Hello,
This is a continuation of our work to enable using Peer-to-Peer PCI
memory in NVMe fabrics targets. Many thanks go to Christoph Hellwig who
provided valuable feedback to get these patches to where they are today.
The concept here is to use memory that's exposed on a PCI BAR as
data buffers in the NVME target code such that data can be transferred
from an RDMA NIC to the special memory and then directly to an NVMe
device avoiding system memory entirely. The upside of this is better
QoS for applications running on the CPU utilizing memory and lower
PCI bandwidth required to the CPU (such that systems could be designed
with fewer lanes connected to the CPU). However, presently, the
trade-off is currently a reduction in overall throughput. (Largely due
to hardware issues that would certainly improve in the future).
Due to these trade-offs we've designed the system to only enable using
the PCI memory in cases where the NIC, NVMe devices and memory are all
behind the same PCI switch. This will mean many setups that could likely
work well will not be supported so that we can be more confident it
will work and not place any responsibility on the user to understand
their topology. (We chose to go this route based on feedback we
received at the last LSF). Future work may enable these transfers behind
a fabric of PCI switches or perhaps using a white list of known good
root complexes.
In order to enable this functionality, we introduce a few new PCI
functions such that a driver can register P2P memory with the system.
Struct pages are created for this memory using devm_memremap_pages()
and the PCI bus offset is stored in the corresponding pagemap structure.
Another set of functions allow a client driver to create a list of
client devices that will be used in a given P2P transactions and then
use that list to find any P2P memory that is supported by all the
client devices. This list is then also used to selectively disable the
ACS bits for the downstream ports behind these devices.
In the block layer, we also introduce a P2P request flag to indicate a
given request targets P2P memory as well as a flag for a request queue
to indicate a given queue supports targeting P2P memory. P2P requests
will only be accepted by queues that support it. Also, P2P requests
are marked to not be merged seeing a non-homogenous request would
complicate the DMA mapping requirements.
In the PCI NVMe driver, we modify the existing CMB support to utilize
the new PCI P2P memory infrastructure and also add support for P2P
memory in its request queue. When a P2P request is received it uses the
pci_p2pmem_map_sg() function which applies the necessary transformation
to get the corrent pci_bus_addr_t for the DMA transactions.
In the RDMA core, we also adjust rdma_rw_ctx_init() and
rdma_rw_ctx_destroy() to take a flags argument which indicates whether
to use the PCI P2P mapping functions or not.
Finally, in the NVMe fabrics target port we introduce a new
configuration boolean: 'allow_p2pmem'. When set, the port will attempt
to find P2P memory supported by the RDMA NIC and all namespaces. If
supported memory is found, it will be used in all IO transfers. And if
a port is using P2P memory, adding new namespaces that are not supported
by that memory will fail.
This series is based off of Christoph's v3 series to revamp
dev_pagemap. A git repo of the patches is available here[2].
Logan
Christoph Hellwig (2):
nvme-pci: clean up CMB initialization
nvme-pci: clean up SMBSZ bit definitions
Logan Gunthorpe (10):
pci-p2p: Support peer to peer memory
pci-p2p: Add sysfs group to display p2pmem stats
pci-p2p: Add PCI p2pmem dma mappings to adjust the bus offset
pci-p2p: Clear ACS P2P flags for all client devices
block: Introduce PCI P2P flags for request and request queue
IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
nvme-pci: Use PCI p2pmem subsystem to manage the CMB
nvme-pci: Add support for P2P memory in requests
nvme-pci: Add a quirk for a pseudo CMB
nvmet: Optionally use PCI P2P memory
Documentation/ABI/testing/sysfs-bus-pci | 25 +
block/blk-core.c | 3 +
drivers/infiniband/core/rw.c | 22 +-
drivers/infiniband/ulp/isert/ib_isert.c | 5 +-
drivers/infiniband/ulp/srpt/ib_srpt.c | 7 +-
drivers/nvme/host/core.c | 4 +
drivers/nvme/host/nvme.h | 8 +
drivers/nvme/host/pci.c | 164 ++++---
drivers/nvme/target/configfs.c | 29 ++
drivers/nvme/target/core.c | 95 +++-
drivers/nvme/target/io-cmd.c | 3 +
drivers/nvme/target/nvmet.h | 10 +
drivers/nvme/target/rdma.c | 41 +-
drivers/pci/Kconfig | 14 +
drivers/pci/Makefile | 1 +
drivers/pci/p2p.c | 781 ++++++++++++++++++++++++++++++++
include/linux/blk_types.h | 18 +-
include/linux/blkdev.h | 2 +
include/linux/memremap.h | 19 +
include/linux/nvme.h | 22 +-
include/linux/pci-p2p.h | 94 ++++
include/linux/pci.h | 6 +
include/rdma/rw.h | 7 +-
net/sunrpc/xprtrdma/svc_rdma_rw.c | 6 +-
24 files changed, 1291 insertions(+), 95 deletions(-)
create mode 100644 drivers/pci/p2p.c
create mode 100644 include/linux/pci-p2p.h
--
2.11.0
From: Christoph Hellwig <[email protected]>
Refactor the call to nvme_map_cmb, and change the conditions for probing
for the CMB. First remove the version check as NVMe TPs always apply
to earlier versions of the spec as well. Second check for the whole CMBSZ
register for support of the CMB feature instead of just the size field
inside of it to simplify the code a bit.
Signed-off-by: Christoph Hellwig <[email protected]>
---
drivers/nvme/host/pci.c | 41 ++++++++++++++---------------------------
1 file changed, 14 insertions(+), 27 deletions(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f5800c3c9082..58c379af6fb4 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1663,21 +1663,20 @@ static ssize_t nvme_cmb_show(struct device *dev,
}
static DEVICE_ATTR(cmb, S_IRUGO, nvme_cmb_show, NULL);
-static void __iomem *nvme_map_cmb(struct nvme_dev *dev)
+static void nvme_map_cmb(struct nvme_dev *dev)
{
u64 szu, size, offset;
resource_size_t bar_size;
struct pci_dev *pdev = to_pci_dev(dev->dev);
- void __iomem *cmb;
int bar;
dev->cmbsz = readl(dev->bar + NVME_REG_CMBSZ);
- if (!(NVME_CMB_SZ(dev->cmbsz)))
- return NULL;
+ if (!dev->cmbsz)
+ return;
dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
if (!use_cmb_sqes)
- return NULL;
+ return;
szu = (u64)1 << (12 + 4 * NVME_CMB_SZU(dev->cmbsz));
size = szu * NVME_CMB_SZ(dev->cmbsz);
@@ -1686,7 +1685,7 @@ static void __iomem *nvme_map_cmb(struct nvme_dev *dev)
bar_size = pci_resource_len(pdev, bar);
if (offset > bar_size)
- return NULL;
+ return;
/*
* Controllers may support a CMB size larger than their BAR,
@@ -1696,13 +1695,16 @@ static void __iomem *nvme_map_cmb(struct nvme_dev *dev)
if (size > bar_size - offset)
size = bar_size - offset;
- cmb = ioremap_wc(pci_resource_start(pdev, bar) + offset, size);
- if (!cmb)
- return NULL;
-
+ dev->cmb = ioremap_wc(pci_resource_start(pdev, bar) + offset, size);
+ if (!dev->cmb)
+ return;
dev->cmb_bus_addr = pci_bus_address(pdev, bar) + offset;
dev->cmb_size = size;
- return cmb;
+
+ if (sysfs_add_file_to_group(&dev->ctrl.device->kobj,
+ &dev_attr_cmb.attr, NULL))
+ dev_warn(dev->ctrl.device,
+ "failed to add sysfs attribute for CMB\n");
}
static inline void nvme_release_cmb(struct nvme_dev *dev)
@@ -2124,22 +2126,7 @@ static int nvme_pci_enable(struct nvme_dev *dev)
"set queue depth=%u\n", dev->q_depth);
}
- /*
- * CMBs can currently only exist on >=1.2 PCIe devices. We only
- * populate sysfs if a CMB is implemented. Since nvme_dev_attrs_group
- * has no name we can pass NULL as final argument to
- * sysfs_add_file_to_group.
- */
-
- if (readl(dev->bar + NVME_REG_VS) >= NVME_VS(1, 2, 0)) {
- dev->cmb = nvme_map_cmb(dev);
- if (dev->cmb) {
- if (sysfs_add_file_to_group(&dev->ctrl.device->kobj,
- &dev_attr_cmb.attr, NULL))
- dev_warn(dev->ctrl.device,
- "failed to add sysfs attribute for CMB\n");
- }
- }
+ nvme_map_cmb(dev);
pci_enable_pcie_error_reporting(pdev);
pci_save_state(pdev);
--
2.11.0
In order to use PCI P2P memory pci_p2pmem_[un]map_sg() functions must be
called to map the correct DMA address. To do this, we add a flags
variable and the RDMA_RW_CTX_FLAG_PCI_P2P flag. When the flag is
specified use the appropriate map function.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/infiniband/core/rw.c | 22 +++++++++++++++++-----
drivers/infiniband/ulp/isert/ib_isert.c | 5 +++--
drivers/infiniband/ulp/srpt/ib_srpt.c | 7 ++++---
drivers/nvme/target/rdma.c | 6 +++---
include/rdma/rw.h | 7 +++++--
net/sunrpc/xprtrdma/svc_rdma_rw.c | 6 +++---
6 files changed, 35 insertions(+), 18 deletions(-)
diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index c8963e91f92a..7956484da082 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -12,6 +12,7 @@
*/
#include <linux/moduleparam.h>
#include <linux/slab.h>
+#include <linux/pci-p2p.h>
#include <rdma/mr_pool.h>
#include <rdma/rw.h>
@@ -269,18 +270,24 @@ static int rdma_rw_init_single_wr(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
* @remote_addr:remote address to read/write (relative to @rkey)
* @rkey: remote key to operate on
* @dir: %DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
+ * @flags: any of the RDMA_RW_CTX_FLAG_* flags
*
* Returns the number of WQEs that will be needed on the workqueue if
* successful, or a negative error code.
*/
int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
struct scatterlist *sg, u32 sg_cnt, u32 sg_offset,
- u64 remote_addr, u32 rkey, enum dma_data_direction dir)
+ u64 remote_addr, u32 rkey, enum dma_data_direction dir,
+ unsigned int flags)
{
struct ib_device *dev = qp->pd->device;
int ret;
- ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
+ if (flags & RDMA_RW_CTX_FLAG_PCI_P2P)
+ ret = pci_p2pmem_map_sg(sg, sg_cnt);
+ else
+ ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
+
if (!ret)
return -ENOMEM;
sg_cnt = ret;
@@ -499,7 +506,7 @@ struct ib_send_wr *rdma_rw_ctx_wrs(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
rdma_rw_update_lkey(&ctx->sig->data, true);
if (ctx->sig->prot.mr)
rdma_rw_update_lkey(&ctx->sig->prot, true);
-
+
ctx->sig->sig_mr->need_inval = true;
ib_update_fast_reg_key(ctx->sig->sig_mr,
ib_inc_rkey(ctx->sig->sig_mr->lkey));
@@ -579,9 +586,11 @@ EXPORT_SYMBOL(rdma_rw_ctx_post);
* @sg: scatterlist that was used for the READ/WRITE
* @sg_cnt: number of entries in @sg
* @dir: %DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
+ * @flags: the same flags used to init the context
*/
void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
- struct scatterlist *sg, u32 sg_cnt, enum dma_data_direction dir)
+ struct scatterlist *sg, u32 sg_cnt, enum dma_data_direction dir,
+ unsigned int flags)
{
int i;
@@ -602,7 +611,10 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
break;
}
- ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
+ if (flags & RDMA_RW_CTX_FLAG_PCI_P2P)
+ pci_p2pmem_unmap_sg(sg, sg_cnt);
+ else
+ ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
}
EXPORT_SYMBOL(rdma_rw_ctx_destroy);
diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c
index 720dfb3a1ac2..a076da2ead16 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.c
+++ b/drivers/infiniband/ulp/isert/ib_isert.c
@@ -1496,7 +1496,8 @@ isert_rdma_rw_ctx_destroy(struct isert_cmd *cmd, struct isert_conn *conn)
se_cmd->t_prot_nents, dir);
} else {
rdma_rw_ctx_destroy(&cmd->rw, conn->qp, conn->cm_id->port_num,
- se_cmd->t_data_sg, se_cmd->t_data_nents, dir);
+ se_cmd->t_data_sg, se_cmd->t_data_nents,
+ dir, 0);
}
cmd->rw.nr_ops = 0;
@@ -2148,7 +2149,7 @@ isert_rdma_rw_ctx_post(struct isert_cmd *cmd, struct isert_conn *conn,
} else {
ret = rdma_rw_ctx_init(&cmd->rw, conn->qp, port_num,
se_cmd->t_data_sg, se_cmd->t_data_nents,
- offset, addr, rkey, dir);
+ offset, addr, rkey, dir, 0);
}
if (ret < 0) {
isert_err("Cmd: %p failed to prepare RDMA res\n", cmd);
diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.c b/drivers/infiniband/ulp/srpt/ib_srpt.c
index 8a1bd354b1cc..c5371ab2e47d 100644
--- a/drivers/infiniband/ulp/srpt/ib_srpt.c
+++ b/drivers/infiniband/ulp/srpt/ib_srpt.c
@@ -854,7 +854,8 @@ static int srpt_alloc_rw_ctxs(struct srpt_send_ioctx *ioctx,
goto unwind;
ret = rdma_rw_ctx_init(&ctx->rw, ch->qp, ch->sport->port,
- ctx->sg, ctx->nents, 0, remote_addr, rkey, dir);
+ ctx->sg, ctx->nents, 0, remote_addr, rkey,
+ dir, 0);
if (ret < 0) {
target_free_sgl(ctx->sg, ctx->nents);
goto unwind;
@@ -883,7 +884,7 @@ static int srpt_alloc_rw_ctxs(struct srpt_send_ioctx *ioctx,
struct srpt_rw_ctx *ctx = &ioctx->rw_ctxs[i];
rdma_rw_ctx_destroy(&ctx->rw, ch->qp, ch->sport->port,
- ctx->sg, ctx->nents, dir);
+ ctx->sg, ctx->nents, dir, 0);
target_free_sgl(ctx->sg, ctx->nents);
}
if (ioctx->rw_ctxs != &ioctx->s_rw_ctx)
@@ -901,7 +902,7 @@ static void srpt_free_rw_ctxs(struct srpt_rdma_ch *ch,
struct srpt_rw_ctx *ctx = &ioctx->rw_ctxs[i];
rdma_rw_ctx_destroy(&ctx->rw, ch->qp, ch->sport->port,
- ctx->sg, ctx->nents, dir);
+ ctx->sg, ctx->nents, dir, 0);
target_free_sgl(ctx->sg, ctx->nents);
}
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 49912909c298..d4d0662ab071 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -480,7 +480,7 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
if (rsp->n_rdma) {
rdma_rw_ctx_destroy(&rsp->rw, queue->cm_id->qp,
queue->cm_id->port_num, rsp->req.sg,
- rsp->req.sg_cnt, nvmet_data_dir(&rsp->req));
+ rsp->req.sg_cnt, nvmet_data_dir(&rsp->req), 0);
}
if (rsp->req.sg != &rsp->cmd->inline_sg)
@@ -563,7 +563,7 @@ static void nvmet_rdma_read_data_done(struct ib_cq *cq, struct ib_wc *wc)
atomic_add(rsp->n_rdma, &queue->sq_wr_avail);
rdma_rw_ctx_destroy(&rsp->rw, queue->cm_id->qp,
queue->cm_id->port_num, rsp->req.sg,
- rsp->req.sg_cnt, nvmet_data_dir(&rsp->req));
+ rsp->req.sg_cnt, nvmet_data_dir(&rsp->req), 0);
rsp->n_rdma = 0;
if (unlikely(wc->status != IB_WC_SUCCESS)) {
@@ -634,7 +634,7 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
ret = rdma_rw_ctx_init(&rsp->rw, cm_id->qp, cm_id->port_num,
rsp->req.sg, rsp->req.sg_cnt, 0, addr, key,
- nvmet_data_dir(&rsp->req));
+ nvmet_data_dir(&rsp->req), 0);
if (ret < 0)
return NVME_SC_INTERNAL;
rsp->req.transfer_len += len;
diff --git a/include/rdma/rw.h b/include/rdma/rw.h
index a3cbbc7b6417..ba8050434667 100644
--- a/include/rdma/rw.h
+++ b/include/rdma/rw.h
@@ -59,12 +59,15 @@ struct rdma_rw_ctx {
};
};
+#define RDMA_RW_CTX_FLAG_PCI_P2P (1 << 0)
+
int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
struct scatterlist *sg, u32 sg_cnt, u32 sg_offset,
- u64 remote_addr, u32 rkey, enum dma_data_direction dir);
+ u64 remote_addr, u32 rkey, enum dma_data_direction dir,
+ unsigned int flags);
void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
struct scatterlist *sg, u32 sg_cnt,
- enum dma_data_direction dir);
+ enum dma_data_direction dir, unsigned int flags);
int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
u8 port_num, struct scatterlist *sg, u32 sg_cnt,
diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
index 9bd04549a1ad..5f46c35e6707 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
@@ -140,7 +140,7 @@ static void svc_rdma_cc_release(struct svc_rdma_chunk_ctxt *cc,
rdma_rw_ctx_destroy(&ctxt->rw_ctx, rdma->sc_qp,
rdma->sc_port_num, ctxt->rw_sg_table.sgl,
- ctxt->rw_nents, dir);
+ ctxt->rw_nents, dir, 0);
svc_rdma_put_rw_ctxt(rdma, ctxt);
}
svc_xprt_put(&rdma->sc_xprt);
@@ -433,7 +433,7 @@ svc_rdma_build_writes(struct svc_rdma_write_info *info,
ret = rdma_rw_ctx_init(&ctxt->rw_ctx, rdma->sc_qp,
rdma->sc_port_num, ctxt->rw_sg_table.sgl,
ctxt->rw_nents, 0, seg_offset,
- seg_handle, DMA_TO_DEVICE);
+ seg_handle, DMA_TO_DEVICE, 0);
if (ret < 0)
goto out_initerr;
@@ -639,7 +639,7 @@ static int svc_rdma_build_read_segment(struct svc_rdma_read_info *info,
ret = rdma_rw_ctx_init(&ctxt->rw_ctx, cc->cc_rdma->sc_qp,
cc->cc_rdma->sc_port_num,
ctxt->rw_sg_table.sgl, ctxt->rw_nents,
- 0, offset, rkey, DMA_FROM_DEVICE);
+ 0, offset, rkey, DMA_FROM_DEVICE, 0);
if (ret < 0)
goto out_initerr;
--
2.11.0
When the ACS P2P flags are set in the downstream port of the switch,
any P2P TLPs will be sent back to the root complex. The whole point of
the P2P work is to have TLPs avoid the RC seeing it may not support
P2P transactions or, at best, it will perform poorly. So we clear these
flags for all devices involved in transactions with the p2pmem.
A count of the number of requests to disable the flags is maintained.
When the count transitions from 1 to 0, the old flags are restored.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/pci/p2p.c | 144 ++++++++++++++++++++++++++++++++++++++++++++++++++--
include/linux/pci.h | 2 +
2 files changed, 143 insertions(+), 3 deletions(-)
diff --git a/drivers/pci/p2p.c b/drivers/pci/p2p.c
index 87cec87b02e3..617adaa905d2 100644
--- a/drivers/pci/p2p.c
+++ b/drivers/pci/p2p.c
@@ -227,6 +227,89 @@ static struct pci_dev *find_parent_pci_dev(struct device *dev)
}
/*
+ * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
+ * to be disabled in the downstream port of each device in order for
+ * the TLPs to not be forwarded up to the RC.
+ */
+#define PCI_P2P_ACS_FLAGS (PCI_ACS_RR | PCI_ACS_CR)
+
+static int pci_p2pmem_disable_acs(struct pci_dev *pdev)
+{
+ int pos;
+ u16 ctrl;
+ struct pci_dev *downstream;
+
+ downstream = pci_dev_get(pci_upstream_bridge(pdev));
+ if (!downstream) {
+ dev_err(&pdev->dev, "could not find downstream port\n");
+ return -ENODEV;
+ }
+
+ device_lock(&downstream->dev);
+ if (downstream->p2p_acs_requests++)
+ goto unlock_and_return;
+
+ pos = pci_find_ext_capability(downstream, PCI_EXT_CAP_ID_ACS);
+ if (!pos)
+ goto unlock_and_return;
+
+ pci_read_config_word(downstream, pos + PCI_ACS_CTRL, &ctrl);
+
+ downstream->p2p_old_acs_flags = ctrl & PCI_P2P_ACS_FLAGS;
+
+ if (downstream->p2p_old_acs_flags)
+ dev_info(&pdev->dev, "disabling p2p acs flags: %x\n", ctrl);
+
+ ctrl &= ~PCI_P2P_ACS_FLAGS;
+
+ pci_write_config_word(downstream, pos + PCI_ACS_CTRL, ctrl);
+
+unlock_and_return:
+ device_unlock(&downstream->dev);
+ pci_dev_put(downstream);
+ return 0;
+}
+
+static int pci_p2pmem_reset_acs(struct pci_dev *pdev)
+{
+ int pos;
+ u16 ctrl;
+ struct pci_dev *downstream;
+
+ downstream = pci_dev_get(pci_upstream_bridge(pdev));
+ if (!downstream)
+ return -ENODEV;
+
+ device_lock(&downstream->dev);
+
+ /* Only actually reset the flags on a 1->0 transition */
+ if (!downstream->p2p_acs_requests)
+ goto unlock_and_return;
+
+ if (--downstream->p2p_acs_requests)
+ goto unlock_and_return;
+
+ pos = pci_find_ext_capability(downstream, PCI_EXT_CAP_ID_ACS);
+ if (!pos)
+ goto unlock_and_return;
+
+ pci_read_config_word(downstream, pos + PCI_ACS_CTRL, &ctrl);
+
+ ctrl &= ~PCI_P2P_ACS_FLAGS;
+ ctrl |= downstream->p2p_old_acs_flags;
+
+ if (downstream->p2p_old_acs_flags)
+ dev_info(&pdev->dev, "resetting p2p acs flags: %x\n", ctrl);
+
+ pci_write_config_word(downstream, pos + PCI_ACS_CTRL, ctrl);
+
+unlock_and_return:
+ device_unlock(&downstream->dev);
+ pci_dev_put(downstream);
+ return 0;
+}
+
+/*
* If a device is behind a switch, we try to find the upstream bridge
* port of the switch. This requires two calls to pci_upstream_bridge:
* one for the upstream port on the switch, one on the upstream port
@@ -340,6 +423,10 @@ int pci_p2pmem_add_client(struct list_head *head, struct device *dev)
ret = -EXDEV;
goto put_client;
}
+
+ ret = pci_p2pmem_disable_acs(client);
+ if (!ret)
+ goto put_client;
}
new_item = kzalloc(sizeof(*new_item), GFP_KERNEL);
@@ -363,6 +450,9 @@ EXPORT_SYMBOL_GPL(pci_p2pmem_add_client);
static void pci_p2pmem_client_free(struct pci_p2pmem_client *item)
{
+ if (item->p2pmem)
+ pci_p2pmem_reset_acs(item->client);
+
list_del(&item->list);
pci_dev_put(item->client);
pci_dev_put(item->p2pmem);
@@ -383,6 +473,7 @@ void pci_p2pmem_remove_client(struct list_head *head, struct device *dev)
{
struct pci_p2pmem_client *pos, *tmp;
struct pci_dev *pdev;
+ struct pci_dev *p2pmem = NULL;
pdev = find_parent_pci_dev(dev);
if (!pdev)
@@ -392,9 +483,16 @@ void pci_p2pmem_remove_client(struct list_head *head, struct device *dev)
if (pos->client != pdev)
continue;
+ if (!p2pmem)
+ p2pmem = pci_dev_get(pos->p2pmem);
+
pci_p2pmem_client_free(pos);
}
+ if (p2pmem && list_empty(head))
+ pci_p2pmem_reset_acs(p2pmem);
+
+ pci_dev_put(p2pmem);
pci_dev_put(pdev);
}
EXPORT_SYMBOL_GPL(pci_p2pmem_remove_client);
@@ -412,6 +510,10 @@ void pci_p2pmem_client_list_free(struct list_head *head)
{
struct pci_p2pmem_client *pos, *tmp;
+ pos = list_first_entry_or_null(head, struct pci_p2pmem_client, list);
+ if (pos && pos->p2pmem)
+ pci_p2pmem_reset_acs(pos->p2pmem);
+
list_for_each_entry_safe(pos, tmp, head, list)
pci_p2pmem_client_free(pos);
}
@@ -440,6 +542,40 @@ static bool upstream_bridges_match_list(struct pci_dev *pdev,
return ret;
}
+static int bind_clients(struct pci_dev *p2pmem, struct list_head *clients)
+{
+ int ret;
+ struct pci_p2pmem_client *pos, *unwind_pos;
+
+ ret = pci_p2pmem_disable_acs(p2pmem);
+ if (ret)
+ return ret;
+
+ list_for_each_entry(pos, clients, list) {
+ ret = pci_p2pmem_disable_acs(pos->client);
+ if (ret)
+ goto unwind;
+
+ pos->p2pmem = pci_dev_get(p2pmem);
+ }
+
+ return 0;
+
+unwind:
+ list_for_each_entry(unwind_pos, clients, list) {
+ if (pos == unwind_pos)
+ break;
+
+ pci_p2pmem_reset_acs(unwind_pos->client);
+ pci_dev_put(unwind_pos->p2pmem);
+ unwind_pos->p2pmem = NULL;
+ }
+
+ pci_p2pmem_reset_acs(p2pmem);
+
+ return ret;
+}
+
/**
* pci_p2pmem_find - find a p2p mem device compatible with the specified device
* @dev: list of device to check (NULL-terminated)
@@ -450,11 +586,13 @@ static bool upstream_bridges_match_list(struct pci_dev *pdev,
*
* Returns a pointer to the PCI device with a reference taken (use pci_dev_put
* to return the reference) or NULL if no compatible device is found.
+ *
+ * The P2P ACS flags on all applicable PCI devices will be cleared and
+ * reset when the client is removed from the list.
*/
struct pci_dev *pci_p2pmem_find(struct list_head *clients)
{
struct pci_dev *pdev = NULL;
- struct pci_p2pmem_client *pos;
while ((pdev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, pdev))) {
if (!pdev->p2p || !pdev->p2p->published)
@@ -463,8 +601,8 @@ struct pci_dev *pci_p2pmem_find(struct list_head *clients)
if (!upstream_bridges_match_list(pdev, clients))
continue;
- list_for_each_entry(pos, clients, list)
- pos->p2pmem = pdev;
+ if (bind_clients(pdev, clients))
+ continue;
return pdev;
}
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 047aea679e87..cdd4d3c3e562 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -435,6 +435,8 @@ struct pci_dev {
#endif
#ifdef CONFIG_PCI_P2P
struct pci_p2p *p2p;
+ unsigned int p2p_acs_requests;
+ u16 p2p_old_acs_flags;
#endif
phys_addr_t rom; /* Physical address of ROM if it's not from the BAR */
size_t romlen; /* Length of ROM if it's not from the BAR */
--
2.11.0
Some PCI devices may have memory mapped in a BAR space that's
intended for use in Peer-to-Peer transactions. In order to enable
such transactions the memory must be registered with ZONE_DEVICE pages
so it can be used by DMA interfaces in existing drivers.
A kernel interface is provided so that other subsystems can find and
allocate chunks of P2P memory as necessary to facilitate transfers
between two PCI peers. Depending on hardware, this may reduce the
bandwidth of the transfer but would significantly reduce pressure
on system memory. This may be desirable in many cases: for example a
system could be designed with a small CPU connected to a PCI switch by a
small number of lanes which would maximize the number of lanes available
to connect to NVME devices.
The interface requires a user driver to collect a list of client devices
involved in the transaction with the pci_p2pmem_add_client*() functions
then call pci_p2pmem_find() to obtain any suitable P2P memory. Once
this is done the list is bound to the memory and the calling driver is
free to add and remove clients as necessary. The ACS bits on the
downstream switch port will be managed for all the registered clients.
The code is designed to only utilize the p2pmem device if all the devices
involved in a transfer are behind the same PCI switch. This is because
using P2P transactions through the PCI root complex can have performance
limitations or, worse, might not work at all. Finding out how well a
particular RC supports P2P transfers is non-trivial. Additionally, the
benefits of P2P transfers that go through the RC is limited to only
reducing DRAM usage.
This commit includes significant rework and feedback from Christoph
Hellwig.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/pci/Kconfig | 14 ++
drivers/pci/Makefile | 1 +
drivers/pci/p2p.c | 549 +++++++++++++++++++++++++++++++++++++++++++++++
include/linux/memremap.h | 18 ++
include/linux/pci-p2p.h | 85 ++++++++
include/linux/pci.h | 4 +
6 files changed, 671 insertions(+)
create mode 100644 drivers/pci/p2p.c
create mode 100644 include/linux/pci-p2p.h
diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index bda151788f3f..188ea94cfe2e 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -123,6 +123,20 @@ config PCI_PASID
If unsure, say N.
+config PCI_P2P
+ bool "PCI Peer to Peer transfer support"
+ depends on ZONE_DEVICE
+ select GENERIC_ALLOCATOR
+ help
+ Enableѕ drivers to do PCI peer to peer transactions to and from
+ bars that are exposed to other devices in the same domain.
+
+ Many PCIe root complexes do not support P2P transactions and
+ it's hard to tell which support it with good performance, so
+ at this time you will need a PCIe switch.
+
+ If unsure, say N.
+
config PCI_LABEL
def_bool y if (DMI || ACPI)
select NLS
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index c7819b973df7..749858201400 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -23,6 +23,7 @@ obj-$(CONFIG_PCI_MSI) += msi.o
obj-$(CONFIG_PCI_ATS) += ats.o
obj-$(CONFIG_PCI_IOV) += iov.o
+obj-$(CONFIG_PCI_P2P) += p2p.o
#
# ACPI Related PCI FW Functions
diff --git a/drivers/pci/p2p.c b/drivers/pci/p2p.c
new file mode 100644
index 000000000000..aa465ac9273d
--- /dev/null
+++ b/drivers/pci/p2p.c
@@ -0,0 +1,549 @@
+/*
+ * Peer 2 Peer Memory support.
+ *
+ * Copyright (c) 2016-2017, Microsemi Corporation
+ * Copyright (c) 2017, Christoph Hellwig.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/pci-p2p.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/genalloc.h>
+#include <linux/memremap.h>
+#include <linux/percpu-refcount.h>
+
+struct pci_p2p {
+ struct percpu_ref devmap_ref;
+ struct completion devmap_ref_done;
+ struct gen_pool *pool;
+ bool published;
+};
+
+static void pci_p2pmem_percpu_release(struct percpu_ref *ref)
+{
+ struct pci_p2p *p2p =
+ container_of(ref, struct pci_p2p, devmap_ref);
+
+ complete_all(&p2p->devmap_ref_done);
+}
+
+static void pci_p2pmem_percpu_kill(void *data)
+{
+ struct percpu_ref *ref = data;
+
+ if (percpu_ref_is_dying(ref))
+ return;
+
+ percpu_ref_kill(ref);
+}
+
+static void pci_p2pmem_release(void *data)
+{
+ struct pci_dev *pdev = data;
+
+ wait_for_completion(&pdev->p2p->devmap_ref_done);
+ percpu_ref_exit(&pdev->p2p->devmap_ref);
+
+ gen_pool_destroy(pdev->p2p->pool);
+ pdev->p2p = NULL;
+}
+
+static int pci_p2pmem_setup(struct pci_dev *pdev)
+{
+ int error = -ENOMEM;
+ struct pci_p2p *p2p;
+
+ p2p = devm_kzalloc(&pdev->dev, sizeof(*p2p), GFP_KERNEL);
+ if (!p2p)
+ return -ENOMEM;
+
+ p2p->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev));
+ if (!p2p->pool)
+ goto out;
+
+ init_completion(&p2p->devmap_ref_done);
+ error = percpu_ref_init(&p2p->devmap_ref,
+ pci_p2pmem_percpu_release, 0, GFP_KERNEL);
+ if (error)
+ goto out_pool_destroy;
+
+ percpu_ref_switch_to_atomic_sync(&p2p->devmap_ref);
+
+ error = devm_add_action_or_reset(&pdev->dev, pci_p2pmem_release, pdev);
+ if (error)
+ goto out_pool_destroy;
+
+ pdev->p2p = p2p;
+
+ return 0;
+
+out_pool_destroy:
+ gen_pool_destroy(p2p->pool);
+out:
+ devm_kfree(&pdev->dev, p2p);
+ return error;
+}
+
+/**
+ * pci_p2pmem_add_resource - add memory for use as p2p memory
+ * @pci: the device to add the memory to
+ * @bar: PCI bar to add
+ * @size: size of the memory to add, may be zero to use the whole bar
+ * @offset: offset into the PCI bar
+ *
+ * The memory will be given ZONE_DEVICE struct pages so that it may
+ * be used with any dma request.
+ */
+int pci_p2pmem_add_resource(struct pci_dev *pdev, int bar, size_t size,
+ u64 offset)
+{
+ struct dev_pagemap *pgmap;
+ void *addr;
+ int error;
+
+ if (WARN_ON(offset >= pci_resource_len(pdev, bar)))
+ return -EINVAL;
+
+ if (!size)
+ size = pci_resource_len(pdev, bar) - offset;
+
+ if (WARN_ON(size + offset > pci_resource_len(pdev, bar)))
+ return -EINVAL;
+
+ if (!pdev->p2p) {
+ error = pci_p2pmem_setup(pdev);
+ if (error)
+ return error;
+ }
+
+ pgmap = devm_kzalloc(&pdev->dev, sizeof(*pgmap), GFP_KERNEL);
+ if (!pgmap)
+ return -ENOMEM;
+
+ pgmap->res.start = pci_resource_start(pdev, bar) + offset;
+ pgmap->res.end = pgmap->res.start + size - 1;
+ pgmap->ref = &pdev->p2p->devmap_ref;
+ pgmap->type = MEMORY_DEVICE_PCI_P2P;
+
+ addr = devm_memremap_pages(&pdev->dev, pgmap);
+ if (IS_ERR(addr))
+ return PTR_ERR(addr);
+
+ error = gen_pool_add_virt(pdev->p2p->pool, (uintptr_t)addr,
+ pci_bus_address(pdev, bar) + offset,
+ resource_size(&pgmap->res), dev_to_node(&pdev->dev));
+ if (error)
+ return error;
+
+ error = devm_add_action_or_reset(&pdev->dev, pci_p2pmem_percpu_kill,
+ &pdev->p2p->devmap_ref);
+ if (error)
+ return error;
+
+ dev_info(&pdev->dev, "added %zdB of p2p memory\n", size);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_add_resource);
+
+static struct pci_dev *find_parent_pci_dev(struct device *dev)
+{
+ struct device *parent;
+
+ dev = get_device(dev);
+
+ while (dev) {
+ if (dev_is_pci(dev))
+ return to_pci_dev(dev);
+
+ parent = get_device(dev->parent);
+ put_device(dev);
+ dev = parent;
+ }
+
+ return NULL;
+}
+
+/*
+ * If a device is behind a switch, we try to find the upstream bridge
+ * port of the switch. This requires two calls to pci_upstream_bridge:
+ * one for the upstream port on the switch, one on the upstream port
+ * for the next level in the hierarchy. Because of this, devices connected
+ * to the root port will be rejected.
+ */
+static struct pci_dev *get_upstream_switch_port(struct pci_dev *pdev)
+{
+ struct pci_dev *up1, *up2;
+
+ if (!pdev)
+ return NULL;
+
+ up1 = pci_dev_get(pci_upstream_bridge(pdev));
+ if (!up1)
+ return NULL;
+
+ up2 = pci_dev_get(pci_upstream_bridge(up1));
+ pci_dev_put(up1);
+
+ return up2;
+}
+
+static bool __upstream_bridges_match(struct pci_dev *upstream,
+ struct pci_dev *client)
+{
+ struct pci_dev *dma_up;
+ bool ret = true;
+
+ dma_up = get_upstream_switch_port(client);
+
+ if (!dma_up) {
+ dev_dbg(&client->dev, "not a pci device behind a switch\n");
+ ret = false;
+ goto out;
+ }
+
+ if (upstream != dma_up) {
+ dev_dbg(&client->dev,
+ "does not reside on the same upstream bridge\n");
+ ret = false;
+ goto out;
+ }
+
+out:
+ pci_dev_put(dma_up);
+ return ret;
+}
+
+static bool upstream_bridges_match(struct pci_dev *pdev,
+ struct pci_dev *client)
+{
+ struct pci_dev *upstream;
+ bool ret;
+
+ upstream = get_upstream_switch_port(pdev);
+ if (!upstream) {
+ dev_warn(&pdev->dev, "not behind a pci switch\n");
+ return false;
+ }
+
+ ret = __upstream_bridges_match(upstream, client);
+
+ pci_dev_put(upstream);
+
+ return ret;
+}
+
+struct pci_p2pmem_client {
+ struct list_head list;
+ struct pci_dev *client;
+ struct pci_dev *p2pmem;
+};
+
+/**
+ * pci_p2pmem_add_client - allocate a new element in a client device list
+ * @head: list head of p2pmem clients
+ * @dev: device to add to the list
+ *
+ * This adds @dev to a list of clients used by a p2pmem device.
+ * This list should be passed to p2pmem_find(). Once p2pmem_find() has
+ * been called successfully, the list will be bound to a specific p2pmem
+ * device and new clients can only be added to the list if they are
+ * supported by that p2pmem device.
+ *
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2pmem functions can be called concurrently
+ * on that list.
+ *
+ * Returns 0 if the client was successfully added.
+ */
+int pci_p2pmem_add_client(struct list_head *head, struct device *dev)
+{
+ struct pci_p2pmem_client *item, *new_item;
+ struct pci_dev *p2pmem = NULL;
+ struct pci_dev *client;
+ int ret;
+
+ client = find_parent_pci_dev(dev);
+ if (!client) {
+ dev_warn(dev,
+ "cannot be used for p2p as it is not a pci device\n");
+ return -ENODEV;
+ }
+
+ item = list_first_entry_or_null(head, struct pci_p2pmem_client, list);
+ if (item && item->p2pmem) {
+ p2pmem = item->p2pmem;
+
+ if (!upstream_bridges_match(p2pmem, client)) {
+ ret = -EXDEV;
+ goto put_client;
+ }
+ }
+
+ new_item = kzalloc(sizeof(*new_item), GFP_KERNEL);
+ if (!new_item) {
+ ret = -ENOMEM;
+ goto put_client;
+ }
+
+ new_item->client = client;
+ new_item->p2pmem = pci_dev_get(p2pmem);
+
+ list_add_tail(&new_item->list, head);
+
+ return 0;
+
+put_client:
+ pci_dev_put(client);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_add_client);
+
+static void pci_p2pmem_client_free(struct pci_p2pmem_client *item)
+{
+ list_del(&item->list);
+ pci_dev_put(item->client);
+ pci_dev_put(item->p2pmem);
+ kfree(item);
+}
+
+/**
+ * pci_p2pmem_remove_client - remove and free a new p2pmem client
+ * @head: list head of p2pmem clients
+ * @dev: device to remove from the list
+ *
+ * This removes @dev from a list of clients used by a p2pmem device.
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2pmem functions can be called concurrently
+ * on that list.
+ */
+void pci_p2pmem_remove_client(struct list_head *head, struct device *dev)
+{
+ struct pci_p2pmem_client *pos, *tmp;
+ struct pci_dev *pdev;
+
+ pdev = find_parent_pci_dev(dev);
+ if (!pdev)
+ return;
+
+ list_for_each_entry_safe(pos, tmp, head, list) {
+ if (pos->client != pdev)
+ continue;
+
+ pci_p2pmem_client_free(pos);
+ }
+
+ pci_dev_put(pdev);
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_remove_client);
+
+/**
+ * pci_p2pmem_client_list_free - free an entire list of p2pmem clients
+ * @head: list head of p2pmem clients
+ *
+ * This removes all devices in a list of clients used by a p2pmem device.
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2pmem functions can be called concurrently
+ * on that list.
+ */
+void pci_p2pmem_client_list_free(struct list_head *head)
+{
+ struct pci_p2pmem_client *pos, *tmp;
+
+ list_for_each_entry_safe(pos, tmp, head, list)
+ pci_p2pmem_client_free(pos);
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_client_list_free);
+
+static bool upstream_bridges_match_list(struct pci_dev *pdev,
+ struct list_head *head)
+{
+ struct pci_p2pmem_client *pos;
+ struct pci_dev *upstream;
+ bool ret;
+
+ upstream = get_upstream_switch_port(pdev);
+ if (!upstream) {
+ dev_warn(&pdev->dev, "not behind a pci switch\n");
+ return false;
+ }
+
+ list_for_each_entry(pos, head, list) {
+ ret = __upstream_bridges_match(upstream, pos->client);
+ if (!ret)
+ break;
+ }
+
+ pci_dev_put(upstream);
+ return ret;
+}
+
+/**
+ * pci_p2pmem_find - find a p2p mem device compatible with the specified device
+ * @dev: list of device to check (NULL-terminated)
+ *
+ * For now, we only support cases where the devices that will transfer to the
+ * p2pmem device are on the same switch. This cuts out cases that may work but
+ * is safest for the user.
+ *
+ * Returns a pointer to the PCI device with a reference taken (use pci_dev_put
+ * to return the reference) or NULL if no compatible device is found.
+ */
+struct pci_dev *pci_p2pmem_find(struct list_head *clients)
+{
+ struct pci_dev *pdev = NULL;
+ struct pci_p2pmem_client *pos;
+
+ while ((pdev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, pdev))) {
+ if (!pdev->p2p || !pdev->p2p->published)
+ continue;
+
+ if (!upstream_bridges_match_list(pdev, clients))
+ continue;
+
+ list_for_each_entry(pos, clients, list)
+ pos->p2pmem = pdev;
+
+ return pdev;
+ }
+
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_find);
+
+/**
+ * pci_alloc_p2p_mem - allocate p2p memory
+ * @pdev: the device to allocate memory from
+ * @size: number of bytes to allocate
+ *
+ * Returns the allocated memory or NULL on error.
+ */
+void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
+{
+ void *ret;
+
+ if (unlikely(!pdev->p2p))
+ return NULL;
+
+ if (unlikely(!percpu_ref_tryget_live(&pdev->p2p->devmap_ref)))
+ return NULL;
+
+ ret = (void *)(uintptr_t)gen_pool_alloc(pdev->p2p->pool, size);
+
+ if (unlikely(!ret))
+ percpu_ref_put(&pdev->p2p->devmap_ref);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(pci_alloc_p2pmem);
+
+/**
+ * pci_free_p2pmem - allocate p2p memory
+ * @pdev: the device the memory was allocated from
+ * @addr: address of the memory that was allocated
+ * @size: number of bytes that was allocated
+ */
+void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size)
+{
+ gen_pool_free(pdev->p2p->pool, (uintptr_t)addr, size);
+ percpu_ref_put(&pdev->p2p->devmap_ref);
+}
+EXPORT_SYMBOL_GPL(pci_free_p2pmem);
+
+/**
+ * pci_virt_to_bus - return the pci bus address for a given virtual
+ * address obtained with pci_alloc_p2pmem
+ * @pdev: the device the memory was allocated from
+ * @addr: address of the memory that was allocated
+ */
+pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr)
+{
+ if (!addr)
+ return 0;
+ if (!pdev->p2p)
+ return 0;
+
+ return gen_pool_virt_to_phys(pdev->p2p->pool, (unsigned long)addr);
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_virt_to_bus);
+
+/**
+ * pci_p2pmem_alloc_sgl - allocate p2p memory in an sgl
+ * @pdev: the device to allocate memory from
+ * @sgl: the allocated sgl
+ * @nents: the number of sgs in the list
+ * @length: number of bytes to allocate
+ *
+ * Returns 0 on success
+ */
+int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct scatterlist **sgl,
+ unsigned int *nents, u32 length)
+{
+ struct scatterlist *sg;
+ void *addr;
+
+ sg = kzalloc(sizeof(*sg), GFP_KERNEL);
+ if (!sg)
+ return -ENOMEM;
+
+ sg_init_table(sg, 1);
+
+ addr = pci_alloc_p2pmem(pdev, length);
+ if (!addr)
+ goto out_free_sg;
+
+ sg_set_buf(sg, addr, length);
+ *sgl = sg;
+ *nents = 1;
+ return 0;
+
+out_free_sg:
+ kfree(sg);
+ return -ENOMEM;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_alloc_sgl);
+
+/**
+ * pci_p2pmem_free_sgl - free an sgl allocated by pci_p2pmem_alloc_sgl
+ * @pdev: the device to allocate memory from
+ * @sgl: the allocated sgl
+ * @nents: the number of sgs in the list
+ */
+void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl,
+ unsigned int nents)
+{
+ struct scatterlist *sg;
+ int count;
+
+ if (!sgl || !nents)
+ return;
+
+ for_each_sg(sgl, sg, nents, count)
+ pci_free_p2pmem(pdev, sg_virt(sg), sg->length);
+ kfree(sgl);
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_free_sgl);
+
+/**
+ * pci_p2pmem_publish - publish the p2p memory for use by other devices
+ * with pci_p2pmem_find
+ * @pdev: the device with p2p memory to publish
+ * @publish: set to true to publish the memory, false to unpublish it
+ */
+void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
+{
+ if (WARN_ON(publish && !pdev->p2p))
+ return;
+
+ pdev->p2p->published = publish;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 7b4899c06f49..c17a6d167d48 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -53,11 +53,16 @@ struct vmem_altmap {
* driver can hotplug the device memory using ZONE_DEVICE and with that memory
* type. Any page of a process can be migrated to such memory. However no one
* should be allow to pin such memory so that it can always be evicted.
+ *
+ * MEMORY_DEVICE_PCI_P2P:
+ * Device memory residing in a PCI BAR intended for use with Peer-to-Peer
+ * transactions.
*/
enum memory_type {
MEMORY_DEVICE_HOST = 0,
MEMORY_DEVICE_PRIVATE,
MEMORY_DEVICE_PUBLIC,
+ MEMORY_DEVICE_PCI_P2P,
};
/*
@@ -161,6 +166,19 @@ static inline void vmem_altmap_free(struct vmem_altmap *altmap,
}
#endif /* CONFIG_ZONE_DEVICE */
+#ifdef CONFIG_PCI_P2P
+static inline bool is_pci_p2p_page(const struct page *page)
+{
+ return is_zone_device_page(page) &&
+ page->pgmap->type == MEMORY_DEVICE_PCI_P2P;
+}
+#else
+static inline bool is_pci_p2p_page(const struct page *page)
+{
+ return false;
+}
+#endif
+
#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
static inline bool is_device_private_page(const struct page *page)
{
diff --git a/include/linux/pci-p2p.h b/include/linux/pci-p2p.h
new file mode 100644
index 000000000000..f811c97a5886
--- /dev/null
+++ b/include/linux/pci-p2p.h
@@ -0,0 +1,85 @@
+#ifndef _LINUX_PCI_P2P_H
+#define _LINUX_PCI_P2P_H
+/*
+ * Copyright (c) 2016-2017, Microsemi Corporation
+ * Copyright (c) 2017, Christoph Hellwig.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/pci.h>
+
+struct block_device;
+struct scatterlist;
+
+#ifdef CONFIG_PCI_P2P
+int pci_p2pmem_add_resource(struct pci_dev *pdev, int bar, size_t size,
+ u64 offset);
+int pci_p2pmem_add_client(struct list_head *head, struct device *dev);
+void pci_p2pmem_remove_client(struct list_head *head, struct device *dev);
+void pci_p2pmem_client_list_free(struct list_head *head);
+struct pci_dev *pci_p2pmem_find(struct list_head *clients);
+void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size);
+void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size);
+pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr);
+int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct scatterlist **sgl,
+ unsigned int *nents, u32 length);
+void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl,
+ unsigned int nents);
+void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
+#else /* CONFIG_PCI_P2P */
+static inline int pci_p2pmem_add_resource(struct pci_dev *pdev, int bar,
+ size_t size, u64 offset)
+{
+ return 0;
+}
+static inline int pci_p2pmem_add_client(struct list_head *head,
+ struct device *dev)
+{
+ return 0;
+}
+static inline void pci_p2pmem_remove_client(struct list_head *head,
+ struct device *dev)
+{
+}
+static inline void pci_p2pmem_client_list_free(struct list_head *head)
+{
+}
+static inline struct pci_dev *pci_p2pmem_find(struct list_head *clients)
+{
+ return NULL;
+}
+static inline void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
+{
+ return NULL;
+}
+static inline void pci_free_p2pmem(struct pci_dev *pdev, void *addr,
+ size_t size)
+{
+}
+static inline pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev,
+ void *addr)
+{
+ return 0;
+}
+static inline int pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
+ struct scatterlist **sgl, unsigned int *nents, u32 length)
+{
+ return -ENODEV;
+}
+static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
+ struct scatterlist *sgl, unsigned int nents)
+{
+}
+static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
+{
+}
+#endif /* CONFIG_PCI_P2P */
+#endif /* _LINUX_PCI_P2P_H */
diff --git a/include/linux/pci.h b/include/linux/pci.h
index c170c9250c8b..047aea679e87 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -279,6 +279,7 @@ struct pcie_link_state;
struct pci_vpd;
struct pci_sriov;
struct pci_ats;
+struct pci_p2p;
/*
* The pci_dev structure is used to describe PCI devices.
@@ -432,6 +433,9 @@ struct pci_dev {
#ifdef CONFIG_PCI_PASID
u16 pasid_features;
#endif
+#ifdef CONFIG_PCI_P2P
+ struct pci_p2p *p2p;
+#endif
phys_addr_t rom; /* Physical address of ROM if it's not from the BAR */
size_t romlen; /* Length of ROM if it's not from the BAR */
char *driver_override; /* Driver name to force a match */
--
2.11.0
Introduce a quirk to use CMB-like memory on older devices that have
an exposed BAR but do not advertise support for using CMBLOC and
CMBSIZE.
We'd like to use some of these older cards to test P2P memory.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/nvme/host/nvme.h | 7 +++++++
drivers/nvme/host/pci.c | 24 ++++++++++++++++++++----
2 files changed, 27 insertions(+), 4 deletions(-)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index ae1453238250..abaf18ff2e09 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -81,6 +81,13 @@ enum nvme_quirks {
* Supports the LighNVM command set if indicated in vs[1].
*/
NVME_QUIRK_LIGHTNVM = (1 << 6),
+
+ /*
+ * Pseudo CMB Support on BAR 4. For adapters like the Microsemi
+ * NVRAM that have CMB-like memory on a BAR but does not set
+ * CMBLOC or CMBSZ.
+ */
+ NVME_QUIRK_PSEUDO_CMB_BAR4 = (1 << 7),
};
/*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 29ef3fd24938..0b6ad30b64a6 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1688,6 +1688,13 @@ static ssize_t nvme_cmb_show(struct device *dev,
}
static DEVICE_ATTR(cmb, S_IRUGO, nvme_cmb_show, NULL);
+static u32 nvme_pseudo_cmbsz(struct pci_dev *pdev, int bar)
+{
+ return NVME_CMBSZ_WDS | NVME_CMBSZ_RDS |
+ (((ilog2(SZ_16M) - 12) / 4) << NVME_CMBSZ_SZU_SHIFT) |
+ ((pci_resource_len(pdev, bar) / SZ_16M) << NVME_CMBSZ_SZ_SHIFT);
+}
+
static u64 nvme_cmb_size_unit(struct nvme_dev *dev)
{
u8 szu = (dev->cmbsz >> NVME_CMBSZ_SZU_SHIFT) & NVME_CMBSZ_SZU_MASK;
@@ -1707,10 +1714,15 @@ static void nvme_map_cmb(struct nvme_dev *dev)
struct pci_dev *pdev = to_pci_dev(dev->dev);
int bar;
- dev->cmbsz = readl(dev->bar + NVME_REG_CMBSZ);
- if (!dev->cmbsz)
- return;
- dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
+ if (dev->ctrl.quirks & NVME_QUIRK_PSEUDO_CMB_BAR4) {
+ dev->cmbsz = nvme_pseudo_cmbsz(pdev, 4);
+ dev->cmbloc = 4;
+ } else {
+ dev->cmbsz = readl(dev->bar + NVME_REG_CMBSZ);
+ if (!dev->cmbsz)
+ return;
+ dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
+ }
size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev);
offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc);
@@ -2708,6 +2720,10 @@ static const struct pci_device_id nvme_id_table[] = {
.driver_data = NVME_QUIRK_LIGHTNVM, },
{ PCI_DEVICE(0x1d1d, 0x2807), /* CNEX WL */
.driver_data = NVME_QUIRK_LIGHTNVM, },
+ { PCI_DEVICE(0x11f8, 0xf117), /* Microsemi NVRAM adaptor */
+ .driver_data = NVME_QUIRK_PSEUDO_CMB_BAR4, },
+ { PCI_DEVICE(0x1db1, 0x0002), /* Everspin nvNitro adaptor */
+ .driver_data = NVME_QUIRK_PSEUDO_CMB_BAR4, },
{ PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, 0xffffff) },
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2001) },
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2003) },
--
2.11.0
Attributes display the total amount of P2P memory, the ammount available
and whether it is published or not.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
Documentation/ABI/testing/sysfs-bus-pci | 25 ++++++++++++++++
drivers/pci/p2p.c | 51 +++++++++++++++++++++++++++++++++
2 files changed, 76 insertions(+)
diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
index 44d4b2be92fd..7b80ea77faca 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -323,3 +323,28 @@ Description:
This is similar to /sys/bus/pci/drivers_autoprobe, but
affects only the VFs associated with a specific PF.
+
+What: /sys/bus/pci/devices/.../p2pmem/available
+Date: November 2017
+Contact: Logan Gunthorpe <[email protected]>
+Description:
+ If the device has any Peer-to-Peer memory registered, this
+ file contains the ammount of memory that has not been
+ allocated (in decimal).
+
+What: /sys/bus/pci/devices/.../p2pmem/size
+Date: November 2017
+Contact: Logan Gunthorpe <[email protected]>
+Description:
+ If the device has any Peer-to-Peer memory registered, this
+ file contains the total ammount of memory that the device
+ provides (in decimal).
+
+What: /sys/bus/pci/devices/.../p2pmem/published
+Date: November 2017
+Contact: Logan Gunthorpe <[email protected]>
+Description:
+ If the device has any Peer-to-Peer memory registered, this
+ file contains a '1' if the memory has been published for
+ use inside the kernel or a '0' if it is only intended
+ for use within the driver that published it.
diff --git a/drivers/pci/p2p.c b/drivers/pci/p2p.c
index aa465ac9273d..03330c7b5e6d 100644
--- a/drivers/pci/p2p.c
+++ b/drivers/pci/p2p.c
@@ -28,6 +28,53 @@ struct pci_p2p {
bool published;
};
+static ssize_t size_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct pci_dev *pdev = to_pci_dev(dev);
+ size_t size = 0;
+
+ if (pdev->p2p->pool)
+ size = gen_pool_size(pdev->p2p->pool);
+
+ return snprintf(buf, PAGE_SIZE, "%zd\n", size);
+}
+static DEVICE_ATTR_RO(size);
+
+static ssize_t available_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct pci_dev *pdev = to_pci_dev(dev);
+ size_t avail = 0;
+
+ if (pdev->p2p->pool)
+ avail = gen_pool_avail(pdev->p2p->pool);
+
+ return snprintf(buf, PAGE_SIZE, "%zd\n", avail);
+}
+static DEVICE_ATTR_RO(available);
+
+static ssize_t published_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct pci_dev *pdev = to_pci_dev(dev);
+
+ return snprintf(buf, PAGE_SIZE, "%d\n", pdev->p2p->published);
+}
+static DEVICE_ATTR_RO(published);
+
+static struct attribute *p2pmem_attrs[] = {
+ &dev_attr_size.attr,
+ &dev_attr_available.attr,
+ &dev_attr_published.attr,
+ NULL,
+};
+
+static const struct attribute_group p2pmem_group = {
+ .attrs = p2pmem_attrs,
+ .name = "p2pmem",
+};
+
static void pci_p2pmem_percpu_release(struct percpu_ref *ref)
{
struct pci_p2p *p2p =
@@ -54,6 +101,7 @@ static void pci_p2pmem_release(void *data)
percpu_ref_exit(&pdev->p2p->devmap_ref);
gen_pool_destroy(pdev->p2p->pool);
+ sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
pdev->p2p = NULL;
}
@@ -82,6 +130,9 @@ static int pci_p2pmem_setup(struct pci_dev *pdev)
if (error)
goto out_pool_destroy;
+ if (sysfs_create_group(&pdev->dev.kobj, &p2pmem_group))
+ dev_warn(&pdev->dev, "failed to create p2p sysfs group\n");
+
pdev->p2p = p2p;
return 0;
--
2.11.0
For P2P requests we must use the pci_p2pmem_[un]map_sg() functions
instead of the dma_map_sg functions.
With that, we can then indicate PCI_P2P support in the request queue.
For this, we create an NVME_F_PCI_P2P flag which tells the core to
set QUEUE_FLAG_PCI_P2P in the request queue.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/nvme/host/core.c | 4 ++++
drivers/nvme/host/nvme.h | 1 +
drivers/nvme/host/pci.c | 18 ++++++++++++++----
3 files changed, 19 insertions(+), 4 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1e46e60b8f10..8a7caaa5ee49 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2861,7 +2861,11 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
ns->queue = blk_mq_init_queue(ctrl->tagset);
if (IS_ERR(ns->queue))
goto out_free_ns;
+
queue_flag_set_unlocked(QUEUE_FLAG_NONROT, ns->queue);
+ if (ctrl->ops->flags & NVME_F_PCI_P2P)
+ queue_flag_set_unlocked(QUEUE_FLAG_PCI_P2P, ns->queue);
+
ns->queue->queuedata = ns;
ns->ctrl = ctrl;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index ea1aa5283e8e..ae1453238250 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -286,6 +286,7 @@ struct nvme_ctrl_ops {
unsigned int flags;
#define NVME_F_FABRICS (1 << 0)
#define NVME_F_METADATA_SUPPORTED (1 << 1)
+#define NVME_F_PCI_P2P (1 << 2)
int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 71893babb982..29ef3fd24938 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -792,6 +792,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
enum dma_data_direction dma_dir = rq_data_dir(req) ?
DMA_TO_DEVICE : DMA_FROM_DEVICE;
blk_status_t ret = BLK_STS_IOERR;
+ int nents;
sg_init_table(iod->sg, blk_rq_nr_phys_segments(req));
iod->nents = blk_rq_map_sg(q, req, iod->sg);
@@ -799,8 +800,13 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
goto out;
ret = BLK_STS_RESOURCE;
- if (!dma_map_sg_attrs(dev->dev, iod->sg, iod->nents, dma_dir,
- DMA_ATTR_NO_WARN))
+
+ if (REQ_IS_PCI_P2P(req))
+ nents = pci_p2pmem_map_sg(iod->sg, iod->nents);
+ else
+ nents = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents,
+ dma_dir, DMA_ATTR_NO_WARN);
+ if (!nents)
goto out;
if (nvme_pci_use_sgls(dev, req))
@@ -844,7 +850,11 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
DMA_TO_DEVICE : DMA_FROM_DEVICE;
if (iod->nents) {
- dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
+ if (REQ_IS_PCI_P2P(req))
+ pci_p2pmem_unmap_sg(iod->sg, iod->nents);
+ else
+ dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
+
if (blk_integrity_rq(req)) {
if (req_op(req) == REQ_OP_READ)
nvme_dif_remap(req, nvme_dif_complete);
@@ -2417,7 +2427,7 @@ static int nvme_pci_reg_read64(struct nvme_ctrl *ctrl, u32 off, u64 *val)
static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
.name = "pcie",
.module = THIS_MODULE,
- .flags = NVME_F_METADATA_SUPPORTED,
+ .flags = NVME_F_METADATA_SUPPORTED | NVME_F_PCI_P2P,
.reg_read32 = nvme_pci_reg_read32,
.reg_write32 = nvme_pci_reg_write32,
.reg_read64 = nvme_pci_reg_read64,
--
2.11.0
QUEUE_FLAG_PCI_P2P is introduced meaning a driver's request queue
supports targeting P2P memory.
REQ_PCI_P2P is introduced to indicate a particular bio request is
directed to/from PCI P2P memory. A request with this flag is not
accepted unless the corresponding queues have the QUEUE_FLAG_PCI_P2P
flag set.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
block/blk-core.c | 3 +++
include/linux/blk_types.h | 18 +++++++++++++++++-
include/linux/blkdev.h | 2 ++
3 files changed, 22 insertions(+), 1 deletion(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index b8881750a3ac..f347c25731ff 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2141,6 +2141,9 @@ generic_make_request_checks(struct bio *bio)
if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q))
goto not_supported;
+ if ((bio->bi_opf & REQ_PCI_P2P) && !blk_queue_pci_p2p(q))
+ goto not_supported;
+
if (should_fail_request(&bio->bi_disk->part0, bio->bi_iter.bi_size))
goto end_io;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 9e7d8bd776d2..8e1b4d643478 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -228,6 +228,10 @@ enum req_flag_bits {
__REQ_BACKGROUND, /* background IO */
__REQ_NOWAIT, /* Don't wait if request will block */
+#ifdef CONFIG_PCI_P2P
+ __REQ_PCI_P2P, /* request is to/from P2P memory */
+#endif
+
/* command specific flags for REQ_OP_WRITE_ZEROES: */
__REQ_NOUNMAP, /* do not free blocks when zeroing */
@@ -252,6 +256,18 @@ enum req_flag_bits {
#define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND)
#define REQ_NOWAIT (1ULL << __REQ_NOWAIT)
+#ifdef CONFIG_PCI_P2P
+/*
+ * Currently SGLs do not support mixed P2P and regular memory so
+ * requests with P2P memory must not be merged.
+ */
+#define REQ_PCI_P2P (1ULL << __REQ_PCI_P2P)
+#define REQ_IS_PCI_P2P(req) ((req)->cmd_flags & REQ_PCI_P2P)
+#else
+#define REQ_PCI_P2P 0
+#define REQ_IS_PCI_P2P(req) 0
+#endif /* CONFIG_PCI_P2P */
+
#define REQ_NOUNMAP (1ULL << __REQ_NOUNMAP)
#define REQ_DRV (1ULL << __REQ_DRV)
@@ -260,7 +276,7 @@ enum req_flag_bits {
(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
#define REQ_NOMERGE_FLAGS \
- (REQ_NOMERGE | REQ_PREFLUSH | REQ_FUA)
+ (REQ_NOMERGE | REQ_PREFLUSH | REQ_FUA | REQ_PCI_P2P)
#define bio_op(bio) \
((bio)->bi_opf & REQ_OP_MASK)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 0ce8a372d506..877b71dd1122 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -651,6 +651,7 @@ struct request_queue {
#define QUEUE_FLAG_SCSI_PASSTHROUGH 27 /* queue supports SCSI commands */
#define QUEUE_FLAG_QUIESCED 28 /* queue has been quiesced */
#define QUEUE_FLAG_PREEMPT_ONLY 29 /* only process REQ_PREEMPT requests */
+#define QUEUE_FLAG_PCI_P2P 30 /* device supports pci p2p requests */
#define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \
(1 << QUEUE_FLAG_SAME_COMP) | \
@@ -746,6 +747,7 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
#define blk_queue_dax(q) test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
#define blk_queue_scsi_passthrough(q) \
test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags)
+#define blk_queue_pci_p2p(q) test_bit(QUEUE_FLAG_PCI_P2P, &(q)->queue_flags)
#define blk_noretry_request(rq) \
((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
--
2.11.0
Register the CMB buffer as p2pmem and use the appropriate allocation
functions to create and destroy the IO SQ.
If the CMB supports WDS and RDS, publish it for use as p2p memory
by other devices.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/nvme/host/pci.c | 74 +++++++++++++++++++++++++++++--------------------
1 file changed, 44 insertions(+), 30 deletions(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 5af239e46f52..71893babb982 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -29,6 +29,7 @@
#include <linux/types.h>
#include <linux/io-64-nonatomic-lo-hi.h>
#include <linux/sed-opal.h>
+#include <linux/pci-p2p.h>
#include "nvme.h"
@@ -91,9 +92,8 @@ struct nvme_dev {
struct work_struct remove_work;
struct mutex shutdown_lock;
bool subsystem;
- void __iomem *cmb;
- pci_bus_addr_t cmb_bus_addr;
u64 cmb_size;
+ bool cmb_use_sqes;
u32 cmbsz;
u32 cmbloc;
struct nvme_ctrl ctrl;
@@ -148,7 +148,7 @@ struct nvme_queue {
struct nvme_dev *dev;
spinlock_t q_lock;
struct nvme_command *sq_cmds;
- struct nvme_command __iomem *sq_cmds_io;
+ bool sq_cmds_is_io;
volatile struct nvme_completion *cqes;
struct blk_mq_tags **tags;
dma_addr_t sq_dma_addr;
@@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
{
u16 tail = nvmeq->sq_tail;
- if (nvmeq->sq_cmds_io)
- memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
- else
- memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
+ memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
if (++tail == nvmeq->q_depth)
tail = 0;
@@ -1279,9 +1276,21 @@ static void nvme_free_queue(struct nvme_queue *nvmeq)
{
dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
(void *)nvmeq->cqes, nvmeq->cq_dma_addr);
- if (nvmeq->sq_cmds)
- dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
- nvmeq->sq_cmds, nvmeq->sq_dma_addr);
+
+ if (nvmeq->sq_cmds) {
+ if (nvmeq->sq_cmds_is_io)
+ pci_free_p2pmem(to_pci_dev(nvmeq->q_dmadev),
+ nvmeq->sq_cmds,
+ SQ_SIZE(nvmeq->q_depth));
+ else
+ dma_free_coherent(nvmeq->q_dmadev,
+ SQ_SIZE(nvmeq->q_depth),
+ nvmeq->sq_cmds,
+ nvmeq->sq_dma_addr);
+ }
+
+ nvmeq->sq_cmds = NULL;
+
kfree(nvmeq);
}
@@ -1369,18 +1378,24 @@ static int nvme_cmb_qdepth(struct nvme_dev *dev, int nr_io_queues,
static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq,
int qid, int depth)
{
- if (qid && dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
- unsigned offset = (qid - 1) * roundup(SQ_SIZE(depth),
- dev->ctrl.page_size);
- nvmeq->sq_dma_addr = dev->cmb_bus_addr + offset;
- nvmeq->sq_cmds_io = dev->cmb + offset;
- } else {
+ struct pci_dev *pdev = to_pci_dev(dev->dev);
+
+ if (qid && dev->cmb_use_sqes) {
+ nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(depth));
+ nvmeq->sq_dma_addr = pci_p2pmem_virt_to_bus(pdev,
+ nvmeq->sq_cmds);
+ nvmeq->sq_cmds_is_io = true;
+ }
+
+ if (!nvmeq->sq_cmds) {
nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
&nvmeq->sq_dma_addr, GFP_KERNEL);
- if (!nvmeq->sq_cmds)
- return -ENOMEM;
+ nvmeq->sq_cmds_is_io = false;
}
+ if (!nvmeq->sq_cmds)
+ return -ENOMEM;
+
return 0;
}
@@ -1687,9 +1702,6 @@ static void nvme_map_cmb(struct nvme_dev *dev)
return;
dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
- if (!use_cmb_sqes)
- return;
-
size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev);
offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc);
bar = NVME_CMB_BIR(dev->cmbloc);
@@ -1706,11 +1718,15 @@ static void nvme_map_cmb(struct nvme_dev *dev)
if (size > bar_size - offset)
size = bar_size - offset;
- dev->cmb = ioremap_wc(pci_resource_start(pdev, bar) + offset, size);
- if (!dev->cmb)
+ if (pci_p2pmem_add_resource(pdev, bar, size, offset))
return;
- dev->cmb_bus_addr = pci_bus_address(pdev, bar) + offset;
+
dev->cmb_size = size;
+ dev->cmb_use_sqes = use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS);
+
+ if ((dev->cmbsz & (NVME_CMBSZ_WDS | NVME_CMBSZ_RDS)) ==
+ (NVME_CMBSZ_WDS | NVME_CMBSZ_RDS))
+ pci_p2pmem_publish(pdev, true);
if (sysfs_add_file_to_group(&dev->ctrl.device->kobj,
&dev_attr_cmb.attr, NULL))
@@ -1720,12 +1736,10 @@ static void nvme_map_cmb(struct nvme_dev *dev)
static inline void nvme_release_cmb(struct nvme_dev *dev)
{
- if (dev->cmb) {
- iounmap(dev->cmb);
- dev->cmb = NULL;
+ if (dev->cmb_size) {
sysfs_remove_file_from_group(&dev->ctrl.device->kobj,
&dev_attr_cmb.attr, NULL);
- dev->cmbsz = 0;
+ dev->cmb_size = 0;
}
}
@@ -1920,13 +1934,13 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
if (nr_io_queues == 0)
return 0;
- if (dev->cmb && (dev->cmbsz & NVME_CMBSZ_SQS)) {
+ if (dev->cmb_use_sqes) {
result = nvme_cmb_qdepth(dev, nr_io_queues,
sizeof(struct nvme_command));
if (result > 0)
dev->q_depth = result;
else
- nvme_release_cmb(dev);
+ dev->cmb_use_sqes = false;
}
do {
--
2.11.0
The DMA address used when mapping PCI P2P memory must be the PCI bus
address. Thus, introduce pci_p2pmem_[un]map_sg() to map the correct
addresses when using P2P memory.
For this, we assume that an SGL passed to these functions contain all
p2p memory or no p2p memory.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/pci/p2p.c | 43 +++++++++++++++++++++++++++++++++++++++++++
include/linux/memremap.h | 1 +
include/linux/pci-p2p.h | 9 +++++++++
3 files changed, 53 insertions(+)
diff --git a/drivers/pci/p2p.c b/drivers/pci/p2p.c
index 03330c7b5e6d..87cec87b02e3 100644
--- a/drivers/pci/p2p.c
+++ b/drivers/pci/p2p.c
@@ -184,6 +184,8 @@ int pci_p2pmem_add_resource(struct pci_dev *pdev, int bar, size_t size,
pgmap->res.end = pgmap->res.start + size - 1;
pgmap->ref = &pdev->p2p->devmap_ref;
pgmap->type = MEMORY_DEVICE_PCI_P2P;
+ pgmap->pci_p2p_bus_offset = pci_bus_address(pdev, bar) -
+ pci_resource_start(pdev, bar);
addr = devm_memremap_pages(&pdev->dev, pgmap);
if (IS_ERR(addr))
@@ -598,3 +600,44 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
pdev->p2p->published = publish;
}
EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
+
+/*
+ * pci_p2pmem_map_sg - map a pci p2p sg for dma
+ * @sg:scatter list to map
+ * @nents:elements in the scatter list
+ *
+ * Returns the number of sgls mapped
+ */
+int pci_p2pmem_map_sg(struct scatterlist *sg, int nents)
+{
+ struct dev_pagemap *pgmap;
+ struct scatterlist *s;
+ phys_addr_t paddr;
+ int i;
+
+ for_each_sg(sg, s, nents, i) {
+ pgmap = sg_page(s)->pgmap;
+ paddr = sg_phys(s);
+
+ s->dma_address = paddr - pgmap->pci_p2p_bus_offset;
+ sg_dma_len(s) = s->length;
+ }
+
+ return nents;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_map_sg);
+
+/**
+ * pci_p2pmem_unmap_sg - unmap a pci p2p sg for dma
+ * @sg:scatter list to map
+ * @nents:elements in the scatter list
+ */
+void pci_p2pmem_unmap_sg(struct scatterlist *sg, int nents)
+{
+ struct scatterlist *s;
+ int i;
+
+ for_each_sg(sg, s, nents, i)
+ dma_mark_clean(phys_to_virt(sg_phys(s)), sg_dma_len(s));
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_unmap_sg);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index c17a6d167d48..a792f87b9dc3 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -125,6 +125,7 @@ struct dev_pagemap {
struct device *dev;
void *data;
enum memory_type type;
+ u64 pci_p2p_bus_offset;
};
#ifdef CONFIG_ZONE_DEVICE
diff --git a/include/linux/pci-p2p.h b/include/linux/pci-p2p.h
index f811c97a5886..154c39905fc9 100644
--- a/include/linux/pci-p2p.h
+++ b/include/linux/pci-p2p.h
@@ -34,6 +34,8 @@ int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct scatterlist **sgl,
void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl,
unsigned int nents);
void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
+int pci_p2pmem_map_sg(struct scatterlist *sg, int nents);
+void pci_p2pmem_unmap_sg(struct scatterlist *sg, int nents);
#else /* CONFIG_PCI_P2P */
static inline int pci_p2pmem_add_resource(struct pci_dev *pdev, int bar,
size_t size, u64 offset)
@@ -81,5 +83,12 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
{
}
+static inline int pci_p2pmem_map_sg(struct scatterlist *sg, int nents)
+{
+ return -ENODEV;
+}
+static inline void pci_p2pmem_unmap_sg(struct scatterlist *sg, int nents)
+{
+}
#endif /* CONFIG_PCI_P2P */
#endif /* _LINUX_PCI_P2P_H */
--
2.11.0
From: Christoph Hellwig <[email protected]>
Define the bit positions instead of macros using the magic values,
and move the expanded helpers to calculate the size and size unit into
the implementation C file.
Signed-off-by: Christoph Hellwig <[email protected]>
---
drivers/nvme/host/pci.c | 23 +++++++++++++++++------
include/linux/nvme.h | 22 ++++++++++++++--------
2 files changed, 31 insertions(+), 14 deletions(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 58c379af6fb4..5af239e46f52 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1369,7 +1369,7 @@ static int nvme_cmb_qdepth(struct nvme_dev *dev, int nr_io_queues,
static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq,
int qid, int depth)
{
- if (qid && dev->cmb && use_cmb_sqes && NVME_CMB_SQS(dev->cmbsz)) {
+ if (qid && dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
unsigned offset = (qid - 1) * roundup(SQ_SIZE(depth),
dev->ctrl.page_size);
nvmeq->sq_dma_addr = dev->cmb_bus_addr + offset;
@@ -1663,9 +1663,21 @@ static ssize_t nvme_cmb_show(struct device *dev,
}
static DEVICE_ATTR(cmb, S_IRUGO, nvme_cmb_show, NULL);
+static u64 nvme_cmb_size_unit(struct nvme_dev *dev)
+{
+ u8 szu = (dev->cmbsz >> NVME_CMBSZ_SZU_SHIFT) & NVME_CMBSZ_SZU_MASK;
+
+ return 1ULL << (12 + 4 * szu);
+}
+
+static u32 nvme_cmb_size(struct nvme_dev *dev)
+{
+ return (dev->cmbsz >> NVME_CMBSZ_SZ_SHIFT) & NVME_CMBSZ_SZ_MASK;
+}
+
static void nvme_map_cmb(struct nvme_dev *dev)
{
- u64 szu, size, offset;
+ u64 size, offset;
resource_size_t bar_size;
struct pci_dev *pdev = to_pci_dev(dev->dev);
int bar;
@@ -1678,9 +1690,8 @@ static void nvme_map_cmb(struct nvme_dev *dev)
if (!use_cmb_sqes)
return;
- szu = (u64)1 << (12 + 4 * NVME_CMB_SZU(dev->cmbsz));
- size = szu * NVME_CMB_SZ(dev->cmbsz);
- offset = szu * NVME_CMB_OFST(dev->cmbloc);
+ size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev);
+ offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc);
bar = NVME_CMB_BIR(dev->cmbloc);
bar_size = pci_resource_len(pdev, bar);
@@ -1909,7 +1920,7 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
if (nr_io_queues == 0)
return 0;
- if (dev->cmb && NVME_CMB_SQS(dev->cmbsz)) {
+ if (dev->cmb && (dev->cmbsz & NVME_CMBSZ_SQS)) {
result = nvme_cmb_qdepth(dev, nr_io_queues,
sizeof(struct nvme_command));
if (result > 0)
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index aea87f0d917b..4112e2bd747f 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -124,14 +124,20 @@ enum {
#define NVME_CMB_BIR(cmbloc) ((cmbloc) & 0x7)
#define NVME_CMB_OFST(cmbloc) (((cmbloc) >> 12) & 0xfffff)
-#define NVME_CMB_SZ(cmbsz) (((cmbsz) >> 12) & 0xfffff)
-#define NVME_CMB_SZU(cmbsz) (((cmbsz) >> 8) & 0xf)
-
-#define NVME_CMB_WDS(cmbsz) ((cmbsz) & 0x10)
-#define NVME_CMB_RDS(cmbsz) ((cmbsz) & 0x8)
-#define NVME_CMB_LISTS(cmbsz) ((cmbsz) & 0x4)
-#define NVME_CMB_CQS(cmbsz) ((cmbsz) & 0x2)
-#define NVME_CMB_SQS(cmbsz) ((cmbsz) & 0x1)
+
+enum {
+ NVME_CMBSZ_SQS = 1 << 0,
+ NVME_CMBSZ_CQS = 1 << 1,
+ NVME_CMBSZ_LISTS = 1 << 2,
+ NVME_CMBSZ_RDS = 1 << 3,
+ NVME_CMBSZ_WDS = 1 << 4,
+
+ NVME_CMBSZ_SZ_SHIFT = 12,
+ NVME_CMBSZ_SZ_MASK = 0xfffff,
+
+ NVME_CMBSZ_SZU_SHIFT = 8,
+ NVME_CMBSZ_SZU_MASK = 0xf,
+};
/*
* Submission and Completion Queue Entry Sizes for the NVM command set.
--
2.11.0
We create a configfs attribute in each nvme-fabrics target port to
enable p2p memory use. When enabled, the port will only then use the
p2p memory if a p2p memory device can be found which is behind the
same switch as the RDMA port and all the block devices in use. If
the user enabled it an no devices are found, then the system will
silently fall back on using regular memory.
If appropriate, that port will allocate memory for the RDMA buffers
for queues from the p2pmem device falling back to system memory should
anything fail.
Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
save an extra PCI transfer as the NVME card could just take the data
out of it's own memory. However, at this time, cards with CMB buffers
don't seem to be available.
Signed-off-by: Stephen Bates <[email protected]>
Signed-off-by: Steve Wise <[email protected]>
[hch: partial rewrite of the initial code]
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/nvme/target/configfs.c | 29 +++++++++++++
drivers/nvme/target/core.c | 95 +++++++++++++++++++++++++++++++++++++++++-
drivers/nvme/target/io-cmd.c | 3 ++
drivers/nvme/target/nvmet.h | 10 +++++
drivers/nvme/target/rdma.c | 39 ++++++++++++++---
5 files changed, 169 insertions(+), 7 deletions(-)
diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index e6b2d2af81b6..12933ca39e62 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -867,12 +867,41 @@ static void nvmet_port_release(struct config_item *item)
kfree(port);
}
+#ifdef CONFIG_PCI_P2P
+static ssize_t nvmet_allow_p2pmem_show(struct config_item *item, char *page)
+{
+ return sprintf(page, "%d\n", to_nvmet_port(item)->allow_p2pmem);
+}
+
+static ssize_t nvmet_allow_p2pmem_store(struct config_item *item,
+ const char *page, size_t count)
+{
+ struct nvmet_port *port = to_nvmet_port(item);
+ bool allow;
+ int ret;
+
+ ret = strtobool(page, &allow);
+ if (ret)
+ return ret;
+
+ down_write(&nvmet_config_sem);
+ port->allow_p2pmem = allow;
+ up_write(&nvmet_config_sem);
+
+ return count;
+}
+CONFIGFS_ATTR(nvmet_, allow_p2pmem);
+#endif /* CONFIG_PCI_P2P */
+
static struct configfs_attribute *nvmet_port_attrs[] = {
&nvmet_attr_addr_adrfam,
&nvmet_attr_addr_treq,
&nvmet_attr_addr_traddr,
&nvmet_attr_addr_trsvcid,
&nvmet_attr_addr_trtype,
+#ifdef CONFIG_PCI_P2P
+ &nvmet_attr_allow_p2pmem,
+#endif
NULL,
};
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index b54748ad5f48..77d1975388da 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -15,6 +15,7 @@
#include <linux/module.h>
#include <linux/random.h>
#include <linux/rculist.h>
+#include <linux/pci-p2p.h>
#include "nvmet.h"
@@ -271,6 +272,25 @@ void nvmet_put_namespace(struct nvmet_ns *ns)
percpu_ref_put(&ns->ref);
}
+static int nvmet_p2pmem_add_client(struct nvmet_ctrl *ctrl,
+ struct nvmet_ns *ns)
+{
+ int ret;
+
+ if (!blk_queue_pci_p2p(ns->bdev->bd_queue)) {
+ pr_err("p2p is not supported by %s\n",
+ ns->device_path);
+ return -EINVAL;
+ }
+
+ ret = pci_p2pmem_add_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
+ if (ret)
+ pr_err("failed to add p2pmem client %s: %d\n",
+ ns->device_path, ret);
+
+ return ret;
+}
+
int nvmet_ns_enable(struct nvmet_ns *ns)
{
struct nvmet_subsys *subsys = ns->subsys;
@@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
if (ret)
goto out_blkdev_put;
+ list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+ if (ctrl->p2p_dev) {
+ ret = nvmet_p2pmem_add_client(ctrl, ns);
+ if (ret)
+ goto out_remove_clients;
+ }
+ }
+
if (ns->nsid > subsys->max_nsid)
subsys->max_nsid = ns->nsid;
@@ -328,6 +356,9 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
out_unlock:
mutex_unlock(&subsys->lock);
return ret;
+out_remove_clients:
+ list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
+ pci_p2pmem_remove_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
out_blkdev_put:
blkdev_put(ns->bdev, FMODE_WRITE|FMODE_READ);
ns->bdev = NULL;
@@ -363,8 +394,10 @@ void nvmet_ns_disable(struct nvmet_ns *ns)
percpu_ref_exit(&ns->ref);
mutex_lock(&subsys->lock);
- list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
+ list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+ pci_p2pmem_remove_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
nvmet_add_async_event(ctrl, NVME_AER_TYPE_NOTICE, 0, 0);
+ }
if (ns->bdev)
blkdev_put(ns->bdev, FMODE_WRITE|FMODE_READ);
@@ -758,6 +791,63 @@ bool nvmet_host_allowed(struct nvmet_req *req, struct nvmet_subsys *subsys,
return __nvmet_host_allowed(subsys, hostnqn);
}
+/*
+ * If allow_p2pmem is set, we will try to use P2P memory for the SGL lists for
+ * Ι/O commands. This requires the PCI p2p device to be compatible with the
+ * backing device for every namespace on this controller.
+ */
+static void nvmet_setup_p2pmem(struct nvmet_ctrl *ctrl, struct nvmet_req *req)
+{
+ struct nvmet_ns *ns;
+ int ret;
+
+ if (!req->port->allow_p2pmem || !req->p2p_client)
+ return;
+
+ mutex_lock(&ctrl->subsys->lock);
+
+ ret = pci_p2pmem_add_client(&ctrl->p2p_clients, req->p2p_client);
+ if (ret) {
+ pr_err("failed adding p2pmem client %s: %d\n",
+ dev_name(req->p2p_client), ret);
+ goto free_devices;
+ }
+
+ list_for_each_entry_rcu(ns, &ctrl->subsys->namespaces, dev_link) {
+ ret = nvmet_p2pmem_add_client(ctrl, ns);
+ if (ret)
+ goto free_devices;
+ }
+
+ ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients);
+ if (!ctrl->p2p_dev) {
+ pr_info("no supported p2pmem devices found\n");
+ goto free_devices;
+ }
+ mutex_unlock(&ctrl->subsys->lock);
+
+ pr_info("using p2pmem on %s\n", pci_name(ctrl->p2p_dev));
+ return;
+
+free_devices:
+ pci_p2pmem_client_list_free(&ctrl->p2p_clients);
+ mutex_unlock(&ctrl->subsys->lock);
+}
+
+static void nvmet_release_p2pmem(struct nvmet_ctrl *ctrl)
+{
+ if (!ctrl->p2p_dev)
+ return;
+
+ mutex_lock(&ctrl->subsys->lock);
+
+ pci_p2pmem_client_list_free(&ctrl->p2p_clients);
+ pci_dev_put(ctrl->p2p_dev);
+ ctrl->p2p_dev = NULL;
+
+ mutex_unlock(&ctrl->subsys->lock);
+}
+
u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
struct nvmet_req *req, u32 kato, struct nvmet_ctrl **ctrlp)
{
@@ -797,6 +887,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
INIT_WORK(&ctrl->async_event_work, nvmet_async_event_work);
INIT_LIST_HEAD(&ctrl->async_events);
+ INIT_LIST_HEAD(&ctrl->p2p_clients);
memcpy(ctrl->subsysnqn, subsysnqn, NVMF_NQN_SIZE);
memcpy(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE);
@@ -852,6 +943,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
ctrl->kato = DIV_ROUND_UP(kato, 1000);
}
nvmet_start_keep_alive_timer(ctrl);
+ nvmet_setup_p2pmem(ctrl, req);
mutex_lock(&subsys->lock);
list_add_tail(&ctrl->subsys_entry, &subsys->ctrls);
@@ -886,6 +978,7 @@ static void nvmet_ctrl_free(struct kref *ref)
flush_work(&ctrl->async_event_work);
cancel_work_sync(&ctrl->fatal_err_work);
+ nvmet_release_p2pmem(ctrl);
ida_simple_remove(&cntlid_ida, ctrl->cntlid);
nvmet_subsys_put(subsys);
diff --git a/drivers/nvme/target/io-cmd.c b/drivers/nvme/target/io-cmd.c
index 0a4372a016f2..c05368c6579c 100644
--- a/drivers/nvme/target/io-cmd.c
+++ b/drivers/nvme/target/io-cmd.c
@@ -56,6 +56,9 @@ static void nvmet_execute_rw(struct nvmet_req *req)
op = REQ_OP_READ;
}
+ if (is_pci_p2p_page(sg_page(req->sg)))
+ op_flags |= REQ_PCI_P2P;
+
sector = le64_to_cpu(req->cmd->rw.slba);
sector <<= (req->ns->blksize_shift - 9);
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 417f6c0331cc..85a170914588 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -64,6 +64,11 @@ static inline struct nvmet_ns *to_nvmet_ns(struct config_item *item)
return container_of(to_config_group(item), struct nvmet_ns, group);
}
+static inline struct device *nvmet_ns_dev(struct nvmet_ns *ns)
+{
+ return disk_to_dev(ns->bdev->bd_disk);
+}
+
struct nvmet_cq {
u16 qid;
u16 size;
@@ -98,6 +103,7 @@ struct nvmet_port {
struct list_head referrals;
void *priv;
bool enabled;
+ bool allow_p2pmem;
};
static inline struct nvmet_port *to_nvmet_port(struct config_item *item)
@@ -131,6 +137,8 @@ struct nvmet_ctrl {
struct work_struct fatal_err_work;
struct nvmet_fabrics_ops *ops;
+ struct pci_dev *p2p_dev;
+ struct list_head p2p_clients;
char subsysnqn[NVMF_NQN_FIELD_LEN];
char hostnqn[NVMF_NQN_FIELD_LEN];
@@ -232,6 +240,8 @@ struct nvmet_req {
void (*execute)(struct nvmet_req *req);
struct nvmet_fabrics_ops *ops;
+
+ struct device *p2p_client;
};
static inline void nvmet_set_status(struct nvmet_req *req, u16 status)
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index d4d0662ab071..fb4a7fa03015 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -23,6 +23,7 @@
#include <linux/string.h>
#include <linux/wait.h>
#include <linux/inet.h>
+#include <linux/pci-p2p.h>
#include <asm/unaligned.h>
#include <rdma/ib_verbs.h>
@@ -68,6 +69,7 @@ struct nvmet_rdma_rsp {
u8 n_rdma;
u32 flags;
u32 invalidate_rkey;
+ struct pci_dev *p2p_dev;
struct list_head wait_list;
struct list_head free_list;
@@ -480,11 +482,18 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
if (rsp->n_rdma) {
rdma_rw_ctx_destroy(&rsp->rw, queue->cm_id->qp,
queue->cm_id->port_num, rsp->req.sg,
- rsp->req.sg_cnt, nvmet_data_dir(&rsp->req), 0);
+ rsp->req.sg_cnt, nvmet_data_dir(&rsp->req),
+ rsp->p2p_dev ? RDMA_RW_CTX_FLAG_PCI_P2P : 0);
}
- if (rsp->req.sg != &rsp->cmd->inline_sg)
- nvmet_rdma_free_sgl(rsp->req.sg, rsp->req.sg_cnt);
+ if (rsp->req.sg != &rsp->cmd->inline_sg) {
+ if (rsp->p2p_dev)
+ pci_p2pmem_free_sgl(rsp->p2p_dev, rsp->req.sg,
+ rsp->req.sg_cnt);
+ else
+ nvmet_rdma_free_sgl(rsp->req.sg, rsp->req.sg_cnt);
+
+ }
if (unlikely(!list_empty_careful(&queue->rsp_wr_wait_list)))
nvmet_rdma_process_wr_wait_list(queue);
@@ -620,6 +629,7 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
u64 addr = le64_to_cpu(sgl->addr);
u32 len = get_unaligned_le24(sgl->length);
u32 key = get_unaligned_le32(sgl->key);
+ struct pci_dev *p2p_dev = NULL;
int ret;
u16 status;
@@ -627,14 +637,29 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
if (!len)
return 0;
- status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
- len);
+ if (rsp->queue->nvme_sq.ctrl)
+ p2p_dev = rsp->queue->nvme_sq.ctrl->p2p_dev;
+
+ rsp->p2p_dev = NULL;
+ if (rsp->queue->nvme_sq.qid && p2p_dev) {
+ status = pci_p2pmem_alloc_sgl(p2p_dev, &rsp->req.sg,
+ &rsp->req.sg_cnt, len);
+ if (!status)
+ rsp->p2p_dev = p2p_dev;
+ }
+
+ if (!rsp->p2p_dev) {
+ status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
+ len);
+ }
+
if (status)
return status;
ret = rdma_rw_ctx_init(&rsp->rw, cm_id->qp, cm_id->port_num,
rsp->req.sg, rsp->req.sg_cnt, 0, addr, key,
- nvmet_data_dir(&rsp->req), 0);
+ nvmet_data_dir(&rsp->req),
+ rsp->p2p_dev ? RDMA_RW_CTX_FLAG_PCI_P2P : 0);
if (ret < 0)
return NVME_SC_INTERNAL;
rsp->req.transfer_len += len;
@@ -713,6 +738,8 @@ static void nvmet_rdma_handle_command(struct nvmet_rdma_queue *queue,
cmd->send_sge.addr, cmd->send_sge.length,
DMA_TO_DEVICE);
+ cmd->req.p2p_client = &queue->dev->device->dev;
+
if (!nvmet_req_init(&cmd->req, &queue->nvme_cq,
&queue->nvme_sq, &nvmet_rdma_ops))
return;
--
2.11.0
On 04/01/18 12:01 PM, Logan Gunthorpe wrote:
> From: Christoph Hellwig <[email protected]>
>
> Refactor the call to nvme_map_cmb, and change the conditions for probing
> for the CMB. First remove the version check as NVMe TPs always apply
> to earlier versions of the spec as well. Second check for the whole CMBSZ
> register for support of the CMB feature instead of just the size field
> inside of it to simplify the code a bit.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
Oops, I forgot:
Signed-off-by: Logan Gunthorpe <[email protected]>
On 04/01/18 12:01 PM, Logan Gunthorpe wrote:
> From: Christoph Hellwig <[email protected]>
>
> Define the bit positions instead of macros using the magic values,
> and move the expanded helpers to calculate the size and size unit into
> the implementation C file.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
Oops, I forgot:
Signed-off-by: Logan Gunthorpe <[email protected]>
On Thu, Jan 04, 2018 at 12:01:31PM -0700, Logan Gunthorpe wrote:
>
> @@ -269,18 +270,24 @@ static int rdma_rw_init_single_wr(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
> * @remote_addr:remote address to read/write (relative to @rkey)
> * @rkey: remote key to operate on
> * @dir: %DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
> + * @flags: any of the RDMA_RW_CTX_FLAG_* flags
> *
> * Returns the number of WQEs that will be needed on the workqueue if
> * successful, or a negative error code.
> */
> int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
> struct scatterlist *sg, u32 sg_cnt, u32 sg_offset,
> - u64 remote_addr, u32 rkey, enum dma_data_direction dir)
> + u64 remote_addr, u32 rkey, enum dma_data_direction dir,
> + unsigned int flags)
> {
> struct ib_device *dev = qp->pd->device;
> int ret;
>
> - ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
> + if (flags & RDMA_RW_CTX_FLAG_PCI_P2P)
> + ret = pci_p2pmem_map_sg(sg, sg_cnt);
> + else
> + ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
This seems really clunky since we are going to want to do this same
logic all over the place.
I'd be much happier if dma_map_sg can tell the memory is P2P or not
from the scatterlist or dir arguments and not require the callers to
have this.
For instance adding DMA_TO_P2P_DEVICE and DMA_FROM_P2P_DEVICE, or
adding another bit to the page flags in scatterlist.
The signature of pci_p2pmem_map_sg also doesn't look very good, it
should mirror the usual dma_map_sg, otherwise it needs more revision
to extend this to do P2P through a root complex.
Jason
On 04/01/18 12:22 PM, Jason Gunthorpe wrote:
> This seems really clunky since we are going to want to do this same
> logic all over the place.
>
> I'd be much happier if dma_map_sg can tell the memory is P2P or not
> from the scatterlist or dir arguments and not require the callers to
> have this.
We tried things like this in an earlier iteration[1] which assumed the
SG was homogenous (all P2P or all regular memory). This required serious
ugliness to try and ensure SGs were in fact homogenous[2]. If you don't
assume homogenousness you need change every DMA mapping provider or have
a way to efficiently know if the SGL contains any P2P pages.
Unfortunately, it's hard to do any of this without significant
improvements to the SGL code (like [3]) but I understand there's
significant push back against such changes at this time.
> For instance adding DMA_TO_P2P_DEVICE and DMA_FROM_P2P_DEVICE, or
> adding another bit to the page flags in scatterlist.
Creating new direction flags doesn't really solve the problem you point
out. It'd only really work in the rdma_rw_ctx_init() case because that
function already takes the direction argument. It doesn't work in the
similar block device case because the DMA direction isn't passed through
the request.
Though, if there's agreement to do something like that I'm not against it.
> The signature of pci_p2pmem_map_sg also doesn't look very good, it
> should mirror the usual dma_map_sg, otherwise it needs more revision
> to extend this to do P2P through a root complex.
Generally, the feedback that I get is to not make changes that try to
anticipate the future. It would be easy enough to add those arguments
when the code for going through the RC is made (which I expect will be
very involved anyway).
Logan
[1]
https://github.com/sbates130272/linux-p2pmem/commit/af3d3742669511c343c2c5bfbbfaccc325bee80a
[2]
https://github.com/sbates130272/linux-p2pmem/commit/61ec04fa63c79dab14570ea0e97432d9325ad127
[3]
https://github.com/sbates130272/linux-p2pmem/commit/d8635a4de3c674b577b95653d431fd61c2504ccd
Run "git log --oneline drivers/pci" and follow the convention. I
think it would make sense to add a new tag like "PCI/P2P", although
"P2P" has historically also been used in the "PCI-to-PCI bridge"
context, so maybe there's something less ambiguous. "P2PDMA"?
When you add new files, I guess we're looking for the new SPDX
copyright stuff?
On Thu, Jan 04, 2018 at 12:01:26PM -0700, Logan Gunthorpe wrote:
> Some PCI devices may have memory mapped in a BAR space that's
> intended for use in Peer-to-Peer transactions. In order to enable
> such transactions the memory must be registered with ZONE_DEVICE pages
> so it can be used by DMA interfaces in existing drivers.
>
> A kernel interface is provided so that other subsystems can find and
> allocate chunks of P2P memory as necessary to facilitate transfers
> between two PCI peers. Depending on hardware, this may reduce the
> bandwidth of the transfer but would significantly reduce pressure
> on system memory. This may be desirable in many cases: for example a
> system could be designed with a small CPU connected to a PCI switch by a
> small number of lanes which would maximize the number of lanes available
> to connect to NVME devices.
>
> The interface requires a user driver to collect a list of client devices
> involved in the transaction with the pci_p2pmem_add_client*() functions
> then call pci_p2pmem_find() to obtain any suitable P2P memory. Once
> this is done the list is bound to the memory and the calling driver is
> free to add and remove clients as necessary. The ACS bits on the
> downstream switch port will be managed for all the registered clients.
>
> The code is designed to only utilize the p2pmem device if all the devices
> involved in a transfer are behind the same PCI switch. This is because
> using P2P transactions through the PCI root complex can have performance
> limitations or, worse, might not work at all. Finding out how well a
> particular RC supports P2P transfers is non-trivial.
It's more than "non-trivial" or "with good performance" (from Kconfig
help), isn't it? AFAIK, there's no standard way at all to discover
whether P2P DMA is supported between root ports or RCs.
> +config PCI_P2P
> + bool "PCI Peer to Peer transfer support"
> + depends on ZONE_DEVICE
> + select GENERIC_ALLOCATOR
> + help
> + Enableѕ drivers to do PCI peer to peer transactions to and from
> + bars that are exposed to other devices in the same domain.
s/bars/BARs/ (and similarly below, except in C code)
Similarly, s/dma/DMA/ and s/pci/PCI/ below.
And probably also s/p2p/peer-to-peer DMA/ in messages.
Maybe clarify this domain bit. Using "domain" suggests the common PCI
segment/domain usage, but I think you really mean something like the
part of the hierarchy where peer-to-peer DMA is guaranteed by the PCI
spec to work, i.e., anything below a single PCI bridge.
> +
> + Many PCIe root complexes do not support P2P transactions and
> + it's hard to tell which support it with good performance, so
> + at this time you will need a PCIe switch.
> +
> + If unsure, say N.
> + * pci_p2pmem_add_resource - add memory for use as p2p memory
> + * @pci: the device to add the memory to
> + * @bar: PCI bar to add
> + * @size: size of the memory to add, may be zero to use the whole bar
> + * @offset: offset into the PCI bar
> + *
> + * The memory will be given ZONE_DEVICE struct pages so that it may
> + * be used with any dma request.
> + */
> +int pci_p2pmem_add_resource(struct pci_dev *pdev, int bar, size_t size,
> + u64 offset)
> +{
> + struct dev_pagemap *pgmap;
> + void *addr;
> + int error;
Seems like there should be
if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM))
return -EINVAL;
or similar here?
> + if (WARN_ON(offset >= pci_resource_len(pdev, bar)))
> + return -EINVAL;
Are these WARN_ONs for debugging purposes, or do you think we need
them in production? Granted, hitting it would probably be a kernel
driver bug, but still, not sure if the PCI core needs to coddle the
driver author that much.
> + if (!size)
> + size = pci_resource_len(pdev, bar) - offset;
> +
> + if (WARN_ON(size + offset > pci_resource_len(pdev, bar)))
> + return -EINVAL;
> +
> + if (!pdev->p2p) {
> + error = pci_p2pmem_setup(pdev);
> + if (error)
> + return error;
> + }
> +
> + pgmap = devm_kzalloc(&pdev->dev, sizeof(*pgmap), GFP_KERNEL);
> + if (!pgmap)
> + return -ENOMEM;
> +
> + pgmap->res.start = pci_resource_start(pdev, bar) + offset;
> + pgmap->res.end = pgmap->res.start + size - 1;
I'm guessing Christoph's dev_pagemap revamp repo must change
pgmap->res from a pointer to a structure, but I don't see the actual
link in your cover letter.
I think you should set pgmap->res.flags here, too.
> + pgmap->ref = &pdev->p2p->devmap_ref;
> + pgmap->type = MEMORY_DEVICE_PCI_P2P;
> +
> + addr = devm_memremap_pages(&pdev->dev, pgmap);
> + if (IS_ERR(addr))
> + return PTR_ERR(addr);
> +
> + error = gen_pool_add_virt(pdev->p2p->pool, (uintptr_t)addr,
> + pci_bus_address(pdev, bar) + offset,
> + resource_size(&pgmap->res), dev_to_node(&pdev->dev));
> + if (error)
> + return error;
> +
> + error = devm_add_action_or_reset(&pdev->dev, pci_p2pmem_percpu_kill,
> + &pdev->p2p->devmap_ref);
> + if (error)
> + return error;
> +
> + dev_info(&pdev->dev, "added %zdB of p2p memory\n", size);
Can we add %pR and print pgmap->res itself, too?
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pmem_add_resource);
> + * If a device is behind a switch, we try to find the upstream bridge
> + * port of the switch. This requires two calls to pci_upstream_bridge:
> + * one for the upstream port on the switch, one on the upstream port
> + * for the next level in the hierarchy. Because of this, devices connected
> + * to the root port will be rejected.
> + */
> +static struct pci_dev *get_upstream_switch_port(struct pci_dev *pdev)
> +{
> + struct pci_dev *up1, *up2;
> +
> + if (!pdev)
> + return NULL;
> +
> + up1 = pci_dev_get(pci_upstream_bridge(pdev));
> + if (!up1)
> + return NULL;
> +
> + up2 = pci_dev_get(pci_upstream_bridge(up1));
> + pci_dev_put(up1);
> +
> + return up2;
> +}
> +
> +static bool __upstream_bridges_match(struct pci_dev *upstream,
> + struct pci_dev *client)
> +{
> + struct pci_dev *dma_up;
> + bool ret = true;
> +
> + dma_up = get_upstream_switch_port(client);
> +
> + if (!dma_up) {
> + dev_dbg(&client->dev, "not a pci device behind a switch\n");
You have a bit of a mix of PCI ("pci device", "bridge") and PCIe
("switch", "switch port") terminology. I haven't read the rest of the
patches yet, so I don't know if you intend to restrict this to
PCIe-only, e.g., so you can use ACS, or if you want to make it
available on conventional PCI as well.
If the latter, I would use the generic PCI terminology, i.e., "bridge"
instead of "switch".
> + * pci_virt_to_bus - return the pci bus address for a given virtual
> + * address obtained with pci_alloc_p2pmem
> + * @pdev: the device the memory was allocated from
> + * @addr: address of the memory that was allocated
> + */
> +pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr)
> +{
> + if (!addr)
> + return 0;
> + if (!pdev->p2p)
> + return 0;
> +
> + return gen_pool_virt_to_phys(pdev->p2p->pool, (unsigned long)addr);
This doesn't seem right. A physical address is not the same as a PCI
bus address. I expected something like pci_bus_address() or
pcibios_resource_to_bus() here. Am I missing something? If so, a
clarifying comment would be helpful.
> + * pci_p2pmem_publish - publish the p2p memory for use by other devices
> + * with pci_p2pmem_find
> + * @pdev: the device with p2p memory to publish
> + * @publish: set to true to publish the memory, false to unpublish it
> + */
> +void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
> +{
> + if (WARN_ON(publish && !pdev->p2p))
> + return;
Same WARN_ON question -- is this really intended for production?
Doesn't seem like the end user can really do anything with the warning
information.
> diff --git a/include/linux/pci-p2p.h b/include/linux/pci-p2p.h
> new file mode 100644
> index 000000000000..f811c97a5886
> --- /dev/null
> +++ b/include/linux/pci-p2p.h
> @@ -0,0 +1,85 @@
> +#ifndef _LINUX_PCI_P2P_H
> +#define _LINUX_PCI_P2P_H
> +/*
> + * Copyright (c) 2016-2017, Microsemi Corporation
> + * Copyright (c) 2017, Christoph Hellwig.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
> + * more details.
> + */
> +
> +#include <linux/pci.h>
> +
> +struct block_device;
> +struct scatterlist;
I've been noticing that we're accumulating PCI-related files in
include/linux: pci.h, pci-aspm.h pci-ats.h, pci-dma.h, pcieport_if.h,
etc. I'm not sure there's value in all those and am thinking maybe
they should just be folded into pci.h. What do you think?
> +#ifdef CONFIG_PCI_P2P
> +int pci_p2pmem_add_resource(struct pci_dev *pdev, int bar, size_t size,
> + u64 offset);
> +int pci_p2pmem_add_client(struct list_head *head, struct device *dev);
> ...
Bjorn
On Thu, Jan 04, 2018 at 12:01:27PM -0700, Logan Gunthorpe wrote:
> Attributes display the total amount of P2P memory, the ammount available
> and whether it is published or not.
s/ammount/amount/ (also below)
> Signed-off-by: Logan Gunthorpe <[email protected]>
> ---
> Documentation/ABI/testing/sysfs-bus-pci | 25 ++++++++++++++++
> drivers/pci/p2p.c | 51 +++++++++++++++++++++++++++++++++
> 2 files changed, 76 insertions(+)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> index 44d4b2be92fd..7b80ea77faca 100644
> --- a/Documentation/ABI/testing/sysfs-bus-pci
> +++ b/Documentation/ABI/testing/sysfs-bus-pci
> @@ -323,3 +323,28 @@ Description:
>
> This is similar to /sys/bus/pci/drivers_autoprobe, but
> affects only the VFs associated with a specific PF.
> +
> +What: /sys/bus/pci/devices/.../p2pmem/available
I wonder if "p2pdma" would be a more suggestive term? It's not really
the *memory* that is peer-to-peer; the peer-to-peer part is referring
to *access* to the memory.
> @@ -82,6 +130,9 @@ static int pci_p2pmem_setup(struct pci_dev *pdev)
> if (error)
> goto out_pool_destroy;
>
> + if (sysfs_create_group(&pdev->dev.kobj, &p2pmem_group))
> + dev_warn(&pdev->dev, "failed to create p2p sysfs group\n");
Not sure the warning (by itself) is worthwhile. If we were going to
disable the feature if sysfs_create_group() failed, that's one thing,
but we aren't doing anything except generating a warning, which the
user can't really do anything with. If the user is looking for the
sysfs file, its absence will be obvious even without the message.
> pdev->p2p = p2p;
>
> return 0;
> --
> 2.11.0
>
[+cc Alex]
On Thu, Jan 04, 2018 at 12:01:29PM -0700, Logan Gunthorpe wrote:
> When the ACS P2P flags are set in the downstream port of the switch,
> any P2P TLPs will be sent back to the root complex. The whole point of
> the P2P work is to have TLPs avoid the RC seeing it may not support
> P2P transactions or, at best, it will perform poorly. So we clear these
"It will perform poorly" seems like an empirical statement about the
hardware you've seen, not something that's driven by the spec. So I'm
not sure it adds information that will be useful long-term.
> flags for all devices involved in transactions with the p2pmem.
I'm not sure how this coordinates with other uses of ACS, e.g., VFIO.
I think that should be addressed here somehow.
> A count of the number of requests to disable the flags is maintained.
> When the count transitions from 1 to 0, the old flags are restored.
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> ---
> drivers/pci/p2p.c | 144 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> include/linux/pci.h | 2 +
> 2 files changed, 143 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/pci/p2p.c b/drivers/pci/p2p.c
> index 87cec87b02e3..617adaa905d2 100644
> --- a/drivers/pci/p2p.c
> +++ b/drivers/pci/p2p.c
> @@ -227,6 +227,89 @@ static struct pci_dev *find_parent_pci_dev(struct device *dev)
> }
>
> /*
> + * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
> + * to be disabled in the downstream port of each device in order for
> + * the TLPs to not be forwarded up to the RC.
> + */
> +#define PCI_P2P_ACS_FLAGS (PCI_ACS_RR | PCI_ACS_CR)
> +
> +static int pci_p2pmem_disable_acs(struct pci_dev *pdev)
> +{
> + int pos;
> + u16 ctrl;
> + struct pci_dev *downstream;
> +
> + downstream = pci_dev_get(pci_upstream_bridge(pdev));
> + if (!downstream) {
> + dev_err(&pdev->dev, "could not find downstream port\n");
> + return -ENODEV;
> + }
> +
> + device_lock(&downstream->dev);
> + if (downstream->p2p_acs_requests++)
> + goto unlock_and_return;
> +
> + pos = pci_find_ext_capability(downstream, PCI_EXT_CAP_ID_ACS);
> + if (!pos)
> + goto unlock_and_return;
> +
> + pci_read_config_word(downstream, pos + PCI_ACS_CTRL, &ctrl);
> +
> + downstream->p2p_old_acs_flags = ctrl & PCI_P2P_ACS_FLAGS;
> +
> + if (downstream->p2p_old_acs_flags)
> + dev_info(&pdev->dev, "disabling p2p acs flags: %x\n", ctrl);
> +
> + ctrl &= ~PCI_P2P_ACS_FLAGS;
> +
> + pci_write_config_word(downstream, pos + PCI_ACS_CTRL, ctrl);
> +
> +unlock_and_return:
> + device_unlock(&downstream->dev);
> + pci_dev_put(downstream);
> + return 0;
> +}
> +
> +static int pci_p2pmem_reset_acs(struct pci_dev *pdev)
> +{
> + int pos;
> + u16 ctrl;
> + struct pci_dev *downstream;
> +
> + downstream = pci_dev_get(pci_upstream_bridge(pdev));
> + if (!downstream)
> + return -ENODEV;
> +
> + device_lock(&downstream->dev);
> +
> + /* Only actually reset the flags on a 1->0 transition */
> + if (!downstream->p2p_acs_requests)
> + goto unlock_and_return;
> +
> + if (--downstream->p2p_acs_requests)
> + goto unlock_and_return;
> +
> + pos = pci_find_ext_capability(downstream, PCI_EXT_CAP_ID_ACS);
> + if (!pos)
> + goto unlock_and_return;
> +
> + pci_read_config_word(downstream, pos + PCI_ACS_CTRL, &ctrl);
> +
> + ctrl &= ~PCI_P2P_ACS_FLAGS;
> + ctrl |= downstream->p2p_old_acs_flags;
> +
> + if (downstream->p2p_old_acs_flags)
> + dev_info(&pdev->dev, "resetting p2p acs flags: %x\n", ctrl);
> +
> + pci_write_config_word(downstream, pos + PCI_ACS_CTRL, ctrl);
> +
> +unlock_and_return:
> + device_unlock(&downstream->dev);
> + pci_dev_put(downstream);
> + return 0;
> +}
> +
> +/*
> * If a device is behind a switch, we try to find the upstream bridge
> * port of the switch. This requires two calls to pci_upstream_bridge:
> * one for the upstream port on the switch, one on the upstream port
> @@ -340,6 +423,10 @@ int pci_p2pmem_add_client(struct list_head *head, struct device *dev)
> ret = -EXDEV;
> goto put_client;
> }
> +
> + ret = pci_p2pmem_disable_acs(client);
> + if (!ret)
> + goto put_client;
> }
>
> new_item = kzalloc(sizeof(*new_item), GFP_KERNEL);
> @@ -363,6 +450,9 @@ EXPORT_SYMBOL_GPL(pci_p2pmem_add_client);
>
> static void pci_p2pmem_client_free(struct pci_p2pmem_client *item)
> {
> + if (item->p2pmem)
> + pci_p2pmem_reset_acs(item->client);
> +
> list_del(&item->list);
> pci_dev_put(item->client);
> pci_dev_put(item->p2pmem);
> @@ -383,6 +473,7 @@ void pci_p2pmem_remove_client(struct list_head *head, struct device *dev)
> {
> struct pci_p2pmem_client *pos, *tmp;
> struct pci_dev *pdev;
> + struct pci_dev *p2pmem = NULL;
>
> pdev = find_parent_pci_dev(dev);
> if (!pdev)
> @@ -392,9 +483,16 @@ void pci_p2pmem_remove_client(struct list_head *head, struct device *dev)
> if (pos->client != pdev)
> continue;
>
> + if (!p2pmem)
> + p2pmem = pci_dev_get(pos->p2pmem);
> +
> pci_p2pmem_client_free(pos);
> }
>
> + if (p2pmem && list_empty(head))
> + pci_p2pmem_reset_acs(p2pmem);
> +
> + pci_dev_put(p2pmem);
> pci_dev_put(pdev);
> }
> EXPORT_SYMBOL_GPL(pci_p2pmem_remove_client);
> @@ -412,6 +510,10 @@ void pci_p2pmem_client_list_free(struct list_head *head)
> {
> struct pci_p2pmem_client *pos, *tmp;
>
> + pos = list_first_entry_or_null(head, struct pci_p2pmem_client, list);
> + if (pos && pos->p2pmem)
> + pci_p2pmem_reset_acs(pos->p2pmem);
> +
> list_for_each_entry_safe(pos, tmp, head, list)
> pci_p2pmem_client_free(pos);
> }
> @@ -440,6 +542,40 @@ static bool upstream_bridges_match_list(struct pci_dev *pdev,
> return ret;
> }
>
> +static int bind_clients(struct pci_dev *p2pmem, struct list_head *clients)
> +{
> + int ret;
> + struct pci_p2pmem_client *pos, *unwind_pos;
> +
> + ret = pci_p2pmem_disable_acs(p2pmem);
> + if (ret)
> + return ret;
> +
> + list_for_each_entry(pos, clients, list) {
> + ret = pci_p2pmem_disable_acs(pos->client);
> + if (ret)
> + goto unwind;
> +
> + pos->p2pmem = pci_dev_get(p2pmem);
> + }
> +
> + return 0;
> +
> +unwind:
> + list_for_each_entry(unwind_pos, clients, list) {
> + if (pos == unwind_pos)
> + break;
> +
> + pci_p2pmem_reset_acs(unwind_pos->client);
> + pci_dev_put(unwind_pos->p2pmem);
> + unwind_pos->p2pmem = NULL;
> + }
> +
> + pci_p2pmem_reset_acs(p2pmem);
> +
> + return ret;
> +}
> +
> /**
> * pci_p2pmem_find - find a p2p mem device compatible with the specified device
> * @dev: list of device to check (NULL-terminated)
> @@ -450,11 +586,13 @@ static bool upstream_bridges_match_list(struct pci_dev *pdev,
> *
> * Returns a pointer to the PCI device with a reference taken (use pci_dev_put
> * to return the reference) or NULL if no compatible device is found.
> + *
> + * The P2P ACS flags on all applicable PCI devices will be cleared and
> + * reset when the client is removed from the list.
> */
> struct pci_dev *pci_p2pmem_find(struct list_head *clients)
> {
> struct pci_dev *pdev = NULL;
> - struct pci_p2pmem_client *pos;
>
> while ((pdev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, pdev))) {
> if (!pdev->p2p || !pdev->p2p->published)
> @@ -463,8 +601,8 @@ struct pci_dev *pci_p2pmem_find(struct list_head *clients)
> if (!upstream_bridges_match_list(pdev, clients))
> continue;
>
> - list_for_each_entry(pos, clients, list)
> - pos->p2pmem = pdev;
> + if (bind_clients(pdev, clients))
> + continue;
>
> return pdev;
> }
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 047aea679e87..cdd4d3c3e562 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -435,6 +435,8 @@ struct pci_dev {
> #endif
> #ifdef CONFIG_PCI_P2P
> struct pci_p2p *p2p;
> + unsigned int p2p_acs_requests;
> + u16 p2p_old_acs_flags;
> #endif
> phys_addr_t rom; /* Physical address of ROM if it's not from the BAR */
> size_t romlen; /* Length of ROM if it's not from the BAR */
> --
> 2.11.0
>
On Thu, Jan 04, 2018 at 12:01:26PM -0700, Logan Gunthorpe wrote:
> Some PCI devices may have memory mapped in a BAR space that's
> intended for use in Peer-to-Peer transactions. In order to enable
> such transactions the memory must be registered with ZONE_DEVICE pages
> so it can be used by DMA interfaces in existing drivers.
> ...
> + * pci_p2pmem_alloc_sgl - allocate p2p memory in an sgl
> + * @pdev: the device to allocate memory from
> + * @sgl: the allocated sgl
> + * @nents: the number of sgs in the list
> + * @length: number of bytes to allocate
Your later patches use "SGL" in English text. If that's the conventional
capitalization, please use it here, too.
On Thu, Jan 04, 2018 at 12:52:24PM -0700, Logan Gunthorpe wrote:
> >I'd be much happier if dma_map_sg can tell the memory is P2P or not
> >from the scatterlist or dir arguments and not require the callers to
> >have this.
>
> We tried things like this in an earlier iteration[1] which assumed the SG
> was homogenous (all P2P or all regular memory). This required serious
> ugliness to try and ensure SGs were in fact homogenous[2].
I'm confused, these patches already assume the sg is homogenous,
right? Sure looks that way. So [2] is just debugging??
What I meant is to rely on the fact it is already homogenous and
create a function like this:
static inline bool sg_is_table_p2p(struct scatterlist *sg)
{
return is_pci_p2p_page(sg_page(sg));
}
And then insert
if (sg_is_table_p2p(sg))
return pci_p2pmem_map_sg(...);
At the top of dma_map_sg_attrs() (and similar for unmap)
Then we don't need to patch RDMA because RDMA is not special when it
comes to P2P. P2P should work with everything.
This has been my objection to every P2P iteration so far, for
years. RDMA is not special, P2P should not be patching RDMA's DMA
path. This an issue between the DMA and PCI subsystems, not for RDMA.
> >The signature of pci_p2pmem_map_sg also doesn't look very good, it
> >should mirror the usual dma_map_sg, otherwise it needs more revision
> >to extend this to do P2P through a root complex.
>
> Generally, the feedback that I get is to not make changes that try to
> anticipate the future. It would be easy enough to add those arguments when
> the code for going through the RC is made (which I expect will be very
> involved anyway).
Yes, but DMA mapping fundamentally requires the arguments to
dma_map_sg, so it is hard to accept an API called dma_map without
them.. Maybe just a naming issue.
[2] https://github.com/sbates130272/linux-p2pmem/commit/61ec04fa63c79dab14570ea0e97432d9325ad127
On Thu, Jan 04, 2018 at 03:50:40PM -0600, Bjorn Helgaas wrote:
> > This is similar to /sys/bus/pci/drivers_autoprobe, but
> > affects only the VFs associated with a specific PF.
> > +
> > +What: /sys/bus/pci/devices/.../p2pmem/available
>
> I wonder if "p2pdma" would be a more suggestive term? It's not really
> the *memory* that is peer-to-peer; the peer-to-peer part is referring
> to *access* to the memory.
There have been out of tree patches using P2P DMA for a long time, and
some of the use cases have nothing to do with 'memory' - eg DMA to
'registers'
I notice that this series particularly focus on treating the target
BAR as 'memory' - ie it puts genalloc on top of the BAR, and I guess
treat all addresses as equal and interchangable.
If this series gets accepted I would expect proposals to extend this
infrastructure to allow for P2P for registers in some way as well.
So I think the 'p2pmem' name is a good choice only when it is in
places that talk about the genalloc part of this design.
We should reserve p2pdma to talk about the generic infrastructure
unrelated to the genalloc pool.
Since these sysfs's seem to report the genalloc pool status, p2pmem
seems like a good choice to me.
> > @@ -82,6 +130,9 @@ static int pci_p2pmem_setup(struct pci_dev *pdev)
> > if (error)
> > goto out_pool_destroy;
> >
> > + if (sysfs_create_group(&pdev->dev.kobj, &p2pmem_group))
> > + dev_warn(&pdev->dev, "failed to create p2p sysfs group\n");
>
> Not sure the warning (by itself) is worthwhile. If we were going to
> disable the feature if sysfs_create_group() failed, that's one thing,
> but we aren't doing anything except generating a warning, which the
> user can't really do anything with. If the user is looking for the
> sysfs file, its absence will be obvious even without the message.
Don't most of the failure paths inside sysfs_create_group cause prints
anyhow?
Jason
On Thu, 4 Jan 2018 15:57:21 -0600
Bjorn Helgaas <[email protected]> wrote:
> [+cc Alex]
>
> On Thu, Jan 04, 2018 at 12:01:29PM -0700, Logan Gunthorpe wrote:
> > When the ACS P2P flags are set in the downstream port of the switch,
> > any P2P TLPs will be sent back to the root complex. The whole point of
> > the P2P work is to have TLPs avoid the RC seeing it may not support
> > P2P transactions or, at best, it will perform poorly. So we clear these
>
> "It will perform poorly" seems like an empirical statement about the
> hardware you've seen, not something that's driven by the spec. So I'm
> not sure it adds information that will be useful long-term.
>
> > flags for all devices involved in transactions with the p2pmem.
>
> I'm not sure how this coordinates with other uses of ACS, e.g., VFIO.
> I think that should be addressed here somehow.
Yep, flipping these ACS bits invalidates any IOMMU groups that depend
on the isolation of that downstream port and I suspect also any peers
within the same PCI slot of that port and their downstream devices. The
entire sub-hierarchy grouping needs to be re-evaluated. This
potentially affects running devices that depend on that isolation, so
I'm not sure how that happens dynamically. A boot option might be
easier. Thanks,
Alex
> > A count of the number of requests to disable the flags is maintained.
> > When the count transitions from 1 to 0, the old flags are restored.
> >
> > Signed-off-by: Logan Gunthorpe <[email protected]>
> > ---
> > drivers/pci/p2p.c | 144 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> > include/linux/pci.h | 2 +
> > 2 files changed, 143 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/pci/p2p.c b/drivers/pci/p2p.c
> > index 87cec87b02e3..617adaa905d2 100644
> > --- a/drivers/pci/p2p.c
> > +++ b/drivers/pci/p2p.c
> > @@ -227,6 +227,89 @@ static struct pci_dev *find_parent_pci_dev(struct device *dev)
> > }
> >
> > /*
> > + * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
> > + * to be disabled in the downstream port of each device in order for
> > + * the TLPs to not be forwarded up to the RC.
> > + */
> > +#define PCI_P2P_ACS_FLAGS (PCI_ACS_RR | PCI_ACS_CR)
> > +
> > +static int pci_p2pmem_disable_acs(struct pci_dev *pdev)
> > +{
> > + int pos;
> > + u16 ctrl;
> > + struct pci_dev *downstream;
> > +
> > + downstream = pci_dev_get(pci_upstream_bridge(pdev));
> > + if (!downstream) {
> > + dev_err(&pdev->dev, "could not find downstream port\n");
> > + return -ENODEV;
> > + }
> > +
> > + device_lock(&downstream->dev);
> > + if (downstream->p2p_acs_requests++)
> > + goto unlock_and_return;
> > +
> > + pos = pci_find_ext_capability(downstream, PCI_EXT_CAP_ID_ACS);
> > + if (!pos)
> > + goto unlock_and_return;
> > +
> > + pci_read_config_word(downstream, pos + PCI_ACS_CTRL, &ctrl);
> > +
> > + downstream->p2p_old_acs_flags = ctrl & PCI_P2P_ACS_FLAGS;
> > +
> > + if (downstream->p2p_old_acs_flags)
> > + dev_info(&pdev->dev, "disabling p2p acs flags: %x\n", ctrl);
> > +
> > + ctrl &= ~PCI_P2P_ACS_FLAGS;
> > +
> > + pci_write_config_word(downstream, pos + PCI_ACS_CTRL, ctrl);
> > +
> > +unlock_and_return:
> > + device_unlock(&downstream->dev);
> > + pci_dev_put(downstream);
> > + return 0;
> > +}
> > +
> > +static int pci_p2pmem_reset_acs(struct pci_dev *pdev)
> > +{
> > + int pos;
> > + u16 ctrl;
> > + struct pci_dev *downstream;
> > +
> > + downstream = pci_dev_get(pci_upstream_bridge(pdev));
> > + if (!downstream)
> > + return -ENODEV;
> > +
> > + device_lock(&downstream->dev);
> > +
> > + /* Only actually reset the flags on a 1->0 transition */
> > + if (!downstream->p2p_acs_requests)
> > + goto unlock_and_return;
> > +
> > + if (--downstream->p2p_acs_requests)
> > + goto unlock_and_return;
> > +
> > + pos = pci_find_ext_capability(downstream, PCI_EXT_CAP_ID_ACS);
> > + if (!pos)
> > + goto unlock_and_return;
> > +
> > + pci_read_config_word(downstream, pos + PCI_ACS_CTRL, &ctrl);
> > +
> > + ctrl &= ~PCI_P2P_ACS_FLAGS;
> > + ctrl |= downstream->p2p_old_acs_flags;
> > +
> > + if (downstream->p2p_old_acs_flags)
> > + dev_info(&pdev->dev, "resetting p2p acs flags: %x\n", ctrl);
> > +
> > + pci_write_config_word(downstream, pos + PCI_ACS_CTRL, ctrl);
> > +
> > +unlock_and_return:
> > + device_unlock(&downstream->dev);
> > + pci_dev_put(downstream);
> > + return 0;
> > +}
> > +
> > +/*
> > * If a device is behind a switch, we try to find the upstream bridge
> > * port of the switch. This requires two calls to pci_upstream_bridge:
> > * one for the upstream port on the switch, one on the upstream port
> > @@ -340,6 +423,10 @@ int pci_p2pmem_add_client(struct list_head *head, struct device *dev)
> > ret = -EXDEV;
> > goto put_client;
> > }
> > +
> > + ret = pci_p2pmem_disable_acs(client);
> > + if (!ret)
> > + goto put_client;
> > }
> >
> > new_item = kzalloc(sizeof(*new_item), GFP_KERNEL);
> > @@ -363,6 +450,9 @@ EXPORT_SYMBOL_GPL(pci_p2pmem_add_client);
> >
> > static void pci_p2pmem_client_free(struct pci_p2pmem_client *item)
> > {
> > + if (item->p2pmem)
> > + pci_p2pmem_reset_acs(item->client);
> > +
> > list_del(&item->list);
> > pci_dev_put(item->client);
> > pci_dev_put(item->p2pmem);
> > @@ -383,6 +473,7 @@ void pci_p2pmem_remove_client(struct list_head *head, struct device *dev)
> > {
> > struct pci_p2pmem_client *pos, *tmp;
> > struct pci_dev *pdev;
> > + struct pci_dev *p2pmem = NULL;
> >
> > pdev = find_parent_pci_dev(dev);
> > if (!pdev)
> > @@ -392,9 +483,16 @@ void pci_p2pmem_remove_client(struct list_head *head, struct device *dev)
> > if (pos->client != pdev)
> > continue;
> >
> > + if (!p2pmem)
> > + p2pmem = pci_dev_get(pos->p2pmem);
> > +
> > pci_p2pmem_client_free(pos);
> > }
> >
> > + if (p2pmem && list_empty(head))
> > + pci_p2pmem_reset_acs(p2pmem);
> > +
> > + pci_dev_put(p2pmem);
> > pci_dev_put(pdev);
> > }
> > EXPORT_SYMBOL_GPL(pci_p2pmem_remove_client);
> > @@ -412,6 +510,10 @@ void pci_p2pmem_client_list_free(struct list_head *head)
> > {
> > struct pci_p2pmem_client *pos, *tmp;
> >
> > + pos = list_first_entry_or_null(head, struct pci_p2pmem_client, list);
> > + if (pos && pos->p2pmem)
> > + pci_p2pmem_reset_acs(pos->p2pmem);
> > +
> > list_for_each_entry_safe(pos, tmp, head, list)
> > pci_p2pmem_client_free(pos);
> > }
> > @@ -440,6 +542,40 @@ static bool upstream_bridges_match_list(struct pci_dev *pdev,
> > return ret;
> > }
> >
> > +static int bind_clients(struct pci_dev *p2pmem, struct list_head *clients)
> > +{
> > + int ret;
> > + struct pci_p2pmem_client *pos, *unwind_pos;
> > +
> > + ret = pci_p2pmem_disable_acs(p2pmem);
> > + if (ret)
> > + return ret;
> > +
> > + list_for_each_entry(pos, clients, list) {
> > + ret = pci_p2pmem_disable_acs(pos->client);
> > + if (ret)
> > + goto unwind;
> > +
> > + pos->p2pmem = pci_dev_get(p2pmem);
> > + }
> > +
> > + return 0;
> > +
> > +unwind:
> > + list_for_each_entry(unwind_pos, clients, list) {
> > + if (pos == unwind_pos)
> > + break;
> > +
> > + pci_p2pmem_reset_acs(unwind_pos->client);
> > + pci_dev_put(unwind_pos->p2pmem);
> > + unwind_pos->p2pmem = NULL;
> > + }
> > +
> > + pci_p2pmem_reset_acs(p2pmem);
> > +
> > + return ret;
> > +}
> > +
> > /**
> > * pci_p2pmem_find - find a p2p mem device compatible with the specified device
> > * @dev: list of device to check (NULL-terminated)
> > @@ -450,11 +586,13 @@ static bool upstream_bridges_match_list(struct pci_dev *pdev,
> > *
> > * Returns a pointer to the PCI device with a reference taken (use pci_dev_put
> > * to return the reference) or NULL if no compatible device is found.
> > + *
> > + * The P2P ACS flags on all applicable PCI devices will be cleared and
> > + * reset when the client is removed from the list.
> > */
> > struct pci_dev *pci_p2pmem_find(struct list_head *clients)
> > {
> > struct pci_dev *pdev = NULL;
> > - struct pci_p2pmem_client *pos;
> >
> > while ((pdev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, pdev))) {
> > if (!pdev->p2p || !pdev->p2p->published)
> > @@ -463,8 +601,8 @@ struct pci_dev *pci_p2pmem_find(struct list_head *clients)
> > if (!upstream_bridges_match_list(pdev, clients))
> > continue;
> >
> > - list_for_each_entry(pos, clients, list)
> > - pos->p2pmem = pdev;
> > + if (bind_clients(pdev, clients))
> > + continue;
> >
> > return pdev;
> > }
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index 047aea679e87..cdd4d3c3e562 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -435,6 +435,8 @@ struct pci_dev {
> > #endif
> > #ifdef CONFIG_PCI_P2P
> > struct pci_p2p *p2p;
> > + unsigned int p2p_acs_requests;
> > + u16 p2p_old_acs_flags;
> > #endif
> > phys_addr_t rom; /* Physical address of ROM if it's not from the BAR */
> > size_t romlen; /* Length of ROM if it's not from the BAR */
> > --
> > 2.11.0
> >
Thanks for the speedy review!
On 04/01/18 02:40 PM, Bjorn Helgaas wrote:
> Run "git log --oneline drivers/pci" and follow the convention. I
> think it would make sense to add a new tag like "PCI/P2P", although
> "P2P" has historically also been used in the "PCI-to-PCI bridge"
> context, so maybe there's something less ambiguous. "P2PDMA"?
Ok, I'll fix this for v2. I'm fine with renaming things to p2pdma
> When you add new files, I guess we're looking for the new SPDX
> copyright stuff?
Will do.
> It's more than "non-trivial" or "with good performance" (from Kconfig
> help), isn't it? AFAIK, there's no standard way at all to discover
> whether P2P DMA is supported between root ports or RCs.
Yup, that's correct. This would have to be done with a white list.
>
> s/bars/BARs/ (and similarly below, except in C code)
> Similarly, s/dma/DMA/ and s/pci/PCI/ below.
> And probably also s/p2p/peer-to-peer DMA/ in messages.
Will do.
> Maybe clarify this domain bit. Using "domain" suggests the common PCI
> segment/domain usage, but I think you really mean something like the
> part of the hierarchy where peer-to-peer DMA is guaranteed by the PCI
> spec to work, i.e., anything below a single PCI bridge.
Ok, I like the wording you proposed.
>
> Seems like there should be
>
> if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM))
> return -EINVAL;
>
> or similar here?
That sounds like a good idea. Will add.
>
>> + if (WARN_ON(offset >= pci_resource_len(pdev, bar)))
>> + return -EINVAL;
>
> Are these WARN_ONs for debugging purposes, or do you think we need
> them in production? Granted, hitting it would probably be a kernel
> driver bug, but still, not sure if the PCI core needs to coddle the
> driver author that much.
Sure, I'll drop all the WARN_ONs.
> I'm guessing Christoph's dev_pagemap revamp repo must change
> pgmap->res from a pointer to a structure, but I don't see the actual
> link in your cover letter.
Sorry, the patch set is here:
https://www.mail-archive.com/[email protected]/msg07323.html
git://git.infradead.org/users/hch/misc.git hch/pgmap-cleanups.3
> I think you should set pgmap->res.flags here, too.
Sure, I don't think it's used and not set by the NVDIMM code; but I
agree that it'd be a good idea to set it anyway.
>> + dev_info(&pdev->dev, "added %zdB of p2p memory\n", size);
>
> Can we add %pR and print pgmap->res itself, too?
Yup.
> You have a bit of a mix of PCI ("pci device", "bridge") and PCIe
> ("switch", "switch port") terminology. I haven't read the rest of the
> patches yet, so I don't know if you intend to restrict this to
> PCIe-only, e.g., so you can use ACS, or if you want to make it
> available on conventional PCI as well.
>
> If the latter, I would use the generic PCI terminology, i.e., "bridge"
> instead of "switch".
Ok, I'll change it to use the generic term bridge. There's no
restriction in the code to limit it to PCIe only, though I don't expect
anybody will ever be using this with legacy PCI.
>> + * pci_virt_to_bus - return the pci bus address for a given virtual
>> + * address obtained with pci_alloc_p2pmem
>> + * @pdev: the device the memory was allocated from
>> + * @addr: address of the memory that was allocated
>> + */
>> +pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr)
>> +{
>> + if (!addr)
>> + return 0;
>> + if (!pdev->p2p)
>> + return 0;
>> +
>> + return gen_pool_virt_to_phys(pdev->p2p->pool, (unsigned long)addr);
>
> This doesn't seem right. A physical address is not the same as a PCI
> bus address. I expected something like pci_bus_address() or
> pcibios_resource_to_bus() here. Am I missing something? If so, a
> clarifying comment would be helpful.
What you're missing is that when we called gen_pool_add_virt we used the
PCI bus address as the physical address and not the CPU physical address
(which we don't care about). I'll add a comment explaining this.
> I've been noticing that we're accumulating PCI-related files in
> include/linux: pci.h, pci-aspm.h pci-ats.h, pci-dma.h, pcieport_if.h,
> etc. I'm not sure there's value in all those and am thinking maybe
> they should just be folded into pci.h. What do you think?
We started with that. Once we reached a certain amount of code,
Christoph suggested we put it in its own header.
Logan
On 04/01/18 02:50 PM, Bjorn Helgaas wrote:
> On Thu, Jan 04, 2018 at 12:01:27PM -0700, Logan Gunthorpe wrote:
>> Attributes display the total amount of P2P memory, the ammount available
>> and whether it is published or not.
>
> s/ammount/amount/ (also below)
Will fix.
> I wonder if "p2pdma" would be a more suggestive term? It's not really
> the *memory* that is peer-to-peer; the peer-to-peer part is referring
> to *access* to the memory.
I agree with Jason on this point. For now, it only describes the
peer-to-peer memory attributes. If we change the directory to p2pdma
then we'd have to add a mem prefix or something to each attribute.
>> @@ -82,6 +130,9 @@ static int pci_p2pmem_setup(struct pci_dev *pdev)
>> if (error)
>> goto out_pool_destroy;
>>
>> + if (sysfs_create_group(&pdev->dev.kobj, &p2pmem_group))
>> + dev_warn(&pdev->dev, "failed to create p2p sysfs group\n");
>
> Not sure the warning (by itself) is worthwhile. If we were going to
> disable the feature if sysfs_create_group() failed, that's one thing,
> but we aren't doing anything except generating a warning, which the
> user can't really do anything with. If the user is looking for the
> sysfs file, its absence will be obvious even without the message.
Unfortunately, sysfs_create_group() has the warn_unused_result flag set.
So I need to do something with the result to squash the warning. I could
just assign it to an unused variable but something seemed like it needed
to be done and printing a warning was the easiest thing...
Logan
On 04/01/18 03:13 PM, Jason Gunthorpe wrote:
> On Thu, Jan 04, 2018 at 12:52:24PM -0700, Logan Gunthorpe wrote:
>> We tried things like this in an earlier iteration[1] which assumed the SG
>> was homogenous (all P2P or all regular memory). This required serious
>> ugliness to try and ensure SGs were in fact homogenous[2].
>
> I'm confused, these patches already assume the sg is homogenous,
> right? Sure looks that way. So [2] is just debugging??
Yes, but it's a bit different to expect that someone calling
pci_p2pmem_map_sg() will know what they're doing and provide a
homogenous SG. It is relatively clear by convention that the entire SG
must be homogenous given they're calling a pci_p2pmem function. Where
as, allowing P2P SGs into the core DMA code means all we can do is hope
that future developers don't screw it up and allow P2P pages to mix in
with regular pages.
It's also very difficult to add similar functionality to dma_map_page
seeing dma_unmap_page won't have any way to know what it's dealing with.
It just seems confusing to support P2P in the SG version and not the
page version.
> What I meant is to rely on the fact it is already homogenous and
> create a function like this:
>
> static inline bool sg_is_table_p2p(struct scatterlist *sg)
> {
> return is_pci_p2p_page(sg_page(sg));
> }
>
> And then insert
>
> if (sg_is_table_p2p(sg))
> return pci_p2pmem_map_sg(...);
Yes, this is almost identical to the earlier patch I referenced...
> Then we don't need to patch RDMA because RDMA is not special when it
> comes to P2P. P2P should work with everything.
Yes, I agree this would be very nice. But it's challenging at this time
to do it safely. And something like this can still be done in future
work after this patch set gets merged. The changes in this set to the
RDMA and block layers are fairly small and the majority of the block
layer changes would still be required anyway to ensure a given
queue/driver supports P2P requests.
>>> The signature of pci_p2pmem_map_sg also doesn't look very good, it
>>> should mirror the usual dma_map_sg, otherwise it needs more revision
>>> to extend this to do P2P through a root complex.
>>
>> Generally, the feedback that I get is to not make changes that try to
>> anticipate the future. It would be easy enough to add those arguments when
>> the code for going through the RC is made (which I expect will be very
>> involved anyway).
>
> Yes, but DMA mapping fundamentally requires the arguments to
> dma_map_sg, so it is hard to accept an API called dma_map without
> them.. Maybe just a naming issue.
I don't agree that this is a fundamental requirement. But if no one else
objects to adding unused arguments I'd be fine with adding them.
Logan
On 04/01/18 03:35 PM, Alex Williamson wrote:
> Yep, flipping these ACS bits invalidates any IOMMU groups that depend
> on the isolation of that downstream port and I suspect also any peers
> within the same PCI slot of that port and their downstream devices. The
> entire sub-hierarchy grouping needs to be re-evaluated. This
> potentially affects running devices that depend on that isolation, so
> I'm not sure how that happens dynamically. A boot option might be
> easier. Thanks,
I don't see how this is the case in current kernel code. It appears to
only enable ACS globally if the IOMMU requests it.
I also don't see how turning off ACS isolation for a specific device is
going to hurt anything. The IOMMU should still be able to keep going on
unaware that anything has changed. The only worry is that a security
hole may now be created if a user was relying on the isolation between
two devices that are in different VMs or something. However, if a user
was relying on this, they probably shouldn't have turned on P2P in the
first place.
We started with a fairly unintelligent choice to simply disable ACS on
any kernel that had CONFIG_PCI_P2P set. However, this did not seem like
a good idea going forward. Instead, we now selectively disable the ACS
bit only on the downstream ports that are involved in P2P transactions.
This seems like the safest choice and still allows people to (carefully)
use P2P adjacent to other devices that need to be isolated.
I don't think anyone wants another boot option that must be set in order
to use this functionality (and only some hardware would require this).
That's just a huge pain for users.
Logan
On 04/01/18 02:59 PM, Bjorn Helgaas wrote:
> On Thu, Jan 04, 2018 at 12:01:26PM -0700, Logan Gunthorpe wrote:
>> Some PCI devices may have memory mapped in a BAR space that's
>> intended for use in Peer-to-Peer transactions. In order to enable
>> such transactions the memory must be registered with ZONE_DEVICE pages
>> so it can be used by DMA interfaces in existing drivers.
>> ...
>
>> + * pci_p2pmem_alloc_sgl - allocate p2p memory in an sgl
>> + * @pdev: the device to allocate memory from
>> + * @sgl: the allocated sgl
>> + * @nents: the number of sgs in the list
>> + * @length: number of bytes to allocate
>
> Your later patches use "SGL" in English text. If that's the conventional
> capitalization, please use it here, too.
>
I'm not sure there is a convention. scatterlist.h uses 'sglist' and
'scatterlist' so I'll just change my comments to use the latter.
Logan
On 04/01/18 05:00 PM, Logan Gunthorpe wrote:
>
>
> On 04/01/18 03:35 PM, Alex Williamson wrote:
>> Yep, flipping these ACS bits invalidates any IOMMU groups that depend
>> on the isolation of that downstream port and I suspect also any peers
>> within the same PCI slot of that port and their downstream devices. The
>> entire sub-hierarchy grouping needs to be re-evaluated. This
>> potentially affects running devices that depend on that isolation, so
>> I'm not sure how that happens dynamically. A boot option might be
>> easier. Thanks,
>
Also, considering this further: if you can point me to a way to tell if
a device is in an IOMMU group it should be pretty easy to disallow a
device that has been assigned to a group from being used in P2P
transactions.
Logan
On Thu, 4 Jan 2018 17:00:47 -0700
Logan Gunthorpe <[email protected]> wrote:
> On 04/01/18 03:35 PM, Alex Williamson wrote:
> > Yep, flipping these ACS bits invalidates any IOMMU groups that depend
> > on the isolation of that downstream port and I suspect also any peers
> > within the same PCI slot of that port and their downstream devices. The
> > entire sub-hierarchy grouping needs to be re-evaluated. This
> > potentially affects running devices that depend on that isolation, so
> > I'm not sure how that happens dynamically. A boot option might be
> > easier. Thanks,
>
> I don't see how this is the case in current kernel code. It appears to
> only enable ACS globally if the IOMMU requests it.
IOMMU groups don't exist unless the IOMMU is enabled and x86 and ARM
both request ACS be enabled if an IOMMU is present, so I'm not sure
what you're getting at here. Also, in reply to your other email, if
the IOMMU is enabled, every device handled by the IOMMU is a member of
an IOMMU group, see struct device.iommu_group. There's an
iommu_group_get() accessor to get a reference to it.
> I also don't see how turning off ACS isolation for a specific device is
> going to hurt anything. The IOMMU should still be able to keep going on
> unaware that anything has changed. The only worry is that a security
> hole may now be created if a user was relying on the isolation between
> two devices that are in different VMs or something. However, if a user
> was relying on this, they probably shouldn't have turned on P2P in the
> first place.
That's exactly what IOMMU groups represent, the smallest set of devices
which have DMA isolation from other devices. By poking this hole, the
IOMMU group is invalid. We cannot turn off ACS only for a specific
device, in order to enable p2p it needs to be disabled at every
downstream port between the devices where we want to enable p2p.
Depending on the topology, that could mean we're also enabling p2p for
unrelated devices. Those unrelated devices might be in active use and
the p2p IOVAs now have a different destination which is no longer IOMMU
translated.
> We started with a fairly unintelligent choice to simply disable ACS on
> any kernel that had CONFIG_PCI_P2P set. However, this did not seem like
> a good idea going forward. Instead, we now selectively disable the ACS
> bit only on the downstream ports that are involved in P2P transactions.
> This seems like the safest choice and still allows people to (carefully)
> use P2P adjacent to other devices that need to be isolated.
I don't see that the code is doing much checking that adjacent devices
are also affected by the p2p change and of course the IOMMU group is
entirely invalid once the p2p holes start getting poked.
> I don't think anyone wants another boot option that must be set in order
> to use this functionality (and only some hardware would require this).
> That's just a huge pain for users.
No, but nor do we need IOMMU groups that no longer represent what
they're intended to describe or runtime, unchecked routing changes
through the topology for devices that might already be using
conflicting IOVA ranges. Maybe soft hotplugs are another possibility,
designate a sub-hierarchy to be removed and re-scanned with ACS
disabled. Otherwise it seems like disabling and re-enabling ACS needs
to also handle merging and splitting groups dynamically. Thanks,
Alex
On Thu, Jan 04, 2018 at 04:44:00PM -0700, Logan Gunthorpe wrote:
> On 04/01/18 03:13 PM, Jason Gunthorpe wrote:
> >On Thu, Jan 04, 2018 at 12:52:24PM -0700, Logan Gunthorpe wrote:
> >>We tried things like this in an earlier iteration[1] which assumed the SG
> >>was homogenous (all P2P or all regular memory). This required serious
> >>ugliness to try and ensure SGs were in fact homogenous[2].
> >
> >I'm confused, these patches already assume the sg is homogenous,
> >right? Sure looks that way. So [2] is just debugging??
>
> Yes, but it's a bit different to expect that someone calling
> pci_p2pmem_map_sg() will know what they're doing and provide a homogenous
> SG. It is relatively clear by convention that the entire SG must be
> homogenous given they're calling a pci_p2pmem function. Where as, allowing
> P2P SGs into the core DMA code means all we can do is hope that future
> developers don't screw it up and allow P2P pages to mix in with regular
> pages.
Well that argument applies equally to the RDMA RW API wrappers around
the DMA API. I think it is fine if sgl are defined to only have P2P or
not, and that debugging support seemed reasonable to me..
> It's also very difficult to add similar functionality to dma_map_page seeing
> dma_unmap_page won't have any way to know what it's dealing with. It just
> seems confusing to support P2P in the SG version and not the page version.
Well, this proposal is to support P2P in only some RDMA APIs and not
others, so it seems about as confusing to me..
> >Then we don't need to patch RDMA because RDMA is not special when it
> >comes to P2P. P2P should work with everything.
>
> Yes, I agree this would be very nice.
Well, it is more than very nice. We have to keep RDMA working after
all, and if you make it even more special things become harder for us.
It is already the case that DMA in RDMA is very strange. We have
drivers that provide their own DMA ops, for instance.
And on that topic, does this scheme work with HFI?
On first glance, it looks like no. The PCI device the HFI device is
attached to may be able to do P2P, so it should be able to trigger the
support.
However, substituting the p2p_dma_map for the real device op dma_map
will cause a kernel crash when working with HFI. HFI uses a custom DMA
ops that returns CPU addreses in the dma_addr_t which the driver
handles in various special ways. One cannot just replace them with PCI
bus addresses.
So, this kinda looks to me like it causes bad breakage for some RDMA
drivers??
This is why P2P must fit in to the common DMA framework somehow, we
rely on these abstractions to work properly and fully in RDMA.
I think you should consider pushing this directly into the dma_ops
implementations. Add a p2p_supported flag to struct dma_map_ops, and
only if it is true can a caller pass a homogeneous SGL to ops->map_sg.
Only map_sg would be supported for P2P. Upgraded implementations can
call the helper function.
Jason
On Thu, Jan 04, 2018 at 08:33:00PM -0700, Alex Williamson wrote:
> On Thu, 4 Jan 2018 17:00:47 -0700
> Logan Gunthorpe <[email protected]> wrote:
>
> > On 04/01/18 03:35 PM, Alex Williamson wrote:
> > > Yep, flipping these ACS bits invalidates any IOMMU groups that depend
> > > on the isolation of that downstream port and I suspect also any peers
> > > within the same PCI slot of that port and their downstream devices. The
> > > entire sub-hierarchy grouping needs to be re-evaluated. This
> > > potentially affects running devices that depend on that isolation, so
> > > I'm not sure how that happens dynamically. A boot option might be
> > > easier. Thanks,
> >
> > I don't see how this is the case in current kernel code. It appears to
> > only enable ACS globally if the IOMMU requests it.
>
> IOMMU groups don't exist unless the IOMMU is enabled and x86 and ARM
> both request ACS be enabled if an IOMMU is present, so I'm not sure
> what you're getting at here. Also, in reply to your other email, if
> the IOMMU is enabled, every device handled by the IOMMU is a member of
> an IOMMU group, see struct device.iommu_group. There's an
> iommu_group_get() accessor to get a reference to it.
>
> > I also don't see how turning off ACS isolation for a specific device is
> > going to hurt anything. The IOMMU should still be able to keep going on
> > unaware that anything has changed. The only worry is that a security
> > hole may now be created if a user was relying on the isolation between
> > two devices that are in different VMs or something. However, if a user
> > was relying on this, they probably shouldn't have turned on P2P in the
> > first place.
>
> That's exactly what IOMMU groups represent, the smallest set of devices
> which have DMA isolation from other devices. By poking this hole, the
> IOMMU group is invalid. We cannot turn off ACS only for a specific
> device, in order to enable p2p it needs to be disabled at every
> downstream port between the devices where we want to enable p2p.
> Depending on the topology, that could mean we're also enabling p2p for
> unrelated devices. Those unrelated devices might be in active use and
> the p2p IOVAs now have a different destination which is no longer IOMMU
> translated.
>
> > We started with a fairly unintelligent choice to simply disable ACS on
> > any kernel that had CONFIG_PCI_P2P set. However, this did not seem like
> > a good idea going forward. Instead, we now selectively disable the ACS
> > bit only on the downstream ports that are involved in P2P transactions.
> > This seems like the safest choice and still allows people to (carefully)
> > use P2P adjacent to other devices that need to be isolated.
>
> I don't see that the code is doing much checking that adjacent devices
> are also affected by the p2p change and of course the IOMMU group is
> entirely invalid once the p2p holes start getting poked.
>
> > I don't think anyone wants another boot option that must be set in order
> > to use this functionality (and only some hardware would require this).
> > That's just a huge pain for users.
>
> No, but nor do we need IOMMU groups that no longer represent what
> they're intended to describe or runtime, unchecked routing changes
> through the topology for devices that might already be using
> conflicting IOVA ranges. Maybe soft hotplugs are another possibility,
> designate a sub-hierarchy to be removed and re-scanned with ACS
> disabled. Otherwise it seems like disabling and re-enabling ACS needs
> to also handle merging and splitting groups dynamically. Thanks,
>
Dumb question, can we use a PCI bar address of one device into the
IOMMU page table of another address ie like we would DMA map a
regular system page ?
It would be much better in my view to follow down such path if that
is at all possible from hardware point of view (i am not sure where
to dig in the specification to answer my above question).
Cheers,
J?r?me
> @@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
> {
> u16 tail = nvmeq->sq_tail;
>
> - if (nvmeq->sq_cmds_io)
> - memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
> - else
> - memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
> + memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>
Why is it always memcpy() and not memcpy_toio()? I'd expect something like
if (nvmeq->sq_cmds_is_io)
memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
else
memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
Marta
On Fri, 5 Jan 2018 01:47:01 -0500
Jerome Glisse <[email protected]> wrote:
> On Thu, Jan 04, 2018 at 08:33:00PM -0700, Alex Williamson wrote:
> > On Thu, 4 Jan 2018 17:00:47 -0700
> > Logan Gunthorpe <[email protected]> wrote:
> >
> > > On 04/01/18 03:35 PM, Alex Williamson wrote:
> > > > Yep, flipping these ACS bits invalidates any IOMMU groups that depend
> > > > on the isolation of that downstream port and I suspect also any peers
> > > > within the same PCI slot of that port and their downstream devices. The
> > > > entire sub-hierarchy grouping needs to be re-evaluated. This
> > > > potentially affects running devices that depend on that isolation, so
> > > > I'm not sure how that happens dynamically. A boot option might be
> > > > easier. Thanks,
> > >
> > > I don't see how this is the case in current kernel code. It appears to
> > > only enable ACS globally if the IOMMU requests it.
> >
> > IOMMU groups don't exist unless the IOMMU is enabled and x86 and ARM
> > both request ACS be enabled if an IOMMU is present, so I'm not sure
> > what you're getting at here. Also, in reply to your other email, if
> > the IOMMU is enabled, every device handled by the IOMMU is a member of
> > an IOMMU group, see struct device.iommu_group. There's an
> > iommu_group_get() accessor to get a reference to it.
> >
> > > I also don't see how turning off ACS isolation for a specific device is
> > > going to hurt anything. The IOMMU should still be able to keep going on
> > > unaware that anything has changed. The only worry is that a security
> > > hole may now be created if a user was relying on the isolation between
> > > two devices that are in different VMs or something. However, if a user
> > > was relying on this, they probably shouldn't have turned on P2P in the
> > > first place.
> >
> > That's exactly what IOMMU groups represent, the smallest set of devices
> > which have DMA isolation from other devices. By poking this hole, the
> > IOMMU group is invalid. We cannot turn off ACS only for a specific
> > device, in order to enable p2p it needs to be disabled at every
> > downstream port between the devices where we want to enable p2p.
> > Depending on the topology, that could mean we're also enabling p2p for
> > unrelated devices. Those unrelated devices might be in active use and
> > the p2p IOVAs now have a different destination which is no longer IOMMU
> > translated.
> >
> > > We started with a fairly unintelligent choice to simply disable ACS on
> > > any kernel that had CONFIG_PCI_P2P set. However, this did not seem like
> > > a good idea going forward. Instead, we now selectively disable the ACS
> > > bit only on the downstream ports that are involved in P2P transactions.
> > > This seems like the safest choice and still allows people to (carefully)
> > > use P2P adjacent to other devices that need to be isolated.
> >
> > I don't see that the code is doing much checking that adjacent devices
> > are also affected by the p2p change and of course the IOMMU group is
> > entirely invalid once the p2p holes start getting poked.
> >
> > > I don't think anyone wants another boot option that must be set in order
> > > to use this functionality (and only some hardware would require this).
> > > That's just a huge pain for users.
> >
> > No, but nor do we need IOMMU groups that no longer represent what
> > they're intended to describe or runtime, unchecked routing changes
> > through the topology for devices that might already be using
> > conflicting IOVA ranges. Maybe soft hotplugs are another possibility,
> > designate a sub-hierarchy to be removed and re-scanned with ACS
> > disabled. Otherwise it seems like disabling and re-enabling ACS needs
> > to also handle merging and splitting groups dynamically. Thanks,
> >
>
> Dumb question, can we use a PCI bar address of one device into the
> IOMMU page table of another address ie like we would DMA map a
> regular system page ?
>
> It would be much better in my view to follow down such path if that
> is at all possible from hardware point of view (i am not sure where
> to dig in the specification to answer my above question).
Yes, we can bounce device MMIO through the IOMMU, or at least vfio does
enable these mappings to allow p2p between devices assigned to the same
VM and we've seen evidence that they work, but like p2p in general,
there are likely platform dependencies. For Logan this doesn't really
solve the problem unless the device and downstream port support ATS and
ACS Direct Translated P2P is enable. That would allow the device to do
p2p with addresses pre-translated by the IOMMU and cached by ATS on the
device without the transaction needing to traverse all the way to the
IOMMU on the root bus. The security of ATS is pretty questionable in
general, but that would likely be a solution that wouldn't require
manipulating the groupings. Thanks,
Alex
On 04/01/18 08:33 PM, Alex Williamson wrote:
> That's exactly what IOMMU groups represent, the smallest set of devices
> which have DMA isolation from other devices. By poking this hole, the
> IOMMU group is invalid. We cannot turn off ACS only for a specific
> device, in order to enable p2p it needs to be disabled at every
> downstream port between the devices where we want to enable p2p.
> Depending on the topology, that could mean we're also enabling p2p for
> unrelated devices. Those unrelated devices might be in active use and
> the p2p IOVAs now have a different destination which is no longer IOMMU
> translated.
Oh, so IOMMU groups are created based on the existing hierarchy at boot
time and not based on the user's needs for isolation?
Logan
On Fri, 5 Jan 2018 10:10:51 -0700
Logan Gunthorpe <[email protected]> wrote:
> On 04/01/18 08:33 PM, Alex Williamson wrote:
> > That's exactly what IOMMU groups represent, the smallest set of devices
> > which have DMA isolation from other devices. By poking this hole, the
> > IOMMU group is invalid. We cannot turn off ACS only for a specific
> > device, in order to enable p2p it needs to be disabled at every
> > downstream port between the devices where we want to enable p2p.
> > Depending on the topology, that could mean we're also enabling p2p for
> > unrelated devices. Those unrelated devices might be in active use and
> > the p2p IOVAs now have a different destination which is no longer IOMMU
> > translated.
>
> Oh, so IOMMU groups are created based on the existing hierarchy at boot
> time and not based on the user's needs for isolation?
Yes, IOMMU groups expose the isolation of the system as devices are
discovered. Nothing currently accounts for intentionally decreasing
the isolation between devices. Thanks,
Alex
On Thu, Jan 04, 2018 at 12:01:34PM -0700, Logan Gunthorpe wrote:
> Register the CMB buffer as p2pmem and use the appropriate allocation
> functions to create and destroy the IO SQ.
>
> If the CMB supports WDS and RDS, publish it for use as p2p memory
> by other devices.
<>
> + if (qid && dev->cmb_use_sqes) {
> + nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(depth));
> + nvmeq->sq_dma_addr = pci_p2pmem_virt_to_bus(pdev,
> + nvmeq->sq_cmds);
> + nvmeq->sq_cmds_is_io = true;
> + }
This gets into some spec type trouble for creating an IO submission
queue. We use the sq_dma_addr as the PRP entry to the Create I/O
SQ command, which has some requirements:
"the PRP Entry shall have an offset of 0h."
So this has to be 4k aligned, but pci_alloc_p2pmem doesn't guarantee
that. I doubt the spec's intention was to require such alignment for
CMB SQs, but there may be controllers enforcing the rule as written.
On 05/01/18 08:30 AM, Marta Rybczynska wrote:
>> @@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
>> {
>> u16 tail = nvmeq->sq_tail;
>>
>> - if (nvmeq->sq_cmds_io)
>> - memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
>> - else
>> - memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>> + memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>
>
> Why is it always memcpy() and not memcpy_toio()? I'd expect something like
> if (nvmeq->sq_cmds_is_io)
> memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
> else
> memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
We're going on the assumption that memory mapped with
devm_memremap_pages() can be treated as regular memory. So memcpy_toio
is not necessary for P2P memory.
Logan
On 05/01/18 11:11 AM, Keith Busch wrote:
> On Thu, Jan 04, 2018 at 12:01:34PM -0700, Logan Gunthorpe wrote:
>> Register the CMB buffer as p2pmem and use the appropriate allocation
>> functions to create and destroy the IO SQ.
>>
>> If the CMB supports WDS and RDS, publish it for use as p2p memory
>> by other devices.
>
> <>
>
>> + if (qid && dev->cmb_use_sqes) {
>> + nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(depth));
>
>> + nvmeq->sq_dma_addr = pci_p2pmem_virt_to_bus(pdev,
>> + nvmeq->sq_cmds);
>> + nvmeq->sq_cmds_is_io = true;
>> + }
>
> This gets into some spec type trouble for creating an IO submission
> queue. We use the sq_dma_addr as the PRP entry to the Create I/O
> SQ command, which has some requirements:
>
> "the PRP Entry shall have an offset of 0h."
>
> So this has to be 4k aligned, but pci_alloc_p2pmem doesn't guarantee
> that. I doubt the spec's intention was to require such alignment for
> CMB SQs, but there may be controllers enforcing the rule as written.
Although it is not explicitly stated anywhere, pci_alloc_p2pmem() should
always be at least 4k aligned. This is because the gen_pool that
implements it is created with PAGE_SHIFT for its min_alloc_order.
Logan
On Fri, Jan 05, 2018 at 11:19:28AM -0700, Logan Gunthorpe wrote:
> Although it is not explicitly stated anywhere, pci_alloc_p2pmem() should
> always be at least 4k aligned. This is because the gen_pool that implements
> it is created with PAGE_SHIFT for its min_alloc_order.
Ah, I see that now. Thanks for the explanation.
Does it need to be created with page sized minimum alloc order? That
granularity makes it difficult to fit SQs in CMB on archs with larger
pages when we only needed 4k alignment.
I was also hoping to extend this for PRP/SGL in CMB where even 4k is
too high a granularity to make it really useful.
It looks like creating the gen pool with a smaller minimum and
gen_pool_first_fit_order_align algo would satisfy my use cases, but I'm
not sure if there's another reason you've set it to page alignment.
On 05/01/18 12:01 PM, Keith Busch wrote:
> On Fri, Jan 05, 2018 at 11:19:28AM -0700, Logan Gunthorpe wrote:
>> Although it is not explicitly stated anywhere, pci_alloc_p2pmem() should
>> always be at least 4k aligned. This is because the gen_pool that implements
>> it is created with PAGE_SHIFT for its min_alloc_order.
>
> Ah, I see that now. Thanks for the explanation.
>
> Does it need to be created with page sized minimum alloc order? That
> granularity makes it difficult to fit SQs in CMB on archs with larger
> pages when we only needed 4k alignment.
>
> I was also hoping to extend this for PRP/SGL in CMB where even 4k is
> too high a granularity to make it really useful.
>
> It looks like creating the gen pool with a smaller minimum and
> gen_pool_first_fit_order_align algo would satisfy my use cases, but I'm
> not sure if there's another reason you've set it to page alignment.
I don't see any reason why we couldn't change this at some point if it
makes sense to. PAGE_SIZE just seemed like a suitable choice but I don't
think anything depends on it. In the end, I guess it depends on how
limited typical CMB buffers end up being.
Logan
On Thu, Jan 04, 2018 at 09:50:31PM -0700, Jason Gunthorpe wrote:
> Well that argument applies equally to the RDMA RW API wrappers around
> the DMA API. I think it is fine if sgl are defined to only have P2P or
> not, and that debugging support seemed reasonable to me..
>
> > It's also very difficult to add similar functionality to dma_map_page seeing
> > dma_unmap_page won't have any way to know what it's dealing with. It just
> > seems confusing to support P2P in the SG version and not the page version.
>
> Well, this proposal is to support P2P in only some RDMA APIs and not
> others, so it seems about as confusing to me..
As usual we implement what actually has a consumer. On top of that the
R/W API is the only core RDMA API that actually does DMA mapping for the
ULP at the moment. For SENDs and everything else dma maps are done by
the ULP (I'd like to eventually change that, though - e.g. sends through
that are inline to the workqueue don't need a dma map to start with).
> Well, it is more than very nice. We have to keep RDMA working after
> all, and if you make it even more special things become harder for us.
>
> It is already the case that DMA in RDMA is very strange. We have
> drivers that provide their own DMA ops, for instance.
That's because the initial design was to let the ULPs do the DMA
mappings, which fundamentally is wrong. I've fixed it for the R/W
API when adding it, but no one has started work on SENDs and atomics.
> And on that topic, does this scheme work with HFI?
No, and I guess we need an opt-out. HFI generally seems to be
extremely weird.
> This is why P2P must fit in to the common DMA framework somehow, we
> rely on these abstractions to work properly and fully in RDMA.
Moving P2P up to common RDMA code isn't going to fix this. For that
we need to stop preting that something that isn't DMA can abuse the
dma mapping framework, and until then opt them out of behavior that
assumes actual DMA like P2P.
> I think you should consider pushing this directly into the dma_ops
> implementations. Add a p2p_supported flag to struct dma_map_ops, and
> only if it is true can a caller pass a homogeneous SGL to ops->map_sg.
> Only map_sg would be supported for P2P. Upgraded implementations can
> call the helper function.
If at all it should be in the dma_map* wrappers, but for that we'd need
a good identifier. And it still would not solve the whole fake dma
ops issue.
So for now I'd much prefer to let the drivers handle it, and once usage
grows and we know usage patterns better (and I make progress refactoring
both the dma-mapping subsystem itself and the RDMA dma map code to not
be a complete mess) we can move it to the core.
On Mon, Jan 08, 2018 at 03:59:01PM +0100, Christoph Hellwig wrote:
> On Thu, Jan 04, 2018 at 09:50:31PM -0700, Jason Gunthorpe wrote:
> > Well that argument applies equally to the RDMA RW API wrappers around
> > the DMA API. I think it is fine if sgl are defined to only have P2P or
> > not, and that debugging support seemed reasonable to me..
> >
> > > It's also very difficult to add similar functionality to dma_map_page seeing
> > > dma_unmap_page won't have any way to know what it's dealing with. It just
> > > seems confusing to support P2P in the SG version and not the page version.
> >
> > Well, this proposal is to support P2P in only some RDMA APIs and not
> > others, so it seems about as confusing to me..
>
> As usual we implement what actually has a consumer. On top of that the
> R/W API is the only core RDMA API that actually does DMA mapping for the
> ULP at the moment.
Well again the same can be said for dma_map_page vs dma_map_sg...
> For SENDs and everything else dma maps are done by the ULP (I'd like
> to eventually change that, though - e.g. sends through that are
> inline to the workqueue don't need a dma map to start with).
> That's because the initial design was to let the ULPs do the DMA
> mappings, which fundamentally is wrong. I've fixed it for the R/W
> API when adding it, but no one has started work on SENDs and atomics.
Well, you know why it is like this, and it is very complicated to
unwind - the HW driver does not have enough information during CQ
processing to properly do any unmaps, let alone serious error tear
down unmaps, so we'd need a bunch of new APIs developed first, like RW
did. :\
> > And on that topic, does this scheme work with HFI?
>
> No, and I guess we need an opt-out. HFI generally seems to be
> extremely weird.
This series needs some kind of fix so HFI, QIB, rxe, etc don't get
broken, and it shouldn't be 'fixed' at the RDMA level.
> > This is why P2P must fit in to the common DMA framework somehow, we
> > rely on these abstractions to work properly and fully in RDMA.
>
> Moving P2P up to common RDMA code isn't going to fix this. For that
> we need to stop preting that something that isn't DMA can abuse the
> dma mapping framework, and until then opt them out of behavior that
> assumes actual DMA like P2P.
It could, if we had a DMA op for p2p then the drivers that provide
their own ops can implement it appropriately or not at all.
Eg the correct implementation for rxe to support p2p memory is
probably somewhat straightfoward.
> > I think you should consider pushing this directly into the dma_ops
> > implementations. Add a p2p_supported flag to struct dma_map_ops, and
> > only if it is true can a caller pass a homogeneous SGL to ops->map_sg.
> > Only map_sg would be supported for P2P. Upgraded implementations can
> > call the helper function.
>
> If at all it should be in the dma_map* wrappers, but for that we'd need
> a good identifier. And it still would not solve the whole fake dma
> ops issue.
Very long term the IOMMUs under the ops will need to care about this,
so the wrapper is not an optimal place to put it - but I wouldn't
object if it gets it out of RDMA :)
Jason
On 08/01/18 11:09 AM, Jason Gunthorpe wrote:
> It could, if we had a DMA op for p2p then the drivers that provide
> their own ops can implement it appropriately or not at all.
I was thinking of doing something like this. I'll probably rough out a
patch and send it along today or tomorrow.
>> If at all it should be in the dma_map* wrappers, but for that we'd need
>> a good identifier. And it still would not solve the whole fake dma
>> ops issue.
>
> Very long term the IOMMUs under the ops will need to care about this,
> so the wrapper is not an optimal place to put it - but I wouldn't
> object if it gets it out of RDMA :)
Well, creating the extra op doesn't really change anything to the RDMA
patch in this series. The point is, for the time being, we need to track
whether we are doing a P2P or normal mapping using a side channel as we
don't have a good way of tracking it in the SGL at this time.
Adding an extra DMA op is just one way to allow the existing DMA
providers to opt-in/out.
Logan
On Mon, Jan 08, 2018 at 11:17:38AM -0700, Logan Gunthorpe wrote:
> >>If at all it should be in the dma_map* wrappers, but for that we'd need
> >>a good identifier. And it still would not solve the whole fake dma
> >>ops issue.
> >
> >Very long term the IOMMUs under the ops will need to care about this,
> >so the wrapper is not an optimal place to put it - but I wouldn't
> >object if it gets it out of RDMA :)
>
> Well, creating the extra op doesn't really change anything to the RDMA patch
> in this series.
Not fundamentally, but it lets us solve the bugs the patch introduces
with hfi/etc
> The point is, for the time being, we need to track whether we are
> doing a P2P or normal mapping using a side channel as we don't have
> a good way of tracking it in the SGL at this time.
Well, that is disappointing for now, but I'm OK with the flag in the
RW interface - as long as we all sort of agree it is not desirable and
the SG should self-identify in an ideal someday future world..
Jason
On Mon, Jan 08, 2018 at 11:09:17AM -0700, Jason Gunthorpe wrote:
> > As usual we implement what actually has a consumer. On top of that the
> > R/W API is the only core RDMA API that actually does DMA mapping for the
> > ULP at the moment.
>
> Well again the same can be said for dma_map_page vs dma_map_sg...
I don't understand this comment.
>
> > For SENDs and everything else dma maps are done by the ULP (I'd like
> > to eventually change that, though - e.g. sends through that are
> > inline to the workqueue don't need a dma map to start with).
>
>
> > That's because the initial design was to let the ULPs do the DMA
> > mappings, which fundamentally is wrong. I've fixed it for the R/W
> > API when adding it, but no one has started work on SENDs and atomics.
>
> Well, you know why it is like this, and it is very complicated to
> unwind - the HW driver does not have enough information during CQ
> processing to properly do any unmaps, let alone serious error tear
> down unmaps, so we'd need a bunch of new APIs developed first, like RW
> did. :\
Yes, if it was trivial we would have done it already.
> > > And on that topic, does this scheme work with HFI?
> >
> > No, and I guess we need an opt-out. HFI generally seems to be
> > extremely weird.
>
> This series needs some kind of fix so HFI, QIB, rxe, etc don't get
> broken, and it shouldn't be 'fixed' at the RDMA level.
I don't think rxe is a problem as it won't show up a pci device.
HFI and QIB do show as PCI devices, and could be used for P2P transfers
from the PCI point of view. It's just that they have a layer of
software indirection between their hardware and what is exposed at
the RDMA layer.
So I very much disagree about where to place that workaround - the
RDMA code is exactly the right place.
> > > This is why P2P must fit in to the common DMA framework somehow, we
> > > rely on these abstractions to work properly and fully in RDMA.
> >
> > Moving P2P up to common RDMA code isn't going to fix this. For that
> > we need to stop preting that something that isn't DMA can abuse the
> > dma mapping framework, and until then opt them out of behavior that
> > assumes actual DMA like P2P.
>
> It could, if we had a DMA op for p2p then the drivers that provide
> their own ops can implement it appropriately or not at all.
>
> Eg the correct implementation for rxe to support p2p memory is
> probably somewhat straightfoward.
But P2P is _not_ a factor of the dma_ops implementation at all,
it is something that happens behind the dma_map implementation.
Think about what the dma mapping routines do:
(a) translate from host address to bus addresses
and
(b) flush caches (in non-coherent architectures)
Both are obviously not needed for P2P transfers, as they never reach
the host.
> Very long term the IOMMUs under the ops will need to care about this,
> so the wrapper is not an optimal place to put it - but I wouldn't
> object if it gets it out of RDMA :)
Unless you have an IOMMU on your PCIe switch and not before/inside
the root complex that is not correct.
On 08/01/18 11:34 AM, Christoph Hellwig wrote:
> But P2P is _not_ a factor of the dma_ops implementation at all,
> it is something that happens behind the dma_map implementation.
>
> Think about what the dma mapping routines do:
>
> (a) translate from host address to bus addresses
>
> and
>
> (b) flush caches (in non-coherent architectures)
>
> Both are obviously not needed for P2P transfers, as they never reach
> the host.
Isn't pci_p2pdma_map_sg doing (a)? It's just translating from a host
address to a PCI bus address.
>> Very long term the IOMMUs under the ops will need to care about this,
>> so the wrapper is not an optimal place to put it - but I wouldn't
>> object if it gets it out of RDMA :)
>
> Unless you have an IOMMU on your PCIe switch and not before/inside
> the root complex that is not correct.
Per the ACS discussion, in the future we might want to implement "ACS
Direct Translated P2P" as Alex described. I expect it would be the IOMMU
that needs to set that up. So, I think, we also have the dma_map
implementations also doing something like:
(c) setup/manage any security permissions on mappings
Which P2P may at some point be concerned with.
Logan
On Mon, Jan 08, 2018 at 11:44:41AM -0700, Logan Gunthorpe wrote:
>> Think about what the dma mapping routines do:
>>
>> (a) translate from host address to bus addresses
>>
>> and
>>
>> (b) flush caches (in non-coherent architectures)
>>
>> Both are obviously not needed for P2P transfers, as they never reach
>> the host.
>
> Isn't pci_p2pdma_map_sg doing (a)? It's just translating from a host
> address to a PCI bus address.
It does, sort of - but in a different way then the normal DMA map
ops. And only to work around the fact that we need to map our
P2P space into struct pages. Without that we could just pass the
bus address around, but the Linux stack and VM isn't anywhere near
ready for something that advanced.
>>> Very long term the IOMMUs under the ops will need to care about this,
>>> so the wrapper is not an optimal place to put it - but I wouldn't
>>> object if it gets it out of RDMA :)
>>
>> Unless you have an IOMMU on your PCIe switch and not before/inside
>> the root complex that is not correct.
>
> Per the ACS discussion, in the future we might want to implement "ACS
> Direct Translated P2P" as Alex described. I expect it would be the IOMMU
> that needs to set that up. So, I think, we also have the dma_map
> implementations also doing something like:
>
> (c) setup/manage any security permissions on mappings
> Which P2P may at some point be concerned with.
Maybe once root complexes with iommus actually support P2P. But until
then we have a lot more more important problems to solve.
On Mon, Jan 08, 2018 at 07:34:34PM +0100, Christoph Hellwig wrote:
> > > > And on that topic, does this scheme work with HFI?
> > >
> > > No, and I guess we need an opt-out. HFI generally seems to be
> > > extremely weird.
> >
> > This series needs some kind of fix so HFI, QIB, rxe, etc don't get
> > broken, and it shouldn't be 'fixed' at the RDMA level.
>
> I don't think rxe is a problem as it won't show up a pci device.
Right today's restrictions save us..
> HFI and QIB do show as PCI devices, and could be used for P2P transfers
> from the PCI point of view. It's just that they have a layer of
> software indirection between their hardware and what is exposed at
> the RDMA layer.
>
> So I very much disagree about where to place that workaround - the
> RDMA code is exactly the right place.
But why? RDMA is using core code to do this. It uses dma_ops in struct
device and it uses normal dma_map SG. How is it RDMA's problem that
some PCI drivers provide strange DMA ops?
Admittedly they are RDMA drivers, but it is a core mechanism they
(ab)use these days..
> > It could, if we had a DMA op for p2p then the drivers that provide
> > their own ops can implement it appropriately or not at all.
> >
> > Eg the correct implementation for rxe to support p2p memory is
> > probably somewhat straightfoward.
>
> But P2P is _not_ a factor of the dma_ops implementation at all,
> it is something that happens behind the dma_map implementation.
Only as long as the !ACS and switch limitations are present.
Those limitations are fine to get things started, but there is going
to a be a push improve the system to remove them.
> > Very long term the IOMMUs under the ops will need to care about this,
> > so the wrapper is not an optimal place to put it - but I wouldn't
> > object if it gets it out of RDMA :)
>
> Unless you have an IOMMU on your PCIe switch and not before/inside
> the root complex that is not correct.
I understand the proposed patches restrict things to require a switch
and not transit the IOMMU.
But *very long term* P2P will need to work with paths that transit the
system IOMMU and root complex.
This already exists as out-of-tree funtionality that has been deployed
in production for years and years that does P2P through the root
complex with the IOMMU turned off.
Jason
On 08/01/18 11:57 AM, Christoph Hellwig wrote:
> It does, sort of - but in a different way then the normal DMA map
> ops. And only to work around the fact that we need to map our
> P2P space into struct pages. Without that we could just pass the
> bus address around, but the Linux stack and VM isn't anywhere near
> ready for something that advanced.
Yeah, wouldn't that be nice.
Ok, so if we shouldn't touch the dma_map infrastructure how should the
workaround to opt-out HFI and QIB look? I'm not that familiar with the
RDMA code.
Thanks,
Logan
On Mon, Jan 8, 2018 at 11:57 AM, Christoph Hellwig <[email protected]> wrote:
>> (c) setup/manage any security permissions on mappings
>> Which P2P may at some point be concerned with.
>
> Maybe once root complexes with iommus actually support P2P. But until
> then we have a lot more more important problems to solve.
Pretty sure P2P capable IOMMU hardware exists.
With SOC's we also have the scenario that an DMA originated from an
on-die device wishes to target an off-die PCI BAR (through the IOMMU),
that definitely exists today, and people care about it :)
Jason
On Mon, Jan 08, 2018 at 12:49:50PM -0700, Jason Gunthorpe wrote:
> Pretty sure P2P capable IOMMU hardware exists.
>
> With SOC's we also have the scenario that an DMA originated from an
> on-die device wishes to target an off-die PCI BAR (through the IOMMU),
> that definitely exists today, and people care about it :)
Then people will have to help and contribute support for it.
On Mon, Jan 08, 2018 at 12:05:57PM -0700, Logan Gunthorpe wrote:
> Ok, so if we shouldn't touch the dma_map infrastructure how should the
> workaround to opt-out HFI and QIB look? I'm not that familiar with the RDMA
> code.
We can add a no_p2p quirk, I'm just not sure what the right place
for it is. As said they device don't really mind P2P at the PCIe
level, it's just that their RDMA implementation is really strange.
So I'd be tempted to do it in RDMA.
On Mon, Jan 08, 2018 at 12:01:16PM -0700, Jason Gunthorpe wrote:
> > So I very much disagree about where to place that workaround - the
> > RDMA code is exactly the right place.
>
> But why? RDMA is using core code to do this. It uses dma_ops in struct
> device and it uses normal dma_map SG. How is it RDMA's problem that
> some PCI drivers provide strange DMA ops?
Because RDMA uses the dma_virt_ops to pretend a device does DMA when
in fact it doesn't - at least not for the exact data mapped, or
as far as I can tell often not all - e.g. the QIB/HFI devices
might do mmio access for data mapped.
This whole problem only exist because a few RDMA HCA drivers lie
with the help of the RDMA core.
On Tue, Jan 09, 2018 at 05:46:40PM +0100, Christoph Hellwig wrote:
> On Mon, Jan 08, 2018 at 12:49:50PM -0700, Jason Gunthorpe wrote:
> > Pretty sure P2P capable IOMMU hardware exists.
> >
> > With SOC's we also have the scenario that an DMA originated from an
> > on-die device wishes to target an off-die PCI BAR (through the IOMMU),
> > that definitely exists today, and people care about it :)
>
> Then people will have to help and contribute support for it.
Sure. But my point was all this will have to migrate under the dma_ops
for that to work, so why the resistance to using dma_ops right away?
Jason