Hi Everyone,
Here's v2 of our series to introduce P2P based copy offload to NVMe
fabrics. This version has been rebased onto v4.16-rc3 which already
includes Christoph's devpagemap work the previous version was based
off as well as a couple of the cleanup patches that were in v1.
Additionally, we've made the following changes based on feedback:
* Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
as a bunch of cleanup and spelling fixes he pointed out in the last
series.
* To address Alex's ACS concerns, we change to a simpler method of
just disabling ACS behind switches for any kernel that has
CONFIG_PCI_P2PDMA.
* We also reject using devices that employ 'dma_virt_ops' which should
fairly simply handle Jason's concerns that this work might break with
the HFI, QIB and rxe drivers that use the virtual ops to implement
their own special DMA operations.
Thanks,
Logan
--
This is a continuation of our work to enable using Peer-to-Peer PCI
memory in NVMe fabrics targets. Many thanks go to Christoph Hellwig who
provided valuable feedback to get these patches to where they are today.
The concept here is to use memory that's exposed on a PCI BAR as
data buffers in the NVME target code such that data can be transferred
from an RDMA NIC to the special memory and then directly to an NVMe
device avoiding system memory entirely. The upside of this is better
QoS for applications running on the CPU utilizing memory and lower
PCI bandwidth required to the CPU (such that systems could be designed
with fewer lanes connected to the CPU). However, presently, the
trade-off is currently a reduction in overall throughput. (Largely due
to hardware issues that would certainly improve in the future).
Due to these trade-offs we've designed the system to only enable using
the PCI memory in cases where the NIC, NVMe devices and memory are all
behind the same PCI switch. This will mean many setups that could likely
work well will not be supported so that we can be more confident it
will work and not place any responsibility on the user to understand
their topology. (We chose to go this route based on feedback we
received at the last LSF). Future work may enable these transfers behind
a fabric of PCI switches or perhaps using a white list of known good
root complexes.
In order to enable this functionality, we introduce a few new PCI
functions such that a driver can register P2P memory with the system.
Struct pages are created for this memory using devm_memremap_pages()
and the PCI bus offset is stored in the corresponding pagemap structure.
Another set of functions allow a client driver to create a list of
client devices that will be used in a given P2P transactions and then
use that list to find any P2P memory that is supported by all the
client devices. This list is then also used to selectively disable the
ACS bits for the downstream ports behind these devices.
In the block layer, we also introduce a P2P request flag to indicate a
given request targets P2P memory as well as a flag for a request queue
to indicate a given queue supports targeting P2P memory. P2P requests
will only be accepted by queues that support it. Also, P2P requests
are marked to not be merged seeing a non-homogenous request would
complicate the DMA mapping requirements.
In the PCI NVMe driver, we modify the existing CMB support to utilize
the new PCI P2P memory infrastructure and also add support for P2P
memory in its request queue. When a P2P request is received it uses the
pci_p2pmem_map_sg() function which applies the necessary transformation
to get the corrent pci_bus_addr_t for the DMA transactions.
In the RDMA core, we also adjust rdma_rw_ctx_init() and
rdma_rw_ctx_destroy() to take a flags argument which indicates whether
to use the PCI P2P mapping functions or not.
Finally, in the NVMe fabrics target port we introduce a new
configuration boolean: 'allow_p2pmem'. When set, the port will attempt
to find P2P memory supported by the RDMA NIC and all namespaces. If
supported memory is found, it will be used in all IO transfers. And if
a port is using P2P memory, adding new namespaces that are not supported
by that memory will fail.
Logan Gunthorpe (10):
PCI/P2PDMA: Support peer to peer memory
PCI/P2PDMA: Add sysfs group to display p2pmem stats
PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
block: Introduce PCI P2P flags for request and request queue
IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
nvme-pci: Use PCI p2pmem subsystem to manage the CMB
nvme-pci: Add support for P2P memory in requests
nvme-pci: Add a quirk for a pseudo CMB
nvmet: Optionally use PCI P2P memory
Documentation/ABI/testing/sysfs-bus-pci | 25 ++
block/blk-core.c | 3 +
drivers/infiniband/core/rw.c | 21 +-
drivers/infiniband/ulp/isert/ib_isert.c | 5 +-
drivers/infiniband/ulp/srpt/ib_srpt.c | 7 +-
drivers/nvme/host/core.c | 4 +
drivers/nvme/host/nvme.h | 8 +
drivers/nvme/host/pci.c | 118 ++++--
drivers/nvme/target/configfs.c | 29 ++
drivers/nvme/target/core.c | 95 ++++-
drivers/nvme/target/io-cmd.c | 3 +
drivers/nvme/target/nvmet.h | 10 +
drivers/nvme/target/rdma.c | 43 +-
drivers/pci/Kconfig | 20 +
drivers/pci/Makefile | 1 +
drivers/pci/p2pdma.c | 713 ++++++++++++++++++++++++++++++++
drivers/pci/pci.c | 4 +
include/linux/blk_types.h | 18 +-
include/linux/blkdev.h | 3 +
include/linux/memremap.h | 19 +
include/linux/pci-p2pdma.h | 105 +++++
include/linux/pci.h | 4 +
include/rdma/rw.h | 7 +-
net/sunrpc/xprtrdma/svc_rdma_rw.c | 6 +-
24 files changed, 1204 insertions(+), 67 deletions(-)
create mode 100644 drivers/pci/p2pdma.c
create mode 100644 include/linux/pci-p2pdma.h
--
2.11.0
Introduce a quirk to use CMB-like memory on older devices that have
an exposed BAR but do not advertise support for using CMBLOC and
CMBSIZE.
We'd like to use some of these older cards to test P2P memory.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/nvme/host/nvme.h | 7 +++++++
drivers/nvme/host/pci.c | 24 ++++++++++++++++++++----
2 files changed, 27 insertions(+), 4 deletions(-)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index e38beb1917e2..c84a1b9580c8 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -83,6 +83,13 @@ enum nvme_quirks {
* Supports the LighNVM command set if indicated in vs[1].
*/
NVME_QUIRK_LIGHTNVM = (1 << 6),
+
+ /*
+ * Pseudo CMB Support on BAR 4. For adapters like the Microsemi
+ * NVRAM that have CMB-like memory on a BAR but does not set
+ * CMBLOC or CMBSZ.
+ */
+ NVME_QUIRK_PSEUDO_CMB_BAR4 = (1 << 7),
};
/*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b896a78225cb..2ad20fb9f05d 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1679,6 +1679,13 @@ static ssize_t nvme_cmb_show(struct device *dev,
}
static DEVICE_ATTR(cmb, S_IRUGO, nvme_cmb_show, NULL);
+static u32 nvme_pseudo_cmbsz(struct pci_dev *pdev, int bar)
+{
+ return NVME_CMBSZ_WDS | NVME_CMBSZ_RDS |
+ (((ilog2(SZ_16M) - 12) / 4) << NVME_CMBSZ_SZU_SHIFT) |
+ ((pci_resource_len(pdev, bar) / SZ_16M) << NVME_CMBSZ_SZ_SHIFT);
+}
+
static u64 nvme_cmb_size_unit(struct nvme_dev *dev)
{
u8 szu = (dev->cmbsz >> NVME_CMBSZ_SZU_SHIFT) & NVME_CMBSZ_SZU_MASK;
@@ -1698,10 +1705,15 @@ static void nvme_map_cmb(struct nvme_dev *dev)
struct pci_dev *pdev = to_pci_dev(dev->dev);
int bar;
- dev->cmbsz = readl(dev->bar + NVME_REG_CMBSZ);
- if (!dev->cmbsz)
- return;
- dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
+ if (dev->ctrl.quirks & NVME_QUIRK_PSEUDO_CMB_BAR4) {
+ dev->cmbsz = nvme_pseudo_cmbsz(pdev, 4);
+ dev->cmbloc = 4;
+ } else {
+ dev->cmbsz = readl(dev->bar + NVME_REG_CMBSZ);
+ if (!dev->cmbsz)
+ return;
+ dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
+ }
size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev);
offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc);
@@ -2715,6 +2727,10 @@ static const struct pci_device_id nvme_id_table[] = {
.driver_data = NVME_QUIRK_LIGHTNVM, },
{ PCI_DEVICE(0x1d1d, 0x2807), /* CNEX WL */
.driver_data = NVME_QUIRK_LIGHTNVM, },
+ { PCI_DEVICE(0x11f8, 0xf117), /* Microsemi NVRAM adaptor */
+ .driver_data = NVME_QUIRK_PSEUDO_CMB_BAR4, },
+ { PCI_DEVICE(0x1db1, 0x0002), /* Everspin nvNitro adaptor */
+ .driver_data = NVME_QUIRK_PSEUDO_CMB_BAR4, },
{ PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, 0xffffff) },
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2001) },
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2003) },
--
2.11.0
For P2P requests we must use the pci_p2pmem_[un]map_sg() functions
instead of the dma_map_sg functions.
With that, we can then indicate PCI_P2P support in the request queue.
For this, we create an NVME_F_PCI_P2P flag which tells the core to
set QUEUE_FLAG_PCI_P2P in the request queue.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/nvme/host/core.c | 4 ++++
drivers/nvme/host/nvme.h | 1 +
drivers/nvme/host/pci.c | 19 +++++++++++++++----
3 files changed, 20 insertions(+), 4 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 0fe7ea35c221..4dd564f175fa 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2954,7 +2954,11 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
ns->queue = blk_mq_init_queue(ctrl->tagset);
if (IS_ERR(ns->queue))
goto out_free_ns;
+
queue_flag_set_unlocked(QUEUE_FLAG_NONROT, ns->queue);
+ if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
+ queue_flag_set_unlocked(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
+
ns->queue->queuedata = ns;
ns->ctrl = ctrl;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 0521e4707d1c..e38beb1917e2 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -290,6 +290,7 @@ struct nvme_ctrl_ops {
unsigned int flags;
#define NVME_F_FABRICS (1 << 0)
#define NVME_F_METADATA_SUPPORTED (1 << 1)
+#define NVME_F_PCI_P2PDMA (1 << 2)
int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 56ca79be8476..b896a78225cb 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -796,8 +796,13 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
goto out;
ret = BLK_STS_RESOURCE;
- nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents, dma_dir,
- DMA_ATTR_NO_WARN);
+
+ if (REQ_IS_PCI_P2PDMA(req))
+ nr_mapped = pci_p2pdma_map_sg(dev->dev, iod->sg, iod->nents,
+ dma_dir);
+ else
+ nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents,
+ dma_dir, DMA_ATTR_NO_WARN);
if (!nr_mapped)
goto out;
@@ -842,7 +847,12 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
DMA_TO_DEVICE : DMA_FROM_DEVICE;
if (iod->nents) {
- dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
+ if (REQ_IS_PCI_P2PDMA(req))
+ pci_p2pdma_unmap_sg(dev->dev, iod->sg, iod->nents,
+ dma_dir);
+ else
+ dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
+
if (blk_integrity_rq(req)) {
if (req_op(req) == REQ_OP_READ)
nvme_dif_remap(req, nvme_dif_complete);
@@ -2422,7 +2432,8 @@ static int nvme_pci_reg_read64(struct nvme_ctrl *ctrl, u32 off, u64 *val)
static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
.name = "pcie",
.module = THIS_MODULE,
- .flags = NVME_F_METADATA_SUPPORTED,
+ .flags = NVME_F_METADATA_SUPPORTED |
+ NVME_F_PCI_P2PDMA,
.reg_read32 = nvme_pci_reg_read32,
.reg_write32 = nvme_pci_reg_write32,
.reg_read64 = nvme_pci_reg_read64,
--
2.11.0
In order to use PCI P2P memory pci_p2pmem_[un]map_sg() functions must be
called to map the correct DMA address. To do this, we add a flags
variable and the RDMA_RW_CTX_FLAG_PCI_P2P flag. When the flag is
specified use the appropriate map function.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/infiniband/core/rw.c | 21 +++++++++++++++++----
drivers/infiniband/ulp/isert/ib_isert.c | 5 +++--
drivers/infiniband/ulp/srpt/ib_srpt.c | 7 ++++---
drivers/nvme/target/rdma.c | 6 +++---
include/rdma/rw.h | 7 +++++--
net/sunrpc/xprtrdma/svc_rdma_rw.c | 6 +++---
6 files changed, 35 insertions(+), 17 deletions(-)
diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index c8963e91f92a..775a9f8b15a6 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -12,6 +12,7 @@
*/
#include <linux/moduleparam.h>
#include <linux/slab.h>
+#include <linux/pci-p2pdma.h>
#include <rdma/mr_pool.h>
#include <rdma/rw.h>
@@ -269,18 +270,24 @@ static int rdma_rw_init_single_wr(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
* @remote_addr:remote address to read/write (relative to @rkey)
* @rkey: remote key to operate on
* @dir: %DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
+ * @flags: any of the RDMA_RW_CTX_FLAG_* flags
*
* Returns the number of WQEs that will be needed on the workqueue if
* successful, or a negative error code.
*/
int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
struct scatterlist *sg, u32 sg_cnt, u32 sg_offset,
- u64 remote_addr, u32 rkey, enum dma_data_direction dir)
+ u64 remote_addr, u32 rkey, enum dma_data_direction dir,
+ unsigned int flags)
{
struct ib_device *dev = qp->pd->device;
int ret;
- ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
+ if (flags & RDMA_RW_CTX_FLAG_PCI_P2PDMA)
+ ret = pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
+ else
+ ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
+
if (!ret)
return -ENOMEM;
sg_cnt = ret;
@@ -579,9 +586,11 @@ EXPORT_SYMBOL(rdma_rw_ctx_post);
* @sg: scatterlist that was used for the READ/WRITE
* @sg_cnt: number of entries in @sg
* @dir: %DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
+ * @flags: the same flags used to init the context
*/
void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
- struct scatterlist *sg, u32 sg_cnt, enum dma_data_direction dir)
+ struct scatterlist *sg, u32 sg_cnt, enum dma_data_direction dir,
+ unsigned int flags)
{
int i;
@@ -602,7 +611,11 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
break;
}
- ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
+ if (flags & RDMA_RW_CTX_FLAG_PCI_P2PDMA)
+ pci_p2pdma_unmap_sg(qp->pd->device->dma_device, sg,
+ sg_cnt, dir);
+ else
+ ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
}
EXPORT_SYMBOL(rdma_rw_ctx_destroy);
diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c
index fff40b097947..933200645262 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.c
+++ b/drivers/infiniband/ulp/isert/ib_isert.c
@@ -1497,7 +1497,8 @@ isert_rdma_rw_ctx_destroy(struct isert_cmd *cmd, struct isert_conn *conn)
se_cmd->t_prot_nents, dir);
} else {
rdma_rw_ctx_destroy(&cmd->rw, conn->qp, conn->cm_id->port_num,
- se_cmd->t_data_sg, se_cmd->t_data_nents, dir);
+ se_cmd->t_data_sg, se_cmd->t_data_nents,
+ dir, 0);
}
cmd->rw.nr_ops = 0;
@@ -2152,7 +2153,7 @@ isert_rdma_rw_ctx_post(struct isert_cmd *cmd, struct isert_conn *conn,
} else {
ret = rdma_rw_ctx_init(&cmd->rw, conn->qp, port_num,
se_cmd->t_data_sg, se_cmd->t_data_nents,
- offset, addr, rkey, dir);
+ offset, addr, rkey, dir, 0);
}
if (ret < 0) {
diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.c b/drivers/infiniband/ulp/srpt/ib_srpt.c
index 0373b7c40902..5dcbe0a18db8 100644
--- a/drivers/infiniband/ulp/srpt/ib_srpt.c
+++ b/drivers/infiniband/ulp/srpt/ib_srpt.c
@@ -899,7 +899,8 @@ static int srpt_alloc_rw_ctxs(struct srpt_send_ioctx *ioctx,
goto unwind;
ret = rdma_rw_ctx_init(&ctx->rw, ch->qp, ch->sport->port,
- ctx->sg, ctx->nents, 0, remote_addr, rkey, dir);
+ ctx->sg, ctx->nents, 0, remote_addr, rkey,
+ dir, 0);
if (ret < 0) {
target_free_sgl(ctx->sg, ctx->nents);
goto unwind;
@@ -928,7 +929,7 @@ static int srpt_alloc_rw_ctxs(struct srpt_send_ioctx *ioctx,
struct srpt_rw_ctx *ctx = &ioctx->rw_ctxs[i];
rdma_rw_ctx_destroy(&ctx->rw, ch->qp, ch->sport->port,
- ctx->sg, ctx->nents, dir);
+ ctx->sg, ctx->nents, dir, 0);
target_free_sgl(ctx->sg, ctx->nents);
}
if (ioctx->rw_ctxs != &ioctx->s_rw_ctx)
@@ -946,7 +947,7 @@ static void srpt_free_rw_ctxs(struct srpt_rdma_ch *ch,
struct srpt_rw_ctx *ctx = &ioctx->rw_ctxs[i];
rdma_rw_ctx_destroy(&ctx->rw, ch->qp, ch->sport->port,
- ctx->sg, ctx->nents, dir);
+ ctx->sg, ctx->nents, dir, 0);
target_free_sgl(ctx->sg, ctx->nents);
}
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 978e169c11bf..020354e11351 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -427,7 +427,7 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
if (rsp->n_rdma) {
rdma_rw_ctx_destroy(&rsp->rw, queue->cm_id->qp,
queue->cm_id->port_num, rsp->req.sg,
- rsp->req.sg_cnt, nvmet_data_dir(&rsp->req));
+ rsp->req.sg_cnt, nvmet_data_dir(&rsp->req), 0);
}
if (rsp->req.sg != &rsp->cmd->inline_sg)
@@ -510,7 +510,7 @@ static void nvmet_rdma_read_data_done(struct ib_cq *cq, struct ib_wc *wc)
atomic_add(rsp->n_rdma, &queue->sq_wr_avail);
rdma_rw_ctx_destroy(&rsp->rw, queue->cm_id->qp,
queue->cm_id->port_num, rsp->req.sg,
- rsp->req.sg_cnt, nvmet_data_dir(&rsp->req));
+ rsp->req.sg_cnt, nvmet_data_dir(&rsp->req), 0);
rsp->n_rdma = 0;
if (unlikely(wc->status != IB_WC_SUCCESS)) {
@@ -579,7 +579,7 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
ret = rdma_rw_ctx_init(&rsp->rw, cm_id->qp, cm_id->port_num,
rsp->req.sg, rsp->req.sg_cnt, 0, addr, key,
- nvmet_data_dir(&rsp->req));
+ nvmet_data_dir(&rsp->req), 0);
if (ret < 0)
return NVME_SC_INTERNAL;
rsp->req.transfer_len += len;
diff --git a/include/rdma/rw.h b/include/rdma/rw.h
index a3cbbc7b6417..dbc342b53901 100644
--- a/include/rdma/rw.h
+++ b/include/rdma/rw.h
@@ -59,12 +59,15 @@ struct rdma_rw_ctx {
};
};
+#define RDMA_RW_CTX_FLAG_PCI_P2PDMA (1 << 0)
+
int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
struct scatterlist *sg, u32 sg_cnt, u32 sg_offset,
- u64 remote_addr, u32 rkey, enum dma_data_direction dir);
+ u64 remote_addr, u32 rkey, enum dma_data_direction dir,
+ unsigned int flags);
void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
struct scatterlist *sg, u32 sg_cnt,
- enum dma_data_direction dir);
+ enum dma_data_direction dir, unsigned int flags);
int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
u8 port_num, struct scatterlist *sg, u32 sg_cnt,
diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
index 12b9a7e0b6d2..4364a8e98470 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
@@ -140,7 +140,7 @@ static void svc_rdma_cc_release(struct svc_rdma_chunk_ctxt *cc,
rdma_rw_ctx_destroy(&ctxt->rw_ctx, rdma->sc_qp,
rdma->sc_port_num, ctxt->rw_sg_table.sgl,
- ctxt->rw_nents, dir);
+ ctxt->rw_nents, dir, 0);
svc_rdma_put_rw_ctxt(rdma, ctxt);
}
svc_xprt_put(&rdma->sc_xprt);
@@ -433,7 +433,7 @@ svc_rdma_build_writes(struct svc_rdma_write_info *info,
ret = rdma_rw_ctx_init(&ctxt->rw_ctx, rdma->sc_qp,
rdma->sc_port_num, ctxt->rw_sg_table.sgl,
ctxt->rw_nents, 0, seg_offset,
- seg_handle, DMA_TO_DEVICE);
+ seg_handle, DMA_TO_DEVICE, 0);
if (ret < 0)
goto out_initerr;
@@ -639,7 +639,7 @@ static int svc_rdma_build_read_segment(struct svc_rdma_read_info *info,
ret = rdma_rw_ctx_init(&ctxt->rw_ctx, cc->cc_rdma->sc_qp,
cc->cc_rdma->sc_port_num,
ctxt->rw_sg_table.sgl, ctxt->rw_nents,
- 0, offset, rkey, DMA_FROM_DEVICE);
+ 0, offset, rkey, DMA_FROM_DEVICE, 0);
if (ret < 0)
goto out_initerr;
--
2.11.0
Attributes display the total amount of P2P memory, the amount available
and whether it is published or not.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
Documentation/ABI/testing/sysfs-bus-pci | 25 +++++++++++++++++
drivers/pci/p2pdma.c | 50 +++++++++++++++++++++++++++++++++
2 files changed, 75 insertions(+)
diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
index 44d4b2be92fd..f5656dae21be 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -323,3 +323,28 @@ Description:
This is similar to /sys/bus/pci/drivers_autoprobe, but
affects only the VFs associated with a specific PF.
+
+What: /sys/bus/pci/devices/.../p2pmem/available
+Date: November 2017
+Contact: Logan Gunthorpe <[email protected]>
+Description:
+ If the device has any Peer-to-Peer memory registered, this
+ file contains the amount of memory that has not been
+ allocated (in decimal).
+
+What: /sys/bus/pci/devices/.../p2pmem/size
+Date: November 2017
+Contact: Logan Gunthorpe <[email protected]>
+Description:
+ If the device has any Peer-to-Peer memory registered, this
+ file contains the total ammount of memory that the device
+ provides (in decimal).
+
+What: /sys/bus/pci/devices/.../p2pmem/published
+Date: November 2017
+Contact: Logan Gunthorpe <[email protected]>
+Description:
+ If the device has any Peer-to-Peer memory registered, this
+ file contains a '1' if the memory has been published for
+ use inside the kernel or a '0' if it is only intended
+ for use within the driver that published it.
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index ec0a6cb9e500..a57df78f6a32 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -29,6 +29,53 @@ struct pci_p2pdma {
bool published;
};
+static ssize_t size_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct pci_dev *pdev = to_pci_dev(dev);
+ size_t size = 0;
+
+ if (pdev->p2pdma->pool)
+ size = gen_pool_size(pdev->p2pdma->pool);
+
+ return snprintf(buf, PAGE_SIZE, "%zd\n", size);
+}
+static DEVICE_ATTR_RO(size);
+
+static ssize_t available_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct pci_dev *pdev = to_pci_dev(dev);
+ size_t avail = 0;
+
+ if (pdev->p2pdma->pool)
+ avail = gen_pool_avail(pdev->p2pdma->pool);
+
+ return snprintf(buf, PAGE_SIZE, "%zd\n", avail);
+}
+static DEVICE_ATTR_RO(available);
+
+static ssize_t published_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct pci_dev *pdev = to_pci_dev(dev);
+
+ return snprintf(buf, PAGE_SIZE, "%d\n", pdev->p2pdma->published);
+}
+static DEVICE_ATTR_RO(published);
+
+static struct attribute *p2pmem_attrs[] = {
+ &dev_attr_size.attr,
+ &dev_attr_available.attr,
+ &dev_attr_published.attr,
+ NULL,
+};
+
+static const struct attribute_group p2pmem_group = {
+ .attrs = p2pmem_attrs,
+ .name = "p2pmem",
+};
+
static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
{
struct pci_p2pdma *p2p =
@@ -55,6 +102,7 @@ static void pci_p2pdma_release(void *data)
percpu_ref_exit(&pdev->p2pdma->devmap_ref);
gen_pool_destroy(pdev->p2pdma->pool);
+ sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
pdev->p2pdma = NULL;
}
@@ -83,6 +131,8 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
if (error)
goto out_pool_destroy;
+ error = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group);
+
pdev->p2pdma = p2p;
return 0;
--
2.11.0
Register the CMB buffer as p2pmem and use the appropriate allocation
functions to create and destroy the IO SQ.
If the CMB supports WDS and RDS, publish it for use as p2p memory
by other devices.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/nvme/host/pci.c | 75 +++++++++++++++++++++++++++----------------------
1 file changed, 41 insertions(+), 34 deletions(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 73036d2fbbd5..56ca79be8476 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -29,6 +29,7 @@
#include <linux/types.h>
#include <linux/io-64-nonatomic-lo-hi.h>
#include <linux/sed-opal.h>
+#include <linux/pci-p2pdma.h>
#include "nvme.h"
@@ -91,9 +92,8 @@ struct nvme_dev {
struct work_struct remove_work;
struct mutex shutdown_lock;
bool subsystem;
- void __iomem *cmb;
- pci_bus_addr_t cmb_bus_addr;
u64 cmb_size;
+ bool cmb_use_sqes;
u32 cmbsz;
u32 cmbloc;
struct nvme_ctrl ctrl;
@@ -148,7 +148,7 @@ struct nvme_queue {
struct nvme_dev *dev;
spinlock_t q_lock;
struct nvme_command *sq_cmds;
- struct nvme_command __iomem *sq_cmds_io;
+ bool sq_cmds_is_io;
volatile struct nvme_completion *cqes;
struct blk_mq_tags **tags;
dma_addr_t sq_dma_addr;
@@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
{
u16 tail = nvmeq->sq_tail;
- if (nvmeq->sq_cmds_io)
- memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
- else
- memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
+ memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
if (++tail == nvmeq->q_depth)
tail = 0;
@@ -1286,9 +1283,18 @@ static void nvme_free_queue(struct nvme_queue *nvmeq)
{
dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
(void *)nvmeq->cqes, nvmeq->cq_dma_addr);
- if (nvmeq->sq_cmds)
- dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
- nvmeq->sq_cmds, nvmeq->sq_dma_addr);
+
+ if (nvmeq->sq_cmds) {
+ if (nvmeq->sq_cmds_is_io)
+ pci_free_p2pmem(to_pci_dev(nvmeq->q_dmadev),
+ nvmeq->sq_cmds,
+ SQ_SIZE(nvmeq->q_depth));
+ else
+ dma_free_coherent(nvmeq->q_dmadev,
+ SQ_SIZE(nvmeq->q_depth),
+ nvmeq->sq_cmds,
+ nvmeq->sq_dma_addr);
+ }
}
static void nvme_free_queues(struct nvme_dev *dev, int lowest)
@@ -1368,12 +1374,21 @@ static int nvme_cmb_qdepth(struct nvme_dev *dev, int nr_io_queues,
static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq,
int qid, int depth)
{
- /* CMB SQEs will be mapped before creation */
- if (qid && dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS))
- return 0;
+ struct pci_dev *pdev = to_pci_dev(dev->dev);
+
+ if (qid && dev->cmb_use_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
+ nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(depth));
+ nvmeq->sq_dma_addr = pci_p2pmem_virt_to_bus(pdev,
+ nvmeq->sq_cmds);
+ nvmeq->sq_cmds_is_io = true;
+ }
+
+ if (!nvmeq->sq_cmds) {
+ nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
+ &nvmeq->sq_dma_addr, GFP_KERNEL);
+ nvmeq->sq_cmds_is_io = false;
+ }
- nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
- &nvmeq->sq_dma_addr, GFP_KERNEL);
if (!nvmeq->sq_cmds)
return -ENOMEM;
return 0;
@@ -1449,13 +1464,6 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
struct nvme_dev *dev = nvmeq->dev;
int result;
- if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
- unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth),
- dev->ctrl.page_size);
- nvmeq->sq_dma_addr = dev->cmb_bus_addr + offset;
- nvmeq->sq_cmds_io = dev->cmb + offset;
- }
-
nvmeq->cq_vector = qid - 1;
result = adapter_alloc_cq(dev, qid, nvmeq);
if (result < 0)
@@ -1685,9 +1693,6 @@ static void nvme_map_cmb(struct nvme_dev *dev)
return;
dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
- if (!use_cmb_sqes)
- return;
-
size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev);
offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc);
bar = NVME_CMB_BIR(dev->cmbloc);
@@ -1704,11 +1709,15 @@ static void nvme_map_cmb(struct nvme_dev *dev)
if (size > bar_size - offset)
size = bar_size - offset;
- dev->cmb = ioremap_wc(pci_resource_start(pdev, bar) + offset, size);
- if (!dev->cmb)
+ if (pci_p2pdma_add_resource(pdev, bar, size, offset))
return;
- dev->cmb_bus_addr = pci_bus_address(pdev, bar) + offset;
+
dev->cmb_size = size;
+ dev->cmb_use_sqes = use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS);
+
+ if ((dev->cmbsz & (NVME_CMBSZ_WDS | NVME_CMBSZ_RDS)) ==
+ (NVME_CMBSZ_WDS | NVME_CMBSZ_RDS))
+ pci_p2pmem_publish(pdev, true);
if (sysfs_add_file_to_group(&dev->ctrl.device->kobj,
&dev_attr_cmb.attr, NULL))
@@ -1718,12 +1727,10 @@ static void nvme_map_cmb(struct nvme_dev *dev)
static inline void nvme_release_cmb(struct nvme_dev *dev)
{
- if (dev->cmb) {
- iounmap(dev->cmb);
- dev->cmb = NULL;
+ if (dev->cmb_size) {
sysfs_remove_file_from_group(&dev->ctrl.device->kobj,
&dev_attr_cmb.attr, NULL);
- dev->cmbsz = 0;
+ dev->cmb_size = 0;
}
}
@@ -1918,13 +1925,13 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
if (nr_io_queues == 0)
return 0;
- if (dev->cmb && (dev->cmbsz & NVME_CMBSZ_SQS)) {
+ if (dev->cmb_use_sqes) {
result = nvme_cmb_qdepth(dev, nr_io_queues,
sizeof(struct nvme_command));
if (result > 0)
dev->q_depth = result;
else
- nvme_release_cmb(dev);
+ dev->cmb_use_sqes = false;
}
do {
--
2.11.0
We create a configfs attribute in each nvme-fabrics target port to
enable p2p memory use. When enabled, the port will only then use the
p2p memory if a p2p memory device can be found which is behind the
same switch as the RDMA port and all the block devices in use. If
the user enabled it an no devices are found, then the system will
silently fall back on using regular memory.
If appropriate, that port will allocate memory for the RDMA buffers
for queues from the p2pmem device falling back to system memory should
anything fail.
Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
save an extra PCI transfer as the NVME card could just take the data
out of it's own memory. However, at this time, cards with CMB buffers
don't seem to be available.
Signed-off-by: Stephen Bates <[email protected]>
Signed-off-by: Steve Wise <[email protected]>
[hch: partial rewrite of the initial code]
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/nvme/target/configfs.c | 29 +++++++++++++
drivers/nvme/target/core.c | 95 +++++++++++++++++++++++++++++++++++++++++-
drivers/nvme/target/io-cmd.c | 3 ++
drivers/nvme/target/nvmet.h | 10 +++++
drivers/nvme/target/rdma.c | 41 ++++++++++++++----
5 files changed, 169 insertions(+), 9 deletions(-)
diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index e6b2d2af81b6..f1ad9d32344d 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -867,12 +867,41 @@ static void nvmet_port_release(struct config_item *item)
kfree(port);
}
+#ifdef CONFIG_PCI_P2PDMA
+static ssize_t nvmet_allow_p2pmem_show(struct config_item *item, char *page)
+{
+ return sprintf(page, "%d\n", to_nvmet_port(item)->allow_p2pmem);
+}
+
+static ssize_t nvmet_allow_p2pmem_store(struct config_item *item,
+ const char *page, size_t count)
+{
+ struct nvmet_port *port = to_nvmet_port(item);
+ bool allow;
+ int ret;
+
+ ret = strtobool(page, &allow);
+ if (ret)
+ return ret;
+
+ down_write(&nvmet_config_sem);
+ port->allow_p2pmem = allow;
+ up_write(&nvmet_config_sem);
+
+ return count;
+}
+CONFIGFS_ATTR(nvmet_, allow_p2pmem);
+#endif /* CONFIG_PCI_P2PDMA */
+
static struct configfs_attribute *nvmet_port_attrs[] = {
&nvmet_attr_addr_adrfam,
&nvmet_attr_addr_treq,
&nvmet_attr_addr_traddr,
&nvmet_attr_addr_trsvcid,
&nvmet_attr_addr_trtype,
+#ifdef CONFIG_PCI_P2PDMA
+ &nvmet_attr_allow_p2pmem,
+#endif
NULL,
};
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 0bd737117a80..59847aec27db 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -15,6 +15,7 @@
#include <linux/module.h>
#include <linux/random.h>
#include <linux/rculist.h>
+#include <linux/pci-p2pdma.h>
#include "nvmet.h"
@@ -271,6 +272,25 @@ void nvmet_put_namespace(struct nvmet_ns *ns)
percpu_ref_put(&ns->ref);
}
+static int nvmet_p2pdma_add_client(struct nvmet_ctrl *ctrl,
+ struct nvmet_ns *ns)
+{
+ int ret;
+
+ if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) {
+ pr_err("peer-to-peer DMA is not supported by %s\n",
+ ns->device_path);
+ return -EINVAL;
+ }
+
+ ret = pci_p2pdma_add_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
+ if (ret)
+ pr_err("failed to add peer-to-peer DMA client %s: %d\n",
+ ns->device_path, ret);
+
+ return ret;
+}
+
int nvmet_ns_enable(struct nvmet_ns *ns)
{
struct nvmet_subsys *subsys = ns->subsys;
@@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
if (ret)
goto out_blkdev_put;
+ list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+ if (ctrl->p2p_dev) {
+ ret = nvmet_p2pdma_add_client(ctrl, ns);
+ if (ret)
+ goto out_remove_clients;
+ }
+ }
+
if (ns->nsid > subsys->max_nsid)
subsys->max_nsid = ns->nsid;
@@ -328,6 +356,9 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
out_unlock:
mutex_unlock(&subsys->lock);
return ret;
+out_remove_clients:
+ list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
+ pci_p2pdma_remove_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
out_blkdev_put:
blkdev_put(ns->bdev, FMODE_WRITE|FMODE_READ);
ns->bdev = NULL;
@@ -363,8 +394,10 @@ void nvmet_ns_disable(struct nvmet_ns *ns)
percpu_ref_exit(&ns->ref);
mutex_lock(&subsys->lock);
- list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
+ list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+ pci_p2pdma_remove_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
nvmet_add_async_event(ctrl, NVME_AER_TYPE_NOTICE, 0, 0);
+ }
if (ns->bdev)
blkdev_put(ns->bdev, FMODE_WRITE|FMODE_READ);
@@ -761,6 +794,63 @@ bool nvmet_host_allowed(struct nvmet_req *req, struct nvmet_subsys *subsys,
return __nvmet_host_allowed(subsys, hostnqn);
}
+/*
+ * If allow_p2pmem is set, we will try to use P2P memory for the SGL lists for
+ * Ι/O commands. This requires the PCI p2p device to be compatible with the
+ * backing device for every namespace on this controller.
+ */
+static void nvmet_setup_p2pmem(struct nvmet_ctrl *ctrl, struct nvmet_req *req)
+{
+ struct nvmet_ns *ns;
+ int ret;
+
+ if (!req->port->allow_p2pmem || !req->p2p_client)
+ return;
+
+ mutex_lock(&ctrl->subsys->lock);
+
+ ret = pci_p2pdma_add_client(&ctrl->p2p_clients, req->p2p_client);
+ if (ret) {
+ pr_err("failed adding peer-to-peer DMA client %s: %d\n",
+ dev_name(req->p2p_client), ret);
+ goto free_devices;
+ }
+
+ list_for_each_entry_rcu(ns, &ctrl->subsys->namespaces, dev_link) {
+ ret = nvmet_p2pdma_add_client(ctrl, ns);
+ if (ret)
+ goto free_devices;
+ }
+
+ ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients);
+ if (!ctrl->p2p_dev) {
+ pr_info("no supported peer-to-peer memory devices found\n");
+ goto free_devices;
+ }
+ mutex_unlock(&ctrl->subsys->lock);
+
+ pr_info("using peer-to-peer memory on %s\n", pci_name(ctrl->p2p_dev));
+ return;
+
+free_devices:
+ pci_p2pdma_client_list_free(&ctrl->p2p_clients);
+ mutex_unlock(&ctrl->subsys->lock);
+}
+
+static void nvmet_release_p2pmem(struct nvmet_ctrl *ctrl)
+{
+ if (!ctrl->p2p_dev)
+ return;
+
+ mutex_lock(&ctrl->subsys->lock);
+
+ pci_p2pdma_client_list_free(&ctrl->p2p_clients);
+ pci_dev_put(ctrl->p2p_dev);
+ ctrl->p2p_dev = NULL;
+
+ mutex_unlock(&ctrl->subsys->lock);
+}
+
u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
struct nvmet_req *req, u32 kato, struct nvmet_ctrl **ctrlp)
{
@@ -800,6 +890,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
INIT_WORK(&ctrl->async_event_work, nvmet_async_event_work);
INIT_LIST_HEAD(&ctrl->async_events);
+ INIT_LIST_HEAD(&ctrl->p2p_clients);
memcpy(ctrl->subsysnqn, subsysnqn, NVMF_NQN_SIZE);
memcpy(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE);
@@ -855,6 +946,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
ctrl->kato = DIV_ROUND_UP(kato, 1000);
}
nvmet_start_keep_alive_timer(ctrl);
+ nvmet_setup_p2pmem(ctrl, req);
mutex_lock(&subsys->lock);
list_add_tail(&ctrl->subsys_entry, &subsys->ctrls);
@@ -891,6 +983,7 @@ static void nvmet_ctrl_free(struct kref *ref)
flush_work(&ctrl->async_event_work);
cancel_work_sync(&ctrl->fatal_err_work);
+ nvmet_release_p2pmem(ctrl);
ida_simple_remove(&cntlid_ida, ctrl->cntlid);
kfree(ctrl->sqs);
diff --git a/drivers/nvme/target/io-cmd.c b/drivers/nvme/target/io-cmd.c
index 28bbdff4a88b..a213f8fc3bf3 100644
--- a/drivers/nvme/target/io-cmd.c
+++ b/drivers/nvme/target/io-cmd.c
@@ -56,6 +56,9 @@ static void nvmet_execute_rw(struct nvmet_req *req)
op = REQ_OP_READ;
}
+ if (is_pci_p2pdma_page(sg_page(req->sg)))
+ op_flags |= REQ_PCI_P2PDMA;
+
sector = le64_to_cpu(req->cmd->rw.slba);
sector <<= (req->ns->blksize_shift - 9);
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 417f6c0331cc..85a170914588 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -64,6 +64,11 @@ static inline struct nvmet_ns *to_nvmet_ns(struct config_item *item)
return container_of(to_config_group(item), struct nvmet_ns, group);
}
+static inline struct device *nvmet_ns_dev(struct nvmet_ns *ns)
+{
+ return disk_to_dev(ns->bdev->bd_disk);
+}
+
struct nvmet_cq {
u16 qid;
u16 size;
@@ -98,6 +103,7 @@ struct nvmet_port {
struct list_head referrals;
void *priv;
bool enabled;
+ bool allow_p2pmem;
};
static inline struct nvmet_port *to_nvmet_port(struct config_item *item)
@@ -131,6 +137,8 @@ struct nvmet_ctrl {
struct work_struct fatal_err_work;
struct nvmet_fabrics_ops *ops;
+ struct pci_dev *p2p_dev;
+ struct list_head p2p_clients;
char subsysnqn[NVMF_NQN_FIELD_LEN];
char hostnqn[NVMF_NQN_FIELD_LEN];
@@ -232,6 +240,8 @@ struct nvmet_req {
void (*execute)(struct nvmet_req *req);
struct nvmet_fabrics_ops *ops;
+
+ struct device *p2p_client;
};
static inline void nvmet_set_status(struct nvmet_req *req, u16 status)
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 020354e11351..7a1f09995ed5 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -23,6 +23,7 @@
#include <linux/string.h>
#include <linux/wait.h>
#include <linux/inet.h>
+#include <linux/pci-p2pdma.h>
#include <asm/unaligned.h>
#include <rdma/ib_verbs.h>
@@ -68,6 +69,7 @@ struct nvmet_rdma_rsp {
u8 n_rdma;
u32 flags;
u32 invalidate_rkey;
+ struct pci_dev *p2p_dev;
struct list_head wait_list;
struct list_head free_list;
@@ -426,12 +428,18 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
if (rsp->n_rdma) {
rdma_rw_ctx_destroy(&rsp->rw, queue->cm_id->qp,
- queue->cm_id->port_num, rsp->req.sg,
- rsp->req.sg_cnt, nvmet_data_dir(&rsp->req), 0);
+ queue->cm_id->port_num, rsp->req.sg,
+ rsp->req.sg_cnt, nvmet_data_dir(&rsp->req),
+ rsp->p2p_dev ? RDMA_RW_CTX_FLAG_PCI_P2PDMA : 0);
}
- if (rsp->req.sg != &rsp->cmd->inline_sg)
- sgl_free(rsp->req.sg);
+ if (rsp->req.sg != &rsp->cmd->inline_sg) {
+ if (rsp->p2p_dev)
+ pci_p2pmem_free_sgl(rsp->p2p_dev, rsp->req.sg,
+ rsp->req.sg_cnt);
+ else
+ sgl_free(rsp->req.sg);
+ }
if (unlikely(!list_empty_careful(&queue->rsp_wr_wait_list)))
nvmet_rdma_process_wr_wait_list(queue);
@@ -567,19 +575,34 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
u64 addr = le64_to_cpu(sgl->addr);
u32 len = get_unaligned_le24(sgl->length);
u32 key = get_unaligned_le32(sgl->key);
+ struct pci_dev *p2p_dev = NULL;
int ret;
/* no data command? */
if (!len)
return 0;
- rsp->req.sg = sgl_alloc(len, GFP_KERNEL, &rsp->req.sg_cnt);
- if (!rsp->req.sg)
- return NVME_SC_INTERNAL;
+ if (rsp->queue->nvme_sq.ctrl)
+ p2p_dev = rsp->queue->nvme_sq.ctrl->p2p_dev;
+
+ rsp->p2p_dev = NULL;
+ if (rsp->queue->nvme_sq.qid && p2p_dev) {
+ ret = pci_p2pmem_alloc_sgl(p2p_dev, &rsp->req.sg,
+ &rsp->req.sg_cnt, len);
+ if (!ret)
+ rsp->p2p_dev = p2p_dev;
+ }
+
+ if (!rsp->p2p_dev) {
+ rsp->req.sg = sgl_alloc(len, GFP_KERNEL, &rsp->req.sg_cnt);
+ if (!rsp->req.sg)
+ return NVME_SC_INTERNAL;
+ }
ret = rdma_rw_ctx_init(&rsp->rw, cm_id->qp, cm_id->port_num,
rsp->req.sg, rsp->req.sg_cnt, 0, addr, key,
- nvmet_data_dir(&rsp->req), 0);
+ nvmet_data_dir(&rsp->req),
+ rsp->p2p_dev ? RDMA_RW_CTX_FLAG_PCI_P2PDMA : 0);
if (ret < 0)
return NVME_SC_INTERNAL;
rsp->req.transfer_len += len;
@@ -658,6 +681,8 @@ static void nvmet_rdma_handle_command(struct nvmet_rdma_queue *queue,
cmd->send_sge.addr, cmd->send_sge.length,
DMA_TO_DEVICE);
+ cmd->req.p2p_client = &queue->dev->device->dev;
+
if (!nvmet_req_init(&cmd->req, &queue->nvme_cq,
&queue->nvme_sq, &nvmet_rdma_ops))
return;
--
2.11.0
Some PCI devices may have memory mapped in a BAR space that's
intended for use in Peer-to-Peer transactions. In order to enable
such transactions the memory must be registered with ZONE_DEVICE pages
so it can be used by DMA interfaces in existing drivers.
A kernel interface is provided so that other subsystems can find and
allocate chunks of P2P memory as necessary to facilitate transfers
between two PCI peers. Depending on hardware, this may reduce the
bandwidth of the transfer but would significantly reduce pressure
on system memory. This may be desirable in many cases: for example a
system could be designed with a small CPU connected to a PCI switch by a
small number of lanes which would maximize the number of lanes available
to connect to NVME devices.
The interface requires a user driver to collect a list of client devices
involved in the transaction with the pci_p2pmem_add_client*() functions
then call pci_p2pmem_find() to obtain any suitable P2P memory. Once
this is done the list is bound to the memory and the calling driver is
free to add and remove clients as necessary. The ACS bits on the
downstream switch port will be managed for all the registered clients.
The code is designed to only utilize the p2pmem device if all the devices
involved in a transfer are behind the same PCI switch. This is because
using P2P transactions through the PCI root complex can have performance
limitations or, worse, might not work at all. Finding out how well a
particular RC supports P2P transfers is non-trivial. Additionally, the
benefits of P2P transfers that go through the RC is limited to only
reducing DRAM usage.
This commit includes significant rework and feedback from Christoph
Hellwig.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/pci/Kconfig | 16 ++
drivers/pci/Makefile | 1 +
drivers/pci/p2pdma.c | 568 +++++++++++++++++++++++++++++++++++++++++++++
include/linux/memremap.h | 18 ++
include/linux/pci-p2pdma.h | 87 +++++++
include/linux/pci.h | 4 +
6 files changed, 694 insertions(+)
create mode 100644 drivers/pci/p2pdma.c
create mode 100644 include/linux/pci-p2pdma.h
diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 34b56a8f8480..840831418cbd 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -124,6 +124,22 @@ config PCI_PASID
If unsure, say N.
+config PCI_P2PDMA
+ bool "PCI Peer to Peer transfer support"
+ depends on ZONE_DEVICE
+ select GENERIC_ALLOCATOR
+ help
+ Enableѕ drivers to do PCI peer to peer transactions to and from
+ BARs that are exposed in other devices that are the part of
+ the hierarchy where peer-to-peer DMA is guaranteed by the PCI
+ specification to work (ie. anything below a single PCI bridge).
+
+ Many PCIe root complexes do not support P2P transactions and
+ it's hard to tell which support it with good performance, so
+ at this time you will need a PCIe switch.
+
+ If unsure, say N.
+
config PCI_LABEL
def_bool y if (DMI || ACPI)
depends on PCI
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 941970936840..45e0ff6f3213 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -26,6 +26,7 @@ obj-$(CONFIG_PCI_MSI) += msi.o
obj-$(CONFIG_PCI_ATS) += ats.o
obj-$(CONFIG_PCI_IOV) += iov.o
+obj-$(CONFIG_PCI_P2PDMA) += p2pdma.o
#
# ACPI Related PCI FW Functions
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
new file mode 100644
index 000000000000..ec0a6cb9e500
--- /dev/null
+++ b/drivers/pci/p2pdma.c
@@ -0,0 +1,568 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * PCI Peer 2 Peer DMA support.
+ *
+ * Copyright (c) 2016-2017, Microsemi Corporation
+ * Copyright (c) 2017, Christoph Hellwig.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/pci-p2pdma.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/genalloc.h>
+#include <linux/memremap.h>
+#include <linux/percpu-refcount.h>
+
+struct pci_p2pdma {
+ struct percpu_ref devmap_ref;
+ struct completion devmap_ref_done;
+ struct gen_pool *pool;
+ bool published;
+};
+
+static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
+{
+ struct pci_p2pdma *p2p =
+ container_of(ref, struct pci_p2pdma, devmap_ref);
+
+ complete_all(&p2p->devmap_ref_done);
+}
+
+static void pci_p2pdma_percpu_kill(void *data)
+{
+ struct percpu_ref *ref = data;
+
+ if (percpu_ref_is_dying(ref))
+ return;
+
+ percpu_ref_kill(ref);
+}
+
+static void pci_p2pdma_release(void *data)
+{
+ struct pci_dev *pdev = data;
+
+ wait_for_completion(&pdev->p2pdma->devmap_ref_done);
+ percpu_ref_exit(&pdev->p2pdma->devmap_ref);
+
+ gen_pool_destroy(pdev->p2pdma->pool);
+ pdev->p2pdma = NULL;
+}
+
+static int pci_p2pdma_setup(struct pci_dev *pdev)
+{
+ int error = -ENOMEM;
+ struct pci_p2pdma *p2p;
+
+ p2p = devm_kzalloc(&pdev->dev, sizeof(*p2p), GFP_KERNEL);
+ if (!p2p)
+ return -ENOMEM;
+
+ p2p->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev));
+ if (!p2p->pool)
+ goto out;
+
+ init_completion(&p2p->devmap_ref_done);
+ error = percpu_ref_init(&p2p->devmap_ref,
+ pci_p2pdma_percpu_release, 0, GFP_KERNEL);
+ if (error)
+ goto out_pool_destroy;
+
+ percpu_ref_switch_to_atomic_sync(&p2p->devmap_ref);
+
+ error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
+ if (error)
+ goto out_pool_destroy;
+
+ pdev->p2pdma = p2p;
+
+ return 0;
+
+out_pool_destroy:
+ gen_pool_destroy(p2p->pool);
+out:
+ devm_kfree(&pdev->dev, p2p);
+ return error;
+}
+
+/**
+ * pci_p2pdma_add_resource - add memory for use as p2p memory
+ * @pci: the device to add the memory to
+ * @bar: PCI BAR to add
+ * @size: size of the memory to add, may be zero to use the whole BAR
+ * @offset: offset into the PCI BAR
+ *
+ * The memory will be given ZONE_DEVICE struct pages so that it may
+ * be used with any dma request.
+ */
+int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
+ u64 offset)
+{
+ struct dev_pagemap *pgmap;
+ void *addr;
+ int error;
+
+ if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM))
+ return -EINVAL;
+
+ if (offset >= pci_resource_len(pdev, bar))
+ return -EINVAL;
+
+ if (!size)
+ size = pci_resource_len(pdev, bar) - offset;
+
+ if (size + offset > pci_resource_len(pdev, bar))
+ return -EINVAL;
+
+ if (!pdev->p2pdma) {
+ error = pci_p2pdma_setup(pdev);
+ if (error)
+ return error;
+ }
+
+ pgmap = devm_kzalloc(&pdev->dev, sizeof(*pgmap), GFP_KERNEL);
+ if (!pgmap)
+ return -ENOMEM;
+
+ pgmap->res.start = pci_resource_start(pdev, bar) + offset;
+ pgmap->res.end = pgmap->res.start + size - 1;
+ pgmap->res.flags = pci_resource_flags(pdev, bar);
+ pgmap->ref = &pdev->p2pdma->devmap_ref;
+ pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
+
+ addr = devm_memremap_pages(&pdev->dev, pgmap);
+ if (IS_ERR(addr))
+ return PTR_ERR(addr);
+
+ error = gen_pool_add_virt(pdev->p2pdma->pool, (uintptr_t)addr,
+ pci_bus_address(pdev, bar) + offset,
+ resource_size(&pgmap->res), dev_to_node(&pdev->dev));
+ if (error)
+ return error;
+
+ error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_percpu_kill,
+ &pdev->p2pdma->devmap_ref);
+ if (error)
+ return error;
+
+ dev_info(&pdev->dev, "added peer-to-peer DMA memory %pR\n",
+ &pgmap->res);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_add_resource);
+
+static struct pci_dev *find_parent_pci_dev(struct device *dev)
+{
+ struct device *parent;
+
+ dev = get_device(dev);
+
+ while (dev) {
+ if (dev_is_pci(dev))
+ return to_pci_dev(dev);
+
+ parent = get_device(dev->parent);
+ put_device(dev);
+ dev = parent;
+ }
+
+ return NULL;
+}
+
+/*
+ * If a device is behind a switch, we try to find the upstream bridge
+ * port of the switch. This requires two calls to pci_upstream_bridge:
+ * one for the upstream port on the switch, one on the upstream port
+ * for the next level in the hierarchy. Because of this, devices connected
+ * to the root port will be rejected.
+ */
+static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
+{
+ struct pci_dev *up1, *up2;
+
+ if (!pdev)
+ return NULL;
+
+ up1 = pci_dev_get(pci_upstream_bridge(pdev));
+ if (!up1)
+ return NULL;
+
+ up2 = pci_dev_get(pci_upstream_bridge(up1));
+ pci_dev_put(up1);
+
+ return up2;
+}
+
+static bool __upstream_bridges_match(struct pci_dev *upstream,
+ struct pci_dev *client)
+{
+ struct pci_dev *dma_up;
+ bool ret = true;
+
+ dma_up = get_upstream_bridge_port(client);
+
+ if (!dma_up) {
+ dev_dbg(&client->dev, "not a PCI device behind a bridge\n");
+ ret = false;
+ goto out;
+ }
+
+ if (upstream != dma_up) {
+ dev_dbg(&client->dev,
+ "does not reside on the same upstream bridge\n");
+ ret = false;
+ goto out;
+ }
+
+out:
+ pci_dev_put(dma_up);
+ return ret;
+}
+
+static bool upstream_bridges_match(struct pci_dev *pdev,
+ struct pci_dev *client)
+{
+ struct pci_dev *upstream;
+ bool ret;
+
+ upstream = get_upstream_bridge_port(pdev);
+ if (!upstream) {
+ dev_warn(&pdev->dev, "not behind a PCI bridge\n");
+ return false;
+ }
+
+ ret = __upstream_bridges_match(upstream, client);
+
+ pci_dev_put(upstream);
+
+ return ret;
+}
+
+struct pci_p2pdma_client {
+ struct list_head list;
+ struct pci_dev *client;
+ struct pci_dev *p2pdma;
+};
+
+/**
+ * pci_p2pdma_add_client - allocate a new element in a client device list
+ * @head: list head of p2pdma clients
+ * @dev: device to add to the list
+ *
+ * This adds @dev to a list of clients used by a p2pdma device.
+ * This list should be passed to p2pmem_find(). Once p2pmem_find() has
+ * been called successfully, the list will be bound to a specific p2pdma
+ * device and new clients can only be added to the list if they are
+ * supported by that p2pdma device.
+ *
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2p functions can be called concurrently
+ * on that list.
+ *
+ * Returns 0 if the client was successfully added.
+ */
+int pci_p2pdma_add_client(struct list_head *head, struct device *dev)
+{
+ struct pci_p2pdma_client *item, *new_item;
+ struct pci_dev *p2pdma = NULL;
+ struct pci_dev *client;
+ int ret;
+
+ if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == &dma_virt_ops) {
+ dev_warn(dev,
+ "cannot be used for peer-to-peer DMA because the driver makes use of dma_virt_ops\n");
+ return -ENODEV;
+ }
+
+
+ client = find_parent_pci_dev(dev);
+ if (!client) {
+ dev_warn(dev,
+ "cannot be used for peer-to-peer DMA as it is not a PCI device\n");
+ return -ENODEV;
+ }
+
+ item = list_first_entry_or_null(head, struct pci_p2pdma_client, list);
+ if (item && item->p2pdma) {
+ p2pdma = item->p2pdma;
+
+ if (!upstream_bridges_match(p2pdma, client)) {
+ ret = -EXDEV;
+ goto put_client;
+ }
+ }
+
+ new_item = kzalloc(sizeof(*new_item), GFP_KERNEL);
+ if (!new_item) {
+ ret = -ENOMEM;
+ goto put_client;
+ }
+
+ new_item->client = client;
+ new_item->p2pdma = pci_dev_get(p2pdma);
+
+ list_add_tail(&new_item->list, head);
+
+ return 0;
+
+put_client:
+ pci_dev_put(client);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_add_client);
+
+static void pci_p2pdma_client_free(struct pci_p2pdma_client *item)
+{
+ list_del(&item->list);
+ pci_dev_put(item->client);
+ pci_dev_put(item->p2pdma);
+ kfree(item);
+}
+
+/**
+ * pci_p2pdma_remove_client - remove and free a new p2pdma client
+ * @head: list head of p2pdma clients
+ * @dev: device to remove from the list
+ *
+ * This removes @dev from a list of clients used by a p2pdma device.
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2p functions can be called concurrently
+ * on that list.
+ */
+void pci_p2pdma_remove_client(struct list_head *head, struct device *dev)
+{
+ struct pci_p2pdma_client *pos, *tmp;
+ struct pci_dev *pdev;
+
+ pdev = find_parent_pci_dev(dev);
+ if (!pdev)
+ return;
+
+ list_for_each_entry_safe(pos, tmp, head, list) {
+ if (pos->client != pdev)
+ continue;
+
+ pci_p2pdma_client_free(pos);
+ }
+
+ pci_dev_put(pdev);
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_remove_client);
+
+/**
+ * pci_p2pdma_client_list_free - free an entire list of p2pdma clients
+ * @head: list head of p2pdma clients
+ *
+ * This removes all devices in a list of clients used by a p2pdma device.
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2pdma functions can be called concurrently
+ * on that list.
+ */
+void pci_p2pdma_client_list_free(struct list_head *head)
+{
+ struct pci_p2pdma_client *pos, *tmp;
+
+ list_for_each_entry_safe(pos, tmp, head, list)
+ pci_p2pdma_client_free(pos);
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_client_list_free);
+
+static bool upstream_bridges_match_list(struct pci_dev *pdev,
+ struct list_head *head)
+{
+ struct pci_p2pdma_client *pos;
+ struct pci_dev *upstream;
+ bool ret;
+
+ upstream = get_upstream_bridge_port(pdev);
+ if (!upstream) {
+ dev_warn(&pdev->dev, "not behind a PCI bridge\n");
+ return false;
+ }
+
+ list_for_each_entry(pos, head, list) {
+ ret = __upstream_bridges_match(upstream, pos->client);
+ if (!ret)
+ break;
+ }
+
+ pci_dev_put(upstream);
+ return ret;
+}
+
+/**
+ * pci_p2pmem_find - find a peer-to-peer DMA memory device compatible with
+ * the specified list of clients
+ * @dev: list of devices to check (NULL-terminated)
+ *
+ * For now, we only support cases where the devices that will transfer to the
+ * p2pmem device are behind the same bridge. This cuts out cases that may work
+ * but is safest for the user.
+ *
+ * Returns a pointer to the PCI device with a reference taken (use pci_dev_put
+ * to return the reference) or NULL if no compatible device is found.
+ */
+struct pci_dev *pci_p2pmem_find(struct list_head *clients)
+{
+ struct pci_dev *pdev = NULL;
+ struct pci_p2pdma_client *pos;
+
+ while ((pdev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, pdev))) {
+ if (!pdev->p2pdma || !pdev->p2pdma->published)
+ continue;
+
+ if (!upstream_bridges_match_list(pdev, clients))
+ continue;
+
+ list_for_each_entry(pos, clients, list)
+ pos->p2pdma = pdev;
+
+ return pdev;
+ }
+
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_find);
+
+/**
+ * pci_alloc_p2p_mem - allocate peer-to-peer DMA memory
+ * @pdev: the device to allocate memory from
+ * @size: number of bytes to allocate
+ *
+ * Returns the allocated memory or NULL on error.
+ */
+void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
+{
+ void *ret;
+
+ if (unlikely(!pdev->p2pdma))
+ return NULL;
+
+ if (unlikely(!percpu_ref_tryget_live(&pdev->p2pdma->devmap_ref)))
+ return NULL;
+
+ ret = (void *)(uintptr_t)gen_pool_alloc(pdev->p2pdma->pool, size);
+
+ if (unlikely(!ret))
+ percpu_ref_put(&pdev->p2pdma->devmap_ref);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(pci_alloc_p2pmem);
+
+/**
+ * pci_free_p2pmem - allocate peer-to-peer DMA memory
+ * @pdev: the device the memory was allocated from
+ * @addr: address of the memory that was allocated
+ * @size: number of bytes that was allocated
+ */
+void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size)
+{
+ gen_pool_free(pdev->p2pdma->pool, (uintptr_t)addr, size);
+ percpu_ref_put(&pdev->p2pdma->devmap_ref);
+}
+EXPORT_SYMBOL_GPL(pci_free_p2pmem);
+
+/**
+ * pci_virt_to_bus - return the PCI bus address for a given virtual
+ * address obtained with pci_alloc_p2pmem
+ * @pdev: the device the memory was allocated from
+ * @addr: address of the memory that was allocated
+ */
+pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr)
+{
+ if (!addr)
+ return 0;
+ if (!pdev->p2pdma)
+ return 0;
+
+ /*
+ * Note: when we added the memory to the pool we used the PCI
+ * bus address as the physical address. So gen_pool_virt_to_phys()
+ * actually returns the bus address despite the misleading name.
+ */
+ return gen_pool_virt_to_phys(pdev->p2pdma->pool, (unsigned long)addr);
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_virt_to_bus);
+
+/**
+ * pci_p2pmem_alloc_sgl - allocate peer-to-peer DMA memory in an scatterlist
+ * @pdev: the device to allocate memory from
+ * @sgl: the allocated scatterlist
+ * @nents: the number of SG entries in the list
+ * @length: number of bytes to allocate
+ *
+ * Returns 0 on success
+ */
+int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct scatterlist **sgl,
+ unsigned int *nents, u32 length)
+{
+ struct scatterlist *sg;
+ void *addr;
+
+ sg = kzalloc(sizeof(*sg), GFP_KERNEL);
+ if (!sg)
+ return -ENOMEM;
+
+ sg_init_table(sg, 1);
+
+ addr = pci_alloc_p2pmem(pdev, length);
+ if (!addr)
+ goto out_free_sg;
+
+ sg_set_buf(sg, addr, length);
+ *sgl = sg;
+ *nents = 1;
+ return 0;
+
+out_free_sg:
+ kfree(sg);
+ return -ENOMEM;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_alloc_sgl);
+
+/**
+ * pci_p2pmem_free_sgl - free a scatterlist allocated by pci_p2pmem_alloc_sgl
+ * @pdev: the device to allocate memory from
+ * @sgl: the allocated scatterlist
+ * @nents: the number of SG entries in the list
+ */
+void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl,
+ unsigned int nents)
+{
+ struct scatterlist *sg;
+ int count;
+
+ if (!sgl || !nents)
+ return;
+
+ for_each_sg(sgl, sg, nents, count)
+ pci_free_p2pmem(pdev, sg_virt(sg), sg->length);
+ kfree(sgl);
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_free_sgl);
+
+/**
+ * pci_p2pmem_publish - publish the peer-to-peer DMA memory for use by
+ * other devices with pci_p2pmem_find
+ * @pdev: the device with peer-to-peer DMA memory to publish
+ * @publish: set to true to publish the memory, false to unpublish it
+ */
+void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
+{
+ if (publish && !pdev->p2pdma)
+ return;
+
+ pdev->p2pdma->published = publish;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 7b4899c06f49..9e907c338a44 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -53,11 +53,16 @@ struct vmem_altmap {
* driver can hotplug the device memory using ZONE_DEVICE and with that memory
* type. Any page of a process can be migrated to such memory. However no one
* should be allow to pin such memory so that it can always be evicted.
+ *
+ * MEMORY_DEVICE_PCI_P2PDMA:
+ * Device memory residing in a PCI BAR intended for use with Peer-to-Peer
+ * transactions.
*/
enum memory_type {
MEMORY_DEVICE_HOST = 0,
MEMORY_DEVICE_PRIVATE,
MEMORY_DEVICE_PUBLIC,
+ MEMORY_DEVICE_PCI_P2PDMA,
};
/*
@@ -161,6 +166,19 @@ static inline void vmem_altmap_free(struct vmem_altmap *altmap,
}
#endif /* CONFIG_ZONE_DEVICE */
+#ifdef CONFIG_PCI_P2PDMA
+static inline bool is_pci_p2pdma_page(const struct page *page)
+{
+ return is_zone_device_page(page) &&
+ page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
+}
+#else /* CONFIG_PCI_P2PDMA */
+static inline bool is_pci_p2pdma_page(const struct page *page)
+{
+ return false;
+}
+#endif /* CONFIG_PCI_P2PDMA */
+
#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
static inline bool is_device_private_page(const struct page *page)
{
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
new file mode 100644
index 000000000000..c0dde3d3aac4
--- /dev/null
+++ b/include/linux/pci-p2pdma.h
@@ -0,0 +1,87 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2016-2017, Microsemi Corporation
+ * Copyright (c) 2017, Christoph Hellwig.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _LINUX_PCI_P2PDMA_H
+#define _LINUX_PCI_P2PDMA_H
+
+#include <linux/pci.h>
+
+struct block_device;
+struct scatterlist;
+
+#ifdef CONFIG_PCI_P2PDMA
+int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
+ u64 offset);
+int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
+void pci_p2pdma_remove_client(struct list_head *head, struct device *dev);
+void pci_p2pdma_client_list_free(struct list_head *head);
+struct pci_dev *pci_p2pmem_find(struct list_head *clients);
+void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size);
+void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size);
+pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr);
+int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct scatterlist **sgl,
+ unsigned int *nents, u32 length);
+void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl,
+ unsigned int nents);
+void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
+#else /* CONFIG_PCI_P2PDMA */
+static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
+ size_t size, u64 offset)
+{
+ return 0;
+}
+static inline int pci_p2pdma_add_client(struct list_head *head,
+ struct device *dev)
+{
+ return 0;
+}
+static inline void pci_p2pdma_remove_client(struct list_head *head,
+ struct device *dev)
+{
+}
+static inline void pci_p2pdma_client_list_free(struct list_head *head)
+{
+}
+static inline struct pci_dev *pci_p2pmem_find(struct list_head *clients)
+{
+ return NULL;
+}
+static inline void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
+{
+ return NULL;
+}
+static inline void pci_free_p2pmem(struct pci_dev *pdev, void *addr,
+ size_t size)
+{
+}
+static inline pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev,
+ void *addr)
+{
+ return 0;
+}
+static inline int pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
+ struct scatterlist **sgl, unsigned int *nents, u32 length)
+{
+ return -ENODEV;
+}
+static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
+ struct scatterlist *sgl, unsigned int nents)
+{
+}
+static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
+{
+}
+#endif /* CONFIG_PCI_P2PDMA */
+#endif /* _LINUX_PCI_P2P_H */
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 024a1beda008..437e42615896 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -276,6 +276,7 @@ struct pcie_link_state;
struct pci_vpd;
struct pci_sriov;
struct pci_ats;
+struct pci_p2pdma;
/* The pci_dev structure describes PCI devices */
struct pci_dev {
@@ -429,6 +430,9 @@ struct pci_dev {
#ifdef CONFIG_PCI_PASID
u16 pasid_features;
#endif
+#ifdef CONFIG_PCI_P2PDMA
+ struct pci_p2pdma *p2pdma;
+#endif
phys_addr_t rom; /* Physical address if not from BAR */
size_t romlen; /* Length if not from BAR */
char *driver_override; /* Driver name to force a match */
--
2.11.0
For peer-to-peer transactions to work the downstream ports in each
switch must not have the ACS flags set. At this time there is no way
to dynamically change the flags and update the corresponding IOMMU
groups so this is done at enumeration time before the the groups are
assigned.
This effectively means that if CONFIG_PCI_P2PDMA is selected then
all devices behind any switch will be in the same IOMMU group.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/pci/Kconfig | 4 ++++
drivers/pci/p2pdma.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
drivers/pci/pci.c | 4 ++++
include/linux/pci-p2pdma.h | 5 +++++
4 files changed, 57 insertions(+)
diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 840831418cbd..a430672f0ad4 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -138,6 +138,10 @@ config PCI_P2PDMA
it's hard to tell which support it with good performance, so
at this time you will need a PCIe switch.
+ Enabling this option will also disable ACS on all ports behind
+ any PCIe switch. This effictively puts all devices behind any
+ switch into the same IOMMU group.
+
If unsure, say N.
config PCI_LABEL
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 4e1c81f64b29..61af07acd21a 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -255,6 +255,50 @@ static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
return up2;
}
+/*
+ * pci_p2pdma_disable_acs - disable ACS flags for ports in PCI
+ * bridges/switches
+ * @pdev: device to disable ACS flags for
+ *
+ * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
+ * to be disabled on any downstream port in any switch in order for
+ * the TLPs to not be forwarded up to the RC which is not what we want
+ * for P2P.
+ *
+ * This function is called when the devices are first enumerated and
+ * will result in all devices behind any switch to be in the same IOMMU
+ * group. At this time there is no way to "hotplug" IOMMU groups so we rely
+ * on this largish hammer. If you need the devices to be in separate groups
+ * don't enable CONFIG_PCI_P2PDMA.
+ *
+ * Returns 1 if the ACS bits for this device were cleared, otherwise 0.
+ */
+int pci_p2pdma_disable_acs(struct pci_dev *pdev)
+{
+ struct pci_dev *up;
+ int pos;
+ u16 ctrl;
+
+ up = get_upstream_bridge_port(pdev);
+ if (!up)
+ return 0;
+ pci_dev_put(up);
+
+ pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
+ if (!pos)
+ return 0;
+
+ dev_info(&pdev->dev, "disabling ACS flags for peer-to-peer DMA\n");
+
+ pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
+
+ ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
+
+ pci_write_config_word(pdev, pos + PCI_ACS_CTRL, ctrl);
+
+ return 1;
+}
+
static bool __upstream_bridges_match(struct pci_dev *upstream,
struct pci_dev *client)
{
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index f6a4dd10d9b0..95ad3cf288c8 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -16,6 +16,7 @@
#include <linux/of.h>
#include <linux/of_pci.h>
#include <linux/pci.h>
+#include <linux/pci-p2pdma.h>
#include <linux/pm.h>
#include <linux/slab.h>
#include <linux/module.h>
@@ -2826,6 +2827,9 @@ static void pci_std_enable_acs(struct pci_dev *dev)
*/
void pci_enable_acs(struct pci_dev *dev)
{
+ if (pci_p2pdma_disable_acs(dev))
+ return;
+
if (!pci_acs_enable)
return;
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 126eca697ab3..f537f521f60c 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -22,6 +22,7 @@ struct block_device;
struct scatterlist;
#ifdef CONFIG_PCI_P2PDMA
+int pci_p2pdma_disable_acs(struct pci_dev *pdev);
int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
u64 offset);
int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
@@ -41,6 +42,10 @@ int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
enum dma_data_direction dir);
#else /* CONFIG_PCI_P2PDMA */
+static inline int pci_p2pdma_disable_acs(struct pci_dev *pdev)
+{
+ return 0;
+}
static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
size_t size, u64 offset)
{
--
2.11.0
QUEUE_FLAG_PCI_P2P is introduced meaning a driver's request queue
supports targeting P2P memory.
REQ_PCI_P2P is introduced to indicate a particular bio request is
directed to/from PCI P2P memory. A request with this flag is not
accepted unless the corresponding queues have the QUEUE_FLAG_PCI_P2P
flag set.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
block/blk-core.c | 3 +++
include/linux/blk_types.h | 18 +++++++++++++++++-
include/linux/blkdev.h | 3 +++
3 files changed, 23 insertions(+), 1 deletion(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 2d1a7bbe0634..36a6a304a559 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2183,6 +2183,9 @@ generic_make_request_checks(struct bio *bio)
if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q))
goto not_supported;
+ if ((bio->bi_opf & REQ_PCI_P2PDMA) && !blk_queue_pci_p2pdma(q))
+ goto not_supported;
+
if (should_fail_bio(bio))
goto end_io;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index bf18b95ed92d..490122c85b3f 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -274,6 +274,10 @@ enum req_flag_bits {
__REQ_BACKGROUND, /* background IO */
__REQ_NOWAIT, /* Don't wait if request will block */
+#ifdef CONFIG_PCI_P2PDMA
+ __REQ_PCI_P2PDMA, /* request is to/from P2P memory */
+#endif
+
/* command specific flags for REQ_OP_WRITE_ZEROES: */
__REQ_NOUNMAP, /* do not free blocks when zeroing */
@@ -298,6 +302,18 @@ enum req_flag_bits {
#define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND)
#define REQ_NOWAIT (1ULL << __REQ_NOWAIT)
+#ifdef CONFIG_PCI_P2PDMA
+/*
+ * Currently SGLs do not support mixed P2P and regular memory so
+ * requests with P2P memory must not be merged.
+ */
+#define REQ_PCI_P2PDMA (1ULL << __REQ_PCI_P2PDMA)
+#define REQ_IS_PCI_P2PDMA(req) ((req)->cmd_flags & REQ_PCI_P2PDMA)
+#else
+#define REQ_PCI_P2PDMA 0
+#define REQ_IS_PCI_P2PDMA(req) 0
+#endif /* CONFIG_PCI_P2PDMA */
+
#define REQ_NOUNMAP (1ULL << __REQ_NOUNMAP)
#define REQ_DRV (1ULL << __REQ_DRV)
@@ -306,7 +322,7 @@ enum req_flag_bits {
(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
#define REQ_NOMERGE_FLAGS \
- (REQ_NOMERGE | REQ_PREFLUSH | REQ_FUA)
+ (REQ_NOMERGE | REQ_PREFLUSH | REQ_FUA | REQ_PCI_P2PDMA)
#define bio_op(bio) \
((bio)->bi_opf & REQ_OP_MASK)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ed63f3b69c12..0b4a386c73ea 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -698,6 +698,7 @@ struct request_queue {
#define QUEUE_FLAG_SCSI_PASSTHROUGH 27 /* queue supports SCSI commands */
#define QUEUE_FLAG_QUIESCED 28 /* queue has been quiesced */
#define QUEUE_FLAG_PREEMPT_ONLY 29 /* only process REQ_PREEMPT requests */
+#define QUEUE_FLAG_PCI_P2PDMA 30 /* device supports pci p2p requests */
#define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \
(1 << QUEUE_FLAG_SAME_COMP) | \
@@ -793,6 +794,8 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
#define blk_queue_dax(q) test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
#define blk_queue_scsi_passthrough(q) \
test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags)
+#define blk_queue_pci_p2pdma(q) \
+ test_bit(QUEUE_FLAG_PCI_P2PDMA, &(q)->queue_flags)
#define blk_noretry_request(rq) \
((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
--
2.11.0
The DMA address used when mapping PCI P2P memory must be the PCI bus
address. Thus, introduce pci_p2pmem_[un]map_sg() to map the correct
addresses when using P2P memory.
For this, we assume that an SGL passed to these functions contain all
p2p memory or no p2p memory.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/pci/p2pdma.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++
include/linux/memremap.h | 1 +
include/linux/pci-p2pdma.h | 13 ++++++++++++
3 files changed, 65 insertions(+)
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index a57df78f6a32..4e1c81f64b29 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -188,6 +188,8 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
pgmap->res.flags = pci_resource_flags(pdev, bar);
pgmap->ref = &pdev->p2pdma->devmap_ref;
pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
+ pgmap->pci_p2pdma_bus_offset = pci_bus_address(pdev, bar) -
+ pci_resource_start(pdev, bar);
addr = devm_memremap_pages(&pdev->dev, pgmap);
if (IS_ERR(addr))
@@ -616,3 +618,52 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
pdev->p2pdma->published = publish;
}
EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
+
+/*
+ * pci_p2pdma_map_sg - map a PCI peer-to-peer sg for DMA
+ * @dev: device doing the DMA request
+ * @sg: scatter list to map
+ * @nents: elements in the scatterlist
+ * @dir: DMA direction
+ *
+ * Returns the number of SG entries mapped
+ */
+int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
+ enum dma_data_direction dir)
+{
+ struct dev_pagemap *pgmap;
+ struct scatterlist *s;
+ phys_addr_t paddr;
+ int i;
+
+ /*
+ * p2pdma mappings are not compatible with devices that use
+ * dma_virt_ops.
+ */
+ if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == &dma_virt_ops)
+ return 0;
+
+ for_each_sg(sg, s, nents, i) {
+ pgmap = sg_page(s)->pgmap;
+ paddr = sg_phys(s);
+
+ s->dma_address = paddr - pgmap->pci_p2pdma_bus_offset;
+ sg_dma_len(s) = s->length;
+ }
+
+ return nents;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_map_sg);
+
+/**
+ * pci_p2pdma_unmap_sg - unmap a PCI peer-to-peer sg for DMA
+ * @dev: device doing the DMA request
+ * @sg: scatter list to map
+ * @nents: elements in the scatterlist
+ * @dir: DMA direction
+ */
+void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
+ enum dma_data_direction dir)
+{
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 9e907c338a44..1660f64ce96f 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -125,6 +125,7 @@ struct dev_pagemap {
struct device *dev;
void *data;
enum memory_type type;
+ u64 pci_p2pdma_bus_offset;
};
#ifdef CONFIG_ZONE_DEVICE
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index c0dde3d3aac4..126eca697ab3 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -36,6 +36,10 @@ int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct scatterlist **sgl,
void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl,
unsigned int nents);
void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
+int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
+ enum dma_data_direction dir);
+void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
+ enum dma_data_direction dir);
#else /* CONFIG_PCI_P2PDMA */
static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
size_t size, u64 offset)
@@ -83,5 +87,14 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
{
}
+static inline int pci_p2pdma_map_sg(struct device *dev,
+ struct scatterlist *sg, int nents, enum dma_data_direction dir)
+{
+ return 0;
+}
+static inline void pci_p2pdma_unmap_sg(struct device *dev,
+ struct scatterlist *sg, int nents, enum dma_data_direction dir)
+{
+}
#endif /* CONFIG_PCI_P2PDMA */
#endif /* _LINUX_PCI_P2P_H */
--
2.11.0
On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote:
> Hi Everyone,
So Oliver (CC) was having issues getting any of that to work for us.
The problem is that acccording to him (I didn't double check the latest
patches) you effectively hotplug the PCIe memory into the system when
creating struct pages.
This cannot possibly work for us. First we cannot map PCIe memory as
cachable. (Note that doing so is a bad idea if you are behind a PLX
switch anyway since you'd ahve to manage cache coherency in SW).
Then our MMIO space is so far away from our memory space that there is
not enough vmemmap virtual space to be able to do that.
So this can only work accross achitectures by using something like HMM
to create special device struct page's.
Ben.
> Here's v2 of our series to introduce P2P based copy offload to NVMe
> fabrics. This version has been rebased onto v4.16-rc3 which already
> includes Christoph's devpagemap work the previous version was based
> off as well as a couple of the cleanup patches that were in v1.
>
> Additionally, we've made the following changes based on feedback:
>
> * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
> as a bunch of cleanup and spelling fixes he pointed out in the last
> series.
>
> * To address Alex's ACS concerns, we change to a simpler method of
> just disabling ACS behind switches for any kernel that has
> CONFIG_PCI_P2PDMA.
>
> * We also reject using devices that employ 'dma_virt_ops' which should
> fairly simply handle Jason's concerns that this work might break with
> the HFI, QIB and rxe drivers that use the virtual ops to implement
> their own special DMA operations.
>
> Thanks,
>
> Logan
>
> --
>
> This is a continuation of our work to enable using Peer-to-Peer PCI
> memory in NVMe fabrics targets. Many thanks go to Christoph Hellwig who
> provided valuable feedback to get these patches to where they are today.
>
> The concept here is to use memory that's exposed on a PCI BAR as
> data buffers in the NVME target code such that data can be transferred
> from an RDMA NIC to the special memory and then directly to an NVMe
> device avoiding system memory entirely. The upside of this is better
> QoS for applications running on the CPU utilizing memory and lower
> PCI bandwidth required to the CPU (such that systems could be designed
> with fewer lanes connected to the CPU). However, presently, the
> trade-off is currently a reduction in overall throughput. (Largely due
> to hardware issues that would certainly improve in the future).
>
> Due to these trade-offs we've designed the system to only enable using
> the PCI memory in cases where the NIC, NVMe devices and memory are all
> behind the same PCI switch. This will mean many setups that could likely
> work well will not be supported so that we can be more confident it
> will work and not place any responsibility on the user to understand
> their topology. (We chose to go this route based on feedback we
> received at the last LSF). Future work may enable these transfers behind
> a fabric of PCI switches or perhaps using a white list of known good
> root complexes.
>
> In order to enable this functionality, we introduce a few new PCI
> functions such that a driver can register P2P memory with the system.
> Struct pages are created for this memory using devm_memremap_pages()
> and the PCI bus offset is stored in the corresponding pagemap structure.
>
> Another set of functions allow a client driver to create a list of
> client devices that will be used in a given P2P transactions and then
> use that list to find any P2P memory that is supported by all the
> client devices. This list is then also used to selectively disable the
> ACS bits for the downstream ports behind these devices.
>
> In the block layer, we also introduce a P2P request flag to indicate a
> given request targets P2P memory as well as a flag for a request queue
> to indicate a given queue supports targeting P2P memory. P2P requests
> will only be accepted by queues that support it. Also, P2P requests
> are marked to not be merged seeing a non-homogenous request would
> complicate the DMA mapping requirements.
>
> In the PCI NVMe driver, we modify the existing CMB support to utilize
> the new PCI P2P memory infrastructure and also add support for P2P
> memory in its request queue. When a P2P request is received it uses the
> pci_p2pmem_map_sg() function which applies the necessary transformation
> to get the corrent pci_bus_addr_t for the DMA transactions.
>
> In the RDMA core, we also adjust rdma_rw_ctx_init() and
> rdma_rw_ctx_destroy() to take a flags argument which indicates whether
> to use the PCI P2P mapping functions or not.
>
> Finally, in the NVMe fabrics target port we introduce a new
> configuration boolean: 'allow_p2pmem'. When set, the port will attempt
> to find P2P memory supported by the RDMA NIC and all namespaces. If
> supported memory is found, it will be used in all IO transfers. And if
> a port is using P2P memory, adding new namespaces that are not supported
> by that memory will fail.
>
> Logan Gunthorpe (10):
> PCI/P2PDMA: Support peer to peer memory
> PCI/P2PDMA: Add sysfs group to display p2pmem stats
> PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
> PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
> block: Introduce PCI P2P flags for request and request queue
> IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
> nvme-pci: Use PCI p2pmem subsystem to manage the CMB
> nvme-pci: Add support for P2P memory in requests
> nvme-pci: Add a quirk for a pseudo CMB
> nvmet: Optionally use PCI P2P memory
>
> Documentation/ABI/testing/sysfs-bus-pci | 25 ++
> block/blk-core.c | 3 +
> drivers/infiniband/core/rw.c | 21 +-
> drivers/infiniband/ulp/isert/ib_isert.c | 5 +-
> drivers/infiniband/ulp/srpt/ib_srpt.c | 7 +-
> drivers/nvme/host/core.c | 4 +
> drivers/nvme/host/nvme.h | 8 +
> drivers/nvme/host/pci.c | 118 ++++--
> drivers/nvme/target/configfs.c | 29 ++
> drivers/nvme/target/core.c | 95 ++++-
> drivers/nvme/target/io-cmd.c | 3 +
> drivers/nvme/target/nvmet.h | 10 +
> drivers/nvme/target/rdma.c | 43 +-
> drivers/pci/Kconfig | 20 +
> drivers/pci/Makefile | 1 +
> drivers/pci/p2pdma.c | 713 ++++++++++++++++++++++++++++++++
> drivers/pci/pci.c | 4 +
> include/linux/blk_types.h | 18 +-
> include/linux/blkdev.h | 3 +
> include/linux/memremap.h | 19 +
> include/linux/pci-p2pdma.h | 105 +++++
> include/linux/pci.h | 4 +
> include/rdma/rw.h | 7 +-
> net/sunrpc/xprtrdma/svc_rdma_rw.c | 6 +-
> 24 files changed, 1204 insertions(+), 67 deletions(-)
> create mode 100644 drivers/pci/p2pdma.c
> create mode 100644 include/linux/pci-p2pdma.h
>
> --
> 2.11.0
On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote:
> > Hi Everyone,
>
>
> So Oliver (CC) was having issues getting any of that to work for us.
>
> The problem is that acccording to him (I didn't double check the latest
> patches) you effectively hotplug the PCIe memory into the system when
> creating struct pages.
>
> This cannot possibly work for us. First we cannot map PCIe memory as
> cachable. (Note that doing so is a bad idea if you are behind a PLX
> switch anyway since you'd ahve to manage cache coherency in SW).
Note: I think the above means it won't work behind a switch on x86
either, will it ?
> Then our MMIO space is so far away from our memory space that there is
> not enough vmemmap virtual space to be able to do that.
>
> So this can only work accross achitectures by using something like HMM
> to create special device struct page's.
>
> Ben.
>
>
> > Here's v2 of our series to introduce P2P based copy offload to NVMe
> > fabrics. This version has been rebased onto v4.16-rc3 which already
> > includes Christoph's devpagemap work the previous version was based
> > off as well as a couple of the cleanup patches that were in v1.
> >
> > Additionally, we've made the following changes based on feedback:
> >
> > * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
> > as a bunch of cleanup and spelling fixes he pointed out in the last
> > series.
> >
> > * To address Alex's ACS concerns, we change to a simpler method of
> > just disabling ACS behind switches for any kernel that has
> > CONFIG_PCI_P2PDMA.
> >
> > * We also reject using devices that employ 'dma_virt_ops' which should
> > fairly simply handle Jason's concerns that this work might break with
> > the HFI, QIB and rxe drivers that use the virtual ops to implement
> > their own special DMA operations.
> >
> > Thanks,
> >
> > Logan
> >
> > --
> >
> > This is a continuation of our work to enable using Peer-to-Peer PCI
> > memory in NVMe fabrics targets. Many thanks go to Christoph Hellwig who
> > provided valuable feedback to get these patches to where they are today.
> >
> > The concept here is to use memory that's exposed on a PCI BAR as
> > data buffers in the NVME target code such that data can be transferred
> > from an RDMA NIC to the special memory and then directly to an NVMe
> > device avoiding system memory entirely. The upside of this is better
> > QoS for applications running on the CPU utilizing memory and lower
> > PCI bandwidth required to the CPU (such that systems could be designed
> > with fewer lanes connected to the CPU). However, presently, the
> > trade-off is currently a reduction in overall throughput. (Largely due
> > to hardware issues that would certainly improve in the future).
> >
> > Due to these trade-offs we've designed the system to only enable using
> > the PCI memory in cases where the NIC, NVMe devices and memory are all
> > behind the same PCI switch. This will mean many setups that could likely
> > work well will not be supported so that we can be more confident it
> > will work and not place any responsibility on the user to understand
> > their topology. (We chose to go this route based on feedback we
> > received at the last LSF). Future work may enable these transfers behind
> > a fabric of PCI switches or perhaps using a white list of known good
> > root complexes.
> >
> > In order to enable this functionality, we introduce a few new PCI
> > functions such that a driver can register P2P memory with the system.
> > Struct pages are created for this memory using devm_memremap_pages()
> > and the PCI bus offset is stored in the corresponding pagemap structure.
> >
> > Another set of functions allow a client driver to create a list of
> > client devices that will be used in a given P2P transactions and then
> > use that list to find any P2P memory that is supported by all the
> > client devices. This list is then also used to selectively disable the
> > ACS bits for the downstream ports behind these devices.
> >
> > In the block layer, we also introduce a P2P request flag to indicate a
> > given request targets P2P memory as well as a flag for a request queue
> > to indicate a given queue supports targeting P2P memory. P2P requests
> > will only be accepted by queues that support it. Also, P2P requests
> > are marked to not be merged seeing a non-homogenous request would
> > complicate the DMA mapping requirements.
> >
> > In the PCI NVMe driver, we modify the existing CMB support to utilize
> > the new PCI P2P memory infrastructure and also add support for P2P
> > memory in its request queue. When a P2P request is received it uses the
> > pci_p2pmem_map_sg() function which applies the necessary transformation
> > to get the corrent pci_bus_addr_t for the DMA transactions.
> >
> > In the RDMA core, we also adjust rdma_rw_ctx_init() and
> > rdma_rw_ctx_destroy() to take a flags argument which indicates whether
> > to use the PCI P2P mapping functions or not.
> >
> > Finally, in the NVMe fabrics target port we introduce a new
> > configuration boolean: 'allow_p2pmem'. When set, the port will attempt
> > to find P2P memory supported by the RDMA NIC and all namespaces. If
> > supported memory is found, it will be used in all IO transfers. And if
> > a port is using P2P memory, adding new namespaces that are not supported
> > by that memory will fail.
> >
> > Logan Gunthorpe (10):
> > PCI/P2PDMA: Support peer to peer memory
> > PCI/P2PDMA: Add sysfs group to display p2pmem stats
> > PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
> > PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
> > block: Introduce PCI P2P flags for request and request queue
> > IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
> > nvme-pci: Use PCI p2pmem subsystem to manage the CMB
> > nvme-pci: Add support for P2P memory in requests
> > nvme-pci: Add a quirk for a pseudo CMB
> > nvmet: Optionally use PCI P2P memory
> >
> > Documentation/ABI/testing/sysfs-bus-pci | 25 ++
> > block/blk-core.c | 3 +
> > drivers/infiniband/core/rw.c | 21 +-
> > drivers/infiniband/ulp/isert/ib_isert.c | 5 +-
> > drivers/infiniband/ulp/srpt/ib_srpt.c | 7 +-
> > drivers/nvme/host/core.c | 4 +
> > drivers/nvme/host/nvme.h | 8 +
> > drivers/nvme/host/pci.c | 118 ++++--
> > drivers/nvme/target/configfs.c | 29 ++
> > drivers/nvme/target/core.c | 95 ++++-
> > drivers/nvme/target/io-cmd.c | 3 +
> > drivers/nvme/target/nvmet.h | 10 +
> > drivers/nvme/target/rdma.c | 43 +-
> > drivers/pci/Kconfig | 20 +
> > drivers/pci/Makefile | 1 +
> > drivers/pci/p2pdma.c | 713 ++++++++++++++++++++++++++++++++
> > drivers/pci/pci.c | 4 +
> > include/linux/blk_types.h | 18 +-
> > include/linux/blkdev.h | 3 +
> > include/linux/memremap.h | 19 +
> > include/linux/pci-p2pdma.h | 105 +++++
> > include/linux/pci.h | 4 +
> > include/rdma/rw.h | 7 +-
> > net/sunrpc/xprtrdma/svc_rdma_rw.c | 6 +-
> > 24 files changed, 1204 insertions(+), 67 deletions(-)
> > create mode 100644 drivers/pci/p2pdma.c
> > create mode 100644 include/linux/pci-p2pdma.h
> >
> > --
> > 2.11.0
> Hi Everyone,
Hi Logan,
> Here's v2 of our series to introduce P2P based copy offload to NVMe
> fabrics. This version has been rebased onto v4.16-rc3 which already
> includes Christoph's devpagemap work the previous version was based
> off as well as a couple of the cleanup patches that were in v1.
>
> Additionally, we've made the following changes based on feedback:
>
> * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
> as a bunch of cleanup and spelling fixes he pointed out in the last
> series.
>
> * To address Alex's ACS concerns, we change to a simpler method of
> just disabling ACS behind switches for any kernel that has
> CONFIG_PCI_P2PDMA.
>
> * We also reject using devices that employ 'dma_virt_ops' which should
> fairly simply handle Jason's concerns that this work might break with
> the HFI, QIB and rxe drivers that use the virtual ops to implement
> their own special DMA operations.
That's good, but what would happen for these devices? simply fail the
mapping causing the ulp to fail its rdma operation? I would think
that we need a capability flag for devices that support it.
On 03/01/2018 01:40 AM, Logan Gunthorpe wrote:
> In order to use PCI P2P memory pci_p2pmem_[un]map_sg() functions must be
> called to map the correct DMA address. To do this, we add a flags
> variable and the RDMA_RW_CTX_FLAG_PCI_P2P flag. When the flag is
> specified use the appropriate map function.
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> ---
> drivers/infiniband/core/rw.c | 21 +++++++++++++++++----
> drivers/infiniband/ulp/isert/ib_isert.c | 5 +++--
> drivers/infiniband/ulp/srpt/ib_srpt.c | 7 ++++---
> drivers/nvme/target/rdma.c | 6 +++---
> include/rdma/rw.h | 7 +++++--
> net/sunrpc/xprtrdma/svc_rdma_rw.c | 6 +++---
> 6 files changed, 35 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
> index c8963e91f92a..775a9f8b15a6 100644
> --- a/drivers/infiniband/core/rw.c
> +++ b/drivers/infiniband/core/rw.c
> @@ -12,6 +12,7 @@
> */
> #include <linux/moduleparam.h>
> #include <linux/slab.h>
> +#include <linux/pci-p2pdma.h>
> #include <rdma/mr_pool.h>
> #include <rdma/rw.h>
>
> @@ -269,18 +270,24 @@ static int rdma_rw_init_single_wr(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
> * @remote_addr:remote address to read/write (relative to @rkey)
> * @rkey: remote key to operate on
> * @dir: %DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
> + * @flags: any of the RDMA_RW_CTX_FLAG_* flags
> *
> * Returns the number of WQEs that will be needed on the workqueue if
> * successful, or a negative error code.
> */
> int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
> struct scatterlist *sg, u32 sg_cnt, u32 sg_offset,
> - u64 remote_addr, u32 rkey, enum dma_data_direction dir)
> + u64 remote_addr, u32 rkey, enum dma_data_direction dir,
> + unsigned int flags)
> {
> struct ib_device *dev = qp->pd->device;
> int ret;
>
> - ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
> + if (flags & RDMA_RW_CTX_FLAG_PCI_P2PDMA)
> + ret = pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
> + else
> + ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
> +
Why not use is_pci_p2pdma_page(sg) instead of a flag? It would be so
much cleaner...
> We create a configfs attribute in each nvme-fabrics target port to
> enable p2p memory use. When enabled, the port will only then use the
> p2p memory if a p2p memory device can be found which is behind the
> same switch as the RDMA port and all the block devices in use. If
> the user enabled it an no devices are found, then the system will
> silently fall back on using regular memory.
>
> If appropriate, that port will allocate memory for the RDMA buffers
> for queues from the p2pmem device falling back to system memory should
> anything fail.
Nice.
> Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
> save an extra PCI transfer as the NVME card could just take the data
> out of it's own memory. However, at this time, cards with CMB buffers
> don't seem to be available.
Can you describe what would be the plan to have it when these devices
do come along? I'd say that p2p_dev needs to become a nvmet_ns reference
and not from nvmet_ctrl. Then, when cmb capable devices come along, the
ns can prefer to use its own cmb instead of locating a p2p_dev device?
> +static int nvmet_p2pdma_add_client(struct nvmet_ctrl *ctrl,
> + struct nvmet_ns *ns)
> +{
> + int ret;
> +
> + if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) {
> + pr_err("peer-to-peer DMA is not supported by %s\n",
> + ns->device_path);
> + return -EINVAL;
I'd say that just skip it instead of failing it, in theory one can
connect nvme devices via p2p mem and expose other devices in the
same subsystem. The message would be pr_debug to reduce chattiness.
> + }
> +
> + ret = pci_p2pdma_add_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
> + if (ret)
> + pr_err("failed to add peer-to-peer DMA client %s: %d\n",
> + ns->device_path, ret);
> +
> + return ret;
> +}
> +
> int nvmet_ns_enable(struct nvmet_ns *ns)
> {
> struct nvmet_subsys *subsys = ns->subsys;
> @@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
> if (ret)
> goto out_blkdev_put;
>
> + list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
> + if (ctrl->p2p_dev) {
> + ret = nvmet_p2pdma_add_client(ctrl, ns);
> + if (ret)
> + goto out_remove_clients;
Is this really a fatal failure given that we fall-back to main
memory? Why not continue with main memory (and warn at best)?
> +/*
> + * If allow_p2pmem is set, we will try to use P2P memory for the SGL lists for
> + * Ι/O commands. This requires the PCI p2p device to be compatible with the
> + * backing device for every namespace on this controller.
> + */
> +static void nvmet_setup_p2pmem(struct nvmet_ctrl *ctrl, struct nvmet_req *req)
> +{
> + struct nvmet_ns *ns;
> + int ret;
> +
> + if (!req->port->allow_p2pmem || !req->p2p_client)
> + return;
> +
> + mutex_lock(&ctrl->subsys->lock);
> +
> + ret = pci_p2pdma_add_client(&ctrl->p2p_clients, req->p2p_client);
> + if (ret) {
> + pr_err("failed adding peer-to-peer DMA client %s: %d\n",
> + dev_name(req->p2p_client), ret);
> + goto free_devices;
> + }
> +
> + list_for_each_entry_rcu(ns, &ctrl->subsys->namespaces, dev_link) {
> + ret = nvmet_p2pdma_add_client(ctrl, ns);
> + if (ret)
> + goto free_devices;
> + }
> +
> + ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients);
This is the first p2p_dev found right? What happens if I have more than
a single p2p device? In theory I'd have more p2p memory I can use. Have
you considered making pci_p2pmem_find return the least used suitable
device?
> + if (!ctrl->p2p_dev) {
> + pr_info("no supported peer-to-peer memory devices found\n");
> + goto free_devices;
> + }
> + mutex_unlock(&ctrl->subsys->lock);
> +
> + pr_info("using peer-to-peer memory on %s\n", pci_name(ctrl->p2p_dev));
> + return;
> +
> +free_devices:
> + pci_p2pdma_client_list_free(&ctrl->p2p_clients);
> + mutex_unlock(&ctrl->subsys->lock);
> +}
> +
> +static void nvmet_release_p2pmem(struct nvmet_ctrl *ctrl)
> +{
> + if (!ctrl->p2p_dev)
> + return;
> +
> + mutex_lock(&ctrl->subsys->lock);
> +
> + pci_p2pdma_client_list_free(&ctrl->p2p_clients);
> + pci_dev_put(ctrl->p2p_dev);
> + ctrl->p2p_dev = NULL;
> +
> + mutex_unlock(&ctrl->subsys->lock);
> +}
> +
> u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
> struct nvmet_req *req, u32 kato, struct nvmet_ctrl **ctrlp)
> {
> @@ -800,6 +890,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
>
> INIT_WORK(&ctrl->async_event_work, nvmet_async_event_work);
> INIT_LIST_HEAD(&ctrl->async_events);
> + INIT_LIST_HEAD(&ctrl->p2p_clients);
>
> memcpy(ctrl->subsysnqn, subsysnqn, NVMF_NQN_SIZE);
> memcpy(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE);
> @@ -855,6 +946,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
> ctrl->kato = DIV_ROUND_UP(kato, 1000);
> }
> nvmet_start_keep_alive_timer(ctrl);
> + nvmet_setup_p2pmem(ctrl, req);
>
> mutex_lock(&subsys->lock);
> list_add_tail(&ctrl->subsys_entry, &subsys->ctrls);
> @@ -891,6 +983,7 @@ static void nvmet_ctrl_free(struct kref *ref)
> flush_work(&ctrl->async_event_work);
> cancel_work_sync(&ctrl->fatal_err_work);
>
> + nvmet_release_p2pmem(ctrl);
> ida_simple_remove(&cntlid_ida, ctrl->cntlid);
>
> kfree(ctrl->sqs);
> diff --git a/drivers/nvme/target/io-cmd.c b/drivers/nvme/target/io-cmd.c
> index 28bbdff4a88b..a213f8fc3bf3 100644
> --- a/drivers/nvme/target/io-cmd.c
> +++ b/drivers/nvme/target/io-cmd.c
> @@ -56,6 +56,9 @@ static void nvmet_execute_rw(struct nvmet_req *req)
> op = REQ_OP_READ;
> }
>
> + if (is_pci_p2pdma_page(sg_page(req->sg)))
> + op_flags |= REQ_PCI_P2PDMA;
> +
> sector = le64_to_cpu(req->cmd->rw.slba);
> sector <<= (req->ns->blksize_shift - 9);
>
> diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
> index 417f6c0331cc..85a170914588 100644
> --- a/drivers/nvme/target/nvmet.h
> +++ b/drivers/nvme/target/nvmet.h
> @@ -64,6 +64,11 @@ static inline struct nvmet_ns *to_nvmet_ns(struct config_item *item)
> return container_of(to_config_group(item), struct nvmet_ns, group);
> }
>
> +static inline struct device *nvmet_ns_dev(struct nvmet_ns *ns)
> +{
> + return disk_to_dev(ns->bdev->bd_disk);
> +}
> +
> struct nvmet_cq {
> u16 qid;
> u16 size;
> @@ -98,6 +103,7 @@ struct nvmet_port {
> struct list_head referrals;
> void *priv;
> bool enabled;
> + bool allow_p2pmem;
> };
>
> static inline struct nvmet_port *to_nvmet_port(struct config_item *item)
> @@ -131,6 +137,8 @@ struct nvmet_ctrl {
> struct work_struct fatal_err_work;
>
> struct nvmet_fabrics_ops *ops;
> + struct pci_dev *p2p_dev;
> + struct list_head p2p_clients;
>
> char subsysnqn[NVMF_NQN_FIELD_LEN];
> char hostnqn[NVMF_NQN_FIELD_LEN];
> @@ -232,6 +240,8 @@ struct nvmet_req {
>
> void (*execute)(struct nvmet_req *req);
> struct nvmet_fabrics_ops *ops;
> +
> + struct device *p2p_client;
> };
>
> static inline void nvmet_set_status(struct nvmet_req *req, u16 status)
> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
> index 020354e11351..7a1f09995ed5 100644
> --- a/drivers/nvme/target/rdma.c
> +++ b/drivers/nvme/target/rdma.c
> @@ -23,6 +23,7 @@
> #include <linux/string.h>
> #include <linux/wait.h>
> #include <linux/inet.h>
> +#include <linux/pci-p2pdma.h>
> #include <asm/unaligned.h>
>
> #include <rdma/ib_verbs.h>
> @@ -68,6 +69,7 @@ struct nvmet_rdma_rsp {
> u8 n_rdma;
> u32 flags;
> u32 invalidate_rkey;
> + struct pci_dev *p2p_dev;
Given that p2p_client is in nvmet_req, I think it make sense
that the p2p_dev itself would also live there. In theory, nothing
is preventing FC from using it as well.
Looks fine,
Reviewed-by: Sagi Grimberg <[email protected]>
Looks fine,
Reviewed-by: Sagi Grimberg <[email protected]>
> For P2P requests we must use the pci_p2pmem_[un]map_sg() functions
> instead of the dma_map_sg functions.
>
> With that, we can then indicate PCI_P2P support in the request queue.
> For this, we create an NVME_F_PCI_P2P flag which tells the core to
> set QUEUE_FLAG_PCI_P2P in the request queue.
This looks fine to me,
Reviewed-by: Sagi Grimberg <[email protected]>
Any plans adding the capability to nvme-rdma? Should be
straight-forward... In theory, the use-case would be rdma backend
fabric behind. Shouldn't be hard to test either...
> Any plans adding the capability to nvme-rdma? Should be
> straight-forward... In theory, the use-case would be rdma backend
> fabric behind. Shouldn't be hard to test either...
Nice idea Sagi. Yes we have been starting to look at that. Though again we would probably want to impose the "attached to the same PCIe switch" rule which might be less common to satisfy in initiator systems.
Down the road I would also like to discuss the best way to use this P2P framework to facilitate copies between NVMe namespaces (on both PCIe and fabric attached namespaces) without having to expose the CMB up to user space. Wasn't something like that done in the SCSI world at some point Martin?
Stephen
> > Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
> > save an extra PCI transfer as the NVME card could just take the data
> > out of it's own memory. However, at this time, cards with CMB buffers
> > don't seem to be available.
> Can you describe what would be the plan to have it when these devices
> do come along? I'd say that p2p_dev needs to become a nvmet_ns reference
> and not from nvmet_ctrl. Then, when cmb capable devices come along, the
> ns can prefer to use its own cmb instead of locating a p2p_dev device?
Hi Sagi
Thanks for the review! That commit message is somewhat dated as NVMe controllers with CMBs that support RDS and WDS are now commercially available [1]. However we have not yet tried to do any kind of optimization around this yet in terms of determining which p2p_dev to use. Your suggest above looks good and we can look into this kind of optimization in due course.
[1] http://www.eideticom.com/uploads/images/NoLoad_Product_Spec.pdf
>> + ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients);
> This is the first p2p_dev found right? What happens if I have more than
> a single p2p device? In theory I'd have more p2p memory I can use. Have
> you considered making pci_p2pmem_find return the least used suitable
> device?
Yes pci_p2pmem_find will always return the first valid p2p_dev found. At the very least we should update this allocate over all the valid p2p_dev. Since the load on any given p2p_dev will vary over time I think a random allocation of the devices makes sense (at least for now).
Stephen
Hey Sagi,
Thanks for the review!
On 01/03/18 03:32 AM, Sagi Grimberg wrote:
>> int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8
>> port_num,
>> struct scatterlist *sg, u32 sg_cnt, u32 sg_offset,
>> - u64 remote_addr, u32 rkey, enum dma_data_direction dir)
>> + u64 remote_addr, u32 rkey, enum dma_data_direction dir,
>> + unsigned int flags)
>> {
>> struct ib_device *dev = qp->pd->device;
>> int ret;
>> - ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
>> + if (flags & RDMA_RW_CTX_FLAG_PCI_P2PDMA)
>> + ret = pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
>> + else
>> + ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
>> +
>
> Why not use is_pci_p2pdma_page(sg) instead of a flag? It would be so
> much cleaner...
Yes, that sounds like a good idea. We can make that change for v3.
Logan
s/peer to peer/peer-to-peer/ to match text below and in spec.
On Wed, Feb 28, 2018 at 04:39:57PM -0700, Logan Gunthorpe wrote:
> Some PCI devices may have memory mapped in a BAR space that's
> intended for use in Peer-to-Peer transactions. In order to enable
> such transactions the memory must be registered with ZONE_DEVICE pages
> so it can be used by DMA interfaces in existing drivers.
s/Peer-to-Peer/peer-to-peer/ to match spec and typical usage.
Is there anything about this memory that makes it specifically
intended for peer-to-peer transactions? I assume the device can't
really tell whether a transaction is from a CPU or a peer.
> A kernel interface is provided so that other subsystems can find and
> allocate chunks of P2P memory as necessary to facilitate transfers
> between two PCI peers. Depending on hardware, this may reduce the
> bandwidth of the transfer but would significantly reduce pressure
> on system memory. This may be desirable in many cases: for example a
> system could be designed with a small CPU connected to a PCI switch by a
> small number of lanes which would maximize the number of lanes available
> to connect to NVME devices.
"A kernel interface is provided" could mean "the kernel provides an
interface", independent of anything this patch does, but I think you
mean *this patch specifically* adds the interface.
Maybe something like:
Add interfaces for other subsystems to find and allocate ...:
int pci_p2pdma_add_client();
struct pci_dev *pci_p2pmem_find();
void *pci_alloc_p2pmem();
This may reduce bandwidth of the transfer but significantly reduce
...
BTW, maybe there could be some kind of guide for device driver writers
in Documentation/PCI/?
> The interface requires a user driver to collect a list of client devices
> involved in the transaction with the pci_p2pmem_add_client*() functions
> then call pci_p2pmem_find() to obtain any suitable P2P memory. Once
> this is done the list is bound to the memory and the calling driver is
> free to add and remove clients as necessary. The ACS bits on the
> downstream switch port will be managed for all the registered clients.
>
> The code is designed to only utilize the p2pmem device if all the devices
> involved in a transfer are behind the same PCI switch. This is because
> using P2P transactions through the PCI root complex can have performance
> limitations or, worse, might not work at all. Finding out how well a
> particular RC supports P2P transfers is non-trivial. Additionally, the
> benefits of P2P transfers that go through the RC is limited to only
> reducing DRAM usage.
I think it would be clearer and sufficient to simply say that we have
no way to know whether peer-to-peer routing between PCIe Root Ports is
supported (PCIe r4.0, sec 1.3.1).
The fact that you use the PCIe term "switch" suggests that a PCIe
Switch is required, but isn't it sufficient for the peers to be below
the same "PCI bridge", which would include PCIe Root Ports, PCIe
Switch Downstream Ports, and conventional PCI bridges?
The comments at get_upstream_bridge_port() suggest that this isn't
enough, and the peers actually do have to be below the same PCIe
Switch, but I don't know why.
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 34b56a8f8480..840831418cbd 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -124,6 +124,22 @@ config PCI_PASID
>
> If unsure, say N.
>
> +config PCI_P2PDMA
> + bool "PCI Peer to Peer transfer support"
> + depends on ZONE_DEVICE
> + select GENERIC_ALLOCATOR
> + help
> + Enableѕ drivers to do PCI peer to peer transactions to and from
s/peer to peer/peer-to-peer/ (in bool and help text)
> + BARs that are exposed in other devices that are the part of
> + the hierarchy where peer-to-peer DMA is guaranteed by the PCI
> + specification to work (ie. anything below a single PCI bridge).
> +
> + Many PCIe root complexes do not support P2P transactions and
> + it's hard to tell which support it with good performance, so
> + at this time you will need a PCIe switch.
Until we have a way to figure out which of them support P2P,
performance is a don't-care.
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> new file mode 100644
> index 000000000000..ec0a6cb9e500
> --- /dev/null
> +++ b/drivers/pci/p2pdma.c
> @@ -0,0 +1,568 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * PCI Peer 2 Peer DMA support.
s/Peer 2 Peer/peer-to-peer/
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
> + * more details.
I think the SPDX tag is meant to remove the need for including the
license text, so you should be able to remove this. Oh, and one
trivial and annoying nit, I think for the SPDX tag, we're supposed to
use "//" in .c files and "/* */" in .h files.
> + * pci_p2pdma_add_resource - add memory for use as p2p memory
> + * @pci: the device to add the memory to
s/@pci/@pdev/
> + * @bar: PCI BAR to add
> + * @size: size of the memory to add, may be zero to use the whole BAR
> + * @offset: offset into the PCI BAR
> + *
> + * The memory will be given ZONE_DEVICE struct pages so that it may
> + * be used with any dma request.
s/dma/DMA/
> +int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
> + u64 offset)
> +{
> + struct dev_pagemap *pgmap;
> + void *addr;
> + int error;
> +
> + if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM))
> + return -EINVAL;
> +
> + if (offset >= pci_resource_len(pdev, bar))
> + return -EINVAL;
> +
> + if (!size)
> + size = pci_resource_len(pdev, bar) - offset;
> +
> + if (size + offset > pci_resource_len(pdev, bar))
> + return -EINVAL;
> +
> + if (!pdev->p2pdma) {
> + error = pci_p2pdma_setup(pdev);
> + if (error)
> + return error;
> + }
> +
> + pgmap = devm_kzalloc(&pdev->dev, sizeof(*pgmap), GFP_KERNEL);
> + if (!pgmap)
> + return -ENOMEM;
> +
> + pgmap->res.start = pci_resource_start(pdev, bar) + offset;
> + pgmap->res.end = pgmap->res.start + size - 1;
> + pgmap->res.flags = pci_resource_flags(pdev, bar);
> + pgmap->ref = &pdev->p2pdma->devmap_ref;
> + pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
> +
> + addr = devm_memremap_pages(&pdev->dev, pgmap);
> + if (IS_ERR(addr))
Free pgmap here? And in the other error case below? Or maybe this
happens via the devm_* magic? If so, when would that actually happen?
Would pgmap be effectively leaked until the pdev is destroyed?
> + return PTR_ERR(addr);
> +
> + error = gen_pool_add_virt(pdev->p2pdma->pool, (uintptr_t)addr,
> + pci_bus_address(pdev, bar) + offset,
> + resource_size(&pgmap->res), dev_to_node(&pdev->dev));
> + if (error)
> + return error;
> +
> + error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_percpu_kill,
> + &pdev->p2pdma->devmap_ref);
> + if (error)
> + return error;
> +
> + dev_info(&pdev->dev, "added peer-to-peer DMA memory %pR\n",
> + &pgmap->res);
s/dev_info/pci_info/ (also similar usages below, except for the one or
two cases where you don't have a pci_dev).
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pdma_add_resource);
> + * If a device is behind a switch, we try to find the upstream bridge
> + * port of the switch. This requires two calls to pci_upstream_bridge:
> + * one for the upstream port on the switch, one on the upstream port
> + * for the next level in the hierarchy. Because of this, devices connected
> + * to the root port will be rejected.
s/pci_upstream_bridge/pci_upstream_bridge()/
This whole thing is confusing to me. Why do you want to reject peers
directly connected to the same root port? Why do you require the same
Switch Upstream Port? You don't exclude conventional PCI, but it
looks like you would require peers to share *two* upstream PCI-to-PCI
bridges? I would think a single shared upstream bridge (conventional,
PCIe Switch Downstream Port, or PCIe Root Port) would be sufficient?
Apologies if you've answered this before; maybe just include a little
explanation here so I don't ask again :)
> +static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
> +{
> + struct pci_dev *up1, *up2;
> +
> + if (!pdev)
> + return NULL;
> +
> + up1 = pci_dev_get(pci_upstream_bridge(pdev));
> + if (!up1)
> + return NULL;
> +
> + up2 = pci_dev_get(pci_upstream_bridge(up1));
> + pci_dev_put(up1);
> +
> + return up2;
> +}
> +
> +static bool __upstream_bridges_match(struct pci_dev *upstream,
> + struct pci_dev *client)
> +{
> + struct pci_dev *dma_up;
> + bool ret = true;
> +
> + dma_up = get_upstream_bridge_port(client);
> +
> + if (!dma_up) {
> + dev_dbg(&client->dev, "not a PCI device behind a bridge\n");
> + ret = false;
> + goto out;
> + }
> +
> + if (upstream != dma_up) {
> + dev_dbg(&client->dev,
> + "does not reside on the same upstream bridge\n");
> + ret = false;
> + goto out;
> + }
> +
> +out:
> + pci_dev_put(dma_up);
> + return ret;
> +}
> +
> +static bool upstream_bridges_match(struct pci_dev *pdev,
> + struct pci_dev *client)
> +{
> + struct pci_dev *upstream;
> + bool ret;
> +
> + upstream = get_upstream_bridge_port(pdev);
> + if (!upstream) {
> + dev_warn(&pdev->dev, "not behind a PCI bridge\n");
> + return false;
> + }
> +
> + ret = __upstream_bridges_match(upstream, client);
> +
> + pci_dev_put(upstream);
> +
> + return ret;
> +}
> +
> +struct pci_p2pdma_client {
> + struct list_head list;
> + struct pci_dev *client;
> + struct pci_dev *p2pdma;
Maybe call this "peer" or something instead of "p2pdma", since p2pdma
is also used for struct pci_p2pdma things?
> + * pci_p2pdma_add_client - allocate a new element in a client device list
> + * @head: list head of p2pdma clients
> + * @dev: device to add to the list
> + *
> + * This adds @dev to a list of clients used by a p2pdma device.
> + * This list should be passed to p2pmem_find(). Once p2pmem_find() has
> + * been called successfully, the list will be bound to a specific p2pdma
> + * device and new clients can only be added to the list if they are
> + * supported by that p2pdma device.
> + *
> + * The caller is expected to have a lock which protects @head as necessary
> + * so that none of the pci_p2p functions can be called concurrently
> + * on that list.
> + *
> + * Returns 0 if the client was successfully added.
> + */
> +int pci_p2pdma_add_client(struct list_head *head, struct device *dev)
> +{
> + struct pci_p2pdma_client *item, *new_item;
> + struct pci_dev *p2pdma = NULL;
> + struct pci_dev *client;
> + int ret;
> +
> + if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == &dma_virt_ops) {
> + dev_warn(dev,
> + "cannot be used for peer-to-peer DMA because the driver makes use of dma_virt_ops\n");
> + return -ENODEV;
> + }
> +
> +
> + client = find_parent_pci_dev(dev);
Since "pci_p2pdma_add_client()" includes "pci_" in its name, it seems
sort of weird that callers supply a non-PCI device and then we look up
a PCI device here. I assume you have some reason for this; if you
added a writeup in Documentation/PCI, that would be a good place to
elaborate on that, maybe with a one-line clue here.
> + if (!client) {
> + dev_warn(dev,
> + "cannot be used for peer-to-peer DMA as it is not a PCI device\n");
> + return -ENODEV;
> + }
> +
> + item = list_first_entry_or_null(head, struct pci_p2pdma_client, list);
> + if (item && item->p2pdma) {
> + p2pdma = item->p2pdma;
> +
> + if (!upstream_bridges_match(p2pdma, client)) {
> + ret = -EXDEV;
> + goto put_client;
> + }
> + }
> +
> + new_item = kzalloc(sizeof(*new_item), GFP_KERNEL);
> + if (!new_item) {
> + ret = -ENOMEM;
> + goto put_client;
> + }
> +
> + new_item->client = client;
> + new_item->p2pdma = pci_dev_get(p2pdma);
> +
> + list_add_tail(&new_item->list, head);
> +
> + return 0;
> +
> +put_client:
> + pci_dev_put(client);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pdma_add_client);
> + * pci_alloc_p2p_mem - allocate peer-to-peer DMA memory
> + * @pdev: the device to allocate memory from
> + * @size: number of bytes to allocate
> + *
> + * Returns the allocated memory or NULL on error.
> + */
> +void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
> +{
> + void *ret;
> +
> + if (unlikely(!pdev->p2pdma))
Is this a hot path? I'm not sure it's worth cluttering
non-performance paths with likely/unlikely.
> + return NULL;
> +
> + if (unlikely(!percpu_ref_tryget_live(&pdev->p2pdma->devmap_ref)))
> + return NULL;
> +
> + ret = (void *)(uintptr_t)gen_pool_alloc(pdev->p2pdma->pool, size);
Why the double cast? Wouldn't "(void *)" be sufficient?
> + if (unlikely(!ret))
> + percpu_ref_put(&pdev->p2pdma->devmap_ref);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(pci_alloc_p2pmem);
> +
> +/**
> + * pci_free_p2pmem - allocate peer-to-peer DMA memory
> + * @pdev: the device the memory was allocated from
> + * @addr: address of the memory that was allocated
> + * @size: number of bytes that was allocated
> + */
> +void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size)
> +{
> + gen_pool_free(pdev->p2pdma->pool, (uintptr_t)addr, size);
In v4.6-rc1, gen_pool_free() takes "unsigned long addr". I know this
is based on -rc3; is this something that changed between -rc1 and
-rc3?
> + percpu_ref_put(&pdev->p2pdma->devmap_ref);
> +}
> +EXPORT_SYMBOL_GPL(pci_free_p2pmem);
> +
> +/**
> + * pci_virt_to_bus - return the PCI bus address for a given virtual
> + * address obtained with pci_alloc_p2pmem
s/pci_alloc_p2pmem/pci_alloc_p2pmem()/
> + * @pdev: the device the memory was allocated from
> + * @addr: address of the memory that was allocated
> + */
> +pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr)
> +{
> + if (!addr)
> + return 0;
> + if (!pdev->p2pdma)
> + return 0;
> +
> + /*
> + * Note: when we added the memory to the pool we used the PCI
> + * bus address as the physical address. So gen_pool_virt_to_phys()
> + * actually returns the bus address despite the misleading name.
> + */
> + return gen_pool_virt_to_phys(pdev->p2pdma->pool, (unsigned long)addr);
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pmem_virt_to_bus);
> +
> +/**
> + * pci_p2pmem_alloc_sgl - allocate peer-to-peer DMA memory in an scatterlist
s/an scatterlist/a scatterlist/
> + * @pdev: the device to allocate memory from
> + * @sgl: the allocated scatterlist
> + * @nents: the number of SG entries in the list
> + * @length: number of bytes to allocate
> + *
> + * Returns 0 on success
> + */
> +int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct scatterlist **sgl,
> + unsigned int *nents, u32 length)
> +{
> + struct scatterlist *sg;
> + void *addr;
> +
> + sg = kzalloc(sizeof(*sg), GFP_KERNEL);
> + if (!sg)
> + return -ENOMEM;
> +
> + sg_init_table(sg, 1);
> +
> + addr = pci_alloc_p2pmem(pdev, length);
> + if (!addr)
> + goto out_free_sg;
> +
> + sg_set_buf(sg, addr, length);
> + *sgl = sg;
> + *nents = 1;
> + return 0;
> +
> +out_free_sg:
> + kfree(sg);
> + return -ENOMEM;
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pmem_alloc_sgl);
> +
> +/**
> + * pci_p2pmem_free_sgl - free a scatterlist allocated by pci_p2pmem_alloc_sgl
s/ allocated/ allocated/ (remove extra space)
s/pci_p2pmem_alloc_sgl/pci_p2pmem_alloc_sgl()/
> + * @pdev: the device to allocate memory from
> + * @sgl: the allocated scatterlist
> + * @nents: the number of SG entries in the list
> + */
> +void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl,
> + unsigned int nents)
> +{
> + struct scatterlist *sg;
> + int count;
> +
> + if (!sgl || !nents)
> + return;
> +
> + for_each_sg(sgl, sg, nents, count)
> + pci_free_p2pmem(pdev, sg_virt(sg), sg->length);
> + kfree(sgl);
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pmem_free_sgl);
> +
> +/**
> + * pci_p2pmem_publish - publish the peer-to-peer DMA memory for use by
> + * other devices with pci_p2pmem_find
s/pci_p2pmem_find/pci_p2pmem_find()/
> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> new file mode 100644
> index 000000000000..c0dde3d3aac4
> --- /dev/null
> +++ b/include/linux/pci-p2pdma.h
> @@ -0,0 +1,87 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (c) 2016-2017, Microsemi Corporation
> + * Copyright (c) 2017, Christoph Hellwig.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
> + * more details.
Remove license text.
On 01/03/18 04:03 AM, Sagi Grimberg wrote:
> Can you describe what would be the plan to have it when these devices
> do come along? I'd say that p2p_dev needs to become a nvmet_ns reference
> and not from nvmet_ctrl. Then, when cmb capable devices come along, the
> ns can prefer to use its own cmb instead of locating a p2p_dev device?
The patchset already supports CMB drives. That's essentially what patch
7 is for. We change the nvme-pci driver to use the p2pmem code to
register and manage the CMB memory. After that, it is simply available
to the nvmet code. We have already been using this with a couple
prototype CMB drives.
>> +static int nvmet_p2pdma_add_client(struct nvmet_ctrl *ctrl,
>> + struct nvmet_ns *ns)
>> +{
>> + int ret;
>> +
>> + if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) {
>> + pr_err("peer-to-peer DMA is not supported by %s\n",
>> + ns->device_path);
>> + return -EINVAL;
>
> I'd say that just skip it instead of failing it, in theory one can
> connect nvme devices via p2p mem and expose other devices in the
> same subsystem. The message would be pr_debug to reduce chattiness.
No, it's actually a bit more complicated than you make it. There's a
couple cases:
1) The user adds a namespace but there hasn't been a connection and no
p2pmem has been selected. In this case the add_client function is never
called and the user can add whatever namespace they like.
2) When the first connect happens, nvmet_setup_p2pmem() will call
add_client() for each namespace and rdma device. If any of them fail
then it does not use a P2P device and falls back, as you'd like, to
regular memory.
3) When the user adds a namespace after a port is in use and a
compatible P2P device has been found. In this case, if the user tries to
add a namespace that is not compatible with the P2P device in use then
it fails adding the new namespace. The only alternative here is to tear
everything down, ensure there are no P2P enabled buffers open and start
using regular memory again... That is very difficult.
I also disagree that these messages should be pr_debug. If a user is
trying to use P2P memory and they enable it but the system doesn't
choose to use their memory they must know why that is so they can make
the necessary adjustments. If the system doesn't at least print a dmesg
they are in the dark.
>> + }
>> +
>> + ret = pci_p2pdma_add_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
>> + if (ret)
>> + pr_err("failed to add peer-to-peer DMA client %s: %d\n",
>> + ns->device_path, ret);
>> +
>> + return ret;
>> +}
>> +
>> int nvmet_ns_enable(struct nvmet_ns *ns)
>> {
>> struct nvmet_subsys *subsys = ns->subsys;
>> @@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
>> if (ret)
>> goto out_blkdev_put;
>> + list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
>> + if (ctrl->p2p_dev) {
>> + ret = nvmet_p2pdma_add_client(ctrl, ns);
>> + if (ret)
>> + goto out_remove_clients;
>
> Is this really a fatal failure given that we fall-back to main
> memory? Why not continue with main memory (and warn at best)?
See above. It's fatal because we are already using p2p memory and we
can't easily tear that all down when a user adds a new namespace.
>> + ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients);
>
> This is the first p2p_dev found right? What happens if I have more than
> a single p2p device? In theory I'd have more p2p memory I can use. Have
> you considered making pci_p2pmem_find return the least used suitable
> device?
Yes, it currently returns the first found. I imagine a bunch of
improvements could be made to it in future work. However, I'd probably
start with finding the nearest p2p device and then falling back to the
least used if that's necessary. At this time though it's not clear how
complicated these systems will get and what's actually needed here.
>> @@ -68,6 +69,7 @@ struct nvmet_rdma_rsp {
>> u8 n_rdma;
>> u32 flags;
>> u32 invalidate_rkey;
>> + struct pci_dev *p2p_dev;
>
> Given that p2p_client is in nvmet_req, I think it make sense
> that the p2p_dev itself would also live there. In theory, nothing
> is preventing FC from using it as well.
Fair point. I can look at moving it in v3.
Thanks,
Logan
On Wed, Feb 28, 2018 at 04:39:58PM -0700, Logan Gunthorpe wrote:
> Attributes display the total amount of P2P memory, the amount available
> and whether it is published or not.
Can you add enough text here to make the body of the changelog
complete in itself? That might mean just repeating the subject, which
is fine.
> Signed-off-by: Logan Gunthorpe <[email protected]>
> ---
> Documentation/ABI/testing/sysfs-bus-pci | 25 +++++++++++++++++
> drivers/pci/p2pdma.c | 50 +++++++++++++++++++++++++++++++++
> 2 files changed, 75 insertions(+)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> index 44d4b2be92fd..f5656dae21be 100644
> --- a/Documentation/ABI/testing/sysfs-bus-pci
> +++ b/Documentation/ABI/testing/sysfs-bus-pci
> @@ -323,3 +323,28 @@ Description:
>
> This is similar to /sys/bus/pci/drivers_autoprobe, but
> affects only the VFs associated with a specific PF.
> +
> +What: /sys/bus/pci/devices/.../p2pmem/available
> +Date: November 2017
> +Contact: Logan Gunthorpe <[email protected]>
> +Description:
> + If the device has any Peer-to-Peer memory registered, this
> + file contains the amount of memory that has not been
> + allocated (in decimal).
> +
> +What: /sys/bus/pci/devices/.../p2pmem/size
> +Date: November 2017
> +Contact: Logan Gunthorpe <[email protected]>
> +Description:
> + If the device has any Peer-to-Peer memory registered, this
> + file contains the total ammount of memory that the device
s/ammount/amount/
> + provides (in decimal).
> +
> +What: /sys/bus/pci/devices/.../p2pmem/published
> +Date: November 2017
> +Contact: Logan Gunthorpe <[email protected]>
> +Description:
> + If the device has any Peer-to-Peer memory registered, this
> + file contains a '1' if the memory has been published for
> + use inside the kernel or a '0' if it is only intended
> + for use within the driver that published it.
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index ec0a6cb9e500..a57df78f6a32 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -29,6 +29,53 @@ struct pci_p2pdma {
> bool published;
> };
>
> +static ssize_t size_show(struct device *dev, struct device_attribute *attr,
> + char *buf)
> +{
> + struct pci_dev *pdev = to_pci_dev(dev);
> + size_t size = 0;
> +
> + if (pdev->p2pdma->pool)
> + size = gen_pool_size(pdev->p2pdma->pool);
> +
> + return snprintf(buf, PAGE_SIZE, "%zd\n", size);
> +}
> +static DEVICE_ATTR_RO(size);
> +
> +static ssize_t available_show(struct device *dev, struct device_attribute *attr,
> + char *buf)
> +{
> + struct pci_dev *pdev = to_pci_dev(dev);
> + size_t avail = 0;
> +
> + if (pdev->p2pdma->pool)
> + avail = gen_pool_avail(pdev->p2pdma->pool);
> +
> + return snprintf(buf, PAGE_SIZE, "%zd\n", avail);
> +}
> +static DEVICE_ATTR_RO(available);
> +
> +static ssize_t published_show(struct device *dev, struct device_attribute *attr,
> + char *buf)
> +{
> + struct pci_dev *pdev = to_pci_dev(dev);
> +
> + return snprintf(buf, PAGE_SIZE, "%d\n", pdev->p2pdma->published);
> +}
> +static DEVICE_ATTR_RO(published);
> +
> +static struct attribute *p2pmem_attrs[] = {
> + &dev_attr_size.attr,
> + &dev_attr_available.attr,
> + &dev_attr_published.attr,
> + NULL,
> +};
> +
> +static const struct attribute_group p2pmem_group = {
> + .attrs = p2pmem_attrs,
> + .name = "p2pmem",
> +};
> +
> static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
> {
> struct pci_p2pdma *p2p =
> @@ -55,6 +102,7 @@ static void pci_p2pdma_release(void *data)
> percpu_ref_exit(&pdev->p2pdma->devmap_ref);
>
> gen_pool_destroy(pdev->p2pdma->pool);
> + sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
> pdev->p2pdma = NULL;
> }
>
> @@ -83,6 +131,8 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
> if (error)
> goto out_pool_destroy;
>
> + error = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group);
> +
> pdev->p2pdma = p2p;
I think these two statements are out of order, since the attributes
dereference pdev->p2pdma. And it looks like you set "error"
unnecessarily, since you return immediately looking at it.
> return 0;
> --
> 2.11.0
>
On Wed, Feb 28, 2018 at 04:39:59PM -0700, Logan Gunthorpe wrote:
> The DMA address used when mapping PCI P2P memory must be the PCI bus
> address. Thus, introduce pci_p2pmem_[un]map_sg() to map the correct
> addresses when using P2P memory.
>
> For this, we assume that an SGL passed to these functions contain all
> p2p memory or no p2p memory.
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> ---
> drivers/pci/p2pdma.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/memremap.h | 1 +
> include/linux/pci-p2pdma.h | 13 ++++++++++++
> 3 files changed, 65 insertions(+)
>
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index a57df78f6a32..4e1c81f64b29 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -188,6 +188,8 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
> pgmap->res.flags = pci_resource_flags(pdev, bar);
> pgmap->ref = &pdev->p2pdma->devmap_ref;
> pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
> + pgmap->pci_p2pdma_bus_offset = pci_bus_address(pdev, bar) -
> + pci_resource_start(pdev, bar);
>
> addr = devm_memremap_pages(&pdev->dev, pgmap);
> if (IS_ERR(addr))
> @@ -616,3 +618,52 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
> pdev->p2pdma->published = publish;
> }
> EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
> +
> +/*
> + * pci_p2pdma_map_sg - map a PCI peer-to-peer sg for DMA
> + * @dev: device doing the DMA request
> + * @sg: scatter list to map
> + * @nents: elements in the scatterlist
> + * @dir: DMA direction
Can you fix these so the descriptions all have a single space after
the "@dev:", which seems to be the existing convention in this file?
The indentation looks pretty random now.
> + *
> + * Returns the number of SG entries mapped
> + */
> +int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
> + enum dma_data_direction dir)
Same question as before about why the mixture of "pci_*" interfaces
that take "struct device *" parameters.
> +{
> + struct dev_pagemap *pgmap;
> + struct scatterlist *s;
> + phys_addr_t paddr;
> + int i;
> +
> + /*
> + * p2pdma mappings are not compatible with devices that use
> + * dma_virt_ops.
> + */
> + if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == &dma_virt_ops)
> + return 0;
> +
> + for_each_sg(sg, s, nents, i) {
> + pgmap = sg_page(s)->pgmap;
> + paddr = sg_phys(s);
> +
> + s->dma_address = paddr - pgmap->pci_p2pdma_bus_offset;
> + sg_dma_len(s) = s->length;
> + }
> +
> + return nents;
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pdma_map_sg);
> +
> +/**
> + * pci_p2pdma_unmap_sg - unmap a PCI peer-to-peer sg for DMA
> + * @dev: device doing the DMA request
> + * @sg: scatter list to map
> + * @nents: elements in the scatterlist
> + * @dir: DMA direction
Same whitespace comment as above.
> + */
> +void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
> + enum dma_data_direction dir)
> +{
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg);
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 9e907c338a44..1660f64ce96f 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -125,6 +125,7 @@ struct dev_pagemap {
> struct device *dev;
> void *data;
> enum memory_type type;
> + u64 pci_p2pdma_bus_offset;
> };
>
> #ifdef CONFIG_ZONE_DEVICE
> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> index c0dde3d3aac4..126eca697ab3 100644
> --- a/include/linux/pci-p2pdma.h
> +++ b/include/linux/pci-p2pdma.h
> @@ -36,6 +36,10 @@ int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct scatterlist **sgl,
> void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl,
> unsigned int nents);
> void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
> +int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
> + enum dma_data_direction dir);
> +void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
> + enum dma_data_direction dir);
> #else /* CONFIG_PCI_P2PDMA */
> static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
> size_t size, u64 offset)
> @@ -83,5 +87,14 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
> static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
> {
> }
> +static inline int pci_p2pdma_map_sg(struct device *dev,
> + struct scatterlist *sg, int nents, enum dma_data_direction dir)
> +{
> + return 0;
> +}
> +static inline void pci_p2pdma_unmap_sg(struct device *dev,
> + struct scatterlist *sg, int nents, enum dma_data_direction dir)
> +{
> +}
> #endif /* CONFIG_PCI_P2PDMA */
> #endif /* _LINUX_PCI_P2P_H */
> --
> 2.11.0
>
On Wed, Feb 28, 2018 at 04:40:00PM -0700, Logan Gunthorpe wrote:
> For peer-to-peer transactions to work the downstream ports in each
> switch must not have the ACS flags set. At this time there is no way
> to dynamically change the flags and update the corresponding IOMMU
> groups so this is done at enumeration time before the the groups are
> assigned.
s/the the/the/
> This effectively means that if CONFIG_PCI_P2PDMA is selected then
> all devices behind any switch will be in the same IOMMU group.
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> ---
> drivers/pci/Kconfig | 4 ++++
> drivers/pci/p2pdma.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
> drivers/pci/pci.c | 4 ++++
> include/linux/pci-p2pdma.h | 5 +++++
> 4 files changed, 57 insertions(+)
>
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 840831418cbd..a430672f0ad4 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -138,6 +138,10 @@ config PCI_P2PDMA
> it's hard to tell which support it with good performance, so
> at this time you will need a PCIe switch.
>
> + Enabling this option will also disable ACS on all ports behind
> + any PCIe switch. This effictively puts all devices behind any
> + switch into the same IOMMU group.
s/effictively/effectively/
Does this really mean "all devices behind the same Root Port"?
What does this mean in terms of device security? I assume it means,
at least, that individual devices can't be assigned to separate VMs.
I don't mind admitting that this patch makes me pretty nervous, and I
don't have a clear idea of what the implications of this are, or how
to communicate those to end users. "The same IOMMU group" is a pretty
abstract idea.
> If unsure, say N.
>
> config PCI_LABEL
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 4e1c81f64b29..61af07acd21a 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -255,6 +255,50 @@ static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
> return up2;
> }
>
> +/*
> + * pci_p2pdma_disable_acs - disable ACS flags for ports in PCI
> + * bridges/switches
> + * @pdev: device to disable ACS flags for
> + *
> + * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
> + * to be disabled on any downstream port in any switch in order for
> + * the TLPs to not be forwarded up to the RC which is not what we want
> + * for P2P.
> + *
> + * This function is called when the devices are first enumerated and
> + * will result in all devices behind any switch to be in the same IOMMU
> + * group. At this time there is no way to "hotplug" IOMMU groups so we rely
> + * on this largish hammer. If you need the devices to be in separate groups
> + * don't enable CONFIG_PCI_P2PDMA.
> + *
> + * Returns 1 if the ACS bits for this device were cleared, otherwise 0.
> + */
> +int pci_p2pdma_disable_acs(struct pci_dev *pdev)
> +{
> + struct pci_dev *up;
> + int pos;
> + u16 ctrl;
> +
> + up = get_upstream_bridge_port(pdev);
> + if (!up)
> + return 0;
> + pci_dev_put(up);
> +
> + pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
> + if (!pos)
> + return 0;
> +
> + dev_info(&pdev->dev, "disabling ACS flags for peer-to-peer DMA\n");
> +
> + pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
> +
> + ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
> +
> + pci_write_config_word(pdev, pos + PCI_ACS_CTRL, ctrl);
> +
> + return 1;
> +}
> +
> static bool __upstream_bridges_match(struct pci_dev *upstream,
> struct pci_dev *client)
> {
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index f6a4dd10d9b0..95ad3cf288c8 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -16,6 +16,7 @@
> #include <linux/of.h>
> #include <linux/of_pci.h>
> #include <linux/pci.h>
> +#include <linux/pci-p2pdma.h>
> #include <linux/pm.h>
> #include <linux/slab.h>
> #include <linux/module.h>
> @@ -2826,6 +2827,9 @@ static void pci_std_enable_acs(struct pci_dev *dev)
> */
> void pci_enable_acs(struct pci_dev *dev)
> {
> + if (pci_p2pdma_disable_acs(dev))
> + return;
This doesn't read naturally to me. I do see that when
CONFIG_PCI_P2PDMA is not set, pci_p2pdma_disable_acs() does nothing
and returns 0, so we'll go ahead and try to enable ACS as before.
But I think it would be clearer to have an #ifdef CONFIG_PCI_P2PDMA
right here so it's more obvious that we only disable ACS when it's
selected.
> if (!pci_acs_enable)
> return;
>
> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> index 126eca697ab3..f537f521f60c 100644
> --- a/include/linux/pci-p2pdma.h
> +++ b/include/linux/pci-p2pdma.h
> @@ -22,6 +22,7 @@ struct block_device;
> struct scatterlist;
>
> #ifdef CONFIG_PCI_P2PDMA
> +int pci_p2pdma_disable_acs(struct pci_dev *pdev);
> int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
> u64 offset);
> int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
> @@ -41,6 +42,10 @@ int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
> void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
> enum dma_data_direction dir);
> #else /* CONFIG_PCI_P2PDMA */
> +static inline int pci_p2pdma_disable_acs(struct pci_dev *pdev)
> +{
> + return 0;
> +}
> static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
> size_t size, u64 offset)
> {
> --
> 2.11.0
>
On 28/02/18 08:56 PM, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
>> The problem is that acccording to him (I didn't double check the latest
>> patches) you effectively hotplug the PCIe memory into the system when
>> creating struct pages.
>>
>> This cannot possibly work for us. First we cannot map PCIe memory as
>> cachable. (Note that doing so is a bad idea if you are behind a PLX
>> switch anyway since you'd ahve to manage cache coherency in SW).
>
> Note: I think the above means it won't work behind a switch on x86
> either, will it ?
This works perfectly fine on x86 behind a switch and we've tested it on
multiple machines. We've never had an issue of running out of virtual
space despite our PCI bars typically being located with an offset of
56TB or more. The arch code on x86 also somehow figures out not to map
the memory as cachable so that's not an issue (though, at this point,
the CPU never accesses the memory so even if it were, it wouldn't affect
anything).
We also had this working on ARM64 a while back but it required some out
of tree ZONE_DEVICE patches and some truly horrid hacks to it's arch
code to ioremap the memory into the page map.
You didn't mention what architecture you were trying this on.
It may make sense at this point to make this feature dependent on x86
until more work is done to make it properly portable. Something like
arch functions that allow adding IO memory pages to with a specific
cache setting. Though, if an arch has such restrictive limits on the map
size it would probably need to address that too somehow.
Thanks,
Logan
>> So Oliver (CC) was having issues getting any of that to work for us.
>>
>> The problem is that acccording to him (I didn't double check the latest
>> patches) you effectively hotplug the PCIe memory into the system when
>> creating struct pages.
>>
>> This cannot possibly work for us. First we cannot map PCIe memory as
>> cachable. (Note that doing so is a bad idea if you are behind a PLX
>> switch anyway since you'd ahve to manage cache coherency in SW).
>
> Note: I think the above means it won't work behind a switch on x86
> either, will it ?
Ben
We have done extensive testing of this series and its predecessors using PCIe switches from both Broadcom (PLX) and Microsemi. We have also done testing on x86_64, ARM64 and ppc64el based ARCH with varying degrees of success. The series as it currently stands only works on x86_64 but modified (hacky) versions have been made to work on ARM64. The x86_64 testing has been done on a range of (Intel) CPUs, servers, PCI EPs (including RDMA NICs from at least three vendors, NVMe SSDs from at least four vendors and P2P devices from four vendors) and PCI switches.
I do find it slightly offensive that you would question the series even working. I hope you are not suggesting we would submit this framework multiple times without having done testing on it....
Stephen
> On 01/03/18 04:03 AM, Sagi Grimberg wrote:
>> Can you describe what would be the plan to have it when these devices
>> do come along? I'd say that p2p_dev needs to become a nvmet_ns reference
>> and not from nvmet_ctrl. Then, when cmb capable devices come along, the
>> ns can prefer to use its own cmb instead of locating a p2p_dev device?
>
> The patchset already supports CMB drives. That's essentially what patch
> 7 is for. We change the nvme-pci driver to use the p2pmem code to
> register and manage the CMB memory. After that, it is simply available
> to the nvmet code. We have already been using this with a couple
> prototype CMB drives.
The comment was to your statement:
"Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
save an extra PCI transfer as the NVME card could just take the data
out of it's own memory. However, at this time, cards with CMB buffers
don't seem to be available."
Maybe its a left-over which confused me...
Anyways, my question still holds. If I rack several of these
nvme drives, ideally we would use _their_ cmbs for I/O that is
directed to these namespaces. This is why I was suggesting that
p2p_dev should live in nvmet_ns and not in nvmet_ctrl as a single
p2p_dev used by all namespaces.
nvmet_ns seems like a more natural place to host p2p_dev with
cmb in mind.
>>> +static int nvmet_p2pdma_add_client(struct nvmet_ctrl *ctrl,
>>> + struct nvmet_ns *ns)
>>> +{
>>> + int ret;
>>> +
>>> + if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) {
>>> + pr_err("peer-to-peer DMA is not supported by %s\n",
>>> + ns->device_path);
>>> + return -EINVAL;
>>
>> I'd say that just skip it instead of failing it, in theory one can
>> connect nvme devices via p2p mem and expose other devices in the
>> same subsystem. The message would be pr_debug to reduce chattiness.
>
> No, it's actually a bit more complicated than you make it. There's a
> couple cases:
>
> 1) The user adds a namespace but there hasn't been a connection and no
> p2pmem has been selected. In this case the add_client function is never
> called and the user can add whatever namespace they like.
>
> 2) When the first connect happens, nvmet_setup_p2pmem() will call
> add_client() for each namespace and rdma device. If any of them fail
> then it does not use a P2P device and falls back, as you'd like, to
> regular memory.
>
> 3) When the user adds a namespace after a port is in use and a
> compatible P2P device has been found. In this case, if the user tries to
> add a namespace that is not compatible with the P2P device in use then
> it fails adding the new namespace. The only alternative here is to tear
> everything down, ensure there are no P2P enabled buffers open and start
> using regular memory again... That is very difficult.
>
Wouldn't it all be simpler if the p2p_dev resolution would be private
to the namespace?
So is adding some all the namespaces in a subsystem must comply to
using p2p? Seems a little bit harsh if its not absolutely needed. Would
be nice to export a subsystems between two ports (on two HCAs, across
NUMA nodes) where the home node (primary path) would use p2p and
failover would use host memory...
Can you help me understand why this is absolutely not feasible?
> I also disagree that these messages should be pr_debug. If a user is
> trying to use P2P memory and they enable it but the system doesn't
> choose to use their memory they must know why that is so they can make
> the necessary adjustments. If the system doesn't at least print a dmesg
> they are in the dark.
I was arguing that the user might have intentionally wanted to use p2p
where possible but still expose other namespaces over host memory. For
this user, the messages are spam. I guess info/warn could also suffice
(assuming we allow it, if we fail it then no point of the debug level
discussion).
>
>>> + }
>>> +
>>> + ret = pci_p2pdma_add_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
>>> + if (ret)
>>> + pr_err("failed to add peer-to-peer DMA client %s: %d\n",
>>> + ns->device_path, ret);
>>> +
>>> + return ret;
>>> +}
>>> +
>>> int nvmet_ns_enable(struct nvmet_ns *ns)
>>> {
>>> struct nvmet_subsys *subsys = ns->subsys;
>>> @@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
>>> if (ret)
>>> goto out_blkdev_put;
>>> + list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
>>> + if (ctrl->p2p_dev) {
>>> + ret = nvmet_p2pdma_add_client(ctrl, ns);
>>> + if (ret)
>>> + goto out_remove_clients;
>>
>> Is this really a fatal failure given that we fall-back to main
>> memory? Why not continue with main memory (and warn at best)?
>
> See above. It's fatal because we are already using p2p memory and we
> can't easily tear that all down when a user adds a new namespace.
In my mind, I/O to this namespace would use host memory and be done
with it. I guess I still don't understand why this is not possible.
>>> + ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients);
>>
>> This is the first p2p_dev found right? What happens if I have more than
>> a single p2p device? In theory I'd have more p2p memory I can use. Have
>> you considered making pci_p2pmem_find return the least used suitable
>> device?
>
> Yes, it currently returns the first found. I imagine a bunch of
> improvements could be made to it in future work. However, I'd probably
> start with finding the nearest p2p device and then falling back to the
> least used if that's necessary. At this time though it's not clear how
> complicated these systems will get and what's actually needed here.
Fair enough. Just wandering.
On Thu, Mar 01, 2018 at 08:35:55PM +0200, Sagi Grimberg wrote:
>
> >On 01/03/18 04:03 AM, Sagi Grimberg wrote:
> >>Can you describe what would be the plan to have it when these devices
> >>do come along? I'd say that p2p_dev needs to become a nvmet_ns reference
> >>and not from nvmet_ctrl. Then, when cmb capable devices come along, the
> >>ns can prefer to use its own cmb instead of locating a p2p_dev device?
> >
> >The patchset already supports CMB drives. That's essentially what patch 7
> >is for. We change the nvme-pci driver to use the p2pmem code to register
> >and manage the CMB memory. After that, it is simply available to the nvmet
> >code. We have already been using this with a couple prototype CMB drives.
>
> The comment was to your statement:
> "Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
> save an extra PCI transfer as the NVME card could just take the data
> out of it's own memory. However, at this time, cards with CMB buffers
> don't seem to be available."
>
> Maybe its a left-over which confused me...
>
> Anyways, my question still holds. If I rack several of these
> nvme drives, ideally we would use _their_ cmbs for I/O that is
> directed to these namespaces. This is why I was suggesting that
> p2p_dev should live in nvmet_ns and not in nvmet_ctrl as a single
> p2p_dev used by all namespaces.
I agree, I don't think this series should target anything other than
using p2p memory located in one of the devices expected to participate
in the p2p trasnaction for a first pass..
locality is super important for p2p, so I don't think things should
start out in a way that makes specifying the desired locality hard.
This is also why I don't entirely understand why this series has a
generic allocator for p2p mem, it makes little sense to me.
Why wouldn't the nmve driver just claim the entire CMB of its local
device for its own use? Why involve p2p core code in this?
Jason
Thanks for the detailed review Bjorn!
>>
>> + Enabling this option will also disable ACS on all ports behind
>> + any PCIe switch. This effictively puts all devices behind any
>> + switch into the same IOMMU group.
>
> Does this really mean "all devices behind the same Root Port"?
Not necessarily. You might have a cascade of switches (i.e switches below a switch) to achieve a very large fan-out (in an NVMe SSD array for example) and we will only disable ACS on the ports below the relevant switch.
> What does this mean in terms of device security? I assume it means,
> at least, that individual devices can't be assigned to separate VMs.
This was discussed during v1 [1]. Disabling ACS on all downstream ports of the switch means that all the EPs below it have to part of the same IOMMU grouping. However it was also agreed that as long as the ACS disable occurred at boot time (which is does in v2) then the virtualization layer will be aware of it and will perform the IOMMU group formation correctly.
> I don't mind admitting that this patch makes me pretty nervous, and I
> don't have a clear idea of what the implications of this are, or how
> to communicate those to end users. "The same IOMMU group" is a pretty
> abstract idea.
Alex gave a good overview of the implications in [1].
Stephen
[1] https://marc.info/?l=linux-pci&m=151512320031739&w=2
Hi Bjorn,
Thanks for the review. I'll correct all the nits for the next version.
On 01/03/18 10:37 AM, Bjorn Helgaas wrote:
> On Wed, Feb 28, 2018 at 04:39:57PM -0700, Logan Gunthorpe wrote:
>> Some PCI devices may have memory mapped in a BAR space that's
>> intended for use in Peer-to-Peer transactions. In order to enable
>> such transactions the memory must be registered with ZONE_DEVICE pages
>> so it can be used by DMA interfaces in existing drivers.
> Is there anything about this memory that makes it specifically
> intended for peer-to-peer transactions? I assume the device can't
> really tell whether a transaction is from a CPU or a peer.
No there's nothing special about the memory and it can still be accessed
by the CPU. This is just the intended purpose. You could use this PCI
memory as regular DMA buffers or regular memory but I'm not sure why you
would. It would probably be pretty bad performance-wise.
> BTW, maybe there could be some kind of guide for device driver writers
> in Documentation/PCI/?
Makes sense we can look at writing something for the next iteration.
> I think it would be clearer and sufficient to simply say that we have
> no way to know whether peer-to-peer routing between PCIe Root Ports is
> supported (PCIe r4.0, sec 1.3.1).
Fair enough.
> The fact that you use the PCIe term "switch" suggests that a PCIe
> Switch is required, but isn't it sufficient for the peers to be below
> the same "PCI bridge", which would include PCIe Root Ports, PCIe
> Switch Downstream Ports, and conventional PCI bridges?
> The comments at get_upstream_bridge_port() suggest that this isn't
> enough, and the peers actually do have to be below the same PCIe
> Switch, but I don't know why.
I do mean Switch as we do need to keep the traffic off the root complex.
Seeing, as stated above, we don't know if it actually support it. (While
we can be certain any PCI switch does). So we specifically want to
exclude PCIe Root ports and I'm not sure about the support of PCI
bridges but I can't imagine anyone wanting to do P2P around them so I'd
rather be safe than sorry and exclude them.
>> + addr = devm_memremap_pages(&pdev->dev, pgmap);
>> + if (IS_ERR(addr))
>
> Free pgmap here? And in the other error case below? Or maybe this
> happens via the devm_* magic? If so, when would that actually happen?
> Would pgmap be effectively leaked until the pdev is destroyed?
Yes, it happens via the devm magic as that's the way the
devm_memremap_pages() interface was designed. If I remember correctly,
in my testing, it would be de-allocated when the driver gets unbound.
>> + return PTR_ERR(addr);
>> +
>> + error = gen_pool_add_virt(pdev->p2pdma->pool, (uintptr_t)addr,
>> + pci_bus_address(pdev, bar) + offset,
>> + resource_size(&pgmap->res), dev_to_node(&pdev->dev));
>> + if (error)
>> + return error;
>> +
>> + error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_percpu_kill,
>> + &pdev->p2pdma->devmap_ref);
>> + if (error)
>> + return error;
>> +
>> + dev_info(&pdev->dev, "added peer-to-peer DMA memory %pR\n",
>> + &pgmap->res);
>
> s/dev_info/pci_info/ (also similar usages below, except for the one or
> two cases where you don't have a pci_dev).
Oh, nice, I didn't notice that was added.
> This whole thing is confusing to me. Why do you want to reject peers
> directly connected to the same root port? Why do you require the same
> Switch Upstream Port? You don't exclude conventional PCI, but it
> looks like you would require peers to share *two* upstream PCI-to-PCI
> bridges? I would think a single shared upstream bridge (conventional,
> PCIe Switch Downstream Port, or PCIe Root Port) would be sufficient?
Hmm, yes, this may just be laziness on my part. Finding the shared
upstream bridge is a bit more tricky than just showing that they are on
the same switch. So as coded, a fabric of switches with peers on
different legs of the fabric are not supported. But yes, maybe they just
need to be two devices with a single shared upstream bridge that is not
the root port. Again, we need to reject the root port because we can't
know if the root complex can support P2P traffic.
> Since "pci_p2pdma_add_client()" includes "pci_" in its name, it seems
> sort of weird that callers supply a non-PCI device and then we look up
> a PCI device here. I assume you have some reason for this; if you
> added a writeup in Documentation/PCI, that would be a good place to
> elaborate on that, maybe with a one-line clue here.
Well yes, but this is much more convenient for callers which don't need
to care if the device they are attempting to add (which in the NVMe
target case, could be a random block device) is a pci device or not.
Especially seeing find_parent_pci_dev() is non-trivial.
>> +void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
>> +{
>> + void *ret;
>> +
>> + if (unlikely(!pdev->p2pdma))
>
> Is this a hot path? I'm not sure it's worth cluttering
> non-performance paths with likely/unlikely.
I'd say it can be pretty hot given that we need to allocate and free
buffers at multiple GB/s for the NVMe target case. I don't exactly have
benchmarks or anything to show this though...
>> + return NULL;
>> +
>> + if (unlikely(!percpu_ref_tryget_live(&pdev->p2pdma->devmap_ref)))
>> + return NULL;
>> +
>> + ret = (void *)(uintptr_t)gen_pool_alloc(pdev->p2pdma->pool, size);
>
> Why the double cast? Wouldn't "(void *)" be sufficient?
Oh, hmm, I can't remember now. I suspect I added it to fix a kbuild test
robot error. I'll check and see if it can be removed.
> In v4.6-rc1, gen_pool_free() takes "unsigned long addr". I know this
> is based on -rc3; is this something that changed between -rc1 and
> -rc3?
Yes, I think it's taken an unsigned long for a while now. Per the above,
I can't remember exactly why I casted it this way and I'll take a look
again.
Logan
> I agree, I don't think this series should target anything other than
> using p2p memory located in one of the devices expected to participate
> in the p2p trasnaction for a first pass..
I disagree. There is definitely interest in using a NVMe CMB as a bounce buffer and in deploying systems where only some of the NVMe SSDs below a switch have a CMB but use P2P to access all of them. Also there are some devices that only expose memory and their entire purpose is to act as a p2p device, supporting these devices would be valuable.
> locality is super important for p2p, so I don't think things should
> start out in a way that makes specifying the desired locality hard.
Ensuring that the EPs engaged in p2p are all directly connected to the same PCIe switch ensures locality and (for the switches we have tested) performance. I agree solving the case where the namespace are CMB are on the same PCIe EP is valuable but I don't see it as critical to initial acceptance of the series.
Stephen
> Wouldn't it all be simpler if the p2p_dev resolution would be private
> to the namespace?
>
> So is adding some all the namespaces in a subsystem must comply to
> using p2p? Seems a little bit harsh if its not absolutely needed. Would
> be nice to export a subsystems between two ports (on two HCAs, across
> NUMA nodes) where the home node (primary path) would use p2p and
> failover would use host memory...
>
> Can you help me understand why this is absolutely not feasible?
Yes, it would simplify things. However, as best as I can tell, when we
allocate memory we don't know what namespace it will be used for. If
there is a way, I could probably rework it so there's a P2P device per
namespace.
Can you show me how we'd know the namespace to use in
nvmet_rdma_map_sgl_keyed()?
Thanks,
Logan
On 01/03/18 11:02 AM, Bjorn Helgaas wrote:
>> void pci_enable_acs(struct pci_dev *dev)
>> {
>> + if (pci_p2pdma_disable_acs(dev))
>> + return;
>
> This doesn't read naturally to me. I do see that when
> CONFIG_PCI_P2PDMA is not set, pci_p2pdma_disable_acs() does nothing
> and returns 0, so we'll go ahead and try to enable ACS as before.
>
> But I think it would be clearer to have an #ifdef CONFIG_PCI_P2PDMA
> right here so it's more obvious that we only disable ACS when it's
> selected.
I could do this... however, I wrote it this way because I've read Linus
dislikes using #ifdef's inside function bodies and I personally agree
with that sentiment.
Logan
On Wed, Feb 28, 2018 at 7:56 PM, Benjamin Herrenschmidt
<[email protected]> wrote:
> On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
>> On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote:
>> > Hi Everyone,
>>
>>
>> So Oliver (CC) was having issues getting any of that to work for us.
>>
>> The problem is that acccording to him (I didn't double check the latest
>> patches) you effectively hotplug the PCIe memory into the system when
>> creating struct pages.
>>
>> This cannot possibly work for us. First we cannot map PCIe memory as
>> cachable. (Note that doing so is a bad idea if you are behind a PLX
>> switch anyway since you'd ahve to manage cache coherency in SW).
>
> Note: I think the above means it won't work behind a switch on x86
> either, will it ?
The devm_memremap_pages() infrastructure allows placing the memmap in
"System-RAM" even if the hotplugged range is in PCI space. So, even if
it is an issue on some configurations, it's just a simple adjustment
to where the memmap is placed.
On 01/03/18 11:42 AM, Jason Gunthorpe wrote:
> On Thu, Mar 01, 2018 at 08:35:55PM +0200, Sagi Grimberg wrote:
> This is also why I don't entirely understand why this series has a
> generic allocator for p2p mem, it makes little sense to me.
> Why wouldn't the nmve driver just claim the entire CMB of its local
> device for its own use? Why involve p2p core code in this?
We'd prefer to have a generic way to get p2pmem instead of restricting
ourselves to only using CMBs. We did work in the past where the P2P
memory was part of an IB adapter and not the NVMe card. So this won't
work if it's an NVMe only interface.
As Stephen mentioned, we also use a couple devices that only exposes P2P
memory and this isn't related to NVMe at all. So there's value in having
a generic interface and allocator to enable all devices to provide this
memory.
If there were a hypothetical situation where a driver wants to use some
of the memory for P2P and some of it for other purposes then they'd just
divide it themselves and only pass a subset to
pci_p2pdma_add_resource(). However, as per our changes to the nvme-pci
code, it's really just easier to use the allocator for everything inside
the driver.
Logan
On 01/03/18 12:21 PM, Dan Williams wrote:
>> Note: I think the above means it won't work behind a switch on x86
>> either, will it ?
>
> The devm_memremap_pages() infrastructure allows placing the memmap in
> "System-RAM" even if the hotplugged range is in PCI space. So, even if
> it is an issue on some configurations, it's just a simple adjustment
> to where the memmap is placed.
Thanks for the confirmation Dan!
Logan
On 01/03/18 03:31 AM, Sagi Grimberg wrote:
>> * We also reject using devices that employ 'dma_virt_ops' which should
>> fairly simply handle Jason's concerns that this work might break with
>> the HFI, QIB and rxe drivers that use the virtual ops to implement
>> their own special DMA operations.
>
> That's good, but what would happen for these devices? simply fail the
> mapping causing the ulp to fail its rdma operation? I would think
> that we need a capability flag for devices that support it.
pci_p2pmem_find() will simply not return any devices when any client
that uses dma_virt_ops. So in the NVMe target case it simply will not
use P2P memory.
And just in case, pci_p2pdma_map_sg() will also return 0 if the device
passed to it uses dma_virt_ops as well. So if someone bypasses
pci_p2pmem_find() they will get a failure during map.
Logan
On 01/03/18 10:49 AM, Bjorn Helgaas wrote:
>> +int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
>> + enum dma_data_direction dir)
>
> Same question as before about why the mixture of "pci_*" interfaces
> that take "struct device *" parameters.
In this case, per the discussion in v1, it was felt the _map_sg()
function should take the same arguments as dma_map_sg().
Logan
On Thu, 2018-03-01 at 11:04 -0700, Logan Gunthorpe wrote:
>
> On 28/02/18 08:56 PM, Benjamin Herrenschmidt wrote:
> > On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
> > > The problem is that acccording to him (I didn't double check the latest
> > > patches) you effectively hotplug the PCIe memory into the system when
> > > creating struct pages.
> > >
> > > This cannot possibly work for us. First we cannot map PCIe memory as
> > > cachable. (Note that doing so is a bad idea if you are behind a PLX
> > > switch anyway since you'd ahve to manage cache coherency in SW).
> >
> > Note: I think the above means it won't work behind a switch on x86
> > either, will it ?
>
> This works perfectly fine on x86 behind a switch and we've tested it on
> multiple machines. We've never had an issue of running out of virtual
> space despite our PCI bars typically being located with an offset of
> 56TB or more. The arch code on x86 also somehow figures out not to map
> the memory as cachable so that's not an issue (though, at this point,
> the CPU never accesses the memory so even if it were, it wouldn't affect
> anything).
Oliver can you look into this ? You sais the memory was effectively
hotplug'ed into the system when creating the struct pages. That would
mean to me that it's a) mapped (which for us is cachable, maybe x86 has
tricks to avoid that) and b) potentially used to populate userspace
pages (that will definitely be cachable). Unless there's something in
there you didn't see that prevents it.
> We also had this working on ARM64 a while back but it required some out
> of tree ZONE_DEVICE patches and some truly horrid hacks to it's arch
> code to ioremap the memory into the page map.
>
> You didn't mention what architecture you were trying this on.
ppc64.
> It may make sense at this point to make this feature dependent on x86
> until more work is done to make it properly portable. Something like
> arch functions that allow adding IO memory pages to with a specific
> cache setting. Though, if an arch has such restrictive limits on the map
> size it would probably need to address that too somehow.
Not fan of that approach.
So there are two issues to consider here:
- Our MMIO space is very far away from memory (high bits set in the
address) which causes problem with things like vmmemmap, page_address,
virt_to_page etc... Do you have similar issues on arm64 ?
- We need to ensure that the mechanism (which I'm not familiar with)
that you use to create the struct page's for the device don't end up
turning those device pages into normal "general use" pages for the
system. Oliver thinks it does, you say it doesn't, ...
Jerome (Glisse), what's your take on this ? Smells like something that
could be covered by HMM...
Logan, the only reason you need struct page's to begin with is for the
DMA API right ? Or am I missing something here ?
Cheers,
Ben.
> Thanks,
>
> Logan
>
On Thu, 2018-03-01 at 18:09 +0000, Stephen Bates wrote:
> > > So Oliver (CC) was having issues getting any of that to work for us.
> > >
> > > The problem is that acccording to him (I didn't double check the latest
> > > patches) you effectively hotplug the PCIe memory into the system when
> > > creating struct pages.
> > >
> > > This cannot possibly work for us. First we cannot map PCIe memory as
> > > cachable. (Note that doing so is a bad idea if you are behind a PLX
> > > switch anyway since you'd ahve to manage cache coherency in SW).
> >
> >
> > Note: I think the above means it won't work behind a switch on x86
> > either, will it ?
>
>
> Ben
>
> We have done extensive testing of this series and its predecessors
> using PCIe switches from both Broadcom (PLX) and Microsemi. We have
> also done testing on x86_64, ARM64 and ppc64el based ARCH with
> varying degrees of success. The series as it currently stands only
> works on x86_64 but modified (hacky) versions have been made to work
> on ARM64. The x86_64 testing has been done on a range of (Intel)
> CPUs, servers, PCI EPs (including RDMA NICs from at least three
> vendors, NVMe SSDs from at least four vendors and P2P devices from
> four vendors) and PCI switches.
>
> I do find it slightly offensive that you would question the series
> even working. I hope you are not suggesting we would submit this
> framework multiple times without having done testing on it....
No need to get personal on that. I did specify that this was based on
some incomplete understanding of what's going on with that new hack
used to create struct pages.
As it is, it cannot work on ppc64 however, in part because according to
Oliver, we end up mapping things cachable, and in part, because of the
address range issues.
The latter issue might be fundamental to the approach and unfixable
unless we have ways to use hooks for virt_to_page/page_address on these
things.
Ben.
On Thu, 2018-03-01 at 11:21 -0800, Dan Williams wrote:
> On Wed, Feb 28, 2018 at 7:56 PM, Benjamin Herrenschmidt
> <[email protected]> wrote:
> > On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
> > > On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote:
> > > > Hi Everyone,
> > >
> > >
> > > So Oliver (CC) was having issues getting any of that to work for us.
> > >
> > > The problem is that acccording to him (I didn't double check the latest
> > > patches) you effectively hotplug the PCIe memory into the system when
> > > creating struct pages.
> > >
> > > This cannot possibly work for us. First we cannot map PCIe memory as
> > > cachable. (Note that doing so is a bad idea if you are behind a PLX
> > > switch anyway since you'd ahve to manage cache coherency in SW).
> >
> > Note: I think the above means it won't work behind a switch on x86
> > either, will it ?
>
> The devm_memremap_pages() infrastructure allows placing the memmap in
> "System-RAM" even if the hotplugged range is in PCI space. So, even if
> it is an issue on some configurations, it's just a simple adjustment
> to where the memmap is placed.
But what happens with that PCI memory ? Is it effectively turned into
nromal memory (ie, usable for normal allocations, potentially used to
populate user pages etc...) or is it kept aside ?
Also on ppc64, the physical addresses of PCIe make it so far appart
that there's no way we can map them into the linear mapping at the
normal offset of PAGE_OFFSET + (pfn << PAGE_SHIFT), so things like
page_address or virt_to_page cannot work as-is on PCIe addresses.
Ben.
On Fri, 2018-03-02 at 07:34 +1100, Benjamin Herrenschmidt wrote:
>
> But what happens with that PCI memory ? Is it effectively turned into
> nromal memory (ie, usable for normal allocations, potentially used to
> populate user pages etc...) or is it kept aside ?
(What I mean is is it added to the page allocator basically)
Also we need to be able to hard block MEMREMAP_WB mappings of non-RAM
on ppc64 (maybe via an arch hook as it might depend on the processor
family). Server powerpc cannot do cachable accesses on IO memory
(unless it's special OpenCAPI or nVlink, but not on PCIe).
> Also on ppc64, the physical addresses of PCIe make it so far appart
> that there's no way we can map them into the linear mapping at the
> normal offset of PAGE_OFFSET + (pfn << PAGE_SHIFT), so things like
> page_address or virt_to_page cannot work as-is on PCIe addresses.
Talking of which ... is there any documentation on the whole
memremap_page ? my grep turned out empty...
Cheers,
Ben.
On Fri, Mar 02, 2018 at 07:40:15AM +1100, Benjamin Herrenschmidt wrote:
> Also we need to be able to hard block MEMREMAP_WB mappings of non-RAM
> on ppc64 (maybe via an arch hook as it might depend on the processor
> family). Server powerpc cannot do cachable accesses on IO memory
> (unless it's special OpenCAPI or nVlink, but not on PCIe).
I think you are right on this - even on x86 we must not create
cachable mappings of PCI BARs - there is no way that works the way
anyone would expect.
I think this series doesn't have a problem here only because it never
touches the BAR pages with the CPU.
BAR memory should be mapped into the CPU as WC at best on all arches..
Jason
On Fri, Mar 02, 2018 at 07:29:55AM +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-01 at 11:04 -0700, Logan Gunthorpe wrote:
> >
> > On 28/02/18 08:56 PM, Benjamin Herrenschmidt wrote:
> > > On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
> > > > The problem is that acccording to him (I didn't double check the latest
> > > > patches) you effectively hotplug the PCIe memory into the system when
> > > > creating struct pages.
> > > >
> > > > This cannot possibly work for us. First we cannot map PCIe memory as
> > > > cachable. (Note that doing so is a bad idea if you are behind a PLX
> > > > switch anyway since you'd ahve to manage cache coherency in SW).
> > >
> > > Note: I think the above means it won't work behind a switch on x86
> > > either, will it ?
> >
> > This works perfectly fine on x86 behind a switch and we've tested it on
> > multiple machines. We've never had an issue of running out of virtual
> > space despite our PCI bars typically being located with an offset of
> > 56TB or more. The arch code on x86 also somehow figures out not to map
> > the memory as cachable so that's not an issue (though, at this point,
> > the CPU never accesses the memory so even if it were, it wouldn't affect
> > anything).
>
> Oliver can you look into this ? You sais the memory was effectively
> hotplug'ed into the system when creating the struct pages. That would
> mean to me that it's a) mapped (which for us is cachable, maybe x86 has
> tricks to avoid that) and b) potentially used to populate userspace
> pages (that will definitely be cachable). Unless there's something in
> there you didn't see that prevents it.
>
> > We also had this working on ARM64 a while back but it required some out
> > of tree ZONE_DEVICE patches and some truly horrid hacks to it's arch
> > code to ioremap the memory into the page map.
> >
> > You didn't mention what architecture you were trying this on.
>
> ppc64.
>
> > It may make sense at this point to make this feature dependent on x86
> > until more work is done to make it properly portable. Something like
> > arch functions that allow adding IO memory pages to with a specific
> > cache setting. Though, if an arch has such restrictive limits on the map
> > size it would probably need to address that too somehow.
>
> Not fan of that approach.
>
> So there are two issues to consider here:
>
> - Our MMIO space is very far away from memory (high bits set in the
> address) which causes problem with things like vmmemmap, page_address,
> virt_to_page etc... Do you have similar issues on arm64 ?
HMM private (HMM public is different) works around that by looking for
"hole" in address space and using those for hotplug (ie page_to_pfn()
!= physical pfn of the memory). This is ok for HMM because the memory
is never map by the CPU and we can find the physical pfn with a little
bit of math (page_to_pfn() - page->pgmap->res->start + page->pgmap->dev->
physical_base_address).
To avoid anything going bad i actually do not populate the kernel linear
mapping for the range hence definitly no CPU access at all through those
struct page. CPU can still access PCIE bar through usual mmio map.
>
> - We need to ensure that the mechanism (which I'm not familiar with)
> that you use to create the struct page's for the device don't end up
> turning those device pages into normal "general use" pages for the
> system. Oliver thinks it does, you say it doesn't, ...
>
> Jerome (Glisse), what's your take on this ? Smells like something that
> could be covered by HMM...
Well this again a new user of struct page for device memory just for
one usecase. I wanted HMM to be more versatile so that it could be use
for this kind of thing too. I guess the message didn't go through. I
will take some cycles tomorrow to look into this patchset to ascertain
how struct page is use in this context.
Note that i also want peer to peer for HMM users but with ACS and using
IOMMU ie having to populate IOMMU page table of one device to point to
bar of another device. I need to test on how many platform this work,
hardware engineer are unable/unwilling to commit on wether this work or
not.
> Logan, the only reason you need struct page's to begin with is for the
> DMA API right ? Or am I missing something here ?
If it is only needed for that this sounds like a waste of memory for
struct page. Thought i understand this allow new API to match previous
one.
Cheers,
J?r?me
On 01/03/18 01:53 PM, Jason Gunthorpe wrote:
> On Fri, Mar 02, 2018 at 07:40:15AM +1100, Benjamin Herrenschmidt wrote:
>> Also we need to be able to hard block MEMREMAP_WB mappings of non-RAM
>> on ppc64 (maybe via an arch hook as it might depend on the processor
>> family). Server powerpc cannot do cachable accesses on IO memory
>> (unless it's special OpenCAPI or nVlink, but not on PCIe).
>
> I think you are right on this - even on x86 we must not create
> cachable mappings of PCI BARs - there is no way that works the way
> anyone would expect.
On x86, even if I try to make a cachable mapping of a PCI BAR it always
ends up being un-cached. The arch code in x86 always does the right
thing here.... Other arches, not so much.
Logan
On 01/03/18 01:29 PM, Benjamin Herrenschmidt wrote:
> Oliver can you look into this ? You sais the memory was effectively
> hotplug'ed into the system when creating the struct pages. That would
> mean to me that it's a) mapped (which for us is cachable, maybe x86 has
> tricks to avoid that) and b) potentially used to populate userspace
> pages (that will definitely be cachable). Unless there's something in
> there you didn't see that prevents it.
Yes, we've been specifically prohibiting all cases where these pages get
passed to userspace. We don't want that. Although it works in limited
cases (ie x86), and we use it for some testing, there are dragons there.
> - Our MMIO space is very far away from memory (high bits set in the
> address) which causes problem with things like vmmemmap, page_address,
> virt_to_page etc... Do you have similar issues on arm64 ?
No similar issues on arm64. Any chance you could simply not map the PCI
bars that way? What's the point of that? It may simply mean ppc64 can't
be supported until either that changes or the kernel infrastructure gets
more sophisticated.
> Logan, the only reason you need struct page's to begin with is for the
> DMA API right ? Or am I missing something here ?
It's not so much the DMA map API as it is the entire kernel
infrastructure. Scatter lists (which are universally used to setup DMA
requests) require pages and bios require pages, etc, etc. In fact, this
patch set, in its current form, routes around the DMA API entirely.
Myself[1] and others have done prototype work to migrate away from
struct pages and to use pfn_t instead but this work doesn't seem to get
very far in the community.
Logan
[1] https://marc.info/?l=linux-kernel&m=149566222124326&w=2
On 01/03/18 01:55 PM, Jerome Glisse wrote:
> Well this again a new user of struct page for device memory just for
> one usecase. I wanted HMM to be more versatile so that it could be use
> for this kind of thing too. I guess the message didn't go through. I
> will take some cycles tomorrow to look into this patchset to ascertain
> how struct page is use in this context.
We looked at it but didn't see how any of it was applicable to our needs.
Logan
On Thu, 2018-03-01 at 11:21 -0800, Dan Williams wrote:
>
>
> The devm_memremap_pages() infrastructure allows placing the memmap in
> "System-RAM" even if the hotplugged range is in PCI space. So, even if
> it is an issue on some configurations, it's just a simple adjustment
> to where the memmap is placed.
Actually can you explain a bit more here ?
devm_memremap_pages() doesn't take any specific argument about what to
do with the memory.
It does create the vmemmap sections etc... but does so by calling
arch_add_memory(). So __add_memory() isn't called, which means the
pages aren't added to the linear mapping. Then you manually add them to
ZONE_DEVICE.
Am I correct ?
In that case, they indeed can't be used as normal memory pages, which
is good, and if they are indeed not in the linear mapping, then there
is no caching issues.
However, what happens if anything calls page_address() on them ? Some
DMA ops do that for example, or some devices might ...
This is all quite convoluted with no documentation I can find that
explains the various expectations.
So the question is are those pages landing in the linear mapping, and
if yes, by what code path ?
The next question is if we ever want that to work on ppc64, we need a
way to make this fit in our linear mapping and map it non-cachable,
which will require some wrangling on how we handle that mapping.
Cheers,
Ben.
On Thu, Mar 01, 2018 at 02:03:26PM -0700, Logan Gunthorpe wrote:
>
>
> On 01/03/18 01:55 PM, Jerome Glisse wrote:
> > Well this again a new user of struct page for device memory just for
> > one usecase. I wanted HMM to be more versatile so that it could be use
> > for this kind of thing too. I guess the message didn't go through. I
> > will take some cycles tomorrow to look into this patchset to ascertain
> > how struct page is use in this context.
>
> We looked at it but didn't see how any of it was applicable to our needs.
>
It seems people miss-understand HMM :( you do not have to use all of
its features. If all you care about is having struct page then just
use that for instance in your case only use those following 3 functions:
hmm_devmem_add() or hmm_devmem_add_resource() and hmm_devmem_remove()
for cleanup.
You can set the fault callback to an empty stub that always do return
VM_SIGBUS or a patch to allow NULL callback inside HMM.
You don't have to use the free callback if you don't care and if there
is something that doesn't quite match what you want HMM can always be
ajusted to address this.
The intention of HMM is to be useful for all device memory that wish
to have struct page for various reasons.
Cheers,
J?r?me
On 01/03/18 02:03 PM, Benjamin Herrenschmidt wrote:
> However, what happens if anything calls page_address() on them ? Some
> DMA ops do that for example, or some devices might ...
Although we could probably work around it with some pain, we rely on
page_address() and virt_to_phys(), etc to work on these pages. So on
x86, yes, it makes it into the linear mapping.
Logan
On 01/03/18 02:10 PM, Jerome Glisse wrote:
> It seems people miss-understand HMM :( you do not have to use all of
> its features. If all you care about is having struct page then just
> use that for instance in your case only use those following 3 functions:
>
> hmm_devmem_add() or hmm_devmem_add_resource() and hmm_devmem_remove()
> for cleanup.
To what benefit over just using devm_memremap_pages()? If I'm using the
hmm interface and disabling all the features, I don't see the point.
We've also cleaned up the devm_memremap_pages() interface to be more
usefully generic in such a way that I'd hope HMM starts using it too and
gets rid of the code duplication.
Logan
On Thu, Mar 01, 2018 at 02:11:34PM -0700, Logan Gunthorpe wrote:
>
>
> On 01/03/18 02:03 PM, Benjamin Herrenschmidt wrote:
> > However, what happens if anything calls page_address() on them ? Some
> > DMA ops do that for example, or some devices might ...
>
> Although we could probably work around it with some pain, we rely on
> page_address() and virt_to_phys(), etc to work on these pages. So on x86,
> yes, it makes it into the linear mapping.
This is pretty easy to do with HMM:
unsigned long hmm_page_to_phys_pfn(struct page *page)
{
struct hmm_devmem *devmem;
unsigned long ppfn;
/* Sanity test maybe BUG_ON() */
if (!is_device_private_page(page))
return -1UL;
devmem = page->pgmap->data;
ppfn = page_to_page(page) - devmem->pfn_first;
return ppfn + devmem->device_phys_base_pfn;
}
Note that last field does not exist in today HMM because i did not need
such helper so far but this can be added.
Cheers,
J?r?me
On Thu, 1 Mar 2018 18:54:01 +0000
"Stephen Bates" <[email protected]> wrote:
> Thanks for the detailed review Bjorn!
>
> >>
> >> + Enabling this option will also disable ACS on all ports behind
> >> + any PCIe switch. This effictively puts all devices behind any
> >> + switch into the same IOMMU group.
>
> >
> > Does this really mean "all devices behind the same Root Port"?
>
> Not necessarily. You might have a cascade of switches (i.e switches below a switch) to achieve a very large fan-out (in an NVMe SSD array for example) and we will only disable ACS on the ports below the relevant switch.
>
> > What does this mean in terms of device security? I assume it means,
> > at least, that individual devices can't be assigned to separate VMs.
>
> This was discussed during v1 [1]. Disabling ACS on all downstream ports of the switch means that all the EPs below it have to part of the same IOMMU grouping. However it was also agreed that as long as the ACS disable occurred at boot time (which is does in v2) then the virtualization layer will be aware of it and will perform the IOMMU group formation correctly.
This is still a pretty terrible solution though, your kernel provider
needs to decide whether they favor device assignment or p2p, because we
can't do both, unless there's a patch I haven't seen yet that allows
boot time rather than compile time configuration. There are absolutely
supported device assignment cases of switches proving isolation between
devices allowing the downstream EPs to be used independently. I think
this is a non-starter for distribution support without boot time or
dynamic configuration. I could imagine dynamic configuration through
sysfs that might trigger a soft remove and rescan of the affected
devices in order to rebuild the IOMMU group. The hard part might be
determining which points to allow that to guarantee correctness. For
instance, upstream switch ports don't actually support ACS, but they'd
otherwise be an obvious concentration point to trigger a
reconfiguration. Thanks,
Alex
On 01/03/18 02:18 PM, Jerome Glisse wrote:
> This is pretty easy to do with HMM:
>
> unsigned long hmm_page_to_phys_pfn(struct page *page)
This is not useful unless you want to go through all the kernel paths we
are using and replace page_to_phys() and friends with something else
that calls an HMM function when appropriate...
The problem isn't getting the physical address from a page, it's that we
are passing these pages through various kernel interfaces which expect
pages that work in the usual manner. (Look at the code: we quite simply
provide a way to get the PCI bus address from a page when necessary).
Logan
On Thu, Mar 01, 2018 at 02:15:01PM -0700, Logan Gunthorpe wrote:
>
>
> On 01/03/18 02:10 PM, Jerome Glisse wrote:
> > It seems people miss-understand HMM :( you do not have to use all of
> > its features. If all you care about is having struct page then just
> > use that for instance in your case only use those following 3 functions:
> >
> > hmm_devmem_add() or hmm_devmem_add_resource() and hmm_devmem_remove()
> > for cleanup.
>
> To what benefit over just using devm_memremap_pages()? If I'm using the hmm
> interface and disabling all the features, I don't see the point. We've also
> cleaned up the devm_memremap_pages() interface to be more usefully generic
> in such a way that I'd hope HMM starts using it too and gets rid of the code
> duplication.
>
The first HMM variant find a hole and do not require a resource as input
parameter. Beside that internaly for PCIE device memory devm_memremap_pages()
does not do the right thing last time i check it always create a linear
mapping of the range ie HMM call add_pages() while devm_memremap_pages()
call arch_add_memory()
When i upstreamed HMM, Dan didn't want me to touch devm_memremap_pages()
to match my need. I am more than happy to modify devm_memremap_pages() to
also handle HMM needs.
Note that the intention of HMM is to be a middle layer between low level
infrastructure and device driver. Idea is that such impedance layer should
make it easier down the road to change how thing are handled down below
without having to touch many device driver.
Cheers,
J?r?me
On 01/03/18 02:21 PM, Alex Williamson wrote:
> This is still a pretty terrible solution though, your kernel provider
> needs to decide whether they favor device assignment or p2p, because we
> can't do both, unless there's a patch I haven't seen yet that allows
> boot time rather than compile time configuration. There are absolutely
> supported device assignment cases of switches proving isolation between
> devices allowing the downstream EPs to be used independently. I think
> this is a non-starter for distribution support without boot time or
> dynamic configuration. I could imagine dynamic configuration through
> sysfs that might trigger a soft remove and rescan of the affected
> devices in order to rebuild the IOMMU group. The hard part might be
> determining which points to allow that to guarantee correctness. For
> instance, upstream switch ports don't actually support ACS, but they'd
> otherwise be an obvious concentration point to trigger a
> reconfiguration. Thanks,
At this point, I don't expect this to be enabled in any distribution.
The use case is for custom hardware intended to do P2P which means they
can have a custom kernel with P2P enabled. The default for
CONFIG_PCI_P2P is 'no' and I expect it to stay that way for a long time.
Another boot option which has to be set for this to work isn't something
we'd like to see.
Logan
> your kernel provider needs to decide whether they favor device assignment or p2p
Thanks Alex! The hardware requirements for P2P (switch, high performance EPs) are such that we really only expect CONFIG_P2P_DMA to be enabled in specific instances and in those instances the users have made a decision to favor P2P over IOMMU isolation. Or they have setup their PCIe topology in a way that gives them IOMMU isolation where they want it and P2P where they want it.
Stephen
On Thu, Mar 01, 2018 at 09:32:20PM +0000, Stephen Bates wrote:
> > your kernel provider needs to decide whether they favor device assignment or p2p
>
> Thanks Alex! The hardware requirements for P2P (switch, high performance EPs) are such that we really only expect CONFIG_P2P_DMA to be enabled in specific instances and in those instances the users have made a decision to favor P2P over IOMMU isolation. Or they have setup their PCIe topology in a way that gives them IOMMU isolation where they want it and P2P where they want it.
>
>
Note that they are usecase for P2P where IOMMU isolation matter and
the traffic through root complex isn't see as an issue. For instance
for GPU the idea is that you want to allow the RDMA device to directly
read or write from GPU memory to avoid having to migrate memory to
system memory. This isn't so much for performance than for ease of
use.
Cheers,
J?r?me
On Thu, Mar 1, 2018 at 12:34 PM, Benjamin Herrenschmidt
<[email protected]> wrote:
> On Thu, 2018-03-01 at 11:21 -0800, Dan Williams wrote:
>> On Wed, Feb 28, 2018 at 7:56 PM, Benjamin Herrenschmidt
>> <[email protected]> wrote:
>> > On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
>> > > On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote:
>> > > > Hi Everyone,
>> > >
>> > >
>> > > So Oliver (CC) was having issues getting any of that to work for us.
>> > >
>> > > The problem is that acccording to him (I didn't double check the latest
>> > > patches) you effectively hotplug the PCIe memory into the system when
>> > > creating struct pages.
>> > >
>> > > This cannot possibly work for us. First we cannot map PCIe memory as
>> > > cachable. (Note that doing so is a bad idea if you are behind a PLX
>> > > switch anyway since you'd ahve to manage cache coherency in SW).
>> >
>> > Note: I think the above means it won't work behind a switch on x86
>> > either, will it ?
>>
>> The devm_memremap_pages() infrastructure allows placing the memmap in
>> "System-RAM" even if the hotplugged range is in PCI space. So, even if
>> it is an issue on some configurations, it's just a simple adjustment
>> to where the memmap is placed.
>
> But what happens with that PCI memory ? Is it effectively turned into
> nromal memory (ie, usable for normal allocations, potentially used to
> populate user pages etc...) or is it kept aside ?
>
> Also on ppc64, the physical addresses of PCIe make it so far appart
> that there's no way we can map them into the linear mapping at the
> normal offset of PAGE_OFFSET + (pfn << PAGE_SHIFT), so things like
> page_address or virt_to_page cannot work as-is on PCIe addresses.
Ah ok, I'd need to look at the details. I had been assuming that
sparse-vmemmap could handle such a situation, but that could indeed be
a broken assumption.
On 01/03/18 02:35 PM, Jerome Glisse wrote:
> Note that they are usecase for P2P where IOMMU isolation matter and
> the traffic through root complex isn't see as an issue.
Well, we can worry about that once we have a solution to the problem of
knowing whether a root complex supports P2P at all. I'm not sure how
people are going to solve that one.
For the time being, this work is for cards behind a switch only.
Logan
> The intention of HMM is to be useful for all device memory that wish
> to have struct page for various reasons.
Hi Jermone and thanks for your input! Understood. We have looked at HMM in the past and long term I definitely would like to consider how we can add P2P functionality to HMM for both DEVICE_PRIVATE and DEVICE_PUBLIC so we can pass addressable and non-addressable blocks of data between devices. However that is well beyond the intentions of this series ;-).
Stephen
On 01/03/18 02:37 PM, Dan Williams wrote:
> Ah ok, I'd need to look at the details. I had been assuming that
> sparse-vmemmap could handle such a situation, but that could indeed be
> a broken assumption.
It handles it fine for many situations. But when you try to map
something that is at the end of the physical address space then the
spares-vmemmap needs virtual address space that's the size of the
physical address space divided by PAGE_SIZE which may be a little bit
too large...
Logan
On 01/03/18 02:45 PM, Logan Gunthorpe wrote:
> It handles it fine for many situations. But when you try to map
> something that is at the end of the physical address space then the
> spares-vmemmap needs virtual address space that's the size of the
> physical address space divided by PAGE_SIZE which may be a little bit
> too large...
Though, considering this more, maybe this shouldn't be a problem...
Lets say you have 56bits of address space. That's 64PB. If you use need
a sparse vmemmap for the entire space it will take 16TB which leaves you
with 63.98PB of address space left. (Similar calculations for other
numbers of address bits.)
So I'm not sure what the problem with this is.
We still have to ensure all the arches map the memory with the right
cache bits but that should be relatively easy to solve.
Logan
On Thu, 2018-03-01 at 13:53 -0700, Jason Gunthorpe wrote:
> On Fri, Mar 02, 2018 at 07:40:15AM +1100, Benjamin Herrenschmidt wrote:
> > Also we need to be able to hard block MEMREMAP_WB mappings of non-RAM
> > on ppc64 (maybe via an arch hook as it might depend on the processor
> > family). Server powerpc cannot do cachable accesses on IO memory
> > (unless it's special OpenCAPI or nVlink, but not on PCIe).
>
> I think you are right on this - even on x86 we must not create
> cachable mappings of PCI BARs - there is no way that works the way
> anyone would expect.
>
> I think this series doesn't have a problem here only because it never
> touches the BAR pages with the CPU.
>
> BAR memory should be mapped into the CPU as WC at best on all arches..
Could be that x86 has the smarts to do the right thing, still trying to
untangle the code :-)
Cheers,
Ben.
On Thu, Mar 1, 2018 at 2:06 PM, Benjamin Herrenschmidt <[email protected]> wrote:
>
> Could be that x86 has the smarts to do the right thing, still trying to
> untangle the code :-)
Afaik, x86 will not cache PCI unless the system is misconfigured, and
even then it's more likely to just raise a machine check exception
than cache things.
The last-level cache is going to do fills and spills directly to the
memory controller, not to the PCIe side of things.
(I guess you *can* do things differently, and I wouldn't be surprised
if some people inside Intel did try to do things differently with
trying nvram over PCIe, but in general I think the above is true)
You won't find it in the kernel code either. It's in hardware with
firmware configuration of what addresses are mapped to the memory
controllers (and _how_ they are mapped) and which are not.
You _might_ find it in the BIOS, assuming you understood the tables
and had the BIOS writer's guide to unravel the magic registers.
But you might not even find it there. Some of the memory unit timing
programming is done very early, and by code that Intel doesn't even
release to the BIOS writers except as a magic encrypted blob, afaik.
Some of the magic might even be in microcode.
The page table settings for cacheability are more like a hint, and
only _part_ of the whole picture. The memory type range registers are
another part. And magic low-level uarch, northbridge and memory unit
specific magic is yet another part.
So you can disable caching for memory, but I'm pretty sure you can't
enable caching for PCIe at least in the common case. At best you can
affect how the store buffer works for PCIe.
Linus
On Thu, 2018-03-01 at 14:31 -0800, Linus Torvalds wrote:
> On Thu, Mar 1, 2018 at 2:06 PM, Benjamin Herrenschmidt <[email protected]> wrote:
> >
> > Could be that x86 has the smarts to do the right thing, still trying to
> > untangle the code :-)
>
> Afaik, x86 will not cache PCI unless the system is misconfigured, and
> even then it's more likely to just raise a machine check exception
> than cache things.
>
> The last-level cache is going to do fills and spills directly to the
> memory controller, not to the PCIe side of things.
>
> (I guess you *can* do things differently, and I wouldn't be surprised
> if some people inside Intel did try to do things differently with
> trying nvram over PCIe, but in general I think the above is true)
>
> You won't find it in the kernel code either. It's in hardware with
> firmware configuration of what addresses are mapped to the memory
> controllers (and _how_ they are mapped) and which are not.
Ah thanks ! Thanks explains. We can fix that on ppc64 in our linear
mapping code by checking the address vs. memblocks to chose the right
page table attributes.
So the main problem on our side is to figure out the problem of too big
PFNs. I need to look at this with Aneesh, we might be able to make
things fit with a bit of wrangling.
> You _might_ find it in the BIOS, assuming you understood the tables
> and had the BIOS writer's guide to unravel the magic registers.
>
> But you might not even find it there. Some of the memory unit timing
> programming is done very early, and by code that Intel doesn't even
> release to the BIOS writers except as a magic encrypted blob, afaik.
> Some of the magic might even be in microcode.
>
> The page table settings for cacheability are more like a hint, and
> only _part_ of the whole picture. The memory type range registers are
> another part. And magic low-level uarch, northbridge and memory unit
> specific magic is yet another part.
>
> So you can disable caching for memory, but I'm pretty sure you can't
> enable caching for PCIe at least in the common case. At best you can
> affect how the store buffer works for PCIe.
On Thu, Mar 01, 2018 at 12:27:03PM -0700, Logan Gunthorpe wrote:
>
>
> On 01/03/18 11:42 AM, Jason Gunthorpe wrote:
> >On Thu, Mar 01, 2018 at 08:35:55PM +0200, Sagi Grimberg wrote:
> >This is also why I don't entirely understand why this series has a
> >generic allocator for p2p mem, it makes little sense to me.
>
> >Why wouldn't the nmve driver just claim the entire CMB of its local
> >device for its own use? Why involve p2p core code in this?
>
> We'd prefer to have a generic way to get p2pmem instead of restricting
> ourselves to only using CMBs. We did work in the past where the P2P memory
> was part of an IB adapter and not the NVMe card. So this won't work if it's
> an NVMe only interface.
It just seems like it it making it too complicated.
In 99% of cases locality is important and you don't want to accidently
use a buffer in a 3rd device that has no relation to the two doing the
transfer just because it happened to be returned by the generic
allocator.
I can appreciate you might have some special use case for that, but it
absolutely should require special configuration and not just magically
happen.
You bring up IB adaptor memory - if we put that into the allocator
then what is to stop the NMVe driver just using it instead of the CMB
buffer? That would be totally wrong in almost all cases.
Seems like a very subtle and hard to debug performance trap to leave
for the users, and pretty much the only reason to use P2P is
performance... So why have such a dangerous interface?
Jason
On 01/03/18 03:45 PM, Jason Gunthorpe wrote:
> I can appreciate you might have some special use case for that, but it
> absolutely should require special configuration and not just magically
> happen.
Well if driver doesn't want someone doing p2p transfers with the memory
it shouldn't publish it to be used for exactly that purpose.
> You bring up IB adaptor memory - if we put that into the allocator
> then what is to stop the NMVe driver just using it instead of the CMB
> buffer? That would be totally wrong in almost all cases.
If you mean for SQEs in the NVMe driver, look at the code, it
specifically allocates it from it's own device. If you mean the
nvmet-rdma then it's using that memory exactly as it was meant to.
Again, if the IB driver doesn't want someone to use that memory for P2P
transfers it shouldn't publish it as such.
> Seems like a very subtle and hard to debug performance trap to leave
> for the users, and pretty much the only reason to use P2P is
> performance... So why have such a dangerous interface?
It's not at all dangerous, the code specifically only uses P2P memory
that's local enough. And the majority of the code is there to make sure
it will all work in all cases.
Honestly, though, I'd love to go back to the case where the user selects
which p2pmem device to use, but that was very unpopular last year. It
would simplify a bunch of things though.
Also, no, the reason to use P2P is not performance. Only if you have
very specific hardware can you get a performance bump and it isn't all
that significant. The reason to use P2P is so you can design performant
systems with small CPUs, less or slower DRAM, and low lane counts to the
CPU, etc.
Logan
On Thu, 2018-03-01 at 14:57 -0700, Logan Gunthorpe wrote:
>
> On 01/03/18 02:45 PM, Logan Gunthorpe wrote:
> > It handles it fine for many situations. But when you try to map
> > something that is at the end of the physical address space then the
> > spares-vmemmap needs virtual address space that's the size of the
> > physical address space divided by PAGE_SIZE which may be a little bit
> > too large...
>
> Though, considering this more, maybe this shouldn't be a problem...
>
> Lets say you have 56bits of address space.
We use only 52 in practice but yes.
> That's 64PB. If you use need
> a sparse vmemmap for the entire space it will take 16TB which leaves you
> with 63.98PB of address space left. (Similar calculations for other
> numbers of address bits.)
We only have 52 bits of virtual space for the kernel with the radix
MMU.
> So I'm not sure what the problem with this is.
>
> We still have to ensure all the arches map the memory with the right
> cache bits but that should be relatively easy to solve.
>
> Logan
On Thu, Mar 01, 2018 at 11:55:51AM -0700, Logan Gunthorpe wrote:
> Hi Bjorn,
>
> Thanks for the review. I'll correct all the nits for the next version.
>
> On 01/03/18 10:37 AM, Bjorn Helgaas wrote:
> > On Wed, Feb 28, 2018 at 04:39:57PM -0700, Logan Gunthorpe wrote:
> > > Some PCI devices may have memory mapped in a BAR space that's
> > > intended for use in Peer-to-Peer transactions. In order to enable
> > > such transactions the memory must be registered with ZONE_DEVICE pages
> > > so it can be used by DMA interfaces in existing drivers.
>
> > Is there anything about this memory that makes it specifically
> > intended for peer-to-peer transactions? I assume the device can't
> > really tell whether a transaction is from a CPU or a peer.
>
> No there's nothing special about the memory and it can still be accessed by
> the CPU. This is just the intended purpose. You could use this PCI memory as
> regular DMA buffers or regular memory but I'm not sure why you would. It
> would probably be pretty bad performance-wise.
>
>
> > BTW, maybe there could be some kind of guide for device driver writers
> > in Documentation/PCI/?
> Makes sense we can look at writing something for the next iteration.
>
> > I think it would be clearer and sufficient to simply say that we have
> > no way to know whether peer-to-peer routing between PCIe Root Ports is
> > supported (PCIe r4.0, sec 1.3.1).
>
> Fair enough.
>
> > The fact that you use the PCIe term "switch" suggests that a PCIe
> > Switch is required, but isn't it sufficient for the peers to be below
> > the same "PCI bridge", which would include PCIe Root Ports, PCIe
> > Switch Downstream Ports, and conventional PCI bridges?
> > The comments at get_upstream_bridge_port() suggest that this isn't
> > enough, and the peers actually do have to be below the same PCIe
> > Switch, but I don't know why.
>
> I do mean Switch as we do need to keep the traffic off the root complex.
> Seeing, as stated above, we don't know if it actually support it. (While we
> can be certain any PCI switch does). So we specifically want to exclude PCIe
> Root ports and I'm not sure about the support of PCI bridges but I can't
> imagine anyone wanting to do P2P around them so I'd rather be safe than
> sorry and exclude them.
I don't think this is correct. A Root Port defines a hierarchy domain
(I'm looking at PCIe r4.0, sec 1.3.1). The capability to route
peer-to-peer transactions *between* hierarchy domains is optional. I
think this means a Root Complex is not required to route transactions
from one Root Port to another Root Port.
This doesn't say anything about routing between two different devices
below a Root Port. Those would be in the same hierarchy domain and
should follow the conventional PCI routing rules. Of course, since a
Root Port has one link that leads to one device, they would probably
be different functions of a single multi-function device, so I don't
know how practical it would be to test this.
> > This whole thing is confusing to me. Why do you want to reject peers
> > directly connected to the same root port? Why do you require the same
> > Switch Upstream Port? You don't exclude conventional PCI, but it
> > looks like you would require peers to share *two* upstream PCI-to-PCI
> > bridges? I would think a single shared upstream bridge (conventional,
> > PCIe Switch Downstream Port, or PCIe Root Port) would be sufficient?
>
> Hmm, yes, this may just be laziness on my part. Finding the shared upstream
> bridge is a bit more tricky than just showing that they are on the same
> switch. So as coded, a fabric of switches with peers on different legs of
> the fabric are not supported. But yes, maybe they just need to be two
> devices with a single shared upstream bridge that is not the root port.
> Again, we need to reject the root port because we can't know if the root
> complex can support P2P traffic.
This sounds like the same issue as above, so we just need to resolve
that.
> > Since "pci_p2pdma_add_client()" includes "pci_" in its name, it seems
> > sort of weird that callers supply a non-PCI device and then we look up
> > a PCI device here. I assume you have some reason for this; if you
> > added a writeup in Documentation/PCI, that would be a good place to
> > elaborate on that, maybe with a one-line clue here.
>
> Well yes, but this is much more convenient for callers which don't need to
> care if the device they are attempting to add (which in the NVMe target
> case, could be a random block device) is a pci device or not. Especially
> seeing find_parent_pci_dev() is non-trivial.
OK. I accept that it might be convenient, but I still think it leads
to a weird API. Maybe that's OK; I don't know enough about the
scenario in the caller to do anything more than say "hmm, strange".
> > > +void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
> > > +{
> > > + void *ret;
> > > +
> > > + if (unlikely(!pdev->p2pdma))
> >
> > Is this a hot path? I'm not sure it's worth cluttering
> > non-performance paths with likely/unlikely.
>
> I'd say it can be pretty hot given that we need to allocate and free buffers
> at multiple GB/s for the NVMe target case. I don't exactly have benchmarks
> or anything to show this though...
OK, I'll take your word for that. I'm fine with this as long as it is
performance-sensitive.
Bjorn
>> We'd prefer to have a generic way to get p2pmem instead of restricting
>> ourselves to only using CMBs. We did work in the past where the P2P memory
>> was part of an IB adapter and not the NVMe card. So this won't work if it's
>> an NVMe only interface.
> It just seems like it it making it too complicated.
I disagree. Having a common allocator (instead of some separate allocator per driver) makes things simpler.
> Seems like a very subtle and hard to debug performance trap to leave
> for the users, and pretty much the only reason to use P2P is
> performance... So why have such a dangerous interface?
P2P is about offloading the memory and PCI subsystem of the host CPU and this is achieved no matter which p2p_dev is used.
Stephen
> I don't think this is correct. A Root Port defines a hierarchy domain
> (I'm looking at PCIe r4.0, sec 1.3.1). The capability to route
> peer-to-peer transactions *between* hierarchy domains is optional. I
> think this means a Root Complex is not required to route transactions
> from one Root Port to another Root Port.
>
> This doesn't say anything about routing between two different devices
> below a Root Port. Those would be in the same hierarchy domain and
> should follow the conventional PCI routing rules. Of course, since a
> Root Port has one link that leads to one device, they would probably
> be different functions of a single multi-function device, so I don't
> know how practical it would be to test this.
Yes, given that there's only one device below a root port it will either
be a switch or a multi-function device. In the multi-function device
case, I'm pretty sure the spec disallows routing-to-self so doing a P2P
transaction in that sense isn't going to work unless the device
specifically supports it and intercepts the traffic before it gets to
the port.
But, if we're talking about multi-function devices it should be able to
do everything within it's own driver so it's not exactly Peer-to-Peer.
Still, if someone has such hardware I think it's up to them to add
support for this odd situation.
Logan
> I'm pretty sure the spec disallows routing-to-self so doing a P2P
> transaction in that sense isn't going to work unless the device
> specifically supports it and intercepts the traffic before it gets to
> the port.
This is correct. Unless the device intercepts the TLP before it hits the root-port then this would be considered a "route to self" violation and an error event would occur. The same holds for the downstream port on a PCI switch (unless route-to-self violations are disabled which violates the spec but which I have seen done in certain applications).
Stephen
On Thu, Mar 01, 2018 at 06:54:01PM +0000, Stephen Bates wrote:
> Thanks for the detailed review Bjorn!
>
> >> + Enabling this option will also disable ACS on all ports behind
> >> + any PCIe switch. This effictively puts all devices behind any
> >> + switch into the same IOMMU group.
>
> > Does this really mean "all devices behind the same Root Port"?
>
> Not necessarily. You might have a cascade of switches (i.e switches
> below a switch) to achieve a very large fan-out (in an NVMe SSD
> array for example) and we will only disable ACS on the ports below
> the relevant switch.
The question is what the relevant switch is. We call pci_enable_acs()
on every PCI device, including Root Ports. It looks like this relies
on get_upstream_bridge_port() to filter out some things. I don't
think get_upstream_bridge_port() is doing the right thing, so we need
to sort that out first, I guess.
> > What does this mean in terms of device security? I assume it means,
> > at least, that individual devices can't be assigned to separate VMs.
>
> This was discussed during v1 [1]. Disabling ACS on all downstream
> ports of the switch means that all the EPs below it have to part of
> the same IOMMU grouping. However it was also agreed that as long as
> the ACS disable occurred at boot time (which is does in v2) then the
> virtualization layer will be aware of it and will perform the IOMMU
> group formation correctly.
>
> > I don't mind admitting that this patch makes me pretty nervous, and I
> > don't have a clear idea of what the implications of this are, or how
> > to communicate those to end users. "The same IOMMU group" is a pretty
> > abstract idea.
>
> Alex gave a good overview of the implications in [1].
>
> [1] https://marc.info/?l=linux-pci&m=151512320031739&w=2
This might be a good start, but whatever the implications are, they
need to be distilled and made clear directly in the changelog and the
Kconfig help text.
Bjorn
On 01/03/18 04:00 PM, Benjamin Herrenschmidt wrote:
> We use only 52 in practice but yes.
>
>> That's 64PB. If you use need
>> a sparse vmemmap for the entire space it will take 16TB which leaves you
>> with 63.98PB of address space left. (Similar calculations for other
>> numbers of address bits.)
>
> We only have 52 bits of virtual space for the kernel with the radix
> MMU.
Ok, assuming you only have 52 bits of physical address space: the sparse
vmemmap takes 1TB and you're left with 3.9PB of address space for other
things. So, again, why doesn't that work? Is my math wrong?
Logan
On Thu, Mar 01, 2018 at 11:00:51PM +0000, Stephen Bates wrote:
> > Seems like a very subtle and hard to debug performance trap to leave
> > for the users, and pretty much the only reason to use P2P is
> > performance... So why have such a dangerous interface?
>
> P2P is about offloading the memory and PCI subsystem of the host CPU
> and this is achieved no matter which p2p_dev is used.
No, locality matters. If you have a bunch of NICs and bunch of drives
and the allocator chooses to put all P2P memory on a single drive your
performance will suck horribly even if all the traffic is offloaded.
Performance will suck if you have speed differences between the PCI-E
devices. Eg a bunch of Gen 3 x8 NVMe cards paired with a Gen 4 x16 NIC
will not reach full performance unless the p2p buffers are properly
balanced between all cards.
This is why I think putting things in one big pool and pretending like
any P2P mem is interchangable with any other makes zero sense to me,
it is not how the system actually works. Proper locality matters a
lot.
Jason
On Thu, 2018-03-01 at 16:19 -0700, Logan Gunthorpe wrote:
>
> On 01/03/18 04:00 PM, Benjamin Herrenschmidt wrote:
> > We use only 52 in practice but yes.
> >
> > > That's 64PB. If you use need
> > > a sparse vmemmap for the entire space it will take 16TB which leaves you
> > > with 63.98PB of address space left. (Similar calculations for other
> > > numbers of address bits.)
> >
> > We only have 52 bits of virtual space for the kernel with the radix
> > MMU.
>
> Ok, assuming you only have 52 bits of physical address space: the sparse
> vmemmap takes 1TB and you're left with 3.9PB of address space for other
> things. So, again, why doesn't that work? Is my math wrong
The big problem is not the vmemmap, it's the linear mapping.
Cheers,
Ben.
On Thu, 2018-03-01 at 16:19 -0700, Logan Gunthorpe wrote:
(Switching back to my non-IBM address ...)
> On 01/03/18 04:00 PM, Benjamin Herrenschmidt wrote:
> > We use only 52 in practice but yes.
> >
> > > That's 64PB. If you use need
> > > a sparse vmemmap for the entire space it will take 16TB which leaves you
> > > with 63.98PB of address space left. (Similar calculations for other
> > > numbers of address bits.)
> >
> > We only have 52 bits of virtual space for the kernel with the radix
> > MMU.
>
> Ok, assuming you only have 52 bits of physical address space: the sparse
> vmemmap takes 1TB and you're left with 3.9PB of address space for other
> things. So, again, why doesn't that work? Is my math wrong
The big problem is not the vmemmap, it's the linear mapping.
Cheers,
Ben.
> No, locality matters. If you have a bunch of NICs and bunch of drives
> and the allocator chooses to put all P2P memory on a single drive your
> performance will suck horribly even if all the traffic is offloaded.
Sagi brought this up earlier in his comments about the _find_ function. We are planning to do something about this in the next version. This might be a randomization or a "user-pick" and include a rule around using the p2p_dev on the EP if that EP is part of the transaction.
Stephen
On Thu, Mar 01, 2018 at 11:14:46PM +0000, Stephen Bates wrote:
> > I'm pretty sure the spec disallows routing-to-self so doing a P2P
> > transaction in that sense isn't going to work unless the device
> > specifically supports it and intercepts the traffic before it gets to
> > the port.
>
> This is correct. Unless the device intercepts the TLP before it hits
> the root-port then this would be considered a "route to self"
> violation and an error event would occur. The same holds for the
> downstream port on a PCI switch (unless route-to-self violations are
> disabled which violates the spec but which I have seen done in
> certain applications).
I agree that a function doing DMA to a sibling within the same
multi-function device would probably not generate a TLP for it (I
would be curious to read about this in the spec if you have a
pointer).
More fundamentally, is there some multi-function-specific restriction
on peer-to-peer DMA? In conventional PCI, single-function devices on
the same bus can DMA to each other. The transactions will appear on
the bus, but the upstream bridge will ignore them because the address
is inside the bridge's memory window. As far as I know, the same
should happen on PCIe.
I don't know what happens with functions of a multi-function device,
either in conventional PCI or PCIe. I don't remember a restriction on
whether they can DMA to each other, but maybe there is.
Bjorn
On Thu, Mar 01, 2018 at 11:00:51PM +0000, Stephen Bates wrote:
>
> P2P is about offloading the memory and PCI subsystem of the host CPU
> and this is achieved no matter which p2p_dev is used.
Even within a device, memory attributes for its various regions may not be
the same. There's a meaningful difference between writing to an NVMe CMB
vs PMR, and a single device could have both. We don't want to lump these
all together without knowing which region you're allocating from, right?
On 01/03/18 04:49 PM, Keith Busch wrote:
> On Thu, Mar 01, 2018 at 11:00:51PM +0000, Stephen Bates wrote:
>>
>> P2P is about offloading the memory and PCI subsystem of the host CPU
>> and this is achieved no matter which p2p_dev is used.
>
> Even within a device, memory attributes for its various regions may not be
> the same. There's a meaningful difference between writing to an NVMe CMB
> vs PMR, and a single device could have both. We don't want to lump these
> all together without knowing which region you're allocating from, right?
Yes, PMRs are a whole other beast. I have no idea what anyone is going
to actually do with those yet. But no one is proposing to publish them
as p2pmem with this code.
Logan
> There's a meaningful difference between writing to an NVMe CMB vs PMR
When the PMR spec becomes public we can discuss how best to integrate it into the P2P framework (if at all) ;-).
Stephen
On 01/03/18 04:26 PM, Benjamin Herrenschmidt wrote:
> The big problem is not the vmemmap, it's the linear mapping.
Ah, yes, ok.
Logan
> We don't want to lump these all together without knowing which region you're allocating from, right?
In all seriousness I do agree with you on these Keith in the long term. We would consider adding property flags for the memory as it is added to the p2p core and then the allocator could evolve to intelligently dish it out. Attributes like endurance, latency and special write commit requirements could all become attributes in time. Perhaps one more reason for a central entity for p2p memory allocation so this code does not end up having to go into many different drivers?
Stephen
On 01/03/18 04:57 PM, Stephen Bates wrote:
>> We don't want to lump these all together without knowing which region you're allocating from, right?
>
> In all seriousness I do agree with you on these Keith in the long term. We would consider adding property flags for the memory as it is added to the p2p core and then the allocator could evolve to intelligently dish it out. Attributes like endurance, latency and special write commit requirements could all become attributes in time. Perhaps one more reason for a central entity for p2p memory allocation so this code does not end up having to go into many different drivers?
Ugh, I don't know... An allocator for PMR is going to be quite a
different animal. It would probably need to look more like a filesystem
given you'd need to access the same regions across reboots. I don't
think it fits the p2pmem paradigm at all. I think it's something
entirely new all together and who knows exactly what it will look like.
Logan
On 01/03/18 04:15 PM, Bjorn Helgaas wrote:
> The question is what the relevant switch is. We call pci_enable_acs()
> on every PCI device, including Root Ports. It looks like this relies
> on get_upstream_bridge_port() to filter out some things. I don't
> think get_upstream_bridge_port() is doing the right thing, so we need
> to sort that out first, I guess.
Yes, well if a port has an upstream bridge port and that port has an
upstream bridge port we are almost certainly looking at a port in a
switch. If you have a better suggestion to only disable ACS bits on
downstream switch ports I'd be happy to consider it.
> This might be a good start, but whatever the implications are, they
> need to be distilled and made clear directly in the changelog and the
> Kconfig help text.
Well, I tried. I'll expand on it for v3.
Logan
On 01/03/18 04:20 PM, Jason Gunthorpe wrote:
> On Thu, Mar 01, 2018 at 11:00:51PM +0000, Stephen Bates wrote:
> No, locality matters. If you have a bunch of NICs and bunch of drives
> and the allocator chooses to put all P2P memory on a single drive your
> performance will suck horribly even if all the traffic is offloaded.
>
> Performance will suck if you have speed differences between the PCI-E
> devices. Eg a bunch of Gen 3 x8 NVMe cards paired with a Gen 4 x16 NIC
> will not reach full performance unless the p2p buffers are properly
> balanced between all cards.
This would be solved better by choosing the closest devices in the
hierarchy in the p2pmem_find function (ie. preferring devices involved
in the transaction). Also, based on the current code, it's a bit of a
moot point seeing it requires all devices to be on the same switch. So
unless you have a giant switch hidden somewhere you're not going to get
too many NIC/NVME pairs.
In any case, we are looking at changing both those things so distance in
the topology is preferred and the ability to span multiple switches is
allowed. We're also discussing changing it to "user picks" which
simplifies all of this.
Logan
On 01/03/18 10:44 AM, Bjorn Helgaas wrote:
> I think these two statements are out of order, since the attributes
> dereference pdev->p2pdma. And it looks like you set "error"
> unnecessarily, since you return immediately looking at it.
Per the previous series, sysfs_create_group is must_check for some
reason. I had a printk there but you didn't think it was necessary. So
assigning it to error is the only way to squash the warning.
Logan
On Thu, Mar 1, 2018 at 4:15 PM, Logan Gunthorpe <[email protected]> wrote:
>
>
> On 01/03/18 10:44 AM, Bjorn Helgaas wrote:
>>
>> I think these two statements are out of order, since the attributes
>> dereference pdev->p2pdma. And it looks like you set "error"
>> unnecessarily, since you return immediately looking at it.
>
>
> Per the previous series, sysfs_create_group is must_check for some reason. I
> had a printk there but you didn't think it was necessary. So assigning it to
> error is the only way to squash the warning.
Why not fail the setup if the sysfs_create_group() fails? Sure, it may
not be strictly required for userspace to have access to these
attributes, but it seems hostile that userspace can't make assumptions
about the presence of the "p2pmem" directory relative to the
capability being setup.
On 01/03/18 05:36 PM, Dan Williams wrote:
> On Thu, Mar 1, 2018 at 4:15 PM, Logan Gunthorpe <[email protected]> wrote:
>>
>>
>> On 01/03/18 10:44 AM, Bjorn Helgaas wrote:
>>>
>>> I think these two statements are out of order, since the attributes
>>> dereference pdev->p2pdma. And it looks like you set "error"
>>> unnecessarily, since you return immediately looking at it.
>>
>>
>> Per the previous series, sysfs_create_group is must_check for some reason. I
>> had a printk there but you didn't think it was necessary. So assigning it to
>> error is the only way to squash the warning.
>
> Why not fail the setup if the sysfs_create_group() fails? Sure, it may
> not be strictly required for userspace to have access to these
> attributes, but it seems hostile that userspace can't make assumptions
> about the presence of the "p2pmem" directory relative to the
> capability being setup.
Works for me. I'll do that.
Thanks,
Logan
On Thu, Mar 01, 2018 at 11:57:43PM +0000, Stephen Bates wrote:
> > We don't want to lump these all together without knowing which region you're allocating from, right?
>
> In all seriousness I do agree with you on these Keith in the long
> term. We would consider adding property flags for the memory as it
> is added to the p2p core and then the allocator could evolve to
> intelligently dish it out. Attributes like endurance, latency and
> special write commit requirements could all become attributes in
> time. Perhaps one more reason for a central entity for p2p memory
> allocation so this code does not end up having to go into many
> different drivers?
I fear we will find that every kind of P2P memory has special
allocator requirements and dumping it all into one giant pool is not
going to be helpful.
This allocator is already seems not useful for the P2P target memory
on a Mellanox NIC due to the way it has a special allocation flow
(windowing) and special usage requirements..
Nor can it be usefull for the doorbell memory in the NIC.
Both of these are existing use cases for P2P with out of tree patches..
The allocator seems to only be able to solve the CMB problem, and I
think due to that it is better to put this allocator in the NVMe
driver and not the core code.. At least until we find a 2nd user that
needs the same allocation scheme...
Jason
On Fri, 2018-03-02 at 09:34 +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-01 at 14:31 -0800, Linus Torvalds wrote:
> > On Thu, Mar 1, 2018 at 2:06 PM, Benjamin Herrenschmidt <[email protected]> wrote:
> > >
> > > Could be that x86 has the smarts to do the right thing, still trying to
> > > untangle the code :-)
> >
> > Afaik, x86 will not cache PCI unless the system is misconfigured, and
> > even then it's more likely to just raise a machine check exception
> > than cache things.
> >
> > The last-level cache is going to do fills and spills directly to the
> > memory controller, not to the PCIe side of things.
> >
> > (I guess you *can* do things differently, and I wouldn't be surprised
> > if some people inside Intel did try to do things differently with
> > trying nvram over PCIe, but in general I think the above is true)
> >
> > You won't find it in the kernel code either. It's in hardware with
> > firmware configuration of what addresses are mapped to the memory
> > controllers (and _how_ they are mapped) and which are not.
>
> Ah thanks ! Thanks explains. We can fix that on ppc64 in our linear
> mapping code by checking the address vs. memblocks to chose the right
> page table attributes.
FWIW, this thing is called MTRRs on x86, which are initialized by BIOS.
These registers effectively overwrite page table setups. Intel SDM
defines the effect as follows. 'PAT Entry Value' is the page table
setup.
MTRR Memory Type PAT Entry Value Effective Memory Type
--------------------------------------------------------
UC UC UC
UC WC WC
UC WT UC
UC WB UC
UC WP UC
On my system, BIOS sets MTRRs to cover the entire MMIO ranges with UC.
Other BIOSes may simply set the MTRR default type to UC, i.e. uncovered
ranges become UC.
# cat /proc/mtrr
:
reg01: base=0xc0000000000 (12582912MB), size=2097152MB, count=1:
uncachable
:
# cat /proc/iomem | grep 'PCI Bus'
:
c0000000000-c3fffffffff : PCI Bus 0000:00
c4000000000-c7fffffffff : PCI Bus 0000:11
c8000000000-cbfffffffff : PCI Bus 0000:36
cc000000000-cffffffffff : PCI Bus 0000:5b
d0000000000-d3fffffffff : PCI Bus 0000:80
d4000000000-d7fffffffff : PCI Bus 0000:85
d8000000000-dbfffffffff : PCI Bus 0000:ae
dc000000000-dffffffffff : PCI Bus 0000:d7
-Toshi
On Fri, Mar 2, 2018 at 8:22 AM, Kani, Toshi <[email protected]> wrote:
>
> FWIW, this thing is called MTRRs on x86, which are initialized by BIOS.
No.
Or rather, that's simply just another (small) part of it all - and an
architected and documented one at that.
Like the page table caching entries, the memory type range registers
are really just "secondary information". They don't actually select
between PCIe and RAM, they just affect the behavior on top of that.
The really nitty-gritty stuff is not architected, and generally not
documented outside (possibly) the BIOS writer's guide that is not made
public.
Those magical registers contain details like how the DRAM is
interleaved (if it is), what the timings are, where which memory
controller handles which memory range, and what are goes to PCIe etc.
Basically all the actual *steering* information is very much hidden
away from the kernel (and often from the BIOS too). The parts we see
at a higher level are just tuning and tweaks.
Note: the details differ _enormously_ between different chips. The
setup can be very different, with things like Knights Landing having
the external cache that can also act as local memory that isn't a
cache but maps at a different physical address instead etc. That's the
kind of steering I'm talking about - at a low level how physical
addresses get mapped to different cache partitions, memory
controllers, or to the IO system etc.
Linus
On Fri, 2018-03-02 at 08:57 -0800, Linus Torvalds wrote:
> On Fri, Mar 2, 2018 at 8:22 AM, Kani, Toshi <[email protected]> wrote:
> >
> > FWIW, this thing is called MTRRs on x86, which are initialized by BIOS.
>
> No.
>
> Or rather, that's simply just another (small) part of it all - and an
> architected and documented one at that.
>
> Like the page table caching entries, the memory type range registers
> are really just "secondary information". They don't actually select
> between PCIe and RAM, they just affect the behavior on top of that.
>
> The really nitty-gritty stuff is not architected, and generally not
> documented outside (possibly) the BIOS writer's guide that is not made
> public.
>
> Those magical registers contain details like how the DRAM is
> interleaved (if it is), what the timings are, where which memory
> controller handles which memory range, and what are goes to PCIe etc.
>
> Basically all the actual *steering* information is very much hidden
> away from the kernel (and often from the BIOS too). The parts we see
> at a higher level are just tuning and tweaks.
>
> Note: the details differ _enormously_ between different chips. The
> setup can be very different, with things like Knights Landing having
> the external cache that can also act as local memory that isn't a
> cache but maps at a different physical address instead etc. That's the
> kind of steering I'm talking about - at a low level how physical
> addresses get mapped to different cache partitions, memory
> controllers, or to the IO system etc.
Right, MRC code is not documented publicly, and it is very much CPU
dependent. It programs address decoders and maps DRAMs to physical
address as you described. MTRRs have nothing to do with this memory
controller setting. That said, MTRRs specify CPU's memory access type,
such as UC and WB.
Thanks,
-Toshi
> http://nvmexpress.org/wp-content/uploads/NVM-Express-1.3-Ratified-TPs.zip
@Keith - my apologies.
@Christoph - thanks for the link
So my understanding of when the technical content surrounding new NVMe Technical Proposals (TPs) was wrong. I though the TP content could only be discussed once disclosed in the public standard. I have now learnt that once the TPs are ratified they are publically available!
However, as Logan pointed out, PMRs are not relevant to this series so let's defer discussion on how to support them to a later date!
Stephen
> It seems people miss-understand HMM :(
Hi Jerome
Your unhappy face emoticon made me sad so I went off to (re)read up on HMM. Along the way I came up with a couple of things.
While hmm.txt is really nice to read it makes no mention of DEVICE_PRIVATE and DEVICE_PUBLIC. It also gives no indication when one might choose to use one over the other. Would it be possible to update hmm.txt to include some discussion on this? I understand that DEVICE_PUBLIC creates a mapping in the kernel's linear address space for the device memory and DEVICE_PRIVATE does not. However, like I said, I am not sure when you would use either one and the pros and cons of doing so. I actually ended up finding some useful information in memremap.h but I don't think it is fair to expect people to dig *that* deep to find this information ;-).
A quick grep shows no drivers using the HMM API in the upstream code today. Is this correct? Are there any examples of out of tree drivers that use HMM you can point me too? As a driver developer what resources exist to help me write a HMM aware driver?
The (very nice) hmm.txt document is not references in the MAINTAINERS file? You might want to fix that when you have a moment.
Stephen
On Fri, 2018-03-02 at 10:25 +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-01 at 16:19 -0700, Logan Gunthorpe wrote:
> >
> > On 01/03/18 04:00 PM, Benjamin Herrenschmidt wrote:
> > > We use only 52 in practice but yes.
> > >
> > > > That's 64PB. If you use need
> > > > a sparse vmemmap for the entire space it will take 16TB which leaves you
> > > > with 63.98PB of address space left. (Similar calculations for other
> > > > numbers of address bits.)
> > >
> > > We only have 52 bits of virtual space for the kernel with the radix
> > > MMU.
> >
> > Ok, assuming you only have 52 bits of physical address space: the sparse
> > vmemmap takes 1TB and you're left with 3.9PB of address space for other
> > things. So, again, why doesn't that work? Is my math wrong
>
> The big problem is not the vmemmap, it's the linear mapping
Allright, so, I think I have a plan to fix this, but it will take a
little bit of time.
Basically the idea is to have firmware pass to Linux a region that's
known to not have anything in it that it can use for the vmalloc space
rather than have linux arbitrarily cut the address space in half.
I'm pretty sure I can always find large enough "holes" in the physical
address space that are outside of both RAM/OpenCAPI/Nvlink and
PCIe/MMIO space. If anything, unused chip IDs. But I don't want Linux
to have to know about the intimate HW details so I'll pass it from FW.
It will take some time to adjust Linux and get updated FW around
though.
Once that's done, I'll be able to have the linear mapping go through
the entire 52-bit space (minus that hole). Of course the hole need to
be large enough to hold a vmemmap for a 52-bit space, so that's about
4TB. So I probably need a hole that's at least 8TB.
As for the mapping attributes, it should be easy for my linear mapping
code to ensure anything that isn't actual RAM is mapped NC.
Cheers,
Ben.
On Fri, Mar 02, 2018 at 09:38:43PM +0000, Stephen Bates wrote:
> > It seems people miss-understand HMM :(
>
> Hi Jerome
>
> Your unhappy face emoticon made me sad so I went off to (re)read up
> on HMM. Along the way I came up with a couple of things.
>
> While hmm.txt is really nice to read it makes no mention of
> DEVICE_PRIVATE and DEVICE_PUBLIC. It also gives no indication when
> one might choose to use one over the other. Would it be possible to
> update hmm.txt to include some discussion on this? I understand
> that DEVICE_PUBLIC creates a mapping in the kernel's linear address
> space for the device memory and DEVICE_PRIVATE does not. However,
> like I said, I am not sure when you would use either one and the
> pros and cons of doing so. I actually ended up finding some useful
> information in memremap.h but I don't think it is fair to expect
> people to dig *that* deep to find this information ;-).
Yes i need to document that some more in hmm.txt, PRIVATE is for device
that have memory that do not fit regular memory expectation ie cachable
so PCIe device memory fit under that category. So if all you need is
struct page for such memory then this is a perfect fit. On top of that
you can use more HMM feature, like using this memory transparently
inside a process address space.
PUBLIC is for memory that belong to a device but still can be access by
CPU in cache coherent way (CAPI, CCIX, ...). Again if you have such
memory and just want struct page you can use that and again if you want
to use that inside a process address space HMM provide more helpers to
do so.
> A quick grep shows no drivers using the HMM API in the upstream code
> today. Is this correct? Are there any examples of out of tree drivers
> that use HMM you can point me too? As a driver developer what
> resources exist to help me write a HMM aware driver?
I am about to send RFC for nouveau, i am still working out some bugs.
I was hoping to be done today but i am still fighting with the hardware.
They are other drivers being work on with HMM. I do not know exactly
when they will be made public (i expect in coming months).
How you use HMM is under the control of the device driver, as well as
how you expose it to userspace. They use it how they want to use it.
There is no pattern or requirement imposed by HMM. All driver being work
on so far are GPU like hardware, ie big chunk of on board memory
(several giga-bytes) and they want to use that memory inside process
address space in a transparent fashion to the program and CPU.
Each have their own API expose to userspace and while they are a lot of
similarity among them, lot of details of userspace API is hardware
specific. In GPU world most of the driver are in userspace, application
do target high level API such as OpenGL, Vulkan, OpenCL or CUDA. Those
API then have a hardware specific userspace driver that talks to hardware
specific IOCTL. So this is not like network or block device.
> The (very nice) hmm.txt document is not references in the MAINTAINERS
> file? You might want to fix that when you have a moment.
I have couple small fixes/typo patches that i need to cleanup and send
i will fix the MAINTAINERS as part of those.
Cheers,
J?r?me
On Thu, Mar 01, 2018 at 11:53:16PM +0000, Stephen Bates wrote:
> > There's a meaningful difference between writing to an NVMe CMB vs PMR
>
> When the PMR spec becomes public we can discuss how best to integrate it into the P2P framework (if at all) ;-).
http://nvmexpress.org/wp-content/uploads/NVM-Express-1.3-Ratified-TPs.zip
On 02/03/18 09:18 AM, Jason Gunthorpe wrote:
> This allocator is already seems not useful for the P2P target memory
> on a Mellanox NIC due to the way it has a special allocation flow
> (windowing) and special usage requirements..
>
> Nor can it be usefull for the doorbell memory in the NIC.
No one says every P2P use has to use P2P memory. Doorbells are obviously
not P2P memory. But it's the p2mem interface that's important and the
interface absolutely does not belong in the NVMe driver. Once you have
the P2P memory interface you need an allocator behind it and the obvious
place is in the P2P code to handle the common case where you're just
mapping a BAR. We don't need to implement a genalloc in every driver
that has P2P memory attached with it. If some hardware wants to expose
memory that requires complicated allocation it's up to them to solve
that problem but that isn't enough justification, to me, to push common
code into every driver.
> Both of these are existing use cases for P2P with out of tree patches..
And we have out of tree code that uses the generic allocator part of
p2pmem.
> The allocator seems to only be able to solve the CMB problem, and I
> think due to that it is better to put this allocator in the NVMe
> driver and not the core code.. At least until we find a 2nd user that
> needs the same allocation scheme...
See the first P2PMEM RFC. We used Chelsio's NIC instead of the CMB with
a very similar allocation scheme. We'd still be enabling that NIC in the
same way if we didn't run into hardware issues with it. A simple BAR
with memory behind it is always going to be the most common case.
Logan
On Fri, Mar 2, 2018 at 8:57 AM, Linus Torvalds
<[email protected]> wrote:
>
> Like the page table caching entries, the memory type range registers
> are really just "secondary information". They don't actually select
> between PCIe and RAM, they just affect the behavior on top of that.
Side note: historically the two may have been almost the same, since
the CPU only had one single unified bus for "memory" (whether that was
memory-mapped PCI or actual RAM). The steering was external.
But even back then you had extended bits to specify things like how
the 640k-1M region got remapped - which could depend on not just the
address, but on whether you read or wrote to it. The "lost" 384kB of
RAM could either be remapped at a different address, or could be used
for shadowing the (slow) ROM contents, or whatever.
Linus
On 02/03/18 02:44 PM, Benjamin Herrenschmidt wrote:
> Allright, so, I think I have a plan to fix this, but it will take a
> little bit of time.
>
> Basically the idea is to have firmware pass to Linux a region that's
> known to not have anything in it that it can use for the vmalloc space
> rather than have linux arbitrarily cut the address space in half.
>
> I'm pretty sure I can always find large enough "holes" in the physical
> address space that are outside of both RAM/OpenCAPI/Nvlink and
> PCIe/MMIO space. If anything, unused chip IDs. But I don't want Linux
> to have to know about the intimate HW details so I'll pass it from FW.
>
> It will take some time to adjust Linux and get updated FW around
> though.
>
> Once that's done, I'll be able to have the linear mapping go through
> the entire 52-bit space (minus that hole). Of course the hole need to
> be large enough to hold a vmemmap for a 52-bit space, so that's about
> 4TB. So I probably need a hole that's at least 8TB.
>
> As for the mapping attributes, it should be easy for my linear mapping
> code to ensure anything that isn't actual RAM is mapped NC.
Very cool. I'm glad to hear you found a way to fix this.
Thanks,
Logan
On Thu, Mar 1, 2018 at 10:40 AM, Logan Gunthorpe <[email protected]> wrote:
> Register the CMB buffer as p2pmem and use the appropriate allocation
> functions to create and destroy the IO SQ.
>
> If the CMB supports WDS and RDS, publish it for use as p2p memory
> by other devices.
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> ---
> drivers/nvme/host/pci.c | 75 +++++++++++++++++++++++++++----------------------
> 1 file changed, 41 insertions(+), 34 deletions(-)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 73036d2fbbd5..56ca79be8476 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -29,6 +29,7 @@
> #include <linux/types.h>
> #include <linux/io-64-nonatomic-lo-hi.h>
> #include <linux/sed-opal.h>
> +#include <linux/pci-p2pdma.h>
>
> #include "nvme.h"
>
> @@ -91,9 +92,8 @@ struct nvme_dev {
> struct work_struct remove_work;
> struct mutex shutdown_lock;
> bool subsystem;
> - void __iomem *cmb;
> - pci_bus_addr_t cmb_bus_addr;
> u64 cmb_size;
> + bool cmb_use_sqes;
> u32 cmbsz;
> u32 cmbloc;
> struct nvme_ctrl ctrl;
> @@ -148,7 +148,7 @@ struct nvme_queue {
> struct nvme_dev *dev;
> spinlock_t q_lock;
> struct nvme_command *sq_cmds;
> - struct nvme_command __iomem *sq_cmds_io;
> + bool sq_cmds_is_io;
> volatile struct nvme_completion *cqes;
> struct blk_mq_tags **tags;
> dma_addr_t sq_dma_addr;
> @@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
> {
> u16 tail = nvmeq->sq_tail;
> - if (nvmeq->sq_cmds_io)
> - memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
> - else
> - memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
> + memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC
the _toio() variant enforces alignment, does the copy with 4 byte
stores, and has a full barrier after the copy. In comparison our
regular memcpy() does none of those things and may use unaligned and
vector load/stores. For normal (cacheable) memory that is perfectly
fine, but they can cause alignment faults when targeted at MMIO
(cache-inhibited) memory.
I think in this particular case it might be ok since we know SEQs are
aligned to 64 byte boundaries and the copy is too small to use our
vectorised memcpy(). I'll assume we don't need explicit ordering
between writes of SEQs since the existing code doesn't seem to care
unless the doorbell is being rung, so you're probably fine there too.
That said, I still think this is a little bit sketchy and at the very
least you should add a comment explaining what's going on when the CMB
is being used. If someone more familiar with the NVMe driver could
chime in I would appreciate it.
> if (++tail == nvmeq->q_depth)
> tail = 0;
> @@ -1286,9 +1283,18 @@ static void nvme_free_queue(struct nvme_queue *nvmeq)
> {
> dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
> (void *)nvmeq->cqes, nvmeq->cq_dma_addr);
> - if (nvmeq->sq_cmds)
> - dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
> - nvmeq->sq_cmds, nvmeq->sq_dma_addr);
> +
> + if (nvmeq->sq_cmds) {
> + if (nvmeq->sq_cmds_is_io)
> + pci_free_p2pmem(to_pci_dev(nvmeq->q_dmadev),
> + nvmeq->sq_cmds,
> + SQ_SIZE(nvmeq->q_depth));
> + else
> + dma_free_coherent(nvmeq->q_dmadev,
> + SQ_SIZE(nvmeq->q_depth),
> + nvmeq->sq_cmds,
> + nvmeq->sq_dma_addr);
> + }
> }
>
> static void nvme_free_queues(struct nvme_dev *dev, int lowest)
> @@ -1368,12 +1374,21 @@ static int nvme_cmb_qdepth(struct nvme_dev *dev, int nr_io_queues,
> static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq,
> int qid, int depth)
> {
> - /* CMB SQEs will be mapped before creation */
> - if (qid && dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS))
> - return 0;
> + struct pci_dev *pdev = to_pci_dev(dev->dev);
> +
> + if (qid && dev->cmb_use_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
> + nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(depth));
> + nvmeq->sq_dma_addr = pci_p2pmem_virt_to_bus(pdev,
> + nvmeq->sq_cmds);
> + nvmeq->sq_cmds_is_io = true;
> + }
> +
> + if (!nvmeq->sq_cmds) {
> + nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
> + &nvmeq->sq_dma_addr, GFP_KERNEL);
> + nvmeq->sq_cmds_is_io = false;
> + }
>
> - nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
> - &nvmeq->sq_dma_addr, GFP_KERNEL);
> if (!nvmeq->sq_cmds)
> return -ENOMEM;
> return 0;
> @@ -1449,13 +1464,6 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
> struct nvme_dev *dev = nvmeq->dev;
> int result;
>
> - if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
> - unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth),
> - dev->ctrl.page_size);
> - nvmeq->sq_dma_addr = dev->cmb_bus_addr + offset;
> - nvmeq->sq_cmds_io = dev->cmb + offset;
> - }
> -
> nvmeq->cq_vector = qid - 1;
> result = adapter_alloc_cq(dev, qid, nvmeq);
> if (result < 0)
> @@ -1685,9 +1693,6 @@ static void nvme_map_cmb(struct nvme_dev *dev)
> return;
> dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
>
> - if (!use_cmb_sqes)
> - return;
> -
> size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev);
> offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc);
> bar = NVME_CMB_BIR(dev->cmbloc);
> @@ -1704,11 +1709,15 @@ static void nvme_map_cmb(struct nvme_dev *dev)
> if (size > bar_size - offset)
> size = bar_size - offset;
>
> - dev->cmb = ioremap_wc(pci_resource_start(pdev, bar) + offset, size);
> - if (!dev->cmb)
> + if (pci_p2pdma_add_resource(pdev, bar, size, offset))
> return;
> - dev->cmb_bus_addr = pci_bus_address(pdev, bar) + offset;
> +
> dev->cmb_size = size;
> + dev->cmb_use_sqes = use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS);
> +
> + if ((dev->cmbsz & (NVME_CMBSZ_WDS | NVME_CMBSZ_RDS)) ==
> + (NVME_CMBSZ_WDS | NVME_CMBSZ_RDS))
> + pci_p2pmem_publish(pdev, true);
>
> if (sysfs_add_file_to_group(&dev->ctrl.device->kobj,
> &dev_attr_cmb.attr, NULL))
> @@ -1718,12 +1727,10 @@ static void nvme_map_cmb(struct nvme_dev *dev)
>
> static inline void nvme_release_cmb(struct nvme_dev *dev)
> {
> - if (dev->cmb) {
> - iounmap(dev->cmb);
> - dev->cmb = NULL;
> + if (dev->cmb_size) {
> sysfs_remove_file_from_group(&dev->ctrl.device->kobj,
> &dev_attr_cmb.attr, NULL);
> - dev->cmbsz = 0;
> + dev->cmb_size = 0;
> }
> }
>
> @@ -1918,13 +1925,13 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
> if (nr_io_queues == 0)
> return 0;
>
> - if (dev->cmb && (dev->cmbsz & NVME_CMBSZ_SQS)) {
> + if (dev->cmb_use_sqes) {
> result = nvme_cmb_qdepth(dev, nr_io_queues,
> sizeof(struct nvme_command));
> if (result > 0)
> dev->q_depth = result;
> else
> - nvme_release_cmb(dev);
> + dev->cmb_use_sqes = false;
> }
>
> do {
> --
> 2.11.0
>
> _______________________________________________
> Linux-nvdimm mailing list
> [email protected]
> https://lists.01.org/mailman/listinfo/linux-nvdimm
On Mon, Mar 05, 2018 at 12:33:29PM +1100, Oliver wrote:
> On Thu, Mar 1, 2018 at 10:40 AM, Logan Gunthorpe <[email protected]> wrote:
> > @@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
> > {
> > u16 tail = nvmeq->sq_tail;
>
> > - if (nvmeq->sq_cmds_io)
> > - memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
> > - else
> > - memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
> > + memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>
> Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC
> the _toio() variant enforces alignment, does the copy with 4 byte
> stores, and has a full barrier after the copy. In comparison our
> regular memcpy() does none of those things and may use unaligned and
> vector load/stores. For normal (cacheable) memory that is perfectly
> fine, but they can cause alignment faults when targeted at MMIO
> (cache-inhibited) memory.
>
> I think in this particular case it might be ok since we know SEQs are
> aligned to 64 byte boundaries and the copy is too small to use our
> vectorised memcpy(). I'll assume we don't need explicit ordering
> between writes of SEQs since the existing code doesn't seem to care
> unless the doorbell is being rung, so you're probably fine there too.
> That said, I still think this is a little bit sketchy and at the very
> least you should add a comment explaining what's going on when the CMB
> is being used. If someone more familiar with the NVMe driver could
> chime in I would appreciate it.
I may not be understanding the concern, but I'll give it a shot.
You're right, the start of any SQE is always 64-byte aligned, so that
should satisfy alignment requirements.
The order when writing multiple/successive SQEs in a submission queue
does matter, and this is currently serialized through the q_lock.
The order in which the bytes of a single SQE is written doesn't really
matter as long as the entire SQE is written into the CMB prior to writing
that SQ's doorbell register.
The doorbell register is written immediately after copying a command
entry into the submission queue (ignore "shadow buffer" features),
so the doorbells written to commands submitted is 1:1.
If a CMB SQE and DB order is not enforced with the memcpy, then we do
need a barrier after the SQE's memcpy and before the doorbell's writel.
On 05/03/18 09:00 AM, Keith Busch wrote:
> On Mon, Mar 05, 2018 at 12:33:29PM +1100, Oliver wrote:
>> On Thu, Mar 1, 2018 at 10:40 AM, Logan Gunthorpe <[email protected]> wrote:
>>> @@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
>>> {
>>> u16 tail = nvmeq->sq_tail;
>>
>>> - if (nvmeq->sq_cmds_io)
>>> - memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
>>> - else
>>> - memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>> + memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>
>> Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC
>> the _toio() variant enforces alignment, does the copy with 4 byte
>> stores, and has a full barrier after the copy. In comparison our
>> regular memcpy() does none of those things and may use unaligned and
>> vector load/stores. For normal (cacheable) memory that is perfectly
>> fine, but they can cause alignment faults when targeted at MMIO
>> (cache-inhibited) memory.
>>
>> I think in this particular case it might be ok since we know SEQs are
>> aligned to 64 byte boundaries and the copy is too small to use our
>> vectorised memcpy(). I'll assume we don't need explicit ordering
>> between writes of SEQs since the existing code doesn't seem to care
>> unless the doorbell is being rung, so you're probably fine there too.
>> That said, I still think this is a little bit sketchy and at the very
>> least you should add a comment explaining what's going on when the CMB
>> is being used. If someone more familiar with the NVMe driver could
>> chime in I would appreciate it.
>
> I may not be understanding the concern, but I'll give it a shot.
>
> You're right, the start of any SQE is always 64-byte aligned, so that
> should satisfy alignment requirements.
>
> The order when writing multiple/successive SQEs in a submission queue
> does matter, and this is currently serialized through the q_lock.
>
> The order in which the bytes of a single SQE is written doesn't really
> matter as long as the entire SQE is written into the CMB prior to writing
> that SQ's doorbell register.
>
> The doorbell register is written immediately after copying a command
> entry into the submission queue (ignore "shadow buffer" features),
> so the doorbells written to commands submitted is 1:1.
>
> If a CMB SQE and DB order is not enforced with the memcpy, then we do
> need a barrier after the SQE's memcpy and before the doorbell's writel.
Thanks for the information Keith.
Adding to this: regular memcpy generally also enforces alignment as
unaligned access to regular memory is typically bad in some way on most
arches. The generic memcpy_toio also does not have any barrier as it is
just a call to memcpy. Arm64 also does not appear to have a barrier in
its implementation and in the short survey I did I could not find any
implementation with a barrier. I also did not find a ppc implementation
in the tree but it would be weird for it to add a barrier when other
arches do not appear to need it.
We've been operating on the assumption that memory mapped by
devm_memremap_pages() can be treated as regular memory. This is
emphasized by the fact that it does not return an __iomem pointer. If
this assumption does not hold for an arch then we cannot support P2P DMA
without an overhaul of many kernel interfaces or creating other backend
interfaces into the drivers which take different data types (ie. we'd
have to bypass the entire block layer when trying to write data in
p2pmem to an nvme device. This is very undesirable.
Logan
On 3/5/2018 12:10 PM, Logan Gunthorpe wrote:
>
>
> On 05/03/18 09:00 AM, Keith Busch wrote:
>> On Mon, Mar 05, 2018 at 12:33:29PM +1100, Oliver wrote:
>>> On Thu, Mar 1, 2018 at 10:40 AM, Logan Gunthorpe <[email protected]> wrote:
>>>> @@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
>>>> {
>>>> u16 tail = nvmeq->sq_tail;
>>>
>>>> - if (nvmeq->sq_cmds_io)
>>>> - memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
>>>> - else
>>>> - memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>>> + memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>>
>>> Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC
>>> the _toio() variant enforces alignment, does the copy with 4 byte
>>> stores, and has a full barrier after the copy. In comparison our
>>> regular memcpy() does none of those things and may use unaligned and
>>> vector load/stores. For normal (cacheable) memory that is perfectly
>>> fine, but they can cause alignment faults when targeted at MMIO
>>> (cache-inhibited) memory.
>>>
>>> I think in this particular case it might be ok since we know SEQs are
>>> aligned to 64 byte boundaries and the copy is too small to use our
>>> vectorised memcpy(). I'll assume we don't need explicit ordering
>>> between writes of SEQs since the existing code doesn't seem to care
>>> unless the doorbell is being rung, so you're probably fine there too.
>>> That said, I still think this is a little bit sketchy and at the very
>>> least you should add a comment explaining what's going on when the CMB
>>> is being used. If someone more familiar with the NVMe driver could
>>> chime in I would appreciate it.
>>
>> I may not be understanding the concern, but I'll give it a shot.
>>
>> You're right, the start of any SQE is always 64-byte aligned, so that
>> should satisfy alignment requirements.
>>
>> The order when writing multiple/successive SQEs in a submission queue
>> does matter, and this is currently serialized through the q_lock.
>>
>> The order in which the bytes of a single SQE is written doesn't really
>> matter as long as the entire SQE is written into the CMB prior to writing
>> that SQ's doorbell register.
>>
>> The doorbell register is written immediately after copying a command
>> entry into the submission queue (ignore "shadow buffer" features),
>> so the doorbells written to commands submitted is 1:1.
>>
>> If a CMB SQE and DB order is not enforced with the memcpy, then we do
>> need a barrier after the SQE's memcpy and before the doorbell's writel.
>
>
> Thanks for the information Keith.
>
> Adding to this: regular memcpy generally also enforces alignment as unaligned access to regular memory is typically bad in some way on most arches. The generic memcpy_toio also does not have any barrier as it is just a call to memcpy. Arm64 also does not appear to have a barrier in its implementation and in the short survey I did I could not find any implementation with a barrier. I also did not find a ppc implementation in the tree but it would be weird for it to add a barrier when other arches do not appear to need it.
>
> We've been operating on the assumption that memory mapped by devm_memremap_pages() can be treated as regular memory. This is emphasized by the fact that it does not return an __iomem pointer. If this assumption does not hold for an arch then we cannot support P2P DMA without an overhaul of many kernel interfaces or creating other backend interfaces into the drivers which take different data types (ie. we'd have to bypass the entire block layer when trying to write data in p2pmem to an nvme device. This is very undesirable.
>
writel has a barrier inside on ARM64.
https://elixir.bootlin.com/linux/latest/source/arch/arm64/include/asm/io.h#L143
Why do you need another barrier?
ACCESSING DEVICES
-----------------
Many devices can be memory mapped, and so appear to the CPU as if they're just
a set of memory locations. To control such a device, the driver usually has to
make the right memory accesses in exactly the right order.
However, having a clever CPU or a clever compiler creates a potential problem
in that the carefully sequenced accesses in the driver code won't reach the
device in the requisite order if the CPU or the compiler thinks it is more
efficient to reorder, combine or merge accesses - something that would cause
the device to malfunction.
Inside of the Linux kernel, I/O should be done through the appropriate accessor
routines - such as inb() or writel() - which know how to make such accesses
appropriately sequential.
> Logan
>
>
>
--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
On 05/03/18 11:02 AM, Sinan Kaya wrote:
> writel has a barrier inside on ARM64.
>
> https://elixir.bootlin.com/linux/latest/source/arch/arm64/include/asm/io.h#L143
Yes, and no barrier inside memcpy_toio as it uses __raw_writes. This
should be sufficient as we are only accessing addresses that look like
memory and have no side effects (those enabling doorbell accesses may
need to worry about this though). Typically, what could happen, in this
case, is the CPU would issue writes to the BAR normally and the next
time it programmed the DMA engine it would flush everything via the
flush in writel.
> Why do you need another barrier?
We don't.
Thanks,
Logan
>>> - if (nvmeq->sq_cmds_io)
>>> - memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
>>> - else
>>> - memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>> + memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>
>> Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC
>> the _toio() variant enforces alignment, does the copy with 4 byte
>> stores, and has a full barrier after the copy. In comparison our
>> regular memcpy() does none of those things and may use unaligned and
>> vector load/stores. For normal (cacheable) memory that is perfectly
>> fine, but they can cause alignment faults when targeted at MMIO
>> (cache-inhibited) memory.
>>
>> I think in this particular case it might be ok since we know SEQs are
>> aligned to 64 byte boundaries and the copy is too small to use our
>> vectorised memcpy(). I'll assume we don't need explicit ordering
>> between writes of SEQs since the existing code doesn't seem to care
>> unless the doorbell is being rung, so you're probably fine there too.
>> That said, I still think this is a little bit sketchy and at the very
>> least you should add a comment explaining what's going on when the CMB
>> is being used. If someone more familiar with the NVMe driver could
>> chime in I would appreciate it.
>
> I may not be understanding the concern, but I'll give it a shot.
>
> You're right, the start of any SQE is always 64-byte aligned, so that
> should satisfy alignment requirements.
>
> The order when writing multiple/successive SQEs in a submission queue
> does matter, and this is currently serialized through the q_lock.
>
> The order in which the bytes of a single SQE is written doesn't really
> matter as long as the entire SQE is written into the CMB prior to writing
> that SQ's doorbell register.
>
> The doorbell register is written immediately after copying a command
> entry into the submission queue (ignore "shadow buffer" features),
> so the doorbells written to commands submitted is 1:1.
>
> If a CMB SQE and DB order is not enforced with the memcpy, then we do
> need a barrier after the SQE's memcpy and before the doorbell's writel.
Keith, while we're on this, regardless of cmb, is SQE memcopy and DB
update ordering always guaranteed?
If you look at mlx4 (rdma device driver) that works exactly the same as
nvme you will find:
--
qp->sq.head += nreq;
/*
* Make sure that descriptors are written before
* doorbell record.
*/
wmb();
writel(qp->doorbell_qpn,
to_mdev(ibqp->device)->uar_map +
MLX4_SEND_DOORBELL);
/*
* Make sure doorbells don't leak out of SQ spinlock
* and reach the HCA out of order.
*/
mmiowb();
--
That seems to explicitly make sure to place a barrier before updating
the doorbell. So as I see it, either ordering is guaranteed and the
above code is redundant, or nvme needs to do the same.
Thoughts?
On Mon, Mar 05, 2018 at 09:57:27PM +0200, Sagi Grimberg wrote:
> Keith, while we're on this, regardless of cmb, is SQE memcopy and DB update
> ordering always guaranteed?
>
> If you look at mlx4 (rdma device driver) that works exactly the same as
> nvme you will find:
> --
> qp->sq.head += nreq;
>
> /*
> * Make sure that descriptors are written before
> * doorbell record.
> */
> wmb();
>
> writel(qp->doorbell_qpn,
> to_mdev(ibqp->device)->uar_map + MLX4_SEND_DOORBELL);
>
> /*
> * Make sure doorbells don't leak out of SQ spinlock
> * and reach the HCA out of order.
> */
> mmiowb();
> --
>
> That seems to explicitly make sure to place a barrier before updating
> the doorbell. So as I see it, either ordering is guaranteed and the
> above code is redundant, or nvme needs to do the same.
A wmb() is always required before operations that can trigger DMA.
The reason ARM has a barrier in writel() is not to make it ordered
with respect to CPU stores to cachable memory, but to make it ordered
with respect to *other* writels.
Eg Linux defines this:
writel(A, mem);
writel(B, mem);
To always produce two TLPs on PCI-E when mem is UC BAR memory.
And ARM cannot guarentee that without the extra barrier.
So now we see stuff like this:
writel_relaxed(A, mem);
writel_relaxed(B, mem+4);
Which says the TLPs to A and B can be issued in any order..
So when reading the above mlx code, we see the first wmb() being used
to ensure that CPU stores to cachable memory are visible to the DMA
triggered by the doorbell ring.
The mmiowb() is used to ensure that DB writes are not combined and not
issued in any order other than implied by the lock that encloses the
whole thing. This is needed because uar_map is WC memory.
We don't have ordering with respect to two writel's here, so if ARM
performance was a concern the writel could be switched to
writel_relaxed().
Presumably nvme has similar requirments, although I guess the DB
register is mapped UC not WC?
Jason
On 05/03/18 12:57 PM, Sagi Grimberg wrote:
> Keith, while we're on this, regardless of cmb, is SQE memcopy and DB
> update ordering always guaranteed?
>
> If you look at mlx4 (rdma device driver) that works exactly the same as
> nvme you will find:
> --
> qp->sq.head += nreq;
>
> /*
> * Make sure that descriptors are written before
> * doorbell record.
> */
> wmb();
>
> writel(qp->doorbell_qpn,
> to_mdev(ibqp->device)->uar_map +
> MLX4_SEND_DOORBELL);
>
> /*
> * Make sure doorbells don't leak out of SQ spinlock
> * and reach the HCA out of order.
> */
> mmiowb();
> --
To me, it looks like the wmb() is redundant as writel should guarantee
the order. (Indeed, per Sinan's comment, writel on arm64 starts with a
wmb() which means, on that platform, there are two wmb() calls in a row.)
The mmiowb() call, on the other hand, looks correct per my understanding
of it's purpose with respect to the spinlock.
Logan
On 05/03/18 01:10 PM, Jason Gunthorpe wrote:
> So when reading the above mlx code, we see the first wmb() being used
> to ensure that CPU stores to cachable memory are visible to the DMA
> triggered by the doorbell ring.
Oh, yes, that makes sense. Disregard my previous email as I was wrong.
Logan
> Yes i need to document that some more in hmm.txt...
Hi Jermone, thanks for the explanation. Can I suggest you update hmm.txt with what you sent out?
> I am about to send RFC for nouveau, i am still working out some bugs.
Great. I will keep an eye out for it. An example user of hmm will be very helpful.
> i will fix the MAINTAINERS as part of those.
Awesome, thanks.
Stephen
On Mon, Mar 05, 2018 at 01:10:53PM -0700, Jason Gunthorpe wrote:
> So when reading the above mlx code, we see the first wmb() being used
> to ensure that CPU stores to cachable memory are visible to the DMA
> triggered by the doorbell ring.
IIUC, we don't need a similar barrier for NVMe to ensure memory is
visibile to DMA since the SQE memory is allocated DMA coherent when the
SQ is not within a CMB.
> The mmiowb() is used to ensure that DB writes are not combined and not
> issued in any order other than implied by the lock that encloses the
> whole thing. This is needed because uar_map is WC memory.
>
> We don't have ordering with respect to two writel's here, so if ARM
> performance was a concern the writel could be switched to
> writel_relaxed().
>
> Presumably nvme has similar requirments, although I guess the DB
> register is mapped UC not WC?
Yep, the NVMe DB register is required by the spec to be mapped UC.
On Mon, Mar 05, 2018 at 01:42:12PM -0700, Keith Busch wrote:
> On Mon, Mar 05, 2018 at 01:10:53PM -0700, Jason Gunthorpe wrote:
> > So when reading the above mlx code, we see the first wmb() being used
> > to ensure that CPU stores to cachable memory are visible to the DMA
> > triggered by the doorbell ring.
>
> IIUC, we don't need a similar barrier for NVMe to ensure memory is
> visibile to DMA since the SQE memory is allocated DMA coherent when the
> SQ is not within a CMB.
You still need it.
DMA coherent just means you don't need to call the DMA API after
writing, it says nothing about CPU ordering.
eg on x86 DMA coherent is just normal system memory, and you do need
the SFENCE betweeen system memory stores and DMA triggering MMIO,
apparently.
Jason
On Thu, Mar 01, 2018 at 12:13:10PM -0700, Logan Gunthorpe wrote:
>
>
> On 01/03/18 11:02 AM, Bjorn Helgaas wrote:
> > > void pci_enable_acs(struct pci_dev *dev)
> > > {
> > > + if (pci_p2pdma_disable_acs(dev))
> > > + return;
> >
> > This doesn't read naturally to me. I do see that when
> > CONFIG_PCI_P2PDMA is not set, pci_p2pdma_disable_acs() does nothing
> > and returns 0, so we'll go ahead and try to enable ACS as before.
> >
> > But I think it would be clearer to have an #ifdef CONFIG_PCI_P2PDMA
> > right here so it's more obvious that we only disable ACS when it's
> > selected.
>
> I could do this... however, I wrote it this way because I've read Linus
> dislikes using #ifdef's inside function bodies and I personally agree with
> that sentiment.
I try to avoid #ifdefs too, but in this case the plain reading of the
code makes no sense (we're in a function called "enable_acs()", and
the first thing we do is call a function to "disable_acs()".
Disabling ACS is scary for all the security reasons mentioned
elsewhere, so a reader who knows what ACS does and cares about
virtualization and security *has* to go look up the definition of
pci_p2pdma_disable_acs().
If you put the #ifdef right here, then it's easier to read because we
can see that "oh, this is a special and uncommon case that I can
probably ignore".
Bjorn
On 05/03/18 03:28 PM, Bjorn Helgaas wrote:
> If you put the #ifdef right here, then it's easier to read because we
> can see that "oh, this is a special and uncommon case that I can
> probably ignore".
Makes sense. I'll do that.
Thanks,
Logan
On Tue, Mar 6, 2018 at 4:10 AM, Logan Gunthorpe <[email protected]> wrote:
>
>
> On 05/03/18 09:00 AM, Keith Busch wrote:
>>
>> On Mon, Mar 05, 2018 at 12:33:29PM +1100, Oliver wrote:
>>>
>>> On Thu, Mar 1, 2018 at 10:40 AM, Logan Gunthorpe <[email protected]>
>>> wrote:
>>>>
>>>> @@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue
>>>> *nvmeq,
>>>> {
>>>> u16 tail = nvmeq->sq_tail;
>>>
>>>
>>>> - if (nvmeq->sq_cmds_io)
>>>> - memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd,
>>>> sizeof(*cmd));
>>>> - else
>>>> - memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>>> + memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>>
>>>
>>> Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC
>>> the _toio() variant enforces alignment, does the copy with 4 byte
>>> stores, and has a full barrier after the copy. In comparison our
>>> regular memcpy() does none of those things and may use unaligned and
>>> vector load/stores. For normal (cacheable) memory that is perfectly
>>> fine, but they can cause alignment faults when targeted at MMIO
>>> (cache-inhibited) memory.
>>>
>>> I think in this particular case it might be ok since we know SEQs are
>>> aligned to 64 byte boundaries and the copy is too small to use our
>>> vectorised memcpy(). I'll assume we don't need explicit ordering
>>> between writes of SEQs since the existing code doesn't seem to care
>>> unless the doorbell is being rung, so you're probably fine there too.
>>> That said, I still think this is a little bit sketchy and at the very
>>> least you should add a comment explaining what's going on when the CMB
>>> is being used. If someone more familiar with the NVMe driver could
>>> chime in I would appreciate it.
>>
>>
>> I may not be understanding the concern, but I'll give it a shot.
>>
>> You're right, the start of any SQE is always 64-byte aligned, so that
>> should satisfy alignment requirements.
>>
>> The order when writing multiple/successive SQEs in a submission queue
>> does matter, and this is currently serialized through the q_lock.
>>
>> The order in which the bytes of a single SQE is written doesn't really
>> matter as long as the entire SQE is written into the CMB prior to writing
>> that SQ's doorbell register.
>>
>> The doorbell register is written immediately after copying a command
>> entry into the submission queue (ignore "shadow buffer" features),
>> so the doorbells written to commands submitted is 1:1.
>>
>> If a CMB SQE and DB order is not enforced with the memcpy, then we do
>> need a barrier after the SQE's memcpy and before the doorbell's writel.
>
>
>
> Thanks for the information Keith.
>
> Adding to this: regular memcpy generally also enforces alignment as
> unaligned access to regular memory is typically bad in some way on most
> arches. The generic memcpy_toio also does not have any barrier as it is just
> a call to memcpy. Arm64 also does not appear to have a barrier in its
> implementation and in the short survey I did I could not find any
> implementation with a barrier. I also did not find a ppc implementation in
> the tree but it would be weird for it to add a barrier when other arches do
> not appear to need it.
It's in arch/powerpc/kernel/io.c as _memcpy_toio() and it has two full barriers!
Awesome!
Our io.h indicates that our iomem accessors are designed to provide x86ish
strong ordering of accesses to MMIO space. The git log indicates
arch/powerpc/kernel/io.c has barely been touched in the last decade so
odds are most of that code was written in the elder days when people
were less aware of ordering issues. It might just be overly conservative
by today's standards, but maybe not (see below).
> We've been operating on the assumption that memory mapped by
> devm_memremap_pages() can be treated as regular memory. This is emphasized
> by the fact that it does not return an __iomem pointer.
I think the lack of an __iomem annotation is a bit of a red herring. The comment
header for devm_memremap_pages() outlines the conditions for when using
it is valid, the important one being:
> * 4/ res is expected to be a host memory range that could feasibly be
> * treated as a "System RAM" range, i.e. not a device mmio range, but
> * this is not enforced.
Granted, the CMB is a MMIO buffer rather than a register range, but from
my perspective it's still a device MMIO range rather than something we
could feasibly treat as system RAM. The two big reasons being:
a) System RAM is cacheable and coherent while the CMB is neither, and
b) On PPC MMIO accesses are in a separate ordering domain to cacheable
memory accesses.
The latter isn't as terrifying as it sounds since a full barrier will
order everything
with everything, but it still causes problems. For performance reasons
we want to
minimise the number of cases where we need to use a sync instruction
(full barrier)
in favour of a lightweight sync (lwsync) which only orders cacheable
accesses. To
implement that our MMIO accessors set a percpu flag that arch_spin_unlock()
checks to see if any MMIO accesses have occured which would require a full
barrier. If we havn't done any MMIO operations then it'll only do a lwsync. My
concern here is that a driver might do something like this:
void *buffer = devm_memremap_pages(<stuff corresponding to card memory buffer>);
spin_lock();
<do some stuff with system memory>
// buffer is in MMIO space, so the writes are uncached
memcpy(buffer, source_data, <size>);
<do more stuff with system memory>
spin_unlock(); // no MMIO access detected, so it does a lwsync
As far as I know the spinlock API guarantees that accesses that
occurring inside the
critical section will be ordered relative to what happens outside the
critical section.
In this case we would be violating that guarantee and I don't see a
clean way to keep
the fix for it contained to arch/powerpc/. Fundamentally we need to know when
something is accessing MMIO space.
(I'm not going to suggest ditching the lwsync trick. mpe is not going
to take that patch
without a really good reason)
>If this assumption
> does not hold for an arch then we cannot support P2P DMA without an overhaul
> of many kernel interfaces or creating other backend interfaces into the
> drivers which take different data types (ie. we'd have to bypass the entire
> block layer when trying to write data in p2pmem to an nvme device. This is
> very undesirable.
I know. I'm not trying to ruin your day, but the separation between io
and normal memory
is there for a reason. Maybe we can relax that separation, but we need
to be careful about
how we do it.
Oliver
On 05/03/18 05:49 PM, Oliver wrote:
> It's in arch/powerpc/kernel/io.c as _memcpy_toio() and it has two full barriers!
>
> Awesome!
>
> Our io.h indicates that our iomem accessors are designed to provide x86ish
> strong ordering of accesses to MMIO space. The git log indicates
> arch/powerpc/kernel/io.c has barely been touched in the last decade so
> odds are most of that code was written in the elder days when people
> were less aware of ordering issues. It might just be overly conservative
> by today's standards, but maybe not (see below).
Yes, that seems overly conservative.
> (I'm not going to suggest ditching the lwsync trick. mpe is not going
> to take that patch
> without a really good reason)
Well, that's pretty gross. Is this not exactly the situation mmiowb() is
meant to solve? See [1].
Though, you're right in principle. Even if power was similar to other
systems in this way, it's still a risk that if these pages get passed
somewhere in the kernel that uses a spin lock like that without an
mmiowb() call, then it's going to have a bug. For now, the risk is
pretty low as we know exactly where all the p2pmem pages will be used
but if it gets into other places, all bets are off. I did do some work
trying to make a safe version of io-pages and also trying to change from
pages to pfn_t in large areas but neither approach seemed likely to get
any traction in the community, at least not in the near term.
Logan
[1] ACQUIRES VS I/O ACCESSES in
https://www.kernel.org/doc/Documentation/memory-barriers.txt
On Tue, Mar 6, 2018 at 12:14 PM, Logan Gunthorpe <[email protected]> wrote:
>
> On 05/03/18 05:49 PM, Oliver wrote:
>>
>> It's in arch/powerpc/kernel/io.c as _memcpy_toio() and it has two full
>> barriers!
>>
>> Awesome!
>>
>> Our io.h indicates that our iomem accessors are designed to provide x86ish
>> strong ordering of accesses to MMIO space. The git log indicates
>> arch/powerpc/kernel/io.c has barely been touched in the last decade so
>> odds are most of that code was written in the elder days when people
>> were less aware of ordering issues. It might just be overly conservative
>> by today's standards, but maybe not (see below).
>
>
> Yes, that seems overly conservative.
>
>> (I'm not going to suggest ditching the lwsync trick. mpe is not going
>> to take that patch
>> without a really good reason)
>
>
> Well, that's pretty gross. Is this not exactly the situation mmiowb() is
> meant to solve? See [1].
Yep, mmiowb() is supposed to be used in this situation. According to BenH,
author of that io_sync hack, we implement the stronger semantics
so that we don't break existing drivers that assume spin_unlock() does
order i/o even though it's not supposed to. At a guess the x86 version of
spin_unlock() happens to do that so the rest of us need to either live
with it or fix all the bugs :)
> Though, you're right in principle. Even if power was similar to other
> systems in this way, it's still a risk that if these pages get passed
> somewhere in the kernel that uses a spin lock like that without an mmiowb()
> call, then it's going to have a bug. For now, the risk is pretty low as we
> know exactly where all the p2pmem pages will be used but if it gets into
> other places, all bets are off.
Yeah this was my main concern with the whole approach. For ioremap()ed
memory we have the __iomem annotation to help with tracking when we
need to be careful, but we'd lose that entirely here.
> I did do some work trying to make a safe
> version of io-pages and also trying to change from pages to pfn_t in large
> areas but neither approach seemed likely to get any traction in the
> community, at least not in the near term.
It's a tricky problem. HMM with DEVICE_PRIVATE might be more
palatable than the pfn_t conversion since there would still be struct pages
backing the device memory. That said, there are probably other issues with
device private memory not being part of the linear mapping, but HMM
provides some assistance there.
Oliver
On Thu, 2018-03-01 at 15:58 +0000, Stephen Bates wrote:
> > Any plans adding the capability to nvme-rdma? Should be
> > straight-forward... In theory, the use-case would be rdma backend
> > fabric behind. Shouldn't be hard to test either...
>
> Nice idea Sagi. Yes we have been starting to look at that. Though again we
> would probably want to impose the "attached to the same PCIe switch" rule
> which might be less common to satisfy in initiator systems.
>
> Down the road I would also like to discuss the best way to use this P2P
> framework to facilitate copies between NVMe namespaces (on both PCIe and
> fabric attached namespaces) without having to expose the CMB up to user
> space. Wasn't something like that done in the SCSI world at some point
> Martin?
Are you perhaps referring to the following patch series: "Copy Offload"
(https://www.spinics.net/lists/linux-scsi/msg74680.html /
https://lwn.net/Articles/592094/)? I will contact Martin off-list and in
case he wouldn't have the time myself to revive that patch series then I
will free up some time to work on this.
Bart.