2021-04-08 17:03:12

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH 00/16] Add new DMA mapping operation for P2PDMA

Hi,

This patchset continues my work to to add P2PDMA support to the common
dma map operations. This allows for creating SGLs that have both P2PDMA
and regular pages which is a necessary step to allowing P2PDMA pages in
userspace.

The earlier RFC[1] generated a lot of great feedback and I heard no show
stopping objections. Thus, I've incorporated all the feedback and have
decided to post this as a proper patch series with hopes of eventually
getting it in mainline.

I'm happy to do a few more passes if anyone has any further feedback
or better ideas.

This series is based on v5.12-rc6 and a git branch can be found here:

https://github.com/sbates130272/linux-p2pmem/ p2pdma_map_ops_v1

Thanks,

Logan

[1] https://lore.kernel.org/linux-block/[email protected]/


Changes since the RFC:
* Added comment and fixed up the pci_get_slot patch. (per Bjorn)
* Fixed glaring sg_phys() double offset bug. (per Robin)
* Created a new map operation (dma_map_sg_p2pdma()) with a new calling
convention instead of modifying the calling convention of
dma_map_sg(). (per Robin)
* Integrated the two similar pci_p2pdma_dma_map_type() and
pci_p2pdma_map_type() functions into one (per Ira)
* Reworked some of the logic in the map_sg() implementations into
helpers in the p2pdma code. (per Christoph)
* Dropped a bunch of unnecessary symbol exports (per Christoph)
* Expanded the code in dma_pci_p2pdma_supported() for clarity. (per
Ira and Christoph)
* Finished off using the new dma_map_sg_p2pdma() call in rdma_rw
and removed the old pci_p2pdma_[un]map_sg(). (per Jason)

--

Logan Gunthorpe (16):
PCI/P2PDMA: Pass gfp_mask flags to upstream_bridge_distance_warn()
PCI/P2PDMA: Avoid pci_get_slot() which sleeps
PCI/P2PDMA: Attempt to set map_type if it has not been set
PCI/P2PDMA: Refactor pci_p2pdma_map_type() to take pagmap and device
dma-mapping: Introduce dma_map_sg_p2pdma()
lib/scatterlist: Add flag for indicating P2PDMA segments in an SGL
PCI/P2PDMA: Make pci_p2pdma_map_type() non-static
PCI/P2PDMA: Introduce helpers for dma_map_sg implementations
dma-direct: Support PCI P2PDMA pages in dma-direct map_sg
dma-mapping: Add flags to dma_map_ops to indicate PCI P2PDMA support
iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg
nvme-pci: Check DMA ops when indicating support for PCI P2PDMA
nvme-pci: Convert to using dma_map_sg_p2pdma for p2pdma pages
nvme-rdma: Ensure dma support when using p2pdma
RDMA/rw: use dma_map_sg_p2pdma()
PCI/P2PDMA: Remove pci_p2pdma_[un]map_sg()

drivers/infiniband/core/rw.c | 50 +++-------
drivers/iommu/dma-iommu.c | 66 ++++++++++--
drivers/nvme/host/core.c | 3 +-
drivers/nvme/host/nvme.h | 2 +-
drivers/nvme/host/pci.c | 39 ++++----
drivers/nvme/target/rdma.c | 3 +-
drivers/pci/Kconfig | 2 +-
drivers/pci/p2pdma.c | 188 +++++++++++++++++++----------------
include/linux/dma-map-ops.h | 3 +
include/linux/dma-mapping.h | 20 ++++
include/linux/pci-p2pdma.h | 53 ++++++----
include/linux/scatterlist.h | 49 ++++++++-
include/rdma/ib_verbs.h | 32 ++++++
kernel/dma/direct.c | 25 ++++-
kernel/dma/mapping.c | 70 +++++++++++--
15 files changed, 416 insertions(+), 189 deletions(-)


base-commit: e49d033bddf5b565044e2abe4241353959bc9120
--
2.20.1


2021-04-08 17:03:14

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH 15/16] RDMA/rw: use dma_map_sg_p2pdma()

Drop the use of pci_p2pdma_map_sg() in favour of dma_map_sg_p2pdma().

The new interface allows mapping scatterlists that mix both regular
and P2PDMA pages and will verify that the dma device can communicate
with the device the pages are on.

Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/infiniband/core/rw.c | 50 ++++++++++--------------------------
include/rdma/ib_verbs.h | 32 +++++++++++++++++++++++
2 files changed, 46 insertions(+), 36 deletions(-)

diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index 31156e22d3e7..0c6213d9b044 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -273,26 +273,6 @@ static int rdma_rw_init_single_wr(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
return 1;
}

-static void rdma_rw_unmap_sg(struct ib_device *dev, struct scatterlist *sg,
- u32 sg_cnt, enum dma_data_direction dir)
-{
- if (is_pci_p2pdma_page(sg_page(sg)))
- pci_p2pdma_unmap_sg(dev->dma_device, sg, sg_cnt, dir);
- else
- ib_dma_unmap_sg(dev, sg, sg_cnt, dir);
-}
-
-static int rdma_rw_map_sg(struct ib_device *dev, struct scatterlist *sg,
- u32 sg_cnt, enum dma_data_direction dir)
-{
- if (is_pci_p2pdma_page(sg_page(sg))) {
- if (WARN_ON_ONCE(ib_uses_virt_dma(dev)))
- return 0;
- return pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
- }
- return ib_dma_map_sg(dev, sg, sg_cnt, dir);
-}
-
/**
* rdma_rw_ctx_init - initialize a RDMA READ/WRITE context
* @ctx: context to initialize
@@ -315,9 +295,9 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
struct ib_device *dev = qp->pd->device;
int ret;

- ret = rdma_rw_map_sg(dev, sg, sg_cnt, dir);
- if (!ret)
- return -ENOMEM;
+ ret = ib_dma_map_sg_p2pdma(dev, sg, sg_cnt, dir);
+ if (ret < 0)
+ return ret;
sg_cnt = ret;

/*
@@ -354,7 +334,7 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
return ret;

out_unmap_sg:
- rdma_rw_unmap_sg(dev, sg, sg_cnt, dir);
+ ib_dma_unmap_sg(dev, sg, sg_cnt, dir);
return ret;
}
EXPORT_SYMBOL(rdma_rw_ctx_init);
@@ -394,17 +374,15 @@ int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
return -EINVAL;
}

- ret = rdma_rw_map_sg(dev, sg, sg_cnt, dir);
- if (!ret)
- return -ENOMEM;
+ ret = ib_dma_map_sg_p2pdma(dev, sg, sg_cnt, dir);
+ if (ret < 0)
+ return ret;
sg_cnt = ret;

if (prot_sg_cnt) {
- ret = rdma_rw_map_sg(dev, prot_sg, prot_sg_cnt, dir);
- if (!ret) {
- ret = -ENOMEM;
+ ret = ib_dma_map_sg_p2pdma(dev, prot_sg, prot_sg_cnt, dir);
+ if (ret < 0)
goto out_unmap_sg;
- }
prot_sg_cnt = ret;
}

@@ -469,9 +447,9 @@ int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
kfree(ctx->reg);
out_unmap_prot_sg:
if (prot_sg_cnt)
- rdma_rw_unmap_sg(dev, prot_sg, prot_sg_cnt, dir);
+ ib_dma_unmap_sg(dev, prot_sg, prot_sg_cnt, dir);
out_unmap_sg:
- rdma_rw_unmap_sg(dev, sg, sg_cnt, dir);
+ ib_dma_unmap_sg(dev, sg, sg_cnt, dir);
return ret;
}
EXPORT_SYMBOL(rdma_rw_ctx_signature_init);
@@ -603,7 +581,7 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
break;
}

- rdma_rw_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
+ ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
}
EXPORT_SYMBOL(rdma_rw_ctx_destroy);

@@ -631,8 +609,8 @@ void rdma_rw_ctx_destroy_signature(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
kfree(ctx->reg);

if (prot_sg_cnt)
- rdma_rw_unmap_sg(qp->pd->device, prot_sg, prot_sg_cnt, dir);
- rdma_rw_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
+ ib_dma_unmap_sg(qp->pd->device, prot_sg, prot_sg_cnt, dir);
+ ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
}
EXPORT_SYMBOL(rdma_rw_ctx_destroy_signature);

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index ca28fca5736b..a541ed1702f5 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -4028,6 +4028,17 @@ static inline int ib_dma_map_sg_attrs(struct ib_device *dev,
dma_attrs);
}

+static inline int ib_dma_map_sg_p2pdma_attrs(struct ib_device *dev,
+ struct scatterlist *sg, int nents,
+ enum dma_data_direction direction,
+ unsigned long dma_attrs)
+{
+ if (ib_uses_virt_dma(dev))
+ return ib_dma_virt_map_sg(dev, sg, nents);
+ return dma_map_sg_p2pdma_attrs(dev->dma_device, sg, nents, direction,
+ dma_attrs);
+}
+
static inline void ib_dma_unmap_sg_attrs(struct ib_device *dev,
struct scatterlist *sg, int nents,
enum dma_data_direction direction,
@@ -4052,6 +4063,27 @@ static inline int ib_dma_map_sg(struct ib_device *dev,
return ib_dma_map_sg_attrs(dev, sg, nents, direction, 0);
}

+/**
+ * ib_dma_map_sg_p2pdma - Map a scatter/gather list to DMA addresses
+ * @dev: The device for which the DMA addresses are to be created
+ * @sg: The array of scatter/gather entries
+ * @nents: The number of scatter/gather entries
+ * @direction: The direction of the DMA
+ *
+ * Map an scatter/gather list that might contain P2PDMA pages.
+ * Unlike ib_dma_map_sg() it will return either a negative errno or
+ * a positive value indicating the number of dma segments. See
+ * dma_map_sg_p2pdma_attrs() for details.
+ *
+ * The resulting list should be unmapped with ib_dma_unmap_sg().
+ */
+static inline int ib_dma_map_sg_p2pdma(struct ib_device *dev,
+ struct scatterlist *sg, int nents,
+ enum dma_data_direction direction)
+{
+ return ib_dma_map_sg_p2pdma_attrs(dev, sg, nents, direction, 0);
+}
+
/**
* ib_dma_unmap_sg - Unmap a scatter/gather list of DMA addresses
* @dev: The device for which the DMA addresses were created
--
2.20.1

2021-04-08 17:03:27

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH 01/16] PCI/P2PDMA: Pass gfp_mask flags to upstream_bridge_distance_warn()

In order to call upstream_bridge_distance_warn() from a dma_map function,
it must not sleep. The only reason it does sleep is to allocate the seqbuf
to print which devices are within the ACS path.

Switch the kmalloc call to use a passed in gfp_mask and don't print that
message if the buffer fails to be allocated.

Signed-off-by: Logan Gunthorpe <[email protected]>
Acked-by: Bjorn Helgaas <[email protected]>
---
drivers/pci/p2pdma.c | 21 +++++++++++----------
1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 196382630363..bd89437faf06 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -267,7 +267,7 @@ static int pci_bridge_has_acs_redir(struct pci_dev *pdev)

static void seq_buf_print_bus_devfn(struct seq_buf *buf, struct pci_dev *pdev)
{
- if (!buf)
+ if (!buf || !buf->buffer)
return;

seq_buf_printf(buf, "%s;", pci_name(pdev));
@@ -495,25 +495,26 @@ upstream_bridge_distance(struct pci_dev *provider, struct pci_dev *client,

static enum pci_p2pdma_map_type
upstream_bridge_distance_warn(struct pci_dev *provider, struct pci_dev *client,
- int *dist)
+ int *dist, gfp_t gfp_mask)
{
struct seq_buf acs_list;
bool acs_redirects;
int ret;

- seq_buf_init(&acs_list, kmalloc(PAGE_SIZE, GFP_KERNEL), PAGE_SIZE);
- if (!acs_list.buffer)
- return -ENOMEM;
+ seq_buf_init(&acs_list, kmalloc(PAGE_SIZE, gfp_mask), PAGE_SIZE);

ret = upstream_bridge_distance(provider, client, dist, &acs_redirects,
&acs_list);
if (acs_redirects) {
pci_warn(client, "ACS redirect is set between the client and provider (%s)\n",
pci_name(provider));
- /* Drop final semicolon */
- acs_list.buffer[acs_list.len-1] = 0;
- pci_warn(client, "to disable ACS redirect for this path, add the kernel parameter: pci=disable_acs_redir=%s\n",
- acs_list.buffer);
+
+ if (acs_list.buffer) {
+ /* Drop final semicolon */
+ acs_list.buffer[acs_list.len - 1] = 0;
+ pci_warn(client, "to disable ACS redirect for this path, add the kernel parameter: pci=disable_acs_redir=%s\n",
+ acs_list.buffer);
+ }
}

if (ret == PCI_P2PDMA_MAP_NOT_SUPPORTED) {
@@ -566,7 +567,7 @@ int pci_p2pdma_distance_many(struct pci_dev *provider, struct device **clients,

if (verbose)
ret = upstream_bridge_distance_warn(provider,
- pci_client, &distance);
+ pci_client, &distance, GFP_KERNEL);
else
ret = upstream_bridge_distance(provider, pci_client,
&distance, NULL, NULL);
--
2.20.1

2021-04-08 17:03:30

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH 11/16] iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg

When a PCI P2PDMA page is seen, set the IOVA length of the segment
to zero so that it is not mapped into the IOVA. Then, in finalise_sg(),
apply the appropriate bus address to the segment. The IOVA is not
created if the scatterlist only consists of P2PDMA pages.

Similar to dma-direct, the sg_mark_pci_p2pdma() flag is used to
indicate bus address segments. On unmap, P2PDMA segments are skipped
over when determining the start and end IOVA addresses.

With this change, the flags variable in the dma_map_ops is
set to DMA_F_PCI_P2PDMA_SUPPORTED to indicate support for
P2PDMA pages.

Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/iommu/dma-iommu.c | 66 ++++++++++++++++++++++++++++++++++-----
1 file changed, 58 insertions(+), 8 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index af765c813cc8..ef49635f9819 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -20,6 +20,7 @@
#include <linux/mm.h>
#include <linux/mutex.h>
#include <linux/pci.h>
+#include <linux/pci-p2pdma.h>
#include <linux/swiotlb.h>
#include <linux/scatterlist.h>
#include <linux/vmalloc.h>
@@ -864,6 +865,16 @@ static int __finalise_sg(struct device *dev, struct scatterlist *sg, int nents,
sg_dma_address(s) = DMA_MAPPING_ERROR;
sg_dma_len(s) = 0;

+ if (is_pci_p2pdma_page(sg_page(s)) && !s_iova_len) {
+ if (i > 0)
+ cur = sg_next(cur);
+
+ pci_p2pdma_map_bus_segment(s, cur);
+ count++;
+ cur_len = 0;
+ continue;
+ }
+
/*
* Now fill in the real DMA data. If...
* - there is a valid output segment to append to
@@ -961,10 +972,12 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
struct iova_domain *iovad = &cookie->iovad;
struct scatterlist *s, *prev = NULL;
int prot = dma_info_to_prot(dir, dev_is_dma_coherent(dev), attrs);
+ struct dev_pagemap *pgmap = NULL;
+ enum pci_p2pdma_map_type map_type;
dma_addr_t iova;
size_t iova_len = 0;
unsigned long mask = dma_get_seg_boundary(dev);
- int i;
+ int i, ret = 0;

if (static_branch_unlikely(&iommu_deferred_attach_enabled) &&
iommu_deferred_attach(dev, domain))
@@ -993,6 +1006,31 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
s_length = iova_align(iovad, s_length + s_iova_off);
s->length = s_length;

+ if (is_pci_p2pdma_page(sg_page(s))) {
+ if (sg_page(s)->pgmap != pgmap) {
+ pgmap = sg_page(s)->pgmap;
+ map_type = pci_p2pdma_map_type(pgmap, dev,
+ attrs);
+ }
+
+ switch (map_type) {
+ case PCI_P2PDMA_MAP_BUS_ADDR:
+ /*
+ * A zero length will be ignored by
+ * iommu_map_sg() and then can be detected
+ * in __finalise_sg() to actually map the
+ * bus address.
+ */
+ s->length = 0;
+ continue;
+ case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+ break;
+ default:
+ ret = -EREMOTEIO;
+ goto out_restore_sg;
+ }
+ }
+
/*
* Due to the alignment of our single IOVA allocation, we can
* depend on these assumptions about the segment boundary mask:
@@ -1015,6 +1053,9 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
prev = s;
}

+ if (!iova_len)
+ return __finalise_sg(dev, sg, nents, 0);
+
iova = iommu_dma_alloc_iova(domain, iova_len, dma_get_mask(dev), dev);
if (!iova)
goto out_restore_sg;
@@ -1032,13 +1073,13 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
iommu_dma_free_iova(cookie, iova, iova_len, NULL);
out_restore_sg:
__invalidate_sg(sg, nents);
- return 0;
+ return ret;
}

static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
int nents, enum dma_data_direction dir, unsigned long attrs)
{
- dma_addr_t start, end;
+ dma_addr_t end, start = DMA_MAPPING_ERROR;
struct scatterlist *tmp;
int i;

@@ -1054,14 +1095,22 @@ static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
* The scatterlist segments are mapped into a single
* contiguous IOVA allocation, so this is incredibly easy.
*/
- start = sg_dma_address(sg);
- for_each_sg(sg_next(sg), tmp, nents - 1, i) {
+ for_each_sg(sg, tmp, nents, i) {
+ if (sg_is_pci_p2pdma(tmp)) {
+ sg_unmark_pci_p2pdma(tmp);
+ continue;
+ }
if (sg_dma_len(tmp) == 0)
break;
- sg = tmp;
+
+ if (start == DMA_MAPPING_ERROR)
+ start = sg_dma_address(tmp);
+
+ end = sg_dma_address(tmp) + sg_dma_len(tmp);
}
- end = sg_dma_address(sg) + sg_dma_len(sg);
- __iommu_dma_unmap(dev, start, end - start);
+
+ if (start != DMA_MAPPING_ERROR)
+ __iommu_dma_unmap(dev, start, end - start);
}

static dma_addr_t iommu_dma_map_resource(struct device *dev, phys_addr_t phys,
@@ -1254,6 +1303,7 @@ static unsigned long iommu_dma_get_merge_boundary(struct device *dev)
}

static const struct dma_map_ops iommu_dma_ops = {
+ .flags = DMA_F_PCI_P2PDMA_SUPPORTED,
.alloc = iommu_dma_alloc,
.free = iommu_dma_free,
.alloc_pages = dma_common_alloc_pages,
--
2.20.1

2021-04-08 17:03:35

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH 14/16] nvme-rdma: Ensure dma support when using p2pdma

Ensure the dma operations support p2pdma before using the RDMA
device for P2PDMA. This allows switching the RDMA driver from
pci_p2pdma_map_sg() to dma_map_sg_p2pdma().

Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/nvme/target/rdma.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 6c1f3ab7649c..3ec7e77e5416 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -414,7 +414,8 @@ static int nvmet_rdma_alloc_rsp(struct nvmet_rdma_device *ndev,
if (ib_dma_mapping_error(ndev->device, r->send_sge.addr))
goto out_free_rsp;

- if (!ib_uses_virt_dma(ndev->device))
+ if (!ib_uses_virt_dma(ndev->device) &&
+ dma_pci_p2pdma_supported(&ndev->device->dev))
r->req.p2p_client = &ndev->device->dev;
r->send_sge.length = sizeof(*r->req.cqe);
r->send_sge.lkey = ndev->pd->local_dma_lkey;
--
2.20.1

2021-04-08 17:03:59

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH 05/16] dma-mapping: Introduce dma_map_sg_p2pdma()

dma_map_sg() either returns a positive number indicating the number
of entries mapped or zero indicating that resources were not available
to create the mapping. When zero is returned, it is always safe to retry
the mapping later once resources have been freed.

Once P2PDMA pages are mixed into the SGL there may be pages that may
never be successfully mapped with a given device because that device may
not actually be able to access those pages. Thus, multiple error
conditions will need to be distinguished to determine weather a retry
is safe.

Introduce dma_map_sg_p2pdma[_attrs]() with a different calling
convention from dma_map_sg(). The function will return a positive
integer on success or a negative errno on failure.

ENOMEM will be used to indicate a resource failure and EREMOTEIO to
indicate that a P2PDMA page is not mappable.

The __DMA_ATTR_PCI_P2PDMA attribute is introduced to inform the lower
level implementations that P2PDMA pages are allowed and to warn if a
caller introduces them into the regular dma_map_sg() interface.

Signed-off-by: Logan Gunthorpe <[email protected]>
---
include/linux/dma-mapping.h | 15 +++++++++++
kernel/dma/mapping.c | 52 ++++++++++++++++++++++++++++++++-----
2 files changed, 61 insertions(+), 6 deletions(-)

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 2a984cb4d1e0..50b8f586cf59 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -60,6 +60,12 @@
* at least read-only at lesser-privileged levels).
*/
#define DMA_ATTR_PRIVILEGED (1UL << 9)
+/*
+ * __DMA_ATTR_PCI_P2PDMA: This should not be used directly, use
+ * dma_map_sg_p2pdma() instead. Used internally to indicate that the
+ * caller is using the dma_map_sg_p2pdma() interface.
+ */
+#define __DMA_ATTR_PCI_P2PDMA (1UL << 10)

/*
* A dma_addr_t can hold any valid DMA or bus address for the platform. It can
@@ -107,6 +113,8 @@ void dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, size_t size,
enum dma_data_direction dir, unsigned long attrs);
int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents,
enum dma_data_direction dir, unsigned long attrs);
+int dma_map_sg_p2pdma_attrs(struct device *dev, struct scatterlist *sg,
+ int nents, enum dma_data_direction dir, unsigned long attrs);
void dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
int nents, enum dma_data_direction dir,
unsigned long attrs);
@@ -160,6 +168,12 @@ static inline int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
{
return 0;
}
+static inline int dma_map_sg_p2pdma_attrs(struct device *dev,
+ struct scatterlist *sg, int nents, enum dma_data_direction dir,
+ unsigned long attrs)
+{
+ return 0;
+}
static inline void dma_unmap_sg_attrs(struct device *dev,
struct scatterlist *sg, int nents, enum dma_data_direction dir,
unsigned long attrs)
@@ -392,6 +406,7 @@ static inline void dma_sync_sgtable_for_device(struct device *dev,
#define dma_map_single(d, a, s, r) dma_map_single_attrs(d, a, s, r, 0)
#define dma_unmap_single(d, a, s, r) dma_unmap_single_attrs(d, a, s, r, 0)
#define dma_map_sg(d, s, n, r) dma_map_sg_attrs(d, s, n, r, 0)
+#define dma_map_sg_p2pdma(d, s, n, r) dma_map_sg_p2pdma_attrs(d, s, n, r, 0)
#define dma_unmap_sg(d, s, n, r) dma_unmap_sg_attrs(d, s, n, r, 0)
#define dma_map_page(d, p, o, s, r) dma_map_page_attrs(d, p, o, s, r, 0)
#define dma_unmap_page(d, a, s, r) dma_unmap_page_attrs(d, a, s, r, 0)
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index b6a633679933..923089c4267b 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -177,12 +177,8 @@ void dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, size_t size,
}
EXPORT_SYMBOL(dma_unmap_page_attrs);

-/*
- * dma_maps_sg_attrs returns 0 on error and > 0 on success.
- * It should never return a value < 0.
- */
-int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents,
- enum dma_data_direction dir, unsigned long attrs)
+static int __dma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
+ int nents, enum dma_data_direction dir, unsigned long attrs)
{
const struct dma_map_ops *ops = get_dma_ops(dev);
int ents;
@@ -197,6 +193,20 @@ int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents,
ents = dma_direct_map_sg(dev, sg, nents, dir, attrs);
else
ents = ops->map_sg(dev, sg, nents, dir, attrs);
+
+ return ents;
+}
+
+/*
+ * dma_maps_sg_attrs returns 0 on error and > 0 on success.
+ * It should never return a value < 0.
+ */
+int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents,
+ enum dma_data_direction dir, unsigned long attrs)
+{
+ int ents;
+
+ ents = __dma_map_sg_attrs(dev, sg, nents, dir, attrs);
BUG_ON(ents < 0);
debug_dma_map_sg(dev, sg, nents, ents, dir);

@@ -204,6 +214,36 @@ int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents,
}
EXPORT_SYMBOL(dma_map_sg_attrs);

+/*
+ * like dma_map_sg_attrs, but returns a negative errno on error (and > 0
+ * on success). This function must be used if PCI P2PDMA pages might
+ * be in the scatterlist.
+ *
+ * On error this function may return:
+ * -ENOMEM indicating that there was not enough resources available and
+ * the transfer may be retried later
+ * -EREMOTEIO indicating that P2PDMA pages were included but cannot
+ * be mapped by the specified device, retries will always fail
+ *
+ * The scatterlist should be unmapped with the regular dma_unmap_sg[_attrs]().
+ */
+int dma_map_sg_p2pdma_attrs(struct device *dev, struct scatterlist *sg,
+ int nents, enum dma_data_direction dir, unsigned long attrs)
+{
+ int ents;
+
+ ents = __dma_map_sg_attrs(dev, sg, nents, dir,
+ attrs | __DMA_ATTR_PCI_P2PDMA);
+ if (!ents)
+ ents = -ENOMEM;
+
+ if (ents > 0)
+ debug_dma_map_sg(dev, sg, nents, ents, dir);
+
+ return ents;
+}
+EXPORT_SYMBOL_GPL(dma_map_sg_p2pdma_attrs);
+
void dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
int nents, enum dma_data_direction dir,
unsigned long attrs)
--
2.20.1

2021-04-08 17:04:19

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH 10/16] dma-mapping: Add flags to dma_map_ops to indicate PCI P2PDMA support

Add a flags member to the dma_map_ops structure with one flag to
indicate support for PCI P2PDMA.

Also, add a helper to check if a device supports PCI P2PDMA.

Signed-off-by: Logan Gunthorpe <[email protected]>
---
include/linux/dma-map-ops.h | 3 +++
include/linux/dma-mapping.h | 5 +++++
kernel/dma/mapping.c | 18 ++++++++++++++++++
3 files changed, 26 insertions(+)

diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 51872e736e7b..481892822104 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -12,6 +12,9 @@
struct cma;

struct dma_map_ops {
+ unsigned int flags;
+#define DMA_F_PCI_P2PDMA_SUPPORTED (1 << 0)
+
void *(*alloc)(struct device *dev, size_t size,
dma_addr_t *dma_handle, gfp_t gfp,
unsigned long attrs);
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 50b8f586cf59..c31980ecca62 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -146,6 +146,7 @@ int dma_mmap_attrs(struct device *dev, struct vm_area_struct *vma,
unsigned long attrs);
bool dma_can_mmap(struct device *dev);
int dma_supported(struct device *dev, u64 mask);
+bool dma_pci_p2pdma_supported(struct device *dev);
int dma_set_mask(struct device *dev, u64 mask);
int dma_set_coherent_mask(struct device *dev, u64 mask);
u64 dma_get_required_mask(struct device *dev);
@@ -247,6 +248,10 @@ static inline int dma_supported(struct device *dev, u64 mask)
{
return 0;
}
+static inline bool dma_pci_p2pdma_supported(struct device *dev)
+{
+ return 0;
+}
static inline int dma_set_mask(struct device *dev, u64 mask)
{
return -EIO;
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 923089c4267b..ce44a0fcc4e5 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -573,6 +573,24 @@ int dma_supported(struct device *dev, u64 mask)
}
EXPORT_SYMBOL(dma_supported);

+bool dma_pci_p2pdma_supported(struct device *dev)
+{
+ const struct dma_map_ops *ops = get_dma_ops(dev);
+
+ /* if ops is not set, dma direct will be used which supports P2PDMA */
+ if (!ops)
+ return true;
+
+ /*
+ * Note: dma_ops_bypass is not checked here because P2PDMA should
+ * not be used with dma mapping ops that do not have support even
+ * if the specific device is bypassing them.
+ */
+
+ return ops->flags & DMA_F_PCI_P2PDMA_SUPPORTED;
+}
+EXPORT_SYMBOL_GPL(dma_pci_p2pdma_supported);
+
#ifdef CONFIG_ARCH_HAS_DMA_SET_MASK
void arch_dma_set_mask(struct device *dev, u64 mask);
#else
--
2.20.1

2021-04-08 17:04:36

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH 09/16] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg

Add PCI P2PDMA support for dma_direct_map_sg() so that it can map
PCI P2PDMA pages directly without a hack in the callers. This allows
for heterogeneous SGLs that contain both P2PDMA and regular pages.

SGL segments that contain PCI bus addresses are marked with
sg_mark_pci_p2pdma() and are ignored when unmapped.

Signed-off-by: Logan Gunthorpe <[email protected]>
---
kernel/dma/direct.c | 25 ++++++++++++++++++++++---
1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 002268262c9a..108dfb4ecbd5 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -13,6 +13,7 @@
#include <linux/vmalloc.h>
#include <linux/set_memory.h>
#include <linux/slab.h>
+#include <linux/pci-p2pdma.h>
#include "direct.h"

/*
@@ -387,19 +388,37 @@ void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
struct scatterlist *sg;
int i;

- for_each_sg(sgl, sg, nents, i)
+ for_each_sg(sgl, sg, nents, i) {
+ if (sg_is_pci_p2pdma(sg)) {
+ sg_unmark_pci_p2pdma(sg);
+ continue;
+ }
+
dma_direct_unmap_page(dev, sg->dma_address, sg_dma_len(sg), dir,
attrs);
+ }
}
#endif

int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
enum dma_data_direction dir, unsigned long attrs)
{
- int i;
+ struct pci_p2pdma_map_state p2pdma_state = {};
struct scatterlist *sg;
+ int i, ret = 0;

for_each_sg(sgl, sg, nents, i) {
+ if (is_pci_p2pdma_page(sg_page(sg))) {
+ ret = pci_p2pdma_map_segment(&p2pdma_state, dev, sg,
+ attrs);
+ if (ret < 0) {
+ goto out_unmap;
+ } else if (ret) {
+ ret = 0;
+ continue;
+ }
+ }
+
sg->dma_address = dma_direct_map_page(dev, sg_page(sg),
sg->offset, sg->length, dir, attrs);
if (sg->dma_address == DMA_MAPPING_ERROR)
@@ -411,7 +430,7 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,

out_unmap:
dma_direct_unmap_sg(dev, sgl, i, dir, attrs | DMA_ATTR_SKIP_CPU_SYNC);
- return 0;
+ return ret;
}

dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr,
--
2.20.1

2021-04-08 17:04:44

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH 13/16] nvme-pci: Convert to using dma_map_sg_p2pdma for p2pdma pages

Convert to using dma_map_sg_p2pdma() for PCI p2pdma pages.

This should be equivalent but allows for heterogeneous scatterlists
with both P2PDMA and regular pages. However, P2PDMA support will be
slightly more restricted (only dma-direct and dma-iommu are currently
supported).

Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/nvme/host/pci.c | 28 ++++++++--------------------
1 file changed, 8 insertions(+), 20 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 14f092973792..a1ed07ff38b7 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -577,17 +577,6 @@ static void nvme_free_sgls(struct nvme_dev *dev, struct request *req)

}

-static void nvme_unmap_sg(struct nvme_dev *dev, struct request *req)
-{
- struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-
- if (is_pci_p2pdma_page(sg_page(iod->sg)))
- pci_p2pdma_unmap_sg(dev->dev, iod->sg, iod->nents,
- rq_dma_dir(req));
- else
- dma_unmap_sg(dev->dev, iod->sg, iod->nents, rq_dma_dir(req));
-}
-
static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
{
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -600,7 +589,7 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)

WARN_ON_ONCE(!iod->nents);

- nvme_unmap_sg(dev, req);
+ dma_unmap_sg(dev->dev, iod->sg, iod->nents, rq_dma_dir(req));
if (iod->npages == 0)
dma_pool_free(dev->prp_small_pool, nvme_pci_iod_list(req)[0],
iod->first_dma);
@@ -868,14 +857,13 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
if (!iod->nents)
goto out_free_sg;

- if (is_pci_p2pdma_page(sg_page(iod->sg)))
- nr_mapped = pci_p2pdma_map_sg_attrs(dev->dev, iod->sg,
- iod->nents, rq_dma_dir(req), DMA_ATTR_NO_WARN);
- else
- nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents,
- rq_dma_dir(req), DMA_ATTR_NO_WARN);
- if (!nr_mapped)
+ nr_mapped = dma_map_sg_p2pdma_attrs(dev->dev, iod->sg, iod->nents,
+ rq_dma_dir(req), DMA_ATTR_NO_WARN);
+ if (nr_mapped < 0) {
+ if (nr_mapped != -ENOMEM)
+ ret = BLK_STS_TARGET;
goto out_free_sg;
+ }

iod->use_sgl = nvme_pci_use_sgls(dev, req);
if (iod->use_sgl)
@@ -887,7 +875,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
return BLK_STS_OK;

out_unmap_sg:
- nvme_unmap_sg(dev, req);
+ dma_unmap_sg(dev->dev, iod->sg, iod->nents, rq_dma_dir(req));
out_free_sg:
mempool_free(iod->sg, dev->iod_mempool);
return ret;
--
2.20.1

2021-04-08 17:05:07

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH 06/16] lib/scatterlist: Add flag for indicating P2PDMA segments in an SGL

Make use of the third free LSB in scatterlist's page_link on 64bit systems.

The extra bit will be used by dma_[un]map_sg_p2pdma() to determine when a
given SGL segments dma_address points to a PCI bus address.
dma_unmap_sg_p2pdma() will need to perform different cleanup when a
segment is marked as P2PDMA.

Using this bit requires adding an additional dependency on CONFIG_64BIT to
CONFIG_PCI_P2PDMA. This should be acceptable as the majority of P2PDMA
use cases are restricted to newer root complexes and roughly require the
extra address space for memory BARs used in the transactions.

Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/pci/Kconfig | 2 +-
include/linux/scatterlist.h | 49 ++++++++++++++++++++++++++++++++++---
2 files changed, 46 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 0c473d75e625..90b4bddb3300 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -163,7 +163,7 @@ config PCI_PASID

config PCI_P2PDMA
bool "PCI peer-to-peer transfer support"
- depends on ZONE_DEVICE
+ depends on ZONE_DEVICE && 64BIT
select GENERIC_ALLOCATOR
help
Enableѕ drivers to do PCI peer-to-peer transactions to and from
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 6f70572b2938..5525d3ebf36f 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -58,6 +58,21 @@ struct sg_table {
#define SG_CHAIN 0x01UL
#define SG_END 0x02UL

+/*
+ * bit 2 is the third free bit in the page_link on 64bit systems which
+ * is used by dma_unmap_sg() to determine if the dma_address is a PCI
+ * bus address when doing P2PDMA.
+ * Note: CONFIG_PCI_P2PDMA depends on CONFIG_64BIT because of this.
+ */
+
+#ifdef CONFIG_PCI_P2PDMA
+#define SG_PCI_P2PDMA 0x04UL
+#else
+#define SG_PCI_P2PDMA 0x00UL
+#endif
+
+#define SG_PAGE_LINK_MASK (SG_CHAIN | SG_END | SG_PCI_P2PDMA)
+
/*
* We overload the LSB of the page pointer to indicate whether it's
* a valid sg entry, or whether it points to the start of a new scatterlist.
@@ -65,8 +80,9 @@ struct sg_table {
*/
#define sg_is_chain(sg) ((sg)->page_link & SG_CHAIN)
#define sg_is_last(sg) ((sg)->page_link & SG_END)
+#define sg_is_pci_p2pdma(sg) ((sg)->page_link & SG_PCI_P2PDMA)
#define sg_chain_ptr(sg) \
- ((struct scatterlist *) ((sg)->page_link & ~(SG_CHAIN | SG_END)))
+ ((struct scatterlist *) ((sg)->page_link & ~SG_PAGE_LINK_MASK))

/**
* sg_assign_page - Assign a given page to an SG entry
@@ -80,13 +96,13 @@ struct sg_table {
**/
static inline void sg_assign_page(struct scatterlist *sg, struct page *page)
{
- unsigned long page_link = sg->page_link & (SG_CHAIN | SG_END);
+ unsigned long page_link = sg->page_link & SG_PAGE_LINK_MASK;

/*
* In order for the low bit stealing approach to work, pages
* must be aligned at a 32-bit boundary as a minimum.
*/
- BUG_ON((unsigned long) page & (SG_CHAIN | SG_END));
+ BUG_ON((unsigned long) page & SG_PAGE_LINK_MASK);
#ifdef CONFIG_DEBUG_SG
BUG_ON(sg_is_chain(sg));
#endif
@@ -120,7 +136,7 @@ static inline struct page *sg_page(struct scatterlist *sg)
#ifdef CONFIG_DEBUG_SG
BUG_ON(sg_is_chain(sg));
#endif
- return (struct page *)((sg)->page_link & ~(SG_CHAIN | SG_END));
+ return (struct page *)((sg)->page_link & ~SG_PAGE_LINK_MASK);
}

/**
@@ -222,6 +238,31 @@ static inline void sg_unmark_end(struct scatterlist *sg)
sg->page_link &= ~SG_END;
}

+/**
+ * sg_mark_pci_p2pdma - Mark the scatterlist entry for PCI p2pdma
+ * @sg: SG entryScatterlist
+ *
+ * Description:
+ * Marks the passed in sg entry to indicate that the dma_address is
+ * a PCI bus address.
+ **/
+static inline void sg_mark_pci_p2pdma(struct scatterlist *sg)
+{
+ sg->page_link |= SG_PCI_P2PDMA;
+}
+
+/**
+ * sg_unmark_pci_p2pdma - Unmark the scatterlist entry for PCI p2pdma
+ * @sg: SG entryScatterlist
+ *
+ * Description:
+ * Clears the PCI P2PDMA mark
+ **/
+static inline void sg_unmark_pci_p2pdma(struct scatterlist *sg)
+{
+ sg->page_link &= ~SG_PCI_P2PDMA;
+}
+
/**
* sg_phys - Return physical address of an sg entry
* @sg: SG entry
--
2.20.1

2021-04-08 17:05:19

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH 08/16] PCI/P2PDMA: Introduce helpers for dma_map_sg implementations

Add pci_p2pdma_map_segment() as a helper for simple dma_map_sg()
implementations. It takes an scatterlist segment that must point to a
pci_p2pdma struct page and will map it if the mapping requires a bus
address.

The return value indicates whether the mapping required a bus address
or whether the caller still needs to map the segment normally. If the
segment should not be mapped, -EREMOTEIO is returned.

This helper uses a state structure to track the changes to the
pgmap across calls and avoid needing to lookup into the xarray for
every page.

Also add pci_p2pdma_map_bus_segment() which is useful for IOMMU
dma_map_sg() implementations where the sg segment containing the page
differs from the sg segment containing the DMA address.

Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/pci/p2pdma.c | 65 ++++++++++++++++++++++++++++++++++++++
include/linux/pci-p2pdma.h | 21 ++++++++++++
2 files changed, 86 insertions(+)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 38c93f57a941..44ad7664e875 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -923,6 +923,71 @@ void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
}
EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);

+/**
+ * pci_p2pdma_map_segment - map an sg segment determining the mapping type
+ * @state: State structure that should be declared on the stack outside of
+ * the for_each_sg() loop and initialized to zero.
+ * @dev: DMA device that's doing the mapping operation
+ * @sg: scatterlist segment to map
+ * @attrs: dma mapping attributes
+ *
+ * This is a helper to be used by non-iommu dma_map_sg() implementations where
+ * the sg segment is the same for the page_link and the dma_address.
+ *
+ * Attempt to map a single segment in an SGL with the PCI bus address.
+ * The segment must point to a PCI P2PDMA page and thus must be
+ * wrapped in a is_pci_p2pdma_page(sg_page(sg)) check.
+ *
+ * Returns 1 if the segment was mapped, 0 if the segment should be mapped
+ * directly (or through the IOMMU) and -EREMOTEIO if the segment should not
+ * be mapped at all.
+ */
+int pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state,
+ struct device *dev, struct scatterlist *sg,
+ unsigned long dma_attrs)
+{
+ if (state->pgmap != sg_page(sg)->pgmap) {
+ state->pgmap = sg_page(sg)->pgmap;
+ state->map = pci_p2pdma_map_type(state->pgmap, dev, dma_attrs);
+ state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
+ }
+
+ switch (state->map) {
+ case PCI_P2PDMA_MAP_BUS_ADDR:
+ sg->dma_address = sg_phys(sg) + state->bus_off;
+ sg_dma_len(sg) = sg->length;
+ sg_mark_pci_p2pdma(sg);
+ return 1;
+ case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+ return 0;
+ default:
+ return -EREMOTEIO;
+ }
+}
+
+/**
+ * pci_p2pdma_map_bus_segment - map an sg segment pre determined to
+ * be mapped with PCI_P2PDMA_MAP_BUS_ADDR
+ * @pg_sg: scatterlist segment with the page to map
+ * @dma_sg: scatterlist segment to assign a dma address to
+ *
+ * This is a helper for iommu dma_map_sg() implementations when the
+ * segment for the dma address differs from the segment containing the
+ * source page.
+ *
+ * pci_p2pdma_map_type() must have already been called on the pg_sg and
+ * returned PCI_P2PDMA_MAP_BUS_ADDR.
+ */
+void pci_p2pdma_map_bus_segment(struct scatterlist *pg_sg,
+ struct scatterlist *dma_sg)
+{
+ struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(sg_page(pg_sg)->pgmap);
+
+ dma_sg->dma_address = sg_phys(pg_sg) + pgmap->bus_offset;
+ sg_dma_len(dma_sg) = pg_sg->length;
+ sg_mark_pci_p2pdma(dma_sg);
+}
+
/**
* pci_p2pdma_enable_store - parse a configfs/sysfs attribute store
* to enable p2pdma
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index a06072ac3a52..49e7679403cf 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -13,6 +13,12 @@

#include <linux/pci.h>

+struct pci_p2pdma_map_state {
+ struct dev_pagemap *pgmap;
+ int map;
+ u64 bus_off;
+};
+
struct block_device;
struct scatterlist;

@@ -43,6 +49,11 @@ int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
int nents, enum dma_data_direction dir, unsigned long attrs);
void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
int nents, enum dma_data_direction dir, unsigned long attrs);
+int pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state,
+ struct device *dev, struct scatterlist *sg,
+ unsigned long dma_attrs);
+void pci_p2pdma_map_bus_segment(struct scatterlist *pg_sg,
+ struct scatterlist *dma_sg);
int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
bool *use_p2pdma);
ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
@@ -109,6 +120,16 @@ static inline void pci_p2pdma_unmap_sg_attrs(struct device *dev,
unsigned long attrs)
{
}
+static inline int pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state,
+ struct device *dev, struct scatterlist *sg,
+ unsigned long dma_attrs)
+{
+ return 0;
+}
+static inline void pci_p2pdma_map_bus_segment(struct scatterlist *pg_sg,
+ struct scatterlist *dma_sg)
+{
+}
static inline int pci_p2pdma_enable_store(const char *page,
struct pci_dev **p2p_dev, bool *use_p2pdma)
{
--
2.20.1

2021-04-08 17:05:23

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH 07/16] PCI/P2PDMA: Make pci_p2pdma_map_type() non-static

pci_p2pdma_map_type() will be needed by the dma-iommu map_sg
implementation because it will need to determine the mapping type
ahead of actually doing the mapping to create the actual iommu mapping.

Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/pci/p2pdma.c | 34 +++++++++++++++++++++++-----------
include/linux/pci-p2pdma.h | 15 +++++++++++++++
2 files changed, 38 insertions(+), 11 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index bcb1a6d6119d..38c93f57a941 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -20,13 +20,6 @@
#include <linux/seq_buf.h>
#include <linux/xarray.h>

-enum pci_p2pdma_map_type {
- PCI_P2PDMA_MAP_UNKNOWN = 0,
- PCI_P2PDMA_MAP_NOT_SUPPORTED,
- PCI_P2PDMA_MAP_BUS_ADDR,
- PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
-};
-
struct pci_p2pdma {
struct gen_pool *pool;
bool p2pmem_published;
@@ -822,13 +815,30 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
}
EXPORT_SYMBOL_GPL(pci_p2pmem_publish);

-static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
- struct device *dev)
+/**
+ * pci_p2pdma_map_type - return the type of mapping that should be used for
+ * a given device and pgmap
+ * @pgmap: the pagemap of a page to determine the mapping type for
+ * @dev: device that is mapping the page
+ * @dma_attrs: the attributes passed to the dma_map operation --
+ * this is so they can be checked to ensure P2PDMA pages were not
+ * introduced into an incorrect interface (like dma_map_sg). *
+ *
+ * Returns one of:
+ * PCI_P2PDMA_MAP_NOT_SUPPORTED - The mapping should not be done
+ * PCI_P2PDMA_MAP_BUS_ADDR - The mapping should use the PCI bus address
+ * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE - The mapping should be done directly
+ */
+enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
+ struct device *dev, unsigned long dma_attrs)
{
struct pci_dev *provider = to_p2p_pgmap(pgmap)->provider;
enum pci_p2pdma_map_type ret;
struct pci_dev *client;

+ WARN_ONCE(!(dma_attrs & __DMA_ATTR_PCI_P2PDMA),
+ "PCI P2PDMA pages were mapped with dma_map_sg!");
+
if (!provider->p2pdma)
return PCI_P2PDMA_MAP_NOT_SUPPORTED;

@@ -879,7 +889,8 @@ int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
struct pci_p2pdma_pagemap *p2p_pgmap =
to_p2p_pgmap(sg_page(sg)->pgmap);

- switch (pci_p2pdma_map_type(sg_page(sg)->pgmap, dev)) {
+ switch (pci_p2pdma_map_type(sg_page(sg)->pgmap, dev,
+ __DMA_ATTR_PCI_P2PDMA)) {
case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
return dma_map_sg_attrs(dev, sg, nents, dir, attrs);
case PCI_P2PDMA_MAP_BUS_ADDR:
@@ -904,7 +915,8 @@ void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
{
enum pci_p2pdma_map_type map_type;

- map_type = pci_p2pdma_map_type(sg_page(sg)->pgmap, dev);
+ map_type = pci_p2pdma_map_type(sg_page(sg)->pgmap, dev,
+ __DMA_ATTR_PCI_P2PDMA);

if (map_type == PCI_P2PDMA_MAP_THRU_HOST_BRIDGE)
dma_unmap_sg_attrs(dev, sg, nents, dir, attrs);
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 8318a97c9c61..a06072ac3a52 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -16,6 +16,13 @@
struct block_device;
struct scatterlist;

+enum pci_p2pdma_map_type {
+ PCI_P2PDMA_MAP_UNKNOWN = 0,
+ PCI_P2PDMA_MAP_NOT_SUPPORTED,
+ PCI_P2PDMA_MAP_BUS_ADDR,
+ PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
+};
+
#ifdef CONFIG_PCI_P2PDMA
int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
u64 offset);
@@ -30,6 +37,8 @@ struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
unsigned int *nents, u32 length);
void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl);
void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
+enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
+ struct device *dev, unsigned long dma_attrs);
int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
int nents, enum dma_data_direction dir, unsigned long attrs);
void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
@@ -83,6 +92,12 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
{
}
+static inline enum pci_p2pdma_map_type pci_p2pdma_map_type(
+ struct dev_pagemap *pgmap, struct device *dev,
+ unsigned long dma_attrs)
+{
+ return PCI_P2PDMA_MAP_NOT_SUPPORTED;
+}
static inline int pci_p2pdma_map_sg_attrs(struct device *dev,
struct scatterlist *sg, int nents, enum dma_data_direction dir,
unsigned long attrs)
--
2.20.1

2021-04-08 17:06:01

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH 12/16] nvme-pci: Check DMA ops when indicating support for PCI P2PDMA

Introduce a supports_pci_p2pdma() operation in nvme_ctrl_ops to
replace the fixed NVME_F_PCI_P2PDMA flag such that the dma_map_ops
flags can be checked for PCI P2PDMA support.

Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/nvme/host/core.c | 3 ++-
drivers/nvme/host/nvme.h | 2 +-
drivers/nvme/host/pci.c | 11 +++++++++--
3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 0896e21642be..223419454516 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3907,7 +3907,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid,
blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, ns->queue);

blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
- if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
+ if (ctrl->ops->supports_pci_p2pdma &&
+ ctrl->ops->supports_pci_p2pdma(ctrl))
blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);

ns->queue->queuedata = ns;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 07b34175c6ce..9c04df982d2c 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -473,7 +473,6 @@ struct nvme_ctrl_ops {
unsigned int flags;
#define NVME_F_FABRICS (1 << 0)
#define NVME_F_METADATA_SUPPORTED (1 << 1)
-#define NVME_F_PCI_P2PDMA (1 << 2)
int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
@@ -481,6 +480,7 @@ struct nvme_ctrl_ops {
void (*submit_async_event)(struct nvme_ctrl *ctrl);
void (*delete_ctrl)(struct nvme_ctrl *ctrl);
int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size);
+ bool (*supports_pci_p2pdma)(struct nvme_ctrl *ctrl);
};

#ifdef CONFIG_FAULT_INJECTION_DEBUG_FS
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 7249ae74f71f..14f092973792 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2759,17 +2759,24 @@ static int nvme_pci_get_address(struct nvme_ctrl *ctrl, char *buf, int size)
return snprintf(buf, size, "%s\n", dev_name(&pdev->dev));
}

+static bool nvme_pci_supports_pci_p2pdma(struct nvme_ctrl *ctrl)
+{
+ struct nvme_dev *dev = to_nvme_dev(ctrl);
+
+ return dma_pci_p2pdma_supported(dev->dev);
+}
+
static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
.name = "pcie",
.module = THIS_MODULE,
- .flags = NVME_F_METADATA_SUPPORTED |
- NVME_F_PCI_P2PDMA,
+ .flags = NVME_F_METADATA_SUPPORTED,
.reg_read32 = nvme_pci_reg_read32,
.reg_write32 = nvme_pci_reg_write32,
.reg_read64 = nvme_pci_reg_read64,
.free_ctrl = nvme_pci_free_ctrl,
.submit_async_event = nvme_pci_submit_async_event,
.get_address = nvme_pci_get_address,
+ .supports_pci_p2pdma = nvme_pci_supports_pci_p2pdma,
};

static int nvme_dev_map(struct nvme_dev *dev)
--
2.20.1

2021-04-27 19:24:45

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 05/16] dma-mapping: Introduce dma_map_sg_p2pdma()

On Thu, Apr 08, 2021 at 11:01:12AM -0600, Logan Gunthorpe wrote:
> dma_map_sg() either returns a positive number indicating the number
> of entries mapped or zero indicating that resources were not available
> to create the mapping. When zero is returned, it is always safe to retry
> the mapping later once resources have been freed.
>
> Once P2PDMA pages are mixed into the SGL there may be pages that may
> never be successfully mapped with a given device because that device may
> not actually be able to access those pages. Thus, multiple error
> conditions will need to be distinguished to determine weather a retry
> is safe.
>
> Introduce dma_map_sg_p2pdma[_attrs]() with a different calling
> convention from dma_map_sg(). The function will return a positive
> integer on success or a negative errno on failure.
>
> ENOMEM will be used to indicate a resource failure and EREMOTEIO to
> indicate that a P2PDMA page is not mappable.
>
> The __DMA_ATTR_PCI_P2PDMA attribute is introduced to inform the lower
> level implementations that P2PDMA pages are allowed and to warn if a
> caller introduces them into the regular dma_map_sg() interface.

So this new API is all about being able to return an error code
because auditing the old API is basically terrifying?

OK, but why name everything new P2PDMA? It seems nicer to give this
some generic name and have some general program to gradually deprecate
normal non-error-capable dma_map_sg() ?

I think that will raise less questions when subsystem people see the
changes, as I was wondering why RW was being moved to use what looked
like a p2pdma only API.

dma_map_sg_or_err() would have been clearer

The flag is also clearer as to the purpose if it is named
__DMA_ATTR_ERROR_ALLOWED

Jason

2021-04-27 19:30:36

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 00/16] Add new DMA mapping operation for P2PDMA

On Thu, Apr 08, 2021 at 11:01:07AM -0600, Logan Gunthorpe wrote:
> Hi,
>
> This patchset continues my work to to add P2PDMA support to the common
> dma map operations. This allows for creating SGLs that have both P2PDMA
> and regular pages which is a necessary step to allowing P2PDMA pages in
> userspace.
>
> The earlier RFC[1] generated a lot of great feedback and I heard no show
> stopping objections. Thus, I've incorporated all the feedback and have
> decided to post this as a proper patch series with hopes of eventually
> getting it in mainline.
>
> I'm happy to do a few more passes if anyone has any further feedback
> or better ideas.

For the user of the DMA API the idea seems reasonable enough, the next
steps to integrate with pin_user_pages() seem fairly straightfoward
too

Was there no feedback on this at all?

Jason

2021-04-27 19:33:27

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 05/16] dma-mapping: Introduce dma_map_sg_p2pdma()

On Thu, Apr 08, 2021 at 11:01:12AM -0600, Logan Gunthorpe wrote:
> +/*
> + * dma_maps_sg_attrs returns 0 on error and > 0 on success.
> + * It should never return a value < 0.
> + */

Also it is weird a function that can't return 0 is returning an int type

> +int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents,
> + enum dma_data_direction dir, unsigned long attrs)
> +{
> + int ents;
> +
> + ents = __dma_map_sg_attrs(dev, sg, nents, dir, attrs);
> BUG_ON(ents < 0);

if (WARN_ON(ents < 0))
return 0;

instead of bug on?

Also, I see only 8 users of this function. How about just fix them all
to support negative returns and use this as the p2p API instead of
adding new API?

Add the opposite logic flag, 'DMA_ATTRS_NO_ERROR' and pass it through
the other api entry callers that can't handle it?

Jason

2021-04-27 19:35:06

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 09/16] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg

On Thu, Apr 08, 2021 at 11:01:16AM -0600, Logan Gunthorpe wrote:
> Add PCI P2PDMA support for dma_direct_map_sg() so that it can map
> PCI P2PDMA pages directly without a hack in the callers. This allows
> for heterogeneous SGLs that contain both P2PDMA and regular pages.
>
> SGL segments that contain PCI bus addresses are marked with
> sg_mark_pci_p2pdma() and are ignored when unmapped.
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> kernel/dma/direct.c | 25 ++++++++++++++++++++++---
> 1 file changed, 22 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 002268262c9a..108dfb4ecbd5 100644
> +++ b/kernel/dma/direct.c
> @@ -13,6 +13,7 @@
> #include <linux/vmalloc.h>
> #include <linux/set_memory.h>
> #include <linux/slab.h>
> +#include <linux/pci-p2pdma.h>
> #include "direct.h"
>
> /*
> @@ -387,19 +388,37 @@ void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
> struct scatterlist *sg;
> int i;
>
> - for_each_sg(sgl, sg, nents, i)
> + for_each_sg(sgl, sg, nents, i) {
> + if (sg_is_pci_p2pdma(sg)) {
> + sg_unmark_pci_p2pdma(sg);

This doesn't seem nice, the DMA layer should only alter the DMA
portion of the SG, not the other portions. Is it necessary?

Jason

2021-04-27 19:41:50

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 09/16] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg

On Tue, Apr 27, 2021 at 04:33:51PM -0300, Jason Gunthorpe wrote:
> On Thu, Apr 08, 2021 at 11:01:16AM -0600, Logan Gunthorpe wrote:
> > Add PCI P2PDMA support for dma_direct_map_sg() so that it can map
> > PCI P2PDMA pages directly without a hack in the callers. This allows
> > for heterogeneous SGLs that contain both P2PDMA and regular pages.
> >
> > SGL segments that contain PCI bus addresses are marked with
> > sg_mark_pci_p2pdma() and are ignored when unmapped.
> >
> > Signed-off-by: Logan Gunthorpe <[email protected]>
> > kernel/dma/direct.c | 25 ++++++++++++++++++++++---
> > 1 file changed, 22 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> > index 002268262c9a..108dfb4ecbd5 100644
> > +++ b/kernel/dma/direct.c
> > @@ -13,6 +13,7 @@
> > #include <linux/vmalloc.h>
> > #include <linux/set_memory.h>
> > #include <linux/slab.h>
> > +#include <linux/pci-p2pdma.h>
> > #include "direct.h"
> >
> > /*
> > @@ -387,19 +388,37 @@ void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
> > struct scatterlist *sg;
> > int i;
> >
> > - for_each_sg(sgl, sg, nents, i)
> > + for_each_sg(sgl, sg, nents, i) {
> > + if (sg_is_pci_p2pdma(sg)) {
> > + sg_unmark_pci_p2pdma(sg);
>
> This doesn't seem nice, the DMA layer should only alter the DMA
> portion of the SG, not the other portions. Is it necessary?

Oh, I got it completely wrong what this is for.

This should be named sg_dma_mark_pci_p2p() and similar for other
functions to make it clear it is part of the DMA side of the SG
interface (eg it is like sg_dma_address, sg_dma_len, etc)

Jason

2021-04-27 19:45:26

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 11/16] iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg

On Thu, Apr 08, 2021 at 11:01:18AM -0600, Logan Gunthorpe wrote:
> When a PCI P2PDMA page is seen, set the IOVA length of the segment
> to zero so that it is not mapped into the IOVA. Then, in finalise_sg(),
> apply the appropriate bus address to the segment. The IOVA is not
> created if the scatterlist only consists of P2PDMA pages.

I expect P2P to work with systems that use ATS, so we'd want to see
those systems have the IOMMU programmed with the bus address.

Is it OK like this because the other logic prohibits all PCI cases
that would lean on the IOMMU, like ATS, hairpinning through the root
port, or transiting the root complex?

If yes, the code deserves a big comment explaining this is incomplete,
and I'd want to know we can finish this to include ATS at least based
on this series.

Jason

2021-04-27 19:52:00

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 14/16] nvme-rdma: Ensure dma support when using p2pdma

On Thu, Apr 08, 2021 at 11:01:21AM -0600, Logan Gunthorpe wrote:
> Ensure the dma operations support p2pdma before using the RDMA
> device for P2PDMA. This allows switching the RDMA driver from
> pci_p2pdma_map_sg() to dma_map_sg_p2pdma().
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> drivers/nvme/target/rdma.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
> index 6c1f3ab7649c..3ec7e77e5416 100644
> +++ b/drivers/nvme/target/rdma.c
> @@ -414,7 +414,8 @@ static int nvmet_rdma_alloc_rsp(struct nvmet_rdma_device *ndev,
> if (ib_dma_mapping_error(ndev->device, r->send_sge.addr))
> goto out_free_rsp;
>
> - if (!ib_uses_virt_dma(ndev->device))
> + if (!ib_uses_virt_dma(ndev->device) &&
> + dma_pci_p2pdma_supported(&ndev->device->dev))

ib_uses_virt_dma() should not be called by nvme and this is using the
wrong device pointer to query for DMA related properties.

I suspect this wants a ib_dma_pci_p2p_dma_supported() wrapper like
everything else.

Jason

2021-04-27 20:24:33

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 00/16] Add new DMA mapping operation for P2PDMA

On 4/27/21 12:28 PM, Jason Gunthorpe wrote:
> On Thu, Apr 08, 2021 at 11:01:07AM -0600, Logan Gunthorpe wrote:
>> Hi,
>>
>> This patchset continues my work to to add P2PDMA support to the common
>> dma map operations. This allows for creating SGLs that have both P2PDMA
>> and regular pages which is a necessary step to allowing P2PDMA pages in
>> userspace.
>>
>> The earlier RFC[1] generated a lot of great feedback and I heard no show
>> stopping objections. Thus, I've incorporated all the feedback and have
>> decided to post this as a proper patch series with hopes of eventually
>> getting it in mainline.
>>
>> I'm happy to do a few more passes if anyone has any further feedback
>> or better ideas.
>
> For the user of the DMA API the idea seems reasonable enough, the next
> steps to integrate with pin_user_pages() seem fairly straightfoward
> too
>
> Was there no feedback on this at all?
>

oops, I meant to review this a lot sooner, because this whole p2pdma thing is
actually very interesting and important...somehow it slipped but I'll take
a look now.

thanks,
--
John Hubbard
NVIDIA

2021-04-27 20:50:12

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 00/16] Add new DMA mapping operation for P2PDMA

On Tue, Apr 27, 2021 at 1:22 PM John Hubbard <[email protected]> wrote:
>
> On 4/27/21 12:28 PM, Jason Gunthorpe wrote:
> > On Thu, Apr 08, 2021 at 11:01:07AM -0600, Logan Gunthorpe wrote:
> >> Hi,
> >>
> >> This patchset continues my work to to add P2PDMA support to the common
> >> dma map operations. This allows for creating SGLs that have both P2PDMA
> >> and regular pages which is a necessary step to allowing P2PDMA pages in
> >> userspace.
> >>
> >> The earlier RFC[1] generated a lot of great feedback and I heard no show
> >> stopping objections. Thus, I've incorporated all the feedback and have
> >> decided to post this as a proper patch series with hopes of eventually
> >> getting it in mainline.
> >>
> >> I'm happy to do a few more passes if anyone has any further feedback
> >> or better ideas.
> >
> > For the user of the DMA API the idea seems reasonable enough, the next
> > steps to integrate with pin_user_pages() seem fairly straightfoward
> > too
> >
> > Was there no feedback on this at all?
> >
>
> oops, I meant to review this a lot sooner, because this whole p2pdma thing is
> actually very interesting and important...somehow it slipped but I'll take
> a look now.

Still in my queue as well behind Joao's memmap consolidation series,
and a recent copy_mc_to_iter() fix series from Al.

2021-04-27 22:52:10

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 05/16] dma-mapping: Introduce dma_map_sg_p2pdma()



On 2021-04-27 1:22 p.m., Jason Gunthorpe wrote:
> On Thu, Apr 08, 2021 at 11:01:12AM -0600, Logan Gunthorpe wrote:
>> dma_map_sg() either returns a positive number indicating the number
>> of entries mapped or zero indicating that resources were not available
>> to create the mapping. When zero is returned, it is always safe to retry
>> the mapping later once resources have been freed.
>>
>> Once P2PDMA pages are mixed into the SGL there may be pages that may
>> never be successfully mapped with a given device because that device may
>> not actually be able to access those pages. Thus, multiple error
>> conditions will need to be distinguished to determine weather a retry
>> is safe.
>>
>> Introduce dma_map_sg_p2pdma[_attrs]() with a different calling
>> convention from dma_map_sg(). The function will return a positive
>> integer on success or a negative errno on failure.
>>
>> ENOMEM will be used to indicate a resource failure and EREMOTEIO to
>> indicate that a P2PDMA page is not mappable.
>>
>> The __DMA_ATTR_PCI_P2PDMA attribute is introduced to inform the lower
>> level implementations that P2PDMA pages are allowed and to warn if a
>> caller introduces them into the regular dma_map_sg() interface.
>
> So this new API is all about being able to return an error code
> because auditing the old API is basically terrifying?
>
> OK, but why name everything new P2PDMA? It seems nicer to give this
> some generic name and have some general program to gradually deprecate
> normal non-error-capable dma_map_sg() ?
>
> I think that will raise less questions when subsystem people see the
> changes, as I was wondering why RW was being moved to use what looked
> like a p2pdma only API.
>
> dma_map_sg_or_err() would have been clearer
>
> The flag is also clearer as to the purpose if it is named
> __DMA_ATTR_ERROR_ALLOWED

I'm not opposed to these names. I can use them for v2 if there are no
other opinions.

Logan

2021-04-27 22:57:49

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 09/16] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg



On 2021-04-27 1:40 p.m., Jason Gunthorpe wrote:
> On Tue, Apr 27, 2021 at 04:33:51PM -0300, Jason Gunthorpe wrote:
>> On Thu, Apr 08, 2021 at 11:01:16AM -0600, Logan Gunthorpe wrote:
>>> Add PCI P2PDMA support for dma_direct_map_sg() so that it can map
>>> PCI P2PDMA pages directly without a hack in the callers. This allows
>>> for heterogeneous SGLs that contain both P2PDMA and regular pages.
>>>
>>> SGL segments that contain PCI bus addresses are marked with
>>> sg_mark_pci_p2pdma() and are ignored when unmapped.
>>>
>>> Signed-off-by: Logan Gunthorpe <[email protected]>
>>> kernel/dma/direct.c | 25 ++++++++++++++++++++++---
>>> 1 file changed, 22 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
>>> index 002268262c9a..108dfb4ecbd5 100644
>>> +++ b/kernel/dma/direct.c
>>> @@ -13,6 +13,7 @@
>>> #include <linux/vmalloc.h>
>>> #include <linux/set_memory.h>
>>> #include <linux/slab.h>
>>> +#include <linux/pci-p2pdma.h>
>>> #include "direct.h"
>>>
>>> /*
>>> @@ -387,19 +388,37 @@ void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
>>> struct scatterlist *sg;
>>> int i;
>>>
>>> - for_each_sg(sgl, sg, nents, i)
>>> + for_each_sg(sgl, sg, nents, i) {
>>> + if (sg_is_pci_p2pdma(sg)) {
>>> + sg_unmark_pci_p2pdma(sg);
>>
>> This doesn't seem nice, the DMA layer should only alter the DMA
>> portion of the SG, not the other portions. Is it necessary?
>
> Oh, I got it completely wrong what this is for.
>
> This should be named sg_dma_mark_pci_p2p() and similar for other
> functions to make it clear it is part of the DMA side of the SG
> interface (eg it is like sg_dma_address, sg_dma_len, etc)

Fair point. Yes, I'll rename this for the next version.

Logan

2021-04-27 23:00:32

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 05/16] dma-mapping: Introduce dma_map_sg_p2pdma()



On 2021-04-27 1:31 p.m., Jason Gunthorpe wrote:
> On Thu, Apr 08, 2021 at 11:01:12AM -0600, Logan Gunthorpe wrote:
>> +/*
>> + * dma_maps_sg_attrs returns 0 on error and > 0 on success.
>> + * It should never return a value < 0.
>> + */
>
> Also it is weird a function that can't return 0 is returning an int type

Yes, Christoph mentioned in the last series that this should probably
change to an unsigned but I wasn't really sure if that change should be
a part of the P2PDMA series.

>> +int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents,
>> + enum dma_data_direction dir, unsigned long attrs)
>> +{
>> + int ents;
>> +
>> + ents = __dma_map_sg_attrs(dev, sg, nents, dir, attrs);
>> BUG_ON(ents < 0);
>
> if (WARN_ON(ents < 0))
> return 0;
>
> instead of bug on?

It was BUG_ON in the original code. So I felt I should leave it.

> Also, I see only 8 users of this function. How about just fix them all
> to support negative returns and use this as the p2p API instead of
> adding new API?

Well there might be 8 users of dma_map_sg_attrs() but there are a very
large number of dma_map_sg(). Seems odd to me to single out the first as
requiring these changes, but leave the latter.

> Add the opposite logic flag, 'DMA_ATTRS_NO_ERROR' and pass it through
> the other api entry callers that can't handle it?

I'm not that opposed to this. But it will make this series a fair bit
longer to change the 8 map_sg_attrs() usages.

Logan

2021-04-27 23:01:26

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 11/16] iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg



On 2021-04-27 1:43 p.m., Jason Gunthorpe wrote:
> On Thu, Apr 08, 2021 at 11:01:18AM -0600, Logan Gunthorpe wrote:
>> When a PCI P2PDMA page is seen, set the IOVA length of the segment
>> to zero so that it is not mapped into the IOVA. Then, in finalise_sg(),
>> apply the appropriate bus address to the segment. The IOVA is not
>> created if the scatterlist only consists of P2PDMA pages.
>
> I expect P2P to work with systems that use ATS, so we'd want to see
> those systems have the IOMMU programmed with the bus address.

Oh, the paragraph you quote isn't quite as clear as it could be. The bus
address is only used in specific circumstances depending on how the
P2PDMA core code figures the addresses should be mapped (see the
documentation for (upstream_bridge_distance()). The P2PDMA code
currently doesn't have any provisions for ATS (I haven't had access to
any such hardware) but I'm sure it wouldn't be too hard to add.

Logan

2021-04-27 23:03:37

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 05/16] dma-mapping: Introduce dma_map_sg_p2pdma()

On Tue, Apr 27, 2021 at 04:55:45PM -0600, Logan Gunthorpe wrote:

> > Also, I see only 8 users of this function. How about just fix them all
> > to support negative returns and use this as the p2p API instead of
> > adding new API?
>
> Well there might be 8 users of dma_map_sg_attrs() but there are a very
> large number of dma_map_sg(). Seems odd to me to single out the first as
> requiring these changes, but leave the latter.

At a high level I'm OK with it. dma_map_sg_attrs() is the extra
extended version of dma_map_sg(), it already has a different
signature, a different return code is not out of the question.

dma_map_sg() is just the simple easy to use interface that can't do
advanced stuff.

> I'm not that opposed to this. But it will make this series a fair bit
> longer to change the 8 map_sg_attrs() usages.

Yes, but the result seems much nicer to not grow the DMA API further.

Jason

2021-04-27 23:04:14

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 14/16] nvme-rdma: Ensure dma support when using p2pdma



On 2021-04-27 1:47 p.m., Jason Gunthorpe wrote:
> On Thu, Apr 08, 2021 at 11:01:21AM -0600, Logan Gunthorpe wrote:
>> Ensure the dma operations support p2pdma before using the RDMA
>> device for P2PDMA. This allows switching the RDMA driver from
>> pci_p2pdma_map_sg() to dma_map_sg_p2pdma().
>>
>> Signed-off-by: Logan Gunthorpe <[email protected]>
>> drivers/nvme/target/rdma.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
>> index 6c1f3ab7649c..3ec7e77e5416 100644
>> +++ b/drivers/nvme/target/rdma.c
>> @@ -414,7 +414,8 @@ static int nvmet_rdma_alloc_rsp(struct nvmet_rdma_device *ndev,
>> if (ib_dma_mapping_error(ndev->device, r->send_sge.addr))
>> goto out_free_rsp;
>>
>> - if (!ib_uses_virt_dma(ndev->device))
>> + if (!ib_uses_virt_dma(ndev->device) &&
>> + dma_pci_p2pdma_supported(&ndev->device->dev))
>
> ib_uses_virt_dma() should not be called by nvme and this is using the
> wrong device pointer to query for DMA related properties.
>
> I suspect this wants a ib_dma_pci_p2p_dma_supported() wrapper like
> everything else.

Makes sense. Will add for v2.

Logan

2021-05-02 01:41:03

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 00/16] Add new DMA mapping operation for P2PDMA

On 4/8/21 10:01 AM, Logan Gunthorpe wrote:
> Hi,
>
> This patchset continues my work to to add P2PDMA support to the common
> dma map operations. This allows for creating SGLs that have both P2PDMA
> and regular pages which is a necessary step to allowing P2PDMA pages in
> userspace.
>
> The earlier RFC[1] generated a lot of great feedback and I heard no show
> stopping objections. Thus, I've incorporated all the feedback and have
> decided to post this as a proper patch series with hopes of eventually
> getting it in mainline.
>
> I'm happy to do a few more passes if anyone has any further feedback
> or better ideas.
>

After an initial pass through these, I think I like the approach. And I
don't have any huge structural comments or new ideas, just smaller comments
and notes.

I'll respond to each patch, but just wanted to say up front that this is
looking promising, in my opinion.


thanks,
--
John Hubbard
NVIDIA

2021-05-02 04:16:29

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 01/16] PCI/P2PDMA: Pass gfp_mask flags to upstream_bridge_distance_warn()

On 4/8/21 10:01 AM, Logan Gunthorpe wrote:
> In order to call upstream_bridge_distance_warn() from a dma_map function,
> it must not sleep. The only reason it does sleep is to allocate the seqbuf
> to print which devices are within the ACS path.
>
> Switch the kmalloc call to use a passed in gfp_mask and don't print that
> message if the buffer fails to be allocated.
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> Acked-by: Bjorn Helgaas <[email protected]>
> ---
> drivers/pci/p2pdma.c | 21 +++++++++++----------
> 1 file changed, 11 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 196382630363..bd89437faf06 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -267,7 +267,7 @@ static int pci_bridge_has_acs_redir(struct pci_dev *pdev)
>
> static void seq_buf_print_bus_devfn(struct seq_buf *buf, struct pci_dev *pdev)
> {
> - if (!buf)
> + if (!buf || !buf->buffer)

This is not great, sort of from an overall design point of view, even though
it makes the rest of the patch work. See below for other ideas, that will
avoid the need for this sort of odd point fix.

> return;
>
> seq_buf_printf(buf, "%s;", pci_name(pdev));
> @@ -495,25 +495,26 @@ upstream_bridge_distance(struct pci_dev *provider, struct pci_dev *client,
>
> static enum pci_p2pdma_map_type
> upstream_bridge_distance_warn(struct pci_dev *provider, struct pci_dev *client,
> - int *dist)
> + int *dist, gfp_t gfp_mask)
> {
> struct seq_buf acs_list;
> bool acs_redirects;
> int ret;
>
> - seq_buf_init(&acs_list, kmalloc(PAGE_SIZE, GFP_KERNEL), PAGE_SIZE);
> - if (!acs_list.buffer)
> - return -ENOMEM;

Another odd thing: this used to check for memory failure and just give
up, and now it doesn't. Yes, I realize that it all still works at the
moment, but this is quirky and we shouldn't stop here.

Instead, a cleaner approach would be to push the memory allocation
slightly higher up the call stack, out to the
pci_p2pdma_distance_many(). So pci_p2pdma_distance_many() should make
the kmalloc() call, and fail out if it can't get a page for the seq_buf
buffer. Then you don't have to do all this odd stuff.

Furthermore, the call sites can then decide for themselves which GFP
flags, GFP_ATOMIC or GFP_KERNEL or whatever they want for kmalloc().

A related thing: this whole exercise would go better if there were a
preparatory patch or two that changed the return codes in this file to
something less crazy. There are too many functions that can fail, but
are treated as if they sort-of-mostly-would-never-fail, in the hopes of
using the return value directly for counting and such. This is badly
mistaken, and it leads developers to try to avoid returning -ENOMEM
(which is what we need here).

Really, these functions should all be doing "0 for success, -ERRNO for
failure, and pass other values, including results, in the arg list".


> + seq_buf_init(&acs_list, kmalloc(PAGE_SIZE, gfp_mask), PAGE_SIZE);
>
> ret = upstream_bridge_distance(provider, client, dist, &acs_redirects,
> &acs_list);
> if (acs_redirects) {
> pci_warn(client, "ACS redirect is set between the client and provider (%s)\n",
> pci_name(provider));
> - /* Drop final semicolon */
> - acs_list.buffer[acs_list.len-1] = 0;
> - pci_warn(client, "to disable ACS redirect for this path, add the kernel parameter: pci=disable_acs_redir=%s\n",
> - acs_list.buffer);
> +
> + if (acs_list.buffer) {
> + /* Drop final semicolon */
> + acs_list.buffer[acs_list.len - 1] = 0;
> + pci_warn(client, "to disable ACS redirect for this path, add the kernel parameter: pci=disable_acs_redir=%s\n",
> + acs_list.buffer);
> + }
> }
>
> if (ret == PCI_P2PDMA_MAP_NOT_SUPPORTED) {
> @@ -566,7 +567,7 @@ int pci_p2pdma_distance_many(struct pci_dev *provider, struct device **clients,
>
> if (verbose)
> ret = upstream_bridge_distance_warn(provider,
> - pci_client, &distance);
> + pci_client, &distance, GFP_KERNEL);
> else
> ret = upstream_bridge_distance(provider, pci_client,
> &distance, NULL, NULL);
>

thanks,
--
John Hubbard
NVIDIA

2021-05-02 21:26:57

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 05/16] dma-mapping: Introduce dma_map_sg_p2pdma()

On 4/8/21 10:01 AM, Logan Gunthorpe wrote:
> dma_map_sg() either returns a positive number indicating the number
> of entries mapped or zero indicating that resources were not available
> to create the mapping. When zero is returned, it is always safe to retry
> the mapping later once resources have been freed.
>
> Once P2PDMA pages are mixed into the SGL there may be pages that may
> never be successfully mapped with a given device because that device may
> not actually be able to access those pages. Thus, multiple error
> conditions will need to be distinguished to determine weather a retry
> is safe.
>
> Introduce dma_map_sg_p2pdma[_attrs]() with a different calling
> convention from dma_map_sg(). The function will return a positive
> integer on success or a negative errno on failure.
>
> ENOMEM will be used to indicate a resource failure and EREMOTEIO to
> indicate that a P2PDMA page is not mappable.
>
> The __DMA_ATTR_PCI_P2PDMA attribute is introduced to inform the lower
> level implementations that P2PDMA pages are allowed and to warn if a
> caller introduces them into the regular dma_map_sg() interface.
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> ---
> include/linux/dma-mapping.h | 15 +++++++++++
> kernel/dma/mapping.c | 52 ++++++++++++++++++++++++++++++++-----
> 2 files changed, 61 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index 2a984cb4d1e0..50b8f586cf59 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -60,6 +60,12 @@
> * at least read-only at lesser-privileged levels).
> */
> #define DMA_ATTR_PRIVILEGED (1UL << 9)
> +/*
> + * __DMA_ATTR_PCI_P2PDMA: This should not be used directly, use
> + * dma_map_sg_p2pdma() instead. Used internally to indicate that the
> + * caller is using the dma_map_sg_p2pdma() interface.
> + */
> +#define __DMA_ATTR_PCI_P2PDMA (1UL << 10)
>

As mentioned near the top of this file,
Documentation/core-api/dma-attributes.rst also needs to be updated, for
this new item.


> /*
> * A dma_addr_t can hold any valid DMA or bus address for the platform. It can
> @@ -107,6 +113,8 @@ void dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, size_t size,
> enum dma_data_direction dir, unsigned long attrs);
> int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents,
> enum dma_data_direction dir, unsigned long attrs);
> +int dma_map_sg_p2pdma_attrs(struct device *dev, struct scatterlist *sg,
> + int nents, enum dma_data_direction dir, unsigned long attrs);
> void dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
> int nents, enum dma_data_direction dir,
> unsigned long attrs);
> @@ -160,6 +168,12 @@ static inline int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
> {
> return 0;
> }
> +static inline int dma_map_sg_p2pdma_attrs(struct device *dev,
> + struct scatterlist *sg, int nents, enum dma_data_direction dir,
> + unsigned long attrs)
> +{
> + return 0;
> +}
> static inline void dma_unmap_sg_attrs(struct device *dev,
> struct scatterlist *sg, int nents, enum dma_data_direction dir,
> unsigned long attrs)
> @@ -392,6 +406,7 @@ static inline void dma_sync_sgtable_for_device(struct device *dev,
> #define dma_map_single(d, a, s, r) dma_map_single_attrs(d, a, s, r, 0)
> #define dma_unmap_single(d, a, s, r) dma_unmap_single_attrs(d, a, s, r, 0)
> #define dma_map_sg(d, s, n, r) dma_map_sg_attrs(d, s, n, r, 0)
> +#define dma_map_sg_p2pdma(d, s, n, r) dma_map_sg_p2pdma_attrs(d, s, n, r, 0)

This hunk is fine, of course.

But, about pre-existing issues: note to self, or to anyone: send a patch to turn
these into inline functions. The macro redirection here is not adding value, but
it does make things just a little bit worse.


> #define dma_unmap_sg(d, s, n, r) dma_unmap_sg_attrs(d, s, n, r, 0)
> #define dma_map_page(d, p, o, s, r) dma_map_page_attrs(d, p, o, s, r, 0)
> #define dma_unmap_page(d, a, s, r) dma_unmap_page_attrs(d, a, s, r, 0)
> diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
> index b6a633679933..923089c4267b 100644
> --- a/kernel/dma/mapping.c
> +++ b/kernel/dma/mapping.c
> @@ -177,12 +177,8 @@ void dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, size_t size,
> }
> EXPORT_SYMBOL(dma_unmap_page_attrs);
>
> -/*
> - * dma_maps_sg_attrs returns 0 on error and > 0 on success.
> - * It should never return a value < 0.
> - */

It would be better to leave the comment in place, given the non-standard
return values. However, looking around here, it would be better if we go
with the standard -ERRNO for error, and >0 for sucess.

There are pre-existing BUG_ON() and WARN_ON_ONCE() items that are partly
an attempt to compensate for not being able to return proper -ERRNO
codes. For example, this:

BUG_ON(!valid_dma_direction(dir));

...arguably should be more like this:

if(WARN_ON_ONCE(!valid_dma_direction(dir)))
return -EINVAL;


> -int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents,
> - enum dma_data_direction dir, unsigned long attrs)
> +static int __dma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
> + int nents, enum dma_data_direction dir, unsigned long attrs)
> {
> const struct dma_map_ops *ops = get_dma_ops(dev);
> int ents;
> @@ -197,6 +193,20 @@ int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents,
> ents = dma_direct_map_sg(dev, sg, nents, dir, attrs);
> else
> ents = ops->map_sg(dev, sg, nents, dir, attrs);
> +
> + return ents;
> +}
> +
> +/*
> + * dma_maps_sg_attrs returns 0 on error and > 0 on success.
> + * It should never return a value < 0.
> + */
> +int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents,
> + enum dma_data_direction dir, unsigned long attrs)
> +{
> + int ents;

Pre-existing note, feel free to ignore: the ents and nents in the same
routines together, are way too close to the each other in naming. Maybe
using "requested_nents", or "nents_arg", for the incoming value, would
help.

> +
> + ents = __dma_map_sg_attrs(dev, sg, nents, dir, attrs);
> BUG_ON(ents < 0);
> debug_dma_map_sg(dev, sg, nents, ents, dir);
>
> @@ -204,6 +214,36 @@ int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents,
> }
> EXPORT_SYMBOL(dma_map_sg_attrs);
>
> +/*
> + * like dma_map_sg_attrs, but returns a negative errno on error (and > 0
> + * on success). This function must be used if PCI P2PDMA pages might
> + * be in the scatterlist.

Let's turn this into a kernel doc comment block, seeing as how it clearly
wants to be--you're almost there already. You've even reinvented @Return,
below. :)

> + *
> + * On error this function may return:
> + * -ENOMEM indicating that there was not enough resources available and
> + * the transfer may be retried later
> + * -EREMOTEIO indicating that P2PDMA pages were included but cannot
> + * be mapped by the specified device, retries will always fail
> + *
> + * The scatterlist should be unmapped with the regular dma_unmap_sg[_attrs]().

How about:

"The scatterlist should be unmapped via dma_unmap_sg[_attrs]()."

> + */
> +int dma_map_sg_p2pdma_attrs(struct device *dev, struct scatterlist *sg,
> + int nents, enum dma_data_direction dir, unsigned long attrs)
> +{
> + int ents;
> +
> + ents = __dma_map_sg_attrs(dev, sg, nents, dir,
> + attrs | __DMA_ATTR_PCI_P2PDMA);
> + if (!ents)
> + ents = -ENOMEM;
> +
> + if (ents > 0)
> + debug_dma_map_sg(dev, sg, nents, ents, dir);
> +
> + return ents;
> +}
> +EXPORT_SYMBOL_GPL(dma_map_sg_p2pdma_attrs);
> +
> void dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
> int nents, enum dma_data_direction dir,
> unsigned long attrs)
>

thanks,
--
John Hubbard
NVIDIA

2021-05-02 22:35:05

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 06/16] lib/scatterlist: Add flag for indicating P2PDMA segments in an SGL

On 4/8/21 10:01 AM, Logan Gunthorpe wrote:
> Make use of the third free LSB in scatterlist's page_link on 64bit systems.
>
> The extra bit will be used by dma_[un]map_sg_p2pdma() to determine when a
> given SGL segments dma_address points to a PCI bus address.
> dma_unmap_sg_p2pdma() will need to perform different cleanup when a
> segment is marked as P2PDMA.
>
> Using this bit requires adding an additional dependency on CONFIG_64BIT to
> CONFIG_PCI_P2PDMA. This should be acceptable as the majority of P2PDMA
> use cases are restricted to newer root complexes and roughly require the
> extra address space for memory BARs used in the transactions.

Totally agree with the CONFIG_64BIT call.

Also, I have failed to find anything wrong with this patch. :)

Reviewed-by: John Hubbard <[email protected]>

thanks,
--
John Hubbard
NVIDIA

>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> ---
> drivers/pci/Kconfig | 2 +-
> include/linux/scatterlist.h | 49 ++++++++++++++++++++++++++++++++++---
> 2 files changed, 46 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 0c473d75e625..90b4bddb3300 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -163,7 +163,7 @@ config PCI_PASID
>
> config PCI_P2PDMA
> bool "PCI peer-to-peer transfer support"
> - depends on ZONE_DEVICE
> + depends on ZONE_DEVICE && 64BIT
> select GENERIC_ALLOCATOR
> help
> Enableѕ drivers to do PCI peer-to-peer transactions to and from
> diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
> index 6f70572b2938..5525d3ebf36f 100644
> --- a/include/linux/scatterlist.h
> +++ b/include/linux/scatterlist.h
> @@ -58,6 +58,21 @@ struct sg_table {
> #define SG_CHAIN 0x01UL
> #define SG_END 0x02UL
>
> +/*
> + * bit 2 is the third free bit in the page_link on 64bit systems which
> + * is used by dma_unmap_sg() to determine if the dma_address is a PCI
> + * bus address when doing P2PDMA.
> + * Note: CONFIG_PCI_P2PDMA depends on CONFIG_64BIT because of this.
> + */
> +
> +#ifdef CONFIG_PCI_P2PDMA
> +#define SG_PCI_P2PDMA 0x04UL
> +#else
> +#define SG_PCI_P2PDMA 0x00UL
> +#endif
> +
> +#define SG_PAGE_LINK_MASK (SG_CHAIN | SG_END | SG_PCI_P2PDMA)
> +
> /*
> * We overload the LSB of the page pointer to indicate whether it's
> * a valid sg entry, or whether it points to the start of a new scatterlist.
> @@ -65,8 +80,9 @@ struct sg_table {
> */
> #define sg_is_chain(sg) ((sg)->page_link & SG_CHAIN)
> #define sg_is_last(sg) ((sg)->page_link & SG_END)
> +#define sg_is_pci_p2pdma(sg) ((sg)->page_link & SG_PCI_P2PDMA)
> #define sg_chain_ptr(sg) \
> - ((struct scatterlist *) ((sg)->page_link & ~(SG_CHAIN | SG_END)))
> + ((struct scatterlist *) ((sg)->page_link & ~SG_PAGE_LINK_MASK))
>
> /**
> * sg_assign_page - Assign a given page to an SG entry
> @@ -80,13 +96,13 @@ struct sg_table {
> **/
> static inline void sg_assign_page(struct scatterlist *sg, struct page *page)
> {
> - unsigned long page_link = sg->page_link & (SG_CHAIN | SG_END);
> + unsigned long page_link = sg->page_link & SG_PAGE_LINK_MASK;
>
> /*
> * In order for the low bit stealing approach to work, pages
> * must be aligned at a 32-bit boundary as a minimum.
> */
> - BUG_ON((unsigned long) page & (SG_CHAIN | SG_END));
> + BUG_ON((unsigned long) page & SG_PAGE_LINK_MASK);
> #ifdef CONFIG_DEBUG_SG
> BUG_ON(sg_is_chain(sg));
> #endif
> @@ -120,7 +136,7 @@ static inline struct page *sg_page(struct scatterlist *sg)
> #ifdef CONFIG_DEBUG_SG
> BUG_ON(sg_is_chain(sg));
> #endif
> - return (struct page *)((sg)->page_link & ~(SG_CHAIN | SG_END));
> + return (struct page *)((sg)->page_link & ~SG_PAGE_LINK_MASK);
> }
>
> /**
> @@ -222,6 +238,31 @@ static inline void sg_unmark_end(struct scatterlist *sg)
> sg->page_link &= ~SG_END;
> }
>
> +/**
> + * sg_mark_pci_p2pdma - Mark the scatterlist entry for PCI p2pdma
> + * @sg: SG entryScatterlist
> + *
> + * Description:
> + * Marks the passed in sg entry to indicate that the dma_address is
> + * a PCI bus address.
> + **/
> +static inline void sg_mark_pci_p2pdma(struct scatterlist *sg)
> +{
> + sg->page_link |= SG_PCI_P2PDMA;
> +}
> +
> +/**
> + * sg_unmark_pci_p2pdma - Unmark the scatterlist entry for PCI p2pdma
> + * @sg: SG entryScatterlist
> + *
> + * Description:
> + * Clears the PCI P2PDMA mark
> + **/
> +static inline void sg_unmark_pci_p2pdma(struct scatterlist *sg)
> +{
> + sg->page_link &= ~SG_PCI_P2PDMA;
> +}
> +
> /**
> * sg_phys - Return physical address of an sg entry
> * @sg: SG entry
>

2021-05-02 22:53:39

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 07/16] PCI/P2PDMA: Make pci_p2pdma_map_type() non-static

On 4/8/21 10:01 AM, Logan Gunthorpe wrote:
> pci_p2pdma_map_type() will be needed by the dma-iommu map_sg
> implementation because it will need to determine the mapping type
> ahead of actually doing the mapping to create the actual iommu mapping.
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> ---
> drivers/pci/p2pdma.c | 34 +++++++++++++++++++++++-----------
> include/linux/pci-p2pdma.h | 15 +++++++++++++++
> 2 files changed, 38 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index bcb1a6d6119d..38c93f57a941 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -20,13 +20,6 @@
> #include <linux/seq_buf.h>
> #include <linux/xarray.h>
>
> -enum pci_p2pdma_map_type {
> - PCI_P2PDMA_MAP_UNKNOWN = 0,
> - PCI_P2PDMA_MAP_NOT_SUPPORTED,
> - PCI_P2PDMA_MAP_BUS_ADDR,
> - PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
> -};
> -
> struct pci_p2pdma {
> struct gen_pool *pool;
> bool p2pmem_published;
> @@ -822,13 +815,30 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
> }
> EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
>
> -static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
> - struct device *dev)
> +/**
> + * pci_p2pdma_map_type - return the type of mapping that should be used for
> + * a given device and pgmap
> + * @pgmap: the pagemap of a page to determine the mapping type for
> + * @dev: device that is mapping the page
> + * @dma_attrs: the attributes passed to the dma_map operation --
> + * this is so they can be checked to ensure P2PDMA pages were not
> + * introduced into an incorrect interface (like dma_map_sg). *
> + *
> + * Returns one of:
> + * PCI_P2PDMA_MAP_NOT_SUPPORTED - The mapping should not be done
> + * PCI_P2PDMA_MAP_BUS_ADDR - The mapping should use the PCI bus address
> + * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE - The mapping should be done directly
> + */
> +enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
> + struct device *dev, unsigned long dma_attrs)
> {
> struct pci_dev *provider = to_p2p_pgmap(pgmap)->provider;
> enum pci_p2pdma_map_type ret;
> struct pci_dev *client;
>
> + WARN_ONCE(!(dma_attrs & __DMA_ATTR_PCI_P2PDMA),
> + "PCI P2PDMA pages were mapped with dma_map_sg!");

This really ought to also return -EINVAL, assuming that my review suggestions
about return types, in earlier patches, are acceptable.

> +
> if (!provider->p2pdma)
> return PCI_P2PDMA_MAP_NOT_SUPPORTED;
>
> @@ -879,7 +889,8 @@ int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
> struct pci_p2pdma_pagemap *p2p_pgmap =
> to_p2p_pgmap(sg_page(sg)->pgmap);
>
> - switch (pci_p2pdma_map_type(sg_page(sg)->pgmap, dev)) {
> + switch (pci_p2pdma_map_type(sg_page(sg)->pgmap, dev,
> + __DMA_ATTR_PCI_P2PDMA)) {
> case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
> return dma_map_sg_attrs(dev, sg, nents, dir, attrs);
> case PCI_P2PDMA_MAP_BUS_ADDR:
> @@ -904,7 +915,8 @@ void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
> {
> enum pci_p2pdma_map_type map_type;
>
> - map_type = pci_p2pdma_map_type(sg_page(sg)->pgmap, dev);
> + map_type = pci_p2pdma_map_type(sg_page(sg)->pgmap, dev,
> + __DMA_ATTR_PCI_P2PDMA);

These areas might end up looking a bit different, if my suggestion about
applying pci_dev type safety throughout are accepted.

The patch looks generally correct, aside from these details.

thanks,
--
John Hubbard
NVIDIA

>
> if (map_type == PCI_P2PDMA_MAP_THRU_HOST_BRIDGE)
> dma_unmap_sg_attrs(dev, sg, nents, dir, attrs);
> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> index 8318a97c9c61..a06072ac3a52 100644
> --- a/include/linux/pci-p2pdma.h
> +++ b/include/linux/pci-p2pdma.h
> @@ -16,6 +16,13 @@
> struct block_device;
> struct scatterlist;
>
> +enum pci_p2pdma_map_type {
> + PCI_P2PDMA_MAP_UNKNOWN = 0,
> + PCI_P2PDMA_MAP_NOT_SUPPORTED,
> + PCI_P2PDMA_MAP_BUS_ADDR,
> + PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
> +};
> +
> #ifdef CONFIG_PCI_P2PDMA
> int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
> u64 offset);
> @@ -30,6 +37,8 @@ struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
> unsigned int *nents, u32 length);
> void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl);
> void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
> +enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
> + struct device *dev, unsigned long dma_attrs);
> int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
> int nents, enum dma_data_direction dir, unsigned long attrs);
> void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
> @@ -83,6 +92,12 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
> static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
> {
> }
> +static inline enum pci_p2pdma_map_type pci_p2pdma_map_type(
> + struct dev_pagemap *pgmap, struct device *dev,
> + unsigned long dma_attrs)
> +{
> + return PCI_P2PDMA_MAP_NOT_SUPPORTED;
> +}
> static inline int pci_p2pdma_map_sg_attrs(struct device *dev,
> struct scatterlist *sg, int nents, enum dma_data_direction dir,
> unsigned long attrs)
>

2021-05-02 23:00:47

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 08/16] PCI/P2PDMA: Introduce helpers for dma_map_sg implementations

On 4/8/21 10:01 AM, Logan Gunthorpe wrote:
> Add pci_p2pdma_map_segment() as a helper for simple dma_map_sg()
> implementations. It takes an scatterlist segment that must point to a
> pci_p2pdma struct page and will map it if the mapping requires a bus
> address.
>
> The return value indicates whether the mapping required a bus address
> or whether the caller still needs to map the segment normally. If the
> segment should not be mapped, -EREMOTEIO is returned.
>
> This helper uses a state structure to track the changes to the
> pgmap across calls and avoid needing to lookup into the xarray for
> every page.
>
> Also add pci_p2pdma_map_bus_segment() which is useful for IOMMU
> dma_map_sg() implementations where the sg segment containing the page
> differs from the sg segment containing the DMA address.
>

Hard to properly review this patch by itself, because it doesn't show
any callers of the new routine. If you end up shuffling patches and/or
refactoring for other reasons, it would be nice if the next version of
the series included a caller here. In particular, the new
pci_p2pdma_map_state concept is something I want to double-check, to
see if it hits any common pitfalls. I'm sure it doesn't, but still. :)

Meanwhile, I'll keep working through the series, and come back to this
one when I have seen the callers.

thanks,
--
John Hubbard
NVIDIA

> Signed-off-by: Logan Gunthorpe <[email protected]>
> ---
> drivers/pci/p2pdma.c | 65 ++++++++++++++++++++++++++++++++++++++
> include/linux/pci-p2pdma.h | 21 ++++++++++++
> 2 files changed, 86 insertions(+)
>
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 38c93f57a941..44ad7664e875 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -923,6 +923,71 @@ void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
> }
> EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
>
> +/**
> + * pci_p2pdma_map_segment - map an sg segment determining the mapping type
> + * @state: State structure that should be declared on the stack outside of
> + * the for_each_sg() loop and initialized to zero.
> + * @dev: DMA device that's doing the mapping operation
> + * @sg: scatterlist segment to map
> + * @attrs: dma mapping attributes
> + *
> + * This is a helper to be used by non-iommu dma_map_sg() implementations where
> + * the sg segment is the same for the page_link and the dma_address.
> + *
> + * Attempt to map a single segment in an SGL with the PCI bus address.
> + * The segment must point to a PCI P2PDMA page and thus must be
> + * wrapped in a is_pci_p2pdma_page(sg_page(sg)) check.
> + *
> + * Returns 1 if the segment was mapped, 0 if the segment should be mapped
> + * directly (or through the IOMMU) and -EREMOTEIO if the segment should not
> + * be mapped at all.
> + */
> +int pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state,
> + struct device *dev, struct scatterlist *sg,
> + unsigned long dma_attrs)
> +{
> + if (state->pgmap != sg_page(sg)->pgmap) {
> + state->pgmap = sg_page(sg)->pgmap;
> + state->map = pci_p2pdma_map_type(state->pgmap, dev, dma_attrs);
> + state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
> + }
> +
> + switch (state->map) {
> + case PCI_P2PDMA_MAP_BUS_ADDR:
> + sg->dma_address = sg_phys(sg) + state->bus_off;
> + sg_dma_len(sg) = sg->length;
> + sg_mark_pci_p2pdma(sg);
> + return 1;
> + case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
> + return 0;
> + default:
> + return -EREMOTEIO;
> + }
> +}
> +
> +/**
> + * pci_p2pdma_map_bus_segment - map an sg segment pre determined to
> + * be mapped with PCI_P2PDMA_MAP_BUS_ADDR
> + * @pg_sg: scatterlist segment with the page to map
> + * @dma_sg: scatterlist segment to assign a dma address to
> + *
> + * This is a helper for iommu dma_map_sg() implementations when the
> + * segment for the dma address differs from the segment containing the
> + * source page.
> + *
> + * pci_p2pdma_map_type() must have already been called on the pg_sg and
> + * returned PCI_P2PDMA_MAP_BUS_ADDR.
> + */
> +void pci_p2pdma_map_bus_segment(struct scatterlist *pg_sg,
> + struct scatterlist *dma_sg)
> +{
> + struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(sg_page(pg_sg)->pgmap);
> +
> + dma_sg->dma_address = sg_phys(pg_sg) + pgmap->bus_offset;
> + sg_dma_len(dma_sg) = pg_sg->length;
> + sg_mark_pci_p2pdma(dma_sg);
> +}
> +
> /**
> * pci_p2pdma_enable_store - parse a configfs/sysfs attribute store
> * to enable p2pdma
> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> index a06072ac3a52..49e7679403cf 100644
> --- a/include/linux/pci-p2pdma.h
> +++ b/include/linux/pci-p2pdma.h
> @@ -13,6 +13,12 @@
>
> #include <linux/pci.h>
>
> +struct pci_p2pdma_map_state {
> + struct dev_pagemap *pgmap;
> + int map;
> + u64 bus_off;
> +};
> +
> struct block_device;
> struct scatterlist;
>
> @@ -43,6 +49,11 @@ int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
> int nents, enum dma_data_direction dir, unsigned long attrs);
> void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
> int nents, enum dma_data_direction dir, unsigned long attrs);
> +int pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state,
> + struct device *dev, struct scatterlist *sg,
> + unsigned long dma_attrs);
> +void pci_p2pdma_map_bus_segment(struct scatterlist *pg_sg,
> + struct scatterlist *dma_sg);
> int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
> bool *use_p2pdma);
> ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
> @@ -109,6 +120,16 @@ static inline void pci_p2pdma_unmap_sg_attrs(struct device *dev,
> unsigned long attrs)
> {
> }
> +static inline int pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state,
> + struct device *dev, struct scatterlist *sg,
> + unsigned long dma_attrs)
> +{
> + return 0;
> +}
> +static inline void pci_p2pdma_map_bus_segment(struct scatterlist *pg_sg,
> + struct scatterlist *dma_sg)
> +{
> +}
> static inline int pci_p2pdma_enable_store(const char *page,
> struct pci_dev **p2p_dev, bool *use_p2pdma)
> {
>

2021-05-02 23:30:08

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 09/16] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg

On 4/8/21 10:01 AM, Logan Gunthorpe wrote:
> Add PCI P2PDMA support for dma_direct_map_sg() so that it can map
> PCI P2PDMA pages directly without a hack in the callers. This allows
> for heterogeneous SGLs that contain both P2PDMA and regular pages.
>
> SGL segments that contain PCI bus addresses are marked with
> sg_mark_pci_p2pdma() and are ignored when unmapped.
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> ---
> kernel/dma/direct.c | 25 ++++++++++++++++++++++---
> 1 file changed, 22 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 002268262c9a..108dfb4ecbd5 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -13,6 +13,7 @@
> #include <linux/vmalloc.h>
> #include <linux/set_memory.h>
> #include <linux/slab.h>
> +#include <linux/pci-p2pdma.h>
> #include "direct.h"
>
> /*
> @@ -387,19 +388,37 @@ void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,

This routine now deserves a little bit of commenting, now that it is
doing less obvious things. How about something like this:

/*
* Unmaps pages, except for PCI_P2PDMA pages, which were never mapped in the
* first place. Instead of unmapping PCI_P2PDMA entries, simply remove the
* SG_PCI_P2PDMA mark
*/
void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
int nents, enum dma_data_direction dir, unsigned long attrs)
{


> struct scatterlist *sg;
> int i;
>
> - for_each_sg(sgl, sg, nents, i)
> + for_each_sg(sgl, sg, nents, i) {
> + if (sg_is_pci_p2pdma(sg)) {
> + sg_unmark_pci_p2pdma(sg);
> + continue;
> + }
> +
> dma_direct_unmap_page(dev, sg->dma_address, sg_dma_len(sg), dir,
> attrs);
> + }

The same thing can be achieved with fewer lines and a bit more clarity.
Can we please do it like this instead:

for_each_sg(sgl, sg, nents, i) {
if (sg_is_pci_p2pdma(sg))
sg_unmark_pci_p2pdma(sg);
else
dma_direct_unmap_page(dev, sg->dma_address,
sg_dma_len(sg), dir, attrs);
}


> }
> #endif
>

Also here, a block comment for the function would be nice. How about
approximately this:

/*
* Maps each SG segment. Returns the number of entries mapped, or 0 upon
* failure. If any entry could not be mapped, then no entries are mapped.
*/

I'll stop complaining about the pre-existing return code conventions,
since by now you know what I was thinking of saying. :)

> int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
> enum dma_data_direction dir, unsigned long attrs)
> {
> - int i;
> + struct pci_p2pdma_map_state p2pdma_state = {};

Is it worth putting this stuff on the stack--is there a noticeable
performance improvement from caching the state? Because if it's
invisible, then simplicity is better. I suspect you're right, and that
it *is* worth it, but it's good to know for real.

> struct scatterlist *sg;
> + int i, ret = 0;
>
> for_each_sg(sgl, sg, nents, i) {
> + if (is_pci_p2pdma_page(sg_page(sg))) {
> + ret = pci_p2pdma_map_segment(&p2pdma_state, dev, sg,
> + attrs);
> + if (ret < 0) {
> + goto out_unmap;
> + } else if (ret) {
> + ret = 0;
> + continue;

Is this a bug? If neither of those "if" branches fires (ret == 0), then
the code (probably unintentionally) falls through and continues on to
attempt to call dma_direct_map_page()--despite being a PCI_P2PDMA page!

See below for suggestions:

> + }
> + }
> +
> sg->dma_address = dma_direct_map_page(dev, sg_page(sg),
> sg->offset, sg->length, dir, attrs);
> if (sg->dma_address == DMA_MAPPING_ERROR)

This is another case in which "continue" is misleading and not as good
as "else". Because unless I'm wrong above, you really only want to take
one path *or* the other.

Also, the "else if (ret)" can be simplified to just setting ret = 0
unconditionally.

Given all that, here's a suggested alternative, which is both shorter
and clearer, IMHO:

for_each_sg(sgl, sg, nents, i) {
if (is_pci_p2pdma_page(sg_page(sg))) {
ret = pci_p2pdma_map_segment(&p2pdma_state, dev, sg,
attrs);
if (ret < 0)
goto out_unmap;
else
ret = 0;
} else {
sg->dma_address = dma_direct_map_page(dev, sg_page(sg),
sg->offset, sg->length, dir, attrs);
if (sg->dma_address == DMA_MAPPING_ERROR)
goto out_unmap;
sg_dma_len(sg) = sg->length;
}
}

thanks,
--
John Hubbard
NVIDIA

> @@ -411,7 +430,7 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
>
> out_unmap:
> dma_direct_unmap_sg(dev, sgl, i, dir, attrs | DMA_ATTR_SKIP_CPU_SYNC);
> - return 0;
> + return ret;
> }
>
> dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr,
>


2021-05-02 23:34:12

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 09/16] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg

On 5/2/21 4:28 PM, John Hubbard wrote:
> On 4/8/21 10:01 AM, Logan Gunthorpe wrote:
...
>> @@ -387,19 +388,37 @@ void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
>
> This routine now deserves a little bit of commenting, now that it is
> doing less obvious things. How about something like this:
>
> /*
> * Unmaps pages, except for PCI_P2PDMA pages, which were never mapped in the
> * first place. Instead of unmapping PCI_P2PDMA entries, simply remove the
> * SG_PCI_P2PDMA mark
> */

I got that kind of wrong. They *were* mapped, but need to be left mostly
alone...maybe you can word it better. Here's my second draft:

/*
* Unmaps pages, except for PCI_P2PDMA pages, which should not be unmapped at
* this point. Instead of unmapping PCI_P2PDMA entries, simply remove the
* SG_PCI_P2PDMA mark.
*/

...am I getting close? :)

thanks,
--
John Hubbard
NVIDIA

2021-05-03 00:36:45

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 10/16] dma-mapping: Add flags to dma_map_ops to indicate PCI P2PDMA support

On 4/8/21 10:01 AM, Logan Gunthorpe wrote:
> Add a flags member to the dma_map_ops structure with one flag to
> indicate support for PCI P2PDMA.
>
> Also, add a helper to check if a device supports PCI P2PDMA.
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> ---
> include/linux/dma-map-ops.h | 3 +++
> include/linux/dma-mapping.h | 5 +++++
> kernel/dma/mapping.c | 18 ++++++++++++++++++
> 3 files changed, 26 insertions(+)
>
> diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
> index 51872e736e7b..481892822104 100644
> --- a/include/linux/dma-map-ops.h
> +++ b/include/linux/dma-map-ops.h
> @@ -12,6 +12,9 @@
> struct cma;
>
> struct dma_map_ops {
> + unsigned int flags;
> +#define DMA_F_PCI_P2PDMA_SUPPORTED (1 << 0)
> +

Can we move this up and out of the struct member area, so that it looks
more like this:

/*
* Values for struct dma_map_ops.flags:
*
* DMA_F_PCI_P2PDMA_SUPPORTED: <documentation here...this is a good place to
* explain exactly what this flag is for.>
*/
#define DMA_F_PCI_P2PDMA_SUPPORTED (1 << 0)

struct dma_map_ops {
unsigned int flags;


> void *(*alloc)(struct device *dev, size_t size,
> dma_addr_t *dma_handle, gfp_t gfp,
> unsigned long attrs);
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index 50b8f586cf59..c31980ecca62 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -146,6 +146,7 @@ int dma_mmap_attrs(struct device *dev, struct vm_area_struct *vma,
> unsigned long attrs);
> bool dma_can_mmap(struct device *dev);
> int dma_supported(struct device *dev, u64 mask);
> +bool dma_pci_p2pdma_supported(struct device *dev);
> int dma_set_mask(struct device *dev, u64 mask);
> int dma_set_coherent_mask(struct device *dev, u64 mask);
> u64 dma_get_required_mask(struct device *dev);
> @@ -247,6 +248,10 @@ static inline int dma_supported(struct device *dev, u64 mask)
> {
> return 0;
> }
> +static inline bool dma_pci_p2pdma_supported(struct device *dev)
> +{
> + return 0;

Should be:

return false;

> +}
> static inline int dma_set_mask(struct device *dev, u64 mask)
> {
> return -EIO;
> diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
> index 923089c4267b..ce44a0fcc4e5 100644
> --- a/kernel/dma/mapping.c
> +++ b/kernel/dma/mapping.c
> @@ -573,6 +573,24 @@ int dma_supported(struct device *dev, u64 mask)
> }
> EXPORT_SYMBOL(dma_supported);
>
> +bool dma_pci_p2pdma_supported(struct device *dev)
> +{
> + const struct dma_map_ops *ops = get_dma_ops(dev);
> +
> + /* if ops is not set, dma direct will be used which supports P2PDMA */
> + if (!ops)
> + return true;
> +
> + /*
> + * Note: dma_ops_bypass is not checked here because P2PDMA should
> + * not be used with dma mapping ops that do not have support even
> + * if the specific device is bypassing them.
> + */
> +
> + return ops->flags & DMA_F_PCI_P2PDMA_SUPPORTED;

Wow, rather unusual combination of things in order decide this. It feels
a bit over-complicated to have flags and ops and a bool function all
dealing with the same 1-bit answer, but there is no caller shown here,
so I'll have to come back to this after reviewing subsequent patches.

thanks,
--
John Hubbard
NVIDIA

> +}
> +EXPORT_SYMBOL_GPL(dma_pci_p2pdma_supported);
> +
> #ifdef CONFIG_ARCH_HAS_DMA_SET_MASK
> void arch_dma_set_mask(struct device *dev, u64 mask);
> #else
>


2021-05-03 00:55:55

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 08/16] PCI/P2PDMA: Introduce helpers for dma_map_sg implementations

On 4/8/21 10:01 AM, Logan Gunthorpe wrote:
> Add pci_p2pdma_map_segment() as a helper for simple dma_map_sg()
> implementations. It takes an scatterlist segment that must point to a
> pci_p2pdma struct page and will map it if the mapping requires a bus
> address.
>
> The return value indicates whether the mapping required a bus address
> or whether the caller still needs to map the segment normally. If the
> segment should not be mapped, -EREMOTEIO is returned.
>
> This helper uses a state structure to track the changes to the
> pgmap across calls and avoid needing to lookup into the xarray for
> every page.
>

OK, coming back to this patch, after seeing how it is used later in
the series...

> Also add pci_p2pdma_map_bus_segment() which is useful for IOMMU
> dma_map_sg() implementations where the sg segment containing the page
> differs from the sg segment containing the DMA address.
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> ---
> drivers/pci/p2pdma.c | 65 ++++++++++++++++++++++++++++++++++++++
> include/linux/pci-p2pdma.h | 21 ++++++++++++
> 2 files changed, 86 insertions(+)
>
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 38c93f57a941..44ad7664e875 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -923,6 +923,71 @@ void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
> }
> EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
>
> +/**
> + * pci_p2pdma_map_segment - map an sg segment determining the mapping type
> + * @state: State structure that should be declared on the stack outside of
> + * the for_each_sg() loop and initialized to zero.

Silly fine point for the docs here: it doesn't actually have to be on
the stack, so I don't think you need to write that constraint in the
documentation. It just has be be somehow allocated and zeroed.


> + * @dev: DMA device that's doing the mapping operation
> + * @sg: scatterlist segment to map
> + * @attrs: dma mapping attributes
> + *
> + * This is a helper to be used by non-iommu dma_map_sg() implementations where
> + * the sg segment is the same for the page_link and the dma_address.
> + *
> + * Attempt to map a single segment in an SGL with the PCI bus address.
> + * The segment must point to a PCI P2PDMA page and thus must be
> + * wrapped in a is_pci_p2pdma_page(sg_page(sg)) check.

Should this be backed up with actual checks in the function, that
the prerequisites are met?

> + *
> + * Returns 1 if the segment was mapped, 0 if the segment should be mapped
> + * directly (or through the IOMMU) and -EREMOTEIO if the segment should not
> + * be mapped at all.
> + */
> +int pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state,
> + struct device *dev, struct scatterlist *sg,
> + unsigned long dma_attrs)
> +{
> + if (state->pgmap != sg_page(sg)->pgmap) {
> + state->pgmap = sg_page(sg)->pgmap;
> + state->map = pci_p2pdma_map_type(state->pgmap, dev, dma_attrs);
> + state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
> + }

I'll quote myself from patch 9, because I had a comment there that actually
was meant for this patch:

Is it worth putting this stuff on the caller's stack? I mean, is there a
noticeable performance improvement from caching the state? Because if
it's invisible, then simplicity is better. I suspect you're right, and
that it *is* worth it, but it's good to know for real.


> +
> + switch (state->map) {
> + case PCI_P2PDMA_MAP_BUS_ADDR:
> + sg->dma_address = sg_phys(sg) + state->bus_off;
> + sg_dma_len(sg) = sg->length;
> + sg_mark_pci_p2pdma(sg);
> + return 1;
> + case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
> + return 0;
> + default:
> + return -EREMOTEIO;
> + }
> +}
> +
> +/**
> + * pci_p2pdma_map_bus_segment - map an sg segment pre determined to
> + * be mapped with PCI_P2PDMA_MAP_BUS_ADDR

Or:

* pci_p2pdma_map_bus_segment - map an SG segment that is already known
* to be mapped with PCI_P2PDMA_MAP_BUS_ADDR

Also, should that prerequisite be backed up with checks in the function?

> + * @pg_sg: scatterlist segment with the page to map
> + * @dma_sg: scatterlist segment to assign a dma address to
> + *
> + * This is a helper for iommu dma_map_sg() implementations when the
> + * segment for the dma address differs from the segment containing the
> + * source page.
> + *
> + * pci_p2pdma_map_type() must have already been called on the pg_sg and
> + * returned PCI_P2PDMA_MAP_BUS_ADDR.

Another prerequisite, so same question: do you think that the code should
also check that this prerequisite is met?

thanks,
--
John Hubbard
NVIDIA

> + */
> +void pci_p2pdma_map_bus_segment(struct scatterlist *pg_sg,
> + struct scatterlist *dma_sg)
> +{
> + struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(sg_page(pg_sg)->pgmap);
> +
> + dma_sg->dma_address = sg_phys(pg_sg) + pgmap->bus_offset;
> + sg_dma_len(dma_sg) = pg_sg->length;
> + sg_mark_pci_p2pdma(dma_sg);
> +}
> +
> /**
> * pci_p2pdma_enable_store - parse a configfs/sysfs attribute store
> * to enable p2pdma
> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> index a06072ac3a52..49e7679403cf 100644
> --- a/include/linux/pci-p2pdma.h
> +++ b/include/linux/pci-p2pdma.h
> @@ -13,6 +13,12 @@
>
> #include <linux/pci.h>
>
> +struct pci_p2pdma_map_state {
> + struct dev_pagemap *pgmap;
> + int map;
> + u64 bus_off;
> +};
> +
> struct block_device;
> struct scatterlist;
>
> @@ -43,6 +49,11 @@ int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
> int nents, enum dma_data_direction dir, unsigned long attrs);
> void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
> int nents, enum dma_data_direction dir, unsigned long attrs);
> +int pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state,
> + struct device *dev, struct scatterlist *sg,
> + unsigned long dma_attrs);
> +void pci_p2pdma_map_bus_segment(struct scatterlist *pg_sg,
> + struct scatterlist *dma_sg);
> int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
> bool *use_p2pdma);
> ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
> @@ -109,6 +120,16 @@ static inline void pci_p2pdma_unmap_sg_attrs(struct device *dev,
> unsigned long attrs)
> {
> }
> +static inline int pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state,
> + struct device *dev, struct scatterlist *sg,
> + unsigned long dma_attrs)
> +{
> + return 0;
> +}
> +static inline void pci_p2pdma_map_bus_segment(struct scatterlist *pg_sg,
> + struct scatterlist *dma_sg)
> +{
> +}
> static inline int pci_p2pdma_enable_store(const char *page,
> struct pci_dev **p2p_dev, bool *use_p2pdma)
> {
>

2021-05-03 01:17:35

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 11/16] iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg

On 4/8/21 10:01 AM, Logan Gunthorpe wrote:
> When a PCI P2PDMA page is seen, set the IOVA length of the segment
> to zero so that it is not mapped into the IOVA. Then, in finalise_sg(),
> apply the appropriate bus address to the segment. The IOVA is not
> created if the scatterlist only consists of P2PDMA pages.
>
> Similar to dma-direct, the sg_mark_pci_p2pdma() flag is used to
> indicate bus address segments. On unmap, P2PDMA segments are skipped
> over when determining the start and end IOVA addresses.
>
> With this change, the flags variable in the dma_map_ops is
> set to DMA_F_PCI_P2PDMA_SUPPORTED to indicate support for
> P2PDMA pages.
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> ---
> drivers/iommu/dma-iommu.c | 66 ++++++++++++++++++++++++++++++++++-----
> 1 file changed, 58 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index af765c813cc8..ef49635f9819 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -20,6 +20,7 @@
> #include <linux/mm.h>
> #include <linux/mutex.h>
> #include <linux/pci.h>
> +#include <linux/pci-p2pdma.h>
> #include <linux/swiotlb.h>
> #include <linux/scatterlist.h>
> #include <linux/vmalloc.h>
> @@ -864,6 +865,16 @@ static int __finalise_sg(struct device *dev, struct scatterlist *sg, int nents,
> sg_dma_address(s) = DMA_MAPPING_ERROR;
> sg_dma_len(s) = 0;
>
> + if (is_pci_p2pdma_page(sg_page(s)) && !s_iova_len) {

Newbie question: I'm in the dark as to why the !s_iova_len check is there,
can you please enlighten me?

> + if (i > 0)
> + cur = sg_next(cur);
> +
> + pci_p2pdma_map_bus_segment(s, cur);
> + count++;
> + cur_len = 0;
> + continue;
> + }
> +

This is really an if/else condition. And arguably, it would be better
to split out two subroutines, and call one or the other depending on
the result of if is_pci_p2pdma_page(), instead of this "continue" approach.

> /*
> * Now fill in the real DMA data. If...
> * - there is a valid output segment to append to
> @@ -961,10 +972,12 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
> struct iova_domain *iovad = &cookie->iovad;
> struct scatterlist *s, *prev = NULL;
> int prot = dma_info_to_prot(dir, dev_is_dma_coherent(dev), attrs);
> + struct dev_pagemap *pgmap = NULL;
> + enum pci_p2pdma_map_type map_type;
> dma_addr_t iova;
> size_t iova_len = 0;
> unsigned long mask = dma_get_seg_boundary(dev);
> - int i;
> + int i, ret = 0;
>
> if (static_branch_unlikely(&iommu_deferred_attach_enabled) &&
> iommu_deferred_attach(dev, domain))
> @@ -993,6 +1006,31 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
> s_length = iova_align(iovad, s_length + s_iova_off);
> s->length = s_length;
>
> + if (is_pci_p2pdma_page(sg_page(s))) {
> + if (sg_page(s)->pgmap != pgmap) {
> + pgmap = sg_page(s)->pgmap;
> + map_type = pci_p2pdma_map_type(pgmap, dev,
> + attrs);
> + }
> +
> + switch (map_type) {
> + case PCI_P2PDMA_MAP_BUS_ADDR:
> + /*
> + * A zero length will be ignored by
> + * iommu_map_sg() and then can be detected
> + * in __finalise_sg() to actually map the
> + * bus address.
> + */
> + s->length = 0;
> + continue;
> + case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
> + break;
> + default:
> + ret = -EREMOTEIO;
> + goto out_restore_sg;
> + }
> + }
> +
> /*
> * Due to the alignment of our single IOVA allocation, we can
> * depend on these assumptions about the segment boundary mask:
> @@ -1015,6 +1053,9 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
> prev = s;
> }
>
> + if (!iova_len)
> + return __finalise_sg(dev, sg, nents, 0);
> +

ohhh, we're really slicing up this function pretty severely, what with the
continue and the early out and several other control flow changes. I think
it would be better to spend some time factoring this function into two
cases, now that you're adding a second case for PCI P2PDMA. Roughly,
two subroutines would do it.

As it is, this leaves behind a routine that is extremely hard to mentally
verify as correct.


thanks,
--
John Hubbard
NVIDIA

> iova = iommu_dma_alloc_iova(domain, iova_len, dma_get_mask(dev), dev);
> if (!iova)
> goto out_restore_sg;
> @@ -1032,13 +1073,13 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
> iommu_dma_free_iova(cookie, iova, iova_len, NULL);
> out_restore_sg:
> __invalidate_sg(sg, nents);
> - return 0;
> + return ret;
> }
>
> static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
> int nents, enum dma_data_direction dir, unsigned long attrs)
> {
> - dma_addr_t start, end;
> + dma_addr_t end, start = DMA_MAPPING_ERROR;
> struct scatterlist *tmp;
> int i;
>
> @@ -1054,14 +1095,22 @@ static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
> * The scatterlist segments are mapped into a single
> * contiguous IOVA allocation, so this is incredibly easy.
> */
> - start = sg_dma_address(sg);
> - for_each_sg(sg_next(sg), tmp, nents - 1, i) {
> + for_each_sg(sg, tmp, nents, i) {
> + if (sg_is_pci_p2pdma(tmp)) {
> + sg_unmark_pci_p2pdma(tmp);
> + continue;
> + }
> if (sg_dma_len(tmp) == 0)
> break;
> - sg = tmp;
> +
> + if (start == DMA_MAPPING_ERROR)
> + start = sg_dma_address(tmp);
> +
> + end = sg_dma_address(tmp) + sg_dma_len(tmp);
> }
> - end = sg_dma_address(sg) + sg_dma_len(sg);
> - __iommu_dma_unmap(dev, start, end - start);
> +
> + if (start != DMA_MAPPING_ERROR)
> + __iommu_dma_unmap(dev, start, end - start);
> }
>
> static dma_addr_t iommu_dma_map_resource(struct device *dev, phys_addr_t phys,
> @@ -1254,6 +1303,7 @@ static unsigned long iommu_dma_get_merge_boundary(struct device *dev)
> }
>
> static const struct dma_map_ops iommu_dma_ops = {
> + .flags = DMA_F_PCI_P2PDMA_SUPPORTED,
> .alloc = iommu_dma_alloc,
> .free = iommu_dma_free,
> .alloc_pages = dma_common_alloc_pages,
>

2021-05-03 01:39:57

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 12/16] nvme-pci: Check DMA ops when indicating support for PCI P2PDMA

On 4/8/21 10:01 AM, Logan Gunthorpe wrote:
> Introduce a supports_pci_p2pdma() operation in nvme_ctrl_ops to
> replace the fixed NVME_F_PCI_P2PDMA flag such that the dma_map_ops
> flags can be checked for PCI P2PDMA support.
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> ---
> drivers/nvme/host/core.c | 3 ++-
> drivers/nvme/host/nvme.h | 2 +-
> drivers/nvme/host/pci.c | 11 +++++++++--
> 3 files changed, 12 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 0896e21642be..223419454516 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3907,7 +3907,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid,
> blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, ns->queue);
>
> blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
> - if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
> + if (ctrl->ops->supports_pci_p2pdma &&
> + ctrl->ops->supports_pci_p2pdma(ctrl))

This is a little excessive, as I suspected. How about providing a
default .supports_pci_p2pdma routine that returns false, so that
the op is always available (non-null)? By "default", maybe that
means either requiring an init_the_ops_struct() routine to be
used, and/or checking all the users of struct nvme_ctrl_ops.

Another idea: maybe you don't really need a bool .supports_pci_p2pdma()
routine at all, because the existing .flags really is about right.
You just need the flags to be filled in dynamically. So, do that
during nvme_pci setup/init time: that's when this module would call
dma_pci_p2pdma_supported().

Actually, I think that second idea simplifies things quite a
bit, but only if it's possible. I haven't worked through the
startup order of calls in nvme_pci.

thanks,
--
John Hubbard
NVIDIA

> blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
>
> ns->queue->queuedata = ns;
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 07b34175c6ce..9c04df982d2c 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -473,7 +473,6 @@ struct nvme_ctrl_ops {
> unsigned int flags;
> #define NVME_F_FABRICS (1 << 0)
> #define NVME_F_METADATA_SUPPORTED (1 << 1)
> -#define NVME_F_PCI_P2PDMA (1 << 2)
> int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
> int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
> int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
> @@ -481,6 +480,7 @@ struct nvme_ctrl_ops {
> void (*submit_async_event)(struct nvme_ctrl *ctrl);
> void (*delete_ctrl)(struct nvme_ctrl *ctrl);
> int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size);
> + bool (*supports_pci_p2pdma)(struct nvme_ctrl *ctrl);
> };
>
> #ifdef CONFIG_FAULT_INJECTION_DEBUG_FS
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 7249ae74f71f..14f092973792 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2759,17 +2759,24 @@ static int nvme_pci_get_address(struct nvme_ctrl *ctrl, char *buf, int size)
> return snprintf(buf, size, "%s\n", dev_name(&pdev->dev));
> }
>
> +static bool nvme_pci_supports_pci_p2pdma(struct nvme_ctrl *ctrl)
> +{
> + struct nvme_dev *dev = to_nvme_dev(ctrl);
> +
> + return dma_pci_p2pdma_supported(dev->dev);
> +}
> +
> static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
> .name = "pcie",
> .module = THIS_MODULE,
> - .flags = NVME_F_METADATA_SUPPORTED |
> - NVME_F_PCI_P2PDMA,
> + .flags = NVME_F_METADATA_SUPPORTED,
> .reg_read32 = nvme_pci_reg_read32,
> .reg_write32 = nvme_pci_reg_write32,
> .reg_read64 = nvme_pci_reg_read64,
> .free_ctrl = nvme_pci_free_ctrl,
> .submit_async_event = nvme_pci_submit_async_event,
> .get_address = nvme_pci_get_address,
> + .supports_pci_p2pdma = nvme_pci_supports_pci_p2pdma,
> };
>
> static int nvme_dev_map(struct nvme_dev *dev)
>

2021-05-03 01:40:39

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 13/16] nvme-pci: Convert to using dma_map_sg_p2pdma for p2pdma pages

On 4/8/21 10:01 AM, Logan Gunthorpe wrote:
> Convert to using dma_map_sg_p2pdma() for PCI p2pdma pages.
>
> This should be equivalent but allows for heterogeneous scatterlists
> with both P2PDMA and regular pages. However, P2PDMA support will be
> slightly more restricted (only dma-direct and dma-iommu are currently
> supported).
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> ---
> drivers/nvme/host/pci.c | 28 ++++++++--------------------
> 1 file changed, 8 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 14f092973792..a1ed07ff38b7 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -577,17 +577,6 @@ static void nvme_free_sgls(struct nvme_dev *dev, struct request *req)
>
> }
>
> -static void nvme_unmap_sg(struct nvme_dev *dev, struct request *req)
> -{
> - struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> -
> - if (is_pci_p2pdma_page(sg_page(iod->sg)))
> - pci_p2pdma_unmap_sg(dev->dev, iod->sg, iod->nents,
> - rq_dma_dir(req));
> - else
> - dma_unmap_sg(dev->dev, iod->sg, iod->nents, rq_dma_dir(req));
> -}
> -
> static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
> {
> struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> @@ -600,7 +589,7 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
>
> WARN_ON_ONCE(!iod->nents);
>
> - nvme_unmap_sg(dev, req);
> + dma_unmap_sg(dev->dev, iod->sg, iod->nents, rq_dma_dir(req));


Nice simplification!


> if (iod->npages == 0)
> dma_pool_free(dev->prp_small_pool, nvme_pci_iod_list(req)[0],
> iod->first_dma);
> @@ -868,14 +857,13 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
> if (!iod->nents)
> goto out_free_sg;
>
> - if (is_pci_p2pdma_page(sg_page(iod->sg)))
> - nr_mapped = pci_p2pdma_map_sg_attrs(dev->dev, iod->sg,
> - iod->nents, rq_dma_dir(req), DMA_ATTR_NO_WARN);
> - else
> - nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents,
> - rq_dma_dir(req), DMA_ATTR_NO_WARN);
> - if (!nr_mapped)
> + nr_mapped = dma_map_sg_p2pdma_attrs(dev->dev, iod->sg, iod->nents,
> + rq_dma_dir(req), DMA_ATTR_NO_WARN);
> + if (nr_mapped < 0) {
> + if (nr_mapped != -ENOMEM)
> + ret = BLK_STS_TARGET;
> goto out_free_sg;
> + }

But now the "nr_mapped == 0" case is no longer doing an early out_free_sg.
Is that OK?

>
> iod->use_sgl = nvme_pci_use_sgls(dev, req);
> if (iod->use_sgl)
> @@ -887,7 +875,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
> return BLK_STS_OK;
>
> out_unmap_sg:
> - nvme_unmap_sg(dev, req);
> + dma_unmap_sg(dev->dev, iod->sg, iod->nents, rq_dma_dir(req));
> out_free_sg:
> mempool_free(iod->sg, dev->iod_mempool);
> return ret;
>

thanks,
--
John Hubbard
NVIDIA

2021-05-03 01:40:55

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 14/16] nvme-rdma: Ensure dma support when using p2pdma

On 4/8/21 10:01 AM, Logan Gunthorpe wrote:
> Ensure the dma operations support p2pdma before using the RDMA
> device for P2PDMA. This allows switching the RDMA driver from
> pci_p2pdma_map_sg() to dma_map_sg_p2pdma().

Tentatively, this looks right, but it really should be combined
with a following patch that uses it. Then you don't have to try
to explain, above, why it's needed. :)

>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> ---
> drivers/nvme/target/rdma.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
> index 6c1f3ab7649c..3ec7e77e5416 100644
> --- a/drivers/nvme/target/rdma.c
> +++ b/drivers/nvme/target/rdma.c
> @@ -414,7 +414,8 @@ static int nvmet_rdma_alloc_rsp(struct nvmet_rdma_device *ndev,
> if (ib_dma_mapping_error(ndev->device, r->send_sge.addr))
> goto out_free_rsp;
>
> - if (!ib_uses_virt_dma(ndev->device))
> + if (!ib_uses_virt_dma(ndev->device) &&
> + dma_pci_p2pdma_supported(&ndev->device->dev))
> r->req.p2p_client = &ndev->device->dev;
> r->send_sge.length = sizeof(*r->req.cqe);
> r->send_sge.lkey = ndev->pd->local_dma_lkey;
>

thanks,
--
John Hubbard
NVIDIA

2021-05-03 17:09:08

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 09/16] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg



On 2021-05-02 5:28 p.m., John Hubbard wrote:
>> @@ -387,19 +388,37 @@ void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
>
> This routine now deserves a little bit of commenting, now that it is
> doing less obvious things. How about something like this:
>
> /*
> * Unmaps pages, except for PCI_P2PDMA pages, which were never mapped in the
> * first place. Instead of unmapping PCI_P2PDMA entries, simply remove the
> * SG_PCI_P2PDMA mark
> */
> void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
> int nents, enum dma_data_direction dir, unsigned long attrs)
> {
>

Ok.

>> struct scatterlist *sg;
>> int i;
>>
>> - for_each_sg(sgl, sg, nents, i)
>> + for_each_sg(sgl, sg, nents, i) {
>> + if (sg_is_pci_p2pdma(sg)) {
>> + sg_unmark_pci_p2pdma(sg);
>> + continue;
>> + }
>> +
>> dma_direct_unmap_page(dev, sg->dma_address, sg_dma_len(sg), dir,
>> attrs);
>> + }
>
> The same thing can be achieved with fewer lines and a bit more clarity.
> Can we please do it like this instead:
>
> for_each_sg(sgl, sg, nents, i) {
> if (sg_is_pci_p2pdma(sg))
> sg_unmark_pci_p2pdma(sg);
> else
> dma_direct_unmap_page(dev, sg->dma_address,
> sg_dma_len(sg), dir, attrs);
> }
>
>

That's debatable (the way I did it emphasizes the common case). But I'll
consider changing it.

>
> Also here, a block comment for the function would be nice. How about
> approximately this:
>
> /*
> * Maps each SG segment. Returns the number of entries mapped, or 0 upon
> * failure. If any entry could not be mapped, then no entries are mapped.
> */
>
> I'll stop complaining about the pre-existing return code conventions,
> since by now you know what I was thinking of saying. :)

Not really part of this patchset... Seems like if you think there should
be a comment like that here, you should send a patch. But this patch
starts returning a negative value here.

>> int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
>> enum dma_data_direction dir, unsigned long attrs)
>> {
>> - int i;
>> + struct pci_p2pdma_map_state p2pdma_state = {};
>
> Is it worth putting this stuff on the stack--is there a noticeable
> performance improvement from caching the state? Because if it's
> invisible, then simplicity is better. I suspect you're right, and that
> it *is* worth it, but it's good to know for real.
>
>> struct scatterlist *sg;
>> + int i, ret = 0;
>>
>> for_each_sg(sgl, sg, nents, i) {
>> + if (is_pci_p2pdma_page(sg_page(sg))) {
>> + ret = pci_p2pdma_map_segment(&p2pdma_state, dev, sg,
>> + attrs);
>> + if (ret < 0) {
>> + goto out_unmap;
>> + } else if (ret) {
>> + ret = 0;
>> + continue;
>
> Is this a bug? If neither of those "if" branches fires (ret == 0), then
> the code (probably unintentionally) falls through and continues on to
> attempt to call dma_direct_map_page()--despite being a PCI_P2PDMA page!

No, it's not a bug. Per the documentation of pci_p2pdma_map_segment(),
if it returns zero the segment should be mapped normally. P2PDMA pages
must be mapped with physical addresses (or IOVA addresses) if the TLPS
for the transaction will go through the host bridge.

> See below for suggestions:
>
>> + }
>> + }
>> +
>> sg->dma_address = dma_direct_map_page(dev, sg_page(sg),
>> sg->offset, sg->length, dir, attrs);
>> if (sg->dma_address == DMA_MAPPING_ERROR)
>
> This is another case in which "continue" is misleading and not as good
> as "else". Because unless I'm wrong above, you really only want to take
> one path *or* the other.

No, per above, it's not one path or the other. If it's a P2PDMA page it
may still need to be mapped normally.

> Also, the "else if (ret)" can be simplified to just setting ret = 0
> unconditionally.

I don't follow. If ret is set, we need to unset it before the end of the
loop.

> Given all that, here's a suggested alternative, which is both shorter
> and clearer, IMHO:
>
> for_each_sg(sgl, sg, nents, i) {
> if (is_pci_p2pdma_page(sg_page(sg))) {
> ret = pci_p2pdma_map_segment(&p2pdma_state, dev, sg,
> attrs);
> if (ret < 0)
> goto out_unmap;
> else
> ret = 0;
> } else {
> sg->dma_address = dma_direct_map_page(dev, sg_page(sg),
> sg->offset, sg->length, dir, attrs);
> if (sg->dma_address == DMA_MAPPING_ERROR)
> goto out_unmap;
> sg_dma_len(sg) = sg->length;
> }
> }

No, per the comments above, this does not accomplish the same thing and
is not correct.

I'll try to add a comment to the code to make it more clearer. But the
kernel doc on pci_p2pdma_map_segment() does mention what must be done
for different return values explicitly.

Logan

2021-05-03 17:10:22

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 09/16] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg



On 2021-05-02 5:32 p.m., John Hubbard wrote:
> On 5/2/21 4:28 PM, John Hubbard wrote:
>> On 4/8/21 10:01 AM, Logan Gunthorpe wrote:
> ...
>>> @@ -387,19 +388,37 @@ void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
>>
>> This routine now deserves a little bit of commenting, now that it is
>> doing less obvious things. How about something like this:
>>
>> /*
>> * Unmaps pages, except for PCI_P2PDMA pages, which were never mapped in the
>> * first place. Instead of unmapping PCI_P2PDMA entries, simply remove the
>> * SG_PCI_P2PDMA mark
>> */
>
> I got that kind of wrong. They *were* mapped, but need to be left mostly
> alone...maybe you can word it better. Here's my second draft:
>
> /*
> * Unmaps pages, except for PCI_P2PDMA pages, which should not be unmapped at
> * this point. Instead of unmapping PCI_P2PDMA entries, simply remove the
> * SG_PCI_P2PDMA mark.
> */
>
> ...am I getting close? :)

I don't think your original comment was wrong per se. But I guess it
depends on your definition of "mapped". In dma-direct the physical
address is added to the SGL and, on some arches, the address has to be
synced on unmap. With P2PDMA, the PCI bus address is sometimes added to
the SGL and no sync is necessary at the end.

Logan

2021-05-03 17:10:55

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 09/16] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg

Oops missed a comment:

On 2021-05-02 5:28 p.m., John Hubbard wrote:
>> int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
>> enum dma_data_direction dir, unsigned long attrs)
>> {
>> - int i;
>> + struct pci_p2pdma_map_state p2pdma_state = {};
>
> Is it worth putting this stuff on the stack--is there a noticeable
> performance improvement from caching the state? Because if it's
> invisible, then simplicity is better. I suspect you're right, and that
> it *is* worth it, but it's good to know for real.

I haven't measured it (it would be hard to measure), but I think it's
fairly clear here. Without the state, xa_load() would need to be called
on *every* page in an SGL that maps only P2PDMA memory from one device.
With the state, it only needs to be called once. xa_load() is cheap, but
it is not that cheap.

There's essentially the same optimization in get_user_pages for
ZONE_DEVICE pages. So, if it is necessary there, it should be necessary
here.

Logan

2021-05-03 17:14:32

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 10/16] dma-mapping: Add flags to dma_map_ops to indicate PCI P2PDMA support



On 2021-05-02 6:32 p.m., John Hubbard wrote:
> On 4/8/21 10:01 AM, Logan Gunthorpe wrote:
>> Add a flags member to the dma_map_ops structure with one flag to
>> indicate support for PCI P2PDMA.
>>
>> Also, add a helper to check if a device supports PCI P2PDMA.
>>
>> Signed-off-by: Logan Gunthorpe <[email protected]>
>> ---
>> include/linux/dma-map-ops.h | 3 +++
>> include/linux/dma-mapping.h | 5 +++++
>> kernel/dma/mapping.c | 18 ++++++++++++++++++
>> 3 files changed, 26 insertions(+)
>>
>> diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
>> index 51872e736e7b..481892822104 100644
>> --- a/include/linux/dma-map-ops.h
>> +++ b/include/linux/dma-map-ops.h
>> @@ -12,6 +12,9 @@
>> struct cma;
>>
>> struct dma_map_ops {
>> + unsigned int flags;
>> +#define DMA_F_PCI_P2PDMA_SUPPORTED (1 << 0)
>> +
>
> Can we move this up and out of the struct member area, so that it looks
> more like this:
>
> /*
> * Values for struct dma_map_ops.flags:
> *
> * DMA_F_PCI_P2PDMA_SUPPORTED: <documentation here...this is a good place to
> * explain exactly what this flag is for.>
> */
> #define DMA_F_PCI_P2PDMA_SUPPORTED (1 << 0)
>
> struct dma_map_ops {
> unsigned int flags;
>

Sure, I don't care that much. I was just following the style in nvme.h.

>> void *(*alloc)(struct device *dev, size_t size,
>> dma_addr_t *dma_handle, gfp_t gfp,
>> unsigned long attrs);
>> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
>> index 50b8f586cf59..c31980ecca62 100644
>> --- a/include/linux/dma-mapping.h
>> +++ b/include/linux/dma-mapping.h
>> @@ -146,6 +146,7 @@ int dma_mmap_attrs(struct device *dev, struct vm_area_struct *vma,
>> unsigned long attrs);
>> bool dma_can_mmap(struct device *dev);
>> int dma_supported(struct device *dev, u64 mask);
>> +bool dma_pci_p2pdma_supported(struct device *dev);
>> int dma_set_mask(struct device *dev, u64 mask);
>> int dma_set_coherent_mask(struct device *dev, u64 mask);
>> u64 dma_get_required_mask(struct device *dev);
>> @@ -247,6 +248,10 @@ static inline int dma_supported(struct device *dev, u64 mask)
>> {
>> return 0;
>> }
>> +static inline bool dma_pci_p2pdma_supported(struct device *dev)
>> +{
>> + return 0;
>
> Should be:
>
> return false;

Yup, will fix.

>> +}
>> static inline int dma_set_mask(struct device *dev, u64 mask)
>> {
>> return -EIO;
>> diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
>> index 923089c4267b..ce44a0fcc4e5 100644
>> --- a/kernel/dma/mapping.c
>> +++ b/kernel/dma/mapping.c
>> @@ -573,6 +573,24 @@ int dma_supported(struct device *dev, u64 mask)
>> }
>> EXPORT_SYMBOL(dma_supported);
>>
>> +bool dma_pci_p2pdma_supported(struct device *dev)
>> +{
>> + const struct dma_map_ops *ops = get_dma_ops(dev);
>> +
>> + /* if ops is not set, dma direct will be used which supports P2PDMA */
>> + if (!ops)
>> + return true;
>> +
>> + /*
>> + * Note: dma_ops_bypass is not checked here because P2PDMA should
>> + * not be used with dma mapping ops that do not have support even
>> + * if the specific device is bypassing them.
>> + */
>> +
>> + return ops->flags & DMA_F_PCI_P2PDMA_SUPPORTED;
>
> Wow, rather unusual combination of things in order decide this. It feels
> a bit over-complicated to have flags and ops and a bool function all
> dealing with the same 1-bit answer, but there is no caller shown here,
> so I'll have to come back to this after reviewing subsequent patches.

Yeah, I originally had it much simpler, but it confused Ira and it was
clear it had to be written more explicitly and commented better.

Logan

2021-05-03 17:18:25

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 08/16] PCI/P2PDMA: Introduce helpers for dma_map_sg implementations



On 2021-05-02 6:50 p.m., John Hubbard wrote:
> On 4/8/21 10:01 AM, Logan Gunthorpe wrote:
>> Add pci_p2pdma_map_segment() as a helper for simple dma_map_sg()
>> implementations. It takes an scatterlist segment that must point to a
>> pci_p2pdma struct page and will map it if the mapping requires a bus
>> address.
>>
>> The return value indicates whether the mapping required a bus address
>> or whether the caller still needs to map the segment normally. If the
>> segment should not be mapped, -EREMOTEIO is returned.
>>
>> This helper uses a state structure to track the changes to the
>> pgmap across calls and avoid needing to lookup into the xarray for
>> every page.
>>
>
> OK, coming back to this patch, after seeing how it is used later in
> the series...
>
>> Also add pci_p2pdma_map_bus_segment() which is useful for IOMMU
>> dma_map_sg() implementations where the sg segment containing the page
>> differs from the sg segment containing the DMA address.
>>
>> Signed-off-by: Logan Gunthorpe <[email protected]>
>> ---
>> drivers/pci/p2pdma.c | 65 ++++++++++++++++++++++++++++++++++++++
>> include/linux/pci-p2pdma.h | 21 ++++++++++++
>> 2 files changed, 86 insertions(+)
>>
>> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
>> index 38c93f57a941..44ad7664e875 100644
>> --- a/drivers/pci/p2pdma.c
>> +++ b/drivers/pci/p2pdma.c
>> @@ -923,6 +923,71 @@ void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
>> }
>> EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
>>
>> +/**
>> + * pci_p2pdma_map_segment - map an sg segment determining the mapping type
>> + * @state: State structure that should be declared on the stack outside of
>> + * the for_each_sg() loop and initialized to zero.
>
> Silly fine point for the docs here: it doesn't actually have to be on
> the stack, so I don't think you need to write that constraint in the
> documentation. It just has be be somehow allocated and zeroed.

Yeah, that's true, but there's really no reason it would ever not be
allocated on the stack.

>
>> + * @dev: DMA device that's doing the mapping operation
>> + * @sg: scatterlist segment to map
>> + * @attrs: dma mapping attributes
>> + *
>> + * This is a helper to be used by non-iommu dma_map_sg() implementations where
>> + * the sg segment is the same for the page_link and the dma_address.
>> + *
>> + * Attempt to map a single segment in an SGL with the PCI bus address.
>> + * The segment must point to a PCI P2PDMA page and thus must be
>> + * wrapped in a is_pci_p2pdma_page(sg_page(sg)) check.
>
> Should this be backed up with actual checks in the function, that
> the prerequisites are met?

I think that would be unnecessary. All callers are going to call this
inside an is_pci_p2pdma_page() check, otherwise it would slow down the
fast path.

>> + *
>> + * Returns 1 if the segment was mapped, 0 if the segment should be mapped
>> + * directly (or through the IOMMU) and -EREMOTEIO if the segment should not
>> + * be mapped at all.
>> + */
>> +int pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state,
>> + struct device *dev, struct scatterlist *sg,
>> + unsigned long dma_attrs)
>> +{
>> + if (state->pgmap != sg_page(sg)->pgmap) {
>> + state->pgmap = sg_page(sg)->pgmap;
>> + state->map = pci_p2pdma_map_type(state->pgmap, dev, dma_attrs);
>> + state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
>> + }
>
> I'll quote myself from patch 9, because I had a comment there that actually
> was meant for this patch:
>
> Is it worth putting this stuff on the caller's stack? I mean, is there a
> noticeable performance improvement from caching the state? Because if
> it's invisible, then simplicity is better. I suspect you're right, and
> that it *is* worth it, but it's good to know for real.

Yeah, I responded to this in another email.

>
>> +
>> + switch (state->map) {
>> + case PCI_P2PDMA_MAP_BUS_ADDR:
>> + sg->dma_address = sg_phys(sg) + state->bus_off;
>> + sg_dma_len(sg) = sg->length;
>> + sg_mark_pci_p2pdma(sg);
>> + return 1;
>> + case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
>> + return 0;
>> + default:
>> + return -EREMOTEIO;
>> + }
>> +}
>> +
>> +/**
>> + * pci_p2pdma_map_bus_segment - map an sg segment pre determined to
>> + * be mapped with PCI_P2PDMA_MAP_BUS_ADDR
>
> Or:
>
> * pci_p2pdma_map_bus_segment - map an SG segment that is already known
> * to be mapped with PCI_P2PDMA_MAP_BUS_ADDR
>
> Also, should that prerequisite be backed up with checks in the function?

No, this function only really exists for the needs of iommu_dma_map_sg().

>> + * @pg_sg: scatterlist segment with the page to map
>> + * @dma_sg: scatterlist segment to assign a dma address to
>> + *
>> + * This is a helper for iommu dma_map_sg() implementations when the
>> + * segment for the dma address differs from the segment containing the
>> + * source page.
>> + *
>> + * pci_p2pdma_map_type() must have already been called on the pg_sg and
>> + * returned PCI_P2PDMA_MAP_BUS_ADDR.
>
> Another prerequisite, so same question: do you think that the code should
> also check that this prerequisite is met?

Again, no, simply because it's this way because of what's required by
iommu_dma.

Logan

2021-05-03 17:19:14

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 12/16] nvme-pci: Check DMA ops when indicating support for PCI P2PDMA



On 2021-05-02 7:29 p.m., John Hubbard wrote:
> On 4/8/21 10:01 AM, Logan Gunthorpe wrote:
>> Introduce a supports_pci_p2pdma() operation in nvme_ctrl_ops to
>> replace the fixed NVME_F_PCI_P2PDMA flag such that the dma_map_ops
>> flags can be checked for PCI P2PDMA support.
>>
>> Signed-off-by: Logan Gunthorpe <[email protected]>
>> ---
>> drivers/nvme/host/core.c | 3 ++-
>> drivers/nvme/host/nvme.h | 2 +-
>> drivers/nvme/host/pci.c | 11 +++++++++--
>> 3 files changed, 12 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index 0896e21642be..223419454516 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -3907,7 +3907,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid,
>> blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, ns->queue);
>>
>> blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
>> - if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
>> + if (ctrl->ops->supports_pci_p2pdma &&
>> + ctrl->ops->supports_pci_p2pdma(ctrl))
>
> This is a little excessive, as I suspected. How about providing a
> default .supports_pci_p2pdma routine that returns false, so that
> the op is always available (non-null)? By "default", maybe that
> means either requiring an init_the_ops_struct() routine to be
> used, and/or checking all the users of struct nvme_ctrl_ops.

Honestly that sounds much more messy to me than simply checking if it's
NULL before using it (which is a common, accepted pattern for ops).

> Another idea: maybe you don't really need a bool .supports_pci_p2pdma()
> routine at all, because the existing .flags really is about right.
> You just need the flags to be filled in dynamically. So, do that
> during nvme_pci setup/init time: that's when this module would call
> dma_pci_p2pdma_supported().

If the flag is filled in dynamically, then the ops struct would have to
be non-constant. Ops structs should be constant for security reasons.

Logan

2021-05-03 17:21:48

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 13/16] nvme-pci: Convert to using dma_map_sg_p2pdma for p2pdma pages



On 2021-05-02 7:34 p.m., John Hubbard wrote:
>> if (iod->npages == 0)
>> dma_pool_free(dev->prp_small_pool, nvme_pci_iod_list(req)[0],
>> iod->first_dma);
>> @@ -868,14 +857,13 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
>> if (!iod->nents)
>> goto out_free_sg;
>>
>> - if (is_pci_p2pdma_page(sg_page(iod->sg)))
>> - nr_mapped = pci_p2pdma_map_sg_attrs(dev->dev, iod->sg,
>> - iod->nents, rq_dma_dir(req), DMA_ATTR_NO_WARN);
>> - else
>> - nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents,
>> - rq_dma_dir(req), DMA_ATTR_NO_WARN);
>> - if (!nr_mapped)
>> + nr_mapped = dma_map_sg_p2pdma_attrs(dev->dev, iod->sg, iod->nents,
>> + rq_dma_dir(req), DMA_ATTR_NO_WARN);
>> + if (nr_mapped < 0) {
>> + if (nr_mapped != -ENOMEM)
>> + ret = BLK_STS_TARGET;
>> goto out_free_sg;
>> + }
>
> But now the "nr_mapped == 0" case is no longer doing an early out_free_sg.
> Is that OK?

dma_map_sg_p2pdma_attrs() never returns zero. It will return -ENOMEM in
the same situation and results in the same goto out_free_sg.

Logan

2021-05-03 18:12:00

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 01/16] PCI/P2PDMA: Pass gfp_mask flags to upstream_bridge_distance_warn()



On 2021-05-01 9:58 p.m., John Hubbard wrote:
> Another odd thing: this used to check for memory failure and just give
> up, and now it doesn't. Yes, I realize that it all still works at the
> moment, but this is quirky and we shouldn't stop here.
>
> Instead, a cleaner approach would be to push the memory allocation
> slightly higher up the call stack, out to the
> pci_p2pdma_distance_many(). So pci_p2pdma_distance_many() should make
> the kmalloc() call, and fail out if it can't get a page for the seq_buf
> buffer. Then you don't have to do all this odd stuff.

I don't really agree with this assessment. If kmalloc fails to
initialize the seq_buf() (which should be very rare), the only thing
that is lost is the one warning print that tells the user the command
line parameter needed disable the ACS. Everything else works fine,
nothing else can fail. I don't see the need to add extra complexity just
so the code errors out in no-mem instead of just skipping the one,
slightly more informative, warning line.

Also, keep in mind the result of all these functions are cached so it
only ever happens once. So for this to matter, the user would have to do
their first transaction between two devices exactly at the time memory
allocations would fail.


> Furthermore, the call sites can then decide for themselves which GFP
> flags, GFP_ATOMIC or GFP_KERNEL or whatever they want for kmalloc().
>
> A related thing: this whole exercise would go better if there were a
> preparatory patch or two that changed the return codes in this file to
> something less crazy. There are too many functions that can fail, but
> are treated as if they sort-of-mostly-would-never-fail, in the hopes of
> using the return value directly for counting and such. This is badly
> mistaken, and it leads developers to try to avoid returning -ENOMEM
> (which is what we need here).

Hmm? Which functions can fail? and how?

Logan

2021-05-03 18:12:43

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 07/16] PCI/P2PDMA: Make pci_p2pdma_map_type() non-static



On 2021-05-02 4:44 p.m., John Hubbard wrote:
> On 4/8/21 10:01 AM, Logan Gunthorpe wrote:
>> pci_p2pdma_map_type() will be needed by the dma-iommu map_sg
>> implementation because it will need to determine the mapping type
>> ahead of actually doing the mapping to create the actual iommu mapping.
>>
>> Signed-off-by: Logan Gunthorpe <[email protected]>
>> ---
>> drivers/pci/p2pdma.c | 34 +++++++++++++++++++++++-----------
>> include/linux/pci-p2pdma.h | 15 +++++++++++++++
>> 2 files changed, 38 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
>> index bcb1a6d6119d..38c93f57a941 100644
>> --- a/drivers/pci/p2pdma.c
>> +++ b/drivers/pci/p2pdma.c
>> @@ -20,13 +20,6 @@
>> #include <linux/seq_buf.h>
>> #include <linux/xarray.h>
>>
>> -enum pci_p2pdma_map_type {
>> - PCI_P2PDMA_MAP_UNKNOWN = 0,
>> - PCI_P2PDMA_MAP_NOT_SUPPORTED,
>> - PCI_P2PDMA_MAP_BUS_ADDR,
>> - PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
>> -};
>> -
>> struct pci_p2pdma {
>> struct gen_pool *pool;
>> bool p2pmem_published;
>> @@ -822,13 +815,30 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
>> }
>> EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
>>
>> -static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
>> - struct device *dev)
>> +/**
>> + * pci_p2pdma_map_type - return the type of mapping that should be used for
>> + * a given device and pgmap
>> + * @pgmap: the pagemap of a page to determine the mapping type for
>> + * @dev: device that is mapping the page
>> + * @dma_attrs: the attributes passed to the dma_map operation --
>> + * this is so they can be checked to ensure P2PDMA pages were not
>> + * introduced into an incorrect interface (like dma_map_sg). *
>> + *
>> + * Returns one of:
>> + * PCI_P2PDMA_MAP_NOT_SUPPORTED - The mapping should not be done
>> + * PCI_P2PDMA_MAP_BUS_ADDR - The mapping should use the PCI bus address
>> + * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE - The mapping should be done directly
>> + */
>> +enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
>> + struct device *dev, unsigned long dma_attrs)
>> {
>> struct pci_dev *provider = to_p2p_pgmap(pgmap)->provider;
>> enum pci_p2pdma_map_type ret;
>> struct pci_dev *client;
>>
>> + WARN_ONCE(!(dma_attrs & __DMA_ATTR_PCI_P2PDMA),
>> + "PCI P2PDMA pages were mapped with dma_map_sg!");
>
> This really ought to also return -EINVAL, assuming that my review suggestions
> about return types, in earlier patches, are acceptable.

That can't happen because, by convention, dma_map_sg() cannot return
-EINVAL. I think the best we can do is proceed normally and just warn
loudly.

Logan

2021-05-03 18:19:35

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 05/16] dma-mapping: Introduce dma_map_sg_p2pdma()



On 2021-05-02 3:23 p.m., John Hubbard wrote:
> On 4/8/21 10:01 AM, Logan Gunthorpe wrote:
>> dma_map_sg() either returns a positive number indicating the number
>> of entries mapped or zero indicating that resources were not available
>> to create the mapping. When zero is returned, it is always safe to retry
>> the mapping later once resources have been freed.
>>
>> Once P2PDMA pages are mixed into the SGL there may be pages that may
>> never be successfully mapped with a given device because that device may
>> not actually be able to access those pages. Thus, multiple error
>> conditions will need to be distinguished to determine weather a retry
>> is safe.
>>
>> Introduce dma_map_sg_p2pdma[_attrs]() with a different calling
>> convention from dma_map_sg(). The function will return a positive
>> integer on success or a negative errno on failure.
>>
>> ENOMEM will be used to indicate a resource failure and EREMOTEIO to
>> indicate that a P2PDMA page is not mappable.
>>
>> The __DMA_ATTR_PCI_P2PDMA attribute is introduced to inform the lower
>> level implementations that P2PDMA pages are allowed and to warn if a
>> caller introduces them into the regular dma_map_sg() interface.
>>
>> Signed-off-by: Logan Gunthorpe <[email protected]>
>> ---
>> include/linux/dma-mapping.h | 15 +++++++++++
>> kernel/dma/mapping.c | 52 ++++++++++++++++++++++++++++++++-----
>> 2 files changed, 61 insertions(+), 6 deletions(-)
>>
>> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
>> index 2a984cb4d1e0..50b8f586cf59 100644
>> --- a/include/linux/dma-mapping.h
>> +++ b/include/linux/dma-mapping.h
>> @@ -60,6 +60,12 @@
>> * at least read-only at lesser-privileged levels).
>> */
>> #define DMA_ATTR_PRIVILEGED (1UL << 9)
>> +/*
>> + * __DMA_ATTR_PCI_P2PDMA: This should not be used directly, use
>> + * dma_map_sg_p2pdma() instead. Used internally to indicate that the
>> + * caller is using the dma_map_sg_p2pdma() interface.
>> + */
>> +#define __DMA_ATTR_PCI_P2PDMA (1UL << 10)
>>
>
> As mentioned near the top of this file,
> Documentation/core-api/dma-attributes.rst also needs to be updated, for
> this new item.

As this attribute is not meant to be used by anyone outside the dma
functions, I don't think it should be documented here. (That's why it
has a double underscource prefix).

>> /*
>> * A dma_addr_t can hold any valid DMA or bus address for the platform. It can
>> @@ -107,6 +113,8 @@ void dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, size_t size,
>> enum dma_data_direction dir, unsigned long attrs);
>> int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents,
>> enum dma_data_direction dir, unsigned long attrs);
>> +int dma_map_sg_p2pdma_attrs(struct device *dev, struct scatterlist *sg,
>> + int nents, enum dma_data_direction dir, unsigned long attrs);
>> void dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
>> int nents, enum dma_data_direction dir,
>> unsigned long attrs);
>> @@ -160,6 +168,12 @@ static inline int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
>> {
>> return 0;
>> }
>> +static inline int dma_map_sg_p2pdma_attrs(struct device *dev,
>> + struct scatterlist *sg, int nents, enum dma_data_direction dir,
>> + unsigned long attrs)
>> +{
>> + return 0;
>> +}
>> static inline void dma_unmap_sg_attrs(struct device *dev,
>> struct scatterlist *sg, int nents, enum dma_data_direction dir,
>> unsigned long attrs)
>> @@ -392,6 +406,7 @@ static inline void dma_sync_sgtable_for_device(struct device *dev,
>> #define dma_map_single(d, a, s, r) dma_map_single_attrs(d, a, s, r, 0)
>> #define dma_unmap_single(d, a, s, r) dma_unmap_single_attrs(d, a, s, r, 0)
>> #define dma_map_sg(d, s, n, r) dma_map_sg_attrs(d, s, n, r, 0)
>> +#define dma_map_sg_p2pdma(d, s, n, r) dma_map_sg_p2pdma_attrs(d, s, n, r, 0)
>
> This hunk is fine, of course.
>
> But, about pre-existing issues: note to self, or to anyone: send a patch to turn
> these into inline functions. The macro redirection here is not adding value, but
> it does make things just a little bit worse.
>
>
>> #define dma_unmap_sg(d, s, n, r) dma_unmap_sg_attrs(d, s, n, r, 0)
>> #define dma_map_page(d, p, o, s, r) dma_map_page_attrs(d, p, o, s, r, 0)
>> #define dma_unmap_page(d, a, s, r) dma_unmap_page_attrs(d, a, s, r, 0)
>> diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
>> index b6a633679933..923089c4267b 100644
>> --- a/kernel/dma/mapping.c
>> +++ b/kernel/dma/mapping.c
>> @@ -177,12 +177,8 @@ void dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, size_t size,
>> }
>> EXPORT_SYMBOL(dma_unmap_page_attrs);
>>
>> -/*
>> - * dma_maps_sg_attrs returns 0 on error and > 0 on success.
>> - * It should never return a value < 0.
>> - */
>
> It would be better to leave the comment in place, given the non-standard
> return values. However, looking around here, it would be better if we go
> with the standard -ERRNO for error, and >0 for sucess.

The comment is actually left in place. The diff just makes it look like
it was removed. It is added back lower down in the diff.

> There are pre-existing BUG_ON() and WARN_ON_ONCE() items that are partly
> an attempt to compensate for not being able to return proper -ERRNO
> codes. For example, this:
>
> BUG_ON(!valid_dma_direction(dir));
>
> ...arguably should be more like this:
>
> if(WARN_ON_ONCE(!valid_dma_direction(dir)))
> return -EINVAL;

Yes, but you'll have to see the discussion in the RFC. The complaint was
that the calling convention for dma_map_sg() is not expected to return
anything other than 0 or the number of entries mapped. It can't return a
negative error code. That's why BUG_ON(ents < 0) is in the existing
code. That's also why this series introduces the new dma_map_sg_p2pdma()
function. (Though, Jason has made some suggestions to further change this).

>
>> -int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents,
>> - enum dma_data_direction dir, unsigned long attrs)
>> +static int __dma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
>> + int nents, enum dma_data_direction dir, unsigned long attrs)
>> {
>> const struct dma_map_ops *ops = get_dma_ops(dev);
>> int ents;
>> @@ -197,6 +193,20 @@ int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents,
>> ents = dma_direct_map_sg(dev, sg, nents, dir, attrs);
>> else
>> ents = ops->map_sg(dev, sg, nents, dir, attrs);
>> +
>> + return ents;
>> +}
>> +
>> +/*
>> + * dma_maps_sg_attrs returns 0 on error and > 0 on success.
>> + * It should never return a value < 0.
>> + */
>> +int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents,
>> + enum dma_data_direction dir, unsigned long attrs)
>> +{
>> + int ents;
>
> Pre-existing note, feel free to ignore: the ents and nents in the same
> routines together, are way too close to the each other in naming. Maybe
> using "requested_nents", or "nents_arg", for the incoming value, would
> help.

Ok, will change.

>> +
>> + ents = __dma_map_sg_attrs(dev, sg, nents, dir, attrs);
>> BUG_ON(ents < 0);
>> debug_dma_map_sg(dev, sg, nents, ents, dir);
>>
>> @@ -204,6 +214,36 @@ int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents,
>> }
>> EXPORT_SYMBOL(dma_map_sg_attrs);
>>
>> +/*
>> + * like dma_map_sg_attrs, but returns a negative errno on error (and > 0
>> + * on success). This function must be used if PCI P2PDMA pages might
>> + * be in the scatterlist.
>
> Let's turn this into a kernel doc comment block, seeing as how it clearly
> wants to be--you're almost there already. You've even reinvented @Return,
> below. :)

Just trying to follow the convention in this file. But I can make it a
kernel doc.

>> + *
>> + * On error this function may return:
>> + * -ENOMEM indicating that there was not enough resources available and
>> + * the transfer may be retried later
>> + * -EREMOTEIO indicating that P2PDMA pages were included but cannot
>> + * be mapped by the specified device, retries will always fail
>> + *
>> + * The scatterlist should be unmapped with the regular dma_unmap_sg[_attrs]().
>
> How about:
>
> "The scatterlist should be unmapped via dma_unmap_sg[_attrs]()."

Ok

Logan

2021-05-03 18:22:10

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 01/16] PCI/P2PDMA: Pass gfp_mask flags to upstream_bridge_distance_warn()



On 2021-05-03 12:17 p.m., John Hubbard wrote:
> On 5/3/21 8:57 AM, Logan Gunthorpe wrote:
>>
>>
>> On 2021-05-01 9:58 p.m., John Hubbard wrote:
>>> Another odd thing: this used to check for memory failure and just give
>>> up, and now it doesn't. Yes, I realize that it all still works at the
>>> moment, but this is quirky and we shouldn't stop here.
>>>
>>> Instead, a cleaner approach would be to push the memory allocation
>>> slightly higher up the call stack, out to the
>>> pci_p2pdma_distance_many(). So pci_p2pdma_distance_many() should make
>>> the kmalloc() call, and fail out if it can't get a page for the seq_buf
>>> buffer. Then you don't have to do all this odd stuff.
>>
>> I don't really agree with this assessment. If kmalloc fails to
>> initialize the seq_buf() (which should be very rare), the only thing
>> that is lost is the one warning print that tells the user the command
>> line parameter needed disable the ACS. Everything else works fine,
>> nothing else can fail. I don't see the need to add extra complexity just
>> so the code errors out in no-mem instead of just skipping the one,
>> slightly more informative, warning line.
>
> That's the thing: memory failure should be exceedingly rare for this.
> Therefore, just fail out entirely (which I don't expect we'll likely
> ever see), instead of doing all this weird stuff to try to continue
> on if you cannot allocate a single page. If you are in that case, the
> system is not in a state that is going to run your dma p2p setup well
> anyway.
>
> I think it's *less* complexity to allocate up front, fail early if
> allocation fails, and then not have to deal with these really odd
> quirks at the lower levels.
>

I don't see how it's all that weird. We're skipping a warning if we
can't allocate memory to calculate part of the message. It's really not
necessary. If the memory really can't be allocated then something else
will fail, but we really don't need to fail here because we couldn't
print a verbose warning message.

Logan

2021-05-03 18:22:12

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 01/16] PCI/P2PDMA: Pass gfp_mask flags to upstream_bridge_distance_warn()

On 5/3/21 8:57 AM, Logan Gunthorpe wrote:
>
>
> On 2021-05-01 9:58 p.m., John Hubbard wrote:
>> Another odd thing: this used to check for memory failure and just give
>> up, and now it doesn't. Yes, I realize that it all still works at the
>> moment, but this is quirky and we shouldn't stop here.
>>
>> Instead, a cleaner approach would be to push the memory allocation
>> slightly higher up the call stack, out to the
>> pci_p2pdma_distance_many(). So pci_p2pdma_distance_many() should make
>> the kmalloc() call, and fail out if it can't get a page for the seq_buf
>> buffer. Then you don't have to do all this odd stuff.
>
> I don't really agree with this assessment. If kmalloc fails to
> initialize the seq_buf() (which should be very rare), the only thing
> that is lost is the one warning print that tells the user the command
> line parameter needed disable the ACS. Everything else works fine,
> nothing else can fail. I don't see the need to add extra complexity just
> so the code errors out in no-mem instead of just skipping the one,
> slightly more informative, warning line.

That's the thing: memory failure should be exceedingly rare for this.
Therefore, just fail out entirely (which I don't expect we'll likely
ever see), instead of doing all this weird stuff to try to continue
on if you cannot allocate a single page. If you are in that case, the
system is not in a state that is going to run your dma p2p setup well
anyway.

I think it's *less* complexity to allocate up front, fail early if
allocation fails, and then not have to deal with these really odd
quirks at the lower levels.

>
> Also, keep in mind the result of all these functions are cached so it
> only ever happens once. So for this to matter, the user would have to do
> their first transaction between two devices exactly at the time memory
> allocations would fail.
>
>
>> Furthermore, the call sites can then decide for themselves which GFP
>> flags, GFP_ATOMIC or GFP_KERNEL or whatever they want for kmalloc().
>>
>> A related thing: this whole exercise would go better if there were a
>> preparatory patch or two that changed the return codes in this file to
>> something less crazy. There are too many functions that can fail, but
>> are treated as if they sort-of-mostly-would-never-fail, in the hopes of
>> using the return value directly for counting and such. This is badly
>> mistaken, and it leads developers to try to avoid returning -ENOMEM
>> (which is what we need here).
>
> Hmm? Which functions can fail? and how?
>

Let's defer that to the other patches, I was sort of looking ahead to
those, sorry.

thanks,
--
John Hubbard
NVIDIA

2021-05-03 18:25:21

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 01/16] PCI/P2PDMA: Pass gfp_mask flags to upstream_bridge_distance_warn()

On 5/3/21 11:20 AM, Logan Gunthorpe wrote:
...
>> That's the thing: memory failure should be exceedingly rare for this.
>> Therefore, just fail out entirely (which I don't expect we'll likely
>> ever see), instead of doing all this weird stuff to try to continue
>> on if you cannot allocate a single page. If you are in that case, the
>> system is not in a state that is going to run your dma p2p setup well
>> anyway.
>>
>> I think it's *less* complexity to allocate up front, fail early if
>> allocation fails, and then not have to deal with these really odd
>> quirks at the lower levels.
>>
>
> I don't see how it's all that weird. We're skipping a warning if we
> can't allocate memory to calculate part of the message. It's really not
> necessary. If the memory really can't be allocated then something else
> will fail, but we really don't need to fail here because we couldn't
> print a verbose warning message.
>

Well, I really dislike the result we have in this particular patch, but
I won't stand in the way of progress if that's how you really are going
to do it.

thanks,
--
John Hubbard
NVIDIA

2021-05-03 18:26:57

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 01/16] PCI/P2PDMA: Pass gfp_mask flags to upstream_bridge_distance_warn()

On Mon, May 03, 2021 at 11:17:31AM -0700, John Hubbard wrote:
> That's the thing: memory failure should be exceedingly rare for this.
> Therefore, just fail out entirely (which I don't expect we'll likely
> ever see), instead of doing all this weird stuff to try to continue
> on if you cannot allocate a single page. If you are in that case, the
> system is not in a state that is going to run your dma p2p setup well
> anyway.
>
> I think it's *less* complexity to allocate up front, fail early if
> allocation fails, and then not have to deal with these really odd
> quirks at the lower levels.

Agreed.

2021-05-03 18:29:46

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 05/16] dma-mapping: Introduce dma_map_sg_p2pdma()

On Tue, Apr 27, 2021 at 08:01:13PM -0300, Jason Gunthorpe wrote:
> At a high level I'm OK with it. dma_map_sg_attrs() is the extra
> extended version of dma_map_sg(), it already has a different
> signature, a different return code is not out of the question.
>
> dma_map_sg() is just the simple easy to use interface that can't do
> advanced stuff.
>
> > I'm not that opposed to this. But it will make this series a fair bit
> > longer to change the 8 map_sg_attrs() usages.
>
> Yes, but the result seems much nicer to not grow the DMA API further.

We already have a mapping function that can return errors:
dma_map_sgtable.

I think it might make more sense to piggy back on that, as the sg_table
abstraction is pretty useful basically everywhere that we deal with
scatterlists anyway.

In the hopefully no too long run I plan to get rid of scatterlists in
at least NVMe and other high performance devices anyway.

2021-05-03 18:32:46

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 05/16] dma-mapping: Introduce dma_map_sg_p2pdma()



On 2021-05-03 12:28 p.m., Christoph Hellwig wrote:
> On Tue, Apr 27, 2021 at 08:01:13PM -0300, Jason Gunthorpe wrote:
>> At a high level I'm OK with it. dma_map_sg_attrs() is the extra
>> extended version of dma_map_sg(), it already has a different
>> signature, a different return code is not out of the question.
>>
>> dma_map_sg() is just the simple easy to use interface that can't do
>> advanced stuff.
>>
>>> I'm not that opposed to this. But it will make this series a fair bit
>>> longer to change the 8 map_sg_attrs() usages.
>>
>> Yes, but the result seems much nicer to not grow the DMA API further.
>
> We already have a mapping function that can return errors:
> dma_map_sgtable.
>
> I think it might make more sense to piggy back on that, as the sg_table
> abstraction is pretty useful basically everywhere that we deal with
> scatterlists anyway.

Oh, I didn't even realize that existed. I'll use dma_map_sgtable() for v2.

Thanks,

Logan

2021-05-04 00:05:50

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 09/16] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg

On 5/3/21 10:04 AM, Logan Gunthorpe wrote:
> Oops missed a comment:
>
> On 2021-05-02 5:28 p.m., John Hubbard wrote:
>>> int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
>>> enum dma_data_direction dir, unsigned long attrs)
>>> {
>>> - int i;
>>> + struct pci_p2pdma_map_state p2pdma_state = {};
>>
>> Is it worth putting this stuff on the stack--is there a noticeable
>> performance improvement from caching the state? Because if it's
>> invisible, then simplicity is better. I suspect you're right, and that
>> it *is* worth it, but it's good to know for real.
>
> I haven't measured it (it would be hard to measure), but I think it's
> fairly clear here. Without the state, xa_load() would need to be called
> on *every* page in an SGL that maps only P2PDMA memory from one device.
> With the state, it only needs to be called once. xa_load() is cheap, but
> it is not that cheap.

OK, thanks for spelling it out for me. :)

>
> There's essentially the same optimization in get_user_pages for
> ZONE_DEVICE pages. So, if it is necessary there, it should be necessary
> here.
>

Right, that's a pretty solid example.

thanks,
--
John Hubbard
NVIDIA

2021-05-04 00:16:11

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 09/16] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg

On 5/3/21 9:55 AM, Logan Gunthorpe wrote:
...
>> The same thing can be achieved with fewer lines and a bit more clarity.
>> Can we please do it like this instead:
>>
>> for_each_sg(sgl, sg, nents, i) {
>> if (sg_is_pci_p2pdma(sg))
>> sg_unmark_pci_p2pdma(sg);
>> else
>> dma_direct_unmap_page(dev, sg->dma_address,
>> sg_dma_len(sg), dir, attrs);
>> }
>>
>>
>
> That's debatable (the way I did it emphasizes the common case). But I'll
> consider changing it.
>

Thanks for considering it.

>>
>> Also here, a block comment for the function would be nice. How about
>> approximately this:
>>
>> /*
>> * Maps each SG segment. Returns the number of entries mapped, or 0 upon
>> * failure. If any entry could not be mapped, then no entries are mapped.
>> */
>>
>> I'll stop complaining about the pre-existing return code conventions,
>> since by now you know what I was thinking of saying. :)
>
> Not really part of this patchset... Seems like if you think there should
> be a comment like that here, you should send a patch. But this patch
> starts returning a negative value here.

OK, that's fine. Like you say, that comment is rather beyond this patchset.

>>> for_each_sg(sgl, sg, nents, i) {
>>> + if (is_pci_p2pdma_page(sg_page(sg))) {
>>> + ret = pci_p2pdma_map_segment(&p2pdma_state, dev, sg,
>>> + attrs);
>>> + if (ret < 0) {
>>> + goto out_unmap;
>>> + } else if (ret) {
>>> + ret = 0;
>>> + continue;
>>
>> Is this a bug? If neither of those "if" branches fires (ret == 0), then
>> the code (probably unintentionally) falls through and continues on to
>> attempt to call dma_direct_map_page()--despite being a PCI_P2PDMA page!
>
> No, it's not a bug. Per the documentation of pci_p2pdma_map_segment(),
> if it returns zero the segment should be mapped normally. P2PDMA pages
> must be mapped with physical addresses (or IOVA addresses) if the TLPS
> for the transaction will go through the host bridge.

Could we maybe put a little comment there, to that effect? It would be
nice to call out that point, especially since it is common to miss one
case (negative, 0, positive) when handling return values. Sort of like
we used to put "// fallthrough" in the case statements. Not a big deal
of course.

>
>> See below for suggestions:
>>
>>> + }
>>> + }
>>> +
>>> sg->dma_address = dma_direct_map_page(dev, sg_page(sg),
>>> sg->offset, sg->length, dir, attrs);
>>> if (sg->dma_address == DMA_MAPPING_ERROR)
>>
>> This is another case in which "continue" is misleading and not as good
>> as "else". Because unless I'm wrong above, you really only want to take
>> one path *or* the other.
>
> No, per above, it's not one path or the other. If it's a P2PDMA page it
> may still need to be mapped normally.
>

Right. That follows.

thanks,
--
John Hubbard
NVIDIA

2021-05-04 00:19:19

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 12/16] nvme-pci: Check DMA ops when indicating support for PCI P2PDMA

On 5/3/21 10:17 AM, Logan Gunthorpe wrote:
...
>>> blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
>>> - if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
>>> + if (ctrl->ops->supports_pci_p2pdma &&
>>> + ctrl->ops->supports_pci_p2pdma(ctrl))
>>
>> This is a little excessive, as I suspected. How about providing a
>> default .supports_pci_p2pdma routine that returns false, so that
>> the op is always available (non-null)? By "default", maybe that
>> means either requiring an init_the_ops_struct() routine to be
>> used, and/or checking all the users of struct nvme_ctrl_ops.
>
> Honestly that sounds much more messy to me than simply checking if it's
> NULL before using it (which is a common, accepted pattern for ops).

OK, it's a minor suggestion, so feel free to ignore if you prefer it
the other way, sure.

>
>> Another idea: maybe you don't really need a bool .supports_pci_p2pdma()
>> routine at all, because the existing .flags really is about right.
>> You just need the flags to be filled in dynamically. So, do that
>> during nvme_pci setup/init time: that's when this module would call
>> dma_pci_p2pdma_supported().
>
> If the flag is filled in dynamically, then the ops struct would have to
> be non-constant. Ops structs should be constant for security reasons.
>

Hadn't thought about keeping ops structs constant. OK.

thanks,
--
John Hubbard
NVIDIA

2021-05-04 00:27:45

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 13/16] nvme-pci: Convert to using dma_map_sg_p2pdma for p2pdma pages

On 5/3/21 10:19 AM, Logan Gunthorpe wrote:
...
>>> + nr_mapped = dma_map_sg_p2pdma_attrs(dev->dev, iod->sg, iod->nents,
>>> + rq_dma_dir(req), DMA_ATTR_NO_WARN);
>>> + if (nr_mapped < 0) {
>>> + if (nr_mapped != -ENOMEM)
>>> + ret = BLK_STS_TARGET;
>>> goto out_free_sg;
>>> + }
>>
>> But now the "nr_mapped == 0" case is no longer doing an early out_free_sg.
>> Is that OK?
>
> dma_map_sg_p2pdma_attrs() never returns zero. It will return -ENOMEM in
> the same situation and results in the same goto out_free_sg.
>

OK...that's true, it doesn't return zero. A comment or WARN or something
might be nice, to show that the zero case hasn't been overlooked. It's
true that the dma_map_sg_p2pdma_attrs() documentation sort of says
that (although not quite out loud).

thanks,
--
John Hubbard
NVIDIA

2021-05-07 05:03:42

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 11/16] iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg

Sorry, I think I missed responding to this one so here are the answers:

On 2021-05-02 7:14 p.m., John Hubbard wrote:
> On 4/8/21 10:01 AM, Logan Gunthorpe wrote:
>> When a PCI P2PDMA page is seen, set the IOVA length of the segment
>> to zero so that it is not mapped into the IOVA. Then, in finalise_sg(),
>> apply the appropriate bus address to the segment. The IOVA is not
>> created if the scatterlist only consists of P2PDMA pages.
>>
>> Similar to dma-direct, the sg_mark_pci_p2pdma() flag is used to
>> indicate bus address segments. On unmap, P2PDMA segments are skipped
>> over when determining the start and end IOVA addresses.
>>
>> With this change, the flags variable in the dma_map_ops is
>> set to DMA_F_PCI_P2PDMA_SUPPORTED to indicate support for
>> P2PDMA pages.
>>
>> Signed-off-by: Logan Gunthorpe <[email protected]>
>> ---
>>   drivers/iommu/dma-iommu.c | 66 ++++++++++++++++++++++++++++++++++-----
>>   1 file changed, 58 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
>> index af765c813cc8..ef49635f9819 100644
>> --- a/drivers/iommu/dma-iommu.c
>> +++ b/drivers/iommu/dma-iommu.c
>> @@ -20,6 +20,7 @@
>>   #include <linux/mm.h>
>>   #include <linux/mutex.h>
>>   #include <linux/pci.h>
>> +#include <linux/pci-p2pdma.h>
>>   #include <linux/swiotlb.h>
>>   #include <linux/scatterlist.h>
>>   #include <linux/vmalloc.h>
>> @@ -864,6 +865,16 @@ static int __finalise_sg(struct device *dev,
>> struct scatterlist *sg, int nents,
>>           sg_dma_address(s) = DMA_MAPPING_ERROR;
>>           sg_dma_len(s) = 0;
>>   +        if (is_pci_p2pdma_page(sg_page(s)) && !s_iova_len) {
>
> Newbie question: I'm in the dark as to why the !s_iova_len check is there,
> can you please enlighten me?

The loop in iommu_dma_map_sg() will decide what to do with P2PDMA pages.
If it is to map it with the bus address it will set s_iova_len to zero
so that no space is allocated in the IOVA. If it is to map it through
the host bridge, then it it will leave s_iova_len alone and create the
appropriate mapping with the CPU physical address.

This condition notices that s_iova_len was set to zero and fills in a SG
segment with the PCI bus address for that region.


>
>> +            if (i > 0)
>> +                cur = sg_next(cur);
>> +
>> +            pci_p2pdma_map_bus_segment(s, cur);
>> +            count++;
>> +            cur_len = 0;
>> +            continue;
>> +        }
>> +
>
> This is really an if/else condition. And arguably, it would be better
> to split out two subroutines, and call one or the other depending on
> the result of if is_pci_p2pdma_page(), instead of this "continue" approach.

I really disagree here. Putting the exceptional condition in it's own if
statement and leaving the normal case un-indented is easier to read and
understand. It also saves an extra level of indentation in code that is
already starting to look a little squished.


>>           /*
>>            * Now fill in the real DMA data. If...
>>            * - there is a valid output segment to append to
>> @@ -961,10 +972,12 @@ static int iommu_dma_map_sg(struct device *dev,
>> struct scatterlist *sg,
>>       struct iova_domain *iovad = &cookie->iovad;
>>       struct scatterlist *s, *prev = NULL;
>>       int prot = dma_info_to_prot(dir, dev_is_dma_coherent(dev), attrs);
>> +    struct dev_pagemap *pgmap = NULL;
>> +    enum pci_p2pdma_map_type map_type;
>>       dma_addr_t iova;
>>       size_t iova_len = 0;
>>       unsigned long mask = dma_get_seg_boundary(dev);
>> -    int i;
>> +    int i, ret = 0;
>>         if (static_branch_unlikely(&iommu_deferred_attach_enabled) &&
>>           iommu_deferred_attach(dev, domain))
>> @@ -993,6 +1006,31 @@ static int iommu_dma_map_sg(struct device *dev,
>> struct scatterlist *sg,
>>           s_length = iova_align(iovad, s_length + s_iova_off);
>>           s->length = s_length;
>>   +        if (is_pci_p2pdma_page(sg_page(s))) {
>> +            if (sg_page(s)->pgmap != pgmap) {
>> +                pgmap = sg_page(s)->pgmap;
>> +                map_type = pci_p2pdma_map_type(pgmap, dev,
>> +                                   attrs);
>> +            }
>> +
>> +            switch (map_type) {
>> +            case PCI_P2PDMA_MAP_BUS_ADDR:
>> +                /*
>> +                 * A zero length will be ignored by
>> +                 * iommu_map_sg() and then can be detected
>> +                 * in __finalise_sg() to actually map the
>> +                 * bus address.
>> +                 */
>> +                s->length = 0;
>> +                continue;
>> +            case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
>> +                break;
>> +            default:
>> +                ret = -EREMOTEIO;
>> +                goto out_restore_sg;
>> +            }
>> +        }
>> +
>>           /*
>>            * Due to the alignment of our single IOVA allocation, we can
>>            * depend on these assumptions about the segment boundary mask:
>> @@ -1015,6 +1053,9 @@ static int iommu_dma_map_sg(struct device *dev,
>> struct scatterlist *sg,
>>           prev = s;
>>       }
>>   +    if (!iova_len)
>> +        return __finalise_sg(dev, sg, nents, 0);
>> +
>
> ohhh, we're really slicing up this function pretty severely, what with the
> continue and the early out and several other control flow changes. I think
> it would be better to spend some time factoring this function into two
> cases, now that you're adding a second case for PCI P2PDMA. Roughly,
> two subroutines would do it.

I don't see how we can factor this into two cases. The SGL may contain
normal pages or P2PDMA pages or a mix of both and we have to create an
IOVA area for all the regions that map the CPU physical address (ie
normal pages and some P2PDMA pages) then also insert segments for any
PCI bus address.

> As it is, this leaves behind a routine that is extremely hard to mentally
> verify as correct.

Yes, this is tricky code, but not that incomprehensible. Most of the
difficulty is in understanding how it works before adding the P2PDMA bits.

There are two loops: one to prepare the IOVA region and another to fill
in the SGL. We have to add cases in both loops to skip the segments that
need to be mapped with the bus address in the first loop, and insert the
dma SGL segments in the second loop.

Logan