2022-06-15 16:33:12

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v7 00/21] Userspace P2PDMA with O_DIRECT NVMe devices

Hi,

This patchset continues my work to add userspace P2PDMA access using
O_DIRECT NVMe devices. This posting cleans up the way the pages are
stored in the VMA and relies on proper reference counting that was
fixed up recently in the kernel. The new method uses vm_insert_page()
in pci_mmap_p2pmem() so there are no longer an faults or other ops and
the pages are just freed sensibly when the VMA is removed. This simplifies
the VMA code significantly.

The previous posting was here[1].

This patch set enables userspace P2PDMA by allowing userspace to mmap()
allocated chunks of the CMB. The resulting VMA can be passed only
to O_DIRECT IO on NVMe backed files or block devices. A flag is added
to GUP() in Patch 14, then Patches 15 through 19 wire this flag up based
on whether the block queue indicates P2PDMA support. Patches 20
through 21 enable the CMB to be mapped into userspace by mmaping
the nvme char device.

This is relatively straightforward, however the one significant
problem is that, presently, pci_p2pdma_map_sg() requires a homogeneous
SGL with all P2PDMA pages or all regular pages. Enhancing GUP to
support enforcing this rule would require a huge hack that I don't
expect would be all that pallatable. So the first 13 patches add
support for P2PDMA pages to dma_map_sg[table]() to the dma-direct
and dma-iommu implementations. Thus, systems without an IOMMU plus
Intel and AMD IOMMUs are supported. (Other IOMMU implementations would
then be unsupported, notably ARM and PowerPC but support would be added
when they convert to dma-iommu).

dma_map_sgtable() is preferred when dealing with P2PDMA memory as it
will return -EREMOTEIO when the DMA device cannot map specific P2PDMA
pages based on the existing rules in calc_map_type_and_dist().

The other issue is dma_unmap_sg() needs a flag to determine whether a
given dma_addr_t was mapped regularly or as a PCI bus address. To allow
this, a third flag is added to the page_link field in struct
scatterlist. This effectively means support for P2PDMA will now depend
on CONFIG_64BIT.

Feedback welcome.

This series is based on v5.19-rc1. A git branch is available here:

https://github.com/sbates130272/linux-p2pmem/ p2pdma_user_cmb_v7

Thanks,

Logan

[1] https://lkml.kernel.org/r/[email protected]

--

Changes since v6:
- Rebase onto v5.19-rc1
- Rework how the pages are stored in the VMA per Jason's suggestion

Changes since v5:
- Rebased onto v5.18-rc1 which includes Christophs cleanup to
free_zone_device_page() (similar to Ralph's patch).
- Fix bug with concurrent first calls to pci_p2pdma_vma_fault()
that caused a double allocation and lost p2p memory. Noticed
by Andrew Maier.
- Collected a Reviewed-by tag from Chaitanya.
- Numerous minor fixes to commit messages

Changes since v4:
- Rebase onto v5.17-rc1.
- Included Ralph Cambell's patches which removes the ZONE_DEVICE page
reference count offset. This is just to demonstrate that this
series is compatible with that direction.
- Added a comment in pci_p2pdma_map_sg_attrs(), per Chaitanya and
included his Reviewed-by tags.
- Patch 1 in the last series which cleaned up scatterlist.h
has been upstreamed.
- Dropped NEED_SG_DMA_BUS_ADDR_FLAG seeing depends on doesn't
work with selected symbols, per Christoph.
- Switched iov_iter_get_pages_[alloc_]flags to be exported with
EXPORT_SYMBOL_GPL, per Christoph.
- Renamed zone_device_pages_are_mergeable() to
zone_device_pages_have_same_pgmap(), per Christoph.
- Renamed .mmap_file_open operation in nvme_ctrl_ops to
cdev_file_open(), per Christoph.

Changes since v3:
- Add some comment and commit message cleanups I had missed for v3,
also moved the prototypes for some of the p2pdma helpers to
dma-map-ops.h (which I missed in v3 and was suggested in v2).
- Add separate cleanup patch for scatterlist.h and change the macros
to functions. (Suggested by Chaitanya and Jason, respectively)
- Rename sg_dma_mark_pci_p2pdma() and sg_is_dma_pci_p2pdma() to
sg_dma_mark_bus_address() and sg_is_dma_bus_address() which
is a more generic name (As requested by Jason)
- Fixes to some comments and commit messages as suggested by Bjorn
and Jason.
- Ensure swiotlb is not used with P2PDMA pages. (Per Jason)
- The sgtable coversion in RDMA was split out and sent upstream
separately, the new patch is only the removal. (Per Jason)
- Moved the FOLL_PCI_P2PDMA check outside of get_dev_pagemap() as
Jason suggested this will be removed in the near term.
- Add two patches to ensure that zone device pages with different
pgmaps are never merged in the block layer or
sg_alloc_append_table_from_pages() (Per Jason)
- Ensure synchronize_rcu() or call_rcu() is used before returning
pages to the genalloc. (Jason pointed out that pages are not
gauranteed to be unused in all architectures until at least
after an RCU grace period, and that synchronize_rcu() was likely
too slow to use in the vma close operation.
- Collected Acks and Reviews by Bjorn, Jason and Max.

--

Logan Gunthorpe (21):
lib/scatterlist: add flag for indicating P2PDMA segments in an SGL
PCI/P2PDMA: Attempt to set map_type if it has not been set
PCI/P2PDMA: Expose pci_p2pdma_map_type()
PCI/P2PDMA: Introduce helpers for dma_map_sg implementations
dma-mapping: allow EREMOTEIO return code for P2PDMA transfers
dma-direct: support PCI P2PDMA pages in dma-direct map_sg
dma-mapping: add flags to dma_map_ops to indicate PCI P2PDMA support
iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg
nvme-pci: check DMA ops when indicating support for PCI P2PDMA
nvme-pci: convert to using dma_map_sgtable()
RDMA/core: introduce ib_dma_pci_p2p_dma_supported()
RDMA/rw: drop pci_p2pdma_[un]map_sg()
PCI/P2PDMA: Remove pci_p2pdma_[un]map_sg()
mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
iov_iter: introduce iov_iter_get_pages_[alloc_]flags()
block: add check when merging zone device pages
lib/scatterlist: add check when merging zone device pages
block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages()
block: set FOLL_PCI_P2PDMA in bio_map_user_iov()
PCI/P2PDMA: Introduce pci_mmap_p2pmem()
nvme-pci: allow mmaping the CMB in userspace

block/bio.c | 10 +-
block/blk-map.c | 7 +-
drivers/infiniband/core/rw.c | 45 +----
drivers/iommu/dma-iommu.c | 68 ++++++-
drivers/nvme/host/core.c | 38 +++-
drivers/nvme/host/nvme.h | 5 +-
drivers/nvme/host/pci.c | 103 ++++++-----
drivers/nvme/target/rdma.c | 2 +-
drivers/pci/Kconfig | 5 +
drivers/pci/p2pdma.c | 337 ++++++++++++++++++++++++++++-------
include/linux/dma-map-ops.h | 76 ++++++++
include/linux/dma-mapping.h | 5 +
include/linux/mm.h | 24 +++
include/linux/pci-p2pdma.h | 43 ++---
include/linux/scatterlist.h | 44 ++++-
include/linux/uio.h | 6 +
include/rdma/ib_verbs.h | 11 ++
include/uapi/linux/magic.h | 1 +
kernel/dma/direct.c | 43 ++++-
kernel/dma/direct.h | 8 +-
kernel/dma/mapping.c | 22 ++-
lib/iov_iter.c | 25 ++-
lib/scatterlist.c | 25 +--
mm/gup.c | 22 ++-
24 files changed, 765 insertions(+), 210 deletions(-)


base-commit: f2906aa863381afb0015a9eb7fefad885d4e5a56
--
2.30.2


2022-06-15 16:33:13

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v7 05/21] dma-mapping: allow EREMOTEIO return code for P2PDMA transfers

Add EREMOTEIO error return to dma_map_sgtable() which will be used
by .map_sg() implementations that detect P2PDMA pages that the
underlying DMA device cannot access.

Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
---
kernel/dma/mapping.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index db7244291b74..9f65d1041638 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -197,7 +197,7 @@ static int __dma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
if (ents > 0)
debug_dma_map_sg(dev, sg, nents, ents, dir, attrs);
else if (WARN_ON_ONCE(ents != -EINVAL && ents != -ENOMEM &&
- ents != -EIO))
+ ents != -EIO && ents != -EREMOTEIO))
return -EIO;

return ents;
@@ -255,6 +255,8 @@ EXPORT_SYMBOL(dma_map_sg_attrs);
* complete the mapping. Should succeed if retried later.
* -EIO Legacy error code with an unknown meaning. eg. this is
* returned if a lower level call returned DMA_MAPPING_ERROR.
+ * -EREMOTEIO The DMA device cannot access P2PDMA memory specified in
+ * the sg_table. This will not succeed if retried.
*/
int dma_map_sgtable(struct device *dev, struct sg_table *sgt,
enum dma_data_direction dir, unsigned long attrs)
--
2.30.2

2022-06-15 16:33:13

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v7 13/21] PCI/P2PDMA: Remove pci_p2pdma_[un]map_sg()

This interface is superseded by support in dma_map_sg() which now supports
heterogeneous scatterlists. There are no longer any users, so remove it.

Signed-off-by: Logan Gunthorpe <[email protected]>
Acked-by: Bjorn Helgaas <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Max Gurtovoy <[email protected]>
---
drivers/pci/p2pdma.c | 66 --------------------------------------
include/linux/pci-p2pdma.h | 27 ----------------
2 files changed, 93 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 2fc0f4750a2e..d4e635012ffe 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -885,72 +885,6 @@ enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
return type;
}

-static int __pci_p2pdma_map_sg(struct pci_p2pdma_pagemap *p2p_pgmap,
- struct device *dev, struct scatterlist *sg, int nents)
-{
- struct scatterlist *s;
- int i;
-
- for_each_sg(sg, s, nents, i) {
- s->dma_address = sg_phys(s) + p2p_pgmap->bus_offset;
- sg_dma_len(s) = s->length;
- }
-
- return nents;
-}
-
-/**
- * pci_p2pdma_map_sg_attrs - map a PCI peer-to-peer scatterlist for DMA
- * @dev: device doing the DMA request
- * @sg: scatter list to map
- * @nents: elements in the scatterlist
- * @dir: DMA direction
- * @attrs: DMA attributes passed to dma_map_sg() (if called)
- *
- * Scatterlists mapped with this function should be unmapped using
- * pci_p2pdma_unmap_sg_attrs().
- *
- * Returns the number of SG entries mapped or 0 on error.
- */
-int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
- int nents, enum dma_data_direction dir, unsigned long attrs)
-{
- struct pci_p2pdma_pagemap *p2p_pgmap =
- to_p2p_pgmap(sg_page(sg)->pgmap);
-
- switch (pci_p2pdma_map_type(sg_page(sg)->pgmap, dev)) {
- case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
- return dma_map_sg_attrs(dev, sg, nents, dir, attrs);
- case PCI_P2PDMA_MAP_BUS_ADDR:
- return __pci_p2pdma_map_sg(p2p_pgmap, dev, sg, nents);
- default:
- /* Mapping is not Supported */
- return 0;
- }
-}
-EXPORT_SYMBOL_GPL(pci_p2pdma_map_sg_attrs);
-
-/**
- * pci_p2pdma_unmap_sg_attrs - unmap a PCI peer-to-peer scatterlist that was
- * mapped with pci_p2pdma_map_sg()
- * @dev: device doing the DMA request
- * @sg: scatter list to map
- * @nents: number of elements returned by pci_p2pdma_map_sg()
- * @dir: DMA direction
- * @attrs: DMA attributes passed to dma_unmap_sg() (if called)
- */
-void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
- int nents, enum dma_data_direction dir, unsigned long attrs)
-{
- enum pci_p2pdma_map_type map_type;
-
- map_type = pci_p2pdma_map_type(sg_page(sg)->pgmap, dev);
-
- if (map_type == PCI_P2PDMA_MAP_THRU_HOST_BRIDGE)
- dma_unmap_sg_attrs(dev, sg, nents, dir, attrs);
-}
-EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
-
/**
* pci_p2pdma_map_segment - map an sg segment determining the mapping type
* @state: State structure that should be declared outside of the for_each_sg()
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 8318a97c9c61..2c07aa6b7665 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -30,10 +30,6 @@ struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
unsigned int *nents, u32 length);
void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl);
void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
-int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
- int nents, enum dma_data_direction dir, unsigned long attrs);
-void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
- int nents, enum dma_data_direction dir, unsigned long attrs);
int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
bool *use_p2pdma);
ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
@@ -83,17 +79,6 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
{
}
-static inline int pci_p2pdma_map_sg_attrs(struct device *dev,
- struct scatterlist *sg, int nents, enum dma_data_direction dir,
- unsigned long attrs)
-{
- return 0;
-}
-static inline void pci_p2pdma_unmap_sg_attrs(struct device *dev,
- struct scatterlist *sg, int nents, enum dma_data_direction dir,
- unsigned long attrs)
-{
-}
static inline int pci_p2pdma_enable_store(const char *page,
struct pci_dev **p2p_dev, bool *use_p2pdma)
{
@@ -119,16 +104,4 @@ static inline struct pci_dev *pci_p2pmem_find(struct device *client)
return pci_p2pmem_find_many(&client, 1);
}

-static inline int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg,
- int nents, enum dma_data_direction dir)
-{
- return pci_p2pdma_map_sg_attrs(dev, sg, nents, dir, 0);
-}
-
-static inline void pci_p2pdma_unmap_sg(struct device *dev,
- struct scatterlist *sg, int nents, enum dma_data_direction dir)
-{
- pci_p2pdma_unmap_sg_attrs(dev, sg, nents, dir, 0);
-}
-
#endif /* _LINUX_PCI_P2P_H */
--
2.30.2

2022-06-15 16:33:22

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v7 02/21] PCI/P2PDMA: Attempt to set map_type if it has not been set

Attempt to find the mapping type for P2PDMA pages on the first
DMA map attempt if it has not been done ahead of time.

Previously, the mapping type was expected to be calculated ahead of
time, but if pages are to come from userspace then there's no
way to ensure the path was checked ahead of time.

This change will calculate the mapping type if it hasn't pre-calculated
so it is no longer invalid to call pci_p2pdma_map_sg() before the mapping
type is calculated, so drop the WARN_ON when that is the case.

Signed-off-by: Logan Gunthorpe <[email protected]>
Acked-by: Bjorn Helgaas <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
---
drivers/pci/p2pdma.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 462b429ad243..4e8bc457e29a 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -854,6 +854,7 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
struct pci_dev *provider = to_p2p_pgmap(pgmap)->provider;
struct pci_dev *client;
struct pci_p2pdma *p2pdma;
+ int dist;

if (!provider->p2pdma)
return PCI_P2PDMA_MAP_NOT_SUPPORTED;
@@ -870,6 +871,10 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
type = xa_to_value(xa_load(&p2pdma->map_types,
map_types_idx(client)));
rcu_read_unlock();
+
+ if (type == PCI_P2PDMA_MAP_UNKNOWN)
+ return calc_map_type_and_dist(provider, client, &dist, true);
+
return type;
}

@@ -912,7 +917,7 @@ int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
case PCI_P2PDMA_MAP_BUS_ADDR:
return __pci_p2pdma_map_sg(p2p_pgmap, dev, sg, nents);
default:
- WARN_ON_ONCE(1);
+ /* Mapping is not Supported */
return 0;
}
}
--
2.30.2

2022-06-15 16:33:49

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v7 15/21] iov_iter: introduce iov_iter_get_pages_[alloc_]flags()

Add iov_iter_get_pages_flags() and iov_iter_get_pages_alloc_flags()
which take a flags argument that is passed to get_user_pages_fast().

This is so that FOLL_PCI_P2PDMA can be passed when appropriate.

Signed-off-by: Logan Gunthorpe <[email protected]>
---
include/linux/uio.h | 6 ++++++
lib/iov_iter.c | 25 +++++++++++++++++++------
2 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 739285fe5a2f..ddf9e4cf4a59 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -232,8 +232,14 @@ void iov_iter_pipe(struct iov_iter *i, unsigned int direction, struct pipe_inode
void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count);
void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray *xarray,
loff_t start, size_t count);
+ssize_t iov_iter_get_pages_flags(struct iov_iter *i, struct page **pages,
+ size_t maxsize, unsigned maxpages, size_t *start,
+ unsigned int gup_flags);
ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
size_t maxsize, unsigned maxpages, size_t *start);
+ssize_t iov_iter_get_pages_alloc_flags(struct iov_iter *i,
+ struct page ***pages, size_t maxsize, size_t *start,
+ unsigned int gup_flags);
ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
size_t maxsize, size_t *start);
int iov_iter_npages(const struct iov_iter *i, int maxpages);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 6dd5330f7a99..9bf6e3af5120 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1515,9 +1515,9 @@ static struct page *first_bvec_segment(const struct iov_iter *i,
return page;
}

-ssize_t iov_iter_get_pages(struct iov_iter *i,
+ssize_t iov_iter_get_pages_flags(struct iov_iter *i,
struct page **pages, size_t maxsize, unsigned maxpages,
- size_t *start)
+ size_t *start, unsigned int gup_flags)
{
size_t len;
int n, res;
@@ -1528,7 +1528,6 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
return 0;

if (likely(iter_is_iovec(i))) {
- unsigned int gup_flags = 0;
unsigned long addr;

if (iov_iter_rw(i) != WRITE)
@@ -1558,6 +1557,13 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
return iter_xarray_get_pages(i, pages, maxsize, maxpages, start);
return -EFAULT;
}
+EXPORT_SYMBOL_GPL(iov_iter_get_pages_flags);
+
+ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
+ size_t maxsize, unsigned maxpages, size_t *start)
+{
+ return iov_iter_get_pages_flags(i, pages, maxsize, maxpages, start, 0);
+}
EXPORT_SYMBOL(iov_iter_get_pages);

static struct page **get_pages_array(size_t n)
@@ -1640,9 +1646,9 @@ static ssize_t iter_xarray_get_pages_alloc(struct iov_iter *i,
return actual;
}

-ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
+ssize_t iov_iter_get_pages_alloc_flags(struct iov_iter *i,
struct page ***pages, size_t maxsize,
- size_t *start)
+ size_t *start, unsigned int gup_flags)
{
struct page **p;
size_t len;
@@ -1654,7 +1660,6 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
return 0;

if (likely(iter_is_iovec(i))) {
- unsigned int gup_flags = 0;
unsigned long addr;

if (iov_iter_rw(i) != WRITE)
@@ -1667,6 +1672,7 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
p = get_pages_array(n);
if (!p)
return -ENOMEM;
+
res = get_user_pages_fast(addr, n, gup_flags, p);
if (unlikely(res <= 0)) {
kvfree(p);
@@ -1694,6 +1700,13 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
return iter_xarray_get_pages_alloc(i, pages, maxsize, start);
return -EFAULT;
}
+EXPORT_SYMBOL_GPL(iov_iter_get_pages_alloc_flags);
+
+ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
+ size_t maxsize, size_t *start)
+{
+ return iov_iter_get_pages_alloc_flags(i, pages, maxsize, start, 0);
+}
EXPORT_SYMBOL(iov_iter_get_pages_alloc);

size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
--
2.30.2

2022-06-15 16:34:00

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v7 04/21] PCI/P2PDMA: Introduce helpers for dma_map_sg implementations

Add pci_p2pdma_map_segment() as a helper for simple dma_map_sg()
implementations. It takes an scatterlist segment that must point to a
pci_p2pdma struct page and will map it if the mapping requires a bus
address.

The return value indicates whether the mapping required a bus address
or whether the caller still needs to map the segment normally. If the
segment should not be mapped, -EREMOTEIO is returned.

This helper uses a state structure to track the changes to the
pgmap across calls and avoid needing to lookup into the xarray for
every page.

Also add pci_p2pdma_map_bus_segment() which is useful for IOMMU
dma_map_sg() implementations where the sg segment containing the page
differs from the sg segment containing the DMA address.

Prototypes for these helpers are added to dma-map-ops.h as they are only
useful to dma map implementations and don't need to pollute the public
pci-p2pdma header.

Signed-off-by: Logan Gunthorpe <[email protected]>
Acked-by: Bjorn Helgaas <[email protected]>
---
drivers/pci/p2pdma.c | 59 +++++++++++++++++++++++++++++++++++++
include/linux/dma-map-ops.h | 21 +++++++++++++
2 files changed, 80 insertions(+)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 10b1d5c2b5de..2fc0f4750a2e 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -951,6 +951,65 @@ void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
}
EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);

+/**
+ * pci_p2pdma_map_segment - map an sg segment determining the mapping type
+ * @state: State structure that should be declared outside of the for_each_sg()
+ * loop and initialized to zero.
+ * @dev: DMA device that's doing the mapping operation
+ * @sg: scatterlist segment to map
+ *
+ * This is a helper to be used by non-IOMMU dma_map_sg() implementations where
+ * the sg segment is the same for the page_link and the dma_address.
+ *
+ * Attempt to map a single segment in an SGL with the PCI bus address.
+ * The segment must point to a PCI P2PDMA page and thus must be
+ * wrapped in a is_pci_p2pdma_page(sg_page(sg)) check.
+ *
+ * Returns the type of mapping used and maps the page if the type is
+ * PCI_P2PDMA_MAP_BUS_ADDR.
+ */
+enum pci_p2pdma_map_type
+pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
+ struct scatterlist *sg)
+{
+ if (state->pgmap != sg_page(sg)->pgmap) {
+ state->pgmap = sg_page(sg)->pgmap;
+ state->map = pci_p2pdma_map_type(state->pgmap, dev);
+ state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
+ }
+
+ if (state->map == PCI_P2PDMA_MAP_BUS_ADDR) {
+ sg->dma_address = sg_phys(sg) + state->bus_off;
+ sg_dma_len(sg) = sg->length;
+ sg_dma_mark_bus_address(sg);
+ }
+
+ return state->map;
+}
+
+/**
+ * pci_p2pdma_map_bus_segment - map an sg segment pre determined to
+ * be mapped with PCI_P2PDMA_MAP_BUS_ADDR
+ * @pg_sg: scatterlist segment with the page to map
+ * @dma_sg: scatterlist segment to assign a DMA address to
+ *
+ * This is a helper for iommu dma_map_sg() implementations when the
+ * segment for the DMA address differs from the segment containing the
+ * source page.
+ *
+ * pci_p2pdma_map_type() must have already been called on the pg_sg and
+ * returned PCI_P2PDMA_MAP_BUS_ADDR.
+ */
+void pci_p2pdma_map_bus_segment(struct scatterlist *pg_sg,
+ struct scatterlist *dma_sg)
+{
+ struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(sg_page(pg_sg)->pgmap);
+
+ dma_sg->dma_address = sg_phys(pg_sg) + pgmap->bus_offset;
+ sg_dma_len(dma_sg) = pg_sg->length;
+ sg_dma_mark_bus_address(dma_sg);
+}
+
/**
* pci_p2pdma_enable_store - parse a configfs/sysfs attribute store
* to enable p2pdma
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index d693a0e33bac..752f91e5eb5d 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -413,15 +413,36 @@ enum pci_p2pdma_map_type {
PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
};

+struct pci_p2pdma_map_state {
+ struct dev_pagemap *pgmap;
+ int map;
+ u64 bus_off;
+};
+
#ifdef CONFIG_PCI_P2PDMA
enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
struct device *dev);
+enum pci_p2pdma_map_type
+pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
+ struct scatterlist *sg);
+void pci_p2pdma_map_bus_segment(struct scatterlist *pg_sg,
+ struct scatterlist *dma_sg);
#else /* CONFIG_PCI_P2PDMA */
static inline enum pci_p2pdma_map_type
pci_p2pdma_map_type(struct dev_pagemap *pgmap, struct device *dev)
{
return PCI_P2PDMA_MAP_NOT_SUPPORTED;
}
+static inline enum pci_p2pdma_map_type
+pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
+ struct scatterlist *sg)
+{
+ return PCI_P2PDMA_MAP_NOT_SUPPORTED;
+}
+static inline void pci_p2pdma_map_bus_segment(struct scatterlist *pg_sg,
+ struct scatterlist *dma_sg)
+{
+}
#endif /* CONFIG_PCI_P2PDMA */

#endif /* _LINUX_DMA_MAP_OPS_H */
--
2.30.2

2022-06-15 16:34:28

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v7 21/21] nvme-pci: allow mmaping the CMB in userspace

Allow userspace to obtain CMB memory by mmaping the controller's
char device. The mmap call allocates and returns a hunk of CMB memory,
(the offset is ignored) so userspace does not have control over the
address within the CMB.

A VMA allocated in this way will only be usable by drivers that set
FOLL_PCI_P2PDMA when calling GUP. And inter-device support will be
checked the first time the pages are mapped for DMA.

Currently this is only supported by O_DIRECT to an PCI NVMe device
or through the NVMe passthrough IOCTL.

Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/nvme/host/core.c | 35 +++++++++++++++++++++++++++++++----
drivers/nvme/host/nvme.h | 3 +++
drivers/nvme/host/pci.c | 23 +++++++++++++++++++++++
3 files changed, 57 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index d6e76f2dc293..23fe4b544bf1 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3166,6 +3166,7 @@ static int nvme_dev_open(struct inode *inode, struct file *file)
{
struct nvme_ctrl *ctrl =
container_of(inode->i_cdev, struct nvme_ctrl, cdev);
+ int ret = -EINVAL;

switch (ctrl->state) {
case NVME_CTRL_LIVE:
@@ -3175,13 +3176,25 @@ static int nvme_dev_open(struct inode *inode, struct file *file)
}

nvme_get_ctrl(ctrl);
- if (!try_module_get(ctrl->ops->module)) {
- nvme_put_ctrl(ctrl);
- return -EINVAL;
- }
+ if (!try_module_get(ctrl->ops->module))
+ goto err_put_ctrl;

file->private_data = ctrl;
+
+ if (ctrl->ops->cdev_file_open) {
+ ret = ctrl->ops->cdev_file_open(ctrl, file);
+ if (ret)
+ goto err_put_mod;
+ }
+
return 0;
+
+err_put_mod:
+ module_put(ctrl->ops->module);
+err_put_ctrl:
+ nvme_put_ctrl(ctrl);
+ return ret;
+
}

static int nvme_dev_release(struct inode *inode, struct file *file)
@@ -3189,11 +3202,24 @@ static int nvme_dev_release(struct inode *inode, struct file *file)
struct nvme_ctrl *ctrl =
container_of(inode->i_cdev, struct nvme_ctrl, cdev);

+ if (ctrl->ops->cdev_file_release)
+ ctrl->ops->cdev_file_release(file);
+
module_put(ctrl->ops->module);
nvme_put_ctrl(ctrl);
return 0;
}

+static int nvme_dev_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ struct nvme_ctrl *ctrl = file->private_data;
+
+ if (!ctrl->ops->mmap_cmb)
+ return -ENODEV;
+
+ return ctrl->ops->mmap_cmb(ctrl, vma);
+}
+
static const struct file_operations nvme_dev_fops = {
.owner = THIS_MODULE,
.open = nvme_dev_open,
@@ -3201,6 +3227,7 @@ static const struct file_operations nvme_dev_fops = {
.unlocked_ioctl = nvme_dev_ioctl,
.compat_ioctl = compat_ptr_ioctl,
.uring_cmd = nvme_dev_uring_cmd,
+ .mmap = nvme_dev_mmap,
};

static ssize_t nvme_sysfs_reset(struct device *dev,
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 957f79420cf3..44ff05d8e24d 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -503,6 +503,9 @@ struct nvme_ctrl_ops {
void (*delete_ctrl)(struct nvme_ctrl *ctrl);
int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size);
bool (*supports_pci_p2pdma)(struct nvme_ctrl *ctrl);
+ int (*cdev_file_open)(struct nvme_ctrl *ctrl, struct file *file);
+ void (*cdev_file_release)(struct file *file);
+ int (*mmap_cmb)(struct nvme_ctrl *ctrl, struct vm_area_struct *vma);
};

/*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 52b52a7efa9a..8ef3752b7ddb 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2972,6 +2972,26 @@ static bool nvme_pci_supports_pci_p2pdma(struct nvme_ctrl *ctrl)
return dma_pci_p2pdma_supported(dev->dev);
}

+static int nvme_pci_cdev_file_open(struct nvme_ctrl *ctrl, struct file *file)
+{
+ struct pci_dev *pdev = to_pci_dev(to_nvme_dev(ctrl)->dev);
+
+ return pci_p2pdma_file_open(pdev, file);
+}
+
+static void nvme_pci_cdev_file_release(struct file *file)
+{
+ pci_p2pdma_file_release(file);
+}
+
+static int nvme_pci_mmap_cmb(struct nvme_ctrl *ctrl,
+ struct vm_area_struct *vma)
+{
+ struct pci_dev *pdev = to_pci_dev(to_nvme_dev(ctrl)->dev);
+
+ return pci_mmap_p2pmem(pdev, vma);
+}
+
static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
.name = "pcie",
.module = THIS_MODULE,
@@ -2983,6 +3003,9 @@ static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
.submit_async_event = nvme_pci_submit_async_event,
.get_address = nvme_pci_get_address,
.supports_pci_p2pdma = nvme_pci_supports_pci_p2pdma,
+ .cdev_file_open = nvme_pci_cdev_file_open,
+ .cdev_file_release = nvme_pci_cdev_file_release,
+ .mmap_cmb = nvme_pci_mmap_cmb,
};

static int nvme_dev_map(struct nvme_dev *dev)
--
2.30.2

2022-06-15 16:35:24

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v7 01/21] lib/scatterlist: add flag for indicating P2PDMA segments in an SGL

Make use of the third free LSB in scatterlist's page_link on 64bit systems.

The extra bit will be used by dma_[un]map_sg_p2pdma() to determine when a
given SGL segments dma_address points to a PCI bus address.
dma_unmap_sg_p2pdma() will need to perform different cleanup when a
segment is marked as a bus address.

The new bit will only be used when CONFIG_PCI_P2PDMA is set; this means
PCI P2PDMA will require CONFIG_64BIT. This should be acceptable as the
majority of P2PDMA use cases are restricted to newer root complexes and
roughly require the extra address space for memory BARs used in the
transactions.

Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
---
drivers/pci/Kconfig | 5 +++++
include/linux/scatterlist.h | 44 ++++++++++++++++++++++++++++++++++++-
2 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 133c73207782..5cc7cba1941f 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -164,6 +164,11 @@ config PCI_PASID
config PCI_P2PDMA
bool "PCI peer-to-peer transfer support"
depends on ZONE_DEVICE
+ #
+ # The need for the scatterlist DMA bus address flag means PCI P2PDMA
+ # requires 64bit
+ #
+ depends on 64BIT
select GENERIC_ALLOCATOR
help
Enableѕ drivers to do PCI peer-to-peer transactions to and from
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 7ff9d6386c12..6561ca8aead8 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -64,12 +64,24 @@ struct sg_append_table {
#define SG_CHAIN 0x01UL
#define SG_END 0x02UL

+/*
+ * bit 2 is the third free bit in the page_link on 64bit systems which
+ * is used by dma_unmap_sg() to determine if the dma_address is a
+ * bus address when doing P2PDMA.
+ */
+#ifdef CONFIG_PCI_P2PDMA
+#define SG_DMA_BUS_ADDRESS 0x04UL
+static_assert(__alignof__(struct page) >= 8);
+#else
+#define SG_DMA_BUS_ADDRESS 0x00UL
+#endif
+
/*
* We overload the LSB of the page pointer to indicate whether it's
* a valid sg entry, or whether it points to the start of a new scatterlist.
* Those low bits are there for everyone! (thanks mason :-)
*/
-#define SG_PAGE_LINK_MASK (SG_CHAIN | SG_END)
+#define SG_PAGE_LINK_MASK (SG_CHAIN | SG_END | SG_DMA_BUS_ADDRESS)

static inline unsigned int __sg_flags(struct scatterlist *sg)
{
@@ -91,6 +103,11 @@ static inline bool sg_is_last(struct scatterlist *sg)
return __sg_flags(sg) & SG_END;
}

+static inline bool sg_is_dma_bus_address(struct scatterlist *sg)
+{
+ return __sg_flags(sg) & SG_DMA_BUS_ADDRESS;
+}
+
/**
* sg_assign_page - Assign a given page to an SG entry
* @sg: SG entry
@@ -245,6 +262,31 @@ static inline void sg_unmark_end(struct scatterlist *sg)
sg->page_link &= ~SG_END;
}

+/**
+ * sg_dma_mark_bus address - Mark the scatterlist entry as a bus address
+ * @sg: SG entryScatterlist
+ *
+ * Description:
+ * Marks the passed in sg entry to indicate that the dma_address is
+ * a bus address and doesn't need to be unmapped.
+ **/
+static inline void sg_dma_mark_bus_address(struct scatterlist *sg)
+{
+ sg->page_link |= SG_DMA_BUS_ADDRESS;
+}
+
+/**
+ * sg_unmark_pci_p2pdma - Unmark the scatterlist entry as a bus address
+ * @sg: SG entryScatterlist
+ *
+ * Description:
+ * Clears the bus address mark.
+ **/
+static inline void sg_dma_unmark_bus_address(struct scatterlist *sg)
+{
+ sg->page_link &= ~SG_DMA_BUS_ADDRESS;
+}
+
/**
* sg_phys - Return physical address of an sg entry
* @sg: SG entry
--
2.30.2

2022-06-15 16:44:26

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v7 10/21] nvme-pci: convert to using dma_map_sgtable()

The dma_map operations now support P2PDMA pages directly. So remove
the calls to pci_p2pdma_[un]map_sg_attrs() and replace them with calls
to dma_map_sgtable().

dma_map_sgtable() returns more complete error codes than dma_map_sg()
and allows differentiating EREMOTEIO errors in case an unsupported
P2PDMA transfer is requested. When this happens, return BLK_STS_TARGET
so the request isn't retried.

Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Max Gurtovoy <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
---
drivers/nvme/host/pci.c | 69 +++++++++++++++++------------------------
1 file changed, 29 insertions(+), 40 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index e5e032ab1c71..52b52a7efa9a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -230,11 +230,10 @@ struct nvme_iod {
bool use_sgl;
int aborted;
int npages; /* In the PRP list. 0 means small pool in use */
- int nents; /* Used in scatterlist */
dma_addr_t first_dma;
unsigned int dma_len; /* length of single DMA segment mapping */
dma_addr_t meta_dma;
- struct scatterlist *sg;
+ struct sg_table sgt;
};

static inline unsigned int nvme_dbbuf_size(struct nvme_dev *dev)
@@ -524,7 +523,7 @@ static void nvme_commit_rqs(struct blk_mq_hw_ctx *hctx)
static void **nvme_pci_iod_list(struct request *req)
{
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
- return (void **)(iod->sg + blk_rq_nr_phys_segments(req));
+ return (void **)(iod->sgt.sgl + blk_rq_nr_phys_segments(req));
}

static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req)
@@ -576,17 +575,6 @@ static void nvme_free_sgls(struct nvme_dev *dev, struct request *req)
}
}

-static void nvme_unmap_sg(struct nvme_dev *dev, struct request *req)
-{
- struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-
- if (is_pci_p2pdma_page(sg_page(iod->sg)))
- pci_p2pdma_unmap_sg(dev->dev, iod->sg, iod->nents,
- rq_dma_dir(req));
- else
- dma_unmap_sg(dev->dev, iod->sg, iod->nents, rq_dma_dir(req));
-}
-
static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
{
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -597,9 +585,10 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
return;
}

- WARN_ON_ONCE(!iod->nents);
+ WARN_ON_ONCE(!iod->sgt.nents);
+
+ dma_unmap_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req), 0);

- nvme_unmap_sg(dev, req);
if (iod->npages == 0)
dma_pool_free(dev->prp_small_pool, nvme_pci_iod_list(req)[0],
iod->first_dma);
@@ -607,7 +596,7 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
nvme_free_sgls(dev, req);
else
nvme_free_prps(dev, req);
- mempool_free(iod->sg, dev->iod_mempool);
+ mempool_free(iod->sgt.sgl, dev->iod_mempool);
}

static void nvme_print_sgl(struct scatterlist *sgl, int nents)
@@ -630,7 +619,7 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
struct dma_pool *pool;
int length = blk_rq_payload_bytes(req);
- struct scatterlist *sg = iod->sg;
+ struct scatterlist *sg = iod->sgt.sgl;
int dma_len = sg_dma_len(sg);
u64 dma_addr = sg_dma_address(sg);
int offset = dma_addr & (NVME_CTRL_PAGE_SIZE - 1);
@@ -703,16 +692,16 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
dma_len = sg_dma_len(sg);
}
done:
- cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
+ cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sgt.sgl));
cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma);
return BLK_STS_OK;
free_prps:
nvme_free_prps(dev, req);
return BLK_STS_RESOURCE;
bad_sgl:
- WARN(DO_ONCE(nvme_print_sgl, iod->sg, iod->nents),
+ WARN(DO_ONCE(nvme_print_sgl, iod->sgt.sgl, iod->sgt.nents),
"Invalid SGL for payload:%d nents:%d\n",
- blk_rq_payload_bytes(req), iod->nents);
+ blk_rq_payload_bytes(req), iod->sgt.nents);
return BLK_STS_IOERR;
}

@@ -738,12 +727,13 @@ static void nvme_pci_sgl_set_seg(struct nvme_sgl_desc *sge,
}

static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
- struct request *req, struct nvme_rw_command *cmd, int entries)
+ struct request *req, struct nvme_rw_command *cmd)
{
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
struct dma_pool *pool;
struct nvme_sgl_desc *sg_list;
- struct scatterlist *sg = iod->sg;
+ struct scatterlist *sg = iod->sgt.sgl;
+ unsigned int entries = iod->sgt.nents;
dma_addr_t sgl_dma;
int i = 0;

@@ -841,7 +831,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
{
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
blk_status_t ret = BLK_STS_RESOURCE;
- int nr_mapped;
+ int rc;

if (blk_rq_nr_phys_segments(req) == 1) {
struct bio_vec bv = req_bvec(req);
@@ -859,26 +849,25 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
}

iod->dma_len = 0;
- iod->sg = mempool_alloc(dev->iod_mempool, GFP_ATOMIC);
- if (!iod->sg)
+ iod->sgt.sgl = mempool_alloc(dev->iod_mempool, GFP_ATOMIC);
+ if (!iod->sgt.sgl)
return BLK_STS_RESOURCE;
- sg_init_table(iod->sg, blk_rq_nr_phys_segments(req));
- iod->nents = blk_rq_map_sg(req->q, req, iod->sg);
- if (!iod->nents)
+ sg_init_table(iod->sgt.sgl, blk_rq_nr_phys_segments(req));
+ iod->sgt.orig_nents = blk_rq_map_sg(req->q, req, iod->sgt.sgl);
+ if (!iod->sgt.orig_nents)
goto out_free_sg;

- if (is_pci_p2pdma_page(sg_page(iod->sg)))
- nr_mapped = pci_p2pdma_map_sg_attrs(dev->dev, iod->sg,
- iod->nents, rq_dma_dir(req), DMA_ATTR_NO_WARN);
- else
- nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents,
- rq_dma_dir(req), DMA_ATTR_NO_WARN);
- if (!nr_mapped)
+ rc = dma_map_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req),
+ DMA_ATTR_NO_WARN);
+ if (rc) {
+ if (rc == -EREMOTEIO)
+ ret = BLK_STS_TARGET;
goto out_free_sg;
+ }

iod->use_sgl = nvme_pci_use_sgls(dev, req);
if (iod->use_sgl)
- ret = nvme_pci_setup_sgls(dev, req, &cmnd->rw, nr_mapped);
+ ret = nvme_pci_setup_sgls(dev, req, &cmnd->rw);
else
ret = nvme_pci_setup_prps(dev, req, &cmnd->rw);
if (ret != BLK_STS_OK)
@@ -886,9 +875,9 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
return BLK_STS_OK;

out_unmap_sg:
- nvme_unmap_sg(dev, req);
+ dma_unmap_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req), 0);
out_free_sg:
- mempool_free(iod->sg, dev->iod_mempool);
+ mempool_free(iod->sgt.sgl, dev->iod_mempool);
return ret;
}

@@ -912,7 +901,7 @@ static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)

iod->aborted = 0;
iod->npages = -1;
- iod->nents = 0;
+ iod->sgt.nents = 0;

ret = nvme_setup_cmd(req->q->queuedata, req);
if (ret)
--
2.30.2

2022-06-15 16:45:51

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v7 12/21] RDMA/rw: drop pci_p2pdma_[un]map_sg()

dma_map_sg() now supports the use of P2PDMA pages so pci_p2pdma_map_sg()
is no longer necessary and may be dropped. This means the
rdma_rw_[un]map_sg() helpers are no longer necessary. Remove it all.

Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
---
drivers/infiniband/core/rw.c | 45 ++++++++----------------------------
1 file changed, 9 insertions(+), 36 deletions(-)

diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index 4d98f931a13d..8367974b7998 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -274,33 +274,6 @@ static int rdma_rw_init_single_wr(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
return 1;
}

-static void rdma_rw_unmap_sg(struct ib_device *dev, struct scatterlist *sg,
- u32 sg_cnt, enum dma_data_direction dir)
-{
- if (is_pci_p2pdma_page(sg_page(sg)))
- pci_p2pdma_unmap_sg(dev->dma_device, sg, sg_cnt, dir);
- else
- ib_dma_unmap_sg(dev, sg, sg_cnt, dir);
-}
-
-static int rdma_rw_map_sgtable(struct ib_device *dev, struct sg_table *sgt,
- enum dma_data_direction dir)
-{
- int nents;
-
- if (is_pci_p2pdma_page(sg_page(sgt->sgl))) {
- if (WARN_ON_ONCE(ib_uses_virt_dma(dev)))
- return 0;
- nents = pci_p2pdma_map_sg(dev->dma_device, sgt->sgl,
- sgt->orig_nents, dir);
- if (!nents)
- return -EIO;
- sgt->nents = nents;
- return 0;
- }
- return ib_dma_map_sgtable_attrs(dev, sgt, dir, 0);
-}
-
/**
* rdma_rw_ctx_init - initialize a RDMA READ/WRITE context
* @ctx: context to initialize
@@ -327,7 +300,7 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u32 port_num,
};
int ret;

- ret = rdma_rw_map_sgtable(dev, &sgt, dir);
+ ret = ib_dma_map_sgtable_attrs(dev, &sgt, dir, 0);
if (ret)
return ret;
sg_cnt = sgt.nents;
@@ -366,7 +339,7 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u32 port_num,
return ret;

out_unmap_sg:
- rdma_rw_unmap_sg(dev, sgt.sgl, sgt.orig_nents, dir);
+ ib_dma_unmap_sgtable_attrs(dev, &sgt, dir, 0);
return ret;
}
EXPORT_SYMBOL(rdma_rw_ctx_init);
@@ -414,12 +387,12 @@ int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
return -EINVAL;
}

- ret = rdma_rw_map_sgtable(dev, &sgt, dir);
+ ret = ib_dma_map_sgtable_attrs(dev, &sgt, dir, 0);
if (ret)
return ret;

if (prot_sg_cnt) {
- ret = rdma_rw_map_sgtable(dev, &prot_sgt, dir);
+ ret = ib_dma_map_sgtable_attrs(dev, &prot_sgt, dir, 0);
if (ret)
goto out_unmap_sg;
}
@@ -486,9 +459,9 @@ int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
kfree(ctx->reg);
out_unmap_prot_sg:
if (prot_sgt.nents)
- rdma_rw_unmap_sg(dev, prot_sgt.sgl, prot_sgt.orig_nents, dir);
+ ib_dma_unmap_sgtable_attrs(dev, &prot_sgt, dir, 0);
out_unmap_sg:
- rdma_rw_unmap_sg(dev, sgt.sgl, sgt.orig_nents, dir);
+ ib_dma_unmap_sgtable_attrs(dev, &sgt, dir, 0);
return ret;
}
EXPORT_SYMBOL(rdma_rw_ctx_signature_init);
@@ -621,7 +594,7 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
break;
}

- rdma_rw_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
+ ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
}
EXPORT_SYMBOL(rdma_rw_ctx_destroy);

@@ -649,8 +622,8 @@ void rdma_rw_ctx_destroy_signature(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
kfree(ctx->reg);

if (prot_sg_cnt)
- rdma_rw_unmap_sg(qp->pd->device, prot_sg, prot_sg_cnt, dir);
- rdma_rw_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
+ ib_dma_unmap_sg(qp->pd->device, prot_sg, prot_sg_cnt, dir);
+ ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
}
EXPORT_SYMBOL(rdma_rw_ctx_destroy_signature);

--
2.30.2

2022-06-15 16:47:22

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v7 14/21] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages

GUP Callers that expect PCI P2PDMA pages can now set FOLL_PCI_P2PDMA to
allow obtaining P2PDMA pages. If GUP is called without the flag and a
P2PDMA page is found, it will return an error.

FOLL_PCI_P2PDMA cannot be set if FOLL_LONGTERM is set.

Signed-off-by: Logan Gunthorpe <[email protected]>
---
include/linux/mm.h | 1 +
mm/gup.c | 22 +++++++++++++++++++++-
2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bc8f326be0ce..0bcb54ea503c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2941,6 +2941,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */
#define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */
#define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */
+#define FOLL_PCI_P2PDMA 0x100000 /* allow returning PCI P2PDMA pages */

/*
* FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
diff --git a/mm/gup.c b/mm/gup.c
index 551264407624..f15f01d06a09 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -564,6 +564,12 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
goto out;
}

+ if (unlikely(!(flags & FOLL_PCI_P2PDMA) &&
+ is_pci_p2pdma_page(page))) {
+ page = ERR_PTR(-EREMOTEIO);
+ goto out;
+ }
+
VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
!PageAnonExclusive(page), page);

@@ -994,6 +1000,9 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
return -EOPNOTSUPP;

+ if ((gup_flags & FOLL_LONGTERM) && (gup_flags & FOLL_PCI_P2PDMA))
+ return -EOPNOTSUPP;
+
if (vma_is_secretmem(vma))
return -EFAULT;

@@ -2289,6 +2298,10 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
page = pte_page(pte);

+ if (unlikely(!(flags & FOLL_PCI_P2PDMA) &&
+ is_pci_p2pdma_page(page)))
+ goto pte_unmap;
+
folio = try_grab_folio(page, 1, flags);
if (!folio)
goto pte_unmap;
@@ -2368,6 +2381,12 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
undo_dev_pagemap(nr, nr_start, flags, pages);
break;
}
+
+ if (!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)) {
+ undo_dev_pagemap(nr, nr_start, flags, pages);
+ break;
+ }
+
SetPageReferenced(page);
pages[*nr] = page;
if (unlikely(!try_grab_page(page, flags))) {
@@ -2856,7 +2875,8 @@ static int internal_get_user_pages_fast(unsigned long start,

if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
FOLL_FORCE | FOLL_PIN | FOLL_GET |
- FOLL_FAST_ONLY | FOLL_NOFAULT)))
+ FOLL_FAST_ONLY | FOLL_NOFAULT |
+ FOLL_PCI_P2PDMA)))
return -EINVAL;

if (gup_flags & FOLL_PIN)
--
2.30.2

2022-06-15 16:47:41

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v7 08/21] iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg

When a PCI P2PDMA page is seen, set the IOVA length of the segment
to zero so that it is not mapped into the IOVA. Then, in finalise_sg(),
apply the appropriate bus address to the segment. The IOVA is not
created if the scatterlist only consists of P2PDMA pages.

A P2PDMA page may have three possible outcomes when being mapped:
1) If the data path between the two devices doesn't go through
the root port, then it should be mapped with a PCI bus address
2) If the data path goes through the host bridge, it should be mapped
normally with an IOMMU IOVA.
3) It is not possible for the two devices to communicate and thus
the mapping operation should fail (and it will return -EREMOTEIO).

Similar to dma-direct, the sg_dma_mark_pci_p2pdma() flag is used to
indicate bus address segments. On unmap, P2PDMA segments are skipped
over when determining the start and end IOVA addresses.

With this change, the flags variable in the dma_map_ops is set to
DMA_F_PCI_P2PDMA_SUPPORTED to indicate support for P2PDMA pages.

Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
---
drivers/iommu/dma-iommu.c | 68 +++++++++++++++++++++++++++++++++++----
1 file changed, 61 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index f90251572a5d..b01ca0c6a7ab 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -21,6 +21,7 @@
#include <linux/iova.h>
#include <linux/irq.h>
#include <linux/list_sort.h>
+#include <linux/memremap.h>
#include <linux/mm.h>
#include <linux/mutex.h>
#include <linux/pci.h>
@@ -1062,6 +1063,16 @@ static int __finalise_sg(struct device *dev, struct scatterlist *sg, int nents,
sg_dma_address(s) = DMA_MAPPING_ERROR;
sg_dma_len(s) = 0;

+ if (is_pci_p2pdma_page(sg_page(s)) && !s_iova_len) {
+ if (i > 0)
+ cur = sg_next(cur);
+
+ pci_p2pdma_map_bus_segment(s, cur);
+ count++;
+ cur_len = 0;
+ continue;
+ }
+
/*
* Now fill in the real DMA data. If...
* - there is a valid output segment to append to
@@ -1158,6 +1169,8 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
struct iova_domain *iovad = &cookie->iovad;
struct scatterlist *s, *prev = NULL;
int prot = dma_info_to_prot(dir, dev_is_dma_coherent(dev), attrs);
+ struct dev_pagemap *pgmap = NULL;
+ enum pci_p2pdma_map_type map_type;
dma_addr_t iova;
size_t iova_len = 0;
unsigned long mask = dma_get_seg_boundary(dev);
@@ -1193,6 +1206,35 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
s_length = iova_align(iovad, s_length + s_iova_off);
s->length = s_length;

+ if (is_pci_p2pdma_page(sg_page(s))) {
+ if (sg_page(s)->pgmap != pgmap) {
+ pgmap = sg_page(s)->pgmap;
+ map_type = pci_p2pdma_map_type(pgmap, dev);
+ }
+
+ switch (map_type) {
+ case PCI_P2PDMA_MAP_BUS_ADDR:
+ /*
+ * A zero length will be ignored by
+ * iommu_map_sg() and then can be detected
+ * in __finalise_sg() to actually map the
+ * bus address.
+ */
+ s->length = 0;
+ continue;
+ case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+ /*
+ * Mapping through host bridge should be
+ * mapped with regular IOVAs, thus we
+ * do nothing here and continue below.
+ */
+ break;
+ default:
+ ret = -EREMOTEIO;
+ goto out_restore_sg;
+ }
+ }
+
/*
* Due to the alignment of our single IOVA allocation, we can
* depend on these assumptions about the segment boundary mask:
@@ -1215,6 +1257,9 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
prev = s;
}

+ if (!iova_len)
+ return __finalise_sg(dev, sg, nents, 0);
+
iova = iommu_dma_alloc_iova(domain, iova_len, dma_get_mask(dev), dev);
if (!iova) {
ret = -ENOMEM;
@@ -1236,7 +1281,7 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
out_restore_sg:
__invalidate_sg(sg, nents);
out:
- if (ret != -ENOMEM)
+ if (ret != -ENOMEM && ret != -EREMOTEIO)
return -EINVAL;
return ret;
}
@@ -1244,7 +1289,7 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
int nents, enum dma_data_direction dir, unsigned long attrs)
{
- dma_addr_t start, end;
+ dma_addr_t end, start = DMA_MAPPING_ERROR;
struct scatterlist *tmp;
int i;

@@ -1260,14 +1305,22 @@ static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
* The scatterlist segments are mapped into a single
* contiguous IOVA allocation, so this is incredibly easy.
*/
- start = sg_dma_address(sg);
- for_each_sg(sg_next(sg), tmp, nents - 1, i) {
+ for_each_sg(sg, tmp, nents, i) {
+ if (sg_is_dma_bus_address(tmp)) {
+ sg_dma_unmark_bus_address(tmp);
+ continue;
+ }
if (sg_dma_len(tmp) == 0)
break;
- sg = tmp;
+
+ if (start == DMA_MAPPING_ERROR)
+ start = sg_dma_address(tmp);
+
+ end = sg_dma_address(tmp) + sg_dma_len(tmp);
}
- end = sg_dma_address(sg) + sg_dma_len(sg);
- __iommu_dma_unmap(dev, start, end - start);
+
+ if (start != DMA_MAPPING_ERROR)
+ __iommu_dma_unmap(dev, start, end - start);
}

static dma_addr_t iommu_dma_map_resource(struct device *dev, phys_addr_t phys,
@@ -1460,6 +1513,7 @@ static unsigned long iommu_dma_get_merge_boundary(struct device *dev)
}

static const struct dma_map_ops iommu_dma_ops = {
+ .flags = DMA_F_PCI_P2PDMA_SUPPORTED,
.alloc = iommu_dma_alloc,
.free = iommu_dma_free,
.alloc_pages = dma_common_alloc_pages,
--
2.30.2

2022-06-15 16:54:14

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v7 06/21] dma-direct: support PCI P2PDMA pages in dma-direct map_sg

Add PCI P2PDMA support for dma_direct_map_sg() so that it can map
PCI P2PDMA pages directly without a hack in the callers. This allows
for heterogeneous SGLs that contain both P2PDMA and regular pages.

A P2PDMA page may have three possible outcomes when being mapped:
1) If the data path between the two devices doesn't go through the
root port, then it should be mapped with a PCI bus address
2) If the data path goes through the host bridge, it should be mapped
normally, as though it were a CPU physical address
3) It is not possible for the two devices to communicate and thus
the mapping operation should fail (and it will return -EREMOTEIO).

SGL segments that contain PCI bus addresses are marked with
sg_dma_mark_pci_p2pdma() and are ignored when unmapped.

P2PDMA mappings are also failed if swiotlb needs to be used on the
mapping.

Signed-off-by: Logan Gunthorpe <[email protected]>
---
kernel/dma/direct.c | 43 +++++++++++++++++++++++++++++++++++++------
kernel/dma/direct.h | 8 +++++++-
2 files changed, 44 insertions(+), 7 deletions(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index e978f36e6be8..133a4be2d3e5 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -454,29 +454,60 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
arch_sync_dma_for_cpu_all();
}

+/*
+ * Unmaps segments, except for ones marked as pci_p2pdma which do not
+ * require any further action as they contain a bus address.
+ */
void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
int nents, enum dma_data_direction dir, unsigned long attrs)
{
struct scatterlist *sg;
int i;

- for_each_sg(sgl, sg, nents, i)
- dma_direct_unmap_page(dev, sg->dma_address, sg_dma_len(sg), dir,
- attrs);
+ for_each_sg(sgl, sg, nents, i) {
+ if (sg_is_dma_bus_address(sg))
+ sg_dma_unmark_bus_address(sg);
+ else
+ dma_direct_unmap_page(dev, sg->dma_address,
+ sg_dma_len(sg), dir, attrs);
+ }
}
#endif

int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
enum dma_data_direction dir, unsigned long attrs)
{
- int i;
+ struct pci_p2pdma_map_state p2pdma_state = {};
+ enum pci_p2pdma_map_type map;
struct scatterlist *sg;
+ int i, ret;

for_each_sg(sgl, sg, nents, i) {
+ if (is_pci_p2pdma_page(sg_page(sg))) {
+ map = pci_p2pdma_map_segment(&p2pdma_state, dev, sg);
+ switch (map) {
+ case PCI_P2PDMA_MAP_BUS_ADDR:
+ continue;
+ case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+ /*
+ * Any P2P mapping that traverses the PCI
+ * host bridge must be mapped with CPU physical
+ * address and not PCI bus addresses. This is
+ * done with dma_direct_map_page() below.
+ */
+ break;
+ default:
+ ret = -EREMOTEIO;
+ goto out_unmap;
+ }
+ }
+
sg->dma_address = dma_direct_map_page(dev, sg_page(sg),
sg->offset, sg->length, dir, attrs);
- if (sg->dma_address == DMA_MAPPING_ERROR)
+ if (sg->dma_address == DMA_MAPPING_ERROR) {
+ ret = -EIO;
goto out_unmap;
+ }
sg_dma_len(sg) = sg->length;
}

@@ -484,7 +515,7 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,

out_unmap:
dma_direct_unmap_sg(dev, sgl, i, dir, attrs | DMA_ATTR_SKIP_CPU_SYNC);
- return -EIO;
+ return ret;
}

dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr,
diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
index a78c0ba70645..e38ffc5e6bdd 100644
--- a/kernel/dma/direct.h
+++ b/kernel/dma/direct.h
@@ -8,6 +8,7 @@
#define _KERNEL_DMA_DIRECT_H

#include <linux/dma-direct.h>
+#include <linux/memremap.h>

int dma_direct_get_sgtable(struct device *dev, struct sg_table *sgt,
void *cpu_addr, dma_addr_t dma_addr, size_t size,
@@ -87,10 +88,15 @@ static inline dma_addr_t dma_direct_map_page(struct device *dev,
phys_addr_t phys = page_to_phys(page) + offset;
dma_addr_t dma_addr = phys_to_dma(dev, phys);

- if (is_swiotlb_force_bounce(dev))
+ if (is_swiotlb_force_bounce(dev)) {
+ if (is_pci_p2pdma_page(page))
+ return DMA_MAPPING_ERROR;
return swiotlb_map(dev, phys, size, dir, attrs);
+ }

if (unlikely(!dma_capable(dev, dma_addr, size, true))) {
+ if (is_pci_p2pdma_page(page))
+ return DMA_MAPPING_ERROR;
if (is_swiotlb_active(dev))
return swiotlb_map(dev, phys, size, dir, attrs);

--
2.30.2

2022-06-29 06:51:21

by Christoph Hellwig

[permalink] [raw]

2022-06-29 06:57:50

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 01/21] lib/scatterlist: add flag for indicating P2PDMA segments in an SGL

On Wed, Jun 15, 2022 at 10:12:13AM -0600, Logan Gunthorpe wrote:
> Make use of the third free LSB in scatterlist's page_link on 64bit systems.
>
> The extra bit will be used by dma_[un]map_sg_p2pdma() to determine when a
> given SGL segments dma_address points to a PCI bus address.
> dma_unmap_sg_p2pdma() will need to perform different cleanup when a
> segment is marked as a bus address.
>
> The new bit will only be used when CONFIG_PCI_P2PDMA is set; this means
> PCI P2PDMA will require CONFIG_64BIT. This should be acceptable as the
> majority of P2PDMA use cases are restricted to newer root complexes and
> roughly require the extra address space for memory BARs used in the
> transactions.
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> Reviewed-by: Chaitanya Kulkarni <[email protected]>

Looks good:

Reviewed-by: Christoph Hellwig <[email protected]>

2022-06-29 06:58:32

by Christoph Hellwig

[permalink] [raw]

2022-06-29 06:58:50

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 14/21] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages

On Wed, Jun 15, 2022 at 10:12:26AM -0600, Logan Gunthorpe wrote:
> GUP Callers that expect PCI P2PDMA pages can now set FOLL_PCI_P2PDMA to
> allow obtaining P2PDMA pages. If GUP is called without the flag and a
> P2PDMA page is found, it will return an error.
>
> FOLL_PCI_P2PDMA cannot be set if FOLL_LONGTERM is set.

Looks good:

Reviewed-by: Christoph Hellwig <[email protected]>

2022-06-29 07:00:48

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 00/21] Userspace P2PDMA with O_DIRECT NVMe devices

Given how long this is stuck and how big and touching many subsystems,
can we start to make progress on this pice-mail?

I think patches 1-13 look pretty solid, and assuming an review for the
dma-iommu bits these patches could probaby be queued up ASAP.

2022-06-29 07:11:59

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 12/21] RDMA/rw: drop pci_p2pdma_[un]map_sg()

On Wed, Jun 15, 2022 at 10:12:24AM -0600, Logan Gunthorpe wrote:
> dma_map_sg() now supports the use of P2PDMA pages so pci_p2pdma_map_sg()
> is no longer necessary and may be dropped. This means the
> rdma_rw_[un]map_sg() helpers are no longer necessary. Remove it all.

Looks good:

Reviewed-by: Christoph Hellwig <[email protected]>

2022-06-29 07:19:20

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 15/21] iov_iter: introduce iov_iter_get_pages_[alloc_]flags()

I think this is going to have massive conflicts with Al's iov_iter
support..

2022-06-29 09:18:41

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH v7 01/21] lib/scatterlist: add flag for indicating P2PDMA segments in an SGL

On 2022-06-15 17:12, Logan Gunthorpe wrote:
> Make use of the third free LSB in scatterlist's page_link on 64bit systems.
>
> The extra bit will be used by dma_[un]map_sg_p2pdma() to determine when a
> given SGL segments dma_address points to a PCI bus address.
> dma_unmap_sg_p2pdma() will need to perform different cleanup when a
> segment is marked as a bus address.
>
> The new bit will only be used when CONFIG_PCI_P2PDMA is set; this means
> PCI P2PDMA will require CONFIG_64BIT. This should be acceptable as the
> majority of P2PDMA use cases are restricted to newer root complexes and
> roughly require the extra address space for memory BARs used in the
> transactions.
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> Reviewed-by: Chaitanya Kulkarni <[email protected]>
> ---
> drivers/pci/Kconfig | 5 +++++
> include/linux/scatterlist.h | 44 ++++++++++++++++++++++++++++++++++++-
> 2 files changed, 48 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 133c73207782..5cc7cba1941f 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -164,6 +164,11 @@ config PCI_PASID
> config PCI_P2PDMA
> bool "PCI peer-to-peer transfer support"
> depends on ZONE_DEVICE
> + #
> + # The need for the scatterlist DMA bus address flag means PCI P2PDMA
> + # requires 64bit
> + #
> + depends on 64BIT
> select GENERIC_ALLOCATOR
> help
> Enableѕ drivers to do PCI peer-to-peer transactions to and from
> diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
> index 7ff9d6386c12..6561ca8aead8 100644
> --- a/include/linux/scatterlist.h
> +++ b/include/linux/scatterlist.h
> @@ -64,12 +64,24 @@ struct sg_append_table {
> #define SG_CHAIN 0x01UL
> #define SG_END 0x02UL
>
> +/*
> + * bit 2 is the third free bit in the page_link on 64bit systems which
> + * is used by dma_unmap_sg() to determine if the dma_address is a
> + * bus address when doing P2PDMA.
> + */
> +#ifdef CONFIG_PCI_P2PDMA
> +#define SG_DMA_BUS_ADDRESS 0x04UL
> +static_assert(__alignof__(struct page) >= 8);
> +#else
> +#define SG_DMA_BUS_ADDRESS 0x00UL
> +#endif
> +
> /*
> * We overload the LSB of the page pointer to indicate whether it's
> * a valid sg entry, or whether it points to the start of a new scatterlist.
> * Those low bits are there for everyone! (thanks mason :-)
> */
> -#define SG_PAGE_LINK_MASK (SG_CHAIN | SG_END)
> +#define SG_PAGE_LINK_MASK (SG_CHAIN | SG_END | SG_DMA_BUS_ADDRESS)
>
> static inline unsigned int __sg_flags(struct scatterlist *sg)
> {
> @@ -91,6 +103,11 @@ static inline bool sg_is_last(struct scatterlist *sg)
> return __sg_flags(sg) & SG_END;
> }
>
> +static inline bool sg_is_dma_bus_address(struct scatterlist *sg)
> +{
> + return __sg_flags(sg) & SG_DMA_BUS_ADDRESS;
> +}
> +
> /**
> * sg_assign_page - Assign a given page to an SG entry
> * @sg: SG entry
> @@ -245,6 +262,31 @@ static inline void sg_unmark_end(struct scatterlist *sg)
> sg->page_link &= ~SG_END;
> }
>
> +/**
> + * sg_dma_mark_bus address - Mark the scatterlist entry as a bus address
> + * @sg: SG entryScatterlist

entryScatterlist?

> + *
> + * Description:
> + * Marks the passed in sg entry to indicate that the dma_address is
> + * a bus address and doesn't need to be unmapped.
> + **/
> +static inline void sg_dma_mark_bus_address(struct scatterlist *sg)
> +{
> + sg->page_link |= SG_DMA_BUS_ADDRESS;
> +}
> +
> +/**
> + * sg_unmark_pci_p2pdma - Unmark the scatterlist entry as a bus address
> + * @sg: SG entryScatterlist
> + *
> + * Description:
> + * Clears the bus address mark.
> + **/
> +static inline void sg_dma_unmark_bus_address(struct scatterlist *sg)
> +{
> + sg->page_link &= ~SG_DMA_BUS_ADDRESS;
> +}

Does this serve any useful purpose? If a page is determined to be device
memory, it's not going to suddenly stop being device memory, and if the
underlying sg is recycled to point elsewhere then sg_assign_page() will
still (correctly) clear this flag anyway. Trying to reason about this
beyond superficial API symmetry - i.e. why exactly would a caller need
to call it, and what would the implications be of failing to do so -
seems to lead straight to confusion.

In fact I'd be inclined to have sg_assign_page() be responsible for
setting the flag automatically as well, and thus not need
sg_dma_mark_bus_address() either, however I can see the argument for
doing it this way round to not entangle the APIs too much, so I don't
have any great objection to that.

Thanks,
Robin.

> +
> /**
> * sg_phys - Return physical address of an sg entry
> * @sg: SG entry

2022-06-29 12:21:42

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH v7 08/21] iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg

On 2022-06-15 17:12, Logan Gunthorpe wrote:
> When a PCI P2PDMA page is seen, set the IOVA length of the segment
> to zero so that it is not mapped into the IOVA. Then, in finalise_sg(),
> apply the appropriate bus address to the segment. The IOVA is not
> created if the scatterlist only consists of P2PDMA pages.
>
> A P2PDMA page may have three possible outcomes when being mapped:
> 1) If the data path between the two devices doesn't go through
> the root port, then it should be mapped with a PCI bus address
> 2) If the data path goes through the host bridge, it should be mapped
> normally with an IOMMU IOVA.
> 3) It is not possible for the two devices to communicate and thus
> the mapping operation should fail (and it will return -EREMOTEIO).
>
> Similar to dma-direct, the sg_dma_mark_pci_p2pdma() flag is used to
> indicate bus address segments. On unmap, P2PDMA segments are skipped
> over when determining the start and end IOVA addresses.
>
> With this change, the flags variable in the dma_map_ops is set to
> DMA_F_PCI_P2PDMA_SUPPORTED to indicate support for P2PDMA pages.
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> Reviewed-by: Jason Gunthorpe <[email protected]>
> ---
> drivers/iommu/dma-iommu.c | 68 +++++++++++++++++++++++++++++++++++----
> 1 file changed, 61 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index f90251572a5d..b01ca0c6a7ab 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -21,6 +21,7 @@
> #include <linux/iova.h>
> #include <linux/irq.h>
> #include <linux/list_sort.h>
> +#include <linux/memremap.h>
> #include <linux/mm.h>
> #include <linux/mutex.h>
> #include <linux/pci.h>
> @@ -1062,6 +1063,16 @@ static int __finalise_sg(struct device *dev, struct scatterlist *sg, int nents,
> sg_dma_address(s) = DMA_MAPPING_ERROR;
> sg_dma_len(s) = 0;
>
> + if (is_pci_p2pdma_page(sg_page(s)) && !s_iova_len) {

Logically, should we not be able to use sg_is_dma_bus_address() here? I
think it should be feasible, and simpler, to prepare the p2p segments
up-front, such that at this point all we need to do is restore the
original length (if even that, see below).

> + if (i > 0)
> + cur = sg_next(cur);
> +
> + pci_p2pdma_map_bus_segment(s, cur);
> + count++;
> + cur_len = 0;
> + continue;
> + }
> +
> /*
> * Now fill in the real DMA data. If...
> * - there is a valid output segment to append to
> @@ -1158,6 +1169,8 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
> struct iova_domain *iovad = &cookie->iovad;
> struct scatterlist *s, *prev = NULL;
> int prot = dma_info_to_prot(dir, dev_is_dma_coherent(dev), attrs);
> + struct dev_pagemap *pgmap = NULL;
> + enum pci_p2pdma_map_type map_type;
> dma_addr_t iova;
> size_t iova_len = 0;
> unsigned long mask = dma_get_seg_boundary(dev);
> @@ -1193,6 +1206,35 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
> s_length = iova_align(iovad, s_length + s_iova_off);
> s->length = s_length;
>
> + if (is_pci_p2pdma_page(sg_page(s))) {
> + if (sg_page(s)->pgmap != pgmap) {
> + pgmap = sg_page(s)->pgmap;
> + map_type = pci_p2pdma_map_type(pgmap, dev);
> + }

There's a definite code smell here, but per above and below I think we
*should* actually call the new helper instead of copy-pasting half of it.

> +
> + switch (map_type) {
> + case PCI_P2PDMA_MAP_BUS_ADDR:
> + /*
> + * A zero length will be ignored by
> + * iommu_map_sg() and then can be detected

If that is required behaviour then it needs an explicit check in
iommu_map_sg() to guarantee (and document) it. It's only by chance that
__iommu_map() happens to return success for size == 0 *if* all the other
arguments still line up, which is a far cry from a safe no-op.

However, rather than play yet more silly tricks, I think it would make
even more sense to make iommu_map_sg() properly aware and able to skip
direct p2p segments on its own. Once it becomes normal to pass mixed
scatterlists around, it's only a matter of time until one ends up being
handed to a driver which manages its own IOMMU domain, and then what?

> + * in __finalise_sg() to actually map the
> + * bus address.
> + */
> + s->length = 0;
> + continue;
> + case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
> + /*
> + * Mapping through host bridge should be
> + * mapped with regular IOVAs, thus we
> + * do nothing here and continue below.
> + */
> + break;
> + default:
> + ret = -EREMOTEIO;
> + goto out_restore_sg;
> + }
> + }
> +
> /*
> * Due to the alignment of our single IOVA allocation, we can
> * depend on these assumptions about the segment boundary mask:
> @@ -1215,6 +1257,9 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
> prev = s;
> }
>
> + if (!iova_len)
> + return __finalise_sg(dev, sg, nents, 0);
> +
> iova = iommu_dma_alloc_iova(domain, iova_len, dma_get_mask(dev), dev);
> if (!iova) {
> ret = -ENOMEM;
> @@ -1236,7 +1281,7 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
> out_restore_sg:
> __invalidate_sg(sg, nents);
> out:
> - if (ret != -ENOMEM)
> + if (ret != -ENOMEM && ret != -EREMOTEIO)
> return -EINVAL;
> return ret;
> }
> @@ -1244,7 +1289,7 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
> static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
> int nents, enum dma_data_direction dir, unsigned long attrs)
> {
> - dma_addr_t start, end;
> + dma_addr_t end, start = DMA_MAPPING_ERROR;

There are several things I don't like about this logic, I'd rather have
"end = 0" here...

> struct scatterlist *tmp;
> int i;
>
> @@ -1260,14 +1305,22 @@ static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
> * The scatterlist segments are mapped into a single
> * contiguous IOVA allocation, so this is incredibly easy.
> */

[ This comment rather stops being true :( ]

> - start = sg_dma_address(sg);
> - for_each_sg(sg_next(sg), tmp, nents - 1, i) {

...then generalise the first-element special case here into a dedicated
"walk to the first non-p2p element" loop...

> + for_each_sg(sg, tmp, nents, i) {
> + if (sg_is_dma_bus_address(tmp)) {
> + sg_dma_unmark_bus_address(tmp);

[ Again I question what this actually achieves ]

> + continue;
> + }
> if (sg_dma_len(tmp) == 0)
> break;
> - sg = tmp;
> +
> + if (start == DMA_MAPPING_ERROR)
> + start = sg_dma_address(tmp);
> +
> + end = sg_dma_address(tmp) + sg_dma_len(tmp);
> }
> - end = sg_dma_address(sg) + sg_dma_len(sg);
> - __iommu_dma_unmap(dev, start, end - start);
> +
> + if (start != DMA_MAPPING_ERROR)

...then "if (end)" here.

Thanks,
Robin.

> + __iommu_dma_unmap(dev, start, end - start);
> }
>
> static dma_addr_t iommu_dma_map_resource(struct device *dev, phys_addr_t phys,
> @@ -1460,6 +1513,7 @@ static unsigned long iommu_dma_get_merge_boundary(struct device *dev)
> }
>
> static const struct dma_map_ops iommu_dma_ops = {
> + .flags = DMA_F_PCI_P2PDMA_SUPPORTED,
> .alloc = iommu_dma_alloc,
> .free = iommu_dma_free,
> .alloc_pages = dma_common_alloc_pages,

2022-06-29 16:02:54

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v7 08/21] iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg




On 2022-06-29 06:07, Robin Murphy wrote:
> On 2022-06-15 17:12, Logan Gunthorpe wrote:
>> When a PCI P2PDMA page is seen, set the IOVA length of the segment
>> to zero so that it is not mapped into the IOVA. Then, in finalise_sg(),
>> apply the appropriate bus address to the segment. The IOVA is not
>> created if the scatterlist only consists of P2PDMA pages.
>>
>> A P2PDMA page may have three possible outcomes when being mapped:
>>    1) If the data path between the two devices doesn't go through
>>       the root port, then it should be mapped with a PCI bus address
>>    2) If the data path goes through the host bridge, it should be mapped
>>       normally with an IOMMU IOVA.
>>    3) It is not possible for the two devices to communicate and thus
>>       the mapping operation should fail (and it will return -EREMOTEIO).
>>
>> Similar to dma-direct, the sg_dma_mark_pci_p2pdma() flag is used to
>> indicate bus address segments. On unmap, P2PDMA segments are skipped
>> over when determining the start and end IOVA addresses.
>>
>> With this change, the flags variable in the dma_map_ops is set to
>> DMA_F_PCI_P2PDMA_SUPPORTED to indicate support for P2PDMA pages.
>>
>> Signed-off-by: Logan Gunthorpe <[email protected]>
>> Reviewed-by: Jason Gunthorpe <[email protected]>
>> ---
>>   drivers/iommu/dma-iommu.c | 68 +++++++++++++++++++++++++++++++++++----
>>   1 file changed, 61 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
>> index f90251572a5d..b01ca0c6a7ab 100644
>> --- a/drivers/iommu/dma-iommu.c
>> +++ b/drivers/iommu/dma-iommu.c
>> @@ -21,6 +21,7 @@
>>   #include <linux/iova.h>
>>   #include <linux/irq.h>
>>   #include <linux/list_sort.h>
>> +#include <linux/memremap.h>
>>   #include <linux/mm.h>
>>   #include <linux/mutex.h>
>>   #include <linux/pci.h>
>> @@ -1062,6 +1063,16 @@ static int __finalise_sg(struct device *dev,
>> struct scatterlist *sg, int nents,
>>           sg_dma_address(s) = DMA_MAPPING_ERROR;
>>           sg_dma_len(s) = 0;
>>   +        if (is_pci_p2pdma_page(sg_page(s)) && !s_iova_len) {
>
> Logically, should we not be able to use sg_is_dma_bus_address() here? I
> think it should be feasible, and simpler, to prepare the p2p segments
> up-front, such that at this point all we need to do is restore the
> original length (if even that, see below).

Per my previous email, no, because sg_is_dma_bus_address() is not set
yet and not meant to tell you something about the page. That flag will
be set below by pci_p2pdma_map_bus_segment() and then checkd in
iommu_dma_unmap_sg() to determine if the dma_address in the segment
needs to be unmapped.

>
>> +            if (i > 0)
>> +                cur = sg_next(cur);
>> +
>> +            pci_p2pdma_map_bus_segment(s, cur);
>> +            count++;
>> +            cur_len = 0;
>> +            continue;
>> +        }
>> +
>>           /*
>>            * Now fill in the real DMA data. If...
>>            * - there is a valid output segment to append to
>> @@ -1158,6 +1169,8 @@ static int iommu_dma_map_sg(struct device *dev,
>> struct scatterlist *sg,
>>       struct iova_domain *iovad = &cookie->iovad;
>>       struct scatterlist *s, *prev = NULL;
>>       int prot = dma_info_to_prot(dir, dev_is_dma_coherent(dev), attrs);
>> +    struct dev_pagemap *pgmap = NULL;
>> +    enum pci_p2pdma_map_type map_type;
>>       dma_addr_t iova;
>>       size_t iova_len = 0;
>>       unsigned long mask = dma_get_seg_boundary(dev);
>> @@ -1193,6 +1206,35 @@ static int iommu_dma_map_sg(struct device *dev,
>> struct scatterlist *sg,
>>           s_length = iova_align(iovad, s_length + s_iova_off);
>>           s->length = s_length;
>>   +        if (is_pci_p2pdma_page(sg_page(s))) {
>> +            if (sg_page(s)->pgmap != pgmap) {
>> +                pgmap = sg_page(s)->pgmap;
>> +                map_type = pci_p2pdma_map_type(pgmap, dev);
>> +            }
>
> There's a definite code smell here, but per above and below I think we
> *should* actually call the new helper instead of copy-pasting half of it.


>
>> +
>> +            switch (map_type) {
>> +            case PCI_P2PDMA_MAP_BUS_ADDR:
>> +                /*
>> +                 * A zero length will be ignored by
>> +                 * iommu_map_sg() and then can be detected
>
> If that is required behaviour then it needs an explicit check in
> iommu_map_sg() to guarantee (and document) it. It's only by chance that
> __iommu_map() happens to return success for size == 0 *if* all the other
> arguments still line up, which is a far cry from a safe no-op.

What should such a check look like? I could certainly add some comments
to iommu_map_sg(), but I don't see what the code would need to check for...

> However, rather than play yet more silly tricks, I think it would make
> even more sense to make iommu_map_sg() properly aware and able to skip
> direct p2p segments on its own. Once it becomes normal to pass mixed
> scatterlists around, it's only a matter of time until one ends up being
> handed to a driver which manages its own IOMMU domain, and then what?

I suppose we can add another call to is_pci_p2pdma_page() inside
iommu_map_sg() if you think that is cleaner. Seems like more work on the
fast path to me, but I'm not opposed to it.

>> +                 * in __finalise_sg() to actually map the
>> +                 * bus address.
>> +                 */
>> +                s->length = 0;
>> +                continue;
>> +            case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
>> +                /*
>> +                 * Mapping through host bridge should be
>> +                 * mapped with regular IOVAs, thus we
>> +                 * do nothing here and continue below.
>> +                 */
>> +                break;
>> +            default:
>> +                ret = -EREMOTEIO;
>> +                goto out_restore_sg;
>> +            }
>> +        }
>> +
>>           /*
>>            * Due to the alignment of our single IOVA allocation, we can
>>            * depend on these assumptions about the segment boundary mask:
>> @@ -1215,6 +1257,9 @@ static int iommu_dma_map_sg(struct device *dev,
>> struct scatterlist *sg,
>>           prev = s;
>>       }
>>   +    if (!iova_len)
>> +        return __finalise_sg(dev, sg, nents, 0);
>> +
>>       iova = iommu_dma_alloc_iova(domain, iova_len, dma_get_mask(dev),
>> dev);
>>       if (!iova) {
>>           ret = -ENOMEM;
>> @@ -1236,7 +1281,7 @@ static int iommu_dma_map_sg(struct device *dev,
>> struct scatterlist *sg,
>>   out_restore_sg:
>>       __invalidate_sg(sg, nents);
>>   out:
>> -    if (ret != -ENOMEM)
>> +    if (ret != -ENOMEM && ret != -EREMOTEIO)
>>           return -EINVAL;
>>       return ret;
>>   }
>> @@ -1244,7 +1289,7 @@ static int iommu_dma_map_sg(struct device *dev,
>> struct scatterlist *sg,
>>   static void iommu_dma_unmap_sg(struct device *dev, struct
>> scatterlist *sg,
>>           int nents, enum dma_data_direction dir, unsigned long attrs)
>>   {
>> -    dma_addr_t start, end;
>> +    dma_addr_t end, start = DMA_MAPPING_ERROR;
>
> There are several things I don't like about this logic, I'd rather have
> "end = 0" here...

Ok, I think that should work.

>>       struct scatterlist *tmp;
>>       int i;
>>   @@ -1260,14 +1305,22 @@ static void iommu_dma_unmap_sg(struct device
>> *dev, struct scatterlist *sg,
>>        * The scatterlist segments are mapped into a single
>>        * contiguous IOVA allocation, so this is incredibly easy.
>>        */
>
> [ This comment rather stops being true :( ]

Not exactly. Sure there are some segments in the SGL that have bus
addresses, but all the regular IOVAs still have a single contiguous
allocation and only require one call to __iommu_dma_unmap(). The only
trick issues is finding the first and last actual IOVA SG to get the range.

>
>> -    start = sg_dma_address(sg);
>> -    for_each_sg(sg_next(sg), tmp, nents - 1, i) {
>
> ...then generalise the first-element special case here into a dedicated
> "walk to the first non-p2p element" loop...

Ok, I'll see what I can do for that.

Logan

2022-06-29 16:06:37

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v7 01/21] lib/scatterlist: add flag for indicating P2PDMA segments in an SGL




On 2022-06-29 03:05, Robin Murphy wrote:
> On 2022-06-15 17:12, Logan Gunthorpe wrote:
>> Make use of the third free LSB in scatterlist's page_link on 64bit
>> systems.
>>
>> The extra bit will be used by dma_[un]map_sg_p2pdma() to determine when a
>> given SGL segments dma_address points to a PCI bus address.
>> dma_unmap_sg_p2pdma() will need to perform different cleanup when a
>> segment is marked as a bus address.
>>
>> The new bit will only be used when CONFIG_PCI_P2PDMA is set; this means
>> PCI P2PDMA will require CONFIG_64BIT. This should be acceptable as the
>> majority of P2PDMA use cases are restricted to newer root complexes and
>> roughly require the extra address space for memory BARs used in the
>> transactions.
>>
>> Signed-off-by: Logan Gunthorpe <[email protected]>
>> Reviewed-by: Chaitanya Kulkarni <[email protected]>
>> ---
>>   drivers/pci/Kconfig         |  5 +++++
>>   include/linux/scatterlist.h | 44 ++++++++++++++++++++++++++++++++++++-
>>   2 files changed, 48 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
>> index 133c73207782..5cc7cba1941f 100644
>> --- a/drivers/pci/Kconfig
>> +++ b/drivers/pci/Kconfig
>> @@ -164,6 +164,11 @@ config PCI_PASID
>>   config PCI_P2PDMA
>>       bool "PCI peer-to-peer transfer support"
>>       depends on ZONE_DEVICE
>> +    #
>> +    # The need for the scatterlist DMA bus address flag means PCI P2PDMA
>> +    # requires 64bit
>> +    #
>> +    depends on 64BIT
>>       select GENERIC_ALLOCATOR
>>       help
>>         Enableѕ drivers to do PCI peer-to-peer transactions to and from
>> diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
>> index 7ff9d6386c12..6561ca8aead8 100644
>> --- a/include/linux/scatterlist.h
>> +++ b/include/linux/scatterlist.h
>> @@ -64,12 +64,24 @@ struct sg_append_table {
>>   #define SG_CHAIN    0x01UL
>>   #define SG_END        0x02UL
>>   +/*
>> + * bit 2 is the third free bit in the page_link on 64bit systems which
>> + * is used by dma_unmap_sg() to determine if the dma_address is a
>> + * bus address when doing P2PDMA.
>> + */
>> +#ifdef CONFIG_PCI_P2PDMA
>> +#define SG_DMA_BUS_ADDRESS    0x04UL
>> +static_assert(__alignof__(struct page) >= 8);
>> +#else
>> +#define SG_DMA_BUS_ADDRESS    0x00UL
>> +#endif
>> +
>>   /*
>>    * We overload the LSB of the page pointer to indicate whether it's
>>    * a valid sg entry, or whether it points to the start of a new
>> scatterlist.
>>    * Those low bits are there for everyone! (thanks mason :-)
>>    */
>> -#define SG_PAGE_LINK_MASK (SG_CHAIN | SG_END)
>> +#define SG_PAGE_LINK_MASK (SG_CHAIN | SG_END | SG_DMA_BUS_ADDRESS)
>>     static inline unsigned int __sg_flags(struct scatterlist *sg)
>>   {
>> @@ -91,6 +103,11 @@ static inline bool sg_is_last(struct scatterlist *sg)
>>       return __sg_flags(sg) & SG_END;
>>   }
>>   +static inline bool sg_is_dma_bus_address(struct scatterlist *sg)
>> +{
>> +    return __sg_flags(sg) & SG_DMA_BUS_ADDRESS;
>> +}
>> +
>>   /**
>>    * sg_assign_page - Assign a given page to an SG entry
>>    * @sg:            SG entry
>> @@ -245,6 +262,31 @@ static inline void sg_unmark_end(struct
>> scatterlist *sg)
>>       sg->page_link &= ~SG_END;
>>   }
>>   +/**
>> + * sg_dma_mark_bus address - Mark the scatterlist entry as a bus address
>> + * @sg:         SG entryScatterlist
>
> entryScatterlist?
>
>> + *
>> + * Description:
>> + *   Marks the passed in sg entry to indicate that the dma_address is
>> + *   a bus address and doesn't need to be unmapped.
>> + **/
>> +static inline void sg_dma_mark_bus_address(struct scatterlist *sg)
>> +{
>> +    sg->page_link |= SG_DMA_BUS_ADDRESS;
>> +}
>> +
>> +/**
>> + * sg_unmark_pci_p2pdma - Unmark the scatterlist entry as a bus address
>> + * @sg:         SG entryScatterlist
>> + *
>> + * Description:
>> + *   Clears the bus address mark.
>> + **/
>> +static inline void sg_dma_unmark_bus_address(struct scatterlist *sg)
>> +{
>> +    sg->page_link &= ~SG_DMA_BUS_ADDRESS;
>> +}
>
> Does this serve any useful purpose? If a page is determined to be device
> memory, it's not going to suddenly stop being device memory, and if the
> underlying sg is recycled to point elsewhere then sg_assign_page() will
> still (correctly) clear this flag anyway. Trying to reason about this
> beyond superficial API symmetry - i.e. why exactly would a caller need
> to call it, and what would the implications be of failing to do so -
> seems to lead straight to confusion.
>
> In fact I'd be inclined to have sg_assign_page() be responsible for
> setting the flag automatically as well, and thus not need
> sg_dma_mark_bus_address() either, however I can see the argument for
> doing it this way round to not entangle the APIs too much, so I don't
> have any great objection to that.

Yes, I think you misunderstand what this is for. The SG_DMA_BUS_ADDDRESS
flag doesn't mark the segment for the page, but for the dma address. It
cannot be set in sg_assign_page() seeing it's not a property of the page
but a property of the dma_address in the sgl.

It's not meant for use by regular SG users, it's only meant for use
inside DMA mapping implementations. The purpose is to know whether a
given dma_address in the SGL is a bus address or regular memory because
the two different types must be unmapped differently. We can't rely on
the page because, as you know, many dma_map_sg() the dma_address entry
in the sgl does not map to the same memory as the page. Or to put it
another way: is_pci_p2pdma_page(sg->page) does not imply that
sg->dma_address points to a bus address.

Does that make sense?

Logan

2022-06-29 18:17:00

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH v7 01/21] lib/scatterlist: add flag for indicating P2PDMA segments in an SGL

On 2022-06-29 16:39, Logan Gunthorpe wrote:
>
>
>
> On 2022-06-29 03:05, Robin Murphy wrote:
>> On 2022-06-15 17:12, Logan Gunthorpe wrote:
>>> Make use of the third free LSB in scatterlist's page_link on 64bit
>>> systems.
>>>
>>> The extra bit will be used by dma_[un]map_sg_p2pdma() to determine when a
>>> given SGL segments dma_address points to a PCI bus address.
>>> dma_unmap_sg_p2pdma() will need to perform different cleanup when a
>>> segment is marked as a bus address.
>>>
>>> The new bit will only be used when CONFIG_PCI_P2PDMA is set; this means
>>> PCI P2PDMA will require CONFIG_64BIT. This should be acceptable as the
>>> majority of P2PDMA use cases are restricted to newer root complexes and
>>> roughly require the extra address space for memory BARs used in the
>>> transactions.
>>>
>>> Signed-off-by: Logan Gunthorpe <[email protected]>
>>> Reviewed-by: Chaitanya Kulkarni <[email protected]>
>>> ---
>>>   drivers/pci/Kconfig         |  5 +++++
>>>   include/linux/scatterlist.h | 44 ++++++++++++++++++++++++++++++++++++-
>>>   2 files changed, 48 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
>>> index 133c73207782..5cc7cba1941f 100644
>>> --- a/drivers/pci/Kconfig
>>> +++ b/drivers/pci/Kconfig
>>> @@ -164,6 +164,11 @@ config PCI_PASID
>>>   config PCI_P2PDMA
>>>       bool "PCI peer-to-peer transfer support"
>>>       depends on ZONE_DEVICE
>>> +    #
>>> +    # The need for the scatterlist DMA bus address flag means PCI P2PDMA
>>> +    # requires 64bit
>>> +    #
>>> +    depends on 64BIT
>>>       select GENERIC_ALLOCATOR
>>>       help
>>>         Enableѕ drivers to do PCI peer-to-peer transactions to and from
>>> diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
>>> index 7ff9d6386c12..6561ca8aead8 100644
>>> --- a/include/linux/scatterlist.h
>>> +++ b/include/linux/scatterlist.h
>>> @@ -64,12 +64,24 @@ struct sg_append_table {
>>>   #define SG_CHAIN    0x01UL
>>>   #define SG_END        0x02UL
>>>   +/*
>>> + * bit 2 is the third free bit in the page_link on 64bit systems which
>>> + * is used by dma_unmap_sg() to determine if the dma_address is a
>>> + * bus address when doing P2PDMA.
>>> + */
>>> +#ifdef CONFIG_PCI_P2PDMA
>>> +#define SG_DMA_BUS_ADDRESS    0x04UL
>>> +static_assert(__alignof__(struct page) >= 8);
>>> +#else
>>> +#define SG_DMA_BUS_ADDRESS    0x00UL
>>> +#endif
>>> +
>>>   /*
>>>    * We overload the LSB of the page pointer to indicate whether it's
>>>    * a valid sg entry, or whether it points to the start of a new
>>> scatterlist.
>>>    * Those low bits are there for everyone! (thanks mason :-)
>>>    */
>>> -#define SG_PAGE_LINK_MASK (SG_CHAIN | SG_END)
>>> +#define SG_PAGE_LINK_MASK (SG_CHAIN | SG_END | SG_DMA_BUS_ADDRESS)
>>>     static inline unsigned int __sg_flags(struct scatterlist *sg)
>>>   {
>>> @@ -91,6 +103,11 @@ static inline bool sg_is_last(struct scatterlist *sg)
>>>       return __sg_flags(sg) & SG_END;
>>>   }
>>>   +static inline bool sg_is_dma_bus_address(struct scatterlist *sg)
>>> +{
>>> +    return __sg_flags(sg) & SG_DMA_BUS_ADDRESS;
>>> +}
>>> +
>>>   /**
>>>    * sg_assign_page - Assign a given page to an SG entry
>>>    * @sg:            SG entry
>>> @@ -245,6 +262,31 @@ static inline void sg_unmark_end(struct
>>> scatterlist *sg)
>>>       sg->page_link &= ~SG_END;
>>>   }
>>>   +/**
>>> + * sg_dma_mark_bus address - Mark the scatterlist entry as a bus address
>>> + * @sg:         SG entryScatterlist
>>
>> entryScatterlist?
>>
>>> + *
>>> + * Description:
>>> + *   Marks the passed in sg entry to indicate that the dma_address is
>>> + *   a bus address and doesn't need to be unmapped.
>>> + **/
>>> +static inline void sg_dma_mark_bus_address(struct scatterlist *sg)
>>> +{
>>> +    sg->page_link |= SG_DMA_BUS_ADDRESS;
>>> +}
>>> +
>>> +/**
>>> + * sg_unmark_pci_p2pdma - Unmark the scatterlist entry as a bus address
>>> + * @sg:         SG entryScatterlist
>>> + *
>>> + * Description:
>>> + *   Clears the bus address mark.
>>> + **/
>>> +static inline void sg_dma_unmark_bus_address(struct scatterlist *sg)
>>> +{
>>> +    sg->page_link &= ~SG_DMA_BUS_ADDRESS;
>>> +}
>>
>> Does this serve any useful purpose? If a page is determined to be device
>> memory, it's not going to suddenly stop being device memory, and if the
>> underlying sg is recycled to point elsewhere then sg_assign_page() will
>> still (correctly) clear this flag anyway. Trying to reason about this
>> beyond superficial API symmetry - i.e. why exactly would a caller need
>> to call it, and what would the implications be of failing to do so -
>> seems to lead straight to confusion.
>>
>> In fact I'd be inclined to have sg_assign_page() be responsible for
>> setting the flag automatically as well, and thus not need
>> sg_dma_mark_bus_address() either, however I can see the argument for
>> doing it this way round to not entangle the APIs too much, so I don't
>> have any great objection to that.
>
> Yes, I think you misunderstand what this is for. The SG_DMA_BUS_ADDDRESS
> flag doesn't mark the segment for the page, but for the dma address. It
> cannot be set in sg_assign_page() seeing it's not a property of the page
> but a property of the dma_address in the sgl.
>
> It's not meant for use by regular SG users, it's only meant for use
> inside DMA mapping implementations. The purpose is to know whether a
> given dma_address in the SGL is a bus address or regular memory because
> the two different types must be unmapped differently. We can't rely on
> the page because, as you know, many dma_map_sg() the dma_address entry
> in the sgl does not map to the same memory as the page. Or to put it
> another way: is_pci_p2pdma_page(sg->page) does not imply that
> sg->dma_address points to a bus address.
>
> Does that make sense?

Ah, you're quite right, in trying to take in the whole series at once
first thing in the morning I did fail to properly grasp that detail, so
indeed the sg_assign_page() thing couldn't possibly work, but as I said
that's fine anyway. I still think the lifecycle management is a bit off
though - equivalently, a bus address doesn't stop being a bus address,
so it would seem appropriate to update this flag appropriately whenever
sg_dma_address() is assigned to, and not when it isn't.

Thanks,
Robin.

2022-06-29 18:45:11

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v7 01/21] lib/scatterlist: add flag for indicating P2PDMA segments in an SGL



On 2022-06-29 12:02, Robin Murphy wrote:
> On 2022-06-29 16:39, Logan Gunthorpe wrote:
>> On 2022-06-29 03:05, Robin Murphy wrote:
>>> On 2022-06-15 17:12, Logan Gunthorpe wrote:
>>> Does this serve any useful purpose? If a page is determined to be device
>>> memory, it's not going to suddenly stop being device memory, and if the
>>> underlying sg is recycled to point elsewhere then sg_assign_page() will
>>> still (correctly) clear this flag anyway. Trying to reason about this
>>> beyond superficial API symmetry - i.e. why exactly would a caller need
>>> to call it, and what would the implications be of failing to do so -
>>> seems to lead straight to confusion.
>>>
>>> In fact I'd be inclined to have sg_assign_page() be responsible for
>>> setting the flag automatically as well, and thus not need
>>> sg_dma_mark_bus_address() either, however I can see the argument for
>>> doing it this way round to not entangle the APIs too much, so I don't
>>> have any great objection to that.
>>
>> Yes, I think you misunderstand what this is for. The SG_DMA_BUS_ADDDRESS
>> flag doesn't mark the segment for the page, but for the dma address. It
>> cannot be set in sg_assign_page() seeing it's not a property of the page
>> but a property of the dma_address in the sgl.
>>
>> It's not meant for use by regular SG users, it's only meant for use
>> inside DMA mapping implementations. The purpose is to know whether a
>> given dma_address in the SGL is a bus address or regular memory because
>> the two different types must be unmapped differently. We can't rely on
>> the page because, as you know, many dma_map_sg() the dma_address entry
>> in the sgl does not map to the same memory as the page. Or to put it
>> another way: is_pci_p2pdma_page(sg->page) does not imply that
>> sg->dma_address points to a bus address.
>>
>> Does that make sense?
>
> Ah, you're quite right, in trying to take in the whole series at once
> first thing in the morning I did fail to properly grasp that detail, so
> indeed the sg_assign_page() thing couldn't possibly work, but as I said
> that's fine anyway. I still think the lifecycle management is a bit off
> though - equivalently, a bus address doesn't stop being a bus address,
> so it would seem appropriate to update this flag appropriately whenever
> sg_dma_address() is assigned to, and not when it isn't.

Yes, that's pretty much the way the code is now. The only two places
sg_dma_mark_bus_address() is called are in the two pci_p2pdma helpers
that set the dma address to the bus address. The lines before both calls
set the dma_address and dma_len.

Logan

2022-06-29 19:48:12

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH v7 08/21] iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg

On 2022-06-29 16:57, Logan Gunthorpe wrote:
>
>
>
> On 2022-06-29 06:07, Robin Murphy wrote:
>> On 2022-06-15 17:12, Logan Gunthorpe wrote:
>>> When a PCI P2PDMA page is seen, set the IOVA length of the segment
>>> to zero so that it is not mapped into the IOVA. Then, in finalise_sg(),
>>> apply the appropriate bus address to the segment. The IOVA is not
>>> created if the scatterlist only consists of P2PDMA pages.
>>>
>>> A P2PDMA page may have three possible outcomes when being mapped:
>>>    1) If the data path between the two devices doesn't go through
>>>       the root port, then it should be mapped with a PCI bus address
>>>    2) If the data path goes through the host bridge, it should be mapped
>>>       normally with an IOMMU IOVA.
>>>    3) It is not possible for the two devices to communicate and thus
>>>       the mapping operation should fail (and it will return -EREMOTEIO).
>>>
>>> Similar to dma-direct, the sg_dma_mark_pci_p2pdma() flag is used to
>>> indicate bus address segments. On unmap, P2PDMA segments are skipped
>>> over when determining the start and end IOVA addresses.
>>>
>>> With this change, the flags variable in the dma_map_ops is set to
>>> DMA_F_PCI_P2PDMA_SUPPORTED to indicate support for P2PDMA pages.
>>>
>>> Signed-off-by: Logan Gunthorpe <[email protected]>
>>> Reviewed-by: Jason Gunthorpe <[email protected]>
>>> ---
>>>   drivers/iommu/dma-iommu.c | 68 +++++++++++++++++++++++++++++++++++----
>>>   1 file changed, 61 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
>>> index f90251572a5d..b01ca0c6a7ab 100644
>>> --- a/drivers/iommu/dma-iommu.c
>>> +++ b/drivers/iommu/dma-iommu.c
>>> @@ -21,6 +21,7 @@
>>>   #include <linux/iova.h>
>>>   #include <linux/irq.h>
>>>   #include <linux/list_sort.h>
>>> +#include <linux/memremap.h>
>>>   #include <linux/mm.h>
>>>   #include <linux/mutex.h>
>>>   #include <linux/pci.h>
>>> @@ -1062,6 +1063,16 @@ static int __finalise_sg(struct device *dev,
>>> struct scatterlist *sg, int nents,
>>>           sg_dma_address(s) = DMA_MAPPING_ERROR;
>>>           sg_dma_len(s) = 0;
>>>   +        if (is_pci_p2pdma_page(sg_page(s)) && !s_iova_len) {
>>
>> Logically, should we not be able to use sg_is_dma_bus_address() here? I
>> think it should be feasible, and simpler, to prepare the p2p segments
>> up-front, such that at this point all we need to do is restore the
>> original length (if even that, see below).
>
> Per my previous email, no, because sg_is_dma_bus_address() is not set
> yet and not meant to tell you something about the page. That flag will
> be set below by pci_p2pdma_map_bus_segment() and then checkd in
> iommu_dma_unmap_sg() to determine if the dma_address in the segment
> needs to be unmapped.

I know it's not set yet as-is; I'm suggesting things should be
restructured so that it *would be*. In the logical design of this code,
the DMA addresses are effectively determined in iommu_dma_map_sg(), and
__finalise_sg() merely converts them from a relative to an absolute form
(along with undoing the other trickery). Thus the call to
pci_p2pdma_map_bus_segment() absolutely belongs in the main
iommu_map_sg() loop.

>>> +            if (i > 0)
>>> +                cur = sg_next(cur);
>>> +
>>> +            pci_p2pdma_map_bus_segment(s, cur);
>>> +            count++;
>>> +            cur_len = 0;
>>> +            continue;
>>> +        }
>>> +
>>>           /*
>>>            * Now fill in the real DMA data. If...
>>>            * - there is a valid output segment to append to
>>> @@ -1158,6 +1169,8 @@ static int iommu_dma_map_sg(struct device *dev,
>>> struct scatterlist *sg,
>>>       struct iova_domain *iovad = &cookie->iovad;
>>>       struct scatterlist *s, *prev = NULL;
>>>       int prot = dma_info_to_prot(dir, dev_is_dma_coherent(dev), attrs);
>>> +    struct dev_pagemap *pgmap = NULL;
>>> +    enum pci_p2pdma_map_type map_type;
>>>       dma_addr_t iova;
>>>       size_t iova_len = 0;
>>>       unsigned long mask = dma_get_seg_boundary(dev);
>>> @@ -1193,6 +1206,35 @@ static int iommu_dma_map_sg(struct device *dev,
>>> struct scatterlist *sg,
>>>           s_length = iova_align(iovad, s_length + s_iova_off);
>>>           s->length = s_length;
>>>   +        if (is_pci_p2pdma_page(sg_page(s))) {
>>> +            if (sg_page(s)->pgmap != pgmap) {
>>> +                pgmap = sg_page(s)->pgmap;
>>> +                map_type = pci_p2pdma_map_type(pgmap, dev);
>>> +            }
>>
>> There's a definite code smell here, but per above and below I think we
>> *should* actually call the new helper instead of copy-pasting half of it.
>
>
>>
>>> +
>>> +            switch (map_type) {
>>> +            case PCI_P2PDMA_MAP_BUS_ADDR:
>>> +                /*
>>> +                 * A zero length will be ignored by
>>> +                 * iommu_map_sg() and then can be detected
>>
>> If that is required behaviour then it needs an explicit check in
>> iommu_map_sg() to guarantee (and document) it. It's only by chance that
>> __iommu_map() happens to return success for size == 0 *if* all the other
>> arguments still line up, which is a far cry from a safe no-op.
>
> What should such a check look like? I could certainly add some comments
> to iommu_map_sg(), but I don't see what the code would need to check for...

I'd say a check for zero-length segments would look like "if (sg->length
== 0)", most likely with a "continue;" on the following line.

>> However, rather than play yet more silly tricks, I think it would make
>> even more sense to make iommu_map_sg() properly aware and able to skip
>> direct p2p segments on its own. Once it becomes normal to pass mixed
>> scatterlists around, it's only a matter of time until one ends up being
>> handed to a driver which manages its own IOMMU domain, and then what?
>
> I suppose we can add another call to is_pci_p2pdma_page() inside
> iommu_map_sg() if you think that is cleaner. Seems like more work on the
> fast path to me, but I'm not opposed to it.

I was thinking more of sg_is_dma_bus_address() but admittedly under my
initial misapprehension of that. I suppose there's still a tenuous
argument that even though we're not actually consuming sg_dma_address()
at that point, if a segment *has* been earmarked for direct p2p, then
skipping it rather than mapping it at the root complex TA that's
probably never going to see those transactions might seem the more
logical choice.

However it's all a bit hypothetical, and not significantly cleaner than
a zero-size special case, so I'm not particularly tied to the idea either.

>>> +                 * in __finalise_sg() to actually map the
>>> +                 * bus address.
>>> +                 */
>>> +                s->length = 0;
>>> +                continue;
>>> +            case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
>>> +                /*
>>> +                 * Mapping through host bridge should be
>>> +                 * mapped with regular IOVAs, thus we
>>> +                 * do nothing here and continue below.
>>> +                 */
>>> +                break;
>>> +            default:
>>> +                ret = -EREMOTEIO;
>>> +                goto out_restore_sg;
>>> +            }
>>> +        }
>>> +
>>>           /*
>>>            * Due to the alignment of our single IOVA allocation, we can
>>>            * depend on these assumptions about the segment boundary mask:
>>> @@ -1215,6 +1257,9 @@ static int iommu_dma_map_sg(struct device *dev,
>>> struct scatterlist *sg,
>>>           prev = s;
>>>       }
>>>   +    if (!iova_len)
>>> +        return __finalise_sg(dev, sg, nents, 0);
>>> +
>>>       iova = iommu_dma_alloc_iova(domain, iova_len, dma_get_mask(dev),
>>> dev);
>>>       if (!iova) {
>>>           ret = -ENOMEM;
>>> @@ -1236,7 +1281,7 @@ static int iommu_dma_map_sg(struct device *dev,
>>> struct scatterlist *sg,
>>>   out_restore_sg:
>>>       __invalidate_sg(sg, nents);
>>>   out:
>>> -    if (ret != -ENOMEM)
>>> +    if (ret != -ENOMEM && ret != -EREMOTEIO)
>>>           return -EINVAL;
>>>       return ret;
>>>   }
>>> @@ -1244,7 +1289,7 @@ static int iommu_dma_map_sg(struct device *dev,
>>> struct scatterlist *sg,
>>>   static void iommu_dma_unmap_sg(struct device *dev, struct
>>> scatterlist *sg,
>>>           int nents, enum dma_data_direction dir, unsigned long attrs)
>>>   {
>>> -    dma_addr_t start, end;
>>> +    dma_addr_t end, start = DMA_MAPPING_ERROR;
>>
>> There are several things I don't like about this logic, I'd rather have
>> "end = 0" here...
>
> Ok, I think that should work.
>
>>>       struct scatterlist *tmp;
>>>       int i;
>>>   @@ -1260,14 +1305,22 @@ static void iommu_dma_unmap_sg(struct device
>>> *dev, struct scatterlist *sg,
>>>        * The scatterlist segments are mapped into a single
>>>        * contiguous IOVA allocation, so this is incredibly easy.
>>>        */
>>
>> [ This comment rather stops being true :( ]
>
> Not exactly. Sure there are some segments in the SGL that have bus
> addresses, but all the regular IOVAs still have a single contiguous
> allocation and only require one call to __iommu_dma_unmap(). The only
> trick issues is finding the first and last actual IOVA SG to get the range.

"The scatterlist segments" means all of them, including any p2p ones
which are demonstrably not mapped into anything. The extra logic
required to skip non-mapped p2p segments goes beyond "incredibly easy".
If there's one thing I *do* definitely understand, it's my own comments ;)

Thanks,
Robin.

>
>>
>>> -    start = sg_dma_address(sg);
>>> -    for_each_sg(sg_next(sg), tmp, nents - 1, i) {
>>
>> ...then generalise the first-element special case here into a dedicated
>> "walk to the first non-p2p element" loop...
>
> Ok, I'll see what I can do for that.
>
> Logan

2022-06-29 22:43:34

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v7 08/21] iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg



On 2022-06-29 13:15, Robin Murphy wrote:
> On 2022-06-29 16:57, Logan Gunthorpe wrote:
>>
>>
>>
>> On 2022-06-29 06:07, Robin Murphy wrote:
>>> On 2022-06-15 17:12, Logan Gunthorpe wrote:
>>>> When a PCI P2PDMA page is seen, set the IOVA length of the segment
>>>> to zero so that it is not mapped into the IOVA. Then, in finalise_sg(),
>>>> apply the appropriate bus address to the segment. The IOVA is not
>>>> created if the scatterlist only consists of P2PDMA pages.
>>>>
>>>> A P2PDMA page may have three possible outcomes when being mapped:
>>>>     1) If the data path between the two devices doesn't go through
>>>>        the root port, then it should be mapped with a PCI bus address
>>>>     2) If the data path goes through the host bridge, it should be
>>>> mapped
>>>>        normally with an IOMMU IOVA.
>>>>     3) It is not possible for the two devices to communicate and thus
>>>>        the mapping operation should fail (and it will return
>>>> -EREMOTEIO).
>>>>
>>>> Similar to dma-direct, the sg_dma_mark_pci_p2pdma() flag is used to
>>>> indicate bus address segments. On unmap, P2PDMA segments are skipped
>>>> over when determining the start and end IOVA addresses.
>>>>
>>>> With this change, the flags variable in the dma_map_ops is set to
>>>> DMA_F_PCI_P2PDMA_SUPPORTED to indicate support for P2PDMA pages.
>>>>
>>>> Signed-off-by: Logan Gunthorpe <[email protected]>
>>>> Reviewed-by: Jason Gunthorpe <[email protected]>
>>>> ---
>>>>    drivers/iommu/dma-iommu.c | 68
>>>> +++++++++++++++++++++++++++++++++++----
>>>>    1 file changed, 61 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
>>>> index f90251572a5d..b01ca0c6a7ab 100644
>>>> --- a/drivers/iommu/dma-iommu.c
>>>> +++ b/drivers/iommu/dma-iommu.c
>>>> @@ -21,6 +21,7 @@
>>>>    #include <linux/iova.h>
>>>>    #include <linux/irq.h>
>>>>    #include <linux/list_sort.h>
>>>> +#include <linux/memremap.h>
>>>>    #include <linux/mm.h>
>>>>    #include <linux/mutex.h>
>>>>    #include <linux/pci.h>
>>>> @@ -1062,6 +1063,16 @@ static int __finalise_sg(struct device *dev,
>>>> struct scatterlist *sg, int nents,
>>>>            sg_dma_address(s) = DMA_MAPPING_ERROR;
>>>>            sg_dma_len(s) = 0;
>>>>    +        if (is_pci_p2pdma_page(sg_page(s)) && !s_iova_len) {
>>>
>>> Logically, should we not be able to use sg_is_dma_bus_address() here? I
>>> think it should be feasible, and simpler, to prepare the p2p segments
>>> up-front, such that at this point all we need to do is restore the
>>> original length (if even that, see below).
>>
>> Per my previous email, no, because sg_is_dma_bus_address() is not set
>> yet and not meant to tell you something about the page. That flag will
>> be set below by pci_p2pdma_map_bus_segment() and then checkd in
>> iommu_dma_unmap_sg() to determine if the dma_address in the segment
>> needs to be unmapped.
>
> I know it's not set yet as-is; I'm suggesting things should be
> restructured so that it *would be*. In the logical design of this code,
> the DMA addresses are effectively determined in iommu_dma_map_sg(), and
> __finalise_sg() merely converts them from a relative to an absolute form
> (along with undoing the other trickery). Thus the call to
> pci_p2pdma_map_bus_segment() absolutely belongs in the main
> iommu_map_sg() loop.

I don't see how that can work: __finalise_sg() does more than convert
them from relative to absolute, it also figures out which SG entry will
contain which dma_address segment. Which segment a P2PDMA address needs
to be programmed into depends on the how 'cur' is calculated which in
turn depends on things like seg_mask and max_len. This calculation is
not done in iommu_dma_map_sg() so I don't see how there's any hope of
assigning the bus address for the P2P segments in that function.

If there's a way to restructure things so that's possible that I'm not
seeing, I'm open to it but it's certainly not immediately obvious.

>>>> +
>>>> +            switch (map_type) {
>>>> +            case PCI_P2PDMA_MAP_BUS_ADDR:
>>>> +                /*
>>>> +                 * A zero length will be ignored by
>>>> +                 * iommu_map_sg() and then can be detected
>>>
>>> If that is required behaviour then it needs an explicit check in
>>> iommu_map_sg() to guarantee (and document) it. It's only by chance that
>>> __iommu_map() happens to return success for size == 0 *if* all the other
>>> arguments still line up, which is a far cry from a safe no-op.
>>
>> What should such a check look like? I could certainly add some comments
>> to iommu_map_sg(), but I don't see what the code would need to check
>> for...
>
> I'd say a check for zero-length segments would look like "if (sg->length
> == 0)", most likely with a "continue;" on the following line.

Oh, that's pretty simple to add. Will change.

>>> However, rather than play yet more silly tricks, I think it would make
>>> even more sense to make iommu_map_sg() properly aware and able to skip
>>> direct p2p segments on its own. Once it becomes normal to pass mixed
>>> scatterlists around, it's only a matter of time until one ends up being
>>> handed to a driver which manages its own IOMMU domain, and then what?
>>
>> I suppose we can add another call to is_pci_p2pdma_page() inside
>> iommu_map_sg() if you think that is cleaner. Seems like more work on the
>> fast path to me, but I'm not opposed to it.
>
> I was thinking more of sg_is_dma_bus_address() but admittedly under my
> initial misapprehension of that. I suppose there's still a tenuous
> argument that even though we're not actually consuming sg_dma_address()
> at that point, if a segment *has* been earmarked for direct p2p, then
> skipping it rather than mapping it at the root complex TA that's
> probably never going to see those transactions might seem the more
> logical choice.
>
> However it's all a bit hypothetical, and not significantly cleaner than
> a zero-size special case, so I'm not particularly tied to the idea either.

Yeah, looking at it closer, I can't see how to get rid of the zero size
special case without doing the whole pci_p2pdma_map_type() calculation
twice which we really want to avoid.

Logan

2022-06-30 15:18:22

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH v7 08/21] iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg

On 2022-06-29 23:41, Logan Gunthorpe wrote:
>
>
> On 2022-06-29 13:15, Robin Murphy wrote:
>> On 2022-06-29 16:57, Logan Gunthorpe wrote:
>>>
>>>
>>>
>>> On 2022-06-29 06:07, Robin Murphy wrote:
>>>> On 2022-06-15 17:12, Logan Gunthorpe wrote:
>>>>> When a PCI P2PDMA page is seen, set the IOVA length of the segment
>>>>> to zero so that it is not mapped into the IOVA. Then, in finalise_sg(),
>>>>> apply the appropriate bus address to the segment. The IOVA is not
>>>>> created if the scatterlist only consists of P2PDMA pages.
>>>>>
>>>>> A P2PDMA page may have three possible outcomes when being mapped:
>>>>>     1) If the data path between the two devices doesn't go through
>>>>>        the root port, then it should be mapped with a PCI bus address
>>>>>     2) If the data path goes through the host bridge, it should be
>>>>> mapped
>>>>>        normally with an IOMMU IOVA.
>>>>>     3) It is not possible for the two devices to communicate and thus
>>>>>        the mapping operation should fail (and it will return
>>>>> -EREMOTEIO).
>>>>>
>>>>> Similar to dma-direct, the sg_dma_mark_pci_p2pdma() flag is used to
>>>>> indicate bus address segments. On unmap, P2PDMA segments are skipped
>>>>> over when determining the start and end IOVA addresses.
>>>>>
>>>>> With this change, the flags variable in the dma_map_ops is set to
>>>>> DMA_F_PCI_P2PDMA_SUPPORTED to indicate support for P2PDMA pages.
>>>>>
>>>>> Signed-off-by: Logan Gunthorpe <[email protected]>
>>>>> Reviewed-by: Jason Gunthorpe <[email protected]>
>>>>> ---
>>>>>    drivers/iommu/dma-iommu.c | 68
>>>>> +++++++++++++++++++++++++++++++++++----
>>>>>    1 file changed, 61 insertions(+), 7 deletions(-)
>>>>>
>>>>> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
>>>>> index f90251572a5d..b01ca0c6a7ab 100644
>>>>> --- a/drivers/iommu/dma-iommu.c
>>>>> +++ b/drivers/iommu/dma-iommu.c
>>>>> @@ -21,6 +21,7 @@
>>>>>    #include <linux/iova.h>
>>>>>    #include <linux/irq.h>
>>>>>    #include <linux/list_sort.h>
>>>>> +#include <linux/memremap.h>
>>>>>    #include <linux/mm.h>
>>>>>    #include <linux/mutex.h>
>>>>>    #include <linux/pci.h>
>>>>> @@ -1062,6 +1063,16 @@ static int __finalise_sg(struct device *dev,
>>>>> struct scatterlist *sg, int nents,
>>>>>            sg_dma_address(s) = DMA_MAPPING_ERROR;
>>>>>            sg_dma_len(s) = 0;
>>>>>    +        if (is_pci_p2pdma_page(sg_page(s)) && !s_iova_len) {
>>>>
>>>> Logically, should we not be able to use sg_is_dma_bus_address() here? I
>>>> think it should be feasible, and simpler, to prepare the p2p segments
>>>> up-front, such that at this point all we need to do is restore the
>>>> original length (if even that, see below).
>>>
>>> Per my previous email, no, because sg_is_dma_bus_address() is not set
>>> yet and not meant to tell you something about the page. That flag will
>>> be set below by pci_p2pdma_map_bus_segment() and then checkd in
>>> iommu_dma_unmap_sg() to determine if the dma_address in the segment
>>> needs to be unmapped.
>>
>> I know it's not set yet as-is; I'm suggesting things should be
>> restructured so that it *would be*. In the logical design of this code,
>> the DMA addresses are effectively determined in iommu_dma_map_sg(), and
>> __finalise_sg() merely converts them from a relative to an absolute form
>> (along with undoing the other trickery). Thus the call to
>> pci_p2pdma_map_bus_segment() absolutely belongs in the main
>> iommu_map_sg() loop.
>
> I don't see how that can work: __finalise_sg() does more than convert
> them from relative to absolute, it also figures out which SG entry will
> contain which dma_address segment. Which segment a P2PDMA address needs
> to be programmed into depends on the how 'cur' is calculated which in
> turn depends on things like seg_mask and max_len. This calculation is
> not done in iommu_dma_map_sg() so I don't see how there's any hope of
> assigning the bus address for the P2P segments in that function.
>
> If there's a way to restructure things so that's possible that I'm not
> seeing, I'm open to it but it's certainly not immediately obvious.

Huh? It's still virtually the same thing; iommu_dma_map_sg() calls
pci_p2pdma_map_bus_segment(s) and sets s->length to 0 if
PCI_P2PDMA_MAP_BUS_ADDR, then __finalise_sg() can use
sg_is_dma_bus_address(s) in place of is_pci_p2pdma_page(sg_page(s)), and
just propagate the DMA address and original length from s to cur.

Here you've written a patch which looks to correctly interrupt any
ongoing concatenation state and convey some data from the given input
segment to the appropriate output segment, so I'm baffled by why you'd
think you couldn't do what you've already done.

Thanks,
Robin.

2022-06-30 21:28:48

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v7 08/21] iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg



On 2022-06-30 08:56, Robin Murphy wrote:
> On 2022-06-29 23:41, Logan Gunthorpe wrote:
>>
>>
>> On 2022-06-29 13:15, Robin Murphy wrote:
>>> On 2022-06-29 16:57, Logan Gunthorpe wrote:
>>>>
>>>>
>>>>
>>>> On 2022-06-29 06:07, Robin Murphy wrote:
>>>>> On 2022-06-15 17:12, Logan Gunthorpe wrote:
>>>>>> When a PCI P2PDMA page is seen, set the IOVA length of the segment
>>>>>> to zero so that it is not mapped into the IOVA. Then, in
>>>>>> finalise_sg(),
>>>>>> apply the appropriate bus address to the segment. The IOVA is not
>>>>>> created if the scatterlist only consists of P2PDMA pages.
>>>>>>
>>>>>> A P2PDMA page may have three possible outcomes when being mapped:
>>>>>>      1) If the data path between the two devices doesn't go through
>>>>>>         the root port, then it should be mapped with a PCI bus
>>>>>> address
>>>>>>      2) If the data path goes through the host bridge, it should be
>>>>>> mapped
>>>>>>         normally with an IOMMU IOVA.
>>>>>>      3) It is not possible for the two devices to communicate and
>>>>>> thus
>>>>>>         the mapping operation should fail (and it will return
>>>>>> -EREMOTEIO).
>>>>>>
>>>>>> Similar to dma-direct, the sg_dma_mark_pci_p2pdma() flag is used to
>>>>>> indicate bus address segments. On unmap, P2PDMA segments are skipped
>>>>>> over when determining the start and end IOVA addresses.
>>>>>>
>>>>>> With this change, the flags variable in the dma_map_ops is set to
>>>>>> DMA_F_PCI_P2PDMA_SUPPORTED to indicate support for P2PDMA pages.
>>>>>>
>>>>>> Signed-off-by: Logan Gunthorpe <[email protected]>
>>>>>> Reviewed-by: Jason Gunthorpe <[email protected]>
>>>>>> ---
>>>>>>     drivers/iommu/dma-iommu.c | 68
>>>>>> +++++++++++++++++++++++++++++++++++----
>>>>>>     1 file changed, 61 insertions(+), 7 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
>>>>>> index f90251572a5d..b01ca0c6a7ab 100644
>>>>>> --- a/drivers/iommu/dma-iommu.c
>>>>>> +++ b/drivers/iommu/dma-iommu.c
>>>>>> @@ -21,6 +21,7 @@
>>>>>>     #include <linux/iova.h>
>>>>>>     #include <linux/irq.h>
>>>>>>     #include <linux/list_sort.h>
>>>>>> +#include <linux/memremap.h>
>>>>>>     #include <linux/mm.h>
>>>>>>     #include <linux/mutex.h>
>>>>>>     #include <linux/pci.h>
>>>>>> @@ -1062,6 +1063,16 @@ static int __finalise_sg(struct device *dev,
>>>>>> struct scatterlist *sg, int nents,
>>>>>>             sg_dma_address(s) = DMA_MAPPING_ERROR;
>>>>>>             sg_dma_len(s) = 0;
>>>>>>     +        if (is_pci_p2pdma_page(sg_page(s)) && !s_iova_len) {
>>>>>
>>>>> Logically, should we not be able to use sg_is_dma_bus_address()
>>>>> here? I
>>>>> think it should be feasible, and simpler, to prepare the p2p segments
>>>>> up-front, such that at this point all we need to do is restore the
>>>>> original length (if even that, see below).
>>>>
>>>> Per my previous email, no, because sg_is_dma_bus_address() is not set
>>>> yet and not meant to tell you something about the page. That flag will
>>>> be set below by pci_p2pdma_map_bus_segment() and then checkd in
>>>> iommu_dma_unmap_sg() to determine if the dma_address in the segment
>>>> needs to be unmapped.
>>>
>>> I know it's not set yet as-is; I'm suggesting things should be
>>> restructured so that it *would be*. In the logical design of this code,
>>> the DMA addresses are effectively determined in iommu_dma_map_sg(), and
>>> __finalise_sg() merely converts them from a relative to an absolute form
>>> (along with undoing the other trickery). Thus the call to
>>> pci_p2pdma_map_bus_segment() absolutely belongs in the main
>>> iommu_map_sg() loop.
>>
>> I don't see how that can work: __finalise_sg() does more than convert
>> them from relative to absolute, it also figures out which SG entry will
>> contain which dma_address segment. Which segment a P2PDMA address needs
>> to be programmed into depends on the how 'cur' is calculated which in
>> turn depends on things like seg_mask and max_len. This calculation is
>> not done in iommu_dma_map_sg() so I don't see how there's any hope of
>> assigning the bus address for the P2P segments in that function.
>>
>> If there's a way to restructure things so that's possible that I'm not
>> seeing, I'm open to it but it's certainly not immediately obvious.
>
> Huh? It's still virtually the same thing; iommu_dma_map_sg() calls
> pci_p2pdma_map_bus_segment(s) and sets s->length to 0 if
> PCI_P2PDMA_MAP_BUS_ADDR, then __finalise_sg() can use
> sg_is_dma_bus_address(s) in place of is_pci_p2pdma_page(sg_page(s)), and
> just propagate the DMA address and original length from s to cur.
>
> Here you've written a patch which looks to correctly interrupt any
> ongoing concatenation state and convey some data from the given input
> segment to the appropriate output segment, so I'm baffled by why you'd
> think you couldn't do what you've already done.

Ah, I understand now, thanks for the patience. It took me a couple of
read throughs before I got it, but I figured it out and now have a
working implementation that looks really nice. It's a big improvement
not needing the two different P2PDMA helpers.

I'll send a v8 of just the first 13 patches next week after a bit more
testing. I've put a draft git branch here if you want to look at it
before that:

https://github.com/sbates130272/linux-p2pmem p2pdma_map_v8

Thanks!

Logan

2022-07-04 15:13:16

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH v7 01/21] lib/scatterlist: add flag for indicating P2PDMA segments in an SGL

On 2022-06-15 17:12, Logan Gunthorpe wrote:
> Make use of the third free LSB in scatterlist's page_link on 64bit systems.
>
> The extra bit will be used by dma_[un]map_sg_p2pdma() to determine when a
> given SGL segments dma_address points to a PCI bus address.
> dma_unmap_sg_p2pdma() will need to perform different cleanup when a
> segment is marked as a bus address.
>
> The new bit will only be used when CONFIG_PCI_P2PDMA is set; this means
> PCI P2PDMA will require CONFIG_64BIT. This should be acceptable as the
> majority of P2PDMA use cases are restricted to newer root complexes and
> roughly require the extra address space for memory BARs used in the
> transactions.

Another thought that's hit me slightly late; depending on CONFIG_64BIT
also means that we've got a whole 4 bytes of padding in struct
scatterlist to play with, so at that point maybe it's worth considering
carrying new extra DMA mapping properties in their own field(s). For
instance it would also be really helpful to flag whether a segment is
bounce-buffered or not.

Robin.

> Signed-off-by: Logan Gunthorpe <[email protected]>
> Reviewed-by: Chaitanya Kulkarni <[email protected]>
> ---
> drivers/pci/Kconfig | 5 +++++
> include/linux/scatterlist.h | 44 ++++++++++++++++++++++++++++++++++++-
> 2 files changed, 48 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 133c73207782..5cc7cba1941f 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -164,6 +164,11 @@ config PCI_PASID
> config PCI_P2PDMA
> bool "PCI peer-to-peer transfer support"
> depends on ZONE_DEVICE
> + #
> + # The need for the scatterlist DMA bus address flag means PCI P2PDMA
> + # requires 64bit
> + #
> + depends on 64BIT
> select GENERIC_ALLOCATOR
> help
> Enableѕ drivers to do PCI peer-to-peer transactions to and from
> diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
> index 7ff9d6386c12..6561ca8aead8 100644
> --- a/include/linux/scatterlist.h
> +++ b/include/linux/scatterlist.h
> @@ -64,12 +64,24 @@ struct sg_append_table {
> #define SG_CHAIN 0x01UL
> #define SG_END 0x02UL
>
> +/*
> + * bit 2 is the third free bit in the page_link on 64bit systems which
> + * is used by dma_unmap_sg() to determine if the dma_address is a
> + * bus address when doing P2PDMA.
> + */
> +#ifdef CONFIG_PCI_P2PDMA
> +#define SG_DMA_BUS_ADDRESS 0x04UL
> +static_assert(__alignof__(struct page) >= 8);
> +#else
> +#define SG_DMA_BUS_ADDRESS 0x00UL
> +#endif
> +
> /*
> * We overload the LSB of the page pointer to indicate whether it's
> * a valid sg entry, or whether it points to the start of a new scatterlist.
> * Those low bits are there for everyone! (thanks mason :-)
> */
> -#define SG_PAGE_LINK_MASK (SG_CHAIN | SG_END)
> +#define SG_PAGE_LINK_MASK (SG_CHAIN | SG_END | SG_DMA_BUS_ADDRESS)
>
> static inline unsigned int __sg_flags(struct scatterlist *sg)
> {
> @@ -91,6 +103,11 @@ static inline bool sg_is_last(struct scatterlist *sg)
> return __sg_flags(sg) & SG_END;
> }
>
> +static inline bool sg_is_dma_bus_address(struct scatterlist *sg)
> +{
> + return __sg_flags(sg) & SG_DMA_BUS_ADDRESS;
> +}
> +
> /**
> * sg_assign_page - Assign a given page to an SG entry
> * @sg: SG entry
> @@ -245,6 +262,31 @@ static inline void sg_unmark_end(struct scatterlist *sg)
> sg->page_link &= ~SG_END;
> }
>
> +/**
> + * sg_dma_mark_bus address - Mark the scatterlist entry as a bus address
> + * @sg: SG entryScatterlist
> + *
> + * Description:
> + * Marks the passed in sg entry to indicate that the dma_address is
> + * a bus address and doesn't need to be unmapped.
> + **/
> +static inline void sg_dma_mark_bus_address(struct scatterlist *sg)
> +{
> + sg->page_link |= SG_DMA_BUS_ADDRESS;
> +}
> +
> +/**
> + * sg_unmark_pci_p2pdma - Unmark the scatterlist entry as a bus address
> + * @sg: SG entryScatterlist
> + *
> + * Description:
> + * Clears the bus address mark.
> + **/
> +static inline void sg_dma_unmark_bus_address(struct scatterlist *sg)
> +{
> + sg->page_link &= ~SG_DMA_BUS_ADDRESS;
> +}
> +
> /**
> * sg_phys - Return physical address of an sg entry
> * @sg: SG entry