by Leon Romanovsky

[permalink] [raw]

On Mon, Mar 25, 2024 at 12:22:15AM +0100, Christoph Hellwig wrote:
> On Fri, Mar 22, 2024 at 03:43:30PM -0300, Jason Gunthorpe wrote:
> > If we are going to make caller provided uniformity a requirement, lets
> > imagine a formal memory type idea to help keep this a little
> > abstracted?
> >
> > DMA_MEMORY_TYPE_NORMAL
> > DMA_MEMORY_TYPE_P2P_NOT_ACS
> > DMA_MEMORY_TYPE_ENCRYPTED
> > DMA_MEMORY_TYPE_BOUNCE_BUFFER // ??
> >
> > Then maybe the driver flow looks like:
> >
> > if (transaction.memory_type == DMA_MEMORY_TYPE_NORMAL && dma_api_has_iommu(dev)) {
>
> Add a nice helper to make this somewhat readable, but yes.
>
> > } else if (transaction.memory_type == DMA_MEMORY_TYPE_P2P_NOT_ACS) {
> > num_hwsgls = transcation.num_sgls;
> > for_each_range(transaction, range) {
> > hwsgl[i].addr = dma_api_p2p_not_acs_map(range.start_physical, range.length, p2p_memory_provider);
> > hwsgl[i].len = range.size;
> > }
> > } else {
> > /* Must be DMA_MEMORY_TYPE_NORMAL, DMA_MEMORY_TYPE_ENCRYPTED, DMA_MEMORY_TYPE_BOUNCE_BUFFER? */
> > num_hwsgls = transcation.num_sgls;
> > for_each_range(transaction, range) {
> > hwsgl[i].addr = dma_api_map_cpu_page(range.start_page, range.length);
> > hwsgl[i].len = range.size;
> > }
> >
>
> And these two are really the same except that we call a different map
> helper underneath. So I think as far as the driver is concerned
> they should be the same, the DMA API just needs to key off the
> memory tap.

Yeah.. If the caller is going to have compute the memory type of the
range then lets pass it to the helper

dma_api_map_memory_type(transaction.memory_type, range.start_page, range.length);

Then we can just hide all the differences under the API without doing
duplicated work.

Function names need some work ...

> > > > So I take it as a requirement that RDMA MUST make single MR's out of a
> > > > hodgepodge of page types. RDMA MRs cannot be split. Multiple MR's are
> > > > not a functional replacement for a single MR.
> > >
> > > But MRs consolidate multiple dma addresses anyway.
> >
> > I'm not sure I understand this?
>
> The RDMA MRs take a a list of PFNish address, (or SGLs with the
> enhanced MRs from Mellanox) and give you back a single rkey/lkey.

Yes, that is the desire.

> > To go back to my main thesis - I would like a high performance low
> > level DMA API that is capable enough that it could implement
> > scatterlist dma_map_sg() and thus also implement any future
> > scatterlist_v2, bio, hmm_range_fault or any other thing we come up
> > with on top of it. This is broadly what I thought we agreed to at LSF
> > last year.
>
> I think the biggest underlying problem of the scatterlist based
> DMA implementation for IOMMUs is that it's trying to handle to much,
> that is magic coalescing even if the segments boundaries don't align
> with the IOMMU page size. If we can get rid of that misfeature I
> think we'd greatly simply the API and implementation.

Yeah, that stuff is not easy at all and takes extra computation to
figure out. I always assumed it was there for block...

Leon & Chaitanya will make a RFC v2 along these lines, lets see how it
goes.

Thanks,
Jason

2024-04-09 20:39:47

by Zhu Yanjun

[permalink] [raw]

Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

在 2024/3/7 7:01, Zhu Yanjun 写道:
> 在 2024/3/5 12:18, Leon Romanovsky 写道:
>> This is complimentary part to the proposed LSF/MM topic.
>> https://lore.kernel.org/linux-rdma/[email protected]/T/#m85672c860539fdbbc8fe0f5ccabdc05b40269057
>
> I am interested in this topic. Hope I can join the meeting to discuss
> this topic.
>

With the same idea, in the IDPF driver, the function dma_alloc_coherent
which is called in the IDPF driver can be devided into the following 2
functions:

iommu_dma_alloc_pages

and

iommu_dma_map_page

So the function iommu_dma_alloc_pages allocates pages,
iommu_dma_map_page makes mapping between pages and IOVA.

Now the above idea is implemented in the NIC driver. Currently it can
work well.

Next the above idea will be implemented in the block device. Hope this
can increase the performance of the block device.

Best Regards,
Zhu Yanjun

> Zhu Yanjun
>
>>
>> This is posted as RFC to get a feedback on proposed split, but RDMA,
>> VFIO and
>> DMA patches are ready for review and inclusion, the NVMe patches are
>> still in
>> progress as they require agreement on API first.
>>
>> Thanks
>>
>> -------------------------------------------------------------------------------
>> The DMA mapping operation performs two steps at one same time: allocates
>> IOVA space and actually maps DMA pages to that space. This one shot
>> operation works perfectly for non-complex scenarios, where callers use
>> that DMA API in control path when they setup hardware.
>>
>> However in more complex scenarios, when DMA mapping is needed in data
>> path and especially when some sort of specific datatype is involved,
>> such one shot approach has its drawbacks.
>>
>> That approach pushes developers to introduce new DMA APIs for specific
>> datatype. For example existing scatter-gather mapping functions, or
>> latest Chuck's RFC series to add biovec related DMA mapping [1] and
>> probably struct folio will need it too.
>>
>> These advanced DMA mapping APIs are needed to calculate IOVA size to
>> allocate it as one chunk and some sort of offset calculations to know
>> which part of IOVA to map.
>>
>> Instead of teaching DMA to know these specific datatypes, let's separate
>> existing DMA mapping routine to two steps and give an option to advanced
>> callers (subsystems) perform all calculations internally in advance and
>> map pages later when it is needed.
>>
>> In this series, three users are converted and each of such conversion
>> presents different positive gain:
>> 1. RDMA simplifies and speeds up its pagefault handling for
>>     on-demand-paging (ODP) mode.
>> 2. VFIO PCI live migration code saves huge chunk of memory.
>> 3. NVMe PCI avoids intermediate SG table manipulation and operates
>>     directly on BIOs.
>>
>> Thanks
>>
>> [1]
>> https://lore.kernel.org/all/169772852492.5232.17148564580779995849.stgit@klimt.1015granger.net
>>
>> Chaitanya Kulkarni (2):
>>    block: add dma_link_range() based API
>>    nvme-pci: use blk_rq_dma_map() for NVMe SGL
>>
>> Leon Romanovsky (14):
>>    mm/hmm: let users to tag specific PFNs
>>    dma-mapping: provide an interface to allocate IOVA
>>    dma-mapping: provide callbacks to link/unlink pages to specific IOVA
>>    iommu/dma: Provide an interface to allow preallocate IOVA
>>    iommu/dma: Prepare map/unmap page functions to receive IOVA
>>    iommu/dma: Implement link/unlink page callbacks
>>    RDMA/umem: Preallocate and cache IOVA for UMEM ODP
>>    RDMA/umem: Store ODP access mask information in PFN
>>    RDMA/core: Separate DMA mapping to caching IOVA and page linkage
>>    RDMA/umem: Prevent UMEM ODP creation with SWIOTLB
>>    vfio/mlx5: Explicitly use number of pages instead of allocated length
>>    vfio/mlx5: Rewrite create mkey flow to allow better code reuse
>>    vfio/mlx5: Explicitly store page list
>>    vfio/mlx5: Convert vfio to use DMA link API
>>
>> Documentation/core-api/dma-attributes.rst |   7 +
>> block/blk-merge.c                         | 156 ++++++++++++++
>> drivers/infiniband/core/umem_odp.c        | 219 +++++++------------
>> drivers/infiniband/hw/mlx5/mlx5_ib.h      |   1 +
>> drivers/infiniband/hw/mlx5/odp.c          | 59 +++--
>> drivers/iommu/dma-iommu.c                 | 129 ++++++++---
>> drivers/nvme/host/pci.c                   | 220 +++++--------------
>> drivers/vfio/pci/mlx5/cmd.c               | 252 ++++++++++++----------
>> drivers/vfio/pci/mlx5/cmd.h               | 22 +-
>> drivers/vfio/pci/mlx5/main.c              | 136 +++++-------
>> include/linux/blk-mq.h                    |   9 +
>> include/linux/dma-map-ops.h               | 13 ++
>> include/linux/dma-mapping.h               | 39 ++++
>> include/linux/hmm.h                       |   3 +
>> include/rdma/ib_umem_odp.h                | 22 +-
>> include/rdma/ib_verbs.h                   | 54 +++++
>> kernel/dma/debug.h                        |   2 +
>> kernel/dma/direct.h                       |   7 +-
>> kernel/dma/mapping.c                      | 91 ++++++++
>> mm/hmm.c                                  | 34 +--
>> 20 files changed, 870 insertions(+), 605 deletions(-)
>>
>