2023-09-25 16:47:06

by Kelly Devilliv

[permalink] [raw]
Subject: Re: dma_map_resource() has a bad performance in pcie peer to peer transactions when iommu enabled in Linux

> On 2023-09-25 04:59, Kelly Devilliv wrote:
> > Dear all,
> >
> > I am working on an ARM-V8 server with two gpu cards on it. Recently, I need
> to test pcie peer to peer communication between the two gpu cards, but the
> throughput is only 4GB/s.
> >
> > After I explored the gpu's kernel mode driver, I found it was using the
> dma_map_resource() API to map the peer device's MMIO space. The arm
> iommu driver then will hardcode a 'IOMMU_MMIO' prot in the later dma map:
> >
> > static dma_addr_t iommu_dma_map_resource(struct device *dev,
> phys_addr_t phys,
> > size_t size, enum dma_data_direction
> dir, unsigned long attrs)
> > {
> > return __iommu_dma_map(dev, phys, size,
> > dma_info_to_prot(dir, false,
> attrs) | IOMMU_MMIO,
> > dma_get_mask(dev));
> > }
> >
> > And that will finally set the 'ARM_LPAE_PTE_MEMATTR_DEV' attribute in PTE,
> which may have a negative impact on the performance of the pcie peer to peer
> transactions.
> >
> > /*
> > * Note that this logic is structured to accommodate Mali LPAE
> > * having stage-1-like attributes but stage-2-like permissions.
> > */
> > if (data->iop.fmt == ARM_64_LPAE_S2 ||
> > data->iop.fmt == ARM_32_LPAE_S2) {
> > if (prot & IOMMU_MMIO)
> > pte |= ARM_LPAE_PTE_MEMATTR_DEV;
> > else if (prot & IOMMU_CACHE)
> > pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
> > else
> > pte |= ARM_LPAE_PTE_MEMATTR_NC;
> > } else {
> > if (prot & IOMMU_MMIO)
> > pte |= (ARM_LPAE_MAIR_ATTR_IDX_DEV
> > << ARM_LPAE_PTE_ATTRINDX_SHIFT);
> > else if (prot & IOMMU_CACHE)
> > pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
> > << ARM_LPAE_PTE_ATTRINDX_SHIFT);
> > }
> >
> > I tried to remove the 'IOMMU_MMIO' prot in the dma_map_resource() API
> and re-compile the linux kernel, the throughput then can be up to 28GB/s.
> >
> > Is there an elegant way to solve this issue without modifying the linux kernel?
> e.g., a substitution of dma_map_resource() API?
>
> Not really. Other use-cases for dma_map_resource() include DMA offload
> engines accessing FIFO registers, where allowing reordering, write-gathering,
> etc. would be a terrible idea. Thus it needs to assume a "safe" MMIO memory
> type, which on Arm means Device-nGnRE.
>
> However, the "proper" PCI peer-to-peer support under CONFIG_PCI_P2PDMA
> ended up moving away from the dma_map_resource() approach anyway, and
> allows this kind of device memory to be treated more like regular memory (via
> ZONE_DEVICE) rather than arbitrary MMIO resources, so your best bet would
> be to get the GPU driver converted over to using that.

Thanks Robin.
So your suggestion is we'd better work out a new implementation just as what it
does under CONFIG_PCI_P2PDMA instead of just using the dma_map_resource()
API?

I have explored the GPU drivers from AMD, Nvidia and habanalabs, e.g., and found
they all using the dma_map_resource() API to map the peer device's bar address.
If so, is it possible to be a common performance issue in PCI peer-to-peer scenario?

>
> Thanks,
> Robin.
>
> >
> > Thank you!
> >
> > Platform info:
> > Linux kernel version: 5.10
> > PCIE GEN4 x16
> >
> > Sincerely,
> > Kelly
> >


2023-09-25 23:15:01

by Christian König

[permalink] [raw]
Subject: Re: dma_map_resource() has a bad performance in pcie peer to peer transactions when iommu enabled in Linux

Am 25.09.23 um 16:17 schrieb Kelly Devilliv:
>> On 2023-09-25 04:59, Kelly Devilliv wrote:
>>> Dear all,
>>>
>>> I am working on an ARM-V8 server with two gpu cards on it. Recently, I need
>> to test pcie peer to peer communication between the two gpu cards, but the
>> throughput is only 4GB/s.
>>> After I explored the gpu's kernel mode driver, I found it was using the
>> dma_map_resource() API to map the peer device's MMIO space. The arm
>> iommu driver then will hardcode a 'IOMMU_MMIO' prot in the later dma map:
>>> static dma_addr_t iommu_dma_map_resource(struct device *dev,
>> phys_addr_t phys,
>>> size_t size, enum dma_data_direction
>> dir, unsigned long attrs)
>>> {
>>> return __iommu_dma_map(dev, phys, size,
>>> dma_info_to_prot(dir, false,
>> attrs) | IOMMU_MMIO,
>>> dma_get_mask(dev));
>>> }
>>>
>>> And that will finally set the 'ARM_LPAE_PTE_MEMATTR_DEV' attribute in PTE,
>> which may have a negative impact on the performance of the pcie peer to peer
>> transactions.
>>> /*
>>> * Note that this logic is structured to accommodate Mali LPAE
>>> * having stage-1-like attributes but stage-2-like permissions.
>>> */
>>> if (data->iop.fmt == ARM_64_LPAE_S2 ||
>>> data->iop.fmt == ARM_32_LPAE_S2) {
>>> if (prot & IOMMU_MMIO)
>>> pte |= ARM_LPAE_PTE_MEMATTR_DEV;
>>> else if (prot & IOMMU_CACHE)
>>> pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
>>> else
>>> pte |= ARM_LPAE_PTE_MEMATTR_NC;
>>> } else {
>>> if (prot & IOMMU_MMIO)
>>> pte |= (ARM_LPAE_MAIR_ATTR_IDX_DEV
>>> << ARM_LPAE_PTE_ATTRINDX_SHIFT);
>>> else if (prot & IOMMU_CACHE)
>>> pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
>>> << ARM_LPAE_PTE_ATTRINDX_SHIFT);
>>> }
>>>
>>> I tried to remove the 'IOMMU_MMIO' prot in the dma_map_resource() API
>> and re-compile the linux kernel, the throughput then can be up to 28GB/s.
>>> Is there an elegant way to solve this issue without modifying the linux kernel?
>> e.g., a substitution of dma_map_resource() API?
>>
>> Not really. Other use-cases for dma_map_resource() include DMA offload
>> engines accessing FIFO registers, where allowing reordering, write-gathering,
>> etc. would be a terrible idea. Thus it needs to assume a "safe" MMIO memory
>> type, which on Arm means Device-nGnRE.
>>
>> However, the "proper" PCI peer-to-peer support under CONFIG_PCI_P2PDMA
>> ended up moving away from the dma_map_resource() approach anyway, and
>> allows this kind of device memory to be treated more like regular memory (via
>> ZONE_DEVICE) rather than arbitrary MMIO resources, so your best bet would
>> be to get the GPU driver converted over to using that.
> Thanks Robin.
> So your suggestion is we'd better work out a new implementation just as what it
> does under CONFIG_PCI_P2PDMA instead of just using the dma_map_resource()
> API?
>
> I have explored the GPU drivers from AMD, Nvidia and habanalabs, e.g., and found
> they all using the dma_map_resource() API to map the peer device's bar address.
> If so, is it possible to be a common performance issue in PCI peer-to-peer scenario?

That's not an issue, but expected behavior.

When you enable IOMMU every transaction needs to go through the root
complex for address translation and you completely lose the performance
benefit of PCIe P2P.

This is a hardware limitation and not really related to
dma_map_resource() in any way.

Regards,
Christian.

>
>> Thanks,
>> Robin.
>>
>>> Thank you!
>>>
>>> Platform info:
>>> Linux kernel version: 5.10
>>> PCIE GEN4 x16
>>>
>>> Sincerely,
>>> Kelly
>>>

2023-09-26 05:38:20

by Kelly Devilliv

[permalink] [raw]
Subject: 答复: dma_map_resource() has a bad performa nce in pcie peer to peer transactions when io mmu enabled in Linux

On 2023-09-26 01:58, Christian König wrote:
> Am 25.09.23 um 16:17 schrieb Kelly Devilliv:
>> On 2023-09-25 19:16, Robin Murphy wrote:
>>> On 2023-09-25 04:59, Kelly Devilliv wrote:
>>>> Dear all,
>>>>
>>>> I am working on an ARM-V8 server with two gpu cards on it. Recently,
>>>> I need
>>>> to test pcie peer to peer communication between the two gpu cards,
>>>> but the throughput is only 4GB/s.
>>>> After I explored the gpu's kernel mode driver, I found it was using
>>>> the dma_map_resource() API to map the peer device's MMIO space. The arm
>>>> iommu driver then will hardcode a 'IOMMU_MMIO' prot in the later dma map:
>>>> static dma_addr_t iommu_dma_map_resource(struct device
>>>> *dev,
>>>> phys_addr_t phys,
>>>> size_t size, enum
>>>> dma_data_direction
>>>> dir, unsigned long attrs)
>>>> {
>>>> return __iommu_dma_map(dev, phys, size,
>>>> dma_info_to_prot(dir,
>>>> false,
>>>> attrs) | IOMMU_MMIO,
>>>> dma_get_mask(dev));
>>>> }
>>>>
>>>> And that will finally set the 'ARM_LPAE_PTE_MEMATTR_DEV' attribute
>>>> in PTE,
>>>> which may have a negative impact on the performance of the pcie peer
>>>> to peer transactions.
>>>> /*
>>>> * Note that this logic is structured to accommodate Mali LPAE
>>>> * having stage-1-like attributes but stage-2-like permissions.
>>>> */
>>>> if (data->iop.fmt == ARM_64_LPAE_S2 ||
>>>> data->iop.fmt == ARM_32_LPAE_S2) {
>>>> if (prot & IOMMU_MMIO)
>>>> pte |= ARM_LPAE_PTE_MEMATTR_DEV;
>>>> else if (prot & IOMMU_CACHE)
>>>> pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
>>>> else
>>>> pte |= ARM_LPAE_PTE_MEMATTR_NC;
>>>> } else {
>>>> if (prot & IOMMU_MMIO)
>>>> pte |= (ARM_LPAE_MAIR_ATTR_IDX_DEV
>>>> <<
>ARM_LPAE_PTE_ATTRINDX_SHIFT);
>>>> else if (prot & IOMMU_CACHE)
>>>> pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
>>>> <<
>ARM_LPAE_PTE_ATTRINDX_SHIFT);
>>>> }
>>>>
>>>> I tried to remove the 'IOMMU_MMIO' prot in the dma_map_resource()
>>>> API
>>>> and re-compile the linux kernel, the throughput then can be up to 28GB/s.
>>>> Is there an elegant way to solve this issue without modifying the linux kernel?
>>>> e.g., a substitution of dma_map_resource() API?
>>>
>>> Not really. Other use-cases for dma_map_resource() include DMA
>>> offload engines accessing FIFO registers, where allowing reordering,
>>> write-gathering, etc. would be a terrible idea. Thus it needs to
>>> assume a "safe" MMIO memory type, which on Arm means Device-nGnRE.
>>>
>>> However, the "proper" PCI peer-to-peer support under
>>> CONFIG_PCI_P2PDMA ended up moving away from the
>dma_map_resource()
>>> approach anyway, and allows this kind of device memory to be treated
>>> more like regular memory (via
>>> ZONE_DEVICE) rather than arbitrary MMIO resources, so your best bet
>>> would be to get the GPU driver converted over to using that.
>>
>> Thanks Robin.
>> So your suggestion is we'd better work out a new implementation just
>> as what it does under CONFIG_PCI_P2PDMA instead of just using the
>> dma_map_resource() API?
>>
>> I have explored the GPU drivers from AMD, Nvidia and habanalabs, e.g.,
>> and found they all using the dma_map_resource() API to map the peer
>> device's bar address.
>> If so, is it possible to be a common performance issue in PCI peer-to-peer
>> scenario?
>
> That's not an issue, but expected behavior.
>
> When you enable IOMMU every transaction needs to go through the root
> complex for address translation and you completely lose the performance
> benefit of PCIe P2P.

Thanks Christian. That's true.

>
> This is a hardware limitation and not really related to
> dma_map_resource() in any way.
>

But when I removed the 'IOMMU_MMIO' prot in dma_map_resource(), the performace was significantly improved (from 4GB/s to 28GB/s), which was almost the same as what it can be when IOMMU disabled. So I guess in my common pci topology, what really matters may not be whether IOMMU is enabled or not, but in fact the attributes in dma mapping or ARM PTE does.

I don't know if there is a way to make the memory attributes more configurable in order to be distinguished from the "safe" MMIO memory type, which on Arm means Device-nGnRE as Robin said.

Sincerely,
Kelly

> Regards,
> Christian.
>
>>
>>> Thanks,
>>> Robin.
>>>
>>>> Thank you!
>>>>
>>>> Platform info:
>>>> Linux kernel version: 5.10
>>>> PCIE GEN4 x16
>>>>
>>>> Sincerely,
>>>> Kelly
>>>>