by Peter Hurley

[permalink] [raw]

Subject: Re: [PATCH v3 0/5] enhance DMA CMA on x86

On 10/03/2014 12:06 PM, Akinobu Mita wrote:
> 2014-10-03 23:27 GMT+09:00 Peter Hurley <[email protected]>:
>> On 10/02/2014 07:08 PM, Akinobu Mita wrote:
>>> 2014-10-03 7:03 GMT+09:00 Peter Hurley <[email protected]>:
>>>> On 10/02/2014 12:41 PM, Konrad Rzeszutek Wilk wrote:
>>>>> On Tue, Sep 30, 2014 at 09:49:54PM -0400, Peter Hurley wrote:
>>>>>> On 09/30/2014 07:45 PM, Thomas Gleixner wrote:
>>>
>>>>>> Which is different than if the plan is to ship production units for x86;
>>>>>> then a general purpose solution will be required.
>>>>>>
>>>>>> As to the good design of a general purpose solution for allocating and
>>>>>> mapping huge order pages, you are certainly more qualified to help Akinobu
>>>>>> than I am.
>>>>
>>>> What Akinobu's patches intend to support is:
>>>>
>>>> phys_addr = dma_alloc_coherent(dev, 64 * 1024 * 1024, &bus_addr, GFP_KERNEL);
>>>>
>>>> which raises three issues:
>>>>
>>>> 1. Where do coherent blocks of this size come from?
>>>> 2. How to prevent fragmentation of these reserved blocks over time by
>>>> existing DMA users?
>>>> 3. Is this support generically required across all iommu implementations on x86?
>>>>
>>>> Questions 1 and 2 are non-trivial, in the general case, otherwise the page
>>>> allocator would already do this. Simply dropping in the contiguous memory
>>>> allocator doesn't work because CMA does not have the same policy and performance
>>>> as the page allocator, and is already causing performance regressions even
>>>> in the absence of huge page allocations.
>>>
>>> Could you take a look at the patches I sent? Can they fix these issues?
>>> https://lkml.org/lkml/2014/9/28/110
>>>
>>> With these patches, normal alloc_pages() is used for allocation first
>>> and dma_alloc_from_contiguous() is used as a fallback.
>>
>> Sure, I can test these patches this weekend.
>> Where are the unit tests?
>
> Thanks a lot. I would like to know whether the performance regression
> you see will disappear or not with these patches as if CONFIG_DMA_CMA is
> disabled.

I think something may have gotten lost in translation.

My "test" consists of doing my daily work (email, emacs, kernel builds,
web breaks, etc).

I don't have a testsuite that validates a page allocator or records any
performance metrics (for TTM allocations under load, as an example).

Without a unit test and performance metrics, my "test" is not really
positive affirmation of a correct implementation.

>>>> So that's why I raised question 3; is making the necessary compromises to support
>>>> 64MB coherent DMA allocations across all x86 iommu implementations actually
>>>> required?
>>>>
>>>> Prior to Akinobu's patches, the use of CMA by x86 iommu configurations was
>>>> designed to be limited to testing configurations, as the introductory
>>>> commit states:
>>>>
>>>> commit 0a2b9a6ea93650b8a00f9fd5ee8fdd25671e2df6
>>>> Author: Marek Szyprowski <[email protected]>
>>>> Date: Thu Dec 29 13:09:51 2011 +0100
>>>>
>>>> X86: integrate CMA with DMA-mapping subsystem
>>>>
>>>> This patch adds support for CMA to dma-mapping subsystem for x86
>>>> architecture that uses common pci-dma/pci-nommu implementation. This
>>>> allows to test CMA on KVM/QEMU and a lot of common x86 boxes.
>>>>
>>>> Signed-off-by: Marek Szyprowski <[email protected]>
>>>> Signed-off-by: Kyungmin Park <[email protected]>
>>>> CC: Michal Nazarewicz <[email protected]>
>>>> Acked-by: Arnd Bergmann <[email protected]>
>>>>
>>>>
>>>> Which brings me to my suggestion: if support for huge coherent DMA is
>>>> required only for a special test platform, then could not this support
>>>> be specific to a new iommu configuration, namely iommu=cma, which would
>>>> get initialized much the same way that iommu=calgary is now.
>>>>
>>>> The code for such a iommu configuration would mostly duplicate
>>>> arch/x86/kernel/pci-swiotlb.c and the CMA support would get removed from
>>>> the other x86 iommu implementations.
>>>
>>> I'm not sure I read correctly, though. Can boot option 'cma=0' also
>>> help avoiding CMA from IOMMU implementation?
>>
>> Maybe, but that's not an appropriate solution for distro kernels.
>>
>> Nor does this address configurations that want a really large CMA so
>> 1GB huge pages can be allocated (not for DMA though).
>
> Now I see the point of iommu=cma you suggested. But what should we do
> when CONFIG_SWIOTLB is disabled, especially for x86_32?
> Should we just introduce yet another flag to tell not using DMA_CMA
> instead of adding new swiotlb-like iommu implementation?

Again, since I don't know what you're using this for and
there are no existing mainline users, I can't really design this for
you.

I'm just trying to do my best to come up with alternative solutions
that limit the impact to existing x86 configurations, while still
achieving your goals (without really knowing what those design
constraints are).

Regards,
Peter Hurley

2014-10-05 06:01:46

by Akinobu Mita

[permalink] [raw]

Subject: Re: [PATCH v3 0/5] enhance DMA CMA on x86

2014-10-04 1:39 GMT+09:00 Peter Hurley <[email protected]>:
> On 10/03/2014 12:06 PM, Akinobu Mita wrote:
>> 2014-10-03 23:27 GMT+09:00 Peter Hurley <[email protected]>:
>>> On 10/02/2014 07:08 PM, Akinobu Mita wrote:
>>>> 2014-10-03 7:03 GMT+09:00 Peter Hurley <[email protected]>:
>>>>> On 10/02/2014 12:41 PM, Konrad Rzeszutek Wilk wrote:
>>>>>> On Tue, Sep 30, 2014 at 09:49:54PM -0400, Peter Hurley wrote:
>>>>>>> On 09/30/2014 07:45 PM, Thomas Gleixner wrote:
>>>>
>>>>>>> Which is different than if the plan is to ship production units for x86;
>>>>>>> then a general purpose solution will be required.
>>>>>>>
>>>>>>> As to the good design of a general purpose solution for allocating and
>>>>>>> mapping huge order pages, you are certainly more qualified to help Akinobu
>>>>>>> than I am.
>>>>>
>>>>> What Akinobu's patches intend to support is:
>>>>>
>>>>> phys_addr = dma_alloc_coherent(dev, 64 * 1024 * 1024, &bus_addr, GFP_KERNEL);
>>>>>
>>>>> which raises three issues:
>>>>>
>>>>> 1. Where do coherent blocks of this size come from?
>>>>> 2. How to prevent fragmentation of these reserved blocks over time by
>>>>> existing DMA users?
>>>>> 3. Is this support generically required across all iommu implementations on x86?
>>>>>
>>>>> Questions 1 and 2 are non-trivial, in the general case, otherwise the page
>>>>> allocator would already do this. Simply dropping in the contiguous memory
>>>>> allocator doesn't work because CMA does not have the same policy and performance
>>>>> as the page allocator, and is already causing performance regressions even
>>>>> in the absence of huge page allocations.
>>>>
>>>> Could you take a look at the patches I sent? Can they fix these issues?
>>>> https://lkml.org/lkml/2014/9/28/110
>>>>
>>>> With these patches, normal alloc_pages() is used for allocation first
>>>> and dma_alloc_from_contiguous() is used as a fallback.
>>>
>>> Sure, I can test these patches this weekend.
>>> Where are the unit tests?
>>
>> Thanks a lot. I would like to know whether the performance regression
>> you see will disappear or not with these patches as if CONFIG_DMA_CMA is
>> disabled.
>
> I think something may have gotten lost in translation.
>
> My "test" consists of doing my daily work (email, emacs, kernel builds,
> web breaks, etc).
>
> I don't have a testsuite that validates a page allocator or records any
> performance metrics (for TTM allocations under load, as an example).
>
> Without a unit test and performance metrics, my "test" is not really
> positive affirmation of a correct implementation.
>
>
>>>>> So that's why I raised question 3; is making the necessary compromises to support
>>>>> 64MB coherent DMA allocations across all x86 iommu implementations actually
>>>>> required?
>>>>>
>>>>> Prior to Akinobu's patches, the use of CMA by x86 iommu configurations was
>>>>> designed to be limited to testing configurations, as the introductory
>>>>> commit states:
>>>>>
>>>>> commit 0a2b9a6ea93650b8a00f9fd5ee8fdd25671e2df6
>>>>> Author: Marek Szyprowski <[email protected]>
>>>>> Date: Thu Dec 29 13:09:51 2011 +0100
>>>>>
>>>>> X86: integrate CMA with DMA-mapping subsystem
>>>>>
>>>>> This patch adds support for CMA to dma-mapping subsystem for x86
>>>>> architecture that uses common pci-dma/pci-nommu implementation. This
>>>>> allows to test CMA on KVM/QEMU and a lot of common x86 boxes.
>>>>>
>>>>> Signed-off-by: Marek Szyprowski <[email protected]>
>>>>> Signed-off-by: Kyungmin Park <[email protected]>
>>>>> CC: Michal Nazarewicz <[email protected]>
>>>>> Acked-by: Arnd Bergmann <[email protected]>
>>>>>
>>>>>
>>>>> Which brings me to my suggestion: if support for huge coherent DMA is
>>>>> required only for a special test platform, then could not this support
>>>>> be specific to a new iommu configuration, namely iommu=cma, which would
>>>>> get initialized much the same way that iommu=calgary is now.
>>>>>
>>>>> The code for such a iommu configuration would mostly duplicate
>>>>> arch/x86/kernel/pci-swiotlb.c and the CMA support would get removed from
>>>>> the other x86 iommu implementations.
>>>>
>>>> I'm not sure I read correctly, though. Can boot option 'cma=0' also
>>>> help avoiding CMA from IOMMU implementation?
>>>
>>> Maybe, but that's not an appropriate solution for distro kernels.
>>>
>>> Nor does this address configurations that want a really large CMA so
>>> 1GB huge pages can be allocated (not for DMA though).

kernel parameter 'cma=' is only available when CONFIG_DMA_CMA is enabled.
cma=0 doesn't disable 1GB huge pages as far as I can see.
So I prepare a patch which make default cma size zero on x86.

>> Now I see the point of iommu=cma you suggested. But what should we do
>> when CONFIG_SWIOTLB is disabled, especially for x86_32?
>> Should we just introduce yet another flag to tell not using DMA_CMA
>> instead of adding new swiotlb-like iommu implementation?
>
> Again, since I don't know what you're using this for and
> there are no existing mainline users, I can't really design this for
> you.
>
> I'm just trying to do my best to come up with alternative solutions
> that limit the impact to existing x86 configurations, while still
> achieving your goals (without really knowing what those design
> constraints are).

Thanks a lot for your advise.