2014-02-24 19:04:09

by Joel Fernandes

[permalink] [raw]
Subject: Ideas/suggestions to avoid repeated locking and reducing too many lists with dmaengine?

Hi folks,

Just wanted your thoughts/suggestions on how we can avoid overhead in the EDMA
dmaengine driver. I am seeing a lots of performance drop specially for small
transfers with EDMA versus before raw EDMA was moved to DMAEngine framework
(atleast 25%).

One of the things I am thinking about is the repeated (spin) locking/unlocking
of the virt_dma_chan->lock or vc->lock. In many cases, there's only 1 user or
thread requiring to do a DMA, so I feel the locking is unnecessary and potential
overhead. If there's a sane way to detect this an avoid locking altogether, that
would be great.

Also with respect to virt_dma (which is used by edma to manage all the
descriptors and lists) there are too many lists: submitted, issued, completed
etc and the descriptor moves from one to the other. I am thinking if there is a
way we can avoid using so many lists and just have 2 lists and move the desc
from one list to the other, That could avoid using the intermediate list
altogether and classify dma requests as "done" or "not done".

Since this involves discussing concurrency primitives, copying linux-rt-users as
well.

Thanks,
-Joel


2014-02-24 19:22:09

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: Ideas/suggestions to avoid repeated locking and reducing too many lists with dmaengine?

Wrapping... (I've had to manually edit this.)

On Mon, Feb 24, 2014 at 01:03:32PM -0600, Joel Fernandes wrote:
> Just wanted your thoughts/suggestions on how we can avoid overhead in
> the EDMA dmaengine driver. I am seeing a lots of performance drop
> specially for small transfers with EDMA versus before raw EDMA was
> moved to DMAEngine framework (atleast 25%).
>
> One of the things I am thinking about is the repeated (spin)
> locking/unlocking of the virt_dma_chan->lock or vc->lock. In many
> cases, there's only 1 user or thread requiring to do a DMA, so I
> feel the locking is unnecessary and potential overhead. If there's
> a sane way to detect this an avoid locking altogether, that
> would be great.

For the case where there's no contention, spinlocks /should/ be light.
What will make them more expensive is if you have things like lockdep
enabled, which adds much more code into those paths to do state tracking.
It's a known side effect of using that debug.

So, if you're developing, then you should always have turned lockdep on.
If you're testing for performance, you should have lockdep and spinlock
debugging turned off.

> Also with respect to virt_dma (which is used by edma to manage all the
> descriptors and lists) there are too many lists: submitted, issued,
> completed etc and the descriptor moves from one to the other. I am
> thinking if there is a way we can avoid using so many lists and just
> have 2 lists and move the desc from one list to the other, That could
> avoid using the intermediate list altogether and classify dma requests
> as "done" or "not done".

The reason I created separate submitted and issued lists is that it's
much easier to manage than having everything on a single list.

We could deal with the submitted vs issued list, and that's to have the
channel store the cookie for the last issued descriptor - but I wonder
if it's worth the effort.

What I'd suggest is to try some profiling, and post some profiling
results which show where the problems are, rather than pointing at
bits of code you might not particularly like.

--
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

2014-02-24 20:50:40

by Andy Gross

[permalink] [raw]
Subject: Re: Ideas/suggestions to avoid repeated locking and reducing too many lists with dmaengine?

On Mon, Feb 24, 2014 at 01:03:32PM -0600, Joel Fernandes wrote:
> Hi folks,
>
> Just wanted your thoughts/suggestions on how we can avoid overhead in the EDMA
> dmaengine driver. I am seeing a lots of performance drop specially for small
> transfers with EDMA versus before raw EDMA was moved to DMAEngine framework
> (atleast 25%).

I've seen roughly the same drop in my testing. In my case it had to do with the
nature of how work is done using virt-dma. The virt-dma is predicated on only
letting one transaction be active at a time and it increases the latency for
getting the next transaction off. For large transactions, it's negligible. But
for small transactions, it is pretty evident.


> One of the things I am thinking about is the repeated (spin) locking/unlocking
> of the virt_dma_chan->lock or vc->lock. In many cases, there's only 1 user or
> thread requiring to do a DMA, so I feel the locking is unnecessary and potential
> overhead. If there's a sane way to detect this an avoid locking altogether, that
> would be great.

I'd expect the locking to not be the source of the problem, especially with
your use case.

[snip]

--
sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation

2014-02-24 22:39:22

by Joel Fernandes

[permalink] [raw]
Subject: Re: Ideas/suggestions to avoid repeated locking and reducing too many lists with dmaengine?

Hi Russell,

Firstly, thanks for your quick reply.

On 02/24/2014 01:21 PM, Russell King - ARM Linux wrote:
> Wrapping... (I've had to manually edit this.)
>
> On Mon, Feb 24, 2014 at 01:03:32PM -0600, Joel Fernandes wrote:
>> Just wanted your thoughts/suggestions on how we can avoid overhead in
>> the EDMA dmaengine driver. I am seeing a lots of performance drop
>> specially for small transfers with EDMA versus before raw EDMA was
>> moved to DMAEngine framework (atleast 25%).
>>
>> One of the things I am thinking about is the repeated (spin)
>> locking/unlocking of the virt_dma_chan->lock or vc->lock. In many
>> cases, there's only 1 user or thread requiring to do a DMA, so I
>> feel the locking is unnecessary and potential overhead. If there's
>> a sane way to detect this an avoid locking altogether, that
>> would be great.
>
> For the case where there's no contention, spinlocks /should/ be light.
> What will make them more expensive is if you have things like lockdep
> enabled, which adds much more code into those paths to do state tracking.
> It's a known side effect of using that debug.
>
> So, if you're developing, then you should always have turned lockdep on.
> If you're testing for performance, you should have lockdep and spinlock
> debugging turned off.

Thanks, indeed I had to turn them off.

>
>> Also with respect to virt_dma (which is used by edma to manage all the
>> descriptors and lists) there are too many lists: submitted, issued,
>> completed etc and the descriptor moves from one to the other. I am
>> thinking if there is a way we can avoid using so many lists and just
>> have 2 lists and move the desc from one list to the other, That could
>> avoid using the intermediate list altogether and classify dma requests
>> as "done" or "not done".
>
> The reason I created separate submitted and issued lists is that it's
> much easier to manage than having everything on a single list.
>
> We could deal with the submitted vs issued list, and that's to have the
> channel store the cookie for the last issued descriptor - but I wonder
> if it's worth the effort.
>
> What I'd suggest is to try some profiling, and post some profiling
> results which show where the problems are, rather than pointing at
> bits of code you might not particularly like.
>

Actually I did do some tracing earlier before I posted this thread- and
notice there was excessive traces of locking/unlocking. It is very light
though as you pointed and lighter without debug options. The only other
notable difference is the fact that we are now going through the dmaengine
framework in the newer kernel vs the faster one.

One more thing in my trace is omap_dma_sync repeatedly call in memcpy_to_io
for every barrier call which is not necessary. I am working on a fix this.

On turning off DEBUG_KERNEL and running more tests, I do see some
improvements however the throughput reduction is still =~ 10%

With a modified openssl speed test app, I sent 16-byte sized block
repeatedly to the AES crypto hardware accelerator using EDMA:

On v3.13.5 kernel:
root@am335x-evm:~# openssl speed -evp aes-128-cbc -engine cryptodev
engine "cryptodev" set.
Doing aes-128-cbc for 3s on 16 size blocks: 79902 aes-128-cbc's

With v3.2 kernel,
Doing aes-128-cbc for 3s on 16 size blocks: 92314 aes-128-cbc's

So we're able to encrypt around 13k more ops, or around 4.5k ops/second
with 3.13.5

As such, I do see this as a problem but difference is much lesser for
larger blocks so its not a very big alarm. We should just be using PIO mode
for small blocks.

regards,
-Joel

2014-02-24 22:54:07

by Joel Fernandes

[permalink] [raw]
Subject: Re: Ideas/suggestions to avoid repeated locking and reducing too many lists with dmaengine?

Correcting myself from an earlier post..

On 02/24/2014 04:38 PM, Joel Fernandes wrote:
>>> Also with respect to virt_dma (which is used by edma to manage all the
>>> descriptors and lists) there are too many lists: submitted, issued,
>>> completed etc and the descriptor moves from one to the other. I am
>>> thinking if there is a way we can avoid using so many lists and just
>>> have 2 lists and move the desc from one list to the other, That could
>>> avoid using the intermediate list altogether and classify dma requests
>>> as "done" or "not done".
>>
>> The reason I created separate submitted and issued lists is that it's
>> much easier to manage than having everything on a single list.
>>
>> We could deal with the submitted vs issued list, and that's to have the
>> channel store the cookie for the last issued descriptor - but I wonder
>> if it's worth the effort.
>>
>> What I'd suggest is to try some profiling, and post some profiling
>> results which show where the problems are, rather than pointing at
>> bits of code you might not particularly like.
>>
>
> Actually I did do some tracing earlier before I posted this thread- and
> notice there was excessive traces of locking/unlocking. It is very light
> though as you pointed and lighter without debug options. The only other
> notable difference is the fact that we are now going through the dmaengine
> framework in the newer kernel vs the faster one.
>
> One more thing in my trace is omap_dma_sync repeatedly call in memcpy_to_io
> for every barrier call which is not necessary. I am working on a fix this.
>
> On turning off DEBUG_KERNEL and running more tests, I do see some
> improvements however the throughput reduction is still =~ 10%
>
> With a modified openssl speed test app, I sent 16-byte sized block
> repeatedly to the AES crypto hardware accelerator using EDMA:
>
> On v3.13.5 kernel:
> root@am335x-evm:~# openssl speed -evp aes-128-cbc -engine cryptodev
> engine "cryptodev" set.
> Doing aes-128-cbc for 3s on 16 size blocks: 79902 aes-128-cbc's
>
> With v3.2 kernel,
> Doing aes-128-cbc for 3s on 16 size blocks: 92314 aes-128-cbc's
>
> So we're able to encrypt around 13k more ops, or around 4.5k ops/second
> with 3.13.5

We're able to encrypt around 13k more ops, or around 4.5k ops/second
with the older 3.2 kernel that didn't use DMAEngine.

Regards,
-Joel

2014-02-25 12:24:41

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: Ideas/suggestions to avoid repeated locking and reducing too many lists with dmaengine?

On Mon, Feb 24, 2014 at 02:50:28PM -0600, Andy Gross wrote:
> On Mon, Feb 24, 2014 at 01:03:32PM -0600, Joel Fernandes wrote:
> > Hi folks,
> >
> > Just wanted your thoughts/suggestions on how we can avoid overhead in the EDMA
> > dmaengine driver. I am seeing a lots of performance drop specially for small
> > transfers with EDMA versus before raw EDMA was moved to DMAEngine framework
> > (atleast 25%).
>
> I've seen roughly the same drop in my testing. In my case it had to do
> with the nature of how work is done using virt-dma. The virt-dma is
> predicated on only letting one transaction be active at a time and it
> increases the latency for getting the next transaction off. For large
> transactions, it's negligible. But for small transactions, it is pretty
> evident.

Wrong. virt-dma allows you to fire off the next transaction in the queue
immediately that the previous transaction has finished. I know this,
because sa11x0-dma does exactly that.

You don't need to wait for the tasklet to be called before starting the
next transaction.

--
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

2014-02-25 12:29:23

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: Ideas/suggestions to avoid repeated locking and reducing too many lists with dmaengine?

On Mon, Feb 24, 2014 at 04:38:48PM -0600, Joel Fernandes wrote:
> Actually I did do some tracing earlier before I posted this thread- and
> notice there was excessive traces of locking/unlocking. It is very light
> though as you pointed and lighter without debug options. The only other
> notable difference is the fact that we are now going through the dmaengine
> framework in the newer kernel vs the faster one.

Okay, for OMAP, there's additional latency caused by the tasklet which
starts a transaction when there's no previous transaction running -
that will be your biggest problem.

I do want to move omap-dma away from having a fixed binding between
claimed channels and the logical channels since that's too wasteful,
and that's what the tasklet will eventually be doing - but with
omap-dma being divorsed from the old code, this tasklet may not be
necessary anymore (since we no longer have to operate in process
context for that bit.)

There's a huge pile of patches I have queued up for the next merge
window in linux-next - I'd rather get these into mainline before
trying to do any restructuring, otherwise we're going to get into
a hell of a problem with merge conflicts.

--
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.