2015-11-01 18:51:17

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH 2/2] dma: add Qualcomm Technologies HIDMA channel driver


>>
>>> Also, if you waste CPU cycles for hundreds of milliseconds, it's unlikely
>>> that the function is so performance critical that it requires writel_relaxed().
>>>
>>> Just use writel() here.
ok, replaced writel_relaxed+wmb with writel.

>>
>> The issue is not writel_relaxed vs. writel. After I issue reset, I need
>> wait for some time to confirm reset was done. I can use readl_polling
>> instead of mdelay if we don't like mdelay.
>
> I meant that both _relaxed() and mdelay() are probably wrong here.

You are right about redundant writel_relaxed + wmb. They are effectively
equal to writel.

However, after issuing the command; I still need to wait some amount of
time until hardware acknowledges the commands like reset/enable/disable.
These are relatively faster operations happening in microseconds. That's
why, I have mdelay there.

I'll take a look at workqueues but it could turn out to be an overkill
for few microseconds.

>
> readl_polling() would avoid the part with _relaxed(), but if that can
> still take more than a few microseconds, you should try to sleep inbetween
> rather than burn CPU cycles.
>
>>>> +/*
>>>> + * The interrupt handler for HIDMA will try to consume as many pending
>>>> + * EVRE from the event queue as possible. Each EVRE has an associated
>>>> + * TRE that holds the user interface parameters. EVRE reports the
>>>> + * result of the transaction. Hardware guarantees ordering between EVREs
>>>> + * and TREs. We use last processed offset to figure out which TRE is
>>>> + * associated with which EVRE. If two TREs are consumed by HW, the EVREs
>>>> + * are in order in the event ring.
>>>> + * This handler will do a one pass for consuming EVREs. Other EVREs may
>>>> + * be delivered while we are working. It will try to consume incoming
>>>> + * EVREs one more time and return.
>>>> + * For unprocessed EVREs, hardware will trigger another interrupt until
>>>> + * all the interrupt bits are cleared.
>>>> + */
>>>> +static void hidma_ll_int_handler_internal(struct hidma_lldev *lldev)
>>>> +{
>>>> + u32 status;
>>>> + u32 enable;
>>>> + u32 cause;
>>>> + int repeat = 2;
>>>> + unsigned long timeout;
>>>> +
>>>> + status = readl_relaxed(lldev->evca + EVCA_IRQ_STAT_OFFSET);
>>>> + enable = readl_relaxed(lldev->evca + EVCA_IRQ_EN_OFFSET);
>>>> + cause = status & enable;
>>>> +
>>>
>>> Reading the status probably requires a readl() rather than readl_relaxed()
>>> to guarantee that the DMA data has arrived in memory by the time that the
>>> register data is seen by the CPU. If using readl_relaxed() here is a valid
>>> and required optimization, please add a comment to explain why it works
>>> and how much you gain.
>>
>> I will add some description. This is a high speed peripheral. I don't
>> like spreading barriers as candies inside the readl and writel unless I
>> have to.
>>
>> According to the barriers video, I watched on youtube this should be the
>> rule for ordering.
>>
>> "if you do two relaxed reads and check the results of the returned
>> variables, ARM architecture guarantees that these two relaxed variables
>> will get observed during the check."
>>
>> this is called implied ordering or something of that sort.
>
> My point was a bit different: while it is guaranteed that the
> result of the readl_relaxed() is observed in order, they do not
> guarantee that a DMA from device to memory that was started by
> the device before the readl_relaxed() has arrived in memory
> by the time that the readl_relaxed() result is visible to the
> CPU and it starts accessing the memory.
>
I checked with the hardware designers. Hardware guarantees that by the
time interrupt is observed, all data transactions in flight are
delivered to their respective places and are visible to the CPU. I'll
add a comment in the code about this.

> In other words, when the hardware sends you data followed by an
> interrupt to tell you the data is there, your interrupt handler
> can tell the driver that is waiting for this data that the DMA
> is complete while the data itself is still in flight, e.g. waiting
> for an IOMMU to fetch page table entries.
>
There is HW guarantee for ordering.

On demand paging for IOMMU is only supported for PCIe via PRI (Page
Request Interface) not for HIDMA. All other hardware instances work on
pinned DMA addresses. I'll drop a note about this too to the code as well.


>>>> + wmb();
>>>> +
>>>> + mdelay(1);
>>>
>>> Another workqueue? You should basically never call mdelay().
>>
>> I'll use polled read instead.
>
> Ok, but again make sure that you call msleep() or usleep_range()
> between the reads.
>
>>>> +static int hidma_ll_hw_start(void *llhndl)
>>>> +{
>>>> + int rc = 0;
>>>> + struct hidma_lldev *lldev = llhndl;
>>>> + unsigned long irqflags;
>>>> +
>>>> + spin_lock_irqsave(&lldev->lock, irqflags);
>>>> + writel_relaxed(lldev->tre_write_offset,
>>>> + lldev->trca + TRCA_DOORBELL_OFFSET);
>>>> + spin_unlock_irqrestore(&lldev->lock, irqflags);
>>>
>>> How does this work? The writel_relaxed() won't synchronize with either
>>> the DMA data or the spinlock.
>>
>> mutex and spinlocks have barriers inside. See the youtube video.
>>
>> https://www.youtube.com/watch?v=6ORn6_35kKo
>
> I'm pretty sure these barriers only make sense to the CPU, so the
> spinlock guarantees that the access to lldev->tre_write_offset is
> protected, but not the access to lldev->trca, because that write
> is posted on the bus and might not complete until after the
> unlock. There is no "dsb(st)" in here:
>
> static inline void arch_spin_unlock(arch_spinlock_t *lock)
> {
> unsigned long tmp;
>
> asm volatile(ARM64_LSE_ATOMIC_INSN(
> /* LL/SC */
> " ldrh %w1, %0\n"
> " add %w1, %w1, #1\n"
> " stlrh %w1, %0",
> /* LSE atomics */
> " mov %w1, #1\n"
> " nop\n"
> " staddlh %w1, %0")
> : "=Q" (lock->owner), "=&r" (tmp)
> :
> : "memory");
> }
>
ok. I'm changing it to readl. Thanks for the insight.

>>>
>>> Also, your abstraction seem to go a little too far if the upper driver
>>> doesn't know what the lower driver calls its main device structure.
>>>
>>> Or you can go further and just embed the struct hidma_lldev within the
>>> struct hidma_dev to save one?
>>
>> That's how it was before. It got too complex and variables/spinlocks got
>> intermixed. I borrowed the upper layer and it worked as it is. I rather
>> keep all hardware stuff in another file and do not mix and match for safety.
>
> Ok, then just use a forward declaration for the struct name so you can
> have a type-safe pointer but don't need to show the members.
>
>>>> +void hidma_ll_chstats(struct seq_file *s, void *llhndl, u32 tre_ch)
>>>> +{
>>>> + struct hidma_lldev *lldev = llhndl;
>>>> + struct hidma_tre *tre;
>>>> + u32 length;
>>>> + dma_addr_t src_start;
>>>> + dma_addr_t dest_start;
>>>> + u32 *tre_local;
>>>> +
>>>> + if (unlikely(tre_ch >= lldev->nr_tres)) {
>>>> + dev_err(lldev->dev, "invalid TRE number in chstats:%d",
>>>> + tre_ch);
>>>> + return;
>>>> + }
>>>> + tre = &lldev->trepool[tre_ch];
>>>> + seq_printf(s, "------Channel %d -----\n", tre_ch);
>>>> + seq_printf(s, "allocated=%d\n", atomic_read(&tre->allocated));
>>>> + HIDMA_CHAN_SHOW(tre, queued);
>>>> + seq_printf(s, "err_info=0x%x\n",
>>>> + lldev->tx_status_list[tre->chidx].err_info);
>>>> + seq_printf(s, "err_code=0x%x\n",
>>>> + lldev->tx_status_list[tre->chidx].err_code);
>>>> + HIDMA_CHAN_SHOW(tre, status);
>>>> + HIDMA_CHAN_SHOW(tre, chidx);
>>>> + HIDMA_CHAN_SHOW(tre, dma_sig);
>>>> + seq_printf(s, "dev_name=%s\n", tre->dev_name);
>>>> + seq_printf(s, "callback=%p\n", tre->callback);
>>>> + seq_printf(s, "data=%p\n", tre->data);
>>>> + HIDMA_CHAN_SHOW(tre, tre_index);
>>>> +
>>>> + tre_local = &tre->tre_local[0];
>>>> + src_start = tre_local[TRE_SRC_LOW_IDX];
>>>> + src_start = ((u64)(tre_local[TRE_SRC_HI_IDX]) << 32) + src_start;
>>>> + dest_start = tre_local[TRE_DEST_LOW_IDX];
>>>> + dest_start += ((u64)(tre_local[TRE_DEST_HI_IDX]) << 32);
>>>> + length = tre_local[TRE_LEN_IDX];
>>>> +
>>>> + seq_printf(s, "src=%pap\n", &src_start);
>>>> + seq_printf(s, "dest=%pap\n", &dest_start);
>>>> + seq_printf(s, "length=0x%x\n", length);
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(hidma_ll_chstats);
>>>
>>> Remove all the pointers here. I guess you can remove the entire debugfs
>>> file really ;-)
>>
>> ok, I need some facility to print out stuff when problems happened.
>> Would you rather use sysfs?
>
> sysfs would be less appropriate, as that requires providing a stable ABI
> for user space. I think ftrace should provide what you need. Let me know
> if that doesn't work out.
>
> Arnd
>

--
Sinan Kaya
Qualcomm Technologies, Inc. on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
Linux Foundation Collaborative Project


2015-11-01 20:21:14

by Timur Tabi

[permalink] [raw]
Subject: Re: [PATCH 2/2] dma: add Qualcomm Technologies HIDMA channel driver

Sinan Kaya wrote:
> However, after issuing the command; I still need to wait some amount of
> time until hardware acknowledges the commands like reset/enable/disable.
> These are relatively faster operations happening in microseconds. That's
> why, I have mdelay there.

Can you use readl_poll_timeout()?

--
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the
Code Aurora Forum, hosted by The Linux Foundation.

2015-11-01 20:27:28

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH 2/2] dma: add Qualcomm Technologies HIDMA channel driver



On 11/1/2015 3:21 PM, Timur Tabi wrote:
> Sinan Kaya wrote:
>> However, after issuing the command; I still need to wait some amount of
>> time until hardware acknowledges the commands like reset/enable/disable.
>> These are relatively faster operations happening in microseconds. That's
>> why, I have mdelay there.
>
> Can you use readl_poll_timeout()?
>
Yes, I'm changing them to readl_poll_timeout after looking at the
workqueue vs. expected delay requirements. I decided to stick with
polled read.

First, I missed the point here that mdelay is discouraged.
https://www.kernel.org/doc/Documentation/timers/timers-howto.txt

According to same document, msleep can sleep up to 20ms which is not
what I want.

This code is trying to sleep 10ms maximum and treating longer durations
as failures.

readl_poll_timeout calls usleep_range that seems to satisfy what I'm
looking for.

--
Sinan Kaya
Qualcomm Technologies, Inc. on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
Linux Foundation Collaborative Project

2015-11-02 16:34:01

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 2/2] dma: add Qualcomm Technologies HIDMA channel driver

On Sunday 01 November 2015 13:50:53 Sinan Kaya wrote:
>
> >> The issue is not writel_relaxed vs. writel. After I issue reset, I need
> >> wait for some time to confirm reset was done. I can use readl_polling
> >> instead of mdelay if we don't like mdelay.
> >
> > I meant that both _relaxed() and mdelay() are probably wrong here.
>
> You are right about redundant writel_relaxed + wmb. They are effectively
> equal to writel.

Actually, writel() is wmb()+writel_relaxed(), not the other way round:

When sending a command to a device that can start a DMA transfer,
the barrier is required to ensure that the DMA happens after previously
written data has gone from the CPU write buffers into the memory that
is used as the source for the transfer.

A barrier after the writel() has no effect, as MMIO writes are posted
on the bus.

> However, after issuing the command; I still need to wait some amount of
> time until hardware acknowledges the commands like reset/enable/disable.
> These are relatively faster operations happening in microseconds. That's
> why, I have mdelay there.
>
> I'll take a look at workqueues but it could turn out to be an overkill
> for few microseconds.

Most devices are able to provide an interrupt for long-running commands.
Are you sure that yours is unable to do this? If so, is this a design
mistake or an implementation bug?

> >>> Reading the status probably requires a readl() rather than readl_relaxed()
> >>> to guarantee that the DMA data has arrived in memory by the time that the
> >>> register data is seen by the CPU. If using readl_relaxed() here is a valid
> >>> and required optimization, please add a comment to explain why it works
> >>> and how much you gain.
> >>
> >> I will add some description. This is a high speed peripheral. I don't
> >> like spreading barriers as candies inside the readl and writel unless I
> >> have to.
> >>
> >> According to the barriers video, I watched on youtube this should be the
> >> rule for ordering.
> >>
> >> "if you do two relaxed reads and check the results of the returned
> >> variables, ARM architecture guarantees that these two relaxed variables
> >> will get observed during the check."
> >>
> >> this is called implied ordering or something of that sort.
> >
> > My point was a bit different: while it is guaranteed that the
> > result of the readl_relaxed() is observed in order, they do not
> > guarantee that a DMA from device to memory that was started by
> > the device before the readl_relaxed() has arrived in memory
> > by the time that the readl_relaxed() result is visible to the
> > CPU and it starts accessing the memory.
> >
> I checked with the hardware designers. Hardware guarantees that by the
> time interrupt is observed, all data transactions in flight are
> delivered to their respective places and are visible to the CPU. I'll
> add a comment in the code about this.

I'm curious about this. Does that mean the device is not meant for
high-performance transfers and just synchronizes the bus before
triggering the interrupt?

> > In other words, when the hardware sends you data followed by an
> > interrupt to tell you the data is there, your interrupt handler
> > can tell the driver that is waiting for this data that the DMA
> > is complete while the data itself is still in flight, e.g. waiting
> > for an IOMMU to fetch page table entries.
> >
> There is HW guarantee for ordering.
>
> On demand paging for IOMMU is only supported for PCIe via PRI (Page
> Request Interface) not for HIDMA. All other hardware instances work on
> pinned DMA addresses. I'll drop a note about this too to the code as well.

I wasn't talking about paging, just fetching the IOTLB from the
preloaded page tables in RAM. This can takes several uncached memory
accesses, so it would generally be slow.


Arnd

2015-11-02 19:21:43

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH 2/2] dma: add Qualcomm Technologies HIDMA channel driver

On 11/2/2015 11:33 AM, Arnd Bergmann wrote:
> On Sunday 01 November 2015 13:50:53 Sinan Kaya wrote:
>>
>>>> The issue is not writel_relaxed vs. writel. After I issue reset, I need
>>>> wait for some time to confirm reset was done. I can use readl_polling
>>>> instead of mdelay if we don't like mdelay.
>>>
>>> I meant that both _relaxed() and mdelay() are probably wrong here.
>>
>> You are right about redundant writel_relaxed + wmb. They are effectively
>> equal to writel.
>
> Actually, writel() is wmb()+writel_relaxed(), not the other way round:

I agree. Semantics... but important one since order matters this time.

>
> When sending a command to a device that can start a DMA transfer,
> the barrier is required to ensure that the DMA happens after previously
> written data has gone from the CPU write buffers into the memory that
> is used as the source for the transfer.
>
Agreed.

> A barrier after the writel() has no effect, as MMIO writes are posted
> on the bus.

I had two use cases in the original code. We are talking about start
routine here. I was giving reference to enable/reset/disable uses above.

1. Start routine
--------------
spin_lock
writel_relaxed
spin_unlock

and

2. enable/reset/disable
--------------
writel_relaxed
wmb

I changed writel_relaxed to writel now in start routine and submitted
the second version of the patchset yesterday. I hope you have received
it. I was relying on the spinlocks before.


>
>> However, after issuing the command; I still need to wait some amount of
>> time until hardware acknowledges the commands like reset/enable/disable.
>> These are relatively faster operations happening in microseconds. That's
>> why, I have mdelay there.
>>
>> I'll take a look at workqueues but it could turn out to be an overkill
>> for few microseconds.
>
> Most devices are able to provide an interrupt for long-running commands.
> Are you sure that yours is unable to do this? If so, is this a design
> mistake or an implementation bug?

I think I was not clear on how long these command take. These command
are really fast and get acknowledged at status register in few
microseconds. That's why I choose polling.

I was waiting up to 10ms before and manually sleeping 1 milliseconds in
between each using mdelay. I followed your suggestion and got rid of the
mdelay. Then, I used polled read command which calls the usleep_range
function as you suggested.

Hardware supports error interrupts but this is a SW design philosophy
discussion. Why would you want to trigger an interrupt for few
microseconds delay that only happens during the first time init from probe?

>
>>>>> Reading the status probably requires a readl() rather than readl_relaxed()
>>>>> to guarantee that the DMA data has arrived in memory by the time that the
>>>>> register data is seen by the CPU. If using readl_relaxed() here is a valid
>>>>> and required optimization, please add a comment to explain why it works
>>>>> and how much you gain.
>>>>
>>>> I will add some description. This is a high speed peripheral. I don't
>>>> like spreading barriers as candies inside the readl and writel unless I
>>>> have to.
>>>>
>>>> According to the barriers video, I watched on youtube this should be the
>>>> rule for ordering.
>>>>
>>>> "if you do two relaxed reads and check the results of the returned
>>>> variables, ARM architecture guarantees that these two relaxed variables
>>>> will get observed during the check."
>>>>
>>>> this is called implied ordering or something of that sort.
>>>
>>> My point was a bit different: while it is guaranteed that the
>>> result of the readl_relaxed() is observed in order, they do not
>>> guarantee that a DMA from device to memory that was started by
>>> the device before the readl_relaxed() has arrived in memory
>>> by the time that the readl_relaxed() result is visible to the
>>> CPU and it starts accessing the memory.
>>>
>> I checked with the hardware designers. Hardware guarantees that by the
>> time interrupt is observed, all data transactions in flight are
>> delivered to their respective places and are visible to the CPU. I'll
>> add a comment in the code about this.
>
> I'm curious about this. Does that mean the device is not meant for
> high-performance transfers and just synchronizes the bus before
> triggering the interrupt?

HIDMA meaning, as you probably guessed, is high performance DMA. We had
several name iterations in the company and this was the one that sticked.

I'm a SW person. I don't have the expertise to go deeper into HW design.
I'm following the programming document. It says coherency and guaranteed
interrupt ordering. High performance can mean how fast you can move data
from one location to the other one vs. how fast you can queue up
multiple requests and get acks in response.

I followed a simple design here. HW can take multiple requests
simultaneously and give me an ack when it is finished with interrupt.

If there are requests in flight, other requests will get queued up in SW
and will not be serviced until the previous requests get acknowledged.
Then, as soon as HW stops processing; I queue a bunch of other requests
and kick start it. Current SW design does not allow simultaneous SW
queuing vs. HW processing. I can try this on the next iteration. This
implementation, IMO, is good enough now and has been working reliably
for a long time (since 2014).

>
>>> In other words, when the hardware sends you data followed by an
>>> interrupt to tell you the data is there, your interrupt handler
>>> can tell the driver that is waiting for this data that the DMA
>>> is complete while the data itself is still in flight, e.g. waiting
>>> for an IOMMU to fetch page table entries.
>>>
>> There is HW guarantee for ordering.
>>
>> On demand paging for IOMMU is only supported for PCIe via PRI (Page
>> Request Interface) not for HIDMA. All other hardware instances work on
>> pinned DMA addresses. I'll drop a note about this too to the code as well.
>
> I wasn't talking about paging, just fetching the IOTLB from the
> preloaded page tables in RAM. This can takes several uncached memory
> accesses, so it would generally be slow.
>
I see.

HIDMA is not aware of IOMMU presence since it follows the DMA API. All
IOMMU latency will be built into the data movement time. By the time
interrupt happens, IOMMU lookups + data movement has already taken place.

>
> Arnd
>

--
Sinan Kaya
Qualcomm Technologies, Inc. on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
Linux Foundation Collaborative Project

2015-11-02 20:55:38

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 2/2] dma: add Qualcomm Technologies HIDMA channel driver

On Monday 02 November 2015 14:21:37 Sinan Kaya wrote:
> On 11/2/2015 11:33 AM, Arnd Bergmann wrote:
> > On Sunday 01 November 2015 13:50:53 Sinan Kaya wrote:
> > A barrier after the writel() has no effect, as MMIO writes are posted
> > on the bus.
>
> I had two use cases in the original code. We are talking about start
> routine here. I was giving reference to enable/reset/disable uses above.
>
> 1. Start routine
> --------------
> spin_lock
> writel_relaxed
> spin_unlock
>
> and
>
> 2. enable/reset/disable
> --------------
> writel_relaxed
> wmb
>
> I changed writel_relaxed to writel now in start routine and submitted
> the second version of the patchset yesterday. I hope you have received
> it. I was relying on the spinlocks before.

Ok

> >
> >> However, after issuing the command; I still need to wait some amount of
> >> time until hardware acknowledges the commands like reset/enable/disable.
> >> These are relatively faster operations happening in microseconds. That's
> >> why, I have mdelay there.
> >>
> >> I'll take a look at workqueues but it could turn out to be an overkill
> >> for few microseconds.
> >
> > Most devices are able to provide an interrupt for long-running commands.
> > Are you sure that yours is unable to do this? If so, is this a design
> > mistake or an implementation bug?
>
> I think I was not clear on how long these command take. These command
> are really fast and get acknowledged at status register in few
> microseconds. That's why I choose polling.
>
> I was waiting up to 10ms before and manually sleeping 1 milliseconds in
> between each using mdelay. I followed your suggestion and got rid of the
> mdelay. Then, I used polled read command which calls the usleep_range
> function as you suggested.
>
> Hardware supports error interrupts but this is a SW design philosophy
> discussion. Why would you want to trigger an interrupt for few
> microseconds delay that only happens during the first time init from probe?

If you get called in sleeping context and can use usleep_range() for
delaying, that is fine, but in effect that just means you generate another
interrupt from the timer that is not synchronized to your device, and
hide the complexity behind the usleep_range() function call.

My first choice would have been to use a struct completion to wait for
the next interrupt here, which has similar complexity on the source code
side, but never waits longer than necessary. If the hrtimer based method
works for you, there is no need to change that.

> >> I checked with the hardware designers. Hardware guarantees that by the
> >> time interrupt is observed, all data transactions in flight are
> >> delivered to their respective places and are visible to the CPU. I'll
> >> add a comment in the code about this.
> >
> > I'm curious about this. Does that mean the device is not meant for
> > high-performance transfers and just synchronizes the bus before
> > triggering the interrupt?
>
> HIDMA meaning, as you probably guessed, is high performance DMA. We had
> several name iterations in the company and this was the one that sticked.
>
> I'm a SW person. I don't have the expertise to go deeper into HW design.
> I'm following the programming document. It says coherency and guaranteed
> interrupt ordering. High performance can mean how fast you can move data
> from one location to the other one vs. how fast you can queue up
> multiple requests and get acks in response.
>
> I followed a simple design here. HW can take multiple requests
> simultaneously and give me an ack when it is finished with interrupt.
>
> If there are requests in flight, other requests will get queued up in SW
> and will not be serviced until the previous requests get acknowledged.
> Then, as soon as HW stops processing; I queue a bunch of other requests
> and kick start it. Current SW design does not allow simultaneous SW
> queuing vs. HW processing. I can try this on the next iteration. This
> implementation, IMO, is good enough now and has been working reliably
> for a long time (since 2014).

Are you using message signaled interrupts then? Typically MSI guarantees
ordering against DMA, but level or edge triggered interrupts by definition
cannot (at least on PCI, but most other buses are the same way), because
the DMA master has no insight into when a DMA is actually complete.

If you use MSI, please add a comment to the readl_relaxed() that it
is safe because of that, otherwise the next person who tries to debug
a problem with your driver has to look into this.

> >>> In other words, when the hardware sends you data followed by an
> >>> interrupt to tell you the data is there, your interrupt handler
> >>> can tell the driver that is waiting for this data that the DMA
> >>> is complete while the data itself is still in flight, e.g. waiting
> >>> for an IOMMU to fetch page table entries.
> >>>
> >> There is HW guarantee for ordering.
> >>
> >> On demand paging for IOMMU is only supported for PCIe via PRI (Page
> >> Request Interface) not for HIDMA. All other hardware instances work on
> >> pinned DMA addresses. I'll drop a note about this too to the code as well.
> >
> > I wasn't talking about paging, just fetching the IOTLB from the
> > preloaded page tables in RAM. This can takes several uncached memory
> > accesses, so it would generally be slow.
> >
> I see.
>
> HIDMA is not aware of IOMMU presence since it follows the DMA API. All
> IOMMU latency will be built into the data movement time. By the time
> interrupt happens, IOMMU lookups + data movement has already taken place.

Ok.

Arnd

2015-11-03 05:29:14

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH 2/2] dma: add Qualcomm Technologies HIDMA channel driver



On 11/2/2015 3:55 PM, Arnd Bergmann wrote:
> Are you using message signaled interrupts then?
> Typically MSI guarantees
> ordering against DMA, but level or edge triggered interrupts by definition
> cannot (at least on PCI, but most other buses are the same way), because
> the DMA master has no insight into when a DMA is actually complete.
>
> If you use MSI, please add a comment to the readl_relaxed() that it
> is safe because of that, otherwise the next person who tries to debug
> a problem with your driver has to look into this.

No, using regular GIC SPI interrupts at this moment. I know that HW
doesn't use any of the typical AHB/AXI ARM buses.

I'm familiar with how PCI endpoints works. While the first read in a
typical PCI endpoint ISR flushes all outstanding requests traditionally
to the destination, this concept does not apply here for this HW.

--
Sinan Kaya
Qualcomm Technologies, Inc. on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
Linux Foundation Collaborative Project

2015-11-03 10:44:19

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 2/2] dma: add Qualcomm Technologies HIDMA channel driver

On Tuesday 03 November 2015 00:29:07 Sinan Kaya wrote:
> On 11/2/2015 3:55 PM, Arnd Bergmann wrote:
> > Are you using message signaled interrupts then?
> > Typically MSI guarantees
> > ordering against DMA, but level or edge triggered interrupts by definition
> > cannot (at least on PCI, but most other buses are the same way), because
> > the DMA master has no insight into when a DMA is actually complete.
> >
> > If you use MSI, please add a comment to the readl_relaxed() that it
> > is safe because of that, otherwise the next person who tries to debug
> > a problem with your driver has to look into this.
>
> No, using regular GIC SPI interrupts at this moment. I know that HW
> doesn't use any of the typical AHB/AXI ARM buses.
>
> I'm familiar with how PCI endpoints works. While the first read in a
> typical PCI endpoint ISR flushes all outstanding requests traditionally
> to the destination, this concept does not apply here for this HW.
>

Ok, got it.

Best add an explanation like the above in the interrupt handler,
to prevent this from accidentally getting 'cleaned up' to use
readl(), or copied into a driver that uses PCI ordering rules
where it is actually wrong.

I think it should be done like this:

- anything that is not performance critical, use normal readl/writel
- in the fast path, add a comment to each readl_relaxed()/writel_relaxed()
that is safe in this driver but that would not be safe in a PCI
device
- For the ones that would be safe on PCI as weel, use
readl_relaxed()/writel_relaxed() without a comment on each one,
but clarify somewhere that these are all intentional.

Arnd

2015-11-03 21:07:40

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH 2/2] dma: add Qualcomm Technologies HIDMA channel driver



On 11/3/2015 5:43 AM, Arnd Bergmann wrote:
> Ok, got it.
>
> Best add an explanation like the above in the interrupt handler,
> to prevent this from accidentally getting 'cleaned up' to use
> readl(), or copied into a driver that uses PCI ordering rules
> where it is actually wrong.
>

I'm adding this disclaimer into the ISR routine.

/*
* Fine tuned for this HW...
*
* This ISR has been designed for this particular hardware. Relaxed read
* and write accessors are used for performance reasons due to interrupt
* delivery guarantees. Do not copy this code blindly and expect
* that to work.
*/


> I think it should be done like this:
>
> - anything that is not performance critical, use normal readl/writel
> - in the fast path, add a comment to each readl_relaxed()/writel_relaxed()
> that is safe in this driver but that would not be safe in a PCI
> device
> - For the ones that would be safe on PCI as weel, use
> readl_relaxed()/writel_relaxed() without a comment on each one,
> but clarify somewhere that these are all intentional.

Makes sense.

--
Sinan Kaya
Qualcomm Technologies, Inc. on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
Linux Foundation Collaborative Project

2015-11-03 21:11:25

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 2/2] dma: add Qualcomm Technologies HIDMA channel driver

On Tuesday 03 November 2015 16:07:34 Sinan Kaya wrote:
> On 11/3/2015 5:43 AM, Arnd Bergmann wrote:
> > Ok, got it.
> >
> > Best add an explanation like the above in the interrupt handler,
> > to prevent this from accidentally getting 'cleaned up' to use
> > readl(), or copied into a driver that uses PCI ordering rules
> > where it is actually wrong.
> >
>
> I'm adding this disclaimer into the ISR routine.
>
> /*
> * Fine tuned for this HW...
> *
> * This ISR has been designed for this particular hardware. Relaxed read
> * and write accessors are used for performance reasons due to interrupt
> * delivery guarantees. Do not copy this code blindly and expect
> * that to work.
> */
>
>

Sounds good.

Arnd