Hello,
I'm working on a linux driver for a custom pci card (containing a Xilinx FPGA) which is bus master capable and has to transfer large amounts of data with high bandwidth. I finally succeeded in mmaping the dma buffer residing in ram to user space to avoid unnecessary copying. So actually it seems to work quite well, but sometimes I get some trouble (which seems to occur randomly) concerning dma transfers from ram to the device. When this problem occurs, it leads to the fact, that data arrives too late at the input fifo on the pci card (16kBit).
Looking at some signals with an oscilloscope shows the following behaviour:
1. After preparing the dma buffer in ram and telling the pci card that the dma transfer should begin, the first dma burst is transmitted in a normal way.
2. After the first burst, the pci bus grant signal is disabled, so the access to the bus seems to be denied.
3. About 400 nanoseconds later, the pci device tries to initiate the next burst, but does not succeed (pci bus access is not granted)
=> this process is repeated 3 times
4. In most cases the next burst starts here after the third trial (and all other following bursts are following well). But in the (rarely) faulty case, the 2nd burst only starts after another delay of about 600ns, which is too late, because meanwhile I get a buffer underrun in the FPGA. After some delayed bursts the transfer continues normally.
Does anybody have an idea, why the dma bursts could be delayed, although I deactivated all other pci devices that could disturb the transfers? Maybe it is a quite simple issue, because I'm not yet very experienced with dma stuff. Could it be a problem with my driver implementation, because if the problem occurs, it is always after the first burst? The dma buffer in ram I allocated with pci_alloc_consistent() as described in Rubini's book and the DMA-mapping.txt documentation file.
Here is some information about my environment:
- Gigabyte GA-8I945GMF mainboard with Pentium D processor
- custom pci board with Xilinx FPGA Spartan 2 (XC2S150-6) with PCI 32 LogiCore
- Debian Linux with 2.6.13.4 SMP kernel
Another thing I should mention is that I tried to configure the length of the dma bursts with the pci core, but that didn't work. The oscilloscope showed, that the actual burst length never was higher than 512 Bits and I think this is much too less to be efficient!
Any hint would be very appreciated.
Kind regards,
Burkhard Sch?lpen
__________________________________________________________________________
Erweitern Sie FreeMail zu einem noch leistungsstarkeren E-Mail-Postfach!
Mehr Infos unter http://freemail.web.de/home/landingpad/?mc=021131
Burkhard Sch?lpen wrote:
> ... in the (rarely) faulty case, the 2nd burst only starts
> after another delay of about 600ns, which is too late
Looking at the PCI 2.3 specification,
arbitration latency on the order of a microsecond
or two does not seem excessive for a 33MHz bus.
> ... I deactivated all other pci devices that could disturb the transfers?
Are you accessing registers on your device
during the DMA transfers? If so, the CPU is
acting as a PCI master that could delay granting
the bus to your device.
--
Paul Fulghum
Microgate Systems, Ltd.
>Paul Fulghum <[email protected]> schrieb am 29.12.05 16:30:20:
>
>Burkhard Sch?lpen wrote:
>> ... in the (rarely) faulty case, the 2nd burst only starts
>> after another delay of about 600ns, which is too late
>
>Looking at the PCI 2.3 specification,
>arbitration latency on the order of a microsecond
>or two does not seem excessive for a 33MHz bus.
Okay, then I think I have to figure out, why I cannot get longer bursts than 512 Bits...does anybody have a clue how I can handle that?
>> ... I deactivated all other pci devices that could disturb the transfers?
>
>Are you accessing registers on your device
>during the DMA transfers? If so, the CPU is
>acting as a PCI master that could delay granting
>the bus to your device.
No, I just made sure that not. There are no register accesses during dma transfer. The driver sends the application to sleep until an interrupt signals the completeness.
Kind regards,
Burkhard
______________________________________________________________
Verschicken Sie romantische, coole und witzige Bilder per SMS!
Jetzt bei WEB.DE FreeMail: http://f.web.de/?mc=021193
Burkhard Sch?lpen wrote:
> why I cannot get longer bursts than 512 Bits...
What value is written by the system into the
PCI configuration space of your device for
the latency timer?
(8 bits at offset 0x0d, units = clock cycles)
You can try setting that value in your
driver to a higher value.
--
Paul Fulghum
Microgate Systems, Ltd.
Burkhard Sch?lpen wrote:
> Hello,
>
> I'm working on a linux driver for a custom pci card (containing a Xilinx FPGA) which is bus master capable and has to transfer large amounts of data with high bandwidth. I finally succeeded in mmaping the dma buffer residing in ram to user space to avoid unnecessary copying. So actually it seems to work quite well, but sometimes I get some trouble (which seems to occur randomly) concerning dma transfers from ram to the device. When this problem occurs, it leads to the fact, that data arrives too late at the input fifo on the pci card (16kBit).
>
> Looking at some signals with an oscilloscope shows the following behaviour:
> 1. After preparing the dma buffer in ram and telling the pci card that the dma transfer should begin, the first dma burst is transmitted in a normal way.
> 2. After the first burst, the pci bus grant signal is disabled, so the access to the bus seems to be denied.
> 3. About 400 nanoseconds later, the pci device tries to initiate the next burst, but does not succeed (pci bus access is not granted)
> => this process is repeated 3 times
> 4. In most cases the next burst starts here after the third trial (and all other following bursts are following well). But in the (rarely) faulty case, the 2nd burst only starts after another delay of about 600ns, which is too late, because meanwhile I get a buffer underrun in the FPGA. After some delayed bursts the transfer continues normally.
>
> Does anybody have an idea, why the dma bursts could be delayed, although I deactivated all other pci devices that could disturb the transfers? Maybe it is a quite simple issue, because I'm not yet very experienced with dma stuff. Could it be a problem with my driver implementation, because if the problem occurs, it is always after the first burst? The dma buffer in ram I allocated with pci_alloc_consistent() as described in Rubini's book and the DMA-mapping.txt documentation file.
What kind of PCI transaction is the core using to do the reads? I think
that Memory Read can cause bursts to be interrupted quickly on some
chipsets. If you can use Memory Read Line or Memory Read Multiple this
may increase performance.
You may also need more buffering in the FPGA, otherwise you may be
vulnerable to underruns if there is contention on the PCI bus. The
device should be able to handle normal arbitration delays.
--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/
>What kind of PCI transaction is the core using to do the reads? I think
>that Memory Read can cause bursts to be interrupted quickly on some
>chipsets. If you can use Memory Read Line or Memory Read Multiple this
>may increase performance.
>
>You may also need more buffering in the FPGA, otherwise you may be
>vulnerable to underruns if there is contention on the PCI bus. The
>device should be able to handle normal arbitration delays.
Yeah, that was it! I asked the programmer of the FPGA and he told me that he was using the normal Memory Read transaction. After changing that to MRM we get a much higher burst length. Now the buffer underruns really seem to be disappeared, that is great! He also told me that the fifo buffer on the FPGA could not be larger, because the size is somehow limited in the core (it's some special block ram I think...), so we are lucky that the burst length seems to fix our problem.
By the way, there is another question coming up to my mind. The pci card is to be designed for a large size copying machine (i.e. it is something like a framegrabber device which has to write out data to a printer simultaneously) which leads to really high bandwidth. For now I allocate the dma buffer in ram (a ringbuffer) using pci_alloc_consistent(), which unfortunately delimits the size to about 4 MB. However, it would be convenient to be able to allocate a larger dma buffer, because then we would be able to perform some image processing algorithms directly inside this buffer via mmapping it to user space. Is there any way to achieve this quite simple without being forced to use scatter/gather dma (our hardware is not able to do this - at least until now)?
Thank you very much for your help!
Kind regards,
Burkhard Sch?lpen
__________________________________________________________________________
Erweitern Sie FreeMail zu einem noch leistungsstarkeren E-Mail-Postfach!
Mehr Infos unter http://freemail.web.de/home/landingpad/?mc=021131
Burkhard Sch?lpen wrote:
> By the way, there is another question coming up to my mind. The pci card is to be designed for a large size copying machine (i.e. it is something like a framegrabber device which has to write out data to a printer simultaneously) which leads to really high bandwidth. For now I allocate the dma buffer in ram (a ringbuffer) using pci_alloc_consistent(), which unfortunately delimits the size to about 4 MB. However, it would be convenient to be able to allocate a larger dma buffer, because then we would be able to perform some image processing algorithms directly inside this buffer via mmapping it to user space. Is there any way to achieve this quite simple without being forced to use scatter/gather dma (our hardware is not able to do this - at least until now)?
Unfortunately if you need a memory buffer that is physically contiguous
to do DMA on, your choices are basically either pci_alloc_consistent, or
possibly boot-time allocation of memory by telling the kernel to use
less memory than is in the machine. Trying to allocate a big chunk of
contiguous memory after the system has come up will not be very reliable
since memory tends to become fragmented.
When dealing with this amount of data it really would be best to use
some form of scatter-gather DMA. Even if the hardware is not capable of
taking multiple addresses and doing the DMA on its own, you could sort
of fake it and tell it to do multiple transfers, one for each block of
memory - that might have some overhead though.