Hi,
I am trying to optimize a driver for a slave only PCI device and am
having a lot of trouble getting any kind of PCI burst transactions in
either the read or the write direction. Using bcopy/memcpy or even a
hand-crafted while (len) { *pdst++ = *psrc++} (with pdst and psrc
unsigned long*) I can only get writes to burst and even in that case
only for 2 data phases (8 bytes) and only on 64 bit machines. The
best that I have managed is to use a hand crafted asm function which
copies the data through mmx registers on i386 machines, but that still
only bursts a maximum of 16 bytes in the write direction and not at
all in the read direction. The source and destination pointers are
both aligned to 8 byte boundaries, so I don't think that it's an
alignment issue.
Is there any way to get PIO to burst over the PCI bus in the read and
write direction? My device has 4 BAR registers, but the area where I
am transferring data is marked 'prefetchable' (although the others are
not). I read here: http://lkml.org/lkml/2004/9/23/393 that this was a
prerequisite, but it is apparently not sufficient. He also mentioned
that the area had to be marked as write-back, but it's not clear how
you can tell (no /proc/mtrr doesn't tell you) or that it has anything
to do with bursting reads.
Any ideas would be really appreciated,
thanks-
dan
"Dan Gora" <[email protected]> writes:
>
> Is there any way to get PIO
I assume you really mean MMIO, not PIO. PIO would be port IO.
> to burst over the PCI bus in the read and
> write direction?
You should set the MMIO mapping to write combining using an MTRR
You might need to add appropiate memory barriers if you rely
on write ordering though.
-Andi
> Is there any way to get PIO to burst over the PCI bus in the read and
> write direction? My device has 4 BAR registers, but the area where I
I think you are doign about as well as the X folks did when they spent
time on trying to optimise pio transfers to and from graphics card RAM.
> Any ideas would be really appreciated,
Put a DMA controller on it ;)
Alan
On Fri, Feb 15, 2008 at 2:54 AM, Andi Kleen <[email protected]> wrote:
> "Dan Gora" <[email protected]> writes:
> >
> > Is there any way to get PIO
>
> I assume you really mean MMIO, not PIO. PIO would be port IO.
Sorry, I always saw it referred to as "Programmed I/O" as opposed to DMA...
> You should set the MMIO mapping to write combining using an MTRR
Sorry to be thick here, but how would I go about doing that?
> You might need to add appropiate memory barriers if you rely
> on write ordering though.
Ok, thanks for the info...
dan
On Fri, Feb 15, 2008 at 5:02 AM, Alan Cox <[email protected]> wrote:
> > Is there any way to get PIO to burst over the PCI bus in the read and
> > write direction? My device has 4 BAR registers, but the area where I
>
> I think you are doign about as well as the X folks did when they spent
> time on trying to optimise pio transfers to and from graphics card RAM.
>
That's good to know. Do you have a link or anything to their
discussion or some key words that I could hunt it down?
>
> > Any ideas would be really appreciated,
>
> Put a DMA controller on it ;)
Ugh.. sadly that's what's coming. I really don't get why the
northbridge cannot burst however. If the memory is mapped
prefetchable and you have to do a PCI read through 3 PCIe bridges to
finally get to your device it seems like it would _really_ behoove the
bridge to do a Memory read multiple and get the whole cache line. I
have searched around a lot and there doesn't seem to be any info at
all on how you can persuade these bridges to do different PCI commands
or burst. I don't know why....
thanks again for your help,
dan
Dan Gora wrote:
>>
>> Put a DMA controller on it ;)
>
> Ugh.. sadly that's what's coming. I really don't get why the
> northbridge cannot burst however.
Because the early Intel northbridges didn't, so noone else bothered
either, since everyone designed their hardware to not require that
capability.
-hpa
On Fri, 15 Feb 2008 10:00:28 -0800
"Dan Gora" <[email protected]> wrote:
> On Fri, Feb 15, 2008 at 5:02 AM, Alan Cox <[email protected]> wrote:
> > > Is there any way to get PIO to burst over the PCI bus in the read and
> > > write direction? My device has 4 BAR registers, but the area where I
> >
> > I think you are doign about as well as the X folks did when they spent
> > time on trying to optimise pio transfers to and from graphics card RAM.
> >
>
> That's good to know. Do you have a link or anything to their
> discussion or some key words that I could hunt it down?
It was some time ago but a look at the X tree will find you the code.
It's basically the same as you did - using MMX.
Dan Gora wrote:
> Hi,
>
> I am trying to optimize a driver for a slave only PCI device and am
> having a lot of trouble getting any kind of PCI burst transactions in
> either the read or the write direction. Using bcopy/memcpy or even a
> hand-crafted while (len) { *pdst++ = *psrc++} (with pdst and psrc
> unsigned long*) I can only get writes to burst and even in that case
> only for 2 data phases (8 bytes) and only on 64 bit machines. The
> best that I have managed is to use a hand crafted asm function which
> copies the data through mmx registers on i386 machines, but that still
> only bursts a maximum of 16 bytes in the write direction and not at
> all in the read direction. The source and destination pointers are
> both aligned to 8 byte boundaries, so I don't think that it's an
> alignment issue.
The chipset is being limited by what the CPU is giving it. If the CPU
sends only a small amount of data in one access then the chipset usually
does not try to burst more than that.
>
> Is there any way to get PIO to burst over the PCI bus in the read and
> write direction? My device has 4 BAR registers, but the area where I
> am transferring data is marked 'prefetchable' (although the others are
> not). I read here: http://lkml.org/lkml/2004/9/23/393 that this was a
> prerequisite, but it is apparently not sufficient. He also mentioned
> that the area had to be marked as write-back, but it's not clear how
> you can tell (no /proc/mtrr doesn't tell you) or that it has anything
> to do with bursting reads.
>
> Any ideas would be really appreciated,
Well, in order for the CPU to batch up more writes you'd have to map the
BAR as either write-combining or write-back. If it's not listed in
/proc/mtrr it will be the default setting of uncacheable. X has code to
set up the video memory on the video card as write-combining so it can
get better write performance, you could do something similar.
Setting it as write-back might allow you to get the reads to do bursting
as well (since the CPU will do a cache-line fill instead of individual
accesses) but this if the device is modifying this memory area, unless
you add code to invalidate those cache lines before reading the data
you'll get stale data back. You could run into some other less obvious
issues as well, as normally device memory regions are not mapped write-back.
In general, especially if you need to read data back from the device,
implementing a DMA engine would be by far the better option. Most
chipsets seem not at all optimized for handling sequential reads from
PCI memory from the CPU. (Even in the DMA case, you have to be careful
with what type of memory read transaction you use when transferring from
host memory - some chipsets don't like to burst more than one cycle if
you use normal Memory Read instead of Memory Read Line or Memory Read
Multiple.)
On Feb 15, 2008 10:00 PM, Robert Hancock <[email protected]> wrote:
>
> Well, in order for the CPU to batch up more writes you'd have to map the
> BAR as either write-combining or write-back. If it's not listed in
> /proc/mtrr it will be the default setting of uncacheable.
Ok, this is pretty much what I thought, but I still don't really have
any idea how to do this. ioremap() doesn't take any flags and I'm not
using ioremap_uncacheable(), plus the BAR is marked prefetchable...
> X has code to
> set up the video memory on the video card as write-combining so it can
> get better write performance, you could do something similar.
Alan mentioned this as well, but I haven't tried to hunt this code
yet. If you have any pointers as to where I might find this, I would
appreciate it.
> Setting it as write-back might allow you to get the reads to do bursting
> as well (since the CPU will do a cache-line fill instead of individual
> accesses)
I don't see what the cache write policy has to do with the reads. If
the region is marked cacheable, then all reads should try and read a
cache line, right? The write-back or write-through policy only has to
do with the writes. If it's write through then writes go directly to
RAM, if it's write-back then they hit the cache and are flushed when
the line is flushed (LRU replacement, explicit cache line flush,
etc..), right?
> but this if the device is modifying this memory area, unless
> you add code to invalidate those cache lines before reading the data
> you'll get stale data back.
Yeah this could definitely be tricky, would pci_dma_sync suffice for this?
> You could run into some other less obvious
> issues as well, as normally device memory regions are not mapped write-back.
>
> In general, especially if you need to read data back from the device,
> implementing a DMA engine would be by far the better option. Most
> chipsets seem not at all optimized for handling sequential reads from
> PCI memory from the CPU. (Even in the DMA case, you have to be careful
> with what type of memory read transaction you use when transferring from
> host memory - some chipsets don't like to burst more than one cycle if
> you use normal Memory Read instead of Memory Read Line or Memory Read
> Multiple.)
True enough... Fortunately my device allows me to set these...
What I am trying to avoid is PCI read transactions in general. PCI
reads are slow pretty much no matter if they are originated from the
device or from the host because of all the multitude of bridges they
have to go through (I've seen 5 in some cases... sheesh). So
ultimately I like for everything going to the device to be written
from the host, then everything going towards the host be DMA'd into
RAM by the device, at least then we can take advantage of PCI write
posting and you don't have to wait for the write to actually complete
before we plod on. But this depends on at least getting get write
burst performance from the host so that the time to write the data
from host is less than the time it would take for the device to read
the data out of RAM.
thanks again for your help!
dan
Dan Gora wrote:
> On Feb 15, 2008 10:00 PM, Robert Hancock <[email protected]> wrote:
>> Well, in order for the CPU to batch up more writes you'd have to map the
>> BAR as either write-combining or write-back. If it's not listed in
>> /proc/mtrr it will be the default setting of uncacheable.
>
> Ok, this is pretty much what I thought, but I still don't really have
> any idea how to do this. ioremap() doesn't take any flags and I'm not
> using ioremap_uncacheable(), plus the BAR is marked prefetchable...
Likely easiest to do it from userspace by writing into /proc/mtrr to
change the memory type attributes. Have a look at Documentation/mtrr.txt.
>
>> X has code to
>> set up the video memory on the video card as write-combining so it can
>> get better write performance, you could do something similar.
>
> Alan mentioned this as well, but I haven't tried to hunt this code
> yet. If you have any pointers as to where I might find this, I would
> appreciate it.
>
>> Setting it as write-back might allow you to get the reads to do bursting
>> as well (since the CPU will do a cache-line fill instead of individual
>> accesses)
>
> I don't see what the cache write policy has to do with the reads. If
> the region is marked cacheable, then all reads should try and read a
> cache line, right? The write-back or write-through policy only has to
> do with the writes. If it's write through then writes go directly to
> RAM, if it's write-back then they hit the cache and are flushed when
> the line is flushed (LRU replacement, explicit cache line flush,
> etc..), right?
That caching attribute affects reads as well. If it's marked uncacheable
or write-combining then reads will never be cached, only if it's marked
write-back.
>
>> but this if the device is modifying this memory area, unless
>> you add code to invalidate those cache lines before reading the data
>> you'll get stale data back.
>
> Yeah this could definitely be tricky, would pci_dma_sync suffice for this?
No, that's not meant to handle this case of stale data in the CPU's
cache since that doesn't normally happen. Something like clflush or
wbinvd would do it, those being x86 specific of course..
>
>> You could run into some other less obvious
>> issues as well, as normally device memory regions are not mapped write-back.
>>
>> In general, especially if you need to read data back from the device,
>> implementing a DMA engine would be by far the better option. Most
>> chipsets seem not at all optimized for handling sequential reads from
>> PCI memory from the CPU. (Even in the DMA case, you have to be careful
>> with what type of memory read transaction you use when transferring from
>> host memory - some chipsets don't like to burst more than one cycle if
>> you use normal Memory Read instead of Memory Read Line or Memory Read
>> Multiple.)
>
> True enough... Fortunately my device allows me to set these...
>
> What I am trying to avoid is PCI read transactions in general. PCI
> reads are slow pretty much no matter if they are originated from the
> device or from the host because of all the multitude of bridges they
> have to go through (I've seen 5 in some cases... sheesh). So
> ultimately I like for everything going to the device to be written
> from the host, then everything going towards the host be DMA'd into
> RAM by the device, at least then we can take advantage of PCI write
> posting and you don't have to wait for the write to actually complete
> before we plod on. But this depends on at least getting get write
> burst performance from the host so that the time to write the data
> from host is less than the time it would take for the device to read
> the data out of RAM.
>
> thanks again for your help!
> dan
Setting write-combining should be fairly easy without too many wierd
side effects. Trying to use write-back to get burst reads is potentially
doable, but may be fraught with difficulty.
I think DMA in both directions is still likely better though, unless the
data you are writing is very small. Most chipsets have pretty small
posting buffers so the amount it will help you is small. If you fill
them up you'll just stall the CPU. With doing a DMA read, at least only
the device will stall.