Dear kernel-list,
I am writing a dev driver in which large amounts of data are
passed from user space to PCI device memory and I am seeing a
far lower throughput than I expected. I know this is likely to
be high architecture dependant but I would appreciate some
general guidance.
The essential bit of code looks like
#define CHUNK 512->4096 depending on implementation
static ssize_t BSL_write(..., const char *buf, size_t count..){
char chunk[CHUNK];
int i,pos,
for(i=0,pos=0;i<amount of data;i++,pos+=CHUNK){
copy_from_user(buff,buf+pos,CHUNK);
/* reorder data */
/* not significant in throughput */
for(k=0;k<CHUNK;k++){
chunk[k]=buff[B_SM(j+k)];
}
memcpy_toio(MEM_reg+pos+i*CHUNK+j,chunk,CHUNK);
}
return count;
}
+ lots of not important details-----
MEM_reg=ioremap(pci_resource_start(dev,B_SM_MEM)&PCI_BASE_ADDRESS_MEM_MASK,REG_SIZE);
Basically it reads a big buffer from user space, scrambles the
byte ordering (according to B_SM) and puts it onto the PCI
device memory.
** I currently get 3-4Mb/s throughput (on a fairly poor x86 pc),
this is about 1/10 of what I expected to achieve.
** The CHUNK size is not an important factor above 512 bytes- it
is contributing to 0.1 percent of the bottle neck. This
leads me to think that the copy_from overheads are small.
** The byte ordering is not significant, removing it or
replacing it with a null command indicates it contributes
~0.1 percent of the bottleneck.
** The transfer size will always exceed any cache sizes in the
system.
I will go onto implementing an mmap method sometime later but as
the essential memory / PCI memory bandwidth remains the same I
don't think it will make much difference- From testing with
different transfer sizes overheads seem small provided a minimum
transfer size of 512 bytes is enforced.
Any thoughts? pointers towards profiling this code, typical
throughputs, improving throughput etc gratefully received.
Thanks matt
--------
Thanks for the previous help on interrupts etc-
On Fri, Nov 09, 2001 at 10:48:58AM +0000, Matthew Clark wrote:
> #define CHUNK 512->4096 depending on implementation
>
> static ssize_t BSL_write(..., const char *buf, size_t count..){
> char chunk[CHUNK];
> int i,pos,
> for(i=0,pos=0;i<amount of data;i++,pos+=CHUNK){
> copy_from_user(buff,buf+pos,CHUNK);
> /* reorder data */
> /* not significant in throughput */
> for(k=0;k<CHUNK;k++){
> chunk[k]=buff[B_SM(j+k)];
> }
> memcpy_toio(MEM_reg+pos+i*CHUNK+j,chunk,CHUNK);
> }
> return count;
> }
> + lots of not important details-----
>
> MEM_reg=ioremap(pci_resource_start(dev,B_SM_MEM)&PCI_BASE_ADDRESS_MEM_MASK,REG_SIZE);
Don't know if this will improve the performance much but if you checked the
area you are copying from with access_ok you can replace copy_from_user by
__copy_from_user. The second version does not check the area and so saves some
cycles.
regards,
_sh_
On Fri, Nov 09, 2001 at 10:48:58AM +0000, Matthew Clark wrote:
>
> MEM_reg=ioremap(pci_resource_start(dev,B_SM_MEM)&PCI_BASE_ADDRESS_MEM_MASK,REG_SIZE);
>
Not directly related to your query, but a general observation: You don't
need to mask with PCI_BASE_ADDRESS_MEM_MASK - this is already handled by
the generic PCI layer when it reads the PCI BARs.
--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html