2009-01-07 23:04:02

by Om Narasimhan

[permalink] [raw]
Subject: 64 bit PCI access using MMX register -- how?

Hi,
I have a buggy hardware with many 64 bit registers. (The bug is, when a
64bit read/write is split to two 32bit read/write, under high loads,
hardware would mess up the order of reads/writes. As a result,
read/written values would not be correct).

To avoid data corruption in 32bit Linux, when I implement `readq` as two
back to back `readl`, I need a lock serializing accesses to whole PIO
space. To avoid locking, I implemented suggestion from a friend to use
MMX registers like below.

#if BITS_PER_LONG == 32
static inline void hxge_writeq32(u64 val, void *addr)
{
u64 tmp = 0;
__asm__ __volatile__ (
"movq %%mm0, %[t]\n\t"
"movq %[d], %%mm0\n\t"
"movl %[a], %%eax\n\t"
"movq %%mm0, (%%eax)\n\t"
"movq %[t], %%mm0\n\t"
:[t] "+m"(tmp)
:[a] "r"(addr), [d] "m"(val)
:"%eax"
);
smp_wmb();
return;
}

static inline unsigned long long hxge_readq32(void __iomem *addr)
{
u64 var = 0, tmp = 0xfeedfacedeadbeefULL;
printk(KERN_ERR "addr= %#lx\n", (unsigned long) addr);
__asm__ __volatile__ (
"movq %%mm0, %[t]\n\t"
"movl %[a], %%eax\n\t"
"movq (%%eax), %%mm0\n\t"
"movq %%mm0, %[r]\n\t"
"movq %[t], %%mm0\n\t"
:[r] "=m"(var), [t]"+m"(tmp)
:[a] "r"(addr)
:"%eax"
);
smp_rmb();
printk(KERN_ERR "tmp = %#llx, var = %#llx\n", (unsigned long
long) tmp,
var);
return var;
}

#endif

All the reads return 0xFFFFFFFFFFFFFFFF when tried on ioremap_nocache()
PCI address space.
E.g, output from dmesg:

addr= 0xf9f80010
tmp = 0x0, var = 0xffffffffffffffff

data = 0xffffffffffffffff, da = 0xbc000004

(da is the result of 32bit read on the same address)

Any idea what could be wrong? Any suggestions / pointers?
( I am not subscribed, please CC: me)

I tried reading and writing an ordinary memory location ( a global u64
variable, that is working fine. Only PCI access gives a problem)

PCI access in the form of
val = *((u64*)ioremapped_address)
gets me the expected result. But my question is, is this operation
atomic in 32bit arch?

Thanks,
Om.


2009-01-07 23:39:22

by Alan

[permalink] [raw]
Subject: Re: 64 bit PCI access using MMX register -- how?

One other problem: the kernel doesn't save the FPU state on context
switches or IRQ entry (takes far too long) so that will make a nasty mess.

2009-01-08 05:52:31

by Andi Kleen

[permalink] [raw]
Subject: Re: 64 bit PCI access using MMX register -- how?

Alan Cox <[email protected]> writes:

> One other problem: the kernel doesn't save the FPU state on context
> switches or IRQ entry (takes far too long) so that will make a nasty mess.


I think he was ok because he saved the MMX state by itself, except:

- There was no guarantee that the FPU is in MMX state, not x87 state
- He'll often get a lazy fpu save exception. This used to BUG()
in some cases when invoked from kernel space (but that might have been
changed now). Better is to disable this explicitely around
the access (like in kernel_fpu_begin()/end())
- Doing this all properly is fairly expensive and I suspect
just using a lock will be cheaper.

-Andi


--
[email protected]

2009-01-08 06:46:56

by Roland Dreier

[permalink] [raw]
Subject: Re: 64 bit PCI access using MMX register -- how?

> I think he was ok because he saved the MMX state by itself, except:
>
> - There was no guarantee that the FPU is in MMX state, not x87 state
> - He'll often get a lazy fpu save exception. This used to BUG()
> in some cases when invoked from kernel space (but that might have been
> changed now). Better is to disable this explicitely around
> the access (like in kernel_fpu_begin()/end())
> - Doing this all properly is fairly expensive and I suspect
> just using a lock will be cheaper.

I had some code a long time ago that used SSE (I think movlps was the
opcode I chose) to get an atomic 64-bit PIO operation. To do that, I
just needed to disable preemption and save/restore cr0 around the SSE
operation, and just save/restore the single xmm register I used. Of
course it only works on CPUs that have SSE. That avoids the nastiness
of x87/mmx state, but in the end a spinlock around two readl()s was
faster and a ton simpler, so I threw all that code away.

- R.