On some platforms (e.g. SGI Challenge, Origin, and Altix machines), writes to
I/O space aren't ordered coming from different CPUs. For the most part, this
isn't a problem since drivers generally spinlock around code that does writeX
calls, but if the last operation a driver does before it releases a lock is a
write and some other CPU takes the lock and immediately does a write, it's
possible the second CPU's write could arrive before the first's.
This patch adds a mmiowb() call to deal with this sort of situation, and
modifies the qla1280 driver to use it. The idea is to mirror the regular,
cacheable memory barrier operation, wmb (arg! I misnamed it in my patch,
I'll fix that up in the next rev assuming it's ok). Example:
CPU A: spin_lock_irqsave(&dev_lock, flags)
CPU A: ...
CPU A: writel(newval, ring_ptr);
CPU A: spin_unlock_irqrestore(&dev_lock, flags)
...
CPU B: spin_lock_irqsave(&dev_lock, flags)
CPU B: writel(newval2, ring_ptr);
CPU B: ...
CPU B: spin_unlock_irqrestore(&dev_lock, flags)
In this case, newval2 could be written to ring_ptr before newval. Fixing it
is easy though:
CPU A: spin_lock_irqsave(&dev_lock, flags)
CPU A: ...
CPU A: writel(newval, ring_ptr);
CPU A: mmiowb(); /* ensure no other writes beat us to the device */
CPU A: spin_unlock_irqrestore(&dev_lock, flags)
...
CPU B: spin_lock_irqsave(&dev_lock, flags)
CPU B: writel(newval2, ring_ptr);
CPU B: ...
CPU B: mmiowb();
CPU B: spin_unlock_irqrestore(&dev_lock, flags)
Note that this doesn't address a related case where the driver may want to
actually make a given write get to the device before proceeding. This should
be dealt with by immediately reading a register from the card that has no
side effects. According to the PCI spec, that will guarantee that all writes
have arrived before being sent to the target bus. If no such register is
available (in the case of card resets perhaps), reading from config space is
sufficient (though it may return all ones if the card isn't responding to
read cycles).
I'm also not quite sure if I got the ppc64 case right (and obviously I need to
add the macro for other platforms). The assumption is that *most* platforms
don't have this issue, and those that do have a chipset register or CPU
instruction to ensure ordering. If that turns out not to be the case, we can
add an address argument to the function.
Thanks,
Jesse