Here it is again, updated to apply against the BK tree as of a few minutes
ago. Patches to use the new routine in tg3 and qla1280 to follow. Boot
tested on Altix w/the tg3 and qla1280 bits applied.
On some platforms (e.g. SGI Challenge, Origin, and Altix machines), writes to
I/O space aren't ordered coming from different CPUs. For the most part, this
isn't a problem since drivers generally spinlock around code that does writeX
calls, but if the last operation a driver does before it releases a lock is a
write and some other CPU takes the lock and immediately does a write, it's
possible the second CPU's write could arrive before the first's.
This patch adds a mmiowb() call to deal with this sort of situation, and
adds some documentation describing I/O ordering issues to deviceiobook.tmpl.
The idea is to mirror the regular, cacheable memory barrier operation, wmb.
Example of the problem this new macro solves:
CPU A: spin_lock_irqsave(&dev_lock, flags)
CPU A: ...
CPU A: writel(newval, ring_ptr);
CPU A: spin_unlock_irqrestore(&dev_lock, flags)
...
CPU B: spin_lock_irqsave(&dev_lock, flags)
CPU B: writel(newval2, ring_ptr);
CPU B: ...
CPU B: spin_unlock_irqrestore(&dev_lock, flags)
In this case, newval2 could be written to ring_ptr before newval. Fixing it
is easy though:
CPU A: spin_lock_irqsave(&dev_lock, flags)
CPU A: ...
CPU A: writel(newval, ring_ptr);
CPU A: mmiowb(); /* ensure no other writes beat us to the device */
CPU A: spin_unlock_irqrestore(&dev_lock, flags)
...
CPU B: spin_lock_irqsave(&dev_lock, flags)
CPU B: writel(newval2, ring_ptr);
CPU B: ...
CPU B: mmiowb();
CPU B: spin_unlock_irqrestore(&dev_lock, flags)
Note that this doesn't address a related case where the driver may want to
actually make a given write get to the device before proceeding. This should
be dealt with by immediately reading a register from the card that has no
side effects. According to the PCI spec, that will guarantee that all writes
have arrived before being sent to the target bus. If no such register is
available (in the case of card resets perhaps), reading from config space is
sufficient (though it may return all ones if the card isn't responding to
read cycles). I've tried to describe how mmiowb() differs from PCI posted
write flushing in the patch to deviceiobook.tmpl.
Signed-off-by: Jesse Barnes <[email protected]>
Thanks,
Jesse
There are a few spots in qla1280.c that don't need a full PCI write flush to
the device, but rather a simple write ordering guarantee. This patch changes
some of the PIO reads that cause write flushes into mmiowb calls instead,
which is a lighter weight way of ensuring ordering.
Jes and James, can you ack this and/or push it in via the SCSI BK tree?
Thanks,
Jesse
Signed-off-by: Jeremy Higdon <[email protected]>
Signed-off-by: Jesse Barnes <[email protected]>
This patch originally from Greg Banks. Some parts of the tg3 driver depend on
PIO writes arriving in order. This patch ensures that in two key places
using the new mmiowb macro. This not only prevents bugs (the queues can be
corrupted), but is much faster than ensuring ordering using PIO reads (which
involve a few round trips to the target bus on some platforms).
Arthur has another patch that uses mmiowb in tg3 that he posted earlier as
well.
Signed-off-by: Greg Banks <[email protected]>
Signed-off-by: Jesse Barnes <[email protected]
Thanks,
Jesse
On Thu, 21 Oct 2004 16:28:06 -0700
Jesse Barnes <[email protected]> wrote:
> This patch originally from Greg Banks. Some parts of the tg3 driver depend on
> PIO writes arriving in order. This patch ensures that in two key places
> using the new mmiowb macro. This not only prevents bugs (the queues can be
> corrupted), but is much faster than ensuring ordering using PIO reads (which
> involve a few round trips to the target bus on some platforms).
Do other PCI systems which post PIO writes also potentially reorder
them just like this SGI system does? Just trying to get this situation
straight in my head.
On Thu, Oct 21, 2004 at 04:13:19PM -0700, Jesse Barnes wrote:
> This patch adds a mmiowb() call to deal with this sort of situation, and
> adds some documentation describing I/O ordering issues to deviceiobook.tmpl.
Jesse,
This looks overall pretty good. Just a few nits.
> The idea is to mirror the regular, cacheable memory barrier operation, wmb.
> Example of the problem this new macro solves:
>
> CPU A: spin_lock_irqsave(&dev_lock, flags)
> CPU A: ...
> CPU A: writel(newval, ring_ptr);
> CPU A: spin_unlock_irqrestore(&dev_lock, flags)
> ...
> CPU B: spin_lock_irqsave(&dev_lock, flags)
> CPU B: writel(newval2, ring_ptr);
> CPU B: ...
> CPU B: spin_unlock_irqrestore(&dev_lock, flags)
>
> In this case, newval2 could be written to ring_ptr before newval. Fixing it
> is easy though:
>
> CPU A: spin_lock_irqsave(&dev_lock, flags)
> CPU A: ...
> CPU A: writel(newval, ring_ptr);
> CPU A: mmiowb(); /* ensure no other writes beat us to the device */
> CPU A: spin_unlock_irqrestore(&dev_lock, flags)
> ...
> CPU B: spin_lock_irqsave(&dev_lock, flags)
> CPU B: writel(newval2, ring_ptr);
> CPU B: ...
> CPU B: mmiowb();
> CPU B: spin_unlock_irqrestore(&dev_lock, flags)
This is a great example and should be used instead of the qla1280 code
snippet that you used. Please just point at the source file that contains
the code and name the register usage this example represents.
Ie make it easy to find the example in real code without using
line numbers.
> I've tried to describe how mmiowb() differs from PCI posted
> write flushing in the patch to deviceiobook.tmpl.
Yes - I think this is alot clearer than the previous documents - thanks!
...
> +<programlisting>
> + sp->flags |= SRB_SENT;
> + ha->actthreads++;
> + WRT_REG_WORD(&reg->mailbox4, ha->req_ring_index);
> +
> + /*
> + * A Memory Mapped I/O Write Barrier is needed to ensure that this write
> + * of the request queue in register is ordered ahead of writes issued
> + * after this one by other CPUs. Access to the register is protected
> + * by the host_lock. Without the mmiowb, however, it is possible for
> + * this CPU to release the host lock, another CPU acquire the host lock,
> + * and write to the request queue in, and have the second write make it
> + * to the chip first.
> + */
> + mmiowb(); /* posted write ordering */
> +</programlisting>
This is the example code I'd like to see replaced with your
synthetic example above.
thanks,
grant
On Fri, 2004-10-22 at 09:40, David S. Miller wrote:
> On Thu, 21 Oct 2004 16:28:06 -0700
> Jesse Barnes <[email protected]> wrote:
>
> > This patch originally from Greg Banks. Some parts of the tg3 driver depend on
> > PIO writes arriving in order. This patch ensures that in two key places
> > using the new mmiowb macro. This not only prevents bugs (the queues can be
> > corrupted), but is much faster than ensuring ordering using PIO reads (which
> > involve a few round trips to the target bus on some platforms).
>
> Do other PCI systems which post PIO writes also potentially reorder
> them just like this SGI system does? Just trying to get this situation
> straight in my head.
I think the problem they have is really related to their spinlock, that
is the IO leaking out of the spinlock vs. other CPUs.
On the other hand, as I discussed with Jesse in the past, I'd like to
extend the semantics of mmiowb() to also define full barrier between
cacheable and non-cacheable accesses. That would help fixing a lot of issues
on ppc and ppc64.
Typically, our normal "light" write barrier doesn't reorder between cacheable
and non-cacheable (MMIO) stores, which is why we had to put some heavy sync
barrier in our MMIO writes macros.
By having an mmiowb() that allow to explicitely do this ordering, it would
allow us to relax the barriers in the MMIO macros, and so get back a few
percent of perfs that we lost with the "fix".
I haven't had time to work on that yet though, I need to review with paulus
all the possible race cases, but hopefully, I'll have a patch on top of
Jesse next week or so and will start converting more drivers.
Ben.
On Fri, 22 Oct 2004, Benjamin Herrenschmidt wrote:
> ...
> Typically, our normal "light" write barrier doesn't reorder between cacheable
> and non-cacheable (MMIO) stores, which is why we had to put some heavy sync
> barrier in our MMIO writes macros.
> ...
Do you mean "impose order" rather than "reorder" here?
--
Arthur
On Fri, 2004-10-22 at 11:33, [email protected] wrote:
> On Fri, 22 Oct 2004, Benjamin Herrenschmidt wrote:
>
> > ...
> > Typically, our normal "light" write barrier doesn't reorder between cacheable
> > and non-cacheable (MMIO) stores, which is why we had to put some heavy sync
> > barrier in our MMIO writes macros.
> > ...
>
> Do you mean "impose order" rather than "reorder" here?
Right.
Ben.
On Thursday, October 21, 2004 8:01 pm, Grant Grundler wrote:
> > +<programlisting>
> > + sp->flags |= SRB_SENT;
> > + ha->actthreads++;
> > + WRT_REG_WORD(&reg->mailbox4, ha->req_ring_index);
> > +
> > + /*
> > + * A Memory Mapped I/O Write Barrier is needed to ensure that
> > this write + * of the request queue in register is ordered ahead
> > of writes issued + * after this one by other CPUs. Access to the
> > register is protected + * by the host_lock. Without the mmiowb,
> > however, it is possible for + * this CPU to release the host lock,
> > another CPU acquire the host lock, + * and write to the request
> > queue in, and have the second write make it + * to the chip first.
> > + */
> > + mmiowb(); /* posted write ordering */
> > +</programlisting>
>
> This is the example code I'd like to see replaced with your
> synthetic example above.
Ok, that makes sense. I'd like to update the documentation with a separate
patch though, if that's ok with you. I think Greg had some ideas about other
things to cover as well. Greg?
Thanks,
Jesse
On Thursday, October 21, 2004 6:40 pm, David S. Miller wrote:
> On Thu, 21 Oct 2004 16:28:06 -0700
>
> Jesse Barnes <[email protected]> wrote:
> > This patch originally from Greg Banks. Some parts of the tg3 driver
> > depend on PIO writes arriving in order. This patch ensures that in two
> > key places using the new mmiowb macro. This not only prevents bugs (the
> > queues can be corrupted), but is much faster than ensuring ordering using
> > PIO reads (which involve a few round trips to the target bus on some
> > platforms).
>
> Do other PCI systems which post PIO writes also potentially reorder
> them just like this SGI system does? Just trying to get this situation
> straight in my head.
The HP guys claim that theirs don't, but PPC does, afaik. And clearly any
large system that posts PCI writes has the *potential* of reordering them.
Thanks,
Jesse
Jesse Barnes writes:
> On Thursday, October 21, 2004 6:40 pm, David S. Miller wrote:
> > On Thu, 21 Oct 2004 16:28:06 -0700
> >
> > Jesse Barnes <[email protected]> wrote:
> > > This patch originally from Greg Banks. Some parts of the tg3 driver
> > > depend on PIO writes arriving in order. This patch ensures that in two
> > > key places using the new mmiowb macro. This not only prevents bugs (the
> > > queues can be corrupted), but is much faster than ensuring ordering using
> > > PIO reads (which involve a few round trips to the target bus on some
> > > platforms).
> >
> > Do other PCI systems which post PIO writes also potentially reorder
> > them just like this SGI system does? Just trying to get this situation
> > straight in my head.
>
> The HP guys claim that theirs don't, but PPC does, afaik. And clearly any
> large system that posts PCI writes has the *potential* of reordering them.
No, PPC systems don't reorder writes to PCI devices. Provided you use
inl/outl/readl/writel et al., all PCI accesses from one processor are
strictly ordered, and if you use a spinlock, that gives you strict
access ordering between processors.
Our barrier instructions mostly order cacheable accesses separately
from non-cacheable accesses, except for the strongest barrier
instruction, which orders everything. Thus it would be useful for us
to have an explicit indication of when a cacheable write (i.e. to main
memory) has to be completed (from a PCI device's point of view) before
a non-cacheable device read or write (e.g. to kick off DMA).
Paul.
On Fri, 2004-10-22 at 13:05, Jesse Barnes wrote:
> I think Greg had some ideas about other
> things to cover as well. Greg?
You've done most of the stuff I wanted, but it would be nice to see:
* a mention of the read-to-flush-writes technique and why mmiowb
is better
* an example of how and where to use read_relaxed and a description
of why its better than read
Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.
>>>>> "Jesse" == Jesse Barnes <[email protected]> writes:
Jesse> There are a few spots in qla1280.c that don't need a full PCI
Jesse> write flush to the device, but rather a simple write ordering
Jesse> guarantee. This patch changes some of the PIO reads that cause
Jesse> write flushes into mmiowb calls instead, which is a lighter
Jesse> weight way of ensuring ordering.
Jesse> Jes and James, can you ack this and/or push it in via the SCSI
Jesse> BK tree?
ACK!
James will you push this one?
Cheers,
Jes
On Thu, Oct 21, 2004 at 10:05:50PM -0500, Jesse Barnes wrote:
> Ok, that makes sense. I'd like to update the documentation with a separate
> patch though, if that's ok with you.
Sure, if that's easier for you.
grant
Returning from tg3_poll() without flushing the PIO write which
reenables interrupts can result in lower cpu utilization and higher
throughput. So use a memory barrier, mmiowb(), instead of flushing the
write with a PIO read.
Signed-off-by: Arthur Kepner <[email protected]>
---
--- linux-2.6.9-rc4-mm1.orig/drivers/net/tg3.c 2004-10-22 13:51:10.000000000 -0700
+++ linux-2.6.9-rc4-mm1/drivers/net/tg3.c 2004-10-22 13:53:36.000000000 -0700
@@ -417,6 +417,20 @@
tg3_cond_int(tp);
}
+/* tg3_restart_ints
+ * similar to tg3_enable_ints, but it can return without flushing the
+ * PIO write which reenables interrupts
+ */
+static void tg3_restart_ints(struct tg3 *tp)
+{
+ tw32(TG3PCI_MISC_HOST_CTRL,
+ (tp->misc_host_ctrl & ~MISC_HOST_CTRL_MASK_PCI_INT));
+ tw32_mailbox(MAILBOX_INTERRUPT_0 + TG3_64BIT_REG_LOW, 0x00000000);
+ mmiowb();
+
+ tg3_cond_int(tp);
+}
+
static inline void tg3_netif_stop(struct tg3 *tp)
{
netif_poll_disable(tp->dev);
@@ -2787,7 +2801,7 @@
if (done) {
spin_lock_irqsave(&tp->lock, flags);
__netif_rx_complete(netdev);
- tg3_enable_ints(tp);
+ tg3_restart_ints(tp);
spin_unlock_irqrestore(&tp->lock, flags);
}
--
Arthur
On Thu, 2004-10-21 at 19:17, Jesse Barnes wrote:
> There are a few spots in qla1280.c that don't need a full PCI write flush to
> the device, but rather a simple write ordering guarantee. This patch changes
> some of the PIO reads that cause write flushes into mmiowb calls instead,
> which is a lighter weight way of ensuring ordering.
>
> Jes and James, can you ack this and/or push it in via the SCSI BK tree?
This doesn't seem to work:
CC [M] drivers/scsi/qla1280.o
drivers/scsi/qla1280.c: In function `qla1280_64bit_start_scsi':
drivers/scsi/qla1280.c:3404: warning: implicit declaration of function
`mmiowb'
MODPOST
*** Warning: "mmiowb" [drivers/scsi/qla1280.ko] undefined!
James
On Sunday, October 24, 2004 9:20 am, James Bottomley wrote:
> On Thu, 2004-10-21 at 19:17, Jesse Barnes wrote:
> > There are a few spots in qla1280.c that don't need a full PCI write flush
> > to the device, but rather a simple write ordering guarantee. This patch
> > changes some of the PIO reads that cause write flushes into mmiowb calls
> > instead, which is a lighter weight way of ensuring ordering.
> >
> > Jes and James, can you ack this and/or push it in via the SCSI BK tree?
>
> This doesn't seem to work:
>
> CC [M] drivers/scsi/qla1280.o
> drivers/scsi/qla1280.c: In function `qla1280_64bit_start_scsi':
> drivers/scsi/qla1280.c:3404: warning: implicit declaration of function
> `mmiowb'
>
> MODPOST
> *** Warning: "mmiowb" [drivers/scsi/qla1280.ko] undefined!
Only works in Andrew's tree until he pushes mmiowb to Linus.
Jesse
Jesse Barnes <[email protected]> wrote:
>
> > *** Warning: "mmiowb" [drivers/scsi/qla1280.ko] undefined!
>
> Only works in Andrew's tree until he pushes mmiowb to Linus.
I haven't been following this stuff at all closely and I need help
determining which patches should be pushed, and when. Please.
On Monday, October 25, 2004 12:02 pm, Andrew Morton wrote:
> Jesse Barnes <[email protected]> wrote:
> > > *** Warning: "mmiowb" [drivers/scsi/qla1280.ko] undefined!
> >
> > Only works in Andrew's tree until he pushes mmiowb to Linus.
>
> I haven't been following this stuff at all closely and I need help
> determining which patches should be pushed, and when. Please.
I think it's safe to push the main mmiowb() patch now, then James and the
netdev tree can start including driver changes to make use of it.
Thanks,
Jesse