2006-10-24 19:13:22

by Roland Dreier

[permalink] [raw]
Subject: Ordering between PCI config space writes and MMIO reads?

John Partridge found an interesting bug involving mthca (Mellanox
InfiniBand HCA driver) on IA64/Altix systems. Basically, during
initialization, mthca does:

- do some config writes, including enabling BARs
- then start a firmware command
- read an MMIO register from a BAR (to check if FW is busy)

However, John found that the Altix PCI-X bridge was allowing the MMIO
read to start before the config write was done (which is allowed by
the PCI spec). The PCI trace looked like:

23454: Config Write REG = 01 TYPE = 1 BE = 0000 Req = (0,0,0) Tag = 1 Bus = 1 Device = 0 Function = 0 WAIT = 2
23462: Memory Rd DW A = 00280698 BE = 0000 Req = (0,0,0) Tag = 0 WAIT = 2
23470: Split compl. Lower A = 00 Req = (0,0,0) Tag = 0 Comp = (0,2,0) WAIT = 1 (Error completion)
23476: Split compl. Lower A = 00 Req = (0,0,0) Tag = 1 Comp = (0,2,0) WAIT = 1 (Normal completion of WRITE)

and that "Error completion" leads to a crash.

John proposed the following patch to fix this, which looks good to
me. However, I have a couple of questions about this situation:

1) Is this something that should be fixed in the driver? The PCI
spec allows MMIO cycles to start before an earlier config cycle
completed, but do we want to expose this fact to drivers? Would
it be better for ia64 to use some sort of barrier to make sure
pci_write_config_xxx() is strongly ordered with MMIO?

2) Is this issue lurking in other drivers?

Thanks,
Roland

commit 424b50b6360b325ce642ece687756a600c25d28a
Author: John Partridge <[email protected]>
Date: Tue Oct 24 11:54:16 2006 -0700

IB/mthca: Make sure all PCI config writes reach device before doing MMIO

During initialization, mthca writes some PCI config space registers
and then does an MMIO read from one of the BARs it just enabled. This
MMIO read sometimes failed and caused a crash on SGI Altix machines,
because the PCI-X host bridge (legitimately, according to the PCI
spec) allowed the MMIO read to start before the config write completed.

To fix this, add a config read after all config writes to make sure
they are all done before starting the MMIO read.

Signed-off-by: John Partridge <[email protected]>
Signed-off-by: Roland Dreier <[email protected]>

diff --git a/drivers/infiniband/hw/mthca/mthca_reset.c b/drivers/infiniband/hw/mthca/mthca_reset.c
index 91934f2..578dc7c 100644
--- a/drivers/infiniband/hw/mthca/mthca_reset.c
+++ b/drivers/infiniband/hw/mthca/mthca_reset.c
@@ -281,6 +281,20 @@ good:
goto out;
}

+ /*
+ * Perform a "flush" of the PCI config writes here by reading
+ * the PCI_COMMAND register. This is needed to make sure that
+ * we don't try to touch other PCI BARs before the config
+ * writes are done -- otherwise an MMIO cycle could start
+ * before the config writes are done and reach the HCA before
+ * the BAR is actually enabled.
+ */
+ if (pci_read_config_dword(mdev->pdev, PCI_COMMAND, hca_header)) {
+ err = -ENODEV;
+ mthca_err(mdev, "Couldn't access HCA memory after restoring, "
+ "aborting.\n");
+ }
+
out:
if (bridge)
pci_dev_put(bridge);


2006-10-24 19:22:13

by Jeff Garzik

[permalink] [raw]
Subject: Re: Ordering between PCI config space writes and MMIO reads?

On Tue, Oct 24, 2006 at 12:13:19PM -0700, Roland Dreier wrote:
> 1) Is this something that should be fixed in the driver? The PCI
> spec allows MMIO cycles to start before an earlier config cycle
> completed, but do we want to expose this fact to drivers? Would
> it be better for ia64 to use some sort of barrier to make sure
> pci_write_config_xxx() is strongly ordered with MMIO?

The PCI config APIs have traditionally enforced very strong ordering.
Heck, the PCI config APIs often take a spinlock on each read or write;
so they are definitely not intended to be as fast as MMIO.

Jeff



2006-10-24 21:24:52

by Alan

[permalink] [raw]
Subject: Re: Ordering between PCI config space writes and MMIO reads?

Ar Maw, 2006-10-24 am 12:13 -0700, ysgrifennodd Roland Dreier:
> 1) Is this something that should be fixed in the driver? The PCI
> spec allows MMIO cycles to start before an earlier config cycle
> completed, but do we want to expose this fact to drivers? Would
> it be better for ia64 to use some sort of barrier to make sure
> pci_write_config_xxx() is strongly ordered with MMIO?

It is good to be conservative in this area. Some AMD chipsets at least
had ordering problems with some configurations in the K7 era.


2006-10-24 21:30:08

by Roland Dreier

[permalink] [raw]
Subject: Re: Ordering between PCI config space writes and MMIO reads?

> It is good to be conservative in this area. Some AMD chipsets at least
> had ordering problems with some configurations in the K7 era.

Could you expand a little? Do you mean that the arch implementation
of pci_write_config_xxx() should have extra barriers, or that drivers
should do belt-and-suspenders flushes to make sure config writes are
really done properly?

- R.

2006-10-24 21:37:47

by Jeff Garzik

[permalink] [raw]
Subject: Re: Ordering between PCI config space writes and MMIO reads?

On Tue, Oct 24, 2006 at 02:29:47PM -0700, Roland Dreier wrote:
> > It is good to be conservative in this area. Some AMD chipsets at least
> > had ordering problems with some configurations in the K7 era.
>
> Could you expand a little? Do you mean that the arch implementation
> of pci_write_config_xxx() should have extra barriers, or that drivers
> should do belt-and-suspenders flushes to make sure config writes are
> really done properly?

Drivers are -already- written to assume the pci_write_config_xxx() has
the requisite barriers. The fix doesn't belong in the drivers.

Jeff



2006-10-24 21:47:27

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Ordering between PCI config space writes and MMIO reads?

On Tue, Oct 24, 2006 at 03:22:10PM -0400, Jeff Garzik wrote:
> The PCI config APIs have traditionally enforced very strong ordering.
> Heck, the PCI config APIs often take a spinlock on each read or write;
> so they are definitely not intended to be as fast as MMIO.

s/often/always/. It's implemented in drivers/pci/access.c.

I think the right way to fix this is to ensure mmio write ordering in
the pci_write_config_*() implementations. Like this.

Signed-off-by: Matthew Wilcox <[email protected]>

diff --git a/drivers/pci/access.c b/drivers/pci/access.c
index ea16805..c80f1ba 100644
--- a/drivers/pci/access.c
+++ b/drivers/pci/access.c
@@ -1,6 +1,6 @@
#include <linux/pci.h>
#include <linux/module.h>
-#include <linux/ioport.h>
+#include <linux/io.h>

#include "pci.h"

@@ -45,6 +45,7 @@ int pci_bus_write_config_##size \
if (PCI_##size##_BAD) return PCIBIOS_BAD_REGISTER_NUMBER; \
spin_lock_irqsave(&pci_lock, flags); \
res = bus->ops->write(bus, devfn, pos, len, value); \
+ mmiowb(); \
spin_unlock_irqrestore(&pci_lock, flags); \
return res; \
}
@@ -102,6 +103,7 @@ int pci_user_write_config_##size \
if (likely(!dev->block_ucfg_access)) \
ret = dev->bus->ops->write(dev->bus, dev->devfn, \
pos, sizeof(type), val); \
+ mmiowb(); \
spin_unlock_irqrestore(&pci_lock, flags); \
return ret; \
}

2006-10-24 21:51:34

by Roland Dreier

[permalink] [raw]
Subject: Re: Ordering between PCI config space writes and MMIO reads?

> I think the right way to fix this is to ensure mmio write ordering in
> the pci_write_config_*() implementations. Like this.

I'm happy to fix this in the PCI core and not force drivers to worry
about this.

John, can you confirm that this patch fixes the issue for you?

Thanks,
Roland

2006-10-24 22:15:08

by John Partridge

[permalink] [raw]
Subject: Re: Ordering between PCI config space writes and MMIO reads?

Roland Dreier wrote:
> > I think the right way to fix this is to ensure mmio write ordering in
> > the pci_write_config_*() implementations. Like this.
>
> I'm happy to fix this in the PCI core and not force drivers to worry
> about this.
>
> John, can you confirm that this patch fixes the issue for you?
>
> Thanks,
> Roland

I'll give it a try and get back to you.

John

--
John Partridge

Silicon Graphics Inc
Tel: 651-683-3428
Vnet: 233-3428
E-Mail: [email protected]

2006-10-24 22:36:34

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Ordering between PCI config space writes and MMIO reads?

On Tue, Oct 24, 2006 at 02:51:30PM -0700, Roland Dreier wrote:
> > I think the right way to fix this is to ensure mmio write ordering in
> > the pci_write_config_*() implementations. Like this.
>
> I'm happy to fix this in the PCI core and not force drivers to worry
> about this.
>
> John, can you confirm that this patch fixes the issue for you?

Hang on. I wasn't thinking clearly. mmiowb() only ensures the write
has got as far as the shub. There's no way to fix this in the pci core
-- any PCI-PCI bridge can reorder the two.

This is only really a problem for setup (when we program the BARs), so
it seems silly to enforce an ordering at any other time. Reluctantly, I
must disagree with Jeff -- drivers need to fix this.

2006-10-24 22:43:46

by David Miller

[permalink] [raw]
Subject: Re: Ordering between PCI config space writes and MMIO reads?

From: Matthew Wilcox <[email protected]>
Date: Tue, 24 Oct 2006 16:36:32 -0600

> This is only really a problem for setup (when we program the BARs), so
> it seems silly to enforce an ordering at any other time. Reluctantly, I
> must disagree with Jeff -- drivers need to fix this.

One thing is that we definitely don't want to fix this by,
for example, reading back the PCI_COMMAND register or something
like that. That causes two problems:

1) Some PCI config writes shut the device down and make it
no respond to some kinds of PCI config transactions.
One example is putting the device into D3 or similar
power state, another is performing a device reset.

2) Several drivers use PCI config space accesses to touch the
main registers in order to workaround bugs in the PCI-X
implementation of their chip or similar (tg3 has a few
cases like this), doing a PCI config space readback will
kill performance quite a bit for an already slow situation.

In fact, I do recall that one of the x86 PCI config space access
implementations did a readback like this, and we had to remove
it because it caused problems when doing a reset on tg3 chips
when using PCI config space register write to do the reset.

2006-10-24 23:00:17

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [openib-general] Ordering between PCI config space writes and MMIO reads?

On Tue, Oct 24, 2006 at 04:36:32PM -0600, Matthew Wilcox wrote:
> On Tue, Oct 24, 2006 at 02:51:30PM -0700, Roland Dreier wrote:
> > > I think the right way to fix this is to ensure mmio write ordering in
> > > the pci_write_config_*() implementations. Like this.
> >
> > I'm happy to fix this in the PCI core and not force drivers to worry
> > about this.
> >
> > John, can you confirm that this patch fixes the issue for you?

> Hang on. I wasn't thinking clearly. mmiowb() only ensures the write
> has got as far as the shub. There's no way to fix this in the pci core

What about shifting the requirement down to the platform? Ie on ia64
it would seem that inb/outb already solve this problem via mf.a.

All platforms that support inb/outb correctly must have a
synchronizing primitive for outb..

> This is only really a problem for setup (when we program the BARs), so
> it seems silly to enforce an ordering at any other time. Reluctantly, I
> must disagree with Jeff -- drivers need to fix this.

I'm not sure that can work either. The PCI-X spec is very clear, you
must wait for a non-posted completion if you care about order. Doing a
config read in the driver as a surrogate flush is not good enough in
the general case. Like you say, a pci bridge is free to reorder all
in flight non-posted operations.

Jason

2006-10-24 23:09:20

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Ordering between PCI config space writes and MMIO reads?

Quoting r. Matthew Wilcox <[email protected]>:
> Subject: Re: Ordering between PCI config space writes and MMIO reads?
>
> On Tue, Oct 24, 2006 at 02:51:30PM -0700, Roland Dreier wrote:
> > > I think the right way to fix this is to ensure mmio write ordering in
> > > the pci_write_config_*() implementations. Like this.
> >
> > I'm happy to fix this in the PCI core and not force drivers to worry
> > about this.
> >
> > John, can you confirm that this patch fixes the issue for you?
>
> Hang on. I wasn't thinking clearly. mmiowb() only ensures the write
> has got as far as the shub. There's no way to fix this in the pci core
> -- any PCI-PCI bridge can reorder the two.
>
> This is only really a problem for setup (when we program the BARs), so
> it seems silly to enforce an ordering at any other time. Reluctantly, I
> must disagree with Jeff -- drivers need to fix this.

This can be true for any bridge.
Most arches, however, simply block until config write completes - this is why
driver doesn't issue any MMIO writes - and this is what we are looking for here
- a way to block the CPU until split completion for config write arrives.

By the way, e.g. the PCI Express spec says:
"Read Requests and I/O or Configuration Write Requests are permitted to be
blocked by or to pass other Read Requests and I/O or Configuration Write Requests."
so it is not clear that doing a config read will always flush all config writes
as you want.

--
MST

2006-10-24 23:28:29

by tip-bot for Jack Steiner

[permalink] [raw]
Subject: Re: Ordering between PCI config space writes and MMIO reads?

On Tue, Oct 24, 2006 at 04:36:32PM -0600, Matthew Wilcox wrote:
> On Tue, Oct 24, 2006 at 02:51:30PM -0700, Roland Dreier wrote:
> > > I think the right way to fix this is to ensure mmio write ordering in
> > > the pci_write_config_*() implementations. Like this.
> >
> > I'm happy to fix this in the PCI core and not force drivers to worry
> > about this.
> >
> > John, can you confirm that this patch fixes the issue for you?
>
> Hang on. I wasn't thinking clearly. mmiowb() only ensures the write
> has got as far as the shub.

I think mmiowb() should work on SN hardware. mmiowb() delays until shub
reports that all previously issued PIO writes have completed.

The processor "mf.a" guarantees "platform acceptance" which on SN means
that shub has accepted the write - not that it has actually completed (or
even forwarded anywhere by shub). That makes "mf.a" more-or-less useless
on SN. However, shub has an additional MMR register (PIO_WRITE_COUNT) that
counts actual outstanding PIOs. mmiob() delays until that count goes to
zero.

I'll check if there is any additional reordering that can occur AFTER the
PIO_WRITE_COUNT goes to zero. If so, it would be at bus level - not in
shub or routers.


> There's no way to fix this in the pci core
> -- any PCI-PCI bridge can reorder the two.
>
> This is only really a problem for setup (when we program the BARs), so
> it seems silly to enforce an ordering at any other time. Reluctantly, I
> must disagree with Jeff -- drivers need to fix this.

-- jack

2006-10-25 06:30:28

by Grant Grundler

[permalink] [raw]
Subject: Re: Ordering between PCI config space writes and MMIO reads?

On Tue, Oct 24, 2006 at 12:13:19PM -0700, Roland Dreier wrote:
> John Partridge found an interesting bug involving mthca (Mellanox
> InfiniBand HCA driver) on IA64/Altix systems. Basically, during
> initialization, mthca does:
>
> - do some config writes, including enabling BARs
> - then start a firmware command
> - read an MMIO register from a BAR (to check if FW is busy)
>
> However, John found that the Altix PCI-X bridge was allowing the MMIO
> read to start before the config write was done (which is allowed by
> the PCI spec).

Can someone provide a quote of the PCI Local bus spec that allows this?
(Or at least a reference to a spec version and section number)

> The PCI trace looked like:
>
> 23454: Config Write REG = 01 TYPE = 1 BE = 0000 Req = (0,0,0) Tag = 1 Bus = 1 Device = 0 Function = 0 WAIT = 2
> 23462: Memory Rd DW A = 00280698 BE = 0000 Req = (0,0,0) Tag = 0 WAIT = 2
> 23470: Split compl. Lower A = 00 Req = (0,0,0) Tag = 0 Comp = (0,2,0) WAIT = 1 (Error completion)
> 23476: Split compl. Lower A = 00 Req = (0,0,0) Tag = 1 Comp = (0,2,0) WAIT = 1 (Normal completion of WRITE)
>
> and that "Error completion" leads to a crash.
>
> John proposed the following patch to fix this, which looks good to
> me. However, I have a couple of questions about this situation:
>
> 1) Is this something that should be fixed in the driver? The PCI
> spec allows MMIO cycles to start before an earlier config cycle
> completed, but do we want to expose this fact to drivers?

I would prefer we did not.

> Would
> it be better for ia64 to use some sort of barrier to make sure
> pci_write_config_xxx() is strongly ordered with MMIO?

That would be my preference.

> 2) Is this issue lurking in other drivers?
>
> Thanks,
> Roland
>
> commit 424b50b6360b325ce642ece687756a600c25d28a
> Author: John Partridge <[email protected]>
> Date: Tue Oct 24 11:54:16 2006 -0700
>
> IB/mthca: Make sure all PCI config writes reach device before doing MMIO
>
> During initialization, mthca writes some PCI config space registers
> and then does an MMIO read from one of the BARs it just enabled. This
> MMIO read sometimes failed and caused a crash on SGI Altix machines,
> because the PCI-X host bridge (legitimately, according to the PCI
> spec) allowed the MMIO read to start before the config write completed.

Because of this past discussion with jesse barnes, I'm leary of
any kind of writes traveling through SN2 fabric. The issue is
described pretty well here:
http://www.usenetlinux.com/archive/topic.php/t-49141.html

I don't know that this is the same (or similar) problem.

> To fix this, add a config read after all config writes to make sure
> they are all done before starting the MMIO read.
>
> Signed-off-by: John Partridge <[email protected]>
> Signed-off-by: Roland Dreier <[email protected]>
>
> diff --git a/drivers/infiniband/hw/mthca/mthca_reset.c b/drivers/infiniband/hw/mthca/mthca_reset.c
> index 91934f2..578dc7c 100644
> --- a/drivers/infiniband/hw/mthca/mthca_reset.c
> +++ b/drivers/infiniband/hw/mthca/mthca_reset.c
> @@ -281,6 +281,20 @@ good:
> goto out;
> }
>
> + /*
> + * Perform a "flush" of the PCI config writes here by reading
> + * the PCI_COMMAND register. This is needed to make sure that
> + * we don't try to touch other PCI BARs before the config
> + * writes are done -- otherwise an MMIO cycle could start
> + * before the config writes are done and reach the HCA before
> + * the BAR is actually enabled.
> + */

If this code is accepted, the comment should provide a specific reference
(PCI Version + section) to the PCI spec that allows the out-of-order.

I agree with jgarzik that the drivers already expect config cycles
to be ordered with respect to MMIO cycles.

I'm looking at arch/ia64/pci/pci.c.
Wouldn't it be reasonable to include memory barriers around calls
to SAL config space access functions?

thanks,
grant

2006-10-25 14:05:12

by Roland Dreier

[permalink] [raw]
Subject: Re: [openib-general] Ordering between PCI config space writes and MMIO reads?

> I'm not sure that can work either. The PCI-X spec is very clear, you
> must wait for a non-posted completion if you care about order. Doing a
> config read in the driver as a surrogate flush is not good enough in
> the general case. Like you say, a pci bridge is free to reorder all
> in flight non-posted operations.

No, hang on. Nothing can reorder a dependent read to start after a
write that it depends on, can it? So a config read of PCI_COMMAND
can't start until the completion of a config write of the same
register, right?

- R.

2006-10-25 14:06:04

by Roland Dreier

[permalink] [raw]
Subject: Re: Ordering between PCI config space writes and MMIO reads?

> I'll check if there is any additional reordering that can occur AFTER the
> PIO_WRITE_COUNT goes to zero. If so, it would be at bus level - not in
> shub or routers.

Unfortunately, at least in theory, the reordering can occur. For
example a bridge on some card plugged into an SN slot is allowed to
reorder things too.

- R.

2006-10-25 14:11:10

by Roland Dreier

[permalink] [raw]
Subject: Re: Ordering between PCI config space writes and MMIO reads?

> I'm looking at arch/ia64/pci/pci.c.
> Wouldn't it be reasonable to include memory barriers around calls
> to SAL config space access functions?

It's reasonable, but is there a memory barrier strong enough to
guarantee that a config write has actually completed?

- R.

2006-10-25 14:15:25

by Roland Dreier

[permalink] [raw]
Subject: Re: Ordering between PCI config space writes and MMIO reads?

> One thing is that we definitely don't want to fix this by,
> for example, reading back the PCI_COMMAND register or something
> like that. That causes two problems:
>
> 1) Some PCI config writes shut the device down and make it
> no respond to some kinds of PCI config transactions.
> One example is putting the device into D3 or similar
> power state, another is performing a device reset.

Hmm... it seems there is no other guaranteed way to make sure a config
write has really completed except doing a config read. And only the
driver knows what the config access it's doing means. So the
conclusion we seem to be forced into is that drivers need to include
these reads in the cases where they are needed.

- R.

2006-10-25 14:19:04

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Ordering between PCI config space writes and MMIO reads?

On Wed, Oct 25, 2006 at 12:30:22AM -0600, Grant Grundler wrote:
> Can someone provide a quote of the PCI Local bus spec that allows this?
> (Or at least a reference to a spec version and section number)

PCI-PCI bridges are allowed to do it. If you look in table E-1 of PCI
2.3, or table 8-3 of PCI-X 2.0, you'll see that a Posted Memory Write
can pass a Delayed Write Request (or in PCI-X, a Memory Write can pass a
Split Write Request).

So mmiowb() will solve the problem for Altix, but leave everybody else
vulnerable. I actually don't see a way of forcing the config write to
complete before a memory write -- everything is allowed to pass a config
write, even a config read. I initially thought "But only a crack monkey
would implement a system where a config read could pass a config write",
but the spec explains that:

In most PCI-X implementations, Split Requests are managed in separate
buffers from Split Completions, so Split Requests naturally pass Split
Completions. However, no deadlocks occur if Split Completions block
Split Requests.

So all this code that checks to see if a write had an effect is unsafe.
I'm a little perturbed by this. It means the only way to reliably
distinguish between a write that hasn't taken effect yet and a bit (say,
MWI) the device hasn't implemented is to do a memory access to the
device. Which is hard when you're trying to program the BARs.

I suppose this hasn't bitten us before in, what, 7 years of PCI-X, so
it can't be *that* common a thing for bridges to do. And we would have
noticed the BAR sizing code going wrong (as it does config write
followed immediately by config read), so maybe implementations aren't as
crackful as the PCI spec seems to permit them to be.

I find it really hard to believe the PCI committee have done something
this stupid. There must be another rule somewhere that I'm missing.

2006-10-25 17:16:26

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [openib-general] Ordering between PCI config space writes and MMIO reads?

On Wed, Oct 25, 2006 at 08:18:59AM -0600, Matthew Wilcox wrote:

> PCI-PCI bridges are allowed to do it. If you look in table E-1 of PCI
> 2.3, or table 8-3 of PCI-X 2.0, you'll see that a Posted Memory Write
> can pass a Delayed Write Request (or in PCI-X, a Memory Write can pass a
> Split Write Request).

Carefull here.. Only MMIO writes are of the posted variety. Non posted
transactions (config write, IO write, all reads) are not ever allowed
to pass a posted write, but they can be re-ordered.

Table 8-3 shows that Split Write vs Split Read are all Y/N
meaning they could be re-ordered with respect to each other.

The problem with mthca is exactly this, although the operations were
issued in-order by the bridges the end device (mthca) is free to
complete them in any order, and it choose to complete the MMIO read
before the config write.

> In most PCI-X implementations, Split Requests are managed in separate
> buffers from Split Completions, so Split Requests naturally pass Split
> Completions. However, no deadlocks occur if Split Completions block
> Split Requests.

Again, this is only for posted writes. All these rules are designed to
prevent the bus from deadlocking due to buffer starvation under
certain situations. [Basically split completions and posted writes are
given a seperate queue that can advance if the request queue is
stalled. Otherwise you can deadlock a bridge]

> So all this code that checks to see if a write had an effect is unsafe.
> I'm a little perturbed by this. It means the only way to reliably

MMIO based code that does this is correct and reliable.. PIO code that
does this is only safe if the platform is waiting for the PIO write
completion before starting the PIO read.

PCI-X (pg 80) says this about non-posted transaction ordering requirements:
As in convention PCI, if a requester requires one non-posted
transaction to complete before another, it must not initiate the
second transaction until the first one compeltes.

IMHO, all sane hardware implementations of config ops and PIO should
block the host bridge until the completion is generated by the end
device..I'm sure that most x86 platforms do this. (For instance I've
observed this kind of behavior with a Hyper Transport probe on
Opterons)

This is more than just worrying about ordering, it is about how to
engage a platform specific way to know that the completion has been
generated. The person who suggested polling the PIO_OUTSTANDING
register on SN2 seems to have the right idea (if that counts all
pending non-posted operations, not just PIO ones) :|

The risk of re-ordering is probably not so much in the bridges since
that would be a fairly strange thing to do - but it is very likely in
end-devices. This is especially true if the accesses are to different
internal resources!

Jason

2006-10-25 18:22:47

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Ordering between PCI config space writes and MMIO reads?

Quoting r. Matthew Wilcox <[email protected]>:
> Subject: Re: Ordering between PCI config space writes and MMIO reads?
>
> On Wed, Oct 25, 2006 at 12:30:22AM -0600, Grant Grundler wrote:
> > Can someone provide a quote of the PCI Local bus spec that allows this?
> > (Or at least a reference to a spec version and section number)
>
> PCI-PCI bridges are allowed to do it. If you look in table E-1 of PCI
> 2.3, or table 8-3 of PCI-X 2.0, you'll see that a Posted Memory Write
> can pass a Delayed Write Request (or in PCI-X, a Memory Write can pass a
> Split Write Request).
>
> So mmiowb() will solve the problem for Altix, but leave everybody else
> vulnerable. I actually don't see a way of forcing the config write to
> complete before a memory write -- everything is allowed to pass a config
> write, even a config read. I initially thought "But only a crack monkey
> would implement a system where a config read could pass a config write",
> but the spec explains that:
>
> In most PCI-X implementations, Split Requests are managed in separate
> buffers from Split Completions, so Split Requests naturally pass Split
> Completions. However, no deadlocks occur if Split Completions block
> Split Requests.
>
> So all this code that checks to see if a write had an effect is unsafe.
> I'm a little perturbed by this. It means the only way to reliably
> distinguish between a write that hasn't taken effect yet and a bit (say,
> MWI) the device hasn't implemented is to do a memory access to the
> device. Which is hard when you're trying to program the BARs.
>
> I suppose this hasn't bitten us before in, what, 7 years of PCI-X, so
> it can't be *that* common a thing for bridges to do. And we would have
> noticed the BAR sizing code going wrong (as it does config write
> followed immediately by config read), so maybe implementations aren't as
> crackful as the PCI spec seems to permit them to be.
>
> I find it really hard to believe the PCI committee have done something
> this stupid. There must be another rule somewhere that I'm missing.

I think typically CPUs stall until a non-posted operation completes.
And since config writes are non posted,

pci_config_write_...
write ....

does not *start* the write until config write has completed.
So there's only a single outstanding config operation and that's why
there's never any re-ordering, without any need for flushes.

Your Altix system seems the weird one here in that CPU actually
treats config writes as posted and does not wait for their completion.
I wander whether you can do a bus lock or something and force
waiting till the completion.
This would be much cleaner than trying to fix all drivers.

--
MST

2006-10-31 19:02:54

by Roland Dreier

[permalink] [raw]
Subject: Re: Ordering between PCI config space writes and MMIO reads?

The discussion fizzled out without really reaching a definitive
answer, so I'm going to apply the original patch (below), since I
pretty much convinced myself that only the driver doing the config
access has enough information to fix this reliably.

- R.

Author: John Partridge <[email protected]>
Date: Tue Oct 31 11:00:04 2006 -0800

IB/mthca: Make sure all PCI config writes reach device before doing MMIO

During initialization, mthca writes some PCI config space registers
and then does an MMIO read from one of the BARs it just enabled. This
MMIO read sometimes failed and caused a crash on SGI Altix machines,
because the PCI-X host bridge (legitimately, according to the PCI
spec) allowed the MMIO read to start before the config write completed.

To fix this, add a config read after all config writes to make sure
they are all done before starting the MMIO read.

Signed-off-by: John Partridge <[email protected]>
Signed-off-by: Roland Dreier <[email protected]>

diff --git a/drivers/infiniband/hw/mthca/mthca_reset.c b/drivers/infiniband/hw/mthca/mthca_reset.c
index 91934f2..578dc7c 100644
--- a/drivers/infiniband/hw/mthca/mthca_reset.c
+++ b/drivers/infiniband/hw/mthca/mthca_reset.c
@@ -281,6 +281,20 @@ good:
goto out;
}

+ /*
+ * Perform a "flush" of the PCI config writes here by reading
+ * the PCI_COMMAND register. This is needed to make sure that
+ * we don't try to touch other PCI BARs before the config
+ * writes are done -- otherwise an MMIO cycle could start
+ * before the config writes are done and reach the HCA before
+ * the BAR is actually enabled.
+ */
+ if (pci_read_config_dword(mdev->pdev, PCI_COMMAND, hca_header)) {
+ err = -ENODEV;
+ mthca_err(mdev, "Couldn't access HCA memory after restoring, "
+ "aborting.\n");
+ }
+
out:
if (bridge)
pci_dev_put(bridge);

2006-10-31 19:53:46

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Ordering between PCI config space writes and MMIO reads?

Quoting r. Roland Dreier <[email protected]>:
> Subject: Re: Ordering between PCI config space writes and MMIO reads?
>
> The discussion fizzled out without really reaching a definitive
> answer, so I'm going to apply the original patch (below), since I
> pretty much convinced myself that only the driver doing the config
> access has enough information to fix this reliably.
>
> - R.
>
> Author: John Partridge <[email protected]>
> Date: Tue Oct 31 11:00:04 2006 -0800
>
> IB/mthca: Make sure all PCI config writes reach device before doing MMIO
>
> During initialization, mthca writes some PCI config space registers
> and then does an MMIO read from one of the BARs it just enabled. This
> MMIO read sometimes failed and caused a crash on SGI Altix machines,
> because the PCI-X host bridge (legitimately, according to the PCI
> spec) allowed the MMIO read to start before the config write completed.
>
> To fix this, add a config read after all config writes to make sure
> they are all done before starting the MMIO read.
>
> Signed-off-by: John Partridge <[email protected]>
> Signed-off-by: Roland Dreier <[email protected]>
>
> diff --git a/drivers/infiniband/hw/mthca/mthca_reset.c b/drivers/infiniband/hw/mthca/mthca_reset.c
> index 91934f2..578dc7c 100644
> --- a/drivers/infiniband/hw/mthca/mthca_reset.c
> +++ b/drivers/infiniband/hw/mthca/mthca_reset.c
> @@ -281,6 +281,20 @@ good:
> goto out;
> }
>
> + /*
> + * Perform a "flush" of the PCI config writes here by reading
> + * the PCI_COMMAND register. This is needed to make sure that
> + * we don't try to touch other PCI BARs before the config
> + * writes are done -- otherwise an MMIO cycle could start
> + * before the config writes are done and reach the HCA before
> + * the BAR is actually enabled.
> + */
> + if (pci_read_config_dword(mdev->pdev, PCI_COMMAND, hca_header)) {
> + err = -ENODEV;
> + mthca_err(mdev, "Couldn't access HCA memory after restoring, "
> + "aborting.\n");
> + }
> +
> out:
> if (bridge)
> pci_dev_put(bridge);

Here's what I don't understand: according to PCI rules, pci config read
can bypass pci config write (both are non-posted).
So why does doing it help flush the writes as the comment claims?

Isn't this more the case of
/* pci_config_write seems to complete asynchronously on Altix systems.
* This is probably broken but its not clear what's the best
* thing to do is - for now, do pci_read_config_dword which seems to flush
* everything out. */

--
MST

2006-10-31 19:53:11

by Roland Dreier

[permalink] [raw]
Subject: Re: Ordering between PCI config space writes and MMIO reads?

> Here's what I don't understand: according to PCI rules, pci config read
> can bypass pci config write (both are non-posted).
> So why does doing it help flush the writes as the comment claims?

No, I don't believe a read of a config register can pass a write of
the same register. (Someone correct me if I'm wrong)

- R.

2006-10-31 19:58:14

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Ordering between PCI config space writes and MMIO reads?

On Tue, Oct 31, 2006 at 11:53:02AM -0800, Roland Dreier wrote:
> > Here's what I don't understand: according to PCI rules, pci config read
> > can bypass pci config write (both are non-posted).
> > So why does doing it help flush the writes as the comment claims?
>
> No, I don't believe a read of a config register can pass a write of
> the same register. (Someone correct me if I'm wrong)

I don't see anything in the PCI spec which forbids it, but I would
expect that hardware designers don't actually do that in practice.

2006-10-31 20:28:44

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Ordering between PCI config space writes and MMIO reads?

Quoting r. Roland Dreier <[email protected]>:
> Subject: Re: Ordering between PCI config space writes and MMIO reads?
>
> > Here's what I don't understand: according to PCI rules, pci config read
> > can bypass pci config write (both are non-posted).
> > So why does doing it help flush the writes as the comment claims?
>
> No, I don't believe a read of a config register can pass a write of
> the same register. (Someone correct me if I'm wrong)

It can if PCI-X/PCI-Ex spec is anything to go by.

For example, see table 2-23, transaction ordering rules, in the PCI-express
spec: it is marked as "Y/N: there are no requirements. The second transaction
may optionally pass the first transaction or be blocked by it."

In typical systems the OS should take care not to start a new
non-posted transaction before the previous one completed.
In particular, all intel and ppc systems I've seen simply block the CPU
unti the split completion arrives.

I find it hard to believe that Altix des not supply a way to check
that completion for config write transaction has arrived.


--
MST

2006-10-31 20:37:00

by Richard B. Johnson

[permalink] [raw]
Subject: Re: Ordering between PCI config space writes and MMIO reads?


----- Original Message -----
From: "Michael S. Tsirkin" <[email protected]>
To: "Roland Dreier" <[email protected]>
Cc: <[email protected]>; <[email protected]>;
<[email protected]>; <[email protected]>; <[email protected]>;
<[email protected]>; "David Miller" <[email protected]>
Sent: Tuesday, October 31, 2006 2:53 PM
Subject: Re: Ordering between PCI config space writes and MMIO reads?


> Quoting r. Roland Dreier <[email protected]>:
>> Subject: Re: Ordering between PCI config space writes and MMIO reads?
>>
>> The discussion fizzled out without really reaching a definitive
>> answer, so I'm going to apply the original patch (below), since I
>> pretty much convinced myself that only the driver doing the config
>> access has enough information to fix this reliably.
>>
>> - R.
>>
>> Author: John Partridge <[email protected]>
>> Date: Tue Oct 31 11:00:04 2006 -0800
>>
>> IB/mthca: Make sure all PCI config writes reach device before doing
>> MMIO
>>
>> During initialization, mthca writes some PCI config space registers
>> and then does an MMIO read from one of the BARs it just enabled.
>> This
>> MMIO read sometimes failed and caused a crash on SGI Altix machines,
>> because the PCI-X host bridge (legitimately, according to the PCI
>> spec) allowed the MMIO read to start before the config write
>> completed.
>>
>> To fix this, add a config read after all config writes to make sure
>> they are all done before starting the MMIO read.
>>
>> Signed-off-by: John Partridge <[email protected]>
>> Signed-off-by: Roland Dreier <[email protected]>
>>
>> diff --git a/drivers/infiniband/hw/mthca/mthca_reset.c
>> b/drivers/infiniband/hw/mthca/mthca_reset.c
>> index 91934f2..578dc7c 100644
>> --- a/drivers/infiniband/hw/mthca/mthca_reset.c
>> +++ b/drivers/infiniband/hw/mthca/mthca_reset.c
>> @@ -281,6 +281,20 @@ good:
>> goto out;
>> }
>>
>> + /*
>> + * Perform a "flush" of the PCI config writes here by reading
>> + * the PCI_COMMAND register. This is needed to make sure that
>> + * we don't try to touch other PCI BARs before the config
>> + * writes are done -- otherwise an MMIO cycle could start
>> + * before the config writes are done and reach the HCA before
>> + * the BAR is actually enabled.
>> + */
>> + if (pci_read_config_dword(mdev->pdev, PCI_COMMAND, hca_header)) {
>> + err = -ENODEV;
>> + mthca_err(mdev, "Couldn't access HCA memory after restoring, "
>> + "aborting.\n");
>> + }
>> +
>> out:
>> if (bridge)
>> pci_dev_put(bridge);
>
> Here's what I don't understand: according to PCI rules, pci config read
> can bypass pci config write (both are non-posted).
> So why does doing it help flush the writes as the comment claims?
>
> Isn't this more the case of
> /* pci_config_write seems to complete asynchronously on Altix systems.
> * This is probably broken but its not clear what's the best
> * thing to do is - for now, do pci_read_config_dword which seems to flush
> * everything out. */
>

If you write to the PCI bus and then you read the result, the read __might__
be the
read that flushes any posted writes rather than the read of device registers
that
would occur after the BARs were configured (hardware may be slower than
the CPU). So, it's best to do the required configuration cycles first, then
after
all is done, read something before you actually need to use data from
subsequent
read/write cycles.

> --
> MST
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

Cheers,
Dick Johnson
Penguin : Linux version 2.6.16.24 (somewhere)
New Book: http://www.AbominableFirebug.com


2006-10-31 20:47:19

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Ordering between PCI config space writes and MMIO reads?

On Tue, Oct 31, 2006 at 03:34:47PM -0500, Richard B. Johnson wrote:
> If you write to the PCI bus and then you read the result, the read
> __might__ be the
> read that flushes any posted writes rather than the read of device

Config space writes aren't posted, they're delayed. So, for example,
you can do the config write on the primary bus, then it hits a bridge on
its way to the destination device. The bridge is entitled (obviously,
it's unlikely to) drop it, and then the config read can pass by the
config write.

I'm beginning to think Michael Tsirkin has the only solution to this
-- architectures need to check that their hardware blocks until the
config write completion has occurred (and if not, simulate that it has
in software).

2006-10-31 20:51:14

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Ordering between PCI config space writes and MMIO reads?

Quoting r. Richard B. Johnson <[email protected]>:
> Subject: Re: Ordering between PCI config space writes and MMIO reads?
>
>
> ----- Original Message -----
> From: "Michael S. Tsirkin" <[email protected]>
> To: "Roland Dreier" <[email protected]>
> Cc: <[email protected]>; <[email protected]>;
> <[email protected]>; <[email protected]>; <[email protected]>;
> <[email protected]>; "David Miller" <[email protected]>
> Sent: Tuesday, October 31, 2006 2:53 PM
> Subject: Re: Ordering between PCI config space writes and MMIO reads?
>
>
> > Quoting r. Roland Dreier <[email protected]>:
> >> Subject: Re: Ordering between PCI config space writes and MMIO reads?
> >>
> >> The discussion fizzled out without really reaching a definitive
> >> answer, so I'm going to apply the original patch (below), since I
> >> pretty much convinced myself that only the driver doing the config
> >> access has enough information to fix this reliably.
> >>
> >> - R.
> >>
> >> Author: John Partridge <[email protected]>
> >> Date: Tue Oct 31 11:00:04 2006 -0800
> >>
> >> IB/mthca: Make sure all PCI config writes reach device before doing
> >> MMIO
> >>
> >> During initialization, mthca writes some PCI config space registers
> >> and then does an MMIO read from one of the BARs it just enabled.
> >> This
> >> MMIO read sometimes failed and caused a crash on SGI Altix machines,
> >> because the PCI-X host bridge (legitimately, according to the PCI
> >> spec) allowed the MMIO read to start before the config write
> >> completed.
> >>
> >> To fix this, add a config read after all config writes to make sure
> >> they are all done before starting the MMIO read.
> >>
> >> Signed-off-by: John Partridge <[email protected]>
> >> Signed-off-by: Roland Dreier <[email protected]>
> >>
> >> diff --git a/drivers/infiniband/hw/mthca/mthca_reset.c
> >> b/drivers/infiniband/hw/mthca/mthca_reset.c
> >> index 91934f2..578dc7c 100644
> >> --- a/drivers/infiniband/hw/mthca/mthca_reset.c
> >> +++ b/drivers/infiniband/hw/mthca/mthca_reset.c
> >> @@ -281,6 +281,20 @@ good:
> >> goto out;
> >> }
> >>
> >> + /*
> >> + * Perform a "flush" of the PCI config writes here by reading
> >> + * the PCI_COMMAND register. This is needed to make sure that
> >> + * we don't try to touch other PCI BARs before the config
> >> + * writes are done -- otherwise an MMIO cycle could start
> >> + * before the config writes are done and reach the HCA before
> >> + * the BAR is actually enabled.
> >> + */
> >> + if (pci_read_config_dword(mdev->pdev, PCI_COMMAND, hca_header)) {
> >> + err = -ENODEV;
> >> + mthca_err(mdev, "Couldn't access HCA memory after restoring, "
> >> + "aborting.\n");
> >> + }
> >> +
> >> out:
> >> if (bridge)
> >> pci_dev_put(bridge);
> >
> > Here's what I don't understand: according to PCI rules, pci config read
> > can bypass pci config write (both are non-posted).
> > So why does doing it help flush the writes as the comment claims?
> >
> > Isn't this more the case of
> > /* pci_config_write seems to complete asynchronously on Altix systems.
> > * This is probably broken but its not clear what's the best
> > * thing to do is - for now, do pci_read_config_dword which seems to flush
> > * everything out. */
> >
>
> If you write to the PCI bus and then you read the result, the read __might__
> be the read that flushes any posted writes rather than the read of device
> registers that would occur after the BARs were configured (hardware may be
> slower than the CPU). So, it's best to do the required configuration cycles
> first, then after all is done, read something before you actually need to use
> data from subsequent read/write cycles.

But why should it help? Accordig to the spec, read does not flush configuration
writes (unlike regular writes).

--
MST

2006-10-31 22:30:29

by Roland Dreier

[permalink] [raw]
Subject: Re: Ordering between PCI config space writes and MMIO reads?

> I'm beginning to think Michael Tsirkin has the only solution to this
> -- architectures need to check that their hardware blocks until the
> config write completion has occurred (and if not, simulate that it has
> in software).

OK, I guess I'm convinced. The vague language in the base PCI 3.0
spec about "dependencies" made me think that a read of a config
register had to wait until all previous writes to the same register
are done. So I'll drop this patch for now.

John, you'll need to try and come up with a way to solve this in the
Altix implementation of pci_write_config_xxx().

- R.

2006-11-01 16:29:51

by John Partridge

[permalink] [raw]
Subject: Re: Ordering between PCI config space writes and MMIO reads?

Roland Dreier wrote:
> > I'm beginning to think Michael Tsirkin has the only solution to this
> > -- architectures need to check that their hardware blocks until the
> > config write completion has occurred (and if not, simulate that it has
> > in software).
>
> OK, I guess I'm convinced. The vague language in the base PCI 3.0
> spec about "dependencies" made me think that a read of a config
> register had to wait until all previous writes to the same register
> are done. So I'll drop this patch for now.
>
> John, you'll need to try and come up with a way to solve this in the
> Altix implementation of pci_write_config_xxx().
>
> - R.

Sorry, but I find this change a bit puzzling. The problem is particular to
the PPB on the HCA and not Altix. I can't see anywhere that a PCI Config Write
is required to block until completion, it is the driver and the HCA ,not the
Altix hardware that requires the Config Write to have completed before we
leave mthca_reset() Changing pci_write_config_xxx() will change the behavior
for ALL drivers and the possibility of breaking something else. The fix was
very low risk in mthca_reset(), changing the PCI code to fix this is much
more onerous.

I know you must feel like "piggy in the middle" with this, so I don't mean
to cause you any problems, but I guess I don't understand the reluctance for
the driver fix.

John

--
John Partridge

Silicon Graphics Inc
Tel: 651-683-3428
Vnet: 233-3428
E-Mail: [email protected]

2006-11-01 16:46:47

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Ordering between PCI config space writes and MMIO reads?

On Wed, Nov 01, 2006 at 10:27:19AM -0600, John Partridge wrote:
> Sorry, but I find this change a bit puzzling. The problem is particular to
> the PPB on the HCA and not Altix.

That's not true; it's more likely on Altix, but it's not unique. *any*
PCI-PCI bridge can reorder pci config reads and writes. Apparently the
normal PCI host bridge implementation avoids this problem by blocking
until the completion comes back. If you put a quad-port tulip card into
an Altix, you could experience the same problem (but it would be
massively unlikely. You'd probably have to bring up three interfaces,
saturate them with traffic, then bring up the fourth to see it. And
even then it would be rare).

> I can't see anywhere that a PCI Config
> Write
> is required to block until completion, it is the driver and the HCA ,not the
> Altix hardware that requires the Config Write to have completed before we
> leave mthca_reset()

There's several places in the PCI midlayer that require the config write
to have completed before we do a config read. The MWI code relies on
this to see if the device supports MWI. If it gets out of order, we'll
think that the device doesn't support MWI when it thinks it's been told
to use MWI. Data corruption could result.

> Changing pci_write_config_xxx() will change the behavior
> for ALL drivers and the possibility of breaking something else. The fix was
> very low risk in mthca_reset(), changing the PCI code to fix this is much
> more onerous.

I really don't think so. At worst you'll be changing the timing.

2006-11-01 17:09:13

by John Partridge

[permalink] [raw]
Subject: Re: Ordering between PCI config space writes and MMIO reads?

Matthew,

So, if I understand correctly, you are saying because we cannot guarantee
the "flush" a config write even by doing a config read of the same register
(because the PPB can re-order) we have to make sure we block or spin on the
config write completion at the lowest level of the config write ?

Thanks
John

Matthew Wilcox wrote:
> On Wed, Nov 01, 2006 at 10:27:19AM -0600, John Partridge wrote:
>
>>Sorry, but I find this change a bit puzzling. The problem is particular to
>>the PPB on the HCA and not Altix.
>
>
> That's not true; it's more likely on Altix, but it's not unique. *any*
> PCI-PCI bridge can reorder pci config reads and writes. Apparently the
> normal PCI host bridge implementation avoids this problem by blocking
> until the completion comes back. If you put a quad-port tulip card into
> an Altix, you could experience the same problem (but it would be
> massively unlikely. You'd probably have to bring up three interfaces,
> saturate them with traffic, then bring up the fourth to see it. And
> even then it would be rare).
>
>
>>I can't see anywhere that a PCI Config
>>Write
>>is required to block until completion, it is the driver and the HCA ,not the
>>Altix hardware that requires the Config Write to have completed before we
>>leave mthca_reset()
>
>
> There's several places in the PCI midlayer that require the config write
> to have completed before we do a config read. The MWI code relies on
> this to see if the device supports MWI. If it gets out of order, we'll
> think that the device doesn't support MWI when it thinks it's been told
> to use MWI. Data corruption could result.
>
>
>>Changing pci_write_config_xxx() will change the behavior
>>for ALL drivers and the possibility of breaking something else. The fix was
>>very low risk in mthca_reset(), changing the PCI code to fix this is much
>>more onerous.
>
>
> I really don't think so. At worst you'll be changing the timing.


--
John Partridge

Silicon Graphics Inc
Tel: 651-683-3428
Vnet: 233-3428
E-Mail: [email protected]

2006-11-01 17:14:46

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Ordering between PCI config space writes and MMIO reads?

On Wed, Nov 01, 2006 at 11:08:08AM -0600, John Partridge wrote:
> So, if I understand correctly, you are saying because we cannot guarantee
> the "flush" a config write even by doing a config read of the same register
> (because the PPB can re-order) we have to make sure we block or spin on the
> config write completion at the lowest level of the config write ?

That's correct. And I'm also saying that the reason this hasn't
been thought about before is that other root bridges have a mechanism
(implicit on x86, explicit on parisc) for waiting for the config write
completion to come back.

Seems to me that Altix uses the SAL calls to access PCI config space
these days, so you can hide it in your firmware rather than patching
Linux.

2006-11-01 23:04:14

by David Miller

[permalink] [raw]
Subject: Re: Ordering between PCI config space writes and MMIO reads?

From: John Partridge <[email protected]>
Date: Wed, 01 Nov 2006 10:27:19 -0600

> Sorry, but I find this change a bit puzzling. The problem is
> particular to the PPB on the HCA and not Altix. I can't see anywhere
> that a PCI Config Write is required to block until completion, it is
> the driver and the HCA ,not the Altix hardware that requires the
> Config Write to have completed before we leave mthca_reset()
> Changing pci_write_config_xxx() will change the behavior for ALL
> drivers and the possibility of breaking something else. The fix was
> very low risk in mthca_reset(), changing the PCI code to fix this is
> much more onerous.

The issue is that something as simple as:

val = pci_read_config(REG);
val |= bit;
pci_write_config(REG, val);
newval = pci_read_config(REG);
BUG_ON(!(newval & bit));

is not guarenteed by PCI (aparently).

I see no valid reason why every PCI device driver should
be troubled with this lunacy and the ordering should thus
be ensured by the PCI layer.

It just so happens to take care of the original driver
issue too :-)

2006-11-02 01:10:19

by John Partridge

[permalink] [raw]
Subject: Re: Ordering between PCI config space writes and MMIO reads?

David Miller wrote:
> From: John Partridge <[email protected]>
> Date: Wed, 01 Nov 2006 10:27:19 -0600
>
>
>>Sorry, but I find this change a bit puzzling. The problem is
>>particular to the PPB on the HCA and not Altix. I can't see anywhere
>>that a PCI Config Write is required to block until completion, it is
>>the driver and the HCA ,not the Altix hardware that requires the
>>Config Write to have completed before we leave mthca_reset()
>>Changing pci_write_config_xxx() will change the behavior for ALL
>>drivers and the possibility of breaking something else. The fix was
>>very low risk in mthca_reset(), changing the PCI code to fix this is
>>much more onerous.
>
>
> The issue is that something as simple as:
>
> val = pci_read_config(REG);
> val |= bit;
> pci_write_config(REG, val);
> newval = pci_read_config(REG);
> BUG_ON(!(newval & bit));
>
> is not guarenteed by PCI (aparently).
>
> I see no valid reason why every PCI device driver should
> be troubled with this lunacy and the ordering should thus
> be ensured by the PCI layer.
>
> It just so happens to take care of the original driver
> issue too :-)

Yeah, Matthew has convinced me of that now.

Thanks


--
John Partridge

Silicon Graphics Inc
Tel: 651-683-3428
Vnet: 233-3428
E-Mail: [email protected]

2006-11-02 03:05:31

by Jeremy Higdon

[permalink] [raw]
Subject: Re: Ordering between PCI config space writes and MMIO reads?

On Tue, Oct 24, 2006 at 06:27:55PM -0500, Jack Steiner wrote:
> On Tue, Oct 24, 2006 at 04:36:32PM -0600, Matthew Wilcox wrote:
> > On Tue, Oct 24, 2006 at 02:51:30PM -0700, Roland Dreier wrote:
> > > > I think the right way to fix this is to ensure mmio write ordering in
> > > > the pci_write_config_*() implementations. Like this.
> > >
> > > I'm happy to fix this in the PCI core and not force drivers to worry
> > > about this.
> > >
> > > John, can you confirm that this patch fixes the issue for you?
> >
> > Hang on. I wasn't thinking clearly. mmiowb() only ensures the write
> > has got as far as the shub.
>
> I think mmiowb() should work on SN hardware. mmiowb() delays until shub
> reports that all previously issued PIO writes have completed.
>
> The processor "mf.a" guarantees "platform acceptance" which on SN means
> that shub has accepted the write - not that it has actually completed (or
> even forwarded anywhere by shub). That makes "mf.a" more-or-less useless
> on SN. However, shub has an additional MMR register (PIO_WRITE_COUNT) that
> counts actual outstanding PIOs. mmiob() delays until that count goes to
> zero.
>
> I'll check if there is any additional reordering that can occur AFTER the
> PIO_WRITE_COUNT goes to zero. If so, it would be at bus level - not in
> shub or routers.

As I understand it, the mmiowb on the shub waits only for the PIO write
to be accepted by the destination node (shub or tio) that the I/O device
is attached to, thus guaranteeing that no reordering will happen within
the NL.

If the PPB can reorder the write, then mmiowb is not sufficient. You'd
have to do a readback from a chip register (assuming you can trust the
PPB not to reorder reads and writes), or some other work around I haven't
thought of.

jeremy