2008-06-02 09:56:53

by Jes Sorensen

[permalink] [raw]
Subject: Re: MMIO and gcc re-ordering issue

Jeremy Higdon wrote:
> We don't actually have that problem on the Altix. All writes issued
> by CPU X will be ordered with respect to each other. But writes by
> CPU X and CPU Y will not be, unless an mmiowb() is done by the
> original CPU before the second CPU writes. I.e.
>
> CPU X writel
> CPU X writel
> CPU X mmiowb
>
> CPU Y writel
> ...
>
> Note that this implies some sort of locking. Also note that if in
> the above, CPU Y did the mmiowb, that would not work.

Hmmm,

Then it's less bad than I thought - my apologies for the confusion.

Would we be able to use Ben's trick of setting a per cpu flag in
writel() then and checking that in spin unlock issuing the mmiowb()
there if needed?


Cheers,
Jes


2008-06-02 21:02:32

by Jeremy Higdon

[permalink] [raw]
Subject: Re: MMIO and gcc re-ordering issue

On Mon, Jun 02, 2008 at 11:56:39AM +0200, Jes Sorensen wrote:
> Jeremy Higdon wrote:
> >We don't actually have that problem on the Altix. All writes issued
> >by CPU X will be ordered with respect to each other. But writes by
> >CPU X and CPU Y will not be, unless an mmiowb() is done by the
> >original CPU before the second CPU writes. I.e.
> >
> > CPU X writel
> > CPU X writel
> > CPU X mmiowb
> >
> > CPU Y writel
> > ...
> >
> >Note that this implies some sort of locking. Also note that if in
> >the above, CPU Y did the mmiowb, that would not work.
>
> Hmmm,
>
> Then it's less bad than I thought - my apologies for the confusion.
>
> Would we be able to use Ben's trick of setting a per cpu flag in
> writel() then and checking that in spin unlock issuing the mmiowb()
> there if needed?


Yes, that should work fine. You will get more mmiowb's than you
need, since some drivers, such as Fusion, don't need them. On the
Origins (older SGI MIPS-based Numa), the 'sync' instruction had
the same effect as mmiowb() with respect to mmio write ordering,
and it was issued unconditionally in the spin unlock. It was
cheaper than mmiowb, however.

If it matters, we could invent and use writel_relaxed() to get
performance back in drivers we care about....

jeremy

2008-06-03 04:33:36

by Nick Piggin

[permalink] [raw]
Subject: Re: MMIO and gcc re-ordering issue

On Monday 02 June 2008 19:56, Jes Sorensen wrote:
> Jeremy Higdon wrote:
> > We don't actually have that problem on the Altix. All writes issued
> > by CPU X will be ordered with respect to each other. But writes by
> > CPU X and CPU Y will not be, unless an mmiowb() is done by the
> > original CPU before the second CPU writes. I.e.
> >
> > CPU X writel
> > CPU X writel
> > CPU X mmiowb
> >
> > CPU Y writel
> > ...
> >
> > Note that this implies some sort of locking. Also note that if in
> > the above, CPU Y did the mmiowb, that would not work.
>
> Hmmm,
>
> Then it's less bad than I thought - my apologies for the confusion.
>
> Would we be able to use Ben's trick of setting a per cpu flag in
> writel() then and checking that in spin unlock issuing the mmiowb()
> there if needed?

Yes you could, but your writels would still not be strongly ordered
within (or outside) spinlock regions, which is what Linus wants (and
I kind of agree with).

This comes back to my posting about mmiowb and io_*mb barriers etc.

Despite what you say, what you've done really _does_ change the semantics
of wmb() for all drivers. It is a really sad situation we've got ourselves
into somehow, AFAIKS in the hope of trying to save ourselves a tiny bit
of work upfront :( (this is not just the sgi folk with mmiowb I'm talking
about, but the whole random undefinedness of ordering and io barriers).

The right way to make any change is never to weaken the postcondition of
an existing interface *unless* you are willing to audit the entire tree
and fix it. Impossible for drivers, so the correct thing to do is introduce
a new interface, and move things over at an easier pace. Not rocket science.

The argument that "Altix only uses a few drivers so this way we can just
fix these up rather than make big modifications to large numbers of drivers"
is bogus. It is far worse even for Altix if you make incompatible changes,
because you first *break* every driver on your platform, then you have to
audit and fix them. If you make compatible changes, then you have to do
exactly the same audits to move them over to the new API, but you go from
slower->faster rather than broken->non broken. As a bonus, you haven't
got random broken stuff all over the tree that you forgot to audit.

I don't know how there is still so much debate about this :(

I have a proposal: I am a neutral party here, not being an arch maintainer,
so I'll take input and write up a backward compatible API specification
and force everybody to conform to it ;)

2008-06-03 08:16:23

by Jeremy Higdon

[permalink] [raw]
Subject: Re: MMIO and gcc re-ordering issue

On Tue, Jun 03, 2008 at 02:33:11PM +1000, Nick Piggin wrote:
> On Monday 02 June 2008 19:56, Jes Sorensen wrote:
> > Jeremy Higdon wrote:
> > > We don't actually have that problem on the Altix. All writes issued
> > > by CPU X will be ordered with respect to each other. But writes by
> > > CPU X and CPU Y will not be, unless an mmiowb() is done by the
> > > original CPU before the second CPU writes. I.e.
> > >
> > > CPU X writel
> > > CPU X writel
> > > CPU X mmiowb
> > >
> > > CPU Y writel
> > > ...
> > >
> > > Note that this implies some sort of locking. Also note that if in
> > > the above, CPU Y did the mmiowb, that would not work.
> >
> > Hmmm,
> >
> > Then it's less bad than I thought - my apologies for the confusion.
> >
> > Would we be able to use Ben's trick of setting a per cpu flag in
> > writel() then and checking that in spin unlock issuing the mmiowb()
> > there if needed?
>
> Yes you could, but your writels would still not be strongly ordered
> within (or outside) spinlock regions, which is what Linus wants (and
> I kind of agree with).

Yes they would be. Writes from the same CPU are always ordered. Writes
from different CPUs are not, but that's only a concern if you protect
writing via some sort of lock. If the lock release forces a barrier,
that should take care of the problem.

jeremy

2008-06-03 08:19:36

by Nick Piggin

[permalink] [raw]
Subject: Re: MMIO and gcc re-ordering issue

On Tuesday 03 June 2008 18:15, Jeremy Higdon wrote:
> On Tue, Jun 03, 2008 at 02:33:11PM +1000, Nick Piggin wrote:
> > On Monday 02 June 2008 19:56, Jes Sorensen wrote:

> > > Would we be able to use Ben's trick of setting a per cpu flag in
> > > writel() then and checking that in spin unlock issuing the mmiowb()
> > > there if needed?
> >
> > Yes you could, but your writels would still not be strongly ordered
> > within (or outside) spinlock regions, which is what Linus wants (and
> > I kind of agree with).
>
> Yes they would be. Writes from the same CPU are always ordered. Writes
> from different CPUs are not, but that's only a concern if you protect

They are not strongly ordered WRT writes to cacheable memory. If they
were, then they would not leak out of spinlocks.

2008-06-03 08:45:38

by Jeremy Higdon

[permalink] [raw]
Subject: Re: MMIO and gcc re-ordering issue

On Tue, Jun 03, 2008 at 06:19:05PM +1000, Nick Piggin wrote:
> On Tuesday 03 June 2008 18:15, Jeremy Higdon wrote:
> > On Tue, Jun 03, 2008 at 02:33:11PM +1000, Nick Piggin wrote:
> > > On Monday 02 June 2008 19:56, Jes Sorensen wrote:
> > > > Would we be able to use Ben's trick of setting a per cpu flag in
> > > > writel() then and checking that in spin unlock issuing the mmiowb()
> > > > there if needed?
> > >
> > > Yes you could, but your writels would still not be strongly ordered
> > > within (or outside) spinlock regions, which is what Linus wants (and
> > > I kind of agree with).
> >
> > Yes they would be. Writes from the same CPU are always ordered. Writes
> > from different CPUs are not, but that's only a concern if you protect
>
> They are not strongly ordered WRT writes to cacheable memory. If they
> were, then they would not leak out of spinlocks.

No posted writes are.
As I recall, the outX functions are not supposed to be posted, so the sn2
versions issue a mmiowb in the outX. But writeX has always been posted,
or at least postable. I thought that was generally accepted.

Normally, the only way for a device to see cacheable memory is via a DMA
read, and you are guaranteed on sn2 that in the following:

store of of A to location X
mmio write to device
device DMA read from location X

that the device will see A. In the past, on some other archs, you'd have
to flush the Dcache for that to work.

Granted, if the compiler reorders the store and the mmio write, then you
have a problem.

jeremy

2008-06-03 16:53:19

by Jesse Barnes

[permalink] [raw]
Subject: Re: MMIO and gcc re-ordering issue

On Monday, June 02, 2008 9:33 pm Nick Piggin wrote:
> On Monday 02 June 2008 19:56, Jes Sorensen wrote:
> > Jeremy Higdon wrote:
> > > We don't actually have that problem on the Altix. All writes issued
> > > by CPU X will be ordered with respect to each other. But writes by
> > > CPU X and CPU Y will not be, unless an mmiowb() is done by the
> > > original CPU before the second CPU writes. I.e.
> > >
> > > CPU X writel
> > > CPU X writel
> > > CPU X mmiowb
> > >
> > > CPU Y writel
> > > ...
> > >
> > > Note that this implies some sort of locking. Also note that if in
> > > the above, CPU Y did the mmiowb, that would not work.
> >
> > Hmmm,
> >
> > Then it's less bad than I thought - my apologies for the confusion.
> >
> > Would we be able to use Ben's trick of setting a per cpu flag in
> > writel() then and checking that in spin unlock issuing the mmiowb()
> > there if needed?
>
> Yes you could, but your writels would still not be strongly ordered
> within (or outside) spinlock regions, which is what Linus wants (and
> I kind of agree with).

I think you mean wrt cacheable memory accesses here (though iirc on ia64
spin_unlock has release semantics, so at least it'll barrier other stores).

> This comes back to my posting about mmiowb and io_*mb barriers etc.
>
> Despite what you say, what you've done really _does_ change the semantics
> of wmb() for all drivers. It is a really sad situation we've got ourselves
> into somehow, AFAIKS in the hope of trying to save ourselves a tiny bit
> of work upfront :( (this is not just the sgi folk with mmiowb I'm talking
> about, but the whole random undefinedness of ordering and io barriers).
>
> The right way to make any change is never to weaken the postcondition of
> an existing interface *unless* you are willing to audit the entire tree
> and fix it. Impossible for drivers, so the correct thing to do is introduce
> a new interface, and move things over at an easier pace. Not rocket
> science.

Well, given how undefined things have been in the past, each arch has had to
figure out what things mean (based on looking at drivers & core code) then
come up with appropriate primitives. On Altix, we went both directions: we
made regular PIO reads (readX etc.) *very* expensive to preserve
compatibility with what existing drivers expect, and added a readX_relaxed to
give a big performance boost to tuned drivers.

OTOH, given that posted PCI writes were nothing new to Linux, but the Altix
network topology was, we introduced mmiowb() (with lots of discussion I might
add), which has clear and relatively simple usage guidelines.

Now, in hindsight, using a PIO write set & test flag approach in
writeX/spin_unlock (ala powerpc) might have been a better approach, but iirc
that never came up in the discussion, probably because we were focused on PCI
posting and not uncached vs. cached ordering.

> The argument that "Altix only uses a few drivers so this way we can just
> fix these up rather than make big modifications to large numbers of
> drivers" is bogus. It is far worse even for Altix if you make incompatible
> changes, because you first *break* every driver on your platform, then you
> have to audit and fix them. If you make compatible changes, then you have
> to do exactly the same audits to move them over to the new API, but you go
> from slower->faster rather than broken->non broken. As a bonus, you haven't
> got random broken stuff all over the tree that you forgot to audit.

I agree, but afaik the only change Altix ended up forcing on people was
mmiowb(), but that turned out to be necessary on mips64 (and maybe some other
platforms?) anyway.

> I don't know how there is still so much debate about this :(
>
> I have a proposal: I am a neutral party here, not being an arch maintainer,
> so I'll take input and write up a backward compatible API specification
> and force everybody to conform to it ;)

Aside from the obvious performance impact of making all the readX/writeX
routines strongly ordered, both in terms of PCI posting and cacheable vs.
uncacheable accesses, it also makes things inconsistent. Both core code &
drivers will still have to worry about regular, cacheable memory barriers for
correctness, but it looks like you're proposing that they not have to think
about I/O ordering.

At any rate, I don't think anyone would argue against defining the ordering
semantics of all of these routines (btw you should also include ordering wrt
DMA & PCI posting); the question is what's the best balance between keeping
the driver requirements simple and the performance cost on complex arches.

Jesse

2008-06-05 08:41:08

by Jes Sorensen

[permalink] [raw]
Subject: Re: MMIO and gcc re-ordering issue

Jesse Barnes wrote:
> Now, in hindsight, using a PIO write set & test flag approach in
> writeX/spin_unlock (ala powerpc) might have been a better approach, but iirc
> that never came up in the discussion, probably because we were focused on PCI
> posting and not uncached vs. cached ordering.

Hi Jesse,

I am going to take a stab at implementing this so we can see how much
of an impact it will have.

Cheers,
Jes

2008-06-05 08:44:52

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: MMIO and gcc re-ordering issue

On Thu, 2008-06-05 at 10:40 +0200, Jes Sorensen wrote:
> Jesse Barnes wrote:
> > Now, in hindsight, using a PIO write set & test flag approach in
> > writeX/spin_unlock (ala powerpc) might have been a better approach, but iirc
> > that never came up in the discussion, probably because we were focused on PCI
> > posting and not uncached vs. cached ordering.
>
> Hi Jesse,
>
> I am going to take a stab at implementing this so we can see how much
> of an impact it will have.

Note that the powerpc implementation currently clears the flag
on spin_lock and tests it on unlock. We are considering changing
that to not touch the flag on spin_lock and just clear it whenever
we do a sync (ie, on unlock, on explicit mmiowb, and possibly even
on readl's where we happen to do sync's).

Ben.

2008-06-12 15:07:43

by Matthew Wilcox

[permalink] [raw]
Subject: Re: MMIO and gcc re-ordering issue

On Thu, Jun 05, 2008 at 06:43:53PM +1000, Benjamin Herrenschmidt wrote:
> Note that the powerpc implementation currently clears the flag
> on spin_lock and tests it on unlock. We are considering changing
> that to not touch the flag on spin_lock and just clear it whenever
> we do a sync (ie, on unlock, on explicit mmiowb, and possibly even
> on readl's where we happen to do sync's).

Your current scheme sounds like it's broken for

spin_lock(a)
writel();
spin_lock(b);
spin_unlock(b);
spin_unlock(a);

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2008-06-13 00:08:44

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: MMIO and gcc re-ordering issue

On Thu, 2008-06-12 at 09:07 -0600, Matthew Wilcox wrote:
> On Thu, Jun 05, 2008 at 06:43:53PM +1000, Benjamin Herrenschmidt wrote:
> > Note that the powerpc implementation currently clears the flag
> > on spin_lock and tests it on unlock. We are considering changing
> > that to not touch the flag on spin_lock and just clear it whenever
> > we do a sync (ie, on unlock, on explicit mmiowb, and possibly even
> > on readl's where we happen to do sync's).
>
> Your current scheme sounds like it's broken for
>
> spin_lock(a)
> writel();
> spin_lock(b);
> spin_unlock(b);
> spin_unlock(a);

Which is why we are considering changing it :-)

But as Paulus said before, he did some measurement and we came to the
conclusion that (pending more measurements on a wider range of HW) we
may as well drop the whole scheme and make writel fully synchronous
instead.

Then, we can get some nice weakly ordered accessors and start adding
them with appropriate explicit barriers to the hot path of perf.
critical drivers we care about.

Cheers,
Ben.