Linus,
An issue has come up in the tg3 ethernet driver, where we are seeing
data corruption on ppc64 machines that is attributable to a lack of
ordering between writes to normal RAM and writes to an MMIO register.
Basically the driver does writes to RAM and then a writel to an MMIO
register to trigger DMA, and occasionally the device then reads old
values from memory.
Do you have an opinion about whether the MMIO write in writel() should
be ordered with respect to preceding writes to normal memory?
Currently we have a sync instruction after the store in writel() but
not one before. The sync after is to keep the writel inside
spinlocked regions and to ensure that the store is ordered with
respect to the load in readl() and friends.
Paul.
On Sat, 9 Sep 2006, Paul Mackerras wrote:
>
> Do you have an opinion about whether the MMIO write in writel() should
> be ordered with respect to preceding writes to normal memory?
It shouldn't. It's too expensive. The fact that PC's have nice memory
consistency models means that most of the testing is going to be with the
PC memory ordering, but the same way we have "smp_wmb()" (which is also a
no-op on x86) we should probably have a "mmiowb()" there.
Gaah. Right now, mmiowb() is actually broken on x86 (it's an empty define,
so it may cause compiler warnings about statements with no effects or
something).
I don't think anyting but a few SCSI drivers that are used on some ia64
boxes use mmiowb(). And it's currently a no-op not only on x86 but also on
powerpc. Probably because it's defined to be a barrier between _two_ MMIO
operations, while we should probably have things like
a)
.. regular mem store ..
mem_to_io_barrier();
.. IOMEM store ..
b)
.. IOMEM store ..
io_to_mem_barrier();
.. regular mem store ..
although it's quite possible that (a) never makes any sense at all.
That said, it's also entirely possible that what you _should_ do is to
just make sure that the "sync" is always _before_ the IO op. But that
would require that you have one before an IO load too. Do you? I'm too
lazy to check.
(Keeping it inside a spinlock, I don't know. Spinlocks aren't _supposed_
to order IO, so I don't _think_ that's necessarily an argument for doing
so. So your rationale seems strange. Even on x86, a spinlock release by
_no_ means would mean that an IO write would be "done").
Linus
Linus Torvalds writes:
> I don't think anyting but a few SCSI drivers that are used on some ia64
> boxes use mmiowb(). And it's currently a no-op not only on x86 but also on
The examples in Documentation/memory-barriers.txt and
Documentation/DocBook/deviceiobook.tmpl only seem to say that mmiowb
is needed between a writel and a spin_unlock.
> powerpc. Probably because it's defined to be a barrier between _two_ MMIO
> operations, while we should probably have things like
>
> a)
> .. regular mem store ..
> mem_to_io_barrier();
> .. IOMEM store ..
>
> b)
> .. IOMEM store ..
> io_to_mem_barrier();
> .. regular mem store ..
>
> although it's quite possible that (a) never makes any sense at all.
Do you mean (b) never makes sense? (a) is the classic scenario for a
device doing DMA to read a descriptor we've just constructed.
> That said, it's also entirely possible that what you _should_ do is to
> just make sure that the "sync" is always _before_ the IO op. But that
> would require that you have one before an IO load too. Do you? I'm too
> lazy to check.
Not at present; readX() have no barrier before the load. They have a
barrier after the load that waits for the data to arrive before
allowing any following instructions to execute.
> (Keeping it inside a spinlock, I don't know. Spinlocks aren't _supposed_
> to order IO, so I don't _think_ that's necessarily an argument for doing
> so. So your rationale seems strange. Even on x86, a spinlock release by
> _no_ means would mean that an IO write would be "done").
Well, it's about ordering between multiple CPUs rather than the write
being "done". I think the powerpc readX/writeX got to their current
form before the ia64 guys invented mmiowb().
I suspect the best thing at this point is to move the sync in writeX()
before the store, as you suggest, and add an "eieio" before the load
in readX(). That does mean that we are then relying on driver writers
putting in the mmiowb() between a writeX() and a spin_unlock, but at
least that is documented.
Paul.
On Sat, 9 Sep 2006, Paul Mackerras wrote:
> >
> > although it's quite possible that (a) never makes any sense at all.
>
> Do you mean (b) never makes sense?
Yes.
> I suspect the best thing at this point is to move the sync in writeX()
> before the store, as you suggest, and add an "eieio" before the load
> in readX(). That does mean that we are then relying on driver writers
> putting in the mmiowb() between a writeX() and a spin_unlock, but at
> least that is documented.
Yeah, that sounds reasonable.
Linus
On Fri, 2006-09-08 at 19:42 -0700, Linus Torvalds wrote:
>
> On Sat, 9 Sep 2006, Paul Mackerras wrote:
> >
> > Do you have an opinion about whether the MMIO write in writel() should
> > be ordered with respect to preceding writes to normal memory?
>
> It shouldn't. It's too expensive.
That's a releif ! I was worried we would have to add a second sync in
there... In fact, we might even be able to remove the one we have right
now (replace it with a more lighweight eieio) if we start using mmiowb
(see below)
> The fact that PC's have nice memory
> consistency models means that most of the testing is going to be with the
> PC memory ordering, but the same way we have "smp_wmb()" (which is also a
> no-op on x86) we should probably have a "mmiowb()" there.
>
> Gaah. Right now, mmiowb() is actually broken on x86 (it's an empty define,
> so it may cause compiler warnings about statements with no effects or
> something).
>
> I don't think anyting but a few SCSI drivers that are used on some ia64
> boxes use mmiowb(). And it's currently a no-op not only on x86 but also on
> powerpc. Probably because it's defined to be a barrier between _two_ MMIO
> operations, while we should probably have things like
The problem is that very few people have any clear idea of what mmiowb
is :) In fact, what you described is not the definition of mmiowb
according to Jesse (who, iirc, added it in the first place on ia64). It
was defined as a way to order MMIO vs. MMIO from 2 different nodes. In
fact, it's actually a barrier between MMIO and spin_unlock, preventing
MMIO's from leaking outside of the lock, and thus MMIOs from 2 locked
sections on 2 CPUs from getting mixed together.
However, I'm very happy to define mmiowb() as an almighty barrier for
all sort of stores between all domains (it would be the same instruction
in both case on powerpc anyway, a sync).
If we start requiring mmiowb() to prevent MMIO writes from leaking out
of locks, then we can even remove the sync we have in our MMIO stores
(writel etc...) and replace it with a more lightweight eieio that will
only order vs. other MMIO operations (especially loads). That would need
quite a bit of driver auditing... but we would get back a few nice
percent of performance on heavy IO traffic that we lost due to those
barriers.
> a)
> .. regular mem store ..
> mem_to_io_barrier();
> .. IOMEM store ..
>
> b)
> .. IOMEM store ..
> io_to_mem_barrier();
> .. regular mem store ..
>
> although it's quite possible that (a) never makes any sense at all.
I quite like mem_to_io_* (barrier/rb/wb) and io_to_mem_* in fact :) That
is probably more talkative to device driver writers and would allow more
fine grained barriers. As Segher mentioned in another thread on this
issue (see his mail titled "A modest proposal" (Ho hum) [was: Re: TG3
data corruption (TSO ?)]" sent to lkml earlier today), the naming of
barrier could be improved based on when to use them to make things
clearer to device writers, while providing archs a more fine grained
approach.
> That said, it's also entirely possible that what you _should_ do is to
> just make sure that the "sync" is always _before_ the IO op. But that
> would require that you have one before an IO load too. Do you? I'm too
> lazy to check.
We need a sync after the store to prevent them from leaking out of
spinlocks. The problem is locks are in the coherent domain, MMIO isn't,
and on PowerPC, only strong barriers can order between those two. So
either we have a sync after the store (performance hit on IOs), in the
spin_unlock() (performance hit on anybody using spinlocks) ... or we
move the problem to mmiowb() but that means fixing the load of drivers
that don't use it properly and better document it.
> (Keeping it inside a spinlock, I don't know. Spinlocks aren't _supposed_
> to order IO, so I don't _think_ that's necessarily an argument for doing
> so. So your rationale seems strange. Even on x86, a spinlock release by
> _no_ means would mean that an IO write would be "done").
No, but they should guarantee that stores done within the lock remain
ordered between processors. Example:
Processor A Processor B
lock lock
write A write C
write B write D
unlock unlock
It's important that on the target bus, what is emited is ABCD or CDAB,
that is the 2 locked pairs remain consistent, and not ACBD or something
like that.
Ben.
> I suspect the best thing at this point is to move the sync in writeX()
> before the store, as you suggest, and add an "eieio" before the load
> in readX(). That does mean that we are then relying on driver writers
> putting in the mmiowb() between a writeX() and a spin_unlock, but at
> least that is documented.
Well, why keep the sync in writel then ? Isn't it agreed that the driver
should use an explicit barrier ? Or did I misunderstand Linus ?
Ben.
From: Paul Mackerras <[email protected]>
Date: Sat, 9 Sep 2006 13:02:27 +1000
> I suspect the best thing at this point is to move the sync in writeX()
> before the store, as you suggest, and add an "eieio" before the load
> in readX(). That does mean that we are then relying on driver writers
> putting in the mmiowb() between a writeX() and a spin_unlock, but at
> least that is documented.
I think not matching what PC systems do is, at least from one
perspective, a very bad engineering decision for 2 reasons.
1) You will be chasing down these kinds of problems forever,
you will fix tg3 today, but tomorrow it will be another driver
for which you will invest weeks of delicate debugging that
could have been spent on much more useful coding
2) Driver authors will not get these memory barriers right,
you can say they will because it will be "documented" but
that does not change reality which is that driver folks
will get simple interfaces right but these memory barriers
are relatively advanced concepts, which they thus will get
wrong half the time
Sure it's more expensive, but at least on sparc64 I'd much rather
spend my time working on more interesting things than "today's
missing memory barrier" :-)
I also don't want to see all of these memory barriers crapping up our
drivers. I do a MMIO, then I access a descriptor, or vice versa, then
those should be ordered because they are both technically accesses to
"physical device state". Having to say this explicitly seems really
the wrong thing to do, at least to me.
From: Benjamin Herrenschmidt <[email protected]>
Date: Sat, 09 Sep 2006 17:23:20 +1000
> I quite like mem_to_io_* (barrier/rb/wb) and io_to_mem_* in fact :) That
> is probably more talkative to device driver writers and would allow more
> fine grained barriers.
I firmly believe that the average driver person is not going
to be able to get these things right, even if you document it
in big bold letters in some Documentation/*.txt file.
This is like the Alpha OSF outb() interface for kernel drivers
that had something like 8 arguments.
And even if you define these things, and people start using it,
only PowerPC and a few other systems (I guess IA64) will ever
notice when this stuff is done wrong. That's really not a good
plan from a testing coverage point of view.
David Miller wrote:
> From: Paul Mackerras <[email protected]>
> Date: Sat, 9 Sep 2006 13:02:27 +1000
>
>> I suspect the best thing at this point is to move the sync in writeX()
>> before the store, as you suggest, and add an "eieio" before the load
>> in readX(). That does mean that we are then relying on driver writers
>> putting in the mmiowb() between a writeX() and a spin_unlock, but at
>> least that is documented.
>
> I think not matching what PC systems do is, at least from one
> perspective, a very bad engineering decision for 2 reasons.
>
> 1) You will be chasing down these kinds of problems forever,
> you will fix tg3 today, but tomorrow it will be another driver
> for which you will invest weeks of delicate debugging that
> could have been spent on much more useful coding
>
> 2) Driver authors will not get these memory barriers right,
> you can say they will because it will be "documented" but
> that does not change reality which is that driver folks
> will get simple interfaces right but these memory barriers
> are relatively advanced concepts, which they thus will get
> wrong half the time
>
> Sure it's more expensive, but at least on sparc64 I'd much rather
> spend my time working on more interesting things than "today's
> missing memory barrier" :-)
>
> I also don't want to see all of these memory barriers crapping up our
> drivers. I do a MMIO, then I access a descriptor, or vice versa, then
> those should be ordered because they are both technically accesses to
> "physical device state". Having to say this explicitly seems really
> the wrong thing to do, at least to me.
Agreed.
As (I think) BenH mentioned in another email, the normal way Linux
handles these interfaces is for the primary API (readX, writeX) to be
strongly ordered, strongly coherent, etc. And then there is a relaxed
version without barriers and syncs, for the smart guys who know what
they're doing. We used do this for fbdev drivers, I dunno what happened
to that interface.
We want to make it tough for driver writers to screw up, unless they are
really trying...
Jeff
From: Jeff Garzik <[email protected]>
Date: Sat, 09 Sep 2006 05:55:19 -0400
> As (I think) BenH mentioned in another email, the normal way Linux
> handles these interfaces is for the primary API (readX, writeX) to be
> strongly ordered, strongly coherent, etc. And then there is a relaxed
> version without barriers and syncs, for the smart guys who know what
> they're doing
Indeed, I think that is the way to handle this.
David Miller writes:
> From: Paul Mackerras <[email protected]>
> Date: Sat, 9 Sep 2006 13:02:27 +1000
>
> > I suspect the best thing at this point is to move the sync in writeX()
> > before the store, as you suggest, and add an "eieio" before the load
> > in readX(). That does mean that we are then relying on driver writers
> > putting in the mmiowb() between a writeX() and a spin_unlock, but at
> > least that is documented.
>
> I think not matching what PC systems do is, at least from one
> perspective, a very bad engineering decision for 2 reasons.
Well, I believe that my suggestion *does* match what PC systems do.
It will mean that writes to memory are ordered with respect to
following MMIO writes. MMIO writes aren't ordered with respect to
following stores to memory, but then nobody sane expects that if you
do a writel() then a store to memory, the writel will hit the device
before the store hits memory; even PC systems do PCI write posting.
As for the writel followed by spin_unlock thing, Ben and I have an
idea for some debug code that will detect at runtime if someone does a
writel followed by a spin_unlock without a readl, spin_lock or mmiowb
in between.
The main question I'm trying to decide at the moment is whether to
make the change for 2.6.18 or not... If not then I'd like to get the
wmb() into tg3.c temporarily to fix that particular data corruption
problem.
Paul.
Ar Sad, 2006-09-09 am 12:03 +1000, ysgrifennodd Paul Mackerras:
> Currently we have a sync instruction after the store in writel() but
> not one before. The sync after is to keep the writel inside
> spinlocked regions and to ensure that the store is ordered with
> respect to the load in readl() and friends.
The spinlock v writel case is not guaranteed on other platforms and
requires you use mmiowb. The main memory v writel ordering is half
guaranteed for PCI bus only .. viz:
The following is ok
fooblock->command = 1;
writel(&fooblock_phys, pci_addr); /* mem write seen */
The following is not
fooblock->command = 1;
writel(&fooblock_phys, pci_addr);
fooblock->command = 2; /* could occur before PCI */
Alan
Ar Gwe, 2006-09-08 am 19:42 -0700, ysgrifennodd Linus Torvalds:
> It shouldn't. It's too expensive.
Well it does because way back when you said it should 8)
__writel/__readl were the ones allowed to be un-ordered.
Ar Sad, 2006-09-09 am 17:23 +1000, ysgrifennodd Benjamin Herrenschmidt:
> The problem is that very few people have any clear idea of what mmiowb
> is :) In fact, what you described is not the definition of mmiowb
> according to Jesse
Some of us talked a little about this at Linux Kongress and one
suggestion so people did understand it was
spin_lock_io();
spin_unlock_io();
so that it can be expressed not as a weird barrier op but as part of the
locking.
Linus Torvalds wrote:
>
> On Sat, 9 Sep 2006, Paul Mackerras wrote:
>> Do you have an opinion about whether the MMIO write in writel() should
>> be ordered with respect to preceding writes to normal memory?
>
> It shouldn't. It's too expensive. The fact that PC's have nice memory
> consistency models means that most of the testing is going to be with the
> PC memory ordering, but the same way we have "smp_wmb()" (which is also a
> no-op on x86) we should probably have a "mmiowb()" there.
>
> Gaah. Right now, mmiowb() is actually broken on x86 (it's an empty define,
> so it may cause compiler warnings about statements with no effects or
> something).
>
> I don't think anyting but a few SCSI drivers that are used on some ia64
> boxes use mmiowb(). And it's currently a no-op not only on x86 but also on
> powerpc. Probably because it's defined to be a barrier between _two_ MMIO
> operations, while we should probably have things like
it seems to be a growing virus with network drivers too. bnx2, s2io, tg3 and
bcm43xx (which has like 40 of them) are already resorting to it. We're even
planning on putting one in to e1000.
I'm not sure what bcm43xx chip will work with IA64, or if people actually have
itanium laptops(!) or MIPS, but for e1000 it definately fixes ordering problems
on IA64.
Auke
On Saturday 09 September 2006 14:34, Auke Kok wrote:
> I'm not sure what bcm43xx chip will work with IA64, or if people actually
> have itanium laptops(!) or MIPS, but for e1000 it definately fixes ordering
> problems on IA64.
Sometimes I wonder, with archs like IA64, why Linus just doesn't pretend IA64
doesn't exist and refuse to support it. Its such a horribly botched mess from
what I've seen on LKML.
> Auke
--
Patrick McFarland || http://AdTerrasPerAspera.com
"Computer games don't affect kids; I mean if Pac-Man affected us as kids,
we'd all be running around in darkened rooms, munching magic pills and
listening to repetitive electronic music." -- Kristian Wilson, Nintendo,
Inc, 1989
On Saturday, September 09, 2006 3:08 am, David Miller wrote:
> From: Jeff Garzik <[email protected]>
> Date: Sat, 09 Sep 2006 05:55:19 -0400
>
> > As (I think) BenH mentioned in another email, the normal way Linux
> > handles these interfaces is for the primary API (readX, writeX) to
> > be strongly ordered, strongly coherent, etc. And then there is a
> > relaxed version without barriers and syncs, for the smart guys who
> > know what they're doing
>
> Indeed, I think that is the way to handle this.
Well why didn't you guys mention this when mmiowb() went in?
I agree that having a relaxed PIO ordering version of writeX makes sense
(jejb convinced me of this on irc the other day). But what to name it?
We already have readX_relaxed, but that's for PIO vs. DMA ordering, not
PIO vs. PIO. To distinguish from that case maybe writeX_weak or
writeX_nobarrier would make sense?
Jesse
On Saturday, September 09, 2006 8:09 am, Alan Cox wrote:
> Ar Sad, 2006-09-09 am 17:23 +1000, ysgrifennodd Benjamin
Herrenschmidt:
> > The problem is that very few people have any clear idea of what
> > mmiowb is :) In fact, what you described is not the definition of
> > mmiowb according to Jesse
>
> Some of us talked a little about this at Linux Kongress and one
> suggestion so people did understand it was
>
> spin_lock_io();
> spin_unlock_io();
>
> so that it can be expressed not as a weird barrier op but as part of
> the locking.
That's what IRIX had. It would let us get rid of mmiowb and avoid doing
a full sync in writeX, so may be the best option.
Jesse
On Sunday 10 September 2006 19:19, Jesse Barnes wrote:
> On Saturday, September 09, 2006 8:09 am, Alan Cox wrote:
> > Ar Sad, 2006-09-09 am 17:23 +1000, ysgrifennodd Benjamin
> Herrenschmidt:
> > > The problem is that very few people have any clear idea of what
> > > mmiowb is :) In fact, what you described is not the definition of
> > > mmiowb according to Jesse
> >
> > Some of us talked a little about this at Linux Kongress and one
> > suggestion so people did understand it was
> >
> > spin_lock_io();
> > spin_unlock_io();
> >
> > so that it can be expressed not as a weird barrier op but as part of
> > the locking.
>
> That's what IRIX had. It would let us get rid of mmiowb and avoid doing
> a full sync in writeX, so may be the best option.
Last time I suggested that, people did not want it.
Probably about 9 months ago. Don't remember exactly.
We came to the decision that if a driver depends on some weak
ordering, it should either directly use mmiowb() or have its
own locking wrapper which wraps spin_unlock() and mmiowb().
There is one little problem in practice with something
like spin_unlock_io().
spin_lock_io(&lock);
foovalue = new_foovalue;
if (device_is_fooing)
writel(foovalue, REGISTER);
spin_unlock_io(&lock);
That would be an unneccessary sync in case device is not fooing.
In contrast to the explicit version:
spin_lock(&lock);
foovalue = new_foovalue;
if (device_is_fooing) {
writel(foovalue, REGISTER);
mmiowb();
}
spin_unlock(&lock);
--
Greetings Michael.
On Sun, 10 Sep 2006, Michael Buesch wrote:
> >
> > That's what IRIX had. It would let us get rid of mmiowb and avoid doing
> > a full sync in writeX, so may be the best option.
>
> Last time I suggested that, people did not want it.
I would personally _much_ rather have a separate mmiowb() and a regular
spinlock, than add a magic new spinlock.
Of course, mmiowb() itself is not a great name, and we could/should
probably rename it to make it more obvious what the hell it is.
> There is one little problem in practice with something
> like spin_unlock_io().
>
> spin_lock_io(&lock);
> foovalue = new_foovalue;
> if (device_is_fooing)
> writel(foovalue, REGISTER);
> spin_unlock_io(&lock);
>
> That would be an unneccessary sync in case device is not fooing.
> In contrast to the explicit version:
>
> spin_lock(&lock);
> foovalue = new_foovalue;
> if (device_is_fooing) {
> writel(foovalue, REGISTER);
> mmiowb();
> }
> spin_unlock(&lock);
I think this is even more important when the actual IO is done somewhere
totally different from the locking. It's really confusing if you have a
"spin_unlock_io()" just because some routine you called wanted it.
But more importantly, I don't want to have "spin_unlock_io[_xyzzy]()",
where "xyzzy()" is all the irq/irqrestore/bh variations. It's just not
worth it. We already have enough variations on spinlocks, but at least
right now they are all in the "same category", ie it's all about what the
context of the _locking_ is, and at least the lock matches the unlock, and
there are no separate rules.
Linus
On Sunday 10 September 2006 19:49, Linus Torvalds wrote:
>
> On Sun, 10 Sep 2006, Michael Buesch wrote:
> > >
> > > That's what IRIX had. It would let us get rid of mmiowb and avoid doing
> > > a full sync in writeX, so may be the best option.
> >
> > Last time I suggested that, people did not want it.
>
> I would personally _much_ rather have a separate mmiowb() and a regular
> spinlock, than add a magic new spinlock.
Yeah, as far as I remember it was you who rejected it. ;)
But I second your statement because of the practical issues below.
> Of course, mmiowb() itself is not a great name, and we could/should
> probably rename it to make it more obvious what the hell it is.
>
> > There is one little problem in practice with something
> > like spin_unlock_io().
> >
> > spin_lock_io(&lock);
> > foovalue = new_foovalue;
> > if (device_is_fooing)
> > writel(foovalue, REGISTER);
> > spin_unlock_io(&lock);
> >
> > That would be an unneccessary sync in case device is not fooing.
> > In contrast to the explicit version:
> >
> > spin_lock(&lock);
> > foovalue = new_foovalue;
> > if (device_is_fooing) {
> > writel(foovalue, REGISTER);
> > mmiowb();
> > }
> > spin_unlock(&lock);
>
> I think this is even more important when the actual IO is done somewhere
> totally different from the locking. It's really confusing if you have a
> "spin_unlock_io()" just because some routine you called wanted it.
>
> But more importantly, I don't want to have "spin_unlock_io[_xyzzy]()",
> where "xyzzy()" is all the irq/irqrestore/bh variations. It's just not
> worth it. We already have enough variations on spinlocks, but at least
> right now they are all in the "same category", ie it's all about what the
> context of the _locking_ is, and at least the lock matches the unlock, and
> there are no separate rules.
>
> Linus
>
--
Greetings Michael.
Ar Sul, 2006-09-10 am 10:18 -0700, ysgrifennodd Jesse Barnes:
> We already have readX_relaxed, but that's for PIO vs. DMA ordering, not
> PIO vs. PIO. To distinguish from that case maybe writeX_weak or
> writeX_nobarrier would make sense?
We have existing users of the format "__foo" for unlocked or un-ordered
foo. __readl seems fairly natural and its shorter to write than
_nobarriermaybesyncsafterlockbutnotwithmmio()
Alan
>>> As (I think) BenH mentioned in another email, the normal way Linux
>>> handles these interfaces is for the primary API (readX, writeX) to
>>> be strongly ordered, strongly coherent, etc. And then there is a
>>> relaxed version without barriers and syncs, for the smart guys who
>>> know what they're doing
>>
>> Indeed, I think that is the way to handle this.
>
> Well why didn't you guys mention this when mmiowb() went in?
>
> I agree that having a relaxed PIO ordering version of writeX makes
> sense
> (jejb convinced me of this on irc the other day). But what to name
> it?
> We already have readX_relaxed, but that's for PIO vs. DMA ordering,
> not
> PIO vs. PIO. To distinguish from that case maybe writeX_weak or
> writeX_nobarrier would make sense?
Why not just keep writel() etc. for *both* purposes; the address cookie
it gets as input can distinguish between the required behaviours for
different kinds of I/Os; it will have to be setup by the arch-specific
__ioremap() or similar. And then there could be a
pci_map_mmio_relaxed()
that, when a driver uses it, means "this driver requires only PCI
ordering rules for its MMIO, not x86 rules", so the driver promises to
do wmb()'s for DMA ordering and the mmiowb() [or whatever they become --
pci_mem_to_dma_barrier() and pci_cpu_to_cpu_barrier() or whatever] etc.
Archs that don't see the point in doing (much) faster and scalable I/O,
or just don't want to bother, can have this pci_map_mmio() just call
their x86-ordering ioremap() stuff.
The impact on architectures is minimal; they all need to implement this
new mapping function, but can just default it if they don't (yet) want
to take advantage of it. Archs that do care need to use the I/O cookie
as an actual cookie, not directly as a CPU address to write to, and then
use the proper ordering stuff for the kind of ordering the cookie says
is needed for this I/O range.
The impact on device drivers is even more minimal; drivers don't have to
change; and if they don't, they get x86 I/O ordering semantics. Drivers
that do want to be faster on other architectures, and have been
converted
for it (and most interesting ones have), just need to change the
ioremap()
of their MMIO space to the new API.
I hope this way everyone can be happy -- no one is forced to change,
only
one new API is introduced.
Segher
I'm copying that from a private discussion I had. Please let me know if
you have comments. This proposal includes some questions too so please
answer :)
- writel/readl become totally ordered (including vs. memory). Basically
x86-like. Expensive (very expensive even on some architectures) but also
very safe. There might be room for one exception here: ordering vs
locks, in which case we either still require mmiowb() or use a "trick"
consisting of having writel() set a flag in some per-cpu area and
spin_unlock() do a barrier when this flag is set. Thus Q: Should it have
ordered semantics vs. locks too or do we require mmiowb() still, though
maybe
a renamed version of it ? (That is, does it provide MMIO + memory
ordering
instead of just memory + MMIO and MMIO + MMIO ?). Remember that
providing the
second is already expensive, providing both is -very- expensive on some
architectures. Or do we do neither (that is no MMIO + memory ordering)
but
also alleviate the need for mmiowb() with the trick I described in
locks ?
- __writel/__readl are totally relaxed between mem and IO, though they
still guarantee ordering between MMIO and MMIO. That guaranteed, on
powerpc, can be acheived by putting most of the burden in __readl(),
thus letting __writel() alone to still allow for write combining. We
still need to verify wether those semantics are realistic on other
out-of-order platforms. Jesse ?
Or do you guys prefer it to be -also- relaxed for MMIO + MMIO ? That
would
make thing a bit harder for platforms like PowerPC to use them as we
would
need to defined a specific mmio_to_mmio_barrier() for use with them. I
don't
think that's useful if we can always implement ordering between MMIO and
MMIO
and still allow for write combining on all plaforms.
- In order to provide explicit ordering with memory for the above, we
introduce mem_to_io_barrier() and io_to_mem_barrier(). It's still
unclear
wether to include mmiowb() as an equivalent here, that is wether the
spinlock
case has to be special cased vs. io_to_mem_barrier(). I need to get
Jesse input
on that one.
- We start removing those wmb()'s that have been added to drivers to
handle the ia64 & powerpc case, and possibly convert the "hot" code path
of such drivers to use the relaxed variants with explicit ordering.
Now, this is definitely not 2.6.18 material. For 2.6.18, we can either
ignore the problem and apply a band-aid to tg3 by adding some missing
barriers, or we can tweak the powerpc writel to become ordered vs.
previous memory writes. In that later case, though, to avoid a too big
performance hit, we'll also remove it's previous barrier that was used
to prevent leaking out of locks, and implement the "trick" described
above with a per-cpu variable, so that spin_unlock() does the barrier
whenever it was preceded by a writel.
Cheers,
Ben.
Ar Llu, 2006-09-11 am 07:25 +1000, ysgrifennodd Benjamin Herrenschmidt:
> I'm copying that from a private discussion I had. Please let me know if
> you have comments. This proposal includes some questions too so please
> answer :)
Looks sane and Linus seems to like mmiowb. Only other question: what are
the guarantees of memcpy_to/fromio. Does it access the memory in ordered
fashion or not, does it guarantee only ordering at the end of the copy
or during it ?
On Sun, 2006-09-10 at 23:23 +0100, Alan Cox wrote:
> Ar Llu, 2006-09-11 am 07:25 +1000, ysgrifennodd Benjamin Herrenschmidt:
> > I'm copying that from a private discussion I had. Please let me know if
> > you have comments. This proposal includes some questions too so please
> > answer :)
>
> Looks sane and Linus seems to like mmiowb. Only other question: what are
> the guarantees of memcpy_to/fromio. Does it access the memory in ordered
> fashion or not, does it guarantee only ordering at the end of the copy
> or during it ?
Well, Linus is also ok with writel not ordering memory an IO accesses :)
Though he also mentioned that if we go that route (which is what we have
now in fact), we take the burden of having to test and fix drivers who
don't get it...
That's why I think a compromise is in order, thus my proposal :)
Ben.
> - __writel/__readl are totally relaxed between mem and IO, though
> they
> still guarantee ordering between MMIO and MMIO.
__write/read should be architecture specific no matter what, no
device driver has any business using them (except right now, some
_have_ to use-em because everything else hurts performance way too
much -- not a good situation to be in).
> - In order to provide explicit ordering with memory for the above, we
> introduce mem_to_io_barrier() and io_to_mem_barrier().
These should be bus-specific. Device drivers know what kind of bus
they are on, they have to know anyway, and it's not exactly a bad
thing if they can take advantage of this knowledge.
> It's still unclear
> wether to include mmiowb() as an equivalent here, that is wether the
> spinlock
> case has to be special cased vs. io_to_mem_barrier(). I need to get
> Jesse input on that one.
C, and most assembly languages fwiw, presume a "single thread of
execution", a single execution context. Which is a fine and quite
performant enough model for most things. It's not how the real
world works though, so we need a barrier. I believe calling this
barrier "pci_cpu_to_cpu_barrier()" and having its semantics be,
any PCI access (by this driver) before the barrier on this CPU will
be seen by any agent in the system, before any access (again by this
driver) after the barrier (perhaps on a different CPU), will a) be
not too hard on driver authors to understand and get right; and b)
will not sacrifice a lot of performance on any system.
> - We start removing those wmb()'s that have been added to drivers to
> handle the ia64 & powerpc case, and possibly convert the "hot" code
> path
> of such drivers to use the relaxed variants with explicit ordering.
Not remove the wmb()'s, but replace them by something sane.
> Now, this is definitely not 2.6.18 material. For 2.6.18, we can either
> ignore the problem and apply a band-aid to tg3 by adding some missing
> barriers,
The tg3 bug actually seems not to be because of the missing wmb()'s,
[the driver and all net traffic survive just fine in the case of non-
TSO],
but just because of a plain-and-simple programming bug in the driver.
I'll run some tests tomorrow to confirm. If I'm right, this fix should
go into .18 and into .17-stable at least.
Of course the problems that the PowerPC port currently has with the
missing wmb()'s are still there, but in the non-TSO case they almost
always result in 100% garbage sent on the actual ethernet line, and
that doesn't impede correctness (and it has been there since about
forever; even the TSO case that _does_ corrupt data was in the released
2.6.17, it's very hard to hit). There's no reason to delay 2.6.18
release or change the PowerPC I/O accessors because of this single
issue, knowing it was in 2.6.17 already.
> or we can tweak the powerpc writel to become ordered vs.
> previous memory writes. In that later case, though, to avoid a too big
> performance hit, we'll also remove it's previous barrier that was used
> to prevent leaking out of locks, and implement the "trick" described
> above with a per-cpu variable, so that spin_unlock() does the barrier
> whenever it was preceded by a writel.
That "trick" to avoid I/O accesses to "leak out" of protected regions
is just fine, and should be done no matter what -- if it is decreed that
spinlocks order I/O accesses at all, which is not a bad idea at all (in
my opinion anyway).
Segher
Ok, so we have two different proposals here...
Maybe we should cast a vote ? :)
* Option A:
- writel/readl are fully synchronous (minus mmiowb for spinlocks)
- we provide __writel/__readl with some barriers with relaxed ordering
between memory and MMIO (though still _precise_ semantics, not arch
specific)
* Option B:
- The driver decides at ioremap time wether it wants a fully ordered
mapping or not using
a "special" version of ioremap (with flags ?)
- writel/readl() behave differently depending on the mapping
- __writel/__readl might exist but are architecture specific (ahem...
still to be debated)
The former seems easier to me to implement. The later might indeed be a
bit easier for drivers writers, I'm not 100% convinced tho. The later
means stuffing special tokens in the returned address from ioremap and
testing for them in writel. However, the later is also what we need for
write combining (at least for PowerPC, maybe for other archs, wether a
mapping can write combine has to be decided by using flags in the page
table, thus has to be done at ioremap time. (*)
Votes ?
(*) Unrelated note about write combine: Due to weird implementation
issues, one cannot guarantee to prevent write combine on a write-combine
enabled mapping using barriers. This doesn't work due to CPUs combining
stores between threads on such mappings. I know of at least one CPU
doing that (Cell), there might be a lot more though. Thus a driver using
write combine might want to create two mappings for it's IOs, one with
combine enabled, one ordered, and use the "right" one depending on the
requirements of a given IO access.
> The tg3 bug actually seems not to be because of the missing wmb()'s,
> [the driver and all net traffic survive just fine in the case of non-
> TSO],
> but just because of a plain-and-simple programming bug in the driver.
> I'll run some tests tomorrow to confirm. If I'm right, this fix should
> go into .18 and into .17-stable at least.
Interesting :) I didn't actually verify the barrier problem theory
(though the driver does indeed seem to lack them, so there _is_ a
problem there too), I trusted Michael Chan who seemed to know about the
bug :) Anyway, that doesn't change anything to the fact that there is a
problem with barriers and we are on a good path to finally do something
sane about it.
> Of course the problems that the PowerPC port currently has with the
> missing wmb()'s are still there, but in the non-TSO case they almost
> always result in 100% garbage sent on the actual ethernet line, and
> that doesn't impede correctness (and it has been there since about
> forever; even the TSO case that _does_ corrupt data was in the released
> 2.6.17, it's very hard to hit). There's no reason to delay 2.6.18
> release or change the PowerPC I/O accessors because of this single
> issue, knowing it was in 2.6.17 already.
We can add the missing barriers to tg3 for 2.6.18 nontheless.
> That "trick" to avoid I/O accesses to "leak out" of protected regions
> is just fine, and should be done no matter what -- if it is decreed that
> spinlocks order I/O accesses at all, which is not a bad idea at all (in
> my opinion anyway).
Or we rely on an explicit barrier for that, which is what mmiowb() has
been defined for. We can use the "trick" to detect the missing ones
though, as a debugging aid.
Ben.
On Sunday, September 10, 2006 2:25 pm, Benjamin Herrenschmidt wrote:
> I'm copying that from a private discussion I had. Please let me know
> if you have comments. This proposal includes some questions too so
> please answer :)
>
>
> - writel/readl become totally ordered (including vs. memory).
> Basically x86-like. Expensive (very expensive even on some
> architectures) but also very safe.
This approach will minimize driver changes, and would imply the removal
of some existing mmiowb() and wmb() macros.
> There might be room for one
> exception here: ordering vs locks, in which case we either still
> require mmiowb() or use a "trick" consisting of having writel() set a
> flag in some per-cpu area and spin_unlock() do a barrier when this
> flag is set. Thus Q: Should it have ordered semantics vs. locks too
> or do we require mmiowb() still, though maybe
> a renamed version of it ? (That is, does it provide MMIO + memory
> ordering
> instead of just memory + MMIO and MMIO + MMIO ?). Remember that
> providing the
> second is already expensive, providing both is -very- expensive on
> some architectures. Or do we do neither (that is no MMIO + memory
> ordering) but
> also alleviate the need for mmiowb() with the trick I described in
> locks ?
If we accept this, I don't think we're much better off than we are
currently (not that I have a problem with that). That is, many drivers
would still need to be fixed up.
> - __writel/__readl are totally relaxed between mem and IO, though
> they still guarantee ordering between MMIO and MMIO. That guaranteed,
> on powerpc, can be acheived by putting most of the burden in
> __readl(), thus letting __writel() alone to still allow for write
> combining. We still need to verify wether those semantics are
> realistic on other out-of-order platforms. Jesse ?
Depends on what you mean by "ordered between MMIO and MMIO". mmiowb()
was initially introduced to allow ordering of writes between CPUs, and
really didn't say anything about memory vs. I/O, so the semantics you
describe here would also be different (and more expensive) than what we
have now.
> Or do you guys prefer it to be -also- relaxed for MMIO + MMIO ? That
> would
> make thing a bit harder for platforms like PowerPC to use them as we
> would
> need to defined a specific mmio_to_mmio_barrier() for use with them.
> I don't
> think that's useful if we can always implement ordering between MMIO
> and MMIO
> and still allow for write combining on all plaforms.
This is what mmiowb() is supposed to be, though only for writes. I.e.
two writes from different CPUs may arrive out of order if you don't use
mmiowb() carefully. Do you also need a mmiorb() macro or just a
stronger mmiob()?
> - In order to provide explicit ordering with memory for the above,
> we introduce mem_to_io_barrier() and io_to_mem_barrier(). It's still
> unclear
> wether to include mmiowb() as an equivalent here, that is wether the
> spinlock
> case has to be special cased vs. io_to_mem_barrier(). I need to get
> Jesse input
> on that one.
mmiowb() could be written as io_to_io_write_barrier() if we wanted to be
extra verbose. AIUI it's the same thing as smb_wmb() but for MMIO
space.
Jesse
On Sunday, September 10, 2006 5:12 pm, Benjamin Herrenschmidt wrote:
> Ok, so we have two different proposals here...
>
> Maybe we should cast a vote ? :)
>
> * Option A:
>
> - writel/readl are fully synchronous (minus mmiowb for spinlocks)
> - we provide __writel/__readl with some barriers with relaxed
> ordering between memory and MMIO (though still _precise_ semantics,
> not arch specific)
>
> * Option B:
>
> - The driver decides at ioremap time wether it wants a fully ordered
> mapping or not using
> a "special" version of ioremap (with flags ?)
> - writel/readl() behave differently depending on the mapping
> - __writel/__readl might exist but are architecture specific
> (ahem... still to be debated)
>
> The former seems easier to me to implement. The later might indeed be
> a bit easier for drivers writers, I'm not 100% convinced tho. The
> later means stuffing special tokens in the returned address from
> ioremap and testing for them in writel. However, the later is also
> what we need for write combining (at least for PowerPC, maybe for
> other archs, wether a mapping can write combine has to be decided by
> using flags in the page table, thus has to be done at ioremap time.
> (*)
Yeah, write combining is a good point. After all these years we *still*
don't have a good in-kernel interface for changing memory mapped
attributes, so adding a 'flags' argument to ioremap might be a good
idea (cached, uncached, write combine are the three variants I can
think of off the top of my head).
But doing MMIO ordering this way seems somewhat expensive since it means
extra checks in the readX/writeX routines, which are normally very
fast.
So I guess I'm saying we should have both.
- existing readX/writeX routines are defined to be strongly ordered
- new MMIO accessors are added with weak semantics (not sure I like
the __ naming though, driver authors will have to continually refer
to documentation to figure out what they mean) along with new
barrier macros to synchronize things appropriately
- flags argument to ioremap for cached, uncached, write combine
attributes (this implies some TLB flushing and other arch specific
state flushing, also needed for proper PAT support)
Oh, and all MMIO accessors are *documented* with strongly defined
semantics. :)
If we go this route though, can I request that we don't introduce any
performance regressions in drivers currently using mmiowb()? I.e.
they'll be converted over to the new accessor routines when they become
available along with the new barrier macros?
Thanks,
Jesse
>> - writel/readl become totally ordered (including vs. memory).
>> Basically x86-like. Expensive (very expensive even on some
>> architectures) but also very safe.
>
> This approach will minimize driver changes, and would imply the
> removal
> of some existing mmiowb() and wmb() macros.
Like I tried to explain already, in my competing approach, no drivers
would need changes either. And you could remove those macro's (or
their more-verbosely-saying-what-their-doing, preferably bus-specific
as well) as well -- but you'll face the wrath of those who care about
performance of those drivers on non-x86 platforms.
> This is what mmiowb() is supposed to be, though only for writes. I.e.
> two writes from different CPUs may arrive out of order if you don't
> use
> mmiowb() carefully. Do you also need a mmiorb() macro or just a
> stronger mmiob()?
I'd name this barrier pci_cpu_to_cpu_barrier() -- what it is supposed
to do is order I/O accesses from the same device driver to the same
device, from different CPUs. The same driver is never concurrently
running on more than one CPU right now, which is a fine model.
I include "pci_" in the name, so that we can distinguish between
different bus types, which after all have different ordering rules.
PCI is a very common bus type of course, which explains why there
is mmiowb() and no ..rb() -- this is simply not needed on PCI
(PCI MMIO reads are _always_ slow -- non-posted accesses, in PCI
terminology).
> mmiowb() could be written as io_to_io_write_barrier() if we wanted
> to be
> extra verbose. AIUI it's the same thing as smb_wmb() but for MMIO
> space.
Except that "main-memory" ("coherent domain") accesses are always
atomic as far as this ordering is concerned -- starting a transaction
and having its result are not separately observable. For I/O this
is different -- the whole point of mmiowb() was that although without
it the device drivers _start_ the transactions in the right order just
fine, the order the I/O adapters see them might be different (because
there are multiple paths from different CPUs to the same I/O adapter, or
whatnot).
Hence my proposal of calling it pci_cpu_to_cpu_barrier() -- what it
orders is accesses from separate CPUs. Oh, and it's bus-specific,
of course.
Segher
> If we accept this, I don't think we're much better off than we are
> currently (not that I have a problem with that). That is, many
> drivers
> would still need to be fixed up.
Not necessarily if you introduce the trick of doing the mmiowb() in
spin_unlock when a per-cpu flag has been set previously by writel... not
sure if it's worth tho.
> > - __writel/__readl are totally relaxed between mem and IO, though
> > they still guarantee ordering between MMIO and MMIO. That
> guaranteed,
> > on powerpc, can be acheived by putting most of the burden in
> > __readl(), thus letting __writel() alone to still allow for write
> > combining. We still need to verify wether those semantics are
> > realistic on other out-of-order platforms. Jesse ?
>
> Depends on what you mean by "ordered between MMIO and MMIO".
> mmiowb()
> was initially introduced to allow ordering of writes between CPUs,
> and
> really didn't say anything about memory vs. I/O, so the semantics you
> describe here would also be different (and more expensive) than what
> we
> have now.
No. What I mean is that two consecutive MMIO writes on the same CPU stay
in order, and reads can't cross writes. The relaxed versions would still
require mmiowb() (or another name) for synchronisation against
spinlocks. As I told you before, I much prefer to sync of mmiowb() as a
sync with locks than a sync with "other MMIOs on anotehr CPU" since that
doesn't mean anything outside of the context of a lock.
> This is what mmiowb() is supposed to be, though only for writes.
> I.e.
> two writes from different CPUs may arrive out of order if you don't
> use
> mmiowb() carefully. Do you also need a mmiorb() macro or just a
> stronger mmiob()?
No, you misunderstand me. That's the main problem with mmiowb() and
that's why it's so not clear to so many people: the way you keep
presenting it as synchronizing MMIO vs. MMIO. I think it's a lot more
clear if you present it as synchronizing MMIOs with locks. MMIO vs. MMIO
is anohter matter. It's wether consecutive MMIO writes can be
re-ordered, wether MMIO loads can cross a write (either because the load
is performed late, only when the value is actually used, or because the
load can be exucuted before a preceeding write). That's what current
__raw_* versions on PowerPC will allow, in addition to not doing endian
swap. My proposal was that __writel/__readl, however, would keep MMIO
vs. MMIO ordering (wouldn't allow that sort of re-ordering), however,
they wouldn't order vs. spinlock (would still require mmiowb) nor vs.
main memory (cacheable storage).
> > - In order to provide explicit ordering with memory for the above,
> > we introduce mem_to_io_barrier() and io_to_mem_barrier(). It's still
> > unclear
> > wether to include mmiowb() as an equivalent here, that is wether the
> > spinlock
> > case has to be special cased vs. io_to_mem_barrier(). I need to get
> > Jesse input
> > on that one.
>
> mmiowb() could be written as io_to_io_write_barrier() if we wanted to be
> extra verbose. AIUI it's the same thing as smb_wmb() but for MMIO
> space.
I'm very much against your terminology. It's -not- an IO to IO barrier.
It's an IO to lock barrier. Really. IO to IO is something else. ordering
of IOs between CPUs has absolutely no meaning outside of the context of
locked regions in any case.
Cheers,
Ben.
> If we go this route though, can I request that we don't introduce any
> performance regressions in drivers currently using mmiowb()? I.e.
> they'll be converted over to the new accessor routines when they become
> available along with the new barrier macros?
There are few enough of them, I've grep'ed, so that should be doable.
The segher mentioned in favor of his approach (option B -> ioremap
flags) that doing a test in writeX/readX is very cheap compared to the
cost of IOs in general and would make driver conversion easier: you
don't have to change a single occurence of writel/readl : just add the
necessary barriers and change the ioremap call. Thus I tend to agree
that his approach makes it easier from a driver writer point of view.
Now, I don't have a strong preference myself, which is why I asked for a
vote here. So far, I could your vote for A :)
Ben.
> Hence my proposal of calling it pci_cpu_to_cpu_barrier() -- what it
> orders is accesses from separate CPUs. Oh, and it's bus-specific,
> of course.
I disagree on that one, as I disagree on Jesse terminology too :)
Ordering between stores issued by different CPUs has no meaning
whatsoever unless you have locks. That is you have some kind of
synchronisation primitive between the 2 CPUs. Outside of that, the
concept of ordering doesn't make any sense.
Thus the problem is really only of MMIO stores leaking out of locks,
thus it's really a MMIO vs. lock barrier, and it's a lot easier to
understand that way imho.
Ben.
> Yeah, write combining is a good point. After all these years we
> *still*
> don't have a good in-kernel interface for changing memory mapped
> attributes, so adding a 'flags' argument to ioremap might be a good
> idea (cached, uncached, write combine are the three variants I can
> think of off the top of my head).
But what does "write-combine" mean? There are many different
implementations of basically this same idea, and implementing
just the lowest common denominator of this will hardly get us
out of the mess we are in already.
> - existing readX/writeX routines are defined to be strongly ordered
> - new MMIO accessors are added with weak semantics (not sure I like
> the __ naming though, driver authors will have to continually
> refer
> to documentation to figure out what they mean) along with new
> barrier macros to synchronize things appropriately
What exactly will "weak" mean? If it's weak enough to please all
architectures and busses, it'll be so weak that you'll need 2**N
(with a big N) different barriers.
> - flags argument to ioremap
ioremap is a bad name anyway, if we'll change the API, change the
name as well (and it's a bad idea to keep the same name but make it
mean something different, anyway).
> Oh, and all MMIO accessors are *documented* with strongly defined
> semantics. :)
Not sure what this means? Document them in all-caps? :-)
> If we go this route though, can I request that we don't introduce any
> performance regressions in drivers currently using mmiowb()? I.e.
> they'll be converted over to the new accessor routines when they
> become
> available along with the new barrier macros?
In my proposal at least, those drivers won't lose any performance
(except for a conditional on their I/O cookie, which is a trivial
performance loss compared to the cost of the I/O /an sich/); they
won't need any changes either, except for some renaming. Drivers
_not_ using it might get a performance loss though, because they
will be forced to run with every-I/O-ordered semantics; they will
suddenly start to work *correctly* though.
Segher
> What exactly will "weak" mean? If it's weak enough to please all
> architectures and busses, it'll be so weak that you'll need 2**N
> (with a big N) different barriers.
Which is why I proposed a precise semantic: weak MMIO vs. main meory but
strong between MMIOs on the same CPU. Which, in PowerPC language
translates into no barrier in __writel and eieio,load,twi,isync on
__readl (maybe even a second eieio, I have think about it)
> > - flags argument to ioremap
>
> ioremap is a bad name anyway, if we'll change the API, change the
> name as well (and it's a bad idea to keep the same name but make it
> mean something different, anyway).
We already have a new API with a nicer name, pci_iomap, to use with the
new "ioread/iowrite" accessors that Linus defined a while ago. Problem:
- It's a unifed PIO/MMIO interface, thus there is some added overhead
to calls compared to a simple MMIO-only writel/readl. Though your
proposal would add a similar overhead to writel/readl, so maybe we can
consolidate all of these ...
- It doesn't have a flags arguemnt. But then, there are few enough
users that this can be easily fixed (at least more easily than ioremap)
- It has few users :) Thus lots of drivers will need heavy conversion.
But that's a good point to remember about the existence of this "other"
MMIO/PIO access API. Because adding a whole lot of __* version of those
calls in addition to the MMIO only readX/writeX will be a mess and tend
to favor your option of just having a flag passed at map time for the
sake of simplicity.
> > Oh, and all MMIO accessors are *documented* with strongly defined
> > semantics. :)
>
> Not sure what this means? Document them in all-caps? :-)
>
> > If we go this route though, can I request that we don't introduce any
> > performance regressions in drivers currently using mmiowb()? I.e.
> > they'll be converted over to the new accessor routines when they
> > become
> > available along with the new barrier macros?
>
> In my proposal at least, those drivers won't lose any performance
> (except for a conditional on their I/O cookie, which is a trivial
> performance loss compared to the cost of the I/O /an sich/); they
> won't need any changes either, except for some renaming. Drivers
> _not_ using it might get a performance loss though, because they
> will be forced to run with every-I/O-ordered semantics; they will
> suddenly start to work *correctly* though.
As I wrote in my "vote" email, I have no strong preference at this
point, though I must admit that the fact that your proposal makes the
driver conversion trivial and ends up, imho, easier for driver writers
(and possibly more flexible for archs) makes it appealing. The "only"
problem is the possibly added overhead of a test in readl/writel. I
don't know wether that's relevant enough to justify not going your way
though. Only platforms like ia64 and powerpc will need that test (thus
no performance regression on x86), it might be worth a bit of
benchmarking here to see wether the added test has any real impact on
performances.
Note that nothing prevents us from adding _both_ proposals. It may sound
a bit weird but at the end of the day, if writel/readl are defined as
being strongly ordered, as per Option A, adding a test to make them
optionally -not- strongly ordered based on a cooke in the address won't
hurt much.
In which case, we end up with those 3 cases:
- slow case: normal ioremap, readl/writel strongly ordered
- faster case : "special" ioremap, readl/writel faster
- even faster case : __readl/__writel (save a test from the above)
I mean that having both options at once is possible. Wether it's
something we want to do is a different matter as it might be considered
as adding confusion.
Ben.
>> Hence my proposal of calling it pci_cpu_to_cpu_barrier() -- what it
>> orders is accesses from separate CPUs. Oh, and it's bus-specific,
>> of course.
>
> I disagree on that one, as I disagree on Jesse terminology too :)
>
> Ordering between stores issued by different CPUs has no meaning
> whatsoever unless you have locks. That is you have some kind of
> synchronisation primitive between the 2 CPUs.
And that's exactly what mmiowb() does right now -- it makes sure
the I/O ends up at some I/O hub that will keep the accesses in
order, before it allows the current CPU to continue.
> Outside of that, the
> concept of ordering doesn't make any sense.
>
> Thus the problem is really only of MMIO stores leaking out of locks,
> thus it's really a MMIO vs. lock barrier, and it's a lot easier to
> understand that way imho.
MMIO-as-seen-by-its-target vs. whatever-the-cpus-that-originated-those-
I/Os-think-the-order-is, sure.
The CPU running the "mmiowb()" needs to make sure that the mmiowb()
finished before it allows another CPU to run code that does I/O to the
same device. I thought (most of) this was automatic in Linux (except
for the difference between a CPU doing the access, and the I/O device
seeing it, which is what mmiowb() is meant to solve)? Or are
we just safe from all kinds of similar issues, because driver code
tends to run under interrupt locks?
Aaaaaaaanyway... the question of what to call mmiowb() and what its
exact semantics would become, is a bit of a side issue right now, let's
discuss it later...
Segher
On Mon, 2006-09-11 at 03:48 +0200, Segher Boessenkool wrote:
> > Ordering between stores issued by different CPUs has no meaning
> > whatsoever unless you have locks. That is you have some kind of
> > synchronisation primitive between the 2 CPUs.
>
> And that's exactly what mmiowb() does right now -- it makes sure
> the I/O ends up at some I/O hub that will keep the accesses in
> order, before it allows the current CPU to continue.
No. mmiowb isn't itself a synchronisation primitive between CPUs. It's a
barrier. The lock enclosing the MMIOs is the synchronisation primitive.
mmiowb() makes it actually work with MMIOs since otherwise, MMIO stores
might "leak" out of the lock.
There is no concept of ordering between a flow of stores from 2 CPUs.
There is no "before" or "after" (or rather, they aren't defined) unless
you create a common "t0" reference, in which case you can indeed say
wether a given store on a given side is before or after this "t0".
This common reference can only excist if code on -both- sides actually
implement it, that is, it has to appear somewhere in the program to have
any meaning.
mmiowb() doesn't do that. The spinlock does. That's the referene. That's
what crates a concept of "before" and "after" (or rather "inside" or
"outside"). mmiowb() is merely an implementation detail to make this
actually work when MMIOs are involved.
> Aaaaaaaanyway... the question of what to call mmiowb() and what its
> exact semantics would become, is a bit of a side issue right now, let's
> discuss it later...
See my proposed document with explicit semantics.
Ben.
Benjamin Herrenschmidt wrote:
> > The tg3 bug actually seems not to be because of the missing wmb()'s,
> > [the driver and all net traffic survive just fine in the case of
non-
> > TSO],
> > but just because of a plain-and-simple programming bug in the
driver.
> > I'll run some tests tomorrow to confirm. If I'm right, this fix
should
> > go into .18 and into .17-stable at least.
>
> Interesting :) I didn't actually verify the barrier problem theory
> (though the driver does indeed seem to lack them, so there _is_ a
> problem there too), I trusted Michael Chan who seemed to know about
the
> bug :)
It definitely is caused by lack of memory barriers before the writel().
IBM, Anton, and all of us know about this. TSO probably makes it more
susceptible because you write to many buffer descriptors before you
issue one writel() to DMA all the descriptors. The large number of
TSO descriptors makes re-ordering more likely.
> It definitely is caused by lack of memory barriers before the writel().
> IBM, Anton, and all of us know about this. TSO probably makes it more
> susceptible because you write to many buffer descriptors before you
> issue one writel() to DMA all the descriptors. The large number of
> TSO descriptors makes re-ordering more likely.
The barrier problem exist for sure. However, I've added a barrier before
all write to the mailboxes and I still see the problem. Any idea what I
may have missed ?
I'm now investigating a theory of possible missing barriers in the
producer/consumer variable manipulation if the ring ends up being full
and emptied entirely all at once. It's a bit of a headach to put my head
around this driver as it managed to totally avoid locks and use very few
barriers in general for ordering both between CPUs and with shared data
structures with the chip like the hw status structure. I don't say it's
bad, I say it's complex to make sure it's completely right :)
For example the driver never flushes the PCI buffers with an MMIO read
before reading shared data structures. That means that it can, I suppose
(I hope) rely on the producer/comsumer fields in there always being
updated last (that is after all relevant descriptors have been updated),
but also, that it can afford to lose interrupts (since it doesn't flush
the PCI buffers when getting an interrupt, it reads potentially stale
status and then acts upon that). It might all be on purpose though :)
Cheers,
Ben.
Ar Llu, 2006-09-11 am 10:12 +1000, ysgrifennodd Benjamin Herrenschmidt:
> - writel/readl are fully synchronous (minus mmiowb for spinlocks)
> - we provide __writel/__readl with some barriers with relaxed ordering
> between memory and MMIO (though still _precise_ semantics, not arch
> specific)
I'd rather they were less precise than your later proposal but that
reflects the uses I'm considering perhaps.
> * Option B:
>
> - The driver decides at ioremap time wether it wants a fully ordered
> mapping or not using
That is expensive because writel/readl end up full of if() while at the
moment they are often 1 instruction.
On Mon, 2006-09-11 at 10:02 +0100, Alan Cox wrote:
> Ar Llu, 2006-09-11 am 10:12 +1000, ysgrifennodd Benjamin Herrenschmidt:
> > - writel/readl are fully synchronous (minus mmiowb for spinlocks)
> > - we provide __writel/__readl with some barriers with relaxed ordering
> > between memory and MMIO (though still _precise_ semantics, not arch
> > specific)
>
> I'd rather they were less precise than your later proposal but that
> reflects the uses I'm considering perhaps.
My latest proposal basically defines the rules that I think are useful
for drivers in practice and leaves anything else undefined on purpose
(which means cannot be relied on).
For example, Rule #1, MMIO vs. MMIO has to be strongly defined. Drivers
have to know they can rely on MMIOs staying in order as issued on a
given CPU (and thus read to flush posted writes works etc...). Rule #2
and #3 cover the common cases of mixing up DMA and MMIO, and rule #4 the
special case of MMIOs leaking out of spinlocks. I purposefully didn't
define rules for MMIO W + memory W, or memory R + MMIO R for example.
That would constraint arch implementations too much and I don't see such
rules being useful in practice.
> > * Option B:
> >
> > - The driver decides at ioremap time wether it wants a fully ordered
> > mapping or not using
>
> That is expensive because writel/readl end up full of if() while at the
> moment they are often 1 instruction.
Yup. I agree. It also has another drawback, which is to make it
non-obvious when reading driver code wether a given writel() is ordered
or not, and thus wether it needs a barrier or not. So it makes auditing
drivers to find such ordering bugs harder for us. It does make porting
of drivers to relaxed semantics easier tho. In any case, it's not
fundamentally incompatible with my proposal in that we can have
readl/writel do the test thing (thus be either Class 1 or Class 2 as per
defined in my proposal) based on a map-time flag, and have the
__readl/__writel versions always be Class 2 (that is relaxed ordering
and fast). But I'd rather not mix up the problems right now and thus
leave that aside until I'm sure we have an agreement on the semantics
and the __* accessors semantics.
Cheers,
Ben.
>>>>> "Ben" == Benjamin Herrenschmidt <[email protected]> writes:
Ben> On Sun, 2006-09-10 at 23:23 +0100, Alan Cox wrote:
>> Ar Llu, 2006-09-11 am 07:25 +1000, ysgrifennodd Benjamin
>> Herrenschmidt: > I'm copying that from a private discussion I
>> had. Please let me know if > you have comments. This proposal
>> includes some questions too so please > answer :)
>>
>> Looks sane and Linus seems to like mmiowb. Only other question:
>> what are the guarantees of memcpy_to/fromio. Does it access the
>> memory in ordered fashion or not, does it guarantee only ordering
>> at the end of the copy or during it ?
Ben> Well, Linus is also ok with writel not ordering memory an IO
Ben> accesses :) Though he also mentioned that if we go that route
Ben> (which is what we have now in fact), we take the burden of having
Ben> to test and fix drivers who don't get it...
We have to do this on SN2 anyway, so this way we can benefit from
each other's work :)
Cheers,
Jes
From: Segher Boessenkool <[email protected]>
Date: Sun, 10 Sep 2006 22:01:20 +0200
> Why not just keep writel() etc. for *both* purposes; the address cookie
> it gets as input can distinguish between the required behaviours for
> different kinds of I/Os; it will have to be setup by the arch-specific
> __ioremap() or similar.
This doesn't work when the I/O semantics are encoded into the
instruction, not some virual mapping PTE bits. We'll have to use
a conditional or whatever in that case, which is silly.
>> Why not just keep writel() etc. for *both* purposes; the address
>> cookie
>> it gets as input can distinguish between the required behaviours for
>> different kinds of I/Os; it will have to be setup by the arch-
>> specific
>> __ioremap() or similar.
>
> This doesn't work when the I/O semantics are encoded into the
> instruction, not some virual mapping PTE bits. We'll have to use
> a conditional or whatever in that case, which is silly.
Why is this "silly"? Slowing down I/O accessors by a tiny little
bit isn't expensive, certainly not when compared to the cost of
having to do big-hammer synchronisation all over the place all the
time, like we need to do in the "all busses are strongly ordered
wrt to every other agent in the system" case.
Archs that _do_ implement only one set of ordering rules for every
bus, i.e. use the lowest common denominator on everything, do not
need such a conditional either of course -- only archs that want to
_gain_ performance.
There's plenty of other scenario's where such a conditional is
needed already btw, for example when some host bridges don't implement
PCI memory space as memory-mapped, but via an address+data register
(and yeah you can call such hardware "silly", and I certainly won't
disagree, but...)
Segher
On Sunday, September 10, 2006 6:00 pm, Benjamin Herrenschmidt wrote:
> > If we accept this, I don't think we're much better off than we are
> > currently (not that I have a problem with that). That is, many
> > drivers
> > would still need to be fixed up.
>
> Not necessarily if you introduce the trick of doing the mmiowb() in
> spin_unlock when a per-cpu flag has been set previously by writel... not
> sure if it's worth tho.
True, though again this would add a branch to writeX.
> > Depends on what you mean by "ordered between MMIO and MMIO".
> > mmiowb()
> > was initially introduced to allow ordering of writes between CPUs,
> > and
> > really didn't say anything about memory vs. I/O, so the semantics you
> > describe here would also be different (and more expensive) than what
> > we
> > have now.
>
> No. What I mean is that two consecutive MMIO writes on the same CPU stay
> in order, and reads can't cross writes. The relaxed versions would still
> require mmiowb() (or another name) for synchronisation against
> spinlocks. As I told you before, I much prefer to sync of mmiowb() as a
> sync with locks than a sync with "other MMIOs on anotehr CPU" since that
> doesn't mean anything outside of the context of a lock.
Sure, that's where one would typically use it, but it really is a memory
barrier...
>
> > This is what mmiowb() is supposed to be, though only for writes.
> > I.e.
> > two writes from different CPUs may arrive out of order if you don't
> > use
> > mmiowb() carefully. Do you also need a mmiorb() macro or just a
> > stronger mmiob()?
>
> No, you misunderstand me. That's the main problem with mmiowb() and
> that's why it's so not clear to so many people: the way you keep
> presenting it as synchronizing MMIO vs. MMIO. I think it's a lot more
> clear if you present it as synchronizing MMIOs with locks. MMIO vs. MMIO
> is anohter matter.
That's because it *is* a barrier. I don't think it's any harder to understand
then regular memory barriers for example. It's just that you'd typically use
it in conjunction with locks to ensure proper device access.
> It's wether consecutive MMIO writes can be
> re-ordered, wether MMIO loads can cross a write (either because the load
> is performed late, only when the value is actually used, or because the
> load can be exucuted before a preceeding write). That's what current
> __raw_* versions on PowerPC will allow, in addition to not doing endian
> swap. My proposal was that __writel/__readl, however, would keep MMIO
> vs. MMIO ordering (wouldn't allow that sort of re-ordering), however,
> they wouldn't order vs. spinlock (would still require mmiowb) nor vs.
> main memory (cacheable storage).
Ok, that's fine, though I think you'd only want the very weak semantics (as
provided by your __raw* routines) on write combined memory typically?
> > mmiowb() could be written as io_to_io_write_barrier() if we wanted to be
> > extra verbose. AIUI it's the same thing as smb_wmb() but for MMIO
> > space.
>
> I'm very much against your terminology. It's -not- an IO to IO barrier.
> It's an IO to lock barrier. Really. IO to IO is something else. ordering
> of IOs between CPUs has absolutely no meaning outside of the context of
> locked regions in any case.
But it *is* MMIO vs. MMIO. There's confusion because your __raw* routines
don't even guarantee same CPU ordering, while mmiowb() is solely intended for
inter-CPU ordering.
But as you say, the most common (maybe only) use model for it is to make sure
critical sections protecting device access behave correctly, so I don't have
a problem tying it to locks somehow.
Jesse
On Sunday, September 10, 2006 5:54 pm, Segher Boessenkool wrote:
> >> - writel/readl become totally ordered (including vs. memory).
> >> Basically x86-like. Expensive (very expensive even on some
> >> architectures) but also very safe.
> >
> > This approach will minimize driver changes, and would imply the
> > removal
> > of some existing mmiowb() and wmb() macros.
>
> Like I tried to explain already, in my competing approach, no drivers
> would need changes either. And you could remove those macro's (or
> their more-verbosely-saying-what-their-doing, preferably bus-specific
> as well) as well -- but you'll face the wrath of those who care about
> performance of those drivers on non-x86 platforms.
Right, at the cost of more complexity in the accessor routines.
> Hence my proposal of calling it pci_cpu_to_cpu_barrier() -- what it
> orders is accesses from separate CPUs. Oh, and it's bus-specific,
> of course.
Makes sense to me, I have no problem with that name since it's really intended
to order posted PCI writes from different CPUs.
Jesse
On Mon, 2006-09-11 at 11:08 -0700, Jesse Barnes wrote:
> On Sunday, September 10, 2006 6:00 pm, Benjamin Herrenschmidt wrote:
> > > If we accept this, I don't think we're much better off than we are
> > > currently (not that I have a problem with that). That is, many
> > > drivers
> > > would still need to be fixed up.
> >
> > Not necessarily if you introduce the trick of doing the mmiowb() in
> > spin_unlock when a per-cpu flag has been set previously by writel... not
> > sure if it's worth tho.
>
> True, though again this would add a branch to writeX.
No, it adds a cacheable store to writeX and a branch to spin_unlock
> Sure, that's where one would typically use it, but it really is a memory
> barrier...
I prefer having separate semantics for it so people understand it better
but I may be wrong :)
> That's because it *is* a barrier. I don't think it's any harder to understand
> then regular memory barriers for example. It's just that you'd typically use
> it in conjunction with locks to ensure proper device access.
That's why I prefer defining it as a MMIO + lock barrier.
> Ok, that's fine, though I think you'd only want the very weak semantics (as
> provided by your __raw* routines) on write combined memory typically?
Well, that and memory with no side effects (like frame buffers)
> > I'm very much against your terminology. It's -not- an IO to IO barrier.
> > It's an IO to lock barrier. Really. IO to IO is something else. ordering
> > of IOs between CPUs has absolutely no meaning outside of the context of
> > locked regions in any case.
>
> But it *is* MMIO vs. MMIO. There's confusion because your __raw* routines
> don't even guarantee same CPU ordering, while mmiowb() is solely intended for
> inter-CPU ordering.
>
> But as you say, the most common (maybe only) use model for it is to make sure
> critical sections protecting device access behave correctly, so I don't have
> a problem tying it to locks somehow.
Ben.
From: Segher Boessenkool <[email protected]>
Date: Mon, 11 Sep 2006 16:17:18 +0200
> >> Why not just keep writel() etc. for *both* purposes; the address
> >> cookie
> >> it gets as input can distinguish between the required behaviours for
> >> different kinds of I/Os; it will have to be setup by the arch-
> >> specific
> >> __ioremap() or similar.
> >
> > This doesn't work when the I/O semantics are encoded into the
> > instruction, not some virual mapping PTE bits. We'll have to use
> > a conditional or whatever in that case, which is silly.
>
> Why is this "silly"?
It's silly because if you just use different interface
names for the different semantics, the caller can
ask for what he wants at the call site and no conditionals
are needed in the implementation.
> It's silly because if you just use different interface
> names for the different semantics, the caller can
> ask for what he wants at the call site and no conditionals
> are needed in the implementation.
As Paulus also pointed out, having writel() behave differently based on
some magic done earlier at map time makes it harder to understand what
happens when reading the code, and thus harder to audit drivers for
missing barriers etc... since it's not obvious at first sight wether a
driver is using ordered or relaxed semantics. Thus I prefer keeping two
speparate interfaces.
We've come up with the __writel() name, but I'm open to proposals for
something nicer :)
Ben.
David> It's silly because if you just use different interface
David> names for the different semantics, the caller can ask for
David> what he wants at the call site and no conditionals are
David> needed in the implementation.
OTOH that leads to a combinatorial explosion of 8/16/32/64 access
sizes, write combining or not, strongly ordered wrt local memory or
not, relaxed PCI ordering or not, etc. etc.
- R.
Benjamin Herrenschmidt writes:
> On Mon, 2006-09-11 at 11:08 -0700, Jesse Barnes wrote:
>> Ok, that's fine, though I think you'd only want the very weak
>> semantics (as provided by your __raw* routines) on write
>> combined memory typically?
>
> Well, that and memory with no side effects (like frame buffers)
Oh no, it's great for regular device driver work. I used this
type of system all the time on a different PowerPC OS.
Suppose you need to set up a piece of hardware. Assume that the
hardware isn't across some nasty bridge. You do this:
hw->x = 42;
hw->y = 19;
eieio();
hw->p = 11;
hw->q = 233;
hw->r = 87;
eieio()
hw->n = 101;
hw->m = 5;
eieio()
In that ficticious example, I get 7 writes to the hardware device
with only 3 "eieio" operations. It's not hard at all. Sometimes
a "sync" is used instead, also explicitly.
To get even more speed, you can mark memory as non-coherent.
You can even do this for RAM. There are cache control instructions
to take care of any problems; simply ask the CPU to write things
out as needed.
Linux should probably do this:
Plain stuff is like x86. If you want the performance of loose
ordering, ask for it when you get the mapping and use read/write
functions that have a "_" prefix. If you mix the "_" versions
with a plain x86-like mapping or the other way, the behavior you
get will be an arch-specific middle ground.
> Oh no, it's great for regular device driver work. I used this
> type of system all the time on a different PowerPC OS.
>
> Suppose you need to set up a piece of hardware. Assume that the
> hardware isn't across some nasty bridge. You do this:
>
> hw->x = 42;
> hw->y = 19;
> eieio();
> hw->p = 11;
> hw->q = 233;
> hw->r = 87;
> eieio()
> hw->n = 101;
> hw->m = 5;
> eieio()
>
> In that ficticious example, I get 7 writes to the hardware device
> with only 3 "eieio" operations. It's not hard at all. Sometimes
> a "sync" is used instead, also explicitly.
You can do that with my proposed __writel which is a simple store as
writes to non-cacheable and guarded storage have to stay in order
according to the PowerPC architecture. No need for __raw.
> To get even more speed, you can mark memory as non-coherent.
Ugh ? MMIO space is always marked non-coherent. You are not supposed to
set the M bit if the I is set in the page tables. If you are talking
about main memory, then it's a completely different discussion.
> You can even do this for RAM. There are cache control instructions
> to take care of any problems; simply ask the CPU to write things
> out as needed.
Sure, though that's not the topic.
> Linux should probably do this:
>
> Plain stuff is like x86. If you want the performance of loose
> ordering, ask for it when you get the mapping and use read/write
> functions that have a "_" prefix. If you mix the "_" versions
> with a plain x86-like mapping or the other way, the behavior you
> get will be an arch-specific middle ground.
No. I want precisely defined semantics in all cases.
Ben.
On 9/12/06, Benjamin Herrenschmidt <[email protected]> wrote:
>
> > Oh no, it's great for regular device driver work. I used this
> > type of system all the time on a different PowerPC OS.
> >
> > Suppose you need to set up a piece of hardware. Assume that the
> > hardware isn't across some nasty bridge. You do this:
> >
> > hw->x = 42;
> > hw->y = 19;
> > eieio();
> > hw->p = 11;
> > hw->q = 233;
> > hw->r = 87;
> > eieio()
> > hw->n = 101;
> > hw->m = 5;
> > eieio()
> >
> > In that ficticious example, I get 7 writes to the hardware device
> > with only 3 "eieio" operations. It's not hard at all. Sometimes
> > a "sync" is used instead, also explicitly.
>
> You can do that with my proposed __writel which is a simple store as
> writes to non-cacheable and guarded storage have to stay in order
> according to the PowerPC architecture. No need for __raw.
Oops, I forgot about store-store ordering being automatic.
Pretend I had some loads in my example.
A proper interface would be more explicit about what the
fence does, so that driver authors shouldn't need to know
this detail.
> > To get even more speed, you can mark memory as non-coherent.
>
> Ugh ? MMIO space is always marked non-coherent. You are not supposed to
> set the M bit if the I is set in the page tables. If you are talking
> about main memory, then it's a completely different discussion.
OK, a different discussion... though memory being used
for DMA seems rather related. You need to flush before
a DMA out, or invalidate before a DMA in.
> > Linux should probably do this:
> >
> > Plain stuff is like x86. If you want the performance of loose
> > ordering, ask for it when you get the mapping and use read/write
> > functions that have a "_" prefix. If you mix the "_" versions
> > with a plain x86-like mapping or the other way, the behavior you
> > get will be an arch-specific middle ground.
>
> No. I want precisely defined semantics in all cases.
So you say: never mix strict mappings with loose operations,
and never mix loose mappings with strict operations.
That is an excellent rule. I see no need to stop people from
actively trying to shoot their feet though. I'm certainly not
suggesting that people be mixing things.
For some CPUs, you want to be specifying things when you
set up the mapping. For other CPUs, the read/write code is
how this gets determined. So developers specify both.
> Oops, I forgot about store-store ordering being automatic.
> Pretend I had some loads in my example.
Well, in 99% of the cases, you want MMIO loads to be orderd vs. MMIO
stores and thus you can use __writel and __readl (which will only do an
eieio in __readl). If you are really that picky, then, of course you can
go use the __raw_* versions.
> A proper interface would be more explicit about what the
> fence does, so that driver authors shouldn't need to know
> this detail.
What detail ? Isn't my document explicit enough ? If not, please let me
know what is not clear in the definition of the 4 ordering rules and the
matching fences.
> OK, a different discussion... though memory being used
> for DMA seems rather related. You need to flush before
> a DMA out, or invalidate before a DMA in.
This is already documented elsewhere.
> So you say: never mix strict mappings with loose operations,
> and never mix loose mappings with strict operations.
I don't want the concept of "lose mappings" in the generic driver
interface for now anyway :)
It's too implementation specific and I want to know that a given access
is strictly ordered or relaxed just by looking at the accessor used, not
having to go look for where the driver did the ioremap. We can still
provide arch specific things where we feel it's useful but let's move
one step at a time with the generic accessors.
The only "lose" mapping that we'll introduce next is write combining,
but that's also a different debate. Again, one thing at a time :)
> That is an excellent rule. I see no need to stop people from
> actively trying to shoot their feet though. I'm certainly not
> suggesting that people be mixing things.
>
> For some CPUs, you want to be specifying things when you
> set up the mapping. For other CPUs, the read/write code is
> how this gets determined. So developers specify both.
For now, we are assuming that if the mapping controls ordering, then
it's strictly order. We'll see if we hit an arch where that becomes a
problem.
Ben.
On 9/12/06, Benjamin Herrenschmidt <[email protected]> wrote:
>
> > Oops, I forgot about store-store ordering being automatic.
> > Pretend I had some loads in my example.
>
> Well, in 99% of the cases, you want MMIO loads to be orderd vs. MMIO
> stores and thus you can use __writel and __readl (which will only do an
> eieio in __readl). If you are really that picky, then, of course you can
> go use the __raw_* versions.
If I could get the __raw_* versions for every arch, and there
wouldn't be any endianness troubles, it'd do.
I think a single or double "_" is enough warning to enable
full foot-shooting ability though.
> > A proper interface would be more explicit about what the
> > fence does, so that driver authors shouldn't need to know
> > this detail.
>
> What detail ? Isn't my document explicit enough ? If not, please let me
> know what is not clear in the definition of the 4 ordering rules and the
> matching fences.
The driver code is not very self-documenting this way.
If I see an io_to_io_barrier(), how am I to tell if it is
read to read, write to write, read to write, write to read,
read/write to read, read/write to write, read to read/write,
write to read/write, or read/write to read/write?
Considering just reads and writes to MMIO, there are
9 possible types of fence. SPARC seems to cover a
decent number of these distinctly; the instruction takes
an immediate value as flags.
> > So you say: never mix strict mappings with loose operations,
> > and never mix loose mappings with strict operations.
>
> I don't want the concept of "lose mappings" in the generic driver
> interface for now anyway :)
>
> It's too implementation specific and I want to know that a given access
> is strictly ordered or relaxed just by looking at the accessor used, not
> having to go look for where the driver did the ioremap. We can still
> provide arch specific things where we feel it's useful but let's move
> one step at a time with the generic accessors.
I suppose the "sparse" tool could match up mapping type
with accessor type. That'd let you rely on the accessor to
be doing what you expect when you read the code.
Loose mappings are only arch-specific in implementation.
Generic non-arch driver code ought to be able to take
advantage of the performance benefits of loose mappings.
(that is: full any-any reordering)
> If I see an io_to_io_barrier(), how am I to tell if it is
> read to read, write to write, read to write, write to read,
> read/write to read, read/write to write, read to read/write,
> write to read/write, or read/write to read/write?
>
> Considering just reads and writes to MMIO, there are
> 9 possible types of fence. SPARC seems to cover a
> decent number of these distinctly; the instruction takes
> an immediate value as flags.
We need to decide wether a single one doing a full MMIO fence (and not
memory) is enough or if the performance different justifies maybe having
io_to_io_{wmb,rmb,mb}. I don't see any real use for more combinations.
David ? It's your call here. What do you think ?
Ben.
On 9/12/06, Benjamin Herrenschmidt <[email protected]> wrote:
>
> > If I see an io_to_io_barrier(), how am I to tell if it is
> > read to read, write to write, read to write, write to read,
> > read/write to read, read/write to write, read to read/write,
> > write to read/write, or read/write to read/write?
> >
> > Considering just reads and writes to MMIO, there are
> > 9 possible types of fence. SPARC seems to cover a
> > decent number of these distinctly; the instruction takes
> > an immediate value as flags.
>
> We need to decide wether a single one doing a full MMIO fence (and not
> memory) is enough or if the performance different justifies maybe having
> io_to_io_{wmb,rmb,mb}. I don't see any real use for more combinations.
Remember: it's more than just performance. It's documentation
in the code.
> As Paulus also pointed out, having writel() behave differently
> based on
> some magic done earlier at map time makes it harder to understand what
> happens when reading the code, and thus harder to audit drivers for
> missing barriers etc... since it's not obvious at first sight wether a
> driver is using ordered or relaxed semantics.
I do not buy this argument because I do not believe you
can "audit" a driver at "first sight". You'll have to
look at the mapping call anyway, something might be wrong
there (playing evil __ioremap() tricks, mapping the wrong
size, whatever).
I do see your point, I don't believe the ramifications are
as severe as you make them to be though, esp. when compared
to all the (readability, auditing!) problems that having
more different interfaces for basically the same thing will
bring us.
Segher