On Tue, May 27, 2008 at 02:55:56PM -0700, Linus Torvalds wrote:
>
>
> On Wed, 28 May 2008, Benjamin Herrenschmidt wrote:
> >
> > A problem with __raw_ though is that they -also- don't do byteswap,
>
> Well, that's why there is __readl() and __raw_readl(), no?
>
> Neither does ordering, and __raw_readl() doesn't do byte-swap.
This is where the lack of documentation causes arch maintainers a big
problem. None of the semantics of __raw_readl vs __readl vs readl are
documented _anywhere_. If you look at x86 as a template, there's no
comments there about what the different variants are supposed to do
or not do.
So it's left up to arch maintainers to literally guess what should be
done. That's precisely what I did when I implemented ARMs __raw_readl
and friends. I guessed.
And it was only after I read a few mails on lkml which suggested that
readl and friends should always be LE that ARMs readl implementation
started to use le32_to_cpu()... before that it had always been native
endian. Again, lack of documentation...
So, can the semantics of what's expected from these IO accessor
functions be documented somewhere. Please? Before this thread gets
lost in the depths of time?
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
On Monday 02 June 2008 17:24, Russell King wrote:
> On Tue, May 27, 2008 at 02:55:56PM -0700, Linus Torvalds wrote:
> > On Wed, 28 May 2008, Benjamin Herrenschmidt wrote:
> > > A problem with __raw_ though is that they -also- don't do byteswap,
> >
> > Well, that's why there is __readl() and __raw_readl(), no?
> >
> > Neither does ordering, and __raw_readl() doesn't do byte-swap.
>
> This is where the lack of documentation causes arch maintainers a big
> problem. None of the semantics of __raw_readl vs __readl vs readl are
> documented _anywhere_. If you look at x86 as a template, there's no
> comments there about what the different variants are supposed to do
> or not do.
>
> So it's left up to arch maintainers to literally guess what should be
> done. That's precisely what I did when I implemented ARMs __raw_readl
> and friends. I guessed.
>
> And it was only after I read a few mails on lkml which suggested that
> readl and friends should always be LE that ARMs readl implementation
> started to use le32_to_cpu()... before that it had always been native
> endian. Again, lack of documentation...
>
> So, can the semantics of what's expected from these IO accessor
> functions be documented somewhere. Please? Before this thread gets
> lost in the depths of time?
This whole thread also ties in with my posts about mmiowb (which IMO
should go away).
readl/writel: strongly ordered wrt one another and other stores
to cacheable RAM, byteswapping
__readl/__writel: not ordered (needs mb/rmb/wmb to order with
other readl/writel and cacheable operations, or
io_*mb to order with one another)
raw_readl/raw_writel: strongly ordered, no byteswapping
__raw_readl/__raw_writel: not ordered, no byteswapping
then get rid of *relaxed* variants.
Linus: on x86, memory operations to wc and wc+ memory are not ordered
with one another, or operations to other memory types (ie. load/load
and store/store reordering is allowed). Also, as you know, store/load
reordering is explicitly allowed as well, which covers all memory
types. So perhaps it is not quite true to say readl/writel is strongly
ordered by default even on x86. You would have to put in some
mfence instructions in them to make it so.
So, what *exact* definition are you going to mandate for readl/writel?
Anything less than strict ordering then we also need to ensure drivers
use the correct barriers (to implement strict ordering, we could either
put mfence instructions in, or explicitly disallow readl/writel to be
used on wc/wc+ memory).
The other way we can go is just say that they have x86 semantics,
although that would be a bit sad IMO: we should have strong ops, in
which case driver writers never need to use a single barrier provided
they have locking right, and weak ops, in which case they should match
up with the weak Linux memory ordering model for system RAM.
> This whole thread also ties in with my posts about mmiowb (which IMO
> should go away).
>
> readl/writel: strongly ordered wrt one another and other stores
> to cacheable RAM, byteswapping
> __readl/__writel: not ordered (needs mb/rmb/wmb to order with
> other readl/writel and cacheable operations, or
> io_*mb to order with one another)
> raw_readl/raw_writel: strongly ordered, no byteswapping
> __raw_readl/__raw_writel: not ordered, no byteswapping
>
> then get rid of *relaxed* variants.
In addition, some archs like powerpc also provide readl_be/writel_be as
being defined as big endian (ie. byteswap on LE archs, no byteswap on BE
archs).
As of today, powerpc lacks the raw_readl/raw_writel and __readl/__writel
variants (ie, we only provide fully ordered + byteswap and no ordering +
no byteswap variants).
If we agree on the above semantics, I'll do a patch providing the
missing ones.
> Linus: on x86, memory operations to wc and wc+ memory are not ordered
> with one another, or operations to other memory types (ie. load/load
> and store/store reordering is allowed). Also, as you know, store/load
> reordering is explicitly allowed as well, which covers all memory
> types. So perhaps it is not quite true to say readl/writel is strongly
> ordered by default even on x86. You would have to put in some
> mfence instructions in them to make it so.
>
> So, what *exact* definition are you going to mandate for readl/writel?
> Anything less than strict ordering then we also need to ensure drivers
> use the correct barriers (to implement strict ordering, we could either
> put mfence instructions in, or explicitly disallow readl/writel to be
> used on wc/wc+ memory).
The ordering guarantees that I provide on powerpc for "ordered" variants
are:
- cacheable store + writel stays ordered (ie, write to some
DMA stuff and then a register to trigger the DMA).
- readl + cacheable read stays ordered (ie. read some status
register, for example, after an interrupt, and then read the
resulting data in memory).
- any of these ordered vs. spin_lock and spin_unlock (with the
exception that stores done before the spin_lock
could potentially leak into the lock).
- readl is synchronous (ie, makes the CPU think the
data was actually used before executing subsequent
instructions, thus waits for the data to come back,
for example to ensure that a read used to push out
post buffers followed by a delay will indeed happen
with the right delay).
We don't provide meaningless ones like writel + cacheable store for
example. (PCI posting would defeat it anyway).
> The other way we can go is just say that they have x86 semantics,
> although that would be a bit sad IMO: we should have strong ops, in
> which case driver writers never need to use a single barrier provided
> they have locking right, and weak ops, in which case they should match
> up with the weak Linux memory ordering model for system RAM.
Ben.
On Tuesday 03 June 2008 14:32, Benjamin Herrenschmidt wrote:
> > This whole thread also ties in with my posts about mmiowb (which IMO
> > should go away).
> >
> > readl/writel: strongly ordered wrt one another and other stores
> > to cacheable RAM, byteswapping
> > __readl/__writel: not ordered (needs mb/rmb/wmb to order with
> > other readl/writel and cacheable operations, or
> > io_*mb to order with one another)
> > raw_readl/raw_writel: strongly ordered, no byteswapping
> > __raw_readl/__raw_writel: not ordered, no byteswapping
> >
> > then get rid of *relaxed* variants.
>
> In addition, some archs like powerpc also provide readl_be/writel_be as
> being defined as big endian (ie. byteswap on LE archs, no byteswap on BE
> archs).
Sure.
> As of today, powerpc lacks the raw_readl/raw_writel and __readl/__writel
> variants (ie, we only provide fully ordered + byteswap and no ordering +
> no byteswap variants).
>
> If we agree on the above semantics, I'll do a patch providing the
> missing ones.
Let's see what Linus thinks...
> > Linus: on x86, memory operations to wc and wc+ memory are not ordered
> > with one another, or operations to other memory types (ie. load/load
> > and store/store reordering is allowed). Also, as you know, store/load
> > reordering is explicitly allowed as well, which covers all memory
> > types. So perhaps it is not quite true to say readl/writel is strongly
> > ordered by default even on x86. You would have to put in some
> > mfence instructions in them to make it so.
> >
> > So, what *exact* definition are you going to mandate for readl/writel?
> > Anything less than strict ordering then we also need to ensure drivers
> > use the correct barriers (to implement strict ordering, we could either
> > put mfence instructions in, or explicitly disallow readl/writel to be
> > used on wc/wc+ memory).
>
> The ordering guarantees that I provide on powerpc for "ordered" variants
> are:
>
> - cacheable store + writel stays ordered (ie, write to some
> DMA stuff and then a register to trigger the DMA).
>
> - readl + cacheable read stays ordered (ie. read some status
> register, for example, after an interrupt, and then read the
> resulting data in memory).
>
> - any of these ordered vs. spin_lock and spin_unlock (with the
> exception that stores done before the spin_lock
> could potentially leak into the lock).
>
> - readl is synchronous (ie, makes the CPU think the
> data was actually used before executing subsequent
> instructions, thus waits for the data to come back,
> for example to ensure that a read used to push out
> post buffers followed by a delay will indeed happen
> with the right delay).
So your readl can pass an earlier cacheable store or earlier writel?
> We don't provide meaningless ones like writel + cacheable store for
> example. (PCI posting would defeat it anyway).
What do you mean by meaningless? Ordering of writel followed by a
cacheable store is meaningful eg. for retaining io operations within
locks. OK, you explicitly have some extra code for spin_unlock, but
not for bit locks, mutexes, etc. It would make sense to have the
default operations _very_ strongly ordered IMO, and then move drivers
to be more relaxed when they are verified.
On Tue, 2008-06-03 at 16:11 +1000, Nick Piggin wrote:
> > - readl is synchronous (ie, makes the CPU think the
> > data was actually used before executing subsequent
> > instructions, thus waits for the data to come back,
> > for example to ensure that a read used to push out
> > post buffers followed by a delay will indeed happen
> > with the right delay).
>
> So your readl can pass an earlier cacheable store or earlier writel?
I forgot to mention that all MMIO are ordered vs. each other and I
do prevent readl from passing earlier cacheable stores too in my
current implementation but I'n not 100% we want to "guarantee" that,
unless we have stupid devices that trigger DMA's on reads with side
effects.. anyway, it is guaranteed in the current case.
Ben.
Nick Piggin writes:
> So your readl can pass an earlier cacheable store or earlier writel?
No. It's quite gross at the moment, it has a sync before the access
(i.e. a full mb()) and a twi; isync sequence after the access that
stalls execution until the data comes back.
> > We don't provide meaningless ones like writel + cacheable store for
> > example. (PCI posting would defeat it anyway).
>
> What do you mean by meaningless? Ordering of writel followed by a
> cacheable store is meaningful eg. for retaining io operations within
> locks. OK, you explicitly have some extra code for spin_unlock, but
> not for bit locks, mutexes, etc. It would make sense to have the
> default operations _very_ strongly ordered IMO, and then move drivers
> to be more relaxed when they are verified.
It's meaningless in the sense that nothing guarantees that the writel
has actually hit the device, even if we put a full mb() barrier in
between the writel and the cacheable write. That would guarantee that
the writel had got to the PCI host bridge, and couldn't be reordered
with other accesses to the device, but not that the device had
actually seen it.
I don't mind adding code to the mutex unlock to do the same as
spin_unlock, but I really don't want to have *two* sync instructions
for every MMIO write. One is bad enough.
Paul.
On Tuesday 03 June 2008 16:53, Paul Mackerras wrote:
> Nick Piggin writes:
> > So your readl can pass an earlier cacheable store or earlier writel?
>
> No. It's quite gross at the moment, it has a sync before the access
> (i.e. a full mb()) and a twi; isync sequence after the access that
> stalls execution until the data comes back.
OK.
> > > We don't provide meaningless ones like writel + cacheable store for
> > > example. (PCI posting would defeat it anyway).
> >
> > What do you mean by meaningless? Ordering of writel followed by a
> > cacheable store is meaningful eg. for retaining io operations within
> > locks. OK, you explicitly have some extra code for spin_unlock, but
> > not for bit locks, mutexes, etc. It would make sense to have the
> > default operations _very_ strongly ordered IMO, and then move drivers
> > to be more relaxed when they are verified.
>
> It's meaningless in the sense that nothing guarantees that the writel
> has actually hit the device, even if we put a full mb() barrier in
> between the writel and the cacheable write. That would guarantee that
> the writel had got to the PCI host bridge, and couldn't be reordered
> with other accesses to the device, but not that the device had
> actually seen it.
OK, but I think fits OK with our SMP ordering model for cacheable
stores: no amount of barriers on CPU0 will guarantee that CPU1 has
seen the store, you actually have to observe a causual effect
of the store before you can say that.
> I don't mind adding code to the mutex unlock to do the same as
> spin_unlock, but I really don't want to have *two* sync instructions
> for every MMIO write. One is bad enough.
So you can't provide iostore/store ordering without another sync
after the writel store?
I guess the problem with providing exceptions is that it makes it
hard for people who absolutely don't know or care about ordering.
I don't like having to think about it "hmm, we can allow this type
of reordering... oh, unless some silly device does X...".
If we come up with a sane set of weakly ordered accessors
(including io_*mb()), it will make it easier to go through the
important drivers and improve them. We don't have to enforce the
the new semantics strictly until then if they'll slow you down
too much in the meantime.
On Tue, 3 Jun 2008, Nick Piggin wrote:
>
> Linus: on x86, memory operations to wc and wc+ memory are not ordered
> with one another, or operations to other memory types (ie. load/load
> and store/store reordering is allowed). Also, as you know, store/load
> reordering is explicitly allowed as well, which covers all memory
> types. So perhaps it is not quite true to say readl/writel is strongly
> ordered by default even on x86. You would have to put in some
> mfence instructions in them to make it so.
Well, you have to ask for WC/WC+ anyway, so it's immaterial. A driver that
does that needs to be aware of it. IOW, it's a non-issue, imnsho.
Linus
On Tue, 3 Jun 2008, Linus Torvalds wrote:
> On Tue, 3 Jun 2008, Nick Piggin wrote:
>>
>> Linus: on x86, memory operations to wc and wc+ memory are not ordered
>> with one another, or operations to other memory types (ie. load/load
>> and store/store reordering is allowed). Also, as you know, store/load
>> reordering is explicitly allowed as well, which covers all memory
>> types. So perhaps it is not quite true to say readl/writel is strongly
>> ordered by default even on x86. You would have to put in some
>> mfence instructions in them to make it so.
So on x86, these could be re-ordered?
writel(START_OPERATION, CONTROL_REGISTER);
status = readl(STATUS_REGISTER);
> Well, you have to ask for WC/WC+ anyway, so it's immaterial. A driver that
> does that needs to be aware of it. IOW, it's a non-issue, imnsho.
You need to ask for coherent DMA memory too.
On Tue, Jun 03, 2008 at 11:47:00AM -0700, Trent Piepho wrote:
> On Tue, 3 Jun 2008, Linus Torvalds wrote:
> >On Tue, 3 Jun 2008, Nick Piggin wrote:
> >>
> >>Linus: on x86, memory operations to wc and wc+ memory are not ordered
> >>with one another, or operations to other memory types (ie. load/load
> >>and store/store reordering is allowed). Also, as you know, store/load
> >>reordering is explicitly allowed as well, which covers all memory
> >>types. So perhaps it is not quite true to say readl/writel is strongly
> >>ordered by default even on x86. You would have to put in some
> >>mfence instructions in them to make it so.
>
> So on x86, these could be re-ordered?
>
> writel(START_OPERATION, CONTROL_REGISTER);
> status = readl(STATUS_REGISTER);
You wouldn't ask for write-combining memory mapping for control or
status registers.
> >Well, you have to ask for WC/WC+ anyway, so it's immaterial. A driver that
> >does that needs to be aware of it. IOW, it's a non-issue, imnsho.
>
> You need to ask for coherent DMA memory too.
Different case. Coherent DMA memory is *host* memory that the *device*
accesses. We're talking about *device* memory that the *cpu* accesses.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
On Tue, 3 Jun 2008, Trent Piepho wrote:
> On Tue, 3 Jun 2008, Linus Torvalds wrote:
> > On Tue, 3 Jun 2008, Nick Piggin wrote:
> > >
> > > Linus: on x86, memory operations to wc and wc+ memory are not ordered
> > > with one another, or operations to other memory types (ie. load/load
> > > and store/store reordering is allowed). Also, as you know, store/load
> > > reordering is explicitly allowed as well, which covers all memory
> > > types. So perhaps it is not quite true to say readl/writel is strongly
> > > ordered by default even on x86. You would have to put in some
> > > mfence instructions in them to make it so.
>
> So on x86, these could be re-ordered?
>
> writel(START_OPERATION, CONTROL_REGISTER);
> status = readl(STATUS_REGISTER);
With both registers in a WC+ area, yes. The write may be in the WC buffers
until the WC buffers are flushed (short list: a fence, a serializing
instruction, a read-write to uncached memory, or an interrupt. There are
others, but those are the main ones).
But if the status register is in uncached memory (which is the only *sane*
thing to do), then it doesn't matter if the control register is in WC
memory. Because the status register read is itself serializing with the WC
buffer, it's actually fine.
So this is used for putting things like ring queues in WC memory, and fill
them up with writes, and get nice bursty write traffic with the CPU
automatically buffering it up (think "stdio.h on a really low level"). And
if you then have the command registers in UC memory or using IO port
accesses, reading and writing to them will automatically serialize.
> > Well, you have to ask for WC/WC+ anyway, so it's immaterial. A driver that
> > does that needs to be aware of it. IOW, it's a non-issue, imnsho.
>
> You need to ask for coherent DMA memory too.
Not on x86.
And I don't care what anybody else says - x86 is *so* totally dominant,
that other architectures have to live with the fact that 99.9% of all
drivers are written for and tested on x86. As a result, anything else is
"theory".
Are some drivers good and are careful? Yes. Are most? No, and even if they
were, people making changes would mostly test them on x86. Architectures
should strive to basically act as closely as possible to x86 semantics in
order to not get nasty surprises.
And architectures that can't do that well enough *will* often find that
they need to fix drivers, and only a small small subset of all kernel
drivers will generally work out-of-the-box.
If you're ready to do that, your architecture can do anything it damn well
pleases ;)
Linus
On Tue, 3 Jun 2008, Nick Piggin wrote:
> On Monday 02 June 2008 17:24, Russell King wrote:
>> So, can the semantics of what's expected from these IO accessor
>> functions be documented somewhere. Please? Before this thread gets
>> lost in the depths of time?
>
> This whole thread also ties in with my posts about mmiowb (which IMO
> should go away).
>
> readl/writel: strongly ordered wrt one another and other stores
> to cacheable RAM, byteswapping
> __readl/__writel: not ordered (needs mb/rmb/wmb to order with
> other readl/writel and cacheable operations, or
> io_*mb to order with one another)
> raw_readl/raw_writel: strongly ordered, no byteswapping
> __raw_readl/__raw_writel: not ordered, no byteswapping
Byte-swapping vs not byte-swapping is not usually what the programmer wants.
Usually your device's registers are defined as being big-endian or
little-endian and you want whatever is needed to give you that.
I believe that on some archs that can be either byte order, some built-in
devices will change their registers to match, and so you want "native endian"
or no swapping for these. Though that's definitely in the minority.
An accessors that always byte-swaps regardless of the endianness of the host
is never something I've seen a driver want.
IOW, there are four ways one can defined endianness/swapping:
1) Little-endian
2) Big-endian
3) Native-endian aka non-byte-swapping
4) Foreign-endian aka byte-swapping
1 and 2 are by far the most used. Some code wants 3. No one wants 4. Yet
our API is providing 3 & 4, the two which are the least useful.
Is it enough to provide only "all or none" for ordering strictness? For
instance on powerpc, one can get a speedup by dropping strict ordering for IO
vs cacheable memory, but still keeping ordering for IO vs IO and IO vs locks.
This is much easier to program for than no ordering at all. In fact, if one
doesn't use coherent DMA, it's basically the same as fully strict ordering.
On Tue, 3 Jun 2008, Matthew Wilcox wrote:
> On Tue, Jun 03, 2008 at 11:47:00AM -0700, Trent Piepho wrote:
>> On Tue, 3 Jun 2008, Linus Torvalds wrote:
>>> On Tue, 3 Jun 2008, Nick Piggin wrote:
>>>>
>>>> Linus: on x86, memory operations to wc and wc+ memory are not ordered
>>>> with one another, or operations to other memory types (ie. load/load
>>>> and store/store reordering is allowed). Also, as you know, store/load
>>>> reordering is explicitly allowed as well, which covers all memory
>>>> types. So perhaps it is not quite true to say readl/writel is strongly
>>>> ordered by default even on x86. You would have to put in some
>>>> mfence instructions in them to make it so.
>>
>> So on x86, these could be re-ordered?
>>
>> writel(START_OPERATION, CONTROL_REGISTER);
>> status = readl(STATUS_REGISTER);
>
> You wouldn't ask for write-combining memory mapping for control or
> status registers.
But Nick said, "store/load reordering is explicitly allowed as well, which
covers *all* memory types."
On Tue, Jun 03, 2008 at 12:43:21PM -0700, Trent Piepho wrote:
> IOW, there are four ways one can defined endianness/swapping:
> 1) Little-endian
> 2) Big-endian
> 3) Native-endian aka non-byte-swapping
> 4) Foreign-endian aka byte-swapping
>
> 1 and 2 are by far the most used. Some code wants 3. No one wants 4. Yet
> our API is providing 3 & 4, the two which are the least useful.
You've fundamentally misunderstood.
readX/writeX and __readX/__writeX provide little-endian access.
__raw_readX provide native-endian.
If you want 2 or 4, define your own accessors. Some architectures define
other accessors (eg gsc_readX on parisc is native (big) endian, and
works on physical addresses that haven't been ioremapped. sbus_readX on
sparc64 also seems to be native (big) endian).
> Is it enough to provide only "all or none" for ordering strictness? For
> instance on powerpc, one can get a speedup by dropping strict ordering for
> IO
> vs cacheable memory, but still keeping ordering for IO vs IO and IO vs
> locks. This is much easier to program for than no ordering at all. In
> fact, if one
> doesn't use coherent DMA, it's basically the same as fully strict ordering.
I don't understand why you keep talking about DMA. Are you talking
about ordering between readX() and DMA? PCI proides those guarantees.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
On Tue, Jun 03, 2008 at 12:57:56PM -0700, Trent Piepho wrote:
> On Tue, 3 Jun 2008, Matthew Wilcox wrote:
> >On Tue, Jun 03, 2008 at 11:47:00AM -0700, Trent Piepho wrote:
> >>On Tue, 3 Jun 2008, Linus Torvalds wrote:
> >>>On Tue, 3 Jun 2008, Nick Piggin wrote:
> >>>>
> >>>>Linus: on x86, memory operations to wc and wc+ memory are not ordered
> >>>>with one another, or operations to other memory types (ie. load/load
> >>>>and store/store reordering is allowed). Also, as you know, store/load
> >>>>reordering is explicitly allowed as well, which covers all memory
> >>>>types. So perhaps it is not quite true to say readl/writel is strongly
> >>>>ordered by default even on x86. You would have to put in some
> >>>>mfence instructions in them to make it so.
> >>
> >>So on x86, these could be re-ordered?
> >>
> >>writel(START_OPERATION, CONTROL_REGISTER);
> >>status = readl(STATUS_REGISTER);
> >
> >You wouldn't ask for write-combining memory mapping for control or
> >status registers.
>
> But Nick said, "store/load reordering is explicitly allowed as well, which
> covers *all* memory types."
Then Nick is confused. PCI only defines one way to flush posted writes
to a device -- doing a read from it. There's no way that reads can
be allowed to pass writes (unless you've asked for it, like with write
combining).
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
On Tue, 3 Jun 2008, Matthew Wilcox wrote:
> On Tue, Jun 03, 2008 at 12:43:21PM -0700, Trent Piepho wrote:
>> IOW, there are four ways one can defined endianness/swapping:
>> 1) Little-endian
>> 2) Big-endian
>> 3) Native-endian aka non-byte-swapping
>> 4) Foreign-endian aka byte-swapping
>>
>> 1 and 2 are by far the most used. Some code wants 3. No one wants 4. Yet
>> our API is providing 3 & 4, the two which are the least useful.
>
> You've fundamentally misunderstood.
>
> readX/writeX and __readX/__writeX provide little-endian access.
> __raw_readX provide native-endian.
>
> If you want 2 or 4, define your own accessors. Some architectures define
> other accessors (eg gsc_readX on parisc is native (big) endian, and
How about providing 1 and 2, and if you want 3 or 4 define your own accessors?
>> Is it enough to provide only "all or none" for ordering strictness? For
>> instance on powerpc, one can get a speedup by dropping strict ordering for
>> IO
>> vs cacheable memory, but still keeping ordering for IO vs IO and IO vs
>> locks. This is much easier to program for than no ordering at all. In
>> fact, if one
>> doesn't use coherent DMA, it's basically the same as fully strict ordering.
>
> I don't understand why you keep talking about DMA. Are you talking
> about ordering between readX() and DMA? PCI proides those guarantees.
I guess you haven't been reading the whole thread. The reason it started was
because gcc can re-order powerpc (and everyone else's too) IO accesses vs
accesses to cachable memory (but not spin-locks), which ends up only being a
problem with coherent DMA.
On Tue, 3 Jun 2008, Matthew Wilcox wrote:
> On Tue, Jun 03, 2008 at 12:57:56PM -0700, Trent Piepho wrote:
>> On Tue, 3 Jun 2008, Matthew Wilcox wrote:
>>> On Tue, Jun 03, 2008 at 11:47:00AM -0700, Trent Piepho wrote:
>>>> On Tue, 3 Jun 2008, Linus Torvalds wrote:
>>>>> On Tue, 3 Jun 2008, Nick Piggin wrote:
>>>>>>
>>>>>> Linus: on x86, memory operations to wc and wc+ memory are not ordered
>>>>>> with one another, or operations to other memory types (ie. load/load
>>>>>> and store/store reordering is allowed). Also, as you know, store/load
>>>>>> reordering is explicitly allowed as well, which covers all memory
>>>>>> types. So perhaps it is not quite true to say readl/writel is strongly
>>>>>> ordered by default even on x86. You would have to put in some
>>>>>> mfence instructions in them to make it so.
>>>>
>>>> So on x86, these could be re-ordered?
>>>>
>>>> writel(START_OPERATION, CONTROL_REGISTER);
>>>> status = readl(STATUS_REGISTER);
>>>
>>> You wouldn't ask for write-combining memory mapping for control or
>>> status registers.
>>
>> But Nick said, "store/load reordering is explicitly allowed as well, which
>> covers *all* memory types."
>
> Then Nick is confused. PCI only defines one way to flush posted writes
> to a device -- doing a read from it. There's no way that reads can
> be allowed to pass writes (unless you've asked for it, like with write
> combining).
But that requirement is for the PCI bridge, isn't it? It doesn't matter if
the bridge will flush all posted writes before allowing a read if the CPU
decides to give the bridge the read before the write. A powerpc CPU will
certainly do this if you don't take any steps like telling it the memory is
uncachable and guarded. I didn't think it was allowed on x86 (except with
WC), but Nick seemed to say it was.
On Tue, 2008-06-03 at 12:43 -0700, Trent Piepho wrote:
>
> Byte-swapping vs not byte-swapping is not usually what the programmer wants.
> Usually your device's registers are defined as being big-endian or
> little-endian and you want whatever is needed to give you that.
Yes, which is why I (and some other archs) have writel_be/readl_be.
The standard writel/readl being LE.
However, the "raw" variants are defined to be native endian, which is of
some use to -some- archs apparently where they have SoC device whose
endianness follow the core.
> I believe that on some archs that can be either byte order, some built-in
> devices will change their registers to match, and so you want "native endian"
> or no swapping for these. Though that's definitely in the minority.
>
> An accessors that always byte-swaps regardless of the endianness of the host
> is never something I've seen a driver want.
>
> IOW, there are four ways one can defined endianness/swapping:
> 1) Little-endian
> 2) Big-endian
> 3) Native-endian aka non-byte-swapping
> 4) Foreign-endian aka byte-swapping
>
> 1 and 2 are by far the most used. Some code wants 3. No one wants 4. Yet
> our API is providing 3 & 4, the two which are the least useful.
No, we don't provide 4, it was something unclear with nick.
We provide 1. (writel/readl and __variants), some archs provide 2
(writel_be/readl_be, tho I don't have __variants, I suppose I could),
and everybody provides 3. though in some cases (like us) only in the
form of __variants (ie, non ordered, like __raw_readl/__raw_writel).
Nick's proposal is to plug those gaps, though it's, I believe, missing
the _be variants.
> Is it enough to provide only "all or none" for ordering strictness? For
> instance on powerpc, one can get a speedup by dropping strict ordering for IO
> vs cacheable memory, but still keeping ordering for IO vs IO and IO vs locks.
> This is much easier to program for than no ordering at all. In fact, if one
> doesn't use coherent DMA, it's basically the same as fully strict ordering.
Ben.
On Wednesday 04 June 2008 07:58, Trent Piepho wrote:
> On Tue, 3 Jun 2008, Matthew Wilcox wrote:
> > On Tue, Jun 03, 2008 at 12:57:56PM -0700, Trent Piepho wrote:
> >> On Tue, 3 Jun 2008, Matthew Wilcox wrote:
> >>> On Tue, Jun 03, 2008 at 11:47:00AM -0700, Trent Piepho wrote:
> >>>> On Tue, 3 Jun 2008, Linus Torvalds wrote:
> >>>>> On Tue, 3 Jun 2008, Nick Piggin wrote:
> >>>>>> Linus: on x86, memory operations to wc and wc+ memory are not
> >>>>>> ordered with one another, or operations to other memory types (ie.
> >>>>>> load/load and store/store reordering is allowed). Also, as you know,
> >>>>>> store/load reordering is explicitly allowed as well, which covers
> >>>>>> all memory types. So perhaps it is not quite true to say
> >>>>>> readl/writel is strongly ordered by default even on x86. You would
> >>>>>> have to put in some mfence instructions in them to make it so.
> >>>>
> >>>> So on x86, these could be re-ordered?
> >>>>
> >>>> writel(START_OPERATION, CONTROL_REGISTER);
> >>>> status = readl(STATUS_REGISTER);
> >>>
> >>> You wouldn't ask for write-combining memory mapping for control or
> >>> status registers.
> >>
> >> But Nick said, "store/load reordering is explicitly allowed as well,
> >> which covers *all* memory types."
> >
> > Then Nick is confused. PCI only defines one way to flush posted writes
> > to a device -- doing a read from it. There's no way that reads can
> > be allowed to pass writes (unless you've asked for it, like with write
> > combining).
>
> But that requirement is for the PCI bridge, isn't it? It doesn't matter if
> the bridge will flush all posted writes before allowing a read if the CPU
> decides to give the bridge the read before the write. A powerpc CPU will
> certainly do this if you don't take any steps like telling it the memory is
> uncachable and guarded. I didn't think it was allowed on x86 (except with
> WC), but Nick seemed to say it was.
Ah sorry, not UC, I was confused. UC I think actually is strongly ordered
WRT other UC and also cacheable operations.
WC is weakly ordered, anything goes.
On Wednesday 04 June 2008 05:07, Linus Torvalds wrote:
> On Tue, 3 Jun 2008, Trent Piepho wrote:
> > On Tue, 3 Jun 2008, Linus Torvalds wrote:
> > > On Tue, 3 Jun 2008, Nick Piggin wrote:
> > > > Linus: on x86, memory operations to wc and wc+ memory are not ordered
> > > > with one another, or operations to other memory types (ie. load/load
> > > > and store/store reordering is allowed). Also, as you know, store/load
> > > > reordering is explicitly allowed as well, which covers all memory
> > > > types. So perhaps it is not quite true to say readl/writel is
> > > > strongly ordered by default even on x86. You would have to put in
> > > > some mfence instructions in them to make it so.
> >
> > So on x86, these could be re-ordered?
> >
> > writel(START_OPERATION, CONTROL_REGISTER);
> > status = readl(STATUS_REGISTER);
>
> With both registers in a WC+ area, yes. The write may be in the WC buffers
> until the WC buffers are flushed (short list: a fence, a serializing
> instruction, a read-write to uncached memory, or an interrupt. There are
> others, but those are the main ones).
>
> But if the status register is in uncached memory (which is the only *sane*
> thing to do), then it doesn't matter if the control register is in WC
> memory. Because the status register read is itself serializing with the WC
> buffer, it's actually fine.
Actually, according to the document I am looking at (the AMD one), a UC
store may pass a previous WC store.
So you could have some code that writes to some WC memory on the card,
and then stores to an UC control register to start up the operation on
that memory, couldn't you? Those can go out of order.
On Wednesday 04 June 2008 00:47, Linus Torvalds wrote:
> On Tue, 3 Jun 2008, Nick Piggin wrote:
> > Linus: on x86, memory operations to wc and wc+ memory are not ordered
> > with one another, or operations to other memory types (ie. load/load
> > and store/store reordering is allowed). Also, as you know, store/load
> > reordering is explicitly allowed as well, which covers all memory
> > types. So perhaps it is not quite true to say readl/writel is strongly
> > ordered by default even on x86. You would have to put in some
> > mfence instructions in them to make it so.
>
> Well, you have to ask for WC/WC+ anyway, so it's immaterial. A driver that
> does that needs to be aware of it. IOW, it's a non-issue, imnsho.
Ah, yes UC is strongly ordered WRT all others *except* WC/WC+.
But WC memory is not an x86 specific thing right, so do we need
some accessors for WC memory? Or can we just throw that in the
weakly ordered pile, and ensure mb/rmb/wmb does the right thing
for them.
And you want readl/writel to be strongly ordered like x86 on all
architectures, no exceptions? This will slow some things down,
but if we then also provide explicitly weakly ordered instructions
(and add io_mb/io_rmb/io_wmb) then at least it gives the framework
for drivers to be written to run on those architectures.
The other thing we could do is mandate only that readl/writel will
be ordered WRT one another, *and* with spinlocks, but otherwise not
with cacheable RAM...
On Wednesday 04 June 2008 07:44, Trent Piepho wrote:
> On Tue, 3 Jun 2008, Matthew Wilcox wrote:
> > I don't understand why you keep talking about DMA. Are you talking
> > about ordering between readX() and DMA? PCI proides those guarantees.
>
> I guess you haven't been reading the whole thread. The reason it started
> was because gcc can re-order powerpc (and everyone else's too) IO accesses
> vs accesses to cachable memory (but not spin-locks), which ends up only
> being a problem with coherent DMA.
I don't think it is only a problem with coherent DMA.
CPU0 CPU1
mutex_lock(mutex);
writel(something, DATA_REG);
writel(GO, CTRL_REG);
started = 1;
mutex_unlock(mutex);
mutex_lock(mutex);
if (started)
/* oops, this can reach device before GO */
writel(STOP, CTRL_REG);
On Wed, 4 Jun 2008, Nick Piggin wrote:
>
> Actually, according to the document I am looking at (the AMD one), a UC
> store may pass a previous WC store.
Hmm. Intel arch manyal, Vol 3, 10.3 (page 10-7 in my version):
"If the WC bufer is partially filled, the writes may be delayed until
the next ocurrence of a serializing event; such as, an SFENCE or MFENCE
instruction, CPUID execution, a read or write to uncached memory, ..."
Any typos mine.
Anyway, Intel certainly seems to document that WC memory is serialized by
any access to UC memory.
But yes, I can well imagine that AMD is different, and I also heartily
would recommend rather being safe than sorry. Putting an explicit memory
barrier in between those accesses when you know it might make a difference
is just a good idea.
But basically, as far as I know the thing was designed to be invisible to
old software: that is the whole idea behind WC memory. So the design was
certainly intended to be that you can generally mark a framebuffer-like
structure WC without any software _ever_ caring, as long as you keep all
control ports in UC memory.
Of course, because burst writes from the WC buffer are <i>so</i> much more
efficient on the PCI bus than dribbling them out one write at a time, it
didn't take long before all the graphics cards etc wanted to <i>also</i>
mark their command queues as WC memory, so that you could burst out the
commands to the ring buffers as fast as possible. So now you have both
your frame buffer *and* your command buffers mapped WC, and now ordering
really has to be ensured in software if you access both.
[ And then there are the crazy people who mark *main memory* as WC,
because they don't want to pollute the cache with all the data, and then
you have the issue of cache coherency etc crap. Which only gets worse
with SMP, especially if one processor thinks it has part of memory
exclusively cached, and another one - or even the same one,
through another aliasign address - ignores the cache protocol.
And you now get unhappy CPU's that think that there is a bug in the
cache protocol and they get machine check faults.
So what started out as a "we can do accesses to the frame buffer more
efficiently without anybody ever even having to know or care" has
turned into a whole nightmare of people using it for other things, and
then you very much _do_ have to care! ]
And it doesn't surprise me if AMD then didn't get exactly the same
rules.
Oh, well.
Linus
On Wed, 4 Jun 2008, Nick Piggin wrote:
> On Wednesday 04 June 2008 07:44, Trent Piepho wrote:
>> On Tue, 3 Jun 2008, Matthew Wilcox wrote:
>
>>> I don't understand why you keep talking about DMA. Are you talking
>>> about ordering between readX() and DMA? PCI proides those guarantees.
>>
>> I guess you haven't been reading the whole thread. The reason it started
>> was because gcc can re-order powerpc (and everyone else's too) IO accesses
>> vs accesses to cachable memory (but not spin-locks), which ends up only
>> being a problem with coherent DMA.
>
> I don't think it is only a problem with coherent DMA.
>
> CPU0 CPU1
> mutex_lock(mutex);
> writel(something, DATA_REG);
> writel(GO, CTRL_REG);
> started = 1;
(A)
> mutex_unlock(mutex);
> mutex_lock(mutex);
(B)
> if (started)
> /* oops, this can reach device before GO */
> writel(STOP, CTRL_REG);
The locks themselves should have (and do have) ordering operations to insure
gcc and/or the cpu can't move a store or load outside the locked region.
Generally you need that to keep stores/loads to cacheable memory inside the
critical area, much less I/O operations. Otherwise all you have to do is
replace writel(something, ...) with shared_data->something = ... and there's
an obvious problem. In your example, gcc currently can and will move the GO
operation to point A (if it can figure out that CTRL_REG and started aren't
aliased), but that's not a problem. If it could move it to B that would be a
problem, but it can't.
Other than coherent DMA, I don't think there is any reason to care if I/O
accessors are strongly ordered wrt load/stores to cacheable memory. locking
and streaming DMA sync operations already need to have ordering, so they don't
require all I/O to be ordered wrt all cacheable memory.
> Anyway, Intel certainly seems to document that WC memory is serialized by
> any access to UC memory.
I don't believe that is actually true on Pentium Pro at least.
> So what started out as a "we can do accesses to the frame buffer more
> efficiently without anybody ever even having to know or care" has
> turned into a whole nightmare of people using it for other things, and
> then you very much _do_ have to care! ]
Notably graphics where 3Dfx lined the registers up specifically to make
this work. We were also using it with I²O where it gave a small
performance gain.
Alan
On Wednesday 04 June 2008 05:07, Linus Torvalds wrote:
> On Tue, 3 Jun 2008, Trent Piepho wrote:
> > On Tue, 3 Jun 2008, Linus Torvalds wrote:
> > > On Tue, 3 Jun 2008, Nick Piggin wrote:
> > > > Linus: on x86, memory operations to wc and wc+ memory are not ordered
> > > > with one another, or operations to other memory types (ie. load/load
> > > > and store/store reordering is allowed). Also, as you know, store/load
> > > > reordering is explicitly allowed as well, which covers all memory
> > > > types. So perhaps it is not quite true to say readl/writel is
> > > > strongly ordered by default even on x86. You would have to put in
> > > > some mfence instructions in them to make it so.
> >
> > So on x86, these could be re-ordered?
> >
> > writel(START_OPERATION, CONTROL_REGISTER);
> > status = readl(STATUS_REGISTER);
>
> With both registers in a WC+ area, yes. The write may be in the WC buffers
> until the WC buffers are flushed (short list: a fence, a serializing
> instruction, a read-write to uncached memory, or an interrupt. There are
> others, but those are the main ones).
>
> But if the status register is in uncached memory (which is the only *sane*
> thing to do), then it doesn't matter if the control register is in WC
> memory. Because the status register read is itself serializing with the WC
> buffer, it's actually fine.
>
> So this is used for putting things like ring queues in WC memory, and fill
> them up with writes, and get nice bursty write traffic with the CPU
> automatically buffering it up (think "stdio.h on a really low level"). And
> if you then have the command registers in UC memory or using IO port
> accesses, reading and writing to them will automatically serialize.
OK, I'm sitll not quite sure where this has ended up. I guess you are happy
with x86 semantics as they are now. That is, all IO accesses are strongly
ordered WRT one another and WRT cacheable memory (which includes keeping
them within spinlocks), *unless* one asks for WC memory, in which case that
memory is quite weakly ordered (and is not even ordered by a regular IO
readl, at least according to AMD spec). So for WC memory, one still needs
to use mb/rmb/wmb.
So that still doesn't tell us what *minimum* level of ordering we should
provide in the cross platform readl/writel API. Some relatively sane
suggestions would be:
- as strong as x86. guaranteed not to break drivers that work on x86,
but slower on some archs. To me, this is most pleasing. It is much
much easier to notice something is going a little slower and to work
out how to use weaker ordering there, than it is to debug some
once-in-a-bluemoon breakage caused by just the right architecture,
driver, etc. It totally frees up the driver writer from thinking
about barriers, provided they get the locking right.
- ordered WRT other IO accessors, constrained within spinlocks, but not
cacheable memory. This is what powerpc does now. It's a little faster
for them, and probably covers the vast majority of drivers, but there
are real possibilities to get it wrong (trivial example: using bit
locks or mutexes or any kind of open coded locking or lockless
synchronisation can break).
- (less sane) same as above, but not ordered WRT spinlocks. This is what
ia64 (sn2) does. From a purist POV, it is a little less arbitrary than
powerpc, but in practice, it will break a lot more drivers than powerpc.
I was kind of joking about taking control of this issue :) But seriously,
it needs a decision to be made. I vote for #1. My rationale: I'm still
finding relatively major (well, found maybe 4 or 5 in the last couple of
years) bugs in the mm subsystem due to memory ordering problems. This is
apparently one of the most well reviewed and tested bit of code in the
kernel by people who know all about memory ordering. Not to mention that
mm/ does not have to worry about IO ordering at all. Then apparently
driver are the least reviewed and tested. Connect dots.
Now that doesn't leave waker ordering architectures lumped with "slow old
x86 semantics". Think of it as giving them the benefit of sharing x86
development and testing :) We can then formalise the relaxed __ accessors
to be more complete (ie. +/- byteswapping). I'd also propose to add
io_rmb/io_wmb/io_mb that order io/io access, to help architectures like
sn2 where the io/cacheable barrier is pretty expensive.
Any comments?
On Monday, June 09, 2008 11:56 pm Nick Piggin wrote:
> So that still doesn't tell us what *minimum* level of ordering we should
> provide in the cross platform readl/writel API. Some relatively sane
> suggestions would be:
>
> - as strong as x86. guaranteed not to break drivers that work on x86,
> but slower on some archs. To me, this is most pleasing. It is much
> much easier to notice something is going a little slower and to work
> out how to use weaker ordering there, than it is to debug some
> once-in-a-bluemoon breakage caused by just the right architecture,
> driver, etc. It totally frees up the driver writer from thinking
> about barriers, provided they get the locking right.
>
> - ordered WRT other IO accessors, constrained within spinlocks, but not
> cacheable memory. This is what powerpc does now. It's a little faster
> for them, and probably covers the vast majority of drivers, but there
> are real possibilities to get it wrong (trivial example: using bit
> locks or mutexes or any kind of open coded locking or lockless
> synchronisation can break).
>
> - (less sane) same as above, but not ordered WRT spinlocks. This is what
> ia64 (sn2) does. From a purist POV, it is a little less arbitrary than
> powerpc, but in practice, it will break a lot more drivers than powerpc.
>
> I was kind of joking about taking control of this issue :) But seriously,
> it needs a decision to be made. I vote for #1. My rationale: I'm still
> finding relatively major (well, found maybe 4 or 5 in the last couple of
> years) bugs in the mm subsystem due to memory ordering problems. This is
> apparently one of the most well reviewed and tested bit of code in the
> kernel by people who know all about memory ordering. Not to mention that
> mm/ does not have to worry about IO ordering at all. Then apparently
> driver are the least reviewed and tested. Connect dots.
>
> Now that doesn't leave waker ordering architectures lumped with "slow old
> x86 semantics". Think of it as giving them the benefit of sharing x86
> development and testing :) We can then formalise the relaxed __ accessors
> to be more complete (ie. +/- byteswapping). I'd also propose to add
> io_rmb/io_wmb/io_mb that order io/io access, to help architectures like
> sn2 where the io/cacheable barrier is pretty expensive.
>
> Any comments?
FWIW that approach sounds pretty good to me. Arches that suffer from
performance penalties can still add lower level primitives and port selected
drivers over, so really they won't be losing much. AFAICT though drivers
will still have to worry about regular memory ordering issues; they'll just
be safe from I/O related ones. :) Still, the simplification is probably
worth it.
Jesse
On Tue, 2008-06-10 at 10:41 -0700, Jesse Barnes wrote:
> On Monday, June 09, 2008 11:56 pm Nick Piggin wrote:
> > So that still doesn't tell us what *minimum* level of ordering we should
> > provide in the cross platform readl/writel API. Some relatively sane
> > suggestions would be:
> >
> > - as strong as x86. guaranteed not to break drivers that work on x86,
> > but slower on some archs. To me, this is most pleasing. It is much
> > much easier to notice something is going a little slower and to work
> > out how to use weaker ordering there, than it is to debug some
> > once-in-a-bluemoon breakage caused by just the right architecture,
> > driver, etc. It totally frees up the driver writer from thinking
> > about barriers, provided they get the locking right.
> >
> > - ordered WRT other IO accessors, constrained within spinlocks, but not
> > cacheable memory. This is what powerpc does now. It's a little faster
> > for them, and probably covers the vast majority of drivers, but there
> > are real possibilities to get it wrong (trivial example: using bit
> > locks or mutexes or any kind of open coded locking or lockless
> > synchronisation can break).
> >
> > - (less sane) same as above, but not ordered WRT spinlocks. This is what
> > ia64 (sn2) does. From a purist POV, it is a little less arbitrary than
> > powerpc, but in practice, it will break a lot more drivers than powerpc.
> >
> > I was kind of joking about taking control of this issue :) But seriously,
> > it needs a decision to be made. I vote for #1. My rationale: I'm still
> > finding relatively major (well, found maybe 4 or 5 in the last couple of
> > years) bugs in the mm subsystem due to memory ordering problems. This is
> > apparently one of the most well reviewed and tested bit of code in the
> > kernel by people who know all about memory ordering. Not to mention that
> > mm/ does not have to worry about IO ordering at all. Then apparently
> > driver are the least reviewed and tested. Connect dots.
> >
> > Now that doesn't leave waker ordering architectures lumped with "slow old
> > x86 semantics". Think of it as giving them the benefit of sharing x86
> > development and testing :) We can then formalise the relaxed __ accessors
> > to be more complete (ie. +/- byteswapping). I'd also propose to add
> > io_rmb/io_wmb/io_mb that order io/io access, to help architectures like
> > sn2 where the io/cacheable barrier is pretty expensive.
> >
> > Any comments?
>
> FWIW that approach sounds pretty good to me. Arches that suffer from
> performance penalties can still add lower level primitives and port selected
> drivers over, so really they won't be losing much. AFAICT though drivers
> will still have to worry about regular memory ordering issues; they'll just
> be safe from I/O related ones. :) Still, the simplification is probably
> worth it.
me too. That's the whole basis for readX_relaxed() and its cohorts: we
make our weirdest machines (like altix) conform to the x86 norm. Then
where it really kills us we introduce additional semantics to selected
drivers that enable us to recover I/O speed on the abnormal platforms.
About the only problem we've had is that architectures aren't very good
at co-ordinating for their additional accessors so we tend to get a
forest of strange ones growing up, which appear only in a few drivers
(i.e. the ones that need the speed ups) and which have no well
documented meaning.
James
> me too. That's the whole basis for readX_relaxed() and its cohorts: we
> make our weirdest machines (like altix) conform to the x86 norm. Then
> where it really kills us we introduce additional semantics to selected
> drivers that enable us to recover I/O speed on the abnormal platforms.
Except as I pointed out before, Altix doesn't conform to the norm and
many (most?) drivers are missing mmiowb()s that are needed for Altix.
Just no one has plugged most devices into an Altix (or haven't stressed
the driver in a way that exposes problems of IO ordering between CPUs).
It would be a great thing to use the powerpc trick of setting a flag
that is tested by spin_unlock()/mutex_unlock() and automatically doing
the mmiowb() if needed, and then killing off mmiowb() entirely.
- R.
On Tuesday, June 10, 2008 12:05 pm Roland Dreier wrote:
> > me too. That's the whole basis for readX_relaxed() and its cohorts: we
> > make our weirdest machines (like altix) conform to the x86 norm. Then
> > where it really kills us we introduce additional semantics to selected
> > drivers that enable us to recover I/O speed on the abnormal platforms.
>
> Except as I pointed out before, Altix doesn't conform to the norm and
> many (most?) drivers are missing mmiowb()s that are needed for Altix.
> Just no one has plugged most devices into an Altix (or haven't stressed
> the driver in a way that exposes problems of IO ordering between CPUs).
>
> It would be a great thing to use the powerpc trick of setting a flag
> that is tested by spin_unlock()/mutex_unlock() and automatically doing
> the mmiowb() if needed, and then killing off mmiowb() entirely.
Yeah I think that's what Nick's guidelines would guarantee. And Jes is
already working on the spin_unlock change you mentioned, so mmiowb() should
be history soon (in name only, assuming Nick also introduces the I/O barriers
he talked about for ordering the looser accessors it would still be there but
would be called io_wmb or something).
Jesse
On Wednesday 11 June 2008 05:19, Jesse Barnes wrote:
> On Tuesday, June 10, 2008 12:05 pm Roland Dreier wrote:
> > > me too. That's the whole basis for readX_relaxed() and its cohorts:
> > > we make our weirdest machines (like altix) conform to the x86 norm.
> > > Then where it really kills us we introduce additional semantics to
> > > selected drivers that enable us to recover I/O speed on the abnormal
> > > platforms.
> >
> > Except as I pointed out before, Altix doesn't conform to the norm and
> > many (most?) drivers are missing mmiowb()s that are needed for Altix.
> > Just no one has plugged most devices into an Altix (or haven't stressed
> > the driver in a way that exposes problems of IO ordering between CPUs).
> >
> > It would be a great thing to use the powerpc trick of setting a flag
> > that is tested by spin_unlock()/mutex_unlock() and automatically doing
> > the mmiowb() if needed, and then killing off mmiowb() entirely.
>
> Yeah I think that's what Nick's guidelines would guarantee. And Jes is
> already working on the spin_unlock change you mentioned, so mmiowb() should
> be history soon (in name only, assuming Nick also introduces the I/O
> barriers he talked about for ordering the looser accessors it would still
> be there but would be called io_wmb or something).
Exactly, yes. I guess everybody has had good intentions here, but
as noticed, what is lacking is coordination and documentation.
You mention strong ordering WRT spin_unlock, which suggests that
you would prefer to take option #2 (the current powerpc one): io/io
is ordered and io is contained inside spinlocks, but io/cacheable
in general is not ordered.
I *would* prefer that io/cacheable actually is strongly ordered with
the default accessors. Because if you have that, then the driver
writer never has to care about memory ordering, provided they use
correct locking for SMP issues. Same as x86. With option 2, there
are still windows where you could possibly have issues.
For any high performance drivers that are well maintained (ie. the
ones where slowdown might be noticed), everyone should have a pretty
good handle on memory ordering requirements, so it shouldn't take
long to go through and convert them to relaxed accessors.
On Wed, 2008-06-11 at 13:29 +1000, Nick Piggin wrote:
>
> Exactly, yes. I guess everybody has had good intentions here, but
> as noticed, what is lacking is coordination and documentation.
>
> You mention strong ordering WRT spin_unlock, which suggests that
> you would prefer to take option #2 (the current powerpc one): io/io
> is ordered and io is contained inside spinlocks, but io/cacheable
> in general is not ordered.
IO/cacheable -is- ordered on powepc in what we believe is the direction
that matter: IO reads are fully ordered vs. anything and IO writes are
ordered vs. previous cacheable stores. The only "relaxed" situation is
IO writes followed by cacheable stores, which I believe shouldn't be
a problem. (except for spinlocks for which we use the flag trick)
> I *would* prefer that io/cacheable actually is strongly ordered with
> the default accessors. Because if you have that, then the driver
> writer never has to care about memory ordering, provided they use
> correct locking for SMP issues. Same as x86. With option 2, there
> are still windows where you could possibly have issues.
>
> For any high performance drivers that are well maintained (ie. the
> ones where slowdown might be noticed), everyone should have a pretty
> good handle on memory ordering requirements, so it shouldn't take
> long to go through and convert them to relaxed accessors.
Ben.
On Wednesday 11 June 2008 13:40, Benjamin Herrenschmidt wrote:
> On Wed, 2008-06-11 at 13:29 +1000, Nick Piggin wrote:
> > Exactly, yes. I guess everybody has had good intentions here, but
> > as noticed, what is lacking is coordination and documentation.
> >
> > You mention strong ordering WRT spin_unlock, which suggests that
> > you would prefer to take option #2 (the current powerpc one): io/io
> > is ordered and io is contained inside spinlocks, but io/cacheable
> > in general is not ordered.
>
> IO/cacheable -is- ordered on powepc in what we believe is the direction
> that matter: IO reads are fully ordered vs. anything and IO writes are
> ordered vs. previous cacheable stores. The only "relaxed" situation is
> IO writes followed by cacheable stores, which I believe shouldn't be
> a problem. (except for spinlocks for which we use the flag trick)
Spinlocks... mutexes, semaphores, rwsems, rwlocks, bit spinlocks, bit
mutexes, open coded bit locks (of which there are a number floating
around in drivers/).
But even assuming you get all of that fixed up. I wonder what is such
a big benefit to powerpc that you'll rather add the exception "cacheable
stores are not ordered with previous io stores" than to say "any driver
which works on x86 will work on powerpc as far as memory ordering goes"?
(don't you also need something to order io reads with cacheable reads?
as per my observation that your rmb is broken according to IBM docs)
Obviously you already have a sync instruction in your writel, so 1)
adding a second one doesn't slow it down by an order of mangnitude or
anything, just some small factor; and 2) you obviously want to be
converting high performance drivers to a more relaxed model anyway
regardless of whether there is one sync or two in there.
Has this ever been measured or thought carefully about? Beyond the
extent of "one sync good, two sync bad" ;)
Nick Piggin writes:
> OK, I'm sitll not quite sure where this has ended up. I guess you are happy
> with x86 semantics as they are now. That is, all IO accesses are strongly
> ordered WRT one another and WRT cacheable memory (which includes keeping
> them within spinlocks),
My understanding was that on x86, loads could pass stores in general,
i.e. a later load could be performed before an earlier store. I guess
that can't be true for uncached loads, but could a cacheable load be
performed before an earlier uncached store?
> - as strong as x86. guaranteed not to break drivers that work on x86,
> but slower on some archs. To me, this is most pleasing. It is much
> much easier to notice something is going a little slower and to work
> out how to use weaker ordering there, than it is to debug some
> once-in-a-bluemoon breakage caused by just the right architecture,
> driver, etc. It totally frees up the driver writer from thinking
> about barriers, provided they get the locking right.
I just wish we had even one actual example of things going wrong with
the current rules we have on powerpc to motivate changing to this
model.
> Now that doesn't leave waker ordering architectures lumped with "slow old
> x86 semantics". Think of it as giving them the benefit of sharing x86
> development and testing :) We can then formalise the relaxed __ accessors
> to be more complete (ie. +/- byteswapping).
That leaves a gulf between the extremely strongly ordered writel
etc. and the extremely weakly ordered __writel etc. The current
powerpc scheme is fine for a lot of drivers but your proposal would
leave us no way to deliver it to them.
Paul.
On Wednesday 11 June 2008 14:18, Paul Mackerras wrote:
> Nick Piggin writes:
> > OK, I'm sitll not quite sure where this has ended up. I guess you are
> > happy with x86 semantics as they are now. That is, all IO accesses are
> > strongly ordered WRT one another and WRT cacheable memory (which includes
> > keeping them within spinlocks),
>
> My understanding was that on x86, loads could pass stores in general,
> i.e. a later load could be performed before an earlier store.
Yes, this is the one reordering allowed by the ISA on cacheable memory.
WC memory is weaker, which Linus wants to allow exception for because
one must explicitly ask for it. UC memory (which presumably is what
we're talking about as "IO access") I think is stronger in that it does
not allow any reordering between one another or any cacheable access:
AMD says this:
c — A store (wp,wt,wb,uc,wc,wc+) may not pass a previous load
(wp,wt,wb,uc,wc,wc+).
f — A load (uc) does not pass a previous store (wp,wt,wb,uc,wc,wc+).
g — A store (wp,wt,wb,uc) does not pass a previous store (wp,wt,wb,uc).
i — A load (wp,wt,wb,wc,wc+) does not pass a previous store (uc).
AMD does allow WC/WC+ to be weakly ordered WRT WC as well as UC, which
Intel seemingly does not. But we're alrady treating WC as special.
I can't actually find the definitive statement in the Intel manuals
saying UC is strongly ordered also WRT WB. Linus?
> I guess
> that can't be true for uncached loads, but could a cacheable load be
> performed before an earlier uncached store?
> > - as strong as x86. guaranteed not to break drivers that work on x86,
> > but slower on some archs. To me, this is most pleasing. It is much
> > much easier to notice something is going a little slower and to work
> > out how to use weaker ordering there, than it is to debug some
> > once-in-a-bluemoon breakage caused by just the right architecture,
> > driver, etc. It totally frees up the driver writer from thinking
> > about barriers, provided they get the locking right.
>
> I just wish we had even one actual example of things going wrong with
> the current rules we have on powerpc to motivate changing to this
> model.
~/usr/src/linux-2.6> git grep test_and_set_bit drivers/ | wc -l
506
How sure are you that none of those forms part of a cobbled-together
locking scheme that hopes to constrain some IO access?
~/usr/src/linux-2.6> git grep test_and_set_bit drivers/ | grep while | wc -l
29
How about those?
~/usr/src/linux-2.6> git grep mutex_lock drivers/ | wc -l
3138
How sure are you that none of them is hoping to constrain IO operations
within the lock?
Also grep for down, down_write, write_lock, and maybe some others I've
forgotten. And then forget completely about locking and imagine some
of the open coded things you see around the place (or parts where drivers
try to get creative and open code their own locking or try lockless
things).
> > Now that doesn't leave waker ordering architectures lumped with "slow old
> > x86 semantics". Think of it as giving them the benefit of sharing x86
> > development and testing :) We can then formalise the relaxed __ accessors
> > to be more complete (ie. +/- byteswapping).
>
> That leaves a gulf between the extremely strongly ordered writel
> etc. and the extremely weakly ordered __writel etc. The current
> powerpc scheme is fine for a lot of drivers but your proposal would
> leave us no way to deliver it to them.
But surely you have to audit the drivers anyway to ensure they are OK
with the current powerpc scheme. In which case, once you have audited
them and know they are safe, you can easily convert them to the even
_faster_ __readl/__writel, and just add the appropriate barriers.
Nick Piggin writes:
> > I just wish we had even one actual example of things going wrong with
> > the current rules we have on powerpc to motivate changing to this
> > model.
>
> ~/usr/src/linux-2.6> git grep test_and_set_bit drivers/ | wc -l
> 506
> How sure are you that none of those forms part of a cobbled-together
> locking scheme that hopes to constrain some IO access?
My comment was precisely about the fact that this sort of argument is
actually FUD. I want one example that is demonstrably wrong, not just
a "how sure are you".
> But surely you have to audit the drivers anyway to ensure they are OK
> with the current powerpc scheme. In which case, once you have audited
> them and know they are safe, you can easily convert them to the even
> _faster_ __readl/__writel, and just add the appropriate barriers.
The trouble is that as code gets maintained, using __writel + explicit
barriers is more fragile because some people will change the code, or
add new code, without understanding the barriers. So whenever a
driver gets converted to using __writel + barriers, we will end up
having to watch every change that goes into it forever. Whereas with
the current scheme there's a much smaller set of gotchas to watch out
for, and the gotchas are things that already raise red flags, such as
open-coded locking and any sort of "clever" lockless scheme.
Paul.
On Wednesday 11 June 2008 15:13, Paul Mackerras wrote:
> Nick Piggin writes:
> > > I just wish we had even one actual example of things going wrong with
> > > the current rules we have on powerpc to motivate changing to this
> > > model.
> >
> > ~/usr/src/linux-2.6> git grep test_and_set_bit drivers/ | wc -l
> > 506
> > How sure are you that none of those forms part of a cobbled-together
> > locking scheme that hopes to constrain some IO access?
>
> My comment was precisely about the fact that this sort of argument is
> actually FUD. I want one example that is demonstrably wrong, not just
> a "how sure are you".
But this is a case where you really should be scared, so FUD is
completely appropriate.
At least, I don't see how it is at all worse than FUD about how much
slower everything will be with stronger ordering, and how nobody will
be able to do anything about it. In reality, for most devices it will
not be measurable, and for some like network or disk might use a bit
of a tweak in one or two fastpaths.
If the driver author honestly does not know the code well enough to
say whether a particular ordering is allowed or not, then it should
just not be allowed.
Anyway, what do I win if I give you a concrete example?
I found one in the first file I picked out of my grep: drivers/net/s2io.c
if (test_and_set_bit(__S2IO_STATE_LINK_TASK, &(nic->state))) {
/* The card is being reset, no point doing anything */
goto out_unlock;
}
...
val64 = readq(&bar0->adapter_control);
val64 |= ADAPTER_LED_ON;
writeq(val64, &bar0->adapter_control);
s2io_link(nic, LINK_UP);
} else {
if (CARDS_WITH_FAULTY_LINK_INDICATORS(nic->device_type,
subid)) {
val64 = readq(&bar0->gpio_control);
val64 &= ~GPIO_CTRL_GPIO_0;
writeq(val64, &bar0->gpio_control);
val64 = readq(&bar0->gpio_control);
}
/* turn off LED */
val64 = readq(&bar0->adapter_control);
val64 = val64 &(~ADAPTER_LED_ON);
writeq(val64, &bar0->adapter_control);
s2io_link(nic, LINK_DOWN);
}
clear_bit(__S2IO_STATE_LINK_TASK, &(nic->state));
Now I can't say definitively that this is going to be wrong on
powerpc, because I don't know the code well enough. But I'd be
90% sure that the unlock is not supposed to be visible to
other CPUs before the writeqs are queued to the card. On x86 it
wouldn't be.
> > But surely you have to audit the drivers anyway to ensure they are OK
> > with the current powerpc scheme. In which case, once you have audited
> > them and know they are safe, you can easily convert them to the even
> > _faster_ __readl/__writel, and just add the appropriate barriers.
>
> The trouble is that as code gets maintained, using __writel + explicit
> barriers is more fragile because some people will change the code, or
> add new code, without understanding the barriers.
If the relaxed writel and explicit barriers are not well documented and
well understood, then they should not be used.
And if the memory ordering issues are not well understood, then it
should be done with locking and strongly ordered IO accessors.
> So whenever a
> driver gets converted to using __writel + barriers, we will end up
> having to watch every change that goes into it forever. Whereas with
*You* don't have to watch it, the maintainer does. And if they can't
understand the barriers, then it should not be converted to use them.
Unless you want to take over maintainership. And if you do, then you
should be watching all changes that go into it.
> the current scheme there's a much smaller set of gotchas to watch out
> for, and the gotchas are things that already raise red flags, such as
> open-coded locking and any sort of "clever" lockless scheme.
And with the strongly ordered scheme, there is much smaller set of
gotchas again, and it behaves the same as the x86 it was developed on,
and you don't get confused by the list of rules: IO access is either
strongly ordered or weakly ordered.
On Wednesday 11 June 2008 15:35, Nick Piggin wrote:
> On Wednesday 11 June 2008 15:13, Paul Mackerras wrote:
> > Nick Piggin writes:
> > > > I just wish we had even one actual example of things going wrong with
> > > > the current rules we have on powerpc to motivate changing to this
> > > > model.
> > >
> > > ~/usr/src/linux-2.6> git grep test_and_set_bit drivers/ | wc -l
> > > 506
> > > How sure are you that none of those forms part of a cobbled-together
> > > locking scheme that hopes to constrain some IO access?
> >
> > My comment was precisely about the fact that this sort of argument is
> > actually FUD. I want one example that is demonstrably wrong, not just
> > a "how sure are you".
>
> But this is a case where you really should be scared, so FUD is
> completely appropriate.
>
> At least, I don't see how it is at all worse than FUD about how much
> slower everything will be with stronger ordering, and how nobody will
> be able to do anything about it. In reality, for most devices it will
> not be measurable, and for some like network or disk might use a bit
> of a tweak in one or two fastpaths.
>
> If the driver author honestly does not know the code well enough to
> say whether a particular ordering is allowed or not, then it should
> just not be allowed.
>
>
> Anyway, what do I win if I give you a concrete example?
>
> I found one in the first file I picked out of my grep: drivers/net/s2io.c
>
> if (test_and_set_bit(__S2IO_STATE_LINK_TASK, &(nic->state))) {
> /* The card is being reset, no point doing anything */
> goto out_unlock;
> }
>
> ...
> val64 = readq(&bar0->adapter_control);
> val64 |= ADAPTER_LED_ON;
> writeq(val64, &bar0->adapter_control);
> s2io_link(nic, LINK_UP);
> } else {
> if (CARDS_WITH_FAULTY_LINK_INDICATORS(nic->device_type,
> subid)) {
> val64 = readq(&bar0->gpio_control);
> val64 &= ~GPIO_CTRL_GPIO_0;
> writeq(val64, &bar0->gpio_control);
> val64 = readq(&bar0->gpio_control);
> }
> /* turn off LED */
> val64 = readq(&bar0->adapter_control);
> val64 = val64 &(~ADAPTER_LED_ON);
> writeq(val64, &bar0->adapter_control);
> s2io_link(nic, LINK_DOWN);
> }
> clear_bit(__S2IO_STATE_LINK_TASK, &(nic->state));
>
> Now I can't say definitively that this is going to be wrong on
> powerpc, because I don't know the code well enough. But I'd be
> 90% sure that the unlock is not supposed to be visible to
> other CPUs before the writeqs are queued to the card. On x86 it
> wouldn't be.
And I hope you're reading this, sn2 guys :)
The mmiowb() audit for spin_unlock was a nice idea, but this example
shows that unless you *really* know the code and have actually audited
everything, then you should not be messing with relaxing of ordering
etc.
Now obviously the example I quote is a slowpath, in which case it can
very happily continue to use strongly ordered io accessors, and you
will never ever notice the extra sync or PCI read or whatever after
the writeq. Then you can easily relax the couple of fastpaths that
matter, put barriers in there, comment them, and you're away: you now
have a *working* driver (that doesn't randomly lose adapter_control
bits twice a year), and that is also faster than it was when it was
broken because you've specifically used the very weakly ordered IO
accessors in the fast paths that matter.
And when somebody plugs in some obscure hardware into your box that
you haven't audited and put mmiowb etc etc into, then guess what? It
is also not going to blow up in an undebuggable way 6 months later in
the middle of <insert something important here>. The worst case is
that the user notices it is running a bit slowly and sends a full
profile output the next day.
Nick Piggin writes:
> Now that doesn't leave waker ordering architectures lumped with "slow old
> x86 semantics". Think of it as giving them the benefit of sharing x86
> development and testing :)
Worth something, but not perhaps as much as you think, given that many
x86 driver writers still don't pay much attention to making their code
endian-safe and 64-bit-clean. And there are still plenty of drivers
that use virt_to_bus...
Paul.
On Wed, 11 Jun 2008, Nick Piggin wrote:
>
> I can't actually find the definitive statement in the Intel manuals
> saying UC is strongly ordered also WRT WB. Linus?
Definitive? Dunno. But look in the Architecture manual, volume 3A, 10.3
"Methods of Caching Available", and then under the bullet about Write
Combining (WC), it says
the writes may be delayed until the next occurrence of a serializing
event; such as, an SFENCE of MFENCE instruction, CPUID execution, a read
or write to uncached memory, an interrupt occurrence, or a LOCK
instruction execution.
However, it's worth noting that
- documentation can be wrong, or even if right, can be Intel-specific.
- the above is expressly _only_ about the WC buffer, not about regular
memory writes. Cached memory accesses are different from WC accesses.
so in the end, the thing that matters is how things actually work.
Linus
On Tuesday, June 10, 2008 8:29 pm Nick Piggin wrote:
> On Wednesday 11 June 2008 05:19, Jesse Barnes wrote:
> > On Tuesday, June 10, 2008 12:05 pm Roland Dreier wrote:
> > > > me too. That's the whole basis for readX_relaxed() and its cohorts:
> > > > we make our weirdest machines (like altix) conform to the x86 norm.
> > > > Then where it really kills us we introduce additional semantics to
> > > > selected drivers that enable us to recover I/O speed on the abnormal
> > > > platforms.
> > >
> > > Except as I pointed out before, Altix doesn't conform to the norm and
> > > many (most?) drivers are missing mmiowb()s that are needed for Altix.
> > > Just no one has plugged most devices into an Altix (or haven't stressed
> > > the driver in a way that exposes problems of IO ordering between CPUs).
> > >
> > > It would be a great thing to use the powerpc trick of setting a flag
> > > that is tested by spin_unlock()/mutex_unlock() and automatically doing
> > > the mmiowb() if needed, and then killing off mmiowb() entirely.
> >
> > Yeah I think that's what Nick's guidelines would guarantee. And Jes is
> > already working on the spin_unlock change you mentioned, so mmiowb()
> > should be history soon (in name only, assuming Nick also introduces the
> > I/O barriers he talked about for ordering the looser accessors it would
> > still be there but would be called io_wmb or something).
>
> Exactly, yes. I guess everybody has had good intentions here, but
> as noticed, what is lacking is coordination and documentation.
>
> You mention strong ordering WRT spin_unlock, which suggests that
> you would prefer to take option #2 (the current powerpc one): io/io
> is ordered and io is contained inside spinlocks, but io/cacheable
> in general is not ordered.
I was thinking it would be good for the weaker accessors, but now that I think
about it you could just use the new io_* barrier functions.
I didn't mean to imply that I wasn't in favor of the io/cacheable ordering as
well.
> For any high performance drivers that are well maintained (ie. the
> ones where slowdown might be noticed), everyone should have a pretty
> good handle on memory ordering requirements, so it shouldn't take
> long to go through and convert them to relaxed accessors.
Yep. Thanks for working on this, Nick, it's definitely a good thing that
you're taking control of it. :)
Thanks,
Jesse
On Thursday 12 June 2008 02:07, Jesse Barnes wrote:
> On Tuesday, June 10, 2008 8:29 pm Nick Piggin wrote:
> > You mention strong ordering WRT spin_unlock, which suggests that
> > you would prefer to take option #2 (the current powerpc one): io/io
> > is ordered and io is contained inside spinlocks, but io/cacheable
> > in general is not ordered.
>
> I was thinking it would be good for the weaker accessors, but now that I
> think about it you could just use the new io_* barrier functions.
>
> I didn't mean to imply that I wasn't in favor of the io/cacheable ordering
> as well.
>
> > For any high performance drivers that are well maintained (ie. the
> > ones where slowdown might be noticed), everyone should have a pretty
> > good handle on memory ordering requirements, so it shouldn't take
> > long to go through and convert them to relaxed accessors.
>
> Yep. Thanks for working on this, Nick, it's definitely a good thing that
> you're taking control of it. :)
Well, I really am just trying to help the kernel for everyone (and every
architecture). Performance for all architectures really is my #2 priority,
so if any arch becomes irrepearably slower under a proposal I would
go back to the drawing board.
I'll come up with a proposal in the form of an initial code+documentation
patch when I get some more time on it.
Nick Piggin writes:
> /* turn off LED */
> val64 = readq(&bar0->adapter_control);
> val64 = val64 &(~ADAPTER_LED_ON);
> writeq(val64, &bar0->adapter_control);
> s2io_link(nic, LINK_DOWN);
> }
> clear_bit(__S2IO_STATE_LINK_TASK, &(nic->state));
>
> Now I can't say definitively that this is going to be wrong on
> powerpc, because I don't know the code well enough. But I'd be
> 90% sure that the unlock is not supposed to be visible to
> other CPUs before the writeqs are queued to the card. On x86 it
> wouldn't be.
Interestingly, there is also a store to cacheable memory
(nic->device_enabled_once), but no smp_wmb or equivalent before the
clear_bit. So there are other potential problems here besides the I/O
related ones.
Anyway, I have done some tests on a dual G5 here with putting a sync
on both sides of the store in writel etc. (i.e. making readl/writel
strongly ordered w.r.t. everything else), and as you predicted, there
wasn't a noticeable performance degradation, at least not on the
couple of things I tried. So I am now inclined to accept your
suggestion that we should do that. I should probably do some similar
checks on POWER6 and a few other machines first, though.
Paul.
On Thursday 12 June 2008 22:14, Paul Mackerras wrote:
> Nick Piggin writes:
> > /* turn off LED */
> > val64 = readq(&bar0->adapter_control);
> > val64 = val64 &(~ADAPTER_LED_ON);
> > writeq(val64, &bar0->adapter_control);
> > s2io_link(nic, LINK_DOWN);
> > }
> > clear_bit(__S2IO_STATE_LINK_TASK, &(nic->state));
> >
> > Now I can't say definitively that this is going to be wrong on
> > powerpc, because I don't know the code well enough. But I'd be
> > 90% sure that the unlock is not supposed to be visible to
> > other CPUs before the writeqs are queued to the card. On x86 it
> > wouldn't be.
>
> Interestingly, there is also a store to cacheable memory
> (nic->device_enabled_once), but no smp_wmb or equivalent before the
> clear_bit. So there are other potential problems here besides the I/O
> related ones.
Yeah there sure is. That sucks too, but we go one step at a time ;)
I think proposing a strong ordering between set_bit/clear_bit would
actually be quite noticable slowdown in core kernel code at this
point.
Which reminds me, I have been meaning to do another pass of test and
set bit / clear bit conversions to the _lock primitives...
> Anyway, I have done some tests on a dual G5 here with putting a sync
> on both sides of the store in writel etc. (i.e. making readl/writel
> strongly ordered w.r.t. everything else), and as you predicted, there
> wasn't a noticeable performance degradation, at least not on the
> couple of things I tried. So I am now inclined to accept your
> suggestion that we should do that. I should probably do some similar
> checks on POWER6 and a few other machines first, though.
Oh good, thanks for looking into it. I guess it might be a little more
noticable on bigger POWER systems. And I think we might even need to do
a PCI read after every writel on sn2 systems in order to get the
semantics I want.
I can't say it won't be noticable. But if we consider the number of
drivers (maybe one or two dozen well maintained ones), and number of
sites in each driver (maybe one or two submission and completion
fastpaths which should have a minimum of IO operations in each one)
that will have to be converted in order to get performance as good or
better than it is currently with relaxed accessors.... and weigh that
against all the places in those and every other crappy obscure
driver that we *won't* have to audit, I really think we end up with
a net win even with some short term pain.