2001-10-10 16:27:10

by Paul McKenney

[permalink] [raw]
Subject: Re: [Lse-tech] Re: RFC: patch to allow lock-free traversal of lists with insertion


> > true for older alphas, especially because they are strictly in-order
> > machines, unlike ev6.
>
> Yes, it sounds strange. However According to Paul this would not be the
> cpu but a cache coherency issue. rmb() would enforce the cache coherency
> etc... so maybe the issue is related to old SMP motherboard etc... not
> even to the cpus ... dunno. But as said it sounded very strange that
> also new chips and new boards have such a weird reodering trouble.

It sounded strange to me, too. ;-) And my first reading of the
Alpha AXP Archtecture RM didn't help me much.

It was indeed a cache coherency issue. The architect I talked to felt
that it was a feature rather than a bug. I have an email in to him.
In the meantime, Compaq's patent #6,108,737 leads me to believe that
others in DEC/Compaq also believe it to be a feature. The paragraph
starting at column 2 line 20 of the body of this patent states:

In a weakly-ordered system, an order is imposed between selected
sets of memory reference operations, while other operations are
considered unordered. One or more MB instructions are used to
indicate the required order. In the case of an MB instruction
defined by the Alpha (R) 21264 processor instruction set, the MB
denotes that all memory reference instructions above the MB (i.e.,
pre-MB instructions) are ordered before all reference instructions
after the MB (i.e., post-MB instructions). However, no order
is required between reference instructions that are not separated
by an MB.

(The patent talks about the WMB instruction later on.)

In other words, if there is no MB, the CPU is not required to maintain
ordering. Regardless of data dependencies or anything else.

There is also an application note at

http://www.openvms.compaq.com/wizard/wiz_2637.html

which states:

For instance, your producer must issue a "memory barrier" instruction
after writing the data to shared memory and before inserting it on
the queue; likewise, your consumer must issue a memory barrier
instruction after removing an item from the queue and before reading
from its memory. Otherwise, you risk seeing stale data, since,
while the Alpha processor does provide coherent memory, it does
not provide implicit ordering of reads and writes. (That is, the
write of the producer's data might reach memory after the write of
the queue, such that the consumer might read the new item from the
queue but get the previous values from the item's memory.

Note that they require a memory barrier (rmb()) between the time the
item is removed from the queue and the time that the data in the item
is referenced, despite the fact that there is a data dependency between
the dequeueing and the dereferencing. So, again, data dependency does
-not- substitute for an MB on Alpha.

Comments from DEC/Compaq people who know Alpha architecture details?

Thanx, Paul

> > I suspect some confusion here - probably that architect meant loads
> > to independent addresses. Of course, in this case mb() is required
> > to assure ordering.
> >
> > Ivan.
>
>
> Andrea
>


2001-10-10 16:59:14

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [Lse-tech] Re: RFC: patch to allow lock-free traversal of lists with insertion

On Wed, Oct 10, 2001 at 08:24:11AM -0700, Paul McKenney wrote:
> which states:
>
> For instance, your producer must issue a "memory barrier" instruction
> after writing the data to shared memory and before inserting it on
> the queue; likewise, your consumer must issue a memory barrier
> instruction after removing an item from the queue and before reading
> from its memory. Otherwise, you risk seeing stale data, since,
> while the Alpha processor does provide coherent memory, it does
> not provide implicit ordering of reads and writes. (That is, the
> write of the producer's data might reach memory after the write of
> the queue, such that the consumer might read the new item from the
> queue but get the previous values from the item's memory.
>
> Note that they require a memory barrier (rmb()) between the time the
> item is removed from the queue and the time that the data in the item
> is referenced, despite the fact that there is a data dependency between
> the dequeueing and the dereferencing. So, again, data dependency does
> -not- substitute for an MB on Alpha.

This is very explicit indeed.

In short I guess what happens is that the reader may have old data in
its cache, and the rmb() makes sure the cache used after the rmb() is
not older than the cache used before the rmb().

However the more I think about it the more I suspect we'd better use
rmb() in all readers in the common code, rather than defining rmbdd with
IPI on alpha, maybe alpha after all isn't the only one that needs the
rmb() but it's the only one defining the behaviour in detail? I can very
well imagine other archs also not invalidating all caches of all other
cpus after a write followed by wmb, the synchronization can be delayed
safely if the other cpus are readers, and they didn't issued an explicit
rmb. So what alpha is doing can really be seen as a "feature" that
allows to delay synchronization of caches after writes and wmb, unless
an explicit order is requested by the reader via rmb (or better mb on alpha).

The fact is that if there's old data in cache, a data dependency isn't
different than non data dependency if the cpu caches aren't synchronized
or invalidated during wmb run by the producer.

Andrea

2001-10-10 17:26:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Lse-tech] Re: RFC: patch to allow lock-free traversal of lists with insertion

In article <[email protected]>,
Andrea Arcangeli <[email protected]> wrote:
>
>However the more I think about it the more I suspect we'd better use
>rmb() in all readers in the common code

Absolutely. It's not that expensive an operation on sane hardware. And
it's definitely conceptually the only right thing to do - we're saying
that we're doing a read that depends on a previous read having seen
previous memory. Ergo, "rmb()".

Of course, right now Linux only exports a subset of the potential memory
barriers, and maybe we should export a fuller set - allowing CPU's that
have stricter ordering to possibly make it a no-op. But thinking about
even something like x86, I don't see where Intel would guarantee that
two reads (data-dependent or not) would have some implicit memory
ordering.

Re-ordering reads with data dependencies is hard, but it happens quite
naturally in a CPU that does address speculation. I don't know of
anybody who does that, but I bet _somebody_ will. Maybe even the P4?

Linus

2001-10-12 05:10:36

by Rusty Russell

[permalink] [raw]
Subject: Re: [Lse-tech] Re: RFC: patch to allow lock-free traversal of lists with insertion

On Wed, 10 Oct 2001 17:25:22 +0000 (UTC)
[email protected] (Linus Torvalds) wrote:

> Absolutely. It's not that expensive an operation on sane hardware. And
> it's definitely conceptually the only right thing to do - we're saying
> that we're doing a read that depends on a previous read having seen
> previous memory. Ergo, "rmb()".

Accessing pointer contents after you dereference the pointer is "obvious":
we've been trying to get Richard to understand the problem for FIVE MONTHS,
and he's not stupid!

The PPC manual (thanks Paul M) clearly indicates rmbdd() is not neccessary.
That they mention it explicitly suggests it's going to happen on more
architectures: you are correct, we should sprinkle rmbdd() everywhere
(rmb() is heavy on current PPC) and I'll update the Kernel Locking Guide now
the rules have changed.[1]

Rusty.
[1] Aren't we lucky our documentation is so sparse noone can accuse us of being
inconsistent? 8)

2001-10-12 05:44:25

by Albert D. Cahalan

[permalink] [raw]
Subject: Re: [Lse-tech] Re: RFC: patch to allow lock-free traversal of lists with

Paul McKenney writes:

> In the meantime, Compaq's patent #6,108,737 leads me to believe that
> others in DEC/Compaq also believe it to be a feature. The paragraph
> starting at column 2 line 20 of the body of this patent states:
>
> In a weakly-ordered system, an order is imposed between selected
> sets of memory reference operations, while other operations are
> considered unordered. One or more MB instructions are used to
> indicate the required order. In the case of an MB instruction
> defined by the Alpha (R) 21264 processor instruction set, the MB
> denotes that all memory reference instructions above the MB (i.e.,
> pre-MB instructions) are ordered before all reference instructions
> after the MB (i.e., post-MB instructions). However, no order
> is required between reference instructions that are not separated
> by an MB.
>
> (The patent talks about the WMB instruction later on.)
>
> In other words, if there is no MB, the CPU is not required to maintain
> ordering. Regardless of data dependencies or anything else.
>
> There is also an application note at
>
> http://www.openvms.compaq.com/wizard/wiz_2637.html
>
> which states:
>
> For instance, your producer must issue a "memory barrier" instruction
> after writing the data to shared memory and before inserting it on
> the queue; likewise, your consumer must issue a memory barrier
> instruction after removing an item from the queue and before reading
> from its memory. Otherwise, you risk seeing stale data, since,
> while the Alpha processor does provide coherent memory, it does
> not provide implicit ordering of reads and writes. (That is, the
> write of the producer's data might reach memory after the write of
> the queue, such that the consumer might read the new item from the
> queue but get the previous values from the item's memory.
>
> Note that they require a memory barrier (rmb()) between the time the
> item is removed from the queue and the time that the data in the item
> is referenced, despite the fact that there is a data dependency between
> the dequeueing and the dereferencing. So, again, data dependency does
> -not- substitute for an MB on Alpha.

This looks an awful lot like the PowerPC architecture.

In an SMP system, one would most likely mark pages as
requiring coherency. This means that stores to a memory
location from multiple processors will give sane results.
Ordering is undefined when multiple memory locations are
involved.

There is a memory barrier instruction called "eieio".
This is commonly used for IO, but is also useful for RAM.
Two separate sets of memory operations are simultaneously
and independently affected by eieio:

-- set one, generally memory-mapped IO space --
loads to uncached + guarded memory
stores to uncached + guarded memory
stores to write-through-required memory

-- set two, generally RAM on an SMP box --
stores to cached + write-back + coherent

"The eieio instruction is intended for use in managing shared data
structures ... the shared data structure and the lock that protects
it must be altered only by stores that are in the same set"
-- from the 32-bit ppc arch book

2001-10-12 06:26:42

by Paul Mackerras

[permalink] [raw]
Subject: Re: [Lse-tech] Re: RFC: patch to allow lock-free traversal of lists with

Albert D. Cahalan writes:

> This looks an awful lot like the PowerPC architecture.
>
> In an SMP system, one would most likely mark pages as
> requiring coherency. This means that stores to a memory
> location from multiple processors will give sane results.
> Ordering is undefined when multiple memory locations are
> involved.

The current PowerPC Architecture spec actually has a paragraph that
says that where a processor does two loads and the second load depends
on the first (i.e. the result from the first load is used in computing
the address for the second load), that they have to be performed in
program order with respect to other processors. In other cases you do
need a barrier as you say.

This constraint has evidently been added since the original PPC
architecture book was published. I strongly doubt that any of the
older PPC implementations would violate that constraint though.

Paul.

Subject: Re: [Lse-tech] Re: RFC: patch to allow lock-free traversal of lists with



--On Friday, 12 October, 2001 1:43 AM -0400 "Albert D. Cahalan"
<[email protected]> wrote:

> There is a memory barrier instruction called "eieio"

This must be April 1 by some calendar.

With a read/write here, a read/write there... (etc.)

Guess the engineer was called Mr/Ms McDonald.

--
Alex Bligh

2001-10-12 08:52:01

by Jonathan Lundell

[permalink] [raw]
Subject: Re: [Lse-tech] Re: RFC: patch to allow lock-free traversal of lists with

At 9:28 AM +0100 10/12/01, Alex Bligh - linux-kernel wrote:
>>There is a memory barrier instruction called "eieio"
>
>This must be April 1 by some calendar.

PPC "Enforce In-Order Execution of I/O"
--
/Jonathan Lundell.

2001-10-12 16:29:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Lse-tech] Re: RFC: patch to allow lock-free traversal of lists with insertion


On Fri, 12 Oct 2001, Rusty Russell wrote:
>
> The PPC manual (thanks Paul M) clearly indicates rmbdd() is not neccessary.
> That they mention it explicitly suggests it's going to happen on more
> architectures: you are correct, we should sprinkle rmbdd() everywhere
> (rmb() is heavy on current PPC) and I'll update the Kernel Locking Guide now
> the rules have changed.[1]

I would suggest re-naming "rmbdd()". I _assume_ that "dd" stands for "data
dependent", but quite frankly, "rmbdd" looks like the standard IBM "we
lost every vowel ever invented" kind of assembly lanaguage to me.

I'm sure that having programmed PPC assembly language, you find it very
natural (IBM motto: "We found five vowels hiding in a corner, and we used
them _all_ for the 'eieio' instruction so that we wouldn't have to use
them anywhere else").

But for us normal people, contractions that have more than 3 letters are
hard to remember. I wouldn't mind making the other memory barriers more
descriptive either, but with something like "mb()" at least you don't have
to look five times to make sure you got all the letters in the right
order..

(IBM motto: "If you can't read our assembly language, you must be
borderline dyslexic, and we don't want you to mess with it anyway").

So how about just going all the way and calling it what it is:
"data_dependent_read_barrier()", with a notice in the PPC docs about how
the PPC cannot speculate reads anyway, so it's a no-op.

(IBM motto: "TEN vowels? Don't you know vowels are scrd?")

And hey, if you want to, feel free to create the regular

#define read_barrier() rmb()
#define write_barrier() wmb()
#define memory_barrier() mb()

too. It's not as if we should ever have that many of them, and when we
_do_ have them, they might as well stand out to make it clear that we're
doing subtle things.. Although that "data-dependent read barrier" is a lot
more subtle than most.

Linus

2001-10-12 19:50:43

by Al Dunsmuir

[permalink] [raw]
Subject: Re: [Lse-tech] Re: RFC: patch to allow lock-free traversal of lists with insertion

On Fri, 12 Oct 2001 09:28:07 -0700 (PDT), Linus Torvalds wrote:
>I'm sure that having programmed PPC assembly language, you find it very
>natural (IBM motto: "We found five vowels hiding in a corner, and we used
>them _all_ for the 'eieio' instruction so that we wouldn't have to use
>them anywhere else").

More likely a bad pun on the children's song "Old McDonald had a farm".

I have no insider knowledge, but I've worked for IBM (on S/390) for 22 years,
_am_ dyslexic, and enjoy bad puns <GRIN>

Al Dunsmuir

2001-10-13 01:08:17

by Paul Mackerras

[permalink] [raw]
Subject: Re: [Lse-tech] Re: RFC: patch to allow lock-free traversal of lists with insertion

Linus Torvalds writes:

> So how about just going all the way and calling it what it is:
> "data_dependent_read_barrier()", with a notice in the PPC docs about how
> the PPC cannot speculate reads anyway, so it's a no-op.

To set the record straight, the PPC architecture spec says that
implementations *can* speculate reads, but they have to make it look
as if dependent loads have a read barrier between them.

And it isn't speculated reads that are the problem in the alpha case,
it's the fact that the cache can reorder invalidations that are
received from the bus. That's why you can read the new value of p but
the old value of *p on one processor after another processor has just
done something like a = 1; wmb(); p = &a.

My impression from what Paul McKenney was saying was that on most
modern architectures other than alpha, dependent loads act as if they
have a read barrier between them. What we need to know is which
architectures specify that behaviour in their architecture spec, as
against those which do that today but which might not do it tomorrow.

I just looked at the SPARC V9 specification; it has a formal
definition of the memory model which precludes reordering dependent
loads (once again this is an "as if" rule). So on ppc and sparc64 we
have an assurance that we won't need an rmb() between dependent loads
in the future.

As for intel x86, is there a architecture spec that talks about things
like memory ordering? My impression is that the x86 architecture is
pretty much defined by its implementations but I could well be wrong.

> too. It's not as if we should ever have that many of them, and when we
> _do_ have them, they might as well stand out to make it clear that we're
> doing subtle things.. Although that "data-dependent read barrier" is a lot
> more subtle than most.

Indeed... getting the new p but the old *p is quite
counter-intuitive IMHO.

Paul.

2001-10-13 01:48:59

by Davide Libenzi

[permalink] [raw]
Subject: Re: [Lse-tech] Re: RFC: patch to allow lock-free traversal of lists with insertion

On Sat, 13 Oct 2001, Paul Mackerras wrote:

> Linus Torvalds writes:
>
> > So how about just going all the way and calling it what it is:
> > "data_dependent_read_barrier()", with a notice in the PPC docs about how
> > the PPC cannot speculate reads anyway, so it's a no-op.
>
> To set the record straight, the PPC architecture spec says that
> implementations *can* speculate reads, but they have to make it look
> as if dependent loads have a read barrier between them.
>
> And it isn't speculated reads that are the problem in the alpha case,
> it's the fact that the cache can reorder invalidations that are
> received from the bus. That's why you can read the new value of p but
> the old value of *p on one processor after another processor has just
> done something like a = 1; wmb(); p = &a.

I think that necessary condition to have a reordering of bus sent
invalidations is to have a partitioned cache architecture.
I don't see any valid reason for a cache controller of a linear cache
architecture to reorder an invalidation stream coming from a single cpu.
If the invalidation sequence for cpu1 ( in a linear cache architecture )
is 1a, 1b, 1c, ... this case can happen :

1a, 2a, 1b, 2b, 2c, 1c, ...
| | |
.........................

but not this :

1a, 2a, 2b, 1c, 2c, 1b, ...
| | |
...............><........


> My impression from what Paul McKenney was saying was that on most
> modern architectures other than alpha, dependent loads act as if they
> have a read barrier between them. What we need to know is which
> architectures specify that behaviour in their architecture spec, as
> against those which do that today but which might not do it tomorrow.

Uhmm, an architecture that with a = *p; schedule the load of *p before
the load of p sounds screwy to me.
The problem is that even if cpu1 schedule the load of p before the
load of *p and cpu2 does a = 1; wmb(); p = &a; , it could happen that
even if from cpu2 the invalidation stream exit in order, cpu1 could see
the value of p before the value of *p due a reordering done by the
cache controller delivering the stream to cpu1.




- Davide



2001-10-13 02:00:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Lse-tech] Re: RFC: patch to allow lock-free traversal of lists with insertion


On Sat, 13 Oct 2001, Paul Mackerras wrote:
>
> As for intel x86, is there a architecture spec that talks about things
> like memory ordering? My impression is that the x86 architecture is
> pretty much defined by its implementations but I could well be wrong.

It is discussed in the multi-procesor management section, under "memory
ordering", and it does say that "reads can be carried out specilatively
and in any order".

HOWEVER, it does have what Intel calles "processor order", and it claims
that "writes by a single processor are observed in the same order by all
processors."

Which implies that if the _same_ CPU wrote *p and p, there is apparently
an ordering constraint there. But I don't think that's the case here. So
the x86 apparently needs a rmb().

(But Intel has redefined the memory ordering so many times that they might
redefine it in the future too and say that dependent loads are ok. I
suspect most of the definitions are of the type "Oh, it used to be ok in
the implementation even though it wasn't defined, and it turns out that
Windows doesn't work if we change it, so we'll define darkness to be the
new standard"..)

> Indeed... getting the new p but the old *p is quite
> counter-intuitive IMHO.

I disagree. I think the alpha has it right - it says that if you care
about ordering, you have to tell so. No ifs, no buts, no special cases
("Oh, dependent loads are special").

Of course, I used to absolutely _hate_ the x86 ordering rules just because
they are so ad-hoc. But I'm getting more and more used to them, and the
Intel "write ordered with store-buffer forwarding" model ends up being
really nice for locking ;)

Linus

2001-10-13 02:05:01

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Lse-tech] Re: RFC: patch to allow lock-free traversal of lists with insertion


On Fri, 12 Oct 2001, Davide Libenzi wrote:
>
> The problem is that even if cpu1 schedule the load of p before the
> load of *p and cpu2 does a = 1; wmb(); p = &a; , it could happen that
> even if from cpu2 the invalidation stream exit in order, cpu1 could see
> the value of p before the value of *p due a reordering done by the
> cache controller delivering the stream to cpu1.

Umm - if that happens, your cache controller isn't honouring the wmb(),
and you have problems quite regardless of any load ordering on _any_ CPU.

Ehh?

Linus

2001-10-13 02:26:03

by Davide Libenzi

[permalink] [raw]
Subject: Re: [Lse-tech] Re: RFC: patch to allow lock-free traversal of lists with insertion

On Fri, 12 Oct 2001, Linus Torvalds wrote:

>
> On Fri, 12 Oct 2001, Davide Libenzi wrote:
> >
> > The problem is that even if cpu1 schedule the load of p before the
> > load of *p and cpu2 does a = 1; wmb(); p = &a; , it could happen that
> > even if from cpu2 the invalidation stream exit in order, cpu1 could see
> > the value of p before the value of *p due a reordering done by the
> > cache controller delivering the stream to cpu1.
>
> Umm - if that happens, your cache controller isn't honouring the wmb(),
> and you have problems quite regardless of any load ordering on _any_ CPU.
>
> Ehh?

I'm searching the hp-parisc doc about it but it seems that even Paul
McKenney pointed out this.
Suppose that p and *p are on two different cache partitions and the
invalidation order that comes from the wmb() cpu is 1) *p 2) p
Suppose that the partition when *p lies is damn busy and the one where
p lies is free.
The reader cpu could pickup the value of p before the value of *p by
reading the old value of a
The barrier on the reader side should establish a checkpoint that enforce
the commit of *p before p





- Davide


2001-10-13 02:40:43

by Davide Libenzi

[permalink] [raw]
Subject: Re: [Lse-tech] Re: RFC: patch to allow lock-free traversal of lists with insertion

On Fri, 12 Oct 2001, Davide Libenzi wrote:

> On Fri, 12 Oct 2001, Linus Torvalds wrote:
>
> >
> > On Fri, 12 Oct 2001, Davide Libenzi wrote:
> > >
> > > The problem is that even if cpu1 schedule the load of p before the
> > > load of *p and cpu2 does a = 1; wmb(); p = &a; , it could happen that
> > > even if from cpu2 the invalidation stream exit in order, cpu1 could see
> > > the value of p before the value of *p due a reordering done by the
> > > cache controller delivering the stream to cpu1.
> >
> > Umm - if that happens, your cache controller isn't honouring the wmb(),
> > and you have problems quite regardless of any load ordering on _any_ CPU.
> >
> > Ehh?
>
> I'm searching the hp-parisc doc about it but it seems that even Paul
> McKenney pointed out this.

ops, alpha



- Davide


2001-10-13 03:30:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Lse-tech] Re: RFC: patch to allow lock-free traversal of lists with insertion


On Fri, 12 Oct 2001, Davide Libenzi wrote:
>
> Suppose that p and *p are on two different cache partitions and the
> invalidation order that comes from the wmb() cpu is 1) *p 2) p
> Suppose that the partition when *p lies is damn busy and the one where
> p lies is free.
> The reader cpu could pickup the value of p before the value of *p by
> reading the old value of a

Ahh.. I misunderstood. You are arguing for the rmb() even if the CPU
doesn't speculate addresses for loads. Yes, I agree.

Linus

2001-10-13 04:51:00

by Paul Mackerras

[permalink] [raw]
Subject: Re: [Lse-tech] Re: RFC: patch to allow lock-free traversal of lists with insertion

Linus Torvalds writes:

> On Fri, 12 Oct 2001, Davide Libenzi wrote:
> >
> > The problem is that even if cpu1 schedule the load of p before the
> > load of *p and cpu2 does a = 1; wmb(); p = &a; , it could happen that
> > even if from cpu2 the invalidation stream exit in order, cpu1 could see
> > the value of p before the value of *p due a reordering done by the
> > cache controller delivering the stream to cpu1.
>
> Umm - if that happens, your cache controller isn't honouring the wmb(),
> and you have problems quite regardless of any load ordering on _any_ CPU.

Well yes. But this is what happens on alpha apparently.

On alpha, it seems that the wmb only has an effect on the cache of the
processor doing the writes, it doesn't affect any other caches. The
wmb ensures that the invalidates from the two writes go out onto the
bus in the right order, but the wmb itself doesn't go on the bus.
Thus the invalidates can get processed in the reverse order by a
receiving cache. I presume that an rmb() on alpha must wait for all
outstanding invalidates to be processed by the cache. But I'm not an
expert on the alpha by any means.

Paul.

2001-10-13 07:42:51

by Rusty Russell

[permalink] [raw]
Subject: Re: [Lse-tech] Re: RFC: patch to allow lock-free traversal of lists with insertion

In message <[email protected]> you
write:
> And hey, if you want to, feel free to create the regular
>
> #define read_barrier() rmb()
> #define write_barrier() wmb()
> #define memory_barrier() mb()

I agree... read_barrier_depends() then?

Rusty.
--
Premature optmztion is rt of all evl. --DK

2001-10-13 13:48:31

by Alan

[permalink] [raw]
Subject: Re: [Lse-tech] Re: RFC: patch to allow lock-free traversal of lists

> It is discussed in the multi-procesor management section, under "memory
> ordering", and it does say that "reads can be carried out specilatively
> and in any order".
>
> HOWEVER, it does have what Intel calles "processor order", and it claims
> that "writes by a single processor are observed in the same order by all
> processors."

The other thing on the intel side is that you have to read the errata
documentation. There are an interesting collection of misordering errata.

> (But Intel has redefined the memory ordering so many times that they might
> redefine it in the future too and say that dependent loads are ok. I
> suspect most of the definitions are of the type "Oh, it used to be ok in
> the implementation even though it wasn't defined, and it turns out that
> Windows doesn't work if we change it, so we'll define darkness to be the
> new standard"..)

Its notable that the folks who did looser ordering x86 clones had MTRRs
to enable the performance boost

Alan

2001-10-13 14:11:16

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Re: RFC: patch to allow lock-free traversal of lists

>> To set the record straight, the PPC architecture spec says that
>> implementations *can* speculate reads, but they have to make it look
>> as if dependent loads have a read barrier between them.
>>
>> And it isn't speculated reads that are the problem in the alpha case,
>> it's the fact that the cache can reorder invalidations that are
>> received from the bus. That's why you can read the new value of p but
>> the old value of *p on one processor after another processor has just
>> done something like a = 1; wmb(); p = &a.
>
>I think that necessary condition to have a reordering of bus sent
>invalidations is to have a partitioned cache architecture.
>I don't see any valid reason for a cache controller of a linear cache
>architecture to reorder an invalidation stream coming from a single cpu.
>If the invalidation sequence for cpu1 ( in a linear cache architecture )
>is 1a, 1b, 1c, ... this case can happen :
>
>1a, 2a, 1b, 2b, 2c, 1c, ...
>| | |
>.........................
>
>but not this :
>
>1a, 2a, 2b, 1c, 2c, 1b, ...
>| | |
>...............><........

A split cache is certainly one way to get the updates reordered. It is
not the only one. As Linus pointed out earlier, value speculation can
have the same effect (though I do not believe that value speculation is
in the cards for the i386 architecture).

>> My impression from what Paul McKenney was saying was that on most
>> modern architectures other than alpha, dependent loads act as if they
>> have a read barrier between them. What we need to know is which
>> architectures specify that behaviour in their architecture spec, as
>> against those which do that today but which might not do it tomorrow.
>
>Uhmm, an architecture that with a = *p; schedule the load of *p before
>the load of p sounds screwy to me.

Agreed, but value speculation could do just that. However, value
speculation would be handled correctly by rmbdd() and by wmbdd(),
though the latter case relies on precise interrupts.

>The problem is that even if cpu1 schedule the load of p before the
>load of *p and cpu2 does a = 1; wmb(); p = &a; , it could happen that
>even if from cpu2 the invalidation stream exit in order, cpu1 could see
>the value of p before the value of *p due a reordering done by the
>cache controller delivering the stream to cpu1.

Yep! I did not create Alpha, I am just trying to help everyone understand
how to live with it. ;-)

Thanx, Paul

2001-10-13 16:28:18

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Re: RFC: patch to allow lock-free traversal of lists with insertion ^M

>In message <[email protected]> you
> write:
>> And hey, if you want to, feel free to create the regular
>>
>> #define read_barrier() rmb()
>> #define write_barrier() wmb()
>> #define memory_barrier() mb()
>
>I agree... read_barrier_depends() then?
>
>Rusty.
>--
>Premature optmztion is rt of all evl. --DK

OK, here is an RFC patch with the read_barrier_depends(). (I know that
the indentation is messed up, will fix when I add the read_barrier()
and friends).

Thoughts? Especially from people familiar with MIPS and PA-RISC?

Thanx, Paul

diff -urN -X /home/mckenney/dontdiff linux-2.4.10/include/asm-alpha/system.h linux-2.4.10.read_barrier_depends/include/asm-alpha/system.h
--- linux-2.4.10/include/asm-alpha/system.h Sun Aug 12 10:38:47 2001
+++ linux-2.4.10.read_barrier_depends/include/asm-alpha/system.h Sat Oct 13 08:40:34 2001
@@ -148,16 +148,21 @@
#define rmb() \
__asm__ __volatile__("mb": : :"memory")

+#define read_barrier_depends() \
+__asm__ __volatile__("mb": : :"memory")
+
#define wmb() \
__asm__ __volatile__("wmb": : :"memory")

#ifdef CONFIG_SMP
#define smp_mb() mb()
#define smp_rmb() rmb()
+#define smp_read_barrier_depends() read_barrier_depends()
#define smp_wmb() wmb()
#else
#define smp_mb() barrier()
#define smp_rmb() barrier()
+#define smp_read_barrier_depends() barrier()
#define smp_wmb() barrier()
#endif

diff -urN -X /home/mckenney/dontdiff linux-2.4.10/include/asm-arm/system.h linux-2.4.10.read_barrier_depends/include/asm-arm/system.h
--- linux-2.4.10/include/asm-arm/system.h Mon Nov 27 17:07:59 2000
+++ linux-2.4.10.read_barrier_depends/include/asm-arm/system.h Sat Oct 13 08:40:40 2001
@@ -38,6 +38,7 @@

#define mb() __asm__ __volatile__ ("" : : : "memory")
#define rmb() mb()
+#define read_barrier_depends() do { } while(0)
#define wmb() mb()
#define nop() __asm__ __volatile__("mov\tr0,r0\t@ nop\n\t");

@@ -67,12 +68,14 @@

#define smp_mb() mb()
#define smp_rmb() rmb()
+#define smp_read_barrier_depends() read_barrier_depends()
#define smp_wmb() wmb()

#else

#define smp_mb() barrier()
#define smp_rmb() barrier()
+#define smp_read_barrier_depends() do { } while(0)
#define smp_wmb() barrier()

#define cli() __cli()
diff -urN -X /home/mckenney/dontdiff linux-2.4.10/include/asm-cris/system.h linux-2.4.10.read_barrier_depends/include/asm-cris/system.h
--- linux-2.4.10/include/asm-cris/system.h Tue May 1 16:05:00 2001
+++ linux-2.4.10.read_barrier_depends/include/asm-cris/system.h Sat Oct 13 08:40:48 2001
@@ -143,15 +143,18 @@

#define mb() __asm__ __volatile__ ("" : : : "memory")
#define rmb() mb()
+#define read_barrier_depends() do { } while(0)
#define wmb() mb()

#ifdef CONFIG_SMP
#define smp_mb() mb()
#define smp_rmb() rmb()
+#define smp_read_barrier_depends() read_barrier_depends()
#define smp_wmb() wmb()
#else
#define smp_mb() barrier()
#define smp_rmb() barrier()
+#define smp_read_barrier_depends() do { } while(0)
#define smp_wmb() barrier()
#endif

diff -urN -X /home/mckenney/dontdiff linux-2.4.10/include/asm-i386/system.h linux-2.4.10.read_barrier_depends/include/asm-i386/system.h
--- linux-2.4.10/include/asm-i386/system.h Sun Sep 23 10:31:01 2001
+++ linux-2.4.10.read_barrier_depends/include/asm-i386/system.h Sat Oct 13 08:40:54 2001
@@ -284,15 +284,18 @@
*/
#define mb() __asm__ __volatile__ ("lock; addl $0,0(%%esp)": : :"memory")
#define rmb() mb()
+#define read_barrier_depends() do { } while(0)
#define wmb() __asm__ __volatile__ ("": : :"memory")

#ifdef CONFIG_SMP
#define smp_mb() mb()
#define smp_rmb() rmb()
+#define smp_read_barrier_depends() read_barrier_depends()
#define smp_wmb() wmb()
#else
#define smp_mb() barrier()
#define smp_rmb() barrier()
+#define smp_read_barrier_depends() do { } while(0)
#define smp_wmb() barrier()
#endif

diff -urN -X /home/mckenney/dontdiff linux-2.4.10/include/asm-ia64/system.h linux-2.4.10.read_barrier_depends/include/asm-ia64/system.h
--- linux-2.4.10/include/asm-ia64/system.h Tue Jul 31 10:30:09 2001
+++ linux-2.4.10.read_barrier_depends/include/asm-ia64/system.h Sat Oct 13 08:55:13 2001
@@ -85,6 +85,9 @@
* stores and that all following stores will be
* visible only after all previous stores.
* rmb(): Like wmb(), but for reads.
+ * read_barrier_depends(): Like rmb(), but only for pairs
+ * of loads where the second load depends on the
+ * value loaded by the first.
* mb(): wmb()/rmb() combo, i.e., all previous memory
* accesses are visible before all subsequent
* accesses and vice versa. This is also known as
@@ -98,15 +101,18 @@
*/
#define mb() __asm__ __volatile__ ("mf" ::: "memory")
#define rmb() mb()
+#define read_barrier_depends() do { } while(0)
#define wmb() mb()

#ifdef CONFIG_SMP
# define smp_mb() mb()
# define smp_rmb() rmb()
+# define smp_read_barrier_depends() read_barrier_depends()
# define smp_wmb() wmb()
#else
# define smp_mb() barrier()
# define smp_rmb() barrier()
+# define smp_read_barrier_depends() do { } while(0)
# define smp_wmb() barrier()
#endif

diff -urN -X /home/mckenney/dontdiff linux-2.4.10/include/asm-m68k/system.h linux-2.4.10.read_barrier_depends/include/asm-m68k/system.h
--- linux-2.4.10/include/asm-m68k/system.h Mon Jun 11 19:15:27 2001
+++ linux-2.4.10.read_barrier_depends/include/asm-m68k/system.h Sat Oct 13 08:41:08 2001
@@ -80,12 +80,14 @@
#define nop() do { asm volatile ("nop"); barrier(); } while (0)
#define mb() barrier()
#define rmb() barrier()
+#define read_barrier_depends() do { } while(0)
#define wmb() barrier()
#define set_mb(var, value) do { xchg(&var, value); } while (0)
#define set_wmb(var, value) do { var = value; wmb(); } while (0)

#define smp_mb() barrier()
#define smp_rmb() barrier()
+#define smp_read_barrier_depends() do { } while(0)
#define smp_wmb() barrier()


diff -urN -X /home/mckenney/dontdiff linux-2.4.10/include/asm-mips/system.h linux-2.4.10.read_barrier_depends/include/asm-mips/system.h
--- linux-2.4.10/include/asm-mips/system.h Sun Sep 9 10:43:01 2001
+++ linux-2.4.10.read_barrier_depends/include/asm-mips/system.h Sat Oct 13 08:41:14 2001
@@ -150,6 +150,7 @@

#include <asm/wbflush.h>
#define rmb() do { } while(0)
+#define read_barrier_depends() do { } while(0)
#define wmb() wbflush()
#define mb() wbflush()

@@ -166,6 +167,7 @@
: /* no input */ \
: "memory")
#define rmb() mb()
+#define read_barrier_depends() do { } while(0)
#define wmb() mb()

#endif /* CONFIG_CPU_HAS_WB */
@@ -173,10 +175,12 @@
#ifdef CONFIG_SMP
#define smp_mb() mb()
#define smp_rmb() rmb()
+#define smp_read_barrier_depends() read_barrier_depends()
#define smp_wmb() wmb()
#else
#define smp_mb() barrier()
#define smp_rmb() barrier()
+#define smp_read_barrier_depends() do { } while(0)
#define smp_wmb() barrier()
#endif

diff -urN -X /home/mckenney/dontdiff linux-2.4.10/include/asm-mips64/system.h linux-2.4.10.read_barrier_depends/include/asm-mips64/system.h
--- linux-2.4.10/include/asm-mips64/system.h Wed Jul 4 11:50:39 2001
+++ linux-2.4.10.read_barrier_depends/include/asm-mips64/system.h Sat Oct 13 08:41:22 2001
@@ -147,15 +147,18 @@
: /* no input */ \
: "memory")
#define rmb() mb()
+#define read_barrier_depends() do { } while(0)
#define wmb() mb()

#ifdef CONFIG_SMP
#define smp_mb() mb()
#define smp_rmb() rmb()
+#define smp_read_barrier_depends() read_barrier_depends()
#define smp_wmb() wmb()
#else
#define smp_mb() barrier()
#define smp_rmb() barrier()
+#define smp_read_barrier_depends() do { } while(0)
#define smp_wmb() barrier()
#endif

diff -urN -X /home/mckenney/dontdiff linux-2.4.10/include/asm-parisc/system.h linux-2.4.10.read_barrier_depends/include/asm-parisc/system.h
--- linux-2.4.10/include/asm-parisc/system.h Wed Dec 6 11:46:39 2000
+++ linux-2.4.10.read_barrier_depends/include/asm-parisc/system.h Sat Oct 13 08:41:28 2001
@@ -50,6 +50,7 @@
#ifdef CONFIG_SMP
#define smp_mb() mb()
#define smp_rmb() rmb()
+#define smp_read_barrier_depends() do { } while(0)
#define smp_wmb() wmb()
#else
/* This is simply the barrier() macro from linux/kernel.h but when serial.c
@@ -58,6 +59,7 @@
*/
#define smp_mb() __asm__ __volatile__("":::"memory");
#define smp_rmb() __asm__ __volatile__("":::"memory");
+#define smp_read_barrier_depends() do { } while(0)
#define smp_wmb() __asm__ __volatile__("":::"memory");
#endif

@@ -122,6 +124,7 @@

#define mb() __asm__ __volatile__ ("sync" : : :"memory")
#define wmb() mb()
+#define read_barrier_depends() do { } while(0)

extern unsigned long __xchg(unsigned long, unsigned long *, int);

diff -urN -X /home/mckenney/dontdiff linux-2.4.10/include/asm-ppc/system.h linux-2.4.10.read_barrier_depends/include/asm-ppc/system.h
--- linux-2.4.10/include/asm-ppc/system.h Tue Aug 28 06:58:33 2001
+++ linux-2.4.10.read_barrier_depends/include/asm-ppc/system.h Sat Oct 13 08:41:56 2001
@@ -24,6 +24,8 @@
*
* mb() prevents loads and stores being reordered across this point.
* rmb() prevents loads being reordered across this point.
+ * read_barrier_depends() prevents data-dependant loads being reordered
+ * across this point (nop on PPC).
* wmb() prevents stores being reordered across this point.
*
* We can use the eieio instruction for wmb, but since it doesn't
@@ -32,6 +34,7 @@
*/
#define mb() __asm__ __volatile__ ("sync" : : : "memory")
#define rmb() __asm__ __volatile__ ("sync" : : : "memory")
+#define read_barrier_depends() do { } while(0)
#define wmb() __asm__ __volatile__ ("eieio" : : : "memory")

#define set_mb(var, value) do { var = value; mb(); } while (0)
@@ -40,10 +43,12 @@
#ifdef CONFIG_SMP
#define smp_mb() mb()
#define smp_rmb() rmb()
+#define smp_read_barrier_depends() read_barrier_depends()
#define smp_wmb() wmb()
#else
#define smp_mb() __asm__ __volatile__("": : :"memory")
#define smp_rmb() __asm__ __volatile__("": : :"memory")
+#define smp_read_barrier_depends() do { } while(0)
#define smp_wmb() __asm__ __volatile__("": : :"memory")
#endif /* CONFIG_SMP */

diff -urN -X /home/mckenney/dontdiff linux-2.4.10/include/asm-s390/system.h linux-2.4.10.read_barrier_depends/include/asm-s390/system.h
--- linux-2.4.10/include/asm-s390/system.h Wed Jul 25 14:12:02 2001
+++ linux-2.4.10.read_barrier_depends/include/asm-s390/system.h Sat Oct 13 08:42:08 2001
@@ -117,9 +117,11 @@
# define SYNC_OTHER_CORES(x) eieio()
#define mb() eieio()
#define rmb() eieio()
+#define read_barrier_depends() do { } while(0)
#define wmb() eieio()
#define smp_mb() mb()
#define smp_rmb() rmb()
+#define smp_read_barrier_depends() read_barrier_depends()
#define smp_wmb() wmb()
#define smp_mb__before_clear_bit() smp_mb()
#define smp_mb__after_clear_bit() smp_mb()
diff -urN -X /home/mckenney/dontdiff linux-2.4.10/include/asm-s390x/system.h linux-2.4.10.read_barrier_depends/include/asm-s390x/system.h
--- linux-2.4.10/include/asm-s390x/system.h Wed Jul 25 14:12:03 2001
+++ linux-2.4.10.read_barrier_depends/include/asm-s390x/system.h Sat Oct 13 08:42:14 2001
@@ -130,9 +130,11 @@
# define SYNC_OTHER_CORES(x) eieio()
#define mb() eieio()
#define rmb() eieio()
+#define read_barrier_depends() do { } while(0)
#define wmb() eieio()
#define smp_mb() mb()
#define smp_rmb() rmb()
+#define smp_read_barrier_depends() read_barrier_depends()
#define smp_wmb() wmb()
#define smp_mb__before_clear_bit() smp_mb()
#define smp_mb__after_clear_bit() smp_mb()
diff -urN -X /home/mckenney/dontdiff linux-2.4.10/include/asm-sh/system.h linux-2.4.10.read_barrier_depends/include/asm-sh/system.h
--- linux-2.4.10/include/asm-sh/system.h Sat Sep 8 12:29:09 2001
+++ linux-2.4.10.read_barrier_depends/include/asm-sh/system.h Sat Oct 13 08:43:04 2001
@@ -88,15 +88,18 @@

#define mb() __asm__ __volatile__ ("": : :"memory")
#define rmb() mb()
+#define read_barrier_depends() do { } while(0)
#define wmb() __asm__ __volatile__ ("": : :"memory")

#ifdef CONFIG_SMP
#define smp_mb() mb()
#define smp_rmb() rmb()
+#define smp_read_barrier_depends() read_barrier_depends()
#define smp_wmb() wmb()
#else
#define smp_mb() barrier()
#define smp_rmb() barrier()
+#define smp_read_barrier_depends() do { } while(0)
#define smp_wmb() barrier()
#endif

diff -urN -X /home/mckenney/dontdiff linux-2.4.10/include/asm-sparc/system.h linux-2.4.10.read_barrier_depends/include/asm-sparc/system.h
--- linux-2.4.10/include/asm-sparc/system.h Tue Oct 3 09:24:41 2000
+++ linux-2.4.10.read_barrier_depends/include/asm-sparc/system.h Sat Oct 13 08:43:09 2001
@@ -277,11 +277,13 @@
/* XXX Change this if we ever use a PSO mode kernel. */
#define mb() __asm__ __volatile__ ("" : : : "memory")
#define rmb() mb()
+#define read_barrier_depends() do { } while(0)
#define wmb() mb()
#define set_mb(__var, __value) do { __var = __value; mb(); } while(0)
#define set_wmb(__var, __value) set_mb(__var, __value)
#define smp_mb() __asm__ __volatile__("":::"memory");
#define smp_rmb() __asm__ __volatile__("":::"memory");
+#define smp_read_barrier_depends() do { } while(0)
#define smp_wmb() __asm__ __volatile__("":::"memory");

#define nop() __asm__ __volatile__ ("nop");
diff -urN -X /home/mckenney/dontdiff linux-2.4.10/include/asm-sparc64/system.h linux-2.4.10.read_barrier_depends/include/asm-sparc64/system.h
--- linux-2.4.10/include/asm-sparc64/system.h Fri Sep 7 11:01:20 2001
+++ linux-2.4.10.read_barrier_depends/include/asm-sparc64/system.h Sat Oct 13 08:43:14 2001
@@ -99,6 +99,7 @@
#define mb() \
membar("#LoadLoad | #LoadStore | #StoreStore | #StoreLoad");
#define rmb() membar("#LoadLoad")
+#define read_barrier_depends() do { } while(0)
#define wmb() membar("#StoreStore")
#define set_mb(__var, __value) \
do { __var = __value; membar("#StoreLoad | #StoreStore"); } while(0)
@@ -108,10 +109,12 @@
#ifdef CONFIG_SMP
#define smp_mb() mb()
#define smp_rmb() rmb()
+#define smp_read_barrier_depends() read_barrier_depends()
#define smp_wmb() wmb()
#else
#define smp_mb() __asm__ __volatile__("":::"memory");
#define smp_rmb() __asm__ __volatile__("":::"memory");
+#define smp_read_barrier_depends() do { } while(0)
#define smp_wmb() __asm__ __volatile__("":::"memory");
#endif

2001-10-13 21:27:38

by Rusty Russell

[permalink] [raw]
Subject: Re: RFC: patch to allow lock-free traversal of lists with insertion ^M

In message <[email protected]> you write:
> OK, here is an RFC patch with the read_barrier_depends(). (I know that
> the indentation is messed up, will fix when I add the read_barrier()
> and friends).

Yep, this is a 2.5 thing: we should probably combine change the
barriers so read_barrier et. al. only have effect on SMP, and then
have io_read_barrier for IO (these can be usefully differentiated on
PPC IIUC).

The stillborn smp_mb() etc. can then be abandoned.

Rusty.
--
Premature optmztion is rt of all evl. --DK