LinuxLists.cc - single copy atomicity for double load/stores on 32-bit systems

2019-05-30 18:24:37

Subject: single copy atomicity for double load/stores on 32-bit systems

Hi Peter,

Had an interesting lunch time discussion with our hardware architects pertinent to
"minimal guarantees expected of a CPU" section of memory-barriers.txt

| (*) These guarantees apply only to properly aligned and sized scalar
| variables. "Properly sized" currently means variables that are
| the same size as "char", "short", "int" and "long". "Properly
| aligned" means the natural alignment, thus no constraints for
| "char", two-byte alignment for "short", four-byte alignment for
| "int", and either four-byte or eight-byte alignment for "long",
| on 32-bit and 64-bit systems, respectively.

I'm not sure how to interpret "natural alignment" for the case of double
load/stores on 32-bit systems where the hardware and ABI allow for 4 byte
alignment (ARCv2 LDD/STD, ARM LDRD/STRD ....)

I presume (and the question) that lkmm doesn't expect such 8 byte load/stores to
be atomic unless 8-byte aligned

ARMv7 arch ref manual seems to confirm this. Quoting

| LDM, LDC, LDC2, LDRD, STM, STC, STC2, STRD, PUSH, POP, RFE, SRS, VLDM, VLDR,
| VSTM, and VSTR instructions are executed as a sequence of word-aligned word
| accesses. Each 32-bit word access is guaranteed to be single-copy atomic. A
| subsequence of two or more word accesses from the sequence might not exhibit
| single-copy atomicity

While it seems reasonable form hardware pov to not implement such atomicity by
default it seems there's an additional burden on application writers. They could
be happily using a lockless algorithm with just a shared flag between 2 threads
w/o need for any explicit synchronization. But upgrade to a new compiler which
aggressively "packs" struct rendering long long 32-bit aligned (vs. 64-bit before)
causing the code to suddenly stop working. Is the onus on them to declare such
memory as c11 atomic or some such.

Thx,
-Vineet

2019-05-30 18:56:48

by Paul E. McKenney

[permalink] [raw]

Subject: Re: single copy atomicity for double load/stores on 32-bit systems

On Thu, May 30, 2019 at 11:22:42AM -0700, Vineet Gupta wrote:
> Hi Peter,
>
> Had an interesting lunch time discussion with our hardware architects pertinent to
> "minimal guarantees expected of a CPU" section of memory-barriers.txt
>
>
> | (*) These guarantees apply only to properly aligned and sized scalar
> | variables. "Properly sized" currently means variables that are
> | the same size as "char", "short", "int" and "long". "Properly
> | aligned" means the natural alignment, thus no constraints for
> | "char", two-byte alignment for "short", four-byte alignment for
> | "int", and either four-byte or eight-byte alignment for "long",
> | on 32-bit and 64-bit systems, respectively.
>
>
> I'm not sure how to interpret "natural alignment" for the case of double
> load/stores on 32-bit systems where the hardware and ABI allow for 4 byte
> alignment (ARCv2 LDD/STD, ARM LDRD/STRD ....)
>
> I presume (and the question) that lkmm doesn't expect such 8 byte load/stores to
> be atomic unless 8-byte aligned

I would not expect 8-byte accesses to be atomic on 32-bit systems unless
some special instruction was in use. But that usually means special
intrinsics or assembly code.

> ARMv7 arch ref manual seems to confirm this. Quoting
>
> | LDM, LDC, LDC2, LDRD, STM, STC, STC2, STRD, PUSH, POP, RFE, SRS, VLDM, VLDR,
> | VSTM, and VSTR instructions are executed as a sequence of word-aligned word
> | accesses. Each 32-bit word access is guaranteed to be single-copy atomic. A
> | subsequence of two or more word accesses from the sequence might not exhibit
> | single-copy atomicity
>
> While it seems reasonable form hardware pov to not implement such atomicity by
> default it seems there's an additional burden on application writers. They could
> be happily using a lockless algorithm with just a shared flag between 2 threads
> w/o need for any explicit synchronization. But upgrade to a new compiler which
> aggressively "packs" struct rendering long long 32-bit aligned (vs. 64-bit before)
> causing the code to suddenly stop working. Is the onus on them to declare such
> memory as c11 atomic or some such.

There are also GCC extensions that allow specifying the alignment of
structure fields.

Thanx, Paul

2019-05-30 19:18:09

by Vineet Gupta

[permalink] [raw]

Subject: Re: single copy atomicity for double load/stores on 32-bit systems

On 5/30/19 11:55 AM, Paul E. McKenney wrote:
>
>> I'm not sure how to interpret "natural alignment" for the case of double
>> load/stores on 32-bit systems where the hardware and ABI allow for 4 byte
>> alignment (ARCv2 LDD/STD, ARM LDRD/STRD ....)
>>
>> I presume (and the question) that lkmm doesn't expect such 8 byte load/stores to
>> be atomic unless 8-byte aligned
> I would not expect 8-byte accesses to be atomic on 32-bit systems unless
> some special instruction was in use. But that usually means special
> intrinsics or assembly code.

Thx for confirming.

In cases where we *do* expect the atomicity, it seems there's some existing type
checking but isn't water tight.
e.g.

#define __smp_load_acquire(p) \
({ \
typeof(*p) ___p1 = READ_ONCE(*p); \
compiletime_assert_atomic_type(*p); \
__smp_mb(); \
___p1; \
})

#define compiletime_assert_atomic_type(t) \
compiletime_assert(__native_word(t), \
"Need native word sized stores/loads for atomicity.")

#define __native_word(t) \
(sizeof(t) == sizeof(char) || sizeof(t) == sizeof(short) || \
sizeof(t) == sizeof(int) || sizeof(t) == sizeof(long))

So it won't catch the usage of 4 byte aligned long long which gcc targets to
single double load instruction.

Thx,
-Vineet

2019-05-31 08:23:12

by Peter Zijlstra

[permalink] [raw]

Subject: Re: single copy atomicity for double load/stores on 32-bit systems

On Thu, May 30, 2019 at 11:22:42AM -0700, Vineet Gupta wrote:
> Hi Peter,
>
> Had an interesting lunch time discussion with our hardware architects pertinent to
> "minimal guarantees expected of a CPU" section of memory-barriers.txt
>
>
> | (*) These guarantees apply only to properly aligned and sized scalar
> | variables. "Properly sized" currently means variables that are
> | the same size as "char", "short", "int" and "long". "Properly
> | aligned" means the natural alignment, thus no constraints for
> | "char", two-byte alignment for "short", four-byte alignment for
> | "int", and either four-byte or eight-byte alignment for "long",
> | on 32-bit and 64-bit systems, respectively.
>
>
> I'm not sure how to interpret "natural alignment" for the case of double
> load/stores on 32-bit systems where the hardware and ABI allow for 4 byte
> alignment (ARCv2 LDD/STD, ARM LDRD/STRD ....)

Natural alignment: !((uintptr_t)ptr % sizeof(*ptr))

For any u64 type, that would give 8 byte alignment. the problem
otherwise being that your data spans two lines/pages etc..

> I presume (and the question) that lkmm doesn't expect such 8 byte load/stores to
> be atomic unless 8-byte aligned
>
> ARMv7 arch ref manual seems to confirm this. Quoting
>
> | LDM, LDC, LDC2, LDRD, STM, STC, STC2, STRD, PUSH, POP, RFE, SRS, VLDM, VLDR,
> | VSTM, and VSTR instructions are executed as a sequence of word-aligned word
> | accesses. Each 32-bit word access is guaranteed to be single-copy atomic. A
> | subsequence of two or more word accesses from the sequence might not exhibit
> | single-copy atomicity
>
> While it seems reasonable form hardware pov to not implement such atomicity by
> default it seems there's an additional burden on application writers. They could
> be happily using a lockless algorithm with just a shared flag between 2 threads
> w/o need for any explicit synchronization.

If you're that careless with lockless code, you deserve all the pain you
get.

> But upgrade to a new compiler which
> aggressively "packs" struct rendering long long 32-bit aligned (vs. 64-bit before)
> causing the code to suddenly stop working. Is the onus on them to declare such
> memory as c11 atomic or some such.

When a programmer wants guarantees they already need to know wth they're
doing.

And I'll stand by my earlier conviction that any architecture that has a
native u64 (be it a 64bit arch or a 32bit with double-width
instructions) but has an ABI that allows u32 alignment on them is daft.

2019-05-31 08:24:49

by Peter Zijlstra

[permalink] [raw]

Subject: Re: single copy atomicity for double load/stores on 32-bit systems

On Thu, May 30, 2019 at 07:16:36PM +0000, Vineet Gupta wrote:
> On 5/30/19 11:55 AM, Paul E. McKenney wrote:
> >
> >> I'm not sure how to interpret "natural alignment" for the case of double
> >> load/stores on 32-bit systems where the hardware and ABI allow for 4 byte
> >> alignment (ARCv2 LDD/STD, ARM LDRD/STRD ....)
> >>
> >> I presume (and the question) that lkmm doesn't expect such 8 byte load/stores to
> >> be atomic unless 8-byte aligned
> > I would not expect 8-byte accesses to be atomic on 32-bit systems unless
> > some special instruction was in use. But that usually means special
> > intrinsics or assembly code.
>
> Thx for confirming.
>
> In cases where we *do* expect the atomicity, it seems there's some existing type
> checking but isn't water tight.
> e.g.
>
> #define __smp_load_acquire(p) \
> ({ \
> typeof(*p) ___p1 = READ_ONCE(*p); \
> compiletime_assert_atomic_type(*p); \
> __smp_mb(); \
> ___p1; \
> })
>
> #define compiletime_assert_atomic_type(t) \
> compiletime_assert(__native_word(t), \
> "Need native word sized stores/loads for atomicity.")
>
> #define __native_word(t) \
> (sizeof(t) == sizeof(char) || sizeof(t) == sizeof(short) || \
> sizeof(t) == sizeof(int) || sizeof(t) == sizeof(long))
>
>
> So it won't catch the usage of 4 byte aligned long long which gcc targets to
> single double load instruction.

Yes, we didn't do those because that would result in runtime overhead.

We assume natural alignment for any type the hardware can do.

2019-05-31 08:27:55

by Peter Zijlstra

[permalink] [raw]

Subject: Re: single copy atomicity for double load/stores on 32-bit systems

On Thu, May 30, 2019 at 11:53:58AM -0700, Paul E. McKenney wrote:
> On Thu, May 30, 2019 at 11:22:42AM -0700, Vineet Gupta wrote:
> > Hi Peter,
> >
> > Had an interesting lunch time discussion with our hardware architects pertinent to
> > "minimal guarantees expected of a CPU" section of memory-barriers.txt
> >
> >
> > | (*) These guarantees apply only to properly aligned and sized scalar
> > | variables. "Properly sized" currently means variables that are
> > | the same size as "char", "short", "int" and "long". "Properly
> > | aligned" means the natural alignment, thus no constraints for
> > | "char", two-byte alignment for "short", four-byte alignment for
> > | "int", and either four-byte or eight-byte alignment for "long",
> > | on 32-bit and 64-bit systems, respectively.
> >
> >
> > I'm not sure how to interpret "natural alignment" for the case of double
> > load/stores on 32-bit systems where the hardware and ABI allow for 4 byte
> > alignment (ARCv2 LDD/STD, ARM LDRD/STRD ....)
> >
> > I presume (and the question) that lkmm doesn't expect such 8 byte load/stores to
> > be atomic unless 8-byte aligned
>
> I would not expect 8-byte accesses to be atomic on 32-bit systems unless
> some special instruction was in use. But that usually means special
> intrinsics or assembly code.

If the GCC of said platform defaults to the double-word instructions for
long long, then I would very much expect natural alignment on it too.

If the feature is only available through inline asm or intrinsics, then
we can be a little more lenient perhaps.

2019-05-31 09:43:48

by David Laight

[permalink] [raw]

Subject: RE: single copy atomicity for double load/stores on 32-bit systems

From: Vineet Gupta
> Sent: 30 May 2019 19:23
...
> While it seems reasonable form hardware pov to not implement such atomicity by
> default it seems there's an additional burden on application writers. They could
> be happily using a lockless algorithm with just a shared flag between 2 threads
> w/o need for any explicit synchronization. But upgrade to a new compiler which
> aggressively "packs" struct rendering long long 32-bit aligned (vs. 64-bit before)
> causing the code to suddenly stop working. Is the onus on them to declare such
> memory as c11 atomic or some such.

A 'new' compiler can't suddenly change the alignment rules for structure elements.
The alignment rules will be part of the ABI.

More likely is that the structure itself is unexpectedly allocated on
an 8n+4 boundary due to code changes elsewhere.

It is also worth noting that for complete portability only writes to
'full words' can be assumed atomic.
Some old Alpha's did RMW cycles for byte writes.
(Although I suspect Linux doesn't support those any more.)

Even x86 can catch you out.
The bit operations will do wider RMW cycles than you expect.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2019-05-31 11:47:11

by Paul E. McKenney

[permalink] [raw]

Subject: Re: single copy atomicity for double load/stores on 32-bit systems

On Fri, May 31, 2019 at 09:41:17AM +0000, David Laight wrote:
> From: Vineet Gupta
> > Sent: 30 May 2019 19:23
> ...
> > While it seems reasonable form hardware pov to not implement such atomicity by
> > default it seems there's an additional burden on application writers. They could
> > be happily using a lockless algorithm with just a shared flag between 2 threads
> > w/o need for any explicit synchronization. But upgrade to a new compiler which
> > aggressively "packs" struct rendering long long 32-bit aligned (vs. 64-bit before)
> > causing the code to suddenly stop working. Is the onus on them to declare such
> > memory as c11 atomic or some such.
>
> A 'new' compiler can't suddenly change the alignment rules for structure elements.
> The alignment rules will be part of the ABI.
>
> More likely is that the structure itself is unexpectedly allocated on
> an 8n+4 boundary due to code changes elsewhere.
>
> It is also worth noting that for complete portability only writes to
> 'full words' can be assumed atomic.
> Some old Alpha's did RMW cycles for byte writes.
> (Although I suspect Linux doesn't support those any more.)

Any C11 or later compiler needs to generate the atomic RMW cycles if
needed in cases like this. To see this, consider the following code:

spinlock_t l1;
spinlock_t l2;
struct foo {
char c1; // Protected by l1
char c2; // Protected by l2
}

...

spin_lock(&l1);
fp->c1 = 42;
do_somthing_protected_by_l1();
spin_unlock(&l1);

...

spin_lock(&l2);
fp->c2 = 206;
do_somthing_protected_by_l2();
spin_unlock(&l2);

A compiler that failed to generate atomic RMW code sequences for those
stores to ->c1 and ->c2 would be generating a data race in the object
code when there was no such race in the source code. Kudos to Hans Boehm
for having browbeat compiler writers into accepting this restriction,
which was not particularly popular -- they wanted to be able to use
vector units and such. ;-)

> Even x86 can catch you out.
> The bit operations will do wider RMW cycles than you expect.

But does the compiler automatically generate these?

Thanx, Paul

> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

2019-06-03 18:10:02

by Vineet Gupta

[permalink] [raw]

Subject: Re: single copy atomicity for double load/stores on 32-bit systems

On 5/31/19 1:21 AM, Peter Zijlstra wrote:
>> I'm not sure how to interpret "natural alignment" for the case of double
>> load/stores on 32-bit systems where the hardware and ABI allow for 4 byte
>> alignment (ARCv2 LDD/STD, ARM LDRD/STRD ....)
> Natural alignment: !((uintptr_t)ptr % sizeof(*ptr))
>
> For any u64 type, that would give 8 byte alignment. the problem
> otherwise being that your data spans two lines/pages etc..

Sure, but as Paul said, if the software doesn't expect them to be atomic by
default, they could span 2 hardware lines to keep the implementation simpler/sane.

2019-06-03 18:45:05

by Vineet Gupta

[permalink] [raw]

Subject: Re: single copy atomicity for double load/stores on 32-bit systems

On 5/31/19 1:21 AM, Peter Zijlstra wrote:
> And I'll stand by my earlier conviction that any architecture that has a
> native u64 (be it a 64bit arch or a 32bit with double-width
> instructions) but has an ABI that allows u32 alignment on them is daft.

Why ? For 64-bit data on 32-bit systems, hardware doesn't claim to provide any
single-copy atomicity for such data and software doesn't expect either.

2019-06-03 18:46:28

by Vineet Gupta

[permalink] [raw]

Subject: Re: single copy atomicity for double load/stores on 32-bit systems

On 5/31/19 2:41 AM, David Laight wrote:
>> While it seems reasonable form hardware pov to not implement such atomicity by
>> default it seems there's an additional burden on application writers. They could
>> be happily using a lockless algorithm with just a shared flag between 2 threads
>> w/o need for any explicit synchronization. But upgrade to a new compiler which
>> aggressively "packs" struct rendering long long 32-bit aligned (vs. 64-bit before)
>> causing the code to suddenly stop working. Is the onus on them to declare such
>> memory as c11 atomic or some such.
> A 'new' compiler can't suddenly change the alignment rules for structure elements.
> The alignment rules will be part of the ABI.
>
> More likely is that the structure itself is unexpectedly allocated on
> an 8n+4 boundary due to code changes elsewhere.

Indeed thats what I meant that the layout changed as is typical of a new compiler.

2019-06-03 20:15:02

by Paul E. McKenney

[permalink] [raw]

Subject: Re: single copy atomicity for double load/stores on 32-bit systems

On Mon, Jun 03, 2019 at 06:08:35PM +0000, Vineet Gupta wrote:
> On 5/31/19 1:21 AM, Peter Zijlstra wrote:
> >> I'm not sure how to interpret "natural alignment" for the case of double
> >> load/stores on 32-bit systems where the hardware and ABI allow for 4 byte
> >> alignment (ARCv2 LDD/STD, ARM LDRD/STRD ....)
> > Natural alignment: !((uintptr_t)ptr % sizeof(*ptr))
> >
> > For any u64 type, that would give 8 byte alignment. the problem
> > otherwise being that your data spans two lines/pages etc..
>
> Sure, but as Paul said, if the software doesn't expect them to be atomic by
> default, they could span 2 hardware lines to keep the implementation simpler/sane.

I could imagine 8-byte types being only four-byte aligned on 32-bit systems,
but it would be quite a surprise on 64-bit systems.

Thanx, Paul

2019-06-03 22:01:02

by Vineet Gupta

[permalink] [raw]

Subject: Re: single copy atomicity for double load/stores on 32-bit systems

On 6/3/19 1:13 PM, Paul E. McKenney wrote:
> On Mon, Jun 03, 2019 at 06:08:35PM +0000, Vineet Gupta wrote:
>> On 5/31/19 1:21 AM, Peter Zijlstra wrote:
>>>> I'm not sure how to interpret "natural alignment" for the case of double
>>>> load/stores on 32-bit systems where the hardware and ABI allow for 4 byte
>>>> alignment (ARCv2 LDD/STD, ARM LDRD/STRD ....)
>>> Natural alignment: !((uintptr_t)ptr % sizeof(*ptr))
>>>
>>> For any u64 type, that would give 8 byte alignment. the problem
>>> otherwise being that your data spans two lines/pages etc..
>> Sure, but as Paul said, if the software doesn't expect them to be atomic by
>> default, they could span 2 hardware lines to keep the implementation simpler/sane.
> I could imagine 8-byte types being only four-byte aligned on 32-bit systems,
> but it would be quite a surprise on 64-bit systems.

Totally agree !

Thx,
-Vineet

2019-06-04 07:43:53

by Geert Uytterhoeven

[permalink] [raw]

Subject: Re: single copy atomicity for double load/stores on 32-bit systems

Hi Paul,

On Mon, Jun 3, 2019 at 10:14 PM Paul E. McKenney <[email protected]> wrote:
> On Mon, Jun 03, 2019 at 06:08:35PM +0000, Vineet Gupta wrote:
> > On 5/31/19 1:21 AM, Peter Zijlstra wrote:
> > >> I'm not sure how to interpret "natural alignment" for the case of double
> > >> load/stores on 32-bit systems where the hardware and ABI allow for 4 byte
> > >> alignment (ARCv2 LDD/STD, ARM LDRD/STRD ....)
> > > Natural alignment: !((uintptr_t)ptr % sizeof(*ptr))
> > >
> > > For any u64 type, that would give 8 byte alignment. the problem
> > > otherwise being that your data spans two lines/pages etc..
> >
> > Sure, but as Paul said, if the software doesn't expect them to be atomic by
> > default, they could span 2 hardware lines to keep the implementation simpler/sane.
>
> I could imagine 8-byte types being only four-byte aligned on 32-bit systems,
> but it would be quite a surprise on 64-bit systems.

Or two-byte aligned?

M68k started with a 16-bit data bus, and alignment rules were retained
when gaining a wider data bus.

BTW, do any platforms have issues with atomicity of 4-byte types on
16-bit data buses? I believe some embedded ARM or PowerPC do have
such buses.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2019-06-06 09:46:40

by Paul E. McKenney

[permalink] [raw]

Subject: Re: single copy atomicity for double load/stores on 32-bit systems

On Tue, Jun 04, 2019 at 09:41:04AM +0200, Geert Uytterhoeven wrote:
> Hi Paul,
>
> On Mon, Jun 3, 2019 at 10:14 PM Paul E. McKenney <[email protected]> wrote:
> > On Mon, Jun 03, 2019 at 06:08:35PM +0000, Vineet Gupta wrote:
> > > On 5/31/19 1:21 AM, Peter Zijlstra wrote:
> > > >> I'm not sure how to interpret "natural alignment" for the case of double
> > > >> load/stores on 32-bit systems where the hardware and ABI allow for 4 byte
> > > >> alignment (ARCv2 LDD/STD, ARM LDRD/STRD ....)
> > > > Natural alignment: !((uintptr_t)ptr % sizeof(*ptr))
> > > >
> > > > For any u64 type, that would give 8 byte alignment. the problem
> > > > otherwise being that your data spans two lines/pages etc..
> > >
> > > Sure, but as Paul said, if the software doesn't expect them to be atomic by
> > > default, they could span 2 hardware lines to keep the implementation simpler/sane.
> >
> > I could imagine 8-byte types being only four-byte aligned on 32-bit systems,
> > but it would be quite a surprise on 64-bit systems.
>
> Or two-byte aligned?
>
> M68k started with a 16-bit data bus, and alignment rules were retained
> when gaining a wider data bus.
>
> BTW, do any platforms have issues with atomicity of 4-byte types on
> 16-bit data buses? I believe some embedded ARM or PowerPC do have
> such buses.

But m68k is !SMP-only, correct? If so, the only issues would be
interactions with interrupt handlers and the like, and doesn't current
m68k hardware use exact interrupts? Or is it still possible to interrupt
an m68k in the middle of an instruction like it was in the bad old days?

Thanx, Paul

> Gr{oetje,eeting}s,
>
> Geert
>
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
>
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like that.
> -- Linus Torvalds

2019-06-06 09:55:40

by Geert Uytterhoeven

[permalink] [raw]

Subject: Re: single copy atomicity for double load/stores on 32-bit systems

Hi Paul,

On Thu, Jun 6, 2019 at 11:43 AM Paul E. McKenney <[email protected]> wrote:
> On Tue, Jun 04, 2019 at 09:41:04AM +0200, Geert Uytterhoeven wrote:
> > On Mon, Jun 3, 2019 at 10:14 PM Paul E. McKenney <[email protected]> wrote:
> > > On Mon, Jun 03, 2019 at 06:08:35PM +0000, Vineet Gupta wrote:
> > > > On 5/31/19 1:21 AM, Peter Zijlstra wrote:
> > > > >> I'm not sure how to interpret "natural alignment" for the case of double
> > > > >> load/stores on 32-bit systems where the hardware and ABI allow for 4 byte
> > > > >> alignment (ARCv2 LDD/STD, ARM LDRD/STRD ....)
> > > > > Natural alignment: !((uintptr_t)ptr % sizeof(*ptr))
> > > > >
> > > > > For any u64 type, that would give 8 byte alignment. the problem
> > > > > otherwise being that your data spans two lines/pages etc..
> > > >
> > > > Sure, but as Paul said, if the software doesn't expect them to be atomic by
> > > > default, they could span 2 hardware lines to keep the implementation simpler/sane.
> > >
> > > I could imagine 8-byte types being only four-byte aligned on 32-bit systems,
> > > but it would be quite a surprise on 64-bit systems.
> >
> > Or two-byte aligned?
> >
> > M68k started with a 16-bit data bus, and alignment rules were retained
> > when gaining a wider data bus.
> >
> > BTW, do any platforms have issues with atomicity of 4-byte types on
> > 16-bit data buses? I believe some embedded ARM or PowerPC do have
> > such buses.
>
> But m68k is !SMP-only, correct? If so, the only issues would be

M68k support in Linux is uniprocessor-only.

> interactions with interrupt handlers and the like, and doesn't current
> m68k hardware use exact interrupts? Or is it still possible to interrupt
> an m68k in the middle of an instruction like it was in the bad old days?

TBH, I don't know.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2019-06-06 18:39:04

by David Laight

[permalink] [raw]

Subject: RE: single copy atomicity for double load/stores on 32-bit systems

From: Paul E. McKenney
> Sent: 06 June 2019 10:44
...
> But m68k is !SMP-only, correct? If so, the only issues would be
> interactions with interrupt handlers and the like, and doesn't current
> m68k hardware use exact interrupts? Or is it still possible to interrupt
> an m68k in the middle of an instruction like it was in the bad old days?

Hardware interrupts were always on instruction boundaries, the
mid-instruction interrupts would only happen for page faults (etc).

There were SMP m68k systems (but I can't remember one).
It was important to continue from a mid-instruction trap on the
same cpu - unless you could guarantee that all the cpus had
exactly the same version of the microcode.

In any case you could probably use the 'cmp2' instruction
for an atomic 64bit write.
OTOH setting that up was such a PITA it was always easier
to disable interrupts.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2019-06-06 22:33:42

by Paul E. McKenney

[permalink] [raw]

Subject: Re: single copy atomicity for double load/stores on 32-bit systems

On Thu, Jun 06, 2019 at 04:34:52PM +0000, David Laight wrote:
> From: Paul E. McKenney
> > Sent: 06 June 2019 10:44
> ...
> > But m68k is !SMP-only, correct? If so, the only issues would be
> > interactions with interrupt handlers and the like, and doesn't current
> > m68k hardware use exact interrupts? Or is it still possible to interrupt
> > an m68k in the middle of an instruction like it was in the bad old days?
>
> Hardware interrupts were always on instruction boundaries, the
> mid-instruction interrupts would only happen for page faults (etc).

OK, !SMP should be fine, then.

> There were SMP m68k systems (but I can't remember one).
> It was important to continue from a mid-instruction trap on the
> same cpu - unless you could guarantee that all the cpus had
> exactly the same version of the microcode.

Yuck! ;-)

> In any case you could probably use the 'cmp2' instruction
> for an atomic 64bit write.
> OTOH setting that up was such a PITA it was always easier
> to disable interrupts.

Unless I am forgetting something, given that m68k is a 32-bit system,
we should be OK without an atomic 64-bit write.

Thanx, Paul

2019-07-01 20:06:40

by Vineet Gupta

[permalink] [raw]

Subject: Re: single copy atomicity for double load/stores on 32-bit systems

On 5/31/19 1:21 AM, Peter Zijlstra wrote:
> On Thu, May 30, 2019 at 11:22:42AM -0700, Vineet Gupta wrote:
>> Hi Peter,
>>
>> Had an interesting lunch time discussion with our hardware architects pertinent to
>> "minimal guarantees expected of a CPU" section of memory-barriers.txt
>>
>>
>> | (*) These guarantees apply only to properly aligned and sized scalar
>> | variables. "Properly sized" currently means variables that are
>> | the same size as "char", "short", "int" and "long". "Properly
>> | aligned" means the natural alignment, thus no constraints for
>> | "char", two-byte alignment for "short", four-byte alignment for
>> | "int", and either four-byte or eight-byte alignment for "long",
>> | on 32-bit and 64-bit systems, respectively.
>>
>>
>> I'm not sure how to interpret "natural alignment" for the case of double
>> load/stores on 32-bit systems where the hardware and ABI allow for 4 byte
>> alignment (ARCv2 LDD/STD, ARM LDRD/STRD ....)
>
> Natural alignment: !((uintptr_t)ptr % sizeof(*ptr))
>
> For any u64 type, that would give 8 byte alignment. the problem
> otherwise being that your data spans two lines/pages etc..
>
>> I presume (and the question) that lkmm doesn't expect such 8 byte load/stores to
>> be atomic unless 8-byte aligned
>>
>> ARMv7 arch ref manual seems to confirm this. Quoting
>>
>> | LDM, LDC, LDC2, LDRD, STM, STC, STC2, STRD, PUSH, POP, RFE, SRS, VLDM, VLDR,
>> | VSTM, and VSTR instructions are executed as a sequence of word-aligned word
>> | accesses. Each 32-bit word access is guaranteed to be single-copy atomic. A
>> | subsequence of two or more word accesses from the sequence might not exhibit
>> | single-copy atomicity
>>
>> While it seems reasonable form hardware pov to not implement such atomicity by
>> default it seems there's an additional burden on application writers. They could
>> be happily using a lockless algorithm with just a shared flag between 2 threads
>> w/o need for any explicit synchronization.
>
> If you're that careless with lockless code, you deserve all the pain you
> get.
>
>> But upgrade to a new compiler which
>> aggressively "packs" struct rendering long long 32-bit aligned (vs. 64-bit before)
>> causing the code to suddenly stop working. Is the onus on them to declare such
>> memory as c11 atomic or some such.
>
> When a programmer wants guarantees they already need to know wth they're
> doing.
>
> And I'll stand by my earlier conviction that any architecture that has a
> native u64 (be it a 64bit arch or a 32bit with double-width
> instructions) but has an ABI that allows u32 alignment on them is daft.

So I agree with Paul's assertion that it is strange for 8-byte type being 4-byte
aligned on a 64-bit system, but is it totally broken even if the ISA of the said
64-bit arch allows LD/ST to be augmented with acq/rel respectively.

Say the ISA guarantees single-copy atomicity for aligned cases (i.e. for 8-byte
data only if it is naturally aligned) and in lack thereof programmer needs to use
the proper acq/release

In my earlier example on lockless code, we do assume that programmer will use a
release in the update of flag.

2019-07-02 10:47:03

by Will Deacon

[permalink] [raw]

Subject: Re: single copy atomicity for double load/stores on 32-bit systems

On Mon, Jul 01, 2019 at 08:05:51PM +0000, Vineet Gupta wrote:
> On 5/31/19 1:21 AM, Peter Zijlstra wrote:
> > On Thu, May 30, 2019 at 11:22:42AM -0700, Vineet Gupta wrote:
> >> Had an interesting lunch time discussion with our hardware architects pertinent to
> >> "minimal guarantees expected of a CPU" section of memory-barriers.txt
> >>
> >>
> >> | (*) These guarantees apply only to properly aligned and sized scalar
> >> | variables. "Properly sized" currently means variables that are
> >> | the same size as "char", "short", "int" and "long". "Properly
> >> | aligned" means the natural alignment, thus no constraints for
> >> | "char", two-byte alignment for "short", four-byte alignment for
> >> | "int", and either four-byte or eight-byte alignment for "long",
> >> | on 32-bit and 64-bit systems, respectively.
> >>
> >>
> >> I'm not sure how to interpret "natural alignment" for the case of double
> >> load/stores on 32-bit systems where the hardware and ABI allow for 4 byte
> >> alignment (ARCv2 LDD/STD, ARM LDRD/STRD ....)
> >
> > Natural alignment: !((uintptr_t)ptr % sizeof(*ptr))
> >
> > For any u64 type, that would give 8 byte alignment. the problem
> > otherwise being that your data spans two lines/pages etc..
> >
> >> I presume (and the question) that lkmm doesn't expect such 8 byte load/stores to
> >> be atomic unless 8-byte aligned
> >>
> >> ARMv7 arch ref manual seems to confirm this. Quoting
> >>
> >> | LDM, LDC, LDC2, LDRD, STM, STC, STC2, STRD, PUSH, POP, RFE, SRS, VLDM, VLDR,
> >> | VSTM, and VSTR instructions are executed as a sequence of word-aligned word
> >> | accesses. Each 32-bit word access is guaranteed to be single-copy atomic. A
> >> | subsequence of two or more word accesses from the sequence might not exhibit
> >> | single-copy atomicity
> >>
> >> While it seems reasonable form hardware pov to not implement such atomicity by
> >> default it seems there's an additional burden on application writers. They could
> >> be happily using a lockless algorithm with just a shared flag between 2 threads
> >> w/o need for any explicit synchronization.
> >
> > If you're that careless with lockless code, you deserve all the pain you
> > get.
> >
> >> But upgrade to a new compiler which
> >> aggressively "packs" struct rendering long long 32-bit aligned (vs. 64-bit before)
> >> causing the code to suddenly stop working. Is the onus on them to declare such
> >> memory as c11 atomic or some such.
> >
> > When a programmer wants guarantees they already need to know wth they're
> > doing.
> >
> > And I'll stand by my earlier conviction that any architecture that has a
> > native u64 (be it a 64bit arch or a 32bit with double-width
> > instructions) but has an ABI that allows u32 alignment on them is daft.
>
> So I agree with Paul's assertion that it is strange for 8-byte type being 4-byte
> aligned on a 64-bit system, but is it totally broken even if the ISA of the said
> 64-bit arch allows LD/ST to be augmented with acq/rel respectively.
>
> Say the ISA guarantees single-copy atomicity for aligned cases (i.e. for 8-byte
> data only if it is naturally aligned) and in lack thereof programmer needs to use
> the proper acq/release

Apologies if I'm missing some context here, but it's not clear to me why the
use of acquire/release instructions has anything to do with single-copy
atomicity of unaligned accesses. The ordering they provide doesn't
necessarily prevent tearing, although a CPU architecture could obviously
provide that guarantee if it wanted to. Generally though, I wouldn't expect
the two to go hand-in-hand like you're suggesting.

Will