2005-10-06 12:58:28

by Kirill Korotaev

[permalink] [raw]
Subject: SMP syncronization on AMD processors (broken?)

Hello Linus, Andrew and others,

Please help with a not simple question about spin_lock/spin_unlock on
SMP archs. The question is whether concurrent spin_lock()'s should
acquire it in more or less "fair" fashinon or one of CPUs can starve any
arbitrary time while others do reacquire it in a loop.

The question raised because the situation we observe on AMD processors
is really strange and makes us believe that something is wrong in
kerne/in processor or our minds. Below goes an explanation:

The whole story started when we wrote the following code:

void XXX(void)
{
/* ints disabled */
restart:
spin_lock(&lock);
do_something();
if (!flag)
need_restart = 1;
spin_unlock(&lock);
if (need_restart)
goto restart; <<<< LOOPS 4EVER ON AMD!!!
}

void YYY(void)
{
spin_lock(&lock); <<<< SPINS 4EVER ON AMD!!!
flag = 1;
spin_unlock(&lock);
}

function XXX() starts on CPU0 and begins to loop since flag is not set,
then CPU1 calls function YYY() and it turns out that it can't take the
lock any arbitrary time.

Other observations:
- This does not happen on Intel processors, more over on Intel 2 CPUs
take locks in a fair manner, exactly one by one!
- If do_something() = usleep(3) we observed that XXX() loops forever,
while YYY spins 4EVER on the same lock...
- cpu_relax() doesn't help after spin_unlock()...
- wbinvd() after spin_unlock() helpes and 2 CPUs began to take the lock
in a fair manner.

How can this happen? Is it regulated somehow by SMP specifications?

Kirill
P.S. Below is provided /proc/cpuinfo of machines affected.

-----------------------------------------------------------------------------

[root@ts25 ~]# cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 35
model name : AMD Athlon(tm) 64 X2 Dual Core Processor 3800+
stepping : 2
cpu MHz : 2010.433
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm
3dnowext 3dnow pni
bogomips : 3981.31

processor : 1
vendor_id : AuthenticAMD
cpu family : 15
model : 35
model name : AMD Athlon(tm) 64 X2 Dual Core Processor 3800+
stepping : 2
cpu MHz : 2010.433
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm
3dnowext 3dnow pni
bogomips : 3964.92

-----------------------------------------------------------------------------

[root@opteron root]# cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 5
model name : AMD Opteron(tm) Processor 246
stepping : 10
cpu MHz : 1992.595
cache size : 1024 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm
3dnowext 3dnow
bogomips : 3915.77


2005-10-06 13:14:58

by linux-os (Dick Johnson)

[permalink] [raw]
Subject: Re: SMP syncronization on AMD processors (broken?)


On Thu, 6 Oct 2005, Kirill Korotaev wrote:

> Hello Linus, Andrew and others,
>
> Please help with a not simple question about spin_lock/spin_unlock on
> SMP archs. The question is whether concurrent spin_lock()'s should
> acquire it in more or less "fair" fashinon or one of CPUs can starve any
> arbitrary time while others do reacquire it in a loop.
>
> The question raised because the situation we observe on AMD processors
> is really strange and makes us believe that something is wrong in
> kerne/in processor or our minds. Below goes an explanation:
>
> The whole story started when we wrote the following code:
>
> void XXX(void)
> {
> /* ints disabled */
> restart:
> spin_lock(&lock);
> do_something();
> if (!flag)
> need_restart = 1;
> spin_unlock(&lock);
> if (need_restart)
> goto restart; <<<< LOOPS 4EVER ON AMD!!!
> }
>
> void YYY(void)
> {
> spin_lock(&lock); <<<< SPINS 4EVER ON AMD!!!
> flag = 1;
> spin_unlock(&lock);
> }
>
> function XXX() starts on CPU0 and begins to loop since flag is not set,
> then CPU1 calls function YYY() and it turns out that it can't take the
> lock any arbitrary time.
>
> Other observations:
> - This does not happen on Intel processors, more over on Intel 2 CPUs
> take locks in a fair manner, exactly one by one!
> - If do_something() = usleep(3) we observed that XXX() loops forever,
> while YYY spins 4EVER on the same lock...
> - cpu_relax() doesn't help after spin_unlock()...
> - wbinvd() after spin_unlock() helpes and 2 CPUs began to take the lock
> in a fair manner.
>
> How can this happen? Is it regulated somehow by SMP specifications?
>

Are your flags atomic_t types manipulated correctly with
the proper primatives or are you just setting globals?

> - wbinvd() after spin_unlock() helpes and 2 CPUs began to take the lock
^^^^^^^^^_________ This is the hint that your flags are not
doing what you expect, i.e., a change not seen by another CPU.


> Kirill
> P.S. Below is provided /proc/cpuinfo of machines affected.
>
> -----------------------------------------------------------------------------
>
> [root@ts25 ~]# cat /proc/cpuinfo
> processor : 0
> vendor_id : AuthenticAMD
> cpu family : 15
> model : 35
> model name : AMD Athlon(tm) 64 X2 Dual Core Processor 3800+
> stepping : 2
> cpu MHz : 2010.433
> cache size : 512 KB
> fdiv_bug : no
> hlt_bug : no
> f00f_bug : no
> coma_bug : no
> fpu : yes
> fpu_exception : yes
> cpuid level : 1
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm
> 3dnowext 3dnow pni
> bogomips : 3981.31
>
> processor : 1
> vendor_id : AuthenticAMD
> cpu family : 15
> model : 35
> model name : AMD Athlon(tm) 64 X2 Dual Core Processor 3800+
> stepping : 2
> cpu MHz : 2010.433
> cache size : 512 KB
> fdiv_bug : no
> hlt_bug : no
> f00f_bug : no
> coma_bug : no
> fpu : yes
> fpu_exception : yes
> cpuid level : 1
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm
> 3dnowext 3dnow pni
> bogomips : 3964.92
>
> -----------------------------------------------------------------------------
>
> [root@opteron root]# cat /proc/cpuinfo
> processor : 0
> vendor_id : AuthenticAMD
> cpu family : 15
> model : 5
> model name : AMD Opteron(tm) Processor 246
> stepping : 10
> cpu MHz : 1992.595
> cache size : 1024 KB
> fdiv_bug : no
> hlt_bug : no
> f00f_bug : no
> coma_bug : no
> fpu : yes
> fpu_exception : yes
> cpuid level : 1
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm
> 3dnowext 3dnow
> bogomips : 3915.77
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

Cheers,
Dick Johnson
Penguin : Linux version 2.6.13 on an i686 machine (5589.55 BogoMips).
Warning : 98.36% of all statistics are fiction.

****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to [email protected] - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

2005-10-06 13:19:17

by Arjan van de Ven

[permalink] [raw]
Subject: Re: SMP syncronization on AMD processors (broken?)

On Thu, 2005-10-06 at 17:05 +0400, Kirill Korotaev wrote:
> Hello Linus, Andrew and others,
>
> Please help with a not simple question about spin_lock/spin_unlock on
> SMP archs. The question is whether concurrent spin_lock()'s should
> acquire it in more or less "fair" fashinon or one of CPUs can starve any
> arbitrary time while others do reacquire it in a loop.

spinlocks are designed to not be fair. or rather are allowed to not be.
If you want them to be fair on x86 you need at minimum to put a
cpu_relax() in your busy loop...


2005-10-06 13:32:37

by Andi Kleen

[permalink] [raw]
Subject: Re: SMP syncronization on AMD processors (broken?)

Kirill Korotaev <[email protected]> writes:

> Please help with a not simple question about spin_lock/spin_unlock on
> SMP archs. The question is whether concurrent spin_lock()'s should
> acquire it in more or less "fair" fashinon or one of CPUs can starve
> any arbitrary time while others do reacquire it in a loop.

They are not fully fair because of the NUMAness of the system.
Same on many other NUMA systems.

We considered long ago to use queued locks to avoid this, but
they are quite costly for the uncongested case and never seemed worth it.

So live with it.

-Andi

2005-10-06 13:32:44

by Andrey Savochkin

[permalink] [raw]
Subject: Re: SMP syncronization on AMD processors (broken?)

On Thu, Oct 06, 2005 at 03:19:07PM +0200, Arjan van de Ven wrote:
> On Thu, 2005-10-06 at 17:05 +0400, Kirill Korotaev wrote:
> > Hello Linus, Andrew and others,
> >
> > Please help with a not simple question about spin_lock/spin_unlock on
> > SMP archs. The question is whether concurrent spin_lock()'s should
> > acquire it in more or less "fair" fashinon or one of CPUs can starve any
> > arbitrary time while others do reacquire it in a loop.
>
> spinlocks are designed to not be fair. or rather are allowed to not be.
> If you want them to be fair on x86 you need at minimum to put a
> cpu_relax() in your busy loop...

The question was raised exactly because cpu_relax() doesn't help on these AMD
CPUs.

Some Pentiums do more than expected from them, and the programs works in a
very fair manner even without cpu_relax(), so the question boils down to
whether there are some new AMD rules how to write such loops, is it a defect
of the CPU, or if we are missing something else.

Andrey

2005-10-06 13:46:09

by Andrey Savochkin

[permalink] [raw]
Subject: Re: SMP syncronization on AMD processors (broken?)

On Thu, Oct 06, 2005 at 03:32:30PM +0200, Andi Kleen wrote:
> Kirill Korotaev <[email protected]> writes:
>
> > Please help with a not simple question about spin_lock/spin_unlock on
> > SMP archs. The question is whether concurrent spin_lock()'s should
> > acquire it in more or less "fair" fashinon or one of CPUs can starve
> > any arbitrary time while others do reacquire it in a loop.
>
> They are not fully fair because of the NUMAness of the system.
> Same on many other NUMA systems.
>
> We considered long ago to use queued locks to avoid this, but
> they are quite costly for the uncongested case and never seemed worth it.
>
> So live with it.

Well, it's hard to swallow...
It's not about being not fully fair, it's about deadlocks that started
to appear after code changes inside retry loops...

A practical question is whether there is an "official" way to tell the CPU
that it should synchronize with memory, or if you have ideas how to make it
less costly than queued locks.

A theoretical question is how many places in the kernel use such retry loops
that may start to fail some day (or on some machines)...

Andrey

2005-10-06 13:51:07

by Eric Dumazet

[permalink] [raw]
Subject: Re: SMP syncronization on AMD processors (broken?)

Andi Kleen a ?crit :
> Kirill Korotaev <[email protected]> writes:
>
>
>>Please help with a not simple question about spin_lock/spin_unlock on
>>SMP archs. The question is whether concurrent spin_lock()'s should
>>acquire it in more or less "fair" fashinon or one of CPUs can starve
>>any arbitrary time while others do reacquire it in a loop.
>
>
> They are not fully fair because of the NUMAness of the system.
> Same on many other NUMA systems.
>
> We considered long ago to use queued locks to avoid this, but
> they are quite costly for the uncongested case and never seemed worth it.
>
> So live with it.

Unrelated, but that reminds me that current spinlock implementation on x86
imply that NR_CPUS should be < 128.

Maybe we should reflect this in Kconfig ?

config NR_CPUS
range 2 128

Or use a plain int for spinlock, instead of a signed char.

Eric

2005-10-06 14:02:17

by Andi Kleen

[permalink] [raw]
Subject: Re: [discuss] Re: SMP syncronization on AMD processors (broken?)

On Thursday 06 October 2005 15:50, Eric Dumazet wrote:

> Maybe we should reflect this in Kconfig ?
>
> config NR_CPUS
> range 2 128
>
> Or use a plain int for spinlock, instead of a signed char.

Hmm? 2.6 already uses int as far as I can see.

-Andi

2005-10-06 14:02:17

by Andi Kleen

[permalink] [raw]
Subject: Re: [discuss] Re: SMP syncronization on AMD processors (broken?)

On Thursday 06 October 2005 15:46, Andrey Savochkin wrote:
> On Thu, Oct 06, 2005 at 03:32:30PM +0200, Andi Kleen wrote:
> > Kirill Korotaev <[email protected]> writes:
> > > Please help with a not simple question about spin_lock/spin_unlock on
> > > SMP archs. The question is whether concurrent spin_lock()'s should
> > > acquire it in more or less "fair" fashinon or one of CPUs can starve
> > > any arbitrary time while others do reacquire it in a loop.
> >
> > They are not fully fair because of the NUMAness of the system.
> > Same on many other NUMA systems.
> >
> > We considered long ago to use queued locks to avoid this, but
> > they are quite costly for the uncongested case and never seemed worth it.
> >
> > So live with it.
>
> Well, it's hard to swallow...
> It's not about being not fully fair, it's about deadlocks that started
> to appear after code changes inside retry loops...

Don't do that then.

> A practical question is whether there is an "official" way to tell the CPU
> that it should synchronize with memory, or if you have ideas how to make it
> less costly than queued locks.

I don't think there is an way specified in the architecture. So you're
definitely in undocumented system dependent territory if you attempt this.

delay.

Or maybe a write combining access (movnti) follwed with a sfence.


> A theoretical question is how many places in the kernel use such retry
> loops that may start to fail some day (or on some machines)...

We already have such cases - e.g. our rwlocks always had such a deadlock
even on SMP systems. As far as I know it has been reported exactly once on a
64CPU IA64 system, but it wasn't possible to fix it without large scale
changes so it was ignored. I am not aware of the problem ever happening on a
production system.

And in general fairness was never a force of Linux. A lot of subsystems
do resource handling / sharing without taking it into account. And so far
we got away with it.

I'm not saying it's a good thing, but that general strategy
doesn't seem to have hurt us significantly so far and the fixes are usually
worse than the problems.

-Andi

2005-10-06 14:11:22

by Andi Kleen

[permalink] [raw]
Subject: Re: [discuss] Re: SMP syncronization on AMD processors (broken?)

> Sorry Andi, I only trust assembly :)

Ok the datatype is int, but the assembly still uses decb. Will change that
to a decl.

-Andi

2005-10-06 14:10:13

by Eric Dumazet

[permalink] [raw]
Subject: Re: [discuss] Re: SMP syncronization on AMD processors (broken?)

Andi Kleen a ?crit :
> On Thursday 06 October 2005 15:50, Eric Dumazet wrote:
>
>
>>Maybe we should reflect this in Kconfig ?
>>
>>config NR_CPUS
>>range 2 128
>>
>>Or use a plain int for spinlock, instead of a signed char.
>
>
> Hmm? 2.6 already uses int as far as I can see.
>

Not in public 2.6 at least.

ffffffff8030be10 <_spin_lock>:
ffffffff8030be10: 55 push %rbp
ffffffff8030be11: 48 89 e5 mov %rsp,%rbp
ffffffff8030be14: f0 fe 0f lock decb (%rdi)
ffffffff8030be17: 0f 88 d1 02 00 00 js <.text.lock.spinlock>
ffffffff8030be1d: c9 leaveq
ffffffff8030be1e: c3 retq

Sorry Andi, I only trust assembly :)

Eric

2005-10-06 14:22:34

by Arjan van de Ven

[permalink] [raw]
Subject: Re: SMP syncronization on AMD processors (broken?)

On Thu, 2005-10-06 at 17:32 +0400, Andrey Savochkin wrote:
> On Thu, Oct 06, 2005 at 03:19:07PM +0200, Arjan van de Ven wrote:
> > On Thu, 2005-10-06 at 17:05 +0400, Kirill Korotaev wrote:
> > > Hello Linus, Andrew and others,
> > >
> > > Please help with a not simple question about spin_lock/spin_unlock on
> > > SMP archs. The question is whether concurrent spin_lock()'s should
> > > acquire it in more or less "fair" fashinon or one of CPUs can starve any
> > > arbitrary time while others do reacquire it in a loop.
> >
> > spinlocks are designed to not be fair. or rather are allowed to not be.
> > If you want them to be fair on x86 you need at minimum to put a
> > cpu_relax() in your busy loop...
>
> The question was raised exactly because cpu_relax() doesn't help on these AMD
> CPUs.
>
> Some Pentiums do more than expected from them, and the programs works in a
> very fair manner even without cpu_relax(), so the question boils down to
> whether there are some new AMD rules how to write such loops, is it a defect
> of the CPU, or if we are missing something else.

the rule basically is "don't write such a loop" though; this is only the
beginning; because if you have two such things on separate cores of a
dual core cpu you for sure starve anything outside of that core just the
same. Eg it goes one level up as well.

There is no spin_lock_yield() currently and until there is this is just
a code pattern you should avoid.


2005-10-06 14:46:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: SMP syncronization on AMD processors (broken?)



On Thu, 6 Oct 2005, Kirill Korotaev wrote:
>
> The question raised because the situation we observe on AMD processors is
> really strange and makes us believe that something is wrong in kerne/in
> processor or our minds. Below goes an explanation:

Your code is buggy.

> The whole story started when we wrote the following code:
>
> void XXX(void)
> {
> /* ints disabled */
> restart:
> spin_lock(&lock);
> do_something();
> if (!flag)
> need_restart = 1;
> spin_unlock(&lock);
> if (need_restart)
> goto restart; <<<< LOOPS 4EVER ON AMD!!!
> }
>
> void YYY(void)
> {
> spin_lock(&lock); <<<< SPINS 4EVER ON AMD!!!
> flag = 1;
> spin_unlock(&lock);
> }

If you want to notify another CPU that you want the spinlock, then you
need to set the "flag" variable _outside_ of the spinlock.

Spinlocks are not fair, not by a long shot. They never have been, and they
never will. Fairness would be extremely expensive indeed.

> Other observations:
> - This does not happen on Intel processors, more over on Intel 2 CPUs take
> locks in a fair manner, exactly one by one!

It depends entirely on the cache coherency protocol. I bet you'd find
differences even within Intel CPU's.

Linus

2005-10-06 14:53:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: SMP syncronization on AMD processors (broken?)



On Thu, 6 Oct 2005, Andrey Savochkin wrote:
>
> Well, it's hard to swallow...
> It's not about being not fully fair, it's about deadlocks that started
> to appear after code changes inside retry loops...

No, it's not about fairness.

It's about BUGS IN YOUR CODE.

If you need fairness, you need to implement that yourself. You can do so
many ways. Either on top of spinlocks, by using an external side-band
channel, or by using semaphores instead of spinlocks (semaphores are much
higher cost, but part of the cost is that they _are_ fair).

Linus

2005-10-06 15:27:49

by Andrey Savochkin

[permalink] [raw]
Subject: Re: SMP syncronization on AMD processors (broken?)

On Thu, Oct 06, 2005 at 07:52:17AM -0700, Linus Torvalds wrote:
>
>
> On Thu, 6 Oct 2005, Andrey Savochkin wrote:
> >
> > Well, it's hard to swallow...
> > It's not about being not fully fair, it's about deadlocks that started
> > to appear after code changes inside retry loops...
>
> No, it's not about fairness.
>
> It's about BUGS IN YOUR CODE.
>
> If you need fairness, you need to implement that yourself. You can do so
> many ways. Either on top of spinlocks, by using an external side-band
> channel, or by using semaphores instead of spinlocks (semaphores are much
> higher cost, but part of the cost is that they _are_ fair).

Ok, let it be a bug.

I just want to repeat that nobody wanted or expected any fairness in this case.
But such extremety that on some CPU models one CPU never, not in billion
cycles, can get the lock if the other CPU repeatedly drops and acquires the
lock, and that in this scenario memory changes seem to never propagate to
other CPUs - well, all of that is a surprise.

I start to wonder about existing mainstream code, presumably bug-free, that
uses spinlocks without any problematic restart.
If one day some piece starts to be called too often by some legitimate
reasons, it might fall into the same pattern and completely block others who
want to take the same spinlock.
I'm not advocating for changing spinlock implementation, it's just a
thought...

Andrey

2005-10-06 15:34:41

by Hugh Dickins

[permalink] [raw]
Subject: Re: SMP syncronization on AMD processors (broken?)

On Thu, 6 Oct 2005, Linus Torvalds wrote:
>
> If you want to notify another CPU that you want the spinlock, then you
> need to set the "flag" variable _outside_ of the spinlock.
>
> Spinlocks are not fair, not by a long shot. They never have been, and they
> never will. Fairness would be extremely expensive indeed.

That reminds me: ought cond_resched_lock to be doing something more?

int cond_resched_lock(spinlock_t *lock)
{
int ret = 0;

if (need_lockbreak(lock)) {
spin_unlock(lock);
cpu_relax();
ret = 1;
spin_lock(lock);
}

2005-10-06 15:47:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: SMP syncronization on AMD processors (broken?)



On Thu, 6 Oct 2005, Andrey Savochkin wrote:
>
> I start to wonder about existing mainstream code, presumably bug-free, that
> uses spinlocks without any problematic restart.

Actually, a number of cases where we found things to be a problem have
been converted.

It _is_ a real-life issue, although with "normal" code it only happens in
extreme machines (NUMA with tons of nodes). It's fairly fundamental in
NUMA, and in many ways you do absolutely _not_ want fairness, because it's
much better to take the lock locally a hundred times (almost free) than it
is to bounce it back-and-forth between two nodes a hundred times (very
expensive).

Fairness is often very expensive indeed.

But places where unfairness can be a problem have been converted to things
like RCU, which allows concurrent operations more gracefully.

And sometimes the answer is just "don't do that then".

Linus

2005-10-06 15:54:45

by Eric Dumazet

[permalink] [raw]
Subject: Re: SMP syncronization on AMD processors (broken?)

Hugh Dickins a ?crit :
> On Thu, 6 Oct 2005, Linus Torvalds wrote:
>
>>If you want to notify another CPU that you want the spinlock, then you
>>need to set the "flag" variable _outside_ of the spinlock.
>>
>>Spinlocks are not fair, not by a long shot. They never have been, and they
>>never will. Fairness would be extremely expensive indeed.
>
>
> That reminds me: ought cond_resched_lock to be doing something more?
>
> int cond_resched_lock(spinlock_t *lock)
> {
> int ret = 0;
>
> if (need_lockbreak(lock)) {
> spin_unlock(lock);
> cpu_relax();
> ret = 1;
> spin_lock(lock);
> }
> -

Isnt it funny that some bugs can spot other bugs ? :)

break_lock should be declared atomic_t and used like that :

void __lockfunc _##op##_lock(locktype##_t *lock)
{
preempt_disable();
for (;;) {
if (likely(_raw_##op##_trylock(lock)))
break;
preempt_enable();
atomic_inc(&(lock)->break_lock);
while (!op##_can_lock(lock))
cpu_relax();
preempt_disable();
atomic_dec(&(lock)->break_lock);
}
}


Eric

2005-10-06 16:02:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: SMP syncronization on AMD processors (broken?)



On Thu, 6 Oct 2005, Hugh Dickins wrote:
>
> That reminds me: ought cond_resched_lock to be doing something more?

I suspect the "need_resched()" case is sufficient in practice: it might
not be perfect, but let's face it, there's not a lot we can do.

It's true that the cpu_relax() basically does nothing except perhaps by
luck (and depending on CPU and memory setup). There's no good alternative,
though. You could make it more complex and change the cpu_relax() to
something like

i = random_backoff();
for (;;) {
cpu_relax();
if (spin_is_locked(lock))
break;
if (!need_lockbreak(lock))
break;
if (!--i)
break;
}

but I'd argue vehemently that you shouldn't do this unless you can see
real problems.

I doubt it's really a problem in practice. You have to have a really hot
spinlock that stays hot on one CPU (and most of them you can't even _make_
hot from user space), and I suspect we've fixed the ones that matter.

Linus

2005-10-07 20:38:12

by Joe Seigh

[permalink] [raw]
Subject: Re: SMP syncronization on AMD processors (broken?)

Kirill Korotaev wrote:
> Hello Linus, Andrew and others,
>
> Please help with a not simple question about spin_lock/spin_unlock on
> SMP archs. The question is whether concurrent spin_lock()'s should
> acquire it in more or less "fair" fashinon or one of CPUs can starve any
> arbitrary time while others do reacquire it in a loop.
>
> The question raised because the situation we observe on AMD processors
> is really strange and makes us believe that something is wrong in
> kerne/in processor or our minds. Below goes an explanation:
>
> The whole story started when we wrote the following code:
>
> void XXX(void)
> {
> /* ints disabled */
> restart:
> spin_lock(&lock);
> do_something();
> if (!flag)
> need_restart = 1;
> spin_unlock(&lock);
> if (need_restart)
> goto restart; <<<< LOOPS 4EVER ON AMD!!!
> }
>
> void YYY(void)
> {
> spin_lock(&lock); <<<< SPINS 4EVER ON AMD!!!
> flag = 1;
> spin_unlock(&lock);
> }
>
> function XXX() starts on CPU0 and begins to loop since flag is not set,
> then CPU1 calls function YYY() and it turns out that it can't take the
> lock any arbitrary time.
>
> Other observations:
> - This does not happen on Intel processors, more over on Intel 2 CPUs
> take locks in a fair manner, exactly one by one!
> - If do_something() = usleep(3) we observed that XXX() loops forever,
> while YYY spins 4EVER on the same lock...
> - cpu_relax() doesn't help after spin_unlock()...

Unilateral backoff isn't guaranteed to work as well as multilateral
backoff which can't be done in software AFAIK.

> - wbinvd() after spin_unlock() helpes and 2 CPUs began to take the lock
> in a fair manner.
>
> How can this happen? Is it regulated somehow by SMP specifications?

The hardware specs don't guarantee fairness. The only architecture I
know of that did guarantee fairness was IBM's 370 architecture and that
guarantee only appeared in registered confidential copies of the
architecture as an engineering note.

What you can do is write your own FIFO spin lock based on the bakery
algorithm. It will require two words instead of one, one for the
"next ticket" and one for the "current ticket". To get the next
ticket value use LOCK XADD which should be guaranteed to work. Then just
spin until the current value equals your ticket value. You also
have the advantage that when a contended lock is released all the
waiting processors all don't try to grab the lockword cache line
exclusive all at once. That stuff was spaced out earlier. If you
have lots of storage you can put the two words in separate cache lines.
Also it's quite easy to make a FIFO read write upgradable spin lock
out of it without increasing the path length of the write lock part.
Bakery algorithms are fun.


--
Joe Seigh

2005-10-07 21:01:16

by Stephen Hemminger

[permalink] [raw]
Subject: Re: SMP syncronization on AMD processors (broken?)

There are lots and lots of papers on this, you might also
want to consider.

http://www.cs.vu.nl/~swami/fairlocks.pdf

--
Stephen Hemminger <[email protected]>
OSDL http://developer.osdl.org/~shemminger

2005-10-08 09:34:57

by Chuck Ebbert

[permalink] [raw]
Subject: Re: SMP syncronization on AMD processors (broken?)

In-Reply-To: <[email protected]>

On Thu, 06 Oct 2005 at 17:05:03 +0400, Kirill Korotaev wrote:

> The question is whether concurrent spin_lock()'s should
> acquire it in more or less "fair" fashinon or one of CPUs can starve any
> arbitrary time while others do reacquire it in a loop.

You neglected to say what CPU type you compiled the kernel for.

If it wasn't Pentium Pro maybe you could patch include/asm-i386/spinlock.h
line 82 (or the same place in x86-64) like this:

___
* (PPro errata 66, 92)
*/

-#if !defined(CONFIG_X86_OOSTORE) && !defined(CONFIG_X86_PPRO_FENCE)
+#if 0

#define __raw_spin_unlock_string \
"movb $1,%0" \
___

The data might not make it out of the CPU write buffer without a locking
instruction doing the update.
__
Chuck

2005-10-11 01:00:49

by Andrew Morton

[permalink] [raw]
Subject: Re: SMP syncronization on AMD processors (broken?)

Andrey Savochkin <[email protected]> wrote:
>
> I start to wonder about existing mainstream code, presumably bug-free, that
> uses spinlocks without any problematic restart.
> If one day some piece starts to be called too often by some legitimate
> reasons, it might fall into the same pattern and completely block others who
> want to take the same spinlock.

There are surely places in the kernel where we do get into a corner and do
drop a lock in the expectation that another CPU will acquire it and get us
out of the mess we got ourselves into. One is
fs/jbd/commit.c:inverted_lock(), and I've always been afraid that it might
be vulnerable to NUMA capture effects such as you describe.

Presently inverted_lock() just does a completely pointless schedule() and
remains that way because nobody has reported it locking up yet :(

> I'm not advocating for changing spinlock implementation, it's just a
> thought...

It would make sense in these cases if there was some primitive which we
could call which says "hey, I expect+want another CPU to grab this lock in
preference to this CPU".

Note that inverted_lock() uses bit_spin_lock() - that's just a detail but
it does illustrate that any such primitive shouldn't be specific to just
spinlocks and rwlocks.

(And yes, I once looked at fixing JBD "for real" but it was really hard.
Fact is, the commit code and the filesystem code really do want to take
those locks in opposite order).

2005-10-11 01:20:18

by Andi Kleen

[permalink] [raw]
Subject: Re: SMP syncronization on AMD processors (broken?)

On Tuesday 11 October 2005 02:59, Andrew Morton wrote:

> > I'm not advocating for changing spinlock implementation, it's just a
> > thought...
>
> It would make sense in these cases if there was some primitive which we
> could call which says "hey, I expect+want another CPU to grab this lock in
> preference to this CPU".

I just don't know how to implement such a primitive given the guarantees
of the x86 architecture. It might be possible to do something that
works on specific CPUs, but that will likely break later.


-Andi

2005-10-11 03:19:49

by Joe Seigh

[permalink] [raw]
Subject: Re: SMP syncronization on AMD processors (broken?)

Andi Kleen wrote:
> On Tuesday 11 October 2005 02:59, Andrew Morton wrote:
>
>
>>> I'm not advocating for changing spinlock implementation, it's just a
>>> thought...
>>
>>It would make sense in these cases if there was some primitive which we
>>could call which says "hey, I expect+want another CPU to grab this lock in
>>preference to this CPU".
>
>
> I just don't know how to implement such a primitive given the guarantees
> of the x86 architecture. It might be possible to do something that
> works on specific CPUs, but that will likely break later.
>

I thought that's what the WBINVD did. Either the problem is the delayed
write buffer or the fact that the store makes the lock cache line exclusive
which gives the processor unfair advantage if it immediately tries to
reacquire the lock. WBINVD solves both of those problems.

Or you could use a spin lock implementation that didn't have that problem
to begin with.

--
Joe Seigh



2005-10-11 23:50:26

by George Spelvin

[permalink] [raw]
Subject: Re: SMP syncronization on AMD processors (broken?)

> The whole story started when we wrote the following code:
>
> void XXX(void)
> {
> /* ints disabled */
> restart:
> spin_lock(&lock);
> do_something();
> if (!flag)
> need_restart = 1;
> spin_unlock(&lock);
> if (need_restart)
> goto restart; <<<< LOOPS 4EVER ON AMD!!!
> }
>
> void YYY(void)
> {
> spin_lock(&lock); <<<< SPINS 4EVER ON AMD!!!
> flag = 1;
> spin_unlock(&lock);
> }
>
> function XXX() starts on CPU0 and begins to loop since flag is not set,
> then CPU1 calls function YYY() and it turns out that it can't take the
> lock any arbitrary time.

The right thing to do here is to wait for the flag to be set *outside*
the lock, and then re-validate inside the lock:

void XXX(void)
{
/* ints disabled */
restart:
spin_lock(&lock);
do_something();
if (!flag)
need_restart = 1;
spin_unlock(&lock);
if (need_restart) {
while (!flag)
cpu_relax();
goto restart;
}
}

This way, XXX() keeps the lock dropped for as long as it takes for
YYY() to notice and grab it.


However, I realize that this is of course a simplified case of some real
code, where even *finding* the flag requires the spin lock.

The generic solution is to have a global "progress" counter, which
records "I made progress toward setting flag", that XXX() can
busy-loop on:

int progress;

void XXX(void)
{
int old_progress;
/* ints disabled */
restart:
spin_lock(&lock);
do_something();
if (!flag) {
old_progress = progress;
need_restart = 1;
}
spin_unlock(&lock);
if (need_restart) {
while (progress == old_progress)
cpu_relax();
goto restart;
}
}

void YYY(void)
{
spin_lock(&lock);
flag = 1;
progress++;
spin_unlock(&lock);
}

It may be that in your data structure, there is one or a series of
fields that already exist that you can use for the purpose. The goal
is to merely detect *change*, so you can reacquire the lock and test
definitively. It's okay to read freed memory while doing this, as long as
you can be sure that:
- The memory read won't oops the kernel, and
- You don't end up depending on the value of the freed memory to
get you out of the stall.

2005-10-12 02:13:18

by Chris Friesen

[permalink] [raw]
Subject: Re: SMP syncronization on AMD processors (broken?)

[email protected] wrote:

> The right thing to do here is to wait for the flag to be set *outside*
> the lock, and then re-validate inside the lock:

This may work on some processors, but on others the read of "progress"
in XXX, or the write in YYY may require arch-specific code to force the
update out to other cpus.

Alternately, explicitly atomic operations should suffice, but a simple
increment is probably not enough for portable code.

Chris

2005-10-12 02:39:52

by George Spelvin

[permalink] [raw]
Subject: Re: SMP syncronization on AMD processors (broken?)

> This may work on some processors, but on others the read of "progress"
> in XXX, or the write in YYY may require arch-specific code to force the
> update out to other cpus.
>
> Alternately, explicitly atomic operations should suffice, but a simple
> increment is probably not enough for portable code.

Er.. you mean, the pre-incremented value could be cached *indefinitely*
by XXX? That seems odd...

I can see an arch hook (memory barrier sort of thuing) to push it out a
bit faster, but are there architecures on which noticing the increment
could be delayed indefinitely?

In particular, that same hook would already be used by the spin lock
release sequence (to ensure that someone else notices the lock is now
available), and unless it's address-specific, it would do for the
"progress" counter as well.

2005-10-12 03:28:00

by Kyle Moffett

[permalink] [raw]
Subject: Re: SMP syncronization on AMD processors (broken?)

On Oct 11, 2005, at 22:39:50, [email protected] wrote:
>> This may work on some processors, but on others the read of
>> "progress" in XXX, or the write in YYY may require arch-specific
>> code to force the update out to other cpus.
>>
>> Alternately, explicitly atomic operations should suffice, but a
>> simple increment is probably not enough for portable code.
>
> Er.. you mean, the pre-incremented value could be cached
> *indefinitely* by XXX? That seems odd...
>
> I can see an arch hook (memory barrier sort of thuing) to push it
> out a bit faster, but are there architecures on which noticing the
> increment could be delayed indefinitely?
>
> In particular, that same hook would already be used by the spin
> lock release sequence (to ensure that someone else notices the lock
> is now available), and unless it's address-specific, it would do
> for the "progress" counter as well.

Umm, IIRC, some architectures (don't remember which ones, but I'd
guess it's the big 512-way boxen) have cache-line-and-memory models
such that a cacheline may remain out-of-date indefinitely unless the
CPU with the update runs a "cache-line flush" instruction or the CPU
who wants an update requests one with an exclusive cacheline lock or
similar. On such a system, the only way to ensure safe distribution
of data between CPUs is to make sure it's in the same cacheline as
the spinlock (and document that fact) or use special instructions to
verify coherency.

Cheers,
Kyle Moffett

--
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are, by
definition, not smart enough to debug it.
-- Brian Kernighan


2005-10-13 12:17:51

by Kirill Korotaev

[permalink] [raw]
Subject: Re: SMP syncronization on AMD processors (broken?)

Thanks a lot for the interesting idea provided below.
I will try to implement it.

Kirill

>>The whole story started when we wrote the following code:
>>
>>void XXX(void)
>>{
>> /* ints disabled */
>>restart:
>> spin_lock(&lock);
>> do_something();
>> if (!flag)
>> need_restart = 1;
>> spin_unlock(&lock);
>> if (need_restart)
>> goto restart; <<<< LOOPS 4EVER ON AMD!!!
>>}
>>
>>void YYY(void)
>>{
>> spin_lock(&lock); <<<< SPINS 4EVER ON AMD!!!
>> flag = 1;
>> spin_unlock(&lock);
>>}
>>
>>function XXX() starts on CPU0 and begins to loop since flag is not set,
>>then CPU1 calls function YYY() and it turns out that it can't take the
>>lock any arbitrary time.
>
>
> The right thing to do here is to wait for the flag to be set *outside*
> the lock, and then re-validate inside the lock:
>
> void XXX(void)
> {
> /* ints disabled */
> restart:
> spin_lock(&lock);
> do_something();
> if (!flag)
> need_restart = 1;
> spin_unlock(&lock);
> if (need_restart) {
> while (!flag)
> cpu_relax();
> goto restart;
> }
> }
>
> This way, XXX() keeps the lock dropped for as long as it takes for
> YYY() to notice and grab it.
>
>
> However, I realize that this is of course a simplified case of some real
> code, where even *finding* the flag requires the spin lock.
>
> The generic solution is to have a global "progress" counter, which
> records "I made progress toward setting flag", that XXX() can
> busy-loop on:
>
> int progress;
>
> void XXX(void)
> {
> int old_progress;
> /* ints disabled */
> restart:
> spin_lock(&lock);
> do_something();
> if (!flag) {
> old_progress = progress;
> need_restart = 1;
> }
> spin_unlock(&lock);
> if (need_restart) {
> while (progress == old_progress)
> cpu_relax();
> goto restart;
> }
> }
>
> void YYY(void)
> {
> spin_lock(&lock);
> flag = 1;
> progress++;
> spin_unlock(&lock);
> }
>
> It may be that in your data structure, there is one or a series of
> fields that already exist that you can use for the purpose. The goal
> is to merely detect *change*, so you can reacquire the lock and test
> definitively. It's okay to read freed memory while doing this, as long as
> you can be sure that:
> - The memory read won't oops the kernel, and
> - You don't end up depending on the value of the freed memory to
> get you out of the stall.
>


2005-10-13 18:32:25

by Joe Seigh

[permalink] [raw]
Subject: Re: SMP syncronization on AMD processors (broken?)

Kirill Korotaev wrote:
> Hello Linus, Andrew and others,
>
> Please help with a not simple question about spin_lock/spin_unlock on
> SMP archs. The question is whether concurrent spin_lock()'s should
> acquire it in more or less "fair" fashinon or one of CPUs can starve any
> arbitrary time while others do reacquire it in a loop.
>
> The question raised because the situation we observe on AMD processors
> is really strange and makes us believe that something is wrong in
> kerne/in processor or our minds. Below goes an explanation:
>
> The whole story started when we wrote the following code:
>
> void XXX(void)
> {
> /* ints disabled */
> restart:
> spin_lock(&lock);
> do_something();
> if (!flag)
> need_restart = 1;
> spin_unlock(&lock);
> if (need_restart)
> goto restart; <<<< LOOPS 4EVER ON AMD!!!
> }
>
> void YYY(void)
> {
> spin_lock(&lock); <<<< SPINS 4EVER ON AMD!!!
> flag = 1;
> spin_unlock(&lock);
> }
>
> function XXX() starts on CPU0 and begins to loop since flag is not set,
> then CPU1 calls function YYY() and it turns out that it can't take the
> lock any arbitrary time.
>

Actually this should work

void XXXX() {
do {
do_something();
} while (!flag);
}

void YYY() {
flag = 1;
}

flag doesn't even have to be atomic though you will want to
declare it volatile or use some other mechanism to keep the
compiler from optimizing it.

Other than that you don't need to do anything to make it be seen
by the other processor. Strongly coherent cache does what you need.
Using a spin lock doesn't make it get seen any faster. In fact the
spin lock slows you down if anything.

It may be on some hardware you may need to take explicit action to
force something to be seen by another processor. Using a spin lock
may work in this case but that's only because the implementer of the
spin lock on that platform decided to add in some forward progress
semantics. There's no formal definition of spin lock semantics
so whatever's in there is whatever the implementer thought there
should be. This may not seem like a big deal since every thinks
lock semantics are obvious. But somehow no one seems to know to
to state them formally. Posix couldn't do it. They gave up and
officially defined it as ... like you know ... obvious. C++00x
is stuck on how to define thread memory model semantics. They
haven't even gotten to locks yet. All they know is it shouldn't
look like the Java memory model. And although Java has formal
semantics, they don't define forward progress. Java depends on
cooperative threading. If you use a competative threading model,
you're on your own. Which gets us back to Linux. If you don't
have forward progress defined as part of your synchronization
semantics, then you depend on having a surplus of compute resources
for your code to work correctly as you found out by putting your
two cpu's in busy loops.

In that case above where you need to use a spin lock for force visibility,
just use the spin lock to read and write the flag value. Don't hold it
while you're executing do_something() which hopefully will be of
sufficient duration to prevent live lock.

--
Joe Seigh