2006-11-04 22:36:51

by Mikulas Patocka

[permalink] [raw]
Subject: 2048 CPUs [was: Re: New filesystem for Linux]

>> Does one Linux kernel run on system with 1024 cpus? I guess it
>> must fry spinlocks... (or even lockup due to spinlock livelocks)
>
> The SGI Altix can have 2048 CPUs.

And does it run one image of Linux? Or more images each on few cpus?

How do they solve problem with spinlock livelocks?

If time-spent-outside-spinlock/time-spent-in-spinlock < number-of-cpus,
the spinlock livelock may happen --- this condition is not true normally
with 2 or 4 cpus, but for that high amount of cpus, there is a danger.

Or do they have some special hardware spinlock instruction that guarantees
fairness?

Mikulas


2006-11-06 01:47:47

by Mikulas Patocka

[permalink] [raw]
Subject: Re: 2048 CPUs [was: Re: New filesystem for Linux]



On Sun, 5 Nov 2006, Albert Cahalan wrote:

> On 11/4/06, Mikulas Patocka <[email protected]> wrote:
>> >> Does one Linux kernel run on system with 1024 cpus? I guess it
>> >> must fry spinlocks... (or even lockup due to spinlock livelocks)
>> >
>> > The SGI Altix can have 2048 CPUs.
>>
>> And does it run one image of Linux? Or more images each on few cpus?
>
> Yes, of course it runs one image of Linux. It's SMP, not a cluster.
> If I remember right:
>
> 2 CPUs per chip
> 2 chips per board (with local RAM; this is NUMA)
> 512 boards per machine
>
>> How do they solve problem with spinlock livelocks?
>>
>> If time-spent-outside-spinlock/time-spent-in-spinlock < number-of-cpus,
>> the spinlock livelock may happen --- this condition is not true normally
>> with 2 or 4 cpus, but for that high amount of cpus, there is a danger.
>>
>> Or do they have some special hardware spinlock instruction that guarantees
>> fairness?
>
> Well first of all it's using IA-64 processors. These do atomic operations
> via instructions that are essentially "take lock" and "release lock".
> The non-CPU hardware probably recognizes these operations by the
> special bus cycles and does whatever is needed to ensure fairness.

It looks like a normal looping spinlock with rep nop and cmpxchg on i386.
Not that I understand IA64 assembler --- but look at
ia64_spinlock_contention --- it looks like a loop with pause and nonatomic
test and a branch to acquire spinlock atomically with cmpxchg --- nothing
that would prevent starving there.

> Second of all, I think we just try to avoid having long spinlock hold
> times. The SGI people love using RCU. As I recall, SGI also caused
> Linux to get a sequence lock for jiffies.

That is right, that you should avoid contention --- but the kernel should
be stable no matter what userspace programs you run. Or does it mean that
scientists can run only trusted programs on Altix that do not perform too
many certain syscalls concurently in order to not jam spinlocks?

Mikulas

2006-11-07 21:41:07

by Pavel Machek

[permalink] [raw]
Subject: Re: 2048 CPUs [was: Re: New filesystem for Linux]

Hi!

> >The SGI Altix can have 2048 CPUs.
>
> And does it run one image of Linux? Or more images each
> on few cpus?
>
> How do they solve problem with spinlock livelocks?
>
> If time-spent-outside-spinlock/time-spent-in-spinlock <
> number-of-cpus, the spinlock livelock may happen ---
> this condition is not true normally with 2 or 4 cpus,
> but for that high amount of cpus, there is a danger.

Lets say time-spent-outside-spinlock == time-spent-in-spinlock and
number-of-cpus == 2.

1 < 2 , so it should livelock according to you...

...but afaict this should work okay. Even if spinlocks are very
unfair, as long as time-outside and time-inside comes in big chunks,
it should work.

If you are unlucky, one cpu may stall for a while, but... I see no
livelock.

--
Thanks for all the (sleeping) penguins.

2006-11-07 22:36:21

by Mikulas Patocka

[permalink] [raw]
Subject: Re: 2048 CPUs [was: Re: New filesystem for Linux]

Hi!

> Hi!
>
>>> The SGI Altix can have 2048 CPUs.
>>
>> And does it run one image of Linux? Or more images each
>> on few cpus?
>>
>> How do they solve problem with spinlock livelocks?
>>
>> If time-spent-outside-spinlock/time-spent-in-spinlock <
>> number-of-cpus, the spinlock livelock may happen ---
>> this condition is not true normally with 2 or 4 cpus,
>> but for that high amount of cpus, there is a danger.
>
> Lets say time-spent-outside-spinlock == time-spent-in-spinlock and
> number-of-cpus == 2.
>
> 1 < 2 , so it should livelock according to you...

There is off-by-one bug in the condition. It should be:
(time_spent_in_spinlock + time_spent_outside_spinlock) /
time_spent_in_spinlock < number_of_cpus

... or if you divide it by time_spent_in_spinlock:
time_spent_outside_spinlock / time_spent_in_spinlock + 1 < number_of_cpus

> ...but afaict this should work okay. Even if spinlocks are very
> unfair, as long as time-outside and time-inside comes in big chunks,
> it should work.
>
> If you are unlucky, one cpu may stall for a while, but... I see no
> livelock.

If some rogue threads (and it may not even be intetional) call the same
syscall stressing the one spinlock all the time, other syscalls needing
the same spinlock may stall.

Maybe there are so few Altices in the world that no one has yet observed
it...

Mikulas

2006-11-07 23:15:12

by Pavel Machek

[permalink] [raw]
Subject: Re: 2048 CPUs [was: Re: New filesystem for Linux]

Hi!

> >Lets say time-spent-outside-spinlock == time-spent-in-spinlock and
> >number-of-cpus == 2.
> >
> >1 < 2 , so it should livelock according to you...
>
> There is off-by-one bug in the condition. It should be:
> (time_spent_in_spinlock + time_spent_outside_spinlock) /
> time_spent_in_spinlock < number_of_cpus
>
> ... or if you divide it by time_spent_in_spinlock:
> time_spent_outside_spinlock / time_spent_in_spinlock + 1 < number_of_cpus
>
> >...but afaict this should work okay. Even if spinlocks are very
> >unfair, as long as time-outside and time-inside comes in big chunks,
> >it should work.
> >
> >If you are unlucky, one cpu may stall for a while, but... I see no
> >livelock.
>
> If some rogue threads (and it may not even be intetional) call the same
> syscall stressing the one spinlock all the time, other syscalls needing
> the same spinlock may stall.

Fortunately, they'll unstall with probability of 1... so no, I do not
think this is real problem.

If someone takes semaphore in syscall (we do), same problem may
happen, right...? Without need for 2048 cpus. Maybe semaphores/mutexes
are fair (or mostly fair) these days, but rwlocks may not be or
something.

Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-11-08 03:04:50

by David Chinner

[permalink] [raw]
Subject: Re: 2048 CPUs [was: Re: New filesystem for Linux]

On Sat, Nov 04, 2006 at 11:36:49PM +0100, Mikulas Patocka wrote:
> >>Does one Linux kernel run on system with 1024 cpus? I guess it
> >>must fry spinlocks... (or even lockup due to spinlock livelocks)
> >
> >The SGI Altix can have 2048 CPUs.
>
> And does it run one image of Linux? Or more images each on few cpus?

One image.

> How do they solve problem with spinlock livelocks?

By replacing contended spinlocks withsleeping locks, using no-lock
techniques (e.g. per-cpu) or changing the algorithm to remove the
contention point.

w.r.t filesystem locking scalability, you should read this paper:

http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2006-11-08 18:26:29

by Mikulas Patocka

[permalink] [raw]
Subject: Re: 2048 CPUs [was: Re: New filesystem for Linux]

Hi!

>>> Lets say time-spent-outside-spinlock == time-spent-in-spinlock and
>>> number-of-cpus == 2.
>>>
>>> 1 < 2 , so it should livelock according to you...
>>
>> There is off-by-one bug in the condition. It should be:
>> (time_spent_in_spinlock + time_spent_outside_spinlock) /
>> time_spent_in_spinlock < number_of_cpus
>>
>> ... or if you divide it by time_spent_in_spinlock:
>> time_spent_outside_spinlock / time_spent_in_spinlock + 1 < number_of_cpus
>>
>>> ...but afaict this should work okay. Even if spinlocks are very
>>> unfair, as long as time-outside and time-inside comes in big chunks,
>>> it should work.
>>>
>>> If you are unlucky, one cpu may stall for a while, but... I see no
>>> livelock.
>>
>> If some rogue threads (and it may not even be intetional) call the same
>> syscall stressing the one spinlock all the time, other syscalls needing
>> the same spinlock may stall.
>
> Fortunately, they'll unstall with probability of 1... so no, I do not
> think this is real problem.

You can't tell that CPUs behave exactly probabilistically --- it may
happen that one gets out of the wait loop always too late.

SMP buses have complex protocols to prevent starvation in case all CPUs
are writing to the same cache line and similar --- however it is unusable
againt spinlock starvation.

> If someone takes semaphore in syscall (we do), same problem may
> happen, right...? Without need for 2048 cpus. Maybe semaphores/mutexes
> are fair (or mostly fair) these days, but rwlocks may not be or
> something.

Scheduler increases priority of sleeping process, so starving process
should be waken up first. But if there are so many processes, that process
that passed the semaphore, sleeps and tries to take the semaphore has
already increased priority to the level of process that waited on the
semaphor, livelock can happen too.

Mikulas

> Pavel
>
> --
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
>

2006-11-10 09:03:18

by Pavel Machek

[permalink] [raw]
Subject: Re: 2048 CPUs [was: Re: New filesystem for Linux]

Hi!

> >>If some rogue threads (and it may not even be intetional) call the same
> >>syscall stressing the one spinlock all the time, other syscalls needing
> >>the same spinlock may stall.
> >
> >Fortunately, they'll unstall with probability of 1... so no, I do not
> >think this is real problem.
>
> You can't tell that CPUs behave exactly probabilistically --- it may
> happen that one gets out of the wait loop always too late.

Well, I don't need them to be _exactly_ probabilistical.

Anyway, if you have 2048 CPUs... you can perhaps get some non-broken
ones.

> >If someone takes semaphore in syscall (we do), same problem may
> >happen, right...? Without need for 2048 cpus. Maybe semaphores/mutexes
> >are fair (or mostly fair) these days, but rwlocks may not be or
> >something.
>
> Scheduler increases priority of sleeping process, so starving process
> should be waken up first. But if there are so many processes, that
>process

I do not think this is how Linux scheduler works.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-11-10 15:20:41

by Mikulas Patocka

[permalink] [raw]
Subject: Re: 2048 CPUs [was: Re: New filesystem for Linux]

Hi!

>>>> If some rogue threads (and it may not even be intetional) call the same
>>>> syscall stressing the one spinlock all the time, other syscalls needing
>>>> the same spinlock may stall.
>>>
>>> Fortunately, they'll unstall with probability of 1... so no, I do not
>>> think this is real problem.
>>
>> You can't tell that CPUs behave exactly probabilistically --- it may
>> happen that one gets out of the wait loop always too late.
>
> Well, I don't need them to be _exactly_ probabilistical.
>
> Anyway, if you have 2048 CPUs... you can perhaps get some non-broken
> ones.

No intel document guarantees you that if more CPUs simultaneously execute
locked cmpxchg in a loop that a CPU will see compare success in a finite
time. In fact, CPUs can't guarantee this at all, because they don't know
that they're executing a spinlock --- for them its just an instruction
stream like anything else.

Intel only guarantees that cmpxchg (or any other instruction) completes in
finite time, but it doesn't say anything about the result of it.

>>> If someone takes semaphore in syscall (we do), same problem may
>>> happen, right...? Without need for 2048 cpus. Maybe semaphores/mutexes
>>> are fair (or mostly fair) these days, but rwlocks may not be or
>>> something.
>>
>> Scheduler increases priority of sleeping process, so starving process
>> should be waken up first. But if there are so many processes, that
>> process
>
> I do not think this is how Linux scheduler works.
> Pavel

<= 2.4 scheduler worked exactly like this. 2.6 has it more complicated,
but does similar thing. But you are right that starvation on semaphore can
happen, if the process has too high nice value, it will never be risen
above other processes.

Mikulas

2006-11-10 16:15:39

by Alan

[permalink] [raw]
Subject: Re: 2048 CPUs [was: Re: New filesystem for Linux]


> Intel only guarantees that cmpxchg (or any other instruction) completes in
> finite time, but it doesn't say anything about the result of it.

They don't even guarantee that... (hlt for example) or instructions SMM trapping into stuff that doesn't come back

</pedant>

Alan

2006-11-12 18:16:44

by Pavel Machek

[permalink] [raw]
Subject: Re: 2048 CPUs [was: Re: New filesystem for Linux]

Hi!

> >>You can't tell that CPUs behave exactly
> >>probabilistically --- it may
> >>happen that one gets out of the wait loop always too
> >>late.
> >
> >Well, I don't need them to be _exactly_
> >probabilistical.
> >
> >Anyway, if you have 2048 CPUs... you can perhaps get
> >some non-broken
> >ones.
>
> No intel document guarantees you that if more CPUs
> simultaneously execute locked cmpxchg in a loop that a

If we are talking 2048 cpus, we are talking ia64.

> CPU will see compare success in a finite time. In fact,
> CPUs can't guarantee this at all, because they don't
> know that they're executing a spinlock --- for them its
> just an instruction stream like anything else.

...even i386 has monitor/mwait these days.
Pavel
--
Thanks for all the (sleeping) penguins.

2006-11-12 20:07:27

by Mikulas Patocka

[permalink] [raw]
Subject: Re: 2048 CPUs [was: Re: New filesystem for Linux]

Hi!

> Hi!
>
>>>> You can't tell that CPUs behave exactly
>>>> probabilistically --- it may
>>>> happen that one gets out of the wait loop always too
>>>> late.
>>>
>>> Well, I don't need them to be _exactly_
>>> probabilistical.
>>>
>>> Anyway, if you have 2048 CPUs... you can perhaps get
>>> some non-broken
>>> ones.
>>
>> No intel document guarantees you that if more CPUs
>> simultaneously execute locked cmpxchg in a loop that a
>
> If we are talking 2048 cpus, we are talking ia64.

IA64 spinlock is locked cmpxchg, if failed than pause (i386 equivalent of
rep nop) read the value, and if unlocked, try cmpxchg again.

There is no fairness in it.

>> CPU will see compare success in a finite time. In fact,
>> CPUs can't guarantee this at all, because they don't
>> know that they're executing a spinlock --- for them its
>> just an instruction stream like anything else.
>
> ...even i386 has monitor/mwait these days.

It also doesn't guarantee that subsequent locked instruction will take the
lock after finite number of loops.

Mikulas

> Pavel
> --
> Thanks for all the (sleeping) penguins.
>

2006-11-14 06:24:31

by Albert Cahalan

[permalink] [raw]
Subject: Re: 2048 CPUs [was: Re: New filesystem for Linux]

> >> No intel document guarantees you that if more CPUs
> >> simultaneously execute locked cmpxchg in a loop that a
> >
> > If we are talking 2048 cpus, we are talking ia64.
>
> IA64 spinlock is locked cmpxchg, if failed than pause (i386 equivalent of
> rep nop) read the value, and if unlocked, try cmpxchg again.
>
> There is no fairness in it.

I suppose we could use something better.

There is the MCS lock, the related CLH lock, and IBM's
improvement on the MCS lock. As with RCU, we'd need
to get IBM's permission to use their lock. (so, how did we
get permission for RCU?) The basic MCS lock is also
patented I think.

http://www.cs.rochester.edu/~scott/professional/Dijkstra/presentation.html