... continuing a discussion on a similar issue some months ago ...
Hi experts,
To do Futex in userspace and many things in the Kernel there are several
predefined "atomic" macros.
Depending on the arch, there are several ways to do this:
- normal "read-modify-write" processor instructions that by design are
not interrupted (e.g. x86 and 68K non-SMP)
- normal "read-modify-write" processor instructions that by design are
not interrupted and bus locking (e.g. Ubicom 7K SMP)
- normal "read-modify-write" processor instructions with bus lock
prefix instruction (x86 SMP)
- disable, enable the Interrupt (non-SMP archs that allow for this,
often used in user space with uClinux)
- "atomic region": the global ISR (return) code detects and finalizes
atomic functions, because they are in a certain code region (non-SMP
archs that don't have atomic instructions, e.g. BlackFin)
Now I (we ?) face a new class of archs. Here I use the (currently
non-SMP), MMU-enabled NIOS2 (provided as IP-Core in Altera FPGA). I
suppose there are other IP-Core processors or otherwise expendable CPUs
with similar specs:
- no normal processor "read-modify-write" instructions that by design
are not interrupted or even bus-locking
- new "custom" processor instructions can be defined that might work
atomically, as well not interruptible as kind of bus-locking for SMP use
- these custom instructions can't access the memory in normal way
(through the MMU and the cache).
So to implement atomic instructions a dedicated RAM area would be needed
to hold the atomically accessible data. Same can't be accessed by the
CPU in other ways. This RAM area would be implemented with the new
atomic instructions and be located within the FPGA and thus could
accessed very fast (no cache issues).
Of course this would need for the Kernel and the libraries to avoid any
access to that RAM area with normal instructions. All code would need to
use the set of "atomic" macros to access that RAM. Here I suppose we
would need some additional atomic Macros such as "atomic_malloc()",
"atomic_read_word()" etc, which, with normal archs would just reproduce
the standard code.
Of course the size of the "atomic RAM" would be limited and thus a
system error would be generated if same would be exhausted. I don't
think that, with the intended "embedded" use, this would be a serious issue.
I feel that thinking about this could be viable task - even if this
might ask for some modifications in the global arch-independent Kernel
and/or library code, as I suppose this is the only way to allow for
implementing an SMP design with NIOS and similar archs. (Happily these
modifications would be "compatible" by design as with any existing arch
just the existing code would be reproduced by the new Macros and so
their use is strictly optional with any arch other than those that would
profit from the new implementation.)
What do you think ?
-Michael
On 04/08/2010 11:28 AM, David Newall wrote:
> Will the system use multiple cores? If it's only single core, perhaps
> atomic operations are really necessary?
At the moment I am just planning a single core NIOS project. But SMP is
a decent option for NIOS, as it is no hardware problem - and often has
been requested - to design multiple CPUs in a single in such an FPGA and
run SMP Linux on it.
OTOH, to do FUTEX in full (MMU) - Linux, atomic operations are
definitively necessary, as in Userland you can't disable the interrupt
without a Kernel call (which to avoid FUTEX is all about.) Currently the
plan with the NIOS (MMU, non-SMP) arch is to simulate atomic operations
with the said "atomic region" (as is done with e.g. the BlackFin arch).
(BTW.: when enumerating the ways how atomic macros are done I forgot to
mention the "new ARM" method: dedicated "load locked, store conditional"
operations that help simulating atomic behavior in user space without
the Kernel's help.)
-Michael
On 04/08/2010 11:43 AM, Giuseppe Calderaro wrote:
>
> In the past I used the mutex connected to the avalon bus to get atomic
> test and set instructions:
> http://www.altera.com/literature/hb/nios2/n2cpu_nii51020.pdf
>
Of course I do know the Altera hardware Mutex, but same just implements
a single Mutex. To allow for FUTEX, which is the base of doing threaded
application with the pthread functions provided by libc, the application
programmer can create any number of pthread_mutex'es (of course a decent
restriction by hardware to some 1000 would usually not harm with typical
embedded applications).
Of course a multithreaded application should not be done in an
architecture depending way, but should be compilable for any arch and
thus the count of allowable pthread_mutex'es should not be greatly
restricted by the underlying hardware (be it NIOS or whatever else).
Of course you are right that using custom instructions to do a decent
count of atomic (hardware-) mutex'es necessary for implementing decent
SMP aware FUTEX (or even Kernel Mutex) stuff with NIOS and similar archs
is not strictly necessary, but I think it's handy to use them instead of
I/O-elements, as the MMU would disallow user space access to "normal"
I/O addresses.
Michael
> - no normal processor "read-modify-write" instructions that by design
> are not interrupted or even bus-locking
> - new "custom" processor instructions can be defined that might work
> atomically, as well not interruptible as kind of bus-locking for SMP use
> - these custom instructions can't access the memory in normal way
> (through the MMU and the cache).
>
> So to implement atomic instructions a dedicated RAM area would be needed
> to hold the atomically accessible data. Same can't be accessed by the
> CPU in other ways. This RAM area would be implemented with the new
> atomic instructions and be located within the FPGA and thus could
> accessed very fast (no cache issues).
Take a look at sparc 32bit. That only has a single meaningful atomic
instruction (swap byte with 0xFF). It provides all the kernel atomic_t
operations via this: arch/sparc/lib/atomic32.c. That bitops are done a
similar way, which leaves spinlocks and the like.
More importantly if your true locks in the FPGA are really fast in CPU
terms then you can think of every other atomic instructions as being
implemented using
lock(cpu_atomic_instruction_lock)
do bits
unlock(cpu_atomic_instruction_lock)
(its just this is normally done in hardware/microcode)
Doing it per instruction might be a bit na?ve but I think you can
reasonably do it so that things like spinlocks use a single (or a hashed
set) of non kernel locks to implement "atomic" instructions, and as
sparc32 shows you only need a tiny subset of them to implement the rest
in their terms.
So I don't actually think you need any kernel core changes to get going,
and given the kernel dynamically allocates a lot of locks I suspect
trying to dynamically manage atomic ram allocations is going to cost more
than executing a few instructions here and there under a single very fast
hardware assisted lock.
Alan
On 04/08/2010 12:45 PM, Alan Cox wrote:
> Take a look at sparc 32bit. That only has a single meaningful atomic
> instruction (swap byte with 0xFF). It provides all the kernel atomic_t
> operations via this: arch/sparc/lib/atomic32.c. That bitops are done a
> similar way, which leaves spinlocks and the like.
>
As the NIOS and similar "load/store" archs by the underlaying hardware
design can't provide any atomic memory read-modify write operations
(without or with bus lock) at all, we need to search for a completely
CPU-hardware independent way of doing atomic stuff both with non-SMP and
SMP designs. The new ARM processors provide "load locked" / "store
conditional" instructions to overcome this with a combination of
hardware and software means.
> More importantly if your true locks in the FPGA are really fast in CPU
> terms then you can think of every other atomic instructions as being
> implemented using
>
> lock(cpu_atomic_instruction_lock)
> do bits
> unlock(cpu_atomic_instruction_lock)
>
I feel that this does not help.
The important task (for me right now) is providing decent FUTEX
(multiple of those ! ). Here you need to do atomic instructions in user
spaces on the (multiple) FUTEX handling word (so interrupt
disabling/enabling is not possible). If you try to implement this by
using a _single_ lock (surrounding the would-be atomic instruction
sequence), this IMHO only lifts the problem to another level:
If one thread locks the "cpu_atomic_instruction_lock" and now the Kernel
does a task switch and now a second thread tries to lock it as well,
same would need to do a kernel call to do the waiting. So the
"cpu_atomic_instruction_lock" is nothing but a FUTEX itself and asks for
the same complexity the FUTEX handling needs: (A) it needs an SMP-safe
user space atomic read-modify-write <here of course only a single one
instead of multiple> and (B) it needs the Kernel-infrastructure to
handle this (see the multiple articles available on Futex being nasty ;)
. The "cpu_atomic_instruction_lock" Futex only provides for atomic
operations and thus the normal (here: second level) Futex needs to stay
in place, as well.
I have no idea how the Kernel infrastructure for two-level Futex could
be implemented.
> So I don't actually think you need any kernel core changes to get going,
> and given the kernel dynamically allocates a lot of locks I suspect
> trying to dynamically manage atomic ram allocations is going to cost more
> than executing a few instructions here and there under a single very fast
> hardware assisted lock.
>
I suppose the current NIOS _Kernel_ code implements atomic operations by
disabling/enabling the interrupts. This of course is not possible with
SMP designs and OTOH it's not possible in user space. That is why
implementing FUTEX and SMP seems to ask for similar considerations.
-Michael
From: Michael Schnell <[email protected]>
Date: Thu, 08 Apr 2010 14:11:00 +0200
> If one thread locks the "cpu_atomic_instruction_lock" and now the Kernel
> does a task switch and now a second thread tries to lock it as well,
> same would need to do a kernel call to do the waiting.
Using the spinlock array idea also doesn't work in userspace
because any signal handler that tries to do an atomic on the
same object will deadlock on the spinlock.
You'll have have to do this entirely in the kernel, and your
FUTEX implementation will have to always make the futex()
system call.
It's the only way to do this and have it work completely.
We have to do something similar on sparc32, and the signal handler
deadlocks are very real, many glibc testcases that use threads
and pthread mutexes will deadlock because of this very issue if
you try to do the spinlock trick in userspace.
On 04/08/2010 02:14 PM, David Miller wrote:
> Using the spinlock array idea also doesn't work in userspace
> because any signal handler that tries to do an atomic on the
> same object will deadlock on the spinlock.
>
Yep. I was beeing afraid of signal issues when thinking about this stuff
(on and off for several months :) ), too.
That is why I finally think that a _completely_ hardware based solution
for each necessary atomic operation is necessary, as well to do Futex
(if not using said "atomic region" workaround for non-SMP), as to do SMP.
I finally think that this might be possible in a decent way with custom
instructions using a - say - 1K Word internal FPGA memory space. But
this might need some changes in the non-arch dependent Kernel and/or
library code as the atomic macros would work on "handles" instead of
pointers (of course these handles would be the old pointers with
"normal" archs) and the words used by the macros would need to be
explicitly allocated and deallocated instead of potentially being just
static variables - even though the "atomic_allocate" macro would just
create a static variable for "normal archs" and return it's address.
-Michael
> The important task (for me right now) is providing decent FUTEX
> (multiple of those ! ). Here you need to do atomic instructions in user
> spaces on the (multiple) FUTEX handling word (so interrupt
> disabling/enabling is not possible). If you try to implement this by
> using a _single_ lock (surrounding the would-be atomic instruction
> sequence), this IMHO only lifts the problem to another level:
Sorry, but FUTEX is *irrelevant*, utterly and totally. It's an
implementation of a model of fast user space locking for certain classes
of processor, and its not exposed to applications in that form.
Your first problem is to implement spin_lock and friends in the kernel,
which you can do with a single fast lock in your special memory
area/instructions. So with your extra bit of magic FPGA memory and
instructions you can implement kernel locking. At that point you've got
the main thing you need - a bootable SMP kernel that doesn't require
mangling the entire kernel. Futexes or not you need a workable SMP kernel
first.
> If one thread locks the "cpu_atomic_instruction_lock" and now the Kernel
> does a task switch and now a second thread tries to lock it as well,
The cpu atomic instruction lock lets you do kernel space. That is all.
There are other approaches including a semi-insane ways of creating an SMP
ll/sc behaviour with scheduler hacks, at least for small processor counts.
> I suppose the current NIOS _Kernel_ code implements atomic operations by
> disabling/enabling the interrupts. This of course is not possible with
> SMP designs and OTOH it's not possible in user space. That is why
> implementing FUTEX and SMP seems to ask for similar considerations.
No
The kernel needs a way of implementing a tiny set of operations fast.
Those operations need to match the existing kernel locking system. They
don't have security properties, they don't have to deal with pageable
memory. They have a fixed, definite API.
FUTEX is to all intents and purposes an internal kernel magic interface
with arch specific corner cases used by the C library to provide posix
locking. You don't even need futex. If its not the right model for your
platform you make the C library use your own totally unrelated locking
scheme internally.
Indeed if your FPGA memory doesn't go via the MMU etc I don't see how you
can implement any kind of futex like system.
If you have got some kind of way to assign/protect futex space then I'd
expect your interfaces will look something like this at the low level
open /dev/lockspace
[some kind of operation to select if we are mapping a page of
locks from another task or making a new set]
mmap /dev/lockspace [map a suitable page of lockspace
somewhere into the task]
do lock ops on the lockspace mapping
providing userspace sees a correct fast implementation of posix locks
which is what actually gets used by well behaved apps (and most people not
clinically insane given how much fun futex is to work with at the low
level)
You might also of course want to review the store ordering and visibility
rules for the CPU for both cached and uncached memory. There are several
algorithms that work rather well which don't require locked instructions
to implement locks, merely some ordering rules
(eg look up Lamports Bakery, Wallace's Bakery etc)
http://research.microsoft.com/en-us/um/people/lamport/pubs/bakery.pdf
futex is just one way of skinning that particular cat
Alan
On Thursday 08 April 2010, Michael Schnell wrote:
> On 04/08/2010 02:14 PM, David Miller wrote:
> > Using the spinlock array idea also doesn't work in userspace
> > because any signal handler that tries to do an atomic on the
> > same object will deadlock on the spinlock.
> >
> Yep. I was beeing afraid of signal issues when thinking about this stuff
> (on and off for several months :) ), too.
>
> That is why I finally think that a completely hardware based solution
> for each necessary atomic operation is necessary, as well to do Futex
> (if not using said "atomic region" workaround for non-SMP), as to do SMP.
One really expensive but safe way to do atomic operations is to always
have them done on one CPU only, and provide a mechanism for other CPUs
to ask for an atomic operation using an inter-processor-interrupt.
> I finally think that this might be possible in a decent way with custom
> instructions using a - say - 1K Word internal FPGA memory space. But
> this might need some changes in the non-arch dependent Kernel and/or
> library code as the atomic macros would work on "handles" instead of
> pointers (of course these handles would be the old pointers with
> "normal" archs) and the words used by the macros would need to be
> explicitly allocated and deallocated instead of potentially being just
> static variables - even though the "atomic_allocate" macro would just
> create a static variable for "normal archs" and return it's address.
Why can't you do a hash by memory address for this?
I would guess you can define an instruction to atomically set and check
a bit in a shared array of implementation-specific size, by passing
a token in that by convention is the memory address you want to lock.
Given two priviledged instructions
/* returns one if we got the lock, zero if someone else holds it */
bool hashlock_addr(volatile void *addr);
void hashunlock_addr(volatile void *addr);
you can do
int atomic_add_return(int i, atomic_t *v)
{
int temp;
while (!hashlock_addr(v))
;
smp_rmb();
temp = v->counter;
temp += i;
v->counter = temp;
smp_wmb();
hashunlock_addr(v);
}
static inline unsigned long __cmpxchg(volatile unsigned long *m,
unsigned long old, unsigned long new)
{
unsigned long retval;
unsigned long flags;
while (!hashlock_addr(m))
;
smp_rmb()
retval = *m;
if (retval == old) {
*m = new;
smp_wmb();
}
hashunlock_addr(m);
return retval;
}
Anything else you can build on top of these two, including the system calls
that are used from user applications. Since you never hold that bit lock for
more than a few cycles, you could do with much less than 1K bits, in theory
a single global mutex (ignoring the address entirely) would be enough.
That said, a real load-locked/store-conditional would be much more powerful,
in particular because it can also be used from user space, and it is typically
more efficient because it uses the same mechanisms as the cache coherency
protocol.
Arnd
On Thu, Apr 8, 2010 at 03:32, Michael Schnell wrote:
> - new "custom" processor instructions can be defined that might work
> atomically, as well not interruptible as kind of bus-locking for SMP use
> - these custom instructions can't access the memory in normal way
> (through the MMU and the cache).
>
> So to implement atomic instructions a dedicated RAM area would be needed
> to hold the atomically accessible data. Same can't be accessed by the
> CPU in other ways. This RAM area would be implemented with the new
> atomic instructions and be located within the FPGA and thus could
> accessed very fast (no cache issues).
>
> Of course this would need for the Kernel and the libraries to avoid any
> access to that RAM area with normal instructions. All code would need to
> use the set of "atomic" macros to access that RAM. Here I suppose we
> would need some additional atomic Macros such as "atomic_malloc()",
> "atomic_read_word()" etc, which, with normal archs would just reproduce
> the standard code.
>
> Of course the size of the "atomic RAM" would be limited and thus a
> system error would be generated if same would be exhausted. I don't
> think that, with the intended "embedded" use, this would be a serious issue.
>
> I feel that thinking about this could be viable task - even if this
> might ask for some modifications in the global arch-independent Kernel
> and/or library code, as I suppose this is the only way to allow for
> implementing an SMP design with NIOS and similar archs. (Happily these
> modifications would be "compatible" by design as with any existing arch
> just the existing code would be reproduced by the new Macros and so
> their use is strictly optional with any arch other than those that would
> profit from the new implementation.)
the Blackfin already has such an instruction (TESTSET), but since it
has restrictions (cannot be used safely on cached memory), we found it
unusable as a generic solution. this is why we went to the userspace
fixed atomic code region.
the only time we use this instruction is with the Blackfin SMP port,
and only as an inter-core lock (so it has a fixed address and there is
only one such lock).
i think you underestimate the task of customizing every userspace
point that may want atomic instructions. and no matter what sized
atomic region you pick, having it arbitrarily fail at runtime because
applications have "too many" locks is a poor solution.
-mike
On 04/09/2010 03:36 AM, Mike Frysinger wrote:
> i think you underestimate the task of customizing every userspace
> point that may want atomic instructions.
I never said this would be easy ;) Maybe it's close to impossible :(
I had hoped modifying the "atomic" macros for the arch in question in a
way that they use a handle instead of a pointer to the "atomic" variable
in a compatible way and then modifying the code that uses these macros
in a way that theses handles are created by additional macros that with
other archs can be uses optionally (in a compatible way: returning the
appropriate pointer) could work. But of course you are correct that a
lot of code would need being inspected.
> and no matter what sized
> atomic region you pick, having it arbitrarily fail at runtime because
> applications have "too many" locks is a poor solution.
>
Of course you are right, but not being able to allow for SMP is a poor
solution, too.
I suppose it's possible to generate a dedicated system error when an
application runs out of "pthread_mutex" locks (i.e. when the
atomic_malloc() function detects that the atomic RAM is exhausted). You
also can unexpectedly get a normal "out of memory" system in an
application error at any point (e.g. when opening a file) ;) I don't
suppose any "normal" embedded application will use more than some 100.
(In any case, it would be good to have a (of course no-SMP-)
distribution that uses the "atomic region" software workaround, as well,
in case the user does not want to implement the necessary custom
instructions. This would be usable for applications excessively
consuming of "atomic RAM", too.)
-Michael
On 04/08/2010 04:15 PM, Arnd Bergmann wrote:
> /* returns one if we got the lock, zero if someone else holds it */
> bool hashlock_addr(volatile void *addr);
>
I don't see how (to do FUTEX) a hashlock can be implemented in a way
that we stay in user mode when locking it and - if it's already locked -
we do a Kernel call for waiting on it being unlocked by another thread.
(This is what FUTEX does.)
-Michael
On 04/08/2010 03:37 PM, Alan Cox wrote:
> Sorry, but FUTEX is *irrelevant*, utterly and totally. It's an
> implementation of a model of fast user space locking for certain classes
> of processor, and its not exposed to applications in that form.
>
FUTEX is kind of _invisible_ to the user code.Usually it's hidden behind
pthread_mutex() and pthread_mutex_...() automatically falls back to
non-FUTEX based locking (that always do Kernel calls).
But it is very relevant, as without it, a multithreaded application will
perform a lot poorer. Here many locks might be necessary to protect
resources (mostly modifications to memory locations) used by multiple
threads (running on multiple CPUs when doing SMP). This can happen very
often and always doing two Kernel calls (lock and unlock) is a very poor
option. This problem is even worse with the archs in question as even a
simple inc can't be done in an atomic way (even using ASM) and even for
this, a lock is necessary (or directly using the "atomic" macros we talk
about and that are not correctly implemented for the said archs right now).
> Your first problem is to implement spin_lock and friends in the kernel,
> which you can do with a single fast lock in your special memory
> area/instructions.
Yep. I suppose in Kernel space this can be easily handled either by
disabling / enabling the interrupt in non-SMP designs or by a hardware
MUTEX (that for NIOS is provided by Altera as an I/O element)
> Futexes or not you need a workable SMP kernel
> first.
>
If "you" is You that might be true. But if "you" is me its utterly and
totally wrong. For my heavily multithreaded application I need FUTEX but
not SMP (yet). For me, SMP is no advantage if it does not support FUTEX
and I suppose the SMP solution with a single hardware mutex can't do
this (but maybe I'm wrong here and a software workaround is possible).
Happily Thomas (the maintainer of the NIOS distribution) agrees with me
that FUTEX is important and I hope we soon will work together on making
it possible for the MMU based NIOS distribution. Right now I just want
to discuss if doing a hardware based thing - that might help with doing
SMP one day, too - would be more agreeable than the currently suggested
way with the "atomic region" software workaround (for non-SMP). I
suppose the current NIOS _Kernel_ code implements atomic operations by
enabling / disabling interrupt, so no hardware lock is necessary.
> FUTEX is to all intents and purposes an internal kernel magic interface
> with arch specific corner cases used by the C library to provide posix
> locking. You don't even need futex. If its not the right model for your
> platform you make the C library use your own totally unrelated locking
> scheme internally.
>
pthread_mutex..() uses FUTEX if available with the arch, so FUTEX is a
way of complying to the POSIX standard. Of course there are other ways
(that pthread_mutex_...() use if FUTEX is not available) but this asks
for Kernel calls with any lock and any unlock and thus is a lot slower -
maybe unusable with certain applications.
> Indeed if your FPGA memory doesn't go via the MMU etc I don't see how you
> can implement any kind of futex like system.
>
The internal memory can be designed to go via the MMU (and the cache).
That is not the problem. The problem is that with this simple
"load-store RISC"-architecture, there is _no_ way to have the processor
do a read-modify-write operation (sequence) in user space. Not SMP safe
and not even thread safe. This _can_ be relaxed by implementing a
"custom instruction" in "hardware". But a "custom instruction" can't use
the MMU and the cache. It can be designed to use a dedicated memory area
(not accessible by other CPU instructions) or to use the any memory
_directly_ (bypassing the MMU and the cache). It would be a very nice
feature if Altera would provide a "normal" memory interface for custom
instructions, but this is not an option right now.
> providing userspace sees a correct fast implementation of posix locks
> which is what actually gets used by well behaved apps (and most people not
> clinically insane given how much fun futex is to work with at the low
> level)
>
....
> futex is just one way of skinning that particular cat
>
IMHO, the only decent way to go is to provide FUTEX perfectly compatible
to what other archs do, and thus have it be accessed via pthread_mutex()
so that any "standard" POSIX compatible multithreaded application will
take advantage of the speed gain.
Thanks,
-Michael
> If "you" is You that might be true. But if "you" is me its utterly and
> totally wrong. For my heavily multithreaded application I need FUTEX but
> not SMP (yet). For me, SMP is no advantage if it does not support FUTEX
> and I suppose the SMP solution with a single hardware mutex can't do
> this (but maybe I'm wrong here and a software workaround is possible).
You are completing missing the point
Linux + glibc platforms don't "need" futex - you need fast user space
locks. Futex is an implementation of those locks really based around
platforms with atomic instructions. People were doing fast user space
locks before Linus was even born and on machines without atomic
operations.
Seperate out
- the purpose for which the system exists (fast user locking)
- the interfaces by which it must be presented (posix pthread mutex)
- the implementation of the system
> pthread_mutex..() uses FUTEX if available with the arch, so FUTEX is a
> way of complying to the POSIX standard. Of course there are other ways
> (that pthread_mutex_...() use if FUTEX is not available) but this asks
> for Kernel calls with any lock and any unlock and thus is a lot slower -
> maybe unusable with certain applications.
Nope. Glibc allows you to implement arch specific code for these locks
which may not be FUTEX but need not be kernel based. The user space
mechanics of the futex stuff include platform specific stuff for all
platforms. You might do the blocking kernel parts of it via the futex
syscall but what matters are the uncontended fast paths which are arch
specific C library code.
> IMHO, the only decent way to go is to provide FUTEX perfectly compatible
> to what other archs do, and thus have it be accessed via pthread_mutex()
> so that any "standard" POSIX compatible multithreaded application will
> take advantage of the speed gain.
You clearly need a pthread_mutex that is fast - but the idea that this
means FUTEX is misleading and futex on each platform in the user space
side is different per architecture anyway.
The idea that you need atomic operations to do fast user space locking is
also of course wrong - you only need store ordering.
Alan
On 04/09/2010 01:54 PM, Alan Cox wrote:
> Linux + glibc platforms don't "need" futex - you need fast user space
> locks. Futex is an implementation of those locks really based around
> platforms with atomic instructions. People were doing fast user space
> locks before Linus was even born and on machines without atomic
> operations.
>
Of course you are right, but IMHO this is why FUTEX was invented (and
AFAIK, Linux himself did the first implementation). With FUTEX there is
a standard way of speeding up Posix compatible thread locking (by
implementing the user space part of FUTEX in the pthread part of libc
and defining a Kernel interface for the fast thread locking/unlocking
functions that is not (much ?) more arch depending than other Kernel
interfaces.
Of course you are right that my suggestion in fact contradicts to this
by defining the FUTEX Kernel interface to work on a kind of Handles
instead of user-space pointers (even though same would still use the
same C-type an in fact can be understood as pointers into the "Atomic
RAM, accessible only by some special ASM instructions).
Anyway, working on FUTEX for the arch allows for community based work
(in the library and in the Kernel code) instead of having anybody
interested do their own implementation right within the (propriety) user
code.
> Seperate out
> - the purpose for which the system exists (fast user locking)
>
Yep.
> - the interfaces by which it must be presented (posix pthread mutex)
>
IMHO the only decent "community-compatible" implementation is doing it
in a POSIX way and allowing for "standard Linux user space code", thus
using pthread_mutex_...() (pthreadLib, libc).
> - the implementation of the system
>
Same as any and libc and Linux Kernel stuff: community based and done
under GPL, modifying common (arch-independent) code only if necessary
and then in an as "compatible" way as possible.
> Nope. Glibc allows you to implement arch specific code for these locks
> which may not be FUTEX but need not be kernel based.
Of course you are right again. But is there rally a libc version that
implements pthread_mutex() with user space locking without using FUTEX ?
I wonder what Kernel interface it uses to perform the waiting.
In fact I did a testing program to prepare the implementation of fast
user space locking. Here I tried out several methods e.g.
- pthread_mutex_...()
- system V sema
- my own code (several variants taken from "Futexes are tricky by
Ulrich Drepper") for the user space part of FUTEX, using the FUTEX
Kernel interface
- some hombrew buggy testing code
I ran this program on PC (libc using FUTEX) and NIOS (libc using Kernel
calls)
Based on this, I do suppose that creating any _working_ method for user
space based thread locking (on any new arch) will be at least as much
work as implementing FUTEX on same.
> The user space
> mechanics of the futex stuff include platform specific stuff for all
> platforms.
The Kernel space part of FUTEX stuff also includes platform specific
code, at least with SMP designs, as it will need to work SMP-atomic.
> You might do the blocking kernel parts of it via the futex
> syscall but what matters are the uncontended fast paths which are arch
> specific C library code.
>
The fast part needs atomic user space operations that are not existing
in the arch in question and thus need some help from the Kernel (i.e.:
the said "atomic region") and/or some dedicated hardware (this is what
this thread is about).
> You clearly need a pthread_mutex that is fast - but the idea that this
> means FUTEX is misleading and futex on each platform in the user space
> side is different per architecture anyway.
>
I understand that FUTEX was invented to allow for a more "standard",
less platform depending way of implementing pthread_mutex: using the
platform's "atomic" macros for the user space part and the FUTEX system
call for the Kernel part should allow for platform independent library
source code for any arch that supports FUTEX.
> The idea that you need atomic operations to do fast user space locking is
> also of course wrong - you only need store ordering.
>
I feel that store ordering is even more difficult to be implemented than
atomicness, but I'm eager to learn about this.
I don't think the NIOS can provide this (the normal instruction set is
quite limited and the custom instructions can't access memory in a
normal way at all)
If it's only meant for non-SMP this is not a limitation for me right now.
If you think it could be done with NIOS: using store ordering, how can I
implement a pthread_mutex_..() workalike ?
-Michael
> If you think it could be done with NIOS: using store ordering, how can I
> implement a pthread_mutex_..() workalike ?
Lamport's Bakery is the classic algorithm for doing this. It has very
minimal assumptions about visibility of writes and the order they are
seen so should certainly work uniprocessor and may work SMP depending
upon the cache coherency rules.
It doesn't scale well to large numbers of threads however.
Alan
On 04/09/2010 03:15 PM, Alan Cox wrote:
> Lamport's Bakery
Found it in Wikipedia. I'm working on the text...
Thanks a lot !
-Michael
On 04/09/2010 03:15 PM, Alan Cox wrote:
> Lamport's Bakery
Hmm. The code implements the lock as a busy spinning wait. This of
course is not possible in real world, as the thread that has the lock
will not get any CPU time (e.g. in a non-SMP system).
I understand that the main purpose of the FUTEX Kernel call(s) is doing
a not-busy wait and having the "unlock" code wake the (next) waiting thread.
I did implement something like this in my testing program: enhanced by a
sleep to allow for the thread that has the lock to proceed it's work,
but this of course is not fast at all, as a short sleep() produces too
much CPU load and a long sleep produces too much latency.
So maybe this algorithm can be used instead of the hardware stuff I
suggested but it would need a FUTEX-like Kernel part, too.
-Michael
On 04/09/2010 05:14 PM, Arnd Bergmann wrote:
>
>> I don't see how (to do FUTEX) a hashlock can be implemented in a way
>> that we stay in user mode when locking it and - if it's already locked -
>> we do a Kernel call for waiting on it being unlocked by another thread.
>> (This is what FUTEX does.)
>>
> You wouldn't. For user space, you can always do a syscall for atomic
> operations, like some architectures do that lack atomic instructions.
>
Of course. This is what pthread_mutex_...() does, if the arch does not
provide the appropriate atomic functions and/or does not have FUTEX. But
this discussion is about fast thread communication, thus avoiding
syscalls is essential to me..
> You already need that with a non-SMP system anyway. As Alan explained,
> futex is only an optimization for a relatively uninteresting case
> (multi-threaded user applications), you really need to solve this for
> kernel space first, because the kernel is inherently multi-threaded.
>
I don't see why optimizing for speed and especially latency is
uninteresting (with embedded systems like the one I'm planning).
Multi-threaded user applications is exactly the case that is extremely
interesting to me and that is why I started this discussion. The
non-SMP-Kernel ( and non-FUTEX) case already is solved for NIOS
(supposedly by interrupt disabling). An SMP-Linux is not yet crafted
(and for me its a lot lower priority than decent user-space
multithreading, but of course it _is_ a valuable task).
> If you want to have atomics in user space, why not go all the way and
> make a small extension to your cache coherency logic to do load-locked/
> store-conditional as well.
Of course doing load-locked, store-conditional custom instructions was
an option I did consider, but as there is no way to access memory
through cache and MMU with custom instructions, I don't see how this
could be done, as the current way FUTEX works, the code will define the
DWORDs to be handled atomically anywhere in the user space memory. Of
course disabling the cache completely is not an option for a task that
is aimed to improve user space performance.
> As far as I understand, NIOS-2 does not
> come with a coherency logic normally, so you must already have made some
> extensions on that level in order to make the dcache coherent.
>
The user space cache coherency is *somehow* provided by the current
(non-SMP) Kernel. I understand that special considerations only are
necessary when the MMU tables are modified. There is no SMP Kernel for
NIOS yet. I in fact have no idea how cache coherency can/will be handled
in the SMP case.
> Typically, there is one dcache line that you can mark as exclusive-locked,
> which behaves just like exclusive, except that the store-conditional
> instruction only performs a store to an exclusive-locked cache line,
> setting a condition code if the cache line has transitioned to another
> state in the meantime.
>
AFAIK, the current NIOS cache hardware design does not provide for
features such as external invalidating of cache lines. So those who
start the NIOS SMP project will need to deal with that by having Altera
enhance the NIOS design and/or use additional "hardware" outside of the
CPU/cache/MMU block.
Thanks,
-Michael
Hi!
> OTOH, to do FUTEX in full (MMU) - Linux, atomic operations are
> definitively necessary, as in Userland you can't disable the interrupt
> without a Kernel call (which to avoid FUTEX is all about.) Currently the
You could create unpriviledged 'disable interrupts for 10
instructions' and 'test if interrupts are still disabled'
instructions, and base your mutex implementation on that.
But you'll have to stop calling it futex at that point...
Or you could just optimize syscalls to be really fast...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Monday 12 April 2010, Michael Schnell wrote:
> > You already need that with a non-SMP system anyway. As Alan explained,
> > futex is only an optimization for a relatively uninteresting case
> > (multi-threaded user applications), you really need to solve this for
> > kernel space first, because the kernel is inherently multi-threaded.
> >
> I don't see why optimizing for speed and especially latency is
> uninteresting (with embedded systems like the one I'm planning).
>
> Multi-threaded user applications is exactly the case that is extremely
> interesting to me and that is why I started this discussion. The
> non-SMP-Kernel ( and non-FUTEX) case already is solved for NIOS
> (supposedly by interrupt disabling). An SMP-Linux is not yet crafted
> (and for me its a lot lower priority than decent user-space
> multithreading, but of course it is a valuable task).
Ok. Your initial post didn't make it clear that this is all you are
looking for. While atomic CPU operations would solve this problem,
you don't really need to make the RAM access itself atomic,
only the instruction flow.
> > If you want to have atomics in user space, why not go all the way and
> > make a small extension to your cache coherency logic to do load-locked/
> > store-conditional as well.
> Of course doing load-locked, store-conditional custom instructions was
> an option I did consider, but as there is no way to access memory
> through cache and MMU with custom instructions, I don't see how this
> could be done, as the current way FUTEX works, the code will define the
> DWORDs to be handled atomically anywhere in the user space memory. Of
> course disabling the cache completely is not an option for a task that
> is aimed to improve user space performance.
Right. So if you cannot implement a 'test-and-set', 'exchange' or
'store-conditional' instruction, I don't think any custom instructions
will help you.
You can probably implement an atomic function in a VDSO though, without
any CPU extensions, I think this has been discussed for blackfin
before. The idea is to let the kernel check if the instruction pointer
is in the critical section of the VDSO while returning to user space.
If it is, the kernel can jump back to the caller of that function
instead of the function itself, and indicate failure so the user can
retry.
Arnd
On 04/12/2010 05:02 PM, Arnd Bergmann wrote:
>
> if you cannot implement a 'test-and-set', 'exchange' or
> 'store-conditional' instruction, I don't think any custom instructions
> will help you.
>
That is not the point. Custom instructions easily can implement
thread-safe and SMP safe atomic read-modify-write user-land-enabled
operations such as "test-and-set", "exchange" or "inc" (I don't think
that "store conditional" would be necessary). But they can't work if the
same data can be accessed by other (non-custom) CPU instructions, as the
custom-instructions would necessarily bypass the cache and the MMU while
the normal CPU instructions would use both (and as all this is done with
performance in mind disabling or invalidating the cache is out of
question and in user-land such operations are impossible, anyway. This
is why the idea of a dedicated "atomic RAM" has been developed (which
requires the arch-independent Kernel and Library code to access all
would-be "atomic" locations _only_ through arch-depending macros
<including allocating and freeing these memory locations>, even though
for "standard" archs, the resulting ASM code would (hopefully, nearly
everywhere) stay the same as it is).
> You can probably implement an atomic function in a VDSO though, without
> any CPU extensions, I think this has been discussed for blackfin
> before. The idea is to let the kernel check if the instruction pointer
> is in the critical section of the VDSO while returning to user space.
> If it is, the kernel can jump back to the caller of that function
> instead of the function itself, and indicate failure so the user can
> retry.
>
Right., I already mentioned this "Atomic Region" way to implement an
"atomic operation" workaround for non-SMP systems in the first message
of this thread.
Of course it would be advantageous to have this method for NIOS, as
well, as it can be implemented without "hardware" support and without
modifications of the arch-independent Kernel and library code. But it
will not work with SMP and it will introduces an (even if supposedly not
huge) overhead with any interrupt.
A hardware implementation would avoid this overhead, would allow for
atomic SMP user-land operations and could be the base of implementing an
SMP Kernel (even though with NIOS/SMP other issues <like
cache-synchronization> might need to be solved additionally).
-Michael
On 04/12/2010 02:54 PM, Pavel Machek wrote:
>
> You could create unpriviledged 'disable interrupts for 10
> instructions' and 'test if interrupts are still disabled'
> instructions, and base your mutex implementation on that.
>
That would be great, but AFAIK, it's not decently possible with NIOS.
The interrupt enable flag of the CPU can't be accessed by custom
instructions. The CPU has up to 32 interrupt lines that it exposes to
the custom FPGA "hardware". of course it _would_ be possible to gate all
of them and manage this additional flag by a custom instruction. We
would need to investigate how big the min/max delay between setting the
interrupt and the CPU acknowledging it is. According to this, the count
of NOPs between the "custom interrupt disable" and the load of the
atomic value needs to be chosen and how many clock cycles the interrupt
lock needs to be held.
Unfortunately, AFAIK, the CPU-external FPGA "hardware" (that implements
the custom instructions) can't see the shifting of the CPU's instruction
queue. So the lock duration only can be counted in clock cycles, but not
in instructions. The CPU might need to wait for a very large count of
clocks for accessing (instructions or data) words e.g. in external
dynamic RAM.
That is why (when considering how to implement the atomic user land
macros necessary for FUTEX) I did consider your idea to reduce the
average overhead imposed by the necessity of having the Kernel ISR code
finish a would be atomic operation. But as the delay is very uncertain,
I feel that the ISR-trick can't be dropped completely.
It might be a good idea to ask Altera to implement such a
userland-enabled instruction (disable interrupt for the next n
instructions). That would be really easy within the NIOS CPU iP-Core,
but with custom instructions we are out of luck :(.
Moreover, of course any interrupt logic only helps with the nos-SMP-case.
> But you'll have to stop calling it futex at that point...
>
FUTEX (e.g. the userland part of same) is just one paradigm (and IMHO
the most important one, as pthread_mutex_..() uses it for fast POSIX
compatible thread synchronization) that requires atomic user land
operations. So any implementation of the appropriate atomic userland
operations can be used to do FUTEX on top of it.
> Or you could just optimize syscalls to be really fast...
I trust that the Kernel developers already did that :).
My test showed that with a x86 PC, the system calls really are
astonishingly fast. But same supposedly features sophisticated hardware
to support syscalls. Nonetheless using Futex did provide a considerable
speed gain, even with SMP hardware where atomic operations are
expensive, due to cache synchronization done by hardware.
But the little old NIOS hardware of course is done using as few gates as
possible ;)
Thanks !
-Michael
On 04/08/2010 03:37 PM, Alan Cox wrote:
> FUTEX ... is an implementation ... for certain classes of processor.
Alan,
If that is true "in depth", there should be no FUTEX-relevant code,
neither in the Kernel, nor in the library, that should be considered
"arch-independent".
If so, we could happily create a "atomic hardware RAM" - based (or
whatever type of) FUTEX for the arch in question in the arch's source
files, without needing to take care of what happens in the rest of the
Kernel or Library code.
Provided this implementing the idea I presented might be a doable task.
-Michael
> > FUTEX ... is an implementation ... for certain classes of processor.
> Alan,
>
> If that is true "in depth", there should be no FUTEX-relevant code,
> neither in the Kernel, nor in the library, that should be considered
> "arch-independent".
We have a lot of kernel code which is only usable on some architectures
and a few opt out of. The largest example of course being the mm layer
which is strictly for MMU equipped systems.
The important point is that you don't *have* to use futex to provide the
posix pthread lock primitives if they don't fit your platform.
On 04/14/2010 02:57 PM, Alan Cox wrote:
> The important point is that you don't *have* to use futex to provide the
> posix pthread lock primitives if they don't fit your platform.
>
I don't *have* to use Linux to provide the function the controller is
supposed to do for the end-user :).
But Linux is a very decent way to make it happen....
-Michael