2001-04-18 17:10:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: light weight user level semaphores


[ Cc'd to linux-kernel, to get feedback etc. I've already talked this over
with some people a long time ago, but more people might get interested ]

On Tue, 17 Apr 2001, Mike Kravetz wrote:
>
> In the near future, I should have some time to begin
> working on a prototype implementation. One thing that
> I don't remember too clearly is a reference you made to
> the System V semaphore implementation. I'm pretty sure
> you indicated any new light weight implementation should
> not be based on the System V APIs. Is this correct, or
> did I remember incorrectly?

It's correct. I don't see any way the kernel can do the SysV semantics for
"cleanup" for a semaphore when a process dies in an uncontrolled manner
(or do it fast enough even when it can use at_exit() etc). The whole point
of fast semaphores would be to avoid the kernel entry entirely for the
non-contention case, which basically means that the kernel doesn't even
_know_ who holds the semaphore at any given moment. So the kernel cannot
do the cleanups on process exit that are part of the SysV semantics.

My personal absolute favourite "fast semaphore" implementation is as
follows. First the user interface, just to make it clear that the
implementation is very far from the interface:

/*
* a fast semaphore is a 128-byte opaque thing,
* aligned on a 128-byte boundary. This is partly
* to minimize false sharing in the L1 (we assume
* that 128-byte cache-lines are going to be fairly
* common), but also to allow the kernel to hide
* data there
*/
struct fast_semaphore {
unsigned int opaque[32];
} __attribute__((aligned, 64));

struct fast_semaphore *FS_create(char *ID);
int FS_down(struct fast_semaphore *, unsigned long timeout);
void FS_up(struct fast_semaphore *);

would basically be the interface. People would not need to know what the
implementation is like. Add to taste (ie make rw-semaphores, etc), but the
above is a kind of "fairly minimal thing". So "trydown()" would just be a
FS_down() with a zero timeout, for example.

Anyway, the implementation would be roughly:

- FS_create is responsible for allocating a shared memory region
at "FS_create()" time. This is what the ID is there for: a "anonymous"
semaphore would have an ID of NULL, and could only be used by threads
or across a fork(): it would basically be done with a MAP_ANON |
MAP_SHARED, and the pointer returned would just be a pointer to that
memory.

So FS_create() starts out by allocating the backing store for the
semaphore. This can basically be done in user space, although the
kernel does need to get involved for the second part of it, which
is to (a) allocate a kernel "backing store" thing that contains the
waiters and the wait-queues for other processes and (b) fill in the
opaque 128-bit area with the initial count AND the magic to make it
fly. More on the magic later.

So the second part of FS_create needs a new system call.

- FS_down() and FS_up() would be two parts: the fast case (no
contention), very similar to what the Linux kernel itself uses. And the
slow case (contention), which ends up being a system call. You'd have
something like this on x86 in user space:

extern void FS_down(struct fast_semahore *fs,
unsigned long timeout) __attribute__((regparm(3)));

/* Four-instruction fast-path: the call plus these ones */
FS_down:
lock ; decl (%edx)
js FS_down_contention
ret
FS_down_contention:
movl $FS_down_contention_syscall,%eax
int 80
ret

(Note: the regparm(3) thing makes the arguments be passed in %edx and
%ecx - check me on details in which order, and realize that they will
show up as arguments to the system call too because the x86 system call
interface is already register-based)

FS_up() does the same - see how the kernel already knows to avoid doing
the wakup if there has been no contention, and has a fast-path that
never goes out-of-line (ie the kernel semaphore out-of-line case is the
user-level system call case).

So now we get to the "subtle" part. Getting contention right. The above
causes us to get to the kernel when we have contention, and the kernel
gets only a pointer to user space. In particular, it gets a pointer to
memory that it cannot trust, and from that _untrusted_ pointer it needs to
quickly get to the _trusted_ part, ie the part that only the kernel itself
controls (the stuff with the wait-queues etc). This is where subtlety is
needed.

The speed concerns are paramount: I am convinced that the non-contention
case is the important one, but at the same time we can't allow contention
to be _too_ costly either. The system call is fairly cheap (and already
acts as a first-level back-off, so that's ok), but we can't afford to
spend more time than we need here.

So in my opinion the only reasonable approach is to have a kernel pointer
in the untrusted memory, and then have ways to quickyl validate the
pointer. My preferred approach:

- the first word of the "opaque" semaphore is obviously the semaphore
count (we already used it that way in the user-space thing).
- the second word of the semaphore is the pointer to kernel space that
was set up at kernel portion of FS_create.
- an arbitrary part (say 256 bits) of the rest of the semaphore are a
secure hash that the kernel did at FS_create time.

The validation boils down to:

unsigned long FS_down_system_call(
unsigned long unused, /* %ebx */
unsigned long timeout, /* %ecx */
struct fast_semaphore *fs) /* %edx */
{
struct kernel_fast_sem *kfs;

if ((unsigned long) fs & 127)
goto bad_sem;
if (!access_ok(VERIFY_READ, fs, 128))
goto bad_sem;

/*
* See if the system call already caused
* us to become un-contended. We don't need
* the kernel pointer for this, and thus
* we don't need the verification overhead.
*/
if (FS_trydown(fs))
return 0;

kfs = __get_user(fs->opaque+1);

/*
* Verify that it might be a valid kernel pointer
* before we even try to dereference it
*/
if ((unsigned long) kfs & 7)
goto bad_sem;
if (kfs < TASK_SIZE)
goto bad_sem;
if (kfs > TASK_SIZE+640k && kfs < TASK_SIZE + 1M)
goto bad_sem;
if (kfs > high_mem)
goto bad_sem;

/*
* Simple first-level check, so that user space
* cannot just try to make the signature match
* whatever is in kernel memory at the time. There
* are some common kernel patterns (like all zero),
* which might otherwise allow users to pass in a
* bogus kernel pointer.
*/
if (kfs->magic != FS_SIGNATURE_MAGIC)
goto bad_sem;
if (kfs->user_address != fs)
goto bad_sem;

/*
* Ok, we know we can dereference it, and that it _looks_
* like a valid semaphore. Make sure by verify secure
* signature
*/
for (i = 0; i < FS_SIGNATURE_WORDS; i++)
if (__get_user(fs->opaque+2+i) != kfs->signature[i])
goto bad_sem;

/*
* Ok, we now have the counter (in user space in "fs")
* and the kernel part (wait queues, waiter info etc).
* Do the slow path, return success/failure.
return do_fs_down(fs, kfs, timeout);

bad_sem:
/*
* EXIT. Don't let the process try billions of bad
* combinations fast. Make him fork() for each one.
*/
do_exit(11);
}


See? The only important part is that when you create the fast semaphore in
FS_create() (and that is going to be the slow part), the signature has to
be a cryptographically secure random number so that user space cannot
spoof kernel pointers.

So the overhead for the above is

- non-contention:
zero overhead (but semaphore creation is not free)

- contention:
kernel entry (unavoidable anyway)
verification

The verification boils down to a few range checks and a (cached - we've
already looked at, or will need to look at, the other fields in the same
structures) memcmp(), so the overhead there is on the order of 30 cycles.

Security issues:

- the user could create a non-shared user-mode "fs" pointer that has the
right signature, and thus fool the kernel into using the wrong
user-mode pointer.

Note that this is OK. The kernel won't mess up its own integrity, it
will just get the wrong answer. Who cares if the kernel allows multiple
users to enter if they are bad users?

- The user must _not_ be able to fool the kernel into using a completely
non-existing semaphore.

Comments?

Linus


2001-04-18 18:13:29

by Bernd Eckenfels

[permalink] [raw]
Subject: Re: light weight user level semaphores

In article <[email protected]> you wrote:
> So FS_create() starts out by allocating the backing store for the
> semaphore. This can basically be done in user space, although the
> kernel does need to get involved for the second part of it, which
> is to (a) allocate a kernel "backing store" thing that contains the
> waiters and the wait-queues for other processes and (b) fill in the
> opaque 128-bit area with the initial count AND the magic to make it
> fly. More on the magic later.

> So the second part of FS_create needs a new system call.

How will the clean up of the kernelstore work?

> - The user must _not_ be able to fool the kernel into using a completely
> non-existing semaphore.

In that case the access to kernel level is protected by a very secure
combination of secure hash and magic number checking. But anyway there is a
small chance to get to some kernel memory unauthorized. Do you know if this is
the first (known) interface which has a more practical approach to kernel data
structure security?

If we want to be a bit more strict, we can have a pre-allocated pool of
semaphores and the kernel pointer check can add the kernelk address of the
semaphore region into account. It's faster than the checksum probably and more
secure in protecting the rest of the kernel memory. Spoofing access to other
semaphores would be still possible (but can be protected by a smaller hash).

Greetings
Bernd

2001-04-18 19:37:00

by Ulrich Drepper

[permalink] [raw]
Subject: Re: light weight user level semaphores

Linus Torvalds <[email protected]> writes:

Sounds good so far. Some comments.

> - FS_create is responsible for allocating a shared memory region
> at "FS_create()" time.

This is not so great. The POSIX shared semaphores require that an
pthread_mutex_t object placed in a shared memory region can be
initialized to work across process boundaries. I.e., the FS_create
function would actually be FS_init. There is no problem with the
kernel or the helper code at user level allocating more storage (for
the waitlist of whatever) but it must not be necessary for the user to
know about them and place them in share memory themselves.

The situation for non-shared (i.e. intra-process) semaphores are
easier. What I didn't understand is your remark about fork. The
semaphores should be cloned. Unless the shared flag is set there
should be no sharing among processes.


The rest seems OK. Thanks,

--
---------------. ,-. 1325 Chesapeake Terrace
Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA
Red Hat `--' drepper at redhat.com `------------------------

2001-04-19 07:21:06

by Alon Ziv

[permalink] [raw]
Subject: Re: light weight user level semaphores

Hmm...
I already started (long ago, and abandoned since due to lack of time :-( )
down another path; I'd like to resurrect it...

My lightweight-semaphores were actually even simpler in userspace:
* the userspace struct was just a signed count and a file handle.
* Uncontended case is exactly like Linus' version (i.e., down() is decl +
js, up() is incl()).
* The contention syscall was (in my implementation) an ioctl on the FH; the
FH was a special one, from a private syscall (although with the new VFS I'd
have written it as just another specialized FS, or even referred into the
SysVsem FS).

So, there is no chance for user corruption of kernel data (as it just ain't
there...); and the contended-case cost is probably equivalent (VFS cost vs.
validation).

Hope I inspired someone...

-az

----- Original Message -----
From: "Ulrich Drepper" <[email protected]>
To: "Linus Torvalds" <[email protected]>
Cc: "Mike Kravetz" <[email protected]>; "Kernel Mailing List"
<[email protected]>
Sent: Wednesday, April 18, 2001 21:35
Subject: Re: light weight user level semaphores


> Linus Torvalds <[email protected]> writes:
>
> Sounds good so far. Some comments.
>
> > - FS_create is responsible for allocating a shared memory region
> > at "FS_create()" time.
>
> This is not so great. The POSIX shared semaphores require that an
> pthread_mutex_t object placed in a shared memory region can be
> initialized to work across process boundaries. I.e., the FS_create
> function would actually be FS_init. There is no problem with the
> kernel or the helper code at user level allocating more storage (for
> the waitlist of whatever) but it must not be necessary for the user to
> know about them and place them in share memory themselves.
>
> The situation for non-shared (i.e. intra-process) semaphores are
> easier. What I didn't understand is your remark about fork. The
> semaphores should be cloned. Unless the shared flag is set there
> should be no sharing among processes.
>
>
> The rest seems OK. Thanks,
>
> --
> ---------------. ,-. 1325 Chesapeake Terrace
> Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA
> Red Hat `--' drepper at redhat.com `------------------------
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>

2001-04-19 08:54:19

by Abramo Bagnara

[permalink] [raw]
Subject: Re: light weight user level semaphores

Alon Ziv wrote:
>
> Hmm...
> I already started (long ago, and abandoned since due to lack of time :-( )
> down another path; I'd like to resurrect it...
>
> My lightweight-semaphores were actually even simpler in userspace:
> * the userspace struct was just a signed count and a file handle.
> * Uncontended case is exactly like Linus' version (i.e., down() is decl +
> js, up() is incl()).
> * The contention syscall was (in my implementation) an ioctl on the FH; the
> FH was a special one, from a private syscall (although with the new VFS I'd
> have written it as just another specialized FS, or even referred into the
> SysVsem FS).
>
> So, there is no chance for user corruption of kernel data (as it just ain't
> there...); and the contended-case cost is probably equivalent (VFS cost vs.
> validation).

This would also permit:
- to have poll()
- to use mmap() to obtain the userspace area

It would become something very near to sacred Unix dogmas ;-)

--
Abramo Bagnara mailto:[email protected]

Opera Unica Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy

ALSA project http://www.alsa-project.org
It sounds good!

2001-04-19 09:09:08

by Alexander Viro

[permalink] [raw]
Subject: Re: light weight user level semaphores



On Thu, 19 Apr 2001, Abramo Bagnara wrote:

> Alon Ziv wrote:
> >
> > Hmm...
> > I already started (long ago, and abandoned since due to lack of time :-( )
> > down another path; I'd like to resurrect it...
> >
> > My lightweight-semaphores were actually even simpler in userspace:
> > * the userspace struct was just a signed count and a file handle.
> > * Uncontended case is exactly like Linus' version (i.e., down() is decl +
> > js, up() is incl()).
> > * The contention syscall was (in my implementation) an ioctl on the FH; the
> > FH was a special one, from a private syscall (although with the new VFS I'd
> > have written it as just another specialized FS, or even referred into the
> > SysVsem FS).
> >
> > So, there is no chance for user corruption of kernel data (as it just ain't
> > there...); and the contended-case cost is probably equivalent (VFS cost vs.
> > validation).
>
> This would also permit:
> - to have poll()
> - to use mmap() to obtain the userspace area
>
> It would become something very near to sacred Unix dogmas ;-)

I suspect that simple pipe with would be sufficient to handle contention
case - nothing fancy needed (read when you need to block, write upon up()
when you have contenders)

Would something along the lines of (inline as needed, etc.)

down:
lock decl count
js __down_failed
down_done:
ret

up:
lock incl count
jle __up_waking
up_done:
ret

__down_failed:
call down_failed
jmp down_done
__up_waking:
call up_waking
jmp up_done

down_failed()
{
read(pipe_fd, &dummy, 1);
}

up_waking()
{
write(pipe_fd, &dummy, 1);
}

be enough?
Al

2001-04-19 09:09:19

by Ingo Oeser

[permalink] [raw]
Subject: Re: light weight user level semaphores

On Thu, Apr 19, 2001 at 10:20:48AM +0200, Alon Ziv wrote:
> My lightweight-semaphores were actually even simpler in userspace:
> * the userspace struct was just a signed count and a file handle.
> * Uncontended case is exactly like Linus' version (i.e., down() is decl +
> js, up() is incl()).
> * The contention syscall was (in my implementation) an ioctl on the FH; the
> FH was a special one, from a private syscall (although with the new VFS I'd
> have written it as just another specialized FS, or even referred into the
> SysVsem FS).

This is roughly the way I would prefer it.

But I would dedicate a whole page to this struct, since this is
the granularity we can decide sharing on. This also has the
advantage, that we can include a lot of debugging info into this
page, too. Some people would like to know current contenders,
up/down ratio per second and contender etc.

Why? We have the infrastructure and all the semantics already in
place and it is well known to the programmers. We know how we
inherit this stuff, what will happen on process termination and
so on.

I thought about this myself a lot, but just didn't like the idea
to trust user space for up/down. I thought about abusing read() and
write() for down() and up(). Just doing it partially in user
space would be an significant speedup, once you got it right.

Maybe we can even combine both of it like this:

Then user space can do:

/* open or create sema4 with normal open semantics */
fd=open("/dev/sema4/myone");
sema4=mmap(NULL,getpagesize(),,,fd,0);

/* up */
atomic_inc_and_test_for_zero(sema4) && ioctl(fd,WAKE_SLEEPERS,NULL);

/* down */
atomic_dec_and_test_negative(sema4) && ioctl(fd,SLEEP_NOW,NULL);

or
/* open or create sema4 with normal open semantics */
fd=open("/dev/sema4/myone");

/* up */
write(fd, NULL,0); /* do the atomic stuff and wakeup in kernel */

/* this might be stupid, but COULD be implemented */
/* add 4 items to counter */
write(fd, NULL, 4);

/* down */
read(sama4, NULL, 0);

We could even do trylock() by default, if we open O_NONBLOCK. Or
we could do trylock sometimes using select() and poll(). This
also makes it easy to add it to existing select() loops like
Motif.

This differences could even be hidden by the libc. IIRC there are
some archs, which cannot do atomic operations without privileged
instructions, which is not acceptable in user space. Also there
are archs, which are not cache coherent (think NUMA) and where
flushing these caches to the other CPUs is privileged. Last but
not least there are clusters with process migration.

My twofold approach would solve all these problems rather simply.

It would be a libc decision on what to use now. And the libc
knows enough about the application to handle all these cases.

The only thing we still need, is what we do if a contender or
waiter ist killed. Should we send SIGPIPE? Should we simply wake
all the waiters?

And we are not creating a new namespace again, but just use the
standard UN*X one: File name space.

Hopes this "fit into namespace" solution will be considered,
because I don't like to have a new linux-only API with completely
new semantics and things to care in wrappers, even if you don't
use this stuff.

I also don't like the "kill me if I do a mistake"
that Linus proposed in the "bad_sem" label.

Comments? Flames? Overengineered?

Regards

Ingo Oeser
--
10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag>
<<<<<<<<<<<< been there and had much fun >>>>>>>>>>>>

2001-04-19 10:47:12

by Abramo Bagnara

[permalink] [raw]
Subject: Re: light weight user level semaphores

Alexander Viro wrote:
>
> I suspect that simple pipe with would be sufficient to handle contention
> case - nothing fancy needed (read when you need to block, write upon up()
> when you have contenders)
>
> Would something along the lines of (inline as needed, etc.)
>
> down:
> lock decl count
> js __down_failed
> down_done:
> ret
>
> up:
> lock incl count
> jle __up_waking
> up_done:
> ret
>
> __down_failed:
> call down_failed
> jmp down_done
> __up_waking:
> call up_waking
> jmp up_done
>
> down_failed()
> {
> read(pipe_fd, &dummy, 1);
> }
>
> up_waking()
> {
> write(pipe_fd, &dummy, 1);
> }
>
> be enough?

There is something wonderful in this simple solution.

However I've a few doubts:
- choice policy for thread to wake is not selectable
- we separate shared memory area from file descriptor
- the implementation of down_try has neither been discussed nor
excluded, but I don't see how to implement it

The implementation of a specific filesystem seems to me more flexyble.

--
Abramo Bagnara mailto:[email protected]

Opera Unica Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy

ALSA project http://www.alsa-project.org
It sounds good!

2001-04-19 11:49:53

by Alan

[permalink] [raw]
Subject: Re: light weight user level semaphores

> My lightweight-semaphores were actually even simpler in userspace:
> * the userspace struct was just a signed count and a file handle.
> * Uncontended case is exactly like Linus' version (i.e., down() is decl +
> js, up() is incl()).
> * The contention syscall was (in my implementation) an ioctl on the FH; the
> FH was a special one, from a private syscall (although with the new VFS I'd
> have written it as just another specialized FS, or even referred into the
> SysVsem FS).

Which raises an even more interesting question. Suppose your semaphore function
wanst a magic file system but was flock on a standard file ? The contention
overhead is rather less nice than Linus proposal but it ought 8) to work
without any kernel patches

2001-04-19 16:04:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: light weight user level semaphores



On Thu, 19 Apr 2001, Alon Ziv wrote:
>
> * the userspace struct was just a signed count and a file handle.

The main reason I wanted to avoid a filehandle is just because it's
another name space that people already use, and that people know what the
semantics are for (ie "open()" is _defined_ to return the "lowest
available file descriptor", and people depend on that).

So if you use a file handle, you'd need to do magic - open it, and then
use dup2() to move it up high, or something. Which has its own set of
problems: just _how_ high woul dyou move it? Would it potentially disturb
an application that opens thousands of files, and knows that they get
consecutive file descriptors? Which is _legal_ and well-defined in UNIX.

However, I'm not married to the secure hash version - you could certainly
use another name-space, and something more akin to file descriptors. You
should be aware of issues like the above, though. Maybe it would be ok to
say "if you use fast semaphores, they use file descriptors and you should
no longer depend on consecutive fd's".

But note how that might make it really nasty for things like libraries:
can libraries use fast semaphores behind the back of the user? They might
well want to use the semaphores exactly for things like memory allocator
locking etc. But libc certainly cant use fd's behind peoples backs.

So personally, I actually think that you must _not_ use file descriptors.
But that doesn't mean that you couldn't have a more "file-desciptor-like"
approach.

Side note: the design _should_ allow for "lazy initialization". In
particular, it should be ok for FS_create() to not actually do a system
call at all, but just initialize the count and set a "uninitialized" flag.
And then the actual initialization would be done at "FS_down()" time, and
only if contention happens.

Why? Note that there are many cases where contention simply _cannot_
happen. The classic one is a thread-safe library that is used both by
threaded applications and by single-threaded ones, where the
single-threaded one would never actually trigger contention.

For these kinds of reasons it would actually be best to make try to
abstract the interfaces (notably the system call interface) as much as
possible, so that you can change the implementation inside the kernel
without having to recompile applications that use it. So the sanest
implementation might be one where

- FS_create is a system call that just gets a 128-byte area and an ID.
- the contention cases are plain system calls with no user-mode part to
them at all.

This allows people to modify the behaviour of the semaphores later,
_without_ having any real coupling between user-mode expectations and
kernel implementation.

For example, if the user-mode library actually does a physical "open()" or
plays games with file descriptors itself, we will -always- be stuck with
the fd approach, and we can never fix it. But if you have opaque system
calls, you mist start out with a system call that internally just does the
equivalent of the "open a file descriptor and hide it in the semaphore",
and later on the thing can be changed to do whatever else without the user
program ever even realizing..

Linus

2001-04-19 16:13:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: light weight user level semaphores



On Thu, 19 Apr 2001, Abramo Bagnara wrote:
>
> > [ Using file descriptors ]
>
> This would also permit:
> - to have poll()
> - to use mmap() to obtain the userspace area
>
> It would become something very near to sacred Unix dogmas ;-)

No, this is NOT what the UNIX dogmas are all about.

When UNIX says "everything is a file", it really means that "everything is
a stream of bytes". Things like magic operations on file desciptors are
_anathema_ to UNIX. ioctl() is the worst wart of UNIX. Having magic
semantics of file descriptors is NOT Unix dogma at all, it is a horrible
corruption of the original UNIX cleanlyness.

Please don't excuse "semaphore file descriptors" with the "everything is a
file" mantra. It is not at ALL applicable.

The "everything is a file" mantra is to make pipe etc meaningful -
processes don't have to worry about whether the fd they have is from a
file open, a pipe() system call, opening a special block device, or a
socket()+connect() thing. They can just read and write. THAT is what UNIX
is all about.

And this is obviously NOT true of a "magic file descriptors for
semaphores". You can't pass it off as stdin to another process and expect
anything useful from it unless the other process _knows_ it is a special
semaphore thing and does mmap magic or something.

The greatness of UNIX comes from "everything is a stream of bytes". That's
something that almost nobody got right before UNIX. Remember VMS
structured files? Did anybody ever realize what an absolutely _idiotic_
crock the NT "CopyFile()" thing is for the same reason?

Don't confuse that with "everything should be a file descriptor". The two
have nothing to do with each other.

Linus

2001-04-19 16:34:13

by Alexander Viro

[permalink] [raw]
Subject: Re: light weight user level semaphores



On Thu, 19 Apr 2001, Linus Torvalds wrote:

>
>
> On Thu, 19 Apr 2001, Abramo Bagnara wrote:
> >
> > > [ Using file descriptors ]
> >
> > This would also permit:
> > - to have poll()
> > - to use mmap() to obtain the userspace area
> >
> > It would become something very near to sacred Unix dogmas ;-)
>
> No, this is NOT what the UNIX dogmas are all about.
>
> When UNIX says "everything is a file", it really means that "everything is
> a stream of bytes". Things like magic operations on file desciptors are
> _anathema_ to UNIX. ioctl() is the worst wart of UNIX. Having magic
> semantics of file descriptors is NOT Unix dogma at all, it is a horrible
> corruption of the original UNIX cleanlyness.

<applause>

The only reason for using file descriptors is that we can (AFAICS)
avoid any magic operations or new kinds of files. Honest-to-$DEITY
read() and write() on real pipes seems to be enough to implement
contention case for simple semaphores.

I see your point re sequential allocation of descriptors, but I'm not
sure that it's that serious - we need that stuff only for multi-threaded
programs and in that case we can't rely on sequentially allocated
descriptors anyway - stuff from different threads gets mixed together.

I certainly agree that introducing ioctl() in _any_ API is a shootable
offense. However, I wonder whether we really need any kernel changes
at all.
Al

2001-04-19 16:37:43

by Alan

[permalink] [raw]
Subject: Re: light weight user level semaphores

> can libraries use fast semaphores behind the back of the user? They might
> well want to use the semaphores exactly for things like memory allocator
> locking etc. But libc certainly cant use fd's behind peoples backs.

libc is entitled to, and most definitely does exactly that. Take a look at
things like gethostent, getpwent etc etc.

Alan

2001-04-19 16:44:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: light weight user level semaphores



On Thu, 19 Apr 2001, Alexander Viro wrote:
>
> I certainly agree that introducing ioctl() in _any_ API is a shootable
> offense. However, I wonder whether we really need any kernel changes
> at all.

I'd certainly be interested in seeing the pipe-based approach. Especially
if you make the pipe allocation lazy. That isn'tr trivial (it needs to be
done right with both up_failed() and down_failed() trying to allocate the
pipe on contention and using an atomic cmpxchg-style setting if none
existed before). It has the BIG advantage of working on old kernels, so
that you don't need to have backwards compatibility cruft in the
libraries.

Linus

2001-04-19 16:45:43

by Abramo Bagnara

[permalink] [raw]
Subject: Re: light weight user level semaphores

Linus Torvalds wrote:
>
> On Thu, 19 Apr 2001, Abramo Bagnara wrote:
> >
> > > [ Using file descriptors ]
> >
> > This would also permit:
> > - to have poll()
> > - to use mmap() to obtain the userspace area
> >
> > It would become something very near to sacred Unix dogmas ;-)
>
> No, this is NOT what the UNIX dogmas are all about.
>
> When UNIX says "everything is a file", it really means that "everything is
> a stream of bytes". Things like magic operations on file desciptors are
> _anathema_ to UNIX. ioctl() is the worst wart of UNIX. Having magic
> semantics of file descriptors is NOT Unix dogma at all, it is a horrible
> corruption of the original UNIX cleanlyness.

Nice outpouring indeed, it seems taken from L'Ouvre au Noir by
Marguerite Yourcenar ;-)))

You're perfectly right but the file descriptor solution appeared to me a
nice way to work around the Unix limitation to have poll(2) working only
on file descriptor.

Said this, I've no doubt that a better poll-like syscall would solve all
that in a more elegant way.

You understand that sometime we've no other choice that to design
workarounds to minimize needed changes (and then often to maximize
acceptance probability).

OTOH you may always decide to do things in the elegant way, you've such
a responsibility for linux kernel.

--
Abramo Bagnara mailto:[email protected]

Opera Unica Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy

ALSA project http://www.alsa-project.org
It sounds good!

2001-04-19 16:47:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: light weight user level semaphores



On Thu, 19 Apr 2001, Alan Cox wrote:
> > can libraries use fast semaphores behind the back of the user? They might
> > well want to use the semaphores exactly for things like memory allocator
> > locking etc. But libc certainly cant use fd's behind peoples backs.
>
> libc is entitled to, and most definitely does exactly that. Take a look at
> things like gethostent, getpwent etc etc.

Ehh.. I will bet you $10 USD that if libc allocates the next file
descriptor on the first "malloc()" in user space (in order to use the
semaphores for mm protection), programs _will_ break.

You want to take the bet?

Linus

2001-04-19 17:11:55

by Alan

[permalink] [raw]
Subject: Re: light weight user level semaphores

> > libc is entitled to, and most definitely does exactly that. Take a look at
> > things like gethostent, getpwent etc etc.
>
> Ehh.. I will bet you $10 USD that if libc allocates the next file
> descriptor on the first "malloc()" in user space (in order to use the
> semaphores for mm protection), programs _will_ break.
>
> You want to take the bet?

Its not normally a good idea to take a Linus bet, but this time Im obviously
missing something. fd0-2 will be passed in (and if not then shit already
happens - see old bugtraq on the matter for setuid apps, glibc bugs)

So the C library gets fd 3
My first fopen gets fd 4.

That can already happen and isnt new. Several profiling libraries on Unix have
precisely this effect already. They dynamic link/loader will also open file
handles to do mmaps although generally you wont see those as they are closed
again after mapping.

Internationalisation code in glibc will also open and map tables during startup


2001-04-19 17:33:49

by Alexander Viro

[permalink] [raw]
Subject: Re: light weight user level semaphores



On Thu, 19 Apr 2001, Linus Torvalds wrote:

>
>
> On Thu, 19 Apr 2001, Alexander Viro wrote:
> >
> > I certainly agree that introducing ioctl() in _any_ API is a shootable
> > offense. However, I wonder whether we really need any kernel changes
> > at all.
>
> I'd certainly be interested in seeing the pipe-based approach. Especially
> if you make the pipe allocation lazy. That isn'tr trivial (it needs to be
> done right with both up_failed() and down_failed() trying to allocate the
> pipe on contention and using an atomic cmpxchg-style setting if none
> existed before). It has the BIG advantage of working on old kernels, so
> that you don't need to have backwards compatibility cruft in the
> libraries.

Ehh... Non-lazy variant is just read() and write() as down_failed() and
up_wakeup() Lazy... How about

if (Lock <= 1)
goto must_open;
opened:
/* as in non-lazy case */


must_open:
pipe(fd);
lock decl Lock
jg lost_it /* Already seriously positive - clean up and go */
jl spin_and_lose
/* Lock went from 1 to 0 - go ahead */
reader = fd[0];
writer = fd[1];
Lock = MAX_INT;
goto opened;
spin_and_lose:
/* Won't take long - another guy got to do 3 memory writes */
while (Lock <= 0)
;
lost_it:
lock incl Lock
close(fd[0]);
close(fd[1]);
goto opened;

Al

2001-04-19 17:39:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: light weight user level semaphores



On Thu, 19 Apr 2001, Alexander Viro wrote:
>
> Ehh... Non-lazy variant is just read() and write() as down_failed() and
> up_wakeup() Lazy... How about

Looks good to me. Anybody want to try this out and test some benchmarks?

There may be problems with large numbers of semaphores, but hopefully that
won't be an issue. And the ability to select/poll on these things might
come in handy for various implementation issues (ie locks with timeouts
etc).

Linus

2001-04-19 18:25:01

by Alexander Viro

[permalink] [raw]
Subject: Re: light weight user level semaphores



On Thu, 19 Apr 2001, Linus Torvalds wrote:

>
>
> On Thu, 19 Apr 2001, Alexander Viro wrote:
> >
> > Ehh... Non-lazy variant is just read() and write() as down_failed() and
> > up_wakeup() Lazy... How about
>
> Looks good to me. Anybody want to try this out and test some benchmarks?

Ugh. It doesn't look good for me. s/MAX_INT/MAX_INT>>1/ or we will
get into trouble on anything that goes into spin_and_lose. Window is
pretty narrow (notice that lost_it is OK - we only need to worry
about somebody coming in after winner drives Lock from 1 to 0
and before it gets it from 0 to MAX_INT), but we can get into serious
trouble if schedule() will hit that window.

MAX_INT/2 should be enough to deal with that, AFAICS.

However, I would _really_ like to get that code reviewed from the memory
access ordering POV. Warning: right now I'm half-asleep, so the thing can
very well be completely bogus in that area. Extra eyes would be certainly
welcome.

Al

PS: ->Lock should be set to 1 when we initialize semaphore. Destroying
semaphore should do
if (sem->Lock > 1) {
close(sem->writer);
close(sem->reader);
}

2001-04-19 19:03:56

by Olaf Titz

[permalink] [raw]
Subject: Re: light weight user level semaphores

> problems: just _how_ high woul dyou move it? Would it potentially disturb
> an application that opens thousands of files, and knows that they get
> consecutive file descriptors? Which is _legal_ and well-defined in UNIX.

Only if you close them before. The process may have been started with
arbitrary fds open.

> say "if you use fast semaphores, they use file descriptors and you should
> no longer depend on consecutive fd's".

Which you cannot anyway. Already some library routines can open fds
although they don't explicitly say so and don't have to in all
implementations, like openlog() or all the get*by*() stuff (or even
dlopen()), so you are never sure to know which or how many FDs you
actually have open.

Olaf

2001-04-19 19:28:02

by Ulrich Drepper

[permalink] [raw]
Subject: Re: light weight user level semaphores

Linus Torvalds <[email protected]> writes:

> Looks good to me. Anybody want to try this out and test some benchmarks?

I fail to see how this works across processes. How can you generate a
file descriptor for this pipe in a second process which simply shares
some memory with the first one? The first process is passive: no file
descriptor passing must be necessary.

How these things are working elsewhere is that a memory address
(probably a physical address) is used as a token. The semaphore
object is placed in the memory shared by the processes and the virtual
address is passed in the syscall.

Note that semaphores need not always be shared between processes.
This is a property the user has to choose. So the implementation can
be easier in the normal intra-process case.

In any case all kinds of user-level operations are possible as well
and all the schemes suggested for dealing with the common case without
syscalls can be applied here as well.

--
---------------. ,-. 1325 Chesapeake Terrace
Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA
Red Hat `--' drepper at redhat.com `------------------------

2001-04-19 19:37:00

by Alan

[permalink] [raw]
Subject: Re: light weight user level semaphores

> I fail to see how this works across processes. How can you generate a
> file descriptor for this pipe in a second process which simply shares
> some memory with the first one? The first process is passive: no file
> descriptor passing must be necessary.

mknod foo p. Or use sockets (although AF_UNIX sockets are higher latency)
Thats why I suggested using flock - its name based. Whether you mkstemp()
stuff and pass it around isnt something I care about

Files give you permissions for free too

> Note that semaphores need not always be shared between processes.
> This is a property the user has to choose. So the implementation can
> be easier in the normal intra-process case.

So you have unix file permissions on them ?


2001-04-19 19:48:49

by Ulrich Drepper

[permalink] [raw]
Subject: Re: light weight user level semaphores

Alan Cox <[email protected]> writes:

> > can libraries use fast semaphores behind the back of the user? They might
> > well want to use the semaphores exactly for things like memory allocator
> > locking etc. But libc certainly cant use fd's behind peoples backs.
>
> libc is entitled to, and most definitely does exactly that. Take a look at
> things like gethostent, getpwent etc etc.

You are mixing two completely different things.

Functions like gethostent() and catopen() are explicitly allowed to be
implemented using file descriptors. If this is allowed the standard
contains appropriate wording.

Other functions like setlocale() do use file descriptors, yes, but
these are not kept. Before the function returns they are closed.
This can cause disruptions in other threads which find descriptors not
allocated sequentially but this has to be taken into account. Rules
for multi-threaded applications are different. A single-threaded
application will not see such a difference.

Now, the standards do not allow POSIX mutexes to be implemented using
file descriptors. The same is true for unnamed POSIX semaphores. So
Linus is right, though for a different reason than he thought.

The situation is a bit different for named POSIX semaphores. These
can be implemented using file descriptors. But they don't have to and
IMO they shouldn't. A memory reference based semaphore implementation
would allow a named semaphore to be implemented using

fd = open (name)
addr = mmap (..fd..)
close (fd)
sem_syscall (addr)

i.e., it can be mapped to a memory reference again.

--
---------------. ,-. 1325 Chesapeake Terrace
Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA
Red Hat `--' drepper at redhat.com `------------------------

2001-04-19 20:08:19

by Ulrich Drepper

[permalink] [raw]
Subject: Re: light weight user level semaphores

Alan Cox <[email protected]> writes:

> mknod foo p. Or use sockets (although AF_UNIX sockets are higher latency)
> Thats why I suggested using flock - its name based. Whether you mkstemp()
> stuff and pass it around isnt something I care about
>
> Files give you permissions for free too

I don't want nor need file permissions. A program would look like this:


process 1:


fd = open("somefile")
addr = mmap(fd);

pthread_mutexattr_init(&attr);
pthread_mutexattr_setpshared(&attr, PTHREAD_PROCESS_SHARED);

pthread_mutex_init ((pthread_mutex_t *) addr, &attr);

pthread_mutex_lock ((pthread_mutex_t *) addr);

pthread_mutex_destroy((pthread_mutex_t *) addr);

process 2:

fd = open("somefile")
addr = mmap(fd);

pthread_mutex_lock ((pthread_mutex_t *) addr);


The shared mem segment can be retrieved in whatever way. The mutex in
this case is anonymous. Everybody who has access to the shared mem
*must* have access to the mutex.


For semaphores it looks similarly. First the anonymous case:

process 1:


fd = open("somefile")
addr = mmap(fd);

sem_init ((sem_t *) addr, 1, 10); // 10 is arbitrary

sem_wait ((sem_t *) addr);

sem_destroy((sem_t *) addr);


process 2:

fd = open("somefile")
addr = mmap(fd);

sem_wait ((sem_t *) addr);

Note that POSIX semaphores could be implemented with global POSIX
mutexes.


Finally, named semaphores:

semp = sem_open("somefile", O_CREAT|O_EXCL, 0600)

sem_wait (semp);

sem_close(semp);
sem_unlink(semp);


This is the only semaphore kind which maps nicely to a pipe or socket.
All the others don't. And even for named semaphores it is best to
have a separate name space like the shmfs.

> So you have unix file permissions on them ?

See above. Permissions are only allowed for named semaphores.

--
---------------. ,-. 1325 Chesapeake Terrace
Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA
Red Hat `--' drepper at redhat.com `------------------------

2001-04-19 20:15:04

by Alan

[permalink] [raw]
Subject: Re: light weight user level semaphores

> I don't want nor need file permissions. A program would look like this:

Your example opens/mmaps so has file permissions. Which is what I was asking

> The shared mem segment can be retrieved in whatever way. The mutex in
> this case is anonymous. Everybody who has access to the shared mem
> *must* have access to the mutex.

We agree 8)

2001-04-19 20:22:59

by Ingo Oeser

[permalink] [raw]
Subject: Re: light weight user level semaphores

On Thu, Apr 19, 2001 at 12:26:03PM -0700, Ulrich Drepper wrote:
> In any case all kinds of user-level operations are possible as well
> and all the schemes suggested for dealing with the common case without
> syscalls can be applied here as well.

Are you sure, you can implement SMP-safe, atomic operations (which you need
for all up()/down() in user space) WITHOUT using privileged
instructions on ALL archs Linux supports?

How do we do this on nccNUMA machines later? How on clusters[1]?

On what I can see in asm-*/atomic.h this is not possible, but I
probably miss sth. here ;-)

I didn't know that POSIX forbids using fds to implement a
semaphore. That's VERY bad.

Learning new APIs always means making a lot of mistakes and doing
this while we write production code, since nobody likes to pay for
experiments.

And I still see no point on speeding of creation and contention,
since these should be rare cases and the application overusing
these should be punished HARD.

Maybe someone can enlighten my on these aspects.

Regards

Ingo Oeser

[1] Ok, people already use other than Unix mechanisms for this
stuff on massive parallel computing. So this might not be an
issue. Only for libc internal sema4s
--
10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag>
<<<<<<<<<<<< been there and had much fun >>>>>>>>>>>>

2001-04-19 20:28:19

by Ulrich Drepper

[permalink] [raw]
Subject: Re: light weight user level semaphores

Alan Cox <[email protected]> writes:

> > I don't want nor need file permissions. A program would look like this:
>
> Your example opens/mmaps so has file permissions. Which is what I was asking

There are no permissions on the mutex object. It is the shared memory
which counts. If you would implement the global mutexes as
independent objects in the filesystem hierarchy you would somehow
magically make the permissions match those of the object containing
the memory representation of the global semaphore.


fd = open("somefile", O_CREAT|O_TRUNC, 0666)
addr=mmap(fd)
// assume attr is for a global mutex
pthread_mutex_init((pthread_mutex_t*)addr, &attr)
fchmod(fd, 0600)
fchown(fd, someuser, somegroup)

If pthread_mutex_attr() is allocating some kind of file, how do you
determine the permissions? How are they changed if the permissions to
the file change?

The kernel representation of the mutex must not be disassociated from
the shared memory region.

Even if you all think very little about Solaris, look at the kernel
interface for semaphores.

--
---------------. ,-. 1325 Chesapeake Terrace
Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA
Red Hat `--' drepper at redhat.com `------------------------

2001-04-19 20:41:49

by Ulrich Drepper

[permalink] [raw]
Subject: Re: light weight user level semaphores

Ingo Oeser <[email protected]> writes:

> Are you sure, you can implement SMP-safe, atomic operations (which you need
> for all up()/down() in user space) WITHOUT using privileged
> instructions on ALL archs Linux supports?

Which processors have no such instructions but are SMP-capable?

> How do we do this on nccNUMA machines later? How on clusters[1]?

Clusters are not my problem. They require additional software. And
NUMA machines maybe be requiring a certain sequence in which the
operations must be performed and the hardware should take care of the
rest.


I don't really care what the final implementation will be like. For
UP and SMP machines I definitely want to have as much as possible at
user-level. If you need a special libpthread for NUMA machines, so be
it.

--
---------------. ,-. 1325 Chesapeake Terrace
Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA
Red Hat `--' drepper at redhat.com `------------------------

2001-04-19 20:47:32

by Ingo Oeser

[permalink] [raw]
Subject: Re: light weight user level semaphores

On Thu, Apr 19, 2001 at 09:11:56AM -0700, Linus Torvalds wrote:
> No, this is NOT what the UNIX dogmas are all about.
>
> When UNIX says "everything is a file", it really means that "everything is
> a stream of bytes". Things like magic operations on file desciptors are
> _anathema_ to UNIX. ioctl() is the worst wart of UNIX. Having magic
> semantics of file descriptors is NOT Unix dogma at all, it is a horrible
> corruption of the original UNIX cleanlyness.

Right. And on semaphores, this stream is exactly 0 bytes long.
This is perfectly normal and can be handled by all applications
I'm aware of.

My idea violates nothing here.

> Please don't excuse "semaphore file descriptors" with the "everything is a
> file" mantra. It is not at ALL applicable.
>
> The "everything is a file" mantra is to make pipe etc meaningful -
> processes don't have to worry about whether the fd they have is from a
> file open, a pipe() system call, opening a special block device, or a
> socket()+connect() thing. They can just read and write. THAT is what UNIX
> is all about.

Right. And with my approach read() and write() with a buffer
pointer != NULL would either yield an return value of "0" or
-1 and set errno=EINVAL ("object not suitable for reading/writing").
Anyway they should return IMMIDIATELY in these cases.

We already have these special semantics with devices. Look at
/dev/sgX for an example how we pass even structured data via
normal read/write (instead of "stream of bytes").

> And this is obviously NOT true of a "magic file descriptors for
> semaphores". You can't pass it off as stdin to another process and expect
> anything useful from it unless the other process _knows_ it is a special
> semaphore thing and does mmap magic or something.

see above. NOTHING special about this idea. No magic handling
involved, unless the user of the fd knows what it is. For other
users it will be just a normal fd with normal operations, since
the special case is hidden well enough.

This is even WAY simpler as all that tty-crap and similar
devices, which read/write very dependend on their actual ioctl
configuration.

But since stupid POSIX forbids using fds for semaphores
(according to Ulrich Drepper), this nice, simple and
non-intrusive solution is out.

Instead we should go with several new syscalls, user space
dependencies, strange error handling and yet-to-discuss
semantics.

Everybody else byt you would have been kicked out by the core
people for suggesting this ;-)

Regards

Ingo Oeser
--
10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag>
<<<<<<<<<<<< been there and had much fun >>>>>>>>>>>>

2001-04-19 20:50:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: light weight user level semaphores



On 19 Apr 2001, Ulrich Drepper wrote:

> Linus Torvalds <[email protected]> writes:
>
> > Looks good to me. Anybody want to try this out and test some benchmarks?
>
> I fail to see how this works across processes.

It's up to FS_create() to create whatever shared mapping is needed.

For threads, you don't need anything special.

For fork()'d helper stuff, you'd use MAP_ANON | MAP_SHARED.

For execve(), you need shm shared memory or MAP_SHARED on a file.

It all depends on your needs.

Linus

2001-04-19 20:52:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: light weight user level semaphores



On Thu, 19 Apr 2001, Ingo Oeser wrote:
>
> Are you sure, you can implement SMP-safe, atomic operations (which you need
> for all up()/down() in user space) WITHOUT using privileged
> instructions on ALL archs Linux supports?

Why do you care?

Sure, there are broken architectures out there. They'd need system calls.
They'd be slow. That's THEIR problem.

No sane architecture has this limitation.

Linus

2001-04-19 20:56:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: light weight user level semaphores



On Thu, 19 Apr 2001, Ingo Oeser wrote:

> On Thu, Apr 19, 2001 at 09:11:56AM -0700, Linus Torvalds wrote:
> > No, this is NOT what the UNIX dogmas are all about.
> >
> > When UNIX says "everything is a file", it really means that "everything is
> > a stream of bytes". Things like magic operations on file desciptors are
> > _anathema_ to UNIX. ioctl() is the worst wart of UNIX. Having magic
> > semantics of file descriptors is NOT Unix dogma at all, it is a horrible
> > corruption of the original UNIX cleanlyness.
>
> Right. And on semaphores, this stream is exactly 0 bytes long.
> This is perfectly normal and can be handled by all applications
> I'm aware of.

It's perfectly normal, but it does NOT conform to the idea "everything is
a file".

The fact that there are other ugly examples (ioctls and special files)
does not mean that adding a new one is a good idea.

When people say "everything is a file", they mean that it can be _used_ as
a file, not that it can passably return a valid error code.

Linus

2001-04-19 21:20:04

by Ulrich Drepper

[permalink] [raw]
Subject: Re: light weight user level semaphores

Linus Torvalds <[email protected]> writes:

> > I fail to see how this works across processes.
>
> It's up to FS_create() to create whatever shared mapping is needed.

No, the point is that FS_create is *not* the one creating the shared
mapping. The user is explicitly doing this her/himself.

--
---------------. ,-. 1325 Chesapeake Terrace
Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA
Red Hat `--' drepper at redhat.com `------------------------

2001-04-19 21:38:45

by Alan

[permalink] [raw]
Subject: Re: light weight user level semaphores

> Are you sure, you can implement SMP-safe, atomic operations (which you need
> for all up()/down() in user space) WITHOUT using privileged
> instructions on ALL archs Linux supports?

You don't need to. For some architectures the semaphore code would always call
into the kernel. For those that allow fast locks in userspace it won't. The
API is the thing, and the public exposure would I assume be pthreads


2001-04-19 21:42:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: light weight user level semaphores



On 19 Apr 2001, Ulrich Drepper wrote:
> Linus Torvalds <[email protected]> writes:
>
> > > I fail to see how this works across processes.
> >
> > It's up to FS_create() to create whatever shared mapping is needed.
>
> No, the point is that FS_create is *not* the one creating the shared
> mapping. The user is explicitly doing this her/himself.

No.

Who creates the shared mapping is _irrelevant_, because it ends up being
entirely a function of what the chosen interface is.

For example, quote often you want semaphores for threading purposes only,
and then you don't need a shared mapping at all. So you'd use the proper
interfaces for that, and for that, your "thread_semaphore()" function
would just do a malloc() and initialize the memory to zero. Doing a mmap
or something like that would just be stupid, because you're protecting
only one VM space anyway.

In other cases, you may need to have process-wide semaphores, and you'd
use "process_semaphore(char *ID)" or something, which actually does a
mmap() on a shared file. Or you'd have "fork_semaphore()" that creates a
semaphore that is valid across forks, not not valid across execve's and
cannot be passed around.

So normally the user does NOT create the shared mapping himself. Normally
you'd just use the "proper interface" for your needs, nothing more.

Sure, you can have the option of saying "I've created this shared memory
region, please make it use the generic semaphore engine code", but quite
frankly I think that is a BAD IDEA. Why? Because it won't work portably
across architectures anyway. You don't know what the requirements of the
architecture are, so it should be done by a nice "semaphore library". NOT
by the user.

Remember: these semaphores are NOT a new SysV bogosity. These semaphores
are a new interface, with sane performance and sane design. And you can
have multiple external interfaces to the same "semaphore engine".

I'm not interested in re-creating the idiocies of Sys IPC.

Linus

2001-04-19 22:36:53

by Rogier Wolff

[permalink] [raw]
Subject: Re: light weight user level semaphores

Alan Cox wrote:
> > > libc is entitled to, and most definitely does exactly that. Take a look at
> > > things like gethostent, getpwent etc etc.
> >
> > Ehh.. I will bet you $10 USD that if libc allocates the next file
> > descriptor on the first "malloc()" in user space (in order to use the
> > semaphores for mm protection), programs _will_ break.
> >
> > You want to take the bet?
>
> Its not normally a good idea to take a Linus bet, but this time Im obviously
> missing something. fd0-2 will be passed in (and if not then shit already
> happens - see old bugtraq on the matter for setuid apps, glibc bugs)
>
> So the C library gets fd 3
> My first fopen gets fd 4.

Code may
close (0);
close (1);
close (2);
...
malloc ();

/* Now open our controlling TTY/ stdin .. */
fd = open (... ) ;

After taking care of this (*), problem I find the fd trick WAY more
appealing than Linus' magic numbers. With file descriptors we have a
"small integer which can be validated quickly". We also have storage
for a private pointer somewhere in the fd structure.

If people are TOO afraid of breaking something, creating a new set of
small integers handled similarly as "fds" would do fine. (Maybe here
we'd allocate just a few, and reallocate when neccesary).

Roger.

(*) I bet that
get_sem_fd ()
{
int rv;
int fd;
fd = get_fd ();
if (fd < 5) {
rv = get_sem_fd ();
close(fd);
fd = rv;
}
return fd;
}

will not break much. (UGLY coding. Don't tell me.)

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* There are old pilots, and there are bold pilots.
* There are also old, bald pilots.

2001-04-19 22:47:58

by Ulrich Drepper

[permalink] [raw]
Subject: Re: light weight user level semaphores

Linus Torvalds <[email protected]> writes:

> I'm not interested in re-creating the idiocies of Sys IPC.

I'm not talking about sysv semaphores (couldn't care less). And you
haven't read any of the mails with examples I sent.

If the new interface can be useful for anything it must allow to
implement process-shared POSIX mutexes. The user-level representation
of these mutexes are simple variables which in the case of
inter-process mutexes are placed in shared memory. These variables
must be usable with the normal pthread_mutex_lock() functions and
perform whatever is needed.

Whether the pthread_mutex_init() function for shared mutexes is doing
a lot more work and allocates even more memory, I don't care. The
standard certainly permits this and every pthread_mutex_init() must
have a pthread_mutex_destroy() which allows allocating and freeing
resources (no file descriptor, though). So, yes, your FS_create
syscall can allocate something.

But the question is what handle to put in the pthread_mutex_t variable
so the different processes can use the mutex. It cannot be a file
descriptor since it's not shared between processes. It cannot be a
pointer to some other place in the virtual memory since the place
pointed to might not be (and probably isn't if FS_create is allocating
something in the process setting up the mutex). You could put some
magic cookie in the pthread_mutex_t object the kernel can then use.


So, instead of repeating over and over again the same old story, fill
in the gaps here:


int
pthread_mutex_init (pthread_mutex_t *mutex,
const pthread_mutexattr_t *mutex_attr)
{
if (mutex_attr != NULL && mutex_attr->__pshared != 0)
{
... FILL IN HERE ...
}
else
...intra-process mutex, uninteresting here...
}

int
pthread_mutex_lock (pthread_mutex_t *mutex)
{
if (mutex_attr != NULL && mutex_attr->__pshared != 0)
{
... FILL IN HERE ...
}
else
...intra-process mutex, uninteresting here...
}

int
pthread_mutex_destroy (pthread_mutex_t *mutex)
{
if (mutex_attr != NULL && mutex_attr->__pshared != 0)
{
... FILL IN HERE ...
}
else
...intra-process mutex, uninteresting here...
}


These functions must work with something like this:

~~~~~~~~~~~~~~~~~~~~~ cons.c ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>

int
main (int argc, char *argv[])
{
char tmpl[] = "/tmp/fooXXXXXX";
int fd = mkstemp (tmpl);
pthread_mutexattr_t attr;
pthread_mutex_t *m1;
pthread_mutex_t *m2;
void *addr;
volatile int *i;

pthread_mutexattr_init (&attr);
pthread_mutexattr_setpshared (&attr, PTHREAD_PROCESS_SHARED);

ftruncate (fd, 2 * sizeof (*m1) + sizeof (int));
addr = mmap (NULL, sizeof (*m1), PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
m1 = addr;
m2 = m1 + 1;
i = (int *) (m2 + 1);
*i = 0;

pthread_mutex_init (m1, &attr);
pthread_mutex_lock (m1);

pthread_mutex_init (m2, &attr);
pthread_mutex_lock (m2);

if (fork () == 0)
{
char buf[10];
snprintf (buf, sizeof buf, "%d", fd);
execl ("./prod", "prod", buf, NULL);
}

while (1)
{
pthread_mutex_lock (m1);
printf ("*i = %d\n", *i);
pthread_mutex_unlock (m2);
}

return 0;
}
~~~~~~~~~~~~~~~~~~~~~~prod.c ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>

int
main (int argc, char *argv[])
{
int fd = atoi (argv[1]);
void *addr;
pthread_mutex_t *m1;
pthread_mutex_t *m2;
volatile int *i;

addr = mmap (NULL, sizeof (*m1), PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
m1 = addr;
m2 = m1 + 1;
i = (int *) (m2 + 1);

while (1)
{
++*i;
pthread_mutex_unlock (m1);
pthread_mutex_lock (m2);
}

return 0;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--
---------------. ,-. 1325 Chesapeake Terrace
Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA
Red Hat `--' drepper at redhat.com `------------------------

2001-04-20 01:35:51

by Alexander Viro

[permalink] [raw]
Subject: Re: light weight user level semaphores



On 19 Apr 2001, Ulrich Drepper wrote:

> Linus Torvalds <[email protected]> writes:
>
> > I'm not interested in re-creating the idiocies of Sys IPC.
>
> I'm not talking about sysv semaphores (couldn't care less). And you
> haven't read any of the mails with examples I sent.
>
> If the new interface can be useful for anything it must allow to
> implement process-shared POSIX mutexes.

Pardon me the bluntness, but... Why?
* on _any_ UNIX we can implement semaphore (object that has Dijkstra's
P and V operations, whatever) shared by processes that have access to pipe.
In a portable way. That's the part of pipe semantics that had been there
since way before v6. Pre-sysv, pre-POSIX, etc. When named pipes appeared
the same semantics had been carried to them. Agreed so far?
* if we have shared memory _and_ some implementation of semaphores
we can (on architectures that allow atomic_dec() and atomic_inc()) produce
semaphores that work via memory access in uncontended case and use slow
semaphores to handle contention side of the business. Nothing UNIX-specific
here.
* such objects _are_ useful. They are reasonably portable and
if they fit the task at hand and are cheaper than POSIX mutexes - that's
all rationale one could need for using them.

Sure, the variant I've posted was intra-process only, simply because it
uses normal pipes. Implementation with named pipes is also trivial -
when you map the shared area, allocate private one of the corresponding
size and keep descriptors there. End of story.

AFAICS mechanism is portable enough (and even on the architectures that
do not allow atomic userland operations we can survive - just fall back
to "slow" ones via read()/write() on pipes). And excuse me, but when
one writes an application code the question is not "how to make it use
POSIX semaphores", it's "how to get the serialization I need in a
portable way".

2001-04-20 02:47:03

by Ulrich Drepper

[permalink] [raw]
Subject: Re: light weight user level semaphores

Alexander Viro <[email protected]> writes:

> > If the new interface can be useful for anything it must allow to
> > implement process-shared POSIX mutexes.
>
> Pardon me the bluntness, but... Why?

Because otherwise there is no reason to even waste a second with this.
At least for me and everybody else who has interest in portable solutions.

I don't care how it's implemented. Look at the code example I posted.
If you can provide an implementation which can implement anonymous
inter-process mutexes then ring again. Until then I'll wait. If you
implement something else I couldn't care less since it's useless for
me.

--
---------------. ,-. 1325 Chesapeake Terrace
Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA
Red Hat `--' drepper at redhat.com `------------------------

2001-04-20 09:33:38

by Olaf Titz

[permalink] [raw]
Subject: Re: light weight user level semaphores

> Ehh.. I will bet you $10 USD that if libc allocates the next file
> descriptor on the first "malloc()" in user space (in order to use the
> semaphores for mm protection), programs _will_ break.

Of course, but this is a result from sloppy coding. In general, open()
can just return anything and about the only case where you can even
think of ignoring its result is this:
close(0); close(1); close(2);
open("/dev/null", O_RDWR); dup(0); dup(0);
(which is even not clean for other reasons).

I can't imagine depending on the "fact" that the first fd I open is 3,
the next is 4, etc. And what if the routine in question is not
malloc() but e.g. getpwuid()? Both are just arbitrary library
functions, and one of them clearly does open file descriptors,
depending on their implementation.

What would the reason[1] be for wanting contiguous fd space anyway?

Olaf

[1] apart from not having understood how poll() works of course.

2001-04-20 14:20:03

by Jesse Pollard

[permalink] [raw]
Subject: Re: light weight user level semaphores

Olaf Titz <[email protected]>:
> > Ehh.. I will bet you $10 USD that if libc allocates the next file
> > descriptor on the first "malloc()" in user space (in order to use the
> > semaphores for mm protection), programs _will_ break.
>
> Of course, but this is a result from sloppy coding. In general, open()
> can just return anything and about the only case where you can even
> think of ignoring its result is this:
> close(0); close(1); close(2);
> open("/dev/null", O_RDWR); dup(0); dup(0);
> (which is even not clean for other reasons).
>
> I can't imagine depending on the "fact" that the first fd I open is 3,
> the next is 4, etc. And what if the routine in question is not
> malloc() but e.g. getpwuid()? Both are just arbitrary library
> functions, and one of them clearly does open file descriptors,
> depending on their implementation.
>
> What would the reason[1] be for wanting contiguous fd space anyway?
>
> Olaf
>
> [1] apart from not having understood how poll() works of course.

Optimization use in select: If all "interesting" file id's are known
to be below "n", then only the first "n" bits in a FD_ISSET need to
be examined. As soon as the bits are scattered, it takes MUCH longer
to check for activity....

It may not be the "best" way, but what I tend to do is:

Umm - this is snipped from a multiplexed logger using FIFOs for
and indeterminate amount of data from differet utilities sending
text buffers (normally one line at a time but could be more).

static void fd_init(argc,argv)
int argc; /* number of parameters */
char **argv; /* parameter list */
{
int i,j; /* scratch counters */
static char str[50];

pnames = argv;
FD_ZERO(&in_files); /* init all file descriptor sets */

for (i = 0; i <= MAX_LOG && i < argc; i++) {
sprintf(str,"/tmp/%s",pnames[i]);
mkfifo(str,0600); /* assume it exists */
inlogfd[i] = open(str,O_RDONLY | O_NDELAY);
FD_SET(inlogfd[i],&in_files);
}
used = i;
}


Then I can scan for any activity by:

do {
while (select(MAX_LOG,&active,NULL,NULL,NULL) >= 0) {
for(i = 0; i <= used; i++) {
if (FD_ISSET(inlogfd[i],&active)) {
r=ioctl(inlogfd[i],FIONREAD,&n);
while (n > 0) {
r = (n > BUF_MAX - 1) ? BUF_MAX - 1: n;
read(inlogfd[i],buf,r);
printbuf(pnames[i],r);
n -= r;
}
}
}
active = in_files;
}
} while (errno == EINTR);

-------------------------------------------------------------------------
Jesse I Pollard, II
Email: [email protected]

Any opinions expressed are solely my own.

2001-04-20 19:03:25

by Olaf Titz

[permalink] [raw]
Subject: Re: light weight user level semaphores

> Optimization use in select: If all "interesting" file id's are known
> to be below "n", then only the first "n" bits in a FD_ISSET need to
> be examined. As soon as the bits are scattered, it takes MUCH longer
> to check for activity....

That's an optimization, not a correctness issue.

> for (i = 0; i <= MAX_LOG && i < argc; i++) {
> sprintf(str,"/tmp/%s",pnames[i]);
> mkfifo(str,0600); /* assume it exists */
> inlogfd[i] = open(str,O_RDONLY | O_NDELAY);
> FD_SET(inlogfd[i],&in_files);
> }

This works regardless of what the open() returns. What does not work
is using MAX_LOG (assuming it is constant) later in the following form:

> while (select(MAX_LOG,&active,NULL,NULL,NULL) >= 0) {

I see no way around computing the maximum of the inlogfd[i] values +1.
(Which can of course be done just after the opens above. Note that the
last opened fd _is_ guaranteed to get the highest number; FD_SET is
one of the library routines where you can be pretty confident they
don't open fds...)

Btw. there are two problems even assuming you do get contiguous fds:
- an off by one error in the case of argc > MAX_LOG, the first
argument of select() is maximum fd _plus one_
- from an optimization POV it is highly advisable to take only the
real maximum anyway.

Olaf

2001-04-20 23:33:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: light weight user level semaphores

In article <[email protected]>,
Olaf Titz <[email protected]> wrote:
>> Ehh.. I will bet you $10 USD that if libc allocates the next file
>> descriptor on the first "malloc()" in user space (in order to use the
>> semaphores for mm protection), programs _will_ break.
>
>Of course, but this is a result from sloppy coding.

ABSOLUTELY NOT!

This is guaranteed behaviour of UNIX. You get file handles in order, or
you don't get them at all.

Sure, some library functions are allowed to use up file handles. But
most sure as hell are NOT.

> In general, open()
>can just return anything and about the only case where you can even
>think of ignoring its result is this:
> close(0); close(1); close(2);
> open("/dev/null", O_RDWR); dup(0); dup(0);

Which is quite common to do.

Imagine a server that starts up another process, which does exactly
something like the above: the _usual_ execve() case looks something like

pid = fork();
if (!pid) {
close(0);
close(1);
dup(pipe[0]); /* input pipe */
dup(pipe[1]); /* output pipe */
execve("child");
exit(1);
}

The above is absolutely _standard_ behaviour. It's required to work.

And btw, it's _still_ required to work even if there happens to be a
"malloc()" in between the close() and the dup() calls.

Trust me. You're arguing for clearly broken behaviour. malloc() and
friends MUST NOT open file descriptors. It _will_ break programs that
rely on traditional and documented features.

Linus

2001-04-21 10:33:52

by Olaf Titz

[permalink] [raw]
Subject: Re: light weight user level semaphores

> This is guaranteed behaviour of UNIX. You get file handles in order, or
> you don't get them at all.

You get the _next free_ file handle in order. What if your program
assumes they are all contiguous, and it is called by some other
program which forgot about FD_CLOEXEC and has some higher fds still
open? (xdm did this for ten years with its listening socket, just to
name a well-known example. So every program which asssumes contiguous
fd allocations would fail if started from an xdm session.)

If your program makes assumptions on its environment which are not
guaranteed it's broken.

What _is_ guaranteed is that after consecutive allocations of fds like
for (i=0; i<n; ++i)
fd[i]=open(...);
the following property holds:
fd[i] > fd[j] if (i > j and fd[i]!=-1 and fd[j]!=-1).
What is absolutely nowhere guaranteed is that
fd[i+1] = fd[i]+1.
It is not possible to guarantee this since any fd may be already open
before main() starts.

Of course you can guarantee that the fds are available like this:
for (i=getdtablesize(); i>=0; --i)
close(i);
and not calling library functions which may open fds.

> pid = fork();
> if (!pid) {
> close(0);
> close(1);
> dup(pipe[0]); /* input pipe */
> dup(pipe[1]); /* output pipe */
> execve("child");
> exit(1);
> }
>
> The above is absolutely _standard_ behaviour. It's required to work.

The reason why it works is that (a) the target fds are 0 and 1, and
(b) you close them explicitly. For less trivial uses, there is always
dup2().

> And btw, it's _still_ required to work even if there happens to be a
> "malloc()" in between the close() and the dup() calls.

I wouldn't count on that. It's clearly not required to work if there's
a getpwnam() in between. (I already had my share of problems with
syslog() in exactly this situation.)

Do we need a list of library functions which may open fds, like the
infamous "list of functions which may move or purge memory" on the Mac
(which grew longer with every OS release and Inside Mac supplement
issue)? Do we need to know for each library routine how it is
implemented?

> Trust me. You're arguing for clearly broken behaviour. malloc() and
> friends MUST NOT open file descriptors. It _will_ break programs that
> rely on traditional and documented features.

Traditional and documented is, in my view, the description as of the
open(2) man page:

When the
call is successful, the file descriptor returned will be
the lowest file descriptor not currently open for the pro?
cess.

which of course is exactly how it is implemented in the kernel.

Olaf

2001-04-21 23:14:47

by Edgar Toernig

[permalink] [raw]
Subject: Re: fd allocation [was: light weight user level semaphores]

Linus Torvalds wrote:
>
> pid = fork();
> if (!pid) {
> close(0);
> close(1);
> dup(pipe[0]); /* input pipe */
> dup(pipe[1]); /* output pipe */
> execve("child");
> exit(1);
> }
>
> The above is absolutely _standard_ behaviour. It's required to work.
>
> And btw, it's _still_ required to work even if there happens to be a
> "malloc()" in between the close() and the dup() calls.

Right. This is expected (and defined) behaviour. But do you have
_any_ example where this is used for fds > 2? I can't remember.
And IMHO that would be pretty fragile too. Shell scripts sometimes
open temporary fds > 2 and these are passed to called programs. I.e.

#!/bin/sh
exec 3>log
echo >&3 "script started"
ls /proc/self/fd # gets fd3 already opened
ls /proc/self/fd 4</dev/null # now 3 and 4 already in use...
# or look into any configure script...

So, IMHO as long as some library does not mess with fds 0, 1, and 2
it should be ok [1]. Yes, it would be against the standard but I
still have to find some code where this semantic is used for fds > 2.

Ciao, ET.


PS: I would prefer to keep the standard semantics but the reasons
for that are pretty weak ... ;-)

PPS: Even your sample code is fragile. It breaks if I start it
with ./a.out <&- ;-) (the close(0) is likely to close one end
of the pipe)

[1] Unintentionally setting the controlling tty may be a problem.

2001-04-22 10:03:39

by Olaf Titz

[permalink] [raw]
Subject: Re: fd allocation [was: light weight user level semaphores]

> So, IMHO as long as some library does not mess with fds 0, 1, and 2
> it should be ok [1]. Yes, it would be against the standard but I
>...
> [1] Unintentionally setting the controlling tty may be a problem.

The controlling tty is not what is first opened to fd 0 but what is
first opened, so this problem can occur at any time.

Olaf

2001-04-22 10:41:44

by Alon Ziv

[permalink] [raw]
Subject: Re: light weight user level semaphores

All of this FD allocation stuff is truly distrurbing.
This appears to be the one place where Win32 got it (almost) right---
quite about every kernel object looks to userland just like an opaque
handle, and the same operations apply to all of them.
So (e.g.) a mixed wait for socket operation or a semaphore or a timer
is very simple.
The only abstraction we have that is even remotely similar is the FD,
yet its semantics are far too strict to use this way.
The only remotely-feasible idea I've had, so far, was to allow
"negative" FDs (i.e., numbered 0x80000000+) to be used for semaphores;
this sidesteps the POSIX requirements (= we can just claim we don't
support more than 2G FDs per process), but still leaves us with the
problems of managing a split (or extremely large) FD table _and_ with
the issue of allocation policy...
Besides, as Linus already said, FDs are likely not the right abstraction
for objects without file behavior, like semaphores or timers.

[BTW, another solution is to truly support opaque "handles" to kernel
objects; I believe David Howells is already working on something like
this for Wine? The poll interface can be trivially extended to support
waiting on those...]

-az

2001-04-22 10:41:45

by Alon Ziv

[permalink] [raw]
Subject: Re: light weight user level semaphores

All of this FD allocation stuff is truly distrurbing.
This appears to be the one place where Win32 got it (almost) right---
quite about every kernel object looks to userland just like an opaque
handle, and the same operations apply to all of them.
So (e.g.) a mixed wait for socket operation or a semaphore or a timer
is very simple.
The only abstraction we have that is even remotely similar is the FD,
yet its semantics are far too strict to use this way.
The only remotely-feasible idea I've had, so far, was to allow
"negative" FDs (i.e., numbered 0x80000000+) to be used for semaphores;
this sidesteps the POSIX requirements (= we can just claim we don't
support more than 2G FDs per process), but still leaves us with the
problems of managing a split (or extremely large) FD table _and_ with
the issue of allocation policy...
Besides, as Linus already said, FDs are likely not the right abstraction
for objects without file behavior, like semaphores or timers.

[BTW, another solution is to truly support opaque "handles" to kernel
objects; I believe David Howells is already working on something like
this for Wine? The poll interface can be trivially extended to support
waiting on those...]

-az

2001-04-22 12:43:25

by Alan

[permalink] [raw]
Subject: Re: light weight user level semaphores

> All of this FD allocation stuff is truly distrurbing.
> This appears to be the one place where Win32 got it (almost) right---
> quite about every kernel object looks to userland just like an opaque
> handle, and the same operations apply to all of them.

Unix got this right, then AT&T broke it in System III. One very good reason
for pipe based semaphore stuff is precisely that it works in poll/select/SIGIO

Alan

2001-04-22 14:19:10

by David Woodhouse

[permalink] [raw]
Subject: Re: light weight user level semaphores


[email protected] said:
> [BTW, another solution is to truly support opaque "handles" to kernel
> objects; I believe David Howells is already working on something like
> this for Wine? The poll interface can be trivially extended to support
> waiting on those...]

ISTR it wasn't quite trivial to do it that way - it would require the
addition of an extra argument to the fops->poll() method.

David?

--
dwmw2


2001-04-22 14:19:51

by Alon Ziv

[permalink] [raw]
Subject: Re: light weight user level semaphores

Well, that's the reason for my small-negative-integer semaphore-FD idea...
(It won't support select() easily, but poll() is prob'ly good enough)
Still, there is the problem of read()/write()/etc. semantics; sure, we can
declare that 'negative FDs' have their own semantics which just happen to
include poll(), but it sure looks like a kludge...

-az

----- Original Message -----
From: "Alan Cox" <[email protected]>
To: "Alon Ziv" <[email protected]>
Cc: <[email protected]>
Sent: Sunday, April 22, 2001 14:44
Subject: Re: light weight user level semaphores


> > All of this FD allocation stuff is truly distrurbing.
> > This appears to be the one place where Win32 got it (almost) right---
> > quite about every kernel object looks to userland just like an opaque
> > handle, and the same operations apply to all of them.
>
> Unix got this right, then AT&T broke it in System III. One very good
reason
> for pipe based semaphore stuff is precisely that it works in
poll/select/SIGIO
>
> Alan
>
>
>

2001-04-22 14:31:53

by Alexander Viro

[permalink] [raw]
Subject: Re: light weight user level semaphores



On Sun, 22 Apr 2001, Alon Ziv wrote:

> Well, that's the reason for my small-negative-integer semaphore-FD idea...
> (It won't support select() easily, but poll() is prob'ly good enough)
> Still, there is the problem of read()/write()/etc. semantics; sure, we can
> declare that 'negative FDs' have their own semantics which just happen to
> include poll(), but it sure looks like a kludge...

You _still_ don't get it. The question is not "how to add magic kernel
objects that would look like descriptors and support a binch of
ioctls, allowing to do semaphores", it's "do we need semaphores
to be kernel-level objects". Implementation with pipes allows to avoid
the magic crap - they are real, normal pipes - nothing special from
the kernel POV. read(), write(), etc. are nothing but reading and writing
for pipes.

2001-04-22 15:08:28

by Alon Ziv

[permalink] [raw]
Subject: Re: light weight user level semaphores

Oh, I don't argue about that. (Well, almost--- see below...)
It's just that we need _some_ method for getting over the silly POSIX
FD-handling restrictions... And the negative-FDs may be the solution.

(Note I said we 'can' declare other semantics; not 'should'. So these
FDs can still be normal ones, just at the other end of the numbering
range...)

My misgivings are:
* There's no way to integrate other signalling mechanisms; e.g., we may
wish for a 'wake-all-waiters' signaller, or for a 'timed-wait' that
arrives via an FD and not as a signal
* a pipe is a more-or-less good semaphore; it may be too heavyweight,
as it's forced to pass useless [in this case] info, and we can't
control its wakeup order [although POSIX doesn't seem to require this]

[ Actually, I once had an idea of binding signals into an FD, so they can be
'read' out of it... with that, an alarm() is a 'timed-wait' waitable by
poll() :-) ]

-az

----- Original Message -----
From: "Alexander Viro" <[email protected]>
To: "Alon Ziv" <[email protected]>
Cc: <[email protected]>; "Alan Cox" <[email protected]>
Sent: Sunday, April 22, 2001 16:31
Subject: Re: light weight user level semaphores


>
>
> On Sun, 22 Apr 2001, Alon Ziv wrote:
>
> > Well, that's the reason for my small-negative-integer semaphore-FD
idea...
> > (It won't support select() easily, but poll() is prob'ly good enough)
> > Still, there is the problem of read()/write()/etc. semantics; sure, we
can
> > declare that 'negative FDs' have their own semantics which just happen
to
> > include poll(), but it sure looks like a kludge...
>
> You _still_ don't get it. The question is not "how to add magic kernel
> objects that would look like descriptors and support a binch of
> ioctls, allowing to do semaphores", it's "do we need semaphores
> to be kernel-level objects". Implementation with pipes allows to avoid
> the magic crap - they are real, normal pipes - nothing special from
> the kernel POV. read(), write(), etc. are nothing but reading and writing
> for pipes.
>
>
>

2001-04-23 13:20:23

by David Howells

[permalink] [raw]
Subject: Re: light weight user level semaphores

David Woodhouse <[email protected]> wrote:
> [email protected] said:
> > [BTW, another solution is to truly support opaque "handles" to kernel
> > objects; I believe David Howells is already working on something like
> > this for Wine?

Yes. However, it uses a different system call set to use them. They translate
to small object structures internally.

> > The poll interface can be trivially extended to support
> > waiting on those...]

No, they aren't files. I did not want to use "files" because this would incur
a fairly major penalty for each object:

struct file + struct dentry + struct inode

Which would mean that Win32 File objects would require two of each, one set to
hold the extra Win32 attributes and one set for the actual Linux file.

The way I've chosen uses somewhat less memory and should be faster.

> ISTR it wasn't quite trivial to do it that way - it would require the
> addition of an extra argument to the fops->poll() method.

Yes, the PulseEvent operation demands that all processes currently waiting on
the event should be woken, but that no processes attaching immediately
afterward get triggered.

This means that the PulseEvent handler has to be able to notify all the
processes currently waiting on the queue and only those processes. I got it to
do this by marking the waiter records each process links into the queue.

Oh... and WaitForMultipleObjects also has a "wait for all" option.

David

2001-04-23 13:48:26

by Alon Ziv

[permalink] [raw]
Subject: Re: light weight user level semaphores

From: "David Howells" <[email protected]>
> David Woodhouse <[email protected]> wrote:
> > [email protected] said:
> > > [BTW, another solution is to truly support opaque "handles" to kernel
> > > objects; I believe David Howells is already working on something like
> > > this for Wine?
>
> Yes. However, it uses a different system call set to use them. They
translate
> to small object structures internally.
>

Obviously... since they're handles, not FDs...
[BTW, are you using Windows' idea of storing the objects in process space,
in a
page that's inaccessible to the app itself, and passing pointers into this
page
as the handles?]

> > > The poll interface can be trivially extended to support
> > > waiting on those...]
>
> No, they aren't files. I did not want to use "files" because this would
incur
> a fairly major penalty for each object:
>
So what if they aren't files?
If you look at (e.g.) AIX's poll(), it allows you to put SysV semaphore IDs
in
pollfd structures. (Actually they do even more--- they have an extended
pollfd
struct; but even without it, just putting a handle instead of FD and a
special
event code in a normal pollfd should suffice...)

> struct file + struct dentry + struct inode
>
> Which would mean that Win32 File objects would require two of each, one
set to
> hold the extra Win32 attributes and one set for the actual Linux file.
>

I'm afraid I'm not following your logic in this; I believe most Win32 attrs
can
be mapped to more generic abstractions which should be able to exist at
'struct
file' level. (And even if not, a Win32 file handle could just hold two
pointers---
one to the 'struct file', and one to the extra attrs...)

> The way I've chosen uses somewhat less memory and should be faster.
>

And breaks _completely_ with the existing scheme :-/

> > ISTR it wasn't quite trivial to do it that way - it would require the
> > addition of an extra argument to the fops->poll() method.
>
> Yes, the PulseEvent operation demands that all processes currently waiting
on
> the event should be woken, but that no processes attaching immediately
> afterward get triggered.
>

Huh? Where did you get this?
Looking at my copy of MSDN (July '00), the PulseEvent remarks more-or-less
suggest
an implementation like
SetEvent(e)
ResetEvent(e)
I don't see any mention of 'currently waiting' vs 'new' waiters. (Besides, I
doubt MS
tries to solve this in the SMP case...)

> Oh... and WaitForMultipleObjects also has a "wait for all" option.

Yes, this is a valid point... I wonder if it's possible to add _just_ this
to
poll()...

-az


2001-04-23 15:37:43

by Jeff Garzik

[permalink] [raw]
Subject: Re: light weight user level semaphores

Linus Torvalds wrote:
> Trust me. You're arguing for clearly broken behaviour. malloc() and
> friends MUST NOT open file descriptors. It _will_ break programs that
> rely on traditional and documented features.

Indeed; STDIN_FILENO and friends are constants...

--
Jeff Garzik | The difference between America and England is that
Building 1024 | the English think 100 miles is a long distance and
MandrakeSoft | the Americans think 100 years is a long time.
| (random fortune)

2001-04-23 15:41:13

by David Howells

[permalink] [raw]
Subject: Re: light weight user level semaphores

Alon Ziv <[email protected]> wrote:
> Obviously... since they're handles, not FDs...
> [BTW, are you using Windows' idea of storing the objects in process space,
> in a page that's inaccessible to the app itself, and passing pointers into
> this page as the handles?]

No... I grab a page in kernel space and use it as an array. One problem is
that if an exit occurs, I have to be able to discard all attached objects
after the process's VM has been cleaned up (ie: what if it gets swapped
out?). Plus, mmap can clobber existing mappings, MapViewOfFile can't.

> So what if they aren't files?

Small structures private to my Win32 module.

> I'm afraid I'm not following your logic in this; I believe most Win32 attrs
> can be mapped to more generic abstractions which should be able to exist at
> 'struct file' level.

"Most"...

It'd mean adding extra fields into struct file (and possibly struct inode)
just for the use of this module (which probably wouldn't be accepted).

> (And even if not, a Win32 file handle could just hold two pointers---

No. the extra data has to be accessible from CreateFile (potentially running
in other processes), and this'd mean it'd have to go speculatively searching
all Win32 handle tables currently in use.

> And breaks _completely_ with the existing scheme :-/

So what? This is for a WINE accelerator/Win32 module only. There's already
been an argument over making the whole lot available as general Linux
functionality, but most people said that it'd be a bad idea because it'd not
be portable.

> Huh? Where did you get this?
> Looking at my copy of MSDN (July '00), the PulseEvent remarks more-or-less
> suggest an implementation like
> SetEvent(e)
> ResetEvent(e)

Consider the following:

WAITER 1 WAITER 2 WAITER 3 WAKER
wait-on-event wait-on-event wait-on-event
sleep sleep sleep
PulseEvent
set-event
wake(WAITER 1)
wake(WAITER 2)
wake(WAITER 3)
reset-event
wake wake wake
what-happened? what-happened? what-happened?
nothing! nothing! nothing!
sleep sleep sleep

All three waiters should wake up with a note that the event triggered, but
they don't. Plus a fourth waiter who begins to wait on the event after the
set-event is issue probably shouldn't wake up.

> I wonder if it's possible to add _just_ this to poll()...

No... there's no way to pass this to poll (or select).

Better to add a WaitForMultipleObjects() syscall and have that call
do_select() with a flag.

David

2001-04-23 19:18:31

by Ingo Oeser

[permalink] [raw]
Subject: Re: light weight user level semaphores

On Thu, Apr 19, 2001 at 09:46:17AM -0700, Linus Torvalds wrote:
> > libc is entitled to, and most definitely does exactly that. Take a look at
> > things like gethostent, getpwent etc etc.
>
> Ehh.. I will bet you $10 USD that if libc allocates the next file
> descriptor on the first "malloc()" in user space (in order to use the
> semaphores for mm protection), programs _will_ break.

But we would not open the semaphore on malloc() but instead in
the init functions of the libc. So the semaphore will be already
allocated. May be dup2()ed to some very high range
(INT_MAX-__GLIBC_MALLOC_SEM_FD) and the original fd closed.

So this will be no real problem. That's why I don't like lazy
init: May be you cannot init anymore, if you come to and
condition, where you would need it.

Also init/fini are usally very slow operations and as many things
as possible are burdend onto their shoulders.

Semaphores tend to be structures living very long (at least in
all code I've written and seen so far) so I see no point in
defering their initialization.

Regards

Ingo Oeser
--
10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag>
<<<<<<<<<<<< been there and had much fun >>>>>>>>>>>>

2001-04-24 00:21:25

by daw

[permalink] [raw]
Subject: Re: light weight user level semaphores

Linus Torvalds wrote:
>Ehh.. I will bet you $10 USD that if libc allocates the next file
>descriptor on the first "malloc()" in user space (in order to use the
>semaphores for mm protection), programs _will_ break.
>
>You want to take the bet?

Good point. Speaking of which:
ioctl(fd, UIOCATTACHSEMA, ...);
seems to act like dup(fd) if fd was opened on "/dev/usemaclone"
(see drivers/sgi/char/usema.c). According to usema(7), this is
intended to help libraries implement semaphores.

Is this a bad coding? Should the kernel really support an ioctl()
that can silently allocate the next file descriptor? This seems
like asking for trouble. Or, maybe I just misunderstood something.

2001-04-24 00:42:01

by Alexander Viro

[permalink] [raw]
Subject: Re: light weight user level semaphores



On 24 Apr 2001, David Wagner wrote:

> Linus Torvalds wrote:
> >Ehh.. I will bet you $10 USD that if libc allocates the next file
> >descriptor on the first "malloc()" in user space (in order to use the
> >semaphores for mm protection), programs _will_ break.
> >
> >You want to take the bet?
>
> Good point. Speaking of which:
> ioctl(fd, UIOCATTACHSEMA, ...);
> seems to act like dup(fd) if fd was opened on "/dev/usemaclone"
> (see drivers/sgi/char/usema.c). According to usema(7), this is
> intended to help libraries implement semaphores.
>
> Is this a bad coding?

Yes. Not to mention side effects, it's just plain ugly. Anyone who invents
identifiers of _that_ level of ugliness should be forced to read them
aloud for a week or so, until somebody will shoot him out of mercy.
Out of curiosity: who was the author? It looks unusually nasty, even for
SGI.