2005-01-21 21:25:55

by Brandon Corey

[permalink] [raw]
Subject: Pollable Semaphores

I'm trying to find out if there is a pollable semaphore equivalent on Linux.

The main idea of a "pollable semaphore", is a semaphore with a related
file descriptor. The file descriptor can be used to select() when the
semaphore is acquirable. This provides a convenient way for users to
implement code synchronization between threads, where multiple file
descriptors are already being selected against.

We have a pollable semaphore implementation on IRIX that provides this
functionality. The API consists of a handful of calls for creation and
destruction of pollable semaphores, as well as a means to attach them
to a file descriptor. Beyond that, from the users point of view, they're
just treated as any other file descriptor.

These calls are routed through a library and then passed off to a kernel
driver that handles the events. If someone selects against a semaphore
when it's unaquirable, the driver sleeps on a synchronization variable.
When the semaphore is subsequently made aquirable, the driver will wake up
any waiters. Multiple pollable semaphores mixed with other file
descriptors can be selected against, and a wakeup will occur when any of
the semaphores become acquirable.

Is anyone aware of any equivalent functionality?

Brandon


2005-01-21 21:41:14

by Roland Dreier

[permalink] [raw]
Subject: Re: Pollable Semaphores

Brandon> I'm trying to find out if there is a pollable semaphore
Brandon> equivalent on Linux. The main idea of a "pollable
Brandon> semaphore", is a semaphore with a related file
Brandon> descriptor. The file descriptor can be used to select()
Brandon> when the semaphore is acquirable. This provides a
Brandon> convenient way for users to implement code
Brandon> synchronization between threads, where multiple file
Brandon> descriptors are already being selected against.

Yes, I believe futexes and specifically FUTEX_FD can be used to
implement this. See http://people.redhat.com/~drepper/futex.pdf for
full details.

- Roland

2005-01-21 23:25:25

by Brent Casavant

[permalink] [raw]
Subject: Re: Pollable Semaphores

On Fri, 21 Jan 2005, Roland Dreier wrote:

> Brandon> I'm trying to find out if there is a pollable semaphore
> Brandon> equivalent on Linux. The main idea of a "pollable
> Brandon> semaphore", is a semaphore with a related file
> Brandon> descriptor. The file descriptor can be used to select()
> Brandon> when the semaphore is acquirable. This provides a
> Brandon> convenient way for users to implement code
> Brandon> synchronization between threads, where multiple file
> Brandon> descriptors are already being selected against.
>
> Yes, I believe futexes and specifically FUTEX_FD can be used to
> implement this. See http://people.redhat.com/~drepper/futex.pdf for
> full details.

Thanks for pointing out that paper, I wasn't aware of it (Brandon and
I talked about this problem before he posted).

The problem listed in section 9 of that paper is a showstopper to using
FUTEX_FD. It's the very problem I ran into when trying to brainstorm a
wrapper library call around it to roughly implement the behavior Brandon
is looking for.

An additional major inconvenience is that the file descriptor can only
be used once, after which it must be closed and another reopened. If
somehow the poll/select call could reuse the same file descriptor we
could avoid a whole bunch of library glue goo to make it work that
way (i.e. special library routines similar to poll(2) and select(2)
which do some fake file descriptor table hand waving to make it look
like there's reusable futex fds).

Perhaps a new FUTEX_POLLFD would be a reasonable solution for both
problems? The semantics would be a bit different that FUTEX_FD.

It could solve the race condition (i.e. the problem in section 9 of
the paper) by the following:

1. A write to the fd is used to set the value of interest analagous to
the second parmeter to FUTEX_WAIT. This value is stored on a
per-fd (or maybe per-fd per-thread if it's not possible to have
multiple fds per futex) basis.
2. select/poll on the fd return EWOULDBLOCK if the current value of
the futex is not equal to the value of interest. Otherwise it
behaves as FUTEX_FD currently does.

I think it's even possible that this behavior could be wedged into
the current FUTEX_FD, as a write(2) to this fd is meaningless on it
at present. It's not a perfect solution, however, in that it would
be possible for one or more of the futexes that are the target of a
select(2) call to be updated so rapidly that we can never make progress
in the "write value then select" loop.

I'm not sure how to fix the reusability problem, as I haven't
determined the technical reason behind the current one-shot design.
Maybe it could be as simple as using another currently unused call
such as read(2) to reset the fd? But I suspect it's harder than
that, otherwise it wouldn't be a one-shot in the first place.

Brent

--
Brent Casavant If you had nothing to fear,
[email protected] how then could you be brave?
Silicon Graphics, Inc. -- Queen Dama, Source Wars

2005-01-22 00:13:02

by Davide Libenzi

[permalink] [raw]
Subject: Re: Pollable Semaphores

On Fri, 21 Jan 2005, Brandon Corey wrote:

> I'm trying to find out if there is a pollable semaphore equivalent on Linux.
>
> The main idea of a "pollable semaphore", is a semaphore with a related
> file descriptor. The file descriptor can be used to select() when the
> semaphore is acquirable. This provides a convenient way for users to
> implement code synchronization between threads, where multiple file
> descriptors are already being selected against.
>
> We have a pollable semaphore implementation on IRIX that provides this
> functionality. The API consists of a handful of calls for creation and
> destruction of pollable semaphores, as well as a means to attach them
> to a file descriptor. Beyond that, from the users point of view, they're
> just treated as any other file descriptor.
>
> These calls are routed through a library and then passed off to a kernel
> driver that handles the events. If someone selects against a semaphore
> when it's unaquirable, the driver sleeps on a synchronization variable.
> When the semaphore is subsequently made aquirable, the driver will wake up
> any waiters. Multiple pollable semaphores mixed with other file
> descriptors can be selected against, and a wakeup will occur when any of
> the semaphores become acquirable.
>
> Is anyone aware of any equivalent functionality?

I used pipe-based semaphores when I need that functionality (call psem_down_fd()
to get the pollable fd):

http://www.xmailserver.org/pipe-sem.c
http://www.xmailserver.org/pipe-sem.h

They have the problem of the maximum pipe buffer size that affects the
maximum count, but in my case it was fine. Or at least bugs did not come
biting me at the time ;)



- Davide

2005-01-22 03:43:11

by Ulrich Drepper

[permalink] [raw]
Subject: Re: Pollable Semaphores

On Fri, 21 Jan 2005 17:17:51 -0600, Brent Casavant <[email protected]> wrote:

> 2. select/poll on the fd return EWOULDBLOCK if the current value of
> the futex is not equal to the value of interest. Otherwise it
> behaves as FUTEX_FD currently does.

This is the problematic part. The expected value, as you suggested,
can be handled with a write() and since the expected value is often
constant, this is a low-overhead method.

But the poll() interface is not so easy. You cannot change the poll()
semantic to return such an error. It makes really no sense.

What I thought could be done is to define instead a new POLL* constant
which signals the EWOULDBLOCK condition of the futex() syscall in the
revents member. The poll/epoll syscall would do it's normal work and
just fill all the appropriate revents. A futex value mismatch would
mean the call is not blocking at all, just as available data would be
for POLLIN.

For select, I would use the exception bitmap. The bit is set for
futex fds in the EWOULDBLOCK case.

All this _could_ work. But we've been bitten quite a few times in the
past. There might be special cases which may need at least some
additional functionality. This should be taken into account in the
original design.

So, if people are interested in this, code something up and try it.
Stress it as much as you can. I would oppose adding any new futex
interface created at a hunch if I'd be Andrew.

And is another thing to consider. There is at least one other event
which should be pollable: process (maybe threads) deaths. I was
hoping that we get support for this, perhaps in the form of polling
the /proc/PID directory. For poll(), a POLLERR value could mean the
process/thread died. For select(), once again a bit in the except
array could be set.

2005-01-22 05:52:07

by Chris Wright

[permalink] [raw]
Subject: Re: Pollable Semaphores

* Ulrich Drepper ([email protected]) wrote:
> And is another thing to consider. There is at least one other event
> which should be pollable: process (maybe threads) deaths. I was
> hoping that we get support for this, perhaps in the form of polling
> the /proc/PID directory. For poll(), a POLLERR value could mean the
> process/thread died. For select(), once again a bit in the except
> array could be set.

I have a simple patch that does just that. It worked after brief testing,
then I never went back to look at it any more. I'll see if I can't dig
it up, maybe it's useful.

thanks,
-chris
--
Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net

2005-01-22 07:05:19

by Chris Wright

[permalink] [raw]
Subject: Re: Pollable Semaphores

* Chris Wright ([email protected]) wrote:
> * Ulrich Drepper ([email protected]) wrote:
> > And is another thing to consider. There is at least one other event
> > which should be pollable: process (maybe threads) deaths. I was
> > hoping that we get support for this, perhaps in the form of polling
> > the /proc/PID directory. For poll(), a POLLERR value could mean the
> > process/thread died. For select(), once again a bit in the except
> > array could be set.
>
> I have a simple patch that does just that. It worked after brief testing,
> then I never went back to look at it any more. I'll see if I can't dig
> it up, maybe it's useful.

Yeah, here it is. I refreshed it against a current kernel. It passes my
same old test, where I select on /proc/<pid>/status fd in exceptfds.

===== fs/proc/base.c 1.86 vs edited =====
--- 1.86/fs/proc/base.c 2005-01-10 17:29:31 -08:00
+++ edited/fs/proc/base.c 2005-01-21 22:51:00 -08:00
@@ -32,6 +32,7 @@
#include <linux/mount.h>
#include <linux/security.h>
#include <linux/ptrace.h>
+#include <linux/poll.h>
#include "internal.h"

/*
@@ -519,8 +520,21 @@ static ssize_t proc_info_read(struct fil
return length;
}

+static unsigned int proc_info_poll(struct file *file, poll_table *wait)
+{
+ struct inode *inode = file->f_dentry->d_inode;
+ struct task_struct *task = proc_task(inode);
+ wait_queue_head_t *pid_wait = &PROC_I(task->proc_dentry->d_inode)->wait;
+
+ poll_wait(file, pid_wait, wait);
+ if (!pid_alive(task))
+ return POLLPRI;
+ return 0;
+}
+
static struct file_operations proc_info_file_operations = {
.read = proc_info_read,
+ .poll = proc_info_poll,
};

static int mem_open(struct inode* inode, struct file* file)
@@ -1489,6 +1503,8 @@ void proc_pid_flush(struct dentry *proc_
{
might_sleep();
if(proc_dentry != NULL) {
+ wait_queue_head_t *pid_wait=&PROC_I(proc_dentry->d_inode)->wait;
+ wake_up_interruptible(pid_wait);
shrink_dcache_parent(proc_dentry);
dput(proc_dentry);
}
===== fs/proc/inode.c 1.31 vs edited =====
--- 1.31/fs/proc/inode.c 2005-01-04 18:48:14 -08:00
+++ edited/fs/proc/inode.c 2005-01-21 22:48:06 -08:00
@@ -97,6 +97,7 @@ static struct inode *proc_alloc_inode(st
ei->type = 0;
ei->op.proc_get_link = NULL;
ei->pde = NULL;
+ init_waitqueue_head(&ei->wait);
inode = &ei->vfs_inode;
inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
return inode;
===== include/linux/proc_fs.h 1.41 vs edited =====
--- 1.41/include/linux/proc_fs.h 2005-01-07 21:44:33 -08:00
+++ edited/include/linux/proc_fs.h 2005-01-21 22:48:06 -08:00
@@ -243,6 +243,7 @@ struct proc_inode {
int (*proc_read)(struct task_struct *task, char *page);
} op;
struct proc_dir_entry *pde;
+ wait_queue_head_t wait;
struct inode vfs_inode;
};

2005-01-22 07:40:01

by Ulrich Drepper

[permalink] [raw]
Subject: Re: Pollable Semaphores

On Fri, 21 Jan 2005 23:05:04 -0800, Chris Wright <[email protected]> wrote:
> Yeah, here it is. I refreshed it against a current kernel. It passes my
> same old test, where I select on /proc/<pid>/status fd in exceptfds.

Looks certainly attractive to me. Nice small patch. How quickly
after the death of the process is proc_pid_flush() called?

If this could go in and the futex stuff is handled, there is "only"
async I/O to handle. After that we could finally create a uniform
event mechanism at userlevel which binds all these events (I/O,
process/thread termination, sync primitives) together. Maybe support
for legacy sync primitives (SysV semaphores, msg queues) is needed as
well, don't know yet. Note that I assume that polling of POSIX
mqueues works as it did the last time I tried it.

2005-01-22 18:11:01

by Chris Wright

[permalink] [raw]
Subject: Re: Pollable Semaphores

* Ulrich Drepper ([email protected]) wrote:
> On Fri, 21 Jan 2005 23:05:04 -0800, Chris Wright <[email protected]> wrote:
> > Yeah, here it is. I refreshed it against a current kernel. It passes my
> > same old test, where I select on /proc/<pid>/status fd in exceptfds.
>
> Looks certainly attractive to me. Nice small patch. How quickly
> after the death of the process is proc_pid_flush() called?

I don't think it's called until it's reaped.

> If this could go in and the futex stuff is handled, there is "only"
> async I/O to handle. After that we could finally create a uniform
> event mechanism at userlevel which binds all these events (I/O,
> process/thread termination, sync primitives) together. Maybe support
> for legacy sync primitives (SysV semaphores, msg queues) is needed as
> well, don't know yet. Note that I assume that polling of POSIX
> mqueues works as it did the last time I tried it.

It should, AFAIK nothing there has changed.

thanks,
-chris
--
Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net

2005-01-24 20:45:07

by Robert White

[permalink] [raw]
Subject: RE: Pollable Semaphores

The other obvious problem is that if there are contenders on the semaphore, the
semaphore may well not remain acquirable between the return from select and the call
to actually acquire the semaphore.

What I'd think you need is a "device" that you open, then attach a semaphore to (with
an ioctl()?), then you would write increment/decrement actions to the device file
descriptor and it would reply with the results.

By making the write and read actions be "normal" and "atomic" you end up with normal
produce/consume file operations and so normal interaction with poll/select.

I would imagine [off the top of my head] that each device node would have to have
some kind of tasklet-like (but blockable) context that could wait for the semaphore
in a non-blocking way. (Actually I'd imagine a reverse hook in the semaphore itself
that would contain a linked-list of inactive tasklet references which would each be
activated on a suitable semaphore operation (increment).

The short version is that the "available" state must be atomic with the actual
decrement operation or you will probably end up blocking in the decrement after the
select() anyway. Once you factor that in, all sorts of more-classic models suggest
themselves.

Rob White,
Casabyte, Inc.