2002-06-06 07:23:06

by Rusty Russell

[permalink] [raw]
Subject: [PATCH] Futex Asynchronous Interface

These two patches (requiring the other patches I sent to the list
which can also be found on my kernel.org page) add the ability to tie
a futex to a file descriptor, for use with poll/select or SIGIO
(required by NGPT).

The method is: open /dev/futex, use sys_futex(FUTEX_AWAIT) to attach
it to a particular futex, then use select or poll (or set the fd up
for sigio signals, and expect a SIGIO).

You need to use FUTEX_AWAIT again after poll succeeds or SIGIO
(ie. it's oneshot). Calling it while a futex is already outstanding
forgets about the old futex.

The reason for this method is that it's pretty convenient for
programs, and since each one pins a page down, tying that to a struct
file * means we have an implicit limit.

Code below. Feedback welcome.
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

Name: Waker can unpin page, rather than waiting process
Author: Rusty Russell
Status: Tested in 2.5.20
Depends: Futex/copy-from-user.patch.gz Futex/unpin-page-fix.patch.gz
Depends: Futex/waitq.patch.gz

D: This changes the implementation so that the waker actually unpins
D: the page. This is preparation for the async interface, where the
D: process which registered interest is not in the kernel.


diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.20.19104/kernel/futex.c linux-2.5.20.19104.updated/kernel/futex.c
--- linux-2.5.20.19104/kernel/futex.c Thu Jun 6 17:13:46 2002
+++ linux-2.5.20.19104.updated/kernel/futex.c Thu Jun 6 17:14:30 2002
@@ -98,11 +98,13 @@
if (this->page == page && this->offset == offset) {
list_del_init(i);
tell_waiter(this);
+ unpin_page(this->page);
num_woken++;
if (num_woken >= num) break;
}
}
spin_unlock(&futex_lock);
+ unpin_page(page);
return num_woken;
}

@@ -192,9 +194,10 @@
}
out:
set_current_state(TASK_RUNNING);
- /* Were we woken up anyway? */
+ /* Were we woken up anyway? If so, it unpinned page. */
if (!unqueue_me(&q))
return 0;
+ unpin_page(page);
return ret;
}

@@ -225,6 +228,7 @@
if (IS_ERR(page))
return PTR_ERR(page);

+ /* On success, these routines unpin the pages themselves. */
head = hash_futex(page, pos_in_page);
switch (op) {
case FUTEX_WAIT:
@@ -236,7 +240,8 @@
default:
ret = -EINVAL;
}
- unpin_page(page);
+ if (ret < 0)
+ unpin_page(page);

return ret;
}
Name: Asynchronous interface for futexes
Author: Rusty Russell
Status: Tested on 2.5.20
Depends: Futex/comment-fix.patch.gz Futex/copy-from-user.patch.gz
Depends: Futex/no-write-needed.patch.gz Futex/unpin-page-fix.patch.gz
Depends: Futex/waitq.patch.gz Futex/waker-unpin-page.patch.gz

D: This patch adds a FUTEX_AWAIT and /dev/futex, for attaching futexes
D: to file descriptors, which can be used with poll, select or SIGIO.

diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.20.15557/include/linux/futex.h linux-2.5.20.15557.updated/include/linux/futex.h
--- linux-2.5.20.15557/include/linux/futex.h Sat May 25 14:34:59 2002
+++ linux-2.5.20.15557.updated/include/linux/futex.h Wed Jun 5 22:01:44 2002
@@ -4,5 +4,6 @@
/* Second argument to futex syscall */
#define FUTEX_WAIT (0)
#define FUTEX_WAKE (1)
+#define FUTEX_AWAIT (2)

#endif
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.20.15557/kernel/futex.c linux-2.5.20.15557.updated/kernel/futex.c
--- linux-2.5.20.15557/kernel/futex.c Wed Jun 5 22:01:41 2002
+++ linux-2.5.20.15557.updated/kernel/futex.c Wed Jun 5 22:02:09 2002
@@ -34,6 +34,10 @@
#include <linux/highmem.h>
#include <linux/time.h>
#include <linux/pagemap.h>
+#include <linux/file.h>
+#include <linux/slab.h>
+#include <linux/devfs_fs_kernel.h>
+#include <linux/poll.h>
#include <asm/uaccess.h>

/* Simple "sleep if unchanged" interface. */
@@ -41,11 +45,18 @@
/* FIXME: This may be way too small. --RR */
#define FUTEX_HASHBITS 6

+extern void send_sigio(struct fown_struct *fown, int fd, int band);
+
/* We use this instead of a normal wait_queue_t, so we can wake only
the relevent ones (hashed queues may be shared) */
struct futex_q {
struct list_head list;
wait_queue_head_t waiters;
+
+ /* For AWAIT, sigio sent using these. */
+ int fd;
+ struct file *filp;
+
/* Page struct and offset within it. */
struct page *page;
unsigned int offset;
@@ -54,6 +65,7 @@
/* The key for the hash is the address + index + offset within page */
static struct list_head futex_queues[1<<FUTEX_HASHBITS];
static spinlock_t futex_lock = SPIN_LOCK_UNLOCKED;
+extern struct file_operations futex_fops;

static inline struct list_head *hash_futex(struct page *page,
unsigned long offset)
@@ -73,9 +85,12 @@
page_cache_release(page);
}

+/* Waiter may be sitting in FUTEX_WAIT or poll, or async */
static inline void tell_waiter(struct futex_q *q)
{
wake_up_all(&q->waiters);
+ if (q->fd != -1)
+ send_sigio(&q->filp->f_owner, q->fd, POLL_IN);
}

static int futex_wake(struct list_head *head,
@@ -113,6 +128,7 @@
add_wait_queue(&q->waiters, wait);
q->page = page;
q->offset = offset;
+ q->fd = -1;

spin_lock(&futex_lock);
list_add_tail(&q->list, head);
@@ -196,6 +212,38 @@
return ret;
}

+static int futex_await(struct list_head *head,
+ struct page *page,
+ int offset,
+ int fd)
+{
+ struct file *filp;
+ struct futex_q *q;
+
+ filp = fget(fd);
+ if (!filp || filp->f_op != &futex_fops)
+ return -EBADF;
+ q = filp->private_data;
+
+ spin_lock(&futex_lock);
+ /* Eliminate any old notification, wake any pollers, release page. */
+ if (!list_empty(&q->list)) {
+ list_del(&q->list);
+ wake_up_all(&q->waiters);
+ unpin_page(q->page);
+ }
+
+ q->filp = filp;
+ q->fd = fd;
+ q->page = page;
+ q->offset = offset;
+ list_add_tail(&q->list, head);
+ spin_unlock(&futex_lock);
+ fput(filp);
+
+ return 0;
+}
+
asmlinkage int sys_futex(void *uaddr, int op, int val, struct timespec *utime)
{
int ret;
@@ -229,6 +277,9 @@
case FUTEX_WAIT:
ret = futex_wait(head, page, pos_in_page, val, uaddr, time);
break;
+ case FUTEX_AWAIT:
+ ret = futex_await(head, page, pos_in_page, val);
+ break;
case FUTEX_WAKE:
ret = futex_wake(head, page, pos_in_page, val);
break;
@@ -241,12 +292,68 @@
return ret;
}

+static int futex_open(struct inode *inode, struct file *filp)
+{
+ struct futex_q *q;
+
+ q = kmalloc(sizeof(*q), GFP_KERNEL);
+ if (!q)
+ return -ENOMEM;
+ INIT_LIST_HEAD(&q->list);
+ init_waitqueue_head(&q->waiters);
+
+ filp->private_data = q;
+ return 0;
+}
+
+static int futex_close(struct inode *inode, struct file *filp)
+{
+ struct futex_q *q = filp->private_data;
+
+ spin_lock(&futex_lock);
+ if (!list_empty(&q->list)) {
+ list_del(&q->list);
+ unpin_page(q->page);
+ BUG_ON(waitqueue_active(&q->waiters));
+ }
+ spin_unlock(&futex_lock);
+ kfree(filp->private_data);
+ return 0;
+}
+
+/* You need to do a FUTEX_AWAIT to arm this after each successful poll */
+static unsigned int futex_poll(struct file *filp,
+ struct poll_table_struct *wait)
+{
+ struct futex_q *q = filp->private_data;
+ int ret = 0;
+
+ spin_lock(&futex_lock);
+ if (!list_empty(&q->list))
+ poll_wait(filp, &q->waiters, wait);
+ else
+ ret = POLLIN | POLLRDNORM;
+ spin_unlock(&futex_lock);
+
+ return ret;
+}
+
+static struct file_operations futex_fops = {
+ open: futex_open,
+ release: futex_close,
+ poll: futex_poll,
+};
+
static int __init init(void)
{
+ int futex_major;
unsigned int i;

for (i = 0; i < ARRAY_SIZE(futex_queues); i++)
INIT_LIST_HEAD(&futex_queues[i]);
+ futex_major = devfs_register_chrdev(0, "futex", &futex_fops);
+ devfs_register(NULL, "futex", DEVFS_FL_NONE, futex_major,
+ 0, S_IFCHR | 0666, &futex_fops, NULL);
return 0;
}
__initcall(init);


2002-06-06 16:11:40

by Martin Wirth

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface

Hi Rusty,

>if (this->page == page && this->offset == offset) {
> list_del_init(i);
> tell_waiter(this);
>+ unpin_page(this->page);
> num_woken++;
> if (num_woken >= num) break;
> }
> }
> spin_unlock(&futex_lock);
>+ unpin_page(page);
> return num_woken;

If I understand right you shouldn't unpin the page if you are not sure that
all waiters for a specific (page,offset)-combination are woken up and deleted
from the waitqueue. Otherwise a second call to futex_wake may look on the wrong
hash_queue or wake the wrong waiters.

In general, I think fast userspace synchronization primitives and asynchronous
notification are different enough to keep them logically more separated.
Your double use of the hashed wait queues and sys_call make the code difficult
to grasp and thus open for subtle error.

Martin

Martin


2002-06-06 16:36:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface



On Thu, 6 Jun 2002, Rusty Russell wrote:
>
> The method is: open /dev/futex

STOP!

What madness is this?

You have a damn mutex system call, don't introduce mode crap in /dev.

Do we create pipes by opening /dev/pipe? No. Do we have major and minor
numbers for sockets and populate /dev with them? No. And as a result,
there has _never_ been any sysadmin problems with either.

You already have to have a system call to bind the particular fd to the
futex _anyway_, so do the only sane thing, and allocate the fd _there_,
and get rid of that stupid and horrible /dev/futed which only buys you
pain, system administration, extra code, and a black star for being
stupid.

Linus

2002-06-06 18:37:20

by Alan

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface

On Thu, 2002-06-06 at 17:36, Linus Torvalds wrote:
> You have a damn mutex system call, don't introduce mode crap in /dev.
>
> Do we create pipes by opening /dev/pipe? No. Do we have major and minor
> numbers for sockets and populate /dev with them? No. And as a result,
> there has _never_ been any sysadmin problems with either.

Its a bitch trying to create sockets in shell scripts isnt it. I wonder
if there is a connection 8)

Alan

2002-06-06 22:55:12

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface

In message <[email protected]> you write:
> Hi Rusty,
>
> >if (this->page == page && this->offset == offset) {
> > list_del_init(i);
> > tell_waiter(this);
> >+ unpin_page(this->page);
> > num_woken++;
> > if (num_woken >= num) break;
> > }
> > }
> > spin_unlock(&futex_lock);
> >+ unpin_page(page);
> > return num_woken;
>
> If I understand right you shouldn't unpin the page if you are not sure that
> all waiters for a specific (page,offset)-combination are woken up and deleted
> from the waitqueue. Otherwise a second call to futex_wake may look on the wro
ng
> hash_queue or wake the wrong waiters.

No, we delete them from the list (list_del_init) so we can't find it
again: ie. futex_wake can't be invoked twice on the same thing without
someone putting it back on the queue...

> In general, I think fast userspace synchronization primitives and
> asynchronous notification are different enough to keep them
> logically more separated. Your double use of the hashed wait queues
> and sys_call make the code difficult to grasp and thus open for
> subtle error.

I don't think the complexity is much worse: async is inherently
complex given that there are no real in-kernel primitives to do offer
both sync and async cleanly. Having two incompatible primitives would
be painful, too.

Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-06-06 23:18:26

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface

In message <[email protected]> you wri
te:
> Do we have major and minor numbers for sockets and populate /dev
> with them? No. And as a result, there has _never_ been any sysadmin
> problems with either.

Ummm... you don't do much network programming, do you Linus? Don't
confuse familiarity with fondness: the socket API is *not* a good
model to copy.

> You already have to have a system call to bind the particular fd to the
> futex _anyway_, so do the only sane thing, and allocate the fd _there_,
> and get rid of that stupid and horrible /dev/futed which only buys you
> pain, system administration, extra code, and a black star for being
> stupid.

Yet another special way to create a special fd? Hmm...

That might be better than what I proposed, but it's not the epitomy of
taste either.

Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-06-07 08:31:19

by Peter Wächtler

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface

Rusty Russell wrote:
> In message <[email protected]> you wri
> te:
>
>>Do we have major and minor numbers for sockets and populate /dev
>>with them? No. And as a result, there has _never_ been any sysadmin
>>problems with either.
>>
>
> Ummm... you don't do much network programming, do you Linus? Don't
> confuse familiarity with fondness: the socket API is *not* a good
> model to copy.
>
>
>>You already have to have a system call to bind the particular fd to the
>>futex _anyway_, so do the only sane thing, and allocate the fd _there_,
>>and get rid of that stupid and horrible /dev/futed which only buys you
>>pain, system administration, extra code, and a black star for being
>>stupid.
>>
>
> Yet another special way to create a special fd? Hmm...
>
> That might be better than what I proposed, but it's not the epitomy of
> taste either.
>

What about /proc/futex then? Less adminstrative work, clean interface
(also for shell scripts like Alan suggested).
Al Viro would like this, it's more like Plan9 or QNX6. :)

Give it an entry in the namespace, why not with sockets (unix and ip) also?



2002-06-07 09:55:24

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface

On Thu, 6 Jun 2002 09:36:23 -0700 (PDT)
Linus Torvalds <[email protected]> wrote:
> You already have to have a system call to bind the particular fd to the
> futex _anyway_, so do the only sane thing, and allocate the fd _there_,

But Ma, I don't *want* to write a filesystem!

Linus, Al, is there an easier way to do this? I stole this from sockfs,
but I balked at another 50 lines for a proper inode creation, so I just use
the same dentry and inode over and over.

It's still an awful lot of irrelevant code: what can I cut?

/* Everyone needs a dentry and inode */
static struct dentry *futex_dentry;

/* Signal allows caller to avoid the race which would occur if they
set the sigio stuff up afterwards. */
static int futex_fd(struct list_head *head,
struct page *page,
int offset,
int signal)
{
int fd;
struct futex_q *q;
struct file *filp;

if (signal < 0 || signal > _NSIG)
return -EINVAL;

fd = get_unused_fd();
if (fd < 0)
return fd;
filp = get_empty_filp();
if (!filp) {
put_unused_fd(fd);
return -ENFILE;
}
filp->f_op = &futex_fops;
filp->f_dentry = dget(futex_dentry);

if (signal) {
filp->f_owner.pid = current->pid;
filp->f_owner.uid = current->uid;
filp->f_owner.euid = current->euid;
filp->f_owner.signum = signal;
}

q = kmalloc(sizeof(*q), GFP_KERNEL);
if (!q) {
put_unused_fd(fd);
put_filp(filp);
return -ENOMEM;
}

/* Initialize queue structure */
init_waitqueue_head(&q->waiters);
filp->private_data = q;

/* Go for it... */
queue_me(head, q, page, offset, fd, filp);

/* Now we map fd to filp, so userspace can access it */
fd_install(fd, filp);
return 0;
}

/* Oh yeah, makes sense to write a filesystem... */
static struct super_operations futexfs_ops = {
statfs: simple_statfs,
};

static int futexfs_fill_super(struct super_block *sb, void *data, int silent)
{
struct inode *root;
sb->s_blocksize = 1024;
sb->s_blocksize_bits = 10;
sb->s_magic = 0xFBAD1DEA;
sb->s_op = &futexfs_ops;
root = new_inode(sb);
if (!root)
return -ENOMEM;
root->i_mode = S_IFDIR | S_IRUSR | S_IWUSR;
root->i_uid = root->i_gid = 0;
root->i_atime = root->i_mtime = root->i_ctime = CURRENT_TIME;
sb->s_root = d_alloc(NULL, &(const struct qstr) { "futex", 5, 0 });
if (!sb->s_root) {
iput(root);
return -ENOMEM;
}
sb->s_root->d_sb = sb;
sb->s_root->d_parent = sb->s_root;
d_instantiate(sb->s_root, root);
return 0;
}

static struct super_block *
futexfs_get_sb(struct file_system_type *fs_type,
int flags, char *dev_name, void *data)
{
return get_sb_nodev(fs_type, flags, data, futexfs_fill_super);
}

static struct file_system_type futex_fs_type = {
name: "futexfs",
get_sb: futexfs_get_sb,
kill_sb: kill_anon_super,
fs_flags: FS_NOMOUNT,
};

static int __init init(void)
{
unsigned int i;
struct qstr name = { .name = "futex", .len = 5, .hash = 0 };
struct vfsmount *futex_mnt;

register_filesystem(&futex_fs_type);
futex_mnt = kern_mount(&futex_fs_type);
futex_dentry = d_alloc(NULL, &name);
futex_dentry->d_inode = new_inode(futex_mnt->mnt_sb);

for (i = 0; i < ARRAY_SIZE(futex_queues); i++)
INIT_LIST_HEAD(&futex_queues[i]);
return 0;
}
__initcall(init);

--
there are those who do and those who hang on and you don't see too
many doers quoting their contemporaries. -- Larry McVoy

2002-06-08 22:28:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface



On Fri, 7 Jun 2002, Peter W?chtler wrote:
>
> What about /proc/futex then?

Why?

Tell me _one_ advantage from having the thing exposed as a filename?

The whole point with "everything is a file" is not that you have some
random filename (indeed, sockets and pipes show that "file" and "filename"
have nothing to do with each other), but the fact that you can use common
tools to operate on different things.

But there's absolutely no point in opening /dev/futex from a shell script
or similar, because you don't get anything from it. You still have to bind
the fd to it's real object.

In short, the name "/dev/futex" (or "/proc/futex") is _meaningless_.
There's no point to it. It has no life outside the FUTEX system call, and
the only thing that you can do by exposing it as a name is to cause
problems for people who don't want to mount /proc, or who do not happen to
have that device node in their /dev.

> Give it an entry in the namespace, why not with sockets (unix and ip) also?

Perhaps because you cannot enumerate sockets and pipes? They don't _have_
names before they are created. Same as futexes, btw.

Linus

2002-06-08 22:42:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface



On Fri, 7 Jun 2002, Rusty Russell wrote:
>
> Linus, Al, is there an easier way to do this? I stole this from sockfs,
> but I balked at another 50 lines for a proper inode creation, so I just use
> the same dentry and inode over and over.

There's nothing inherently wrong with re-using the inode and dentry -
that's what /dev/futex would do too, of course.

> It's still an awful lot of irrelevant code: what can I cut?

I don't think it's a matter of cutting, as much as possibly a matter of
tryign to share some common code. pipefs, sockfs and now this: they all do
pretty much exactly the same thing, and there is nothing that says that
they should have separate super_operations, for example, since they are
all identical.

And once you have the same super_operations, you really have the same
"fill_super" functions too. The only thing that separates these
superblocks is the root name, so that /proc gets nice output. So it should
be fine to just have

sb = create_anon_fs("futex");

and share all of the setup code across futex/pipes/sockfs.

Which still leaves you with the

get_unused_fd();
get_empty_filp();
filp->f_dentry = dget(sb->s_root);
.. fill it ..
fd_install(fd, filp);

but by then we're talking single lines of overhead.

Linus

2002-06-09 10:03:43

by Peter Wächtler

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface

Linus Torvalds schrieb:
>
> On Fri, 7 Jun 2002, Peter W?chtler wrote:
> >
> > What about /proc/futex then?
>
> Why?
>
> Tell me _one_ advantage from having the thing exposed as a filename?
>

There is no enforcing advantage for this.

newbie question: how to provide file operations like poll
without an entry in the filesystem? (in the meantime I will try
to answer this myself :)

> The whole point with "everything is a file" is not that you have some
> random filename (indeed, sockets and pipes show that "file" and "filename"
> have nothing to do with each other), but the fact that you can use common
> tools to operate on different things.
>
> But there's absolutely no point in opening /dev/futex from a shell script
> or similar, because you don't get anything from it. You still have to bind
> the fd to it's real object.
>
> In short, the name "/dev/futex" (or "/proc/futex") is _meaningless_.
> There's no point to it. It has no life outside the FUTEX system call, and
> the only thing that you can do by exposing it as a name is to cause
> problems for people who don't want to mount /proc, or who do not happen to
> have that device node in their /dev.
>
> > Give it an entry in the namespace, why not with sockets (unix and ip) also?
>
> Perhaps because you cannot enumerate sockets and pipes? They don't _have_
> names before they are created. Same as futexes, btw.
>

Still you can open a file in the namespace and write some commands to it.
Then it turns out to be a socket on port 25:

fd=open("/dev/socket",O_RDWR);
write(fd,"connect stream 25\n",sizeof(..));
write(fd,"helo mail.my.com\n",..);
...


Just one more question: would it be possible to specify a poll operation
for /proc/blah?

2002-06-09 10:57:36

by kaih

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface

[email protected] (Linus Torvalds) wrote on 08.06.02 in <[email protected]>:

> On Fri, 7 Jun 2002, Peter Waechtler wrote:
> >
> > What about /proc/futex then?
>
> Why?
>
> Tell me _one_ advantage from having the thing exposed as a filename?

None, of course - the shell can't do the other futex ops, either. Futex
file handles mean you can implement select() on them, but that's about all
they have in common with files - there is certainly no read() or write()
operation here!

> > Give it an entry in the namespace, why not with sockets (unix and ip)
> > also?
>
> Perhaps because you cannot enumerate sockets and pipes? They don't _have_
> names before they are created. Same as futexes, btw.

Now *there* I disagree, at least for sockets. First of all, there's
absolutely no need to be able to enumerate unopened sockets to justify
putting them into the namespace - you can create them (in a special fs, of
course) at the moment they are opened. (As in fact is done, in /proc/<pid>/
fd/.) And second, you *can* read() and write() there.

However, I don't think that's all that important. What I'd rather see is
making the network devices into namespace nodes. The situation of eth0 and
friends, from a Unix perspective, is utterly unnatural.

MfG Kai

2002-06-09 17:49:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface



On Sun, 9 Jun 2002, Peter W?chtler wrote:
>
> Still you can open a file in the namespace and write some commands to it.
> Then it turns out to be a socket on port 25:
>
> fd=open("/dev/socket",O_RDWR);
> write(fd,"connect stream 25\n",sizeof(..));
> write(fd,"helo mail.my.com\n",..);

Yes, obviously you can avoid system calls entirely, and replace all of
them with read/write of commands.

This is not even a very uncommon idea: the above is basically message
passing, and is largely how many microkernels work. Except they don't call
it read/write, they tend to call it send/recv, and they aren't "file
descriptors", they are "ports".

It has advantages: because you only have one set of primitives, it's more
easily abtracted at that level, meaning that you can (and people do) make
it distributed etc without having to worry about local semantics.

It has disadvantages too: performance tends to be bad (you have to copy
around and parse the commands that are no longer implicit in the system
call number), and while there is a high level of abstraction on one level
("everything is a 'port' that can receive or send messages), at some point
the proverbial shit hits the fan and you've moved the details behind the
abstraction down (and now the data stream is no longer just bytes, but has
a meaning in itself).

But yes, the sequences

open("/dev/socket") -> socket()
write(fd,"connect stream 25") -> connect()

are obviously "equivalent". It's not my personal favourite equivalence,
though. I'd much rather add the information at _open_ time, and make it a
name-space issue, so that you'd do something like

open("//sockfs/dst=123.45.67.89:25", O_RDWR);

instead. Which is _also_ entirely equivalent, of course (the "namespace"
approach does require that you be able to do "fd-relative" lookups, so
that you could also do

sk = open("//sockfs", O_RDWR);
sk2 = fd_open(sk, "dst=123.45.67.89:25", O_RDWR);

which is actually useful even in regular files too, just as a way of doing
directory-relative file opens without having to do a "chdir()").

HOWEVER, the fact is that exactly because they are equivalent, there is no
real difference between them. So you might as well just use the old UNIX
behaviour, and if you want to open sockets from a script, you use any of
the already _existing_ socket script helpers. For port 25, you have one
called "sendmail". For port 80, you have things like "lynx -source".

And you have tons of things like "netpipes", for doing generic scripting
of sockets.

The fact is, trying to come up with new ways to do the same old thing is
_not_ a good idea. It may look cool to expose sockets in the namespace,
but what's the actual added advantage over existing standard practices?
Unless that can be shown, there's just no point.

Do a google search for "netpipes", I'm sure you'll find it can do what you
wanted.

Sorry to rain on the "cool feature" parade, but I want to see some
_advantage_ from exposing new names in the namespace.

Linus

2002-06-09 17:58:21

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface

Hi!

> Name: Waker can unpin page, rather than waiting process
> Author: Rusty Russell
> Status: Tested in 2.5.20
> Depends: Futex/copy-from-user.patch.gz Futex/unpin-page-fix.patch.gz
> Depends: Futex/waitq.patch.gz
>
> D: This changes the implementation so that the waker actually unpins
> D: the page. This is preparation for the async interface, where the
> D: process which registered interest is not in the kernel.

What is it? Nice header, did you generate it by hand?
Pavel
--
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.

2002-06-09 18:09:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface



On 9 Jun 2002, Kai Henningsen wrote:
>
> However, I don't think that's all that important. What I'd rather see is
> making the network devices into namespace nodes. The situation of eth0 and
> friends, from a Unix perspective, is utterly unnatural.

But what would you _do_ with them? What would be the advantage as compared
to the current situation?

Now, to configure a device, you get a fd to the device the same way you
get a fd _anyway_ - with "socket()".

And anybody who says that "socket()" is utterly unnatural to the UNIX way
is quite far out to lunch. It may be unnatural to the Plan-9 way of
"everything is a namespace", but that was never the UNIX way. The UNIX way
is "everything is a file descriptor or a process", but that was never
about namespaces.

Yes, some old-timers could argue that original UNIX didn't have sockets,
and that the BSD interface is ugly and an abomination and that it _should_
have been a namespace thing, but that argument falls flat on its face when
you realize that the "pipe()" system call _was_ in original UNIX, and has
all the same issues.

Don't get hung up about names.

Linus

2002-06-09 19:07:01

by Thunder from the hill

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface

Hi,

On Sun, 9 Jun 2002, Linus Torvalds wrote:
> Yes, some old-timers could argue that original UNIX didn't have sockets,

Not to mention MULTICS...

IMO this all looks like "exporting program variables to filesystem", would
you do that? (Then we'll need /dev/memory/16k/5362337156/blah, etc.) The
next issue would be "how do we stop other processes from using our
sockets, semaphores, etc., ending up where we started.

Sockets are a good implementation as long as they don't fall down for some
particular purpose. This isn't given yet. Semaphores didn't yet fall down,
either. So what do you want more? The "old" system might look crappy to
you, but it works! It works, even if it's a little more abstract than the
Plan-9. Are we Plan-9?

Regards,
Thunder

PS. this mail was sent through a happily working lot of sockets.
--
German attitude becoming | Thunder from the hill at ngforever
rightaway popular: |
"Get outa my way, | free inhabitant not directly
for I got a mobile phone!" | belonging anywhere

2002-06-10 06:41:52

by kaih

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface

[email protected] (Linus Torvalds) wrote on 09.06.02 in <[email protected]>:

> On 9 Jun 2002, Kai Henningsen wrote:
> >
> > However, I don't think that's all that important. What I'd rather see is
> > making the network devices into namespace nodes. The situation of eth0 and
> > friends, from a Unix perspective, is utterly unnatural.
>
> But what would you _do_ with them? What would be the advantage as compared
> to the current situation?

For one, enumerate them. Network interfaces, as opposed to sockets, are
fairly static. And currently I have to either call /sbin/ifconfig or
figure out where this information is hidden in /proc. (And there is a
kernel interface, I think, some ioctl() stuff - I have mostly forgotten
all I learned about that, it was so ugly.)

Doesn't need to be /dev/xxx nodes (that *would* be the traditional Unix
way) - these days I'd argue for a special network device fs.

> Now, to configure a device, you get a fd to the device the same way you
> get a fd _anyway_ - with "socket()".

>From what little I remember, though, you don't bind your socket to some
address "eth0" - which might have been halfway reasonable - but instead
you say *that* with some ioctl().

> And anybody who says that "socket()" is utterly unnatural to the UNIX way
> is quite far out to lunch. It may be unnatural to the Plan-9 way of
> "everything is a namespace", but that was never the UNIX way. The UNIX way
> is "everything is a file descriptor or a process", but that was never
> about namespaces.
>
> Yes, some old-timers could argue that original UNIX didn't have sockets,
> and that the BSD interface is ugly and an abomination and that it _should_
> have been a namespace thing, but that argument falls flat on its face when
> you realize that the "pipe()" system call _was_ in original UNIX, and has
> all the same issues.

On the other hand, pipe() is about *anonymous* files. You could in fact
argue that it would be nice to have an interface for anonymous disk files
as well, instead of all the current acrobatics for safe temp file
creation.

> Don't get hung up about names.

Especially not if your argument is then about situations where you atually
don't want a name.

See, the problem with eth0 and friends is that they already *have* names -
their names just aren't in the standard namespace. *That* is the real
problem here.

Unix tradition or not, I propose the rule:

*Any kernel object that has a text name, shall appear,*
*under this name, somewhere in the filesystem.*

Or in other words, *there can only be one namespace*. (Well, only one
global namespace, anyway.)

MfG Kai

2002-06-10 06:55:08

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface

In message <[email protected]> you write:
> Hi!
>
> > Name: Waker can unpin page, rather than waiting process
> > Author: Rusty Russell
> > Status: Tested in 2.5.20
> > Depends: Futex/copy-from-user.patch.gz Futex/unpin-page-fix.patch.gz
> > Depends: Futex/waitq.patch.gz
> >
> > D: This changes the implementation so that the waker actually unpins
> > D: the page. This is preparation for the async interface, where the
> > D: process which registered interest is not in the kernel.
>
> What is it? Nice header, did you generate it by hand?

It's a format I made up for all the patches on my web page. Most
important is the Depends: line which my patch scripts can deal with
quite nicely: see my kernel.org page.

Down with Bit Keeper! Long live patch! 8)
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-06-10 07:55:59

by Helge Hafting

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface

Linus Torvalds wrote:
>
> On 9 Jun 2002, Kai Henningsen wrote:
> >
> > However, I don't think that's all that important. What I'd rather see is
> > making the network devices into namespace nodes. The situation of eth0 and
> > friends, from a Unix perspective, is utterly unnatural.
>
> But what would you _do_ with them? What would be the advantage as compared
> to the current situation?

Not much, but
ls /dev/net
eth0 eth1 eth2 ippp0
would be a convenient way to see what net devices exists.
This already works for other devices, when using devfs.

I guess this isn't reason enough though.

Helge Hafting

2002-06-10 14:13:28

by Thunder from the hill

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface

Hi,

On Mon, 10 Jun 2002, Helge Hafting wrote:
> ls /dev/net
> eth0 eth1 eth2 ippp0

What is it worth? You have a few more files which you can't do anything
with, and ifconfig output is much more greppable etc.

I remember these network devices from Solaris. There wasn't any good about
them IIRC, the only sane way of working with them was to work around them,
i.e. ignoring. Do you want a /dev/ignoreme directory?

We shouldn't introduce to ignore.

Regards,
Thunder
--
German attitude becoming | Thunder from the hill at ngforever
rightaway popular: |
"Get outa my way, | free inhabitant not directly
for I got a mobile phone!" | belonging anywhere

2002-06-10 15:11:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface



On Mon, 10 Jun 2002, Helge Hafting wrote:
>
> Not much, but
> ls /dev/net
> eth0 eth1 eth2 ippp0
> would be a convenient way to see what net devices exists.
> This already works for other devices, when using devfs.

You might as well do

cat /proc/net/dev

instead.

Which works with existing kernels, going back to whatever..

Linus

2002-06-10 21:02:04

by kaih

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface

[email protected] (Thunder from the hill) wrote on 10.06.02 in <[email protected]>:

> On Mon, 10 Jun 2002, Helge Hafting wrote:
> > ls /dev/net
> > eth0 eth1 eth2 ippp0
>
> What is it worth? You have a few more files which you can't do anything
> with, and ifconfig output is much more greppable etc.

Ifconfig output is *WHAT*?!

Ifconfig output, to be parsed by a script, is one of the shittiest
interfaces possible.

Look at this, and then tell me again that "ifconfig output is much more
greppable"!

# ifconfig eth0
eth0 Protokoll:Ethernet Hardware Adresse 00:50:FC:0C:63:69
inet Adresse:10.0.41.2 Bcast:10.255.255.255 Maske:255.255.255.0
EtherTalk Phase 2 Adresse:65280/237
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:26841078 errors:4240 dropped:0 overruns:0 frame:0
TX packets:26134055 errors:0 dropped:0 overruns:0 carrier:0
Kollisionen:8481
RX bytes:60708618 (57.8 MiB) TX bytes:2654812652 (2.4 GiB)

# LANG= ifconfig eth0
eth0 Link encap:Ethernet HWaddr 00:50:FC:0C:63:69
inet addr:10.0.41.2 Bcast:10.255.255.255 Mask:255.255.255.0
EtherTalk Phase 2 addr:65280/237
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:26841182 errors:4240 dropped:0 overruns:0 frame:0
TX packets:26134181 errors:0 dropped:0 overruns:0 carrier:0
collisions:8481
RX bytes:60727233 (57.9 MiB) TX bytes:2654827939 (2.4 GiB)

#

Even rooting around in /proc is better than this!

> I remember these network devices from Solaris. There wasn't any good about
> them IIRC, the only sane way of working with them was to work around them,
> i.e. ignoring. Do you want a /dev/ignoreme directory?

I have no idea what Solaris did, nor do I necessarily want to know, but I
*do* have experience with what Linux does, and I'm certainly not
impressed.

Given that I've fairly often been irritated about not having these things
be filesystem nodes, I'd expect there to *be* benefit in having them in
the filesystem if this is done halfway reasonable.

MfG Kai

2002-06-10 20:57:51

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface

Followup to: <[email protected]>
By author: Linus Torvalds <[email protected]>
In newsgroup: linux.dev.kernel
>
> But what would you _do_ with them? What would be the advantage as compared
> to the current situation?
>
> Now, to configure a device, you get a fd to the device the same way you
> get a fd _anyway_ - with "socket()".
>
> And anybody who says that "socket()" is utterly unnatural to the UNIX way
> is quite far out to lunch. It may be unnatural to the Plan-9 way of
> "everything is a namespace", but that was never the UNIX way. The UNIX way
> is "everything is a file descriptor or a process", but that was never
> about namespaces.
>
> Yes, some old-timers could argue that original UNIX didn't have sockets,
> and that the BSD interface is ugly and an abomination and that it _should_
> have been a namespace thing, but that argument falls flat on its face when
> you realize that the "pipe()" system call _was_ in original UNIX, and has
> all the same issues.
>

The biggest problem with the socket API is that it makes it impossible
to use in a shell script context or other context where it isn't
practical to operate on binary data. The second-biggest problem with
the API is that it at least historically makes it basically impossible
to write address-family-independent code... something like "telnet"
shouldn't need to know if it's operating over TCP on IPv4, TCP on
IPv6, SPX on IPX...; it should be able to resolve a string to a name and
then create a connection, and the address family stuff should be
encasulated in the library.

The first problem a namespace-like solution would deal with... then
one could open /dev/sock/stream/ipv4/206.189.222.9/23 using a normal
open() call. The second is to resolve "poetry.vogon.net","telnet" to
the above string, but that's a user-space issue.

At least it would be nice if open() on a Unix domain socket would
connect to that socket.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[email protected]>

2002-06-11 09:14:59

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface

In message <[email protected]> you wr
ite:
>
>
> On Fri, 7 Jun 2002, Rusty Russell wrote:
> >
> > Linus, Al, is there an easier way to do this? I stole this from sockfs,
> > but I balked at another 50 lines for a proper inode creation, so I just use
> > the same dentry and inode over and over.
>
> There's nothing inherently wrong with re-using the inode and dentry -
> that's what /dev/futex would do too, of course.
>
> > It's still an awful lot of irrelevant code: what can I cut?
>
> I don't think it's a matter of cutting, as much as possibly a matter of
> tryign to share some common code. pipefs, sockfs and now this: they all do
> pretty much exactly the same thing, and there is nothing that says that
> they should have separate super_operations, for example, since they are
> all identical.

Yeah, that's a nice idea. But sockfs is sufficiently different that
it's probably not worth it. Left for Al's larger brain.

Here's the new (tested) futex patch (actually 4 minipatches but patch
-p1 < msg works fine).

Updated futex library at:
http://www.[cc].kernel.org/pub/linux/kernel/people/rusty/futex-2.2.tar.gz

Name: Remove requirement for page to be writable.
Author: Rusty Russell
Status: Trivial

D: It's possible that someone could be passively monitoring: write
D: access is no longer strictly required (since new futex implementation).

diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.20/kernel/futex.c working-2.5.20-afutex/kernel/futex.c
--- linux-2.5.20/kernel/futex.c Sat May 25 14:35:00 2002
+++ working-2.5.20-afutex/kernel/futex.c Wed Jun 5 18:42:17 2002
@@ -128,9 +158,9 @@
int err;

down_read(&mm->mmap_sem);
- err = get_user_pages(current, current->mm, page_start,
+ err = get_user_pages(current, mm, page_start,
1 /* one page */,
- 1 /* writable */,
+ 0 /* writable not important */,
0 /* don't force */,
&page,
NULL /* don't return vmas */);

Name: Unpin page bug fix.
Author: Rusty Russell
Status: Tested on 2.5.20

D: This uses page_cache_release() instead of put_page(), as it might
D: be a pagecache page.

diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.21.23289/kernel/futex.c linux-2.5.21.23289.updated/kernel/futex.c
--- linux-2.5.21.23289/kernel/futex.c Mon Jun 10 16:03:56 2002
+++ linux-2.5.21.23289.updated/kernel/futex.c Mon Jun 10 17:10:10 2002
@@ -70,6 +70,14 @@
wake_up_all(&q->waiters);
}

+static inline void unpin_page(struct page *page)
+{
+ /* Avoid releasing the page which is on the LRU list. I don't
+ know if this is correct, but it stops the BUG() in
+ __free_pages_ok(). */
+ page_cache_release(page);
+}
+
static int futex_wake(struct list_head *head,
struct page *page,
unsigned int offset,
@@ -223,7 +231,7 @@
default:
ret = -EINVAL;
}
- page_cache_release(page);
+ unpin_page(page);

return ret;
}

Name: Waker can unpin page, rather than waiting process
Author: Rusty Russell
Status: Tested in 2.5.20
Depends: Futex/copy-from-user.patch.gz Futex/unpin-page-fix.patch.gz
Depends: Futex/waitq.patch.gz

D: This changes the implementation so that the waker actually unpins
D: the page. This is preparation for the async interface, where the
D: process which registered interest is not in the kernel.


diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.20.19104/kernel/futex.c linux-2.5.20.19104.updated/kernel/futex.c
--- linux-2.5.20.19104/kernel/futex.c Thu Jun 6 17:13:46 2002
+++ linux-2.5.20.19104.updated/kernel/futex.c Thu Jun 6 17:14:30 2002
@@ -98,11 +98,13 @@
if (this->page == page && this->offset == offset) {
list_del_init(i);
tell_waiter(this);
+ unpin_page(this->page);
num_woken++;
if (num_woken >= num) break;
}
}
spin_unlock(&futex_lock);
+ unpin_page(page);
return num_woken;
}

@@ -192,9 +194,10 @@
}
out:
set_current_state(TASK_RUNNING);
- /* Were we woken up anyway? */
+ /* Were we woken up anyway? If so, it unpinned page. */
if (!unqueue_me(&q))
return 0;
+ unpin_page(page);
return ret;
}

@@ -225,6 +228,7 @@
if (IS_ERR(page))
return PTR_ERR(page);

+ /* On success, these routines unpin the pages themselves. */
head = hash_futex(page, pos_in_page);
switch (op) {
case FUTEX_WAIT:
@@ -236,7 +240,8 @@
default:
ret = -EINVAL;
}
- unpin_page(page);
+ if (ret < 0)
+ unpin_page(page);

return ret;
}

Name: New asynchronous interface for futexes
Author: Rusty Russell
Status: Tested on 2.5.20
Depends: Futex/comment-fix.patch.gz Futex/copy-from-user.patch.gz
Depends: Futex/no-write-needed.patch.gz Futex/unpin-page-fix.patch.gz
Depends: Futex/waitq.patch.gz Futex/waker-unpin-page.patch.gz

D: This patch adds a FUTEX_FD call, for opening a file descriptor
D: attached to a futex, which can be used with poll, select or SIGIO.

diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal working-2.5.21-pre-afutex/include/linux/futex.h working-2.5.21-afutex/include/linux/futex.h
--- working-2.5.21-pre-afutex/include/linux/futex.h Sat May 25 14:34:59 2002
+++ working-2.5.21-afutex/include/linux/futex.h Tue Jun 11 13:25:52 2002
@@ -4,5 +4,6 @@
/* Second argument to futex syscall */
#define FUTEX_WAIT (0)
#define FUTEX_WAKE (1)
+#define FUTEX_FD (2)

#endif
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal working-2.5.21-pre-afutex/kernel/futex.c working-2.5.21-afutex/kernel/futex.c
--- working-2.5.21-pre-afutex/kernel/futex.c Tue Jun 11 13:20:06 2002
+++ working-2.5.21-afutex/kernel/futex.c Tue Jun 11 13:26:46 2002
@@ -34,6 +34,10 @@
#include <linux/highmem.h>
#include <linux/time.h>
#include <linux/pagemap.h>
+#include <linux/slab.h>
+#include <linux/poll.h>
+#include <linux/file.h>
+#include <linux/dcache.h>
#include <asm/uaccess.h>

/* Simple "sleep if unchanged" interface. */
@@ -41,6 +45,11 @@
/* FIXME: This may be way too small. --RR */
#define FUTEX_HASHBITS 6

+extern void send_sigio(struct fown_struct *fown, int fd, int band);
+
+/* Everyone needs a dentry and inode */
+static struct dentry *futex_dentry;
+
/* We use this instead of a normal wait_queue_t, so we can wake only
the relevent ones (hashed queues may be shared) */
struct futex_q {
@@ -49,6 +58,9 @@
/* Page struct and offset within it. */
struct page *page;
unsigned int offset;
+ /* For fd, sigio sent using these. */
+ int fd;
+ struct file *filp;
};

/* The key for the hash is the address + index + offset within page */
@@ -65,9 +77,12 @@
return &futex_queues[hash_long(h, FUTEX_HASHBITS)];
}

+/* Waiter either waiting in FUTEX_WAIT or poll(), or expecting signal */
static inline void tell_waiter(struct futex_q *q)
{
wake_up_all(&q->waiters);
+ if (q->filp)
+ send_sigio(&q->filp->f_owner, q->fd, POLL_IN);
}

static inline void unpin_page(struct page *page)
@@ -105,14 +120,16 @@

/* Add at end to avoid starvation */
static inline void queue_me(struct list_head *head,
- wait_queue_t *wait,
struct futex_q *q,
struct page *page,
- unsigned int offset)
+ unsigned int offset,
+ int fd,
+ struct file *filp)
{
- add_wait_queue(&q->waiters, wait);
q->page = page;
q->offset = offset;
+ q->fd = fd;
+ q->filp = filp;

spin_lock(&futex_lock);
list_add_tail(&q->list, head);
@@ -166,7 +183,9 @@
int ret = 0;

set_current_state(TASK_INTERRUPTIBLE);
- queue_me(head, &wait, &q, page, offset);
+ init_waitqueue_head(&q.waiters);
+ add_wait_queue(&q.waiters, &wait);
+ queue_me(head, &q, page, offset, -1, NULL);

/* Page is pinned, but may no longer be in this address space. */
if (get_user(curval, uaddr) != 0) {
@@ -196,6 +215,95 @@
return ret;
}

+static int futex_close(struct inode *inode, struct file *filp)
+{
+ struct futex_q *q = filp->private_data;
+
+ spin_lock(&futex_lock);
+ if (!list_empty(&q->list)) {
+ list_del(&q->list);
+ unpin_page(q->page);
+ /* Noone can be polling on us now. */
+ BUG_ON(waitqueue_active(&q->waiters));
+ }
+ spin_unlock(&futex_lock);
+ kfree(filp->private_data);
+ return 0;
+}
+
+/* This is one-shot: once it's gone off you need a new fd */
+static unsigned int futex_poll(struct file *filp,
+ struct poll_table_struct *wait)
+{
+ struct futex_q *q = filp->private_data;
+ int ret = 0;
+
+ spin_lock(&futex_lock);
+ if (!list_empty(&q->list))
+ poll_wait(filp, &q->waiters, wait);
+ else
+ ret = POLLIN | POLLRDNORM;
+ spin_unlock(&futex_lock);
+
+ return ret;
+}
+
+static struct file_operations futex_fops = {
+ release: futex_close,
+ poll: futex_poll,
+};
+
+/* Signal allows caller to avoid the race which would occur if they
+ set the sigio stuff up afterwards. */
+static int futex_fd(struct list_head *head,
+ struct page *page,
+ int offset,
+ int signal)
+{
+ int fd;
+ struct futex_q *q;
+ struct file *filp;
+
+ if (signal < 0 || signal > _NSIG)
+ return -EINVAL;
+
+ fd = get_unused_fd();
+ if (fd < 0)
+ return fd;
+ filp = get_empty_filp();
+ if (!filp) {
+ put_unused_fd(fd);
+ return -ENFILE;
+ }
+ filp->f_op = &futex_fops;
+ filp->f_dentry = dget(futex_dentry);
+
+ if (signal) {
+ filp->f_owner.pid = current->pid;
+ filp->f_owner.uid = current->uid;
+ filp->f_owner.euid = current->euid;
+ filp->f_owner.signum = signal;
+ }
+
+ q = kmalloc(sizeof(*q), GFP_KERNEL);
+ if (!q) {
+ put_unused_fd(fd);
+ put_filp(filp);
+ return -ENOMEM;
+ }
+
+ /* Initialize queue structure */
+ init_waitqueue_head(&q->waiters);
+ filp->private_data = q;
+
+ /* Go for it... */
+ queue_me(head, q, page, offset, fd, filp);
+
+ /* Now we map fd to filp, so userspace can access it */
+ fd_install(fd, filp);
+ return fd;
+}
+
asmlinkage int sys_futex(void *uaddr, int op, int val, struct timespec *utime)
{
int ret;
@@ -232,6 +340,10 @@
case FUTEX_WAKE:
ret = futex_wake(head, page, pos_in_page, val);
break;
+ case FUTEX_FD:
+ /* non-zero val means F_SETOWN(getpid()) & F_SETSIG(val) */
+ ret = futex_fd(head, page, pos_in_page, val);
+ break;
default:
ret = -EINVAL;
}
@@ -241,9 +353,55 @@
return ret;
}

+/* FIXME: Oh yeah, makes sense to write a filesystem... */
+static struct super_operations futexfs_ops = { statfs: simple_statfs };
+
+/* Don't check error returns: we're dead if they happen */
+static int futexfs_fill_super(struct super_block *sb, void *data, int silent)
+{
+ struct inode *root;
+
+ sb->s_blocksize = 1024;
+ sb->s_blocksize_bits = 10;
+ sb->s_magic = 0xBAD1DEA;
+ sb->s_op = &futexfs_ops;
+
+ root = new_inode(sb);
+ root->i_mode = S_IFDIR | S_IRUSR | S_IWUSR;
+ root->i_uid = root->i_gid = 0;
+ root->i_atime = root->i_mtime = root->i_ctime = CURRENT_TIME;
+
+ sb->s_root = d_alloc(NULL, &(const struct qstr) { "futex", 5, 0 });
+ sb->s_root->d_sb = sb;
+ sb->s_root->d_parent = sb->s_root;
+ d_instantiate(sb->s_root, root);
+
+ /* We never let this drop to zero. */
+ futex_dentry = dget(sb->s_root);
+
+ return 0;
+}
+
+static struct super_block *
+futexfs_get_sb(struct file_system_type *fs_type,
+ int flags, char *dev_name, void *data)
+{
+ return get_sb_nodev(fs_type, flags, data, futexfs_fill_super);
+}
+
+static struct file_system_type futex_fs_type = {
+ name: "futexfs",
+ get_sb: futexfs_get_sb,
+ kill_sb: kill_anon_super,
+ fs_flags: FS_NOMOUNT,
+};
+
static int __init init(void)
{
unsigned int i;
+
+ register_filesystem(&futex_fs_type);
+ kern_mount(&futex_fs_type);

for (i = 0; i < ARRAY_SIZE(futex_queues); i++)
INIT_LIST_HEAD(&futex_queues[i]);

--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-06-11 14:12:20

by john slee

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface

On Mon, Jun 10, 2002 at 10:46:00PM +0200, Kai Henningsen wrote:
> [email protected] (Thunder from the hill) wrote on 10.06.02 in <[email protected]>:
>
> > On Mon, 10 Jun 2002, Helge Hafting wrote:
> > > ls /dev/net
> > > eth0 eth1 eth2 ippp0
> >
> > What is it worth? You have a few more files which you can't do anything
> > with, and ifconfig output is much more greppable etc.
>
> Ifconfig output is *WHAT*?!
>
> Ifconfig output, to be parsed by a script, is one of the shittiest
> interfaces possible.

it may be but you can use 'ip' instead which _is_ much more script
friendly and seems to have at least ifconfig functionality.

j.

--
toyota power: http://indigoid.net/

2002-06-11 15:16:27

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface

Linus Torvalds <[email protected]> writes:

> On Mon, 10 Jun 2002, Helge Hafting wrote:
> >
> > Not much, but
> > ls /dev/net
> > eth0 eth1 eth2 ippp0
> > would be a convenient way to see what net devices exists.
> > This already works for other devices, when using devfs.
>
> You might as well do
>
> cat /proc/net/dev
>
> instead.
>
> Which works with existing kernels, going back to whatever..

Gap, puke.

Sorry I have built kernels where space was tight and I only built in
/proc so I could read /proc/net/dev. And since with /proc everything
is all in one basket it is very hard to turn off unneeded features.

/proc might be nice to user space but as it is implemented it is
nasty to work with.

So a netdevfs or some solution that factors better would really
be nice.

Eric

2002-06-11 16:53:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface


Rusty,
this makes no sense:

D: This changes the implementation so that the waker actually unpins
D: the page. This is preparation for the async interface, where the
D: process which registered interest is not in the kernel.

Whazzup? The closing of the fd will unpin the page, the waker has no
reason to do so. It is very much against the linux philosophy (and a
design disaster anyway) to have the waker muck with the data structures of
anything waiting.

Linus

2002-06-12 05:32:37

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface

In message <[email protected]> you wri
te:
>
> Rusty,
> this makes no sense:
>
> D: This changes the implementation so that the waker actually unpins
> D: the page. This is preparation for the async interface, where the
> D: process which registered interest is not in the kernel.
>
> Whazzup? The closing of the fd will unpin the page, the waker has no
> reason to do so. It is very much against the linux philosophy (and a
> design disaster anyway) to have the waker muck with the data structures of
> anything waiting.

Good catch: now the fd is a "one-shot" thing anyway, making close
unpin the page makes more sense. Tested patch below (against 2.5.21).

FYI: I already violate this philosophy as I remove the waiter from the
queue when I wake them: this allows them to tell that they were woken
(waker does a list_del_init() on the waiting entry, so waiting knows
if (list_empty()) I was woken).

It would be more natural for the waiter to examine the futex value,
and if it's still unchanged go back to sleep. But this makes
assumptions about what they're using the futex value for. For
example, we "PASS_THIS_DIRECTLY" value into the futex. This requires
that one (and ONLY one) process waiting actually wakes up.

This is why coming up with a primitive which allowed us to build posix
threads and fair queueing as well as "normal" unfair semantics took so
damn long.

Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.21/include/linux/futex.h working-2.5.21-afutex/include/linux/futex.h
--- linux-2.5.21/include/linux/futex.h Sat May 25 14:34:59 2002
+++ working-2.5.21-afutex/include/linux/futex.h Wed Jun 12 12:32:18 2002
@@ -4,5 +4,6 @@
/* Second argument to futex syscall */
#define FUTEX_WAIT (0)
#define FUTEX_WAKE (1)
+#define FUTEX_FD (2)

#endif
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.21/kernel/futex.c working-2.5.21-afutex/kernel/futex.c
--- linux-2.5.21/kernel/futex.c Mon Jun 10 16:03:56 2002
+++ working-2.5.21-afutex/kernel/futex.c Wed Jun 12 14:10:41 2002
@@ -34,6 +34,10 @@
#include <linux/highmem.h>
#include <linux/time.h>
#include <linux/pagemap.h>
+#include <linux/slab.h>
+#include <linux/poll.h>
+#include <linux/file.h>
+#include <linux/dcache.h>
#include <asm/uaccess.h>

/* Simple "sleep if unchanged" interface. */
@@ -41,6 +45,11 @@
/* FIXME: This may be way too small. --RR */
#define FUTEX_HASHBITS 6

+extern void send_sigio(struct fown_struct *fown, int fd, int band);
+
+/* Everyone needs a dentry and inode */
+static struct dentry *futex_dentry;
+
/* We use this instead of a normal wait_queue_t, so we can wake only
the relevent ones (hashed queues may be shared) */
struct futex_q {
@@ -49,6 +58,9 @@
/* Page struct and offset within it. */
struct page *page;
unsigned int offset;
+ /* For fd, sigio sent using these. */
+ int fd;
+ struct file *filp;
};

/* The key for the hash is the address + index + offset within page */
@@ -65,9 +77,20 @@
return &futex_queues[hash_long(h, FUTEX_HASHBITS)];
}

+/* Waiter either waiting in FUTEX_WAIT or poll(), or expecting signal */
static inline void tell_waiter(struct futex_q *q)
{
wake_up_all(&q->waiters);
+ if (q->filp)
+ send_sigio(&q->filp->f_owner, q->fd, POLL_IN);
+}
+
+static inline void unpin_page(struct page *page)
+{
+ /* Avoid releasing the page which is on the LRU list. I don't
+ know if this is correct, but it stops the BUG() in
+ __free_pages_ok(). */
+ page_cache_release(page);
}

static int futex_wake(struct list_head *head,
@@ -95,14 +118,16 @@

/* Add at end to avoid starvation */
static inline void queue_me(struct list_head *head,
- wait_queue_t *wait,
struct futex_q *q,
struct page *page,
- unsigned int offset)
+ unsigned int offset,
+ int fd,
+ struct file *filp)
{
- add_wait_queue(&q->waiters, wait);
q->page = page;
q->offset = offset;
+ q->fd = fd;
+ q->filp = filp;

spin_lock(&futex_lock);
list_add_tail(&q->list, head);
@@ -130,9 +155,9 @@
int err;

down_read(&mm->mmap_sem);
- err = get_user_pages(current, current->mm, page_start,
+ err = get_user_pages(current, mm, page_start,
1 /* one page */,
- 1 /* writable */,
+ 0 /* writable not important */,
0 /* don't force */,
&page,
NULL /* don't return vmas */);
@@ -156,7 +181,9 @@
int ret = 0;

set_current_state(TASK_INTERRUPTIBLE);
- queue_me(head, &wait, &q, page, offset);
+ init_waitqueue_head(&q.waiters);
+ add_wait_queue(&q.waiters, &wait);
+ queue_me(head, &q, page, offset, -1, NULL);

/* Page is pinned, but may no longer be in this address space. */
if (get_user(curval, uaddr) != 0) {
@@ -185,6 +212,93 @@
return ret;
}

+static int futex_close(struct inode *inode, struct file *filp)
+{
+ struct futex_q *q = filp->private_data;
+
+ spin_lock(&futex_lock);
+ if (!list_empty(&q->list)) {
+ list_del(&q->list);
+ /* Noone can be polling on us now. */
+ BUG_ON(waitqueue_active(&q->waiters));
+ }
+ spin_unlock(&futex_lock);
+ unpin_page(q->page);
+ kfree(filp->private_data);
+ return 0;
+}
+
+/* This is one-shot: once it's gone off you need a new fd */
+static unsigned int futex_poll(struct file *filp,
+ struct poll_table_struct *wait)
+{
+ struct futex_q *q = filp->private_data;
+ int ret = 0;
+
+ spin_lock(&futex_lock);
+ if (!list_empty(&q->list))
+ poll_wait(filp, &q->waiters, wait);
+ else
+ ret = POLLIN | POLLRDNORM;
+ spin_unlock(&futex_lock);
+
+ return ret;
+}
+
+static struct file_operations futex_fops = {
+ release: futex_close,
+ poll: futex_poll,
+};
+
+/* Signal allows caller to avoid the race which would occur if they
+ set the sigio stuff up afterwards. */
+static int futex_fd(struct list_head *head,
+ struct page *page,
+ int offset,
+ int signal)
+{
+ int fd;
+ struct futex_q *q;
+ struct file *filp;
+
+ if (signal < 0 || signal > _NSIG)
+ return -EINVAL;
+
+ fd = get_unused_fd();
+ if (fd < 0)
+ return fd;
+ filp = get_empty_filp();
+ if (!filp) {
+ put_unused_fd(fd);
+ return -ENFILE;
+ }
+ filp->f_op = &futex_fops;
+ filp->f_dentry = dget(futex_dentry);
+
+ if (signal) {
+ filp->f_owner.pid = current->pid;
+ filp->f_owner.uid = current->uid;
+ filp->f_owner.euid = current->euid;
+ filp->f_owner.signum = signal;
+ }
+
+ q = kmalloc(sizeof(*q), GFP_KERNEL);
+ if (!q) {
+ put_unused_fd(fd);
+ put_filp(filp);
+ return -ENOMEM;
+ }
+
+ /* Initialize queue structure, and add to hash table. */
+ filp->private_data = q;
+ init_waitqueue_head(&q->waiters);
+ queue_me(head, q, page, offset, fd, filp);
+
+ /* Now we map fd to filp, so userspace can access it */
+ fd_install(fd, filp);
+ return fd;
+}
+
asmlinkage int sys_futex(void *uaddr, int op, int val, struct timespec *utime)
{
int ret;
@@ -220,17 +334,70 @@
case FUTEX_WAKE:
ret = futex_wake(head, page, pos_in_page, val);
break;
+ case FUTEX_FD:
+ /* non-zero val means F_SETOWN(getpid()) & F_SETSIG(val) */
+ ret = futex_fd(head, page, pos_in_page, val);
+ if (ret >= 0)
+ /* Leave page pinned (attached to fd). */
+ return ret;
+ break;
default:
ret = -EINVAL;
}
- page_cache_release(page);
+ unpin_page(page);

return ret;
}

+/* FIXME: Oh yeah, makes sense to write a filesystem... */
+static struct super_operations futexfs_ops = { statfs: simple_statfs };
+
+/* Don't check error returns: we're dead if they happen */
+static int futexfs_fill_super(struct super_block *sb, void *data, int silent)
+{
+ struct inode *root;
+
+ sb->s_blocksize = 1024;
+ sb->s_blocksize_bits = 10;
+ sb->s_magic = 0xBAD1DEA;
+ sb->s_op = &futexfs_ops;
+
+ root = new_inode(sb);
+ root->i_mode = S_IFDIR | S_IRUSR | S_IWUSR;
+ root->i_uid = root->i_gid = 0;
+ root->i_atime = root->i_mtime = root->i_ctime = CURRENT_TIME;
+
+ sb->s_root = d_alloc(NULL, &(const struct qstr) { "futex", 5, 0 });
+ sb->s_root->d_sb = sb;
+ sb->s_root->d_parent = sb->s_root;
+ d_instantiate(sb->s_root, root);
+
+ /* We never let this drop to zero. */
+ futex_dentry = dget(sb->s_root);
+
+ return 0;
+}
+
+static struct super_block *
+futexfs_get_sb(struct file_system_type *fs_type,
+ int flags, char *dev_name, void *data)
+{
+ return get_sb_nodev(fs_type, flags, data, futexfs_fill_super);
+}
+
+static struct file_system_type futex_fs_type = {
+ name: "futexfs",
+ get_sb: futexfs_get_sb,
+ kill_sb: kill_anon_super,
+ fs_flags: FS_NOMOUNT,
+};
+
static int __init init(void)
{
unsigned int i;
+
+ register_filesystem(&futex_fs_type);
+ kern_mount(&futex_fs_type);

for (i = 0; i < ARRAY_SIZE(futex_queues); i++)
INIT_LIST_HEAD(&futex_queues[i]);

2002-06-12 09:14:30

by Peter Wächtler

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface

Rusty Russell wrote:
> In message <[email protected]> you wri
> te:
>
>>Rusty,
>> this makes no sense:
>>
>>D: This changes the implementation so that the waker actually unpins
>>D: the page. This is preparation for the async interface, where the
>>D: process which registered interest is not in the kernel.
>>
>>Whazzup? The closing of the fd will unpin the page, the waker has no
>>reason to do so. It is very much against the linux philosophy (and a
>>design disaster anyway) to have the waker muck with the data structures of
>>anything waiting.
>>
>
> Good catch: now the fd is a "one-shot" thing anyway, making close
> unpin the page makes more sense. Tested patch below (against 2.5.21).
>
> FYI: I already violate this philosophy as I remove the waiter from the
> queue when I wake them: this allows them to tell that they were woken
> (waker does a list_del_init() on the waiting entry, so waiting knows
> if (list_empty()) I was woken).
>
> It would be more natural for the waiter to examine the futex value,
> and if it's still unchanged go back to sleep. But this makes
> assumptions about what they're using the futex value for. For
> example, we "PASS_THIS_DIRECTLY" value into the futex. This requires
> that one (and ONLY one) process waiting actually wakes up.
>
> This is why coming up with a primitive which allowed us to build posix
> threads and fair queueing as well as "normal" unfair semantics took so
> damn long.
>

What are the plans on how to deal with a waiter when the lock holder
dies abnormally?

What about sending a signal (SIGTRAP or SIGLOST), returning -1 and
setting errno to a reasonable value (EIO?)

I couldn't find anything in susv3


2002-06-12 15:20:58

by Hubertus Franke

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface

On Wednesday 12 June 2002 05:16 am, Peter W?chtler wrote:
> Rusty Russell wrote:
> > In message <[email protected]>
> > you wri
> >
> > te:
> >>Rusty,
> >> this makes no sense:
> >>
> >>D: This changes the implementation so that the waker actually unpins
> >>D: the page. This is preparation for the async interface, where the
> >>D: process which registered interest is not in the kernel.
> >>
> >>Whazzup? The closing of the fd will unpin the page, the waker has no
> >>reason to do so. It is very much against the linux philosophy (and a
> >>design disaster anyway) to have the waker muck with the data structures
> >> of anything waiting.
> >
> > Good catch: now the fd is a "one-shot" thing anyway, making close
> > unpin the page makes more sense. Tested patch below (against 2.5.21).
> >
> > FYI: I already violate this philosophy as I remove the waiter from the
> > queue when I wake them: this allows them to tell that they were woken
> > (waker does a list_del_init() on the waiting entry, so waiting knows
> > if (list_empty()) I was woken).
> >
> > It would be more natural for the waiter to examine the futex value,
> > and if it's still unchanged go back to sleep. But this makes
> > assumptions about what they're using the futex value for. For
> > example, we "PASS_THIS_DIRECTLY" value into the futex. This requires
> > that one (and ONLY one) process waiting actually wakes up.
> >
> > This is why coming up with a primitive which allowed us to build posix
> > threads and fair queueing as well as "normal" unfair semantics took so
> > damn long.
>
> What are the plans on how to deal with a waiter when the lock holder
> dies abnormally?
>
> What about sending a signal (SIGTRAP or SIGLOST), returning -1 and
> setting errno to a reasonable value (EIO?)
>
> I couldn't find anything in susv3

I thing this was decided some time ago that we won't deal with this situation.
If we need to, the following needs to happen.

A) we need to follow a protocol to register the PID/TID id within the lock.
Example of this is described in
"Recoverable User-Level Mutual Exclusion" by Phiilip Bohannon
Proceedings of the 7th IEEE Symposium on Parallel and Distributed
Systems, 1995.

B) this again requires that some entity (kernel ?) knows about all locks,
whether waited on in the kernel or not.

C) I believe we need a deamon that occasinally identifies dead locks.

Is it worth all this trouble? Or do we need two versions of the ?

The paper describes that for a Sun SS20/61 the safe spin locks obtained
only 25% of the performance of spinlocks.


--
-- Hubertus Franke ([email protected])

2002-06-12 15:42:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface


On Wed, 12 Jun 2002, Peter W?chtler wrote:
>
> What are the plans on how to deal with a waiter when the lock holder
> dies abnormally?

That's why they are called FUTEX'es - they're fast. They're NOT SysV
semaphores, and they are done 99% in user space. The kernel doesn't even
_know_ about them until contention happens, and even then only in a rather
dim "somebody wants me to do this, but I don't know _what_ he is doing"
way.

> What about sending a signal (SIGTRAP or SIGLOST), returning -1 and
> setting errno to a reasonable value (EIO?)

There's just nothing the kernel _can_ do. The common case (by far) is that
the kernel has never seen the futex at all, since many uses are likely to
not have much contention. So when a user program dies holding such a
uncontended lock, the kernel simply _cannot_ do anything.

(The kernel also cannot do anything even for the contended locks, because
the whole interface is designed for speed and with the knowledge that the
kernel won't be able to fix stuff up, so the kernel doesn't actually have
enough information even in the contention case. See the "dim notion"
above).

Besides, if you have a threads package that uses some lock for mutual
exclusion, and a thread dies while holding the lock, there's nothign sane
anybody can do about it anyway. The data structures are likely to be in an
invalid state, and just making every other thread block on the lock until
you can attach a debugger is probably the closest to a _right_ thing you
can do.

In short: it's not a bug, it's a design feature, and it's very much
designed for efficiency.

Linus

2002-06-12 16:27:38

by Peter Wächtler

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface

Linus Torvalds wrote:
> On Wed, 12 Jun 2002, Peter W?chtler wrote:
>
>>What are the plans on how to deal with a waiter when the lock holder
>>dies abnormally?
>>
>
> That's why they are called FUTEX'es - they're fast. They're NOT SysV
> semaphores, and they are done 99% in user space. The kernel doesn't even
> _know_ about them until contention happens, and even then only in a rather
> dim "somebody wants me to do this, but I don't know _what_ he is doing"
> way.
>
>
>>What about sending a signal (SIGTRAP or SIGLOST), returning -1 and
>>setting errno to a reasonable value (EIO?)
>>
>
> There's just nothing the kernel _can_ do. The common case (by far) is that
> the kernel has never seen the futex at all, since many uses are likely to
> not have much contention. So when a user program dies holding such a
> uncontended lock, the kernel simply _cannot_ do anything.
>
For the uncontended case: their is no blocked process...

Huh, I think you misunderstood me.

One (or more) process is blocked in a waitqueue in the kernel - waiting
for a futex to be released.

The lock holder crashes - say with SIGSEGV. Now if we don't release the
waiters, they wait until reboot or user/admin kills them with a signal -
assuming they are interruptible sleeping.

I know that the kernel can't do anything about the aborted critical section.
But the waiters should be "freed" - and now we can discuss if we kill them
or report an error and let them deal with that.

So we surely have a process_exit_cleanup function (where FDs are closed etc).
There we would have to add a check if that process is holding a futex, the
waitqueue for that and "release" all waiters.

Can't be done? I don't think that this would add a performance hit
since it's only done on exit (and especially "abnormal" exit).

There is no way to check if a process holds a futex and which processes
are blocked on the associated waitqueue?

The waitqueue is built upon linking a struct futex_q list on the blocked
processes stack.

The entries to these lists are in a static array
struct list_head futex_queues[1<<FUTEX_HASHBITS].
At least we could search them on "exit due to fatal signal" when exiting.
Perhaps spending a bit in task_struct WHEN they got a lock - so we don't
have to search on every process exit.

Yes, searching the hash array lists could last a long time, but:

Is process exit time that important?

> (The kernel also cannot do anything even for the contended locks, because
> the whole interface is designed for speed and with the knowledge that the
> kernel won't be able to fix stuff up, so the kernel doesn't actually have
> enough information even in the contention case. See the "dim notion"
> above).
>
> Besides, if you have a threads package that uses some lock for mutual
> exclusion, and a thread dies while holding the lock, there's nothign sane
> anybody can do about it anyway. The data structures are likely to be in an
> invalid state, and just making every other thread block on the lock until
> you can attach a debugger is probably the closest to a _right_ thing you
> can do.
>
> In short: it's not a bug, it's a design feature, and it's very much
> designed for efficiency.
>

And leave dangling processes (lost futex zombies)?


2002-06-12 16:48:06

by Peter Wächtler

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface

Hubertus Franke wrote:
> On Wednesday 12 June 2002 05:16 am, Peter W?chtler wrote:
>
>>Rusty Russell wrote:
>>
>>>In message <[email protected]>
>>>you wri
>>>
>>>te:
>>>
>>>>Rusty,
>>>>this makes no sense:
>>>>
>>>>D: This changes the implementation so that the waker actually unpins
>>>>D: the page. This is preparation for the async interface, where the
>>>>D: process which registered interest is not in the kernel.
>>>>
>>>>Whazzup? The closing of the fd will unpin the page, the waker has no
>>>>reason to do so. It is very much against the linux philosophy (and a
>>>>design disaster anyway) to have the waker muck with the data structures
>>>>of anything waiting.
>>>>
>>>Good catch: now the fd is a "one-shot" thing anyway, making close
>>>unpin the page makes more sense. Tested patch below (against 2.5.21).
>>>
>>>FYI: I already violate this philosophy as I remove the waiter from the
>>>queue when I wake them: this allows them to tell that they were woken
>>>(waker does a list_del_init() on the waiting entry, so waiting knows
>>>if (list_empty()) I was woken).
>>>
>>>It would be more natural for the waiter to examine the futex value,
>>>and if it's still unchanged go back to sleep. But this makes
>>>assumptions about what they're using the futex value for. For
>>>example, we "PASS_THIS_DIRECTLY" value into the futex. This requires
>>>that one (and ONLY one) process waiting actually wakes up.
>>>
>>>This is why coming up with a primitive which allowed us to build posix
>>>threads and fair queueing as well as "normal" unfair semantics took so
>>>damn long.
>>>
>>What are the plans on how to deal with a waiter when the lock holder
>>dies abnormally?
>>
>>What about sending a signal (SIGTRAP or SIGLOST), returning -1 and
>>setting errno to a reasonable value (EIO?)
>>
>>I couldn't find anything in susv3
>>
>
> I thing this was decided some time ago that we won't deal with this situation.
> If we need to, the following needs to happen.
>
> A) we need to follow a protocol to register the PID/TID id within the lock.
> Example of this is described in
> "Recoverable User-Level Mutual Exclusion" by Phiilip Bohannon
> Proceedings of the 7th IEEE Symposium on Parallel and Distributed
> Systems, 1995.
>
> B) this again requires that some entity (kernel ?) knows about all locks,
> whether waited on in the kernel or not.
>
> C) I believe we need a deamon that occasinally identifies dead locks.
>
> Is it worth all this trouble? Or do we need two versions of the ?
>
> The paper describes that for a Sun SS20/61 the safe spin locks obtained
> only 25% of the performance of spinlocks.
>

Oops, I see it already myself. We lack a way for saying who is/was owning
the futex and so we can hardly tell who is waiting on this "unknown" lock.
:-(







2002-06-12 16:52:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface




On Wed, 12 Jun 2002, Peter W?chtler wrote:
>
> For the uncontended case: their is no blocked process...

Wrong.

The process that holds the lock can die _before_ it gets contended.

When another thread comes in, it now is contended, but the kernel doesn't
know about anything.

> One (or more) process is blocked in a waitqueue in the kernel - waiting
> for a futex to be released.
>
> The lock holder crashes - say with SIGSEGV.

The lock holder may have crashed long before the waiting process even
started waiting.

Besides, the kernel only knows about those processes that see contention.
Even if the contention happened _before_ the lock holder crashed, the
kernel doesn't know about the lock holder itself - it only knows about the
process that caused the contention. The kernel will get to know about the
lock holder only when it tris to resolve the contention, and since that
one has crashed, that will never happen.

> I know that the kernel can't do anything about the aborted critical section.
> But the waiters should be "freed" - and now we can discuss if we kill them
> or report an error and let them deal with that.

The waiters should absolutely _not_ be freed. There's nothign they can do
about it. The data inside the critical region is no longer valid, and

> Can't be done? I don't think that this would add a performance hit
> since it's only done on exit (and especially "abnormal" exit).

But the point is not that it would be a performance hit on "exit()", but
that WE DON'T TRACK THE LOCKS in the kernel in the first place.

Right now the kernel does _zero_ work for a lock that isn't contended. It
doesn't know _anything_ about the process that got the lock initially.

Any amount of tracking would be _extremely_ expensive. Right now getting
an uncontended lock is about 15 CPU cycles in user space.

Tryin to tell the kernel about gettign that lock takes about 1us on a P4
(system call overhead), ie we're talking 18000 cycles. 18 THOUSAND cycles
minimum. Compared to the current 15 cycles. That's more than three orders
of magnitude slower than the current code, and you just lost the whole
point of doing this all in user space in the first place.

Linus

2002-06-12 17:05:45

by Peter Wächtler

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface

Linus Torvalds wrote:
>
>
> On Wed, 12 Jun 2002, Peter W?chtler wrote:
>
>>For the uncontended case: their is no blocked process...
>>
>
> Wrong.
>
> The process that holds the lock can die _before_ it gets contended.
>
> When another thread comes in, it now is contended, but the kernel doesn't
> know about anything.
>
>
>>One (or more) process is blocked in a waitqueue in the kernel - waiting
>>for a futex to be released.
>>
>>The lock holder crashes - say with SIGSEGV.
>>
>
> The lock holder may have crashed long before the waiting process even
> started waiting.
>
> Besides, the kernel only knows about those processes that see contention.
> Even if the contention happened _before_ the lock holder crashed, the
> kernel doesn't know about the lock holder itself - it only knows about the
> process that caused the contention. The kernel will get to know about the
> lock holder only when it tris to resolve the contention, and since that
> one has crashed, that will never happen.
>
>
>>I know that the kernel can't do anything about the aborted critical section.
>>But the waiters should be "freed" - and now we can discuss if we kill them
>>or report an error and let them deal with that.
>>
>
> The waiters should absolutely _not_ be freed. There's nothign they can do
> about it. The data inside the critical region is no longer valid, and
>
>
>>Can't be done? I don't think that this would add a performance hit
>>since it's only done on exit (and especially "abnormal" exit).
>>
>
> But the point is not that it would be a performance hit on "exit()", but
> that WE DON'T TRACK THE LOCKS in the kernel in the first place.
>
> Right now the kernel does _zero_ work for a lock that isn't contended. It
> doesn't know _anything_ about the process that got the lock initially.
>
> Any amount of tracking would be _extremely_ expensive. Right now getting
> an uncontended lock is about 15 CPU cycles in user space.
>
> Tryin to tell the kernel about gettign that lock takes about 1us on a P4
> (system call overhead), ie we're talking 18000 cycles. 18 THOUSAND cycles
> minimum. Compared to the current 15 cycles. That's more than three orders
> of magnitude slower than the current code, and you just lost the whole
> point of doing this all in user space in the first place.
>

Thanks for this patient explanation. I see the problem now clearly.

To Frank: I will read the (already downloaded) paper ;-)

And to all: Did you notice the "nutex" approach
http://marc.theaimsgroup.com/?l=linux-kernel&m=102373047428621&w=2

2002-06-12 18:15:01

by Vladimir Zidar

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface

On Wed, 2002-06-12 at 18:50, Peter W?chtler wrote:

> >>What are the plans on how to deal with a waiter when the lock holder
> >>dies abnormally?
> >>
> >>What about sending a signal (SIGTRAP or SIGLOST), returning -1 and
> >>setting errno to a reasonable value (EIO?)
> >>
> >>I couldn't find anything in susv3
> >>
> >
> > I thing this was decided some time ago that we won't deal with this situation.
> > If we need to, the following needs to happen.
> >
> > A) we need to follow a protocol to register the PID/TID id within the lock.
> > Example of this is described in
> > "Recoverable User-Level Mutual Exclusion" by Phiilip Bohannon
> > Proceedings of the 7th IEEE Symposium on Parallel and Distributed
> > Systems, 1995.
> >
> > B) this again requires that some entity (kernel ?) knows about all locks,
> > whether waited on in the kernel or not.
> >
> > C) I believe we need a deamon that occasinally identifies dead locks.
> >
> > Is it worth all this trouble? Or do we need two versions of the ?
> >
> > The paper describes that for a Sun SS20/61 the safe spin locks obtained
> > only 25% of the performance of spinlocks.
> >
>
> Oops, I see it already myself. We lack a way for saying who is/was owning
> the futex and so we can hardly tell who is waiting on this "unknown" lock.
> :-(

Which is not the case in 'nutex' implementation.

Take look:

http://www.mindnever.org/~mr_w/nutex_mod.tar.gz


--
Bye,

and have a very nice day !



2002-06-12 18:30:38

by Saurabh Desai

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface

Linus Torvalds wrote:
>
> On Wed, 12 Jun 2002, Peter W?chtler wrote:
> >
> > For the uncontended case: their is no blocked process...
>
> Wrong.
>
> The process that holds the lock can die _before_ it gets contended.
>
> When another thread comes in, it now is contended, but the kernel doesn't
> know about anything.
>
> > One (or more) process is blocked in a waitqueue in the kernel - waiting
> > for a futex to be released.
> >
> > The lock holder crashes - say with SIGSEGV.
>
> The lock holder may have crashed long before the waiting process even
> started waiting.

Is it possible to check whether the holder process exist before
start waiting? may be in user-space?

>
> Besides, the kernel only knows about those processes that see contention.
> Even if the contention happened _before_ the lock holder crashed, the
> kernel doesn't know about the lock holder itself - it only knows about the
> process that caused the contention. The kernel will get to know about the
> lock holder only when it tris to resolve the contention, and since that
> one has crashed, that will never happen.

Just curious: is there anything to register a cleanup handler
when a process holds a lock, so when that goes away, run the cleanup and
wakeup all waiters?

>
> > I know that the kernel can't do anything about the aborted critical section.
> > But the waiters should be "freed" - and now we can discuss if we kill them
> > or report an error and let them deal with that.
>
> The waiters should absolutely _not_ be freed. There's nothign they can do
> about it. The data inside the critical region is no longer valid, and
>
> > Can't be done? I don't think that this would add a performance hit
> > since it's only done on exit (and especially "abnormal" exit).
>
> But the point is not that it would be a performance hit on "exit()", but
> that WE DON'T TRACK THE LOCKS in the kernel in the first place.
>
> Right now the kernel does _zero_ work for a lock that isn't contended. It
> doesn't know _anything_ about the process that got the lock initially.
>
> Any amount of tracking would be _extremely_ expensive. Right now getting
> an uncontended lock is about 15 CPU cycles in user space.
>
> Tryin to tell the kernel about gettign that lock takes about 1us on a P4
> (system call overhead), ie we're talking 18000 cycles. 18 THOUSAND cycles
> minimum. Compared to the current 15 cycles. That's more than three orders
> of magnitude slower than the current code, and you just lost the whole
> point of doing this all in user space in the first place.
>
> Linus
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2002-06-12 20:06:08

by Oliver Xymoron

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface

On Wed, 12 Jun 2002, Linus Torvalds wrote:

> Any amount of tracking would be _extremely_ expensive. Right now getting
> an uncontended lock is about 15 CPU cycles in user space.
>
> Tryin to tell the kernel about gettign that lock takes about 1us on a P4
> (system call overhead), ie we're talking 18000 cycles. 18 THOUSAND cycles
> minimum. Compared to the current 15 cycles. That's more than three orders
> of magnitude slower than the current code, and you just lost the whole
> point of doing this all in user space in the first place.

That doesn't rule out approaches like storing a cookie alongside the lock
once it's acquired (or in a parallel space). Which can easily be done with
a wrapper around lock acquisition. And stale lock detection needn't be
done in kernel space either.

--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."

2002-06-12 20:19:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface


On Wed, 12 Jun 2002, Oliver Xymoron wrote:
>
> That doesn't rule out approaches like storing a cookie alongside the lock
> once it's acquired (or in a parallel space). Which can easily be done with
> a wrapper around lock acquisition. And stale lock detection needn't be
> done in kernel space either.

Oh, agreed, you can do debugging locks in user-space, but it won't be the
kernel that does anything, it will instead have to depend on something
like "if I time out on the lock, I can explicitly test if the previous
holder (who saved his thread ID in memory after getting the lock) is still
alive and try to clean up after him".

This is what a lot of the filesystem locking code does for the things in
/var/lock/xxx, of course.

No kernel necessarily involved.

But it's going to have to depend on the politeness of all threads that
partake in the locking.

Linus

2002-06-13 01:29:53

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface

In message <[email protected]> you write:
> What are the plans on how to deal with a waiter when the lock holder
> dies abnormally?
>
> What about sending a signal (SIGTRAP or SIGLOST), returning -1 and
> setting errno to a reasonable value (EIO?)

This is becoming an FAQ:

No, you can't. These locks are fast because they don't go through the
kernel in the contented case -> the kernel doesn't know who has it.

You can deal with this in userspace, though, if you make certain
assumptions. Have a daemon which say, polls the futexes every second
and when a futex is taken for two seconds in a row, tries to grab it
itself with a 1 second timeout: if it fails it assumes the lock is
stuck and take action (note: anyone can lock or unlock, there's no
lock "owner" except by convention).

This is not perfect, but neither is the fcntl lock case of "process
dies, lock vanishes, noone gets told". And since the cleanup is
app-specific, making the whole thing app-specific is not so crazy.

Hope that clarifies,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-06-13 02:53:45

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface

In message <[email protected]> you wr
ite:
>
>
>
> On Wed, 12 Jun 2002, Peter W=E4chtler wrote:
> >
> > For the uncontended case: their is no blocked process...
>
> Wrong.
>
> The process that holds the lock can die _before_ it gets contended.
>
> When another thread comes in, it now is contended, but the kernel doesn't
> know about anything.

Note also: this is a feature.

I have a little helper program which can grab or release a futex in a
(mmapped) file. It's great for shell scripts to grab locks. In this
case the helper exits with the lock held, and a later invocation
releases a lock it never held.

*AND* the lock is persistent across reboots, since it's in a file.
How cool is that!

This is the *third* major thread on this subject, BTW.
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-06-13 09:35:48

by Peter Wächtler

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface

Rusty Russell wrote:
> In message <[email protected]> you wr
> ite:
>
>>
>>
>>On Wed, 12 Jun 2002, Peter W=E4chtler wrote:
>>
>>>For the uncontended case: their is no blocked process...
>>>
>>Wrong.
>>
>>The process that holds the lock can die _before_ it gets contended.
>>
>>When another thread comes in, it now is contended, but the kernel doesn't
>>know about anything.
>>
>
> Note also: this is a feature.
>
> I have a little helper program which can grab or release a futex in a
> (mmapped) file. It's great for shell scripts to grab locks. In this
> case the helper exits with the lock held, and a later invocation
> releases a lock it never held.
>
> *AND* the lock is persistent across reboots, since it's in a file.
> How cool is that!
>

Don't want to bugg you: but you would have to clean them up in any case
when you restart your system of cooperating programs.

Posix shmem would be a nice place to store your mmaped file so that it's
gone after reboot - but gives "kernel life time".

And not that I want to put the futexes down: but now I understand why
the PROCESS_SHARED locks on Irix live in the kernel. Yes, perhaps we
should provide both and the app can choose what suits best.


2002-06-13 09:52:34

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface

In message <[email protected]> you write:
> Rusty Russell wrote:
> > *AND* the lock is persistent across reboots, since it's in a file.
> > How cool is that!
> >
>
> Don't want to bugg you: but you would have to clean them up in any case
> when you restart your system of cooperating programs.

Sorry, I should have added that I don't think it's cool in a *useful*
way.

Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-06-13 16:40:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface



On Thu, 13 Jun 2002, Gabriel Paubert wrote:
>
> Please tell us where you got your 18GHz CPU (18000 cycles/microsecond) ;-)

Ok, ok, I can't do math. Rub it in.

Linus

2002-06-13 16:38:27

by Gabriel Paubert

[permalink] [raw]
Subject: Re: [PATCH] Futex Asynchronous Interface

Linus Torvalds wrote:

> Right now the kernel does _zero_ work for a lock that isn't contended. It
> doesn't know _anything_ about the process that got the lock initially.
>
> Any amount of tracking would be _extremely_ expensive. Right now getting
> an uncontended lock is about 15 CPU cycles in user space.
>
> Tryin to tell the kernel about gettign that lock takes about 1us on a P4
> (system call overhead), ie we're talking 18000 cycles. 18 THOUSAND cycles
> minimum. Compared to the current 15 cycles. That's more than three orders
> of magnitude slower than the current code, and you just lost the whole
> point of doing this all in user space in the first place.


Please tell us where you got your 18GHz CPU (18000 cycles/microsecond) ;-)

I want one, badly!

Gabriel