2002-01-07 20:06:31

by Matthew Kirkwood

[permalink] [raw]
Subject: [PATCH][RFC] Lightweight user-level semaphores

Linus,

The patch below implements some of your design for a really
quick user-level locking primitive, as explained here:

http://lwn.net/2001/0419/a/lt-semaphores.php3

There's a user-level API and a couple of test programs in
the attached tarball. I haven't bothered wih the vital
security hash/signature thing yet.

It all seems to work (i686 UP and SMP), but isn't without
issues:

* It leaks. How were you going to refcount the kernel
portions? Could they be attached to the VM mapping?
Would a lockfs be too expensive?

* It doesn't have a timeout. Is there something like a
down_timeout() available?

* I don't do the:

if (kfs->user_address != fs)
goto bad_sem;

because it doesn't seem to add anything, and prevents
putting these locks in a non-fixed file or SysV SHM
map.

Is that a problem?

Any comments?

Matthew.


diff -ruN linux-2.4.17/arch/i386/kernel/entry.S linux-2.4.17-usersem/arch/i386/kernel/entry.S
--- linux-2.4.17/arch/i386/kernel/entry.S Sat Jan 5 14:00:37 2002
+++ linux-2.4.17-usersem/arch/i386/kernel/entry.S Sun Jan 6 13:52:50 2002
@@ -622,6 +622,9 @@
.long SYMBOL_NAME(sys_ni_syscall) /* Reserved for Security */
.long SYMBOL_NAME(sys_gettid)
.long SYMBOL_NAME(sys_readahead) /* 225 */
+ .long SYMBOL_NAME(sys_FS_create)
+ .long SYMBOL_NAME(sys_FS_down)
+ .long SYMBOL_NAME(sys_FS_up)

.rept NR_syscalls-(.-sys_call_table)/4
.long SYMBOL_NAME(sys_ni_syscall)
diff -ruN linux-2.4.17/include/asm-i386/unistd.h linux-2.4.17-usersem/include/asm-i386/unistd.h
--- linux-2.4.17/include/asm-i386/unistd.h Wed Oct 17 18:03:03 2001
+++ linux-2.4.17-usersem/include/asm-i386/unistd.h Sun Jan 6 13:49:54 2002
@@ -230,6 +230,9 @@
#define __NR_security 223 /* syscall for security modules */
#define __NR_gettid 224
#define __NR_readahead 225
+#define __NR_FS_create 226
+#define __NR_FS_down 227
+#define __NR_FS_up 228

/* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */

diff -ruN linux-2.4.17/include/linux/usersem.h linux-2.4.17-usersem/include/linux/usersem.h
--- linux-2.4.17/include/linux/usersem.h Thu Jan 1 01:00:00 1970
+++ linux-2.4.17-usersem/include/linux/usersem.h Sun Jan 6 17:50:04 2002
@@ -0,0 +1,9 @@
+#ifndef __LINUX_USERSEM_H
+#define __LINUX_USERSEM_H
+
+struct fast_sem {
+ int count;
+ void *__opaque_ksem;
+};
+
+#endif /* __LINUX_USERSEM_H */
diff -ruN linux-2.4.17/kernel/Makefile linux-2.4.17-usersem/kernel/Makefile
--- linux-2.4.17/kernel/Makefile Mon Sep 17 05:22:40 2001
+++ linux-2.4.17-usersem/kernel/Makefile Sat Jan 5 14:36:39 2002
@@ -9,12 +9,12 @@

O_TARGET := kernel.o

-export-objs = signal.o sys.o kmod.o context.o ksyms.o pm.o exec_domain.o printk.o
+export-objs = signal.o sys.o kmod.o context.o ksyms.o pm.o exec_domain.o printk.o usersem.o

obj-y = sched.o dma.o fork.o exec_domain.o panic.o printk.o \
module.o exit.o itimer.o info.o time.o softirq.o resource.o \
sysctl.o acct.o capability.o ptrace.o timer.o user.o \
- signal.o sys.o kmod.o context.o
+ signal.o sys.o kmod.o context.o usersem.o

obj-$(CONFIG_UID16) += uid16.o
obj-$(CONFIG_MODULES) += ksyms.o
diff -ruN linux-2.4.17/kernel/usersem.c linux-2.4.17-usersem/kernel/usersem.c
--- linux-2.4.17/kernel/usersem.c Thu Jan 1 01:00:00 1970
+++ linux-2.4.17-usersem/kernel/usersem.c Sun Jan 6 17:50:20 2002
@@ -0,0 +1,82 @@
+/*
+ * Lightweight user-level semaphores
+ */
+
+#include <linux/slab.h>
+#include <linux/usersem.h>
+
+#include <linux/spinlock.h>
+#include <asm/semaphore.h>
+#include <asm/errno.h>
+
+/* Don't be messin' with my van^H^H^Hopaque data, sucka */
+#define FS_SIG_MAGIC 0xf001f001
+
+struct ksem {
+ unsigned magic;
+ spinlock_t spin;
+ struct semaphore sem;
+};
+
+
+/* XXX - _from_user */
+static inline struct ksem *get_ksem(struct fast_sem *s)
+{
+ struct ksem *r = (struct ksem*)s->__opaque_ksem;
+ if(!r) return NULL;
+ if(r->magic != FS_SIG_MAGIC) return NULL;
+ return r;
+}
+
+/* XXX - _to_user */
+static inline int store_ksem(struct fast_sem *s, struct ksem *k)
+{
+ s->count = 0;
+ s->__opaque_ksem = (void*)k;
+ return 0;
+}
+
+static struct ksem *ksem_new(void)
+{
+ struct ksem *s;
+ s = kmalloc(sizeof(*s), GFP_KERNEL);
+ s->magic = FS_SIG_MAGIC;
+ spin_lock_init(&s->spin);
+ sema_init(&s->sem, 1);
+ return s;
+}
+
+
+asmlinkage long sys_FS_create(struct fast_sem *sem)
+{
+ struct ksem *ksem;
+ if(!(ksem = ksem_new()))
+ return -ENOMEM;
+ if(store_ksem(sem, ksem)) {
+ //ksem_free(ksem);
+ return -EFAULT;
+ }
+ return 0;
+}
+
+
+asmlinkage long sys_FS_down(struct fast_sem *sem)
+{
+ struct ksem *ksem;
+
+ if(!(ksem = get_ksem(sem)))
+ return -ENOENT;
+
+ down_interruptible(&ksem->sem);
+ return 0;
+}
+
+asmlinkage long sys_FS_up(struct fast_sem *sem)
+{
+ struct ksem *ksem;
+ if(!(ksem = get_ksem(sem)))
+ return -ENOENT;
+
+ up(&ksem->sem);
+ return 0;
+}


Attachments:
ust.tar.gz (1.67 kB)

2002-01-07 20:31:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH][RFC] Lightweight user-level semaphores


On Mon, 7 Jan 2002, Matthew Kirkwood wrote:
>
> * It leaks. How were you going to refcount the kernel
> portions? Could they be attached to the VM mapping?
> Would a lockfs be too expensive?

Yes, I was going to just attach to the vma, along with potentially also
require a flag at mmap time (MAP_SEMAPHORE - some other unixes have
something like it already) to tell the OS about the consistency issues
that might come up on some architectures (on x86 it would be a no-op).

> * It doesn't have a timeout. Is there something like a
> down_timeout() available?

Not as-is, but all the kernel infrastructure should be there in theory.

> * I don't do the:
>
> if (kfs->user_address != fs)
> goto bad_sem;
>
> because it doesn't seem to add anything, and prevents
> putting these locks in a non-fixed file or SysV SHM
> map.

Fair enough. I think I suggested that just as another sanity check, and
because some architectures _will_ require address issues (not necessarily
total equality, but at least "modulo X equality").

Linus

2002-01-07 22:04:02

by Matthew Kirkwood

[permalink] [raw]
Subject: Re: [PATCH][RFC] Lightweight user-level semaphores

On Mon, 7 Jan 2002, Linus Torvalds wrote:

> > * It leaks. How were you going to refcount the kernel
> > portions? Could they be attached to the VM mapping?
> > Would a lockfs be too expensive?
>
> Yes, I was going to just attach to the vma,

Wouldn't that have to be an address_space, so separate maps
of the same object will use the same count? Or (not unlikely)
am I misunderstanding the way these structures are laid out?

> along with potentially also require a flag at mmap time (MAP_SEMAPHORE
> - some other unixes have something like it already) to tell the OS
> about the consistency issues that might come up on some architectures
> (on x86 it would be a no-op).

OK.

> > * It doesn't have a timeout. Is there something like a
> > down_timeout() available?
>
> Not as-is, but all the kernel infrastructure should be there in
> theory.

OK, thanks.

> > * I don't do the:
> >
> > if (kfs->user_address != fs)
> > goto bad_sem;
> >
> > because it doesn't seem to add anything, and prevents
> > putting these locks in a non-fixed file or SysV SHM
> > map.
>
> Fair enough. I think I suggested that just as another sanity check,
> and because some architectures _will_ require address issues (not
> necessarily total equality, but at least "modulo X equality").

Should being in the same place in the same page (though
possibly at a different address) should suffice for all
architectures?

Matthew.

2002-01-07 22:40:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH][RFC] Lightweight user-level semaphores


On Mon, 7 Jan 2002, Matthew Kirkwood wrote:
> >
> > Yes, I was going to just attach to the vma,
>
> Wouldn't that have to be an address_space, so separate maps
> of the same object will use the same count? Or (not unlikely)
> am I misunderstanding the way these structures are laid out?

I would just mke the creator be special. The guy who creates the semaphore
owns it, and if he unmaps it, it's gone.

Note that there are other, potentially cleaner solutions. In particular,
some people like the "semaphore as file descriptor" approach, and I have
to say that I think they may be right. Then you just pass the file
descriptor along as the cookie, and you can do dup()/close() etc on it.

Mind trying that approach instead? It's not all that far off from your
current setup, and it would certainly have none of the security
implications..

Linus

2002-01-08 13:45:34

by Alan

[permalink] [raw]
Subject: Re: [PATCH][RFC] Lightweight user-level semaphores

> * I don't do the:
>
> if (kfs->user_address != fs)
> goto bad_sem;
>
> because it doesn't seem to add anything, and prevents
> putting these locks in a non-fixed file or SysV SHM
> map.

The security side of it is basically non existant anyway. If you can map it
you can play naughty. If I decide to do raw I/O DMA directly into that
segment you either litter the kernel with special cases or accept that
if you do stupid things it breaks.

I'm for the latter 8)

> +static inline struct ksem *get_ksem(struct fast_sem *s)
> +{
> + struct ksem *r = (struct ksem*)s->__opaque_ksem;
> + if(!r) return NULL;
> + if(r->magic != FS_SIG_MAGIC) return NULL;

Bang, dead, raw hardware access - game over. You can't dereference the
untrusted pointer to check if its valid, even to look at it. Wouldn't it
be easier and safer to create a two page map, where page 0 is the r/w
objects and page 1 is mapped r/o or kernel private and consists of identical
sized objects ? Then its a case of offset from page start + array bias.

> + struct ksem *s;
> + s = kmalloc(sizeof(*s), GFP_KERNEL);

if(s==NULL) check missing


Alan

2002-01-09 23:24:15

by Matthew Kirkwood

[permalink] [raw]
Subject: Re: [PATCH][RFC] Lightweight user-level semaphores

On Tue, 8 Jan 2002, Linus Torvalds wrote:

> > Sorry to bother you, but how much inode/dentry stuff do I
> > need to have?
>
> You do need to have a dentry and an inode, but they don't have to
> contain anything interesting. In fact, you could have one global
> inode/dentry and just point the filp always at that.

Here we go. I'm sure Al and others will pull the fs stuff
to bits, but it does seem to work. The previously attached
test stuff didn't need any changes and can be fetched from

http://hairy.beasts.org/ust.tar.gz

Both user and kernel bits are, of course, improvable, but
this should at least show that the approach works.

Matthew.


diff -X dontdiff -ruN linux-2.4.17/arch/i386/kernel/entry.S linux-2.4.17-usersem/arch/i386/kernel/entry.S
--- linux-2.4.17/arch/i386/kernel/entry.S Sat Jan 5 14:00:37 2002
+++ linux-2.4.17-usersem/arch/i386/kernel/entry.S Sun Jan 6 13:52:50 2002
@@ -622,6 +622,9 @@
.long SYMBOL_NAME(sys_ni_syscall) /* Reserved for Security */
.long SYMBOL_NAME(sys_gettid)
.long SYMBOL_NAME(sys_readahead) /* 225 */
+ .long SYMBOL_NAME(sys_FS_create)
+ .long SYMBOL_NAME(sys_FS_down)
+ .long SYMBOL_NAME(sys_FS_up)

.rept NR_syscalls-(.-sys_call_table)/4
.long SYMBOL_NAME(sys_ni_syscall)
diff -X dontdiff -ruN linux-2.4.17/include/asm-i386/unistd.h linux-2.4.17-usersem/include/asm-i386/unistd.h
--- linux-2.4.17/include/asm-i386/unistd.h Wed Oct 17 18:03:03 2001
+++ linux-2.4.17-usersem/include/asm-i386/unistd.h Sun Jan 6 13:49:54 2002
@@ -230,6 +230,9 @@
#define __NR_security 223 /* syscall for security modules */
#define __NR_gettid 224
#define __NR_readahead 225
+#define __NR_FS_create 226
+#define __NR_FS_down 227
+#define __NR_FS_up 228

/* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */

diff -X dontdiff -ruN linux-2.4.17/include/linux/usersem.h linux-2.4.17-usersem/include/linux/usersem.h
--- linux-2.4.17/include/linux/usersem.h Thu Jan 1 01:00:00 1970
+++ linux-2.4.17-usersem/include/linux/usersem.h Mon Jan 7 22:47:28 2002
@@ -0,0 +1,11 @@
+#ifndef __LINUX_USERSEM_H
+#define __LINUX_USERSEM_H
+
+#define FS_OPAQUE_SIZE 24
+
+struct fast_sem {
+ int count;
+ int fd;
+} __attribute__((aligned(64)));
+
+#endif /* __LINUX_USERSEM_H */
diff -X dontdiff -ruN linux-2.4.17/kernel/Makefile linux-2.4.17-usersem/kernel/Makefile
--- linux-2.4.17/kernel/Makefile Mon Sep 17 05:22:40 2001
+++ linux-2.4.17-usersem/kernel/Makefile Sat Jan 5 14:36:39 2002
@@ -9,12 +9,12 @@

O_TARGET := kernel.o

-export-objs = signal.o sys.o kmod.o context.o ksyms.o pm.o exec_domain.o printk.o
+export-objs = signal.o sys.o kmod.o context.o ksyms.o pm.o exec_domain.o printk.o usersem.o

obj-y = sched.o dma.o fork.o exec_domain.o panic.o printk.o \
module.o exit.o itimer.o info.o time.o softirq.o resource.o \
sysctl.o acct.o capability.o ptrace.o timer.o user.o \
- signal.o sys.o kmod.o context.o
+ signal.o sys.o kmod.o context.o usersem.o

obj-$(CONFIG_UID16) += uid16.o
obj-$(CONFIG_MODULES) += ksyms.o
diff -X dontdiff -ruN linux-2.4.17/kernel/usersem.c linux-2.4.17-usersem/kernel/usersem.c
--- linux-2.4.17/kernel/usersem.c Thu Jan 1 01:00:00 1970
+++ linux-2.4.17-usersem/kernel/usersem.c Wed Jan 9 14:42:38 2002
@@ -0,0 +1,144 @@
+/*
+ * Lightweight user-level semaphores
+ */
+
+#include <linux/usersem.h>
+
+#include <linux/compiler.h>
+#include <linux/slab.h>
+#include <linux/random.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+
+#include <asm/uaccess.h>
+#include <asm/semaphore.h>
+#include <asm/errno.h>
+
+static int lock_fop_release(struct inode *i, struct file *f)
+{
+ kfree(f->private_data);
+}
+
+static struct file_operations lock_file_ops = {
+release: lock_fop_release,
+};
+
+
+static struct semaphore *get_sem(struct fast_sem *fs)
+{
+ int fd;
+ struct file *f;
+ if(get_user(fd, &fs->fd))
+ return NULL;
+ if(!(f = fget(fd)))
+ return NULL;
+ /* XXX - check that it's the right kind of file */
+ return (struct semaphore*)f->private_data;
+}
+
+
+asmlinkage long sys_FS_create(struct fast_sem *sem)
+{
+ int fd, err;
+ struct semaphore *s;
+ struct file *f;
+ static struct dentry *dent = NULL;
+ static struct inode *semi = NULL;
+
+ if((fd = get_unused_fd()) < 0)
+ return fd;
+
+ if(!(f = get_empty_filp())) {
+ err = -ENFILE;
+ goto fd_out;
+ }
+
+ err = -ENOMEM;
+ if(unlikely(!dent)) {
+ if(!(dent = d_alloc(NULL, &(const struct qstr){ "lock", 4, 0})))
+ goto file_out;
+ }
+
+ if(unlikely(!semi)) {
+ if(!(semi = get_empty_inode()))
+ goto file_out;
+
+ semi->i_state = I_DIRTY;
+ semi->i_mode = S_IFREG | S_IRUSR | S_IWUSR;
+ semi->i_uid = current->fsuid;
+ semi->i_gid = current->fsgid;
+ semi->i_atime = semi->i_mtime = semi->i_ctime = CURRENT_TIME;
+ d_add(dent, semi);
+ }
+
+ if(!(s = kmalloc(sizeof(*s), GFP_KERNEL)))
+ goto file_out;
+
+ sema_init(s, 1);
+ f->f_dentry = dget(dent);
+ f->f_op = &lock_file_ops;
+ f->private_data = (void*)s;
+
+ if(put_user(0, &sem->count) || put_user(0, &sem->fd)) {
+ err = -EFAULT;
+ goto file_out;
+ }
+
+ fd_install(fd, f);
+
+ put_user(fd, &sem->fd);
+
+ return 0;
+
+file_out:
+ put_filp(f);
+fd_out:
+ put_unused_fd(fd);
+ return err;
+}
+
+
+static struct file * get_filp(struct fast_sem *s)
+{
+ int r;
+ if(get_user(r, &s->fd))
+ return NULL;
+ return fget(r);
+}
+
+
+asmlinkage long sys_FS_down(struct fast_sem *fs)
+{
+ int r = -EBADF;
+ struct semaphore *s;
+ struct file *f;
+
+ if(!(f = get_filp(fs)))
+ return -EBADF;
+
+ s = get_sem(fs);
+ if(s) {
+ down_interruptible(s);
+ r = 0;
+ }
+ fput(f);
+ return r;
+}
+
+asmlinkage long sys_FS_up(struct fast_sem *fs)
+{
+ int r = -EBADF;
+ struct semaphore *s;
+ struct file *f;
+
+ if(!(f = get_filp(fs)))
+ return -EBADF;
+
+ s = get_sem(fs);
+ if(s) {
+ up(s);
+ r = 0;
+ }
+ fput(f);
+ return r;
+}

2002-01-11 07:57:59

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH][RFC] Lightweight user-level semaphores

On Wed, 9 Jan 2002 23:23:55 +0000 (GMT)
Matthew Kirkwood <[email protected]> wrote:

> Both user and kernel bits are, of course, improvable, but
> this should at least show that the approach works.

Hi Matthew,

Prefer the "char device" approach myself. open = create, read = down,
write = up. Following (completely untested) patch stole your work to
implement "/dev/usem". Added bonus is the ability to mmap the fd to map in
the shared page, which means the fd carries a shared region with it (hey,
it was 14 more lines).

Userspace needs corrsponding updates, but this is basically FYI.

Cheers,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.2-pre11/drivers/char/Makefile working-2.5.2-pre11-usem/drivers/char/Makefile
--- linux-2.5.2-pre11/drivers/char/Makefile Tue Nov 27 16:52:09 2001
+++ working-2.5.2-pre11-usem/drivers/char/Makefile Fri Jan 11 16:30:33 2002
@@ -16,7 +16,7 @@

O_TARGET := char.o

-obj-y += mem.o tty_io.o n_tty.o tty_ioctl.o raw.o pty.o misc.o random.o
+obj-y += mem.o tty_io.o n_tty.o tty_ioctl.o raw.o pty.o misc.o random.o usersem.o

# All of the (potential) objects that export symbols.
# This list comes from 'grep -l EXPORT_SYMBOL *.[hc]'.
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.2-pre11/drivers/char/usersem.c working-2.5.2-pre11-usem/drivers/char/usersem.c
--- linux-2.5.2-pre11/drivers/char/usersem.c Thu Jan 1 10:00:00 1970
+++ working-2.5.2-pre11-usem/drivers/char/usersem.c Fri Jan 11 16:25:04 2002
@@ -0,0 +1,93 @@
+/*
+ * Lightweight user-level semaphores
+ * (C) 2002 Matthew Kirkwood <[email protected]>
+ * Made into a char device by Rusty Russell.
+ */
+#include <linux/config.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+
+/* Experimental major */
+#define USEM_MAJOR (240)
+
+struct usersem
+{
+ /* The kernel mutex itself */
+ struct semaphore mutex;
+ /* The shared page for users' convenience */
+ void *shared_page;
+};
+
+/* Read one byte = take mutex */
+static ssize_t read_usem(struct file *file, char *buf,
+ size_t count, loff_t *ppos)
+{
+ struct usersem *usem = file->private_data;
+ return down_interruptible(&usem->mutex);
+}
+
+/* Write one byte = give up mutex */
+static ssize_t write_usem(struct file *file, const char *buf,
+ size_t count, loff_t *ppos)
+{
+ struct usersem *usem = file->private_data;
+ up(&usem->mutex);
+ return 1;
+}
+
+/* Access page attached to the file descriptor. */
+static int mmap_usem(struct file *file, struct vm_area_struct *vma)
+{
+ struct usersem *usem = file->private_data;
+
+ if (vma->vm_end - vma->vm_start != PAGE_SIZE)
+ return -EINVAL;
+
+ return remap_page_range(vma->vm_start,
+ virt_to_phys(usem->shared_page),
+ vma->vm_end - vma->vm_start,
+ vma->vm_page_prot);
+}
+
+/* Open = create new usem */
+static int open_usem(struct inode *inode, struct file *filp)
+{
+ struct usersem *usem;
+
+ usem = kmalloc(sizeof(struct usersem), GFP_KERNEL);
+ if (!usem)
+ return -ENOMEM;
+ usem->shared_page = (void *)get_zeroed_page(GFP_KERNEL);
+ if (!usem->shared_page) {
+ kfree(usem);
+ return -ENOMEM;
+ }
+ sema_init(usem->mutex, 1);
+ filp->private_data = usem;
+ return 0;
+}
+
+static int release(struct inode *inode, struct file *filp)
+{
+ struct usersem *usem = file->private_data;
+
+ free_page(usem->shared_page);
+ kfree(usem);
+ return 0;
+}
+
+static struct file_operations usem_fops = {
+ read: read_usem,
+ write: write_usem,
+ open: open_usem,
+ release: release_usem,
+};
+
+int __init usem_init(void)
+{
+ if (devfs_register_chrdev(USEM_MAJOR, "usem", &usem_fops))
+ printk("unable to get major %d for usem devs\n", USEM_MAJOR);
+ return 0;
+}
+
+__initcall(usem_init);

2002-01-11 11:41:39

by Matthew Kirkwood

[permalink] [raw]
Subject: Re: [PATCH][RFC] Lightweight user-level semaphores

On Fri, 11 Jan 2002, Rusty Russell wrote:

> > Both user and kernel bits are, of course, improvable, but
> > this should at least show that the approach works.

> Prefer the "char device" approach myself. open = create, read =
> down, write = up. Following (completely untested) patch stole your
> work to implement "/dev/usem". Added bonus is the ability to mmap the
> fd to map in the shared page, which means the fd carries a shared
> region with it (hey, it was 14 more lines).

Nice hack. I'm not particularly attached to my implementation
but:
* Dedicating a whole page per semaphore seems rather
expensive for a "lightweight" primitive.
* Possibly ditto even file descriptors. I may want
several thousand of these.
* Are there any numbers for the VFS overheads? There
are quite a few lock acquires in there, even if the
paths are fairly short.
* It would be nice to keep the userspace structure
opaque (except for the counter) and able to share
it transparently between processes.

Actually, the more I look at Linus's original idea, the more
sense it seems to make (and the more I regret scrapping my
almost-complete implementation of it for the fd idea :)

(Oh, and you forgot:

> +static struct file_operations usem_fops = {
> + read: read_usem,
> + write: write_usem,
> + open: open_usem,
> + release: release_usem,
mmap: mmap_usem,
> +};

)

Cheers,
Matthew.

2002-01-11 15:50:18

by Manfred Spraul

[permalink] [raw]
Subject: Re: [PATCH][RFC] Lightweight user-level semaphores

> Actually, the more I look at Linus's original idea, the more
> sense it seems to make (and the more I regret scrapping my
> almost-complete implementation of it for the fd idea :)

Do you have a plan how to implement the no-contention case entirely in
userspace?
That would make them really fast, not just saving 50 or 100 cycles
through a special syscall and bypassing VFS.

--
Manfred

2002-01-11 16:15:53

by Matthew Kirkwood

[permalink] [raw]
Subject: Re: [PATCH][RFC] Lightweight user-level semaphores

On Fri, 11 Jan 2002, Manfred Spraul wrote:

> > Actually, the more I look at Linus's original idea, the more
> > sense it seems to make (and the more I regret scrapping my
> > almost-complete implementation of it for the fd idea :)
>
> Do you have a plan how to implement the no-contention case entirely in
> userspace?
> That would make them really fast, not just saving 50 or 100 cycles
> through a special syscall and bypassing VFS.

Yep, it was really easy. Following Linus' design[0]
it was a really easy hack[1].

I'd like these things to be really easily shareable,
and that's harder to do when you need to communicate
a mapped area and a file descriptor.

But without an obvious handle, it's hard to collect
unreferenced locks.

Rusty's idea is nice (though I think it'd be better
with a filesystem than a device, so you can share
names rather than file descriptors) but the page per
lock seems like rather too much overhead.

Matthew.

[0] http://lwn.net/2001/0419/a/lt-semaphores.php3
[1] Kernel patch: http://hairy.beasts.org/usersem-2.4.17.diff
(I have a more complete patch on a machine which isn't up
right now.)
Userspace bit: http://hairy.beasts.org/ust.tar.gz

2002-01-11 16:52:43

by Alan

[permalink] [raw]
Subject: Re: [PATCH][RFC] Lightweight user-level semaphores

> Rusty's idea is nice (though I think it'd be better
> with a filesystem than a device, so you can share
> names rather than file descriptors) but the page per
> lock seems like rather too much overhead.

I don't think its a big problem. A page gets you 256 locks or whatever the
number ends up as. For the case where you have many fine grained locks
between a group of threads thats great. You just parcel them out. If its
two processes just wanting a lock between them, well they get a page with
room for 256 lock objects, but they only want one. That doesn't matter.
Its one page, if they need two or two hundred locks its stil one page.

Alan

2002-01-13 12:39:04

by Matthew Kirkwood

[permalink] [raw]
Subject: Re: [PATCH][RFC] Lightweight user-level semaphores

On Fri, 11 Jan 2002, Alan Cox wrote:

> > Rusty's idea is nice (though I think it'd be better
> > with a filesystem than a device, so you can share
> > names rather than file descriptors) but the page per
> > lock seems like rather too much overhead.
>
> I don't think its a big problem. A page gets you 256 locks or whatever
> the number ends up as. For the case where you have many fine grained
> locks between a group of threads thats great. You just parcel them
> out. If its two processes just wanting a lock between them, well they
> get a page with room for 256 lock objects, but they only want one.
> That doesn't matter. Its one page, if they need two or two hundred
> locks its stil one page.

Yep, that'd be fine. However, you then lose the neatness
of "lock==file descriptor", and need something other than
read/write for down/up.

So I think that the original idea of storing an opaque
cookie which, it happens, we can lookup to a kernel address,
is the way forward. Linus' design had:

/*
* Verify that it might be a valid kernel pointer
* before we even try to dereference it
*/
if ((unsigned long) kfs & 7)
goto bad_sem;
if (kfs < TASK_SIZE)
goto bad_sem;
if (kfs > TASK_SIZE+640k && kfs < TASK_SIZE + 1M)
goto bad_sem;
if (kfs > high_mem)
goto bad_sem;

which is a bit magic (and, no doubt, non-portable), but
doesn't look entirely unreasonable.

I guess the alternative is to store them in a hash table
or tree but I don't know what that would do to the
contended case.

Matthew.

2002-01-13 12:56:15

by Manfred Spraul

[permalink] [raw]
Subject: Re: [PATCH][RFC] Lightweight user-level semaphores

Matthew Kirkwood wrote:
>
>
> Yep, that'd be fine. However, you then lose the neatness
> of "lock==file descriptor", and need something other than
> read/write for down/up.
>
pread(), use the file pointer.

>
> I guess the alternative is to store them in a hash table
> or tree but I don't know what that would do to the
> contended case.
>
I'd start with file descriptor+pread(), and then check how much faster
your could get without any protection at all (i.e. just trust user
space). If the difference is small, then use the file descriptor.

--
Manfred

2002-01-13 15:02:01

by Alan

[permalink] [raw]
Subject: Re: [PATCH][RFC] Lightweight user-level semaphores

> Yep, that'd be fine. However, you then lose the neatness
> of "lock==file descriptor", and need something other than
> read/write for down/up.

If I have to have 2000 pages and 2000 file handles for 2000 locks I've
kind of lost interest. read/write syscalls take offsets. I can pread/pwrite
a lock in a set of locks. The only reason for using an fd I can see is so
you can poll on a lock. All the other neatness issues are wrapped in the
library support code anyway.

> I guess the alternative is to store them in a hash table
> or tree but I don't know what that would do to the
> contended case.

Read my old mail you need neither a hash table or a tree. You just need
the required shadow object

offset = addr - vma->vm_start;
offset /= sizeof(struct user_lock);
lock = ((struct lock *)vma->vm_private_data)[offset];

Alan

2002-01-14 01:36:00

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH][RFC] Lightweight user-level semaphores

On Sun, 13 Jan 2002 15:13:30 +0000 (GMT)
Alan Cox <[email protected]> wrote:

> > Yep, that'd be fine. However, you then lose the neatness
> > of "lock==file descriptor", and need something other than
> > read/write for down/up.
>
> If I have to have 2000 pages and 2000 file handles for 2000 locks I've
> kind of lost interest. read/write syscalls take offsets. I can pread/pwrite
> a lock in a set of locks. The only reason for using an fd I can see is so
> you can poll on a lock. All the other neatness issues are wrapped in the
> library support code anyway.

My interest in this is for TDB (Trivial Database: see sourceforge). TDB
requires a lock per hash chain (ie. arbitrary number of locks), a number
which does not change. With an extended version of the fd code (ie. first
mmap controls size of map, and offset control which semaphore you are
referring to, and semaphores created on demand), this requires:

Each TDB would have an associated ".locks" unix domain socket so you can
read the fd from the "master". This must be atomically created by the
first process to open the TDB, and must be asynchronously serviced by a
process at all times (ie. when the "master" exits, someone else takes
over).

Without even mentioning the impossibility of creating a Unix domain socket
with an arbitrary path name (can't chdir, might not be able to chdir
back), or the problem of cleaning up the socket when noone is using the
tdb, or the horror which is fd passing under Unix, I think it's clear that
the fd solutions are vastly inferior to the "magic cookie" approach.

Still, it was cute hack.
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.