2002-02-11 19:38:10

by Hubertus Franke

[permalink] [raw]
Subject: Fast user level locking / sempahore

Enclosed is a prototype implementation of fast user level interprocess
synchronization for linux 2.4.17+ and 2.5.2+).
I attached the ulocks.tar.bz2 file which contains
an (at this time optional) kernel patch, a module implementing the required
kernel functionality, the user level portions and an elaborate test program.

-- Hubertus Franke <[email protected]>

-----------------------------------------------------------------------

There have been many
acronyms been used for this including:
fast semaphores
user level locking
fast user level interprocess synchronization (FULIPS)

I shall use the short-term FULIPS from here on.

As discussed in earlier messages on this topic it is desirable to have
multi-process applications synchronize efficiently. Most notably databases
would like this functionality. At the current time such synchronization
relies on standard SysV IPC mechanisms (most commomly SysV semaphores).
The disadvantage of using these routines is that in the uncontested cases
the overhead of a system call and the kernel functionality is incurred.

Instead in FULIPS we allow communication over a shared memory region
(e.g. SysV shmseg, or shared mmaped file). A user level datastructure
is allocated in that region and accessible to all processes/tasks that
have access to that memory. Then in the uncontested case, the lock can
be acquird via atomic operations and without entering the kernel. In
the contested case, the kernel has to be entered to resolve the
waiting and wakeup cases. As can be seen in the RESULTS file/attachment,
substantial performance improvements can be achieved.

In the following I express some of my opinions regarding issues.
Please feel free to provide some constructive criticism on these
positions.

For Linux to support various middlewares efficiently a good and
comprehensive FULIPS support is necessary. So far I have only seen
proposals for fast semaphores/mutexes.

IHMO-1: For the same reasons multiple reader/single writer locks
------- (rwlocks) were introduced in the kernel, they should also be
introduced in a FULIPS library/kernel.

In the proposal made by Linus sometime in 5/2001 timeframe, he
advocated to locate a kernel pointer/handle and an access signature
into the user lock datastructure. Presenting this handle/signature
combination is sufficient to gain access to the kernel object handling
the contention cases. Though the initial proposal exits a process
trying to attempt access with a false handle/signature pair, this
proposal is not tight.

IMHO-2: We should utilize the memory protection mechanisms in place.
------- In particular in my implementation I pass the virtual address of
the user lock into the kernel and I verify access to it.
To improve performance a hash table is maintained to map <mm,vaddr> to
the kernel object. I maintain hash access statistics and the cases
I have tried show on average 1-3 hash chains traversals.

In the previous proposal the kernel object related to the user lock
are allocated apriori. While one can argue that this is exactly what
is done for SysV locks and FULIPS only provide a fast path access to
such functionality one can also made a contrary point for dynamic
allocation on demand. Physical memory in the system is allocated to
accessed virtual pages only and mostly so on demand. Errors with
respect to accessibility are determined at access time not at creation time.
So why not apply the same logic to the kernel part of user level locks.

IMHO-3: Creating the kernel object can be deferred until the first contention
------- is experienced, rather then requiring apriori allocation.

Though I don't feel strongly about this position, this is what I
implemented in the current code. Through simple restructuring, the code
can be adopted to require mandatory creation and lock attachments.
e.g lock_create for the first and lock_attach for the following processes.

IMHO-4: We should provide build in per/lock statistics.
------- This was easily accomplished in my code and virtually did not create

Since I really would like to see a standardized API for FULIPS
supported in the kernel and glibc.a I like to solicitate opinion on my
stated IMHO-1..4.


IMPLEMENTATION
--------------

At the current time I structured the code such that it can be loaded
as a module rather than requiring constant kernel modification. The
absolute minimum required from the kernel is a callback function and a
mechanism to invoke it when a VMA with allocated ulocks in the kernel
calls munmap(). In that case we need to pass control back to the
kernel module implementing the kernel part of FULIPS.

The code provided here consists of the following components:

(a) KERNEL-PATCH: (optional) a patch to 2.4.17 to support FULIPS in
the an external module. It implements the callback function
mentioned above. Also define MAP_SEMAPHORE, however currently the
libc.a filters unknown flags before passing the flags to the
mmap/shmat system call. so this is not effective until libc is
recompiled.

The requirement of this patch can be bypassed if one is willing to
intercept the vm_ops. Unfortunately due to the shm_vm_ops usage,
the shm_detach behavior changes slightly, marking the segment for
destruction only at application termination and not at shm_detach
time.

My current module (see below) supports both approaches, so one
doesn't need to modify the kernel to experiment with this
subsystem. To do so, goto src/kulocks.c #line 56 and do not define
USE_VANILLA_KERNEL.


(b) KULOCKS Module: src/kulocks.c
the dynamically loadable module maintaining kernel related data structures
the code to deal with th
and a proc file system to access some internal statistics.

(c) Include Files: src/include
user level definitions and operations. it supports general
fast semaphores spinning semaphores with timeout (finite spinning
time). It also supports multiple reader/single writer locks.
We structured this include directory to mimic the include directory
of the kernel. This way we can copy everything right over to the
kernel once design is accepted and integration is required.

(d) user library: src/ulock.c
ulock.c simply provides some support for the spinning initialization
and slow path spin implementation, that ultimately should/could move
to libc.a. A library is created from this and must be linked with
applications.

(e) ULOCKFLEX:
a reasonable sophisticated test program to verify integrity of once
implementation as well as performance


Some more comments
<asm/ulocks.h> currently is only provided for the i386 architecture.
There are basically only a few atomic operations required:
decrement, increment, compare_and_swap
These can be taken almost identically from the kernel section ensuring
that they are SMP enabled.

ISSUES:
-------

Error handling can become a headache in this scenario. For instance
if a process get interrupted while waiting in the kernel, the
status word in user space will be out of sync with the state of
the semaphores in the kernel. Further operations will compromise the
integrity of the program.
Handling possible errors in user land can add complexity.
I opted therefore to make interrupted system call to restart.
That leaves the following error codes to be returned.

ENOMEM: Insufficient kernel memory to allocate internal objects
(here Linus's approach where there is only one kernel lock object
and no references and hash entries does seem advantageous)
EFAULT: illegal address was specified
system call was issued directly without a wrapper functions,
otherwise the wrapper functions would have SEGV
EACCES: MAP_SEMAPHORE not specified on the VMA (currently not active).
EINVAL: lock type unknown

All in all, it seems that the integrity of the program is already compromised
(other than the ENOMEM case) and those errors should results in a
program check anyway.

The other option would be to return the error and repair the status word
in the lock routine.
Comments on this issue.


ULOCKFLEX:
----------

Integrity and performance measurement tool. This tool will exercise the
user locks infrastructure. I intend to submit this to the Linux Test
Project, once the proper interface for user level locks has been
fixed/accepted. To many options to describe here.
For fast semaphores (non-spinning) I implemented also the same functionality
using SysV semaphores. Simply execute klockflex instead of ulockflex.

In essense the tool reports total throughput and lock contention
(latter only if compiled in via the -DLOCK_STATISTICS flag). It also
verifies whether the locking is working properly. It uses mean locking
hold times and non-hold times and uniforms distributions of [ 0.5
.. 1.5 ] * mean to create some randomness in the access and locking
events. It allows the specification of number of task (either
processes or threads), the number of locks to use, the type of locking
(semaphore or rwlocks) and their related parameters. One can specify
the type of shared memory (shmat or memory mapped files) and the
number of such shared memory objects to be used. It can check for
statistical significance when run


The file <RESULTS> shows a collection of some runs that demonstrate the
potential performance gains that can be obtained by utilizing a FULIPS
subsystem.

---------------------------------------------------------

What to do to try this out:

1) either
(a) patch the kernel and comment out the line
"#define USE_VANILLA_KERNEL" in src/kulocks.c#56
(b) or do nothing and this step
2) in top directory issue> make
3) insmod src/kulocks.o
4) cd ulockflex; doit (to run a comformance test)

----------------------------------------------------------

I'd appreciate your feedback.

-- Hubertus Franke <[email protected]>








Attachments:
(No filename) (9.70 kB)
ulocks.tar.bz2 (26.49 kB)
Download all attachments