Hello, guys.
This patchset implements brsem - big reader semaphore. This is
similar to the late big reader lock in its concept. Reader side tries
to do as little as possible while writer side takes over all the
overhead. brsem is to be used for cases where...
a. writer operation is *extremely* rare compared to read ops
b. both readers and writers should be able to sleep
c. stale data or reader-side retry is impossible or impractical
Device or subsystem hotplug/unplug synchronization fits very well to
above criteria. Plugging and unplugging occur very rarely but they
still need stringent synchronization.
brsem is implemented while trying to solve the following race
condition in the vfs layer.
super->s_umount rwsem proctects filesystem unmounts and remounts
against other operations. When remounting a rw filesystem ro, after
write-locking s_umount, do_remount_sb() is invoked, which scans all
open files and either refuses to remount or bang all write-open files
depending on force argument. After that, the filesystem is remounted
ro.
The problem is that a file can be opened rw after all open files are
scanned but before the filesystem is remounted ro. This can be easily
exposed by adding msleep(2000) between open file scan and remount_fs()
and doing the following in the user space. An ext2 fs is mounted on
tmp.
# mount -o remount,ro tmp& sleep 1; cat >> tmp/asdf
This leaves us with a write-opened file on a ro filesystem. ext2
happily writes data to disk. I've only verified open(2) but this race
probably exists for other file operations, too.
I'll post a patch to add 2s sleep during remounting and test results
in a reply to this mail.
The solution is to make permission check and the rest of open atomic
against remounting by read-locking s_umount rwsem. This probably
should be done for most file operations including write(2) (maybe even
when committing dirty pages to fs for rw-mmapped files in 'force'
case). This is an expensive overhead to pay for such rare cases and
seqlock / rcupdate cannot be used here.
So, brsem is implemented. Read locking and unlocking involve only
local_bh_disable/enable, a few local arithmetics and conditionals - no
atomic ops, no shared word, no memory barrier. Writer operations are
*very* expensive. They have to issue workqueue works to each cpu
using smp_call_function and wait for them.
This patchset also converts cpucontrol semaphore which protects cpu
hotplug/unplugging events, which obviously occur very rarely.
[ Start of patch descriptions ]
01_brsem_implement_brsem.patch
: implement big reader semaphore
This patch implements big reader semaphore - a rwsem with very
cheap reader-side operations and very expensive writer-side
operations. For details, please read comments at the top of
kern/brsem.c.
02_brsem_convert-s_umount-to-brsem.patch
: convert super_block->s_umount to brsem
Convert super_block->s_umount from rwsem to brsem.
03_brsem_fix-remount-open-race.patch
: fix ro-remount <-> open race condition
A file can be opened rw while ro-remounting is in progress
resulting in open rw file on ro mounted filesystem. This
patch kills the race by making permission check and the rest
of open atomic w.r.t. remounting. Other file operations also
have this race condition and should be fixed in similar
manner.
04_brsem_convert-cpucontrol-to-brsem.patch
: convert cpucontrol to brsem
cpucontrol synchronizes cpu hotplugging and used to be a
semaphore. This patch converts it to brsem.
[ End of patch descriptions ]
Thanks.
--
tejun
03_brsem_fix-remount-open-race.patch
A file can be opened rw while ro-remounting is in progress
resulting in open rw file on ro mounted filesystem. This
patch kills the race by making permission check and the rest
of open atomic w.r.t. remounting. Other file operations also
have this race condition and should be fixed in similar
manner.
Signed-off-by: Tejun Heo <[email protected]>
namei.c | 11 ++++++++++-
open.c | 12 ++++++++++--
2 files changed, 20 insertions(+), 3 deletions(-)
Index: linux-work/fs/namei.c
===================================================================
--- linux-work.orig/fs/namei.c 2005-09-25 15:26:32.000000000 +0900
+++ linux-work/fs/namei.c 2005-09-25 15:42:04.000000000 +0900
@@ -1412,6 +1412,12 @@ int may_open(struct nameidata *nd, int a
* 11 - read/write permissions needed
* which is a lot more logical, and also allows the "no perm" needed
* for symlinks (where the permissions are checked later).
+ *
+ * nd->mnt->mnt_sb->s_umount brsem is read-locked on successful return
+ * from this function. This is to make permission checking and the
+ * actual open operation atomic w.r.t. remounting. The caller must
+ * release s_umount after open is complete.
+ *
* SMP-safe
*/
int open_namei(const char * pathname, int flag, int mode, struct nameidata *nd)
@@ -1512,9 +1518,12 @@ do_last:
if (path.dentry->d_inode && S_ISDIR(path.dentry->d_inode->i_mode))
goto exit;
ok:
+ brsem_down_read(nd->mnt->mnt_sb->s_umount);
error = may_open(nd, acc_mode, flag);
- if (error)
+ if (error) {
+ brsem_up_read(nd->mnt->mnt_sb->s_umount);
goto exit;
+ }
return 0;
exit_dput:
Index: linux-work/fs/open.c
===================================================================
--- linux-work.orig/fs/open.c 2005-09-25 15:26:32.000000000 +0900
+++ linux-work/fs/open.c 2005-09-25 15:42:04.000000000 +0900
@@ -828,8 +828,16 @@ struct file *filp_open(const char * file
return ERR_PTR(error);
error = open_namei(filename, namei_flags, mode, &nd);
- if (!error)
- return __dentry_open(nd.dentry, nd.mnt, flags, f);
+ if (!error) {
+ f = __dentry_open(nd.dentry, nd.mnt, flags, f);
+ /*
+ * On successful return from open_namei(), s_umount is
+ * read-locked, see comment above open_namei() for
+ * more information.
+ */
+ brsem_up_read(nd.mnt->mnt_sb->s_umount);
+ return f;
+ }
put_filp(f);
return ERR_PTR(error);
04_brsem_convert-cpucontrol-to-brsem.patch
cpucontrol synchronizes cpu hotplugging and used to be a
semaphore. This patch converts it to brsem.
Signed-off-by: Tejun Heo <[email protected]>
include/linux/brsem.h | 4 +-
include/linux/cpu.h | 16 ++++++---
init/main.c | 1
kernel/brsem.c | 38 +++++++++++++++--------
kernel/cpu.c | 81 +++++++++++++++++++++++++++++++++++++++++++-------
5 files changed, 110 insertions(+), 30 deletions(-)
Index: linux-work/include/linux/cpu.h
===================================================================
--- linux-work.orig/include/linux/cpu.h 2005-09-25 15:26:32.000000000 +0900
+++ linux-work/include/linux/cpu.h 2005-09-25 15:42:04.000000000 +0900
@@ -23,7 +23,7 @@
#include <linux/node.h>
#include <linux/compiler.h>
#include <linux/cpumask.h>
-#include <asm/semaphore.h>
+#include <linux/brsem.h>
struct cpu {
int node_id; /* The node which contains the CPU */
@@ -59,10 +59,11 @@ extern struct sysdev_class cpu_sysdev_cl
#ifdef CONFIG_HOTPLUG_CPU
/* Stop CPUs going up and down. */
-extern struct semaphore cpucontrol;
-#define lock_cpu_hotplug() down(&cpucontrol)
-#define unlock_cpu_hotplug() up(&cpucontrol)
-#define lock_cpu_hotplug_interruptible() down_interruptible(&cpucontrol)
+extern struct brsem cpucontrol;
+extern struct task_struct *cpucontrol_holder;
+#define lock_cpu_hotplug() brsem_down_read(&cpucontrol)
+#define unlock_cpu_hotplug() brsem_up_read(&cpucontrol)
+#define is_cpu_hotplug_holder() (cpucontrol_holder == current)
#define hotcpu_notifier(fn, pri) { \
static struct notifier_block fn##_nb = \
{ .notifier_call = fn, .priority = pri }; \
@@ -74,11 +75,14 @@ extern int __attribute__((weak)) smp_pre
#else
#define lock_cpu_hotplug() do { } while (0)
#define unlock_cpu_hotplug() do { } while (0)
-#define lock_cpu_hotplug_interruptible() 0
+#define is_cpu_hotplug_holder() 0
#define hotcpu_notifier(fn, pri)
/* CPUs don't go offline once they're online w/o CONFIG_HOTPLUG_CPU */
static inline int cpu_is_offline(int cpu) { return 0; }
#endif
+/* cpucontrol initializer, called from init/main.c */
+void cpucontrol_init(void);
+
#endif /* _LINUX_CPU_H_ */
Index: linux-work/init/main.c
===================================================================
--- linux-work.orig/init/main.c 2005-09-25 15:42:03.000000000 +0900
+++ linux-work/init/main.c 2005-09-25 15:42:04.000000000 +0900
@@ -513,6 +513,7 @@ asmlinkage void __init start_kernel(void
setup_per_cpu_pageset();
numa_policy_init();
brsem_init_early();
+ cpucontrol_init();
if (late_time_init)
late_time_init();
calibrate_delay();
Index: linux-work/kernel/cpu.c
===================================================================
--- linux-work.orig/kernel/cpu.c 2005-09-25 15:26:32.000000000 +0900
+++ linux-work/kernel/cpu.c 2005-09-25 15:42:04.000000000 +0900
@@ -15,8 +15,39 @@
#include <linux/stop_machine.h>
#include <asm/semaphore.h>
-/* This protects CPUs going up and down... */
-DECLARE_MUTEX(cpucontrol);
+/*
+ * cpucontrol is a brsem used to synchronize cpu hotplug events.
+ * Invoking lock_cpu_hotplug() read-locks cpucontrol and no
+ * hotplugging events will occur until it's released.
+ *
+ * Unfortunately, brsem itself makes use of lock_cpu_hotplug() and
+ * performing brsem write-lock operations on cpucontrol deadlocks.
+ * This is avoided by...
+ *
+ * a. guaranteeing that cpu hotplug events won't occur during the
+ * write-lock operations, and
+ *
+ * b. skipping lock_cpu_hotplug() inside brsem.
+ *
+ * #a is achieved by acquiring and releasing cpucontrol_mutex outside
+ * cpucontrol write-lock. #b is achieved by skipping
+ * lock_cpu_hotplug() inside brsem if the current task is
+ * cpucontrol_mutex holder (is_cpu_hotplug_holder() test).
+ *
+ * Also, note that cpucontrol is first initialized with
+ * BRSEM_BYPASS_INITIALIZER and then initialized again with
+ * __create_brsem() instead of simply using create_brsem(). This is
+ * necessary as cpucontrol brsem gets used way before brsem subsystem
+ * becomes up and running.
+ *
+ * Until brsem is properly initialized, all brsem ops succeed
+ * unconditionally. cpucontrol becomes operational only after
+ * cpucontrol_init() is finished, which should be called after
+ * brsem_init_early().
+ */
+struct brsem cpucontrol = BRSEM_BYPASS_INITIALIZER;
+static DECLARE_MUTEX(cpucontrol_mutex);
+struct task_struct *cpucontrol_holder;
static struct notifier_block *cpu_chain;
@@ -25,22 +56,45 @@ int register_cpu_notifier(struct notifie
{
int ret;
- if ((ret = down_interruptible(&cpucontrol)) != 0)
+ if ((ret = brsem_down_read_interruptible(&cpucontrol)) != 0)
return ret;
ret = notifier_chain_register(&cpu_chain, nb);
- up(&cpucontrol);
+ brsem_up_read(&cpucontrol);
return ret;
}
EXPORT_SYMBOL(register_cpu_notifier);
void unregister_cpu_notifier(struct notifier_block *nb)
{
- down(&cpucontrol);
+ brsem_down_read(&cpucontrol);
notifier_chain_unregister(&cpu_chain, nb);
- up(&cpucontrol);
+ brsem_up_read(&cpucontrol);
}
EXPORT_SYMBOL(unregister_cpu_notifier);
+static int write_lock_cpucontrol(void)
+{
+ int ret;
+
+ if ((ret = down_interruptible(&cpucontrol_mutex)) != 0)
+ return ret;
+ BUG_ON(cpucontrol_holder);
+ cpucontrol_holder = current;
+ if ((ret = brsem_down_write_interruptible(&cpucontrol)) != 0) {
+ cpucontrol_holder = NULL;
+ up(&cpucontrol_mutex);
+ return ret;
+ }
+ return 0;
+}
+
+static void write_unlock_cpucontrol(void)
+{
+ brsem_up_write(&cpucontrol);
+ cpucontrol_holder = NULL;
+ up(&cpucontrol_mutex);
+}
+
#ifdef CONFIG_HOTPLUG_CPU
static inline void check_for_tasks(int cpu)
{
@@ -80,7 +134,7 @@ int cpu_down(unsigned int cpu)
struct task_struct *p;
cpumask_t old_allowed, tmp;
- if ((err = lock_cpu_hotplug_interruptible()) != 0)
+ if ((err = write_lock_cpucontrol()) != 0)
return err;
if (num_online_cpus() == 1) {
@@ -145,7 +199,7 @@ out_thread:
out_allowed:
set_cpus_allowed(current, old_allowed);
out:
- unlock_cpu_hotplug();
+ write_unlock_cpucontrol();
return err;
}
#endif /*CONFIG_HOTPLUG_CPU*/
@@ -155,7 +209,7 @@ int __devinit cpu_up(unsigned int cpu)
int ret;
void *hcpu = (void *)(long)cpu;
- if ((ret = down_interruptible(&cpucontrol)) != 0)
+ if ((ret = write_lock_cpucontrol()) != 0)
return ret;
if (cpu_online(cpu) || !cpu_present(cpu)) {
@@ -184,6 +238,13 @@ out_notify:
if (ret != 0)
notifier_call_chain(&cpu_chain, CPU_UP_CANCELED, hcpu);
out:
- up(&cpucontrol);
+ write_unlock_cpucontrol();
return ret;
}
+
+void cpucontrol_init(void)
+{
+ struct brsem *sem;
+ sem = __create_brsem(&cpucontrol);
+ BUG_ON(!sem);
+}
Index: linux-work/kernel/brsem.c
===================================================================
--- linux-work.orig/kernel/brsem.c 2005-09-25 15:42:03.000000000 +0900
+++ linux-work/kernel/brsem.c 2005-09-25 15:42:04.000000000 +0900
@@ -156,7 +156,7 @@ static void coac_schedule_work_per_cpu(v
schedule_work(works + smp_processor_id());
}
-static void call_on_all_cpus(void (*func)(void *), void *data)
+static void call_on_all_cpus(void (*func)(void *), void *data, int skip_cpulock)
{
static DECLARE_MUTEX(coac_mutex); /* serializes uses of coac_works */
static struct work_struct coac_works[NR_CPUS];
@@ -174,7 +174,8 @@ static void call_on_all_cpus(void (*func
}
down(&coac_mutex);
- lock_cpu_hotplug();
+ if (!skip_cpulock)
+ lock_cpu_hotplug();
/*
* If we're on keventd, scheduling work and waiting for it
@@ -202,7 +203,8 @@ static void call_on_all_cpus(void (*func
wait_for_completion(&coac_arg.completion);
}
- unlock_cpu_hotplug();
+ if (!skip_cpulock)
+ unlock_cpu_hotplug();
up(&coac_mutex);
}
@@ -305,7 +307,7 @@ static int expand_brsem(int target_idx)
ebarg.new_len = new_len;
atomic_set(&ebarg.failed, 0);
- call_on_all_cpus(expand_brsem_cpucb, &ebarg);
+ call_on_all_cpus(expand_brsem_cpucb, &ebarg, 0);
res = -ENOMEM;
if (atomic_read(&ebarg.failed))
@@ -405,13 +407,13 @@ static void sync_brsem_cpucb(void *data)
__brsem_leave_crit();
}
-static void sync_brsem(struct brsem *sem, int write_locking)
+static void sync_brsem(struct brsem *sem, int write_locking, int skip_cpulock)
{
int cpu;
if (brsem_initialized) {
struct sync_brsem_arg sbarg = { sem, write_locking };
- call_on_all_cpus(sync_brsem_cpucb, &sbarg);
+ call_on_all_cpus(sync_brsem_cpucb, &sbarg, skip_cpulock);
return;
}
@@ -420,7 +422,8 @@ static void sync_brsem(struct brsem *sem
* single threaded. Sync manually. Note that lockings are
* not necessary here. They're done just for consistency.
*/
- lock_cpu_hotplug();
+ if (!skip_cpulock)
+ lock_cpu_hotplug();
spin_lock_crit(&sem->lock);
for_each_online_cpu(cpu) {
int *p = per_cpu(brsem_rcnt_ar, cpu) + sem->idx;
@@ -429,14 +432,15 @@ static void sync_brsem(struct brsem *sem
*p = write_locking ? INT_MIN : 0;
}
spin_unlock_crit(&sem->lock);
- unlock_cpu_hotplug();
+ if (!skip_cpulock)
+ unlock_cpu_hotplug();
}
static void do_destroy_brsem(struct brsem *sem)
{
check_idx(sem);
- sync_brsem(sem, 0);
+ sync_brsem(sem, 0, 0);
BUG_ON(sem->master_rcnt != 0);
spin_lock(&brsem_alloc_lock);
@@ -586,8 +590,14 @@ void __brsem_up_read_slow(struct brsem *
__brsem_leave_crit();
}
+/*
+ * skip_cpulock is used to avoid deadlocks when write operations are
+ * performed on cpu hotplug brsem. See kernel/cpu.c for more
+ * information.
+ */
static int __brsem_down_write(struct brsem *sem, int interruptible)
{
+ int skip_cpulock = is_cpu_hotplug_holder();
int res;
if (is_bypass(sem))
@@ -601,7 +611,7 @@ static int __brsem_down_write(struct brs
} else
down(&sem->write_mutex);
- sync_brsem(sem, 1);
+ sync_brsem(sem, 1, skip_cpulock);
spin_lock_crit(&sem->lock);
@@ -619,7 +629,7 @@ static int __brsem_down_write(struct brs
finish_wait(&sem->write_wait, &wait);
if (interruptible && signal_pending(current)) {
- sync_brsem(sem, 0);
+ sync_brsem(sem, 0, skip_cpulock);
wake_up_all(&sem->read_wait);
up(&sem->write_mutex);
return -EINTR;
@@ -661,12 +671,14 @@ int brsem_down_write_interruptible(struc
*/
void brsem_up_write(struct brsem *sem)
{
+ int skip_cpulock = is_cpu_hotplug_holder();
+
if (is_bypass(sem))
return;
check_idx(sem);
BUG_ON(sem->master_rcnt);
- sync_brsem(sem, 0);
+ sync_brsem(sem, 0, skip_cpulock);
wake_up_all(&sem->read_wait);
up(&sem->write_mutex);
}
@@ -856,7 +868,7 @@ void __init brsem_init(void)
brsem_initialized = 1;
/* Make sure other cpus see above change */
- call_on_all_cpus(dummy_cpucb, NULL);
+ call_on_all_cpus(dummy_cpucb, NULL, 0);
}
EXPORT_SYMBOL(__create_brsem);
Index: linux-work/include/linux/brsem.h
===================================================================
--- linux-work.orig/include/linux/brsem.h 2005-09-25 15:42:03.000000000 +0900
+++ linux-work/include/linux/brsem.h 2005-09-25 15:42:04.000000000 +0900
@@ -65,7 +65,9 @@ void brsem_init(void);
/*
* The following initializer and __create_brsem() are for cases where
- * brsem should be used before brsem_init_early() is finished.
+ * brsem should be used before brsem_init_early() is finished. If
+ * you're trying to use brsem for such cases, refer to
+ * include/linux/cpu.h and kernel/cpu.c for example.
*/
#define BRSEM_BYPASS_INITIALIZER { .idx = 0, .flags = BRSEM_F_BYPASS }
02_brsem_convert-s_umount-to-brsem.patch
Convert super_block->s_umount from rwsem to brsem.
Signed-off-by: Tejun Heo <[email protected]>
arch/um/drivers/mconsole_kern.c | 2 -
fs/9p/vfs_super.c | 2 -
fs/afs/super.c | 2 -
fs/cifs/cifsfs.c | 2 -
fs/fs-writeback.c | 8 ++---
fs/jffs2/super.c | 2 -
fs/libfs.c | 2 -
fs/namespace.c | 8 ++---
fs/nfs/inode.c | 4 +-
fs/quota.c | 4 +-
fs/reiserfs/procfs.c | 2 -
fs/super.c | 60 +++++++++++++++++++++-------------------
include/linux/fs.h | 3 +-
security/selinux/hooks.c | 2 -
14 files changed, 54 insertions(+), 49 deletions(-)
Index: linux-work/fs/9p/vfs_super.c
===================================================================
--- linux-work.orig/fs/9p/vfs_super.c 2005-09-25 15:26:32.000000000 +0900
+++ linux-work/fs/9p/vfs_super.c 2005-09-25 15:42:04.000000000 +0900
@@ -189,7 +189,7 @@ static struct super_block *v9fs_get_sb(s
put_back_sb:
/* deactivate_super calls v9fs_kill_super which will frees the rest */
- up_write(&sb->s_umount);
+ brsem_up_write(sb->s_umount);
deactivate_super(sb);
return ERR_PTR(retval);
}
Index: linux-work/fs/afs/super.c
===================================================================
--- linux-work.orig/fs/afs/super.c 2005-09-25 15:26:32.000000000 +0900
+++ linux-work/fs/afs/super.c 2005-09-25 15:42:04.000000000 +0900
@@ -343,7 +343,7 @@ static struct super_block *afs_get_sb(st
ret = afs_fill_super(sb, ¶ms, flags & MS_VERBOSE ? 1 : 0);
if (ret < 0) {
- up_write(&sb->s_umount);
+ brsem_up_write(sb->s_umount);
deactivate_super(sb);
goto error;
}
Index: linux-work/fs/cifs/cifsfs.c
===================================================================
--- linux-work.orig/fs/cifs/cifsfs.c 2005-09-25 15:26:32.000000000 +0900
+++ linux-work/fs/cifs/cifsfs.c 2005-09-25 15:42:04.000000000 +0900
@@ -435,7 +435,7 @@ cifs_get_sb(struct file_system_type *fs_
rc = cifs_read_super(sb, data, dev_name, flags & MS_VERBOSE ? 1 : 0);
if (rc) {
- up_write(&sb->s_umount);
+ brsem_up_write(sb->s_umount);
deactivate_super(sb);
return ERR_PTR(rc);
}
Index: linux-work/fs/fs-writeback.c
===================================================================
--- linux-work.orig/fs/fs-writeback.c 2005-09-25 15:26:32.000000000 +0900
+++ linux-work/fs/fs-writeback.c 2005-09-25 15:42:04.000000000 +0900
@@ -425,13 +425,13 @@ restart:
* waiting around, most of the time the FS is going to
* be unmounted by the time it is released.
*/
- if (down_read_trylock(&sb->s_umount)) {
+ if (brsem_down_read_trylock(sb->s_umount)) {
if (sb->s_root) {
spin_lock(&inode_lock);
sync_sb_inodes(sb, wbc);
spin_unlock(&inode_lock);
}
- up_read(&sb->s_umount);
+ brsem_up_read(sb->s_umount);
}
spin_lock(&sb_lock);
if (__put_super_and_need_restart(sb))
@@ -516,12 +516,12 @@ restart:
sb->s_syncing = 1;
sb->s_count++;
spin_unlock(&sb_lock);
- down_read(&sb->s_umount);
+ brsem_down_read(sb->s_umount);
if (sb->s_root) {
sync_inodes_sb(sb, wait);
sync_blockdev(sb->s_bdev);
}
- up_read(&sb->s_umount);
+ brsem_up_read(sb->s_umount);
spin_lock(&sb_lock);
if (__put_super_and_need_restart(sb))
goto restart;
Index: linux-work/fs/jffs2/super.c
===================================================================
--- linux-work.orig/fs/jffs2/super.c 2005-09-25 15:26:32.000000000 +0900
+++ linux-work/fs/jffs2/super.c 2005-09-25 15:42:04.000000000 +0900
@@ -156,7 +156,7 @@ static struct super_block *jffs2_get_sb_
if (ret) {
/* Failure case... */
- up_write(&sb->s_umount);
+ brsem_up_write(sb->s_umount);
deactivate_super(sb);
return ERR_PTR(ret);
}
Index: linux-work/fs/libfs.c
===================================================================
--- linux-work.orig/fs/libfs.c 2005-09-25 15:26:32.000000000 +0900
+++ linux-work/fs/libfs.c 2005-09-25 15:42:04.000000000 +0900
@@ -233,7 +233,7 @@ get_sb_pseudo(struct file_system_type *f
return s;
Enomem:
- up_write(&s->s_umount);
+ brsem_up_write(s->s_umount);
deactivate_super(s);
return ERR_PTR(-ENOMEM);
}
Index: linux-work/fs/namespace.c
===================================================================
--- linux-work.orig/fs/namespace.c 2005-09-25 15:26:32.000000000 +0900
+++ linux-work/fs/namespace.c 2005-09-25 15:42:04.000000000 +0900
@@ -421,14 +421,14 @@ static int do_umount(struct vfsmount *mn
* Special case for "unmounting" root ...
* we just try to remount it readonly.
*/
- down_write(&sb->s_umount);
+ brsem_down_write(sb->s_umount);
if (!(sb->s_flags & MS_RDONLY)) {
lock_kernel();
DQUOT_OFF(sb);
retval = do_remount_sb(sb, MS_RDONLY, NULL, 0);
unlock_kernel();
}
- up_write(&sb->s_umount);
+ brsem_up_write(sb->s_umount);
return retval;
}
@@ -681,11 +681,11 @@ static int do_remount(struct nameidata *
if (nd->dentry != nd->mnt->mnt_root)
return -EINVAL;
- down_write(&sb->s_umount);
+ brsem_down_write(sb->s_umount);
err = do_remount_sb(sb, flags, data, 0);
if (!err)
nd->mnt->mnt_flags=mnt_flags;
- up_write(&sb->s_umount);
+ brsem_up_write(sb->s_umount);
if (!err)
security_sb_post_remount(nd->mnt, flags, data);
return err;
Index: linux-work/fs/nfs/inode.c
===================================================================
--- linux-work.orig/fs/nfs/inode.c 2005-09-25 15:26:32.000000000 +0900
+++ linux-work/fs/nfs/inode.c 2005-09-25 15:42:04.000000000 +0900
@@ -1574,7 +1574,7 @@ static struct super_block *nfs_get_sb(st
error = nfs_fill_super(s, data, flags & MS_VERBOSE ? 1 : 0);
if (error) {
- up_write(&s->s_umount);
+ brsem_up_write(s->s_umount);
deactivate_super(s);
return ERR_PTR(error);
}
@@ -1931,7 +1931,7 @@ static struct super_block *nfs4_get_sb(s
error = nfs4_fill_super(s, data, flags & MS_VERBOSE ? 1 : 0);
if (error) {
- up_write(&s->s_umount);
+ brsem_up_write(s->s_umount);
deactivate_super(s);
return ERR_PTR(error);
}
Index: linux-work/fs/quota.c
===================================================================
--- linux-work.orig/fs/quota.c 2005-09-25 15:26:32.000000000 +0900
+++ linux-work/fs/quota.c 2005-09-25 15:42:04.000000000 +0900
@@ -209,10 +209,10 @@ restart:
continue;
sb->s_count++;
spin_unlock(&sb_lock);
- down_read(&sb->s_umount);
+ brsem_down_read(sb->s_umount);
if (sb->s_root && sb->s_qcop->quota_sync)
quota_sync_sb(sb, type);
- up_read(&sb->s_umount);
+ brsem_up_read(sb->s_umount);
spin_lock(&sb_lock);
if (__put_super_and_need_restart(sb))
goto restart;
Index: linux-work/fs/reiserfs/procfs.c
===================================================================
--- linux-work.orig/fs/reiserfs/procfs.c 2005-09-25 15:26:32.000000000 +0900
+++ linux-work/fs/reiserfs/procfs.c 2005-09-25 15:42:04.000000000 +0900
@@ -422,7 +422,7 @@ static void *r_start(struct seq_file *m,
if (IS_ERR(sget(&reiserfs_fs_type, test_sb, set_sb, s)))
return NULL;
- up_write(&s->s_umount);
+ brsem_up_write(s->s_umount);
if (de->deleted) {
deactivate_super(s);
Index: linux-work/fs/super.c
===================================================================
--- linux-work.orig/fs/super.c 2005-09-25 15:26:32.000000000 +0900
+++ linux-work/fs/super.c 2005-09-25 15:42:04.000000000 +0900
@@ -60,20 +60,18 @@ static struct super_block *alloc_super(v
if (s) {
memset(s, 0, sizeof(struct super_block));
- if (security_sb_alloc(s)) {
- kfree(s);
- s = NULL;
- goto out;
- }
+ if (security_sb_alloc(s))
+ goto fail_sec;
INIT_LIST_HEAD(&s->s_dirty);
INIT_LIST_HEAD(&s->s_io);
INIT_LIST_HEAD(&s->s_files);
INIT_LIST_HEAD(&s->s_instances);
INIT_HLIST_HEAD(&s->s_anon);
INIT_LIST_HEAD(&s->s_inodes);
- init_rwsem(&s->s_umount);
+ if (!(s->s_umount = create_brsem()))
+ goto fail_umount;
sema_init(&s->s_lock, 1);
- down_write(&s->s_umount);
+ brsem_down_write(s->s_umount);
s->s_count = S_BIAS;
atomic_set(&s->s_active, 1);
sema_init(&s->s_vfs_rename_sem,1);
@@ -87,8 +85,13 @@ static struct super_block *alloc_super(v
s->s_op = &default_op;
s->s_time_gran = 1000000000;
}
-out:
return s;
+
+fail_umount:
+ security_sb_free(s);
+fail_sec:
+ kfree(s);
+ return NULL;
}
/**
@@ -100,6 +103,7 @@ out:
static inline void destroy_super(struct super_block *s)
{
security_sb_free(s);
+ destroy_brsem(s->s_umount);
kfree(s);
}
@@ -171,7 +175,7 @@ void deactivate_super(struct super_block
if (atomic_dec_and_lock(&s->s_active, &sb_lock)) {
s->s_count -= S_BIAS-1;
spin_unlock(&sb_lock);
- down_write(&s->s_umount);
+ brsem_down_write(s->s_umount);
fs->kill_sb(s);
put_filesystem(fs);
put_super(s);
@@ -195,7 +199,7 @@ static int grab_super(struct super_block
{
s->s_count++;
spin_unlock(&sb_lock);
- down_write(&s->s_umount);
+ brsem_down_write(s->s_umount);
if (s->s_root) {
spin_lock(&sb_lock);
if (s->s_count > S_BIAS) {
@@ -206,7 +210,7 @@ static int grab_super(struct super_block
}
spin_unlock(&sb_lock);
}
- up_write(&s->s_umount);
+ brsem_up_write(s->s_umount);
put_super(s);
yield();
return 0;
@@ -258,7 +262,7 @@ void generic_shutdown_super(struct super
list_del_init(&sb->s_list);
list_del(&sb->s_instances);
spin_unlock(&sb_lock);
- up_write(&sb->s_umount);
+ brsem_up_write(sb->s_umount);
}
EXPORT_SYMBOL(generic_shutdown_super);
@@ -319,7 +323,7 @@ EXPORT_SYMBOL(sget);
void drop_super(struct super_block *sb)
{
- up_read(&sb->s_umount);
+ brsem_up_read(sb->s_umount);
put_super(sb);
}
@@ -349,9 +353,9 @@ restart:
if (sb->s_dirt) {
sb->s_count++;
spin_unlock(&sb_lock);
- down_read(&sb->s_umount);
+ brsem_down_read(sb->s_umount);
write_super(sb);
- up_read(&sb->s_umount);
+ brsem_up_read(sb->s_umount);
spin_lock(&sb_lock);
if (__put_super_and_need_restart(sb))
goto restart;
@@ -400,10 +404,10 @@ restart:
continue; /* hm. Was remounted r/o meanwhile */
sb->s_count++;
spin_unlock(&sb_lock);
- down_read(&sb->s_umount);
+ brsem_down_read(sb->s_umount);
if (sb->s_root && (wait || sb->s_dirt))
sb->s_op->sync_fs(sb, wait);
- up_read(&sb->s_umount);
+ brsem_up_read(sb->s_umount);
/* restart only when sb is no longer on the list */
spin_lock(&sb_lock);
if (__put_super_and_need_restart(sb))
@@ -434,10 +438,10 @@ rescan:
if (sb->s_bdev == bdev) {
sb->s_count++;
spin_unlock(&sb_lock);
- down_read(&sb->s_umount);
+ brsem_down_read(sb->s_umount);
if (sb->s_root)
return sb;
- up_read(&sb->s_umount);
+ brsem_up_read(sb->s_umount);
/* restart only when sb is no longer on the list */
spin_lock(&sb_lock);
if (__put_super_and_need_restart(sb))
@@ -460,10 +464,10 @@ rescan:
if (sb->s_dev == dev) {
sb->s_count++;
spin_unlock(&sb_lock);
- down_read(&sb->s_umount);
+ brsem_down_read(sb->s_umount);
if (sb->s_root)
return sb;
- up_read(&sb->s_umount);
+ brsem_up_read(sb->s_umount);
/* restart only when sb is no longer on the list */
spin_lock(&sb_lock);
if (__put_super_and_need_restart(sb))
@@ -568,7 +572,7 @@ static void do_emergency_remount(unsigne
list_for_each_entry(sb, &super_blocks, s_list) {
sb->s_count++;
spin_unlock(&sb_lock);
- down_read(&sb->s_umount);
+ brsem_down_read(sb->s_umount);
if (sb->s_root && sb->s_bdev && !(sb->s_flags & MS_RDONLY)) {
/*
* ->remount_fs needs lock_kernel().
@@ -701,7 +705,7 @@ struct super_block *get_sb_bdev(struct f
if (s->s_root) {
if ((flags ^ s->s_flags) & MS_RDONLY) {
- up_write(&s->s_umount);
+ brsem_up_write(s->s_umount);
deactivate_super(s);
s = ERR_PTR(-EBUSY);
}
@@ -715,7 +719,7 @@ struct super_block *get_sb_bdev(struct f
sb_set_blocksize(s, s->s_old_blocksize);
error = fill_super(s, data, flags & MS_VERBOSE ? 1 : 0);
if (error) {
- up_write(&s->s_umount);
+ brsem_up_write(s->s_umount);
deactivate_super(s);
s = ERR_PTR(error);
} else {
@@ -759,7 +763,7 @@ struct super_block *get_sb_nodev(struct
error = fill_super(s, data, flags & MS_VERBOSE ? 1 : 0);
if (error) {
- up_write(&s->s_umount);
+ brsem_up_write(s->s_umount);
deactivate_super(s);
return ERR_PTR(error);
}
@@ -788,7 +792,7 @@ struct super_block *get_sb_single(struct
s->s_flags = flags;
error = fill_super(s, data, flags & MS_VERBOSE ? 1 : 0);
if (error) {
- up_write(&s->s_umount);
+ brsem_up_write(s->s_umount);
deactivate_super(s);
return ERR_PTR(error);
}
@@ -840,12 +844,12 @@ do_kern_mount(const char *fstype, int fl
mnt->mnt_root = dget(sb->s_root);
mnt->mnt_mountpoint = sb->s_root;
mnt->mnt_parent = mnt;
- up_write(&sb->s_umount);
+ brsem_up_write(sb->s_umount);
free_secdata(secdata);
put_filesystem(type);
return mnt;
out_sb:
- up_write(&sb->s_umount);
+ brsem_up_write(sb->s_umount);
deactivate_super(sb);
sb = ERR_PTR(error);
out_free_secdata:
Index: linux-work/security/selinux/hooks.c
===================================================================
--- linux-work.orig/security/selinux/hooks.c 2005-09-25 15:26:32.000000000 +0900
+++ linux-work/security/selinux/hooks.c 2005-09-25 15:42:04.000000000 +0900
@@ -4407,7 +4407,7 @@ next_sb:
sb->s_count++;
spin_unlock(&sb_lock);
spin_unlock(&sb_security_lock);
- down_read(&sb->s_umount);
+ brsem_down_read(sb->s_umount);
if (sb->s_root)
superblock_doinit(sb, NULL);
drop_super(sb);
Index: linux-work/include/linux/fs.h
===================================================================
--- linux-work.orig/include/linux/fs.h 2005-09-25 15:26:32.000000000 +0900
+++ linux-work/include/linux/fs.h 2005-09-25 15:42:04.000000000 +0900
@@ -216,6 +216,7 @@ extern int dir_notify_enable;
#include <linux/prio_tree.h>
#include <linux/init.h>
#include <linux/sched.h>
+#include <linux/brsem.h>
#include <asm/atomic.h>
#include <asm/semaphore.h>
@@ -771,7 +772,7 @@ struct super_block {
unsigned long s_flags;
unsigned long s_magic;
struct dentry *s_root;
- struct rw_semaphore s_umount;
+ struct brsem *s_umount;
struct semaphore s_lock;
int s_count;
int s_syncing;
Index: linux-work/arch/um/drivers/mconsole_kern.c
===================================================================
--- linux-work.orig/arch/um/drivers/mconsole_kern.c 2005-09-25 15:26:32.000000000 +0900
+++ linux-work/arch/um/drivers/mconsole_kern.c 2005-09-25 15:42:04.000000000 +0900
@@ -150,7 +150,7 @@ void mconsole_proc(struct mc_request *re
mconsole_reply(req, "Failed to get procfs superblock", 1, 0);
goto out;
}
- up_write(&super->s_umount);
+ brsem_up_write(super->s_umount);
nd.dentry = super->s_root;
nd.mnt = NULL;
01_brsem_implement_brsem.patch
This patch implements big reader semaphore - a rwsem with very
cheap reader-side operations and very expensive writer-side
operations. For details, please read comments at the top of
kern/brsem.c.
Signed-off-by: Tejun Heo <[email protected]>
include/linux/brsem.h | 202 +++++++++++
init/main.c | 3
kernel/Makefile | 2
kernel/brsem.c | 869 ++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 1075 insertions(+), 1 deletion(-)
Index: linux-work/include/linux/brsem.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-work/include/linux/brsem.h 2005-09-25 15:42:03.000000000 +0900
@@ -0,0 +1,202 @@
+#ifndef __LINUX_BRSEM_H
+#define __LINUX_BRSEM_H
+
+/*
+ * include/linux/brsem.h - Big reader rw semaphore
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of the
+ * License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
+ * 02111-1307, USA.
+ *
+ * Copyright (C) 2005 Tejun Heo <[email protected]>
+ *
+ * See kernel/brsem.c for more information.
+ */
+
+#include <linux/spinlock.h>
+#include <linux/interrupt.h>
+#include <linux/wait.h>
+#include <linux/workqueue.h>
+#include <linux/percpu.h>
+#include <asm/semaphore.h>
+
+DECLARE_PER_CPU(int *, brsem_rcnt_ar);
+
+struct brsem {
+ int idx;
+ spinlock_t lock;
+ long long master_rcnt;
+ wait_queue_head_t read_wait;
+ wait_queue_head_t write_wait;
+ struct semaphore write_mutex;
+ struct work_struct async_work;
+ unsigned flags;
+};
+
+enum {
+ /*
+ * sem->flags, protected by sem->lock
+ */
+ BRSEM_F_ALLOCATED = 0x0001,
+ BRSEM_F_BYPASS = 0x0002,
+
+ /* Async todo flags */
+ BRSEM_F_ASYNC_UPWRITE = 0x0100,
+ BRSEM_F_ASYNC_DESTROY = 0x0200,
+ BRSEM_F_ASYNC_MASK = 0x0300,
+};
+
+/*
+ * brsem subsys initialization routines, called from init/main.c
+ */
+void brsem_init_early(void);
+void brsem_init(void);
+
+/*
+ * The following initializer and __create_brsem() are for cases where
+ * brsem should be used before brsem_init_early() is finished.
+ */
+#define BRSEM_BYPASS_INITIALIZER { .idx = 0, .flags = BRSEM_F_BYPASS }
+
+struct brsem *__create_brsem(struct brsem *sem);
+
+int __brsem_down_read_slow(struct brsem *sem, int interruptible);
+int __brsem_down_read_trylock_slow(struct brsem *sem);
+void __brsem_up_read_slow(struct brsem *sem);
+
+static inline int *__brsem_rcnt(struct brsem *sem)
+{
+ return __get_cpu_var(brsem_rcnt_ar) + sem->idx;
+}
+
+/*
+ * The following *_crit functions can be changed to use
+ * local_irq_disable/enable on architectures where irq switching is
+ * cheaper than bh switching. See brsem.c for more information.
+ */
+static inline void __brsem_enter_crit(void)
+{
+ local_bh_disable();
+}
+
+static inline void __brsem_leave_crit(void)
+{
+ if (!irqs_disabled()) /* is this cheap on all archs? */
+ local_bh_enable();
+ else
+ __local_bh_enable();
+}
+
+static inline int __brsem_down_read(struct brsem *sem, int interruptible)
+{
+ int *p;
+ might_sleep();
+ __brsem_enter_crit();
+ p = __brsem_rcnt(sem);
+ if (*p - 1 < INT_MAX - 1) { /* *p != INT_MIN && *p != INT_MAX */
+ (*p)++;
+ __brsem_leave_crit();
+ return 0;
+ }
+ return __brsem_down_read_slow(sem, interruptible);
+}
+
+/*
+ * Public functions
+ */
+
+/**
+ * create_brsem - allocates and initializes a new brsem. Returns
+ * pointer to the new brsem on success, NULL on failure
+ *
+ * This function may sleep.
+ */
+static inline struct brsem *create_brsem(void)
+{
+ return __create_brsem(NULL);
+}
+
+void destroy_brsem(struct brsem *brsem);
+
+/**
+ * brsem_down_read - read lock the specified brsem
+ *
+ * This function may sleep.
+ */
+static inline void brsem_down_read(struct brsem *sem)
+{
+ __brsem_down_read(sem, 0);
+}
+
+/**
+ * brsem_down_read_interruptible - read lock the specified brsem
+ * (interruptible). Returns 0 on success and -EINTR if interrupted.
+ *
+ * This function is identical to brsem_down_read except that it may be
+ * interrupted by signal. This function may sleep.
+ */
+static inline int brsem_down_read_interruptible(struct brsem *sem)
+{
+ return __brsem_down_read(sem, 1);
+}
+
+/**
+ * brsem_down_read_trylock - try to read lock the specified brsem.
+ * Returns 1 on sucess and 0 on failure.
+ *
+ * If the specfied brsem can be acquired without sleeping, it's
+ * acquired and 1 is returned; otherwise, 0 is returned. This
+ * function doesn't sleep and can be called from anywhere except for
+ * irq handlers.
+ */
+static inline int brsem_down_read_trylock(struct brsem *sem)
+{
+ int *p;
+ BUG_ON(in_irq());
+ __brsem_enter_crit();
+ p = __brsem_rcnt(sem);
+ if (*p - 1 < INT_MAX - 1) { /* *p != INT_MIN && *p != INT_MAX */
+ (*p)++;
+ __brsem_leave_crit();
+ return 1;
+ }
+ return __brsem_down_read_trylock_slow(sem);
+}
+
+/**
+ * brsem_up_read - read unlock the specified brsem
+ *
+ * This function doesn't sleep and can be called from anywhere except
+ * for irq handlers.
+ */
+static inline void brsem_up_read(struct brsem *sem)
+{
+ int *p;
+ BUG_ON(in_irq());
+ __brsem_enter_crit();
+ p = __brsem_rcnt(sem);
+ if (*p > INT_MIN + 1) { /* *p != INT_MIN && *p != INT_MIN + 1 */
+ (*p)--;
+ __brsem_leave_crit();
+ return;
+ }
+ __brsem_up_read_slow(sem);
+}
+
+void brsem_down_write(struct brsem *sem);
+int brsem_down_write_interruptible(struct brsem *sem);
+void brsem_up_write(struct brsem *sem);
+void brsem_up_write_async(struct brsem *sem);
+
+#endif /* __LINUX_BRSEM_H */
Index: linux-work/kernel/Makefile
===================================================================
--- linux-work.orig/kernel/Makefile 2005-09-25 15:26:33.000000000 +0900
+++ linux-work/kernel/Makefile 2005-09-25 15:42:03.000000000 +0900
@@ -7,7 +7,7 @@ obj-y = sched.o fork.o exec_domain.o
sysctl.o capability.o ptrace.o timer.o user.o \
signal.o sys.o kmod.o workqueue.o pid.o \
rcupdate.o intermodule.o extable.o params.o posix-timers.o \
- kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o
+ kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o brsem.o
obj-$(CONFIG_FUTEX) += futex.o
obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o
Index: linux-work/kernel/brsem.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-work/kernel/brsem.c 2005-09-25 15:42:03.000000000 +0900
@@ -0,0 +1,869 @@
+/*
+ * kernel/brsem.c - Big reader rw semaphore
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of the
+ * License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
+ * 02111-1307, USA.
+ *
+ * Copyright (C) 2005 Tejun Heo <[email protected]>
+ *
+ * This file implements big-reader rw semaphore.
+ *
+ * Read locking and unlocking are very cheap - no shared word, no
+ * atomic operation, no memory barrier, just local bh enable/disable,
+ * some cpu-local arithmetics and conditionals. Of course, we can't
+ * cheat the mother nature and writers take all the overhead plus some
+ * extra. Write locking and unlocking involve issuing workqueue works
+ * to all active CPUs and waiting for them.
+ *
+ * DO NOT use brsem if updates are frequent. brsem is intended to be
+ * used for things like file system remount <-> open/write
+ * synchronization or cpu hotplug synchronizaion, where writer side is
+ * _very_ rare while the reader side can be frequent.
+ *
+ * brsem is semantically identical to rwsem except for the followings.
+ *
+ * a. All non-sleeping functions - destroy_brsem,
+ * brsem_down_read_trylock, brsem_up_read and brsem_up_write_async
+ * - cannot be called from irq handlers. They can be called from
+ * bh or while irq and/or bh are disabled (in_softirq ||
+ * irqs_disabled) but cannot be called from irq handlers (in_irq).
+ *
+ * This is because brsem uses bh disable/enable to achieve
+ * exclusion on local cpu. This choice is made because switching
+ * local irq is quite expensive on some architectures. On
+ * architectures where local irq switching is cheaper than bh
+ * switching, changing __brsem_enter/leave_crit to use
+ * local_irq_disable/enable would be a good idea. (maybe
+ * __ARCH_CHEAP_LOCAL_IRQ_OPS)
+ *
+ * b. brsem_up_write needs to sleep. If write unlocking needs to be
+ * done while in_atomic, brsem_up_write_async should be used.
+ *
+ * A dedicated workqueue is used for brsem_up_write_async as
+ * otherwise deadlock can occur if a work on the same workqueue
+ * releases a brsem while holding a spin and then tries to regrab
+ * it after releasing the spin.
+ *
+ * Note: destroy_brsem also operates asynchronously using the same
+ * brsem workqueue.
+ *
+ * brsem is different from rcu in that
+ *
+ * a. being a sempahore not a spinlock, readers and writers can sleep
+ * while holding it.
+ *
+ * b. it actually synchronizes and can be used where transient stale
+ * data cannot be tolerated or working around is too cumbersome.
+ *
+ * brsem is different from seqlock in that
+ *
+ * a. both readers and writers can sleep while holding it.
+ *
+ * b. it actually synchronizes and can be used where reader side retry
+ * is impossible or impractical.
+ *
+ * On UP machines, brsem can be trivially replaced by rwsem with
+ * simple macro redirections once rwsem supports interruptible down
+ * operations.
+ */
+
+#include <linux/brsem.h>
+
+#include <linux/idr.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <linux/notifier.h>
+#include <linux/cpu.h>
+#include <linux/module.h>
+#include <linux/completion.h>
+#include <asm/bitops.h>
+#include <asm/atomic.h>
+
+enum {
+ BRSEM_INITIAL_RCNT_ARLEN = 32,
+};
+
+static int initial_rcnt_ar[BRSEM_INITIAL_RCNT_ARLEN] __initdata = { INT_MIN };
+
+static int brsem_initialized;
+static spinlock_t brsem_alloc_lock = SPIN_LOCK_UNLOCKED;
+static int brsem_len = BRSEM_INITIAL_RCNT_ARLEN;
+static DEFINE_IDR(brsem_idr);
+
+DEFINE_PER_CPU(int *, brsem_rcnt_ar) = initial_rcnt_ar;
+static DEFINE_PER_CPU(int, brsem_rcnt_arlen) = BRSEM_INITIAL_RCNT_ARLEN;
+
+static struct workqueue_struct *brsem_async_wq;
+
+static void do_async_jobs(void *data);
+
+#define is_bypass(sem) \
+ unlikely((sem)->idx == 0 && (sem)->flags & BRSEM_F_BYPASS)
+
+#define check_idx(sem) \
+ BUG_ON((sem)->idx <= 0 || (sem)->idx >= brsem_len)
+
+static inline void spin_lock_crit(spinlock_t *spin)
+{
+ __brsem_enter_crit();
+ spin_lock(spin);
+}
+
+static inline void spin_unlock_crit(spinlock_t *spin)
+{
+ spin_unlock(spin);
+ __brsem_leave_crit();
+}
+
+/*
+ * Call on all cpus
+ */
+struct coac_work_arg {
+ void (*func)(void *);
+ void *data;
+ atomic_t remaining;
+ struct completion completion;
+};
+
+static void coac_work_fn(void *data)
+{
+ struct coac_work_arg *arg = data;
+
+ arg->func(arg->data);
+
+ smp_mb__before_atomic_dec();
+
+ if (atomic_dec_and_test(&arg->remaining))
+ complete(&arg->completion);
+}
+
+static void coac_schedule_work_per_cpu(void *data)
+{
+ struct work_struct *works = data;
+
+ schedule_work(works + smp_processor_id());
+}
+
+static void call_on_all_cpus(void (*func)(void *), void *data)
+{
+ static DECLARE_MUTEX(coac_mutex); /* serializes uses of coac_works */
+ static struct work_struct coac_works[NR_CPUS];
+ struct coac_work_arg coac_arg;
+ int cpu, local_cpu = -1;
+
+ coac_arg.func = func;
+ coac_arg.data = data;
+ atomic_set(&coac_arg.remaining, num_online_cpus());
+ init_completion(&coac_arg.completion);
+
+ for_each_online_cpu(cpu) {
+ struct work_struct *w = coac_works + cpu;
+ INIT_WORK(w, coac_work_fn, &coac_arg);
+ }
+
+ down(&coac_mutex);
+ lock_cpu_hotplug();
+
+ /*
+ * If we're on keventd, scheduling work and waiting for it
+ * will deadlock. In such cases, @func is invoked directly.
+ * Note that, if we're not on keventd, we cannot call @func
+ * directly as we aren't bound to specific cpu and @func
+ * cannot be called with preemption disabled.
+ */
+ preempt_disable();
+ if (current_is_keventd())
+ local_cpu = smp_processor_id();
+
+ smp_call_function(coac_schedule_work_per_cpu, coac_works, 1, 0);
+
+ if (local_cpu >= 0) {
+ preempt_enable();
+ func(data);
+ if (atomic_dec_and_test(&coac_arg.remaining))
+ smp_mb__after_atomic_dec();
+ else
+ wait_for_completion(&coac_arg.completion);
+ } else {
+ coac_schedule_work_per_cpu(coac_works);
+ preempt_enable();
+ wait_for_completion(&coac_arg.completion);
+ }
+
+ unlock_cpu_hotplug();
+ up(&coac_mutex);
+}
+
+static void *alloc_rcnt_ar(size_t size)
+{
+ if (size <= PAGE_SIZE)
+ return kmalloc(size, GFP_KERNEL);
+ else
+ return vmalloc(size);
+}
+
+static void free_rcnt_ar(void *ptr, size_t size)
+{
+ if (size <= PAGE_SIZE)
+ kfree(ptr);
+ else
+ vfree(ptr);
+}
+
+/*
+ * While expanding rcnt_ar's, substitution should be done locally on
+ * each cpu. Note that rcnt_ar's are also allocated by each cpu.
+ * This is simpler to implement and NUMA friendly.
+ *
+ * Once expanded, rcnt_ar's are never shrinked, not even when
+ * expansion on other cpus fail. Per-cpu rcnt_ar_len's are kept to
+ * maintain integrity and avoid unnecessary reallocation. Note that
+ * per-cpu rcnt_ar_len's are also necessary when processing cpu
+ * hot-plug events.
+ */
+struct expand_brsem_arg {
+ int new_len;
+ atomic_t failed;
+};
+
+static void expand_brsem_cpucb(void *data)
+{
+ struct expand_brsem_arg *ebarg = data;
+ int len, new_len;
+ size_t size, new_size;
+ int *rcnt_ar, *new_rcnt_ar;
+
+ len = __get_cpu_var(brsem_rcnt_arlen);
+ size = sizeof(rcnt_ar[0]) * len;
+ rcnt_ar = __get_cpu_var(brsem_rcnt_ar);
+
+ new_len = ebarg->new_len;
+ if (len >= new_len)
+ return;
+ new_size = sizeof(new_rcnt_ar[0]) * new_len;
+ new_rcnt_ar = alloc_rcnt_ar(new_size);
+ if (!new_rcnt_ar)
+ goto fail;
+
+ memset((void *)new_rcnt_ar + size, 0, new_size - size);
+
+ __brsem_enter_crit();
+ memcpy(new_rcnt_ar, rcnt_ar, size);
+ __get_cpu_var(brsem_rcnt_ar) = new_rcnt_ar;
+ __get_cpu_var(brsem_rcnt_arlen) = new_len;
+ __brsem_leave_crit();
+
+ free_rcnt_ar(rcnt_ar, size);
+
+ return;
+ fail:
+ atomic_inc(&ebarg->failed);
+}
+
+static int expand_brsem(int target_idx)
+{
+ static DECLARE_MUTEX(expand_mutex);
+ int new_len, res;
+ struct expand_brsem_arg ebarg;
+
+ /*
+ * brsem rcnt_ar's cannot be expanded until brsem is fully
+ * initialized. If the following BUG_ON is ever hit, bump
+ * BRSEM_INITIAL_RCNT_ARLEN.
+ */
+ BUG_ON(!brsem_initialized);
+
+ down(&expand_mutex);
+
+ new_len = brsem_len;
+ while (new_len <= target_idx) {
+ new_len <<= 1;
+ if (new_len < 0) {
+ /* Duh... wrapped around? */
+ WARN_ON(1);
+ res = -EBUSY;
+ goto out;
+ }
+ }
+
+ res = 0;
+ if (new_len <= brsem_len)
+ goto out;
+
+ ebarg.new_len = new_len;
+ atomic_set(&ebarg.failed, 0);
+
+ call_on_all_cpus(expand_brsem_cpucb, &ebarg);
+
+ res = -ENOMEM;
+ if (atomic_read(&ebarg.failed))
+ goto out;
+
+ brsem_len = new_len;
+ res = 0;
+ out:
+ up(&expand_mutex);
+ return res;
+}
+
+struct brsem *__create_brsem(struct brsem *sem)
+{
+ struct brsem *allocated_sem = NULL;
+ int idx, res;
+
+ if (!sem) {
+ sem = allocated_sem = kzalloc(sizeof(*sem), GFP_KERNEL);
+ if (!sem)
+ goto out;
+ }
+
+ do {
+ res = idr_pre_get(&brsem_idr, GFP_KERNEL);
+ if (res < 0)
+ goto out_free;
+ spin_lock(&brsem_alloc_lock);
+ res = idr_get_new_above(&brsem_idr, sem, 1, &idx);
+ spin_unlock(&brsem_alloc_lock);
+ } while (res == -EAGAIN);
+
+ if (res < 0)
+ goto out_free;
+
+ while (idx >= brsem_len) {
+ res = expand_brsem(idx);
+ if (res < 0)
+ goto out_idr_remove;
+ }
+
+ sem->idx = idx;
+ spin_lock_init(&sem->lock);
+ init_waitqueue_head(&sem->read_wait);
+ init_waitqueue_head(&sem->write_wait);
+ init_MUTEX(&sem->write_mutex);
+ INIT_WORK(&sem->async_work, do_async_jobs, sem);
+ sem->flags |= allocated_sem ? BRSEM_F_ALLOCATED : 0;
+
+ goto out;
+
+ out_idr_remove:
+ spin_lock(&brsem_alloc_lock);
+ idr_remove(&brsem_idr, idx);
+ spin_unlock(&brsem_alloc_lock);
+ out_free:
+ kfree(allocated_sem);
+ sem = NULL;
+ out:
+ return sem;
+}
+
+/*
+ * This is the heart and soul of brsem write operations. sync_brsem()
+ * does two things atomically (w.r.t. each cpu) - collecting all
+ * cpu-local reader counts into sem->master_rcnt, and resetting or
+ * locking those rcnts.
+ *
+ * Locking per-cpu rcnts is done by setting them to INT_MIN. All
+ * reader-side fast path functions fall back to slow path when they
+ * encounter INT_MIN, allowing inter-cpu synchronization and rwsem
+ * semantics to be implemented in slow path proper.
+ */
+struct sync_brsem_arg {
+ struct brsem *sem;
+ int write_locking;
+};
+
+static void sync_brsem_cpucb(void *data)
+{
+ struct sync_brsem_arg *sbarg = data;
+ struct brsem *sem = sbarg->sem;
+ int *p = __brsem_rcnt(sem);
+
+ __brsem_enter_crit();
+
+ /* collect rcnt */
+ if (*p != 0 && *p != INT_MIN) {
+ spin_lock(&sem->lock);
+ sem->master_rcnt += *p;
+ spin_unlock(&sem->lock);
+ }
+
+ /* lock or unlock rcnt */
+ *p = sbarg->write_locking ? INT_MIN : 0;
+
+ __brsem_leave_crit();
+}
+
+static void sync_brsem(struct brsem *sem, int write_locking)
+{
+ int cpu;
+
+ if (brsem_initialized) {
+ struct sync_brsem_arg sbarg = { sem, write_locking };
+ call_on_all_cpus(sync_brsem_cpucb, &sbarg);
+ return;
+ }
+
+ /*
+ * Workqueue is not operational yet. Fortunately, we're still
+ * single threaded. Sync manually. Note that lockings are
+ * not necessary here. They're done just for consistency.
+ */
+ lock_cpu_hotplug();
+ spin_lock_crit(&sem->lock);
+ for_each_online_cpu(cpu) {
+ int *p = per_cpu(brsem_rcnt_ar, cpu) + sem->idx;
+ if (*p != INT_MIN)
+ sem->master_rcnt += *p;
+ *p = write_locking ? INT_MIN : 0;
+ }
+ spin_unlock_crit(&sem->lock);
+ unlock_cpu_hotplug();
+}
+
+static void do_destroy_brsem(struct brsem *sem)
+{
+ check_idx(sem);
+
+ sync_brsem(sem, 0);
+ BUG_ON(sem->master_rcnt != 0);
+
+ spin_lock(&brsem_alloc_lock);
+
+ BUG_ON(idr_find(&brsem_idr, sem->idx) != sem);
+ idr_remove(&brsem_idr, sem->idx);
+
+ spin_unlock(&brsem_alloc_lock);
+
+ sem->idx = 0;
+
+ if (sem->flags & BRSEM_F_ALLOCATED)
+ kfree(sem);
+}
+
+static void do_async_jobs(void *data)
+{
+ struct brsem *sem = data;
+ unsigned flags;
+
+ spin_lock_crit(&sem->lock);
+ flags = sem->flags & BRSEM_F_ASYNC_MASK;
+ sem->flags ^= flags;
+ spin_unlock_crit(&sem->lock);
+
+ if (flags & BRSEM_F_ASYNC_UPWRITE)
+ brsem_up_write(sem);
+
+ if (flags & BRSEM_F_ASYNC_DESTROY)
+ do_destroy_brsem(sem);
+}
+
+static void queue_async(struct brsem *sem, unsigned todo)
+{
+ spin_lock_crit(&sem->lock);
+
+ BUG_ON(todo & ~BRSEM_F_ASYNC_MASK || sem->flags & todo);
+ sem->flags |= todo;
+
+ queue_work(brsem_async_wq, &sem->async_work);
+
+ /* sem->lock must be released after the last reference */
+ spin_unlock_crit(&sem->lock);
+}
+
+/**
+ * destroy_brsem - destroy and free the specified brsem
+ *
+ * This function doesn't sleep and can be called from anywhere except
+ * for irq handlers. See comment at the top of this file for more
+ * information regarding async operations.
+ */
+void destroy_brsem(struct brsem *sem)
+{
+ queue_async(sem, BRSEM_F_ASYNC_DESTROY);
+}
+
+int __brsem_down_read_slow(struct brsem *sem, int interruptible)
+{
+ int *p;
+ DEFINE_WAIT(wait);
+
+ if (is_bypass(sem))
+ goto out;
+ check_idx(sem);
+ retry:
+ p = __brsem_rcnt(sem);
+ switch (*p) {
+ case INT_MIN:
+ /* writer(s) present */
+ prepare_to_wait(&sem->read_wait, &wait,
+ interruptible ? TASK_INTERRUPTIBLE
+ : TASK_UNINTERRUPTIBLE);
+ __brsem_leave_crit();
+
+ schedule();
+ finish_wait(&sem->read_wait, &wait);
+
+ if (interruptible && signal_pending(current))
+ return -EINTR;
+
+ __brsem_enter_crit();
+ goto retry;
+
+ case INT_MAX:
+ /* local rcnt overflow, dump into master rcnt */
+ spin_lock(&sem->lock);
+ sem->master_rcnt += *p;
+ *p = 1;
+ spin_unlock(&sem->lock);
+ break;
+
+ default:
+ (*p)++;
+ }
+ out:
+ __brsem_leave_crit();
+ return 0;
+}
+
+int __brsem_down_read_trylock_slow(struct brsem *sem)
+{
+ int *p = __brsem_rcnt(sem);
+ int res = 1;
+
+ if (is_bypass(sem))
+ goto out;
+ check_idx(sem);
+
+ if (*p == INT_MIN) {
+ /* writer(s) present */
+ res = 0;
+ } else {
+ /* local rcnt overflow, dump into master rcnt */
+ spin_lock(&sem->lock);
+ sem->master_rcnt += *p;
+ *p = 1;
+ spin_unlock(&sem->lock);
+ }
+ out:
+ __brsem_leave_crit();
+ return res;
+}
+
+void __brsem_up_read_slow(struct brsem *sem)
+{
+ int *p = __brsem_rcnt(sem);
+
+ if (is_bypass(sem))
+ goto out;
+ check_idx(sem);
+
+ spin_lock(&sem->lock);
+
+ if (*p == INT_MIN) {
+ /* writer(s) present. */
+ if (--sem->master_rcnt == 0)
+ wake_up(&sem->write_wait);
+ } else {
+ /* local rcnt underflow, dump into master rcnt */
+ sem->master_rcnt += *p;
+ *p = -1;
+ }
+
+ spin_unlock(&sem->lock);
+ out:
+ __brsem_leave_crit();
+}
+
+static int __brsem_down_write(struct brsem *sem, int interruptible)
+{
+ int res;
+
+ if (is_bypass(sem))
+ return 0;
+ check_idx(sem);
+
+ if (interruptible) {
+ res = down_interruptible(&sem->write_mutex);
+ if (res < 0)
+ return res;
+ } else
+ down(&sem->write_mutex);
+
+ sync_brsem(sem, 1);
+
+ spin_lock_crit(&sem->lock);
+
+ if (sem->master_rcnt) {
+ /* there are still readers left, wait for them */
+ DEFINE_WAIT(wait);
+
+ BUG_ON(sem->master_rcnt < 0);
+
+ prepare_to_wait(&sem->write_wait, &wait,
+ interruptible ? TASK_INTERRUPTIBLE
+ : TASK_UNINTERRUPTIBLE);
+ spin_unlock_crit(&sem->lock);
+ schedule();
+ finish_wait(&sem->write_wait, &wait);
+
+ if (interruptible && signal_pending(current)) {
+ sync_brsem(sem, 0);
+ wake_up_all(&sem->read_wait);
+ up(&sem->write_mutex);
+ return -EINTR;
+ }
+ /* we got the lock */
+ } else
+ spin_unlock_crit(&sem->lock);
+
+ return 0;
+}
+
+/**
+ * brsem_down_write - write lock the specified brsem
+ *
+ * This function may sleep.
+ */
+void brsem_down_write(struct brsem *sem)
+{
+ int res = __brsem_down_write(sem, 0);
+ BUG_ON(res != 0);
+}
+
+/**
+ * brsem_down_write_interruptible - write lock the specified brsem
+ * (interruptible). Returns 0 on success and -EINTR if interrupted.
+ *
+ * This function is identical to brsem_down_write except that it may
+ * be interrupted by signal. This function may sleep.
+ */
+int brsem_down_write_interruptible(struct brsem *sem)
+{
+ return __brsem_down_write(sem, 1);
+}
+
+/**
+ * brsem_up_write - write unlock the specified brsem
+ *
+ * This function may sleep.
+ */
+void brsem_up_write(struct brsem *sem)
+{
+ if (is_bypass(sem))
+ return;
+ check_idx(sem);
+
+ BUG_ON(sem->master_rcnt);
+ sync_brsem(sem, 0);
+ wake_up_all(&sem->read_wait);
+ up(&sem->write_mutex);
+}
+
+/**
+ * brsem_up_write_async - write unlock the specified brsem asynchronously
+ *
+ * This function schedules write unlock of the specified brsem. It
+ * can be called from anywhere except irq handlers. See comment at
+ * the top of this file for more information regarding async
+ * operations.
+ */
+void brsem_up_write_async(struct brsem *sem)
+{
+ queue_async(sem, BRSEM_F_ASYNC_UPWRITE);
+}
+
+static int __cpuinit brsem_cpu_callback(struct notifier_block *nfb,
+ unsigned long action, void *data)
+{
+ int cpu = (unsigned long)data, online_cpu;
+ int *rcnt_ar, *new_rcnt_ar, len, i;
+ struct brsem *sem;
+
+ switch (action) {
+ case CPU_UP_PREPARE:
+ /*
+ * Using any online cpu's rcnt_arlen as reference is
+ * safe because it is protected with cpu hotplug lock.
+ */
+ online_cpu = any_online_cpu(CPU_MASK_ALL);
+ rcnt_ar = per_cpu(brsem_rcnt_ar, online_cpu);
+ len = per_cpu(brsem_rcnt_arlen, online_cpu);
+
+ new_rcnt_ar = alloc_rcnt_ar(sizeof(new_rcnt_ar[0]) * len);
+ if (!new_rcnt_ar)
+ return NOTIFY_BAD;
+
+ /*
+ * Transition between INT_MIN and any other value is
+ * protected by cpu hotplug lock.
+ */
+ for (i = 0; i < len; i++)
+ new_rcnt_ar[i] = rcnt_ar[i] == INT_MIN ? INT_MIN : 0;
+
+ BUG_ON(per_cpu(brsem_rcnt_ar, cpu) ||
+ per_cpu(brsem_rcnt_arlen, cpu));
+
+ per_cpu(brsem_rcnt_ar, cpu) = new_rcnt_ar;
+ per_cpu(brsem_rcnt_arlen, cpu) = len;
+ break;
+
+#ifdef CONFIG_HOTPLUG_CPU
+ case CPU_UP_CANCELED:
+ case CPU_DEAD:
+ rcnt_ar = per_cpu(brsem_rcnt_ar, cpu);
+ len = per_cpu(brsem_rcnt_arlen, cpu);
+ if (rcnt_ar == NULL)
+ break;
+
+ per_cpu(brsem_rcnt_ar, cpu) = NULL;
+ per_cpu(brsem_rcnt_arlen, cpu) = 0;
+
+ for (i = 0; i < len; i++) {
+ if (rcnt_ar[i] == 0 || rcnt_ar[i] == INT_MIN)
+ continue;
+
+ spin_lock(&brsem_alloc_lock);
+ sem = idr_find(&brsem_idr, i);
+ spin_unlock(&brsem_alloc_lock);
+
+ BUG_ON(!sem || sem->idx != i);
+
+ /*
+ * All inter-cpu synchronizations occur while
+ * rcnt_ar is INT_MIN, so the following
+ * doesn't interfere with any.
+ */
+ spin_lock_crit(&sem->lock);
+ sem->master_rcnt += rcnt_ar[i];
+ spin_unlock_crit(&sem->lock);
+ }
+
+ free_rcnt_ar(rcnt_ar, sizeof(rcnt_ar[0]) * len);
+ break;
+#endif
+ }
+
+ return NOTIFY_OK;
+}
+
+static struct notifier_block brsem_cpu_notifier =
+ { brsem_cpu_callback, NULL, 0 };
+
+/*
+ * brsem is initialized in three stages to make it useable as early as
+ * possible in the booting process.
+ *
+ * 1. per_cpu area setup
+ *
+ * This happens _very_ early in the booting process. Once this is
+ * done, all statically allocated brsems initialized with
+ * BRSEM_BYPASS_INITIALIZER can be passed to brsem functions. At
+ * this stage, these brsems don't do anything. All operations on
+ * them succeed unconditionally.
+ *
+ * As only one thread runs on only one cpu in this stage, bypassing
+ * locking mechanism doesn't cause any problem.
+ *
+ * brsems cannot be initialized with __create_brsem() or created with
+ * create_brsem() in this stage.
+ *
+ * 2. brsem_init_early
+ *
+ * This is done as soon as slab allocator is online. CPU notifier is
+ * installed and brsem_idr is initialized. Workqueue is not working
+ * yet but the kernel runs only one thread until this stage making
+ * access to rcnt_ar's of other cpus safe.
+ *
+ * In this stage, brsem properly operates (not of much use as we're
+ * still single threaded) and brsems can be initialized or allocated.
+ *
+ * 3. brsem_init
+ *
+ * After workqueue is ready, brsem_init() is called and brsem becomes
+ * fully operational.
+ *
+ * Unless brsems are needed before before stage 2, users of brsem
+ * don't have to worry about initialization. Simply calling
+ * create_brsem() works.
+ *
+ * However, if a brsem is needed before stage 2, it needs to be
+ * allocated manually (either statically or with alloc_bootmem) and
+ * initialized with BRSEM_BYPASS_INITIALIZER before the first use.
+ * The brsem must be initialized with __create_brsem() once stage 2 is
+ * reached. The initialization can be done at anytime after stage 2
+ * but doing it while things are still single-threaded is strongly
+ * recommended.
+ *
+ * *CAUTION* When initializing a brsem with __create_brsem(), the
+ * brsem MUST NOT have any writer or reader. brsem doesn't keep track
+ * of its holders while bypassing and initializing it while holders
+ * are present will screw brsem when they release the brsem.
+ */
+void __init brsem_init_early(void)
+{
+ int cpu;
+
+ /*
+ * per-cpu initializer linked initial rcnt_ar to all cpus.
+ * Bang all except cpu0.
+ */
+ for (cpu = 1; cpu < NR_CPUS; cpu++) {
+ per_cpu(brsem_rcnt_ar, cpu) = NULL;
+ per_cpu(brsem_rcnt_arlen, cpu) = 0;
+ }
+
+ register_cpu_notifier(&brsem_cpu_notifier);
+ idr_init(&brsem_idr);
+}
+
+static void __init dummy_cpucb(void *data)
+{
+ /* nothing */
+}
+
+void __init brsem_init(void)
+{
+ int *rcnt_ar, *new_rcnt_ar;
+ size_t size;
+
+ /* Replace cpu0's __initdata rcnt_ar with kmalloc'ed one */
+ rcnt_ar = per_cpu(brsem_rcnt_ar, 0);
+
+ size = sizeof(new_rcnt_ar[0]) * BRSEM_INITIAL_RCNT_ARLEN;
+ new_rcnt_ar = kmalloc(size, GFP_KERNEL);
+ BUG_ON(!new_rcnt_ar);
+
+ memcpy(new_rcnt_ar, rcnt_ar, size);
+ per_cpu(brsem_rcnt_ar, 0) = new_rcnt_ar;
+
+ /* Create async workqueue */
+ brsem_async_wq = create_singlethread_workqueue("brsem");
+ BUG_ON(!brsem_async_wq);
+
+ /* CPU's are all online now and workqueue is working */
+ brsem_initialized = 1;
+
+ /* Make sure other cpus see above change */
+ call_on_all_cpus(dummy_cpucb, NULL);
+}
+
+EXPORT_SYMBOL(__create_brsem);
+EXPORT_SYMBOL(destroy_brsem);
+EXPORT_SYMBOL(__brsem_down_read_slow);
+EXPORT_SYMBOL(__brsem_down_read_trylock_slow);
+EXPORT_SYMBOL(__brsem_up_read_slow);
+EXPORT_SYMBOL(brsem_down_write);
+EXPORT_SYMBOL(brsem_down_write_interruptible);
+EXPORT_SYMBOL(brsem_up_write);
Index: linux-work/init/main.c
===================================================================
--- linux-work.orig/init/main.c 2005-09-25 15:26:33.000000000 +0900
+++ linux-work/init/main.c 2005-09-25 15:42:03.000000000 +0900
@@ -35,6 +35,7 @@
#include <linux/kernel_stat.h>
#include <linux/security.h>
#include <linux/workqueue.h>
+#include <linux/brsem.h>
#include <linux/profile.h>
#include <linux/rcupdate.h>
#include <linux/moduleparam.h>
@@ -511,6 +512,7 @@ asmlinkage void __init start_kernel(void
kmem_cache_init();
setup_per_cpu_pageset();
numa_policy_init();
+ brsem_init_early();
if (late_time_init)
late_time_init();
calibrate_delay();
@@ -605,6 +607,7 @@ static void __init do_basic_setup(void)
{
/* drivers will send hotplug events */
init_workqueues();
+ brsem_init();
usermodehelper_init();
driver_init();
Tejun Heo wrote:
> 01_brsem_implement_brsem.patch
>
> This patch implements big reader semaphore - a rwsem with very
> cheap reader-side operations and very expensive writer-side
> operations. For details, please read comments at the top of
> kern/brsem.c.
>
This thing looks pretty overengineered. It is difficult to
read with all the little wrapper functions and weird naming
schemes.
What would be wrong with an array of NR_CPUS rwsems? The only
tiny trick you would have to do AFAIKS is have up_read remember
what rwsem down_read took, but that could be returned from
down_read as a token.
I have been meaning to do something like this for mmap_sem to
see what happens to page fault scalability (though the heavy
write-side would make such a scheme unsuitable for mainline).
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Tejun Heo wrote:
> +/*
> + * cpucontrol is a brsem used to synchronize cpu hotplug events.
> + * Invoking lock_cpu_hotplug() read-locks cpucontrol and no
> + * hotplugging events will occur until it's released.
> + *
> + * Unfortunately, brsem itself makes use of lock_cpu_hotplug() and
> + * performing brsem write-lock operations on cpucontrol deadlocks.
> + * This is avoided by...
> + *
> + * a. guaranteeing that cpu hotplug events won't occur during the
> + * write-lock operations, and
> + *
> + * b. skipping lock_cpu_hotplug() inside brsem.
> + *
> + * #a is achieved by acquiring and releasing cpucontrol_mutex outside
> + * cpucontrol write-lock. #b is achieved by skipping
> + * lock_cpu_hotplug() inside brsem if the current task is
> + * cpucontrol_mutex holder (is_cpu_hotplug_holder() test).
> + *
> + * Also, note that cpucontrol is first initialized with
> + * BRSEM_BYPASS_INITIALIZER and then initialized again with
> + * __create_brsem() instead of simply using create_brsem(). This is
> + * necessary as cpucontrol brsem gets used way before brsem subsystem
> + * becomes up and running.
> + *
> + * Until brsem is properly initialized, all brsem ops succeed
> + * unconditionally. cpucontrol becomes operational only after
> + * cpucontrol_init() is finished, which should be called after
> + * brsem_init_early().
> + */
Mmm, this is just insane IMO.
Note that I happen to also think the idea (brsems) have merit, and
that cpucontrol may be one of the places where a sane implementation
would actually be useful... but at least when you're introducing
this kind of complexity anywhere, you *really* need to be able to
back it up with numbers.
As far as the VFS race fix goes, I guess Al or someone else will
comment on its correctness. But I think it might be nicer to first
fix it with a regular rwsem and then show some numbers to justify
its conversion to a brsem.
If you need interruptible rwsems, I almost got an implementation
working a while back, and David Howells recently said he was
interested in doing them... so that's not an impossibility.
Nick
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Index: linux-2.6/include/linux/brsem.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/brsem.h
@@ -0,0 +1,18 @@
+#ifndef __BRSEM_H
+#define __BRSEM_H
+
+#include <linux/rwsem.h>
+struct brsem {
+ struct rw_semaphore cpu_sem[NR_CPUS];
+};
+
+#define BRSEM_READ_TRYLOCK_FAILED -1
+typedef int brsem_read_t;
+
+brsem_read_t brsem_down_read(struct brsem *);
+brsem_read_t brsem_down_read_trylock(struct brsem *);
+void brsem_up_read(struct brsem *, brsem_read_t);
+void brsem_down_write(struct brsem *);
+void brsem_up_write(struct brsem *);
+
+#endif
Index: linux-2.6/lib/Makefile
===================================================================
--- linux-2.6.orig/lib/Makefile
+++ linux-2.6/lib/Makefile
@@ -5,7 +5,7 @@
lib-y := errno.o ctype.o string.o vsprintf.o cmdline.o \
bust_spinlocks.o rbtree.o radix-tree.o dump_stack.o \
idr.o div64.o int_sqrt.o bitmap.o extable.o prio_tree.o \
- sha1.o
+ sha1.o brsem.o
lib-y += kobject.o kref.o kobject_uevent.o klist.o
Index: linux-2.6/lib/brsem.c
===================================================================
--- /dev/null
+++ linux-2.6/lib/brsem.c
@@ -0,0 +1,38 @@
+#include <linux/brsem.h>
+#include <linux/rwsem.h>
+#include <linux/smp.h>
+
+brsem_read_t brsem_down_read(struct brsem *brsem)
+{
+ brsem_read_t ret = smp_processor_id();
+ down_read(&brsem->cpu_sem[ret]);
+ return ret;
+}
+
+brsem_read_t brsem_down_read_trylock(struct brsem *brsem)
+{
+ brsem_read_t ret = smp_processor_id();
+ if (!down_read_trylock(&brsem->cpu_sem[ret]))
+ return BRSEM_READ_TRYLOCK_FAILED;
+ return ret;
+}
+
+void brsem_up_read(struct brsem *brsem, brsem_read_t token)
+{
+ up_read(&brsem->cpu_sem[token]);
+}
+
+void brsem_down_write(struct brsem *brsem)
+{
+ int i;
+ for (i = 0; i < NR_CPUS; i++)
+ down_write(&brsem->cpu_sem[i]);
+}
+
+void brsem_up_write(struct brsem *brsem)
+{
+ int i;
+ for (i = 0; i < NR_CPUS; i++)
+ up_write(&brsem->cpu_sem[i]);
+}
+
Nick Piggin wrote:
> Tejun Heo wrote:
>
>> +/*
>> + * cpucontrol is a brsem used to synchronize cpu hotplug events.
>> + * Invoking lock_cpu_hotplug() read-locks cpucontrol and no
>> + * hotplugging events will occur until it's released.
>> + *
>> + * Unfortunately, brsem itself makes use of lock_cpu_hotplug() and
>> + * performing brsem write-lock operations on cpucontrol deadlocks.
>> + * This is avoided by...
>> + *
>> + * a. guaranteeing that cpu hotplug events won't occur during the
>> + * write-lock operations, and
>> + *
>> + * b. skipping lock_cpu_hotplug() inside brsem.
>> + *
>> + * #a is achieved by acquiring and releasing cpucontrol_mutex outside
>> + * cpucontrol write-lock. #b is achieved by skipping
>> + * lock_cpu_hotplug() inside brsem if the current task is
>> + * cpucontrol_mutex holder (is_cpu_hotplug_holder() test).
>> + *
>> + * Also, note that cpucontrol is first initialized with
>> + * BRSEM_BYPASS_INITIALIZER and then initialized again with
>> + * __create_brsem() instead of simply using create_brsem(). This is
>> + * necessary as cpucontrol brsem gets used way before brsem subsystem
>> + * becomes up and running.
>> + *
>> + * Until brsem is properly initialized, all brsem ops succeed
>> + * unconditionally. cpucontrol becomes operational only after
>> + * cpucontrol_init() is finished, which should be called after
>> + * brsem_init_early().
>> + */
>
>
> Mmm, this is just insane IMO.
>
> Note that I happen to also think the idea (brsems) have merit, and
> that cpucontrol may be one of the places where a sane implementation
> would actually be useful... but at least when you're introducing
> this kind of complexity anywhere, you *really* need to be able to
> back it up with numbers.
>
> As far as the VFS race fix goes, I guess Al or someone else will
> comment on its correctness. But I think it might be nicer to first
> fix it with a regular rwsem and then show some numbers to justify
> its conversion to a brsem.
>
> If you need interruptible rwsems, I almost got an implementation
> working a while back, and David Howells recently said he was
> interested in doing them... so that's not an impossibility.
>
> Nick
>
Hello, Nick.
I do agree that it's absolutely ugly. I thought about ripping the
3-tage init'ing and cpu hotplug stuff as currently cpu hotplug locking
isn't used frequently, but was just giving a shot. I'll strip this
thing out.
Thanks.
--
tejun
Hello, Nick.
Nick Piggin wrote:
> Tejun Heo wrote:
>
>> 01_brsem_implement_brsem.patch
>>
>> This patch implements big reader semaphore - a rwsem with very
>> cheap reader-side operations and very expensive writer-side
>> operations. For details, please read comments at the top of
>> kern/brsem.c.
>>
>
> This thing looks pretty overengineered. It is difficult to
> read with all the little wrapper functions and weird naming
> schemes.
As I've said in the other reply, I'll first rip off three stage init
thing for cpucontrol. That added pretty much complexity to it. And
with the weird naming scheme, please tell me how to fix it. I have no
problem renaming things.
> What would be wrong with an array of NR_CPUS rwsems? The only
> tiny trick you would have to do AFAIKS is have up_read remember
> what rwsem down_read took, but that could be returned from
> down_read as a token.
Using array of rwsems means that every reader-side ops performs
(unnecessary) real down and up operations on the semaphore which involve
atomic memory op and probably memory barrier. These are pretty
expensive operations.
What brsem tries to do is implementing rwsem semantics while
performing only normal (as opposed to atomic/barrier) intstructions
during reader-side operations. That's why all the call_on_all_cpus
stuff is needed while write-locking. Do you think avoiding
atomic/barrier stuff would be an overkill?
> I have been meaning to do something like this for mmap_sem to
> see what happens to page fault scalability (though the heavy
> write-side would make such a scheme unsuitable for mainline).
I agree that brsem write-locking would be too heavy for mmap_sep for
usual cases.
Thanks.
--
tejun
Tejun Heo wrote:
>
> As I've said in the other reply, I'll first rip off three stage init
> thing for cpucontrol. That added pretty much complexity to it. And
> with the weird naming scheme, please tell me how to fix it. I have no
> problem renaming things.
>
OK, my criticism of your naming was not constructive so I apologise
for that. I willll help you with some of those minor issues if we
establish that your overall design is a a goer.
>> What would be wrong with an array of NR_CPUS rwsems? The only
>> tiny trick you would have to do AFAIKS is have up_read remember
>> what rwsem down_read took, but that could be returned from
>> down_read as a token.
>
>
> Using array of rwsems means that every reader-side ops performs
> (unnecessary) real down and up operations on the semaphore which involve
> atomic memory op and probably memory barrier. These are pretty
> expensive operations.
>
> What brsem tries to do is implementing rwsem semantics while performing
> only normal (as opposed to atomic/barrier) intstructions during
> reader-side operations. That's why all the call_on_all_cpus stuff is
> needed while write-locking. Do you think avoiding atomic/barrier stuff
> would be an overkill?
>
Yes I think so. I think the main problem on modern CPUs is
not atomic operations as such, but cacheline bouncing.
Without any numbers, I'd guess that your down_read is more
expensive than mine simply due to touching more cachelines
and having more branches.
The other thing is simply that you really want your
synchronization primitives to be as simple and verifiable
as possible. For example rwsems even recently have had subtle
memory ordering and other implemntation corner cases, and
they're much less complex than this brsem.
Nick
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Hello, Nick.
Nick Piggin wrote:
> Tejun Heo wrote:
>
>>
>> As I've said in the other reply, I'll first rip off three stage init
>> thing for cpucontrol. That added pretty much complexity to it. And
>> with the weird naming scheme, please tell me how to fix it. I have no
>> problem renaming things.
>>
>
> OK, my criticism of your naming was not constructive so I apologise
> for that. I willll help you with some of those minor issues if we
> establish that your overall design is a a goer.
>
Thanks. :-)
>
>>> What would be wrong with an array of NR_CPUS rwsems? The only
>>> tiny trick you would have to do AFAIKS is have up_read remember
>>> what rwsem down_read took, but that could be returned from
>>> down_read as a token.
>>
>>
>>
>> Using array of rwsems means that every reader-side ops performs
>> (unnecessary) real down and up operations on the semaphore which
>> involve atomic memory op and probably memory barrier. These are
>> pretty expensive operations.
>>
>> What brsem tries to do is implementing rwsem semantics while
>> performing only normal (as opposed to atomic/barrier) intstructions
>> during reader-side operations. That's why all the call_on_all_cpus
>> stuff is needed while write-locking. Do you think avoiding
>> atomic/barrier stuff would be an overkill?
>>
>
> Yes I think so. I think the main problem on modern CPUs is
> not atomic operations as such, but cacheline bouncing.
>
> Without any numbers, I'd guess that your down_read is more
> expensive than mine simply due to touching more cachelines
> and having more branches.
Other than local_bh_disable/enable(), all brsem read ops do are
1. accessing sem->idx
2. calculate per-cpu rcnt address from sem->idx
3. do one branch on the value of per-cpu rcnt
4. inc/dec per-cpu rcnt
So, it does access one more cachline and, yeap, there is one branch
for bh enabling and several more inside local_bh_enable. I'll try to
get some benchmark numbers for comparison.
I'm thinking about adding down_read(&xxx->s_umount) to write(2) and
compare normal rwsem and brsem performance by repeitively writing short
data into a file on a UP machine. Do you have better ideas?
>
> The other thing is simply that you really want your
> synchronization primitives to be as simple and verifiable
> as possible. For example rwsems even recently have had subtle
> memory ordering and other implemntation corner cases, and
> they're much less complex than this brsem.
>
> Nick
>
Thanks.
--
tejun
Tejun Heo wrote:
> Other than local_bh_disable/enable(), all brsem read ops do are
>
> 1. accessing sem->idx
> 2. calculate per-cpu rcnt address from sem->idx
> 3. do one branch on the value of per-cpu rcnt
> 4. inc/dec per-cpu rcnt
>
> So, it does access one more cachline and, yeap, there is one branch for
> bh enabling and several more inside local_bh_enable. I'll try to get
> some benchmark numbers for comparison.
>
Well local_bh_disable touches the preempt count too, although we
can probably assume that's hot in cache.
You might also find yours has a bigger icache footprint as well.
> I'm thinking about adding down_read(&xxx->s_umount) to write(2) and
> compare normal rwsem and brsem performance by repeitively writing short
> data into a file on a UP machine. Do you have better ideas?
>
To be honest I'd say that you wouldn't be able to measure it if
you're going through a regular system call path, although such
a measurement certainly won't hurt.
I don't have a better idea though. I don't think a busy loop
microbenchmark is going to be very informative either, it might
actually give a measurable difference but that difference
probably won't be too representitive of real use :\
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Hello, Nick.
Nick Piggin wrote:
> Tejun Heo wrote:
>
>> Other than local_bh_disable/enable(), all brsem read ops do are
>>
>> 1. accessing sem->idx
>> 2. calculate per-cpu rcnt address from sem->idx
>> 3. do one branch on the value of per-cpu rcnt
>> 4. inc/dec per-cpu rcnt
>>
>> So, it does access one more cachline and, yeap, there is one branch
>> for bh enabling and several more inside local_bh_enable. I'll try to
>> get some benchmark numbers for comparison.
>>
>
> Well local_bh_disable touches the preempt count too, although we
> can probably assume that's hot in cache.
>
> You might also find yours has a bigger icache footprint as well.
>
>> I'm thinking about adding down_read(&xxx->s_umount) to write(2) and
>> compare normal rwsem and brsem performance by repeitively writing
>> short data into a file on a UP machine. Do you have better ideas?
>>
>
> To be honest I'd say that you wouldn't be able to measure it if
> you're going through a regular system call path, although such
> a measurement certainly won't hurt.
>
> I don't have a better idea though. I don't think a busy loop
> microbenchmark is going to be very informative either, it might
> actually give a measurable difference but that difference
> probably won't be too representitive of real use :\
>
I did a busy-loop microbenchmark, and I think it's informative enough. :-)
The following command is run on three versions - vanilla version, one
with read_down/up(->s_umount) added to vfs_write(), and one with
brsem_read_down/up(->s_umount) added to vfs_write().
# time -p dd if=/dev/zero of=out bs=32 count=10M
The test is run three times and the results are averaged.
a. vanilla
real 58.63
user 5.61
sys 52.37
b. rwsem
real 59.24
user 6.06
sys 52.29
c. brsem
real 61.74
user 5.78
sys 55.04
I don't think brsem has any chance of being faster than rwsem if it's
slower in this micro benchmark. One weird thing is that the result of
rwsem consistently showed higher user time and lower system time than
vanilla (no synchronization) case, maybe oprofiling will tell something.
Anyways, you were absolutely right. My brsem was a pretty stupid idea
after all. Let's hope at least I learned something from it. :-(
Thanks a lot.
--
tejun
Tejun Heo wrote:
> I did a busy-loop microbenchmark, and I think it's informative enough. :-)
>
Great!
> The following command is run on three versions - vanilla version, one
> with read_down/up(->s_umount) added to vfs_write(), and one with
> brsem_read_down/up(->s_umount) added to vfs_write().
>
> # time -p dd if=/dev/zero of=out bs=32 count=10M
>
> The test is run three times and the results are averaged.
>
> a. vanilla
>
> real 58.63
> user 5.61
> sys 52.37
>
> b. rwsem
>
> real 59.24
> user 6.06
> sys 52.29
>
> c. brsem
>
> real 61.74
> user 5.78
> sys 55.04
>
> I don't think brsem has any chance of being faster than rwsem if it's
> slower in this micro benchmark. One weird thing is that the result of
> rwsem consistently showed higher user time and lower system time than
> vanilla (no synchronization) case, maybe oprofiling will tell something.
>
Yep, probably just some timing or cache anomaly, generally just
sum the user and system time in the case where you are running
identical userspace code... I wouldn't worry too much about it.
> Anyways, you were absolutely right. My brsem was a pretty stupid idea
> after all. Let's hope at least I learned something from it. :-(
>
I wouldn't say stupid. Implementation was too complex, but some
clever ideas were required to solve the problems you identified.
There are definitely a couple of possible places where a brsem
may be useful, and I think cpucontrol semaphore is one of those.
Al can probably comment on its use in the superblock. It seems
fair enough, though I think we'll want to slim down my naive
rwsem array and maybe think about using alloc_percpu.
Is the remount case much rarer than mount/unmount? Do we need to
be careful about bloating the size of the superblock on large
machines? In that case maybe a global remount brsem will be
enough to handle this race?
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Nick Piggin wrote:
> Tejun Heo wrote:
>
> >+/*
> >+ * cpucontrol is a brsem used to synchronize cpu hotplug events.
> >+ * Invoking lock_cpu_hotplug() read-locks cpucontrol and no
> >+ * hotplugging events will occur until it's released.
> >+ *
> >+ * Unfortunately, brsem itself makes use of lock_cpu_hotplug() and
> >+ * performing brsem write-lock operations on cpucontrol deadlocks.
> >+ * This is avoided by...
> >+ *
> >+ * a. guaranteeing that cpu hotplug events won't occur during the
> >+ * write-lock operations, and
> >+ *
> >+ * b. skipping lock_cpu_hotplug() inside brsem.
> >+ *
> >+ * #a is achieved by acquiring and releasing cpucontrol_mutex outside
> >+ * cpucontrol write-lock. #b is achieved by skipping
> >+ * lock_cpu_hotplug() inside brsem if the current task is
> >+ * cpucontrol_mutex holder (is_cpu_hotplug_holder() test).
> >+ *
> >+ * Also, note that cpucontrol is first initialized with
> >+ * BRSEM_BYPASS_INITIALIZER and then initialized again with
> >+ * __create_brsem() instead of simply using create_brsem(). This is
> >+ * necessary as cpucontrol brsem gets used way before brsem subsystem
> >+ * becomes up and running.
> >+ *
> >+ * Until brsem is properly initialized, all brsem ops succeed
> >+ * unconditionally. cpucontrol becomes operational only after
> >+ * cpucontrol_init() is finished, which should be called after
> >+ * brsem_init_early().
> >+ */
>
> Mmm, this is just insane IMO.
>
> Note that I happen to also think the idea (brsems) have merit, and
> that cpucontrol may be one of the places where a sane implementation
> would actually be useful... but at least when you're introducing
> this kind of complexity anywhere, you *really* need to be able to
> back it up with numbers.
The only performance-related complaint with cpu hotplug of which I'm
aware -- that taking a cpu down on a large system can be painfully
slow -- resides in the "write side" of the code, which is not the case
that the brsem implementation optimizes. I think this patch would
make that case even worse. So I don't think it's appropriate to use a
brsem for cpu hotplug, especially without trying rwsem first.
Nathan
Nathan Lynch wrote:
>Nick Piggin wrote:
>
>>
>>Note that I happen to also think the idea (brsems) have merit, and
>>that cpucontrol may be one of the places where a sane implementation
>>would actually be useful... but at least when you're introducing
>>this kind of complexity anywhere, you *really* need to be able to
>>back it up with numbers.
>>
>
>The only performance-related complaint with cpu hotplug of which I'm
>aware -- that taking a cpu down on a large system can be painfully
>slow -- resides in the "write side" of the code, which is not the case
>that the brsem implementation optimizes. I think this patch would
>make that case even worse. So I don't think it's appropriate to use a
>brsem for cpu hotplug, especially without trying rwsem first.
>
>
I'm not sure that a brsem would make a noticable difference.
It isn't that cpu hotplug semaphore is a performance problem
now, but that it isn't being used in as many cases as it could
be due to its unscalable nature. For example, a while back I
wanted to use it in the fork() path in the scheduler but
couldn't.
Anyway, as I said, you need to be able to back it up with
numbers ;)
Nick
Send instant messages to your online friends http://au.messenger.yahoo.com
Hello, Nathan & Nick.
Nick Piggin wrote:
> Nathan Lynch wrote:
>
>> Nick Piggin wrote:
>>
>>>
>>> Note that I happen to also think the idea (brsems) have merit, and
>>> that cpucontrol may be one of the places where a sane implementation
>>> would actually be useful... but at least when you're introducing
>>> this kind of complexity anywhere, you *really* need to be able to
>>> back it up with numbers.
>>>
>>
>> The only performance-related complaint with cpu hotplug of which I'm
>> aware -- that taking a cpu down on a large system can be painfully
>> slow -- resides in the "write side" of the code, which is not the case
>> that the brsem implementation optimizes. I think this patch would
>> make that case even worse. So I don't think it's appropriate to use a
>> brsem for cpu hotplug, especially without trying rwsem first.
>>
>>
Actually, a patch which converts cpucontrol to rwsem was once in -mm.
I don't know what happened to it. I can't see it in 2.6.13-rc2-mm1.
>
> I'm not sure that a brsem would make a noticable difference.
>
> It isn't that cpu hotplug semaphore is a performance problem
> now, but that it isn't being used in as many cases as it could
> be due to its unscalable nature. For example, a while back I
> wanted to use it in the fork() path in the scheduler but
> couldn't.
I couldn't have put it better. As it currently stands, cpucontrol
doesn't have any heavy readers. It's partly because it's not necessary
but at least for some part it's because it cannot be used in such cases
as the overhead is too big. The same is true for super->s_umount rwsem
- read_down'ing per-fs rwsem on every write(2) will hurt performance on
SMP machines. I think we'll have more and more of this class of
synchronization problems as we add hotplug capability to subsystems.
Having spent a week on implementing something very ugly :-), I'm a bit
embarrased here but I still want to point out that we need something to
solve this problem.
> Anyway, as I said, you need to be able to back it up with
> numbers ;)
Right, gotta benchmark before committing full implementation.
Thanks.
--
tejun