Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756544AbYHHOET (ORCPT ); Fri, 8 Aug 2008 10:04:19 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752647AbYHHOEJ (ORCPT ); Fri, 8 Aug 2008 10:04:09 -0400 Received: from yw-out-2324.google.com ([74.125.46.29]:1858 "EHLO yw-out-2324.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750849AbYHHOEH (ORCPT ); Fri, 8 Aug 2008 10:04:07 -0400 DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:cc:in-reply-to:mime-version :content-type:content-transfer-encoding:content-disposition :references; b=QAwOHGSnfNF7TDUgED+5n6GxcRdDc/S8Krw884iO2Gob7X51Semb3eFb+k2U6DLJfC trg+W2zSaMKkqGSgMDpCXYG4Q6EvT1Dr4q+mBCHuMyDWmWat2jfcpzp+LRK18Mxxn6ym 6hYDceXWi8LyLDZmLddzZglfmhPI9zS+w2CKQ= Message-ID: <48f7fe350808080704j4162451bk19b9987258c0d060@mail.gmail.com> Date: Fri, 8 Aug 2008 10:04:05 -0400 From: "Ryan Hope" To: "Peter Zijlstra" Subject: Re: [PATCH][RESEND] scalable rw_mutex Cc: LKML In-Reply-To: <1218190862.8625.77.camel@twins> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <489B7DAD.3080104@gmail.com> <1218190862.8625.77.camel@twins> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 14058 Lines: 393 Sorry for not CCing you, I had intended to CC you and others but I clicked send too quick. One of the reiser4 todo's was remove all semaphore's, I didnt realize rw_semaphores were not real semaphores On Fri, Aug 8, 2008 at 6:21 AM, Peter Zijlstra wrote: > On Thu, 2008-08-07 at 18:56 -0400, Ryan Hope wrote: >> This was posted sometime last year I think and it never got merged. Can this get >> a go around in -mm, it would help in converting the semaphore's in reiser4 to >> mutexes. > > Thanks for CC'ing me :-/ > > I dropped it because its only more scalable up to around 4 cpus. > > Also, how would it help reiser4? using rwsems is perfectly fine - as > they aren't actual semaphores. > >> diff --git a/include/linux/rwmutex.h b/include/linux/rwmutex.h >> new file mode 100644 >> index 0000000..39ec857 >> --- /dev/null >> +++ b/include/linux/rwmutex.h >> @@ -0,0 +1,83 @@ >> +/* >> + * Scalable reader/writer lock. >> + * >> + * Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra >> + * >> + * This file contains the public data structure and API definitions. >> + */ >> +#ifndef _LINUX_RWMUTEX_H >> +#define _LINUX_RWMUTEX_H >> + >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> + >> +struct rw_mutex { >> + /* Read mostly global */ >> + struct percpu_counter readers; >> + unsigned int status; >> + >> + /* The following variables are only for the slowpath */ >> + struct mutex read_mutex; /* r -> w waiting */ >> + struct mutex write_mutex; /* w -> w waiting */ >> + struct task_struct *waiter; /* w -> r waiting */ >> + atomic_t read_waiters; >> + >> +#ifdef CONFIG_DEBUG_LOCK_ALLOC >> + struct lockdep_map dep_map; >> +#endif >> +}; >> + >> +void __rw_mutex_init(struct rw_mutex *rw_mutex, const char * name, >> + struct lock_class_key *key); >> +void rw_mutex_destroy(struct rw_mutex *rw_mutex); >> + >> +#define rw_mutex_init(rw_mutex) \ >> + do { \ >> + static struct lock_class_key __key; \ >> + __rw_mutex_init((rw_mutex), #rw_mutex, &__key); \ >> + } while (0) >> + >> +void rw_mutex_read_lock_slow(struct rw_mutex *rw_mutex); >> + >> +void rw_mutex_write_lock_nested(struct rw_mutex *rw_mutex, int subclass); >> +void rw_mutex_write_unlock(struct rw_mutex *rw_mutex); >> + >> +int __rw_mutex_read_trylock(struct rw_mutex *rw_mutex); >> + >> +static inline int rw_mutex_read_trylock(struct rw_mutex *rw_mutex) >> +{ >> + int ret = __rw_mutex_read_trylock(rw_mutex); >> + if (ret) >> + rwsem_acquire_read(&rw_mutex->dep_map, 0, 1, _RET_IP_); >> + return ret; >> +} >> + >> +static inline void rw_mutex_read_lock(struct rw_mutex *rw_mutex) >> +{ >> + int ret; >> + >> + might_sleep(); >> + rwsem_acquire_read(&rw_mutex->dep_map, 0, 0, _RET_IP_); >> + >> + ret = __rw_mutex_read_trylock(rw_mutex); >> + if (!ret) >> + rw_mutex_read_lock_slow(rw_mutex); >> +} >> + >> +void rw_mutex_read_unlock(struct rw_mutex *rw_mutex); >> + >> +static inline int rw_mutex_is_locked(struct rw_mutex *rw_mutex) >> +{ >> + return mutex_is_locked(&rw_mutex->write_mutex); >> +} >> + >> +static inline void rw_mutex_write_lock(struct rw_mutex *rw_mutex) >> +{ >> + rw_mutex_write_lock_nested(rw_mutex, 0); >> +} >> + >> +#endif /* _LINUX_RWMUTEX_H */ >> diff --git a/kernel/Makefile b/kernel/Makefile >> index dd58bdc..8277ef5 100644 >> --- a/kernel/Makefile >> +++ b/kernel/Makefile >> @@ -9,7 +9,7 @@ obj-y = sched.o fork.o exec_domain.o panic.o printk.o \ >> rcupdate.o extable.o params.o posix-timers.o \ >> kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \ >> hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \ >> - notifier.o ksysfs.o pm_qos_params.o sched_clock.o >> + notifier.o ksysfs.o pm_qos_params.o sched_clock.o rwmutex.o >> >> CFLAGS_REMOVE_sched.o = -mno-spe >> >> diff --git a/kernel/rwmutex.c b/kernel/rwmutex.c >> new file mode 100644 >> index 0000000..2b82d11 >> --- /dev/null >> +++ b/kernel/rwmutex.c >> @@ -0,0 +1,256 @@ >> +/* >> + * Scalable reader/writer lock. >> + * >> + * Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra >> + * >> + * Its scalable in that the read count is a percpu counter and the reader fast >> + * path does not write to a shared cache-line. >> + * >> + * Its not FIFO fair, but starvation proof by alternating readers and writers. >> + */ >> +#include >> +#include >> +#include >> +#include >> + >> +/* >> + * rw mutex - oxymoron when we take mutex to stand for 'MUTual EXlusion' >> + * >> + * However in this context we take mutex to mean a sleeping lock, with the >> + * property that it must be released by the same context that acquired it. >> + * >> + * design goals: >> + * >> + * A sleeping reader writer lock with a scalable read side, to avoid bouncing >> + * cache-lines. >> + * >> + * dynamics: >> + * >> + * The reader fast path is modification of a percpu_counter and a read of a >> + * shared cache-line. >> + * >> + * The write side is quite heavy; it takes two mutexes, a writer mutex and a >> + * readers mutex. The writer mutex is for w <-> w interaction, the read mutex >> + * for r -> w. The read side is forced into the slow path by setting the >> + * status bit. Then it waits for all current readers to disappear. >> + * >> + * The read lock slow path; taken when the status bit is set; takes the read >> + * mutex. Because the write side also takes this mutex, the new readers are >> + * blocked. The read unlock slow path tickles the writer every time a read >> + * lock is released. >> + * >> + * Write unlock clears the status bit, and drops the read mutex; allowing new >> + * readers. It then waits for at least one waiting reader to get a lock (if >> + * there were any readers waiting) before releasing the write mutex which will >> + * allow possible other writers to come in an stop new readers, thus avoiding >> + * starvation by alternating between readers and writers >> + * >> + * considerations: >> + * >> + * The lock's space footprint is quite large (on x86_64): >> + * >> + * 96 bytes [struct rw_mutex] >> + * 8 bytes per cpu NR_CPUS [void *] >> + * 32 bytes per cpu (NR_CPUS ?= cpu_possible_map ?= nr_cpu_ids) >> + * [smallest slab] >> + * >> + * 1376 bytes for x86_64 defconfig (NR_CPUS = 32) >> + */ >> + >> +#define RW_MUTEX_READER_FAST 0 >> +#define RW_MUTEX_READER_SLOW 1 >> + >> +void __rw_mutex_init(struct rw_mutex *rw_mutex, const char *name, >> + struct lock_class_key *key) >> +{ >> +#ifdef CONFIG_DEBUG_LOCK_ALLOC >> + debug_check_no_locks_freed((void *)rw_mutex, sizeof(*rw_mutex)); >> + lockdep_init_map(&rw_mutex->dep_map, name, key, 0); >> +#endif >> + >> + percpu_counter_init(&rw_mutex->readers, 0); >> + rw_mutex->status = RW_MUTEX_READER_FAST; >> + mutex_init(&rw_mutex->read_mutex); >> + mutex_init(&rw_mutex->write_mutex); >> + rw_mutex->waiter = NULL; >> +#ifdef CONFIG_DEBUG_LOCK_ALLOC >> + printk("rw_mutex size: %u\n", sizeof(struct rw_mutex)); >> +#endif >> +} >> +EXPORT_SYMBOL(__rw_mutex_init); >> + >> +void rw_mutex_destroy(struct rw_mutex *rw_mutex) >> +{ >> + percpu_counter_destroy(&rw_mutex->readers); >> + mutex_destroy(&rw_mutex->read_mutex); >> + mutex_destroy(&rw_mutex->write_mutex); >> +} >> +EXPORT_SYMBOL(rw_mutex_destroy); >> + >> +static inline void rw_mutex_readers_inc(struct rw_mutex *rw_mutex) >> +{ >> + percpu_counter_inc(&rw_mutex->readers); >> + smp_wmb(); >> +} >> + >> +static inline void rw_mutex_readers_dec(struct rw_mutex *rw_mutex) >> +{ >> + percpu_counter_dec(&rw_mutex->readers); >> + smp_wmb(); >> +} >> + >> +static inline long rw_mutex_readers(struct rw_mutex *rw_mutex) >> +{ >> + smp_rmb(); >> + return percpu_counter_sum(&rw_mutex->readers); >> +} >> + >> +#define rw_mutex_writer_wait(rw_mutex, condition) \ >> +do { \ >> + struct task_struct *tsk = current; \ >> + \ >> + BUG_ON((rw_mutex)->waiter); \ >> + set_task_state(tsk, TASK_UNINTERRUPTIBLE); \ >> + get_task_struct(tsk); \ >> + (rw_mutex)->waiter = tsk; \ >> + smp_wmb(); \ >> + while (!(condition)) { \ >> + schedule(); \ >> + set_task_state(tsk, TASK_UNINTERRUPTIBLE); \ >> + } \ >> + tsk->state = TASK_RUNNING; \ >> + (rw_mutex)->waiter = NULL; \ >> + put_task_struct(tsk); \ >> +} while (0) >> + >> +static inline void rw_mutex_writer_wake(struct rw_mutex *rw_mutex) >> +{ >> + struct task_struct *tsk; >> + >> + smp_rmb(); >> + tsk = rw_mutex->waiter; >> + if (tsk) >> + wake_up_process(tsk); >> +} >> + >> +void rw_mutex_read_lock_slow(struct rw_mutex *rw_mutex) >> +{ >> + /* >> + * read lock slow path; >> + * count the number of readers waiting on the read_mutex >> + */ >> + atomic_inc(&rw_mutex->read_waiters); >> + mutex_lock(&rw_mutex->read_mutex); >> + >> + /* >> + * rw_mutex->state is only set while the read_mutex is held >> + * so by serialising on this lock, we're sure its free. >> + */ >> + BUG_ON(rw_mutex->status); >> + >> + rw_mutex_readers_inc(rw_mutex); >> + >> + /* >> + * wake up a possible write unlock; waiting for at least a single >> + * reader to pass before letting a new writer through. >> + */ >> + atomic_dec(&rw_mutex->read_waiters); >> + rw_mutex_writer_wake(rw_mutex); >> + mutex_unlock(&rw_mutex->read_mutex); >> +} >> +EXPORT_SYMBOL(rw_mutex_read_lock_slow); >> + >> +static inline >> +void rw_mutex_status_set(struct rw_mutex *rw_mutex, unsigned int status) >> +{ >> + rw_mutex->status = status; >> + /* >> + * allow new readers to see this change in status >> + */ >> + smp_wmb(); >> +} >> + >> +static inline unsigned int rw_mutex_reader_slow(struct rw_mutex *rw_mutex) >> +{ >> + /* >> + * match rw_mutex_status_set() >> + */ >> + smp_rmb(); >> + return rw_mutex->status; >> +} >> + >> +int __rw_mutex_read_trylock(struct rw_mutex *rw_mutex) >> +{ >> + rw_mutex_readers_inc(rw_mutex); >> + if (unlikely(rw_mutex_reader_slow(rw_mutex))) { >> + rw_mutex_readers_dec(rw_mutex); >> + /* >> + * possibly wake up a writer waiting for this reference to >> + * disappear >> + */ >> + rw_mutex_writer_wake(rw_mutex); >> + return 0; >> + } >> + return 1; >> +} >> +EXPORT_SYMBOL(__rw_mutex_read_trylock); >> + >> +void rw_mutex_read_unlock(struct rw_mutex *rw_mutex) >> +{ >> + rwsem_release(&rw_mutex->dep_map, 1, _RET_IP_); >> + >> + rw_mutex_readers_dec(rw_mutex); >> + /* >> + * on the slow path; >> + * nudge the writer waiting for the last reader to go away >> + */ >> + if (unlikely(rw_mutex_reader_slow(rw_mutex))) >> + rw_mutex_writer_wake(rw_mutex); >> +} >> +EXPORT_SYMBOL(rw_mutex_read_unlock); >> + >> +void rw_mutex_write_lock_nested(struct rw_mutex *rw_mutex, int subclass) >> +{ >> + might_sleep(); >> + rwsem_acquire(&rw_mutex->dep_map, subclass, 0, _RET_IP_); >> + >> + mutex_lock_nested(&rw_mutex->write_mutex, subclass); >> + >> + /* >> + * block new readers >> + */ >> + mutex_lock_nested(&rw_mutex->read_mutex, subclass); >> + rw_mutex_status_set(rw_mutex, RW_MUTEX_READER_SLOW); >> + /* >> + * and wait for all current readers to go away >> + */ >> + rw_mutex_writer_wait(rw_mutex, (rw_mutex_readers(rw_mutex) == 0)); >> +} >> +EXPORT_SYMBOL(rw_mutex_write_lock_nested); >> + >> +void rw_mutex_write_unlock(struct rw_mutex *rw_mutex) >> +{ >> + int waiters; >> + >> + might_sleep(); >> + rwsem_release(&rw_mutex->dep_map, 1, _RET_IP_); >> + >> + /* >> + * let the readers rip >> + */ >> + rw_mutex_status_set(rw_mutex, RW_MUTEX_READER_FAST); >> + waiters = atomic_read(&rw_mutex->read_waiters); >> + mutex_unlock(&rw_mutex->read_mutex); >> + /* >> + * wait for at least 1 reader to get through >> + */ >> + if (waiters) { >> + rw_mutex_writer_wait(rw_mutex, >> + (atomic_read(&rw_mutex->read_waiters) < waiters)); >> + } >> + /* >> + * before we let the writers rip >> + */ >> + mutex_unlock(&rw_mutex->write_mutex); >> +} >> +EXPORT_SYMBOL(rw_mutex_write_unlock); >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at http://www.tux.org/lkml/ > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/