Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1762010Ab2EQPTn (ORCPT ); Thu, 17 May 2012 11:19:43 -0400 Received: from e4.ny.us.ibm.com ([32.97.182.144]:50966 "EHLO e4.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1759873Ab2EQPTk (ORCPT ); Thu, 17 May 2012 11:19:40 -0400 Date: Thu, 17 May 2012 08:18:38 -0700 From: "Paul E. McKenney" To: Steven Rostedt Cc: LKML , RT , Thomas Gleixner , Clark Williams , Peter Zijlstra Subject: Re: [RFC][PATCH RT] rwsem_rt: Another (more sane) approach to mulit reader rt locks Message-ID: <20120517151838.GA8692@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <1337090625.14207.304.camel@gandalf.stny.rr.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1337090625.14207.304.camel@gandalf.stny.rr.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-Content-Scanned: Fidelis XPS MAILER x-cbid: 12051715-3534-0000-0000-00000878ABDE Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 11217 Lines: 360 On Tue, May 15, 2012 at 10:03:45AM -0400, Steven Rostedt wrote: > The RT patch has been having lots of trouble lately with large machines > and applications running lots of threads. This usually boils down to a > bottle neck of a single lock: the mm->mmap_sem. Some researchers at MIT RCU-ified this lock: http://people.csail.mit.edu/nickolai/papers/clements-bonsai.pdf They have the patches in a git tree that can be found here: http://pdos.csail.mit.edu/mosbench/ Looks like some work required to get this mainline-ready -- for one thing, it is based on an old kernel. Thanx, Paul > The mmap_sem is a rwsem, which can sleep, but it also can be taken with > a read/write lock, where a read lock can be taken by several tasks at > the same time and the write lock can be only taken by a single task. > > But due to priority inheritance, having multiple readers makes the code > much more complex, thus the -rt patch converts all rwsems into a single > mutex, where readers may nest (the same task may grab the same rwsem for > read multiple times), but only one task may hold the rwsem at any given > time (for read or write). > > When we have lots of threads, the rwsem may be taken often, either for > memory allocation or filling in page faults. This becomes a bottle neck > for threads as only one thread at a time may grab the mmap_sem (which is > shared by all threads of a process). > > Previous attempts of adding multiple readers became too complex and was > error prone. This approach takes on a much more simpler technique, one > that is actually used by per cpu locks. > > The idea here is to have an rwsem create a rt_mutex for each CPU. > Actually, it creates a rwsem for each CPU that can only be acquired by > one task at a time. This allows for readers on separate CPUs to take > only the per cpu lock. When a writer needs to take a lock, it must grab > all CPU locks before continuing. > > This approach does nothing special with the rt_mutex or priority > inheritance code. That stays the same, and works normally (thus less > error prone). The trick here is that when a reader takes a rwsem for > read, it must disable migration, that way it can unlock the rwsem > without needing any special searches (which lock did it take?). > > I've tested this a bit, and so far it works well. I haven't found a nice > way to initialize the locks, so I'm using the silly initialize_rwsem() > at all places that acquire the lock. But we can work on this later. > > Also, I don't use per_cpu sections for the locks, which means we have > cache line collisions, but a normal (mainline) rwsem has that as well. > > These are all room for improvement (and why this is just an RFC patch). > > I'll see if I can get some numbers to see how this fixes the issues with > multi threads on big boxes. > > Thoughts? > > -- Steve > > Not-yet-signed-off-by: Steven Rostedt > > diff --git a/include/linux/rwsem_rt.h b/include/linux/rwsem_rt.h > index 802c690..cd0c812 100644 > --- a/include/linux/rwsem_rt.h > +++ b/include/linux/rwsem_rt.h > @@ -18,7 +18,7 @@ > > #include > > -struct rw_semaphore { > +struct __rw_semaphore { > struct rt_mutex lock; > int read_depth; > #ifdef CONFIG_DEBUG_LOCK_ALLOC > @@ -26,22 +26,40 @@ struct rw_semaphore { > #endif > }; > > +struct rw_semaphore { > + int initialized; > + struct __rw_semaphore lock[NR_CPUS]; > +#ifdef CONFIG_DEBUG_LOCK_ALLOC > + const char *name; > + struct lock_class_key __key[NR_CPUS]; > +#endif > +}; > + > +#ifdef CONFIG_DEBUG_LOCK_ALLOC > +#define __RWSEM_INITIALIZER(_name) \ > + { .name = _name } > +#else > #define __RWSEM_INITIALIZER(name) \ > - { .lock = __RT_MUTEX_INITIALIZER(name.lock), \ > - RW_DEP_MAP_INIT(name) } > + { } > + > +#endif > > #define DECLARE_RWSEM(lockname) \ > struct rw_semaphore lockname = __RWSEM_INITIALIZER(lockname) > > -extern void __rt_rwsem_init(struct rw_semaphore *rwsem, char *name, > +extern void __rt_rwsem_init(struct __rw_semaphore *rwsem, char *name, > struct lock_class_key *key); > > -# define rt_init_rwsem(sem) \ > -do { \ > - static struct lock_class_key __key; \ > - \ > - rt_mutex_init(&(sem)->lock); \ > - __rt_rwsem_init((sem), #sem, &__key); \ > +# define rt_init_rwsem(sem) \ > +do { \ > + static struct lock_class_key __key[NR_CPUS]; \ > + int ____i; \ > + \ > + for (____i = 0; ____i < NR_CPUS; ____i++) { \ > + rt_mutex_init(&((sem)->lock[____i]).lock); \ > + __rt_rwsem_init(&((sem)->lock[____i]), #sem, &__key[____i]); \ > + } \ > + (sem)->initialized = 1; \ > } while (0) > > extern void rt_down_write(struct rw_semaphore *rwsem); > @@ -55,7 +73,11 @@ extern void rt_up_write(struct rw_semaphore *rwsem); > extern void rt_downgrade_write(struct rw_semaphore *rwsem); > > #define init_rwsem(sem) rt_init_rwsem(sem) > -#define rwsem_is_locked(s) rt_mutex_is_locked(&(s)->lock) > +/* > + * Use raw_smp_processor_id(), as readlocks use migrate disable, > + * and write locks lock all of them (we don't care which one we test. > + */ > +#define rwsem_is_locked(s) rt_mutex_is_locked(&(s)->lock[raw_smp_processor_id()].lock) > > static inline void down_read(struct rw_semaphore *sem) > { > diff --git a/kernel/rt.c b/kernel/rt.c > index 092d6b3..f8dab27 100644 > --- a/kernel/rt.c > +++ b/kernel/rt.c > @@ -306,18 +306,52 @@ EXPORT_SYMBOL(__rt_rwlock_init); > * rw_semaphores > */ > > +static void __initialize_rwsem(struct rw_semaphore *rwsem) > +{ > + int i; > + > + /* TODO add spinlock here? */ > + rwsem->initialized = 1; > + > + for (i = 0; i < NR_CPUS; i++) { > + rt_mutex_init(&rwsem->lock[i].lock); > + __rt_rwsem_init(&rwsem->lock[i], > +#ifdef CONFIG_DEBUG_LOCK_ALLOC > + rwsem->name, &rwsem->key[i] > +#else > + "", 0 > +#endif > + ); > + } > +} > + > +#define initialize_rwsem(rwsem) \ > + do { \ > + if (unlikely(!rwsem->initialized)) \ > + __initialize_rwsem(rwsem); \ > + } while (0) > + > void rt_up_write(struct rw_semaphore *rwsem) > { > - rwsem_release(&rwsem->dep_map, 1, _RET_IP_); > - rt_mutex_unlock(&rwsem->lock); > + int i; > + > + for_each_possible_cpu(i) { > + rwsem_release(&rwsem->lock[i].dep_map, 1, _RET_IP_); > + rt_mutex_unlock(&rwsem->lock[i].lock); > + } > } > EXPORT_SYMBOL(rt_up_write); > > void rt_up_read(struct rw_semaphore *rwsem) > { > - rwsem_release(&rwsem->dep_map, 1, _RET_IP_); > - if (--rwsem->read_depth == 0) > - rt_mutex_unlock(&rwsem->lock); > + int cpu; > + > + cpu = smp_processor_id(); > + rwsem_release(&rwsem->lock[cpu].dep_map, 1, _RET_IP_); > + if (--rwsem->lock[cpu].read_depth == 0) { > + rt_mutex_unlock(&rwsem->lock[cpu].lock); > + migrate_enable(); > + } > } > EXPORT_SYMBOL(rt_up_read); > > @@ -327,67 +361,112 @@ EXPORT_SYMBOL(rt_up_read); > */ > void rt_downgrade_write(struct rw_semaphore *rwsem) > { > - BUG_ON(rt_mutex_owner(&rwsem->lock) != current); > - rwsem->read_depth = 1; > + int cpu; > + int i; > + > + migrate_disable(); > + cpu = smp_processor_id(); > + for_each_possible_cpu(i) { > + if (cpu == i) { > + BUG_ON(rt_mutex_owner(&rwsem->lock[i].lock) != current); > + rwsem->lock[i].read_depth = 1; > + } else { > + rwsem_release(&rwsem->lock[i].dep_map, 1, _RET_IP_); > + rt_mutex_unlock(&rwsem->lock[i].lock); > + } > + } > } > EXPORT_SYMBOL(rt_downgrade_write); > > int rt_down_write_trylock(struct rw_semaphore *rwsem) > { > - int ret = rt_mutex_trylock(&rwsem->lock); > + int ret; > + int i; > > - if (ret) > - rwsem_acquire(&rwsem->dep_map, 0, 1, _RET_IP_); > - return ret; > + initialize_rwsem(rwsem); > + > + for_each_possible_cpu(i) { > + ret = rt_mutex_trylock(&rwsem->lock[i].lock); > + if (ret) > + rwsem_acquire(&rwsem->lock[i].dep_map, 0, 1, _RET_IP_); > + else > + goto release; > + } > + return 1; > + release: > + while (--i >= 0) { > + rwsem_release(&rwsem->lock[i].dep_map, 1, _RET_IP_); > + rt_mutex_unlock(&rwsem->lock[i].lock); > + } > + return 0; > } > EXPORT_SYMBOL(rt_down_write_trylock); > > void rt_down_write(struct rw_semaphore *rwsem) > { > - rwsem_acquire(&rwsem->dep_map, 0, 0, _RET_IP_); > - rt_mutex_lock(&rwsem->lock); > + int i; > + initialize_rwsem(rwsem); > + for_each_possible_cpu(i) { > + rwsem_acquire(&rwsem->lock[i].dep_map, 0, 0, _RET_IP_); > + rt_mutex_lock(&rwsem->lock[i].lock); > + } > } > EXPORT_SYMBOL(rt_down_write); > > void rt_down_write_nested(struct rw_semaphore *rwsem, int subclass) > { > - rwsem_acquire(&rwsem->dep_map, subclass, 0, _RET_IP_); > - rt_mutex_lock(&rwsem->lock); > + int i; > + > + initialize_rwsem(rwsem); > + for_each_possible_cpu(i) { > + rwsem_acquire(&rwsem->lock[i].dep_map, subclass, 0, _RET_IP_); > + rt_mutex_lock(&rwsem->lock[i].lock); > + } > } > EXPORT_SYMBOL(rt_down_write_nested); > > int rt_down_read_trylock(struct rw_semaphore *rwsem) > { > - struct rt_mutex *lock = &rwsem->lock; > + struct rt_mutex *lock; > int ret = 1; > + int cpu; > > + initialize_rwsem(rwsem); > + migrate_disable(); > + cpu = smp_processor_id(); > + lock = &rwsem->lock[cpu].lock; > /* > * recursive read locks succeed when current owns the rwsem, > * but not when read_depth == 0 which means that the rwsem is > * write locked. > */ > if (rt_mutex_owner(lock) != current) > - ret = rt_mutex_trylock(&rwsem->lock); > - else if (!rwsem->read_depth) > + ret = rt_mutex_trylock(lock); > + else if (!rwsem->lock[cpu].read_depth) > ret = 0; > > if (ret) { > - rwsem->read_depth++; > - rwsem_acquire(&rwsem->dep_map, 0, 1, _RET_IP_); > - } > + rwsem->lock[cpu].read_depth++; > + rwsem_acquire(&rwsem->lock[cpu].dep_map, 0, 1, _RET_IP_); > + } else > + migrate_enable(); > return ret; > } > EXPORT_SYMBOL(rt_down_read_trylock); > > static void __rt_down_read(struct rw_semaphore *rwsem, int subclass) > { > - struct rt_mutex *lock = &rwsem->lock; > + struct rt_mutex *lock; > + int cpu; > > - rwsem_acquire_read(&rwsem->dep_map, subclass, 0, _RET_IP_); > + migrate_disable(); > + cpu = smp_processor_id(); > + lock = &rwsem->lock[cpu].lock; > + rwsem_acquire_read(&rwsem->lock[cpu].dep_map, subclass, 0, _RET_IP_); > > if (rt_mutex_owner(lock) != current) > - rt_mutex_lock(&rwsem->lock); > - rwsem->read_depth++; > + rt_mutex_lock(lock); > + rwsem->lock[cpu].read_depth++; > } > > void rt_down_read(struct rw_semaphore *rwsem) > @@ -402,7 +481,7 @@ void rt_down_read_nested(struct rw_semaphore *rwsem, int subclass) > } > EXPORT_SYMBOL(rt_down_read_nested); > > -void __rt_rwsem_init(struct rw_semaphore *rwsem, char *name, > +void __rt_rwsem_init(struct __rw_semaphore *rwsem, char *name, > struct lock_class_key *key) > { > #ifdef CONFIG_DEBUG_LOCK_ALLOC > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/