Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S265290AbUFHTFc (ORCPT ); Tue, 8 Jun 2004 15:05:32 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S265292AbUFHTFc (ORCPT ); Tue, 8 Jun 2004 15:05:32 -0400 Received: from mx1.redhat.com ([66.187.233.31]:28085 "EHLO mx1.redhat.com") by vger.kernel.org with ESMTP id S265290AbUFHTFI (ORCPT ); Tue, 8 Jun 2004 15:05:08 -0400 From: David Howells In-Reply-To: <20040608093111.01a910e9.akpm@osdl.org> References: <20040608093111.01a910e9.akpm@osdl.org> <20040608154438.GK18083@dualathlon.random> To: Andrew Morton Cc: Andrea Arcangeli , torvalds@osdl.org, linux-kernel@vger.kernel.org Subject: Re: downgrade_write replacement in remap_file_pages User-Agent: EMH/1.14.1 SEMI/1.14.5 (Awara-Onsen) FLIM/1.14.5 (Demachiyanagi) APEL/10.6 Emacs/21.3 (i386-redhat-linux-gnu) MULE/5.0 (SAKAKI) MIME-Version: 1.0 (generated by SEMI 1.14.5 - "Awara-Onsen") Content-Type: multipart/mixed; boundary="Multipart_Tue_Jun__8_20:04:53_2004-1" Date: Tue, 08 Jun 2004 20:04:53 +0100 Message-ID: <2303.1086721493@redhat.com> Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5867 Lines: 154 --Multipart_Tue_Jun__8_20:04:53_2004-1 Content-Type: text/plain; charset=US-ASCII > > So ho-hum. As a first step, David, could you please take a look into > what's up with downgrade_write()? Yes. Patch attached. The critical bit is in the hunk beginning at line 95. It appears my rwsem testsuite didn't have a downgrade write test. I added one and the bug became obvious. > (Then again, we need to have a serious think about the overflow bug. It's > fatal. Should we fix it? If so, the current rwsem implementation is > probably unsalvageable). The optimised rwsem implementation is probably okay, provided you don't have a 32-bit machine with silly amounts of memory, for which the 32767 process limit is probably reasonable. Maybe make it contingent on CONFIG_HIGHMEM on X86? If you don't want to use an optimised version, there _is_ a spinlock version as well. For 64-bit archs like x86_64 and ppc64 this algorithm can easily made to use a 64-bit counter (lib/rwsem.c wants a signed long counter), so it's just a matter of using XADDQ, LDARX/STDCX or equivalents (I have patches for those two). I've also got MIPS64 patches somewhere too, though they're horribly out of date. Alpha and S390 already do the 64-bit counter thing. And if sparc64 has a 64-bit CAS, that can be used. IA64's rwsems should be 64-bitable too. Using a 64-bit counter gives you a limit of 2 billion processes. David --Multipart_Tue_Jun__8_20:04:53_2004-1 Content-Type: text/plain; charset=US-ASCII Signed-Off-By: David Howells D: Stop downgrade_write() from under-adjusting the rwsem counter in optimised D: rw-semaphores. diff -urp linux-2.6.6/lib/rwsem.c linux-2.6.6-keys/lib/rwsem.c --- linux-2.6.6/lib/rwsem.c 2004-05-11 11:28:57.000000000 +0100 +++ linux-2.6.6-keys/lib/rwsem.c 2004-06-08 19:31:35.550653817 +0100 @@ -29,15 +29,15 @@ void rwsemtrace(struct rw_semaphore *sem /* * handle the lock being released whilst there are processes blocked on it that can now run - * - if we come here, then: - * - the 'active part' of the count (&0x0000ffff) reached zero but has been re-incremented + * - if we come here from up_xxxx(), then: + * - the 'active part' of the count (&0x0000ffff) had reached zero (but may have changed) * - the 'waiting part' of the count (&0xffff0000) is negative (and will still be so) * - there must be someone on the queue * - the spinlock must be held by the caller * - woken process blocks are discarded from the list after having task zeroed - * - writers are only woken if wakewrite is non-zero + * - writers are only woken if downgrading is false */ -static inline struct rw_semaphore *__rwsem_do_wake(struct rw_semaphore *sem, int wakewrite) +static inline struct rw_semaphore *__rwsem_do_wake(struct rw_semaphore *sem, int downgrading) { struct rwsem_waiter *waiter; struct task_struct *tsk; @@ -46,10 +46,12 @@ static inline struct rw_semaphore *__rws rwsemtrace(sem,"Entering __rwsem_do_wake"); - if (!wakewrite) + if (downgrading) goto dont_wake_writers; - /* only wake someone up if we can transition the active part of the count from 0 -> 1 */ + /* if we came through an up_xxxx() call, we only only wake someone up + * if we can transition the active part of the count from 0 -> 1 + */ try_again: oldcount = rwsem_atomic_update(RWSEM_ACTIVE_BIAS,sem) - RWSEM_ACTIVE_BIAS; if (oldcount & RWSEM_ACTIVE_MASK) @@ -78,9 +80,10 @@ static inline struct rw_semaphore *__rws if (waiter->flags & RWSEM_WAITING_FOR_WRITE) goto out; - /* grant an infinite number of read locks to the readers at the front of the queue - * - note we increment the 'active part' of the count by the number of readers (less one - * for the activity decrement we've already done) before waking any processes up + /* grant an infinite number of read locks to the readers at the front + * of the queue + * - note we increment the 'active part' of the count by the number of + * readers before waking any processes up */ readers_only: woken = 0; @@ -95,8 +98,10 @@ static inline struct rw_semaphore *__rws } while (waiter->flags & RWSEM_WAITING_FOR_READ); loop = woken; - woken *= RWSEM_ACTIVE_BIAS-RWSEM_WAITING_BIAS; - woken -= RWSEM_ACTIVE_BIAS; + woken *= RWSEM_ACTIVE_BIAS - RWSEM_WAITING_BIAS; + if (!downgrading) + woken -= RWSEM_ACTIVE_BIAS; /* we'd already done one increment + * earlier */ rwsem_atomic_add(woken,sem); next = sem->wait_list.next; @@ -150,7 +155,7 @@ static inline struct rw_semaphore *rwsem * - it might even be this process, since the waker takes a more active part */ if (!(count & RWSEM_ACTIVE_MASK)) - sem = __rwsem_do_wake(sem,1); + sem = __rwsem_do_wake(sem, 0); spin_unlock(&sem->wait_lock); @@ -201,7 +206,7 @@ struct rw_semaphore fastcall __sched *rw /* * handle waking up a waiter on the semaphore - * - up_read has decremented the active part of the count if we come here + * - up_read/up_write has decremented the active part of the count if we come here */ struct rw_semaphore fastcall *rwsem_wake(struct rw_semaphore *sem) { @@ -211,7 +216,7 @@ struct rw_semaphore fastcall *rwsem_wake /* do nothing if list empty */ if (!list_empty(&sem->wait_list)) - sem = __rwsem_do_wake(sem,1); + sem = __rwsem_do_wake(sem, 0); spin_unlock(&sem->wait_lock); @@ -233,7 +238,7 @@ struct rw_semaphore fastcall *rwsem_down /* do nothing if list empty */ if (!list_empty(&sem->wait_list)) - sem = __rwsem_do_wake(sem,0); + sem = __rwsem_do_wake(sem, 1); spin_unlock(&sem->wait_lock); --Multipart_Tue_Jun__8_20:04:53_2004-1-- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/