From: David Howells <dhowells@redhat.com>
In-Reply-To: <20040608093111.01a910e9.akpm@osdl.org> 
References: <20040608093111.01a910e9.akpm@osdl.org>  <20040608154438.GK18083@dualathlon.random> 
To: Andrew Morton <akpm@osdl.org>
Cc: Andrea Arcangeli <andrea@suse.de>, torvalds@osdl.org,
       linux-kernel@vger.kernel.org
Subject: Re: downgrade_write replacement in remap_file_pages 
User-Agent: EMH/1.14.1 SEMI/1.14.5 (Awara-Onsen) FLIM/1.14.5 (Demachiyanagi) APEL/10.6 Emacs/21.3 (i386-redhat-linux-gnu) MULE/5.0 (SAKAKI)
MIME-Version: 1.0 (generated by SEMI 1.14.5 - "Awara-Onsen")
Content-Type: multipart/mixed;
 boundary="Multipart_Tue_Jun__8_20:04:53_2004-1"
Date: Tue, 08 Jun 2004 20:04:53 +0100
Message-ID: <2303.1086721493@redhat.com>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 5867
Lines: 154

--Multipart_Tue_Jun__8_20:04:53_2004-1
Content-Type: text/plain; charset=US-ASCII

> 
> So ho-hum.  As a first step, David, could you please take a look into
> what's up with downgrade_write()?

Yes. Patch attached. The critical bit is in the hunk beginning at line 95.

It appears my rwsem testsuite didn't have a downgrade write test. I added one
and the bug became obvious.

> (Then again, we need to have a serious think about the overflow bug.  It's
> fatal.  Should we fix it?  If so, the current rwsem implementation is
> probably unsalvageable).

The optimised rwsem implementation is probably okay, provided you don't have a
32-bit machine with silly amounts of memory, for which the 32767 process limit
is probably reasonable. Maybe make it contingent on CONFIG_HIGHMEM on X86?

If you don't want to use an optimised version, there _is_ a spinlock version
as well.

For 64-bit archs like x86_64 and ppc64 this algorithm can easily made to use a
64-bit counter (lib/rwsem.c wants a signed long counter), so it's just a
matter of using XADDQ, LDARX/STDCX or equivalents (I have patches for those
two). I've also got MIPS64 patches somewhere too, though they're horribly out
of date.

Alpha and S390 already do the 64-bit counter thing. And if sparc64 has a
64-bit CAS, that can be used. IA64's rwsems should be 64-bitable too.

Using a 64-bit counter gives you a limit of 2 billion processes.

David


--Multipart_Tue_Jun__8_20:04:53_2004-1
Content-Type: text/plain; charset=US-ASCII


Signed-Off-By: David Howells <dhowells@redhat.com>
D: Stop downgrade_write() from under-adjusting the rwsem counter in optimised
D: rw-semaphores.

diff -urp linux-2.6.6/lib/rwsem.c linux-2.6.6-keys/lib/rwsem.c
--- linux-2.6.6/lib/rwsem.c	2004-05-11 11:28:57.000000000 +0100
+++ linux-2.6.6-keys/lib/rwsem.c	2004-06-08 19:31:35.550653817 +0100
@@ -29,15 +29,15 @@ void rwsemtrace(struct rw_semaphore *sem
 
 /*
  * handle the lock being released whilst there are processes blocked on it that can now run
- * - if we come here, then:
- *   - the 'active part' of the count (&0x0000ffff) reached zero but has been re-incremented
+ * - if we come here from up_xxxx(), then:
+ *   - the 'active part' of the count (&0x0000ffff) had reached zero (but may have changed)
  *   - the 'waiting part' of the count (&0xffff0000) is negative (and will still be so)
  *   - there must be someone on the queue
  * - the spinlock must be held by the caller
  * - woken process blocks are discarded from the list after having task zeroed
- * - writers are only woken if wakewrite is non-zero
+ * - writers are only woken if downgrading is false
  */
-static inline struct rw_semaphore *__rwsem_do_wake(struct rw_semaphore *sem, int wakewrite)
+static inline struct rw_semaphore *__rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
 {
 	struct rwsem_waiter *waiter;
 	struct task_struct *tsk;
@@ -46,10 +46,12 @@ static inline struct rw_semaphore *__rws
 
 	rwsemtrace(sem,"Entering __rwsem_do_wake");
 
-	if (!wakewrite)
+	if (downgrading)
 		goto dont_wake_writers;
 
-	/* only wake someone up if we can transition the active part of the count from 0 -> 1 */
+	/* if we came through an up_xxxx() call, we only only wake someone up
+	 * if we can transition the active part of the count from 0 -> 1
+	 */
  try_again:
 	oldcount = rwsem_atomic_update(RWSEM_ACTIVE_BIAS,sem) - RWSEM_ACTIVE_BIAS;
 	if (oldcount & RWSEM_ACTIVE_MASK)
@@ -78,9 +80,10 @@ static inline struct rw_semaphore *__rws
 	if (waiter->flags & RWSEM_WAITING_FOR_WRITE)
 		goto out;
 
-	/* grant an infinite number of read locks to the readers at the front of the queue
-	 * - note we increment the 'active part' of the count by the number of readers (less one
-	 *   for the activity decrement we've already done) before waking any processes up
+	/* grant an infinite number of read locks to the readers at the front
+	 * of the queue
+	 * - note we increment the 'active part' of the count by the number of
+	 *   readers before waking any processes up
 	 */
  readers_only:
 	woken = 0;
@@ -95,8 +98,10 @@ static inline struct rw_semaphore *__rws
 	} while (waiter->flags & RWSEM_WAITING_FOR_READ);
 
 	loop = woken;
-	woken *= RWSEM_ACTIVE_BIAS-RWSEM_WAITING_BIAS;
-	woken -= RWSEM_ACTIVE_BIAS;
+	woken *= RWSEM_ACTIVE_BIAS - RWSEM_WAITING_BIAS;
+	if (!downgrading)
+		woken -= RWSEM_ACTIVE_BIAS; /* we'd already done one increment
+					     * earlier */
 	rwsem_atomic_add(woken,sem);
 
 	next = sem->wait_list.next;
@@ -150,7 +155,7 @@ static inline struct rw_semaphore *rwsem
 	 * - it might even be this process, since the waker takes a more active part
 	 */
 	if (!(count & RWSEM_ACTIVE_MASK))
-		sem = __rwsem_do_wake(sem,1);
+		sem = __rwsem_do_wake(sem, 0);
 
 	spin_unlock(&sem->wait_lock);
 
@@ -201,7 +206,7 @@ struct rw_semaphore fastcall __sched *rw
 
 /*
  * handle waking up a waiter on the semaphore
- * - up_read has decremented the active part of the count if we come here
+ * - up_read/up_write has decremented the active part of the count if we come here
  */
 struct rw_semaphore fastcall *rwsem_wake(struct rw_semaphore *sem)
 {
@@ -211,7 +216,7 @@ struct rw_semaphore fastcall *rwsem_wake
 
 	/* do nothing if list empty */
 	if (!list_empty(&sem->wait_list))
-		sem = __rwsem_do_wake(sem,1);
+		sem = __rwsem_do_wake(sem, 0);
 
 	spin_unlock(&sem->wait_lock);
 
@@ -233,7 +238,7 @@ struct rw_semaphore fastcall *rwsem_down
 
 	/* do nothing if list empty */
 	if (!list_empty(&sem->wait_list))
-		sem = __rwsem_do_wake(sem,0);
+		sem = __rwsem_do_wake(sem, 1);
 
 	spin_unlock(&sem->wait_lock);
 

--Multipart_Tue_Jun__8_20:04:53_2004-1--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/