Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757458AbXEITdl (ORCPT ); Wed, 9 May 2007 15:33:41 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1754123AbXEITde (ORCPT ); Wed, 9 May 2007 15:33:34 -0400 Received: from extu-mxob-1.symantec.com ([216.10.194.28]:50858 "EHLO extu-mxob-1.symantec.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751526AbXEITdc (ORCPT ); Wed, 9 May 2007 15:33:32 -0400 Date: Wed, 9 May 2007 20:33:15 +0100 (BST) From: Hugh Dickins X-X-Sender: hugh@blonde.wat.veritas.com To: Nick Piggin cc: Benjamin Herrenschmidt , linux-arch@vger.kernel.org, Andrew Morton , Linux Kernel Mailing List , Linux Memory Management List Subject: Re: [rfc] optimise unlock_page In-Reply-To: <20070508225012.GF20174@wotan.suse.de> Message-ID: References: <20070508113709.GA19294@wotan.suse.de> <20070508114003.GB19294@wotan.suse.de> <1178659827.14928.85.camel@localhost.localdomain> <20070508224124.GD20174@wotan.suse.de> <20070508225012.GF20174@wotan.suse.de> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-OriginalArrivalTime: 09 May 2007 19:33:30.0243 (UTC) FILETIME=[EC567530:01C79270] X-Brightmail-Verdict: VlJEQwAAAAIAAAABAAAAAAAAAAEAAAAAAAAABmluYm94AG5waWdnaW5Ac3VzZS5kZQBsaW51eC1hcmNoQHZnZXIua2VybmVsLm9yZwBsaW51eC1rZXJuZWxAdmdlci5rZXJuZWwub3JnAGFrcG1AbGludXgtZm91bmRhdGlvbi5vcmcAYmVuaEBrZXJuZWwuY3Jhc2hpbmcub3JnAGxpbnV4LW1tQGt2YWNrLm9yZwA= X-Brightmail-Tracker: AAAAAA== Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4678 Lines: 110 On Wed, 9 May 2007, Nick Piggin wrote: > On Wed, May 09, 2007 at 12:41:24AM +0200, Nick Piggin wrote: > > On Wed, May 09, 2007 at 07:30:27AM +1000, Benjamin Herrenschmidt wrote: > > > > > > Waking them all would fix it but at the risk of causing other > > > problems... Maybe PG_waiters need to actually be a counter but if that > > > is the case, then it complicates things even more. > > > > It will wake up 1 exclusive waiter, but no limit on non exclusive waiters. > > Hmm, but it won't wake up waiters behind the exclusive guy... maybe the > > wake up code can check whether the waitqueue is still active after the > > wakeup, and set PG_waiters again in that case? > > Hm, I don't know if we can do that without a race either... > > OTOH, waking all non exclusive waiters may not be a really bad idea. Not good enough, I'm afraid. It looks like Ben's right and you need a count - and counts in the page struct are a lot harder to add than page flags. I've now played around with the hangs on my three 4CPU machines (all of them in io_schedule below __lock_page, waiting on pages which were neither PG_locked nor PG_waiters when I looked). Seeing Ben's mail, I thought the answer would be just to remove the "_exclusive" from your three prepare_to_wait_exclusive()s. That helped, but it didn't eliminate the hangs. After fiddling around with different ideas for some while, I came to realize that the ClearPageWaiters (in very misleadingly named __unlock_page) is hopeless. It's just so easy for it to clear the PG_waiters that a third task relies upon for wakeup (and which cannot loop around to set it again, because it simply won't be woken by unlock_page/__unlock_page without it already being set). Below is the patch I've applied to see some tests actually running with your patches, but it's just a joke: absurdly racy and presumptuous in itself (the "3" stands for us and the cache and one waiter; I deleted the neighbouring mb and comment, not because I disagree, but because it's ridiculous to pay so much attention to such unlikely races when there's much worse nearby). Though I've not checked: if I've got the counting wrong, then maybe all my pages are left marked PG_waiters by now. (I did imagine we could go back to prepare_to_wait_exclusive once I'd put in the page_count test before ClearPageWaiters; but apparently not, that still hung.) My intention had been to apply the patches to what I tested before with lmbench, to get comparative numbers; but I don't think this is worth the time, it's too far from being a real solution. I was puzzled as to how you came up with any performance numbers yourself, when I could hardly boot. I see you mentioned 2CPU G5, I guess you need a CPU or two more; or maybe it's that you didn't watch what happened as it booted, often those hangs recover later. Hugh --- a/mm/filemap.c 2007-05-08 20:17:31.000000000 +0100 +++ b/mm/filemap.c 2007-05-09 19:14:03.000000000 +0100 @@ -517,13 +517,8 @@ EXPORT_SYMBOL(wait_on_page_bit); */ void fastcall __unlock_page(struct page *page) { - ClearPageWaiters(page); - /* - * The mb is necessary to enforce ordering between the clear_bit and - * the read of the waitqueue (to avoid SMP races with a parallel - * wait_on_page_locked() - */ - smp_mb__after_clear_bit(); + if (page_count(page) <= 3 + page_has_buffers(page)+page_mapcount(page)) + ClearPageWaiters(page); wake_up_page(page, PG_locked); } EXPORT_SYMBOL(__unlock_page); @@ -558,7 +553,7 @@ void fastcall __lock_page(struct page *p DEFINE_WAIT_BIT(wait, &page->flags, PG_locked); do { - prepare_to_wait_exclusive(wq, &wait.wait, TASK_UNINTERRUPTIBLE); + prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE); SetPageWaiters(page); if (likely(PageLocked(page))) sync_page(page); @@ -577,7 +572,7 @@ void fastcall __lock_page_nosync(struct DEFINE_WAIT_BIT(wait, &page->flags, PG_locked); do { - prepare_to_wait_exclusive(wq, &wait.wait, TASK_UNINTERRUPTIBLE); + prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE); SetPageWaiters(page); if (likely(PageLocked(page))) io_schedule(); @@ -591,7 +586,7 @@ void fastcall __wait_on_page_locked(stru DEFINE_WAIT_BIT(wait, &page->flags, PG_locked); do { - prepare_to_wait_exclusive(wq, &wait.wait, TASK_UNINTERRUPTIBLE); + prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE); SetPageWaiters(page); if (likely(PageLocked(page))) sync_page(page); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/