Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751280AbdH1OvU (ORCPT ); Mon, 28 Aug 2017 10:51:20 -0400 Received: from mga14.intel.com ([192.55.52.115]:57511 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751184AbdH1OvS (ORCPT ); Mon, 28 Aug 2017 10:51:18 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.41,442,1498546800"; d="scan'208";a="142773064" From: "Liang, Kan" To: Linus Torvalds , Tim Chen CC: Mel Gorman , Peter Zijlstra , Ingo Molnar , Andi Kleen , Andrew Morton , "Johannes Weiner" , Jan Kara , Christopher Lameter , "Eric W . Biederman" , Davidlohr Bueso , linux-mm , Linux Kernel Mailing List Subject: RE: [PATCH 2/2 v2] sched/wait: Introduce lock breaker in wake_up_page_bit Thread-Topic: [PATCH 2/2 v2] sched/wait: Introduce lock breaker in wake_up_page_bit Thread-Index: AQHTHb+Q9zxUNJ5MmUaRufh57Tkd0qKU91KAgAAnnoCAAAjGgIAAA4IAgAAYmQCAACfWgIABATSAgANwC8A= Date: Mon, 28 Aug 2017 14:51:14 +0000 Message-ID: <37D7C6CF3E00A74B8858931C1DB2F077537A07E9@SHSMSX103.ccr.corp.intel.com> References: <83f675ad385d67760da4b99cd95ee912ca7c0b44.1503677178.git.tim.c.chen@linux.intel.com> In-Reply-To: Accept-Language: zh-CN, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiZTg4MGY4MTMtYjdhMi00ODhjLWEwZDItZmQ3YTI2YWYwOTg3IiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX0lDIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE2LjUuOS4zIiwiVHJ1c3RlZExhYmVsSGFzaCI6IjFRenErQUlyZm9qbWxaUUQrdDFQaWhaUTRyT0NOTmlmbU9HN3p0OFh4dDA9In0= x-ctpclassification: CTP_IC dlp-product: dlpe-windows dlp-version: 11.0.0.116 dlp-reaction: no-action x-originating-ip: [10.239.127.40] Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by nfs id v7SEpQqJ002006 Content-Length: 6391 Lines: 141 . > On Fri, Aug 25, 2017 at 7:54 PM, Linus Torvalds foundation.org> wrote: > > > > Simplify, simplify, simplify. > > I've now tried three different approaches (I stopped sending broken patches > after deciding the first one was unfixable), and they all suck. > > I was hoping for something lockless to avoid the whole issue with latency > over the long list, but anything that has the wait queue entry allocated on the > stack needs to synchronize the wakeup due to the stack usage, so once you > have lists that are thousands of entries, either you hold the lock for long > times (normal wait queues) or take and release them constantly (the swait > approach), or you batch things up (Tim's wait-queue patches). > > Of those three approaches, the batching does seem to be the best of the lot. > Allocating non-stack wait entries while waiting for pages seems like a bad > idea. We're probably waiting for IO in normal circumstances, possibly > because we're low on memory. > > So I *am* ok with the batching (your patch #1), but I'm *not* ok with the > nasty horrible book-keeping to avoid new entries when you batch and > release the lock and that allows new entries on the list (your patch #2). > > That said, I have now stared *way* too much at the page wait code after > having unsuccessfully tried to replace that wait-queue with other "clever" > approaches (all of which ended up being crap). > > And I'm starting to see a possible solution, or at least improvement. > > Let's just assume I take the batching patch. It's not conceptually ugly, it's > fairly simple, and there are lots of independent arguments for it (ie latency, > but also possible performance from better parallelism). > > That patch doesn't make any data structures bigger or more complicated, it's > just that single "is this a bookmark entry" thing. > So that patch just looks fundamentally fine to me, and the only real > argument I ever had against it was that I would really really _really_ have > wanted to root-cause the behavior. > > So that leaves my fundamental dislike of your other patch. > > And it turns out that all my looking at the page wait code wasn't entirely > unproductive. Yes, I went through three crap approaches before I gave up on > rewriting it, but in the meantime I did get way too intimate with looking at > that pile of crud. > > And I think I found the reason why you needed that patch #2 in the first place. > > Here's what is going on: > > - we're going to assume that the problem is all with a single page, not due to > hash collisions (as per your earlier reports) > > - we also know that the only bit that matters is the PG_locked bit > > - that means that the only way to get that "multiple concurrent waker" > situation that your patch #2 tries to handle better is because you have > multiple people unlocking the page - since that is what causes the wakeups > > - that in turn means that you obviously had multiple threads *locking* it too. > > - and that means that among those new entries there are lockers coming in > in the middle and adding an exclusive entry. > > - the exclusive entry already stops the wakeup list walking > > - but we add non-exclusive entries TO THE BEGINNING of the page waiters > list > > And I really think that last thing is fundamentally wrong. > > It's wrong for several reasons: > > - because it's unfair: threads that want to lock get put behind threads that > just want to see the unlocked state. > > - because it's stupid: our non-locking waiters will end up waiting again if the > page got locked, so waking up a locker *and* a lot of non-locking waiters will > just cause them to go back to sleep anyway > > - because it causes us to walk longer lists: we stop walking when we wake up > the exclusive waiter, but because we always put the non-exclusive waiters in > there first, we always walk the long list of non-exclusive waiters even if we > could just stop walking because we woke up an exclusive one. > > Now the reason we do this seems to be entirely random: for a *normal* wait > queue, you really want to always wake up all the non-exclusive waiters, > because exclusive waiters are only exclusive wrt each other. > > But for a page lock, an exclusive waiter really is getting the lock, and the > other waiters are going to wait for the lock to clear anyway, so the page wait > thing is actually almost exactly the reverse situation. We *could* put > exclusive waiters at the beginning of the list instead, and actively try to avoid > walking the list at all if we have pending lockers. > > I'm not doing that, because I think the fair thing to do is to just do things in > the order they came in. Plus the code is actually simpler if we just always add > to the tail. > > Now, the other thing to look at is how the wakeup function works. It checks > the aliasing information (page and bit number), but then it > *also* does: > > if (test_bit(key->bit_nr, &key->page->flags)) > return 0; > > basically saying "continue walking if somebody else already got the bit". > > That's *INSANE*. It's exactly the wrong thing to do. It's basically saying "even > if we had an exclusive waiter, let's not wake it up, but do let us continue to > walk the list even though the bit we're waiting for to clear is set already". > > What would be sane is to say "stop walking the list now".. So just do that - by > making "negative return from wake function" mean "stop walking". > > So how about just this fairly trivial patch? I tried this patch and https://lkml.org/lkml/2017/8/27/222 together. But they don't fix the issue. I can still get the similar call stack. Thanks, Kan > > In fact, I think this may help even *without* Tim's patch #1. So I think this > would be lovely to test on that problem load on its own, and seeing if it > makes the wait queues behave better. > > It might not cut down on the total length of the wait-queue, but it should > hopefully cause it to walk it much less. We now hit the exclusive waiters > earlier and stop waking things up when we have a new locker thread pending. > And when the page ends up being locked again, we stop walking the list > entirely, so no unnecessarily traversal. > > This patch is small and at least minimally works (I'm running it right now). > > Linus