Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932647AbdHYT6H (ORCPT ); Fri, 25 Aug 2017 15:58:07 -0400 Received: from mail-oi0-f67.google.com ([209.85.218.67]:33342 "EHLO mail-oi0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932421AbdHYT6G (ORCPT ); Fri, 25 Aug 2017 15:58:06 -0400 MIME-Version: 1.0 In-Reply-To: References: <83f675ad385d67760da4b99cd95ee912ca7c0b44.1503677178.git.tim.c.chen@linux.intel.com> From: Linus Torvalds Date: Fri, 25 Aug 2017 12:58:05 -0700 X-Google-Sender-Auth: hBzMRL0uv3L7JLOyGDE-bQbBIqw Message-ID: Subject: Re: [PATCH 2/2 v2] sched/wait: Introduce lock breaker in wake_up_page_bit To: Tim Chen Cc: Mel Gorman , Peter Zijlstra , Ingo Molnar , Andi Kleen , Kan Liang , Andrew Morton , Johannes Weiner , Jan Kara , Christopher Lameter , "Eric W . Biederman" , Davidlohr Bueso , linux-mm , Linux Kernel Mailing List Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2514 Lines: 69 On Fri, Aug 25, 2017 at 9:13 AM, Tim Chen wrote: > Now that we have added breaks in the wait queue scan and allow bookmark > on scan position, we put this logic in the wake_up_page_bit function. Oh, _this_ is the other patch you were talking about. I thought it was the NUMA counter threashold that was discussed around the same time, and that's why I referred to Mel. Gods, _this_ patch is ugly. No, I'm not happy with it at all. It makes that wait_queue_head much bigger, for this disgusting one use. So no, this is no good. Now, maybe the page_wait_table[] itself could be changed to be something that is *not* just the wait-queue head. But if we need to change the page_wait_table[] itself to have more information, then we should make it not be a wait-queue at all, we should make it be a list of much more specialized entries, indexed by the {page,bit} tuple. And once we do that, we probably *could* use something like two specialized lists: one that is wake-all, and one that is wake-one. So you'd have something like struct page_wait_struct { struct list_node list; struct page *page; int bit; struct llist_head all; struct llist_head exclusive; }; and then the "page_wait_table[]" would be just an array of struct page_wait_head { spinlock_t lock; struct hlist_head list; }; and then the rule is: - each page/bit combination allocates one of these page_wait_struct entries when somebody starts waiting for it for the first time (and we use the spinlock in the page_wait_head to serialize that list). - an exclusive wait (ie somebody who waits to get the bit for locking) adds itself to the 'exclusive' llist - non-locking waiters add themselves to the 'all' list - we can use llist_del_all() to remove the 'all' list and then walk it and wake them up without any locks - we can use llist_del_first() to remove the first exclusive waiter and wait _it_ up without any locks. Hmm? How does that sound? That means that we wouldn't use the normal wait-queues at all for the page hash waiting. We'd use this two-level list: one list to find the page/bit thing, and then two lists within that fdor the wait-queues for just *that* page/bit. So no need for the 'key' stuff at all, because the page/bit data would be in the first data list, and the second list wouldn't have any of these traversal issues where you have to be careful and do it one entry at a time. Linus