Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752607AbdHPXXP (ORCPT ); Wed, 16 Aug 2017 19:23:15 -0400 Received: from out02.mta.xmission.com ([166.70.13.232]:56539 "EHLO out02.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752420AbdHPXXN (ORCPT ); Wed, 16 Aug 2017 19:23:13 -0400 From: ebiederm@xmission.com (Eric W. Biederman) To: Linus Torvalds Cc: Andi Kleen , Tim Chen , Peter Zijlstra , Ingo Molnar , Kan Liang , Andrew Morton , Johannes Weiner , Jan Kara , linux-mm , Linux Kernel Mailing List References: <84c7f26182b7f4723c0fe3b34ba912a9de92b8b7.1502758114.git.tim.c.chen@linux.intel.com> <20170815022743.GB28715@tassilo.jf.intel.com> <20170815031524.GC28715@tassilo.jf.intel.com> <20170815224728.GA1373@linux-80c1.suse> Date: Wed, 16 Aug 2017 18:22:57 -0500 In-Reply-To: (Linus Torvalds's message of "Tue, 15 Aug 2017 16:50:34 -0700") Message-ID: <87inhnrtbi.fsf@xmission.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-XM-SPF: eid=1di7eS-0000Iq-Ko;;;mid=<87inhnrtbi.fsf@xmission.com>;;;hst=in01.mta.xmission.com;;;ip=67.3.200.44;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX1/sAmMtzO6Cx+hDMLAfxG9OK3M7tb+uzv0= X-SA-Exim-Connect-IP: 67.3.200.44 X-SA-Exim-Mail-From: ebiederm@xmission.com X-Spam-Report: * -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP * 1.5 XMNoVowels Alpha-numberic number with no vowels * 0.0 TVD_RCVD_IP Message was received from an IP address * 0.0 T_TM2_M_HEADER_IN_MSG BODY: No description available. * 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% * [score: 0.4963] * -0.0 DCC_CHECK_NEGATIVE Not listed in DCC * [sa02 1397; Body=1 Fuz1=1 Fuz2=1] X-Spam-DCC: XMission; sa02 1397; Body=1 Fuz1=1 Fuz2=1 X-Spam-Combo: *;Linus Torvalds X-Spam-Relay-Country: X-Spam-Timing: total 5343 ms - load_scoreonly_sql: 0.05 (0.0%), signal_user_changed: 3.0 (0.1%), b_tie_ro: 2.1 (0.0%), parse: 1.29 (0.0%), extract_message_metadata: 49 (0.9%), get_uri_detail_list: 2.9 (0.1%), tests_pri_-1000: 23 (0.4%), tests_pri_-950: 2.0 (0.0%), tests_pri_-900: 14 (0.3%), tests_pri_-400: 44 (0.8%), check_bayes: 42 (0.8%), b_tokenize: 12 (0.2%), b_tok_get_all: 19 (0.4%), b_comp_prob: 4.8 (0.1%), b_tok_touch_all: 2.7 (0.0%), b_finish: 0.77 (0.0%), tests_pri_0: 608 (11.4%), check_dkim_signature: 0.86 (0.0%), check_dkim_adsp: 14 (0.3%), tests_pri_500: 4593 (86.0%), poll_dns_idle: 4584 (85.8%), rewrite_mail: 0.00 (0.0%) Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk X-Spam-Flag: No X-SA-Exim-Version: 4.2.1 (built Thu, 05 May 2016 13:38:54 -0600) X-SA-Exim-Scanned: Yes (on in01.mta.xmission.com) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1977 Lines: 48 Linus Torvalds writes: > On Tue, Aug 15, 2017 at 3:57 PM, Linus Torvalds > wrote: >> >> Oh, and the page wait-queue really needs that key argument too, which >> is another thing that swait queue code got rid of in the name of >> simplicity. > > Actually, it gets worse. > > Because the page wait queues are hashed, it's not an all-or-nothing > thing even for the non-exclusive cases, and it's not a "wake up first > entry" for the exclusive case. Both have to be conditional on the wait > entry actually matching the page and bit in question. > > So no way to use swait, or any of the lockless queuing code in general > (so we can't do some clever private wait-list using llist.h either). > > End result: it looks like you fairly fundamentally do need to use a > lock over the whole list traversal (like the standard wait-queues), > and then add a cursor entry like Tim's patch if dropping the lock in > the middle. > > Anyway, looking at the old code, we *used* to limit the page wait hash > table to 4k entries, and we used to have one hash table per memory > zone. > > The per-zone thing didn't work at all for the generic bit-waitqueues, > because of how people used them on virtual addresses on the stack. > > But it *could* work for the page waitqueues, which are now a totally > separate entity, and is obviously always physically addressed (since > the indexing is by "struct page" pointer), and doesn't have that > issue. > > So I guess we could re-introduce the notion of per-zone page waitqueue > hash tables. It was disgusting to allocate and free though (and hooked > into the memory hotplug code). > > So I'd still hope that we can instead just have one larger hash table, > and that is sufficient for the problem. If increasing the hash table size fixes the problem I am wondering if rhash tables might be the proper solution to this problem. They start out small and then grow as needed. Eric