Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753197AbdHOC1o (ORCPT ); Mon, 14 Aug 2017 22:27:44 -0400 Received: from mga05.intel.com ([192.55.52.43]:37481 "EHLO mga05.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752986AbdHOC1n (ORCPT ); Mon, 14 Aug 2017 22:27:43 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.41,376,1498546800"; d="scan'208";a="1003634385" Date: Mon, 14 Aug 2017 19:27:43 -0700 From: Andi Kleen To: Linus Torvalds Cc: Tim Chen , Peter Zijlstra , Ingo Molnar , Kan Liang , Andrew Morton , Johannes Weiner , Jan Kara , linux-mm , Linux Kernel Mailing List Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk Message-ID: <20170815022743.GB28715@tassilo.jf.intel.com> References: <84c7f26182b7f4723c0fe3b34ba912a9de92b8b7.1502758114.git.tim.c.chen@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.8.3 (2017-05-23) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2433 Lines: 58 On Mon, Aug 14, 2017 at 06:48:06PM -0700, Linus Torvalds wrote: > On Mon, Aug 14, 2017 at 5:52 PM, Tim Chen wrote: > > We encountered workloads that have very long wake up list on large > > systems. A waker takes a long time to traverse the entire wake list and > > execute all the wake functions. > > > > We saw page wait list that are up to 3700+ entries long in tests of large > > 4 and 8 socket systems. It took 0.8 sec to traverse such list during > > wake up. Any other CPU that contends for the list spin lock will spin > > for a long time. As page wait list is shared by many pages so it could > > get very long on systems with large memory. > > I really dislike this patch. > > The patch seems a band-aid for really horrible kernel behavior, rather > than fixing the underlying problem itself. > > Now, it may well be that we do end up needing this band-aid in the > end, so this isn't a NAK of the patch per se. But I'd *really* like to > see if we can fix the underlying cause for what you see somehow.. We could try it and it may even help in this case and it may be a good idea in any case on such a system, but: - Even with a large hash table it might be that by chance all CPUs will be queued up on the same page - There are a lot of other wait queues in the kernel and they all could run into a similar problem - I suspect it's even possible to construct it from user space as a kind of DoS attack Given all that I don't see any alternative to fixing wait queues somehow. It's just that systems are so big that now that they're starting to stretch the tried old primitives. Now in one case (on a smaller system) we debugged we had - 4S system with 208 logical threads - during the test the wait queue length was 3700 entries. - the last CPUs queued had to wait roughly 0.8s This gives a budget of roughly 1us per wake up. It could be that we could find some way to do "bulk wakeups" in the scheduler that are much cheaper, and switch to them if there are a lot of entries in the wait queues. With that it may be possible to do a wake up in less than 1us. But even with that it will be difficult to beat the scaling curve. If systems get bigger again (and they will be) it could easily break again, as the budget gets smaller and smaller. Also disabling interrupts for that long is just nasty. Given all that I still think a lock breaker of some form is needed. -Andi