Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753896AbdHOBsJ (ORCPT ); Mon, 14 Aug 2017 21:48:09 -0400 Received: from mail-oi0-f42.google.com ([209.85.218.42]:33315 "EHLO mail-oi0-f42.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753355AbdHOBsH (ORCPT ); Mon, 14 Aug 2017 21:48:07 -0400 MIME-Version: 1.0 In-Reply-To: <84c7f26182b7f4723c0fe3b34ba912a9de92b8b7.1502758114.git.tim.c.chen@linux.intel.com> References: <84c7f26182b7f4723c0fe3b34ba912a9de92b8b7.1502758114.git.tim.c.chen@linux.intel.com> From: Linus Torvalds Date: Mon, 14 Aug 2017 18:48:06 -0700 X-Google-Sender-Auth: u9T8E6hauKRgYp0c878lgdxboNU Message-ID: Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk To: Tim Chen Cc: Peter Zijlstra , Ingo Molnar , Andi Kleen , Kan Liang , Andrew Morton , Johannes Weiner , Jan Kara , linux-mm , Linux Kernel Mailing List Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2789 Lines: 61 On Mon, Aug 14, 2017 at 5:52 PM, Tim Chen wrote: > We encountered workloads that have very long wake up list on large > systems. A waker takes a long time to traverse the entire wake list and > execute all the wake functions. > > We saw page wait list that are up to 3700+ entries long in tests of large > 4 and 8 socket systems. It took 0.8 sec to traverse such list during > wake up. Any other CPU that contends for the list spin lock will spin > for a long time. As page wait list is shared by many pages so it could > get very long on systems with large memory. I really dislike this patch. The patch seems a band-aid for really horrible kernel behavior, rather than fixing the underlying problem itself. Now, it may well be that we do end up needing this band-aid in the end, so this isn't a NAK of the patch per se. But I'd *really* like to see if we can fix the underlying cause for what you see somehow.. In particular, if this is about the page wait table, maybe we can just make the wait table bigger. IOW, are people actually waiting on the *same* page, or are they mainly waiting on totally different pages, just hashing to the same wait queue? Because right now that page wait table is a small fixed size, and the only reason it's a small fixed size is that nobody reported any issues with it - particularly since we now avoid the wait table entirely for the common cases by having that "contention" bit. But it really is a *small* table. We literally have #define PAGE_WAIT_TABLE_BITS 8 so it's just 256 entries. We could easily it much bigger, if we are actually seeing a lot of collissions. We *used* to have a very complex per-zone thing for bit-waitiqueues, but that was because we got lots and lots of contention issues, and everybody *always* touched the wait-queues whether they waited or not (so being per-zone was a big deal) We got rid of all that per-zone complexity when the normal case didn't hit in the page wait queues at all, but we may have over-done the simplification a bit since nobody showed any issue. In particular, we used to size the per-zone thing by amount of memory. We could easily re-introduce that for the new simpler page queues. The page_waitiqueue() is a simple helper function inside mm/filemap.c, and thanks to the per-page "do we have actual waiters" bit that we have now, we can actually afford to make it bigger and more complex now if we want to. What happens to your load if you just make that table bigger? You can literally test by just changing the constant from 8 to 16 or something, making us use twice as many bits for hashing. A "real" patch would size it by amount of memory, but just for testing the contention on your load, you can do the hacky one-liner. Linus