Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp2800126imu; Sun, 23 Dec 2018 08:06:06 -0800 (PST) X-Google-Smtp-Source: AFSGD/Xx1nlvmgNylYO8dGfh6vrEPNX690SO5Sd2iooWdvXbneKZwzKtmbmwFzOnwVZWIm2Qq8pW X-Received: by 2002:a62:7e93:: with SMTP id z141mr9971772pfc.239.1545581166736; Sun, 23 Dec 2018 08:06:06 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1545581166; cv=none; d=google.com; s=arc-20160816; b=v1VLXrIFBYGqg6bxa5bJgHiJ8K4QGd51FLDGtJsTHagIoJyqHVmANk0Qn1Vakz5oHu fV7lFMVVDFt4oXY9sa1gttBJbT3j4K5hCZT+2eTFvw9Y5uXwjueFODI557x7X6p6n/33 y8U4ApNNOEvcSgffO6y4FKWP7jUptAI3kmM4COeu3n2JC5GXmtFU2NtfQJ+ybrWonMWU 9e2UGSplsZ1SUekscC5lEWBqFICe96EBcMwZTRaM7HsYTUQp0xxrsvjMERNYJSzjT8cQ tXLgRvVl4RFChGG3mhV8ra1AS+b8dPz3wBCerQj0IwM4Gs5jlU0V1qkUgWcLCy1l2UMi tUIQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=+i+CHnOGlPCChXaq7wsWBtAZSPJBNxhntnqR/vvF3+Q=; b=yRTsAvytPtj1rSbtpBcrB2jOUqDlL4RiVOLOnNlm7X6AVY5wNIyjhV52/4HIiMjzHj R+UD1XZ47f1hoJoup0i2zqp+aLkiQcVvQKK9p9hpFNnL7JR/+KZoNJ5sCttLy1Cuxt6U 3kSgT+l+ZTO/trr5JGIUZ7vOCukCAFr+rz8lFko2HmFqzuqmoIyNcc5jBUUXBfjp5+OZ fd8pqt3kro5/bimEW3w1/+0fU6KVBrrfUuI+qN9Ywg1CzMUWOJ6dW+Y5Zfx9X8qA9Qak 5k5sxPLgmskbLI8g8mowE5sL3gRNdxJq5nQpOTNmnBJQl3U6/paDt2+9gjvIqwjS7SZD T8Ug== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 80si14460214pfz.11.2018.12.23.08.05.38; Sun, 23 Dec 2018 08:06:06 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2392353AbeLVQ4f (ORCPT + 99 others); Sat, 22 Dec 2018 11:56:35 -0500 Received: from outbound-smtp26.blacknight.com ([81.17.249.194]:39455 "EHLO outbound-smtp26.blacknight.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726511AbeLVQ4e (ORCPT ); Sat, 22 Dec 2018 11:56:34 -0500 Received: from mail.blacknight.com (pemlinmail04.blacknight.ie [81.17.254.17]) by outbound-smtp26.blacknight.com (Postfix) with ESMTPS id DEAFAB8A78 for ; Sat, 22 Dec 2018 12:08:32 +0000 (GMT) Received: (qmail 16172 invoked from network); 22 Dec 2018 12:08:32 -0000 Received: from unknown (HELO techsingularity.net) (mgorman@techsingularity.net@[37.228.229.96]) by 81.17.254.9 with ESMTPSA (AES256-SHA encrypted, authenticated); 22 Dec 2018 12:08:32 -0000 Date: Sat, 22 Dec 2018 12:08:31 +0000 From: Mel Gorman To: David Rientjes Cc: Vlastimil Babka , Andrea Arcangeli , Linus Torvalds , Michal Hocko , ying.huang@intel.com, s.priebe@profihost.ag, Linux List Kernel Mailing , alex.williamson@redhat.com, lkp@01.org, kirill@shutemov.name, Andrew Morton , zi.yan@cs.rutgers.edu, Linux-MM layout Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression Message-ID: <20181222120831.GC31517@techsingularity.net> References: <20181210044916.GC24097@redhat.com> <0bbf4202-6187-28fb-37b7-da6885b89cce@suse.cz> <0700f5c3-66a8-338a-0ba0-2231cc3bb637@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Dec 21, 2018 at 02:18:45PM -0800, David Rientjes wrote: > On Fri, 14 Dec 2018, Vlastimil Babka wrote: > > > > It would be interesting to know if anybody has tried using the per-zone > > > free_area's to determine migration targets and set a bit if it should be > > > considered a migration source or a migration target. If all pages for a > > > pageblock are not on free_areas, they are fully used. > > > > Repurposing/adding a new pageblock bit was in my mind to help multiple > > compactors not undo each other's work in the scheme where there's no > > free page scanner, but I didn't implement it yet. > > > > It looks like Mel has a series posted that still is implemented with > linear scans through memory, so I'm happy to move the discussion there; I > think the goal for compaction with regard to this thread is determining > whether reclaim in the page allocator would actually be useful and > targeted reclaim to make memory available for isolate_freepages() could be > expensive. I'd hope that we could move in a direction where compaction > doesn't care where the pageblock is and does the minimal amount of work > possible to make a high-order page available, not sure if that's possible > with a linear scan. I'll take a look at Mel's series though. That series has evolved significantly because there was a lot of missing pieces. While it's somewhat ready other than badly written changelogs, I didn't post it because I'm going offline and wouldn't respond to feedback and I imagine others are offline too and unavailable for review. Besides, the merge window is about to open and I know there are patches in Andrews tree for mainline that should be taken into account. The series is now 25 patches long and covers a lot of pre-requisites that would be necessary before removing the linear scanner. What is critical for a purely free-list scanner is that the exit conditions are identified and the series provides a lot of the pieces. For example, a non-linear scanner must properly control skip bits and isolate pageblocks from multiple compaction instances which this series does. The main takeawy from the series is that it reduces system CPU usage by 17%, reduces free scan rates by 99.5% and increases THP allocation success rates by 33% giving almost 99% allocation success rates. It also; o Isolates pageblocks for a single compaction instance o Synchronises async/sync scanners when appropriate to reduce rescanning o Identifies when a pageblock is being rescanned and is "sticky" and makes forward progress instead of looping excessively o Smarter logic when clearing pageblock skip bits so reduce scanning o Various different methods for reducing unnecessary scanning o Better handling of contention o Avoids compaction of remote nodes in direct compaction context If you do not want to wait until the new year, it's at git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-fast-compact-v2r15 Preliminary results based on thpscale using MADV_HUGEPAGE to allocate huge pages on a fragmented system. thpscale Fault Latencies 4.20.0-rc6 4.20.0-rc6 mmotm-20181210 noremote-v2r14 Amean fault-both-1 864.83 ( 0.00%) 1006.88 * -16.43%* Amean fault-both-3 3566.05 ( 0.00%) 2460.97 * 30.99%* Amean fault-both-5 5685.02 ( 0.00%) 4052.92 * 28.71%* Amean fault-both-7 7289.40 ( 0.00%) 5929.65 ( 18.65%) Amean fault-both-12 10937.46 ( 0.00%) 8870.53 ( 18.90%) Amean fault-both-18 15440.48 ( 0.00%) 11464.86 * 25.75%* Amean fault-both-24 15345.83 ( 0.00%) 13040.01 * 15.03%* Amean fault-both-30 20159.73 ( 0.00%) 16618.73 * 17.56%* Amean fault-both-32 20843.51 ( 0.00%) 14401.25 * 30.91%* Fault latency (either huge or base) is mostly improved even when 32 tasks are trying to allocate huge pages on an 8-CPU single socket machine where contention is a factor thpscale Percentage Faults Huge 4.20.0-rc6 4.20.0-rc6 mmotm-20181210 noremote-v2r14 Percentage huge-1 96.03 ( 0.00%) 96.94 ( 0.95%) Percentage huge-3 71.43 ( 0.00%) 95.43 ( 33.60%) Percentage huge-5 70.44 ( 0.00%) 96.85 ( 37.48%) Percentage huge-7 70.39 ( 0.00%) 94.77 ( 34.63%) Percentage huge-12 71.53 ( 0.00%) 98.07 ( 37.11%) Percentage huge-18 70.61 ( 0.00%) 98.42 ( 39.38%) Percentage huge-24 71.84 ( 0.00%) 97.85 ( 36.20%) Percentage huge-30 69.94 ( 0.00%) 98.13 ( 40.31%) Percentage huge-32 66.92 ( 0.00%) 97.79 ( 46.13%) 96-98% of THP requests get huge pages on request 4.20.0-rc6 4.20.0-rc6 mmotm-20181210noremote-v2r14 User 27.30 27.86 System 192.70 159.42 Elapsed 580.13 571.98 System CPU usage is reduced so we get more huge pages for less work and the workload completes slightly faster. 4.20.0-rc6 4.20.0-rc6 mmotm-20181210 noremote-v2r14 Allocation stalls 19156.00 3627.00 Fewer stalls which is always a plus. THP fault alloc 77804.00 84618.00 THP fault fallback 7628.00 816.00 THP collapse alloc 12.00 0.00 THP collapse fail 0.00 0.00 THP split 56921.00 56920.00 THP split failed 1982.00 116.00 Compaction stalls 36350.00 25541.00 Compaction success 17491.00 22651.00 Compaction failures 18859.00 2890.00 Compaction efficiency 48.12 88.68 Compaction efficiency is increased a lot (efficiency is a basic measure of success vs failure). Previously almost half of the THP requests failed. Page migrate success 10200844.00 7802473.00 Page migrate failure 3703.00 409.00 Compaction pages isolated 23093029.00 16532642.00 Compaction migrate scanned 28454655.00 8976143.00 Compaction free scanned 717517120.00 3632762.00 Compact scan efficiency 3.97 247.09 Migration scanning is down 32%, free scanning is down 99.5%. Scan efficiency is interesting because it's a measure of how many pages the free scanner examines for one migration source. Before the series, we had to scan *way* more pages to find a free page where as now we scan *fewer* pages to find a migration target due to the use of free lists. Kcompactd wake 1.00 9.00 Kcompactd migrate scanned 14023.00 13318.00 Kcompactd free scanned 6932.00 6643.00 Minor improvements for kcompactd but for this workload, it was barely active. I'll rebase and repost in the new year and I think it should be considered a prerequisite before considering the removal of the linear scanning. It'll be impossible to remove completely due to memory isoluation. If built on this series I would imagine that it would take the following approach. o The migration scanner remains linear or mostly linear (series uses free page lists to get hints on where suitable migration sources are) o The free scanner would be purely based on the free lists i.e. fast_isolate_freepages would be the only scanner o The migration scanner would need to be strict about obeying the skip bit to avoid picking a migration source that was previously a migration target o The exit condition for compaction is not when scanners meet but when fast_isolate_freepages cannot find any pageblock that is MIGRATE_MOVABLE && !pageblock_skip -- Mel Gorman SUSE Labs