Received: by 10.192.165.148 with SMTP id m20csp5255736imm; Wed, 9 May 2018 01:55:18 -0700 (PDT) X-Google-Smtp-Source: AB8JxZrchkr1BK0BRd+tz5y8m/yrgwvs111kEHSHKNC5EB3pT3w9LADz/pA2tTceVgZbzM06W4y6 X-Received: by 2002:a63:2787:: with SMTP id n129-v6mr35301644pgn.167.1525856118808; Wed, 09 May 2018 01:55:18 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1525856118; cv=none; d=google.com; s=arc-20160816; b=l1lJdRWFxduSBkX9e2OJ9xTdFvBs0ZmGtop3Q8Lpe8Wl3zl5/PzFpI/xqiLIpHbpOi LibSVTKyL9OEffnXDH899XPPy87TKoaxkweHkeBdapIJrN/fCOWNV2KaQ50aSGlHQWBL worFE9TKCbhZGzq1tIzXW/+cZo/WjXaMzby0mUCFNhUXeQze6HD7U3EDPWPkK5sCTbXo DvAgPYOd3C0K8j1Jplzzbwb9xGCNl5zyN4VIAQCcmbd7d4F+bwNyFzBpuLIgvyY74xFN vMYx68W4rczUJloi3uGp3kDZmJbRDt4QAVjxFpAb2gNvu5gGQ9rzDGS0f433jpv3jlm7 syog== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:date:subject:cc:to:from :arc-authentication-results; bh=Yk6R0FQ2N2uPM/PKYD4qfDeSwdIEW5imB6EflFt6Zac=; b=kvFNyHEnN6vGCIjeNiiK6DvqkEGguswArxVWmORMpFRkQgWydhfh3xX6dvArg7L/LP piPatn2/9cUGdV8wGc7YRbmRubSwgbNrIcAu0Zbuqk3bZnDCrJ2vX4mW8/ss+pWZ1uZs AYL8I0d4o0DpNYc+vers1nl5IKYFjVx3sV0YtEsIrqt8MF6sfIcZRCCr/Lby9AmQvTgY 4I8y2EQ2FdRFkQzTXuHVptFtXLMntBcgjF5IcQh8jLNhiF0N0GYQiinqS2WUKpRY3wus zKoz9pTiXxUAuhO+i1cHMGRzPaaloYUfEznIxFtvTjV2s0RvBrcS65vc6ZFwJsW2pAq3 IuHA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id y5si16878934pfe.134.2018.05.09.01.55.03; Wed, 09 May 2018 01:55:18 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934157AbeEIIxc (ORCPT + 99 others); Wed, 9 May 2018 04:53:32 -0400 Received: from mga03.intel.com ([134.134.136.65]:31335 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934077AbeEIIx1 (ORCPT ); Wed, 9 May 2018 04:53:27 -0400 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by orsmga103.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 09 May 2018 01:53:26 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.49,381,1520924400"; d="scan'208";a="227055140" Received: from aaronlu.sh.intel.com ([10.239.159.135]) by fmsmga006.fm.intel.com with ESMTP; 09 May 2018 01:53:24 -0700 From: Aaron Lu To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Vlastimil Babka , Mel Gorman , Matthew Wilcox , Daniel Jordan , Tariq Toukan Subject: [RFC v3 PATCH 0/5] Eliminate zone->lock contention for will-it-scale/page_fault1 and parallel free Date: Wed, 9 May 2018 16:54:45 +0800 Message-Id: <20180509085450.3524-1-aaron.lu@intel.com> X-Mailer: git-send-email 2.14.3 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This series is meant to improve zone->lock scalability for order 0 pages. With will-it-scale/page_fault1 workload, on a 2 sockets Intel Skylake server with 112 CPUs, CPU spend 80% of its time spinning on zone->lock. Perf profile shows the most time consuming part under zone->lock is the cache miss on "struct page", so here I'm trying to avoid those cache misses. v3: v2 has been sent out for more than a month and I was suggested to do a resent. While doing it, I suppose I should rebase it to a newer kernel. So... - Rebase to v4.17-rc4; - Remove useless "mt" param in add_to_buddy_common() as pointed out by Vlastimil Babka; - Patch 4/5: optimiza cluster operation on free path for all possible migrate types; Previous version only considered MOVABLE pages; - Patch 5/5 is newly added to only disable cluster alloc and no merge when compaction is in progress. Previouslly we will disable cluster alloc and no merge as long as there is compaction failures in the zone. A branch is maintained here in case someone wants to give it a try: https://github.com/aaronlu/linux zone_lock_rfc_v3 v2: Patch 1/4 adds some wrapper functions for page to be added/removed into/from buddy and doesn't have functionality changes. Patch 2/4 skip doing merge for order 0 pages to avoid cache misses on buddy's "struct page". On a 2 sockets Intel Skylake, this has very good effect on free path for will-it-scale/page_fault1 full load in that it reduced zone->lock contention on free path from 35% to 1.1%. Also, it shows good result on parallel free(*) workload by reducing zone->lock contention from 90% to almost zero(lru lock increased from almost 0 to 90% though). Patch 3/4 deals with allocation path zone->lock contention by not touching pages on free_list one by one inside zone->lock. Together with patch 2/4, zone->lock contention is entirely eliminated for will-it-scale/page_fault1 full load, though this patch adds some overhead to manage cluster on free path and it has some bad effects on parallel free workload in that it increased zone->lock contention from almost 0 to 25%. Patch 4/4 is an optimization in free path due to cluster operation. It decreased the number of times add_to_cluster() has to be called and restored performance for parallel free workload by reducing zone->lock's contention to almost 0% again. The good thing about this patchset is, it eliminated zone->lock contention for will-it-scale/page_fault1 and parallel free on big servers(contention shifted to lru_lock). The bad things are: - it added some overhead in compaction path where it will do merging for those merge-skipped order 0 pages; - it is unfriendly to high order page allocation since we do not do merging for order 0 pages now. To see how much effect it has on compaction, mmtests/stress-highalloc is used on a Desktop machine with 8 CPUs and 4G memory. (mmtests/stress-highalloc: make N copies of kernel tree and start building them to consume almost all memory with reclaimable file page cache. These file page cache will not be returned to buddy so effectively makes it a worst case for high order page workload. Then after 5 minutes, start allocating X order-9 pages to see how well compaction works). With a delay of 100ms between allocations: kernel success_rate average_time_of_alloc_one_hugepage base 58% 3.95927e+06 ns patch2/4 58% 5.45935e+06 ns patch4/4 57% 6.59174e+06 ns With a delay of 1ms between allocations: kernel success_rate average_time_of_alloc_one_hugepage base 53% 3.17362e+06 ns patch2/4 44% 2.31637e+06 ns patch4/4 59% 2.73029e+06 ns If we compare patch4/4's result with base, it performed OK I think. This is probably due to compaction is a heavy job so the added overhead doesn't affect much. To see how much effect it has on workload that uses hugepage, I did the following test on a 2 sockets Intel Skylake with 112 CPUs/64G memory: 1 Break all high order pages by starting a program that consumes almost all memory with anonymous pages and then exit. This is used to create an extreme bad case for this patchset compared to vanilla that always does merging; 2 Start 56 processes of will-it-scale/page_fault1 that use hugepages through calling madvise(MADV_HUGEPAGE). To make things worse for this patchset, start another 56 processes of will-it-scale/page_fault1 that uses order 0 pages to continually cause trouble for the 56 THP users. Let them run for 5 minutes. Score result(higher is better): kernel order0 THP base 1522246 10540254 patch2/4 5266247 +246% 3309816 -69% patch4/4 2234073 +47% 9610295 -8.8% TBH, I'm not sure if the way I tried above is good enough to expose the problem of this patchset. So if you have any thoughts on this patchset, please feel free to let me know, thanks. (*) Parallel free is a workload that I used to see how well parallel freeing a large VMA can be. I tested this on a 4 sockets Intel Skylake machine with 768G memory. The test program starts by doing a 512G anon memory allocation with mmap() and then exit to see how fast it can exit. The parallel is implemented inside kernel and has been posted before: http://lkml.kernel.org/r/1489568404-7817-1-git-send-email-aaron.lu@intel.com A branch is maintained here in case someone wants to give it a try: https://github.com/aaronlu/linux zone_lock_rfc_v2 v1 is here: https://lkml.kernel.org/r/20180205053013.GB16980@intel.com Aaron Lu (5): mm/page_alloc: use helper functions to add/remove a page to/from buddy mm/__free_one_page: skip merge for order-0 page unless compaction failed mm/rmqueue_bulk: alloc without touching individual page structure mm/free_pcppages_bulk: reduce overhead of cluster operation on free path mm/can_skip_merge(): make it more aggressive to attempt cluster alloc/free include/linux/mm_types.h | 3 + include/linux/mmzone.h | 35 ++++ mm/compaction.c | 17 +- mm/internal.h | 57 ++++++ mm/page_alloc.c | 496 ++++++++++++++++++++++++++++++++++++++++++----- 5 files changed, 557 insertions(+), 51 deletions(-) -- 2.14.3