Received: by 10.223.185.116 with SMTP id b49csp3779013wrg; Mon, 26 Feb 2018 06:02:04 -0800 (PST) X-Google-Smtp-Source: AH8x224mpyx1CULyUqLuzR6Wor16E0xmbZedD8obp27ZhZrYYjUEVnIplynv6l9q6z/pH4L0NPHC X-Received: by 2002:a17:902:7e4a:: with SMTP id a10-v6mr10698619pln.207.1519653724350; Mon, 26 Feb 2018 06:02:04 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1519653724; cv=none; d=google.com; s=arc-20160816; b=YXNX3Ubc+je52d12eooRb2lq65ri/f0K4SjorJKGYEUuai7Jcg9IFmnXmfTYRmldux zyvzHatpZLDcKEyap+mi3oG4oiH9/W03sMXmKM8TqO+eaApJlur3Xg4xfcz4lcrrkb9B rxL6AzloP/IdBmCNX6X9GuOTuAQyLcYVslmYvyI6iL+Qqcc+hZBR1oyT1Mbp78+Ni/Js zDL0OiZZR4WXIerNX9D9gpMn+Q6DU4KPSAo8AVB0QcxP3xSCa7UoK4OLdk8tbYeVkpt2 8yZAeO1HZtFEj2BdK0C+t0zJETcmX2T1qNgzb6sPOYF39IYS83+vbLHU+Z+WbhyBs0wJ hs9w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:references:in-reply-to:message-id:date :subject:cc:to:from:arc-authentication-results; bh=0g5o8EK2XmmSvMAWGAFx34MdtXCzqbJVcwzWeOZrCy4=; b=XCUGCleESpCnCLs3LH6Rc3fGB5Rq32L9XoypfvorpbCcwoAt941zJ87d+RHKp7w3Gl cNT1MCSuy6TthSrV5qKpEGFbYwvu47kYc/vj9Tqu3oXjOvp/cmpcRymnaWwhx6RFG1fB Q+7ai5qOuTJGFOMPegLaSLbnQrL91PJGzv8oMN9eEX5NhMY1kog71hDXp4JEFPUvDsOr 2PWtlGbThzCLpwDyUTQGxCO9kktEInfAKKTVSVHhnRNXD2BcOSDp/baR8PDEdPimHHpe lJl+6z8Ap+60p7mqx8W5irvjJyhBZ9JmrgxZ7gwGwi4SgxyHt2u2O8/WpZGQ5/eSsVcJ WOxw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id r4si6795924pfb.394.2018.02.26.06.01.44; Mon, 26 Feb 2018 06:02:03 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753255AbeBZNxD (ORCPT + 99 others); Mon, 26 Feb 2018 08:53:03 -0500 Received: from mga04.intel.com ([192.55.52.120]:54609 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753122AbeBZNwy (ORCPT ); Mon, 26 Feb 2018 08:52:54 -0500 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga104.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 26 Feb 2018 05:52:54 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.47,396,1515484800"; d="scan'208";a="29836380" Received: from aaronlu.sh.intel.com ([10.239.159.135]) by FMSMGA003.fm.intel.com with ESMTP; 26 Feb 2018 05:52:52 -0800 From: Aaron Lu To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Vlastimil Babka , Mel Gorman , Matthew Wilcox Subject: [PATCH v3 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock Date: Mon, 26 Feb 2018 21:53:46 +0800 Message-Id: <20180226135346.7208-4-aaron.lu@intel.com> X-Mailer: git-send-email 2.14.3 In-Reply-To: <20180226135346.7208-1-aaron.lu@intel.com> References: <20180226135346.7208-1-aaron.lu@intel.com> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org When a page is freed back to the global pool, its buddy will be checked to see if it's possible to do a merge. This requires accessing buddy's page structure and that access could take a long time if it's cache cold. This patch adds a prefetch to the to-be-freed page's buddy outside of zone->lock in hope of accessing buddy's page structure later under zone->lock will be faster. Since we *always* do buddy merging and check an order-0 page's buddy to try to merge it when it goes into the main allocator, the cacheline will always come in, i.e. the prefetched data will never be unused. In the meantime, there are two concerns: 1 the prefetch could potentially evict existing cachelines, especially for L1D cache since it is not huge; 2 there is some additional instruction overhead, namely calculating buddy pfn twice. For 1, it's hard to say, this microbenchmark though shows good result but the actual benefit of this patch will be workload/CPU dependant; For 2, since the calculation is a XOR on two local variables, it's expected in many cases that cycles spent will be offset by reduced memory latency later. This is especially true for NUMA machines where multiple CPUs are contending on zone->lock and the most time consuming part under zone->lock is the wait of 'struct page' cacheline of the to-be-freed pages and their buddies. Test with will-it-scale/page_fault1 full load: kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) v4.16-rc2+ 9034215 7971818 13667135 15677465 patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% this patch 10338868 +8.4% 8544477 +2.8% 14839808 +5.5% 17155464 +2.9% Note: this patch's performance improvement percent is against patch2/3. [changelog stole from Dave Hansen and Mel Gorman's comments] https://lkml.org/lkml/2018/1/24/551 Suggested-by: Ying Huang Signed-off-by: Aaron Lu --- mm/page_alloc.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 35576da0a6c9..dc3b89894f2c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1142,6 +1142,9 @@ static void free_pcppages_bulk(struct zone *zone, int count, batch_free = count; do { + unsigned long pfn, buddy_pfn; + struct page *buddy; + page = list_last_entry(list, struct page, lru); /* must delete as __free_one_page list manipulates */ list_del(&page->lru); @@ -1150,6 +1153,18 @@ static void free_pcppages_bulk(struct zone *zone, int count, continue; list_add_tail(&page->lru, &head); + + /* + * We are going to put the page back to the global + * pool, prefetch its buddy to speed up later access + * under zone->lock. It is believed the overhead of + * calculating buddy_pfn here can be offset by reduced + * memory latency later. + */ + pfn = page_to_pfn(page); + buddy_pfn = __find_buddy_pfn(pfn, 0); + buddy = page + (buddy_pfn - pfn); + prefetch(buddy); } while (--count && --batch_free && !list_empty(list)); } -- 2.14.3