Received: by 10.213.65.68 with SMTP id h4csp114876imn; Mon, 12 Mar 2018 20:35:40 -0700 (PDT) X-Google-Smtp-Source: AG47ELvDELkR9/H1q9NBU9F/161VL/49Vpa26Jn0MCfE/zOJzID7bB4kg3FUVCN9mYtutUX277kW X-Received: by 10.101.65.5 with SMTP id w5mr8478799pgp.214.1520912140064; Mon, 12 Mar 2018 20:35:40 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1520912140; cv=none; d=google.com; s=arc-20160816; b=oqqVK3B/hjC+cBOiVYcFMn2qx42Qwj52PEjg4Vb5KcpQAwW9jgBbXzxDZd34fm4/68 f1qzgXs3hSyAN1oGYbZWedXp7DnL52r/nMpPg8811N03Qa09TGaNHvL/MScd+gH20tf4 /iIy2ZMGod7PJvmUTT8EVgoyqOkHK2VYenp9pnvZoGCY7kgW72/ZXOykiuTOfq6zTY5E iq26Luv/lITNeaGNWYiZ8qDyUPNtFXHFg9eGDU0w2xm8DmbHdMHqgj5m/38gHm44Hoor yv6iQGG5PhmCbXHcOhuOUnPoUoZ7AnhCEvRMYB+sqHxJvBacweU5tK/Em6jdAaEB3h2B WMAg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=PU1QMbed5EHPb8jogMH9H/fCSwC3UBeMhRzVbHtDQG8=; b=mtiqDoEEMyCwF9hZoKv4h8kDVsu1ln2MoQ5coUB5Bjsuq0ljfgRWH2WmG+3qt9LiQh 5pweHMRDFDx4DEfw850xlO0Ttf4SbGkH55RLWr+W7KicN4x8yKNmivn0fV4prHlkxkNl v4V0bqvUuSXdMHU7+OzCplP1oPNiRRHqtAqc8+sIGAQmLLnMDTSBCO/ncowFwr4Y76rc jmJ3QenPAiBhDT8IHYbNuiXWBPq0ZGJHbvBLmyPImOKv4vrG7PKmgARjXS7p0y4bdmIC s/C0X8hzWsAvhIvVZd6HMbfiNrd72CTOIZnaG1CYymFint0KUh6oKJBmYEh5XDiWAEGP s/Dw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w123si6835726pfd.14.2018.03.12.20.35.25; Mon, 12 Mar 2018 20:35:40 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932576AbeCMDdu (ORCPT + 99 others); Mon, 12 Mar 2018 23:33:50 -0400 Received: from mga17.intel.com ([192.55.52.151]:15104 "EHLO mga17.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932294AbeCMDdt (ORCPT ); Mon, 12 Mar 2018 23:33:49 -0400 X-Amp-Result: UNKNOWN X-Amp-Original-Verdict: FILE UNKNOWN X-Amp-File-Uploaded: False Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by fmsmga107.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 12 Mar 2018 20:33:48 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.47,463,1515484800"; d="scan'208";a="37652467" Received: from aaronlu.sh.intel.com (HELO intel.com) ([10.239.159.135]) by fmsmga001.fm.intel.com with ESMTP; 12 Mar 2018 20:33:46 -0700 Date: Tue, 13 Mar 2018 11:34:53 +0800 From: Aaron Lu To: Vlastimil Babka Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Mel Gorman , Matthew Wilcox , David Rientjes Subject: Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free Message-ID: <20180313033453.GB13782@intel.com> References: <20180301062845.26038-1-aaron.lu@intel.com> <20180301062845.26038-3-aaron.lu@intel.com> <9cad642d-9fe5-b2c3-456c-279065c32337@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <9cad642d-9fe5-b2c3-456c-279065c32337@suse.cz> User-Agent: Mutt/1.9.2 (2017-12-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 12, 2018 at 03:22:53PM +0100, Vlastimil Babka wrote: > On 03/01/2018 07:28 AM, Aaron Lu wrote: > > When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy, > > the zone->lock is held and then pages are chosen from PCP's migratetype > > list. While there is actually no need to do this 'choose part' under > > lock since it's PCP pages, the only CPU that can touch them is us and > > irq is also disabled. > > > > Moving this part outside could reduce lock held time and improve > > performance. Test with will-it-scale/page_fault1 full load: > > > > kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) > > v4.16-rc2+ 9034215 7971818 13667135 15677465 > > this patch 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% > > > > What the test does is: starts $nr_cpu processes and each will repeatedly > > do the following for 5 minutes: > > 1 mmap 128M anonymouse space; > > 2 write access to that space; > > 3 munmap. > > The score is the aggregated iteration. > > > > https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c > > > > Acked-by: Mel Gorman > > Signed-off-by: Aaron Lu > > --- > > mm/page_alloc.c | 39 +++++++++++++++++++++++---------------- > > 1 file changed, 23 insertions(+), 16 deletions(-) > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index faa33eac1635..dafdcdec9c1f 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -1116,12 +1116,10 @@ static void free_pcppages_bulk(struct zone *zone, int count, > > int migratetype = 0; > > int batch_free = 0; > > bool isolated_pageblocks; > > - > > - spin_lock(&zone->lock); > > - isolated_pageblocks = has_isolate_pageblock(zone); > > + struct page *page, *tmp; > > + LIST_HEAD(head); > > > > while (count) { > > - struct page *page; > > struct list_head *list; > > > > /* > > @@ -1143,27 +1141,36 @@ static void free_pcppages_bulk(struct zone *zone, int count, > > batch_free = count; > > > > do { > > - int mt; /* migratetype of the to-be-freed page */ > > - > > page = list_last_entry(list, struct page, lru); > > - /* must delete as __free_one_page list manipulates */ > > + /* must delete to avoid corrupting pcp list */ > > list_del(&page->lru); > > Well, since bulkfree_pcp_prepare() doesn't care about page->lru, you > could maybe use list_move_tail() instead of list_del() + > list_add_tail()? That avoids temporarily writing poison values. Good point, except bulkfree_pcp_prepare() could return error and then the page will need to be removed from the to-be-freed list, like this: do { page = list_last_entry(list, struct page, lru); list_move_tail(&page->lru, &head); pcp->count--; if (bulkfree_pcp_prepare(page)) list_del(&page->lru); } while (--count && --batch_free && !list_empty(list)); Considering bulkfree_pcp_prepare() returning error is the rare case, this list_del() should rarely happen. At the same time, this part is outside of zone->lock and can hardly impact performance...so I'm not sure. > Hm actually, you are reversing the list in the process, because page is > obtained by list_last_entry and you use list_add_tail. That could have > unintended performance consequences? True the order is changed when these to-be-freed pages are in this temporary list, but then they are iterated and freed one by one from head to tail so the order they landed in free_list is the same as before the patch(also the same as they are in pcp list). > > Also maybe list_cut_position() could be faster than shuffling pages one > by one? I guess not really, because batch_free will be generally low? We will need to know where to cut if list_cut_position() is to be used and to find that out, the list will need to be iterated first. I guess that's too much trouble. Since this part of code is per-cpu(outside of zone->lock) and these pages are in pcp(meaning their cachelines are not likely in remote), I didn't worry too much about not being able to list_cut_position() but iterate. On allocation side though, when manipulating the global free_list under zone->lock, this is a big problem since pages there are freed from different CPUs and the cache could be cold for the allocating CPU. That is why I'm proposing clusted allocation sometime ago as an RFC patch where list_cut_position() is so good that it could eliminate the cacheline miss issue since we do not need to iterate cold pages one by one. I wish there is a data structure that has the flexibility of list while at the same time we can locate the Nth element in the list without the need to iterate. That's what I'm looking for when developing clustered allocation for order 0 pages. In the end, I had to use another place to record where the Nth element is. I hope to send out v2 of that RFC series soon but I'm still collecting data for it. I would appreciate if people could take a look then :-) batch_free's value depends on what the system is doing. When user application is making use of memory, the common case is, only migratetype of MIGRATE_MOVABLE has pages to free and then batch_free will be 1 in the first round and (pcp->batch-1) in the 2nd round. Here is some data I collected recently on how often only MIGRATE_MOVABLE list has pages to free in free_pcppages_bulk(): On my desktop, after boot: free_pcppages_bulk: 6268 single_mt_movable: 2566 (41%) free_pcppages_bulk means the number of time this function gets called. single_mt_movable means number of times when only MIGRATE_MOVABLE list has pages to free. After kbuild with a distro kconfig: free_pcppages_bulk: 9100508 single_mt_movable: 8435483 (92.75%) If we change the initial value of migratetype in free_pcppages_bulk() from 0(MIGRATE_UNMOVABLE) to 1(MIGRATE_MOVABLE), then batch_free will be pcp->batch in the 1st round and we can save something but the saving is negligible when running a workload so I didn't send a patch for it yet. > > pcp->count--; > > > > - mt = get_pcppage_migratetype(page); > > - /* MIGRATE_ISOLATE page should not go to pcplists */ > > - VM_BUG_ON_PAGE(is_migrate_isolate(mt), page); > > - /* Pageblock could have been isolated meanwhile */ > > - if (unlikely(isolated_pageblocks)) > > - mt = get_pageblock_migratetype(page); > > - > > if (bulkfree_pcp_prepare(page)) > > continue; > > > > - __free_one_page(page, page_to_pfn(page), zone, 0, mt); > > - trace_mm_page_pcpu_drain(page, 0, mt); > > + list_add_tail(&page->lru, &head); > > } while (--count && --batch_free && !list_empty(list)); > > } > > + > > + spin_lock(&zone->lock); > > + isolated_pageblocks = has_isolate_pageblock(zone); > > + > > + /* > > + * Use safe version since after __free_one_page(), > > + * page->lru.next will not point to original list. > > + */ > > + list_for_each_entry_safe(page, tmp, &head, lru) { > > + int mt = get_pcppage_migratetype(page); > > + /* MIGRATE_ISOLATE page should not go to pcplists */ > > + VM_BUG_ON_PAGE(is_migrate_isolate(mt), page); > > + /* Pageblock could have been isolated meanwhile */ > > + if (unlikely(isolated_pageblocks)) > > + mt = get_pageblock_migratetype(page); > > + > > + __free_one_page(page, page_to_pfn(page), zone, 0, mt); > > + trace_mm_page_pcpu_drain(page, 0, mt); > > + } > > spin_unlock(&zone->lock); > > } > > > > >