Received: by 10.223.176.5 with SMTP id f5csp2129028wra; Sun, 4 Feb 2018 21:32:24 -0800 (PST) X-Google-Smtp-Source: AH8x227keeuEEURNl3zkBo0kOBIYt9qEYiI30QCTZivMf2Y9ba/AlqA7bciiYX3ZWlbf9V9iHwIp X-Received: by 10.99.161.26 with SMTP id b26mr2487892pgf.130.1517808743980; Sun, 04 Feb 2018 21:32:23 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1517808743; cv=none; d=google.com; s=arc-20160816; b=wvz78bo0bbveGxmKKT5YHY0+8E8dB59IUHeDe5oYh76YdKwvgcIy52+1lwWpJRwS3O oitpIGk1Yc62mzdZrWXiSbIclJ+O4bhwRLBYjL4W/t2XXXDAnhpGYR4Xr+kZ6TxNKjhd GOJt9CJEPCxidNmR0LHhmUT8C9MgcwAqM3PkAW22XSKfmPf6K8IYkP9DM22cg6g6wVnn iwwzVnXLZI4NQQKacUYkHL+s7CypMtstNTA59gD+W9QAKwZHhcaYcG1ABFZ+t7frTD+C laqWAYvc41jvNef7sBy48EgSGODRrSt2rkQk0D/RFK6rgCm7Awt10HR8LpHoO1BZ8TzC rIdA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=qraZnSc+N8NypXY5uRsAXyTXinaHDwhouRm9TdPIVHg=; b=f54DE0fcOlBaQGHLI2JBi7dcfVweTIKVKf1klF6pB/2v8f7hO8kcoBsRkbWaZ0NHyd POoRAOqqw3Kj1ecYBkHsLxvwJSWz1nfUpFNES6BZZSdtMxS85OV6HESolp357ruUs/2v kwwNMkAgaTDmRqFexwV/ndL3tXUlBl6KWno5wEizamLitLzlQxAoHRoojobqDkzMgnlK PzVbSPntTtfle5Jsm7XHSMGXX5EbhmVYY7679Y7aNMKcH0fsK+VyAiWBl1y92AlaZKSQ u3e1o397hgVh28M0wBu+4aWLLh9JkzOG3szbpY8GaZb/A1AxdeL5kTBW6TT1FIYmYdUr bTAQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id g12-v6si325935pln.6.2018.02.04.21.32.09; Sun, 04 Feb 2018 21:32:23 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752322AbeBEFbE (ORCPT + 99 others); Mon, 5 Feb 2018 00:31:04 -0500 Received: from mga05.intel.com ([192.55.52.43]:41341 "EHLO mga05.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750724AbeBEFa6 (ORCPT ); Mon, 5 Feb 2018 00:30:58 -0500 X-Amp-Result: UNKNOWN X-Amp-Original-Verdict: FILE UNKNOWN X-Amp-File-Uploaded: False Received: from orsmga002.jf.intel.com ([10.7.209.21]) by fmsmga105.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 04 Feb 2018 21:30:57 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.46,463,1511856000"; d="scan'208";a="31996125" Received: from aaronlu.sh.intel.com (HELO intel.com) ([10.239.159.135]) by orsmga002.jf.intel.com with ESMTP; 04 Feb 2018 21:30:54 -0800 Date: Mon, 5 Feb 2018 13:31:39 +0800 From: Aaron Lu To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Vlastimil Babka , Mel Gorman , Daniel Jordan Subject: [RFC PATCH 1/2] __free_one_page: skip merge for order-0 page unless compaction is in progress Message-ID: <20180205053139.GC16980@intel.com> References: <20180124023050.20097-1-aaron.lu@intel.com> <20180205053013.GB16980@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180205053013.GB16980@intel.com> User-Agent: Mutt/1.9.1 (2017-09-22) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Running will-it-scale/page_fault1 process mode workload on a 2 sockets Intel Skylake server showed severe lock contention of zone->lock, as high as about 80%(43% on allocation path and 38% on free path) CPU cycles are burnt spinning. With perf, the most time consuming part inside that lock on free path is cache missing on page structures, mostly on the to-be-freed page's buddy due to merging. One way to avoid this overhead is not do any merging at all for order-0 pages and leave the need for high order pages to compaction. With this approach, the lock contention for zone->lock on free path dropped to 4% but allocation side still has as high as 43% lock contention. In the meantime, the dropped lock contention on free side doesn't translate to performance increase, instead, it's consumed by increased lock contention of the per node lru_lock(rose from 2% to 33%). One concern of this approach is, how much impact does it have on high order page allocation, like for order-9 pages? I have run the stress-highalloc workload on a Haswell Desktop(8 CPU/4G memory) sometime ago and it showed similar success rate for vanilla kernel and the patched kernel, both at 74%. Note that it indeed took longer to finish the test: 244s vs 218s. 1 vanilla Attempted allocations: 1917 Failed allocs: 494 Success allocs: 1423 % Success: 74 Duration alloctest pass: 218s 2 no_merge_in_buddy Attempted allocations: 1917 Failed allocs: 497 Success allocs: 1420 % Success: 74 Duration alloctest pass: 244s The above test was done with the default --ms-delay=100, which means there is a delay of 100ms between page allocations. If I set the delay to 1ms, the success rate of this patch will drop to 36% while vanilla could maintain at about 70%. Though 1ms delay may not be practical, it indeed shows the possible impact of this patch on high order page allocation. The next patch deals with allocation path zone->lock contention. Suggested-by: Dave Hansen Signed-off-by: Aaron Lu --- mm/compaction.c | 28 ++++++++++++++ mm/internal.h | 15 +++++++- mm/page_alloc.c | 116 ++++++++++++++++++++++++++++++++++++-------------------- 3 files changed, 117 insertions(+), 42 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index 10cd757f1006..b53c4d420533 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -669,6 +669,28 @@ static bool too_many_isolated(struct zone *zone) return isolated > (inactive + active) / 2; } +static int merge_page(struct zone *zone, struct page *page, unsigned long pfn) +{ + int order = 0; + unsigned long buddy_pfn = __find_buddy_pfn(pfn, order); + struct page *buddy = page + (buddy_pfn - pfn); + + /* Only do merging if the merge skipped page's buddy is also free */ + if (PageBuddy(buddy)) { + int mt = get_pageblock_migratetype(page); + unsigned long flags; + + spin_lock_irqsave(&zone->lock, flags); + if (likely(page_merge_skipped(page))) { + do_merge(zone, page, mt); + order = page_order(page); + } + spin_unlock_irqrestore(&zone->lock, flags); + } + + return order; +} + /** * isolate_migratepages_block() - isolate all migrate-able pages within * a single pageblock @@ -777,6 +799,12 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, */ if (PageBuddy(page)) { unsigned long freepage_order = page_order_unsafe(page); + /* + * If the page didn't do merging on free time, now do + * it since we are doing compaction. + */ + if (page_merge_skipped(page)) + freepage_order = merge_page(zone, page, low_pfn); /* * Without lock, we cannot be sure that what we got is diff --git a/mm/internal.h b/mm/internal.h index e6bd35182dae..d2b0ac02d459 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -228,9 +228,22 @@ int find_suitable_fallback(struct free_area *area, unsigned int order, static inline unsigned int page_order(struct page *page) { /* PageBuddy() must be checked by the caller */ - return page_private(page); + return page_private(page) & ~(1 << 16); } +/* + * This function returns if the page is in buddy but didn't do any merging + * for performance reason. This function only makes sense if PageBuddy(page) + * is also true. The caller should hold zone->lock for this function to return + * correct value, or it can handle invalid values gracefully. + */ +static inline bool page_merge_skipped(struct page *page) +{ + return PageBuddy(page) && (page->private & (1 << 16)); +} + +void do_merge(struct zone *zone, struct page *page, int migratetype); + /* * Like page_order(), but for callers who cannot afford to hold the zone lock. * PageBuddy() should be checked first by the caller to minimize race window, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2ac7fa97dd55..9497c8c5f808 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -781,49 +781,14 @@ static inline int page_is_buddy(struct page *page, struct page *buddy, return 0; } -/* - * Freeing function for a buddy system allocator. - * - * The concept of a buddy system is to maintain direct-mapped table - * (containing bit values) for memory blocks of various "orders". - * The bottom level table contains the map for the smallest allocatable - * units of memory (here, pages), and each level above it describes - * pairs of units from the levels below, hence, "buddies". - * At a high level, all that happens here is marking the table entry - * at the bottom level available, and propagating the changes upward - * as necessary, plus some accounting needed to play nicely with other - * parts of the VM system. - * At each level, we keep a list of pages, which are heads of continuous - * free pages of length of (1 << order) and marked with _mapcount - * PAGE_BUDDY_MAPCOUNT_VALUE. Page's order is recorded in page_private(page) - * field. - * So when we are allocating or freeing one, we can derive the state of the - * other. That is, if we allocate a small block, and both were - * free, the remainder of the region must be split into blocks. - * If a block is freed, and its buddy is also free, then this - * triggers coalescing into a block of larger size. - * - * -- nyc - */ - -static inline void __free_one_page(struct page *page, - unsigned long pfn, - struct zone *zone, unsigned int order, - int migratetype) +static void inline __do_merge(struct page *page, unsigned int order, + struct zone *zone, int migratetype) { + unsigned int max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1); + unsigned long pfn = page_to_pfn(page); unsigned long combined_pfn; unsigned long uninitialized_var(buddy_pfn); struct page *buddy; - unsigned int max_order; - - max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1); - - VM_BUG_ON(!zone_is_initialized(zone)); - VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page); - - VM_BUG_ON(migratetype == -1); - if (likely(!is_migrate_isolate(migratetype))) - __mod_zone_freepage_state(zone, 1 << order, migratetype); VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page); VM_BUG_ON_PAGE(bad_range(zone, page), page); @@ -879,8 +844,6 @@ static inline void __free_one_page(struct page *page, } done_merging: - set_page_order(page, order); - /* * If this is not the largest possible page, check if the buddy * of the next-highest order is free. If it is, it's possible @@ -905,9 +868,80 @@ static inline void __free_one_page(struct page *page, list_add(&page->lru, &zone->free_area[order].free_list[migratetype]); out: + set_page_order(page, order); zone->free_area[order].nr_free++; } +void do_merge(struct zone *zone, struct page *page, int migratetype) +{ + VM_BUG_ON(page_order(page) != 0); + + list_del(&page->lru); + zone->free_area[0].nr_free--; + rmv_page_order(page); + + __do_merge(page, 0, zone, migratetype); +} + +static inline bool should_skip_merge(struct zone *zone, unsigned int order) +{ +#ifdef CONFIG_COMPACTION + return !zone->compact_considered && !order; +#else + return false; +#endif +} + +/* + * Freeing function for a buddy system allocator. + * + * The concept of a buddy system is to maintain direct-mapped table + * (containing bit values) for memory blocks of various "orders". + * The bottom level table contains the map for the smallest allocatable + * units of memory (here, pages), and each level above it describes + * pairs of units from the levels below, hence, "buddies". + * At a high level, all that happens here is marking the table entry + * at the bottom level available, and propagating the changes upward + * as necessary, plus some accounting needed to play nicely with other + * parts of the VM system. + * At each level, we keep a list of pages, which are heads of continuous + * free pages of length of (1 << order) and marked with _mapcount + * PAGE_BUDDY_MAPCOUNT_VALUE. Page's order is recorded in page_private(page) + * field. + * So when we are allocating or freeing one, we can derive the state of the + * other. That is, if we allocate a small block, and both were + * free, the remainder of the region must be split into blocks. + * If a block is freed, and its buddy is also free, then this + * triggers coalescing into a block of larger size. + * + * -- nyc + */ +static inline void __free_one_page(struct page *page, + unsigned long pfn, + struct zone *zone, unsigned int order, + int migratetype) +{ + VM_BUG_ON(!zone_is_initialized(zone)); + VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page); + + VM_BUG_ON(migratetype == -1); + if (likely(!is_migrate_isolate(migratetype))) + __mod_zone_freepage_state(zone, 1 << order, migratetype); + + if (should_skip_merge(zone, order)) { + list_add(&page->lru, &zone->free_area[order].free_list[migratetype]); + /* + * 1 << 16 set on page->private to indicate this order0 + * page skipped merging during free time + */ + set_page_order(page, order | (1 << 16)); + zone->free_area[order].nr_free++; + return; + } + + __do_merge(page, order, zone, migratetype); +} + /* * A bad page could be due to a number of fields. Instead of multiple branches, * try and check multiple fields with one check. The caller must do a detailed -- 2.14.3