LinuxLists.cc - [RFC v3 PATCH 0/5] Eliminate zone->lock contention for will-it-scale/page

2018-05-09 08:55:18

Subject: [RFC v3 PATCH 0/5] Eliminate zone->lock contention for will-it-scale/page_fault1 and parallel free

This series is meant to improve zone->lock scalability for order 0 pages.
With will-it-scale/page_fault1 workload, on a 2 sockets Intel Skylake
server with 112 CPUs, CPU spend 80% of its time spinning on zone->lock.
Perf profile shows the most time consuming part under zone->lock is the
cache miss on "struct page", so here I'm trying to avoid those cache
misses.

v3:

v2 has been sent out for more than a month and I was suggested to do a
resent. While doing it, I suppose I should rebase it to a newer kernel.
So...

- Rebase to v4.17-rc4;
- Remove useless "mt" param in add_to_buddy_common() as pointed out
by Vlastimil Babka;
- Patch 4/5: optimiza cluster operation on free path for all possible
migrate types; Previous version only considered MOVABLE pages;
- Patch 5/5 is newly added to only disable cluster alloc and no merge
when compaction is in progress. Previouslly we will disable cluster
alloc and no merge as long as there is compaction failures in the
zone.

A branch is maintained here in case someone wants to give it a try:
https://github.com/aaronlu/linux zone_lock_rfc_v3

v2:

Patch 1/4 adds some wrapper functions for page to be added/removed
into/from buddy and doesn't have functionality changes.

Patch 2/4 skip doing merge for order 0 pages to avoid cache misses on
buddy's "struct page". On a 2 sockets Intel Skylake, this has very good
effect on free path for will-it-scale/page_fault1 full load in that it
reduced zone->lock contention on free path from 35% to 1.1%. Also, it
shows good result on parallel free(*) workload by reducing zone->lock
contention from 90% to almost zero(lru lock increased from almost 0 to
90% though).

Patch 3/4 deals with allocation path zone->lock contention by not
touching pages on free_list one by one inside zone->lock. Together with
patch 2/4, zone->lock contention is entirely eliminated for
will-it-scale/page_fault1 full load, though this patch adds some
overhead to manage cluster on free path and it has some bad effects on
parallel free workload in that it increased zone->lock contention from
almost 0 to 25%.

Patch 4/4 is an optimization in free path due to cluster operation. It
decreased the number of times add_to_cluster() has to be called and
restored performance for parallel free workload by reducing zone->lock's
contention to almost 0% again.

The good thing about this patchset is, it eliminated zone->lock
contention for will-it-scale/page_fault1 and parallel free on big
servers(contention shifted to lru_lock). The bad things are:
- it added some overhead in compaction path where it will do merging
for those merge-skipped order 0 pages;
- it is unfriendly to high order page allocation since we do not do
merging for order 0 pages now.

To see how much effect it has on compaction, mmtests/stress-highalloc is
used on a Desktop machine with 8 CPUs and 4G memory.
(mmtests/stress-highalloc: make N copies of kernel tree and start
building them to consume almost all memory with reclaimable file page
cache. These file page cache will not be returned to buddy so effectively
makes it a worst case for high order page workload. Then after 5 minutes,
start allocating X order-9 pages to see how well compaction works).

With a delay of 100ms between allocations:
kernel success_rate average_time_of_alloc_one_hugepage
base 58% 3.95927e+06 ns
patch2/4 58% 5.45935e+06 ns
patch4/4 57% 6.59174e+06 ns

With a delay of 1ms between allocations:
kernel success_rate average_time_of_alloc_one_hugepage
base 53% 3.17362e+06 ns
patch2/4 44% 2.31637e+06 ns
patch4/4 59% 2.73029e+06 ns

If we compare patch4/4's result with base, it performed OK I think.
This is probably due to compaction is a heavy job so the added overhead
doesn't affect much.

To see how much effect it has on workload that uses hugepage, I did the
following test on a 2 sockets Intel Skylake with 112 CPUs/64G memory:
1 Break all high order pages by starting a program that consumes almost
all memory with anonymous pages and then exit. This is used to create
an extreme bad case for this patchset compared to vanilla that always
does merging;
2 Start 56 processes of will-it-scale/page_fault1 that use hugepages
through calling madvise(MADV_HUGEPAGE). To make things worse for this
patchset, start another 56 processes of will-it-scale/page_fault1 that
uses order 0 pages to continually cause trouble for the 56 THP users.
Let them run for 5 minutes.

Score result(higher is better):

kernel order0 THP
base 1522246 10540254
patch2/4 5266247 +246% 3309816 -69%
patch4/4 2234073 +47% 9610295 -8.8%

TBH, I'm not sure if the way I tried above is good enough to expose the
problem of this patchset. So if you have any thoughts on this patchset,
please feel free to let me know, thanks.

(*) Parallel free is a workload that I used to see how well parallel
freeing a large VMA can be. I tested this on a 4 sockets Intel Skylake
machine with 768G memory. The test program starts by doing a 512G anon
memory allocation with mmap() and then exit to see how fast it can exit.
The parallel is implemented inside kernel and has been posted before:
http://lkml.kernel.org/r/[email protected]

A branch is maintained here in case someone wants to give it a try:
https://github.com/aaronlu/linux zone_lock_rfc_v2

v1 is here:
https://lkml.kernel.org/r/[email protected]

Aaron Lu (5):
mm/page_alloc: use helper functions to add/remove a page to/from buddy
mm/__free_one_page: skip merge for order-0 page unless compaction
failed
mm/rmqueue_bulk: alloc without touching individual page structure
mm/free_pcppages_bulk: reduce overhead of cluster operation on free
path
mm/can_skip_merge(): make it more aggressive to attempt cluster
alloc/free

include/linux/mm_types.h | 3 +
include/linux/mmzone.h | 35 ++++
mm/compaction.c | 17 +-
mm/internal.h | 57 ++++++
mm/page_alloc.c | 496 ++++++++++++++++++++++++++++++++++++++++++-----
5 files changed, 557 insertions(+), 51 deletions(-)

--
2.14.3

2018-05-09 08:54:21

by Aaron Lu

[permalink] [raw]

Subject: [RFC v3 PATCH 1/5] mm/page_alloc: use helper functions to add/remove a page to/from buddy

There are multiple places that add/remove a page into/from buddy,
introduce helper functions for them.

This also makes it easier to add code when a page is added/removed
to/from buddy.

Acked-by: Vlastimil Babka <[email protected]>
Signed-off-by: Aaron Lu <[email protected]>
---
mm/page_alloc.c | 65 ++++++++++++++++++++++++++++++++++-----------------------
1 file changed, 39 insertions(+), 26 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 905db9d7962f..a92afa362e1f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -697,12 +697,41 @@ static inline void set_page_order(struct page *page, unsigned int order)
__SetPageBuddy(page);
}

+static inline void add_to_buddy_common(struct page *page, struct zone *zone,
+ unsigned int order)
+{
+ set_page_order(page, order);
+ zone->free_area[order].nr_free++;
+}
+
+static inline void add_to_buddy_head(struct page *page, struct zone *zone,
+ unsigned int order, int mt)
+{
+ add_to_buddy_common(page, zone, order);
+ list_add(&page->lru, &zone->free_area[order].free_list[mt]);
+}
+
+static inline void add_to_buddy_tail(struct page *page, struct zone *zone,
+ unsigned int order, int mt)
+{
+ add_to_buddy_common(page, zone, order);
+ list_add_tail(&page->lru, &zone->free_area[order].free_list[mt]);
+}
+
static inline void rmv_page_order(struct page *page)
{
__ClearPageBuddy(page);
set_page_private(page, 0);
}

+static inline void remove_from_buddy(struct page *page, struct zone *zone,
+ unsigned int order)
+{
+ list_del(&page->lru);
+ zone->free_area[order].nr_free--;
+ rmv_page_order(page);
+}
+
/*
* This function checks whether a page is free && is the buddy
* we can do coalesce a page and its buddy if
@@ -806,13 +835,10 @@ static inline void __free_one_page(struct page *page,
* Our buddy is free or it is CONFIG_DEBUG_PAGEALLOC guard page,
* merge with it and move up one order.
*/
- if (page_is_guard(buddy)) {
+ if (page_is_guard(buddy))
clear_page_guard(zone, buddy, order, migratetype);
- } else {
- list_del(&buddy->lru);
- zone->free_area[order].nr_free--;
- rmv_page_order(buddy);
- }
+ else
+ remove_from_buddy(buddy, zone, order);
combined_pfn = buddy_pfn & pfn;
page = page + (combined_pfn - pfn);
pfn = combined_pfn;
@@ -844,8 +870,6 @@ static inline void __free_one_page(struct page *page,
}

done_merging:
- set_page_order(page, order);
-
/*
* If this is not the largest possible page, check if the buddy
* of the next-highest order is free. If it is, it's possible
@@ -862,15 +886,12 @@ static inline void __free_one_page(struct page *page,
higher_buddy = higher_page + (buddy_pfn - combined_pfn);
if (pfn_valid_within(buddy_pfn) &&
page_is_buddy(higher_page, higher_buddy, order + 1)) {
- list_add_tail(&page->lru,
- &zone->free_area[order].free_list[migratetype]);
- goto out;
+ add_to_buddy_tail(page, zone, order, migratetype);
+ return;
}
}

- list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
-out:
- zone->free_area[order].nr_free++;
+ add_to_buddy_head(page, zone, order, migratetype);
}

/*
@@ -1830,9 +1851,7 @@ static inline void expand(struct zone *zone, struct page *page,
if (set_page_guard(zone, &page[size], high, migratetype))
continue;

- list_add(&page[size].lru, &area->free_list[migratetype]);
- area->nr_free++;
- set_page_order(&page[size], high);
+ add_to_buddy_head(&page[size], zone, high, migratetype);
}
}

@@ -1976,9 +1995,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
struct page, lru);
if (!page)
continue;
- list_del(&page->lru);
- rmv_page_order(page);
- area->nr_free--;
+ remove_from_buddy(page, zone, current_order);
expand(zone, page, order, current_order, area, migratetype);
set_pcppage_migratetype(page, migratetype);
return page;
@@ -2896,9 +2913,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
}

/* Remove page from free list */
- list_del(&page->lru);
- zone->free_area[order].nr_free--;
- rmv_page_order(page);
+ remove_from_buddy(page, zone, order);

/*
* Set the pageblock if the isolated page is at least half of a
@@ -8032,9 +8047,7 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
pr_info("remove from free list %lx %d %lx\n",
pfn, 1 << order, end_pfn);
#endif
- list_del(&page->lru);
- rmv_page_order(page);
- zone->free_area[order].nr_free--;
+ remove_from_buddy(page, zone, order);
for (i = 0; i < (1 << order); i++)
SetPageReserved((page+i));
pfn += (1 << order);
--
2.14.3

2018-05-09 08:54:33

by Aaron Lu

[permalink] [raw]

Subject: [RFC v3 PATCH 5/5] mm/can_skip_merge(): make it more aggressive to attempt cluster alloc/free

After system runs a long time, it's easy for a zone to have no
suitable high order page available and that will stop cluster alloc
and free in current implementation due to compact_considered > 0.

To make it favour order0 alloc/free, relax the condition to only
disallow cluster alloc/free when problem would occur, e.g. when
compaction is in progress.

Signed-off-by: Aaron Lu <[email protected]>
---
mm/internal.h | 4 ----
1 file changed, 4 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index e3f209f8fb39..521aa4d8f3c1 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -552,10 +552,6 @@ void try_to_merge_page(struct page *page);
#ifdef CONFIG_COMPACTION
static inline bool can_skip_merge(struct zone *zone, int order)
{
- /* Compaction has failed in this zone, we shouldn't skip merging */
- if (zone->compact_considered)
- return false;
-
/* Only consider no_merge for order 0 pages */
if (order)
return false;
--
2.14.3

2018-05-09 08:55:26

by Aaron Lu

[permalink] [raw]

Subject: [RFC v3 PATCH 4/5] mm/free_pcppages_bulk: reduce overhead of cluster operation on free path

After "no_merge for order 0", the biggest overhead in free path for
order 0 pages is now add_to_cluster(). As pages are freed one by one,
it caused frequent operation of add_to_cluster().

Ideally, if only one migratetype pcp list has pages to free and
count=pcp->batch in free_pcppages_bulk(), we can avoid calling
add_to_cluster() one time per page but adding them in one go as
a single cluster so this patch just did this.

This optimization brings zone->lock contention down from 25% to
almost zero again using the parallel free workload.

Signed-off-by: Aaron Lu <[email protected]>
---
mm/page_alloc.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 46 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 64afb26064ed..33814ffda507 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1251,6 +1251,36 @@ static inline void prefetch_buddy(struct page *page)
prefetch(buddy);
}

+static inline bool free_cluster_pages(struct zone *zone, struct list_head *list,
+ int mt, int count)
+{
+ struct cluster *c;
+ struct page *page, *n;
+
+ if (!can_skip_merge(zone, 0))
+ return false;
+
+ if (count != this_cpu_ptr(zone->pageset)->pcp.batch)
+ return false;
+
+ c = new_cluster(zone, count, list_first_entry(list, struct page, lru));
+ if (unlikely(!c))
+ return false;
+
+ list_for_each_entry_safe(page, n, list, lru) {
+ set_page_order(page, 0);
+ set_page_merge_skipped(page);
+ page->cluster = c;
+ list_add(&page->lru, &zone->free_area[0].free_list[mt]);
+ }
+
+ INIT_LIST_HEAD(list);
+ zone->free_area[0].nr_free += count;
+ __mod_zone_page_state(zone, NR_FREE_PAGES, count);
+
+ return true;
+}
+
/*
* Frees a number of pages from the PCP lists
* Assumes all pages on list are in same zone, and of same order.
@@ -1265,10 +1295,10 @@ static inline void prefetch_buddy(struct page *page)
static void free_pcppages_bulk(struct zone *zone, int count,
struct per_cpu_pages *pcp)
{
- int migratetype = 0;
- int batch_free = 0;
+ int migratetype = 0, i, count_mt[MIGRATE_PCPTYPES] = {0};
+ int batch_free = 0, saved_count = count;
int prefetch_nr = 0;
- bool isolated_pageblocks;
+ bool isolated_pageblocks, single_mt = false;
struct page *page, *tmp;
LIST_HEAD(head);

@@ -1292,6 +1322,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
/* This is the only non-empty list. Free them all. */
if (batch_free == MIGRATE_PCPTYPES)
batch_free = count;
+ count_mt[migratetype] += batch_free;

do {
page = list_last_entry(list, struct page, lru);
@@ -1323,12 +1354,24 @@ static void free_pcppages_bulk(struct zone *zone, int count,
} while (--count && --batch_free && !list_empty(list));
}

+ for (i = 0; i < MIGRATE_PCPTYPES; i++) {
+ if (count_mt[i] == saved_count) {
+ single_mt = true;
+ break;
+ }
+ }
+
spin_lock(&zone->lock);
isolated_pageblocks = has_isolate_pageblock(zone);

+ if (!isolated_pageblocks && single_mt)
+ free_cluster_pages(zone, &head, migratetype, saved_count);
+
/*
* Use safe version since after __free_one_page(),
* page->lru.next will not point to original list.
+ *
+ * If free_cluster_pages() succeeds, head will be an empty list here.
*/
list_for_each_entry_safe(page, tmp, &head, lru) {
int mt = get_pcppage_migratetype(page);
--
2.14.3

2018-05-09 08:56:00

by Aaron Lu

[permalink] [raw]

Subject: [RFC v3 PATCH 3/5] mm/rmqueue_bulk: alloc without touching individual page structure

Profile on Intel Skylake server shows the most time consuming part
under zone->lock on allocation path is accessing those to-be-returned
page's "struct page" on the free_list inside zone->lock. One explanation
is, different CPUs are releasing pages to the head of free_list and
those page's 'struct page' may very well be cache cold for the allocating
CPU when it grabs these pages from free_list' head. The purpose here
is to avoid touching these pages one by one inside zone->lock.

One idea is, we just take the requested number of pages off free_list
with something like list_cut_position() and then adjust nr_free of
free_area accordingly inside zone->lock and other operations like
clearing PageBuddy flag for these pages are done outside of zone->lock.

list_cut_position() needs to know where to cut, that's what the new
'struct cluster' meant to provide. All pages on order 0's free_list
belongs to a cluster so when a number of pages is needed, the cluster
to which head page of free_list belongs is checked and then tail page
of the cluster could be found. With tail page, list_cut_position() can
be used to drop the cluster off free_list. The 'struct cluster' also has
'nr' to tell how many pages this cluster has so nr_free of free_area can
be adjusted inside the lock too.

This caused a race window though: from the moment zone->lock is dropped
till these pages' PageBuddy flags get cleared, these pages are not in
buddy but still have PageBuddy flag set.

This doesn't cause problems for users that access buddy pages through
free_list. But there are other users, like move_freepages() which is
used to move a pageblock pages from one migratetype to another in
fallback allocation path, will test PageBuddy flag of a page derived
from PFN. The end result could be that for pages in the race window,
they are moved back to free_list of another migratetype. For this
reason, a synchronization function zone_wait_cluster_alloc() is
introduced to wait till all pages are in correct state. This function
is meant to be called with zone->lock held, so after this function
returns, we do not need to worry about new pages becoming racy state.

Another user is compaction, where it will scan a pageblock for
migratable candidates. In this process, pages derived from PFN will
be checked for PageBuddy flag to decide if it is a merge skipped page.
To avoid a racy page getting merged back into buddy, the
zone_wait_and_disable_cluster_alloc() function is introduced to:
1 disable clustered allocation by increasing zone->cluster.disable_depth;
2 wait till the race window pass by calling zone_wait_cluster_alloc().
This function is also meant to be called with zone->lock held so after
it returns, all pages are in correct state and no more cluster alloc
will be attempted till zone_enable_cluster_alloc() is called to decrease
zone->cluster.disable_depth.

The two patches could eliminate zone->lock contention entirely but at
the same time, pgdat->lru_lock contention rose to 82%. Final performance
increased about 8.3%.

Suggested-by: Ying Huang <[email protected]>
Suggested-by: Dave Hansen <[email protected]>
Signed-off-by: Aaron Lu <[email protected]>
---
include/linux/mm_types.h | 2 +
include/linux/mmzone.h | 35 ++++++
mm/compaction.c | 4 +
mm/internal.h | 34 ++++++
mm/page_alloc.c | 288 +++++++++++++++++++++++++++++++++++++++++++++--
5 files changed, 355 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 17c5604e6ec0..9eb19448424a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -86,6 +86,8 @@ struct page {
void *s_mem; /* slab first object */
atomic_t compound_mapcount; /* first tail page */
/* page_deferred_list().next -- second tail page */
+
+ struct cluster *cluster; /* order 0 cluster this page belongs to */
};

/* Second double word */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 32699b2dc52a..aed70d887648 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -356,6 +356,40 @@ enum zone_type {

#ifndef __GENERATING_BOUNDS_H

+struct cluster {
+ struct page *tail; /* tail page of the cluster */
+ int nr; /* how many pages are in this cluster */
+};
+
+struct order0_cluster {
+ /* order 0 cluster array, dynamically allocated */
+ struct cluster *array;
+ /*
+ * order 0 cluster array length, also used to indicate if cluster
+ * allocation is enabled for this zone(cluster allocation is disabled
+ * for small zones whose batch size is smaller than 1, like DMA zone)
+ */
+ int len;
+ /*
+ * smallest position from where we search for an
+ * empty cluster from the cluster array
+ */
+ int zero_bit;
+ /* bitmap used to quickly locate an empty cluster from cluster array */
+ unsigned long *bitmap;
+
+ /* disable cluster allocation to avoid new pages becoming racy state. */
+ unsigned long disable_depth;
+
+ /*
+ * used to indicate if there are pages allocated in cluster mode
+ * still in racy state. Caller with zone->lock held could use helper
+ * function zone_wait_cluster_alloc() to wait all such pages to exit
+ * the race window.
+ */
+ atomic_t in_progress;
+};
+
struct zone {
/* Read-mostly fields */

@@ -460,6 +494,7 @@ struct zone {

/* free areas of different sizes */
struct free_area free_area[MAX_ORDER];
+ struct order0_cluster cluster;

/* zone flags, see below */
unsigned long flags;
diff --git a/mm/compaction.c b/mm/compaction.c
index 004416312092..4714da6a4938 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1599,6 +1599,8 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro

migrate_prep_local();

+ zone_wait_and_disable_cluster_alloc(zone);
+
while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {
int err;

@@ -1697,6 +1699,8 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
zone->compact_cached_free_pfn = free_pfn;
}

+ zone_enable_cluster_alloc(zone);
+
count_compact_events(COMPACTMIGRATE_SCANNED, cc->total_migrate_scanned);
count_compact_events(COMPACTFREE_SCANNED, cc->total_free_scanned);

diff --git a/mm/internal.h b/mm/internal.h
index eeec12740dc2..e3f209f8fb39 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -560,12 +560,46 @@ static inline bool can_skip_merge(struct zone *zone, int order)
if (order)
return false;

+ /*
+ * Clustered allocation is only disabled when high-order pages
+ * are needed, e.g. in compaction and CMA alloc, so we should
+ * also skip merging in that case.
+ */
+ if (zone->cluster.disable_depth)
+ return false;
+
return true;
}
+
+static inline void zone_wait_cluster_alloc(struct zone *zone)
+{
+ while (atomic_read(&zone->cluster.in_progress))
+ cpu_relax();
+}
+
+static inline void zone_wait_and_disable_cluster_alloc(struct zone *zone)
+{
+ unsigned long flags;
+ spin_lock_irqsave(&zone->lock, flags);
+ zone->cluster.disable_depth++;
+ zone_wait_cluster_alloc(zone);
+ spin_unlock_irqrestore(&zone->lock, flags);
+}
+
+static inline void zone_enable_cluster_alloc(struct zone *zone)
+{
+ unsigned long flags;
+ spin_lock_irqsave(&zone->lock, flags);
+ zone->cluster.disable_depth--;
+ spin_unlock_irqrestore(&zone->lock, flags);
+}
#else /* CONFIG_COMPACTION */
static inline bool can_skip_merge(struct zone *zone, int order)
{
return false;
}
+static inline void zone_wait_cluster_alloc(struct zone *zone) {}
+static inline void zone_wait_and_disable_cluster_alloc(struct zone *zone) {}
+static inline void zone_enable_cluster_alloc(struct zone *zone) {}
#endif /* CONFIG_COMPACTION */
#endif /* __MM_INTERNAL_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0a7988d9935d..64afb26064ed 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -707,6 +707,82 @@ static inline void set_page_order(struct page *page, unsigned int order)
__SetPageBuddy(page);
}

+static inline struct cluster *new_cluster(struct zone *zone, int nr,
+ struct page *tail)
+{
+ struct order0_cluster *cluster = &zone->cluster;
+ int n = find_next_zero_bit(cluster->bitmap, cluster->len, cluster->zero_bit);
+ if (n == cluster->len) {
+ printk_ratelimited("node%d zone %s cluster used up\n",
+ zone->zone_pgdat->node_id, zone->name);
+ return NULL;
+ }
+ cluster->zero_bit = n;
+ set_bit(n, cluster->bitmap);
+ cluster->array[n].nr = nr;
+ cluster->array[n].tail = tail;
+ return &cluster->array[n];
+}
+
+static inline struct cluster *add_to_cluster_common(struct page *page,
+ struct zone *zone, struct page *neighbor)
+{
+ struct cluster *c;
+
+ if (neighbor) {
+ int batch = this_cpu_ptr(zone->pageset)->pcp.batch;
+ c = neighbor->cluster;
+ if (c && c->nr < batch) {
+ page->cluster = c;
+ c->nr++;
+ return c;
+ }
+ }
+
+ c = new_cluster(zone, 1, page);
+ if (unlikely(!c))
+ return NULL;
+
+ page->cluster = c;
+ return c;
+}
+
+/*
+ * Add this page to the cluster where the previous head page belongs.
+ * Called after page is added to free_list(and becoming the new head).
+ */
+static inline void add_to_cluster_head(struct page *page, struct zone *zone,
+ int order, int mt)
+{
+ struct page *neighbor;
+
+ if (order || !zone->cluster.len)
+ return;
+
+ neighbor = page->lru.next == &zone->free_area[0].free_list[mt] ?
+ NULL : list_entry(page->lru.next, struct page, lru);
+ add_to_cluster_common(page, zone, neighbor);
+}
+
+/*
+ * Add this page to the cluster where the previous tail page belongs.
+ * Called after page is added to free_list(and becoming the new tail).
+ */
+static inline void add_to_cluster_tail(struct page *page, struct zone *zone,
+ int order, int mt)
+{
+ struct page *neighbor;
+ struct cluster *c;
+
+ if (order || !zone->cluster.len)
+ return;
+
+ neighbor = page->lru.prev == &zone->free_area[0].free_list[mt] ?
+ NULL : list_entry(page->lru.prev, struct page, lru);
+ c = add_to_cluster_common(page, zone, neighbor);
+ c->tail = page;
+}
+
static inline void add_to_buddy_common(struct page *page, struct zone *zone,
unsigned int order)
{
@@ -726,6 +802,7 @@ static inline void add_to_buddy_head(struct page *page, struct zone *zone,
{
add_to_buddy_common(page, zone, order);
list_add(&page->lru, &zone->free_area[order].free_list[mt]);
+ add_to_cluster_head(page, zone, order, mt);
}

static inline void add_to_buddy_tail(struct page *page, struct zone *zone,
@@ -733,6 +810,7 @@ static inline void add_to_buddy_tail(struct page *page, struct zone *zone,
{
add_to_buddy_common(page, zone, order);
list_add_tail(&page->lru, &zone->free_area[order].free_list[mt]);
+ add_to_cluster_tail(page, zone, order, mt);
}

static inline void rmv_page_order(struct page *page)
@@ -741,9 +819,29 @@ static inline void rmv_page_order(struct page *page)
set_page_private(page, 0);
}

+/* called before removed from free_list */
+static inline void remove_from_cluster(struct page *page, struct zone *zone)
+{
+ struct cluster *c = page->cluster;
+ if (!c)
+ return;
+
+ page->cluster = NULL;
+ c->nr--;
+ if (!c->nr) {
+ int bit = c - zone->cluster.array;
+ c->tail = NULL;
+ clear_bit(bit, zone->cluster.bitmap);
+ if (bit < zone->cluster.zero_bit)
+ zone->cluster.zero_bit = bit;
+ } else if (page == c->tail)
+ c->tail = list_entry(page->lru.prev, struct page, lru);
+}
+
static inline void remove_from_buddy(struct page *page, struct zone *zone,
unsigned int order)
{
+ remove_from_cluster(page, zone);
list_del(&page->lru);
zone->free_area[order].nr_free--;
rmv_page_order(page);
@@ -2129,6 +2227,17 @@ static int move_freepages(struct zone *zone,
if (num_movable)
*num_movable = 0;

+ /*
+ * Cluster alloced pages may have their PageBuddy flag unclear yet
+ * after dropping zone->lock in rmqueue_bulk() and steal here could
+ * move them back to free_list. So it's necessary to wait till all
+ * those pages have their flags properly cleared.
+ *
+ * We do not need to disable cluster alloc though since we already
+ * held zone->lock and no allocation could happen.
+ */
+ zone_wait_cluster_alloc(zone);
+
for (page = start_page; page <= end_page;) {
if (!pfn_valid_within(page_to_pfn(page))) {
page++;
@@ -2153,8 +2262,10 @@ static int move_freepages(struct zone *zone,
}

order = page_order(page);
+ remove_from_cluster(page, zone);
list_move(&page->lru,
&zone->free_area[order].free_list[migratetype]);
+ add_to_cluster_head(page, zone, order, migratetype);
page += 1 << order;
pages_moved += 1 << order;
}
@@ -2303,7 +2414,9 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,

single_page:
area = &zone->free_area[current_order];
+ remove_from_cluster(page, zone);
list_move(&page->lru, &area->free_list[start_type]);
+ add_to_cluster_head(page, zone, current_order, start_type);
}

/*
@@ -2564,6 +2677,145 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype)
return page;
}

+static int __init zone_order0_cluster_init(void)
+{
+ struct zone *zone;
+
+ for_each_zone(zone) {
+ int len, mt, batch;
+ unsigned long flags;
+ struct order0_cluster *cluster;
+
+ if (!managed_zone(zone))
+ continue;
+
+ /* no need to enable cluster allocation for batch<=1 zone */
+ preempt_disable();
+ batch = this_cpu_ptr(zone->pageset)->pcp.batch;
+ preempt_enable();
+ if (batch <= 1)
+ continue;
+
+ cluster = &zone->cluster;
+ /* FIXME: possible overflow of int type */
+ len = DIV_ROUND_UP(zone->managed_pages, batch);
+ cluster->array = vzalloc(len * sizeof(struct cluster));
+ if (!cluster->array)
+ return -ENOMEM;
+ cluster->bitmap = vzalloc(DIV_ROUND_UP(len, BITS_PER_LONG) *
+ sizeof(unsigned long));
+ if (!cluster->bitmap)
+ return -ENOMEM;
+
+ spin_lock_irqsave(&zone->lock, flags);
+ cluster->len = len;
+ for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
+ struct page *page;
+ list_for_each_entry_reverse(page,
+ &zone->free_area[0].free_list[mt], lru)
+ add_to_cluster_head(page, zone, 0, mt);
+ }
+ spin_unlock_irqrestore(&zone->lock, flags);
+ }
+
+ return 0;
+}
+subsys_initcall(zone_order0_cluster_init);
+
+static inline int __rmqueue_bulk_cluster(struct zone *zone, unsigned long count,
+ struct list_head *list, int mt)
+{
+ struct list_head *head = &zone->free_area[0].free_list[mt];
+ int nr = 0;
+
+ while (nr < count) {
+ struct page *head_page;
+ struct list_head *tail, tmp_list;
+ struct cluster *c;
+ int bit;
+
+ head_page = list_first_entry_or_null(head, struct page, lru);
+ if (!head_page || !head_page->cluster)
+ break;
+
+ c = head_page->cluster;
+ tail = &c->tail->lru;
+
+ /* drop the cluster off free_list and attach to list */
+ list_cut_position(&tmp_list, head, tail);
+ list_splice_tail(&tmp_list, list);
+
+ nr += c->nr;
+ zone->free_area[0].nr_free -= c->nr;
+
+ /* this cluster is empty now */
+ c->tail = NULL;
+ c->nr = 0;
+ bit = c - zone->cluster.array;
+ clear_bit(bit, zone->cluster.bitmap);
+ if (bit < zone->cluster.zero_bit)
+ zone->cluster.zero_bit = bit;
+ }
+
+ return nr;
+}
+
+static inline int rmqueue_bulk_cluster(struct zone *zone, unsigned int order,
+ unsigned long count, struct list_head *list,
+ int migratetype)
+{
+ int alloced;
+ struct page *page;
+
+ /*
+ * Cluster alloc races with merging so don't try cluster alloc when we
+ * can't skip merging. Note that can_skip_merge() keeps the same return
+ * value from here till all pages have their flags properly processed,
+ * i.e. the end of the function where in_progress is incremented, even
+ * we have dropped the lock in the middle because the only place that
+ * can change can_skip_merge()'s return value is compaction code and
+ * compaction needs to wait on in_progress.
+ */
+ if (!can_skip_merge(zone, 0))
+ return 0;
+
+ /* Cluster alloc is disabled, mostly compaction is already in progress */
+ if (zone->cluster.disable_depth)
+ return 0;
+
+ /* Cluster alloc is disabled for this zone */
+ if (unlikely(!zone->cluster.len))
+ return 0;
+
+ alloced = __rmqueue_bulk_cluster(zone, count, list, migratetype);
+ if (!alloced)
+ return 0;
+
+ /*
+ * Cache miss on page structure could slow things down
+ * dramatically so accessing these alloced pages without
+ * holding lock for better performance.
+ *
+ * Since these pages still have PageBuddy set, there is a race
+ * window between now and when PageBuddy is cleared for them
+ * below. Any operation that would scan a pageblock and check
+ * PageBuddy(page), e.g. compaction, will need to wait till all
+ * such pages are properly processed. in_progress is used for
+ * such purpose so increase it now before dropping the lock.
+ */
+ atomic_inc(&zone->cluster.in_progress);
+ spin_unlock(&zone->lock);
+
+ list_for_each_entry(page, list, lru) {
+ rmv_page_order(page);
+ page->cluster = NULL;
+ set_pcppage_migratetype(page, migratetype);
+ }
+ atomic_dec(&zone->cluster.in_progress);
+
+ return alloced;
+}
+
/*
* Obtain a specified number of elements from the buddy allocator, all under
* a single hold of the lock, for efficiency. Add them to the supplied list.
@@ -2573,17 +2825,23 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
unsigned long count, struct list_head *list,
int migratetype)
{
- int i, alloced = 0;
+ int i, alloced;
+ struct page *page, *tmp;

spin_lock(&zone->lock);
- for (i = 0; i < count; ++i) {
- struct page *page = __rmqueue(zone, order, migratetype);
+ alloced = rmqueue_bulk_cluster(zone, order, count, list, migratetype);
+ if (alloced > 0) {
+ if (alloced >= count)
+ goto out;
+ else
+ spin_lock(&zone->lock);
+ }
+
+ for (; alloced < count; alloced++) {
+ page = __rmqueue(zone, order, migratetype);
if (unlikely(page == NULL))
break;

- if (unlikely(check_pcp_refill(page)))
- continue;
-
/*
* Split buddy pages returned by expand() are received here in
* physical page order. The page is added to the tail of
@@ -2595,7 +2853,18 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
* pages are ordered properly.
*/
list_add_tail(&page->lru, list);
- alloced++;
+ }
+ spin_unlock(&zone->lock);
+
+out:
+ i = alloced;
+ list_for_each_entry_safe(page, tmp, list, lru) {
+ if (unlikely(check_pcp_refill(page))) {
+ list_del(&page->lru);
+ alloced--;
+ continue;
+ }
+
if (is_migrate_cma(get_pcppage_migratetype(page)))
__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
-(1 << order));
@@ -2608,7 +2877,6 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
* pages added to the pcp list.
*/
__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
- spin_unlock(&zone->lock);
return alloced;
}

@@ -7893,6 +8161,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
unsigned long outer_start, outer_end;
unsigned int order;
int ret = 0;
+ struct zone *zone = page_zone(pfn_to_page(start));

struct compact_control cc = {
.nr_migratepages = 0,
@@ -7935,6 +8204,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
if (ret)
return ret;

+ zone_wait_and_disable_cluster_alloc(zone);
/*
* In case of -EBUSY, we'd like to know which page causes problem.
* So, just fall through. test_pages_isolated() has a tracepoint
@@ -8017,6 +8287,8 @@ int alloc_contig_range(unsigned long start, unsigned long end,
done:
undo_isolate_page_range(pfn_max_align_down(start),
pfn_max_align_up(end), migratetype);
+
+ zone_enable_cluster_alloc(zone);
return ret;
}

--
2.14.3

2018-05-09 08:56:13

by Aaron Lu

[permalink] [raw]

Subject: [RFC v3 PATCH 2/5] mm/__free_one_page: skip merge for order-0 page unless compaction failed

Running will-it-scale/page_fault1 process mode workload on a 2 sockets
Intel Skylake server showed severe lock contention of zone->lock, as
high as about 80%(42% on allocation path and 35% on free path) CPU
cycles are burnt spinning. With perf, the most time consuming part inside
that lock on free path is cache missing on page structures, mostly on
the to-be-freed page's buddy due to merging.

One way to avoid this overhead is not do any merging at all for order-0
pages. With this approach, the lock contention for zone->lock on free
path dropped to 1.1% but allocation side still has as high as 42% lock
contention. In the meantime, the dropped lock contention on free side
doesn't translate to performance increase, instead, it's consumed by
increased lock contention of the per node lru_lock(rose from 5% to 37%)
and the final performance slightly dropped about 1%.

Though performance dropped a little, it almost eliminated zone lock
contention on free path and it is the foundation for the next patch
that eliminates zone lock contention for allocation path.

Suggested-by: Dave Hansen <[email protected]>
Signed-off-by: Aaron Lu <[email protected]>
---
include/linux/mm_types.h | 1 +
mm/compaction.c | 13 ++++++-
mm/internal.h | 27 ++++++++++++++
mm/page_alloc.c | 94 +++++++++++++++++++++++++++++++++++++++++-------
4 files changed, 121 insertions(+), 14 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 21612347d311..17c5604e6ec0 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -93,6 +93,7 @@ struct page {
pgoff_t index; /* Our offset within mapping. */
void *freelist; /* sl[aou]b first free object */
/* page_deferred_list().prev -- second tail page */
+ bool buddy_merge_skipped; /* skipped merging when added to buddy */
};

union {
diff --git a/mm/compaction.c b/mm/compaction.c
index 028b7210a669..004416312092 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -777,8 +777,19 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
* potential isolation targets.
*/
if (PageBuddy(page)) {
- unsigned long freepage_order = page_order_unsafe(page);
+ unsigned long freepage_order;

+ /*
+ * If this is a merge_skipped page, do merge now
+ * since high-order pages are needed. zone lock
+ * isn't taken for the merge_skipped check so the
+ * check could be wrong but the worst case is we
+ * lose a merge opportunity.
+ */
+ if (page_merge_was_skipped(page))
+ try_to_merge_page(page);
+
+ freepage_order = page_order_unsafe(page);
/*
* Without lock, we cannot be sure that what we got is
* a valid page order. Consider only values in the
diff --git a/mm/internal.h b/mm/internal.h
index 62d8c34e63d5..eeec12740dc2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -541,4 +541,31 @@ static inline bool is_migrate_highatomic_page(struct page *page)

void setup_zone_pageset(struct zone *zone);
extern struct page *alloc_new_node_page(struct page *page, unsigned long node);
+
+static inline bool page_merge_was_skipped(struct page *page)
+{
+ return page->buddy_merge_skipped;
+}
+
+void try_to_merge_page(struct page *page);
+
+#ifdef CONFIG_COMPACTION
+static inline bool can_skip_merge(struct zone *zone, int order)
+{
+ /* Compaction has failed in this zone, we shouldn't skip merging */
+ if (zone->compact_considered)
+ return false;
+
+ /* Only consider no_merge for order 0 pages */
+ if (order)
+ return false;
+
+ return true;
+}
+#else /* CONFIG_COMPACTION */
+static inline bool can_skip_merge(struct zone *zone, int order)
+{
+ return false;
+}
+#endif /* CONFIG_COMPACTION */
#endif /* __MM_INTERNAL_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a92afa362e1f..0a7988d9935d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -691,6 +691,16 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
unsigned int order, int migratetype) {}
#endif

+static inline void set_page_merge_skipped(struct page *page)
+{
+ page->buddy_merge_skipped = true;
+}
+
+static inline void clear_page_merge_skipped(struct page *page)
+{
+ page->buddy_merge_skipped = false;
+}
+
static inline void set_page_order(struct page *page, unsigned int order)
{
set_page_private(page, order);
@@ -700,6 +710,13 @@ static inline void set_page_order(struct page *page, unsigned int order)
static inline void add_to_buddy_common(struct page *page, struct zone *zone,
unsigned int order)
{
+ /*
+ * Always clear buddy_merge_skipped when added to buddy because
+ * buddy_merge_skipped shares space with index and index could
+ * be used as migratetype for PCP pages.
+ */
+ clear_page_merge_skipped(page);
+
set_page_order(page, order);
zone->free_area[order].nr_free++;
}
@@ -730,6 +747,7 @@ static inline void remove_from_buddy(struct page *page, struct zone *zone,
list_del(&page->lru);
zone->free_area[order].nr_free--;
rmv_page_order(page);
+ clear_page_merge_skipped(page);
}

/*
@@ -800,7 +818,7 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
* -- nyc
*/

-static inline void __free_one_page(struct page *page,
+static inline void do_merge(struct page *page,
unsigned long pfn,
struct zone *zone, unsigned int order,
int migratetype)
@@ -812,16 +830,6 @@ static inline void __free_one_page(struct page *page,

max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1);

- VM_BUG_ON(!zone_is_initialized(zone));
- VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
-
- VM_BUG_ON(migratetype == -1);
- if (likely(!is_migrate_isolate(migratetype)))
- __mod_zone_freepage_state(zone, 1 << order, migratetype);
-
- VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
- VM_BUG_ON_PAGE(bad_range(zone, page), page);
-
continue_merging:
while (order < max_order - 1) {
buddy_pfn = __find_buddy_pfn(pfn, order);
@@ -894,6 +902,61 @@ static inline void __free_one_page(struct page *page,
add_to_buddy_head(page, zone, order, migratetype);
}

+void try_to_merge_page(struct page *page)
+{
+ unsigned long pfn, buddy_pfn, flags;
+ struct page *buddy;
+ struct zone *zone;
+
+ /*
+ * No need to do merging if buddy is not free.
+ * zone lock isn't taken so this could be wrong but worst case
+ * is we lose a merge opportunity.
+ */
+ pfn = page_to_pfn(page);
+ buddy_pfn = __find_buddy_pfn(pfn, 0);
+ buddy = page + (buddy_pfn - pfn);
+ if (!PageBuddy(buddy))
+ return;
+
+ zone = page_zone(page);
+ spin_lock_irqsave(&zone->lock, flags);
+ /* Verify again after taking the lock */
+ if (likely(PageBuddy(page) && page_merge_was_skipped(page) &&
+ PageBuddy(buddy))) {
+ int mt = get_pageblock_migratetype(page);
+
+ remove_from_buddy(page, zone, 0);
+ do_merge(page, pfn, zone, 0, mt);
+ }
+ spin_unlock_irqrestore(&zone->lock, flags);
+}
+
+static inline void __free_one_page(struct page *page,
+ unsigned long pfn,
+ struct zone *zone, unsigned int order,
+ int migratetype)
+{
+ VM_BUG_ON(!zone_is_initialized(zone));
+ VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
+
+ VM_BUG_ON(migratetype == -1);
+ if (likely(!is_migrate_isolate(migratetype)))
+ __mod_zone_freepage_state(zone, 1 << order, migratetype);
+
+ VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
+ VM_BUG_ON_PAGE(bad_range(zone, page), page);
+
+ if (can_skip_merge(zone, order)) {
+ add_to_buddy_head(page, zone, 0, migratetype);
+ set_page_merge_skipped(page);
+ return;
+ }
+
+ do_merge(page, pfn, zone, order, migratetype);
+}
+
+
/*
* A bad page could be due to a number of fields. Instead of multiple branches,
* try and check multiple fields with one check. The caller must do a detailed
@@ -1151,9 +1214,14 @@ static void free_pcppages_bulk(struct zone *zone, int count,
* can be offset by reduced memory latency later. To
* avoid excessive prefetching due to large count, only
* prefetch buddy for the first pcp->batch nr of pages.
+ *
+ * If merge can be skipped, no need to prefetch buddy.
*/
- if (prefetch_nr++ < pcp->batch)
- prefetch_buddy(page);
+ if (can_skip_merge(zone, 0) || prefetch_nr > pcp->batch)
+ continue;
+
+ prefetch_buddy(page);
+ prefetch_nr++;
} while (--count && --batch_free && !list_empty(list));
}

--
2.14.3

2018-05-17 11:48:49

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [RFC v3 PATCH 1/5] mm/page_alloc: use helper functions to add/remove a page to/from buddy

On Wed, May 09, 2018 at 04:54:46PM +0800, Aaron Lu wrote:
> +static inline void add_to_buddy_head(struct page *page, struct zone *zone,
> + unsigned int order, int mt)
> +{
> + add_to_buddy_common(page, zone, order);
> + list_add(&page->lru, &zone->free_area[order].free_list[mt]);
> +}

Isn't this function (and all of its friends) misnamed? We're not adding
this page to the buddy allocator, we're adding it to the freelist. It
doesn't go to the buddy allocator until later, if at all.

2018-05-17 11:55:34

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [RFC v3 PATCH 1/5] mm/page_alloc: use helper functions to add/remove a page to/from buddy

On Thu, May 17, 2018 at 04:48:21AM -0700, Matthew Wilcox wrote:
> On Wed, May 09, 2018 at 04:54:46PM +0800, Aaron Lu wrote:
> > +static inline void add_to_buddy_head(struct page *page, struct zone *zone,
> > + unsigned int order, int mt)
> > +{
> > + add_to_buddy_common(page, zone, order);
> > + list_add(&page->lru, &zone->free_area[order].free_list[mt]);
> > +}
>
> Isn't this function (and all of its friends) misnamed? We're not adding
> this page to the buddy allocator, we're adding it to the freelist. It
> doesn't go to the buddy allocator until later, if at all.

No, never mind, I misunderstood. Ignore this please.