Daniel Jordan and others proposed an innovative technique to make
multiple threads concurrently use list_del() at any position of the
list and list_add() at head position of the list without taking a lock
in this year's MM summit[0].
People think this technique may be useful to improve zone lock
scalability so here is my try. This series is based on Daniel Jordan's
most recent patchset[1]. To make this series self contained, 2 of his
patches are extracted here.
Scalability comes best when multiple threads are operating at different
positions of the list. Since free path will access (buddy) pages
randomly on free list during merging, it is a good fit to make use of
this technique. This patchset makes free path run concurrently.
Patch 1 is for testing purpose only, it removes LRU lock from the
picture so we can get a better understanding of how much improvement
this patchset has on zone lock.
Patch 2-3 are Daniel's work to realize concurrent list_del() and
list_add(), these new APIs are called smp_list_del() and
smp_list_splice().
Patch 4-7 makes free path run concurrently by converting the zone lock
from spinlock to rwlock and has free path taking the zone lock in read
mode. To avoid complexity and problems, all other code paths take zone
lock in write mode.
Patch 8 is an optimization that reduces free list head access to avoid
severe cache bouncing. It also comes with a side effect: with this
patch, there will be mergable pages unmerged in Buddy.
Patch 9 improves fragmentation issues introduced in patch 8 by doing
pre-merges before pages are sent to merge under zone lock.
This patchset is based on v4.19-rc2.
Performance wise on 56 cores/112 threads Intel Skylake 2 sockets server
using will-it-scale/page_fault1 process mode(higher is better):
kernel performance zone lock contention
patch1 9219349 76.99%
patch7 2461133 -73.3% 54.46%(another 34.66% on smp_list_add())
patch8 11712766 +27.0% 68.14%
patch9 11386980 +23.5% 67.18%
Though lock contention reduced a lot for patch7, the performance dropped
considerably due to severe cache bouncing on free list head among
multiple threads doing page free at the same time, because every page free
will need to add the page to the free list head.
Patch8 is meant to solve this cache bouncing problem and has good result,
except the above mentioned side effect of having mergable pages unmerged
in Buddy. Patch9 reduced the fragmentation problem to some extent while
caused slightly performance drop.
As a comparison to the no_merge+cluster_alloc approach I posted before[2]:
kernel performance zone lock contention
patch1 9219349 76.99%
no_merge 11733153 +27.3% 69.18%
no_merge+cluster_alloc 12094893 +31.2% 0.73%
no_merge(skip merging for order0 page on free path) has similar
performance and zone lock contention as patch8/9, while with
cluster_alloc that also improves allocation side, zone lock contention
for this workload is almost gone.
To get an idea of how fragmentation are affected by patch8 and how much
improvement patch9 has, this is the result of /proc/buddyinfo after
running will-it-scale/page_fault1 for 3 minutes:
With patch7:
Node 0, zone DMA 0 2 1 1 3 2 2 1 0 1 3
Node 0, zone DMA32 7 3 6 5 5 10 6 7 6 10 410
Node 0, zone Normal 17820 16819 14645 12969 11367 9229 6365 3062 756 69 5646
Node 1, zone Normal 44789 60354 52331 37532 22071 9604 2750 241 32 11 6378
With patch8:
Node 0, zone DMA 0 2 1 1 3 2 2 1 0 1 3
Node 0, zone DMA32 7 9 5 4 5 10 6 7 6 10 410
Node 0, zone Normal 404917 119614 79446 58303 20679 3106 222 89 28 9 5615
Node 1, zone Normal 507659 127355 64470 53549 14104 1288 30 4 1 1 6078
With patch9:
Node 0, zone DMA 0 3 0 1 3 0 1 0 1 1 3
Node 0, zone DMA32 11 423 621 705 726 702 60 14 5 6 296
Node 0, zone Normal 20407 21016 18731 16195 13697 10483 6873 3148 735 39 5637
Node 1, zone Normal 79738 76963 59313 35996 18626 9743 3947 750 21 2 6080
A lot more pages stayed in order0 in patch8 than patch7, consequently,
for order5 and above pages, there are fewer with patch8 than patch7,
suggesting that some pages are not properly merged into high order pages
with patch8 applied. Patch9 has far fewer pages stayed in order0 than
patch8, which is a good sign but still not as good as patch7.
As a comparison, this is the result of no_merge(think of it as a worst
case result regarding fragmentation):
With no_merge:
Node 0, zone DMA 0 2 1 1 3 2 2 1 0 1 3
Node 0, zone DMA32 7 3 6 5 5 10 6 7 6 10 410
Node 0, zone Normal 1895199 5 1 1 4 2 2 1 1 1 5614
Node 1, zone Normal 1718733 4 1 13 10 3 2 0 1 1 6008
Conclusion: The approach I proposed here caused performance drop due to
free list head cache bouncing. If we can bear the result of some
mergable pages becoming unmerged in Buddy, zone lock scalability can be
improved: performance increase 20%+, lock contention drop 8%.
no_merge+cluster_alloc on the other hand, can eiminate zone lock
contention entirely, but has worse fragmentation issue.
[0] https://lwn.net/Articles/753058/
[1] https://lkml.kernel.org/r/[email protected]
[2] https://lkml.kernel.org/r/[email protected]
Aaron Lu (7):
mm: do not add anon pages to LRU
mm: convert zone lock from spinlock to rwlock
mm/page_alloc: use helper functions to add/remove a page to/from buddy
use atomic for free_area[order].nr_free
mm: use read_lock for free path
mm: use smp_list_splice() on free path
mm: page_alloc: merge before sending pages to global pool
Daniel Jordan (2):
mm: introduce smp_list_del for concurrent list entry removals
mm: introduce smp_list_splice to prepare for concurrent LRU adds
include/linux/list.h | 4 +
include/linux/mm.h | 1 +
include/linux/mmzone.h | 4 +-
init/main.c | 1 +
lib/Makefile | 2 +-
lib/list.c | 227 ++++++++++++++++++++++++++++
mm/compaction.c | 90 +++++------
mm/hugetlb.c | 8 +-
mm/memory.c | 2 +-
mm/page_alloc.c | 332 ++++++++++++++++++++++++++++-------------
mm/page_isolation.c | 12 +-
mm/vmstat.c | 8 +-
12 files changed, 526 insertions(+), 165 deletions(-)
create mode 100644 lib/list.c
--
2.17.1
For the sake of testing purpose, do not add anon pages to LRU to
avoid LRU lock so we can test zone lock exclusively.
Signed-off-by: Aaron Lu <[email protected]>
---
mm/memory.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/memory.c b/mm/memory.c
index c467102a5cbc..080641255b8b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3208,7 +3208,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
- lru_cache_add_active_or_unevictable(page, vma);
+ //lru_cache_add_active_or_unevictable(page, vma);
setpte:
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
--
2.17.1
From: Daniel Jordan <[email protected]>
Now that the LRU lock is a RW lock, lay the groundwork for fine-grained
synchronization so that multiple threads holding the lock as reader can
safely remove pages from an LRU at the same time.
Add a thread-safe variant of list_del called smp_list_del that allows
multiple threads to delete nodes from a list, and wrap this new list API
in smp_del_page_from_lru to get the LRU statistics updates right.
For bisectability's sake, call the new function only when holding
lru_lock as writer. In the next patch, switch to taking it as reader.
The algorithm is explained in detail in the comments. Yosef Lev
conceived of the algorithm, and this patch is heavily based on an
earlier version from him. Thanks to Dave Dice for suggesting the
prefetch.
[aaronlu: only take list related code here]
Signed-off-by: Yosef Lev <[email protected]>
Signed-off-by: Daniel Jordan <[email protected]>
---
include/linux/list.h | 2 +
lib/Makefile | 2 +-
lib/list.c | 158 +++++++++++++++++++++++++++++++++++++++++++
3 files changed, 161 insertions(+), 1 deletion(-)
create mode 100644 lib/list.c
diff --git a/include/linux/list.h b/include/linux/list.h
index de04cc5ed536..0fd9c87dd14b 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -47,6 +47,8 @@ static inline bool __list_del_entry_valid(struct list_head *entry)
}
#endif
+extern void smp_list_del(struct list_head *entry);
+
/*
* Insert a new entry between two known consecutive entries.
*
diff --git a/lib/Makefile b/lib/Makefile
index ca3f7ebb900d..9527b7484653 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -38,7 +38,7 @@ obj-y += bcd.o div64.o sort.o parser.o debug_locks.o random32.o \
gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
bsearch.o find_bit.o llist.o memweight.o kfifo.o \
percpu-refcount.o rhashtable.o reciprocal_div.o \
- once.o refcount.o usercopy.o errseq.o bucket_locks.o
+ once.o refcount.o usercopy.o errseq.o bucket_locks.o list.o
obj-$(CONFIG_STRING_SELFTEST) += test_string.o
obj-y += string_helpers.o
obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
diff --git a/lib/list.c b/lib/list.c
new file mode 100644
index 000000000000..4d0949ea1a09
--- /dev/null
+++ b/lib/list.c
@@ -0,0 +1,158 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (c) 2017, 2018 Oracle and/or its affiliates. All rights reserved.
+ *
+ * Authors: Yosef Lev <[email protected]>
+ * Daniel Jordan <[email protected]>
+ */
+
+#include <linux/list.h>
+#include <linux/prefetch.h>
+
+/*
+ * smp_list_del is a variant of list_del that allows concurrent list removals
+ * under certain assumptions. The idea is to get away from overly coarse
+ * synchronization, such as using a lock to guard an entire list, which
+ * serializes all operations even though those operations might be happening on
+ * disjoint parts.
+ *
+ * If you want to use other functions from the list API concurrently,
+ * additional synchronization may be necessary. For example, you could use a
+ * rwlock as a two-mode lock, where readers use the lock in shared mode and are
+ * allowed to call smp_list_del concurrently, and writers use the lock in
+ * exclusive mode and are allowed to use all list operations.
+ */
+
+/**
+ * smp_list_del - concurrent variant of list_del
+ * @entry: entry to delete from the list
+ *
+ * Safely removes an entry from the list in the presence of other threads that
+ * may try to remove adjacent entries. Uses the entry's next field and the
+ * predecessor entry's next field as locks to accomplish this.
+ *
+ * Assumes that no two threads may try to delete the same entry. This
+ * assumption holds, for example, if the objects on the list are
+ * reference-counted so that an object is only removed when its refcount falls
+ * to 0.
+ *
+ * @entry's next and prev fields are poisoned on return just as with list_del.
+ */
+void smp_list_del(struct list_head *entry)
+{
+ struct list_head *succ, *pred, *pred_reread;
+
+ /*
+ * The predecessor entry's cacheline is read before it's written, so to
+ * avoid an unnecessary cacheline state transition, prefetch for
+ * writing. In the common case, the predecessor won't change.
+ */
+ prefetchw(entry->prev);
+
+ /*
+ * Step 1: Lock @entry E by making its next field point to its
+ * predecessor D. This prevents any thread from removing the
+ * predecessor because that thread will loop in its step 4 while
+ * E->next == D. This also prevents any thread from removing the
+ * successor F because that thread will see that F->prev->next != F in
+ * the cmpxchg in its step 3. Retry if the successor is being removed
+ * and has already set this field to NULL in step 3.
+ */
+ succ = READ_ONCE(entry->next);
+ pred = READ_ONCE(entry->prev);
+ while (succ == NULL || cmpxchg(&entry->next, succ, pred) != succ) {
+ /*
+ * Reread @entry's successor because it may change until
+ * @entry's next field is locked. Reread the predecessor to
+ * have a better chance of publishing the right value and avoid
+ * entering the loop in step 2 while @entry is locked,
+ * but this isn't required for correctness because the
+ * predecessor is reread in step 2.
+ */
+ cpu_relax();
+ succ = READ_ONCE(entry->next);
+ pred = READ_ONCE(entry->prev);
+ }
+
+ /*
+ * Step 2: A racing thread may remove @entry's predecessor. Reread and
+ * republish @entry->prev until it does not change. This guarantees
+ * that the racing thread has not passed the while loop in step 4 and
+ * has not freed the predecessor, so it is safe for this thread to
+ * access predecessor fields in step 3.
+ */
+ pred_reread = READ_ONCE(entry->prev);
+ while (pred != pred_reread) {
+ WRITE_ONCE(entry->next, pred_reread);
+ pred = pred_reread;
+ /*
+ * Ensure the predecessor is published in @entry's next field
+ * before rereading the predecessor. Pairs with the smp_mb in
+ * step 4.
+ */
+ smp_mb();
+ pred_reread = READ_ONCE(entry->prev);
+ }
+
+ /*
+ * Step 3: If the predecessor points to @entry, lock it and continue.
+ * Otherwise, the predecessor is being removed, so loop until that
+ * removal finishes and this thread's @entry->prev is updated, which
+ * indicates the old predecessor has reached the loop in step 4. Write
+ * the new predecessor into @entry->next. This both releases the old
+ * predecessor from its step 4 loop and sets this thread up to lock the
+ * new predecessor.
+ */
+ while (pred->next != entry ||
+ cmpxchg(&pred->next, entry, NULL) != entry) {
+ /*
+ * The predecessor is being removed so wait for a new,
+ * unlocked predecessor.
+ */
+ cpu_relax();
+ pred_reread = READ_ONCE(entry->prev);
+ if (pred != pred_reread) {
+ /*
+ * The predecessor changed, so republish it and update
+ * it as in step 2.
+ */
+ WRITE_ONCE(entry->next, pred_reread);
+ pred = pred_reread;
+ /* Pairs with smp_mb in step 4. */
+ smp_mb();
+ }
+ }
+
+ /*
+ * Step 4: @entry and @entry's predecessor are both locked, so now
+ * actually remove @entry from the list.
+ *
+ * It is safe to write to the successor's prev pointer because step 1
+ * prevents the successor from being removed.
+ */
+
+ WRITE_ONCE(succ->prev, pred);
+
+ /*
+ * The full barrier guarantees that all changes are visible to other
+ * threads before the entry is unlocked by the final write, pairing
+ * with the implied full barrier before the cmpxchg in step 1.
+ *
+ * The barrier also guarantees that this thread writes succ->prev
+ * before reading succ->next, pairing with a thread in step 2 or 3 that
+ * writes entry->next before reading entry->prev, which ensures that
+ * the one that writes second sees the update from the other.
+ */
+ smp_mb();
+
+ while (READ_ONCE(succ->next) == entry) {
+ /* The successor is being removed, so wait for it to finish. */
+ cpu_relax();
+ }
+
+ /* Simultaneously completes the removal and unlocks the predecessor. */
+ WRITE_ONCE(pred->next, succ);
+
+ entry->next = LIST_POISON1;
+ entry->prev = LIST_POISON2;
+}
--
2.17.1
From: Daniel Jordan <[email protected]>
Now that we splice a local list onto the LRU, prepare for multiple tasks
doing this concurrently by adding a variant of the kernel's list
splicing API, list_splice, that's designed to work with multiple tasks.
Although there is naturally less parallelism to be gained from locking
the LRU head this way, the main benefit of doing this is to allow
removals to happen concurrently. The way lru_lock is today, an add
needlessly blocks removal of any page but the first in the LRU.
For now, hold lru_lock as writer to serialize the adds to ensure the
function is correct for a single thread at a time.
Yosef Lev came up with this algorithm.
[aaronlu: drop LRU related code, keep only list related code]
Suggested-by: Yosef Lev <[email protected]>
Signed-off-by: Daniel Jordan <[email protected]>
---
include/linux/list.h | 1 +
lib/list.c | 60 ++++++++++++++++++++++++++++++++++++++------
2 files changed, 54 insertions(+), 7 deletions(-)
diff --git a/include/linux/list.h b/include/linux/list.h
index 0fd9c87dd14b..5f203fb55939 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -48,6 +48,7 @@ static inline bool __list_del_entry_valid(struct list_head *entry)
#endif
extern void smp_list_del(struct list_head *entry);
+extern void smp_list_splice(struct list_head *list, struct list_head *head);
/*
* Insert a new entry between two known consecutive entries.
diff --git a/lib/list.c b/lib/list.c
index 4d0949ea1a09..104faa144abf 100644
--- a/lib/list.c
+++ b/lib/list.c
@@ -10,17 +10,18 @@
#include <linux/prefetch.h>
/*
- * smp_list_del is a variant of list_del that allows concurrent list removals
- * under certain assumptions. The idea is to get away from overly coarse
- * synchronization, such as using a lock to guard an entire list, which
- * serializes all operations even though those operations might be happening on
- * disjoint parts.
+ * smp_list_del and smp_list_splice are variants of list_del and list_splice,
+ * respectively, that allow concurrent list operations under certain
+ * assumptions. The idea is to get away from overly coarse synchronization,
+ * such as using a lock to guard an entire list, which serializes all
+ * operations even though those operations might be happening on disjoint
+ * parts.
*
* If you want to use other functions from the list API concurrently,
* additional synchronization may be necessary. For example, you could use a
* rwlock as a two-mode lock, where readers use the lock in shared mode and are
- * allowed to call smp_list_del concurrently, and writers use the lock in
- * exclusive mode and are allowed to use all list operations.
+ * allowed to call smp_list_* functions concurrently, and writers use the lock
+ * in exclusive mode and are allowed to use all list operations.
*/
/**
@@ -156,3 +157,48 @@ void smp_list_del(struct list_head *entry)
entry->next = LIST_POISON1;
entry->prev = LIST_POISON2;
}
+
+/**
+ * smp_list_splice - thread-safe splice of two lists
+ * @list: the new list to add
+ * @head: the place to add it in the first list
+ *
+ * Safely handles concurrent smp_list_splice operations onto the same list head
+ * and concurrent smp_list_del operations of any list entry except @head.
+ * Assumes that @head cannot be removed.
+ */
+void smp_list_splice(struct list_head *list, struct list_head *head)
+{
+ struct list_head *first = list->next;
+ struct list_head *last = list->prev;
+ struct list_head *succ;
+
+ /*
+ * Lock the front of @head by replacing its next pointer with NULL.
+ * Should another thread be adding to the front, wait until it's done.
+ */
+ succ = READ_ONCE(head->next);
+ while (succ == NULL || cmpxchg(&head->next, succ, NULL) != succ) {
+ cpu_relax();
+ succ = READ_ONCE(head->next);
+ }
+
+ first->prev = head;
+ last->next = succ;
+
+ /*
+ * It is safe to write to succ, head's successor, because locking head
+ * prevents succ from being removed in smp_list_del.
+ */
+ succ->prev = last;
+
+ /*
+ * Pairs with the implied full barrier before the cmpxchg above.
+ * Ensures the write that unlocks the head is seen last to avoid list
+ * corruption.
+ */
+ smp_wmb();
+
+ /* Simultaneously complete the splice and unlock the head node. */
+ WRITE_ONCE(head->next, first);
+}
--
2.17.1
There are multiple places that add/remove a page into/from buddy,
introduce helper functions for them.
This also makes it easier to add code when a page is added/removed
to/from buddy.
No functionality change.
Acked-by: Vlastimil Babka <[email protected]>
Signed-off-by: Aaron Lu <[email protected]>
---
mm/page_alloc.c | 65 +++++++++++++++++++++++++++++--------------------
1 file changed, 39 insertions(+), 26 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 38e39ccdd6d9..d0b954783f1d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -697,12 +697,41 @@ static inline void set_page_order(struct page *page, unsigned int order)
__SetPageBuddy(page);
}
+static inline void add_to_buddy_common(struct page *page, struct zone *zone,
+ unsigned int order)
+{
+ set_page_order(page, order);
+ zone->free_area[order].nr_free++;
+}
+
+static inline void add_to_buddy_head(struct page *page, struct zone *zone,
+ unsigned int order, int mt)
+{
+ add_to_buddy_common(page, zone, order);
+ list_add(&page->lru, &zone->free_area[order].free_list[mt]);
+}
+
+static inline void add_to_buddy_tail(struct page *page, struct zone *zone,
+ unsigned int order, int mt)
+{
+ add_to_buddy_common(page, zone, order);
+ list_add_tail(&page->lru, &zone->free_area[order].free_list[mt]);
+}
+
static inline void rmv_page_order(struct page *page)
{
__ClearPageBuddy(page);
set_page_private(page, 0);
}
+static inline void remove_from_buddy(struct page *page, struct zone *zone,
+ unsigned int order)
+{
+ list_del(&page->lru);
+ zone->free_area[order].nr_free--;
+ rmv_page_order(page);
+}
+
/*
* This function checks whether a page is free && is the buddy
* we can coalesce a page and its buddy if
@@ -803,13 +832,10 @@ static inline void __free_one_page(struct page *page,
* Our buddy is free or it is CONFIG_DEBUG_PAGEALLOC guard page,
* merge with it and move up one order.
*/
- if (page_is_guard(buddy)) {
+ if (page_is_guard(buddy))
clear_page_guard(zone, buddy, order, migratetype);
- } else {
- list_del(&buddy->lru);
- zone->free_area[order].nr_free--;
- rmv_page_order(buddy);
- }
+ else
+ remove_from_buddy(buddy, zone, order);
combined_pfn = buddy_pfn & pfn;
page = page + (combined_pfn - pfn);
pfn = combined_pfn;
@@ -841,8 +867,6 @@ static inline void __free_one_page(struct page *page,
}
done_merging:
- set_page_order(page, order);
-
/*
* If this is not the largest possible page, check if the buddy
* of the next-highest order is free. If it is, it's possible
@@ -859,15 +883,12 @@ static inline void __free_one_page(struct page *page,
higher_buddy = higher_page + (buddy_pfn - combined_pfn);
if (pfn_valid_within(buddy_pfn) &&
page_is_buddy(higher_page, higher_buddy, order + 1)) {
- list_add_tail(&page->lru,
- &zone->free_area[order].free_list[migratetype]);
- goto out;
+ add_to_buddy_tail(page, zone, order, migratetype);
+ return;
}
}
- list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
-out:
- zone->free_area[order].nr_free++;
+ add_to_buddy_head(page, zone, order, migratetype);
}
/*
@@ -1805,9 +1826,7 @@ static inline void expand(struct zone *zone, struct page *page,
if (set_page_guard(zone, &page[size], high, migratetype))
continue;
- list_add(&page[size].lru, &area->free_list[migratetype]);
- area->nr_free++;
- set_page_order(&page[size], high);
+ add_to_buddy_head(&page[size], zone, high, migratetype);
}
}
@@ -1951,9 +1970,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
struct page, lru);
if (!page)
continue;
- list_del(&page->lru);
- rmv_page_order(page);
- area->nr_free--;
+ remove_from_buddy(page, zone, current_order);
expand(zone, page, order, current_order, area, migratetype);
set_pcppage_migratetype(page, migratetype);
return page;
@@ -2871,9 +2888,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
}
/* Remove page from free list */
- list_del(&page->lru);
- zone->free_area[order].nr_free--;
- rmv_page_order(page);
+ remove_from_buddy(page, zone, order);
/*
* Set the pageblock if the isolated page is at least half of a
@@ -8066,9 +8081,7 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
pr_info("remove from free list %lx %d %lx\n",
pfn, 1 << order, end_pfn);
#endif
- list_del(&page->lru);
- rmv_page_order(page);
- zone->free_area[order].nr_free--;
+ remove_from_buddy(page, zone, order);
for (i = 0; i < (1 << order); i++)
SetPageReserved((page+i));
pfn += (1 << order);
--
2.17.1
Now that we have mergable pages in Buddy unmerged, this is a step
to reduce such things from happening to some extent.
Suppose two buddy pages are on the list to be freed in free_pcppages_bulk(),
the first page goes to merge but its buddy is not in Buddy yet so we
hold it locally as an order0 page; then its buddy page goes to merge and
couldn't merge either because we hold the first page locally instead of
having it in Buddy. The end result is, we have two mergable buddy pages
but failed to merge it.
So this patch will attempt merge for these to-be-freed pages before
acquiring any lock, it could, to some extent, reduce fragmentation caused
by last patch.
With this change, the pcp_drain trace isn't easy to use so I removed it.
Signed-off-by: Aaron Lu <[email protected]>
---
mm/page_alloc.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 73 insertions(+), 2 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index df38c3f2a1cc..d3eafe857713 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1098,6 +1098,72 @@ void __init percpu_mergelist_init(void)
}
}
+static inline bool buddy_in_list(struct page *page, struct page *buddy,
+ struct list_head *list)
+{
+ list_for_each_entry_continue(page, list, lru)
+ if (page == buddy)
+ return true;
+
+ return false;
+}
+
+static inline void merge_in_pcp(struct list_head *list)
+{
+ int order;
+ struct page *page;
+
+ /* Set order information to 0 initially since they are PCP pages */
+ list_for_each_entry(page, list, lru)
+ set_page_private(page, 0);
+
+ /*
+ * Check for mergable pages for each order.
+ *
+ * For each order, check if their buddy is also in the list and
+ * if so, do merge, then remove the merged buddy from the list.
+ */
+ for (order = 0; order < MAX_ORDER - 1; order++) {
+ bool has_merge = false;
+
+ page = list_first_entry(list, struct page, lru);
+ while (&page->lru != list) {
+ unsigned long pfn, buddy_pfn, combined_pfn;
+ struct page *buddy, *n;
+
+ if (page_order(page) != order) {
+ page = list_next_entry(page, lru);
+ continue;
+ }
+
+ pfn = page_to_pfn(page);
+ buddy_pfn = __find_buddy_pfn(pfn, order);
+ buddy = page + (buddy_pfn - pfn);
+ if (!buddy_in_list(page, buddy, list) ||
+ page_order(buddy) != order) {
+ page = list_next_entry(page, lru);
+ continue;
+ }
+
+ combined_pfn = pfn & buddy_pfn;
+ if (combined_pfn == pfn) {
+ set_page_private(page, order + 1);
+ list_del(&buddy->lru);
+ page = list_next_entry(page, lru);
+ } else {
+ set_page_private(buddy, order + 1);
+ n = list_next_entry(page, lru);
+ list_del(&page->lru);
+ page = n;
+ }
+ has_merge = true;
+ }
+
+ if (!has_merge)
+ break;
+ }
+}
+
/*
* Frees a number of pages from the PCP lists
* Assumes all pages on list are in same zone, and of same order.
@@ -1165,6 +1231,12 @@ static void free_pcppages_bulk(struct zone *zone, int count,
} while (--count && --batch_free && !list_empty(list));
}
+ /*
+ * Before acquiring the possibly heavily contended zone lock, do merge
+ * among these to-be-freed PCP pages before sending them to Buddy.
+ */
+ merge_in_pcp(&head);
+
read_lock(&zone->lock);
isolated_pageblocks = has_isolate_pageblock(zone);
@@ -1182,10 +1254,9 @@ static void free_pcppages_bulk(struct zone *zone, int count,
if (unlikely(isolated_pageblocks))
mt = get_pageblock_migratetype(page);
- order = 0;
+ order = page_order(page);
merged_page = do_merge(page, page_to_pfn(page), zone, &order, mt);
list_add(&merged_page->lru, this_cpu_ptr(&merge_lists[order][mt]));
- trace_mm_page_pcpu_drain(page, 0, mt);
}
for_each_migratetype_order(order, migratetype) {
--
2.17.1
With free path running concurrently, the cache bouncing on free
list head is severe since multiple threads can be freeing pages
and each free will need to add the page to free list head.
To improve performance on free path for order-0 pages, we can
choose to not add the merged pages to Buddy immediately after
merge but keep them on a local percpu list first and then after
all pages are finished merging, add these merged pages to Buddy
with smp_list_splice() in one go.
This optimization caused a problem though: the page we hold on the
local percpu list can be a buddy of other being freed page and we
lose the merge oppotunity for them. With this patch, we will have
mergable pages unmerged in Buddy.
Due to this, I don't see much value of keeping the range lock which
is used to avoid such thing from happening, so the range lock is
removed in this patch.
Signed-off-by: Aaron Lu <[email protected]>
---
include/linux/mm.h | 1 +
include/linux/mmzone.h | 3 -
init/main.c | 1 +
mm/page_alloc.c | 151 +++++++++++++++++++++++++----------------
4 files changed, 95 insertions(+), 61 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a61ebe8ad4ca..a99ba2cb7a0d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2155,6 +2155,7 @@ extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long,
extern void setup_per_zone_wmarks(void);
extern int __meminit init_per_zone_wmark_min(void);
extern void mem_init(void);
+extern void percpu_mergelist_init(void);
extern void __init mmap_init(void);
extern void show_mem(unsigned int flags, nodemask_t *nodemask);
extern long si_mem_available(void);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0ea52e9bb610..e66b8c63d5d1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -467,9 +467,6 @@ struct zone {
/* Primarily protects free_area */
rwlock_t lock;
- /* Protects merge operation for a range of order=(MAX_ORDER-1) pages */
- spinlock_t *range_locks;
-
/* Write-intensive fields used by compaction and vmstats. */
ZONE_PADDING(_pad2_)
diff --git a/init/main.c b/init/main.c
index 18f8f0140fa0..68a428e1bf15 100644
--- a/init/main.c
+++ b/init/main.c
@@ -517,6 +517,7 @@ static void __init mm_init(void)
* bigger than MAX_ORDER unless SPARSEMEM.
*/
page_ext_init_flatmem();
+ percpu_mergelist_init();
mem_init();
kmem_cache_init();
pgtable_init();
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5f5cc671bcf7..df38c3f2a1cc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -339,17 +339,6 @@ static inline bool update_defer_init(pg_data_t *pgdat,
}
#endif
-/* Return a pointer to the spinblock for a pageblock this page belongs to */
-static inline spinlock_t *get_range_lock(struct page *page)
-{
- struct zone *zone = page_zone(page);
- unsigned long zone_start_pfn = zone->zone_start_pfn;
- unsigned long range = (page_to_pfn(page) - zone_start_pfn) >>
- (MAX_ORDER - 1);
-
- return &zone->range_locks[range];
-}
-
/* Return a pointer to the bitmap storing bits affecting a block of pages */
static inline unsigned long *get_pageblock_bitmap(struct page *page,
unsigned long pfn)
@@ -711,9 +700,15 @@ static inline void set_page_order(struct page *page, unsigned int order)
static inline void add_to_buddy(struct page *page, struct zone *zone,
unsigned int order, int mt)
{
+ /*
+ * Adding page to free list before setting PageBuddy flag
+ * or other thread doing merge can notice its PageBuddy flag
+ * and attempt to merge with it, causing list corruption.
+ */
+ smp_list_add(&page->lru, &zone->free_area[order].free_list[mt]);
+ smp_wmb();
set_page_order(page, order);
atomic_long_inc(&zone->free_area[order].nr_free);
- smp_list_add(&page->lru, &zone->free_area[order].free_list[mt]);
}
static inline void rmv_page_order(struct page *page)
@@ -784,40 +779,17 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
return 0;
}
-/*
- * Freeing function for a buddy system allocator.
- *
- * The concept of a buddy system is to maintain direct-mapped table
- * (containing bit values) for memory blocks of various "orders".
- * The bottom level table contains the map for the smallest allocatable
- * units of memory (here, pages), and each level above it describes
- * pairs of units from the levels below, hence, "buddies".
- * At a high level, all that happens here is marking the table entry
- * at the bottom level available, and propagating the changes upward
- * as necessary, plus some accounting needed to play nicely with other
- * parts of the VM system.
- * At each level, we keep a list of pages, which are heads of continuous
- * free pages of length of (1 << order) and marked with PageBuddy.
- * Page's order is recorded in page_private(page) field.
- * So when we are allocating or freeing one, we can derive the state of the
- * other. That is, if we allocate a small block, and both were
- * free, the remainder of the region must be split into blocks.
- * If a block is freed, and its buddy is also free, then this
- * triggers coalescing into a block of larger size.
- *
- * -- nyc
- */
-
-static inline void __free_one_page(struct page *page,
+/* Return merged page pointer with order updated */
+static inline struct page *do_merge(struct page *page,
unsigned long pfn,
- struct zone *zone, unsigned int order,
+ struct zone *zone, unsigned int *p_order,
int migratetype)
{
unsigned long combined_pfn;
unsigned long uninitialized_var(buddy_pfn);
struct page *buddy;
unsigned int max_order;
- spinlock_t *range_lock;
+ unsigned int order = *p_order;
max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1);
@@ -831,8 +803,6 @@ static inline void __free_one_page(struct page *page,
VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
VM_BUG_ON_PAGE(bad_range(zone, page), page);
- range_lock = get_range_lock(page);
- spin_lock(range_lock);
continue_merging:
while (order < max_order - 1) {
buddy_pfn = __find_buddy_pfn(pfn, order);
@@ -881,8 +851,41 @@ static inline void __free_one_page(struct page *page,
}
done_merging:
+ *p_order = order;
+ return page;
+}
+
+/*
+ * Freeing function for a buddy system allocator.
+ *
+ * The concept of a buddy system is to maintain direct-mapped table
+ * (containing bit values) for memory blocks of various "orders".
+ * The bottom level table contains the map for the smallest allocatable
+ * units of memory (here, pages), and each level above it describes
+ * pairs of units from the levels below, hence, "buddies".
+ * At a high level, all that happens here is marking the table entry
+ * at the bottom level available, and propagating the changes upward
+ * as necessary, plus some accounting needed to play nicely with other
+ * parts of the VM system.
+ * At each level, we keep a list of pages, which are heads of continuous
+ * free pages of length of (1 << order) and marked with PageBuddy.
+ * Page's order is recorded in page_private(page) field.
+ * So when we are allocating or freeing one, we can derive the state of the
+ * other. That is, if we allocate a small block, and both were
+ * free, the remainder of the region must be split into blocks.
+ * If a block is freed, and its buddy is also free, then this
+ * triggers coalescing into a block of larger size.
+ *
+ * -- nyc
+ */
+
+static inline void __free_one_page(struct page *page,
+ unsigned long pfn,
+ struct zone *zone, unsigned int order,
+ int migratetype)
+{
+ page = do_merge(page, pfn, zone, &order, migratetype);
add_to_buddy(page, zone, order, migratetype);
- spin_unlock(range_lock);
}
/*
@@ -1081,6 +1084,20 @@ static inline void prefetch_buddy(struct page *page)
prefetch(buddy);
}
+static DEFINE_PER_CPU(struct list_head, merge_lists[MAX_ORDER][MIGRATE_TYPES]);
+
+void __init percpu_mergelist_init(void)
+{
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ unsigned int order, mt;
+
+ for_each_migratetype_order(order, mt)
+ INIT_LIST_HEAD(per_cpu_ptr(&merge_lists[order][mt], cpu));
+ }
+}
+
/*
* Frees a number of pages from the PCP lists
* Assumes all pages on list are in same zone, and of same order.
@@ -1101,10 +1118,10 @@ static void free_pcppages_bulk(struct zone *zone, int count,
bool isolated_pageblocks;
struct page *page, *tmp;
LIST_HEAD(head);
+ struct list_head *list;
+ unsigned int order;
while (count) {
- struct list_head *list;
-
/*
* Remove pages from lists in a round-robin fashion. A
* batch_free count is maintained that is incremented when an
@@ -1157,15 +1174,46 @@ static void free_pcppages_bulk(struct zone *zone, int count,
*/
list_for_each_entry_safe(page, tmp, &head, lru) {
int mt = get_pcppage_migratetype(page);
+ struct page *merged_page;
+
/* MIGRATE_ISOLATE page should not go to pcplists */
VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
/* Pageblock could have been isolated meanwhile */
if (unlikely(isolated_pageblocks))
mt = get_pageblock_migratetype(page);
- __free_one_page(page, page_to_pfn(page), zone, 0, mt);
+ order = 0;
+ merged_page = do_merge(page, page_to_pfn(page), zone, &order, mt);
+ list_add(&merged_page->lru, this_cpu_ptr(&merge_lists[order][mt]));
trace_mm_page_pcpu_drain(page, 0, mt);
}
+
+ for_each_migratetype_order(order, migratetype) {
+ unsigned long n;
+ struct list_head *entry;
+
+ list = this_cpu_ptr(&merge_lists[order][migratetype]);
+ if (list_empty(list))
+ continue;
+
+ smp_list_splice(list, &zone->free_area[order].free_list[migratetype]);
+
+ /* Add to list first before setting PageBuddy flag */
+ smp_wmb();
+
+ n = 0;
+ entry = list;
+ do {
+ entry = entry->next;
+ page = list_entry(entry, struct page, lru);
+ set_page_order(page, order);
+ n++;
+ } while (entry != list->prev);
+ INIT_LIST_HEAD(list);
+
+ atomic_long_add(n, &zone->free_area[order].nr_free);
+ }
+
read_unlock(&zone->lock);
}
@@ -6280,18 +6328,6 @@ void __ref free_area_init_core_hotplug(int nid)
}
#endif
-static void __init setup_range_locks(struct zone *zone)
-{
- unsigned long nr = (zone->spanned_pages >> (MAX_ORDER - 1)) + 1;
- unsigned long size = nr * sizeof(spinlock_t);
- unsigned long i;
-
- zone->range_locks = memblock_virt_alloc_node_nopanic(size,
- zone->zone_pgdat->node_id);
- for (i = 0; i < nr; i++)
- spin_lock_init(&zone->range_locks[i]);
-}
-
/*
* Set up the zone data structures:
* - mark all pages reserved
@@ -6363,7 +6399,6 @@ static void __init free_area_init_core(struct pglist_data *pgdat)
setup_usemap(pgdat, zone, zone_start_pfn, size);
init_currently_empty_zone(zone, zone_start_pfn, size);
memmap_init(size, nid, j, zone_start_pfn);
- setup_range_locks(zone);
}
}
--
2.17.1
Daniel Jordan's patch has made it possible for multiple threads to
operate on a global list with smp_list_del() at any position and
smp_list_add/splice() at head position concurrently without taking
any lock.
This patch makes use of this technique on free list.
To make this happen, add_to_buddy_tail() is removed since only
adding to list head is safe with smp_list_del() so only
add_to_buddy() is used.
Once free path can run concurrently, it is possible for multiple
threads to free pages at the same time. If 2 pages being freed are
buddy, they can miss the oppotunity to be merged.
For this reason, introduce range locks to protect merge operation
that makes sure inside one range, only one merge can happen and a
page's Buddy status is properly set inside the lock. The range is
selected as an order of (MAX_ORDER-1) pages since merge can't
exceed that order.
Signed-off-by: Aaron Lu <[email protected]>
---
include/linux/list.h | 1 +
include/linux/mmzone.h | 3 ++
lib/list.c | 23 ++++++++++
mm/page_alloc.c | 95 +++++++++++++++++++++++-------------------
4 files changed, 78 insertions(+), 44 deletions(-)
diff --git a/include/linux/list.h b/include/linux/list.h
index 5f203fb55939..608e40f6489e 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -49,6 +49,7 @@ static inline bool __list_del_entry_valid(struct list_head *entry)
extern void smp_list_del(struct list_head *entry);
extern void smp_list_splice(struct list_head *list, struct list_head *head);
+extern void smp_list_add(struct list_head *entry, struct list_head *head);
/*
* Insert a new entry between two known consecutive entries.
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e66b8c63d5d1..0ea52e9bb610 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -467,6 +467,9 @@ struct zone {
/* Primarily protects free_area */
rwlock_t lock;
+ /* Protects merge operation for a range of order=(MAX_ORDER-1) pages */
+ spinlock_t *range_locks;
+
/* Write-intensive fields used by compaction and vmstats. */
ZONE_PADDING(_pad2_)
diff --git a/lib/list.c b/lib/list.c
index 104faa144abf..3ecf62b88c86 100644
--- a/lib/list.c
+++ b/lib/list.c
@@ -202,3 +202,26 @@ void smp_list_splice(struct list_head *list, struct list_head *head)
/* Simultaneously complete the splice and unlock the head node. */
WRITE_ONCE(head->next, first);
}
+
+void smp_list_add(struct list_head *entry, struct list_head *head)
+{
+ struct list_head *succ;
+
+ /*
+ * Lock the front of @head by replacing its next pointer with NULL.
+ * Should another thread be adding to the front, wait until it's done.
+ */
+ succ = READ_ONCE(head->next);
+ while (succ == NULL || cmpxchg(&head->next, succ, NULL) != succ) {
+ cpu_relax();
+ succ = READ_ONCE(head->next);
+ }
+
+ entry->next = succ;
+ entry->prev = head;
+ succ->prev = entry;
+
+ smp_wmb();
+
+ WRITE_ONCE(head->next, entry);
+}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dff3edc60d71..5f5cc671bcf7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -339,6 +339,17 @@ static inline bool update_defer_init(pg_data_t *pgdat,
}
#endif
+/* Return a pointer to the spinblock for a pageblock this page belongs to */
+static inline spinlock_t *get_range_lock(struct page *page)
+{
+ struct zone *zone = page_zone(page);
+ unsigned long zone_start_pfn = zone->zone_start_pfn;
+ unsigned long range = (page_to_pfn(page) - zone_start_pfn) >>
+ (MAX_ORDER - 1);
+
+ return &zone->range_locks[range];
+}
+
/* Return a pointer to the bitmap storing bits affecting a block of pages */
static inline unsigned long *get_pageblock_bitmap(struct page *page,
unsigned long pfn)
@@ -697,25 +708,12 @@ static inline void set_page_order(struct page *page, unsigned int order)
__SetPageBuddy(page);
}
-static inline void add_to_buddy_common(struct page *page, struct zone *zone,
- unsigned int order)
+static inline void add_to_buddy(struct page *page, struct zone *zone,
+ unsigned int order, int mt)
{
set_page_order(page, order);
atomic_long_inc(&zone->free_area[order].nr_free);
-}
-
-static inline void add_to_buddy_head(struct page *page, struct zone *zone,
- unsigned int order, int mt)
-{
- add_to_buddy_common(page, zone, order);
- list_add(&page->lru, &zone->free_area[order].free_list[mt]);
-}
-
-static inline void add_to_buddy_tail(struct page *page, struct zone *zone,
- unsigned int order, int mt)
-{
- add_to_buddy_common(page, zone, order);
- list_add_tail(&page->lru, &zone->free_area[order].free_list[mt]);
+ smp_list_add(&page->lru, &zone->free_area[order].free_list[mt]);
}
static inline void rmv_page_order(struct page *page)
@@ -724,12 +722,25 @@ static inline void rmv_page_order(struct page *page)
set_page_private(page, 0);
}
+static inline void remove_from_buddy_common(struct page *page,
+ struct zone *zone, unsigned int order)
+{
+ atomic_long_dec(&zone->free_area[order].nr_free);
+ rmv_page_order(page);
+}
+
static inline void remove_from_buddy(struct page *page, struct zone *zone,
unsigned int order)
{
list_del(&page->lru);
- atomic_long_dec(&zone->free_area[order].nr_free);
- rmv_page_order(page);
+ remove_from_buddy_common(page, zone, order);
+}
+
+static inline void remove_from_buddy_concurrent(struct page *page,
+ struct zone *zone, unsigned int order)
+{
+ smp_list_del(&page->lru);
+ remove_from_buddy_common(page, zone, order);
}
/*
@@ -806,6 +817,7 @@ static inline void __free_one_page(struct page *page,
unsigned long uninitialized_var(buddy_pfn);
struct page *buddy;
unsigned int max_order;
+ spinlock_t *range_lock;
max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1);
@@ -819,6 +831,8 @@ static inline void __free_one_page(struct page *page,
VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
VM_BUG_ON_PAGE(bad_range(zone, page), page);
+ range_lock = get_range_lock(page);
+ spin_lock(range_lock);
continue_merging:
while (order < max_order - 1) {
buddy_pfn = __find_buddy_pfn(pfn, order);
@@ -835,7 +849,7 @@ static inline void __free_one_page(struct page *page,
if (page_is_guard(buddy))
clear_page_guard(zone, buddy, order, migratetype);
else
- remove_from_buddy(buddy, zone, order);
+ remove_from_buddy_concurrent(buddy, zone, order);
combined_pfn = buddy_pfn & pfn;
page = page + (combined_pfn - pfn);
pfn = combined_pfn;
@@ -867,28 +881,8 @@ static inline void __free_one_page(struct page *page,
}
done_merging:
- /*
- * If this is not the largest possible page, check if the buddy
- * of the next-highest order is free. If it is, it's possible
- * that pages are being freed that will coalesce soon. In case,
- * that is happening, add the free page to the tail of the list
- * so it's less likely to be used soon and more likely to be merged
- * as a higher order page
- */
- if ((order < MAX_ORDER-2) && pfn_valid_within(buddy_pfn)) {
- struct page *higher_page, *higher_buddy;
- combined_pfn = buddy_pfn & pfn;
- higher_page = page + (combined_pfn - pfn);
- buddy_pfn = __find_buddy_pfn(combined_pfn, order + 1);
- higher_buddy = higher_page + (buddy_pfn - combined_pfn);
- if (pfn_valid_within(buddy_pfn) &&
- page_is_buddy(higher_page, higher_buddy, order + 1)) {
- add_to_buddy_tail(page, zone, order, migratetype);
- return;
- }
- }
-
- add_to_buddy_head(page, zone, order, migratetype);
+ add_to_buddy(page, zone, order, migratetype);
+ spin_unlock(range_lock);
}
/*
@@ -1154,7 +1148,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
} while (--count && --batch_free && !list_empty(list));
}
- write_lock(&zone->lock);
+ read_lock(&zone->lock);
isolated_pageblocks = has_isolate_pageblock(zone);
/*
@@ -1172,7 +1166,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
__free_one_page(page, page_to_pfn(page), zone, 0, mt);
trace_mm_page_pcpu_drain(page, 0, mt);
}
- write_unlock(&zone->lock);
+ read_unlock(&zone->lock);
}
static void free_one_page(struct zone *zone,
@@ -1826,7 +1820,7 @@ static inline void expand(struct zone *zone, struct page *page,
if (set_page_guard(zone, &page[size], high, migratetype))
continue;
- add_to_buddy_head(&page[size], zone, high, migratetype);
+ add_to_buddy(&page[size], zone, high, migratetype);
}
}
@@ -6286,6 +6280,18 @@ void __ref free_area_init_core_hotplug(int nid)
}
#endif
+static void __init setup_range_locks(struct zone *zone)
+{
+ unsigned long nr = (zone->spanned_pages >> (MAX_ORDER - 1)) + 1;
+ unsigned long size = nr * sizeof(spinlock_t);
+ unsigned long i;
+
+ zone->range_locks = memblock_virt_alloc_node_nopanic(size,
+ zone->zone_pgdat->node_id);
+ for (i = 0; i < nr; i++)
+ spin_lock_init(&zone->range_locks[i]);
+}
+
/*
* Set up the zone data structures:
* - mark all pages reserved
@@ -6357,6 +6363,7 @@ static void __init free_area_init_core(struct pglist_data *pgdat)
setup_usemap(pgdat, zone, zone_start_pfn, size);
init_currently_empty_zone(zone, zone_start_pfn, size);
memmap_init(size, nid, j, zone_start_pfn);
+ setup_range_locks(zone);
}
}
--
2.17.1
This patch converts zone lock from spinlock to rwlock and always
take the lock in write mode so there is no functionality change.
This is a preparation for free path to take the lock in read mode
to make free path work concurrently.
compact_trylock and compact_unlock_should_abort are taken from
Daniel Jordan's patch.
Signed-off-by: Aaron Lu <[email protected]>
---
include/linux/mmzone.h | 2 +-
mm/compaction.c | 90 +++++++++++++++++++++---------------------
mm/hugetlb.c | 8 ++--
mm/page_alloc.c | 52 ++++++++++++------------
mm/page_isolation.c | 12 +++---
mm/vmstat.c | 4 +-
6 files changed, 85 insertions(+), 83 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 1e22d96734e0..84cfa56e2d19 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -465,7 +465,7 @@ struct zone {
unsigned long flags;
/* Primarily protects free_area */
- spinlock_t lock;
+ rwlock_t lock;
/* Write-intensive fields used by compaction and vmstats. */
ZONE_PADDING(_pad2_)
diff --git a/mm/compaction.c b/mm/compaction.c
index faca45ebe62d..6ecf74d8e287 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -347,20 +347,20 @@ static inline void update_pageblock_skip(struct compact_control *cc,
* Returns true if the lock is held
* Returns false if the lock is not held and compaction should abort
*/
-static bool compact_trylock_irqsave(spinlock_t *lock, unsigned long *flags,
- struct compact_control *cc)
-{
- if (cc->mode == MIGRATE_ASYNC) {
- if (!spin_trylock_irqsave(lock, *flags)) {
- cc->contended = true;
- return false;
- }
- } else {
- spin_lock_irqsave(lock, *flags);
- }
-
- return true;
-}
+#define compact_trylock(lock, flags, cc, lockf, trylockf) \
+({ \
+ bool __ret = true; \
+ if ((cc)->mode == MIGRATE_ASYNC) { \
+ if (!trylockf((lock), *(flags))) { \
+ (cc)->contended = true; \
+ __ret = false; \
+ } \
+ } else { \
+ lockf((lock), *(flags)); \
+ } \
+ \
+ __ret; \
+})
/*
* Compaction requires the taking of some coarse locks that are potentially
@@ -377,29 +377,29 @@ static bool compact_trylock_irqsave(spinlock_t *lock, unsigned long *flags,
* Returns false when compaction can continue (sync compaction might have
* scheduled)
*/
-static bool compact_unlock_should_abort(spinlock_t *lock,
- unsigned long flags, bool *locked, struct compact_control *cc)
-{
- if (*locked) {
- spin_unlock_irqrestore(lock, flags);
- *locked = false;
- }
-
- if (fatal_signal_pending(current)) {
- cc->contended = true;
- return true;
- }
-
- if (need_resched()) {
- if (cc->mode == MIGRATE_ASYNC) {
- cc->contended = true;
- return true;
- }
- cond_resched();
- }
-
- return false;
-}
+#define compact_unlock_should_abort(lock, flags, locked, cc, unlockf) \
+({ \
+ bool __ret = false; \
+ \
+ if (*(locked)) { \
+ unlockf((lock), (flags)); \
+ *(locked) = false; \
+ } \
+ \
+ if (fatal_signal_pending(current)) { \
+ (cc)->contended = true; \
+ __ret = true; \
+ } else if (need_resched()) { \
+ if ((cc)->mode == MIGRATE_ASYNC) { \
+ (cc)->contended = true; \
+ __ret = true; \
+ } else { \
+ cond_resched(); \
+ } \
+ } \
+ \
+ __ret; \
+})
/*
* Aside from avoiding lock contention, compaction also periodically checks
@@ -457,7 +457,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
*/
if (!(blockpfn % SWAP_CLUSTER_MAX)
&& compact_unlock_should_abort(&cc->zone->lock, flags,
- &locked, cc))
+ &locked, cc, write_unlock_irqrestore))
break;
nr_scanned++;
@@ -502,8 +502,9 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
* spin on the lock and we acquire the lock as late as
* possible.
*/
- locked = compact_trylock_irqsave(&cc->zone->lock,
- &flags, cc);
+ locked = compact_trylock(&cc->zone->lock, &flags, cc,
+ write_lock_irqsave,
+ write_trylock_irqsave);
if (!locked)
break;
@@ -541,7 +542,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
}
if (locked)
- spin_unlock_irqrestore(&cc->zone->lock, flags);
+ write_unlock_irqrestore(&cc->zone->lock, flags);
/*
* There is a tiny chance that we have read bogus compound_order(),
@@ -758,7 +759,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
*/
if (!(low_pfn % SWAP_CLUSTER_MAX)
&& compact_unlock_should_abort(zone_lru_lock(zone), flags,
- &locked, cc))
+ &locked, cc, spin_unlock_irqrestore))
break;
if (!pfn_valid_within(low_pfn))
@@ -847,8 +848,9 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
/* If we already hold the lock, we can skip some rechecking */
if (!locked) {
- locked = compact_trylock_irqsave(zone_lru_lock(zone),
- &flags, cc);
+ locked = compact_trylock(zone_lru_lock(zone), &flags, cc,
+ spin_lock_irqsave,
+ spin_trylock_irqsave);
if (!locked)
break;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3c21775f196b..18fde0139f4a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1113,7 +1113,7 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
zonelist = node_zonelist(nid, gfp_mask);
for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp_mask), nodemask) {
- spin_lock_irqsave(&zone->lock, flags);
+ write_lock_irqsave(&zone->lock, flags);
pfn = ALIGN(zone->zone_start_pfn, nr_pages);
while (zone_spans_last_pfn(zone, pfn, nr_pages)) {
@@ -1125,16 +1125,16 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
* spinning on this lock, it may win the race
* and cause alloc_contig_range() to fail...
*/
- spin_unlock_irqrestore(&zone->lock, flags);
+ write_unlock_irqrestore(&zone->lock, flags);
ret = __alloc_gigantic_page(pfn, nr_pages, gfp_mask);
if (!ret)
return pfn_to_page(pfn);
- spin_lock_irqsave(&zone->lock, flags);
+ write_lock_irqsave(&zone->lock, flags);
}
pfn += nr_pages;
}
- spin_unlock_irqrestore(&zone->lock, flags);
+ write_unlock_irqrestore(&zone->lock, flags);
}
return NULL;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 05e983f42316..38e39ccdd6d9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1133,7 +1133,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
} while (--count && --batch_free && !list_empty(list));
}
- spin_lock(&zone->lock);
+ write_lock(&zone->lock);
isolated_pageblocks = has_isolate_pageblock(zone);
/*
@@ -1151,7 +1151,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
__free_one_page(page, page_to_pfn(page), zone, 0, mt);
trace_mm_page_pcpu_drain(page, 0, mt);
}
- spin_unlock(&zone->lock);
+ write_unlock(&zone->lock);
}
static void free_one_page(struct zone *zone,
@@ -1159,13 +1159,13 @@ static void free_one_page(struct zone *zone,
unsigned int order,
int migratetype)
{
- spin_lock(&zone->lock);
+ write_lock(&zone->lock);
if (unlikely(has_isolate_pageblock(zone) ||
is_migrate_isolate(migratetype))) {
migratetype = get_pfnblock_migratetype(page, pfn);
}
__free_one_page(page, pfn, zone, order, migratetype);
- spin_unlock(&zone->lock);
+ write_unlock(&zone->lock);
}
static void __meminit __init_single_page(struct page *page, unsigned long pfn,
@@ -2251,7 +2251,7 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone,
if (zone->nr_reserved_highatomic >= max_managed)
return;
- spin_lock_irqsave(&zone->lock, flags);
+ write_lock_irqsave(&zone->lock, flags);
/* Recheck the nr_reserved_highatomic limit under the lock */
if (zone->nr_reserved_highatomic >= max_managed)
@@ -2267,7 +2267,7 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone,
}
out_unlock:
- spin_unlock_irqrestore(&zone->lock, flags);
+ write_unlock_irqrestore(&zone->lock, flags);
}
/*
@@ -2300,7 +2300,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
pageblock_nr_pages)
continue;
- spin_lock_irqsave(&zone->lock, flags);
+ write_lock_irqsave(&zone->lock, flags);
for (order = 0; order < MAX_ORDER; order++) {
struct free_area *area = &(zone->free_area[order]);
@@ -2343,11 +2343,11 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
ret = move_freepages_block(zone, page, ac->migratetype,
NULL);
if (ret) {
- spin_unlock_irqrestore(&zone->lock, flags);
+ write_unlock_irqrestore(&zone->lock, flags);
return ret;
}
}
- spin_unlock_irqrestore(&zone->lock, flags);
+ write_unlock_irqrestore(&zone->lock, flags);
}
return false;
@@ -2465,7 +2465,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
{
int i, alloced = 0;
- spin_lock(&zone->lock);
+ write_lock(&zone->lock);
for (i = 0; i < count; ++i) {
struct page *page = __rmqueue(zone, order, migratetype);
if (unlikely(page == NULL))
@@ -2498,7 +2498,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
* pages added to the pcp list.
*/
__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
- spin_unlock(&zone->lock);
+ write_unlock(&zone->lock);
return alloced;
}
@@ -2687,7 +2687,7 @@ void mark_free_pages(struct zone *zone)
if (zone_is_empty(zone))
return;
- spin_lock_irqsave(&zone->lock, flags);
+ write_lock_irqsave(&zone->lock, flags);
max_zone_pfn = zone_end_pfn(zone);
for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++)
@@ -2721,7 +2721,7 @@ void mark_free_pages(struct zone *zone)
}
}
}
- spin_unlock_irqrestore(&zone->lock, flags);
+ write_unlock_irqrestore(&zone->lock, flags);
}
#endif /* CONFIG_PM */
@@ -2990,7 +2990,7 @@ struct page *rmqueue(struct zone *preferred_zone,
* allocate greater than order-1 page units with __GFP_NOFAIL.
*/
WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
- spin_lock_irqsave(&zone->lock, flags);
+ write_lock_irqsave(&zone->lock, flags);
do {
page = NULL;
@@ -3002,7 +3002,7 @@ struct page *rmqueue(struct zone *preferred_zone,
if (!page)
page = __rmqueue(zone, order, migratetype);
} while (page && check_new_pages(page, order));
- spin_unlock(&zone->lock);
+ write_unlock(&zone->lock);
if (!page)
goto failed;
__mod_zone_freepage_state(zone, -(1 << order),
@@ -5009,7 +5009,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
show_node(zone);
printk(KERN_CONT "%s: ", zone->name);
- spin_lock_irqsave(&zone->lock, flags);
+ write_lock_irqsave(&zone->lock, flags);
for (order = 0; order < MAX_ORDER; order++) {
struct free_area *area = &zone->free_area[order];
int type;
@@ -5023,7 +5023,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
types[order] |= 1 << type;
}
}
- spin_unlock_irqrestore(&zone->lock, flags);
+ write_unlock_irqrestore(&zone->lock, flags);
for (order = 0; order < MAX_ORDER; order++) {
printk(KERN_CONT "%lu*%lukB ",
nr[order], K(1UL) << order);
@@ -6247,7 +6247,7 @@ static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx,
zone_set_nid(zone, nid);
zone->name = zone_names[idx];
zone->zone_pgdat = NODE_DATA(nid);
- spin_lock_init(&zone->lock);
+ rwlock_init(&zone->lock);
zone_seqlock_init(zone);
zone_pcp_init(zone);
}
@@ -7239,7 +7239,7 @@ static void __setup_per_zone_wmarks(void)
for_each_zone(zone) {
u64 tmp;
- spin_lock_irqsave(&zone->lock, flags);
+ write_lock_irqsave(&zone->lock, flags);
tmp = (u64)pages_min * zone->managed_pages;
do_div(tmp, lowmem_pages);
if (is_highmem(zone)) {
@@ -7277,7 +7277,7 @@ static void __setup_per_zone_wmarks(void)
zone->watermark[WMARK_LOW] = min_wmark_pages(zone) + tmp;
zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
- spin_unlock_irqrestore(&zone->lock, flags);
+ write_unlock_irqrestore(&zone->lock, flags);
}
/* update totalreserve_pages */
@@ -8041,7 +8041,7 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
return;
offline_mem_sections(pfn, end_pfn);
zone = page_zone(pfn_to_page(pfn));
- spin_lock_irqsave(&zone->lock, flags);
+ write_lock_irqsave(&zone->lock, flags);
pfn = start_pfn;
while (pfn < end_pfn) {
if (!pfn_valid(pfn)) {
@@ -8073,7 +8073,7 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
SetPageReserved((page+i));
pfn += (1 << order);
}
- spin_unlock_irqrestore(&zone->lock, flags);
+ write_unlock_irqrestore(&zone->lock, flags);
}
#endif
@@ -8084,14 +8084,14 @@ bool is_free_buddy_page(struct page *page)
unsigned long flags;
unsigned int order;
- spin_lock_irqsave(&zone->lock, flags);
+ write_lock_irqsave(&zone->lock, flags);
for (order = 0; order < MAX_ORDER; order++) {
struct page *page_head = page - (pfn & ((1 << order) - 1));
if (PageBuddy(page_head) && page_order(page_head) >= order)
break;
}
- spin_unlock_irqrestore(&zone->lock, flags);
+ write_unlock_irqrestore(&zone->lock, flags);
return order < MAX_ORDER;
}
@@ -8110,7 +8110,7 @@ bool set_hwpoison_free_buddy_page(struct page *page)
unsigned int order;
bool hwpoisoned = false;
- spin_lock_irqsave(&zone->lock, flags);
+ write_lock_irqsave(&zone->lock, flags);
for (order = 0; order < MAX_ORDER; order++) {
struct page *page_head = page - (pfn & ((1 << order) - 1));
@@ -8120,7 +8120,7 @@ bool set_hwpoison_free_buddy_page(struct page *page)
break;
}
}
- spin_unlock_irqrestore(&zone->lock, flags);
+ write_unlock_irqrestore(&zone->lock, flags);
return hwpoisoned;
}
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 43e085608846..5c99fc2a1616 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -26,7 +26,7 @@ static int set_migratetype_isolate(struct page *page, int migratetype,
zone = page_zone(page);
- spin_lock_irqsave(&zone->lock, flags);
+ write_lock_irqsave(&zone->lock, flags);
/*
* We assume the caller intended to SET migrate type to isolate.
@@ -82,7 +82,7 @@ static int set_migratetype_isolate(struct page *page, int migratetype,
__mod_zone_freepage_state(zone, -nr_pages, mt);
}
- spin_unlock_irqrestore(&zone->lock, flags);
+ write_unlock_irqrestore(&zone->lock, flags);
if (!ret)
drain_all_pages(zone);
return ret;
@@ -98,7 +98,7 @@ static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
struct page *buddy;
zone = page_zone(page);
- spin_lock_irqsave(&zone->lock, flags);
+ write_lock_irqsave(&zone->lock, flags);
if (!is_migrate_isolate_page(page))
goto out;
@@ -137,7 +137,7 @@ static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
set_pageblock_migratetype(page, migratetype);
zone->nr_isolate_pageblock--;
out:
- spin_unlock_irqrestore(&zone->lock, flags);
+ write_unlock_irqrestore(&zone->lock, flags);
if (isolated_page) {
post_alloc_hook(page, order, __GFP_MOVABLE);
__free_pages(page, order);
@@ -299,10 +299,10 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
return -EBUSY;
/* Check all pages are free or marked as ISOLATED */
zone = page_zone(page);
- spin_lock_irqsave(&zone->lock, flags);
+ write_lock_irqsave(&zone->lock, flags);
pfn = __test_page_isolated_in_pageblock(start_pfn, end_pfn,
skip_hwpoisoned_pages);
- spin_unlock_irqrestore(&zone->lock, flags);
+ write_unlock_irqrestore(&zone->lock, flags);
trace_test_pages_isolated(start_pfn, end_pfn, pfn);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 8ba0870ecddd..06d79271a8ae 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1337,10 +1337,10 @@ static void walk_zones_in_node(struct seq_file *m, pg_data_t *pgdat,
continue;
if (!nolock)
- spin_lock_irqsave(&zone->lock, flags);
+ write_lock_irqsave(&zone->lock, flags);
print(m, pgdat, zone);
if (!nolock)
- spin_unlock_irqrestore(&zone->lock, flags);
+ write_unlock_irqrestore(&zone->lock, flags);
}
}
#endif
--
2.17.1
Since we will make free path run concurrently, free_area[].nr_free has
to be atomic.
Signed-off-by: Aaron Lu <[email protected]>
---
include/linux/mmzone.h | 2 +-
mm/page_alloc.c | 12 ++++++------
mm/vmstat.c | 4 ++--
3 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 84cfa56e2d19..e66b8c63d5d1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -95,7 +95,7 @@ extern int page_group_by_mobility_disabled;
struct free_area {
struct list_head free_list[MIGRATE_TYPES];
- unsigned long nr_free;
+ atomic_long_t nr_free;
};
struct pglist_data;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d0b954783f1d..dff3edc60d71 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -701,7 +701,7 @@ static inline void add_to_buddy_common(struct page *page, struct zone *zone,
unsigned int order)
{
set_page_order(page, order);
- zone->free_area[order].nr_free++;
+ atomic_long_inc(&zone->free_area[order].nr_free);
}
static inline void add_to_buddy_head(struct page *page, struct zone *zone,
@@ -728,7 +728,7 @@ static inline void remove_from_buddy(struct page *page, struct zone *zone,
unsigned int order)
{
list_del(&page->lru);
- zone->free_area[order].nr_free--;
+ atomic_long_dec(&zone->free_area[order].nr_free);
rmv_page_order(page);
}
@@ -2225,7 +2225,7 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
int i;
int fallback_mt;
- if (area->nr_free == 0)
+ if (atomic_long_read(&area->nr_free) == 0)
return -1;
*can_steal = false;
@@ -3178,7 +3178,7 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
struct free_area *area = &z->free_area[o];
int mt;
- if (!area->nr_free)
+ if (atomic_long_read(&area->nr_free) == 0)
continue;
for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
@@ -5029,7 +5029,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
struct free_area *area = &zone->free_area[order];
int type;
- nr[order] = area->nr_free;
+ nr[order] = atomic_long_read(&area->nr_free);
total += nr[order] << order;
types[order] = 0;
@@ -5562,7 +5562,7 @@ static void __meminit zone_init_free_lists(struct zone *zone)
unsigned int order, t;
for_each_migratetype_order(order, t) {
INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
- zone->free_area[order].nr_free = 0;
+ atomic_long_set(&zone->free_area[order].nr_free, 0);
}
}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 06d79271a8ae..c1985550bb9f 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1030,7 +1030,7 @@ static void fill_contig_page_info(struct zone *zone,
unsigned long blocks;
/* Count number of free blocks */
- blocks = zone->free_area[order].nr_free;
+ blocks = atomic_long_read(&zone->free_area[order].nr_free);
info->free_blocks_total += blocks;
/* Count free base pages */
@@ -1353,7 +1353,7 @@ static void frag_show_print(struct seq_file *m, pg_data_t *pgdat,
seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
for (order = 0; order < MAX_ORDER; ++order)
- seq_printf(m, "%6lu ", zone->free_area[order].nr_free);
+ seq_printf(m, "%6lu ", atomic_long_read(&zone->free_area[order].nr_free));
seq_putc(m, '\n');
}
--
2.17.1
On Tue, Sep 11, 2018 at 01:36:07PM +0800, Aaron Lu wrote:
> Daniel Jordan and others proposed an innovative technique to make
> multiple threads concurrently use list_del() at any position of the
> list and list_add() at head position of the list without taking a lock
> in this year's MM summit[0].
>
> People think this technique may be useful to improve zone lock
> scalability so here is my try.
Nice, this uses the smp_list_* functions well in spite of the limitations you
encountered with them here.
> Performance wise on 56 cores/112 threads Intel Skylake 2 sockets server
> using will-it-scale/page_fault1 process mode(higher is better):
>
> kernel performance zone lock contention
> patch1 9219349 76.99%
> patch7 2461133 -73.3% 54.46%(another 34.66% on smp_list_add())
> patch8 11712766 +27.0% 68.14%
> patch9 11386980 +23.5% 67.18%
Is "zone lock contention" the percentage that readers and writers combined
spent waiting? I'm curious to see read and write wait time broken out, since
it seems there are writers (very likely on the allocation side) spoiling the
parallelism we get with the read lock.
If the contention is from allocation, I wonder whether it's feasible to make
that path use SMP list functions. Something like smp_list_cut_position
combined with using page clusters from [*] to cut off a chunk of list. Many
details to keep in mind there, though, like having to unset PageBuddy in that
list chunk when other tasks can be concurrently merging pages that are part of
it.
Or maybe what's needed is a more scalable data structure than an array of
lists, since contention on the heads seems to be the limiting factor. A simple
list that keeps the pages in most-recently-used order (except when adding to
the list tail) is good for cache warmth, but I wonder how helpful that is when
all CPUs can allocate from the front. Having multiple places to put pages of a
given order/mt would ease the contention.
> Though lock contention reduced a lot for patch7, the performance dropped
> considerably due to severe cache bouncing on free list head among
> multiple threads doing page free at the same time, because every page free
> will need to add the page to the free list head.
Could be beneficial to take an MCS-style approach in smp_list_splice/add so
that multiple waiters aren't bouncing the same cacheline around. This is
something I'm planning to try on lru_lock.
Daniel
[*] https://lkml.kernel.org/r/[email protected]
On Fri, Sep 21, 2018 at 10:45:36AM -0700, Daniel Jordan wrote:
> On Tue, Sep 11, 2018 at 01:36:07PM +0800, Aaron Lu wrote:
> > Daniel Jordan and others proposed an innovative technique to make
> > multiple threads concurrently use list_del() at any position of the
> > list and list_add() at head position of the list without taking a lock
> > in this year's MM summit[0].
> >
> > People think this technique may be useful to improve zone lock
> > scalability so here is my try.
>
> Nice, this uses the smp_list_* functions well in spite of the limitations you
> encountered with them here.
>
> > Performance wise on 56 cores/112 threads Intel Skylake 2 sockets server
> > using will-it-scale/page_fault1 process mode(higher is better):
> >
> > kernel performance zone lock contention
> > patch1 9219349 76.99%
> > patch7 2461133 -73.3% 54.46%(another 34.66% on smp_list_add())
> > patch8 11712766 +27.0% 68.14%
> > patch9 11386980 +23.5% 67.18%
>
> Is "zone lock contention" the percentage that readers and writers combined
> spent waiting? I'm curious to see read and write wait time broken out, since
> it seems there are writers (very likely on the allocation side) spoiling the
> parallelism we get with the read lock.
lock contention is combined, read side consumes about 31% while write
side consumes 35%. Write side definitely is blocking read side.
I also tried not taking lock in read mode on free path to avoid free
path blocking on allocation path, but that caused other unplesant
consequences for allocation path, namely the free_list head->next can
be NULL when allocating pages due to free path can be adding pages to
the list using smp_list_add/splice so I had to use free_list head->prev
instead to fetch pages on allocation path. Also, the fetched page can be
merged in the mean time on free path so need to confirm if it is really
usable, etc. This complicated allocation path and didn't deliver good
results so I gave up this idea.
> If the contention is from allocation, I wonder whether it's feasible to make
> that path use SMP list functions. Something like smp_list_cut_position
> combined with using page clusters from [*] to cut off a chunk of list. Many
> details to keep in mind there, though, like having to unset PageBuddy in that
> list chunk when other tasks can be concurrently merging pages that are part of
> it.
As you put here, the PageBuddy flag is a problem. If I cut off a batch
of pages from free_list and then dropping the lock, these pages will
have PageBuddy flag set and free path can attempt a merge with any of
these pages and cause problems.
PageBuddy flag can not be cleared with lock held since that would
require accessing 'struct page's for these pages and it is the most time
consuming part among all operations that happened on allocation path
under zone lock.
This is doable in your referenced no_merge+cluster_alloc approach because
we skipped merge most of the time. And when merge really needs to
happen like in compaction, cluser_alloc will be disabled.
> Or maybe what's needed is a more scalable data structure than an array of
> lists, since contention on the heads seems to be the limiting factor. A simple
> list that keeps the pages in most-recently-used order (except when adding to
> the list tail) is good for cache warmth, but I wonder how helpful that is when
> all CPUs can allocate from the front. Having multiple places to put pages of a
> given order/mt would ease the contention.
Agree.
> > Though lock contention reduced a lot for patch7, the performance dropped
> > considerably due to severe cache bouncing on free list head among
> > multiple threads doing page free at the same time, because every page free
> > will need to add the page to the free list head.
>
> Could be beneficial to take an MCS-style approach in smp_list_splice/add so
> that multiple waiters aren't bouncing the same cacheline around. This is
> something I'm planning to try on lru_lock.
That's a good idea.
If that is done, we can at least parallelise free path and gain
something by not paying the penalty of cache bouncing on list head.
>
> Daniel
>
> [*] https://lkml.kernel.org/r/[email protected]
And thanks a lot for the comments!