This series is meant to improve zone->lock scalability for order 0 pages.
With will-it-scale/page_fault1 workload, on a 2 sockets Intel Skylake
server with 112 CPUs, CPU spend 80% of its time spinning on zone->lock.
Perf profile shows the most time consuming part under zone->lock is the
cache miss on "struct page", so here I'm trying to avoid those cache
misses.
Patch 1/5 adds some wrapper functions for page to be added/removed
into/from buddy and doesn't have functionality changes.
Patch 2/5 skip doing merge for order 0 pages to avoid cache misses on
buddy's "struct page". On a 2 sockets Intel Skylake, this has very good
effect on free path for will-it-scale/page_fault1 full load in that it
reduced zone->lock contention on free path from 35% to 1.1%. Also, it
shows good result on parallel free(*) workload by reducing zone->lock
contention from 90% to almost zero(lru lock increased from almost 0 to
90% though).
Patch 3/5 deals with allocation path zone->lock contention by not
touching pages on free_list one by one inside zone->lock. Together with
patch 2/4, zone->lock contention is entirely eliminated for
will-it-scale/page_fault1 full load, though this patch adds some
overhead to manage cluster on free path and it has some bad effects on
parallel free workload in that it increased zone->lock contention from
almost 0 to 25%.
Patch 4/5 is an optimization in free path due to cluster operation. It
decreased the number of times add_to_cluster() has to be called and
restored performance for parallel free workload by reducing zone->lock's
contention to almost 0% again.
Patch 5/5 relax the condition for no_merge and cluster_alloc to happen.
The good thing about this patchset is, it eliminated zone->lock
contention for will-it-scale/page_fault1 and parallel free on big
servers(contention shifted to lru_lock).
Tariq Toukan has kindly given this patchset a go using netperf and here
is the numbers. Quoting him:
"
I ran TCP multistream tests before and after your changes to mm.
In order to stress the zone lock, I made sure there are always PCPs that
continuously strive to allocate pages (CPUs where driver allocates
memory), and PCPs that continuously free pages back to buddy
(application/stack release the socket buffer).
This was done by configuring less queues than CPUs, and running more TCP
streams than queues.
In addition, to make the test more page consuming, I modified the driver
as follows:
- disabled the rx page cache mechanism.
- disabled the page-reuse optimization, so that every page serves a
single packet, instead of two (each is of default MTU size: 1500B).
NIC: ConnectX-5 (100Gbps)
lscpu:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Stepping: 2
CPU MHz: 3306.118
BogoMIPS: 4999.28
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 30720K
NUMA node0 CPU(s):
0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
NUMA node1 CPU(s):
1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47
Test:
run multiple netperf instances (TCP_STREAM) for 10 secs, repeat 3 times.
Number of streams: 48, 96, 192, 240, 360, and 480.
Collect stats for:
- BW (in Mbps)
- CPU usage (sum of percentages for all 48 cores).
- perf profile
24 queues:
Before:
STREAMS: 48 BW: 72890.68 CPU: 4457 (44.57)
STREAMS: 48 BW: 70518.78 CPU: 4577 (45.77)
STREAMS: 48 BW: 73354.36 CPU: 4160 (41.60)
STREAMS: 96 BW: 71048.86 CPU: 4809 (48.09)
STREAMS: 96 BW: 70849.41 CPU: 4801 (48.01)
STREAMS: 96 BW: 71156.13 CPU: 4804 (48.04)
STREAMS: 192 BW: 69967.00 CPU: 4803 (48.03)
STREAMS: 192 BW: 67814.98 CPU: 4805 (48.05)
STREAMS: 192 BW: 67767.61 CPU: 4803 (48.03)
STREAMS: 240 BW: 68131.36 CPU: 4805 (48.05)
STREAMS: 240 BW: 67128.16 CPU: 4804 (48.04)
STREAMS: 240 BW: 71137.51 CPU: 4804 (48.04)
STREAMS: 360 BW: 71613.75 CPU: 4804 (48.04)
STREAMS: 360 BW: 72516.28 CPU: 4803 (48.03)
STREAMS: 360 BW: 69121.28 CPU: 4803 (48.03)
STREAMS: 480 BW: 73367.51 CPU: 4805 (48.05)
STREAMS: 480 BW: 74699.93 CPU: 4804 (48.04)
STREAMS: 480 BW: 71192.96 CPU: 4809 (48.09)
zone lock bottleneck (queued_spin_lock_slowpath)
perf top (~similar for all num of streams):
72.36% [kernel] [k] queued_spin_lock_slowpath
10.23% [kernel] [k] copy_user_enhanced_fast_string
2.10% [kernel] [k] copy_page_to_iter
1.36% [kernel] [k] __list_del_entry_valid
After:
STREAMS: 48 BW: 94049.28 CPU: 1651 (16.51)
STREAMS: 48 BW: 94279.97 CPU: 1939 (19.39)
STREAMS: 48 BW: 94247.40 CPU: 1653 (16.53)
STREAMS: 96 BW: 94292.01 CPU: 1905 (19.05)
STREAMS: 96 BW: 94296.22 CPU: 1908 (19.08)
STREAMS: 96 BW: 94301.79 CPU: 1850 (18.50)
STREAMS: 192 BW: 93225.68 CPU: 2034 (20.34)
STREAMS: 192 BW: 93408.97 CPU: 1919 (19.19)
STREAMS: 192 BW: 94486.73 CPU: 2051 (20.51)
STREAMS: 240 BW: 92829.74 CPU: 2055 (20.55)
STREAMS: 240 BW: 94280.25 CPU: 2120 (21.20)
STREAMS: 240 BW: 94504.60 CPU: 2052 (20.52)
STREAMS: 360 BW: 94715.63 CPU: 2087 (20.87)
STREAMS: 360 BW: 94536.92 CPU: 2361 (23.61)
STREAMS: 360 BW: 96327.03 CPU: 2254 (22.54)
STREAMS: 480 BW: 95101.56 CPU: 2404 (24.04)
STREAMS: 480 BW: 95250.94 CPU: 2372 (23.72)
STREAMS: 480 BW: 99479.32 CPU: 2630 (26.30)
bottleneck is released, linerate is reached, significantly lower cpu.
perf top (~similar for all num of streams):
25.52% [kernel] [k] copy_user_enhanced_fast_string
5.93% [kernel] [k] queued_spin_lock_slowpath
3.72% [kernel] [k] copy_page_to_iter
3.58% [kernel] [k] intel_idle
3.24% [kernel] [k] mlx5e_skb_from_cqe_mpwrq_linear
2.23% [kernel] [k] get_page_from_freelist
1.94% [kernel] [k] build_skb
1.84% [kernel] [k] tcp_gro_receive
1.81% [kernel] [k] poll_idle
1.40% [kernel] [k] __list_del_entry_valid
1.07% [kernel] [k] dev_gro_receive
12 queues:
Before:
STREAMS: 48 BW: 61766.86 CPU: 4158 (41.58)
STREAMS: 48 BW: 64199.99 CPU: 3661 (36.61)
STREAMS: 48 BW: 63818.20 CPU: 3929 (39.29)
STREAMS: 96 BW: 56918.94 CPU: 4779 (47.79)
STREAMS: 96 BW: 58083.64 CPU: 4733 (47.33)
STREAMS: 96 BW: 57821.44 CPU: 4711 (47.11)
STREAMS: 192 BW: 58394.17 CPU: 4795 (47.95)
STREAMS: 192 BW: 56975.54 CPU: 4800 (48.00)
STREAMS: 192 BW: 57661.05 CPU: 4798 (47.98)
STREAMS: 240 BW: 56555.59 CPU: 4801 (48.01)
STREAMS: 240 BW: 58227.32 CPU: 4799 (47.99)
STREAMS: 240 BW: 57478.13 CPU: 4805 (48.05)
STREAMS: 360 BW: 59316.66 CPU: 4804 (48.04)
STREAMS: 360 BW: 62893.67 CPU: 4803 (48.03)
STREAMS: 360 BW: 59385.07 CPU: 4804 (48.04)
STREAMS: 480 BW: 66586.20 CPU: 4805 (48.05)
STREAMS: 480 BW: 59929.05 CPU: 4803 (48.03)
STREAMS: 480 BW: 61451.14 CPU: 4804 (48.04)
STREAMS: 960 BW: 73923.86 CPU: 4805 (48.05)
STREAMS: 960 BW: 61479.10 CPU: 4804 (48.04)
STREAMS: 960 BW: 73230.86 CPU: 4804 (48.04)
bottleneck is more severe.
perf top:
78.58% [kernel] [k] queued_spin_lock_slowpath
6.56% [kernel] [k] copy_user_enhanced_fast_string
1.49% [kernel] [k] __list_del_entry_valid
1.10% [kernel] [k] copy_page_to_iter
1.05% [kernel] [k] free_pcppages_bulk
After:
STREAMS: 48 BW: 94114.63 CPU: 1961 (19.61)
STREAMS: 48 BW: 94865.69 CPU: 1838 (18.38)
STREAMS: 48 BW: 94222.46 CPU: 2164 (21.64)
STREAMS: 96 BW: 94307.39 CPU: 2184 (21.84)
STREAMS: 96 BW: 93282.46 CPU: 2765 (27.65)
STREAMS: 96 BW: 93642.15 CPU: 2743 (27.43)
STREAMS: 192 BW: 92575.63 CPU: 3093 (30.93)
STREAMS: 192 BW: 92868.66 CPU: 3028 (30.28)
STREAMS: 192 BW: 92749.76 CPU: 3069 (30.69)
STREAMS: 240 BW: 92793.80 CPU: 3131 (31.31)
STREAMS: 240 BW: 93138.46 CPU: 3022 (30.22)
STREAMS: 240 BW: 92520.57 CPU: 3215 (32.15)
STREAMS: 360 BW: 93200.99 CPU: 3328 (33.28)
STREAMS: 360 BW: 92822.61 CPU: 3254 (32.54)
STREAMS: 360 BW: 93138.66 CPU: 3229 (32.29)
STREAMS: 480 BW: 93484.17 CPU: 3184 (31.84)
STREAMS: 480 BW: 92930.23 CPU: 3372 (33.72)
bottleneck released.
28.09% [kernel] [k] copy_user_enhanced_fast_string
3.46% [kernel] [k] copy_page_to_iter
2.71% [kernel] [k] intel_idle
2.44% [kernel] [k] queued_spin_lock_slowpath
2.30% [kernel] [k] mlx5e_skb_from_cqe_mpwrq_linear
1.85% [kernel] [k] mlx5e_sq_xmit
1.69% [kernel] [k] get_page_from_freelist
1.59% [kernel] [k] __list_del_entry_valid
1.51% [kernel] [k] __slab_free
1.43% [kernel] [k] __tcp_transmit_skb
1.37% [kernel] [k] tcp_rcv_established
1.19% [kernel] [k] _raw_spin_lock
1.16% [kernel] [k] tcp_recvmsg
1.05% [kernel] [k] _raw_spin_lock_bh
In addition, I tested single/small number of streams (no serious
contention on zone lock), and observed 'similar or better' results for
all cases.
"
The bad things are:
- it added some overhead in compaction path where it will do merging
for those merge-skipped order 0 pages;
- it is unfriendly to high order page allocation since we do not do
merging for order 0 pages now.
To see how much effect it has on compaction success rate,
mmtests/stress-highalloc is used on a Desktop machine with 8 CPUs and
4G memory. (mmtests/stress-highalloc: make N copies of kernel tree and
start building them to consume almost all memory with reclaimable file
page cache. These file page cache will not be returned to buddy so
effectively makes it a worst case for high order page workload. Then
after 5 minutes, start allocating X order-9 pages to see how well
compaction works).
With a delay of 100ms between allocations:
kernel success_rate average_time_of_alloc_one_hugepage
base 58% 3.95927e+06 ns
patch2/5 58% 5.45935e+06 ns
patch4/5 57% 6.59174e+06 ns
With a delay of 1ms between allocations:
kernel success_rate average_time_of_alloc_one_hugepage
base 53% 3.17362e+06 ns
patch2/5 44% 2.31637e+06 ns
patch4/5 59% 2.73029e+06 ns
If we compare patch4/5's result with base, it performed OK I think.
This is probably due to compaction is a heavy job so the added overhead
doesn't affect much.
Please note that for workloads that use only order0 pages, this patchset
will be a win; for workloads that use only huge pages, this patchset
will not have much impact since fragementation will only happen when a
lot of order0 pages get allocated and freed; The only workloads that will
get hurt are those that use both order0 and hugepage at the same time.
To see how much effect it has on those workloads that use both order0
and hugepage, I did the following test on a 2 sockets Intel Skylake with
112 CPUs/64G memory:
1 Break all high order pages by starting a program that consumes almost
all memory with anonymous pages and then exit. This is used to create
an extreme bad case for this patchset compared to vanilla that always
does merging;
2 Start 56 processes of will-it-scale/page_fault1 that use hugepages
through calling madvise(MADV_HUGEPAGE). To make things worse for this
patchset, start another 56 processes of will-it-scale/page_fault1 that
uses order 0 pages to continually cause trouble for the 56 THP users.
Let them run for 5 minutes.
Score result(higher is better):
kernel order0 THP
base 1522246 10540254
patch2/5 5266247 +246% 3309816 -69%
patch4/5 2234073 +47% 9610295 -8.8%
Real workloads will differ of course.
(*) Parallel free is a workload that I used to see how well parallel
freeing a large VMA can be. I tested this on a 4 sockets Intel Skylake
machine with 768G memory. The test program starts by doing a 512G anon
memory allocation with mmap() and then exit to see how fast it can exit.
The parallel is implemented inside kernel and has been posted before:
http://lkml.kernel.org/r/[email protected]
A branch is maintained here in case someone wants to give it a try:
https://github.com/aaronlu/linux no_merge_cluster_alloc_4.19-rc5
v4:
- rebased to v4.19-rc5;
- add numbers from netperf(courtesy of Tariq Toukan)
Aaron Lu (5):
mm/page_alloc: use helper functions to add/remove a page to/from buddy
mm/__free_one_page: skip merge for order-0 page unless compaction
failed
mm/rmqueue_bulk: alloc without touching individual page structure
mm/free_pcppages_bulk: reduce overhead of cluster operation on free
path
mm/can_skip_merge(): make it more aggressive to attempt cluster
alloc/free
include/linux/mm_types.h | 10 +-
include/linux/mmzone.h | 35 +++
mm/compaction.c | 17 +-
mm/internal.h | 57 +++++
mm/page_alloc.c | 490 +++++++++++++++++++++++++++++++++++----
5 files changed, 556 insertions(+), 53 deletions(-)
--
2.17.2
Running will-it-scale/page_fault1 process mode workload on a 2 sockets
Intel Skylake server showed severe lock contention of zone->lock, as
high as about 80%(42% on allocation path and 35% on free path) CPU
cycles are burnt spinning. With perf, the most time consuming part inside
that lock on free path is cache missing on page structures, mostly on
the to-be-freed page's buddy due to merging.
One way to avoid this overhead is not do any merging at all for order-0
pages. With this approach, the lock contention for zone->lock on free
path dropped to 1.1% but allocation side still has as high as 42% lock
contention. In the meantime, the dropped lock contention on free side
doesn't translate to performance increase, instead, it's consumed by
increased lock contention of the per node lru_lock(rose from 5% to 37%)
and the final performance slightly dropped about 1%.
Though performance dropped a little, it almost eliminated zone lock
contention on free path and it is the foundation for the next patch
that eliminates zone lock contention for allocation path.
Suggested-by: Dave Hansen <[email protected]>
Signed-off-by: Aaron Lu <[email protected]>
---
include/linux/mm_types.h | 9 +++-
mm/compaction.c | 13 +++++-
mm/internal.h | 27 ++++++++++++
mm/page_alloc.c | 88 ++++++++++++++++++++++++++++++++++------
4 files changed, 121 insertions(+), 16 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5ed8f6292a53..aed93053ef6e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -179,8 +179,13 @@ struct page {
int units; /* SLOB */
};
- /* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
- atomic_t _refcount;
+ union {
+ /* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
+ atomic_t _refcount;
+
+ /* For pages in Buddy: if skipped merging when added to Buddy */
+ bool buddy_merge_skipped;
+ };
#ifdef CONFIG_MEMCG
struct mem_cgroup *mem_cgroup;
diff --git a/mm/compaction.c b/mm/compaction.c
index faca45ebe62d..0c9c7a30dde3 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -777,8 +777,19 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
* potential isolation targets.
*/
if (PageBuddy(page)) {
- unsigned long freepage_order = page_order_unsafe(page);
+ unsigned long freepage_order;
+ /*
+ * If this is a merge_skipped page, do merge now
+ * since high-order pages are needed. zone lock
+ * isn't taken for the merge_skipped check so the
+ * check could be wrong but the worst case is we
+ * lose a merge opportunity.
+ */
+ if (page_merge_was_skipped(page))
+ try_to_merge_page(page);
+
+ freepage_order = page_order_unsafe(page);
/*
* Without lock, we cannot be sure that what we got is
* a valid page order. Consider only values in the
diff --git a/mm/internal.h b/mm/internal.h
index 87256ae1bef8..c166735a559e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -527,4 +527,31 @@ static inline bool is_migrate_highatomic_page(struct page *page)
void setup_zone_pageset(struct zone *zone);
extern struct page *alloc_new_node_page(struct page *page, unsigned long node);
+
+static inline bool page_merge_was_skipped(struct page *page)
+{
+ return page->buddy_merge_skipped;
+}
+
+void try_to_merge_page(struct page *page);
+
+#ifdef CONFIG_COMPACTION
+static inline bool can_skip_merge(struct zone *zone, int order)
+{
+ /* Compaction has failed in this zone, we shouldn't skip merging */
+ if (zone->compact_considered)
+ return false;
+
+ /* Only consider no_merge for order 0 pages */
+ if (order)
+ return false;
+
+ return true;
+}
+#else /* CONFIG_COMPACTION */
+static inline bool can_skip_merge(struct zone *zone, int order)
+{
+ return false;
+}
+#endif /* CONFIG_COMPACTION */
#endif /* __MM_INTERNAL_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 14c20bb3a3da..76d471e0ab24 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -691,6 +691,16 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
unsigned int order, int migratetype) {}
#endif
+static inline void set_page_merge_skipped(struct page *page)
+{
+ page->buddy_merge_skipped = true;
+}
+
+static inline void clear_page_merge_skipped(struct page *page)
+{
+ page->buddy_merge_skipped = false;
+}
+
static inline void set_page_order(struct page *page, unsigned int order)
{
set_page_private(page, order);
@@ -700,6 +710,7 @@ static inline void set_page_order(struct page *page, unsigned int order)
static inline void add_to_buddy_common(struct page *page, struct zone *zone,
unsigned int order)
{
+ clear_page_merge_skipped(page);
set_page_order(page, order);
zone->free_area[order].nr_free++;
}
@@ -730,6 +741,7 @@ static inline void remove_from_buddy(struct page *page, struct zone *zone,
list_del(&page->lru);
zone->free_area[order].nr_free--;
rmv_page_order(page);
+ clear_page_merge_skipped(page);
}
/*
@@ -797,7 +809,7 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
* -- nyc
*/
-static inline void __free_one_page(struct page *page,
+static inline void do_merge(struct page *page,
unsigned long pfn,
struct zone *zone, unsigned int order,
int migratetype)
@@ -809,16 +821,6 @@ static inline void __free_one_page(struct page *page,
max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1);
- VM_BUG_ON(!zone_is_initialized(zone));
- VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
-
- VM_BUG_ON(migratetype == -1);
- if (likely(!is_migrate_isolate(migratetype)))
- __mod_zone_freepage_state(zone, 1 << order, migratetype);
-
- VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
- VM_BUG_ON_PAGE(bad_range(zone, page), page);
-
continue_merging:
while (order < max_order - 1) {
buddy_pfn = __find_buddy_pfn(pfn, order);
@@ -891,6 +893,61 @@ static inline void __free_one_page(struct page *page,
add_to_buddy_head(page, zone, order, migratetype);
}
+void try_to_merge_page(struct page *page)
+{
+ unsigned long pfn, buddy_pfn, flags;
+ struct page *buddy;
+ struct zone *zone;
+
+ /*
+ * No need to do merging if buddy is not free.
+ * zone lock isn't taken so this could be wrong but worst case
+ * is we lose a merge opportunity.
+ */
+ pfn = page_to_pfn(page);
+ buddy_pfn = __find_buddy_pfn(pfn, 0);
+ buddy = page + (buddy_pfn - pfn);
+ if (!PageBuddy(buddy))
+ return;
+
+ zone = page_zone(page);
+ spin_lock_irqsave(&zone->lock, flags);
+ /* Verify again after taking the lock */
+ if (likely(PageBuddy(page) && page_merge_was_skipped(page) &&
+ PageBuddy(buddy))) {
+ int mt = get_pageblock_migratetype(page);
+
+ remove_from_buddy(page, zone, 0);
+ do_merge(page, pfn, zone, 0, mt);
+ }
+ spin_unlock_irqrestore(&zone->lock, flags);
+}
+
+static inline void __free_one_page(struct page *page,
+ unsigned long pfn,
+ struct zone *zone, unsigned int order,
+ int migratetype)
+{
+ VM_BUG_ON(!zone_is_initialized(zone));
+ VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
+
+ VM_BUG_ON(migratetype == -1);
+ if (likely(!is_migrate_isolate(migratetype)))
+ __mod_zone_freepage_state(zone, 1 << order, migratetype);
+
+ VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
+ VM_BUG_ON_PAGE(bad_range(zone, page), page);
+
+ if (can_skip_merge(zone, order)) {
+ add_to_buddy_head(page, zone, 0, migratetype);
+ set_page_merge_skipped(page);
+ return;
+ }
+
+ do_merge(page, pfn, zone, order, migratetype);
+}
+
+
/*
* A bad page could be due to a number of fields. Instead of multiple branches,
* try and check multiple fields with one check. The caller must do a detailed
@@ -1148,9 +1205,14 @@ static void free_pcppages_bulk(struct zone *zone, int count,
* can be offset by reduced memory latency later. To
* avoid excessive prefetching due to large count, only
* prefetch buddy for the first pcp->batch nr of pages.
+ *
+ * If merge can be skipped, no need to prefetch buddy.
*/
- if (prefetch_nr++ < pcp->batch)
- prefetch_buddy(page);
+ if (can_skip_merge(zone, 0) || prefetch_nr > pcp->batch)
+ continue;
+
+ prefetch_buddy(page);
+ prefetch_nr++;
} while (--count && --batch_free && !list_empty(list));
}
--
2.17.2
After "no_merge for order 0", the biggest overhead in free path for
order 0 pages is now add_to_cluster(). As pages are freed one by one,
it caused frequent operation of add_to_cluster().
Ideally, if only one migratetype pcp list has pages to free and
count=pcp->batch in free_pcppages_bulk(), we can avoid calling
add_to_cluster() one time per page but adding them in one go as
a single cluster so this patch just did this.
This optimization brings zone->lock contention down from 25% to
almost zero again using the parallel free workload.
Signed-off-by: Aaron Lu <[email protected]>
---
mm/page_alloc.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 46 insertions(+), 3 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e60a248030dc..204696f6c2f4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1242,6 +1242,36 @@ static inline void prefetch_buddy(struct page *page)
prefetch(buddy);
}
+static inline bool free_cluster_pages(struct zone *zone, struct list_head *list,
+ int mt, int count)
+{
+ struct cluster *c;
+ struct page *page, *n;
+
+ if (!can_skip_merge(zone, 0))
+ return false;
+
+ if (count != this_cpu_ptr(zone->pageset)->pcp.batch)
+ return false;
+
+ c = new_cluster(zone, count, list_first_entry(list, struct page, lru));
+ if (unlikely(!c))
+ return false;
+
+ list_for_each_entry_safe(page, n, list, lru) {
+ set_page_order(page, 0);
+ set_page_merge_skipped(page);
+ page->cluster = c;
+ list_add(&page->lru, &zone->free_area[0].free_list[mt]);
+ }
+
+ INIT_LIST_HEAD(list);
+ zone->free_area[0].nr_free += count;
+ __mod_zone_page_state(zone, NR_FREE_PAGES, count);
+
+ return true;
+}
+
/*
* Frees a number of pages from the PCP lists
* Assumes all pages on list are in same zone, and of same order.
@@ -1256,10 +1286,10 @@ static inline void prefetch_buddy(struct page *page)
static void free_pcppages_bulk(struct zone *zone, int count,
struct per_cpu_pages *pcp)
{
- int migratetype = 0;
- int batch_free = 0;
+ int migratetype = 0, i, count_mt[MIGRATE_PCPTYPES] = {0};
+ int batch_free = 0, saved_count = count;
int prefetch_nr = 0;
- bool isolated_pageblocks;
+ bool isolated_pageblocks, single_mt = false;
struct page *page, *tmp;
LIST_HEAD(head);
@@ -1283,6 +1313,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
/* This is the only non-empty list. Free them all. */
if (batch_free == MIGRATE_PCPTYPES)
batch_free = count;
+ count_mt[migratetype] += batch_free;
do {
page = list_last_entry(list, struct page, lru);
@@ -1314,12 +1345,24 @@ static void free_pcppages_bulk(struct zone *zone, int count,
} while (--count && --batch_free && !list_empty(list));
}
+ for (i = 0; i < MIGRATE_PCPTYPES; i++) {
+ if (count_mt[i] == saved_count) {
+ single_mt = true;
+ break;
+ }
+ }
+
spin_lock(&zone->lock);
isolated_pageblocks = has_isolate_pageblock(zone);
+ if (!isolated_pageblocks && single_mt)
+ free_cluster_pages(zone, &head, migratetype, saved_count);
+
/*
* Use safe version since after __free_one_page(),
* page->lru.next will not point to original list.
+ *
+ * If free_cluster_pages() succeeds, head will be an empty list here.
*/
list_for_each_entry_safe(page, tmp, &head, lru) {
int mt = get_pcppage_migratetype(page);
--
2.17.2
Profile on Intel Skylake server shows the most time consuming part
under zone->lock on allocation path is accessing those to-be-returned
page's "struct page" on the free_list inside zone->lock. One explanation
is, different CPUs are releasing pages to the head of free_list and
those page's 'struct page' may very well be cache cold for the allocating
CPU when it grabs these pages from free_list' head. The purpose here
is to avoid touching these pages one by one inside zone->lock.
One idea is, we just take the requested number of pages off free_list
with something like list_cut_position() and then adjust nr_free of
free_area accordingly inside zone->lock and other operations like
clearing PageBuddy flag for these pages are done outside of zone->lock.
list_cut_position() needs to know where to cut, that's what the new
'struct cluster' meant to provide. All pages on order 0's free_list
belongs to a cluster so when a number of pages is needed, the cluster
to which head page of free_list belongs is checked and then tail page
of the cluster could be found. With tail page, list_cut_position() can
be used to drop the cluster off free_list. The 'struct cluster' also has
'nr' to tell how many pages this cluster has so nr_free of free_area can
be adjusted inside the lock too.
This caused a race window though: from the moment zone->lock is dropped
till these pages' PageBuddy flags get cleared, these pages are not in
buddy but still have PageBuddy flag set.
This doesn't cause problems for users that access buddy pages through
free_list. But there are other users, like move_freepages() which is
used to move a pageblock pages from one migratetype to another in
fallback allocation path, will test PageBuddy flag of a page derived
from PFN. The end result could be that for pages in the race window,
they are moved back to free_list of another migratetype. For this
reason, a synchronization function zone_wait_cluster_alloc() is
introduced to wait till all pages are in correct state. This function
is meant to be called with zone->lock held, so after this function
returns, we do not need to worry about new pages becoming racy state.
Another user is compaction, where it will scan a pageblock for
migratable candidates. In this process, pages derived from PFN will
be checked for PageBuddy flag to decide if it is a merge skipped page.
To avoid a racy page getting merged back into buddy, the
zone_wait_and_disable_cluster_alloc() function is introduced to:
1 disable clustered allocation by increasing zone->cluster.disable_depth;
2 wait till the race window pass by calling zone_wait_cluster_alloc().
This function is also meant to be called with zone->lock held so after
it returns, all pages are in correct state and no more cluster alloc
will be attempted till zone_enable_cluster_alloc() is called to decrease
zone->cluster.disable_depth.
The two patches could eliminate zone->lock contention entirely but at
the same time, pgdat->lru_lock contention rose to 82%. Final performance
increased about 8.3%.
Suggested-by: Ying Huang <[email protected]>
Suggested-by: Dave Hansen <[email protected]>
Signed-off-by: Aaron Lu <[email protected]>
---
include/linux/mm_types.h | 19 +--
include/linux/mmzone.h | 35 +++++
mm/compaction.c | 4 +
mm/internal.h | 34 +++++
mm/page_alloc.c | 288 +++++++++++++++++++++++++++++++++++++--
5 files changed, 363 insertions(+), 17 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index aed93053ef6e..3abe1515502e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -85,8 +85,14 @@ struct page {
*/
struct list_head lru;
/* See page-flags.h for PAGE_MAPPING_FLAGS */
- struct address_space *mapping;
- pgoff_t index; /* Our offset within mapping. */
+ union {
+ struct address_space *mapping;
+ struct cluster *cluster;
+ };
+ union {
+ pgoff_t index; /* Our offset within mapping. */
+ bool buddy_merge_skipped;
+ };
/**
* @private: Mapping-private opaque data.
* Usually used for buffer_heads if PagePrivate.
@@ -179,13 +185,8 @@ struct page {
int units; /* SLOB */
};
- union {
- /* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
- atomic_t _refcount;
-
- /* For pages in Buddy: if skipped merging when added to Buddy */
- bool buddy_merge_skipped;
- };
+ /* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
+ atomic_t _refcount;
#ifdef CONFIG_MEMCG
struct mem_cgroup *mem_cgroup;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 1e22d96734e0..765567366ddb 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -356,6 +356,40 @@ enum zone_type {
#ifndef __GENERATING_BOUNDS_H
+struct cluster {
+ struct page *tail; /* tail page of the cluster */
+ int nr; /* how many pages are in this cluster */
+};
+
+struct order0_cluster {
+ /* order 0 cluster array, dynamically allocated */
+ struct cluster *array;
+ /*
+ * order 0 cluster array length, also used to indicate if cluster
+ * allocation is enabled for this zone(cluster allocation is disabled
+ * for small zones whose batch size is smaller than 1, like DMA zone)
+ */
+ int len;
+ /*
+ * smallest position from where we search for an
+ * empty cluster from the cluster array
+ */
+ int zero_bit;
+ /* bitmap used to quickly locate an empty cluster from cluster array */
+ unsigned long *bitmap;
+
+ /* disable cluster allocation to avoid new pages becoming racy state. */
+ unsigned long disable_depth;
+
+ /*
+ * used to indicate if there are pages allocated in cluster mode
+ * still in racy state. Caller with zone->lock held could use helper
+ * function zone_wait_cluster_alloc() to wait all such pages to exit
+ * the race window.
+ */
+ atomic_t in_progress;
+};
+
struct zone {
/* Read-mostly fields */
@@ -460,6 +494,7 @@ struct zone {
/* free areas of different sizes */
struct free_area free_area[MAX_ORDER];
+ struct order0_cluster cluster;
/* zone flags, see below */
unsigned long flags;
diff --git a/mm/compaction.c b/mm/compaction.c
index 0c9c7a30dde3..b732136dfc4c 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1601,6 +1601,8 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
migrate_prep_local();
+ zone_wait_and_disable_cluster_alloc(zone);
+
while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {
int err;
@@ -1699,6 +1701,8 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
zone->compact_cached_free_pfn = free_pfn;
}
+ zone_enable_cluster_alloc(zone);
+
count_compact_events(COMPACTMIGRATE_SCANNED, cc->total_migrate_scanned);
count_compact_events(COMPACTFREE_SCANNED, cc->total_free_scanned);
diff --git a/mm/internal.h b/mm/internal.h
index c166735a559e..fb4e8f7976e5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -546,12 +546,46 @@ static inline bool can_skip_merge(struct zone *zone, int order)
if (order)
return false;
+ /*
+ * Clustered allocation is only disabled when high-order pages
+ * are needed, e.g. in compaction and CMA alloc, so we should
+ * also skip merging in that case.
+ */
+ if (zone->cluster.disable_depth)
+ return false;
+
return true;
}
+
+static inline void zone_wait_cluster_alloc(struct zone *zone)
+{
+ while (atomic_read(&zone->cluster.in_progress))
+ cpu_relax();
+}
+
+static inline void zone_wait_and_disable_cluster_alloc(struct zone *zone)
+{
+ unsigned long flags;
+ spin_lock_irqsave(&zone->lock, flags);
+ zone->cluster.disable_depth++;
+ zone_wait_cluster_alloc(zone);
+ spin_unlock_irqrestore(&zone->lock, flags);
+}
+
+static inline void zone_enable_cluster_alloc(struct zone *zone)
+{
+ unsigned long flags;
+ spin_lock_irqsave(&zone->lock, flags);
+ zone->cluster.disable_depth--;
+ spin_unlock_irqrestore(&zone->lock, flags);
+}
#else /* CONFIG_COMPACTION */
static inline bool can_skip_merge(struct zone *zone, int order)
{
return false;
}
+static inline void zone_wait_cluster_alloc(struct zone *zone) {}
+static inline void zone_wait_and_disable_cluster_alloc(struct zone *zone) {}
+static inline void zone_enable_cluster_alloc(struct zone *zone) {}
#endif /* CONFIG_COMPACTION */
#endif /* __MM_INTERNAL_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 76d471e0ab24..e60a248030dc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -707,6 +707,82 @@ static inline void set_page_order(struct page *page, unsigned int order)
__SetPageBuddy(page);
}
+static inline struct cluster *new_cluster(struct zone *zone, int nr,
+ struct page *tail)
+{
+ struct order0_cluster *cluster = &zone->cluster;
+ int n = find_next_zero_bit(cluster->bitmap, cluster->len, cluster->zero_bit);
+ if (n == cluster->len) {
+ printk_ratelimited("node%d zone %s cluster used up\n",
+ zone->zone_pgdat->node_id, zone->name);
+ return NULL;
+ }
+ cluster->zero_bit = n;
+ set_bit(n, cluster->bitmap);
+ cluster->array[n].nr = nr;
+ cluster->array[n].tail = tail;
+ return &cluster->array[n];
+}
+
+static inline struct cluster *add_to_cluster_common(struct page *page,
+ struct zone *zone, struct page *neighbor)
+{
+ struct cluster *c;
+
+ if (neighbor) {
+ int batch = this_cpu_ptr(zone->pageset)->pcp.batch;
+ c = neighbor->cluster;
+ if (c && c->nr < batch) {
+ page->cluster = c;
+ c->nr++;
+ return c;
+ }
+ }
+
+ c = new_cluster(zone, 1, page);
+ if (unlikely(!c))
+ return NULL;
+
+ page->cluster = c;
+ return c;
+}
+
+/*
+ * Add this page to the cluster where the previous head page belongs.
+ * Called after page is added to free_list(and becoming the new head).
+ */
+static inline void add_to_cluster_head(struct page *page, struct zone *zone,
+ int order, int mt)
+{
+ struct page *neighbor;
+
+ if (order || !zone->cluster.len)
+ return;
+
+ neighbor = page->lru.next == &zone->free_area[0].free_list[mt] ?
+ NULL : list_entry(page->lru.next, struct page, lru);
+ add_to_cluster_common(page, zone, neighbor);
+}
+
+/*
+ * Add this page to the cluster where the previous tail page belongs.
+ * Called after page is added to free_list(and becoming the new tail).
+ */
+static inline void add_to_cluster_tail(struct page *page, struct zone *zone,
+ int order, int mt)
+{
+ struct page *neighbor;
+ struct cluster *c;
+
+ if (order || !zone->cluster.len)
+ return;
+
+ neighbor = page->lru.prev == &zone->free_area[0].free_list[mt] ?
+ NULL : list_entry(page->lru.prev, struct page, lru);
+ c = add_to_cluster_common(page, zone, neighbor);
+ c->tail = page;
+}
+
static inline void add_to_buddy_common(struct page *page, struct zone *zone,
unsigned int order)
{
@@ -720,6 +796,7 @@ static inline void add_to_buddy_head(struct page *page, struct zone *zone,
{
add_to_buddy_common(page, zone, order);
list_add(&page->lru, &zone->free_area[order].free_list[mt]);
+ add_to_cluster_head(page, zone, order, mt);
}
static inline void add_to_buddy_tail(struct page *page, struct zone *zone,
@@ -727,6 +804,7 @@ static inline void add_to_buddy_tail(struct page *page, struct zone *zone,
{
add_to_buddy_common(page, zone, order);
list_add_tail(&page->lru, &zone->free_area[order].free_list[mt]);
+ add_to_cluster_tail(page, zone, order, mt);
}
static inline void rmv_page_order(struct page *page)
@@ -735,9 +813,29 @@ static inline void rmv_page_order(struct page *page)
set_page_private(page, 0);
}
+/* called before removed from free_list */
+static inline void remove_from_cluster(struct page *page, struct zone *zone)
+{
+ struct cluster *c = page->cluster;
+ if (!c)
+ return;
+
+ page->cluster = NULL;
+ c->nr--;
+ if (!c->nr) {
+ int bit = c - zone->cluster.array;
+ c->tail = NULL;
+ clear_bit(bit, zone->cluster.bitmap);
+ if (bit < zone->cluster.zero_bit)
+ zone->cluster.zero_bit = bit;
+ } else if (page == c->tail)
+ c->tail = list_entry(page->lru.prev, struct page, lru);
+}
+
static inline void remove_from_buddy(struct page *page, struct zone *zone,
unsigned int order)
{
+ remove_from_cluster(page, zone);
list_del(&page->lru);
zone->free_area[order].nr_free--;
rmv_page_order(page);
@@ -2098,6 +2196,17 @@ static int move_freepages(struct zone *zone,
if (num_movable)
*num_movable = 0;
+ /*
+ * Cluster alloced pages may have their PageBuddy flag unclear yet
+ * after dropping zone->lock in rmqueue_bulk() and steal here could
+ * move them back to free_list. So it's necessary to wait till all
+ * those pages have their flags properly cleared.
+ *
+ * We do not need to disable cluster alloc though since we already
+ * held zone->lock and no allocation could happen.
+ */
+ zone_wait_cluster_alloc(zone);
+
for (page = start_page; page <= end_page;) {
if (!pfn_valid_within(page_to_pfn(page))) {
page++;
@@ -2122,8 +2231,10 @@ static int move_freepages(struct zone *zone,
}
order = page_order(page);
+ remove_from_cluster(page, zone);
list_move(&page->lru,
&zone->free_area[order].free_list[migratetype]);
+ add_to_cluster_head(page, zone, order, migratetype);
page += 1 << order;
pages_moved += 1 << order;
}
@@ -2272,7 +2383,9 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
single_page:
area = &zone->free_area[current_order];
+ remove_from_cluster(page, zone);
list_move(&page->lru, &area->free_list[start_type]);
+ add_to_cluster_head(page, zone, current_order, start_type);
}
/*
@@ -2533,6 +2646,145 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype)
return page;
}
+static int __init zone_order0_cluster_init(void)
+{
+ struct zone *zone;
+
+ for_each_zone(zone) {
+ int len, mt, batch;
+ unsigned long flags;
+ struct order0_cluster *cluster;
+
+ if (!managed_zone(zone))
+ continue;
+
+ /* no need to enable cluster allocation for batch<=1 zone */
+ preempt_disable();
+ batch = this_cpu_ptr(zone->pageset)->pcp.batch;
+ preempt_enable();
+ if (batch <= 1)
+ continue;
+
+ cluster = &zone->cluster;
+ /* FIXME: possible overflow of int type */
+ len = DIV_ROUND_UP(zone->managed_pages, batch);
+ cluster->array = vzalloc(len * sizeof(struct cluster));
+ if (!cluster->array)
+ return -ENOMEM;
+ cluster->bitmap = vzalloc(DIV_ROUND_UP(len, BITS_PER_LONG) *
+ sizeof(unsigned long));
+ if (!cluster->bitmap)
+ return -ENOMEM;
+
+ spin_lock_irqsave(&zone->lock, flags);
+ cluster->len = len;
+ for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
+ struct page *page;
+ list_for_each_entry_reverse(page,
+ &zone->free_area[0].free_list[mt], lru)
+ add_to_cluster_head(page, zone, 0, mt);
+ }
+ spin_unlock_irqrestore(&zone->lock, flags);
+ }
+
+ return 0;
+}
+subsys_initcall(zone_order0_cluster_init);
+
+static inline int __rmqueue_bulk_cluster(struct zone *zone, unsigned long count,
+ struct list_head *list, int mt)
+{
+ struct list_head *head = &zone->free_area[0].free_list[mt];
+ int nr = 0;
+
+ while (nr < count) {
+ struct page *head_page;
+ struct list_head *tail, tmp_list;
+ struct cluster *c;
+ int bit;
+
+ head_page = list_first_entry_or_null(head, struct page, lru);
+ if (!head_page || !head_page->cluster)
+ break;
+
+ c = head_page->cluster;
+ tail = &c->tail->lru;
+
+ /* drop the cluster off free_list and attach to list */
+ list_cut_position(&tmp_list, head, tail);
+ list_splice_tail(&tmp_list, list);
+
+ nr += c->nr;
+ zone->free_area[0].nr_free -= c->nr;
+
+ /* this cluster is empty now */
+ c->tail = NULL;
+ c->nr = 0;
+ bit = c - zone->cluster.array;
+ clear_bit(bit, zone->cluster.bitmap);
+ if (bit < zone->cluster.zero_bit)
+ zone->cluster.zero_bit = bit;
+ }
+
+ return nr;
+}
+
+static inline int rmqueue_bulk_cluster(struct zone *zone, unsigned int order,
+ unsigned long count, struct list_head *list,
+ int migratetype)
+{
+ int alloced;
+ struct page *page;
+
+ /*
+ * Cluster alloc races with merging so don't try cluster alloc when we
+ * can't skip merging. Note that can_skip_merge() keeps the same return
+ * value from here till all pages have their flags properly processed,
+ * i.e. the end of the function where in_progress is incremented, even
+ * we have dropped the lock in the middle because the only place that
+ * can change can_skip_merge()'s return value is compaction code and
+ * compaction needs to wait on in_progress.
+ */
+ if (!can_skip_merge(zone, 0))
+ return 0;
+
+ /* Cluster alloc is disabled, mostly compaction is already in progress */
+ if (zone->cluster.disable_depth)
+ return 0;
+
+ /* Cluster alloc is disabled for this zone */
+ if (unlikely(!zone->cluster.len))
+ return 0;
+
+ alloced = __rmqueue_bulk_cluster(zone, count, list, migratetype);
+ if (!alloced)
+ return 0;
+
+ /*
+ * Cache miss on page structure could slow things down
+ * dramatically so accessing these alloced pages without
+ * holding lock for better performance.
+ *
+ * Since these pages still have PageBuddy set, there is a race
+ * window between now and when PageBuddy is cleared for them
+ * below. Any operation that would scan a pageblock and check
+ * PageBuddy(page), e.g. compaction, will need to wait till all
+ * such pages are properly processed. in_progress is used for
+ * such purpose so increase it now before dropping the lock.
+ */
+ atomic_inc(&zone->cluster.in_progress);
+ spin_unlock(&zone->lock);
+
+ list_for_each_entry(page, list, lru) {
+ rmv_page_order(page);
+ page->cluster = NULL;
+ set_pcppage_migratetype(page, migratetype);
+ }
+ atomic_dec(&zone->cluster.in_progress);
+
+ return alloced;
+}
+
/*
* Obtain a specified number of elements from the buddy allocator, all under
* a single hold of the lock, for efficiency. Add them to the supplied list.
@@ -2542,17 +2794,23 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
unsigned long count, struct list_head *list,
int migratetype)
{
- int i, alloced = 0;
+ int i, alloced;
+ struct page *page, *tmp;
spin_lock(&zone->lock);
- for (i = 0; i < count; ++i) {
- struct page *page = __rmqueue(zone, order, migratetype);
+ alloced = rmqueue_bulk_cluster(zone, order, count, list, migratetype);
+ if (alloced > 0) {
+ if (alloced >= count)
+ goto out;
+ else
+ spin_lock(&zone->lock);
+ }
+
+ for (; alloced < count; alloced++) {
+ page = __rmqueue(zone, order, migratetype);
if (unlikely(page == NULL))
break;
- if (unlikely(check_pcp_refill(page)))
- continue;
-
/*
* Split buddy pages returned by expand() are received here in
* physical page order. The page is added to the tail of
@@ -2564,7 +2822,18 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
* pages are ordered properly.
*/
list_add_tail(&page->lru, list);
- alloced++;
+ }
+ spin_unlock(&zone->lock);
+
+out:
+ i = alloced;
+ list_for_each_entry_safe(page, tmp, list, lru) {
+ if (unlikely(check_pcp_refill(page))) {
+ list_del(&page->lru);
+ alloced--;
+ continue;
+ }
+
if (is_migrate_cma(get_pcppage_migratetype(page)))
__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
-(1 << order));
@@ -2577,7 +2846,6 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
* pages added to the pcp list.
*/
__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
- spin_unlock(&zone->lock);
return alloced;
}
@@ -7925,6 +8193,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
unsigned long outer_start, outer_end;
unsigned int order;
int ret = 0;
+ struct zone *zone = page_zone(pfn_to_page(start));
struct compact_control cc = {
.nr_migratepages = 0,
@@ -7967,6 +8236,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
if (ret)
return ret;
+ zone_wait_and_disable_cluster_alloc(zone);
/*
* In case of -EBUSY, we'd like to know which page causes problem.
* So, just fall through. test_pages_isolated() has a tracepoint
@@ -8049,6 +8319,8 @@ int alloc_contig_range(unsigned long start, unsigned long end,
done:
undo_isolate_page_range(pfn_max_align_down(start),
pfn_max_align_up(end), migratetype);
+
+ zone_enable_cluster_alloc(zone);
return ret;
}
--
2.17.2
There are multiple places that add/remove a page into/from buddy,
introduce helper functions for them.
This also makes it easier to add code when a page is added/removed
to/from buddy.
No functionality change.
Acked-by: Vlastimil Babka <[email protected]>
Signed-off-by: Aaron Lu <[email protected]>
---
mm/page_alloc.c | 65 +++++++++++++++++++++++++++++--------------------
1 file changed, 39 insertions(+), 26 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 89d2a2ab3fe6..14c20bb3a3da 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -697,12 +697,41 @@ static inline void set_page_order(struct page *page, unsigned int order)
__SetPageBuddy(page);
}
+static inline void add_to_buddy_common(struct page *page, struct zone *zone,
+ unsigned int order)
+{
+ set_page_order(page, order);
+ zone->free_area[order].nr_free++;
+}
+
+static inline void add_to_buddy_head(struct page *page, struct zone *zone,
+ unsigned int order, int mt)
+{
+ add_to_buddy_common(page, zone, order);
+ list_add(&page->lru, &zone->free_area[order].free_list[mt]);
+}
+
+static inline void add_to_buddy_tail(struct page *page, struct zone *zone,
+ unsigned int order, int mt)
+{
+ add_to_buddy_common(page, zone, order);
+ list_add_tail(&page->lru, &zone->free_area[order].free_list[mt]);
+}
+
static inline void rmv_page_order(struct page *page)
{
__ClearPageBuddy(page);
set_page_private(page, 0);
}
+static inline void remove_from_buddy(struct page *page, struct zone *zone,
+ unsigned int order)
+{
+ list_del(&page->lru);
+ zone->free_area[order].nr_free--;
+ rmv_page_order(page);
+}
+
/*
* This function checks whether a page is free && is the buddy
* we can coalesce a page and its buddy if
@@ -803,13 +832,10 @@ static inline void __free_one_page(struct page *page,
* Our buddy is free or it is CONFIG_DEBUG_PAGEALLOC guard page,
* merge with it and move up one order.
*/
- if (page_is_guard(buddy)) {
+ if (page_is_guard(buddy))
clear_page_guard(zone, buddy, order, migratetype);
- } else {
- list_del(&buddy->lru);
- zone->free_area[order].nr_free--;
- rmv_page_order(buddy);
- }
+ else
+ remove_from_buddy(buddy, zone, order);
combined_pfn = buddy_pfn & pfn;
page = page + (combined_pfn - pfn);
pfn = combined_pfn;
@@ -841,8 +867,6 @@ static inline void __free_one_page(struct page *page,
}
done_merging:
- set_page_order(page, order);
-
/*
* If this is not the largest possible page, check if the buddy
* of the next-highest order is free. If it is, it's possible
@@ -859,15 +883,12 @@ static inline void __free_one_page(struct page *page,
higher_buddy = higher_page + (buddy_pfn - combined_pfn);
if (pfn_valid_within(buddy_pfn) &&
page_is_buddy(higher_page, higher_buddy, order + 1)) {
- list_add_tail(&page->lru,
- &zone->free_area[order].free_list[migratetype]);
- goto out;
+ add_to_buddy_tail(page, zone, order, migratetype);
+ return;
}
}
- list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
-out:
- zone->free_area[order].nr_free++;
+ add_to_buddy_head(page, zone, order, migratetype);
}
/*
@@ -1805,9 +1826,7 @@ static inline void expand(struct zone *zone, struct page *page,
if (set_page_guard(zone, &page[size], high, migratetype))
continue;
- list_add(&page[size].lru, &area->free_list[migratetype]);
- area->nr_free++;
- set_page_order(&page[size], high);
+ add_to_buddy_head(&page[size], zone, high, migratetype);
}
}
@@ -1951,9 +1970,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
struct page, lru);
if (!page)
continue;
- list_del(&page->lru);
- rmv_page_order(page);
- area->nr_free--;
+ remove_from_buddy(page, zone, current_order);
expand(zone, page, order, current_order, area, migratetype);
set_pcppage_migratetype(page, migratetype);
return page;
@@ -2871,9 +2888,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
}
/* Remove page from free list */
- list_del(&page->lru);
- zone->free_area[order].nr_free--;
- rmv_page_order(page);
+ remove_from_buddy(page, zone, order);
/*
* Set the pageblock if the isolated page is at least half of a
@@ -8070,9 +8085,7 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
pr_info("remove from free list %lx %d %lx\n",
pfn, 1 << order, end_pfn);
#endif
- list_del(&page->lru);
- rmv_page_order(page);
- zone->free_area[order].nr_free--;
+ remove_from_buddy(page, zone, order);
for (i = 0; i < (1 << order); i++)
SetPageReserved((page+i));
pfn += (1 << order);
--
2.17.2
After system runs a long time, it's easy for a zone to have no
suitable high order page available and that will stop cluster alloc
and free in current implementation due to compact_considered > 0.
To make it favour order0 alloc/free, relax the condition to only
disallow cluster alloc/free when problem would occur, e.g. when
compaction is in progress.
Signed-off-by: Aaron Lu <[email protected]>
---
mm/internal.h | 4 ----
1 file changed, 4 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index fb4e8f7976e5..309a3f43e613 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -538,10 +538,6 @@ void try_to_merge_page(struct page *page);
#ifdef CONFIG_COMPACTION
static inline bool can_skip_merge(struct zone *zone, int order)
{
- /* Compaction has failed in this zone, we shouldn't skip merging */
- if (zone->compact_considered)
- return false;
-
/* Only consider no_merge for order 0 pages */
if (order)
return false;
--
2.17.2
On Wed, Oct 17, 2018 at 02:33:26PM +0800, Aaron Lu wrote:
> There are multiple places that add/remove a page into/from buddy,
> introduce helper functions for them.
>
> This also makes it easier to add code when a page is added/removed
> to/from buddy.
>
> No functionality change.
>
> Acked-by: Vlastimil Babka <[email protected]>
> Signed-off-by: Aaron Lu <[email protected]>
Acked-by: Mel Gorman <[email protected]>
--
Mel Gorman
SUSE Labs
On Wed, Oct 17, 2018 at 02:33:27PM +0800, Aaron Lu wrote:
> Running will-it-scale/page_fault1 process mode workload on a 2 sockets
> Intel Skylake server showed severe lock contention of zone->lock, as
> high as about 80%(42% on allocation path and 35% on free path) CPU
> cycles are burnt spinning. With perf, the most time consuming part inside
> that lock on free path is cache missing on page structures, mostly on
> the to-be-freed page's buddy due to merging.
>
This confuses me slightly. The commit log for d8a759b57035 ("mm,
page_alloc: double zone's batchsize") indicates that the contention for
will-it-scale moved from the zone lock to the LRU lock. This appears to
contradict that although the exact test case is different (page_fault_1
vs page_fault2). Can you clarify why commit d8a759b57035 is
insufficient?
I'm wondering is this really about reducing the number of dirtied cache
lines due to struct page updates and less about the actual zone lock.
> One way to avoid this overhead is not do any merging at all for order-0
> pages. With this approach, the lock contention for zone->lock on free
> path dropped to 1.1% but allocation side still has as high as 42% lock
> contention. In the meantime, the dropped lock contention on free side
> doesn't translate to performance increase, instead, it's consumed by
> increased lock contention of the per node lru_lock(rose from 5% to 37%)
> and the final performance slightly dropped about 1%.
>
Although this implies it's really about contention.
> Though performance dropped a little, it almost eliminated zone lock
> contention on free path and it is the foundation for the next patch
> that eliminates zone lock contention for allocation path.
>
Can you clarify whether THP was enabled or not? As this is order-0 focused,
it would imply the series should have minimal impact due to limited merging.
> Suggested-by: Dave Hansen <[email protected]>
> Signed-off-by: Aaron Lu <[email protected]>
> ---
> include/linux/mm_types.h | 9 +++-
> mm/compaction.c | 13 +++++-
> mm/internal.h | 27 ++++++++++++
> mm/page_alloc.c | 88 ++++++++++++++++++++++++++++++++++------
> 4 files changed, 121 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 5ed8f6292a53..aed93053ef6e 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -179,8 +179,13 @@ struct page {
> int units; /* SLOB */
> };
>
> - /* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
> - atomic_t _refcount;
> + union {
> + /* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
> + atomic_t _refcount;
> +
> + /* For pages in Buddy: if skipped merging when added to Buddy */
> + bool buddy_merge_skipped;
> + };
>
In some instances, bools within structrs are frowned upon because of
differences in sizes across architectures. Because this is part of a
union, I don't think it's problematic but bear in mind in case someone
else spots it.
> #ifdef CONFIG_MEMCG
> struct mem_cgroup *mem_cgroup;
> diff --git a/mm/compaction.c b/mm/compaction.c
> index faca45ebe62d..0c9c7a30dde3 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -777,8 +777,19 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> * potential isolation targets.
> */
> if (PageBuddy(page)) {
> - unsigned long freepage_order = page_order_unsafe(page);
> + unsigned long freepage_order;
>
> + /*
> + * If this is a merge_skipped page, do merge now
> + * since high-order pages are needed. zone lock
> + * isn't taken for the merge_skipped check so the
> + * check could be wrong but the worst case is we
> + * lose a merge opportunity.
> + */
> + if (page_merge_was_skipped(page))
> + try_to_merge_page(page);
> +
> + freepage_order = page_order_unsafe(page);
> /*
> * Without lock, we cannot be sure that what we got is
> * a valid page order. Consider only values in the
> diff --git a/mm/internal.h b/mm/internal.h
> index 87256ae1bef8..c166735a559e 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -527,4 +527,31 @@ static inline bool is_migrate_highatomic_page(struct page *page)
>
> void setup_zone_pageset(struct zone *zone);
> extern struct page *alloc_new_node_page(struct page *page, unsigned long node);
> +
> +static inline bool page_merge_was_skipped(struct page *page)
> +{
> + return page->buddy_merge_skipped;
> +}
> +
> +void try_to_merge_page(struct page *page);
> +
> +#ifdef CONFIG_COMPACTION
> +static inline bool can_skip_merge(struct zone *zone, int order)
> +{
> + /* Compaction has failed in this zone, we shouldn't skip merging */
> + if (zone->compact_considered)
> + return false;
> +
> + /* Only consider no_merge for order 0 pages */
> + if (order)
> + return false;
> +
> + return true;
> +}
> +#else /* CONFIG_COMPACTION */
> +static inline bool can_skip_merge(struct zone *zone, int order)
> +{
> + return false;
> +}
> +#endif /* CONFIG_COMPACTION */
> #endif /* __MM_INTERNAL_H */
Strictly speaking, lazy buddy merging does not need to be linked to
compaction. Lazy merging doesn't say anything about the mobility of
buddy pages that are still allocated.
When lazy buddy merging was last examined years ago, a consequence was
that high-order allocation success rates were reduced. I see you do the
merging when compaction has been recently considered but I don't see how
that is sufficient. If a high-order allocation fails, there is no
guarantee that compaction will find those unmerged buddies. There is
also no guarantee that a page free will find them. So, in the event of a
high-order allocation failure, what finds all those unmerged buddies and
puts them together to see if the allocation would succeed without
reclaim/compaction/etc.
--
Mel Gorman
SUSE Labs
On Wed, Oct 17, 2018 at 02:33:28PM +0800, Aaron Lu wrote:
> Profile on Intel Skylake server shows the most time consuming part
> under zone->lock on allocation path is accessing those to-be-returned
> page's "struct page" on the free_list inside zone->lock. One explanation
> is, different CPUs are releasing pages to the head of free_list and
> those page's 'struct page' may very well be cache cold for the allocating
> CPU when it grabs these pages from free_list' head. The purpose here
> is to avoid touching these pages one by one inside zone->lock.
>
I didn't read this one in depth because it's somewhat ortogonal to the
lazy buddy merging which I think would benefit from being finalised and
ensuring that there are no reductions in high-order allocation success
rates. Pages being allocated on one CPU and freed on another is not that
unusual -- ping-pong workloads or things like netperf used to exhibit
this sort of pattern.
However, this part stuck out
> +static inline void zone_wait_cluster_alloc(struct zone *zone)
> +{
> + while (atomic_read(&zone->cluster.in_progress))
> + cpu_relax();
> +}
> +
RT has had problems with cpu_relax in the past but more importantly, as
this delay for parallel compactions and allocations of contig ranges,
we could be stuck here for very long periods of time with interrupts
disabled. It gets even worse if it's from an interrupt context such as
jumbo frame allocation or a high-order slab allocation that is atomic.
These potentially large periods of time with interrupts disabled is very
hazardous. It may be necessary to consider instead minimising the number
of struct page update when merging to PCP and then either increasing the
size of the PCP or allowing it to exceed pcp->high for short periods of
time to batch the struct page updates.
I didn't read the rest of the series as it builds upon this patch.
--
Mel Gorman
SUSE Labs
On Wed, Oct 17, 2018 at 11:44:27AM +0100, Mel Gorman wrote:
> On Wed, Oct 17, 2018 at 02:33:27PM +0800, Aaron Lu wrote:
> > Running will-it-scale/page_fault1 process mode workload on a 2 sockets
> > Intel Skylake server showed severe lock contention of zone->lock, as
> > high as about 80%(42% on allocation path and 35% on free path) CPU
> > cycles are burnt spinning. With perf, the most time consuming part inside
> > that lock on free path is cache missing on page structures, mostly on
> > the to-be-freed page's buddy due to merging.
> >
>
> This confuses me slightly. The commit log for d8a759b57035 ("mm,
> page_alloc: double zone's batchsize") indicates that the contention for
> will-it-scale moved from the zone lock to the LRU lock. This appears to
> contradict that although the exact test case is different (page_fault_1
> vs page_fault2). Can you clarify why commit d8a759b57035 is
> insufficient?
commit d8a759b57035 helps zone lock scalability and while it reduced
zone lock scalability to some extent(but not entirely eliminated it),
the lock contention shifted to LRU lock in the meantime.
e.g. from commit d8a759b57035's changelog, with the same test case
will-it-scale/page_fault1:
4 sockets Skylake:
batch score change zone_contention lru_contention total_contention
31 15345900 +0.00% 64% 8% 72%
63 17992886 +17.25% 24% 45% 69%
4 sockets Broadwell:
batch score change zone_contention lru_contention total_contention
31 16703983 +0.00% 67% 7% 74%
63 18288885 +9.49% 38% 33% 71%
2 sockets Skylake:
batch score change zone_contention lru_contention total_contention
31 9554867 +0.00% 66% 3% 69%
63 9980145 +4.45% 62% 4% 66%
Please note that though zone lock contention for the 4 sockets server
reduced a lot with commit d8a759b57035, 2 sockets Skylake still suffered
a lot from zone lock contention even after we doubled batch size.
Also, the reduced zone lock contention will again get worse if LRU lock
is optimized away by Daniel's work, or in cases there are no LRU in the
picture, e.g. an in-kernel user of page allocator like Tariq Toukan
demonstrated with netperf.
> I'm wondering is this really about reducing the number of dirtied cache
> lines due to struct page updates and less about the actual zone lock.
Hmm...if we reduce the time it takes under the zone lock, aren't we
helping the zone lock? :-)
>
> > One way to avoid this overhead is not do any merging at all for order-0
> > pages. With this approach, the lock contention for zone->lock on free
> > path dropped to 1.1% but allocation side still has as high as 42% lock
> > contention. In the meantime, the dropped lock contention on free side
> > doesn't translate to performance increase, instead, it's consumed by
> > increased lock contention of the per node lru_lock(rose from 5% to 37%)
> > and the final performance slightly dropped about 1%.
> >
>
> Although this implies it's really about contention.
>
> > Though performance dropped a little, it almost eliminated zone lock
> > contention on free path and it is the foundation for the next patch
> > that eliminates zone lock contention for allocation path.
> >
>
> Can you clarify whether THP was enabled or not? As this is order-0 focused,
> it would imply the series should have minimal impact due to limited merging.
Sorry about this, I should have mentioned THP is not used here.
>
> > Suggested-by: Dave Hansen <[email protected]>
> > Signed-off-by: Aaron Lu <[email protected]>
> > ---
> > include/linux/mm_types.h | 9 +++-
> > mm/compaction.c | 13 +++++-
> > mm/internal.h | 27 ++++++++++++
> > mm/page_alloc.c | 88 ++++++++++++++++++++++++++++++++++------
> > 4 files changed, 121 insertions(+), 16 deletions(-)
> >
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 5ed8f6292a53..aed93053ef6e 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -179,8 +179,13 @@ struct page {
> > int units; /* SLOB */
> > };
> >
> > - /* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
> > - atomic_t _refcount;
> > + union {
> > + /* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
> > + atomic_t _refcount;
> > +
> > + /* For pages in Buddy: if skipped merging when added to Buddy */
> > + bool buddy_merge_skipped;
> > + };
> >
>
> In some instances, bools within structrs are frowned upon because of
> differences in sizes across architectures. Because this is part of a
> union, I don't think it's problematic but bear in mind in case someone
> else spots it.
OK, thanks for the remind.
>
> > #ifdef CONFIG_MEMCG
> > struct mem_cgroup *mem_cgroup;
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index faca45ebe62d..0c9c7a30dde3 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -777,8 +777,19 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> > * potential isolation targets.
> > */
> > if (PageBuddy(page)) {
> > - unsigned long freepage_order = page_order_unsafe(page);
> > + unsigned long freepage_order;
> >
> > + /*
> > + * If this is a merge_skipped page, do merge now
> > + * since high-order pages are needed. zone lock
> > + * isn't taken for the merge_skipped check so the
> > + * check could be wrong but the worst case is we
> > + * lose a merge opportunity.
> > + */
> > + if (page_merge_was_skipped(page))
> > + try_to_merge_page(page);
> > +
> > + freepage_order = page_order_unsafe(page);
> > /*
> > * Without lock, we cannot be sure that what we got is
> > * a valid page order. Consider only values in the
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 87256ae1bef8..c166735a559e 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -527,4 +527,31 @@ static inline bool is_migrate_highatomic_page(struct page *page)
> >
> > void setup_zone_pageset(struct zone *zone);
> > extern struct page *alloc_new_node_page(struct page *page, unsigned long node);
> > +
> > +static inline bool page_merge_was_skipped(struct page *page)
> > +{
> > + return page->buddy_merge_skipped;
> > +}
> > +
> > +void try_to_merge_page(struct page *page);
> > +
> > +#ifdef CONFIG_COMPACTION
> > +static inline bool can_skip_merge(struct zone *zone, int order)
> > +{
> > + /* Compaction has failed in this zone, we shouldn't skip merging */
> > + if (zone->compact_considered)
> > + return false;
> > +
> > + /* Only consider no_merge for order 0 pages */
> > + if (order)
> > + return false;
> > +
> > + return true;
> > +}
> > +#else /* CONFIG_COMPACTION */
> > +static inline bool can_skip_merge(struct zone *zone, int order)
> > +{
> > + return false;
> > +}
> > +#endif /* CONFIG_COMPACTION */
> > #endif /* __MM_INTERNAL_H */
>
> Strictly speaking, lazy buddy merging does not need to be linked to
> compaction. Lazy merging doesn't say anything about the mobility of
> buddy pages that are still allocated.
True.
I was thinking if compactions isn't enabled, we probably shouldn't
enable this lazy buddy merging feature as it would make high order
allocation success rate dropping a lot.
I probably should have mentioned clearly somewhere in the changelog that
the function of merging those unmerged order0 pages are embedded in
compaction code, in function isolate_migratepages_block() when isolate
candidates are scanned.
>
> When lazy buddy merging was last examined years ago, a consequence was
> that high-order allocation success rates were reduced. I see you do the
I tried mmtests/stress-highalloc on one desktop and didn't see
high-order allocation success rate dropping as shown in patch0's
changelog. But it could be that I didn't test enough machines or using
other test cases? Any suggestions on how to uncover this problem?
> merging when compaction has been recently considered but I don't see how
> that is sufficient. If a high-order allocation fails, there is no
> guarantee that compaction will find those unmerged buddies. There is
Any unmerged buddies will have page->buddy_merge_skipped set and during
compaction, when isolate_migratepages_block() iterates pages to find
isolate candidates, it will find these unmerged pages and will do_merge()
for them. Suppose an order-9 pageblock, every page is merge_skipped
order-0 page; after isolate_migratepages_block() iterates them one by one
and calls do_merge() for them one by one, higher order page will be
formed during this process and after the last unmerged order0 page goes
through do_merge(), an order-9 buddy page will be formed.
> also no guarantee that a page free will find them. So, in the event of a
> high-order allocation failure, what finds all those unmerged buddies and
> puts them together to see if the allocation would succeed without
> reclaim/compaction/etc.
compaction is needed to form a high-order page after high-order
allocation failed, I think this is also true for vanilla kernel?
On Wed, Oct 17, 2018 at 09:10:59PM +0800, Aaron Lu wrote:
> On Wed, Oct 17, 2018 at 11:44:27AM +0100, Mel Gorman wrote:
> > On Wed, Oct 17, 2018 at 02:33:27PM +0800, Aaron Lu wrote:
> > > Running will-it-scale/page_fault1 process mode workload on a 2 sockets
> > > Intel Skylake server showed severe lock contention of zone->lock, as
> > > high as about 80%(42% on allocation path and 35% on free path) CPU
> > > cycles are burnt spinning. With perf, the most time consuming part inside
> > > that lock on free path is cache missing on page structures, mostly on
> > > the to-be-freed page's buddy due to merging.
> > >
> >
> > This confuses me slightly. The commit log for d8a759b57035 ("mm,
> > page_alloc: double zone's batchsize") indicates that the contention for
> > will-it-scale moved from the zone lock to the LRU lock. This appears to
> > contradict that although the exact test case is different (page_fault_1
> > vs page_fault2). Can you clarify why commit d8a759b57035 is
> > insufficient?
>
> commit d8a759b57035 helps zone lock scalability and while it reduced
> zone lock scalability to some extent(but not entirely eliminated it),
> the lock contention shifted to LRU lock in the meantime.
>
I assume you meant "zone lock contention" in the second case.
> e.g. from commit d8a759b57035's changelog, with the same test case
> will-it-scale/page_fault1:
>
> 4 sockets Skylake:
> batch score change zone_contention lru_contention total_contention
> 31 15345900 +0.00% 64% 8% 72%
> 63 17992886 +17.25% 24% 45% 69%
>
> 4 sockets Broadwell:
> batch score change zone_contention lru_contention total_contention
> 31 16703983 +0.00% 67% 7% 74%
> 63 18288885 +9.49% 38% 33% 71%
>
> 2 sockets Skylake:
> batch score change zone_contention lru_contention total_contention
> 31 9554867 +0.00% 66% 3% 69%
> 63 9980145 +4.45% 62% 4% 66%
>
> Please note that though zone lock contention for the 4 sockets server
> reduced a lot with commit d8a759b57035, 2 sockets Skylake still suffered
> a lot from zone lock contention even after we doubled batch size.
>
Any particuular reason why? I assume it's related to the number of zone
locks with the increase number of zones and the number of threads used
for the test.
> Also, the reduced zone lock contention will again get worse if LRU lock
> is optimized away by Daniel's work, or in cases there are no LRU in the
> picture, e.g. an in-kernel user of page allocator like Tariq Toukan
> demonstrated with netperf.
>
Vaguely understood, I never looked at the LRU lock patches.
> > I'm wondering is this really about reducing the number of dirtied cache
> > lines due to struct page updates and less about the actual zone lock.
>
> Hmm...if we reduce the time it takes under the zone lock, aren't we
> helping the zone lock? :-)
>
Indirectly yes but reducing cache line dirtying is useful in itself so
they should be at least considered separately as independent
optimisations.
> >
> > > One way to avoid this overhead is not do any merging at all for order-0
> > > pages. With this approach, the lock contention for zone->lock on free
> > > path dropped to 1.1% but allocation side still has as high as 42% lock
> > > contention. In the meantime, the dropped lock contention on free side
> > > doesn't translate to performance increase, instead, it's consumed by
> > > increased lock contention of the per node lru_lock(rose from 5% to 37%)
> > > and the final performance slightly dropped about 1%.
> > >
> >
> > Although this implies it's really about contention.
> >
> > > Though performance dropped a little, it almost eliminated zone lock
> > > contention on free path and it is the foundation for the next patch
> > > that eliminates zone lock contention for allocation path.
> > >
> >
> > Can you clarify whether THP was enabled or not? As this is order-0 focused,
> > it would imply the series should have minimal impact due to limited merging.
>
> Sorry about this, I should have mentioned THP is not used here.
>
That's important to know. It does reduce the utility of the patch
somewhat but not all arches support THP and THP is not always enabled on
x86.
> > compaction. Lazy merging doesn't say anything about the mobility of
> > buddy pages that are still allocated.
>
> True.
> I was thinking if compactions isn't enabled, we probably shouldn't
> enable this lazy buddy merging feature as it would make high order
> allocation success rate dropping a lot.
>
It probably is lower as reclaim is not that aggressive. Add a comment
with an explanation as to why it's compaction-specific.
> I probably should have mentioned clearly somewhere in the changelog that
> the function of merging those unmerged order0 pages are embedded in
> compaction code, in function isolate_migratepages_block() when isolate
> candidates are scanned.
>
Yes, but note that the concept is still problematic.
isolate_migratepages_block is not guaranteed to find a pageblock with
unmerged buddies in it. If there are pageblocks towards the end of the
zone with unmerged pages, they may never be found. This will be very hard
to detect at runtime because it's heavily dependant on the exact state
of the system.
> >
> > When lazy buddy merging was last examined years ago, a consequence was
> > that high-order allocation success rates were reduced. I see you do the
>
> I tried mmtests/stress-highalloc on one desktop and didn't see
> high-order allocation success rate dropping as shown in patch0's
> changelog. But it could be that I didn't test enough machines or using
> other test cases? Any suggestions on how to uncover this problem?
>
stress-highalloc is nowhere near as useful as it used to be
unfortunately. It was built at a time when 4G machines were unusual.
config-global-dhp__workload_thpscale can be sometimes useful but it's
variable. There is not a good modern example of detecting allocation success
rates of highly fragmented systems at the moment which is a real pity.
> > merging when compaction has been recently considered but I don't see how
> > that is sufficient. If a high-order allocation fails, there is no
> > guarantee that compaction will find those unmerged buddies. There is
>
> Any unmerged buddies will have page->buddy_merge_skipped set and during
> compaction, when isolate_migratepages_block() iterates pages to find
> isolate candidates, it will find these unmerged pages and will do_merge()
> for them. Suppose an order-9 pageblock, every page is merge_skipped
> order-0 page; after isolate_migratepages_block() iterates them one by one
> and calls do_merge() for them one by one, higher order page will be
> formed during this process and after the last unmerged order0 page goes
> through do_merge(), an order-9 buddy page will be formed.
>
Again, as compaction is not guaranteed to find the pageblocks, it would
be important to consider whether a) that matters or b) find an
alternative way of keeping unmerged buddies on separate lists so they
can be quickly discovered when a high-order allocation fails.
> > also no guarantee that a page free will find them. So, in the event of a
> > high-order allocation failure, what finds all those unmerged buddies and
> > puts them together to see if the allocation would succeed without
> > reclaim/compaction/etc.
>
> compaction is needed to form a high-order page after high-order
> allocation failed, I think this is also true for vanilla kernel?
It's needed to form them efficiently but excessive reclaim or writing 3
to drop_caches can also do it. Be careful of tying lazy buddy too
closely to compaction.
--
Mel Gorman
SUSE Labs
On Wed, Oct 17, 2018 at 12:20:42PM +0100, Mel Gorman wrote:
> On Wed, Oct 17, 2018 at 02:33:28PM +0800, Aaron Lu wrote:
> > Profile on Intel Skylake server shows the most time consuming part
> > under zone->lock on allocation path is accessing those to-be-returned
> > page's "struct page" on the free_list inside zone->lock. One explanation
> > is, different CPUs are releasing pages to the head of free_list and
> > those page's 'struct page' may very well be cache cold for the allocating
> > CPU when it grabs these pages from free_list' head. The purpose here
> > is to avoid touching these pages one by one inside zone->lock.
> >
>
> I didn't read this one in depth because it's somewhat ortogonal to the
> lazy buddy merging which I think would benefit from being finalised and
> ensuring that there are no reductions in high-order allocation success
> rates. Pages being allocated on one CPU and freed on another is not that
> unusual -- ping-pong workloads or things like netperf used to exhibit
> this sort of pattern.
>
> However, this part stuck out
>
> > +static inline void zone_wait_cluster_alloc(struct zone *zone)
> > +{
> > + while (atomic_read(&zone->cluster.in_progress))
> > + cpu_relax();
> > +}
> > +
>
> RT has had problems with cpu_relax in the past but more importantly, as
> this delay for parallel compactions and allocations of contig ranges,
> we could be stuck here for very long periods of time with interrupts
The longest possible time is one CPU accessing pcp->batch number cold
cachelines. Reason:
When zone_wait_cluster_alloc() is called, we already held zone lock so
no more allocations are possible. Waiting in_progress to become zero
means waiting any CPU that increased in_progress to finish processing
their allocated pages. Since they will at most allocate pcp->batch pages
and worse case are all these page structres are cache cold, so the
longest wait time is one CPU accessing pcp->batch number cold cache lines.
I have no idea if this time is too long though.
> disabled. It gets even worse if it's from an interrupt context such as
> jumbo frame allocation or a high-order slab allocation that is atomic.
My understanding is atomic allocation won't trigger compaction, no?
> These potentially large periods of time with interrupts disabled is very
> hazardous.
I see and agree, thanks for pointing this out.
Hopefully, the above mentioned worst case time won't be regarded as
unbound or too long.
> It may be necessary to consider instead minimising the number
> of struct page update when merging to PCP and then either increasing the
> size of the PCP or allowing it to exceed pcp->high for short periods of
> time to batch the struct page updates.
I don't quite follow this part. It doesn't seem possible we can exceed
pcp->high in allocation path, or are you talking about free path?
And thanks a lot for the review!
On Wed, Oct 17, 2018 at 02:58:07PM +0100, Mel Gorman wrote:
> On Wed, Oct 17, 2018 at 09:10:59PM +0800, Aaron Lu wrote:
> > On Wed, Oct 17, 2018 at 11:44:27AM +0100, Mel Gorman wrote:
> > > On Wed, Oct 17, 2018 at 02:33:27PM +0800, Aaron Lu wrote:
> > > > Running will-it-scale/page_fault1 process mode workload on a 2 sockets
> > > > Intel Skylake server showed severe lock contention of zone->lock, as
> > > > high as about 80%(42% on allocation path and 35% on free path) CPU
> > > > cycles are burnt spinning. With perf, the most time consuming part inside
> > > > that lock on free path is cache missing on page structures, mostly on
> > > > the to-be-freed page's buddy due to merging.
> > > >
> > >
> > > This confuses me slightly. The commit log for d8a759b57035 ("mm,
> > > page_alloc: double zone's batchsize") indicates that the contention for
> > > will-it-scale moved from the zone lock to the LRU lock. This appears to
> > > contradict that although the exact test case is different (page_fault_1
> > > vs page_fault2). Can you clarify why commit d8a759b57035 is
> > > insufficient?
> >
> > commit d8a759b57035 helps zone lock scalability and while it reduced
> > zone lock scalability to some extent(but not entirely eliminated it),
> > the lock contention shifted to LRU lock in the meantime.
> >
>
> I assume you meant "zone lock contention" in the second case.
Yes, that's right.
>
> > e.g. from commit d8a759b57035's changelog, with the same test case
> > will-it-scale/page_fault1:
> >
> > 4 sockets Skylake:
> > batch score change zone_contention lru_contention total_contention
> > 31 15345900 +0.00% 64% 8% 72%
> > 63 17992886 +17.25% 24% 45% 69%
> >
> > 4 sockets Broadwell:
> > batch score change zone_contention lru_contention total_contention
> > 31 16703983 +0.00% 67% 7% 74%
> > 63 18288885 +9.49% 38% 33% 71%
> >
> > 2 sockets Skylake:
> > batch score change zone_contention lru_contention total_contention
> > 31 9554867 +0.00% 66% 3% 69%
> > 63 9980145 +4.45% 62% 4% 66%
> >
> > Please note that though zone lock contention for the 4 sockets server
> > reduced a lot with commit d8a759b57035, 2 sockets Skylake still suffered
> > a lot from zone lock contention even after we doubled batch size.
> >
>
> Any particuular reason why? I assume it's related to the number of zone
> locks with the increase number of zones and the number of threads used
> for the test.
I think so too.
The 4 sockets server has 192 CPUs in total while the 2 sockets server
has 112 CPUs in total. Assume only ZONE_NORMAL are used, for the 4
sockets server it would be 192/4=48(CPUs per zone) while for the 2
sockets server it is 112/2=56(CPUs per zone). The test is started with
nr_task=nr_cpu so for the 2 sockets servers, it ends up having more CPUs
consuming one zone.
>
> > Also, the reduced zone lock contention will again get worse if LRU lock
> > is optimized away by Daniel's work, or in cases there are no LRU in the
> > picture, e.g. an in-kernel user of page allocator like Tariq Toukan
> > demonstrated with netperf.
> >
>
> Vaguely understood, I never looked at the LRU lock patches.
>
> > > I'm wondering is this really about reducing the number of dirtied cache
> > > lines due to struct page updates and less about the actual zone lock.
> >
> > Hmm...if we reduce the time it takes under the zone lock, aren't we
> > helping the zone lock? :-)
> >
>
> Indirectly yes but reducing cache line dirtying is useful in itself so
> they should be at least considered separately as independent
> optimisations.
>
> > >
> > > > One way to avoid this overhead is not do any merging at all for order-0
> > > > pages. With this approach, the lock contention for zone->lock on free
> > > > path dropped to 1.1% but allocation side still has as high as 42% lock
> > > > contention. In the meantime, the dropped lock contention on free side
> > > > doesn't translate to performance increase, instead, it's consumed by
> > > > increased lock contention of the per node lru_lock(rose from 5% to 37%)
> > > > and the final performance slightly dropped about 1%.
> > > >
> > >
> > > Although this implies it's really about contention.
> > >
> > > > Though performance dropped a little, it almost eliminated zone lock
> > > > contention on free path and it is the foundation for the next patch
> > > > that eliminates zone lock contention for allocation path.
> > > >
> > >
> > > Can you clarify whether THP was enabled or not? As this is order-0 focused,
> > > it would imply the series should have minimal impact due to limited merging.
> >
> > Sorry about this, I should have mentioned THP is not used here.
> >
>
> That's important to know. It does reduce the utility of the patch
> somewhat but not all arches support THP and THP is not always enabled on
> x86.
I always wondered how systems are making use of THP.
After all, when system has been runing a while(days or months), file
cache should consumed a lot of memory and high order pages will become
more and more scare. If order9 page can't be reliably allocated, will
workload rely on it?
Just a thought.
THP is of course pretty neat that it reduced TLB cost, needs fewer page
table etc. I just wondered if people really rely on it, or using it
after their system has been up for a long time.
> > > compaction. Lazy merging doesn't say anything about the mobility of
> > > buddy pages that are still allocated.
> >
> > True.
> > I was thinking if compactions isn't enabled, we probably shouldn't
> > enable this lazy buddy merging feature as it would make high order
> > allocation success rate dropping a lot.
> >
>
> It probably is lower as reclaim is not that aggressive. Add a comment
> with an explanation as to why it's compaction-specific.
>
> > I probably should have mentioned clearly somewhere in the changelog that
> > the function of merging those unmerged order0 pages are embedded in
> > compaction code, in function isolate_migratepages_block() when isolate
> > candidates are scanned.
> >
>
> Yes, but note that the concept is still problematic.
> isolate_migratepages_block is not guaranteed to find a pageblock with
> unmerged buddies in it. If there are pageblocks towards the end of the
> zone with unmerged pages, they may never be found. This will be very hard
> to detect at runtime because it's heavily dependant on the exact state
> of the system.
Quite true.
The intent here though, is not to have compaction merge back all
unmerged pages, but did the merge for these unmerged pages in a
piggyback way, i.e. since isolate_migratepages_block() is doing the
scan, why don't we let it handle these unmerged pages when it meets
them?
If for some reason isolate_migratepages_block() didn't meet a single
unmerged page before compaction succeed, we probably do not need worry
much yet since compaction succeeded anyway.
> > >
> > > When lazy buddy merging was last examined years ago, a consequence was
> > > that high-order allocation success rates were reduced. I see you do the
> >
> > I tried mmtests/stress-highalloc on one desktop and didn't see
> > high-order allocation success rate dropping as shown in patch0's
> > changelog. But it could be that I didn't test enough machines or using
> > other test cases? Any suggestions on how to uncover this problem?
> >
>
> stress-highalloc is nowhere near as useful as it used to be
> unfortunately. It was built at a time when 4G machines were unusual.
> config-global-dhp__workload_thpscale can be sometimes useful but it's
Will take a look at this, thanks for the pointer.
> variable. There is not a good modern example of detecting allocation success
> rates of highly fragmented systems at the moment which is a real pity.
>
> > > merging when compaction has been recently considered but I don't see how
> > > that is sufficient. If a high-order allocation fails, there is no
> > > guarantee that compaction will find those unmerged buddies. There is
> >
> > Any unmerged buddies will have page->buddy_merge_skipped set and during
> > compaction, when isolate_migratepages_block() iterates pages to find
> > isolate candidates, it will find these unmerged pages and will do_merge()
> > for them. Suppose an order-9 pageblock, every page is merge_skipped
> > order-0 page; after isolate_migratepages_block() iterates them one by one
> > and calls do_merge() for them one by one, higher order page will be
> > formed during this process and after the last unmerged order0 page goes
> > through do_merge(), an order-9 buddy page will be formed.
> >
>
> Again, as compaction is not guaranteed to find the pageblocks, it would
> be important to consider whether a) that matters or b) find an
> alternative way of keeping unmerged buddies on separate lists so they
> can be quickly discovered when a high-order allocation fails.
That's a good question.
I tend to think it doesn't matter whether we can find all unmerged pages.
Let compaction does its job as before and do_merge() for any unmerged
pages when it scanned them is probably enough. But as you said, we don't
have a good enough case to test this yet.
> > > also no guarantee that a page free will find them. So, in the event of a
> > > high-order allocation failure, what finds all those unmerged buddies and
> > > puts them together to see if the allocation would succeed without
> > > reclaim/compaction/etc.
> >
> > compaction is needed to form a high-order page after high-order
> > allocation failed, I think this is also true for vanilla kernel?
>
> It's needed to form them efficiently but excessive reclaim or writing 3
> to drop_caches can also do it. Be careful of tying lazy buddy too
> closely to compaction.
That's the current design of this patchset, do you see any immediate
problem of this? Is it that you are worried about high-order allocation
success rate using this design?
On 10/17/18 3:58 PM, Mel Gorman wrote:
> Again, as compaction is not guaranteed to find the pageblocks, it would
> be important to consider whether a) that matters or b) find an
> alternative way of keeping unmerged buddies on separate lists so they
> can be quickly discovered when a high-order allocation fails.
Agree, unmerged buddies could be on separate freelist from regular
order-0 freelist. That list could be also preferred to allocations
before the regular one. Then one could e.g. try "direct merging" via
this list when compaction fails, or prefer direct merging to compaction
for non-costly-order allocations, do direct merging when allocation
context doesn't even allow compaction (atomic etc).
Also I would definitely consider always merging pages freed to
non-MOVABLE pageblocks. We really don't want to increase the
fragmentation in those. However that means it probably won't help the
netperf case?
Vlastimil
On Wed, Oct 17, 2018 at 07:03:30PM +0200, Vlastimil Babka wrote:
> On 10/17/18 3:58 PM, Mel Gorman wrote:
> > Again, as compaction is not guaranteed to find the pageblocks, it would
> > be important to consider whether a) that matters or b) find an
> > alternative way of keeping unmerged buddies on separate lists so they
> > can be quickly discovered when a high-order allocation fails.
>
> Agree, unmerged buddies could be on separate freelist from regular
> order-0 freelist. That list could be also preferred to allocations
> before the regular one. Then one could e.g. try "direct merging" via
> this list when compaction fails, or prefer direct merging to compaction
> for non-costly-order allocations, do direct merging when allocation
> context doesn't even allow compaction (atomic etc).
One concern regarding "direct merging" these unmerged pages via this
separate freelist(let's call it unmerged_free_list) is: adjacent
unmerged pages on the unmerged_free_list could be far away from each
other regarding their physical positions, so during the process of
merging them, the needed high order page may not be able to be formed
in a short time. Actually, the time could be unbound in a bad condition
when:
1 unmerged pages adjacent on the unmerged_free_list happen to be far
away from each other regarding their physical positions; and
2 there are a lot of unmerged pages on unmerged_free_list.
That's the reason I hooked the merging of unmerged pages in compaction
when isolate_migratepages_block() is scanning every page of a pageblock
in PFN order.
OTOH, if there is a kernel thread trying to reduce fragmentation by
doing merges for these unmerged pages, I think it's perfect fine to let
it iterate all unmerged pages of that list and do_merge() for all of
them.
So what about this: if kcompactd is running, let it handle these
unmerged pages on the list and after that, do its usual job of
compaction. If direct compaction is running, do not handle unmerged
pages on that list but rely on isolate_migratepages_block() to do the
merging as is done in this patchset.
This of course has the effect of tying compaction with 'lazy merging'.
If it is not desirable, what about creating a new kernel thread to do
the merging of unmerged pages on the list while keeping the behaviour of
isolate_migratepages_block() in this patchset to improve compaction
success rate.
> Also I would definitely consider always merging pages freed to
> non-MOVABLE pageblocks. We really don't want to increase the
> fragmentation in those. However that means it probably won't help the
> netperf case?
Yes, that would be unfortunate for all in-kernel users of page
allocator...
On 10/18/18 8:48 AM, Aaron Lu wrote:
> On Wed, Oct 17, 2018 at 07:03:30PM +0200, Vlastimil Babka wrote:
>> On 10/17/18 3:58 PM, Mel Gorman wrote:
>>> Again, as compaction is not guaranteed to find the pageblocks, it would
>>> be important to consider whether a) that matters or b) find an
>>> alternative way of keeping unmerged buddies on separate lists so they
>>> can be quickly discovered when a high-order allocation fails.
>>
>> Agree, unmerged buddies could be on separate freelist from regular
>> order-0 freelist. That list could be also preferred to allocations
>> before the regular one. Then one could e.g. try "direct merging" via
>> this list when compaction fails, or prefer direct merging to compaction
>> for non-costly-order allocations, do direct merging when allocation
>> context doesn't even allow compaction (atomic etc).
>
> One concern regarding "direct merging" these unmerged pages via this
> separate freelist(let's call it unmerged_free_list) is: adjacent
> unmerged pages on the unmerged_free_list could be far away from each
> other regarding their physical positions, so during the process of
> merging them, the needed high order page may not be able to be formed
> in a short time. Actually, the time could be unbound in a bad condition
> when:
> 1 unmerged pages adjacent on the unmerged_free_list happen to be far
> away from each other regarding their physical positions; and
I'm not sure I understand. Why should it matter for merging if pages are
adjacent on the unmerged_free_list? The buddy for merging is found the
usual way, no?
> 2 there are a lot of unmerged pages on unmerged_free_list.
That will affect allocation latency, yeah. Still might be faster than
direct compaction. And possible to do in GFP_ATOMIC context, unlike
direct compaction.
> That's the reason I hooked the merging of unmerged pages in compaction
> when isolate_migratepages_block() is scanning every page of a pageblock
> in PFN order.
>
> OTOH, if there is a kernel thread trying to reduce fragmentation by
> doing merges for these unmerged pages, I think it's perfect fine to let
> it iterate all unmerged pages of that list and do_merge() for all of
> them.
>
> So what about this: if kcompactd is running, let it handle these
> unmerged pages on the list and after that, do its usual job of
> compaction. If direct compaction is running, do not handle unmerged
> pages on that list but rely on isolate_migratepages_block() to do the
> merging as is done in this patchset.
>
> This of course has the effect of tying compaction with 'lazy merging'.
> If it is not desirable, what about creating a new kernel thread to do
> the merging of unmerged pages on the list while keeping the behaviour of
> isolate_migratepages_block() in this patchset to improve compaction
> success rate.
Note that anything based on daemons will seem like reducing latency for
allocations, but if we delay merging and then later do it from a daemon,
the overall zone lock times will be essentially the same, right? The
reduced zone lock benefits only happen when the unmerged pages get
reallocated.
>> Also I would definitely consider always merging pages freed to
>> non-MOVABLE pageblocks. We really don't want to increase the
>> fragmentation in those. However that means it probably won't help the
>> netperf case?
>
> Yes, that would be unfortunate for all in-kernel users of page
> allocator...
In that case there should definitely be a direct merging possibility
IMHO, even if only as a last resort before stealing from another pageblock.
On Thu, Oct 18, 2018 at 10:23:22AM +0200, Vlastimil Babka wrote:
> On 10/18/18 8:48 AM, Aaron Lu wrote:
> > On Wed, Oct 17, 2018 at 07:03:30PM +0200, Vlastimil Babka wrote:
> >> On 10/17/18 3:58 PM, Mel Gorman wrote:
> >>> Again, as compaction is not guaranteed to find the pageblocks, it would
> >>> be important to consider whether a) that matters or b) find an
> >>> alternative way of keeping unmerged buddies on separate lists so they
> >>> can be quickly discovered when a high-order allocation fails.
> >>
> >> Agree, unmerged buddies could be on separate freelist from regular
> >> order-0 freelist. That list could be also preferred to allocations
> >> before the regular one. Then one could e.g. try "direct merging" via
> >> this list when compaction fails, or prefer direct merging to compaction
> >> for non-costly-order allocations, do direct merging when allocation
> >> context doesn't even allow compaction (atomic etc).
> >
> > One concern regarding "direct merging" these unmerged pages via this
> > separate freelist(let's call it unmerged_free_list) is: adjacent
> > unmerged pages on the unmerged_free_list could be far away from each
> > other regarding their physical positions, so during the process of
> > merging them, the needed high order page may not be able to be formed
> > in a short time. Actually, the time could be unbound in a bad condition
> > when:
> > 1 unmerged pages adjacent on the unmerged_free_list happen to be far
> > away from each other regarding their physical positions; and
>
> I'm not sure I understand. Why should it matter for merging if pages are
> adjacent on the unmerged_free_list? The buddy for merging is found the
> usual way, no?
Yes it's found the usual way. I probably didn't state clear, let me try
again.
Consider a pageblock, initially as an free order9 page. Let's assume
this order9 page is expand()ed into 512 order0 pages during different
allocation requests and they go to different applications running on
different CPUs. After some time, all of them are freed back, but each
of them is freed back at different times, so they are not adjacent on
unmerged_free_list(they could be far away from each other).
In the above scenerio, merging pages on unmerged_free_list one by one
may not be an efficent way to form a high-order page, but scanning a
pageblock PFN wise could be.
Of course, the above scenerio is imagined by me as a worst case, normal
case could be much better.
>
> > 2 there are a lot of unmerged pages on unmerged_free_list.
>
> That will affect allocation latency, yeah. Still might be faster than
> direct compaction. And possible to do in GFP_ATOMIC context, unlike
> direct compaction.
I see, but I'm not sure if it is OK to do 'direct merging' in GFP_ATOMIC
context - it is better for cases where failure to have the high-order
page allocated is very bad, but it might not be a good idea if the caller
has a fallback mechanism, i.e. if high order page allocation failed, they
can work with order0.
>
> > That's the reason I hooked the merging of unmerged pages in compaction
> > when isolate_migratepages_block() is scanning every page of a pageblock
> > in PFN order.
> >
> > OTOH, if there is a kernel thread trying to reduce fragmentation by
> > doing merges for these unmerged pages, I think it's perfect fine to let
> > it iterate all unmerged pages of that list and do_merge() for all of
> > them.
> >
> > So what about this: if kcompactd is running, let it handle these
> > unmerged pages on the list and after that, do its usual job of
> > compaction. If direct compaction is running, do not handle unmerged
> > pages on that list but rely on isolate_migratepages_block() to do the
> > merging as is done in this patchset.
> >
> > This of course has the effect of tying compaction with 'lazy merging'.
> > If it is not desirable, what about creating a new kernel thread to do
> > the merging of unmerged pages on the list while keeping the behaviour of
> > isolate_migratepages_block() in this patchset to improve compaction
> > success rate.
>
> Note that anything based on daemons will seem like reducing latency for
> allocations, but if we delay merging and then later do it from a daemon,
> the overall zone lock times will be essentially the same, right? The
> reduced zone lock benefits only happen when the unmerged pages get
> reallocated.
Agree.
> Also I would definitely consider always merging pages freed to
> >> non-MOVABLE pageblocks. We really don't want to increase the
> >> fragmentation in those. However that means it probably won't help the
> >> netperf case?
> >
> > Yes, that would be unfortunate for all in-kernel users of page
> > allocator...
>
> In that case there should definitely be a direct merging possibility
> IMHO, even if only as a last resort before stealing from another pageblock.
On Wed, Oct 17, 2018 at 10:59:04PM +0800, Aaron Lu wrote:
> > Any particuular reason why? I assume it's related to the number of zone
> > locks with the increase number of zones and the number of threads used
> > for the test.
>
> I think so too.
>
> The 4 sockets server has 192 CPUs in total while the 2 sockets server
> has 112 CPUs in total. Assume only ZONE_NORMAL are used, for the 4
> sockets server it would be 192/4=48(CPUs per zone) while for the 2
> sockets server it is 112/2=56(CPUs per zone). The test is started with
> nr_task=nr_cpu so for the 2 sockets servers, it ends up having more CPUs
> consuming one zone.
>
Nice that the prediction is accurate. It brings us to another option --
breaking up the zone lock by either hash or address space ranges. The
address space ranges would probably be easier to implement. Where it
gets hairy is that PFN walkers would need different zone locks. However,
overall it might be a better option because it's not order-0 specific.
It would be a lot of legwork because all uses of the zone lock would
have to be audited to see which ones protect the free lists and which
ones protect "something else".
> > That's important to know. It does reduce the utility of the patch
> > somewhat but not all arches support THP and THP is not always enabled on
> > x86.
>
> I always wondered how systems are making use of THP.
> After all, when system has been runing a while(days or months), file
> cache should consumed a lot of memory and high order pages will become
> more and more scare. If order9 page can't be reliably allocated, will
> workload rely on it?
> Just a thought.
>
File cache can usually be trivially reclaimed and moved. It's a "how
long is a piece of string" to determine at what point a system can get
fragmented and whether than can be prevented. It's somewhat outside the
scope of this patch but anecdotally I'm looking at a machine with 20 days
uptime and it still has 2390GB worth of THPs free after a large amount
of reclaim activity over the system lifetime so fragmentation avoidance
does work in some cases.
> THP is of course pretty neat that it reduced TLB cost, needs fewer page
> table etc. I just wondered if people really rely on it, or using it
> after their system has been up for a long time.
>
If people didn't rely on it then we might as well delete THP and the
declare the whole tmpfs-backed-THP as worthless.
> > Yes, but note that the concept is still problematic.
> > isolate_migratepages_block is not guaranteed to find a pageblock with
> > unmerged buddies in it. If there are pageblocks towards the end of the
> > zone with unmerged pages, they may never be found. This will be very hard
> > to detect at runtime because it's heavily dependant on the exact state
> > of the system.
>
> Quite true.
>
> The intent here though, is not to have compaction merge back all
> unmerged pages, but did the merge for these unmerged pages in a
> piggyback way, i.e. since isolate_migratepages_block() is doing the
> scan, why don't we let it handle these unmerged pages when it meets
> them?
>
> If for some reason isolate_migratepages_block() didn't meet a single
> unmerged page before compaction succeed, we probably do not need worry
> much yet since compaction succeeded anyway.
>
I don't think this is the right way of thinking about it because it's
possible to have the system split in such a way so that the migration
scanner only encounters unmovable pages before it meets the free scanner
where unmerged buddies were in the higher portion of the address space.
You either need to keep unmerged buddies on a separate list or search
the order-0 free list for merge candidates prior to compaction.
> > It's needed to form them efficiently but excessive reclaim or writing 3
> > to drop_caches can also do it. Be careful of tying lazy buddy too
> > closely to compaction.
>
> That's the current design of this patchset, do you see any immediate
> problem of this? Is it that you are worried about high-order allocation
> success rate using this design?
I've pointed out what I see are the design flaws but yes, in general, I'm
worried about the high order allocation success rate using this design,
the reliance on compaction and the fact that the primary motivation is
when THP is disabled.
--
Mel Gorman
SUSE Labs
On Wed, Oct 17, 2018 at 10:23:27PM +0800, Aaron Lu wrote:
> > RT has had problems with cpu_relax in the past but more importantly, as
> > this delay for parallel compactions and allocations of contig ranges,
> > we could be stuck here for very long periods of time with interrupts
>
> The longest possible time is one CPU accessing pcp->batch number cold
> cachelines. Reason:
> When zone_wait_cluster_alloc() is called, we already held zone lock so
> no more allocations are possible. Waiting in_progress to become zero
> means waiting any CPU that increased in_progress to finish processing
> their allocated pages. Since they will at most allocate pcp->batch pages
> and worse case are all these page structres are cache cold, so the
> longest wait time is one CPU accessing pcp->batch number cold cache lines.
>
> I have no idea if this time is too long though.
>
But compact_zone calls zone_wait_and_disable_cluster_alloc so how is the
disabled time there bound by pcp->batch?
> > disabled. It gets even worse if it's from an interrupt context such as
> > jumbo frame allocation or a high-order slab allocation that is atomic.
>
> My understanding is atomic allocation won't trigger compaction, no?
>
No, they can't. I didn't check properly but be wary of any possibility
whereby interrupts can get delayed in zone_wait_cluster_alloc. I didn't
go back and check if it can -- partially because I'm more focused on the
lazy buddy aspect at the moment.
> > It may be necessary to consider instead minimising the number
> > of struct page update when merging to PCP and then either increasing the
> > size of the PCP or allowing it to exceed pcp->high for short periods of
> > time to batch the struct page updates.
>
> I don't quite follow this part. It doesn't seem possible we can exceed
> pcp->high in allocation path, or are you talking about free path?
>
I'm talking about the free path.
> And thanks a lot for the review!
My pleasure, hope it helps.
--
Mel Gorman
SUSE Labs
On Thu, Oct 18, 2018 at 12:20:55PM +0100, Mel Gorman wrote:
> On Wed, Oct 17, 2018 at 10:23:27PM +0800, Aaron Lu wrote:
> > > RT has had problems with cpu_relax in the past but more importantly, as
> > > this delay for parallel compactions and allocations of contig ranges,
> > > we could be stuck here for very long periods of time with interrupts
> >
> > The longest possible time is one CPU accessing pcp->batch number cold
> > cachelines. Reason:
> > When zone_wait_cluster_alloc() is called, we already held zone lock so
> > no more allocations are possible. Waiting in_progress to become zero
> > means waiting any CPU that increased in_progress to finish processing
> > their allocated pages. Since they will at most allocate pcp->batch pages
> > and worse case are all these page structres are cache cold, so the
> > longest wait time is one CPU accessing pcp->batch number cold cache lines.
> >
> > I have no idea if this time is too long though.
> >
>
> But compact_zone calls zone_wait_and_disable_cluster_alloc so how is the
> disabled time there bound by pcp->batch?
My mistake, I misunderstood spin_lock_irqsave() and thought lock would
need be acquired before irq is disabled...
So yeah, your concern of possible excessive long irq disabled time here
is true.
On Thu, Oct 18, 2018 at 12:16:32PM +0100, Mel Gorman wrote:
> On Wed, Oct 17, 2018 at 10:59:04PM +0800, Aaron Lu wrote:
> > > Any particuular reason why? I assume it's related to the number of zone
> > > locks with the increase number of zones and the number of threads used
> > > for the test.
> >
> > I think so too.
> >
> > The 4 sockets server has 192 CPUs in total while the 2 sockets server
> > has 112 CPUs in total. Assume only ZONE_NORMAL are used, for the 4
> > sockets server it would be 192/4=48(CPUs per zone) while for the 2
> > sockets server it is 112/2=56(CPUs per zone). The test is started with
> > nr_task=nr_cpu so for the 2 sockets servers, it ends up having more CPUs
> > consuming one zone.
> >
>
> Nice that the prediction is accurate. It brings us to another option --
> breaking up the zone lock by either hash or address space ranges. The
> address space ranges would probably be easier to implement. Where it
> gets hairy is that PFN walkers would need different zone locks. However,
> overall it might be a better option because it's not order-0 specific.
I think the 'address space range' lock is worth a try.
> It would be a lot of legwork because all uses of the zone lock would
> have to be audited to see which ones protect the free lists and which
> ones protect "something else".
Yes a lot of details.
> > > That's important to know. It does reduce the utility of the patch
> > > somewhat but not all arches support THP and THP is not always enabled on
> > > x86.
> >
> > I always wondered how systems are making use of THP.
> > After all, when system has been runing a while(days or months), file
> > cache should consumed a lot of memory and high order pages will become
> > more and more scare. If order9 page can't be reliably allocated, will
> > workload rely on it?
> > Just a thought.
> >
>
> File cache can usually be trivially reclaimed and moved. It's a "how
> long is a piece of string" to determine at what point a system can get
> fragmented and whether than can be prevented. It's somewhat outside the
> scope of this patch but anecdotally I'm looking at a machine with 20 days
> uptime and it still has 2390GB worth of THPs free after a large amount
> of reclaim activity over the system lifetime so fragmentation avoidance
> does work in some cases.
Good to know, thanks.
>
> > THP is of course pretty neat that it reduced TLB cost, needs fewer page
> > table etc. I just wondered if people really rely on it, or using it
> > after their system has been up for a long time.
> >
>
> If people didn't rely on it then we might as well delete THP and the
> declare the whole tmpfs-backed-THP as worthless.
>
> > > Yes, but note that the concept is still problematic.
> > > isolate_migratepages_block is not guaranteed to find a pageblock with
> > > unmerged buddies in it. If there are pageblocks towards the end of the
> > > zone with unmerged pages, they may never be found. This will be very hard
> > > to detect at runtime because it's heavily dependant on the exact state
> > > of the system.
> >
> > Quite true.
> >
> > The intent here though, is not to have compaction merge back all
> > unmerged pages, but did the merge for these unmerged pages in a
> > piggyback way, i.e. since isolate_migratepages_block() is doing the
> > scan, why don't we let it handle these unmerged pages when it meets
> > them?
> >
> > If for some reason isolate_migratepages_block() didn't meet a single
> > unmerged page before compaction succeed, we probably do not need worry
> > much yet since compaction succeeded anyway.
> >
>
> I don't think this is the right way of thinking about it because it's
> possible to have the system split in such a way so that the migration
> scanner only encounters unmovable pages before it meets the free scanner
> where unmerged buddies were in the higher portion of the address space.
Yes it is possible unmerged pages are in the higher portion.
My understanding is, when the two scanners meet, all unmerged pages will
be either used by the free scanner as migrate targets or sent to merge
by the migration scanner.
>
> You either need to keep unmerged buddies on a separate list or search
> the order-0 free list for merge candidates prior to compaction.
>
> > > It's needed to form them efficiently but excessive reclaim or writing 3
> > > to drop_caches can also do it. Be careful of tying lazy buddy too
> > > closely to compaction.
> >
> > That's the current design of this patchset, do you see any immediate
> > problem of this? Is it that you are worried about high-order allocation
> > success rate using this design?
>
> I've pointed out what I see are the design flaws but yes, in general, I'm
> worried about the high order allocation success rate using this design,
> the reliance on compaction and the fact that the primary motivation is
> when THP is disabled.
When THP is in use, zone lock contention is pretty much nowhere :-)
I'll see what I can get with 'address space range' lock first and will
come back to 'lazy buddy' if it doesn't work out. Thank you and
Vlastimil for all the suggestions.
On Fri, Oct 19, 2018 at 01:57:03PM +0800, Aaron Lu wrote:
> >
> > I don't think this is the right way of thinking about it because it's
> > possible to have the system split in such a way so that the migration
> > scanner only encounters unmovable pages before it meets the free scanner
> > where unmerged buddies were in the higher portion of the address space.
>
> Yes it is possible unmerged pages are in the higher portion.
>
> My understanding is, when the two scanners meet, all unmerged pages will
> be either used by the free scanner as migrate targets or sent to merge
> by the migration scanner.
>
It's not guaranteed if the lower portion of the address space consisted
entirely of pages that cannot migrate (because they are unmovable or because
migration failed due to pins). It's actually a fundamental limitation
of compaction that it can miss migration and compaction opportunities
due to how the scanners are implemented. It was designed that way to
avoid pageblocks being migrated unnecessarily back and forth but the
downside is missed opportunities.
> > You either need to keep unmerged buddies on a separate list or search
> > the order-0 free list for merge candidates prior to compaction.
> >
> > > > It's needed to form them efficiently but excessive reclaim or writing 3
> > > > to drop_caches can also do it. Be careful of tying lazy buddy too
> > > > closely to compaction.
> > >
> > > That's the current design of this patchset, do you see any immediate
> > > problem of this? Is it that you are worried about high-order allocation
> > > success rate using this design?
> >
> > I've pointed out what I see are the design flaws but yes, in general, I'm
> > worried about the high order allocation success rate using this design,
> > the reliance on compaction and the fact that the primary motivation is
> > when THP is disabled.
>
> When THP is in use, zone lock contention is pretty much nowhere :-)
>
> I'll see what I can get with 'address space range' lock first and will
> come back to 'lazy buddy' if it doesn't work out. Thank you and
> Vlastimil for all the suggestions.
My pleasure.
--
Mel Gorman
SUSE Labs
On Fri, Oct 19, 2018 at 09:54:35AM +0100, Mel Gorman wrote:
> On Fri, Oct 19, 2018 at 01:57:03PM +0800, Aaron Lu wrote:
> > >
> > > I don't think this is the right way of thinking about it because it's
> > > possible to have the system split in such a way so that the migration
> > > scanner only encounters unmovable pages before it meets the free scanner
> > > where unmerged buddies were in the higher portion of the address space.
> >
> > Yes it is possible unmerged pages are in the higher portion.
> >
> > My understanding is, when the two scanners meet, all unmerged pages will
> > be either used by the free scanner as migrate targets or sent to merge
> > by the migration scanner.
> >
>
> It's not guaranteed if the lower portion of the address space consisted
> entirely of pages that cannot migrate (because they are unmovable or because
> migration failed due to pins). It's actually a fundamental limitation
> of compaction that it can miss migration and compaction opportunities
> due to how the scanners are implemented. It was designed that way to
> avoid pageblocks being migrated unnecessarily back and forth but the
> downside is missed opportunities.
>
> > > You either need to keep unmerged buddies on a separate list or search
> > > the order-0 free list for merge candidates prior to compaction.
> > >
> > > > > It's needed to form them efficiently but excessive reclaim or writing 3
> > > > > to drop_caches can also do it. Be careful of tying lazy buddy too
> > > > > closely to compaction.
> > > >
> > > > That's the current design of this patchset, do you see any immediate
> > > > problem of this? Is it that you are worried about high-order allocation
> > > > success rate using this design?
> > >
> > > I've pointed out what I see are the design flaws but yes, in general, I'm
> > > worried about the high order allocation success rate using this design,
> > > the reliance on compaction and the fact that the primary motivation is
> > > when THP is disabled.
> >
> > When THP is in use, zone lock contention is pretty much nowhere :-)
> >
> > I'll see what I can get with 'address space range' lock first and will
> > come back to 'lazy buddy' if it doesn't work out.
With the address space range idea, wouldn't the zone free_area require changes
too? I can't see how locking by address range could synchronize it as it
exists now otherwise, with per order/mt list heads.
One idea is to further subdivide the free area according to how the locking
works and find some reasonable way to handle having to search for pages of a
given order/mt in multiple places.
On Fri, Oct 19, 2018 at 08:00:53AM -0700, Daniel Jordan wrote:
> On Fri, Oct 19, 2018 at 09:54:35AM +0100, Mel Gorman wrote:
> > On Fri, Oct 19, 2018 at 01:57:03PM +0800, Aaron Lu wrote:
> > > >
> > > > I don't think this is the right way of thinking about it because it's
> > > > possible to have the system split in such a way so that the migration
> > > > scanner only encounters unmovable pages before it meets the free scanner
> > > > where unmerged buddies were in the higher portion of the address space.
> > >
> > > Yes it is possible unmerged pages are in the higher portion.
> > >
> > > My understanding is, when the two scanners meet, all unmerged pages will
> > > be either used by the free scanner as migrate targets or sent to merge
> > > by the migration scanner.
> > >
> >
> > It's not guaranteed if the lower portion of the address space consisted
> > entirely of pages that cannot migrate (because they are unmovable or because
> > migration failed due to pins). It's actually a fundamental limitation
> > of compaction that it can miss migration and compaction opportunities
> > due to how the scanners are implemented. It was designed that way to
> > avoid pageblocks being migrated unnecessarily back and forth but the
> > downside is missed opportunities.
> >
> > > > You either need to keep unmerged buddies on a separate list or search
> > > > the order-0 free list for merge candidates prior to compaction.
> > > >
> > > > > > It's needed to form them efficiently but excessive reclaim or writing 3
> > > > > > to drop_caches can also do it. Be careful of tying lazy buddy too
> > > > > > closely to compaction.
> > > > >
> > > > > That's the current design of this patchset, do you see any immediate
> > > > > problem of this? Is it that you are worried about high-order allocation
> > > > > success rate using this design?
> > > >
> > > > I've pointed out what I see are the design flaws but yes, in general, I'm
> > > > worried about the high order allocation success rate using this design,
> > > > the reliance on compaction and the fact that the primary motivation is
> > > > when THP is disabled.
> > >
> > > When THP is in use, zone lock contention is pretty much nowhere :-)
> > >
> > > I'll see what I can get with 'address space range' lock first and will
> > > come back to 'lazy buddy' if it doesn't work out.
>
> With the address space range idea, wouldn't the zone free_area require changes
> too? I can't see how locking by address range could synchronize it as it
> exists now otherwise, with per order/mt list heads.
>
> One idea is to further subdivide the free area according to how the locking
> works and find some reasonable way to handle having to search for pages of a
> given order/mt in multiple places.
I plan to create one free_are per 'address space range'. The challenge
will be how to quickly locate a free_area that has the required free
page on allocation path. Other details like how big the address space
range should be etc. will need to be explored with testing.
I think this approach is worth a try because it wouldn't cause
fragmentation.
On 10/17/18 8:33 AM, Aaron Lu wrote:
> Profile on Intel Skylake server shows the most time consuming part
> under zone->lock on allocation path is accessing those to-be-returned
> page's "struct page" on the free_list inside zone->lock. One explanation
> is, different CPUs are releasing pages to the head of free_list and
> those page's 'struct page' may very well be cache cold for the allocating
> CPU when it grabs these pages from free_list' head. The purpose here
> is to avoid touching these pages one by one inside zone->lock.
What about making the pages cache-hot first, without zone->lock, by
traversing via page->lru. It would need some safety checks obviously
(maybe based on page_to_pfn + pfn_valid, or something) to make sure we
only read from real struct pages in case there's some update racing. The
worst case would be not populating enough due to race, and thus not
gaining the performance when doing the actual rmqueueing under lock.
Vlastimil
On Mon, Oct 22, 2018 at 11:37:53AM +0200, Vlastimil Babka wrote:
> On 10/17/18 8:33 AM, Aaron Lu wrote:
> > Profile on Intel Skylake server shows the most time consuming part
> > under zone->lock on allocation path is accessing those to-be-returned
> > page's "struct page" on the free_list inside zone->lock. One explanation
> > is, different CPUs are releasing pages to the head of free_list and
> > those page's 'struct page' may very well be cache cold for the allocating
> > CPU when it grabs these pages from free_list' head. The purpose here
> > is to avoid touching these pages one by one inside zone->lock.
>
> What about making the pages cache-hot first, without zone->lock, by
> traversing via page->lru. It would need some safety checks obviously
> (maybe based on page_to_pfn + pfn_valid, or something) to make sure we
> only read from real struct pages in case there's some update racing. The
> worst case would be not populating enough due to race, and thus not
> gaining the performance when doing the actual rmqueueing under lock.
Yes, there are the 2 potential problems you have pointed out:
1 we may be prefetching something that isn't a page due to page->lru can
be reused as different things under different scenerios;
2 we may not be able to prefetch much due to other CPU is doing
allocation inside the lock, it's possible we end up with prefetching
pages that are on another CPU's pcp list.
Considering the above 2 problems, I feel prefetching outside lock a
little risky and troublesome.
Allocation path is the hard part of improving page allocator
performance - in free path, we can prefetch them safely outside the lock
and we can even pre-merge them outside the lock to reduce the pressure of
the zone lock; but in allocation path, there is pretty nothing we can do
before acquiring the lock, except taking the risk to prefetch them
without taking the lock as you mentioned here.
We can come back to this if 'address space range' lock doesn't work out.