2016-12-02 00:22:48

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 0/2] High-order per-cpu cache v5

The following is two patches that implement a per-cpu cache for high-order
allocations, primarily aimed at SLUB. The first patch is a bug fix that
is technically unrelated but was discovered by review and so batched
together. The second is the patch that implements the cache.

include/linux/mmzone.h | 20 +++++++-
mm/page_alloc.c | 122 +++++++++++++++++++++++++++++++------------------
2 files changed, 96 insertions(+), 46 deletions(-)

--
2.10.2


2016-12-02 00:22:52

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 1/2] mm, page_alloc: Keep pcp count and list contents in sync if struct page is corrupted

Vlastimil Babka pointed out that commit 479f854a207c ("mm, page_alloc:
defer debugging checks of pages allocated from the PCP") will allow the
per-cpu list counter to be out of sync with the per-cpu list contents
if a struct page is corrupted. This patch keeps the accounting in sync.

Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
Signed-off-by: Mel Gorman <[email protected]>
cc: [email protected] [4.7+]
---
mm/page_alloc.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6de9440e3ae2..777ed59570df 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2192,7 +2192,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
unsigned long count, struct list_head *list,
int migratetype, bool cold)
{
- int i;
+ int i, alloced = 0;

spin_lock(&zone->lock);
for (i = 0; i < count; ++i) {
@@ -2217,13 +2217,14 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
else
list_add_tail(&page->lru, list);
list = &page->lru;
+ alloced++;
if (is_migrate_cma(get_pcppage_migratetype(page)))
__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
-(1 << order));
}
__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
spin_unlock(&zone->lock);
- return i;
+ return alloced;
}

#ifdef CONFIG_NUMA
--
2.10.2

2016-12-02 00:23:10

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5

Changelog since v4
o Avoid pcp->count getting out of sync if struct page gets corrupted

Changelog since v3
o Allow high-order atomic allocations to use reserves

Changelog since v2
o Correct initialisation to avoid -Woverflow warning

SLUB has been the default small kernel object allocator for quite some time
but it is not universally used due to performance concerns and a reliance
on high-order pages. The high-order concerns has two major components --
high-order pages are not always available and high-order page allocations
potentially contend on the zone->lock. This patch addresses some concerns
about the zone lock contention by extending the per-cpu page allocator to
cache high-order pages. The patch makes the following modifications

o New per-cpu lists are added to cache the high-order pages. This increases
the cache footprint of the per-cpu allocator and overall usage but for
some workloads, this will be offset by reduced contention on zone->lock.
The first MIGRATE_PCPTYPE entries in the list are per-migratetype. The
remaining are high-order caches up to and including
PAGE_ALLOC_COSTLY_ORDER

o pcp accounting during free is now confined to free_pcppages_bulk as it's
impossible for the caller to know exactly how many pages were freed.
Due to the high-order caches, the number of pages drained for a request
is no longer precise.

o The high watermark for per-cpu pages is increased to reduce the probability
that a single refill causes a drain on the next free.

The benefit depends on both the workload and the machine as ultimately the
determining factor is whether cache line bounces on zone->lock or contention
is a problem. The patch was tested on a variety of workloads and machines,
some of which are reported here.

This is the result from netperf running UDP_STREAM on localhost. It was
selected on the basis that it is slab-intensive and has been the subject
of previous SLAB vs SLUB comparisons with the caveat that this is not
testing between two physical hosts.

2-socket modern machine
4.9.0-rc5 4.9.0-rc5
vanilla hopcpu-v5
Hmean send-64 178.38 ( 0.00%) 260.54 ( 46.06%)
Hmean send-128 351.49 ( 0.00%) 518.56 ( 47.53%)
Hmean send-256 671.23 ( 0.00%) 1005.72 ( 49.83%)
Hmean send-1024 2663.60 ( 0.00%) 3880.54 ( 45.69%)
Hmean send-2048 5126.53 ( 0.00%) 7545.38 ( 47.18%)
Hmean send-3312 7949.99 ( 0.00%) 11324.34 ( 42.44%)
Hmean send-4096 9433.56 ( 0.00%) 12818.85 ( 35.89%)
Hmean send-8192 15940.64 ( 0.00%) 21404.98 ( 34.28%)
Hmean send-16384 26699.54 ( 0.00%) 32810.08 ( 22.89%)
Hmean recv-64 178.38 ( 0.00%) 260.52 ( 46.05%)
Hmean recv-128 351.49 ( 0.00%) 518.53 ( 47.53%)
Hmean recv-256 671.20 ( 0.00%) 1005.42 ( 49.79%)
Hmean recv-1024 2663.45 ( 0.00%) 3879.75 ( 45.67%)
Hmean recv-2048 5126.26 ( 0.00%) 7544.23 ( 47.17%)
Hmean recv-3312 7949.50 ( 0.00%) 11322.52 ( 42.43%)
Hmean recv-4096 9433.04 ( 0.00%) 12816.68 ( 35.87%)
Hmean recv-8192 15939.64 ( 0.00%) 21402.75 ( 34.27%)
Hmean recv-16384 26698.44 ( 0.00%) 32806.81 ( 22.88%)

1-socket 6 year old machine
4.9.0-rc5 4.9.0-rc5
vanilla hopcpu-v4
Hmean send-64 87.47 ( 0.00%) 127.01 ( 45.21%)
Hmean send-128 174.36 ( 0.00%) 254.86 ( 46.17%)
Hmean send-256 347.52 ( 0.00%) 505.91 ( 45.58%)
Hmean send-1024 1363.03 ( 0.00%) 1962.49 ( 43.98%)
Hmean send-2048 2632.68 ( 0.00%) 3731.74 ( 41.75%)
Hmean send-3312 4123.19 ( 0.00%) 5859.08 ( 42.10%)
Hmean send-4096 5056.48 ( 0.00%) 7058.00 ( 39.58%)
Hmean send-8192 8784.22 ( 0.00%) 12134.53 ( 38.14%)
Hmean send-16384 15081.60 ( 0.00%) 19638.90 ( 30.22%)
Hmean recv-64 86.19 ( 0.00%) 126.34 ( 46.58%)
Hmean recv-128 173.93 ( 0.00%) 253.51 ( 45.75%)
Hmean recv-256 346.19 ( 0.00%) 503.34 ( 45.40%)
Hmean recv-1024 1358.28 ( 0.00%) 1951.63 ( 43.68%)
Hmean recv-2048 2623.45 ( 0.00%) 3701.67 ( 41.10%)
Hmean recv-3312 4108.63 ( 0.00%) 5817.75 ( 41.60%)
Hmean recv-4096 5037.25 ( 0.00%) 7004.79 ( 39.06%)
Hmean recv-8192 8762.32 ( 0.00%) 12059.83 ( 37.63%)
Hmean recv-16384 15042.36 ( 0.00%) 19514.33 ( 29.73%)

This is somewhat dramatic but it's also not universal. For example, it was
observed on an older HP machine using pcc-cpufreq that there was almost
no difference but pcc-cpufreq is also a known performance hazard.

These are quite different results but illustrate that the patch is
dependent on the CPU. The results are similar for TCP_STREAM on
the two-socket machine.

The observations on sockperf are different.

2-socket modern machine
sockperf-tcp-throughput
4.9.0-rc5 4.9.0-rc5
vanilla hopcpu-v5
Hmean 14 93.90 ( 0.00%) 92.74 ( -1.23%)
Hmean 100 1211.02 ( 0.00%) 1284.36 ( 6.05%)
Hmean 300 6016.95 ( 0.00%) 6149.26 ( 2.20%)
Hmean 500 8846.20 ( 0.00%) 8988.84 ( 1.61%)
Hmean 850 12280.71 ( 0.00%) 12434.78 ( 1.25%)
Stddev 14 5.32 ( 0.00%) 4.79 ( 9.88%)
Stddev 100 35.32 ( 0.00%) 74.20 (-110.06%)
Stddev 300 132.63 ( 0.00%) 65.50 ( 50.61%)
Stddev 500 152.90 ( 0.00%) 188.67 (-23.40%)
Stddev 850 221.46 ( 0.00%) 257.61 (-16.32%)

sockperf-udp-throughput
4.9.0-rc5 4.9.0-rc5
vanilla hopcpu-v5
Hmean 14 36.32 ( 0.00%) 51.25 ( 41.09%)
Hmean 100 258.41 ( 0.00%) 355.76 ( 37.67%)
Hmean 300 773.96 ( 0.00%) 1054.13 ( 36.20%)
Hmean 500 1291.07 ( 0.00%) 1758.21 ( 36.18%)
Hmean 850 2137.88 ( 0.00%) 2939.52 ( 37.50%)
Stddev 14 0.75 ( 0.00%) 1.21 (-61.36%)
Stddev 100 9.02 ( 0.00%) 11.53 (-27.89%)
Stddev 300 13.66 ( 0.00%) 31.24 (-128.62%)
Stddev 500 25.01 ( 0.00%) 53.44 (-113.67%)
Stddev 850 37.72 ( 0.00%) 70.05 (-85.71%)

Note that the improvements for TCP are nowhere near as dramatic as netperf,
there is a slight loss for small packets and it's much more variable. While
it's not presented here, it's known that running sockperf "under load"
that packet latency is generally lower but not universally so. On the
other hand, UDP improves performance but again, is much more variable.

This highlights that the patch is not necessarily a universal win and is
going to depend heavily on both the workload and the CPU used.

hackbench was also tested with both socket and pipes and both processes
and threads and the results are interesting in terms of how variability
is imapcted

1-socket machine
hackbench-process-pipes
4.9.0-rc5 4.9.0-rc5
vanilla highmark-v5
Amean 1 12.9637 ( 0.00%) 13.1807 ( -1.67%)
Amean 3 13.4770 ( 0.00%) 13.6803 ( -1.51%)
Amean 5 18.5333 ( 0.00%) 18.7383 ( -1.11%)
Amean 7 24.5690 ( 0.00%) 23.0550 ( 6.16%)
Amean 12 39.7990 ( 0.00%) 36.7207 ( 7.73%)
Amean 16 56.0520 ( 0.00%) 48.2890 ( 13.85%)
Stddev 1 0.3847 ( 0.00%) 0.5853 (-52.15%)
Stddev 3 0.2652 ( 0.00%) 0.0295 ( 88.89%)
Stddev 5 0.5589 ( 0.00%) 0.2466 ( 55.87%)
Stddev 7 0.5310 ( 0.00%) 0.6680 (-25.79%)
Stddev 12 1.0780 ( 0.00%) 0.3230 ( 70.04%)
Stddev 16 2.1138 ( 0.00%) 0.6835 ( 67.66%)

hackbench-process-sockets
Amean 1 4.8873 ( 0.00%) 4.7180 ( 3.46%)
Amean 3 14.1157 ( 0.00%) 14.3643 ( -1.76%)
Amean 5 22.5537 ( 0.00%) 23.1380 ( -2.59%)
Amean 7 30.3743 ( 0.00%) 31.1520 ( -2.56%)
Amean 12 49.1773 ( 0.00%) 50.3060 ( -2.30%)
Amean 16 64.0873 ( 0.00%) 66.2633 ( -3.40%)
Stddev 1 0.2360 ( 0.00%) 0.2201 ( 6.74%)
Stddev 3 0.0539 ( 0.00%) 0.0780 (-44.72%)
Stddev 5 0.1463 ( 0.00%) 0.1579 ( -7.90%)
Stddev 7 0.1260 ( 0.00%) 0.3091 (-145.31%)
Stddev 12 0.2169 ( 0.00%) 0.4822 (-122.36%)
Stddev 16 0.0529 ( 0.00%) 0.4513 (-753.20%)

It's not a universal win for pipes but the differences are within the
noise. What is interesting is that variability shows both gains and losses
in stark contrast to the sockperf results. On the other hand, sockets
generally show small losses albeit within the noise with more variability.
Once again, the workload and CPU gets different results.

fsmark was tested with zero-sized files to continually allocate slab objects
but didn't show any differences. This can be explained by the fact that the
workload is only allocating and does not have mix of allocs/frees that would
benefit from the caching. It was tested to ensure no major harm was done.

While it is recognised that this is a mixed bag of results, the patch
helps a lot more workloads than it hurts and intuitively, avoiding the
zone->lock in some cases is a good thing.

Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Jesper Dangaard Brouer <[email protected]>
---
include/linux/mmzone.h | 20 ++++++++-
mm/page_alloc.c | 117 +++++++++++++++++++++++++++++++------------------
2 files changed, 93 insertions(+), 44 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0f088f3a2fed..54032ab2f4f9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -255,6 +255,24 @@ enum zone_watermarks {
NR_WMARK
};

+/*
+ * One per migratetype for order-0 pages and one per high-order up to
+ * and including PAGE_ALLOC_COSTLY_ORDER. This may allow unmovable
+ * allocations to contaminate reclaimable pageblocks if high-order
+ * pages are heavily used.
+ */
+#define NR_PCP_LISTS (MIGRATE_PCPTYPES + PAGE_ALLOC_COSTLY_ORDER)
+
+static inline unsigned int pindex_to_order(unsigned int pindex)
+{
+ return pindex < MIGRATE_PCPTYPES ? 0 : pindex - MIGRATE_PCPTYPES + 1;
+}
+
+static inline unsigned int order_to_pindex(int migratetype, unsigned int order)
+{
+ return (order == 0) ? migratetype : MIGRATE_PCPTYPES + order - 1;
+}
+
#define min_wmark_pages(z) (z->watermark[WMARK_MIN])
#define low_wmark_pages(z) (z->watermark[WMARK_LOW])
#define high_wmark_pages(z) (z->watermark[WMARK_HIGH])
@@ -265,7 +283,7 @@ struct per_cpu_pages {
int batch; /* chunk size for buddy add/remove */

/* Lists of pages, one per migrate type stored on the pcp-lists */
- struct list_head lists[MIGRATE_PCPTYPES];
+ struct list_head lists[NR_PCP_LISTS];
};

struct per_cpu_pageset {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 777ed59570df..5f1faaf0a5e3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1050,9 +1050,9 @@ static __always_inline bool free_pages_prepare(struct page *page,
}

#ifdef CONFIG_DEBUG_VM
-static inline bool free_pcp_prepare(struct page *page)
+static inline bool free_pcp_prepare(struct page *page, unsigned int order)
{
- return free_pages_prepare(page, 0, true);
+ return free_pages_prepare(page, order, true);
}

static inline bool bulkfree_pcp_prepare(struct page *page)
@@ -1060,9 +1060,9 @@ static inline bool bulkfree_pcp_prepare(struct page *page)
return false;
}
#else
-static bool free_pcp_prepare(struct page *page)
+static bool free_pcp_prepare(struct page *page, unsigned int order)
{
- return free_pages_prepare(page, 0, false);
+ return free_pages_prepare(page, order, false);
}

static bool bulkfree_pcp_prepare(struct page *page)
@@ -1085,8 +1085,9 @@ static bool bulkfree_pcp_prepare(struct page *page)
static void free_pcppages_bulk(struct zone *zone, int count,
struct per_cpu_pages *pcp)
{
- int migratetype = 0;
- int batch_free = 0;
+ unsigned int pindex = UINT_MAX; /* Reclaim will start at 0 */
+ unsigned int batch_free = 0;
+ unsigned int nr_freed = 0;
unsigned long nr_scanned;
bool isolated_pageblocks;

@@ -1096,28 +1097,29 @@ static void free_pcppages_bulk(struct zone *zone, int count,
if (nr_scanned)
__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);

- while (count) {
+ while (count > 0) {
struct page *page;
struct list_head *list;
+ unsigned int order;

/*
* Remove pages from lists in a round-robin fashion. A
* batch_free count is maintained that is incremented when an
- * empty list is encountered. This is so more pages are freed
- * off fuller lists instead of spinning excessively around empty
- * lists
+ * empty list is encountered. This is not exact due to
+ * high-order but percision is not required.
*/
do {
batch_free++;
- if (++migratetype == MIGRATE_PCPTYPES)
- migratetype = 0;
- list = &pcp->lists[migratetype];
+ if (++pindex == NR_PCP_LISTS)
+ pindex = 0;
+ list = &pcp->lists[pindex];
} while (list_empty(list));

/* This is the only non-empty list. Free them all. */
- if (batch_free == MIGRATE_PCPTYPES)
+ if (batch_free == NR_PCP_LISTS)
batch_free = count;

+ order = pindex_to_order(pindex);
do {
int mt; /* migratetype of the to-be-freed page */

@@ -1132,14 +1134,17 @@ static void free_pcppages_bulk(struct zone *zone, int count,
if (unlikely(isolated_pageblocks))
mt = get_pageblock_migratetype(page);

+ nr_freed += (1 << order);
+ count -= (1 << order);
if (bulkfree_pcp_prepare(page))
continue;

- __free_one_page(page, page_to_pfn(page), zone, 0, mt);
- trace_mm_page_pcpu_drain(page, 0, mt);
- } while (--count && --batch_free && !list_empty(list));
+ __free_one_page(page, page_to_pfn(page), zone, order, mt);
+ trace_mm_page_pcpu_drain(page, order, mt);
+ } while (count > 0 && --batch_free && !list_empty(list));
}
spin_unlock(&zone->lock);
+ pcp->count -= nr_freed;
}

static void free_one_page(struct zone *zone,
@@ -2244,10 +2249,8 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
local_irq_save(flags);
batch = READ_ONCE(pcp->batch);
to_drain = min(pcp->count, batch);
- if (to_drain > 0) {
+ if (to_drain > 0)
free_pcppages_bulk(zone, to_drain, pcp);
- pcp->count -= to_drain;
- }
local_irq_restore(flags);
}
#endif
@@ -2269,10 +2272,8 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
pset = per_cpu_ptr(zone->pageset, cpu);

pcp = &pset->pcp;
- if (pcp->count) {
+ if (pcp->count)
free_pcppages_bulk(zone, pcp->count, pcp);
- pcp->count = 0;
- }
local_irq_restore(flags);
}

@@ -2404,18 +2405,18 @@ void mark_free_pages(struct zone *zone)
#endif /* CONFIG_PM */

/*
- * Free a 0-order page
+ * Free a pcp page
* cold == true ? free a cold page : free a hot page
*/
-void free_hot_cold_page(struct page *page, bool cold)
+static void __free_hot_cold_page(struct page *page, bool cold, unsigned int order)
{
struct zone *zone = page_zone(page);
struct per_cpu_pages *pcp;
unsigned long flags;
unsigned long pfn = page_to_pfn(page);
- int migratetype;
+ int migratetype, pindex;

- if (!free_pcp_prepare(page))
+ if (!free_pcp_prepare(page, order))
return;

migratetype = get_pfnblock_migratetype(page, pfn);
@@ -2432,28 +2433,33 @@ void free_hot_cold_page(struct page *page, bool cold)
*/
if (migratetype >= MIGRATE_PCPTYPES) {
if (unlikely(is_migrate_isolate(migratetype))) {
- free_one_page(zone, page, pfn, 0, migratetype);
+ free_one_page(zone, page, pfn, order, migratetype);
goto out;
}
migratetype = MIGRATE_MOVABLE;
}

+ pindex = order_to_pindex(migratetype, order);
pcp = &this_cpu_ptr(zone->pageset)->pcp;
if (!cold)
- list_add(&page->lru, &pcp->lists[migratetype]);
+ list_add(&page->lru, &pcp->lists[pindex]);
else
- list_add_tail(&page->lru, &pcp->lists[migratetype]);
- pcp->count++;
+ list_add_tail(&page->lru, &pcp->lists[pindex]);
+ pcp->count += 1 << order;
if (pcp->count >= pcp->high) {
unsigned long batch = READ_ONCE(pcp->batch);
free_pcppages_bulk(zone, batch, pcp);
- pcp->count -= batch;
}

out:
local_irq_restore(flags);
}

+void free_hot_cold_page(struct page *page, bool cold)
+{
+ __free_hot_cold_page(page, cold, 0);
+}
+
/*
* Free a list of 0-order pages
*/
@@ -2589,20 +2595,33 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
struct page *page;
bool cold = ((gfp_flags & __GFP_COLD) != 0);

- if (likely(order == 0)) {
+ if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
struct per_cpu_pages *pcp;
struct list_head *list;

local_irq_save(flags);
do {
+ unsigned int pindex;
+
+ pindex = order_to_pindex(migratetype, order);
pcp = &this_cpu_ptr(zone->pageset)->pcp;
- list = &pcp->lists[migratetype];
+ list = &pcp->lists[pindex];
if (list_empty(list)) {
- pcp->count += rmqueue_bulk(zone, 0,
+ int nr_pages = rmqueue_bulk(zone, order,
pcp->batch, list,
migratetype, cold);
- if (unlikely(list_empty(list)))
+ if (unlikely(list_empty(list))) {
+ /*
+ * Retry high-order atomic allocs
+ * from the buddy list which may
+ * use MIGRATE_HIGHATOMIC.
+ */
+ if (order && (alloc_flags & ALLOC_HARDER))
+ goto try_buddylist;
+
goto failed;
+ }
+ pcp->count += (nr_pages << order);
}

if (cold)
@@ -2611,10 +2630,11 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
page = list_first_entry(list, struct page, lru);

list_del(&page->lru);
- pcp->count--;
+ pcp->count -= (1 << order);

} while (check_new_pcp(page));
} else {
+try_buddylist:
/*
* We most definitely don't want callers attempting to
* allocate greater than order-1 page units with __GFP_NOFAIL.
@@ -3838,8 +3858,8 @@ EXPORT_SYMBOL(get_zeroed_page);
void __free_pages(struct page *page, unsigned int order)
{
if (put_page_testzero(page)) {
- if (order == 0)
- free_hot_cold_page(page, false);
+ if (order <= PAGE_ALLOC_COSTLY_ORDER)
+ __free_hot_cold_page(page, false, order);
else
__free_pages_ok(page, order);
}
@@ -5161,20 +5181,31 @@ static void pageset_update(struct per_cpu_pages *pcp, unsigned long high,
/* a companion to pageset_set_high() */
static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
{
- pageset_update(&p->pcp, 6 * batch, max(1UL, 1 * batch));
+ unsigned long high;
+
+ /*
+ * per-cpu refills occur when a per-cpu list for a migratetype
+ * or a high-order is depleted even if pages are free overall.
+ * Tune the high watermark such that it's unlikely, but not
+ * impossible, that a single refill event will trigger a
+ * shrink on the next free to the per-cpu list.
+ */
+ high = batch * MIGRATE_PCPTYPES + (batch << PAGE_ALLOC_COSTLY_ORDER);
+
+ pageset_update(&p->pcp, high, max(1UL, 1 * batch));
}

static void pageset_init(struct per_cpu_pageset *p)
{
struct per_cpu_pages *pcp;
- int migratetype;
+ unsigned int pindex;

memset(p, 0, sizeof(*p));

pcp = &p->pcp;
pcp->count = 0;
- for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++)
- INIT_LIST_HEAD(&pcp->lists[migratetype]);
+ for (pindex = 0; pindex < NR_PCP_LISTS; pindex++)
+ INIT_LIST_HEAD(&pcp->lists[pindex]);
}

static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
--
2.10.2

2016-12-02 03:54:46

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm, page_alloc: Keep pcp count and list contents in sync if struct page is corrupted

On Friday, December 02, 2016 8:23 AM Mel Gorman wrote:
> Vlastimil Babka pointed out that commit 479f854a207c ("mm, page_alloc:
> defer debugging checks of pages allocated from the PCP") will allow the
> per-cpu list counter to be out of sync with the per-cpu list contents
> if a struct page is corrupted. This patch keeps the accounting in sync.
>
> Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
> Signed-off-by: Mel Gorman <[email protected]>
> cc: [email protected] [4.7+]
> ---
> mm/page_alloc.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6de9440e3ae2..777ed59570df 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2192,7 +2192,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
> unsigned long count, struct list_head *list,
> int migratetype, bool cold)
> {
> - int i;
> + int i, alloced = 0;
>
> spin_lock(&zone->lock);
> for (i = 0; i < count; ++i) {
> @@ -2217,13 +2217,14 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
> else
> list_add_tail(&page->lru, list);
> list = &page->lru;
> + alloced++;
> if (is_migrate_cma(get_pcppage_migratetype(page)))
> __mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
> -(1 << order));
> }
> __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));

Now i is a pure index, yes?

> spin_unlock(&zone->lock);
> - return i;
> + return alloced;
> }
>
> #ifdef CONFIG_NUMA
> --
> 2.10.2

2016-12-02 06:01:36

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5

Hello, Mel.

I didn't follow up previous discussion so what I raise here would be
duplicated. Please let me know the link if it is answered before.

On Fri, Dec 02, 2016 at 12:22:44AM +0000, Mel Gorman wrote:
> Changelog since v4
> o Avoid pcp->count getting out of sync if struct page gets corrupted
>
> Changelog since v3
> o Allow high-order atomic allocations to use reserves
>
> Changelog since v2
> o Correct initialisation to avoid -Woverflow warning
>
> SLUB has been the default small kernel object allocator for quite some time
> but it is not universally used due to performance concerns and a reliance
> on high-order pages. The high-order concerns has two major components --
> high-order pages are not always available and high-order page allocations
> potentially contend on the zone->lock. This patch addresses some concerns
> about the zone lock contention by extending the per-cpu page allocator to
> cache high-order pages. The patch makes the following modifications
>
> o New per-cpu lists are added to cache the high-order pages. This increases
> the cache footprint of the per-cpu allocator and overall usage but for
> some workloads, this will be offset by reduced contention on zone->lock.
> The first MIGRATE_PCPTYPE entries in the list are per-migratetype. The
> remaining are high-order caches up to and including
> PAGE_ALLOC_COSTLY_ORDER
>
> o pcp accounting during free is now confined to free_pcppages_bulk as it's
> impossible for the caller to know exactly how many pages were freed.
> Due to the high-order caches, the number of pages drained for a request
> is no longer precise.
>
> o The high watermark for per-cpu pages is increased to reduce the probability
> that a single refill causes a drain on the next free.
>
> The benefit depends on both the workload and the machine as ultimately the
> determining factor is whether cache line bounces on zone->lock or contention
> is a problem. The patch was tested on a variety of workloads and machines,
> some of which are reported here.
>
> This is the result from netperf running UDP_STREAM on localhost. It was
> selected on the basis that it is slab-intensive and has been the subject
> of previous SLAB vs SLUB comparisons with the caveat that this is not
> testing between two physical hosts.
>
> 2-socket modern machine
> 4.9.0-rc5 4.9.0-rc5
> vanilla hopcpu-v5
> Hmean send-64 178.38 ( 0.00%) 260.54 ( 46.06%)
> Hmean send-128 351.49 ( 0.00%) 518.56 ( 47.53%)
> Hmean send-256 671.23 ( 0.00%) 1005.72 ( 49.83%)
> Hmean send-1024 2663.60 ( 0.00%) 3880.54 ( 45.69%)
> Hmean send-2048 5126.53 ( 0.00%) 7545.38 ( 47.18%)
> Hmean send-3312 7949.99 ( 0.00%) 11324.34 ( 42.44%)
> Hmean send-4096 9433.56 ( 0.00%) 12818.85 ( 35.89%)
> Hmean send-8192 15940.64 ( 0.00%) 21404.98 ( 34.28%)
> Hmean send-16384 26699.54 ( 0.00%) 32810.08 ( 22.89%)
> Hmean recv-64 178.38 ( 0.00%) 260.52 ( 46.05%)
> Hmean recv-128 351.49 ( 0.00%) 518.53 ( 47.53%)
> Hmean recv-256 671.20 ( 0.00%) 1005.42 ( 49.79%)
> Hmean recv-1024 2663.45 ( 0.00%) 3879.75 ( 45.67%)
> Hmean recv-2048 5126.26 ( 0.00%) 7544.23 ( 47.17%)
> Hmean recv-3312 7949.50 ( 0.00%) 11322.52 ( 42.43%)
> Hmean recv-4096 9433.04 ( 0.00%) 12816.68 ( 35.87%)
> Hmean recv-8192 15939.64 ( 0.00%) 21402.75 ( 34.27%)
> Hmean recv-16384 26698.44 ( 0.00%) 32806.81 ( 22.88%)
>
> 1-socket 6 year old machine
> 4.9.0-rc5 4.9.0-rc5
> vanilla hopcpu-v4
> Hmean send-64 87.47 ( 0.00%) 127.01 ( 45.21%)
> Hmean send-128 174.36 ( 0.00%) 254.86 ( 46.17%)
> Hmean send-256 347.52 ( 0.00%) 505.91 ( 45.58%)
> Hmean send-1024 1363.03 ( 0.00%) 1962.49 ( 43.98%)
> Hmean send-2048 2632.68 ( 0.00%) 3731.74 ( 41.75%)
> Hmean send-3312 4123.19 ( 0.00%) 5859.08 ( 42.10%)
> Hmean send-4096 5056.48 ( 0.00%) 7058.00 ( 39.58%)
> Hmean send-8192 8784.22 ( 0.00%) 12134.53 ( 38.14%)
> Hmean send-16384 15081.60 ( 0.00%) 19638.90 ( 30.22%)
> Hmean recv-64 86.19 ( 0.00%) 126.34 ( 46.58%)
> Hmean recv-128 173.93 ( 0.00%) 253.51 ( 45.75%)
> Hmean recv-256 346.19 ( 0.00%) 503.34 ( 45.40%)
> Hmean recv-1024 1358.28 ( 0.00%) 1951.63 ( 43.68%)
> Hmean recv-2048 2623.45 ( 0.00%) 3701.67 ( 41.10%)
> Hmean recv-3312 4108.63 ( 0.00%) 5817.75 ( 41.60%)
> Hmean recv-4096 5037.25 ( 0.00%) 7004.79 ( 39.06%)
> Hmean recv-8192 8762.32 ( 0.00%) 12059.83 ( 37.63%)
> Hmean recv-16384 15042.36 ( 0.00%) 19514.33 ( 29.73%)
>
> This is somewhat dramatic but it's also not universal. For example, it was
> observed on an older HP machine using pcc-cpufreq that there was almost
> no difference but pcc-cpufreq is also a known performance hazard.
>
> These are quite different results but illustrate that the patch is
> dependent on the CPU. The results are similar for TCP_STREAM on
> the two-socket machine.
>
> The observations on sockperf are different.
>
> 2-socket modern machine
> sockperf-tcp-throughput
> 4.9.0-rc5 4.9.0-rc5
> vanilla hopcpu-v5
> Hmean 14 93.90 ( 0.00%) 92.74 ( -1.23%)
> Hmean 100 1211.02 ( 0.00%) 1284.36 ( 6.05%)
> Hmean 300 6016.95 ( 0.00%) 6149.26 ( 2.20%)
> Hmean 500 8846.20 ( 0.00%) 8988.84 ( 1.61%)
> Hmean 850 12280.71 ( 0.00%) 12434.78 ( 1.25%)
> Stddev 14 5.32 ( 0.00%) 4.79 ( 9.88%)
> Stddev 100 35.32 ( 0.00%) 74.20 (-110.06%)
> Stddev 300 132.63 ( 0.00%) 65.50 ( 50.61%)
> Stddev 500 152.90 ( 0.00%) 188.67 (-23.40%)
> Stddev 850 221.46 ( 0.00%) 257.61 (-16.32%)
>
> sockperf-udp-throughput
> 4.9.0-rc5 4.9.0-rc5
> vanilla hopcpu-v5
> Hmean 14 36.32 ( 0.00%) 51.25 ( 41.09%)
> Hmean 100 258.41 ( 0.00%) 355.76 ( 37.67%)
> Hmean 300 773.96 ( 0.00%) 1054.13 ( 36.20%)
> Hmean 500 1291.07 ( 0.00%) 1758.21 ( 36.18%)
> Hmean 850 2137.88 ( 0.00%) 2939.52 ( 37.50%)
> Stddev 14 0.75 ( 0.00%) 1.21 (-61.36%)
> Stddev 100 9.02 ( 0.00%) 11.53 (-27.89%)
> Stddev 300 13.66 ( 0.00%) 31.24 (-128.62%)
> Stddev 500 25.01 ( 0.00%) 53.44 (-113.67%)
> Stddev 850 37.72 ( 0.00%) 70.05 (-85.71%)
>
> Note that the improvements for TCP are nowhere near as dramatic as netperf,
> there is a slight loss for small packets and it's much more variable. While
> it's not presented here, it's known that running sockperf "under load"
> that packet latency is generally lower but not universally so. On the
> other hand, UDP improves performance but again, is much more variable.
>
> This highlights that the patch is not necessarily a universal win and is
> going to depend heavily on both the workload and the CPU used.
>
> hackbench was also tested with both socket and pipes and both processes
> and threads and the results are interesting in terms of how variability
> is imapcted
>
> 1-socket machine
> hackbench-process-pipes
> 4.9.0-rc5 4.9.0-rc5
> vanilla highmark-v5
> Amean 1 12.9637 ( 0.00%) 13.1807 ( -1.67%)
> Amean 3 13.4770 ( 0.00%) 13.6803 ( -1.51%)
> Amean 5 18.5333 ( 0.00%) 18.7383 ( -1.11%)
> Amean 7 24.5690 ( 0.00%) 23.0550 ( 6.16%)
> Amean 12 39.7990 ( 0.00%) 36.7207 ( 7.73%)
> Amean 16 56.0520 ( 0.00%) 48.2890 ( 13.85%)
> Stddev 1 0.3847 ( 0.00%) 0.5853 (-52.15%)
> Stddev 3 0.2652 ( 0.00%) 0.0295 ( 88.89%)
> Stddev 5 0.5589 ( 0.00%) 0.2466 ( 55.87%)
> Stddev 7 0.5310 ( 0.00%) 0.6680 (-25.79%)
> Stddev 12 1.0780 ( 0.00%) 0.3230 ( 70.04%)
> Stddev 16 2.1138 ( 0.00%) 0.6835 ( 67.66%)
>
> hackbench-process-sockets
> Amean 1 4.8873 ( 0.00%) 4.7180 ( 3.46%)
> Amean 3 14.1157 ( 0.00%) 14.3643 ( -1.76%)
> Amean 5 22.5537 ( 0.00%) 23.1380 ( -2.59%)
> Amean 7 30.3743 ( 0.00%) 31.1520 ( -2.56%)
> Amean 12 49.1773 ( 0.00%) 50.3060 ( -2.30%)
> Amean 16 64.0873 ( 0.00%) 66.2633 ( -3.40%)
> Stddev 1 0.2360 ( 0.00%) 0.2201 ( 6.74%)
> Stddev 3 0.0539 ( 0.00%) 0.0780 (-44.72%)
> Stddev 5 0.1463 ( 0.00%) 0.1579 ( -7.90%)
> Stddev 7 0.1260 ( 0.00%) 0.3091 (-145.31%)
> Stddev 12 0.2169 ( 0.00%) 0.4822 (-122.36%)
> Stddev 16 0.0529 ( 0.00%) 0.4513 (-753.20%)
>
> It's not a universal win for pipes but the differences are within the
> noise. What is interesting is that variability shows both gains and losses
> in stark contrast to the sockperf results. On the other hand, sockets
> generally show small losses albeit within the noise with more variability.
> Once again, the workload and CPU gets different results.
>
> fsmark was tested with zero-sized files to continually allocate slab objects
> but didn't show any differences. This can be explained by the fact that the
> workload is only allocating and does not have mix of allocs/frees that would
> benefit from the caching. It was tested to ensure no major harm was done.
>
> While it is recognised that this is a mixed bag of results, the patch
> helps a lot more workloads than it hurts and intuitively, avoiding the
> zone->lock in some cases is a good thing.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Acked-by: Johannes Weiner <[email protected]>
> Acked-by: Jesper Dangaard Brouer <[email protected]>
> ---
> include/linux/mmzone.h | 20 ++++++++-
> mm/page_alloc.c | 117 +++++++++++++++++++++++++++++++------------------
> 2 files changed, 93 insertions(+), 44 deletions(-)
>

> static bool bulkfree_pcp_prepare(struct page *page)
> @@ -1085,8 +1085,9 @@ static bool bulkfree_pcp_prepare(struct page *page)
> static void free_pcppages_bulk(struct zone *zone, int count,
> struct per_cpu_pages *pcp)
> {
> - int migratetype = 0;
> - int batch_free = 0;
> + unsigned int pindex = UINT_MAX; /* Reclaim will start at 0 */
> + unsigned int batch_free = 0;
> + unsigned int nr_freed = 0;
> unsigned long nr_scanned;
> bool isolated_pageblocks;
>
> @@ -1096,28 +1097,29 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> if (nr_scanned)
> __mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
>
> - while (count) {
> + while (count > 0) {
> struct page *page;
> struct list_head *list;
> + unsigned int order;
>
> /*
> * Remove pages from lists in a round-robin fashion. A
> * batch_free count is maintained that is incremented when an
> - * empty list is encountered. This is so more pages are freed
> - * off fuller lists instead of spinning excessively around empty
> - * lists
> + * empty list is encountered. This is not exact due to
> + * high-order but percision is not required.
> */
> do {
> batch_free++;
> - if (++migratetype == MIGRATE_PCPTYPES)
> - migratetype = 0;
> - list = &pcp->lists[migratetype];
> + if (++pindex == NR_PCP_LISTS)
> + pindex = 0;
> + list = &pcp->lists[pindex];
> } while (list_empty(list));
>
> /* This is the only non-empty list. Free them all. */
> - if (batch_free == MIGRATE_PCPTYPES)
> + if (batch_free == NR_PCP_LISTS)
> batch_free = count;
>
> + order = pindex_to_order(pindex);
> do {
> int mt; /* migratetype of the to-be-freed page */
>
> @@ -1132,14 +1134,17 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> if (unlikely(isolated_pageblocks))
> mt = get_pageblock_migratetype(page);
>
> + nr_freed += (1 << order);
> + count -= (1 << order);
> if (bulkfree_pcp_prepare(page))
> continue;
>
> - __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> - trace_mm_page_pcpu_drain(page, 0, mt);
> - } while (--count && --batch_free && !list_empty(list));
> + __free_one_page(page, page_to_pfn(page), zone, order, mt);
> + trace_mm_page_pcpu_drain(page, order, mt);
> + } while (count > 0 && --batch_free && !list_empty(list));
> }
> spin_unlock(&zone->lock);
> + pcp->count -= nr_freed;
> }

I guess that this patch would cause following problems.

1. If pcp->batch is too small, high order page will not be freed
easily and survive longer. Think about following situation.

Batch count: 7
MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1
-> order 2...

free count: 1 + 1 + 1 + 2 + 4 = 9
so order 3 would not be freed.

2. And, It seems that this logic penalties high order pages. One free
to high order page means 1 << order pages free rather than just
one page free. This logic do round-robin to choose the target page so
amount of freed page will be different by the order. I think that it
makes some sense because high order page are less important to cache
in pcp than lower order but I'd like to know if it is intended or not.
If intended, it deserves the comment.

3. I guess that order-0 file/anon page alloc/free is dominent in many
workloads. If this case happen, it invalidates effect of high order
cache in pcp since cached high order pages would be also freed to the
buddy when burst order-0 free happens.

> @@ -2589,20 +2595,33 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
> struct page *page;
> bool cold = ((gfp_flags & __GFP_COLD) != 0);
>
> - if (likely(order == 0)) {
> + if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
> struct per_cpu_pages *pcp;
> struct list_head *list;
>
> local_irq_save(flags);
> do {
> + unsigned int pindex;
> +
> + pindex = order_to_pindex(migratetype, order);
> pcp = &this_cpu_ptr(zone->pageset)->pcp;
> - list = &pcp->lists[migratetype];
> + list = &pcp->lists[pindex];
> if (list_empty(list)) {
> - pcp->count += rmqueue_bulk(zone, 0,
> + int nr_pages = rmqueue_bulk(zone, order,
> pcp->batch, list,
> migratetype, cold);

Maybe, you need to fix rmqueue_bulk(). rmqueue_bulk() allocates batch
* (1 << order) pages and pcp->count can easily overflow pcp->high
* because list empty here doesn't mean that pcp->count is zero.

Thanks.

2016-12-02 06:19:23

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm, page_alloc: Keep pcp count and list contents in sync if struct page is corrupted

On 12/02/2016 04:47 AM, Hillf Danton wrote:
> On Friday, December 02, 2016 8:23 AM Mel Gorman wrote:
>> Vlastimil Babka pointed out that commit 479f854a207c ("mm, page_alloc:
>> defer debugging checks of pages allocated from the PCP") will allow the
>> per-cpu list counter to be out of sync with the per-cpu list contents
>> if a struct page is corrupted. This patch keeps the accounting in sync.
>>
>> Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
>> Signed-off-by: Mel Gorman <[email protected]>
>> cc: [email protected] [4.7+]

Acked-by: Vlastimil Babka <[email protected]>

>> ---
>> mm/page_alloc.c | 5 +++--
>> 1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 6de9440e3ae2..777ed59570df 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -2192,7 +2192,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>> unsigned long count, struct list_head *list,
>> int migratetype, bool cold)
>> {
>> - int i;
>> + int i, alloced = 0;
>>
>> spin_lock(&zone->lock);
>> for (i = 0; i < count; ++i) {
>> @@ -2217,13 +2217,14 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>> else
>> list_add_tail(&page->lru, list);
>> list = &page->lru;
>> + alloced++;
>> if (is_migrate_cma(get_pcppage_migratetype(page)))
>> __mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
>> -(1 << order));
>> }
>> __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
>
> Now i is a pure index, yes?

No, even if a page fails the check_pcp_refill() check and is not
"allocated", it is also no longer a free page, so it's correct to
subtract it from NR_FREE_PAGES.

>> spin_unlock(&zone->lock);
>> - return i;
>> + return alloced;
>> }
>>
>> #ifdef CONFIG_NUMA
>> --
>> 2.10.2
>

2016-12-02 08:12:22

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm, page_alloc: Keep pcp count and list contents in sync if struct page is corrupted

On Fri 02-12-16 00:22:43, Mel Gorman wrote:
> Vlastimil Babka pointed out that commit 479f854a207c ("mm, page_alloc:
> defer debugging checks of pages allocated from the PCP") will allow the
> per-cpu list counter to be out of sync with the per-cpu list contents
> if a struct page is corrupted. This patch keeps the accounting in sync.
>
> Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
> Signed-off-by: Mel Gorman <[email protected]>
> cc: [email protected] [4.7+]

I am trying to think about what would happen if we did go out of sync
and cannot spot a problem. Vlastimil has mentioned something about
free_pcppages_bulk looping for ever but I cannot see it happening right
now. So why is this worth stable backport?

Anyway the patch looks correct
Acked-by: Michal Hocko <[email protected]>

> ---
> mm/page_alloc.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6de9440e3ae2..777ed59570df 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2192,7 +2192,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
> unsigned long count, struct list_head *list,
> int migratetype, bool cold)
> {
> - int i;
> + int i, alloced = 0;
>
> spin_lock(&zone->lock);
> for (i = 0; i < count; ++i) {
> @@ -2217,13 +2217,14 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
> else
> list_add_tail(&page->lru, list);
> list = &page->lru;
> + alloced++;
> if (is_migrate_cma(get_pcppage_migratetype(page)))
> __mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
> -(1 << order));
> }
> __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));

I guess this deserves a comment (i vs. alloced is confusing and I bet
somebody will come up with a cleanup...). We leak corrupted pages
intentionally so we should uncharge them from the NR_FREE_PAGES.

> spin_unlock(&zone->lock);
> - return i;
> + return alloced;
> }
>
> #ifdef CONFIG_NUMA
> --
> 2.10.2
>

--
Michal Hocko
SUSE Labs

2016-12-02 08:21:12

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5

On Fri 02-12-16 15:03:46, Joonsoo Kim wrote:
[...]
> > o pcp accounting during free is now confined to free_pcppages_bulk as it's
> > impossible for the caller to know exactly how many pages were freed.
> > Due to the high-order caches, the number of pages drained for a request
> > is no longer precise.
> >
> > o The high watermark for per-cpu pages is increased to reduce the probability
> > that a single refill causes a drain on the next free.
[...]
> I guess that this patch would cause following problems.
>
> 1. If pcp->batch is too small, high order page will not be freed
> easily and survive longer. Think about following situation.
>
> Batch count: 7
> MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1
> -> order 2...
>
> free count: 1 + 1 + 1 + 2 + 4 = 9
> so order 3 would not be freed.

I guess the second paragraph above in the changelog tries to clarify
that...

> 2. And, It seems that this logic penalties high order pages. One free
> to high order page means 1 << order pages free rather than just
> one page free. This logic do round-robin to choose the target page so
> amount of freed page will be different by the order.

Yes this is indeed possible. The first paragraph above mentions this
problem.

> I think that it
> makes some sense because high order page are less important to cache
> in pcp than lower order but I'd like to know if it is intended or not.
> If intended, it deserves the comment.
>
> 3. I guess that order-0 file/anon page alloc/free is dominent in many
> workloads. If this case happen, it invalidates effect of high order
> cache in pcp since cached high order pages would be also freed to the
> buddy when burst order-0 free happens.

Yes this is true and I was wondering the same but I believe this can be
enahanced later on. E.g. we can check the order when crossing pcp->high
mark and only the given order portion of the batch. I just wouldn't over
optimize at this stage.
--
Michal Hocko
SUSE Labs

2016-12-02 08:25:31

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5

On Fri 02-12-16 00:22:44, Mel Gorman wrote:
> Changelog since v4
> o Avoid pcp->count getting out of sync if struct page gets corrupted
>
> Changelog since v3
> o Allow high-order atomic allocations to use reserves
>
> Changelog since v2
> o Correct initialisation to avoid -Woverflow warning
>
> SLUB has been the default small kernel object allocator for quite some time
> but it is not universally used due to performance concerns and a reliance
> on high-order pages. The high-order concerns has two major components --
> high-order pages are not always available and high-order page allocations
> potentially contend on the zone->lock. This patch addresses some concerns
> about the zone lock contention by extending the per-cpu page allocator to
> cache high-order pages. The patch makes the following modifications
>
> o New per-cpu lists are added to cache the high-order pages. This increases
> the cache footprint of the per-cpu allocator and overall usage but for
> some workloads, this will be offset by reduced contention on zone->lock.
> The first MIGRATE_PCPTYPE entries in the list are per-migratetype. The
> remaining are high-order caches up to and including
> PAGE_ALLOC_COSTLY_ORDER
>
> o pcp accounting during free is now confined to free_pcppages_bulk as it's
> impossible for the caller to know exactly how many pages were freed.
> Due to the high-order caches, the number of pages drained for a request
> is no longer precise.
>
> o The high watermark for per-cpu pages is increased to reduce the probability
> that a single refill causes a drain on the next free.
>
> The benefit depends on both the workload and the machine as ultimately the
> determining factor is whether cache line bounces on zone->lock or contention
> is a problem. The patch was tested on a variety of workloads and machines,
> some of which are reported here.
>
> This is the result from netperf running UDP_STREAM on localhost. It was
> selected on the basis that it is slab-intensive and has been the subject
> of previous SLAB vs SLUB comparisons with the caveat that this is not
> testing between two physical hosts.
>
> 2-socket modern machine
> 4.9.0-rc5 4.9.0-rc5
> vanilla hopcpu-v5
> Hmean send-64 178.38 ( 0.00%) 260.54 ( 46.06%)
> Hmean send-128 351.49 ( 0.00%) 518.56 ( 47.53%)
> Hmean send-256 671.23 ( 0.00%) 1005.72 ( 49.83%)
> Hmean send-1024 2663.60 ( 0.00%) 3880.54 ( 45.69%)
> Hmean send-2048 5126.53 ( 0.00%) 7545.38 ( 47.18%)
> Hmean send-3312 7949.99 ( 0.00%) 11324.34 ( 42.44%)
> Hmean send-4096 9433.56 ( 0.00%) 12818.85 ( 35.89%)
> Hmean send-8192 15940.64 ( 0.00%) 21404.98 ( 34.28%)
> Hmean send-16384 26699.54 ( 0.00%) 32810.08 ( 22.89%)
> Hmean recv-64 178.38 ( 0.00%) 260.52 ( 46.05%)
> Hmean recv-128 351.49 ( 0.00%) 518.53 ( 47.53%)
> Hmean recv-256 671.20 ( 0.00%) 1005.42 ( 49.79%)
> Hmean recv-1024 2663.45 ( 0.00%) 3879.75 ( 45.67%)
> Hmean recv-2048 5126.26 ( 0.00%) 7544.23 ( 47.17%)
> Hmean recv-3312 7949.50 ( 0.00%) 11322.52 ( 42.43%)
> Hmean recv-4096 9433.04 ( 0.00%) 12816.68 ( 35.87%)
> Hmean recv-8192 15939.64 ( 0.00%) 21402.75 ( 34.27%)
> Hmean recv-16384 26698.44 ( 0.00%) 32806.81 ( 22.88%)
>
> 1-socket 6 year old machine
> 4.9.0-rc5 4.9.0-rc5
> vanilla hopcpu-v4
> Hmean send-64 87.47 ( 0.00%) 127.01 ( 45.21%)
> Hmean send-128 174.36 ( 0.00%) 254.86 ( 46.17%)
> Hmean send-256 347.52 ( 0.00%) 505.91 ( 45.58%)
> Hmean send-1024 1363.03 ( 0.00%) 1962.49 ( 43.98%)
> Hmean send-2048 2632.68 ( 0.00%) 3731.74 ( 41.75%)
> Hmean send-3312 4123.19 ( 0.00%) 5859.08 ( 42.10%)
> Hmean send-4096 5056.48 ( 0.00%) 7058.00 ( 39.58%)
> Hmean send-8192 8784.22 ( 0.00%) 12134.53 ( 38.14%)
> Hmean send-16384 15081.60 ( 0.00%) 19638.90 ( 30.22%)
> Hmean recv-64 86.19 ( 0.00%) 126.34 ( 46.58%)
> Hmean recv-128 173.93 ( 0.00%) 253.51 ( 45.75%)
> Hmean recv-256 346.19 ( 0.00%) 503.34 ( 45.40%)
> Hmean recv-1024 1358.28 ( 0.00%) 1951.63 ( 43.68%)
> Hmean recv-2048 2623.45 ( 0.00%) 3701.67 ( 41.10%)
> Hmean recv-3312 4108.63 ( 0.00%) 5817.75 ( 41.60%)
> Hmean recv-4096 5037.25 ( 0.00%) 7004.79 ( 39.06%)
> Hmean recv-8192 8762.32 ( 0.00%) 12059.83 ( 37.63%)
> Hmean recv-16384 15042.36 ( 0.00%) 19514.33 ( 29.73%)
>
> This is somewhat dramatic but it's also not universal. For example, it was
> observed on an older HP machine using pcc-cpufreq that there was almost
> no difference but pcc-cpufreq is also a known performance hazard.
>
> These are quite different results but illustrate that the patch is
> dependent on the CPU. The results are similar for TCP_STREAM on
> the two-socket machine.
>
> The observations on sockperf are different.
>
> 2-socket modern machine
> sockperf-tcp-throughput
> 4.9.0-rc5 4.9.0-rc5
> vanilla hopcpu-v5
> Hmean 14 93.90 ( 0.00%) 92.74 ( -1.23%)
> Hmean 100 1211.02 ( 0.00%) 1284.36 ( 6.05%)
> Hmean 300 6016.95 ( 0.00%) 6149.26 ( 2.20%)
> Hmean 500 8846.20 ( 0.00%) 8988.84 ( 1.61%)
> Hmean 850 12280.71 ( 0.00%) 12434.78 ( 1.25%)
> Stddev 14 5.32 ( 0.00%) 4.79 ( 9.88%)
> Stddev 100 35.32 ( 0.00%) 74.20 (-110.06%)
> Stddev 300 132.63 ( 0.00%) 65.50 ( 50.61%)
> Stddev 500 152.90 ( 0.00%) 188.67 (-23.40%)
> Stddev 850 221.46 ( 0.00%) 257.61 (-16.32%)
>
> sockperf-udp-throughput
> 4.9.0-rc5 4.9.0-rc5
> vanilla hopcpu-v5
> Hmean 14 36.32 ( 0.00%) 51.25 ( 41.09%)
> Hmean 100 258.41 ( 0.00%) 355.76 ( 37.67%)
> Hmean 300 773.96 ( 0.00%) 1054.13 ( 36.20%)
> Hmean 500 1291.07 ( 0.00%) 1758.21 ( 36.18%)
> Hmean 850 2137.88 ( 0.00%) 2939.52 ( 37.50%)
> Stddev 14 0.75 ( 0.00%) 1.21 (-61.36%)
> Stddev 100 9.02 ( 0.00%) 11.53 (-27.89%)
> Stddev 300 13.66 ( 0.00%) 31.24 (-128.62%)
> Stddev 500 25.01 ( 0.00%) 53.44 (-113.67%)
> Stddev 850 37.72 ( 0.00%) 70.05 (-85.71%)
>
> Note that the improvements for TCP are nowhere near as dramatic as netperf,
> there is a slight loss for small packets and it's much more variable. While
> it's not presented here, it's known that running sockperf "under load"
> that packet latency is generally lower but not universally so. On the
> other hand, UDP improves performance but again, is much more variable.
>
> This highlights that the patch is not necessarily a universal win and is
> going to depend heavily on both the workload and the CPU used.
>
> hackbench was also tested with both socket and pipes and both processes
> and threads and the results are interesting in terms of how variability
> is imapcted
>
> 1-socket machine
> hackbench-process-pipes
> 4.9.0-rc5 4.9.0-rc5
> vanilla highmark-v5
> Amean 1 12.9637 ( 0.00%) 13.1807 ( -1.67%)
> Amean 3 13.4770 ( 0.00%) 13.6803 ( -1.51%)
> Amean 5 18.5333 ( 0.00%) 18.7383 ( -1.11%)
> Amean 7 24.5690 ( 0.00%) 23.0550 ( 6.16%)
> Amean 12 39.7990 ( 0.00%) 36.7207 ( 7.73%)
> Amean 16 56.0520 ( 0.00%) 48.2890 ( 13.85%)
> Stddev 1 0.3847 ( 0.00%) 0.5853 (-52.15%)
> Stddev 3 0.2652 ( 0.00%) 0.0295 ( 88.89%)
> Stddev 5 0.5589 ( 0.00%) 0.2466 ( 55.87%)
> Stddev 7 0.5310 ( 0.00%) 0.6680 (-25.79%)
> Stddev 12 1.0780 ( 0.00%) 0.3230 ( 70.04%)
> Stddev 16 2.1138 ( 0.00%) 0.6835 ( 67.66%)
>
> hackbench-process-sockets
> Amean 1 4.8873 ( 0.00%) 4.7180 ( 3.46%)
> Amean 3 14.1157 ( 0.00%) 14.3643 ( -1.76%)
> Amean 5 22.5537 ( 0.00%) 23.1380 ( -2.59%)
> Amean 7 30.3743 ( 0.00%) 31.1520 ( -2.56%)
> Amean 12 49.1773 ( 0.00%) 50.3060 ( -2.30%)
> Amean 16 64.0873 ( 0.00%) 66.2633 ( -3.40%)
> Stddev 1 0.2360 ( 0.00%) 0.2201 ( 6.74%)
> Stddev 3 0.0539 ( 0.00%) 0.0780 (-44.72%)
> Stddev 5 0.1463 ( 0.00%) 0.1579 ( -7.90%)
> Stddev 7 0.1260 ( 0.00%) 0.3091 (-145.31%)
> Stddev 12 0.2169 ( 0.00%) 0.4822 (-122.36%)
> Stddev 16 0.0529 ( 0.00%) 0.4513 (-753.20%)
>
> It's not a universal win for pipes but the differences are within the
> noise. What is interesting is that variability shows both gains and losses
> in stark contrast to the sockperf results. On the other hand, sockets
> generally show small losses albeit within the noise with more variability.
> Once again, the workload and CPU gets different results.
>
> fsmark was tested with zero-sized files to continually allocate slab objects
> but didn't show any differences. This can be explained by the fact that the
> workload is only allocating and does not have mix of allocs/frees that would
> benefit from the caching. It was tested to ensure no major harm was done.
>
> While it is recognised that this is a mixed bag of results, the patch
> helps a lot more workloads than it hurts and intuitively, avoiding the
> zone->lock in some cases is a good thing.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Acked-by: Johannes Weiner <[email protected]>
> Acked-by: Jesper Dangaard Brouer <[email protected]>

I though I already did but anyway
Acked-by: Michal Hocko <[email protected]>
--
Michal Hocko
SUSE Labs

2016-12-02 09:05:53

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5

On Fri, Dec 02, 2016 at 03:03:46PM +0900, Joonsoo Kim wrote:
> > @@ -1132,14 +1134,17 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> > if (unlikely(isolated_pageblocks))
> > mt = get_pageblock_migratetype(page);
> >
> > + nr_freed += (1 << order);
> > + count -= (1 << order);
> > if (bulkfree_pcp_prepare(page))
> > continue;
> >
> > - __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> > - trace_mm_page_pcpu_drain(page, 0, mt);
> > - } while (--count && --batch_free && !list_empty(list));
> > + __free_one_page(page, page_to_pfn(page), zone, order, mt);
> > + trace_mm_page_pcpu_drain(page, order, mt);
> > + } while (count > 0 && --batch_free && !list_empty(list));
> > }
> > spin_unlock(&zone->lock);
> > + pcp->count -= nr_freed;
> > }
>
> I guess that this patch would cause following problems.
>
> 1. If pcp->batch is too small, high order page will not be freed
> easily and survive longer. Think about following situation.
>
> Batch count: 7
> MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1
> -> order 2...
>
> free count: 1 + 1 + 1 + 2 + 4 = 9
> so order 3 would not be freed.
>

You're relying on the batch count to be 7 where in a lot of cases it's
31. Even if low batch counts are common on another platform or you adjusted
the other counts to be higher values until they equal 30, it would be for
this drain that no order-3 pages were freed. It's not a permanent situation.

When or if it gets freed depends on the allocation request stream but the
same applies to the existing caches. If a high-order request arrives, it'll
be used. If all the requests are for the other orders, then eventually
the frees will hit the high watermark enough that the round-robin batch
freeing fill free the order-3 entry in the cache.

> 2. And, It seems that this logic penalties high order pages. One free
> to high order page means 1 << order pages free rather than just
> one page free. This logic do round-robin to choose the target page so
> amount of freed page will be different by the order. I think that it
> makes some sense because high order page are less important to cache
> in pcp than lower order but I'd like to know if it is intended or not.
> If intended, it deserves the comment.
>

It's intended but I'm not sure what else you want me to explain outside
the code itself in this case. The round-robin nature of the bulk drain
already doesn't attach any special important to the migrate type of the
list and there is no good reason to assume that high-order pages in the
cache when the high watermark is reached deserve special protection.

> 3. I guess that order-0 file/anon page alloc/free is dominent in many
> workloads. If this case happen, it invalidates effect of high order
> cache in pcp since cached high order pages would be also freed to the
> buddy when burst order-0 free happens.
>

A large burst of order-0 frees will free the high-order cache if it's not
being used but I don't see what your point is or why that is a problem.
It is pretty much guaranteed that there will be workloads that benefit
from protecting the high-order cache (SLUB-intensive alloc/free
intensive workloads) while others suffer (Fault-intensive map/unmap
workloads).

What's there at the moment behaves reasonably on a variety of workloads
across 8 machines.

> > @@ -2589,20 +2595,33 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
> > struct page *page;
> > bool cold = ((gfp_flags & __GFP_COLD) != 0);
> >
> > - if (likely(order == 0)) {
> > + if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
> > struct per_cpu_pages *pcp;
> > struct list_head *list;
> >
> > local_irq_save(flags);
> > do {
> > + unsigned int pindex;
> > +
> > + pindex = order_to_pindex(migratetype, order);
> > pcp = &this_cpu_ptr(zone->pageset)->pcp;
> > - list = &pcp->lists[migratetype];
> > + list = &pcp->lists[pindex];
> > if (list_empty(list)) {
> > - pcp->count += rmqueue_bulk(zone, 0,
> > + int nr_pages = rmqueue_bulk(zone, order,
> > pcp->batch, list,
> > migratetype, cold);
>
> Maybe, you need to fix rmqueue_bulk(). rmqueue_bulk() allocates batch
> * (1 << order) pages and pcp->count can easily overflow pcp->high
> * because list empty here doesn't mean that pcp->count is zero.
>

Potentially a refill can cause a drain on another list. However, I adjusted
the high watermark in pageset_set_batch to make it unlikely that a single
refill will cause a drain and added a comment about it. I say unlikely
because it's not guaranteed. A carefully created workload could potentially
bring all the order-0 and some of the high-order caches close to the
watermark and then trigger a drain due to a refill of order-3. The impact
is marginal and in itself does not warrent increasing the high watermark
to guarantee that no single refill can cause a drain on the next free.

--
Mel Gorman
SUSE Labs

2016-12-02 09:38:07

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm, page_alloc: Keep pcp count and list contents in sync if struct page is corrupted

On Friday, December 02, 2016 2:19 PM Vlastimil Babka wrote:
> On 12/02/2016 04:47 AM, Hillf Danton wrote:
> > On Friday, December 02, 2016 8:23 AM Mel Gorman wrote:
> >> Vlastimil Babka pointed out that commit 479f854a207c ("mm, page_alloc:
> >> defer debugging checks of pages allocated from the PCP") will allow the
> >> per-cpu list counter to be out of sync with the per-cpu list contents
> >> if a struct page is corrupted. This patch keeps the accounting in sync.
> >>
> >> Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
> >> Signed-off-by: Mel Gorman <[email protected]>
> >> cc: [email protected] [4.7+]
>
> Acked-by: Vlastimil Babka <[email protected]>
>
> >> ---
> >> mm/page_alloc.c | 5 +++--
> >> 1 file changed, 3 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index 6de9440e3ae2..777ed59570df 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -2192,7 +2192,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
> >> unsigned long count, struct list_head *list,
> >> int migratetype, bool cold)
> >> {
> >> - int i;
> >> + int i, alloced = 0;
> >>
> >> spin_lock(&zone->lock);
> >> for (i = 0; i < count; ++i) {
> >> @@ -2217,13 +2217,14 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
> >> else
> >> list_add_tail(&page->lru, list);
> >> list = &page->lru;
> >> + alloced++;
> >> if (is_migrate_cma(get_pcppage_migratetype(page)))
> >> __mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
> >> -(1 << order));
> >> }
> >> __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
> >
> > Now i is a pure index, yes?
>
> No, even if a page fails the check_pcp_refill() check and is not
> "allocated", it is also no longer a free page, so it's correct to
> subtract it from NR_FREE_PAGES.
>
Yes, we can allocate free page next time.

thanks
Hillf


2016-12-02 09:49:38

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm, page_alloc: Keep pcp count and list contents in sync if struct page is corrupted

On Fri, Dec 02, 2016 at 09:12:17AM +0100, Michal Hocko wrote:
> On Fri 02-12-16 00:22:43, Mel Gorman wrote:
> > Vlastimil Babka pointed out that commit 479f854a207c ("mm, page_alloc:
> > defer debugging checks of pages allocated from the PCP") will allow the
> > per-cpu list counter to be out of sync with the per-cpu list contents
> > if a struct page is corrupted. This patch keeps the accounting in sync.
> >
> > Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
> > Signed-off-by: Mel Gorman <[email protected]>
> > cc: [email protected] [4.7+]
>
> I am trying to think about what would happen if we did go out of sync
> and cannot spot a problem. Vlastimil has mentioned something about
> free_pcppages_bulk looping for ever but I cannot see it happening right
> now.

free_pcppages_bulk can infinite loop if the page count is positive and
there are no pages. While I've only seen this during development, a
corrupted count loops here

do {
batch_free++;
if (++pindex == NR_PCP_LISTS)
pindex = 0;
list = &pcp->lists[pindex];
} while (list_empty(list));

It would only be seen in a situation where struct page corruption was
detected so it's rare.

--
Mel Gorman
SUSE Labs

2016-12-02 10:03:24

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm, page_alloc: Keep pcp count and list contents in sync if struct page is corrupted

On Fri 02-12-16 09:49:33, Mel Gorman wrote:
> On Fri, Dec 02, 2016 at 09:12:17AM +0100, Michal Hocko wrote:
> > On Fri 02-12-16 00:22:43, Mel Gorman wrote:
> > > Vlastimil Babka pointed out that commit 479f854a207c ("mm, page_alloc:
> > > defer debugging checks of pages allocated from the PCP") will allow the
> > > per-cpu list counter to be out of sync with the per-cpu list contents
> > > if a struct page is corrupted. This patch keeps the accounting in sync.
> > >
> > > Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
> > > Signed-off-by: Mel Gorman <[email protected]>
> > > cc: [email protected] [4.7+]
> >
> > I am trying to think about what would happen if we did go out of sync
> > and cannot spot a problem. Vlastimil has mentioned something about
> > free_pcppages_bulk looping for ever but I cannot see it happening right
> > now.
>
> free_pcppages_bulk can infinite loop if the page count is positive and
> there are no pages. While I've only seen this during development, a
> corrupted count loops here
>
> do {
> batch_free++;
> if (++pindex == NR_PCP_LISTS)
> pindex = 0;
> list = &pcp->lists[pindex];
> } while (list_empty(list));
>
> It would only be seen in a situation where struct page corruption was
> detected so it's rare.

OK, I was apparently sleeping when responding. I focused on t he outer
loop and that should just converge. But it is true that this inner loop
can just runaway... Could you add that to the changelog please? This
definitely warrants stable backport.

Thanks!

--
Michal Hocko
SUSE Labs

2016-12-02 10:18:06

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm, page_alloc: Keep pcp count and list contents in sync if struct page is corrupted

On Fri 02-12-16 17:30:07, Hillf Danton wrote:
[...]
> > >> @@ -2217,13 +2217,14 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
> > >> else
> > >> list_add_tail(&page->lru, list);
> > >> list = &page->lru;
> > >> + alloced++;
> > >> if (is_migrate_cma(get_pcppage_migratetype(page)))
> > >> __mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
> > >> -(1 << order));
> > >> }
> > >> __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
> > >
> > > Now i is a pure index, yes?
> >
> > No, even if a page fails the check_pcp_refill() check and is not
> > "allocated", it is also no longer a free page, so it's correct to
> > subtract it from NR_FREE_PAGES.
> >
> Yes, we can allocate free page next time.

No we cannot. The page is gone from the free list. We have effectively
leaked it.
--
Michal Hocko
SUSE Labs

2016-12-02 11:02:48

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm, page_alloc: Keep pcp count and list contents in sync if struct page is corrupted

On Fri, Dec 02, 2016 at 11:04:11AM +0100, Michal Hocko wrote:
> On Fri 02-12-16 17:30:07, Hillf Danton wrote:
> [...]
> > > >> @@ -2217,13 +2217,14 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
> > > >> else
> > > >> list_add_tail(&page->lru, list);
> > > >> list = &page->lru;
> > > >> + alloced++;
> > > >> if (is_migrate_cma(get_pcppage_migratetype(page)))
> > > >> __mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
> > > >> -(1 << order));
> > > >> }
> > > >> __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
> > > >
> > > > Now i is a pure index, yes?
> > >
> > > No, even if a page fails the check_pcp_refill() check and is not
> > > "allocated", it is also no longer a free page, so it's correct to
> > > subtract it from NR_FREE_PAGES.
> > >
> > Yes, we can allocate free page next time.
>
> No we cannot. The page is gone from the free list. We have effectively
> leaked it.

And deliberately so. It's in an unknown state, possibly due to memory
corruption or a use-after free bug. The machine can continue limping on
with warnings in the kernel log but the VM stops going near the page
itself as much as possible.

--
Mel Gorman
SUSE Labs

2016-12-05 03:02:55

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5

On Fri, Dec 02, 2016 at 09:04:49AM +0000, Mel Gorman wrote:
> On Fri, Dec 02, 2016 at 03:03:46PM +0900, Joonsoo Kim wrote:
> > > @@ -1132,14 +1134,17 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> > > if (unlikely(isolated_pageblocks))
> > > mt = get_pageblock_migratetype(page);
> > >
> > > + nr_freed += (1 << order);
> > > + count -= (1 << order);
> > > if (bulkfree_pcp_prepare(page))
> > > continue;
> > >
> > > - __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> > > - trace_mm_page_pcpu_drain(page, 0, mt);
> > > - } while (--count && --batch_free && !list_empty(list));
> > > + __free_one_page(page, page_to_pfn(page), zone, order, mt);
> > > + trace_mm_page_pcpu_drain(page, order, mt);
> > > + } while (count > 0 && --batch_free && !list_empty(list));
> > > }
> > > spin_unlock(&zone->lock);
> > > + pcp->count -= nr_freed;
> > > }
> >
> > I guess that this patch would cause following problems.
> >
> > 1. If pcp->batch is too small, high order page will not be freed
> > easily and survive longer. Think about following situation.
> >
> > Batch count: 7
> > MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1
> > -> order 2...
> >
> > free count: 1 + 1 + 1 + 2 + 4 = 9
> > so order 3 would not be freed.
> >
>
> You're relying on the batch count to be 7 where in a lot of cases it's
> 31. Even if low batch counts are common on another platform or you adjusted
> the other counts to be higher values until they equal 30, it would be for
> this drain that no order-3 pages were freed. It's not a permanent situation.
>
> When or if it gets freed depends on the allocation request stream but the
> same applies to the existing caches. If a high-order request arrives, it'll
> be used. If all the requests are for the other orders, then eventually
> the frees will hit the high watermark enough that the round-robin batch
> freeing fill free the order-3 entry in the cache.

I know that it isn't a permanent situation and it depends on workload.
However, it is clearly an unfair situation. We don't have any good reason
to cache higher order freepage longer. Even, batch count 7 means
that it is a small system. In this kind of system, there is no reason
to keep high order freepage longer in the cache.

The other potential problem is that if we change
PAGE_ALLOC_COSTLY_ORDER to 5 in the future, this 31 batch count also
doesn't guarantee that free_pcppages_bulk() will work fairly and we
will not notice it easily.

I think that it can be simply solved by maintaining a last pindex in
pcp. How about it?

>
> > 2. And, It seems that this logic penalties high order pages. One free
> > to high order page means 1 << order pages free rather than just
> > one page free. This logic do round-robin to choose the target page so
> > amount of freed page will be different by the order. I think that it
> > makes some sense because high order page are less important to cache
> > in pcp than lower order but I'd like to know if it is intended or not.
> > If intended, it deserves the comment.
> >
>
> It's intended but I'm not sure what else you want me to explain outside
> the code itself in this case. The round-robin nature of the bulk drain
> already doesn't attach any special important to the migrate type of the
> list and there is no good reason to assume that high-order pages in the
> cache when the high watermark is reached deserve special protection.

Non-trivial part is that round-robin approach penalties high-order
pages caching. We usually think that round-robin is fair, but, in this
case, it isn't. Some people can notice that amount of freepage in turn is
different but some may not. It is a different situation in the past that
amount of freepage in turn is same even if migratetype is different. I
think that it deserve some comment but I don't feel it strongly.

> > 3. I guess that order-0 file/anon page alloc/free is dominent in many
> > workloads. If this case happen, it invalidates effect of high order
> > cache in pcp since cached high order pages would be also freed to the
> > buddy when burst order-0 free happens.
> >
>
> A large burst of order-0 frees will free the high-order cache if it's not
> being used but I don't see what your point is or why that is a problem.
> It is pretty much guaranteed that there will be workloads that benefit
> from protecting the high-order cache (SLUB-intensive alloc/free
> intensive workloads) while others suffer (Fault-intensive map/unmap
> workloads).
>
> What's there at the moment behaves reasonably on a variety of workloads
> across 8 machines.

Yes, I see that this patch improves some workloads. What I like to say
is that I find some weakness and if it is fixed, we can get better
result. This patch implement unified pcp cache for migratetype and
high-order but if we separate them and manage number of cached items
separately, we would not have above problem. Could you teach me the
reason not to implement the separate cache for high order?

>
> > > @@ -2589,20 +2595,33 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
> > > struct page *page;
> > > bool cold = ((gfp_flags & __GFP_COLD) != 0);
> > >
> > > - if (likely(order == 0)) {
> > > + if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
> > > struct per_cpu_pages *pcp;
> > > struct list_head *list;
> > >
> > > local_irq_save(flags);
> > > do {
> > > + unsigned int pindex;
> > > +
> > > + pindex = order_to_pindex(migratetype, order);
> > > pcp = &this_cpu_ptr(zone->pageset)->pcp;
> > > - list = &pcp->lists[migratetype];
> > > + list = &pcp->lists[pindex];
> > > if (list_empty(list)) {
> > > - pcp->count += rmqueue_bulk(zone, 0,
> > > + int nr_pages = rmqueue_bulk(zone, order,
> > > pcp->batch, list,
> > > migratetype, cold);
> >
> > Maybe, you need to fix rmqueue_bulk(). rmqueue_bulk() allocates batch
> > * (1 << order) pages and pcp->count can easily overflow pcp->high
> > * because list empty here doesn't mean that pcp->count is zero.
> >
>
> Potentially a refill can cause a drain on another list. However, I adjusted
> the high watermark in pageset_set_batch to make it unlikely that a single
> refill will cause a drain and added a comment about it. I say unlikely
> because it's not guaranteed. A carefully created workload could potentially
> bring all the order-0 and some of the high-order caches close to the
> watermark and then trigger a drain due to a refill of order-3. The impact
> is marginal and in itself does not warrent increasing the high watermark
> to guarantee that no single refill can cause a drain on the next free.

Hmm... What makes me wonder is that alloc/free isn't symmetric.
Free in free_pcppages_bulk() are done until number of freed pages
becomes the batch. High order pages are counted as 1 << order. But, in
refill here, counting high order pages is single one rather than
1 << order. This asymmetric alloc/free is intended? Why do we cache same
number of high order page with the number of order-0 page in one
batch?

Thanks.

2016-12-05 03:06:46

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5

On Fri, Dec 02, 2016 at 09:21:08AM +0100, Michal Hocko wrote:
> On Fri 02-12-16 15:03:46, Joonsoo Kim wrote:
> [...]
> > > o pcp accounting during free is now confined to free_pcppages_bulk as it's
> > > impossible for the caller to know exactly how many pages were freed.
> > > Due to the high-order caches, the number of pages drained for a request
> > > is no longer precise.
> > >
> > > o The high watermark for per-cpu pages is increased to reduce the probability
> > > that a single refill causes a drain on the next free.
> [...]
> > I guess that this patch would cause following problems.
> >
> > 1. If pcp->batch is too small, high order page will not be freed
> > easily and survive longer. Think about following situation.
> >
> > Batch count: 7
> > MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1
> > -> order 2...
> >
> > free count: 1 + 1 + 1 + 2 + 4 = 9
> > so order 3 would not be freed.
>
> I guess the second paragraph above in the changelog tries to clarify
> that...

It doesn't perfectly clarify my concern. This is a different problem.

>
> > 2. And, It seems that this logic penalties high order pages. One free
> > to high order page means 1 << order pages free rather than just
> > one page free. This logic do round-robin to choose the target page so
> > amount of freed page will be different by the order.
>
> Yes this is indeed possible. The first paragraph above mentions this
> problem.

Yes, it is mentioned simply but we cannot easily notice that the above
penalty for high order page is there.

Thanks.

2016-12-05 09:58:03

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5

On Mon, Dec 05, 2016 at 12:06:19PM +0900, Joonsoo Kim wrote:
> On Fri, Dec 02, 2016 at 09:04:49AM +0000, Mel Gorman wrote:
> > On Fri, Dec 02, 2016 at 03:03:46PM +0900, Joonsoo Kim wrote:
> > > > @@ -1132,14 +1134,17 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> > > > if (unlikely(isolated_pageblocks))
> > > > mt = get_pageblock_migratetype(page);
> > > >
> > > > + nr_freed += (1 << order);
> > > > + count -= (1 << order);
> > > > if (bulkfree_pcp_prepare(page))
> > > > continue;
> > > >
> > > > - __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> > > > - trace_mm_page_pcpu_drain(page, 0, mt);
> > > > - } while (--count && --batch_free && !list_empty(list));
> > > > + __free_one_page(page, page_to_pfn(page), zone, order, mt);
> > > > + trace_mm_page_pcpu_drain(page, order, mt);
> > > > + } while (count > 0 && --batch_free && !list_empty(list));
> > > > }
> > > > spin_unlock(&zone->lock);
> > > > + pcp->count -= nr_freed;
> > > > }
> > >
> > > I guess that this patch would cause following problems.
> > >
> > > 1. If pcp->batch is too small, high order page will not be freed
> > > easily and survive longer. Think about following situation.
> > >
> > > Batch count: 7
> > > MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1
> > > -> order 2...
> > >
> > > free count: 1 + 1 + 1 + 2 + 4 = 9
> > > so order 3 would not be freed.
> > >
> >
> > You're relying on the batch count to be 7 where in a lot of cases it's
> > 31. Even if low batch counts are common on another platform or you adjusted
> > the other counts to be higher values until they equal 30, it would be for
> > this drain that no order-3 pages were freed. It's not a permanent situation.
> >
> > When or if it gets freed depends on the allocation request stream but the
> > same applies to the existing caches. If a high-order request arrives, it'll
> > be used. If all the requests are for the other orders, then eventually
> > the frees will hit the high watermark enough that the round-robin batch
> > freeing fill free the order-3 entry in the cache.
>
> I know that it isn't a permanent situation and it depends on workload.
> However, it is clearly an unfair situation. We don't have any good reason
> to cache higher order freepage longer. Even, batch count 7 means
> that it is a small system. In this kind of system, there is no reason
> to keep high order freepage longer in the cache.
>

Without knowing the future allocation request stream, there is no reason
to favour one part of the per-cpu cache over another. To me, it's not
actually clear at all it's an unfair situation, particularly given that the
vanilla code is also unfair -- the vanilla code can artifically preserve
MIGRATE_UNMOVABLE without any clear indication that it is a universal win.
The only deciding factor there was a fault-intensive workload would mask
overhead of the page allocator due to page zeroing cost which UNMOVABLE
allocations may or may not require. Even that is vague considering that
page-table allocations are zeroing even if many kernel allocations are not.

> The other potential problem is that if we change
> PAGE_ALLOC_COSTLY_ORDER to 5 in the future, this 31 batch count also
> doesn't guarantee that free_pcppages_bulk() will work fairly and we
> will not notice it easily.
>

In the event the high-order cache is increased, then the high watermark
would also need to be adjusted to account for that just as this patch
does.

> I think that it can be simply solved by maintaining a last pindex in
> pcp. How about it?
>

That would rely on the previous allocation stream to drive the freeing
which is slightly related to the fact the per-cpu cache contents are
related to the previous request stream. It's still not guaranteed to be
related to the future request stream.

Adding a new pindex cache adds complexity to the free path without any
guarantee it benefits anything. The use of such a heuristic should be
driven by a workload demonstrating it's a problem. Granted, half of the
cost of a free operations is due to irq enable/disable but there is no
reason to make it unnecessarily expensive.

> > > 3. I guess that order-0 file/anon page alloc/free is dominent in many
> > > workloads. If this case happen, it invalidates effect of high order
> > > cache in pcp since cached high order pages would be also freed to the
> > > buddy when burst order-0 free happens.
> > >
> >
> > A large burst of order-0 frees will free the high-order cache if it's not
> > being used but I don't see what your point is or why that is a problem.
> > It is pretty much guaranteed that there will be workloads that benefit
> > from protecting the high-order cache (SLUB-intensive alloc/free
> > intensive workloads) while others suffer (Fault-intensive map/unmap
> > workloads).
> >
> > What's there at the moment behaves reasonably on a variety of workloads
> > across 8 machines.
>
> Yes, I see that this patch improves some workloads. What I like to say
> is that I find some weakness and if it is fixed, we can get better
> result. This patch implement unified pcp cache for migratetype and
> high-order but if we separate them and manage number of cached items
> separately, we would not have above problem. Could you teach me the
> reason not to implement the separate cache for high order?
>

It's additional complexity and a separate cache that would require separate
batch counts and high watermarks for order-0 and high-order for a problem
that is not demonstrated as being necessary by a workload on any platform.

> >
> > > > @@ -2589,20 +2595,33 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
> > > > struct page *page;
> > > > bool cold = ((gfp_flags & __GFP_COLD) != 0);
> > > >
> > > > - if (likely(order == 0)) {
> > > > + if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
> > > > struct per_cpu_pages *pcp;
> > > > struct list_head *list;
> > > >
> > > > local_irq_save(flags);
> > > > do {
> > > > + unsigned int pindex;
> > > > +
> > > > + pindex = order_to_pindex(migratetype, order);
> > > > pcp = &this_cpu_ptr(zone->pageset)->pcp;
> > > > - list = &pcp->lists[migratetype];
> > > > + list = &pcp->lists[pindex];
> > > > if (list_empty(list)) {
> > > > - pcp->count += rmqueue_bulk(zone, 0,
> > > > + int nr_pages = rmqueue_bulk(zone, order,
> > > > pcp->batch, list,
> > > > migratetype, cold);
> > >
> > > Maybe, you need to fix rmqueue_bulk(). rmqueue_bulk() allocates batch
> > > * (1 << order) pages and pcp->count can easily overflow pcp->high
> > > * because list empty here doesn't mean that pcp->count is zero.
> > >
> >
> > Potentially a refill can cause a drain on another list. However, I adjusted
> > the high watermark in pageset_set_batch to make it unlikely that a single
> > refill will cause a drain and added a comment about it. I say unlikely
> > because it's not guaranteed. A carefully created workload could potentially
> > bring all the order-0 and some of the high-order caches close to the
> > watermark and then trigger a drain due to a refill of order-3. The impact
> > is marginal and in itself does not warrent increasing the high watermark
> > to guarantee that no single refill can cause a drain on the next free.
>
> Hmm... What makes me wonder is that alloc/free isn't symmetric.
> Free in free_pcppages_bulk() are done until number of freed pages
> becomes the batch. High order pages are counted as 1 << order.
> But, in refill here, counting high order pages is single one rather than
> 1 << order. This asymmetric alloc/free is intended? Why do we cache same
> number of high order page with the number of order-0 page in one
> batch?
>

Because on the alloc side, we're batching the number of operations done
under the lock and on the free side, we're concerned with how much memory
is pinned by the per-cpu cache. There are different options that could
be taken such as accounting for the number of list elements instead of
order-0 pages but that will make the size of the per-cpu cache variable
without necessarily being beneficial. Pretty much anything related to the
per-cpu cache is an optimistic heuristic on what is beneficial to cache
and when with the view to reducing operations taken under the zone->lock.

--
Mel Gorman
SUSE Labs

2016-12-06 02:40:14

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5

On Mon, Dec 05, 2016 at 09:57:39AM +0000, Mel Gorman wrote:
> On Mon, Dec 05, 2016 at 12:06:19PM +0900, Joonsoo Kim wrote:
> > On Fri, Dec 02, 2016 at 09:04:49AM +0000, Mel Gorman wrote:
> > > On Fri, Dec 02, 2016 at 03:03:46PM +0900, Joonsoo Kim wrote:
> > > > > @@ -1132,14 +1134,17 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> > > > > if (unlikely(isolated_pageblocks))
> > > > > mt = get_pageblock_migratetype(page);
> > > > >
> > > > > + nr_freed += (1 << order);
> > > > > + count -= (1 << order);
> > > > > if (bulkfree_pcp_prepare(page))
> > > > > continue;
> > > > >
> > > > > - __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> > > > > - trace_mm_page_pcpu_drain(page, 0, mt);
> > > > > - } while (--count && --batch_free && !list_empty(list));
> > > > > + __free_one_page(page, page_to_pfn(page), zone, order, mt);
> > > > > + trace_mm_page_pcpu_drain(page, order, mt);
> > > > > + } while (count > 0 && --batch_free && !list_empty(list));
> > > > > }
> > > > > spin_unlock(&zone->lock);
> > > > > + pcp->count -= nr_freed;
> > > > > }
> > > >
> > > > I guess that this patch would cause following problems.
> > > >
> > > > 1. If pcp->batch is too small, high order page will not be freed
> > > > easily and survive longer. Think about following situation.
> > > >
> > > > Batch count: 7
> > > > MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1
> > > > -> order 2...
> > > >
> > > > free count: 1 + 1 + 1 + 2 + 4 = 9
> > > > so order 3 would not be freed.
> > > >
> > >
> > > You're relying on the batch count to be 7 where in a lot of cases it's
> > > 31. Even if low batch counts are common on another platform or you adjusted
> > > the other counts to be higher values until they equal 30, it would be for
> > > this drain that no order-3 pages were freed. It's not a permanent situation.
> > >
> > > When or if it gets freed depends on the allocation request stream but the
> > > same applies to the existing caches. If a high-order request arrives, it'll
> > > be used. If all the requests are for the other orders, then eventually
> > > the frees will hit the high watermark enough that the round-robin batch
> > > freeing fill free the order-3 entry in the cache.
> >
> > I know that it isn't a permanent situation and it depends on workload.
> > However, it is clearly an unfair situation. We don't have any good reason
> > to cache higher order freepage longer. Even, batch count 7 means
> > that it is a small system. In this kind of system, there is no reason
> > to keep high order freepage longer in the cache.
> >
>
> Without knowing the future allocation request stream, there is no reason
> to favour one part of the per-cpu cache over another. To me, it's not

What I suggest is that. Don't favour one part of the per-cpu cache over
another.

> actually clear at all it's an unfair situation, particularly given that the
> vanilla code is also unfair -- the vanilla code can artifically preserve
> MIGRATE_UNMOVABLE without any clear indication that it is a universal win.
> The only deciding factor there was a fault-intensive workload would mask
> overhead of the page allocator due to page zeroing cost which UNMOVABLE
> allocations may or may not require. Even that is vague considering that
> page-table allocations are zeroing even if many kernel allocations are not.

"Vanilla works like that" doesn't seem to be reasonable to justify
this change. Vanilla code works with three lists and it now become
six lists and each list can have different size of page. We need to
think that previous approach will also work fine with current one. I
think that there is a problem although it's not permanent and would be
minor. However, it's better to fix it when it is found.

> > The other potential problem is that if we change
> > PAGE_ALLOC_COSTLY_ORDER to 5 in the future, this 31 batch count also
> > doesn't guarantee that free_pcppages_bulk() will work fairly and we
> > will not notice it easily.
> >
>
> In the event the high-order cache is increased, then the high watermark
> would also need to be adjusted to account for that just as this patch
> does.

pcp->high will be adjusted automatically when high-order cache is
increased by your change. What we miss is pcp->batch and there is no
information about that the number of high-order cache and pcp->batch
has some association.

> > I think that it can be simply solved by maintaining a last pindex in
> > pcp. How about it?
> >
>
> That would rely on the previous allocation stream to drive the freeing
> which is slightly related to the fact the per-cpu cache contents are
> related to the previous request stream. It's still not guaranteed to be
> related to the future request stream.
>
> Adding a new pindex cache adds complexity to the free path without any
> guarantee it benefits anything. The use of such a heuristic should be

It provides the benefit that prevents that high order page survives
longer in the cacahe in any pcp->batch, in any PAGE_ALLOC_COSTLY_ORDER
setup.

> driven by a workload demonstrating it's a problem. Granted, half of the
> cost of a free operations is due to irq enable/disable but there is no
> reason to make it unnecessarily expensive.

I think that it's not that complex. What we need all is that just one
variable in pcp and read/write it once in free_pcppages_bulk().

free_pcppages_bulk()
{
int pindex = pcp->last_pindex

...

pcp->last_pindex = pindex
return
}

>
> > > > 3. I guess that order-0 file/anon page alloc/free is dominent in many
> > > > workloads. If this case happen, it invalidates effect of high order
> > > > cache in pcp since cached high order pages would be also freed to the
> > > > buddy when burst order-0 free happens.
> > > >
> > >
> > > A large burst of order-0 frees will free the high-order cache if it's not
> > > being used but I don't see what your point is or why that is a problem.
> > > It is pretty much guaranteed that there will be workloads that benefit
> > > from protecting the high-order cache (SLUB-intensive alloc/free
> > > intensive workloads) while others suffer (Fault-intensive map/unmap
> > > workloads).
> > >
> > > What's there at the moment behaves reasonably on a variety of workloads
> > > across 8 machines.
> >
> > Yes, I see that this patch improves some workloads. What I like to say
> > is that I find some weakness and if it is fixed, we can get better
> > result. This patch implement unified pcp cache for migratetype and
> > high-order but if we separate them and manage number of cached items
> > separately, we would not have above problem. Could you teach me the
> > reason not to implement the separate cache for high order?
> >
>
> It's additional complexity and a separate cache that would require separate
> batch counts and high watermarks for order-0 and high-order for a problem
> that is not demonstrated as being necessary by a workload on any platform.

It is a potential problem of this implementation so no one can
demonstrated that separating is necessary without testing this patch widely.

The other problem that I can guess is that caching high order page in
pcp on highly fragmentation system will cause more compaction and negative
performance impact. It would be way expensive than merit of reducing
lock contention so we need to setup pcp's high/batch for high order
page carefully. Perhaps, separating them would be helpful here.

Anyway, nothing is proved so I don't insist more.

> > >
> > > > > @@ -2589,20 +2595,33 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
> > > > > struct page *page;
> > > > > bool cold = ((gfp_flags & __GFP_COLD) != 0);
> > > > >
> > > > > - if (likely(order == 0)) {
> > > > > + if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
> > > > > struct per_cpu_pages *pcp;
> > > > > struct list_head *list;
> > > > >
> > > > > local_irq_save(flags);
> > > > > do {
> > > > > + unsigned int pindex;
> > > > > +
> > > > > + pindex = order_to_pindex(migratetype, order);
> > > > > pcp = &this_cpu_ptr(zone->pageset)->pcp;
> > > > > - list = &pcp->lists[migratetype];
> > > > > + list = &pcp->lists[pindex];
> > > > > if (list_empty(list)) {
> > > > > - pcp->count += rmqueue_bulk(zone, 0,
> > > > > + int nr_pages = rmqueue_bulk(zone, order,
> > > > > pcp->batch, list,
> > > > > migratetype, cold);
> > > >
> > > > Maybe, you need to fix rmqueue_bulk(). rmqueue_bulk() allocates batch
> > > > * (1 << order) pages and pcp->count can easily overflow pcp->high
> > > > * because list empty here doesn't mean that pcp->count is zero.
> > > >
> > >
> > > Potentially a refill can cause a drain on another list. However, I adjusted
> > > the high watermark in pageset_set_batch to make it unlikely that a single
> > > refill will cause a drain and added a comment about it. I say unlikely
> > > because it's not guaranteed. A carefully created workload could potentially
> > > bring all the order-0 and some of the high-order caches close to the
> > > watermark and then trigger a drain due to a refill of order-3. The impact
> > > is marginal and in itself does not warrent increasing the high watermark
> > > to guarantee that no single refill can cause a drain on the next free.
> >
> > Hmm... What makes me wonder is that alloc/free isn't symmetric.
> > Free in free_pcppages_bulk() are done until number of freed pages
> > becomes the batch. High order pages are counted as 1 << order.
> > But, in refill here, counting high order pages is single one rather than
> > 1 << order. This asymmetric alloc/free is intended? Why do we cache same
> > number of high order page with the number of order-0 page in one
> > batch?
> >
>
> Because on the alloc side, we're batching the number of operations done
> under the lock and on the free side, we're concerned with how much memory
> is pinned by the per-cpu cache. There are different options that could
> be taken such as accounting for the number of list elements instead of
> order-0 pages but that will make the size of the per-cpu cache variable
> without necessarily being beneficial. Pretty much anything related to the
> per-cpu cache is an optimistic heuristic on what is beneficial to cache
> and when with the view to reducing operations taken under the zone->lock.

As far as I know, maximum memory pinned by pcp is controlled by
pcp->high. pcp->batch is all for reducing lock contention even when it
is used in free side. If meaning of pcp->batch is different in
alloc/free side, it's better to use another variable. It is really
complicate to think/implement different concepts by using a single
variable.

And, if alloc side allows to cache pcp->batch elements of each type,
we cannot control how much memory is pinned by pcp. I know that you
add some comment about it but breaking design assumption that
pcp->high is maximum memory pinned by pcp looks not good. Is there
any way to fix it?

Thanks.

2016-12-06 13:53:58

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5

On Tue, Dec 06, 2016 at 11:43:45AM +0900, Joonsoo Kim wrote:
> > actually clear at all it's an unfair situation, particularly given that the
> > vanilla code is also unfair -- the vanilla code can artifically preserve
> > MIGRATE_UNMOVABLE without any clear indication that it is a universal win.
> > The only deciding factor there was a fault-intensive workload would mask
> > overhead of the page allocator due to page zeroing cost which UNMOVABLE
> > allocations may or may not require. Even that is vague considering that
> > page-table allocations are zeroing even if many kernel allocations are not.
>
> "Vanilla works like that" doesn't seem to be reasonable to justify
> this change. Vanilla code works with three lists and it now become
> six lists and each list can have different size of page. We need to
> think that previous approach will also work fine with current one. I
> think that there is a problem although it's not permanent and would be
> minor. However, it's better to fix it when it is found.
>

This is going in circles. I prototyped the modification which increases
the per-cpu structure slightly and will evaluate. It takes about a day
to run through the full set of tests. If it causes no harm, I'll release
another version.

--
Mel Gorman
SUSE Labs