2016-11-27 13:20:00

by Mel Gorman

[permalink] [raw]
Subject: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

Changelog since v2
o Correct initialisation to avoid -Woverflow warning

SLUB has been the default small kernel object allocator for quite some time
but it is not universally used due to performance concerns and a reliance
on high-order pages. The high-order concerns has two major components --
high-order pages are not always available and high-order page allocations
potentially contend on the zone->lock. This patch addresses some concerns
about the zone lock contention by extending the per-cpu page allocator to
cache high-order pages. The patch makes the following modifications

o New per-cpu lists are added to cache the high-order pages. This increases
the cache footprint of the per-cpu allocator and overall usage but for
some workloads, this will be offset by reduced contention on zone->lock.
The first MIGRATE_PCPTYPE entries in the list are per-migratetype. The
remaining are high-order caches up to and including
PAGE_ALLOC_COSTLY_ORDER

o pcp accounting during free is now confined to free_pcppages_bulk as it's
impossible for the caller to know exactly how many pages were freed.
Due to the high-order caches, the number of pages drained for a request
is no longer precise.

o The high watermark for per-cpu pages is increased to reduce the probability
that a single refill causes a drain on the next free.

The benefit depends on both the workload and the machine as ultimately the
determining factor is whether cache line bounces on zone->lock or contention
is a problem. The patch was tested on a variety of workloads and machines,
some of which are reported here.

This is the result from netperf running UDP_STREAM on localhost. It was
selected on the basis that it is slab-intensive and has been the subject
of previous SLAB vs SLUB comparisons with the caveat that this is not
testing between two physical hosts.

2-socket modern machine
4.9.0-rc5 4.9.0-rc5
vanilla hopcpu-v3
Hmean send-64 178.38 ( 0.00%) 256.74 ( 43.93%)
Hmean send-128 351.49 ( 0.00%) 507.52 ( 44.39%)
Hmean send-256 671.23 ( 0.00%) 1004.19 ( 49.60%)
Hmean send-1024 2663.60 ( 0.00%) 3910.42 ( 46.81%)
Hmean send-2048 5126.53 ( 0.00%) 7562.13 ( 47.51%)
Hmean send-3312 7949.99 ( 0.00%) 11565.98 ( 45.48%)
Hmean send-4096 9433.56 ( 0.00%) 12929.67 ( 37.06%)
Hmean send-8192 15940.64 ( 0.00%) 21587.63 ( 35.43%)
Hmean send-16384 26699.54 ( 0.00%) 32013.79 ( 19.90%)
Hmean recv-64 178.38 ( 0.00%) 256.72 ( 43.92%)
Hmean recv-128 351.49 ( 0.00%) 507.47 ( 44.38%)
Hmean recv-256 671.20 ( 0.00%) 1003.95 ( 49.57%)
Hmean recv-1024 2663.45 ( 0.00%) 3909.70 ( 46.79%)
Hmean recv-2048 5126.26 ( 0.00%) 7560.67 ( 47.49%)
Hmean recv-3312 7949.50 ( 0.00%) 11564.63 ( 45.48%)
Hmean recv-4096 9433.04 ( 0.00%) 12927.48 ( 37.04%)
Hmean recv-8192 15939.64 ( 0.00%) 21584.59 ( 35.41%)
Hmean recv-16384 26698.44 ( 0.00%) 32009.77 ( 19.89%)

1-socket 6 year old machine
4.9.0-rc5 4.9.0-rc5
vanilla hopcpu-v3
Hmean send-64 87.47 ( 0.00%) 127.14 ( 45.36%)
Hmean send-128 174.36 ( 0.00%) 256.42 ( 47.06%)
Hmean send-256 347.52 ( 0.00%) 509.41 ( 46.59%)
Hmean send-1024 1363.03 ( 0.00%) 1991.54 ( 46.11%)
Hmean send-2048 2632.68 ( 0.00%) 3759.51 ( 42.80%)
Hmean send-3312 4123.19 ( 0.00%) 5873.28 ( 42.45%)
Hmean send-4096 5056.48 ( 0.00%) 7072.81 ( 39.88%)
Hmean send-8192 8784.22 ( 0.00%) 12143.92 ( 38.25%)
Hmean send-16384 15081.60 ( 0.00%) 19812.71 ( 31.37%)
Hmean recv-64 86.19 ( 0.00%) 126.59 ( 46.87%)
Hmean recv-128 173.93 ( 0.00%) 255.21 ( 46.73%)
Hmean recv-256 346.19 ( 0.00%) 506.72 ( 46.37%)
Hmean recv-1024 1358.28 ( 0.00%) 1980.03 ( 45.77%)
Hmean recv-2048 2623.45 ( 0.00%) 3729.35 ( 42.15%)
Hmean recv-3312 4108.63 ( 0.00%) 5831.47 ( 41.93%)
Hmean recv-4096 5037.25 ( 0.00%) 7021.59 ( 39.39%)
Hmean recv-8192 8762.32 ( 0.00%) 12072.44 ( 37.78%)
Hmean recv-16384 15042.36 ( 0.00%) 19690.14 ( 30.90%)

This is somewhat dramatic but it's also not universal. For example, it was
observed on an older HP machine using pcc-cpufreq that there was almost
no difference but pcc-cpufreq is also a known performance hazard.

These are quite different results but illustrate that the patch is
dependent on the CPU. The results are similar for TCP_STREAM on
the two-socket machine.

The observations on sockperf are different.

2-socket modern machine
sockperf-tcp-throughput
4.9.0-rc5 4.9.0-rc5
vanilla hopcpu-v3
Hmean 14 93.90 ( 0.00%) 92.79 ( -1.18%)
Hmean 100 1211.02 ( 0.00%) 1286.66 ( 6.25%)
Hmean 300 3081.59 ( 0.00%) 3347.84 ( 8.64%)
Hmean 500 4614.19 ( 0.00%) 4953.23 ( 7.35%)
Hmean 850 6521.74 ( 0.00%) 6951.72 ( 6.59%)
Stddev 14 0.89 ( 0.00%) 3.24 (-264.75%)
Stddev 100 5.95 ( 0.00%) 8.88 (-49.27%)
Stddev 300 11.16 ( 0.00%) 28.13 (-151.98%)
Stddev 500 36.32 ( 0.00%) 42.07 (-15.84%)
Stddev 850 29.61 ( 0.00%) 66.73 (-125.36%)

sockperf-udp-throughput
4.9.0-rc5 4.9.0-rc5
vanilla hopcpu-v3
Hmean 14 16.82 ( 0.00%) 25.23 ( 50.03%)
Hmean 100 119.91 ( 0.00%) 180.63 ( 50.65%)
Hmean 300 358.11 ( 0.00%) 539.29 ( 50.59%)
Hmean 500 595.16 ( 0.00%) 892.15 ( 49.90%)
Hmean 850 989.44 ( 0.00%) 1496.01 ( 51.20%)
Stddev 14 0.05 ( 0.00%) 0.10 (-116.02%)
Stddev 100 0.53 ( 0.00%) 1.12 (-111.23%)
Stddev 300 1.43 ( 0.00%) 1.58 (-10.21%)
Stddev 500 3.93 ( 0.00%) 5.14 (-30.95%)
Stddev 850 4.02 ( 0.00%) 6.46 (-60.64%)

Note that the improvements for TCP are nowhere near as dramatic as netperf,
there is a slight loss for small packets and it's much more variable. While
it's not presented here, it's known that running sockperf "under load"
that packet latency is generally lower but not universally so. On the
other hand, UDP improves performance but again, is much more variable.

This highlights that the patch is not necessarily a universal win and is
going to depend heavily on both the workload and the CPU used.

hackbench was also tested with both socket and pipes and both processes
and threads and the results are interesting in terms of how variability
is imapcted

1-socket machine
hackbench-process-pipes
4.9.0-rc5 4.9.0-rc5
vanilla highmark-v1r12
Amean 1 12.9637 ( 0.00%) 13.1807 ( -1.67%)
Amean 3 13.4770 ( 0.00%) 13.6803 ( -1.51%)
Amean 5 18.5333 ( 0.00%) 18.7383 ( -1.11%)
Amean 7 24.5690 ( 0.00%) 23.0550 ( 6.16%)
Amean 12 39.7990 ( 0.00%) 36.7207 ( 7.73%)
Amean 16 56.0520 ( 0.00%) 48.2890 ( 13.85%)
Stddev 1 0.3847 ( 0.00%) 0.5853 (-52.15%)
Stddev 3 0.2652 ( 0.00%) 0.0295 ( 88.89%)
Stddev 5 0.5589 ( 0.00%) 0.2466 ( 55.87%)
Stddev 7 0.5310 ( 0.00%) 0.6680 (-25.79%)
Stddev 12 1.0780 ( 0.00%) 0.3230 ( 70.04%)
Stddev 16 2.1138 ( 0.00%) 0.6835 ( 67.66%)

hackbench-process-sockets
Amean 1 4.8873 ( 0.00%) 4.7180 ( 3.46%)
Amean 3 14.1157 ( 0.00%) 14.3643 ( -1.76%)
Amean 5 22.5537 ( 0.00%) 23.1380 ( -2.59%)
Amean 7 30.3743 ( 0.00%) 31.1520 ( -2.56%)
Amean 12 49.1773 ( 0.00%) 50.3060 ( -2.30%)
Amean 16 64.0873 ( 0.00%) 66.2633 ( -3.40%)
Stddev 1 0.2360 ( 0.00%) 0.2201 ( 6.74%)
Stddev 3 0.0539 ( 0.00%) 0.0780 (-44.72%)
Stddev 5 0.1463 ( 0.00%) 0.1579 ( -7.90%)
Stddev 7 0.1260 ( 0.00%) 0.3091 (-145.31%)
Stddev 12 0.2169 ( 0.00%) 0.4822 (-122.36%)
Stddev 16 0.0529 ( 0.00%) 0.4513 (-753.20%)

It's not a universal win for pipes but the differences are within the
noise. What is interesting is that variability shows both gains and losses
in stark contrast to the sockperf results. On the other hand, sockets
generally show small losses albeit within the noise with more variability.
Once again, the workload and CPU gets different results.

fsmark was tested with zero-sized files to continually allocate slab objects
but didn't show any differences. This can be explained by the fact that the
workload is only allocating and does not have mix of allocs/frees that would
benefit from the caching. It was tested to ensure no major harm was done.

While it is recognised that this is a mixed bag of results, the patch
helps a lot more workloads than it hurts and intuitively, avoiding the
zone->lock in some cases is a good thing.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 20 +++++++++-
mm/page_alloc.c | 105 +++++++++++++++++++++++++++++--------------------
2 files changed, 82 insertions(+), 43 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0f088f3a2fed..54032ab2f4f9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -255,6 +255,24 @@ enum zone_watermarks {
NR_WMARK
};

+/*
+ * One per migratetype for order-0 pages and one per high-order up to
+ * and including PAGE_ALLOC_COSTLY_ORDER. This may allow unmovable
+ * allocations to contaminate reclaimable pageblocks if high-order
+ * pages are heavily used.
+ */
+#define NR_PCP_LISTS (MIGRATE_PCPTYPES + PAGE_ALLOC_COSTLY_ORDER)
+
+static inline unsigned int pindex_to_order(unsigned int pindex)
+{
+ return pindex < MIGRATE_PCPTYPES ? 0 : pindex - MIGRATE_PCPTYPES + 1;
+}
+
+static inline unsigned int order_to_pindex(int migratetype, unsigned int order)
+{
+ return (order == 0) ? migratetype : MIGRATE_PCPTYPES + order - 1;
+}
+
#define min_wmark_pages(z) (z->watermark[WMARK_MIN])
#define low_wmark_pages(z) (z->watermark[WMARK_LOW])
#define high_wmark_pages(z) (z->watermark[WMARK_HIGH])
@@ -265,7 +283,7 @@ struct per_cpu_pages {
int batch; /* chunk size for buddy add/remove */

/* Lists of pages, one per migrate type stored on the pcp-lists */
- struct list_head lists[MIGRATE_PCPTYPES];
+ struct list_head lists[NR_PCP_LISTS];
};

struct per_cpu_pageset {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6de9440e3ae2..91dc68c2a717 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1050,9 +1050,9 @@ static __always_inline bool free_pages_prepare(struct page *page,
}

#ifdef CONFIG_DEBUG_VM
-static inline bool free_pcp_prepare(struct page *page)
+static inline bool free_pcp_prepare(struct page *page, unsigned int order)
{
- return free_pages_prepare(page, 0, true);
+ return free_pages_prepare(page, order, true);
}

static inline bool bulkfree_pcp_prepare(struct page *page)
@@ -1060,9 +1060,9 @@ static inline bool bulkfree_pcp_prepare(struct page *page)
return false;
}
#else
-static bool free_pcp_prepare(struct page *page)
+static bool free_pcp_prepare(struct page *page, unsigned int order)
{
- return free_pages_prepare(page, 0, false);
+ return free_pages_prepare(page, order, false);
}

static bool bulkfree_pcp_prepare(struct page *page)
@@ -1085,8 +1085,9 @@ static bool bulkfree_pcp_prepare(struct page *page)
static void free_pcppages_bulk(struct zone *zone, int count,
struct per_cpu_pages *pcp)
{
- int migratetype = 0;
- int batch_free = 0;
+ unsigned int pindex = UINT_MAX; /* Reclaim will start at 0 */
+ unsigned int batch_free = 0;
+ unsigned int nr_freed = 0;
unsigned long nr_scanned;
bool isolated_pageblocks;

@@ -1096,28 +1097,29 @@ static void free_pcppages_bulk(struct zone *zone, int count,
if (nr_scanned)
__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);

- while (count) {
+ while (count > 0) {
struct page *page;
struct list_head *list;
+ unsigned int order;

/*
* Remove pages from lists in a round-robin fashion. A
* batch_free count is maintained that is incremented when an
- * empty list is encountered. This is so more pages are freed
- * off fuller lists instead of spinning excessively around empty
- * lists
+ * empty list is encountered. This is not exact due to
+ * high-order but percision is not required.
*/
do {
batch_free++;
- if (++migratetype == MIGRATE_PCPTYPES)
- migratetype = 0;
- list = &pcp->lists[migratetype];
+ if (++pindex == NR_PCP_LISTS)
+ pindex = 0;
+ list = &pcp->lists[pindex];
} while (list_empty(list));

/* This is the only non-empty list. Free them all. */
- if (batch_free == MIGRATE_PCPTYPES)
+ if (batch_free == NR_PCP_LISTS)
batch_free = count;

+ order = pindex_to_order(pindex);
do {
int mt; /* migratetype of the to-be-freed page */

@@ -1135,11 +1137,14 @@ static void free_pcppages_bulk(struct zone *zone, int count,
if (bulkfree_pcp_prepare(page))
continue;

- __free_one_page(page, page_to_pfn(page), zone, 0, mt);
- trace_mm_page_pcpu_drain(page, 0, mt);
- } while (--count && --batch_free && !list_empty(list));
+ __free_one_page(page, page_to_pfn(page), zone, order, mt);
+ trace_mm_page_pcpu_drain(page, order, mt);
+ nr_freed += (1 << order);
+ count -= (1 << order);
+ } while (count > 0 && --batch_free && !list_empty(list));
}
spin_unlock(&zone->lock);
+ pcp->count -= nr_freed;
}

static void free_one_page(struct zone *zone,
@@ -2243,10 +2248,8 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
local_irq_save(flags);
batch = READ_ONCE(pcp->batch);
to_drain = min(pcp->count, batch);
- if (to_drain > 0) {
+ if (to_drain > 0)
free_pcppages_bulk(zone, to_drain, pcp);
- pcp->count -= to_drain;
- }
local_irq_restore(flags);
}
#endif
@@ -2268,10 +2271,8 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
pset = per_cpu_ptr(zone->pageset, cpu);

pcp = &pset->pcp;
- if (pcp->count) {
+ if (pcp->count)
free_pcppages_bulk(zone, pcp->count, pcp);
- pcp->count = 0;
- }
local_irq_restore(flags);
}

@@ -2403,18 +2404,18 @@ void mark_free_pages(struct zone *zone)
#endif /* CONFIG_PM */

/*
- * Free a 0-order page
+ * Free a pcp page
* cold == true ? free a cold page : free a hot page
*/
-void free_hot_cold_page(struct page *page, bool cold)
+static void __free_hot_cold_page(struct page *page, bool cold, unsigned int order)
{
struct zone *zone = page_zone(page);
struct per_cpu_pages *pcp;
unsigned long flags;
unsigned long pfn = page_to_pfn(page);
- int migratetype;
+ int migratetype, pindex;

- if (!free_pcp_prepare(page))
+ if (!free_pcp_prepare(page, order))
return;

migratetype = get_pfnblock_migratetype(page, pfn);
@@ -2431,28 +2432,33 @@ void free_hot_cold_page(struct page *page, bool cold)
*/
if (migratetype >= MIGRATE_PCPTYPES) {
if (unlikely(is_migrate_isolate(migratetype))) {
- free_one_page(zone, page, pfn, 0, migratetype);
+ free_one_page(zone, page, pfn, order, migratetype);
goto out;
}
migratetype = MIGRATE_MOVABLE;
}

+ pindex = order_to_pindex(migratetype, order);
pcp = &this_cpu_ptr(zone->pageset)->pcp;
if (!cold)
- list_add(&page->lru, &pcp->lists[migratetype]);
+ list_add(&page->lru, &pcp->lists[pindex]);
else
- list_add_tail(&page->lru, &pcp->lists[migratetype]);
- pcp->count++;
+ list_add_tail(&page->lru, &pcp->lists[pindex]);
+ pcp->count += 1 << order;
if (pcp->count >= pcp->high) {
unsigned long batch = READ_ONCE(pcp->batch);
free_pcppages_bulk(zone, batch, pcp);
- pcp->count -= batch;
}

out:
local_irq_restore(flags);
}

+void free_hot_cold_page(struct page *page, bool cold)
+{
+ __free_hot_cold_page(page, cold, 0);
+}
+
/*
* Free a list of 0-order pages
*/
@@ -2588,18 +2594,22 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
struct page *page;
bool cold = ((gfp_flags & __GFP_COLD) != 0);

- if (likely(order == 0)) {
+ if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
struct per_cpu_pages *pcp;
struct list_head *list;

local_irq_save(flags);
do {
+ unsigned int pindex;
+
+ pindex = order_to_pindex(migratetype, order);
pcp = &this_cpu_ptr(zone->pageset)->pcp;
- list = &pcp->lists[migratetype];
+ list = &pcp->lists[pindex];
if (list_empty(list)) {
- pcp->count += rmqueue_bulk(zone, 0,
+ int nr_pages = rmqueue_bulk(zone, order,
pcp->batch, list,
migratetype, cold);
+ pcp->count += (nr_pages << order);
if (unlikely(list_empty(list)))
goto failed;
}
@@ -2610,7 +2620,7 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
page = list_first_entry(list, struct page, lru);

list_del(&page->lru);
- pcp->count--;
+ pcp->count -= (1 << order);

} while (check_new_pcp(page));
} else {
@@ -3837,8 +3847,8 @@ EXPORT_SYMBOL(get_zeroed_page);
void __free_pages(struct page *page, unsigned int order)
{
if (put_page_testzero(page)) {
- if (order == 0)
- free_hot_cold_page(page, false);
+ if (order <= PAGE_ALLOC_COSTLY_ORDER)
+ __free_hot_cold_page(page, false, order);
else
__free_pages_ok(page, order);
}
@@ -5160,20 +5170,31 @@ static void pageset_update(struct per_cpu_pages *pcp, unsigned long high,
/* a companion to pageset_set_high() */
static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
{
- pageset_update(&p->pcp, 6 * batch, max(1UL, 1 * batch));
+ unsigned long high;
+
+ /*
+ * per-cpu refills occur when a per-cpu list for a migratetype
+ * or a high-order is depleted even if pages are free overall.
+ * Tune the high watermark such that it's unlikely, but not
+ * impossible, that a single refill event will trigger a
+ * shrink on the next free to the per-cpu list.
+ */
+ high = batch * MIGRATE_PCPTYPES + (batch << PAGE_ALLOC_COSTLY_ORDER);
+
+ pageset_update(&p->pcp, high, max(1UL, 1 * batch));
}

static void pageset_init(struct per_cpu_pageset *p)
{
struct per_cpu_pages *pcp;
- int migratetype;
+ unsigned int pindex;

memset(p, 0, sizeof(*p));

pcp = &p->pcp;
pcp->count = 0;
- for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++)
- INIT_LIST_HEAD(&pcp->lists[migratetype]);
+ for (pindex = 0; pindex < NR_PCP_LISTS; pindex++)
+ INIT_LIST_HEAD(&pcp->lists[pindex]);
}

static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
--
2.10.2


2016-11-28 11:00:53

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

On 11/27/2016 02:19 PM, Mel Gorman wrote:
>
> 2-socket modern machine
> 4.9.0-rc5 4.9.0-rc5
> vanilla hopcpu-v3
> Hmean send-64 178.38 ( 0.00%) 256.74 ( 43.93%)
> Hmean send-128 351.49 ( 0.00%) 507.52 ( 44.39%)
> Hmean send-256 671.23 ( 0.00%) 1004.19 ( 49.60%)
> Hmean send-1024 2663.60 ( 0.00%) 3910.42 ( 46.81%)
> Hmean send-2048 5126.53 ( 0.00%) 7562.13 ( 47.51%)
> Hmean send-3312 7949.99 ( 0.00%) 11565.98 ( 45.48%)
> Hmean send-4096 9433.56 ( 0.00%) 12929.67 ( 37.06%)
> Hmean send-8192 15940.64 ( 0.00%) 21587.63 ( 35.43%)
> Hmean send-16384 26699.54 ( 0.00%) 32013.79 ( 19.90%)
> Hmean recv-64 178.38 ( 0.00%) 256.72 ( 43.92%)
> Hmean recv-128 351.49 ( 0.00%) 507.47 ( 44.38%)
> Hmean recv-256 671.20 ( 0.00%) 1003.95 ( 49.57%)
> Hmean recv-1024 2663.45 ( 0.00%) 3909.70 ( 46.79%)
> Hmean recv-2048 5126.26 ( 0.00%) 7560.67 ( 47.49%)
> Hmean recv-3312 7949.50 ( 0.00%) 11564.63 ( 45.48%)
> Hmean recv-4096 9433.04 ( 0.00%) 12927.48 ( 37.04%)
> Hmean recv-8192 15939.64 ( 0.00%) 21584.59 ( 35.41%)
> Hmean recv-16384 26698.44 ( 0.00%) 32009.77 ( 19.89%)
>
> 1-socket 6 year old machine
> 4.9.0-rc5 4.9.0-rc5
> vanilla hopcpu-v3
> Hmean send-64 87.47 ( 0.00%) 127.14 ( 45.36%)
> Hmean send-128 174.36 ( 0.00%) 256.42 ( 47.06%)
> Hmean send-256 347.52 ( 0.00%) 509.41 ( 46.59%)
> Hmean send-1024 1363.03 ( 0.00%) 1991.54 ( 46.11%)
> Hmean send-2048 2632.68 ( 0.00%) 3759.51 ( 42.80%)
> Hmean send-3312 4123.19 ( 0.00%) 5873.28 ( 42.45%)
> Hmean send-4096 5056.48 ( 0.00%) 7072.81 ( 39.88%)
> Hmean send-8192 8784.22 ( 0.00%) 12143.92 ( 38.25%)
> Hmean send-16384 15081.60 ( 0.00%) 19812.71 ( 31.37%)
> Hmean recv-64 86.19 ( 0.00%) 126.59 ( 46.87%)
> Hmean recv-128 173.93 ( 0.00%) 255.21 ( 46.73%)
> Hmean recv-256 346.19 ( 0.00%) 506.72 ( 46.37%)
> Hmean recv-1024 1358.28 ( 0.00%) 1980.03 ( 45.77%)
> Hmean recv-2048 2623.45 ( 0.00%) 3729.35 ( 42.15%)
> Hmean recv-3312 4108.63 ( 0.00%) 5831.47 ( 41.93%)
> Hmean recv-4096 5037.25 ( 0.00%) 7021.59 ( 39.39%)
> Hmean recv-8192 8762.32 ( 0.00%) 12072.44 ( 37.78%)
> Hmean recv-16384 15042.36 ( 0.00%) 19690.14 ( 30.90%)

That looks way much better than the "v1" RFC posting. Was it just
because you stopped doing the "at first iteration, use migratetype as
index", and initializing pindex UINT_MAX hits so much quicker, or was
there something more subtle that I missed? There was no changelog
between "v1" and "v2".

>
> Signed-off-by: Mel Gorman <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>


2016-11-28 11:45:54

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

On Mon, Nov 28, 2016 at 12:00:41PM +0100, Vlastimil Babka wrote:
> On 11/27/2016 02:19 PM, Mel Gorman wrote:
> >
> > 2-socket modern machine
> > 4.9.0-rc5 4.9.0-rc5
> > vanilla hopcpu-v3
> > Hmean send-64 178.38 ( 0.00%) 256.74 ( 43.93%)
> > Hmean send-128 351.49 ( 0.00%) 507.52 ( 44.39%)
> > Hmean send-256 671.23 ( 0.00%) 1004.19 ( 49.60%)
> > Hmean send-1024 2663.60 ( 0.00%) 3910.42 ( 46.81%)
> > Hmean send-2048 5126.53 ( 0.00%) 7562.13 ( 47.51%)
> > Hmean send-3312 7949.99 ( 0.00%) 11565.98 ( 45.48%)
> > Hmean send-4096 9433.56 ( 0.00%) 12929.67 ( 37.06%)
> > Hmean send-8192 15940.64 ( 0.00%) 21587.63 ( 35.43%)
> > Hmean send-16384 26699.54 ( 0.00%) 32013.79 ( 19.90%)
> > Hmean recv-64 178.38 ( 0.00%) 256.72 ( 43.92%)
> > Hmean recv-128 351.49 ( 0.00%) 507.47 ( 44.38%)
> > Hmean recv-256 671.20 ( 0.00%) 1003.95 ( 49.57%)
> > Hmean recv-1024 2663.45 ( 0.00%) 3909.70 ( 46.79%)
> > Hmean recv-2048 5126.26 ( 0.00%) 7560.67 ( 47.49%)
> > Hmean recv-3312 7949.50 ( 0.00%) 11564.63 ( 45.48%)
> > Hmean recv-4096 9433.04 ( 0.00%) 12927.48 ( 37.04%)
> > Hmean recv-8192 15939.64 ( 0.00%) 21584.59 ( 35.41%)
> > Hmean recv-16384 26698.44 ( 0.00%) 32009.77 ( 19.89%)
> >
> > 1-socket 6 year old machine
> > 4.9.0-rc5 4.9.0-rc5
> > vanilla hopcpu-v3
> > Hmean send-64 87.47 ( 0.00%) 127.14 ( 45.36%)
> > Hmean send-128 174.36 ( 0.00%) 256.42 ( 47.06%)
> > Hmean send-256 347.52 ( 0.00%) 509.41 ( 46.59%)
> > Hmean send-1024 1363.03 ( 0.00%) 1991.54 ( 46.11%)
> > Hmean send-2048 2632.68 ( 0.00%) 3759.51 ( 42.80%)
> > Hmean send-3312 4123.19 ( 0.00%) 5873.28 ( 42.45%)
> > Hmean send-4096 5056.48 ( 0.00%) 7072.81 ( 39.88%)
> > Hmean send-8192 8784.22 ( 0.00%) 12143.92 ( 38.25%)
> > Hmean send-16384 15081.60 ( 0.00%) 19812.71 ( 31.37%)
> > Hmean recv-64 86.19 ( 0.00%) 126.59 ( 46.87%)
> > Hmean recv-128 173.93 ( 0.00%) 255.21 ( 46.73%)
> > Hmean recv-256 346.19 ( 0.00%) 506.72 ( 46.37%)
> > Hmean recv-1024 1358.28 ( 0.00%) 1980.03 ( 45.77%)
> > Hmean recv-2048 2623.45 ( 0.00%) 3729.35 ( 42.15%)
> > Hmean recv-3312 4108.63 ( 0.00%) 5831.47 ( 41.93%)
> > Hmean recv-4096 5037.25 ( 0.00%) 7021.59 ( 39.39%)
> > Hmean recv-8192 8762.32 ( 0.00%) 12072.44 ( 37.78%)
> > Hmean recv-16384 15042.36 ( 0.00%) 19690.14 ( 30.90%)
>
> That looks way much better than the "v1" RFC posting. Was it just because
> you stopped doing the "at first iteration, use migratetype as index", and
> initializing pindex UINT_MAX hits so much quicker, or was there something
> more subtle that I missed? There was no changelog between "v1" and "v2".
>

The array is sized correctly which avoids one useless check. The order-0
lists are always drained first so in some rare cases, only the fast
paths are used. There was a subtle correction in detecting when all of
one list should be drained. In combination, it happened to boost
performance a lot on the two machines I reported on. While 6 other
machines were tested, not all of them saw such a dramatic boost and if
these machines are rebooted and retested every time, the high
performance is not always consistent, it all depends on how often the
fast paths are used.

> > Signed-off-by: Mel Gorman <[email protected]>
>
> Acked-by: Vlastimil Babka <[email protected]>
>

Thanks.

--
Mel Gorman
SUSE Labs

Subject: Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

On Sun, 27 Nov 2016, Mel Gorman wrote:

>
> SLUB has been the default small kernel object allocator for quite some time
> but it is not universally used due to performance concerns and a reliance
> on high-order pages. The high-order concerns has two major components --
> high-order pages are not always available and high-order page allocations
> potentially contend on the zone->lock. This patch addresses some concerns
> about the zone lock contention by extending the per-cpu page allocator to
> cache high-order pages. The patch makes the following modifications

Note that SLUB will only use high order pages when available and fall back
to order 0 if memory is fragmented. This means that the effect of this
patch is going to gradually vanish as memory becomes more and more
fragmented.

I think this patch is beneficial but we need to address long term the
issue of memory fragmentation. That is not only a SLUB issue but an
overall problem since we keep on having to maintain lists of 4k memory
blocks in variuos subsystems. And as memory increases these lists are
becoming larger and larger and more difficult to manage. Code complexity
increases and fragility too (look at transparent hugepages). Ultimately we
will need a clean way to manage the allocation and freeing of large
physically contiguous pages. Reserving memory at booting (CMA, giant
pages) is some sort of solution but this all devolves into lots of knobs
that only insiders know how to tune and an overall fragile solution.

2016-11-28 16:21:35

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

On Mon, Nov 28, 2016 at 09:39:19AM -0600, Christoph Lameter wrote:
> On Sun, 27 Nov 2016, Mel Gorman wrote:
>
> >
> > SLUB has been the default small kernel object allocator for quite some time
> > but it is not universally used due to performance concerns and a reliance
> > on high-order pages. The high-order concerns has two major components --
> > high-order pages are not always available and high-order page allocations
> > potentially contend on the zone->lock. This patch addresses some concerns
> > about the zone lock contention by extending the per-cpu page allocator to
> > cache high-order pages. The patch makes the following modifications
>
> Note that SLUB will only use high order pages when available and fall back
> to order 0 if memory is fragmented. This means that the effect of this
> patch is going to gradually vanish as memory becomes more and more
> fragmented.
>

Yes, that's a problem for SLUB with or without this patch. It's always
been the case that SLUB relying on high-order pages for performance is
problematic.

> I think this patch is beneficial but we need to address long term the
> issue of memory fragmentation. That is not only a SLUB issue but an
> overall problem since we keep on having to maintain lists of 4k memory
> blocks in variuos subsystems. And as memory increases these lists are
> becoming larger and larger and more difficult to manage. Code complexity
> increases and fragility too (look at transparent hugepages). Ultimately we
> will need a clean way to manage the allocation and freeing of large
> physically contiguous pages. Reserving memory at booting (CMA, giant
> pages) is some sort of solution but this all devolves into lots of knobs
> that only insiders know how to tune and an overall fragile solution.
>

While I agree with all of this, it's also a problem independent of this
patch.


--
Mel Gorman
SUSE Labs

Subject: Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

On Mon, 28 Nov 2016, Mel Gorman wrote:

> Yes, that's a problem for SLUB with or without this patch. It's always
> been the case that SLUB relying on high-order pages for performance is
> problematic.

This is a general issue in the kernel. Performance often requires larger
contiguous ranges of memory.


> > that only insiders know how to tune and an overall fragile solution.
> While I agree with all of this, it's also a problem independent of this
> patch.

It is related. The fundamental issue with fragmentation remain and IMHO we
really need to tackle this.

2016-11-28 18:48:15

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

On Mon, Nov 28, 2016 at 10:38:58AM -0600, Christoph Lameter wrote:
> > > that only insiders know how to tune and an overall fragile solution.
> > While I agree with all of this, it's also a problem independent of this
> > patch.
>
> It is related. The fundamental issue with fragmentation remain and IMHO we
> really need to tackle this.
>

Fragmentation is one issue. Allocation scalability is a separate issue.
This patch is about scaling parallel allocations of small contiguous
ranges. Even if there were fragmentation-related patches up for discussion,
they would not be directly affected by this patch.

If you have a series aimed at parts of the fragmentation problem or how
subsystems can avoid tracking 4K pages in some important cases then by
all means post them.

--
Mel Gorman
SUSE Labs

Subject: Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

On Mon, 28 Nov 2016, Mel Gorman wrote:

> If you have a series aimed at parts of the fragmentation problem or how
> subsystems can avoid tracking 4K pages in some important cases then by
> all means post them.

I designed SLUB with defrag methods in mind. We could warm up some old
patchsets that where never merged:

https://lkml.org/lkml/2010/1/29/332

2016-11-28 19:54:43

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

On Sun, Nov 27, 2016 at 01:19:54PM +0000, Mel Gorman wrote:
> While it is recognised that this is a mixed bag of results, the patch
> helps a lot more workloads than it hurts and intuitively, avoiding the
> zone->lock in some cases is a good thing.
>
> Signed-off-by: Mel Gorman <[email protected]>

This seems like a net gain to me, and the patch loos good too.

Acked-by: Johannes Weiner <[email protected]>

> @@ -255,6 +255,24 @@ enum zone_watermarks {
> NR_WMARK
> };
>
> +/*
> + * One per migratetype for order-0 pages and one per high-order up to
> + * and including PAGE_ALLOC_COSTLY_ORDER. This may allow unmovable
> + * allocations to contaminate reclaimable pageblocks if high-order
> + * pages are heavily used.

I think that should be fine. Higher order allocations rely on being
able to compact movable blocks, not on reclaim freeing contiguous
blocks, so poisoning reclaimable blocks is much less of a concern than
poisoning movable blocks. And I'm not aware of any 0 < order < COSTLY
movable allocations that would put movable blocks into an HO cache.

2016-11-28 21:00:02

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

On 11/28/2016 07:54 PM, Christoph Lameter wrote:
> On Mon, 28 Nov 2016, Mel Gorman wrote:
>
>> If you have a series aimed at parts of the fragmentation problem or how
>> subsystems can avoid tracking 4K pages in some important cases then by
>> all means post them.
>
> I designed SLUB with defrag methods in mind. We could warm up some old
> patchsets that where never merged:
>
> https://lkml.org/lkml/2010/1/29/332

Note that some other solutions to the dentry cache problem (perhaps of a
more low-hanging fruit kind) were also discussed at KS/LPC MM panel
session: https://lwn.net/Articles/705758/

2016-11-30 08:56:10

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

On Mon, Nov 28, 2016 at 12:00:41PM +0100, Vlastimil Babka wrote:
> > 1-socket 6 year old machine
> > 4.9.0-rc5 4.9.0-rc5
> > vanilla hopcpu-v3
> > Hmean send-64 87.47 ( 0.00%) 127.14 ( 45.36%)
> > Hmean send-128 174.36 ( 0.00%) 256.42 ( 47.06%)
> > Hmean send-256 347.52 ( 0.00%) 509.41 ( 46.59%)
> > Hmean send-1024 1363.03 ( 0.00%) 1991.54 ( 46.11%)
> > Hmean send-2048 2632.68 ( 0.00%) 3759.51 ( 42.80%)
> > Hmean send-3312 4123.19 ( 0.00%) 5873.28 ( 42.45%)
> > Hmean send-4096 5056.48 ( 0.00%) 7072.81 ( 39.88%)
> > Hmean send-8192 8784.22 ( 0.00%) 12143.92 ( 38.25%)
> > Hmean send-16384 15081.60 ( 0.00%) 19812.71 ( 31.37%)
> > Hmean recv-64 86.19 ( 0.00%) 126.59 ( 46.87%)
> > Hmean recv-128 173.93 ( 0.00%) 255.21 ( 46.73%)
> > Hmean recv-256 346.19 ( 0.00%) 506.72 ( 46.37%)
> > Hmean recv-1024 1358.28 ( 0.00%) 1980.03 ( 45.77%)
> > Hmean recv-2048 2623.45 ( 0.00%) 3729.35 ( 42.15%)
> > Hmean recv-3312 4108.63 ( 0.00%) 5831.47 ( 41.93%)
> > Hmean recv-4096 5037.25 ( 0.00%) 7021.59 ( 39.39%)
> > Hmean recv-8192 8762.32 ( 0.00%) 12072.44 ( 37.78%)
> > Hmean recv-16384 15042.36 ( 0.00%) 19690.14 ( 30.90%)
>
> That looks way much better than the "v1" RFC posting. Was it just because
> you stopped doing the "at first iteration, use migratetype as index", and
> initializing pindex UINT_MAX hits so much quicker, or was there something
> more subtle that I missed? There was no changelog between "v1" and "v2".
>

FYI, the LKP test robot reported the following so there is some
independent basis for picking this up.

---8<---

FYI, we noticed a +23.0% improvement of netperf.Throughput_Mbps due to
commit:

commit 79404c5a5c66481aa55c0cae685e49e0f44a0479 ("mm: page_alloc: High-order per-cpu page allocator")
https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-pagealloc-highorder-percpu-v3r1


--
Mel Gorman
SUSE Labs

2016-11-30 12:40:48

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3


On Sun, 27 Nov 2016 13:19:54 +0000 Mel Gorman <[email protected]> wrote:

[...]
> SLUB has been the default small kernel object allocator for quite some time
> but it is not universally used due to performance concerns and a reliance
> on high-order pages. The high-order concerns has two major components --
> high-order pages are not always available and high-order page allocations
> potentially contend on the zone->lock. This patch addresses some concerns
> about the zone lock contention by extending the per-cpu page allocator to
> cache high-order pages. The patch makes the following modifications
>
> o New per-cpu lists are added to cache the high-order pages. This increases
> the cache footprint of the per-cpu allocator and overall usage but for
> some workloads, this will be offset by reduced contention on zone->lock.

This will also help performance of NIC driver that allocator
higher-order pages for their RX-ring queue (and chop it up for MTU).
I do like this patch, even-though I'm working on moving drivers away
from allocation these high-order pages.

Acked-by: Jesper Dangaard Brouer <[email protected]>

[...]
> This is the result from netperf running UDP_STREAM on localhost. It was
> selected on the basis that it is slab-intensive and has been the subject
> of previous SLAB vs SLUB comparisons with the caveat that this is not
> testing between two physical hosts.

I do like you are using a networking test to benchmark this. Looking at
the results, my initial response is that the improvements are basically
too good to be true.

Can you share how you tested this with netperf and the specific netperf
parameters?
e.g.
How do you configure the send/recv sizes?
Have you pinned netperf and netserver on different CPUs?

For localhost testing, when netperf and netserver run on the same CPU,
you observer half the performance, very intuitively. When pinning
netperf and netserver (via e.g. option -T 1,2) you observe the most
stable results. When allowing netperf and netserver to migrate between
CPUs (default setting), the real fun starts and unstable results,
because now the CPU scheduler is also being tested, and my experience
is also more "fun" memory situations occurs, as I guess we are hopping
between more per CPU alloc caches (also affecting the SLUB per CPU usage
pattern).

> 2-socket modern machine
> 4.9.0-rc5 4.9.0-rc5
> vanilla hopcpu-v3

The kernel from 4.9.0-rc5-vanilla to 4.9.0-rc5-hopcpu-v3 only contains
this single change right?
Netdev/Paolo recently (in net-next) optimized the UDP code path
significantly, and I just want to make sure your results are not
affected by these changes.


> Hmean send-64 178.38 ( 0.00%) 256.74 ( 43.93%)
> Hmean send-128 351.49 ( 0.00%) 507.52 ( 44.39%)
> Hmean send-256 671.23 ( 0.00%) 1004.19 ( 49.60%)
> Hmean send-1024 2663.60 ( 0.00%) 3910.42 ( 46.81%)
> Hmean send-2048 5126.53 ( 0.00%) 7562.13 ( 47.51%)
> Hmean send-3312 7949.99 ( 0.00%) 11565.98 ( 45.48%)
> Hmean send-4096 9433.56 ( 0.00%) 12929.67 ( 37.06%)
> Hmean send-8192 15940.64 ( 0.00%) 21587.63 ( 35.43%)
> Hmean send-16384 26699.54 ( 0.00%) 32013.79 ( 19.90%)
> Hmean recv-64 178.38 ( 0.00%) 256.72 ( 43.92%)
> Hmean recv-128 351.49 ( 0.00%) 507.47 ( 44.38%)
> Hmean recv-256 671.20 ( 0.00%) 1003.95 ( 49.57%)
> Hmean recv-1024 2663.45 ( 0.00%) 3909.70 ( 46.79%)
> Hmean recv-2048 5126.26 ( 0.00%) 7560.67 ( 47.49%)
> Hmean recv-3312 7949.50 ( 0.00%) 11564.63 ( 45.48%)
> Hmean recv-4096 9433.04 ( 0.00%) 12927.48 ( 37.04%)
> Hmean recv-8192 15939.64 ( 0.00%) 21584.59 ( 35.41%)
> Hmean recv-16384 26698.44 ( 0.00%) 32009.77 ( 19.89%)
>
> 1-socket 6 year old machine
> 4.9.0-rc5 4.9.0-rc5
> vanilla hopcpu-v3
> Hmean send-64 87.47 ( 0.00%) 127.14 ( 45.36%)
> Hmean send-128 174.36 ( 0.00%) 256.42 ( 47.06%)
> Hmean send-256 347.52 ( 0.00%) 509.41 ( 46.59%)
> Hmean send-1024 1363.03 ( 0.00%) 1991.54 ( 46.11%)
> Hmean send-2048 2632.68 ( 0.00%) 3759.51 ( 42.80%)
> Hmean send-3312 4123.19 ( 0.00%) 5873.28 ( 42.45%)
> Hmean send-4096 5056.48 ( 0.00%) 7072.81 ( 39.88%)
> Hmean send-8192 8784.22 ( 0.00%) 12143.92 ( 38.25%)
> Hmean send-16384 15081.60 ( 0.00%) 19812.71 ( 31.37%)
> Hmean recv-64 86.19 ( 0.00%) 126.59 ( 46.87%)
> Hmean recv-128 173.93 ( 0.00%) 255.21 ( 46.73%)
> Hmean recv-256 346.19 ( 0.00%) 506.72 ( 46.37%)
> Hmean recv-1024 1358.28 ( 0.00%) 1980.03 ( 45.77%)
> Hmean recv-2048 2623.45 ( 0.00%) 3729.35 ( 42.15%)
> Hmean recv-3312 4108.63 ( 0.00%) 5831.47 ( 41.93%)
> Hmean recv-4096 5037.25 ( 0.00%) 7021.59 ( 39.39%)
> Hmean recv-8192 8762.32 ( 0.00%) 12072.44 ( 37.78%)
> Hmean recv-16384 15042.36 ( 0.00%) 19690.14 ( 30.90%)
>
> This is somewhat dramatic but it's also not universal. For example, it was
> observed on an older HP machine using pcc-cpufreq that there was almost
> no difference but pcc-cpufreq is also a known performance hazard.

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer

2016-11-30 13:05:59

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

On Sun 27-11-16 13:19:54, Mel Gorman wrote:
[...]
> @@ -2588,18 +2594,22 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
> struct page *page;
> bool cold = ((gfp_flags & __GFP_COLD) != 0);
>
> - if (likely(order == 0)) {
> + if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
> struct per_cpu_pages *pcp;
> struct list_head *list;
>
> local_irq_save(flags);
> do {
> + unsigned int pindex;
> +
> + pindex = order_to_pindex(migratetype, order);
> pcp = &this_cpu_ptr(zone->pageset)->pcp;
> - list = &pcp->lists[migratetype];
> + list = &pcp->lists[pindex];
> if (list_empty(list)) {
> - pcp->count += rmqueue_bulk(zone, 0,
> + int nr_pages = rmqueue_bulk(zone, order,
> pcp->batch, list,
> migratetype, cold);
> + pcp->count += (nr_pages << order);
> if (unlikely(list_empty(list)))
> goto failed;

just a nit, we can reorder the check and the count update because nobody
could have stolen pages allocated by rmqueue_bulk. I would also consider
nr_pages a bit misleading because we get a number or allocated elements.
Nothing to lose sleep over...

> }

But... Unless I am missing something this effectively means that we do
not exercise high order atomic reserves. Shouldn't we fallback to
the locked __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC) for
order > 0 && ALLOC_HARDER ? Or is this just hidden in some other code
path which I am not seeing?

Other than that the patch looks reasonable to me. Keeping some portion
of !costly pages on pcp lists sounds useful from the fragmentation
point of view as well AFAICS because it would be normally dissolved for
order-0 requests while we push on the reclaim more right now.

--
Michal Hocko
SUSE Labs

2016-11-30 14:06:26

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

On Wed, Nov 30, 2016 at 01:40:34PM +0100, Jesper Dangaard Brouer wrote:
>
> On Sun, 27 Nov 2016 13:19:54 +0000 Mel Gorman <[email protected]> wrote:
>
> [...]
> > SLUB has been the default small kernel object allocator for quite some time
> > but it is not universally used due to performance concerns and a reliance
> > on high-order pages. The high-order concerns has two major components --
> > high-order pages are not always available and high-order page allocations
> > potentially contend on the zone->lock. This patch addresses some concerns
> > about the zone lock contention by extending the per-cpu page allocator to
> > cache high-order pages. The patch makes the following modifications
> >
> > o New per-cpu lists are added to cache the high-order pages. This increases
> > the cache footprint of the per-cpu allocator and overall usage but for
> > some workloads, this will be offset by reduced contention on zone->lock.
>
> This will also help performance of NIC driver that allocator
> higher-order pages for their RX-ring queue (and chop it up for MTU).
> I do like this patch, even-though I'm working on moving drivers away
> from allocation these high-order pages.
>
> Acked-by: Jesper Dangaard Brouer <[email protected]>
>

Thanks.

> [...]
> > This is the result from netperf running UDP_STREAM on localhost. It was
> > selected on the basis that it is slab-intensive and has been the subject
> > of previous SLAB vs SLUB comparisons with the caveat that this is not
> > testing between two physical hosts.
>
> I do like you are using a networking test to benchmark this. Looking at
> the results, my initial response is that the improvements are basically
> too good to be true.
>

FWIW, LKP independently measured the boost to be 23% so it's expected
there will be different results depending on exact configuration and CPU.

> Can you share how you tested this with netperf and the specific netperf
> parameters?

The mmtests config file used is
configs/config-global-dhp__network-netperf-unbound so all details can be
extrapolated or reproduced from that.

> e.g.
> How do you configure the send/recv sizes?

Static range of sizes specified in the config file.

> Have you pinned netperf and netserver on different CPUs?
>

No. While it's possible to do a pinned test which helps stability, it
also tends to be less reflective of what happens in a variety of
workloads so I took the "harder" option.

> For localhost testing, when netperf and netserver run on the same CPU,
> you observer half the performance, very intuitively. When pinning
> netperf and netserver (via e.g. option -T 1,2) you observe the most
> stable results. When allowing netperf and netserver to migrate between
> CPUs (default setting), the real fun starts and unstable results,
> because now the CPU scheduler is also being tested, and my experience
> is also more "fun" memory situations occurs, as I guess we are hopping
> between more per CPU alloc caches (also affecting the SLUB per CPU usage
> pattern).
>

Yes which is another reason why I used an unbound configuration. I didn't
want to get an artificial boost from pinned server/client using the same
per-cpu caches. As a side-effect, it may mean that machines with fewer
CPUs get a greater boost as there are fewer per-cpu caches being used.

> > 2-socket modern machine
> > 4.9.0-rc5 4.9.0-rc5
> > vanilla hopcpu-v3
>
> The kernel from 4.9.0-rc5-vanilla to 4.9.0-rc5-hopcpu-v3 only contains
> this single change right?

Yes.

--
Mel Gorman
SUSE Labs

2016-11-30 14:16:26

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

On Wed, Nov 30, 2016 at 02:05:50PM +0100, Michal Hocko wrote:
> On Sun 27-11-16 13:19:54, Mel Gorman wrote:
> [...]
> > @@ -2588,18 +2594,22 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
> > struct page *page;
> > bool cold = ((gfp_flags & __GFP_COLD) != 0);
> >
> > - if (likely(order == 0)) {
> > + if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
> > struct per_cpu_pages *pcp;
> > struct list_head *list;
> >
> > local_irq_save(flags);
> > do {
> > + unsigned int pindex;
> > +
> > + pindex = order_to_pindex(migratetype, order);
> > pcp = &this_cpu_ptr(zone->pageset)->pcp;
> > - list = &pcp->lists[migratetype];
> > + list = &pcp->lists[pindex];
> > if (list_empty(list)) {
> > - pcp->count += rmqueue_bulk(zone, 0,
> > + int nr_pages = rmqueue_bulk(zone, order,
> > pcp->batch, list,
> > migratetype, cold);
> > + pcp->count += (nr_pages << order);
> > if (unlikely(list_empty(list)))
> > goto failed;
>
> just a nit, we can reorder the check and the count update because nobody
> could have stolen pages allocated by rmqueue_bulk.

Ok, it's minor but I can do that.

> I would also consider
> nr_pages a bit misleading because we get a number or allocated elements.
> Nothing to lose sleep over...
>

I didn't think of a clearer name because in this sort of context, I consider
a high-order page to be a single page.

> > }
>
> But... Unless I am missing something this effectively means that we do
> not exercise high order atomic reserves. Shouldn't we fallback to
> the locked __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC) for
> order > 0 && ALLOC_HARDER ? Or is this just hidden in some other code
> path which I am not seeing?
>

Good spot, would this be acceptable to you?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 91dc68c2a717..94808f565f74 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2609,9 +2609,18 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
int nr_pages = rmqueue_bulk(zone, order,
pcp->batch, list,
migratetype, cold);
- pcp->count += (nr_pages << order);
- if (unlikely(list_empty(list)))
+ if (unlikely(list_empty(list))) {
+ /*
+ * Retry high-order atomic allocs
+ * from the buddy list which may
+ * use MIGRATE_HIGHATOMIC.
+ */
+ if (order && (alloc_flags & ALLOC_HARDER))
+ goto try_buddylist;
+
goto failed;
+ }
+ pcp->count += (nr_pages << order);
}

if (cold)
@@ -2624,6 +2633,7 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,

} while (check_new_pcp(page));
} else {
+try_buddylist:
/*
* We most definitely don't want callers attempting to
* allocate greater than order-1 page units with __GFP_NOFAIL.
--
Mel Gorman
SUSE Labs

2016-11-30 15:00:04

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

On Wed 30-11-16 14:16:13, Mel Gorman wrote:
> On Wed, Nov 30, 2016 at 02:05:50PM +0100, Michal Hocko wrote:
[...]
> > But... Unless I am missing something this effectively means that we do
> > not exercise high order atomic reserves. Shouldn't we fallback to
> > the locked __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC) for
> > order > 0 && ALLOC_HARDER ? Or is this just hidden in some other code
> > path which I am not seeing?
> >
>
> Good spot, would this be acceptable to you?

It's not a queen of beauty but it works. A more elegant solution would
require more surgery I guess which is probably not worth it at this
stage.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 91dc68c2a717..94808f565f74 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2609,9 +2609,18 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
> int nr_pages = rmqueue_bulk(zone, order,
> pcp->batch, list,
> migratetype, cold);
> - pcp->count += (nr_pages << order);
> - if (unlikely(list_empty(list)))
> + if (unlikely(list_empty(list))) {
> + /*
> + * Retry high-order atomic allocs
> + * from the buddy list which may
> + * use MIGRATE_HIGHATOMIC.
> + */
> + if (order && (alloc_flags & ALLOC_HARDER))
> + goto try_buddylist;
> +
> goto failed;
> + }
> + pcp->count += (nr_pages << order);
> }
>
> if (cold)
> @@ -2624,6 +2633,7 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
>
> } while (check_new_pcp(page));
> } else {
> +try_buddylist:
> /*
> * We most definitely don't want callers attempting to
> * allocate greater than order-1 page units with __GFP_NOFAIL.
> --
> Mel Gorman
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Michal Hocko
SUSE Labs

2016-11-30 15:06:27

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

On Wed, 30 Nov 2016 14:06:15 +0000
Mel Gorman <[email protected]> wrote:

> On Wed, Nov 30, 2016 at 01:40:34PM +0100, Jesper Dangaard Brouer wrote:
> >
> > On Sun, 27 Nov 2016 13:19:54 +0000 Mel Gorman <[email protected]> wrote:
> >
> > [...]
> > > SLUB has been the default small kernel object allocator for quite some time
> > > but it is not universally used due to performance concerns and a reliance
> > > on high-order pages. The high-order concerns has two major components --
> > > high-order pages are not always available and high-order page allocations
> > > potentially contend on the zone->lock. This patch addresses some concerns
> > > about the zone lock contention by extending the per-cpu page allocator to
> > > cache high-order pages. The patch makes the following modifications
> > >
> > > o New per-cpu lists are added to cache the high-order pages. This increases
> > > the cache footprint of the per-cpu allocator and overall usage but for
> > > some workloads, this will be offset by reduced contention on zone->lock.
> >
> > This will also help performance of NIC driver that allocator
> > higher-order pages for their RX-ring queue (and chop it up for MTU).
> > I do like this patch, even-though I'm working on moving drivers away
> > from allocation these high-order pages.
> >
> > Acked-by: Jesper Dangaard Brouer <[email protected]>
> >
>
> Thanks.
>
> > [...]
> > > This is the result from netperf running UDP_STREAM on localhost. It was
> > > selected on the basis that it is slab-intensive and has been the subject
> > > of previous SLAB vs SLUB comparisons with the caveat that this is not
> > > testing between two physical hosts.
> >
> > I do like you are using a networking test to benchmark this. Looking at
> > the results, my initial response is that the improvements are basically
> > too good to be true.
> >
>
> FWIW, LKP independently measured the boost to be 23% so it's expected
> there will be different results depending on exact configuration and CPU.

Yes, noticed that, nice (which was a SCTP test)
https://lists.01.org/pipermail/lkp/2016-November/005210.html

It is of-cause great. It is just strange I cannot reproduce it on my
high-end box, with manual testing. I'll try your test suite and try to
figure out what is wrong with my setup.


> > Can you share how you tested this with netperf and the specific netperf
> > parameters?
>
> The mmtests config file used is
> configs/config-global-dhp__network-netperf-unbound so all details can be
> extrapolated or reproduced from that.

I didn't know of mmtests: https://github.com/gormanm/mmtests

It looks nice and quite comprehensive! :-)


> > e.g.
> > How do you configure the send/recv sizes?
>
> Static range of sizes specified in the config file.

I'll figure it out... reading your shell code :-)

export NETPERF_BUFFER_SIZES=64,128,256,1024,2048,3312,4096,8192,16384
https://github.com/gormanm/mmtests/blob/master/configs/config-global-dhp__network-netperf-unbound#L72

I see you are using netperf 2.4.5 and setting both the send an recv
size (-- -m and -M) which is fine.

I don't quite get why you are setting the socket recv size (with -- -s
and -S) to such a small number, size + 256.

SOCKETSIZE_OPT="-s $((SIZE+256)) -S $((SIZE+256))

netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -i 3 3 -I 95 5 -H 127.0.0.1 \
-- -s 320 -S 320 -m 64 -M 64 -P 15895

netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -i 3 3 -I 95 5 -H 127.0.0.1 \
-- -s 384 -S 384 -m 128 -M 128 -P 15895

netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -i 3 3 -I 95 5 -H 127.0.0.1 \
-- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895

> > Have you pinned netperf and netserver on different CPUs?
> >
>
> No. While it's possible to do a pinned test which helps stability, it
> also tends to be less reflective of what happens in a variety of
> workloads so I took the "harder" option.

Agree.

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer

2016-11-30 16:35:24

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

On Wed, Nov 30, 2016 at 04:06:12PM +0100, Jesper Dangaard Brouer wrote:
> > > [...]
> > > > This is the result from netperf running UDP_STREAM on localhost. It was
> > > > selected on the basis that it is slab-intensive and has been the subject
> > > > of previous SLAB vs SLUB comparisons with the caveat that this is not
> > > > testing between two physical hosts.
> > >
> > > I do like you are using a networking test to benchmark this. Looking at
> > > the results, my initial response is that the improvements are basically
> > > too good to be true.
> > >
> >
> > FWIW, LKP independently measured the boost to be 23% so it's expected
> > there will be different results depending on exact configuration and CPU.
>
> Yes, noticed that, nice (which was a SCTP test)
> https://lists.01.org/pipermail/lkp/2016-November/005210.html
>
> It is of-cause great. It is just strange I cannot reproduce it on my
> high-end box, with manual testing. I'll try your test suite and try to
> figure out what is wrong with my setup.
>

That would be great. I had seen the boost on multiple machines and LKP
verifying it is helpful.

>
> > > Can you share how you tested this with netperf and the specific netperf
> > > parameters?
> >
> > The mmtests config file used is
> > configs/config-global-dhp__network-netperf-unbound so all details can be
> > extrapolated or reproduced from that.
>
> I didn't know of mmtests: https://github.com/gormanm/mmtests
>
> It looks nice and quite comprehensive! :-)
>

Thanks.

> > > e.g.
> > > How do you configure the send/recv sizes?
> >
> > Static range of sizes specified in the config file.
>
> I'll figure it out... reading your shell code :-)
>
> export NETPERF_BUFFER_SIZES=64,128,256,1024,2048,3312,4096,8192,16384
> https://github.com/gormanm/mmtests/blob/master/configs/config-global-dhp__network-netperf-unbound#L72
>
> I see you are using netperf 2.4.5 and setting both the send an recv
> size (-- -m and -M) which is fine.
>

Ok.

> I don't quite get why you are setting the socket recv size (with -- -s
> and -S) to such a small number, size + 256.
>

Maybe I missed something at the time I wrote that but why would it need
to be larger?

--
Mel Gorman
SUSE Labs

2016-12-01 17:34:18

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

(Cc. netdev, we might have an issue with Paolo's UDP accounting and
small socket queues)

On Wed, 30 Nov 2016 16:35:20 +0000
Mel Gorman <[email protected]> wrote:

> > I don't quite get why you are setting the socket recv size
> > (with -- -s and -S) to such a small number, size + 256.
> >
>
> Maybe I missed something at the time I wrote that but why would it
> need to be larger?

Well, to me it is quite obvious that we need some queue to avoid packet
drops. We have two processes netperf and netserver, that are sending
packets between each-other (UDP_STREAM mostly netperf -> netserver).
These PIDs are getting scheduled and migrated between CPUs, and thus
does not get executed equally fast, thus a queue is need absorb the
fluctuations.

The network stack is even partly catching your config "mistake" and
increase the socket queue size, so we minimum can handle one max frame
(due skb "truesize" concept approx PAGE_SIZE + overhead).

Hopefully for localhost testing a small queue should hopefully not
result in packet drops. Testing... ups, this does result in packet
drops.

Test command extracted from mmtests, UDP_STREAM size 1024:

netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 \
-- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895

UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0)
port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec

4608 1024 60.00 50024301 0 6829.98
2560 60.00 46133211 6298.72

Dropped packets: 50024301-46133211=3891090

To get a better drop indication, during this I run a command, to get
system-wide network counters from the last second, so below numbers are
per second.

$ nstat > /dev/null && sleep 1 && nstat
#kernel
IpInReceives 885162 0.0
IpInDelivers 885161 0.0
IpOutRequests 885162 0.0
UdpInDatagrams 776105 0.0
UdpInErrors 109056 0.0
UdpOutDatagrams 885160 0.0
UdpRcvbufErrors 109056 0.0
IpExtInOctets 931190476 0.0
IpExtOutOctets 931189564 0.0
IpExtInNoECTPkts 885162 0.0

So, 885Kpps but only 776Kpps delivered and 109Kpps drops. See
UdpInErrors and UdpRcvbufErrors is equal (109056/sec). This drop
happens kernel side in __udp_queue_rcv_skb[1], because receiving
process didn't empty it's queue fast enough see [2].

Although upstream changes are coming in this area, [2] is replaced with
__udp_enqueue_schedule_skb, which I actually tested with... hmm

Retesting with kernel 4.7.0-baseline+ ... show something else.
To Paolo, you might want to look into this. And it could also explain why
I've not see the mentioned speedup by mm-change, as I've been testing
this patch on top of net-next (at 93ba2222550) with Paolo's UDP changes.

netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 \
-- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895

UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 15895
AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec

4608 1024 60.00 47248301 0 6450.97
2560 60.00 47245030 6450.52

Only dropped 47248301-47245030=3271

$ nstat > /dev/null && sleep 1 && nstat
#kernel
IpInReceives 810566 0.0
IpInDelivers 810566 0.0
IpOutRequests 810566 0.0
UdpInDatagrams 810468 0.0
UdpInErrors 99 0.0
UdpOutDatagrams 810566 0.0
UdpRcvbufErrors 99 0.0
IpExtInOctets 852713328 0.0
IpExtOutOctets 852713328 0.0
IpExtInNoECTPkts 810563 0.0

And nstat is also much better with only 99 drop/sec.

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer

[1] http://lxr.free-electrons.com/source/net/ipv4/udp.c?v=4.8#L1454
[2] http://lxr.free-electrons.com/source/net/core/sock.c?v=4.8#L413


Extra: with net-next at 93ba2222550

If I use netperf default socket queue, then there is not a single
packet drop:

netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -l 60 -H 127.0.0.1
-- -m 1024 -M 1024 -P 15895

UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0)
port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec

212992 1024 60.00 48485642 0 6619.91
212992 60.00 48485642 6619.91


$ nstat > /dev/null && sleep 1 && nstat
#kernel
IpInReceives 821723 0.0
IpInDelivers 821722 0.0
IpOutRequests 821723 0.0
UdpInDatagrams 821722 0.0
UdpOutDatagrams 821722 0.0
IpExtInOctets 864457856 0.0
IpExtOutOctets 864458908 0.0
IpExtInNoECTPkts 821729 0.0




2016-12-01 22:17:55

by Paolo Abeni

[permalink] [raw]
Subject: Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

On Thu, 2016-12-01 at 18:34 +0100, Jesper Dangaard Brouer wrote:
> (Cc. netdev, we might have an issue with Paolo's UDP accounting and
> small socket queues)
>
> On Wed, 30 Nov 2016 16:35:20 +0000
> Mel Gorman <[email protected]> wrote:
>
> > > I don't quite get why you are setting the socket recv size
> > > (with -- -s and -S) to such a small number, size + 256.
> > >
> >
> > Maybe I missed something at the time I wrote that but why would it
> > need to be larger?
>
> Well, to me it is quite obvious that we need some queue to avoid packet
> drops. We have two processes netperf and netserver, that are sending
> packets between each-other (UDP_STREAM mostly netperf -> netserver).
> These PIDs are getting scheduled and migrated between CPUs, and thus
> does not get executed equally fast, thus a queue is need absorb the
> fluctuations.
>
> The network stack is even partly catching your config "mistake" and
> increase the socket queue size, so we minimum can handle one max frame
> (due skb "truesize" concept approx PAGE_SIZE + overhead).
>
> Hopefully for localhost testing a small queue should hopefully not
> result in packet drops. Testing... ups, this does result in packet
> drops.
>
> Test command extracted from mmtests, UDP_STREAM size 1024:
>
> netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 \
> -- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895
>
> UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0)
> port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET
> Socket Message Elapsed Messages
> Size Size Time Okay Errors Throughput
> bytes bytes secs # # 10^6bits/sec
>
> 4608 1024 60.00 50024301 0 6829.98
> 2560 60.00 46133211 6298.72
>
> Dropped packets: 50024301-46133211=3891090
>
> To get a better drop indication, during this I run a command, to get
> system-wide network counters from the last second, so below numbers are
> per second.
>
> $ nstat > /dev/null && sleep 1 && nstat
> #kernel
> IpInReceives 885162 0.0
> IpInDelivers 885161 0.0
> IpOutRequests 885162 0.0
> UdpInDatagrams 776105 0.0
> UdpInErrors 109056 0.0
> UdpOutDatagrams 885160 0.0
> UdpRcvbufErrors 109056 0.0
> IpExtInOctets 931190476 0.0
> IpExtOutOctets 931189564 0.0
> IpExtInNoECTPkts 885162 0.0
>
> So, 885Kpps but only 776Kpps delivered and 109Kpps drops. See
> UdpInErrors and UdpRcvbufErrors is equal (109056/sec). This drop
> happens kernel side in __udp_queue_rcv_skb[1], because receiving
> process didn't empty it's queue fast enough see [2].
>
> Although upstream changes are coming in this area, [2] is replaced with
> __udp_enqueue_schedule_skb, which I actually tested with... hmm
>
> Retesting with kernel 4.7.0-baseline+ ... show something else.
> To Paolo, you might want to look into this. And it could also explain why
> I've not see the mentioned speedup by mm-change, as I've been testing
> this patch on top of net-next (at 93ba2222550) with Paolo's UDP changes.

Thank you for reporting this.

It seems that the commit 123b4a633580 ("udp: use it's own memory
accounting schema") is too strict while checking the rcvbuf.

For very small value of rcvbuf, it allows a single skb to be enqueued,
while previously we allowed 2 of them to enter the queue, even if the
first one truesize exceeded rcvbuf, as in your test-case.

Can you please try the following patch ?

Thank you,

Paolo
---
net/ipv4/udp.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index e1d0bf8..2f5dc92 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1200,19 +1200,21 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
struct sk_buff_head *list = &sk->sk_receive_queue;
int rmem, delta, amt, err = -ENOMEM;
int size = skb->truesize;
+ int limit;

/* try to avoid the costly atomic add/sub pair when the receive
* queue is full; always allow at least a packet
*/
rmem = atomic_read(&sk->sk_rmem_alloc);
- if (rmem && (rmem + size > sk->sk_rcvbuf))
+ limit = size + sk->sk_rcvbuf;
+ if (rmem > limit)
goto drop;

/* we drop only if the receive buf is full and the receive
* queue contains some other skb
*/
rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
- if ((rmem > sk->sk_rcvbuf) && (rmem > size))
+ if (rmem > limit)
goto uncharge_drop;

spin_lock(&list->lock);





2016-12-02 15:39:27

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

On Thu, 01 Dec 2016 23:17:48 +0100
Paolo Abeni <[email protected]> wrote:

> On Thu, 2016-12-01 at 18:34 +0100, Jesper Dangaard Brouer wrote:
> > (Cc. netdev, we might have an issue with Paolo's UDP accounting and
> > small socket queues)
> >
> > On Wed, 30 Nov 2016 16:35:20 +0000
> > Mel Gorman <[email protected]> wrote:
> >
> > > > I don't quite get why you are setting the socket recv size
> > > > (with -- -s and -S) to such a small number, size + 256.
> > > >
> > >
> > > Maybe I missed something at the time I wrote that but why would it
> > > need to be larger?
> >
> > Well, to me it is quite obvious that we need some queue to avoid packet
> > drops. We have two processes netperf and netserver, that are sending
> > packets between each-other (UDP_STREAM mostly netperf -> netserver).
> > These PIDs are getting scheduled and migrated between CPUs, and thus
> > does not get executed equally fast, thus a queue is need absorb the
> > fluctuations.
> >
> > The network stack is even partly catching your config "mistake" and
> > increase the socket queue size, so we minimum can handle one max frame
> > (due skb "truesize" concept approx PAGE_SIZE + overhead).
> >
> > Hopefully for localhost testing a small queue should hopefully not
> > result in packet drops. Testing... ups, this does result in packet
> > drops.
> >
> > Test command extracted from mmtests, UDP_STREAM size 1024:
> >
> > netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 \
> > -- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895
> >
> > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0)
> > port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET
> > Socket Message Elapsed Messages
> > Size Size Time Okay Errors Throughput
> > bytes bytes secs # # 10^6bits/sec
> >
> > 4608 1024 60.00 50024301 0 6829.98
> > 2560 60.00 46133211 6298.72
> >
> > Dropped packets: 50024301-46133211=3891090
> >
> > To get a better drop indication, during this I run a command, to get
> > system-wide network counters from the last second, so below numbers are
> > per second.
> >
> > $ nstat > /dev/null && sleep 1 && nstat
> > #kernel
> > IpInReceives 885162 0.0
> > IpInDelivers 885161 0.0
> > IpOutRequests 885162 0.0
> > UdpInDatagrams 776105 0.0
> > UdpInErrors 109056 0.0
> > UdpOutDatagrams 885160 0.0
> > UdpRcvbufErrors 109056 0.0
> > IpExtInOctets 931190476 0.0
> > IpExtOutOctets 931189564 0.0
> > IpExtInNoECTPkts 885162 0.0
> >
> > So, 885Kpps but only 776Kpps delivered and 109Kpps drops. See
> > UdpInErrors and UdpRcvbufErrors is equal (109056/sec). This drop
> > happens kernel side in __udp_queue_rcv_skb[1], because receiving
> > process didn't empty it's queue fast enough see [2].
> >
> > Although upstream changes are coming in this area, [2] is replaced with
> > __udp_enqueue_schedule_skb, which I actually tested with... hmm
> >
> > Retesting with kernel 4.7.0-baseline+ ... show something else.
> > To Paolo, you might want to look into this. And it could also explain why
> > I've not see the mentioned speedup by mm-change, as I've been testing
> > this patch on top of net-next (at 93ba2222550) with Paolo's UDP changes.
>
> Thank you for reporting this.
>
> It seems that the commit 123b4a633580 ("udp: use it's own memory
> accounting schema") is too strict while checking the rcvbuf.
>
> For very small value of rcvbuf, it allows a single skb to be enqueued,
> while previously we allowed 2 of them to enter the queue, even if the
> first one truesize exceeded rcvbuf, as in your test-case.
>
> Can you please try the following patch ?

Sure, it looks much better with this patch.


$ /home/jbrouer/git/mmtests/work/testdisk/sources/netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec

4608 1024 60.00 50191555 0 6852.82
2560 60.00 50189872 6852.59

Only 50191555-50189872=1683 drops, approx 1683/60 = 28/sec

$ nstat > /dev/null && sleep 1 && nstat
#kernel
IpInReceives 885417 0.0
IpInDelivers 885416 0.0
IpOutRequests 885417 0.0
UdpInDatagrams 885382 0.0
UdpInErrors 29 0.0
UdpOutDatagrams 885410 0.0
UdpRcvbufErrors 29 0.0
IpExtInOctets 931534428 0.0
IpExtOutOctets 931533376 0.0
IpExtInNoECTPkts 885488 0.0


> Thank you,
>
> Paolo
> ---
> net/ipv4/udp.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index e1d0bf8..2f5dc92 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1200,19 +1200,21 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
> struct sk_buff_head *list = &sk->sk_receive_queue;
> int rmem, delta, amt, err = -ENOMEM;
> int size = skb->truesize;
> + int limit;
>
> /* try to avoid the costly atomic add/sub pair when the receive
> * queue is full; always allow at least a packet
> */
> rmem = atomic_read(&sk->sk_rmem_alloc);
> - if (rmem && (rmem + size > sk->sk_rcvbuf))
> + limit = size + sk->sk_rcvbuf;
> + if (rmem > limit)
> goto drop;
>
> /* we drop only if the receive buf is full and the receive
> * queue contains some other skb
> */
> rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
> - if ((rmem > sk->sk_rcvbuf) && (rmem > size))
> + if (rmem > limit)
> goto uncharge_drop;
>
> spin_lock(&list->lock);
>



--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer

2016-12-02 15:44:39

by Paolo Abeni

[permalink] [raw]
Subject: Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

On Fri, 2016-12-02 at 16:37 +0100, Jesper Dangaard Brouer wrote:
> On Thu, 01 Dec 2016 23:17:48 +0100
> Paolo Abeni <[email protected]> wrote:
>
> > On Thu, 2016-12-01 at 18:34 +0100, Jesper Dangaard Brouer wrote:
> > > (Cc. netdev, we might have an issue with Paolo's UDP accounting and
> > > small socket queues)
> > >
> > > On Wed, 30 Nov 2016 16:35:20 +0000
> > > Mel Gorman <[email protected]> wrote:
> > >
> > > > > I don't quite get why you are setting the socket recv size
> > > > > (with -- -s and -S) to such a small number, size + 256.
> > > > >
> > > >
> > > > Maybe I missed something at the time I wrote that but why would it
> > > > need to be larger?
> > >
> > > Well, to me it is quite obvious that we need some queue to avoid packet
> > > drops. We have two processes netperf and netserver, that are sending
> > > packets between each-other (UDP_STREAM mostly netperf -> netserver).
> > > These PIDs are getting scheduled and migrated between CPUs, and thus
> > > does not get executed equally fast, thus a queue is need absorb the
> > > fluctuations.
> > >
> > > The network stack is even partly catching your config "mistake" and
> > > increase the socket queue size, so we minimum can handle one max frame
> > > (due skb "truesize" concept approx PAGE_SIZE + overhead).
> > >
> > > Hopefully for localhost testing a small queue should hopefully not
> > > result in packet drops. Testing... ups, this does result in packet
> > > drops.
> > >
> > > Test command extracted from mmtests, UDP_STREAM size 1024:
> > >
> > > netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 \
> > > -- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895
> > >
> > > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0)
> > > port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET
> > > Socket Message Elapsed Messages
> > > Size Size Time Okay Errors Throughput
> > > bytes bytes secs # # 10^6bits/sec
> > >
> > > 4608 1024 60.00 50024301 0 6829.98
> > > 2560 60.00 46133211 6298.72
> > >
> > > Dropped packets: 50024301-46133211=3891090
> > >
> > > To get a better drop indication, during this I run a command, to get
> > > system-wide network counters from the last second, so below numbers are
> > > per second.
> > >
> > > $ nstat > /dev/null && sleep 1 && nstat
> > > #kernel
> > > IpInReceives 885162 0.0
> > > IpInDelivers 885161 0.0
> > > IpOutRequests 885162 0.0
> > > UdpInDatagrams 776105 0.0
> > > UdpInErrors 109056 0.0
> > > UdpOutDatagrams 885160 0.0
> > > UdpRcvbufErrors 109056 0.0
> > > IpExtInOctets 931190476 0.0
> > > IpExtOutOctets 931189564 0.0
> > > IpExtInNoECTPkts 885162 0.0
> > >
> > > So, 885Kpps but only 776Kpps delivered and 109Kpps drops. See
> > > UdpInErrors and UdpRcvbufErrors is equal (109056/sec). This drop
> > > happens kernel side in __udp_queue_rcv_skb[1], because receiving
> > > process didn't empty it's queue fast enough see [2].
> > >
> > > Although upstream changes are coming in this area, [2] is replaced with
> > > __udp_enqueue_schedule_skb, which I actually tested with... hmm
> > >
> > > Retesting with kernel 4.7.0-baseline+ ... show something else.
> > > To Paolo, you might want to look into this. And it could also explain why
> > > I've not see the mentioned speedup by mm-change, as I've been testing
> > > this patch on top of net-next (at 93ba2222550) with Paolo's UDP changes.
> >
> > Thank you for reporting this.
> >
> > It seems that the commit 123b4a633580 ("udp: use it's own memory
> > accounting schema") is too strict while checking the rcvbuf.
> >
> > For very small value of rcvbuf, it allows a single skb to be enqueued,
> > while previously we allowed 2 of them to enter the queue, even if the
> > first one truesize exceeded rcvbuf, as in your test-case.
> >
> > Can you please try the following patch ?
>
> Sure, it looks much better with this patch.

Thank you for testing. I'll send a formal patch to David soon.

BTW I see I nice performance improvement compared to 4.7...

Cheers,

Paolo