After discussions with Joonsoo, I added a guarantee that high-order
lists will be drained regardless of batch size. While I maintained it was
unnecessary, it also did little harm other than increasing the size of
the per-cpu structure. There were slight variations in performance but a
mix of gains and losses within the noise relative to the previous release.
Changelog since v6
o Guarantee that per-cpu lists are drained regardless of batch size
o Dropped patch 1 as Andrew already picked it up
Changelog since v5
o Changelog clarification in patch 1
o Additional comments in patch 2
Changelog since v4
o Avoid pcp->count getting out of sync if struct page gets corrupted
Changelog since v3
o Allow high-order atomic allocations to use reserves
Changelog since v2
o Correct initialisation to avoid -Woverflow warning
SLUB has been the default small kernel object allocator for quite some time
but it is not universally used due to performance concerns and a reliance
on high-order pages. The high-order concerns has two major components --
high-order pages are not always available and high-order page allocations
potentially contend on the zone->lock. This patch addresses some concerns
about the zone lock contention by extending the per-cpu page allocator to
cache high-order pages. The patch makes the following modifications
o New per-cpu lists are added to cache the high-order pages. This increases
the cache footprint of the per-cpu allocator and overall usage but for
some workloads, this will be offset by reduced contention on zone->lock.
The first MIGRATE_PCPTYPE entries in the list are per-migratetype. The
remaining are high-order caches up to and including
PAGE_ALLOC_COSTLY_ORDER
o pcp accounting during free is now confined to free_pcppages_bulk as it's
impossible for the caller to know exactly how many pages were freed.
Due to the high-order caches, the number of pages drained for a request
is no longer precise.
o The high watermark for per-cpu pages is increased to reduce the probability
that a single refill causes a drain on the next free.
The benefit depends on both the workload and the machine as ultimately the
determining factor is whether cache line bounces on zone->lock or contention
is a problem. The patch was tested on a variety of workloads and machines,
some of which are reported here.
This is the result from netperf running UDP_STREAM on localhost. It was
selected on the basis that it is slab-intensive and has been the subject
of previous SLAB vs SLUB comparisons with the caveat that this is not
testing between two physical hosts.
2-socket modern machine
4.9.0-rc5 4.9.0-rc5
vanilla hopcpu-v7
Hmean send-64 178.38 ( 0.00%) 263.48 ( 47.71%)
Hmean send-128 351.49 ( 0.00%) 523.69 ( 48.99%)
Hmean send-256 671.23 ( 0.00%) 1021.92 ( 52.24%)
Hmean send-1024 2663.60 ( 0.00%) 3909.75 ( 46.78%)
Hmean send-2048 5126.53 ( 0.00%) 7365.98 ( 43.68%)
Hmean send-3312 7949.99 ( 0.00%) 11077.98 ( 39.35%)
Hmean send-4096 9433.56 ( 0.00%) 12715.42 ( 34.79%)
Hmean send-8192 15940.64 ( 0.00%) 22322.39 ( 40.03%)
Hmean send-16384 26699.54 ( 0.00%) 32918.05 ( 23.29%)
Hmean recv-64 178.38 ( 0.00%) 263.46 ( 47.70%)
Hmean recv-128 351.49 ( 0.00%) 523.65 ( 48.98%)
Hmean recv-256 671.20 ( 0.00%) 1021.54 ( 52.20%)
Hmean recv-1024 2663.45 ( 0.00%) 3909.13 ( 46.77%)
Hmean recv-2048 5126.26 ( 0.00%) 7364.61 ( 43.66%)
Hmean recv-3312 7949.50 ( 0.00%) 11076.31 ( 39.33%)
Hmean recv-4096 9433.04 ( 0.00%) 12713.49 ( 34.78%)
Hmean recv-8192 15939.64 ( 0.00%) 22320.05 ( 40.03%)
Hmean recv-16384 26698.44 ( 0.00%) 32913.66 ( 23.28%)
1-socket 6 year old machine
4.9.0-rc5 4.9.0-rc5
vanilla hopcpu-v7
Hmean send-64 87.47 ( 0.00%) 127.41 ( 45.67%)
Hmean send-128 174.36 ( 0.00%) 256.71 ( 47.23%)
Hmean send-256 347.52 ( 0.00%) 506.40 ( 45.72%)
Hmean send-1024 1363.03 ( 0.00%) 1968.24 ( 44.40%)
Hmean send-2048 2632.68 ( 0.00%) 3742.86 ( 42.17%)
Hmean send-3312 4123.19 ( 0.00%) 5849.80 ( 41.88%)
Hmean send-4096 5056.48 ( 0.00%) 7119.10 ( 40.79%)
Hmean send-8192 8784.22 ( 0.00%) 12161.53 ( 38.45%)
Hmean send-16384 15081.60 ( 0.00%) 19418.36 ( 28.76%)
Hmean recv-64 86.19 ( 0.00%) 126.84 ( 47.16%)
Hmean recv-128 173.93 ( 0.00%) 255.62 ( 46.96%)
Hmean recv-256 346.19 ( 0.00%) 503.73 ( 45.51%)
Hmean recv-1024 1358.28 ( 0.00%) 1957.11 ( 44.09%)
Hmean recv-2048 2623.45 ( 0.00%) 3716.88 ( 41.68%)
Hmean recv-3312 4108.63 ( 0.00%) 5810.21 ( 41.41%)
Hmean recv-4096 5037.25 ( 0.00%) 7067.22 ( 40.30%)
Hmean recv-8192 8762.32 ( 0.00%) 12080.40 ( 37.87%)
Hmean recv-16384 15042.36 ( 0.00%) 19291.29 ( 28.25%)
This is somewhat dramatic but it's also not universal. For example, it was
observed on an older HP machine using pcc-cpufreq that there was almost
no difference but pcc-cpufreq is also a known performance hazard.
These are quite different results but illustrate that the patch is
dependent on the CPU. The results are similar for TCP_STREAM on
the two-socket machine.
The observations on sockperf are different.
2-socket modern machine
sockperf-tcp-throughput
4.9.0-rc5 4.9.0-rc5
vanilla hopcpu-v7
Hmean 14 93.90 ( 0.00%) 92.74 ( -1.23%)
Hmean 100 1211.02 ( 0.00%) 1284.36 ( 6.05%)
Hmean 300 6016.95 ( 0.00%) 6149.26 ( 2.20%)
Hmean 500 8846.20 ( 0.00%) 8988.84 ( 1.61%)
Hmean 850 12280.71 ( 0.00%) 12434.78 ( 1.25%)
Stddev 14 5.32 ( 0.00%) 4.79 ( 9.88%)
Stddev 100 35.32 ( 0.00%) 74.20 (-110.06%)
Stddev 300 132.63 ( 0.00%) 65.50 ( 50.61%)
Stddev 500 152.90 ( 0.00%) 188.67 (-23.40%)
Stddev 850 221.46 ( 0.00%) 257.61 (-16.32%)
sockperf-udp-throughput
4.9.0-rc5 4.9.0-rc5
vanilla hopcpu-v5
Hmean 14 36.32 ( 0.00%) 51.47 ( 41.70%)
Hmean 100 258.41 ( 0.00%) 364.07 ( 40.89%)
Hmean 300 773.96 ( 0.00%) 1065.13 ( 37.62%)
Hmean 500 1291.07 ( 0.00%) 1769.25 ( 37.04%)
Hmean 850 2137.88 ( 0.00%) 2981.96 ( 39.48%)
Stddev 14 0.75 ( 0.00%) 0.84 (-11.79%)
Stddev 100 9.02 ( 0.00%) 10.53 (-16.77%)
Stddev 300 13.66 ( 0.00%) 20.02 (-46.53%)
Stddev 500 25.01 ( 0.00%) 45.16 (-80.56%)
Stddev 850 37.72 ( 0.00%) 67.67 (-79.41%)
Note that the improvements for TCP are nowhere near as dramatic as netperf,
there is a slight loss for small packets and it's much more variable. While
it's not presented here, it's known that running sockperf "under load"
that packet latency is generally lower but not universally so. On the
other hand, UDP improves performance but again, is much more variable.
This highlights that the patch is not necessarily a universal win and is
going to depend heavily on both the workload and the CPU used.
hackbench was also tested with both socket and pipes and both processes
and threads and the results are interesting in terms of how variability
is imapcted
1-socket machine
hackbench-process-pipes
4.9.0-rc5 4.9.0-rc5
vanilla highmark-v7
Amean 1 12.9637 ( 0.00%) 13.1807 ( -1.67%)
Amean 3 13.4770 ( 0.00%) 13.6803 ( -1.51%)
Amean 5 18.5333 ( 0.00%) 18.7383 ( -1.11%)
Amean 7 24.5690 ( 0.00%) 23.0550 ( 6.16%)
Amean 12 39.7990 ( 0.00%) 36.7207 ( 7.73%)
Amean 16 56.0520 ( 0.00%) 48.2890 ( 13.85%)
Stddev 1 0.3847 ( 0.00%) 0.5853 (-52.15%)
Stddev 3 0.2652 ( 0.00%) 0.0295 ( 88.89%)
Stddev 5 0.5589 ( 0.00%) 0.2466 ( 55.87%)
Stddev 7 0.5310 ( 0.00%) 0.6680 (-25.79%)
Stddev 12 1.0780 ( 0.00%) 0.3230 ( 70.04%)
Stddev 16 2.1138 ( 0.00%) 0.6835 ( 67.66%)
hackbench-process-sockets
Amean 1 4.8873 ( 0.00%) 4.7180 ( 3.46%)
Amean 3 14.1157 ( 0.00%) 14.3643 ( -1.76%)
Amean 5 22.5537 ( 0.00%) 23.1380 ( -2.59%)
Amean 7 30.3743 ( 0.00%) 31.1520 ( -2.56%)
Amean 12 49.1773 ( 0.00%) 50.3060 ( -2.30%)
Amean 16 64.0873 ( 0.00%) 66.2633 ( -3.40%)
Stddev 1 0.2360 ( 0.00%) 0.2201 ( 6.74%)
Stddev 3 0.0539 ( 0.00%) 0.0780 (-44.72%)
Stddev 5 0.1463 ( 0.00%) 0.1579 ( -7.90%)
Stddev 7 0.1260 ( 0.00%) 0.3091 (-145.31%)
Stddev 12 0.2169 ( 0.00%) 0.4822 (-122.36%)
Stddev 16 0.0529 ( 0.00%) 0.4513 (-753.20%)
It's not a universal win for pipes but the differences are within the
noise. What is interesting is that variability shows both gains and losses
in stark contrast to the sockperf results. On the other hand, sockets
generally show small losses albeit within the noise with more variability.
Once again, the workload and CPU gets different results.
fsmark was tested with zero-sized files to continually allocate slab objects
but didn't show any differences. This can be explained by the fact that the
workload is only allocating and does not have mix of allocs/frees that would
benefit from the caching. It was tested to ensure no major harm was done.
While it is recognised that this is a mixed bag of results, the patch
helps a lot more workloads than it hurts and intuitively, avoiding the
zone->lock in some cases is a good thing.
Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Jesper Dangaard Brouer <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
include/linux/mmzone.h | 25 +++++++++-
mm/page_alloc.c | 121 +++++++++++++++++++++++++++++++------------------
2 files changed, 101 insertions(+), 45 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0f088f3a2fed..9c44aa96533d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -255,6 +255,24 @@ enum zone_watermarks {
NR_WMARK
};
+/*
+ * One per migratetype for order-0 pages and one per high-order up to
+ * and including PAGE_ALLOC_COSTLY_ORDER. This may allow unmovable
+ * allocations to contaminate reclaimable pageblocks if high-order
+ * pages are heavily used.
+ */
+#define NR_PCP_LISTS (MIGRATE_PCPTYPES + PAGE_ALLOC_COSTLY_ORDER)
+
+static inline unsigned int pindex_to_order(unsigned int pindex)
+{
+ return pindex < MIGRATE_PCPTYPES ? 0 : pindex - MIGRATE_PCPTYPES + 1;
+}
+
+static inline unsigned int order_to_pindex(int migratetype, unsigned int order)
+{
+ return (order == 0) ? migratetype : MIGRATE_PCPTYPES + order - 1;
+}
+
#define min_wmark_pages(z) (z->watermark[WMARK_MIN])
#define low_wmark_pages(z) (z->watermark[WMARK_LOW])
#define high_wmark_pages(z) (z->watermark[WMARK_HIGH])
@@ -263,9 +281,14 @@ struct per_cpu_pages {
int count; /* number of pages in the list */
int high; /* high watermark, emptying needed */
int batch; /* chunk size for buddy add/remove */
+ int last_pindex; /* last pindex bulk reclaimed.
+ * This guarantees that all lists
+ * will eventually be shrunk regardless
+ * of batch size.
+ */
/* Lists of pages, one per migrate type stored on the pcp-lists */
- struct list_head lists[MIGRATE_PCPTYPES];
+ struct list_head lists[NR_PCP_LISTS];
};
struct per_cpu_pageset {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 34ada718ef47..0a4692ba2370 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1050,9 +1050,9 @@ static __always_inline bool free_pages_prepare(struct page *page,
}
#ifdef CONFIG_DEBUG_VM
-static inline bool free_pcp_prepare(struct page *page)
+static inline bool free_pcp_prepare(struct page *page, unsigned int order)
{
- return free_pages_prepare(page, 0, true);
+ return free_pages_prepare(page, order, true);
}
static inline bool bulkfree_pcp_prepare(struct page *page)
@@ -1060,9 +1060,9 @@ static inline bool bulkfree_pcp_prepare(struct page *page)
return false;
}
#else
-static bool free_pcp_prepare(struct page *page)
+static bool free_pcp_prepare(struct page *page, unsigned int order)
{
- return free_pages_prepare(page, 0, false);
+ return free_pages_prepare(page, order, false);
}
static bool bulkfree_pcp_prepare(struct page *page)
@@ -1084,9 +1084,10 @@ static bool bulkfree_pcp_prepare(struct page *page)
*/
static void free_pcppages_bulk(struct zone *zone, int count,
struct per_cpu_pages *pcp)
-{
- int migratetype = 0;
- int batch_free = 0;
+{ /* Reclaim will start at the same index as it previously stopped at */
+ unsigned int pindex = pcp->last_pindex - 1;
+ unsigned int batch_free = 0;
+ unsigned int nr_freed = 0;
unsigned long nr_scanned;
bool isolated_pageblocks;
@@ -1096,28 +1097,29 @@ static void free_pcppages_bulk(struct zone *zone, int count,
if (nr_scanned)
__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
- while (count) {
+ while (count > 0) {
struct page *page;
struct list_head *list;
+ unsigned int order;
/*
* Remove pages from lists in a round-robin fashion. A
* batch_free count is maintained that is incremented when an
- * empty list is encountered. This is so more pages are freed
- * off fuller lists instead of spinning excessively around empty
- * lists
+ * empty list is encountered. This is not exact due to
+ * high-order but percision is not required.
*/
do {
batch_free++;
- if (++migratetype == MIGRATE_PCPTYPES)
- migratetype = 0;
- list = &pcp->lists[migratetype];
+ if (++pindex == NR_PCP_LISTS)
+ pindex = 0;
+ list = &pcp->lists[pindex];
} while (list_empty(list));
/* This is the only non-empty list. Free them all. */
- if (batch_free == MIGRATE_PCPTYPES)
+ if (batch_free == NR_PCP_LISTS)
batch_free = count;
+ order = pindex_to_order(pindex);
do {
int mt; /* migratetype of the to-be-freed page */
@@ -1132,14 +1134,18 @@ static void free_pcppages_bulk(struct zone *zone, int count,
if (unlikely(isolated_pageblocks))
mt = get_pageblock_migratetype(page);
+ nr_freed += (1 << order);
+ count -= (1 << order);
if (bulkfree_pcp_prepare(page))
continue;
- __free_one_page(page, page_to_pfn(page), zone, 0, mt);
- trace_mm_page_pcpu_drain(page, 0, mt);
- } while (--count && --batch_free && !list_empty(list));
+ __free_one_page(page, page_to_pfn(page), zone, order, mt);
+ trace_mm_page_pcpu_drain(page, order, mt);
+ } while (count > 0 && --batch_free && !list_empty(list));
+ pcp->last_pindex = pindex;
}
spin_unlock(&zone->lock);
+ pcp->count -= nr_freed;
}
static void free_one_page(struct zone *zone,
@@ -2251,10 +2257,8 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
local_irq_save(flags);
batch = READ_ONCE(pcp->batch);
to_drain = min(pcp->count, batch);
- if (to_drain > 0) {
+ if (to_drain > 0)
free_pcppages_bulk(zone, to_drain, pcp);
- pcp->count -= to_drain;
- }
local_irq_restore(flags);
}
#endif
@@ -2276,10 +2280,8 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
pset = per_cpu_ptr(zone->pageset, cpu);
pcp = &pset->pcp;
- if (pcp->count) {
+ if (pcp->count)
free_pcppages_bulk(zone, pcp->count, pcp);
- pcp->count = 0;
- }
local_irq_restore(flags);
}
@@ -2411,18 +2413,18 @@ void mark_free_pages(struct zone *zone)
#endif /* CONFIG_PM */
/*
- * Free a 0-order page
+ * Free a pcp page
* cold == true ? free a cold page : free a hot page
*/
-void free_hot_cold_page(struct page *page, bool cold)
+static void __free_hot_cold_page(struct page *page, bool cold, unsigned int order)
{
struct zone *zone = page_zone(page);
struct per_cpu_pages *pcp;
unsigned long flags;
unsigned long pfn = page_to_pfn(page);
- int migratetype;
+ int migratetype, pindex;
- if (!free_pcp_prepare(page))
+ if (!free_pcp_prepare(page, order))
return;
migratetype = get_pfnblock_migratetype(page, pfn);
@@ -2439,28 +2441,33 @@ void free_hot_cold_page(struct page *page, bool cold)
*/
if (migratetype >= MIGRATE_PCPTYPES) {
if (unlikely(is_migrate_isolate(migratetype))) {
- free_one_page(zone, page, pfn, 0, migratetype);
+ free_one_page(zone, page, pfn, order, migratetype);
goto out;
}
migratetype = MIGRATE_MOVABLE;
}
+ pindex = order_to_pindex(migratetype, order);
pcp = &this_cpu_ptr(zone->pageset)->pcp;
if (!cold)
- list_add(&page->lru, &pcp->lists[migratetype]);
+ list_add(&page->lru, &pcp->lists[pindex]);
else
- list_add_tail(&page->lru, &pcp->lists[migratetype]);
- pcp->count++;
+ list_add_tail(&page->lru, &pcp->lists[pindex]);
+ pcp->count += 1 << order;
if (pcp->count >= pcp->high) {
unsigned long batch = READ_ONCE(pcp->batch);
free_pcppages_bulk(zone, batch, pcp);
- pcp->count -= batch;
}
out:
local_irq_restore(flags);
}
+void free_hot_cold_page(struct page *page, bool cold)
+{
+ __free_hot_cold_page(page, cold, 0);
+}
+
/*
* Free a list of 0-order pages
*/
@@ -2596,20 +2603,33 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
struct page *page;
bool cold = ((gfp_flags & __GFP_COLD) != 0);
- if (likely(order == 0)) {
+ if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
struct per_cpu_pages *pcp;
struct list_head *list;
local_irq_save(flags);
do {
+ unsigned int pindex;
+
+ pindex = order_to_pindex(migratetype, order);
pcp = &this_cpu_ptr(zone->pageset)->pcp;
- list = &pcp->lists[migratetype];
+ list = &pcp->lists[pindex];
if (list_empty(list)) {
- pcp->count += rmqueue_bulk(zone, 0,
+ int nr_pages = rmqueue_bulk(zone, order,
pcp->batch, list,
migratetype, cold);
- if (unlikely(list_empty(list)))
+ if (unlikely(list_empty(list))) {
+ /*
+ * Retry high-order atomic allocs
+ * from the buddy list which may
+ * use MIGRATE_HIGHATOMIC.
+ */
+ if (order && (alloc_flags & ALLOC_HARDER))
+ goto try_buddylist;
+
goto failed;
+ }
+ pcp->count += (nr_pages << order);
}
if (cold)
@@ -2618,10 +2638,11 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
page = list_first_entry(list, struct page, lru);
list_del(&page->lru);
- pcp->count--;
+ pcp->count -= (1 << order);
} while (check_new_pcp(page));
} else {
+try_buddylist:
/*
* We most definitely don't want callers attempting to
* allocate greater than order-1 page units with __GFP_NOFAIL.
@@ -3845,8 +3866,8 @@ EXPORT_SYMBOL(get_zeroed_page);
void __free_pages(struct page *page, unsigned int order)
{
if (put_page_testzero(page)) {
- if (order == 0)
- free_hot_cold_page(page, false);
+ if (order <= PAGE_ALLOC_COSTLY_ORDER)
+ __free_hot_cold_page(page, false, order);
else
__free_pages_ok(page, order);
}
@@ -5168,20 +5189,32 @@ static void pageset_update(struct per_cpu_pages *pcp, unsigned long high,
/* a companion to pageset_set_high() */
static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
{
- pageset_update(&p->pcp, 6 * batch, max(1UL, 1 * batch));
+ unsigned long high;
+
+ /*
+ * per-cpu refills occur when a per-cpu list for a migratetype
+ * or a high-order is depleted even if pages are free overall.
+ * Tune the high watermark such that it's unlikely, but not
+ * impossible, that a single refill event will trigger a
+ * shrink on the next free to the per-cpu list.
+ */
+ high = batch * MIGRATE_PCPTYPES + (batch << PAGE_ALLOC_COSTLY_ORDER);
+
+ pageset_update(&p->pcp, high, max(1UL, 1 * batch));
}
static void pageset_init(struct per_cpu_pageset *p)
{
struct per_cpu_pages *pcp;
- int migratetype;
+ unsigned int pindex;
memset(p, 0, sizeof(*p));
pcp = &p->pcp;
pcp->count = 0;
- for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++)
- INIT_LIST_HEAD(&pcp->lists[migratetype]);
+ pcp->last_pindex = 0;
+ for (pindex = 0; pindex < NR_PCP_LISTS; pindex++)
+ INIT_LIST_HEAD(&pcp->lists[pindex]);
}
static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
--
2.10.2
On Wed, 7 Dec 2016, Mel Gorman wrote:
> SLUB has been the default small kernel object allocator for quite some time
> but it is not universally used due to performance concerns and a reliance
> on high-order pages. The high-order concerns has two major components --
SLUB does not rely on high order pages. It falls back to lower order if
the higher orders are not available. Its a performance concern.
This is also an issue for various other kernel subsystems that really
would like to have larger contiguous memory area. We are often seeing
performance constraints due to the high number of 4k segments when doing
large scale block I/O f.e.
Otherwise I really like what I am seeing here.
On Wed, Dec 07, 2016 at 08:52:27AM -0600, Christoph Lameter wrote:
> On Wed, 7 Dec 2016, Mel Gorman wrote:
>
> > SLUB has been the default small kernel object allocator for quite some time
> > but it is not universally used due to performance concerns and a reliance
> > on high-order pages. The high-order concerns has two major components --
>
> SLUB does not rely on high order pages. It falls back to lower order if
> the higher orders are not available. Its a performance concern.
>
Ok -- While SLUB does not rely on high-order pages for functional
correctness, it perfoms better if high-order pages are available.
> This is also an issue for various other kernel subsystems that really
> would like to have larger contiguous memory area. We are often seeing
> performance constraints due to the high number of 4k segments when doing
> large scale block I/O f.e.
>
Which is related to the fundamentals of fragmentation control in
general. At some point there will have to be a revisit to get back to
the type of reliability that existed in 3.0-era without the massive
overhead it incurred. As stated before, I agree it's important but
outside the scope of this patch.
--
Mel Gorman
SUSE Labs
On Wed, 7 Dec 2016, Mel Gorman wrote:
> Which is related to the fundamentals of fragmentation control in
> general. At some point there will have to be a revisit to get back to
> the type of reliability that existed in 3.0-era without the massive
> overhead it incurred. As stated before, I agree it's important but
> outside the scope of this patch.
What reliability issues are there? 3.X kernels were better in what
way? Which overhead are we talking about?
Fragmentation has been a problem for a long time and the issue gets worse
as memory sizes increase, the hardware improves and the expectations on
throughput and reliability increase.
On Wed, Dec 07, 2016 at 10:40:47AM -0600, Christoph Lameter wrote:
> On Wed, 7 Dec 2016, Mel Gorman wrote:
>
> > Which is related to the fundamentals of fragmentation control in
> > general. At some point there will have to be a revisit to get back to
> > the type of reliability that existed in 3.0-era without the massive
> > overhead it incurred. As stated before, I agree it's important but
> > outside the scope of this patch.
>
> What reliability issues are there? 3.X kernels were better in what
> way? Which overhead are we talking about?
>
3.0-era kernels had better fragmentation control, higher success rates at
allocation etc. I vaguely recall that it had fewer sources of high-order
allocations but I don't remember specifics and part of that could be the
lack of THP at the time. The overhead was massive due to massive stalls
and excessive reclaim -- hours to complete some high-allocation stress
tests even if the success rate was high.
--
Mel Gorman
SUSE Labs
On Wed, 7 Dec 2016, Mel Gorman wrote:
> 3.0-era kernels had better fragmentation control, higher success rates at
> allocation etc. I vaguely recall that it had fewer sources of high-order
> allocations but I don't remember specifics and part of that could be the
> lack of THP at the time. The overhead was massive due to massive stalls
> and excessive reclaim -- hours to complete some high-allocation stress
> tests even if the success rate was high.
There were a couple of high order page reclaim improvements implemented
at that time that were later abandoned. I think higher order pages were
more available than now. SLUB was regularly able to get higher order pages.
On Wed, Dec 07, 2016 at 11:11:08AM -0600, Christoph Lameter wrote:
> On Wed, 7 Dec 2016, Mel Gorman wrote:
>
> > 3.0-era kernels had better fragmentation control, higher success rates at
> > allocation etc. I vaguely recall that it had fewer sources of high-order
> > allocations but I don't remember specifics and part of that could be the
> > lack of THP at the time. The overhead was massive due to massive stalls
> > and excessive reclaim -- hours to complete some high-allocation stress
> > tests even if the success rate was high.
>
> There were a couple of high order page reclaim improvements implemented
> at that time that were later abandoned. I think higher order pages were
> more available than now.
There were, the cost was high -- lumpy reclaim was a major source of the
cost but not the only one. The cost of allocation offset any benefit of
having them. At least for hugepages it did, I don't know about SLUB because
I didn't quantify if the benefit of SLUB using huge pages was offset by
the allocation cost (I doubt it). The cost later became intolerable when
THP started hitting those paths routinely.
It's not simply a case of going back to how fragmentation control was
managed then because it'll simply reintroduce excessive stalls in
allocation paths.
--
Mel Gorman
SUSE Labs
On Wed, 2016-12-07 at 10:12 +0000, Mel Gorman wrote:
> This is the result from netperf running UDP_STREAM on localhost. It was
> selected on the basis that it is slab-intensive and has been the subject
> of previous SLAB vs SLUB comparisons with the caveat that this is not
> testing between two physical hosts.
>
Interesting results.
netperf UDP_STREAM is not really slab intensive : (for large sendsizes
like 16KB)
Bulk of the storage should be allocated from alloc_skb_with_frags(),
ie using pages.
And I am not sure we enabled high order pages in this path ?
ip_make_skb()
__ip_append_data()
sock_alloc_send_skb()
sock_alloc_send_pskb (..., max_page_order=0)
alloc_skb_with_frags ( max_page_order=0)
So far, I believe net/unix/af_unix.c uses PAGE_ALLOC_COSTLY_ORDER as
max_order, but UDP does not do that yet.
We probably could enable high-order pages there, if we believe this is
okay.
Or maybe I missed and this already happened ? ;)
Thanks.
On Wed, 2016-12-07 at 11:00 -0800, Eric Dumazet wrote:
>
> So far, I believe net/unix/af_unix.c uses PAGE_ALLOC_COSTLY_ORDER as
> max_order, but UDP does not do that yet.
For af_unix, it happened in
https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=28d6427109d13b0f447cba5761f88d3548e83605
This came to fix a regression, since we had a gigantic slab allocation
in af_unix before
https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=eb6a24816b247c0be6b2e97e68933072874bbe54
On Wed, Dec 07, 2016 at 11:00:49AM -0800, Eric Dumazet wrote:
> On Wed, 2016-12-07 at 10:12 +0000, Mel Gorman wrote:
>
> > This is the result from netperf running UDP_STREAM on localhost. It was
> > selected on the basis that it is slab-intensive and has been the subject
> > of previous SLAB vs SLUB comparisons with the caveat that this is not
> > testing between two physical hosts.
> >
>
> Interesting results.
>
> netperf UDP_STREAM is not really slab intensive : (for large sendsizes
> like 16KB)
>
Interesting because it didn't match what I previous measured but then
again, when I established that netperf on localhost was slab intensive,
it was also an older kernel. Can you tell me if SLAB or SLUB was enabled
in your test kernel?
Either that or the baseline I used has since been changed from what you
are testing and we're not hitting the same paths.
> Bulk of the storage should be allocated from alloc_skb_with_frags(),
> ie using pages.
>
> And I am not sure we enabled high order pages in this path ?
>
> ip_make_skb()
> __ip_append_data()
> sock_alloc_send_skb()
> sock_alloc_send_pskb (..., max_page_order=0)
> alloc_skb_with_frags ( max_page_order=0)
>
It doesn't look like it. While it's not directly related to this patch,
can you give the full stack? I'm particularly curious to see if these
allocations are in an IRQ path or not.
> We probably could enable high-order pages there, if we believe this is
> okay.
>
Ultimately, not a great idea unless you want variable performance depending
on whether high-order pages are available or not. The motivation for the
patch was primarily for SLUB-intensive workloads.
--
Mel Gorman
SUSE Labs
On Wed, 2016-12-07 at 19:48 +0000, Mel Gorman wrote:
>
>
> Interesting because it didn't match what I previous measured but then
> again, when I established that netperf on localhost was slab intensive,
> it was also an older kernel. Can you tell me if SLAB or SLUB was enabled
> in your test kernel?
>
> Either that or the baseline I used has since been changed from what you
> are testing and we're not hitting the same paths.
lpaa6:~# uname -a
Linux lpaa6 4.9.0-smp-DEV #429 SMP @1481125332 x86_64 GNU/Linux
lpaa6:~# perf record -g ./netperf -t UDP_STREAM -l 3 -- -m 16384
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
localhost () port 0 AF_INET
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec
212992 16384 3.00 654644 0 28601.04
212992 3.00 654592 28598.77
[ perf record: Woken up 5 times to write data ]
[ perf record: Captured and wrote 1.888 MB perf.data (~82481 samples) ]
perf report --stdio
...
1.92% netperf [kernel.kallsyms] [k]
cache_alloc_refill
|
--- cache_alloc_refill
|
|--82.22%-- kmem_cache_alloc_node_trace
| __kmalloc_node_track_caller
| __alloc_skb
| alloc_skb_with_frags
| sock_alloc_send_pskb
| sock_alloc_send_skb
| __ip_append_data.isra.50
| ip_make_skb
| udp_sendmsg
| inet_sendmsg
| sock_sendmsg
| SYSC_sendto
| sys_sendto
| entry_SYSCALL_64_fastpath
| __sendto_nocancel
| |
| --100.00%-- 0x0
|
Oh wait, sock_alloc_send_skb() requests for all the bytes in skb->head :
struct sk_buff *sock_alloc_send_skb(struct sock *sk, unsigned long size,
int noblock, int *errcode)
{
return sock_alloc_send_pskb(sk, size, 0, noblock, errcode, 0);
}
Maybe one day we will avoid doing order-4 (or even order-5 in extreme
cases !) allocations for loopback as we did for af_unix :P
I mean, maybe some applications are sending 64KB UDP messages over
loopback right now...
On Wed, Dec 07, 2016 at 12:10:24PM -0800, Eric Dumazet wrote:
> On Wed, 2016-12-07 at 19:48 +0000, Mel Gorman wrote:
> >
> >
> > Interesting because it didn't match what I previous measured but then
> > again, when I established that netperf on localhost was slab intensive,
> > it was also an older kernel. Can you tell me if SLAB or SLUB was enabled
> > in your test kernel?
> >
> > Either that or the baseline I used has since been changed from what you
> > are testing and we're not hitting the same paths.
>
>
> lpaa6:~# uname -a
> Linux lpaa6 4.9.0-smp-DEV #429 SMP @1481125332 x86_64 GNU/Linux
>
> lpaa6:~# perf record -g ./netperf -t UDP_STREAM -l 3 -- -m 16384
> MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> localhost () port 0 AF_INET
> Socket Message Elapsed Messages
> Size Size Time Okay Errors Throughput
> bytes bytes secs # # 10^6bits/sec
>
> 212992 16384 3.00 654644 0 28601.04
> 212992 3.00 654592 28598.77
>
I'm seeing parts of the disconnect. The load is slab intensive but not
necessarily page allocator intensive depending on a variety of factors. While
the motivation of the patch was initially SLUB, any path that is high-order
page allocator intensive benefits so;
1. If the workload is slab intensive and SLUB is used then it may benefit
if SLUB happens to frequently require new pages, particularly if there
is a pattern of growing/shrinking slabs frequently.
2. If the workload is high-order page allocator intensive but bypassing
SLUB and SLAB, then it'll benefit anyway
So you say you don't see much slab activity for some configuration and
it's hitting the page allocator. For the purposes of this patch, that's
fine albeit useless for a SLAB vs SLUB comparison.
Anything else I saw for the moment is probably not surprising;
At small packet sizes on localhost, I see relatively low page allocator
activity except during the socket setup and other unrelated activity
(khugepaged, irqbalance, some btrfs stuff) which is curious as it's
less clear why the performance was improved in that case. I considered
the possibility that it was cache hotness of pages but that's not a
good fit. If it was true then the first test would be slow and the rest
relatively fast and I'm not seeing that. The other side-effect is that
all the high-order pages that are allocated at the start are physically
close together but that shouldn't have that big an impact. So for now,
the gain is unexplained even though it happens consistently.
At larger message sizes to localhost, it's page allocator intensive through
paths like this
netperf-3887 [032] .... 393.246420: mm_page_alloc: page=ffffea0021272200 pfn=8690824 order=3 migratetype=0 gfp_flags=GFP_KERNEL|__GFP_NOWARN|__GFP_REPEAT|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_NOTRACK
netperf-3887 [032] .... 393.246421: <stack trace>
=> kmalloc_large_node+0x60/0x8d <ffffffff812101c3>
=> __kmalloc_node_track_caller+0x245/0x280 <ffffffff811f0415>
=> __kmalloc_reserve.isra.35+0x31/0x90 <ffffffff81674b61>
=> __alloc_skb+0x7e/0x280 <ffffffff81676bce>
=> alloc_skb_with_frags+0x5a/0x1c0 <ffffffff81676e2a>
=> sock_alloc_send_pskb+0x19e/0x200 <ffffffff816721fe>
=> sock_alloc_send_skb+0x18/0x20 <ffffffff81672278>
=> __ip_append_data.isra.46+0x61d/0xa00 <ffffffff816cf78d>
=> ip_make_skb+0xc2/0x110 <ffffffff816d1c72>
=> udp_sendmsg+0x2c0/0xa40 <ffffffff816f9930>
=> inet_sendmsg+0x7f/0xb0 <ffffffff8170655f>
=> sock_sendmsg+0x38/0x50 <ffffffff8166d9f8>
=> SYSC_sendto+0x102/0x190 <ffffffff8166de92>
=> SyS_sendto+0xe/0x10 <ffffffff8166e94e>
=> do_syscall_64+0x5b/0xd0 <ffffffff8100293b>
=> return_from_SYSCALL_64+0x0/0x6a <ffffffff8178e7af>
It's going through the SLUB paths but finding the allocation is too large
and hitting the page allocator instead. This is using 4.9-rc5 as a baseline
so fixes might be missing.
If using small messages to a remote host, I again see intense page
allocator activity via
netperf-4326 [047] .... 994.978387: mm_page_alloc: page=ffffea0041413400 pfn=17106128 order=2 migratetype=0 gfp_flags=__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_REPEAT|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_NOTRACK
netperf-4326 [047] .... 994.978387: <stack trace>
=> alloc_pages_current+0x88/0x120 <ffffffff811e1678>
=> new_slab+0x33f/0x580 <ffffffff811eb77f>
=> ___slab_alloc+0x352/0x4d0 <ffffffff811ec6a2>
=> __slab_alloc.isra.73+0x43/0x5e <ffffffff812105d0>
=> __kmalloc_node_track_caller+0xba/0x280 <ffffffff811f028a>
=> __kmalloc_reserve.isra.35+0x31/0x90 <ffffffff81674b61>
=> __alloc_skb+0x7e/0x280 <ffffffff81676bce>
=> alloc_skb_with_frags+0x5a/0x1c0 <ffffffff81676e2a>
=> sock_alloc_send_pskb+0x19e/0x200 <ffffffff816721fe>
=> sock_alloc_send_skb+0x18/0x20 <ffffffff81672278>
=> __ip_append_data.isra.46+0x61d/0xa00 <ffffffff816cf78d>
=> ip_make_skb+0xc2/0x110 <ffffffff816d1c72>
=> udp_sendmsg+0x2c0/0xa40 <ffffffff816f9930>
=> inet_sendmsg+0x7f/0xb0 <ffffffff8170655f>
=> sock_sendmsg+0x38/0x50 <ffffffff8166d9f8>
=> SYSC_sendto+0x102/0x190 <ffffffff8166de92>
=> SyS_sendto+0xe/0x10 <ffffffff8166e94e>
=> do_syscall_64+0x5b/0xd0 <ffffffff8100293b>
=> return_from_SYSCALL_64+0x0/0x6a <ffffffff8178e7af>
This is a slab path, but at different orders.
So while the patch was motivated by SLUB, the fact I'm getting intense
page allocator activity still benefits.
> Maybe one day we will avoid doing order-4 (or even order-5 in extreme
> cases !) allocations for loopback as we did for af_unix :P
>
> I mean, maybe some applications are sending 64KB UDP messages over
> loopback right now...
>
Maybe but it's clear that even running "networking" workloads does not
necessarily mean that paths interesting to this patch are hit. Not
necessarily bad but it was always expected that the benefit of the patch
would be workload and configuration dependant.
--
Mel Gorman
SUSE Labs
On Wed, Dec 07, 2016 at 09:19:58PM +0000, Mel Gorman wrote:
> At small packet sizes on localhost, I see relatively low page allocator
> activity except during the socket setup and other unrelated activity
> (khugepaged, irqbalance, some btrfs stuff) which is curious as it's
> less clear why the performance was improved in that case. I considered
> the possibility that it was cache hotness of pages but that's not a
> good fit. If it was true then the first test would be slow and the rest
> relatively fast and I'm not seeing that. The other side-effect is that
> all the high-order pages that are allocated at the start are physically
> close together but that shouldn't have that big an impact. So for now,
> the gain is unexplained even though it happens consistently.
>
Further investigation led me to conclude that the netperf automation on
my side had some methodology errors that could account for an artifically
low score in some cases. The netperf automation is years old and would
have been developed against a much older and smaller machine which may be
why I missed it until I went back looking at exactly what the automation
was doing. Minimally in a server/client test on remote maching there was
potentially higher packet loss than is acceptable. This would account why
some machines "benefitted" while others did not -- there would be boot to
boot variations that some machines happened to be "lucky". I believe I've
corrected the errors, discarded all the old data and scheduled a rest to
see what falls out.
--
Mel Gorman
SUSE Labs
On Wed, 7 Dec 2016 23:25:31 +0000
Mel Gorman <[email protected]> wrote:
> On Wed, Dec 07, 2016 at 09:19:58PM +0000, Mel Gorman wrote:
> > At small packet sizes on localhost, I see relatively low page allocator
> > activity except during the socket setup and other unrelated activity
> > (khugepaged, irqbalance, some btrfs stuff) which is curious as it's
> > less clear why the performance was improved in that case. I considered
> > the possibility that it was cache hotness of pages but that's not a
> > good fit. If it was true then the first test would be slow and the rest
> > relatively fast and I'm not seeing that. The other side-effect is that
> > all the high-order pages that are allocated at the start are physically
> > close together but that shouldn't have that big an impact. So for now,
> > the gain is unexplained even though it happens consistently.
> >
>
> Further investigation led me to conclude that the netperf automation on
> my side had some methodology errors that could account for an artifically
> low score in some cases. The netperf automation is years old and would
> have been developed against a much older and smaller machine which may be
> why I missed it until I went back looking at exactly what the automation
> was doing. Minimally in a server/client test on remote maching there was
> potentially higher packet loss than is acceptable. This would account why
> some machines "benefitted" while others did not -- there would be boot to
> boot variations that some machines happened to be "lucky". I believe I've
> corrected the errors, discarded all the old data and scheduled a rest to
> see what falls out.
I guess you are talking about setting the netperf socket queue low
(+256 bytes above msg size), that I pointed out in[1]. I can see from
GitHub-mmtests-commit[2] "netperf: Set remote and local socket max
buffer sizes", that you have removed that, good! :-)
>From the same commit[2] I can see you explicitly set (local+remote):
sysctl net.core.rmem_max=16777216
sysctl net.core.wmem_max=16777216
Eric do you have any advice on this setting?
And later[4] you further increase this to 32MiB. Notice that the
netperf UDP_STREAM test will still use the default value from:
net.core.rmem_default = 212992.
(To Eric) Mel's small UDP queues also interacted badly with Eric and
Paolo's UDP improvements, which was fixed in net-next commit[3]
363dc73acacb ("udp: be less conservative with sock rmem accounting").
[1] http://lkml.kernel.org/r/[email protected]
[2] https://github.com/gormanm/mmtests/commit/7f16226577b
[3] https://git.kernel.org/davem/net-next/c/363dc73acacb
[4] https://github.com/gormanm/mmtests/commit/777d1f5cd08
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
On Thu, Dec 08, 2016 at 09:22:31AM +0100, Jesper Dangaard Brouer wrote:
> On Wed, 7 Dec 2016 23:25:31 +0000
> Mel Gorman <[email protected]> wrote:
>
> > On Wed, Dec 07, 2016 at 09:19:58PM +0000, Mel Gorman wrote:
> > > At small packet sizes on localhost, I see relatively low page allocator
> > > activity except during the socket setup and other unrelated activity
> > > (khugepaged, irqbalance, some btrfs stuff) which is curious as it's
> > > less clear why the performance was improved in that case. I considered
> > > the possibility that it was cache hotness of pages but that's not a
> > > good fit. If it was true then the first test would be slow and the rest
> > > relatively fast and I'm not seeing that. The other side-effect is that
> > > all the high-order pages that are allocated at the start are physically
> > > close together but that shouldn't have that big an impact. So for now,
> > > the gain is unexplained even though it happens consistently.
> > >
> >
> > Further investigation led me to conclude that the netperf automation on
> > my side had some methodology errors that could account for an artifically
> > low score in some cases. The netperf automation is years old and would
> > have been developed against a much older and smaller machine which may be
> > why I missed it until I went back looking at exactly what the automation
> > was doing. Minimally in a server/client test on remote maching there was
> > potentially higher packet loss than is acceptable. This would account why
> > some machines "benefitted" while others did not -- there would be boot to
> > boot variations that some machines happened to be "lucky". I believe I've
> > corrected the errors, discarded all the old data and scheduled a rest to
> > see what falls out.
>
> I guess you are talking about setting the netperf socket queue low
> (+256 bytes above msg size), that I pointed out in[1].
Primarily, yes.
> From the same commit[2] I can see you explicitly set (local+remote):
>
> sysctl net.core.rmem_max=16777216
> sysctl net.core.wmem_max=16777216
>
Yes, I set it for higher speed networks as a starting point to remind me
to examine rmem_default or socket configurations if any significant packet
loss is observed.
> Eric do you have any advice on this setting?
>
> And later[4] you further increase this to 32MiB. Notice that the
> netperf UDP_STREAM test will still use the default value from:
> net.core.rmem_default = 212992.
>
That's expected. In the initial sniff-test, I saw negligible packet loss.
I'm waiting to see what the full set of network tests look like before
doing any further adjustments.
--
Mel Gorman
SUSE Labs
On Thu, 8 Dec 2016 09:18:06 +0000
Mel Gorman <[email protected]> wrote:
> On Thu, Dec 08, 2016 at 09:22:31AM +0100, Jesper Dangaard Brouer wrote:
> > On Wed, 7 Dec 2016 23:25:31 +0000
> > Mel Gorman <[email protected]> wrote:
> >
> > > On Wed, Dec 07, 2016 at 09:19:58PM +0000, Mel Gorman wrote:
> > > > At small packet sizes on localhost, I see relatively low page allocator
> > > > activity except during the socket setup and other unrelated activity
> > > > (khugepaged, irqbalance, some btrfs stuff) which is curious as it's
> > > > less clear why the performance was improved in that case. I considered
> > > > the possibility that it was cache hotness of pages but that's not a
> > > > good fit. If it was true then the first test would be slow and the rest
> > > > relatively fast and I'm not seeing that. The other side-effect is that
> > > > all the high-order pages that are allocated at the start are physically
> > > > close together but that shouldn't have that big an impact. So for now,
> > > > the gain is unexplained even though it happens consistently.
> > > >
> > >
> > > Further investigation led me to conclude that the netperf automation on
> > > my side had some methodology errors that could account for an artifically
> > > low score in some cases. The netperf automation is years old and would
> > > have been developed against a much older and smaller machine which may be
> > > why I missed it until I went back looking at exactly what the automation
> > > was doing. Minimally in a server/client test on remote maching there was
> > > potentially higher packet loss than is acceptable. This would account why
> > > some machines "benefitted" while others did not -- there would be boot to
> > > boot variations that some machines happened to be "lucky". I believe I've
> > > corrected the errors, discarded all the old data and scheduled a rest to
> > > see what falls out.
> >
> > I guess you are talking about setting the netperf socket queue low
> > (+256 bytes above msg size), that I pointed out in[1].
>
> Primarily, yes.
>
> > From the same commit[2] I can see you explicitly set (local+remote):
> >
> > sysctl net.core.rmem_max=16777216
> > sysctl net.core.wmem_max=16777216
> >
>
> Yes, I set it for higher speed networks as a starting point to remind me
> to examine rmem_default or socket configurations if any significant packet
> loss is observed.
>
> > Eric do you have any advice on this setting?
> >
> > And later[4] you further increase this to 32MiB. Notice that the
> > netperf UDP_STREAM test will still use the default value from:
> > net.core.rmem_default = 212992.
> >
>
> That's expected. In the initial sniff-test, I saw negligible packet loss.
> I'm waiting to see what the full set of network tests look like before
> doing any further adjustments.
For netperf I will not recommend adjusting the global default
/proc/sys/net/core/rmem_default as netperf have means of adjusting this
value from the application (which were the options you setup too low
and just removed). I think you should keep this as the default for now
(unless Eric says something else), as this should cover most users.
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
On Thu, Dec 08, 2016 at 11:43:08AM +0100, Jesper Dangaard Brouer wrote:
> > That's expected. In the initial sniff-test, I saw negligible packet loss.
> > I'm waiting to see what the full set of network tests look like before
> > doing any further adjustments.
>
> For netperf I will not recommend adjusting the global default
> /proc/sys/net/core/rmem_default as netperf have means of adjusting this
> value from the application (which were the options you setup too low
> and just removed). I think you should keep this as the default for now
> (unless Eric says something else), as this should cover most users.
>
Ok, the current state is that buffer sizes are only set for netperf
UDP_STREAM and only when running over a real network. The values selected
were specific to the network I had available so milage may vary.
localhost is left at the defaults.
--
Mel Gorman
SUSE Labs
On Thu, 8 Dec 2016 11:06:56 +0000
Mel Gorman <[email protected]> wrote:
> On Thu, Dec 08, 2016 at 11:43:08AM +0100, Jesper Dangaard Brouer wrote:
> > > That's expected. In the initial sniff-test, I saw negligible packet loss.
> > > I'm waiting to see what the full set of network tests look like before
> > > doing any further adjustments.
> >
> > For netperf I will not recommend adjusting the global default
> > /proc/sys/net/core/rmem_default as netperf have means of adjusting this
> > value from the application (which were the options you setup too low
> > and just removed). I think you should keep this as the default for now
> > (unless Eric says something else), as this should cover most users.
> >
>
> Ok, the current state is that buffer sizes are only set for netperf
> UDP_STREAM and only when running over a real network. The values selected
> were specific to the network I had available so milage may vary.
> localhost is left at the defaults.
Looks like you made a mistake when re-implementing using buffer sizes
for netperf. See patch below signature.
Besides I think you misunderstood me, you can adjust:
sysctl net.core.rmem_max
sysctl net.core.wmem_max
And you should if you plan to use/set 851968 as socket size for UDP
remote tests, else you will be limited to the "max" values (212992 well
actually 425984 2x default value, for reasons I cannot remember)
https://github.com/gormanm/mmtests/commit/de9f8cdb7146021
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
[PATCH] mmtests: actually use variable SOCKETSIZE_OPT
From: Jesper Dangaard Brouer <[email protected]>
commit 7f16226577b2 ("netperf: Set remote and local socket max buffer
sizes") removed netperf's setting of the socket buffer sizes and
instead used global /proc/sys settings.
commit de9f8cdb7146 ("netperf: Only adjust socket sizes for
UDP_STREAM") re-added explicit netperf setting socket buffer sizes for
remote-host testing (saved in SOCKETSIZE_OPT). Only problem is this
variable is not used after commit 7f16226577b2.
Simply use $SOCKETSIZE_OPT when invoking netperf command.
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
---
shellpack_src/src/netperf/netperf-bench | 2 +-
shellpacks/shellpack-bench-netperf | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/shellpack_src/src/netperf/netperf-bench b/shellpack_src/src/netperf/netperf-bench
index 8e7d02864c4a..b2820610936e 100755
--- a/shellpack_src/src/netperf/netperf-bench
+++ b/shellpack_src/src/netperf/netperf-bench
@@ -93,7 +93,7 @@ mmtests_server_ctl start --serverside-name $PROTOCOL-$SIZE
-t $PROTOCOL \
-i 3,3 -I 95,5 \
-H $SERVER_HOST \
- -- $MSGSIZE_OPT $EXTRA \
+ -- $SOCKETSIZE_OPT $MSGSIZE_OPT $EXTRA \
2>&1 | tee $LOGDIR_RESULTS/$PROTOCOL-${SIZE}.$ITERATION \
|| die Failed to run netperf
monitor_post_hook $LOGDIR_RESULTS $SIZE
diff --git a/shellpacks/shellpack-bench-netperf b/shellpacks/shellpack-bench-netperf
index 2ce26ba39f1b..7356082d5a78 100755
--- a/shellpacks/shellpack-bench-netperf
+++ b/shellpacks/shellpack-bench-netperf
@@ -190,7 +190,7 @@ for ITERATION in `seq 1 $ITERATIONS`; do
-t $PROTOCOL \
-i 3,3 -I 95,5 \
-H $SERVER_HOST \
- -- $MSGSIZE_OPT $EXTRA \
+ -- $SOCKETSIZE_OPT $MSGSIZE_OPT $EXTRA \
2>&1 | tee $LOGDIR_RESULTS/$PROTOCOL-${SIZE}.$ITERATION \
|| die Failed to run netperf
monitor_post_hook $LOGDIR_RESULTS $SIZE
On Thu, Dec 08, 2016 at 03:48:13PM +0100, Jesper Dangaard Brouer wrote:
> On Thu, 8 Dec 2016 11:06:56 +0000
> Mel Gorman <[email protected]> wrote:
>
> > On Thu, Dec 08, 2016 at 11:43:08AM +0100, Jesper Dangaard Brouer wrote:
> > > > That's expected. In the initial sniff-test, I saw negligible packet loss.
> > > > I'm waiting to see what the full set of network tests look like before
> > > > doing any further adjustments.
> > >
> > > For netperf I will not recommend adjusting the global default
> > > /proc/sys/net/core/rmem_default as netperf have means of adjusting this
> > > value from the application (which were the options you setup too low
> > > and just removed). I think you should keep this as the default for now
> > > (unless Eric says something else), as this should cover most users.
> > >
> >
> > Ok, the current state is that buffer sizes are only set for netperf
> > UDP_STREAM and only when running over a real network. The values selected
> > were specific to the network I had available so milage may vary.
> > localhost is left at the defaults.
>
> Looks like you made a mistake when re-implementing using buffer sizes
> for netperf.
We appear to have a disconnect. This was reintroduced in response to your
comment "For netperf I will not recommend adjusting the global default
/proc/sys/net/core/rmem_default as netperf have means of adjusting this
value from the application".
My understanding was that netperfs means was the -s and -S switches for
send and recv buffers so I reintroduced them and avoided altering
[r|w]mem_default.
Leaving the defaults resulted in some UDP packet loss on a 10GbE network
so some upward adjustment.
>From my perspective, either adjusting [r|w]mem_default or specifying -s
-S works for the UDP_STREAM issue but using the switches meant only this
is affected and other loads like sockperf and netpipe will need to be
evaluated separately which I don't mind doing.
> See patch below signature.
>
> Besides I think you misunderstood me, you can adjust:
> sysctl net.core.rmem_max
> sysctl net.core.wmem_max
>
> And you should if you plan to use/set 851968 as socket size for UDP
> remote tests, else you will be limited to the "max" values (212992 well
> actually 425984 2x default value, for reasons I cannot remember)
>
The intent is to use the larger values to avoid packet loss on
UDP_STREAM.
--
Mel Gorman
SUSE Labs
On Thu, 2016-12-08 at 09:18 +0000, Mel Gorman wrote:
> Yes, I set it for higher speed networks as a starting point to remind me
> to examine rmem_default or socket configurations if any significant packet
> loss is observed.
Note that your page allocators changes might show more impact with
netperf and af_unix (instead of udp)
On Thu, 8 Dec 2016 15:11:01 +0000
Mel Gorman <[email protected]> wrote:
> On Thu, Dec 08, 2016 at 03:48:13PM +0100, Jesper Dangaard Brouer wrote:
> > On Thu, 8 Dec 2016 11:06:56 +0000
> > Mel Gorman <[email protected]> wrote:
> >
> > > On Thu, Dec 08, 2016 at 11:43:08AM +0100, Jesper Dangaard Brouer wrote:
> > > > > That's expected. In the initial sniff-test, I saw negligible packet loss.
> > > > > I'm waiting to see what the full set of network tests look like before
> > > > > doing any further adjustments.
> > > >
> > > > For netperf I will not recommend adjusting the global default
> > > > /proc/sys/net/core/rmem_default as netperf have means of adjusting this
> > > > value from the application (which were the options you setup too low
> > > > and just removed). I think you should keep this as the default for now
> > > > (unless Eric says something else), as this should cover most users.
> > > >
> > >
> > > Ok, the current state is that buffer sizes are only set for netperf
> > > UDP_STREAM and only when running over a real network. The values selected
> > > were specific to the network I had available so milage may vary.
> > > localhost is left at the defaults.
> >
> > Looks like you made a mistake when re-implementing using buffer sizes
> > for netperf.
>
> We appear to have a disconnect. This was reintroduced in response to your
> comment "For netperf I will not recommend adjusting the global default
> /proc/sys/net/core/rmem_default as netperf have means of adjusting this
> value from the application".
>
> My understanding was that netperfs means was the -s and -S switches for
> send and recv buffers so I reintroduced them and avoided altering
> [r|w]mem_default.
>
> Leaving the defaults resulted in some UDP packet loss on a 10GbE network
> so some upward adjustment.
>
> From my perspective, either adjusting [r|w]mem_default or specifying -s
> -S works for the UDP_STREAM issue but using the switches meant only this
> is affected and other loads like sockperf and netpipe will need to be
> evaluated separately which I don't mind doing.
>
> > See patch below signature.
> >
> > Besides I think you misunderstood me, you can adjust:
> > sysctl net.core.rmem_max
> > sysctl net.core.wmem_max
> >
> > And you should if you plan to use/set 851968 as socket size for UDP
> > remote tests, else you will be limited to the "max" values (212992 well
> > actually 425984 2x default value, for reasons I cannot remember)
> >
>
> The intent is to use the larger values to avoid packet loss on
> UDP_STREAM.
We do seem to misunderstand each-other.
I was just pointing out two things:
1. Notice the difference between "max" and "default" proc setting.
Only adjust the "max" setting.
2. There was simple BASH-shell script error in your commit.
Patch below fix it.
[PATCH] mmtests: actually use variable SOCKETSIZE_OPT
From: Jesper Dangaard Brouer <[email protected]>
commit 7f16226577b2 ("netperf: Set remote and local socket max buffer
sizes") removed netperf's setting of the socket buffer sizes and
instead used global /proc/sys settings.
commit de9f8cdb7146 ("netperf: Only adjust socket sizes for
UDP_STREAM") re-added explicit netperf setting socket buffer sizes for
remote-host testing (saved in SOCKETSIZE_OPT). Only problem is this
variable is not used after commit 7f16226577b2.
Simply use $SOCKETSIZE_OPT when invoking netperf command.
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
---
shellpack_src/src/netperf/netperf-bench | 2 +-
shellpacks/shellpack-bench-netperf | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/shellpack_src/src/netperf/netperf-bench b/shellpack_src/src/netperf/netperf-bench
index 8e7d02864c4a..b2820610936e 100755
--- a/shellpack_src/src/netperf/netperf-bench
+++ b/shellpack_src/src/netperf/netperf-bench
@@ -93,7 +93,7 @@ mmtests_server_ctl start --serverside-name $PROTOCOL-$SIZE
-t $PROTOCOL \
-i 3,3 -I 95,5 \
-H $SERVER_HOST \
- -- $MSGSIZE_OPT $EXTRA \
+ -- $SOCKETSIZE_OPT $MSGSIZE_OPT $EXTRA \
2>&1 | tee $LOGDIR_RESULTS/$PROTOCOL-${SIZE}.$ITERATION \
|| die Failed to run netperf
monitor_post_hook $LOGDIR_RESULTS $SIZE
diff --git a/shellpacks/shellpack-bench-netperf b/shellpacks/shellpack-bench-netperf
index 2ce26ba39f1b..7356082d5a78 100755
--- a/shellpacks/shellpack-bench-netperf
+++ b/shellpacks/shellpack-bench-netperf
@@ -190,7 +190,7 @@ for ITERATION in `seq 1 $ITERATIONS`; do
-t $PROTOCOL \
-i 3,3 -I 95,5 \
-H $SERVER_HOST \
- -- $MSGSIZE_OPT $EXTRA \
+ -- $SOCKETSIZE_OPT $MSGSIZE_OPT $EXTRA \
2>&1 | tee $LOGDIR_RESULTS/$PROTOCOL-${SIZE}.$ITERATION \
|| die Failed to run netperf
monitor_post_hook $LOGDIR_RESULTS $SIZE
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
On Thu, Dec 08, 2016 at 06:19:51PM +0100, Jesper Dangaard Brouer wrote:
> > > See patch below signature.
> > >
> > > Besides I think you misunderstood me, you can adjust:
> > > sysctl net.core.rmem_max
> > > sysctl net.core.wmem_max
> > >
> > > And you should if you plan to use/set 851968 as socket size for UDP
> > > remote tests, else you will be limited to the "max" values (212992 well
> > > actually 425984 2x default value, for reasons I cannot remember)
> > >
> >
> > The intent is to use the larger values to avoid packet loss on
> > UDP_STREAM.
>
> We do seem to misunderstand each-other.
> I was just pointing out two things:
>
> 1. Notice the difference between "max" and "default" proc setting.
> Only adjust the "max" setting.
>
> 2. There was simple BASH-shell script error in your commit.
> Patch below fix it.
>
Understood now.
> [PATCH] mmtests: actually use variable SOCKETSIZE_OPT
>
> From: Jesper Dangaard Brouer <[email protected]>
>
Applied, thanks!
--
Mel Gorman
SUSE Labs