2016-10-13 08:08:59

by Joonsoo Kim

[permalink] [raw]
Subject: [RFC PATCH 0/5] Reduce fragmentation

From: Joonsoo Kim <[email protected]>

Hello,

This is a patchset to reduce fragmentation. Patch 1 ~ 3 changes
allocation/free logic to reduce fragmentation. Patch 4 ~ 5 is
to manually control number of unmovable/reclaimable pageblock by user.
Usually user has more knowledge about their system and if the number of
unmovable/reclaimable pageblock is pre-defined properly, fragmentation
would be reduced a lot.

I found that this patchset reduce fragmentaion on my test.

System: 512 MB
Workload: Kernel build test (make -j12, 5 times)
Result: Number of mixed movable pageblock / Number of movable pageblock

Base: 50 / 205
Patch 1 ~ 3: 20 / 205
Patchset + 15% Pre-defined unmovable/reclaimable pageblock: 0 / 176

Note that I didn't test hard so I'm not sure if there is a side-effect
or not. If there is no disagreement, I will do more testing and repost
the patchset.

Johannes, this patchset would not help to find the root cause of
your regression but it would help to mitigate your symptom.

This patchset is based on next-20161006.

Thanks.

Joonsoo Kim (5):
mm/page_alloc: always add freeing page at the tail of the buddy list
mm/page_alloc: use smallest fallback page first in movable allocation
mm/page_alloc: stop instantly reusing freed page
mm/page_alloc: add fixed migratetype pageblock infrastructure
mm/page_alloc: support fixed migratetype pageblock

include/linux/mmzone.h | 6 +-
include/linux/pageblock-flags.h | 3 +-
mm/page_alloc.c | 224 ++++++++++++++++++++++++++++++----------
mm/vmstat.c | 7 +-
4 files changed, 179 insertions(+), 61 deletions(-)

--
1.9.1


2016-10-13 08:08:48

by Joonsoo Kim

[permalink] [raw]
Subject: [RFC PATCH 2/5] mm/page_alloc: use smallest fallback page first in movable allocation

From: Joonsoo Kim <[email protected]>

When we try to find freepage in fallback buddy list, we always serach
the largest one. This would help for fragmentation if we process
unmovable/reclaimable allocation request because it could cause permanent
fragmentation on movable pageblock and spread out such allocations would
cause more fragmentation. But, movable allocation request is
rather different. It would be simply freed or migrated so it doesn't
contribute to fragmentation on the other pageblock. In this case, it would
be better not to break the precious highest order freepage so we need to
search the smallest freepage first.

Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/page_alloc.c | 26 +++++++++++++++++++++-----
1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c4f7d05..70427bf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2121,15 +2121,31 @@ static void unreserve_highatomic_pageblock(const struct alloc_context *ac)
int fallback_mt;
bool can_steal;

- /* Find the largest possible block of pages in the other list */
- for (current_order = MAX_ORDER-1;
- current_order >= order && current_order <= MAX_ORDER-1;
- --current_order) {
+ if (start_migratetype == MIGRATE_MOVABLE)
+ current_order = order;
+ else
+ current_order = MAX_ORDER - 1;
+
+ /*
+ * Find the appropriate block of pages in the other list.
+ * If start_migratetype is MIGRATE_UNMOVABLE/MIGRATE_RECLAIMABLE,
+ * it would be better to find largest pageblock since it could cause
+ * fragmentation. However, in case of MIGRATE_MOVABLE, there is no
+ * risk about fragmentation so it would be better to use smallest one.
+ */
+ while (current_order >= order && current_order <= MAX_ORDER - 1) {
+
area = &(zone->free_area[current_order]);
fallback_mt = find_suitable_fallback(area, current_order,
start_migratetype, false, &can_steal);
- if (fallback_mt == -1)
+ if (fallback_mt == -1) {
+ if (start_migratetype == MIGRATE_MOVABLE)
+ current_order++;
+ else
+ current_order--;
+
continue;
+ }

page = list_first_entry(&area->free_list[fallback_mt],
struct page, lru);
--
1.9.1

2016-10-13 08:08:54

by Joonsoo Kim

[permalink] [raw]
Subject: [RFC PATCH 1/5] mm/page_alloc: always add freeing page at the tail of the buddy list

From: Joonsoo Kim <[email protected]>

Currently, freeing page can stay longer in the buddy list if next higher
order page is in the buddy list in order to help coalescence. However,
it doesn't work for the simplest sequential free case. For example, think
about the situation that 8 consecutive pages are freed in sequential
order.

page 0: attached at the head of order 0 list
page 1: merged with page 0, attached at the head of order 1 list
page 2: attached at the tail of order 0 list
page 3: merged with page 2 and then merged with page 0, attached at
the head of order 2 list
page 4: attached at the head of order 0 list
page 5: merged with page 4, attached at the tail of order 1 list
page 6: attached at the tail of order 0 list
page 7: merged with page 6 and then merged with page 4. Lastly, merged
with page 0 and we get order 3 freepage.

With excluding page 0 case, there are three cases that freeing page is
attached at the head of buddy list in this example and if just one
corresponding ordered allocation request comes at that moment, this page
in being a high order page will be allocated and we would fail to make
order-3 freepage.

Allocation usually happens in sequential order and free also does. So, it
would be important to detect such a situation and to give some chance
to be coalesced.

I think that simple and effective heuristic about this case is just
attaching freeing page at the tail of the buddy list unconditionally.
If freeing isn't merged during one rotation, it would be actual
fragmentation and we don't need to care about it for coalescence.

Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/page_alloc.c | 25 ++-----------------------
1 file changed, 2 insertions(+), 23 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1790391..c4f7d05 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -858,29 +858,8 @@ static inline void __free_one_page(struct page *page,
done_merging:
set_page_order(page, order);

- /*
- * If this is not the largest possible page, check if the buddy
- * of the next-highest order is free. If it is, it's possible
- * that pages are being freed that will coalesce soon. In case,
- * that is happening, add the free page to the tail of the list
- * so it's less likely to be used soon and more likely to be merged
- * as a higher order page
- */
- if ((order < MAX_ORDER-2) && pfn_valid_within(page_to_pfn(buddy))) {
- struct page *higher_page, *higher_buddy;
- combined_idx = buddy_idx & page_idx;
- higher_page = page + (combined_idx - page_idx);
- buddy_idx = __find_buddy_index(combined_idx, order + 1);
- higher_buddy = higher_page + (buddy_idx - combined_idx);
- if (page_is_buddy(higher_page, higher_buddy, order + 1)) {
- list_add_tail(&page->lru,
- &zone->free_area[order].free_list[migratetype]);
- goto out;
- }
- }
-
- list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
-out:
+ list_add_tail(&page->lru,
+ &zone->free_area[order].free_list[migratetype]);
zone->free_area[order].nr_free++;
}

--
1.9.1

2016-10-13 08:09:25

by Joonsoo Kim

[permalink] [raw]
Subject: [RFC PATCH 5/5] mm/page_alloc: support fixed migratetype pageblock

From: Joonsoo Kim <[email protected]>

We have migratetype facility to minimise fragmentation. It dynamically
changes migratetype of pageblock based on some criterias but it never
be perfect. Some migratetype pages are often placed in the other
migratetype pageblock. We call this pageblock as mixed pageblock.

There are two types of mixed pageblock. Movable page on unmovable
pageblock and unmovable page on movable pageblock. (I simply ignore
reclaimble migratetype/pageblock for easy explanation.) Earlier case is
not a big problem because movable page is reclaimable or migratable. We can
reclaim/migrate it when necessary so it usually doesn't contribute
fragmentation. Actual problem is caused by later case. We don't have
any way to reclaim/migrate this page and it prevents to make high order
freepage.

This later case happens when there is too less unmovable freepage. When
unmovable freepage runs out, fallback allocation happens and unmovable
allocation would be served by movable pageblock.

To solve/prevent this problem, we need to have enough unmovable freepage
to satisfy all unmovable allocation request by unmovable pageblock.
If we set enough unmovable pageblock at boot and fix it's migratetype
until power off, we would have more unmovable freepage during runtime and
mitigate above problem.

This patch provides a way to set minimum number of unmovable pageblock
at boot time. In my test, with proper setup, I can't see any mixed
pageblock where unmovable allocation stay on movable pageblock.

Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/page_alloc.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 90 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6b60e26..846c8c7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1406,6 +1406,89 @@ void clear_zone_contiguous(struct zone *zone)
zone->contiguous = false;
}

+static unsigned long ratio_unmovable, ratio_reclaimable;
+
+static int __init early_ratio_unmovable(char *buf)
+{
+ if (!buf)
+ return -EINVAL;
+
+ return kstrtoul(buf, 0, &ratio_unmovable);
+}
+early_param("ratio_unmovable", early_ratio_unmovable);
+
+static int __init early_ratio_reclaimable(char *buf)
+{
+ if (!buf)
+ return -EINVAL;
+
+ return kstrtoul(buf, 0, &ratio_reclaimable);
+}
+early_param("ratio_reclaimable", early_ratio_reclaimable);
+
+static void __reserve_zone_fixed_pageblock(struct zone *zone,
+ int migratetype, int nr)
+{
+ unsigned long block_start_pfn = zone->zone_start_pfn;
+ unsigned long block_end_pfn;
+ struct page *page;
+ int count = 0;
+ int pageblocks = MAX_ORDER_NR_PAGES / pageblock_nr_pages;
+ int i;
+
+ block_end_pfn = ALIGN(block_start_pfn + 1, pageblock_nr_pages);
+ for (; block_start_pfn < zone_end_pfn(zone) &&
+ count + pageblocks <= nr;
+ block_start_pfn = block_end_pfn,
+ block_end_pfn += pageblock_nr_pages) {
+
+ block_end_pfn = min(block_end_pfn, zone_end_pfn(zone));
+
+ if (!__pageblock_pfn_to_page(block_start_pfn,
+ block_end_pfn, zone))
+ continue;
+
+ page = pfn_to_page(block_start_pfn);
+ if (get_pageblock_migratetype(page) != MIGRATE_MOVABLE)
+ continue;
+
+ if (!PageBuddy(page))
+ continue;
+
+ if (page_order(page) != MAX_ORDER - 1)
+ continue;
+
+ move_freepages_block(zone, page, migratetype);
+ i = pageblocks;
+ do {
+ set_pageblock_migratetype(page, migratetype);
+ set_pageblock_flags_group(page, 1,
+ PB_migrate_fixed, PB_migrate_fixed);
+ count++;
+ page += pageblock_nr_pages;
+ } while (--i);
+ }
+
+ pr_info("Node %d %s %d pageblocks are permanently reserved for migratetype %d\n",
+ zone_to_nid(zone), zone->name, count, migratetype);
+}
+
+static void reserve_zone_fixed_pageblock(struct zone *zone)
+{
+ unsigned long nr_unmovable, nr_reclaimable;
+
+ nr_unmovable = (zone->managed_pages * ratio_unmovable / 100);
+ nr_unmovable /= pageblock_nr_pages;
+
+ nr_reclaimable = (zone->managed_pages * ratio_reclaimable / 100);
+ nr_reclaimable /= pageblock_nr_pages;
+
+ __reserve_zone_fixed_pageblock(zone,
+ MIGRATE_UNMOVABLE, nr_unmovable);
+ __reserve_zone_fixed_pageblock(zone,
+ MIGRATE_RECLAIMABLE, nr_reclaimable);
+}
+
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
static void __init deferred_free_range(struct page *page,
unsigned long pfn, int nr_pages)
@@ -1567,6 +1650,7 @@ static int __init deferred_init_memmap(void *data)
void __init page_alloc_init_late(void)
{
struct zone *zone;
+ unsigned long flags;

#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
int nid;
@@ -1584,6 +1668,12 @@ void __init page_alloc_init_late(void)
files_maxfiles_init();
#endif

+ for_each_populated_zone(zone) {
+ spin_lock_irqsave(&zone->lock, flags);
+ reserve_zone_fixed_pageblock(zone);
+ spin_unlock_irqrestore(&zone->lock, flags);
+ }
+
for_each_populated_zone(zone)
set_zone_contiguous(zone);
}
--
1.9.1

2016-10-13 08:09:35

by Joonsoo Kim

[permalink] [raw]
Subject: [RFC PATCH 3/5] mm/page_alloc: stop instantly reusing freed page

From: Joonsoo Kim <[email protected]>

Allocation/free pattern is usually sequantial. If they are freed to
the buddy list, they can be coalesced. However, we first keep these freed
pages at the pcp list and try to reuse them until threshold is reached
so we don't have enough chance to get a high order freepage. This reusing
would provide us some performance advantages since we don't need to
get the zone lock and we don't pay the cost to check buddy merging.
But, less fragmentation and more high order freepage would compensate
this overhead in other ways. First, we would trigger less direct
compaction which has high overhead. And, there are usecases that uses
high order page to boost their performance.

Instantly resuing freed page seems to provide us computational benefit
but the other affects more precious things like as I/O performance and
memory consumption so I think that it's a good idea to weight
later advantage more.

Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/mmzone.h | 6 +++--
mm/page_alloc.c | 71 ++++++++++++++++++++++++++++++++------------------
mm/vmstat.c | 7 ++---
3 files changed, 53 insertions(+), 31 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7f2ae99..75a92d1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -260,12 +260,14 @@ enum zone_watermarks {
#define high_wmark_pages(z) (z->watermark[WMARK_HIGH])

struct per_cpu_pages {
- int count; /* number of pages in the list */
+ int alloc_count; /* number of pages in the list */
+ int free_count; /* number of pages in the list */
int high; /* high watermark, emptying needed */
int batch; /* chunk size for buddy add/remove */

/* Lists of pages, one per migrate type stored on the pcp-lists */
- struct list_head lists[MIGRATE_PCPTYPES];
+ struct list_head alloc_lists[MIGRATE_PCPTYPES];
+ struct list_head free_lists[MIGRATE_PCPTYPES];
};

struct per_cpu_pageset {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 70427bf..a167754 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1091,7 +1091,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
batch_free++;
if (++migratetype == MIGRATE_PCPTYPES)
migratetype = 0;
- list = &pcp->lists[migratetype];
+ list = &pcp->free_lists[migratetype];
} while (list_empty(list));

/* This is the only non-empty list. Free them all. */
@@ -2258,10 +2258,10 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)

local_irq_save(flags);
batch = READ_ONCE(pcp->batch);
- to_drain = min(pcp->count, batch);
+ to_drain = min(pcp->free_count, batch);
if (to_drain > 0) {
free_pcppages_bulk(zone, to_drain, pcp);
- pcp->count -= to_drain;
+ pcp->free_count -= to_drain;
}
local_irq_restore(flags);
}
@@ -2279,14 +2279,24 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
unsigned long flags;
struct per_cpu_pageset *pset;
struct per_cpu_pages *pcp;
+ int mt;

local_irq_save(flags);
pset = per_cpu_ptr(zone->pageset, cpu);

pcp = &pset->pcp;
- if (pcp->count) {
- free_pcppages_bulk(zone, pcp->count, pcp);
- pcp->count = 0;
+ if (pcp->alloc_count) {
+ for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
+ list_splice_init(&pcp->alloc_lists[mt],
+ &pcp->free_lists[mt]);
+ }
+ pcp->free_count += pcp->alloc_count;
+ pcp->alloc_count = 0;
+ }
+
+ if (pcp->free_count) {
+ free_pcppages_bulk(zone, pcp->free_count, pcp);
+ pcp->free_count = 0;
}
local_irq_restore(flags);
}
@@ -2357,12 +2367,13 @@ void drain_all_pages(struct zone *zone)

if (zone) {
pcp = per_cpu_ptr(zone->pageset, cpu);
- if (pcp->pcp.count)
+ if (pcp->pcp.alloc_count || pcp->pcp.free_count)
has_pcps = true;
} else {
for_each_populated_zone(z) {
pcp = per_cpu_ptr(z->pageset, cpu);
- if (pcp->pcp.count) {
+ if (pcp->pcp.alloc_count ||
+ pcp->pcp.free_count) {
has_pcps = true;
break;
}
@@ -2454,15 +2465,12 @@ void free_hot_cold_page(struct page *page, bool cold)
}

pcp = &this_cpu_ptr(zone->pageset)->pcp;
- if (!cold)
- list_add(&page->lru, &pcp->lists[migratetype]);
- else
- list_add_tail(&page->lru, &pcp->lists[migratetype]);
- pcp->count++;
- if (pcp->count >= pcp->high) {
+ list_add(&page->lru, &pcp->free_lists[migratetype]);
+ pcp->free_count++;
+ if (pcp->free_count >= pcp->batch) {
unsigned long batch = READ_ONCE(pcp->batch);
free_pcppages_bulk(zone, batch, pcp);
- pcp->count -= batch;
+ pcp->free_count -= batch;
}

out:
@@ -2611,9 +2619,9 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
local_irq_save(flags);
do {
pcp = &this_cpu_ptr(zone->pageset)->pcp;
- list = &pcp->lists[migratetype];
+ list = &pcp->alloc_lists[migratetype];
if (list_empty(list)) {
- pcp->count += rmqueue_bulk(zone, 0,
+ pcp->alloc_count += rmqueue_bulk(zone, 0,
pcp->batch, list,
migratetype, cold);
if (unlikely(list_empty(list)))
@@ -2626,7 +2634,7 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
page = list_first_entry(list, struct page, lru);

list_del(&page->lru);
- pcp->count--;
+ pcp->alloc_count--;

} while (check_new_pcp(page));
} else {
@@ -4258,13 +4266,17 @@ void show_free_areas(unsigned int filter)
int cpu;
struct zone *zone;
pg_data_t *pgdat;
+ struct per_cpu_pages *pcp;

for_each_populated_zone(zone) {
if (skip_free_areas_node(filter, zone_to_nid(zone)))
continue;

- for_each_online_cpu(cpu)
- free_pcp += per_cpu_ptr(zone->pageset, cpu)->pcp.count;
+ for_each_online_cpu(cpu) {
+ pcp = &per_cpu_ptr(zone->pageset, cpu)->pcp;
+ free_pcp += pcp->alloc_count;
+ free_pcp += pcp->free_count;
+ }
}

printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n"
@@ -4347,8 +4359,11 @@ void show_free_areas(unsigned int filter)
continue;

free_pcp = 0;
- for_each_online_cpu(cpu)
- free_pcp += per_cpu_ptr(zone->pageset, cpu)->pcp.count;
+ for_each_online_cpu(cpu) {
+ pcp = &per_cpu_ptr(zone->pageset, cpu)->pcp;
+ free_pcp += pcp->alloc_count;
+ free_pcp += pcp->free_count;
+ }

show_node(zone);
printk("%s"
@@ -4394,7 +4409,8 @@ void show_free_areas(unsigned int filter)
K(zone_page_state(zone, NR_PAGETABLE)),
K(zone_page_state(zone, NR_BOUNCE)),
K(free_pcp),
- K(this_cpu_read(zone->pageset->pcp.count)),
+ K(this_cpu_read(zone->pageset->pcp.alloc_count) +
+ this_cpu_read(zone->pageset->pcp.free_count)),
K(zone_page_state(zone, NR_FREE_CMA_PAGES)));
printk("lowmem_reserve[]:");
for (i = 0; i < MAX_NR_ZONES; i++)
@@ -5251,9 +5267,12 @@ static void pageset_init(struct per_cpu_pageset *p)
memset(p, 0, sizeof(*p));

pcp = &p->pcp;
- pcp->count = 0;
- for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++)
- INIT_LIST_HEAD(&pcp->lists[migratetype]);
+ pcp->alloc_count = 0;
+ pcp->free_count = 0;
+ for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++) {
+ INIT_LIST_HEAD(&pcp->alloc_lists[migratetype]);
+ INIT_LIST_HEAD(&pcp->free_lists[migratetype]);
+ }
}

static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 604f26a..dbb9836 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -676,7 +676,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
* if not then there is nothing to expire.
*/
if (!__this_cpu_read(p->expire) ||
- !__this_cpu_read(p->pcp.count))
+ !__this_cpu_read(p->pcp.free_count))
continue;

/*
@@ -690,7 +690,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
if (__this_cpu_dec_return(p->expire))
continue;

- if (__this_cpu_read(p->pcp.count)) {
+ if (__this_cpu_read(p->pcp.free_count)) {
drain_zone_pages(zone, this_cpu_ptr(&p->pcp));
changes++;
}
@@ -1408,7 +1408,8 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
"\n high: %i"
"\n batch: %i",
i,
- pageset->pcp.count,
+ pageset->pcp.alloc_count +
+ pageset->pcp.free_count,
pageset->pcp.high,
pageset->pcp.batch);
#ifdef CONFIG_SMP
--
1.9.1

2016-10-13 08:09:41

by Joonsoo Kim

[permalink] [raw]
Subject: [RFC PATCH 4/5] mm/page_alloc: add fixed migratetype pageblock infrastructure

From: Joonsoo Kim <[email protected]>

Following patch will support permanent migratetype pageblock by
kernel boot parameter. For preparation, this patch adds infrastructure
for it. Once fixed, migratetype cannot be changed anymore until power off.

Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/pageblock-flags.h | 3 ++-
mm/page_alloc.c | 12 +++++++++++-
2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index e942558..0cf2c1f 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -31,12 +31,13 @@ enum pageblock_bits {
PB_migrate_end = PB_migrate + 3 - 1,
/* 3 bits required for migrate types */
PB_migrate_skip,/* If set the block is skipped by compaction */
+ PB_migrate_fixed,

/*
* Assume the bits will always align on a word. If this assumption
* changes then get/set pageblock needs updating.
*/
- NR_PAGEBLOCK_BITS
+ NR_PAGEBLOCK_BITS = 8,
};

#ifdef CONFIG_HUGETLB_PAGE
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a167754..6b60e26 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -427,7 +427,7 @@ void set_pfnblock_flags_mask(struct page *page, unsigned long flags,
unsigned long bitidx, word_bitidx;
unsigned long old_word, word;

- BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4);
+ BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 8);

bitmap = get_pageblock_bitmap(page, pfn);
bitidx = pfn_to_bitidx(page, pfn);
@@ -451,10 +451,17 @@ void set_pfnblock_flags_mask(struct page *page, unsigned long flags,

void set_pageblock_migratetype(struct page *page, int migratetype)
{
+ int fixed;
+
if (unlikely(page_group_by_mobility_disabled &&
migratetype < MIGRATE_PCPTYPES))
migratetype = MIGRATE_UNMOVABLE;

+ fixed = get_pageblock_flags_group(page,
+ PB_migrate_fixed, PB_migrate_fixed);
+ if (fixed)
+ return;
+
set_pageblock_flags_group(page, (unsigned long)migratetype,
PB_migrate, PB_migrate_end);
}
@@ -2026,6 +2033,9 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone,
int mt;
unsigned long max_managed, flags;

+ /* FIXME: disable highatomic pageblock reservation for test */
+ return;
+
/*
* Limit the number reserved to 1 pageblock or roughly 1% of a zone.
* Check is race-prone but harmless.
--
1.9.1

2016-10-13 09:12:20

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RFC PATCH 2/5] mm/page_alloc: use smallest fallback page first in movable allocation

On 10/13/2016 10:08 AM, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> When we try to find freepage in fallback buddy list, we always serach
> the largest one. This would help for fragmentation if we process
> unmovable/reclaimable allocation request because it could cause permanent
> fragmentation on movable pageblock and spread out such allocations would
> cause more fragmentation. But, movable allocation request is
> rather different. It would be simply freed or migrated so it doesn't
> contribute to fragmentation on the other pageblock. In this case, it would
> be better not to break the precious highest order freepage so we need to
> search the smallest freepage first.

I've also pondered this, but then found a lower hanging fruit that
should be hopefully clear win and mitigate most cases of breaking
high-order pages unnecessarily:

http://marc.info/?l=linux-mm&m=147582914330198&w=2

So I would try that first, and then test your patch on top? In your
patch there's a risk that we make it harder for unmovable/reclaimable
pageblocks to become movable again (we start with the smallest page
which means there's lower chance that move_freepages_block() will
convert more than half of the block). And Johannes's report seems to be
about a regression in exactly this aspect of the heuristics.

> Signed-off-by: Joonsoo Kim <[email protected]>
> ---
> mm/page_alloc.c | 26 +++++++++++++++++++++-----
> 1 file changed, 21 insertions(+), 5 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c4f7d05..70427bf 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2121,15 +2121,31 @@ static void unreserve_highatomic_pageblock(const struct alloc_context *ac)
> int fallback_mt;
> bool can_steal;
>
> - /* Find the largest possible block of pages in the other list */
> - for (current_order = MAX_ORDER-1;
> - current_order >= order && current_order <= MAX_ORDER-1;
> - --current_order) {
> + if (start_migratetype == MIGRATE_MOVABLE)
> + current_order = order;
> + else
> + current_order = MAX_ORDER - 1;
> +
> + /*
> + * Find the appropriate block of pages in the other list.
> + * If start_migratetype is MIGRATE_UNMOVABLE/MIGRATE_RECLAIMABLE,
> + * it would be better to find largest pageblock since it could cause
> + * fragmentation. However, in case of MIGRATE_MOVABLE, there is no
> + * risk about fragmentation so it would be better to use smallest one.
> + */
> + while (current_order >= order && current_order <= MAX_ORDER - 1) {
> +
> area = &(zone->free_area[current_order]);
> fallback_mt = find_suitable_fallback(area, current_order,
> start_migratetype, false, &can_steal);
> - if (fallback_mt == -1)
> + if (fallback_mt == -1) {
> + if (start_migratetype == MIGRATE_MOVABLE)
> + current_order++;
> + else
> + current_order--;
> +
> continue;
> + }
>
> page = list_first_entry(&area->free_list[fallback_mt],
> struct page, lru);
>

2016-10-13 09:23:51

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] mm/page_alloc: always add freeing page at the tail of the buddy list

On 10/13/2016 10:08 AM, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> Currently, freeing page can stay longer in the buddy list if next higher
> order page is in the buddy list in order to help coalescence. However,
> it doesn't work for the simplest sequential free case. For example, think
> about the situation that 8 consecutive pages are freed in sequential
> order.
>
> page 0: attached at the head of order 0 list
> page 1: merged with page 0, attached at the head of order 1 list
> page 2: attached at the tail of order 0 list
> page 3: merged with page 2 and then merged with page 0, attached at
> the head of order 2 list
> page 4: attached at the head of order 0 list
> page 5: merged with page 4, attached at the tail of order 1 list
> page 6: attached at the tail of order 0 list
> page 7: merged with page 6 and then merged with page 4. Lastly, merged
> with page 0 and we get order 3 freepage.
>
> With excluding page 0 case, there are three cases that freeing page is
> attached at the head of buddy list in this example and if just one
> corresponding ordered allocation request comes at that moment, this page
> in being a high order page will be allocated and we would fail to make
> order-3 freepage.
>
> Allocation usually happens in sequential order and free also does. So, it

Are you sure this is true except after the system is freshly booted? As
soon as it becomes fragmented, a stream of order-0 allocations will
likely grab them randomly from all over the place and it's unlikely to
recover except small orders.

> would be important to detect such a situation and to give some chance
> to be coalesced.
>
> I think that simple and effective heuristic about this case is just
> attaching freeing page at the tail of the buddy list unconditionally.
> If freeing isn't merged during one rotation, it would be actual
> fragmentation and we don't need to care about it for coalescence.

I'm not against removing this heuristic, but not without some
benchmarks. The disadvantage of putting pages to tail lists is that they
become cache-cold until allocated again. We should check how large that
problem is.

> Signed-off-by: Joonsoo Kim <[email protected]>
> ---
> mm/page_alloc.c | 25 ++-----------------------
> 1 file changed, 2 insertions(+), 23 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1790391..c4f7d05 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -858,29 +858,8 @@ static inline void __free_one_page(struct page *page,
> done_merging:
> set_page_order(page, order);
>
> - /*
> - * If this is not the largest possible page, check if the buddy
> - * of the next-highest order is free. If it is, it's possible
> - * that pages are being freed that will coalesce soon. In case,
> - * that is happening, add the free page to the tail of the list
> - * so it's less likely to be used soon and more likely to be merged
> - * as a higher order page
> - */
> - if ((order < MAX_ORDER-2) && pfn_valid_within(page_to_pfn(buddy))) {
> - struct page *higher_page, *higher_buddy;
> - combined_idx = buddy_idx & page_idx;
> - higher_page = page + (combined_idx - page_idx);
> - buddy_idx = __find_buddy_index(combined_idx, order + 1);
> - higher_buddy = higher_page + (buddy_idx - combined_idx);
> - if (page_is_buddy(higher_page, higher_buddy, order + 1)) {
> - list_add_tail(&page->lru,
> - &zone->free_area[order].free_list[migratetype]);
> - goto out;
> - }
> - }
> -
> - list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
> -out:
> + list_add_tail(&page->lru,
> + &zone->free_area[order].free_list[migratetype]);
> zone->free_area[order].nr_free++;
> }
>
>

2016-10-13 11:05:26

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RFC PATCH 5/5] mm/page_alloc: support fixed migratetype pageblock

On 10/13/2016 10:08 AM, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> We have migratetype facility to minimise fragmentation. It dynamically
> changes migratetype of pageblock based on some criterias but it never
> be perfect. Some migratetype pages are often placed in the other
> migratetype pageblock. We call this pageblock as mixed pageblock.
>
> There are two types of mixed pageblock. Movable page on unmovable
> pageblock and unmovable page on movable pageblock. (I simply ignore
> reclaimble migratetype/pageblock for easy explanation.) Earlier case is
> not a big problem because movable page is reclaimable or migratable. We can
> reclaim/migrate it when necessary so it usually doesn't contribute
> fragmentation. Actual problem is caused by later case. We don't have
> any way to reclaim/migrate this page and it prevents to make high order
> freepage.
>
> This later case happens when there is too less unmovable freepage. When
> unmovable freepage runs out, fallback allocation happens and unmovable
> allocation would be served by movable pageblock.
>
> To solve/prevent this problem, we need to have enough unmovable freepage
> to satisfy all unmovable allocation request by unmovable pageblock.
> If we set enough unmovable pageblock at boot and fix it's migratetype
> until power off, we would have more unmovable freepage during runtime and
> mitigate above problem.
>
> This patch provides a way to set minimum number of unmovable pageblock
> at boot time. In my test, with proper setup, I can't see any mixed
> pageblock where unmovable allocation stay on movable pageblock.

So if I get this correctly, the fixed-as-unmovable bit doesn't actually
prevent fallbacks to such pageblocks? Then I'm surprised that's enough
to make any difference. Also Johannes's problem is that there are too
many unmovable pageblocks, so I'm a bit skeptical that simply
preallocating some will help his workload. But we'll see...

In any case I wouldn't pursue a solution that requires user
configuration, until as a last resort. Hopefully we can make the
heuristics good enough so that's not necessary. Sorry for my mostly
negative feedback to your series, I'm glad you pursuit this as well, and
hope we'll eventually find a good solution :)

Vlastimil

2016-10-13 11:14:24

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RFC PATCH 3/5] mm/page_alloc: stop instantly reusing freed page

On 10/13/2016 10:08 AM, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> Allocation/free pattern is usually sequantial. If they are freed to
> the buddy list, they can be coalesced. However, we first keep these freed
> pages at the pcp list and try to reuse them until threshold is reached
> so we don't have enough chance to get a high order freepage. This reusing
> would provide us some performance advantages since we don't need to
> get the zone lock and we don't pay the cost to check buddy merging.
> But, less fragmentation and more high order freepage would compensate
> this overhead in other ways. First, we would trigger less direct
> compaction which has high overhead. And, there are usecases that uses
> high order page to boost their performance.
>
> Instantly resuing freed page seems to provide us computational benefit
> but the other affects more precious things like as I/O performance and
> memory consumption so I think that it's a good idea to weight
> later advantage more.

Again, there's also cache hotness to consider. And whether the
sequential pattern is still real on a system with higher uptime. Should
be possible to evaluate with tracepoints?


2016-10-14 01:02:08

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] mm/page_alloc: always add freeing page at the tail of the buddy list

On Thu, Oct 13, 2016 at 11:04:39AM +0200, Vlastimil Babka wrote:
> On 10/13/2016 10:08 AM, [email protected] wrote:
> >From: Joonsoo Kim <[email protected]>
> >
> >Currently, freeing page can stay longer in the buddy list if next higher
> >order page is in the buddy list in order to help coalescence. However,
> >it doesn't work for the simplest sequential free case. For example, think
> >about the situation that 8 consecutive pages are freed in sequential
> >order.
> >
> >page 0: attached at the head of order 0 list
> >page 1: merged with page 0, attached at the head of order 1 list
> >page 2: attached at the tail of order 0 list
> >page 3: merged with page 2 and then merged with page 0, attached at
> > the head of order 2 list
> >page 4: attached at the head of order 0 list
> >page 5: merged with page 4, attached at the tail of order 1 list
> >page 6: attached at the tail of order 0 list
> >page 7: merged with page 6 and then merged with page 4. Lastly, merged
> > with page 0 and we get order 3 freepage.
> >
> >With excluding page 0 case, there are three cases that freeing page is
> >attached at the head of buddy list in this example and if just one
> >corresponding ordered allocation request comes at that moment, this page
> >in being a high order page will be allocated and we would fail to make
> >order-3 freepage.
> >
> >Allocation usually happens in sequential order and free also does. So, it
>
> Are you sure this is true except after the system is freshly booted?
> As soon as it becomes fragmented, a stream of order-0 allocations
> will likely grab them randomly from all over the place and it's
> unlikely to recover except small orders.

What we should really focus on is just a small order page
(non-costly order page) and this patch would help to make them. Even
if the system runs for a long time, I saw that there are many small
order freepages so there would be enough chance to alloc/free in
sequential order.

>
> >would be important to detect such a situation and to give some chance
> >to be coalesced.
> >
> >I think that simple and effective heuristic about this case is just
> >attaching freeing page at the tail of the buddy list unconditionally.
> >If freeing isn't merged during one rotation, it would be actual
> >fragmentation and we don't need to care about it for coalescence.
>
> I'm not against removing this heuristic, but not without some
> benchmarks. The disadvantage of putting pages to tail lists is that

I can do more test. But, I'd like to say that it is not removing the
heuristic but expanding the heuristic. Before adding this heuristic,
all freed page are added at the head of the buddy list.

> they become cache-cold until allocated again. We should check how
> large that problem is.

Yes, your concern is fair enough. There are some reasons to justify
this patch but it should be checked.

If merging happens, we cannot make sure whether this merged page is
cache-hot or not. There is a possibility that part of merged page stay
in the buddy list for a long time and is cache-cold. I guess that we
can apply above heuristic only for merged page which we cannot make
sure if it is cache-hot or not.

And, there is no guarantee that freed page is cache-hot. If it is used
for file-cache and reclaimed, it would be cache-cold. And, if
next allocation is requested by file-cache, it requires cache-cold
page. Benefit of maintaining cache-hot page at the head of the buddy
list would weaken.

Thanks.

2016-10-14 01:26:31

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [RFC PATCH 2/5] mm/page_alloc: use smallest fallback page first in movable allocation

On Thu, Oct 13, 2016 at 11:12:10AM +0200, Vlastimil Babka wrote:
> On 10/13/2016 10:08 AM, [email protected] wrote:
> >From: Joonsoo Kim <[email protected]>
> >
> >When we try to find freepage in fallback buddy list, we always serach
> >the largest one. This would help for fragmentation if we process
> >unmovable/reclaimable allocation request because it could cause permanent
> >fragmentation on movable pageblock and spread out such allocations would
> >cause more fragmentation. But, movable allocation request is
> >rather different. It would be simply freed or migrated so it doesn't
> >contribute to fragmentation on the other pageblock. In this case, it would
> >be better not to break the precious highest order freepage so we need to
> >search the smallest freepage first.
>
> I've also pondered this, but then found a lower hanging fruit that
> should be hopefully clear win and mitigate most cases of breaking
> high-order pages unnecessarily:
>
> http://marc.info/?l=linux-mm&m=147582914330198&w=2

Yes, I agree with that change. That's the similar patch what I tried
before.

"mm/page_alloc: don't break highest order freepage if steal"
http://marc.info/?l=linux-mm&m=143011930520417&w=2

>
> So I would try that first, and then test your patch on top? In your
> patch there's a risk that we make it harder for
> unmovable/reclaimable pageblocks to become movable again (we start
> with the smallest page which means there's lower chance that
> move_freepages_block() will convert more than half of the block).

Indeed, but, with your "count movable pages when stealing", risk would
disappear. :)

> And Johannes's report seems to be about a regression in exactly this
> aspect of the heuristics.

Even if your change slows down the breaking high order freepage, but,
it would provide just a small delay to break. High order freepage
would be broken soon and we cannot prevent to decrease high order
freepage in the system. With my approach, high order freepage would
stay longer time.

For Johannes case, my approach doesn't aim at recovering from that
situation. Instead, it tries to prevent such situation that
migratetype of pageblock is changed.

Thanks.

2016-10-14 01:27:59

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [RFC PATCH 3/5] mm/page_alloc: stop instantly reusing freed page

On Thu, Oct 13, 2016 at 12:59:14PM +0200, Vlastimil Babka wrote:
> On 10/13/2016 10:08 AM, [email protected] wrote:
> >From: Joonsoo Kim <[email protected]>
> >
> >Allocation/free pattern is usually sequantial. If they are freed to
> >the buddy list, they can be coalesced. However, we first keep these freed
> >pages at the pcp list and try to reuse them until threshold is reached
> >so we don't have enough chance to get a high order freepage. This reusing
> >would provide us some performance advantages since we don't need to
> >get the zone lock and we don't pay the cost to check buddy merging.
> >But, less fragmentation and more high order freepage would compensate
> >this overhead in other ways. First, we would trigger less direct
> >compaction which has high overhead. And, there are usecases that uses
> >high order page to boost their performance.
> >
> >Instantly resuing freed page seems to provide us computational benefit
> >but the other affects more precious things like as I/O performance and
> >memory consumption so I think that it's a good idea to weight
> >later advantage more.
>
> Again, there's also cache hotness to consider. And whether the
> sequential pattern is still real on a system with higher uptime.
> Should be possible to evaluate with tracepoints?

I answered this in previous e-mail. Anyway, we should evaluate
cache-effect. tracepoint or perf's cache event would show some
evidence. I will do it soon and report again.

Thanks.

2016-10-14 01:57:46

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [RFC PATCH 5/5] mm/page_alloc: support fixed migratetype pageblock

On Thu, Oct 13, 2016 at 01:05:11PM +0200, Vlastimil Babka wrote:
> On 10/13/2016 10:08 AM, [email protected] wrote:
> >From: Joonsoo Kim <[email protected]>
> >
> >We have migratetype facility to minimise fragmentation. It dynamically
> >changes migratetype of pageblock based on some criterias but it never
> >be perfect. Some migratetype pages are often placed in the other
> >migratetype pageblock. We call this pageblock as mixed pageblock.
> >
> >There are two types of mixed pageblock. Movable page on unmovable
> >pageblock and unmovable page on movable pageblock. (I simply ignore
> >reclaimble migratetype/pageblock for easy explanation.) Earlier case is
> >not a big problem because movable page is reclaimable or migratable. We can
> >reclaim/migrate it when necessary so it usually doesn't contribute
> >fragmentation. Actual problem is caused by later case. We don't have
> >any way to reclaim/migrate this page and it prevents to make high order
> >freepage.
> >
> >This later case happens when there is too less unmovable freepage. When
> >unmovable freepage runs out, fallback allocation happens and unmovable
> >allocation would be served by movable pageblock.
> >
> >To solve/prevent this problem, we need to have enough unmovable freepage
> >to satisfy all unmovable allocation request by unmovable pageblock.
> >If we set enough unmovable pageblock at boot and fix it's migratetype
> >until power off, we would have more unmovable freepage during runtime and
> >mitigate above problem.
> >
> >This patch provides a way to set minimum number of unmovable pageblock
> >at boot time. In my test, with proper setup, I can't see any mixed
> >pageblock where unmovable allocation stay on movable pageblock.
>
> So if I get this correctly, the fixed-as-unmovable bit doesn't
> actually prevent fallbacks to such pageblocks? Then I'm surprised
> that's enough to make any difference. Also Johannes's problem is
> that there are too many unmovable pageblocks, so I'm a bit skeptical
> that simply preallocating some will help his workload. But we'll
> see...

This patch standalone would not help the Johannes's problem, but, with
whole series, it would make some difference.

I started this series motivated from Johannes's report but it doesn't
totally focus on his problem. Our android system also has a long
standing fragmentation problem and I hope that this patchset would
help them, too.

>
> In any case I wouldn't pursue a solution that requires user
> configuration, until as a last resort. Hopefully we can make the
> heuristics good enough so that's not necessary. Sorry for my mostly
> negative feedback to your series, I'm glad you pursuit this as well,
> and hope we'll eventually find a good solution :)

I'm fine with your feedback. It's valuable. I also doesn't pursue the
method that requires your configuration but it would be the case that
it is necessary. Amount of allocation request with specific
migratetype on our system varies a lot. Migratetype of pageblock would
be changed frequently in this situation and frequent changing
migratetype would increase mixed pageblock and cause permanent
fragmentation.

Thanks.

2016-10-14 10:52:45

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RFC PATCH 2/5] mm/page_alloc: use smallest fallback page first in movable allocation

On 10/14/2016 03:26 AM, Joonsoo Kim wrote:
> On Thu, Oct 13, 2016 at 11:12:10AM +0200, Vlastimil Babka wrote:
>> On 10/13/2016 10:08 AM, [email protected] wrote:
>> >From: Joonsoo Kim <[email protected]>
>> >
>> >When we try to find freepage in fallback buddy list, we always serach
>> >the largest one. This would help for fragmentation if we process
>> >unmovable/reclaimable allocation request because it could cause permanent
>> >fragmentation on movable pageblock and spread out such allocations would
>> >cause more fragmentation. But, movable allocation request is
>> >rather different. It would be simply freed or migrated so it doesn't
>> >contribute to fragmentation on the other pageblock. In this case, it would
>> >be better not to break the precious highest order freepage so we need to
>> >search the smallest freepage first.
>>
>> I've also pondered this, but then found a lower hanging fruit that
>> should be hopefully clear win and mitigate most cases of breaking
>> high-order pages unnecessarily:
>>
>> http://marc.info/?l=linux-mm&m=147582914330198&w=2
>
> Yes, I agree with that change. That's the similar patch what I tried
> before.
>
> "mm/page_alloc: don't break highest order freepage if steal"
> http://marc.info/?l=linux-mm&m=143011930520417&w=2

Ah, indeed, I forgot about it and had to rediscover :)

>
>>
>> So I would try that first, and then test your patch on top? In your
>> patch there's a risk that we make it harder for
>> unmovable/reclaimable pageblocks to become movable again (we start
>> with the smallest page which means there's lower chance that
>> move_freepages_block() will convert more than half of the block).
>
> Indeed, but, with your "count movable pages when stealing", risk would
> disappear. :)

Hmm, but that counting is only triggered when we attempt to steal whole
pageblock. For movable allocation, can_steal_fallback() allows that only for
(order >= pageblock_order / 2), and since your patch makes "order" as small as
possible for movable allocations, the chances are lower?

>> And Johannes's report seems to be about a regression in exactly this
>> aspect of the heuristics.
>
> Even if your change slows down the breaking high order freepage, but,
> it would provide just a small delay to break. High order freepage
> would be broken soon and we cannot prevent to decrease high order
> freepage in the system. With my approach, high order freepage would
> stay longer time.
>
> For Johannes case, my approach doesn't aim at recovering from that
> situation. Instead, it tries to prevent such situation that
> migratetype of pageblock is changed.
>
> Thanks.
>

2016-10-17 09:25:29

by Xishi Qiu

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] mm/page_alloc: always add freeing page at the tail of the buddy list

On 2016/10/13 16:08, [email protected] wrote:

> From: Joonsoo Kim <[email protected]>
>
> Currently, freeing page can stay longer in the buddy list if next higher
> order page is in the buddy list in order to help coalescence. However,
> it doesn't work for the simplest sequential free case. For example, think
> about the situation that 8 consecutive pages are freed in sequential
> order.
>
> page 0: attached at the head of order 0 list
> page 1: merged with page 0, attached at the head of order 1 list
> page 2: attached at the tail of order 0 list
> page 3: merged with page 2 and then merged with page 0, attached at
> the head of order 2 list
> page 4: attached at the head of order 0 list
> page 5: merged with page 4, attached at the tail of order 1 list
> page 6: attached at the tail of order 0 list
> page 7: merged with page 6 and then merged with page 4. Lastly, merged
> with page 0 and we get order 3 freepage.
>
> With excluding page 0 case, there are three cases that freeing page is
> attached at the head of buddy list in this example and if just one
> corresponding ordered allocation request comes at that moment, this page
> in being a high order page will be allocated and we would fail to make
> order-3 freepage.
>
> Allocation usually happens in sequential order and free also does. So, it
> would be important to detect such a situation and to give some chance
> to be coalesced.
>
> I think that simple and effective heuristic about this case is just
> attaching freeing page at the tail of the buddy list unconditionally.
> If freeing isn't merged during one rotation, it would be actual
> fragmentation and we don't need to care about it for coalescence.
>

Hi Joonsoo,

I find another two places to reduce fragmentation.

1)
__rmqueue_fallback
steal_suitable_fallback
move_freepages_block
move_freepages
list_move
If we steal some free pages, we will add these page at the head of start_migratetype list,
this will cause more fixed migratetype, because this pages will be allocated more easily.
So how about use list_move_tail instead of list_move?

2)
__rmqueue_fallback
expand
list_add
How about use list_add_tail instead of list_add? If add the tail, then the rest of pages
will be hard to be allocated and we can merge them again as soon as the page freed.

Thanks,
Xishi Qiu

2016-10-26 04:36:34

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] mm/page_alloc: always add freeing page at the tail of the buddy list

On Mon, Oct 17, 2016 at 05:21:54PM +0800, Xishi Qiu wrote:
> On 2016/10/13 16:08, [email protected] wrote:
>
> > From: Joonsoo Kim <[email protected]>
> >
> > Currently, freeing page can stay longer in the buddy list if next higher
> > order page is in the buddy list in order to help coalescence. However,
> > it doesn't work for the simplest sequential free case. For example, think
> > about the situation that 8 consecutive pages are freed in sequential
> > order.
> >
> > page 0: attached at the head of order 0 list
> > page 1: merged with page 0, attached at the head of order 1 list
> > page 2: attached at the tail of order 0 list
> > page 3: merged with page 2 and then merged with page 0, attached at
> > the head of order 2 list
> > page 4: attached at the head of order 0 list
> > page 5: merged with page 4, attached at the tail of order 1 list
> > page 6: attached at the tail of order 0 list
> > page 7: merged with page 6 and then merged with page 4. Lastly, merged
> > with page 0 and we get order 3 freepage.
> >
> > With excluding page 0 case, there are three cases that freeing page is
> > attached at the head of buddy list in this example and if just one
> > corresponding ordered allocation request comes at that moment, this page
> > in being a high order page will be allocated and we would fail to make
> > order-3 freepage.
> >
> > Allocation usually happens in sequential order and free also does. So, it
> > would be important to detect such a situation and to give some chance
> > to be coalesced.
> >
> > I think that simple and effective heuristic about this case is just
> > attaching freeing page at the tail of the buddy list unconditionally.
> > If freeing isn't merged during one rotation, it would be actual
> > fragmentation and we don't need to care about it for coalescence.
> >
>
> Hi Joonsoo,
>
> I find another two places to reduce fragmentation.
>
> 1)
> __rmqueue_fallback
> steal_suitable_fallback
> move_freepages_block
> move_freepages
> list_move
> If we steal some free pages, we will add these page at the head of start_migratetype list,
> this will cause more fixed migratetype, because this pages will be allocated more easily.
> So how about use list_move_tail instead of list_move?

Yeah... I don't think deeply but, at a glance, it would be helpful.

>
> 2)
> __rmqueue_fallback
> expand
> list_add
> How about use list_add_tail instead of list_add? If add the tail, then the rest of pages
> will be hard to be allocated and we can merge them again as soon as the page freed.

I guess that it has no effect. When we do __rmqueue_fallback() and
expand(), we don't have any freepage on this or more order. So,
list_add or list_add_tail will show the same result.

Thanks.

2016-10-26 04:40:19

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [RFC PATCH 2/5] mm/page_alloc: use smallest fallback page first in movable allocation

On Fri, Oct 14, 2016 at 12:52:26PM +0200, Vlastimil Babka wrote:
> On 10/14/2016 03:26 AM, Joonsoo Kim wrote:
> >On Thu, Oct 13, 2016 at 11:12:10AM +0200, Vlastimil Babka wrote:
> >>On 10/13/2016 10:08 AM, [email protected] wrote:
> >>>From: Joonsoo Kim <[email protected]>
> >>>
> >>>When we try to find freepage in fallback buddy list, we always serach
> >>>the largest one. This would help for fragmentation if we process
> >>>unmovable/reclaimable allocation request because it could cause permanent
> >>>fragmentation on movable pageblock and spread out such allocations would
> >>>cause more fragmentation. But, movable allocation request is
> >>>rather different. It would be simply freed or migrated so it doesn't
> >>>contribute to fragmentation on the other pageblock. In this case, it would
> >>>be better not to break the precious highest order freepage so we need to
> >>>search the smallest freepage first.
> >>
> >>I've also pondered this, but then found a lower hanging fruit that
> >>should be hopefully clear win and mitigate most cases of breaking
> >>high-order pages unnecessarily:
> >>
> >>http://marc.info/?l=linux-mm&m=147582914330198&w=2
> >
> >Yes, I agree with that change. That's the similar patch what I tried
> >before.
> >
> >"mm/page_alloc: don't break highest order freepage if steal"
> >http://marc.info/?l=linux-mm&m=143011930520417&w=2
>
> Ah, indeed, I forgot about it and had to rediscover :)
>
> >
> >>
> >>So I would try that first, and then test your patch on top? In your
> >>patch there's a risk that we make it harder for
> >>unmovable/reclaimable pageblocks to become movable again (we start
> >>with the smallest page which means there's lower chance that
> >>move_freepages_block() will convert more than half of the block).
> >
> >Indeed, but, with your "count movable pages when stealing", risk would
> >disappear. :)
>
> Hmm, but that counting is only triggered when we attempt to steal
> whole pageblock. For movable allocation, can_steal_fallback() allows
> that only for
> (order >= pageblock_order / 2), and since your patch makes "order"
> as small as possible for movable allocations, the chances are lower?

Chances are lower than current but we eventually try to steal that
(order >= pageblock_order / 2) freepage from unmovable pageblock and
your logic will result in changing pageblock migratetype from
unmovable to movable.

Thanks.

2016-10-26 05:52:17

by Xishi Qiu

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] mm/page_alloc: always add freeing page at the tail of the buddy list

On 2016/10/26 12:37, Joonsoo Kim wrote:

> On Mon, Oct 17, 2016 at 05:21:54PM +0800, Xishi Qiu wrote:
>> On 2016/10/13 16:08, [email protected] wrote:
>>
>>> From: Joonsoo Kim <[email protected]>
>>>
>>> Currently, freeing page can stay longer in the buddy list if next higher
>>> order page is in the buddy list in order to help coalescence. However,
>>> it doesn't work for the simplest sequential free case. For example, think
>>> about the situation that 8 consecutive pages are freed in sequential
>>> order.
>>>
>>> page 0: attached at the head of order 0 list
>>> page 1: merged with page 0, attached at the head of order 1 list
>>> page 2: attached at the tail of order 0 list
>>> page 3: merged with page 2 and then merged with page 0, attached at
>>> the head of order 2 list
>>> page 4: attached at the head of order 0 list
>>> page 5: merged with page 4, attached at the tail of order 1 list
>>> page 6: attached at the tail of order 0 list
>>> page 7: merged with page 6 and then merged with page 4. Lastly, merged
>>> with page 0 and we get order 3 freepage.
>>>
>>> With excluding page 0 case, there are three cases that freeing page is
>>> attached at the head of buddy list in this example and if just one
>>> corresponding ordered allocation request comes at that moment, this page
>>> in being a high order page will be allocated and we would fail to make
>>> order-3 freepage.
>>>
>>> Allocation usually happens in sequential order and free also does. So, it
>>> would be important to detect such a situation and to give some chance
>>> to be coalesced.
>>>
>>> I think that simple and effective heuristic about this case is just
>>> attaching freeing page at the tail of the buddy list unconditionally.
>>> If freeing isn't merged during one rotation, it would be actual
>>> fragmentation and we don't need to care about it for coalescence.
>>>
>>
>> Hi Joonsoo,
>>
>> I find another two places to reduce fragmentation.
>>
>> 1)
>> __rmqueue_fallback
>> steal_suitable_fallback
>> move_freepages_block
>> move_freepages
>> list_move
>> If we steal some free pages, we will add these page at the head of start_migratetype list,
>> this will cause more fixed migratetype, because this pages will be allocated more easily.
>> So how about use list_move_tail instead of list_move?
>
> Yeah... I don't think deeply but, at a glance, it would be helpful.
>
>>
>> 2)
>> __rmqueue_fallback
>> expand
>> list_add
>> How about use list_add_tail instead of list_add? If add the tail, then the rest of pages
>> will be hard to be allocated and we can merge them again as soon as the page freed.
>
> I guess that it has no effect. When we do __rmqueue_fallback() and
> expand(), we don't have any freepage on this or more order. So,
> list_add or list_add_tail will show the same result.
>

Hi Joonsoo,

Usually this list is empty, but in the following case, the list is not empty.

__rmqueue_fallback
steal_suitable_fallback
move_freepages_block // move to the list of start_migratetype
expand // split the largest order first
list_add // add to the list of start_migratetype

Thanks,
Xishi Qiu



2016-10-26 05:58:24

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] mm/page_alloc: always add freeing page at the tail of the buddy list

On Wed, Oct 26, 2016 at 01:50:37PM +0800, Xishi Qiu wrote:
> On 2016/10/26 12:37, Joonsoo Kim wrote:
>
> > On Mon, Oct 17, 2016 at 05:21:54PM +0800, Xishi Qiu wrote:
> >> On 2016/10/13 16:08, [email protected] wrote:
> >>
> >>> From: Joonsoo Kim <[email protected]>
> >>>
> >>> Currently, freeing page can stay longer in the buddy list if next higher
> >>> order page is in the buddy list in order to help coalescence. However,
> >>> it doesn't work for the simplest sequential free case. For example, think
> >>> about the situation that 8 consecutive pages are freed in sequential
> >>> order.
> >>>
> >>> page 0: attached at the head of order 0 list
> >>> page 1: merged with page 0, attached at the head of order 1 list
> >>> page 2: attached at the tail of order 0 list
> >>> page 3: merged with page 2 and then merged with page 0, attached at
> >>> the head of order 2 list
> >>> page 4: attached at the head of order 0 list
> >>> page 5: merged with page 4, attached at the tail of order 1 list
> >>> page 6: attached at the tail of order 0 list
> >>> page 7: merged with page 6 and then merged with page 4. Lastly, merged
> >>> with page 0 and we get order 3 freepage.
> >>>
> >>> With excluding page 0 case, there are three cases that freeing page is
> >>> attached at the head of buddy list in this example and if just one
> >>> corresponding ordered allocation request comes at that moment, this page
> >>> in being a high order page will be allocated and we would fail to make
> >>> order-3 freepage.
> >>>
> >>> Allocation usually happens in sequential order and free also does. So, it
> >>> would be important to detect such a situation and to give some chance
> >>> to be coalesced.
> >>>
> >>> I think that simple and effective heuristic about this case is just
> >>> attaching freeing page at the tail of the buddy list unconditionally.
> >>> If freeing isn't merged during one rotation, it would be actual
> >>> fragmentation and we don't need to care about it for coalescence.
> >>>
> >>
> >> Hi Joonsoo,
> >>
> >> I find another two places to reduce fragmentation.
> >>
> >> 1)
> >> __rmqueue_fallback
> >> steal_suitable_fallback
> >> move_freepages_block
> >> move_freepages
> >> list_move
> >> If we steal some free pages, we will add these page at the head of start_migratetype list,
> >> this will cause more fixed migratetype, because this pages will be allocated more easily.
> >> So how about use list_move_tail instead of list_move?
> >
> > Yeah... I don't think deeply but, at a glance, it would be helpful.
> >
> >>
> >> 2)
> >> __rmqueue_fallback
> >> expand
> >> list_add
> >> How about use list_add_tail instead of list_add? If add the tail, then the rest of pages
> >> will be hard to be allocated and we can merge them again as soon as the page freed.
> >
> > I guess that it has no effect. When we do __rmqueue_fallback() and
> > expand(), we don't have any freepage on this or more order. So,
> > list_add or list_add_tail will show the same result.
> >
>
> Hi Joonsoo,
>
> Usually this list is empty, but in the following case, the list is not empty.
>
> __rmqueue_fallback
> steal_suitable_fallback
> move_freepages_block // move to the list of start_migratetype
> expand // split the largest order first
> list_add // add to the list of start_migratetype

In this case, stealed freepage on steal_suitable_fallback() and
splitted freepage would come from the same pageblock. So, it doen't
matter to use whatever list_add* function.

Thanks.

2016-10-26 06:09:23

by Xishi Qiu

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] mm/page_alloc: always add freeing page at the tail of the buddy list

On 2016/10/26 13:59, Joonsoo Kim wrote:

> On Wed, Oct 26, 2016 at 01:50:37PM +0800, Xishi Qiu wrote:
>> On 2016/10/26 12:37, Joonsoo Kim wrote:
>>
>>> On Mon, Oct 17, 2016 at 05:21:54PM +0800, Xishi Qiu wrote:
>>>> On 2016/10/13 16:08, [email protected] wrote:
>>>>
>>>>> From: Joonsoo Kim <[email protected]>
>>>>>
>>>>> Currently, freeing page can stay longer in the buddy list if next higher
>>>>> order page is in the buddy list in order to help coalescence. However,
>>>>> it doesn't work for the simplest sequential free case. For example, think
>>>>> about the situation that 8 consecutive pages are freed in sequential
>>>>> order.
>>>>>
>>>>> page 0: attached at the head of order 0 list
>>>>> page 1: merged with page 0, attached at the head of order 1 list
>>>>> page 2: attached at the tail of order 0 list
>>>>> page 3: merged with page 2 and then merged with page 0, attached at
>>>>> the head of order 2 list
>>>>> page 4: attached at the head of order 0 list
>>>>> page 5: merged with page 4, attached at the tail of order 1 list
>>>>> page 6: attached at the tail of order 0 list
>>>>> page 7: merged with page 6 and then merged with page 4. Lastly, merged
>>>>> with page 0 and we get order 3 freepage.
>>>>>
>>>>> With excluding page 0 case, there are three cases that freeing page is
>>>>> attached at the head of buddy list in this example and if just one
>>>>> corresponding ordered allocation request comes at that moment, this page
>>>>> in being a high order page will be allocated and we would fail to make
>>>>> order-3 freepage.
>>>>>
>>>>> Allocation usually happens in sequential order and free also does. So, it
>>>>> would be important to detect such a situation and to give some chance
>>>>> to be coalesced.
>>>>>
>>>>> I think that simple and effective heuristic about this case is just
>>>>> attaching freeing page at the tail of the buddy list unconditionally.
>>>>> If freeing isn't merged during one rotation, it would be actual
>>>>> fragmentation and we don't need to care about it for coalescence.
>>>>>
>>>>
>>>> Hi Joonsoo,
>>>>
>>>> I find another two places to reduce fragmentation.
>>>>
>>>> 1)
>>>> __rmqueue_fallback
>>>> steal_suitable_fallback
>>>> move_freepages_block
>>>> move_freepages
>>>> list_move
>>>> If we steal some free pages, we will add these page at the head of start_migratetype list,
>>>> this will cause more fixed migratetype, because this pages will be allocated more easily.
>>>> So how about use list_move_tail instead of list_move?
>>>
>>> Yeah... I don't think deeply but, at a glance, it would be helpful.
>>>
>>>>
>>>> 2)
>>>> __rmqueue_fallback
>>>> expand
>>>> list_add
>>>> How about use list_add_tail instead of list_add? If add the tail, then the rest of pages
>>>> will be hard to be allocated and we can merge them again as soon as the page freed.
>>>
>>> I guess that it has no effect. When we do __rmqueue_fallback() and
>>> expand(), we don't have any freepage on this or more order. So,
>>> list_add or list_add_tail will show the same result.
>>>
>>
>> Hi Joonsoo,
>>
>> Usually this list is empty, but in the following case, the list is not empty.
>>
>> __rmqueue_fallback
>> steal_suitable_fallback
>> move_freepages_block // move to the list of start_migratetype
>> expand // split the largest order first
>> list_add // add to the list of start_migratetype
>
> In this case, stealed freepage on steal_suitable_fallback() and
> splitted freepage would come from the same pageblock. So, it doen't
> matter to use whatever list_add* function.
>

Yes, they are from the same pageblock, stealed freepage will move to the
start_migratetype, and expand will move to the same migratetype too,
but the list may be not empty because of the stealed freepage.
So when we split the largest order, add to the tail will be allocated
less easily, right?

Thanks,
Xishi Qiu

> Thanks.
>
> .
>