2014-07-04 07:52:47

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 00/10] fix freepage count problems due to memory isolation

Hello,

This patchset aims at fixing problems due to memory isolation found by
testing my patchset [1].

These are really subtle problems so I can be wrong. If you find what I am
missing, please let me know.

Before describing bugs itself, I first explain definition of freepage.

1. pages on buddy list are counted as freepage.
2. pages on isolate migratetype buddy list are *not* counted as freepage.
3. pages on cma buddy list are counted as CMA freepage, too.
4. pages for guard are *not* counted as freepage.

Now, I describe problems and related patch.

1. Patch 2: If guard page are cleared and merged into isolate buddy list,
we should not add freepage count.

2. Patch 3: When the page return back from pcp to buddy, we should
account it to freepage counter. In this case, we should check the
pageblock migratetype of the page and should insert the page into
appropriate buddy list. Although we checked it in current code, we
didn't insert the page into appropriate buddy list so that freepage
counting can be wrong.

3. Patch 4: There is race condition so that some freepages could be
on isolate buddy list. If so, we can't use this page until next isolation
attempt on this pageblock.

4. Patch 5: There is race condition that page on isolate pageblock
can go into non-isolate buddy list. If so, buddy allocator would
merge pages on non-isolate buddy list and isolate buddy list, respectively,
and freepage count will be wrong.

5. Patch 9: move_freepages(_block) returns *not* number of moved pages.
Instead, it returns number of pages linked in that migratetype buddy list.
So accouting with this return value makes freepage count wrong.

6. Patch 10: buddy allocator would merge pages on non-isolate buddy list
and isolate buddy list, respectively. This leads to freepage counting
problem so fix it by stopping merging in this case.

Without patchset [1], above problem doesn't happens on my CMA allocation
test, because CMA reserved pages aren't used at all. So there is no
chance for above race.

With patchset [1], I did simple CMA allocation test and get below result.

- Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
- run kernel build (make -j16) on background
- 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
- Result: more than 5000 freepage count are missed

With patchset [1] and this patchset, I found that no freepage count are
missed so that I conclude that problems are solved.

These problems can be possible on memory hot remove users, although
I didn't check it further.

Other patches are either for the base to fix these problems or for
simple clean-up. Please see individual patches for more information.

This patchset is based on linux-next-20140703.

Thanks.

[1]: Aggressively allocate the pages on cma reserved memory
https://lkml.org/lkml/2014/5/30/291


Joonsoo Kim (10):
mm/page_alloc: remove unlikely macro on free_one_page()
mm/page_alloc: correct to clear guard attribute in DEBUG_PAGEALLOC
mm/page_alloc: handle page on pcp correctly if it's pageblock is
isolated
mm/page_alloc: carefully free the page on isolate pageblock
mm/page_alloc: optimize and unify pageblock migratetype check in free
path
mm/page_alloc: separate freepage migratetype interface
mm/page_alloc: store migratetype of the buddy list into freepage
correctly
mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type
mm/page_alloc: fix possible wrongly calculated freepage counter
mm/page_alloc: Stop merging pages on non-isolate and isolate buddy
list

include/linux/mm.h | 30 +++++++--
include/linux/mmzone.h | 5 ++
include/linux/page-isolation.h | 8 +++
mm/page_alloc.c | 138 +++++++++++++++++++++++++++++-----------
mm/page_isolation.c | 18 ++----
5 files changed, 147 insertions(+), 52 deletions(-)

--
1.7.9.5


2014-07-04 07:52:53

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 07/10] mm/page_alloc: store migratetype of the buddy list into freepage correctly

Whenever page is splited or merged, we don't set migratetype to new page,
so there can be no accurate onbuddy_migratetype information. To maintain
it correctly, we should reset whenever page order is changed.
I think that set_page_order() is the best place to do, because it is
called whenever page is merged or splited. Hence, this patch adds
set_onbuddy_migratetype() to set_page_order().

And this patch makes set/get_onbuddy_migratetype() only enabled if
memory isolation is enabeld, because it doesn't needed in other case.

Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/mm.h | 6 ++++++
mm/page_alloc.c | 9 +++++----
2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 278ecfd..b35bd3b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -281,6 +281,7 @@ struct inode;
#define page_private(page) ((page)->private)
#define set_page_private(page, v) ((page)->private = (v))

+#if defined(CONFIG_MEMORY_ISOLATION)
static inline void set_onbuddy_migratetype(struct page *page, int migratetype)
{
page->index = migratetype;
@@ -294,6 +295,11 @@ static inline int get_onbuddy_migratetype(struct page *page)
{
return page->index;
}
+#else
+static inline void set_onbuddy_migratetype(struct page *page,
+ int migratetype) {}
+static inline int get_onbuddy_migratetype(struct page *page) { return 0; }
+#endif

static inline void set_onpcp_migratetype(struct page *page, int migratetype)
{
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9d8ba2d..e1c4c3e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -465,9 +465,11 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
unsigned int order, int migratetype) {}
#endif

-static inline void set_page_order(struct page *page, unsigned int order)
+static inline void set_page_order(struct page *page, unsigned int order,
+ int migratetype)
{
set_page_private(page, order);
+ set_onbuddy_migratetype(page, migratetype);
__SetPageBuddy(page);
}

@@ -633,7 +635,7 @@ static inline void __free_one_page(struct page *page,
page_idx = combined_idx;
order++;
}
- set_page_order(page, order);
+ set_page_order(page, order, migratetype);

/*
* If this is not the largest possible page, check if the buddy
@@ -797,7 +799,6 @@ static void __free_pages_ok(struct page *page, unsigned int order)
migratetype = get_pfnblock_migratetype(page, pfn);
local_irq_save(flags);
__count_vm_events(PGFREE, 1 << order);
- set_onbuddy_migratetype(page, migratetype);
free_one_page(page_zone(page), page, pfn, order, migratetype);
local_irq_restore(flags);
}
@@ -943,7 +944,7 @@ static inline void expand(struct zone *zone, struct page *page,
#endif
list_add(&page[size].lru, &area->free_list[migratetype]);
area->nr_free++;
- set_page_order(&page[size], high);
+ set_page_order(&page[size], high, migratetype);
}
}

--
1.7.9.5

2014-07-04 07:53:03

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 06/10] mm/page_alloc: separate freepage migratetype interface

Currently, we use (set/get)_freepage_migratetype in two use cases.
One usecase is to know migratetype of buddy list where page will be
linked later. The other one is to know migratetype of buddy list where
page is linked now.

Although there is incompleteness for later case, there is no serious
problem, because it is only used by limited context, such as memory
isolation. But, now I'm preparing to fix many freepage counting bugs, and
accurate information about the migratetype of buddy list where page is
linked now is really needed. So this incompleteness would be problem.

Before fixing this incompleteness, separation of interface is needed,
because it is only used if CONFIG_MEMORY_ISOLATION is enabled and it has
some overhead.

So this patch just do separation of interface. There is no functional
change and following patch will describe what we are missing and fix it.

Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/mm.h | 24 ++++++++++++++++++++----
mm/page_alloc.c | 18 +++++++++---------
mm/page_isolation.c | 4 ++--
3 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e03dd29..278ecfd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -281,14 +281,30 @@ struct inode;
#define page_private(page) ((page)->private)
#define set_page_private(page, v) ((page)->private = (v))

-/* It's valid only if the page is free path or free_list */
-static inline void set_freepage_migratetype(struct page *page, int migratetype)
+static inline void set_onbuddy_migratetype(struct page *page, int migratetype)
{
page->index = migratetype;
}

-/* It's valid only if the page is free path or free_list */
-static inline int get_freepage_migratetype(struct page *page)
+/*
+ * It's valid only if the page is on buddy list. It represents
+ * migratetype of the buddy list where page is linked now.
+ */
+static inline int get_onbuddy_migratetype(struct page *page)
+{
+ return page->index;
+}
+
+static inline void set_onpcp_migratetype(struct page *page, int migratetype)
+{
+ page->index = migratetype;
+}
+
+/*
+ * It's valid only if the page is on pcp. It represents migratetype of
+ * the buddy list where page will be linked later when going buddy.
+ */
+static inline int get_onpcp_migratetype(struct page *page)
{
return page->index;
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dcc2f08..9d8ba2d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -737,7 +737,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
page = list_entry(list->prev, struct page, lru);
/* must delete as __free_one_page list manipulates */
list_del(&page->lru);
- mt = get_freepage_migratetype(page);
+ mt = get_onpcp_migratetype(page);

/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
__free_one_page(page, page_to_pfn(page), zone, 0, mt);
@@ -797,7 +797,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
migratetype = get_pfnblock_migratetype(page, pfn);
local_irq_save(flags);
__count_vm_events(PGFREE, 1 << order);
- set_freepage_migratetype(page, migratetype);
+ set_onbuddy_migratetype(page, migratetype);
free_one_page(page_zone(page), page, pfn, order, migratetype);
local_irq_restore(flags);
}
@@ -1023,7 +1023,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
rmv_page_order(page);
area->nr_free--;
expand(zone, page, order, current_order, area, migratetype);
- set_freepage_migratetype(page, migratetype);
+ set_onpcp_migratetype(page, migratetype);
return page;
}

@@ -1091,7 +1091,7 @@ int move_freepages(struct zone *zone,
order = page_order(page);
list_move(&page->lru,
&zone->free_area[order].free_list[migratetype]);
- set_freepage_migratetype(page, migratetype);
+ set_onbuddy_migratetype(page, migratetype);
page += 1 << order;
pages_moved += 1 << order;
}
@@ -1221,12 +1221,12 @@ __rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype)

expand(zone, page, order, current_order, area,
new_type);
- /* The freepage_migratetype may differ from pageblock's
+ /* The onpcp_migratetype may differ from pageblock's
* migratetype depending on the decisions in
* try_to_steal_freepages. This is OK as long as it does
* not differ for MIGRATE_CMA type.
*/
- set_freepage_migratetype(page, new_type);
+ set_onpcp_migratetype(page, new_type);

trace_mm_page_alloc_extfrag(page, order, current_order,
start_migratetype, migratetype, new_type);
@@ -1344,7 +1344,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
else
list_add_tail(&page->lru, list);
list = &page->lru;
- if (is_migrate_cma(get_freepage_migratetype(page)))
+ if (is_migrate_cma(get_onpcp_migratetype(page)))
__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
-(1 << order));
}
@@ -1510,7 +1510,7 @@ void free_hot_cold_page(struct page *page, bool cold)
return;

migratetype = get_pfnblock_migratetype(page, pfn);
- set_freepage_migratetype(page, migratetype);
+ set_onpcp_migratetype(page, migratetype);
local_irq_save(flags);
__count_vm_event(PGFREE);

@@ -1710,7 +1710,7 @@ again:
if (!page)
goto failed;
__mod_zone_freepage_state(zone, -(1 << order),
- get_freepage_migratetype(page));
+ get_onpcp_migratetype(page));
}

__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 1fa4a4d..6e4e86b 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -192,7 +192,7 @@ __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn,
* is MIGRATE_ISOLATE. Catch it and move the page into
* MIGRATE_ISOLATE list.
*/
- if (get_freepage_migratetype(page) != MIGRATE_ISOLATE) {
+ if (get_onbuddy_migratetype(page) != MIGRATE_ISOLATE) {
struct page *end_page;

end_page = page + (1 << page_order(page)) - 1;
@@ -202,7 +202,7 @@ __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn,
pfn += 1 << page_order(page);
}
else if (page_count(page) == 0 &&
- get_freepage_migratetype(page) == MIGRATE_ISOLATE)
+ get_onpcp_migratetype(page) == MIGRATE_ISOLATE)
pfn += 1;
else if (skip_hwpoisoned_pages && PageHWPoison(page)) {
/*
--
1.7.9.5

2014-07-04 07:52:59

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 05/10] mm/page_alloc: optimize and unify pageblock migratetype check in free path

Currently, when we free the page from pcp list to buddy, we check
pageblock of the page in order to isolate the page on isolated
pageblock. Although this could rarely happen and to check migratetype of
pageblock is somewhat expensive, we check it on free fast path. I think
that this is undesirable. To prevent this situation, I introduce new
variable, nr_isolate_pageblock on struct zone and use it to determine
if we should check pageblock migratetype. Isolation on pageblock rarely
happens so we can mostly avoid this pageblock migratetype check.

Additionally, unify freepage counting code, because it can be done in
common part, __free_one_page(). This unifying provides extra guarantee
that the page on isolate pageblock don't go into non-isolate buddy list.
This is similar situation describing in previous patch so refer it
if you need more explanation.

Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/mmzone.h | 5 +++++
include/linux/page-isolation.h | 8 ++++++++
mm/page_alloc.c | 40 +++++++++++++++++++---------------------
mm/page_isolation.c | 2 ++
4 files changed, 34 insertions(+), 21 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fd48890..e9c194f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -374,6 +374,11 @@ struct zone {
/* pfn where async and sync compaction migration scanner should start */
unsigned long compact_cached_migrate_pfn[2];
#endif
+
+#ifdef CONFIG_MEMORY_ISOLATION
+ unsigned long nr_isolate_pageblock;
+#endif
+
#ifdef CONFIG_MEMORY_HOTPLUG
/* see spanned/present_pages for more description */
seqlock_t span_seqlock;
diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index 3fff8e7..2dc1e16 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -2,6 +2,10 @@
#define __LINUX_PAGEISOLATION_H

#ifdef CONFIG_MEMORY_ISOLATION
+static inline bool has_isolate_pageblock(struct zone *zone)
+{
+ return zone->nr_isolate_pageblock;
+}
static inline bool is_migrate_isolate_page(struct page *page)
{
return get_pageblock_migratetype(page) == MIGRATE_ISOLATE;
@@ -11,6 +15,10 @@ static inline bool is_migrate_isolate(int migratetype)
return migratetype == MIGRATE_ISOLATE;
}
#else
+static inline bool has_isolate_pageblock(struct zone *zone)
+{
+ return false;
+}
static inline bool is_migrate_isolate_page(struct page *page)
{
return false;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d8feedc..dcc2f08 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -594,6 +594,24 @@ static inline void __free_one_page(struct page *page,
VM_BUG_ON_PAGE(page_idx & ((1 << order) - 1), page);
VM_BUG_ON_PAGE(bad_range(zone, page), page);

+ /*
+ * pcp pages could go normal buddy list due to stale pageblock
+ * migratetype so re-check it if there is isolate pageblock.
+ *
+ * And, we got migratetype without holding the lock so it could be
+ * racy. If some pages go on the isolate migratetype buddy list
+ * by this race, we can't allocate this page anymore until next
+ * isolation attempt on this pageblock. To prevent this
+ * possibility, re-check migratetype with holding the lock.
+ */
+ if (unlikely(has_isolate_pageblock(zone) ||
+ is_migrate_isolate(migratetype))) {
+ migratetype = get_pfnblock_migratetype(page, pfn);
+ }
+
+ if (!is_migrate_isolate(migratetype))
+ __mod_zone_freepage_state(zone, 1 << order, migratetype);
+
while (order < MAX_ORDER-1) {
buddy_idx = __find_buddy_index(page_idx, order);
buddy = page + (buddy_idx - page_idx);
@@ -719,13 +737,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
page = list_entry(list->prev, struct page, lru);
/* must delete as __free_one_page list manipulates */
list_del(&page->lru);
-
- if (unlikely(is_migrate_isolate_page(page))) {
- mt = MIGRATE_ISOLATE;
- } else {
- mt = get_freepage_migratetype(page);
- __mod_zone_freepage_state(zone, 1, mt);
- }
+ mt = get_freepage_migratetype(page);

/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
__free_one_page(page, page_to_pfn(page), zone, 0, mt);
@@ -742,21 +754,7 @@ static void free_one_page(struct zone *zone,
{
spin_lock(&zone->lock);
zone->pages_scanned = 0;
-
- if (unlikely(is_migrate_isolate(migratetype))) {
- /*
- * We got migratetype without holding the lock so it could be
- * racy. If some pages go on the isolate migratetype buddy list
- * by this race, we can't allocate this page anymore until next
- * isolation attempt on this pageblock. To prevent this
- * possibility, re-check migratetype with holding the lock.
- */
- migratetype = get_pfnblock_migratetype(page, pfn);
- }
-
__free_one_page(page, pfn, zone, order, migratetype);
- if (!is_migrate_isolate(migratetype))
- __mod_zone_freepage_state(zone, 1 << order, migratetype);
spin_unlock(&zone->lock);
}

diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index d1473b2..1fa4a4d 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -60,6 +60,7 @@ out:
int migratetype = get_pageblock_migratetype(page);

set_pageblock_migratetype(page, MIGRATE_ISOLATE);
+ zone->nr_isolate_pageblock++;
nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE);

__mod_zone_freepage_state(zone, -nr_pages, migratetype);
@@ -83,6 +84,7 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype)
nr_pages = move_freepages_block(zone, page, migratetype);
__mod_zone_freepage_state(zone, nr_pages, migratetype);
set_pageblock_migratetype(page, migratetype);
+ zone->nr_isolate_pageblock--;
out:
spin_unlock_irqrestore(&zone->lock, flags);
}
--
1.7.9.5

2014-07-04 07:53:10

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 04/10] mm/page_alloc: carefully free the page on isolate pageblock

We got migratetype without holding the lock so it could be
racy. If some pages go on the isolate migratetype buddy list
by this race, we can't allocate this page anymore until next
isolation attempt on this pageblock. Below is possible
scenario of this race.

pageblock 1 is isolate migratetype.

CPU1 CPU2
- get_pfnblock_migratetype(pageblock 1),
so MIGRATE_ISOLATE is returned
- call free_one_page() with MIGRATE_ISOLATE
- grab the zone lock
- unisolate pageblock 1
- release the zone lock
- grab the zone lock
- call __free_one_page() with MIGRATE_ISOLATE
- free page go into isolate buddy list
and we can't use it anymore

To prevent this possibility, re-check migratetype with holding the lock.

Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/page_alloc.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 99c05f7..d8feedc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -743,6 +743,17 @@ static void free_one_page(struct zone *zone,
spin_lock(&zone->lock);
zone->pages_scanned = 0;

+ if (unlikely(is_migrate_isolate(migratetype))) {
+ /*
+ * We got migratetype without holding the lock so it could be
+ * racy. If some pages go on the isolate migratetype buddy list
+ * by this race, we can't allocate this page anymore until next
+ * isolation attempt on this pageblock. To prevent this
+ * possibility, re-check migratetype with holding the lock.
+ */
+ migratetype = get_pfnblock_migratetype(page, pfn);
+ }
+
__free_one_page(page, pfn, zone, order, migratetype);
if (!is_migrate_isolate(migratetype))
__mod_zone_freepage_state(zone, 1 << order, migratetype);
--
1.7.9.5

2014-07-04 07:53:53

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 10/10] mm/page_alloc: Stop merging pages on non-isolate and isolate buddy list

If we merge pages on non-isolate buddy list and isolate buddy list,
respectively, we should fixup freepage count, because we don't regard
pages in isolate buddy list as freepage. But this will impose some
overhead on __free_one_page() which is core function of page free path
so this overhead looks undesirable to me. Instead, we can stop merging
in this case. With this approach, we can skip to fixup freepage count
with low overhead.

The side-effect of this change is that some buddies equal or larger than
pageblock order isn't merged if one of buddy is on isolate pageblock. But,
I think that this is no problem, because isolation means that we will use
page on isolate pageblock specially, so it will split soon in any case.

Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/page_alloc.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 80c9bd8..da4da66 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -648,6 +648,24 @@ static inline void __free_one_page(struct page *page,
break;

/*
+ * Stop merging between page on non-isolate buddy list and
+ * isolate buddy list, respectively. This case is only possible
+ * for pages equal or larger than pageblock_order, because
+ * pageblock migratetype can be changed in this granularity.
+ */
+ if (unlikely(order >= pageblock_order &&
+ has_isolate_pageblock(zone))) {
+ int buddy_mt = get_onbuddy_migratetype(buddy);
+
+ if (migratetype != buddy_mt) {
+ if (is_migrate_isolate(migratetype))
+ break;
+ else if (is_migrate_isolate(buddy_mt))
+ break;
+ }
+ }
+
+ /*
* Our buddy is free or it is CONFIG_DEBUG_PAGEALLOC guard page,
* merge with it and move up one order.
*/
--
1.7.9.5

2014-07-04 07:53:57

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 09/10] mm/page_alloc: fix possible wrongly calculated freepage counter

When isolating/unisolating some pageblock, we add/sub number of
pages returend by move_freepages_block() to calculate number of freepage.
But, this value is invalid for calculation, because it means number of
pages in buddy list of that migratetype rather than number of *moved*
pages. So number of freepage could be incorrect more and more whenever
calling these functions.

And, there is one more counting problem on
__test_page_isolated_in_pageblock(). move_freepages() is called, but missed
to fixup number of freepage. I think that counting should be done in
move_freepages(), otherwise, another future user to this function also
missed to fixup number of freepage again.

Now, we have proper infrastructure, get_onbuddy_migratetype(), which can
be used to get current migratetype of buddy list. So fix this situation.

Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/page_alloc.c | 37 ++++++++++++++++++++++++++++++++++---
mm/page_isolation.c | 12 +++---------
2 files changed, 37 insertions(+), 12 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d9fb8bb..80c9bd8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -465,6 +465,33 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
unsigned int order, int migratetype) {}
#endif

+static inline void count_freepage_nr(struct page *page, int order,
+ int *nr_isolate, int *nr_cma, int *nr_others)
+{
+ int nr = 1 << order;
+ int migratetype = get_onbuddy_migratetype(page);
+
+ if (is_migrate_isolate(migratetype))
+ *nr_isolate += nr;
+ else if (is_migrate_cma(migratetype))
+ *nr_cma += nr;
+ else
+ *nr_others += nr;
+}
+
+static void fixup_freepage_nr(struct zone *zone, int migratetype,
+ int nr_isolate, int nr_cma, int nr_others)
+{
+ int nr_free = nr_cma + nr_others;
+
+ if (is_migrate_isolate(migratetype)) {
+ __mod_zone_page_state(zone, NR_FREE_PAGES, -nr_free);
+ __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, -nr_cma);
+ } else {
+ __mod_zone_freepage_state(zone, nr_isolate, migratetype);
+ }
+}
+
static inline void set_page_order(struct page *page, unsigned int order,
int migratetype)
{
@@ -619,6 +646,7 @@ static inline void __free_one_page(struct page *page,
buddy = page + (buddy_idx - page_idx);
if (!page_is_buddy(page, buddy, order))
break;
+
/*
* Our buddy is free or it is CONFIG_DEBUG_PAGEALLOC guard page,
* merge with it and move up one order.
@@ -1062,7 +1090,7 @@ int move_freepages(struct zone *zone,
{
struct page *page;
unsigned long order;
- int pages_moved = 0;
+ int nr_pages = 0, nr_isolate = 0, nr_cma = 0, nr_others = 0;

#ifndef CONFIG_HOLES_IN_ZONE
/*
@@ -1090,14 +1118,17 @@ int move_freepages(struct zone *zone,
}

order = page_order(page);
+ count_freepage_nr(page, order,
+ &nr_isolate, &nr_cma, &nr_others);
list_move(&page->lru,
&zone->free_area[order].free_list[migratetype]);
set_onbuddy_migratetype(page, migratetype);
page += 1 << order;
- pages_moved += 1 << order;
+ nr_pages += 1 << order;
}

- return pages_moved;
+ fixup_freepage_nr(zone, migratetype, nr_isolate, nr_cma, nr_others);
+ return nr_pages;
}

int move_freepages_block(struct zone *zone, struct page *page,
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 6e4e86b..62676de 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -56,14 +56,9 @@ int set_migratetype_isolate(struct page *page, bool skip_hwpoisoned_pages)

out:
if (!ret) {
- unsigned long nr_pages;
- int migratetype = get_pageblock_migratetype(page);
-
set_pageblock_migratetype(page, MIGRATE_ISOLATE);
zone->nr_isolate_pageblock++;
- nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE);
-
- __mod_zone_freepage_state(zone, -nr_pages, migratetype);
+ move_freepages_block(zone, page, MIGRATE_ISOLATE);
}

spin_unlock_irqrestore(&zone->lock, flags);
@@ -75,14 +70,13 @@ out:
void unset_migratetype_isolate(struct page *page, unsigned migratetype)
{
struct zone *zone;
- unsigned long flags, nr_pages;
+ unsigned long flags;

zone = page_zone(page);
spin_lock_irqsave(&zone->lock, flags);
if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
goto out;
- nr_pages = move_freepages_block(zone, page, migratetype);
- __mod_zone_freepage_state(zone, nr_pages, migratetype);
+ move_freepages_block(zone, page, migratetype);
set_pageblock_migratetype(page, migratetype);
zone->nr_isolate_pageblock--;
out:
--
1.7.9.5

2014-07-04 07:54:40

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 08/10] mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type

When isolating free page, what we want to know is which list
the page is linked. If it is linked in isolate migratetype buddy list,
we can skip watermark check and freepage counting. And if it is linked
in CMA migratetype buddy list, we need to fixup freepage counting. For
this purpose, get_onbuddy_migratetype() is more fit and cheap than
get_pageblock_migratetype(). So use it.

Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/page_alloc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e1c4c3e..d9fb8bb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1597,7 +1597,7 @@ static int __isolate_free_page(struct page *page, unsigned int order)
BUG_ON(!PageBuddy(page));

zone = page_zone(page);
- mt = get_pageblock_migratetype(page);
+ mt = get_onbuddy_migratetype(page);

if (!is_migrate_isolate(mt)) {
/* Obey watermarks as if the page was being allocated */
--
1.7.9.5

2014-07-04 08:07:52

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 01/10] mm/page_alloc: remove unlikely macro on free_one_page()

Isolation is really rare case so !is_migrate_isolate() is
likely case. Remove unlikely macro.

Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/page_alloc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8dac0f0..0d4cf7a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -735,7 +735,7 @@ static void free_one_page(struct zone *zone,
zone->pages_scanned = 0;

__free_one_page(page, pfn, zone, order, migratetype);
- if (unlikely(!is_migrate_isolate(migratetype)))
+ if (!is_migrate_isolate(migratetype))
__mod_zone_freepage_state(zone, 1 << order, migratetype);
spin_unlock(&zone->lock);
}
--
1.7.9.5

2014-07-04 08:07:49

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 02/10] mm/page_alloc: correct to clear guard attribute in DEBUG_PAGEALLOC

In __free_one_page(), we check the buddy page if it is guard page.
And, if so, we should clear guard attribute on the buddy page. But,
currently, we clear original page's order rather than buddy one's.
This doesn't have any problem, because resetting buddy's order
is useless and the original page's order is re-assigned soon.
But, it is better to correct code.

Additionally, I change (set/clear)_page_guard_flag() to
(set/clear)_page_guard() and makes these functions do all works
needed for guard page. This may make code more understandable.

One more thing, I did in this patch, is that fixing freepage accounting.
If we clear guard page and link it onto isolate buddy list, we should
not increase freepage count.

Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/page_alloc.c | 29 ++++++++++++++++-------------
1 file changed, 16 insertions(+), 13 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0d4cf7a..aeb51d1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -441,18 +441,28 @@ static int __init debug_guardpage_minorder_setup(char *buf)
}
__setup("debug_guardpage_minorder=", debug_guardpage_minorder_setup);

-static inline void set_page_guard_flag(struct page *page)
+static inline void set_page_guard(struct zone *zone, struct page *page,
+ unsigned int order, int migratetype)
{
__set_bit(PAGE_DEBUG_FLAG_GUARD, &page->debug_flags);
+ set_page_private(page, order);
+ /* Guard pages are not available for any usage */
+ __mod_zone_freepage_state(zone, -(1 << order), migratetype);
}

-static inline void clear_page_guard_flag(struct page *page)
+static inline void clear_page_guard(struct zone *zone, struct page *page,
+ unsigned int order, int migratetype)
{
__clear_bit(PAGE_DEBUG_FLAG_GUARD, &page->debug_flags);
+ set_page_private(page, 0);
+ if (!is_migrate_isolate(migratetype))
+ __mod_zone_freepage_state(zone, (1 << order), migratetype);
}
#else
-static inline void set_page_guard_flag(struct page *page) { }
-static inline void clear_page_guard_flag(struct page *page) { }
+static inline void set_page_guard(struct zone *zone, struct page *page,
+ unsigned int order, int migratetype) {}
+static inline void clear_page_guard(struct zone *zone, struct page *page,
+ unsigned int order, int migratetype) {}
#endif

static inline void set_page_order(struct page *page, unsigned int order)
@@ -594,10 +604,7 @@ static inline void __free_one_page(struct page *page,
* merge with it and move up one order.
*/
if (page_is_guard(buddy)) {
- clear_page_guard_flag(buddy);
- set_page_private(page, 0);
- __mod_zone_freepage_state(zone, 1 << order,
- migratetype);
+ clear_page_guard(zone, buddy, order, migratetype);
} else {
list_del(&buddy->lru);
zone->free_area[order].nr_free--;
@@ -919,11 +926,7 @@ static inline void expand(struct zone *zone, struct page *page,
* pages will stay not present in virtual address space
*/
INIT_LIST_HEAD(&page[size].lru);
- set_page_guard_flag(&page[size]);
- set_page_private(&page[size], high);
- /* Guard pages are not available for any usage */
- __mod_zone_freepage_state(zone, -(1 << high),
- migratetype);
+ set_page_guard(zone, &page[size], high, migratetype);
continue;
}
#endif
--
1.7.9.5

2014-07-04 08:09:21

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 03/10] mm/page_alloc: handle page on pcp correctly if it's pageblock is isolated

If pageblock of page on pcp are isolated now, we should free it to isolate
buddy list to prevent future allocation on it. But current code doesn't
do this.

Moreover, there is a freepage counting problem on current code. Although
pageblock of page on pcp are isolated now, it could go normal buddy list,
because get_onpcp_migratetype() will return non-isolate migratetype.
In this case, we should do either adding freepage count or changing
migratetype to MIGRATE_ISOLATE, but, current code do neither.

This patch fixes these two problems by handling pageblock migratetype
before calling __free_one_page(). And, if we find the page on isolated
pageblock, change migratetype to MIGRATE_ISOLATE to prevent future
allocation of this page and freepage counting problem.

Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/page_alloc.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index aeb51d1..99c05f7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -719,15 +719,17 @@ static void free_pcppages_bulk(struct zone *zone, int count,
page = list_entry(list->prev, struct page, lru);
/* must delete as __free_one_page list manipulates */
list_del(&page->lru);
- mt = get_freepage_migratetype(page);
+
+ if (unlikely(is_migrate_isolate_page(page))) {
+ mt = MIGRATE_ISOLATE;
+ } else {
+ mt = get_freepage_migratetype(page);
+ __mod_zone_freepage_state(zone, 1, mt);
+ }
+
/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
__free_one_page(page, page_to_pfn(page), zone, 0, mt);
trace_mm_page_pcpu_drain(page, 0, mt);
- if (likely(!is_migrate_isolate_page(page))) {
- __mod_zone_page_state(zone, NR_FREE_PAGES, 1);
- if (is_migrate_cma(mt))
- __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1);
- }
} while (--to_free && --batch_free && !list_empty(list));
}
spin_unlock(&zone->lock);
--
1.7.9.5

2014-07-04 12:03:20

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm/page_alloc: remove unlikely macro on free_one_page()

On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> Isolation is really rare case so !is_migrate_isolate() is
> likely case. Remove unlikely macro.

Good catch. Why not replace it with likely then? Any difference in the assembly?

> Signed-off-by: Joonsoo Kim <[email protected]>
> ---
> mm/page_alloc.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8dac0f0..0d4cf7a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -735,7 +735,7 @@ static void free_one_page(struct zone *zone,
> zone->pages_scanned = 0;
>
> __free_one_page(page, pfn, zone, order, migratetype);
> - if (unlikely(!is_migrate_isolate(migratetype)))
> + if (!is_migrate_isolate(migratetype))
> __mod_zone_freepage_state(zone, 1 << order, migratetype);
> spin_unlock(&zone->lock);
> }
>

2014-07-04 12:52:14

by Michal Nazarewicz

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm/page_alloc: remove unlikely macro on free_one_page()

On Fri, Jul 04 2014, Joonsoo Kim <[email protected]> wrote:
> Isolation is really rare case so !is_migrate_isolate() is
> likely case. Remove unlikely macro.
>
> Signed-off-by: Joonsoo Kim <[email protected]>



> ---
> mm/page_alloc.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8dac0f0..0d4cf7a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -735,7 +735,7 @@ static void free_one_page(struct zone *zone,
> zone->pages_scanned = 0;
>
> __free_one_page(page, pfn, zone, order, migratetype);
> - if (unlikely(!is_migrate_isolate(migratetype)))
> + if (!is_migrate_isolate(migratetype))
> __mod_zone_freepage_state(zone, 1 << order, migratetype);
> spin_unlock(&zone->lock);
> }
> --
> 1.7.9.5
>

--
Best regards, _ _
.o. | Liege of Serenely Enlightened Majesty of o' \,=./ `o
..o | Computer Science, Michał “mina86” Nazarewicz (o o)
ooo +--<[email protected]>--<xmpp:[email protected]>--ooO--(_)--Ooo--

2014-07-04 12:53:17

by Michal Nazarewicz

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm/page_alloc: remove unlikely macro on free_one_page()

On Fri, Jul 04 2014, Joonsoo Kim <[email protected]> wrote:
> Isolation is really rare case so !is_migrate_isolate() is
> likely case. Remove unlikely macro.
>
> Signed-off-by: Joonsoo Kim <[email protected]>

Acked-by: Michal Nazarewicz <[email protected]>

> ---
> mm/page_alloc.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8dac0f0..0d4cf7a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -735,7 +735,7 @@ static void free_one_page(struct zone *zone,
> zone->pages_scanned = 0;
>
> __free_one_page(page, pfn, zone, order, migratetype);
> - if (unlikely(!is_migrate_isolate(migratetype)))
> + if (!is_migrate_isolate(migratetype))
> __mod_zone_freepage_state(zone, 1 << order, migratetype);
> spin_unlock(&zone->lock);
> }

--
Best regards, _ _
.o. | Liege of Serenely Enlightened Majesty of o' \,=./ `o
..o | Computer Science, Michał “mina86” Nazarewicz (o o)
ooo +--<[email protected]>--<xmpp:[email protected]>--ooO--(_)--Ooo--

2014-07-04 15:33:37

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 00/10] fix freepage count problems due to memory isolation

On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> Hello,

Hi Joonsoo,

please CC me on further updates, this is relevant to me.

> This patchset aims at fixing problems due to memory isolation found by
> testing my patchset [1].
>
> These are really subtle problems so I can be wrong. If you find what I am
> missing, please let me know.
>
> Before describing bugs itself, I first explain definition of freepage.
>
> 1. pages on buddy list are counted as freepage.
> 2. pages on isolate migratetype buddy list are *not* counted as freepage.

I think the second point is causing us a lot of trouble. And I wonder if it's really
justified! We already have some is_migrate_isolate() checks in the fast path and now you
would add more, mostly just to keep this accounting correct.

So the question is, does it have to be correct? And (admiteddly not after a completely
exhaustive analysis) I think the answer is, surprisingly, that it doesn't :)

Well I of course don't mean that the freepage accounts could go random completely, but
what if we allowed them to drift a bit, limiting both the max error and the timeframe
where errors are possible? After all, watermarks checking is already racy so I don't think
it would be hurt that much.

Now if we look at what both CMA and memory hot-remove does is:

1. Mark a MAX_ORDER-aligned buch of pageblocks as MIGRATE_ISOLATE through
start_isolate_page_range(). As part of this, all free pages in that area are
moved on the isolate freelist through move_freepages_block().

2. Try to migrate away all non-free pages in the range. Also drain pcplists and lru_add
caches.

3. Check if everything was successfully isolated by test_pages_isolated(). Restart and/or
undo pageblock isolation if not.

So my idea is to throw away all special-casing of is_migrate_isolate() in the buddy
allocator, which would therefore account free pages on the isolate freelist as normal free
pages.
The accounting of isolated pages would be instead done only on the top level of CMA/hot
remove in the three steps outlined above, which would be modified as follows:

1. Calculate N as the target number of pages to be isolated. Perform the actions of step 1
as usual. Calculate X as the number of pages that move_freepages_block() has moved.
Subtract X from freepages (this is the same as it is done now), but also *remember the
value of X*

2. Migrate and drain pcplists as usual. The new free pages will either end up correctly on
isolate freelist, or not. We don't care, they will be accounted as freepages either way.
This is where some inaccuracy in accounted freepages would build up.

3. If test_pages_isolated() checks pass, subtract (N - X) from freepages. The result is
that we have a isolated range of N pages that nobody can steal now as everything is on
isolate freelist and is MAX_ORDER aligned. And we have in total subtracted N pages (first
X, then N-X). So the accounting matches reality.

If we have to undo, we undo the isolation and as part of this, we use
move_freepages_block() to move pages from isolate freelist to the normal ones. But we
don't care how many pages were moved. We simply add the remembered value of X to the
number of freepages, undoing the change from step 1. Again, the accounting matches reality.


The final point is that if we do this per MAX_ORDER blocks, the error in accounting cannot
be ever larger than 4MB and will be visible only during time a single MAX_ORDER block is
handled.

As a possible improvement, we can assume during phase 2 that every page freed by migration
will end up correctly on isolate free list. So we create M free pages by migration, and
subtract M from freepage account. Then in phase 3 we either subtract (N - X - M), or add X
+ M in the undo case. (Ideally, if we succeed, X + M should be equal to N, but due to
pages on pcplists and the possible races it will be less). I think with this improvement,
any error would be negligible.

Thoughts?

> 3. pages on cma buddy list are counted as CMA freepage, too.
> 4. pages for guard are *not* counted as freepage.
>
> Now, I describe problems and related patch.
>
> 1. Patch 2: If guard page are cleared and merged into isolate buddy list,
> we should not add freepage count.
>
> 2. Patch 3: When the page return back from pcp to buddy, we should
> account it to freepage counter. In this case, we should check the
> pageblock migratetype of the page and should insert the page into
> appropriate buddy list. Although we checked it in current code, we
> didn't insert the page into appropriate buddy list so that freepage
> counting can be wrong.
>
> 3. Patch 4: There is race condition so that some freepages could be
> on isolate buddy list. If so, we can't use this page until next isolation
> attempt on this pageblock.
>
> 4. Patch 5: There is race condition that page on isolate pageblock
> can go into non-isolate buddy list. If so, buddy allocator would
> merge pages on non-isolate buddy list and isolate buddy list, respectively,
> and freepage count will be wrong.
>
> 5. Patch 9: move_freepages(_block) returns *not* number of moved pages.
> Instead, it returns number of pages linked in that migratetype buddy list.
> So accouting with this return value makes freepage count wrong.
>
> 6. Patch 10: buddy allocator would merge pages on non-isolate buddy list
> and isolate buddy list, respectively. This leads to freepage counting
> problem so fix it by stopping merging in this case.
>
> Without patchset [1], above problem doesn't happens on my CMA allocation
> test, because CMA reserved pages aren't used at all. So there is no
> chance for above race.
>
> With patchset [1], I did simple CMA allocation test and get below result.
>
> - Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
> - run kernel build (make -j16) on background
> - 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
> - Result: more than 5000 freepage count are missed
>
> With patchset [1] and this patchset, I found that no freepage count are
> missed so that I conclude that problems are solved.
>
> These problems can be possible on memory hot remove users, although
> I didn't check it further.
>
> Other patches are either for the base to fix these problems or for
> simple clean-up. Please see individual patches for more information.
>
> This patchset is based on linux-next-20140703.
>
> Thanks.
>
> [1]: Aggressively allocate the pages on cma reserved memory
> https://lkml.org/lkml/2014/5/30/291
>
>
> Joonsoo Kim (10):
> mm/page_alloc: remove unlikely macro on free_one_page()
> mm/page_alloc: correct to clear guard attribute in DEBUG_PAGEALLOC
> mm/page_alloc: handle page on pcp correctly if it's pageblock is
> isolated
> mm/page_alloc: carefully free the page on isolate pageblock
> mm/page_alloc: optimize and unify pageblock migratetype check in free
> path
> mm/page_alloc: separate freepage migratetype interface
> mm/page_alloc: store migratetype of the buddy list into freepage
> correctly
> mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type
> mm/page_alloc: fix possible wrongly calculated freepage counter
> mm/page_alloc: Stop merging pages on non-isolate and isolate buddy
> list
>
> include/linux/mm.h | 30 +++++++--
> include/linux/mmzone.h | 5 ++
> include/linux/page-isolation.h | 8 +++
> mm/page_alloc.c | 138 +++++++++++++++++++++++++++++-----------
> mm/page_isolation.c | 18 ++----
> 5 files changed, 147 insertions(+), 52 deletions(-)
>

2014-07-07 04:44:09

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 00/10] fix freepage count problems due to memory isolation

Ccing Lisa, because there was bug report it may be related this
topic last Saturday.

http://www.spinics.net/lists/linux-mm/msg75741.html

On Fri, Jul 04, 2014 at 05:33:27PM +0200, Vlastimil Babka wrote:
> On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> > Hello,
>
> Hi Joonsoo,
>
> please CC me on further updates, this is relevant to me.

Hello, Vlastimil.

Sorry for missing you. :)

>
> > This patchset aims at fixing problems due to memory isolation found by
> > testing my patchset [1].
> >
> > These are really subtle problems so I can be wrong. If you find what I am
> > missing, please let me know.
> >
> > Before describing bugs itself, I first explain definition of freepage.
> >
> > 1. pages on buddy list are counted as freepage.
> > 2. pages on isolate migratetype buddy list are *not* counted as freepage.
>
> I think the second point is causing us a lot of trouble. And I wonder if it's really
> justified! We already have some is_migrate_isolate() checks in the fast path and now you
> would add more, mostly just to keep this accounting correct.

It's not just for keeping accouting correct. There is another
purpose for is_migrate_isolate(). It forces freed pages to go into
isolate buddy list. Without it, freed pages would go into other
buddy list and will be used soon. So memory isolation can't work well
without is_migrate_isolate() checks and success rate could decrease.

And, I just added three more is_migrate_isolate() in the fast
path, but, two checks are in same *unlikely* branch and I can remove
another one easily. Therefore it's not quite problem I guess. (It even
does no-op if MEMORY_ISOLATION is disabled.)
Moreover, I removed one unconditional get_pageblock_migratetype() in
free_pcppages_bulk() so, in performance point or view, freepath would
be improved.

>
> So the question is, does it have to be correct? And (admiteddly not after a completely
> exhaustive analysis) I think the answer is, surprisingly, that it doesn't :)
>
> Well I of course don't mean that the freepage accounts could go random completely, but
> what if we allowed them to drift a bit, limiting both the max error and the timeframe
> where errors are possible? After all, watermarks checking is already racy so I don't think
> it would be hurt that much.

I understand your suggestion. I once thought like as you, but give up
that idea. Watermark checking is already racy, but, it's *only*
protection to prevent memory allocation. After passing that check,
there is no mean to prevent us from allocating memory. So it should
be accurate as much as possible. If we allow someone to get the
memory without considering memory isolation, free memory could be in
really low state and system could be broken occasionally.

>
> Now if we look at what both CMA and memory hot-remove does is:
>
> 1. Mark a MAX_ORDER-aligned buch of pageblocks as MIGRATE_ISOLATE through
> start_isolate_page_range(). As part of this, all free pages in that area are
> moved on the isolate freelist through move_freepages_block().
>
> 2. Try to migrate away all non-free pages in the range. Also drain pcplists and lru_add
> caches.
>
> 3. Check if everything was successfully isolated by test_pages_isolated(). Restart and/or
> undo pageblock isolation if not.
>
> So my idea is to throw away all special-casing of is_migrate_isolate() in the buddy
> allocator, which would therefore account free pages on the isolate freelist as normal free
> pages.
> The accounting of isolated pages would be instead done only on the top level of CMA/hot
> remove in the three steps outlined above, which would be modified as follows:
>
> 1. Calculate N as the target number of pages to be isolated. Perform the actions of step 1
> as usual. Calculate X as the number of pages that move_freepages_block() has moved.
> Subtract X from freepages (this is the same as it is done now), but also *remember the
> value of X*
>
> 2. Migrate and drain pcplists as usual. The new free pages will either end up correctly on
> isolate freelist, or not. We don't care, they will be accounted as freepages either way.
> This is where some inaccuracy in accounted freepages would build up.
>
> 3. If test_pages_isolated() checks pass, subtract (N - X) from freepages. The result is
> that we have a isolated range of N pages that nobody can steal now as everything is on
> isolate freelist and is MAX_ORDER aligned. And we have in total subtracted N pages (first
> X, then N-X). So the accounting matches reality.
>
> If we have to undo, we undo the isolation and as part of this, we use
> move_freepages_block() to move pages from isolate freelist to the normal ones. But we
> don't care how many pages were moved. We simply add the remembered value of X to the
> number of freepages, undoing the change from step 1. Again, the accounting matches reality.
>
>
> The final point is that if we do this per MAX_ORDER blocks, the error in accounting cannot
> be ever larger than 4MB and will be visible only during time a single MAX_ORDER block is
> handled.

The 4MB error in accounting looks serious for me. min_free_kbytes is
4MB in 1GB system. So this 4MB error would makes all things broken in
such systems. Moreover, there are some ARCH having larger
pageblock_order than MAX_ORDER. In this case, the error will be larger
than 4MB.

In addition, I have a plan to extend CMA to work in parallel. It means
that there could be parallel memory isolation users rather than just
one user at the same time, so, we cannot easily bound the error under
some degree.

>
> As a possible improvement, we can assume during phase 2 that every page freed by migration
> will end up correctly on isolate free list. So we create M free pages by migration, and
> subtract M from freepage account. Then in phase 3 we either subtract (N - X - M), or add X
> + M in the undo case. (Ideally, if we succeed, X + M should be equal to N, but due to
> pages on pcplists and the possible races it will be less). I think with this improvement,
> any error would be negligible.
>
> Thoughts?

Thanks for suggestion. :)
It is really good topic to think deeply.

For now, I'd like to fix these problems without side-effect as you
suggested. Your suggestion changes the meaning of freepage that
isolated pages are included in nr_freepage and there could be possible
regression in success rate of memory hotplug and CMA. Possibly, it
is the way we have to go, but, IMHO, it isn't the time to go. Before
going that way, we should fix current implementation first so that
fixes can be backported to old kernel if someone needs. Interestingly,
on last Saturday, Lisa Du reported CMA accounting bugs.

http://www.spinics.net/lists/linux-mm/msg75741.html

I don't look at it in detail, but, maybe it is related to these
problems and we should fix it without side-effect.

So, in conclusion, I think that your suggestion is beyond the scope of
this patchset because of following two reasons.

1. I'd like to fix these problems without side-effect(possible
regression in success rate of memory hotplug and CMA, and nr_freepage
meanging change) due to backport possibility.
2. nr_freepage without considering memory isolation is somewhat dangerous
and not suitable for some systems.

If you have any objection, please let me know. But, I will go on
a vacation for a week so I can't answer your further comments
for a week. I will reply them next week. :)

Thanks.

> > 3. pages on cma buddy list are counted as CMA freepage, too.
> > 4. pages for guard are *not* counted as freepage.
> >
> > Now, I describe problems and related patch.
> >
> > 1. Patch 2: If guard page are cleared and merged into isolate buddy list,
> > we should not add freepage count.
> >
> > 2. Patch 3: When the page return back from pcp to buddy, we should
> > account it to freepage counter. In this case, we should check the
> > pageblock migratetype of the page and should insert the page into
> > appropriate buddy list. Although we checked it in current code, we
> > didn't insert the page into appropriate buddy list so that freepage
> > counting can be wrong.
> >
> > 3. Patch 4: There is race condition so that some freepages could be
> > on isolate buddy list. If so, we can't use this page until next isolation
> > attempt on this pageblock.
> >
> > 4. Patch 5: There is race condition that page on isolate pageblock
> > can go into non-isolate buddy list. If so, buddy allocator would
> > merge pages on non-isolate buddy list and isolate buddy list, respectively,
> > and freepage count will be wrong.
> >
> > 5. Patch 9: move_freepages(_block) returns *not* number of moved pages.
> > Instead, it returns number of pages linked in that migratetype buddy list.
> > So accouting with this return value makes freepage count wrong.
> >
> > 6. Patch 10: buddy allocator would merge pages on non-isolate buddy list
> > and isolate buddy list, respectively. This leads to freepage counting
> > problem so fix it by stopping merging in this case.
> >
> > Without patchset [1], above problem doesn't happens on my CMA allocation
> > test, because CMA reserved pages aren't used at all. So there is no
> > chance for above race.
> >
> > With patchset [1], I did simple CMA allocation test and get below result.
> >
> > - Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
> > - run kernel build (make -j16) on background
> > - 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
> > - Result: more than 5000 freepage count are missed
> >
> > With patchset [1] and this patchset, I found that no freepage count are
> > missed so that I conclude that problems are solved.
> >
> > These problems can be possible on memory hot remove users, although
> > I didn't check it further.
> >
> > Other patches are either for the base to fix these problems or for
> > simple clean-up. Please see individual patches for more information.
> >
> > This patchset is based on linux-next-20140703.
> >
> > Thanks.
> >
> > [1]: Aggressively allocate the pages on cma reserved memory
> > https://lkml.org/lkml/2014/5/30/291
> >
> >
> > Joonsoo Kim (10):
> > mm/page_alloc: remove unlikely macro on free_one_page()
> > mm/page_alloc: correct to clear guard attribute in DEBUG_PAGEALLOC
> > mm/page_alloc: handle page on pcp correctly if it's pageblock is
> > isolated
> > mm/page_alloc: carefully free the page on isolate pageblock
> > mm/page_alloc: optimize and unify pageblock migratetype check in free
> > path
> > mm/page_alloc: separate freepage migratetype interface
> > mm/page_alloc: store migratetype of the buddy list into freepage
> > correctly
> > mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type
> > mm/page_alloc: fix possible wrongly calculated freepage counter
> > mm/page_alloc: Stop merging pages on non-isolate and isolate buddy
> > list
> >
> > include/linux/mm.h | 30 +++++++--
> > include/linux/mmzone.h | 5 ++
> > include/linux/page-isolation.h | 8 +++
> > mm/page_alloc.c | 138 +++++++++++++++++++++++++++++-----------
> > mm/page_isolation.c | 18 ++----
> > 5 files changed, 147 insertions(+), 52 deletions(-)
> >
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2014-07-07 04:53:30

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm/page_alloc: remove unlikely macro on free_one_page()

On Fri, Jul 04, 2014 at 02:03:10PM +0200, Vlastimil Babka wrote:
> On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> > Isolation is really rare case so !is_migrate_isolate() is
> > likely case. Remove unlikely macro.
>
> Good catch. Why not replace it with likely then? Any difference in the assembly?
>

I didn't investigate it. I will check it in next spin.

Thanks.

2014-07-07 08:25:39

by Gioh Kim

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm/page_alloc: handle page on pcp correctly if it's pageblock is isolated



2014-07-04 ???? 4:57, Joonsoo Kim ?? ??:
> If pageblock of page on pcp are isolated now, we should free it to isolate
> buddy list to prevent future allocation on it. But current code doesn't
> do this.

I think it is strage that pcp can have isolated page.
I remember that The purpose of pcp is having frequently used pages (hot for cache),
but isolated page is not for frequently allocated and freed.

(Above is my guess. It can be wrong. I know CMA and hot-plug features are using isolated pages
so that I guess isolated pages are not for frequently allocated and freed memory.)

Anyway if isolated page is not good for pcp, what about preventing isolated page from inserting pcp?
I think fix of free_hot_cold_page() can be one of the preventing ways.
The free_hot_cold_page() checks migratetype and inserts CMA page into pcp.
Am I correct?

If I am correct I think inserting CMA page into pcp is harmful for both of CMA and pcp.
If CMA page is on pcp it will be used frequently and prevent making contiguous memory.
What about avoiding CMA page for pcp like following?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8c9eeec..1cbb816 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1413,13 +1413,8 @@ void free_hot_cold_page(struct page *page, bool cold)
* areas back if necessary. Otherwise, we may have to free
* excessively into the page allocator
*/
- if (migratetype >= MIGRATE_PCPTYPES) {
- if (unlikely(is_migrate_isolate(migratetype))) {
- free_one_page(zone, page, pfn, 0, migratetype);
- goto out;
- }
- migratetype = MIGRATE_MOVABLE;
- }
+ if (migratetype > MIGRATE_PCPTYPES)
+ goto out;

pcp = &this_cpu_ptr(zone->pageset)->pcp;
if (!cold)


>
> Moreover, there is a freepage counting problem on current code. Although
> pageblock of page on pcp are isolated now, it could go normal buddy list,
> because get_onpcp_migratetype() will return non-isolate migratetype.

I cannot find get_onpcp_migratetype() in v3.16.0-rc3 and also cannot google it.
Is it typo?

> In this case, we should do either adding freepage count or changing
> migratetype to MIGRATE_ISOLATE, but, current code do neither.
>
> This patch fixes these two problems by handling pageblock migratetype
> before calling __free_one_page(). And, if we find the page on isolated
> pageblock, change migratetype to MIGRATE_ISOLATE to prevent future
> allocation of this page and freepage counting problem.
>
> Signed-off-by: Joonsoo Kim <[email protected]>
> ---
> mm/page_alloc.c | 14 ++++++++------
> 1 file changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index aeb51d1..99c05f7 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -719,15 +719,17 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> page = list_entry(list->prev, struct page, lru);
> /* must delete as __free_one_page list manipulates */
> list_del(&page->lru);
> - mt = get_freepage_migratetype(page);
> +
> + if (unlikely(is_migrate_isolate_page(page))) {
> + mt = MIGRATE_ISOLATE;
> + } else {
> + mt = get_freepage_migratetype(page);
> + __mod_zone_freepage_state(zone, 1, mt);
> + }
> +
> /* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
> __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> trace_mm_page_pcpu_drain(page, 0, mt);
> - if (likely(!is_migrate_isolate_page(page))) {
> - __mod_zone_page_state(zone, NR_FREE_PAGES, 1);
> - if (is_migrate_cma(mt))
> - __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1);
> - }
> } while (--to_free && --batch_free && !list_empty(list));
> }
> spin_unlock(&zone->lock);
>

2014-07-07 14:33:20

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 00/10] fix freepage count problems due to memory isolation

On 07/07/2014 06:49 AM, Joonsoo Kim wrote:
> Ccing Lisa, because there was bug report it may be related this
> topic last Saturday.
>
> http://www.spinics.net/lists/linux-mm/msg75741.html
>
> On Fri, Jul 04, 2014 at 05:33:27PM +0200, Vlastimil Babka wrote:
>> On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
>>> Hello,
>>
>> Hi Joonsoo,
>>
>> please CC me on further updates, this is relevant to me.
>
> Hello, Vlastimil.
>
> Sorry for missing you. :)
>
>>
>>> This patchset aims at fixing problems due to memory isolation found by
>>> testing my patchset [1].
>>>
>>> These are really subtle problems so I can be wrong. If you find what I am
>>> missing, please let me know.
>>>
>>> Before describing bugs itself, I first explain definition of freepage.
>>>
>>> 1. pages on buddy list are counted as freepage.
>>> 2. pages on isolate migratetype buddy list are *not* counted as freepage.
>>
>> I think the second point is causing us a lot of trouble. And I wonder if it's really
>> justified! We already have some is_migrate_isolate() checks in the fast path and now you
>> would add more, mostly just to keep this accounting correct.
>
> It's not just for keeping accouting correct. There is another
> purpose for is_migrate_isolate(). It forces freed pages to go into
> isolate buddy list. Without it, freed pages would go into other
> buddy list and will be used soon. So memory isolation can't work well
> without is_migrate_isolate() checks and success rate could decrease.

Well I think that could be solved also by doing a lru/pcplists drain
right after marking pageblock as MIGRATETYPE_ISOLATE. After that moment,
anything newly put on pcplists should observe the new migratetype and go
to the correct pcplists. As putting stuff on pcplists is done with
disabled interrupts, and draining is done by IPI, I think it should work
correctly if we put the migratetype determination under the disabled irq
part in free_hot_cold_page().

> And, I just added three more is_migrate_isolate() in the fast
> path, but, two checks are in same *unlikely* branch and I can remove
> another one easily. Therefore it's not quite problem I guess. (It even
> does no-op if MEMORY_ISOLATION is disabled.)
> Moreover, I removed one unconditional get_pageblock_migratetype() in
> free_pcppages_bulk() so, in performance point or view, freepath would
> be improved.

I haven't checked the individual patches yet, but I'll try.

>>
>> So the question is, does it have to be correct? And (admiteddly not after a completely
>> exhaustive analysis) I think the answer is, surprisingly, that it doesn't :)
>>
>> Well I of course don't mean that the freepage accounts could go random completely, but
>> what if we allowed them to drift a bit, limiting both the max error and the timeframe
>> where errors are possible? After all, watermarks checking is already racy so I don't think
>> it would be hurt that much.
>
> I understand your suggestion. I once thought like as you, but give up
> that idea. Watermark checking is already racy, but, it's *only*
> protection to prevent memory allocation. After passing that check,
> there is no mean to prevent us from allocating memory. So it should
> be accurate as much as possible. If we allow someone to get the
> memory without considering memory isolation, free memory could be in
> really low state and system could be broken occasionally.

It would definitely require more thorough review, but I think with the
pcplists draining changes outlined above, there shouldn't be any more
inaccuracy happening.

>>
>> Now if we look at what both CMA and memory hot-remove does is:
>>
>> 1. Mark a MAX_ORDER-aligned buch of pageblocks as MIGRATE_ISOLATE through
>> start_isolate_page_range(). As part of this, all free pages in that area are
>> moved on the isolate freelist through move_freepages_block().
>>
>> 2. Try to migrate away all non-free pages in the range. Also drain pcplists and lru_add
>> caches.
>>
>> 3. Check if everything was successfully isolated by test_pages_isolated(). Restart and/or
>> undo pageblock isolation if not.
>>
>> So my idea is to throw away all special-casing of is_migrate_isolate() in the buddy
>> allocator, which would therefore account free pages on the isolate freelist as normal free
>> pages.
>> The accounting of isolated pages would be instead done only on the top level of CMA/hot
>> remove in the three steps outlined above, which would be modified as follows:
>>
>> 1. Calculate N as the target number of pages to be isolated. Perform the actions of step 1
>> as usual. Calculate X as the number of pages that move_freepages_block() has moved.
>> Subtract X from freepages (this is the same as it is done now), but also *remember the
>> value of X*
>>
>> 2. Migrate and drain pcplists as usual. The new free pages will either end up correctly on
>> isolate freelist, or not. We don't care, they will be accounted as freepages either way.
>> This is where some inaccuracy in accounted freepages would build up.
>>
>> 3. If test_pages_isolated() checks pass, subtract (N - X) from freepages. The result is
>> that we have a isolated range of N pages that nobody can steal now as everything is on
>> isolate freelist and is MAX_ORDER aligned. And we have in total subtracted N pages (first
>> X, then N-X). So the accounting matches reality.
>>
>> If we have to undo, we undo the isolation and as part of this, we use
>> move_freepages_block() to move pages from isolate freelist to the normal ones. But we
>> don't care how many pages were moved. We simply add the remembered value of X to the
>> number of freepages, undoing the change from step 1. Again, the accounting matches reality.
>>
>>
>> The final point is that if we do this per MAX_ORDER blocks, the error in accounting cannot
>> be ever larger than 4MB and will be visible only during time a single MAX_ORDER block is
>> handled.
>
> The 4MB error in accounting looks serious for me. min_free_kbytes is
> 4MB in 1GB system. So this 4MB error would makes all things broken in
> such systems. Moreover, there are some ARCH having larger
> pageblock_order than MAX_ORDER. In this case, the error will be larger
> than 4MB.
>
> In addition, I have a plan to extend CMA to work in parallel. It means
> that there could be parallel memory isolation users rather than just
> one user at the same time, so, we cannot easily bound the error under
> some degree.

OK. I had thought that the error would be much smaller in practice, but
probably due to how buddy merging works, it would be enough if just the
very last freed page in the MAX_ORDER block was misplaced, and that
would trigger a cascading merge that will end with single page at
MAX_ORDER size becoming misplaced. So let's probably forget this approach.

>> As a possible improvement, we can assume during phase 2 that every page freed by migration
>> will end up correctly on isolate free list. So we create M free pages by migration, and
>> subtract M from freepage account. Then in phase 3 we either subtract (N - X - M), or add X
>> + M in the undo case. (Ideally, if we succeed, X + M should be equal to N, but due to
>> pages on pcplists and the possible races it will be less). I think with this improvement,
>> any error would be negligible.
>>
>> Thoughts?
>
> Thanks for suggestion. :)
> It is really good topic to think deeply.
>
> For now, I'd like to fix these problems without side-effect as you
> suggested. Your suggestion changes the meaning of freepage that
> isolated pages are included in nr_freepage and there could be possible
> regression in success rate of memory hotplug and CMA. Possibly, it
> is the way we have to go, but, IMHO, it isn't the time to go. Before
> going that way, we should fix current implementation first so that
> fixes can be backported to old kernel if someone needs.

Adding the extra pcplist drains and reordering
get_pageblock_migratetype() under disabled IRQ's should be small enough
to be backportable and not bring any regressions to success rates. There
will be some cost for the extra drains, but paid only when the memory
isolation actually happens. Extending the scope of disabled irq's would
affect fast path somewhat, but I guess less than extra checks?

> Interestingly,
> on last Saturday, Lisa Du reported CMA accounting bugs.
>
> http://www.spinics.net/lists/linux-mm/msg75741.html
>
> I don't look at it in detail, but, maybe it is related to these
> problems and we should fix it without side-effect.
>
> So, in conclusion, I think that your suggestion is beyond the scope of
> this patchset because of following two reasons.
>
> 1. I'd like to fix these problems without side-effect(possible
> regression in success rate of memory hotplug and CMA, and nr_freepage
> meanging change) due to backport possibility.
> 2. nr_freepage without considering memory isolation is somewhat dangerous
> and not suitable for some systems.
>
> If you have any objection, please let me know. But, I will go on
> a vacation for a week so I can't answer your further comments
> for a week. I will reply them next week. :)
>
> Thanks.
>
>>> 3. pages on cma buddy list are counted as CMA freepage, too.
>>> 4. pages for guard are *not* counted as freepage.
>>>
>>> Now, I describe problems and related patch.
>>>
>>> 1. Patch 2: If guard page are cleared and merged into isolate buddy list,
>>> we should not add freepage count.
>>>
>>> 2. Patch 3: When the page return back from pcp to buddy, we should
>>> account it to freepage counter. In this case, we should check the
>>> pageblock migratetype of the page and should insert the page into
>>> appropriate buddy list. Although we checked it in current code, we
>>> didn't insert the page into appropriate buddy list so that freepage
>>> counting can be wrong.
>>>
>>> 3. Patch 4: There is race condition so that some freepages could be
>>> on isolate buddy list. If so, we can't use this page until next isolation
>>> attempt on this pageblock.
>>>
>>> 4. Patch 5: There is race condition that page on isolate pageblock
>>> can go into non-isolate buddy list. If so, buddy allocator would
>>> merge pages on non-isolate buddy list and isolate buddy list, respectively,
>>> and freepage count will be wrong.
>>>
>>> 5. Patch 9: move_freepages(_block) returns *not* number of moved pages.
>>> Instead, it returns number of pages linked in that migratetype buddy list.
>>> So accouting with this return value makes freepage count wrong.
>>>
>>> 6. Patch 10: buddy allocator would merge pages on non-isolate buddy list
>>> and isolate buddy list, respectively. This leads to freepage counting
>>> problem so fix it by stopping merging in this case.
>>>
>>> Without patchset [1], above problem doesn't happens on my CMA allocation
>>> test, because CMA reserved pages aren't used at all. So there is no
>>> chance for above race.
>>>
>>> With patchset [1], I did simple CMA allocation test and get below result.
>>>
>>> - Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
>>> - run kernel build (make -j16) on background
>>> - 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
>>> - Result: more than 5000 freepage count are missed
>>>
>>> With patchset [1] and this patchset, I found that no freepage count are
>>> missed so that I conclude that problems are solved.
>>>
>>> These problems can be possible on memory hot remove users, although
>>> I didn't check it further.
>>>
>>> Other patches are either for the base to fix these problems or for
>>> simple clean-up. Please see individual patches for more information.
>>>
>>> This patchset is based on linux-next-20140703.
>>>
>>> Thanks.
>>>
>>> [1]: Aggressively allocate the pages on cma reserved memory
>>> https://lkml.org/lkml/2014/5/30/291
>>>
>>>
>>> Joonsoo Kim (10):
>>> mm/page_alloc: remove unlikely macro on free_one_page()
>>> mm/page_alloc: correct to clear guard attribute in DEBUG_PAGEALLOC
>>> mm/page_alloc: handle page on pcp correctly if it's pageblock is
>>> isolated
>>> mm/page_alloc: carefully free the page on isolate pageblock
>>> mm/page_alloc: optimize and unify pageblock migratetype check in free
>>> path
>>> mm/page_alloc: separate freepage migratetype interface
>>> mm/page_alloc: store migratetype of the buddy list into freepage
>>> correctly
>>> mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type
>>> mm/page_alloc: fix possible wrongly calculated freepage counter
>>> mm/page_alloc: Stop merging pages on non-isolate and isolate buddy
>>> list
>>>
>>> include/linux/mm.h | 30 +++++++--
>>> include/linux/mmzone.h | 5 ++
>>> include/linux/page-isolation.h | 8 +++
>>> mm/page_alloc.c | 138 +++++++++++++++++++++++++++++-----------
>>> mm/page_isolation.c | 18 ++----
>>> 5 files changed, 147 insertions(+), 52 deletions(-)
>>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2014-07-07 14:51:08

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 02/10] mm/page_alloc: correct to clear guard attribute in DEBUG_PAGEALLOC

On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> In __free_one_page(), we check the buddy page if it is guard page.
> And, if so, we should clear guard attribute on the buddy page. But,
> currently, we clear original page's order rather than buddy one's.
> This doesn't have any problem, because resetting buddy's order
> is useless

Well it might theoretically confuse someone examining a crash dump, to
see an unexpected page->private value. So I agree it should be corrected.

> and the original page's order is re-assigned soon.
> But, it is better to correct code.
>
> Additionally, I change (set/clear)_page_guard_flag() to
> (set/clear)_page_guard() and makes these functions do all works
> needed for guard page. This may make code more understandable.
>
> One more thing, I did in this patch, is that fixing freepage accounting.
> If we clear guard page and link it onto isolate buddy list, we should
> not increase freepage count.

OK, since this is under CONFIG_DEBUG_PAGEALLOC, and I don't see other
solution in this case, I agree.

> Signed-off-by: Joonsoo Kim <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

> ---
> mm/page_alloc.c | 29 ++++++++++++++++-------------
> 1 file changed, 16 insertions(+), 13 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0d4cf7a..aeb51d1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -441,18 +441,28 @@ static int __init debug_guardpage_minorder_setup(char *buf)
> }
> __setup("debug_guardpage_minorder=", debug_guardpage_minorder_setup);
>
> -static inline void set_page_guard_flag(struct page *page)
> +static inline void set_page_guard(struct zone *zone, struct page *page,
> + unsigned int order, int migratetype)
> {
> __set_bit(PAGE_DEBUG_FLAG_GUARD, &page->debug_flags);
> + set_page_private(page, order);
> + /* Guard pages are not available for any usage */
> + __mod_zone_freepage_state(zone, -(1 << order), migratetype);
> }
>
> -static inline void clear_page_guard_flag(struct page *page)
> +static inline void clear_page_guard(struct zone *zone, struct page *page,
> + unsigned int order, int migratetype)
> {
> __clear_bit(PAGE_DEBUG_FLAG_GUARD, &page->debug_flags);
> + set_page_private(page, 0);
> + if (!is_migrate_isolate(migratetype))
> + __mod_zone_freepage_state(zone, (1 << order), migratetype);
> }
> #else
> -static inline void set_page_guard_flag(struct page *page) { }
> -static inline void clear_page_guard_flag(struct page *page) { }
> +static inline void set_page_guard(struct zone *zone, struct page *page,
> + unsigned int order, int migratetype) {}
> +static inline void clear_page_guard(struct zone *zone, struct page *page,
> + unsigned int order, int migratetype) {}
> #endif
>
> static inline void set_page_order(struct page *page, unsigned int order)
> @@ -594,10 +604,7 @@ static inline void __free_one_page(struct page *page,
> * merge with it and move up one order.
> */
> if (page_is_guard(buddy)) {
> - clear_page_guard_flag(buddy);
> - set_page_private(page, 0);
> - __mod_zone_freepage_state(zone, 1 << order,
> - migratetype);
> + clear_page_guard(zone, buddy, order, migratetype);
> } else {
> list_del(&buddy->lru);
> zone->free_area[order].nr_free--;
> @@ -919,11 +926,7 @@ static inline void expand(struct zone *zone, struct page *page,
> * pages will stay not present in virtual address space
> */
> INIT_LIST_HEAD(&page[size].lru);
> - set_page_guard_flag(&page[size]);
> - set_page_private(&page[size], high);
> - /* Guard pages are not available for any usage */
> - __mod_zone_freepage_state(zone, -(1 << high),
> - migratetype);
> + set_page_guard(zone, &page[size], high, migratetype);
> continue;
> }
> #endif
>

2014-07-07 15:19:56

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm/page_alloc: handle page on pcp correctly if it's pageblock is isolated

On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> If pageblock of page on pcp are isolated now, we should free it to isolate
> buddy list to prevent future allocation on it. But current code doesn't
> do this.
>
> Moreover, there is a freepage counting problem on current code. Although
> pageblock of page on pcp are isolated now, it could go normal buddy list,
> because get_onpcp_migratetype() will return non-isolate migratetype.

get_onpcp_migratetype() is only introduced in later patch.

> In this case, we should do either adding freepage count or changing
> migratetype to MIGRATE_ISOLATE, but, current code do neither.

I wouldn't say it "do neither". It already limits the freepage counting
to !MIGRATE_ISOLATE case (and it's not converted to
__mod_zone_freepage_state for some reason). So there's accounting
mismatch in addition to buddy list misplacement.

> This patch fixes these two problems by handling pageblock migratetype
> before calling __free_one_page(). And, if we find the page on isolated
> pageblock, change migratetype to MIGRATE_ISOLATE to prevent future
> allocation of this page and freepage counting problem.

So although this is not an addition of a new pageblock migratetype check
to the fast path (the check is already there), I would prefer removing
the check :) With the approach of pcplists draining outlined in my reply
to 00/10, we would allow a misplacement to happen (and the page
accounted as freepage) immediately followed by move_frepages_block which
would place the page onto isolate freelist with the rest. Anything newly
freed will get isolate_migratetype determined in free_hot_cold_page or
__free_pages_ok (where it would need moving the migratepage check under
the disabled irq part) and be placed and buddy-merged properly.

> Signed-off-by: Joonsoo Kim <[email protected]>
> ---
> mm/page_alloc.c | 14 ++++++++------
> 1 file changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index aeb51d1..99c05f7 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -719,15 +719,17 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> page = list_entry(list->prev, struct page, lru);
> /* must delete as __free_one_page list manipulates */
> list_del(&page->lru);
> - mt = get_freepage_migratetype(page);
> +
> + if (unlikely(is_migrate_isolate_page(page))) {
> + mt = MIGRATE_ISOLATE;
> + } else {
> + mt = get_freepage_migratetype(page);
> + __mod_zone_freepage_state(zone, 1, mt);
> + }
> +
> /* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
> __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> trace_mm_page_pcpu_drain(page, 0, mt);
> - if (likely(!is_migrate_isolate_page(page))) {
> - __mod_zone_page_state(zone, NR_FREE_PAGES, 1);
> - if (is_migrate_cma(mt))
> - __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1);
> - }
> } while (--to_free && --batch_free && !list_empty(list));
> }
> spin_unlock(&zone->lock);
>

2014-07-07 15:43:39

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 04/10] mm/page_alloc: carefully free the page on isolate pageblock

On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> We got migratetype without holding the lock so it could be
> racy. If some pages go on the isolate migratetype buddy list
> by this race, we can't allocate this page anymore until next
> isolation attempt on this pageblock. Below is possible
> scenario of this race.
>
> pageblock 1 is isolate migratetype.
>
> CPU1 CPU2
> - get_pfnblock_migratetype(pageblock 1),
> so MIGRATE_ISOLATE is returned
> - call free_one_page() with MIGRATE_ISOLATE
> - grab the zone lock
> - unisolate pageblock 1
> - release the zone lock
> - grab the zone lock
> - call __free_one_page() with MIGRATE_ISOLATE
> - free page go into isolate buddy list
> and we can't use it anymore
>
> To prevent this possibility, re-check migratetype with holding the lock.

This could be also solved similarly to the other races, if during
unisolation, CPU2 sent a drain_all_pages() IPI and only then used
move_freepages_block(). Again, get_pfnblock_migratetype() on CPU1 would
need to be moved under disabled irq's.

> Signed-off-by: Joonsoo Kim <[email protected]>
> ---
> mm/page_alloc.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 99c05f7..d8feedc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -743,6 +743,17 @@ static void free_one_page(struct zone *zone,
> spin_lock(&zone->lock);
> zone->pages_scanned = 0;
>
> + if (unlikely(is_migrate_isolate(migratetype))) {
> + /*
> + * We got migratetype without holding the lock so it could be
> + * racy. If some pages go on the isolate migratetype buddy list
> + * by this race, we can't allocate this page anymore until next
> + * isolation attempt on this pageblock. To prevent this
> + * possibility, re-check migratetype with holding the lock.
> + */
> + migratetype = get_pfnblock_migratetype(page, pfn);
> + }
> +
> __free_one_page(page, pfn, zone, order, migratetype);
> if (!is_migrate_isolate(migratetype))
> __mod_zone_freepage_state(zone, 1 << order, migratetype);
>

2014-07-07 15:50:15

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 05/10] mm/page_alloc: optimize and unify pageblock migratetype check in free path

On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> Currently, when we free the page from pcp list to buddy, we check
> pageblock of the page in order to isolate the page on isolated
> pageblock. Although this could rarely happen and to check migratetype of
> pageblock is somewhat expensive, we check it on free fast path. I think
> that this is undesirable. To prevent this situation, I introduce new
> variable, nr_isolate_pageblock on struct zone and use it to determine
> if we should check pageblock migratetype. Isolation on pageblock rarely
> happens so we can mostly avoid this pageblock migratetype check.

Better, but still there's a zone flag check and maintenance. So if it
could be avoided, it would be better.

> Additionally, unify freepage counting code, because it can be done in
> common part, __free_one_page(). This unifying provides extra guarantee
> that the page on isolate pageblock don't go into non-isolate buddy list.
> This is similar situation describing in previous patch so refer it
> if you need more explanation.

You should make it clearer that you are solving misplacement of the type
"page should be placed on isolated freelist but it's not" through
free_one_page(), which was solved only for free_pcppages_bulk() in patch
03/10. Mentioning patch 04/10 here, which solves the opposite problem
"page shouldn't be placed on isolated freelist, but it is", only
confuses the situation. Also this patch undoes everything of 04/10 and
moves it elsewhere, so that would make it harder to git blame etc. I
would reorder 04 and 05.

> Signed-off-by: Joonsoo Kim <[email protected]>
> ---
> include/linux/mmzone.h | 5 +++++
> include/linux/page-isolation.h | 8 ++++++++
> mm/page_alloc.c | 40 +++++++++++++++++++---------------------
> mm/page_isolation.c | 2 ++
> 4 files changed, 34 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index fd48890..e9c194f 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -374,6 +374,11 @@ struct zone {
> /* pfn where async and sync compaction migration scanner should start */
> unsigned long compact_cached_migrate_pfn[2];
> #endif
> +
> +#ifdef CONFIG_MEMORY_ISOLATION
> + unsigned long nr_isolate_pageblock;
> +#endif
> +
> #ifdef CONFIG_MEMORY_HOTPLUG
> /* see spanned/present_pages for more description */
> seqlock_t span_seqlock;
> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
> index 3fff8e7..2dc1e16 100644
> --- a/include/linux/page-isolation.h
> +++ b/include/linux/page-isolation.h
> @@ -2,6 +2,10 @@
> #define __LINUX_PAGEISOLATION_H
>
> #ifdef CONFIG_MEMORY_ISOLATION
> +static inline bool has_isolate_pageblock(struct zone *zone)
> +{
> + return zone->nr_isolate_pageblock;
> +}
> static inline bool is_migrate_isolate_page(struct page *page)
> {
> return get_pageblock_migratetype(page) == MIGRATE_ISOLATE;
> @@ -11,6 +15,10 @@ static inline bool is_migrate_isolate(int migratetype)
> return migratetype == MIGRATE_ISOLATE;
> }
> #else
> +static inline bool has_isolate_pageblock(struct zone *zone)
> +{
> + return false;
> +}
> static inline bool is_migrate_isolate_page(struct page *page)
> {
> return false;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d8feedc..dcc2f08 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -594,6 +594,24 @@ static inline void __free_one_page(struct page *page,
> VM_BUG_ON_PAGE(page_idx & ((1 << order) - 1), page);
> VM_BUG_ON_PAGE(bad_range(zone, page), page);
>
> + /*
> + * pcp pages could go normal buddy list due to stale pageblock
> + * migratetype so re-check it if there is isolate pageblock.
> + *
> + * And, we got migratetype without holding the lock so it could be
> + * racy. If some pages go on the isolate migratetype buddy list
> + * by this race, we can't allocate this page anymore until next
> + * isolation attempt on this pageblock. To prevent this
> + * possibility, re-check migratetype with holding the lock.
> + */
> + if (unlikely(has_isolate_pageblock(zone) ||
> + is_migrate_isolate(migratetype))) {
> + migratetype = get_pfnblock_migratetype(page, pfn);
> + }
> +
> + if (!is_migrate_isolate(migratetype))
> + __mod_zone_freepage_state(zone, 1 << order, migratetype);
> +
> while (order < MAX_ORDER-1) {
> buddy_idx = __find_buddy_index(page_idx, order);
> buddy = page + (buddy_idx - page_idx);
> @@ -719,13 +737,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> page = list_entry(list->prev, struct page, lru);
> /* must delete as __free_one_page list manipulates */
> list_del(&page->lru);
> -
> - if (unlikely(is_migrate_isolate_page(page))) {
> - mt = MIGRATE_ISOLATE;
> - } else {
> - mt = get_freepage_migratetype(page);
> - __mod_zone_freepage_state(zone, 1, mt);
> - }
> + mt = get_freepage_migratetype(page);
>
> /* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
> __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> @@ -742,21 +754,7 @@ static void free_one_page(struct zone *zone,
> {
> spin_lock(&zone->lock);
> zone->pages_scanned = 0;
> -
> - if (unlikely(is_migrate_isolate(migratetype))) {
> - /*
> - * We got migratetype without holding the lock so it could be
> - * racy. If some pages go on the isolate migratetype buddy list
> - * by this race, we can't allocate this page anymore until next
> - * isolation attempt on this pageblock. To prevent this
> - * possibility, re-check migratetype with holding the lock.
> - */
> - migratetype = get_pfnblock_migratetype(page, pfn);
> - }
> -
> __free_one_page(page, pfn, zone, order, migratetype);
> - if (!is_migrate_isolate(migratetype))
> - __mod_zone_freepage_state(zone, 1 << order, migratetype);
> spin_unlock(&zone->lock);
> }
>
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index d1473b2..1fa4a4d 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -60,6 +60,7 @@ out:
> int migratetype = get_pageblock_migratetype(page);
>
> set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> + zone->nr_isolate_pageblock++;
> nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE);
>
> __mod_zone_freepage_state(zone, -nr_pages, migratetype);
> @@ -83,6 +84,7 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype)
> nr_pages = move_freepages_block(zone, page, migratetype);
> __mod_zone_freepage_state(zone, nr_pages, migratetype);
> set_pageblock_migratetype(page, migratetype);
> + zone->nr_isolate_pageblock--;
> out:
> spin_unlock_irqrestore(&zone->lock, flags);
> }
>

2014-07-07 15:57:55

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 08/10] mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type

On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> When isolating free page, what we want to know is which list
> the page is linked. If it is linked in isolate migratetype buddy list,
> we can skip watermark check and freepage counting. And if it is linked
> in CMA migratetype buddy list, we need to fixup freepage counting. For
> this purpose, get_onbuddy_migratetype() is more fit and cheap than
> get_pageblock_migratetype(). So use it.

Hm but you made get_onbuddy_migratetype() work only with
CONFIG_MEMORY_ISOLATION. And __isolate_free_page is (despite the name)
not at all limited to CONFIG_MEMORY_ISOLATION.

> Signed-off-by: Joonsoo Kim <[email protected]>
> ---
> mm/page_alloc.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e1c4c3e..d9fb8bb 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1597,7 +1597,7 @@ static int __isolate_free_page(struct page *page, unsigned int order)
> BUG_ON(!PageBuddy(page));
>
> zone = page_zone(page);
> - mt = get_pageblock_migratetype(page);
> + mt = get_onbuddy_migratetype(page);
>
> if (!is_migrate_isolate(mt)) {
> /* Obey watermarks as if the page was being allocated */
>

2014-07-08 01:01:39

by Gioh Kim

[permalink] [raw]
Subject: Re: [PATCH 08/10] mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type



2014-07-08 오전 12:57, Vlastimil Babka 쓴 글:
> On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
>> When isolating free page, what we want to know is which list
>> the page is linked. If it is linked in isolate migratetype buddy list,
>> we can skip watermark check and freepage counting. And if it is linked
>> in CMA migratetype buddy list, we need to fixup freepage counting. For
>> this purpose, get_onbuddy_migratetype() is more fit and cheap than
>> get_pageblock_migratetype(). So use it.
>
> Hm but you made get_onbuddy_migratetype() work only with CONFIG_MEMORY_ISOLATION. And __isolate_free_page is (despite the name) not at all limited to CONFIG_MEMORY_ISOLATION.
>

Current __isolate_free_page is called by only split_free_page, and split_free_page by isolate_freepages_block.
split_free_page is called only for isolated pages now but It can be changed someday.
I think get_onbuddy_migratetype should work with any situation.

And I think the name of get_onbuddy_migratetype is confused.
Because of _onbuddy_, it might look like that the pages are buddy pages.
I think the original name _freepage_ is proper one.


>> Signed-off-by: Joonsoo Kim <[email protected]>
>> ---
>> mm/page_alloc.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index e1c4c3e..d9fb8bb 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1597,7 +1597,7 @@ static int __isolate_free_page(struct page *page, unsigned int order)
>> BUG_ON(!PageBuddy(page));
>>
>> zone = page_zone(page);
>> - mt = get_pageblock_migratetype(page);
>> + mt = get_onbuddy_migratetype(page);
>>
>> if (!is_migrate_isolate(mt)) {
>> /* Obey watermarks as if the page was being allocated */
>>
>
>

2014-07-08 07:19:08

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm/page_alloc: handle page on pcp correctly if it's pageblock is isolated

On 07/07/2014 10:25 AM, Gioh Kim wrote:
>
>
> 2014-07-04 ???? 4:57, Joonsoo Kim ?? ??:
>> If pageblock of page on pcp are isolated now, we should free it to isolate
>> buddy list to prevent future allocation on it. But current code doesn't
>> do this.
>
> I think it is strage that pcp can have isolated page.
> I remember that The purpose of pcp is having frequently used pages (hot for cache),
> but isolated page is not for frequently allocated and freed.

It can happen that a page was placed on pcplist *before* the pageblock
was isolated.

> (Above is my guess. It can be wrong. I know CMA and hot-plug features are using isolated pages
> so that I guess isolated pages are not for frequently allocated and freed memory.)
>
> Anyway if isolated page is not good for pcp, what about preventing isolated page from inserting pcp?
> I think fix of free_hot_cold_page() can be one of the preventing ways.
> The free_hot_cold_page() checks migratetype and inserts CMA page into pcp.
> Am I correct?
>
> If I am correct I think inserting CMA page into pcp is harmful for both of CMA and pcp.
> If CMA page is on pcp it will be used frequently and prevent making contiguous memory.
> What about avoiding CMA page for pcp like following?

CMA is not the only user of memory isolation. There's also memory
hot-remove. Also CMA tries to make the CMA-marked pages usable as normal
MOVABLE pages, until a CMA allocation is needed. Here you would restrict
their usage outside of pcplists.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8c9eeec..1cbb816 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1413,13 +1413,8 @@ void free_hot_cold_page(struct page *page, bool cold)
> * areas back if necessary. Otherwise, we may have to free
> * excessively into the page allocator
> */
> - if (migratetype >= MIGRATE_PCPTYPES) {
> - if (unlikely(is_migrate_isolate(migratetype))) {
> - free_one_page(zone, page, pfn, 0, migratetype);
> - goto out;
> - }
> - migratetype = MIGRATE_MOVABLE;
> - }
> + if (migratetype > MIGRATE_PCPTYPES)
> + goto out;

Certainly not, for the reasons above and also:
- by changing >= to > you stopped handling RESERVE pages specially
- by removing free_one_page() call you are now leaking the CMA and
ISOLATE pages.

> pcp = &this_cpu_ptr(zone->pageset)->pcp;
> if (!cold)
>
>
>>
>> Moreover, there is a freepage counting problem on current code. Although
>> pageblock of page on pcp are isolated now, it could go normal buddy list,
>> because get_onpcp_migratetype() will return non-isolate migratetype.
>
> I cannot find get_onpcp_migratetype() in v3.16.0-rc3 and also cannot google it.
> Is it typo?
>
>> In this case, we should do either adding freepage count or changing
>> migratetype to MIGRATE_ISOLATE, but, current code do neither.
>>
>> This patch fixes these two problems by handling pageblock migratetype
>> before calling __free_one_page(). And, if we find the page on isolated
>> pageblock, change migratetype to MIGRATE_ISOLATE to prevent future
>> allocation of this page and freepage counting problem.
>>
>> Signed-off-by: Joonsoo Kim <[email protected]>
>> ---
>> mm/page_alloc.c | 14 ++++++++------
>> 1 file changed, 8 insertions(+), 6 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index aeb51d1..99c05f7 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -719,15 +719,17 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>> page = list_entry(list->prev, struct page, lru);
>> /* must delete as __free_one_page list manipulates */
>> list_del(&page->lru);
>> - mt = get_freepage_migratetype(page);
>> +
>> + if (unlikely(is_migrate_isolate_page(page))) {
>> + mt = MIGRATE_ISOLATE;
>> + } else {
>> + mt = get_freepage_migratetype(page);
>> + __mod_zone_freepage_state(zone, 1, mt);
>> + }
>> +
>> /* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
>> __free_one_page(page, page_to_pfn(page), zone, 0, mt);
>> trace_mm_page_pcpu_drain(page, 0, mt);
>> - if (likely(!is_migrate_isolate_page(page))) {
>> - __mod_zone_page_state(zone, NR_FREE_PAGES, 1);
>> - if (is_migrate_cma(mt))
>> - __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1);
>> - }
>> } while (--to_free && --batch_free && !list_empty(list));
>> }
>> spin_unlock(&zone->lock);
>>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2014-07-08 07:23:16

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 08/10] mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type

On 07/08/2014 03:01 AM, Gioh Kim wrote:
>
>
> 2014-07-08 오전 12:57, Vlastimil Babka 쓴 글:
>> On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
>>> When isolating free page, what we want to know is which list
>>> the page is linked. If it is linked in isolate migratetype buddy list,
>>> we can skip watermark check and freepage counting. And if it is linked
>>> in CMA migratetype buddy list, we need to fixup freepage counting. For
>>> this purpose, get_onbuddy_migratetype() is more fit and cheap than
>>> get_pageblock_migratetype(). So use it.
>>
>> Hm but you made get_onbuddy_migratetype() work only with CONFIG_MEMORY_ISOLATION. And __isolate_free_page is (despite the name) not at all limited to CONFIG_MEMORY_ISOLATION.
>>
>
> Current __isolate_free_page is called by only split_free_page, and split_free_page by isolate_freepages_block.
> split_free_page is called only for isolated pages now but It can be changed someday.

Yeah, but isolate_freepages_block is also not part of
CONFIG_MEMORY_ISOLATION. Unfortunately, there are two distinct concepts
of memory isolation and LRU isolation, that are not distinguished in
function names.

> I think get_onbuddy_migratetype should work with any situation.
>
> And I think the name of get_onbuddy_migratetype is confused.
> Because of _onbuddy_, it might look like that the pages are buddy pages.
> I think the original name _freepage_ is proper one.
>
>
>>> Signed-off-by: Joonsoo Kim <[email protected]>
>>> ---
>>> mm/page_alloc.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index e1c4c3e..d9fb8bb 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -1597,7 +1597,7 @@ static int __isolate_free_page(struct page *page, unsigned int order)
>>> BUG_ON(!PageBuddy(page));
>>>
>>> zone = page_zone(page);
>>> - mt = get_pageblock_migratetype(page);
>>> + mt = get_onbuddy_migratetype(page);
>>>
>>> if (!is_migrate_isolate(mt)) {
>>> /* Obey watermarks as if the page was being allocated */
>>>
>>
>>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2014-07-14 06:16:43

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 00/10] fix freepage count problems due to memory isolation

On Mon, Jul 07, 2014 at 04:33:09PM +0200, Vlastimil Babka wrote:
> On 07/07/2014 06:49 AM, Joonsoo Kim wrote:
> >Ccing Lisa, because there was bug report it may be related this
> >topic last Saturday.
> >
> >http://www.spinics.net/lists/linux-mm/msg75741.html
> >
> >On Fri, Jul 04, 2014 at 05:33:27PM +0200, Vlastimil Babka wrote:
> >>On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> >>>Hello,
> >>
> >>Hi Joonsoo,
> >>
> >>please CC me on further updates, this is relevant to me.
> >
> >Hello, Vlastimil.
> >
> >Sorry for missing you. :)
> >
> >>
> >>>This patchset aims at fixing problems due to memory isolation found by
> >>>testing my patchset [1].
> >>>
> >>>These are really subtle problems so I can be wrong. If you find what I am
> >>>missing, please let me know.
> >>>
> >>>Before describing bugs itself, I first explain definition of freepage.
> >>>
> >>>1. pages on buddy list are counted as freepage.
> >>>2. pages on isolate migratetype buddy list are *not* counted as freepage.
> >>
> >>I think the second point is causing us a lot of trouble. And I wonder if it's really
> >>justified! We already have some is_migrate_isolate() checks in the fast path and now you
> >>would add more, mostly just to keep this accounting correct.
> >
> >It's not just for keeping accouting correct. There is another
> >purpose for is_migrate_isolate(). It forces freed pages to go into
> >isolate buddy list. Without it, freed pages would go into other
> >buddy list and will be used soon. So memory isolation can't work well
> >without is_migrate_isolate() checks and success rate could decrease.
>
> Well I think that could be solved also by doing a lru/pcplists drain
> right after marking pageblock as MIGRATETYPE_ISOLATE. After that
> moment, anything newly put on pcplists should observe the new
> migratetype and go to the correct pcplists. As putting stuff on
> pcplists is done with disabled interrupts, and draining is done by
> IPI, I think it should work correctly if we put the migratetype
> determination under the disabled irq part in free_hot_cold_page().

Hello, sorry for late.

Yes, this can close the race on migratetype buddy list, but, there is
a problem. See the below.

>
> >And, I just added three more is_migrate_isolate() in the fast
> >path, but, two checks are in same *unlikely* branch and I can remove
> >another one easily. Therefore it's not quite problem I guess. (It even
> >does no-op if MEMORY_ISOLATION is disabled.)
> >Moreover, I removed one unconditional get_pageblock_migratetype() in
> >free_pcppages_bulk() so, in performance point or view, freepath would
> >be improved.
>
> I haven't checked the individual patches yet, but I'll try.
>
> >>
> >>So the question is, does it have to be correct? And (admiteddly not after a completely
> >>exhaustive analysis) I think the answer is, surprisingly, that it doesn't :)
> >>
> >>Well I of course don't mean that the freepage accounts could go random completely, but
> >>what if we allowed them to drift a bit, limiting both the max error and the timeframe
> >>where errors are possible? After all, watermarks checking is already racy so I don't think
> >>it would be hurt that much.
> >
> >I understand your suggestion. I once thought like as you, but give up
> >that idea. Watermark checking is already racy, but, it's *only*
> >protection to prevent memory allocation. After passing that check,
> >there is no mean to prevent us from allocating memory. So it should
> >be accurate as much as possible. If we allow someone to get the
> >memory without considering memory isolation, free memory could be in
> >really low state and system could be broken occasionally.
>
> It would definitely require more thorough review, but I think with
> the pcplists draining changes outlined above, there shouldn't be any
> more inaccuracy happening.
>
> >>
> >>Now if we look at what both CMA and memory hot-remove does is:
> >>
> >>1. Mark a MAX_ORDER-aligned buch of pageblocks as MIGRATE_ISOLATE through
> >>start_isolate_page_range(). As part of this, all free pages in that area are
> >>moved on the isolate freelist through move_freepages_block().
> >>
> >>2. Try to migrate away all non-free pages in the range. Also drain pcplists and lru_add
> >>caches.
> >>
> >>3. Check if everything was successfully isolated by test_pages_isolated(). Restart and/or
> >>undo pageblock isolation if not.
> >>
> >>So my idea is to throw away all special-casing of is_migrate_isolate() in the buddy
> >>allocator, which would therefore account free pages on the isolate freelist as normal free
> >>pages.
> >>The accounting of isolated pages would be instead done only on the top level of CMA/hot
> >>remove in the three steps outlined above, which would be modified as follows:
> >>
> >>1. Calculate N as the target number of pages to be isolated. Perform the actions of step 1
> >>as usual. Calculate X as the number of pages that move_freepages_block() has moved.
> >>Subtract X from freepages (this is the same as it is done now), but also *remember the
> >>value of X*
> >>
> >>2. Migrate and drain pcplists as usual. The new free pages will either end up correctly on
> >>isolate freelist, or not. We don't care, they will be accounted as freepages either way.
> >>This is where some inaccuracy in accounted freepages would build up.
> >>
> >>3. If test_pages_isolated() checks pass, subtract (N - X) from freepages. The result is
> >>that we have a isolated range of N pages that nobody can steal now as everything is on
> >>isolate freelist and is MAX_ORDER aligned. And we have in total subtracted N pages (first
> >>X, then N-X). So the accounting matches reality.
> >>
> >>If we have to undo, we undo the isolation and as part of this, we use
> >>move_freepages_block() to move pages from isolate freelist to the normal ones. But we
> >>don't care how many pages were moved. We simply add the remembered value of X to the
> >>number of freepages, undoing the change from step 1. Again, the accounting matches reality.
> >>
> >>
> >>The final point is that if we do this per MAX_ORDER blocks, the error in accounting cannot
> >>be ever larger than 4MB and will be visible only during time a single MAX_ORDER block is
> >>handled.
> >
> >The 4MB error in accounting looks serious for me. min_free_kbytes is
> >4MB in 1GB system. So this 4MB error would makes all things broken in
> >such systems. Moreover, there are some ARCH having larger
> >pageblock_order than MAX_ORDER. In this case, the error will be larger
> >than 4MB.
> >
> >In addition, I have a plan to extend CMA to work in parallel. It means
> >that there could be parallel memory isolation users rather than just
> >one user at the same time, so, we cannot easily bound the error under
> >some degree.
>
> OK. I had thought that the error would be much smaller in practice,
> but probably due to how buddy merging works, it would be enough if
> just the very last freed page in the MAX_ORDER block was misplaced,
> and that would trigger a cascading merge that will end with single
> page at MAX_ORDER size becoming misplaced. So let's probably forget
> this approach.
>
> >>As a possible improvement, we can assume during phase 2 that every page freed by migration
> >>will end up correctly on isolate free list. So we create M free pages by migration, and
> >>subtract M from freepage account. Then in phase 3 we either subtract (N - X - M), or add X
> >>+ M in the undo case. (Ideally, if we succeed, X + M should be equal to N, but due to
> >>pages on pcplists and the possible races it will be less). I think with this improvement,
> >>any error would be negligible.
> >>
> >>Thoughts?
> >
> >Thanks for suggestion. :)
> >It is really good topic to think deeply.
> >
> >For now, I'd like to fix these problems without side-effect as you
> >suggested. Your suggestion changes the meaning of freepage that
> >isolated pages are included in nr_freepage and there could be possible
> >regression in success rate of memory hotplug and CMA. Possibly, it
> >is the way we have to go, but, IMHO, it isn't the time to go. Before
> >going that way, we should fix current implementation first so that
> >fixes can be backported to old kernel if someone needs.
>
> Adding the extra pcplist drains and reordering
> get_pageblock_migratetype() under disabled IRQ's should be small
> enough to be backportable and not bring any regressions to success
> rates. There will be some cost for the extra drains, but paid only
> when the memory isolation actually happens. Extending the scope of
> disabled irq's would affect fast path somewhat, but I guess less
> than extra checks?

Your approach would close the race on migratetype buddy list. But,
I'm not sure how we can count number of freepages correctly. IIUC,
unlike my approach, this approach allows merging between isolate buddy
list and normal buddy list and it would results in miscounting number of
freepages. You probably think that move_freepages_block() could fixup
all the situation, but, I don't think so.

After applying your approach,

CPU1 CPU2
set_migratetype_isolate() - free_one_page()
- lock
- set_pageblock_migratetype()
- unlock
- drain...
- lock
- __free_one_page() with MIGRATE_ISOLATE
- merged with normal page and linked with
isolate buddy list
- skip to count freepage, because of
MIGRATE_ISOLATE
- unlock
- lock
- move_freepages_block()
try to fixup freepages count
- unlock

We want to fixup counting in move_freepages_block(), but, we can't
because we don't have enough information whether this page is already
accounted on number of freepage or not. Could you help me to find out
the whole mechanism of your approach?

And, IIUC, your approach can remove only one get_pageblock_migratetype()
in free_pcppages_bulk(). free_hot_cold_page() and free_one_page()
should have is_migrate_isolate() branches to free the page to correct
list and count number of freepage correctly. Additionally, it extends
scope of disabled irq. So it doesn't look much better than mine,
because my approach also removes get_pageblock_migratetype() in
free_pcppages_bulk().

Please let me know what I am missing. :)

Thanks.

>
> >Interestingly,
> >on last Saturday, Lisa Du reported CMA accounting bugs.
> >
> >http://www.spinics.net/lists/linux-mm/msg75741.html
> >
> >I don't look at it in detail, but, maybe it is related to these
> >problems and we should fix it without side-effect.
> >
> >So, in conclusion, I think that your suggestion is beyond the scope of
> >this patchset because of following two reasons.
> >
> >1. I'd like to fix these problems without side-effect(possible
> >regression in success rate of memory hotplug and CMA, and nr_freepage
> >meanging change) due to backport possibility.
> >2. nr_freepage without considering memory isolation is somewhat dangerous
> >and not suitable for some systems.
> >
> >If you have any objection, please let me know. But, I will go on
> >a vacation for a week so I can't answer your further comments
> >for a week. I will reply them next week. :)
> >
> >Thanks.
> >
> >>>3. pages on cma buddy list are counted as CMA freepage, too.
> >>>4. pages for guard are *not* counted as freepage.
> >>>
> >>>Now, I describe problems and related patch.
> >>>
> >>>1. Patch 2: If guard page are cleared and merged into isolate buddy list,
> >>>we should not add freepage count.
> >>>
> >>>2. Patch 3: When the page return back from pcp to buddy, we should
> >>>account it to freepage counter. In this case, we should check the
> >>>pageblock migratetype of the page and should insert the page into
> >>>appropriate buddy list. Although we checked it in current code, we
> >>>didn't insert the page into appropriate buddy list so that freepage
> >>>counting can be wrong.
> >>>
> >>>3. Patch 4: There is race condition so that some freepages could be
> >>>on isolate buddy list. If so, we can't use this page until next isolation
> >>>attempt on this pageblock.
> >>>
> >>>4. Patch 5: There is race condition that page on isolate pageblock
> >>>can go into non-isolate buddy list. If so, buddy allocator would
> >>>merge pages on non-isolate buddy list and isolate buddy list, respectively,
> >>>and freepage count will be wrong.
> >>>
> >>>5. Patch 9: move_freepages(_block) returns *not* number of moved pages.
> >>>Instead, it returns number of pages linked in that migratetype buddy list.
> >>>So accouting with this return value makes freepage count wrong.
> >>>
> >>>6. Patch 10: buddy allocator would merge pages on non-isolate buddy list
> >>>and isolate buddy list, respectively. This leads to freepage counting
> >>>problem so fix it by stopping merging in this case.
> >>>
> >>>Without patchset [1], above problem doesn't happens on my CMA allocation
> >>>test, because CMA reserved pages aren't used at all. So there is no
> >>>chance for above race.
> >>>
> >>>With patchset [1], I did simple CMA allocation test and get below result.
> >>>
> >>>- Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
> >>>- run kernel build (make -j16) on background
> >>>- 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
> >>>- Result: more than 5000 freepage count are missed
> >>>
> >>>With patchset [1] and this patchset, I found that no freepage count are
> >>>missed so that I conclude that problems are solved.
> >>>
> >>>These problems can be possible on memory hot remove users, although
> >>>I didn't check it further.
> >>>
> >>>Other patches are either for the base to fix these problems or for
> >>>simple clean-up. Please see individual patches for more information.
> >>>
> >>>This patchset is based on linux-next-20140703.
> >>>
> >>>Thanks.
> >>>
> >>>[1]: Aggressively allocate the pages on cma reserved memory
> >>> https://lkml.org/lkml/2014/5/30/291
> >>>
> >>>
> >>>Joonsoo Kim (10):
> >>> mm/page_alloc: remove unlikely macro on free_one_page()
> >>> mm/page_alloc: correct to clear guard attribute in DEBUG_PAGEALLOC
> >>> mm/page_alloc: handle page on pcp correctly if it's pageblock is
> >>> isolated
> >>> mm/page_alloc: carefully free the page on isolate pageblock
> >>> mm/page_alloc: optimize and unify pageblock migratetype check in free
> >>> path
> >>> mm/page_alloc: separate freepage migratetype interface
> >>> mm/page_alloc: store migratetype of the buddy list into freepage
> >>> correctly
> >>> mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type
> >>> mm/page_alloc: fix possible wrongly calculated freepage counter
> >>> mm/page_alloc: Stop merging pages on non-isolate and isolate buddy
> >>> list
> >>>
> >>> include/linux/mm.h | 30 +++++++--
> >>> include/linux/mmzone.h | 5 ++
> >>> include/linux/page-isolation.h | 8 +++
> >>> mm/page_alloc.c | 138 +++++++++++++++++++++++++++++-----------
> >>> mm/page_isolation.c | 18 ++----
> >>> 5 files changed, 147 insertions(+), 52 deletions(-)
> >>>
> >>
> >>--
> >>To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >>the body to [email protected]. For more info on Linux MM,
> >>see: http://www.linux-mm.org/ .
> >>Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
> >
> >--
> >To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >the body to [email protected]. For more info on Linux MM,
> >see: http://www.linux-mm.org/ .
> >Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
> >
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2014-07-14 06:18:35

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm/page_alloc: handle page on pcp correctly if it's pageblock is isolated

On Mon, Jul 07, 2014 at 05:19:48PM +0200, Vlastimil Babka wrote:
> On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> >If pageblock of page on pcp are isolated now, we should free it to isolate
> >buddy list to prevent future allocation on it. But current code doesn't
> >do this.
> >
> >Moreover, there is a freepage counting problem on current code. Although
> >pageblock of page on pcp are isolated now, it could go normal buddy list,
> >because get_onpcp_migratetype() will return non-isolate migratetype.
>
> get_onpcp_migratetype() is only introduced in later patch.

Yes, I will fix it.

>
> >In this case, we should do either adding freepage count or changing
> >migratetype to MIGRATE_ISOLATE, but, current code do neither.
>
> I wouldn't say it "do neither". It already limits the freepage
> counting to !MIGRATE_ISOLATE case (and it's not converted to
> __mod_zone_freepage_state for some reason). So there's accounting
> mismatch in addition to buddy list misplacement.

Okay.

>
> >This patch fixes these two problems by handling pageblock migratetype
> >before calling __free_one_page(). And, if we find the page on isolated
> >pageblock, change migratetype to MIGRATE_ISOLATE to prevent future
> >allocation of this page and freepage counting problem.
>
> So although this is not an addition of a new pageblock migratetype
> check to the fast path (the check is already there), I would prefer
> removing the check :)

Yes, I want to do it if possible. :)

> With the approach of pcplists draining
> outlined in my reply to 00/10, we would allow a misplacement to
> happen (and the page accounted as freepage) immediately followed by
> move_frepages_block which would place the page onto isolate freelist
> with the rest. Anything newly freed will get isolate_migratetype
> determined in free_hot_cold_page or __free_pages_ok (where it would
> need moving the migratepage check under the disabled irq part) and
> be placed and buddy-merged properly.

I explained the problem of this approach in 00/10.

Thanks.

2014-07-14 06:23:09

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 05/10] mm/page_alloc: optimize and unify pageblock migratetype check in free path

On Mon, Jul 07, 2014 at 05:50:09PM +0200, Vlastimil Babka wrote:
> On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> >Currently, when we free the page from pcp list to buddy, we check
> >pageblock of the page in order to isolate the page on isolated
> >pageblock. Although this could rarely happen and to check migratetype of
> >pageblock is somewhat expensive, we check it on free fast path. I think
> >that this is undesirable. To prevent this situation, I introduce new
> >variable, nr_isolate_pageblock on struct zone and use it to determine
> >if we should check pageblock migratetype. Isolation on pageblock rarely
> >happens so we can mostly avoid this pageblock migratetype check.
>
> Better, but still there's a zone flag check and maintenance. So if
> it could be avoided, it would be better.
>
> >Additionally, unify freepage counting code, because it can be done in
> >common part, __free_one_page(). This unifying provides extra guarantee
> >that the page on isolate pageblock don't go into non-isolate buddy list.
> >This is similar situation describing in previous patch so refer it
> >if you need more explanation.
>
> You should make it clearer that you are solving misplacement of the
> type "page should be placed on isolated freelist but it's not"
> through free_one_page(), which was solved only for
> free_pcppages_bulk() in patch 03/10. Mentioning patch 04/10 here,
> which solves the opposite problem "page shouldn't be placed on
> isolated freelist, but it is", only confuses the situation. Also
> this patch undoes everything of 04/10 and moves it elsewhere, so
> that would make it harder to git blame etc. I would reorder 04 and
> 05.

Okay. I will clarify what I am solving in commit description and
reorder patches appropriately.

Thanks.

2014-07-14 06:28:22

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 08/10] mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type

On Mon, Jul 07, 2014 at 05:57:49PM +0200, Vlastimil Babka wrote:
> On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> >When isolating free page, what we want to know is which list
> >the page is linked. If it is linked in isolate migratetype buddy list,
> >we can skip watermark check and freepage counting. And if it is linked
> >in CMA migratetype buddy list, we need to fixup freepage counting. For
> >this purpose, get_onbuddy_migratetype() is more fit and cheap than
> >get_pageblock_migratetype(). So use it.
>
> Hm but you made get_onbuddy_migratetype() work only with
> CONFIG_MEMORY_ISOLATION. And __isolate_free_page is (despite the
> name) not at all limited to CONFIG_MEMORY_ISOLATION.

get_onbuddy_migratetype() is only used for determining whether this
page is on isolate buddy list or not. So if !CONFIG_MEMORY_ISOLATION,
default value of get_onbuddy_migratetype() makes things correct.

But, I should write some code comment.

Thanks.

2014-07-14 09:49:41

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 00/10] fix freepage count problems due to memory isolation

On 07/14/2014 08:22 AM, Joonsoo Kim wrote:
> On Mon, Jul 07, 2014 at 04:33:09PM +0200, Vlastimil Babka wrote:
>> On 07/07/2014 06:49 AM, Joonsoo Kim wrote:
>>> Ccing Lisa, because there was bug report it may be related this
>>> topic last Saturday.
>>>
>>> http://www.spinics.net/lists/linux-mm/msg75741.html
>>>
>>> On Fri, Jul 04, 2014 at 05:33:27PM +0200, Vlastimil Babka wrote:
>>>> On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
>>>>> Hello,
>>>>
>>>> Hi Joonsoo,
>>>>
>>>> please CC me on further updates, this is relevant to me.
>>>
>>> Hello, Vlastimil.
>>>
>>> Sorry for missing you. :)
>>>
>>>>
>>>>> This patchset aims at fixing problems due to memory isolation found by
>>>>> testing my patchset [1].
>>>>>
>>>>> These are really subtle problems so I can be wrong. If you find what I am
>>>>> missing, please let me know.
>>>>>
>>>>> Before describing bugs itself, I first explain definition of freepage.
>>>>>
>>>>> 1. pages on buddy list are counted as freepage.
>>>>> 2. pages on isolate migratetype buddy list are *not* counted as freepage.
>>>>
>>>> I think the second point is causing us a lot of trouble. And I wonder if it's really
>>>> justified! We already have some is_migrate_isolate() checks in the fast path and now you
>>>> would add more, mostly just to keep this accounting correct.
>>>
>>> It's not just for keeping accouting correct. There is another
>>> purpose for is_migrate_isolate(). It forces freed pages to go into
>>> isolate buddy list. Without it, freed pages would go into other
>>> buddy list and will be used soon. So memory isolation can't work well
>>> without is_migrate_isolate() checks and success rate could decrease.
>>
>> Well I think that could be solved also by doing a lru/pcplists drain
>> right after marking pageblock as MIGRATETYPE_ISOLATE. After that
>> moment, anything newly put on pcplists should observe the new
>> migratetype and go to the correct pcplists. As putting stuff on
>> pcplists is done with disabled interrupts, and draining is done by
>> IPI, I think it should work correctly if we put the migratetype
>> determination under the disabled irq part in free_hot_cold_page().
>
> Hello, sorry for late.
>
> Yes, this can close the race on migratetype buddy list, but, there is
> a problem. See the below.
>
>>
>>> And, I just added three more is_migrate_isolate() in the fast
>>> path, but, two checks are in same *unlikely* branch and I can remove
>>> another one easily. Therefore it's not quite problem I guess. (It even
>>> does no-op if MEMORY_ISOLATION is disabled.)
>>> Moreover, I removed one unconditional get_pageblock_migratetype() in
>>> free_pcppages_bulk() so, in performance point or view, freepath would
>>> be improved.
>>
>> I haven't checked the individual patches yet, but I'll try.
>>
>>>>
>>>> So the question is, does it have to be correct? And (admiteddly not after a completely
>>>> exhaustive analysis) I think the answer is, surprisingly, that it doesn't :)
>>>>
>>>> Well I of course don't mean that the freepage accounts could go random completely, but
>>>> what if we allowed them to drift a bit, limiting both the max error and the timeframe
>>>> where errors are possible? After all, watermarks checking is already racy so I don't think
>>>> it would be hurt that much.
>>>
>>> I understand your suggestion. I once thought like as you, but give up
>>> that idea. Watermark checking is already racy, but, it's *only*
>>> protection to prevent memory allocation. After passing that check,
>>> there is no mean to prevent us from allocating memory. So it should
>>> be accurate as much as possible. If we allow someone to get the
>>> memory without considering memory isolation, free memory could be in
>>> really low state and system could be broken occasionally.
>>
>> It would definitely require more thorough review, but I think with
>> the pcplists draining changes outlined above, there shouldn't be any
>> more inaccuracy happening.
>>
>>>>
>>>> Now if we look at what both CMA and memory hot-remove does is:
>>>>
>>>> 1. Mark a MAX_ORDER-aligned buch of pageblocks as MIGRATE_ISOLATE through
>>>> start_isolate_page_range(). As part of this, all free pages in that area are
>>>> moved on the isolate freelist through move_freepages_block().
>>>>
>>>> 2. Try to migrate away all non-free pages in the range. Also drain pcplists and lru_add
>>>> caches.
>>>>
>>>> 3. Check if everything was successfully isolated by test_pages_isolated(). Restart and/or
>>>> undo pageblock isolation if not.
>>>>
>>>> So my idea is to throw away all special-casing of is_migrate_isolate() in the buddy
>>>> allocator, which would therefore account free pages on the isolate freelist as normal free
>>>> pages.
>>>> The accounting of isolated pages would be instead done only on the top level of CMA/hot
>>>> remove in the three steps outlined above, which would be modified as follows:
>>>>
>>>> 1. Calculate N as the target number of pages to be isolated. Perform the actions of step 1
>>>> as usual. Calculate X as the number of pages that move_freepages_block() has moved.
>>>> Subtract X from freepages (this is the same as it is done now), but also *remember the
>>>> value of X*
>>>>
>>>> 2. Migrate and drain pcplists as usual. The new free pages will either end up correctly on
>>>> isolate freelist, or not. We don't care, they will be accounted as freepages either way.
>>>> This is where some inaccuracy in accounted freepages would build up.
>>>>
>>>> 3. If test_pages_isolated() checks pass, subtract (N - X) from freepages. The result is
>>>> that we have a isolated range of N pages that nobody can steal now as everything is on
>>>> isolate freelist and is MAX_ORDER aligned. And we have in total subtracted N pages (first
>>>> X, then N-X). So the accounting matches reality.
>>>>
>>>> If we have to undo, we undo the isolation and as part of this, we use
>>>> move_freepages_block() to move pages from isolate freelist to the normal ones. But we
>>>> don't care how many pages were moved. We simply add the remembered value of X to the
>>>> number of freepages, undoing the change from step 1. Again, the accounting matches reality.
>>>>
>>>>
>>>> The final point is that if we do this per MAX_ORDER blocks, the error in accounting cannot
>>>> be ever larger than 4MB and will be visible only during time a single MAX_ORDER block is
>>>> handled.
>>>
>>> The 4MB error in accounting looks serious for me. min_free_kbytes is
>>> 4MB in 1GB system. So this 4MB error would makes all things broken in
>>> such systems. Moreover, there are some ARCH having larger
>>> pageblock_order than MAX_ORDER. In this case, the error will be larger
>>> than 4MB.
>>>
>>> In addition, I have a plan to extend CMA to work in parallel. It means
>>> that there could be parallel memory isolation users rather than just
>>> one user at the same time, so, we cannot easily bound the error under
>>> some degree.
>>
>> OK. I had thought that the error would be much smaller in practice,
>> but probably due to how buddy merging works, it would be enough if
>> just the very last freed page in the MAX_ORDER block was misplaced,
>> and that would trigger a cascading merge that will end with single
>> page at MAX_ORDER size becoming misplaced. So let's probably forget
>> this approach.
>>
>>>> As a possible improvement, we can assume during phase 2 that every page freed by migration
>>>> will end up correctly on isolate free list. So we create M free pages by migration, and
>>>> subtract M from freepage account. Then in phase 3 we either subtract (N - X - M), or add X
>>>> + M in the undo case. (Ideally, if we succeed, X + M should be equal to N, but due to
>>>> pages on pcplists and the possible races it will be less). I think with this improvement,
>>>> any error would be negligible.
>>>>
>>>> Thoughts?
>>>
>>> Thanks for suggestion. :)
>>> It is really good topic to think deeply.
>>>
>>> For now, I'd like to fix these problems without side-effect as you
>>> suggested. Your suggestion changes the meaning of freepage that
>>> isolated pages are included in nr_freepage and there could be possible
>>> regression in success rate of memory hotplug and CMA. Possibly, it
>>> is the way we have to go, but, IMHO, it isn't the time to go. Before
>>> going that way, we should fix current implementation first so that
>>> fixes can be backported to old kernel if someone needs.
>>
>> Adding the extra pcplist drains and reordering
>> get_pageblock_migratetype() under disabled IRQ's should be small
>> enough to be backportable and not bring any regressions to success
>> rates. There will be some cost for the extra drains, but paid only
>> when the memory isolation actually happens. Extending the scope of
>> disabled irq's would affect fast path somewhat, but I guess less
>> than extra checks?
>
> Your approach would close the race on migratetype buddy list. But,
> I'm not sure how we can count number of freepages correctly. IIUC,
> unlike my approach, this approach allows merging between isolate buddy
> list and normal buddy list and it would results in miscounting number of
> freepages. You probably think that move_freepages_block() could fixup
> all the situation, but, I don't think so.

No, I didn't think that, I simply overlooked this scenario. Good catch.

> After applying your approach,
>
> CPU1 CPU2
> set_migratetype_isolate() - free_one_page()
> - lock
> - set_pageblock_migratetype()
> - unlock
> - drain...
> - lock
> - __free_one_page() with MIGRATE_ISOLATE
> - merged with normal page and linked with
> isolate buddy list
> - skip to count freepage, because of
> MIGRATE_ISOLATE
> - unlock
> - lock
> - move_freepages_block()
> try to fixup freepages count
> - unlock
>
> We want to fixup counting in move_freepages_block(), but, we can't
> because we don't have enough information whether this page is already
> accounted on number of freepage or not. Could you help me to find out
> the whole mechanism of your approach?

I can't think of an easy fix to this situation.

A non-trivial fix that comes to mind (and I might have overlooked
something) is something like:

- distinguish MIGRATETYPE_ISOLATING and MIGRATETYPE_ISOLATED
- CPU1 first sets MIGRATETYPE_ISOLATING before the drain
- when CPU2 sees MIGRATETYPE_ISOLATING, it just puts the page on special
unbounded pcplist and that's it
- CPU1 does the drain as usual, potentially misplacing some pages that
move_freepages_block() will then fix. But no wrong merging can occur.
- after move_freepages_block(), CPU1 changes MIGRATETYPE_ISOLATING to
MIGRATETYPE_ISOLATED
- CPU2 can then start freeing directly on isolate buddy list. There
might be some pages still on the special pcplist of CPU2/CPUx but that
means they won't merge yet.
- CPU1 executes on all CPU's a new operation that flushes the special
pcplist on isolate buddy list and merge as needed.


> And, IIUC, your approach can remove only one get_pageblock_migratetype()
> in free_pcppages_bulk(). free_hot_cold_page() and free_one_page()
> should have is_migrate_isolate() branches to free the page to correct
> list and count number of freepage correctly. Additionally, it extends
> scope of disabled irq. So it doesn't look much better than mine,
> because my approach also removes get_pageblock_migratetype() in
> free_pcppages_bulk().
>
> Please let me know what I am missing. :)
>
> Thanks.
>
>>
>>> Interestingly,
>>> on last Saturday, Lisa Du reported CMA accounting bugs.
>>>
>>> http://www.spinics.net/lists/linux-mm/msg75741.html
>>>
>>> I don't look at it in detail, but, maybe it is related to these
>>> problems and we should fix it without side-effect.
>>>
>>> So, in conclusion, I think that your suggestion is beyond the scope of
>>> this patchset because of following two reasons.
>>>
>>> 1. I'd like to fix these problems without side-effect(possible
>>> regression in success rate of memory hotplug and CMA, and nr_freepage
>>> meanging change) due to backport possibility.
>>> 2. nr_freepage without considering memory isolation is somewhat dangerous
>>> and not suitable for some systems.
>>>
>>> If you have any objection, please let me know. But, I will go on
>>> a vacation for a week so I can't answer your further comments
>>> for a week. I will reply them next week. :)
>>>
>>> Thanks.
>>>
>>>>> 3. pages on cma buddy list are counted as CMA freepage, too.
>>>>> 4. pages for guard are *not* counted as freepage.
>>>>>
>>>>> Now, I describe problems and related patch.
>>>>>
>>>>> 1. Patch 2: If guard page are cleared and merged into isolate buddy list,
>>>>> we should not add freepage count.
>>>>>
>>>>> 2. Patch 3: When the page return back from pcp to buddy, we should
>>>>> account it to freepage counter. In this case, we should check the
>>>>> pageblock migratetype of the page and should insert the page into
>>>>> appropriate buddy list. Although we checked it in current code, we
>>>>> didn't insert the page into appropriate buddy list so that freepage
>>>>> counting can be wrong.
>>>>>
>>>>> 3. Patch 4: There is race condition so that some freepages could be
>>>>> on isolate buddy list. If so, we can't use this page until next isolation
>>>>> attempt on this pageblock.
>>>>>
>>>>> 4. Patch 5: There is race condition that page on isolate pageblock
>>>>> can go into non-isolate buddy list. If so, buddy allocator would
>>>>> merge pages on non-isolate buddy list and isolate buddy list, respectively,
>>>>> and freepage count will be wrong.
>>>>>
>>>>> 5. Patch 9: move_freepages(_block) returns *not* number of moved pages.
>>>>> Instead, it returns number of pages linked in that migratetype buddy list.
>>>>> So accouting with this return value makes freepage count wrong.
>>>>>
>>>>> 6. Patch 10: buddy allocator would merge pages on non-isolate buddy list
>>>>> and isolate buddy list, respectively. This leads to freepage counting
>>>>> problem so fix it by stopping merging in this case.
>>>>>
>>>>> Without patchset [1], above problem doesn't happens on my CMA allocation
>>>>> test, because CMA reserved pages aren't used at all. So there is no
>>>>> chance for above race.
>>>>>
>>>>> With patchset [1], I did simple CMA allocation test and get below result.
>>>>>
>>>>> - Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
>>>>> - run kernel build (make -j16) on background
>>>>> - 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
>>>>> - Result: more than 5000 freepage count are missed
>>>>>
>>>>> With patchset [1] and this patchset, I found that no freepage count are
>>>>> missed so that I conclude that problems are solved.
>>>>>
>>>>> These problems can be possible on memory hot remove users, although
>>>>> I didn't check it further.
>>>>>
>>>>> Other patches are either for the base to fix these problems or for
>>>>> simple clean-up. Please see individual patches for more information.
>>>>>
>>>>> This patchset is based on linux-next-20140703.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> [1]: Aggressively allocate the pages on cma reserved memory
>>>>> https://lkml.org/lkml/2014/5/30/291
>>>>>
>>>>>
>>>>> Joonsoo Kim (10):
>>>>> mm/page_alloc: remove unlikely macro on free_one_page()
>>>>> mm/page_alloc: correct to clear guard attribute in DEBUG_PAGEALLOC
>>>>> mm/page_alloc: handle page on pcp correctly if it's pageblock is
>>>>> isolated
>>>>> mm/page_alloc: carefully free the page on isolate pageblock
>>>>> mm/page_alloc: optimize and unify pageblock migratetype check in free
>>>>> path
>>>>> mm/page_alloc: separate freepage migratetype interface
>>>>> mm/page_alloc: store migratetype of the buddy list into freepage
>>>>> correctly
>>>>> mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type
>>>>> mm/page_alloc: fix possible wrongly calculated freepage counter
>>>>> mm/page_alloc: Stop merging pages on non-isolate and isolate buddy
>>>>> list
>>>>>
>>>>> include/linux/mm.h | 30 +++++++--
>>>>> include/linux/mmzone.h | 5 ++
>>>>> include/linux/page-isolation.h | 8 +++
>>>>> mm/page_alloc.c | 138 +++++++++++++++++++++++++++++-----------
>>>>> mm/page_isolation.c | 18 ++----
>>>>> 5 files changed, 147 insertions(+), 52 deletions(-)
>>>>>
>>>>
>>>> --
>>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>>> the body to [email protected]. For more info on Linux MM,
>>>> see: http://www.linux-mm.org/ .
>>>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to [email protected]. For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2014-07-15 08:22:35

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 00/10] fix freepage count problems due to memory isolation

On Mon, Jul 14, 2014 at 11:49:25AM +0200, Vlastimil Babka wrote:
> On 07/14/2014 08:22 AM, Joonsoo Kim wrote:
> >On Mon, Jul 07, 2014 at 04:33:09PM +0200, Vlastimil Babka wrote:
> >>On 07/07/2014 06:49 AM, Joonsoo Kim wrote:
> >>>Ccing Lisa, because there was bug report it may be related this
> >>>topic last Saturday.
> >>>
> >>>http://www.spinics.net/lists/linux-mm/msg75741.html
> >>>
> >>>On Fri, Jul 04, 2014 at 05:33:27PM +0200, Vlastimil Babka wrote:
> >>>>On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
> >>>>>Hello,
> >>>>
> >>>>Hi Joonsoo,
> >>>>
> >>>>please CC me on further updates, this is relevant to me.
> >>>
> >>>Hello, Vlastimil.
> >>>
> >>>Sorry for missing you. :)
> >>>
> >>>>
> >>>>>This patchset aims at fixing problems due to memory isolation found by
> >>>>>testing my patchset [1].
> >>>>>
> >>>>>These are really subtle problems so I can be wrong. If you find what I am
> >>>>>missing, please let me know.
> >>>>>
> >>>>>Before describing bugs itself, I first explain definition of freepage.
> >>>>>
> >>>>>1. pages on buddy list are counted as freepage.
> >>>>>2. pages on isolate migratetype buddy list are *not* counted as freepage.
> >>>>
> >>>>I think the second point is causing us a lot of trouble. And I wonder if it's really
> >>>>justified! We already have some is_migrate_isolate() checks in the fast path and now you
> >>>>would add more, mostly just to keep this accounting correct.
> >>>
> >>>It's not just for keeping accouting correct. There is another
> >>>purpose for is_migrate_isolate(). It forces freed pages to go into
> >>>isolate buddy list. Without it, freed pages would go into other
> >>>buddy list and will be used soon. So memory isolation can't work well
> >>>without is_migrate_isolate() checks and success rate could decrease.
> >>
> >>Well I think that could be solved also by doing a lru/pcplists drain
> >>right after marking pageblock as MIGRATETYPE_ISOLATE. After that
> >>moment, anything newly put on pcplists should observe the new
> >>migratetype and go to the correct pcplists. As putting stuff on
> >>pcplists is done with disabled interrupts, and draining is done by
> >>IPI, I think it should work correctly if we put the migratetype
> >>determination under the disabled irq part in free_hot_cold_page().
> >
> >Hello, sorry for late.
> >
> >Yes, this can close the race on migratetype buddy list, but, there is
> >a problem. See the below.
> >
> >>
> >>>And, I just added three more is_migrate_isolate() in the fast
> >>>path, but, two checks are in same *unlikely* branch and I can remove
> >>>another one easily. Therefore it's not quite problem I guess. (It even
> >>>does no-op if MEMORY_ISOLATION is disabled.)
> >>>Moreover, I removed one unconditional get_pageblock_migratetype() in
> >>>free_pcppages_bulk() so, in performance point or view, freepath would
> >>>be improved.
> >>
> >>I haven't checked the individual patches yet, but I'll try.
> >>
> >>>>
> >>>>So the question is, does it have to be correct? And (admiteddly not after a completely
> >>>>exhaustive analysis) I think the answer is, surprisingly, that it doesn't :)
> >>>>
> >>>>Well I of course don't mean that the freepage accounts could go random completely, but
> >>>>what if we allowed them to drift a bit, limiting both the max error and the timeframe
> >>>>where errors are possible? After all, watermarks checking is already racy so I don't think
> >>>>it would be hurt that much.
> >>>
> >>>I understand your suggestion. I once thought like as you, but give up
> >>>that idea. Watermark checking is already racy, but, it's *only*
> >>>protection to prevent memory allocation. After passing that check,
> >>>there is no mean to prevent us from allocating memory. So it should
> >>>be accurate as much as possible. If we allow someone to get the
> >>>memory without considering memory isolation, free memory could be in
> >>>really low state and system could be broken occasionally.
> >>
> >>It would definitely require more thorough review, but I think with
> >>the pcplists draining changes outlined above, there shouldn't be any
> >>more inaccuracy happening.
> >>
> >>>>
> >>>>Now if we look at what both CMA and memory hot-remove does is:
> >>>>
> >>>>1. Mark a MAX_ORDER-aligned buch of pageblocks as MIGRATE_ISOLATE through
> >>>>start_isolate_page_range(). As part of this, all free pages in that area are
> >>>>moved on the isolate freelist through move_freepages_block().
> >>>>
> >>>>2. Try to migrate away all non-free pages in the range. Also drain pcplists and lru_add
> >>>>caches.
> >>>>
> >>>>3. Check if everything was successfully isolated by test_pages_isolated(). Restart and/or
> >>>>undo pageblock isolation if not.
> >>>>
> >>>>So my idea is to throw away all special-casing of is_migrate_isolate() in the buddy
> >>>>allocator, which would therefore account free pages on the isolate freelist as normal free
> >>>>pages.
> >>>>The accounting of isolated pages would be instead done only on the top level of CMA/hot
> >>>>remove in the three steps outlined above, which would be modified as follows:
> >>>>
> >>>>1. Calculate N as the target number of pages to be isolated. Perform the actions of step 1
> >>>>as usual. Calculate X as the number of pages that move_freepages_block() has moved.
> >>>>Subtract X from freepages (this is the same as it is done now), but also *remember the
> >>>>value of X*
> >>>>
> >>>>2. Migrate and drain pcplists as usual. The new free pages will either end up correctly on
> >>>>isolate freelist, or not. We don't care, they will be accounted as freepages either way.
> >>>>This is where some inaccuracy in accounted freepages would build up.
> >>>>
> >>>>3. If test_pages_isolated() checks pass, subtract (N - X) from freepages. The result is
> >>>>that we have a isolated range of N pages that nobody can steal now as everything is on
> >>>>isolate freelist and is MAX_ORDER aligned. And we have in total subtracted N pages (first
> >>>>X, then N-X). So the accounting matches reality.
> >>>>
> >>>>If we have to undo, we undo the isolation and as part of this, we use
> >>>>move_freepages_block() to move pages from isolate freelist to the normal ones. But we
> >>>>don't care how many pages were moved. We simply add the remembered value of X to the
> >>>>number of freepages, undoing the change from step 1. Again, the accounting matches reality.
> >>>>
> >>>>
> >>>>The final point is that if we do this per MAX_ORDER blocks, the error in accounting cannot
> >>>>be ever larger than 4MB and will be visible only during time a single MAX_ORDER block is
> >>>>handled.
> >>>
> >>>The 4MB error in accounting looks serious for me. min_free_kbytes is
> >>>4MB in 1GB system. So this 4MB error would makes all things broken in
> >>>such systems. Moreover, there are some ARCH having larger
> >>>pageblock_order than MAX_ORDER. In this case, the error will be larger
> >>>than 4MB.
> >>>
> >>>In addition, I have a plan to extend CMA to work in parallel. It means
> >>>that there could be parallel memory isolation users rather than just
> >>>one user at the same time, so, we cannot easily bound the error under
> >>>some degree.
> >>
> >>OK. I had thought that the error would be much smaller in practice,
> >>but probably due to how buddy merging works, it would be enough if
> >>just the very last freed page in the MAX_ORDER block was misplaced,
> >>and that would trigger a cascading merge that will end with single
> >>page at MAX_ORDER size becoming misplaced. So let's probably forget
> >>this approach.
> >>
> >>>>As a possible improvement, we can assume during phase 2 that every page freed by migration
> >>>>will end up correctly on isolate free list. So we create M free pages by migration, and
> >>>>subtract M from freepage account. Then in phase 3 we either subtract (N - X - M), or add X
> >>>>+ M in the undo case. (Ideally, if we succeed, X + M should be equal to N, but due to
> >>>>pages on pcplists and the possible races it will be less). I think with this improvement,
> >>>>any error would be negligible.
> >>>>
> >>>>Thoughts?
> >>>
> >>>Thanks for suggestion. :)
> >>>It is really good topic to think deeply.
> >>>
> >>>For now, I'd like to fix these problems without side-effect as you
> >>>suggested. Your suggestion changes the meaning of freepage that
> >>>isolated pages are included in nr_freepage and there could be possible
> >>>regression in success rate of memory hotplug and CMA. Possibly, it
> >>>is the way we have to go, but, IMHO, it isn't the time to go. Before
> >>>going that way, we should fix current implementation first so that
> >>>fixes can be backported to old kernel if someone needs.
> >>
> >>Adding the extra pcplist drains and reordering
> >>get_pageblock_migratetype() under disabled IRQ's should be small
> >>enough to be backportable and not bring any regressions to success
> >>rates. There will be some cost for the extra drains, but paid only
> >>when the memory isolation actually happens. Extending the scope of
> >>disabled irq's would affect fast path somewhat, but I guess less
> >>than extra checks?
> >
> >Your approach would close the race on migratetype buddy list. But,
> >I'm not sure how we can count number of freepages correctly. IIUC,
> >unlike my approach, this approach allows merging between isolate buddy
> >list and normal buddy list and it would results in miscounting number of
> >freepages. You probably think that move_freepages_block() could fixup
> >all the situation, but, I don't think so.
>
> No, I didn't think that, I simply overlooked this scenario. Good catch.
>
> >After applying your approach,
> >
> >CPU1 CPU2
> >set_migratetype_isolate() - free_one_page()
> > - lock
> > - set_pageblock_migratetype()
> > - unlock
> > - drain...
> > - lock
> > - __free_one_page() with MIGRATE_ISOLATE
> > - merged with normal page and linked with
> > isolate buddy list
> > - skip to count freepage, because of
> > MIGRATE_ISOLATE
> > - unlock
> > - lock
> > - move_freepages_block()
> > try to fixup freepages count
> > - unlock
> >
> >We want to fixup counting in move_freepages_block(), but, we can't
> >because we don't have enough information whether this page is already
> >accounted on number of freepage or not. Could you help me to find out
> >the whole mechanism of your approach?
>
> I can't think of an easy fix to this situation.
>
> A non-trivial fix that comes to mind (and I might have overlooked
> something) is something like:
>
> - distinguish MIGRATETYPE_ISOLATING and MIGRATETYPE_ISOLATED
> - CPU1 first sets MIGRATETYPE_ISOLATING before the drain
> - when CPU2 sees MIGRATETYPE_ISOLATING, it just puts the page on
> special unbounded pcplist and that's it
> - CPU1 does the drain as usual, potentially misplacing some pages
> that move_freepages_block() will then fix. But no wrong merging can
> occur.
> - after move_freepages_block(), CPU1 changes MIGRATETYPE_ISOLATING
> to MIGRATETYPE_ISOLATED
> - CPU2 can then start freeing directly on isolate buddy list. There
> might be some pages still on the special pcplist of CPU2/CPUx but
> that means they won't merge yet.
> - CPU1 executes on all CPU's a new operation that flushes the
> special pcplist on isolate buddy list and merge as needed.
>

Really thanks for sharing idea.

It looks possible but I guess that it needs more branches related to
pageblock isolation. Now I have a quick thought to prevent merging,
but, I'm not sure that it is better than current patchset. After more
thinking, I will post rough idea here.

Thanks.

2014-07-15 08:36:50

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 00/10] fix freepage count problems due to memory isolation

On 07/15/2014 10:28 AM, Joonsoo Kim wrote:
> On Mon, Jul 14, 2014 at 11:49:25AM +0200, Vlastimil Babka wrote:
>> On 07/14/2014 08:22 AM, Joonsoo Kim wrote:
>>> On Mon, Jul 07, 2014 at 04:33:09PM +0200, Vlastimil Babka wrote:
>>>> On 07/07/2014 06:49 AM, Joonsoo Kim wrote:
>>>>> Ccing Lisa, because there was bug report it may be related this
>>>>> topic last Saturday.
>>>>>
>>>>> http://www.spinics.net/lists/linux-mm/msg75741.html
>>>>>
>>>>> On Fri, Jul 04, 2014 at 05:33:27PM +0200, Vlastimil Babka wrote:
>>>>>> On 07/04/2014 09:57 AM, Joonsoo Kim wrote:
>>>>>>> Hello,
>>>>>>
>>>>>> Hi Joonsoo,
>>>>>>
>>>>>> please CC me on further updates, this is relevant to me.
>>>>>
>>>>> Hello, Vlastimil.
>>>>>
>>>>> Sorry for missing you. :)
>>>>>
>>>>>>
>>>>>>> This patchset aims at fixing problems due to memory isolation found by
>>>>>>> testing my patchset [1].
>>>>>>>
>>>>>>> These are really subtle problems so I can be wrong. If you find what I am
>>>>>>> missing, please let me know.
>>>>>>>
>>>>>>> Before describing bugs itself, I first explain definition of freepage.
>>>>>>>
>>>>>>> 1. pages on buddy list are counted as freepage.
>>>>>>> 2. pages on isolate migratetype buddy list are *not* counted as freepage.
>>>>>>
>>>>>> I think the second point is causing us a lot of trouble. And I wonder if it's really
>>>>>> justified! We already have some is_migrate_isolate() checks in the fast path and now you
>>>>>> would add more, mostly just to keep this accounting correct.
>>>>>
>>>>> It's not just for keeping accouting correct. There is another
>>>>> purpose for is_migrate_isolate(). It forces freed pages to go into
>>>>> isolate buddy list. Without it, freed pages would go into other
>>>>> buddy list and will be used soon. So memory isolation can't work well
>>>>> without is_migrate_isolate() checks and success rate could decrease.
>>>>
>>>> Well I think that could be solved also by doing a lru/pcplists drain
>>>> right after marking pageblock as MIGRATETYPE_ISOLATE. After that
>>>> moment, anything newly put on pcplists should observe the new
>>>> migratetype and go to the correct pcplists. As putting stuff on
>>>> pcplists is done with disabled interrupts, and draining is done by
>>>> IPI, I think it should work correctly if we put the migratetype
>>>> determination under the disabled irq part in free_hot_cold_page().
>>>
>>> Hello, sorry for late.
>>>
>>> Yes, this can close the race on migratetype buddy list, but, there is
>>> a problem. See the below.
>>>
>>>>
>>>>> And, I just added three more is_migrate_isolate() in the fast
>>>>> path, but, two checks are in same *unlikely* branch and I can remove
>>>>> another one easily. Therefore it's not quite problem I guess. (It even
>>>>> does no-op if MEMORY_ISOLATION is disabled.)
>>>>> Moreover, I removed one unconditional get_pageblock_migratetype() in
>>>>> free_pcppages_bulk() so, in performance point or view, freepath would
>>>>> be improved.
>>>>
>>>> I haven't checked the individual patches yet, but I'll try.
>>>>
>>>>>>
>>>>>> So the question is, does it have to be correct? And (admiteddly not after a completely
>>>>>> exhaustive analysis) I think the answer is, surprisingly, that it doesn't :)
>>>>>>
>>>>>> Well I of course don't mean that the freepage accounts could go random completely, but
>>>>>> what if we allowed them to drift a bit, limiting both the max error and the timeframe
>>>>>> where errors are possible? After all, watermarks checking is already racy so I don't think
>>>>>> it would be hurt that much.
>>>>>
>>>>> I understand your suggestion. I once thought like as you, but give up
>>>>> that idea. Watermark checking is already racy, but, it's *only*
>>>>> protection to prevent memory allocation. After passing that check,
>>>>> there is no mean to prevent us from allocating memory. So it should
>>>>> be accurate as much as possible. If we allow someone to get the
>>>>> memory without considering memory isolation, free memory could be in
>>>>> really low state and system could be broken occasionally.
>>>>
>>>> It would definitely require more thorough review, but I think with
>>>> the pcplists draining changes outlined above, there shouldn't be any
>>>> more inaccuracy happening.
>>>>
>>>>>>
>>>>>> Now if we look at what both CMA and memory hot-remove does is:
>>>>>>
>>>>>> 1. Mark a MAX_ORDER-aligned buch of pageblocks as MIGRATE_ISOLATE through
>>>>>> start_isolate_page_range(). As part of this, all free pages in that area are
>>>>>> moved on the isolate freelist through move_freepages_block().
>>>>>>
>>>>>> 2. Try to migrate away all non-free pages in the range. Also drain pcplists and lru_add
>>>>>> caches.
>>>>>>
>>>>>> 3. Check if everything was successfully isolated by test_pages_isolated(). Restart and/or
>>>>>> undo pageblock isolation if not.
>>>>>>
>>>>>> So my idea is to throw away all special-casing of is_migrate_isolate() in the buddy
>>>>>> allocator, which would therefore account free pages on the isolate freelist as normal free
>>>>>> pages.
>>>>>> The accounting of isolated pages would be instead done only on the top level of CMA/hot
>>>>>> remove in the three steps outlined above, which would be modified as follows:
>>>>>>
>>>>>> 1. Calculate N as the target number of pages to be isolated. Perform the actions of step 1
>>>>>> as usual. Calculate X as the number of pages that move_freepages_block() has moved.
>>>>>> Subtract X from freepages (this is the same as it is done now), but also *remember the
>>>>>> value of X*
>>>>>>
>>>>>> 2. Migrate and drain pcplists as usual. The new free pages will either end up correctly on
>>>>>> isolate freelist, or not. We don't care, they will be accounted as freepages either way.
>>>>>> This is where some inaccuracy in accounted freepages would build up.
>>>>>>
>>>>>> 3. If test_pages_isolated() checks pass, subtract (N - X) from freepages. The result is
>>>>>> that we have a isolated range of N pages that nobody can steal now as everything is on
>>>>>> isolate freelist and is MAX_ORDER aligned. And we have in total subtracted N pages (first
>>>>>> X, then N-X). So the accounting matches reality.
>>>>>>
>>>>>> If we have to undo, we undo the isolation and as part of this, we use
>>>>>> move_freepages_block() to move pages from isolate freelist to the normal ones. But we
>>>>>> don't care how many pages were moved. We simply add the remembered value of X to the
>>>>>> number of freepages, undoing the change from step 1. Again, the accounting matches reality.
>>>>>>
>>>>>>
>>>>>> The final point is that if we do this per MAX_ORDER blocks, the error in accounting cannot
>>>>>> be ever larger than 4MB and will be visible only during time a single MAX_ORDER block is
>>>>>> handled.
>>>>>
>>>>> The 4MB error in accounting looks serious for me. min_free_kbytes is
>>>>> 4MB in 1GB system. So this 4MB error would makes all things broken in
>>>>> such systems. Moreover, there are some ARCH having larger
>>>>> pageblock_order than MAX_ORDER. In this case, the error will be larger
>>>>> than 4MB.
>>>>>
>>>>> In addition, I have a plan to extend CMA to work in parallel. It means
>>>>> that there could be parallel memory isolation users rather than just
>>>>> one user at the same time, so, we cannot easily bound the error under
>>>>> some degree.
>>>>
>>>> OK. I had thought that the error would be much smaller in practice,
>>>> but probably due to how buddy merging works, it would be enough if
>>>> just the very last freed page in the MAX_ORDER block was misplaced,
>>>> and that would trigger a cascading merge that will end with single
>>>> page at MAX_ORDER size becoming misplaced. So let's probably forget
>>>> this approach.
>>>>
>>>>>> As a possible improvement, we can assume during phase 2 that every page freed by migration
>>>>>> will end up correctly on isolate free list. So we create M free pages by migration, and
>>>>>> subtract M from freepage account. Then in phase 3 we either subtract (N - X - M), or add X
>>>>>> + M in the undo case. (Ideally, if we succeed, X + M should be equal to N, but due to
>>>>>> pages on pcplists and the possible races it will be less). I think with this improvement,
>>>>>> any error would be negligible.
>>>>>>
>>>>>> Thoughts?
>>>>>
>>>>> Thanks for suggestion. :)
>>>>> It is really good topic to think deeply.
>>>>>
>>>>> For now, I'd like to fix these problems without side-effect as you
>>>>> suggested. Your suggestion changes the meaning of freepage that
>>>>> isolated pages are included in nr_freepage and there could be possible
>>>>> regression in success rate of memory hotplug and CMA. Possibly, it
>>>>> is the way we have to go, but, IMHO, it isn't the time to go. Before
>>>>> going that way, we should fix current implementation first so that
>>>>> fixes can be backported to old kernel if someone needs.
>>>>
>>>> Adding the extra pcplist drains and reordering
>>>> get_pageblock_migratetype() under disabled IRQ's should be small
>>>> enough to be backportable and not bring any regressions to success
>>>> rates. There will be some cost for the extra drains, but paid only
>>>> when the memory isolation actually happens. Extending the scope of
>>>> disabled irq's would affect fast path somewhat, but I guess less
>>>> than extra checks?
>>>
>>> Your approach would close the race on migratetype buddy list. But,
>>> I'm not sure how we can count number of freepages correctly. IIUC,
>>> unlike my approach, this approach allows merging between isolate buddy
>>> list and normal buddy list and it would results in miscounting number of
>>> freepages. You probably think that move_freepages_block() could fixup
>>> all the situation, but, I don't think so.
>>
>> No, I didn't think that, I simply overlooked this scenario. Good catch.
>>
>>> After applying your approach,
>>>
>>> CPU1 CPU2
>>> set_migratetype_isolate() - free_one_page()
>>> - lock
>>> - set_pageblock_migratetype()
>>> - unlock
>>> - drain...
>>> - lock
>>> - __free_one_page() with MIGRATE_ISOLATE
>>> - merged with normal page and linked with
>>> isolate buddy list
>>> - skip to count freepage, because of
>>> MIGRATE_ISOLATE
>>> - unlock
>>> - lock
>>> - move_freepages_block()
>>> try to fixup freepages count
>>> - unlock
>>>
>>> We want to fixup counting in move_freepages_block(), but, we can't
>>> because we don't have enough information whether this page is already
>>> accounted on number of freepage or not. Could you help me to find out
>>> the whole mechanism of your approach?
>>
>> I can't think of an easy fix to this situation.
>>
>> A non-trivial fix that comes to mind (and I might have overlooked
>> something) is something like:
>>
>> - distinguish MIGRATETYPE_ISOLATING and MIGRATETYPE_ISOLATED
>> - CPU1 first sets MIGRATETYPE_ISOLATING before the drain
>> - when CPU2 sees MIGRATETYPE_ISOLATING, it just puts the page on
>> special unbounded pcplist and that's it
>> - CPU1 does the drain as usual, potentially misplacing some pages
>> that move_freepages_block() will then fix. But no wrong merging can
>> occur.
>> - after move_freepages_block(), CPU1 changes MIGRATETYPE_ISOLATING
>> to MIGRATETYPE_ISOLATED
>> - CPU2 can then start freeing directly on isolate buddy list. There
>> might be some pages still on the special pcplist of CPU2/CPUx but
>> that means they won't merge yet.
>> - CPU1 executes on all CPU's a new operation that flushes the
>> special pcplist on isolate buddy list and merge as needed.
>>
>
> Really thanks for sharing idea.

Ah, you didn't find a hole yet, good sign :D

> It looks possible but I guess that it needs more branches related to
> pageblock isolation. Now I have a quick thought to prevent merging,
> but, I'm not sure that it is better than current patchset. After more
> thinking, I will post rough idea here.

I was thinking about it more and maybe it wouldn't need a new
migratetype after all. But it would always need to free isolate pages on
the special pcplist. That means this pcplist would be used not only
during the call to start_isolate_page_range, but all the way until
undo_isolate_page_range(). I don't think it's a problem and it
simplifies things. The only way to move to isolate freelist is through
the new isolate pcplist flush operation initiated by a single CPU at
well defined time.

The undo would look like:
- (migratetype is still set to MIGRATETYPE_ISOLATE, CPU2 frees affected
pages to the special freelist)
- CPU1 does move_freepages_block() to put pages back from isolate
freelist to e.g. MOVABLE or CMA. At this point, nobody will put new
pages on isolate freelist.
- CPU1 changes migratetype of the pageblock to e.g. MOVABLE. CPU2 and
others start freeing normally. Merging can occur only on the MOVABLE
freelist, as isolate freelist is empty and nobody puts pages there.
- CPU1 flushes the isolate pcplists of all CPU's on the MOVABLE
freelist. Merging is again correct.

I think your plan of multiple parallel CMA allocations (and thus
multiple parallel isolations) is also possible. The isolate pcplists can
be shared by pages coming from multiple parallel isolations. But the
flush operation needs a pfn start/end parameters to only flush pages
belonging to the given isolation. That might mean a bit of inefficient
list traversing, but I don't think it's a problem.

Makes sense?
Vlastimil

> Thanks.
>

2014-07-16 08:37:48

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 00/10] fix freepage count problems due to memory isolation

On Tue, Jul 15, 2014 at 10:36:35AM +0200, Vlastimil Babka wrote:
> >>A non-trivial fix that comes to mind (and I might have overlooked
> >>something) is something like:
> >>
> >>- distinguish MIGRATETYPE_ISOLATING and MIGRATETYPE_ISOLATED
> >>- CPU1 first sets MIGRATETYPE_ISOLATING before the drain
> >>- when CPU2 sees MIGRATETYPE_ISOLATING, it just puts the page on
> >>special unbounded pcplist and that's it
> >>- CPU1 does the drain as usual, potentially misplacing some pages
> >>that move_freepages_block() will then fix. But no wrong merging can
> >>occur.
> >>- after move_freepages_block(), CPU1 changes MIGRATETYPE_ISOLATING
> >>to MIGRATETYPE_ISOLATED
> >>- CPU2 can then start freeing directly on isolate buddy list. There
> >>might be some pages still on the special pcplist of CPU2/CPUx but
> >>that means they won't merge yet.
> >>- CPU1 executes on all CPU's a new operation that flushes the
> >>special pcplist on isolate buddy list and merge as needed.
> >>
> >
> >Really thanks for sharing idea.
>
> Ah, you didn't find a hole yet, good sign :D
>
> >It looks possible but I guess that it needs more branches related to
> >pageblock isolation. Now I have a quick thought to prevent merging,
> >but, I'm not sure that it is better than current patchset. After more
> >thinking, I will post rough idea here.
>
> I was thinking about it more and maybe it wouldn't need a new
> migratetype after all. But it would always need to free isolate
> pages on the special pcplist. That means this pcplist would be used
> not only during the call to start_isolate_page_range, but all the
> way until undo_isolate_page_range(). I don't think it's a problem
> and it simplifies things. The only way to move to isolate freelist
> is through the new isolate pcplist flush operation initiated by a
> single CPU at well defined time.
>
> The undo would look like:
> - (migratetype is still set to MIGRATETYPE_ISOLATE, CPU2 frees
> affected pages to the special freelist)
> - CPU1 does move_freepages_block() to put pages back from isolate
> freelist to e.g. MOVABLE or CMA. At this point, nobody will put new
> pages on isolate freelist.
> - CPU1 changes migratetype of the pageblock to e.g. MOVABLE. CPU2
> and others start freeing normally. Merging can occur only on the
> MOVABLE freelist, as isolate freelist is empty and nobody puts pages
> there.
> - CPU1 flushes the isolate pcplists of all CPU's on the MOVABLE
> freelist. Merging is again correct.
>
> I think your plan of multiple parallel CMA allocations (and thus
> multiple parallel isolations) is also possible. The isolate pcplists
> can be shared by pages coming from multiple parallel isolations. But
> the flush operation needs a pfn start/end parameters to only flush
> pages belonging to the given isolation. That might mean a bit of
> inefficient list traversing, but I don't think it's a problem.

I think that special pcplist would cause a problem if we should check
pfn range. If there are too many pages on this pcplist, move pages from
this pcplist to isolate freelist takes too long time in irq context and
system could be broken. This operation cannot be easily stopped because
it is initiated by IPI on other cpu and starter of this IPI expect that
all pages on other cpus' pcplist are moved properly when returning
from on_each_cpu().

And, if there are so many pages, serious lock contention would happen
in this case.

Anyway, my idea's key point is using PageIsolated() to distinguish
isolated page, instead of using PageBuddy(). If page is PageIsolated(),
it isn't handled as freepage although it is in buddy allocator. During free,
page with MIGRATETYPE_ISOLATE will be marked as PageIsolated() and
won't be merged and counted for freepage.

When we move pages from normal buddy list to isolate buddy
list, we check PageBuddy() and subtract number of PageBuddy() pages
from number of freepage. And, change page from PageBuddy() to PageIsolated()
since it is handled as isolated page at this point. In this way, freepage
count will be correct.

Unisolation can be done by similar approach.

I made prototype of this approach and it isn't intrusive to core
allocator compared to my previous patchset.

Make sense?

Thanks.

2014-07-16 11:14:37

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 00/10] fix freepage count problems due to memory isolation

On 07/16/2014 10:43 AM, Joonsoo Kim wrote:
>> I think your plan of multiple parallel CMA allocations (and thus
>> multiple parallel isolations) is also possible. The isolate pcplists
>> can be shared by pages coming from multiple parallel isolations. But
>> the flush operation needs a pfn start/end parameters to only flush
>> pages belonging to the given isolation. That might mean a bit of
>> inefficient list traversing, but I don't think it's a problem.
>
> I think that special pcplist would cause a problem if we should check
> pfn range. If there are too many pages on this pcplist, move pages from
> this pcplist to isolate freelist takes too long time in irq context and
> system could be broken. This operation cannot be easily stopped because
> it is initiated by IPI on other cpu and starter of this IPI expect that
> all pages on other cpus' pcplist are moved properly when returning
> from on_each_cpu().
>
> And, if there are so many pages, serious lock contention would happen
> in this case.

Hm I see. So what if it wasn't a special pcplist, but a special "free list"
where the pages would be just linked together as on pcplist, regardless of
order, and would not merge until the CPU that drives the memory isolation
process decides it is safe to flush them away. That would remove the need for
IPI's and provide the same guarantees I think.

> Anyway, my idea's key point is using PageIsolated() to distinguish
> isolated page, instead of using PageBuddy(). If page is PageIsolated(),

Is PageIsolated a completely new page flag? Those are a limited resource so I
would expect some resistance to such approach. Or a new special page->_mapcount
value? That could maybe work.

> it isn't handled as freepage although it is in buddy allocator. During free,
> page with MIGRATETYPE_ISOLATE will be marked as PageIsolated() and
> won't be merged and counted for freepage.

OK. Preventing wrong merging is the key point and this should work.

> When we move pages from normal buddy list to isolate buddy
> list, we check PageBuddy() and subtract number of PageBuddy() pages

Do we really need to check PageBuddy()? Could a page get marked as PageIsolate()
but still go to normal list instead of isolate list?

> from number of freepage. And, change page from PageBuddy() to PageIsolated()
> since it is handled as isolated page at this point. In this way, freepage
> count will be correct.
>
> Unisolation can be done by similar approach.
>
> I made prototype of this approach and it isn't intrusive to core
> allocator compared to my previous patchset.
>
> Make sense?

I think so :)

> Thanks.
>

2014-07-17 06:06:03

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 00/10] fix freepage count problems due to memory isolation

On Wed, Jul 16, 2014 at 01:14:26PM +0200, Vlastimil Babka wrote:
> On 07/16/2014 10:43 AM, Joonsoo Kim wrote:
> >> I think your plan of multiple parallel CMA allocations (and thus
> >> multiple parallel isolations) is also possible. The isolate pcplists
> >> can be shared by pages coming from multiple parallel isolations. But
> >> the flush operation needs a pfn start/end parameters to only flush
> >> pages belonging to the given isolation. That might mean a bit of
> >> inefficient list traversing, but I don't think it's a problem.
> >
> > I think that special pcplist would cause a problem if we should check
> > pfn range. If there are too many pages on this pcplist, move pages from
> > this pcplist to isolate freelist takes too long time in irq context and
> > system could be broken. This operation cannot be easily stopped because
> > it is initiated by IPI on other cpu and starter of this IPI expect that
> > all pages on other cpus' pcplist are moved properly when returning
> > from on_each_cpu().
> >
> > And, if there are so many pages, serious lock contention would happen
> > in this case.
>
> Hm I see. So what if it wasn't a special pcplist, but a special "free list"
> where the pages would be just linked together as on pcplist, regardless of
> order, and would not merge until the CPU that drives the memory isolation
> process decides it is safe to flush them away. That would remove the need for
> IPI's and provide the same guarantees I think.

Looks good. It would work. I think that your solution is better than mine.
I will implement it and test.

>
> > Anyway, my idea's key point is using PageIsolated() to distinguish
> > isolated page, instead of using PageBuddy(). If page is PageIsolated(),
>
> Is PageIsolated a completely new page flag? Those are a limited resource so I
> would expect some resistance to such approach. Or a new special page->_mapcount
> value? That could maybe work.

Yes, it is new special page->_mapcount.

> > it isn't handled as freepage although it is in buddy allocator. During free,
> > page with MIGRATETYPE_ISOLATE will be marked as PageIsolated() and
> > won't be merged and counted for freepage.
>
> OK. Preventing wrong merging is the key point and this should work.
>
> > When we move pages from normal buddy list to isolate buddy
> > list, we check PageBuddy() and subtract number of PageBuddy() pages
>
> Do we really need to check PageBuddy()? Could a page get marked as PageIsolate()
> but still go to normal list instead of isolate list?

Checking PageBuddy() is used for identifying page linked in normal
list.

Thanks.

2014-07-17 09:28:26

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 00/10] fix freepage count problems due to memory isolation

On 07/17/2014 08:12 AM, Joonsoo Kim wrote:
>>
>> Hm I see. So what if it wasn't a special pcplist, but a special "free list"
>> where the pages would be just linked together as on pcplist, regardless of
>> order, and would not merge until the CPU that drives the memory isolation
>> process decides it is safe to flush them away. That would remove the need for
>> IPI's and provide the same guarantees I think.
>
> Looks good. It would work. I think that your solution is better than mine.
> I will implement it and test.

Thanks. But maybe there's still a good use for marking pages specially
as isolated, and not PageBuddy with freepage_migratetype set to
MIGRATE_ISOLATE. Why do we buddy-merge them anyway? I don't think CMA or
memory offlining benefits from that, and it only makes the code more
complex?
So maybe we could split isolated buddy pages to order-0 marked as
PageIsolated and have a single per-zone list instead of order-based
freelists?

>> Do we really need to check PageBuddy()? Could a page get marked as PageIsolate()
>> but still go to normal list instead of isolate list?
>
> Checking PageBuddy() is used for identifying page linked in normal
> list.

Ah right, forgot it walks by pfn scanning, not by traversing the free list.

> Thanks.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>