Changes from v4 to v5
* Patch 3: Only freepage counting logic is moved. Others remains as is.
(Vlastimil)
* Patch 4: Consider merging on un-isolation process. (Minchan)
* Add some Ack tags
Changes from v3 to v4
* Patch 1: Add code comment on nr_isolate_pageblock on struct zone (Naoya)
Add one more check in free_one_page() that checks whether
migratetype is MIGRATE_ISOLATE or not.
* Patch 4: Use min() to prevent overflow of buddy merge order (Naoya)
* Remove RFC tag
* Add stable tag on all patches
Changes from v1, v2 to v3
* A lot of comments that lead this patchset to right direction
(Vlastimil and Minchan)
This is version 5 patchset which is improved and minimized version of
version 1 to fix freepage accounting problem during memory isolation.
I tried different approach in version 2, but, it looks really complicated
so I change my mind to improve version 1. You can see version 1, 2 in
following links [1] [2], respectively.
IMO, this v5 is better than v2, because this is simpler than v2 so
better for maintenance and this doesn't change pageblock isolation
logic so it is much easier to backport.
This problems are found by testing my patchset [3]. There are some race
conditions on pageblock isolation and these race cause incorrect
freepage count.
Before describing bugs itself, I first explain definition of freepage.
1. pages on buddy list are counted as freepage.
2. pages on isolate migratetype buddy list are *not* counted as freepage.
3. pages on cma buddy list are counted as CMA freepage, too.
Now, I describe problems and related patch.
Patch 1: There is race conditions on getting pageblock migratetype that
it results in misplacement of freepages on buddy list, incorrect
freepage count and un-availability of freepage.
Patch 2: Freepages on pcp list could have stale cached information to
determine migratetype of buddy list to go. This causes misplacement
of freepages on buddy list and incorrect freepage count.
Patch 4: Merging between freepages on different migratetype of
pageblocks will cause freepages accouting problem. This patch fixes it.
Without patchset [3], above problem doesn't happens on my CMA allocation
test, because CMA reserved pages aren't used at all. So there is no
chance for above race.
With patchset [3], I did simple CMA allocation test and get below result.
- Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
- run kernel build (make -j16) on background
- 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
- Result: more than 5000 freepage count are missed
With patchset [3] and this patchset, I found that no freepage count are
missed so that I conclude that problems are solved.
On my simple memory offlining test, these problems also occur on that
environment, too.
This patchset is based on v3.18-rc2.
Please see individual patches for more information.
Thanks.
Joonsoo Kim (4):
mm/page_alloc: fix incorrect isolation behavior by rechecking
migratetype
mm/page_alloc: add freepage on isolate pageblock to correct buddy
list
mm/page_alloc: move freepage counting logic to __free_one_page()
mm/page_alloc: restrict max order of merging on isolated pageblock
include/linux/mmzone.h | 9 +++++++
include/linux/page-isolation.h | 8 ++++++
mm/internal.h | 25 +++++++++++++++++++
mm/page_alloc.c | 54 ++++++++++++++++------------------------
mm/page_isolation.c | 33 ++++++++++++++++++++++++
5 files changed, 96 insertions(+), 33 deletions(-)
--
1.7.9.5
There are two paths to reach core free function of buddy allocator,
__free_one_page(), one is free_one_page()->__free_one_page() and the
other is free_hot_cold_page()->free_pcppages_bulk()->__free_one_page().
Each paths has race condition causing serious problems. At first, this
patch is focused on first type of freepath. And then, following patch
will solve the problem in second type of freepath.
In the first type of freepath, we got migratetype of freeing page without
holding the zone lock, so it could be racy. There are two cases of this
race.
1. pages are added to isolate buddy list after restoring orignal
migratetype
CPU1 CPU2
get migratetype => return MIGRATE_ISOLATE
call free_one_page() with MIGRATE_ISOLATE
grab the zone lock
unisolate pageblock
release the zone lock
grab the zone lock
call __free_one_page() with MIGRATE_ISOLATE
freepage go into isolate buddy list,
although pageblock is already unisolated
This may cause two problems. One is that we can't use this page anymore
until next isolation attempt of this pageblock, because freepage is on
isolate buddy list. The other is that freepage accouting could be wrong
due to merging between different buddy list. Freepages on isolate buddy
list aren't counted as freepage, but ones on normal buddy list are counted
as freepage. If merge happens, buddy freepage on normal buddy list is
inevitably moved to isolate buddy list without any consideration of
freepage accouting so it could be incorrect.
2. pages are added to normal buddy list while pageblock is isolated.
It is similar with above case.
This also may cause two problems. One is that we can't keep these
freepages from being allocated. Although this pageblock is isolated,
freepage would be added to normal buddy list so that it could be
allocated without any restriction. And the other problem is same as
case 1, that it, incorrect freepage accouting.
This race condition would be prevented by checking migratetype again
with holding the zone lock. Because it is somewhat heavy operation
and it isn't needed in common case, we want to avoid rechecking as much
as possible. So this patch introduce new variable, nr_isolate_pageblock
in struct zone to check if there is isolated pageblock.
With this, we can avoid to re-check migratetype in common case and do
it only if there is isolated pageblock or migratetype is MIGRATE_ISOLATE.
This solve above mentioned problems.
Changes from v3:
Add one more check in free_one_page() that checks whether migratetype is
MIGRATE_ISOLATE or not. Without this, abovementioned case 1 could happens.
Cc: <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Acked-by: Michal Nazarewicz <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/mmzone.h | 9 +++++++++
include/linux/page-isolation.h | 8 ++++++++
mm/page_alloc.c | 11 +++++++++--
mm/page_isolation.c | 2 ++
4 files changed, 28 insertions(+), 2 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4593567..3d090af 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -431,6 +431,15 @@ struct zone {
*/
int nr_migrate_reserve_block;
+#ifdef CONFIG_MEMORY_ISOLATION
+ /*
+ * Number of isolated pageblock. It is used to solve incorrect
+ * freepage counting problem due to racy retrieving migratetype
+ * of pageblock. Protected by zone->lock.
+ */
+ unsigned long nr_isolate_pageblock;
+#endif
+
#ifdef CONFIG_MEMORY_HOTPLUG
/* see spanned/present_pages for more description */
seqlock_t span_seqlock;
diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index 3fff8e7..2dc1e16 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -2,6 +2,10 @@
#define __LINUX_PAGEISOLATION_H
#ifdef CONFIG_MEMORY_ISOLATION
+static inline bool has_isolate_pageblock(struct zone *zone)
+{
+ return zone->nr_isolate_pageblock;
+}
static inline bool is_migrate_isolate_page(struct page *page)
{
return get_pageblock_migratetype(page) == MIGRATE_ISOLATE;
@@ -11,6 +15,10 @@ static inline bool is_migrate_isolate(int migratetype)
return migratetype == MIGRATE_ISOLATE;
}
#else
+static inline bool has_isolate_pageblock(struct zone *zone)
+{
+ return false;
+}
static inline bool is_migrate_isolate_page(struct page *page)
{
return false;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d3ed723..f7a867e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -749,9 +749,16 @@ static void free_one_page(struct zone *zone,
if (nr_scanned)
__mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
+ if (unlikely(has_isolate_pageblock(zone) ||
+ is_migrate_isolate(migratetype))) {
+ migratetype = get_pfnblock_migratetype(page, pfn);
+ if (is_migrate_isolate(migratetype))
+ goto skip_counting;
+ }
+ __mod_zone_freepage_state(zone, 1 << order, migratetype);
+
+skip_counting:
__free_one_page(page, pfn, zone, order, migratetype);
- if (unlikely(!is_migrate_isolate(migratetype)))
- __mod_zone_freepage_state(zone, 1 << order, migratetype);
spin_unlock(&zone->lock);
}
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index d1473b2..1fa4a4d 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -60,6 +60,7 @@ out:
int migratetype = get_pageblock_migratetype(page);
set_pageblock_migratetype(page, MIGRATE_ISOLATE);
+ zone->nr_isolate_pageblock++;
nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE);
__mod_zone_freepage_state(zone, -nr_pages, migratetype);
@@ -83,6 +84,7 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype)
nr_pages = move_freepages_block(zone, page, migratetype);
__mod_zone_freepage_state(zone, nr_pages, migratetype);
set_pageblock_migratetype(page, migratetype);
+ zone->nr_isolate_pageblock--;
out:
spin_unlock_irqrestore(&zone->lock, flags);
}
--
1.7.9.5
In free_pcppages_bulk(), we use cached migratetype of freepage
to determine type of buddy list where freepage will be added.
This information is stored when freepage is added to pcp list, so
if isolation of pageblock of this freepage begins after storing,
this cached information could be stale. In other words, it has
original migratetype rather than MIGRATE_ISOLATE.
There are two problems caused by this stale information. One is that
we can't keep these freepages from being allocated. Although this
pageblock is isolated, freepage will be added to normal buddy list
so that it could be allocated without any restriction. And the other
problem is incorrect freepage accounting. Freepages on isolate pageblock
should not be counted for number of freepage.
Following is the code snippet in free_pcppages_bulk().
/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
__free_one_page(page, page_to_pfn(page), zone, 0, mt);
trace_mm_page_pcpu_drain(page, 0, mt);
if (likely(!is_migrate_isolate_page(page))) {
__mod_zone_page_state(zone, NR_FREE_PAGES, 1);
if (is_migrate_cma(mt))
__mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1);
}
As you can see above snippet, current code already handle second problem,
incorrect freepage accounting, by re-fetching pageblock migratetype
through is_migrate_isolate_page(page). But, because this re-fetched
information isn't used for __free_one_page(), first problem would not be
solved. This patch try to solve this situation to re-fetch pageblock
migratetype before __free_one_page() and to use it for __free_one_page().
In addition to move up position of this re-fetch, this patch use
optimization technique, re-fetching migratetype only if there is
isolate pageblock. Pageblock isolation is rare event, so we can
avoid re-fetching in common case with this optimization.
This patch also correct migratetype of the tracepoint output.
Cc: <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Acked-by: Michal Nazarewicz <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/page_alloc.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f7a867e..6df23fe 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -725,14 +725,17 @@ static void free_pcppages_bulk(struct zone *zone, int count,
/* must delete as __free_one_page list manipulates */
list_del(&page->lru);
mt = get_freepage_migratetype(page);
+ if (unlikely(has_isolate_pageblock(zone))) {
+ mt = get_pageblock_migratetype(page);
+ if (is_migrate_isolate(mt))
+ goto skip_counting;
+ }
+ __mod_zone_freepage_state(zone, 1, mt);
+
+skip_counting:
/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
__free_one_page(page, page_to_pfn(page), zone, 0, mt);
trace_mm_page_pcpu_drain(page, 0, mt);
- if (likely(!is_migrate_isolate_page(page))) {
- __mod_zone_page_state(zone, NR_FREE_PAGES, 1);
- if (is_migrate_cma(mt))
- __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1);
- }
} while (--to_free && --batch_free && !list_empty(list));
}
spin_unlock(&zone->lock);
--
1.7.9.5
All the caller of __free_one_page() has similar freepage counting logic,
so we can move it to __free_one_page(). This reduce line of code and help
future maintenance. This is also preparation step for "mm/page_alloc:
restrict max order of merging on isolated pageblock" which fix the
freepage counting problem on freepage with more than pageblock order.
Changes from v4:
Only freepage counting logic is moved. Others remains as is.
Cc: <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/page_alloc.c | 14 +++-----------
1 file changed, 3 insertions(+), 11 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6df23fe..2bc7768 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -579,6 +579,8 @@ static inline void __free_one_page(struct page *page,
return;
VM_BUG_ON(migratetype == -1);
+ if (!is_migrate_isolate(migratetype))
+ __mod_zone_freepage_state(zone, 1 << order, migratetype);
page_idx = pfn & ((1 << MAX_ORDER) - 1);
@@ -725,14 +727,9 @@ static void free_pcppages_bulk(struct zone *zone, int count,
/* must delete as __free_one_page list manipulates */
list_del(&page->lru);
mt = get_freepage_migratetype(page);
- if (unlikely(has_isolate_pageblock(zone))) {
+ if (unlikely(has_isolate_pageblock(zone)))
mt = get_pageblock_migratetype(page);
- if (is_migrate_isolate(mt))
- goto skip_counting;
- }
- __mod_zone_freepage_state(zone, 1, mt);
-skip_counting:
/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
__free_one_page(page, page_to_pfn(page), zone, 0, mt);
trace_mm_page_pcpu_drain(page, 0, mt);
@@ -755,12 +752,7 @@ static void free_one_page(struct zone *zone,
if (unlikely(has_isolate_pageblock(zone) ||
is_migrate_isolate(migratetype))) {
migratetype = get_pfnblock_migratetype(page, pfn);
- if (is_migrate_isolate(migratetype))
- goto skip_counting;
}
- __mod_zone_freepage_state(zone, 1 << order, migratetype);
-
-skip_counting:
__free_one_page(page, pfn, zone, order, migratetype);
spin_unlock(&zone->lock);
}
--
1.7.9.5
Current pageblock isolation logic could isolate each pageblock
individually. This causes freepage accounting problem if freepage with
pageblock order on isolate pageblock is merged with other freepage on
normal pageblock. We can prevent merging by restricting max order of
merging to pageblock order if freepage is on isolate pageblock.
Side-effect of this change is that there could be non-merged buddy
freepage even if finishing pageblock isolation, because undoing pageblock
isolation is just to move freepage from isolate buddy list to normal buddy
list rather than to consider merging. So, the patch also makes undoing
pageblock isolation consider freepage merge. When un-isolation, freepage
with more than pageblock order and it's buddy are checked. If they are
on normal pageblock, instead of just moving, we isolate the freepage and
free it in order to get merged.
Changes from v4:
Consider merging on un-isolation process.
Cc: <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/internal.h | 25 +++++++++++++++++++++++++
mm/page_alloc.c | 40 +++++++++++++---------------------------
mm/page_isolation.c | 31 +++++++++++++++++++++++++++++++
3 files changed, 69 insertions(+), 27 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index 8293040..a4f90ba 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -108,6 +108,31 @@ extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
/*
* in mm/page_alloc.c
*/
+
+/*
+ * Locate the struct page for both the matching buddy in our
+ * pair (buddy1) and the combined O(n+1) page they form (page).
+ *
+ * 1) Any buddy B1 will have an order O twin B2 which satisfies
+ * the following equation:
+ * B2 = B1 ^ (1 << O)
+ * For example, if the starting buddy (buddy2) is #8 its order
+ * 1 buddy is #10:
+ * B2 = 8 ^ (1 << 1) = 8 ^ 2 = 10
+ *
+ * 2) Any buddy B will have an order O+1 parent P which
+ * satisfies the following equation:
+ * P = B & ~(1 << O)
+ *
+ * Assumption: *_mem_map is contiguous at least up to MAX_ORDER
+ */
+static inline unsigned long
+__find_buddy_index(unsigned long page_idx, unsigned int order)
+{
+ return page_idx ^ (1 << order);
+}
+
+extern int __isolate_free_page(struct page *page, unsigned int order);
extern void __free_pages_bootmem(struct page *page, unsigned int order);
extern void prep_compound_page(struct page *page, unsigned long order);
#ifdef CONFIG_MEMORY_FAILURE
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2bc7768..eb60d18 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -469,29 +469,6 @@ static inline void rmv_page_order(struct page *page)
}
/*
- * Locate the struct page for both the matching buddy in our
- * pair (buddy1) and the combined O(n+1) page they form (page).
- *
- * 1) Any buddy B1 will have an order O twin B2 which satisfies
- * the following equation:
- * B2 = B1 ^ (1 << O)
- * For example, if the starting buddy (buddy2) is #8 its order
- * 1 buddy is #10:
- * B2 = 8 ^ (1 << 1) = 8 ^ 2 = 10
- *
- * 2) Any buddy B will have an order O+1 parent P which
- * satisfies the following equation:
- * P = B & ~(1 << O)
- *
- * Assumption: *_mem_map is contiguous at least up to MAX_ORDER
- */
-static inline unsigned long
-__find_buddy_index(unsigned long page_idx, unsigned int order)
-{
- return page_idx ^ (1 << order);
-}
-
-/*
* This function checks whether a page is free && is the buddy
* we can do coalesce a page and its buddy if
* (a) the buddy is not in a hole &&
@@ -571,6 +548,7 @@ static inline void __free_one_page(struct page *page,
unsigned long combined_idx;
unsigned long uninitialized_var(buddy_idx);
struct page *buddy;
+ int max_order = MAX_ORDER;
VM_BUG_ON(!zone_is_initialized(zone));
@@ -579,15 +557,23 @@ static inline void __free_one_page(struct page *page,
return;
VM_BUG_ON(migratetype == -1);
- if (!is_migrate_isolate(migratetype))
+ if (is_migrate_isolate(migratetype)) {
+ /*
+ * We restrict max order of merging to prevent merge
+ * between freepages on isolate pageblock and normal
+ * pageblock. Without this, pageblock isolation
+ * could cause incorrect freepage accounting.
+ */
+ max_order = min(MAX_ORDER, pageblock_order + 1);
+ } else
__mod_zone_freepage_state(zone, 1 << order, migratetype);
- page_idx = pfn & ((1 << MAX_ORDER) - 1);
+ page_idx = pfn & ((1 << max_order) - 1);
VM_BUG_ON_PAGE(page_idx & ((1 << order) - 1), page);
VM_BUG_ON_PAGE(bad_range(zone, page), page);
- while (order < MAX_ORDER-1) {
+ while (order < max_order - 1) {
buddy_idx = __find_buddy_index(page_idx, order);
buddy = page + (buddy_idx - page_idx);
if (!page_is_buddy(page, buddy, order))
@@ -1590,7 +1576,7 @@ void split_page(struct page *page, unsigned int order)
}
EXPORT_SYMBOL_GPL(split_page);
-static int __isolate_free_page(struct page *page, unsigned int order)
+int __isolate_free_page(struct page *page, unsigned int order)
{
unsigned long watermark;
struct zone *zone;
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 1fa4a4d..df61c93 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -76,17 +76,48 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype)
{
struct zone *zone;
unsigned long flags, nr_pages;
+ struct page *isolated_page = NULL;
+ unsigned int order;
+ unsigned long page_idx, buddy_idx;
+ struct page *buddy;
+ int mt;
zone = page_zone(page);
spin_lock_irqsave(&zone->lock, flags);
if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
goto out;
+
+ /*
+ * Because freepage with more than pageblock_order on isolated
+ * pageblock is restricted to merge due to freepage counting problem,
+ * it is possible that there is free buddy page.
+ * move_freepages_block() doesn't care of merge so we need other
+ * approach in order to merge them. Isolation and free will make
+ * these pages to be merged.
+ */
+ if (PageBuddy(page)) {
+ order = page_order(page);
+ if (order >= pageblock_order) {
+ page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
+ buddy_idx = __find_buddy_index(page_idx, order);
+ buddy = page + (buddy_idx - page_idx);
+ mt = get_pageblock_migratetype(buddy);
+
+ if (!is_migrate_isolate(mt)) {
+ __isolate_free_page(page, order);
+ set_page_refcounted(page);
+ isolated_page = page;
+ }
+ }
+ }
nr_pages = move_freepages_block(zone, page, migratetype);
__mod_zone_freepage_state(zone, nr_pages, migratetype);
set_pageblock_migratetype(page, migratetype);
zone->nr_isolate_pageblock--;
out:
spin_unlock_irqrestore(&zone->lock, flags);
+ if (isolated_page)
+ __free_pages(isolated_page, order);
}
static inline struct page *
--
1.7.9.5
On 10/31/2014 08:25 AM, Joonsoo Kim wrote:
> All the caller of __free_one_page() has similar freepage counting logic,
> so we can move it to __free_one_page(). This reduce line of code and help
> future maintenance. This is also preparation step for "mm/page_alloc:
> restrict max order of merging on isolated pageblock" which fix the
> freepage counting problem on freepage with more than pageblock order.
>
> Changes from v4:
> Only freepage counting logic is moved. Others remains as is.
>
> Cc: <[email protected]>
> Signed-off-by: Joonsoo Kim <[email protected]>
Looks like most of the cleanup was still possible, especially getting
rid of the skip_counting labels is nice.
Acked-by: Vlastimil Babka <[email protected]>
> ---
> mm/page_alloc.c | 14 +++-----------
> 1 file changed, 3 insertions(+), 11 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6df23fe..2bc7768 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -579,6 +579,8 @@ static inline void __free_one_page(struct page *page,
> return;
>
> VM_BUG_ON(migratetype == -1);
> + if (!is_migrate_isolate(migratetype))
> + __mod_zone_freepage_state(zone, 1 << order, migratetype);
>
> page_idx = pfn & ((1 << MAX_ORDER) - 1);
>
> @@ -725,14 +727,9 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> /* must delete as __free_one_page list manipulates */
> list_del(&page->lru);
> mt = get_freepage_migratetype(page);
> - if (unlikely(has_isolate_pageblock(zone))) {
> + if (unlikely(has_isolate_pageblock(zone)))
> mt = get_pageblock_migratetype(page);
> - if (is_migrate_isolate(mt))
> - goto skip_counting;
> - }
> - __mod_zone_freepage_state(zone, 1, mt);
>
> -skip_counting:
> /* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
> __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> trace_mm_page_pcpu_drain(page, 0, mt);
> @@ -755,12 +752,7 @@ static void free_one_page(struct zone *zone,
> if (unlikely(has_isolate_pageblock(zone) ||
> is_migrate_isolate(migratetype))) {
> migratetype = get_pfnblock_migratetype(page, pfn);
> - if (is_migrate_isolate(migratetype))
> - goto skip_counting;
> }
> - __mod_zone_freepage_state(zone, 1 << order, migratetype);
> -
> -skip_counting:
> __free_one_page(page, pfn, zone, order, migratetype);
> spin_unlock(&zone->lock);
> }
>
On 10/31/2014 08:25 AM, Joonsoo Kim wrote:
> @@ -571,6 +548,7 @@ static inline void __free_one_page(struct page *page,
> unsigned long combined_idx;
> unsigned long uninitialized_var(buddy_idx);
> struct page *buddy;
> + int max_order = MAX_ORDER;
>
> VM_BUG_ON(!zone_is_initialized(zone));
>
> @@ -579,15 +557,23 @@ static inline void __free_one_page(struct page *page,
> return;
>
> VM_BUG_ON(migratetype == -1);
> - if (!is_migrate_isolate(migratetype))
> + if (is_migrate_isolate(migratetype)) {
> + /*
> + * We restrict max order of merging to prevent merge
> + * between freepages on isolate pageblock and normal
> + * pageblock. Without this, pageblock isolation
> + * could cause incorrect freepage accounting.
> + */
> + max_order = min(MAX_ORDER, pageblock_order + 1);
> + } else
> __mod_zone_freepage_state(zone, 1 << order, migratetype);
Please add { } to the else branch, this looks ugly :)
> - page_idx = pfn & ((1 << MAX_ORDER) - 1);
> + page_idx = pfn & ((1 << max_order) - 1);
>
> VM_BUG_ON_PAGE(page_idx & ((1 << order) - 1), page);
> VM_BUG_ON_PAGE(bad_range(zone, page), page);
>
> - while (order < MAX_ORDER-1) {
> + while (order < max_order - 1) {
> buddy_idx = __find_buddy_index(page_idx, order);
> buddy = page + (buddy_idx - page_idx);
> if (!page_is_buddy(page, buddy, order))
> @@ -1590,7 +1576,7 @@ void split_page(struct page *page, unsigned int order)
> }
> EXPORT_SYMBOL_GPL(split_page);
>
> -static int __isolate_free_page(struct page *page, unsigned int order)
> +int __isolate_free_page(struct page *page, unsigned int order)
> {
> unsigned long watermark;
> struct zone *zone;
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index 1fa4a4d..df61c93 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -76,17 +76,48 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype)
> {
> struct zone *zone;
> unsigned long flags, nr_pages;
> + struct page *isolated_page = NULL;
> + unsigned int order;
> + unsigned long page_idx, buddy_idx;
> + struct page *buddy;
> + int mt;
>
> zone = page_zone(page);
> spin_lock_irqsave(&zone->lock, flags);
> if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
> goto out;
> +
> + /*
> + * Because freepage with more than pageblock_order on isolated
> + * pageblock is restricted to merge due to freepage counting problem,
> + * it is possible that there is free buddy page.
> + * move_freepages_block() doesn't care of merge so we need other
> + * approach in order to merge them. Isolation and free will make
> + * these pages to be merged.
> + */
> + if (PageBuddy(page)) {
> + order = page_order(page);
> + if (order >= pageblock_order) {
> + page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
> + buddy_idx = __find_buddy_index(page_idx, order);
> + buddy = page + (buddy_idx - page_idx);
> + mt = get_pageblock_migratetype(buddy);
> +
> + if (!is_migrate_isolate(mt)) {
You could use is_migrate_isolate_page(buddy) and save a variable.
> + __isolate_free_page(page, order);
> + set_page_refcounted(page);
> + isolated_page = page;
> + }
> + }
> + }
> nr_pages = move_freepages_block(zone, page, migratetype);
- this is a costly no-op when the whole pageblock is an isolated page,
right?
> __mod_zone_freepage_state(zone, nr_pages, migratetype);
- with isolated_page set, this means you increase freepage_state here,
and then the second time in __free_pages() below? __isolate_free_page()
won't decrease it as the pageblock is still MIGRATE_ISOLATE, so the net
result is overcounting?
> set_pageblock_migratetype(page, migratetype);
> zone->nr_isolate_pageblock--;
> out:
> spin_unlock_irqrestore(&zone->lock, flags);
> + if (isolated_page)
> + __free_pages(isolated_page, order);
> }
>
> static inline struct page *
>
On Fri, Oct 31, 2014 at 03:39:13PM +0100, Vlastimil Babka wrote:
> On 10/31/2014 08:25 AM, Joonsoo Kim wrote:
> >@@ -571,6 +548,7 @@ static inline void __free_one_page(struct page *page,
> > unsigned long combined_idx;
> > unsigned long uninitialized_var(buddy_idx);
> > struct page *buddy;
> >+ int max_order = MAX_ORDER;
> >
> > VM_BUG_ON(!zone_is_initialized(zone));
> >
> >@@ -579,15 +557,23 @@ static inline void __free_one_page(struct page *page,
> > return;
> >
> > VM_BUG_ON(migratetype == -1);
> >- if (!is_migrate_isolate(migratetype))
> >+ if (is_migrate_isolate(migratetype)) {
> >+ /*
> >+ * We restrict max order of merging to prevent merge
> >+ * between freepages on isolate pageblock and normal
> >+ * pageblock. Without this, pageblock isolation
> >+ * could cause incorrect freepage accounting.
> >+ */
> >+ max_order = min(MAX_ORDER, pageblock_order + 1);
> >+ } else
> > __mod_zone_freepage_state(zone, 1 << order, migratetype);
>
> Please add { } to the else branch, this looks ugly :)
Sure.
>
> >- page_idx = pfn & ((1 << MAX_ORDER) - 1);
> >+ page_idx = pfn & ((1 << max_order) - 1);
> >
> > VM_BUG_ON_PAGE(page_idx & ((1 << order) - 1), page);
> > VM_BUG_ON_PAGE(bad_range(zone, page), page);
> >
> >- while (order < MAX_ORDER-1) {
> >+ while (order < max_order - 1) {
> > buddy_idx = __find_buddy_index(page_idx, order);
> > buddy = page + (buddy_idx - page_idx);
> > if (!page_is_buddy(page, buddy, order))
> >@@ -1590,7 +1576,7 @@ void split_page(struct page *page, unsigned int order)
> > }
> > EXPORT_SYMBOL_GPL(split_page);
> >
> >-static int __isolate_free_page(struct page *page, unsigned int order)
> >+int __isolate_free_page(struct page *page, unsigned int order)
> > {
> > unsigned long watermark;
> > struct zone *zone;
> >diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> >index 1fa4a4d..df61c93 100644
> >--- a/mm/page_isolation.c
> >+++ b/mm/page_isolation.c
> >@@ -76,17 +76,48 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype)
> > {
> > struct zone *zone;
> > unsigned long flags, nr_pages;
> >+ struct page *isolated_page = NULL;
> >+ unsigned int order;
> >+ unsigned long page_idx, buddy_idx;
> >+ struct page *buddy;
> >+ int mt;
> >
> > zone = page_zone(page);
> > spin_lock_irqsave(&zone->lock, flags);
> > if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
> > goto out;
> >+
> >+ /*
> >+ * Because freepage with more than pageblock_order on isolated
> >+ * pageblock is restricted to merge due to freepage counting problem,
> >+ * it is possible that there is free buddy page.
> >+ * move_freepages_block() doesn't care of merge so we need other
> >+ * approach in order to merge them. Isolation and free will make
> >+ * these pages to be merged.
> >+ */
> >+ if (PageBuddy(page)) {
> >+ order = page_order(page);
> >+ if (order >= pageblock_order) {
> >+ page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
> >+ buddy_idx = __find_buddy_index(page_idx, order);
> >+ buddy = page + (buddy_idx - page_idx);
> >+ mt = get_pageblock_migratetype(buddy);
> >+
> >+ if (!is_migrate_isolate(mt)) {
>
> You could use is_migrate_isolate_page(buddy) and save a variable.
Okay.
> >+ __isolate_free_page(page, order);
> >+ set_page_refcounted(page);
> >+ isolated_page = page;
> >+ }
> >+ }
> >+ }
> > nr_pages = move_freepages_block(zone, page, migratetype);
>
> - this is a costly no-op when the whole pageblock is an isolated
> page, right?
Okay. I will fix it.
>
> > __mod_zone_freepage_state(zone, nr_pages, migratetype);
>
> - with isolated_page set, this means you increase freepage_state
> here, and then the second time in __free_pages() below?
> __isolate_free_page() won't decrease it as the pageblock is still
> MIGRATE_ISOLATE, so the net result is overcounting?
After __isolate_free_page(), freepage has no buddy flag and
move_freepages_block() doesn't move and count it. So, freepage_state
only increase in __free_pages(). So net result will be correct.
Below is the update for your comment.
Thanks.
------------>8----------------
>From 4bf298908aba16935c7699589c60d00fa0cf389c Mon Sep 17 00:00:00 2001
From: Joonsoo Kim <[email protected]>
Date: Mon, 25 Aug 2014 09:52:13 +0900
Subject: [PATCH v6 4/4] mm/page_alloc: restrict max order of merging on isolated
pageblock
Current pageblock isolation logic could isolate each pageblock
individually. This causes freepage accounting problem if freepage with
pageblock order on isolate pageblock is merged with other freepage on
normal pageblock. We can prevent merging by restricting max order of
merging to pageblock order if freepage is on isolate pageblock.
Side-effect of this change is that there could be non-merged buddy
freepage even if finishing pageblock isolation, because undoing pageblock
isolation is just to move freepage from isolate buddy list to normal buddy
list rather than to consider merging. So, the patch also makes undoing
pageblock isolation consider freepage merge. When un-isolation, freepage
with more than pageblock order and it's buddy are checked. If they are
on normal pageblock, instead of just moving, we isolate the freepage and
free it in order to get merged.
Changes from v5:
Avoid costly move_freepages_block() if there is no freepage.
Some cosmetic changes.
Changes from v4:
Consider merging on un-isolation process.
Cc: <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/internal.h | 25 +++++++++++++++++++++++++
mm/page_alloc.c | 41 ++++++++++++++---------------------------
mm/page_isolation.c | 41 +++++++++++++++++++++++++++++++++++++++--
3 files changed, 78 insertions(+), 29 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index 8293040..a4f90ba 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -108,6 +108,31 @@ extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
/*
* in mm/page_alloc.c
*/
+
+/*
+ * Locate the struct page for both the matching buddy in our
+ * pair (buddy1) and the combined O(n+1) page they form (page).
+ *
+ * 1) Any buddy B1 will have an order O twin B2 which satisfies
+ * the following equation:
+ * B2 = B1 ^ (1 << O)
+ * For example, if the starting buddy (buddy2) is #8 its order
+ * 1 buddy is #10:
+ * B2 = 8 ^ (1 << 1) = 8 ^ 2 = 10
+ *
+ * 2) Any buddy B will have an order O+1 parent P which
+ * satisfies the following equation:
+ * P = B & ~(1 << O)
+ *
+ * Assumption: *_mem_map is contiguous at least up to MAX_ORDER
+ */
+static inline unsigned long
+__find_buddy_index(unsigned long page_idx, unsigned int order)
+{
+ return page_idx ^ (1 << order);
+}
+
+extern int __isolate_free_page(struct page *page, unsigned int order);
extern void __free_pages_bootmem(struct page *page, unsigned int order);
extern void prep_compound_page(struct page *page, unsigned long order);
#ifdef CONFIG_MEMORY_FAILURE
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2bc7768..a009d0a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -469,29 +469,6 @@ static inline void rmv_page_order(struct page *page)
}
/*
- * Locate the struct page for both the matching buddy in our
- * pair (buddy1) and the combined O(n+1) page they form (page).
- *
- * 1) Any buddy B1 will have an order O twin B2 which satisfies
- * the following equation:
- * B2 = B1 ^ (1 << O)
- * For example, if the starting buddy (buddy2) is #8 its order
- * 1 buddy is #10:
- * B2 = 8 ^ (1 << 1) = 8 ^ 2 = 10
- *
- * 2) Any buddy B will have an order O+1 parent P which
- * satisfies the following equation:
- * P = B & ~(1 << O)
- *
- * Assumption: *_mem_map is contiguous at least up to MAX_ORDER
- */
-static inline unsigned long
-__find_buddy_index(unsigned long page_idx, unsigned int order)
-{
- return page_idx ^ (1 << order);
-}
-
-/*
* This function checks whether a page is free && is the buddy
* we can do coalesce a page and its buddy if
* (a) the buddy is not in a hole &&
@@ -571,6 +548,7 @@ static inline void __free_one_page(struct page *page,
unsigned long combined_idx;
unsigned long uninitialized_var(buddy_idx);
struct page *buddy;
+ int max_order = MAX_ORDER;
VM_BUG_ON(!zone_is_initialized(zone));
@@ -579,15 +557,24 @@ static inline void __free_one_page(struct page *page,
return;
VM_BUG_ON(migratetype == -1);
- if (!is_migrate_isolate(migratetype))
+ if (is_migrate_isolate(migratetype)) {
+ /*
+ * We restrict max order of merging to prevent merge
+ * between freepages on isolate pageblock and normal
+ * pageblock. Without this, pageblock isolation
+ * could cause incorrect freepage accounting.
+ */
+ max_order = min(MAX_ORDER, pageblock_order + 1);
+ } else {
__mod_zone_freepage_state(zone, 1 << order, migratetype);
+ }
- page_idx = pfn & ((1 << MAX_ORDER) - 1);
+ page_idx = pfn & ((1 << max_order) - 1);
VM_BUG_ON_PAGE(page_idx & ((1 << order) - 1), page);
VM_BUG_ON_PAGE(bad_range(zone, page), page);
- while (order < MAX_ORDER-1) {
+ while (order < max_order - 1) {
buddy_idx = __find_buddy_index(page_idx, order);
buddy = page + (buddy_idx - page_idx);
if (!page_is_buddy(page, buddy, order))
@@ -1590,7 +1577,7 @@ void split_page(struct page *page, unsigned int order)
}
EXPORT_SYMBOL_GPL(split_page);
-static int __isolate_free_page(struct page *page, unsigned int order)
+int __isolate_free_page(struct page *page, unsigned int order)
{
unsigned long watermark;
struct zone *zone;
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 1fa4a4d..c8778f7 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -76,17 +76,54 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype)
{
struct zone *zone;
unsigned long flags, nr_pages;
+ struct page *isolated_page = NULL;
+ unsigned int order;
+ unsigned long page_idx, buddy_idx;
+ struct page *buddy;
zone = page_zone(page);
spin_lock_irqsave(&zone->lock, flags);
if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
goto out;
- nr_pages = move_freepages_block(zone, page, migratetype);
- __mod_zone_freepage_state(zone, nr_pages, migratetype);
+
+ /*
+ * Because freepage with more than pageblock_order on isolated
+ * pageblock is restricted to merge due to freepage counting problem,
+ * it is possible that there is free buddy page.
+ * move_freepages_block() doesn't care of merge so we need other
+ * approach in order to merge them. Isolation and free will make
+ * these pages to be merged.
+ */
+ if (PageBuddy(page)) {
+ order = page_order(page);
+ if (order >= pageblock_order) {
+ page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
+ buddy_idx = __find_buddy_index(page_idx, order);
+ buddy = page + (buddy_idx - page_idx);
+
+ if (!is_migrate_isolate_page(buddy)) {
+ __isolate_free_page(page, order);
+ set_page_refcounted(page);
+ isolated_page = page;
+ }
+ }
+ }
+
+ /*
+ * If we isolate freepage with more than pageblock_order, there
+ * should be no freepage in the range, so we could avoid costly
+ * pageblock scanning for freepage moving.
+ */
+ if (!isolated_page) {
+ nr_pages = move_freepages_block(zone, page, migratetype);
+ __mod_zone_freepage_state(zone, nr_pages, migratetype);
+ }
set_pageblock_migratetype(page, migratetype);
zone->nr_isolate_pageblock--;
out:
spin_unlock_irqrestore(&zone->lock, flags);
+ if (isolated_page)
+ __free_pages(isolated_page, order);
}
static inline struct page *
--
1.7.9.5
Hello,
On 10/31/2014 04:25 PM, Joonsoo Kim wrote:
> In free_pcppages_bulk(), we use cached migratetype of freepage
> to determine type of buddy list where freepage will be added.
> This information is stored when freepage is added to pcp list, so
> if isolation of pageblock of this freepage begins after storing,
> this cached information could be stale. In other words, it has
> original migratetype rather than MIGRATE_ISOLATE.
>
> There are two problems caused by this stale information. One is that
> we can't keep these freepages from being allocated. Although this
> pageblock is isolated, freepage will be added to normal buddy list
> so that it could be allocated without any restriction. And the other
> problem is incorrect freepage accounting. Freepages on isolate pageblock
> should not be counted for number of freepage.
>
> Following is the code snippet in free_pcppages_bulk().
>
> /* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
> __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> trace_mm_page_pcpu_drain(page, 0, mt);
> if (likely(!is_migrate_isolate_page(page))) {
> __mod_zone_page_state(zone, NR_FREE_PAGES, 1);
> if (is_migrate_cma(mt))
> __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1);
> }
>
> As you can see above snippet, current code already handle second problem,
> incorrect freepage accounting, by re-fetching pageblock migratetype
> through is_migrate_isolate_page(page). But, because this re-fetched
> information isn't used for __free_one_page(), first problem would not be
> solved. This patch try to solve this situation to re-fetch pageblock
> migratetype before __free_one_page() and to use it for __free_one_page().
>
> In addition to move up position of this re-fetch, this patch use
> optimization technique, re-fetching migratetype only if there is
> isolate pageblock. Pageblock isolation is rare event, so we can
> avoid re-fetching in common case with this optimization.
>
> This patch also correct migratetype of the tracepoint output.
>
> Cc: <[email protected]>
> Acked-by: Minchan Kim <[email protected]>
> Acked-by: Michal Nazarewicz <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Signed-off-by: Joonsoo Kim <[email protected]>
> ---
> mm/page_alloc.c | 13 ++++++++-----
> 1 file changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f7a867e..6df23fe 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -725,14 +725,17 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> /* must delete as __free_one_page list manipulates */
> list_del(&page->lru);
> mt = get_freepage_migratetype(page);
> + if (unlikely(has_isolate_pageblock(zone))) {
How about adding an additional check for 'mt == MIGRATE_MOVABLE' here?
Then, most of get_pageblock_migratetype() calls could be avoided while
the isolation is in progress. I am not sure this is the case on memory
offlining. How do you think?
> + mt = get_pageblock_migratetype(page);
> + if (is_migrate_isolate(mt))
> + goto skip_counting;
> + }
> + __mod_zone_freepage_state(zone, 1, mt);
> +
> +skip_counting:
> /* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
> __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> trace_mm_page_pcpu_drain(page, 0, mt);
> - if (likely(!is_migrate_isolate_page(page))) {
> - __mod_zone_page_state(zone, NR_FREE_PAGES, 1);
> - if (is_migrate_cma(mt))
> - __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1);
> - }
> } while (--to_free && --batch_free && !list_empty(list));
> }
> spin_unlock(&zone->lock);
>
On 11/03/2014 09:10 AM, Joonsoo Kim wrote:
> On Fri, Oct 31, 2014 at 03:39:13PM +0100, Vlastimil Babka wrote:
>>> + __isolate_free_page(page, order);
>>> + set_page_refcounted(page);
>>> + isolated_page = page;
>>> + }
>>> + }
>>> + }
>>> nr_pages = move_freepages_block(zone, page, migratetype);
>>
>> - this is a costly no-op when the whole pageblock is an isolated
>> page, right?
>
> Okay. I will fix it.
>
>>
>>> __mod_zone_freepage_state(zone, nr_pages, migratetype);
>>
>> - with isolated_page set, this means you increase freepage_state
>> here, and then the second time in __free_pages() below?
>> __isolate_free_page() won't decrease it as the pageblock is still
>> MIGRATE_ISOLATE, so the net result is overcounting?
>
> After __isolate_free_page(), freepage has no buddy flag and
> move_freepages_block() doesn't move and count it. So, freepage_state
> only increase in __free_pages(). So net result will be correct.
Ah right, I forgot that it gets nr_pages from the move_freepages_block()
result (which is 0 in this case).
> Below is the update for your comment.
>
> Thanks.
>
> ------------>8----------------
> From 4bf298908aba16935c7699589c60d00fa0cf389c Mon Sep 17 00:00:00 2001
> From: Joonsoo Kim <[email protected]>
> Date: Mon, 25 Aug 2014 09:52:13 +0900
> Subject: [PATCH v6 4/4] mm/page_alloc: restrict max order of merging on isolated
> pageblock
>
> Current pageblock isolation logic could isolate each pageblock
> individually. This causes freepage accounting problem if freepage with
> pageblock order on isolate pageblock is merged with other freepage on
> normal pageblock. We can prevent merging by restricting max order of
> merging to pageblock order if freepage is on isolate pageblock.
>
> Side-effect of this change is that there could be non-merged buddy
> freepage even if finishing pageblock isolation, because undoing pageblock
> isolation is just to move freepage from isolate buddy list to normal buddy
> list rather than to consider merging. So, the patch also makes undoing
> pageblock isolation consider freepage merge. When un-isolation, freepage
> with more than pageblock order and it's buddy are checked. If they are
> on normal pageblock, instead of just moving, we isolate the freepage and
> free it in order to get merged.
>
> Changes from v5:
> Avoid costly move_freepages_block() if there is no freepage.
> Some cosmetic changes.
>
> Changes from v4:
> Consider merging on un-isolation process.
>
> Cc: <[email protected]>
> Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On Mon, Nov 3, 2014 at 4:22 PM, Heesub Shin <[email protected]> wrote:
> Hello,
>
>
> On 10/31/2014 04:25 PM, Joonsoo Kim wrote:
>>
>> In free_pcppages_bulk(), we use cached migratetype of freepage
>> to determine type of buddy list where freepage will be added.
>> This information is stored when freepage is added to pcp list, so
>> if isolation of pageblock of this freepage begins after storing,
>> this cached information could be stale. In other words, it has
>> original migratetype rather than MIGRATE_ISOLATE.
>>
>> There are two problems caused by this stale information. One is that
>> we can't keep these freepages from being allocated. Although this
>> pageblock is isolated, freepage will be added to normal buddy list
>> so that it could be allocated without any restriction. And the other
>> problem is incorrect freepage accounting. Freepages on isolate pageblock
>> should not be counted for number of freepage.
>>
>> Following is the code snippet in free_pcppages_bulk().
>>
>> /* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
>> __free_one_page(page, page_to_pfn(page), zone, 0, mt);
>> trace_mm_page_pcpu_drain(page, 0, mt);
>> if (likely(!is_migrate_isolate_page(page))) {
>> __mod_zone_page_state(zone, NR_FREE_PAGES, 1);
>> if (is_migrate_cma(mt))
>> __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1);
>> }
>>
>> As you can see above snippet, current code already handle second problem,
>> incorrect freepage accounting, by re-fetching pageblock migratetype
>> through is_migrate_isolate_page(page). But, because this re-fetched
>> information isn't used for __free_one_page(), first problem would not be
>> solved. This patch try to solve this situation to re-fetch pageblock
>> migratetype before __free_one_page() and to use it for __free_one_page().
>>
>> In addition to move up position of this re-fetch, this patch use
>> optimization technique, re-fetching migratetype only if there is
>> isolate pageblock. Pageblock isolation is rare event, so we can
>> avoid re-fetching in common case with this optimization.
>>
>> This patch also correct migratetype of the tracepoint output.
>>
>> Cc: <[email protected]>
>> Acked-by: Minchan Kim <[email protected]>
>> Acked-by: Michal Nazarewicz <[email protected]>
>> Acked-by: Vlastimil Babka <[email protected]>
>> Signed-off-by: Joonsoo Kim <[email protected]>
>> ---
>> mm/page_alloc.c | 13 ++++++++-----
>> 1 file changed, 8 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index f7a867e..6df23fe 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -725,14 +725,17 @@ static void free_pcppages_bulk(struct zone *zone,
>> int count,
>> /* must delete as __free_one_page list manipulates
>> */
>> list_del(&page->lru);
>> mt = get_freepage_migratetype(page);
>> + if (unlikely(has_isolate_pageblock(zone))) {
>
>
> How about adding an additional check for 'mt == MIGRATE_MOVABLE' here? Then,
> most of get_pageblock_migratetype() calls could be avoided while the
> isolation is in progress. I am not sure this is the case on memory
> offlining. How do you think?
I think the reason is that this "mt" may be not the right value of this page.
It is set without zone->lock.
Thanks,
Hui
>
>> + mt = get_pageblock_migratetype(page);
>> + if (is_migrate_isolate(mt))
>> + goto skip_counting;
>> + }
>> + __mod_zone_freepage_state(zone, 1, mt);
>> +
>> +skip_counting:
>> /* MIGRATE_MOVABLE list may include
>> MIGRATE_RESERVEs */
>> __free_one_page(page, page_to_pfn(page), zone, 0,
>> mt);
>> trace_mm_page_pcpu_drain(page, 0, mt);
>> - if (likely(!is_migrate_isolate_page(page))) {
>> - __mod_zone_page_state(zone, NR_FREE_PAGES,
>> 1);
>> - if (is_migrate_cma(mt))
>> - __mod_zone_page_state(zone,
>> NR_FREE_CMA_PAGES, 1);
>> - }
>> } while (--to_free && --batch_free && !list_empty(list));
>> }
>> spin_unlock(&zone->lock);
>>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Mon, Nov 03, 2014 at 05:22:19PM +0900, Heesub Shin wrote:
> Hello,
>
> On 10/31/2014 04:25 PM, Joonsoo Kim wrote:
> >In free_pcppages_bulk(), we use cached migratetype of freepage
> >to determine type of buddy list where freepage will be added.
> >This information is stored when freepage is added to pcp list, so
> >if isolation of pageblock of this freepage begins after storing,
> >this cached information could be stale. In other words, it has
> >original migratetype rather than MIGRATE_ISOLATE.
> >
> >There are two problems caused by this stale information. One is that
> >we can't keep these freepages from being allocated. Although this
> >pageblock is isolated, freepage will be added to normal buddy list
> >so that it could be allocated without any restriction. And the other
> >problem is incorrect freepage accounting. Freepages on isolate pageblock
> >should not be counted for number of freepage.
> >
> >Following is the code snippet in free_pcppages_bulk().
> >
> >/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
> >__free_one_page(page, page_to_pfn(page), zone, 0, mt);
> >trace_mm_page_pcpu_drain(page, 0, mt);
> >if (likely(!is_migrate_isolate_page(page))) {
> > __mod_zone_page_state(zone, NR_FREE_PAGES, 1);
> > if (is_migrate_cma(mt))
> > __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1);
> >}
> >
> >As you can see above snippet, current code already handle second problem,
> >incorrect freepage accounting, by re-fetching pageblock migratetype
> >through is_migrate_isolate_page(page). But, because this re-fetched
> >information isn't used for __free_one_page(), first problem would not be
> >solved. This patch try to solve this situation to re-fetch pageblock
> >migratetype before __free_one_page() and to use it for __free_one_page().
> >
> >In addition to move up position of this re-fetch, this patch use
> >optimization technique, re-fetching migratetype only if there is
> >isolate pageblock. Pageblock isolation is rare event, so we can
> >avoid re-fetching in common case with this optimization.
> >
> >This patch also correct migratetype of the tracepoint output.
> >
> >Cc: <[email protected]>
> >Acked-by: Minchan Kim <[email protected]>
> >Acked-by: Michal Nazarewicz <[email protected]>
> >Acked-by: Vlastimil Babka <[email protected]>
> >Signed-off-by: Joonsoo Kim <[email protected]>
> >---
> > mm/page_alloc.c | 13 ++++++++-----
> > 1 file changed, 8 insertions(+), 5 deletions(-)
> >
> >diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >index f7a867e..6df23fe 100644
> >--- a/mm/page_alloc.c
> >+++ b/mm/page_alloc.c
> >@@ -725,14 +725,17 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> > /* must delete as __free_one_page list manipulates */
> > list_del(&page->lru);
> > mt = get_freepage_migratetype(page);
> >+ if (unlikely(has_isolate_pageblock(zone))) {
>
> How about adding an additional check for 'mt == MIGRATE_MOVABLE'
> here? Then, most of get_pageblock_migratetype() calls could be
> avoided while the isolation is in progress. I am not sure this is
> the case on memory offlining. How do you think?
Hello,
Isolation could be invoked to other migratetype pageblock. You can
reference has_unmovable_pages() in page_alloc.c. So, additional check
'mt == MIGRATE_MOVABLE' should not be inserted.
Thanks.
On Fri, Oct 31, 2014 at 3:25 PM, Joonsoo Kim <[email protected]> wrote:
> There are two paths to reach core free function of buddy allocator,
> __free_one_page(), one is free_one_page()->__free_one_page() and the
> other is free_hot_cold_page()->free_pcppages_bulk()->__free_one_page().
> Each paths has race condition causing serious problems. At first, this
> patch is focused on first type of freepath. And then, following patch
> will solve the problem in second type of freepath.
>
> In the first type of freepath, we got migratetype of freeing page without
> holding the zone lock, so it could be racy. There are two cases of this
> race.
>
> 1. pages are added to isolate buddy list after restoring orignal
> migratetype
>
> CPU1 CPU2
>
> get migratetype => return MIGRATE_ISOLATE
> call free_one_page() with MIGRATE_ISOLATE
>
> grab the zone lock
> unisolate pageblock
> release the zone lock
>
> grab the zone lock
> call __free_one_page() with MIGRATE_ISOLATE
> freepage go into isolate buddy list,
> although pageblock is already unisolated
>
> This may cause two problems. One is that we can't use this page anymore
> until next isolation attempt of this pageblock, because freepage is on
> isolate buddy list. The other is that freepage accouting could be wrong
> due to merging between different buddy list. Freepages on isolate buddy
> list aren't counted as freepage, but ones on normal buddy list are counted
> as freepage. If merge happens, buddy freepage on normal buddy list is
> inevitably moved to isolate buddy list without any consideration of
> freepage accouting so it could be incorrect.
>
> 2. pages are added to normal buddy list while pageblock is isolated.
> It is similar with above case.
>
> This also may cause two problems. One is that we can't keep these
> freepages from being allocated. Although this pageblock is isolated,
> freepage would be added to normal buddy list so that it could be
> allocated without any restriction. And the other problem is same as
> case 1, that it, incorrect freepage accouting.
>
> This race condition would be prevented by checking migratetype again
> with holding the zone lock. Because it is somewhat heavy operation
> and it isn't needed in common case, we want to avoid rechecking as much
> as possible. So this patch introduce new variable, nr_isolate_pageblock
> in struct zone to check if there is isolated pageblock.
> With this, we can avoid to re-check migratetype in common case and do
> it only if there is isolated pageblock or migratetype is MIGRATE_ISOLATE.
> This solve above mentioned problems.
>
> Changes from v3:
> Add one more check in free_one_page() that checks whether migratetype is
> MIGRATE_ISOLATE or not. Without this, abovementioned case 1 could happens.
>
> Cc: <[email protected]>
> Acked-by: Minchan Kim <[email protected]>
> Acked-by: Michal Nazarewicz <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Signed-off-by: Joonsoo Kim <[email protected]>
> ---
> include/linux/mmzone.h | 9 +++++++++
> include/linux/page-isolation.h | 8 ++++++++
> mm/page_alloc.c | 11 +++++++++--
> mm/page_isolation.c | 2 ++
> 4 files changed, 28 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 4593567..3d090af 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -431,6 +431,15 @@ struct zone {
> */
> int nr_migrate_reserve_block;
>
> +#ifdef CONFIG_MEMORY_ISOLATION
> + /*
> + * Number of isolated pageblock. It is used to solve incorrect
> + * freepage counting problem due to racy retrieving migratetype
> + * of pageblock. Protected by zone->lock.
> + */
> + unsigned long nr_isolate_pageblock;
> +#endif
> +
First sorry for this deferred reply, I see these patches have been merged
into the mainline.
However, I still have a tiny question:
Why use ZONE_PADDING(_pad1_) seperate it and zone->lock?
How about move it to the same cacheline with zone->lock, because it is
accessed under zone->lock?
> #ifdef CONFIG_MEMORY_HOTPLUG
> /* see spanned/present_pages for more description */
> seqlock_t span_seqlock;
> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
> index 3fff8e7..2dc1e16 100644
> --- a/include/linux/page-isolation.h
> +++ b/include/linux/page-isolation.h
> @@ -2,6 +2,10 @@
> #define __LINUX_PAGEISOLATION_H
>
> #ifdef CONFIG_MEMORY_ISOLATION
> +static inline bool has_isolate_pageblock(struct zone *zone)
> +{
> + return zone->nr_isolate_pageblock;
> +}
> static inline bool is_migrate_isolate_page(struct page *page)
> {
> return get_pageblock_migratetype(page) == MIGRATE_ISOLATE;
> @@ -11,6 +15,10 @@ static inline bool is_migrate_isolate(int migratetype)
> return migratetype == MIGRATE_ISOLATE;
> }
> #else
> +static inline bool has_isolate_pageblock(struct zone *zone)
> +{
> + return false;
> +}
> static inline bool is_migrate_isolate_page(struct page *page)
> {
> return false;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d3ed723..f7a867e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -749,9 +749,16 @@ static void free_one_page(struct zone *zone,
> if (nr_scanned)
> __mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
>
> + if (unlikely(has_isolate_pageblock(zone) ||
> + is_migrate_isolate(migratetype))) {
> + migratetype = get_pfnblock_migratetype(page, pfn);
> + if (is_migrate_isolate(migratetype))
> + goto skip_counting;
> + }
> + __mod_zone_freepage_state(zone, 1 << order, migratetype);
> +
> +skip_counting:
> __free_one_page(page, pfn, zone, order, migratetype);
> - if (unlikely(!is_migrate_isolate(migratetype)))
> - __mod_zone_freepage_state(zone, 1 << order, migratetype);
> spin_unlock(&zone->lock);
> }
>
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index d1473b2..1fa4a4d 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -60,6 +60,7 @@ out:
> int migratetype = get_pageblock_migratetype(page);
>
> set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> + zone->nr_isolate_pageblock++;
> nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE);
>
> __mod_zone_freepage_state(zone, -nr_pages, migratetype);
> @@ -83,6 +84,7 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype)
> nr_pages = move_freepages_block(zone, page, migratetype);
> __mod_zone_freepage_state(zone, nr_pages, migratetype);
> set_pageblock_migratetype(page, migratetype);
> + zone->nr_isolate_pageblock--;
> out:
> spin_unlock_irqrestore(&zone->lock, flags);
> }
> --
> 1.7.9.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 4593567..3d090af 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -431,6 +431,15 @@ struct zone {
> > */
> > int nr_migrate_reserve_block;
> >
> > +#ifdef CONFIG_MEMORY_ISOLATION
> > + /*
> > + * Number of isolated pageblock. It is used to solve incorrect
> > + * freepage counting problem due to racy retrieving migratetype
> > + * of pageblock. Protected by zone->lock.
> > + */
> > + unsigned long nr_isolate_pageblock;
> > +#endif
> > +
>
> First sorry for this deferred reply, I see these patches have been merged
> into the mainline.
> However, I still have a tiny question:
> Why use ZONE_PADDING(_pad1_) seperate it and zone->lock?
> How about move it to the same cacheline with zone->lock, because it is
> accessed under zone->lock?
>
zone->lock is currently sharing lines with the data that is frequently
updated under zone lock and some of the dirty data cache line bouncing has
completed when the lock is acquired. nr_isolate_pageblock is a read-mostly
field and in some cases will never be used. It's fine where it is beside
other read-mostly fields.
--
Mel Gorman
SUSE Labs
On Fri, Nov 14, 2014 at 10:33:01AM +0000, Mel Gorman wrote:
> > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > index 4593567..3d090af 100644
> > > --- a/include/linux/mmzone.h
> > > +++ b/include/linux/mmzone.h
> > > @@ -431,6 +431,15 @@ struct zone {
> > > */
> > > int nr_migrate_reserve_block;
> > >
> > > +#ifdef CONFIG_MEMORY_ISOLATION
> > > + /*
> > > + * Number of isolated pageblock. It is used to solve incorrect
> > > + * freepage counting problem due to racy retrieving migratetype
> > > + * of pageblock. Protected by zone->lock.
> > > + */
> > > + unsigned long nr_isolate_pageblock;
> > > +#endif
> > > +
> >
> > First sorry for this deferred reply, I see these patches have been merged
> > into the mainline.
> > However, I still have a tiny question:
> > Why use ZONE_PADDING(_pad1_) seperate it and zone->lock?
> > How about move it to the same cacheline with zone->lock, because it is
> > accessed under zone->lock?
> >
>
> zone->lock is currently sharing lines with the data that is frequently
> updated under zone lock and some of the dirty data cache line bouncing has
> completed when the lock is acquired. nr_isolate_pageblock is a read-mostly
> field and in some cases will never be used. It's fine where it is beside
> other read-mostly fields.
>
My bad...
I don't remember why I decide that place. :/
It seems better to move nr_isolate_pageblock to the same cacheline with
zone->lock, but, as Mel said, it is rarely used field so improvement would
be marginal.
Thanks.