Changes from v3 to v4
* Patch 1: Add code comment on nr_isolate_pageblock on struct zone (Naoya)
Add one more check in free_one_page() that checks whether
migratetype is MIGRATE_ISOLATE or not.
* Patch 4: Use min() to prevent overflow of buddy merge order (Naoya)
* Remove RFC tag
* Add stable tag on all patches
Changes from v1, v2 to v3
* A lot of comments that lead this patchset to right direction
(Vlastimil and Minchan)
This is version 4 patchset which is improved and minimized version of
version 1 to fix freepage accounting problem during memory isolation.
I tried different approach in version 2, but, it looks really complicated
so I change my mind to improve version 1. You can see version 1, 2 in
following links [1] [2], respectively.
IMO, this v3 is better than v2, because this is simpler than v2 so
better for maintainance and this doesn't change pageblock isolation
logic so it is much easier to backport.
This problems are found by testing my patchset [3]. There are some race
conditions on pageblock isolation and these race cause incorrect
freepage count.
Before describing bugs itself, I first explain definition of freepage.
1. pages on buddy list are counted as freepage.
2. pages on isolate migratetype buddy list are *not* counted as freepage.
3. pages on cma buddy list are counted as CMA freepage, too.
Now, I describe problems and related patch.
Patch 1: There is race conditions on getting pageblock migratetype that
it results in misplacement of freepages on buddy list, incorrect
freepage count and un-availability of freepage.
Patch 2: Freepages on pcp list could have stale cached information to
determine migratetype of buddy list to go. This causes misplacement
of freepages on buddy list and incorrect freepage count.
Patch 4: Merging between freepages on different migratetype of
pageblocks will cause freepages accouting problem. This patch fixes it.
Without patchset [3], above problem doesn't happens on my CMA allocation
test, because CMA reserved pages aren't used at all. So there is no
chance for above race.
With patchset [3], I did simple CMA allocation test and get below result.
- Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
- run kernel build (make -j16) on background
- 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
- Result: more than 5000 freepage count are missed
With patchset [3] and this patchset, I found that no freepage count are
missed so that I conclude that problems are solved.
On my simple memory offlining test, these problems also occur on that
environment.
This patchset is based on v3.18-rc1.
Please see individual patches for more information.
Thanks.
[1]: https://lkml.org/lkml/2014/7/4/79
[2]: lkml.org/lkml/2014/8/6/52
[3]: Aggressively allocate the pages on cma reserved memory
https://lkml.org/lkml/2014/5/30/291
Joonsoo Kim (4):
mm/page_alloc: fix incorrect isolation behavior by rechecking
migratetype
mm/page_alloc: add freepage on isolate pageblock to correct buddy
list
mm/page_alloc: move migratetype recheck logic to __free_one_page()
mm/page_alloc: restrict max order of merging on isolated pageblock
include/linux/mmzone.h | 9 +++++++++
include/linux/page-isolation.h | 8 ++++++++
mm/page_alloc.c | 29 ++++++++++++++++++++---------
mm/page_isolation.c | 2 ++
4 files changed, 39 insertions(+), 9 deletions(-)
--
1.7.9.5
All the caller of __free_one_page() has similar migratetype recheck logic,
so we can move it to __free_one_page(). This reduce line of code and help
future maintenance. This is also preparation step for "mm/page_alloc:
restrict max order of merging on isolated pageblock" which fix the
freepage accouting problem on freepage with more than pageblock order.
Cc: <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/page_alloc.c | 24 ++++++++----------------
1 file changed, 8 insertions(+), 16 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5d2f807..433f92c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -579,7 +579,15 @@ static inline void __free_one_page(struct page *page,
return;
VM_BUG_ON(migratetype == -1);
+ if (unlikely(has_isolate_pageblock(zone) ||
+ is_migrate_isolate(migratetype))) {
+ migratetype = get_pfnblock_migratetype(page, pfn);
+ if (is_migrate_isolate(migratetype))
+ goto skip_counting;
+ }
+ __mod_zone_freepage_state(zone, 1 << order, migratetype);
+skip_counting:
page_idx = pfn & ((1 << MAX_ORDER) - 1);
VM_BUG_ON_PAGE(page_idx & ((1 << order) - 1), page);
@@ -725,14 +733,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
/* must delete as __free_one_page list manipulates */
list_del(&page->lru);
mt = get_freepage_migratetype(page);
- if (unlikely(has_isolate_pageblock(zone))) {
- mt = get_pageblock_migratetype(page);
- if (is_migrate_isolate(mt))
- goto skip_counting;
- }
- __mod_zone_freepage_state(zone, 1, mt);
-skip_counting:
/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
__free_one_page(page, page_to_pfn(page), zone, 0, mt);
trace_mm_page_pcpu_drain(page, 0, mt);
@@ -752,15 +753,6 @@ static void free_one_page(struct zone *zone,
if (nr_scanned)
__mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
- if (unlikely(has_isolate_pageblock(zone) ||
- is_migrate_isolate(migratetype))) {
- migratetype = get_pfnblock_migratetype(page, pfn);
- if (is_migrate_isolate(migratetype))
- goto skip_counting;
- }
- __mod_zone_freepage_state(zone, 1 << order, migratetype);
-
-skip_counting:
__free_one_page(page, pfn, zone, order, migratetype);
spin_unlock(&zone->lock);
}
--
1.7.9.5
Current pageblock isolation logic could isolate each pageblock
individually. This causes freepage accounting problem if freepage with
pageblock order on isolate pageblock is merged with other freepage on
normal pageblock. We can prevent merging by restricting max order of
merging to pageblock order if freepage is on isolate pageblock.
Side-effect of this change is that there could be non-merged buddy
freepage even if finishing pageblock isolation, because undoing pageblock
isolation is just to move freepage from isolate buddy list to normal buddy
list rather than to consider merging. But, I think it doesn't matter
because 1) almost allocation request are for equal or below pageblock
order, 2) caller of pageblock isolation will use this freepage so
freepage will split in any case and 3) merge would happen soon after
some alloc/free on this and buddy pageblock.
Cc: <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/page_alloc.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 433f92c..3ec58db 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -571,6 +571,7 @@ static inline void __free_one_page(struct page *page,
unsigned long combined_idx;
unsigned long uninitialized_var(buddy_idx);
struct page *buddy;
+ int max_order = MAX_ORDER;
VM_BUG_ON(!zone_is_initialized(zone));
@@ -582,18 +583,26 @@ static inline void __free_one_page(struct page *page,
if (unlikely(has_isolate_pageblock(zone) ||
is_migrate_isolate(migratetype))) {
migratetype = get_pfnblock_migratetype(page, pfn);
- if (is_migrate_isolate(migratetype))
+ if (is_migrate_isolate(migratetype)) {
+ /*
+ * We restrict max order of merging to prevent merge
+ * between freepages on isolate pageblock and normal
+ * pageblock. Without this, pageblock isolation
+ * could cause incorrect freepage accounting.
+ */
+ max_order = min(MAX_ORDER, pageblock_order + 1);
goto skip_counting;
+ }
}
__mod_zone_freepage_state(zone, 1 << order, migratetype);
skip_counting:
- page_idx = pfn & ((1 << MAX_ORDER) - 1);
+ page_idx = pfn & ((1 << max_order) - 1);
VM_BUG_ON_PAGE(page_idx & ((1 << order) - 1), page);
VM_BUG_ON_PAGE(bad_range(zone, page), page);
- while (order < MAX_ORDER-1) {
+ while (order < max_order - 1) {
buddy_idx = __find_buddy_index(page_idx, order);
buddy = page + (buddy_idx - page_idx);
if (!page_is_buddy(page, buddy, order))
--
1.7.9.5
There are two paths to reach core free function of buddy allocator,
__free_one_page(), one is free_one_page()->__free_one_page() and the
other is free_hot_cold_page()->free_pcppages_bulk()->__free_one_page().
Each paths has race condition causing serious problems. At first, this
patch is focused on first type of freepath. And then, following patch
will solve the problem in second type of freepath.
In the first type of freepath, we got migratetype of freeing page without
holding the zone lock, so it could be racy. There are two cases of this
race.
1. pages are added to isolate buddy list after restoring orignal
migratetype
CPU1 CPU2
get migratetype => return MIGRATE_ISOLATE
call free_one_page() with MIGRATE_ISOLATE
grab the zone lock
unisolate pageblock
release the zone lock
grab the zone lock
call __free_one_page() with MIGRATE_ISOLATE
freepage go into isolate buddy list,
although pageblock is already unisolated
This may cause two problems. One is that we can't use this page anymore
until next isolation attempt of this pageblock, because freepage is on
isolate buddy list. The other is that freepage accouting could be wrong
due to merging between different buddy list. Freepages on isolate buddy
list aren't counted as freepage, but ones on normal buddy list are counted
as freepage. If merge happens, buddy freepage on normal buddy list is
inevitably moved to isolate buddy list without any consideration of
freepage accouting so it could be incorrect.
2. pages are added to normal buddy list while pageblock is isolated.
It is similar with above case.
This also may cause two problems. One is that we can't keep these
freepages from being allocated. Although this pageblock is isolated,
freepage would be added to normal buddy list so that it could be
allocated without any restriction. And the other problem is same as
case 1, that it, incorrect freepage accouting.
This race condition would be prevented by checking migratetype again
with holding the zone lock. Because it is somewhat heavy operation
and it isn't needed in common case, we want to avoid rechecking as much
as possible. So this patch introduce new variable, nr_isolate_pageblock
in struct zone to check if there is isolated pageblock.
With this, we can avoid to re-check migratetype in common case and do
it only if there is isolated pageblock or migratetype is MIGRATE_ISOLATE.
This solve above mentioned problems.
Changes from v3:
Add one more check in free_one_page() that checks whether migratetype is
MIGRATE_ISOLATE or not. Without this, abovementioned case 1 could happens.
Cc: <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/mmzone.h | 9 +++++++++
include/linux/page-isolation.h | 8 ++++++++
mm/page_alloc.c | 11 +++++++++--
mm/page_isolation.c | 2 ++
4 files changed, 28 insertions(+), 2 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4593567..3d090af 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -431,6 +431,15 @@ struct zone {
*/
int nr_migrate_reserve_block;
+#ifdef CONFIG_MEMORY_ISOLATION
+ /*
+ * Number of isolated pageblock. It is used to solve incorrect
+ * freepage counting problem due to racy retrieving migratetype
+ * of pageblock. Protected by zone->lock.
+ */
+ unsigned long nr_isolate_pageblock;
+#endif
+
#ifdef CONFIG_MEMORY_HOTPLUG
/* see spanned/present_pages for more description */
seqlock_t span_seqlock;
diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index 3fff8e7..2dc1e16 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -2,6 +2,10 @@
#define __LINUX_PAGEISOLATION_H
#ifdef CONFIG_MEMORY_ISOLATION
+static inline bool has_isolate_pageblock(struct zone *zone)
+{
+ return zone->nr_isolate_pageblock;
+}
static inline bool is_migrate_isolate_page(struct page *page)
{
return get_pageblock_migratetype(page) == MIGRATE_ISOLATE;
@@ -11,6 +15,10 @@ static inline bool is_migrate_isolate(int migratetype)
return migratetype == MIGRATE_ISOLATE;
}
#else
+static inline bool has_isolate_pageblock(struct zone *zone)
+{
+ return false;
+}
static inline bool is_migrate_isolate_page(struct page *page)
{
return false;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 70027da..4a5d8e5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -749,9 +749,16 @@ static void free_one_page(struct zone *zone,
if (nr_scanned)
__mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
+ if (unlikely(has_isolate_pageblock(zone) ||
+ is_migrate_isolate(migratetype))) {
+ migratetype = get_pfnblock_migratetype(page, pfn);
+ if (is_migrate_isolate(migratetype))
+ goto skip_counting;
+ }
+ __mod_zone_freepage_state(zone, 1 << order, migratetype);
+
+skip_counting:
__free_one_page(page, pfn, zone, order, migratetype);
- if (unlikely(!is_migrate_isolate(migratetype)))
- __mod_zone_freepage_state(zone, 1 << order, migratetype);
spin_unlock(&zone->lock);
}
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index d1473b2..1fa4a4d 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -60,6 +60,7 @@ out:
int migratetype = get_pageblock_migratetype(page);
set_pageblock_migratetype(page, MIGRATE_ISOLATE);
+ zone->nr_isolate_pageblock++;
nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE);
__mod_zone_freepage_state(zone, -nr_pages, migratetype);
@@ -83,6 +84,7 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype)
nr_pages = move_freepages_block(zone, page, migratetype);
__mod_zone_freepage_state(zone, nr_pages, migratetype);
set_pageblock_migratetype(page, migratetype);
+ zone->nr_isolate_pageblock--;
out:
spin_unlock_irqrestore(&zone->lock, flags);
}
--
1.7.9.5
In free_pcppages_bulk(), we use cached migratetype of freepage
to determine type of buddy list where freepage will be added.
This information is stored when freepage is added to pcp list, so
if isolation of pageblock of this freepage begins after storing,
this cached information could be stale. In other words, it has
original migratetype rather than MIGRATE_ISOLATE.
There are two problems caused by this stale information. One is that
we can't keep these freepages from being allocated. Although this
pageblock is isolated, freepage will be added to normal buddy list
so that it could be allocated without any restriction. And the other
problem is incorrect freepage accounting. Freepages on isolate pageblock
should not be counted for number of freepage.
Following is the code snippet in free_pcppages_bulk().
/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
__free_one_page(page, page_to_pfn(page), zone, 0, mt);
trace_mm_page_pcpu_drain(page, 0, mt);
if (likely(!is_migrate_isolate_page(page))) {
__mod_zone_page_state(zone, NR_FREE_PAGES, 1);
if (is_migrate_cma(mt))
__mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1);
}
As you can see above snippet, current code already handle second problem,
incorrect freepage accounting, by re-fetching pageblock migratetype
through is_migrate_isolate_page(page). But, because this re-fetched
information isn't used for __free_one_page(), first problem would not be
solved. This patch try to solve this situation to re-fetch pageblock
migratetype before __free_one_page() and to use it for __free_one_page().
In addition to move up position of this re-fetch, this patch use
optimization technique, re-fetching migratetype only if there is
isolate pageblock. Pageblock isolation is rare event, so we can
avoid re-fetching in common case with this optimization.
Cc: <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/page_alloc.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4a5d8e5..5d2f807 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -725,14 +725,17 @@ static void free_pcppages_bulk(struct zone *zone, int count,
/* must delete as __free_one_page list manipulates */
list_del(&page->lru);
mt = get_freepage_migratetype(page);
+ if (unlikely(has_isolate_pageblock(zone))) {
+ mt = get_pageblock_migratetype(page);
+ if (is_migrate_isolate(mt))
+ goto skip_counting;
+ }
+ __mod_zone_freepage_state(zone, 1, mt);
+
+skip_counting:
/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
__free_one_page(page, page_to_pfn(page), zone, 0, mt);
trace_mm_page_pcpu_drain(page, 0, mt);
- if (likely(!is_migrate_isolate_page(page))) {
- __mod_zone_page_state(zone, NR_FREE_PAGES, 1);
- if (is_migrate_cma(mt))
- __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1);
- }
} while (--to_free && --batch_free && !list_empty(list));
}
spin_unlock(&zone->lock);
--
1.7.9.5
Hi Joonsoo,
I know you spend much effort for investigate/fix this subtle problem.
So, you should be hero.
Thanks for really nice work!
On Thu, Oct 23, 2014 at 05:10:17PM +0900, Joonsoo Kim wrote:
> Changes from v3 to v4
> * Patch 1: Add code comment on nr_isolate_pageblock on struct zone (Naoya)
> Add one more check in free_one_page() that checks whether
> migratetype is MIGRATE_ISOLATE or not.
> * Patch 4: Use min() to prevent overflow of buddy merge order (Naoya)
> * Remove RFC tag
> * Add stable tag on all patches
>
> Changes from v1, v2 to v3
> * A lot of comments that lead this patchset to right direction
> (Vlastimil and Minchan)
>
> This is version 4 patchset which is improved and minimized version of
> version 1 to fix freepage accounting problem during memory isolation.
> I tried different approach in version 2, but, it looks really complicated
> so I change my mind to improve version 1. You can see version 1, 2 in
> following links [1] [2], respectively.
>
> IMO, this v3 is better than v2, because this is simpler than v2 so
> better for maintainance and this doesn't change pageblock isolation
> logic so it is much easier to backport.
>
> This problems are found by testing my patchset [3]. There are some race
> conditions on pageblock isolation and these race cause incorrect
> freepage count.
>
> Before describing bugs itself, I first explain definition of freepage.
>
> 1. pages on buddy list are counted as freepage.
> 2. pages on isolate migratetype buddy list are *not* counted as freepage.
> 3. pages on cma buddy list are counted as CMA freepage, too.
>
> Now, I describe problems and related patch.
>
> Patch 1: There is race conditions on getting pageblock migratetype that
> it results in misplacement of freepages on buddy list, incorrect
> freepage count and un-availability of freepage.
>
> Patch 2: Freepages on pcp list could have stale cached information to
> determine migratetype of buddy list to go. This causes misplacement
> of freepages on buddy list and incorrect freepage count.
>
> Patch 4: Merging between freepages on different migratetype of
> pageblocks will cause freepages accouting problem. This patch fixes it.
>
> Without patchset [3], above problem doesn't happens on my CMA allocation
> test, because CMA reserved pages aren't used at all. So there is no
> chance for above race.
>
> With patchset [3], I did simple CMA allocation test and get below result.
>
> - Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
> - run kernel build (make -j16) on background
> - 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
> - Result: more than 5000 freepage count are missed
>
> With patchset [3] and this patchset, I found that no freepage count are
> missed so that I conclude that problems are solved.
>
> On my simple memory offlining test, these problems also occur on that
> environment.
>
> This patchset is based on v3.18-rc1.
> Please see individual patches for more information.
>
> Thanks.
>
> [1]: https://lkml.org/lkml/2014/7/4/79
> [2]: lkml.org/lkml/2014/8/6/52
> [3]: Aggressively allocate the pages on cma reserved memory
> https://lkml.org/lkml/2014/5/30/291
>
> Joonsoo Kim (4):
> mm/page_alloc: fix incorrect isolation behavior by rechecking
> migratetype
> mm/page_alloc: add freepage on isolate pageblock to correct buddy
> list
> mm/page_alloc: move migratetype recheck logic to __free_one_page()
So, [1-3],
Acked-by: Minchan Kim <[email protected]>
> mm/page_alloc: restrict max order of merging on isolated pageblock
As you noted in description, this patch has a side effect which doesn't
merge buddies. Most of all, I agree your assumptions but it's not true always.
Who knows there is a driver which want a higher page above pageblock?
Who knows there is no allocation/free of the isolated range right before
highest allocation request?
Even, your patch introduces new exception rule for page allocator.
"Hey, allocator, from now on, you could have unmerged buddies
in your list so please advertise it to your customer"
So, all of users of the allocator should consider that exception so
it might hit us sometime.
I want to fix that in isolation undo time.
Thanks, again!
>
> include/linux/mmzone.h | 9 +++++++++
> include/linux/page-isolation.h | 8 ++++++++
> mm/page_alloc.c | 29 ++++++++++++++++++++---------
> mm/page_isolation.c | 2 ++
> 4 files changed, 39 insertions(+), 9 deletions(-)
>
> --
> 1.7.9.5
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
--
Kind regards,
Minchan Kim
On Fri, Oct 24, 2014 at 11:27:49AM +0900, Minchan Kim wrote:
> Hi Joonsoo,
>
> I know you spend much effort for investigate/fix this subtle problem.
> So, you should be hero.
>
> Thanks for really nice work!
Hello,
Thanks. :)
> >
> > Joonsoo Kim (4):
> > mm/page_alloc: fix incorrect isolation behavior by rechecking
> > migratetype
> > mm/page_alloc: add freepage on isolate pageblock to correct buddy
> > list
> > mm/page_alloc: move migratetype recheck logic to __free_one_page()
>
> So, [1-3],
> Acked-by: Minchan Kim <[email protected]>
Thanks, too.
> > mm/page_alloc: restrict max order of merging on isolated pageblock
>
> As you noted in description, this patch has a side effect which doesn't
> merge buddies. Most of all, I agree your assumptions but it's not true always.
>
> Who knows there is a driver which want a higher page above pageblock?
> Who knows there is no allocation/free of the isolated range right before
> highest allocation request?
> Even, your patch introduces new exception rule for page allocator.
>
> "Hey, allocator, from now on, you could have unmerged buddies
> in your list so please advertise it to your customer"
>
> So, all of users of the allocator should consider that exception so
> it might hit us sometime.
>
> I want to fix that in isolation undo time.
> Thanks, again!
Okay. I will try it. The reason I implement as current is that it makes
process of isolation/un-isolation asymetric and needs to copy and
paste some code to handle this specialty. That would possibly result
in maintainance overhead. But, yes, exception of buddy property is
also bad situation. I will implement it and send it soon.
Thanks.
2014-10-23 ???? 5:10, Joonsoo Kim ?? ??:
> Changes from v3 to v4
> * Patch 1: Add code comment on nr_isolate_pageblock on struct zone (Naoya)
> Add one more check in free_one_page() that checks whether
> migratetype is MIGRATE_ISOLATE or not.
> * Patch 4: Use min() to prevent overflow of buddy merge order (Naoya)
> * Remove RFC tag
> * Add stable tag on all patches
>
> Changes from v1, v2 to v3
> * A lot of comments that lead this patchset to right direction
> (Vlastimil and Minchan)
>
> This is version 4 patchset which is improved and minimized version of
> version 1 to fix freepage accounting problem during memory isolation.
> I tried different approach in version 2, but, it looks really complicated
> so I change my mind to improve version 1. You can see version 1, 2 in
> following links [1] [2], respectively.
>
> IMO, this v3 is better than v2, because this is simpler than v2 so
> better for maintainance and this doesn't change pageblock isolation
> logic so it is much easier to backport.
>
> This problems are found by testing my patchset [3]. There are some race
> conditions on pageblock isolation and these race cause incorrect
> freepage count.
>
> Before describing bugs itself, I first explain definition of freepage.
>
> 1. pages on buddy list are counted as freepage.
> 2. pages on isolate migratetype buddy list are *not* counted as freepage.
> 3. pages on cma buddy list are counted as CMA freepage, too.
>
> Now, I describe problems and related patch.
>
> Patch 1: There is race conditions on getting pageblock migratetype that
> it results in misplacement of freepages on buddy list, incorrect
> freepage count and un-availability of freepage.
>
> Patch 2: Freepages on pcp list could have stale cached information to
> determine migratetype of buddy list to go. This causes misplacement
> of freepages on buddy list and incorrect freepage count.
>
> Patch 4: Merging between freepages on different migratetype of
> pageblocks will cause freepages accouting problem. This patch fixes it.
>
> Without patchset [3], above problem doesn't happens on my CMA allocation
> test, because CMA reserved pages aren't used at all. So there is no
> chance for above race.
>
> With patchset [3], I did simple CMA allocation test and get below result.
>
> - Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
> - run kernel build (make -j16) on background
> - 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
> - Result: more than 5000 freepage count are missed
>
> With patchset [3] and this patchset, I found that no freepage count are
> missed so that I conclude that problems are solved.
>
> On my simple memory offlining test, these problems also occur on that
> environment.
>
> This patchset is based on v3.18-rc1.
> Please see individual patches for more information.
>
> Thanks.
>
> [1]: https://lkml.org/lkml/2014/7/4/79
> [2]: lkml.org/lkml/2014/8/6/52
> [3]: Aggressively allocate the pages on cma reserved memory
> https://lkml.org/lkml/2014/5/30/291
>
> Joonsoo Kim (4):
> mm/page_alloc: fix incorrect isolation behavior by rechecking
> migratetype
> mm/page_alloc: add freepage on isolate pageblock to correct buddy
> list
> mm/page_alloc: move migratetype recheck logic to __free_one_page()
> mm/page_alloc: restrict max order of merging on isolated pageblock
>
> include/linux/mmzone.h | 9 +++++++++
> include/linux/page-isolation.h | 8 ++++++++
> mm/page_alloc.c | 29 ++++++++++++++++++++---------
> mm/page_isolation.c | 2 ++
> 4 files changed, 39 insertions(+), 9 deletions(-)
>
Thanks a lot.
v4 looks more elegance than previous one.
I'm looking forward to applying of this and [3].
On 10/23/2014 10:10 AM, Joonsoo Kim wrote:
> Changes from v3:
> Add one more check in free_one_page() that checks whether migratetype is
> MIGRATE_ISOLATE or not. Without this, abovementioned case 1 could happens.
Good catch.
> Cc: <[email protected]>
> Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
(minor suggestion below)
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -749,9 +749,16 @@ static void free_one_page(struct zone *zone,
> if (nr_scanned)
> __mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
>
> + if (unlikely(has_isolate_pageblock(zone) ||
Would it make any difference if this was read just once and not in each
loop iteration?
On 10/23/2014 10:10 AM, Joonsoo Kim wrote:
> In free_pcppages_bulk(), we use cached migratetype of freepage
> to determine type of buddy list where freepage will be added.
> This information is stored when freepage is added to pcp list, so
> if isolation of pageblock of this freepage begins after storing,
> this cached information could be stale. In other words, it has
> original migratetype rather than MIGRATE_ISOLATE.
>
> There are two problems caused by this stale information. One is that
> we can't keep these freepages from being allocated. Although this
> pageblock is isolated, freepage will be added to normal buddy list
> so that it could be allocated without any restriction. And the other
> problem is incorrect freepage accounting. Freepages on isolate pageblock
> should not be counted for number of freepage.
>
> Following is the code snippet in free_pcppages_bulk().
>
> /* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
> __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> trace_mm_page_pcpu_drain(page, 0, mt);
> if (likely(!is_migrate_isolate_page(page))) {
> __mod_zone_page_state(zone, NR_FREE_PAGES, 1);
> if (is_migrate_cma(mt))
> __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1);
> }
>
> As you can see above snippet, current code already handle second problem,
> incorrect freepage accounting, by re-fetching pageblock migratetype
> through is_migrate_isolate_page(page). But, because this re-fetched
> information isn't used for __free_one_page(), first problem would not be
> solved. This patch try to solve this situation to re-fetch pageblock
> migratetype before __free_one_page() and to use it for __free_one_page().
>
> In addition to move up position of this re-fetch, this patch use
> optimization technique, re-fetching migratetype only if there is
> isolate pageblock. Pageblock isolation is rare event, so we can
> avoid re-fetching in common case with this optimization.
>
> Cc: <[email protected]>
> Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
> ---
> mm/page_alloc.c | 13 ++++++++-----
> 1 file changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4a5d8e5..5d2f807 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -725,14 +725,17 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> /* must delete as __free_one_page list manipulates */
> list_del(&page->lru);
> mt = get_freepage_migratetype(page);
> + if (unlikely(has_isolate_pageblock(zone))) {
> + mt = get_pageblock_migratetype(page);
> + if (is_migrate_isolate(mt))
> + goto skip_counting;
> + }
> + __mod_zone_freepage_state(zone, 1, mt);
> +
> +skip_counting:
> /* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
> __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> trace_mm_page_pcpu_drain(page, 0, mt);
> - if (likely(!is_migrate_isolate_page(page))) {
> - __mod_zone_page_state(zone, NR_FREE_PAGES, 1);
> - if (is_migrate_cma(mt))
> - __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1);
> - }
> } while (--to_free && --batch_free && !list_empty(list));
> }
> spin_unlock(&zone->lock);
>
On 10/23/2014 10:10 AM, Joonsoo Kim wrote:
> All the caller of __free_one_page() has similar migratetype recheck logic,
> so we can move it to __free_one_page(). This reduce line of code and help
> future maintenance. This is also preparation step for "mm/page_alloc:
> restrict max order of merging on isolated pageblock" which fix the
> freepage accouting problem on freepage with more than pageblock order.
>
> Cc: <[email protected]>
> Signed-off-by: Joonsoo Kim <[email protected]>
> ---
> mm/page_alloc.c | 24 ++++++++----------------
> 1 file changed, 8 insertions(+), 16 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5d2f807..433f92c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -579,7 +579,15 @@ static inline void __free_one_page(struct page *page,
> return;
>
> VM_BUG_ON(migratetype == -1);
> + if (unlikely(has_isolate_pageblock(zone) ||
> + is_migrate_isolate(migratetype))) {
Since the v4 change of patch 1, this now adds
is_migrate_isolate(migratetype) also for the free_pcppages_bulk path,
where it's not needed?
> + migratetype = get_pfnblock_migratetype(page, pfn);
> + if (is_migrate_isolate(migratetype))
> + goto skip_counting;
> + }
> + __mod_zone_freepage_state(zone, 1 << order, migratetype);
>
> +skip_counting:
> page_idx = pfn & ((1 << MAX_ORDER) - 1);
>
> VM_BUG_ON_PAGE(page_idx & ((1 << order) - 1), page);
> @@ -725,14 +733,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> /* must delete as __free_one_page list manipulates */
> list_del(&page->lru);
> mt = get_freepage_migratetype(page);
> - if (unlikely(has_isolate_pageblock(zone))) {
> - mt = get_pageblock_migratetype(page);
> - if (is_migrate_isolate(mt))
> - goto skip_counting;
> - }
> - __mod_zone_freepage_state(zone, 1, mt);
>
> -skip_counting:
> /* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
> __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> trace_mm_page_pcpu_drain(page, 0, mt);
The 'mt' here for the tracepoint is now different. I know it's the same
as before patch 2, but the value introduced by patch 2 is more correct
than the reverting to pre-patch 2 done here.
This and the introduced check above are maybe minor things, but it makes
me question the value of unifying the check when the conditions in the
two call paths are not completely the same...
I understand this is also prerequisity for patch 4 in some sense, but if
you are reworking it anyway, then maybe this won't be needed in the end?
Thanks for the effort!
Vlastimil
> @@ -752,15 +753,6 @@ static void free_one_page(struct zone *zone,
> if (nr_scanned)
> __mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
>
> - if (unlikely(has_isolate_pageblock(zone) ||
> - is_migrate_isolate(migratetype))) {
> - migratetype = get_pfnblock_migratetype(page, pfn);
> - if (is_migrate_isolate(migratetype))
> - goto skip_counting;
> - }
> - __mod_zone_freepage_state(zone, 1 << order, migratetype);
> -
> -skip_counting:
> __free_one_page(page, pfn, zone, order, migratetype);
> spin_unlock(&zone->lock);
> }
>
On Thu, Oct 23 2014, Joonsoo Kim wrote:
> There are two paths to reach core free function of buddy allocator,
> __free_one_page(), one is free_one_page()->__free_one_page() and the
> other is free_hot_cold_page()->free_pcppages_bulk()->__free_one_page().
> Each paths has race condition causing serious problems. At first, this
> patch is focused on first type of freepath. And then, following patch
> will solve the problem in second type of freepath.
>
> In the first type of freepath, we got migratetype of freeing page without
> holding the zone lock, so it could be racy. There are two cases of this
> race.
>
> 1. pages are added to isolate buddy list after restoring orignal
> migratetype
>
> CPU1 CPU2
>
> get migratetype => return MIGRATE_ISOLATE
> call free_one_page() with MIGRATE_ISOLATE
>
> grab the zone lock
> unisolate pageblock
> release the zone lock
>
> grab the zone lock
> call __free_one_page() with MIGRATE_ISOLATE
> freepage go into isolate buddy list,
> although pageblock is already unisolated
>
> This may cause two problems. One is that we can't use this page anymore
> until next isolation attempt of this pageblock, because freepage is on
> isolate buddy list. The other is that freepage accouting could be wrong
> due to merging between different buddy list. Freepages on isolate buddy
> list aren't counted as freepage, but ones on normal buddy list are counted
> as freepage. If merge happens, buddy freepage on normal buddy list is
> inevitably moved to isolate buddy list without any consideration of
> freepage accouting so it could be incorrect.
>
> 2. pages are added to normal buddy list while pageblock is isolated.
> It is similar with above case.
>
> This also may cause two problems. One is that we can't keep these
> freepages from being allocated. Although this pageblock is isolated,
> freepage would be added to normal buddy list so that it could be
> allocated without any restriction. And the other problem is same as
> case 1, that it, incorrect freepage accouting.
>
> This race condition would be prevented by checking migratetype again
> with holding the zone lock. Because it is somewhat heavy operation
> and it isn't needed in common case, we want to avoid rechecking as much
> as possible. So this patch introduce new variable, nr_isolate_pageblock
> in struct zone to check if there is isolated pageblock.
> With this, we can avoid to re-check migratetype in common case and do
> it only if there is isolated pageblock or migratetype is MIGRATE_ISOLATE.
> This solve above mentioned problems.
>
> Changes from v3:
> Add one more check in free_one_page() that checks whether migratetype is
> MIGRATE_ISOLATE or not. Without this, abovementioned case 1 could happens.
>
> Cc: <[email protected]>
> Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Michal Nazarewicz <[email protected]>
> ---
> include/linux/mmzone.h | 9 +++++++++
> include/linux/page-isolation.h | 8 ++++++++
> mm/page_alloc.c | 11 +++++++++--
> mm/page_isolation.c | 2 ++
> 4 files changed, 28 insertions(+), 2 deletions(-)
--
Best regards, _ _
.o. | Liege of Serenely Enlightened Majesty of o' \,=./ `o
..o | Computer Science, Michał “mina86” Nazarewicz (o o)
ooo +--<[email protected]>--<xmpp:[email protected]>--ooO--(_)--Ooo--
On Thu, Oct 23 2014, Joonsoo Kim wrote:
> In free_pcppages_bulk(), we use cached migratetype of freepage
> to determine type of buddy list where freepage will be added.
> This information is stored when freepage is added to pcp list, so
> if isolation of pageblock of this freepage begins after storing,
> this cached information could be stale. In other words, it has
> original migratetype rather than MIGRATE_ISOLATE.
>
> There are two problems caused by this stale information. One is that
> we can't keep these freepages from being allocated. Although this
> pageblock is isolated, freepage will be added to normal buddy list
> so that it could be allocated without any restriction. And the other
> problem is incorrect freepage accounting. Freepages on isolate pageblock
> should not be counted for number of freepage.
>
> Following is the code snippet in free_pcppages_bulk().
>
> /* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
> __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> trace_mm_page_pcpu_drain(page, 0, mt);
> if (likely(!is_migrate_isolate_page(page))) {
> __mod_zone_page_state(zone, NR_FREE_PAGES, 1);
> if (is_migrate_cma(mt))
> __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1);
> }
>
> As you can see above snippet, current code already handle second problem,
> incorrect freepage accounting, by re-fetching pageblock migratetype
> through is_migrate_isolate_page(page). But, because this re-fetched
> information isn't used for __free_one_page(), first problem would not be
> solved. This patch try to solve this situation to re-fetch pageblock
> migratetype before __free_one_page() and to use it for __free_one_page().
>
> In addition to move up position of this re-fetch, this patch use
> optimization technique, re-fetching migratetype only if there is
> isolate pageblock. Pageblock isolation is rare event, so we can
> avoid re-fetching in common case with this optimization.
>
> Cc: <[email protected]>
> Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Michal Nazarewicz <[email protected]>
> ---
> mm/page_alloc.c | 13 ++++++++-----
> 1 file changed, 8 insertions(+), 5 deletions(-)
--
Best regards, _ _
.o. | Liege of Serenely Enlightened Majesty of o' \,=./ `o
..o | Computer Science, Michał “mina86” Nazarewicz (o o)
ooo +--<[email protected]>--<xmpp:[email protected]>--ooO--(_)--Ooo--
On Mon, Oct 27, 2014 at 11:33:20AM +0100, Vlastimil Babka wrote:
> On 10/23/2014 10:10 AM, Joonsoo Kim wrote:
> > Changes from v3:
> > Add one more check in free_one_page() that checks whether migratetype is
> > MIGRATE_ISOLATE or not. Without this, abovementioned case 1 could happens.
>
> Good catch.
>
> > Cc: <[email protected]>
> > Signed-off-by: Joonsoo Kim <[email protected]>
>
> Acked-by: Vlastimil Babka <[email protected]>
>
> (minor suggestion below)
>
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -749,9 +749,16 @@ static void free_one_page(struct zone *zone,
> > if (nr_scanned)
> > __mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
> >
> > + if (unlikely(has_isolate_pageblock(zone) ||
>
> Would it make any difference if this was read just once and not in each
> loop iteration?
>
>
I guess that you'd like to say this to patch 2.
I can do it, but, it doesn't any difference in terms of performance,
because we access zone's member variable in each loop iteration
in __free_one_page().
Thanks.
On Mon, Oct 27, 2014 at 11:40:23AM +0100, Vlastimil Babka wrote:
> On 10/23/2014 10:10 AM, Joonsoo Kim wrote:
> > All the caller of __free_one_page() has similar migratetype recheck logic,
> > so we can move it to __free_one_page(). This reduce line of code and help
> > future maintenance. This is also preparation step for "mm/page_alloc:
> > restrict max order of merging on isolated pageblock" which fix the
> > freepage accouting problem on freepage with more than pageblock order.
> >
> > Cc: <[email protected]>
> > Signed-off-by: Joonsoo Kim <[email protected]>
> > ---
> > mm/page_alloc.c | 24 ++++++++----------------
> > 1 file changed, 8 insertions(+), 16 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 5d2f807..433f92c 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -579,7 +579,15 @@ static inline void __free_one_page(struct page *page,
> > return;
> >
> > VM_BUG_ON(migratetype == -1);
> > + if (unlikely(has_isolate_pageblock(zone) ||
> > + is_migrate_isolate(migratetype))) {
>
> Since the v4 change of patch 1, this now adds
> is_migrate_isolate(migratetype) also for the free_pcppages_bulk path,
> where it's not needed?
Yes, you are right. But, patch 4 needs is_migrate_isolate() check
in __free_one_page().
>
> > + migratetype = get_pfnblock_migratetype(page, pfn);
> > + if (is_migrate_isolate(migratetype))
> > + goto skip_counting;
> > + }
> > + __mod_zone_freepage_state(zone, 1 << order, migratetype);
> >
> > +skip_counting:
> > page_idx = pfn & ((1 << MAX_ORDER) - 1);
> >
> > VM_BUG_ON_PAGE(page_idx & ((1 << order) - 1), page);
> > @@ -725,14 +733,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> > /* must delete as __free_one_page list manipulates */
> > list_del(&page->lru);
> > mt = get_freepage_migratetype(page);
> > - if (unlikely(has_isolate_pageblock(zone))) {
> > - mt = get_pageblock_migratetype(page);
> > - if (is_migrate_isolate(mt))
> > - goto skip_counting;
> > - }
> > - __mod_zone_freepage_state(zone, 1, mt);
> >
> > -skip_counting:
> > /* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
> > __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> > trace_mm_page_pcpu_drain(page, 0, mt);
>
> The 'mt' here for the tracepoint is now different. I know it's the same
> as before patch 2, but the value introduced by patch 2 is more correct
> than the reverting to pre-patch 2 done here.
Yes, you are right. I didn't notice that.
> This and the introduced check above are maybe minor things, but it makes
> me question the value of unifying the check when the conditions in the
> two call paths are not completely the same...
>
> I understand this is also prerequisity for patch 4 in some sense, but if
> you are reworking it anyway, then maybe this won't be needed in the end?
As mentioned above, is_migrate_isolate() check is needed in __free_one_page()
in any case. Reworked patch also needs this.
Hmm... I'd like to check isolate migratetype in just one place, but, it
seems to be impossible. To correct tracepoint, I will remain the check
in each functions. :/
Thanks.