2010-07-02 05:51:32

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH 0/7] hugepage migration

Hi,

This is a patchset for hugepage migration.

There are many users of page migration such as soft offlining,
memory hotplug, memory policy and memory compaction,
but this patchset adds hugepage support only for soft offlining
as the first step.

This patchset is based on 2.6.35-rc3 applied with "HWPOISON for
hugepage" patchset I previously posted (see Andi's git tree.)
http://git.kernel.org/?p=linux/kernel/git/ak/linux-mce-2.6.git;a=summary

I tested this patchset with 'make func' in libhugetlbfs and
have gotten the same result as one from 2.6.35-rc3.

[PATCH 1/7] hugetlb: add missing unlock in avoidcopy path in hugetlb_cow()
[PATCH 2/7] hugetlb, HWPOISON: move PG_HWPoison bit check
[PATCH 3/7] hugetlb: add allocate function for hugepage migration
[PATCH 4/7] hugetlb: add hugepage check in mem_cgroup_{register,end}_migration()
[PATCH 5/7] hugetlb: pin oldpage in page migration
[PATCH 6/7] hugetlb: hugepage migration core
[PATCH 7/7] hugetlb, HWPOISON: soft offlining for hugepage

fs/hugetlbfs/inode.c | 2 +
include/linux/hugetlb.h | 6 ++
mm/hugetlb.c | 138 +++++++++++++++++++++++++++++++++++-----------
mm/memcontrol.c | 5 ++
mm/memory-failure.c | 57 +++++++++++++++-----
mm/migrate.c | 70 ++++++++++++++++++++++--
6 files changed, 226 insertions(+), 52 deletions(-)

Thanks,
Naoya Horiguchi


2010-07-02 05:49:42

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH 7/7] hugetlb, HWPOISON: soft offlining for hugepage

This patch extends soft offlining framework to support hugepage.
When memory corrected errors occur repeatedly on a hugepage,
we can choose to stop using it by migrating data onto another hugepage
and disabling the original (maybe half-broken) one.

Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Jun'ichi Nomura <[email protected]>
---
include/linux/hugetlb.h | 2 +
mm/hugetlb.c | 7 +++++
mm/memory-failure.c | 57 +++++++++++++++++++++++++++++++++++-----------
3 files changed, 52 insertions(+), 14 deletions(-)

diff --git v2.6.35-rc3-hwpoison/include/linux/hugetlb.h v2.6.35-rc3-hwpoison/include/linux/hugetlb.h
index 952e3ce..cb3c373 100644
--- v2.6.35-rc3-hwpoison/include/linux/hugetlb.h
+++ v2.6.35-rc3-hwpoison/include/linux/hugetlb.h
@@ -44,6 +44,7 @@ int hugetlb_reserve_pages(struct inode *inode, long from, long to,
int acctflags);
void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
void __isolate_hwpoisoned_huge_page(struct page *page);
+void isolate_hwpoisoned_huge_page(struct page *page);

extern unsigned long hugepages_treat_as_movable;
extern const unsigned long hugetlb_zero, hugetlb_infinity;
@@ -102,6 +103,7 @@ static inline void hugetlb_report_meminfo(struct seq_file *m)
#define hugetlb_fault(mm, vma, addr, flags) ({ BUG(); 0; })
#define huge_pte_offset(mm, address) 0
#define __isolate_hwpoisoned_huge_page(page) 0
+#define isolate_hwpoisoned_huge_page(page) 0

#define hugetlb_change_protection(vma, address, end, newprot)

diff --git v2.6.35-rc3-hwpoison/mm/hugetlb.c v2.6.35-rc3-hwpoison/mm/hugetlb.c
index 6e7f5f2..fe01ff2 100644
--- v2.6.35-rc3-hwpoison/mm/hugetlb.c
+++ v2.6.35-rc3-hwpoison/mm/hugetlb.c
@@ -2947,3 +2947,10 @@ void __isolate_hwpoisoned_huge_page(struct page *hpage)
h->free_huge_pages_node[nid]--;
spin_unlock(&hugetlb_lock);
}
+
+void isolate_hwpoisoned_huge_page(struct page *hpage)
+{
+ lock_page(hpage);
+ __isolate_hwpoisoned_huge_page(hpage);
+ unlock_page(hpage);
+}
diff --git v2.6.35-rc3-hwpoison/mm/memory-failure.c v2.6.35-rc3-hwpoison/mm/memory-failure.c
index d0b420a..c6516df 100644
--- v2.6.35-rc3-hwpoison/mm/memory-failure.c
+++ v2.6.35-rc3-hwpoison/mm/memory-failure.c
@@ -1186,7 +1186,10 @@ EXPORT_SYMBOL(unpoison_memory);
static struct page *new_page(struct page *p, unsigned long private, int **x)
{
int nid = page_to_nid(p);
- return alloc_pages_exact_node(nid, GFP_HIGHUSER_MOVABLE, 0);
+ if (PageHuge(p))
+ return alloc_huge_page_node(page_hstate(compound_head(p)), nid);
+ else
+ return alloc_pages_exact_node(nid, GFP_HIGHUSER_MOVABLE, 0);
}

/*
@@ -1214,7 +1217,20 @@ static int get_any_page(struct page *p, unsigned long pfn, int flags)
* was free.
*/
set_migratetype_isolate(p);
- if (!get_page_unless_zero(compound_head(p))) {
+ /*
+ * When the target page is a free hugepage, just remove it
+ * from free hugepage list.
+ */
+ if (PageHuge(p)) {
+ struct page *hpage = compound_head(p);
+ if (!get_page_unless_zero(hpage)) {
+ pr_debug("get_any_page: %#lx free huge page\n", pfn);
+ isolate_hwpoisoned_huge_page(hpage);
+ set_page_hwpoison_huge_page(hpage);
+ ret = 0;
+ } else
+ ret = 1;
+ } else if (!get_page_unless_zero(compound_head(p))) {
if (is_free_buddy_page(p)) {
pr_debug("get_any_page: %#lx free buddy page\n", pfn);
/* Set hwpoison bit while page is still isolated */
@@ -1260,6 +1276,7 @@ int soft_offline_page(struct page *page, int flags)
{
int ret;
unsigned long pfn = page_to_pfn(page);
+ struct page *hpage = compound_head(page);

ret = get_any_page(page, pfn, flags);
if (ret < 0)
@@ -1270,7 +1287,7 @@ int soft_offline_page(struct page *page, int flags)
/*
* Page cache page we can handle?
*/
- if (!PageLRU(page)) {
+ if (!PageLRU(page) && !PageHuge(page)) {
/*
* Try to free it.
*/
@@ -1286,21 +1303,21 @@ int soft_offline_page(struct page *page, int flags)
if (ret == 0)
goto done;
}
- if (!PageLRU(page)) {
+ if (!PageLRU(page) && !PageHuge(page)) {
pr_debug("soft_offline: %#lx: unknown non LRU page type %lx\n",
pfn, page->flags);
return -EIO;
}

- lock_page(page);
- wait_on_page_writeback(page);
+ lock_page(hpage);
+ wait_on_page_writeback(hpage);

/*
* Synchronized using the page lock with memory_failure()
*/
- if (PageHWPoison(page)) {
- unlock_page(page);
- put_page(page);
+ if (PageHWPoison(page) || (PageTail(page) && PageHWPoison(hpage))) {
+ unlock_page(hpage);
+ put_page(hpage);
pr_debug("soft offline: %#lx page already poisoned\n", pfn);
return -EBUSY;
}
@@ -1310,7 +1327,7 @@ int soft_offline_page(struct page *page, int flags)
* non dirty unmapped page cache pages.
*/
ret = invalidate_inode_page(page);
- unlock_page(page);
+ unlock_page(hpage);

/*
* Drop count because page migration doesn't like raised
@@ -1319,7 +1336,16 @@ int soft_offline_page(struct page *page, int flags)
* RED-PEN would be better to keep it isolated here, but we
* would need to fix isolation locking first.
*/
- put_page(page);
+ put_page(hpage);
+
+ /*
+ * Hugepage is not involved in LRU list, so skip LRU isolation.
+ */
+ if (PageHuge(page)) {
+ ret = 0;
+ goto do_migrate;
+ }
+
if (ret == 1) {
ret = 0;
pr_debug("soft_offline: %#lx: invalidated\n", pfn);
@@ -1332,10 +1358,11 @@ int soft_offline_page(struct page *page, int flags)
* handles a large number of cases for us.
*/
ret = isolate_lru_page(page);
+do_migrate:
if (!ret) {
LIST_HEAD(pagelist);

- list_add(&page->lru, &pagelist);
+ list_add(&hpage->lru, &pagelist);
ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL, 0);
if (ret) {
pr_debug("soft offline: %#lx: migration failed %d, type %lx\n",
@@ -1343,6 +1370,8 @@ int soft_offline_page(struct page *page, int flags)
if (ret > 0)
ret = -EIO;
}
+ if (!ret && PageHuge(hpage))
+ isolate_hwpoisoned_huge_page(hpage);
} else {
pr_debug("soft offline: %#lx: isolation failed: %d, page count %d, type %lx\n",
pfn, ret, page_count(page), page->flags);
@@ -1351,8 +1380,8 @@ int soft_offline_page(struct page *page, int flags)
return ret;

done:
- atomic_long_add(1, &mce_bad_pages);
- SetPageHWPoison(page);
+ atomic_long_add(1 << compound_order(hpage), &mce_bad_pages);
+ set_page_hwpoison_huge_page(hpage);
/* keep elevated page count for bad page */
return ret;
}
--
1.7.1

2010-07-02 05:49:47

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH 4/7] hugetlb: add hugepage check in mem_cgroup_{register,end}_migration()

Currently memory cgroup doesn't charge hugepage,
so avoid calling these functions in hugepage migration context.

Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Jun'ichi Nomura <[email protected]>
---
mm/memcontrol.c | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)

diff --git v2.6.35-rc3-hwpoison/mm/memcontrol.c v2.6.35-rc3-hwpoison/mm/memcontrol.c
index c6ece0a..fed32de 100644
--- v2.6.35-rc3-hwpoison/mm/memcontrol.c
+++ v2.6.35-rc3-hwpoison/mm/memcontrol.c
@@ -2504,6 +2504,8 @@ int mem_cgroup_prepare_migration(struct page *page,

if (mem_cgroup_disabled())
return 0;
+ if (PageHuge(page))
+ return 0;

pc = lookup_page_cgroup(page);
lock_page_cgroup(pc);
@@ -2591,6 +2593,9 @@ void mem_cgroup_end_migration(struct mem_cgroup *mem,

if (!mem)
return;
+ if (PageHuge(oldpage))
+ return;
+
/* blocks rmdir() */
cgroup_exclude_rmdir(&mem->css);
/* at migration success, oldpage->mapping is NULL. */
--
1.7.1

2010-07-02 05:49:58

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH 2/7] hugetlb, HWPOISON: move PG_HWPoison bit check

In order to handle metadatum correctly, we should check whether the hugepage
we are going to access is HWPOISONed *before* incrementing mapcount,
adding the hugepage into pagecache or constructing anon_vma.
This patch also adds retry code when there is a race between
alloc_huge_page() and memory failure.

Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Jun'ichi Nomura <[email protected]>
---
mm/hugetlb.c | 34 +++++++++++++++++++++-------------
1 files changed, 21 insertions(+), 13 deletions(-)

diff --git v2.6.35-rc3-hwpoison/mm/hugetlb.c v2.6.35-rc3-hwpoison/mm/hugetlb.c
index a26c24a..5c77a73 100644
--- v2.6.35-rc3-hwpoison/mm/hugetlb.c
+++ v2.6.35-rc3-hwpoison/mm/hugetlb.c
@@ -2490,8 +2490,15 @@ retry:
int err;
struct inode *inode = mapping->host;

- err = add_to_page_cache(page, mapping, idx, GFP_KERNEL);
+ lock_page(page);
+ if (unlikely(PageHWPoison(page))) {
+ unlock_page(page);
+ goto retry;
+ }
+ err = add_to_page_cache_locked(page, mapping,
+ idx, GFP_KERNEL);
if (err) {
+ unlock_page(page);
put_page(page);
if (err == -EEXIST)
goto retry;
@@ -2504,6 +2511,10 @@ retry:
page_dup_rmap(page);
} else {
lock_page(page);
+ if (unlikely(PageHWPoison(page))) {
+ unlock_page(page);
+ goto retry;
+ }
if (unlikely(anon_vma_prepare(vma))) {
ret = VM_FAULT_OOM;
goto backout_unlocked;
@@ -2511,22 +2522,19 @@ retry:
hugepage_add_new_anon_rmap(page, vma, address);
}
} else {
+ /*
+ * If memory error occurs between mmap() and fault, some process
+ * don't have hwpoisoned swap entry for errored virtual address.
+ * So we need to block hugepage fault by PG_hwpoison bit check.
+ */
+ if (unlikely(PageHWPoison(page))) {
+ ret = VM_FAULT_HWPOISON;
+ goto backout_unlocked;
+ }
page_dup_rmap(page);
}

/*
- * Since memory error handler replaces pte into hwpoison swap entry
- * at the time of error handling, a process which reserved but not have
- * the mapping to the error hugepage does not have hwpoison swap entry.
- * So we need to block accesses from such a process by checking
- * PG_hwpoison bit here.
- */
- if (unlikely(PageHWPoison(page))) {
- ret = VM_FAULT_HWPOISON;
- goto backout_unlocked;
- }
-
- /*
* If we are going to COW a private mapping later, we examine the
* pending reservations for this page now. This will ensure that
* any allocations necessary to record that reservation occur outside
--
1.7.1

2010-07-02 05:50:03

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH 5/7] hugetlb: pin oldpage in page migration

This patch introduces pinning the old page during page migration
to avoid freeing it before we complete copying.
This race condition can happen for privately mapped or anonymous hugepage.

Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Jun'ichi Nomura <[email protected]>
---
mm/migrate.c | 26 +++++++++++++++++++++++---
1 files changed, 23 insertions(+), 3 deletions(-)

diff --git v2.6.35-rc3-hwpoison/mm/migrate.c v2.6.35-rc3-hwpoison/mm/migrate.c
index 4205b1d..e4a381c 100644
--- v2.6.35-rc3-hwpoison/mm/migrate.c
+++ v2.6.35-rc3-hwpoison/mm/migrate.c
@@ -214,7 +214,7 @@ static int migrate_page_move_mapping(struct address_space *mapping,

if (!mapping) {
/* Anonymous page without mapping */
- if (page_count(page) != 1)
+ if (page_count(page) != 2 - PageHuge(page))
return -EAGAIN;
return 0;
}
@@ -224,7 +224,11 @@ static int migrate_page_move_mapping(struct address_space *mapping,
pslot = radix_tree_lookup_slot(&mapping->page_tree,
page_index(page));

- expected_count = 2 + page_has_private(page);
+ /*
+ * Hugepages are expected to have only one remained reference
+ * from pagecache, because hugepages are not linked to LRU list.
+ */
+ expected_count = 3 + page_has_private(page) - PageHuge(page);
if (page_count(page) != expected_count ||
(struct page *)radix_tree_deref_slot(pslot) != page) {
spin_unlock_irq(&mapping->tree_lock);
@@ -561,7 +565,11 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
if (!newpage)
return -ENOMEM;

- if (page_count(page) == 1) {
+ /*
+ * For anonymous hugepages, reference count is equal to mapcount,
+ * so don't consider migration is done for anonymou hugepage.
+ */
+ if (page_count(page) == 1 && !(PageHuge(page) && PageAnon(page))) {
/* page was freed from under us. So we are done. */
goto move_newpage;
}
@@ -644,6 +652,16 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
}

/*
+ * It's reasonable to pin the old page until unmapping and copying
+ * complete, because when the original page is an anonymous hugepage,
+ * it will be freed in try_to_unmap() due to the fact that
+ * all references of anonymous hugepage come from mapcount.
+ * Although in the other cases no problem comes out without pinning,
+ * it looks logically correct to do it.
+ */
+ get_page(page);
+
+ /*
* Corner case handling:
* 1. When a new swap-cache page is read into, it is added to the LRU
* and treated as swapcache but it has no rmap yet.
@@ -697,6 +715,8 @@ uncharge:
unlock:
unlock_page(page);

+ put_page(page);
+
if (rc != -EAGAIN) {
/*
* A page that has been migrated has all references
--
1.7.1

2010-07-02 05:50:12

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH 1/7] hugetlb: add missing unlock in avoidcopy path in hugetlb_cow()

This patch fixes possible deadlock in hugepage lock_page()
by adding missing unlock_page().

libhugetlbfs test will hit this bug when the next patch in this
patchset ("hugetlb, HWPOISON: move PG_HWPoison bit check") is applied.

Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Jun'ichi Nomura <[email protected]>
---
mm/hugetlb.c | 4 +++-
1 files changed, 3 insertions(+), 1 deletions(-)

diff --git v2.6.35-rc3-hwpoison/mm/hugetlb.c v2.6.35-rc3-hwpoison/mm/hugetlb.c
index abf249d..a26c24a 100644
--- v2.6.35-rc3-hwpoison/mm/hugetlb.c
+++ v2.6.35-rc3-hwpoison/mm/hugetlb.c
@@ -2323,9 +2323,11 @@ retry_avoidcopy:
* and just make the page writable */
avoidcopy = (page_mapcount(old_page) == 1);
if (avoidcopy) {
- if (!trylock_page(old_page))
+ if (!trylock_page(old_page)) {
if (PageAnon(old_page))
page_move_anon_rmap(old_page, vma, address);
+ } else
+ unlock_page(old_page);
set_huge_ptep_writable(vma, address, ptep);
return 0;
}
--
1.7.1

2010-07-02 05:50:18

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH 6/7] hugetlb: hugepage migration core

This patch extends page migration code to support hugepage migration.
One of the potential users of this feature is soft offlining which
is triggered by memory corrected errors (added by the next patch.)

Todo: there are other users of page migration such as memory policy,
memory hotplug and memocy compaction.
They are not ready for hugepage support for now.

Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Jun'ichi Nomura <[email protected]>
---
fs/hugetlbfs/inode.c | 2 ++
include/linux/hugetlb.h | 1 +
mm/hugetlb.c | 10 ++++++++--
mm/migrate.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
4 files changed, 53 insertions(+), 4 deletions(-)

diff --git v2.6.35-rc3-hwpoison/fs/hugetlbfs/inode.c v2.6.35-rc3-hwpoison/fs/hugetlbfs/inode.c
index a4e9a7e..8fd5967 100644
--- v2.6.35-rc3-hwpoison/fs/hugetlbfs/inode.c
+++ v2.6.35-rc3-hwpoison/fs/hugetlbfs/inode.c
@@ -31,6 +31,7 @@
#include <linux/statfs.h>
#include <linux/security.h>
#include <linux/magic.h>
+#include <linux/migrate.h>

#include <asm/uaccess.h>

@@ -675,6 +676,7 @@ static const struct address_space_operations hugetlbfs_aops = {
.write_begin = hugetlbfs_write_begin,
.write_end = hugetlbfs_write_end,
.set_page_dirty = hugetlbfs_set_page_dirty,
+ .migratepage = migrate_page,
};


diff --git v2.6.35-rc3-hwpoison/include/linux/hugetlb.h v2.6.35-rc3-hwpoison/include/linux/hugetlb.h
index 0b73c53..952e3ce 100644
--- v2.6.35-rc3-hwpoison/include/linux/hugetlb.h
+++ v2.6.35-rc3-hwpoison/include/linux/hugetlb.h
@@ -320,6 +320,7 @@ static inline unsigned int pages_per_huge_page(struct hstate *h)
{
return 1;
}
+#define page_hstate(p) NULL
#endif

#endif /* _LINUX_HUGETLB_H */
diff --git v2.6.35-rc3-hwpoison/mm/hugetlb.c v2.6.35-rc3-hwpoison/mm/hugetlb.c
index d7c462b..6e7f5f2 100644
--- v2.6.35-rc3-hwpoison/mm/hugetlb.c
+++ v2.6.35-rc3-hwpoison/mm/hugetlb.c
@@ -2640,8 +2640,14 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
ptep = huge_pte_offset(mm, address);
if (ptep) {
entry = huge_ptep_get(ptep);
- if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
- return VM_FAULT_HWPOISON;
+ if (!(huge_pte_none(entry) || pte_present(entry))) {
+ if (is_migration_entry(pte_to_swp_entry(entry))) {
+ migration_entry_wait(mm, (pmd_t *)ptep,
+ address);
+ return 0;
+ } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
+ return VM_FAULT_HWPOISON;
+ }
}

ptep = huge_pte_alloc(mm, address, huge_page_size(h));
diff --git v2.6.35-rc3-hwpoison/mm/migrate.c v2.6.35-rc3-hwpoison/mm/migrate.c
index e4a381c..e7af148 100644
--- v2.6.35-rc3-hwpoison/mm/migrate.c
+++ v2.6.35-rc3-hwpoison/mm/migrate.c
@@ -32,6 +32,7 @@
#include <linux/security.h>
#include <linux/memcontrol.h>
#include <linux/syscalls.h>
+#include <linux/hugetlb.h>
#include <linux/gfp.h>

#include "internal.h"
@@ -74,6 +75,8 @@ void putback_lru_pages(struct list_head *l)
struct page *page2;

list_for_each_entry_safe(page, page2, l, lru) {
+ if (PageHuge(page))
+ break;
list_del(&page->lru);
dec_zone_page_state(page, NR_ISOLATED_ANON +
page_is_file_cache(page));
@@ -95,6 +98,12 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
pte_t *ptep, pte;
spinlock_t *ptl;

+ if (unlikely(PageHuge(new))) {
+ ptep = huge_pte_offset(mm, addr);
+ ptl = &mm->page_table_lock;
+ goto check;
+ }
+
pgd = pgd_offset(mm, addr);
if (!pgd_present(*pgd))
goto out;
@@ -115,6 +124,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
}

ptl = pte_lockptr(mm, pmd);
+check:
spin_lock(ptl);
pte = *ptep;
if (!is_swap_pte(pte))
@@ -130,10 +140,17 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
if (is_write_migration_entry(entry))
pte = pte_mkwrite(pte);
+ if (PageHuge(new))
+ pte = pte_mkhuge(pte);
flush_cache_page(vma, addr, pte_pfn(pte));
set_pte_at(mm, addr, ptep, pte);

- if (PageAnon(new))
+ if (PageHuge(new)) {
+ if (PageAnon(new))
+ hugepage_add_anon_rmap(new, vma, addr);
+ else
+ page_dup_rmap(new);
+ } else if (PageAnon(new))
page_add_anon_rmap(new, vma, addr);
else
page_add_file_rmap(new);
@@ -267,7 +284,14 @@ static int migrate_page_move_mapping(struct address_space *mapping,
* Note that anonymous pages are accounted for
* via NR_FILE_PAGES and NR_ANON_PAGES if they
* are mapped to swap space.
+ *
+ * Not account hugepage here for now because hugepage has
+ * separate accounting rule.
*/
+ if (PageHuge(newpage)) {
+ spin_unlock_irq(&mapping->tree_lock);
+ return 0;
+ }
__dec_zone_page_state(page, NR_FILE_PAGES);
__inc_zone_page_state(newpage, NR_FILE_PAGES);
if (PageSwapBacked(page)) {
@@ -284,7 +308,17 @@ static int migrate_page_move_mapping(struct address_space *mapping,
*/
static void migrate_page_copy(struct page *newpage, struct page *page)
{
- copy_highpage(newpage, page);
+ int i;
+ struct hstate *h;
+ if (!PageHuge(newpage))
+ copy_highpage(newpage, page);
+ else {
+ h = page_hstate(newpage);
+ for (i = 0; i < pages_per_huge_page(h); i++) {
+ cond_resched();
+ copy_highpage(newpage + i, page + i);
+ }
+ }

if (PageError(page))
SetPageError(newpage);
@@ -718,6 +752,11 @@ unlock:
put_page(page);

if (rc != -EAGAIN) {
+ if (PageHuge(newpage)) {
+ put_page(newpage);
+ goto out;
+ }
+
/*
* A page that has been migrated has all references
* removed and will be freed. A page that has not been
@@ -738,6 +777,7 @@ move_newpage:
*/
putback_lru_page(newpage);

+out:
if (result) {
if (rc)
*result = rc;
--
1.7.1

2010-07-02 05:50:54

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH 3/7] hugetlb: add allocate function for hugepage migration

We can't use existing hugepage allocation functions to allocate hugepage
for page migration, because page migration can happen asynchronously with
the running processes and page migration users should call the allocation
function with physical addresses (not virtual addresses) as arguments.

Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Jun'ichi Nomura <[email protected]>
---
include/linux/hugetlb.h | 3 ++
mm/hugetlb.c | 83 +++++++++++++++++++++++++++++++++++++---------
2 files changed, 69 insertions(+), 17 deletions(-)

diff --git v2.6.35-rc3-hwpoison/include/linux/hugetlb.h v2.6.35-rc3-hwpoison/include/linux/hugetlb.h
index f479700..0b73c53 100644
--- v2.6.35-rc3-hwpoison/include/linux/hugetlb.h
+++ v2.6.35-rc3-hwpoison/include/linux/hugetlb.h
@@ -228,6 +228,8 @@ struct huge_bootmem_page {
struct hstate *hstate;
};

+struct page *alloc_huge_page_node(struct hstate *h, int nid);
+
/* arch callback */
int __init alloc_bootmem_huge_page(struct hstate *h);

@@ -303,6 +305,7 @@ static inline struct hstate *page_hstate(struct page *page)

#else
struct hstate {};
+#define alloc_huge_page_node(h, nid) NULL
#define alloc_bootmem_huge_page(h) NULL
#define hstate_file(f) NULL
#define hstate_vma(v) NULL
diff --git v2.6.35-rc3-hwpoison/mm/hugetlb.c v2.6.35-rc3-hwpoison/mm/hugetlb.c
index 5c77a73..d7c462b 100644
--- v2.6.35-rc3-hwpoison/mm/hugetlb.c
+++ v2.6.35-rc3-hwpoison/mm/hugetlb.c
@@ -466,6 +466,18 @@ static void enqueue_huge_page(struct hstate *h, struct page *page)
h->free_huge_pages_node[nid]++;
}

+static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
+{
+ struct page *page;
+ if (list_empty(&h->hugepage_freelists[nid]))
+ return NULL;
+ page = list_entry(h->hugepage_freelists[nid].next, struct page, lru);
+ list_del(&page->lru);
+ h->free_huge_pages--;
+ h->free_huge_pages_node[nid]--;
+ return page;
+}
+
static struct page *dequeue_huge_page_vma(struct hstate *h,
struct vm_area_struct *vma,
unsigned long address, int avoid_reserve)
@@ -497,18 +509,13 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
for_each_zone_zonelist_nodemask(zone, z, zonelist,
MAX_NR_ZONES - 1, nodemask) {
nid = zone_to_nid(zone);
- if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask) &&
- !list_empty(&h->hugepage_freelists[nid])) {
- page = list_entry(h->hugepage_freelists[nid].next,
- struct page, lru);
- list_del(&page->lru);
- h->free_huge_pages--;
- h->free_huge_pages_node[nid]--;
-
- if (!avoid_reserve)
- decrement_hugepage_resv_vma(h, vma);
-
- break;
+ if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask)) {
+ page = dequeue_huge_page_node(h, nid);
+ if (page) {
+ if (!avoid_reserve)
+ decrement_hugepage_resv_vma(h, vma);
+ break;
+ }
}
}
err:
@@ -616,7 +623,7 @@ int PageHuge(struct page *page)
}
EXPORT_SYMBOL_GPL(PageHuge);

-static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
+static struct page *__alloc_huge_page_node(struct hstate *h, int nid)
{
struct page *page;

@@ -627,14 +634,56 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
__GFP_REPEAT|__GFP_NOWARN,
huge_page_order(h));
+ if (page && arch_prepare_hugepage(page)) {
+ __free_pages(page, huge_page_order(h));
+ return NULL;
+ }
+
+ return page;
+}
+
+static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
+{
+ struct page *page = __alloc_huge_page_node(h, nid);
+ if (page)
+ prep_new_huge_page(h, page, nid);
+ return page;
+}
+
+static struct page *alloc_buddy_huge_page_node(struct hstate *h, int nid)
+{
+ struct page *page = __alloc_huge_page_node(h, nid);
if (page) {
- if (arch_prepare_hugepage(page)) {
- __free_pages(page, huge_page_order(h));
+ set_compound_page_dtor(page, free_huge_page);
+ spin_lock(&hugetlb_lock);
+ h->nr_huge_pages++;
+ h->nr_huge_pages_node[nid]++;
+ spin_unlock(&hugetlb_lock);
+ put_page_testzero(page);
+ }
+ return page;
+}
+
+struct page *alloc_huge_page_node(struct hstate *h, int nid)
+{
+ struct page *page;
+
+ spin_lock(&hugetlb_lock);
+ get_mems_allowed();
+ page = dequeue_huge_page_node(h, nid);
+ put_mems_allowed();
+ spin_unlock(&hugetlb_lock);
+
+ if (!page) {
+ page = alloc_buddy_huge_page_node(h, nid);
+ if (!page) {
+ __count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
return NULL;
- }
- prep_new_huge_page(h, page, nid);
+ } else
+ __count_vm_event(HTLB_BUDDY_PGALLOC);
}

+ set_page_refcounted(page);
return page;
}

--
1.7.1

2010-07-02 08:30:30

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 0/7] hugepage migration

On Fri, Jul 02, 2010 at 02:47:19PM +0900, Naoya Horiguchi wrote:
> This is a patchset for hugepage migration.

Thanks for working on this.
>
> There are many users of page migration such as soft offlining,
> memory hotplug, memory policy and memory compaction,
> but this patchset adds hugepage support only for soft offlining
> as the first step.

Is that simply because the callers are not hooked up yet
for the other cases, or more fundamental issues?


> I tested this patchset with 'make func' in libhugetlbfs and
> have gotten the same result as one from 2.6.35-rc3.

Is there a tester for the functionality too? (e.g. mce-test test cases)

-Andi

2010-07-02 08:31:47

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 1/7] hugetlb: add missing unlock in avoidcopy path in hugetlb_cow()

On Fri, Jul 02, 2010 at 02:47:20PM +0900, Naoya Horiguchi wrote:
> This patch fixes possible deadlock in hugepage lock_page()
> by adding missing unlock_page().
>
> libhugetlbfs test will hit this bug when the next patch in this
> patchset ("hugetlb, HWPOISON: move PG_HWPoison bit check") is applied.

This looks like a general bug fix that should be merged ASAP?

Or do you think this cannot be hit at all without the other patches?

-Andi

2010-07-02 09:08:57

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/7] hugetlb: add allocate function for hugepage migration

On Fri, Jul 02, 2010 at 02:47:22PM +0900, Naoya Horiguchi wrote:
> We can't use existing hugepage allocation functions to allocate hugepage
> for page migration, because page migration can happen asynchronously with
> the running processes and page migration users should call the allocation
> function with physical addresses (not virtual addresses) as arguments.

I looked through this patch and didn't see anything bad. Some more
eyes familiar with hugepages would be good though.

Since there are now so many different allocation functions some
comments on when they should be used may be useful too

-Andi

2010-07-02 09:11:27

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 4/7] hugetlb: add hugepage check in mem_cgroup_{register,end}_migration()

On Fri, Jul 02, 2010 at 02:47:23PM +0900, Naoya Horiguchi wrote:
> Currently memory cgroup doesn't charge hugepage,
> so avoid calling these functions in hugepage migration context.

Sounds like something that really should be fixed, but ok for now
I guess...

-Andi
--
[email protected] -- Speaking for myself only.

2010-07-05 08:48:12

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH 3/7] hugetlb: add allocate function for hugepage migration

On Fri, Jul 02, 2010 at 11:08:54AM +0200, Andi Kleen wrote:
> On Fri, Jul 02, 2010 at 02:47:22PM +0900, Naoya Horiguchi wrote:
> > We can't use existing hugepage allocation functions to allocate hugepage
> > for page migration, because page migration can happen asynchronously with
> > the running processes and page migration users should call the allocation
> > function with physical addresses (not virtual addresses) as arguments.
>
> I looked through this patch and didn't see anything bad. Some more
> eyes familiar with hugepages would be good though.

Yes.

> Since there are now so many different allocation functions some
> comments on when they should be used may be useful too

OK. How about this?

+/*
+ * This allocation function is useful in the context where vma is irrelevant.
+ * E.g. soft-offlining uses this function because it only cares physical
+ * address of error page.
+ */
+struct page *alloc_huge_page_node(struct hstate *h, int nid)
+{

BTW, I don't like this function name very much.
Since the most significant difference of this function to alloc_huge_page()
is lack of vma argument, so I'm going to change the name to
alloc_huge_page_no_vma_node() in the next version if it is no problem.

Or, since the postfix like "_no_vma" is verbose, I think it might be
a good idea to rename present alloc_huge_page() to alloc_huge_page_vma().
Is this worthwhile?

Thanks,
Naoya Horiguchi

2010-07-05 08:49:46

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH 0/7] hugepage migration

On Fri, Jul 02, 2010 at 10:30:26AM +0200, Andi Kleen wrote:
> On Fri, Jul 02, 2010 at 02:47:19PM +0900, Naoya Horiguchi wrote:
> > This is a patchset for hugepage migration.
>
> Thanks for working on this.
> >
> > There are many users of page migration such as soft offlining,
> > memory hotplug, memory policy and memory compaction,
> > but this patchset adds hugepage support only for soft offlining
> > as the first step.
>
> Is that simply because the callers are not hooked up yet
> for the other cases, or more fundamental issues?

Yes, it's just underway.
I hope we have no critical problems to implement other cases at the time.

>
> > I tested this patchset with 'make func' in libhugetlbfs and
> > have gotten the same result as one from 2.6.35-rc3.
>
> Is there a tester for the functionality too? (e.g. mce-test test cases)

Yes.
I'll send patches on mce-test suite later.

Thanks,
Naoya Horiguchi

2010-07-05 08:50:04

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH 1/7] hugetlb: add missing unlock in avoidcopy path in hugetlb_cow()

On Fri, Jul 02, 2010 at 10:31:43AM +0200, Andi Kleen wrote:
> On Fri, Jul 02, 2010 at 02:47:20PM +0900, Naoya Horiguchi wrote:
> > This patch fixes possible deadlock in hugepage lock_page()
> > by adding missing unlock_page().
> >
> > libhugetlbfs test will hit this bug when the next patch in this
> > patchset ("hugetlb, HWPOISON: move PG_HWPoison bit check") is applied.
>
> This looks like a general bug fix that should be merged ASAP?
>
> Or do you think this cannot be hit at all without the other patches?

This bug was introduced by patch "hugetlb, rmap: add reverse mapping for
hugepage" in previous patchset (currently in linux-next) and it's not
merged in mainline yet.
So it's OK if this patch goes into linux-next by its merge to mainline.

Thanks,
Naoya Horiguchi

2010-07-05 09:28:26

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/7] hugetlb: add allocate function for hugepage migration

On Mon, Jul 05, 2010 at 05:46:29PM +0900, Naoya Horiguchi wrote:
> On Fri, Jul 02, 2010 at 11:08:54AM +0200, Andi Kleen wrote:
> > On Fri, Jul 02, 2010 at 02:47:22PM +0900, Naoya Horiguchi wrote:
> > > We can't use existing hugepage allocation functions to allocate hugepage
> > > for page migration, because page migration can happen asynchronously with
> > > the running processes and page migration users should call the allocation
> > > function with physical addresses (not virtual addresses) as arguments.
> >
> > I looked through this patch and didn't see anything bad. Some more
> > eyes familiar with hugepages would be good though.
>
> Yes.
>
> > Since there are now so many different allocation functions some
> > comments on when they should be used may be useful too
>
> OK. How about this?
>
> +/*
> + * This allocation function is useful in the context where vma is irrelevant.
> + * E.g. soft-offlining uses this function because it only cares physical
> + * address of error page.
> + */

Looks good thanks.

> +struct page *alloc_huge_page_node(struct hstate *h, int nid)
> +{
>
> BTW, I don't like this function name very much.
> Since the most significant difference of this function to alloc_huge_page()
> is lack of vma argument, so I'm going to change the name to
> alloc_huge_page_no_vma_node() in the next version if it is no problem.
>
> Or, since the postfix like "_no_vma" is verbose, I think it might be
> a good idea to rename present alloc_huge_page() to alloc_huge_page_vma().
> Is this worthwhile?

Yes, in a separate patch

-Andi

--
[email protected] -- Speaking for myself only.

2010-07-05 09:45:21

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 5/7] hugetlb: pin oldpage in page migration

On Fri, Jul 02, 2010 at 02:47:24PM +0900, Naoya Horiguchi wrote:
> This patch introduces pinning the old page during page migration
> to avoid freeing it before we complete copying.
> This race condition can happen for privately mapped or anonymous hugepage.

Patch looks good.
-Andi

2010-07-05 09:59:33

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 6/7] hugetlb: hugepage migration core

On Fri, Jul 02, 2010 at 02:47:25PM +0900, Naoya Horiguchi wrote:
> diff --git v2.6.35-rc3-hwpoison/mm/migrate.c v2.6.35-rc3-hwpoison/mm/migrate.c
> index e4a381c..e7af148 100644
> --- v2.6.35-rc3-hwpoison/mm/migrate.c
> +++ v2.6.35-rc3-hwpoison/mm/migrate.c
> @@ -32,6 +32,7 @@
> #include <linux/security.h>
> #include <linux/memcontrol.h>
> #include <linux/syscalls.h>
> +#include <linux/hugetlb.h>
> #include <linux/gfp.h>
>
> #include "internal.h"
> @@ -74,6 +75,8 @@ void putback_lru_pages(struct list_head *l)
> struct page *page2;
>
> list_for_each_entry_safe(page, page2, l, lru) {
> + if (PageHuge(page))
> + break;

Why is this a break and not a continue? Couldn't you have small and large
pages in the same list?

There's more code that handles LRU in this file. Do they all handle huge pages
correctly?

I also noticed we do not always lock all sub pages in the huge page. Now if
IO happens it will lock on subpages, not the head page. But this code
handles all subpages as a unit. Could this cause locking problems?
Perhaps it would be safer to lock all sub pages always? Or would
need to audit other page users to make sure they always lock on the head
and do the same here.

Hmm page reference counts may have the same issue?

> @@ -95,6 +98,12 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
> pte_t *ptep, pte;
> spinlock_t *ptl;
>
> + if (unlikely(PageHuge(new))) {
> + ptep = huge_pte_offset(mm, addr);
> + ptl = &mm->page_table_lock;
> + goto check;
> + }
> +
> pgd = pgd_offset(mm, addr);
> if (!pgd_present(*pgd))
> goto out;
> @@ -115,6 +124,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
> }
>
> ptl = pte_lockptr(mm, pmd);
> +check:

I think I would prefer a proper if else over a goto here.

The lookup should probably just call a helper to make this function more readable
(like lookup_address(), unfortunately that's x86 specific right now)


> @@ -284,7 +308,17 @@ static int migrate_page_move_mapping(struct address_space *mapping,
> */
> static void migrate_page_copy(struct page *newpage, struct page *page)
> {
> - copy_highpage(newpage, page);
> + int i;
> + struct hstate *h;
> + if (!PageHuge(newpage))
> + copy_highpage(newpage, page);
> + else {
> + h = page_hstate(newpage);
> + for (i = 0; i < pages_per_huge_page(h); i++) {
> + cond_resched();
> + copy_highpage(newpage + i, page + i);

Better reuse copy_huge_page() instead of open coding.


-Andi
--
[email protected] -- Speaking for myself only.

2010-07-05 10:28:48

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 7/7] hugetlb, HWPOISON: soft offlining for hugepage

> +void isolate_hwpoisoned_huge_page(struct page *hpage)
> +{
> + lock_page(hpage);
> + __isolate_hwpoisoned_huge_page(hpage);
> + unlock_page(hpage);
> +}

This assumes all other users (even outside this file)
who lock always do so on the head page too. Needs some double-checking?

> } else {
> pr_debug("soft offline: %#lx: isolation failed: %d, page count %d, type %lx\n",
> pfn, ret, page_count(page), page->flags);
> @@ -1351,8 +1380,8 @@ int soft_offline_page(struct page *page, int flags)
> return ret;
>
> done:
> - atomic_long_add(1, &mce_bad_pages);
> - SetPageHWPoison(page);
> + atomic_long_add(1 << compound_order(hpage), &mce_bad_pages);

Probably should add a separate counter too?

-Andi
--
[email protected] -- Speaking for myself only.

2010-07-06 03:35:51

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH 6/7] hugetlb: hugepage migration core

On Mon, Jul 05, 2010 at 11:59:28AM +0200, Andi Kleen wrote:
> On Fri, Jul 02, 2010 at 02:47:25PM +0900, Naoya Horiguchi wrote:
> > diff --git v2.6.35-rc3-hwpoison/mm/migrate.c v2.6.35-rc3-hwpoison/mm/migrate.c
> > index e4a381c..e7af148 100644
> > --- v2.6.35-rc3-hwpoison/mm/migrate.c
> > +++ v2.6.35-rc3-hwpoison/mm/migrate.c
> > @@ -32,6 +32,7 @@
> > #include <linux/security.h>
> > #include <linux/memcontrol.h>
> > #include <linux/syscalls.h>
> > +#include <linux/hugetlb.h>
> > #include <linux/gfp.h>
> >
> > #include "internal.h"
> > @@ -74,6 +75,8 @@ void putback_lru_pages(struct list_head *l)
> > struct page *page2;
> >
> > list_for_each_entry_safe(page, page2, l, lru) {
> > + if (PageHuge(page))
> > + break;
>
> Why is this a break and not a continue? Couldn't you have small and large
> pages in the same list?

Hmm, this chunk need to be fixed because I had too specific assumption.
The list passed to migrate_pages() has only one page or one hugepage in
page migration kicked by soft offline, but it's not the case in general case.
Since hugepage is not linked to LRU list, we had better simply skip
putback_lru_pages().

> There's more code that handles LRU in this file. Do they all handle huge pages
> correctly?
>
> I also noticed we do not always lock all sub pages in the huge page. Now if
> IO happens it will lock on subpages, not the head page. But this code
> handles all subpages as a unit. Could this cause locking problems?
> Perhaps it would be safer to lock all sub pages always? Or would
> need to audit other page users to make sure they always lock on the head
> and do the same here.
>
> Hmm page reference counts may have the same issue?

If we try to implement paging out of hugepage in the future, we need to
solve all these problems straightforwardly. But at least for now we can
skirt them by not touching LRU code for hugepage extension.

> > @@ -95,6 +98,12 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
> > pte_t *ptep, pte;
> > spinlock_t *ptl;
> >
> > + if (unlikely(PageHuge(new))) {
> > + ptep = huge_pte_offset(mm, addr);
> > + ptl = &mm->page_table_lock;
> > + goto check;
> > + }
> > +
> > pgd = pgd_offset(mm, addr);
> > if (!pgd_present(*pgd))
> > goto out;
> > @@ -115,6 +124,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
> > }
> >
> > ptl = pte_lockptr(mm, pmd);
> > +check:
>
> I think I would prefer a proper if else over a goto here.
>
> The lookup should probably just call a helper to make this function more readable
> (like lookup_address(), unfortunately that's x86 specific right now)

OK.
I'll move common code to helper function.

> > @@ -284,7 +308,17 @@ static int migrate_page_move_mapping(struct address_space *mapping,
> > */
> > static void migrate_page_copy(struct page *newpage, struct page *page)
> > {
> > - copy_highpage(newpage, page);
> > + int i;
> > + struct hstate *h;
> > + if (!PageHuge(newpage))
> > + copy_highpage(newpage, page);
> > + else {
> > + h = page_hstate(newpage);
> > + for (i = 0; i < pages_per_huge_page(h); i++) {
> > + cond_resched();
> > + copy_highpage(newpage + i, page + i);
>
> Better reuse copy_huge_page() instead of open coding.

Agreed.

Thanks,
Naoya Horiguchi

2010-07-06 07:13:44

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 6/7] hugetlb: hugepage migration core

On Tue, Jul 06, 2010 at 12:33:42PM +0900, Naoya Horiguchi wrote:
> > There's more code that handles LRU in this file. Do they all handle huge pages
> > correctly?
> >
> > I also noticed we do not always lock all sub pages in the huge page. Now if
> > IO happens it will lock on subpages, not the head page. But this code
> > handles all subpages as a unit. Could this cause locking problems?
> > Perhaps it would be safer to lock all sub pages always? Or would
> > need to audit other page users to make sure they always lock on the head
> > and do the same here.
> >
> > Hmm page reference counts may have the same issue?
>
> If we try to implement paging out of hugepage in the future, we need to
> solve all these problems straightforwardly. But at least for now we can
> skirt them by not touching LRU code for hugepage extension.

We need the page lock to avoid migrating pages that are currently
under IO. This can happen even without swapping when the process
manually starts IO.

-Andi
--
[email protected] -- Speaking for myself only.

2010-07-06 15:55:26

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 5/7] hugetlb: pin oldpage in page migration

On Fri, 2 Jul 2010, Naoya Horiguchi wrote:

> This patch introduces pinning the old page during page migration
> to avoid freeing it before we complete copying.

The old page is already pinned due to the reference count that is taken
when the page is put onto the list of pages to be migrated. See
do_move_pages() f.e.

Huge pages use a different scheme?

> This race condition can happen for privately mapped or anonymous hugepage.

It cannot happen unless you come up with your own scheme of managing pages
to be migrated and bypass migrate_pages(). There you should take the
refcount.

> /*
> + * It's reasonable to pin the old page until unmapping and copying
> + * complete, because when the original page is an anonymous hugepage,
> + * it will be freed in try_to_unmap() due to the fact that
> + * all references of anonymous hugepage come from mapcount.
> + * Although in the other cases no problem comes out without pinning,
> + * it looks logically correct to do it.
> + */
> + get_page(page);
> +
> + /*

Its already pinned. Dont do this. migrate_pages() relies on the caller
having pinned the page already.

2010-07-06 16:01:08

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 6/7] hugetlb: hugepage migration core

On Fri, 2 Jul 2010, Naoya Horiguchi wrote:

> --- v2.6.35-rc3-hwpoison/mm/migrate.c
> +++ v2.6.35-rc3-hwpoison/mm/migrate.c
> @@ -32,6 +32,7 @@
> #include <linux/security.h>
> #include <linux/memcontrol.h>
> #include <linux/syscalls.h>
> +#include <linux/hugetlb.h>
> #include <linux/gfp.h>
>
> #include "internal.h"
> @@ -74,6 +75,8 @@ void putback_lru_pages(struct list_head *l)
> struct page *page2;
>
> list_for_each_entry_safe(page, page2, l, lru) {
> + if (PageHuge(page))
> + break;
> list_del(&page->lru);

Argh. Hugepages in putpack_lru_pages()? Huge pages are not on the lru.
Come up with something cleaner here.

> @@ -267,7 +284,14 @@ static int migrate_page_move_mapping(struct address_space *mapping,
> * Note that anonymous pages are accounted for
> * via NR_FILE_PAGES and NR_ANON_PAGES if they
> * are mapped to swap space.
> + *
> + * Not account hugepage here for now because hugepage has
> + * separate accounting rule.
> */
> + if (PageHuge(newpage)) {
> + spin_unlock_irq(&mapping->tree_lock);
> + return 0;
> + }
> __dec_zone_page_state(page, NR_FILE_PAGES);
> __inc_zone_page_state(newpage, NR_FILE_PAGES);
> if (PageSwapBacked(page)) {

This looks wrong here. Too many special casing added to basic migration
functionality.

> @@ -284,7 +308,17 @@ static int migrate_page_move_mapping(struct address_space *mapping,
> */
> static void migrate_page_copy(struct page *newpage, struct page *page)
> {
> - copy_highpage(newpage, page);
> + int i;
> + struct hstate *h;
> + if (!PageHuge(newpage))
> + copy_highpage(newpage, page);
> + else {
> + h = page_hstate(newpage);
> + for (i = 0; i < pages_per_huge_page(h); i++) {
> + cond_resched();
> + copy_highpage(newpage + i, page + i);
> + }
> + }
>
> if (PageError(page))
> SetPageError(newpage);

Could you generalize this for migrating an order N page?

> @@ -718,6 +752,11 @@ unlock:
> put_page(page);
>
> if (rc != -EAGAIN) {
> + if (PageHuge(newpage)) {
> + put_page(newpage);
> + goto out;
> + }
> +

I dont like this kind of inconsistency with the refcounting. Page
migration is complicated enough already.

2010-07-06 16:02:39

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 6/7] hugetlb: hugepage migration core

On Tue, 6 Jul 2010, Naoya Horiguchi wrote:

> Hmm, this chunk need to be fixed because I had too specific assumption.
> The list passed to migrate_pages() has only one page or one hugepage in
> page migration kicked by soft offline, but it's not the case in general case.
> Since hugepage is not linked to LRU list, we had better simply skip
> putback_lru_pages().

Maybe write a migrate_huge_page() function instead? The functionality is
materially different since we are not juggling things with the lru.

2010-07-07 06:07:06

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH 6/7] hugetlb: hugepage migration core

On Tue, Jul 06, 2010 at 09:13:37AM +0200, Andi Kleen wrote:
> On Tue, Jul 06, 2010 at 12:33:42PM +0900, Naoya Horiguchi wrote:
> > > There's more code that handles LRU in this file. Do they all handle huge pages
> > > correctly?
> > >
> > > I also noticed we do not always lock all sub pages in the huge page. Now if
> > > IO happens it will lock on subpages, not the head page. But this code
> > > handles all subpages as a unit. Could this cause locking problems?
> > > Perhaps it would be safer to lock all sub pages always? Or would
> > > need to audit other page users to make sure they always lock on the head
> > > and do the same here.
> > >
> > > Hmm page reference counts may have the same issue?
> >
> > If we try to implement paging out of hugepage in the future, we need to
> > solve all these problems straightforwardly. But at least for now we can
> > skirt them by not touching LRU code for hugepage extension.
>
> We need the page lock to avoid migrating pages that are currently
> under IO. This can happen even without swapping when the process
> manually starts IO.

I see. I understood we should work on locking problem in now.
I digged and learned hugepage IO can happen in direct IO from/to
hugepage or coredump of hugepage user.

We can resolve race between memory failure and IO by checking
page lock and writeback flag, right?

BTW I surveyed direct IO code, but page lock seems not to be taken.
Am I missing something?
(Before determining whether we lock all subpages or only headpage,
I want to clarify how current code for non-hugepage resolves this problem.)

Thanks,
Naoya Horiguchi

2010-07-07 06:46:41

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH 5/7] hugetlb: pin oldpage in page migration

Hi,

Thank you for your reviewing.

On Tue, Jul 06, 2010 at 10:54:38AM -0500, Christoph Lameter wrote:
> On Fri, 2 Jul 2010, Naoya Horiguchi wrote:
>
> > This patch introduces pinning the old page during page migration
> > to avoid freeing it before we complete copying.
>
> The old page is already pinned due to the reference count that is taken
> when the page is put onto the list of pages to be migrated. See
> do_move_pages() f.e.

OK.

> Huge pages use a different scheme?

Different scheme is in soft offline, where the target page is not pinned
before migration. So I should have pinned in soft offline side.
I'll fix it.

> > This race condition can happen for privately mapped or anonymous hugepage.
>
> It cannot happen unless you come up with your own scheme of managing pages
> to be migrated and bypass migrate_pages(). There you should take the
> refcount.

Yes.

> > /*
> > + * It's reasonable to pin the old page until unmapping and copying
> > + * complete, because when the original page is an anonymous hugepage,
> > + * it will be freed in try_to_unmap() due to the fact that
> > + * all references of anonymous hugepage come from mapcount.
> > + * Although in the other cases no problem comes out without pinning,
> > + * it looks logically correct to do it.
> > + */
> > + get_page(page);
> > +
> > + /*
>
> Its already pinned. Dont do this. migrate_pages() relies on the caller
> having pinned the page already.

I agree.

Thanks,
Naoya Horiguchi

2010-07-07 06:46:50

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH 6/7] hugetlb: hugepage migration core

On Tue, Jul 06, 2010 at 11:02:04AM -0500, Christoph Lameter wrote:
> On Tue, 6 Jul 2010, Naoya Horiguchi wrote:
>
> > Hmm, this chunk need to be fixed because I had too specific assumption.
> > The list passed to migrate_pages() has only one page or one hugepage in
> > page migration kicked by soft offline, but it's not the case in general case.
> > Since hugepage is not linked to LRU list, we had better simply skip
> > putback_lru_pages().
>
> Maybe write a migrate_huge_page() function instead? The functionality is
> materially different since we are not juggling things with the lru.

OK. I'll try to devide functions for hugepage in the next turn.

Thanks,
Naoya Horiguchi

2010-07-07 06:47:08

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH 6/7] hugetlb: hugepage migration core

On Tue, Jul 06, 2010 at 11:00:27AM -0500, Christoph Lameter wrote:
> On Fri, 2 Jul 2010, Naoya Horiguchi wrote:
>
> > --- v2.6.35-rc3-hwpoison/mm/migrate.c
> > +++ v2.6.35-rc3-hwpoison/mm/migrate.c
> > @@ -32,6 +32,7 @@
> > #include <linux/security.h>
> > #include <linux/memcontrol.h>
> > #include <linux/syscalls.h>
> > +#include <linux/hugetlb.h>
> > #include <linux/gfp.h>
> >
> > #include "internal.h"
> > @@ -74,6 +75,8 @@ void putback_lru_pages(struct list_head *l)
> > struct page *page2;
> >
> > list_for_each_entry_safe(page, page2, l, lru) {
> > + if (PageHuge(page))
> > + break;
> > list_del(&page->lru);
>
> Argh. Hugepages in putpack_lru_pages()? Huge pages are not on the lru.
> Come up with something cleaner here.

OK.

> > @@ -267,7 +284,14 @@ static int migrate_page_move_mapping(struct address_space *mapping,
> > * Note that anonymous pages are accounted for
> > * via NR_FILE_PAGES and NR_ANON_PAGES if they
> > * are mapped to swap space.
> > + *
> > + * Not account hugepage here for now because hugepage has
> > + * separate accounting rule.
> > */
> > + if (PageHuge(newpage)) {
> > + spin_unlock_irq(&mapping->tree_lock);
> > + return 0;
> > + }
> > __dec_zone_page_state(page, NR_FILE_PAGES);
> > __inc_zone_page_state(newpage, NR_FILE_PAGES);
> > if (PageSwapBacked(page)) {
>
> This looks wrong here. Too many special casing added to basic migration
> functionality.

Agreed.
As I replied in another email, I'll move to migrate_huge_page() and
avoid adding ugly changes like this.

> > @@ -284,7 +308,17 @@ static int migrate_page_move_mapping(struct address_space *mapping,
> > */
> > static void migrate_page_copy(struct page *newpage, struct page *page)
> > {
> > - copy_highpage(newpage, page);
> > + int i;
> > + struct hstate *h;
> > + if (!PageHuge(newpage))
> > + copy_highpage(newpage, page);
> > + else {
> > + h = page_hstate(newpage);
> > + for (i = 0; i < pages_per_huge_page(h); i++) {
> > + cond_resched();
> > + copy_highpage(newpage + i, page + i);
> > + }
> > + }
> >
> > if (PageError(page))
> > SetPageError(newpage);
>
> Could you generalize this for migrating an order N page?

Yes.
I'll define a helper function to handle order N case.

> > @@ -718,6 +752,11 @@ unlock:
> > put_page(page);
> >
> > if (rc != -EAGAIN) {
> > + if (PageHuge(newpage)) {
> > + put_page(newpage);
> > + goto out;
> > + }
> > +
>
> I dont like this kind of inconsistency with the refcounting. Page
> migration is complicated enough already.

OK.

Thanks,
Naoya Horiguchi

2010-07-07 09:27:30

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 6/7] hugetlb: hugepage migration core

> I see. I understood we should work on locking problem in now.
> I digged and learned hugepage IO can happen in direct IO from/to
> hugepage or coredump of hugepage user.
>
> We can resolve race between memory failure and IO by checking
> page lock and writeback flag, right?

Yes, but we have to make sure it's in the same page.

As I understand the IO locking does not use the head page, that
means migration may need to lock all the sub pages.

Or fix IO locking to use head pages?

>
> BTW I surveyed direct IO code, but page lock seems not to be taken.
> Am I missing something?

That's expected I believe because applications are supposed to coordinate
for direct IO (but then direct IO also drops page cache).

But page lock is used to coordinate in the page cache for buffered IO.


-Andi
--
[email protected] -- Speaking for myself only.

2010-07-07 22:54:14

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 1/7] hugetlb: add missing unlock in avoidcopy path in hugetlb_cow()

On Fri, Jul 02, 2010 at 02:47:20PM +0900, Naoya Horiguchi wrote:
> This patch fixes possible deadlock in hugepage lock_page()
> by adding missing unlock_page().
>
> libhugetlbfs test will hit this bug when the next patch in this
> patchset ("hugetlb, HWPOISON: move PG_HWPoison bit check") is applied.

I merged this patch into the hwpoison tree now.

For the other patches in the series waiting until the open issues
are fixed.

-Andi

--
[email protected] -- Speaking for myself only.

2010-07-08 05:46:04

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH 6/7] hugetlb: hugepage migration core

On Wed, Jul 07, 2010 at 11:27:19AM +0200, Andi Kleen wrote:
> > I see. I understood we should work on locking problem in now.
> > I digged and learned hugepage IO can happen in direct IO from/to
> > hugepage or coredump of hugepage user.
> >
> > We can resolve race between memory failure and IO by checking
> > page lock and writeback flag, right?
>
> Yes, but we have to make sure it's in the same page.
>
> As I understand the IO locking does not use the head page, that
> means migration may need to lock all the sub pages.
>
> Or fix IO locking to use head pages?
> >
> > BTW I surveyed direct IO code, but page lock seems not to be taken.
> > Am I missing something?
>
> That's expected I believe because applications are supposed to coordinate
> for direct IO (but then direct IO also drops page cache).
>
> But page lock is used to coordinate in the page cache for buffered IO.

This page cache is located on non-hugepage, isn't it?
If so, buffered IO is handled in the same manner as done for non-hugepage.
I think "hugepage under IO" is realized only in direct IO for now.

Direct IO is issued in page unit even if the target page is in hugepage,
so locking each subpages separately looks natural for me than auditing
head page.

2010-07-08 06:59:13

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 6/7] hugetlb: hugepage migration core

> This page cache is located on non-hugepage, isn't it?

Yes.

> If so, buffered IO is handled in the same manner as done for non-hugepage.
> I think "hugepage under IO" is realized only in direct IO for now.
>
> Direct IO is issued in page unit even if the target page is in hugepage,
> so locking each subpages separately looks natural for me than auditing
> head page.

Ok. Would need to make sure lock ordering is correctly handled all the time.

If there's any code that locks multiple pages "backwards" and the migration
code locks it forward there might be a problem. Maybe it's not a problem
though.

-Andi
--
[email protected] -- Speaking for myself only.