From: Zi Yan <[email protected]>
Hi Naoya,
Here is soft-offline support for thp migration. I need comments since it has
an interface change (Patch 2) of soft_offline_page() and
a behavior change (Patch 3) in migrate_pages(). soft_offline_page() is used
in store_soft_offline_page() from drivers/base/memory.c.
The patchset is on top of mmotm-2017-08-10-15-33.
The patchset is tested with:
1. simple madvise() call program (https://github.com/x-y-z/soft-offline-test) and
2. a local kernel change to intentionally fail allocating THPs for soft offline,
which makes to-be-soft-offlined THPs being split by Patch 3.
Patch 1: obtain the size of a offlined page before it is offlined. The size is
used as the step value of the for-loop inside madvise_inject_error().
Originally, the for-loop used the size of offlined pages, which was OK.
But as a THP is offlined, it is split afterwards, so the page size obtained
after offlined is PAGE_SIZE instead of THP page size, which causes a THP being
offlined 512 times.
Patch 2: when offlining a THP, there are two situations, a) the THP is offlined
as a whole, or b) the THP is split and only the raw error page is offlined.
Thus, we need soft_offline_page() to tell us whether a THP is split during
offlining, which leads to a new interface parameter.
Patch 3: as Naoya suggested, if a THP fails to be offlined as a whole, we should
retry the raw error subpage. This patch implement it. This also requires
migrate_pages() not splitting a THP if migration fails for MR_MEMORY_FAILURE.
Patch 4: enable thp migration support for soft offline.
Any suggestions and comments are welcome.
Thanks.
Zi Yan (4):
mm: madvise: read loop's step size beforehand in
madvise_inject_error(), prepare for THP support.
mm: soft-offline: Change soft_offline_page() interface to tell if the
page is split or not.
mm: soft-offline: retry to split and soft-offline the raw error if the
original THP offlining fails.
mm: hwpoison: soft offline supports thp migration
drivers/base/memory.c | 2 +-
include/linux/mm.h | 2 +-
mm/madvise.c | 24 ++++++++++--
mm/memory-failure.c | 103 +++++++++++++++++++++++++++++---------------------
mm/migrate.c | 16 ++++++++
5 files changed, 97 insertions(+), 50 deletions(-)
--
2.13.2
From: Zi Yan <[email protected]>
This prepares for THP migration support. During soft-offlining,
if a THP is migrated without splitting, we need to soft-offline pages
after the THP. Otherwise, we need to soft-offline the next subpage in
that THP.
The new added output parameter in soft_offline_page() help distinguish
the two conditions above. As THPs are split after they are free,
we cannot distinguish the two conditions by examining the soft-offlined
pages.
Signed-off-by: Zi Yan <[email protected]>
---
drivers/base/memory.c | 2 +-
include/linux/mm.h | 2 +-
mm/madvise.c | 9 +++++----
mm/memory-failure.c | 11 +++++++----
4 files changed, 14 insertions(+), 10 deletions(-)
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 4e3b61cda520..3ab25b05d5e8 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -551,7 +551,7 @@ store_soft_offline_page(struct device *dev,
pfn >>= PAGE_SHIFT;
if (!pfn_valid(pfn))
return -ENXIO;
- ret = soft_offline_page(pfn_to_page(pfn), 0);
+ ret = soft_offline_page(pfn_to_page(pfn), 0, NULL);
return ret == 0 ? count : ret;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6f543a47fc92..d392fa090ab5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2467,7 +2467,7 @@ extern int sysctl_memory_failure_early_kill;
extern int sysctl_memory_failure_recovery;
extern void shake_page(struct page *p, int access);
extern atomic_long_t num_poisoned_pages;
-extern int soft_offline_page(struct page *page, int flags);
+extern int soft_offline_page(struct page *page, int flags, int *split);
/*
diff --git a/mm/madvise.c b/mm/madvise.c
index 49f6774db259..857255db404a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -634,17 +634,18 @@ static int madvise_inject_error(int behavior,
}
if (behavior == MADV_SOFT_OFFLINE) {
+ int split = 0;
+
pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
page_to_pfn(page), start);
- ret = soft_offline_page(page, MF_COUNT_INCREASED);
+ ret = soft_offline_page(page, MF_COUNT_INCREASED, &split);
if (ret)
return ret;
/*
- * Non hugetlb pages either have PAGE_SIZE
- * or are split into PAGE_SIZE
+ * If the page is split, page_size should be changed.
*/
- if (!PageHuge(page))
+ if (split)
page_size = PAGE_SIZE;
continue;
}
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 8b8ff29412b6..8a9ac6f9e1b0 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1361,7 +1361,7 @@ static void memory_failure_work_func(struct work_struct *work)
if (!gotten)
break;
if (entry.flags & MF_SOFT_OFFLINE)
- soft_offline_page(pfn_to_page(entry.pfn), entry.flags);
+ soft_offline_page(pfn_to_page(entry.pfn), entry.flags, NULL);
else
memory_failure(entry.pfn, entry.trapno, entry.flags);
}
@@ -1678,7 +1678,7 @@ static int __soft_offline_page(struct page *page, int flags)
return ret;
}
-static int soft_offline_in_use_page(struct page *page, int flags)
+static int soft_offline_in_use_page(struct page *page, int flags, int *split)
{
int ret;
struct page *hpage = compound_head(page);
@@ -1694,6 +1694,8 @@ static int soft_offline_in_use_page(struct page *page, int flags)
put_hwpoison_page(hpage);
return -EBUSY;
}
+ if (split)
+ *split = 1;
unlock_page(hpage);
get_hwpoison_page(page);
put_hwpoison_page(hpage);
@@ -1722,6 +1724,7 @@ static void soft_offline_free_page(struct page *page)
* soft_offline_page - Soft offline a page.
* @page: page to offline
* @flags: flags. Same as memory_failure().
+ * @split: output. Tells if page is split or not.
*
* Returns 0 on success, otherwise negated errno.
*
@@ -1740,7 +1743,7 @@ static void soft_offline_free_page(struct page *page)
* This is not a 100% solution for all memory, but tries to be
* ``good enough'' for the majority of memory.
*/
-int soft_offline_page(struct page *page, int flags)
+int soft_offline_page(struct page *page, int flags, int *split)
{
int ret;
unsigned long pfn = page_to_pfn(page);
@@ -1757,7 +1760,7 @@ int soft_offline_page(struct page *page, int flags)
put_online_mems();
if (ret > 0)
- ret = soft_offline_in_use_page(page, flags);
+ ret = soft_offline_in_use_page(page, flags, split);
else if (ret == 0)
soft_offline_free_page(page);
--
2.13.2
From: Zi Yan <[email protected]>
The loop in madvise_inject_error() reads its step size from a page
after it is soft-offlined. It works because the page is:
1) a hugetlb page: the page size does not change;
2) a base page: the page size does not change;
3) a THP: soft-offline always splits THPs, thus, it is OK to use
PAGE_SIZE as step size.
It will be a problem when soft-offline supports THP migrations.
When a THP is migrated without split during soft-offlining, the THP
is split after migration, thus, before and after soft-offlining page
sizes do not match. This causes a THP to be unnecessarily soft-lined,
at most, 511 times, wasting free space.
Signed-off-by: Zi Yan <[email protected]>
---
mm/madvise.c | 21 ++++++++++++++++++---
1 file changed, 18 insertions(+), 3 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 47d8d8a25eae..49f6774db259 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -612,19 +612,22 @@ static long madvise_remove(struct vm_area_struct *vma,
static int madvise_inject_error(int behavior,
unsigned long start, unsigned long end)
{
- struct page *page;
+ struct page *page = NULL;
+ unsigned long page_size = PAGE_SIZE;
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
- for (; start < end; start += PAGE_SIZE <<
- compound_order(compound_head(page))) {
+ for (; start < end; start += page_size) {
int ret;
ret = get_user_pages_fast(start, 1, 0, &page);
if (ret != 1)
return ret;
+ page_size = (PAGE_SIZE << compound_order(compound_head(page))) -
+ (PAGE_SIZE * (page - compound_head(page)));
+
if (PageHWPoison(page)) {
put_page(page);
continue;
@@ -637,6 +640,12 @@ static int madvise_inject_error(int behavior,
ret = soft_offline_page(page, MF_COUNT_INCREASED);
if (ret)
return ret;
+ /*
+ * Non hugetlb pages either have PAGE_SIZE
+ * or are split into PAGE_SIZE
+ */
+ if (!PageHuge(page))
+ page_size = PAGE_SIZE;
continue;
}
pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
@@ -645,6 +654,12 @@ static int madvise_inject_error(int behavior,
ret = memory_failure(page_to_pfn(page), 0, MF_COUNT_INCREASED);
if (ret)
return ret;
+ /*
+ * Non hugetlb pages either have PAGE_SIZE
+ * or are split into PAGE_SIZE
+ */
+ if (!PageHuge(page))
+ page_size = PAGE_SIZE;
}
return 0;
}
--
2.13.2
From: Zi Yan <[email protected]>
This patch enables thp migration for soft offline.
Signed-off-by: Zi Yan <[email protected]>
---
mm/memory-failure.c | 19 -------------------
1 file changed, 19 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index c05107548d72..02ae1aff51a4 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1712,25 +1712,6 @@ static int __soft_offline_page(struct page *page, int flags, int *split)
static int soft_offline_in_use_page(struct page *page, int flags, int *split)
{
int ret;
- struct page *hpage = compound_head(page);
-
- if (!PageHuge(page) && PageTransHuge(hpage)) {
- lock_page(hpage);
- if (!PageAnon(hpage) || unlikely(split_huge_page(hpage))) {
- unlock_page(hpage);
- if (!PageAnon(hpage))
- pr_info("soft offline: %#lx: non anonymous thp\n", page_to_pfn(page));
- else
- pr_info("soft offline: %#lx: thp split failed\n", page_to_pfn(page));
- put_hwpoison_page(hpage);
- return -EBUSY;
- }
- if (split)
- *split = 1;
- unlock_page(hpage);
- get_hwpoison_page(page);
- put_hwpoison_page(hpage);
- }
if (PageHuge(page))
ret = soft_offline_huge_page(page, flags);
--
2.13.2
From: Zi Yan <[email protected]>
For THP soft-offline support, we first try to migrate a THP without
splitting. If the migration fails, we split the THP and migrate the
raw error page.
migrate_pages() does not split a THP if the migration reason is
MR_MEMORY_FAILURE.
Signed-off-by: Zi Yan <[email protected]>
---
mm/memory-failure.c | 77 +++++++++++++++++++++++++++++++++++++----------------
mm/migrate.c | 16 +++++++++++
2 files changed, 70 insertions(+), 23 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 8a9ac6f9e1b0..c05107548d72 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1598,10 +1598,11 @@ static int soft_offline_huge_page(struct page *page, int flags)
return ret;
}
-static int __soft_offline_page(struct page *page, int flags)
+static int __soft_offline_page(struct page *page, int flags, int *split)
{
int ret;
- unsigned long pfn = page_to_pfn(page);
+ struct page *hpage = compound_head(page);
+ unsigned long pfn = page_to_pfn(hpage);
/*
* Check PageHWPoison again inside page lock because PageHWPoison
@@ -1609,11 +1610,11 @@ static int __soft_offline_page(struct page *page, int flags)
* memory_failure() also double-checks PageHWPoison inside page lock,
* so there's no race between soft_offline_page() and memory_failure().
*/
- lock_page(page);
- wait_on_page_writeback(page);
- if (PageHWPoison(page)) {
- unlock_page(page);
- put_hwpoison_page(page);
+ lock_page(hpage);
+ wait_on_page_writeback(hpage);
+ if (PageHWPoison(hpage)) {
+ unlock_page(hpage);
+ put_hwpoison_page(hpage);
pr_info("soft offline: %#lx page already poisoned\n", pfn);
return -EBUSY;
}
@@ -1621,14 +1622,14 @@ static int __soft_offline_page(struct page *page, int flags)
* Try to invalidate first. This should work for
* non dirty unmapped page cache pages.
*/
- ret = invalidate_inode_page(page);
- unlock_page(page);
+ ret = invalidate_inode_page(hpage);
+ unlock_page(hpage);
/*
* RED-PEN would be better to keep it isolated here, but we
* would need to fix isolation locking first.
*/
if (ret == 1) {
- put_hwpoison_page(page);
+ put_hwpoison_page(hpage);
pr_info("soft_offline: %#lx: invalidated\n", pfn);
SetPageHWPoison(page);
num_poisoned_pages_inc();
@@ -1640,15 +1641,15 @@ static int __soft_offline_page(struct page *page, int flags)
* Try to migrate to a new page instead. migrate.c
* handles a large number of cases for us.
*/
- if (PageLRU(page))
- ret = isolate_lru_page(page);
+ if (PageLRU(hpage))
+ ret = isolate_lru_page(hpage);
else
- ret = isolate_movable_page(page, ISOLATE_UNEVICTABLE);
+ ret = isolate_movable_page(hpage, ISOLATE_UNEVICTABLE);
/*
* Drop page reference which is came from get_any_page()
* successful isolate_lru_page() already took another one.
*/
- put_hwpoison_page(page);
+ put_hwpoison_page(hpage);
if (!ret) {
LIST_HEAD(pagelist);
/*
@@ -1657,23 +1658,53 @@ static int __soft_offline_page(struct page *page, int flags)
* cannot have PAGE_MAPPING_MOVABLE.
*/
if (!__PageMovable(page))
- inc_node_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
- list_add(&page->lru, &pagelist);
+ mod_node_page_state(page_pgdat(hpage), NR_ISOLATED_ANON +
+ page_is_file_cache(hpage), hpage_nr_pages(hpage));
+retry_subpage:
+ list_add(&hpage->lru, &pagelist);
ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
MIGRATE_SYNC, MR_MEMORY_FAILURE);
if (ret) {
- if (!list_empty(&pagelist))
- putback_movable_pages(&pagelist);
-
+ if (!list_empty(&pagelist)) {
+ if (!PageTransHuge(hpage))
+ putback_movable_pages(&pagelist);
+ else {
+ lock_page(hpage);
+ if (split_huge_page_to_list(hpage, &pagelist)) {
+ unlock_page(hpage);
+ goto failed;
+ }
+ unlock_page(hpage);
+
+ if (split)
+ *split = 1;
+ /*
+ * Pull the raw error page out and put back other subpages.
+ * Then retry the raw error page.
+ */
+ list_del(&page->lru);
+ putback_movable_pages(&pagelist);
+ hpage = page;
+ goto retry_subpage;
+ }
+ }
+failed:
pr_info("soft offline: %#lx: migration failed %d, type %lx (%pGp)\n",
- pfn, ret, page->flags, &page->flags);
+ pfn, ret, hpage->flags, &hpage->flags);
if (ret > 0)
ret = -EIO;
}
+ /*
+ * Set PageHWPoison on the raw error page.
+ *
+ * If the page is a THP, PageHWPoison is set then cleared
+ * in its head page in migrate_pages(). So we need to set the raw error
+ * page here. Otherwise, setting PageHWPoison again is fine.
+ */
+ SetPageHWPoison(page);
} else {
pr_info("soft offline: %#lx: isolation failed: %d, page count %d, type %lx (%pGp)\n",
- pfn, ret, page_count(page), page->flags, &page->flags);
+ pfn, ret, page_count(hpage), hpage->flags, &hpage->flags);
}
return ret;
}
@@ -1704,7 +1735,7 @@ static int soft_offline_in_use_page(struct page *page, int flags, int *split)
if (PageHuge(page))
ret = soft_offline_huge_page(page, flags);
else
- ret = __soft_offline_page(page, flags);
+ ret = __soft_offline_page(page, flags, split);
return ret;
}
diff --git a/mm/migrate.c b/mm/migrate.c
index f7b69282d216..b44df9cf72fd 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1118,6 +1118,15 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
}
if (unlikely(PageTransHuge(page) && !PageTransHuge(newpage))) {
+ /*
+ * soft-offline wants to retry the raw error subpage, if the THP
+ * migration fails. So we do not split the THP here and exit directly.
+ */
+ if (reason == MR_MEMORY_FAILURE) {
+ rc = -ENOMEM;
+ goto put_new;
+ }
+
lock_page(page);
rc = split_huge_page(page);
unlock_page(page);
@@ -1164,6 +1173,13 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
*/
if (!test_set_page_hwpoison(page))
num_poisoned_pages_inc();
+
+ /*
+ * Clear PageHWPoison in the head page. The caller
+ * is responsible for setting the raw error page.
+ */
+ if (PageTransHuge(page))
+ ClearPageHWPoison(page);
}
} else {
if (rc != -EAGAIN) {
--
2.13.2
On Mon, Aug 14, 2017 at 09:52:13PM -0400, Zi Yan wrote:
> From: Zi Yan <[email protected]>
>
> The loop in madvise_inject_error() reads its step size from a page
> after it is soft-offlined. It works because the page is:
> 1) a hugetlb page: the page size does not change;
> 2) a base page: the page size does not change;
> 3) a THP: soft-offline always splits THPs, thus, it is OK to use
> PAGE_SIZE as step size.
>
> It will be a problem when soft-offline supports THP migrations.
> When a THP is migrated without split during soft-offlining, the THP
> is split after migration, thus, before and after soft-offlining page
> sizes do not match. This causes a THP to be unnecessarily soft-lined,
> at most, 511 times, wasting free space.
Hi Zi Yan,
Thank you for the suggestion.
I think that when madvise(MADV_SOFT_OFFLINE) is called with some range
over more than one 4kB page, the caller clearly intends to call
soft_offline_page() over all 4kB pages within the range in order to
simulate the multiple soft-offline events. Please note that the caller
only knows that specific pages are half-broken, and expect that all such
pages are offlined. So the end result should be same, whether the given
range is backed by thp or not.
>
> Signed-off-by: Zi Yan <[email protected]>
> ---
> mm/madvise.c | 21 ++++++++++++++++++---
> 1 file changed, 18 insertions(+), 3 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 47d8d8a25eae..49f6774db259 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -612,19 +612,22 @@ static long madvise_remove(struct vm_area_struct *vma,
> static int madvise_inject_error(int behavior,
> unsigned long start, unsigned long end)
> {
> - struct page *page;
> + struct page *page = NULL;
> + unsigned long page_size = PAGE_SIZE;
>
> if (!capable(CAP_SYS_ADMIN))
> return -EPERM;
>
> - for (; start < end; start += PAGE_SIZE <<
> - compound_order(compound_head(page))) {
> + for (; start < end; start += page_size) {
> int ret;
>
> ret = get_user_pages_fast(start, 1, 0, &page);
> if (ret != 1)
> return ret;
>
> + page_size = (PAGE_SIZE << compound_order(compound_head(page))) -
> + (PAGE_SIZE * (page - compound_head(page)));
> +
Assigning a value which is not 4kB or some hugepage size into page_size
might be confusing because that's not what the name says. You can introduce
'next' virtual address and ALIGN() might be helpful to calculate it.
Thanks,
Naoya Horiguchi
> if (PageHWPoison(page)) {
> put_page(page);
> continue;
> @@ -637,6 +640,12 @@ static int madvise_inject_error(int behavior,
> ret = soft_offline_page(page, MF_COUNT_INCREASED);
> if (ret)
> return ret;
> + /*
> + * Non hugetlb pages either have PAGE_SIZE
> + * or are split into PAGE_SIZE
> + */
> + if (!PageHuge(page))
> + page_size = PAGE_SIZE;
> continue;
> }
> pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
> @@ -645,6 +654,12 @@ static int madvise_inject_error(int behavior,
> ret = memory_failure(page_to_pfn(page), 0, MF_COUNT_INCREASED);
> if (ret)
> return ret;
> + /*
> + * Non hugetlb pages either have PAGE_SIZE
> + * or are split into PAGE_SIZE
> + */
> + if (!PageHuge(page))
> + page_size = PAGE_SIZE;
> }
> return 0;
> }
> --
> 2.13.2
>
>
On 23 Aug 2017, at 3:49, Naoya Horiguchi wrote:
> On Mon, Aug 14, 2017 at 09:52:13PM -0400, Zi Yan wrote:
>> From: Zi Yan <[email protected]>
>>
>> The loop in madvise_inject_error() reads its step size from a page
>> after it is soft-offlined. It works because the page is:
>> 1) a hugetlb page: the page size does not change;
>> 2) a base page: the page size does not change;
>> 3) a THP: soft-offline always splits THPs, thus, it is OK to use
>> PAGE_SIZE as step size.
>>
>> It will be a problem when soft-offline supports THP migrations.
>> When a THP is migrated without split during soft-offlining, the THP
>> is split after migration, thus, before and after soft-offlining page
>> sizes do not match. This causes a THP to be unnecessarily soft-lined,
>> at most, 511 times, wasting free space.
>
> Hi Zi Yan,
>
> Thank you for the suggestion.
>
> I think that when madvise(MADV_SOFT_OFFLINE) is called with some range
> over more than one 4kB page, the caller clearly intends to call
> soft_offline_page() over all 4kB pages within the range in order to
> simulate the multiple soft-offline events. Please note that the caller
> only knows that specific pages are half-broken, and expect that all such
> pages are offlined. So the end result should be same, whether the given
> range is backed by thp or not.
>
But if the given virtual address is backed by a THP and the THP is soft-offlined
without splitting (enabled by following patches), the old for-loop will cause extra
511 THPs being soft-offlined.
For example, the caller wants to offline VPN 0-511, which is backed by a THP whose
address range is PFN 0-511. In the first iteration of the for-loop,
get_user_pages_fast(VPN0, ...) will return the THP and soft_offline_page() will offline the THP,
replacing it with a new THP, say PFN 512-1023, so VPN 0-511 is backed by PFN 512-1023.
But the original THP will be split after it is freed, thus, for-loop will not end
at this moment, but continues to offline VPN1, which leads to PFN 512-1023 being offlined
and replaced by another THP, say 1024-1535. This will go on and end up with
511 extra THPs are offlined. That is why we need to this patch to tell
whether the THP is offlined as a whole or just its head page is offlined.
Let me know if it is still not clear to you. Or I missed something.
>>
>> Signed-off-by: Zi Yan <[email protected]>
>> ---
>> mm/madvise.c | 21 ++++++++++++++++++---
>> 1 file changed, 18 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 47d8d8a25eae..49f6774db259 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -612,19 +612,22 @@ static long madvise_remove(struct vm_area_struct *vma,
>> static int madvise_inject_error(int behavior,
>> unsigned long start, unsigned long end)
>> {
>> - struct page *page;
>> + struct page *page = NULL;
>> + unsigned long page_size = PAGE_SIZE;
>>
>> if (!capable(CAP_SYS_ADMIN))
>> return -EPERM;
>>
>> - for (; start < end; start += PAGE_SIZE <<
>> - compound_order(compound_head(page))) {
>> + for (; start < end; start += page_size) {
>> int ret;
>>
>> ret = get_user_pages_fast(start, 1, 0, &page);
>> if (ret != 1)
>> return ret;
>>
>> + page_size = (PAGE_SIZE << compound_order(compound_head(page))) -
>> + (PAGE_SIZE * (page - compound_head(page)));
>> +
>
> Assigning a value which is not 4kB or some hugepage size into page_size
> might be confusing because that's not what the name says. You can introduce
> 'next' virtual address and ALIGN() might be helpful to calculate it.
Like:
next = ALIGN(start, PAGE_SIZE<<compound_order(compound_head(page))) - start;
I think it works. Thanks.
--
Best Regards
Yan Zi
On Wed, Aug 23, 2017 at 10:20:02AM -0400, Zi Yan wrote:
> On 23 Aug 2017, at 3:49, Naoya Horiguchi wrote:
>
> > On Mon, Aug 14, 2017 at 09:52:13PM -0400, Zi Yan wrote:
> >> From: Zi Yan <[email protected]>
> >>
> >> The loop in madvise_inject_error() reads its step size from a page
> >> after it is soft-offlined. It works because the page is:
> >> 1) a hugetlb page: the page size does not change;
> >> 2) a base page: the page size does not change;
> >> 3) a THP: soft-offline always splits THPs, thus, it is OK to use
> >> PAGE_SIZE as step size.
> >>
> >> It will be a problem when soft-offline supports THP migrations.
> >> When a THP is migrated without split during soft-offlining, the THP
> >> is split after migration, thus, before and after soft-offlining page
> >> sizes do not match. This causes a THP to be unnecessarily soft-lined,
> >> at most, 511 times, wasting free space.
> >
> > Hi Zi Yan,
> >
> > Thank you for the suggestion.
> >
> > I think that when madvise(MADV_SOFT_OFFLINE) is called with some range
> > over more than one 4kB page, the caller clearly intends to call
> > soft_offline_page() over all 4kB pages within the range in order to
> > simulate the multiple soft-offline events. Please note that the caller
> > only knows that specific pages are half-broken, and expect that all such
> > pages are offlined. So the end result should be same, whether the given
> > range is backed by thp or not.
> >
>
> But if the given virtual address is backed by a THP and the THP is soft-offlined
> without splitting (enabled by following patches), the old for-loop will cause extra
> 511 THPs being soft-offlined.
>
> For example, the caller wants to offline VPN 0-511, which is backed by a THP whose
> address range is PFN 0-511. In the first iteration of the for-loop,
> get_user_pages_fast(VPN0, ...) will return the THP and soft_offline_page() will offline the THP,
> replacing it with a new THP, say PFN 512-1023, so VPN 0-511 is backed by PFN 512-1023.
> But the original THP will be split after it is freed, thus, for-loop will not end
> at this moment, but continues to offline VPN1, which leads to PFN 512-1023 being offlined
> and replaced by another THP, say 1024-1535. This will go on and end up with
> 511 extra THPs are offlined. That is why we need to this patch to tell
> whether the THP is offlined as a whole or just its head page is offlined.
Thanks for elaborating this. I understand your point.
But I still not sure what the best behavior is.
madvise(MADV_SOFT_OFFLINE) is a test feature and giving multi-page range
on the call works like some stress testing. So multiple thp migrations
seem to me an expected behavior. At least it behaves in the same manner
as calling madvise(MADV_SOFT_OFFLINE) 512 times on VPN0-VPN511 separately,
which is consistent.
So I still feel like leaving the current behavior as long as your later
patches work without this change.
Thanks,
Naoya Horiguchi
>
> Let me know if it is still not clear to you. Or I missed something.
>
> >>
> >> Signed-off-by: Zi Yan <[email protected]>
> >> ---
> >> mm/madvise.c | 21 ++++++++++++++++++---
> >> 1 file changed, 18 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/mm/madvise.c b/mm/madvise.c
> >> index 47d8d8a25eae..49f6774db259 100644
> >> --- a/mm/madvise.c
> >> +++ b/mm/madvise.c
> >> @@ -612,19 +612,22 @@ static long madvise_remove(struct vm_area_struct *vma,
> >> static int madvise_inject_error(int behavior,
> >> unsigned long start, unsigned long end)
> >> {
> >> - struct page *page;
> >> + struct page *page = NULL;
> >> + unsigned long page_size = PAGE_SIZE;
> >>
> >> if (!capable(CAP_SYS_ADMIN))
> >> return -EPERM;
> >>
> >> - for (; start < end; start += PAGE_SIZE <<
> >> - compound_order(compound_head(page))) {
> >> + for (; start < end; start += page_size) {
> >> int ret;
> >>
> >> ret = get_user_pages_fast(start, 1, 0, &page);
> >> if (ret != 1)
> >> return ret;
> >>
> >> + page_size = (PAGE_SIZE << compound_order(compound_head(page))) -
> >> + (PAGE_SIZE * (page - compound_head(page)));
> >> +
> >
> > Assigning a value which is not 4kB or some hugepage size into page_size
> > might be confusing because that's not what the name says. You can introduce
> > 'next' virtual address and ALIGN() might be helpful to calculate it.
>
> Like:
>
> next = ALIGN(start, PAGE_SIZE<<compound_order(compound_head(page))) - start;
>
> I think it works. Thanks.
>
>
> --
> Best Regards
> Yan Zi
On Mon, Aug 14, 2017 at 09:52:15PM -0400, Zi Yan wrote:
> From: Zi Yan <[email protected]>
>
> For THP soft-offline support, we first try to migrate a THP without
> splitting. If the migration fails, we split the THP and migrate the
> raw error page.
>
> migrate_pages() does not split a THP if the migration reason is
> MR_MEMORY_FAILURE.
>
> Signed-off-by: Zi Yan <[email protected]>
> ---
> mm/memory-failure.c | 77 +++++++++++++++++++++++++++++++++++++----------------
> mm/migrate.c | 16 +++++++++++
> 2 files changed, 70 insertions(+), 23 deletions(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 8a9ac6f9e1b0..c05107548d72 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
...
> @@ -1657,23 +1658,53 @@ static int __soft_offline_page(struct page *page, int flags)
> * cannot have PAGE_MAPPING_MOVABLE.
> */
> if (!__PageMovable(page))
> - inc_node_page_state(page, NR_ISOLATED_ANON +
> - page_is_file_cache(page));
> - list_add(&page->lru, &pagelist);
> + mod_node_page_state(page_pgdat(hpage), NR_ISOLATED_ANON +
> + page_is_file_cache(hpage), hpage_nr_pages(hpage));
> +retry_subpage:
> + list_add(&hpage->lru, &pagelist);
> ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
> MIGRATE_SYNC, MR_MEMORY_FAILURE);
> if (ret) {
> - if (!list_empty(&pagelist))
> - putback_movable_pages(&pagelist);
> -
> + if (!list_empty(&pagelist)) {
> + if (!PageTransHuge(hpage))
> + putback_movable_pages(&pagelist);
> + else {
> + lock_page(hpage);
> + if (split_huge_page_to_list(hpage, &pagelist)) {
> + unlock_page(hpage);
> + goto failed;
Don't you need to call putback_movable_pages() before returning?
> + }
> + unlock_page(hpage);
> +
> + if (split)
> + *split = 1;
> + /*
> + * Pull the raw error page out and put back other subpages.
> + * Then retry the raw error page.
> + */
> + list_del(&page->lru);
> + putback_movable_pages(&pagelist);
If putback_movable_pages() is not called for the raw error page,
NR_ISOLATED_ANON or NR_ISOLATED_FILE stat remains incremented by 1 after return?
> + hpage = page;
> + goto retry_subpage;
> + }
> + }
> +failed:
> pr_info("soft offline: %#lx: migration failed %d, type %lx (%pGp)\n",
> - pfn, ret, page->flags, &page->flags);
> + pfn, ret, hpage->flags, &hpage->flags);
> if (ret > 0)
> ret = -EIO;
> }
> + /*
> + * Set PageHWPoison on the raw error page.
> + *
> + * If the page is a THP, PageHWPoison is set then cleared
> + * in its head page in migrate_pages(). So we need to set the raw error
> + * page here. Otherwise, setting PageHWPoison again is fine.
> + */
> + SetPageHWPoison(page);
When we failed to migrate the page, we don't have to set PageHWPoison
because we can still continue to use the page (which tends to cause corrected
error but is still usable) and we have a chance to retry soft-offline later.
So please set the flag only when page/thp migration succeeds.
...
> diff --git a/mm/migrate.c b/mm/migrate.c
> index f7b69282d216..b44df9cf72fd 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1118,6 +1118,15 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
> }
>
> if (unlikely(PageTransHuge(page) && !PageTransHuge(newpage))) {
> + /*
> + * soft-offline wants to retry the raw error subpage, if the THP
> + * migration fails. So we do not split the THP here and exit directly.
> + */
> + if (reason == MR_MEMORY_FAILURE) {
> + rc = -ENOMEM;
> + goto put_new;
> + }
> +
This might imply that existing thp migration code have a room for improvment.
> lock_page(page);
> rc = split_huge_page(page);
> unlock_page(page);
This code maybe doesn't work. Even when split_huge_page() succeeds (rc = 0),
__unmap_and_move() migrates only the head page and all tail pages are ignored
(note that split_huge_page_to_list() sends tail pages to LRU list when
parameter 'list' is NULL.) IOW, the retry logic in migrate_pages() doesn't
work for thp migration.
What I think is OK is that
- when thp migration from soft offline fails,
1. split the thp
2. send only raw error page to migration page list and send the other
healthy subpages to LRU list
3. retry loop in migrate_pages()
- when thp migration from other callers fails,
1. split the thp
2. send all subpages to migration page list
3. retry loop in migrate_pages()
> @@ -1164,6 +1173,13 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
> */
> if (!test_set_page_hwpoison(page))
> num_poisoned_pages_inc();
> +
> + /*
> + * Clear PageHWPoison in the head page. The caller
> + * is responsible for setting the raw error page.
> + */
> + if (PageTransHuge(page))
> + ClearPageHWPoison(page);
I think that you can write more simply with passing the pointer to
the raw error page from caller. It is also helpful for above-mentioned
split-retry issue.
unmap_and_move() has a parameter 'private' which is now only used
for get/put_new_page(), so I think we can extend it for our purpose.
Thanks,
Naoya Horiguchi
On 24 Aug 2017, at 0:26, Naoya Horiguchi wrote:
> On Wed, Aug 23, 2017 at 10:20:02AM -0400, Zi Yan wrote:
>> On 23 Aug 2017, at 3:49, Naoya Horiguchi wrote:
>>
>>> On Mon, Aug 14, 2017 at 09:52:13PM -0400, Zi Yan wrote:
>>>> From: Zi Yan <[email protected]>
>>>>
>>>> The loop in madvise_inject_error() reads its step size from a page
>>>> after it is soft-offlined. It works because the page is:
>>>> 1) a hugetlb page: the page size does not change;
>>>> 2) a base page: the page size does not change;
>>>> 3) a THP: soft-offline always splits THPs, thus, it is OK to use
>>>> PAGE_SIZE as step size.
>>>>
>>>> It will be a problem when soft-offline supports THP migrations.
>>>> When a THP is migrated without split during soft-offlining, the THP
>>>> is split after migration, thus, before and after soft-offlining page
>>>> sizes do not match. This causes a THP to be unnecessarily soft-lined,
>>>> at most, 511 times, wasting free space.
>>>
>>> Hi Zi Yan,
>>>
>>> Thank you for the suggestion.
>>>
>>> I think that when madvise(MADV_SOFT_OFFLINE) is called with some range
>>> over more than one 4kB page, the caller clearly intends to call
>>> soft_offline_page() over all 4kB pages within the range in order to
>>> simulate the multiple soft-offline events. Please note that the caller
>>> only knows that specific pages are half-broken, and expect that all such
>>> pages are offlined. So the end result should be same, whether the given
>>> range is backed by thp or not.
>>>
>>
>> But if the given virtual address is backed by a THP and the THP is soft-offlined
>> without splitting (enabled by following patches), the old for-loop will cause extra
>> 511 THPs being soft-offlined.
>>
>> For example, the caller wants to offline VPN 0-511, which is backed by a THP whose
>> address range is PFN 0-511. In the first iteration of the for-loop,
>> get_user_pages_fast(VPN0, ...) will return the THP and soft_offline_page() will offline the THP,
>> replacing it with a new THP, say PFN 512-1023, so VPN 0-511 is backed by PFN 512-1023.
>> But the original THP will be split after it is freed, thus, for-loop will not end
>> at this moment, but continues to offline VPN1, which leads to PFN 512-1023 being offlined
>> and replaced by another THP, say 1024-1535. This will go on and end up with
>> 511 extra THPs are offlined. That is why we need to this patch to tell
>> whether the THP is offlined as a whole or just its head page is offlined.
>
> Thanks for elaborating this. I understand your point.
> But I still not sure what the best behavior is.
>
> madvise(MADV_SOFT_OFFLINE) is a test feature and giving multi-page range
> on the call works like some stress testing. So multiple thp migrations
> seem to me an expected behavior. At least it behaves in the same manner
> as calling madvise(MADV_SOFT_OFFLINE) 512 times on VPN0-VPN511 separately,
> which is consistent.
>
> So I still feel like leaving the current behavior as long as your later
> patches work without this change.
Sure. I will drop Patch 1 and 2 in the next version.
Thanks for your explanation.
--
Best Regards
Yan Zi