Testing the latest linux git tree turned up a few bugs with page
migration to and from ZONE_DEVICE private and anonymous pages.
Hopefully it clarifies how ZONE_DEVICE private struct page uses
the same mapping and index fields from the source anonymous page
mapping.
Patch #3 was sent earlier and this is v2 with an updated change log.
http://lkml.kernel.org/r/[email protected]
Ralph Campbell (3):
mm: document zone device struct page reserved fields
mm/hmm: fix ZONE_DEVICE anon page mapping reuse
mm/hmm: Fix bad subpage pointer in try_to_unmap_one
include/linux/mm_types.h | 9 ++++++++-
kernel/memremap.c | 4 ++++
mm/rmap.c | 1 +
3 files changed, 13 insertions(+), 1 deletion(-)
--
2.20.1
Struct page for ZONE_DEVICE private pages uses the reserved fields when
anonymous pages are migrated to device private memory. This is so
the page->mapping and page->index fields are preserved and the page can
be migrated back to system memory.
Document this in comments so it is more clear.
Signed-off-by: Ralph Campbell <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Jérôme Glisse <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
---
include/linux/mm_types.h | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3a37a89eb7a7..d6ea74e20306 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -159,7 +159,14 @@ struct page {
/** @pgmap: Points to the hosting device page map. */
struct dev_pagemap *pgmap;
void *zone_device_data;
- unsigned long _zd_pad_1; /* uses mapping */
+ /*
+ * The following fields are used to hold the source
+ * page anonymous mapping information while it is
+ * migrated to device memory. See migrate_page().
+ */
+ unsigned long _zd_pad_1; /* aliases mapping */
+ unsigned long _zd_pad_2; /* aliases index */
+ unsigned long _zd_pad_3; /* aliases private */
};
/** @rcu_head: You can use this to free a page by RCU. */
--
2.20.1
When a ZONE_DEVICE private page is freed, the page->mapping field can be
set. If this page is reused as an anonymous page, the previous value can
prevent the page from being inserted into the CPU's anon rmap table.
For example, when migrating a pte_none() page to device memory:
migrate_vma(ops, vma, start, end, src, dst, private)
migrate_vma_collect()
src[] = MIGRATE_PFN_MIGRATE
migrate_vma_prepare()
/* no page to lock or isolate so OK */
migrate_vma_unmap()
/* no page to unmap so OK */
ops->alloc_and_copy()
/* driver allocates ZONE_DEVICE page for dst[] */
migrate_vma_pages()
migrate_vma_insert_page()
page_add_new_anon_rmap()
__page_set_anon_rmap()
/* This check sees the page's stale mapping field */
if (PageAnon(page))
return
/* page->mapping is not updated */
The result is that the migration appears to succeed but a subsequent CPU
fault will be unable to migrate the page back to system memory or worse.
Clear the page->mapping field when freeing the ZONE_DEVICE page so stale
pointer data doesn't affect future page use.
Fixes: b7a523109fb5c9d2d6dd ("mm: don't clear ->mapping in hmm_devmem_free")
Cc: [email protected]
Signed-off-by: Ralph Campbell <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Logan Gunthorpe <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: "Jérôme Glisse" <[email protected]>
---
kernel/memremap.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/kernel/memremap.c b/kernel/memremap.c
index bea6f887adad..238ae5d0ae8a 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -408,6 +408,10 @@ void __put_devmap_managed_page(struct page *page)
mem_cgroup_uncharge(page);
+ /* Clear anonymous page mapping to prevent stale pointers */
+ if (is_device_private_page(page))
+ page->mapping = NULL;
+
page->pgmap->ops->page_free(page);
} else if (!count)
__put_page(page);
--
2.20.1
When migrating an anonymous private page to a ZONE_DEVICE private page,
the source page->mapping and page->index fields are copied to the
destination ZONE_DEVICE struct page and the page_mapcount() is increased.
This is so rmap_walk() can be used to unmap and migrate the page back to
system memory. However, try_to_unmap_one() computes the subpage pointer
from a swap pte which computes an invalid page pointer and a kernel panic
results such as:
BUG: unable to handle page fault for address: ffffea1fffffffc8
Currently, only single pages can be migrated to device private memory so
no subpage computation is needed and it can be set to "page".
Fixes: a5430dda8a3a1c ("mm/migrate: support un-addressable ZONE_DEVICE page in migration")
Signed-off-by: Ralph Campbell <[email protected]>
Cc: "Jérôme Glisse" <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
mm/rmap.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/mm/rmap.c b/mm/rmap.c
index e5dfe2ae6b0d..ec1af8b60423 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1476,6 +1476,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
* No need to invalidate here it will synchronize on
* against the special swap migration pte.
*/
+ subpage = page;
goto discard;
}
--
2.20.1
On 7/16/19 5:14 PM, Ralph Campbell wrote:
> Struct page for ZONE_DEVICE private pages uses the reserved fields when
> anonymous pages are migrated to device private memory. This is so
> the page->mapping and page->index fields are preserved and the page can
> be migrated back to system memory.
> Document this in comments so it is more clear.
>
> Signed-off-by: Ralph Campbell <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Jérôme Glisse <[email protected]>
> Cc: "Kirill A . Shutemov" <[email protected]>
> Cc: Lai Jiangshan <[email protected]>
> Cc: Martin Schwidefsky <[email protected]>
> Cc: Pekka Enberg <[email protected]>
> Cc: Randy Dunlap <[email protected]>
> Cc: Andrey Ryabinin <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Jason Gunthorpe <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> ---
> include/linux/mm_types.h | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 3a37a89eb7a7..d6ea74e20306 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -159,7 +159,14 @@ struct page {
> /** @pgmap: Points to the hosting device page map. */
> struct dev_pagemap *pgmap;
> void *zone_device_data;
> - unsigned long _zd_pad_1; /* uses mapping */
> + /*
> + * The following fields are used to hold the source
> + * page anonymous mapping information while it is
> + * migrated to device memory. See migrate_page().
> + */
> + unsigned long _zd_pad_1; /* aliases mapping */
> + unsigned long _zd_pad_2; /* aliases index */
> + unsigned long _zd_pad_3; /* aliases private */
Actually, I do think this helps. It's hard to document these fields, and
the ZONE_DEVICE pages have a really complicated situation during migration
to a device.
Additionally, I'm not sure, but should we go even further, and do this on the
other side of the alias:
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d6ea74e20306..c5ce5989d8a8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -83,7 +83,12 @@ struct page {
* by the page owner.
*/
struct list_head lru;
- /* See page-flags.h for PAGE_MAPPING_FLAGS */
+ /*
+ * See page-flags.h for PAGE_MAPPING_FLAGS.
+ *
+ * Also: the next three fields (mapping, index and
+ * private) are all used by ZONE_DEVICE pages.
+ */
struct address_space *mapping;
pgoff_t index; /* Our offset within mapping. */
/**
?
Either way, you can add:
Reviewed-by: John Hubbard <[email protected]>
thanks,
--
John Hubbard
NVIDIA
On 7/16/19 5:14 PM, Ralph Campbell wrote:
> When a ZONE_DEVICE private page is freed, the page->mapping field can be
> set. If this page is reused as an anonymous page, the previous value can
> prevent the page from being inserted into the CPU's anon rmap table.
> For example, when migrating a pte_none() page to device memory:
> migrate_vma(ops, vma, start, end, src, dst, private)
> migrate_vma_collect()
> src[] = MIGRATE_PFN_MIGRATE
> migrate_vma_prepare()
> /* no page to lock or isolate so OK */
> migrate_vma_unmap()
> /* no page to unmap so OK */
> ops->alloc_and_copy()
> /* driver allocates ZONE_DEVICE page for dst[] */
> migrate_vma_pages()
> migrate_vma_insert_page()
> page_add_new_anon_rmap()
> __page_set_anon_rmap()
> /* This check sees the page's stale mapping field */
> if (PageAnon(page))
> return
> /* page->mapping is not updated */
>
> The result is that the migration appears to succeed but a subsequent CPU
> fault will be unable to migrate the page back to system memory or worse.
>
> Clear the page->mapping field when freeing the ZONE_DEVICE page so stale
> pointer data doesn't affect future page use.
>
> Fixes: b7a523109fb5c9d2d6dd ("mm: don't clear ->mapping in hmm_devmem_free")
> Cc: [email protected]
> Signed-off-by: Ralph Campbell <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Jason Gunthorpe <[email protected]>
> Cc: Logan Gunthorpe <[email protected]>
> Cc: Ira Weiny <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: "Kirill A. Shutemov" <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Mike Kravetz <[email protected]>
> Cc: "Jérôme Glisse" <[email protected]>
> ---
> kernel/memremap.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index bea6f887adad..238ae5d0ae8a 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -408,6 +408,10 @@ void __put_devmap_managed_page(struct page *page)
>
> mem_cgroup_uncharge(page);
>
> + /* Clear anonymous page mapping to prevent stale pointers */
This is sufficiently complex, that some concise form of the documentation
that you've put in the commit description, needs to also exist right here, as
a comment.
How's this read:
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 238ae5d0ae8a..e52e9da5d0a7 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -408,7 +408,27 @@ void __put_devmap_managed_page(struct page *page)
mem_cgroup_uncharge(page);
- /* Clear anonymous page mapping to prevent stale pointers */
+ /*
+ * When a device_private page is freed, the page->mapping field
+ * may still contain a (stale) mapping value. For example, the
+ * lower bits of page->mapping may still identify the page an an
+ * anonymous page. Ultimately, this entire field is just stale
+ * and wrong, and it will cause errors if not cleared. One
+ * example is:
+ *
+ * migrate_vma_pages()
+ * migrate_vma_insert_page()
+ * page_add_new_anon_rmap()
+ * __page_set_anon_rmap()
+ * ...checks page->mapping, via PageAnon(page) call,
+ * and incorrectly concludes that the page is an
+ * anonymous page. Therefore, it incorrectly,
+ * silently fails to set up the new anon rmap.
+ *
+ * For other types of ZONE_DEVICE pages, migration is either
+ * handled differently or not done at all, so there is no need
+ * to clear page->mapping.
+ */
if (is_device_private_page(page))
page->mapping = NULL;
?
thanks,
--
John Hubbard
NVIDIA
On 7/16/19 5:14 PM, Ralph Campbell wrote:
> When migrating an anonymous private page to a ZONE_DEVICE private page,
> the source page->mapping and page->index fields are copied to the
> destination ZONE_DEVICE struct page and the page_mapcount() is increased.
> This is so rmap_walk() can be used to unmap and migrate the page back to
> system memory. However, try_to_unmap_one() computes the subpage pointer
> from a swap pte which computes an invalid page pointer and a kernel panic
> results such as:
>
> BUG: unable to handle page fault for address: ffffea1fffffffc8
>
> Currently, only single pages can be migrated to device private memory so
> no subpage computation is needed and it can be set to "page".
>
> Fixes: a5430dda8a3a1c ("mm/migrate: support un-addressable ZONE_DEVICE page in migration")
> Signed-off-by: Ralph Campbell <[email protected]>
> Cc: "Jérôme Glisse" <[email protected]>
> Cc: "Kirill A. Shutemov" <[email protected]>
> Cc: Mike Kravetz <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Jason Gunthorpe <[email protected]>
> Cc: <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> ---
> mm/rmap.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index e5dfe2ae6b0d..ec1af8b60423 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1476,6 +1476,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> * No need to invalidate here it will synchronize on
> * against the special swap migration pte.
> */
> + subpage = page;
> goto discard;
> }
The problem is clear, but the solution still leaves the code ever so slightly
more confusing, and it was already pretty difficult to begin with.
I still hold out hope for some comment documentation at least, and maybe
even just removing the subpage variable (as Jerome mentioned, offline) as
well.
Jerome?
thanks,
--
John Hubbard
NVIDIA
On Tue, Jul 16, 2019 at 06:20:23PM -0700, John Hubbard wrote:
> > - unsigned long _zd_pad_1; /* uses mapping */
> > + /*
> > + * The following fields are used to hold the source
> > + * page anonymous mapping information while it is
> > + * migrated to device memory. See migrate_page().
> > + */
> > + unsigned long _zd_pad_1; /* aliases mapping */
> > + unsigned long _zd_pad_2; /* aliases index */
> > + unsigned long _zd_pad_3; /* aliases private */
>
> Actually, I do think this helps. It's hard to document these fields, and
> the ZONE_DEVICE pages have a really complicated situation during migration
> to a device.
>
> Additionally, I'm not sure, but should we go even further, and do this on the
> other side of the alias:
The _zd_pad_* field obviously are NOT used anywhere in the source tree.
So these comments are very misleading. If we still keep
using ->mapping, ->index and ->private we really should clean up the
definition of struct page to make that obvious instead of trying to
doctor around it using comments.
On 7/16/19 9:22 PM, Christoph Hellwig wrote:
> On Tue, Jul 16, 2019 at 06:20:23PM -0700, John Hubbard wrote:
>>> - unsigned long _zd_pad_1; /* uses mapping */
>>> + /*
>>> + * The following fields are used to hold the source
>>> + * page anonymous mapping information while it is
>>> + * migrated to device memory. See migrate_page().
>>> + */
>>> + unsigned long _zd_pad_1; /* aliases mapping */
>>> + unsigned long _zd_pad_2; /* aliases index */
>>> + unsigned long _zd_pad_3; /* aliases private */
>>
>> Actually, I do think this helps. It's hard to document these fields, and
>> the ZONE_DEVICE pages have a really complicated situation during migration
>> to a device.
>>
>> Additionally, I'm not sure, but should we go even further, and do this on the
>> other side of the alias:
>
> The _zd_pad_* field obviously are NOT used anywhere in the source tree.
> So these comments are very misleading. If we still keep
> using ->mapping, ->index and ->private we really should clean up the
> definition of struct page to make that obvious instead of trying to
> doctor around it using comments.
>
OK, so just delete all the _zd_pad_* fields? Works for me. It's misleading to
calling something padding, if it's actually unavailable because it's used
in the other union, so deleting would be even better than commenting.
In that case, it would still be nice to have this new snippet, right?:
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d6ea74e20306..c5ce5989d8a8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -83,7 +83,12 @@
struct page {
* by the page owner.
*/
struct list_head lru;
- /* See page-flags.h for PAGE_MAPPING_FLAGS */
+ /*
+ * See page-flags.h for PAGE_MAPPING_FLAGS.
+ *
+ * Also: the next three fields (mapping, index and
+ * private) are all used by ZONE_DEVICE pages.
+ */
struct address_space *mapping;
pgoff_t index; /* Our offset within mapping. */
/**
thanks,
--
John Hubbard
NVIDIA
On Tue, Jul 16, 2019 at 09:31:33PM -0700, John Hubbard wrote:
> OK, so just delete all the _zd_pad_* fields? Works for me. It's misleading to
> calling something padding, if it's actually unavailable because it's used
> in the other union, so deleting would be even better than commenting.
>
> In that case, it would still be nice to have this new snippet, right?:
I hope willy can chime in a bit on his thoughts about how the union in
struct page should look like. The padding at the end of the sub-structs
certainly looks pointless, and other places don't use it either. But if
we are using the other fields it almost seems to me like we only want to
union the lru field in the first sub-struct instead of overlaying most
of it.
On 7/16/19 9:38 PM, Christoph Hellwig wrote:
> On Tue, Jul 16, 2019 at 09:31:33PM -0700, John Hubbard wrote:
>> OK, so just delete all the _zd_pad_* fields? Works for me. It's misleading to
>> calling something padding, if it's actually unavailable because it's used
>> in the other union, so deleting would be even better than commenting.
>>
>> In that case, it would still be nice to have this new snippet, right?:
>
> I hope willy can chime in a bit on his thoughts about how the union in
> struct page should look like. The padding at the end of the sub-structs
> certainly looks pointless, and other places don't use it either. But if
> we are using the other fields it almost seems to me like we only want to
> union the lru field in the first sub-struct instead of overlaying most
> of it.
>
I like this approach.
I'll work on an updated patch that makes "struct list_head lru" part
of a union with the ZONE_DEVICE struct without the padding and update
the comments and change log.
I will also wait a day or two for others to add their comments.