2013-08-13 07:05:05

by Minchan Kim

[permalink] [raw]
Subject: [RFC 0/3] Pin page control subsystem

!! NOTICE !!
It's totally untested patchset so please AVOID real testing.
I'd like to show just concept and want to discuss it on very early stage.
(so there isn't enough description but I guess code is very simple so
not a big problem to understand the intention).

This patchset is for solving *kernel* pinpage migration problem more
general. Now, zswap, zram and z* family, not sure upcoming what
solution are using memory don't live in harmony with VM.
(I don't remember ballon compaction but we might be able to unify
ballon compaction with this.)

VM sometime want to migrate and/or reclaim pages for CMA, memory-hotplug,
THP and so on but at the moment, it could handle only userspace pages
so if above example subsystem have pinned a some page in a range VM want
to migrate, migration is failed so above exmaple couldn't work well.

This patchset is for basic facility for the role.

patch 1 introduces a new page flags and patch 2 introduce pinpage control
subsystem. So, subsystems want to control pinpage should implement own
pinpage_xxx functions because each subsystem would have other character
so what kinds of data structure for managing pinpage information depends
on them. Otherwise, they can use general functions defined in pinpage
subsystem. patch 3 hacks migration.c so that migration is
aware of pinpage now and migrate them with pinpage subsystem.

It exposes new rule that users of pinpage control subsystem shouldn't use
struct page->flags and struct page->lru field freely because lru field
is used for migration.c and flags field is used for lock_page in pinpage
control subsystem. I think it's not a big problem because subsystem can
use other fields of the page descriptor, instead.

This patch's limitation is that it couldn't apply user space pages
although I'd REALLY REALLY like to unify them.
IOW, it couldn't handle long pin page by get_user_pages.
Basic hurdle is that how to handle nesting cases caused by that
several subsystem pin on same page with GUP but they could have
different migrate methods. It could add rather complexity and overhead
but I'm not sure it's worth because proved culprit until now is AIO
ring pages and Gu and Benjamin have approached it with another way
so I'd like to hear their opinions.

Minchan Kim (3):
mm: Introduce new page flag
pinpage control subsystem
mm: migrate pinned page

include/linux/page-flags.h | 2 +
include/linux/pinpage.h | 39 +++++++++++++
mm/Makefile | 2 +-
mm/compaction.c | 26 ++++++++-
mm/migrate.c | 58 ++++++++++++++++---
mm/page_alloc.c | 1 +
mm/pinpage.c | 134 ++++++++++++++++++++++++++++++++++++++++++++
7 files changed, 252 insertions(+), 10 deletions(-)
create mode 100644 include/linux/pinpage.h
create mode 100644 mm/pinpage.c

--
1.7.9.5


2013-08-13 07:05:04

by Minchan Kim

[permalink] [raw]
Subject: [RFC 1/3] mm: Introduce new page flag

Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/page-flags.h | 2 ++
mm/page_alloc.c | 1 +
2 files changed, 3 insertions(+)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6d53675..75ce843 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -109,6 +109,7 @@ enum pageflags {
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
PG_compound_lock,
#endif
+ PG_pin,
__NR_PAGEFLAGS,

/* Filesystems */
@@ -197,6 +198,7 @@ struct page; /* forward declaration */

TESTPAGEFLAG(Locked, locked)
PAGEFLAG(Error, error) TESTCLEARFLAG(Error, error)
+PAGEFLAG(Pin, pin) TESTCLEARFLAG(Pin, pin)
PAGEFLAG(Referenced, referenced) TESTCLEARFLAG(Referenced, referenced)
PAGEFLAG(Dirty, dirty) TESTSCFLAG(Dirty, dirty) __CLEARPAGEFLAG(Dirty, dirty)
PAGEFLAG(LRU, lru) __CLEARPAGEFLAG(LRU, lru)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b100255..5dd8b43 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6345,6 +6345,7 @@ static const struct trace_print_flags pageflag_names[] = {
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
{1UL << PG_compound_lock, "compound_lock" },
#endif
+ {1UL << PG_pin, "pin" },
};

static void dump_page_flags(unsigned long flags)
--
1.7.9.5

2013-08-13 07:05:37

by Minchan Kim

[permalink] [raw]
Subject: [RFC 3/3] mm: migrate pinned page

Signed-off-by: Minchan Kim <[email protected]>
---
mm/compaction.c | 26 +++++++++++++++++++++++--
mm/migrate.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++-------
2 files changed, 75 insertions(+), 9 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 05ccb4c..16b80e6 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -396,8 +396,10 @@ static void acct_isolated(struct zone *zone, bool locked, struct compact_control
struct page *page;
unsigned int count[2] = { 0, };

- list_for_each_entry(page, &cc->migratepages, lru)
- count[!!page_is_file_cache(page)]++;
+ list_for_each_entry(page, &cc->migratepages, lru) {
+ if (!PagePin(page))
+ count[!!page_is_file_cache(page)]++;
+ }

/* If locked we can use the interrupt unsafe versions */
if (locked) {
@@ -535,6 +537,25 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
}

/*
+ * Pinned kernel page(ex, zswap) could be isolated.
+ */
+ if (PagePin(page)) {
+ if (!get_page_unless_zero(page))
+ continue;
+ /*
+ * Subsystem want to use pinpage should not
+ * use page->lru feild.
+ */
+ VM_BUG_ON(!list_empty(&page->lru));
+ if (!trylock_page(page)) {
+ put_page(page);
+ continue;
+ }
+
+ goto isolated;
+ }
+
+ /*
* Check may be lockless but that's ok as we recheck later.
* It's possible to migrate LRU pages and balloon pages
* Skip any other type of page
@@ -601,6 +622,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
/* Successfully isolated */
cc->finished_update_migrate = true;
del_page_from_lru_list(page, lruvec, page_lru(page));
+isolated:
list_add(&page->lru, migratelist);
cc->nr_migratepages++;
nr_isolated++;
diff --git a/mm/migrate.c b/mm/migrate.c
index 6f0c244..4d28049 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -36,6 +36,7 @@
#include <linux/hugetlb_cgroup.h>
#include <linux/gfp.h>
#include <linux/balloon_compaction.h>
+#include <linux/pinpage.h>

#include <asm/tlbflush.h>

@@ -101,12 +102,17 @@ void putback_movable_pages(struct list_head *l)

list_for_each_entry_safe(page, page2, l, lru) {
list_del(&page->lru);
- dec_zone_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
- if (unlikely(balloon_page_movable(page)))
- balloon_page_putback(page);
- else
- putback_lru_page(page);
+ if (!PagePin(page)) {
+ dec_zone_page_state(page, NR_ISOLATED_ANON +
+ page_is_file_cache(page));
+ if (unlikely(balloon_page_movable(page)))
+ balloon_page_putback(page);
+ else
+ putback_lru_page(page);
+ } else {
+ unlock_page(page);
+ put_page(page);
+ }
}
}

@@ -855,6 +861,39 @@ out:
return rc;
}

+static int unmap_and_move_pinpage(new_page_t get_new_page,
+ unsigned long private, struct page *page, int force,
+ enum migrate_mode mode)
+{
+ int *result = NULL;
+ int rc = 0;
+ struct page *newpage = get_new_page(page, private, &result);
+ if (!newpage)
+ return -ENOMEM;
+
+ VM_BUG_ON(!PageLocked(page));
+ if (page_count(page) == 1) {
+ /* page was freed from under us. So we are done. */
+ goto out;
+ }
+
+ rc = migrate_pinpage(page, newpage);
+out:
+ if (rc != -EAGAIN) {
+ list_del(&page->lru);
+ unlock_page(page);
+ put_page(page);
+ }
+
+ if (result) {
+ if (rc)
+ *result = rc;
+ else
+ *result = page_to_nid(newpage);
+ }
+ return rc;
+
+}
/*
* Obtain the lock on page, remove all ptes and migrate the page
* to the newly allocated page in newpage.
@@ -1025,8 +1064,13 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
list_for_each_entry_safe(page, page2, from, lru) {
cond_resched();

- rc = unmap_and_move(get_new_page, private,
+ if (PagePin(page)) {
+ rc = unmap_and_move_pinpage(get_new_page, private,
page, pass > 2, mode);
+ } else {
+ rc = unmap_and_move(get_new_page, private,
+ page, pass > 2, mode);
+ }

switch(rc) {
case -ENOMEM:
--
1.7.9.5

2013-08-13 07:05:52

by Minchan Kim

[permalink] [raw]
Subject: [RFC 2/3] pinpage control subsystem

Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/pinpage.h | 39 ++++++++++++++
mm/Makefile | 2 +-
mm/pinpage.c | 134 +++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 174 insertions(+), 1 deletion(-)
create mode 100644 include/linux/pinpage.h
create mode 100644 mm/pinpage.c

diff --git a/include/linux/pinpage.h b/include/linux/pinpage.h
new file mode 100644
index 0000000..42fbdc7
--- /dev/null
+++ b/include/linux/pinpage.h
@@ -0,0 +1,39 @@
+#ifndef _LINUX_PINPAGE_H
+#define _LINUX_PINPAGE_H
+
+#include <linux/radix-tree.h>
+
+/*
+ * NOTE : pinpage_system user shouldn't use page->lru and page->flags
+ * fields.
+ */
+struct pinpage_system {
+ struct radix_tree_root page_tree;
+ spinlock_t tree_lock;
+
+ int (*create_subsys)(struct pinpage_system *psys);
+ int (*destroy_subsys)(struct pinpage_system *psys);
+ int (*migrate)(struct pinpage_system *psys, struct page *page,
+ struct page *newpage);
+ int (*add_page)(struct pinpage_system *psys, struct page *page,
+ void *private);
+ int (*del_page)(struct pinpage_system *psys, struct page *page);
+ int (*find_page)(struct pinpage_system *psys, struct page *page);
+
+ struct list_head list;
+};
+
+extern int general_create_subsys(struct pinpage_system *psys);
+extern int general_destroy_subsys(struct pinpage_system *psys);
+extern int general_add_page(struct pinpage_system *psys, struct page *page,
+ void *private);
+extern int general_del_page(struct pinpage_system *psys, struct page *page);
+extern int general_find_page(struct pinpage_system *psys, struct page *page);
+
+extern int set_pinpage(struct pinpage_system *psys, struct page *page,
+ void *private);
+extern int register_pinpage(struct pinpage_system *psys);
+extern int migrate_pinpage(struct page *page, struct page *newpage);
+
+#endif
+
diff --git a/mm/Makefile b/mm/Makefile
index f008033..bf4a2d9 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -5,7 +5,7 @@
mmu-y := nommu.o
mmu-$(CONFIG_MMU) := fremap.o highmem.o madvise.o memory.o mincore.o \
mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
- vmalloc.o pagewalk.o pgtable-generic.o
+ vmalloc.o pagewalk.o pgtable-generic.o pinpage.o

ifdef CONFIG_CROSS_MEMORY_ATTACH
mmu-$(CONFIG_MMU) += process_vm_access.o
diff --git a/mm/pinpage.c b/mm/pinpage.c
new file mode 100644
index 0000000..0833204
--- /dev/null
+++ b/mm/pinpage.c
@@ -0,0 +1,134 @@
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/pinpage.h>
+#include <linux/pagemap.h>
+
+static DEFINE_SPINLOCK(pinpage_system_lock);
+static LIST_HEAD(pinpage_system_list);
+
+struct pinpage_info {
+ unsigned long pfn;
+ void *private;
+};
+
+int general_create_subsys(struct pinpage_system *psys)
+{
+ INIT_RADIX_TREE(&psys->page_tree, GFP_KERNEL);
+ spin_lock_init(&psys->tree_lock);
+ return 0;
+}
+EXPORT_SYMBOL(general_create_subsys);
+
+int general_destroy_subsys(struct pinpage_system *psys)
+{
+ return 0;
+}
+EXPORT_SYMBOL(general_destroy_subsys);
+
+int general_add_page(struct pinpage_system *psys, struct page *page,
+ void *private)
+{
+ int ret = -ENOMEM;
+ unsigned long pfn = page_to_pfn(page);
+ struct pinpage_info *pinfo = kmalloc(sizeof(pinfo), GFP_KERNEL);
+ if (!pinfo)
+ return ret;
+
+ pinfo->pfn = pfn;
+ pinfo->private = private;
+
+ spin_lock(&psys->tree_lock);
+ ret = radix_tree_insert(&psys->page_tree, pfn, pinfo);
+ spin_unlock(&psys->tree_lock);
+ return ret;
+}
+EXPORT_SYMBOL(general_add_page);
+
+int general_del_page(struct pinpage_system *psys, struct page *page)
+{
+ struct pinpage_info *pinfo;
+ spin_lock(&psys->tree_lock);
+ pinfo = radix_tree_lookup(&psys->page_tree, page_to_pfn(page));
+ if (!pinfo) {
+ spin_unlock(&psys->tree_lock);
+ return -EINVAL;
+ }
+ radix_tree_delete(&psys->page_tree, page_to_pfn(page));
+ spin_unlock(&psys->tree_lock);
+ return 0;
+}
+EXPORT_SYMBOL(general_del_page);
+
+int general_find_page(struct pinpage_system *psys, struct page *page)
+{
+ struct pinpage_info *pinfo;
+ spin_lock(&psys->tree_lock);
+ pinfo = radix_tree_lookup(&psys->page_tree, page_to_pfn(page));
+ spin_unlock(&psys->tree_lock);
+ return pinfo ? 1 : 0;
+}
+EXPORT_SYMBOL(general_find_page);
+
+int set_pinpage(struct pinpage_system *psys, struct page *page, void *private)
+{
+ int ret;
+ ret = psys->add_page(psys, page, private);
+ if (!ret) {
+ lock_page(page);
+ /* Doesn't allow nesting */
+ VM_BUG_ON(PagePin(page));
+ SetPagePin(page);
+ unlock_page(page);
+ }
+ return ret;
+}
+EXPORT_SYMBOL(set_pinpage);
+
+int clear_pinpage(struct pinpage_system *psys, struct page *page)
+{
+ int ret;
+ ret = psys->del_page(psys, page);
+ if (!ret) {
+ lock_page(page);
+ ClearPagePin(page);
+ unlock_page(page);
+ }
+ return ret;
+}
+EXPORT_SYMBOL(clear_pinpage);
+
+int register_pinpage(struct pinpage_system *psys)
+{
+ /* register pinpage_subsystem to global list */
+ spin_lock(&pinpage_system_lock);
+ list_add(&psys->list, &pinpage_system_list);
+ spin_unlock(&pinpage_system_lock);
+ return psys->create_subsys(psys);
+}
+EXPORT_SYMBOL(register_pinpage);
+
+int unregister_pinpage(struct pinpage_system *psys)
+{
+ /* register pinpage_subsystem to global list */
+ spin_lock(&pinpage_system_lock);
+ list_del(&psys->list);
+ spin_unlock(&pinpage_system_lock);
+ return psys->destroy_subsys(psys);
+}
+EXPORT_SYMBOL(unregister_pinpage);
+
+int migrate_pinpage(struct page *page, struct page *newpage)
+{
+ int err = 0;
+ struct pinpage_system *psys;
+
+ spin_lock(&pinpage_system_lock);
+ list_for_each_entry(psys, &pinpage_system_list, list) {
+ if (psys->find_page(psys, page)) {
+ err = psys->migrate(psys, page, newpage);
+ break;
+ }
+ }
+ spin_unlock(&pinpage_system_lock);
+ return err;
+}
--
1.7.9.5

2013-08-13 09:46:57

by Krzysztof Kozlowski

[permalink] [raw]
Subject: Re: [RFC 0/3] Pin page control subsystem

Hi Minchan,

On wto, 2013-08-13 at 16:04 +0900, Minchan Kim wrote:
> patch 2 introduce pinpage control
> subsystem. So, subsystems want to control pinpage should implement own
> pinpage_xxx functions because each subsystem would have other character
> so what kinds of data structure for managing pinpage information depends
> on them. Otherwise, they can use general functions defined in pinpage
> subsystem. patch 3 hacks migration.c so that migration is
> aware of pinpage now and migrate them with pinpage subsystem.

I wonder why don't we use page->mapping and a_ops? Is there any
disadvantage of such mapping/a_ops?

Best regards,
Krzysztof

2013-08-13 14:23:42

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [RFC 0/3] Pin page control subsystem

On Tue, Aug 13, 2013 at 11:46:42AM +0200, Krzysztof Kozlowski wrote:
> Hi Minchan,
>
> On wto, 2013-08-13 at 16:04 +0900, Minchan Kim wrote:
> > patch 2 introduce pinpage control
> > subsystem. So, subsystems want to control pinpage should implement own
> > pinpage_xxx functions because each subsystem would have other character
> > so what kinds of data structure for managing pinpage information depends
> > on them. Otherwise, they can use general functions defined in pinpage
> > subsystem. patch 3 hacks migration.c so that migration is
> > aware of pinpage now and migrate them with pinpage subsystem.
>
> I wonder why don't we use page->mapping and a_ops? Is there any
> disadvantage of such mapping/a_ops?

That's what the pending aio patches do, and I think this is a better
approach for those use-cases that the technique works for.

The biggest problem I see with the pinpage approach is that it's based on a
single page at a time. I'd venture a guess that many pinned pages are done
in groups of pages, not single ones.

-ben

> Best regards,
> Krzysztof

--
"Thought is the essence of where you are now."

Subject: Re: [RFC 0/3] Pin page control subsystem

On Tue, 13 Aug 2013, Minchan Kim wrote:

> VM sometime want to migrate and/or reclaim pages for CMA, memory-hotplug,
> THP and so on but at the moment, it could handle only userspace pages
> so if above example subsystem have pinned a some page in a range VM want
> to migrate, migration is failed so above exmaple couldn't work well.

Dont we have the mmu_notifiers that could help in that case? You could get
a callback which could prepare the pages for migration?

2013-08-13 23:54:25

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC 0/3] Pin page control subsystem

Hello Krzysztof,

On Tue, Aug 13, 2013 at 11:46:42AM +0200, Krzysztof Kozlowski wrote:
> Hi Minchan,
>
> On wto, 2013-08-13 at 16:04 +0900, Minchan Kim wrote:
> > patch 2 introduce pinpage control
> > subsystem. So, subsystems want to control pinpage should implement own
> > pinpage_xxx functions because each subsystem would have other character
> > so what kinds of data structure for managing pinpage information depends
> > on them. Otherwise, they can use general functions defined in pinpage
> > subsystem. patch 3 hacks migration.c so that migration is
> > aware of pinpage now and migrate them with pinpage subsystem.
>
> I wonder why don't we use page->mapping and a_ops? Is there any
> disadvantage of such mapping/a_ops?

Most concern of the approach is how to handle nested pin case.
For example, driver A and driver B pin same file-backed page
conincidently by get_user_pages.
For the migration, we needs following operations.

1. [buffer]'s migrate_page for the file-backed page
2. [driver A]'s migrate_page
3. [driver B]'s migrate_page

But the page's mapping is only one. How can we handle it?

If we give up pinpage subsystem unifying userspace pages(ex, GUP)
and kernel space pages(ex, zswap, zram and zcache), we can go
address_space's migatepages but we might lost abstraction so that
all of users should implement own pinpage manager. It's not hard,
I guess but it's more error-prone and not maintainable for the future.

>
> Best regards,
> Krzysztof
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2013-08-14 00:08:50

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC 0/3] Pin page control subsystem

Hello Benjamin,

On Tue, Aug 13, 2013 at 10:23:38AM -0400, Benjamin LaHaise wrote:
> On Tue, Aug 13, 2013 at 11:46:42AM +0200, Krzysztof Kozlowski wrote:
> > Hi Minchan,
> >
> > On wto, 2013-08-13 at 16:04 +0900, Minchan Kim wrote:
> > > patch 2 introduce pinpage control
> > > subsystem. So, subsystems want to control pinpage should implement own
> > > pinpage_xxx functions because each subsystem would have other character
> > > so what kinds of data structure for managing pinpage information depends
> > > on them. Otherwise, they can use general functions defined in pinpage
> > > subsystem. patch 3 hacks migration.c so that migration is
> > > aware of pinpage now and migrate them with pinpage subsystem.
> >
> > I wonder why don't we use page->mapping and a_ops? Is there any
> > disadvantage of such mapping/a_ops?
>
> That's what the pending aio patches do, and I think this is a better
> approach for those use-cases that the technique works for.

I saw your implementation roughly and I think it's not a generic solution.
How could it handle the example mentioned in reply of Krzysztof?

>
> The biggest problem I see with the pinpage approach is that it's based on a
> single page at a time. I'd venture a guess that many pinned pages are done
> in groups of pages, not single ones.

In case of z* family, most of allocation is single but I agree many GUP users
would allocate groups of pages. Then, we can cover it by expanding the API
like this.

int set_pinpage(struct pinpage_system *psys, struct page **pages,
unsigned long nr_pages, void **privates);

so we can handle it by batch and the subsystem can manage pinpage_info with
interval tree rather than radix tree which is default.
That's why pinpage control subsystem has room for subsystem specific metadata
handling.

>
> -ben
>
> > Best regards,
> > Krzysztof
>
> --
> "Thought is the essence of where you are now."
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2013-08-14 00:12:33

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC 0/3] Pin page control subsystem

Hello Christoph,

On Tue, Aug 13, 2013 at 04:21:30PM +0000, Christoph Lameter wrote:
> On Tue, 13 Aug 2013, Minchan Kim wrote:
>
> > VM sometime want to migrate and/or reclaim pages for CMA, memory-hotplug,
> > THP and so on but at the moment, it could handle only userspace pages
> > so if above example subsystem have pinned a some page in a range VM want
> > to migrate, migration is failed so above exmaple couldn't work well.
>
> Dont we have the mmu_notifiers that could help in that case? You could get
> a callback which could prepare the pages for migration?

Now I'm not familiar with mmu_notifier so please could you elaborate it
a bit for me to dive into that?

Thanks!

>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2013-08-14 16:47:16

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC 0/3] Pin page control subsystem

Hi Christoph,

On Wed, Aug 14, 2013 at 04:36:44PM +0000, Christoph Lameter wrote:
> On Wed, 14 Aug 2013, Minchan Kim wrote:
>
> > On Tue, Aug 13, 2013 at 04:21:30PM +0000, Christoph Lameter wrote:
> > > On Tue, 13 Aug 2013, Minchan Kim wrote:
> > >
> > > > VM sometime want to migrate and/or reclaim pages for CMA, memory-hotplug,
> > > > THP and so on but at the moment, it could handle only userspace pages
> > > > so if above example subsystem have pinned a some page in a range VM want
> > > > to migrate, migration is failed so above exmaple couldn't work well.
> > >
> > > Dont we have the mmu_notifiers that could help in that case? You could get
> > > a callback which could prepare the pages for migration?
> >
> > Now I'm not familiar with mmu_notifier so please could you elaborate it
> > a bit for me to dive into that?
>
> Add a notifier callback for unpinning pages to the mmu notifier subsystem
> and then your drivers could register with the subsystem to get
> notifications when migration needs to occur etc.
>

When I look API of mmu_notifier, it has mm_struct so I guess it works
for only user process. Right?
If so, I need to register it without user conext because zram, zswap
and zcache works for only kernel side.

--
Kind regards,
Minchan Kim

Subject: Re: [RFC 0/3] Pin page control subsystem

On Wed, 14 Aug 2013, Minchan Kim wrote:

> On Tue, Aug 13, 2013 at 04:21:30PM +0000, Christoph Lameter wrote:
> > On Tue, 13 Aug 2013, Minchan Kim wrote:
> >
> > > VM sometime want to migrate and/or reclaim pages for CMA, memory-hotplug,
> > > THP and so on but at the moment, it could handle only userspace pages
> > > so if above example subsystem have pinned a some page in a range VM want
> > > to migrate, migration is failed so above exmaple couldn't work well.
> >
> > Dont we have the mmu_notifiers that could help in that case? You could get
> > a callback which could prepare the pages for migration?
>
> Now I'm not familiar with mmu_notifier so please could you elaborate it
> a bit for me to dive into that?

Add a notifier callback for unpinning pages to the mmu notifier subsystem
and then your drivers could register with the subsystem to get
notifications when migration needs to occur etc.

Subject: Re: [RFC 0/3] Pin page control subsystem

On Thu, 15 Aug 2013, Minchan Kim wrote:

> When I look API of mmu_notifier, it has mm_struct so I guess it works
> for only user process. Right?

Correct. A process must have mapped the pages. If you can get a
kernel "process" to work then that process could map the pages.

> If so, I need to register it without user conext because zram, zswap
> and zcache works for only kernel side.

Hmmm... Ok but that now gets the complexity of page pinnning up to a very
weird level. Is there some way we can have a common way to deal with the
various ways that pinning is needed? Just off the top of my head (I may
miss some use cases) we have

1. mlock from user space
2. page pinning for reclaim
3. Page pinning for I/O from device drivers (like f.e. the RDMA subsystem)
4. Page pinning for low latency operations
5. Page pinning for migration
6. Page pinning for the perf buffers.
7. Page pinning for cross system access (XPMEM, GRU SGI)

Now we have another subsystem wanting different semantics of pinning. Is
there any way we can come up with a pinning mechanism that fits all use
cases, that is easyly understandable and maintainable?

2013-08-15 04:48:45

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC 0/3] Pin page control subsystem

Hey Christoph,

On Wed, Aug 14, 2013 at 04:58:36PM +0000, Christoph Lameter wrote:
> On Thu, 15 Aug 2013, Minchan Kim wrote:
>
> > When I look API of mmu_notifier, it has mm_struct so I guess it works
> > for only user process. Right?
>
> Correct. A process must have mapped the pages. If you can get a
> kernel "process" to work then that process could map the pages.
>
> > If so, I need to register it without user conext because zram, zswap
> > and zcache works for only kernel side.
>
> Hmmm... Ok but that now gets the complexity of page pinnning up to a very
> weird level. Is there some way we can have a common way to deal with the
> various ways that pinning is needed? Just off the top of my head (I may
> miss some use cases) we have
>
> 1. mlock from user space

Now mlock pages could be migrated in case of CMA so I think it's not a
big problem to migrate it for other cases.
I remember You and Peter argued what's the mlock semainc of pin POV
and as I remember correctly, Peter said mlock doesn't mean pin so
we could migrate it but you didn't agree. Right?
Anyway, it's off-topic but technically, it's not a problem.

> 2. page pinning for reclaim

Reclaiming pin a page for a while. Of course, "for a while" means
rather vague so it could mean it's really long for someone but really
short for others. But at least, reclaim pin should be short and
we should try it if it's not ture.

> 3. Page pinning for I/O from device drivers (like f.e. the RDMA subsystem)

It's one of big concerns for me. Even several drviers might be able to pin
a page same time. But normally most of drvier can know he will pin a page
long time or short time so if it want to pin a page long time like aio or
some GPU driver for zero-coyp, it should use pinpage control subsystem to
release pin pages when VM ask.

> 4. Page pinning for low latency operations

I have no idea but I guess most of them pin a page during short time?
Otherwise, they should use pinpage control subsystem, too.

> 5. Page pinning for migration

It's like 2. migration pin should be short.

> 6. Page pinning for the perf buffers.

I'm not familiar with that but my gut feeling is it will pin pages
for a long time so it should use pinpage control subsystem.

> 7. Page pinning for cross system access (XPMEM, GRU SGI)

If it's really long pin, it should use pinpage control subsystem.

>
> Now we have another subsystem wanting different semantics of pinning. Is
> there any way we can come up with a pinning mechanism that fits all use
> cases, that is easyly understandable and maintainable?

I agree it's not easy but we should go that way rather than adding ad-hoc
subsystem specific implementaion. If we allow subsystem specific way,
maybe, everybody want to touch migrate.c so it would be very complicated
and bloated, even not maintainable in future. If it goes another way
like a_ops->migratepages, it couldn't handle complex nesting pin pages
case so it couldn't gaurantee pinpage migraions.

Most hard part is what is "for a while". It depends on system workloads
so some system means it is 3ms while other system means it is 3s. :(
Sigh, now I have no idea how can handle it with general.

Thanks for the comment, Christoph!

>

--
Kind regards,
Minchan Kim

Subject: Re: [RFC 0/3] Pin page control subsystem

On Thu, 15 Aug 2013, Minchan Kim wrote:

> Now mlock pages could be migrated in case of CMA so I think it's not a
> big problem to migrate it for other cases.
> I remember You and Peter argued what's the mlock semainc of pin POV
> and as I remember correctly, Peter said mlock doesn't mean pin so
> we could migrate it but you didn't agree. Right?

mlock means it can be migrated. Pinning is currently done by increasing
the page count. Migration will be attempted but it will fail since the
references cannot be all removed. Peter proposed that mlock would work
like pinning so that a migration of the page would not be attempted.

My concern is not only about migration but about a general way of pinning
pages. Having mlock and pinning with different semantics is already an
issue as the conversation with Peter brought out. Now we are
adding yet another way that pinning is used.