2013-05-13 09:08:53

by Tang Chen

[permalink] [raw]
Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

Hi Mel,

On 02/06/2013 05:56 PM, Mel Gorman wrote:
>
> There is the possibility that callbacks could be introduced for
> migrate_unpin() and migrate_pin() that takes a list of PFN pairs
> (old,new). The unpin callback should release the old PFNs and barrier
> against any operations until the migrate_pfn() callback is called with
> the updated pfns to be repinned. Again it would fully depend on subsystems
> implementing it properly.
>
> The callback interface would be more robust but puts a lot more work on
> the driver side where your milage will vary.
>

I'm very interested in the "callback" way you said.

For memory hot-remove case, the aio pages are pined in memory and making
the pages cannot be offlined, furthermore, the pages cannot be removed.

IIUC, you mean implement migrate_unpin() and migrate_pin() callbacks in aio
subsystem, and call them when hot-remove code tries to offline pages,
right ?

If so, I'm wondering where should we put this callback pointers ?
In struct page ?


It has been a long time since this topic was discussed. But to solve this
problem cleanly for hotplug guys and CMA guys, please give some more
comments.

Thanks. :)

>
> To guarantee CMA can migrate pages pinned by drivers I think you need
> migrate-related callsbacks to unpin, barrier the driver until migration
> completes and repin.
>
> I do not know, or at least have no heard, of anyone working on such a
> scheme.
>


2013-05-13 09:19:10

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

On Mon, May 13, 2013 at 05:11:43PM +0800, Tang Chen wrote:
> Hi Mel,
>
> On 02/06/2013 05:56 PM, Mel Gorman wrote:
> >
> >There is the possibility that callbacks could be introduced for
> >migrate_unpin() and migrate_pin() that takes a list of PFN pairs
> >(old,new). The unpin callback should release the old PFNs and barrier
> >against any operations until the migrate_pfn() callback is called with
> >the updated pfns to be repinned. Again it would fully depend on subsystems
> >implementing it properly.
> >
> >The callback interface would be more robust but puts a lot more work on
> >the driver side where your milage will vary.
> >
>
> I'm very interested in the "callback" way you said.
>
> For memory hot-remove case, the aio pages are pined in memory and making
> the pages cannot be offlined, furthermore, the pages cannot be removed.
>
> IIUC, you mean implement migrate_unpin() and migrate_pin() callbacks in aio
> subsystem, and call them when hot-remove code tries to offline
> pages, right ?
>
> If so, I'm wondering where should we put this callback pointers ?
> In struct page ?
>

No, I would expect the callbacks to be part the address space operations
which can be found via page->mapping.

--
Mel Gorman
SUSE Labs

2013-05-13 14:54:45

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

Benjamin LaHaise <[email protected]> writes:

> On Mon, May 13, 2013 at 10:19:02AM +0100, Mel Gorman wrote:
>> On Mon, May 13, 2013 at 05:11:43PM +0800, Tang Chen wrote:
> ...
>> > If so, I'm wondering where should we put this callback pointers ?
>> > In struct page ?
>> >
>>
>> No, I would expect the callbacks to be part the address space operations
>> which can be found via page->mapping.
>
> If someone adds those callbacks and provides a means for testing them,
> it would be pretty trivial to change the aio code to migrate its pinned
> pages on demand.

How do you propose to move the ring pages?

Cheers,
Jeff

2013-05-13 15:01:49

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

On Mon, May 13, 2013 at 10:54:03AM -0400, Jeff Moyer wrote:
> How do you propose to move the ring pages?

It's the same problem as doing a TLB shootdown: flush the old pages from
userspace's mapping, copy any existing data to the new pages, then
repopulate the page tables. It will likely require the addition of
address_space_operations for the mapping, but that's not too hard to do.

-ben
--
"Thought is the essence of where you are now."

2013-05-13 15:04:27

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

On Mon, May 13, 2013 at 10:19:02AM +0100, Mel Gorman wrote:
> On Mon, May 13, 2013 at 05:11:43PM +0800, Tang Chen wrote:
...
> > If so, I'm wondering where should we put this callback pointers ?
> > In struct page ?
> >
>
> No, I would expect the callbacks to be part the address space operations
> which can be found via page->mapping.

If someone adds those callbacks and provides a means for testing them,
it would be pretty trivial to change the aio code to migrate its pinned
pages on demand.

-ben
--
"Thought is the essence of where you are now."

2013-05-14 01:22:18

by Tang Chen

[permalink] [raw]
Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

Hi Mel, Benjamin, Jeff,

On 05/13/2013 11:01 PM, Benjamin LaHaise wrote:
> On Mon, May 13, 2013 at 10:54:03AM -0400, Jeff Moyer wrote:
>> How do you propose to move the ring pages?
>
> It's the same problem as doing a TLB shootdown: flush the old pages from
> userspace's mapping, copy any existing data to the new pages, then
> repopulate the page tables. It will likely require the addition of
> address_space_operations for the mapping, but that's not too hard to do.
>

I think we add migrate_unpin() callback to decrease page->count if
necessary,
and migrate the page to a new page, and add migrate_pin() callback to pin
the new page again.

The migrate procedure will work just as before. We use callbacks to
decrease
the page->count before migration starts, and increase it when the migration
is done.

And migrate_pin() and migrate_unpin() callbacks will be added to
struct address_space_operations.

Is that right ?

If so, I'll be working on it.

Thanks. :)

2013-05-14 04:12:29

by Tang Chen

[permalink] [raw]
Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

Hi Mel,

On 05/13/2013 05:19 PM, Mel Gorman wrote:
>> For memory hot-remove case, the aio pages are pined in memory and making
>> the pages cannot be offlined, furthermore, the pages cannot be removed.
>>
>> IIUC, you mean implement migrate_unpin() and migrate_pin() callbacks in aio
>> subsystem, and call them when hot-remove code tries to offline
>> pages, right ?
>>
>> If so, I'm wondering where should we put this callback pointers ?
>> In struct page ?
>>
>
> No, I would expect the callbacks to be part the address space operations
> which can be found via page->mapping.
>

Two more problems I don't quite understand.

1. For an anonymous page, it has no address_space, and no address space
operation. But the aio ring problem just happened when dealing with
anonymous pages. Please refer to:
(https://lkml.org/lkml/2012/11/29/69)

If we put the the callbacks in page->mapping->a_ops, the anonymous
pages
won't be able to use them.

And we cannot give a default callback because the situation we are
dealing
with is a special situation.

So where to put the callback for anonymous pages ?


2. How to find out the reason why page->count != 1 in
migrate_page_move_mapping() ?

In the problem we are dealing with, get_user_pages() is called to
pin the pages
in memory. And the pages are migratable. So we want to decrease the
page->count.

But get_user_pages() is not the only reason leading to page->count
increased.
How can I know when should decrease teh page->count or when should not ?

The way I can figure out is to assign the callback pointer in
get_user_pages()
because it is get_user_pages() who pins the pages.


Thanks. :)







2013-05-14 13:58:54

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

On Tue, May 14, 2013 at 09:24:58AM +0800, Tang Chen wrote:
> Hi Mel, Benjamin, Jeff,
>
> On 05/13/2013 11:01 PM, Benjamin LaHaise wrote:
> >On Mon, May 13, 2013 at 10:54:03AM -0400, Jeff Moyer wrote:
> >>How do you propose to move the ring pages?
> >
> >It's the same problem as doing a TLB shootdown: flush the old pages from
> >userspace's mapping, copy any existing data to the new pages, then
> >repopulate the page tables. It will likely require the addition of
> >address_space_operations for the mapping, but that's not too hard to do.
> >
>
> I think we add migrate_unpin() callback to decrease page->count if
> necessary,
> and migrate the page to a new page, and add migrate_pin() callback to pin
> the new page again.

You can't just decrease the page count for this to work. The pages are
pinned because aio_complete() can occur at any time and needs to have a
place to write the completion events. When changing pages, aio has to
take the appropriate lock when changing one page for another.

> The migrate procedure will work just as before. We use callbacks to
> decrease
> the page->count before migration starts, and increase it when the migration
> is done.
>
> And migrate_pin() and migrate_unpin() callbacks will be added to
> struct address_space_operations.

I think the existing migratepage operation in address_space_operations can
be used. Does it get called when hot unplug occurs? That is: is testing
with the migrate_pages syscall similar enough to the memory removal case?

-ben

> Is that right ?
>
> If so, I'll be working on it.
>
> Thanks. :)

--
"Thought is the essence of where you are now."

2013-05-15 02:06:23

by Tang Chen

[permalink] [raw]
Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

Hi Benjamin, Mel,

Please see below.

On 05/14/2013 09:58 PM, Benjamin LaHaise wrote:
> On Tue, May 14, 2013 at 09:24:58AM +0800, Tang Chen wrote:
>> Hi Mel, Benjamin, Jeff,
>>
>> On 05/13/2013 11:01 PM, Benjamin LaHaise wrote:
>>> On Mon, May 13, 2013 at 10:54:03AM -0400, Jeff Moyer wrote:
>>>> How do you propose to move the ring pages?
>>>
>>> It's the same problem as doing a TLB shootdown: flush the old pages from
>>> userspace's mapping, copy any existing data to the new pages, then
>>> repopulate the page tables. It will likely require the addition of
>>> address_space_operations for the mapping, but that's not too hard to do.
>>>
>>
>> I think we add migrate_unpin() callback to decrease page->count if
>> necessary,
>> and migrate the page to a new page, and add migrate_pin() callback to pin
>> the new page again.
>
> You can't just decrease the page count for this to work. The pages are
> pinned because aio_complete() can occur at any time and needs to have a
> place to write the completion events. When changing pages, aio has to
> take the appropriate lock when changing one page for another.

In aio_complete(),

aio_complete() {
......
spin_lock_irqsave(&ctx->completion_lock, flags);
//write the completion event.
spin_unlock_irqrestore(&ctx->completion_lock, flags);
......
}

So for this problem, I think we can hold ctx->completion_lock in the aio
callbacks to prevent aio subsystem accessing pages who are being migrated.

>
>> The migrate procedure will work just as before. We use callbacks to
>> decrease
>> the page->count before migration starts, and increase it when the migration
>> is done.
>>
>> And migrate_pin() and migrate_unpin() callbacks will be added to
>> struct address_space_operations.
>
> I think the existing migratepage operation in address_space_operations can
> be used. Does it get called when hot unplug occurs? That is: is testing
> with the migrate_pages syscall similar enough to the memory removal case?
>

But as I said, for anonymous pages such as aio ring buffer, they don't have
address_space_operations. So where should we put the callbacks' pointers ?

Add something like address_space_operations to struct anon_vma ?

Thanks. :)





2013-05-15 07:18:49

by Tang Chen

[permalink] [raw]
Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

Hi Benjamin, Mel,

On 05/15/2013 10:09 AM, Tang Chen wrote:
> Hi Benjamin, Mel,
>
> Please see below.
>
> On 05/14/2013 09:58 PM, Benjamin LaHaise wrote:
>> On Tue, May 14, 2013 at 09:24:58AM +0800, Tang Chen wrote:
>>> Hi Mel, Benjamin, Jeff,
>>>
>>> On 05/13/2013 11:01 PM, Benjamin LaHaise wrote:
>>>> On Mon, May 13, 2013 at 10:54:03AM -0400, Jeff Moyer wrote:
>>>>> How do you propose to move the ring pages?
>>>>
>>>> It's the same problem as doing a TLB shootdown: flush the old pages
>>>> from
>>>> userspace's mapping, copy any existing data to the new pages, then
>>>> repopulate the page tables. It will likely require the addition of
>>>> address_space_operations for the mapping, but that's not too hard to
>>>> do.
>>>>
>>>
>>> I think we add migrate_unpin() callback to decrease page->count if
>>> necessary,
>>> and migrate the page to a new page, and add migrate_pin() callback to
>>> pin
>>> the new page again.
>>
>> You can't just decrease the page count for this to work. The pages are
>> pinned because aio_complete() can occur at any time and needs to have a
>> place to write the completion events. When changing pages, aio has to
>> take the appropriate lock when changing one page for another.
>
> In aio_complete(),
>
> aio_complete() {
> ......
> spin_lock_irqsave(&ctx->completion_lock, flags);
> //write the completion event.
> spin_unlock_irqrestore(&ctx->completion_lock, flags);
> ......
> }
>
> So for this problem, I think we can hold kioctx->completion_lock in the aio
> callbacks to prevent aio subsystem accessing pages who are being migrated.
>

Another problem here is:

We intend to call these callbacks in the page migrate path, and we need to
know which lock to hold. But there is no way for migrate path to know this
info.

The migrate path is common for all kinds of pages, so we cannot pass any
specific parameter to the callbacks in migrate path.

When we get a page, we cannot get any kioctx info from the page. So how can
the callback know which lock to require without any parameter ? Or do we
have
any other way to do so ?

Would you please give some more advice about this ?

BTW, we also need to update kioctx->ring_pages.

Thanks. :)

>>
>>> The migrate procedure will work just as before. We use callbacks to
>>> decrease
>>> the page->count before migration starts, and increase it when the
>>> migration
>>> is done.
>>>
>>> And migrate_pin() and migrate_unpin() callbacks will be added to
>>> struct address_space_operations.
>>
>> I think the existing migratepage operation in address_space_operations
>> can
>> be used. Does it get called when hot unplug occurs? That is: is testing
>> with the migrate_pages syscall similar enough to the memory removal case?
>>
>
> But as I said, for anonymous pages such as aio ring buffer, they don't have
> address_space_operations. So where should we put the callbacks' pointers ?
>
> Add something like address_space_operations to struct anon_vma ?
>
> Thanks. :)
>
>
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-05-15 13:25:06

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

On Tue, May 14, 2013 at 11:55:31AM +0800, Tang Chen wrote:
> Hi Mel,
>
> On 05/13/2013 05:19 PM, Mel Gorman wrote:
> >>For memory hot-remove case, the aio pages are pined in memory and making
> >>the pages cannot be offlined, furthermore, the pages cannot be removed.
> >>
> >>IIUC, you mean implement migrate_unpin() and migrate_pin() callbacks in aio
> >>subsystem, and call them when hot-remove code tries to offline
> >>pages, right ?
> >>
> >>If so, I'm wondering where should we put this callback pointers ?
> >>In struct page ?
> >>
> >
> >No, I would expect the callbacks to be part the address space operations
> >which can be found via page->mapping.
> >
>
> Two more problems I don't quite understand.
>

Bear in mind I've done no research on this particular problem. At best,
the migrate pin/unpin is the direction that I'd start with if I was tasked
with fixing this (which I'm not). Hence, I cannot answer your questions
at the level of detail you are looking for.

> 1. For an anonymous page, it has no address_space, and no address space
> operation. But the aio ring problem just happened when dealing with
> anonymous pages. Please refer to:
> (https://lkml.org/lkml/2012/11/29/69)
>

If it is to be an address space operations sturcture then you'll need a
pseudo mapping structure for anonymous pages that are pinned by aio --
similar in principal to how swapper_space is used for managing PageSwapCache
or how anon_vma structures can be associated with a page.

However, I warn you that you may find that the address_space is the
wrong level to register such callbacks, it just seemed like the obvious
first choice. A potential alternative implementation is to create a 1:1
association between pages and a long-lived holder that is stored on a hash
table (similar style of arrangement as page_waitqueue). A page is looked up
in the hash table and if an entry exists, it points to an callback structure
to the subsystem holding the pin. It's up to the subsystem to register the
callbacks when it is about to pin a page (get_user_pages_longlived(....,
&release_ops) and figure out how to release the pin safely.

> 2. How to find out the reason why page->count != 1 in
> migrate_page_move_mapping() ?
>
> In the problem we are dealing with, get_user_pages() is called to
> pin the pages
> in memory. And the pages are migratable. So we want to decrease
> the page->count.
>
> But get_user_pages() is not the only reason leading to
> page->count increased.
> How can I know when should decrease teh page->count or when should not ?
>

You cannot just arbitrarily drop the page->count without causing problems. It
has to be released by the subsystem holding the pin because only it can
know when it's safe.

--
Mel Gorman
SUSE Labs

2013-05-16 05:51:28

by Tang Chen

[permalink] [raw]
Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

Hi Mel,

On 05/15/2013 09:24 PM, Mel Gorman wrote:
> If it is to be an address space operations sturcture then you'll need a
> pseudo mapping structure for anonymous pages that are pinned by aio --
> similar in principal to how swapper_space is used for managing PageSwapCache
> or how anon_vma structures can be associated with a page.
>
> However, I warn you that you may find that the address_space is the
> wrong level to register such callbacks, it just seemed like the obvious
> first choice. A potential alternative implementation is to create a 1:1
> association between pages and a long-lived holder that is stored on a hash
> table (similar style of arrangement as page_waitqueue). A page is looked up
> in the hash table and if an entry exists, it points to an callback structure
> to the subsystem holding the pin. It's up to the subsystem to register the
> callbacks when it is about to pin a page (get_user_pages_longlived(....,
> &release_ops) and figure out how to release the pin safely.
>

OK, I'll try to figure out a proper place to put the callbacks.
But I think we need to add something new to struct page. I'm just
not sure if it is OK. Maybe we can discuss more about it when I send
a RFC patch.

Thanks for the advices, and I'll try them.

Thanks. :)

2013-05-17 00:23:51

by Benjamin LaHaise

[permalink] [raw]
Subject: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable())

On Thu, May 16, 2013 at 01:54:18PM +0800, Tang Chen wrote:
...
> OK, I'll try to figure out a proper place to put the callbacks.
> But I think we need to add something new to struct page. I'm just
> not sure if it is OK. Maybe we can discuss more about it when I send
> a RFC patch.
...

I ended up working on this a bit today, and managed to cobble together
something that somewhat works -- please see the patch below. It still is
not completely tested, and it has a rather nasty bug owing to the fact
that the file descriptors returned by anon_inode_getfile() all share the
same inode (read: more than one instance of aio does not work), but it
shows the basic idea. Also, bad things probably happen if someone does
an mremap() on the aio ring buffer. I'll polish this off sometime next
week after the long weekend if noone beats me to it.

-ben
--
"Thought is the essence of where you are now."

fs/aio.c | 95 ++++++++++++++++++++++++++++++++++++++++++++++--
include/linux/migrate.h | 3 +
mm/migrate.c | 2 -
3 files changed, 96 insertions(+), 4 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c
index c5b1a8c..dbad23e 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -35,6 +35,9 @@
#include <linux/eventfd.h>
#include <linux/blkdev.h>
#include <linux/compat.h>
+#include <linux/anon_inodes.h>
+#include <linux/migrate.h>
+#include <linux/ramfs.h>

#include <asm/kmap_types.h>
#include <asm/uaccess.h>
@@ -108,6 +111,7 @@ struct kioctx {
} ____cacheline_aligned_in_smp;

struct page *internal_pages[AIO_RING_PAGES];
+ struct file *ctx_file;
};

/*------ sysctl variables----*/
@@ -146,8 +150,59 @@ static void aio_free_ring(struct kioctx *ctx)

if (ctx->ring_pages && ctx->ring_pages != ctx->internal_pages)
kfree(ctx->ring_pages);
+
+ if (ctx->ctx_file) {
+ truncate_setsize(ctx->ctx_file->f_inode, 0);
+ fput(ctx->ctx_file);
+ ctx->ctx_file = NULL;
+ }
+}
+
+static int aio_ctx_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ vma->vm_ops = &generic_file_vm_ops;
+ return 0;
+}
+
+static const struct file_operations aio_ctx_fops = {
+ .mmap = aio_ctx_mmap,
+};
+
+static int aio_set_page_dirty(struct page *page)
+{
+ return 0;
+}
+
+static int aio_migratepage(struct address_space *mapping, struct page *new,
+ struct page *old, enum migrate_mode mode)
+{
+ struct kioctx *ctx = mapping->private_data;
+ unsigned long flags;
+ unsigned idx = old->index;
+ int rc;
+
+ BUG_ON(PageWriteback(old)); /* Writeback must be complete */
+ put_page(old);
+ rc = migrate_page_move_mapping(mapping, new, old, NULL, mode);
+ if (rc != MIGRATEPAGE_SUCCESS) {
+ get_page(old);
+ return rc;
+ }
+ get_page(new);
+
+ spin_lock_irqsave(&ctx->completion_lock, flags);
+ migrate_page_copy(new, old);
+ ctx->ring_pages[idx] = new;
+ spin_unlock_irqrestore(&ctx->completion_lock, flags);
+
+ return MIGRATEPAGE_SUCCESS;
}

+static const struct address_space_operations aio_ctx_aops = {
+ .set_page_dirty = aio_set_page_dirty,
+ .migratepage = aio_migratepage,
+};
+
static int aio_setup_ring(struct kioctx *ctx)
{
struct aio_ring *ring;
@@ -155,6 +210,7 @@ static int aio_setup_ring(struct kioctx *ctx)
struct mm_struct *mm = current->mm;
unsigned long size, populate;
int nr_pages;
+ int i;

/* Compensate for the ring buffer's head/tail overlap entry */
nr_events += 2; /* 1 is required, 2 for good luck */
@@ -166,6 +222,31 @@ static int aio_setup_ring(struct kioctx *ctx)
if (nr_pages < 0)
return -EINVAL;

+ ctx->ctx_file = anon_inode_getfile("[aio]", &aio_ctx_fops, ctx, O_RDWR);
+ if (IS_ERR(ctx->ctx_file)) {
+ ctx->ctx_file = NULL;
+ return -EAGAIN;
+ }
+ ctx->ctx_file->f_inode->i_mapping->a_ops = &aio_ctx_aops;
+ ctx->ctx_file->f_inode->i_mapping->private_data = ctx;
+ ctx->ctx_file->f_inode->i_size = PAGE_SIZE * (loff_t)nr_pages;
+
+ for (i=0; i<nr_pages; i++) {
+ struct page *page;
+ void *ptr;
+ page = find_or_create_page(ctx->ctx_file->f_inode->i_mapping,
+ i, GFP_KERNEL);
+ if (!page) {
+ break;
+ }
+ ptr = kmap(page);
+ clear_page(ptr);
+ kunmap(page);
+ SetPageUptodate(page);
+ SetPageDirty(page);
+ unlock_page(page);
+ }
+
nr_events = (PAGE_SIZE * nr_pages - sizeof(struct aio_ring)) / sizeof(struct io_event);

ctx->nr_events = 0;
@@ -180,20 +261,25 @@ static int aio_setup_ring(struct kioctx *ctx)
ctx->mmap_size = nr_pages * PAGE_SIZE;
pr_debug("attempting mmap of %lu bytes\n", ctx->mmap_size);
down_write(&mm->mmap_sem);
- ctx->mmap_base = do_mmap_pgoff(NULL, 0, ctx->mmap_size,
+ ctx->mmap_base = do_mmap_pgoff(ctx->ctx_file, 0, ctx->mmap_size,
PROT_READ|PROT_WRITE,
- MAP_ANONYMOUS|MAP_PRIVATE, 0, &populate);
+ MAP_SHARED|MAP_POPULATE, 0,
+ &populate);
if (IS_ERR((void *)ctx->mmap_base)) {
up_write(&mm->mmap_sem);
ctx->mmap_size = 0;
aio_free_ring(ctx);
return -EAGAIN;
}
+ up_write(&mm->mmap_sem);
+ mm_populate(ctx->mmap_base, populate);

pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base);
ctx->nr_pages = get_user_pages(current, mm, ctx->mmap_base, nr_pages,
1, 0, ctx->ring_pages, NULL);
- up_write(&mm->mmap_sem);
+ for (i=0; i<ctx->nr_pages; i++) {
+ put_page(ctx->ring_pages[i]);
+ }

if (unlikely(ctx->nr_pages != nr_pages)) {
aio_free_ring(ctx);
@@ -403,6 +489,8 @@ out_cleanup:
err = -EAGAIN;
aio_free_ring(ctx);
out_freectx:
+ if (ctx->ctx_file)
+ fput(ctx->ctx_file);
kmem_cache_free(kioctx_cachep, ctx);
pr_debug("error allocating ioctx %d\n", err);
return ERR_PTR(err);
@@ -852,6 +940,7 @@ SYSCALL_DEFINE2(io_setup, unsigned, nr_events, aio_context_t __user *, ctxp)
ioctx = ioctx_alloc(nr_events);
ret = PTR_ERR(ioctx);
if (!IS_ERR(ioctx)) {
+ ctx = ioctx->user_id;
ret = put_user(ioctx->user_id, ctxp);
if (ret)
kill_ioctx(ioctx);
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index a405d3dc..b6f3289 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -55,6 +55,9 @@ extern int migrate_vmas(struct mm_struct *mm,
extern void migrate_page_copy(struct page *newpage, struct page *page);
extern int migrate_huge_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page);
+extern int migrate_page_move_mapping(struct address_space *mapping,
+ struct page *newpage, struct page *page,
+ struct buffer_head *head, enum migrate_mode mode);
#else

static inline void putback_lru_pages(struct list_head *l) {}
diff --git a/mm/migrate.c b/mm/migrate.c
index 27ed225..ac9c3a9 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -294,7 +294,7 @@ static inline bool buffer_migrate_lock_buffers(struct buffer_head *head,
* 2 for pages with a mapping
* 3 for pages with a mapping and PagePrivate/PagePrivate2 set.
*/
-static int migrate_page_move_mapping(struct address_space *mapping,
+int migrate_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page,
struct buffer_head *head, enum migrate_mode mode)
{

2013-05-17 03:26:06

by Tang Chen

[permalink] [raw]
Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable())

Hi Benjamin,

Thank you very much for your idea. :)

I have no objection to your idea, but seeing from your patch, this only
works for aio subsystem because you changed the way to allocate the aio
ring pages, with a file mapping.

So far as I know, not only aio, but also other subsystems, such CMA, will
also have problem like this. The page cannot be migrated because it is
pinned in memory. So I think we should work out a common way to solve how
to migrate pinned pages.

I'm working in the way Mel has said, migrate_unpin() and migrate_pin()
callbacks. But as you saw, I met some problems, like I don't where to put
these two callbacks. And discussed with you guys, I want to try this:

1. Add a new member to struct page, used to remember the pin holders of
this page, including the pin and unpin callbacks and the necessary data.
This is more like a callback chain.
(I'm worry about this step, I'm not sure if it is good enough. After
all,
we need a good place to put the callbacks.)

And then, like Mel said,

2. Implement the callbacks in the subsystems, and register them to the
new member in struct page.

3. Call these callbacks before and after migration.


I think I'll send a RFC patch next week when I finished the outline. I'm
just thinking of finding a common way to solve this problem that all the
other subsystems will benefit.

Thanks. :)


On 05/17/2013 08:23 AM, Benjamin LaHaise wrote:
> On Thu, May 16, 2013 at 01:54:18PM +0800, Tang Chen wrote:
> ...
>> OK, I'll try to figure out a proper place to put the callbacks.
>> But I think we need to add something new to struct page. I'm just
>> not sure if it is OK. Maybe we can discuss more about it when I send
>> a RFC patch.
> ...
>
> I ended up working on this a bit today, and managed to cobble together
> something that somewhat works -- please see the patch below. It still is
> not completely tested, and it has a rather nasty bug owing to the fact
> that the file descriptors returned by anon_inode_getfile() all share the
> same inode (read: more than one instance of aio does not work), but it
> shows the basic idea. Also, bad things probably happen if someone does
> an mremap() on the aio ring buffer. I'll polish this off sometime next
> week after the long weekend if noone beats me to it.
>
> -ben

2013-05-17 14:37:22

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable())

On Fri, May 17, 2013 at 11:28:52AM +0800, Tang Chen wrote:
> Hi Benjamin,
>
> Thank you very much for your idea. :)
>
> I have no objection to your idea, but seeing from your patch, this only
> works for aio subsystem because you changed the way to allocate the aio
> ring pages, with a file mapping.

That is correct. There is no way you're going to be able to solve this
problem without dealing with the issue on a subsystem by subsystem basis.

> So far as I know, not only aio, but also other subsystems, such CMA, will
> also have problem like this. The page cannot be migrated because it is
> pinned in memory. So I think we should work out a common way to solve how
> to migrate pinned pages.

A generic approach would require hardware support, but I doubt that is
going to happen.

> I'm working in the way Mel has said, migrate_unpin() and migrate_pin()
> callbacks. But as you saw, I met some problems, like I don't where to put
> these two callbacks. And discussed with you guys, I want to try this:
>
> 1. Add a new member to struct page, used to remember the pin holders of
> this page, including the pin and unpin callbacks and the necessary data.
> This is more like a callback chain.
> (I'm worry about this step, I'm not sure if it is good enough. After
> all,
> we need a good place to put the callbacks.)

Putting function pointers into struct page is not going to happen. You'd
be adding a significant amount of memory overhead for something that is
never going to be used on the vast majority of systems (2 function pointers
would be 16 bytes per page on a 64 bit system). Keep in mind that distro
kernels tend to enable almost all config options on their kernels, so the
overhead of any approach has to make sense for the users of the kernel that
will never make use of this kind of migration.

> And then, like Mel said,
>
> 2. Implement the callbacks in the subsystems, and register them to the
> new member in struct page.

No, the hook should be in the address_space_operations. We already have
a pointer to an address space in struct page. This avoids adding more
overhead to struct page.

> 3. Call these callbacks before and after migration.

How is that better than using the existing hook in address_space_operations?

> I think I'll send a RFC patch next week when I finished the outline. I'm
> just thinking of finding a common way to solve this problem that all the
> other subsystems will benefit.

Before pursuing this approach, make sure you've got buy-in for all of the
overhead you're adding to the system. I don't think that growing struct
page is going to be an acceptable design choice given the amount of
overhead it will incur.

> Thanks. :)

Cheers,

-ben
--
"Thought is the essence of where you are now."

2013-05-17 18:17:34

by Zach Brown

[permalink] [raw]
Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable())

> I ended up working on this a bit today, and managed to cobble together
> something that somewhat works -- please see the patch below.

Just some quick observations:

> + ctx->ctx_file = anon_inode_getfile("[aio]", &aio_ctx_fops, ctx, O_RDWR);
> + if (IS_ERR(ctx->ctx_file)) {
> + ctx->ctx_file = NULL;
> + return -EAGAIN;
> + }

It's too bad that aio contexts will now be accounted against the filp
limits (get_empty_filp -> files_stat.max_files, etc).

> + for (i=0; i<nr_pages; i++) {
> + struct page *page;
> + void *ptr;
> + page = find_or_create_page(ctx->ctx_file->f_inode->i_mapping,
> + i, GFP_KERNEL);
> + if (!page) {
> + break;
> + }
> + ptr = kmap(page);
> + clear_page(ptr);
> + kunmap(page);
> + SetPageUptodate(page);
> + SetPageDirty(page);
> + unlock_page(page);
> + }

If they're GFP_KERNEL then you don't need to kmap them. But we probably
want to allocate with GFP_HIGHUSER and then use clear_user_highpage() to
zero them?

- z

2013-05-17 18:30:06

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable())

On Fri, May 17, 2013 at 11:17:08AM -0700, Zach Brown wrote:
> > I ended up working on this a bit today, and managed to cobble together
> > something that somewhat works -- please see the patch below.
>
> Just some quick observations:
>
> > + ctx->ctx_file = anon_inode_getfile("[aio]", &aio_ctx_fops, ctx, O_RDWR);
> > + if (IS_ERR(ctx->ctx_file)) {
> > + ctx->ctx_file = NULL;
> > + return -EAGAIN;
> > + }
>
> It's too bad that aio contexts will now be accounted against the filp
> limits (get_empty_filp -> files_stat.max_files, etc).

Yeah, that is a downside of this approach. It would be possible to to
do it with only an inode/address_space, but that would mean bypassing
do_mmap(), which is not worth considering. If it is really an issue, we
could add a flag to bypass that limit since aio has its own.
anon_inode_getfile() as it stands is a major problem.

> > + for (i=0; i<nr_pages; i++) {
> > + struct page *page;
> > + void *ptr;
> > + page = find_or_create_page(ctx->ctx_file->f_inode->i_mapping,
> > + i, GFP_KERNEL);
> > + if (!page) {
> > + break;
> > + }
> > + ptr = kmap(page);
> > + clear_page(ptr);
> > + kunmap(page);
> > + SetPageUptodate(page);
> > + SetPageDirty(page);
> > + unlock_page(page);
> > + }
>
> If they're GFP_KERNEL then you don't need to kmap them. But we probably
> want to allocate with GFP_HIGHUSER and then use clear_user_highpage() to
> zero them?

Adding __GFP_ZERO would fix that too. The next respin will include that
change. I also have to properly handle the mremap() case as well.

-ben
--
"Thought is the essence of where you are now."

2013-05-21 02:12:39

by Tang Chen

[permalink] [raw]
Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable())

Hi Benjamin,

Sorry for the late. Please see below.

On 05/17/2013 10:37 PM, Benjamin LaHaise wrote:
> On Fri, May 17, 2013 at 11:28:52AM +0800, Tang Chen wrote:
>> Hi Benjamin,
>>
>> Thank you very much for your idea. :)
>>
>> I have no objection to your idea, but seeing from your patch, this only
>> works for aio subsystem because you changed the way to allocate the aio
>> ring pages, with a file mapping.
>
> That is correct. There is no way you're going to be able to solve this
> problem without dealing with the issue on a subsystem by subsystem basis.
>

Yes, I understand that. We need subsystem work anyway.


>> I'm working in the way Mel has said, migrate_unpin() and migrate_pin()
>> callbacks. But as you saw, I met some problems, like I don't where to put
>> these two callbacks. And discussed with you guys, I want to try this:
>>
>> 1. Add a new member to struct page, used to remember the pin holders of
>> this page, including the pin and unpin callbacks and the necessary data.
>> This is more like a callback chain.
>> (I'm worry about this step, I'm not sure if it is good enough. After
>> all,
>> we need a good place to put the callbacks.)
>
> Putting function pointers into struct page is not going to happen. You'd
> be adding a significant amount of memory overhead for something that is
> never going to be used on the vast majority of systems (2 function pointers
> would be 16 bytes per page on a 64 bit system). Keep in mind that distro
> kernels tend to enable almost all config options on their kernels, so the
> overhead of any approach has to make sense for the users of the kernel that
> will never make use of this kind of migration.

True. But I just cannot find a place to hold the callbacks.

>
>> 3. Call these callbacks before and after migration.
>
> How is that better than using the existing hook in address_space_operations?

I'm not saying using two callbacks before and after migration is better.
I don't want to use address_space_operations is because there is no such
member
for anonymous pages.

In your idea, using a file mapping will create a
address_space_operations. But
I really don't think we can modify the way of memory allocation for all the
subsystems who has this problem. Maybe not just aio and cma. That means if
you want to pin pages in memory, you have to use a file mapping. This makes
the memory allocation more complicated. And the idea should be known by all
the subsystem developers. Is that going to happen ?


I also thought about reuse one field of struct page. But as you said, there
may not be many users of this functionality. Reusing a field of struct page
will make things more complicated and lead to high coupling.


So, how about the other idea that Mel mentioned ?

We create a 1-1 mapping of pinned page ranges and the pinner (subsystem
callbacks and data), maybe a global list or a hash table. And then, we can
find the callbacks.


Thanks. :)

2013-05-21 02:27:36

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable())

On Tue, May 21, 2013 at 10:07:52AM +0800, Tang Chen wrote:
....
> I'm not saying using two callbacks before and after migration is better.
> I don't want to use address_space_operations is because there is no such
> member
> for anonymous pages.

That depends on the nature of the pinning. For the general case of
get_user_pages(), you're correct that it won't work for anonymous memory.

> In your idea, using a file mapping will create a
> address_space_operations. But
> I really don't think we can modify the way of memory allocation for all the
> subsystems who has this problem. Maybe not just aio and cma. That means if
> you want to pin pages in memory, you have to use a file mapping. This makes
> the memory allocation more complicated. And the idea should be known by all
> the subsystem developers. Is that going to happen ?

Different subsystems will need to use different approaches to fixing the
issue. I doubt any single approach will work for everything.

> I also thought about reuse one field of struct page. But as you said, there
> may not be many users of this functionality. Reusing a field of struct page
> will make things more complicated and lead to high coupling.

What happens when more than one subsystem tries to pin a particular page?
What if it's a shared page rather than an anonymous page?

> So, how about the other idea that Mel mentioned ?
>
> We create a 1-1 mapping of pinned page ranges and the pinner (subsystem
> callbacks and data), maybe a global list or a hash table. And then, we can
> find the callbacks.

Maybe that is the simplest approach, but it's going to make get_user_pages()
slower and more complicated (as if it wasn't already). Maybe with all the
bells and whistles of per-cpu data structures and such you can make it work,
but I'm pretty sure someone running the large unmentionable benchmark will
complain about the performance regressions you're going to introduce. At
least in the case of the AIO ring buffer, using the address_space approach
doesn't introduce any new performance issues. There's also the bigger
question of if you can or cannot exclude get_user_pages_fast() from this.
In short: you've got a lot more work on your hands to do.

> Thanks. :)

Cheers,

-ben
--
"Thought is the essence of where you are now."

2013-06-11 09:39:42

by Tang Chen

[permalink] [raw]
Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable())

Hi Benjamin,

Are you still working on this problem ?

Thanks. :)

On 05/21/2013 10:27 AM, Benjamin LaHaise wrote:
> On Tue, May 21, 2013 at 10:07:52AM +0800, Tang Chen wrote:
> ....
>> I'm not saying using two callbacks before and after migration is better.
>> I don't want to use address_space_operations is because there is no such
>> member
>> for anonymous pages.
>
> That depends on the nature of the pinning. For the general case of
> get_user_pages(), you're correct that it won't work for anonymous memory.
>
>> In your idea, using a file mapping will create a
>> address_space_operations. But
>> I really don't think we can modify the way of memory allocation for all the
>> subsystems who has this problem. Maybe not just aio and cma. That means if
>> you want to pin pages in memory, you have to use a file mapping. This makes
>> the memory allocation more complicated. And the idea should be known by all
>> the subsystem developers. Is that going to happen ?
>
> Different subsystems will need to use different approaches to fixing the
> issue. I doubt any single approach will work for everything.
>
>> I also thought about reuse one field of struct page. But as you said, there
>> may not be many users of this functionality. Reusing a field of struct page
>> will make things more complicated and lead to high coupling.
>
> What happens when more than one subsystem tries to pin a particular page?
> What if it's a shared page rather than an anonymous page?
>
>> So, how about the other idea that Mel mentioned ?
>>
>> We create a 1-1 mapping of pinned page ranges and the pinner (subsystem
>> callbacks and data), maybe a global list or a hash table. And then, we can
>> find the callbacks.
>
> Maybe that is the simplest approach, but it's going to make get_user_pages()
> slower and more complicated (as if it wasn't already). Maybe with all the
> bells and whistles of per-cpu data structures and such you can make it work,
> but I'm pretty sure someone running the large unmentionable benchmark will
> complain about the performance regressions you're going to introduce. At
> least in the case of the AIO ring buffer, using the address_space approach
> doesn't introduce any new performance issues. There's also the bigger
> question of if you can or cannot exclude get_user_pages_fast() from this.
> In short: you've got a lot more work on your hands to do.
>
>> Thanks. :)
>
> Cheers,
>
> -ben

2013-06-11 14:45:28

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable())

Hi Tang,

On Tue, Jun 11, 2013 at 05:42:31PM +0800, Tang Chen wrote:
> Hi Benjamin,
>
> Are you still working on this problem ?
>
> Thanks. :)

Below is a copy of the most recent version of this patch I have worked
on. This version works and stands up to my testing using move_pages() to
force the migration of the aio ring buffer. A test program is available
at http://www.kvack.org/~bcrl/aio/aio-numa-test.c . Please note that
this version is not suitable for mainline as the modifactions to the
anon inode code are undesirable, so that part needs reworking.

-ben


fs/aio.c | 113 ++++++++++++++++++++++++++++++++++++++++++++----
fs/anon_inodes.c | 14 ++++-
include/linux/migrate.h | 3 +
mm/migrate.c | 2
mm/swap.c | 1
5 files changed, 121 insertions(+), 12 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index c5b1a8c..a951690 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -35,6 +35,9 @@
#include <linux/eventfd.h>
#include <linux/blkdev.h>
#include <linux/compat.h>
+#include <linux/anon_inodes.h>
+#include <linux/migrate.h>
+#include <linux/ramfs.h>

#include <asm/kmap_types.h>
#include <asm/uaccess.h>
@@ -108,6 +111,7 @@ struct kioctx {
} ____cacheline_aligned_in_smp;

struct page *internal_pages[AIO_RING_PAGES];
+ struct file *ctx_file;
};

/*------ sysctl variables----*/
@@ -136,18 +140,80 @@ __initcall(aio_setup);

static void aio_free_ring(struct kioctx *ctx)
{
- long i;
-
- for (i = 0; i < ctx->nr_pages; i++)
- put_page(ctx->ring_pages[i]);
+ int i;

if (ctx->mmap_size)
vm_munmap(ctx->mmap_base, ctx->mmap_size);

+ if (ctx->ctx_file)
+ truncate_setsize(ctx->ctx_file->f_inode, 0);
+
+ for (i = 0; i < ctx->nr_pages; i++) {
+ pr_debug("pid(%d) [%d] page->count=%d\n", current->pid, i,
+ page_count(ctx->ring_pages[i]));
+ put_page(ctx->ring_pages[i]);
+ }
+
if (ctx->ring_pages && ctx->ring_pages != ctx->internal_pages)
kfree(ctx->ring_pages);
+
+ if (ctx->ctx_file) {
+ truncate_setsize(ctx->ctx_file->f_inode, 0);
+ pr_debug("pid(%d) i_nlink=%u d_count=%d, d_unhashed=%d i_count=%d\n",
+ current->pid, ctx->ctx_file->f_inode->i_nlink,
+ ctx->ctx_file->f_path.dentry->d_count,
+ d_unhashed(ctx->ctx_file->f_path.dentry),
+ atomic_read(&ctx->ctx_file->f_path.dentry->d_inode->i_count));
+ fput(ctx->ctx_file);
+ ctx->ctx_file = NULL;
+ }
+}
+
+static int aio_ctx_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ vma->vm_ops = &generic_file_vm_ops;
+ return 0;
+}
+
+static const struct file_operations aio_ctx_fops = {
+ .mmap = aio_ctx_mmap,
+};
+
+static int aio_set_page_dirty(struct page *page)
+{
+ return 0;
+}
+
+static int aio_migratepage(struct address_space *mapping, struct page *new,
+ struct page *old, enum migrate_mode mode)
+{
+ struct kioctx *ctx = mapping->private_data;
+ unsigned long flags;
+ unsigned idx = old->index;
+ int rc;
+
+ BUG_ON(PageWriteback(old)); /* Writeback must be complete */
+ put_page(old);
+ rc = migrate_page_move_mapping(mapping, new, old, NULL, mode);
+ if (rc != MIGRATEPAGE_SUCCESS) {
+ get_page(old);
+ return rc;
+ }
+ get_page(new);
+
+ spin_lock_irqsave(&ctx->completion_lock, flags);
+ migrate_page_copy(new, old);
+ ctx->ring_pages[idx] = new;
+ spin_unlock_irqrestore(&ctx->completion_lock, flags);
+
+ return MIGRATEPAGE_SUCCESS;
}

+static const struct address_space_operations aio_ctx_aops = {
+ .set_page_dirty = aio_set_page_dirty,
+ .migratepage = aio_migratepage,
+};
+
static int aio_setup_ring(struct kioctx *ctx)
{
struct aio_ring *ring;
@@ -155,6 +221,7 @@ static int aio_setup_ring(struct kioctx *ctx)
struct mm_struct *mm = current->mm;
unsigned long size, populate;
int nr_pages;
+ int i;

/* Compensate for the ring buffer's head/tail overlap entry */
nr_events += 2; /* 1 is required, 2 for good luck */
@@ -166,6 +233,28 @@ static int aio_setup_ring(struct kioctx *ctx)
if (nr_pages < 0)
return -EINVAL;

+ ctx->ctx_file = anon_inode_getfile("[aio]", &aio_ctx_fops, ctx, O_RDWR);
+ if (IS_ERR(ctx->ctx_file)) {
+ ctx->ctx_file = NULL;
+ return -EAGAIN;
+ }
+ ctx->ctx_file->f_inode->i_mapping->a_ops = &aio_ctx_aops;
+ ctx->ctx_file->f_inode->i_mapping->private_data = ctx;
+ ctx->ctx_file->f_inode->i_size = PAGE_SIZE * (loff_t)nr_pages;
+
+ for (i=0; i<nr_pages; i++) {
+ struct page *page;
+ page = find_or_create_page(ctx->ctx_file->f_inode->i_mapping,
+ i, GFP_HIGHUSER | __GFP_ZERO);
+ if (!page)
+ break;
+ pr_debug("pid(%d) page[%d]->count=%d\n",
+ current->pid, i, page_count(page));
+ SetPageUptodate(page);
+ SetPageDirty(page);
+ unlock_page(page);
+ }
+
nr_events = (PAGE_SIZE * nr_pages - sizeof(struct aio_ring)) / sizeof(struct io_event);

ctx->nr_events = 0;
@@ -180,20 +269,25 @@ static int aio_setup_ring(struct kioctx *ctx)
ctx->mmap_size = nr_pages * PAGE_SIZE;
pr_debug("attempting mmap of %lu bytes\n", ctx->mmap_size);
down_write(&mm->mmap_sem);
- ctx->mmap_base = do_mmap_pgoff(NULL, 0, ctx->mmap_size,
- PROT_READ|PROT_WRITE,
- MAP_ANONYMOUS|MAP_PRIVATE, 0, &populate);
+ ctx->mmap_base = do_mmap_pgoff(ctx->ctx_file, 0, ctx->mmap_size,
+ PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_POPULATE, 0,
+ &populate);
if (IS_ERR((void *)ctx->mmap_base)) {
up_write(&mm->mmap_sem);
ctx->mmap_size = 0;
aio_free_ring(ctx);
return -EAGAIN;
}
+ up_write(&mm->mmap_sem);
+ mm_populate(ctx->mmap_base, populate);

pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base);
ctx->nr_pages = get_user_pages(current, mm, ctx->mmap_base, nr_pages,
1, 0, ctx->ring_pages, NULL);
- up_write(&mm->mmap_sem);
+ for (i=0; i<ctx->nr_pages; i++) {
+ put_page(ctx->ring_pages[i]);
+ }

if (unlikely(ctx->nr_pages != nr_pages)) {
aio_free_ring(ctx);
@@ -403,6 +497,8 @@ out_cleanup:
err = -EAGAIN;
aio_free_ring(ctx);
out_freectx:
+ if (ctx->ctx_file)
+ fput(ctx->ctx_file);
kmem_cache_free(kioctx_cachep, ctx);
pr_debug("error allocating ioctx %d\n", err);
return ERR_PTR(err);
@@ -852,6 +948,7 @@ SYSCALL_DEFINE2(io_setup, unsigned, nr_events, aio_context_t __user *, ctxp)
ioctx = ioctx_alloc(nr_events);
ret = PTR_ERR(ioctx);
if (!IS_ERR(ioctx)) {
+ ctx = ioctx->user_id;
ret = put_user(ioctx->user_id, ctxp);
if (ret)
kill_ioctx(ioctx);
diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 47a65df..376d289 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -131,6 +131,7 @@ struct file *anon_inode_getfile(const char *name,
struct qstr this;
struct path path;
struct file *file;
+ struct inode *inode;

if (IS_ERR(anon_inode_inode))
return ERR_PTR(-ENODEV);
@@ -138,6 +139,12 @@ struct file *anon_inode_getfile(const char *name,
if (fops->owner && !try_module_get(fops->owner))
return ERR_PTR(-ENOENT);

+ inode = anon_inode_mkinode(anon_inode_inode->i_sb);
+ if (IS_ERR(inode)) {
+ file = ERR_PTR(-ENOMEM);
+ goto err_module;
+ }
+
/*
* Link the inode to a directory entry by creating a unique name
* using the inode sequence number.
@@ -155,17 +162,18 @@ struct file *anon_inode_getfile(const char *name,
* We know the anon_inode inode count is always greater than zero,
* so ihold() is safe.
*/
- ihold(anon_inode_inode);
+ //ihold(inode);

- d_instantiate(path.dentry, anon_inode_inode);
+ d_instantiate(path.dentry, inode);

file = alloc_file(&path, OPEN_FMODE(flags), fops);
if (IS_ERR(file))
goto err_dput;
- file->f_mapping = anon_inode_inode->i_mapping;
+ file->f_mapping = inode->i_mapping;

file->f_flags = flags & (O_ACCMODE | O_NONBLOCK);
file->private_data = priv;
+ drop_nlink(inode);

return file;

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index a405d3dc..b6f3289 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -55,6 +55,9 @@ extern int migrate_vmas(struct mm_struct *mm,
extern void migrate_page_copy(struct page *newpage, struct page *page);
extern int migrate_huge_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page);
+extern int migrate_page_move_mapping(struct address_space *mapping,
+ struct page *newpage, struct page *page,
+ struct buffer_head *head, enum migrate_mode mode);
#else

static inline void putback_lru_pages(struct list_head *l) {}
diff --git a/mm/migrate.c b/mm/migrate.c
index 27ed225..ac9c3a9 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -294,7 +294,7 @@ static inline bool buffer_migrate_lock_buffers(struct buffer_head *head,
* 2 for pages with a mapping
* 3 for pages with a mapping and PagePrivate/PagePrivate2 set.
*/
-static int migrate_page_move_mapping(struct address_space *mapping,
+int migrate_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page,
struct buffer_head *head, enum migrate_mode mode)
{
diff --git a/mm/swap.c b/mm/swap.c
index dfd7d71..bbfba0a 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -160,6 +160,7 @@ skip_lock_tail:

void put_page(struct page *page)
{
+ BUG_ON(page_count(page) <= 0);
if (unlikely(PageCompound(page)))
put_compound_page(page);
else if (put_page_testzero(page))

2013-06-28 09:27:39

by Gu Zheng

[permalink] [raw]
Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable())

On 06/11/2013 10:45 PM, Benjamin LaHaise wrote:

> Hi Tang,
>
> On Tue, Jun 11, 2013 at 05:42:31PM +0800, Tang Chen wrote:
>> Hi Benjamin,
>>
>> Are you still working on this problem ?
>>
>> Thanks. :)
>
> Below is a copy of the most recent version of this patch I have worked
> on. This version works and stands up to my testing using move_pages() to
> force the migration of the aio ring buffer. A test program is available
> at http://www.kvack.org/~bcrl/aio/aio-numa-test.c . Please note that
> this version is not suitable for mainline as the modifactions to the

> anon inode code are undesirable, so that part needs reworking.

Hi Ben,
Are you still working on this patch?
As you know, using the current anon inode will lead to more than one instance of
aio can not work. Have you found a way to fix this issue? Or can we use some
other ones to replace the anon inode?

Thanks,
Gu

>
> -ben
>
>
> fs/aio.c | 113 ++++++++++++++++++++++++++++++++++++++++++++----
> fs/anon_inodes.c | 14 ++++-
> include/linux/migrate.h | 3 +
> mm/migrate.c | 2
> mm/swap.c | 1
> 5 files changed, 121 insertions(+), 12 deletions(-)
>
> diff --git a/fs/aio.c b/fs/aio.c
> index c5b1a8c..a951690 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -35,6 +35,9 @@
> #include <linux/eventfd.h>
> #include <linux/blkdev.h>
> #include <linux/compat.h>
> +#include <linux/anon_inodes.h>
> +#include <linux/migrate.h>
> +#include <linux/ramfs.h>
>
> #include <asm/kmap_types.h>
> #include <asm/uaccess.h>
> @@ -108,6 +111,7 @@ struct kioctx {
> } ____cacheline_aligned_in_smp;
>
> struct page *internal_pages[AIO_RING_PAGES];
> + struct file *ctx_file;
> };
>
> /*------ sysctl variables----*/
> @@ -136,18 +140,80 @@ __initcall(aio_setup);
>
> static void aio_free_ring(struct kioctx *ctx)
> {
> - long i;
> -
> - for (i = 0; i < ctx->nr_pages; i++)
> - put_page(ctx->ring_pages[i]);
> + int i;
>
> if (ctx->mmap_size)
> vm_munmap(ctx->mmap_base, ctx->mmap_size);
>
> + if (ctx->ctx_file)
> + truncate_setsize(ctx->ctx_file->f_inode, 0);
> +
> + for (i = 0; i < ctx->nr_pages; i++) {
> + pr_debug("pid(%d) [%d] page->count=%d\n", current->pid, i,
> + page_count(ctx->ring_pages[i]));
> + put_page(ctx->ring_pages[i]);
> + }
> +
> if (ctx->ring_pages && ctx->ring_pages != ctx->internal_pages)
> kfree(ctx->ring_pages);
> +
> + if (ctx->ctx_file) {
> + truncate_setsize(ctx->ctx_file->f_inode, 0);
> + pr_debug("pid(%d) i_nlink=%u d_count=%d, d_unhashed=%d i_count=%d\n",
> + current->pid, ctx->ctx_file->f_inode->i_nlink,
> + ctx->ctx_file->f_path.dentry->d_count,
> + d_unhashed(ctx->ctx_file->f_path.dentry),
> + atomic_read(&ctx->ctx_file->f_path.dentry->d_inode->i_count));
> + fput(ctx->ctx_file);
> + ctx->ctx_file = NULL;
> + }
> +}
> +
> +static int aio_ctx_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> + vma->vm_ops = &generic_file_vm_ops;
> + return 0;
> +}
> +
> +static const struct file_operations aio_ctx_fops = {
> + .mmap = aio_ctx_mmap,
> +};
> +
> +static int aio_set_page_dirty(struct page *page)
> +{
> + return 0;
> +}
> +
> +static int aio_migratepage(struct address_space *mapping, struct page *new,
> + struct page *old, enum migrate_mode mode)
> +{
> + struct kioctx *ctx = mapping->private_data;
> + unsigned long flags;
> + unsigned idx = old->index;
> + int rc;
> +
> + BUG_ON(PageWriteback(old)); /* Writeback must be complete */
> + put_page(old);
> + rc = migrate_page_move_mapping(mapping, new, old, NULL, mode);
> + if (rc != MIGRATEPAGE_SUCCESS) {
> + get_page(old);
> + return rc;
> + }
> + get_page(new);
> +
> + spin_lock_irqsave(&ctx->completion_lock, flags);
> + migrate_page_copy(new, old);
> + ctx->ring_pages[idx] = new;
> + spin_unlock_irqrestore(&ctx->completion_lock, flags);
> +
> + return MIGRATEPAGE_SUCCESS;
> }
>
> +static const struct address_space_operations aio_ctx_aops = {
> + .set_page_dirty = aio_set_page_dirty,
> + .migratepage = aio_migratepage,
> +};
> +
> static int aio_setup_ring(struct kioctx *ctx)
> {
> struct aio_ring *ring;
> @@ -155,6 +221,7 @@ static int aio_setup_ring(struct kioctx *ctx)
> struct mm_struct *mm = current->mm;
> unsigned long size, populate;
> int nr_pages;
> + int i;
>
> /* Compensate for the ring buffer's head/tail overlap entry */
> nr_events += 2; /* 1 is required, 2 for good luck */
> @@ -166,6 +233,28 @@ static int aio_setup_ring(struct kioctx *ctx)
> if (nr_pages < 0)
> return -EINVAL;
>
> + ctx->ctx_file = anon_inode_getfile("[aio]", &aio_ctx_fops, ctx, O_RDWR);
> + if (IS_ERR(ctx->ctx_file)) {
> + ctx->ctx_file = NULL;
> + return -EAGAIN;
> + }
> + ctx->ctx_file->f_inode->i_mapping->a_ops = &aio_ctx_aops;
> + ctx->ctx_file->f_inode->i_mapping->private_data = ctx;
> + ctx->ctx_file->f_inode->i_size = PAGE_SIZE * (loff_t)nr_pages;
> +
> + for (i=0; i<nr_pages; i++) {
> + struct page *page;
> + page = find_or_create_page(ctx->ctx_file->f_inode->i_mapping,
> + i, GFP_HIGHUSER | __GFP_ZERO);
> + if (!page)
> + break;
> + pr_debug("pid(%d) page[%d]->count=%d\n",
> + current->pid, i, page_count(page));
> + SetPageUptodate(page);
> + SetPageDirty(page);
> + unlock_page(page);
> + }
> +
> nr_events = (PAGE_SIZE * nr_pages - sizeof(struct aio_ring)) / sizeof(struct io_event);
>
> ctx->nr_events = 0;
> @@ -180,20 +269,25 @@ static int aio_setup_ring(struct kioctx *ctx)
> ctx->mmap_size = nr_pages * PAGE_SIZE;
> pr_debug("attempting mmap of %lu bytes\n", ctx->mmap_size);
> down_write(&mm->mmap_sem);
> - ctx->mmap_base = do_mmap_pgoff(NULL, 0, ctx->mmap_size,
> - PROT_READ|PROT_WRITE,
> - MAP_ANONYMOUS|MAP_PRIVATE, 0, &populate);
> + ctx->mmap_base = do_mmap_pgoff(ctx->ctx_file, 0, ctx->mmap_size,
> + PROT_READ | PROT_WRITE,
> + MAP_SHARED | MAP_POPULATE, 0,
> + &populate);
> if (IS_ERR((void *)ctx->mmap_base)) {
> up_write(&mm->mmap_sem);
> ctx->mmap_size = 0;
> aio_free_ring(ctx);
> return -EAGAIN;
> }
> + up_write(&mm->mmap_sem);
> + mm_populate(ctx->mmap_base, populate);
>
> pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base);
> ctx->nr_pages = get_user_pages(current, mm, ctx->mmap_base, nr_pages,
> 1, 0, ctx->ring_pages, NULL);
> - up_write(&mm->mmap_sem);
> + for (i=0; i<ctx->nr_pages; i++) {
> + put_page(ctx->ring_pages[i]);
> + }
>
> if (unlikely(ctx->nr_pages != nr_pages)) {
> aio_free_ring(ctx);
> @@ -403,6 +497,8 @@ out_cleanup:
> err = -EAGAIN;
> aio_free_ring(ctx);
> out_freectx:
> + if (ctx->ctx_file)
> + fput(ctx->ctx_file);
> kmem_cache_free(kioctx_cachep, ctx);
> pr_debug("error allocating ioctx %d\n", err);
> return ERR_PTR(err);
> @@ -852,6 +948,7 @@ SYSCALL_DEFINE2(io_setup, unsigned, nr_events, aio_context_t __user *, ctxp)
> ioctx = ioctx_alloc(nr_events);
> ret = PTR_ERR(ioctx);
> if (!IS_ERR(ioctx)) {
> + ctx = ioctx->user_id;
> ret = put_user(ioctx->user_id, ctxp);
> if (ret)
> kill_ioctx(ioctx);
> diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
> index 47a65df..376d289 100644
> --- a/fs/anon_inodes.c
> +++ b/fs/anon_inodes.c
> @@ -131,6 +131,7 @@ struct file *anon_inode_getfile(const char *name,
> struct qstr this;
> struct path path;
> struct file *file;
> + struct inode *inode;
>
> if (IS_ERR(anon_inode_inode))
> return ERR_PTR(-ENODEV);
> @@ -138,6 +139,12 @@ struct file *anon_inode_getfile(const char *name,
> if (fops->owner && !try_module_get(fops->owner))
> return ERR_PTR(-ENOENT);
>
> + inode = anon_inode_mkinode(anon_inode_inode->i_sb);
> + if (IS_ERR(inode)) {
> + file = ERR_PTR(-ENOMEM);
> + goto err_module;
> + }
> +
> /*
> * Link the inode to a directory entry by creating a unique name
> * using the inode sequence number.
> @@ -155,17 +162,18 @@ struct file *anon_inode_getfile(const char *name,
> * We know the anon_inode inode count is always greater than zero,
> * so ihold() is safe.
> */
> - ihold(anon_inode_inode);
> + //ihold(inode);
>
> - d_instantiate(path.dentry, anon_inode_inode);
> + d_instantiate(path.dentry, inode);
>
> file = alloc_file(&path, OPEN_FMODE(flags), fops);
> if (IS_ERR(file))
> goto err_dput;
> - file->f_mapping = anon_inode_inode->i_mapping;
> + file->f_mapping = inode->i_mapping;
>
> file->f_flags = flags & (O_ACCMODE | O_NONBLOCK);
> file->private_data = priv;
> + drop_nlink(inode);
>
> return file;
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index a405d3dc..b6f3289 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -55,6 +55,9 @@ extern int migrate_vmas(struct mm_struct *mm,
> extern void migrate_page_copy(struct page *newpage, struct page *page);
> extern int migrate_huge_page_move_mapping(struct address_space *mapping,
> struct page *newpage, struct page *page);
> +extern int migrate_page_move_mapping(struct address_space *mapping,
> + struct page *newpage, struct page *page,
> + struct buffer_head *head, enum migrate_mode mode);
> #else
>
> static inline void putback_lru_pages(struct list_head *l) {}
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 27ed225..ac9c3a9 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -294,7 +294,7 @@ static inline bool buffer_migrate_lock_buffers(struct buffer_head *head,
> * 2 for pages with a mapping
> * 3 for pages with a mapping and PagePrivate/PagePrivate2 set.
> */
> -static int migrate_page_move_mapping(struct address_space *mapping,
> +int migrate_page_move_mapping(struct address_space *mapping,
> struct page *newpage, struct page *page,
> struct buffer_head *head, enum migrate_mode mode)
> {
> diff --git a/mm/swap.c b/mm/swap.c
> index dfd7d71..bbfba0a 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -160,6 +160,7 @@ skip_lock_tail:
>
> void put_page(struct page *page)
> {
> + BUG_ON(page_count(page) <= 0);
> if (unlikely(PageCompound(page)))
> put_compound_page(page);
> else if (put_page_testzero(page))
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-07-01 07:26:58

by Gu Zheng

[permalink] [raw]
Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable())

On 06/11/2013 10:45 PM, Benjamin LaHaise wrote:

> Hi Tang,
>
> On Tue, Jun 11, 2013 at 05:42:31PM +0800, Tang Chen wrote:
>> Hi Benjamin,
>>
>> Are you still working on this problem ?
>>
>> Thanks. :)
>
> Below is a copy of the most recent version of this patch I have worked
> on. This version works and stands up to my testing using move_pages() to
> force the migration of the aio ring buffer. A test program is available
> at http://www.kvack.org/~bcrl/aio/aio-numa-test.c . Please note that
> this version is not suitable for mainline as the modifactions to the
> anon inode code are undesirable, so that part needs reworking.



Hi Ben,
Are you still working on this patch?
As you know, using the current anon inode will lead to more than one instance of
aio can not work. Have you found a way to fix this issue? Or can we use some
other ones to replace the anon inode?

Thanks,
Gu

>
> -ben
>
>
> fs/aio.c | 113 ++++++++++++++++++++++++++++++++++++++++++++----
> fs/anon_inodes.c | 14 ++++-
> include/linux/migrate.h | 3 +
> mm/migrate.c | 2
> mm/swap.c | 1
> 5 files changed, 121 insertions(+), 12 deletions(-)
>
> diff --git a/fs/aio.c b/fs/aio.c
> index c5b1a8c..a951690 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -35,6 +35,9 @@
> #include <linux/eventfd.h>
> #include <linux/blkdev.h>
> #include <linux/compat.h>
> +#include <linux/anon_inodes.h>
> +#include <linux/migrate.h>
> +#include <linux/ramfs.h>
>
> #include <asm/kmap_types.h>
> #include <asm/uaccess.h>
> @@ -108,6 +111,7 @@ struct kioctx {
> } ____cacheline_aligned_in_smp;
>
> struct page *internal_pages[AIO_RING_PAGES];
> + struct file *ctx_file;
> };
>
> /*------ sysctl variables----*/
> @@ -136,18 +140,80 @@ __initcall(aio_setup);
>
> static void aio_free_ring(struct kioctx *ctx)
> {
> - long i;
> -
> - for (i = 0; i < ctx->nr_pages; i++)
> - put_page(ctx->ring_pages[i]);
> + int i;
>
> if (ctx->mmap_size)
> vm_munmap(ctx->mmap_base, ctx->mmap_size);
>
> + if (ctx->ctx_file)
> + truncate_setsize(ctx->ctx_file->f_inode, 0);
> +
> + for (i = 0; i < ctx->nr_pages; i++) {
> + pr_debug("pid(%d) [%d] page->count=%d\n", current->pid, i,
> + page_count(ctx->ring_pages[i]));
> + put_page(ctx->ring_pages[i]);
> + }
> +
> if (ctx->ring_pages && ctx->ring_pages != ctx->internal_pages)
> kfree(ctx->ring_pages);
> +
> + if (ctx->ctx_file) {
> + truncate_setsize(ctx->ctx_file->f_inode, 0);
> + pr_debug("pid(%d) i_nlink=%u d_count=%d, d_unhashed=%d i_count=%d\n",
> + current->pid, ctx->ctx_file->f_inode->i_nlink,
> + ctx->ctx_file->f_path.dentry->d_count,
> + d_unhashed(ctx->ctx_file->f_path.dentry),
> + atomic_read(&ctx->ctx_file->f_path.dentry->d_inode->i_count));
> + fput(ctx->ctx_file);
> + ctx->ctx_file = NULL;
> + }
> +}
> +
> +static int aio_ctx_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> + vma->vm_ops = &generic_file_vm_ops;
> + return 0;
> +}
> +
> +static const struct file_operations aio_ctx_fops = {
> + .mmap = aio_ctx_mmap,
> +};
> +
> +static int aio_set_page_dirty(struct page *page)
> +{
> + return 0;
> +}
> +
> +static int aio_migratepage(struct address_space *mapping, struct page *new,
> + struct page *old, enum migrate_mode mode)
> +{
> + struct kioctx *ctx = mapping->private_data;
> + unsigned long flags;
> + unsigned idx = old->index;
> + int rc;
> +
> + BUG_ON(PageWriteback(old)); /* Writeback must be complete */
> + put_page(old);
> + rc = migrate_page_move_mapping(mapping, new, old, NULL, mode);
> + if (rc != MIGRATEPAGE_SUCCESS) {
> + get_page(old);
> + return rc;
> + }
> + get_page(new);
> +
> + spin_lock_irqsave(&ctx->completion_lock, flags);
> + migrate_page_copy(new, old);
> + ctx->ring_pages[idx] = new;
> + spin_unlock_irqrestore(&ctx->completion_lock, flags);
> +
> + return MIGRATEPAGE_SUCCESS;
> }
>
> +static const struct address_space_operations aio_ctx_aops = {
> + .set_page_dirty = aio_set_page_dirty,
> + .migratepage = aio_migratepage,
> +};
> +
> static int aio_setup_ring(struct kioctx *ctx)
> {
> struct aio_ring *ring;
> @@ -155,6 +221,7 @@ static int aio_setup_ring(struct kioctx *ctx)
> struct mm_struct *mm = current->mm;
> unsigned long size, populate;
> int nr_pages;
> + int i;
>
> /* Compensate for the ring buffer's head/tail overlap entry */
> nr_events += 2; /* 1 is required, 2 for good luck */
> @@ -166,6 +233,28 @@ static int aio_setup_ring(struct kioctx *ctx)
> if (nr_pages < 0)
> return -EINVAL;
>
> + ctx->ctx_file = anon_inode_getfile("[aio]", &aio_ctx_fops, ctx, O_RDWR);
> + if (IS_ERR(ctx->ctx_file)) {
> + ctx->ctx_file = NULL;
> + return -EAGAIN;
> + }
> + ctx->ctx_file->f_inode->i_mapping->a_ops = &aio_ctx_aops;
> + ctx->ctx_file->f_inode->i_mapping->private_data = ctx;
> + ctx->ctx_file->f_inode->i_size = PAGE_SIZE * (loff_t)nr_pages;
> +
> + for (i=0; i<nr_pages; i++) {
> + struct page *page;
> + page = find_or_create_page(ctx->ctx_file->f_inode->i_mapping,
> + i, GFP_HIGHUSER | __GFP_ZERO);
> + if (!page)
> + break;
> + pr_debug("pid(%d) page[%d]->count=%d\n",
> + current->pid, i, page_count(page));
> + SetPageUptodate(page);
> + SetPageDirty(page);
> + unlock_page(page);
> + }
> +
> nr_events = (PAGE_SIZE * nr_pages - sizeof(struct aio_ring)) / sizeof(struct io_event);
>
> ctx->nr_events = 0;
> @@ -180,20 +269,25 @@ static int aio_setup_ring(struct kioctx *ctx)
> ctx->mmap_size = nr_pages * PAGE_SIZE;
> pr_debug("attempting mmap of %lu bytes\n", ctx->mmap_size);
> down_write(&mm->mmap_sem);
> - ctx->mmap_base = do_mmap_pgoff(NULL, 0, ctx->mmap_size,
> - PROT_READ|PROT_WRITE,
> - MAP_ANONYMOUS|MAP_PRIVATE, 0, &populate);
> + ctx->mmap_base = do_mmap_pgoff(ctx->ctx_file, 0, ctx->mmap_size,
> + PROT_READ | PROT_WRITE,
> + MAP_SHARED | MAP_POPULATE, 0,
> + &populate);
> if (IS_ERR((void *)ctx->mmap_base)) {
> up_write(&mm->mmap_sem);
> ctx->mmap_size = 0;
> aio_free_ring(ctx);
> return -EAGAIN;
> }
> + up_write(&mm->mmap_sem);
> + mm_populate(ctx->mmap_base, populate);
>
> pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base);
> ctx->nr_pages = get_user_pages(current, mm, ctx->mmap_base, nr_pages,
> 1, 0, ctx->ring_pages, NULL);
> - up_write(&mm->mmap_sem);
> + for (i=0; i<ctx->nr_pages; i++) {
> + put_page(ctx->ring_pages[i]);
> + }
>
> if (unlikely(ctx->nr_pages != nr_pages)) {
> aio_free_ring(ctx);
> @@ -403,6 +497,8 @@ out_cleanup:
> err = -EAGAIN;
> aio_free_ring(ctx);
> out_freectx:
> + if (ctx->ctx_file)
> + fput(ctx->ctx_file);
> kmem_cache_free(kioctx_cachep, ctx);
> pr_debug("error allocating ioctx %d\n", err);
> return ERR_PTR(err);
> @@ -852,6 +948,7 @@ SYSCALL_DEFINE2(io_setup, unsigned, nr_events, aio_context_t __user *, ctxp)
> ioctx = ioctx_alloc(nr_events);
> ret = PTR_ERR(ioctx);
> if (!IS_ERR(ioctx)) {
> + ctx = ioctx->user_id;
> ret = put_user(ioctx->user_id, ctxp);
> if (ret)
> kill_ioctx(ioctx);
> diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
> index 47a65df..376d289 100644
> --- a/fs/anon_inodes.c
> +++ b/fs/anon_inodes.c
> @@ -131,6 +131,7 @@ struct file *anon_inode_getfile(const char *name,
> struct qstr this;
> struct path path;
> struct file *file;
> + struct inode *inode;
>
> if (IS_ERR(anon_inode_inode))
> return ERR_PTR(-ENODEV);
> @@ -138,6 +139,12 @@ struct file *anon_inode_getfile(const char *name,
> if (fops->owner && !try_module_get(fops->owner))
> return ERR_PTR(-ENOENT);
>
> + inode = anon_inode_mkinode(anon_inode_inode->i_sb);
> + if (IS_ERR(inode)) {
> + file = ERR_PTR(-ENOMEM);
> + goto err_module;
> + }
> +
> /*
> * Link the inode to a directory entry by creating a unique name
> * using the inode sequence number.
> @@ -155,17 +162,18 @@ struct file *anon_inode_getfile(const char *name,
> * We know the anon_inode inode count is always greater than zero,
> * so ihold() is safe.
> */
> - ihold(anon_inode_inode);
> + //ihold(inode);
>
> - d_instantiate(path.dentry, anon_inode_inode);
> + d_instantiate(path.dentry, inode);
>
> file = alloc_file(&path, OPEN_FMODE(flags), fops);
> if (IS_ERR(file))
> goto err_dput;
> - file->f_mapping = anon_inode_inode->i_mapping;
> + file->f_mapping = inode->i_mapping;
>
> file->f_flags = flags & (O_ACCMODE | O_NONBLOCK);
> file->private_data = priv;
> + drop_nlink(inode);
>
> return file;
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index a405d3dc..b6f3289 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -55,6 +55,9 @@ extern int migrate_vmas(struct mm_struct *mm,
> extern void migrate_page_copy(struct page *newpage, struct page *page);
> extern int migrate_huge_page_move_mapping(struct address_space *mapping,
> struct page *newpage, struct page *page);
> +extern int migrate_page_move_mapping(struct address_space *mapping,
> + struct page *newpage, struct page *page,
> + struct buffer_head *head, enum migrate_mode mode);
> #else
>
> static inline void putback_lru_pages(struct list_head *l) {}
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 27ed225..ac9c3a9 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -294,7 +294,7 @@ static inline bool buffer_migrate_lock_buffers(struct buffer_head *head,
> * 2 for pages with a mapping
> * 3 for pages with a mapping and PagePrivate/PagePrivate2 set.
> */
> -static int migrate_page_move_mapping(struct address_space *mapping,
> +int migrate_page_move_mapping(struct address_space *mapping,
> struct page *newpage, struct page *page,
> struct buffer_head *head, enum migrate_mode mode)
> {
> diff --git a/mm/swap.c b/mm/swap.c
> index dfd7d71..bbfba0a 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -160,6 +160,7 @@ skip_lock_tail:
>
> void put_page(struct page *page)
> {
> + BUG_ON(page_count(page) <= 0);
> if (unlikely(PageCompound(page)))
> put_compound_page(page);
> else if (put_page_testzero(page))
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>



2013-07-02 18:00:13

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable())

On Mon, Jul 01, 2013 at 03:23:39PM +0800, Gu Zheng wrote:
> Hi Ben,
> Are you still working on this patch?
> As you know, using the current anon inode will lead to more than one instance of
> aio can not work. Have you found a way to fix this issue? Or can we use some
> other ones to replace the anon inode?

This patch hasn't been a high priority for me. I would really appreciate
it if someone could confirm that this patch does indeed fix the hotplug
page migration issue by testing it in a system that hits the bug. Removing
the anon_inode bits isn't too much work, but I'd just like to have some
confirmation that this fix is considered to be "good enough" for the
problem at hand before spending any further time on it. There was talk of
using another approach, but it's not clear if there was any progress.

-ben
--
"Thought is the essence of where you are now."

2013-07-03 01:56:55

by Gu Zheng

[permalink] [raw]
Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable())

On 07/03/2013 02:00 AM, Benjamin LaHaise wrote:

> On Mon, Jul 01, 2013 at 03:23:39PM +0800, Gu Zheng wrote:
>> Hi Ben,
>> Are you still working on this patch?
>> As you know, using the current anon inode will lead to more than one instance of
>> aio can not work. Have you found a way to fix this issue? Or can we use some
>> other ones to replace the anon inode?
>
> This patch hasn't been a high priority for me. I would really appreciate
> it if someone could confirm that this patch does indeed fix the hotplug
> page migration issue by testing it in a system that hits the bug. Removing
> the anon_inode bits isn't too much work, but I'd just like to have some
> confirmation that this fix is considered to be "good enough" for the
> problem at hand before spending any further time on it. There was talk of
> using another approach, but it's not clear if there was any progress.

Yeah, we have not seen anyone try to fix this issue using the other approach we
talked. I'm not sure whether your patch can indeed fix the problem, but I'll
carry out a complete test to confirm it, and I'll be very glad to continue this
job based on your patch if you do not have enough time working on it.:)

Thanks,
Gu

>
> -ben

2013-07-04 06:54:45

by Gu Zheng

[permalink] [raw]
Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable())

On 07/03/2013 02:00 AM, Benjamin LaHaise wrote:

> On Mon, Jul 01, 2013 at 03:23:39PM +0800, Gu Zheng wrote:
>> Hi Ben,
>> Are you still working on this patch?
>> As you know, using the current anon inode will lead to more than one instance of
>> aio can not work. Have you found a way to fix this issue? Or can we use some
>> other ones to replace the anon inode?
>
> This patch hasn't been a high priority for me. I would really appreciate
> it if someone could confirm that this patch does indeed fix the hotplug
> page migration issue by testing it in a system that hits the bug. Removing
> the anon_inode bits isn't too much work, but I'd just like to have some
> confirmation that this fix is considered to be "good enough" for the
> problem at hand before spending any further time on it. There was talk of
> using another approach, but it's not clear if there was any progress.

Hi Ben,
When I test your patch on kernel 3.10, the kernel panic when aio job
complete or exit, exactly in aio_free_ring(), the following is a part of dmesg.

Thanks,
Gu

kernel BUG at mm/swap.c:163!

invalid opcode: 0000 [#1] SMP

Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4
nf_nat xt_CHECKSUM iptable_mangle bridge stp llc autofs4 sunrpc cpufreq_ondemand
ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip6t_REJECT
nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter
ip6_tables ipv6 vfat fat dm_mirror dm_region_hash dm_log dm_mod vhost_net
macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support acpi_cpufreq freq_table
mperf coretemp kvm_intel kvm crc32c_intel microcode pcspkr sg i2c_i801 lpc_ich
mfd_core ioatdma i7core_edac edac_core e1000e igb dca i2c_algo_bit i2c_core ptp
pps_core ext4(F) jbd2(F) mbcache(F) sd_mod(F) crc_t10dif(F) megaraid_sas(F)
mptsas(F) mptscsih(F) mptbase(F) scsi_transport_sas(F)

CPU: 4 PID: 100 Comm: kworker/4:1 Tainted: GF 3.10.0-aio-migrate+
#107
Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS
Version 89.32 DP Proto 08/16/2012
Workqueue: events kill_ioctx_work

task: ffff8807dda974e0 ti: ffff8807dda98000 task.ti: ffff8807dda98000

RIP: 0010:[<ffffffff8111a9a8>] [<ffffffff8111a9a8>] put_page+0x48/0x60

RSP: 0018:ffff8807dda99cd8 EFLAGS: 00010246

RAX: 0000000000000000 RBX: ffff8807be1f1e00 RCX: 0000000000000001

RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffea001b196c80

RBP: ffff8807dda99cd8 R08: 0000000000000000 R09: 0000000000000000

R10: ffff8807ffbb5f00 R11: 000000000000005a R12: 0000000000000001

R13: 0000000000000000 R14: ffff8807dda974e0 R15: ffff8807be1f1ec8

FS: 0000000000000000(0000) GS:ffff8807fd680000(0000) knlGS:0000000000000000

CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b

CR2: 0000003b826dc7d0 CR3: 0000000001a0b000 CR4: 00000000000007e0

DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

Stack:

ffff8807dda99d18 ffffffff811b11f6 0000000000000000 0000000200000000

ffff8807be1f1e00 ffff8807be1f1e80 000000000000000c 0000000000000000

ffff8807dda99dc8 ffffffff811b21a2 00000001000438ec ffff8807fd692d00

Call Trace:

[<ffffffff811b11f6>] aio_free_ring+0x96/0x1c0

[<ffffffff811b21a2>] free_ioctx+0x1f2/0x250

[<ffffffff81081a5d>] ? idle_balance+0xed/0x140

[<ffffffff811b221a>] put_ioctx+0x1a/0x30

[<ffffffff811b24af>] kill_ioctx_work+0x2f/0x40

[<ffffffff81060933>] process_one_work+0x183/0x490

[<ffffffff81061ac0>] worker_thread+0x120/0x3a0

[<ffffffff810619a0>] ? manage_workers+0x160/0x160

[<ffffffff8106786e>] kthread+0xce/0xe0

[<ffffffff810677a0>] ? kthread_freezable_should_stop+0x70/0x70

[<ffffffff8154b79c>] ret_from_fork+0x7c/0xb0

[<ffffffff810677a0>] ? kthread_freezable_should_stop+0x70/0x70

Code: 07 00 c0 75 1f f0 ff 4f 1c 0f 94 c0 84 c0 75 0b c9 66 90 c3 0f 1f 80 00 00
00 00 e8 53 fe ff ff c9 66 90 c3 e8 7a fe ff ff c9 c3 <0f> 0b 66 0f 1f 44 00 00
eb f8 48 8b 47 30 eb bc 0f 1f 84 00 00
RIP [<ffffffff8111a9a8>] put_page+0x48/0x60

RSP <ffff8807dda99cd8>

---[ end trace b5e2c17407c840d8 ]---

Jul 4 15:49:50 BUG: unable to handle kernel paging request at ffffffffffffffd8

IP: [<ffffffff81067140>] kthread_data+0x10/0x20

PGD 1a0c067 PUD 1a0e067 PMD 0

Oops: 0000 [#2] SMP

Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4
nf_nat xt_CHECKSUM iptable_mangle bridge stp llc autofs4 sunrpc cpufreq_ondemand
ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip6t_REJECT
nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter
ip6_tables ipv6 vfat fat dm_mirror dm_region_hash dm_log dm_mod vhost_net
macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support acpi_cpufreq freq_table
mperf coretemp kvm_intel kvm crc32c_intel microcode pcspkr sg i2c_i801 lpc_ich
mfd_core ioatdma i7core_edac edac_core e1000e igb dca i2c_algo_bit i2c_core ptp
pps_core ext4(F) jbd2(F) mbcache(F) sd_mod(F) crc_t10dif(F) megaraid_sas(F)
mptsas(F) mptscsih(F) mptbase(F) scsi_transport_sas(F)

CPU: 4 PID: 100 Comm: kworker/4:1 Tainted: GF D 3.10.0-aio-migrate+
#107
Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS
Version 89.32 DP Proto 08/16/2012
task: ffff8807dda974e0 ti: ffff8807dda98000 task.ti: ffff8807dda98000

RIP: 0010:[<ffffffff81067140>] [<ffffffff81067140>] kthread_data+0x10/0x20

RSP: 0018:ffff8807dda999b8 EFLAGS: 00010092

RAX: 0000000000000000 RBX: 0000000000000004 RCX: ffffffff81da3ea0

RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff8807dda974e0

RBP: ffff8807dda999b8 R08: ffff8807dda97550 R09: 0000000000000006

R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000004

R13: ffff8807dda97ab8 R14: 0000000000000001 R15: 0000000000000006

FS: 0000000000000000(0000) GS:ffff8807fd680000(0000) knlGS:0000000000000000

CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b

CR2: 0000000000000028 CR3: 0000000001a0b000 CR4: 00000000000007e0

DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

Stack:

ffff8807dda999d8 ffffffff8105e155 ffff8807dda999d8 ffff8807fd692d00

ffff8807dda99a68 ffffffff8154168b ffff8807dda99fd8 0000000000012d00

ffff8807dda98010 0000000000012d00 0000000000012d00 0000000000012d00

Call Trace:

[<ffffffff8105e155>] wq_worker_sleeping+0x15/0xa0

[<ffffffff8154168b>] __schedule+0x5ab/0x6f0

[<ffffffff81239992>] ? put_io_context_active+0xc2/0xf0

[<ffffffff815419a9>] schedule+0x29/0x70

[<ffffffff81047795>] do_exit+0x2d5/0x480

[<ffffffff81544029>] oops_end+0xa9/0xf0

[<ffffffff810058eb>] die+0x5b/0x90

[<ffffffff81543b8b>] do_trap+0xcb/0x170

[<ffffffff81546e22>] ? __atomic_notifier_call_chain+0x12/0x20

[<ffffffff81003565>] do_invalid_op+0x95/0xb0

[<ffffffff8111a9a8>] ? put_page+0x48/0x60

[<ffffffff8111c411>] ? truncate_inode_pages_range+0x201/0x500

[<ffffffff8154c8e8>] invalid_op+0x18/0x20

[<ffffffff8111a9a8>] ? put_page+0x48/0x60

[<ffffffff8111c829>] ? truncate_setsize+0x19/0x20

[<ffffffff811b11f6>] aio_free_ring+0x96/0x1c0

[<ffffffff811b21a2>] free_ioctx+0x1f2/0x250

[<ffffffff81081a5d>] ? idle_balance+0xed/0x140

[<ffffffff811b221a>] put_ioctx+0x1a/0x30

[<ffffffff811b24af>] kill_ioctx_work+0x2f/0x40

[<ffffffff81060933>] process_one_work+0x183/0x490

[<ffffffff81061ac0>] worker_thread+0x120/0x3a0

[<ffffffff810619a0>] ? manage_workers+0x160/0x160

[<ffffffff8106786e>] kthread+0xce/0xe0

[<ffffffff810677a0>] ? kthread_freezable_should_stop+0x70/0x70

[<ffffffff8154b79c>] ret_from_fork+0x7c/0xb0

[<ffffffff810677a0>] ? kthread_freezable_should_stop+0x70/0x70

Code: 80 05 00 00 48 8b 40 c8 c9 48 c1 e8 02 83 e0 01 c3 66 2e 0f 1f 84 00 00 00
00 00 55 48 89 e5 66 66 66 66 90 48 8b 87 80 05 00 00 <48> 8b 40 d8 c9 c3 66 2e
0f 1f 84 00 00 00 00 00 55 48 89 e5 66
RIP [<ffffffff81067140>] kthread_data+0x10/0x20

RSP <ffff8807dda999b8>

CR2: ffffffffffffffd8

---[ end trace b5e2c17407c840d9 ]---

DP kernel: -----Fixing recursive fault but reboot is needed!

-------[ cut here ]------------

Jul 4 15:49:50 DP kernel: kernel BUG at mm/swap.c:163!

Jul 4 15:49:50 DP kernel: invalid opcode: 0000 [#1] SMP

Jul 4 15:49:50 DP kernel: Modules linked in: ebtable_nat ebtables
ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_CHECKSUM iptable_mangle bridge
stp llc autofs4 sunrpc cpufreq_ondemand ipt_REJECT nf_conntrack_ipv4
nf_defrag_ipv4 iptable_filter ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6
xt_state nf_conntrack ip6table_filter ip6_tables ipv6 vfat fat dm_mirror
dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt
iTCO_vendor_support acpi_cpufreq freq_table mperf coretemp kvm_intel kvm
crc32c_intel microcode pcspkr sg i2c_i801 lpc_ich mfd_core ioatdma i7core_edac
edac_core e1000e igb dca i2c_algo_bit i2c_core ptp pps_core ext4(F) jbd2(F)
mbcache(F) sd_mod(F) crc_t10dif(F) megaraid_sas(F) mptsas(F) mptscsih(F)
mptbase(F) scsi_transport_sas(F)
Jul 4 15:49:50 DP kernel: CPU: 4 PID: 100 Comm: kworker/4:1 Tainted: GF
3.10.0-aio-migrate+ #107
Jul 4 15:49:50 DP kernel: Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS
PRIMEQUEST 1000 Series BIOS Version 89.32 DP Proto 08/16/2012

Jul 4 15:49:50 DP kernel: Workqueue: events kill_ioctx_work

Jul 4 15:49:50 DP kernel: task: ffff8807dda974e0 ti: ffff8807dda98000 task.ti:
ffff8807dda98000
Jul 4 15:49:50 DP kernel: RIP: 0010:[<ffffffff8111a9a8>] [<ffffffff8111a9a8>]
put_page+0x48/0x60
Jul 4 15:49:50 DP kernel: RSP: 0018:ffff8807dda99cd8 EFLAGS: 00010246

Jul 4 15:49:50 DP kernel: RAX: 0000000000000000 RBX: ffff8807be1f1e00 RCX:
0000000000000001
Jul 4 15:49:50 DP kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI:
ffffea001b196c80
Jul 4 15:49:50 DP kernel: RBP: ffff8807dda99cd8 R08: 0000000000000000 R09:
0000000000000000
Jul 4 15:49:50 DP kernel: R10: ffff8807ffbb5f00 R11: 000000000000005a R12:
0000000000000001
Jul 4 15:49:50 DP kernel: R13: 0000000000000000 R14: ffff8807dda974e0 R15:
ffff8807be1f1ec8
Jul 4 15:49:50 DP kernel: FS: 0000000000000000(0000) GS:ffff8807fd680000(0000)
knlGS:0000000000000000
Jul 4 15:49:50 DP kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b

Jul 4 15:49:50 DP kernel: CR2: 0000003b826dc7d0 CR3: 0000000001a0b000 CR4:
00000000000007e0
Jul 4 15:49:50 DP kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
Jul 4 15:49:50 DP kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
Jul 4 15:49:50 DP kernel: Stack:

Jul 4 15:49:50 DP kernel: ffff8807dda99d18 ffffffff811b11f6 0000000000000000
0000000200000000
Jul 4 15:49:50 DP kernel: ffff8807be1f1e00 ffff8807be1f1e80 000000000000000c
0000000000000000
Jul 4 15:49:50 DP kernel: ffff8807dda99dc8 ffffffff811b21a2 00000001000438ec
ffff8807fd692d00
Jul 4 15:49:50 DP kernel: Call Trace:

Jul 4 15:49:50 DP kernel: [<ffffffff811b11f6>] aio_free_ring+0x96/0x1c0

Jul 4 15:49:50 DP kernel: [<ffffffff811b21a2>] free_ioctx+0x1f2/0x250

Jul 4 15:49:50 DP kernel: [<ffffffff81081a5d>] ? idle_balance+0xed/0x140

Jul 4 15:49:50 DP kernel: [<ffffffff811b221a>] put_ioctx+0x1a/0x30

Jul 4 15:49:50 DP kernel: [<ffffffff811b24af>] kill_ioctx_work+0x2f/0x40

Jul 4 15:49:50 DP kernel: [<ffffffff81060933>] process_one_work+0x183/0x490

Jul 4 15:49:50 DP kernel: [<ffffffff81061ac0>] worker_thread+0x120/0x3a0

Jul 4 15:49:50 DP kernel: [<ffffffff810619a0>] ? manage_workers+0x160/0x160

Jul 4 15:49:50 DP kernel: [<ffffffff8106786e>] kthread+0xce/0xe0

Jul 4 15:49:50 DP kernel: [<ffffffff810677a0>] ?
kthread_freezable_should_stop+0x70/0x70
Jul 4 15:49:50 DP kernel: [<ffffffff8154b79c>] ret_from_fork+0x7c/0xb0

Jul 4 15:49:50 DP kernel: [<ffffffff810677a0>] ?
kthread_freezable_should_stop+0x70/0x70
Jul 4 15:49:50 DP kernel: Code: 07 00 c0 75 1f f0 ff 4f 1c 0f 94 c0 84 c0 75 0b
c9 66 90 c3 0f 1f 80 00 00 00 00 e8 53 fe ff ff c9 66 90 c3 e8 7a fe ff ff c9 c3
<0f> 0b 66 0f 1f 44 00 00 eb f8 48 8b 47 30 eb bc 0f 1f 84 00 00
Jul 4 15:49:50 DP kernel: RIP [<ffffffff8111a9a8>] put_page+0x48/0x60

Jul 4 15:49:50 DP kernel: RSP <ffff8807dda99cd8>

Jul 4 15:49:50 DP kernel: ---[ end trace b5e2c17407c840d8 ]---

INFO: rcu_sched detected stalls on CPUs/tasks: { 4} (detected by 9, t=21056
jiffies, g=4158, c=4157, q=1040)
sending NMI to all CPUs:

NMI backtrace for cpu 4

CPU: 4 PID: 100 Comm: kworker/4:1 Tainted: GF D 3.10.0-aio-migrate+
#107
Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS
Version 89.32 DP Proto 08/16/2012
task: ffff8807dda974e0 ti: ffff8807dda98000 task.ti: ffff8807dda98000

RIP: 0010:[<ffffffff81542cd2>] [<ffffffff81542cd2>]
_raw_spin_lock_irq+0x22/0x30
RSP: 0018:ffff8807dda99618 EFLAGS: 00000002

RAX: 000000000000497c RBX: ffff8807fd692d00 RCX: ffff8807dda98010

RDX: 000000000000497e RSI: ffffffff815419a9 RDI: ffff8807fd692d00

RBP: ffff8807dda99618 R08: 0000000000000004 R09: 0000000000000100

R10: 00000000000009fe R11: 00000000000009fe R12: 0000000000000004

R13: 0000000000000009 R14: 0000000000000009 R15: 0000000000000000

FS: 0000000000000000(0000) GS:ffff8807fd680000(0000) knlGS:0000000000000000

CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b

CR2: 0000000000000028 CR3: 0000000001a0b000 CR4: 00000000000007e0

DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

Stack:

ffff8807dda996a8 ffffffff815411b6 ffff8807dda99fd8 0000000000012d00

ffff8807dda98010 0000000000012d00 0000000000012d00 0000000000012d00

ffff8807dda99fd8 0000000000012d00 ffff8807dda974e0 ffff8807dda996c8

Call Trace:

[<ffffffff815411b6>] __schedule+0xd6/0x6f0

[<ffffffff815419a9>] schedule+0x29/0x70

[<ffffffff810478ea>] do_exit+0x42a/0x480

[<ffffffff81544029>] oops_end+0xa9/0xf0

[<ffffffff81035c3e>] no_context+0x11e/0x1f0

[<ffffffff81035e2d>] __bad_area_nosemaphore+0x11d/0x220

[<ffffffff81035f43>] bad_area_nosemaphore+0x13/0x20

[<ffffffff815469b5>] __do_page_fault+0xc5/0x490

[<ffffffff810d4a97>] ? call_rcu_sched+0x17/0x20

[<ffffffff8125d3da>] ? strlcpy+0x4a/0x60

[<ffffffff81546d8e>] do_page_fault+0xe/0x10

[<ffffffff815434f2>] page_fault+0x22/0x30

[<ffffffff81067140>] ? kthread_data+0x10/0x20

[<ffffffff8105e155>] wq_worker_sleeping+0x15/0xa0

[<ffffffff8154168b>] __schedule+0x5ab/0x6f0

[<ffffffff81239992>] ? put_io_context_active+0xc2/0xf0

[<ffffffff815419a9>] schedule+0x29/0x70

[<ffffffff81047795>] do_exit+0x2d5/0x480

[<ffffffff81544029>] oops_end+0xa9/0xf0

[<ffffffff810058eb>] die+0x5b/0x90

[<ffffffff81543b8b>] do_trap+0xcb/0x170

[<ffffffff81546e22>] ? __atomic_notifier_call_chain+0x12/0x20

[<ffffffff81003565>] do_invalid_op+0x95/0xb0

[<ffffffff8111a9a8>] ? put_page+0x48/0x60

[<ffffffff8111c411>] ? truncate_inode_pages_range+0x201/0x500

[<ffffffff8154c8e8>] invalid_op+0x18/0x20

[<ffffffff8111a9a8>] ? put_page+0x48/0x60

[<ffffffff8111c829>] ? truncate_setsize+0x19/0x20

[<ffffffff811b11f6>] aio_free_ring+0x96/0x1c0

[<ffffffff811b21a2>] free_ioctx+0x1f2/0x250

[<ffffffff81081a5d>] ? idle_balance+0xed/0x140

[<ffffffff811b221a>] put_ioctx+0x1a/0x30

[<ffffffff811b24af>] kill_ioctx_work+0x2f/0x40

[<ffffffff81060933>] process_one_work+0x183/0x490

[<ffffffff81061ac0>] worker_thread+0x120/0x3a0

[<ffffffff810619a0>] ? manage_workers+0x160/0x160

[<ffffffff8106786e>] kthread+0xce/0xe0

[<ffffffff810677a0>] ? kthread_freezable_should_stop+0x70/0x70

[<ffffffff8154b79c>] ret_from_fork+0x7c/0xb0

[<ffffffff810677a0>] ? kthread_freezable_should_stop+0x70/0x70

Code: 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 66 66 66 66 90 fa b8 00 00 01 00 f0
0f c1 07 89 c2 c1 ea 10 66 39 c2 74 0d 0f 1f 00 f3 90 <0f> b7 07 66 39 c2 75 f6
c9 c3 0f 1f 40 00 55 48 89 e5 66 66 66
NMI backtrace for cpu 1


>
> -ben

2013-07-04 11:41:57

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable())

On Thu, Jul 04, 2013 at 02:51:18PM +0800, Gu Zheng wrote:
> Hi Ben,
> When I test your patch on kernel 3.10, the kernel panic when aio job
> complete or exit, exactly in aio_free_ring(), the following is a part of dmesg.

What is your test case?

-ben

2013-07-05 03:24:21

by Gu Zheng

[permalink] [raw]
Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable())

On 07/04/2013 07:41 PM, Benjamin LaHaise wrote:

> On Thu, Jul 04, 2013 at 02:51:18PM +0800, Gu Zheng wrote:
>> Hi Ben,
>> When I test your patch on kernel 3.10, the kernel panic when aio job
>> complete or exit, exactly in aio_free_ring(), the following is a part of dmesg.
>
> What is your test case?

Just the one you mentioned in the previous mail:
http://www.kvack.org/~bcrl/aio/aio-numa-test.c

Thanks,
Gu

>
> -ben
>