LinuxLists.cc - [RFC] memory defragmentation to satisfy high order allocations

2004-10-01 19:58:18

Subject: [RFC] memory defragmentation to satisfy high order allocations

Hi fellows,

So I've been playing with memory defragmentation for the last couple
of weeks.

The following patch implements a "coalesce_memory()" function
which takes "zone" and "order" as a parameter.

It tries to move enough physically nearby pages to form a free area
of "order" size.

It does that by checking whether the page can be moved, allocating a new page,
unmapping the pte's to it, copying data to new page, remapping the ptes,
and reinserting the page on the radix/LRU.

Its very uncomplete yet - for one SMP concurrent radix lookups will screw file
page unmapping (swapcache lookup should be safe), and lots of other buggies inside.
For example it doesnt re establishes pte's once it has unmapped them.

I'm working on those.

But it works fine on UP (for a few minutes :)), and easily creates large
physically contiguous areas of memory.

With such a thing in place we can build a mechanism for kswapd
(or a separate kernel thread, if needed) to notice when we are low on
high order pages, and use the coalescing algorithm instead blindly
freeing unique pages from LRU in the hope to build large physically
contiguous memory areas.

Comments appreciated.

Lots of this has been copied from rmap.c/etc.

Yes, the code needs to be cleanup up.

--- page_alloc.c.orig 2004-09-19 16:53:52.000000000 -0300
+++ page_alloc.c 2004-10-01 16:26:21.602387344 -0300
@@ -33,6 +33,8 @@
#include <linux/cpu.h>
#include <linux/cpuset.h>
#include <linux/nodemask.h>
+#include <linux/rmap.h>
+#include <linux/mm_inline.h>

#include <asm/tlbflush.h>

@@ -97,7 +99,471 @@
page->mapping = NULL;
}

-#ifndef CONFIG_HUGETLB_PAGE
+#define REMAP_FAIL 0
+#define REMAP_SUCCESS 1
+
+
+void page_remove_rmap(struct page *page);
+void page_add_anon_rmap(struct page *page,
+ struct vm_area_struct *vma, unsigned long address);
+struct anon_vma *page_lock_anon_vma(struct page *page);
+inline unsigned long avma_address(struct page *page, struct vm_area_struct *vma);
+
+inline unsigned long
+avma_address(struct page *page, struct vm_area_struct *vma)
+{
+ pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+ unsigned long address;
+
+ address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+
+ if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
+ /* page should be within any vma from prio_tree_next */
+ printk(KERN_ERR "address: %x pgoff:%x vma->start:%x vma->end:%x\n",
+ address, pgoff,vma->vm_start, vma->vm_end );
+ BUG_ON(!PageAnon(page));
+ return -EFAULT;
+ }
+ return address;
+}
+
+
+
+int try_to_remap_file(struct page *page, struct page *newpage, struct vm_area_struct *vma)
+{
+ unsigned long address;
+ struct mm_struct *mm = vma->vm_mm;
+ pgd_t *pgd;
+ pmd_t *pmd;
+ pte_t *pte;
+ pte_t pteval;
+ int ret;
+
+ printk(KERN_ERR "try_to_remap_file!\n");
+
+ if (!mm->rss)
+ return REMAP_FAIL;
+
+ address = avma_address(page, vma);
+
+
+ pgd = pgd_offset(mm, address);
+ if (!pgd_present(*pgd))
+ goto out_unlock;
+
+ pmd = pmd_offset(pgd, address);
+ if (!pmd_present(*pmd))
+ goto out_unlock;
+
+
+ pte = pte_offset_map(pmd, address);
+ if (!pte_present(*pte))
+ goto out_unlock;
+
+ if (page_to_pfn(page) != pte_pfn(*pte))
+ goto out_unlock;
+
+ if ((vma->vm_flags & (VM_LOCKED|VM_RESERVED)))
+ goto out_unlock;
+
+ /* Nuke the pte */
+
+ flush_cache_page(vma, address);
+
+ pteval = ptep_clear_flush(vma, address, pte);
+
+ page_remove_rmap(page);
+
+ /* transfer the dirty bit to the new page */
+ if (pte_dirty(pteval))
+ set_page_dirty(newpage);
+
+ pteval = mk_pte(newpage, vma->vm_page_prot);
+
+ set_pte(pte, pteval);
+
+ page_add_file_rmap(newpage);
+
+ return REMAP_SUCCESS;
+
+out_unlock:
+ return REMAP_FAIL;
+}
+
+
+
+
+int try_to_remap_anon(struct page *page, struct page *newpage, struct vm_area_struct *vma)
+{
+ unsigned long address;
+ struct mm_struct *mm = vma->vm_mm;
+ pgd_t *pgd;
+ pmd_t *pmd;
+ pte_t *pte;
+ pte_t pteval;
+ int ret;
+
+
+ if (!vma)
+ printk(KERN_ERR "!vma\n");
+
+ spin_lock(&mm->page_table_lock);
+
+ address = avma_address(page, vma);
+
+ if (address == -EFAULT)
+ return REMAP_FAIL;
+
+ if (!mm)
+ return REMAP_FAIL;
+
+ if (!mm->rss)
+ return REMAP_FAIL;
+
+ pgd = pgd_offset(mm, address);
+ if (!pgd_present(*pgd))
+ goto out_unlock;
+
+ pmd = pmd_offset(pgd, address);
+ if (!pmd_present(*pmd))
+ goto out_unlock;
+
+ pte = pte_offset_map(pmd, address);
+ if (!pte_present(*pte))
+ goto out_unlock;
+
+ if (page_to_pfn(page) != pte_pfn(*pte))
+ goto out_unlock;
+
+ if ((vma->vm_flags & (VM_LOCKED|VM_RESERVED)))
+ ret = REMAP_FAIL;
+
+ /* Nuke the pte */
+
+ flush_cache_page(vma, address);
+ pteval = ptep_clear_flush(vma, address, pte);
+
+ page_remove_rmap(page);
+
+ /* transfer the dirty bit to the new page */
+ if (pte_dirty(pteval))
+ set_page_dirty(newpage);
+
+ pteval = mk_pte(newpage, vma->vm_page_prot);
+
+ set_pte(pte, pteval);
+
+ page_add_anon_rmap(newpage, vma, address);
+
+ spin_unlock(&mm->page_table_lock);
+
+ return REMAP_SUCCESS;
+
+out_unlock:
+ spin_unlock(&mm->page_table_lock);
+ return REMAP_FAIL;
+
+}
+
+/* Move LRU pages to other locations, undo the remapping operation
+* if any of the mapped pte's fails to be remapped.
+*
+*/
+
+int can_move_page(struct page *page)
+{
+ int ret;
+ int ptes_unmapped = 0;
+ struct page *newpage;
+
+ if (PageLocked(page))
+ return 0;
+
+ if (PageReserved(page))
+ return 0;
+
+ if (PageWriteback(page))
+ return 0;
+
+ if (page_count(page) == 0)
+ return 1;
+
+ if (PageLRU(page)) {
+ if (PageAnon(page) && page_count(page) == 1 + PageSwapCache(page)) {
+ struct anon_vma *anon_vma;
+ struct vm_area_struct *vma;
+ unsigned long anon_mapping = (unsigned long) page->mapping;
+ unsigned long savedindex;
+ int error;
+
+ newpage = alloc_pages(GFP_HIGHUSER, 0);
+
+ if (PageSwapCache(page) &&
+ page_count(page) != page_mapcount(page) + 1) {
+ free_page(newpage);
+ goto out;
+ }
+
+ if (!PageAnon(page) || anon_mapping != page->mapping) {
+ free_page (newpage);
+ goto out;
+ }
+
+ page_cache_get(page);
+
+ if (TestSetPageLocked(page)) {
+ free_page(newpage);
+ page_cache_release(page);
+ goto out;
+ }
+
+ if (PageSwapCache(page)) {
+ write_lock_irq(&swapper_space.tree_lock);
+ /* recheck under swapper address space tree lock */
+ if (!PageSwapCache(page) || page_count(page) != 3) {
+ write_unlock_irq(&swapper_space.tree_lock);
+ free_page(newpage);
+ unlock_page(page);
+ page_cache_release(page);
+ }
+ radix_tree_delete(&swapper_space.page_tree, page->private);
+ savedindex = page->private;
+ }
+
+ anon_vma = page_lock_anon_vma(page);
+
+ if (!anon_vma) {
+ free_page(newpage);
+ page_cache_release(page);
+ goto out;
+ }
+
+ list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
+ ret = try_to_remap_anon(page, newpage, vma);
+ if (ret == REMAP_FAIL) {
+ if (PageSwapCache(page))
+ write_unlock_irq(&swapper_space
+ .tree_lock);
+ spin_unlock(&anon_vma->lock);
+ free_page(newpage);
+ unlock_page(page);
+ page_cache_release(page);
+ goto redo_unmaps;
+ }
+ ptes_unmapped++;
+ }
+
+ copy_highpage(newpage, page);
+
+ unlock_page(page);
+
+ page_cache_release(page);
+ page_cache_release(page);
+
+ newpage->private = savedindex;
+
+ if (PageSwapCache(page)) {
+ error = radix_tree_insert(&swapper_space.page_tree,
+ savedindex, newpage);
+
+ //if (error)
+ }
+
+
+ spin_unlock(&anon_vma->lock);
+ write_unlock_irq(&swapper_space.tree_lock);
+
+ return 1;
+
+ } else if (!PageAnon(page) &&
+ page_count(page) == 1) {
+ struct vm_area_struct *vma;
+ struct prio_tree_iter iter;
+ struct zone *zone = page_zone(page);
+ struct address_space *mapping = page->mapping;
+ struct page *testpage;
+ int mapped;
+ pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT -
+ PAGE_SHIFT);
+ pgoff_t savedindex = page->index;
+
+ if (!mapping)
+ goto out;
+
+ if (!list_empty(&mapping->i_mmap_nonlinear)) {
+ spin_unlock(&mapping->i_mmap_lock);
+ goto out;
+ }
+
+ if (PagePrivate(page))
+ printk(KERN_ERR "PagePrivate!\n");
+ if (PageWriteback(page)) {
+ printk(KERN_ERR "PageWriteback! quitting\n");
+ goto out;
+ }
+
+ newpage = alloc_pages(GFP_HIGHUSER, 0);
+
+ if (page_count(page) != 1 ||
+ !PageLRU(page) || PageAnon(page) ||
+ page->mapping != mapping ||
+ page->index != savedindex) {
+ free_page(newpage);
+ goto out;
+ }
+
+ page_cache_get(page);
+
+ if (TestSetPageLocked(page)) {
+ page_cache_release(page);
+ printk(KERN_ERR "page locked!!!\n");
+ goto out;
+ }
+
+ // remove radix entry and block page faults for SMP systems
+
+ spin_lock(&mapping->i_mmap_lock);
+
+ mapped = page_mapcount(page);
+
+ vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap,
+ pgoff, pgoff)
+ {
+ ret = try_to_remap_file(page, newpage, vma);
+ if (ret == REMAP_FAIL) {
+ unlock_page(page);
+ goto redo_unmaps;
+ }
+ ptes_unmapped++;
+ mapped--;
+ if (!mapped)
+ break;
+
+ }
+
+ if (TestClearPageLRU(page))
+ del_page_from_lru(zone, page);
+
+ remove_from_page_cache(page);
+
+ copy_highpage(newpage, page);
+
+ newpage->flags = page->flags;
+
+ unlock_page(page);
+
+ add_to_page_cache_lru(newpage, mapping, savedindex,
+ GFP_KERNEL);
+
+ page_cache_release(page);
+ page_cache_release(page);
+
+ unlock_page(newpage);
+
+ spin_unlock(&mapping->i_mmap_lock);
+ return 1;
+ }
+
+ }
+
+
+out:
+ preempt_enable();
+ return 0;
+
+redo_unmaps:
+ free_page(newpage);
+ printk(KERN_ERR "unmap PTE failed!@#$^5! ptes_unmapped:%d\n", ptes_unmapped);
+ return 0;
+}
+
+#define MAX_ORDER_DEC 3 /* maximum order decrease */
+
+int coalesce_memory(unsigned int order, struct zone *zone)
+{
+ unsigned int torder;
+ unsigned int nr_freed_pages = 0, nr_pages = 0;
+
+ if (order < 1) {
+ printk(KERN_ERR "order <= 2");
+ return -1;
+ }
+
+ preempt_disable();
+
+ for (torder = order - 1; torder > order - MAX_ORDER_DEC; torder--) {
+ struct list_head *entry;
+ struct page *pwalk, *page;
+ int walkcount = 0;
+ struct free_area *area = zone->free_area + torder;
+ nr_pages = (1UL << order) - (1UL << torder);
+
+ entry = area->free_list.next;
+
+ while (entry != &area->free_list) {
+ int ret;
+ page = list_entry(entry, struct page, lru);
+ entry = entry->next;
+
+ pwalk = page;
+
+ /* Look backwards */
+
+ for (walkcount = 1; walkcount<nr_pages; walkcount++) {
+ pwalk = page-walkcount;
+
+ ret = can_move_page(pwalk);
+ if (ret)
+ nr_freed_pages++;
+ else
+ goto forward;
+
+ if (nr_freed_pages == nr_pages)
+ goto success;
+
+ }
+
+forward:
+
+ pwalk = page;
+
+ /* Look forward, skipping the page frames from this
+ high order page we are looking at */
+
+ for (walkcount = (1UL << torder); walkcount<nr_pages;
+ walkcount++) {
+ pwalk = page+walkcount;
+
+ ret = can_move_page(pwalk);
+
+ if (ret)
+ nr_freed_pages++;
+ else
+ goto loopey;
+
+ if (nr_freed_pages == nr_pages)
+ goto success;
+ }
+
+loopey:
+
+// goto bailout;
+ }
+ }
+
+bailout:
+ preempt_enable();
+ printk(KERN_ERR "failure nr_pages:%d nr_freed_pages:%d!\n", nr_pages,
+nr_freed_pages);
+return 0;
+
+success:
+printk(KERN_ERR "SUCCESS coalesced %d pages!\n", nr_freed_pages);
+return 1;
+
+}
+
+#ifndef CONFIG_HUGETL _PAGE
#define prep_compound_page(page, order) do { } while (0)
#define destroy_compound_page(page, order) do { } while (0)
#else

2004-10-01 20:15:06

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC] memory defragmentation to satisfy high order allocations

Marcelo Tosatti <[email protected]> wrote:
>
> The following patch implements a "coalesce_memory()" function
> which takes "zone" and "order" as a parameter.
>
> It tries to move enough physically nearby pages to form a free area
> of "order" size.
>
> It does that by checking whether the page can be moved, allocating a new page,
> unmapping the pte's to it, copying data to new page, remapping the ptes,
> and reinserting the page on the radix/LRU.

Presumably this duplicates some of the memory hot-remove patches.

Apparently Dave Hansen has working and sane-looking hot remove code
which is in a close-to-submittable state.

2004-10-01 20:32:28

by Marcelo Tosatti

[permalink] [raw]

Subject: Re: [RFC] memory defragmentation to satisfy high order allocations

On Fri, Oct 01, 2004 at 01:11:47PM -0700, Andrew Morton wrote:
> Marcelo Tosatti <[email protected]> wrote:
> >
> > The following patch implements a "coalesce_memory()" function
> > which takes "zone" and "order" as a parameter.
> >
> > It tries to move enough physically nearby pages to form a free area
> > of "order" size.
> >
> > It does that by checking whether the page can be moved, allocating a new page,
> > unmapping the pte's to it, copying data to new page, remapping the ptes,
> > and reinserting the page on the radix/LRU.
>
> Presumably this duplicates some of the memory hot-remove patches.

As far as I have researched, the memory moving/remapping code
on the hot remove patches dont work correctly. Please correct me.

And what I've seen (from the Fujitsu guys) was quite ugly IMHO.

> Apparently Dave Hansen has working and sane-looking hot remove code
> which is in a close-to-submittable state.

Dave?

2004-10-01 21:53:10

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC] memory defragmentation to satisfy high order allocations

Marcelo Tosatti <[email protected]> wrote:
>
> As far as I have researched, the memory moving/remapping code
> on the hot remove patches dont work correctly. Please correct me.
>
> And what I've seen (from the Fujitsu guys) was quite ugly IMHO.

That's a totally different patch.

2004-10-01 22:06:19

by Dave Hansen

[permalink] [raw]

Subject: Re: [RFC] memory defragmentation to satisfy high order allocations

On Fri, 2004-10-01 at 12:04, Marcelo Tosatti wrote:
> On Fri, Oct 01, 2004 at 01:11:47PM -0700, Andrew Morton wrote:
> > Presumably this duplicates some of the memory hot-remove patches.
>
> As far as I have researched, the memory moving/remapping code
> on the hot remove patches dont work correctly. Please correct me.

I definitely see some commonality, but Marcelo's approach has handling
for the different kinds of pages broken out much more nicely. Can't
tell yet if this produces extra code, or is just plain better.

We worked pretty hard to try and copy as little code as possible. Was
there any reason that there was so much stuff copied out of rmap.c?
Just for proof-of-concept?

Here's one of the recent patch sets that we're working on:

http://sprucegoose.sr71.net/patches/2.6.9-rc2-mm4-mhp-test2/

In that directory, the K* patches hijack some of the swap code (but
require memory pressure to work last time I tried), and the p000*
patches (by Hirokazu Takahashi) actively migrate pages around. Both
approaches work, but the K* one is smaller and less intrusive, while the
p000* one is much more complete. They may end up being able to coexist
in the end.

> And what I've seen (from the Fujitsu guys) was quite ugly IMHO.

I don't work for Fujitsu :) Please take a look at the patches in the
above directory and see what you think. I'm sure you have some very
good stuff in your patch, but I need to take a closer look.

I'm just about to head out of town for the weekend, but I'll take a much
more detailed look on Monday.

--
Dave Hansen
[email protected]

2004-10-02 01:08:32

by Marcelo Tosatti

[permalink] [raw]

Subject: Re: [RFC] memory defragmentation to satisfy high order allocations

On Fri, Oct 01, 2004 at 02:57:03PM -0700, Dave Hansen wrote:
> On Fri, 2004-10-01 at 12:04, Marcelo Tosatti wrote:
> > On Fri, Oct 01, 2004 at 01:11:47PM -0700, Andrew Morton wrote:
> > > Presumably this duplicates some of the memory hot-remove patches.
> >
> > As far as I have researched, the memory moving/remapping code
> > on the hot remove patches dont work correctly. Please correct me.
>
> I definitely see some commonality, but Marcelo's approach has handling
> for the different kinds of pages broken out much more nicely. Can't
> tell yet if this produces extra code, or is just plain better.
>
> We worked pretty hard to try and copy as little code as possible. Was
> there any reason that there was so much stuff copied out of rmap.c?
> Just for proof-of-concept?

Just proof of concept really, to have an equivalent of "try_to_unmap()" -
which you call from the migrate page code.

Just that "try_to_remap_{file,anon}" do the pte clearing + remapping in
one function.

> Here's one of the recent patch sets that we're working on:
>
> http://sprucegoose.sr71.net/patches/2.6.9-rc2-mm4-mhp-test2/
>
> In that directory, the K* patches hijack some of the swap code (but
> require memory pressure to work last time I tried), and the p000*
> patches (by Hirokazu Takahashi) actively migrate pages around. Both
> approaches work, but the K* one is smaller and less intrusive, while the
> p000* one is much more complete. They may end up being able to coexist
> in the end.

The page migration code (p000*) looks nice - quite complete indeed (nice error
handling, etc) but somewhat specific to the migration procedure, which is more
critical (cannot fail so easily as) then the remapping for high-order allocations.

For example this in migrate_page_common

+ switch (ret) {
+ case 0:
+ case -ENOENT:
+ copy_highpage(newpage, page);
+ return ret;
+ case -EBUSY:
+ return ret;
+ case -EAGAIN:
+ writeback_and_free_buffers(page);
+ unlock_page(page);
+ msleep(10);
+ timeout -= 10;
+ lock_page(page);
+ continue;

Which retries undefinately to migrate the page

For the "defragmentation" operation we want to do an "easy" try - ie if we
can't remap giveup.

I feel we should try to "untie" the code which checks for remapping availability /
does the remapping from the page migration - so to be able to share the most
code between it and other users of the same functionality.

Curiosity: How did you guys test the migration operation? Several threads on
several processors operating on the memory, etc?

> I don't work for Fujitsu :) Please take a look at the patches in the
> above directory and see what you think. I'm sure you have some very
> good stuff in your patch, but I need to take a closer look.
>
> I'm just about to head out of town for the weekend, but I'll take a much
> more detailed look on Monday.

Cool. I'll take a closer look at the relevant parts of memory hotplug patches
this weekend, hopefully. See if I can help with testing of these patches too.

Andrew, what are your thoughts wrt merging this to mainline?

2004-10-02 01:20:40

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC] memory defragmentation to satisfy high order allocations

Marcelo Tosatti <[email protected]> wrote:
>
> > Here's one of the recent patch sets that we're working on:
> >
> > http://sprucegoose.sr71.net/patches/2.6.9-rc2-mm4-mhp-test2/
> >
> ...
> Andrew, what are your thoughts wrt merging this to mainline?

It's the first I've seen of it. I guess I'd be looking for testing results
as well as the outcome of discussions/review with the ia64 guys whose
hardware is not quite as cooperative as that on the ppc64 machines.

2004-10-02 02:30:14

by Nick Piggin

[permalink] [raw]

Subject: Re: [RFC] memory defragmentation to satisfy high order allocations

Marcelo Tosatti wrote:

>
>With such a thing in place we can build a mechanism for kswapd
>(or a separate kernel thread, if needed) to notice when we are low on
>high order pages, and use the coalescing algorithm instead blindly
>freeing unique pages from LRU in the hope to build large physically
>contiguous memory areas.
>
>Comments appreciated.
>
>

Hi Marcelo,
Seems like a good idea... even with regular dumb kswapd "merging",
you may easily get stuck for example on systems without swap...

Anyway, I'd like to get those beat kswapd patches in first. Then
your mechanism just becomes something like:

if order-0 pages are low {
try to free memory
}
else if order-1 or higher pages are low {
try to coalesce_memory
if that fails, try to free memory
}

2004-10-02 02:41:19

by Nick Piggin

[permalink] [raw]

Subject: Re: [RFC] memory defragmentation to satisfy high order allocations

Marcelo Tosatti wrote:

>
>For example it doesnt re establishes pte's once it has unmapped them.
>
>

Another thing - I don't know if I'd bother re-establishing ptes....
I'd say just leave it to happen lazily at fault time.

2004-10-02 03:50:29

by Hirokazu Takahashi

[permalink] [raw]

Subject: Re: [RFC] memory defragmentation to satisfy high order allocations

Hello,

> >For example it doesnt re establishes pte's once it has unmapped them.
> >
> >
>
> Another thing - I don't know if I'd bother re-establishing ptes....
> I'd say just leave it to happen lazily at fault time.

I think the reason is that his current implementation doesn't assign
a swap entry to an anonymous page to move.

Thank you,
Hirokazu Takahashi.

2004-10-02 04:35:37

by Marcelo Tosatti

[permalink] [raw]

Subject: Re: [RFC] memory defragmentation to satisfy high order allocations

On Sat, Oct 02, 2004 at 12:30:01PM +1000, Nick Piggin wrote:
>
>
> Marcelo Tosatti wrote:
>
> >
> >With such a thing in place we can build a mechanism for kswapd
> >(or a separate kernel thread, if needed) to notice when we are low on
> >high order pages, and use the coalescing algorithm instead blindly
> >freeing unique pages from LRU in the hope to build large physically
> >contiguous memory areas.
> >
> >Comments appreciated.
> >
> >
>
> Hi Marcelo,
> Seems like a good idea... even with regular dumb kswapd "merging",
> you may easily get stuck for example on systems without swap...
>
> Anyway, I'd like to get those beat kswapd patches in first. Then
> your mechanism just becomes something like:
>
> if order-0 pages are low {
> try to free memory
> }
> else if order-1 or higher pages are low {
> try to coalesce_memory
> if that fails, try to free memory
> }

Hi Nick!

I understand that kswapd is broken, and it needs to go into the page reclaim path
to free pages when we are out of high order pages (what your
"beat kswapd" patches do and fix high-order failures by doing so), but
Linus's argument against it seems to be that "it potentially frees too much pages"
causing harm to the system. He also says this has been tried in the past,
with not nice results.

And that is why its has not been merged into mainline.

Is my interpretation correct?

But right, kswapd needs to get fixed to honour high order
pages.

2004-10-02 09:30:10

by Hirokazu Takahashi

[permalink] [raw]

Subject: Re: [RFC] memory defragmentation to satisfy high order allocations

Hello, Marcelo.

Generic memory defragmentation will be very nice for me to implement
hugetlbpage migration, as allocating a new hugetlbpage is a hard job.

> For the "defragmentation" operation we want to do an "easy" try - ie if we
> can't remap giveup.
>
> I feel we should try to "untie" the code which checks for remapping availability /
> does the remapping from the page migration - so to be able to share the most
> code between it and other users of the same functionality.

I think it's possible to introduce non-wait mode to the migration code,
as you may expect. Shall I implement it?

> Curiosity: How did you guys test the migration operation? Several threads on
> several processors operating on the memory, etc?

I always test it with the zone hotplug emulation patch, which Mr.Iwamoto
has made. I usually run following jobs concurrently while zones are added
and removed repeatedly on a SMP machine.
- making linux kernel
- copying file trees.
- overwriting file trees.
- removing file trees
- some pages are swapped out automatically:)

And Mr.Iwamoto has some small programs to check any kind of page
can be migrated. The programs repeat one of following actions:
- read/write files .
- use MAP_SHARED and MAP_PRIVATE mmap()'s and read/write there.
- use Direct I/O.
- use AIO.
- fork to have COW pages.
- use shmem.
- use sendfile.

> Cool. I'll take a closer look at the relevant parts of memory hotplug patches
> this weekend, hopefully. See if I can help with testing of these patches too.

Any comments are very welcome.

Thank you,
Hirokazu Takahashi.

2004-10-02 17:34:43

by Marcelo Tosatti

[permalink] [raw]

Subject: Re: [RFC] memory defragmentation to satisfy high order allocations

On Sat, Oct 02, 2004 at 12:41:14PM +1000, Nick Piggin wrote:
>
>
> Marcelo Tosatti wrote:
>
> >
> >For example it doesnt re establishes pte's once it has unmapped them.
> >
> >
>
> Another thing - I don't know if I'd bother re-establishing ptes....
> I'd say just leave it to happen lazily at fault time.

Indeed it should work lazily.

2004-10-02 20:02:20

by Marcelo Tosatti

[permalink] [raw]

Subject: Re: [RFC] memory defragmentation to satisfy high order allocations

On Sat, Oct 02, 2004 at 06:30:15PM +0900, Hirokazu Takahashi wrote:
> Hello, Marcelo.
>
> Generic memory defragmentation will be very nice for me to implement
> hugetlbpage migration, as allocating a new hugetlbpage is a hard job.
>
> > For the "defragmentation" operation we want to do an "easy" try - ie if we
> > can't remap giveup.
> >
> > I feel we should try to "untie" the code which checks for remapping availability /
> > does the remapping from the page migration - so to be able to share the most
> > code between it and other users of the same functionality.
>
> I think it's possible to introduce non-wait mode to the migration code,
> as you may expect. Shall I implement it?
>
> > Curiosity: How did you guys test the migration operation? Several threads on
> > several processors operating on the memory, etc?
>
> I always test it with the zone hotplug emulation patch, which Mr.Iwamoto
> has made. I usually run following jobs concurrently while zones are added
> and removed repeatedly on a SMP machine.
> - making linux kernel
> - copying file trees.
> - overwriting file trees.
> - removing file trees
> - some pages are swapped out automatically:)
>
> And Mr.Iwamoto has some small programs to check any kind of page
> can be migrated. The programs repeat one of following actions:
> - read/write files .
> - use MAP_SHARED and MAP_PRIVATE mmap()'s and read/write there.
> - use Direct I/O.
> - use AIO.
> - fork to have COW pages.
> - use shmem.
> - use sendfile.
>
> > Cool. I'll take a closer look at the relevant parts of memory hotplug patches
> > this weekend, hopefully. See if I can help with testing of these patches too.
>
> Any comments are very welcome.

I have a few comments about the code:

1)
I'm pretty sure you should transfer the radix tree tag at radix_tree_replace().
If for example you transfer a dirty tagged page to another zone, an mpage_writepages()
will miss it (because it uses pagevec_lookup_tag(PAGECACHE_DIRTY_TAG)).

Should be quite trivial to do (save tags before deleting and set to new entry,
all in radix_tree_replace).

My implementation also contained the same bug.

2)
At migrate_onepage you add anonymous pages which aren't swap allocated
to the swap cache
+ /*
+ * Put the page in a radix tree if it isn't in the tree yet.
+ */
+#ifdef CONFIG_SWAP
+ if (PageAnon(page) && !PageSwapCache(page))
+ if (!add_to_swap(page, GFP_KERNEL)) {
+ unlock_page(page);
+ return ERR_PTR(-ENOSPC);
+ }
+#endif /* CONFIG_SWAP */

Why's that? You can copy anonymous pages without adding them to swap (thats
what the patch I posted does).

3) At migrate_page_common you assume additional page references
(page_migratable returning -EAGAIN) means the code should try to writeout
the page.

Is that assumption always valid?

In theory there is no need to writeout pages when migrating them to
other zones - they will be copied and the dirty information retained (either
in the PageDirty bit or radix tree tag).

I just noticed you do that on further patches (migrate_page_buffer), but AFAICS
the writeout remains. Why arent you using migrate_page_buffer yet?

I think the final aim should be to remove the need for "pageout()"
completly.

4)
About implementing a nonblocking version of it. The easier way, it
seems to me, is to pass a "block" argument to generic_migrate_page() and
use that.

Questions: are there any documents on the memory hotplug userspace tools?
Where can I find them?

Are Iwamoto's test programs available?

In general the code looks nice to me! I'll jump in and help with
testing.

2004-10-03 04:13:22

by Hirokazu Takahashi

[permalink] [raw]

Subject: Re: [RFC] memory defragmentation to satisfy high order allocations

Hi,

> > > Cool. I'll take a closer look at the relevant parts of memory hotplug patches
> > > this weekend, hopefully. See if I can help with testing of these patches too.
> >
> > Any comments are very welcome.
>
>
> I have a few comments about the code:
>
> 1)
> I'm pretty sure you should transfer the radix tree tag at radix_tree_replace().
> If for example you transfer a dirty tagged page to another zone, an mpage_writepages()
> will miss it (because it uses pagevec_lookup_tag(PAGECACHE_DIRTY_TAG)).
>
> Should be quite trivial to do (save tags before deleting and set to new entry,
> all in radix_tree_replace).
>
> My implementation also contained the same bug.

Yes, it's one of the issues to do. The tag should be transferred in
radix_tree_replace() as you pointed out. The current implementation
sets the tag in set_page_dirty(newpage).

> 2)
> At migrate_onepage you add anonymous pages which aren't swap allocated
> to the swap cache
> + /*
> + * Put the page in a radix tree if it isn't in the tree yet.
> + */
> +#ifdef CONFIG_SWAP
> + if (PageAnon(page) && !PageSwapCache(page))
> + if (!add_to_swap(page, GFP_KERNEL)) {
> + unlock_page(page);
> + return ERR_PTR(-ENOSPC);
> + }
> +#endif /* CONFIG_SWAP */
>
> Why's that? You can copy anonymous pages without adding them to swap (thats
> what the patch I posted does).

The reason is to guarantee that any anonymous page can be migrated anytime.
I want to block newly occurred accesses to the page during the migration
because it can't be migrated if there remain some references on it by
system calls, direct I/O and page faults.

Your approach will work fine on most of anonymous pages, which aren't
heavily accessed. I think it will be enough for memory defragmentation.

> 3) At migrate_page_common you assume additional page references
> (page_migratable returning -EAGAIN) means the code should try to writeout
> the page.
>
> Is that assumption always valid?

-EAGAIN means that the page may require to be written back or
just to wait for a while since the page is just referred by system call
or pagefault handler.

> In theory there is no need to writeout pages when migrating them to
> other zones - they will be copied and the dirty information retained (either
> in the PageDirty bit or radix tree tag).
>
> I just noticed you do that on further patches (migrate_page_buffer), but AFAICS
> the writeout remains. Why arent you using migrate_page_buffer yet?

I've designed migrate_page_buffer() for this purpose.
At this moment ext2 only uses this yet.

> I think the final aim should be to remove the need for "pageout()"
> completly.

Yes!

> 4)
> About implementing a nonblocking version of it. The easier way, it
> seems to me, is to pass a "block" argument to generic_migrate_page() and
> use that.

Yes.

> Questions: are there any documents on the memory hotplug userspace tools?
> Where can I find them?

IBM guys and Fujitsu guys are designing user interface independently.
IBM team is implementing memory section hotplug while Fujitsu team
try to implement NUMA node hotplug. But both of the designs use
regular hot-plug mechanism, which kicks /sbin/hotplug script to control
devices via sysfs.

Dave, would you explain about it?

> Are Iwamoto's test programs available?

Ok, I'll notice him to post them.

> In general the code looks nice to me! I'll jump in and help with
> testing.

I appreciate your offer. I'm very happy with that.

Thank you,
Hirokazu Takahasih.

2004-10-03 15:36:35

by Marcelo Tosatti

[permalink] [raw]

Subject: Re: [RFC] memory defragmentation to satisfy high order allocations

On Sun, Oct 03, 2004 at 01:13:38PM +0900, Hirokazu Takahashi wrote:
> Hi,
>
> > > > Cool. I'll take a closer look at the relevant parts of memory hotplug patches
> > > > this weekend, hopefully. See if I can help with testing of these patches too.
> > >
> > > Any comments are very welcome.
> >
> >
> > I have a few comments about the code:
> >
> > 1)
> > I'm pretty sure you should transfer the radix tree tag at radix_tree_replace().
> > If for example you transfer a dirty tagged page to another zone, an mpage_writepages()
> > will miss it (because it uses pagevec_lookup_tag(PAGECACHE_DIRTY_TAG)).
> >
> > Should be quite trivial to do (save tags before deleting and set to new entry,
> > all in radix_tree_replace).
> >
> > My implementation also contained the same bug.
>
> Yes, it's one of the issues to do. The tag should be transferred in
> radix_tree_replace() as you pointed out. The current implementation
> sets the tag in set_page_dirty(newpage).

Oh I missed that, right.

But yes, anyway, the tag should be transferred at radix_tree_replace (earlier)
or pagevec_lookup_tag() can miss those pages.

> > 2)
> > At migrate_onepage you add anonymous pages which aren't swap allocated
> > to the swap cache
> > + /*
> > + * Put the page in a radix tree if it isn't in the tree yet.
> > + */
> > +#ifdef CONFIG_SWAP
> > + if (PageAnon(page) && !PageSwapCache(page))
> > + if (!add_to_swap(page, GFP_KERNEL)) {
> > + unlock_page(page);
> > + return ERR_PTR(-ENOSPC);
> > + }
> > +#endif /* CONFIG_SWAP */
> >
> > Why's that? You can copy anonymous pages without adding them to swap (thats
> > what the patch I posted does).
>
> The reason is to guarantee that any anonymous page can be migrated anytime.
> I want to block newly occurred accesses to the page during the migration
> because it can't be migrated if there remain some references on it by
> system calls, direct I/O and page faults.

It would be nice if we could block pte faults in a way such to not need
adding each anonymous page to swap. It can be too costly if you have a lot memory
and it makes the whole operation dependable on swap size (if you dont have enough
swap, you're dead).

Maybe hold mm->page_table_lock (might be too costly in terms of CPU time, but since
migration is not a common operation anyway), or create a semaphore?

> Your approach will work fine on most of anonymous pages, which aren't
> heavily accessed. I think it will be enough for memory defragmentation.

Yes...

> > 3) At migrate_page_common you assume additional page references
> > (page_migratable returning -EAGAIN) means the code should try to writeout
> > the page.
> >
> > Is that assumption always valid?
>
> -EAGAIN means that the page may require to be written back

But why is it needed to writeout pages? We shouldnt need to. At least
from what I can understand.

> or
> just to wait for a while since the page is just referred by system call
> or pagefault handler.

I'm not sure if making that assumption is always valid.

Kernel code can have an additional count on the page meaning "this page is pinned,
dont move it". At least that should be valid.

Any piece of code which holds a reference on a page for a long
time is going to be a pain for the algorithm right?

> > In theory there is no need to writeout pages when migrating them to
> > other zones - they will be copied and the dirty information retained (either
> > in the PageDirty bit or radix tree tag).
> >
> > I just noticed you do that on further patches (migrate_page_buffer), but AFAICS
> > the writeout remains. Why arent you using migrate_page_buffer yet?
>
> I've designed migrate_page_buffer() for this purpose.
> At this moment ext2 only uses this yet.

Ah ok I haven't looked at those patches.

> > I think the final aim should be to remove the need for "pageout()"
> > completly.
>
> Yes!
>
> > 4)
> > About implementing a nonblocking version of it. The easier way, it
> > seems to me, is to pass a "block" argument to generic_migrate_page() and
> > use that.
>
> Yes.

OK. I'll try to implement it this week (plus the radix_tree_replace
tag thingie).

> > Questions: are there any documents on the memory hotplug userspace tools?
> > Where can I find them?
>
> IBM guys and Fujitsu guys are designing user interface independently.
> IBM team is implementing memory section hotplug while Fujitsu team
> try to implement NUMA node hotplug. But both of the designs use
> regular hot-plug mechanism, which kicks /sbin/hotplug script to control
> devices via sysfs.
>
> Dave, would you explain about it?

Please :)

> > Are Iwamoto's test programs available?
>
> Ok, I'll notice him to post them.
>
> > In general the code looks nice to me! I'll jump in and help with
> > testing.
>
> I appreciate your offer. I'm very happy with that.

Me too! :)

2004-10-03 18:40:40

by Hirokazu Takahashi

[permalink] [raw]

Subject: Re: [RFC] memory defragmentation to satisfy high order allocations

Hi, Marcelo

> > > 2)
> > > At migrate_onepage you add anonymous pages which aren't swap allocated
> > > to the swap cache
> > > + /*
> > > + * Put the page in a radix tree if it isn't in the tree yet.
> > > + */
> > > +#ifdef CONFIG_SWAP
> > > + if (PageAnon(page) && !PageSwapCache(page))
> > > + if (!add_to_swap(page, GFP_KERNEL)) {
> > > + unlock_page(page);
> > > + return ERR_PTR(-ENOSPC);
> > > + }
> > > +#endif /* CONFIG_SWAP */
> > >
> > > Why's that? You can copy anonymous pages without adding them to swap (thats
> > > what the patch I posted does).
> >
> > The reason is to guarantee that any anonymous page can be migrated anytime.
> > I want to block newly occurred accesses to the page during the migration
> > because it can't be migrated if there remain some references on it by
> > system calls, direct I/O and page faults.
>
> It would be nice if we could block pte faults in a way such to not need
> adding each anonymous page to swap. It can be too costly if you have a lot memory
> and it makes the whole operation dependable on swap size (if you dont have enough
> swap, you're dead).
>
> Maybe hold mm->page_table_lock (might be too costly in terms of CPU time, but since
> migration is not a common operation anyway), or create a semaphore?

I think the problem of the holding mm->page_table_lock approach is
that it doesn't allow the migration code blocked. the semaphore
approach would be better.

I have another idea that each anonymous page can detach its swap entry
after its migration. It can be done by remove_exclusive_swap_page()
if the page is remapped to the same spaces forcibly by
touch_unmapped_address() I made.

> > Your approach will work fine on most of anonymous pages, which aren't
> > heavily accessed. I think it will be enough for memory defragmentation.
>
> Yes...
>
> > > 3) At migrate_page_common you assume additional page references
> > > (page_migratable returning -EAGAIN) means the code should try to writeout
> > > the page.
> > >
> > > Is that assumption always valid?
> >
> > -EAGAIN means that the page may require to be written back
>
> But why is it needed to writeout pages? We shouldnt need to. At least
> from what I can understand.

The migration code allows each filesystem to implement its own
migration code or just use migrate_page_buffer() or
migrate_page_common().

migrate_page_common() is a default function if filesystem doesn't
implement anything. The function is the most generic and it tries
to writeback pages only if they are dirty and have buffers.

> > or
> > just to wait for a while since the page is just referred by system call
> > or pagefault handler.
>
> I'm not sure if making that assumption is always valid.
>
> Kernel code can have an additional count on the page meaning "this page is pinned,
> dont move it". At least that should be valid.

Yes, I know. I have checked all of the code.

AIO event buffers are pinned, therefore the memory-hotplug team plans
to make pages for the event buffers assigned to non-hotpluggable
memory regions.

And pages in sendfile() might be pinned for a while in case of network
problems. I think there may be some workarounds. The easiest way
is just waiting its timeout, and another way is changing the mode
of sendfile() to copy pages in advance.

Pages for NFS also might be pinned with network problems.
One of the ideas is to restrict NFS to allocate pages from
specific memory region, sot that all memory except the region
can be hot-removed. And it's possible to implementing whole
migrate_page method, which may handled stuck pages.

If the migration code is used for memory defragmentation, pinned pages
must be avoided. I think it can be done with the non-blocking mode.

> Any piece of code which holds a reference on a page for a long
> time is going to be a pain for the algorithm right?
>

> > > 4)
> > > About implementing a nonblocking version of it. The easier way, it
> > > seems to me, is to pass a "block" argument to generic_migrate_page() and
> > > use that.
> >
> > Yes.
>
> OK. I'll try to implement it this week (plus the radix_tree_replace
> tag thingie).

Thank you for that.

Hirokazu Takahashi.

2004-10-03 19:22:29

by Trond Myklebust

[permalink] [raw]

Subject: Re: [RFC] memory defragmentation to satisfy high order allocations

P? su , 03/10/2004 klokka 20:35, skreiv Hirokazu Takahashi:

> Pages for NFS also might be pinned with network problems.
> One of the ideas is to restrict NFS to allocate pages from
> specific memory region, sot that all memory except the region
> can be hot-removed. And it's possible to implementing whole
> migrate_page method, which may handled stuck pages.

Why do you want to special-case this?

The above is a generic condition: any filesystem can suffer from the
equivalent problem of a failure or slow response in the underlying
device. Making an NFS-specific hack is just counter-productive to
solving the generic problem.

Cheers,
Trond

2004-10-03 20:03:08

by Hirokazu Takahashi

[permalink] [raw]

Subject: Re: [RFC] memory defragmentation to satisfy high order allocations

Hello,

> > Pages for NFS also might be pinned with network problems.
> > One of the ideas is to restrict NFS to allocate pages from
> > specific memory region, sot that all memory except the region
> > can be hot-removed. And it's possible to implementing whole
> > migrate_page method, which may handled stuck pages.
>
> Why do you want to special-case this?
>
> The above is a generic condition: any filesystem can suffer from the
> equivalent problem of a failure or slow response in the underlying
> device. Making an NFS-specific hack is just counter-productive to
> solving the generic problem.

However, while network is down network/cluster filesystems might not
release pages forever unlike in the case of block devices, which may
timeout or returns a error in case of failure.

Each filesystem can control what the migration code does.
If it doesn't have anything to help memory migration, it's possible
to wait for the network coming up before starting memory migration,
or give up it if the network happen to be down. That's no problem.

Thank you,
Hirokazu Takahashi.

2004-10-03 20:44:48

by Trond Myklebust

[permalink] [raw]

Subject: Re: [RFC] memory defragmentation to satisfy high order allocations

P? su , 03/10/2004 klokka 22:03, skreiv Hirokazu Takahashi:

> However, while network is down network/cluster filesystems might not
> release pages forever unlike in the case of block devices, which may
> timeout or returns a error in case of failure.

Where is the difference? As far as the VM is concerned, it is a latency
problem. The fact of whether or not it is a permanent hang, a hang with
a long timeout, or just a slow device is irrelevant because the VM
doesn't actually know about these devices.

> Each filesystem can control what the migration code does.
> If it doesn't have anything to help memory migration, it's possible
> to wait for the network coming up before starting memory migration,
> or give up it if the network happen to be down. That's no problem.

Wrong. It *is* a problem: Filesystems aren't required to know anything
about the particulars of the underlying block/network/... device timeout
semantics either.

Think, for instance about EXT2. Where in the current code do you see
that it is required to detect that it is running on top of something
like the NBD device? Where does it figure out what the latencies of this
device is?

AFAICS, most filesystems in linux/fs/* have no knowledge whatsoever
about the underlying block/network/... devices and their timeout values.
Basing your decision about whether or not you need to manage high
latency situations just by inspecting the filesystem type is therefore
not going to give very reliable results.

Cheers,
Trond

2004-10-04 02:23:44

by Dave Hansen

[permalink] [raw]

Subject: Re: [RFC] memory defragmentation to satisfy high order allocations

On Sat, 2004-10-02 at 21:13, Hirokazu Takahashi wrote:
> > Questions: are there any documents on the memory hotplug userspace tools?
> > Where can I find them?
>
> IBM guys and Fujitsu guys are designing user interface independently.
> IBM team is implementing memory section hotplug while Fujitsu team
> try to implement NUMA node hotplug. But both of the designs use
> regular hot-plug mechanism, which kicks /sbin/hotplug script to control
> devices via sysfs.
>
> Dave, would you explain about it?

First of all, we're still on the first set of these APIs. So, either
we're really, really smart (unlikely) or we have a few revisions an
rewrites to go before everybody is happy.

ls /sys/devices/system/memory/ gives you each memory area, with
arbitrary numbers like this:
memory0
memory1
memory2
memory8953

We haven't decided whether to make each of those represent a constant
sized area, or let them be variable. In any case, there will either be
a range inside of each or a global block size something like here:

/sys/devices/system/memory/block_size

Each memory device would have a directory like this:

# ls /sys/devices/system/memory/memory8953/
node -> ../../node/node4 (for the NUMA case)
state
phys_start_addr

To take a memory section offline, you

echo offline > /sys/devices/system/memory/memory8953/state

For now, that takes the section offline by allocating all of its pages
and migrating the test. It also removes the sysfs node, triggering a
/sbin/hotplug event for the device removal. We might makes this 2
different states in the future (offline and removal). This could also
potentially be triggered by hardware alone.

For now, you can also add memory, but it's hackish and will certainly
change:

echo 0x8000000 > /sys/devices/system/memory/probe

will add SECTION_SIZE amount of memory at 2GB. Yes, SECTION_SIZE is
hard-coded, but this is only for testing. We'll eventually take ranges
and maybe NUMA information into there somehow. Why can't the hardware
just do this? It's a long story :)

--
Dave Hansen
[email protected]

2004-10-04 02:33:09

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [RFC] memory defragmentation to satisfy high order allocations

how about inserting this if-sentense ?

-- Kame

Marcelo Tosatti wrote:
> +int coalesce_memory(unsigned int order, struct zone *zone)
> +{
<snip>

> + while (entry != &area->free_list) {
> + int ret;
> + page = list_entry(entry, struct page, lru);
> + entry = entry->next;
> +

+ if (((page_to_pfn(page) - zone->zone_start_pfn) & (1 << toorder)) {

> + pwalk = page;
> +
> + /* Look backwards */
> +
> + for (walkcount = 1; walkcount<nr_pages; walkcount++) {
..................
> + }
> +
+ } else {
> +forward:
> +
> + pwalk = page;
> +
> + /* Look forward, skipping the page frames from this
> + high order page we are looking at */
> +
> + for (walkcount = (1UL << torder); walkcount<nr_pages;
> + walkcount++) {
> + pwalk = page+walkcount;
> +
> + ret = can_move_page(pwalk);
> +
> + if (ret)
> + nr_freed_pages++;
> + else
> + goto loopey;
> +
> + if (nr_freed_pages == nr_pages)
> + goto success;
> + }
> +
+ }

2004-10-04 03:24:29

by IWAMOTO Toshihiro

[permalink] [raw]

Subject: Re: [RFC] memory defragmentation to satisfy high order allocations

At Sun, 3 Oct 2004 11:07:23 -0300,
Marcelo Tosatti wrote:
>
> On Sun, Oct 03, 2004 at 01:13:38PM +0900, Hirokazu Takahashi wrote:
> > > 2)
> > > At migrate_onepage you add anonymous pages which aren't swap allocated
> > > to the swap cache
> > > + /*
> > > + * Put the page in a radix tree if it isn't in the tree yet.
> > > + */
> > > +#ifdef CONFIG_SWAP
> > > + if (PageAnon(page) && !PageSwapCache(page))
> > > + if (!add_to_swap(page, GFP_KERNEL)) {
> > > + unlock_page(page);
> > > + return ERR_PTR(-ENOSPC);
> > > + }
> > > +#endif /* CONFIG_SWAP */
> > >
> > > Why's that? You can copy anonymous pages without adding them to swap (thats
> > > what the patch I posted does).
> >
> > The reason is to guarantee that any anonymous page can be migrated anytime.
> > I want to block newly occurred accesses to the page during the migration
> > because it can't be migrated if there remain some references on it by
> > system calls, direct I/O and page faults.
>
> It would be nice if we could block pte faults in a way such to not need
> adding each anonymous page to swap. It can be too costly if you have a lot memory
> and it makes the whole operation dependable on swap size (if you dont have enough
> swap, you're dead).
>
> Maybe hold mm->page_table_lock (might be too costly in terms of CPU time, but since
> migration is not a common operation anyway), or create a semaphore?

I chose the swap cache based implementation in order to minimize
slowdown of the normal code path. (I thought there's zero code
addition on the normal pagefault path when I designed this, but it's
no longer true...)

If we can agree on adding a new lock, there might be a better
implementation.

--
IWAMOTO Toshihiro

2004-10-04 04:09:24

by IWAMOTO Toshihiro

[permalink] [raw]

Subject: Re: [RFC] memory defragmentation to satisfy high order allocations

At Sat, 2 Oct 2004 15:33:49 -0300,
Marcelo Tosatti wrote:
>
> On Sat, Oct 02, 2004 at 06:30:15PM +0900, Hirokazu Takahashi wrote:
> 3) At migrate_page_common you assume additional page references
> (page_migratable returning -EAGAIN) means the code should try to writeout
> the page.
>
> Is that assumption always valid?
>
> In theory there is no need to writeout pages when migrating them to
> other zones - they will be copied and the dirty information retained (either
> in the PageDirty bit or radix tree tag).

It's true only when page->private is NULL. Otherwise writeback is
necessary to free buffer_head.

> I just noticed you do that on further patches (migrate_page_buffer), but AFAICS
> the writeout remains. Why arent you using migrate_page_buffer yet?
>
> I think the final aim should be to remove the need for "pageout()"
> completly.

Are you going to implement migrate_page_buffer for every file system?
I don't think it's worthwhile.

> Questions: are there any documents on the memory hotplug userspace tools?
> Where can I find them?
>
> Are Iwamoto's test programs available?

I've put them at the following URL, but I doubt they are useful for
you; there are no documentation for them.

http://people.valinux.co.jp/~iwamoto/mh/tests/

--
IWAMOTO Toshihiro

2004-10-04 06:52:47

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [RFC] memory defragmentation to satisfy high order allocations

Hi,

Marcelo Tosatti wrote:

> +int can_move_page(struct page *page)
> +{
<snip>
> + if (page_count(page) == 0)
> + return 1;

I think there are 3 cases when page_count(page) == 0.

1. a page is free and in the buddy allocator.
2. a page is free and in per-cpu-pages list.
3. a page is in pagevec .

I think only case 1 pages meet your requirements.

I used PG_private flag for distinguishing case 1 from 2 and 3
in my no-bitmap buddy allocator posted before.
I added PG_private flag to a page which is in buddy allocator's free_list.

Regards

-- Kame
<[email protected]>

2004-10-04 08:17:19

by Nick Piggin

[permalink] [raw]

Subject: Re: [RFC] memory defragmentation to satisfy high order allocations

Marcelo Tosatti wrote:

>On Sat, Oct 02, 2004 at 12:30:01PM +1000, Nick Piggin wrote:
>
>>
>>Marcelo Tosatti wrote:
>>
>>
>>>With such a thing in place we can build a mechanism for kswapd
>>>(or a separate kernel thread, if needed) to notice when we are low on
>>>high order pages, and use the coalescing algorithm instead blindly
>>>freeing unique pages from LRU in the hope to build large physically
>>>contiguous memory areas.
>>>
>>>Comments appreciated.
>>>
>>>
>>>
>>Hi Marcelo,
>>Seems like a good idea... even with regular dumb kswapd "merging",
>>you may easily get stuck for example on systems without swap...
>>
>>Anyway, I'd like to get those beat kswapd patches in first. Then
>>your mechanism just becomes something like:
>>
>> if order-0 pages are low {
>> try to free memory
>> }
>> else if order-1 or higher pages are low {
>> try to coalesce_memory
>> if that fails, try to free memory
>> }
>>
>
>Hi Nick!
>
>

Sorry, I'd been away for the weekend which is why I didn't get a
chance to reply to you.

>I understand that kswapd is broken, and it needs to go into the page reclaim path
>to free pages when we are out of high order pages (what your
>"beat kswapd" patches do and fix high-order failures by doing so), but
>Linus's argument against it seems to be that "it potentially frees too much pages"
>causing harm to the system. He also says this has been tried in the past,
>with not nice results.
>
>

Not quite. I think a (the) big thing with my patch is that it will
check order-0...n watermarks when an order-n allocation is made.

So if there is no order >2 allocations happening, it won't attempt
to keep higher order memory available (until someone attempts an
allocation).

Basically, it gets kswapd doing the work when it would otherwise
have to be done in direct reclaim, *OR* otherwise indefinitely fail
if the allocations aren't blockable.

>And that is why its has not been merged into mainline.
>
>Is my interpretation correct?
>
>But right, kswapd needs to get fixed to honour high order
>pages.
>
>

Well Linus was silent on the issue after I answered his concerns.
I mailed him privately and he basically said that it seems sane,
and he is waiting for patches. Of course, by that stage it was
fairly late into 2.6.9, and the current behaviour isn't a regression,
so I'm shooting for 2.6.10.

Your defragmentor should sit very nicely on top of it, of course.

2004-10-04 13:02:00

by Hirokazu Takahashi

[permalink] [raw]

Subject: Re: [RFC] memory defragmentation to satisfy high order allocations

Hello,

Yes, I know what you're talking about.
The current kernel doesn't have any features about it.

So that I've been wondering if there might be any good solution
to help memory hot-removal. It would be nice if there were support
from filesystems and block devices.

> > However, while network is down network/cluster filesystems might not
> > release pages forever unlike in the case of block devices, which may
> > timeout or returns a error in case of failure.
>
> Where is the difference? As far as the VM is concerned, it is a latency
> problem. The fact of whether or not it is a permanent hang, a hang with
> a long timeout, or just a slow device is irrelevant because the VM
> doesn't actually know about these devices.
>
> > Each filesystem can control what the migration code does.
> > If it doesn't have anything to help memory migration, it's possible
> > to wait for the network coming up before starting memory migration,
> > or give up it if the network happen to be down. That's no problem.
>
> Wrong. It *is* a problem: Filesystems aren't required to know anything
> about the particulars of the underlying block/network/... device timeout
> semantics either.
>
> Think, for instance about EXT2. Where in the current code do you see
> that it is required to detect that it is running on top of something
> like the NBD device? Where does it figure out what the latencies of this
> device is?
>
> AFAICS, most filesystems in linux/fs/* have no knowledge whatsoever
> about the underlying block/network/... devices and their timeout values.
> Basing your decision about whether or not you need to manage high
> latency situations just by inspecting the filesystem type is therefore
> not going to give very reliable results.

Thank you,
Hirokazu Takahashi.

2004-10-04 19:02:38

by Marcelo Tosatti

[permalink] [raw]

Subject: Re: [RFC] memory defragmentation to satisfy high order allocations

On Mon, Oct 04, 2004 at 03:35:59AM +0900, Hirokazu Takahashi wrote:
> Hi, Marcelo
>
> > > > 2)
> > > > At migrate_onepage you add anonymous pages which aren't swap allocated
> > > > to the swap cache
> > > > + /*
> > > > + * Put the page in a radix tree if it isn't in the tree yet.
> > > > + */
> > > > +#ifdef CONFIG_SWAP
> > > > + if (PageAnon(page) && !PageSwapCache(page))
> > > > + if (!add_to_swap(page, GFP_KERNEL)) {
> > > > + unlock_page(page);
> > > > + return ERR_PTR(-ENOSPC);
> > > > + }
> > > > +#endif /* CONFIG_SWAP */
> > > >
> > > > Why's that? You can copy anonymous pages without adding them to swap (thats
> > > > what the patch I posted does).
> > >
> > > The reason is to guarantee that any anonymous page can be migrated anytime.
> > > I want to block newly occurred accesses to the page during the migration
> > > because it can't be migrated if there remain some references on it by
> > > system calls, direct I/O and page faults.
> >
> > It would be nice if we could block pte faults in a way such to not need
> > adding each anonymous page to swap. It can be too costly if you have a lot memory
> > and it makes the whole operation dependable on swap size (if you dont have enough
> > swap, you're dead).
> >
> > Maybe hold mm->page_table_lock (might be too costly in terms of CPU time, but since
> > migration is not a common operation anyway), or create a semaphore?
>
> I think the problem of the holding mm->page_table_lock approach is
> that it doesn't allow the migration code blocked. the semaphore
> approach would be better.

OK, I think the problem is that can be more than one thread with different address
spaces (ie different "current->mm", after fork) accessing the page.

Adding a waitqueue to "anon_vma" structure (to be slept at do_swap_page time,
and awake after copy-page-and-flags/unlock), can be do the job I think.

> I have another idea that each anonymous page can detach its swap entry
> after its migration.

Yes thats a nice idea.

> It can be done by remove_exclusive_swap_page()
> if the page is remapped to the same spaces forcibly by
> touch_unmapped_address() I made.

touch_unmapped_address() ?

> > > Your approach will work fine on most of anonymous pages, which aren't
> > > heavily accessed. I think it will be enough for memory defragmentation.
> >
> > Yes...
> >
> > > > 3) At migrate_page_common you assume additional page references
> > > > (page_migratable returning -EAGAIN) means the code should try to writeout
> > > > the page.
> > > >
> > > > Is that assumption always valid?
> > >
> > > -EAGAIN means that the page may require to be written back
> >
> > But why is it needed to writeout pages? We shouldnt need to. At least
> > from what I can understand.
>
> The migration code allows each filesystem to implement its own
> migration code or just use migrate_page_buffer() or
> migrate_page_common().
>
> migrate_page_common() is a default function if filesystem doesn't
> implement anything. The function is the most generic and it tries
> to writeback pages only if they are dirty and have buffers.

The thing is: What is the point of writing out pages?

We're just trying to migrate pages to another zone.

If its under writeout, wait, if its dirty, just move it to the other
zone.

Can you enlight me?

> > > or
> > > just to wait for a while since the page is just referred by system call
> > > or pagefault handler.
> >
> > I'm not sure if making that assumption is always valid.
> >
> > Kernel code can have an additional count on the page meaning "this page is pinned,
> > dont move it". At least that should be valid.
>
> Yes, I know. I have checked all of the code.
>
> AIO event buffers are pinned, therefore the memory-hotplug team plans
> to make pages for the event buffers assigned to non-hotpluggable
> memory regions.
>
> And pages in sendfile() might be pinned for a while in case of network
> problems. I think there may be some workarounds. The easiest way
> is just waiting its timeout, and another way is changing the mode
> of sendfile() to copy pages in advance.
>
> Pages for NFS also might be pinned with network problems.
> One of the ideas is to restrict NFS to allocate pages from
> specific memory region, sot that all memory except the region
> can be hot-removed. And it's possible to implementing whole
> migrate_page method, which may handled stuck pages.
>
> If the migration code is used for memory defragmentation, pinned pages
> must be avoided. I think it can be done with the non-blocking mode.

Right.

> > Any piece of code which holds a reference on a page for a long
> > time is going to be a pain for the algorithm right?
> >
>
> > > > 4)
> > > > About implementing a nonblocking version of it. The easier way, it
> > > > seems to me, is to pass a "block" argument to generic_migrate_page() and
> > > > use that.
> > >
> > > Yes.
> >
> > OK. I'll try to implement it this week (plus the radix_tree_replace
> > tag thingie).
>
> Thank you for that.

Any news about Iwamoto's test programs? :)

2004-10-04 19:07:50

by Marcelo Tosatti

[permalink] [raw]

Subject: Re: [RFC] memory defragmentation to satisfy high order allocations

On Mon, Oct 04, 2004 at 01:09:10PM +0900, IWAMOTO Toshihiro wrote:
> At Sat, 2 Oct 2004 15:33:49 -0300,
> Marcelo Tosatti wrote:
> >
> > On Sat, Oct 02, 2004 at 06:30:15PM +0900, Hirokazu Takahashi wrote:
> > 3) At migrate_page_common you assume additional page references
> > (page_migratable returning -EAGAIN) means the code should try to writeout
> > the page.
> >
> > Is that assumption always valid?
> >
> > In theory there is no need to writeout pages when migrating them to
> > other zones - they will be copied and the dirty information retained (either
> > in the PageDirty bit or radix tree tag).
>
> It's true only when page->private is NULL. Otherwise writeback is
> necessary to free buffer_head.

You can move the buffer_head's also cant you? Adjusting bh->b_page etc.

Thats what migrate_page_buffer does, no?

Writting pages which contain buffer_head's on memory migration
is really, very bad.

Imagine gigabytes of pages with buffer_head's.

> > I just noticed you do that on further patches (migrate_page_buffer), but AFAICS
> > the writeout remains. Why arent you using migrate_page_buffer yet?
> >
> > I think the final aim should be to remove the need for "pageout()"
> > completly.
>
> Are you going to implement migrate_page_buffer for every file system?
> I don't think it's worthwhile.
>
> > Questions: are there any documents on the memory hotplug userspace tools?
> > Where can I find them?
> >
> > Are Iwamoto's test programs available?
>
> I've put them at the following URL, but I doubt they are useful for
> you; there are no documentation for them.
>
> http://people.valinux.co.jp/~iwamoto/mh/tests/

I'll take a look thanks.

2004-10-04 19:10:37

by Marcelo Tosatti

[permalink] [raw]

Subject: Re: [RFC] memory defragmentation to satisfy high order allocations

Yeap that is a nice optimization, thanks Hiroyuki.

On Mon, Oct 04, 2004 at 11:38:32AM +0900, Hiroyuki KAMEZAWA wrote:
>
> how about inserting this if-sentense ?

2004-10-05 02:53:45

by Hirokazu Takahashi

[permalink] [raw]

Subject: Re: [RFC] memory defragmentation to satisfy high order allocations

Hi, Marcelo

> > > > > Why's that? You can copy anonymous pages without adding them to swap (thats
> > > > > what the patch I posted does).
> > > >
> > > > The reason is to guarantee that any anonymous page can be migrated anytime.
> > > > I want to block newly occurred accesses to the page during the migration
> > > > because it can't be migrated if there remain some references on it by
> > > > system calls, direct I/O and page faults.
> > >
> > > It would be nice if we could block pte faults in a way such to not need
> > > adding each anonymous page to swap. It can be too costly if you have a lot memory
> > > and it makes the whole operation dependable on swap size (if you dont have enough
> > > swap, you're dead).
> > >
> > > Maybe hold mm->page_table_lock (might be too costly in terms of CPU time, but since
> > > migration is not a common operation anyway), or create a semaphore?
> >
> > I think the problem of the holding mm->page_table_lock approach is
> > that it doesn't allow the migration code blocked. the semaphore
> > approach would be better.
>
> OK, I think the problem is that can be more than one thread with different address
> spaces (ie different "current->mm", after fork) accessing the page.

Yes.

> Adding a waitqueue to "anon_vma" structure (to be slept at do_swap_page time,
> and awake after copy-page-and-flags/unlock), can be do the job I think.
>
> > I have another idea that each anonymous page can detach its swap entry
> > after its migration.
>
> Yes thats a nice idea.

I'll do it in a week.

> > It can be done by remove_exclusive_swap_page()
> > if the page is remapped to the same spaces forcibly by
> > touch_unmapped_address() I made.
>
> touch_unmapped_address() ?

I've made two functions.

record_unmapped_address() holds mm and address where a target page
has been mapped. touch_unmapped_address() remaps the page again.

> > > > Your approach will work fine on most of anonymous pages, which aren't
> > > > heavily accessed. I think it will be enough for memory defragmentation.
> > >
> > > Yes...
> > >
> > > > > 3) At migrate_page_common you assume additional page references
> > > > > (page_migratable returning -EAGAIN) means the code should try to writeout
> > > > > the page.
> > > > >
> > > > > Is that assumption always valid?
> > > >
> > > > -EAGAIN means that the page may require to be written back
> > >
> > > But why is it needed to writeout pages? We shouldnt need to. At least
> > > from what I can understand.
> >
> > The migration code allows each filesystem to implement its own
> > migration code or just use migrate_page_buffer() or
> > migrate_page_common().
> >
> > migrate_page_common() is a default function if filesystem doesn't
> > implement anything. The function is the most generic and it tries
> > to writeback pages only if they are dirty and have buffers.
>
> The thing is: What is the point of writing out pages?

It was the easiest way to handle pages with buffers when Iwamoto
and I started to implement it. We thought it was slow but it would
work for all kinds of filesystems.

> We're just trying to migrate pages to another zone.
>
> If its under writeout, wait, if its dirty, just move it to the other
> zone.
>
> Can you enlight me?

Yes, I also realize that.
migrate_page_buffer() will do this, but I'm not certain it will work
for all kinds of filesystems. I guess there might be some exceptions.
We may need a special operation to handle pages on a filesystem,
which has releasepage method.

Thanks,
Hirokazu Takahashi.