This is a patchset intended to introduce page migration into the kernel
through a simple implementation of swap based page migration.
The aim is to be minimally intrusive in order to have some hopes for inclusion
into 2.6.15. A separate direct page migration patch is being developed that
applies on top of this patch. The direct migration patch is being discussed on
<[email protected]>.
Much of the code is based on code that the memory hotplug project and Ray Bryant
have been working on for a long time. See http://sourceforge.net/projects/lhms/
Changes from V4 to V5:
- Use existing lru_add caches to return pages to the active and inactive lists.
- Some cleanup (const attribute for sys_migrate_pages etc)
Changes from V3 to V4:
- patch against 2.6.14-rc5-mm1.
- Correctly gather pages in migrate_add_page()
- Restructure swapout code for easy later application of the direct migration
patches. Rename swapout() to migrate_pages().
- Add PF_SWAPWRITE support to allow write to swap from a process. Save
and restore earlier state to allow nesting of the use of PF_SWAPWRITE.
- Fix sys_migrate_pages permission check (thanks Ray).
Changes from V2 to V3:
- Break out common code for page eviction (Thanks to a patch by Magnus Damm)
- Add check to avoid MPOL_MF_MOVE moving pages that are also accessed from
another address space. Add support for MPOL_MF_MOVE_ALL to override this
(requires superuser priviledges).
- Update overview regarding direct page migration patchset following soon and
cut longwinded explanations.
- Add sys_migrate patchset
- Check cpuset restrictions on sys_migrate.
Changes from V1 to V2:
- Patch against 2.6.14-rc4-mm1
- Remove move_pages() function
- Code cleanup to make it less invasive.
- Fix missing lru_add_drain() invocation from isolate_lru_page()
In a NUMA system it is often beneficial to be able to move the memory
in use by a process to different nodes in order to enhance performance.
Currently Linux simply does not support this facility. This patchset
implements page migration via a new syscall sys_migrate_pages and via
the memory policy layer with the MPOL_MF_MOVE and MPOL_MF_MOVE_ALL
flags.
Page migration is also useful for other purposes:
1. Memory hotplug. Migrating processes off a memory node that is going
to be disconnected.
2. Remapping of bad pages. These could be detected through soft ECC errors
and other mechanisms.
migrate_pages() can only migrate pages under certain conditions. These other
uses may require additional measures to ensure that pages are migratable. The
hotplug project f.e. restricts allocations to removable memory.
The patchset consists of five patches:
1. LRU operations
Add basic operations to remove pages from the LRU lists and return
them back to it.
2. PF_WRITESWAP
Allow a process to set PF_WRITESWAP in its flags in order to be allowed
to write pages to swap space.
3. migrate_pages() implementation
Adds a function to mm/vmscan.c called migrate_pages(). The functionality
of that function is restricted to swapping out pages. An additional patch
is necessary for direct page migration.
4. MPOL_MF_MOVE flag for memory policies.
This implements MPOL_MF_MOVE in addition to MPOL_MF_STRICT. MPOL_MF_STRICT
allows the checking if all pages in a memory area obey the memory policies.
MPOL_MF_MOVE will migrate all pages that do not conform to the memory policy.
If pages are evicted then the system will allocate pages conforming to the
policy on swap in.
5. sys_migrate_pages system call and cpuset API
Adds a new function call
sys_migrate_pages(pid, maxnode, from_nodes, to_nodes)
to migrate pages of a process to a different node and also a function
for the use of the migration mechanism in cpusets
do_migrate_pages(struct mm_struct *, from_nodes, to_nodes, move_flags).
=====
URLs referring to the discussion regarding the initial version of these
patches.
Page eviction: http://marc.theaimsgroup.com/?l=linux-mm&m=112922756730989&w=2
Numa policy : http://marc.theaimsgroup.com/?l=linux-mm&m=112922756724715&w=2
Discussion of V2 of the patchset:
http://marc.theaimsgroup.com/?t=112959680300007&r=1&w=2
Discussion of V3:
http://marc.theaimsgroup.com/?t=112984939600003&r=1&w=2
Add page migration support via swap to the NUMA policy layer
This patch adds page migration support to the NUMA policy layer. An additional
flag MPOL_MF_MOVE is introduced for mbind. If MPOL_MF_MOVE is specified then
pages that do not conform to the memory policy will be evicted from memory.
When they get pages back in new pages will be allocated following the numa policy.
Changes V4->V5
- make nodemask_t * parameter const
Changes V3->V4
- migrate_page_add: Do pagelist processing directly instead
of doing it via isolate_lru_page().
- Use the migrate_pages() to evict the pages.
Changes V2->V3
- Add check to not migrate pages shared with other processes (but allow
migration of memory shared between threads having a common mm_struct)
- MPOL_MF_MOVE_ALL to override and move even pages shared with other
processes. This only works if the process issuing this call has
CAP_SYS_RESOURCE because this enables the moving of pages owned
by other processes.
- MPOL_MF_DISCONTIG_OK (internal use only) to not check for continuous VMAs.
Enable MPOL_MF_DISCONTIG_OK if policy to be set is NULL (default policy).
Changes V1->V2
- Add vma_migratable() function for future enhancements.
- No side effects on WARN_ON
- Remove move_pages for now
- Make patch fit 2.6.14-rc4-mm1
Signed-off-by: Christoph Lameter <[email protected]>
Index: linux-2.6.14-rc5-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.14-rc5-mm1.orig/mm/mempolicy.c 2005-10-31 14:10:53.000000000 -0800
+++ linux-2.6.14-rc5-mm1/mm/mempolicy.c 2005-10-31 18:49:32.000000000 -0800
@@ -83,9 +83,13 @@
#include <linux/init.h>
#include <linux/compat.h>
#include <linux/mempolicy.h>
+#include <linux/swap.h>
#include <asm/tlbflush.h>
#include <asm/uaccess.h>
+/* Internal MPOL_MF_xxx flags */
+#define MPOL_MF_DISCONTIG_OK (1<<20) /* Skip checks for continuous vmas */
+
static kmem_cache_t *policy_cache;
static kmem_cache_t *sn_cache;
@@ -179,9 +183,62 @@ static struct mempolicy *mpol_new(int mo
return policy;
}
+/* Check if we are the only process mapping the page in question */
+static inline int single_mm_mapping(struct mm_struct *mm,
+ struct address_space *mapping)
+{
+ struct vm_area_struct *vma;
+ struct prio_tree_iter iter;
+ int rc = 1;
+
+ spin_lock(&mapping->i_mmap_lock);
+ vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, ULONG_MAX)
+ if (mm != vma->vm_mm) {
+ rc = 0;
+ goto out;
+ }
+ list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list)
+ if (mm != vma->vm_mm) {
+ rc = 0;
+ goto out;
+ }
+out:
+ spin_unlock(&mapping->i_mmap_lock);
+ return rc;
+}
+
+/*
+ * Add a page to be migrated to the pagelist
+ */
+static void migrate_page_add(struct vm_area_struct *vma,
+ struct page *page, struct list_head *pagelist, unsigned long flags)
+{
+ /*
+ * Avoid migrating a page that is shared by others and not writable.
+ */
+ if ((flags & MPOL_MF_MOVE_ALL) ||
+ PageAnon(page) ||
+ mapping_writably_mapped(page->mapping) ||
+ single_mm_mapping(vma->vm_mm, page->mapping)
+ ) {
+ int rc = isolate_lru_page(page);
+
+ if (rc == 1)
+ list_add(&page->lru, pagelist);
+ /*
+ * If the isolate attempt was not successful
+ * then we just encountered an unswappable
+ * page. Something must be wrong.
+ */
+ WARN_ON(rc == 0);
+ }
+}
+
/* Ensure all existing pages follow the policy. */
static int check_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
- unsigned long addr, unsigned long end, nodemask_t *nodes)
+ unsigned long addr, unsigned long end,
+ const nodemask_t *nodes, unsigned long flags,
+ struct list_head *pagelist)
{
pte_t *orig_pte;
pte_t *pte;
@@ -200,15 +257,23 @@ static int check_pte_range(struct vm_are
continue;
}
nid = pfn_to_nid(pfn);
- if (!node_isset(nid, *nodes))
- break;
+ if (!node_isset(nid, *nodes)) {
+ if (pagelist) {
+ struct page *page = pfn_to_page(pfn);
+
+ migrate_page_add(vma, page, pagelist, flags);
+ } else
+ break;
+ }
} while (pte++, addr += PAGE_SIZE, addr != end);
pte_unmap_unlock(orig_pte, ptl);
return addr != end;
}
static inline int check_pmd_range(struct vm_area_struct *vma, pud_t *pud,
- unsigned long addr, unsigned long end, nodemask_t *nodes)
+ unsigned long addr, unsigned long end,
+ const nodemask_t *nodes, unsigned long flags,
+ struct list_head *pagelist)
{
pmd_t *pmd;
unsigned long next;
@@ -218,14 +283,17 @@ static inline int check_pmd_range(struct
next = pmd_addr_end(addr, end);
if (pmd_none_or_clear_bad(pmd))
continue;
- if (check_pte_range(vma, pmd, addr, next, nodes))
+ if (check_pte_range(vma, pmd, addr, next, nodes,
+ flags, pagelist))
return -EIO;
} while (pmd++, addr = next, addr != end);
return 0;
}
static inline int check_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
- unsigned long addr, unsigned long end, nodemask_t *nodes)
+ unsigned long addr, unsigned long end,
+ const nodemask_t *nodes, unsigned long flags,
+ struct list_head *pagelist)
{
pud_t *pud;
unsigned long next;
@@ -235,14 +303,17 @@ static inline int check_pud_range(struct
next = pud_addr_end(addr, end);
if (pud_none_or_clear_bad(pud))
continue;
- if (check_pmd_range(vma, pud, addr, next, nodes))
+ if (check_pmd_range(vma, pud, addr, next, nodes,
+ flags, pagelist))
return -EIO;
} while (pud++, addr = next, addr != end);
return 0;
}
static inline int check_pgd_range(struct vm_area_struct *vma,
- unsigned long addr, unsigned long end, nodemask_t *nodes)
+ unsigned long addr, unsigned long end,
+ const nodemask_t *nodes, unsigned long flags,
+ struct list_head *pagelist)
{
pgd_t *pgd;
unsigned long next;
@@ -252,16 +323,35 @@ static inline int check_pgd_range(struct
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(pgd))
continue;
- if (check_pud_range(vma, pgd, addr, next, nodes))
+ if (check_pud_range(vma, pgd, addr, next, nodes,
+ flags, pagelist))
return -EIO;
} while (pgd++, addr = next, addr != end);
return 0;
}
-/* Step 1: check the range */
+/* Check if a vma is migratable */
+static inline int vma_migratable(struct vm_area_struct *vma)
+{
+ if (vma->vm_flags & (
+ VM_LOCKED |
+ VM_IO |
+ VM_RESERVED |
+ VM_DENYWRITE |
+ VM_SHM
+ ))
+ return 0;
+ return 1;
+}
+
+/*
+ * Check if all pages in a range are on a set of nodes.
+ * If pagelist != NULL then isolate pages from the LRU and
+ * put them on the pagelist.
+ */
static struct vm_area_struct *
check_range(struct mm_struct *mm, unsigned long start, unsigned long end,
- nodemask_t *nodes, unsigned long flags)
+ const nodemask_t *nodes, unsigned long flags, struct list_head *pagelist)
{
int err;
struct vm_area_struct *first, *vma, *prev;
@@ -273,17 +363,24 @@ check_range(struct mm_struct *mm, unsign
return ERR_PTR(-EACCES);
prev = NULL;
for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
- if (!vma->vm_next && vma->vm_end < end)
- return ERR_PTR(-EFAULT);
- if (prev && prev->vm_end < vma->vm_start)
- return ERR_PTR(-EFAULT);
- if ((flags & MPOL_MF_STRICT) && !is_vm_hugetlb_page(vma)) {
+ if (!(flags & MPOL_MF_DISCONTIG_OK)) {
+ if (!vma->vm_next && vma->vm_end < end)
+ return ERR_PTR(-EFAULT);
+ if (prev && prev->vm_end < vma->vm_start)
+ return ERR_PTR(-EFAULT);
+ }
+ if (!is_vm_hugetlb_page(vma) &&
+ ((flags & MPOL_MF_STRICT) ||
+ ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
+ vma_migratable(vma)
+ ))) {
unsigned long endvma = vma->vm_end;
if (endvma > end)
endvma = end;
if (vma->vm_start > start)
start = vma->vm_start;
- err = check_pgd_range(vma, start, endvma, nodes);
+ err = check_pgd_range(vma, start, endvma, nodes,
+ flags, pagelist);
if (err) {
first = ERR_PTR(err);
break;
@@ -357,33 +454,59 @@ long do_mbind(unsigned long start, unsig
struct mempolicy *new;
unsigned long end;
int err;
+ LIST_HEAD(pagelist);
- if ((flags & ~(unsigned long)(MPOL_MF_STRICT)) || mode > MPOL_MAX)
+ if ((flags & ~(unsigned long)(MPOL_MF_STRICT | MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+ || mode > MPOL_MAX)
return -EINVAL;
+ if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_RESOURCE))
+ return -EPERM;
+
if (start & ~PAGE_MASK)
return -EINVAL;
+
if (mode == MPOL_DEFAULT)
flags &= ~MPOL_MF_STRICT;
+
len = (len + PAGE_SIZE - 1) & PAGE_MASK;
end = start + len;
+
if (end < start)
return -EINVAL;
if (end == start)
return 0;
+
if (mpol_check_policy(mode, nmask))
return -EINVAL;
+
new = mpol_new(mode, nmask);
if (IS_ERR(new))
return PTR_ERR(new);
+ /*
+ * If we are using the default policy then operation
+ * on discontinuous address spaces is okay after all
+ */
+ if (!new)
+ flags |= MPOL_MF_DISCONTIG_OK;
+
PDprintk("mbind %lx-%lx mode:%ld nodes:%lx\n",start,start+len,
mode,nodes_addr(nodes)[0]);
down_write(&mm->mmap_sem);
- vma = check_range(mm, start, end, nmask, flags);
+ vma = check_range(mm, start, end, nmask, flags,
+ (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) ? &pagelist : NULL);
err = PTR_ERR(vma);
- if (!IS_ERR(vma))
+ if (!IS_ERR(vma)) {
err = mbind_range(vma, start, end, new);
+ if (!list_empty(&pagelist))
+ migrate_pages(&pagelist, NULL);
+ if (!err && !list_empty(&pagelist) && (flags & MPOL_MF_STRICT))
+ err = -EIO;
+ }
+ if (!list_empty(&pagelist))
+ putback_lru_pages(&pagelist);
+
up_write(&mm->mmap_sem);
mpol_free(new);
return err;
Index: linux-2.6.14-rc5-mm1/include/linux/mempolicy.h
===================================================================
--- linux-2.6.14-rc5-mm1.orig/include/linux/mempolicy.h 2005-10-31 14:10:52.000000000 -0800
+++ linux-2.6.14-rc5-mm1/include/linux/mempolicy.h 2005-10-31 18:47:59.000000000 -0800
@@ -22,6 +22,8 @@
/* Flags for mbind */
#define MPOL_MF_STRICT (1<<0) /* Verify existing pages in the mapping */
+#define MPOL_MF_MOVE (1<<1) /* Move pages owned by this process to conform to mapping */
+#define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to mapping */
#ifdef __KERNEL__
isolation of pages from LRU
Implement functions to isolate pages from the LRU and put them back later.
An earlier implementation was provided by
Hirokazu Takahashi <[email protected]> and
IWAMOTO Toshihiro <[email protected]> for the memory
hotplug project.
>From Magnus:
This patch for 2.6.14-rc4-mm1 breaks out isolate_lru_page() and
putpack_lru_page() and makes them inline. I'd like to build my code on
top of this patch, and I think your page eviction code could be built
on top of this patch too - without introducing too much duplicated
code.
Changes V4-V5:
- Use the caches to return inactive and active pages instead
of __putback_lru. Remove putback_lru macro
- Add move_to_lru function
Changes V3-V4:
- Remove obsolete second parameter from isolate_lru_page
- Mention the original authors
Signed-off-by: Magnus Damm <[email protected]>
Signed-off-by: Christoph Lameter <[email protected]>
Index: linux-2.6.14-rc5-mm1/include/linux/mm_inline.h
===================================================================
--- linux-2.6.14-rc5-mm1.orig/include/linux/mm_inline.h 2005-10-19 23:23:05.000000000 -0700
+++ linux-2.6.14-rc5-mm1/include/linux/mm_inline.h 2005-10-31 13:22:12.000000000 -0800
@@ -38,3 +38,32 @@ del_page_from_lru(struct zone *zone, str
zone->nr_inactive--;
}
}
+
+/*
+ * Isolate one page from the LRU lists.
+ *
+ * - zone->lru_lock must be held
+ */
+static inline int
+__isolate_lru_page(struct zone *zone, struct page *page)
+{
+ if (TestClearPageLRU(page)) {
+ if (get_page_testone(page)) {
+ /*
+ * It is being freed elsewhere
+ */
+ __put_page(page);
+ SetPageLRU(page);
+ return -ENOENT;
+ } else {
+ if (PageActive(page))
+ del_page_from_active_list(zone, page);
+ else
+ del_page_from_inactive_list(zone, page);
+ return 1;
+ }
+ }
+
+ return 0;
+}
+
Index: linux-2.6.14-rc5-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.14-rc5-mm1.orig/mm/vmscan.c 2005-10-24 10:27:30.000000000 -0700
+++ linux-2.6.14-rc5-mm1/mm/vmscan.c 2005-10-31 13:21:57.000000000 -0800
@@ -578,43 +578,75 @@ keep:
*
* Appropriate locks must be held before calling this function.
*
+ * @zone: The zone where lru_lock is held.
* @nr_to_scan: The number of pages to look through on the list.
* @src: The LRU list to pull pages off.
* @dst: The temp list to put pages on to.
- * @scanned: The number of pages that were scanned.
*
- * returns how many pages were moved onto *@dst.
+ * returns the number of pages that were scanned.
*/
-static int isolate_lru_pages(int nr_to_scan, struct list_head *src,
- struct list_head *dst, int *scanned)
+static int isolate_lru_pages(struct zone *zone, int nr_to_scan,
+ struct list_head *src, struct list_head *dst)
{
- int nr_taken = 0;
struct page *page;
- int scan = 0;
+ int scanned = 0;
- while (scan++ < nr_to_scan && !list_empty(src)) {
+ while (scanned++ < nr_to_scan && !list_empty(src)) {
page = lru_to_page(src);
prefetchw_prev_lru_page(page, src, flags);
- if (!TestClearPageLRU(page))
- BUG();
- list_del(&page->lru);
- if (get_page_testone(page)) {
- /*
- * It is being freed elsewhere
- */
- __put_page(page);
- SetPageLRU(page);
- list_add(&page->lru, src);
- continue;
- } else {
+ switch (__isolate_lru_page(zone, page)) {
+ case 1:
+ /* Succeeded to isolate page */
list_add(&page->lru, dst);
- nr_taken++;
+ break;
+ case -1:
+ /* Not possible to isolate */
+ list_move(&page->lru, src);
+ break;
+ default:
+ BUG();
}
}
- *scanned = scan;
- return nr_taken;
+ return scanned;
+}
+
+static void lru_add_drain_per_cpu(void *dummy)
+{
+ lru_add_drain();
+}
+
+/*
+ * Isolate one page from the LRU lists and put it on the
+ * indicated list. Do necessary cache draining if the
+ * page is not on the LRU lists yet.
+ *
+ * Result:
+ * 0 = page not on LRU list
+ * 1 = page removed from LRU list and added to the specified list.
+ * -1 = page is being freed elsewhere.
+ */
+int isolate_lru_page(struct page *page)
+{
+ int rc = 0;
+ struct zone *zone = page_zone(page);
+
+redo:
+ spin_lock_irq(&zone->lru_lock);
+ rc = __isolate_lru_page(zone, page);
+ spin_unlock_irq(&zone->lru_lock);
+ if (rc == 0) {
+ /*
+ * Maybe this page is still waiting for a cpu to drain it
+ * from one of the lru lists?
+ */
+ smp_call_function(&lru_add_drain_per_cpu, NULL, 0 , 1);
+ lru_add_drain();
+ if (PageLRU(page))
+ goto redo;
+ }
+ return rc;
}
/*
@@ -632,18 +664,15 @@ static void shrink_cache(struct zone *zo
spin_lock_irq(&zone->lru_lock);
while (max_scan > 0) {
struct page *page;
- int nr_taken;
int nr_scan;
int nr_freed;
- nr_taken = isolate_lru_pages(sc->swap_cluster_max,
- &zone->inactive_list,
- &page_list, &nr_scan);
- zone->nr_inactive -= nr_taken;
+ nr_scan = isolate_lru_pages(zone, sc->swap_cluster_max,
+ &zone->inactive_list, &page_list);
zone->pages_scanned += nr_scan;
spin_unlock_irq(&zone->lru_lock);
- if (nr_taken == 0)
+ if (list_empty(&page_list))
goto done;
max_scan -= nr_scan;
@@ -682,6 +711,39 @@ done:
pagevec_release(&pvec);
}
+static inline void move_to_lru(struct *page)
+{
+ list_del(&page->lru);
+ if (PageActive(page)) {
+ /*
+ * lru_cache_add_active checks that
+ * the PG_active bit is off.
+ */
+ ClearPageActive(page);
+ lru_cache_add_activE(page);
+ } else {
+ lru_cache_add(page);
+ put_page(page);
+}
+
+/*
+ * Add isolated pages on the list back to the LRU
+ *
+ * returns the number of pages put back.
+ */
+int putback_lru_pages(struct list_head *l)
+{
+ struct page * page;
+ struct page * page2;
+ int count = 0;
+
+ list_for_each_entry_safe(page, page2, l, lru) {
+ move_to_lru(page);
+ count++;
+ }
+ return count;
+}
+
/*
* This moves pages from the active list to the inactive list.
*
@@ -718,10 +780,9 @@ refill_inactive_zone(struct zone *zone,
lru_add_drain();
spin_lock_irq(&zone->lru_lock);
- pgmoved = isolate_lru_pages(nr_pages, &zone->active_list,
- &l_hold, &pgscanned);
+ pgscanned = isolate_lru_pages(zone, nr_pages,
+ &zone->active_list, &l_hold);
zone->pages_scanned += pgscanned;
- zone->nr_active -= pgmoved;
spin_unlock_irq(&zone->lru_lock);
/*
Index: linux-2.6.14-rc5-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.14-rc5-mm1.orig/include/linux/swap.h 2005-10-24 10:27:13.000000000 -0700
+++ linux-2.6.14-rc5-mm1/include/linux/swap.h 2005-10-31 13:21:57.000000000 -0800
@@ -176,6 +176,9 @@ extern int zone_reclaim(struct zone *, u
extern int shrink_all_memory(int);
extern int vm_swappiness;
+extern int isolate_lru_page(struct page *p);
+extern int putback_lru_pages(struct list_head *l);
+
#ifdef CONFIG_MMU
/* linux/mm/shmem.c */
extern int shmem_unuse(swp_entry_t entry, struct page *page);
Add PF_SWAPWRITE to control a processes permission to write to swap.
- Use PF_SWAPWRITE in may_write_to_queue() instead of checking for kswapd
and pdflush
- Set PF_SWAPWRITE flag for kswapd and pdflush
Signed-off-by: Christoph Lameter <[email protected]>
Index: linux-2.6.14-rc5-mm1/include/linux/sched.h
===================================================================
--- linux-2.6.14-rc5-mm1.orig/include/linux/sched.h 2005-10-24 10:27:29.000000000 -0700
+++ linux-2.6.14-rc5-mm1/include/linux/sched.h 2005-10-31 13:30:48.000000000 -0800
@@ -914,6 +914,7 @@ do { if (atomic_dec_and_test(&(tsk)->usa
#define PF_SYNCWRITE 0x00200000 /* I am doing a sync write */
#define PF_BORROWED_MM 0x00400000 /* I am a kthread doing use_mm */
#define PF_RANDOMIZE 0x00800000 /* randomize virtual address space */
+#define PF_SWAPWRITE 0x01000000 /* the process is allowed to write to swap */
/*
* Only the _current_ task can read/write to tsk->flags, but other
Index: linux-2.6.14-rc5-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.14-rc5-mm1.orig/mm/vmscan.c 2005-10-31 13:21:57.000000000 -0800
+++ linux-2.6.14-rc5-mm1/mm/vmscan.c 2005-10-31 13:30:48.000000000 -0800
@@ -263,9 +263,7 @@ static inline int is_page_cache_freeable
static int may_write_to_queue(struct backing_dev_info *bdi)
{
- if (current_is_kswapd())
- return 1;
- if (current_is_pdflush()) /* This is unlikely, but why not... */
+ if (current->flags & PF_SWAPWRITE)
return 1;
if (!bdi_write_congested(bdi))
return 1;
@@ -1289,7 +1287,7 @@ static int kswapd(void *p)
* us from recursively trying to free more memory as we're
* trying to free the first piece of memory in the first place).
*/
- tsk->flags |= PF_MEMALLOC|PF_KSWAPD;
+ tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
order = 0;
for ( ; ; ) {
Index: linux-2.6.14-rc5-mm1/mm/pdflush.c
===================================================================
--- linux-2.6.14-rc5-mm1.orig/mm/pdflush.c 2005-10-24 10:27:21.000000000 -0700
+++ linux-2.6.14-rc5-mm1/mm/pdflush.c 2005-10-31 13:30:48.000000000 -0800
@@ -90,7 +90,7 @@ struct pdflush_work {
static int __pdflush(struct pdflush_work *my_work)
{
- current->flags |= PF_FLUSHER;
+ current->flags |= PF_FLUSHER | PF_SWAPWRITE;
my_work->fn = NULL;
my_work->who = current;
INIT_LIST_HEAD(&my_work->list);
Page migration support in vmscan.c
This patch adds the basic page migration function with a minimal implementation
that only allows the eviction of pages to swap space.
Page eviction and migration may be useful to migrate pages, to suspend programs
or for remapping single pages (useful for faulty pages or pages with soft ECC
failures)
The process is as follows:
The function wanting to migrate pages must first build a list of pages to be
migrated or evicted and take them off the lru lists via isolate_lru_page().
isolate_lru_page determines that a page is freeable based on the LRU bit set.
Then the actual migration or swapout can happen by calling migrate_pages().
migrate_pages does its best to migrate or swapout the pages and does multiple passes
over the list. Some pages may only be swappable if they are not dirty. migrate_pages
may start writing out dirty pages in the initial passes over the pages.
However, migrate_pages may not be able to migrate or evict all pages for a variety
of reasons.
The remaining pages may be returned to the LRU lists using putback_lru_pages().
Changelog V4->V5:
- Use the lru caches to return pages to the LRU
Changelog V3->V4:
- Restructure code so that applying patches to support full migration does
require minimal changes. Rename swapout_pages() to migrate_pages().
Changelog V2->V3:
- Extract common code from shrink_list() and swapout_pages()
Signed-off-by: Mike Kravetz <[email protected]>
Signed-off-by: Christoph Lameter <[email protected]>
Index: linux-2.6.14-rc5-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.14-rc5-mm1.orig/include/linux/swap.h 2005-10-31 14:11:20.000000000 -0800
+++ linux-2.6.14-rc5-mm1/include/linux/swap.h 2005-10-31 18:38:14.000000000 -0800
@@ -179,6 +179,8 @@ extern int vm_swappiness;
extern int isolate_lru_page(struct page *p);
extern int putback_lru_pages(struct list_head *l);
+extern int migrate_pages(struct list_head *l, struct list_head *t);
+
#ifdef CONFIG_MMU
/* linux/mm/shmem.c */
extern int shmem_unuse(swp_entry_t entry, struct page *page);
Index: linux-2.6.14-rc5-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.14-rc5-mm1.orig/mm/vmscan.c 2005-10-31 14:11:20.000000000 -0800
+++ linux-2.6.14-rc5-mm1/mm/vmscan.c 2005-10-31 18:41:13.000000000 -0800
@@ -368,6 +368,45 @@ static pageout_t pageout(struct page *pa
return PAGE_CLEAN;
}
+static inline int remove_mapping(struct address_space *mapping,
+ struct page *page)
+{
+ if (!mapping)
+ return 0; /* truncate got there first */
+
+ write_lock_irq(&mapping->tree_lock);
+
+ /*
+ * The non-racy check for busy page. It is critical to check
+ * PageDirty _after_ making sure that the page is freeable and
+ * not in use by anybody. (pagecache + us == 2)
+ */
+ if (unlikely(page_count(page) != 2))
+ goto cannot_free;
+ smp_rmb();
+ if (unlikely(PageDirty(page)))
+ goto cannot_free;
+
+ if (PageSwapCache(page)) {
+ swp_entry_t swap = { .val = page_private(page) };
+ add_to_swapped_list(swap.val);
+ __delete_from_swap_cache(page);
+ write_unlock_irq(&mapping->tree_lock);
+ swap_free(swap);
+ __put_page(page); /* The pagecache ref */
+ return 1;
+ }
+
+ __remove_from_page_cache(page);
+ write_unlock_irq(&mapping->tree_lock);
+ __put_page(page);
+ return 1;
+
+cannot_free:
+ write_unlock_irq(&mapping->tree_lock);
+ return 0;
+}
+
/*
* shrink_list adds the number of reclaimed pages to sc->nr_reclaimed
*/
@@ -506,37 +545,8 @@ static int shrink_list(struct list_head
goto free_it;
}
- if (!mapping)
- goto keep_locked; /* truncate got there first */
-
- write_lock_irq(&mapping->tree_lock);
-
- /*
- * The non-racy check for busy page. It is critical to check
- * PageDirty _after_ making sure that the page is freeable and
- * not in use by anybody. (pagecache + us == 2)
- */
- if (unlikely(page_count(page) != 2))
- goto cannot_free;
- smp_rmb();
- if (unlikely(PageDirty(page)))
- goto cannot_free;
-
-#ifdef CONFIG_SWAP
- if (PageSwapCache(page)) {
- swp_entry_t swap = { .val = page_private(page) };
- add_to_swapped_list(swap.val);
- __delete_from_swap_cache(page);
- write_unlock_irq(&mapping->tree_lock);
- swap_free(swap);
- __put_page(page); /* The pagecache ref */
- goto free_it;
- }
-#endif /* CONFIG_SWAP */
-
- __remove_from_page_cache(page);
- write_unlock_irq(&mapping->tree_lock);
- __put_page(page);
+ if (!remove_mapping(mapping, page))
+ goto keep_locked;
free_it:
unlock_page(page);
@@ -545,10 +555,6 @@ free_it:
__pagevec_release_nonlru(&freed_pvec);
continue;
-cannot_free:
- write_unlock_irq(&mapping->tree_lock);
- goto keep_locked;
-
activate_locked:
SetPageActive(page);
pgactivate++;
@@ -567,6 +573,156 @@ keep:
}
/*
+ * swapout a single page
+ * page is locked upon entry, unlocked on exit
+ *
+ * return codes:
+ * 0 = complete
+ * 1 = retry
+ */
+static int swap_page(struct page *page)
+{
+ struct address_space *mapping = page_mapping(page);
+
+ if (page_mapped(page) && mapping)
+ if (try_to_unmap(page) != SWAP_SUCCESS)
+ goto unlock_retry;
+
+ if (PageDirty(page)) {
+ /* Page is dirty, try to write it out here */
+ switch(pageout(page, mapping)) {
+ case PAGE_KEEP:
+ case PAGE_ACTIVATE:
+ goto unlock_retry;
+ case PAGE_SUCCESS:
+ goto retry;
+ case PAGE_CLEAN:
+ ; /* try to free the page below */
+ }
+ }
+
+ if (PagePrivate(page)) {
+ if (!try_to_release_page(page, GFP_KERNEL))
+ goto unlock_retry;
+ if (!mapping && page_count(page) == 1)
+ goto free_it;
+ }
+
+ if (!remove_mapping(mapping, page))
+ goto unlock_retry; /* truncate got there first */
+
+free_it:
+ /*
+ * We may free pages that were taken off the active list
+ * by isolate_lru_page. However, free_hot_cold_page will check
+ * if the active bit is set. So clear it.
+ */
+ ClearPageActive(page);
+
+ list_del(&page->lru);
+ unlock_page(page);
+ put_page(page);
+ return 0;
+
+unlock_retry:
+ unlock_page(page);
+
+retry:
+ return 1;
+}
+/*
+ * migrate_pages
+ *
+ * Two lists are passed to this function. The first list
+ * contains the pages isolated from the LRU to be migrated.
+ * The second list contains new pages that the pages isolated
+ * can be moved to. If the second list is NULL then all
+ * pages are swapped out.
+ *
+ * The function returns after 10 attempts or if no pages
+ * are movable anymore because t has become empty
+ * or no retryable pages exist anymore.
+ *
+ * SIMPLIFIED VERSION: This implementation of migrate_pages
+ * is only swapping out pages and never touches the second
+ * list. The direct migration patchset
+ * extends this function to avoid the use of swap.
+ */
+int migrate_pages(struct list_head *l, struct list_head *t)
+{
+ int retry;
+ LIST_HEAD(failed);
+ int nr_failed = 0;
+ int pass = 0;
+ struct page *page;
+ struct page *page2;
+ int swapwrite = current->flags & PF_SWAPWRITE;
+
+ if (!swapwrite)
+ current->flags |= PF_SWAPWRITE;
+
+redo:
+ retry = 0;
+
+ list_for_each_entry_safe(page, page2, l, lru) {
+ cond_resched();
+
+ /*
+ * Skip locked pages during the first two passes to give the
+ * functions holding the lock time to release the page. Later we use
+ * lock_page to have a higher chance of acquiring the lock.
+ */
+ if (pass > 2)
+ lock_page(page);
+ else
+ if (TestSetPageLocked(page))
+ goto retry_later;
+
+ /*
+ * Only wait on writeback if we have already done a pass where
+ * we we may have triggered writeouts for lots of pages.
+ */
+ if (pass > 0)
+ wait_on_page_writeback(page);
+ else
+ if (PageWriteback(page)) {
+ unlock_page(page);
+ goto retry_later;
+ }
+
+#ifdef CONFIG_SWAP
+ if (PageAnon(page) && !PageSwapCache(page)) {
+ if (!add_to_swap(page)) {
+ unlock_page(page);
+ list_move(&page->lru, &failed);
+ nr_failed++;
+ continue;
+ }
+ }
+#endif /* CONFIG_SWAP */
+
+ /*
+ * Page is properly locked and writeback is complete.
+ * Try to migrate the page.
+ */
+ if (swap_page(page)) {
+retry_later:
+ retry++;
+ }
+ }
+ if (retry && pass++ < 10)
+ goto redo;
+
+ if (!swapwrite)
+ current->flags &= ~PF_SWAPWRITE;
+
+ if (!list_empty(&failed))
+ list_splice(&failed, l);
+
+ return nr_failed + retry;
+}
+
+/*
* zone->lru_lock is heavily contended. Some of the functions that
* shrink the lists perform better by taking out a batch of pages
* and working on them outside the LRU lock.
@@ -709,7 +865,7 @@ done:
pagevec_release(&pvec);
}
-static inline void move_to_lru(struct *page)
+static inline void move_to_lru(struct page *page)
{
list_del(&page->lru);
if (PageActive(page)) {
@@ -718,8 +874,8 @@ static inline void move_to_lru(struct *p
* the PG_active bit is off.
*/
ClearPageActive(page);
- lru_cache_add_activE(page);
- } else {
+ lru_cache_add_active(page);
+ } else
lru_cache_add(page);
put_page(page);
}
sys_migrate_pages implementation using swap based page migration
This is the original API proposed by Ray Bryant in his posts during the
first half of 2005 on [email protected] and [email protected].
The intend of sys_migrate is to migrate memory of a process. A process may have
migrated to another node. Memory was allocated optimally for the prior context.
sys_migrate_pages allows to shift the memory to the new node.
sys_migrate_pages is also useful if the processes available memory nodes have
changed through cpuset operations to manually move the processes memory. Paul
Jackson is working on an automated mechanism that will allow an automatic
migration if the cpuset of a process is changed. However, a user may decide
to manually control the migration.
This implementation is put into the policy layer since it uses concepts and
functions that are also needed for mbind and friends. The patch also provides
a do_migrate_pages function that may be useful for cpusets to automatically move
memory. sys_migrate_pages does not modify policies in contrast to Ray's implementation.
The current code here is based on the swap based page migration capability and thus
not able to preserve the physical layout relative to it containing nodeset (which
may be a cpuset). When direct page migration becomes available then the
implementation needs to be changed to do a isomorphic move of pages between different
nodesets. The current implementation simply evicts all pages in source
nodeset that are not in the target nodeset.
Patch supports ia64, i386, x86_64 and ppc64. Patch not tested on ppc64.
Changes V4->V5:
- Follow Paul's suggestion to make parameters const.
Changes V3->V4:
- Add Ray's permissions check based on check_kill_permission().
Signed-off-by: Christoph Lameter <[email protected]>
Index: linux-2.6.14-rc5-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.14-rc5-mm1.orig/mm/mempolicy.c 2005-10-31 18:49:32.000000000 -0800
+++ linux-2.6.14-rc5-mm1/mm/mempolicy.c 2005-10-31 18:50:03.000000000 -0800
@@ -631,11 +631,41 @@ long do_get_mempolicy(int *policy, nodem
}
/*
+ * For now migrate_pages simply swaps out the pages from nodes that are in
+ * the source set but not in the target set. In the future, we would
+ * want a function that moves pages between the two nodesets in such
+ * a way as to preserve the physical layout as much as possible.
+ *
+ * Returns the number of page that could not be moved.
+ */
+int do_migrate_pages(struct mm_struct *mm,
+ const nodemask_t *from_nodes, const nodemask_t *to_nodes, int flags)
+{
+ LIST_HEAD(pagelist);
+ int count = 0;
+ nodemask_t nodes;
+
+ nodes_andnot(nodes, *from_nodes, *to_nodes);
+ nodes_complement(nodes, nodes);
+
+ down_read(&mm->mmap_sem);
+ check_range(mm, mm->mmap->vm_start, TASK_SIZE, &nodes,
+ flags | MPOL_MF_DISCONTIG_OK, &pagelist);
+ if (!list_empty(&pagelist)) {
+ migrate_pages(&pagelist, NULL);
+ if (!list_empty(&pagelist))
+ count = putback_lru_pages(&pagelist);
+ }
+ up_read(&mm->mmap_sem);
+ return count;
+}
+
+/*
* User space interface with variable sized bitmaps for nodelists.
*/
/* Copy a node mask from user space. */
-static int get_nodes(nodemask_t *nodes, unsigned long __user *nmask,
+static int get_nodes(nodemask_t *nodes, const unsigned long __user *nmask,
unsigned long maxnode)
{
unsigned long k;
@@ -724,6 +754,67 @@ asmlinkage long sys_set_mempolicy(int mo
return do_set_mempolicy(mode, &nodes);
}
+/* Macro needed until Paul implements this function in kernel/cpusets.c */
+#define cpuset_mems_allowed(task) node_online_map
+
+asmlinkage long sys_migrate_pages(pid_t pid, unsigned long maxnode,
+ const unsigned long __user *old_nodes,
+ const unsigned long __user *new_nodes)
+{
+ struct mm_struct *mm;
+ struct task_struct *task;
+ nodemask_t old;
+ nodemask_t new;
+ nodemask_t task_nodes;
+ int err;
+
+ err = get_nodes(&old, old_nodes, maxnode);
+ if (err)
+ return err;
+
+ err = get_nodes(&new, new_nodes, maxnode);
+ if (err)
+ return err;
+
+ /* Find the mm_struct */
+ read_lock(&tasklist_lock);
+ task = pid ? find_task_by_pid(pid) : current;
+ if (!task) {
+ read_unlock(&tasklist_lock);
+ return -ESRCH;
+ }
+ mm = get_task_mm(task);
+ read_unlock(&tasklist_lock);
+
+ if (!mm)
+ return -EINVAL;
+
+ /*
+ * Permissions check like for signals.
+ * See check_kill_permission()
+ */
+ if ((current->euid ^ task->suid) && (current->euid ^ task->uid) &&
+ (current->uid ^ task->suid) && (current->uid ^ task->uid) &&
+ !capable(CAP_SYS_ADMIN)) {
+ err = -EPERM;
+ goto out;
+ }
+
+ task_nodes = cpuset_mems_allowed(task);
+ /* Is the user allowed to access the target nodes? */
+ if (!nodes_subset(new, task_nodes) &&
+ !capable(CAP_SYS_ADMIN)) {
+ err= -EPERM;
+ goto out;
+ }
+
+ err = do_migrate_pages(mm, &old, &new, MPOL_MF_MOVE);
+out:
+ mmput(mm);
+ return err;
+}
+
+
/* Retrieve NUMA policy */
asmlinkage long sys_get_mempolicy(int __user *policy,
unsigned long __user *nmask,
Index: linux-2.6.14-rc5-mm1/kernel/sys_ni.c
===================================================================
--- linux-2.6.14-rc5-mm1.orig/kernel/sys_ni.c 2005-10-19 23:23:05.000000000 -0700
+++ linux-2.6.14-rc5-mm1/kernel/sys_ni.c 2005-10-31 18:50:03.000000000 -0800
@@ -82,6 +82,7 @@ cond_syscall(compat_sys_socketcall);
cond_syscall(sys_inotify_init);
cond_syscall(sys_inotify_add_watch);
cond_syscall(sys_inotify_rm_watch);
+cond_syscall(sys_migrate_pages);
/* arch-specific weak syscall entries */
cond_syscall(sys_pciconfig_read);
Index: linux-2.6.14-rc5-mm1/arch/ia64/kernel/entry.S
===================================================================
--- linux-2.6.14-rc5-mm1.orig/arch/ia64/kernel/entry.S 2005-10-19 23:23:05.000000000 -0700
+++ linux-2.6.14-rc5-mm1/arch/ia64/kernel/entry.S 2005-10-31 18:50:03.000000000 -0800
@@ -1600,5 +1600,6 @@ sys_call_table:
data8 sys_inotify_init
data8 sys_inotify_add_watch
data8 sys_inotify_rm_watch
+ data8 sys_migrate_pages
.org sys_call_table + 8*NR_syscalls // guard against failures to increase NR_syscalls
Index: linux-2.6.14-rc5-mm1/include/asm-ia64/unistd.h
===================================================================
--- linux-2.6.14-rc5-mm1.orig/include/asm-ia64/unistd.h 2005-10-31 14:11:01.000000000 -0800
+++ linux-2.6.14-rc5-mm1/include/asm-ia64/unistd.h 2005-10-31 18:50:03.000000000 -0800
@@ -269,12 +269,12 @@
#define __NR_inotify_init 1277
#define __NR_inotify_add_watch 1278
#define __NR_inotify_rm_watch 1279
-
+#define __NR_migrate_pages 1280
#ifdef __KERNEL__
#include <linux/config.h>
-#define NR_syscalls 256 /* length of syscall table */
+#define NR_syscalls 257 /* length of syscall table */
#define __ARCH_WANT_SYS_RT_SIGACTION
Index: linux-2.6.14-rc5-mm1/arch/ppc64/kernel/misc.S
===================================================================
--- linux-2.6.14-rc5-mm1.orig/arch/ppc64/kernel/misc.S 2005-10-31 14:10:56.000000000 -0800
+++ linux-2.6.14-rc5-mm1/arch/ppc64/kernel/misc.S 2005-10-31 18:50:03.000000000 -0800
@@ -1581,3 +1581,4 @@ _GLOBAL(sys_call_table)
.llong .sys_inotify_init /* 275 */
.llong .sys_inotify_add_watch
.llong .sys_inotify_rm_watch
+ .llong .sys_migrate_pages
Index: linux-2.6.14-rc5-mm1/arch/i386/kernel/syscall_table.S
===================================================================
--- linux-2.6.14-rc5-mm1.orig/arch/i386/kernel/syscall_table.S 2005-10-19 23:23:05.000000000 -0700
+++ linux-2.6.14-rc5-mm1/arch/i386/kernel/syscall_table.S 2005-10-31 18:50:03.000000000 -0800
@@ -294,3 +294,5 @@ ENTRY(sys_call_table)
.long sys_inotify_init
.long sys_inotify_add_watch
.long sys_inotify_rm_watch
+ .long sys_migrate_pages
+
Index: linux-2.6.14-rc5-mm1/include/asm-x86_64/unistd.h
===================================================================
--- linux-2.6.14-rc5-mm1.orig/include/asm-x86_64/unistd.h 2005-10-31 14:11:01.000000000 -0800
+++ linux-2.6.14-rc5-mm1/include/asm-x86_64/unistd.h 2005-10-31 18:50:03.000000000 -0800
@@ -571,8 +571,10 @@ __SYSCALL(__NR_inotify_init, sys_inotify
__SYSCALL(__NR_inotify_add_watch, sys_inotify_add_watch)
#define __NR_inotify_rm_watch 255
__SYSCALL(__NR_inotify_rm_watch, sys_inotify_rm_watch)
+#define __NR_migrate_pages 256
+__SYSCALL(__NR_migrate_pages, sys_migrate_pages)
-#define __NR_syscall_max __NR_inotify_rm_watch
+#define __NR_syscall_max __NR_migrate_pages
#ifndef __NO_STUBS
/* user-visible error numbers are in the range -1 - -4095 */
Index: linux-2.6.14-rc5-mm1/include/linux/syscalls.h
===================================================================
--- linux-2.6.14-rc5-mm1.orig/include/linux/syscalls.h 2005-10-31 14:11:01.000000000 -0800
+++ linux-2.6.14-rc5-mm1/include/linux/syscalls.h 2005-10-31 18:50:03.000000000 -0800
@@ -511,5 +511,7 @@ asmlinkage long sys_ioprio_set(int which
asmlinkage long sys_ioprio_get(int which, int who);
asmlinkage long sys_set_mempolicy(int mode, unsigned long __user *nmask,
unsigned long maxnode);
+asmlinkage long sys_migrate_pages(pid_t pid, unsigned long maxnode,
+ const unsigned long __user *from, const unsigned long __user *to);
#endif
Index: linux-2.6.14-rc5-mm1/include/linux/mempolicy.h
===================================================================
--- linux-2.6.14-rc5-mm1.orig/include/linux/mempolicy.h 2005-10-31 18:47:59.000000000 -0800
+++ linux-2.6.14-rc5-mm1/include/linux/mempolicy.h 2005-10-31 18:50:03.000000000 -0800
@@ -158,6 +158,9 @@ extern void numa_default_policy(void);
extern void numa_policy_init(void);
extern struct mempolicy default_policy;
+int do_migrate_pages(struct mm_struct *mm,
+ const nodemask_t *from_nodes, const nodemask_t *to_nodes, int flags);
+
#else
struct mempolicy {};
Christoph Lameter <[email protected]> wrote:
>
> This is a patchset intended to introduce page migration into the kernel
> through a simple implementation of swap based page migration.
> The aim is to be minimally intrusive in order to have some hopes for inclusion
> into 2.6.15. A separate direct page migration patch is being developed that
> applies on top of this patch. The direct migration patch is being discussed on
> <[email protected]>.
I remain concerned that it hasn't been demonstrated that the infrastructure
which this patch provides will be adequate for all future applications -
especially memory hot-remove.
So I'll queue this up for -mm, but I think we need to see an entire
hot-remove implementation based on this, and have all the interested
parties signed up to it before we can start moving the infrastructure into
mainline.
Do you think the features which these patches add should be Kconfigurable?
Christoph Lameter <[email protected]> wrote:
>
> ...
> Changes V3->V4:
> - Add Ray's permissions check based on check_kill_permission().
>
> ...
> + /*
> + * Permissions check like for signals.
> + * See check_kill_permission()
> + */
> + if ((current->euid ^ task->suid) && (current->euid ^ task->uid) &&
> + (current->uid ^ task->suid) && (current->uid ^ task->uid) &&
> + !capable(CAP_SYS_ADMIN)) {
> + err = -EPERM;
> + goto out;
> + }
Obscure. Can you please explain the thinking behind putting this check in
here? Preferably via a comment...
On Monday 31 October 2005 21:12, Christoph Lameter wrote:
> Page migration support in vmscan.c
This has no #ifdef SWAP:
> + if (PageSwapCache(page)) {
> + swp_entry_t swap = { .val = page_private(page) };
> + add_to_swapped_list(swap.val);
> + __delete_from_swap_cache(page);
> + write_unlock_irq(&mapping->tree_lock);
> + swap_free(swap);
> + __put_page(page); /* The pagecache ref */
> + return 1;
> + }
But what you removed did:
> -#ifdef CONFIG_SWAP
> - if (PageSwapCache(page)) {
> - swp_entry_t swap = { .val = page_private(page) };
> - add_to_swapped_list(swap.val);
> - __delete_from_swap_cache(page);
> - write_unlock_irq(&mapping->tree_lock);
> - swap_free(swap);
> - __put_page(page); /* The pagecache ref */
> - goto free_it;
> - }
> -#endif /* CONFIG_SWAP */
What happens if you build without swap?
Rob
On Monday 31 October 2005 21:25, Andrew Morton wrote:
> So I'll queue this up for -mm, but I think we need to see an entire
> hot-remove implementation based on this, and have all the interested
> parties signed up to it before we can start moving the infrastructure into
> mainline.
>
> Do you think the features which these patches add should be Kconfigurable?
Yes please. At least something under CONFIG_EMBEDDED to save poor Matt the
trouble of chopping it out himself. :)
Rob
On Tue, 2005-11-01 at 02:06 -0600, Rob Landley wrote:
> On Monday 31 October 2005 21:12, Christoph Lameter wrote:
> > Page migration support in vmscan.c
>
> This has no #ifdef SWAP:
>
> > + if (PageSwapCache(page)) {
> > + swp_entry_t swap = { .val = page_private(page) };
> > + add_to_swapped_list(swap.val);
> > + __delete_from_swap_cache(page);
> > + write_unlock_irq(&mapping->tree_lock);
> > + swap_free(swap);
> > + __put_page(page); /* The pagecache ref */
> > + return 1;
> > + }
>
> But what you removed did:
>
> > -#ifdef CONFIG_SWAP
> > - if (PageSwapCache(page)) {
> > - swp_entry_t swap = { .val = page_private(page) };
> > - add_to_swapped_list(swap.val);
> > - __delete_from_swap_cache(page);
> > - write_unlock_irq(&mapping->tree_lock);
> > - swap_free(swap);
> > - __put_page(page); /* The pagecache ref */
> > - goto free_it;
> > - }
> > -#endif /* CONFIG_SWAP */
>
> What happens if you build without swap?
You don't need an explicit #ifdef.
PageSwapCache() has an #ifdef for its declaration which gets it down to
'0'. That should get gcc to completely kill the if(){} block, with no
explicit #ifdef.
-- Dave
Andrew Morton wrote:
> Christoph Lameter <[email protected]> wrote:
>
>>This is a patchset intended to introduce page migration into the kernel
>> through a simple implementation of swap based page migration.
>> The aim is to be minimally intrusive in order to have some hopes for inclusion
>> into 2.6.15. A separate direct page migration patch is being developed that
>> applies on top of this patch. The direct migration patch is being discussed on
>> <[email protected]>.
>
>
> I remain concerned that it hasn't been demonstrated that the infrastructure
> which this patch provides will be adequate for all future applications -
> especially memory hot-remove.
>
> So I'll queue this up for -mm, but I think we need to see an entire
> hot-remove implementation based on this, and have all the interested
> parties signed up to it before we can start moving the infrastructure into
> mainline.
>
It looks Christoph didn't use (direct)memory migration core in this set,
so memory-hotremove will not be affected by this.
At the first look, memory hotplugger will just replace swap_page()
with migrate_onepage().
Comparing swap-based migration and memory hotremove, memory hotremove
has to support wider kinds of pages other than anon, file-cache
swap-cache, mlocked() page used by direct I/O, HugeTLB pages and achieve
close to 100% guaranntee. Ignoring the fact migration and hotremove will
share the code, what they have to do is very different.
swap-based approach looks just intend to do process migration and
it itself looks not bad.
I think your point is that hotremove and migration will share some amounts of codes.
We are now discussing *direct* page migration and will share codes for anon pages.
It is being discussed in -lhms. We'd like to create good one.
Thanks,
-- KAMEZAWA Hiroyuki <[email protected]>
On Tue, 1 Nov 2005, Andrew Morton wrote:
> Christoph Lameter <[email protected]> wrote:
>>
>> ...
>> Changes V3->V4:
>> - Add Ray's permissions check based on check_kill_permission().
>>
>> ...
>> + /*
>> + * Permissions check like for signals.
>> + * See check_kill_permission()
>> + */
>> + if ((current->euid ^ task->suid) && (current->euid ^ task->uid) &&
>> + (current->uid ^ task->suid) && (current->uid ^ task->uid) &&
>> + !capable(CAP_SYS_ADMIN)) {
>> + err = -EPERM;
>> + goto out;
>> + }
>
> Obscure. Can you please explain the thinking behind putting this check in
> here? Preferably via a comment...
Also XOR is not a good substitute for a compare. Except in some
strange corner cases, the code will always take more CPU cycles
because XOR modifies oprands while compares don't need to.
Cheers,
Dick Johnson
Penguin : Linux version 2.6.13.4 on an i686 machine (5589.55 BogoMips).
Warning : 98.36% of all statistics are fiction.
.
****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to [email protected] - and destroy all copies of this information, including any attachments, without reading or disclosing them.
Thank you.
On Mon, 31 Oct 2005, Andrew Morton wrote:
> So I'll queue this up for -mm, but I think we need to see an entire
> hot-remove implementation based on this, and have all the interested
> parties signed up to it before we can start moving the infrastructure into
> mainline.
There are multiple components involved here. This includes user space,
cpusets and kernel support for various forms of migrating pages.
> Do you think the features which these patches add should be Kconfigurable?
The policy layer stuff is already Kconfigurable via CONFIG_NUMA.
I think the lower layer in vmscan.c could be always included since its
currently small and its just a variation on swap. However, if we add the
rest of the features contained in the hot plug then the code size becomes
significant and we may want to have a config variable for that.
On Mon, 31 Oct 2005, Andrew Morton wrote:
> Christoph Lameter <[email protected]> wrote:
> > + * Permissions check like for signals.
> > + * See check_kill_permission()
> Obscure. Can you please explain the thinking behind putting this check in
> here? Preferably via a comment...
>
Index: linux-2.6.14-rc5-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.14-rc5-mm1.orig/mm/mempolicy.c 2005-11-01 09:32:46.000000000 -0800
+++ linux-2.6.14-rc5-mm1/mm/mempolicy.c 2005-11-01 09:38:46.000000000 -0800
@@ -790,8 +790,15 @@ asmlinkage long sys_migrate_pages(pid_t
return -EINVAL;
/*
- * Permissions check like for signals.
- * See check_kill_permission()
+ * We only allow a process to move the pages of another
+ * if the process issuing sys_migrate has the right to send a kill
+ * signal to the process to be moved. Moving another processes
+ * memory may impact the performance of that process. If the
+ * process issuing sys_migrate_pages has the right to kill the
+ * target process then obviously that process has the right to
+ * impact the performance of the target process.
+ *
+ * The permission check was taken from check_kill_permission()
*/
if ((current->euid ^ task->suid) && (current->euid ^ task->uid) &&
(current->uid ^ task->suid) && (current->uid ^ task->uid) &&
On Tue, 1 Nov 2005, Rob Landley wrote:
> On Monday 31 October 2005 21:25, Andrew Morton wrote:
> > So I'll queue this up for -mm, but I think we need to see an entire
> > hot-remove implementation based on this, and have all the interested
> > parties signed up to it before we can start moving the infrastructure into
> > mainline.
> >
> > Do you think the features which these patches add should be Kconfigurable?
>
> Yes please. At least something under CONFIG_EMBEDDED to save poor Matt the
> trouble of chopping it out himself. :)
Ok. We will think of something to switch this off.
On Tue, 1 Nov 2005, linux-os (Dick Johnson) wrote:
> >> + */
> >> + if ((current->euid ^ task->suid) && (current->euid ^ task->uid) &&
> >> + (current->uid ^ task->suid) && (current->uid ^ task->uid) &&
> >
> > Obscure. Can you please explain the thinking behind putting this check in
> > here? Preferably via a comment...
>
> Also XOR is not a good substitute for a compare. Except in some
> strange corner cases, the code will always take more CPU cycles
> because XOR modifies oprands while compares don't need to.
May I submit a patch that removes these strange checks both for
check_kill_permission and sys_migrate_pages?
Hi,
> > > So I'll queue this up for -mm, but I think we need to see an entire
> > > hot-remove implementation based on this, and have all the interested
> > > parties signed up to it before we can start moving the infrastructure into
> > > mainline.
> > >
> > > Do you think the features which these patches add should be Kconfigurable?
This code looks no help for hot-remove. It seems able to handle only
pages easily to migrate, while hot-remove has to guarantee all pages
can be migrated.
> > Yes please. At least something under CONFIG_EMBEDDED to save poor Matt the
> > trouble of chopping it out himself. :)
>
> Ok. We will think of something to switch this off.
Hi Christoph, sorry I've been off from lhms for long time.
Shall I port the generic memory migration code for hot-remove to -mm tree
directly, and add some new interface like migrate_page_to(struct page *from,
struct page *to) so this may probably fit for your purpose.
The code is still in Dave's mhp1 tree waiting for being merged to -mm tree.
The port will be easy because the migration code is independent to the
memory hotplug code. The core code isn't so big.
Thanks,
Hirokazu Takahashi.
On Wed, 2 Nov 2005, Hirokazu Takahashi wrote:
> > > > Do you think the features which these patches add should be Kconfigurable?
>
> This code looks no help for hot-remove. It seems able to handle only
> pages easily to migrate, while hot-remove has to guarantee all pages
> can be migrated.
Right.
> Hi Christoph, sorry I've been off from lhms for long time.
>
> Shall I port the generic memory migration code for hot-remove to -mm tree
> directly, and add some new interface like migrate_page_to(struct page *from,
> struct page *to) so this may probably fit for your purpose.
>
> The code is still in Dave's mhp1 tree waiting for being merged to -mm tree.
> The port will be easy because the migration code is independent to the
> memory hotplug code. The core code isn't so big.
Please follow the discussion on lhms-devel. I am trying to bring these two
things together.
Hi Christoph,
> > > > > Do you think the features which these patches add should be Kconfigurable?
> >
> > This code looks no help for hot-remove. It seems able to handle only
> > pages easily to migrate, while hot-remove has to guarantee all pages
> > can be migrated.
>
> Right.
>
> > Hi Christoph, sorry I've been off from lhms for long time.
> >
> > Shall I port the generic memory migration code for hot-remove to -mm tree
> > directly, and add some new interface like migrate_page_to(struct page *from,
> > struct page *to) so this may probably fit for your purpose.
> >
> > The code is still in Dave's mhp1 tree waiting for being merged to -mm tree.
> > The port will be easy because the migration code is independent to the
> > memory hotplug code. The core code isn't so big.
>
> Please follow the discussion on lhms-devel. I am trying to bring these two
> things together.
OK, I'll look over it.
Thanks.
Hi Christoph,
>> > > > Do you think the features which these patches add should be Kconfigurable?
>>
>> This code looks no help for hot-remove. It seems able to handle only
>> pages easily to migrate, while hot-remove has to guarantee all pages
>> can be migrated.
>
>Right.
>
>> Hi Christoph, sorry I've been off from lhms for long time.
>>
>> Shall I port the generic memory migration code for hot-remove to -mm tree
>> directly, and add some new interface like migrate_page_to(struct page *from,
>> struct page *to) so this may probably fit for your purpose.
>>
>> The code is still in Dave's mhp1 tree waiting for being merged to -mm tree.
>> The port will be easy because the migration code is independent to the
>> memory hotplug code. The core code isn't so big.
>
>Please follow the discussion on lhms-devel. I am trying to bring these two
>things together.
I've read the archive of lhms-devel.
You're going to take in most of the original migration code
except for some tricks to migrate pages which are hard to move.
I think this is what you said the complexity, which you
want to remove forever.
I have to explain that this complexity came from making the code
guarantee to be able to migrate any pages. So the code is designed:
- to migrate heavily accessed pages.
- to migrate pages without backing-store.
- to migrate pages without I/O's.
- to migrate pages of which status may be changed during the migration
correctly.
This have to be implemented if the hotplug memory use it.
It seems to become a reinvention of the wheel to me.
It's easy to add a new interface to the code for memory policy aware
migration. It will be wonderful doing process migration prior to
planed hotremove momory. This decision should be done out of kernel.
Thanks,
Hirokazu Takahashi.
Hi Christoph,
The page migration code is waiting for something appearing to
use it but memory hotremove. I thought it would be memory
defragmentation or process migration.
> >> > > > Do you think the features which these patches add should be Kconfigurable?
> >>
> >> This code looks no help for hot-remove. It seems able to handle only
> >> pages easily to migrate, while hot-remove has to guarantee all pages
> >> can be migrated.
> >
> >Right.
> >
> >> Hi Christoph, sorry I've been off from lhms for long time.
> >>
> >> Shall I port the generic memory migration code for hot-remove to -mm tree
> >> directly, and add some new interface like migrate_page_to(struct page *from,
> >> struct page *to) so this may probably fit for your purpose.
> >>
> >> The code is still in Dave's mhp1 tree waiting for being merged to -mm tree.
> >> The port will be easy because the migration code is independent to the
> >> memory hotplug code. The core code isn't so big.
> >
> >Please follow the discussion on lhms-devel. I am trying to bring these two
> >things together.
>
> I've read the archive of lhms-devel.
> You're going to take in most of the original migration code
> except for some tricks to migrate pages which are hard to move.
> I think this is what you said the complexity, which you
> want to remove forever.
If you don't like the code is devided into a lot of small pieces,
I can merge their patches into several patches.
> I have to explain that this complexity came from making the code
> guarantee to be able to migrate any pages. So the code is designed:
> - to migrate heavily accessed pages.
> - to migrate pages without backing-store.
> - to migrate pages without I/O's.
> - to migrate pages of which status may be changed during the migration
> correctly.
>
> This have to be implemented if the hotplug memory use it.
> It seems to become a reinvention of the wheel to me.
>
> It's easy to add a new interface to the code for memory policy aware
> migration. It will be wonderful doing process migration prior to
> planed hotremove momory. This decision should be done out of kernel.
If you really want to skip the complex part, I can easily add a non-wait
mode to the migration code.
Thanks,
Hirokazu Takahashi.
Hirokazu Takahashi wrote:
> Hi Christoph,
> I've read the archive of lhms-devel.
> You're going to take in most of the original migration code
> except for some tricks to migrate pages which are hard to move.
> I think this is what you said the complexity, which you
> want to remove forever.
>
> I have to explain that this complexity came from making the code
> guarantee to be able to migrate any pages. So the code is designed:
> - to migrate heavily accessed pages.
> - to migrate pages without backing-store.
> - to migrate pages without I/O's.
> - to migrate pages of which status may be changed during the migration
> correctly.
>
> This have to be implemented if the hotplug memory use it.
yes.
> It seems to become a reinvention of the wheel to me.
>
Christoph, I think you should make it clear the advantage of your code
to the -mhp tree's. I think we can add migrate_page_to() easily to the
-mhp tree's as Takahashi said.
BTW, could you explain what is done and what is not in your patch set ?
-- Kame
On Wed, 2 Nov 2005, Hirokazu Takahashi wrote:
> >Please follow the discussion on lhms-devel. I am trying to bring these two
> >things together.
>
> I've read the archive of lhms-devel.
> You're going to take in most of the original migration code
> except for some tricks to migrate pages which are hard to move.
> I think this is what you said the complexity, which you
> want to remove forever.
No I just want to bring this in in stages for easier review.
> This have to be implemented if the hotplug memory use it.
> It seems to become a reinvention of the wheel to me.
Its a reorganization of the code in order to get this in.
> It's easy to add a new interface to the code for memory policy aware
> migration. It will be wonderful doing process migration prior to
> planed hotremove momory. This decision should be done out of kernel.
There are a couple of things that need to build on top of page migration.
One is hotplug and the other is the remapping of bad memory. Both are
similar. hotplug is mainly IBM's interest whereas remapping is also SGIs.
We plan to get the remapping functionality in as soon as possible. For
that we also need to be able to move pages of all sorts.
On Tue, 1 Nov 2005, Rob Landley wrote:
> > Do you think the features which these patches add should be Kconfigurable?
>
> Yes please. At least something under CONFIG_EMBEDDED to save poor Matt the
> trouble of chopping it out himself. :)
I hope this fits the bill?
---
Add CONFIG_MIGRATION for page migration support
Include page migration if the system is NUMA or having a memory model
that allows distinct areas of memory (SPARSEMEM, DISCONTIGMEM).
And:
- Only include lru_add_drain_per_cpu if building for an SMP system.
Signed-off-by: Christoph Lameter <[email protected]>
Index: linux-2.6.14-rc5-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.14-rc5-mm1.orig/include/linux/swap.h 2005-11-02 11:39:01.000000000 -0800
+++ linux-2.6.14-rc5-mm1/include/linux/swap.h 2005-11-02 11:42:00.000000000 -0800
@@ -179,7 +179,9 @@ extern int vm_swappiness;
extern int isolate_lru_page(struct page *p);
extern int putback_lru_pages(struct list_head *l);
+#ifdef CONFIG_MIGRATION
extern int migrate_pages(struct list_head *l, struct list_head *t);
+#endif
#ifdef CONFIG_MMU
/* linux/mm/shmem.c */
Index: linux-2.6.14-rc5-mm1/mm/Kconfig
===================================================================
--- linux-2.6.14-rc5-mm1.orig/mm/Kconfig 2005-10-31 14:10:53.000000000 -0800
+++ linux-2.6.14-rc5-mm1/mm/Kconfig 2005-11-02 11:44:57.000000000 -0800
@@ -132,3 +132,11 @@ config SPLIT_PTLOCK_CPUS
default "4096" if ARM && !CPU_CACHE_VIPT
default "4096" if PARISC && DEBUG_SPINLOCK && !64BIT
default "4"
+
+#
+# support for page migration
+#
+config MIGRATION
+ def_bool y if NUMA || SPARSEMEM || DISCONTIGMEM
+
+
Index: linux-2.6.14-rc5-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.14-rc5-mm1.orig/mm/vmscan.c 2005-11-02 11:39:01.000000000 -0800
+++ linux-2.6.14-rc5-mm1/mm/vmscan.c 2005-11-02 11:54:00.000000000 -0800
@@ -572,6 +572,7 @@ keep:
return reclaimed;
}
+#ifdef CONFIG_MIGRATION
/*
* swapout a single page
* page is locked upon entry, unlocked on exit
@@ -721,6 +722,7 @@ retry_later:
return nr_failed + retry;
}
+#endif
/*
* zone->lru_lock is heavily contended. Some of the functions that
@@ -766,10 +768,12 @@ static int isolate_lru_pages(struct zone
return scanned;
}
+#ifdef CONFIG_SMP
static void lru_add_drain_per_cpu(void *dummy)
{
lru_add_drain();
}
+#endif
/*
* Isolate one page from the LRU lists and put it on the
On 11/1/05, Christoph Lameter <[email protected]> wrote:
[snip]
> +static inline int
> +__isolate_lru_page(struct zone *zone, struct page *page)
> +{
> + if (TestClearPageLRU(page)) {
> + if (get_page_testone(page)) {
> + /*
> + * It is being freed elsewhere
> + */
> + __put_page(page);
> + SetPageLRU(page);
> + return -ENOENT;
Ok, -ENOENT..
> -static int isolate_lru_pages(int nr_to_scan, struct list_head *src,
> - struct list_head *dst, int *scanned)
> +static int isolate_lru_pages(struct zone *zone, int nr_to_scan,
> + struct list_head *src, struct list_head *dst)
[snip]
> + switch (__isolate_lru_page(zone, page)) {
> + case 1:
> + /* Succeeded to isolate page */
> list_add(&page->lru, dst);
> - nr_taken++;
> + break;
> + case -1:
> + /* Not possible to isolate */
> + list_move(&page->lru, src);
> + break;
> + default:
> + BUG();
Huh, -1?
It looks like the V4 to V5 upgrade added -ENOENT as return value to
__isolate_lru_page(), but did not change the code in
isolate_lru_pages().
The fix for this is simple, but maybe something else needs to be
changed too or I'm misunderstanding what is happening here.
Andrew, this looks like a showstopper for 2.6.14-mm1.
/ magnus
On Mon, 7 Nov 2005, Magnus Damm wrote:
> It looks like the V4 to V5 upgrade added -ENOENT as return value to
> __isolate_lru_page(), but did not change the code in
> isolate_lru_pages().
Yuck. You are right.
> The fix for this is simple, but maybe something else needs to be
> changed too or I'm misunderstanding what is happening here.
No the fix should simply be replacing the -1 by -ENOENT.
Fix check in isolate_lru_pages
Signed-off-by: Christoph Lameter <[email protected]>
Index: linux-2.6.14-rc5-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.14-rc5-mm1.orig/mm/vmscan.c 2005-11-02 11:39:01.000000000 -0800
+++ linux-2.6.14-rc5-mm1/mm/vmscan.c 2005-11-07 09:50:34.000000000 -0800
@@ -754,7 +754,7 @@ static int isolate_lru_pages(struct zone
/* Succeeded to isolate page */
list_add(&page->lru, dst);
break;
- case -1:
+ case -ENOENT:
/* Not possible to isolate */
list_move(&page->lru, src);
break;
Christoph Lameter <[email protected]> wrote:
>
> +static void lru_add_drain_per_cpu(void *dummy)
> +{
> + lru_add_drain();
> +}
> +
> +/*
> + * Isolate one page from the LRU lists and put it on the
> + * indicated list. Do necessary cache draining if the
> + * page is not on the LRU lists yet.
> + *
> + * Result:
> + * 0 = page not on LRU list
> + * 1 = page removed from LRU list and added to the specified list.
> + * -1 = page is being freed elsewhere.
> + */
> +int isolate_lru_page(struct page *page)
> +{
> + int rc = 0;
> + struct zone *zone = page_zone(page);
> +
> +redo:
> + spin_lock_irq(&zone->lru_lock);
> + rc = __isolate_lru_page(zone, page);
> + spin_unlock_irq(&zone->lru_lock);
> + if (rc == 0) {
> + /*
> + * Maybe this page is still waiting for a cpu to drain it
> + * from one of the lru lists?
> + */
> + smp_call_function(&lru_add_drain_per_cpu, NULL, 0 , 1);
lru_add_drain() ends up doing spin_unlock_irq(), so we'll enable interrupts
within the smp_call_function() handler. Is that legal on all
architectures?
On Mon, 14 Nov 2005, Andrew Morton wrote:
> > +int isolate_lru_page(struct page *page)
> > +{
> > + int rc = 0;
> > + struct zone *zone = page_zone(page);
> > +
> > +redo:
> > + spin_lock_irq(&zone->lru_lock);
> > + rc = __isolate_lru_page(zone, page);
> > + spin_unlock_irq(&zone->lru_lock);
> > + if (rc == 0) {
> > + /*
> > + * Maybe this page is still waiting for a cpu to drain it
> > + * from one of the lru lists?
> > + */
> > + smp_call_function(&lru_add_drain_per_cpu, NULL, 0 , 1);
>
> lru_add_drain() ends up doing spin_unlock_irq(), so we'll enable interrupts
> within the smp_call_function() handler. Is that legal on all
> architectures?
isolate_lru_pages() is only called within a process context in the swap
migration patches. The hotplug folks may have to address this if they want
to isolate pages from interrupts etc.
Hi Patrick,
I found your email on netdev mailing list. I see that a
priority argument has been added to hard_start_xmit() in the latest
IPW2200. Is there a user space API to use the new argument with the
existing socket API? Does one need to necessarily add a new
sys_socketcall to utilize the priority argument in hard_start_xmit().
Any info shared on this would be very much appreciated. Thanks! Surya
Christoph Lameter <[email protected]> wrote:
>
> On Mon, 14 Nov 2005, Andrew Morton wrote:
>
> > > +int isolate_lru_page(struct page *page)
> > > +{
> > > + int rc = 0;
> > > + struct zone *zone = page_zone(page);
> > > +
> > > +redo:
> > > + spin_lock_irq(&zone->lru_lock);
> > > + rc = __isolate_lru_page(zone, page);
> > > + spin_unlock_irq(&zone->lru_lock);
> > > + if (rc == 0) {
> > > + /*
> > > + * Maybe this page is still waiting for a cpu to drain it
> > > + * from one of the lru lists?
> > > + */
> > > + smp_call_function(&lru_add_drain_per_cpu, NULL, 0 , 1);
> >
> > lru_add_drain() ends up doing spin_unlock_irq(), so we'll enable interrupts
> > within the smp_call_function() handler. Is that legal on all
> > architectures?
>
> isolate_lru_pages() is only called within a process context in the swap
> migration patches. The hotplug folks may have to address this if they want
> to isolate pages from interrupts etc.
But lru_add_drain_per_cpu() will be called from interrupt context: the IPI
handler.
I'm asking whether it is safe for the IPI handler to reenable interupts on
all architectures. It might be so, but I don't recall ever having seen it
discussed, nor have I seen code which does it.
On Tue, 2005-11-15 at 08:38 -0800, Christoph Lameter wrote:
> On Mon, 14 Nov 2005, Andrew Morton wrote:
>
> > > +int isolate_lru_page(struct page *page)
> > > +{
> > > + int rc = 0;
> > > + struct zone *zone = page_zone(page);
> > > +
> > > +redo:
> > > + spin_lock_irq(&zone->lru_lock);
> > > + rc = __isolate_lru_page(zone, page);
> > > + spin_unlock_irq(&zone->lru_lock);
> > > + if (rc == 0) {
> > > + /*
> > > + * Maybe this page is still waiting for a cpu to drain it
> > > + * from one of the lru lists?
> > > + */
> > > + smp_call_function(&lru_add_drain_per_cpu, NULL, 0 , 1);
> >
> > lru_add_drain() ends up doing spin_unlock_irq(), so we'll enable interrupts
> > within the smp_call_function() handler. Is that legal on all
> > architectures?
>
> isolate_lru_pages() is only called within a process context in the swap
> migration patches. The hotplug folks may have to address this if they want
> to isolate pages from interrupts etc.
>
I believe Andrew is refering to the calls from the interprocessor
interrupt handlers triggered by the smp_call_function(). Looks like
ia64 runs IPI handlers with interrupts enabled [SA_INTERRUPT], so should
be OK there, but maybe not for all archs?
Lee
On Tue, 15 Nov 2005, Andrew Morton wrote:
> But lru_add_drain_per_cpu() will be called from interrupt context: the IPI
> handler.
Ahh.. thought you meant the lru_add_drain run on the local processor.
> I'm asking whether it is safe for the IPI handler to reenable interupts on
> all architectures. It might be so, but I don't recall ever having seen it
> discussed, nor have I seen code which does it.
smp_call_function is also used by the slab allocator to drain the
pages. All the spinlocks in there and those of the page allocator (called
for freeing pages) use spin_lock_irqsave. Why is this not used for
lru_add_drain() and friends?
Maybe we need to start a new thread so that others see it?
Christoph Lameter <[email protected]> wrote:
>
> On Tue, 15 Nov 2005, Andrew Morton wrote:
>
> > But lru_add_drain_per_cpu() will be called from interrupt context: the IPI
> > handler.
>
> Ahh.. thought you meant the lru_add_drain run on the local processor.
>
> > I'm asking whether it is safe for the IPI handler to reenable interupts on
> > all architectures. It might be so, but I don't recall ever having seen it
> > discussed, nor have I seen code which does it.
>
> smp_call_function is also used by the slab allocator to drain the
> pages. All the spinlocks in there and those of the page allocator (called
> for freeing pages) use spin_lock_irqsave. Why is this not used for
> lru_add_drain() and friends?
It's a microoptimisation - lru_add_drain() is always called with local irqs
enabled, so no need for irqsave.
I don't think spin_lock_irqsave() is notably more expensive than
spin_lock_irq() - the cost is in the irq disabling and in the atomic
operation.
> Maybe we need to start a new thread so that others see it?
Spose so. If we cannot convince ourselves that local_irq_enable() in an
ipi handler is safe, we need to convert any called functions to use
irqsave.
On Tue, Nov 15, 2005 at 10:46:05AM -0800, Andrew Morton wrote:
> Christoph Lameter <[email protected]> wrote:
> >
> > On Tue, 15 Nov 2005, Andrew Morton wrote:
> >
> > > But lru_add_drain_per_cpu() will be called from interrupt context: the IPI
> > > handler.
> >
> > Ahh.. thought you meant the lru_add_drain run on the local processor.
> >
> > > I'm asking whether it is safe for the IPI handler to reenable interupts on
> > > all architectures. It might be so, but I don't recall ever having seen it
> > > discussed, nor have I seen code which does it.
> >
> > smp_call_function is also used by the slab allocator to drain the
> > pages. All the spinlocks in there and those of the page allocator (called
> > for freeing pages) use spin_lock_irqsave. Why is this not used for
> > lru_add_drain() and friends?
>
> It's a microoptimisation - lru_add_drain() is always called with local irqs
> enabled, so no need for irqsave.
>
> I don't think spin_lock_irqsave() is notably more expensive than
> spin_lock_irq() - the cost is in the irq disabling and in the atomic
> operation.
Pardon me, but spin_lock_irqsave() needs to write data to the stack,
which is likely to be cache-cold, so you have to fault the cacheline in
from slow memory.
And thats much slower than the atomic operation, isnt it?