Hello,
This patch avoids the allocation of rmap for shared memory and it uses
the objrmap framework to do find the mapping-ptes starting from a
page_t which is zero memory cost, (and zero cpu cost for the fast paths).
patch applies cleanly to linux-2.5 CVS. I suggest it for merging into
mainline.
without this patch not even the 4:4 tlb overhead would allow intensive
shm (shmfs+IPC) workloads to surivive on 32bit archs. Basically without
this fix it's like 2.6 is running w/o pte-highmem. 700 tasks with 2.7G
of shm mapped each would run the box out of zone-normal even with 4:4.
With 3:1 100 tasks would be enough. Math is easy:
2.7*1024*1024*1024/4096*8*100/1024/1024/1024
2.7*1024*1024*1024/4096*8*700/1024/1024/1024
But the real reason of this work is for huge 64bit archs, so we speedup
and avoid to waste tons of ram. on 32-ways the scalability is hurted
very badly by rmap, so it has to be removed (Martin can provide the
numbers I think).
Even with this fix removing rmap for the file mappings, the anonymous
memory will still pay for the rmap slowdown (still very relevant for
various critical apps), so I just finished designing a new method for
unmapping ptes of anonymous mappings too. it's not Hugh's anobjrmap
patch because (despite being very useful to get the right mindset) its
design was flawed since it was tracking mm not vmas and the page->index
as an absolute address not an offset, so it was breaking with mremap
(forcing him to reinstantiate rmap during mremap in the anobjrmap-5
patch), and it had several other implementation issues. But all my
further work will be against the below objrmap-core. The below patch
just fixes the most serious bottlenecks. So I recommend it for
inclusion, the rest of the work for anonymous memory and non linear
vmas, is orthogonal with this.
Credit for this patch goes enterely to Dave McCracken (the original idea
of using the i_mmap lists for the vm instead of only using it for
truncate is as usual from David Miller), I only fixed two bugs in its
version before submitting it to you.
I speculate that because of rmap some people has been forced to use 4:4
generating >30% slowdowns to critical common server linux workloads even
to boxes with as little as 8G of ram.
I'm very convinced that it would be an huge mistake to force the
userbase with <=16G of ram to the 4:4 slowdown, but to avoid that we've
to drop rmap.
As part of my current anon_vma_chain vm work I'm also shrinking the
page_t to 40 bytes, and eventually it will be 32 bytes with further
patches, that combined with the usage of remap_file_pages (avoiding tons
of vmas) and the bio work not requiring flood of bh anymore (more
powerful than the 2.4 varyio), should reduce even further the needs of
normal-zone during high end workloads, allowing at least 16G boxes to
run perfectly fine with 3:1 design, like today with 2.4 we already run
huge shm workloads on 16G boxes with plenty of zone-normal margin in
production, even 32G seems to work fine (though the margin is not huge
there). With 2.6 I expect to raise the margin significantly (for
safety) in 32G boxes too with the most efficient 3:1 kernel split. Only
64G boxes will require either 2.5:1.5 or 4:4, and I think it's ok to
either use 4:4 or 2.5:1.5 there since they're less than 1% of the
userbase and with AMD64 hitting the market already I doubt the x86 64G
userbase will increase anytime.
diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids sles-ref/fs/exec.c sles-objrmap/fs/exec.c
--- sles-ref/fs/exec.c 2004-02-29 17:47:21.000000000 +0100
+++ sles-objrmap/fs/exec.c 2004-03-03 06:45:38.716636864 +0100
@@ -323,6 +323,7 @@ void put_dirty_page(struct task_struct *
}
lru_cache_add_active(page);
flush_dcache_page(page);
+ SetPageAnon(page);
set_pte(pte, pte_mkdirty(pte_mkwrite(mk_pte(page, prot))));
pte_chain = page_add_rmap(page, pte, pte_chain);
pte_unmap(pte);
diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids sles-ref/include/linux/mm.h sles-objrmap/include/linux/mm.h
--- sles-ref/include/linux/mm.h 2004-02-29 17:47:30.000000000 +0100
+++ sles-objrmap/include/linux/mm.h 2004-03-03 06:45:38.000000000 +0100
@@ -180,6 +180,7 @@ struct page {
struct pte_chain *chain;/* Reverse pte mapping pointer.
* protected by PG_chainlock */
pte_addr_t direct;
+ int mapcount;
} pte;
unsigned long private; /* mapping-private opaque data */
diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids sles-ref/include/linux/page-flags.h sles-objrmap/include/linux/page-flags.h
--- sles-ref/include/linux/page-flags.h 2004-01-15 18:36:24.000000000 +0100
+++ sles-objrmap/include/linux/page-flags.h 2004-03-03 06:45:38.808622880 +0100
@@ -75,6 +75,7 @@
#define PG_mappedtodisk 17 /* Has blocks allocated on-disk */
#define PG_reclaim 18 /* To be reclaimed asap */
#define PG_compound 19 /* Part of a compound page */
+#define PG_anon 20 /* Anonymous page */
/*
@@ -270,6 +271,10 @@ extern void get_full_page_state(struct p
#define SetPageCompound(page) set_bit(PG_compound, &(page)->flags)
#define ClearPageCompound(page) clear_bit(PG_compound, &(page)->flags)
+#define PageAnon(page) test_bit(PG_anon, &(page)->flags)
+#define SetPageAnon(page) set_bit(PG_anon, &(page)->flags)
+#define ClearPageAnon(page) clear_bit(PG_anon, &(page)->flags)
+
/*
* The PageSwapCache predicate doesn't use a PG_flag at this time,
* but it may again do so one day.
diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids sles-ref/include/linux/swap.h sles-objrmap/include/linux/swap.h
--- sles-ref/include/linux/swap.h 2004-02-04 16:07:05.000000000 +0100
+++ sles-objrmap/include/linux/swap.h 2004-03-03 06:45:38.830619536 +0100
@@ -185,6 +185,8 @@ struct pte_chain *FASTCALL(page_add_rmap
void FASTCALL(page_remove_rmap(struct page *, pte_t *));
int FASTCALL(try_to_unmap(struct page *));
+int page_convert_anon(struct page *);
+
/* linux/mm/shmem.c */
extern int shmem_unuse(swp_entry_t entry, struct page *page);
#else
diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids sles-ref/mm/filemap.c sles-objrmap/mm/filemap.c
--- sles-ref/mm/filemap.c 2004-02-29 17:47:33.000000000 +0100
+++ sles-objrmap/mm/filemap.c 2004-03-03 06:45:38.915606616 +0100
@@ -73,6 +73,9 @@
* ->mmap_sem
* ->i_sem (msync)
*
+ * ->lock_page
+ * ->i_shared_sem (page_convert_anon)
+ *
* ->inode_lock
* ->sb_lock (fs/fs-writeback.c)
* ->mapping->page_lock (__sync_single_inode)
diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids sles-ref/mm/fremap.c sles-objrmap/mm/fremap.c
--- sles-ref/mm/fremap.c 2004-02-29 17:47:26.000000000 +0100
+++ sles-objrmap/mm/fremap.c 2004-03-03 06:45:38.936603424 +0100
@@ -61,10 +61,26 @@ int install_page(struct mm_struct *mm, s
pmd_t *pmd;
pte_t pte_val;
struct pte_chain *pte_chain;
+ unsigned long pgidx;
pte_chain = pte_chain_alloc(GFP_KERNEL);
if (!pte_chain)
goto err;
+
+ /*
+ * Convert this page to anon for objrmap if it's nonlinear
+ */
+ pgidx = (addr - vma->vm_start) >> PAGE_SHIFT;
+ pgidx += vma->vm_pgoff;
+ pgidx >>= PAGE_CACHE_SHIFT - PAGE_SHIFT;
+ if (!PageAnon(page) && (page->index != pgidx)) {
+ lock_page(page);
+ err = page_convert_anon(page);
+ unlock_page(page);
+ if (err < 0)
+ goto err_free;
+ }
+
pgd = pgd_offset(mm, addr);
spin_lock(&mm->page_table_lock);
@@ -85,12 +101,11 @@ int install_page(struct mm_struct *mm, s
pte_val = *pte;
pte_unmap(pte);
update_mmu_cache(vma, addr, pte_val);
- spin_unlock(&mm->page_table_lock);
- pte_chain_free(pte_chain);
- return 0;
+ err = 0;
err_unlock:
spin_unlock(&mm->page_table_lock);
+err_free:
pte_chain_free(pte_chain);
err:
return err;
diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids sles-ref/mm/memory.c sles-objrmap/mm/memory.c
--- sles-ref/mm/memory.c 2004-02-29 17:47:33.000000000 +0100
+++ sles-objrmap/mm/memory.c 2004-03-03 06:45:38.965599016 +0100
@@ -1071,6 +1071,7 @@ static int do_wp_page(struct mm_struct *
++mm->rss;
page_remove_rmap(old_page, page_table);
break_cow(vma, new_page, address, page_table);
+ SetPageAnon(new_page);
pte_chain = page_add_rmap(new_page, page_table, pte_chain);
lru_cache_add_active(new_page);
@@ -1310,6 +1311,7 @@ static int do_swap_page(struct mm_struct
flush_icache_page(vma, page);
set_pte(page_table, pte);
+ SetPageAnon(page);
pte_chain = page_add_rmap(page, page_table, pte_chain);
/* No need to invalidate - it was non-present before */
@@ -1377,6 +1379,7 @@ do_anonymous_page(struct mm_struct *mm,
vma);
lru_cache_add_active(page);
mark_page_accessed(page);
+ SetPageAnon(page);
}
set_pte(page_table, entry);
@@ -1444,6 +1447,10 @@ retry:
if (!pte_chain)
goto oom;
+ /* See if nopage returned an anon page */
+ if (!new_page->mapping || PageSwapCache(new_page))
+ SetPageAnon(new_page);
+
/*
* Should we do an early C-O-W break?
*/
@@ -1454,6 +1461,7 @@ retry:
copy_user_highpage(page, new_page, address);
page_cache_release(new_page);
lru_cache_add_active(page);
+ SetPageAnon(page);
new_page = page;
}
diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids sles-ref/mm/mmap.c sles-objrmap/mm/mmap.c
--- sles-ref/mm/mmap.c 2004-02-29 17:47:30.000000000 +0100
+++ sles-objrmap/mm/mmap.c 2004-03-03 06:53:46.000000000 +0100
@@ -267,9 +267,7 @@ static void vma_link(struct mm_struct *m
if (mapping)
down(&mapping->i_shared_sem);
- spin_lock(&mm->page_table_lock);
__vma_link(mm, vma, prev, rb_link, rb_parent);
- spin_unlock(&mm->page_table_lock);
if (mapping)
up(&mapping->i_shared_sem);
@@ -318,6 +316,22 @@ static inline int is_mergeable_vma(struc
return 1;
}
+/* requires that the relevant i_shared_sem be held by the caller */
+static void move_vma_start(struct vm_area_struct *vma, unsigned long addr)
+{
+ struct inode *inode = NULL;
+
+ if (vma->vm_file)
+ inode = vma->vm_file->f_dentry->d_inode;
+ if (inode)
+ __remove_shared_vm_struct(vma, inode);
+ /* If no vm_file, perhaps we should always keep vm_pgoff at 0?? */
+ vma->vm_pgoff += (long)(addr - vma->vm_start) >> PAGE_SHIFT;
+ vma->vm_start = addr;
+ if (inode)
+ __vma_link_file(vma);
+}
+
/*
* Return true if we can merge this (vm_flags,file,vm_pgoff,size)
* in front of (at a lower virtual address and file offset than) the vma.
@@ -370,7 +384,6 @@ static int vma_merge(struct mm_struct *m
unsigned long end, unsigned long vm_flags,
struct file *file, unsigned long pgoff)
{
- spinlock_t *lock = &mm->page_table_lock;
struct inode *inode = file ? file->f_dentry->d_inode : NULL;
struct semaphore *i_shared_sem;
@@ -402,7 +415,6 @@ static int vma_merge(struct mm_struct *m
down(i_shared_sem);
need_up = 1;
}
- spin_lock(lock);
prev->vm_end = end;
/*
@@ -415,7 +427,6 @@ static int vma_merge(struct mm_struct *m
prev->vm_end = next->vm_end;
__vma_unlink(mm, next, prev);
__remove_shared_vm_struct(next, inode);
- spin_unlock(lock);
if (need_up)
up(i_shared_sem);
if (file)
@@ -425,7 +436,6 @@ static int vma_merge(struct mm_struct *m
kmem_cache_free(vm_area_cachep, next);
return 1;
}
- spin_unlock(lock);
if (need_up)
up(i_shared_sem);
return 1;
@@ -443,10 +453,7 @@ static int vma_merge(struct mm_struct *m
if (end == prev->vm_start) {
if (file)
down(i_shared_sem);
- spin_lock(lock);
- prev->vm_start = addr;
- prev->vm_pgoff -= (end - addr) >> PAGE_SHIFT;
- spin_unlock(lock);
+ move_vma_start(prev, addr);
if (file)
up(i_shared_sem);
return 1;
@@ -905,19 +912,16 @@ int expand_stack(struct vm_area_struct *
*/
address += 4 + PAGE_SIZE - 1;
address &= PAGE_MASK;
- spin_lock(&vma->vm_mm->page_table_lock);
grow = (address - vma->vm_end) >> PAGE_SHIFT;
/* Overcommit.. */
if (security_vm_enough_memory(grow)) {
- spin_unlock(&vma->vm_mm->page_table_lock);
return -ENOMEM;
}
if (address - vma->vm_start > current->rlim[RLIMIT_STACK].rlim_cur ||
((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) >
current->rlim[RLIMIT_AS].rlim_cur) {
- spin_unlock(&vma->vm_mm->page_table_lock);
vm_unacct_memory(grow);
return -ENOMEM;
}
@@ -925,7 +929,6 @@ int expand_stack(struct vm_area_struct *
vma->vm_mm->total_vm += grow;
if (vma->vm_flags & VM_LOCKED)
vma->vm_mm->locked_vm += grow;
- spin_unlock(&vma->vm_mm->page_table_lock);
return 0;
}
@@ -959,19 +962,16 @@ int expand_stack(struct vm_area_struct *
* the spinlock only before relocating the vma range ourself.
*/
address &= PAGE_MASK;
- spin_lock(&vma->vm_mm->page_table_lock);
grow = (vma->vm_start - address) >> PAGE_SHIFT;
/* Overcommit.. */
if (security_vm_enough_memory(grow)) {
- spin_unlock(&vma->vm_mm->page_table_lock);
return -ENOMEM;
}
if (vma->vm_end - address > current->rlim[RLIMIT_STACK].rlim_cur ||
((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) >
current->rlim[RLIMIT_AS].rlim_cur) {
- spin_unlock(&vma->vm_mm->page_table_lock);
vm_unacct_memory(grow);
return -ENOMEM;
}
@@ -980,7 +980,6 @@ int expand_stack(struct vm_area_struct *
vma->vm_mm->total_vm += grow;
if (vma->vm_flags & VM_LOCKED)
vma->vm_mm->locked_vm += grow;
- spin_unlock(&vma->vm_mm->page_table_lock);
return 0;
}
@@ -1147,8 +1146,6 @@ static void unmap_region(struct mm_struc
/*
* Create a list of vma's touched by the unmap, removing them from the mm's
* vma list as we go..
- *
- * Called with the page_table_lock held.
*/
static void
detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -1211,10 +1208,9 @@ int split_vma(struct mm_struct * mm, str
down(&mapping->i_shared_sem);
spin_lock(&mm->page_table_lock);
- if (new_below) {
- vma->vm_start = addr;
- vma->vm_pgoff += ((addr - new->vm_start) >> PAGE_SHIFT);
- } else
+ if (new_below)
+ move_vma_start(vma, addr);
+ else
vma->vm_end = addr;
__insert_vm_struct(mm, new);
diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids sles-ref/mm/page_alloc.c sles-objrmap/mm/page_alloc.c
--- sles-ref/mm/page_alloc.c 2004-02-29 17:47:36.000000000 +0100
+++ sles-objrmap/mm/page_alloc.c 2004-03-03 06:45:38.992594912 +0100
@@ -230,6 +230,8 @@ static inline void free_pages_check(cons
bad_page(function, page);
if (PageDirty(page))
ClearPageDirty(page);
+ if (PageAnon(page))
+ ClearPageAnon(page);
}
/*
diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids sles-ref/mm/rmap.c sles-objrmap/mm/rmap.c
--- sles-ref/mm/rmap.c 2004-02-29 17:47:33.000000000 +0100
+++ sles-objrmap/mm/rmap.c 2004-03-03 07:01:39.200621104 +0100
@@ -102,6 +102,136 @@ pte_chain_encode(struct pte_chain *pte_c
**/
/**
+ * find_pte - Find a pte pointer given a vma and a struct page.
+ * @vma: the vma to search
+ * @page: the page to find
+ *
+ * Determine if this page is mapped in this vma. If it is, map and rethrn
+ * the pte pointer associated with it. Return null if the page is not
+ * mapped in this vma for any reason.
+ *
+ * This is strictly an internal helper function for the object-based rmap
+ * functions.
+ *
+ * It is the caller's responsibility to unmap the pte if it is returned.
+ */
+static inline pte_t *
+find_pte(struct vm_area_struct *vma, struct page *page, unsigned long *addr)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ pgd_t *pgd;
+ pmd_t *pmd;
+ pte_t *pte;
+ unsigned long loffset;
+ unsigned long address;
+
+ loffset = (page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT));
+ address = vma->vm_start + ((loffset - vma->vm_pgoff) << PAGE_SHIFT);
+ if (address < vma->vm_start || address >= vma->vm_end)
+ goto out;
+
+ pgd = pgd_offset(mm, address);
+ if (!pgd_present(*pgd))
+ goto out;
+
+ pmd = pmd_offset(pgd, address);
+ if (!pmd_present(*pmd))
+ goto out;
+
+ pte = pte_offset_map(pmd, address);
+ if (!pte_present(*pte))
+ goto out_unmap;
+
+ if (page_to_pfn(page) != pte_pfn(*pte))
+ goto out_unmap;
+
+ if (addr)
+ *addr = address;
+
+ return pte;
+
+out_unmap:
+ pte_unmap(pte);
+out:
+ return NULL;
+}
+
+/**
+ * page_referenced_obj_one - referenced check for object-based rmap
+ * @vma: the vma to look in.
+ * @page: the page we're working on.
+ *
+ * Find a pte entry for a page/vma pair, then check and clear the referenced
+ * bit.
+ *
+ * This is strictly a helper function for page_referenced_obj.
+ */
+static int
+page_referenced_obj_one(struct vm_area_struct *vma, struct page *page)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ pte_t *pte;
+ int referenced = 0;
+
+ if (!spin_trylock(&mm->page_table_lock))
+ return 1;
+
+ pte = find_pte(vma, page, NULL);
+ if (pte) {
+ if (ptep_test_and_clear_young(pte))
+ referenced++;
+ pte_unmap(pte);
+ }
+
+ spin_unlock(&mm->page_table_lock);
+ return referenced;
+}
+
+/**
+ * page_referenced_obj_one - referenced check for object-based rmap
+ * @page: the page we're checking references on.
+ *
+ * For an object-based mapped page, find all the places it is mapped and
+ * check/clear the referenced flag. This is done by following the page->mapping
+ * pointer, then walking the chain of vmas it holds. It returns the number
+ * of references it found.
+ *
+ * This function is only called from page_referenced for object-based pages.
+ *
+ * The semaphore address_space->i_shared_sem is tried. If it can't be gotten,
+ * assume a reference count of 1.
+ */
+static int
+page_referenced_obj(struct page *page)
+{
+ struct address_space *mapping = page->mapping;
+ struct vm_area_struct *vma;
+ int referenced = 0;
+
+ if (!page->pte.mapcount)
+ return 0;
+
+ if (!mapping)
+ BUG();
+
+ if (PageSwapCache(page))
+ BUG();
+
+ if (down_trylock(&mapping->i_shared_sem))
+ return 1;
+
+ list_for_each_entry(vma, &mapping->i_mmap, shared)
+ referenced += page_referenced_obj_one(vma, page);
+
+ list_for_each_entry(vma, &mapping->i_mmap_shared, shared)
+ referenced += page_referenced_obj_one(vma, page);
+
+ up(&mapping->i_shared_sem);
+
+ return referenced;
+}
+
+/**
* page_referenced - test if the page was referenced
* @page: the page to test
*
@@ -123,6 +253,10 @@ int fastcall page_referenced(struct page
if (TestClearPageReferenced(page))
referenced++;
+ if (!PageAnon(page)) {
+ referenced += page_referenced_obj(page);
+ goto out;
+ }
if (PageDirect(page)) {
pte_t *pte = rmap_ptep_map(page->pte.direct);
if (ptep_test_and_clear_young(pte))
@@ -154,6 +288,7 @@ int fastcall page_referenced(struct page
__pte_chain_free(pc);
}
}
+out:
return referenced;
}
@@ -176,6 +311,21 @@ page_add_rmap(struct page *page, pte_t *
pte_chain_lock(page);
+ /*
+ * If this is an object-based page, just count it. We can
+ * find the mappings by walking the object vma chain for that object.
+ */
+ if (!PageAnon(page)) {
+ if (!page->mapping)
+ BUG();
+ if (PageSwapCache(page))
+ BUG();
+ if (!page->pte.mapcount)
+ inc_page_state(nr_mapped);
+ page->pte.mapcount++;
+ goto out;
+ }
+
if (page->pte.direct == 0) {
page->pte.direct = pte_paddr;
SetPageDirect(page);
@@ -232,8 +382,25 @@ void fastcall page_remove_rmap(struct pa
pte_chain_lock(page);
if (!page_mapped(page))
- goto out_unlock; /* remap_page_range() from a driver? */
+ goto out_unlock;
+ /*
+ * If this is an object-based page, just uncount it. We can
+ * find the mappings by walking the object vma chain for that object.
+ */
+ if (!PageAnon(page)) {
+ if (!page->mapping)
+ BUG();
+ if (PageSwapCache(page))
+ BUG();
+ if (!page->pte.mapcount)
+ BUG();
+ page->pte.mapcount--;
+ if (!page->pte.mapcount)
+ dec_page_state(nr_mapped);
+ goto out_unlock;
+ }
+
if (PageDirect(page)) {
if (page->pte.direct == pte_paddr) {
page->pte.direct = 0;
@@ -280,6 +447,102 @@ out_unlock:
}
/**
+ * try_to_unmap_obj - unmap a page using the object-based rmap method
+ * @page: the page to unmap
+ *
+ * Determine whether a page is mapped in a given vma and unmap it if it's found.
+ *
+ * This function is strictly a helper function for try_to_unmap_obj.
+ */
+static inline int
+try_to_unmap_obj_one(struct vm_area_struct *vma, struct page *page)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ unsigned long address;
+ pte_t *pte;
+ pte_t pteval;
+ int ret = SWAP_AGAIN;
+
+ if (!spin_trylock(&mm->page_table_lock))
+ return ret;
+
+ pte = find_pte(vma, page, &address);
+ if (!pte)
+ goto out;
+
+ if (vma->vm_flags & (VM_LOCKED|VM_RESERVED)) {
+ ret = SWAP_FAIL;
+ goto out_unmap;
+ }
+
+ flush_cache_page(vma, address);
+ pteval = ptep_get_and_clear(pte);
+ flush_tlb_page(vma, address);
+
+ if (pte_dirty(pteval))
+ set_page_dirty(page);
+
+ if (!page->pte.mapcount)
+ BUG();
+
+ mm->rss--;
+ page->pte.mapcount--;
+ page_cache_release(page);
+
+out_unmap:
+ pte_unmap(pte);
+
+out:
+ spin_unlock(&mm->page_table_lock);
+ return ret;
+}
+
+/**
+ * try_to_unmap_obj - unmap a page using the object-based rmap method
+ * @page: the page to unmap
+ *
+ * Find all the mappings of a page using the mapping pointer and the vma chains
+ * contained in the address_space struct it points to.
+ *
+ * This function is only called from try_to_unmap for object-based pages.
+ *
+ * The semaphore address_space->i_shared_sem is tried. If it can't be gotten,
+ * return a temporary error.
+ */
+static int
+try_to_unmap_obj(struct page *page)
+{
+ struct address_space *mapping = page->mapping;
+ struct vm_area_struct *vma;
+ int ret = SWAP_AGAIN;
+
+ if (!mapping)
+ BUG();
+
+ if (PageSwapCache(page))
+ BUG();
+
+ if (down_trylock(&mapping->i_shared_sem))
+ return ret;
+
+ list_for_each_entry(vma, &mapping->i_mmap, shared) {
+ ret = try_to_unmap_obj_one(vma, page);
+ if (ret == SWAP_FAIL || !page->pte.mapcount)
+ goto out;
+ }
+
+ list_for_each_entry(vma, &mapping->i_mmap_shared, shared) {
+ ret = try_to_unmap_obj_one(vma, page);
+ if (ret == SWAP_FAIL || !page->pte.mapcount)
+ goto out;
+ }
+
+out:
+ up(&mapping->i_shared_sem);
+ return ret;
+}
+
+/**
* try_to_unmap_one - worker function for try_to_unmap
* @page: page to unmap
* @ptep: page table entry to unmap from page
@@ -397,6 +660,15 @@ int fastcall try_to_unmap(struct page *
if (!page->mapping)
BUG();
+ /*
+ * If it's an object-based page, use the object vma chain to find all
+ * the mappings.
+ */
+ if (!PageAnon(page)) {
+ ret = try_to_unmap_obj(page);
+ goto out;
+ }
+
if (PageDirect(page)) {
ret = try_to_unmap_one(page, page->pte.direct);
if (ret == SWAP_SUCCESS) {
@@ -453,12 +725,115 @@ int fastcall try_to_unmap(struct page *
}
}
out:
- if (!page_mapped(page))
+ if (!page_mapped(page)) {
dec_page_state(nr_mapped);
+ ret = SWAP_SUCCESS;
+ }
return ret;
}
/**
+ * page_convert_anon - Convert an object-based mapped page to pte_chain-based.
+ * @page: the page to convert
+ *
+ * Find all the mappings for an object-based page and convert them
+ * to 'anonymous', ie create a pte_chain and store all the pte pointers there.
+ *
+ * This function takes the address_space->i_shared_sem, sets the PageAnon flag,
+ * then sets the mm->page_table_lock for each vma and calls page_add_rmap. This
+ * means there is a period when PageAnon is set, but still has some mappings
+ * with no pte_chain entry. This is in fact safe, since page_remove_rmap will
+ * simply not find it. try_to_unmap might erroneously return success, but it
+ * will never be called because the page_convert_anon() caller has locked the
+ * page.
+ *
+ * page_referenced() may fail to scan all the appropriate pte's and may return
+ * an inaccurate result. This is so rare that it does not matter.
+ */
+int page_convert_anon(struct page *page)
+{
+ struct address_space *mapping;
+ struct vm_area_struct *vma;
+ struct pte_chain *pte_chain = NULL;
+ pte_t *pte;
+ int err = 0;
+
+ mapping = page->mapping;
+ if (mapping == NULL)
+ goto out; /* truncate won the lock_page() race */
+
+ down(&mapping->i_shared_sem);
+ pte_chain_lock(page);
+
+ /*
+ * Has someone else done it for us before we got the lock?
+ * If so, pte.direct or pte.chain has replaced pte.mapcount.
+ */
+ if (PageAnon(page)) {
+ pte_chain_unlock(page);
+ goto out_unlock;
+ }
+
+ SetPageAnon(page);
+ if (page->pte.mapcount == 0) {
+ pte_chain_unlock(page);
+ goto out_unlock;
+ }
+ /* This is gonna get incremented by page_add_rmap */
+ dec_page_state(nr_mapped);
+ page->pte.mapcount = 0;
+
+ /*
+ * Now that the page is marked as anon, unlock it. page_add_rmap will
+ * lock it as necessary.
+ */
+ pte_chain_unlock(page);
+
+ list_for_each_entry(vma, &mapping->i_mmap, shared) {
+ if (!pte_chain) {
+ pte_chain = pte_chain_alloc(GFP_KERNEL);
+ if (!pte_chain) {
+ err = -ENOMEM;
+ goto out_unlock;
+ }
+ }
+ spin_lock(&vma->vm_mm->page_table_lock);
+ pte = find_pte(vma, page, NULL);
+ if (pte) {
+ /* Make sure this isn't a duplicate */
+ page_remove_rmap(page, pte);
+ pte_chain = page_add_rmap(page, pte, pte_chain);
+ pte_unmap(pte);
+ }
+ spin_unlock(&vma->vm_mm->page_table_lock);
+ }
+ list_for_each_entry(vma, &mapping->i_mmap_shared, shared) {
+ if (!pte_chain) {
+ pte_chain = pte_chain_alloc(GFP_KERNEL);
+ if (!pte_chain) {
+ err = -ENOMEM;
+ goto out_unlock;
+ }
+ }
+ spin_lock(&vma->vm_mm->page_table_lock);
+ pte = find_pte(vma, page, NULL);
+ if (pte) {
+ /* Make sure this isn't a duplicate */
+ page_remove_rmap(page, pte);
+ pte_chain = page_add_rmap(page, pte, pte_chain);
+ pte_unmap(pte);
+ }
+ spin_unlock(&vma->vm_mm->page_table_lock);
+ }
+
+out_unlock:
+ pte_chain_free(pte_chain);
+ up(&mapping->i_shared_sem);
+out:
+ return err;
+}
+
+/**
** No more VM stuff below this comment, only pte_chain helper
** functions.
**/
diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids sles-ref/mm/swapfile.c sles-objrmap/mm/swapfile.c
--- sles-ref/mm/swapfile.c 2004-02-20 17:26:54.000000000 +0100
+++ sles-objrmap/mm/swapfile.c 2004-03-03 07:03:33.128301464 +0100
@@ -390,6 +390,7 @@ unuse_pte(struct vm_area_struct *vma, un
vma->vm_mm->rss++;
get_page(page);
set_pte(dir, pte_mkold(mk_pte(page, vma->vm_page_prot)));
+ SetPageAnon(page);
*pte_chainp = page_add_rmap(page, dir, *pte_chainp);
swap_free(entry);
}
Andrew,
I certainly prefer this to the 4:4 horrors. So it sounds worth it to put
it into -mm if everybody else is ok with it.
Linus
Andrea Arcangeli <[email protected]> wrote:
>
> without this patch not even the 4:4 tlb overhead would allow intensive
> shm (shmfs+IPC) workloads to surivive on 32bit archs. Basically without
> this fix it's like 2.6 is running w/o pte-highmem.
yes.
> But the real reason of this work is for huge 64bit archs, so we speedup
> and avoid to waste tons of ram.
pte_chain space consumption is approximately equal to pagetable page space
consumption. Sometimes a bit more, sometimes a lot less, approximately
equal.
So why do you say it saves "tons of ram"?
> on 32-ways the scalability is hurted
> very badly by rmap, so it has to be removed (Martin can provide the
> numbers I think).
I don't recall that the objrmap patches ever significantly affected CPU
utilisation.
I'm not saying that I'm averse to the patches, but I do suspect that this is
a case of large highmem boxes dragging the rest of the kernel along behind
them, and nothing else.
Linus Torvalds <[email protected]> wrote:
>
>
> Andrew,
> I certainly prefer this to the 4:4 horrors. So it sounds worth it to put
> it into -mm if everybody else is ok with it.
Sure. To my amazement it applies without rejects, so we get both ;)
Hopefully the regresson which this patch adds (having to search across
vma's which do not cover the pages which we're trying to unmap) will not
impact too many workloads. It will take some time to find out. If it
_does_ impact workloads then we have a case where 64-bit machines are
suffering because of monster highmem requirements, which needs a judgement
call.
There is an architectural concern: we're now treating anonymous pages
differently from file-backed ones. But we already do that in some places
anyway and the implementation is pretty straightforward.
Other issues are how it will play with remap_file_pages(), and how it
impacts Ingo's work to permit remap_file_pages() to set page permissions on
a per-page basis. This change provides large performance improvements to
UML, making it more viable for various virtual-hosting applications. I
don't immediately see any reason why objrmap should kill that off, but if
it does we're in the position of trading off UML virtual server performance
against monster highmem viability. That's less clear.
> . Basically without
> this fix it's like 2.6 is running w/o pte-highmem. 700 tasks with 2.7G
> of shm mapped each would run the box out of zone-normal even with 4:4.
> With 3:1 100 tasks would be enough. Math is easy:
>
> 2.7*1024*1024*1024/4096*8*100/1024/1024/1024
> 2.7*1024*1024*1024/4096*8*700/1024/1024/1024
not saying your patch is not useful or anything,but there is a less
invasive shortcut possible. Oracle wants to mlock() its shared area, and
for mlock()'d pages we don't need a pte chain *at all*. So we could get
rid of a lot of this overhead that way.
Now your patch might well be useful for a lot of other reasons too, but
if this is the only one there are potential less invasive solutions for
2.6.
On Mon, Mar 08, 2004 at 01:02:31PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > without this patch not even the 4:4 tlb overhead would allow intensive
> > shm (shmfs+IPC) workloads to surivive on 32bit archs. Basically without
> > this fix it's like 2.6 is running w/o pte-highmem.
>
> yes.
>
> > But the real reason of this work is for huge 64bit archs, so we speedup
> > and avoid to waste tons of ram.
>
> pte_chain space consumption is approximately equal to pagetable page space
> consumption. Sometimes a bit more, sometimes a lot less, approximately
> equal.
exactly.
>
> So why do you say it saves "tons of ram"?
because in most high end workloads several gigabytes of ram are
allocated in the pagetables, and without this patch we would waste
another several gigabytes for rmap too (basically doubling the memory
cost of the pagetables). And several gigabytes of ram saved is "tons of
ram" in my vocabulary. I'm talking 64bit here (ignoring the fact the
several gigabytes doesn't fit anyways in the max 4G of zone-normal with
4:4)
> > on 32-ways the scalability is hurted
> > very badly by rmap, so it has to be removed (Martin can provide the
> > numbers I think).
>
> I don't recall that the objrmap patches ever significantly affected CPU
> utilisation.
it does, the number precisely is a 30% figure slowdown in kernel compiles.
also check any readprofile in any of your boxes, rmap is at the very
top.
> I'm not saying that I'm averse to the patches, but I do suspect that this is
> a case of large highmem boxes dragging the rest of the kernel along behind
> them, and nothing else.
highmem has nothing to do with this. Saving several gigs of ram and
speedups of 30% on 32-ways are the only real reason.
On Mon, Mar 08, 2004 at 01:23:05PM -0800, Andrew Morton wrote:
> There is an architectural concern: we're now treating anonymous pages
> differently from file-backed ones. But we already do that in some places
I'm working on that.
> Other issues are how it will play with remap_file_pages(), and how it
> impacts Ingo's work to permit remap_file_pages() to set page permissions on
> a per-page basis. This change provides large performance improvements to
in the current form it should be using pte_chains still for nonlinear
vmas, see the function that pretends to convert the page to be like
anonymous memory (which simply means to use pte_chains for the reverse
mappings). I admit I didn't focus much on that part though, I trust
Dave on that ;), since I want to drop it.
What I want to do with the nonlinear vmas is to scan all the ptes in
every nonlinear vma, so I don't have to allocate the pte_chain and the
swapping procedure will simply be more cpu hungry under nonlinear vmas.
I'm not interested to provide optimal performance in swapping nonlinear
vmas, I prefer the fast path to be as fast as possible and without
memory overhead. nonlinear vmas are supposed to speedup the workload. If
one needs to swap efficiently, the vmas will do it (by carrying some
memory overhead, like pte_chains would carry too).
> UML, making it more viable for various virtual-hosting applications. I
> don't immediately see any reason why objrmap should kill that off, but if
> it does we're in the position of trading off UML virtual server performance
> against monster highmem viability. That's less clear.
I still don't see objrmap as a highmem related things, by side effect of
being more efficient avoids the needs of 4:4 too, but that's just a side
effect.
the potential additional cpu consumtion with many vmas from the same
file at different offsets is something I'm slightly concerned about too
but my priority is to optimize the fast path, and the slowdown is not
something I worry about too much since it should be still a lot better
than the pagetable walk of 2.4 where one had to throw away the whole
address space before one could free a shared page (and still it was far
from being cpu bound). I'm also using objrmap in 2.4 to actually swap
lots of shm, some of which having tons of vmas for the same file, and
while the objrmap function remains at the top of the profiling, at least
the swap_out loop is gone and avoid to trow away the whole address
space. I mean it's still a lot more efficient than having to scan all
the ptes in the system to find the right page, or like 2.4 stock is
doing to throw away all the address space to swap 4k.
and you know well that 2.6 swaps slower (or anyways not faster) than
2.4, see the posts on linux-mm or the complains from the lowmem users,
there are various bits involved into the swapping and paging that are
more important than saving cpu during swapping, which is normally an I/O
bound thing.
On Mon, Mar 08, 2004 at 10:28:38PM +0100, Arjan van de Ven wrote:
> > . Basically without
> > this fix it's like 2.6 is running w/o pte-highmem. 700 tasks with 2.7G
> > of shm mapped each would run the box out of zone-normal even with 4:4.
> > With 3:1 100 tasks would be enough. Math is easy:
> >
> > 2.7*1024*1024*1024/4096*8*100/1024/1024/1024
> > 2.7*1024*1024*1024/4096*8*700/1024/1024/1024
>
>
> not saying your patch is not useful or anything,but there is a less
> invasive shortcut possible. Oracle wants to mlock() its shared area, and
> for mlock()'d pages we don't need a pte chain *at all*. So we could get
> rid of a lot of this overhead that way.
I agree that works fine for Oracle, that's becase Oracle is an extreme
special case since most of this shared memory is an I/O cache, this is
not the case of other apps, and those other apps really depends on the
kernel vm paging algorithms for things more than istantiating a pte (or
a pmd if it's a largepage). Other apps can't use mlock. Some of these
apps works closely with oracle too.
dropping pte_chains through mlock was suggested around april 2003
originally by Wli and I didn't like that idea since we really want to
allow swapping if we run short of ram. And it doesn't solve the
scalability slowdown on the 32-way for kernel compiles either.
Andrea Arcangeli <[email protected]> wrote:
>
> > Other issues are how it will play with remap_file_pages(), and how it
> > impacts Ingo's work to permit remap_file_pages() to set page permissions on
> > a per-page basis. This change provides large performance improvements to
>
> in the current form it should be using pte_chains still for nonlinear
> vmas, see the function that pretends to convert the page to be like
> anonymous memory (which simply means to use pte_chains for the reverse
> mappings). I admit I didn't focus much on that part though, I trust
> Dave on that ;), since I want to drop it.
>
> What I want to do with the nonlinear vmas is to scan all the ptes in
> every nonlinear vma, so I don't have to allocate the pte_chain and the
> swapping procedure will simply be more cpu hungry under nonlinear vmas.
> I'm not interested to provide optimal performance in swapping nonlinear
> vmas, I prefer the fast path to be as fast as possible and without
> memory overhead.
OK. There was talk some months ago about making the non-linear vma's
effectively mlocked and unswappable. That would reduce their usefulness
significantly. It looks like that's off the table now, which is good.
btw, mincore() has always been broken with nonlinear vma's. If you could
fix that up some time using that pagetable walker it would be nice. It's
not very important though.
On Mon, Mar 08, 2004 at 03:21:26PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > > Other issues are how it will play with remap_file_pages(), and how it
> > > impacts Ingo's work to permit remap_file_pages() to set page permissions on
> > > a per-page basis. This change provides large performance improvements to
> >
> > in the current form it should be using pte_chains still for nonlinear
> > vmas, see the function that pretends to convert the page to be like
> > anonymous memory (which simply means to use pte_chains for the reverse
> > mappings). I admit I didn't focus much on that part though, I trust
> > Dave on that ;), since I want to drop it.
> >
> > What I want to do with the nonlinear vmas is to scan all the ptes in
> > every nonlinear vma, so I don't have to allocate the pte_chain and the
> > swapping procedure will simply be more cpu hungry under nonlinear vmas.
> > I'm not interested to provide optimal performance in swapping nonlinear
> > vmas, I prefer the fast path to be as fast as possible and without
> > memory overhead.
>
> OK. There was talk some months ago about making the non-linear vma's
> effectively mlocked and unswappable. That would reduce their usefulness
> significantly. It looks like that's off the table now, which is good.
I sure remeber well since I was the one suggesting it ;). I now figured
out throwing some cpu at the problem would make them swappable without
hurting any fast path, so I was opting for it. It won't be an efficient
swapping but at least it swaps ;). If it will be way too inefficient
we've to options 1) is to take the 2.4 way of destroying all the vma
(still better than to destroy the whole address space), 2) to make the
mlocked as I originally suggested which will annoy the JIT emulation
usage mentioned.
> btw, mincore() has always been broken with nonlinear vma's. If you could
> fix that up some time using that pagetable walker it would be nice. It's
> not very important though.
Ok! I'm still late at this though, I wish I would be working on the
nonlinear stuff by now ;), I'm still stuck at the anon_vma_chain...
If I understand well, vmtruncate will also need the pagetable walker to
nuke all mappings of the last pages of the files before we free them
from the pagecache. So it should be a library call that mincore can use
too then, I don't see problems.
btw (for completeness), about the cpu consumption concerns about objrmap
w.r.t. security (that was Ingo's only argument against objrmap),
whatever malicious waste of cpu that could happen during paging, can be
already triggered in any kernel out there by using truncate on the same
mappings instead of swapping them out. Truncate 1 page at time and you
can have the kernel walk all the vmas in the huge list for every page in
the mapping. It must be mandated by a syscall but I don't see much
difference. I don't think it's very different from a for(;;) loop in
userspace, except this will have an higher scheduler latency, but if we
implement it right the scheduler latency won't be higher than the one
triggered by truncate today. So I can't see any security related issue.
Swapping a page with objrmap or truncating it with the same objrmap (as
every kernel out there does) carries exactly the same objrmap cpu cost.
And this is only a matter of local security and wasting cpu if you can
write your own app is quite easy anyways.
Andrea Arcangeli <[email protected]> wrote:
>
> > btw, mincore() has always been broken with nonlinear vma's. If you could
> > fix that up some time using that pagetable walker it would be nice. It's
> > not very important though.
>
> Ok! I'm still late at this though, I wish I would be working on the
> nonlinear stuff by now ;), I'm still stuck at the anon_vma_chain...
As I say, broken mincore() on nonlinear mappings isn't a showstopper ;)
> If I understand well, vmtruncate will also need the pagetable walker to
> nuke all mappings of the last pages of the files before we free them
> from the pagecache. So it should be a library call that mincore can use
> too then, I don't see problems.
If we want to bother with the traditional truncate-causes-SIGBUS semantics
on nonlinear mappings, yes. I guess it would be best to do that if
possible.
>
> btw (for completeness), about the cpu consumption concerns about objrmap
> w.r.t. security (that was Ingo's only argument against objrmap),
> whatever malicious waste of cpu that could happen during paging, can be
> already triggered in any kernel out there by using truncate on the same
> mappings instead of swapping them out.
Yes, malicious apps can DoS the machine in many ways. I'm more concerned
about non-malicious ones getting hurt by the new search activity. Say, a
single-threaded app which uses a huge number of vma's to map discontiguous
parts of a file. The 2.4-style virtual scan would handle that OK, and the
2.6-style pte_chain walk would handle it OK too. People do weird things.
(objrmap could perhaps terminate the vma walk after it sees the page->count
fall to a value which means there are no more pte's mapping the page - that
would halve the search cost on average).
On Mon, Mar 08, 2004 at 04:10:46PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > > btw, mincore() has always been broken with nonlinear vma's. If you could
> > > fix that up some time using that pagetable walker it would be nice. It's
> > > not very important though.
> >
> > Ok! I'm still late at this though, I wish I would be working on the
> > nonlinear stuff by now ;), I'm still stuck at the anon_vma_chain...
>
> As I say, broken mincore() on nonlinear mappings isn't a showstopper ;)
>
> > If I understand well, vmtruncate will also need the pagetable walker to
> > nuke all mappings of the last pages of the files before we free them
> > from the pagecache. So it should be a library call that mincore can use
> > too then, I don't see problems.
>
> If we want to bother with the traditional truncate-causes-SIGBUS semantics
> on nonlinear mappings, yes. I guess it would be best to do that if
> possible.
yes, this was my object.
btw, this reminds me another trouble, that is what to do in the case
where in 2.4 we convert file-mapped-pages into anonymous pages while
they're still mapped (I don't remeber exactly what could do that but it
could happen, do you remember the details? I think this is the case that
Hugh calls the Morton pages, he also had troubles in his anobjrmap
attempt but I think it was more a fixme comment). In 2.4 the swap_out
had to deal with that somehow, but with my anobjrmap the vm will now
lose track of those pages, so they will become unswappable. Not sure if
they were unswappable in 2.4 too and/or if 2.6-rmap could leave them
visible to the vm or not.
Also these pages should be swapped to the swap device, if something,
they lost reference of the inode.
Input on the Morton pages is appreciated ;)
> > btw (for completeness), about the cpu consumption concerns about objrmap
> > w.r.t. security (that was Ingo's only argument against objrmap),
> > whatever malicious waste of cpu that could happen during paging, can be
> > already triggered in any kernel out there by using truncate on the same
> > mappings instead of swapping them out.
>
> Yes, malicious apps can DoS the machine in many ways. I'm more concerned
> about non-malicious ones getting hurt by the new search activity. Say, a
> single-threaded app which uses a huge number of vma's to map discontiguous
> parts of a file. The 2.4-style virtual scan would handle that OK, and the
> 2.6-style pte_chain walk would handle it OK too. People do weird things.
That's the db normal scenario and it's running fine on 16G boxes today
in 2.4 with objrmap. Note that we're talking about swapping here, and
compared to swap_out that cpu load in the vma chains is little compared
to throwing away hundred gigs of address space before swapping the first
4k (infact objrmap was the 1st showstopper fix to make huge-shm-swap work
properly). So I'm not very concerned. Also for some db those pages
should be mlocked, so to be optimal we should remove them from the lru
while they're mlocked as Martin suggested. and normal pure db (not
applications) don't swap, so they will only run faster.
Longer term on 64bit those weird setups will be all but common as far as
I can tell.
Overall it sounds the best trade-off.
> (objrmap could perhaps terminate the vma walk after it sees the page->count
> fall to a value which means there are no more pte's mapping the page - that
> would halve the search cost on average).
It really should! Agreed. Feel free to go ahead and fix it. I checked
the page_count before starting the loop in my 2.4 implementation, but I
forgot to do that during the core of the loop. I was only breaking the
loop at the first pte_young I would find (my objrmap isn't capable of
clearing the pte_young bit, I leave that task to the pagetable walk, you
know my 2.4 objrmap is an hybrid between objrmap and the swap_out loop,
2.6 handles all this differently).
Andrea Arcangeli <[email protected]> wrote:
>
> btw, this reminds me another trouble, that is what to do in the case
> where in 2.4 we convert file-mapped-pages into anonymous pages while
> they're still mapped (I don't remeber exactly what could do that but it
> could happen, do you remember the details? I think this is the case that
> Hugh calls the Morton pages, he also had troubles in his anobjrmap
> attempt but I think it was more a fixme comment). In 2.4 the swap_out
> had to deal with that somehow, but with my anobjrmap the vm will now
> lose track of those pages, so they will become unswappable. Not sure if
> they were unswappable in 2.4 too and/or if 2.6-rmap could leave them
> visible to the vm or not.
>
> Also these pages should be swapped to the swap device, if something,
> they lost reference of the inode.
>
> Input on the Morton pages is appreciated ;)
You mean Dickens pages ;)
They were caused by a race between truncate and filemap_nopage(), iirc.
nopage was sleeping on the read I/O and truncate would come in and tear
down the pagetables. Then the read I/O completes and nopage reinstantiates
the page outside i_size after truncate ripped it off the mapping. truncate
was unable to free the page because ext3 happened to have a ref via the
page's buffer_heads. Something like that.
But these pages should no longer exist, due to the truncate_count logic in
do_no_page().
However I'm not sure that this (truly revolting) problem which Rajesh
identified:
http://www.ussg.iu.edu/hypermail/linux/kernel/0402.2/1155.html
Cannot cause them to come back.
I really do want to fix that problem via locking: say taking i_shared_sem
inside mremap(). i_shared_sem is a very innermost lock and the ranking
with mmap_sem is all wrong.
For now, it would be sufficient to put a debug printk in there somewhere to
see if we are still getting Dickens pages.
Andrea Arcangeli <[email protected]> wrote:
>
> > I don't recall that the objrmap patches ever significantly affected CPU
> > utilisation.
>
> it does, the number precisely is a 30% figure slowdown in kernel compiles.
I think this might have been increased system time, not increased runtime.
> also check any readprofile in any of your boxes, rmap is at the very
> top.
With super-forky workloads, yes. But we were somewhat disappointed in the
(lack of) improvements which [an]objrmap offered.
It sounds like a bunch of remeasuring is needed. No doubt someone will do
this as we move these patches along.
* Andrea Arcangeli <[email protected]> wrote:
> I agree that works fine for Oracle, that's becase Oracle is an extreme
> special case since most of this shared memory is an I/O cache, this is
> not the case of other apps, and those other apps really depends on the
> kernel vm paging algorithms for things more than istantiating a pte
> (or a pmd if it's a largepage). Other apps can't use mlock. Some of
> these apps works closely with oracle too.
what other apps use gigs of shared memory where that shared memory is
not an IO cache?
> dropping pte_chains through mlock was suggested around april 2003
> originally by Wli and I didn't like that idea since we really want to
> allow swapping if we run short of ram. [...]
dropping pte_chains on mlock() we implemented in RHEL3 and it works fine
to reduce the pte_chain overhead for those extreme shm users.
mind you, it still doesnt make high-end DB workloads viable on 32 GB
systems. (and no, not due to the pte_chain overhead.) 3:1 is simply not
enough at 32 GB and higher [possibly much ealier, for other workloads].
Trying to argue otherwise is sticking your head into the sand.
most of the anti-rmap sentiment (not this patch - this patch looks OK at
first sight, except the increase in struct page) is really backwards.
The right solution is to have rmap which is a _per page_ overhead and
the clear path to a mostly O(1) VM logic. Then we can increase the page
size (pgcl) to scale down the rmap overhead (both the per-page and the
locking overhead). What's so hard about this concept? Simple and
flexible data structure.
the x86 highmem issues are a quickly fading transient in history.
Ingo
* Andrea Arcangeli <[email protected]> wrote:
> btw (for completeness), about the cpu consumption concerns about
> objrmap w.r.t. security (that was Ingo's only argument against
> objrmap),
ugh? This is not my main argument against 'objrmap'. My main (and pretty
much only) argument against it is the linear searching it reintroduces.
I.e. in your patch you do this, in 3 key areas:
+int page_convert_anon(struct page *page)
...
+ list_for_each_entry(vma, &mapping->i_mmap, shared) {
...
+ list_for_each_entry(vma, &mapping->i_mmap_shared, shared) {
...
+static int
+page_referenced_obj(struct page *page)
...
+ list_for_each_entry(vma, &mapping->i_mmap, shared)
...
+ list_for_each_entry(vma, &mapping->i_mmap_shared, shared)
...
+static int
+try_to_unmap_obj(struct page *page)
...
+ list_for_each_entry(vma, &mapping->i_mmap, shared) {
...
+ list_for_each_entry(vma, &mapping->i_mmap_shared, shared) {
...
the length of these lists ~equals to the number of processes mapping
said inode (the 'sharing factor'). I.e. the more processes, the bigger
box, the longer lists. The more advanced computers get, the longer the
lists get too. A scalability nightmare in the making.
with rmap we do have the ability to make it truly O(1) all across, by
making the pte chains a double linked list. Moreover, we can freely
reduce the rmap overhead (both the memory, algorithmic and locking
overhead) by increasing the page size - a natural thing to do on big
boxes anyway. The increasing of the page size also linearly _reduces_
the RAM footprint of rmap. So rmap and pgcl are a natural fit and the
thing of the future.
now, the linear searching of vmas does not reduce with increased
page-size. In fact, it will increase in time as the sharing factor
increases.
do you see what i'm worried about?
Ingo
On Tue, Mar 09, 2004 at 09:31:03AM +0100, Ingo Molnar wrote:
> with rmap we do have the ability to make it truly O(1) all across, by
> making the pte chains a double linked list. Moreover, we can freely
> reduce the rmap overhead (both the memory, algorithmic and locking
> overhead) by increasing the page size - a natural thing to do on big
> boxes anyway. The increasing of the page size also linearly _reduces_
> the RAM footprint of rmap. So rmap and pgcl are a natural fit and the
> thing of the future.
> now, the linear searching of vmas does not reduce with increased
> page-size. In fact, it will increase in time as the sharing factor
> increases.
This is getting bandied about rather frequently. I should make some
kind of attack on an implementation. The natural implementation is
to add one pte per contiguous and aligned group of PAGE_MMUCOUNT ptes
to the pte_chain and search the area surrounding any pte_chain element.
But the linear search you're pointing at is unnecessary to begin with.
Only the single nonlinear mappings' pte needs to be added to the
pte_chain there; one need only also scan the vma lists at reclaim-time.
This would also make page_convert_anon() a misnomer and SetPageAnon()
on nonlinearly-mapped file-backed pages a bug.
-- wli
* Ingo Molnar <[email protected]> wrote:
> ugh? This is not my main argument against 'objrmap'. My main (and
> pretty much only) argument against it is the linear searching it
> reintroduces.
to clarify this somewhat. objrmap works fine (and roughly equivalently
to rmap) when processes map files via one vma mostly. (e.g. shared
libraries.)
objrmap falls apart badly if there are tons but _disjunct_ vmas to the
same inode. One such workload is Oracle's 'indirect buffer cache'. It is
a ~512 MB virtual memory window with 32KB mapsize, mapping to a much
larger shmfs page, featuring 16 thousand vmas per process.
The problem is that the ->i_mmap and ->i_mmap_shared lists 'merge' _all_
the vmas that somehow belong to the given inode. Ie. in the above case
it has a length of 16 thousand entries. So while possibly none of those
vmas shares any physical page with each other, your
try_to_unmap_obj(page) function will loop through possibly thousands of
vmas, killing the VM's ability to reclaim. (and also presenting the
'kswapd is stuck and eating up CPU time' phenomenon to users.)
Andrea, did you know about this property of your patch? If yes, why
didnt you mention it in the announcement, as a tradeoff to take care of?
it's ironic that precisely the workload you cite (shmfs for IO cache,
when the shared memory size is larger than what the process can map) is
the one that would hurt most from objrmap. In that workload there can be
possibly tens of thousands of disjunct vmas mapping the same shmfs inode
and kswapd would loop endlessly without achieving anything useful.
(remap_file_pages() will handle such workloads fine, but it's still a
big regression for those applications that happen to have more than a
handful of disjunct vmas per inode. I obviously like fremap, but i dont
want to force it down anyone's throat.)
Ingo
* Andrea Arcangeli <[email protected]> wrote:
> This patch avoids the allocation of rmap for shared memory and it uses
> the objrmap framework to do find the mapping-ptes starting from a
> page_t which is zero memory cost, (and zero cpu cost for the fast
> paths)
this patch locks up the VM.
To reproduce, run the attached, very simple test-mmap.c code (as
unprivileged user) which maps 80MB worth of shared memory in a
finegrained way, creating ~19K vmas, and sleeps. Keep this process
around.
Then try to create any sort of VM swap pressure. (start a few desktop
apps or generate pagecache pressure.) [the 500 MHz P3 system i tried
this on has 256 MB of RAM and 300 MB of swap.]
stock 2.6.4-rc2-mm1 handles it just fine - it starts swapping and
recovers. The system is responsive and behaves just fine.
with 2.6.4-rc2-mm1 + your objrmap patch the box in essence locks up and
it's not possible to do anything. The VM is looping within the objrmap
functions. (a sample trace attached.)
Note that the test-mmap.c app does nothing that a normal user cannot do.
In fact it's not even hostile - it only has lots of vmas but is
otherwise not actively pushing the VM, it's just sleeping. (Also, the
test is a very far cry from Oracle's workload of gigabytes of shm mapped
in a finegrained way to hundreds of processes.) All in one, currently i
believe the patch is pretty unacceptable in its present form.
Ingo
Pid: 7, comm: kswapd0
EIP: 0060:[<c013ee6d>] CPU: 0
EIP is at page_referenced_obj+0xdd/0x120
EFLAGS: 00000246 Not tainted
EAX: cb311808 EBX: cb311820 ECX: 40a2d000 EDX: cb311848
ESI: cfe202fc EDI: cfe2033c EBP: cfdf9dc4 DS: 007b ES: 007b
CR0: 8005003b CR2: 40507000 CR3: 0b11e000 CR4: 00000290
Call Trace:
[<c013ef71>] page_referenced+0xc1/0xd0
[<c0137bad>] refill_inactive_zone+0x3fd/0x4c0
[<c01376bc>] shrink_cache+0x26c/0x360
[<c0137d11>] shrink_zone+0xa1/0xb0
[<c01380d7>] balance_pgdat+0x1a7/0x200
[<c013820b>] kswapd+0xdb/0xe0
[<c01180b0>] autoremove_wake_function+0x0/0x50
[<c01180b0>] autoremove_wake_function+0x0/0x50
[<c0138130>] kswapd+0x0/0xe0
[<c01050f9>] kernel_thread_helper+0x5/0xc
* Ingo Molnar <[email protected]> wrote:
> To reproduce, run the attached, very simple test-mmap.c code (as
> unprivileged user) which maps 80MB worth of shared memory in a
> finegrained way, creating ~19K vmas, and sleeps. Keep this process
> around.
or run the attached test-mmap2.c code, which simulates a very small DB
app using only 1800 vmas per process: it only maps 8 MB of shm and
spawns 32 processes. This has an even more lethal effect than the
previous code.
Ingo
Ingo Molnar <[email protected]> wrote:
>
> or run the attached test-mmap2.c code, which simulates a very small DB
> app using only 1800 vmas per process: it only maps 8 MB of shm and
> spawns 32 processes. This has an even more lethal effect than the
> previous code.
Do these tests actually make any forward progress at all, or is it some bug
which has sent the kernel into a loop?
* Andrew Morton <[email protected]> wrote:
> > or run the attached test-mmap2.c code, which simulates a very small DB
> > app using only 1800 vmas per process: it only maps 8 MB of shm and
> > spawns 32 processes. This has an even more lethal effect than the
> > previous code.
>
> Do these tests actually make any forward progress at all, or is it some bug
> which has sent the kernel into a loop?
i think they make a forward progress so it's more of a DoS - but a very
effective one, especially considering that i didnt even try hard ...
what worries me is that there are apps that generate such vma patterns
(for various reasons).
I do believe that scanning ->i_mmap & ->i_mmap_shared is fundamentally
flawed.
Ingo
* Andrew Morton <[email protected]> wrote:
>> Do these tests actually make any forward progress at all, or is it
>> some bug which has sent the kernel into a loop?
On Tue, Mar 09, 2004 at 12:49:24PM +0100, Ingo Molnar wrote:
> i think they make a forward progress so it's more of a DoS - but a very
> effective one, especially considering that i didnt even try hard ...
> what worries me is that there are apps that generate such vma patterns
> (for various reasons).
> I do believe that scanning ->i_mmap & ->i_mmap_shared is fundamentally
> flawed.
Whatever's going on, this looks like objrmap will turn into a quagmire.
I was vaguely holding out for anobjrmap to come in and get rid of the
dependency of the pte_chain -based ptov resolution on struct page. So,
any ideas on how to kick pte_chains of the habit of shoving information
in pagetable nodes' struct pages or am I (worst case) stuck eating
grossly oversized pagetable nodes and horrific internal fragmentation
(<= 20% pagetable utilization with 4K already) no matter what?
I guess I could allocate an array of the things pte_chains want in
struct pages and attach it to ->private at allocation-time, but that's
even worse wrt. cache and space footprint than the current state of
affairs, worse still on 32-bit, and scales poorly to small PAGE_MMUCOUNT.
I guess ->lru and ->list may handle it up to 4, but that smells bad.
My second guess is that with PAGE_MMUCOUNT >= 2 and only using one
pte_chain entry per PAGE_MMUCOUNT aligned and contiguous ptes, it's
still a net space win to just put information directly beside the
(potentially physical) pte pointers in the pte_chains.
Do either of these sound desirable? Any other ideas?
-- wli
On Tue, Mar 09, 2004 at 10:03:26AM +0100, Ingo Molnar wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
> > ugh? This is not my main argument against 'objrmap'. My main (and
> > pretty much only) argument against it is the linear searching it
> > reintroduces.
>
> to clarify this somewhat. objrmap works fine (and roughly equivalently
> to rmap) when processes map files via one vma mostly. (e.g. shared
> libraries.)
>
> objrmap falls apart badly if there are tons but _disjunct_ vmas to the
> same inode. One such workload is Oracle's 'indirect buffer cache'. It is
> a ~512 MB virtual memory window with 32KB mapsize, mapping to a much
> larger shmfs page, featuring 16 thousand vmas per process.
>
> The problem is that the ->i_mmap and ->i_mmap_shared lists 'merge' _all_
> the vmas that somehow belong to the given inode. Ie. in the above case
> it has a length of 16 thousand entries. So while possibly none of those
> vmas shares any physical page with each other, your
> try_to_unmap_obj(page) function will loop through possibly thousands of
> vmas, killing the VM's ability to reclaim. (and also presenting the
> 'kswapd is stuck and eating up CPU time' phenomenon to users.)
>
> Andrea, did you know about this property of your patch? If yes, why
> didnt you mention it in the announcement, as a tradeoff to take care of?
first of all that this algorithm is running in production just fine in
the workloads you're talking about, it's not like I didn't even try it,
even the ones that have to swap (see the end of the email).
you should specify that they have to swap to actually pay this cost,
and you said that the memory you're talking about is mlocked. In 2.4
objrmap I don't pay the cost for mlocked memory, the 2.6 patch isn't
smart enough yet but it will soon will.
The patch I posted is the building block, all the developemnt is on top
of that. That patch alone I agree is not optimal and my first next
target is to remove all pte_chains from the kernel and I'm not far from
that, this will avoid the rmap overhead for anon memory and secondly
(less critical) it will give us 8 bytes per page_t (reducing the page_t
of 4 bytes compared to current 2.6 mainline). I can shrink it further
but I'd need to drop some functionality so I'm doing the anon_vma work
in a self contained way using the same apis that are in rmap.c today.
Regardless the mlock thing (and I'm not even sure the 2.4 code is
running with mlocked vmas, so it may be fine even w/o mlock), you should
mention that this "cpu" cost happens only by the time you become I/O
bound. And the workload that I'm testing in 2.4 is very swap intensive,
not like a normal db.
there are three things you're missing:
1) cpus and memory are faster and faster with respect to storage
(and there so much ram these days that one can argue if
doing any swap oriented optimization will ever payoff)
2) the only benefit you get is less cpu cost during swapping, and 2.6
is slower than 2.4 at swap bandwidth
3) if you use those gigabytes of ram as shmfs backing store, then I
expect a lot more benefit than to save cpu during swapping, since
you'll have some more gigs of ram before you start swapping
>
> it's ironic that precisely the workload you cite (shmfs for IO cache,
> when the shared memory size is larger than what the process can map) is
> the one that would hurt most from objrmap. In that workload there can be
> possibly tens of thousands of disjunct vmas mapping the same shmfs inode
> and kswapd would loop endlessly without achieving anything useful.
it's the opposite, without objrmap the machine falls apart because
before we can swap 4k we've to destroy the entire address space, so we
keep generating minor fault floods. The cpu cost of following some
thousand pointers is little compared to destroying the whole hundred
gigabytes or terabytes of address space.
I'm not kidding, people is now able to use the machine, previously they
couldn't (it's not just objrmap of course, that's one of three bits that
fixed it).
And in this whole email you're still thinking only with 32bit mindset,
for 64bit objrmap is obviously an order of magnitude superior in the
same workload that you're complained about above for 32bit archs.
* Andrea Arcangeli <[email protected]> wrote:
> first of all that this algorithm is running in production just fine in
> the workloads you're talking about, it's not like I didn't even try
> it, even the ones that have to swap (see the end of the email).
could you just try test-mmap2.c on such a box, and hit swap?
Ingo
On Tue, Mar 09, 2004 at 08:47:47AM +0100, Ingo Molnar wrote:
>
> * Andrea Arcangeli <[email protected]> wrote:
>
> > I agree that works fine for Oracle, that's becase Oracle is an extreme
> > special case since most of this shared memory is an I/O cache, this is
> > not the case of other apps, and those other apps really depends on the
> > kernel vm paging algorithms for things more than istantiating a pte
> > (or a pmd if it's a largepage). Other apps can't use mlock. Some of
> > these apps works closely with oracle too.
>
> what other apps use gigs of shared memory where that shared memory is
> not an IO cache?
another >50b company, there aren't that many
> > dropping pte_chains through mlock was suggested around april 2003
> > originally by Wli and I didn't like that idea since we really want to
> > allow swapping if we run short of ram. [...]
>
> dropping pte_chains on mlock() we implemented in RHEL3 and it works fine
> to reduce the pte_chain overhead for those extreme shm users.
that's ok for oracle, but it's far from being an acceptable solution to
all critical apps.
> mind you, it still doesnt make high-end DB workloads viable on 32 GB
> systems. (and no, not due to the pte_chain overhead.) 3:1 is simply not
> enough at 32 GB and higher [possibly much ealier, for other workloads].
> Trying to argue otherwise is sticking your head into the sand.
then either we run different software or your vm is inferior, it's as
simple as that. I'm not just guessing, this stuff run for a very long
time, we've even triggered bugs outside the kernel that were never
reproducible until we did, now fixed (not a linux issue) and there was
no zone-normal shortage at all that I can remeber, not a single kernel
glitch was experienced during the 32G tests.
only if you can reproduce your zone-normal shortage with 2.4-aa I can
care about it, give it a try please, I cannot care what happens with the
rmap vm in the RHEL3 kernels, I don't know the vm of those kernels, and
in my tree I've quite a ton of add-ons for making 3:1 work (beneficial
in 64bit too of course, just to mention one the vma isn't hardware
aligned, that alone gives back dozen mbytes of normal zone etc..).
Go read this:
http://www.oracle.com/apps_benchmark/html/index.html?0325B_Report1.html
CPUs: 4 x Intel Xeon MP 2.8GHz
Processor caches: 12 KB L1; 512 KB L2; 2MB L3
Memory: 32GB
Operating System: SuSE SLES8
Disks: 2 x 72.8GB (Ultra 3)
User Count 7504 users
there was absolutely no zone-normal shortage going on here, the machine
was perfectly fine, the 7.5k user limit is purerly a cpu bound matter,
that's the maximul the cpus could handle. Of course this was with SLES8,
mainline would run in tons of troubles due the lack of pte-highmem,
everything else these days seems to be in mainline too.
2.6 mainline would lockup quick with zone-normal shortage too in the
above workload due rmap (I think regardless of 4:4 or 3:1).
But with my 2.6 work I expect to give even more margin to those 32G
boxes using 3:1 as usual, thanks to a reduced page_t and thanks to
remap_file_pages and some other bit, so they're even more generic.
there's absolutely no swap going on in those machines, and if they swap
that will be a few megs, so walking long i_mmap lists a few times is
perfectly fine if we really have to do I/O (if they use mlock and we
teach objrmap to remove from the lru the locked pages they won't risk
any list-walking either, but while you're forced to use mlock anyways to
make RHEL work with rmap, objrmap don't need any mlock to work optimally
since normally there's no swap at all, so no cpu time spent in the
lists, mlock is more an hint than a requirement with objrmap in these
workloads).
And note that running out of zone-normal shouldn't lead to a
kernel-crash like in 2.6, in my 2.4-aa it simply generates a -ENOMEM
retval from syscalls, that's it, no task killed, nothing really bad
happening. Running out of zone-normal is not different from running out
of highmem in a machine without swap. So if 3:1 would run out of zone-normal
at 8.5k users (possible but we couldn't reach that since as said it's
cpu bound at 7.5k) it could be that not all ram will perfectly utilized,
but it'll be like running out of highmem, except the tasks will not be
killed. An admin should monitor vm over time, lowmem and highmem free
levels, to see if his workloads risks to run the box out of memory and
if he needs a different architecture or a different memory model. In the
tests I did so far in the 2.4-aa vm, 32G works fine with 3:1 after all
the work done to make it work.
a 32-way with 32G of ram may hit a zone-normal shortage earlier due the
per-cpu reservations, that's true, just like a 48G box 2-way will run
out of normal-zone quicker. At some point 3:1 becomes a limitation, we
agree on that, but I definitely would use 3:1 for workloads like the
above one with hardware like the above one, using any other model for
this workload is wrong and it will only hurt.
And about the 32G we can argue, about 8G boxes we don't really want to
argue, 3:1 is fine for 8G boxes.
as a matter of fact the only single reason you have to ship all PAE
kernels with 4:4 is rmap, no other reason. If you didn't have rmap, you
could leave the option to the user of choosing.
> most of the anti-rmap sentiment (not this patch - this patch looks OK at
> first sight, except the increase in struct page) is really backwards.
the increase in struct page will be fixed with the further patches, the
first effect of the further patches will be to make the page_t 4bytes
less than 2.6 and 2.4 mainline. this patch is just the building block of
the objrmap vm, it's not definitive, just a trasitional phase before the
pte_chain goes away completely releasing its 8bytes from the page_t.
> The right solution is to have rmap which is a _per page_ overhead and
> the clear path to a mostly O(1) VM logic. Then we can increase the page
> size (pgcl) to scale down the rmap overhead (both the per-page and the
> locking overhead). What's so hard about this concept? Simple and
> flexible data structure.
>
> the x86 highmem issues are a quickly fading transient in history.
pte_chains hurts the same on 32bit and 64bit.
On Tue, Mar 09, 2004 at 04:09:42PM +0100, Ingo Molnar wrote:
>
> * Andrea Arcangeli <[email protected]> wrote:
>
> > first of all that this algorithm is running in production just fine in
> > the workloads you're talking about, it's not like I didn't even try
> > it, even the ones that have to swap (see the end of the email).
>
> could you just try test-mmap2.c on such a box, and hit swap?
I will try, to see what happens. But please write an exploit for
truncate too since you obviously can, blaming on the vm is a
red-herring, if the vm has an issue, truncate always had an issue in any
kernel out there since 1997 (the first time I rememeber).
Unless it crashes the machine I don't care, it's totally wrong in my
opinion to hurt everything useful to save cpu while running an exploit.
there are easier ways to waste cpu (rewrite the exploit with truncate
please!!!)
* Andrea Arcangeli <[email protected]> wrote:
> http://www.oracle.com/apps_benchmark/html/index.html?0325B_Report1.html
OASB is special and pushes the DB less than e.g. TPC-C does. How big was
the SGA? I bet the setup didnt have use_indirect_data_buffers=true.
(OASB is not a full-disclosure benchmark so i have no way to check
this.) All you have proven is that workloads with a limited number of
per-inode vmas can perform well. Which completely ignores my point.
Ingo
On Tue, Mar 09, 2004 at 11:52:26AM +0100, Ingo Molnar wrote:
>
> * Andrea Arcangeli <[email protected]> wrote:
>
> > This patch avoids the allocation of rmap for shared memory and it uses
> > the objrmap framework to do find the mapping-ptes starting from a
> > page_t which is zero memory cost, (and zero cpu cost for the fast
> > paths)
>
> this patch locks up the VM.
>
> To reproduce, run the attached, very simple test-mmap.c code (as
> unprivileged user) which maps 80MB worth of shared memory in a
> finegrained way, creating ~19K vmas, and sleeps. Keep this process
> around.
>
> Then try to create any sort of VM swap pressure. (start a few desktop
> apps or generate pagecache pressure.) [the 500 MHz P3 system i tried
> this on has 256 MB of RAM and 300 MB of swap.]
>
> stock 2.6.4-rc2-mm1 handles it just fine - it starts swapping and
> recovers. The system is responsive and behaves just fine.
>
> with 2.6.4-rc2-mm1 + your objrmap patch the box in essence locks up and
> it's not possible to do anything. The VM is looping within the objrmap
> functions. (a sample trace attached.)
>
> Note that the test-mmap.c app does nothing that a normal user cannot do.
> In fact it's not even hostile - it only has lots of vmas but is
> otherwise not actively pushing the VM, it's just sleeping. (Also, the
> test is a very far cry from Oracle's workload of gigabytes of shm mapped
> in a finegrained way to hundreds of processes.) All in one, currently i
> believe the patch is pretty unacceptable in its present form.
this doesn't lockup for me (in 2.6 + objrmap), but the machine is not
responsive, while pushing 1G into swap. Here a trace in the middle of the
swapping while pressing C^c on your program doesn't respond for half a minute.
Mind to leave it running a bit longer before claiming a lockup?
1 206 615472 4032 84 879332 11248 16808 16324 16808 2618 20311 0 43 0 57
1 204 641740 1756 96 878476 2852 16980 4928 16980 5066 60228 0 35 1 64
1 205 650936 2508 100 875604 2248 9928 3772 9928 1364 21052 0 34 2 64
2 204 658212 2656 104 876904 3564 12052 4988 12052 2074 19647 0 32 1 67
1 204 674260 1628 104 878528 3236 12924 5608 12928 2062 27114 0 47 0 53
1 204 678248 1988 96 879004 3540 4664 4360 4664 1988 20728 0 31 0 69
1 203 683748 4024 96 878132 2844 5036 3724 5036 1513 18173 0 38 0 61
0 206 687312 1732 112 879056 3396 4260 4424 4272 1704 13222 0 32 0 68
1 204 690164 1936 116 880364 2844 3400 3496 3404 1422 18214 0 35 0 64
0 205 696572 4348 112 877676 2956 6620 3788 6620 1281 11544 0 37 1 62
0 204 699244 4168 108 878272 3140 3528 3892 3528 1467 11464 0 28 0 72
1 206 704296 1820 112 878604 2576 4980 3592 4980 1386 11710 0 26 0 74
1 205 710452 1972 104 876760 2256 6684 3092 6684 1308 20947 0 34 1 66
2 203 714512 1632 108 877564 2332 4876 3068 4876 1295 9792 0 20 0 80
0 204 719804 3720 112 878128 2536 6352 3100 6368 1441 20714 0 39 0 61
124 200 724708 1636 100 879548 3376 5308 3912 5308 1516 20732 0 38 0 62
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 204 730908 4344 100 877528 2592 6356 3672 6356 1819 15894 0 35 0 65
0 204 733556 3836 104 878256 2312 3132 3508 3132 1294 10905 0 33 0 67
0 205 736380 3388 100 877376 3084 3364 3832 3364 1322 11550 0 30 0 70
1 206 747016 2032 100 877760 2780 13144 4272 13144 1564 17486 0 37 0 63
1 205 756664 2192 96 878004 1704 7704 2116 7704 1341 20056 0 32 0 67
9 203 759084 3200 92 878516 2748 3168 3676 3168 1330 18252 0 45 0 54
0 205 761752 3928 96 877208 2604 2984 3284 2984 1330 10395 0 35 0 65
most of the time is spent in "wa", though it's a 4-way, so it means at least
two cpus are spinning. I'm pushing the box hard into swap. 2.6 swap extremely
slow w/ or w/o objrmap, not much difference really w/o or w/o your exploit.
now the C^c hit, and I got the prompt back, no lockup.
Note that my swap workload was very heavy too, with 200 tasks all swapping in
the shm segment, so stalls have to be expected.
And if Oracle really mlocks the ram (required anyways if you use rmap as you
admitted) this is a no-issue for oracle.
As Andrew said we've room for improvements too, just checking page_mapped in
the middle of the vma walk (to break it) will make a lot of difference in the
average cpu cost.
On Tue, Mar 09, 2004 at 12:02:33PM +0100, Ingo Molnar wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
> > To reproduce, run the attached, very simple test-mmap.c code (as
> > unprivileged user) which maps 80MB worth of shared memory in a
> > finegrained way, creating ~19K vmas, and sleeps. Keep this process
> > around.
>
> or run the attached test-mmap2.c code, which simulates a very small DB
> app using only 1800 vmas per process: it only maps 8 MB of shm and
> spawns 32 processes. This has an even more lethal effect than the
> previous code.
this use more cpu than the previous one, but no other differences.
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
8 1 14660 978284 972 17016 387 350 453 353 308 1129 1 12 48 40
33 1 14660 759788 972 231692 0 0 0 0 1087 16282 12 88 0 0
40 2 14660 655220 972 332332 0 0 0 0 1087 96 15 85 0 0
47 0 14660 562372 972 421208 0 0 0 0 1086 97 15 85 0 0
52 0 14660 476412 1048 502656 76 0 300 0 1119 267 17 83 0 0
55 0 14660 397092 1064 578256 0 0 0 112 1089 97 15 85 0 0
62 1 14660 332844 1064 638436 0 0 40 0 1088 95 17 83 0 0
68 0 14648 260072 1072 707732 0 0 76 0 1093 179 15 85 0 0
68 0 14648 198184 1072 765804 0 0 0 0 1088 82 16 84 0 0
75 0 14648 136496 1072 823468 0 0 0 0 1086 84 16 84 0 0
75 0 14648 98544 1072 857604 0 0 0 0 1087 71 17 83 0 0
82 0 14648 30732 1084 921376 0 0 0 76 1089 90 16 84 0 0
84 5 14648 2104 444 947844 0 76 0 192 1130 76 15 85 0 0
83 27 18028 2464 228 944140 428 3216 428 3216 1142 577 10 90 0 0
71 75 22800 3744 224 943120 912 5560 1168 5560 1502 3351 6 91 0 3
82 55 25424 3464 236 940624 760 2848 856 2848 1222 764 12 88 0 0
84 59 27012 2104 240 939040 1128 1796 1172 1796 1182 762 10 90 0 0
73 80 29308 2480 164 938476 2364 3212 2364 3236 1526 5685 4 74 0 21
81 81 33296 2656 144 937492 2456 4920 3356 4920 2275 7953 2 62 0 36
81 81 36172 2576 144 935168 3300 4484 4364 4484 1751 5622 5 86 0 9
88 83 38828 2884 136 933532 2592 3828 3376 3828 1690 8162 1 57 0 42
62 84 42196 3368 132 932992 1472 3864 2136 3864 1291 4127 4 78 0 18
74 71 46624 3660 112 929492 1828 4972 2916 4972 1395 3104 6 83 0 11
1 89 48572 2920 112 929436 2036 2852 2752 2852 1355 5284 8 76 0 16
31 86 52428 3588 104 926436 1416 4432 1620 4432 1253 4271 0 43 0 57
46 86 58288 1988 108 926460 1740 6644 2872 6644 1309 5233 9 88 0 3
56 87 61452 2376 96 927664 2332 4032 3460 4032 1443 9227 1 73 0 26
3 118 73588 2484 88 924492 4128 14928 5576 14928 2357 33401 0 59 1 40
36 137 78656 2532 88 925692 1804 4356 2520 4356 1420 29642 0 60 2 37
1 153 80380 2180 88 926112 2676 5644 3700 5644 1798 17355 0 77 0 22
90 170 86396 2588 88 925000 3104 4208 3872 4208 2179 33189 0 76 0 24
58 174 90768 2172 88 925016 4816 5624 6600 5624 2884 31681 0 75 0 24
82 179 94680 2912 88 923424 8772 10016 10568 10016 2625 30269 0 74 0 26
14 184 101480 2260 84 923388 4752 5992 6456 5992 4369 49544 0 70 0 30
3 206 110620 2208 92 921608 8396 12016 11276 12016 4993 81573 0 71 0 29
2 207 114788 2984 88 921720 2196 5180 3348 5180 1423 18939 0 62 0 38
13 204 117344 2348 88 923060 3960 3608 5276 3608 2807 20612 0 90 0 10
145 202 123920 2092 88 922752 9108 11316 12584 11316 3131 34221 0 72 0 28
3 206 131008 2024 84 920800 7948 10888 9828 10888 5424 57225 0 78 0 21
2 207 140124 2144 88 922312 8968 9368 12512 9368 6789 75225 0 71 0 28
37 208 148108 2468 80 921396 14540 15120 20632 15120 8226 74565 0 82 0 18
4 205 157620 2184 108 921120 5592 7908 8468 7908 5713 56264 0 72 0 28
2 206 160540 2792 100 920836 2132 3736 4312 3736 1752 13193 0 79 0 21
2 207 168176 2564 96 920332 10680 14340 14300 14340 5805 46868 0 81 0 19
195 207 183436 2684 88 919632 9056 13756 14824 13756 7322 73112 0 74 0 26
1 210 188696 2152 108 920092 5620 8792 9124 8816 2539 30646 0 65 0 35
2 205 196888 2760 92 918844 4584 6512 6128 6512 3842 47524 0 69 0 31
123 203 198992 2648 92 919996 2776 3292 3564 3292 1637 17687 0 77 0 23
2 204 203276 2100 92 919012 2848 5100 4092 5100 1682 20360 0 57 0 42
2 206 206956 2244 84 921068 6724 7744 10060 7744 3257 25261 0 80 0 20
4 205 218928 2612 96 917692 10124 13580 13968 13580 6812 57570 0 79 0 20
1 205 226656 1948 96 919004 7460 10504 9888 10504 4342 78518 0 62 0 38
2 204 235688 2292 96 918884 4640 7472 6540 7472 2570 31259 0 63 0 37
1 206 239712 2244 92 919104 2348 3436 3060 3436 1542 12147 0 69 0 30
no lockup at all. swap rate wasn't horrible either.
anyways we should try again after we made the code smarter, there's some
room for improvements. and if page_referenced is being hitten more frequently,
there may be a fundamental issue in the caller not in the method. Could be we
call it too frequently. We can also join the two things into one single pass,
so we don't call it twice if none of the pte is young. Currently we'd call it
twice before we run the unmap pass, if we free with two passes we would reduce
the overhead of 33%.
overall this is just working not too bad for me, I can stop any task
fine and things keeps running. As soon as the swap load stops cpu goes
back idle.
Note that most of the time even if we have to swap several gigabytes,
the time we "swap" those gigabytes is pretty small. A machine swapping
constantly several gigabytes in a loop would be hardly usable, what
matters is that the box is fast on the _workingset_ after the not used
part of the memory is being moved into swap, and wasting gigs of ram in
pte_chains will be worse in 64bit than using more cpu while moving the
not used part of ram into swap. If moving into swap is slow, it's not a
big problem. If the machines trashes all the time like in the above,
then there's little hope that it will perform well, w/ or w/o cpu system
load. The important thing is that this cpu load during swap doesn't
destroy all the address space in a flood like with the pagetable walk,
so the machine remains responsive even if we hit a long vma walk once in
a while to swap 1M.
On Tue, Mar 09, 2004 at 12:49:24PM +0100, Ingo Molnar wrote:
>
> * Andrew Morton <[email protected]> wrote:
>
> > > or run the attached test-mmap2.c code, which simulates a very small DB
> > > app using only 1800 vmas per process: it only maps 8 MB of shm and
> > > spawns 32 processes. This has an even more lethal effect than the
> > > previous code.
> >
> > Do these tests actually make any forward progress at all, or is it some bug
> > which has sent the kernel into a loop?
>
> i think they make a forward progress so it's more of a DoS - but a very
> effective one, especially considering that i didnt even try hard ...
>
> what worries me is that there are apps that generate such vma patterns
> (for various reasons).
those vmas in those apps are forced to be mlocked with the rmap VM, so
it's hard for me to buy that rmap is any better. You can't even allow
those vmas to be non-mlocked or you'll finish your zone-normal even with
4:4.
on 64bit those apps will work _absolutely_best_ with objrmap and they'll
waste tons of ram (and some amount of cpu too) with rmap. objrmap is the
absolutely best model for those apps in any 64bit arch.
the argument you're making about those apps are all in favour of objrmap
IMO.
> I do believe that scanning ->i_mmap & ->i_mmap_shared is fundamentally
> flawed.
If it's the DoS that you worry about, vmtruncate will do the trick too.
overall machine remains usable for me, despite the increased cpu load.
* Andrea Arcangeli <[email protected]> wrote:
> > or run the attached test-mmap2.c code, which simulates a very small DB
> > app using only 1800 vmas per process: it only maps 8 MB of shm and
> > spawns 32 processes. This has an even more lethal effect than the
> > previous code.
>
> this use more cpu than the previous one, but no other differences.
how fast is the system you tried this on? If it's faster than the 500
MHz box i tried it on then please try the attached test-mmap3.c. (which
is still not doing anything extreme.)
Ingo
* Ingo Molnar <[email protected]> wrote:
> > this use more cpu than the previous one, but no other differences.
>
> how fast is the system you tried this on? If it's faster than the 500
> MHz box i tried it on then please try the attached test-mmap3.c.
> (which is still not doing anything extreme.)
also, please run it on an UP kernel.
Ingo
* Andrea Arcangeli <[email protected]> wrote:
> > could you just try test-mmap2.c on such a box, and hit swap?
> Unless it crashes the machine I don't care, it's totally wrong in my
> opinion to hurt everything useful to save cpu while running an
> exploit. there are easier ways to waste cpu (rewrite the exploit with
> truncate please!!!)
i'm not sure i follow. "truncate being slow" is not the same order of
magnitude of a problem as "the VM being incapable of getting work done".
Ingo
On Tue, Mar 09, 2004 at 04:36:20PM +0100, Ingo Molnar wrote:
>
> * Andrea Arcangeli <[email protected]> wrote:
>
> > http://www.oracle.com/apps_benchmark/html/index.html?0325B_Report1.html
>
> OASB is special and pushes the DB less than e.g. TPC-C does. How big was
> the SGA? I bet the setup didnt have use_indirect_data_buffers=true.
I don't know the answer of this, but it was the usual top configuration
with very large memory model, I'm not aware of a superior config on x86
and this triggered bugs for the first time ever which could mean we were
the first ever pushing the db that far.
> (OASB is not a full-disclosure benchmark so i have no way to check
> this.) All you have proven is that workloads with a limited number of
> per-inode vmas can perform well. Which completely ignores my point.
what is your point, that OASB is a worthless workload and the only thing
that matters is TPC-C? Maybe you should discuss your point with Oracle
not with me, since I don't know what the two benchmarks are doing
differently. TCP-C was tested too of course, but maybe not in 32G boxes,
frankly I thought OASB was harder than TCP-C, as I think Martin
mentioned too two days ago.
about the limited number of vmas per inode, that's not the case, there
were tons of vmas allocated at least a 512m SGA window per 7.5k tasks,
infact without the vma-file-merging there's no way to make that work.
But the number of vmas isn't relevant with 2.6 and remap_file_pages, so
whatever your point is, I don't see it.
On Tue, Mar 09, 2004 at 05:10:51PM +0100, Ingo Molnar wrote:
>
> * Andrea Arcangeli <[email protected]> wrote:
>
> > > could you just try test-mmap2.c on such a box, and hit swap?
>
> > Unless it crashes the machine I don't care, it's totally wrong in my
> > opinion to hurt everything useful to save cpu while running an
> > exploit. there are easier ways to waste cpu (rewrite the exploit with
> > truncate please!!!)
>
> i'm not sure i follow. "truncate being slow" is not the same order of
> magnitude of a problem as "the VM being incapable of getting work done".
vm has limits, no matter if with rmap or not, if you ask to map 1
million of vmas on a 64bit arch in the same task the rbtree will
slowdown like a crawl too. The vm is a trade-off, we've to optimize for
good apps.
On Tue, Mar 09, 2004 at 05:07:09PM +0100, Ingo Molnar wrote:
>
> * Andrea Arcangeli <[email protected]> wrote:
>
> > > or run the attached test-mmap2.c code, which simulates a very small DB
> > > app using only 1800 vmas per process: it only maps 8 MB of shm and
> > > spawns 32 processes. This has an even more lethal effect than the
> > > previous code.
> >
> > this use more cpu than the previous one, but no other differences.
>
> how fast is the system you tried this on? If it's faster than the 500
xeon 4-way 2.5ghz
> MHz box i tried it on then please try the attached test-mmap3.c. (which
> is still not doing anything extreme.)
it's not attached, but I guess I can hack the mmap2 myself too by just
increasing the number of tasks and number of mmaps ;).
But before doing more tests I think I will finish my anon_vma work and
the objrmap optimizations, then I can concentrante on the testing. At
the moment we already know various bits that can be optimized, so I
prefer to get those implemented first.
another important thing is that we've a reschedule point for every
different page we unmap, not sure if it's the case right now (I didn't
concentrate much on the callers yet).
On Tue, Mar 09, 2004 at 05:08:07PM +0100, Ingo Molnar wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
> > > this use more cpu than the previous one, but no other differences.
> >
> > how fast is the system you tried this on? If it's faster than the 500
> > MHz box i tried it on then please try the attached test-mmap3.c.
> > (which is still not doing anything extreme.)
>
> also, please run it on an UP kernel.
I will, thanks for the hint.
On Tue, 9 Mar 2004, Ingo Molnar wrote:
> i think they make a forward progress so it's more of a DoS - but a very
> effective one, especially considering that i didnt even try hard ...
Ugh. I kind of like objrmap and things may be fixable...
> what worries me is that there are apps that generate such vma patterns
> (for various reasons).
>
> I do believe that scanning ->i_mmap & ->i_mmap_shared is fundamentally
> flawed.
Andrea may want to try a kd-tree instead of the linked
lists, that could well fix the problem you're running
into.
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
> what is your point, that OASB is a worthless workload and the only thing
> that matters is TPC-C? Maybe you should discuss your point with Oracle
> not with me, since I don't know what the two benchmarks are doing
> differently. TCP-C was tested too of course, but maybe not in 32G boxes,
> frankly I thought OASB was harder than TCP-C, as I think Martin
> mentioned too two days ago.
OASB seems harder on the VM than TPC-C, yes. It seems to create thousands
of processes, and fill the user address space up completely as well (2GB
shared segments or whatever).
M.
> -----Original Message-----
> From: [email protected] [mailto:linux-kernel-
> [email protected]] On Behalf Of Martin J. Bligh
> Sent: Tuesday, March 09, 2004 12:23 PM
> To: Andrea Arcangeli; Ingo Molnar
> Cc: Arjan van de Ven; Linus Torvalds; Andrew Morton; linux-
> [email protected]
> Subject: Re: objrmap-core-1 (rmap removal for file mappings to avoid
4:4
> in <=16G machines)
>
> > what is your point, that OASB is a worthless workload and the only
thing
> > that matters is TPC-C? Maybe you should discuss your point with
Oracle
> > not with me, since I don't know what the two benchmarks are doing
> > differently. TCP-C was tested too of course, but maybe not in 32G
boxes,
> > frankly I thought OASB was harder than TCP-C, as I think Martin
> > mentioned too two days ago.
>
> OASB seems harder on the VM than TPC-C, yes. It seems to create
thousands
> of processes, and fill the user address space up completely as well
(2GB
> shared segments or whatever).
>
> M.
>
Both the OASB and TPC-C workloads put pressure on the VM subsystem, but
in different ways.
The OASB environment has a small (compared to TPC-C) shared memory area,
but 1000's of Oracle user processes will be created that attach to this
shared memory area. The goal here is to push the maximum amount of
users onto the server.
The TPC-C environment will have a very large shared memory area
(typically the maximum a machine will allow) that may generate a large
number of vmas. However, there are very few (may be few hundred) Oracle
users processes.
Experience has been that the OASB benchmarks will tend to push VM into
system lockup conditions more than TPC-C.
Andy
On Tue, Mar 09, 2004 at 12:22:07PM -0500, Rik van Riel wrote:
> Andrea may want to try a kd-tree instead of the linked
> lists, that could well fix the problem you're running
> into.
Yep.
Martin's idea of splitting the i_mmap into a multiple lists each
covering a certain range is one of those possibilities to make objrmap
scale.
We've lot of room for improvements.
The basic idea of objrmap vs rmap is that one single object (the vma)
allows us to index tons and tons of ptes, instead of requiring a per-pte
overhead of the pte_chains.
Right now we're not very efficient in finding the "interesting vmas"
especially for file mappings, but we can make that more finegrined over
time. For the anon_vmas work I'm doing that's already quite well
finegriend since it's like if they belong all to different inodes so the
problem is minor there.
* Andrea Arcangeli <[email protected]> wrote:
> > > how fast is the system you tried this on? If it's faster than the 500
> > > MHz box i tried it on then please try the attached test-mmap3.c.
> > > (which is still not doing anything extreme.)
> >
> > also, please run it on an UP kernel.
>
> I will, thanks for the hint.
test-mmap3.c attached. It locked up my UP box so hard that i couldnt
even switch consoles - i turned the box off after 30 minutes.
Ingo
* Andrea Arcangeli <[email protected]> wrote:
> > (OASB is not a full-disclosure benchmark so i have no way to check
> > this.) All you have proven is that workloads with a limited number of
> > per-inode vmas can perform well. Which completely ignores my point.
>
> what is your point, that OASB is a worthless workload and the only
> thing that matters is TPC-C? [...]
not at all. I pointed out specific workloads that create tons of vmas,
which would perform very bad if swapping. OASB is not one of those
workloads. [I could also mention UML which currently creates a vma per
virtualized page, which, with a low-end UML setup, generates tens of
thousands of vmas as well.]
(if the linear search is fixed then i have no objections, but for the
current code to hit any mainline kernel we would first need to redefine
'enterprise quality'. My main worry is that we are now at a dozen emails
regarding this topic and you still dont seem to be aware of the severity
of this quality of implementation problem.)
sure, remap_file_pages() fixes such problems - while i'm happy if more
people use remap_file_pages(), apps are not (and should not be) forced
to use remap_file_pages() and i refuse to concede that the VM must
inevitably get wedged with just a couple of thousand vmas created on a
256 MB 500 MHz box ... I dont know how to put this point in a simpler
way. This stuff must not be added (to mainline) until it can take the
load.
Ingo
On Tue, Mar 09, 2004 at 08:57:52PM +0100, Ingo Molnar wrote:
> 'enterprise quality'. My main worry is that we are now at a dozen emails
> regarding this topic and you still dont seem to be aware of the severity
> of this quality of implementation problem.)
the quality of such objrmap patch is still better than rmap. The DoS
thing is doable with vmtruncate too in any kernel out there.
merging objrmap is the first step. Any other effort happens on top of
it.
I never said this was finished with objrmap, I said since the first
place that it was the building block.
> way. This stuff must not be added (to mainline) until it can take the
> load.
mainline is worthless without objrmap even if you don't run into swap,
at least with objrmap it works unless you push the machine into swap.
On Tue, Mar 09, 2004 at 05:03:07PM +0100, Andrea Arcangeli wrote:
> those vmas in those apps are forced to be mlocked with the rmap VM, so
> it's hard for me to buy that rmap is any better. You can't even allow
btw, try your exploit by keeping the stuff mlocked. you'll see we're not
following the i_mmap already the first time we run into a VM_LOCKED vma,
we could be even more efficient by removing mlocked pages from the lru,
but it's definitely not required to get that workload right, and that
workload needs mlock with rmap to remove the pte_chains anyways!
So even now objrmap seems a lot better than rmap for that workload, it
doesn't even require mlock, it only requires it if you want to pageout
heavily (rmap requires it regardless if you pageout or not). And it can
be fixed too with an rbtree as worse, while the rmap overhead is not
fixable (other than to remove rmap enterely like I'm doing).
BTW, my current anon_vma work is going really well, the code is so much
nicer, and it's quite smaller too.
include/linux/mm.h | 76 +++
include/linux/objrmap.h | 74 +++
include/linux/page-flags.h | 4
include/linux/rmap.h | 53 --
init/main.c | 4
mm/memory.c | 15
mm/mmap.c | 4
mm/nommu.c | 2
mm/objrmap.c | 480 +++++++++++++++++++++++
mm/page_alloc.c | 6
mm/rmap.c | 908 ---------------------------------------------
12 files changed, 636 insertions(+), 990 deletions(-)
and this doesn't remove all the pte_chains everywhere yet.
objrmap.c seems already fully complete, what's missing now is the
removal of all the pte_chains from memory.c and friends, and later the
anon_vma tracking with fork and munmap (I've only covered
do_anonymous_page, so far, see how cool it looks like now:
static int
do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
pte_t *page_table, pmd_t *pmd, int write_access,
unsigned long addr)
{
pte_t entry;
struct page * page = ZERO_PAGE(addr);
int ret;
/* Read-only mapping of ZERO_PAGE. */
entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
/* ..except if it's a write access */
if (write_access) {
/* Allocate our own private page. */
pte_unmap(page_table);
spin_unlock(&mm->page_table_lock);
page = alloc_page(GFP_HIGHUSER);
if (!page)
goto no_mem;
clear_user_highpage(page, addr);
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, addr);
if (!pte_none(*page_table)) {
pte_unmap(page_table);
page_cache_release(page);
spin_unlock(&mm->page_table_lock);
ret = VM_FAULT_MINOR;
goto out;
}
mm->rss++;
entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
vma->vm_page_prot)),
vma);
lru_cache_add_active(page);
mark_page_accessed(page);
SetPageAnon(page);
}
set_pte(page_table, entry);
/* ignores ZERO_PAGE */
page_add_rmap(page, vma);
pte_unmap(page_table);
/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, addr, entry);
spin_unlock(&mm->page_table_lock);
ret = VM_FAULT_MINOR;
goto out;
no_mem:
ret = VM_FAULT_OOM;
out:
return ret;
}
no pte_chains anywhere.
and here the page_add_rmap from objrmap.c:
/* this needs the page->flags PG_map_lock held */
static void inline anon_vma_page_link(struct page * page, struct
vm_struct * vma)
{
SetPageDirect(page);
page->as.vma = vma;
}
/**
* page_add_rmap - add reverse mapping entry to a page
* @page: the page to add the mapping to
* @vma: the vma that is covering the page
*
* Add a new pte reverse mapping to a page.
* The caller needs to hold the mm->page_table_lock.
*/
void * fastcall
page_add_rmap(struct page *page, struct vm_struct * vma)
{
if (!pfn_valid(page_to_pfn(page)) || PageReserved(page))
return;
page_map_lock(page);
if (!page->mapcount++)
inc_page_state(nr_mapped);
if (PageAnon(page))
anon_vma_page_link(page, vma);
else {
/*
* If this is an object-based page, just count it.
* We can find the mappings by walking the object
* vma chain for that object.
*/
BUG_ON(!page->as.mapping);
BUG_ON(PageSwapCache(page));
}
page_map_unlock(page);
}
here page_remove_rmap:
/* this needs the page->flags PG_map_lock held */
static void inline anon_vma_page_unlink(struct page * page)
{
/*
* Cleanup if this anon page is gone
* as far as the vm is concerned.
*/
if (!page->mapcount) {
page->as.vma = 0;
#if 0
/*
* The above clears page->as.anon_vma too
* if the page wasn't direct.
*/
page->as.anon_vma = 0;
#endif
ClearPageDirect(page);
}
}
/**
* page_remove_rmap - take down reverse mapping to a page
* @page: page to remove mapping from
*
* Removes the reverse mapping from the pte_chain of the page,
* after that the caller can clear the page table entry and free
* the page.
* Caller needs to hold the mm->page_table_lock.
*/
void fastcall page_remove_rmap(struct page *page)
{
if (!pfn_valid(page_to_pfn(page)) || PageReserved(page))
return;
page_map_lock(page);
if (!page_mapped(page))
goto out_unlock;
if (!--page->mapcount)
dec_page_state(nr_mapped);
if (PageAnon(page))
anon_vma_page_unlink(page, vma);
else {
/*
* If this is an object-based page, just uncount it.
* We can find the mappings by walking the object vma
* chain for that object.
*/
BUG_ON(!page->as.mapping);
BUG_ON(PageSwapCache(page));
}
page_map_unlock(page);
return;
}
here the paging code that unmaps the ptes:
static int
try_to_unmap_anon(struct page * page)
{
int ret = SWAP_AGAIN;
page_map_lock(page);
if (PageDirect(page)) {
ret = try_to_unmap_inode_one(page->as.vma, page);
} else {
struct vm_area_struct * vma;
anon_vma_t * anon_vma = page->as.anon_vma;
list_for_each_entry(vma, &anon_vma->anon_vma_head, anon_vma_node) {
ret = try_to_unmap_inode_one(vma, page);
if (ret == SWAP_FAIL || !page->mapcount)
goto out;
}
}
out:
page_map_unlock(page);
return ret;
}
/**
* try_to_unmap - try to remove all page table mappings to a page
* @page: the page to get unmapped
*
* Tries to remove all the page table entries which are mapping this
* page, used in the pageout path. Caller must hold the page lock
* and its pte chain lock. Return values are:
*
* SWAP_SUCCESS - we succeeded in removing all mappings
* SWAP_AGAIN - we missed a trylock, try again later
* SWAP_FAIL - the page is unswappable
*/
int fastcall try_to_unmap(struct page * page)
{
struct pte_chain *pc, *next_pc, *start;
int ret = SWAP_SUCCESS;
/* This page should not be on the pageout lists. */
BUG_ON(PageReserved(page));
BUG_ON(!PageLocked(page));
/*
* We need backing store to swap out a page.
* Subtle: this checks for page->as.anon_vma too ;).
*/
BUG_ON(!page->as.mapping);
if (!PageAnon(page))
ret = try_to_unmap_inode(page);
else
ret = try_to_unmap_anon(page);
if (!page_mapped(page)) {
dec_page_state(nr_mapped);
ret = SWAP_SUCCESS;
}
return ret;
}
In my first attempt I was nucking page->mapping++ (that's pure locking
overhead for the file mappings and it wastes 4 bytes per page_t) but
then I retraced since nr_mapped was expanding everywhere in the vm and
the modifications were growing too fast at the same time, so I'll think
about it later for now I will do anon_vma only plus the nonlinear
pagetable walk, so the page is as self contained as possible and it'll
drop all pte_chains from the kernel.
The only single reason I need page->mapped is that if the page is an
inode mapping, page->as.mapping won't be enough to tell if it was
already mapped or not. So my current anon_vma patch (incremental with
objrmap) only reduces the page_t of 4 bytes compared to mainline 2.4 and
mainline 2.6.
With PageDirect and the page->as.vma field I'm deferring _all_ anon_vma
object allocations to fork(), even when a MAP_PRIVATE vma is already
tracked by an inode and by an anon_vma (generated by an old fork), new
anonymous pages allocated are still "direct". So the same vma will have
direct anon pages, anon_vma indirect cow pages, and finally it will have
inode pages too (readonly writeprotect). I plan to teach the cow fault
to convert anon_vma indirect pages to direct pages if page->mapping ==
1 (no, I don't need page->mapping for that, I could use page_count but
since I've page->mapping I use it so the unlikely races are converted to
direct mode too). However a vma can't return "direct", only the page can
return direct. The reason is that I've no way to reach _only_ the pages
pointing to an anon_vma starting from the vma (the only way would be a
pagetable walk but I don't want to do that, and leaving the anon_vma is
perfectly fine, I will garbage collect it when the vma goes away too).
Overall this means anonymous page faults will be blazing fast, no
allocation ever in the fast paths, just fork will have to allocate 12
more bytes per anonymous vma to track the cows (not a big deal compared
to 8 bytes per pte of rmap ;).
here below (most important of all to understand my anon_vma proposed
design) a preview of the data structure layout.
I think this is close to DaveM's original approch to handle the
anonymous memory, though the last time I read his patch was a few years
ago so I don't remeber exactly, the only thing I remeber (because I
disliked that) was that he was doing slab allocations from page faults,
something that definitely completely avoid with highest priority. Hugh's
approch as well was not usable since it was tracking mm and it broke off
with mremap unfortunately.
the way I designed the garbage collection of the anon_vma transient
objects as well I think is extremely optimized, I don't need list of
pages or counter of the pages, I simply garbage collect the anon_vma
during vma destruction, checking vma->anon_vma &&
list_empty(&vma->anon_vma->anon_vma_head). I use the invariant that for
a page to point to an anon_vma there must be a vma still queued in the
anon_vma. That should work reliably and it allows me to only point
anon_vmas from pages, and I never know from a anon_vma (or a vma) if any
page is pointing to it (I only need to know that no page is pointing to
it if no vma is queued into the anon_vma).
It took me a while to design this thing, but now I'm quite happy, I hope
not to find some huge design flaw at the last minute ;). This is why I'm
showing you all this right now before it's finished, if you see any
design flaw please let me know ASAP, I need this thing working quickly!
thanks.
--- sles-anobjrmap-2/include/linux/mm.h.~1~ 2004-03-03 06:45:38.000000000 +0100
+++ sles-anobjrmap-2/include/linux/mm.h 2004-03-10 10:25:55.955735680 +0100
@@ -39,6 +39,22 @@ extern int page_cluster;
* mmap() functions).
*/
+typedef struct anon_vma_s {
+ /* This serializes the accesses to the vma list. */
+ spinlock_t anon_vma_lock;
+
+ /*
+ * This is a list of anonymous "related" vmas,
+ * to scan if one of the pages pointing to this
+ * anon_vma needs to be unmapped.
+ * After we unlink the last vma we must garbage collect
+ * the object if the list is empty because we're
+ * guaranteed no page can be pointing to this anon_vma
+ * if there's no vma anymore.
+ */
+ struct list_head anon_vma_head;
+} anon_vma_t;
+
/*
* This struct defines a memory VMM memory area. There is one of these
* per VM-area/task. A VM area is any part of the process virtual memory
@@ -69,6 +85,19 @@ struct vm_area_struct {
*/
struct list_head shared;
+ /*
+ * The same vma can be both queued into the i_mmap and in a
+ * anon_vma too, for example after a cow in
+ * a MAP_PRIVATE file mapping. However only the MAP_PRIVATE
+ * will go both in the i_mmap and anon_vma. A MAP_SHARED
+ * will only be in the i_mmap_shared and a MAP_ANONYMOUS (file = 0)
+ * will only be queued only in the anon_vma.
+ * The list is serialized by the anon_vma->lock.
+ */
+ struct list_head anon_vma_node;
+ /* Serialized by the vma->vm_mm->page_table_lock */
+ anon_vma_t * anon_vma;
+
/* Function pointers to deal with this struct. */
struct vm_operations_struct * vm_ops;
@@ -172,16 +201,51 @@ struct page {
updated asynchronously */
atomic_t count; /* Usage count, see below. */
struct list_head list; /* ->mapping has some page lists. */
- struct address_space *mapping; /* The inode (or ...) we belong to. */
unsigned long index; /* Our offset within mapping. */
struct list_head lru; /* Pageout list, eg. active_list;
protected by zone->lru_lock !! */
+
+ /*
+ * Address space of this page.
+ * A page can be either mapped to a file or to be anonymous
+ * memory, so using the union is optimal here. The PG_anon
+ * bitflag tells if this is anonymous or a file-mapping.
+ * If PG_anon is clear we use the as.mapping, if PG_anon is
+ * set and PG_direct is not set we use the as.anon_vma,
+ * if PG_anon is set and PG_direct is set we use the as.vma.
+ */
union {
- struct pte_chain *chain;/* Reverse pte mapping pointer.
- * protected by PG_chainlock */
- pte_addr_t direct;
- int mapcount;
- } pte;
+ /* The inode address space if it's a file mapping. */
+ struct address_space * mapping;
+
+ /*
+ * This points to an anon_vma object.
+ * The anon_vma can't go away under us if
+ * we hold the PG_maplock.
+ */
+ anon_vma_t * anon_vma;
+
+ /*
+ * Before the first fork we avoid anon_vma object allocation
+ * and we set PG_direct. anon_vma objects are only created
+ * via fork(), and the vm then stop using the page->as.vma
+ * and it starts using the as.anon_vma object instead.
+ * After the first fork(), even if the child exit, the pages
+ * cannot be downgraded to PG_direct anymore (even if we
+ * wanted to) because there's no way to reach pages starting
+ * from an anon_vma object.
+ */
+ struct vm_struct * vma;
+ } as;
+
+ /*
+ * Number of ptes mapping this page.
+ * It's serialized by PG_maplock.
+ * This is needed only to maintain the nr_mapped global info
+ * so it would be nice to drop it.
+ */
+ unsigned long mapcount;
+
unsigned long private; /* mapping-private opaque data */
/*
--- sles-anobjrmap-2/include/linux/page-flags.h.~1~ 2004-03-03 06:45:38.000000000 +0100
+++ sles-anobjrmap-2/include/linux/page-flags.h 2004-03-10 10:20:59.324830432 +0100
@@ -69,9 +69,9 @@
#define PG_private 12 /* Has something at ->private */
#define PG_writeback 13 /* Page is under writeback */
#define PG_nosave 14 /* Used for system suspend/resume */
-#define PG_chainlock 15 /* lock bit for ->pte_chain */
+#define PG_maplock 15 /* lock bit for ->as.anon_vma and ->mapcount */
-#define PG_direct 16 /* ->pte_chain points directly at pte */
+#define PG_direct 16 /* if set it must use page->as.vma */
#define PG_mappedtodisk 17 /* Has blocks allocated on-disk */
#define PG_reclaim 18 /* To be reclaimed asap */
#define PG_compound 19 /* Part of a compound page */
* Andrea Arcangeli <[email protected]> wrote:
> btw, try your exploit by keeping the stuff mlocked. [...]
btw., why do you insist on calling it an 'exploit'? It's a testcase - it
does things that real applications do too.
Ingo
* Andrea Arcangeli <[email protected]> wrote:
> the quality of such objrmap patch is still better than rmap. The DoS
> thing is doable with vmtruncate too in any kernel out there.
objrmap for now has a serious problem: test-mmap3.c locked up my box (i
couldnt switch text consoles for 30 minutes when i turned the box off).
I'm sure you'll fix it and i'm looking forward seeing it. However, i'd
like to see the full fix instead of a promise to have this fixed
sometime the future. There are valid application workloads that trigger
_worse_ vma patterns than test-mmap3.c does (UML being one such thing,
Oracle with indirect buffer-cache another - i'm sure there are other
apps too.). Calling these applications 'exploits' doesnt help in
getting this thing fixed. There's no problem with keeping this patchset
separate until it's regression-free.
> merging objrmap is the first step. Any other effort happens on top of
> it.
i'd like to see that effort combined with this code, and the full
picture. Since this 'DoS property' is created by the current concept of
the patch, it's not a 'bug' that is easily fixed so we must not (and
cannot) sign up for it blindly, without seeing the full impact. But
yes, it might be fixable. Anyway - the 2.6 kernel is a stable tree and
i'm sure you know that avoiding regression is more important than
anything else.
Ingo
On Wed, Mar 10, 2004 at 12:35:01PM +0100, Ingo Molnar wrote:
>
> * Andrea Arcangeli <[email protected]> wrote:
>
> > the quality of such objrmap patch is still better than rmap. The DoS
> > thing is doable with vmtruncate too in any kernel out there.
>
> objrmap for now has a serious problem: test-mmap3.c locked up my box (i
> couldnt switch text consoles for 30 minutes when i turned the box off).
>
> I'm sure you'll fix it and i'm looking forward seeing it. However, i'd
> like to see the full fix instead of a promise to have this fixed
> sometime the future. There are valid application workloads that trigger
> _worse_ vma patterns than test-mmap3.c does (UML being one such thing,
> Oracle with indirect buffer-cache another - i'm sure there are other
> apps too.). Calling these applications 'exploits' doesnt help in
> getting this thing fixed. There's no problem with keeping this patchset
> separate until it's regression-free.
>
> > merging objrmap is the first step. Any other effort happens on top of
> > it.
>
> i'd like to see that effort combined with this code, and the full
> picture. Since this 'DoS property' is created by the current concept of
> the patch, it's not a 'bug' that is easily fixed so we must not (and
> cannot) sign up for it blindly, without seeing the full impact. But
> yes, it might be fixable. Anyway - the 2.6 kernel is a stable tree and
> i'm sure you know that avoiding regression is more important than
> anything else.
I'm fine to wait the whole work to be finished and to merge it all at
once (still from separate incremental patches) instead of merging it in
steps in mainline and your longer term confidence in our work is
promising, thanks.
since I need this fixed fast, I may have to go the rbtree way to go
safe (mainline could go with prio_trees in the long run instead).
however I still disagree the objrmap I posted is a regression for
applications like Oracle (dunno about uml). It's an obvious regression
for your test-mmap3.c and that's why I call test-mmap3.c an exploit and
not a "real app". Nobody would map 1 page per vma, get real, you have an
hard time to convince me a real app is going to scatter vma with 4k
aperture each. you wrote the very worst case that everybody is aware
about, a real app scenario would not do that. Note that there's quite an
huge amount of merging of file-vmas, you absolutely prevent that too.
Furthmore you said Oracle needs mlock to work "safe" with rmap. But with
2.6 if you use mlock it will still not work. If you use 2.6+objrmap
mlock will fix your DoS secenario too, and Oracle will work as fast as
rmap+mlock in your rmap 2.4 implementation.
Also you're advocating for the "merging in steps" and keeing "2.6
optimal", but you're ignoring the single reason you are forced to ship a
2.4 kernel with 4:4 for every >4G machine. 2.6 mainline (the current
2.6.3 step) has no way to be compiled with 4:4 model. So the current
great 2.6 kernel has no way to work with any machine >4G (if you ship
all PAE kernels with rmap compiled with 4:4 you must agree 2.6 mainline
has no way to work on any kernel with >4G of ram, so you should not be
surprised that I'm dealing with those issues currently). Is 2.6 an high
end kernel with rmap? I supported 4G on x86 the first time with bigmem
in 2.2.
Solving the problem by merging 4:4 instead of removing rmap is not the
way to go IMHO since it doesn't fix the memory waste for 64bit archs
compared to what we can do with 2.4 _mainline_ (64bit doesn't need
pte-highmem and there are no highmem issues to solve there).
At least with objrmap applied to 2.6, there would be a chance to survive
the load on >4G boxes in a 2.6 mainline kernel. Sure, you'd better be
careful not to swapout heavy or it would risk to hang badly (if the app
isn't using mlock, if the app uses mlock 2.6 will fly), but without
objrmap it would lockup before you can worry about reaching swap (mlock
or not).
So in practice I think it would been ok to merge objrmap as an
intermediate step (it's not that I didn't evaluate those possibilities
when I submitted it).
As for the DoS thing in security terms, truncate has the same issue. It
maybe easier to kill the "exploit" since it returns to userspace every
time, and userspace is not swapped out when it happens, but it would
still waste an indefinite amount of time in kernel space. So providing
an efficient means of the i_mmap vma lookup is a problem irrelevant to
the objrmap patch for the vm, I think we agree on this. Doing that will
fix all users (so the vm too).
On Wed, 10 Mar 2004, Andrea Arcangeli wrote:
> On Tue, Mar 09, 2004 at 06:56:50PM +0100, Andrea Arcangeli wrote:
> > We've lot of room for improvements.
>
> Rajesh has a smart idea on how to fix the complexity issue (for both
> truncate and vm) and it involes a new non trivial data structure.
>
> I trust he will make it, but if there will be any trouble with his
> approch for safety I'm currently planning on a simpler fallback solution
> that I can manage without having to design a new tree data structure.
>
> Sharing his "tree and sorting" idea, the fallback I propose is to simply
> index the vmas in a rbtree too.
That simply results in looking up less VMAs for low file
indexes, but still needing to check all of them for high
file indexes.
You really want to sort on both the start and end offset
of the VMA, as can be done with a kd-tree or kdb-tree.
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
On Wed, Mar 10, 2004 at 08:01:15AM -0500, Rik van Riel wrote:
> On Wed, 10 Mar 2004, Andrea Arcangeli wrote:
> > On Tue, Mar 09, 2004 at 06:56:50PM +0100, Andrea Arcangeli wrote:
> > > We've lot of room for improvements.
> >
> > Rajesh has a smart idea on how to fix the complexity issue (for both
> > truncate and vm) and it involes a new non trivial data structure.
> >
> > I trust he will make it, but if there will be any trouble with his
> > approch for safety I'm currently planning on a simpler fallback solution
> > that I can manage without having to design a new tree data structure.
> >
> > Sharing his "tree and sorting" idea, the fallback I propose is to simply
> > index the vmas in a rbtree too.
>
> That simply results in looking up less VMAs for low file
> indexes, but still needing to check all of them for high
> file indexes.
>
> You really want to sort on both the start and end offset
> of the VMA, as can be done with a kd-tree or kdb-tree.
yes. But the only single reason for me to even consider using the rbtree
was to avoid having to introduce another data structure and to feel very
safe in terms of risks of memory corruption in the short term ;). The
rbtree is extremely well exercised, that's the only reason I suggested
it. Rajesh is currently working on another data strucure that is
efficient at finding a "range" (not sure if it is what you're
suggesting, he called it a prio_tree, mix between hashes and raidx
trees), that's optimal, though in practice the rbtree would work too
(peraphs one could still work an exploit ;) but the the real life apps
would be definitely covered by the rbtree too (since all vma are of the
same size and they're all naturally aligned).
Hello,
this is the full current status of my anon_vma work. Now fork() and all
the other page_add/remove_rmap in memory.c plus the paging routines
seems fully covered and I'm now dealing with the vma merging and the
anon_vma garbage collection (the latter is easy but I need to track all
the kmem_cache_free).
There is just one minor limitation with the vma merging of anonymous
memory that I didn't considered during the design phase (I figured it
out while coding). In short this is only an issue with the mremap
syscall (and sometimes with mmap too while filling an hole). The vma
merging happening during mmap/brk (not filling an hole) is always going
to happen fine, since the newly created vma has vma->anon_vma == NULL
and I can have the guarantee from the caller that no page is yet mapped
to this vma, so I can merge it just fine and it'll be part of whatever
pre-existing anon_vma object (after possibly fixing up the vma->pg_off
of the newly created vma).
Only if I fill an hole (with mmap or brk) I may be not able to merge the
three anon vmas together if their pg_off disagrees. However their pg_off
may disagree only if somebody used mremap on those vmas previously,
since I setup the pg_off of anonymous memory in a way that if you only
use mmap/brk even filling the holes is guaranteed to do full merging.
The problem in mremap is not only the pgoff, the problem is that I can
merge anonymous vma only if (!vma1->anon_vma || !vma2->anon_vma) is
true. If both vma1 and vma2 have a different anon_vma I cannot merge
them togheter (even if the pg_off agrees) because the pages under vma2
may point to vma2->anon_vma and the pages under vma1 may point to
vma1->anon_vma in their page->as.anon_vma. There is no way to reach
efficiently the pages pointing to a certain anon_vma. As said yesterday
the invariant I use to garbage collect the anon_vma is to wait all vma
to go be unlinked from the anon_vma, but as far as there are vmas queued
into the anon_vma object I cannot release those anon_vma objects, and in
turn I cannot do merging either.
the only way to allow 100% merging through mremap would be to have a
list with the head in the anon_vma and the nodes in the page_t, that
would be very easy but it would waste 4 bytes per page_t for a
hlist_node (the 4byte waste in the anon_vma is not a problem). And the
merging would be very expensive too since I would need to run a
for_each_page_in_the_list loop to fixup first all the page->index
according to the spread between vma1->pg_off and vma2->pg_off, and
second I should reset the page->as.anon_vma (or page->as.vma for direct
pages) to point respectively to the other anon_vma (or the other vma for
direct pages).
So I think I will go ahead with the current data structures despite the
small regression in vma merging. I doubt it's an issue but please let me
know if you think it's an issue and that I should add an hlist_node to
the page_t and an hlist_head to the anon_vma_t. btw, it's something I
can always do later if it's really necessary. Even with the additional
4bytes per page_t the page_t size would not be bigger than mainline 2.4
and mainline 2.6.
include/linux/mm.h | 79 +++
include/linux/objrmap.h | 66 +++
include/linux/page-flags.h | 4
include/linux/rmap.h | 53 --
init/main.c | 4
kernel/fork.c | 10
mm/Makefile | 2
mm/memory.c | 129 +-----
mm/mmap.c | 9
mm/nommu.c | 2
mm/objrmap.c | 575 ++++++++++++++++++++++++++++
mm/page_alloc.c | 6
mm/rmap.c | 908 ---------------------------------------------
14 files changed, 772 insertions(+), 1075 deletions(-)
--- sles-anobjrmap-2/include/linux/mm.h.~1~ 2004-03-03 06:45:38.000000000 +0100
+++ sles-anobjrmap-2/include/linux/mm.h 2004-03-10 18:59:14.000000000 +0100
@@ -39,6 +39,22 @@ extern int page_cluster;
* mmap() functions).
*/
+typedef struct anon_vma_s {
+ /* This serializes the accesses to the vma list. */
+ spinlock_t anon_vma_lock;
+
+ /*
+ * This is a list of anonymous "related" vmas,
+ * to scan if one of the pages pointing to this
+ * anon_vma needs to be unmapped.
+ * After we unlink the last vma we must garbage collect
+ * the object if the list is empty because we're
+ * guaranteed no page can be pointing to this anon_vma
+ * if there's no vma anymore.
+ */
+ struct list_head anon_vma_head;
+} anon_vma_t;
+
/*
* This struct defines a memory VMM memory area. There is one of these
* per VM-area/task. A VM area is any part of the process virtual memory
@@ -69,6 +85,19 @@ struct vm_area_struct {
*/
struct list_head shared;
+ /*
+ * The same vma can be both queued into the i_mmap and in a
+ * anon_vma too, for example after a cow in
+ * a MAP_PRIVATE file mapping. However only the MAP_PRIVATE
+ * will go both in the i_mmap and anon_vma. A MAP_SHARED
+ * will only be in the i_mmap_shared and a MAP_ANONYMOUS (file = 0)
+ * will only be queued only in the anon_vma.
+ * The list is serialized by the anon_vma->lock.
+ */
+ struct list_head anon_vma_node;
+ /* Serialized by the vma->vm_mm->page_table_lock */
+ anon_vma_t * anon_vma;
+
/* Function pointers to deal with this struct. */
struct vm_operations_struct * vm_ops;
@@ -172,16 +201,51 @@ struct page {
updated asynchronously */
atomic_t count; /* Usage count, see below. */
struct list_head list; /* ->mapping has some page lists. */
- struct address_space *mapping; /* The inode (or ...) we belong to. */
unsigned long index; /* Our offset within mapping. */
struct list_head lru; /* Pageout list, eg. active_list;
protected by zone->lru_lock !! */
+
+ /*
+ * Address space of this page.
+ * A page can be either mapped to a file or to be anonymous
+ * memory, so using the union is optimal here. The PG_anon
+ * bitflag tells if this is anonymous or a file-mapping.
+ * If PG_anon is clear we use the as.mapping, if PG_anon is
+ * set and PG_direct is not set we use the as.anon_vma,
+ * if PG_anon is set and PG_direct is set we use the as.vma.
+ */
union {
- struct pte_chain *chain;/* Reverse pte mapping pointer.
- * protected by PG_chainlock */
- pte_addr_t direct;
- int mapcount;
- } pte;
+ /* The inode address space if it's a file mapping. */
+ struct address_space * mapping;
+
+ /*
+ * This points to an anon_vma object.
+ * The anon_vma can't go away under us if
+ * we hold the PG_maplock.
+ */
+ anon_vma_t * anon_vma;
+
+ /*
+ * Before the first fork we avoid anon_vma object allocation
+ * and we set PG_direct. anon_vma objects are only created
+ * via fork(), and the vm then stop using the page->as.vma
+ * and it starts using the as.anon_vma object instead.
+ * After the first fork(), even if the child exit, the pages
+ * cannot be downgraded to PG_direct anymore (even if we
+ * wanted to) because there's no way to reach pages starting
+ * from an anon_vma object.
+ */
+ struct vm_struct * vma;
+ } as;
+
+ /*
+ * Number of ptes mapping this page.
+ * It's serialized by PG_maplock.
+ * This is needed only to maintain the nr_mapped global info
+ * so it would be nice to drop it.
+ */
+ unsigned long mapcount;
+
unsigned long private; /* mapping-private opaque data */
/*
@@ -440,7 +504,8 @@ void unmap_page_range(struct mmu_gather
unsigned long address, unsigned long size);
void clear_page_tables(struct mmu_gather *tlb, unsigned long first, int nr);
int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
- struct vm_area_struct *vma);
+ struct vm_area_struct *vma, struct vm_area_struct *orig_vma,
+ anon_vma_t ** anon_vma);
int zeromap_page_range(struct vm_area_struct *vma, unsigned long from,
unsigned long size, pgprot_t prot);
--- sles-anobjrmap-2/include/linux/page-flags.h.~1~ 2004-03-03 06:45:38.000000000 +0100
+++ sles-anobjrmap-2/include/linux/page-flags.h 2004-03-10 10:20:59.000000000 +0100
@@ -69,9 +69,9 @@
#define PG_private 12 /* Has something at ->private */
#define PG_writeback 13 /* Page is under writeback */
#define PG_nosave 14 /* Used for system suspend/resume */
-#define PG_chainlock 15 /* lock bit for ->pte_chain */
+#define PG_maplock 15 /* lock bit for ->as.anon_vma and ->mapcount */
-#define PG_direct 16 /* ->pte_chain points directly at pte */
+#define PG_direct 16 /* if set it must use page->as.vma */
#define PG_mappedtodisk 17 /* Has blocks allocated on-disk */
#define PG_reclaim 18 /* To be reclaimed asap */
#define PG_compound 19 /* Part of a compound page */
--- sles-anobjrmap-2/include/linux/objrmap.h.~1~ 2004-03-05 05:27:41.000000000 +0100
+++ sles-anobjrmap-2/include/linux/objrmap.h 2004-03-10 20:48:57.000000000 +0100
@@ -1,8 +1,7 @@
#ifndef _LINUX_RMAP_H
#define _LINUX_RMAP_H
/*
- * Declarations for Reverse Mapping functions in mm/rmap.c
- * Its structures are declared within that file.
+ * Declarations for Object Reverse Mapping functions in mm/objrmap.c
*/
#include <linux/config.h>
@@ -10,32 +9,46 @@
#include <linux/linkage.h>
#include <linux/slab.h>
+#include <linux/kernel.h>
-struct pte_chain;
-extern kmem_cache_t *pte_chain_cache;
+extern kmem_cache_t * anon_vma_cachep;
-#define pte_chain_lock(page) bit_spin_lock(PG_chainlock, &page->flags)
-#define pte_chain_unlock(page) bit_spin_unlock(PG_chainlock, &page->flags)
+#define page_map_lock(page) bit_spin_lock(PG_maplock, &page->flags)
+#define page_map_unlock(page) bit_spin_unlock(PG_maplock, &page->flags)
-struct pte_chain *pte_chain_alloc(int gfp_flags);
-void __pte_chain_free(struct pte_chain *pte_chain);
+static inline void anon_vma_free(anon_vma_t * anon_vma)
+{
+ kmem_cache_free(anon_vma);
+}
-static inline void pte_chain_free(struct pte_chain *pte_chain)
+static inline anon_vma_t * anon_vma_alloc(void)
{
- if (pte_chain)
- __pte_chain_free(pte_chain);
+ might_sleep();
+
+ return kmem_cache_alloc(anon_vma_cachep, SLAB_KERNEL);
}
-int FASTCALL(page_referenced(struct page *));
-struct pte_chain *FASTCALL(page_add_rmap(struct page *, pte_t *,
- struct pte_chain *));
-void FASTCALL(page_remove_rmap(struct page *, pte_t *));
-int page_convert_anon(struct page *);
+static inline void anon_vma_unlink(struct vm_area_struct * vma)
+{
+ anon_vma_t * anon_vma = vma->anon_vma;
+
+ if (anon_vma) {
+ spin_lock(&anon_vma->anon_vma_lock);
+ list_del(&vma->anon_vm_node);
+ spin_unlock(&anon_vma->anon_vma_lock);
+ }
+}
+
+void FASTCALL(page_add_rmap(struct page *, struct vm_struct *));
+void FASTCALL(page_add_rmap_fork(struct page *, struct vm_area_struct *,
+ struct vm_area_struct *, anon_vma_t **));
+void FASTCALL(page_remove_rmap(struct page *));
/*
* Called from mm/vmscan.c to handle paging out
*/
int FASTCALL(try_to_unmap(struct page *));
+int FASTCALL(page_referenced(struct page *));
/*
* Return values of try_to_unmap
--- sles-anobjrmap-2/init/main.c.~1~ 2004-02-29 17:47:36.000000000 +0100
+++ sles-anobjrmap-2/init/main.c 2004-03-09 05:32:34.000000000 +0100
@@ -85,7 +85,7 @@ extern void signals_init(void);
extern void buffer_init(void);
extern void pidhash_init(void);
extern void pidmap_init(void);
-extern void pte_chain_init(void);
+extern void anon_vma_init(void);
extern void radix_tree_init(void);
extern void free_initmem(void);
extern void populate_rootfs(void);
@@ -495,7 +495,7 @@ asmlinkage void __init start_kernel(void
calibrate_delay();
pidmap_init();
pgtable_cache_init();
- pte_chain_init();
+ anon_vma_init();
#ifdef CONFIG_KDB
kdb_init();
--- sles-anobjrmap-2/kernel/fork.c.~1~ 2004-02-29 17:47:33.000000000 +0100
+++ sles-anobjrmap-2/kernel/fork.c 2004-03-10 18:58:29.000000000 +0100
@@ -276,6 +276,7 @@ static inline int dup_mmap(struct mm_str
struct vm_area_struct * mpnt, *tmp, **pprev;
int retval;
unsigned long charge = 0;
+ anon_vma_t * anon_vma = NULL;
down_write(&oldmm->mmap_sem);
flush_cache_mm(current->mm);
@@ -310,6 +311,11 @@ static inline int dup_mmap(struct mm_str
goto fail_nomem;
charge += len;
}
+ if (!anon_vma) {
+ anon_vma = anon_vma_alloc();
+ if (!anon_vma)
+ goto fail_nomem;
+ }
tmp = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
if (!tmp)
goto fail_nomem;
@@ -339,7 +345,7 @@ static inline int dup_mmap(struct mm_str
*pprev = tmp;
pprev = &tmp->vm_next;
mm->map_count++;
- retval = copy_page_range(mm, current->mm, tmp);
+ retval = copy_page_range(mm, current->mm, tmp, mpnt, &anon_vma);
spin_unlock(&mm->page_table_lock);
if (tmp->vm_ops && tmp->vm_ops->open)
@@ -354,6 +360,8 @@ static inline int dup_mmap(struct mm_str
out:
flush_tlb_mm(current->mm);
up_write(&oldmm->mmap_sem);
+ if (anon_vma)
+ anon_vma_free(anon_vma);
return retval;
fail_nomem:
retval = -ENOMEM;
--- sles-anobjrmap-2/mm/mmap.c.~1~ 2004-03-03 06:53:46.000000000 +0100
+++ sles-anobjrmap-2/mm/mmap.c 2004-03-11 07:43:32.158221568 +0100
@@ -325,7 +325,7 @@ static void move_vma_start(struct vm_are
inode = vma->vm_file->f_dentry->d_inode;
if (inode)
__remove_shared_vm_struct(vma, inode);
- /* If no vm_file, perhaps we should always keep vm_pgoff at 0?? */
+ /* we must update pgoff even if no vm_file for the anon_vma_chain */
vma->vm_pgoff += (long)(addr - vma->vm_start) >> PAGE_SHIFT;
vma->vm_start = addr;
if (inode)
@@ -576,6 +576,7 @@ unsigned long __do_mmap_pgoff(struct mm_
case MAP_SHARED:
break;
}
+ pgoff = addr << PAGE_SHIFT;
}
error = security_file_mmap(file, prot, flags);
@@ -639,6 +640,8 @@ munmap_back:
vma->vm_private_data = NULL;
vma->vm_next = NULL;
INIT_LIST_HEAD(&vma->shared);
+ INIT_LIST_HEAD(&vma->anon_vma_node);
+ vma->anon_vma = NULL;
if (file) {
error = -EINVAL;
@@ -1381,10 +1384,12 @@ unsigned long do_brk(unsigned long addr,
vma->vm_flags = flags;
vma->vm_page_prot = protection_map[flags & 0x0f];
vma->vm_ops = NULL;
- vma->vm_pgoff = 0;
+ vma->vm_pgoff = addr << PAGE_SHIFT;
vma->vm_file = NULL;
vma->vm_private_data = NULL;
INIT_LIST_HEAD(&vma->shared);
+ INIT_LIST_HEAD(&vma->anon_vma_node);
+ vma->anon_vma = NULL;
vma_link(mm, vma, prev, rb_link, rb_parent);
--- sles-anobjrmap-2/mm/page_alloc.c.~1~ 2004-03-03 06:45:38.000000000 +0100
+++ sles-anobjrmap-2/mm/page_alloc.c 2004-03-10 10:28:26.000000000 +0100
@@ -91,6 +91,7 @@ static void bad_page(const char *functio
1 << PG_writeback);
set_page_count(page, 0);
page->mapping = NULL;
+ page->mapcount = 0;
}
#if !defined(CONFIG_HUGETLB_PAGE) && !defined(CONFIG_CRASH_DUMP) \
@@ -216,8 +217,7 @@ static inline void __free_pages_bulk (st
static inline void free_pages_check(const char *function, struct page *page)
{
- if ( page_mapped(page) ||
- page->mapping != NULL ||
+ if ( page->as.mapping != NULL ||
page_count(page) != 0 ||
(page->flags & (
1 << PG_lru |
@@ -329,7 +329,7 @@ static inline void set_page_refs(struct
*/
static void prep_new_page(struct page *page, int order)
{
- if (page->mapping || page_mapped(page) ||
+ if (page->as.mapping ||
(page->flags & (
1 << PG_private |
1 << PG_locked |
--- sles-anobjrmap-2/mm/nommu.c.~1~ 2004-02-04 16:07:06.000000000 +0100
+++ sles-anobjrmap-2/mm/nommu.c 2004-03-09 05:32:41.000000000 +0100
@@ -568,6 +568,6 @@ unsigned long get_unmapped_area(struct f
return -ENOMEM;
}
-void pte_chain_init(void)
+void anon_vma_init(void)
{
}
--- sles-anobjrmap-2/mm/memory.c.~1~ 2004-03-05 05:24:35.000000000 +0100
+++ sles-anobjrmap-2/mm/memory.c 2004-03-10 19:25:27.000000000 +0100
@@ -43,12 +43,11 @@
#include <linux/swap.h>
#include <linux/highmem.h>
#include <linux/pagemap.h>
-#include <linux/rmap.h>
+#include <linux/objrmap.h>
#include <linux/module.h>
#include <linux/init.h>
#include <asm/pgalloc.h>
-#include <asm/rmap.h>
#include <asm/uaccess.h>
#include <asm/tlb.h>
#include <asm/tlbflush.h>
@@ -105,7 +104,6 @@ static inline void free_one_pmd(struct m
}
page = pmd_page(*dir);
pmd_clear(dir);
- pgtable_remove_rmap(page);
pte_free_tlb(tlb, page);
}
@@ -164,7 +162,6 @@ pte_t fastcall * pte_alloc_map(struct mm
pte_free(new);
goto out;
}
- pgtable_add_rmap(new, mm, address);
pmd_populate(mm, pmd, new);
}
out:
@@ -190,7 +187,6 @@ pte_t fastcall * pte_alloc_kernel(struct
pte_free_kernel(new);
goto out;
}
- pgtable_add_rmap(virt_to_page(new), mm, address);
pmd_populate_kernel(mm, pmd, new);
}
out:
@@ -211,26 +207,17 @@ out:
* but may be dropped within pmd_alloc() and pte_alloc_map().
*/
int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
- struct vm_area_struct *vma)
+ struct vm_area_struct *vma, struct vm_area_struct *orig_vma,
+ anon_vma_t ** anon_vma)
{
pgd_t * src_pgd, * dst_pgd;
unsigned long address = vma->vm_start;
unsigned long end = vma->vm_end;
unsigned long cow;
- struct pte_chain *pte_chain = NULL;
if (is_vm_hugetlb_page(vma))
return copy_hugetlb_page_range(dst, src, vma);
- pte_chain = pte_chain_alloc(GFP_ATOMIC);
- if (!pte_chain) {
- spin_unlock(&dst->page_table_lock);
- pte_chain = pte_chain_alloc(GFP_KERNEL);
- spin_lock(&dst->page_table_lock);
- if (!pte_chain)
- goto nomem;
- }
-
cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
src_pgd = pgd_offset(src, address)-1;
dst_pgd = pgd_offset(dst, address)-1;
@@ -299,7 +286,7 @@ skip_copy_pte_range:
pfn = pte_pfn(pte);
/* the pte points outside of valid memory, the
* mapping is assumed to be good, meaningful
- * and not mapped via rmap - duplicate the
+ * and not mapped via objrmap - duplicate the
* mapping as is.
*/
page = NULL;
@@ -331,30 +318,20 @@ skip_copy_pte_range:
dst->rss++;
set_pte(dst_pte, pte);
- pte_chain = page_add_rmap(page, dst_pte,
- pte_chain);
- if (pte_chain)
- goto cont_copy_pte_range_noset;
- pte_chain = pte_chain_alloc(GFP_ATOMIC);
- if (pte_chain)
- goto cont_copy_pte_range_noset;
+ page_add_rmap_fork(page, vma, orig_vma, anon_vma);
+
+ if (need_resched()) {
+ pte_unmap_nested(src_pte);
+ pte_unmap(dst_pte);
+ spin_unlock(&src->page_table_lock);
+ spin_unlock(&dst->page_table_lock);
+ __cond_resched();
+ spin_lock(&dst->page_table_lock);
+ spin_lock(&src->page_table_lock);
+ dst_pte = pte_offset_map(dst_pmd, address);
+ src_pte = pte_offset_map_nested(src_pmd, address);
+ }
- /*
- * pte_chain allocation failed, and we need to
- * run page reclaim.
- */
- pte_unmap_nested(src_pte);
- pte_unmap(dst_pte);
- spin_unlock(&src->page_table_lock);
- spin_unlock(&dst->page_table_lock);
- pte_chain = pte_chain_alloc(GFP_KERNEL);
- spin_lock(&dst->page_table_lock);
- if (!pte_chain)
- goto nomem;
- spin_lock(&src->page_table_lock);
- dst_pte = pte_offset_map(dst_pmd, address);
- src_pte = pte_offset_map_nested(src_pmd,
- address);
cont_copy_pte_range_noset:
address += PAGE_SIZE;
if (address >= end) {
@@ -377,10 +354,9 @@ cont_copy_pmd_range:
out_unlock:
spin_unlock(&src->page_table_lock);
out:
- pte_chain_free(pte_chain);
return 0;
+
nomem:
- pte_chain_free(pte_chain);
return -ENOMEM;
}
@@ -421,7 +397,7 @@ zap_pte_range(struct mmu_gather *tlb, pm
!PageSwapCache(page))
mark_page_accessed(page);
tlb->freed++;
- page_remove_rmap(page, ptep);
+ page_remove_rmap(page);
tlb_remove_page(tlb, page);
}
}
@@ -1014,7 +990,6 @@ static int do_wp_page(struct mm_struct *
{
struct page *old_page, *new_page;
unsigned long pfn = pte_pfn(pte);
- struct pte_chain *pte_chain;
pte_t entry;
if (unlikely(!pfn_valid(pfn))) {
@@ -1053,9 +1028,6 @@ static int do_wp_page(struct mm_struct *
page_cache_get(old_page);
spin_unlock(&mm->page_table_lock);
- pte_chain = pte_chain_alloc(GFP_KERNEL);
- if (!pte_chain)
- goto no_pte_chain;
new_page = alloc_page(GFP_HIGHUSER);
if (!new_page)
goto no_new_page;
@@ -1069,10 +1041,10 @@ static int do_wp_page(struct mm_struct *
if (pte_same(*page_table, pte)) {
if (PageReserved(old_page))
++mm->rss;
- page_remove_rmap(old_page, page_table);
+ page_remove_rmap(old_page);
break_cow(vma, new_page, address, page_table);
SetPageAnon(new_page);
- pte_chain = page_add_rmap(new_page, page_table, pte_chain);
+ page_add_rmap(new_page, vma);
lru_cache_add_active(new_page);
/* Free the old page.. */
@@ -1082,12 +1054,9 @@ static int do_wp_page(struct mm_struct *
page_cache_release(new_page);
page_cache_release(old_page);
spin_unlock(&mm->page_table_lock);
- pte_chain_free(pte_chain);
return VM_FAULT_MINOR;
no_new_page:
- pte_chain_free(pte_chain);
-no_pte_chain:
page_cache_release(old_page);
return VM_FAULT_OOM;
}
@@ -1245,7 +1214,6 @@ static int do_swap_page(struct mm_struct
swp_entry_t entry = pte_to_swp_entry(orig_pte);
pte_t pte;
int ret = VM_FAULT_MINOR;
- struct pte_chain *pte_chain = NULL;
pte_unmap(page_table);
spin_unlock(&mm->page_table_lock);
@@ -1275,11 +1243,6 @@ static int do_swap_page(struct mm_struct
}
mark_page_accessed(page);
- pte_chain = pte_chain_alloc(GFP_KERNEL);
- if (!pte_chain) {
- ret = VM_FAULT_OOM;
- goto out;
- }
lock_page(page);
/*
@@ -1312,14 +1275,13 @@ static int do_swap_page(struct mm_struct
flush_icache_page(vma, page);
set_pte(page_table, pte);
SetPageAnon(page);
- pte_chain = page_add_rmap(page, page_table, pte_chain);
+ page_add_rmap(page, vma);
/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, address, pte);
pte_unmap(page_table);
spin_unlock(&mm->page_table_lock);
out:
- pte_chain_free(pte_chain);
return ret;
}
@@ -1335,20 +1297,8 @@ do_anonymous_page(struct mm_struct *mm,
{
pte_t entry;
struct page * page = ZERO_PAGE(addr);
- struct pte_chain *pte_chain;
int ret;
- pte_chain = pte_chain_alloc(GFP_ATOMIC);
- if (!pte_chain) {
- pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
- pte_chain = pte_chain_alloc(GFP_KERNEL);
- if (!pte_chain)
- goto no_mem;
- spin_lock(&mm->page_table_lock);
- page_table = pte_offset_map(pmd, addr);
- }
-
/* Read-only mapping of ZERO_PAGE. */
entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
@@ -1359,8 +1309,8 @@ do_anonymous_page(struct mm_struct *mm,
spin_unlock(&mm->page_table_lock);
page = alloc_page(GFP_HIGHUSER);
- if (!page)
- goto no_mem;
+ if (unlikely(!page))
+ return VM_FAULT_OOM;
clear_user_highpage(page, addr);
spin_lock(&mm->page_table_lock);
@@ -1370,8 +1320,7 @@ do_anonymous_page(struct mm_struct *mm,
pte_unmap(page_table);
page_cache_release(page);
spin_unlock(&mm->page_table_lock);
- ret = VM_FAULT_MINOR;
- goto out;
+ return VM_FAULT_MINOR;
}
mm->rss++;
entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
@@ -1383,20 +1332,16 @@ do_anonymous_page(struct mm_struct *mm,
}
set_pte(page_table, entry);
- /* ignores ZERO_PAGE */
- pte_chain = page_add_rmap(page, page_table, pte_chain);
pte_unmap(page_table);
/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, addr, entry);
spin_unlock(&mm->page_table_lock);
ret = VM_FAULT_MINOR;
- goto out;
-no_mem:
- ret = VM_FAULT_OOM;
-out:
- pte_chain_free(pte_chain);
+ /* ignores ZERO_PAGE */
+ page_add_rmap(page, vma);
+
return ret;
}
@@ -1419,7 +1364,6 @@ do_no_page(struct mm_struct *mm, struct
struct page * new_page;
struct address_space *mapping = NULL;
pte_t entry;
- struct pte_chain *pte_chain;
int sequence = 0;
int ret = VM_FAULT_MINOR;
@@ -1443,10 +1387,6 @@ retry:
if (new_page == NOPAGE_OOM)
return VM_FAULT_OOM;
- pte_chain = pte_chain_alloc(GFP_KERNEL);
- if (!pte_chain)
- goto oom;
-
/* See if nopage returned an anon page */
if (!new_page->mapping || PageSwapCache(new_page))
SetPageAnon(new_page);
@@ -1476,7 +1416,6 @@ retry:
sequence = atomic_read(&mapping->truncate_count);
spin_unlock(&mm->page_table_lock);
page_cache_release(new_page);
- pte_chain_free(pte_chain);
goto retry;
}
page_table = pte_offset_map(pmd, address);
@@ -1500,7 +1439,7 @@ retry:
if (write_access)
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
set_pte(page_table, entry);
- pte_chain = page_add_rmap(new_page, page_table, pte_chain);
+ page_add_rmap(new_page, vma);
pte_unmap(page_table);
} else {
/* One of our sibling threads was faster, back out. */
@@ -1513,13 +1452,13 @@ retry:
/* no need to invalidate: a not-present page shouldn't be cached */
update_mmu_cache(vma, address, entry);
spin_unlock(&mm->page_table_lock);
- goto out;
-oom:
+ out:
+ return ret;
+
+ oom:
page_cache_release(new_page);
ret = VM_FAULT_OOM;
-out:
- pte_chain_free(pte_chain);
- return ret;
+ goto out;
}
/*
--- sles-anobjrmap-2/mm/objrmap.c.~1~ 2004-03-05 05:40:21.000000000 +0100
+++ sles-anobjrmap-2/mm/objrmap.c 2004-03-10 20:29:20.000000000 +0100
@@ -1,105 +1,27 @@
/*
- * mm/rmap.c - physical to virtual reverse mappings
- *
- * Copyright 2001, Rik van Riel <[email protected]>
- * Released under the General Public License (GPL).
+ * mm/objrmap.c
*
+ * Provides methods for unmapping all sort of mapped pages
+ * using the vma objects, the brainer part of objrmap is the
+ * tracking of the vma to analyze for every given mapped page.
+ * The anon_vma methods are tracking anonymous pages,
+ * and the inode methods are tracking pages belonging
+ * to an inode.
*
- * Simple, low overhead pte-based reverse mapping scheme.
- * This is kept modular because we may want to experiment
- * with object-based reverse mapping schemes. Please try
- * to keep this thing as modular as possible.
+ * anonymous methods by Andrea Arcangeli <[email protected]> 2004
+ * inode methods by Dave McCracken <[email protected]> 2003, 2004
*/
/*
- * Locking:
- * - the page->pte.chain is protected by the PG_chainlock bit,
- * which nests within the the mm->page_table_lock,
- * which nests within the page lock.
- * - because swapout locking is opposite to the locking order
- * in the page fault path, the swapout path uses trylocks
- * on the mm->page_table_lock
- */
-#include <linux/mm.h>
-#include <linux/pagemap.h>
-#include <linux/swap.h>
-#include <linux/swapops.h>
-#include <linux/slab.h>
-#include <linux/init.h>
-#include <linux/rmap.h>
-#include <linux/cache.h>
-#include <linux/percpu.h>
-
-#include <asm/pgalloc.h>
-#include <asm/rmap.h>
-#include <asm/tlb.h>
-#include <asm/tlbflush.h>
-
-/* #define DEBUG_RMAP */
-
-/*
- * Shared pages have a chain of pte_chain structures, used to locate
- * all the mappings to this page. We only need a pointer to the pte
- * here, the page struct for the page table page contains the process
- * it belongs to and the offset within that process.
- *
- * We use an array of pte pointers in this structure to minimise cache misses
- * while traversing reverse maps.
- */
-#define NRPTE ((L1_CACHE_BYTES - sizeof(unsigned long))/sizeof(pte_addr_t))
-
-/*
- * next_and_idx encodes both the address of the next pte_chain and the
- * offset of the highest-index used pte in ptes[].
+ * try_to_unmap/page_referenced/page_add_rmap/page_remove_rmap
+ * inherit from the rmap design mm/rmap.c under
+ * Copyright 2001, Rik van Riel <[email protected]>
+ * Released under the General Public License (GPL).
*/
-struct pte_chain {
- unsigned long next_and_idx;
- pte_addr_t ptes[NRPTE];
-} ____cacheline_aligned;
-
-kmem_cache_t *pte_chain_cache;
-static inline struct pte_chain *pte_chain_next(struct pte_chain *pte_chain)
-{
- return (struct pte_chain *)(pte_chain->next_and_idx & ~NRPTE);
-}
-
-static inline struct pte_chain *pte_chain_ptr(unsigned long pte_chain_addr)
-{
- return (struct pte_chain *)(pte_chain_addr & ~NRPTE);
-}
-
-static inline int pte_chain_idx(struct pte_chain *pte_chain)
-{
- return pte_chain->next_and_idx & NRPTE;
-}
-
-static inline unsigned long
-pte_chain_encode(struct pte_chain *pte_chain, int idx)
-{
- return (unsigned long)pte_chain | idx;
-}
-
-/*
- * pte_chain list management policy:
- *
- * - If a page has a pte_chain list then it is shared by at least two processes,
- * because a single sharing uses PageDirect. (Well, this isn't true yet,
- * coz this code doesn't collapse singletons back to PageDirect on the remove
- * path).
- * - A pte_chain list has free space only in the head member - all succeeding
- * members are 100% full.
- * - If the head element has free space, it occurs in its leading slots.
- * - All free space in the pte_chain is at the start of the head member.
- * - Insertion into the pte_chain puts a pte pointer in the last free slot of
- * the head member.
- * - Removal from a pte chain moves the head pte of the head member onto the
- * victim pte and frees the head member if it became empty.
- */
+#include <linux/mm.h>
-/**
- ** VM stuff below this comment
- **/
+kmem_cache_t * anon_vma_cachep;
/**
* find_pte - Find a pte pointer given a vma and a struct page.
@@ -157,17 +79,17 @@ out:
}
/**
- * page_referenced_obj_one - referenced check for object-based rmap
+ * page_referenced_inode_one - referenced check for object-based rmap
* @vma: the vma to look in.
* @page: the page we're working on.
*
* Find a pte entry for a page/vma pair, then check and clear the referenced
* bit.
*
- * This is strictly a helper function for page_referenced_obj.
+ * This is strictly a helper function for page_referenced_inode.
*/
static int
-page_referenced_obj_one(struct vm_area_struct *vma, struct page *page)
+page_referenced_inode_one(struct vm_area_struct *vma, struct page *page)
{
struct mm_struct *mm = vma->vm_mm;
pte_t *pte;
@@ -188,11 +110,11 @@ page_referenced_obj_one(struct vm_area_s
}
/**
- * page_referenced_obj_one - referenced check for object-based rmap
+ * page_referenced_inode_one - referenced check for object-based rmap
* @page: the page we're checking references on.
*
* For an object-based mapped page, find all the places it is mapped and
- * check/clear the referenced flag. This is done by following the page->mapping
+ * check/clear the referenced flag. This is done by following the page->as.mapping
* pointer, then walking the chain of vmas it holds. It returns the number
* of references it found.
*
@@ -202,29 +124,54 @@ page_referenced_obj_one(struct vm_area_s
* assume a reference count of 1.
*/
static int
-page_referenced_obj(struct page *page)
+page_referenced_inode(struct page *page)
{
- struct address_space *mapping = page->mapping;
+ struct address_space *mapping = page->as.mapping;
struct vm_area_struct *vma;
- int referenced = 0;
+ int referenced;
- if (!page->pte.mapcount)
+ if (!page->mapcount)
return 0;
- if (!mapping)
- BUG();
+ BUG_ON(!mapping);
+ BUG_ON(PageSwapCache(page));
- if (PageSwapCache(page))
- BUG();
+ if (down_trylock(&mapping->i_shared_sem))
+ return 1;
+
+ referenced = 0;
+
+ list_for_each_entry(vma, &mapping->i_mmap, shared)
+ referenced += page_referenced_inode_one(vma, page);
+
+ list_for_each_entry(vma, &mapping->i_mmap_shared, shared)
+ referenced += page_referenced_inode_one(vma, page);
+
+ up(&mapping->i_shared_sem);
+
+ return referenced;
+}
+
+static int page_referenced_anon(struct page *page)
+{
+ int referenced;
+
+ if (!page->mapcount)
+ return 0;
+
+ BUG_ON(!mapping);
+ BUG_ON(PageSwapCache(page));
if (down_trylock(&mapping->i_shared_sem))
return 1;
-
+
+ referenced = 0;
+
list_for_each_entry(vma, &mapping->i_mmap, shared)
- referenced += page_referenced_obj_one(vma, page);
+ referenced += page_referenced_inode_one(vma, page);
list_for_each_entry(vma, &mapping->i_mmap_shared, shared)
- referenced += page_referenced_obj_one(vma, page);
+ referenced += page_referenced_inode_one(vma, page);
up(&mapping->i_shared_sem);
@@ -244,7 +191,6 @@ page_referenced_obj(struct page *page)
*/
int fastcall page_referenced(struct page * page)
{
- struct pte_chain *pc;
int referenced = 0;
if (page_test_and_clear_young(page))
@@ -253,209 +199,179 @@ int fastcall page_referenced(struct page
if (TestClearPageReferenced(page))
referenced++;
- if (!PageAnon(page)) {
- referenced += page_referenced_obj(page);
- goto out;
- }
- if (PageDirect(page)) {
- pte_t *pte = rmap_ptep_map(page->pte.direct);
- if (ptep_test_and_clear_young(pte))
- referenced++;
- rmap_ptep_unmap(pte);
- } else {
- int nr_chains = 0;
+ if (!PageAnon(page))
+ referenced += page_referenced_inode(page);
+ else
+ referenced += page_referenced_anon(page);
- /* Check all the page tables mapping this page. */
- for (pc = page->pte.chain; pc; pc = pte_chain_next(pc)) {
- int i;
-
- for (i = pte_chain_idx(pc); i < NRPTE; i++) {
- pte_addr_t pte_paddr = pc->ptes[i];
- pte_t *p;
-
- p = rmap_ptep_map(pte_paddr);
- if (ptep_test_and_clear_young(p))
- referenced++;
- rmap_ptep_unmap(p);
- nr_chains++;
- }
- }
- if (nr_chains == 1) {
- pc = page->pte.chain;
- page->pte.direct = pc->ptes[NRPTE-1];
- SetPageDirect(page);
- pc->ptes[NRPTE-1] = 0;
- __pte_chain_free(pc);
- }
- }
-out:
return referenced;
}
+/* this needs the page->flags PG_map_lock held */
+static void inline anon_vma_page_link(struct page * page, struct vm_area_struct * vma)
+{
+ BUG_ON(page->mapcount != 1);
+ BUG_ON(PageDirect(page));
+
+ SetPageDirect(page);
+ page->as.vma = vma;
+}
+
+/* this needs the page->flags PG_map_lock held */
+static void inline anon_vma_page_link_fork(struct page * page, struct vm_area_struct * vma,
+ struct vm_area_struct * orig_vma, anon_vma_t ** anon_vma)
+{
+ anon_vma_t * anon_vma = orig_vma->anon_vma;
+
+ BUG_ON(page->mapcount <= 1);
+ BUG_ON(!PageDirect(page));
+
+ if (!anon_vma) {
+ anon_vma = *anon_vma;
+ *anon_vma = NULL;
+
+ /* it's single threaded here, avoid the anon_vma->anon_vma_lock */
+ list_add(&vma->anon_vma_node, &anon_vma->anon_vma_head);
+ list_add(&orig_vma->anon_vma_node, &anon_vma->anon_vma_head);
+
+ orig_vma->anon_vma = vma->anon_vma = anon_vma;
+ } else {
+ /* multithreaded here, anon_vma existed already in other mm */
+ spin_lock(&anon_vma->anon_vma_lock);
+ list_add(&vma->anon_vma_node, &anon_vma->anon_vma_head);
+ spin_unlock(&anon_vma->anon_vma_lock);
+ }
+
+ ClearPageDirect(page);
+ page->as.anon_vma = anon_vma;
+}
+
/**
* page_add_rmap - add reverse mapping entry to a page
* @page: the page to add the mapping to
- * @ptep: the page table entry mapping this page
+ * @vma: the vma that is covering the page
*
* Add a new pte reverse mapping to a page.
- * The caller needs to hold the mm->page_table_lock.
*/
-struct pte_chain * fastcall
-page_add_rmap(struct page *page, pte_t *ptep, struct pte_chain *pte_chain)
+void fastcall page_add_rmap(struct page *page, struct vm_area_struct * vma)
{
- pte_addr_t pte_paddr = ptep_to_paddr(ptep);
- struct pte_chain *cur_pte_chain;
+ if (!pfn_valid(page_to_pfn(page)) || PageReserved(page))
+ return;
- if (PageReserved(page))
- return pte_chain;
+ page_map_lock(page);
- pte_chain_lock(page);
+ if (!page->mapcount++)
+ inc_page_state(nr_mapped);
- /*
- * If this is an object-based page, just count it. We can
- * find the mappings by walking the object vma chain for that object.
- */
- if (!PageAnon(page)) {
- if (!page->mapping)
- BUG();
- if (PageSwapCache(page))
- BUG();
- if (!page->pte.mapcount)
- inc_page_state(nr_mapped);
- page->pte.mapcount++;
- goto out;
+ if (PageAnon(page))
+ anon_vma_page_link(page, vma);
+ else {
+ /*
+ * If this is an object-based page, just count it.
+ * We can find the mappings by walking the object
+ * vma chain for that object.
+ */
+ BUG_ON(!page->as.mapping);
+ BUG_ON(PageSwapCache(page));
}
- if (page->pte.direct == 0) {
- page->pte.direct = pte_paddr;
- SetPageDirect(page);
+ page_map_unlock(page);
+}
+
+/* called from fork() */
+void fastcall page_add_rmap_fork(struct page *page, struct vm_area_struct * vma,
+ struct vm_area_struct * orig_vma, anon_vma_t ** anon_vma)
+{
+ if (!pfn_valid(page_to_pfn(page)) || PageReserved(page))
+ return;
+
+ page_map_lock(page);
+
+ if (!page->mapcount++)
inc_page_state(nr_mapped);
- goto out;
- }
- if (PageDirect(page)) {
- /* Convert a direct pointer into a pte_chain */
- ClearPageDirect(page);
- pte_chain->ptes[NRPTE-1] = page->pte.direct;
- pte_chain->ptes[NRPTE-2] = pte_paddr;
- pte_chain->next_and_idx = pte_chain_encode(NULL, NRPTE-2);
- page->pte.direct = 0;
- page->pte.chain = pte_chain;
- pte_chain = NULL; /* We consumed it */
- goto out;
+ if (PageAnon(page))
+ anon_vma_page_link_fork(page, vma, orig_vma, anon_vma);
+ else {
+ /*
+ * If this is an object-based page, just count it.
+ * We can find the mappings by walking the object
+ * vma chain for that object.
+ */
+ BUG_ON(!page->as.mapping);
+ BUG_ON(PageSwapCache(page));
}
- cur_pte_chain = page->pte.chain;
- if (cur_pte_chain->ptes[0]) { /* It's full */
- pte_chain->next_and_idx = pte_chain_encode(cur_pte_chain,
- NRPTE - 1);
- page->pte.chain = pte_chain;
- pte_chain->ptes[NRPTE-1] = pte_paddr;
- pte_chain = NULL; /* We consumed it */
- goto out;
+ page_map_unlock(page);
+}
+
+/* this needs the page->flags PG_map_lock held */
+static void inline anon_vma_page_unlink(struct page * page)
+{
+ /*
+ * Cleanup if this anon page is gone
+ * as far as the vm is concerned.
+ */
+ if (!page->mapcount) {
+ page->as.vma = 0;
+#if 0
+ /*
+ * The above clears page->as.anon_vma too
+ * if the page wasn't direct.
+ */
+ page->as.anon_vma = 0;
+#endif
+ ClearPageDirect(page);
}
- cur_pte_chain->ptes[pte_chain_idx(cur_pte_chain) - 1] = pte_paddr;
- cur_pte_chain->next_and_idx--;
-out:
- pte_chain_unlock(page);
- return pte_chain;
}
/**
* page_remove_rmap - take down reverse mapping to a page
* @page: page to remove mapping from
- * @ptep: page table entry to remove
*
* Removes the reverse mapping from the pte_chain of the page,
* after that the caller can clear the page table entry and free
* the page.
- * Caller needs to hold the mm->page_table_lock.
*/
-void fastcall page_remove_rmap(struct page *page, pte_t *ptep)
+void fastcall page_remove_rmap(struct page *page)
{
- pte_addr_t pte_paddr = ptep_to_paddr(ptep);
- struct pte_chain *pc;
-
if (!pfn_valid(page_to_pfn(page)) || PageReserved(page))
return;
- pte_chain_lock(page);
+ page_map_lock(page);
if (!page_mapped(page))
goto out_unlock;
- /*
- * If this is an object-based page, just uncount it. We can
- * find the mappings by walking the object vma chain for that object.
- */
- if (!PageAnon(page)) {
- if (!page->mapping)
- BUG();
- if (PageSwapCache(page))
- BUG();
- if (!page->pte.mapcount)
- BUG();
- page->pte.mapcount--;
- if (!page->pte.mapcount)
- dec_page_state(nr_mapped);
- goto out_unlock;
+ if (!--page->mapcount)
+ dec_page_state(nr_mapped);
+
+ if (PageAnon(page))
+ anon_vma_page_unlink(page, vma);
+ else {
+ /*
+ * If this is an object-based page, just uncount it.
+ * We can find the mappings by walking the object vma
+ * chain for that object.
+ */
+ BUG_ON(!page->as.mapping);
+ BUG_ON(PageSwapCache(page));
}
- if (PageDirect(page)) {
- if (page->pte.direct == pte_paddr) {
- page->pte.direct = 0;
- ClearPageDirect(page);
- goto out;
- }
- } else {
- struct pte_chain *start = page->pte.chain;
- struct pte_chain *next;
- int victim_i = pte_chain_idx(start);
-
- for (pc = start; pc; pc = next) {
- int i;
-
- next = pte_chain_next(pc);
- if (next)
- prefetch(next);
- for (i = pte_chain_idx(pc); i < NRPTE; i++) {
- pte_addr_t pa = pc->ptes[i];
-
- if (pa != pte_paddr)
- continue;
- pc->ptes[i] = start->ptes[victim_i];
- start->ptes[victim_i] = 0;
- if (victim_i == NRPTE-1) {
- /* Emptied a pte_chain */
- page->pte.chain = pte_chain_next(start);
- __pte_chain_free(start);
- } else {
- start->next_and_idx++;
- }
- goto out;
- }
- }
- }
-out:
- if (page->pte.direct == 0 && page_test_and_clear_dirty(page))
- set_page_dirty(page);
- if (!page_mapped(page))
- dec_page_state(nr_mapped);
-out_unlock:
- pte_chain_unlock(page);
+ page_map_unlock(page);
return;
}
/**
- * try_to_unmap_obj - unmap a page using the object-based rmap method
+ * try_to_unmap_one - unmap a page using the object-based rmap method
* @page: the page to unmap
*
* Determine whether a page is mapped in a given vma and unmap it if it's found.
*
- * This function is strictly a helper function for try_to_unmap_obj.
+ * This function is strictly a helper function for try_to_unmap_inode.
*/
-static inline int
-try_to_unmap_obj_one(struct vm_area_struct *vma, struct page *page)
+static int
+try_to_unmap_one(struct vm_area_struct *vma, struct page *page)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
@@ -477,17 +393,39 @@ try_to_unmap_obj_one(struct vm_area_stru
}
flush_cache_page(vma, address);
- pteval = ptep_get_and_clear(pte);
- flush_tlb_page(vma, address);
+ pteval = ptep_clear_flush(vma, address, pte);
+
+ if (PageSwapCache(page)) {
+ /*
+ * Store the swap location in the pte.
+ * See handle_pte_fault() ...
+ */
+ swp_entry_t entry = { .val = page->index };
+ swap_duplicate(entry);
+ set_pte(pte, swp_entry_to_pte(entry));
+ BUG_ON(pte_file(*pte));
+ } else {
+ unsigned long pgidx;
+ /*
+ * If a nonlinear mapping then store the file page offset
+ * in the pte.
+ */
+ pgidx = (address - vma->vm_start) >> PAGE_SHIFT;
+ pgidx += vma->vm_pgoff;
+ pgidx >>= PAGE_CACHE_SHIFT - PAGE_SHIFT;
+ if (page->index != pgidx) {
+ set_pte(pte, pgoff_to_pte(page->index));
+ BUG_ON(!pte_file(*pte));
+ }
+ }
if (pte_dirty(pteval))
set_page_dirty(page);
- if (!page->pte.mapcount)
- BUG();
+ BUG_ON(!page->mapcount);
mm->rss--;
- page->pte.mapcount--;
+ page->mapcount--;
page_cache_release(page);
out_unmap:
@@ -499,7 +437,7 @@ out:
}
/**
- * try_to_unmap_obj - unmap a page using the object-based rmap method
+ * try_to_unmap_inode - unmap a page using the object-based rmap method
* @page: the page to unmap
*
* Find all the mappings of a page using the mapping pointer and the vma chains
@@ -511,30 +449,26 @@ out:
* return a temporary error.
*/
static int
-try_to_unmap_obj(struct page *page)
+try_to_unmap_inode(struct page *page)
{
- struct address_space *mapping = page->mapping;
+ struct address_space *mapping = page->as.mapping;
struct vm_area_struct *vma;
int ret = SWAP_AGAIN;
- if (!mapping)
- BUG();
-
- if (PageSwapCache(page))
- BUG();
+ BUG_ON(PageSwapCache(page));
if (down_trylock(&mapping->i_shared_sem))
return ret;
list_for_each_entry(vma, &mapping->i_mmap, shared) {
- ret = try_to_unmap_obj_one(vma, page);
- if (ret == SWAP_FAIL || !page->pte.mapcount)
+ ret = try_to_unmap_one(vma, page);
+ if (ret == SWAP_FAIL || !page->mapcount)
goto out;
}
list_for_each_entry(vma, &mapping->i_mmap_shared, shared) {
- ret = try_to_unmap_obj_one(vma, page);
- if (ret == SWAP_FAIL || !page->pte.mapcount)
+ ret = try_to_unmap_one(vma, page);
+ if (ret == SWAP_FAIL || !page->mapcount)
goto out;
}
@@ -543,94 +477,33 @@ out:
return ret;
}
-/**
- * try_to_unmap_one - worker function for try_to_unmap
- * @page: page to unmap
- * @ptep: page table entry to unmap from page
- *
- * Internal helper function for try_to_unmap, called for each page
- * table entry mapping a page. Because locking order here is opposite
- * to the locking order used by the page fault path, we use trylocks.
- * Locking:
- * page lock shrink_list(), trylock
- * pte_chain_lock shrink_list()
- * mm->page_table_lock try_to_unmap_one(), trylock
- */
-static int FASTCALL(try_to_unmap_one(struct page *, pte_addr_t));
-static int fastcall try_to_unmap_one(struct page * page, pte_addr_t paddr)
-{
- pte_t *ptep = rmap_ptep_map(paddr);
- unsigned long address = ptep_to_address(ptep);
- struct mm_struct * mm = ptep_to_mm(ptep);
- struct vm_area_struct * vma;
- pte_t pte;
- int ret;
-
- if (!mm)
- BUG();
-
- /*
- * We need the page_table_lock to protect us from page faults,
- * munmap, fork, etc...
- */
- if (!spin_trylock(&mm->page_table_lock)) {
- rmap_ptep_unmap(ptep);
- return SWAP_AGAIN;
- }
-
-
- /* During mremap, it's possible pages are not in a VMA. */
- vma = find_vma(mm, address);
- if (!vma) {
- ret = SWAP_FAIL;
- goto out_unlock;
- }
-
- /* The page is mlock()d, we cannot swap it out. */
- if (vma->vm_flags & VM_LOCKED) {
- ret = SWAP_FAIL;
- goto out_unlock;
- }
+static int
+try_to_unmap_anon(struct page * page)
+{
+ int ret = SWAP_AGAIN;
- /* Nuke the page table entry. */
- flush_cache_page(vma, address);
- pte = ptep_clear_flush(vma, address, ptep);
+ page_map_lock(page);
- if (PageSwapCache(page)) {
- /*
- * Store the swap location in the pte.
- * See handle_pte_fault() ...
- */
- swp_entry_t entry = { .val = page->index };
- swap_duplicate(entry);
- set_pte(ptep, swp_entry_to_pte(entry));
- BUG_ON(pte_file(*ptep));
+ if (PageDirect(page)) {
+ vma = page->as.vma;
+ ret = try_to_unmap_one(page->as.vma, page);
} else {
- unsigned long pgidx;
- /*
- * If a nonlinear mapping then store the file page offset
- * in the pte.
- */
- pgidx = (address - vma->vm_start) >> PAGE_SHIFT;
- pgidx += vma->vm_pgoff;
- pgidx >>= PAGE_CACHE_SHIFT - PAGE_SHIFT;
- if (page->index != pgidx) {
- set_pte(ptep, pgoff_to_pte(page->index));
- BUG_ON(!pte_file(*ptep));
+ struct vm_area_struct * vma;
+ anon_vma_t * anon_vma = page->as.anon_vma;
+
+ spin_lock(&anon_vma->anon_vma_lock);
+ list_for_each_entry(vma, &anon_vma->anon_vma_head, anon_vma_node) {
+ ret = try_to_unmap_one(vma, page);
+ if (ret == SWAP_FAIL || !page->mapcount) {
+ spin_unlock(&anon_vma->anon_vma_lock);
+ goto out;
+ }
}
+ spin_unlock(&anon_vma->anon_vma_lock);
}
- /* Move the dirty bit to the physical page now the pte is gone. */
- if (pte_dirty(pte))
- set_page_dirty(page);
-
- mm->rss--;
- page_cache_release(page);
- ret = SWAP_SUCCESS;
-
-out_unlock:
- rmap_ptep_unmap(ptep);
- spin_unlock(&mm->page_table_lock);
+out:
+ page_map_unlock(page);
return ret;
}
@@ -650,82 +523,22 @@ int fastcall try_to_unmap(struct page *
{
struct pte_chain *pc, *next_pc, *start;
int ret = SWAP_SUCCESS;
- int victim_i;
/* This page should not be on the pageout lists. */
- if (PageReserved(page))
- BUG();
- if (!PageLocked(page))
- BUG();
- /* We need backing store to swap out a page. */
- if (!page->mapping)
- BUG();
+ BUG_ON(PageReserved(page));
+ BUG_ON(!PageLocked(page));
/*
- * If it's an object-based page, use the object vma chain to find all
- * the mappings.
+ * We need backing store to swap out a page.
+ * Subtle: this checks for page->as.anon_vma too ;).
*/
- if (!PageAnon(page)) {
- ret = try_to_unmap_obj(page);
- goto out;
- }
+ BUG_ON(!page->as.mapping);
- if (PageDirect(page)) {
- ret = try_to_unmap_one(page, page->pte.direct);
- if (ret == SWAP_SUCCESS) {
- if (page_test_and_clear_dirty(page))
- set_page_dirty(page);
- page->pte.direct = 0;
- ClearPageDirect(page);
- }
- goto out;
- }
+ if (!PageAnon(page))
+ ret = try_to_unmap_inode(page);
+ else
+ ret = try_to_unmap_anon(page);
- start = page->pte.chain;
- victim_i = pte_chain_idx(start);
- for (pc = start; pc; pc = next_pc) {
- int i;
-
- next_pc = pte_chain_next(pc);
- if (next_pc)
- prefetch(next_pc);
- for (i = pte_chain_idx(pc); i < NRPTE; i++) {
- pte_addr_t pte_paddr = pc->ptes[i];
-
- switch (try_to_unmap_one(page, pte_paddr)) {
- case SWAP_SUCCESS:
- /*
- * Release a slot. If we're releasing the
- * first pte in the first pte_chain then
- * pc->ptes[i] and start->ptes[victim_i] both
- * refer to the same thing. It works out.
- */
- pc->ptes[i] = start->ptes[victim_i];
- start->ptes[victim_i] = 0;
- victim_i++;
- if (victim_i == NRPTE) {
- page->pte.chain = pte_chain_next(start);
- __pte_chain_free(start);
- start = page->pte.chain;
- victim_i = 0;
- } else {
- start->next_and_idx++;
- }
- if (page->pte.direct == 0 &&
- page_test_and_clear_dirty(page))
- set_page_dirty(page);
- break;
- case SWAP_AGAIN:
- /* Skip this pte, remembering status. */
- ret = SWAP_AGAIN;
- continue;
- case SWAP_FAIL:
- ret = SWAP_FAIL;
- goto out;
- }
- }
- }
-out:
if (!page_mapped(page)) {
dec_page_state(nr_mapped);
ret = SWAP_SUCCESS;
@@ -733,176 +546,30 @@ out:
return ret;
}
-/**
- * page_convert_anon - Convert an object-based mapped page to pte_chain-based.
- * @page: the page to convert
- *
- * Find all the mappings for an object-based page and convert them
- * to 'anonymous', ie create a pte_chain and store all the pte pointers there.
- *
- * This function takes the address_space->i_shared_sem, sets the PageAnon flag,
- * then sets the mm->page_table_lock for each vma and calls page_add_rmap. This
- * means there is a period when PageAnon is set, but still has some mappings
- * with no pte_chain entry. This is in fact safe, since page_remove_rmap will
- * simply not find it. try_to_unmap might erroneously return success, but it
- * will never be called because the page_convert_anon() caller has locked the
- * page.
- *
- * page_referenced() may fail to scan all the appropriate pte's and may return
- * an inaccurate result. This is so rare that it does not matter.
+/*
+ * No more VM stuff below this comment, only anon_vma helper
+ * functions.
*/
-int page_convert_anon(struct page *page)
-{
- struct address_space *mapping;
- struct vm_area_struct *vma;
- struct pte_chain *pte_chain = NULL;
- pte_t *pte;
- int err = 0;
-
- mapping = page->mapping;
- if (mapping == NULL)
- goto out; /* truncate won the lock_page() race */
-
- down(&mapping->i_shared_sem);
- pte_chain_lock(page);
-
- /*
- * Has someone else done it for us before we got the lock?
- * If so, pte.direct or pte.chain has replaced pte.mapcount.
- */
- if (PageAnon(page)) {
- pte_chain_unlock(page);
- goto out_unlock;
- }
-
- SetPageAnon(page);
- if (page->pte.mapcount == 0) {
- pte_chain_unlock(page);
- goto out_unlock;
- }
- /* This is gonna get incremented by page_add_rmap */
- dec_page_state(nr_mapped);
- page->pte.mapcount = 0;
-
- /*
- * Now that the page is marked as anon, unlock it. page_add_rmap will
- * lock it as necessary.
- */
- pte_chain_unlock(page);
-
- list_for_each_entry(vma, &mapping->i_mmap, shared) {
- if (!pte_chain) {
- pte_chain = pte_chain_alloc(GFP_KERNEL);
- if (!pte_chain) {
- err = -ENOMEM;
- goto out_unlock;
- }
- }
- spin_lock(&vma->vm_mm->page_table_lock);
- pte = find_pte(vma, page, NULL);
- if (pte) {
- /* Make sure this isn't a duplicate */
- page_remove_rmap(page, pte);
- pte_chain = page_add_rmap(page, pte, pte_chain);
- pte_unmap(pte);
- }
- spin_unlock(&vma->vm_mm->page_table_lock);
- }
- list_for_each_entry(vma, &mapping->i_mmap_shared, shared) {
- if (!pte_chain) {
- pte_chain = pte_chain_alloc(GFP_KERNEL);
- if (!pte_chain) {
- err = -ENOMEM;
- goto out_unlock;
- }
- }
- spin_lock(&vma->vm_mm->page_table_lock);
- pte = find_pte(vma, page, NULL);
- if (pte) {
- /* Make sure this isn't a duplicate */
- page_remove_rmap(page, pte);
- pte_chain = page_add_rmap(page, pte, pte_chain);
- pte_unmap(pte);
- }
- spin_unlock(&vma->vm_mm->page_table_lock);
- }
-
-out_unlock:
- pte_chain_free(pte_chain);
- up(&mapping->i_shared_sem);
-out:
- return err;
-}
-
-/**
- ** No more VM stuff below this comment, only pte_chain helper
- ** functions.
- **/
-
-static void pte_chain_ctor(void *p, kmem_cache_t *cachep, unsigned long flags)
-{
- struct pte_chain *pc = p;
-
- memset(pc, 0, sizeof(*pc));
-}
-
-DEFINE_PER_CPU(struct pte_chain *, local_pte_chain) = 0;
-/**
- * __pte_chain_free - free pte_chain structure
- * @pte_chain: pte_chain struct to free
- */
-void __pte_chain_free(struct pte_chain *pte_chain)
+static void
+anon_vma_ctor(void *data, kmem_cache_t *cachep, unsigned long flags)
{
- struct pte_chain **pte_chainp;
-
- pte_chainp = &get_cpu_var(local_pte_chain);
- if (pte_chain->next_and_idx)
- pte_chain->next_and_idx = 0;
- if (*pte_chainp)
- kmem_cache_free(pte_chain_cache, *pte_chainp);
- *pte_chainp = pte_chain;
- put_cpu_var(local_pte_chain);
-}
+ if ((flags & (SLAB_CTOR_VERIFY|SLAB_CTOR_CONSTRUCTOR)) ==
+ SLAB_CTOR_CONSTRUCTOR) {
+ anon_vma_t * anon_vma = (anon_vma_t *) data;
-/*
- * pte_chain_alloc(): allocate a pte_chain structure for use by page_add_rmap().
- *
- * The caller of page_add_rmap() must perform the allocation because
- * page_add_rmap() is invariably called under spinlock. Often, page_add_rmap()
- * will not actually use the pte_chain, because there is space available in one
- * of the existing pte_chains which are attached to the page. So the case of
- * allocating and then freeing a single pte_chain is specially optimised here,
- * with a one-deep per-cpu cache.
- */
-struct pte_chain *pte_chain_alloc(int gfp_flags)
-{
- struct pte_chain *ret;
- struct pte_chain **pte_chainp;
-
- might_sleep_if(gfp_flags & __GFP_WAIT);
-
- pte_chainp = &get_cpu_var(local_pte_chain);
- if (*pte_chainp) {
- ret = *pte_chainp;
- *pte_chainp = NULL;
- put_cpu_var(local_pte_chain);
- } else {
- put_cpu_var(local_pte_chain);
- ret = kmem_cache_alloc(pte_chain_cache, gfp_flags);
+ spin_lock_init(&anon_vma->anon_vma_lock);
+ INIT_LIST_HEAD(&anon_vma->anon_vma_head);
}
- return ret;
}
-void __init pte_chain_init(void)
+void __init anon_vma_init(void)
{
- pte_chain_cache = kmem_cache_create( "pte_chain",
- sizeof(struct pte_chain),
- 0,
- SLAB_MUST_HWCACHE_ALIGN,
- pte_chain_ctor,
- NULL);
+ /* this is intentonally not hw aligned to avoid wasting ram */
+ anon_vma_cachep = kmem_cache_create("anon_vma",
+ sizeof(anon_vma_t), 0, 0,
+ anon_vma_ctor, NULL);
- if (!pte_chain_cache)
- panic("failed to create pte_chain cache!\n");
+ if(!anon_vma_cachep)
+ panic("Cannot create anon_vma SLAB cache");
}
--- sles-anobjrmap-2/mm/Makefile.~1~ 2004-02-29 17:47:30.000000000 +0100
+++ sles-anobjrmap-2/mm/Makefile 2004-03-10 20:26:16.000000000 +0100
@@ -4,7 +4,7 @@
mmu-y := nommu.o
mmu-$(CONFIG_MMU) := fremap.o highmem.o madvise.o memory.o mincore.o \
- mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
+ mlock.o mmap.o mprotect.o mremap.o msync.o objrmap.o \
shmem.o vmalloc.o
obj-y := bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
Hi Andrea,
On Thu, 11 Mar 2004, Andrea Arcangeli wrote:
>
> this is the full current status of my anon_vma work. Now fork() and all
> the other page_add/remove_rmap in memory.c plus the paging routines
> seems fully covered and I'm now dealing with the vma merging and the
> anon_vma garbage collection (the latter is easy but I need to track all
> the kmem_cache_free).
I'm still making my way through all the relevant mails, and not even
glanced at your code yet: I hope later today. But to judge by the
length of your essay on vma merging, it strikes me that you've taken
a wrong direction in switching from my anon mm to your anon vma.
Go by vmas and you have tiresome problems as they are split and merged,
very commonly. Plus you have the overhead of new data structure per vma.
If your design magicked those problems away somehow, okay, but it seems
you're finding issues with it: I think you should go back to anon mms.
Go by mms, and there's only the exceedingly rare (does it ever occur
outside our testing?) awkward case of tracking pages in a private anon
vma inherited from parent, when parent or child mremaps it with MAYMOVE.
Which I reused the pte_chain code for, but it's probably better done
by conjuring up an imaginary tmpfs object as backing at that point
(that has its own little cost, since the object lives on at full size
until all its mappers unmap it, however small the portion they have
mapped). And the overhead of the new data structre is per mm only.
I'll get back to reading through the mails now: sorry if I'm about to
find the arguments against anonmm in my reading. (By the way, several
times you mention the size of a 2.6 struct page as larger than a 2.4
struct page: no, thanks to wli and others it's the 2.6 that's smaller.)
Hugh
Hi Hugh,
On Thu, Mar 11, 2004 at 01:23:24PM +0000, Hugh Dickins wrote:
> Hi Andrea,
>
> On Thu, 11 Mar 2004, Andrea Arcangeli wrote:
> >
> > this is the full current status of my anon_vma work. Now fork() and all
> > the other page_add/remove_rmap in memory.c plus the paging routines
> > seems fully covered and I'm now dealing with the vma merging and the
> > anon_vma garbage collection (the latter is easy but I need to track all
> > the kmem_cache_free).
>
> I'm still making my way through all the relevant mails, and not even
> glanced at your code yet: I hope later today. But to judge by the
> length of your essay on vma merging, it strikes me that you've taken
> a wrong direction in switching from my anon mm to your anon vma.
>
> Go by vmas and you have tiresome problems as they are split and merged,
> very commonly. Plus you have the overhead of new data structure per vma.
it's more complicated because it's more finegrined and it can handle
mremap too. I mean, the additional cost of tracking the vmas payoffs
because then we've a tiny list of vma to search for every page,
otherwise with the mm-wide model we'd need to search all of the vmas in
a mm. This is quite important during swapping with tons of vmas. Note
that in my common case the page will point directly to the vma
(PageDirect(page) == 1), no find_vma or whatever needed in between.
the per-vma overhead is 12 bytes, 2 pointers for the list node and 1
pointer to the anon-vma. As said above it provides several advantages,
but you're certainly right the mm approch had no vma overhead.
I'm quite convinced the anon_vma is the optimal design, though it's not
running yet ;). However it's close to compile. the whole vma and page
layer is finished (including the vma merging). I'm now dealing with the
swapcache stuff and I'm doing it slightly differently from your
anobjrmap-2 patch (obviously I also reistantiate the PG_swapcache
bitflag but the fundamental difference is that I don't drop the
swapper_space):
static inline struct address_space * page_mapping(struct page * page)
{
extern struct address_space swapper_space;
struct address_space * mapping = NULL;
if (PageSwapCache(page))
mapping = &swapper_space;
else if (!PageAnon(page))
mapping = page->as.mapping;
return mapping;
}
I want the same pagecache/swapcache code to work transparently, but I
free up the page->index and the page->mapping for the swapcache, so that
I can reuse it to track the anon_vma. I think the above is simpler than
killing the swapper_space completely as you did. My solution avoids me
hacks like this:
if (mapping && mapping->a_ops && mapping->a_ops->sync_page)
return mapping->a_ops->sync_page(page);
+ if (PageSwapCache(page))
+ blk_run_queues();
return 0;
}
it also avoids me rework set_page_dirty to call __set_page_dirty_buffers
by hand too. I mean, it's less intrusive.
the cpu cost it's similar, since I pay for an additional compare in
page_mapping though, but the code looks cleaner. Could be my opinion
only though ;).
> If your design magicked those problems away somehow, okay, but it seems
> you're finding issues with it: I think you should go back to anon mms.
the only issue I found so far, is that to track the stuff in a
fine-granular way I have to forbid merging sometime. note that
forbidding merging is a feature too, if I would go down with a pagetable
scan on the vma to fixup all page->as.vma/anon_vma and page->index I
would then lose some historic information on the origin of certain vmas,
and I would eventually fallback to the mm-wide information if I would do
total merging.
I think the probability of forbidden merging is low enough that it
doesn't matter. Also it doesn't impact in any way the file merging.
It basically merges as well as the file merging. Right now I'm also not
overriding the intitial vm_pgoff given to brand new anonymous vmas, but
I could, to boost the merging with mremapped segments. Though I don't
think it's necessary.
Overall the main reason for forbidding keeping track of vmas and not of
mm, is to be able to handle mremap as efficiently as with 2.4, I mean
your anobjrmap-5 simply reistantiate the pte_chains, so the vm then has
to deal with both pte_chains and anonmm too.
> Go by mms, and there's only the exceedingly rare (does it ever occur
> outside our testing?) awkward case of tracking pages in a private anon
> vma inherited from parent, when parent or child mremaps it with MAYMOVE.
>
> Which I reused the pte_chain code for, but it's probably better done
> by conjuring up an imaginary tmpfs object as backing at that point
> (that has its own little cost, since the object lives on at full size
> until all its mappers unmap it, however small the portion they have
> mapped). And the overhead of the new data structre is per mm only.
>
> I'll get back to reading through the mails now: sorry if I'm about to
> find the arguments against anonmm in my reading. (By the way, several
> times you mention the size of a 2.6 struct page as larger than a 2.4
> struct page: no, thanks to wli and others it's the 2.6 that's smaller.)
really? mainline 2.6 has the same size of mainline 2.4 (48 bytes), or
I'm counting wrong? (at least my 2.4-aa tree is 48 bytes too, but I
think 2.4 mainline too) objrmap adds 4 bytes (goes to 52bytes), my patch
removes 8 bytes (i.e. the pte_chain) and the result of my patch is 4
bytes less than 2.4 and 2.6 (44 bytes instead of 48 bytes). I wanted to
nuke the mapcount too but that destroy the nr_mapped info, and that
spreads all over so for now I keep the page->mapcount ;)
ok, it links and boots ;)
at the previous try, with slab debugging enabled, it was spawning tons
of errors but I suspect it's a bug in the slab debugging, it was
complaining with red zone memory corruption, could be due the tiny size
of this object (only 8 bytes).
andrea@xeon:~> grep anon_vma /proc/slabinfo
anon_vma 1230 1500 12 250 1 : tunables 120 60 8 : slabdata 6 6 0
andrea@xeon:~>
now I need to try swapping... (I guess it won't work at the first try,
I'd be surprised if I didn't miss any s/index/private/)
>
>
>at the previous try, with slab debugging enabled, it was spawning tons
>of errors but I suspect it's a bug in the slab debugging, it was
>complaining with red zone memory corruption, could be due the tiny size
>of this object (only 8 bytes).
>
>andrea@xeon:~> grep anon_vma /proc/slabinfo
>anon_vma 1230 1500 12 250 1 : tunables 120 60 8 : slabdata 6 6 0
>
According to the slabinfo line, 12 bytes. The revoke_table is 12 bytes,
too, and I'm not aware of any problems with slab debugging enabled.
Could you send me the first few errors?
--
Manfred
On Thu, 11 Mar 2004, Andrea Arcangeli wrote:
> On Thu, Mar 11, 2004 at 01:23:24PM +0000, Hugh Dickins wrote:
> >
> > Go by vmas and you have tiresome problems as they are split and merged,
> > very commonly. Plus you have the overhead of new data structure per vma.
>
> it's more complicated because it's more finegrined and it can handle
> mremap too. I mean, the additional cost of tracking the vmas payoffs
> because then we've a tiny list of vma to search for every page,
> otherwise with the mm-wide model we'd need to search all of the vmas in
> a mm. This is quite important during swapping with tons of vmas. Note
> that in my common case the page will point directly to the vma
> (PageDirect(page) == 1), no find_vma or whatever needed in between.
Nice if you can avoid the find_vma, but it is (or was) used in the
objrmap case, so I was happy to have it in the anobj case also.
Could you post a patch against 2.6.3 or 2.6.4? Your objrmap patch
applies with offsets, no problem, but your anobjrmap patch doesn't
apply cleanly on top of that, partly because you've renamed files
in between (revert that?), but there seem to be other untracked
changes too. I may not be seeing the whole story right.
Great to see the pte_chains gone, but I find what you have for anon
vmas strangely complicated: the continued existence of PageDirect etc.
I guess, having elected to go by vmas, you're trying to avoid some of
the overhead until fork. But that does make it messy to my eyes,
the anonmm way much cleaner and simpler in that regard.
> I want the same pagecache/swapcache code to work transparently, but I
> free up the page->index and the page->mapping for the swapcache, so that
> I can reuse it to track the anon_vma. I think the above is simpler than
> killing the swapper_space completely as you did. My solution avoids me
> hacks like this:
>
> if (mapping && mapping->a_ops && mapping->a_ops->sync_page)
> return mapping->a_ops->sync_page(page);
> + if (PageSwapCache(page))
> + blk_run_queues();
> return 0;
> }
>
> it also avoids me rework set_page_dirty to call __set_page_dirty_buffers
> by hand too. I mean, it's less intrusive.
There may well be better ways of reassigning the page struct fields
than I had, making for less extensive changes, yes. Best to go with the
least intrusive for now (so long as not too ugly) and reappraise later.
> Overall the main reason for forbidding keeping track of vmas and not of
> mm, is to be able to handle mremap as efficiently as with 2.4, I mean
> your anobjrmap-5 simply reistantiate the pte_chains, so the vm then has
> to deal with both pte_chains and anonmm too.
Yes, I used pte_chains for that because we hadn't worked out how to
do remap_file_pages without them (I've not yet looked into how you're
handling those), so might as well put them to use here too. But if
nonlinear is now relieved of pte_chains, great, and as I said below,
the anonmm mremap case should be able to conjure a tmpfs backing object
- which probably amounts to your anon_vma, but only needed in that one
odd case, anon mm sufficient for all the rest, less overhead all round.
> > Go by mms, and there's only the exceedingly rare (does it ever occur
> > outside our testing?) awkward case of tracking pages in a private anon
> > vma inherited from parent, when parent or child mremaps it with MAYMOVE.
> >
> > Which I reused the pte_chain code for, but it's probably better done
> > by conjuring up an imaginary tmpfs object as backing at that point
> > (that has its own little cost, since the object lives on at full size
> > until all its mappers unmap it, however small the portion they have
> > mapped). And the overhead of the new data structre is per mm only.
> >
> > I'll get back to reading through the mails now: sorry if I'm about to
> > find the arguments against anonmm in my reading. (By the way, several
> > times you mention the size of a 2.6 struct page as larger than a 2.4
> > struct page: no, thanks to wli and others it's the 2.6 that's smaller.)
>
> really? mainline 2.6 has the same size of mainline 2.4 (48 bytes), or
> I'm counting wrong? (at least my 2.4-aa tree is 48 bytes too, but I
> think 2.4 mainline too) objrmap adds 4 bytes (goes to 52bytes), my patch
> removes 8 bytes (i.e. the pte_chain) and the result of my patch is 4
> bytes less than 2.4 and 2.6 (44 bytes instead of 48 bytes). I wanted to
> nuke the mapcount too but that destroy the nr_mapped info, and that
> spreads all over so for now I keep the page->mapcount ;)
I think you were counting wrong. Mainline 2.4 i386 48 bytes, agreed.
Mainline 2.6 i386 40 bytes, or 44 bytes if PAE & HIGHPTE. And today,
2.6.4-mm1 i386 32 bytes, or 36 bytes if PAE & HIGHPTE. Though of course
the vanished fields will often be countered by memory usage elsewhere.
Yes, keep mapcount for now: I went around that same loop, it
surely has the feel of something that can be disposed of in the end,
but there's no need to attempt that while doing this objrmap job,
it's better done after since it needs a different kind of care.
(Be aware that shmem_writepage will do the wrong thing, COWing what
should be a shared page, if it is ever given a still-mapped page:
but no need to worry about that now, and it may be easy to work it
differently once the rmap changes settle down. As to shmem_writepage
going directly to swap, by the way: I'm perfectly happy for you to
make that change, but I don't believe the old way was mistaken - it
intentionally gave tmpfs pages which should remain in memory another
go around. I was never convinced one way or the other: but the current
code works very badly for some loads, as you found, I doubt there are
any that will suffer so greatly from the change, so go ahead.)
Hugh
On Thu, 11 Mar 2004, Hugh Dickins wrote:
> length of your essay on vma merging, it strikes me that you've taken
> a wrong direction in switching from my anon mm to your anon vma.
>
> Go by vmas and you have tiresome problems as they are split and merged,
> very commonly. Plus you have the overhead of new data structure per vma.
There's of course a blindingly simple alternative.
Add every anonymous page to an "anon_memory" inode. Then
everything is in effect file backed. Using the same page
refcounting we already do, holes get shot into that "file".
The swap cache code provides a filesystem like mapping
from the anon_memory "files" to the on-disk stuff, or the
anon_memory file pages are resident in memory.
As a side effect, it also makes it possible to get rid
of the swapoff code, simply move the anon_memory file
pages from disk into memory...
We can avoid BSD memory object like code by simply having
multiple processes share the same anon_memory inode, allocating
extents of virtual space at once to reduce VMA count.
Not sure to which extent this is similar to what Hugh's stuff
already does though, or if it's just a different way of saying
how it's done ... I need to re-read the code ;)
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
On Thu, 11 Mar 2004, Rik van Riel wrote:
> On Thu, 11 Mar 2004, Hugh Dickins wrote:
>
> > length of your essay on vma merging, it strikes me that you've taken
> > a wrong direction in switching from my anon mm to your anon vma.
> >
> > Go by vmas and you have tiresome problems as they are split and merged,
> > very commonly. Plus you have the overhead of new data structure per vma.
>
> There's of course a blindingly simple alternative.
>
> Add every anonymous page to an "anon_memory" inode. Then
> everything is in effect file backed. Using the same page
> refcounting we already do, holes get shot into that "file".
Okay, Rik, the two extremes belong to you: one anon memory
object in total (above), and one per page (your original rmap);
whereas Andrea is betting on one per vma, and I go for one per mm.
Each way has its merits, I'm sure - and you've placed two bets!
> The swap cache code provides a filesystem like mapping
> from the anon_memory "files" to the on-disk stuff, or the
> anon_memory file pages are resident in memory.
For 2.7 something like that may well be reasonable.
But let's beware the fancy bloat of extra levels.
> As a side effect, it also makes it possible to get rid
> of the swapoff code, simply move the anon_memory file
> pages from disk into memory...
Wonderful if that code could disappear: but I somehow doubt
it'll fall out quite so easily - swapoff is inevitably
backwards from sanity, isn't it?
Hugh
On Thu, Mar 11, 2004 at 09:54:01PM +0000, Hugh Dickins wrote:
> Could you post a patch against 2.6.3 or 2.6.4? Your objrmap patch
I uploaded my latest status, there are three patches, the first is
Dave's objrmap, the second is your anobjrmap-1, the third is my anon_vma
work that removes the pte_chains all over the kernel.
my patch is not stable yet, it crashes during swapping and the debugging
code catches bug even before swapping (which is good):
0 0 0 404468 11900 41276 0 0 0 0 1095 61 0 0 100 0
0 0 0 404468 11900 41276 0 0 0 0 1108 71 0 0 100 0
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 0 0 404468 11908 41268 0 0 0 136 1102 59 0 0 100 0
1 0 0 310972 11908 41268 0 0 0 0 1100 50 2 7 91 0
1 0 0 66748 11908 41268 0 0 0 0 1085 30 6 19 75 0
1 1 128 2648 216 14132 0 128 0 256 1118 139 3 16 73 8
1 2 77084 1332 232 2188 0 76952 308 76952 1162 255 1 10 54 35
I hope to make it work tomorrow, then the next two things to do are the
pagetable walk in the nonlinear (currently it's pinned) and the rbtree
(or prio_tree) for the i_mmap{,shared}. Then it will be complete and
mergeable.
http://www.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.3/objrmap
On Fri, Mar 12, 2004 at 02:47:10AM +0100, Andrea Arcangeli wrote:
> my patch is not stable yet, it crashes during swapping and the debugging
> code catches bug even before swapping (which is good):
I fixed some more bugs (s/index/private), it's not stable yet but some basic
swapping works now (there is probably some issue with shared swapcache
still, since ps just oopsed, and ps may be sharing-cow swapcache through
fork).
0 0 0 408712 7800 41160 0 0 0 0 1131 46 0 0 95 5
0 0 0 408712 7800 41160 0 0 0 0 1102 64 0 0 100 0
0 0 0 408712 7800 41160 0 0 0 0 1090 40 0 0 100 0
0 0 0 408712 7800 41160 0 0 0 0 1107 84 0 0 100 0
0 0 0 408712 7808 41152 0 0 0 84 1101 66 0 0 100 0
0 0 0 408712 7808 41152 0 0 0 0 1096 52 0 0 100 0
1 0 0 264808 7808 41152 0 0 0 0 1093 49 5 16 79 0
1 0 0 51636 7808 41152 0 0 0 0 1083 34 5 20 75 0
1 1 128 2384 212 14068 0 128 0 204 1106 178 1 7 73 19
1 2 82824 2332 200 2136 32 82668 40 82668 1221 1955 1 12 49 38
1 2 130000 2448 208 1868 32 47048 312 47048 1184 782 0 5 60 35
0 3 178700 1676 208 2428 10388 48700 11000 48700 1536 1291 0 4 55 40
0 3 205996 1780 216 1992 4264 27224 4424 27224 1312 549 1 4 41 55
2 2 238900 4148 240 2388 88 32980 684 32984 1190 1380 1 6 23 69
0 3 295124 1996 244 2392 92 56148 232 56148 1223 149 1 6 38 54
0 2 315204 2036 244 2356 0 19972 0 19972 1172 55 1 2 52 45
1 0 334052 3924 264 2592 192 18720 372 18720 1205 154 0 1 35 63
0 3 377208 2324 264 1928 64 42984 64 42984 1249 208 2 6 39 53
0 1 389856 3408 264 2032 128 12680 224 12680 1187 159 0 1 60 38
0 0 374032 263036 316 3504 920 0 2464 0 1258 224 0 2 76 23
0 0 374032 263036 316 3504 0 0 0 0 1087 27 0 0 100 0
0 0 374032 263036 316 3504 0 0 0 0 1083 25 0 0 100 0
0 0 374032 263040 316 3504 0 0 0 0 1086 25 0 0 100 0
0 0 374032 263040 316 3504 0 0 0 0 1084 27 0 0 100 0
0 0 374032 263128 316 3504 0 0 0 0 1086 23 0 0 100 0
0 0 374032 263164 316 3472 32 0 32 0 1086 23 0 0 100 0
0 0 374032 263212 316 3508 32 0 32 0 1086 25 0 0 100 0
I uploaded a new anon_vma patch in the same directory with the fixes to make
the basic swapping work. Tomorrow I'll look into the ps oops and into
heavey cow loads.
On Thu, 11 Mar 2004, Hugh Dickins wrote:
> Okay, Rik, the two extremes belong to you: one anon memory
> object in total (above), and one per page (your original rmap);
> whereas Andrea is betting on one per vma, and I go for one per mm.
> Each way has its merits, I'm sure - and you've placed two bets!
I suspect yours is the best mix.
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
On Thu, 11 Mar 2004, Andrea Arcangeli wrote:
> it's more complicated because it's more finegrined and it can handle
> mremap too. I mean, the additional cost of tracking the vmas payoffs
> because then we've a tiny list of vma to search for every page,
> otherwise with the mm-wide model we'd need to search all of the vmas in
> a mm.
Actually, with the code Rajesh is working on there's
no search problem with Hugh's idea.
Considering the fact that we'll need Rajesh's code
anyway, to deal with Ingo's test program and the real
world programs that do similar things, I don't see how
your objection to Hugh's code is still valid.
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
On Thu, Mar 11, 2004 at 10:28:42PM -0500, Rik van Riel wrote:
> On Thu, 11 Mar 2004, Andrea Arcangeli wrote:
>
> > it's more complicated because it's more finegrined and it can handle
> > mremap too. I mean, the additional cost of tracking the vmas payoffs
> > because then we've a tiny list of vma to search for every page,
> > otherwise with the mm-wide model we'd need to search all of the vmas in
> > a mm.
>
> Actually, with the code Rajesh is working on there's
> no search problem with Hugh's idea.
you missed the fact mremap doesn't work, that's the fundamental reason
for the vma tracking, so you can use vm_pgoff.
if you take Hugh's anonmm, mremap will be attaching a persistent dynamic
overhead to the vma it touches. Currently it does in form of pte_chains,
that can be converted to other means of overhead, but I simply don't
like it.
I like all vmas to be symmetric to each other, without special hacks to
handle mremap right.
We have the vm_pgoff to handle mremap and I simply use that.
> Considering the fact that we'll need Rajesh's code
> anyway, to deal with Ingo's test program and the real
Rajesh's code has nothing to do with the mremap breakage, Rajesh's code
can only boost the search of the interesting vmas in an anonmm, it
doesn't solve mremap.
> world programs that do similar things, I don't see how
> your objection to Hugh's code is still valid.
This was my objection, maybe you didn't read all my emails, i quote
again:
"Overall the main reason for forbidding keeping track of vmas and not of
mm, is to be able to handle mremap as efficiently as with 2.4, I mean
your anobjrmap-5 simply reistantiate the pte_chains, so the vm then has
to deal with both pte_chains and anonmm too."
As said one can convert the pte_chains to other means of overhead, but
still it's an hack and you'll need transient objects to track those if
you don't track finegrined by vma as I'm doing.
It's not that I didn't read anonmm patches from Hugh, I spent lots of
time on those, they just were flawed and they couldn't handle mremap,
he very well knows, see anobjrmap-5 for istance.
the vma merging isn't a problem, we need to rework the code anyways to
allow the file merging in both mprotect and mremap (currently only mmap
is capable of merging files, and in turn it's also the only one capable
of merging anon_vmas). Any merging code that is currently capable of
merging files is easy to teach about anon_vmas too, it's basically the
same problem at merging.
On Fri, 12 Mar 2004, Andrea Arcangeli wrote:
> On Thu, Mar 11, 2004 at 10:28:42PM -0500, Rik van Riel wrote:
> > Actually, with the code Rajesh is working on there's
> > no search problem with Hugh's idea.
>
> you missed the fact mremap doesn't work, that's the fundamental reason
> for the vma tracking, so you can use vm_pgoff.
>
> if you take Hugh's anonmm, mremap will be attaching a persistent dynamic
> overhead to the vma it touches. Currently it does in form of pte_chains,
> that can be converted to other means of overhead, but I simply don't
> like it.
>
> I like all vmas to be symmetric to each other, without special hacks to
> handle mremap right.
>
> We have the vm_pgoff to handle mremap and I simply use that.
Would it be possible to get rid of that if we attached
a struct address_space to each mm_struct after exec(),
sharing the address_space between parent and child
processes after a fork() ?
Note that the page cache can handle up to 2^42 bytes
in one address_space on a 32 bit system, so there's
more than enough space to be shared between parent and
child processes.
Then the vmas can track vm_pgoff inside the address
space attached to the mm.
> > Considering the fact that we'll need Rajesh's code
> > anyway, to deal with Ingo's test program and the real
>
> Rajesh's code has nothing to do with the mremap breakage, Rajesh's code
> can only boost the search of the interesting vmas in an anonmm, it
> doesn't solve mremap.
If you mmap a file, then mremap part of that mmap, where's
the special case ?
> "Overall the main reason for forbidding keeping track of vmas and not of
> mm, is to be able to handle mremap as efficiently as with 2.4, I mean
> your anobjrmap-5 simply reistantiate the pte_chains, so the vm then has
> to deal with both pte_chains and anonmm too."
Yes, that's a problem indeed. I'm not sure it's fundamental
or just an implementation artifact, though...
> the vma merging isn't a problem, we need to rework the code anyways to
> allow the file merging in both mprotect and mremap
Agreed.
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
On Fri, Mar 12, 2004 at 01:21:27PM +0100, Andrea Arcangeli wrote:
> Rajesh's code has nothing to do with the mremap breakage, Rajesh's code
> can only boost the search of the interesting vmas in an anonmm, it
> doesn't solve mremap.
btw, one more detail, Rajesh's code will fall apart while dealing with
the dynamic metadata attached to vmas relocated by mremap: his code is
usable out of the box only on top of anon_vma (where
vm_pgoff/vm_start/vm_end retains the same semantics as the file mappings
in the i_mmap list), not on top of anonmm where you'll have to stack
some other dynamic structure (like the pte_chains today in anobjrmap-5).
Not sure how well his code could be modified to take into account the
dynamic data structure generated by mremap.
Also don't forget Rajesh's code doesn't come free, it also adds overhead
to the vma, so if you need the tree in the anonmm too (not only in the
inode), you'll grow the vma size too (I grow it of 12 bytes with
anon_vma but then I don't need complex metadata dynamic allocated later
in mremap and I don't need the rbtree search either since it's
finegrined well enough).
I also expect you'll still have significant problems merging two vmas, one
touched by mremap, the other not, since then the dynamic objects would
need to be "partial" for only a part of the vma, complicating even
further the "tree search" with ranges in the sub-metadata attached to
the vma.
On Fri, Mar 12, 2004 at 01:21:27PM +0100, Andrea Arcangeli wrote:
> you missed the fact mremap doesn't work, that's the fundamental reason
> for the vma tracking, so you can use vm_pgoff.
> if you take Hugh's anonmm, mremap will be attaching a persistent dynamic
> overhead to the vma it touches. Currently it does in form of pte_chains,
> that can be converted to other means of overhead, but I simply don't
> like it.
> I like all vmas to be symmetric to each other, without special hacks to
> handle mremap right.
> We have the vm_pgoff to handle mremap and I simply use that.
Absolute guarantees are nice but this characterization is too extreme.
The case where mremap() creates rmap_chains is so rare I never ever saw
it happen in 6 months of regular practical use and testing. Their
creation could be triggered only by remap_file_pages().
-- wli
On Fri, Mar 12, 2004 at 07:40:51AM -0500, Rik van Riel wrote:
> On Fri, 12 Mar 2004, Andrea Arcangeli wrote:
> > On Thu, Mar 11, 2004 at 10:28:42PM -0500, Rik van Riel wrote:
>
> > > Actually, with the code Rajesh is working on there's
> > > no search problem with Hugh's idea.
> >
> > you missed the fact mremap doesn't work, that's the fundamental reason
> > for the vma tracking, so you can use vm_pgoff.
> >
> > if you take Hugh's anonmm, mremap will be attaching a persistent dynamic
> > overhead to the vma it touches. Currently it does in form of pte_chains,
> > that can be converted to other means of overhead, but I simply don't
> > like it.
> >
> > I like all vmas to be symmetric to each other, without special hacks to
> > handle mremap right.
> >
> > We have the vm_pgoff to handle mremap and I simply use that.
>
> Would it be possible to get rid of that if we attached
> a struct address_space to each mm_struct after exec(),
> sharing the address_space between parent and child
> processes after a fork() ?
> Note that the page cache can handle up to 2^42 bytes
> in one address_space on a 32 bit system, so there's
> more than enough space to be shared between parent and
> child processes.
>
> Then the vmas can track vm_pgoff inside the address
> space attached to the mm.
I can't understand sorry.
I don't see what you mean with sharing the same address space between
parent and child, whatever _global_ mm wide address space is screwed by
mremap, if you don't use the pg_off to ofset the page->index, the
vm_start/vm_end means nothing.
I think the anonmm design is flawed and it has no way to handle
mremap reasonably well, though feel free to keep doing research on that,
I would be happy to use a simpler and more efficient design, I just
tried to reuse the anonmm but it was overlay complex in design and
inefficient too to deal with mremap, so I had not much doubts I had to
change that, and the anon_vma idea solved all the issues with anonmm, so
I started coding that.
If you don't track by vmas (like I'm doing), and you allow merging of
two different vmas, one touched by mremap and the other not, you'll end
up mixing the vm_pgoff and the whole anonmm falls apart, and the tree
search falls apart too after you lost the vm_pgoff of the vma that got
merged.
Hugh solved this by simply saying that anonmm isn't capable of dealing
with mremap and he used the pte_chains like if it was the rmap vm, after
the first mremap. That's bad, but whatever more efficient solution than
the pte_chains (for example a metadata tracking a range, not wasting
bytes for every single page in the range like rmap does) will still
be a mess in terms of vma merging, tracking and rbtree/prio_tree search
too, and it won't at all be more obviously efficient, since you'll still
have to use the tree, and in all common cases my design will beat the
tree performance (even ignoring the mremap overhead with anonmm). the
way I defer the anonvma allocation and I instantiate direct pages
is as well is extremely efficient compared to the anonmm.
The only thing I disallow is the merging of two vmas with different
anon_vma or different vm_pgoff, but that's a feature, if you don't do
that in the anonmm design, you'll have to allocate dynamic structures on
top of the vma tracking partial ranges within each vma which can be a
lot slower and it's so messy to deal with that I don't even remotely
considered writing anything like that, when I can use the pgoff with
the anon_vma_t.
> > > Considering the fact that we'll need Rajesh's code
> > > anyway, to deal with Ingo's test program and the real
> >
> > Rajesh's code has nothing to do with the mremap breakage, Rajesh's code
> > can only boost the search of the interesting vmas in an anonmm, it
> > doesn't solve mremap.
>
> If you mmap a file, then mremap part of that mmap, where's
> the special case ?
you miss that we disallow the merging of vmas with vm_pgoff if they
belong to a file (vma->vm_file != NULL). Infact what my code is doing is
to threat the anon vma similarly to the file-vmas, and that's why the
merging probability is reduced a little bit. The single fact anonmm
allows merging of all anonvmas like if they were not-vma-tracked tells
you anonmm is flawed w.r.t. mremap. Something has to be changed anyways
in the vma handling code too (like the vma merging code) even with
anonmm, if your object is to always pass through the vma to reach the
pagetables. Hugh solved this by not passing through the vma after the
first mremap, that works too of course but I think my design is more
efficient, my whole effort is to avoid allocating per-page overhead and
to have a single metadata object (the vma) serving a range of pages,
that's a lot more efficient than the pte_chains and it saves a load of
ram in 64bit and 32bit.
to tell it in another way, the problem you have with anonmm, is that
after an mremap the page->index becomes invalid, and no, you can't fixup
the page->index by looping all over the pages pointed by the vma because
those page->index will be meaningful to other vmas in other address
spaces, where their address is still the original one (the one before
fork()).
> > "Overall the main reason for forbidding keeping track of vmas and not of
> > mm, is to be able to handle mremap as efficiently as with 2.4, I mean
> > your anobjrmap-5 simply reistantiate the pte_chains, so the vm then has
> > to deal with both pte_chains and anonmm too."
>
> Yes, that's a problem indeed. I'm not sure it's fundamental
> or just an implementation artifact, though...
I think it's fundamental but again, if you can find a solution to that
it's more than welcome, I just don't see how you can ever handle mremap
if you threat all the vmas the same, before and after mremap, if you
threat all the vmas the same you lose vm_pgoff and in turn you break in
mremap and you can forget using the vmas for reaching the pagetables
since you will do nothing with just the vm_start/vm_end and page->index
then.
You can still threat all of them the same by allocating dynamic stuff on
top of the vma but that will complicate everything, including the tree
search and the vma merging too. So the few lines I had to add to the vma
merging to teach the vma layer about the anon_vma, should be a whole lot
simpler and a whole lot more efficient than the ones you've to add to
allocate those dynamic objects sitting on top of the vmas and telling you
the right pg_off per-range (not to tell the handling of the oom
conditions while allocating those dynamic objects in super-spinlocked
paths, even the GFP_ATOMIC abuses from the pte_chains were nasty,
GFP_ATOMIC should reserved to irqs and bhs since they've no way to
unlock and sleep!...).
On Fri, Mar 12, 2004 at 04:46:38AM -0800, William Lee Irwin III wrote:
> On Fri, Mar 12, 2004 at 01:21:27PM +0100, Andrea Arcangeli wrote:
> > you missed the fact mremap doesn't work, that's the fundamental reason
> > for the vma tracking, so you can use vm_pgoff.
> > if you take Hugh's anonmm, mremap will be attaching a persistent dynamic
> > overhead to the vma it touches. Currently it does in form of pte_chains,
> > that can be converted to other means of overhead, but I simply don't
> > like it.
> > I like all vmas to be symmetric to each other, without special hacks to
> > handle mremap right.
> > We have the vm_pgoff to handle mremap and I simply use that.
>
> Absolute guarantees are nice but this characterization is too extreme.
> The case where mremap() creates rmap_chains is so rare I never ever saw
> it happen in 6 months of regular practical use and testing. Their
> creation could be triggered only by remap_file_pages().
did you try specweb with apache? that's super heavy mremap as far as I
know (and it maybe using anon memory, and if not I certainly cannot
exclude other apps are using mremap on significant amounts of anymous
ram). To a point that the kmap_lock for the persistent kmaps I used
originally in mremap (at least it has never been racy) was a showstopper
bottleneck spending most of system time there (profiling was horrible in
the kmap_lock) and I had to fixup the 2.6 way with the per-cpu atomic
kmaps to avoid being an order of magnitude slower than in the small
boxes w/o highmem.
the single reason I'm doing this work is to avoid allocating the
pte_chains and to always use the vma instead. If I've to use the
pte_chains again for mremap (hoping that no application is using mremap)
then I'm not at all happy since people could still fall in the pte_chain
trap with some app.
Amittedly the pte_chains makes perfect sense only for nonlinear vmas,
since the vma is meaningless for the nonlinear vmas and really a
per-page cost makes sense there, but I'm not going to add 8 bytes
per-page to swapout the nonlinear vmas efficiently, and I'll let the cpu
pay for that if you really need to swap the nonlinear mappings (i.e. the
pagetable walk). An alternate way would been to dynamically allocate the
per-pte pointer, but that will throw a whole lot of memory at the
problem too, and one of the main points for using nonlinear maps is to
avoid the allocation of the vmas, so I doubt people really want to
allocate lots of ram to handle nonlinear efficiently, so I believe
saving all ram at the expense of cpu cost during swapping will be ok.
On Fri, Mar 12, 2004 at 02:24:36PM +0100, Andrea Arcangeli wrote:
> did you try specweb with apache? that's super heavy mremap as far as I
> know (and it maybe using anon memory, and if not I certainly cannot
> exclude other apps are using mremap on significant amounts of anymous
> ram). To a point that the kmap_lock for the persistent kmaps I used
> originally in mremap (at least it has never been racy) was a showstopper
> bottleneck spending most of system time there (profiling was horrible in
> the kmap_lock) and I had to fixup the 2.6 way with the per-cpu atomic
> kmaps to avoid being an order of magnitude slower than in the small
> boxes w/o highmem.
No. I have never had access to systems set up for specweb.
-- wli
Thanks a lot for pointing us to your (last night's) patches, Andrea.
On Fri, 12 Mar 2004, Andrea Arcangeli wrote:
> On Thu, Mar 11, 2004 at 10:28:42PM -0500, Rik van Riel wrote:
>
> It's not that I didn't read anonmm patches from Hugh, I spent lots of
> time on those, they just were flawed and they couldn't handle mremap,
> he very well knows, see anobjrmap-5 for istance.
Flawed in what way? They handled mremap fine, but yes, used pte_chains
for that extraordinary case, just as pte_chains were used for nonlinear.
With pte_chains gone (hurrah! though nonlinear handling yet to come),
as you know, I've already suggested a better way to handle that case
(use tmpfs-style backing object).
> the vma merging isn't a problem, we need to rework the code anyways to
> allow the file merging in both mprotect and mremap (currently only mmap
> is capable of merging files, and in turn it's also the only one capable
> of merging anon_vmas). Any merging code that is currently capable of
> merging files is easy to teach about anon_vmas too, it's basically the
> same problem at merging.
You're paying too much attention to the (almost optional, though it can
have a devastating effect on vma usage, yes) issue of vma merging, but
what about the (mandatory) vma splitting? I see no sign of the tiresome
code I said you'd need for anonvma rather than anonmm, walking the pages
updating as.vma whenever vma changes e.g. when mprotecting or munmapping
some pages in the middle of a vma. Surely move_vma_start is not enough?
That's what led me to choose anonmm, which seems a lot simpler: the real
argument for anonvma is that it saves a find_vma per pte in try_to_unmap
(page_referenced doesn't need it): a good saving, but is it worth the
complication of the faster paths?
Hugh
On Fri, 12 Mar 2004, Andrea Arcangeli wrote:
> On Fri, Mar 12, 2004 at 04:46:38AM -0800, William Lee Irwin III wrote:
> >
> > The case where mremap() creates rmap_chains is so rare I never ever saw
> > it happen in 6 months of regular practical use and testing. Their
> > creation could be triggered only by remap_file_pages().
>
> did you try specweb with apache? that's super heavy mremap as far as I
> know (and it maybe using anon memory, and if not I certainly cannot
> exclude other apps are using mremap on significant amounts of anymous
> ram).
anonmm has no problem with most mremaps: the special case is for
mremap MAYMOVE of anon vmas _inherited from parent_ (same page at
different addresses in the different mms). As I said before, it's
quite conceivable that this case never arises outside our testing
(but I'd be glad to be shown wrong, would make effort worthwhile).
Hugh
On Fri, Mar 12, 2004 at 01:43:23PM +0000, Hugh Dickins wrote:
> Thanks a lot for pointing us to your (last night's) patches, Andrea.
>
> On Fri, 12 Mar 2004, Andrea Arcangeli wrote:
> > On Thu, Mar 11, 2004 at 10:28:42PM -0500, Rik van Riel wrote:
> >
> > It's not that I didn't read anonmm patches from Hugh, I spent lots of
> > time on those, they just were flawed and they couldn't handle mremap,
> > he very well knows, see anobjrmap-5 for istance.
>
> Flawed in what way? They handled mremap fine, but yes, used pte_chains
> for that extraordinary case, just as pte_chains were used for nonlinear.
"using pte_chains for the extraordinary case" (which is a common case
for some apps) means it doesn't handle it, and you've to use rmap to
handle that case.
> With pte_chains gone (hurrah! though nonlinear handling yet to come),
> as you know, I've already suggested a better way to handle that case
> (use tmpfs-style backing object).
Do you realize the complexity of creating a tmpfs-inode and to attach
all vmas to it stacked on top of anonmm? And after you fix mremap you
get the same disavantages for merging of vmas (remeber my
disavantage of not merging after an mremap you won't merge too), plus it
wastes a lot more ram since you need a fake inode for every anonymous
vma and it's ugly to create those objects inside mremap. My transient
object is 8 bytes per group of vmas. And you need even the prio_tree
search on top of the anonmm.
Don't forget you can't re-use the vma->shared for doing the tmpfs-style
thing, that's already in a true inode. so what you're suggesting would
becomes an huge mess to implement IMHO. the anon_vma sounds a lot
cleaner and more efficient design to me than stacking inode-like objects
on top of a vma already queued in a i_mmap.
> > the vma merging isn't a problem, we need to rework the code anyways
> > to
> > allow the file merging in both mprotect and mremap (currently only mmap
> > is capable of merging files, and in turn it's also the only one capable
> > of merging anon_vmas). Any merging code that is currently capable of
> > merging files is easy to teach about anon_vmas too, it's basically the
> > same problem at merging.
>
> You're paying too much attention to the (almost optional, though it can
> have a devastating effect on vma usage, yes) issue of vma merging, but
> what about the (mandatory) vma splitting? I see no sign of the tiresome
> code I said you'd need for anonvma rather than anonmm, walking the pages
> updating as.vma whenever vma changes e.g. when mprotecting or munmapping
> some pages in the middle of a vma. Surely move_vma_start is not enough?
you're right about vma_split, the way I implemented it is wrong,
basically the as.vma/PageDirect idea is falling apart with vma_split.
I should simply allocate the anon_vma without passing through the direct
mode, that will fix it though it'll be a bit less efficient for the
first page fault in an anonymous vma (only the first one, for all the
other page faults it'll be as fast as the direct mode).
this is probably why the code was not stable yet btw ;) so I greatly
appreciate your comments about it, it's just the optimization I did that
was invalid.
I could retain the optimization with a list of pages attached to the vma
but it doesn't worth it, allocating the anon_vma is way too cheap
compared to that. the pagedirect was a microoptization only, any
additional complexity to retain the microoptimization is worthless.
> That's what led me to choose anonmm, which seems a lot simpler: the real
> argument for anonvma is that it saves a find_vma per pte in try_to_unmap
> (page_referenced doesn't need it): a good saving, but is it worth the
> complication of the faster paths?
the only real argument is mremap, your tmpfs-like thing is overkill
compared to anon_vma, and secondly I don't need the prio_tree to scale.
On Fri, Mar 12, 2004 at 01:55:30PM +0000, Hugh Dickins wrote:
> On Fri, 12 Mar 2004, Andrea Arcangeli wrote:
> > On Fri, Mar 12, 2004 at 04:46:38AM -0800, William Lee Irwin III wrote:
> > >
> > > The case where mremap() creates rmap_chains is so rare I never ever saw
> > > it happen in 6 months of regular practical use and testing. Their
> > > creation could be triggered only by remap_file_pages().
> >
> > did you try specweb with apache? that's super heavy mremap as far as I
> > know (and it maybe using anon memory, and if not I certainly cannot
> > exclude other apps are using mremap on significant amounts of anymous
> > ram).
>
> anonmm has no problem with most mremaps: the special case is for
> mremap MAYMOVE of anon vmas _inherited from parent_ (same page at
> different addresses in the different mms). As I said before, it's
> quite conceivable that this case never arises outside our testing
> (but I'd be glad to be shown wrong, would make effort worthwhile).
the problem is that it _can_ arise, and fixing that is an huge mess
without using the pte_chains IMHO (no hope to use the vma->shared).
I also don't see how can you know if a vma is pointing all to "direct"
pages and in turn you can move it somewhere else without the pte_chains.
sure you can move all anon vmas freely after an execve, but after the
first fork (and in turn with cow pages going on) all mremaps will
non-trackable with anonmm, right? lots of server processes uses fork()
model for the childs, and they can run mremap inside the child of memory
malloced inside the child, and I don't think you can easily track if the
malloc happened inside the child or inside the father, though I may be
wrong on this.
On Fri, 12 Mar 2004, William Lee Irwin III wrote:
>
> Absolute guarantees are nice but this characterization is too extreme.
> The case where mremap() creates rmap_chains is so rare I never ever saw
> it happen in 6 months of regular practical use and testing. Their
> creation could be triggered only by remap_file_pages().
I have to _violently_ agree with Andrea on this one.
The absolute _LAST_ thing we want to have is a "remnant" rmap
infrastructure that only gets very occasional use. That's a GUARANTEED way
to get bugs, and really subtle behaviour.
I think Andrea is 100% right. Either do rmap for everything (like we do
now, modulo IO/mlock), or do it for _nothing_. No half measures with
"most of the time".
Quite frankly, the stuff I've seen suggested sounds absolutely _horrible_.
Special cases are not just a pain to work with, they definitely will cause
bugs. It's not a matter of "if", it's a matter of "when".
So let's make it clear: if we have an object-based reverse mapping, it
should cover all reasonable cases, and in particular, it should NOT have
rare fallbacks to code that thus never gets any real testing.
And if we have per-page rmap like now, it should _always_ be there.
You do have to realize that maintainability is a HELL of a lot more
important than scalability of performance can be. Please keep that in
mind.
Linus
On Fri, 12 Mar 2004, Andrea Arcangeli wrote:
> On Fri, Mar 12, 2004 at 01:43:23PM +0000, Hugh Dickins wrote:
>
> Don't forget you can't re-use the vma->shared for doing the tmpfs-style
> thing, that's already in a true inode.
Good point, I was overlooking that. I'll see if I can come up with
something, but that may well prove a killer.
> you're right about vma_split, the way I implemented it is wrong,
> basically the as.vma/PageDirect idea is falling apart with vma_split.
> I should simply allocate the anon_vma without passing through the direct
Yes, that'll take a lot of the branching out, all much simpler.
> mode, that will fix it though it'll be a bit less efficient for the
> first page fault in an anonymous vma (only the first one, for all the
> other page faults it'll be as fast as the direct mode).
Simpler still to allocate it earlier? Perhaps too wasteful.
Hugh
On Fri, 12 Mar 2004, Andrea Arcangeli wrote:
> I don't see what you mean with sharing the same address space between
> parent and child, whatever _global_ mm wide address space is screwed by
> mremap, if you don't use the pg_off to ofset the page->index, the
> vm_start/vm_end means nothing.
At mremap time, you don't change the page->index at all,
but only the vm_start/vm_end. Think of it as an mm_struct
pointing to a struct address_space with its anonymous
memory. On exec() the mm_struct gets a new address_space,
on fork parent and child share them.
Sharing is good enough, because there is PAGE_SIZE times
more space in a struct address_space than there's available
virtual memory in one single process. That means that for
a daemon like apache every child can simply get its own 4GB
subset of the address space for any new VMAs, while mapping
the inherited VMAs in the same way any other file is mapped.
> I think the anonmm design is flawed and it has no way to handle
> mremap reasonably well,
There's no difference between mremap() of anonymous memory
and mremap() of part of an mmap() range of a file...
At least, there doesn't need to be.
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
On Fri, Mar 12, 2004 at 04:12:10PM +0000, Hugh Dickins wrote:
> > you're right about vma_split, the way I implemented it is wrong,
> > basically the as.vma/PageDirect idea is falling apart with vma_split.
> > I should simply allocate the anon_vma without passing through the direct
>
> Yes, that'll take a lot of the branching out, all much simpler.
indeed.
> Simpler still to allocate it earlier? Perhaps too wasteful.
one trouble with allocate it earlier is that insert_vm_struct would need
to return a -ENOMEM retval, plus things like MAP_PRIVATE don't
necessairly need an anon_vma ever (true anon mappings tends to need it
always instead ;).
So I will have to add a anon_vma_prepare(vma) near all SetPageAnon.
that's easy. Infact I may want to coalesce the two things together, it
will look like:
int anon_vma_prepare_page(vma, page) {
if (!vma->anon_vma) {
vma->anon_vma = anon_vma_alloc()
if (!vma->anon_vma)
return -ENOMEM;
/* single threaded no locks here */
list_add(&vma->anon_vma_node, &anon_vma->anon_vma_head);
}
SetPageAnon(page);
return 0;
}
I will have to handle a retval failure from there, that's the only
annoyance of removing the PageDirect optimization, I really did the
PageDirect mostly to leave all the anon_vma allocations to fork().
Now it's the exact opposite, fork will never need to allocate any
anon_vma anymore, it will only boost the page->mapcount.
>> have a devastating effect on vma usage, yes) issue of vma merging, but
>> what about the (mandatory) vma splitting? ...[snip]
> you're right about vma_split, the way I implemented it is wrong,
> basically the as.vma/PageDirect idea is falling apart with vma_split.
Why do you have to fix up all page structs' PageDirect and as.vma
fields when a vma_split or vma_merge occurs.
Can't you do it lazily on the next page_referenced or page_add_rmap,
etc. Anyway we can get to the anon_vma using as.vma->anon_vma.
I understand that currenly your code assumes that if PageDirect is
set, then there cannot be an anon_vma corresponding to the page.
Rajesh
On Fri, Mar 12, 2004 at 11:25:27AM -0500, Rik van Riel wrote:
> pointing to a struct address_space with its anonymous
> memory. On exec() the mm_struct gets a new address_space,
> on fork parent and child share them.
isn't this what anonmm is already doing? are you suggesting something
different?
> There's no difference between mremap() of anonymous memory
> and mremap() of part of an mmap() range of a file...
>
> At least, there doesn't need to be.
the anonmm simply cannot work because it's not reaching vmas, it only
reaches mm, and with an mm and a virtual address you cannot reach the
right vma if it was moved around by mremap, you don't even see any
vm_pgoff during the lookup, no way to fix anonmm with a prio_tree.
something in between anon_vma and anonmm that could handle mremap too
would been possible but it has downsides not fixable with a prio_tree,
and it consists in queueing all the _vmas_ (not the mm!) into an
anon_vma object, then you've to fixup the vma merging code to obey to
forbid merging with different vm_pgoff. That would be like anon_vma but
it would not be finegriend like anon_vma is, you'll end up scanning very
old vma segments in other address spaces despite you're working with
direct memory now. Such model (let's call it anon_vma_global) would save
8 bytes per vma of anonvma objects. Maybe that's the model that DaveM
implemented originally? I think my anon_vma is superior because more
finegriend (it also avoids the need of a prio_tree even if in theory we
could stack a prio_tree on top of every anon_vma, but it's really not
needed) and the memory usage is minimal anyways (the per-vma memory cost
is the same for anon_vma and anon_vma_global, only the total number of
anon_vma objects vary). the prio_tree wouldn't fix the intermediate
model because the vma ranges could match fine in all address spaces, so
you would need the prio_tree adding another 12 bytes to each vma (on top
of the 12 bytes addred by the anon_vma_global), but the pages would be
different because the vma->vm_mm is different and there can be copy on
writes. this cannot happen with an inode, so the prio_tree fixes the
inode completely while it doesn't fix the anon_vma_global design with 1
anon_vma only allocated at fork for all childs. anon_vma gets that
optimally instead (with a 8byte cost). so overall I think anon_vma is a
much better utilizations of the 12 bytes, rather than having a prio_tree
stacked on top of a anon_vma_global, I prefer to be finegrined and to
track the stuff that not even a prio tree can track when the vma->vm_mm
has different pages for every vma in the same range.
On Fri, 12 Mar 2004, Andrea Arcangeli wrote:
> On Fri, Mar 12, 2004 at 11:25:27AM -0500, Rik van Riel wrote:
> > pointing to a struct address_space with its anonymous
> > memory. On exec() the mm_struct gets a new address_space,
> > on fork parent and child share them.
>
> isn't this what anonmm is already doing? are you suggesting something
> different?
I am suggesting a pointer from the mm_struct to a
struct address_space ...
> > There's no difference between mremap() of anonymous memory
> > and mremap() of part of an mmap() range of a file...
> >
> > At least, there doesn't need to be.
>
> the anonmm simply cannot work because it's not reaching vmas, it only
> reaches mm, and with an mm and a virtual address you cannot reach the
> right vma if it was moved around by mremap,
... and use the offset into the struct address_space as
the page->index, NOT the virtual address inside the mm.
On first creation of anonymous memory these addresses
could be the same, but on mremap inside a forked process
(with multiple processes sharing part of anonymous memory)
a page could have a different offset inside the struct
address space than its virtual address....
Then on mremap you only need to adjust the start and
end offsets inside the VMAs, not the page->index ...
> That would be like anon_vma but it would not be finegriend like anon_vma
> is, you'll end up scanning very old vma segments in other address spaces
Not really. On exec you can start with a new address
space entirely, so the sharing is limited only to
processes that really do share anonymous memory with
each other...
> I think my anon_vma is superior because more finegriend
Isn't being LESS finegrained the whole reason for moving
from pte based to object based reverse mapping ? ;))
> (it also avoids the need of a prio_tree even if in theory we could stack
> a prio_tree on top of every anon_vma, but it's really not needed)
We need the prio_tree anyway for files. I don't see
why we couldn't reuse that code for anonymous memory,
but instead reimplement something new...
Having the same code everywhere will definately help
simplifying things.
cheers,
Rik
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
On Fri, Mar 12, 2004 at 12:05:27PM -0500, Rajesh Venkatasubramanian wrote:
>
>
> >> have a devastating effect on vma usage, yes) issue of vma merging, but
> >> what about the (mandatory) vma splitting? ...[snip]
>
> > you're right about vma_split, the way I implemented it is wrong,
> > basically the as.vma/PageDirect idea is falling apart with vma_split.
>
> Why do you have to fix up all page structs' PageDirect and as.vma
> fields when a vma_split or vma_merge occurs.
>
> Can't you do it lazily on the next page_referenced or page_add_rmap,
I cannot do it lazily unfortunately because the paging routine will
start from the page, so if the page is not uptodate it will go to
read into nirvana.
> etc. Anyway we can get to the anon_vma using as.vma->anon_vma.
>
> I understand that currenly your code assumes that if PageDirect is
> set, then there cannot be an anon_vma corresponding to the page.
correct, though I will have to change that for the above problem ;(
Well, another way is to just do the pagetable walk and fixup the
page->as.vma to be a page->as.anon_vma during split/merge (actually
merge is already taken care of by forbidding merging in the interesting
cases, what I missed was the split, oh well ;). But preallocating the
anon_vma is such a little cost that it should be a lot better than
slowing down the split.
On Fri, Mar 12, 2004 at 12:23:22PM -0500, Rik van Riel wrote:
> On Fri, 12 Mar 2004, Andrea Arcangeli wrote:
> > On Fri, Mar 12, 2004 at 11:25:27AM -0500, Rik van Riel wrote:
> > > pointing to a struct address_space with its anonymous
> > > memory. On exec() the mm_struct gets a new address_space,
> > > on fork parent and child share them.
> >
> > isn't this what anonmm is already doing? are you suggesting something
> > different?
>
> I am suggesting a pointer from the mm_struct to a
> struct address_space ...
that's the anonmm:
+ mm->anonmm = anonmm;
> > > There's no difference between mremap() of anonymous memory
> > > and mremap() of part of an mmap() range of a file...
> > >
> > > At least, there doesn't need to be.
> >
> > the anonmm simply cannot work because it's not reaching vmas, it only
> > reaches mm, and with an mm and a virtual address you cannot reach the
> > right vma if it was moved around by mremap,
>
> ... and use the offset into the struct address_space as
> the page->index, NOT the virtual address inside the mm.
>
> On first creation of anonymous memory these addresses
> could be the same, but on mremap inside a forked process
> (with multiple processes sharing part of anonymous memory)
> a page could have a different offset inside the struct
> address space than its virtual address....
>
> Then on mremap you only need to adjust the start and
> end offsets inside the VMAs, not the page->index ...
I don't see how this can work, each vma needs its own vm_off or a single
address space can't handle them all. Also the page->index is the virtual
address (or the virtual offset with anon_vma), it cannot be replaced
with something global, it has to be per-page.
> Isn't being LESS finegrained the whole reason for moving
> from pte based to object based reverse mapping ? ;))
the object is to cover ranges, instead of forcing per-page overhread.
Being finegrined at the vma is fine, being finegrined less than a vma is
desiderable only if there's no downside.
> > (it also avoids the need of a prio_tree even if in theory we could stack
> > a prio_tree on top of every anon_vma, but it's really not needed)
>
> We need the prio_tree anyway for files. I don't see
As I said in the last email the prio_tree will not work for the
anonvmas, because every vma in the same range will map to different
pages. So you'll find more vmas than the ones you're interested about.
This doesn't happen with inodes. with inodes every vma queued into the
i_mmap will be mapping to the right page _if_ it's pte_present == 1.
with your anonymous address space shared by childs the prio_tree will
find lots of vmas in different vma->vm_mm, each one pointing to
different pages. so to unmap a direct page after a malloc, you may end
up scanning all the address spaces by mistake. This cannot happen with
anon_vma. Furthermore the prio_tree will waste 12 bytes per vma, while
the anon_vma design will waste _at_most_ 8 bytes per vma (actually less
if the anon_vma are shared). And with anon_vma in practice you won't
need a prio_tree stacked on top of anon_vma. You could put one if you
want paying another 12bytes per vma, but it doesn't worth it. So
anon_vma takes less memory and it's more efficent as far as I can tell.
> Having the same code everywhere will definately help
> simplifying things.
Reusing the same code would be good I agree, but I don't think it would
work as well as with the inodes, and with the inodes it's really needed
only for a special 32bit case, so normally the lookup would be immdiate,
while here we need it for real expensive lookups if one has many
anonymous vmas in the childs even on 64bit apps. So I prefer a design
where the prio_tree or not the cost for good apps 64bit archs is the
same. prio_tree is not free, it's still O(log(N)) and I prefer a design
where the common case is N == 1 like with anon_vma (with your
address-space design N would be >1 normally in a server app).
On Fri, 12 Mar 2004, Andrea Arcangeli wrote:
> On Fri, Mar 12, 2004 at 12:23:22PM -0500, Rik van Riel wrote:
> > ... and use the offset into the struct address_space as
> > the page->index, NOT the virtual address inside the mm.
> As I said in the last email the prio_tree will not work for the
> anonvmas, because every vma in the same range will map to different
> pages. So you'll find more vmas than the ones you're interested about.
> This doesn't happen with inodes. with inodes every vma queued into the
> i_mmap will be mapping to the right page _if_ it's pte_present == 1.
You don't have multiple VMAs mapping to same pages, but in
the same range in the address_space.
Note that the per-process virtual memory != per "fork-group"
backing address_space ...
> with your anonymous address space shared by childs the prio_tree will
> find lots of vmas in different vma->vm_mm, each one pointing to
> different pages.
Nope. I wish I was better with graphical programs, or I'd
draw you a picture. ;)
> > Having the same code everywhere will definately help
> > simplifying things.
>
> Reusing the same code would be good I agree, but I don't think it would
> work as well as with the inodes,
> prio_tree is not free, it's still O(log(N)) and I prefer a design where
> the common case is N == 1 like with anon_vma (with your address-space
> design N would be >1 normally in a server app).
It's all a space-time overhead. Do you want more structures
allocated and a more complex mremap, or do you eat the O(log(N))
lookup ?
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
On Fri, 12 Mar 2004, Rik van Riel wrote:
>
> I am suggesting a pointer from the mm_struct to a
> struct address_space ...
[ deleted ]
> Then on mremap you only need to adjust the start and
> end offsets inside the VMAs, not the page->index ...
One fundamental problem I see, maybe you can explain it to me...
- You need a _unique_ page->index start for each VMA, since each
anonymous page needs to have a unique index. Right?
- You can use the virtual address as that unique page index start
- when you mremap() an area, you leave the start indexes the same, so
that you can find the original pages (and create new ones in the old
mapping) by just searching the vma's, not by actually looking at the
page tables.
- HOWEVER, after a mremap(), when you now create a new vma (or expand an
old one) into the previously used page index area, you're now screwed.
How are you going to generate unique page indexes in this new area
without re-using the indexes that you allocated in the old (moved)
area?
I think your approach could work (reverse map by having separate address
spaces for unrelated processes), but I don't see any good "page->index"
allocation scheme that is implementable.
The "unique" page->index thing wouldn't need to have to have anything to
do with the virtual address (indeed, after a mremap it clearly cannot have
anything to do with that), but the thing is, you'd need to be able to
cover the virtual address space with whatever numbers you choose.
You'd want to allocate contiguous indexes within one "vma", since the
whole point would be to be able to try to quickly find the vma (and thus
the page) that contains one particular page, but there are no range
allocators that I can think of that allow growing the VMA after allocation
(needed for vma merging on mmap and brk()) and still keep the range of
indexes down to reasonable numbers.
Or did I totally mis-understand what you were proposing?
Linus
On Fri, 12 Mar 2004, Linus Torvalds wrote:
> I think your approach could work (reverse map by having separate address
> spaces for unrelated processes), but I don't see any good "page->index"
> allocation scheme that is implementable.
> Or did I totally mis-understand what you were proposing?
You're absolutely right. I am still trying to come up with
a way to do this.
Note that since we count page->index in PAGE_SIZE unit we
have PAGE_SIZE times as much space as a process can take,
so we definately have enough address space to come up with
a creative allocation scheme.
I just can't think of any now ...
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
Rik van Riel wrote:
> On Fri, 12 Mar 2004, Linus Torvalds wrote:
>
>
>>I think your approach could work (reverse map by having separate address
>>spaces for unrelated processes), but I don't see any good "page->index"
>>allocation scheme that is implementable.
> Note that since we count page->index in PAGE_SIZE unit we
> have PAGE_SIZE times as much space as a process can take,
> so we definately have enough address space to come up with
> a creative allocation scheme.
What happens when you have more than PAGE_SIZE processes running?
Chris
--
Chris Friesen | MailStop: 043/33/F10
Nortel Networks | work: (613) 765-0557
3500 Carling Avenue | fax: (613) 765-2986
Nepean, ON K2H 8E9 Canada | email: [email protected]
On Fri, 12 Mar 2004, Chris Friesen wrote:
> What happens when you have more than PAGE_SIZE processes running?
Forked off the same process ?
Without doing an exec ?
On a 32 bit system ?
You'd probably run out of space to put the VMAs,
mm_structs and pgds long before reaching this point ...
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
On Fri, 12 Mar 2004, Chris Friesen wrote:
> I'm just thinking of the "fork 100000 kids to test 32-bit pids" sort of
> test cases.
Try that with a process that takes up 2GB of address
space ;) It won't work now and it'll fail for the
same reasons with the scheme I proposed.
Probably before the 2^44 bits of space run out, too.
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
Rik van Riel wrote:
> On Fri, 12 Mar 2004, Chris Friesen wrote:
>
>
>>What happens when you have more than PAGE_SIZE processes running?
>
>
> Forked off the same process ?
> Without doing an exec ?
> On a 32 bit system ?
>
> You'd probably run out of space to put the VMAs,
> mm_structs and pgds long before reaching this point ...
I'm just thinking of the "fork 100000 kids to test 32-bit pids" sort of
test cases.
Chris
--
Chris Friesen | MailStop: 043/33/F10
Nortel Networks | work: (613) 765-0557
3500 Carling Avenue | fax: (613) 765-2986
Nepean, ON K2H 8E9 Canada | email: [email protected]
On Fri, Mar 12, 2004 at 02:06:17PM -0500, Rik van Riel wrote:
> On Fri, 12 Mar 2004, Chris Friesen wrote:
>
> > What happens when you have more than PAGE_SIZE processes running?
>
> Forked off the same process ?
> Without doing an exec ?
> On a 32 bit system ?
>
> You'd probably run out of space to put the VMAs,
> mm_structs and pgds long before reaching this point ...
7.5k users are being reached in a real workload with around 2gigs mapped
per process and with tons of vma per process. with 2.6 and faster cpus
I hope to go even further.
On Fri, 12 Mar 2004, Andrea Arcangeli wrote:
> 7.5k users are being reached in a real workload with around 2gigs mapped
> per process and with tons of vma per process. with 2.6 and faster cpus
> I hope to go even further.
That's not all anonymous memory, though ;)
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
On Fri, Mar 12, 2004 at 03:32:20PM -0500, Rik van Riel wrote:
> That's not all anonymous memory, though ;)
true, my point is it's feasible (cow or shared is the same from a memory
footprint standpoint, actually less since anon_vmas are a lot cheaper
than dummy shmfs inodes)
Linus Torvalds wrote:
> You'd want to allocate contiguous indexes within one "vma", since the
> whole point would be to be able to try to quickly find the vma (and thus
> the page) that contains one particular page, but there are no range
> allocators that I can think of that allow growing the VMA after allocation
> (needed for vma merging on mmap and brk()) and still keep the range of
> indexes down to reasonable numbers.
For growing, they don't have to be contiguous - it's just desirable.
When a vma is grown and the page->offset space it would like to occupy
is already taken, it can be split into two vmas.
Of course that alters mremap() semantics, which depend on vma
boundaries. (mmap, munmap and mprotect don't care). So add a vma
flag which indicates that it and the following vma(s) are a single
unit for the purpose of remapping. Call it the mremap-group flag.
Groups always have the same flags etc.; only the vm_offset varies.
In effect, I'm suggesting that instead of having vmas be the
user-visible unit, and some other finer-grained structures track page
mappings, let _vmas_ be the finer-grained structure, and make the
user-visible unit be whatever multiple consecutive vmas occur with
that flag set. (This is a good balance if the number of splits is
small; not if there are many).
It shouldn't lead to a proliferation of vmas, provided the
page->offset allocation algorithm is sufficiently sparse.
To keep the number of potential splits small, always allocate some
extra page->offset space so that a vma can grow into it. Only when it
cannot grow in page->offset space, do you create a new vma. The new
vma has extra page->offset space allocated too. That extra space
should be proportional to the size of the entire new mremap() region
(multiple vmas), not the new vma size.
In that way, I think it bounds the number of splits to O(log (n/m))
where n is the total mremap() region size, and m is the original size.
The constant in that expression is determined by the proportion that
is used for reserving extra space.
This has some consequences.
If each vma's page->offset allocation reserves space around it to
grow, then adjacent anonymous vmas won't be mergeable.
If they aren't mergeable, it begs the question of why not have an
address_space per vma, instead of per-mm, other than to save memory on
address_space structures?
Well we like them to be mergeable. Lots of reasons. So make initial
mmap() allocations not reserve page->offset space exclusively, but
make allocations done by mremap() reserve the extra space, to get that
O(log (n/m)) property.
Using the mremap-group flag, we are also able to give the appearance
of merged vmas when it would be difficult. If we want certain
anonymous vmas to be appear merged despite them having incompatible
vm_offset values, we can do that.
So going back to the question of address_space per-mm: you don't need
one, due to the mremap-group flag. It's good to use as few as
possible, but it's ok to use more than one per process or per
fork-group, when absolutely necessary.
That fixes the address_space limitation of 2^32 pages and makes
page->offset allocation _very_ simple:
1. Allocate by simply incrementing an address counter.
2. When it's about to wrap, allocate a new address_space.
3. When allocating, reserve extra space for growing.
The extra space should be proportional to the allocation, or
the total size size of the region after mremap(), and clamped
to a sane maximum such as 4G minus size, and a sane minimum
such as 2^22 (room for a million reservations per address_space).
5. When allocating, look at the nearby preceding or following vma
in the virtual address space. If the amount of page->offset space
reserved by those vmas is large enough, we can claim some of that
reservation for the new allocation. If our good neighbour is
adjacent to the new vma, that means the neighbour vma is simply
grown. Otherwise, it means we create a new vma which is
vm_offset-compatible with its neighbour, allowing them to merge if
the hole between is filled.
6. By using large reservations, large regions of the virtual address
space become covered with vm_offset-compatible vmas that are mergeable
when the holes are filled.
4. When trying to merge adjacent anon vmas during ordinary
mmap/munmap/mprotect/mremap operations, if they are not
vm_offset-compatible (or their address_spaces aren't equal)
just use the mremap-group flag to make them appear merged. The
user-visible result is a single vma. The effect on the kernel
is a rare non-mergeable boundary, which will slow vma searching
marginally. The benefit is this simple allocation scheme.
This is like what we have today, with some occasional non-mergeable
vma boundaries (but only very few compared with the total
number of vmas in an mm). These boundaries are not
user-visible, and only affect the kernel algorithms - and in a
simple way.
Data structure changes required: one flag, VM_GROUP or something; each
vma needs a pointer to _its_ address_space (can share space with
vm_file or such); each vma needs to record how much page->offset space
it has reserved beyond its own size. VM_GROWSDOWN vmas might want to
record a reservation down rather than up.
-- Jamie
>> I think your approach could work (reverse map by having separate
>> address
>> spaces for unrelated processes), but I don't see any good "page->index"
>> allocation scheme that is implementable.
>> Or did I totally mis-understand what you were proposing?
> You're absolutely right. I am still trying to come up with
> a way to do this.
> [snip]
> I just can't think of any now ...
Atleast one solution exists. It may be just an academic solution, though.
Add a new prio_tree root "remap_address" to anonmm address_space
structure.
struct anon_remap_address {
unsigned long old_page_index_start;
unsigned long old_page_index_end;
unsigned long new_page_index;
struct prio_tree_node prio_tree_node;
}
For each mremap that expands the area and moves the page tables, allocate
a new anon_remap_address struct and add to remap_address tree.
The page->index does not change ever. Take the page->index and walk
remap_address tree to find all remapped addresses. Once a list of
all remapped addresses are found, it's easy to find the interesting
vmas (again using a different prio_tree). Finding all remapped addresses
may involve recursion, that's bad.
Rajesh
On Fri, Mar 12, 2004 at 08:17:49AM -0800, Linus Torvalds wrote:
> I have to _violently_ agree with Andrea on this one.
> The absolute _LAST_ thing we want to have is a "remnant" rmap
> infrastructure that only gets very occasional use. That's a GUARANTEED way
> to get bugs, and really subtle behaviour.
> I think Andrea is 100% right. Either do rmap for everything (like we do
> now, modulo IO/mlock), or do it for _nothing_. No half measures with
> "most of the time".
> Quite frankly, the stuff I've seen suggested sounds absolutely _horrible_.
> Special cases are not just a pain to work with, they definitely will cause
> bugs. It's not a matter of "if", it's a matter of "when".
> So let's make it clear: if we have an object-based reverse mapping, it
> should cover all reasonable cases, and in particular, it should NOT have
> rare fallbacks to code that thus never gets any real testing.
> And if we have per-page rmap like now, it should _always_ be there.
> You do have to realize that maintainability is a HELL of a lot more
> important than scalability of performance can be. Please keep that in
> mind.
The sole point I had to make was against a performance/resource scalabilty
argument; the soft issues weren't part of that, though they may ultimately
be the deciding factor.
-- wli
On Fri, 12 Mar 2004, Linus Torvalds wrote:
> So let's make it clear: if we have an object-based reverse mapping, it
> should cover all reasonable cases, and in particular, it should NOT have
> rare fallbacks to code that thus never gets any real testing.
Absolutely agreed. And with Rajesh's code it should be possible
to get object-based rmap right, not vulnerable to the scalability
issues demonstrated by Ingo's test programs.
Whether we go with mm-based or vma-based, I don't particularly
care either. As long as the code is nice...
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
Ok, guys,
how about this anon-page suggestion?
I'm a bit nervous about the complexity issues in Andrea's current setup,
so I've been thinking about Rik's per-mm thing. And I think that there is
one very simple approach, which should work fine, and should have minimal
impact on the existing setup exactly because it is so simple.
Basic setup:
- each anonymous page is associated with exactly _one_ virtual address,
in a "anon memory group".
We put the virtual address (shifted down by PAGE_SHIFT) into
"page->index". We put the "anon memory group" pointer into
"page->mapping". We have a PAGE_ANONYMOUS flag to tell the
rest of the world about this.
- the anon memory group has a list of all mm's that it is associated
with.
- an "execve()" creates a new "anon memory group" and drops the old one.
- a mm copy operation just increments the reference count and adds the
new mm to the mm list for that anon memory group.
So now to do reverse mapping, we can take a page, and do
if (PageAnonymous(page)) {
struct anongroup *mmlist = (struct anongroup *)page->mapping;
unsigned long address = page->index << PAGE_SHIFT;
struct mm_struct *mm;
for_each_entry(mm, mmlist->anon_mms, anon_mm) {
.. look up page in page tables in "mm, address" ..
.. most of the time we may not even need to look ..
.. up the "vma" at all, just walk the page tables ..
}
} else {
/* Shared page */
.. look up page using the inode vma list ..
}
The above all works 99% of the time.
The only problem is mremap() after a fork(), and hell, we know that's a
special case anyway, and let's just add a few lines to copy_one_pte(),
which basically does:
if (PageAnonymous(page) && page->count > 1) {
newpage = alloc_page();
copy_page(page, newpage);
page = newpage;
}
/* Move the page to the new address */
page->index = address >> PAGE_SHIFT;
and now we have zero special cases.
The above should work very well. In most cases the "anongroup" will be
very small, and even when it's large (if somebody does a ton of forks
without any execve's), we only have _one_ address to check, and that is
pretty fast. A high-performance server would use threads, anyway. (And
quite frankly, _any_ algorithm will have this issue. Even rmap will have
exactly the same loop, although rmap skips any vm's where the page might
have been COW'ed or removed).
The extra COW in mremap() seems benign. Again, it should usually not even
trigger.
What do you think? To me, this seems to be a really simple approach..
Linus
On Sat, 13 Mar 2004, Linus Torvalds wrote:
>
> Ok, guys,
> how about this anon-page suggestion?
What you describe is pretty much exactly what my anobjrmap patch
from a year ago did. I'm currently looking through that again
to bring it up to date.
> I'm a bit nervous about the complexity issues in Andrea's current setup,
> so I've been thinking about Rik's per-mm thing. And I think that there is
> one very simple approach, which should work fine, and should have minimal
> impact on the existing setup exactly because it is so simple.
>
> Basic setup:
> - each anonymous page is associated with exactly _one_ virtual address,
> in a "anon memory group".
>
> We put the virtual address (shifted down by PAGE_SHIFT) into
> "page->index". We put the "anon memory group" pointer into
> "page->mapping". We have a PAGE_ANONYMOUS flag to tell the
> rest of the world about this.
It's a bit more complicated because page->mapping currently contains
&swapper_space if PageSwapCache(page) - indeed, at present that's
exactly what PageSwapCache(page) tests. So I reintroduced a
PageSwapCache(page) flagbit, avoid the very few places where mapping
pointing to swapper_space was actually useful, and use page->private
instead of page->index for the swp_entry_t.
(Andrew did point out that we could reduce the scale of the mods by
reusing page->list fields instead of mapping/index; but mapping/index
are the natural fields to use, and Andrew now has other changes in
-mm which remove page->list: so the original choice looks right again.)
> for_each_entry(mm, mmlist->anon_mms, anon_mm) {
> .. look up page in page tables in "mm, address" ..
> .. most of the time we may not even need to look ..
> .. up the "vma" at all, just walk the page tables ..
> }
I believe page_referenced() can just walk the page tables,
but try_to_unmap() needs vma to check VM_LOCKED (we're thinking
of other ways to avoid that, but they needn't get mixed into this)
and for flushing cache and tlb (perhaps avoidable on some arches?
I've not checked, and again that would be an optimization to
consider later, not mix in at this stage).
> The only problem is mremap() after a fork(), and hell, we know that's a
> special case anyway, and let's just add a few lines to copy_one_pte(),
> which basically does:
>
> if (PageAnonymous(page) && page->count > 1) {
> newpage = alloc_page();
> copy_page(page, newpage);
> page = newpage;
> }
> /* Move the page to the new address */
> page->index = address >> PAGE_SHIFT;
>
> and now we have zero special cases.
That's always been a fallback solution, I was just a little too ashamed
to propose it originally - seems a little wrong to waste whole pages
rather than wasting a few bytes of data structure trying to track them:
though the pages are pageable unlike any data structure we come up with.
I think we have page_table_lock in copy_one_pte, so won't want to do
it quite like that. It won't matter at all if pages are transiently
untrackable. Might want to do something like make_pages_present
afterwards (but it should only be COWing instantiated pages; and
does need to COW pages currently on swap too).
There's probably an issue with Alan's strict commit memory accounting,
if the mapping is readonly; but so long as we get that counting right,
I don't think it's really going to matter at all if we sometimes fail
an mremap for that reason - but probably need to avoid mistaking the
common case (mremap of own area) for the rare case which needs this
copying (mremap of inherited area).
Hugh
On Sat, 13 Mar 2004, Hugh Dickins wrote:
> On Sat, 13 Mar 2004, Linus Torvalds wrote:
> > if (PageAnonymous(page) && page->count > 1) {
> > newpage = alloc_page();
> > copy_page(page, newpage);
> > page = newpage;
> > }
> > /* Move the page to the new address */
> > page->index = address >> PAGE_SHIFT;
> >
> > and now we have zero special cases.
>
> That's always been a fallback solution, I was just a little too ashamed
> to propose it originally - seems a little wrong to waste whole pages
> rather than wasting a few bytes of data structure trying to track them:
> though the pages are pageable unlike any data structure we come up with.
No, Linus is right.
If a child process uses mremap(), it stands to reason that
it's about to use those pages for something.
Think of it as taking the COW faults early, because chances
are you'd be taking them anyway, just a little bit later...
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
On Sat, Mar 13, 2004 at 08:18:48AM -0800, Linus Torvalds wrote:
>
>
> Ok, guys,
> how about this anon-page suggestion?
>
> I'm a bit nervous about the complexity issues in Andrea's current setup,
> so I've been thinking about Rik's per-mm thing. And I think that there is
> one very simple approach, which should work fine, and should have minimal
> impact on the existing setup exactly because it is so simple.
>
> Basic setup:
> - each anonymous page is associated with exactly _one_ virtual address,
> in a "anon memory group".
>
> We put the virtual address (shifted down by PAGE_SHIFT) into
> "page->index". We put the "anon memory group" pointer into
> "page->mapping". We have a PAGE_ANONYMOUS flag to tell the
> rest of the world about this.
>
> - the anon memory group has a list of all mm's that it is associated
> with.
>
> - an "execve()" creates a new "anon memory group" and drops the old one.
>
> - a mm copy operation just increments the reference count and adds the
> new mm to the mm list for that anon memory group.
This is the anonmm from Hugh.
>
> So now to do reverse mapping, we can take a page, and do
>
> if (PageAnonymous(page)) {
> struct anongroup *mmlist = (struct anongroup *)page->mapping;
> unsigned long address = page->index << PAGE_SHIFT;
> struct mm_struct *mm;
>
> for_each_entry(mm, mmlist->anon_mms, anon_mm) {
> .. look up page in page tables in "mm, address" ..
> .. most of the time we may not even need to look ..
> .. up the "vma" at all, just walk the page tables ..
> }
> } else {
> /* Shared page */
> .. look up page using the inode vma list ..
> }
>
> The above all works 99% of the time.
this is again exactly the anonmm from Hugh.
BTW, (for completeness) I was thinking last night that the anonmm could
handle mremap correctly too in theory without changes like the below
one, if it would walk the whole list of vmas reachable from the mm->mmap
for every mm in the anonmm (your anongroup, Hugh called it struct anonmm
instead of struct anongroup). Problem is that checking all the vmas in
if expensive and a single find_vma is a lot faster, but find_vma has no
way to take vm_pgoff into the equation and in turn it breaks with
mremap.
> The only problem is mremap() after a fork(), and hell, we know that's a
> special case anyway, and let's just add a few lines to copy_one_pte(),
> which basically does:
>
> if (PageAnonymous(page) && page->count > 1) {
> newpage = alloc_page();
> copy_page(page, newpage);
> page = newpage;
> }
> /* Move the page to the new address */
> page->index = address >> PAGE_SHIFT;
>
> and now we have zero special cases.
you're basically here saying that you agree with Hugh that anonmm is the
way to go, and you're providing one of the possible ways to handle
mremap correctly with anonmm (without using pte_chains). I also above
provided another alternate way to handle mremap correctly with anonmm
(that is to inefficiently walk all the mm->mmap and to try unmapping
from all vmas with vma->vm_file == NULL).
what I called anon_vma_global in a older email is the more efficient
version of checking all the vmas in the mm->mmap, a prio_tree could
index all the anon vmas in each mm, so taking vm_pgoff into
consideration, unlike find_vma(page->index). That still takes memory for
each vma though, and it also still forces to check all unrelated mm
address spaces too (see later in the email for details on this).
But returning to your proposed solution to the mremap problem with
the anonmm design, that will certainly work: rather than trying to
handle that case correctly we just makes it impossible for that
condition to happen. I don't like very much to unshare pages, but it may
save more memory than what it actually waste. Problem is that it depends
on the workload.
The remaining downside of all the global anonmm designs vs my finegrined
anon_vma design, is that if you execute a malloc in a child (that will
be direct memory with page->count == 1), you'll still have to try all
the mm in the anongroup (that can be on the order of the thousands),
while the anon_vma design would immediatly only reach the right vma in
the right mm and it would not try the wrong vmas in the other mm (i.e.
no find_vma). That isn't fixable with the anonmm design.
I think the only important thing is to avoid the _per-page_ overhead of the
pte_chains, a _per-vma_ 12 byte cost for the anon_vma doesn't sound like
an issue to me if it can save significant cpu in a setup with thousand
of tasks and each one executing a malloc. A single vma can cover
plenty of memory.
Note that even the i_mmap{,shared} methods (even with a prio_tree!) may
actually check vmas and (in turn mm_structs too) where the page has been
sobstituted with an anonymous copy during a cow fault, if the vma has
been mapped with MAP_PRIVATE. While we cannot avoid to check unrelated
mm_structs with MAP_PRIVATE usages (since the only thing where we have
that information is the pte itself, so by the time we find the answer
it's too late to avoid asking the question), but I can avoid that for
the anonymous memory with my anon_vma design. And my anon_vma gets
mremap right too without the need of prio trees like the anon_vma_global
design I proposed requires, and while still allowing sharing of pages
through mremap.
the downsides of anon_vma vs anonmm+linus-unshare-during-mremap is
that anon_vma requires a per anonymous vma 12 byte object, and secondly it
requires 12 bytes per-vma for the anon_vma_node list_head and the
anon_vma pointer. So it's a worst case 24byte overhead per anonymous
vma (on average it will be slightly less since the anon_vmas can be
shared). Secondly anon_vma forbids merging of vmas with different
anon_vma or with different vm_pgoff, though for all appends there will
be no problem at all, appends with mmap are guaranteed to work. A
munmap+mmap gap creation and gap fill is also guaranteed to work (since
split_vma will make both the prev and next vma share the same anon_vma).
the advantage of anon_vma is that it will track all vma in the most
possible finegrined way, avoiding the unmapping code to walk "mm" that
for sure don't have anything to do with the page that we want to unmap,
plus it handles mremap (allowing sharing and avoiding copies). It avoids
the find_vma cost too.
I'm not sure if the pros-cons worth the additional 24 bytes per
anonymous vma. the complexity doesn't worry me though. Also when the
cost will be truly 24 bytes we'll have the biggest advantage, if the
advantage will be low it means the cost will be less than 24 bytes since
the vma is shared.
> What do you think? To me, this seems to be a really simple approach..
I certainly agree it's simpler. I'm quite undecided if to giveup on the
anon_vma and to use anonmm plus your unshared during mremap at the
moment, while it's simpler it's also a definitely inferior solution
since it uses the mremap hack to work safely and it will check all mm
in the group with find_pte not matter if it worth checking them, but at
the same time if one is never swapping and never using mremap it will
save some memory from the anon_vma overhead (and it will also be
non-exploitable without the need of a prio_tree).
With anon_vma and w/o a prio_tree on top of it, one could try executing
a flood of vma_splits, and without a prio_tree on top of an anon_vma,
that could cause memory waste during swapping, but all real applications
would definitely swap better with anon_vma than with anonmm.
I mean, I would expect the pte_chain advocates to agree anon_vma is a lot
better than anonmm, they were going to throw 8 bytes per-pte to save cpu
during swapping, now I throw only 24 bytes per-vma at the problem (with
each vma being still extendable with merging) and I still provide
optimal swapping with minimal complexty, so they should like the
finegrined way more than unsharing with mremap and not scaling during
swapping checking all unrelated mms too. anon_vma basically sits in
between anonmm and pte_chains. it was more than enough for me, to save all
the memory wasted in the pte_chains on the 64bit archs with huge
anonymous vma blocks, but I didn't want to giveup the swap scalability
either with many processes (with i_mmap{,shared} we've already enough
troubles with the scalability during swapping that I didn't want to
think about those issues with the anonymous memory too with some
thousand tasks like it will run in practice). If I go stright ahead with
anon_vma I'm basically guaranteed that I can forget about the anonymous
vma swapping and that all real life apps will scale _as_well_ as with the
pte_chains, and I'm guaranteed not to run into issues with mremap
(though I don't expect troubles there).
On Sat, 13 Mar 2004, Rik van Riel wrote:
>
> No, Linus is right.
>
> If a child process uses mremap(), it stands to reason that
> it's about to use those pages for something.
>
> Think of it as taking the COW faults early, because chances
> are you'd be taking them anyway, just a little bit later...
Makes perfect sense in the read-write case. The read-only
case is less satisfactory, but those will be even rarer.
Hugh
On Sat, Mar 13, 2004 at 05:24:12PM +0000, Hugh Dickins wrote:
> On Sat, 13 Mar 2004, Linus Torvalds wrote:
> >
> > Ok, guys,
> > how about this anon-page suggestion?
>
> What you describe is pretty much exactly what my anobjrmap patch
> from a year ago did. I'm currently looking through that again
it is. Linus simply provided a solution to the mremap issue, that is to
make it impossible to share anonymous pages through an mremap, that
solves the problem indeed at some cpu and memory cost after an mremap.
I realized you could solve it also by walking the whole list of vmas in
every mm->mmap list but that complexity would be way too high.
> > The only problem is mremap() after a fork(), and hell, we know that's a
> > special case anyway, and let's just add a few lines to copy_one_pte(),
> > which basically does:
> >
> > if (PageAnonymous(page) && page->count > 1) {
> > newpage = alloc_page();
> > copy_page(page, newpage);
> > page = newpage;
> > }
> > /* Move the page to the new address */
> > page->index = address >> PAGE_SHIFT;
> >
> > and now we have zero special cases.
>
> That's always been a fallback solution, I was just a little too ashamed
> to propose it originally - seems a little wrong to waste whole pages
> rather than wasting a few bytes of data structure trying to track them:
> though the pages are pageable unlike any data structure we come up with.
>
> I think we have page_table_lock in copy_one_pte, so won't want to do
> it quite like that. It won't matter at all if pages are transiently
> untrackable. Might want to do something like make_pages_present
> afterwards (but it should only be COWing instantiated pages; and
> does need to COW pages currently on swap too).
>
> There's probably an issue with Alan's strict commit memory accounting,
> if the mapping is readonly; but so long as we get that counting right,
> I don't think it's really going to matter at all if we sometimes fail
> an mremap for that reason - but probably need to avoid mistaking the
> common case (mremap of own area) for the rare case which needs this
> copying (mremap of inherited area).
It still looks like quite an hack to me, though I must agree in a
desktop scenario with swapoff -a, it will save around 24 bytes per
anonymous vma and 12 bytes per file vma plus it doesn't restrict the vma
merging in any way, compared to my anon_vma, and it avoids me to worry
about people doing a flood of vma_splits that will generate a long list
of vmas for every anon_vma.
I still feel anon_vma is more preferable than
anonmm+linus-unshare-mremap if one needs to swap, and while the
prio_tree on i_mmap{shared} in practice is needed only for 32bit apps, I
know some app with hundred of processes allocating huge chunks of direct
anon memory each and swapping a lot at the same time.
On Sat, 13 Mar 2004, Andrea Arcangeli wrote:
>
> I certainly agree it's simpler. I'm quite undecided if to giveup on the
> anon_vma and to use anonmm plus your unshared during mremap at the
> moment, while it's simpler it's also a definitely inferior solution
I think you should persist with anon_vma and I should resurrect
anonmm, and let others decide between those two and pte_chains.
But while in this trial phase, can we both do it in such a way as to
avoid too much trivial change all over the tree? For example, I'm
thinking I need to junk my irrelevant renaming of put_dirty_page to
put_stack_page, and for the moment it would help if you cut out your
mapping -> as.mapping changes (when I came to build yours, I had to
go through various filesystems I had in my config updating them
accordingly). It's a correct change (which I was too lazy to do,
used evil casting instead) but better left as a tidyup for later?
Hugh
On Sat, Mar 13, 2004 at 12:28:31PM -0500, Rik van Riel wrote:
> On Sat, 13 Mar 2004, Hugh Dickins wrote:
> > On Sat, 13 Mar 2004, Linus Torvalds wrote:
>
> > > if (PageAnonymous(page) && page->count > 1) {
> > > newpage = alloc_page();
> > > copy_page(page, newpage);
> > > page = newpage;
> > > }
> > > /* Move the page to the new address */
> > > page->index = address >> PAGE_SHIFT;
> > >
> > > and now we have zero special cases.
> >
> > That's always been a fallback solution, I was just a little too ashamed
> > to propose it originally - seems a little wrong to waste whole pages
> > rather than wasting a few bytes of data structure trying to track them:
> > though the pages are pageable unlike any data structure we come up with.
>
> No, Linus is right.
>
> If a child process uses mremap(), it stands to reason that
> it's about to use those pages for something.
>
> Think of it as taking the COW faults early, because chances
> are you'd be taking them anyway, just a little bit later...
using mremap to _move_ anonymous maps is simply not frequent. It's so
unfrequent that it's hard to tell if the child is going to _read_ or to
_write_. Using those pages means nothing, all it matters is if it will
use those pages from reading or for writing, and I don't see how you can
assume it's going to write to them and how can you assume this is an
early-COW in the common case.
the only interesting point to me is that it's non frequent, with that I
certainly agreee, but I don't see this as an early-COW.
What worries me most are things like kde, they used the library design
with the only object of sharing readonly anonymous pages, that's very
smart since it still avoids one bug in one app to take down the whole
GUI, but if they happen to use mremap to move those readonly page around
after the for we'll screw them completely. I've no indication that this
may be the case and if they ever call mrmap, but I cannot tell the
opposite either.
> The only problem is mremap() after a fork(), and hell, we know that's a
> special case anyway, and let's just add a few lines to copy_one_pte(),
> which basically does:
>
> if (PageAnonymous(page) && page->count > 1) {
> newpage = alloc_page();
> copy_page(page, newpage);
> page = newpage;
> }
> /* Move the page to the new address */
> page->index = address >> PAGE_SHIFT;
>
> and now we have zero special cases.
This part makes the problem so simple. If this is acceptable, then we
have many choices. Since we won't have many mms in the anonmm list,
I don't think we will have any search complexity problems. If we really
worry again about search complexity, we can consider using prio_tree
(adds 16 bytes per vma - we cannot share vma.shared.prio_tree_node).
The prio_tree easily fits for anonmm after linus-mremap-simplification.
Rajesh
On Sat, Mar 13, 2004 at 06:54:06PM +0100, Andrea Arcangeli wrote:
> after the for we'll screw them completely. I've no indication that this
^k
On Sat, 13 Mar 2004, Andrea Arcangeli wrote:
> The remaining downside of all the global anonmm designs vs my finegrined
> anon_vma design, is that if you execute a malloc in a child (that will
> be direct memory with page->count == 1), you'll still have to try all
> the mm in the anongroup (that can be on the order of the thousands),
That's ok, you have a similar issue with very commonly
mmap()d files, where some pages haven't been faulted in
by most processes, or have been replaced by private pages
after a COW fault due to MAP_PRIVATE mapping.
You just increase the number of pages for which this
search is done, but I suspect that shouldn't be a big
worry...
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
On Sat, Mar 13, 2004 at 05:41:37PM +0000, Hugh Dickins wrote:
> On Sat, 13 Mar 2004, Rik van Riel wrote:
> >
> > No, Linus is right.
> >
> > If a child process uses mremap(), it stands to reason that
> > it's about to use those pages for something.
> >
> > Think of it as taking the COW faults early, because chances
> > are you'd be taking them anyway, just a little bit later...
>
> Makes perfect sense in the read-write case. The read-only
> case is less satisfactory, but those will be even rarer.
overall it's not obvious to me that those will be even rarer. see the
last email about kde-like usages to share data like-threads but with
memory protection, those won't write to the data. I mean, it maybe the
way to go, but I think we should get some ok from the major linux
projects that we're not going to invalidate their smart optimizations
first, and we should get this "misfeature" documented somehow.
I've to admit the simplicity is appealing, but besides its
coding-simplicity in practice I believe the only other appealing thing
will be the fact it's not exploitable by people doing a flood of
vma_splits, to solve that with anon_vma I'd need a prio tree on top of
every anon_vma, that means even more memory wased both in the anon_vma
and vma, though pratically a prio_tree there wouldn't be necessary. The
anonmm solves the complexity issue using find_vma, so sharing the rbtree
which already works. that's probably the part I find most appealing of
anonmm. One can still exploit the complexity with anonmm too, but not
from the same address space, so it's easier to limit with ulimit -u. I'm
really not sure what's best, which is not good since I hoped to get
anon_vma implementation working on Monday evening (heck it was already
swapping fine my test app despite the huge vma_split/PageDirect bug that you
noticed that probably caused `ps` to oops, I bet `ps` is doing a
vma_split ;) but I now returned wondering about the design issues instead.
On Sat, Mar 13, 2004 at 05:53:36PM +0000, Hugh Dickins wrote:
> On Sat, 13 Mar 2004, Andrea Arcangeli wrote:
> >
> > I certainly agree it's simpler. I'm quite undecided if to giveup on the
> > anon_vma and to use anonmm plus your unshared during mremap at the
> > moment, while it's simpler it's also a definitely inferior solution
>
> I think you should persist with anon_vma and I should resurrect
> anonmm, and let others decide between those two and pte_chains.
>
> But while in this trial phase, can we both do it in such a way as to
> avoid too much trivial change all over the tree? For example, I'm
> thinking I need to junk my irrelevant renaming of put_dirty_page to
> put_stack_page, and for the moment it would help if you cut out your
> mapping -> as.mapping changes (when I came to build yours, I had to
> go through various filesystems I had in my config updating them
> accordingly). It's a correct change (which I was too lazy to do,
> used evil casting instead) but better left as a tidyup for later?
yes, we should split in two patches, one is the "peparation" for a
reused page->as.mapping, you know I did it differently to retain the
swapper_space and avoiding to hook explicit "if (PageSwapCache)" checks
into things like sync_page.
About using the union, I still prefer it, I've seen Linus in the
pseudocode used an explicit cast too, but I don't feel safe with
explicit casts, I prefer more breakage, than risking to forget
converting any page->mapping into page_maping or similar issues with the
casts ;)
I'll return working on this after the weekend. You can find my latest
status on the ftp, if you extract any interesting "common" bit from
there just send it to me too. thanks.
On Sat, Mar 13, 2004 at 12:55:09PM -0500, Rajesh Venkatasubramanian wrote:
>
> > The only problem is mremap() after a fork(), and hell, we know that's a
> > special case anyway, and let's just add a few lines to copy_one_pte(),
> > which basically does:
> >
> > if (PageAnonymous(page) && page->count > 1) {
> > newpage = alloc_page();
> > copy_page(page, newpage);
> > page = newpage;
> > }
> > /* Move the page to the new address */
> > page->index = address >> PAGE_SHIFT;
> >
> > and now we have zero special cases.
>
> This part makes the problem so simple. If this is acceptable, then we
> have many choices. Since we won't have many mms in the anonmm list,
> I don't think we will have any search complexity problems. If we really
> worry again about search complexity, we can consider using prio_tree
> (adds 16 bytes per vma - we cannot share vma.shared.prio_tree_node).
> The prio_tree easily fits for anonmm after linus-mremap-simplification.
prio_tree with linus-mremap-simplification makes no sense to me. You
cannot avoid checking all the mm with the prio_tree and that is the only
complexity issue introduced by anonmm vs anon_vma.
prio_tree can only sit on top of anon_vma, not on top of
anonmm+linus-unshare-mremap (and yes, I cannot share
vma.shared.prio_tree_node) but pratically it's not needed for the
anon_vmas.
On Sat, 13 Mar 2004, Rik van Riel wrote:
>
> No, Linus is right.
>
> If a child process uses mremap(), it stands to reason that
> it's about to use those pages for something.
That's not necessarily true, since it's entirely possible that it's just a
realloc(), and the old part of the allocation would have been left alone.
That said, I suspect that
- mremap() isn't all _that_ common in the first place
- it's even more rare to do a fork() and then a mremap() (ie most of the
time I suspect the page count will be 1, and no COW is necessary). Most
apps tend to exec() after a fork.
- I agree that in at least part of the remaining cases we _would_ COW the
pages anyway.
I suspect that the only common "no execve after fork" usage is for a few
servers, especially the traditional UNIX kind (ie using processes are
fairly heavy-weight threads). It could be interesting to see numbers.
But basically I'm inclined to believe that the "unnecessary COW" case is
_so_ rare, that if it allows us to make other things simpler (and thus
more stable and likely faster) it is worth it. Especially the simplicity
just appeals to me.
I just think that if mremap() causes so many problems for reverse mapping,
we should make _that_ the expensive operation, instead of making
everything else more complicated. After all, if it turns out that the
"early COW" behaviour I suggest can be a performance problem for some
(rare) circumstances, then the fix for that is likely to just let
applications know that mremap() can be expensive.
(It's still likely to be a lot cheaper than actually doing a new
mmap+memcpy+munmap, so it's not like mremap would become pointless).
Linus
On Fri, 12 Mar 2004, Linus Torvalds wrote:
>
> The absolute _LAST_ thing we want to have is a "remnant" rmap
> infrastructure that only gets very occasional use. That's a GUARANTEED way
> to get bugs, and really subtle behaviour.
On Sat, 13 Mar 2004, Linus Torvalds wrote:
>
> I just think that if mremap() causes so many problems for reverse mapping,
> we should make _that_ the expensive operation, instead of making
> everything else more complicated.
Friday's Linus has a good point, but I agree more with Saturday's:
mremap MAYMOVE is a very special case, and I believe it would hurt
the whole to put it at the centre of the design. But all power to
Andrea to achieve that.
Hugh
On Sat, 13 Mar 2004, Andrea Arcangeli wrote:
>
> yes, we should split in two patches, one is the "peparation" for a
> reused page->as.mapping, you know I did it differently to retain the
> swapper_space and avoiding to hook explicit "if (PageSwapCache)" checks
> into things like sync_page.
>
> About using the union, I still prefer it, I've seen Linus in the
> pseudocode used an explicit cast too, but I don't feel safe with
> explicit casts, I prefer more breakage, than risking to forget
> converting any page->mapping into page_maping or similar issues with the
> casts ;)
Your union is right, and my casting lazy, no question of that.
It's just that we'd need to do a whole lot of cosmetic edits
to get fully building trees, distracting from the guts of it.
In my case, anyway, the number of places that actually use the
casting are very few (just rmap.c?), suspect it's same for you.
I'm certainly not arguing against sanity checks where needed,
just against treewide edits (or broken builds) for now.
> I'll return working on this after the weekend. You can find my latest
> status on the ftp, if you extract any interesting "common" bit from
> there just send it to me too. thanks.
Thanks a lot. I don't imagine you've done the nonlinear vma case
yet, but when you or Rajesh do, please may I just steal it, okay?
Hugh
> prio_tree can only sit on top of anon_vma, not on top of
> anonmm+linus-unshare-mremap (and yes, I cannot share
> vma.shared.prio_tree_node) but pratically it's not needed for the
> anon_vmas.
Agreed. prio_tree is only useful for anon_vma. But, after
linus-unshare-mremap, the anon_vma patch can be modified
(simplified ?) a lot. You don't need any as.anon_vma, as.vma
pointers in the page struct. You just need the already existing
page->mapping and page->index, and a prio_tree of all anon vmas.
The prio_tree can be used to get to the "interesting vmas" without
walking all mms. However, the new prio_tree node adds 16 bytes
per-vma. Considering there may not be much sharing of anon vmas
in common case, I am not sure whether that is worthwhile. Maybe
we can wait for someone to write a program that locks the machine :)
Rajesh
On Sat, Mar 13, 2004 at 02:40:09PM -0500, Rajesh Venkatasubramanian wrote:
> Agreed. prio_tree is only useful for anon_vma. But, after
> linus-unshare-mremap, the anon_vma patch can be modified
> (simplified ?) a lot. You don't need any as.anon_vma, as.vma
> pointers in the page struct. You just need the already existing
> page->mapping and page->index, and a prio_tree of all anon vmas.
what you are missing is that we don't need a prio_tree at all with
anonmm+linus-unshare-mremap, prio tree can make sense only with
anon_vma, not with anonmm. the vm_pgoff is meaningless with anonmm.
find_vma (and the rbtree) already does the trick with anonmm. the
linus-unshare-mremap guarantees that a certain physical page will be
only at a certain virtual address in every mm, so prio_tree taking pgoff
into account isn't needed there, find_vma is more than enough.
any prio_tree can't fix anyways the problem that anonmm will force
the vm to scan all mm at the page->index address, even for a newly
allocated malloc region. that is optimized away by anon_vma, plus
anon_vma avoids the early-COW in mremap. the relevant downside of
anon_vma is that it takes some more byte in the vma to provide those
features.
On Sun, 14 Mar 2004, Andrea Arcangeli wrote:
>
> linus-unshare-mremap guarantees that a certain physical page will be
> only at a certain virtual address in every mm, so prio_tree taking pgoff
> into account isn't needed there, find_vma is more than enough.
Yes. However, I'd at least personally hope that we don't even need the
find_vma() all the time.
When removing a page using the reverse mapping, there really is very
little reason to even look up the vma, although right now the
"flush_tlb_page()" interface is done for vma only so we'd need to change
that or at least add a "flush_tlb_page_mm(mm, virt)" flusher (and if any
architecture wants to look up the vma, they could do so).
It would be silly to look up the vma if we don't actually need it, and I
don't think we do. It's likely faster to just look up the page tables
directly than to even worry about anything else.
But find_vma() certainly would be sufficient.
Linus
On Sat, Mar 13, 2004 at 04:52:00PM -0800, Linus Torvalds wrote:
> Yes. However, I'd at least personally hope that we don't even need the
> find_vma() all the time.
> When removing a page using the reverse mapping, there really is very
> little reason to even look up the vma, although right now the
> "flush_tlb_page()" interface is done for vma only so we'd need to change
> that or at least add a "flush_tlb_page_mm(mm, virt)" flusher (and if any
> architecture wants to look up the vma, they could do so).
> It would be silly to look up the vma if we don't actually need it, and I
> don't think we do. It's likely faster to just look up the page tables
> directly than to even worry about anything else.
> But find_vma() certainly would be sufficient.
find_vma() is often necessary to determine whether the page is mlock()'d.
In schemes where mm's that may not map the page appear in searches, it
may also be necessary to determine if there's even a vma covering the
area at all or otherwise a normal vma, since pagetables outside normal
vmas may very well not be understood by the core (e.g. hugetlb).
-- wli
On Sat, 13 Mar 2004, William Lee Irwin III wrote:
> On Sat, Mar 13, 2004 at 04:52:00PM -0800, Linus Torvalds wrote:
> > Yes. However, I'd at least personally hope that we don't even need the
> > find_vma() all the time.
>
> find_vma() is often necessary to determine whether the page is mlock()'d.
Alternatively, the mlock()d pages shouldn't appear on the LRU
at all, reusing one of the variables inside page->lru as a
counter to keep track of exactly how many times this page is
mlock()d.
> In schemes where mm's that may not map the page appear in searches,
> it may also be necessary to determine if there's even a vma covering the
> area at all or otherwise a normal vma, since pagetables outside normal
> vmas may very well not be understood by the core (e.g. hugetlb).
If the page is a normal page on the LRU, I suspect we don't
need to find the VMA, with the exception of mlock()d pages...
Good thing Christoph was already looking at the mlock()d page
counter idea.
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
On Sat, 13 Mar 2004, William Lee Irwin III wrote:
>
> find_vma() is often necessary to determine whether the page is mlock()'d.
> In schemes where mm's that may not map the page appear in searches, it
> may also be necessary to determine if there's even a vma covering the
> area at all or otherwise a normal vma, since pagetables outside normal
> vmas may very well not be understood by the core (e.g. hugetlb).
Both excellent points. I guess we'll need the extra few cache misses.
Dang.
Linus
On Sat, 13 Mar 2004, William Lee Irwin III wrote:
>> find_vma() is often necessary to determine whether the page is mlock()'d.
On Sat, Mar 13, 2004 at 08:07:52PM -0500, Rik van Riel wrote:
> Alternatively, the mlock()d pages shouldn't appear on the LRU
> at all, reusing one of the variables inside page->lru as a
> counter to keep track of exactly how many times this page is
> mlock()d.
That would be the rare case where it's not necessary. =)
On Sat, 13 Mar 2004, William Lee Irwin III wrote:
>> In schemes where mm's that may not map the page appear in searches,
>> it may also be necessary to determine if there's even a vma covering the
>> area at all or otherwise a normal vma, since pagetables outside normal
>> vmas may very well not be understood by the core (e.g. hugetlb).
On Sat, Mar 13, 2004 at 08:07:52PM -0500, Rik van Riel wrote:
> If the page is a normal page on the LRU, I suspect we don't
> need to find the VMA, with the exception of mlock()d pages...
> Good thing Christoph was already looking at the mlock()d page
> counter idea.
That's not quite where the issue happens. Suppose you have a COW
sharing group (called variously struct anonmm, struct anon, and so on
by various codebases) where a page you're trying to unmap occurs at
some virtual address in several of them, but others may have hugetlb
vmas where that page is otherwise expected. On i386 and potentially
others, the core may not understand present pmd's that are not mere
pointers to ptes and other machine-dependent hugetlb constructs, so
there is trouble. Searching the COW sharing group isn't how everything
works, but in those cases where additionally you can find mm's that
don't map the page at that virtual address and may have different vmas
cover it, this can arise.
-- wli
On Sat, 13 Mar 2004, William Lee Irwin III wrote:
> [hugetlb at same address]
Well, we can find this merely by looking at the page tables
themselves, so that shouldn't be a problem.
> Searching the COW sharing group isn't how everything works, but in those
> cases where additionally you can find mm's that don't map the page at
> that virtual address and may have different vmas cover it, this can
> arise.
This could only happen when you truncate a file that's
been mapped by various nonlinear VMAs, so truncate can't
get rid of the pages...
I suspect there are two ways to fix that:
1) on truncate, scan ALL the ptes inside nonlinear VMAs
and remove the pages
2) don't allow truncate on a file that's mapped with
nonlinear VMAs
Either would work.
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
On Sat, 13 Mar 2004, William Lee Irwin III wrote:
>> [hugetlb at same address]
On Sat, Mar 13, 2004 at 08:41:42PM -0500, Rik van Riel wrote:
> Well, we can find this merely by looking at the page tables
> themselves, so that shouldn't be a problem.
Pagetables of a kind the core understands may not be present there.
On ia32 one could in theory have a pmd_huge() check, which would in
turn not suffice for ia64 and sparc64 hugetlb. These were only examples.
Other unusual forms of mappings, e.g. VM_RESERVED and VM_IO, may also
be bad ideas to trip over by accident.
On Sat, 13 Mar 2004, William Lee Irwin III wrote:
>> Searching the COW sharing group isn't how everything works, but in those
>> cases where additionally you can find mm's that don't map the page at
>> that virtual address and may have different vmas cover it, this can
>> arise.
On Sat, Mar 13, 2004 at 08:41:42PM -0500, Rik van Riel wrote:
> This could only happen when you truncate a file that's
> been mapped by various nonlinear VMAs, so truncate can't
> get rid of the pages...
> I suspect there are two ways to fix that:
> 1) on truncate, scan ALL the ptes inside nonlinear VMAs
> and remove the pages
> 2) don't allow truncate on a file that's mapped with
> nonlinear VMAs
> Either would work.
I'm not sure how that came in. The issue I had in mind was strictly
a matter of tripping over things one can't make sense of from
pagetables alone in try_to_unmap().
COW-shared anonymous pages not unmappable via anonymous COW sharing
groups arising from truncate() vs. remap_file_pages() interactions and
failures to check for nonlinearly-mapped pages in pagetable walkers are
an issue in general of course, but they just aren't this issue.
-- wli
On Tue, 9 Mar 2004, Andrea Arcangeli wrote:
> this doesn't lockup for me (in 2.6 + objrmap), but the machine is not
> responsive, while pushing 1G into swap. Here a trace in the middle of the
> swapping while pressing C^c on your program doesn't respond for half a minute.
>
> Mind to leave it running a bit longer before claiming a lockup?
>
> 1 206 615472 4032 84 879332 11248 16808 16324 16808 2618 20311 0 43 0 57
> 1 204 641740 1756 96 878476 2852 16980 4928 16980 5066 60228 0 35 1 64
> 1 205 650936 2508 100 875604 2248 9928 3772 9928 1364 21052 0 34 2 64
> 2 204 658212 2656 104 876904 3564 12052 4988 12052 2074 19647 0 32 1 67
> 1 204 674260 1628 104 878528 3236 12924 5608 12928 2062 27114 0 47 0 53
> 1 204 678248 1988 96 879004 3540 4664 4360 4664 1988 20728 0 31 0 69
> 1 203 683748 4024 96 878132 2844 5036 3724 5036 1513 18173 0 38 0 61
> 0 206 687312 1732 112 879056 3396 4260 4424 4272 1704 13222 0 32 0 68
> 1 204 690164 1936 116 880364 2844 3400 3496 3404 1422 18214 0 35 0 64
> 0 205 696572 4348 112 877676 2956 6620 3788 6620 1281 11544 0 37 1 62
> 0 204 699244 4168 108 878272 3140 3528 3892 3528 1467 11464 0 28 0 72
> 1 206 704296 1820 112 878604 2576 4980 3592 4980 1386 11710 0 26 0 74
> 1 205 710452 1972 104 876760 2256 6684 3092 6684 1308 20947 0 34 1 66
> 2 203 714512 1632 108 877564 2332 4876 3068 4876 1295 9792 0 20 0 80
> 0 204 719804 3720 112 878128 2536 6352 3100 6368 1441 20714 0 39 0 61
> 124 200 724708 1636 100 879548 3376 5308 3912 5308 1516 20732 0 38 0 62
> procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
> r b swpd free buff cache si so bi bo in cs us sy id wa
> 1 204 730908 4344 100 877528 2592 6356 3672 6356 1819 15894 0 35 0 65
> 0 204 733556 3836 104 878256 2312 3132 3508 3132 1294 10905 0 33 0 67
> 0 205 736380 3388 100 877376 3084 3364 3832 3364 1322 11550 0 30 0 70
> 1 206 747016 2032 100 877760 2780 13144 4272 13144 1564 17486 0 37 0 63
> 1 205 756664 2192 96 878004 1704 7704 2116 7704 1341 20056 0 32 0 67
> 9 203 759084 3200 92 878516 2748 3168 3676 3168 1330 18252 0 45 0 54
> 0 205 761752 3928 96 877208 2604 2984 3284 2984 1330 10395 0 35 0 65
>
> most of the time is spent in "wa", though it's a 4-way, so it means at least
> two cpus are spinning. I'm pushing the box hard into swap. 2.6 swap extremely
> slow w/ or w/o objrmap, not much difference really w/o or w/o your exploit.
Andrea,
I did some swapping tests with 2.6 and found out that it was really slow,
too. Very unresponsive under heavy swapping.
-mm fixed things for me. Not sure parts of it do the trick, though.
Can you be more specific on the "slow swap" comment you made ?
Thank you!
On Mon, Mar 15, 2004 at 04:47:48PM -0300, Marcelo Tosatti wrote:
>
>
> On Tue, 9 Mar 2004, Andrea Arcangeli wrote:
>
> > this doesn't lockup for me (in 2.6 + objrmap), but the machine is not
> > responsive, while pushing 1G into swap. Here a trace in the middle of the
> > swapping while pressing C^c on your program doesn't respond for half a minute.
> >
> > Mind to leave it running a bit longer before claiming a lockup?
> >
> > 1 206 615472 4032 84 879332 11248 16808 16324 16808 2618 20311 0 43 0 57
> > 1 204 641740 1756 96 878476 2852 16980 4928 16980 5066 60228 0 35 1 64
> > 1 205 650936 2508 100 875604 2248 9928 3772 9928 1364 21052 0 34 2 64
> > 2 204 658212 2656 104 876904 3564 12052 4988 12052 2074 19647 0 32 1 67
> > 1 204 674260 1628 104 878528 3236 12924 5608 12928 2062 27114 0 47 0 53
> > 1 204 678248 1988 96 879004 3540 4664 4360 4664 1988 20728 0 31 0 69
> > 1 203 683748 4024 96 878132 2844 5036 3724 5036 1513 18173 0 38 0 61
> > 0 206 687312 1732 112 879056 3396 4260 4424 4272 1704 13222 0 32 0 68
> > 1 204 690164 1936 116 880364 2844 3400 3496 3404 1422 18214 0 35 0 64
> > 0 205 696572 4348 112 877676 2956 6620 3788 6620 1281 11544 0 37 1 62
> > 0 204 699244 4168 108 878272 3140 3528 3892 3528 1467 11464 0 28 0 72
> > 1 206 704296 1820 112 878604 2576 4980 3592 4980 1386 11710 0 26 0 74
> > 1 205 710452 1972 104 876760 2256 6684 3092 6684 1308 20947 0 34 1 66
> > 2 203 714512 1632 108 877564 2332 4876 3068 4876 1295 9792 0 20 0 80
> > 0 204 719804 3720 112 878128 2536 6352 3100 6368 1441 20714 0 39 0 61
> > 124 200 724708 1636 100 879548 3376 5308 3912 5308 1516 20732 0 38 0 62
> > procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
> > r b swpd free buff cache si so bi bo in cs us sy id wa
> > 1 204 730908 4344 100 877528 2592 6356 3672 6356 1819 15894 0 35 0 65
> > 0 204 733556 3836 104 878256 2312 3132 3508 3132 1294 10905 0 33 0 67
> > 0 205 736380 3388 100 877376 3084 3364 3832 3364 1322 11550 0 30 0 70
> > 1 206 747016 2032 100 877760 2780 13144 4272 13144 1564 17486 0 37 0 63
> > 1 205 756664 2192 96 878004 1704 7704 2116 7704 1341 20056 0 32 0 67
> > 9 203 759084 3200 92 878516 2748 3168 3676 3168 1330 18252 0 45 0 54
> > 0 205 761752 3928 96 877208 2604 2984 3284 2984 1330 10395 0 35 0 65
> >
> > most of the time is spent in "wa", though it's a 4-way, so it means at least
> > two cpus are spinning. I'm pushing the box hard into swap. 2.6 swap extremely
> > slow w/ or w/o objrmap, not much difference really w/o or w/o your exploit.
>
> Andrea,
>
> I did some swapping tests with 2.6 and found out that it was really slow,
> too. Very unresponsive under heavy swapping.
>
> -mm fixed things for me. Not sure parts of it do the trick, though.
>
> Can you be more specific on the "slow swap" comment you made ?
well, it's just the swapin/swapout rate being too slow as you noticed. I
didn't benchmark -mm in swap workloads, so it may very well be fixed in
-mm with Nick's patches. At this point in time I've more serious
troubles than the swap speed, and -mm can't help me with those troubles
(4:4 is a last resort I can take from the -mm tree, but I'm trying as
much as I can to avoid forcing people to 4:4 on the <=16G machines that
have huge margins with 3:1 and 2.4-aa, 32G are used to work fine too
with 3:1 on 2.4-aa, infact I'm trying to avoid 4:4 even on the 64G
machines).
On Mon, 15 Mar 2004, Andrea Arcangeli wrote:
> On Mon, Mar 15, 2004 at 04:47:48PM -0300, Marcelo Tosatti wrote:
> >
> >
> > On Tue, 9 Mar 2004, Andrea Arcangeli wrote:
> >
> > > this doesn't lockup for me (in 2.6 + objrmap), but the machine is not
> > > responsive, while pushing 1G into swap. Here a trace in the middle of the
> > > swapping while pressing C^c on your program doesn't respond for half a minute.
> > >
> > > Mind to leave it running a bit longer before claiming a lockup?
> > >
> > > 1 206 615472 4032 84 879332 11248 16808 16324 16808 2618 20311 0 43 0 57
> > > 1 204 641740 1756 96 878476 2852 16980 4928 16980 5066 60228 0 35 1 64
> > > 1 205 650936 2508 100 875604 2248 9928 3772 9928 1364 21052 0 34 2 64
> > > 2 204 658212 2656 104 876904 3564 12052 4988 12052 2074 19647 0 32 1 67
> > > 1 204 674260 1628 104 878528 3236 12924 5608 12928 2062 27114 0 47 0 53
> > > 1 204 678248 1988 96 879004 3540 4664 4360 4664 1988 20728 0 31 0 69
> > > 1 203 683748 4024 96 878132 2844 5036 3724 5036 1513 18173 0 38 0 61
> > > 0 206 687312 1732 112 879056 3396 4260 4424 4272 1704 13222 0 32 0 68
> > > 1 204 690164 1936 116 880364 2844 3400 3496 3404 1422 18214 0 35 0 64
> > > 0 205 696572 4348 112 877676 2956 6620 3788 6620 1281 11544 0 37 1 62
> > > 0 204 699244 4168 108 878272 3140 3528 3892 3528 1467 11464 0 28 0 72
> > > 1 206 704296 1820 112 878604 2576 4980 3592 4980 1386 11710 0 26 0 74
> > > 1 205 710452 1972 104 876760 2256 6684 3092 6684 1308 20947 0 34 1 66
> > > 2 203 714512 1632 108 877564 2332 4876 3068 4876 1295 9792 0 20 0 80
> > > 0 204 719804 3720 112 878128 2536 6352 3100 6368 1441 20714 0 39 0 61
> > > 124 200 724708 1636 100 879548 3376 5308 3912 5308 1516 20732 0 38 0 62
> > > procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
> > > r b swpd free buff cache si so bi bo in cs us sy id wa
> > > 1 204 730908 4344 100 877528 2592 6356 3672 6356 1819 15894 0 35 0 65
> > > 0 204 733556 3836 104 878256 2312 3132 3508 3132 1294 10905 0 33 0 67
> > > 0 205 736380 3388 100 877376 3084 3364 3832 3364 1322 11550 0 30 0 70
> > > 1 206 747016 2032 100 877760 2780 13144 4272 13144 1564 17486 0 37 0 63
> > > 1 205 756664 2192 96 878004 1704 7704 2116 7704 1341 20056 0 32 0 67
> > > 9 203 759084 3200 92 878516 2748 3168 3676 3168 1330 18252 0 45 0 54
> > > 0 205 761752 3928 96 877208 2604 2984 3284 2984 1330 10395 0 35 0 65
> > >
> > > most of the time is spent in "wa", though it's a 4-way, so it means at least
> > > two cpus are spinning. I'm pushing the box hard into swap. 2.6 swap extremely
> > > slow w/ or w/o objrmap, not much difference really w/o or w/o your exploit.
> >
> > Andrea,
> >
> > I did some swapping tests with 2.6 and found out that it was really slow,
> > too. Very unresponsive under heavy swapping.
> >
> > -mm fixed things for me. Not sure parts of it do the trick, though.
> >
> > Can you be more specific on the "slow swap" comment you made ?
>
> well, it's just the swapin/swapout rate being too slow as you noticed. I
> didn't benchmark -mm in swap workloads, so it may very well be fixed in
> -mm with Nick's patches. At this point in time I've more serious
> troubles than the swap speed, and -mm can't help me with those troubles
> (4:4 is a last resort I can take from the -mm tree, but I'm trying as
> much as I can to avoid forcing people to 4:4 on the <=16G machines that
> have huge margins with 3:1 and 2.4-aa, 32G are used to work fine too
> with 3:1 on 2.4-aa, infact I'm trying to avoid 4:4 even on the 64G
> machines).
What are the problems you are facing ? Yes, I could read the previous
posts, etc. but a nice resume is always good, for me, for others, and for
you :)
Yes, 4:4 tlb flushing is, hum, not very cool.
On Tue, Mar 16, 2004 at 04:39:50AM -0300, Marcelo Tosatti wrote:
> What are the problems you are facing ? Yes, I could read the previous
> posts, etc. but a nice resume is always good, for me, for others, and for
> you :)
the primary problem of rmap is the memory consumption and the slowdown
during things like parallel compiles in 32-ways. on 32bit and 64bit
archs.
> Yes, 4:4 tlb flushing is, hum, not very cool.
and it can't help avoiding to waste several gigs of ram on the 64bit ;).