2002-10-18 23:02:05

by Dave McCracken

[permalink] [raw]
Subject: [PATCH 2.5.43-mm2] New shared page table patch


Ok, I've shaken the shared page table code hard, and fixed all the bugs I
could find. It now runs reliably on all the configurations I have access
to, including regression runs of LTP.

I've audited the usage of page_table_lock, and I believe I have covered all
the places that also need to hold the pte_page_lock.

For reference, one of the tests was TPC-H. My code reduced the number of
allocated pte_chains from 5 million to 50 thousand.

Dave McCracken

======================================================================
Dave McCracken IBM Linux Base Kernel Team 1-512-838-3059
[email protected] T/L 678-3059


Attachments:
(No filename) (674.00 B)
shpte-2.5.43-mm2-4.diff (62.66 kB)
Download all attachments

2002-10-19 01:33:21

by Ed Tomlinson

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

On October 18, 2002 07:06 pm, Dave McCracken wrote:
> Ok, I've shaken the shared page table code hard, and fixed all the bugs I
> could find. It now runs reliably on all the configurations I have access
> to, including regression runs of LTP.
>
> I've audited the usage of page_table_lock, and I believe I have covered all
> the places that also need to hold the pte_page_lock.
>
> For reference, one of the tests was TPC-H. My code reduced the number of
> allocated pte_chains from 5 million to 50 thousand.

This is still failing when starting kde3 here. Here is what looks different
in the massive straces of kde starting:

---------- fails
[pid 983] getpid() = 983
[pid 983] getrlimit(0x3, 0xbffff724) = 0
[pid 983] close(5) = 0
[pid 983] close(3) = 0
[pid 983] rt_sigaction(SIGCHLD, {SIG_DFL}, {SIG_DFL}, 8) = 0
[pid 983] rt_sigaction(SIGPIPE, {SIG_DFL}, {SIG_DFL}, 8) = 0
[pid 983] chdir("/poola/home/ed") = 0
[pid 983] brk(0x805d000) = 0x805d000
[pid 983] open("/usr/lib/ksmserver.la", O_RDONLY) = 3
[pid 983] fstat64(3, {st_dev=makedev(33, 3), st_ino=125820, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8
, st_size=947, st_atime=2002/10/18-21:00:31, st_mtime=2002/08/13-08:32:59, st_ctime=2002/08/22-20:36:41}) = 0
[pid 983] old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40014000
[pid 983] read(3, "# ksmserver.la - a libtool libra"..., 4096) = 947
[pid 983] read(3, "", 4096) = 0
[pid 983] close(3) = 0
[pid 983] munmap(0x40014000, 4096) = 0
[pid 983] open("/usr/lib/ksmserver.so", O_RDONLY) = 3
[pid 983] read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\200\262"..., 1024) = 1024
[pid 983] fstat64(3, {st_dev=makedev(33, 3), st_ino=125821, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=1
84, st_size=92000, st_atime=2002/10/18-21:09:24, st_mtime=2002/08/13-08:49:39, st_ctime=2002/08/22-20:36:41}) = 0
[pid 983] old_mmap(NULL, 95284, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x41029000
[pid 983] mprotect(0x4103f000, 5172, PROT_NONE) = 0
[pid 983] old_mmap(0x4103f000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x15000) = 0x4103f000
[pid 983] close(3) = 0
[pid 983] --- SIGSEGV (Segmentation fault) ---
[pid 982] <... read resumed> "", 1) = 0
[pid 982] --- SIGCHLD (Child exited) ---
[pid 982] dup(2) = 6
[pid 982] fcntl64(6, F_GETFL) = 0x8801 (flags O_WRONLY|O_NONBLOCK|O_LARGEFILE)
[pid 982] close(6) = 0
[pid 982] write(2, "kdeinit: Pipe closed unexpectedl"..., 43kdeinit: Pipe closed unexpectedly: Success
---------- ok
[pid 1064] getpid() = 1064
[pid 1064] getrlimit(0x3, 0xbffff724) = 0
[pid 1064] close(5) = 0
[pid 1064] close(3) = 0
[pid 1064] rt_sigaction(SIGCHLD, {SIG_DFL}, {SIG_DFL}, 8) = 0
[pid 1064] rt_sigaction(SIGPIPE, {SIG_DFL}, {SIG_DFL}, 8) = 0
[pid 1064] chdir("/poola/home/ed") = 0
[pid 1064] brk(0x805d000) = 0x805d000
[pid 1064] open("/usr/lib/ksmserver.la", O_RDONLY) = 3
[pid 1064] fstat64(3, {st_dev=makedev(33, 3), st_ino=125820, st_mode=S_IFREG|0644, st_nlink=1, st_uid
=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=947, st_atime=2002/10/18-20:55:52, st_mtime=2002/0
8/13-08:32:59, st_ctime=2002/08/22-20:36:41}) = 0
[pid 1064] old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40014000
[pid 1064] read(3, "# ksmserver.la - a libtool libra"..., 4096) = 947
[pid 1064] read(3, "", 4096) = 0
[pid 1064] close(3) = 0
[pid 1064] munmap(0x40014000, 4096) = 0
[pid 1064] open("/usr/lib/ksmserver.so", O_RDONLY) = 3
[pid 1064] read(3, <unfinished ...>
[pid 1016] <... select resumed> ) = 0 (Timeout)
[pid 1016] gettimeofday({1034989231, 109379}, NULL) = 0
[pid 1016] gettimeofday({1034989231, 109669}, NULL) = 0
[pid 1016] write(3, "=\0\4\0\22\0@\0\0\0\0\0\220\1<\0\232\6\5\0\23\0@\0\0\0"..., 356) = 356
[pid 1016] ioctl(3, 0x541b, [0]) = 0
[pid 1016] gettimeofday({1034989231, 113322}, NULL) = 0
[pid 1016] select(10, [3 5 7 9], NULL, NULL, {0, 395460} <unfinished ...>
[pid 1064] <... read resumed> "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\200\262"..., 1024) = 1
024
[pid 1064] fstat64(3, {st_dev=makedev(33, 3), st_ino=125821, st_mode=S_IFREG|0644, st_nlink=1, st_uid
=0, st_gid=0, st_blksize=4096, st_blocks=184, st_size=92000, st_atime=2002/10/18-21:00:31, st_mtime=20
02/08/13-08:49:39, st_ctime=2002/08/22-20:36:41}) = 0
[pid 1064] old_mmap(NULL, 95284, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x41029000
[pid 1064] mprotect(0x4103f000, 5172, PROT_NONE) = 0
[pid 1064] old_mmap(0x4103f000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x15000) = 0x41
03f000
[pid 1064] close(3) = 0
[pid 1064] write(6, "\0", 1 <unfinished ...>
[pid 1063] <... read resumed> "\0", 1) = 1
[pid 1064] <... write resumed> ) = 1
[pid 1063] close(5 <unfinished ...>
[pid 1064] close(6 <unfinished ...>
----------

I will send the compete straces to you and wli.

Really would like to see this working!

Ed Tomlinson

2002-10-19 19:12:38

by Bill Davidsen

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

--- 2.5.43-mm2/./fs/exec.c 2002-10-17 11:12:59.000000000 -0500
+++ 2.5.43-mm2-shpte/./fs/exec.c 2002-10-17 11:42:16.000000000 -0500
@@ -47,6 +47,7 @@
#include <asm/uaccess.h>
#include <asm/pgalloc.h>
#include <asm/mmu_context.h>
+#include <asm/rmap.h>

#ifdef CONFIG_KMOD
#include <linux/kmod.h>
@@ -311,8 +312,8 @@
flush_page_to_ram(page);
set_pte(pte, pte_mkdirty(pte_mkwrite(mk_pte(page, PAGE_COPY))));
page_add_rmap(page, pte);
+ increment_rss(kmap_atomic_to_page(pte));
pte_unmap(pte);
- tsk->mm->rss++;
spin_unlock(&tsk->mm->page_table_lock);

/* no need for flush_tlb */
--- 2.5.43-mm2/./arch/i386/kernel/vm86.c 2002-10-15 22:27:15.000000000 -0500
+++ 2.5.43-mm2-shpte/./arch/i386/kernel/vm86.c 2002-10-18 13:35:44.000000000 -0500
@@ -39,6 +39,7 @@
#include <linux/smp.h>
#include <linux/smp_lock.h>
#include <linux/highmem.h>
+#include <linux/rmap-locking.h>

#include <asm/uaccess.h>
#include <asm/pgalloc.h>
@@ -120,6 +121,7 @@

static void mark_screen_rdonly(struct task_struct * tsk)
{
+ struct page *ptepage;
pgd_t *pgd;
pmd_t *pmd;
pte_t *pte, *mapped;
@@ -143,6 +145,8 @@
pmd_clear(pmd);
goto out;
}
+ ptepage = pmd_page(*pmd);
+ pte_page_lock(ptepage);
pte = mapped = pte_offset_map(pmd, 0xA0000);
for (i = 0; i < 32; i++) {
if (pte_present(*pte))
@@ -150,6 +154,7 @@
pte++;
}
pte_unmap(mapped);
+ pte_page_unlock(ptepage);
out:
spin_unlock(&tsk->mm->page_table_lock);
preempt_enable();
--- 2.5.43-mm2/./arch/i386/Config.help 2002-10-15 22:27:14.000000000 -0500
+++ 2.5.43-mm2-shpte/./arch/i386/Config.help 2002-10-17 11:42:16.000000000 -0500
@@ -165,6 +165,13 @@
low memory. Setting this option will put user-space page table
entries in high memory.

+CONFIG_SHAREPTE
+ Normally each address space has its own complete page table for all
+ its mappings. This can mean many mappings of a set of shared data
+ pages. With this option, the VM will attempt to share the bottom
+ level of the page table between address spaces that are sharing data
+ pages.
+
CONFIG_HIGHMEM4G
Select this if you have a 32-bit processor and between 1 and 4
gigabytes of physical RAM.
--- 2.5.43-mm2/./arch/i386/config.in 2002-10-17 11:12:53.000000000 -0500
+++ 2.5.43-mm2-shpte/./arch/i386/config.in 2002-10-17 11:42:16.000000000 -0500
@@ -233,6 +233,7 @@
if [ "$CONFIG_HIGHMEM4G" = "y" -o "$CONFIG_HIGHMEM64G" = "y" ]; then
bool 'Allocate 3rd-level pagetables from highmem' CONFIG_HIGHPTE
fi
+bool 'Share 3rd-level pagetables' CONFIG_SHAREPTE y

bool 'Math emulation' CONFIG_MATH_EMULATION
bool 'MTRR (Memory Type Range Register) support' CONFIG_MTRR
--- 2.5.43-mm2/./include/linux/mm.h 2002-10-17 11:13:00.000000000 -0500
+++ 2.5.43-mm2-shpte/./include/linux/mm.h 2002-10-17 11:42:16.000000000 -0500
@@ -164,6 +164,8 @@
struct pte_chain *chain;/* Reverse pte mapping pointer.
* protected by PG_chainlock */
pte_addr_t direct;
+ struct mm_chain *mmchain;/* Reverse mm_struct mapping pointer */
+ struct mm_struct *mmdirect;
} pte;
unsigned long private; /* mapping-private opaque data */

@@ -358,7 +360,12 @@
extern int shmem_zero_setup(struct vm_area_struct *);

extern void zap_page_range(struct vm_area_struct *vma, unsigned long address, unsigned long size);
+#ifdef CONFIG_SHAREPTE
+extern pte_t *pte_unshare(struct mm_struct *mm, pmd_t *pmd, unsigned long address);
+extern int share_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma, pmd_t **prev_pmd);
+#else
extern int copy_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma);
+#endif
extern int remap_page_range(struct vm_area_struct *vma, unsigned long from, unsigned long to, unsigned long size, pgprot_t prot);
extern int zeromap_page_range(struct vm_area_struct *vma, unsigned long from, unsigned long size, pgprot_t prot);

--- 2.5.43-mm2/./include/linux/rmap-locking.h 2002-10-15 22:27:16.000000000 -0500
+++ 2.5.43-mm2-shpte/./include/linux/rmap-locking.h 2002-10-17 11:42:19.000000000 -0500
@@ -23,6 +23,18 @@
#endif
}

+static inline int pte_chain_trylock(struct page *page)
+{
+ preempt_disable();
+#ifdef CONFIG_SMP
+ if (test_and_set_bit(PG_chainlock, &page->flags)) {
+ preempt_enable();
+ return 0;
+ }
+#endif
+ return 1;
+}
+
static inline void pte_chain_unlock(struct page *page)
{
#ifdef CONFIG_SMP
@@ -31,3 +43,7 @@
#endif
preempt_enable();
}
+
+#define pte_page_lock pte_chain_lock
+#define pte_page_trylock pte_chain_trylock
+#define pte_page_unlock pte_chain_unlock
--- 2.5.43-mm2/./include/linux/page-flags.h 2002-10-17 11:13:00.000000000 -0500
+++ 2.5.43-mm2-shpte/./include/linux/page-flags.h 2002-10-17 16:06:42.000000000 -0500
@@ -68,7 +68,7 @@
#define PG_chainlock 15 /* lock bit for ->pte_chain */

#define PG_direct 16 /* ->pte_chain points directly at pte */
-
+#define PG_ptepage 17 /* This page is a pte page */
/*
* Global page accounting. One instance per CPU. Only unsigned longs are
* allowed.
@@ -239,6 +239,12 @@
#define ClearPageDirect(page) clear_bit(PG_direct, &(page)->flags)
#define TestClearPageDirect(page) test_and_clear_bit(PG_direct, &(page)->flags)

+#define PagePtepage(page) test_bit(PG_ptepage, &(page)->flags)
+#define SetPagePtepage(page) set_bit(PG_ptepage, &(page)->flags)
+#define TestSetPagePtepage(page) test_and_set_bit(PG_ptepage, &(page)->flags)
+#define ClearPagePtepage(page) clear_bit(PG_ptepage, &(page)->flags)
+#define TestClearPagePtepage(page) test_and_clear_bit(PG_ptepage, &(page)->flags)
+
/*
* The PageSwapCache predicate doesn't use a PG_flag at this time,
* but it may again do so one day.
--- 2.5.43-mm2/./include/asm-generic/rmap.h 2002-10-15 22:28:24.000000000 -0500
+++ 2.5.43-mm2-shpte/./include/asm-generic/rmap.h 2002-10-17 11:42:16.000000000 -0500
@@ -26,33 +26,6 @@
*/
#include <linux/mm.h>

-static inline void pgtable_add_rmap(struct page * page, struct mm_struct * mm, unsigned long address)
-{
-#ifdef BROKEN_PPC_PTE_ALLOC_ONE
- /* OK, so PPC calls pte_alloc() before mem_map[] is setup ... ;( */
- extern int mem_init_done;
-
- if (!mem_init_done)
- return;
-#endif
- page->mapping = (void *)mm;
- page->index = address & ~((PTRS_PER_PTE * PAGE_SIZE) - 1);
- inc_page_state(nr_page_table_pages);
-}
-
-static inline void pgtable_remove_rmap(struct page * page)
-{
- page->mapping = NULL;
- page->index = 0;
- dec_page_state(nr_page_table_pages);
-}
-
-static inline struct mm_struct * ptep_to_mm(pte_t * ptep)
-{
- struct page * page = kmap_atomic_to_page(ptep);
- return (struct mm_struct *) page->mapping;
-}
-
static inline unsigned long ptep_to_address(pte_t * ptep)
{
struct page * page = kmap_atomic_to_page(ptep);
@@ -87,4 +60,10 @@
}
#endif

+extern void pgtable_add_rmap(struct page * page, struct mm_struct * mm, unsigned long address);
+extern void pgtable_add_rmap_locked(struct page * page, struct mm_struct * mm, unsigned long address);
+extern void pgtable_remove_rmap(struct page * page, struct mm_struct *mm);
+extern void pgtable_remove_rmap_locked(struct page * page, struct mm_struct *mm);
+extern void increment_rss(struct page *ptepage);
+
#endif /* _GENERIC_RMAP_H */
--- 2.5.43-mm2/./include/asm-i386/rmap.h 2002-10-15 22:28:25.000000000 -0500
+++ 2.5.43-mm2-shpte/./include/asm-i386/rmap.h 2002-10-17 16:06:42.000000000 -0500
@@ -9,12 +9,15 @@
{
unsigned long pfn = (unsigned long)(pte_paddr >> PAGE_SHIFT);
unsigned long off = ((unsigned long)pte_paddr) & ~PAGE_MASK;
+
+ preempt_disable();
return (pte_t *)((char *)kmap_atomic(pfn_to_page(pfn), KM_PTE2) + off);
}

static inline void rmap_ptep_unmap(pte_t *pte)
{
kunmap_atomic(pte, KM_PTE2);
+ preempt_enable();
}
#endif

--- 2.5.43-mm2/./include/asm-i386/pgtable.h 2002-10-15 22:28:33.000000000 -0500
+++ 2.5.43-mm2-shpte/./include/asm-i386/pgtable.h 2002-10-17 11:42:16.000000000 -0500
@@ -124,6 +124,7 @@
#define _PAGE_PROTNONE 0x080 /* If not present */

#define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | _PAGE_ACCESSED | _PAGE_DIRTY)
+#define _PAGE_TABLE_RDONLY (_PAGE_PRESENT | _PAGE_USER | _PAGE_ACCESSED | _PAGE_DIRTY)
#define _KERNPG_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY)
#define _PAGE_CHG_MASK (PTE_MASK | _PAGE_ACCESSED | _PAGE_DIRTY)

@@ -184,8 +185,8 @@
#define pmd_none(x) (!pmd_val(x))
#define pmd_present(x) (pmd_val(x) & _PAGE_PRESENT)
#define pmd_clear(xp) do { set_pmd(xp, __pmd(0)); } while (0)
-#define pmd_bad(x) ((pmd_val(x) & (~PAGE_MASK & ~_PAGE_USER)) != _KERNPG_TABLE)
-
+#define pmd_bad(x) ((pmd_val(x) & (~PAGE_MASK & ~_PAGE_USER & ~_PAGE_RW)) != \
+ (_KERNPG_TABLE & ~_PAGE_RW))

#define pages_to_mb(x) ((x) >> (20-PAGE_SHIFT))

@@ -209,6 +210,9 @@
static inline pte_t pte_mkdirty(pte_t pte) { (pte).pte_low |= _PAGE_DIRTY; return pte; }
static inline pte_t pte_mkyoung(pte_t pte) { (pte).pte_low |= _PAGE_ACCESSED; return pte; }
static inline pte_t pte_mkwrite(pte_t pte) { (pte).pte_low |= _PAGE_RW; return pte; }
+static inline int pmd_write(pmd_t pmd) { return (pmd).pmd & _PAGE_RW; }
+static inline pmd_t pmd_wrprotect(pmd_t pmd) { (pmd).pmd &= ~_PAGE_RW; return pmd; }
+static inline pmd_t pmd_mkwrite(pmd_t pmd) { (pmd).pmd |= _PAGE_RW; return pmd; }

static inline int ptep_test_and_clear_dirty(pte_t *ptep) { return test_and_clear_bit(_PAGE_BIT_DIRTY, &ptep->pte_low); }
static inline int ptep_test_and_clear_young(pte_t *ptep) { return test_and_clear_bit(_PAGE_BIT_ACCESSED, &ptep->pte_low); }
@@ -265,12 +269,20 @@
((pte_t *)kmap_atomic(pmd_page(*(dir)),KM_PTE0) + __pte_offset(address))
#define pte_offset_map_nested(dir, address) \
((pte_t *)kmap_atomic(pmd_page(*(dir)),KM_PTE1) + __pte_offset(address))
+#define pte_page_map(__page, address) \
+ ((pte_t *)kmap_atomic(__page,KM_PTE0) + __pte_offset(address))
+#define pte_page_map_nested(__page, address) \
+ ((pte_t *)kmap_atomic(__page,KM_PTE1) + __pte_offset(address))
#define pte_unmap(pte) kunmap_atomic(pte, KM_PTE0)
#define pte_unmap_nested(pte) kunmap_atomic(pte, KM_PTE1)
#else
#define pte_offset_map(dir, address) \
((pte_t *)page_address(pmd_page(*(dir))) + __pte_offset(address))
#define pte_offset_map_nested(dir, address) pte_offset_map(dir, address)
+#define pte_page_map(__page, address) \
+ ((pte_t *)page_address(__page) + __pte_offset(address))
+#define pte_page_map_nested(__page, address) \
+ ((pte_t *)page_address(__page) + __pte_offset(address))
#define pte_unmap(pte) do { } while (0)
#define pte_unmap_nested(pte) do { } while (0)
#endif
--- 2.5.43-mm2/./kernel/fork.c 2002-10-17 11:13:01.000000000 -0500
+++ 2.5.43-mm2-shpte/./kernel/fork.c 2002-10-17 11:42:16.000000000 -0500
@@ -210,6 +210,9 @@
struct vm_area_struct * mpnt, *tmp, **pprev;
int retval;
unsigned long charge = 0;
+#ifdef CONFIG_SHAREPTE
+ pmd_t *prev_pmd = 0;
+#endif

flush_cache_mm(current->mm);
mm->locked_vm = 0;
@@ -273,7 +276,11 @@
*pprev = tmp;
pprev = &tmp->vm_next;
mm->map_count++;
+#ifdef CONFIG_SHAREPTE
+ retval = share_page_range(mm, current->mm, tmp, &prev_pmd);
+#else
retval = copy_page_range(mm, current->mm, tmp);
+#endif
spin_unlock(&mm->page_table_lock);

if (tmp->vm_ops && tmp->vm_ops->open)
--- 2.5.43-mm2/./mm/swapfile.c 2002-10-17 11:13:01.000000000 -0500
+++ 2.5.43-mm2-shpte/./mm/swapfile.c 2002-10-18 13:36:14.000000000 -0500
@@ -15,8 +15,10 @@
#include <linux/shm.h>
#include <linux/blkdev.h>
#include <linux/buffer_head.h>
+#include <linux/rmap-locking.h>

#include <asm/pgtable.h>
+#include <asm/rmap.h>
#include <linux/swapops.h>

spinlock_t swaplock = SPIN_LOCK_UNLOCKED;
@@ -371,7 +373,7 @@
*/
/* mmlist_lock and vma->vm_mm->page_table_lock are held */
static inline void unuse_pte(struct vm_area_struct * vma, unsigned long address,
- pte_t *dir, swp_entry_t entry, struct page* page)
+ pte_t *dir, swp_entry_t entry, struct page* page, pmd_t *pmd)
{
pte_t pte = *dir;

@@ -383,7 +385,7 @@
set_pte(dir, pte_mkold(mk_pte(page, vma->vm_page_prot)));
page_add_rmap(page, dir);
swap_free(entry);
- ++vma->vm_mm->rss;
+ increment_rss(pmd_page(*pmd));
}

/* mmlist_lock and vma->vm_mm->page_table_lock are held */
@@ -391,6 +393,7 @@
unsigned long address, unsigned long size, unsigned long offset,
swp_entry_t entry, struct page* page)
{
+ struct page *ptepage;
pte_t * pte;
unsigned long end;

@@ -401,6 +404,8 @@
pmd_clear(dir);
return;
}
+ ptepage = pmd_page(*dir);
+ pte_page_lock(ptepage);
pte = pte_offset_map(dir, address);
offset += address & PMD_MASK;
address &= ~PMD_MASK;
@@ -408,10 +413,11 @@
if (end > PMD_SIZE)
end = PMD_SIZE;
do {
- unuse_pte(vma, offset+address-vma->vm_start, pte, entry, page);
+ unuse_pte(vma, offset+address-vma->vm_start, pte, entry, page, dir);
address += PAGE_SIZE;
pte++;
} while (address && (address < end));
+ pte_page_unlock(ptepage);
pte_unmap(pte - 1);
}

--- 2.5.43-mm2/./mm/msync.c 2002-10-17 11:13:01.000000000 -0500
+++ 2.5.43-mm2-shpte/./mm/msync.c 2002-10-18 13:31:24.000000000 -0500
@@ -11,6 +11,7 @@
#include <linux/pagemap.h>
#include <linux/mm.h>
#include <linux/mman.h>
+#include <linux/rmap-locking.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -43,6 +44,7 @@
unsigned long address, unsigned long end,
struct vm_area_struct *vma, unsigned int flags)
{
+ struct page *ptepage;
pte_t *pte;
int error;

@@ -53,6 +55,8 @@
pmd_clear(pmd);
return 0;
}
+ ptepage = pmd_page(*pmd);
+ pte_page_lock(ptepage);
pte = pte_offset_map(pmd, address);
if ((address & PMD_MASK) != (end & PMD_MASK))
end = (address & PMD_MASK) + PMD_SIZE;
@@ -64,6 +68,7 @@
} while (address && (address < end));

pte_unmap(pte - 1);
+ pte_page_unlock(ptepage);

return error;
}
--- 2.5.43-mm2/./mm/mprotect.c 2002-10-17 11:13:01.000000000 -0500
+++ 2.5.43-mm2-shpte/./mm/mprotect.c 2002-10-17 11:42:16.000000000 -0500
@@ -16,6 +16,7 @@
#include <linux/fs.h>
#include <linux/highmem.h>
#include <linux/security.h>
+#include <linux/rmap-locking.h>

#include <asm/uaccess.h>
#include <asm/pgalloc.h>
@@ -24,10 +25,11 @@
#include <asm/tlbflush.h>

static inline void
-change_pte_range(pmd_t *pmd, unsigned long address,
+change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, unsigned long address,
unsigned long size, pgprot_t newprot)
{
pte_t * pte;
+ struct page *ptepage;
unsigned long end;

if (pmd_none(*pmd))
@@ -37,11 +39,32 @@
pmd_clear(pmd);
return;
}
- pte = pte_offset_map(pmd, address);
+ ptepage = pmd_page(*pmd);
+ pte_page_lock(ptepage);
address &= ~PMD_MASK;
end = address + size;
if (end > PMD_SIZE)
end = PMD_SIZE;
+
+#ifdef CONFIG_SHAREPTE
+ if (page_count(ptepage) > 1) {
+ if ((address == 0) && (end == PMD_SIZE)) {
+ pmd_t pmdval = *pmd;
+
+ if (vma->vm_flags & VM_MAYWRITE)
+ pmdval = pmd_mkwrite(pmdval);
+ else
+ pmdval = pmd_wrprotect(pmdval);
+ set_pmd(pmd, pmdval);
+ pte_page_unlock(ptepage);
+ return;
+ }
+ pte = pte_unshare(vma->vm_mm, pmd, address);
+ ptepage = pmd_page(*pmd);
+ } else
+#endif
+ pte = pte_offset_map(pmd, address);
+
do {
if (pte_present(*pte)) {
pte_t entry;
@@ -56,11 +79,12 @@
address += PAGE_SIZE;
pte++;
} while (address && (address < end));
+ pte_page_unlock(ptepage);
pte_unmap(pte - 1);
}

static inline void
-change_pmd_range(pgd_t *pgd, unsigned long address,
+change_pmd_range(struct vm_area_struct *vma, pgd_t *pgd, unsigned long address,
unsigned long size, pgprot_t newprot)
{
pmd_t * pmd;
@@ -79,7 +103,7 @@
if (end > PGDIR_SIZE)
end = PGDIR_SIZE;
do {
- change_pte_range(pmd, address, end - address, newprot);
+ change_pte_range(vma, pmd, address, end - address, newprot);
address = (address + PMD_SIZE) & PMD_MASK;
pmd++;
} while (address && (address < end));
@@ -98,7 +122,7 @@
BUG();
spin_lock(&current->mm->page_table_lock);
do {
- change_pmd_range(dir, start, end - start, newprot);
+ change_pmd_range(vma, dir, start, end - start, newprot);
start = (start + PGDIR_SIZE) & PGDIR_MASK;
dir++;
} while (start && (start < end));
--- 2.5.43-mm2/./mm/memory.c 2002-10-17 11:13:01.000000000 -0500
+++ 2.5.43-mm2-shpte/./mm/memory.c 2002-10-18 16:57:25.000000000 -0500
@@ -36,6 +36,20 @@
* ([email protected])
*/

+/*
+ * A note on locking of the page table structure:
+ *
+ * The top level lock that protects the page table is the
+ * mm->page_table_lock. This lock protects the pgd and pmd layer.
+ * However, with the advent of shared pte pages, this lock is not
+ * sufficient. The pte layer is now protected by the pte_page_lock,
+ * set in the struct page of the pte page. Note that with this
+ * locking scheme, once the pgd and pmd layers have been set in the
+ * page fault path and the pte_page_lock has been taken, the
+ * page_table_lock can be released.
+ *
+ */
+
#include <linux/kernel_stat.h>
#include <linux/mm.h>
#include <linux/hugetlb.h>
@@ -44,6 +58,7 @@
#include <linux/highmem.h>
#include <linux/pagemap.h>
#include <linux/vcache.h>
+#include <linux/rmap-locking.h>

#include <asm/pgalloc.h>
#include <asm/rmap.h>
@@ -83,7 +98,7 @@
*/
static inline void free_one_pmd(mmu_gather_t *tlb, pmd_t * dir)
{
- struct page *page;
+ struct page *ptepage;

if (pmd_none(*dir))
return;
@@ -92,10 +107,20 @@
pmd_clear(dir);
return;
}
- page = pmd_page(*dir);
+ ptepage = pmd_page(*dir);
+
+ pte_page_lock(ptepage);
+
+ BUG_ON(page_count(ptepage) != 1);
+
pmd_clear(dir);
- pgtable_remove_rmap(page);
- pte_free_tlb(tlb, page);
+ pgtable_remove_rmap_locked(ptepage, tlb->mm);
+ dec_page_state(nr_page_table_pages);
+ ClearPagePtepage(ptepage);
+
+ pte_page_unlock(ptepage);
+
+ pte_free_tlb(tlb, ptepage);
}

static inline void free_one_pgd(mmu_gather_t *tlb, pgd_t * dir)
@@ -136,6 +161,318 @@
} while (--nr);
}

+#ifdef CONFIG_SHAREPTE
+/*
+ * This function makes the decision whether a pte page needs to be
+ * unshared or not. Note that page_count() == 1 isn't even tested
+ * here. The assumption is that if the pmd entry is marked writeable,
+ * then the page is either already unshared or doesn't need to be
+ * unshared. This catches the situation where task B unshares the pte
+ * page, then task A faults and needs to unprotect the pmd entry.
+ * This is actually done in pte_unshare.
+ *
+ * This function should be called with the page_table_lock held.
+ */
+static inline int pte_needs_unshare(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ pmd_t *pmd, unsigned long address,
+ int write_access)
+{
+ struct page *ptepage;
+
+ /* It's not even there, nothing to unshare. */
+ if (!pmd_present(*pmd))
+ return 0;
+
+ /*
+ * If it's already writable, then it doesn't need to be unshared.
+ * It's either already not shared or it's part of a large shared
+ * region that will never need to be unshared.
+ */
+ if (pmd_write(*pmd))
+ return 0;
+
+ /* If this isn't a write fault we don't need to unshare. */
+ if (!write_access)
+ return 0;
+
+ /*
+ * If this page fits entirely inside a shared region, don't unshare it.
+ */
+ ptepage = pmd_page(*pmd);
+ if ((vma->vm_flags & VM_SHARED) &&
+ (vma->vm_start <= ptepage->index) &&
+ (vma->vm_end >= (ptepage->index + PMD_SIZE))) {
+ return 0;
+ }
+ /*
+ * Ok, we have to unshare.
+ */
+ return 1;
+}
+
+/**
+ * pte_unshare - Unshare a pte page
+ * @mm: the mm_struct that gets an unshared pte page
+ * @pmd: a pointer to the pmd entry that needs unsharing
+ * @address: the virtual address that triggered the unshare
+ *
+ * Here is where a pte page is actually unshared. It actually covers
+ * a couple of possible conditions. If the page_count() is already 1,
+ * then that means it just needs to be set writeable. Otherwise, a
+ * new page needs to be allocated.
+ *
+ * When each pte entry is copied, it is evaluated for COW protection,
+ * as well as checking whether the swap count needs to be incremented.
+ *
+ * This function must be called with the page_table_lock held. It
+ * will release and reacquire the lock when it allocates a new page.
+ *
+ * The function must also be called with the pte_page_lock held on the
+ * old page. This lock will also be dropped, then reacquired when we
+ * allocate a new page. The pte_page_lock will be taken on the new
+ * page. Whichever pte page is returned will have its pte_page_lock
+ * held.
+ */
+
+pte_t *pte_unshare(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
+{
+ pte_t *src_ptb, *dst_ptb;
+ struct page *oldpage, *newpage;
+ struct vm_area_struct *vma;
+ int base, addr;
+ int end, page_end;
+ int src_unshare;
+
+ oldpage = pmd_page(*pmd);
+
+ /* If it's already unshared, we just need to set it writeable */
+ if (page_count(oldpage) == 1) {
+is_unshared:
+ pmd_populate(mm, pmd, oldpage);
+ flush_tlb_mm(mm);
+ goto out_map;
+ }
+
+ pte_page_unlock(oldpage);
+ spin_unlock(&mm->page_table_lock);
+ newpage = pte_alloc_one(mm, address);
+ spin_lock(&mm->page_table_lock);
+ if (unlikely(!newpage))
+ return NULL;
+
+ /*
+ * Fetch the ptepage pointer again in case it changed while
+ * the lock was dropped.
+ */
+ oldpage = pmd_page(*pmd);
+ pte_page_lock(oldpage);
+
+ /* See if it got unshared while we dropped the lock */
+ if (page_count(oldpage) == 1) {
+ pte_free(newpage);
+ goto is_unshared;
+ }
+
+ pte_page_lock(newpage);
+
+ base = addr = oldpage->index;
+ page_end = base + PMD_SIZE;
+ vma = find_vma(mm, base);
+ src_unshare = page_count(oldpage) == 2;
+ dst_ptb = pte_page_map(newpage, base);
+
+ if (!vma || (page_end <= vma->vm_start)) {
+ goto no_vma;
+ }
+
+ if (vma->vm_start > addr)
+ addr = vma->vm_start;
+
+ if (vma->vm_end < page_end)
+ end = vma->vm_end;
+ else
+ end = page_end;
+
+ src_ptb = pte_page_map_nested(oldpage, base);
+
+ do {
+ unsigned int cow = 0;
+ pte_t *src_pte = src_ptb + __pte_offset(addr);
+ pte_t *dst_pte = dst_ptb + __pte_offset(addr);
+
+ cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
+
+ do {
+ pte_t pte = *src_pte;
+ struct page *page;
+
+ if (pte_none(pte))
+ goto unshare_skip_set;
+
+ if (!pte_present(pte)) {
+ swap_duplicate(pte_to_swp_entry(pte));
+ set_pte(dst_pte, pte);
+ goto unshare_skip_set;
+ }
+ page = pte_page(pte);
+ if (!PageReserved(page)) {
+ /* COW mappings require write protecting both sides */
+ if (cow) {
+ pte = pte_wrprotect(pte);
+ if (src_unshare)
+ set_pte(src_pte, pte);
+ }
+ /* If it's a shared mapping,
+ * mark it clean in the new mapping
+ */
+ if (vma->vm_flags & VM_SHARED)
+ pte = pte_mkclean(pte);
+ pte = pte_mkold(pte);
+ get_page(page);
+ }
+ set_pte(dst_pte, pte);
+ page_add_rmap(page, dst_pte);
+unshare_skip_set:
+ src_pte++;
+ dst_pte++;
+ addr += PAGE_SIZE;
+ } while (addr < end);
+
+ if (addr >= page_end)
+ break;
+
+ vma = vma->vm_next;
+ if (!vma)
+ break;
+
+ if (page_end <= vma->vm_start)
+ break;
+
+ addr = vma->vm_start;
+ if (vma->vm_end < page_end)
+ end = vma->vm_end;
+ else
+ end = page_end;
+ } while (1);
+
+ pte_unmap_nested(src_ptb);
+
+no_vma:
+ SetPagePtepage(newpage);
+ pgtable_remove_rmap_locked(oldpage, mm);
+ pgtable_add_rmap_locked(newpage, mm, base);
+ pmd_populate(mm, pmd, newpage);
+ inc_page_state(nr_page_table_pages);
+
+ /* Copy rss count */
+ newpage->private = oldpage->private;
+
+ flush_tlb_mm(mm);
+
+ put_page(oldpage);
+
+ pte_page_unlock(oldpage);
+
+ return dst_ptb + __pte_offset(address);
+
+out_map:
+ return pte_offset_map(pmd, address);
+}
+
+/**
+ * pte_try_to_share - Attempt to find a pte page that can be shared
+ * @mm: the mm_struct that needs a pte page
+ * @vma: the vm_area the address is in
+ * @pmd: a pointer to the pmd entry that needs filling
+ * @address: the address that caused the fault
+ *
+ * This function is called during a page fault. If there is no pte
+ * page for this address, it checks the vma to see if it is shared,
+ * and if it spans the pte page. If so, it goes to the address_space
+ * structure and looks through for matching vmas from other tasks that
+ * already have a pte page that can be shared. If it finds one, it
+ * attaches it and makes it a shared page.
+ */
+
+static pte_t *pte_try_to_share(struct mm_struct *mm, struct vm_area_struct *vma,
+ pmd_t *pmd, unsigned long address)
+{
+ struct address_space *as;
+ struct vm_area_struct *lvma;
+ struct page *ptepage;
+ unsigned long base;
+ pte_t *pte = NULL;
+
+ /*
+ * It already has a pte page. No point in checking further.
+ * We can go ahead and return it now, since we know it's there.
+ */
+ if (pmd_present(*pmd)) {
+ ptepage = pmd_page(*pmd);
+ pte_page_lock(ptepage);
+ return pte_page_map(ptepage, address);
+ }
+
+ /* It's not even shared memory. We definitely can't share the page. */
+ if (!(vma->vm_flags & VM_SHARED))
+ return NULL;
+
+ /* We can only share if the entire pte page fits inside the vma */
+ base = address & ~((PTRS_PER_PTE * PAGE_SIZE) - 1);
+ if ((base < vma->vm_start) || (vma->vm_end < (base + PMD_SIZE)))
+ return NULL;
+
+ as = vma->vm_file->f_dentry->d_inode->i_mapping;
+
+ spin_lock(&as->i_shared_lock);
+
+ list_for_each_entry(lvma, &as->i_mmap_shared, shared) {
+ pgd_t *lpgd;
+ pmd_t *lpmd;
+ pmd_t pmdval;
+
+ /* Skip the one we're working on */
+ if (lvma == vma)
+ continue;
+
+ /* It has to be mapping to the same address */
+ if ((lvma->vm_start != vma->vm_start) ||
+ (lvma->vm_end != vma->vm_end) ||
+ (lvma->vm_pgoff != vma->vm_pgoff))
+ continue;
+
+ lpgd = pgd_offset(lvma->vm_mm, address);
+ lpmd = pmd_offset(lpgd, address);
+
+ /* This page table doesn't have a pte page either, so skip it. */
+ if (!pmd_present(*lpmd))
+ continue;
+
+ /* Ok, we can share it. */
+
+ ptepage = pmd_page(*lpmd);
+ pte_page_lock(ptepage);
+ get_page(ptepage);
+ pgtable_add_rmap_locked(ptepage, mm, address);
+ /*
+ * If this vma is only mapping it read-only, set the
+ * pmd entry read-only to protect it from writes.
+ * Otherwise set it writeable.
+ */
+ if (vma->vm_flags & VM_MAYWRITE)
+ pmdval = pmd_mkwrite(*lpmd);
+ else
+ pmdval = pmd_wrprotect(*lpmd);
+ set_pmd(pmd, pmdval);
+ pte = pte_page_map(ptepage, address);
+ break;
+ }
+ spin_unlock(&as->i_shared_lock);
+ return pte;
+}
+#endif
+
pte_t * pte_alloc_map(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
{
if (!pmd_present(*pmd)) {
@@ -155,13 +492,13 @@
pte_free(new);
goto out;
}
+ SetPagePtepage(new);
pgtable_add_rmap(new, mm, address);
pmd_populate(mm, pmd, new);
+ inc_page_state(nr_page_table_pages);
}
out:
- if (pmd_present(*pmd))
- return pte_offset_map(pmd, address);
- return NULL;
+ return pte_offset_map(pmd, address);
}

pte_t * pte_alloc_kernel(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
@@ -183,7 +520,6 @@
pte_free_kernel(new);
goto out;
}
- pgtable_add_rmap(virt_to_page(new), mm, address);
pmd_populate_kernel(mm, pmd, new);
}
out:
@@ -192,6 +528,111 @@
#define PTE_TABLE_MASK ((PTRS_PER_PTE-1) * sizeof(pte_t))
#define PMD_TABLE_MASK ((PTRS_PER_PMD-1) * sizeof(pmd_t))

+#ifdef CONFIG_SHAREPTE
+/**
+ * share_page_range - share a range of pages at the pte page level at fork time
+ * @dst: the mm_struct of the forked child
+ * @src: the mm_struct of the forked parent
+ * @vma: the vm_area to be shared
+ * @prev_pmd: A pointer to the pmd entry we did at last invocation
+ *
+ * This function shares pte pages between parent and child at fork.
+ * If the vm_area is shared and spans the page, it sets it
+ * writeable. Otherwise it sets it read-only. The prev_pmd parameter
+ * is used to keep track of pte pages we've already shared, since this
+ * function can be called with multiple vmas that point to the same
+ * pte page.
+ */
+int share_page_range(struct mm_struct *dst, struct mm_struct *src,
+ struct vm_area_struct *vma, pmd_t **prev_pmd)
+{
+ pgd_t *src_pgd, *dst_pgd;
+ unsigned long address = vma->vm_start;
+ unsigned long end = vma->vm_end;
+
+ if (is_vm_hugetlb_page(vma))
+ return copy_hugetlb_page_range(dst, src, vma);
+
+ src_pgd = pgd_offset(src, address)-1;
+ dst_pgd = pgd_offset(dst, address)-1;
+
+ for (;;) {
+ pmd_t * src_pmd, * dst_pmd;
+
+ src_pgd++; dst_pgd++;
+
+ if (pgd_none(*src_pgd))
+ goto skip_share_pmd_range;
+ if (pgd_bad(*src_pgd)) {
+ pgd_ERROR(*src_pgd);
+ pgd_clear(src_pgd);
+skip_share_pmd_range: address = (address + PGDIR_SIZE) & PGDIR_MASK;
+ if (!address || (address >= end))
+ goto out;
+ continue;
+ }
+
+ src_pmd = pmd_offset(src_pgd, address);
+ dst_pmd = pmd_alloc(dst, dst_pgd, address);
+ if (!dst_pmd)
+ goto nomem;
+
+ spin_lock(&src->page_table_lock);
+
+ do {
+ pmd_t pmdval = *src_pmd;
+ struct page *ptepage = pmd_page(pmdval);
+
+ if (pmd_none(pmdval))
+ goto skip_share_pte_range;
+ if (pmd_bad(pmdval)) {
+ pmd_ERROR(*src_pmd);
+ pmd_clear(src_pmd);
+ goto skip_share_pte_range;
+ }
+
+ /*
+ * We set the pmd read-only in both the parent and the
+ * child unless it's a writeable shared region that
+ * spans the entire pte page.
+ */
+ if ((((vma->vm_flags & (VM_SHARED|VM_MAYWRITE)) !=
+ (VM_SHARED|VM_MAYWRITE)) ||
+ (ptepage->index < vma->vm_start) ||
+ ((ptepage->index + PMD_SIZE) > vma->vm_end)) &&
+ pmd_write(pmdval)) {
+ pmdval = pmd_wrprotect(pmdval);
+ set_pmd(src_pmd, pmdval);
+ }
+ set_pmd(dst_pmd, pmdval);
+
+ /* Only do this if we haven't seen this pte page before */
+ if (src_pmd != *prev_pmd) {
+ get_page(ptepage);
+ pgtable_add_rmap(ptepage, dst, address);
+ *prev_pmd = src_pmd;
+ dst->rss += ptepage->private;
+ }
+
+skip_share_pte_range: address = (address + PMD_SIZE) & PMD_MASK;
+ if (address >= end)
+ goto out_unlock;
+
+ src_pmd++;
+ dst_pmd++;
+ } while ((unsigned long)src_pmd & PMD_TABLE_MASK);
+ spin_unlock(&src->page_table_lock);
+ }
+
+out_unlock:
+ spin_unlock(&src->page_table_lock);
+
+out:
+ return 0;
+nomem:
+ return -ENOMEM;
+}
+#else
/*
* copy one vm_area from one task to the other. Assumes the page tables
* already present in the new task to be cleared in the whole range
@@ -241,6 +682,7 @@
goto nomem;

do {
+ struct page *ptepage;
pte_t * src_pte, * dst_pte;

/* copy_pte_range */
@@ -260,10 +702,12 @@
if (!dst_pte)
goto nomem;
spin_lock(&src->page_table_lock);
+ ptepage = pmd_page(*src_pmd);
+ pte_page_lock(ptepage);
src_pte = pte_offset_map_nested(src_pmd, address);
do {
pte_t pte = *src_pte;
- struct page *ptepage;
+ struct page *page;
unsigned long pfn;

/* copy_one_pte */
@@ -276,12 +720,12 @@
set_pte(dst_pte, pte);
goto cont_copy_pte_range_noset;
}
- ptepage = pte_page(pte);
+ page = pte_page(pte);
pfn = pte_pfn(pte);
if (!pfn_valid(pfn))
goto cont_copy_pte_range;
- ptepage = pfn_to_page(pfn);
- if (PageReserved(ptepage))
+ page = pfn_to_page(pfn);
+ if (PageReserved(page))
goto cont_copy_pte_range;

/* If it's a COW mapping, write protect it both in the parent and the child */
@@ -294,13 +738,14 @@
if (vma->vm_flags & VM_SHARED)
pte = pte_mkclean(pte);
pte = pte_mkold(pte);
- get_page(ptepage);
+ get_page(page);
dst->rss++;

cont_copy_pte_range: set_pte(dst_pte, pte);
- page_add_rmap(ptepage, dst_pte);
+ page_add_rmap(page, dst_pte);
cont_copy_pte_range_noset: address += PAGE_SIZE;
if (address >= end) {
+ pte_page_unlock(ptepage);
pte_unmap_nested(src_pte);
pte_unmap(dst_pte);
goto out_unlock;
@@ -308,6 +753,7 @@
src_pte++;
dst_pte++;
} while ((unsigned long)src_pte & PTE_TABLE_MASK);
+ pte_page_unlock(ptepage);
pte_unmap_nested(src_pte-1);
pte_unmap(dst_pte-1);
spin_unlock(&src->page_table_lock);
@@ -323,24 +769,22 @@
nomem:
return -ENOMEM;
}
+#endif

static void zap_pte_range(mmu_gather_t *tlb, pmd_t * pmd, unsigned long address, unsigned long size)
{
unsigned long offset;
+ struct mm_struct *mm = tlb->mm;
+ struct page *ptepage = pmd_page(*pmd);
pte_t *ptep;

- if (pmd_none(*pmd))
- return;
- if (pmd_bad(*pmd)) {
- pmd_ERROR(*pmd);
- pmd_clear(pmd);
- return;
- }
- ptep = pte_offset_map(pmd, address);
offset = address & ~PMD_MASK;
if (offset + size > PMD_SIZE)
size = PMD_SIZE - offset;
size &= PAGE_MASK;
+
+ ptep = pte_offset_map(pmd, address);
+
for (offset=0; offset < size; ptep++, offset += PAGE_SIZE) {
pte_t pte = *ptep;
if (pte_none(pte))
@@ -359,6 +803,8 @@
!PageSwapCache(page))
mark_page_accessed(page);
tlb->freed++;
+ ptepage->private--;
+ mm->rss--;
page_remove_rmap(page, ptep);
tlb_remove_page(tlb, page);
}
@@ -371,8 +817,12 @@
pte_unmap(ptep-1);
}

-static void zap_pmd_range(mmu_gather_t *tlb, pgd_t * dir, unsigned long address, unsigned long size)
+static void zap_pmd_range(mmu_gather_t **tlb, pgd_t * dir, unsigned long address, unsigned long size)
{
+ struct page *ptepage;
+#ifdef CONFIG_SHAREPTE
+ struct mm_struct *mm = (*tlb)->mm;
+#endif
pmd_t * pmd;
unsigned long end;

@@ -388,26 +838,59 @@
if (end > ((address + PGDIR_SIZE) & PGDIR_MASK))
end = ((address + PGDIR_SIZE) & PGDIR_MASK);
do {
- zap_pte_range(tlb, pmd, address, end - address);
+ if (pmd_none(*pmd))
+ goto skip_pmd;
+ if (pmd_bad(*pmd)) {
+ pmd_ERROR(*pmd);
+ pmd_clear(pmd);
+ goto skip_pmd;
+ }
+
+ ptepage = pmd_page(*pmd);
+ pte_page_lock(ptepage);
+#ifdef CONFIG_SHAREPTE
+ if (page_count(ptepage) > 1) {
+ if ((address <= ptepage->index) &&
+ (end >= (ptepage->index + PMD_SIZE))) {
+ pmd_clear(pmd);
+ pgtable_remove_rmap_locked(ptepage, mm);
+ mm->rss -= ptepage->private;
+ put_page(ptepage);
+ pte_page_unlock(ptepage);
+ goto skip_pmd;
+ } else {
+ pte_t *pte;
+
+ tlb_finish_mmu(*tlb, address, end);
+ pte = pte_unshare(mm, pmd, address);
+ pte_unmap(pte);
+ *tlb = tlb_gather_mmu(mm, 0);
+ ptepage = pmd_page(*pmd);
+ }
+ }
+#endif
+ zap_pte_range(*tlb, pmd, address, end - address);
+ pte_page_unlock(ptepage);
+skip_pmd:
address = (address + PMD_SIZE) & PMD_MASK;
pmd++;
} while (address < end);
}

-void unmap_page_range(mmu_gather_t *tlb, struct vm_area_struct *vma, unsigned long address, unsigned long end)
+void unmap_page_range(mmu_gather_t **tlb, struct vm_area_struct *vma, unsigned long address, unsigned long end)
{
pgd_t * dir;

BUG_ON(address >= end);

dir = pgd_offset(vma->vm_mm, address);
- tlb_start_vma(tlb, vma);
+ tlb_start_vma(*tlb, vma);
do {
zap_pmd_range(tlb, dir, address, end - address);
address = (address + PGDIR_SIZE) & PGDIR_MASK;
dir++;
} while (address && (address < end));
- tlb_end_vma(tlb, vma);
+ tlb_end_vma(*tlb, vma);
}

/* Dispose of an entire mmu_gather_t per rescheduling point */
@@ -451,7 +934,7 @@

flush_cache_range(vma, address, end);
tlb = tlb_gather_mmu(mm, 0);
- unmap_page_range(tlb, vma, address, end);
+ unmap_page_range(&tlb, vma, address, end);
tlb_finish_mmu(tlb, address, end);

cond_resched_lock(&mm->page_table_lock);
@@ -463,6 +946,126 @@
spin_unlock(&mm->page_table_lock);
}

+/**
+ * unmap_all_pages - unmap all the pages for an mm_struct
+ * @mm: the mm_struct to unmap
+ *
+ * This function is only called when an mm_struct is about to be
+ * released. It walks through all vmas and removes their pages
+ * from the page table. It understands shared pte pages and will
+ * decrement the count appropriately.
+ */
+void unmap_all_pages(struct mm_struct *mm)
+{
+ struct vm_area_struct *vma;
+ struct page *ptepage;
+ mmu_gather_t *tlb;
+ pgd_t *pgd;
+ pmd_t *pmd;
+ unsigned long address;
+ unsigned long vm_end, pmd_end;
+
+ tlb = tlb_gather_mmu(mm, 1);
+
+ vma = mm->mmap;
+ for (;;) {
+ if (!vma)
+ goto out;
+
+ address = vma->vm_start;
+next_vma:
+ vm_end = vma->vm_end;
+ mm->map_count--;
+ /*
+ * Advance the vma pointer to the next vma.
+ * To facilitate coalescing adjacent vmas, the
+ * pointer always points to the next one
+ * beyond the range we're currently working
+ * on, which means vma will be null on the
+ * last iteration.
+ */
+ vma = vma->vm_next;
+ if (vma) {
+ /*
+ * Go ahead and include hugetlb vmas
+ * in the range we process. The pmd
+ * entry will be cleared by close, so
+ * we'll just skip over them. This is
+ * easier than trying to avoid them.
+ */
+ if (is_vm_hugetlb_page(vma))
+ vma->vm_ops->close(vma);
+
+ /*
+ * Coalesce adjacent vmas and process
+ * them all in one iteration.
+ */
+ if (vma->vm_start == vm_end) {
+ goto next_vma;
+ }
+ }
+ pgd = pgd_offset(mm, address);
+ do {
+ if (pgd_none(*pgd))
+ goto skip_pgd;
+
+ if (pgd_bad(*pgd)) {
+ pgd_ERROR(*pgd);
+ pgd_clear(pgd);
+skip_pgd:
+ address += PGDIR_SIZE;
+ if (address > vm_end)
+ address = vm_end;
+ goto next_pgd;
+ }
+ pmd = pmd_offset(pgd, address);
+ if (vm_end > ((address + PGDIR_SIZE) & PGDIR_MASK))
+ pmd_end = (address + PGDIR_SIZE) & PGDIR_MASK;
+ else
+ pmd_end = vm_end;
+
+ for (;;) {
+ if (pmd_none(*pmd))
+ goto next_pmd;
+ if (pmd_bad(*pmd)) {
+ pmd_ERROR(*pmd);
+ pmd_clear(pmd);
+ goto next_pmd;
+ }
+
+ ptepage = pmd_page(*pmd);
+ pte_page_lock(ptepage);
+#ifdef CONFIG_SHAREPTE
+ if (page_count(ptepage) > 1) {
+ pmd_clear(pmd);
+ pgtable_remove_rmap_locked(ptepage, mm);
+ mm->rss -= ptepage->private;
+ put_page(ptepage);
+ } else
+#endif
+ zap_pte_range(tlb, pmd, address,
+ vm_end - address);
+
+ pte_page_unlock(ptepage);
+next_pmd:
+ address = (address + PMD_SIZE) & PMD_MASK;
+ if (address >= pmd_end) {
+ address = pmd_end;
+ break;
+ }
+ pmd++;
+ }
+next_pgd:
+ pgd++;
+ } while (address < vm_end);
+
+ }
+
+out:
+ clear_page_tables(tlb, FIRST_USER_PGD_NR, USER_PTRS_PER_PGD);
+ tlb_finish_mmu(tlb, 0, TASK_SIZE);
+}
+
/*
* Do a quick page-table lookup for a single page.
* mm->page_table_lock must be held.
@@ -790,6 +1393,7 @@
unsigned long address, pte_t *page_table, pmd_t *pmd, pte_t pte)
{
struct page *old_page, *new_page;
+ struct page *ptepage = pmd_page(*pmd);
unsigned long pfn = pte_pfn(pte);

if (!pfn_valid(pfn))
@@ -803,7 +1407,7 @@
flush_cache_page(vma, address);
establish_pte(vma, address, page_table, pte_mkyoung(pte_mkdirty(pte_mkwrite(pte))));
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
+ pte_page_unlock(ptepage);
return VM_FAULT_MINOR;
}
}
@@ -813,7 +1417,7 @@
* Ok, we need to copy. Oh, well..
*/
page_cache_get(old_page);
- spin_unlock(&mm->page_table_lock);
+ pte_page_unlock(ptepage);

new_page = alloc_page(GFP_HIGHUSER);
if (!new_page)
@@ -823,11 +1427,12 @@
/*
* Re-check the pte - we dropped the lock
*/
- spin_lock(&mm->page_table_lock);
+ ptepage = pmd_page(*pmd);
+ pte_page_lock(ptepage);
page_table = pte_offset_map(pmd, address);
if (pte_same(*page_table, pte)) {
if (PageReserved(old_page))
- ++mm->rss;
+ increment_rss(ptepage);
page_remove_rmap(old_page, page_table);
break_cow(vma, new_page, address, page_table);
page_add_rmap(new_page, page_table);
@@ -837,14 +1442,14 @@
new_page = old_page;
}
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
+ pte_page_unlock(ptepage);
page_cache_release(new_page);
page_cache_release(old_page);
return VM_FAULT_MINOR;

bad_wp_page:
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
+ pte_page_unlock(ptepage);
printk(KERN_ERR "do_wp_page: bogus page at address %08lx\n", address);
/*
* This should really halt the system so it can be debugged or
@@ -973,12 +1578,13 @@
pte_t *page_table, pmd_t *pmd, pte_t orig_pte, int write_access)
{
struct page *page;
+ struct page *ptepage = pmd_page(*pmd);
swp_entry_t entry = pte_to_swp_entry(orig_pte);
pte_t pte;
int ret = VM_FAULT_MINOR;

pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
+ pte_page_unlock(ptepage);
page = lookup_swap_cache(entry);
if (!page) {
swapin_readahead(entry);
@@ -988,14 +1594,15 @@
* Back out if somebody else faulted in this pte while
* we released the page table lock.
*/
- spin_lock(&mm->page_table_lock);
+ ptepage = pmd_page(*pmd);
+ pte_page_lock(ptepage);
page_table = pte_offset_map(pmd, address);
if (pte_same(*page_table, orig_pte))
ret = VM_FAULT_OOM;
else
ret = VM_FAULT_MINOR;
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
+ pte_page_unlock(ptepage);
return ret;
}

@@ -1011,11 +1618,12 @@
* Back out if somebody else faulted in this pte while we
* released the page table lock.
*/
- spin_lock(&mm->page_table_lock);
+ ptepage = pmd_page(*pmd);
+ pte_page_lock(ptepage);
page_table = pte_offset_map(pmd, address);
if (!pte_same(*page_table, orig_pte)) {
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
+ pte_page_unlock(ptepage);
unlock_page(page);
page_cache_release(page);
return VM_FAULT_MINOR;
@@ -1027,7 +1635,7 @@
if (vm_swap_full())
remove_exclusive_swap_page(page);

- mm->rss++;
+ increment_rss(ptepage);
pte = mk_pte(page, vma->vm_page_prot);
if (write_access && can_share_swap_page(page))
pte = pte_mkdirty(pte_mkwrite(pte));
@@ -1041,7 +1649,7 @@
/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, address, pte);
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
+ pte_page_unlock(ptepage);
return ret;
}

@@ -1054,6 +1662,7 @@
{
pte_t entry;
struct page * page = ZERO_PAGE(addr);
+ struct page *ptepage = pmd_page(*pmd);

/* Read-only mapping of ZERO_PAGE. */
entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
@@ -1062,23 +1671,24 @@
if (write_access) {
/* Allocate our own private page. */
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
+ pte_page_unlock(ptepage);

page = alloc_page(GFP_HIGHUSER);
if (!page)
goto no_mem;
clear_user_highpage(page, addr);

- spin_lock(&mm->page_table_lock);
+ ptepage = pmd_page(*pmd);
+ pte_page_lock(ptepage);
page_table = pte_offset_map(pmd, addr);

if (!pte_none(*page_table)) {
pte_unmap(page_table);
page_cache_release(page);
- spin_unlock(&mm->page_table_lock);
+ pte_page_unlock(ptepage);
return VM_FAULT_MINOR;
}
- mm->rss++;
+ increment_rss(ptepage);
flush_page_to_ram(page);
entry = pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
lru_cache_add(page);
@@ -1091,7 +1701,7 @@

/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, addr, entry);
- spin_unlock(&mm->page_table_lock);
+ pte_page_unlock(ptepage);
return VM_FAULT_MINOR;

no_mem:
@@ -1114,12 +1724,13 @@
unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
{
struct page * new_page;
+ struct page *ptepage = pmd_page(*pmd);
pte_t entry;

if (!vma->vm_ops || !vma->vm_ops->nopage)
return do_anonymous_page(mm, vma, page_table, pmd, write_access, address);
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
+ pte_page_unlock(ptepage);

new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, 0);

@@ -1144,7 +1755,8 @@
new_page = page;
}

- spin_lock(&mm->page_table_lock);
+ ptepage = pmd_page(*pmd);
+ pte_page_lock(ptepage);
page_table = pte_offset_map(pmd, address);

/*
@@ -1159,7 +1771,7 @@
*/
/* Only go through if we didn't race with anybody else... */
if (pte_none(*page_table)) {
- ++mm->rss;
+ increment_rss(ptepage);
flush_page_to_ram(new_page);
flush_icache_page(vma, new_page);
entry = mk_pte(new_page, vma->vm_page_prot);
@@ -1172,13 +1784,13 @@
/* One of our sibling threads was faster, back out. */
pte_unmap(page_table);
page_cache_release(new_page);
- spin_unlock(&mm->page_table_lock);
+ pte_page_unlock(ptepage);
return VM_FAULT_MINOR;
}

/* no need to invalidate: a not-present page shouldn't be cached */
update_mmu_cache(vma, address, entry);
- spin_unlock(&mm->page_table_lock);
+ pte_page_unlock(ptepage);
return VM_FAULT_MAJOR;
}

@@ -1230,7 +1842,7 @@
entry = pte_mkyoung(entry);
establish_pte(vma, address, pte, entry);
pte_unmap(pte);
- spin_unlock(&mm->page_table_lock);
+ pte_page_unlock(pmd_page(*pmd));
return VM_FAULT_MINOR;
}

@@ -1255,9 +1867,29 @@
pmd = pmd_alloc(mm, pgd, address);

if (pmd) {
- pte_t * pte = pte_alloc_map(mm, pmd, address);
+ pte_t * pte;
+
+#ifdef CONFIG_SHAREPTE
+ if (pte_needs_unshare(mm, vma, pmd, address, write_access)) {
+ pte_page_lock(pmd_page(*pmd));
+ pte = pte_unshare(mm, pmd, address);
+ } else {
+ pte = pte_try_to_share(mm, vma, pmd, address);
+ if (!pte) {
+ pte = pte_alloc_map(mm, pmd, address);
+ if (pte)
+ pte_page_lock(pmd_page(*pmd));
+ }
+ }
+#else
+ pte = pte_alloc_map(mm, pmd, address);
if (pte)
+ pte_page_lock(pmd_page(*pmd));
+#endif
+ if (pte) {
+ spin_unlock(&mm->page_table_lock);
return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
+ }
}
spin_unlock(&mm->page_table_lock);
return VM_FAULT_OOM;
--- 2.5.43-mm2/./mm/mremap.c 2002-10-17 11:13:01.000000000 -0500
+++ 2.5.43-mm2-shpte/./mm/mremap.c 2002-10-18 11:58:43.000000000 -0500
@@ -15,6 +15,7 @@
#include <linux/swap.h>
#include <linux/fs.h>
#include <linux/highmem.h>
+#include <linux/rmap-locking.h>

#include <asm/uaccess.h>
#include <asm/pgalloc.h>
@@ -23,6 +24,7 @@

static pte_t *get_one_pte_map_nested(struct mm_struct *mm, unsigned long addr)
{
+ struct page *ptepage;
pgd_t * pgd;
pmd_t * pmd;
pte_t * pte = NULL;
@@ -45,8 +47,18 @@
goto end;
}

- pte = pte_offset_map_nested(pmd, addr);
+ ptepage = pmd_page(*pmd);
+ pte_page_lock(ptepage);
+#ifdef CONFIG_SHAREPTE
+ if (page_count(ptepage) > 1) {
+ pte = pte_unshare(mm, pmd, addr);
+ ptepage = pmd_page(*pmd);
+ } else
+#endif
+ pte = pte_offset_map_nested(pmd, addr);
+
if (pte_none(*pte)) {
+ pte_page_unlock(ptepage);
pte_unmap_nested(pte);
pte = NULL;
}
@@ -54,6 +66,32 @@
return pte;
}

+static inline void drop_pte_nested(struct mm_struct *mm, unsigned long addr, pte_t *pte)
+{
+ struct page *ptepage;
+ pgd_t *pgd;
+ pmd_t *pmd;
+
+ pgd = pgd_offset(mm, addr);
+ pmd = pmd_offset(pgd, addr);
+ ptepage = pmd_page(*pmd);
+ pte_page_unlock(ptepage);
+ pte_unmap_nested(pte);
+}
+
+static inline void drop_pte(struct mm_struct *mm, unsigned long addr, pte_t *pte)
+{
+ struct page *ptepage;
+ pgd_t *pgd;
+ pmd_t *pmd;
+
+ pgd = pgd_offset(mm, addr);
+ pmd = pmd_offset(pgd, addr);
+ ptepage = pmd_page(*pmd);
+ pte_page_unlock(ptepage);
+ pte_unmap(pte);
+}
+
#ifdef CONFIG_HIGHPTE /* Save a few cycles on the sane machines */
static inline int page_table_present(struct mm_struct *mm, unsigned long addr)
{
@@ -72,12 +110,24 @@

static inline pte_t *alloc_one_pte_map(struct mm_struct *mm, unsigned long addr)
{
+ struct page *ptepage;
pmd_t * pmd;
pte_t * pte = NULL;

pmd = pmd_alloc(mm, pgd_offset(mm, addr), addr);
- if (pmd)
+ if (pmd) {
+ ptepage = pmd_page(*pmd);
+#ifdef CONFIG_SHAREPTE
+ pte_page_lock(ptepage);
+ if (page_count(ptepage) > 1) {
+ pte_unshare(mm, pmd, addr);
+ ptepage = pmd_page(*pmd);
+ }
+ pte_page_unlock(ptepage);
+#endif
pte = pte_alloc_map(mm, pmd, addr);
+ pte_page_lock(ptepage);
+ }
return pte;
}

@@ -121,15 +171,15 @@
* atomic kmap
*/
if (!page_table_present(mm, new_addr)) {
- pte_unmap_nested(src);
+ drop_pte_nested(mm, old_addr, src);
src = NULL;
}
dst = alloc_one_pte_map(mm, new_addr);
if (src == NULL)
src = get_one_pte_map_nested(mm, old_addr);
error = copy_one_pte(mm, src, dst);
- pte_unmap_nested(src);
- pte_unmap(dst);
+ drop_pte_nested(mm, old_addr, src);
+ drop_pte(mm, new_addr, dst);
}
flush_tlb_page(vma, old_addr);
spin_unlock(&mm->page_table_lock);
--- 2.5.43-mm2/./mm/mmap.c 2002-10-17 11:13:01.000000000 -0500
+++ 2.5.43-mm2-shpte/./mm/mmap.c 2002-10-17 11:42:30.000000000 -0500
@@ -23,7 +23,11 @@
#include <asm/pgalloc.h>
#include <asm/tlb.h>

-extern void unmap_page_range(mmu_gather_t *,struct vm_area_struct *vma, unsigned long address, unsigned long size);
+extern void unmap_page_range(mmu_gather_t **,struct vm_area_struct *vma, unsigned long address, unsigned long size);
+#ifdef CONFIG_SHAREPTE
+extern void unmap_shared_range(struct mm_struct *mm, unsigned long address, unsigned long end);
+#endif
+extern void unmap_all_pages(struct mm_struct *mm);
extern void clear_page_tables(mmu_gather_t *tlb, unsigned long first, int nr);

/*
@@ -1013,7 +1017,7 @@
from = start < mpnt->vm_start ? mpnt->vm_start : start;
to = end > mpnt->vm_end ? mpnt->vm_end : end;

- unmap_page_range(tlb, mpnt, from, to);
+ unmap_page_range(&tlb, mpnt, from, to);

if (mpnt->vm_flags & VM_ACCOUNT) {
len = to - from;
@@ -1275,10 +1279,19 @@
}
}

+/*
+ * For small tasks, it's most efficient to unmap the pages for each
+ * vma. For larger tasks, it's better to just walk the entire address
+ * space in one pass, particularly with shared pte pages. This
+ * threshold determines the size where we switch from one method to
+ * the other.
+ */
+
+#define UNMAP_THRESHOLD 500
+
/* Release all mmaps. */
void exit_mmap(struct mm_struct * mm)
{
- mmu_gather_t *tlb;
struct vm_area_struct * mpnt;

profile_exit_mmap(mm);
@@ -1287,36 +1300,14 @@

spin_lock(&mm->page_table_lock);

- tlb = tlb_gather_mmu(mm, 1);
-
flush_cache_mm(mm);
- mpnt = mm->mmap;
- while (mpnt) {
- unsigned long start = mpnt->vm_start;
- unsigned long end = mpnt->vm_end;
-
- /*
- * If the VMA has been charged for, account for its
- * removal
- */
- if (mpnt->vm_flags & VM_ACCOUNT)
- vm_unacct_memory((end - start) >> PAGE_SHIFT);

- mm->map_count--;
- if (!(is_vm_hugetlb_page(mpnt)))
- unmap_page_range(tlb, mpnt, start, end);
- else
- mpnt->vm_ops->close(mpnt);
- mpnt = mpnt->vm_next;
- }
+ unmap_all_pages(mm);

/* This is just debugging */
if (mm->map_count)
BUG();

- clear_page_tables(tlb, FIRST_USER_PGD_NR, USER_PTRS_PER_PGD);
- tlb_finish_mmu(tlb, 0, TASK_SIZE);
-
mpnt = mm->mmap;
mm->mmap = mm->mmap_cache = NULL;
mm->mm_rb = RB_ROOT;
@@ -1332,6 +1323,14 @@
*/
while (mpnt) {
struct vm_area_struct * next = mpnt->vm_next;
+
+ /*
+ * If the VMA has been charged for, account for its
+ * removal
+ */
+ if (mpnt->vm_flags & VM_ACCOUNT)
+ vm_unacct_memory((mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT);
+
remove_shared_vm_struct(mpnt);
if (mpnt->vm_ops) {
if (mpnt->vm_ops->close)
--- 2.5.43-mm2/./mm/rmap.c 2002-10-15 22:29:02.000000000 -0500
+++ 2.5.43-mm2-shpte/./mm/rmap.c 2002-10-18 16:56:51.000000000 -0500
@@ -14,11 +14,11 @@
/*
* Locking:
* - the page->pte.chain is protected by the PG_chainlock bit,
- * which nests within the zone->lru_lock, then the
- * mm->page_table_lock, and then the page lock.
+ * which nests within the zone->lru_lock, then the pte_page_lock,
+ * and then the page lock.
* - because swapout locking is opposite to the locking order
* in the page fault path, the swapout path uses trylocks
- * on the mm->page_table_lock
+ * on the pte_page_lock.
*/
#include <linux/mm.h>
#include <linux/pagemap.h>
@@ -45,11 +45,17 @@
*/
#define NRPTE ((L1_CACHE_BYTES - sizeof(void *))/sizeof(pte_addr_t))

+struct mm_chain {
+ struct mm_chain *next;
+ struct mm_struct *mm;
+};
+
struct pte_chain {
struct pte_chain *next;
pte_addr_t ptes[NRPTE];
};

+static kmem_cache_t *mm_chain_cache;
static kmem_cache_t *pte_chain_cache;

/*
@@ -102,6 +108,25 @@
kmem_cache_free(pte_chain_cache, pte_chain);
}

+static inline struct mm_chain *mm_chain_alloc(void)
+{
+ struct mm_chain *ret;
+
+ ret = kmem_cache_alloc(mm_chain_cache, GFP_ATOMIC);
+ return ret;
+}
+
+static void mm_chain_free(struct mm_chain *mc,
+ struct mm_chain *prev_mc, struct page *page)
+{
+ if (prev_mc)
+ prev_mc->next = mc->next;
+ else if (page)
+ page->pte.mmchain = mc->next;
+
+ kmem_cache_free(mm_chain_cache, mc);
+}
+
/**
** VM stuff below this comment
**/
@@ -161,13 +186,139 @@
return referenced;
}

+/*
+ * pgtable_add_rmap_locked - Add an mm_struct to the chain for a pte page.
+ * @page: The pte page to add the mm_struct to
+ * @mm: The mm_struct to add
+ * @address: The address of the page we're mapping
+ *
+ * Pte pages maintain a chain of mm_structs that use it. This adds a new
+ * mm_struct to the chain.
+ *
+ * This function must be called with the pte_page_lock held for the page
+ */
+void pgtable_add_rmap_locked(struct page * page, struct mm_struct * mm,
+ unsigned long address)
+{
+ struct mm_chain *mc;
+
+#ifdef BROKEN_PPC_PTE_ALLOC_ONE
+ /* OK, so PPC calls pte_alloc() before mem_map[] is setup ... ;( */
+ extern int mem_init_done;
+
+ if (!mem_init_done)
+ return;
+#endif
+#ifdef RMAP_DEBUG
+ BUG_ON(mm == NULL);
+ BUG_ON(!PagePtepage(page));
+#endif
+
+ if (PageDirect(page)) {
+ mc = mm_chain_alloc();
+ mc->mm = page->pte.mmdirect;
+ mc->next = NULL;
+ page->pte.mmchain = mc;
+ ClearPageDirect(page);
+ }
+ if (page->pte.mmchain) {
+ /* Hook up the mm_chain to the page. */
+ mc = mm_chain_alloc();
+ mc->mm = mm;
+ mc->next = page->pte.mmchain;
+ page->pte.mmchain = mc;
+ } else {
+ page->pte.mmdirect = mm;
+ SetPageDirect(page);
+ page->index = address & ~((PTRS_PER_PTE * PAGE_SIZE) - 1);
+ }
+}
+
+/*
+ * pgtable_remove_rmap_locked - Remove an mm_struct from the chain for a pte page.
+ * @page: The pte page to remove the mm_struct from
+ * @mm: The mm_struct to remove
+ *
+ * Pte pages maintain a chain of mm_structs that use it. This removes an
+ * mm_struct from the chain.
+ *
+ * This function must be called with the pte_page_lock held for the page
+ */
+void pgtable_remove_rmap_locked(struct page * page, struct mm_struct *mm)
+{
+ struct mm_chain * mc, * prev_mc = NULL;
+
+#ifdef DEBUG_RMAP
+ BUG_ON(mm == NULL);
+ BUG_ON(!PagePtepage(page));
+#endif
+
+ if (PageDirect(page)) {
+ if (page->pte.mmdirect == mm) {
+ page->pte.mmdirect = NULL;
+ ClearPageDirect(page);
+ page->index = 0;
+ goto out;
+ }
+ } else {
+#ifdef DEBUG_RMAP
+ BUG_ON(page->pte.mmchain->next == NULL);
+#endif
+ for (mc = page->pte.mmchain; mc; prev_mc = mc, mc = mc->next) {
+ if (mc->mm == mm) {
+ mm_chain_free(mc, prev_mc, page);
+ /* Check whether we can convert to direct */
+ mc = page->pte.mmchain;
+ if (!mc->next) {
+ page->pte.mmdirect = mc->mm;
+ SetPageDirect(page);
+ mm_chain_free(mc, NULL, NULL);
+ }
+ goto out;
+ }
+ }
+ }
+ BUG();
+out:
+}
+
+/*
+ * pgtable_add_rmap - Add an mm_struct to the chain for a pte page.
+ * @page: The pte page to add the mm_struct to
+ * @mm: The mm_struct to add
+ * @address: The address of the page we're mapping
+ *
+ * This is a wrapper for pgtable_add_rmap_locked that takes the lock
+ */
+void pgtable_add_rmap(struct page * page, struct mm_struct * mm,
+ unsigned long address)
+{
+ pte_page_lock(page);
+ pgtable_add_rmap_locked(page, mm, address);
+ pte_page_unlock(page);
+}
+
+/*
+ * pgtable_remove_rmap_locked - Remove an mm_struct from the chain for a pte page.
+ * @page: The pte page to remove the mm_struct from
+ * @mm: The mm_struct to remove
+ *
+ * This is a wrapper for pgtable_remove_rmap_locked that takes the lock
+ */
+void pgtable_remove_rmap(struct page * page, struct mm_struct *mm)
+{
+ pte_page_lock(page);
+ pgtable_remove_rmap_locked(page, mm);
+ pte_page_unlock(page);
+}
+
/**
* page_add_rmap - add reverse mapping entry to a page
* @page: the page to add the mapping to
* @ptep: the page table entry mapping this page
*
* Add a new pte reverse mapping to a page.
- * The caller needs to hold the mm->page_table_lock.
+ * The caller needs to hold the pte_page_lock.
*/
void page_add_rmap(struct page * page, pte_t * ptep)
{
@@ -180,8 +331,7 @@
BUG();
if (!pte_present(*ptep))
BUG();
- if (!ptep_to_mm(ptep))
- BUG();
+ BUG_ON(PagePtepage(page));
#endif

if (!pfn_valid(page_to_pfn(page)) || PageReserved(page))
@@ -199,12 +349,15 @@
if (page->pte.direct == pte_paddr)
BUG();
} else {
+ int count = 0;
for (pc = page->pte.chain; pc; pc = pc->next) {
- for (i = 0; i < NRPTE; i++) {
+ for (i = 0; i < NRPTE; i++, count++) {
pte_addr_t p = pc->ptes[i];

- if (p && p == pte_paddr)
+ if (p && p == pte_paddr) {
+ printk(KERN_ERR "page_add_rmap: page %08lx (count %d), ptep %08lx, rmap count %d\n", page, page_count(page), ptep, count);
BUG();
+ }
}
}
}
@@ -263,7 +416,7 @@
* Removes the reverse mapping from the pte_chain of the page,
* after that the caller can clear the page table entry and free
* the page.
- * Caller needs to hold the mm->page_table_lock.
+ * Caller needs to hold the pte_page_lock.
*/
void page_remove_rmap(struct page * page, pte_t * ptep)
{
@@ -277,6 +430,10 @@
if (!page_mapped(page))
return; /* remap_page_range() from a driver? */

+#ifdef DEBUG_RMAP
+ BUG_ON(PagePtepage(page));
+#endif
+
pte_chain_lock(page);

if (PageDirect(page)) {
@@ -342,6 +499,130 @@
return;
}

+static int pgtable_check_mlocked_mm(struct mm_struct *mm, unsigned long address)
+{
+ struct vm_area_struct *vma;
+ int ret = SWAP_SUCCESS;
+
+ /* During mremap, it's possible pages are not in a VMA. */
+ vma = find_vma(mm, address);
+ if (!vma) {
+ ret = SWAP_FAIL;
+ goto out;
+ }
+
+ /* The page is mlock()d, we cannot swap it out. */
+ if (vma->vm_flags & VM_LOCKED) {
+ ret = SWAP_FAIL;
+ }
+out:
+ return ret;
+}
+
+static int pgtable_check_mlocked(struct page *ptepage, unsigned long address)
+{
+ struct mm_chain *mc;
+ int ret = SWAP_SUCCESS;
+
+#ifdef DEBUG_RMAP
+ BUG_ON(!PagePtepage(ptepage));
+#endif
+ if (PageDirect(ptepage)) {
+ ret = pgtable_check_mlocked_mm(ptepage->pte.mmdirect, address);
+ goto out;
+ }
+
+ for (mc = ptepage->pte.mmchain; mc; mc = mc->next) {
+#ifdef DEBUG_RMAP
+ BUG_ON(mc->mm == NULL);
+#endif
+ ret = pgtable_check_mlocked_mm(mc->mm, address);
+ if (ret != SWAP_SUCCESS)
+ goto out;
+ }
+out:
+ return ret;
+}
+
+/**
+ * pgtable_unmap_one_mm - Decrement the rss count and flush for an mm_struct
+ * @mm: - the mm_struct to decrement
+ * @address: - The address of the page we're removing
+ *
+ * All pte pages keep a chain of mm_struct that are using it. This does a flush
+ * of the address for that mm_struct and decrements the rss count.
+ */
+static int pgtable_unmap_one_mm(struct mm_struct *mm, unsigned long address)
+{
+ struct vm_area_struct *vma;
+ int ret = SWAP_SUCCESS;
+
+ /* During mremap, it's possible pages are not in a VMA. */
+ vma = find_vma(mm, address);
+ if (!vma) {
+ ret = SWAP_FAIL;
+ goto out;
+ }
+ flush_tlb_page(vma, address);
+ flush_cache_page(vma, address);
+ mm->rss--;
+
+out:
+ return ret;
+}
+
+/**
+ * pgtable_unmap_one - Decrement all rss counts and flush caches for a pte page
+ * @ptepage: the pte page to decrement the count for
+ * @address: the address of the page we're removing
+ *
+ * This decrements the rss counts of all mm_structs that map this pte page
+ * and flushes the tlb and cache for these mm_structs and address
+ */
+static int pgtable_unmap_one(struct page *ptepage, unsigned long address)
+{
+ struct mm_chain *mc;
+ int ret = SWAP_SUCCESS;
+
+#ifdef DEBUG_RMAP
+ BUG_ON(!PagePtepage(ptepage));
+#endif
+
+ if (PageDirect(ptepage)) {
+ ret = pgtable_unmap_one_mm(ptepage->pte.mmdirect, address);
+ if (ret != SWAP_SUCCESS)
+ goto out;
+ } else for (mc = ptepage->pte.mmchain; mc; mc = mc->next) {
+ ret = pgtable_unmap_one_mm(mc->mm, address);
+ if (ret != SWAP_SUCCESS)
+ goto out;
+ }
+ ptepage->private--;
+out:
+ return ret;
+}
+
+/**
+ * increment_rss - increment the rss count by one
+ * @ptepage: The pte page that's getting a new paged mapped
+ *
+ * Since mapping a page into a pte page can increment the rss
+ * for multiple mm_structs, this function iterates through all
+ * the mms and increments them. It also keeps an rss count
+ * per pte page.
+ */
+void increment_rss(struct page *ptepage)
+{
+ struct mm_chain *mc;
+
+ if (PageDirect(ptepage))
+ ptepage->pte.mmdirect->rss++;
+ else for (mc = ptepage->pte.mmchain; mc; mc = mc->next)
+ mc->mm->rss++;
+
+ ptepage->private++;
+}
+
/**
* try_to_unmap_one - worker function for try_to_unmap
* @page: page to unmap
@@ -354,48 +635,36 @@
* zone->lru_lock page_launder()
* page lock page_launder(), trylock
* pte_chain_lock page_launder()
- * mm->page_table_lock try_to_unmap_one(), trylock
+ * pte_page_lock try_to_unmap_one(), trylock
*/
static int FASTCALL(try_to_unmap_one(struct page *, pte_addr_t));
static int try_to_unmap_one(struct page * page, pte_addr_t paddr)
{
pte_t *ptep = rmap_ptep_map(paddr);
- unsigned long address = ptep_to_address(ptep);
- struct mm_struct * mm = ptep_to_mm(ptep);
- struct vm_area_struct * vma;
pte_t pte;
+ struct page *ptepage = kmap_atomic_to_page(ptep);
+ unsigned long address = ptep_to_address(ptep);
int ret;

- if (!mm)
- BUG();
-
- /*
- * We need the page_table_lock to protect us from page faults,
- * munmap, fork, etc...
- */
- if (!spin_trylock(&mm->page_table_lock)) {
+#ifdef DEBUG_RMAP
+ BUG_ON(!PagePtepage(ptepage));
+#endif
+ if (!pte_page_trylock(ptepage)) {
rmap_ptep_unmap(ptep);
return SWAP_AGAIN;
}

-
- /* During mremap, it's possible pages are not in a VMA. */
- vma = find_vma(mm, address);
- if (!vma) {
- ret = SWAP_FAIL;
+ ret = pgtable_check_mlocked(ptepage, address);
+ if (ret != SWAP_SUCCESS)
goto out_unlock;
- }
+ pte = ptep_get_and_clear(ptep);

- /* The page is mlock()d, we cannot swap it out. */
- if (vma->vm_flags & VM_LOCKED) {
- ret = SWAP_FAIL;
+ ret = pgtable_unmap_one(ptepage, address);
+ if (ret != SWAP_SUCCESS) {
+ set_pte(ptep, pte);
goto out_unlock;
}
-
- /* Nuke the page table entry. */
- pte = ptep_get_and_clear(ptep);
- flush_tlb_page(vma, address);
- flush_cache_page(vma, address);
+ pte_page_unlock(ptepage);

/* Store the swap location in the pte. See handle_pte_fault() ... */
if (PageSwapCache(page)) {
@@ -408,13 +677,15 @@
if (pte_dirty(pte))
set_page_dirty(page);

- mm->rss--;
page_cache_release(page);
ret = SWAP_SUCCESS;
+ goto out;

out_unlock:
+ pte_page_unlock(ptepage);
+
+out:
rmap_ptep_unmap(ptep);
- spin_unlock(&mm->page_table_lock);
return ret;
}

@@ -523,6 +794,17 @@

void __init pte_chain_init(void)
{
+
+ mm_chain_cache = kmem_cache_create( "mm_chain",
+ sizeof(struct mm_chain),
+ 0,
+ 0,
+ NULL,
+ NULL);
+
+ if (!mm_chain_cache)
+ panic("failed to create mm_chain cache!\n");
+
pte_chain_cache = kmem_cache_create( "pte_chain",
sizeof(struct pte_chain),
0,
--- 2.5.43-mm2/./mm/fremap.c 2002-10-17 11:13:01.000000000 -0500
+++ 2.5.43-mm2-shpte/./mm/fremap.c 2002-10-18 12:17:30.000000000 -0500
@@ -11,6 +11,8 @@
#include <linux/mman.h>
#include <linux/pagemap.h>
#include <linux/swapops.h>
+#include <linux/rmap-locking.h>
+
#include <asm/mmu_context.h>

static inline void zap_pte(struct mm_struct *mm, pte_t *ptep)
@@ -47,6 +49,7 @@
unsigned long addr, struct page *page, unsigned long prot)
{
int err = -ENOMEM;
+ struct page *ptepage;
pte_t *pte, entry;
pgd_t *pgd;
pmd_t *pmd;
@@ -58,10 +61,25 @@
if (!pmd)
goto err_unlock;

+#ifdef CONFIG_SHAREPTE
+ if (pmd_present(*pmd)) {
+ ptepage = pmd_page(*pmd);
+ if (page_count(ptepage) > 1) {
+ pte = pte_unshare(mm, pmd, addr);
+ ptepage = pmd_page(*pmd);
+ goto mapped;
+ }
+ }
+#endif
+
pte = pte_alloc_map(mm, pmd, addr);
if (!pte)
goto err_unlock;

+ pte_page_lock(ptepage);
+#ifdef CONFIG_SHAREPTE
+mapped:
+#endif
zap_pte(mm, pte);

mm->rss++;
@@ -75,11 +93,13 @@
pte_unmap(pte);
flush_tlb_page(vma, addr);

+ pte_page_unlock(ptepage);
spin_unlock(&mm->page_table_lock);

return 0;

err_unlock:
+ pte_page_unlock(ptepage);
spin_unlock(&mm->page_table_lock);
return err;
}

2002-10-19 19:30:31

by Dave McCracken

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch


--On Saturday, October 19, 2002 15:17:31 -0400 Bill Davidsen
<[email protected]> wrote:

> Don't tease, what did that do for performance? I see that someone has
> already posted a possible problem, and the code would pass for complex for
> most people, so is the gain worth the pain?

I posted some fork/exec/exit microbenchmark results last week, in which for
large processes fork becomes much faster, and exec/exit become much faster
for processes with lots of shared memory.

As for results from larger benchmarks, those haven't been done. The TPC-H
test we used was primarily for stability testing, and secondarily to see if
we could reduce page table/pte_chain memory overhead, which we did. The
pte_chain overhead was reduced by close to a factor of 100.

This patch isn't primarily a performance patch. It does help for some
things, notably the fork/exec/exit cases mentioned above. But its primary
goal is to reduce the amount of memory wasted in page tables mapping the
same pages into multiple processes. We have seen an application that
consumed on the order of 10 GB of page tables to map a single shared memory
chunk across hundreds of processes. Shared page tables would eliminate
this overhead.

Dave McCracken

======================================================================
Dave McCracken IBM Linux Base Kernel Team 1-512-838-3059
[email protected] T/L 678-3059

2002-10-20 04:16:25

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

Dave McCracken <[email protected]> writes:

> This patch isn't primarily a performance patch. It does help for some
> things, notably the fork/exec/exit cases mentioned above. But its primary
> goal is to reduce the amount of memory wasted in page tables mapping the
> same pages into multiple processes. We have seen an application that
> consumed on the order of 10 GB of page tables to map a single shared memory
> chunk across hundreds of processes. Shared page tables would eliminate
> this overhead.

Have you considered putting a fixed upper bound on the number of pages
tables a page can be mapped into? This would result in the same amount
of memory reduction, with what should be very little complexity.

I admit there would be a few more demand paging hits, but they should be
controllable. And I suspect their performance impact would be lost
in the noise.

Eric

2002-10-20 06:18:20

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

>> For reference, one of the tests was TPC-H. My code reduced the number of
>> allocated pte_chains from 5 million to 50 thousand.
>
> Don't tease, what did that do for performance? I see that someone has
> already posted a possible problem, and the code would pass for complex for
> most people, so is the gain worth the pain?

In many cases, this will stop the box from falling over flat on it's
face due to ZONE_NORMAL exhaustion (from pte-chains), or even total
RAM exhaustion (from PTEs). Thus the performance gain is infinite ;-)

M.

2002-10-21 14:52:13

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

"Martin J. Bligh" <[email protected]> writes:

> >> For reference, one of the tests was TPC-H. My code reduced the number of
> >> allocated pte_chains from 5 million to 50 thousand.
> >
> > Don't tease, what did that do for performance? I see that someone has
> > already posted a possible problem, and the code would pass for complex for
> > most people, so is the gain worth the pain?
>
> In many cases, this will stop the box from falling over flat on it's
> face due to ZONE_NORMAL exhaustion (from pte-chains), or even total
> RAM exhaustion (from PTEs). Thus the performance gain is infinite ;-)

So why has no one written a pte_chain reaper? It is perfectly sane
to allocate a swap entry and move an entire pte_chain to the swap
cache.

Eric

2002-10-21 15:20:52

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

>> >> For reference, one of the tests was TPC-H. My code reduced the number of
>> >> allocated pte_chains from 5 million to 50 thousand.
>> >
>> > Don't tease, what did that do for performance? I see that someone has
>> > already posted a possible problem, and the code would pass for complex for
>> > most people, so is the gain worth the pain?
>>
>> In many cases, this will stop the box from falling over flat on it's
>> face due to ZONE_NORMAL exhaustion (from pte-chains), or even total
>> RAM exhaustion (from PTEs). Thus the performance gain is infinite ;-)
>
> So why has no one written a pte_chain reaper? It is perfectly sane
> to allocate a swap entry and move an entire pte_chain to the swap
> cache.

I think the underlying subsystem does not easily allow for dynamic regeneration, so it's non-trivial. wli was looking at doing pagetable reclaim at some point, IIRC.

IMHO, it's better not to fill memory with crap in the first place than
to invent complex methods of managing and shrinking it afterwards. You
only get into pathalogical conditions under sharing situation, else
it's limited to about 1% of RAM (bad, but manageable) ... thus providing
this sort of sharing nixes the worst of it. Better cache warmth on
switches (for TLB misses), faster fork+exec, etc. are nice side-effects.

The ultimate solution is per-object reverse mappings, rather than per
page, but that's a 2.7 thingy now.

M.

2002-10-22 03:50:01

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

"Martin J. Bligh" <[email protected]> writes:

> >> In many cases, this will stop the box from falling over flat on it's
> >> face due to ZONE_NORMAL exhaustion (from pte-chains), or even total
> >> RAM exhaustion (from PTEs). Thus the performance gain is infinite ;-)
> >
> > So why has no one written a pte_chain reaper? It is perfectly sane
> > to allocate a swap entry and move an entire pte_chain to the swap
> > cache.
>
> I think the underlying subsystem does not easily allow for dynamic regeneration,
> so it's non-trivial.

We swap pages out all of the time in 2.4.x, and that is all I was suggesting
swap out some but not all of the pages, on a very long pte_chain. And swapping
out a page is not terribly complex, unless something very drastic has changed.

> wli was looking at doing pagetable reclaim at some point,
> IIRC.
>
>
> IMHO, it's better not to fill memory with crap in the first place than
> to invent complex methods of managing and shrinking it afterwards. You
> only get into pathalogical conditions under sharing situation, else
> it's limited to about 1% of RAM (bad, but manageable) ... thus providing
> this sort of sharing nixes the worst of it. Better cache warmth on
> switches (for TLB misses), faster fork+exec, etc. are nice side-effects.

I will agree with that if everything works so the sharing happens,
this is a nice feature.

> The ultimate solution is per-object reverse mappings, rather than per
> page, but that's a 2.7 thingy now.
???

Last I checked we already had those in 2.4.x, and still in 2.5.x. The
list of place the address space is mapped.

Eric

2002-10-22 05:55:00

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

>> >> In many cases, this will stop the box from falling over flat on it's
>> >> face due to ZONE_NORMAL exhaustion (from pte-chains), or even total
>> >> RAM exhaustion (from PTEs). Thus the performance gain is infinite ;-)
>> >
>> > So why has no one written a pte_chain reaper? It is perfectly sane
>> > to allocate a swap entry and move an entire pte_chain to the swap
>> > cache.
>>
>> I think the underlying subsystem does not easily allow for dynamic regeneration,
>> so it's non-trivial.
>
> We swap pages out all of the time in 2.4.x, and that is all I was suggesting
> swap out some but not all of the pages, on a very long pte_chain. And swapping
> out a page is not terribly complex, unless something very drastic has changed.

Right, it's swapping out the controlling structures without swapping
out the pages themselves that's harder.

>> IMHO, it's better not to fill memory with crap in the first place than
>> to invent complex methods of managing and shrinking it afterwards. You
>> only get into pathalogical conditions under sharing situation, else
>> it's limited to about 1% of RAM (bad, but manageable) ... thus providing
>> this sort of sharing nixes the worst of it. Better cache warmth on
>> switches (for TLB misses), faster fork+exec, etc. are nice side-effects.
>
> I will agree with that if everything works so the sharing happens,
> this is a nice feature.

I think it will for most of the situations we run aground with now
(normally 5000 oracle tasks sharing a 2Gb shared segment, or some
such monster).

>> The ultimate solution is per-object reverse mappings, rather than per
>> page, but that's a 2.7 thingy now.
> ???
>
> Last I checked we already had those in 2.4.x, and still in 2.5.x. The
> list of place the address space is mapped.

It's more complicated than that ... I'll let Rik or one of the K42
guys who understand it better than I do explain it (yeah, I'm wimping
out on you ;-))

M.

2002-10-22 14:21:17

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

On 21 Oct 2002, Eric W. Biederman wrote:
> "Martin J. Bligh" <[email protected]> writes:
>
> > > So why has no one written a pte_chain reaper? It is perfectly sane
> > > to allocate a swap entry and move an entire pte_chain to the swap
> > > cache.
> >
> > I think the underlying subsystem does not easily allow for dynamic regeneration,
> > so it's non-trivial.
>
> We swap pages out all of the time in 2.4.x, and that is all I was
> suggesting swap out some but not all of the pages, on a very long
> pte_chain. And swapping out a page is not terribly complex, unless
> something very drastic has changed.

Imagine a slightly larger than normal Oracle server.
Say 5000 processes with 1 GB of shared memory.

Just the page tables needed to map this memory would
take up 5 GB of RAM ... with shared page tables we
only need 1 MB of page tables.

The corresponding reduction in rmaps is a nice bonus,
but hardly any more dramatic than the page table
overhead.

In short, we really really want shared page tables.

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
Current spamtrap: <a href=mailto:"[email protected]">[email protected]</a>

2002-10-22 16:04:12

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

On Mon, 21 Oct 2002, Martin J. Bligh wrote:

> > I will agree with that if everything works so the sharing happens,
> > this is a nice feature.
>
> I think it will for most of the situations we run aground with now
> (normally 5000 oracle tasks sharing a 2Gb shared segment, or some
> such monster).

10 GB pagetable overhead, for 2 GB of data. No customer I
know would accept that much OS overhead.

To reduce the overhead we could either reclaim the page
tables and reconstruct them when needed (lots of work) or
we could share the page tables (less runtime overhead).

> >> The ultimate solution is per-object reverse mappings, rather than per
> >> page, but that's a 2.7 thingy now.
> > ???
> >
> > Last I checked we already had those in 2.4.x, and still in 2.5.x. The
> > list of place the address space is mapped.
>
> It's more complicated than that ... I'll let Rik or one of the K42
> guys who understand it better than I do explain it (yeah, I'm
> wimping out on you ;-))

Actually, per-object reverse mappings are nowhere near as good
a solution as shared page tables. At least, not from the points
of view of space consumption and the overhead of tearing down
the mappings at pageout time.

Per-object reverse mappings are better for fork+exec+exit speed,
though.

It's a tradeoff: do we care more for a linear speedup of fork(),
exec() and exit() than we care about a possibly exponential
slowdown of the pageout code ?

(note that William Irwin and myself have a trick to turn the
possible exponential slowdown of the pageout code into something
better ... just a few details left)

regards,

Rik
--
A: No.
Q: Should I include quotations after my reply?

http://www.surriel.com/ http://distro.conectiva.com/

2002-10-22 16:14:01

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

> Actually, per-object reverse mappings are nowhere near as good
> a solution as shared page tables. At least, not from the points
> of view of space consumption and the overhead of tearing down
> the mappings at pageout time.
>
> Per-object reverse mappings are better for fork+exec+exit speed,
> though.
>
> It's a tradeoff: do we care more for a linear speedup of fork(),
> exec() and exit() than we care about a possibly exponential
> slowdown of the pageout code ?

As long as the box doesn't fall flat on it's face in a jibbering
heap, that's the first order of priority ... ie I don't care much
for now ;-)

M.

2002-10-22 17:03:46

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

Rik van Riel wrote:
>
> ...
> In short, we really really want shared page tables.

Or large pages. I confess to being a little perplexed as to
why we're pursuing both.

2002-10-22 17:09:46

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

On Tue, 22 Oct 2002, Andrew Morton wrote:
> Rik van Riel wrote:
> >
> > ...
> > In short, we really really want shared page tables.
>
> Or large pages. I confess to being a little perplexed as to
> why we're pursuing both.

I guess that's due to two things.

1) shared pagetables can speed up fork()+exec() somewhat

2) if we have two options that fix the Oracle problem,
there's a better chance of getting at least one of
the two merged ;)

Rik
--
A: No.
Q: Should I include quotations after my reply?

http://www.surriel.com/ http://distro.conectiva.com/

2002-10-22 17:13:23

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

On Tue, Oct 22, 2002 at 02:09:47PM -0200, Rik van Riel wrote:
> On Mon, 21 Oct 2002, Martin J. Bligh wrote:
>
> > I think it will for most of the situations we run aground with now
> > (normally 5000 oracle tasks sharing a 2Gb shared segment, or some
> > such monster).
>
> 10 GB pagetable overhead, for 2 GB of data. No customer I
> know would accept that much OS overhead.
>
> To reduce the overhead we could either reclaim the page
> tables and reconstruct them when needed (lots of work) or
> we could share the page tables (less runtime overhead).

Or you use 4MB pages. That tends to work much better and has less
complexity. Shared page tables don't work well on x86 when you have
a database trying to access an SGA larger than the virtual address
space, as each process tends to map its own window into the buffer
pool. Highmem with 32 bit va just plain sucks. The right answer is
to change the architecture of the application to not run with 5000
unique processes.

-ben

2002-10-22 17:37:00

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

>> > I think it will for most of the situations we run aground with now
>> > (normally 5000 oracle tasks sharing a 2Gb shared segment, or some
>> > such monster).
>>
>> 10 GB pagetable overhead, for 2 GB of data. No customer I
>> know would accept that much OS overhead.
>>
>> To reduce the overhead we could either reclaim the page
>> tables and reconstruct them when needed (lots of work) or
>> we could share the page tables (less runtime overhead).
>
> Or you use 4MB pages. That tends to work much better and has less
> complexity. Shared page tables don't work well on x86 when you have
> a database trying to access an SGA larger than the virtual address
> space, as each process tends to map its own window into the buffer
> pool. Highmem with 32 bit va just plain sucks. The right answer is
> to change the architecture of the application to not run with 5000
> unique processes.

Bear in mind that large pages are neither swap backed or file backed
(vetoed by Linus), for starters. There are other large app problem scenarios
apart from Oracle ;-)

M.

2002-10-22 17:41:09

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

On Tue, Oct 22, 2002 at 10:36:40AM -0700, Martin J. Bligh wrote:
> Bear in mind that large pages are neither swap backed or file backed
> (vetoed by Linus), for starters. There are other large app problem scenarios
> apart from Oracle ;-)

I think the fact that large page support doesn't support mmap for users
that need it is utterly appauling; there are numerous places where it is
needed. The requirement for root-only access makes it useless for most
people, especially in HPC environments where it is most needed as such
machines are usually shared and accounts are non-priveledged.

-ben
--
"Do you seek knowledge in time travel?"

2002-10-22 17:51:26

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

Benjamin LaHaise wrote:
>
> On Tue, Oct 22, 2002 at 10:36:40AM -0700, Martin J. Bligh wrote:
> > Bear in mind that large pages are neither swap backed or file backed
> > (vetoed by Linus), for starters. There are other large app problem scenarios
> > apart from Oracle ;-)
>
> I think the fact that large page support doesn't support mmap for users
> that need it is utterly appauling; there are numerous places where it is
> needed. The requirement for root-only access makes it useless for most
> people, especially in HPC environments where it is most needed as such
> machines are usually shared and accounts are non-priveledged.
>

Have you reviewed the hugetlbfs and hugetlbpage-backed-shm patches?

That code is still requiring CAP_IPC_LOCK, although I suspect it
would be better to allow hugetlbfs mmap to be purely administered
by file permissions.

2002-10-22 17:49:29

by Bill Davidsen

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

On Tue, 22 Oct 2002, Rik van Riel wrote:

> On 21 Oct 2002, Eric W. Biederman wrote:
> > "Martin J. Bligh" <[email protected]> writes:

> > We swap pages out all of the time in 2.4.x, and that is all I was
> > suggesting swap out some but not all of the pages, on a very long
> > pte_chain. And swapping out a page is not terribly complex, unless
> > something very drastic has changed.
>
> Imagine a slightly larger than normal Oracle server.
> Say 5000 processes with 1 GB of shared memory.
>
> Just the page tables needed to map this memory would
> take up 5 GB of RAM ... with shared page tables we
> only need 1 MB of page tables.
>
> The corresponding reduction in rmaps is a nice bonus,
> but hardly any more dramatic than the page table
> overhead.
>
> In short, we really really want shared page tables.

Does using spt require mapping the pages at the same location in all
processes?

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-10-22 17:57:13

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

On Tue, Oct 22, 2002 at 10:56:04AM -0700, Andrew Morton wrote:
> Have you reviewed the hugetlbfs and hugetlbpage-backed-shm patches?
>
> That code is still requiring CAP_IPC_LOCK, although I suspect it
> would be better to allow hugetlbfs mmap to be purely administered
> by file permissions.

Can we delete the specialty syscalls now?

-ben
--
"Do you seek knowledge in time travel?"

2002-10-22 18:03:33

by Alan

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

On Tue, 2002-10-22 at 18:45, Benjamin LaHaise wrote:
> On Tue, Oct 22, 2002 at 10:36:40AM -0700, Martin J. Bligh wrote:
> > Bear in mind that large pages are neither swap backed or file backed
> > (vetoed by Linus), for starters. There are other large app problem scenarios
> > apart from Oracle ;-)
>
> I think the fact that large page support doesn't support mmap for users
> that need it is utterly appauling; there are numerous places where it is
> needed. The requirement for root-only access makes it useless for most
> people, especially in HPC environments where it is most needed as such
> machines are usually shared and accounts are non-priveledged.

I was very suprised the large page crap went in, in the form it
currently exists. Merging pages makes sense, spotting and doing 4Mb page
allocations kernel side makes sense. The rest is very questionable

2002-10-22 18:03:33

by Bill Davidsen

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

On Tue, 22 Oct 2002, Martin J. Bligh wrote:

> > Actually, per-object reverse mappings are nowhere near as good
> > a solution as shared page tables. At least, not from the points
> > of view of space consumption and the overhead of tearing down
> > the mappings at pageout time.
> >
> > Per-object reverse mappings are better for fork+exec+exit speed,
> > though.
> >
> > It's a tradeoff: do we care more for a linear speedup of fork(),
> > exec() and exit() than we care about a possibly exponential
> > slowdown of the pageout code ?

That tradeoff makes the case for spt being a kbuild or /proc/sys option. A
linear speedup of fork/exec/exit is likely to be more generally useful,
most people just don't have huge shared areas. On the other hand, those
who do would get a vast improvement, and that would put Linux a major step
forward in the server competition.

> As long as the box doesn't fall flat on it's face in a jibbering
> heap, that's the first order of priority ... ie I don't care much
> for now ;-)

I'm just trying to decide what this might do for a news server with
hundreds of readers mmap()ing a GB history file. Benchmarks show the 2.5
has more latency the 2.4, and this is likely to make that more obvious.

Is there any way to to have this only on processes which really need it?
define that any way you wish, including hanging a capability on the
executable to get spt.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-10-22 18:31:21

by Dave McCracken

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch


--On Tuesday, October 22, 2002 15:15:29 -0200 Rik van Riel
<[email protected]> wrote:

>> Or large pages. I confess to being a little perplexed as to
>> why we're pursuing both.
>
> I guess that's due to two things.
>
> 1) shared pagetables can speed up fork()+exec() somewhat
>
> 2) if we have two options that fix the Oracle problem,
> there's a better chance of getting at least one of
> the two merged ;)

And
3) The current large page implementation is only for applications
that want anonymous *non-pageable* shared memory. Shared page
tables reduce resource usage for any shared area that's mapped
at a common address and is large enough to span entire pte pages.
Since all pte pages are shared on a COW basis at fork time, children
will continue to share all large read-only areas with their
parent, eg large executables.

Dave McCracken

======================================================================
Dave McCracken IBM Linux Base Kernel Team 1-512-838-3059
[email protected] T/L 678-3059

2002-10-22 18:43:32

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

Dave McCracken wrote:
>
> --On Tuesday, October 22, 2002 15:15:29 -0200 Rik van Riel
> <[email protected]> wrote:
>
> >> Or large pages. I confess to being a little perplexed as to
> >> why we're pursuing both.
> >
> > I guess that's due to two things.
> >
> > 1) shared pagetables can speed up fork()+exec() somewhat
> >
> > 2) if we have two options that fix the Oracle problem,
> > there's a better chance of getting at least one of
> > the two merged ;)
>
> And
> 3) The current large page implementation is only for applications
> that want anonymous *non-pageable* shared memory. Shared page
> tables reduce resource usage for any shared area that's mapped
> at a common address and is large enough to span entire pte pages.
> Since all pte pages are shared on a COW basis at fork time, children
> will continue to share all large read-only areas with their
> parent, eg large executables.
>

How important is that in practice?

Seems that large pages are the preferred solution to the "Oracle
and DB2 use gobs of pagetable" problem because large pages also
reduce tlb reload traffic.

So once that's out of the picture, what real-world, observed,
customers-are-hurting problem is solved by pagetable sharing?

2002-10-22 18:44:58

by Gerrit Huizenga

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

In message <[email protected]>, > : Alan
Cox writes:
> > I think the fact that large page support doesn't support mmap for users
> > that need it is utterly appauling; there are numerous places where it is
> > needed. The requirement for root-only access makes it useless for most
> > people, especially in HPC environments where it is most needed as such
> > machines are usually shared and accounts are non-priveledged.
>
> I was very suprised the large page crap went in, in the form it
> currently exists. Merging pages makes sense, spotting and doing 4Mb page
> allocations kernel side makes sense. The rest is very questionable

Hmm. Isn't it great that 2.6/3.0 will be stable soon and we can
start working on this for 2.7/3.1?

gerrit

2002-10-22 18:40:51

by Gerrit Huizenga

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

In message <[email protected]>, > : Andrew Morton writes:
> Rik van Riel wrote:
> >
> > ...
> > In short, we really really want shared page tables.
>
> Or large pages. I confess to being a little perplexed as to
> why we're pursuing both.

Large pages benefit the performance of large applications which
explicity take advantage of them (at least today - maybe in the
future, large pages will be automagically handed out to those that
can use them). And, as a side effect, they reduce KVA overhead.
Oh, and at the moment, they are non-pageable, e.g. permanently stuck
in memory.

On the other hand, shared page tables benefit any application that
shares data, including those that haven't been trained to roll over
and beg for large pages. Shared page tables are already showing large
space savings with at least one database.

gerrit

2002-10-22 18:49:06

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

On Tue, Oct 22, 2002 at 11:47:37AM -0700, Gerrit Huizenga wrote:
> Hmm. Isn't it great that 2.6/3.0 will be stable soon and we can
> start working on this for 2.7/3.1?

Sure, but we should delete the syscalls now and just use the filesystem
as the intermediate API.

-ben
--
"Do you seek knowledge in time travel?"

2002-10-22 19:02:25

by Gerrit Huizenga

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

In message <[email protected]>, > : Andrew Morton writes:
> Dave McCracken wrote:
> >
> > And
> > 3) The current large page implementation is only for applications
> > that want anonymous *non-pageable* shared memory. Shared page
> > tables reduce resource usage for any shared area that's mapped
> > at a common address and is large enough to span entire pte pages.
> > Since all pte pages are shared on a COW basis at fork time, children
> > will continue to share all large read-only areas with their
> > parent, eg large executables.
> >
>
> How important is that in practice?
>
> Seems that large pages are the preferred solution to the "Oracle
> and DB2 use gobs of pagetable" problem because large pages also
> reduce tlb reload traffic.
>
> So once that's out of the picture, what real-world, observed,
> customers-are-hurting problem is solved by pagetable sharing?

If the shared pte patch had mmap support, then all shared libraries
would benefit. Might need to align them to 4 MB boundaries for best
results, which would also be easy for libraries with unspecified
attach addresses (e.g. most shared libraries).

gerrit

2002-10-22 19:04:27

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

>> Have you reviewed the hugetlbfs and hugetlbpage-backed-shm patches?
>>
>> That code is still requiring CAP_IPC_LOCK, although I suspect it
>> would be better to allow hugetlbfs mmap to be purely administered
>> by file permissions.
>
> Can we delete the specialty syscalls now?

I was lead to believe that Linus designed them, so he may be emotionally attatched
to them, but I think there would be few others that would cry over the loss ...

M.

2002-10-22 19:03:31

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

>> > Actually, per-object reverse mappings are nowhere near as good
>> > a solution as shared page tables. At least, not from the points
>> > of view of space consumption and the overhead of tearing down
>> > the mappings at pageout time.
>> >
>> > Per-object reverse mappings are better for fork+exec+exit speed,
>> > though.
>> >
>> > It's a tradeoff: do we care more for a linear speedup of fork(),
>> > exec() and exit() than we care about a possibly exponential
>> > slowdown of the pageout code ?
>
> That tradeoff makes the case for spt being a kbuild or /proc/sys option. A
> linear speedup of fork/exec/exit is likely to be more generally useful,
> most people just don't have huge shared areas. On the other hand, those
> who do would get a vast improvement, and that would put Linux a major step
> forward in the server competition.
>
>> As long as the box doesn't fall flat on it's face in a jibbering
>> heap, that's the first order of priority ... ie I don't care much
>> for now ;-)
>
> I'm just trying to decide what this might do for a news server with
> hundreds of readers mmap()ing a GB history file. Benchmarks show the 2.5
> has more latency the 2.4, and this is likely to make that more obvious.
>
> Is there any way to to have this only on processes which really need it?
> define that any way you wish, including hanging a capability on the
> executable to get spt.

Uh, what are you referring to here? Large pages or shared pagetables?
You can't mmap filebacked stuff with larged pages (nixed by Linus).
And yes, large pages have to be specified explicitly by the app.
On the other hand, I don't think shared pagetables have an mmap hook,
though that'd be easy enough to add. And if you're not reading the whole
history file, presumably the PTEs will only be sparsely instantiated anyway.

M.

2002-10-22 19:08:57

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dave McCracken wrote:

> 3) The current large page implementation is only for applications
> that want anonymous *non-pageable* shared memory. Shared page
> tables reduce resource usage for any shared area that's mapped
> at a common address and is large enough to span entire pte pages.


Does this happen automatically (i.e., without modifying th emmap call)?

In any case, a system using prelinking will likely have all users of a
DSO mapping the DSO at the same address. Will a system benefit in this
case? If not directly, perhaps with some help from ld.so since we do
know when we expect the same is used everywhere.

- --
- --------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)

iD8DBQE9taLn2ijCOnn/RHQRAgJ6AJ9AzHCX3NrpZPpGUF9XIQYPdX2NPQCgw7BP
6fIfDzEvsxbGvVtoUX76aAw=
=LKpP
-----END PGP SIGNATURE-----

2002-10-22 19:15:32

by Dave McCracken

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch


--On Tuesday, October 22, 2002 12:02:27 -0700 "Martin J. Bligh"
<[email protected]> wrote:


>> I'm just trying to decide what this might do for a news server with
>> hundreds of readers mmap()ing a GB history file. Benchmarks show the 2.5
>> has more latency the 2.4, and this is likely to make that more obvious.
>
> On the other hand, I don't think shared pagetables have an mmap hook,
> though that'd be easy enough to add. And if you're not reading the whole
> history file, presumably the PTEs will only be sparsely instantiated
> anyway.

Actually shared page tables work on any shared memory area, no matter how
it was created. When a page fault occurs and there's no pte page already
allocated (the common case for any newly mapped region) it checks the vma
to see if it's shared. If it's shared, it gets the address_space for that
vma, then walks through all the shared vmas looking for one that's mapped
at the same address and offset and already has a pte page that can be
shared.

So if your history file is mapped at the same address for all your
processes then it will use shared page tables. While it might be a nice
add-on to allow sharing if they're mapped on the same pte page boundary,
that doesn't seem likely enough to justify the extra work.

Dave McCracken

======================================================================
Dave McCracken IBM Linux Base Kernel Team 1-512-838-3059
[email protected] T/L 678-3059

2002-10-22 19:25:23

by Dave McCracken

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch


--On Tuesday, October 22, 2002 12:06:29 -0700 Gerrit Huizenga
<[email protected]> wrote:

> If the shared pte patch had mmap support, then all shared libraries
> would benefit. Might need to align them to 4 MB boundaries for best
> results, which would also be easy for libraries with unspecified
> attach addresses (e.g. most shared libraries).

Shared page tables do support mmap, but only for areas that are marked
shared. Private mappings are only shared at fork time. If shared
libraries are mapped shared, then shared page tables will actively share
pte pages for them.

Dave McCracken

======================================================================
Dave McCracken IBM Linux Base Kernel Team 1-512-838-3059
[email protected] T/L 678-3059

2002-10-22 19:23:35

by Gerrit Huizenga

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

In message <[email protected]>, > : Benjamin LaHaise writes:
> On Tue, Oct 22, 2002 at 11:47:37AM -0700, Gerrit Huizenga wrote:
> > Hmm. Isn't it great that 2.6/3.0 will be stable soon and we can
> > start working on this for 2.7/3.1?
>
> Sure, but we should delete the syscalls now and just use the filesystem
> as the intermediate API.
>
> -ben

That would be fine with me - we are only planning on people using
flags to shm*() or mmap(), not on the syscalls. I thought Oracle
was the one heavily dependent on the icky syscalls.

gerrit

2002-10-22 19:25:32

by Gerrit Huizenga

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

In message <[email protected]>, > : Ulrich Drepper writes:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Dave McCracken wrote:
>
> > 3) The current large page implementation is only for applications
> > that want anonymous *non-pageable* shared memory. Shared page
> > tables reduce resource usage for any shared area that's mapped
> > at a common address and is large enough to span entire pte pages.
>
>
> Does this happen automatically (i.e., without modifying th emmap call)?
>
> In any case, a system using prelinking will likely have all users of a
> DSO mapping the DSO at the same address. Will a system benefit in this
> case? If not directly, perhaps with some help from ld.so since we do
> know when we expect the same is used everywhere.

One important thing to watch out for is to make sure that the PLT
and GOT fixups (where typically pages are mprotected to RW, modified,
then back to RO) are not in the range of pages that are shared.
And, it helps if everything that is shared read-only is 4 MB aligned.
If ld.so could do that under linux, we'd have the biggest win.

gerrit

2002-10-22 19:23:38

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

On Tue, Oct 22, 2002 at 12:27:57PM -0700, Gerrit Huizenga wrote:
> That would be fine with me - we are only planning on people using
> flags to shm*() or mmap(), not on the syscalls. I thought Oracle
> was the one heavily dependent on the icky syscalls.

You mean the wonderfully untested calls that never worked? At least
they'd tested and used Ingo's 2.4 based patches that made shmfs use
4MB pages.

-ben
--
"Do you seek knowledge in time travel?"

2002-10-22 20:03:02

by Alan

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

On Tue, 2002-10-22 at 20:03, Martin J. Bligh wrote:

> > Can we delete the specialty syscalls now?
>
> I was lead to believe that Linus designed them, so he may be emotionally attatched
> to them, but I think there would be few others that would cry over the loss ...

You mean like the wonderfully pointless sys_readahead. The sooner these
calls go the better.

2002-10-22 20:14:51

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

On Tue, 2002-10-22 at 21:27, Gerrit Huizenga wrote:
be fine with me - we are only planning on people using
> flags to shm*() or mmap(), not on the syscalls. I thought Oracle
> was the one heavily dependent on the icky syscalls.

the icky syscalls are unusable for databases.. I'd be *really* surprised
if oracle could use them at all on x86....


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2002-10-22 21:28:54

by Gerrit Huizenga

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

In message <[email protected]>, > : Alan C
ox writes:
> On Tue, 2002-10-22 at 20:03, Martin J. Bligh wrote:
>
> > > Can we delete the specialty syscalls now?
> >
> > I was lead to believe that Linus designed them, so he may be emotionally attatched
> > to them, but I think there would be few others that would cry over the loss ...
>
> You mean like the wonderfully pointless sys_readahead. The sooner these
> calls go the better.

No, the other icky syscalls - the {alloc,free}_hugepages.

gerrit

2002-10-23 03:04:55

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

On 23 Oct 2002, Andi Kleen wrote:
> Gerrit Huizenga <[email protected]> writes:
>
> > If the shared pte patch had mmap support, then all shared libraries
> > would benefit. Might need to align them to 4 MB boundaries for best
> > results, which would also be easy for libraries with unspecified
> > attach addresses (e.g. most shared libraries).
>
> But only if the shared libraries are a multiple of 2/4MB, otherwise
> you'll waste memory. Or do you propose to link multiple mmap'ed
> libraries together into the same page ?

Using shared page tables all you'll waste is virtual space.

The shared page table for eg. /lib/libc.so will eventually
end up mapping all of the libc pages that are used by the
system's workload and processes won't pagefault on any libc
page that's already present in ram.

Sounds like a win/win solution, cutting down both on pagetable
overhead and on pagefaults.

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
Current spamtrap: <a href=mailto:"[email protected]">[email protected]</a>

2002-10-23 02:57:50

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

Gerrit Huizenga <[email protected]> writes:

> If the shared pte patch had mmap support, then all shared libraries
> would benefit. Might need to align them to 4 MB boundaries for best
> results, which would also be easy for libraries with unspecified
> attach addresses (e.g. most shared libraries).

But only if the shared libraries are a multiple of 2/4MB, otherwise you'll
waste memory. Or do you propose to link multiple mmap'ed libraries together
into the same page ?

But I agree it would be nice to have a chattr for files that tells
mmap() to use large pages for them.

-Andi

2002-10-23 17:44:06

by Gerrit Huizenga

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

In message <[email protected]>, > : Andi Kleen writes:
> Gerrit Huizenga <[email protected]> writes:
>
> > If the shared pte patch had mmap support, then all shared libraries
> > would benefit. Might need to align them to 4 MB boundaries for best
> > results, which would also be easy for libraries with unspecified
> > attach addresses (e.g. most shared libraries).
>
> But only if the shared libraries are a multiple of 2/4MB, otherwise you'll
> waste memory. Or do you propose to link multiple mmap'ed libraries together
> into the same page ?

Hmm. I didn't propose. Sounds cool. But that would have to happen
at the compiler's loader level, not the dynamic linker side of things,
which makes it less likely. Someone once proposed a mega-library where
the big/key shared objects were linked together which would make this
somewhat more practical.

But even wasting a bit of space for a few key libraries, even if they
are smaller than 4 MB (or 2 MB) on ia32 (ia32 PAE) might be worth a
bit of TLB & general overhead (e.g. like the kernel text pages). And,
if shared, even a bigger win.

> But I agree it would be nice to have a chattr for files that tells
> mmap() to use large pages for them.

Yep - that would be ideal - like the old sticky flag on binaries.
As a patch, that would make it easy to compare performance diffs as
well. Probably good for things like Oracle or DB2 as well.

gerrit

2002-10-24 10:37:05

by Bill Davidsen

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

On Tue, 22 Oct 2002, Martin J. Bligh wrote:

> > I'm just trying to decide what this might do for a news server with
> > hundreds of readers mmap()ing a GB history file. Benchmarks show the 2.5
> > has more latency the 2.4, and this is likely to make that more obvious.
> >
> > Is there any way to to have this only on processes which really need it?
> > define that any way you wish, including hanging a capability on the
> > executable to get spt.
>
> Uh, what are you referring to here? Large pages or shared pagetables?
> You can't mmap filebacked stuff with larged pages (nixed by Linus).

Shared pagetables, it's a non-root application.

> And yes, large pages have to be specified explicitly by the app.
> On the other hand, I don't think shared pagetables have an mmap hook,

That could be interesting... so if the server has a mmap()ed file and
forks a child when a connection comes in, the tables would get duplicated
even with the patch? That's not going to help me.

> though that'd be easy enough to add. And if you're not reading the whole
> history file, presumably the PTEs will only be sparsely instantiated anyway.

I have to go back and look at that code to be sure, you may be right.
There certainly are other things which are mmap()ed by all children (or
threads, depending on implementation) which might benefit since they're
moderately large and do have hundreds to thousands of copies.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-10-24 10:44:58

by Bill Davidsen

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

On Tue, 22 Oct 2002, Dave McCracken wrote:

>
> --On Tuesday, October 22, 2002 12:02:27 -0700 "Martin J. Bligh"
> <[email protected]> wrote:
>
>
> >Bill Davidsen wrote:
> >> I'm just trying to decide what this might do for a news server with
> >> hundreds of readers mmap()ing a GB history file. Benchmarks show the 2.5
> >> has more latency the 2.4, and this is likely to make that more obvious.
> >
> > On the other hand, I don't think shared pagetables have an mmap hook,
> > though that'd be easy enough to add. And if you're not reading the whole
> > history file, presumably the PTEs will only be sparsely instantiated
> > anyway.
>
> Actually shared page tables work on any shared memory area, no matter how
> it was created. When a page fault occurs and there's no pte page already
> allocated (the common case for any newly mapped region) it checks the vma
> to see if it's shared. If it's shared, it gets the address_space for that
> vma, then walks through all the shared vmas looking for one that's mapped
> at the same address and offset and already has a pte page that can be
> shared.

That's more encouraging.

> So if your history file is mapped at the same address for all your
> processes then it will use shared page tables. While it might be a nice
> add-on to allow sharing if they're mapped on the same pte page boundary,
> that doesn't seem likely enough to justify the extra work.

The reader processes are either forked (INN w/ daemon nnrpd) or pthreads
(twister, earthquake, diablo). So everything will be the same.

Another thought, how does this play with NUMA systems? I don't have the
problem, but presumably there are implications.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-10-24 14:22:18

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

> Another thought, how does this play with NUMA systems? I don't have the
> problem, but presumably there are implications.

At some point we'll probably only want one shared set per node.
Gets tricky when you migrate processes across nodes though - will
need more thought

M.

2002-10-24 14:32:36

by Dave McCracken

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch


--On Thursday, October 24, 2002 07:22:56 -0700 "Martin J. Bligh"
<[email protected]> wrote:

>> Another thought, how does this play with NUMA systems? I don't have the
>> problem, but presumably there are implications.
>
> At some point we'll probably only want one shared set per node.
> Gets tricky when you migrate processes across nodes though - will
> need more thought

Page tables can only be shared when they're pointing to the same data pages
anyway, so I think it's just part of the larger problem of node-local
memory.

Dave McCracken

======================================================================
Dave McCracken IBM Linux Base Kernel Team 1-512-838-3059
[email protected] T/L 678-3059

2002-10-24 14:54:49

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

>>> Another thought, how does this play with NUMA systems? I don't have the
>>> problem, but presumably there are implications.
>>
>> At some point we'll probably only want one shared set per node.
>> Gets tricky when you migrate processes across nodes though - will
>> need more thought
>
> Page tables can only be shared when they're pointing to the same
> data pages anyway, so I think it's just part of the larger problem
> of node-local memory.

Yes, same problem as text replication. You're right, it's probably not
worth solving otherwise - too small a percentage of the real problem.

M.

2002-10-25 17:26:11

by Bill Davidsen

[permalink] [raw]
Subject: Re: [PATCH 2.5.43-mm2] New shared page table patch

On Thu, 24 Oct 2002, Martin J. Bligh wrote:

> > Another thought, how does this play with NUMA systems? I don't have the
> > problem, but presumably there are implications.
>
> At some point we'll probably only want one shared set per node.
> Gets tricky when you migrate processes across nodes though - will
> need more thought

The whole issue of pages shared between nodes is a graduate thesis waiting
to happen.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.