2002-08-02 00:36:42

by Andrew Morton

[permalink] [raw]
Subject: large page patch


This is a large-page support patch from Rohit Seth, forwarded
with his permission (thanks!).



> Attached is the large_page support for IA-32. For most part there are no
> changes over IA-64 patch. System calls and their semantics remain the
> same. Though there are still some little parts of code that are arch
> specfic (like for IA-64 there is seperate region for large_pages whereas
> on IA-32 it is the same linear address space etc.) I will appreciate if
> you all could provide your input and any issues that you think we need to
> resolve.
>
> Attached is the large_page patch including the following support: 1-
> Private and Shared Anonymous large pages(This is the earlier patch +
> Anonymous share Large_page support). Private Anonymous large_pages stay
> with the particular process and vm segments corresponding to these get
> VM_DONTCOPY attribute. Shared Anonymous pages get shared by children.
> (Children share the same physical large_pages with parent.) Allocation
> and deallocation of this is done using the following two system calls:
>
> sys_get_large_pages (unsigned long addr, unsigned long len, int prot, int flags)
> where prot could be PROT_READ, PROT_WRITE, PROT_EXEC and flags
> is MAP_PRIVATE or MAP_SHARE
> sys_free_large_pages(unsigned long addr)
>
> 2- Shared Large Pages across different processes. Allocation and
> deallocation of large_pages that a process can share and unshare across
> different procecess is using follwoign two systm calls:
>
> sys_share_large_pages(int key, unsigned long addr, unsigned long len, int prot, int flag)
>
> where key is the system wide unique identifier that processes use to share
> pages. This should be non-zero positive number. prot is identical as in
> above cases. flag could be set to IPC_CREAT so that if the segment
> corresponding to key is not already there then it is created (Else -ENOENT
> is returned if there is no existing segment).
>
> sys_unshare_large_pages(unsigned long addr)
>
> is used to unshare the large_pages from process's address space. The
> large_pages are put on lpage_freelist only when the last user has sent the
> request for unsharing it (kind of SHM_DEST attribute).
>
> Most of the support needed for above two cases (Anonymous and Sharing
> across processes) is quite similar in kernel except for binding of
> large_pages to key and temporary inode structure.
>
> 3) Currently the large_page memory is dynamically configurable through
> /proc/sys/kernel/numlargepages User can specify the number (negative
> meaning shrink) that the number of large_page pages need to change. For
> e.g. a value of -2 will reduce the number of large_page pages currently
> configured in system by 2. Note that this change will depend on the
> availability of free large_pages. If none is available then the value
> remains same. (Any cleaner suggestions?)

Some observations which have been made thus far:

- Minimal impact on the VM and MM layers

- Delegates most of it to the arch layer

- Generic code is not tied to pagetables so (for example) PPC could
implement the system calls with BAT registers

- The change to MAX_ORDER is unneeded

- swapping of large pages and making them pagecache-coherent is
unpopular.

- may be better to implement the shm API with fd's, not keys.

- an ia64 implementation is available


diff -Naru linux.org/arch/i386/config.in linux.lp/arch/i386/config.in
--- linux.org/arch/i386/config.in Mon Feb 25 11:37:52 2002
+++ linux.lp/arch/i386/config.in Tue Jul 2 17:49:15 2002
@@ -184,6 +184,8 @@

bool 'Math emulation' CONFIG_MATH_EMULATION
bool 'MTRR (Memory Type Range Register) support' CONFIG_MTRR
+bool 'IA-32 Large Page Support (if available on processor)' CONFIG_LARGE_PAGE
+
bool 'Symmetric multi-processing support' CONFIG_SMP
if [ "$CONFIG_SMP" != "y" ]; then
bool 'Local APIC support on uniprocessors' CONFIG_X86_UP_APIC
@@ -205,7 +207,6 @@

mainmenu_option next_comment
comment 'General setup'
-
bool 'Networking support' CONFIG_NET

# Visual Workstation support is utterly broken.
diff -Naru linux.org/arch/i386/kernel/entry.S linux.lp/arch/i386/kernel/entry.S
--- linux.org/arch/i386/kernel/entry.S Mon Feb 25 11:37:53 2002
+++ linux.lp/arch/i386/kernel/entry.S Tue Jul 2 15:12:23 2002
@@ -634,6 +634,10 @@
.long SYMBOL_NAME(sys_ni_syscall) /* 235 reserved for removexattr */
.long SYMBOL_NAME(sys_ni_syscall) /* reserved for lremovexattr */
.long SYMBOL_NAME(sys_ni_syscall) /* reserved for fremovexattr */
+ .long SYMBOL_NAME(sys_get_large_pages) /* Get large_page pages */
+ .long SYMBOL_NAME(sys_free_large_pages) /* Free large_page pages */
+ .long SYMBOL_NAME(sys_share_large_pages)/* Share large_page pages */
+ .long SYMBOL_NAME(sys_unshare_large_pages)/* UnShare large_page pages */

.rept NR_syscalls-(.-sys_call_table)/4
.long SYMBOL_NAME(sys_ni_syscall)
diff -Naru linux.org/arch/i386/kernel/sys_i386.c linux.lp/arch/i386/kernel/sys_i386.c
--- linux.org/arch/i386/kernel/sys_i386.c Mon Mar 19 12:35:09 2001
+++ linux.lp/arch/i386/kernel/sys_i386.c Wed Jul 3 14:28:16 2002
@@ -254,3 +254,126 @@
return -ERESTARTNOHAND;
}

+#ifdef CONFIG_LARGE_PAGE
+#define LPAGE_ALIGN(x) (((unsigned long)x + (LPAGE_SIZE -1)) & LPAGE_MASK)
+extern long sys_munmap(unsigned long, size_t);
+
+/* get_addr function gets the currently unused virtaul range in
+ * current process's address space. It returns the LARGE_PAGE_SIZE
+ * aligned address (in cases of success). Other kernel generic
+ * routines only could gurantee that allocated address is PAGE_SIZSE aligned.
+ */
+unsigned long
+get_addr(unsigned long addr, unsigned long len)
+{
+ struct vm_area_struct *vma;
+ if (addr) {
+ addr = LPAGE_ALIGN(addr);
+ vma = find_vma(current->mm, addr);
+ if (((TASK_SIZE - len) >= addr) &&
+ (!vma || addr + len <= vma->vm_start))
+ goto found_addr;
+ }
+ addr = LPAGE_ALIGN(TASK_UNMAPPED_BASE);
+ for (vma = find_vma(current->mm, addr); ; vma = vma->vm_next) {
+ if (TASK_SIZE - len < addr)
+ return -ENOMEM;
+ if (!vma || ((addr + len) < vma->vm_start))
+ goto found_addr;
+ addr = vma->vm_end;
+ }
+found_addr:
+ addr = LPAGE_ALIGN(addr);
+ return addr;
+}
+
+asmlinkage unsigned long
+sys_get_large_pages(unsigned long addr, unsigned long len, int prot, int flags)
+{
+ extern int make_lpages_present(unsigned long, unsigned long, int);
+ int temp;
+
+ if (!(cpu_has_pse))
+ return -EINVAL;
+ if (len & (LPAGE_SIZE - 1))
+ return -EINVAL;
+ addr = get_addr(addr, len);
+ if (addr == -ENOMEM)
+ return addr;
+ temp = MAP_SHARED | MAP_ANONYMOUS |MAP_FIXED;
+ addr = do_mmap_pgoff(NULL, addr, len, prot, temp, 0);
+ printk("Returned addr %x\n", addr);
+ if (!(addr & (LPAGE_SIZE -1))) {
+ if (make_lpages_present(addr, (addr+len), flags) < 0) {
+ addr = sys_munmap(addr, len);
+ return -ENOMEM;
+ }
+ }
+ return addr;
+}
+
+asmlinkage unsigned long
+sys_share_large_pages(int key, unsigned long addr, unsigned long len, int prot, int flag)
+{
+ unsigned long raddr;
+ int retval;
+ extern int set_lp_shm_seg(int, unsigned long *, unsigned long, int, int);
+ if (!(cpu_has_pse))
+ return -EINVAL;
+ if (key <= 0)
+ return -EINVAL;
+ if (len & (LPAGE_SIZE - 1))
+ return -EINVAL;
+ raddr = get_addr(addr, len);
+ if (raddr == -ENOMEM)
+ return raddr;
+ retval = set_lp_shm_seg(key, &raddr, len, prot, flag);
+ if (retval < 0)
+ return (unsigned long) retval;
+ return raddr;
+}
+
+asmlinkage int
+sys_free_large_pages(unsigned long addr)
+{
+ struct vm_area_struct *vma;
+ extern int unmap_large_pages(struct vm_area_struct *);
+
+ vma = find_vma(current->mm, addr);
+ if ((!vma) || (!(vma->vm_flags & VM_LARGEPAGE)) ||
+ (vma->vm_start!=addr))
+ return -EINVAL;
+ return unmap_large_pages(vma);
+}
+
+asmlinkage int
+sys_unshare_large_pages(unsigned long addr)
+{
+ return sys_free_large_pages(addr);
+}
+
+#else
+asmlinkage unsigned long
+sys_get_large_pages(unsigned long addr, size_t len, int prot, int flags)
+{
+ return -ENOSYS;
+}
+
+asmlinkage unsigned long
+sys_share_large_apges(int key, unsigned long addr, size_t len, int prot, int flag)
+{
+ return -ENOSYS;
+}
+
+asmlinkage int
+sys_free_large_apges(unsigned long addr)
+{
+ return -ENOSYS;
+}
+
+asmlinkage int
+sys_unshare_large_pages(unsigned long addr)
+{
+ return -ENOSYS;
+}
+#endif
diff -Naru linux.org/arch/i386/mm/Makefile linux.lp/arch/i386/mm/Makefile
--- linux.org/arch/i386/mm/Makefile Fri Dec 29 14:07:20 2000
+++ linux.lp/arch/i386/mm/Makefile Tue Jul 2 16:55:53 2002
@@ -10,5 +10,6 @@
O_TARGET := mm.o

obj-y := init.o fault.o ioremap.o extable.o
+obj-$(CONFIG_LARGE_PAGE) += lpage.o

include $(TOPDIR)/Rules.make
diff -Naru linux.org/arch/i386/mm/init.c linux.lp/arch/i386/mm/init.c
--- linux.org/arch/i386/mm/init.c Fri Dec 21 09:41:53 2001
+++ linux.lp/arch/i386/mm/init.c Tue Jul 2 18:39:13 2002
@@ -447,6 +447,12 @@
return 0;
}

+#ifdef CONFIG_LARGE_PAGE
+long lpagemem = 0;
+int lp_max;
+long lpzone_pages;
+extern struct list_head lpage_freelist;
+#endif
void __init mem_init(void)
{
extern int ppro_with_ram_bug(void);
@@ -532,6 +538,32 @@
zap_low_mappings();
#endif

+#ifdef CONFIG_LARGE_PAGE
+ {
+ long i;
+ long j;
+ struct page *page, *map;
+
+ /*For now reserve quarter for large_pages.*/
+ lpzone_pages = (max_low_pfn >> ((LPAGE_SHIFT - PAGE_SHIFT) + 2)) ;
+ /*Will make this kernel command line. */
+ INIT_LIST_HEAD(&lpage_freelist);
+ for (i=0; i<lpzone_pages; i++) {
+ page = alloc_pages(GFP_ATOMIC, LARGE_PAGE_ORDER);
+ if (page == NULL)
+ break;
+ map = page;
+ for (j=0; j<(LPAGE_SIZE/PAGE_SIZE); j++) {
+ SetPageReserved(map);
+ map++;
+ }
+ list_add(&page->list, &lpage_freelist);
+ }
+ printk("Total Large_page memory pages allocated %ld\n", i);
+ lpzone_pages = lpagemem = i;
+ lp_max = i;
+ }
+#endif
}

/* Put this after the callers, so that it cannot be inlined */
diff -Naru linux.org/arch/i386/mm/lpage.c linux.lp/arch/i386/mm/lpage.c
--- linux.org/arch/i386/mm/lpage.c Wed Dec 31 16:00:00 1969
+++ linux.lp/arch/i386/mm/lpage.c Wed Jul 3 16:09:59 2002
@@ -0,0 +1,475 @@
+/*
+ * IA-32 Large Page Support for Kernel.
+ *
+ * Copyright (C) 2002, Rohit Seth <[email protected]>
+ */
+
+
+#include <linux/config.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/devfs_fs_kernel.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/file.h>
+#include <linux/swap.h>
+#include <linux/pagemap.h>
+#include <linux/string.h>
+#include <linux/locks.h>
+#include <linux/smp_lock.h>
+#include <linux/slab.h>
+
+#include <asm/uaccess.h>
+#include <asm/mman.h>
+
+static struct vm_operations_struct lp_vm_ops;
+struct list_head lpage_freelist;
+spinlock_t lpage_lock = SPIN_LOCK_UNLOCKED;
+extern long lpagemem;
+
+#define MAX_ID 32
+struct lpkey {
+ struct inode *in;
+ int key;
+} lpk[MAX_ID];
+
+static struct inode *
+find_key_inode(int key)
+{
+ int i;
+
+ for (i=0; i<MAX_ID; i++) {
+ if (lpk[i].key == key)
+ return (lpk[i].in);
+ }
+ return NULL;
+}
+static struct page *
+alloc_large_page(void)
+{
+ struct list_head *curr, *head;
+ struct page *page;
+
+ spin_lock(&lpage_lock);
+
+ head = &lpage_freelist;
+ curr = head->next;
+
+ if (curr == head) {
+ spin_unlock(&lpage_lock);
+ return NULL;
+ }
+ page = list_entry(curr, struct page, list);
+ list_del(curr);
+ lpagemem--;
+ spin_unlock(&lpage_lock);
+ set_page_count(page, 1);
+ memset(page_address(page), 0, LPAGE_SIZE);
+ return page;
+}
+
+static void
+free_large_page(struct page *page)
+{
+ if ((page->mapping != NULL) && (page_count(page) == 2)) {
+ struct inode *inode = page->mapping->host;
+ int i;
+
+ lru_cache_del(page);
+ remove_inode_page(page);
+ set_page_count(page, 1);
+ if ((inode->i_size -= LPAGE_SIZE) == 0) {
+ for (i=0;i<MAX_ID;i++)
+ if (lpk[i].key == inode->i_ino) {
+ lpk[i].key = 0;
+ break;
+ }
+ kfree(inode);
+ }
+ }
+ if (put_page_testzero(page)) {
+ spin_lock(&lpage_lock);
+ list_add(&page->list, &lpage_freelist);
+ lpagemem++;
+ spin_unlock(&lpage_lock);
+ }
+}
+
+static pte_t *
+lp_pte_alloc(struct mm_struct *mm, unsigned long addr)
+{
+ pgd_t *pgd;
+ pmd_t *pmd = NULL;
+
+ pgd = pgd_offset(mm, addr);
+ pmd = pmd_alloc(mm, pgd, addr);
+ return (pte_t *)pmd;
+}
+
+static pte_t *
+lp_pte_offset(struct mm_struct *mm, unsigned long addr)
+{
+ pgd_t *pgd;
+ pmd_t *pmd = NULL;
+
+ pgd =pgd_offset(mm, addr);
+ pmd = pmd_offset(pgd, addr);
+ return (pte_t *)pmd;
+}
+
+#define mk_pte_large(entry) {entry.pte_low |= (_PAGE_PRESENT | _PAGE_PSE);}
+
+static void
+set_lp_pte(struct mm_struct *mm, struct vm_area_struct *vma, struct page *page, pte_t *page_table, int write_access)
+{
+ pte_t entry;
+
+ mm->rss += (LPAGE_SIZE/PAGE_SIZE);
+ if (write_access) {
+ entry = pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
+ } else
+ entry = pte_wrprotect(mk_pte(page, vma->vm_page_prot));
+ entry = pte_mkyoung(entry);
+ mk_pte_large(entry);
+ set_pte(page_table, entry);
+ printk("VIRTUAL_ADDRESS_OF_LPAGE IS %p\n", page->virtual);
+ return;
+}
+
+static int
+anon_get_lpage(struct mm_struct *mm, struct vm_area_struct *vma, int write_access, pte_t *page_table)
+{
+ struct page *page;
+
+ page = alloc_large_page();
+ if (page == NULL)
+ return -1;
+ set_lp_pte(mm, vma, page, page_table, write_access);
+ return 1;
+}
+
+int
+make_lpages_present(unsigned long addr, unsigned long end, int flags)
+{
+ int write;
+ struct mm_struct *mm = current->mm;
+ struct vm_area_struct * vma;
+ pte_t *pte;
+
+ vma = find_vma(mm, addr);
+ if (!vma)
+ goto out_error1;
+
+ write = (vma->vm_flags & VM_WRITE) != 0;
+ if ((vma->vm_end - vma->vm_start) & (LPAGE_SIZE-1))
+ goto out_error1;
+ spin_lock(&mm->page_table_lock);
+ do {
+ pte = lp_pte_alloc(mm, addr);
+ if ((pte) && (pte_none(*pte))) {
+ if (anon_get_lpage(mm, vma,
+ write ? VM_WRITE : VM_READ, pte) == -1)
+ goto out_error;
+ } else
+ goto out_error;
+ addr += LPAGE_SIZE;
+ } while (addr < end);
+ spin_unlock(&mm->page_table_lock);
+ vma->vm_flags |= (VM_LARGEPAGE | VM_RESERVED);
+ if (flags & MAP_PRIVATE )
+ vma->vm_flags |= VM_DONTCOPY;
+ vma->vm_ops = &lp_vm_ops;
+ return 0;
+out_error: /*Error case, remove the partial lp_resources. */
+ if (addr > vma->vm_start) {
+ vma->vm_end = addr ;
+ zap_lp_resources(vma);
+ vma->vm_end = end;
+ }
+ spin_unlock(&mm->page_table_lock);
+out_error1:
+ return -1;
+}
+
+int
+copy_lpage_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma)
+{
+ pte_t *src_pte, *dst_pte, entry;
+ struct page *ptepage;
+ unsigned long addr = vma->vm_start;
+ unsigned long end = vma->vm_end;
+
+ while (addr < end) {
+ dst_pte = lp_pte_alloc(dst, addr);
+ if (!dst_pte)
+ goto nomem;
+ src_pte = lp_pte_offset(src, addr);
+ entry = *src_pte;
+ ptepage = pte_page(entry);
+ get_page(ptepage);
+ set_pte(dst_pte, entry);
+ dst->rss += (LPAGE_SIZE/PAGE_SIZE);
+ addr += LPAGE_SIZE;
+ }
+ return 0;
+
+nomem:
+ return -ENOMEM;
+}
+int
+follow_large_page(struct mm_struct *mm, struct vm_area_struct *vma, struct page **pages, struct vm_area_struct **vmas, unsigned long *st, int *length, int i)
+{
+ pte_t *ptep, pte;
+ unsigned long start = *st;
+ unsigned long pstart;
+ int len = *length;
+ struct page *page;
+
+ do {
+ pstart = start;
+ ptep = lp_pte_offset(mm, start);
+ pte = *ptep;
+
+back1:
+ page = pte_page(pte);
+ if (pages) {
+ page += ((start & ~LPAGE_MASK) >> PAGE_SHIFT);
+ pages[i] = page;
+ page_cache_get(page);
+ }
+ if (vmas)
+ vmas[i] = vma;
+ i++;
+ len--;
+ start += PAGE_SIZE;
+ if (((start & LPAGE_MASK) == pstart) && len && (start < vma->vm_end))
+ goto back1;
+ } while (len && start < vma->vm_end);
+ *length = len;
+ *st = start;
+ return i;
+}
+
+static void
+zap_lp_resources(struct vm_area_struct *mpnt)
+{
+ struct mm_struct *mm = mpnt->vm_mm;
+ unsigned long len, addr, end;
+ pte_t *ptep;
+ struct page *page;
+
+ addr = mpnt->vm_start;
+ end = mpnt->vm_end;
+ len = end - addr;
+ do {
+ ptep = lp_pte_offset(mm, addr);
+ page = pte_page(*ptep);
+ pte_clear(ptep);
+ free_large_page(page);
+ addr += LPAGE_SIZE;
+ } while (addr < end);
+ mm->rss -= (len >> PAGE_SHIFT);
+}
+
+static void
+unlink_vma(struct vm_area_struct *mpnt)
+{
+ struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma;
+
+ vma = mm->mmap;
+ if (vma == mpnt) {
+ mm->mmap = vma->vm_next;
+ }
+ else {
+ while (vma->vm_next != mpnt) {
+ vma = vma->vm_next;
+ }
+ vma->vm_next = mpnt->vm_next;
+ }
+ rb_erase(&mpnt->vm_rb, &mm->mm_rb);
+ mm->mmap_cache = NULL;
+ mm->map_count--;
+}
+
+int
+unmap_large_pages(struct vm_area_struct *mpnt)
+{
+ struct mm_struct *mm = current->mm;
+
+ unlink_vma(mpnt);
+ spin_lock(&mm->page_table_lock);
+ zap_lp_resources(mpnt);
+ spin_unlock(&mm->page_table_lock);
+ kmem_cache_free(vm_area_cachep, mpnt);
+ return 1;
+}
+
+static struct inode *
+set_new_inode(unsigned long len, int prot, int flag, int key)
+{
+ struct inode *inode;
+ int i;
+
+ for (i=0; i<MAX_ID; i++) {
+ if (lpk[i].key == 0)
+ break;
+ }
+ if (i == MAX_ID)
+ return NULL;
+ inode = kmalloc(sizeof(struct inode), GFP_ATOMIC);
+ if (inode == NULL)
+ return NULL;
+
+ memset(inode, 0, sizeof(struct inode));
+ INIT_LIST_HEAD(&inode->i_hash);
+ inode->i_mapping = &inode->i_data;
+ inode->i_mapping->host = inode;
+ INIT_LIST_HEAD(&inode->i_data.clean_pages);
+ INIT_LIST_HEAD(&inode->i_data.dirty_pages);
+ INIT_LIST_HEAD(&inode->i_data.locked_pages);
+ spin_lock_init(&inode->i_data.i_shared_lock);
+ inode->i_ino = (unsigned long)key;
+
+ lpk[i].key = key;
+ lpk[i].in = inode;
+ inode->i_uid = current->fsuid;
+ inode->i_gid = current->fsgid;
+ inode->i_mode = prot;
+ inode->i_size = len;
+ return inode;
+}
+
+static int
+check_size_prot(struct inode *inode, unsigned long len, int prot, int flag)
+{
+ if (inode->i_uid != current->fsuid)
+ return -1;
+ if (inode->i_gid != current->fsgid)
+ return -1;
+ if (inode->i_mode != prot)
+ return -1;
+ if (inode->i_size != len)
+ return -1;
+ return 0;
+}
+
+int
+set_lp_shm_seg(int key, unsigned long *raddr, unsigned long len, int prot, int flag)
+{
+ struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma;
+ struct inode *inode;
+ struct address_space *mapping;
+ struct page *page;
+ unsigned long addr = *raddr;
+ int idx;
+ int retval = -ENOMEM;
+
+ if (len & (LPAGE_SIZE -1))
+ return -EINVAL;
+
+ inode = find_key_inode(key);
+ if (inode == NULL) {
+ if (!(flag & IPC_CREAT))
+ return -ENOENT;
+ inode = set_new_inode(len, prot, flag, key);
+ if (inode == NULL)
+ return -ENOMEM;
+ }
+ else
+ if (check_size_prot(inode, len, prot, flag) < 0)
+ return -EINVAL;
+ mapping = inode->i_mapping;
+
+ addr = do_mmap_pgoff(NULL, addr, len, (unsigned long)prot,
+ MAP_FIXED|MAP_PRIVATE | MAP_ANONYMOUS, 0);
+ if (IS_ERR((void *)addr))
+ return -ENOMEM;
+
+ vma = find_vma(mm, addr);
+ if (!vma)
+ return -EINVAL;
+
+ *raddr = addr;
+ spin_lock(&mm->page_table_lock);
+ do {
+ pte_t * pte = lp_pte_alloc(mm, addr);
+ if ((pte) && (pte_none(*pte))) {
+ idx = (addr - vma->vm_start) >> LPAGE_SHIFT;
+ page = find_get_page(mapping, idx);
+ if (page == NULL) {
+ page = alloc_large_page();
+ if (page == NULL)
+ goto out;
+ add_to_page_cache(page, mapping, idx);
+ }
+ set_lp_pte(mm, vma, page, pte, (vma->vm_flags & VM_WRITE));
+ } else
+ goto out;
+ addr += LPAGE_SIZE;
+ } while (addr < vma->vm_end);
+ retval = 0;
+ vma->vm_flags |= (VM_LARGEPAGE | VM_RESERVED);
+ vma->vm_ops = &lp_vm_ops;
+ spin_unlock(&mm->page_table_lock);
+ return retval;
+out:
+ if (addr > vma->vm_start) {
+ raddr = vma->vm_end;
+ vma->vm_end = addr;
+ zap_lp_resources(vma);
+ vma->vm_end = raddr;
+ }
+ spin_unlock(&mm->page_table_lock);
+ return retval;
+}
+
+int
+change_large_page_mem_size(int count)
+{
+ int j;
+ struct page *page, *map;
+ extern long lpzone_pages;
+ extern struct list_head lpage_freelist;
+
+ if (count == 0)
+ return (int)lpzone_pages;
+ if (count > 0) {/*Increase the mem size. */
+ while (count--) {
+ page = alloc_pages(GFP_ATOMIC, LARGE_PAGE_ORDER);
+ if (page == NULL)
+ break;
+ map = page;
+ for (j=0; j<(LPAGE_SIZE/PAGE_SIZE); j++) {
+ SetPageReserved(map);
+ map++;
+ }
+ spin_lock(&lpage_lock);
+ list_add(&page->list, &lpage_freelist);
+ lpagemem++;
+ lpzone_pages++;
+ spin_unlock(&lpage_lock);
+ }
+ return (int)lpzone_pages;
+ }
+ /*Shrink the memory size. */
+ while (count++) {
+ page = alloc_large_page();
+ if (page == NULL)
+ break;
+ spin_lock(&lpage_lock);
+ lpzone_pages--;
+ spin_unlock(&lpage_lock);
+ map = page;
+ for (j=0; j<(LPAGE_SIZE/PAGE_SIZE); j++) {
+ ClearPageReserved(map);
+ map++;
+ }
+ __free_pages(page, LARGE_PAGE_ORDER);
+ }
+ return (int)lpzone_pages;
+}
+static struct vm_operations_struct lp_vm_ops = {
+ close: zap_lp_resources,
+};
diff -Naru linux.org/fs/proc/array.c linux.lp/fs/proc/array.c
--- linux.org/fs/proc/array.c Thu Oct 11 09:00:01 2001
+++ linux.lp/fs/proc/array.c Wed Jul 3 16:59:09 2002
@@ -486,6 +486,17 @@
pgd_t *pgd = pgd_offset(mm, vma->vm_start);
int pages = 0, shared = 0, dirty = 0, total = 0;

+ if (is_vm_large_page(vma)) {
+ int num_pages = ((vma->vm_end - vma->vm_start)/PAGE_SIZE);
+ resident += num_pages;
+ if ((vma->vm_flags & VM_DONTCOPY))
+ share += num_pages;
+ if (vma->vm_flags & VM_WRITE)
+ dt += num_pages;
+ drs += num_pages;
+ vma = vma->vm_next;
+ continue;
+ }
statm_pgd_range(pgd, vma->vm_start, vma->vm_end, &pages, &shared, &dirty, &total);
resident += pages;
share += shared;
diff -Naru linux.org/fs/proc/proc_misc.c linux.lp/fs/proc/proc_misc.c
--- linux.org/fs/proc/proc_misc.c Tue Nov 20 21:29:09 2001
+++ linux.lp/fs/proc/proc_misc.c Wed Jul 3 10:48:21 2002
@@ -151,6 +151,14 @@
B(i.sharedram), B(i.bufferram),
B(pg_size), B(i.totalswap),
B(i.totalswap-i.freeswap), B(i.freeswap));
+#ifdef CONFIG_LARGE_PAGE
+ {
+ extern unsigned long lpagemem, lpzone_pages;
+ len += sprintf(page+len,"Total # of LargePages: %8lu\t\tAvailable: %8lu\n"
+ "LargePageSize: %8lu(0x%xKB)\n",
+ lpzone_pages, lpagemem, LPAGE_SIZE, (LPAGE_SIZE/1024));
+ }
+#endif
/*
* Tagged format, for easy grepping and expansion.
* The above will go away eventually, once the tools
diff -Naru linux.org/include/asm-i386/page.h linux.lp/include/asm-i386/page.h
--- linux.org/include/asm-i386/page.h Mon Feb 25 11:38:12 2002
+++ linux.lp/include/asm-i386/page.h Wed Jul 3 10:49:54 2002
@@ -41,14 +41,22 @@
typedef struct { unsigned long long pmd; } pmd_t;
typedef struct { unsigned long long pgd; } pgd_t;
#define pte_val(x) ((x).pte_low | ((unsigned long long)(x).pte_high << 32))
+#define LPAGE_SHIFT 21
#else
typedef struct { unsigned long pte_low; } pte_t;
typedef struct { unsigned long pmd; } pmd_t;
typedef struct { unsigned long pgd; } pgd_t;
#define pte_val(x) ((x).pte_low)
+#define LPAGE_SHIFT 22
#endif
#define PTE_MASK PAGE_MASK

+#ifdef CONFIG_LARGE_PAGE
+#define LPAGE_SIZE ((1UL) << LPAGE_SHIFT)
+#define LPAGE_MASK (~(LPAGE_SIZE - 1))
+#define LARGE_PAGE_ORDER (LPAGE_SHIFT - PAGE_SHIFT)
+#endif
+
typedef struct { unsigned long pgprot; } pgprot_t;

#define pmd_val(x) ((x).pmd)
diff -Naru linux.org/include/linux/mm.h linux.lp/include/linux/mm.h
--- linux.org/include/linux/mm.h Fri Dec 21 09:42:03 2001
+++ linux.lp/include/linux/mm.h Wed Jul 3 10:49:54 2002
@@ -103,6 +103,7 @@
#define VM_DONTEXPAND 0x00040000 /* Cannot expand with mremap() */
#define VM_RESERVED 0x00080000 /* Don't unmap it from swap_out */

+#define VM_LARGEPAGE 0x00400000 /* Large_Page mapping. */
#define VM_STACK_FLAGS 0x00000177

#define VM_READHINTMASK (VM_SEQ_READ | VM_RAND_READ)
@@ -425,6 +426,16 @@
int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start,
int len, int write, int force, struct page **pages, struct vm_area_struct **vmas);

+#ifdef CONFIG_LARGE_PAGE
+#define is_vm_large_page(vma) (vma->vm_flags & VM_LARGEPAGE)
+extern int copy_large_page(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
+extern int follow_large_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int);
+#else
+#define is_vm_large_page(vma) (0)
+#define follow_large_page(mm, vma, pages, vmas, &start, &len, i) (0)
+#define copy_large_page(dst, src, vma) (0)
+#endif
+
/*
* On a two-level page table, this ends up being trivial. Thus the
* inlining and the symmetry break with pte_alloc() that does all
diff -Naru linux.org/include/linux/mmzone.h linux.lp/include/linux/mmzone.h
--- linux.org/include/linux/mmzone.h Thu Nov 22 11:46:19 2001
+++ linux.lp/include/linux/mmzone.h Wed Jul 3 10:49:54 2002
@@ -13,7 +13,7 @@
*/

#ifndef CONFIG_FORCE_MAX_ZONEORDER
-#define MAX_ORDER 10
+#define MAX_ORDER 15
#else
#define MAX_ORDER CONFIG_FORCE_MAX_ZONEORDER
#endif
diff -Naru linux.org/include/linux/sysctl.h linux.lp/include/linux/sysctl.h
--- linux.org/include/linux/sysctl.h Mon Nov 26 05:29:17 2001
+++ linux.lp/include/linux/sysctl.h Wed Jul 3 10:49:54 2002
@@ -124,6 +124,7 @@
KERN_CORE_USES_PID=52, /* int: use core or core.%pid */
KERN_TAINTED=53, /* int: various kernel tainted flags */
KERN_CADPID=54, /* int: PID of the process to notify on CAD */
+ KERN_LARGE_PAGE_MEM=55, /* Number of large_page pages configured */
};


diff -Naru linux.org/kernel/sysctl.c linux.lp/kernel/sysctl.c
--- linux.org/kernel/sysctl.c Fri Dec 21 09:42:04 2001
+++ linux.lp/kernel/sysctl.c Tue Jul 2 14:07:28 2002
@@ -96,6 +96,10 @@
extern int acct_parm[];
#endif

+#ifdef CONFIG_LARGE_PAGE
+extern int lp_max;
+extern int change_large_page_mem_size(int );
+#endif
extern int pgt_cache_water[];

static int parse_table(int *, int, void *, size_t *, void *, size_t,
@@ -256,6 +260,10 @@
{KERN_S390_USER_DEBUG_LOGGING,"userprocess_debug",
&sysctl_userprocess_debug,sizeof(int),0644,NULL,&proc_dointvec},
#endif
+#ifdef CONFIG_LARGE_PAGE
+ {KERN_LARGE_PAGE_MEM, "numlargepages", &lp_max, sizeof(int), 0644, NULL,
+ &proc_dointvec},
+#endif
{0}
};

@@ -866,6 +874,10 @@
val = -val;
buffer += len;
left -= len;
+#if CONFIG_LARGE_PAGE
+ if (i == &lp_max)
+ val = change_large_page_mem_size(val);
+#endif
switch(op) {
case OP_SET: *i = val; break;
case OP_AND: *i &= val; break;
diff -Naru linux.org/mm/memory.c linux.lp/mm/memory.c
--- linux.org/mm/memory.c Mon Feb 25 11:38:13 2002
+++ linux.lp/mm/memory.c Wed Jul 3 16:14:01 2002
@@ -179,6 +179,9 @@
unsigned long end = vma->vm_end;
unsigned long cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;

+ if (is_vm_large_page(vma) )
+ return copy_lpage_range(dst, src, vma);
+
src_pgd = pgd_offset(src, address)-1;
dst_pgd = pgd_offset(dst, address)-1;

@@ -471,6 +474,10 @@
if ( !vma || (pages && vma->vm_flags & VM_IO) || !(flags & vma->vm_flags) )
return i ? : -EFAULT;

+ if (is_vm_large_page(vma)) {
+ i += follow_large_page(mm, vma, pages, vmas, &start, &len, i);
+ continue;
+ }
spin_lock(&mm->page_table_lock);
do {
struct page *map;
@@ -1360,6 +1367,8 @@
{
pgd_t *pgd;
pmd_t *pmd;
+ if (is_vm_large_page(vma) )
+ return -1;

current->state = TASK_RUNNING;
pgd = pgd_offset(mm, address);
diff -Naru linux.org/mm/mmap.c linux.lp/mm/mmap.c
--- linux.org/mm/mmap.c Mon Feb 25 11:38:14 2002
+++ linux.lp/mm/mmap.c Tue Jul 2 14:15:50 2002
@@ -917,6 +917,9 @@
if (mpnt->vm_start >= addr+len)
return 0;

+ if (is_vm_large_page(mpnt)) /*Large pages can not be unmapped like this. */
+ return -EINVAL;
+
/* If we'll make "hole", check the vm areas limit */
if ((mpnt->vm_start < addr && mpnt->vm_end > addr+len)
&& mm->map_count >= MAX_MAP_COUNT)
diff -Naru linux.org/mm/mprotect.c linux.lp/mm/mprotect.c
--- linux.org/mm/mprotect.c Mon Sep 17 15:30:23 2001
+++ linux.lp/mm/mprotect.c Tue Jul 2 14:18:13 2002
@@ -287,6 +287,8 @@
error = -EFAULT;
if (!vma || vma->vm_start > start)
goto out;
+ if (is_vm_large_page(vma))
+ return -EINVAL; /* Cann't change protections on large_page mappings. */

for (nstart = start ; ; ) {
unsigned int newflags;
diff -Naru linux.org/mm/mremap.c linux.lp/mm/mremap.c
--- linux.org/mm/mremap.c Thu Sep 20 20:31:26 2001
+++ linux.lp/mm/mremap.c Tue Jul 2 14:20:05 2002
@@ -267,6 +267,10 @@
vma = find_vma(current->mm, addr);
if (!vma || vma->vm_start > addr)
goto out;
+ if (is_vm_large_page(vma)) {
+ ret = -EINVAL; /* Cann't remap large_page mappings. */
+ goto out;
+ }
/* We can't remap across vm area boundaries */
if (old_len > vma->vm_end - addr)
goto out;


2002-08-02 00:51:31

by David Miller

[permalink] [raw]
Subject: Re: large page patch

From: Andrew Morton <[email protected]>
Date: Thu, 01 Aug 2002 17:37:46 -0700

Some observations which have been made thus far:

- Minimal impact on the VM and MM layers

Well the downside of this is that it means it isn't transparent
to userspace. For example, specfp2000 results aren't going to
improve after installing these changes. Some of the other large
page implementations would.

- The change to MAX_ORDER is unneeded

This is probably done to increase the likelyhood that 4MB page orders
are available. If we collapse 4MB pages deeper, they are less likely
to be broken up because smaller orders would be selected first.

Maybe it doesn't make a difference....

- swapping of large pages and making them pagecache-coherent is
unpopular.

Swapping them is easy, any time you hit a large PTE you unlarge it.
This is what some of other large page implementations do. Basically
the implementation is that set_pte() breaks apart large ptes when
necessary.

I agree on the pagecache side.

Actually to be honest the other implementations seemed less
intrusive and easier to add support for. The downside is that
handling of weird cases like x86 using pmd's for 4MB pages
was not complete last time I checked.

2002-08-02 01:08:05

by Martin J. Bligh

[permalink] [raw]
Subject: Re: large page patch

> - The change to MAX_ORDER is unneeded

It's not only unneeded, it's detrimental. Not only will we spend more
time merging stuff up and down to no effect, it also makes the
config_nonlinear stuff harder (or we have to #ifdef it, which just causes
more unnecessary differentiation). Please don't do that little bit ....

M.


2002-08-02 01:25:26

by Andrew Morton

[permalink] [raw]
Subject: Re: large page patch

"David S. Miller" wrote:
>
> From: Andrew Morton <[email protected]>
> Date: Thu, 01 Aug 2002 17:37:46 -0700
>
> Some observations which have been made thus far:
>
> - Minimal impact on the VM and MM layers
>
> Well the downside of this is that it means it isn't transparent
> to userspace. For example, specfp2000 results aren't going to
> improve after installing these changes. Some of the other large
> page implementations would.
>
> - The change to MAX_ORDER is unneeded
>
> This is probably done to increase the likelyhood that 4MB page orders
> are available. If we collapse 4MB pages deeper, they are less likely
> to be broken up because smaller orders would be selected first.

This is leakage from ia64, which supports up to 256k pages.

> Maybe it doesn't make a difference....
>
> - swapping of large pages and making them pagecache-coherent is
> unpopular.
>
> Swapping them is easy, any time you hit a large PTE you unlarge it.
> This is what some of other large page implementations do. Basically
> the implementation is that set_pte() breaks apart large ptes when
> necessary.

As far as mm/*.c is concerned, there is no pte. It's just a vma
which is marked "don't touch" These pages aren't on the LRU, nothing
knows about them.

Apparently a page-table based representation could not be used by PPC.

2002-08-02 01:28:14

by David Miller

[permalink] [raw]
Subject: Re: large page patch

From: Andrew Morton <[email protected]>
Date: Thu, 01 Aug 2002 18:26:40 -0700

"David S. Miller" wrote:
> This is probably done to increase the likelyhood that 4MB page orders
> are available. If we collapse 4MB pages deeper, they are less likely
> to be broken up because smaller orders would be selected first.

This is leakage from ia64, which supports up to 256k pages.

Ummm, 4MB > 256K and even with a 4K PAGE_SIZE MAX_ORDER coalesces
up to 4MB already :-)

Apparently a page-table based representation could not be used by PPC.

The page-table is just an abstraction, there is no reason dummy
"large" ptes could not be used which are just ignored by the HW TLB
reload code.

2002-08-02 01:30:46

by Rohit Seth

[permalink] [raw]
Subject: RE: large page patch

There is typo in Andrew's mail. It is not 256K, but it is 256MB.

-----Original Message-----
From: David S. Miller [mailto:[email protected]]
Sent: Thursday, August 01, 2002 6:20 PM
To: [email protected]
Cc: [email protected]; [email protected];
[email protected]; [email protected]; [email protected]
Subject: Re: large page patch


From: Andrew Morton <[email protected]>
Date: Thu, 01 Aug 2002 18:26:40 -0700

"David S. Miller" wrote:
> This is probably done to increase the likelyhood that 4MB page orders
> are available. If we collapse 4MB pages deeper, they are less likely
> to be broken up because smaller orders would be selected first.

This is leakage from ia64, which supports up to 256k pages.

Ummm, 4MB > 256K and even with a 4K PAGE_SIZE MAX_ORDER coalesces
up to 4MB already :-)

Apparently a page-table based representation could not be used by PPC.

The page-table is just an abstraction, there is no reason dummy
"large" ptes could not be used which are just ignored by the HW TLB
reload code.

2002-08-02 01:52:12

by Rik van Riel

[permalink] [raw]
Subject: Re: large page patch

On Thu, 1 Aug 2002, David S. Miller wrote:
> From: Andrew Morton <[email protected]>

> - Minimal impact on the VM and MM layers
>
> Well the downside of this is that it means it isn't transparent
> to userspace. For example, specfp2000 results aren't going to
> improve after installing these changes. Some of the other large
> page implementations would.

It also means we can't automatically switch to large pages for
SHM segments, which is the number one area where we need large
pages...

We should also take into account that the main application that
needs large pages for its SHM segments is Oracle, which we don't
have the source code for so we can't recompile it to use the new
syscalls introduced by this patch ...

IMHO we shouldn't blindly decide for (or against!) this patch
but also carefully look at the large page patch from RHAS (which
got added to -aa recently) and the large page patch which IBM
is working on.

kind regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/



2002-08-02 01:59:01

by David Miller

[permalink] [raw]
Subject: Re: large page patch

From: Rik van Riel <[email protected]>
Date: Thu, 1 Aug 2002 22:55:05 -0300 (BRT)

IMHO we shouldn't blindly decide for (or against!) this patch
but also carefully look at the large page patch from RHAS (which
got added to -aa recently) and the large page patch which IBM
is working on.

And the one from Naohiko Shimizu which is my personal favorite
because sparc64 support is there :)

http://shimizu-lab.dt.u-tokai.ac.jp/lsp.html

2002-08-02 02:28:33

by Gerrit Huizenga

[permalink] [raw]
Subject: Re: large page patch

In message <[email protected]>, >
: Rik van Riel writes:
> On Thu, 1 Aug 2002, David S. Miller wrote:
> > From: Andrew Morton <[email protected]>
>
> > - Minimal impact on the VM and MM layers
> >
> > Well the downside of this is that it means it isn't transparent
> > to userspace. For example, specfp2000 results aren't going to
> > improve after installing these changes. Some of the other large
> > page implementations would.
>
> We should also take into account that the main application that
> needs large pages for its SHM segments is Oracle, which we don't
> have the source code for so we can't recompile it to use the new
> syscalls introduced by this patch ...

There are quite a few other applications that can benefit from large
page support. IBM Watson Research published JVM and some scientific
workload results using large pages which showed substantial benefits.
Also, we believe DB2, Domino, other memory piggish apps (e.g. think
scientific) would benefit equally on many architectures.

It would sure be nice if the interface wasn't some kludgey back door
but more integrated with things like mmap() or shm*(), with semantics
and behaviors that were roughly more predictable. Other than that,
no comments as yet on the patch internals...

gerrit

2002-08-02 02:31:39

by David Miller

[permalink] [raw]
Subject: Re: large page patch

From: Gerrit Huizenga <[email protected]>
Date: Thu, 01 Aug 2002 19:29:52 -0700

other memory piggish apps (e.g. think scientific) would benefit

There are example "benchmark'ish" example on Naohiko Shimizu's
large page project page.

2002-08-02 02:52:09

by Gerrit Huizenga

[permalink] [raw]
Subject: Re: large page patch

In message <[email protected]>, > : "David S. Miller" w
rites:
> From: Gerrit Huizenga <[email protected]>
> Date: Thu, 01 Aug 2002 19:29:52 -0700
>
> other memory piggish apps (e.g. think scientific) would benefit
>
> There are example "benchmark'ish" example on Naohiko Shimizu's
> large page project page.

No 2.5 code, though. :( But there *is* 2.2.16 code!

gerrit

2002-08-02 03:20:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: large page patch

In article <[email protected]>,
David S. Miller <[email protected]> wrote:
> From: Andrew Morton <[email protected]>
> Date: Thu, 01 Aug 2002 18:26:40 -0700
>
> "David S. Miller" wrote:
> > This is probably done to increase the likelyhood that 4MB page orders
> > are available. If we collapse 4MB pages deeper, they are less likely
> > to be broken up because smaller orders would be selected first.
>
> This is leakage from ia64, which supports up to 256k pages.
>
>Ummm, 4MB > 256K and even with a 4K PAGE_SIZE MAX_ORDER coalesces
>up to 4MB already :-)

That should be 256_M_ pages (13 bits of page size + 15 bits of MAX_ORDER
gives you 256MB max).

Linus

2002-08-02 03:44:01

by William Lee Irwin III

[permalink] [raw]
Subject: Re: large page patch

On Thu, Aug 01, 2002 at 05:37:46PM -0700, Andrew Morton wrote:
> This is a large-page support patch from Rohit Seth, forwarded
> with his permission (thanks!).

Overall, the code looks very clean.

(1) So there are now 4 of these things. How do they compare to each
other? Where are the comparison benchmarks? How do their
features compare? Which one(s) do users want?

(2) The allocation policies for pagetables mapping the things may as
well do some kind of lookup, sharing, and cacheing; it's likely
a significant number of the users of the shm segment will be
mapping them more or less the same way given database usage
patterns. It's not a significant amount of space, but kernels
should be frugal about space, and with many tasks as is typical
of databases, the savings may well add up to a small but
respectable chunk of ZONE_NORMAL.

(3) As long as the interface is explicit, it might as well drop flags
into shm and mmap. There isn't even C library support for these
things as they are... time to int $0x80 again so I can test.

(4) Requiring app awareness of page alignment looks like an irritating
porting issue, which doesn't sound as trivial as it would
otherwise be in already extremely cramped 32-bit virtual
address spaces.

(5) What's in it for the average user? It's doubtful GNOME will be
registering memory blocks with these syscalls anytime soon.
Granted, the opportunities for reducing TLB load this way
are small on desktop systems, but it doesn't feel quite
right to just throw mappings of magic physical memory into the
hands of a few enlightened apps on machines with memory to burn
and leave all others in the cold.
By several accounts "scalability" is defined as "performing as
well on large machines as it does on small ones" ... but this
seems to be a method of circumventing the kernel's own memory
management as opposed to a method of improving it in all cases.

(6) As far as reconfiguring, I'm slightly concerned about the robustness
of change_large_page_mem_size() in terms of how likely it is to
succeed. Some on-demand defragmentation looks like it should be
implemented to make it more reliable (now possible thanks to
rmap). In general, the sysctl seems to lack some adaptivity.
Granting root privileges to the workload vs. perpetual
monitoring to find the ideal pool size sounds like a headache.

(7) I'm a little worried by the following:

zone(0): 4096 pages.
zone(1): 225280 pages.
BUG: wrong zone alignment, it will crash
zone(2): 3964928 pages.


My machine doesn't seem to care, but others' might.


Cheers,
Bill

2002-08-02 04:03:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: large page patch

In article <737220000.1028250590@flay>,
Martin J. Bligh <[email protected]> wrote:
>> - The change to MAX_ORDER is unneeded
>
>It's not only unneeded, it's detrimental. Not only will we spend more
>time merging stuff up and down to no effect

I doubt that. At least the naive math says that it should get
exponentially less likely(*) to merge up/down for each level, so by the
time you've reached order-10, any merging is already in the noise and
totally unmeasurable.

And the memory footprint of the bitmaps etc should be basically zero
(since they too shrink exponentially for each order).

((*) The "exponentially less likely" simply comes from doing the trivial
experiment of what would happen if you allocated all pages in-order one
at a time, and then free'd them one at a time. Obviously not a
realistic test, but on the other hand a realistic kernel load tends to
keep a fairly fixed fraction of memory free, which makes it sound
extremely unlikely to me that you'd get sudden collpses/buildups either.
Th elikelihood of being at just the right border for that to happens
_also_ happens to be decreasins as 2**-n)

Of course, if you can actually measure it, that would be interesting.
Naive math gives you a guess for the order of magnitude effect, but
nothing beats real numbers ;)

> It also makes the config_nonlinear stuff harder (or we have to
> #ifdef it, which just causes more unnecessary differentiation).

Hmm.. This sounds like a good point, but I thought we already did all
the math relative to the start of the zone, so that the alignment thing
implied by MAX_ORDER shouldn't be an issue.

Or were you thinking of some other effect?

Linus

2002-08-02 04:22:25

by David Miller

[permalink] [raw]
Subject: Re: large page patch

From: [email protected] (Linus Torvalds)
Date: Fri, 2 Aug 2002 04:07:10 +0000 (UTC)

Of course, if you can actually measure it, that would be
interesting. Naive math gives you a guess for the order of
magnitude effect, but nothing beats real numbers ;)

The SYSV folks actually did have a buddy allocator a long time ago and
they did implement lazy coalescing because is supposedly improved
performance.

See chapter 12 section 7 in "Unix Internals" by Uresh Vahalia.

2002-08-02 04:26:54

by Daniel Phillips

[permalink] [raw]
Subject: Re: large page patch

On Friday 02 August 2002 03:36, Andrew Morton wrote:
> Merged up to 2.5.30. It compiles.
>
> http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.30/lpp.patch

What was the original against?

--
Daniel

2002-08-02 04:27:25

by William Lee Irwin III

[permalink] [raw]
Subject: Re: large page patch

From: [email protected] (Linus Torvalds)
Date: Fri, 2 Aug 2002 04:07:10 +0000 (UTC)
> Of course, if you can actually measure it, that would be
> interesting. Naive math gives you a guess for the order of
> magnitude effect, but nothing beats real numbers ;)

On Thu, Aug 01, 2002 at 09:13:57PM -0700, David S. Miller wrote:
> The SYSV folks actually did have a buddy allocator a long time ago and
> they did implement lazy coalescing because is supposedly improved
> performance.
> See chapter 12 section 7 in "Unix Internals" by Uresh Vahalia.

And I've implemented it for Linux.

ftp://ftp.kernel.org/pub/linux/kernel/people/wli/vm/lazy_buddy/


Cheers,
Bill

2002-08-02 04:29:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: large page patch


On Thu, 1 Aug 2002, David S. Miller wrote:
>
> Of course, if you can actually measure it, that would be
> interesting. Naive math gives you a guess for the order of
> magnitude effect, but nothing beats real numbers ;)
>
> The SYSV folks actually did have a buddy allocator a long time ago and
> they did implement lazy coalescing because is supposedly improved
> performance.

I bet that is mainly because of CPU scalability, and being able to avoid
touching the buddy lists from multiple CPU's - the same reason _we_ have
the per-CPU front-ends on various allocators.

I doubt it is because buddy matters past the 4MB mark. I just can't see
how you can avoid the naive math which says that it should be 1/512th as
common to coalesce to 4MB as it is to coalesce to 8kB.

Walking the buddy bitmaps for a few levels (ie up to order 3 or 4) is
probably quite common, and it's likely to be bad from a SMP cache
standpoint (touching a few bits with what must be fairly random patterns).
So avoiding the buddy with a simple front-end is likely to win you
something, without actually being meaningful at the MAX_ORDER point.

Linus

2002-08-02 04:34:57

by Andrew Morton

[permalink] [raw]
Subject: Re: large page patch

Daniel Phillips wrote:
>
> On Friday 02 August 2002 03:36, Andrew Morton wrote:
> > Merged up to 2.5.30. It compiles.
> >
> > http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.30/lpp.patch
>
> What was the original against?

2.4.18

2002-08-02 04:36:15

by Martin J. Bligh

[permalink] [raw]
Subject: Re: large page patch

Direct email seemed to get seperated from the cc somewhere
along the line ... repeated for others on l-k (sorry Linus ;-))

> I doubt that. At least the naive math says that it should get
> exponentially less likely(*) to merge up/down for each level, so by the
> time you've reached order-10, any merging is already in the noise and
> totally unmeasurable.

Yeah, it's probably unmeasurable, just ugly ;-)
I guess it's more that it seems unnecessary ... if ia64 are the
only people that need it to be that ludicrously large, it'd be
better if they just did it in their arch tree. Just because they
could theoretically have 256Mb pages, do they really *need* them? ;-)

>> It also makes the config_nonlinear stuff harder (or we have to
>> # ifdef it, which just causes more unnecessary differentiation).
>
> Hmm.. This sounds like a good point, but I thought we already did all
> the math relative to the start of the zone, so that the alignment thing
> implied by MAX_ORDER shouldn't be an issue.
>
> Or were you thinking of some other effect?

The config_nonlinear stuff relies on a trick ... we shove physically
non-contig areas into the buddy allocator, but the buddy allocator
is guaranteed to return phys contig areas. That all works just fine
as long as the blocks we put in are of size greater than or equal to
2^MAX_ORDER * PAGE_SIZE, which is currently 4Mb. A 4Mb alignment is
not a problem for any known machine, but I think 256Mb may well be.
It's kind of a dirty trick, but it's a really neat, efficient
solution that gets rid of lots of zone balancing and pgdat proliferation.
It also lets me spread around ZONE_NORMAL across nodes for ia32 NUMA.

M.

2002-08-02 05:08:43

by William Lee Irwin III

[permalink] [raw]
Subject: Re: large page patch

On Thu, Aug 01, 2002 at 09:32:44PM -0700, Linus Torvalds wrote:
> I bet that is mainly because of CPU scalability, and being able to avoid
> touching the buddy lists from multiple CPU's - the same reason _we_ have
> the per-CPU front-ends on various allocators.
> I doubt it is because buddy matters past the 4MB mark. I just can't see
> how you can avoid the naive math which says that it should be 1/512th as
> common to coalesce to 4MB as it is to coalesce to 8kB.
> Walking the buddy bitmaps for a few levels (ie up to order 3 or 4) is
> probably quite common, and it's likely to be bad from a SMP cache
> standpoint (touching a few bits with what must be fairly random patterns).
> So avoiding the buddy with a simple front-end is likely to win you
> something, without actually being meaningful at the MAX_ORDER point.

This is actually part of my strategy.

By properly organizing the deferred queues into lists of lists and
maintaining a small per-cpu cache of pages, a "cache fill" involves
doing a single list deletion under the zone->lock and the remainder
of the work to fill a pagevec occurs outside the lock, reducing the
mean hold time down to ridiculous lows. And since the allocations
are batched, the arrival rate is then divided by the batch size.
Conversely, frees are also batched and the same effect achieved with
the dual operations.

i.e. magazines for the page-level allocator

This can't be achieved with a pure buddy system, as it must examine
individual pages one-by-one to keep the bitmaps updated. Vahalia
discusses the general approach in another section, and integration with
buddy systems (and other allocators) in an exercise.


Cheers,
Bill

2002-08-02 05:21:00

by David Mosberger

[permalink] [raw]
Subject: Re: large page patch

>>>>> On Thu, 01 Aug 2002 19:29:52 -0700, Gerrit Huizenga <[email protected]> said:

Gerrit> It would sure be nice if the interface wasn't some kludgey
Gerrit> back door but more integrated with things like mmap() or
Gerrit> shm*(), with semantics and behaviors that were roughly more
Gerrit> predictable. Other than that, no comments as yet on the
Gerrit> patch internals...

In my opinion the proposed large-page patch addresses a relatively
pressing need for databases (primarily). Longer term, I'd hope that
it can be replaced by a transparent superpage scheme. But the
existing patch can also serve as a nice benchmark for transparent
schemes (and frankly, since it doesn't have to do anything smart
behind the scenes, it's likely that the existing patch, where
applicable, will always do slightly better than a transparent one).

In any case, the big issue of physical memory fragmentation can be
experimented with indepent what the user-level interface looks like.
So the existing patch is useful in that sense as well.

--david

2002-08-02 05:30:30

by David Miller

[permalink] [raw]
Subject: Re: large page patch

From: David Mosberger <[email protected]>
Date: Thu, 1 Aug 2002 22:24:05 -0700

In my opinion the proposed large-page patch addresses a relatively
pressing need for databases (primarily).

Databases want large pages with IPC_SHM, how can this special
syscal hack address that?

It's great for experimentation, but give up syscall slots for
this?

2002-08-02 06:22:51

by David Mosberger

[permalink] [raw]
Subject: Re: large page patch

>>>>> On Thu, 01 Aug 2002 22:20:53 -0700 (PDT), "David S. Miller" <[email protected]> said:

DaveM> From: David Mosberger <[email protected]> Date:
DaveM> Thu, 1 Aug 2002 22:24:05 -0700

DaveM> In my opinion the proposed large-page patch addresses a
DaveM> relatively pressing need for databases (primarily).

DaveM> Databases want large pages with IPC_SHM, how can this
DaveM> special syscal hack address that?

I believe the interface is OK in that regard. AFAIK, Oracle is happy
with it.

DaveM> It's great for experimentation, but give up syscall slots
DaveM> for this?

I'm a bit concerned about this, too. My preference would have been to
use the regular mmap() and shmat() syscalls with some
augmentation/hint as to what the preferred page size is (Simon
Winwood's OLS 2002 paper talks about some options here). I like this
because hints could be useful even with a transparent superpage
scheme.

The original Intel patch did use more of a hint-like approach (the
hint was a simple binary flag though: give me regular pages or give me
large pages), but Linus preferred a separate syscall interface, so the
Intel folks switched over to doing that.

--david

2002-08-02 06:31:45

by Martin J. Bligh

[permalink] [raw]
Subject: Re: large page patch

> DaveM> In my opinion the proposed large-page patch addresses a
> DaveM> relatively pressing need for databases (primarily).
>
> DaveM> Databases want large pages with IPC_SHM, how can this
> DaveM> special syscal hack address that?
>
> I believe the interface is OK in that regard. AFAIK, Oracle is happy
> with it.

Is Oracle now the world's only database? I think not.

> DaveM> It's great for experimentation, but give up syscall slots
> DaveM> for this?
>
> I'm a bit concerned about this, too. My preference would have been to
> use the regular mmap() and shmat() syscalls with some
> augmentation/hint as to what the preferred page size is

I think that's what most users would prefer, and I don't think it
adds a vast amount of kernel complexity. Linus doesn't seem to
be dead set against the shmem modifications at least ... so that's
half way there ;-)

M.

2002-08-02 06:41:41

by David Mosberger

[permalink] [raw]
Subject: Re: large page patch

>>>>> On Thu, 01 Aug 2002 23:33:26 -0700, "Martin J. Bligh" <[email protected]> said:

DaveM> In my opinion the proposed large-page patch addresses a
DaveM> relatively pressing need for databases (primarily).
>>
DaveM> Databases want large pages with IPC_SHM, how can this special
DaveM> syscal hack address that?

>> I believe the interface is OK in that regard. AFAIK, Oracle is
>> happy with it.

Martin> Is Oracle now the world's only database? I think not.

I didn't say such a thing. I just don't know what other db vendors/authors
think of the proposed interface. I'm sure their feedback would be welcome.

--david

2002-08-02 06:56:10

by Andrew Morton

[permalink] [raw]
Subject: Re: large page patch

"Martin J. Bligh" wrote:
>
> > DaveM> In my opinion the proposed large-page patch addresses a
> > DaveM> relatively pressing need for databases (primarily).
> >
> > DaveM> Databases want large pages with IPC_SHM, how can this
> > DaveM> special syscal hack address that?
> >
> > I believe the interface is OK in that regard. AFAIK, Oracle is happy
> > with it.
>
> Is Oracle now the world's only database? I think not.

Is a draft of Simon's patch available against 2.5?

-

2002-08-02 07:12:21

by William Lee Irwin III

[permalink] [raw]
Subject: Re: large page patch

At some point in the past, Dave Miller wrote:
DaveM> In my opinion the proposed large-page patch addresses a
DaveM> relatively pressing need for databases (primarily).
DaveM> Databases want large pages with IPC_SHM, how can this
DaveM> special syscal hack address that?

At some point in the past, David Mosberger wrote:
I believe the interface is OK in that regard. AFAIK, Oracle is happy
with it.

"Martin J. Bligh" wrote:
>> Is Oracle now the world's only database? I think not.

On Fri, Aug 02, 2002 at 12:08:50AM -0700, Andrew Morton wrote:
> Is a draft of Simon's patch available against 2.5?

Unless I can turn blood into wine, walk on water, and produce a working
2.5 version of the thing in < 6 hours (not that I'm not trying), this
will probably have to wait until Hubertus remeterializes tomorrow
morning (EDT) and further porting is done. I'll be up early.

Cheers,
Bill

2002-08-02 07:17:29

by Andrew Morton

[permalink] [raw]
Subject: Re: large page patch

Linus Torvalds wrote:
>
> On Thu, 1 Aug 2002, David S. Miller wrote:
> >
> > Of course, if you can actually measure it, that would be
> > interesting. Naive math gives you a guess for the order of
> > magnitude effect, but nothing beats real numbers ;)
> >
> > The SYSV folks actually did have a buddy allocator a long time ago and
> > they did implement lazy coalescing because is supposedly improved
> > performance.
>
> I bet that is mainly because of CPU scalability, and being able to avoid
> touching the buddy lists from multiple CPU's - the same reason _we_ have
> the per-CPU front-ends on various allocators.
>
> I doubt it is because buddy matters past the 4MB mark. I just can't see
> how you can avoid the naive math which says that it should be 1/512th as
> common to coalesce to 4MB as it is to coalesce to 8kB.

Buddy costs tend to be down in the noise compared with the cost
of the zone->lock.

I did a per-cpu pages patch a while back which, when it takes that
lock, grabs 16 pages or frees 16 pages. Anton tested it on the
12-way: http://samba.org/~anton/linux/2.5.9/ blue -> purple

The cost of rmqueue() and __free_pages_ok went from 13% of system
time down to 2%. So that 2% speedup is all that's available by fiddling
with the buddy algorithm (I think). And I bet most of that is still taking
the lock.

Didn't submit the patch because I think a per-cpu page buffer is a bit of
a dopey cop-out. I have patches here which make most of the page-intensive
fastpaths in the kernel stop using single pages and start using 16-page batches.

That will make a 16-page allocation request just a natural thing
to do. But we will need a per-cpu buffer to wring the last drops
out of anonymous pagefaults and generic_file_write(), which do not
lend themselves to gang allocation.

2002-08-02 08:30:18

by David Miller

[permalink] [raw]
Subject: Re: large page patch

From: David Mosberger <[email protected]>
Date: Thu, 1 Aug 2002 23:26:07 -0700

I'm a bit concerned about this, too. My preference would have been to
use the regular mmap() and shmat() syscalls with some
augmentation/hint as to what the preferred page size is (Simon
Winwood's OLS 2002 paper talks about some options here). I like this
because hints could be useful even with a transparent superpage
scheme.

A "hint" to use superpages? That's absurd.

Any time you are able to translate N pages instead of 1 page with 1
TLB entry it's always preferable.

I also don't buy the swapping complexity bit. The fact is, SHM and
anonymous pages are easy. Just stay away from the page cache and it
is pretty simple to just make the normal VM do this.

If set_pte sees a large page, you simply undo the large ptes in that
group and the complexity ends right there. This means the only maker
of large pages is the bit that creates the large mappings and has all
of the ideal conditions up front. Any time anything happens to that
pte you undo the large'ness of it so that you get normal PAGE_SIZE
ptes back.

Using superpages for anonymous+SHM pages is really the only area I
still think Linux's MM can offer inferior performance compared to what
the hardware is actually capable of.

2002-08-02 09:04:55

by Ryan Cumming

[permalink] [raw]
Subject: Re: large page patch

On August 2, 2002 01:20, David S. Miller wrote:
> From: David Mosberger <[email protected]>
> Date: Thu, 1 Aug 2002 23:26:07 -0700
>
> I'm a bit concerned about this, too. My preference would have been to
> use the regular mmap() and shmat() syscalls with some
> augmentation/hint as to what the preferred page size is (Simon
> Winwood's OLS 2002 paper talks about some options here). I like this
> because hints could be useful even with a transparent superpage
> scheme.
>
> A "hint" to use superpages? That's absurd.

What about applications that want fine-grained page aging? 4MB is a tad on the
course side for most desktop applications.

-Ryan

2002-08-02 09:16:32

by David Miller

[permalink] [raw]
Subject: Re: large page patch

From: Ryan Cumming <[email protected]>
Date: Fri, 2 Aug 2002 02:05:43 -0700

What about applications that want fine-grained page aging? 4MB is a
tad on the course side for most desktop applications.

Once vmscan sees the page and tries to liberate it, then it
will be unlarge'd and thus you'll get fine-grained page aging.

That's the beauty of my implementation suggestion.

2002-08-02 10:02:12

by Marcin Dalecki

[permalink] [raw]
Subject: Re: large page patch

Uz.ytkownik David Mosberger napisa?:
>>>>>>On Thu, 01 Aug 2002 23:33:26 -0700, "Martin J. Bligh" <[email protected]> said:
>>>>>
>
> DaveM> In my opinion the proposed large-page patch addresses a
> DaveM> relatively pressing need for databases (primarily).
> >>
> DaveM> Databases want large pages with IPC_SHM, how can this special
> DaveM> syscal hack address that?
>
> >> I believe the interface is OK in that regard. AFAIK, Oracle is
> >> happy with it.
>
> Martin> Is Oracle now the world's only database? I think not.
>
> I didn't say such a thing. I just don't know what other db vendors/authors
> think of the proposed interface. I'm sure their feedback would be welcome.

You better don't ask DB people and in esp. the Oracle people
about opinnions on interface design. Unless you wan't something
fscking ugly internally looking like FORTRAN/COBOL coding.
They will always scrap portability/usability use undocumented behaviour
and so on in the case they can presumably increase theyr pet benchmark
values.
One of the reasons Solaris is *feeling* so slow is that they asked
Oracle people too frequent about oppinions apparently. In esp. they did
forgett that there are other uses then DB servers ;-).

PS. I just got too much in touch with Oracle to not hate it...

2002-08-02 12:49:22

by Rik van Riel

[permalink] [raw]
Subject: Re: large page patch

On Fri, 2 Aug 2002, Ryan Cumming wrote:
> On August 2, 2002 01:20, David S. Miller wrote:

> > A "hint" to use superpages? That's absurd.
>
> What about applications that want fine-grained page aging? 4MB is a tad
> on the course side for most desktop applications.

Of course we wouldn't want to use superpages for VMAs smaller
than, say, 4 of these superpages.

That would fix this problem automagically.

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

2002-08-02 15:23:44

by David Mosberger

[permalink] [raw]
Subject: Re: large page patch

>>>>> On Fri, 02 Aug 2002 01:20:40 -0700 (PDT), "David S. Miller" <[email protected]> said:

Dave.M> A "hint" to use superpages? That's absurd.

Dave.M> Any time you are able to translate N pages instead of 1 page
Dave.M> with 1 TLB entry it's always preferable.

Yeah, right. So you think a 256MB page-size is optimal for all apps?

What you're missing is how you *get* to the point where you can map N
pages with a single TLB entry. For that to happen, you need to
allocate physically contiguous and properly aligned memory (at least
given the hw that's common today). Doing has certain costs, no matter
what your approach is.

--david

2002-08-02 19:28:47

by Rohit Seth

[permalink] [raw]
Subject: RE: large page patch


We agree that there are few different ways to get this support implemented
in base kernel. Also, the extent to which this support needs to go is also
debatable (like whether the large_pages could be made swapable etc.) Just
to give little history, we also started with prototyping changes in kernel
that would get the large page support transparent to end user (as we wanted
to see the benefit of large apps like databases, spec benchmark and HPC
applications using different page sizes on IA-64). And under some
conditions automagically user start using large pages for shm and private
anonymous pages. But we would call this at best a kludge because there are
quite a number of conditions in these execution paths that one has to do
differently for large_pages. For example,
make_pages_present/handle_mm_fault for anonymous or shmem type of pages need
to be modified to embed the knowledge of different page size in generic
kernel. Also, there are places where semantics of changes may not completely
match. For example, doing a shm_lock/unlock on these segments were not
exactly doing the expected. All those extra changes add cost in the normal
execution path (severity could differ from app to app).

So, we needed to treat the large pages as a special case and want to make
sure that the application that will be using the large pages understand that
these pages are special (avoid transperent usage model until the large pages
are treated the same way as normal pages). This led to cleaner solution
(input for which also came from Linus himself). The new APIs enable the
kernel to contain the changes to be architecture specific and limited to
very few kernel changes. And above all it looks so much portable. Fact is,
the initial implementation was done for IA-64 and porting to x86 took couple
of hours. One of the other key advantage is that this design does not tie
the supported large_page size(s) to any specific size in the generic mm
code. It supports all the underlying architecture supported page sizes
quite independent of generic code. And architecture dependent code could
support multiple large_page sizes in the same kernel.

We presented our work to Oracle and they were acceptable to the new APIs
(not saying Oracle is the only DB in world that one has to worry about, but
it clearly indicates that the move from shm apis to this new APIs is easy.
Obviously the input from other big app vendors will be highly appreciated.).


Sceintific apps people who have the sources should also like this approach,
as there changes will be even more trivial (changes to malloc). And above
all, for those people who really want to get this extra buck transparently,
the changes could be done to user land libraries to selectively map to these
new APIs. LD_PRELOAD could be another way to do. Ofcourse, there will be
changes that need to be done in user land. But they are self contained
changes. And one of the key point is that application knows what it is
demanding/getting form kernel.

Now to the point where the large_pages themselves could be made swapable. In
our opinion (and this may not be this API dependent), it is not a good idea
to look at these pages as swapable candidates. Most of the big apps who are
going to use this feature will use them for the data that they really need
available all the time (prefereably in RAM if not on caches :-)). And the
sysadm could easily configure the amount of large mem pool as per the needs
for a specific environment.

To the point where the whole kernel starts supporting (as David Mosberger
refered) superpages where support is built in kernel to basically treat
superpages as just another size the whole kernel supports will be great too.
But those need quite a lot of exhaustive changes in kernel layers as weill
as lot of tuning.....may be a little further away in future.

thanks,
asit & rohit

2002-08-02 23:37:21

by Chris Wedgwood

[permalink] [raw]
Subject: Re: large page patch

On Thu, Aug 01, 2002 at 05:37:46PM -0700, Andrew Morton wrote:

diff -Naru linux.org/arch/i386/kernel/entry.S linux.lp/arch/i386/kernel/entry.S
--- linux.org/arch/i386/kernel/entry.S Mon Feb 25 11:37:53 2002
+++ linux.lp/arch/i386/kernel/entry.S Tue Jul 2 15:12:23 2002
@@ -634,6 +634,10 @@
.long SYMBOL_NAME(sys_ni_syscall) /* 235 reserved for removexattr */
.long SYMBOL_NAME(sys_ni_syscall) /* reserved for lremovexattr */
.long SYMBOL_NAME(sys_ni_syscall) /* reserved for fremovexattr */
+ .long SYMBOL_NAME(sys_get_large_pages) /* Get large_page pages */
+ .long SYMBOL_NAME(sys_free_large_pages) /* Free large_page pages */
+ .long SYMBOL_NAME(sys_share_large_pages)/* Share large_page pages */
+ .long SYMBOL_NAME(sys_unshare_large_pages)/* UnShare large_page pages */


Must large pages be allocated this way?

At some point I would like to see code that mmap's large amounts of
data (over 1GB) and have it take advantage of this once the kernel is
potentially extended to deal with mapping of large and/or variable
sized pages backed to disk.

Also, some scientific applications will malloc(3) gobs of ram, again
in excess of 1GB, is it unreasonable to expect that the kernel will
notice large allocations and try to provide large pages sbrk in
invoked with suitable high values?



--cw

2002-08-06 20:47:29

by Hugh Dickins

[permalink] [raw]
Subject: Re: large page patch

Some comments on Rohit's large page patch (looking at Andrew's version).

I agree with keeping the actual large page handling separate, per arch.

I agree that it's sensible to focus upon _large_ pages (e.g. 4MB) here;
grouping several pages together as superpages (e.g. 64KB), for non-x86
TLB or other reasons, better handled automatically in another project.

I agree that large pages be kept right away from swap (VM_RESERVE).

But I disagree with new interfaces distinct from known mmap/shm/tmpfs
usage, I think they will cause rather than save trouble. It's using
do_mmap_pgoff anyway, why not mmap itself? Much prefer MAP_LARGEPAGE,
SHM_LARGEPAGE - or might /dev/ZERO and TMPFS help, each on large pages?

munmap, mprotect, mremap patches are deficient: they just check whether
the first vma is VM_LARGEPAGE, but munmap and mprotect (and mremap's
do_munmap in MREMAP_MAYMOVE|MREMAP_FIXED case) may span several vmas.

So, must decide what to do when a VM_LARGEPAGE falls within do_munmap
span: pre-scan would waste time, we rely on unmap_region to unmap
at least length specified by user, we're not interested in splitting
VM_LARGEPAGE areas, so I suggest when a VM_LARGEPAGE area falls partly
or wholly within do_munmap span, it be wholly unmapped. In which case,
no need for sys_free_large_pages and sys_unshare_large_pages.

sys_get_large_pages, if retained as a separate syscall,
would be easier to understand if named sys_mmap_large_pages?

sys_share_large_pages: I'm having a lot of difficulty with this one,
and its set_lp_shm_seg. Share? but it says MAP_PRIVATE (whereas
sys_get_large_pages forces MAP_SHARED). Key? we got that from a
prior shmget? so already there's a tmpfs inode for this, and now
we allocate some other inode? No, I think it would be better off
integrated a little more within tmpfs (perhaps no SHM_LARGEPAGE
at all, just ordinary files in a TMPFS? Rohit mentioned wanting
ability to execute, straightforward from TMPFS file).

change_large_page_mem_size: wouldn't it be better as
set_large_page_mem_size, instead of by increments/decrements?

Whitespace offences, would benefit from a pass through Lindent.

Hugh

2002-08-07 00:07:48

by Rohit Seth

[permalink] [raw]
Subject: RE: large page patch



> -----Original Message-----
> From: Hugh Dickins [mailto:[email protected]]
> Sent: Tuesday, August 06, 2002 1:52 PM
> To: Linus Torvalds
> Cc: Andrew Morton; Seth, Rohit; [email protected]
> Subject: Re: large page patch
>
>
> Some comments on Rohit's large page patch (looking at
> Andrew's version).
>
Thanks.

> I agree with keeping the actual large page handling separate,
> per arch.
>
> I agree that it's sensible to focus upon _large_ pages (e.g.
> 4MB) here;
> grouping several pages together as superpages (e.g. 64KB), for non-x86
> TLB or other reasons, better handled automatically in another project.
>
I think lately at least we are all converging to this view that there is
place for both large_TLB_pages and superpages.

> I agree that large pages be kept right away from swap (VM_RESERVE).
>
> But I disagree with new interfaces distinct from known mmap/shm/tmpfs
> usage, I think they will cause rather than save trouble. It's using
> do_mmap_pgoff anyway, why not mmap itself? Much prefer MAP_LARGEPAGE,
> SHM_LARGEPAGE - or might /dev/ZERO and TMPFS help, each on
> large pages?
>
In this design, there are few key differences between large_page related
calls and mmap system call. Even though they are all eventually using
do_mmap_pgoff (just like shmat and mmap themselves). Like the way
large_pages may or may not get shared across forks. Also, don't have any
backing store for these pages (so fd and offset really don't apply, to some
extent that is also true for anonymous mappings). Their fault behavior and
handling is also quite different. No partial unmaps etc. .....Can be done
(by overloading something like MAP_LARGEPAGE with all the additional
features/attibutes that are currently embeded with new system calls.)but
would pollute some of the generic code.

> munmap, mprotect, mremap patches are deficient: they just
> check whether
> the first vma is VM_LARGEPAGE, but munmap and mprotect (and mremap's
> do_munmap in MREMAP_MAYMOVE|MREMAP_FIXED case) may span several vmas.
>

Good catch. See my comments below.

> So, must decide what to do when a VM_LARGEPAGE falls within do_munmap
> span: pre-scan would waste time, we rely on unmap_region to unmap
> at least length specified by user, we're not interested in splitting
> VM_LARGEPAGE areas, so I suggest when a VM_LARGEPAGE area falls partly
> or wholly within do_munmap span, it be wholly unmapped. In
> which case,
> no need for sys_free_large_pages and sys_unshare_large_pages.

It is obviously clear that we will have to take care of cases where a single
(for example) munmap request is touching vma that spans large TLBs. I agree
with Hugh that pre-scan would be costly. But at the same time, calls like
munmap have no knowlege of large_pages. One could potentially add checks in
these cases and jump effectively to do what sys_free_large pages is doing.
Or it will be nice if we take care of normal cases and skip over the regions
that map large_pages. I think the code in mprotect, mremap already allows
partial services. It is the munmap where we don't already have the
error/partial way out. My preference would be to not touch large_pages and
skip over them.
Also, to some extent see we already have things like munmap and
shmdt....effectively doing the same thing but co-existing (I think mainly
because of the different semantics that have created those segments).
>
> sys_share_large_pages: I'm having a lot of difficulty with this one,
> and its set_lp_shm_seg. Share? but it says MAP_PRIVATE (whereas
> sys_get_large_pages forces MAP_SHARED). Key? we got that from a
> prior shmget? so already there's a tmpfs inode for this, and now
> we allocate some other inode? No, I think it would be better off
> integrated a little more within tmpfs (perhaps no SHM_LARGEPAGE
> at all, just ordinary files in a TMPFS? Rohit mentioned wanting
> ability to execute, straightforward from TMPFS file).
>
Let me clarify sys_share_large_pages syscall: To start with, the syntax of
this call is sys_share_large_apges(int key, unsigned long addr, unsigned
long len, int prot, int flag) where
Key is system wide unique positive number (nothing to do with
shmget's key). Idea is, user app decides how numbers are choosen (for
sharing data across unrelated processes.) Though the intent was not to
relate these calls to shm* key, but seems like quite a few people are
reading it that way. As Andrew once pointed out that we should change this
to something like shared fd. And users can open a particular file to get an
fd that could be used to get at a particular chunk of large pages. Sothere
is no connection between these new system calls and normal shm* related
system calls.

The parameters addr, len and prot have their usual meaning.

The parameter flag for sys_share_large_pages can only have 0 or IPC_CREAT.
Flag=IPC_CREAT tells the kernel, if the particular key is not already in use
then go ahead and create large_page segment. Later references to this will
be done using the same key. If user does not specify IPC_CREAT in flag and
there is no large_apge segment corresponding to key, the sys_call returns
ENOENT.

Whereas, the flag parameter of sys_get_large_pages can take the values
MAP_PRIVATE or MAP_SHARED. Here MAP_PRIVATE will mean that the large pages
will not be copied across forks to new address space (of child). Where as
MAP_SHARED value means that new child process will share the same (physical)
large_pages with parent. (Intent is not to copy huge data across fork system
call and it is expected that correct design would want child and parent to
share the same data. Also if child really need of making this detached from
the parent will create his own new seg and copy the data.)

> change_large_page_mem_size: wouldn't it be better as
> set_large_page_mem_size, instead of by increments/decrements?
>
Well, I've found specifying the change easier than specifying new size.
This could easily be altered though.

> Whitespace offences, would benefit from a pass through Lindent.
>
Will do that clean up in next update.

> Hugh
>