the attached patch (against BK-curr) implements two brand-new VM features:
- a new syscall, remap_file_pages(), for arbitrary remapping of shared
mappings, within the same vma.
- explicit pagetable population (prefaulting) support for mmap().
[ MAP_POPULATE is a nice side-effect of the more generic ->populate()
handler approach suggested by Linus. Plus the patch implements
MAP_NONBLOCK for IO-less prefaulting as well.]
the patch and a test-utility can also be found at:
http://redhat.com/~mingo/remap-file-pages-patches/
first, a little bit of background about what this patch tries to achieve:
Linux mappings (vmas) have been 'linear' from day one on, which means that
files can only be mmap()-ed linearly - ie. the file's first page is mapped
to the first page in the mapping, the second page is mapped to the second
page, the third file-page is mapped to the third mapping page, etc. [plus
an optional constant offset on the file side.] This is a basic property of
mmap() on all unices.
The main goal of this patch is to provide support for 'nonlinear'
mappings, ie. vmas where there's an arbitrary relationship between the
file and the virtual memory range that is mapped - while still keeping the
same single vma.
nonlinearly mapped virtual memory ranges of the same file are already
possible by using multiple separate mappings (vmas) via mmap(), which
approach, while used widely, has lots of disadvantages:
- really complex remappings (used by databases or virtualizing
applications) create a *huge* amount of vmas - and vma's are per-process
which puts a really big load on kernel memory allocations, especially on
32-bit systems. I've seen applications that had a mapping setup that
generated 128 *thousand* vmas per process, causing lots of problems.
- setting up separate mappings is expensive, causes one pagefault per page
and also causes TLB flushes.
- even on 64-bit systems, when mapping really large (terabyte size) and
really sparse files, sparse mappings can be a disadvantage - in the
worst-case there can be as much as 1 more pagetable page allocated for
every file page that is mapped in.
So the more compact mappings enabled by the new remap_file_pages()
syscall can be advantegous in a number of cases. The new syscall has the
following semantics:
* sys_remap_file_pages - remap arbitrary pages of a shared backing store
* file within an existing vma.
* @start: start of the remapped virtual memory range
* @size: size of the remapped virtual memory range
* @prot: new protection bits of the range
* @pgoff: to be mapped page of the backing store file
* @flags: 0 or MAP_NONBLOCKED - the later will cause no IO.
the virtual memory range has to be mapped linearly before using it, and
all further remapping has to happen within a single vma. Since all the
mapping information of nonlinear vmas lives in the pagetables, they either
have to be MAP_LOCKED (for databases or virtualization software) or need a
special SIGBUS handler to reinstall mappings across swapping (for more
complex uses such as memory debuggers).
mmap(MAP_POPULATE) prefaulting support, is a 'side-effect' of the support
code needed for nonlinear vmas.
Today's relational databases, when supporting large amounts of RAM on
IA32, typically map a big /dev/shm/ shared memory file (eg. 16 GB large)
into a smaller window (eg. 1GB large) and heavily remap that window. Here
is a list of advantages of nonlinear mappings, over of the current mmap()
based remapping method, in the case of such DB memory models:
- less vma tree lookup overhead: there's only one vma allocated for an eg.
1 GB window. In fact this vma would be cached in the mm->mmap_cache most
of the time.
- dramatically reduced vma lowmem overhead: i've seen DBs where about
50%(!) of all lowmem allocations during a TPC run are due to the shmfs
vmas. remap_file_pages() is in essence using the pagetables as the only
mapping information - and it's thus the most memory-compact mapping
method possible.
- the vma lowmem reduction has another advantage: the number of concurrent
database processes can be increased significantly. With pagetables in
highmem and mass-vmas avoided, the lowmem pressure is reduced
significantly.
- less swapping overhead: a high number of vmas increases the swapping
overhead.
- finegrained caching for the application - it no longer has a lowmem
impact to map at 4K granularity.
- TLB flush avoidance: the MAP_FIXED overmapping of larger than 4K cache
units causes a TLB flush, greatly increasing the overhead of 'basic'
DB cache operations - both the direct overhead and the secondary costs
of repopulating the TLB cache are signifiant - and will only increase
with newer CPUs. remap_file_pages() uses the one-page invalidation
instruction, which does not destroy the TLB.
- reduced cost of cache remapping: remapping a eg. 16K cache element
causes a kernel entry, an unmapping, a TLB flush and a vma-install, plus
4 subsequent pagefaults. With remap_file_pages() it's a single kernel
entry per cache element - no followup pagefaults.
the patch has got an initial review from Linus, Andrew Morton and Hugh
Dickins, and most of their suggestions are part of the patch already.
Implementational details:
- pagetable population is a per-vma op currently, not a file op. The
reason is that remapping itself is a per-vma concept, changing
sys_remap_file_pages() to work across multiple vmas does not make much
sense IMO.
- the fact that MAP_LOCKED has to be used is not an issue for things like
databases, which size their caches to available RAM anyway, and want to
have a separate chunk of memory that is never swapped out.
- i've added MAP_NONBLOCK as well, which enables applications to prefault
all pages that are in the pagecache already, but cause no further IO.
This can be useful for the dynamic linker, especially when prelinking is
enabled.
- the lowlevel population handlers are currently simplified. They iterate
over the virtual memory range and look up the file page for the given
offset. It would be faster to iterate the pagecache mapping's radix tree
and the pagetables at once, but it's also *much* more complex. I have
tried to implement it and had to unroll the change - mixing radix tree
walking and pagetable walking and getting all the VM details right is
really complex - especially considering all the re-lookup race checks
that have to occur upon IO. It might make sense to split out a fast
function for MAP_NONBLOCK. (everyone is welcome to implement these
things.)
- two backing store concepts have a ->populate function currently: shmfs
and the generic pagecache.
- readahead: currently filemap_populate() does not initiate further
readahead - this is mainly to recognize the mostly random nature of
remap_file_pages() remappings. Or should we trust the readahead engine
and use one filemap_getpage() function for filemap_nopage() and
filemap_populate()? The readahead should not go beyond the window
specified by the populate() function, which is distinct from the mostly
guessing work filemap_nopage() has to do. This is the reason why i
separated the two codepaths, although they did similar things.
i've also tested the patch by unconditionally enabling prefaulting in
mmap(), which produced a fully working system, so i'm confident that the
basic bits are right. The testcode on my webpage can be used to check the
nonlinear remapping functionality. I've tested x86 SMP and UP.
all in one, nonlinear mappings (not present in any other OS i'm aware of)
would IMO be a nice twist to the already excellent and incredibly flexible
Linux VM. Comments, suggestions are welcome!
Ingo
--- linux/arch/i386/kernel/entry.S.orig 2002-09-20 17:20:21.000000000 +0200
+++ linux/arch/i386/kernel/entry.S 2002-10-14 10:09:38.000000000 +0200
@@ -736,6 +736,7 @@
.long sys_alloc_hugepages /* 250 */
.long sys_free_hugepages
.long sys_exit_group
+ .long sys_remap_file_pages
.rept NR_syscalls-(.-sys_call_table)/4
.long sys_ni_syscall
--- linux/include/linux/mm.h.orig 2002-10-14 10:09:05.000000000 +0200
+++ linux/include/linux/mm.h 2002-10-14 13:02:10.000000000 +0200
@@ -130,6 +130,7 @@
void (*open)(struct vm_area_struct * area);
void (*close)(struct vm_area_struct * area);
struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int unused);
+ int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, unsigned long prot, unsigned long pgoff, int nonblock);
};
/* forward declaration; pte_chain is meant to be internal to rmap.c */
@@ -366,9 +367,12 @@
extern pmd_t *FASTCALL(__pmd_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address));
extern pte_t *FASTCALL(pte_alloc_kernel(struct mm_struct *mm, pmd_t *pmd, unsigned long address));
extern pte_t *FASTCALL(pte_alloc_map(struct mm_struct *mm, pmd_t *pmd, unsigned long address));
+extern int install_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, struct page *page, unsigned long prot);
extern int handle_mm_fault(struct mm_struct *mm,struct vm_area_struct *vma, unsigned long address, int write_access);
extern int make_pages_present(unsigned long addr, unsigned long end);
extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write);
+extern int sys_remap_file_pages(unsigned long start, unsigned long size, unsigned long prot, unsigned long pgoff, unsigned long nonblock);
+
extern struct page * follow_page(struct mm_struct *mm, unsigned long address, int write);
int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start,
--- linux/include/asm-i386/mman.h.orig 2002-09-20 17:20:17.000000000 +0200
+++ linux/include/asm-i386/mman.h 2002-10-14 12:58:21.000000000 +0200
@@ -18,6 +18,8 @@
#define MAP_EXECUTABLE 0x1000 /* mark it as an executable */
#define MAP_LOCKED 0x2000 /* pages are locked */
#define MAP_NORESERVE 0x4000 /* don't check for reservations */
+#define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */
+#define MAP_NONBLOCK 0x10000 /* do not block on IO */
#define MS_ASYNC 1 /* sync memory asynchronously */
#define MS_INVALIDATE 2 /* invalidate the caches */
--- linux/include/asm-i386/unistd.h.orig 2002-10-14 10:08:51.000000000 +0200
+++ linux/include/asm-i386/unistd.h 2002-10-14 10:09:38.000000000 +0200
@@ -257,6 +257,7 @@
#define __NR_alloc_hugepages 250
#define __NR_free_hugepages 251
#define __NR_exit_group 252
+#define __NR_remap_file_pages 253
/* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */
--- linux/mm/fremap.c.orig 2002-10-14 10:09:38.000000000 +0200
+++ linux/mm/fremap.c 2002-10-14 13:01:47.000000000 +0200
@@ -0,0 +1,138 @@
+/*
+ * linux/mm/mpopulate.c
+ *
+ * Explicit pagetable population and nonlinear (random) mappings support.
+ *
+ * started by Ingo Molnar, Copyright (C) 2002
+ */
+
+#include <linux/mm.h>
+#include <linux/file.h>
+#include <linux/mman.h>
+#include <linux/pagemap.h>
+#include <linux/swapops.h>
+#include <asm/mmu_context.h>
+
+static inline void zap_pte(struct mm_struct *mm, pte_t *ptep)
+{
+ pte_t pte = *ptep;
+
+ if (pte_none(pte))
+ return;
+ if (pte_present(pte)) {
+ unsigned long pfn = pte_pfn(pte);
+
+ pte = ptep_get_and_clear(ptep);
+ if (pfn_valid(pfn)) {
+ struct page *page = pfn_to_page(pfn);
+ if (!PageReserved(page)) {
+ if (pte_dirty(pte))
+ set_page_dirty(page);
+ page_remove_rmap(page, ptep);
+ page_cache_release(page);
+ mm->rss--;
+ }
+ }
+ } else {
+ free_swap_and_cache(pte_to_swp_entry(pte));
+ pte_clear(ptep);
+ }
+}
+
+/*
+ * Install a page to a given virtual memory address, release any
+ * previously existing mapping.
+ */
+int install_page(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long addr, struct page *page, unsigned long prot)
+{
+ int err = -ENOMEM;
+ pte_t *pte, entry;
+ pgd_t *pgd;
+ pmd_t *pmd;
+
+ pgd = pgd_offset(mm, addr);
+ spin_lock(&mm->page_table_lock);
+
+ pmd = pmd_alloc(mm, pgd, addr);
+ if (!pmd)
+ goto err_unlock;
+
+ pte = pte_alloc_map(mm, pmd, addr);
+ if (!pte)
+ goto err_unlock;
+
+ zap_pte(mm, pte);
+
+ mm->rss++;
+ flush_page_to_ram(page);
+ flush_icache_page(vma, page);
+ entry = mk_pte(page, protection_map[prot]);
+ if (prot & PROT_WRITE)
+ entry = pte_mkwrite(pte_mkdirty(entry));
+ set_pte(pte, entry);
+ page_add_rmap(page, pte);
+ pte_unmap(pte);
+ flush_tlb_page(vma, addr);
+
+ spin_unlock(&mm->page_table_lock);
+
+ return 0;
+
+err_unlock:
+ spin_unlock(&mm->page_table_lock);
+ return err;
+}
+
+/***
+ * sys_remap_file_pages - remap arbitrary pages of a shared backing store
+ * file within an existing vma.
+ * @start: start of the remapped virtual memory range
+ * @size: size of the remapped virtual memory range
+ * @prot: new protection bits of the range
+ * @pgoff: to be mapped page of the backing store file
+ * @flags: 0 or MAP_NONBLOCKED - the later will cause no IO.
+ *
+ * this syscall works purely via pagetables, so it's the most efficient
+ * way to map the same (large) file into a given virtual window. Unlike
+ * mremap()/mmap() it does not create any new vmas.
+ *
+ * The new mappings do not live across swapout, so either use MAP_LOCKED
+ * or use PROT_NONE in the original linear mapping and add a special
+ * SIGBUS pagefault handler to reinstall zapped mappings.
+ */
+int sys_remap_file_pages(unsigned long start, unsigned long size,
+ unsigned long prot, unsigned long pgoff, unsigned long flags)
+{
+ struct mm_struct *mm = current->mm;
+ unsigned long end = start + size;
+ struct vm_area_struct *vma;
+ int err = -EINVAL;
+
+ /*
+ * Sanitize the syscall parameters:
+ */
+ start = PAGE_ALIGN(start);
+ size = PAGE_ALIGN(size);
+ prot &= 0xf;
+
+ down_read(&mm->mmap_sem);
+
+ vma = find_vma(mm, start);
+ /*
+ * Make sure the vma is shared, that it supports prefaulting,
+ * and that the remapped range is valid and fully within
+ * the single existing vma:
+ */
+ if (vma && (vma->vm_flags & VM_SHARED) &&
+ vma->vm_ops && vma->vm_ops->populate &&
+ end > start && start >= vma->vm_start &&
+ end <= vma->vm_end)
+ err = vma->vm_ops->populate(vma, start, size, prot,
+ pgoff, flags & MAP_NONBLOCK);
+
+ up_read(&mm->mmap_sem);
+
+ return err;
+}
+
--- linux/mm/Makefile.orig 2002-10-14 10:09:04.000000000 +0200
+++ linux/mm/Makefile 2002-10-14 10:09:38.000000000 +0200
@@ -4,7 +4,7 @@
export-objs := shmem.o filemap.o mempool.o page_alloc.o page-writeback.o
-obj-y := memory.o mmap.o filemap.o mprotect.o mlock.o mremap.o \
+obj-y := memory.o mmap.o filemap.o fremap.o mprotect.o mlock.o mremap.o \
vmalloc.o slab.o bootmem.o swap.o vmscan.o page_io.o \
page_alloc.o swap_state.o swapfile.o oom_kill.o \
shmem.o highmem.o mempool.o msync.o mincore.o readahead.o \
--- linux/mm/shmem.c.orig 2002-10-14 10:09:04.000000000 +0200
+++ linux/mm/shmem.c 2002-10-14 12:23:32.000000000 +0200
@@ -710,7 +710,10 @@
* vm. If we swap it in we mark it dirty since we also free the swap
* entry since a page cannot live in both the swap and page cache
*/
-static int shmem_getpage(struct inode *inode, unsigned long idx, struct page **pagep)
+static int shmem_getpage(struct inode *inode,
+ unsigned long idx,
+ struct page **pagep,
+ int nonblock)
{
struct address_space *mapping = inode->i_mapping;
struct shmem_inode_info *info = SHMEM_I(inode);
@@ -735,11 +738,9 @@
spin_unlock(&info->lock);
repeat:
- page = find_lock_page(mapping, idx);
- if (page) {
- *pagep = page;
+ *pagep = page = find_lock_page(mapping, idx);
+ if (page || nonblock)
return 0;
- }
spin_lock(&info->lock);
shmem_recalc_inode(inode);
@@ -879,7 +880,7 @@
}
spin_unlock(&info->lock);
if (swap.val) {
- if (shmem_getpage(inode, idx, &page) == 0)
+ if (shmem_getpage(inode, idx, &page, 0) == 0)
unlock_page(page);
}
return page;
@@ -899,7 +900,7 @@
if (((loff_t) idx << PAGE_CACHE_SHIFT) >= inode->i_size)
return NOPAGE_SIGBUS;
- error = shmem_getpage(inode, idx, &page);
+ error = shmem_getpage(inode, idx, &page, 0);
if (error)
return (error == -ENOMEM)? NOPAGE_OOM: NOPAGE_SIGBUS;
@@ -908,6 +909,43 @@
return page;
}
+static int shmem_populate(struct vm_area_struct *vma,
+ unsigned long addr,
+ unsigned long len,
+ unsigned long prot,
+ unsigned long pgoff,
+ int nonblock)
+{
+ struct inode *inode = vma->vm_file->f_dentry->d_inode;
+ struct mm_struct *mm = vma->vm_mm;
+ struct page *page;
+ int err;
+
+repeat:
+ if (((loff_t) pgoff << PAGE_CACHE_SHIFT) >= inode->i_size)
+ return -EFAULT;
+
+ err = shmem_getpage(inode, pgoff, &page, nonblock);
+ if (err)
+ return err;
+
+ if (page) {
+ unlock_page(page);
+
+ err = install_page(mm, vma, addr, page, prot);
+ if (err)
+ return err;
+ }
+
+ len -= PAGE_SIZE;
+ addr += PAGE_SIZE;
+ pgoff++;
+ if (len)
+ goto repeat;
+
+ return 0;
+}
+
void shmem_lock(struct file *file, int lock)
{
struct inode *inode = file->f_dentry->d_inode;
@@ -1111,7 +1149,7 @@
__get_user(dummy, buf+bytes-1);
}
- status = shmem_getpage(inode, index, &page);
+ status = shmem_getpage(inode, index, &page, 0);
if (status)
break;
@@ -1178,7 +1216,7 @@
break;
}
- desc->error = shmem_getpage(inode, index, &page);
+ desc->error = shmem_getpage(inode, index, &page, 0);
if (desc->error) {
if (desc->error == -EFAULT)
desc->error = 0;
@@ -1429,7 +1467,7 @@
iput(inode);
return -ENOMEM;
}
- error = shmem_getpage(inode, 0, &page);
+ error = shmem_getpage(inode, 0, &page, 0);
if (error) {
vm_unacct_memory(VM_ACCT(1));
iput(inode);
@@ -1466,7 +1504,7 @@
static int shmem_readlink(struct dentry *dentry, char *buffer, int buflen)
{
struct page *page;
- int res = shmem_getpage(dentry->d_inode, 0, &page);
+ int res = shmem_getpage(dentry->d_inode, 0, &page, 0);
if (res)
return res;
res = vfs_readlink(dentry, buffer, buflen, kmap(page));
@@ -1479,7 +1517,7 @@
static int shmem_follow_link(struct dentry *dentry, struct nameidata *nd)
{
struct page *page;
- int res = shmem_getpage(dentry->d_inode, 0, &page);
+ int res = shmem_getpage(dentry->d_inode, 0, &page, 0);
if (res)
return res;
res = vfs_follow_link(nd, kmap(page));
@@ -1746,6 +1784,7 @@
static struct vm_operations_struct shmem_vm_ops = {
.nopage = shmem_nopage,
+ .populate = shmem_populate,
};
static struct super_block *shmem_get_sb(struct file_system_type *fs_type,
--- linux/mm/mmap.c.orig 2002-10-14 10:08:55.000000000 +0200
+++ linux/mm/mmap.c 2002-10-14 12:58:57.000000000 +0200
@@ -601,6 +601,12 @@
mm->locked_vm += len >> PAGE_SHIFT;
make_pages_present(addr, addr + len);
}
+ if (flags & MAP_POPULATE) {
+ up_write(&mm->mmap_sem);
+ sys_remap_file_pages(addr, len, prot,
+ pgoff, flags & MAP_NONBLOCK);
+ down_write(&mm->mmap_sem);
+ }
return addr;
unmap_and_free_vma:
--- linux/mm/filemap.c.orig 2002-10-14 10:22:07.000000000 +0200
+++ linux/mm/filemap.c 2002-10-14 12:33:40.000000000 +0200
@@ -1154,8 +1154,155 @@
return NULL;
}
+static struct page * filemap_getpage(struct file *file, unsigned long pgoff,
+ int nonblock)
+{
+ struct address_space *mapping = file->f_dentry->d_inode->i_mapping;
+ struct page *page;
+ int error;
+
+ /*
+ * Do we have something in the page cache already?
+ */
+retry_find:
+ page = find_get_page(mapping, pgoff);
+ if (!page) {
+ if (nonblock)
+ return NULL;
+ goto no_cached_page;
+ }
+
+ /*
+ * Ok, found a page in the page cache, now we need to check
+ * that it's up-to-date.
+ */
+ if (!PageUptodate(page))
+ goto page_not_uptodate;
+
+success:
+ /*
+ * Found the page and have a reference on it, need to check sharing
+ * and possibly copy it over to another page..
+ */
+ mark_page_accessed(page);
+ flush_page_to_ram(page);
+
+ return page;
+
+no_cached_page:
+ error = page_cache_read(file, pgoff);
+
+ /*
+ * The page we want has now been added to the page cache.
+ * In the unlikely event that someone removed it in the
+ * meantime, we'll just come back here and read it again.
+ */
+ if (error >= 0)
+ goto retry_find;
+
+ /*
+ * An error return from page_cache_read can result if the
+ * system is low on memory, or a problem occurs while trying
+ * to schedule I/O.
+ */
+ return NULL;
+
+page_not_uptodate:
+ lock_page(page);
+
+ /* Did it get unhashed while we waited for it? */
+ if (!page->mapping) {
+ unlock_page(page);
+ goto err;
+ }
+
+ /* Did somebody else get it up-to-date? */
+ if (PageUptodate(page)) {
+ unlock_page(page);
+ goto success;
+ }
+
+ if (!mapping->a_ops->readpage(file, page)) {
+ wait_on_page_locked(page);
+ if (PageUptodate(page))
+ goto success;
+ }
+
+ /*
+ * Umm, take care of errors if the page isn't up-to-date.
+ * Try to re-read it _once_. We do this synchronously,
+ * because there really aren't any performance issues here
+ * and we need to check for errors.
+ */
+ lock_page(page);
+
+ /* Somebody truncated the page on us? */
+ if (!page->mapping) {
+ unlock_page(page);
+ goto err;
+ }
+ /* Somebody else successfully read it in? */
+ if (PageUptodate(page)) {
+ unlock_page(page);
+ goto success;
+ }
+
+ ClearPageError(page);
+ if (!mapping->a_ops->readpage(file, page)) {
+ wait_on_page_locked(page);
+ if (PageUptodate(page))
+ goto success;
+ }
+
+ /*
+ * Things didn't work out. Return zero to tell the
+ * mm layer so, possibly freeing the page cache page first.
+ */
+err:
+ page_cache_release(page);
+
+ return NULL;
+}
+
+static int filemap_populate(struct vm_area_struct *vma,
+ unsigned long addr,
+ unsigned long len,
+ unsigned long prot,
+ unsigned long pgoff,
+ int nonblock)
+{
+ struct file *file = vma->vm_file;
+ struct address_space *mapping = file->f_dentry->d_inode->i_mapping;
+ struct inode *inode = mapping->host;
+ unsigned long size;
+ struct mm_struct *mm = vma->vm_mm;
+ struct page *page;
+ int err;
+
+repeat:
+ size = (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+ if (pgoff + (len >> PAGE_CACHE_SHIFT) > size)
+ return -EINVAL;
+
+ page = filemap_getpage(file, pgoff, nonblock);
+ if (!page)
+ return -ENOMEM;
+ err = install_page(mm, vma, addr, page, prot);
+ if (err)
+ return err;
+
+ len -= PAGE_SIZE;
+ addr += PAGE_SIZE;
+ pgoff++;
+ if (len)
+ goto repeat;
+
+ return 0;
+}
+
static struct vm_operations_struct generic_file_vm_ops = {
.nopage = filemap_nopage,
+ .populate = filemap_populate,
};
/* This is used for a general mmap of a disk file */
From: Ingo Molnar <[email protected]>
Date: Mon, 14 Oct 2002 14:38:43 +0200 (CEST)
- TLB flush avoidance: the MAP_FIXED overmapping of larger than 4K cache
units causes a TLB flush, greatly increasing the overhead of 'basic'
DB cache operations - both the direct overhead and the secondary costs
of repopulating the TLB cache are signifiant - and will only increase
with newer CPUs. remap_file_pages() uses the one-page invalidation
instruction, which does not destroy the TLB.
Maybe on your cpu.
We created the range tlb flushes so that architectures have a chance
of optimizing such operations when possible.
If that isn't happening for small numbers of pages on x86 currently,
that isn't justification for special casing it here in this non-linear
mappings code.
If someone does a remap of 1GB of address space, I sure want the
option of doing a full MM flush if that is cheaper on my platform.
Currently, this part smells of an x86 performance hack, which might
even be suboptimal on x86 for remapping of huge ranges.
On Mon, 14 Oct 2002, David S. Miller wrote:
> We created the range tlb flushes so that architectures have a chance of
> optimizing such operations when possible.
yeah, agreed, we can change it to do the mmu_gather_t thing, and to
optimize that on x86 as well. Nevertheless the fact remains that cache
users were pretty much forced to use a multipage cache unit, which caused
all userspace TLBs to be flushed on x86. Where to draw the line between a
loop of INVLPG and a CR3 flush on x86 is up in the air - i'd say it's at
roughly 8 pages currently, while the x86 TLB flush code only optimizes the
single-page flushes. So you are right that this issue should be separated
from nonlinear mappings.
Ingo
From: Ingo Molnar <[email protected]>
Date: Mon, 14 Oct 2002 15:30:42 +0200 (CEST)
Where to draw the line between a loop of INVLPG and a CR3 flush on
x86 is up in the air - i'd say it's at roughly 8 pages currently
I'd say it's highly x86 revision dependant and that it
can be easily calibrated at boot time :-)
On Mon, Oct 14, 2002 at 02:38:43PM +0200, Ingo Molnar wrote:
> the patch has got an initial review from Linus, Andrew Morton and Hugh
> Dickins, and most of their suggestions are part of the patch already.
hmm
On Mon, Oct 14, 2002 at 02:38:43PM +0200, Ingo Molnar wrote:
> +int sys_remap_file_pages(unsigned long start, unsigned long size,
> + unsigned long prot, unsigned long pgoff, unsigned long flags)
ISTR suggesting vectorizing this to reduce syscall traffic.
The semantics of faulting in pages on such a region, while not
incorrect, are at least unusual enough to raise the question of
whether it's appropriate for the kernel to fill the pages as opposed
to returning an error to userspace. The requirement of MAP_LOCKED
or PROT_NONE might as well be in-kernel if the file offset contiguity
assumption is not met, since there's no reliable way for userspace to
use the syscall otherwise. sys_remap_file_pages() also interacts in
an unusual way with the semantics of MAP_POPULATE. MAP_POPULATE seems
to perform a non-blocking make_pages_present() operation not shared
with MAP_LOCKED, and filemap_populate() performs the file offset
contiguous prefaulting which again doesn't mix well with the scatter
gather semantics desired. These are the file offset contiguity
assumptions I mentioned.
Also, a stranger phenomenon appears in filemap_populate(), where
nonblock may be true, and so filemap_getpage() will return NULL,
but -ENOMEM is returned if filemap_getpage() returns NULL.
Also, I see a significant portion of filemap_nopage() duplicated in
filemap_getpage(), including long-stale hashtable-related comments.
Bill
On Mon, 14 Oct 2002, William Lee Irwin III wrote:
> > +int sys_remap_file_pages(unsigned long start, unsigned long size,
> > + unsigned long prot, unsigned long pgoff, unsigned long flags)
>
> ISTR suggesting vectorizing this to reduce syscall traffic.
well, i dont agree with vectorizing everything, unless it's some really
lightweight thing and/or the operation is somehow naturally vectorized.
(block and network IO, etc.)
> The semantics of faulting in pages on such a region, while not
> incorrect, are at least unusual enough to raise the question of whether
> it's appropriate for the kernel to fill the pages as opposed to
> returning an error to userspace. [...]
the pagefault path simply does not have the information whether a mapping
was nonlinear, possibly long before. In the initial patch i had a
VM_RANDOM flag for vmas, which was set up at mmap() time, but this
restricted the API needlessly. Tracking whether a mapping was remapped
randomly before does not sound too useful to me either. So right now it's
the responsibility of the user to use the API in a meaningful way - is it
such a big problem?
> [...] The requirement of MAP_LOCKED or PROT_NONE might as well be
> in-kernel if the file offset contiguity assumption is not met, [...]
how would you do this actually, without other restrictions?
> sys_remap_file_pages() also interacts in an unusual way with the
> semantics of MAP_POPULATE. MAP_POPULATE seems to perform a non-blocking
> make_pages_present() operation not shared with MAP_LOCKED, [...]
what it does is a blocking make_pages_present(), for nonblocking you also
need to specify MAP_NONBLOCK. I agree that the mlock path should/could be
merged with the populate path, this was suggested by Linus as well.
> [...] and filemap_populate() performs the file offset contiguous
> prefaulting which again doesn't mix well with the scatter gather
> semantics desired.
huh?
> Also, a stranger phenomenon appears in filemap_populate(), where
> nonblock may be true, and so filemap_getpage() will return NULL, but
> -ENOMEM is returned if filemap_getpage() returns NULL.
true, this is a bug, i fixed it in the -F9 patch at:
http://redhat.com/~mingo/remap-file-pages-patches/
> Also, I see a significant portion of filemap_nopage() duplicated in
> filemap_getpage(), including long-stale hashtable-related comments.
check the announcement email for details about the seemingly duplicated
code of filemap_nopage() and filemap_getpage(). And which hashing comments
do you mean? We still hash pagecache pages.
Ingo
> And which hashing comments do you mean? We still hash pagecache pages.
i see, indeed those comments (and other comments in filemap.c) should
refer to the radix tree, although people do keep think about it as a hash
:-)
Ingo
On Mon, 14 Oct 2002, William Lee Irwin III wrote:
>> ISTR suggesting vectorizing this to reduce syscall traffic.
On Mon, Oct 14, 2002 at 05:02:26PM +0200, Ingo Molnar wrote:
> well, i dont agree with vectorizing everything, unless it's some really
> lightweight thing and/or the operation is somehow naturally vectorized.
> (block and network IO, etc.)
I would say it makes sense for its intended purpose, as large volumes
of pages are expected to be remapped this way.
On Mon, 14 Oct 2002, William Lee Irwin III wrote:
>> The semantics of faulting in pages on such a region, while not
>> incorrect, are at least unusual enough to raise the question of whether
>> it's appropriate for the kernel to fill the pages as opposed to
>> returning an error to userspace. [...]
On Mon, Oct 14, 2002 at 05:02:26PM +0200, Ingo Molnar wrote:
> the pagefault path simply does not have the information whether a mapping
> was nonlinear, possibly long before. In the initial patch i had a
> VM_RANDOM flag for vmas, which was set up at mmap() time, but this
> restricted the API needlessly. Tracking whether a mapping was remapped
> randomly before does not sound too useful to me either. So right now it's
> the responsibility of the user to use the API in a meaningful way - is it
> such a big problem?
The VM_RANDOM flag was a safety net I believed to be beneficial, in
part because userspace would be limited in how it can utilize the
remapping facility without guarantees that mappings are not implicitly
established in ways it does not expect.
On Mon, 14 Oct 2002, William Lee Irwin III wrote:
>> [...] The requirement of MAP_LOCKED or PROT_NONE might as well be
>> in-kernel if the file offset contiguity assumption is not met, [...]
On Mon, Oct 14, 2002 at 05:02:26PM +0200, Ingo Molnar wrote:
> how would you do this actually, without other restrictions?
It would be easy to keep the VM_RANDOM flag for truly random vma's,
and check for file offsets matching in the prefault path for others.
On Mon, 14 Oct 2002, William Lee Irwin III wrote:
>> sys_remap_file_pages() also interacts in an unusual way with the
>> semantics of MAP_POPULATE. MAP_POPULATE seems to perform a non-blocking
>> make_pages_present() operation not shared with MAP_LOCKED, [...]
On Mon, Oct 14, 2002 at 05:02:26PM +0200, Ingo Molnar wrote:
> what it does is a blocking make_pages_present(), for nonblocking you also
> need to specify MAP_NONBLOCK. I agree that the mlock path should/could be
> merged with the populate path, this was suggested by Linus as well.
It's optionally nonblocking, a small communication failure of mine.
On Mon, 14 Oct 2002, William Lee Irwin III wrote:
>> [...] and filemap_populate() performs the file offset contiguous
>> prefaulting which again doesn't mix well with the scatter gather
>> semantics desired.
On Mon, Oct 14, 2002 at 05:02:26PM +0200, Ingo Molnar wrote:
> huh?
A nonblocking filemap_populate() may partially populate a virtual
address range and the user may later fault in pages not file offset
contiguous in the prefaulted region as opposed to discovering them as
not present. For the MAP_LOCKED | PROT_NONE scenario this would apply
only with a nonblocking prefault on a MAP_LOCKED vma, where the caller
would find stale mappings where the prefault operation failed, and even
if the nonblocking path invalidated the pte's one would still return to
faulting in of the wrong pages. This is a safety question limiting the
usefulness of nonblocking prefaults on scatter gather vma's to userspace.
On Mon, 14 Oct 2002, William Lee Irwin III wrote:
>> Also, I see a significant portion of filemap_nopage() duplicated in
>> filemap_getpage(), including long-stale hashtable-related comments.
On Mon, Oct 14, 2002 at 05:02:26PM +0200, Ingo Molnar wrote:
> check the announcement email for details about the seemingly duplicated
> code of filemap_nopage() and filemap_getpage(). And which hashing comments
> do you mean? We still hash pagecache pages.
Since the inclusion of the radix tree pagecache I've regarded these
comments about hashing as stale.
Bill
On Mon, 14 Oct 2002, William Lee Irwin III wrote:
> > well, i dont agree with vectorizing everything, unless it's some really
> > lightweight thing and/or the operation is somehow naturally vectorized.
> > (block and network IO, etc.)
>
> I would say it makes sense for its intended purpose, as large volumes of
> pages are expected to be remapped this way.
volume alone does not necessiate a vectored API - everything depends on
whether the volume comes in chunks, ie. is predictable. For things like a
LRU DB-cache the order of new cache entries are largely unpredictable.
Even if predictable then they are mostly continuous in the file space, ie.
properly vectorized by the API already. There might be cases where there
are bulk mappings, but the common case i'm quite sure isnt. Just like we
dont have bulk mmap() support either. But, if it really turns out to be a
problem then a bulk API can be added later on, which would just build upon
the existing syscall.
> The VM_RANDOM flag was a safety net I believed to be beneficial, in part
> because userspace would be limited in how it can utilize the remapping
> facility without guarantees that mappings are not implicitly established
> in ways it does not expect.
if this is really an issue then we could force vma->vm_page_prot to
PROT_NONE within remap_file_pages(), so at least all subsequent faults
will be PROT_NONE and the user would have to explicitly re-mprotect() the
vma again to change this.
> > how would you do this actually, without other restrictions?
>
> It would be easy to keep the VM_RANDOM flag for truly random vma's, and
> check for file offsets matching in the prefault path for others.
there is no 'VM_RANDOM' flag right now - because *any* shared mapping can
be used with remap_file_pages(). I think forcing the default protection to
PROT_NONE should solve the problem - this is the most logical thing to do
in the MAP_LOCKED case as well.
> A nonblocking filemap_populate() may partially populate a virtual
> address range and the user may later fault in pages not file offset
> contiguous in the prefaulted region as opposed to discovering them as
> not present. For the MAP_LOCKED | PROT_NONE scenario this would apply
> only with a nonblocking prefault on a MAP_LOCKED vma, where the caller
> would find stale mappings where the prefault operation failed, and even
> if the nonblocking path invalidated the pte's one would still return to
> faulting in of the wrong pages. This is a safety question limiting the
> usefulness of nonblocking prefaults on scatter gather vma's to
> userspace.
this too is handled by changing the default protection to PROT_NONE. A
user would have to deliberately change the protection to get rid of this
safety net - at which point no harm will be done, the user will get what
he asked for.
Ingo
> if this is really an issue then we could force vma->vm_page_prot to
> PROT_NONE within remap_file_pages(), so at least all subsequent faults
> will be PROT_NONE and the user would have to explicitly re-mprotect()
> the vma again to change this.
i've added this to the -G1 patch at:
http://redhat.com/~mingo/remap-file-pages-patches/
Ingo
Ingo Molnar wrote:
>
> ...
>
> - readahead: currently filemap_populate() does not initiate further
> readahead - this is mainly to recognize the mostly random nature of
> remap_file_pages() remappings. Or should we trust the readahead engine
> and use one filemap_getpage() function for filemap_nopage() and
> filemap_populate()? The readahead should not go beyond the window
> specified by the populate() function, which is distinct from the mostly
> guessing work filemap_nopage() has to do. This is the reason why i
> separated the two codepaths, although they did similar things.
We will need some way of scheduling large chunks of reads concurrently.
I'd agree that we can do better than just relying on the current
readahead/readaround code though.
It looks to be pretty simple - work out the file offset and number of
pages which are covered by the remap_file_pages() call and then pass
those to do_page_cache_readahead(). It will start async reads of any
not-present pages in the range. Then we can drop into your current code
and start waiting on the IO.
do_page_cache_readahead() is independent of the file->f_ra state. It's
just a "get this chunk of the file into pagecache" function.
A few lines on entry to filemap_populate() will do this. The readahead
code does preallocate and pin its pages before starting IO, so we'd
need some checks that the user isn't asking for a ridiculous amount
of memory.
hm. Actually this may simplify things. Just run do_page_cache_readahead()
against the affected file area and then we *know* that all pages are
present, unless there's eviction happening. And handle the latter
via SIGBUS. If that's a reasonable approach then filemap_getpage()
just becomes:
page = find_get_page(mapping, index);
if (page)
wait_on_page_locked(page);
return page; /* Caller checks PageUptodate */
All the above gives optimal IO scheduling for a single call to
remap_file_pages(). But there are additional IO scheduling
benefits available from launching IO against the separate chunks
of the file which will be subject to remap_file_pages() in the
future. Only the application knows that. Seems that sys_readahead()
is perfectly suited to this.
> i've also tested the patch by unconditionally enabling prefaulting in
> mmap(), which produced a fully working system,
That's a good thing. If you can suggest any additional temp changes
along these lines, that would help to get this code settled down quickly.
I'd suggest you just include those things in the diff for the while,
if poss.
> all in one, nonlinear mappings (not present in any other OS i'm aware of)
> would IMO be a nice twist to the already excellent and incredibly flexible
> Linux VM. Comments, suggestions are welcome!
Yup. What additional work do you believe is needed before this
is ready to go? (Apart from making it compile on sparc ;))
Are you missing a page_cache_release() on the error path in install_page()?
On Mon, Oct 14, 2002 at 06:02:30PM +0200, Ingo Molnar wrote:
>> if this is really an issue then we could force vma->vm_page_prot to
>> PROT_NONE within remap_file_pages(), so at least all subsequent faults
>> will be PROT_NONE and the user would have to explicitly re-mprotect()
>> the vma again to change this.
> i've added this to the -G1 patch at:
> http://redhat.com/~mingo/remap-file-pages-patches/
> Ingo
Okay, if PROT_NONE is in there, my last remaining concern would be
solved by invalidating the pte's of failed nonblocking prefaults.
A nonblocking remapping done "over the top" of a preexisting mapping
(such as would occur with MAP_LOCKED) may fail at unpredictable times,
which is fine from the kernel point of view but leaves userspace with
no way of telling which pages that it requested to be prefaulted
actually made it in, and which pages are residue from the prior mapping.
Basically something like this would solve this API semantics issue
(totally untested). Against mpopulate-2.5.42-G1
diff -urpN mpop-2.5.42/include/linux/mm.h wlipop-2.5.42/include/linux/mm.h
--- mpop-2.5.42/include/linux/mm.h 2002-10-14 11:43:03.000000000 -0700
+++ wlipop-2.5.42/include/linux/mm.h 2002-10-14 12:26:27.000000000 -0700
@@ -424,6 +424,7 @@ static inline pmd_t *pmd_alloc(struct mm
return pmd_offset(pgd, address);
}
+extern void zap_pte(struct mm_struct *, pte_t *);
extern void free_area_init(unsigned long * zones_size);
extern void free_area_init_node(int nid, pg_data_t *pgdat, struct page *pmap,
unsigned long * zones_size, unsigned long zone_start_pfn,
diff -urpN mpop-2.5.42/mm/filemap.c wlipop-2.5.42/mm/filemap.c
--- mpop-2.5.42/mm/filemap.c 2002-10-14 11:43:03.000000000 -0700
+++ wlipop-2.5.42/mm/filemap.c 2002-10-14 12:12:08.000000000 -0700
@@ -1286,9 +1286,27 @@ repeat:
return -EINVAL;
page = filemap_getpage(file, pgoff, nonblock);
- if (!page && !nonblock)
- return -ENOMEM;
- if (page) {
+
+ if (!page) {
+ pgd_t *pgd;
+
+ if (!nonblock)
+ return -ENOMEM;
+
+ spin_lock(&mm->page_table_lock);
+ pgd = pgd_offset(mm, addr);
+ if (!pgd_none(*pgd) && !pgd_bad(*pgd)) {
+ pmd_t *pmd = pmd_offset(pgd, addr);
+ if (!pmd_none(*pmd) && !pmd_bad(*pmd)) {
+ pte_t *pte = pte_offset_map(pmd, addr);
+ if (pte) {
+ zap_pte(mm, pte);
+ pte_unmap(pte);
+ }
+ }
+ }
+ spin_unlock(&mm->page_table_lock);
+ } else {
err = install_page(mm, vma, addr, page, prot);
if (err)
return err;
diff -urpN mpop-2.5.42/mm/fremap.c wlipop-2.5.42/mm/fremap.c
--- mpop-2.5.42/mm/fremap.c 2002-10-14 11:43:03.000000000 -0700
+++ wlipop-2.5.42/mm/fremap.c 2002-10-14 12:26:37.000000000 -0700
@@ -13,7 +13,7 @@
#include <linux/swapops.h>
#include <asm/mmu_context.h>
-static inline void zap_pte(struct mm_struct *mm, pte_t *ptep)
+void zap_pte(struct mm_struct *mm, pte_t *ptep)
{
pte_t pte = *ptep;
On Mon, Oct 14, 2002 at 06:02:30PM +0200, Ingo Molnar wrote:
>> if this is really an issue then we could force vma->vm_page_prot to
>> PROT_NONE within remap_file_pages(), so at least all subsequent faults
>> will be PROT_NONE and the user would have to explicitly re-mprotect()
>> the vma again to change this.
> i've added this to the -G1 patch at:
> http://redhat.com/~mingo/remap-file-pages-patches/
> Ingo
Also, this may be relaxed when the file offsets match.
Against unpatched -G1:
Bill
--- mpop-2.5.42/mm/fremap.c 2002-10-14 11:43:03.000000000 -0700
+++ wlipop-2.5.42/mm/fremap.c 2002-10-14 14:17:11.000000000 -0700
@@ -129,10 +129,16 @@
end > start && start >= vma->vm_start &&
end <= vma->vm_end) {
/*
- * Change the default protection to PROT_NONE:
+ * Change the default protection to PROT_NONE if
+ * the file offset doesn't coincide with the vma's:
*/
- if (pgprot_val(vma->vm_page_prot) != pgprot_val(__S000))
- vma->vm_page_prot = __S000;
+ if (pgprot_val(vma->vm_page_prot) != pgprot_val(__S000)) {
+ unsigned long offset;
+ offset = (start - vma->vm_start) >> PAGE_CACHE_SHIFT
+ + vma->vm_pgoff;
+ if (offset != pgoff)
+ vma->vm_page_prot = __S000;
+ }
err = vma->vm_ops->populate(vma, start, size, prot,
pgoff, flags & MAP_NONBLOCK);
}
On Mon, Oct 14, 2002 at 02:20:45PM -0700, William Lee Irwin III wrote:
+ offset = (start - vma->vm_start) >> PAGE_CACHE_SHIFT
+ + vma->vm_pgoff;
I'm not so old I should be forgetting C already.
--- mpop-2.5.42/mm/fremap.c 2002-10-14 11:43:03.000000000 -0700
+++ wlipop-2.5.42/mm/fremap.c 2002-10-14 14:17:11.000000000 -0700
@@ -129,10 +129,16 @@
end > start && start >= vma->vm_start &&
end <= vma->vm_end) {
/*
- * Change the default protection to PROT_NONE:
+ * Change the default protection to PROT_NONE if
+ * the file offset doesn't coincide with the vma's:
*/
- if (pgprot_val(vma->vm_page_prot) != pgprot_val(__S000))
- vma->vm_page_prot = __S000;
+ if (pgprot_val(vma->vm_page_prot) != pgprot_val(__S000)) {
+ unsigned long offset;
+ offset = ((start - vma->vm_start) >> PAGE_CACHE_SHIFT)
+ + vma->vm_pgoff;
+ if (offset != pgoff)
+ vma->vm_page_prot = __S000;
+ }
err = vma->vm_ops->populate(vma, start, size, prot,
pgoff, flags & MAP_NONBLOCK);
}
I have a couple of suggestions/questions that may make this syscall
more generally useful.
Ingo Molnar wrote:
> * @prot: new protection bits of the range
So, you can change the protection per page without thousands of vmas?
Some garbage collectors could take advantage of that.
> Since all the mapping information of nonlinear vmas lives in the
> pagetables, they either have to be MAP_LOCKED (for databases or
> virtualization software) or need a special SIGBUS handler to
> reinstall mappings across swapping (for more complex uses such as
> memory debuggers).
I like the SIGBUS. Am I correct to assume that, when there is memory
pressure, some pages are evicted from memory and all accesses to those
pages from userspace will raise SIBUS until the mapping is
reastablished? Am I correct to assume this works fine on /dev/shm files?
This has uses in programs that cache data that is faster to
recalculate than to swap. For example, a vector image displayer might
prefer to re-render parts of an image than to wait for parts of a
large cached image to page in. A JIT run-time compiler might prefer
to regenerate code fragments on demand than to wait for paged out,
cached code fragments to page back in.
I think that the above two features are already supported by your
patch, by simply using a /dev/shm file as the backing store. Ingo,
can you confirm this?
Thanks,
-- Jamie
On Tue, 15 Oct 2002, Jamie Lokier wrote:
>
> I like the SIGBUS. Am I correct to assume that, when there is memory
> pressure, some pages are evicted from memory and all accesses to those
> pages from userspace will raise SIBUS until the mapping is
> reastablished? Am I correct to assume this works fine on /dev/shm files?
It should work fine, assuming the shm interface has a "populate" macro.
The only real problem I think the interface has is that the nonlinear
mappings cannot have private pages in them - because while the MM would be
happy to swap them out, there is no sane way to get them back.
The private page information actually does exist in the page tables, but
the _protection_ does not. That's in the VMA, and since one of the whole
points with the nonlinear mapping is that the VMA is "anonymous", we're
kind of screwed.
As it is, you cannot really even add private pages to the linear mapping:
all the page add interfaces are for shared pages only. But I could imagine
that it could be useful to have a "add private page here".
(The "add private page here" interface might also have a "use previous
page if one existed" method, in which case the SIGBUS handler could just
use that one, and thus generate the protection information on the fly -
while generating the actual swap-in data from the page table entry itself)
Ingo - what do you think? I suspect a anonymous ("swap-backed" as opposed
to "backed by this file") interface might be quite useful.
Linus