2006-10-07 13:05:44

by Nick Piggin

[permalink] [raw]
Subject: [rfc] 2.6.19-rc1: vm stuff

The first 3 patches are some minor fixes and rearrangements for the
page allocator and are probably fit to go into -mm.

The next set of 3 patches is another attempt at solving the invalidate
vs pagefault race (this got reintroduced after invalidate_complete_page2
was added, and has always been present for nonlinear mappings). These
are booted and have had some stress testing, but are still at the RFC
stage. Comments?

Nick


2006-10-07 13:05:51

by Nick Piggin

[permalink] [raw]
Subject: [patch 1/3] mm: arch_free_page fix

After the PG_reserved check was added, arch_free_page was being called in the
wrong place (it could be called for a page we don't actually want to free).
Fix that.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c 2006-08-05 18:38:50.000000000 +1000
+++ linux-2.6/mm/page_alloc.c 2006-09-17 17:19:32.000000000 +1000
@@ -443,7 +443,6 @@ static void __free_pages_ok(struct page
int i;
int reserved = 0;

- arch_free_page(page, order);
if (!PageHighMem(page))
debug_check_no_locks_freed(page_address(page),
PAGE_SIZE<<order);
@@ -453,7 +452,9 @@ static void __free_pages_ok(struct page
if (reserved)
return;

+ arch_free_page(page, order);
kernel_map_pages(page, 1 << order, 0);
+
local_irq_save(flags);
__count_vm_events(PGFREE, 1 << order);
free_one_page(page_zone(page), page, order);
@@ -717,13 +718,12 @@ static void fastcall free_hot_cold_page(
struct per_cpu_pages *pcp;
unsigned long flags;

- arch_free_page(page, 0);
-
if (PageAnon(page))
page->mapping = NULL;
if (free_pages_check(page))
return;

+ arch_free_page(page, 0);
kernel_map_pages(page, 1, 0);

pcp = &zone_pcp(zone, get_cpu())->pcp[cold];

2006-10-07 13:06:02

by Nick Piggin

[permalink] [raw]
Subject: [patch 2/3] mm: locks_freed fix

Move the lock debug checks below the page reserved checks.
Also, having debug_check_no_locks_freed in kernel_map_pages is wrong.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h 2006-10-06 11:59:47.000000000 +1000
+++ linux-2.6/include/linux/mm.h 2006-10-06 12:00:02.000000000 +1000
@@ -1038,12 +1038,7 @@ static inline void vm_stat_account(struc

#ifndef CONFIG_DEBUG_PAGEALLOC
static inline void
-kernel_map_pages(struct page *page, int numpages, int enable)
-{
- if (!PageHighMem(page) && !enable)
- debug_check_no_locks_freed(page_address(page),
- numpages * PAGE_SIZE);
-}
+kernel_map_pages(struct page *page, int numpages, int enable) {}
#endif

extern struct vm_area_struct *get_gate_vma(struct task_struct *tsk);
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c 2006-10-06 11:59:47.000000000 +1000
+++ linux-2.6/mm/page_alloc.c 2006-10-06 12:01:52.000000000 +1000
@@ -443,15 +443,13 @@ static void __free_pages_ok(struct page
int i;
int reserved = 0;

- if (!PageHighMem(page))
- debug_check_no_locks_freed(page_address(page),
- PAGE_SIZE<<order);
-
for (i = 0 ; i < (1 << order) ; ++i)
reserved += free_pages_check(page + i);
if (reserved)
return;

+ if (!PageHighMem(page))
+ debug_check_no_locks_freed(page_address(page),PAGE_SIZE<<order);
arch_free_page(page, order);
kernel_map_pages(page, 1 << order, 0);

@@ -723,6 +721,8 @@ static void fastcall free_hot_cold_page(
if (free_pages_check(page))
return;

+ if (!PageHighMem(page))
+ debug_check_no_locks_freed(page_address(page), PAGE_SIZE);
arch_free_page(page, 0);
kernel_map_pages(page, 1, 0);

2006-10-07 13:06:39

by Nick Piggin

[permalink] [raw]
Subject: [patch 1/3] mm: fault vs invalidate/truncate check

Add a bugcheck for Andrea's pagefault vs invalidate race. This is triggerable
for both linear and nonlinear pages with a userspace test harness (using
direct IO and truncate, respectively).

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -120,6 +120,8 @@ void __remove_from_page_cache(struct pag
page->mapping = NULL;
mapping->nrpages--;
__dec_zone_page_state(page, NR_FILE_PAGES);
+
+ BUG_ON(page_mapped(page));
}

void remove_from_page_cache(struct page *page)

2006-10-07 13:06:25

by Nick Piggin

[permalink] [raw]
Subject: [patch 3/3] mm: add arch_alloc_page

Add an arch_alloc_page to match arch_free_page.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/include/linux/gfp.h
===================================================================
--- linux-2.6.orig/include/linux/gfp.h 2006-10-06 11:59:20.000000000 +1000
+++ linux-2.6/include/linux/gfp.h 2006-10-06 12:02:35.000000000 +1000
@@ -101,6 +101,9 @@ static inline int gfp_zone(gfp_t gfp)
#ifndef HAVE_ARCH_FREE_PAGE
static inline void arch_free_page(struct page *page, int order) { }
#endif
+#ifndef HAVE_ARCH_ALLOC_PAGE
+static inline void arch_alloc_page(struct page *page, int order) { }
+#endif

extern struct page *
FASTCALL(__alloc_pages(gfp_t, unsigned int, struct zonelist *));
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c 2006-10-06 12:01:52.000000000 +1000
+++ linux-2.6/mm/page_alloc.c 2006-10-06 12:02:35.000000000 +1000
@@ -552,6 +552,8 @@ static int prep_new_page(struct page *pa
1 << PG_checked | 1 << PG_mappedtodisk);
set_page_private(page, 0);
set_page_refcounted(page);
+
+ arch_alloc_page(page, order);
kernel_map_pages(page, 1 << order, 1);

if (gfp_flags & __GFP_ZERO)

2006-10-07 13:07:01

by Nick Piggin

[permalink] [raw]
Subject: [patch 2/3] mm: fault vs invalidate/truncate race fix

Fix the race between invalidate_inode_pages and do_no_page.

Andrea Arcangeli identified a subtle race between invalidation of
pages from pagecache with userspace mappings, and do_no_page.

The issue is that invalidation has to shoot down all mappings to the
page, before it can be discarded from the pagecache. Between shooting
down ptes to a particular page, and actually dropping the struct page
from the pagecache, do_no_page from any process might fault on that
page and establish a new mapping to the page just before it gets
discarded from the pagecache.

The most common case where such invalidation is used is in file
truncation. This case was catered for by doing a sort of open-coded
seqlock between the file's i_size, and its truncate_count.

Truncation will decrease i_size, then increment truncate_count before
unmapping userspace pages; do_no_page will read truncate_count, then
find the page if it is within i_size, and then check truncate_count
under the page table lock and back out and retry if it had
subsequently been changed (ptl will serialise against unmapping, and
ensure a potentially updated truncate_count is actually visible).

Complexity and documentation issues aside, the locking protocol fails
in the case where we would like to invalidate pagecache inside i_size.
do_no_page can come in anytime and filemap_nopage is not aware of the
invalidation in progress (as it is when it is outside i_size). The
end result is that dangling (->mapping == NULL) pages that appear to
be from a particular file may be mapped into userspace with nonsense
data. Valid mappings to the same place will see a different page.

Andrea implemented two working fixes, one using a real seqlock,
another using a page->flags bit. He also proposed using the page lock
in do_no_page, but that was initially considered too heavyweight.
However, it is not a global or per-file lock, and the page cacheline
is modified in do_no_page to increment _count and _mapcount anyway, so
a further modification should not be a large performance hit.
Scalability is not an issue.

This patch implements this latter approach. ->nopage implementations
return with the page locked if it is possible for their underlying
file to be invalidated (in that case, they must set a special vm_flags
bit to indicate so). do_no_page only unlocks the page after setting
up the mapping completely. invalidation is excluded because it holds
the page lock during invalidation of each page (and ensures that the
page is not mapped while holding the lock).

This allows significant simplifications in do_no_page.

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -166,6 +166,11 @@ extern unsigned int kobjsize(const void
#define VM_NONLINEAR 0x00800000 /* Is non-linear (remap_file_pages) */
#define VM_MAPPED_COPY 0x01000000 /* T if mapped copy of data (nommu mmap) */
#define VM_INSERTPAGE 0x02000000 /* The vma has had "vm_insert_page()" done on it */
+#define VM_CAN_INVALIDATE 0x04000000 /* The mapping may be invalidated,
+ * eg. truncate or invalidate_inode_*.
+ * In this case, do_no_page must
+ * return with the page locked.
+ */

#ifndef VM_STACK_DEFAULT_FLAGS /* arch can override this */
#define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1363,9 +1363,10 @@ struct page *filemap_nopage(struct vm_ar
unsigned long size, pgoff;
int did_readaround = 0, majmin = VM_FAULT_MINOR;

+ BUG_ON(!(area->vm_flags & VM_CAN_INVALIDATE));
+
pgoff = ((address-area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff;

-retry_all:
size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
if (pgoff >= size)
goto outside_data_content;
@@ -1387,7 +1388,7 @@ retry_all:
* Do we have something in the page cache already?
*/
retry_find:
- page = find_get_page(mapping, pgoff);
+ page = find_lock_page(mapping, pgoff);
if (!page) {
unsigned long ra_pages;

@@ -1421,7 +1422,7 @@ retry_find:
start = pgoff - ra_pages / 2;
do_page_cache_readahead(mapping, file, start, ra_pages);
}
- page = find_get_page(mapping, pgoff);
+ page = find_lock_page(mapping, pgoff);
if (!page)
goto no_cached_page;
}
@@ -1430,13 +1431,19 @@ retry_find:
ra->mmap_hit++;

/*
- * Ok, found a page in the page cache, now we need to check
- * that it's up-to-date.
+ * We have a locked page in the page cache, now we need to check
+ * that it's up-to-date. If not, it is going to be due to an error.
*/
- if (!PageUptodate(page))
+ if (unlikely(!PageUptodate(page)))
goto page_not_uptodate;

-success:
+ /* Must recheck i_size under page lock */
+ size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+ if (unlikely(pgoff >= size)) {
+ unlock_page(page);
+ goto outside_data_content;
+ }
+
/*
* Found the page and have a reference on it.
*/
@@ -1479,34 +1486,11 @@ no_cached_page:
return NOPAGE_SIGBUS;

page_not_uptodate:
+ /* IO error path */
if (!did_readaround) {
majmin = VM_FAULT_MAJOR;
count_vm_event(PGMAJFAULT);
}
- lock_page(page);
-
- /* Did it get unhashed while we waited for it? */
- if (!page->mapping) {
- unlock_page(page);
- page_cache_release(page);
- goto retry_all;
- }
-
- /* Did somebody else get it up-to-date? */
- if (PageUptodate(page)) {
- unlock_page(page);
- goto success;
- }
-
- error = mapping->a_ops->readpage(file, page);
- if (!error) {
- wait_on_page_locked(page);
- if (PageUptodate(page))
- goto success;
- } else if (error == AOP_TRUNCATED_PAGE) {
- page_cache_release(page);
- goto retry_find;
- }

/*
* Umm, take care of errors if the page isn't up-to-date.
@@ -1514,37 +1498,15 @@ page_not_uptodate:
* because there really aren't any performance issues here
* and we need to check for errors.
*/
- lock_page(page);
-
- /* Somebody truncated the page on us? */
- if (!page->mapping) {
- unlock_page(page);
- page_cache_release(page);
- goto retry_all;
- }
-
- /* Somebody else successfully read it in? */
- if (PageUptodate(page)) {
- unlock_page(page);
- goto success;
- }
ClearPageError(page);
error = mapping->a_ops->readpage(file, page);
- if (!error) {
- wait_on_page_locked(page);
- if (PageUptodate(page))
- goto success;
- } else if (error == AOP_TRUNCATED_PAGE) {
- page_cache_release(page);
+ page_cache_release(page);
+
+ if (!error || error == AOP_TRUNCATED_PAGE)
goto retry_find;
- }

- /*
- * Things didn't work out. Return zero to tell the
- * mm layer so, possibly freeing the page cache page first.
- */
+ /* Things didn't work out. Return zero to tell the mm layer so. */
shrink_readahead_size_eio(file, ra);
- page_cache_release(page);
return NOPAGE_SIGBUS;
}
EXPORT_SYMBOL(filemap_nopage);
@@ -1737,6 +1699,7 @@ int generic_file_mmap(struct file * file
return -ENOEXEC;
file_accessed(file);
vma->vm_ops = &generic_file_vm_ops;
+ vma->vm_flags |= VM_CAN_INVALIDATE;
return 0;
}

Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1674,6 +1674,13 @@ static int unmap_mapping_range_vma(struc
unsigned long restart_addr;
int need_break;

+ /*
+ * files that support invalidating or truncating portions of the
+ * file from under mmaped areas must set the VM_CAN_INVALIDATE flag, and
+ * have their .nopage function return the page locked.
+ */
+ BUG_ON(!(vma->vm_flags & VM_CAN_INVALIDATE));
+
again:
restart_addr = vma->vm_truncate_count;
if (is_restart_addr(restart_addr) && start_addr < restart_addr) {
@@ -1804,17 +1811,8 @@ void unmap_mapping_range(struct address_

spin_lock(&mapping->i_mmap_lock);

- /* serialize i_size write against truncate_count write */
- smp_wmb();
- /* Protect against page faults, and endless unmapping loops */
+ /* Protect against endless unmapping loops */
mapping->truncate_count++;
- /*
- * For archs where spin_lock has inclusive semantics like ia64
- * this smp_mb() will prevent to read pagetable contents
- * before the truncate_count increment is visible to
- * other cpus.
- */
- smp_mb();
if (unlikely(is_restart_addr(mapping->truncate_count))) {
if (mapping->truncate_count == 0)
reset_vma_truncate_counts(mapping);
@@ -1853,7 +1851,6 @@ int vmtruncate(struct inode * inode, lof
if (IS_SWAPFILE(inode))
goto out_busy;
i_size_write(inode, offset);
- unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
truncate_inode_pages(mapping, offset);
goto out_truncate;

@@ -1892,7 +1889,6 @@ int vmtruncate_range(struct inode *inode

mutex_lock(&inode->i_mutex);
down_write(&inode->i_alloc_sem);
- unmap_mapping_range(mapping, offset, (end - offset), 1);
truncate_inode_pages_range(mapping, offset, end);
inode->i_op->truncate_range(inode, offset, end);
up_write(&inode->i_alloc_sem);
@@ -2143,10 +2139,8 @@ static int do_no_page(struct mm_struct *
int write_access)
{
spinlock_t *ptl;
- struct page *new_page;
- struct address_space *mapping = NULL;
+ struct page *page, *nopage_page;
pte_t entry;
- unsigned int sequence = 0;
int ret = VM_FAULT_MINOR;
int anon = 0;
struct page *dirty_page = NULL;
@@ -2154,71 +2148,52 @@ static int do_no_page(struct mm_struct *
pte_unmap(page_table);
BUG_ON(vma->vm_flags & VM_PFNMAP);

- if (vma->vm_file) {
- mapping = vma->vm_file->f_mapping;
- sequence = mapping->truncate_count;
- smp_rmb(); /* serializes i_size against truncate_count */
- }
-retry:
- new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, &ret);
- /*
- * No smp_rmb is needed here as long as there's a full
- * spin_lock/unlock sequence inside the ->nopage callback
- * (for the pagecache lookup) that acts as an implicit
- * smp_mb() and prevents the i_size read to happen
- * after the next truncate_count read.
- */
-
+ nopage_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, &ret);
/* no page was available -- either SIGBUS or OOM */
- if (new_page == NOPAGE_SIGBUS)
+ if (nopage_page == NOPAGE_SIGBUS)
return VM_FAULT_SIGBUS;
- if (new_page == NOPAGE_OOM)
+ if (nopage_page == NOPAGE_OOM)
return VM_FAULT_OOM;

+ BUG_ON(vma->vm_flags & VM_CAN_INVALIDATE && !PageLocked(nopage_page));
+ /*
+ * For consistency in subsequent calls, make the nopage_page always
+ * locked. These should be in the minority but if they turn out to be
+ * critical then this can always be revisited
+ */
+ if (unlikely(!(vma->vm_flags & VM_CAN_INVALIDATE)))
+ lock_page(nopage_page);
+
/*
* Should we do an early C-O-W break?
*/
+ page = nopage_page;
if (write_access) {
if (!(vma->vm_flags & VM_SHARED)) {
- struct page *page;
-
- if (unlikely(anon_vma_prepare(vma)))
- goto oom;
+ if (unlikely(anon_vma_prepare(vma))) {
+ ret = VM_FAULT_OOM;
+ goto out_error;
+ }
page = alloc_page_vma(GFP_HIGHUSER, vma, address);
- if (!page)
- goto oom;
- copy_user_highpage(page, new_page, address);
- page_cache_release(new_page);
- new_page = page;
+ if (!page) {
+ ret = VM_FAULT_OOM;
+ goto out_error;
+ }
+ copy_user_highpage(page, nopage_page, address);
anon = 1;
-
} else {
/* if the page will be shareable, see if the backing
* address space wants to know that the page is about
* to become writable */
if (vma->vm_ops->page_mkwrite &&
- vma->vm_ops->page_mkwrite(vma, new_page) < 0
- ) {
- page_cache_release(new_page);
- return VM_FAULT_SIGBUS;
+ vma->vm_ops->page_mkwrite(vma, page) < 0) {
+ ret = VM_FAULT_SIGBUS;
+ goto out_error;
}
}
}

page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
- /*
- * For a file-backed vma, someone could have truncated or otherwise
- * invalidated this page. If unmap_mapping_range got called,
- * retry getting the page.
- */
- if (mapping && unlikely(sequence != mapping->truncate_count)) {
- pte_unmap_unlock(page_table, ptl);
- page_cache_release(new_page);
- cond_resched();
- sequence = mapping->truncate_count;
- smp_rmb();
- goto retry;
- }

/*
* This silly early PAGE_DIRTY setting removes a race
@@ -2231,43 +2206,51 @@ retry:
* handle that later.
*/
/* Only go through if we didn't race with anybody else... */
- if (pte_none(*page_table)) {
- flush_icache_page(vma, new_page);
- entry = mk_pte(new_page, vma->vm_page_prot);
+ if (likely(pte_none(*page_table))) {
+ flush_icache_page(vma, page);
+ entry = mk_pte(page, vma->vm_page_prot);
if (write_access)
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
set_pte_at(mm, address, page_table, entry);
if (anon) {
inc_mm_counter(mm, anon_rss);
- lru_cache_add_active(new_page);
- page_add_new_anon_rmap(new_page, vma, address);
+ lru_cache_add_active(page);
+ page_add_new_anon_rmap(page, vma, address);
} else {
inc_mm_counter(mm, file_rss);
- page_add_file_rmap(new_page);
+ page_add_file_rmap(page);
if (write_access) {
- dirty_page = new_page;
+ dirty_page = page;
get_page(dirty_page);
}
}
+
+ /* no need to invalidate: a not-present page won't be cached */
+ update_mmu_cache(vma, address, entry);
+ lazy_mmu_prot_update(entry);
} else {
- /* One of our sibling threads was faster, back out. */
- page_cache_release(new_page);
- goto unlock;
+ if (anon)
+ page_cache_release(page);
+ else
+ anon = 1; /* not anon, but release nopage_page */
}

- /* no need to invalidate: a not-present page shouldn't be cached */
- update_mmu_cache(vma, address, entry);
- lazy_mmu_prot_update(entry);
-unlock:
pte_unmap_unlock(page_table, ptl);
- if (dirty_page) {
+
+out:
+ unlock_page(nopage_page);
+ if (anon)
+ page_cache_release(nopage_page);
+ else if (dirty_page) {
set_page_dirty_balance(dirty_page);
put_page(dirty_page);
}
+
return ret;
-oom:
- page_cache_release(new_page);
- return VM_FAULT_OOM;
+
+out_error:
+ anon = 1; /* relase nopage_page */
+ goto out;
}

/*
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c
+++ linux-2.6/mm/shmem.c
@@ -81,6 +81,7 @@ enum sgp_type {
SGP_READ, /* don't exceed i_size, don't allocate page */
SGP_CACHE, /* don't exceed i_size, may allocate page */
SGP_WRITE, /* may exceed i_size, may allocate page */
+ SGP_NOPAGE, /* same as SGP_CACHE, return with page locked */
};

static int shmem_getpage(struct inode *inode, unsigned long idx,
@@ -1209,8 +1210,10 @@ repeat:
}
done:
if (*pagep != filepage) {
- unlock_page(filepage);
*pagep = filepage;
+ if (sgp != SGP_NOPAGE)
+ unlock_page(filepage);
+
}
return 0;

@@ -1229,13 +1232,15 @@ struct page *shmem_nopage(struct vm_area
unsigned long idx;
int error;

+ BUG_ON(!(vma->vm_flags & VM_CAN_INVALIDATE));
+
idx = (address - vma->vm_start) >> PAGE_SHIFT;
idx += vma->vm_pgoff;
idx >>= PAGE_CACHE_SHIFT - PAGE_SHIFT;
if (((loff_t) idx << PAGE_CACHE_SHIFT) >= i_size_read(inode))
return NOPAGE_SIGBUS;

- error = shmem_getpage(inode, idx, &page, SGP_CACHE, type);
+ error = shmem_getpage(inode, idx, &page, SGP_NOPAGE, type);
if (error)
return (error == -ENOMEM)? NOPAGE_OOM: NOPAGE_SIGBUS;

@@ -1333,6 +1338,7 @@ int shmem_mmap(struct file *file, struct
{
file_accessed(file);
vma->vm_ops = &shmem_vm_ops;
+ vma->vm_flags |= VM_CAN_INVALIDATE;
return 0;
}

@@ -2445,5 +2451,6 @@ int shmem_zero_setup(struct vm_area_stru
fput(vma->vm_file);
vma->vm_file = file;
vma->vm_ops = &shmem_vm_ops;
+ vma->vm_flags |= VM_CAN_INVALIDATE;
return 0;
}
Index: linux-2.6/fs/ncpfs/mmap.c
===================================================================
--- linux-2.6.orig/fs/ncpfs/mmap.c
+++ linux-2.6/fs/ncpfs/mmap.c
@@ -123,6 +123,7 @@ int ncp_mmap(struct file *file, struct v
return -EFBIG;

vma->vm_ops = &ncp_file_mmap;
+ vma->vm_flags |= VM_CAN_INVALIDATE;
file_accessed(file);
return 0;
}
Index: linux-2.6/fs/ocfs2/mmap.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/mmap.c
+++ linux-2.6/fs/ocfs2/mmap.c
@@ -93,6 +93,7 @@ int ocfs2_mmap(struct file *file, struct

file_accessed(file);
vma->vm_ops = &ocfs2_file_vm_ops;
+ vma->vm_flags |= VM_CAN_INVALIDATE;
return 0;
}

Index: linux-2.6/fs/xfs/linux-2.6/xfs_file.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_file.c
+++ linux-2.6/fs/xfs/linux-2.6/xfs_file.c
@@ -343,6 +343,7 @@ xfs_file_mmap(
struct vm_area_struct *vma)
{
vma->vm_ops = &xfs_file_vm_ops;
+ vma->vm_flags |= VM_CAN_INVALIDATE;

#ifdef CONFIG_XFS_DMAPI
if (vn_from_inode(filp->f_dentry->d_inode)->v_vfsp->vfs_flag & VFS_DMI)
Index: linux-2.6/ipc/shm.c
===================================================================
--- linux-2.6.orig/ipc/shm.c
+++ linux-2.6/ipc/shm.c
@@ -230,6 +230,7 @@ static int shm_mmap(struct file * file,
ret = shmem_mmap(file, vma);
if (ret == 0) {
vma->vm_ops = &shm_vm_ops;
+ vma->vm_flags |= VM_CAN_INVALIDATE;
if (!(vma->vm_flags & VM_WRITE))
vma->vm_flags &= ~VM_MAYWRITE;
shm_inc(shm_file_ns(file), file->f_dentry->d_inode->i_ino);
Index: linux-2.6/fs/ocfs2/dlmglue.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/dlmglue.c
+++ linux-2.6/fs/ocfs2/dlmglue.c
@@ -2656,7 +2656,6 @@ static int ocfs2_data_convert_worker(str
sync_mapping_buffers(mapping);
if (blocking == LKM_EXMODE) {
truncate_inode_pages(mapping, 0);
- unmap_mapping_range(mapping, 0, 0, 0);
} else {
/* We only need to wait on the I/O if we're not also
* truncating pages because truncate_inode_pages waits
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -163,6 +163,11 @@ void truncate_inode_pages_range(struct a
unlock_page(page);
continue;
}
+ while (page_mapped(page)) {
+ unmap_mapping_range(mapping,
+ (loff_t)page_index<<PAGE_CACHE_SHIFT,
+ PAGE_CACHE_SIZE, 0);
+ }
truncate_complete_page(mapping, page);
unlock_page(page);
}
@@ -200,6 +205,11 @@ void truncate_inode_pages_range(struct a
break;
lock_page(page);
wait_on_page_writeback(page);
+ while (page_mapped(page)) {
+ unmap_mapping_range(mapping,
+ (loff_t)page_index<<PAGE_CACHE_SHIFT,
+ PAGE_CACHE_SIZE, 0);
+ }
if (page->index > next)
next = page->index;
next++;

2006-10-07 13:07:13

by Nick Piggin

[permalink] [raw]
Subject: [patch 3/3] mm: fault handler to replace nopage and populate

Nonlinear mappings are (AFAIKS) simply a virtual memory concept that
encodes the virtual address -> file offset differently from linear
mappings.

I can't see why the filesystem/pagecache code should need to know anything
about it, except for the fact that the ->nopage handler didn't quite pass
down enough information (ie. pgoff). But it is more logical to pass pgoff
rather than have the ->nopage function calculate it itself anyway. And
having the nopage handler install the pte itself is sort of nasty.

This patch introduces a new fault handler that replaces ->nopage and ->populate
and (hopefully) ->page_mkwrite. Most of the old mechanism is still in place
so there is a lot of duplication and nice cleanups that can be removed if
everyone switches over.

The rationale for doing this in the first place is that nonlinear mappings
are subject to the pagefault vs invalidate/truncate race too, and it seemed
stupid to duplicate the synchronisation logic rather than just consolidate
the two.

Comments?

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -166,11 +166,12 @@ extern unsigned int kobjsize(const void
#define VM_NONLINEAR 0x00800000 /* Is non-linear (remap_file_pages) */
#define VM_MAPPED_COPY 0x01000000 /* T if mapped copy of data (nommu mmap) */
#define VM_INSERTPAGE 0x02000000 /* The vma has had "vm_insert_page()" done on it */
-#define VM_CAN_INVALIDATE 0x04000000 /* The mapping may be invalidated,
+#define VM_CAN_INVALIDATE 0x04000000 /* The mapping may be invalidated,
* eg. truncate or invalidate_inode_*.
* In this case, do_no_page must
* return with the page locked.
*/
+#define VM_CAN_NONLINEAR 0x08000000 /* Has ->fault & does nonlinear pages */

#ifndef VM_STACK_DEFAULT_FLAGS /* arch can override this */
#define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
@@ -194,6 +195,23 @@ extern unsigned int kobjsize(const void
*/
extern pgprot_t protection_map[16];

+#define FAULT_FLAG_WRITE 0x01
+#define FAULT_FLAG_NONLINEAR 0x02
+
+/*
+ * fault_data is filled in the the pagefault handler and passed to the
+ * vma's ->fault function. That function is responsible for filling in
+ * 'type', which is the type of fault if a page is returned, or the type
+ * of error if NULL is returned.
+ */
+struct fault_data {
+ struct vm_area_struct *vma;
+ unsigned long address;
+ pgoff_t pgoff;
+ unsigned int flags;
+
+ int type;
+};

/*
* These are the virtual MM functions - opening of an area, closing and
@@ -203,6 +221,7 @@ extern pgprot_t protection_map[16];
struct vm_operations_struct {
void (*open)(struct vm_area_struct * area);
void (*close)(struct vm_area_struct * area);
+ struct page * (*fault)(struct fault_data * data);
struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
unsigned long (*nopfn)(struct vm_area_struct * area, unsigned long address);
int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
@@ -1028,9 +1047,7 @@ extern void truncate_inode_pages_range(s
loff_t lstart, loff_t lend);

/* generic vm_area_ops exported for stackable file systems */
-extern struct page *filemap_nopage(struct vm_area_struct *, unsigned long, int *);
-extern int filemap_populate(struct vm_area_struct *, unsigned long,
- unsigned long, pgprot_t, unsigned long, int);
+extern struct page *filemap_fault(struct fault_data *data);

/* mm/page-writeback.c */
int write_one_page(struct page *page, int wait);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -2121,6 +2121,137 @@ oom:
return VM_FAULT_OOM;
}

+static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pte_t *page_table, pmd_t *pmd,
+ pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
+{
+ spinlock_t *ptl;
+ struct page *page, *faulted_page;
+ pte_t entry;
+ int anon = 0;
+ struct page *dirty_page = NULL;
+
+ struct fault_data fdata = {
+ .vma = vma,
+ .address = address & PAGE_MASK,
+ .pgoff = pgoff,
+ .flags = flags,
+ };
+
+ pte_unmap(page_table);
+ BUG_ON(vma->vm_flags & VM_PFNMAP);
+
+ faulted_page = vma->vm_ops->fault(&fdata);
+ if (unlikely(!faulted_page))
+ return fdata.type;
+
+ /*
+ * For consistency in subsequent calls, make the faulted_page always
+ * locked. These should be in the minority but if they turn out to be
+ * critical then this can always be revisited
+ */
+ if (unlikely(!(vma->vm_flags & VM_CAN_INVALIDATE)))
+ lock_page(faulted_page);
+ else
+ BUG_ON(!PageLocked(faulted_page));
+
+ /*
+ * Should we do an early C-O-W break?
+ */
+ page = faulted_page;
+ if (flags & FAULT_FLAG_WRITE) {
+ if (!(vma->vm_flags & VM_SHARED)) {
+ anon = 1;
+ if (unlikely(anon_vma_prepare(vma))) {
+ fdata.type = VM_FAULT_OOM;
+ goto out;
+ }
+ page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+ if (!page) {
+ fdata.type = VM_FAULT_OOM;
+ goto out;
+ }
+ copy_user_highpage(page, faulted_page, address);
+ }
+ }
+
+ page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+
+ /*
+ * This silly early PAGE_DIRTY setting removes a race
+ * due to the bad i386 page protection. But it's valid
+ * for other architectures too.
+ *
+ * Note that if write_access is true, we either now have
+ * an exclusive copy of the page, or this is a shared mapping,
+ * so we can make it writable and dirty to avoid having to
+ * handle that later.
+ */
+ /* Only go through if we didn't race with anybody else... */
+ if (likely(pte_same(*page_table, orig_pte))) {
+ flush_icache_page(vma, page);
+ entry = mk_pte(page, vma->vm_page_prot);
+ if (flags & FAULT_FLAG_WRITE)
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ set_pte_at(mm, address, page_table, entry);
+ if (anon) {
+ inc_mm_counter(mm, anon_rss);
+ lru_cache_add_active(page);
+ page_add_new_anon_rmap(page, vma, address);
+ } else {
+ inc_mm_counter(mm, file_rss);
+ page_add_file_rmap(page);
+ if (flags & FAULT_FLAG_WRITE) {
+ dirty_page = page;
+ get_page(dirty_page);
+ }
+ }
+
+ /* no need to invalidate: a not-present page won't be cached */
+ update_mmu_cache(vma, address, entry);
+ lazy_mmu_prot_update(entry);
+ } else {
+ if (anon)
+ page_cache_release(page);
+ else
+ anon = 1; /* not anon, but release faulted_page */
+ }
+
+ pte_unmap_unlock(page_table, ptl);
+
+out:
+ unlock_page(faulted_page);
+ if (anon)
+ page_cache_release(faulted_page);
+ else if (dirty_page) {
+ set_page_dirty_balance(dirty_page);
+ put_page(dirty_page);
+ }
+
+ return fdata.type;
+}
+
+static int do_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pte_t *page_table, pmd_t *pmd,
+ int write_access, pte_t orig_pte)
+{
+ pgoff_t pgoff = (((address & PAGE_MASK)
+ - vma->vm_start) >> PAGE_CACHE_SHIFT) + vma->vm_pgoff;
+ unsigned int flags = (write_access ? FAULT_FLAG_WRITE : 0);
+
+ return __do_fault(mm, vma, address, page_table, pmd, pgoff, flags, orig_pte);
+}
+
+static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pte_t *page_table, pmd_t *pmd,
+ int write_access, pgoff_t pgoff, pte_t orig_pte)
+{
+ unsigned int flags = FAULT_FLAG_NONLINEAR |
+ (write_access ? FAULT_FLAG_WRITE : 0);
+
+ return __do_fault(mm, vma, address, page_table, pmd, pgoff, flags, orig_pte);
+}
+
/*
* do_no_page() tries to create a new page mapping. It aggressively
* tries to share with existing pages, but makes a separate copy if
@@ -2327,9 +2458,14 @@ static int do_file_page(struct mm_struct
print_bad_pte(vma, orig_pte, address);
return VM_FAULT_OOM;
}
- /* We can then assume vm->vm_ops && vma->vm_ops->populate */

pgoff = pte_to_pgoff(orig_pte);
+
+ if (vma->vm_ops && vma->vm_ops->fault)
+ return do_nonlinear_fault(mm, vma, address, page_table, pmd,
+ write_access, pgoff, orig_pte);
+
+ /* We can then assume vm->vm_ops && vma->vm_ops->populate */
err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE,
vma->vm_page_prot, pgoff, 0);
if (err == -ENOMEM)
@@ -2364,10 +2500,12 @@ static inline int handle_pte_fault(struc
if (!pte_present(entry)) {
if (pte_none(entry)) {
if (vma->vm_ops) {
+ if (vma->vm_ops->fault)
+ return do_linear_fault(mm, vma, address,
+ pte, pmd, write_access, entry);
if (vma->vm_ops->nopage)
return do_no_page(mm, vma, address,
- pte, pmd,
- write_access);
+ pte, pmd, write_access);
if (unlikely(vma->vm_ops->nopfn))
return do_no_pfn(mm, vma, address, pte,
pmd, write_access);
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1339,40 +1339,37 @@ static int fastcall page_cache_read(stru
#define MMAP_LOTSAMISS (100)

/**
- * filemap_nopage - read in file data for page fault handling
- * @area: the applicable vm_area
- * @address: target address to read in
- * @type: returned with VM_FAULT_{MINOR,MAJOR} if not %NULL
+ * filemap_fault - read in file data for page fault handling
+ * @data: the applicable fault_data
*
- * filemap_nopage() is invoked via the vma operations vector for a
+ * filemap_fault() is invoked via the vma operations vector for a
* mapped memory region to read in file data during a page fault.
*
* The goto's are kind of ugly, but this streamlines the normal case of having
* it in the page cache, and handles the special cases reasonably without
* having a lot of duplicated code.
*/
-struct page *filemap_nopage(struct vm_area_struct *area,
- unsigned long address, int *type)
+struct page *filemap_fault(struct fault_data *data)
{
int error;
- struct file *file = area->vm_file;
+ struct file *file = data->vma->vm_file;
struct address_space *mapping = file->f_mapping;
struct file_ra_state *ra = &file->f_ra;
struct inode *inode = mapping->host;
struct page *page;
- unsigned long size, pgoff;
- int did_readaround = 0, majmin = VM_FAULT_MINOR;
+ unsigned long size;
+ int did_readaround = 0;

- BUG_ON(!(area->vm_flags & VM_CAN_INVALIDATE));
+ data->type = VM_FAULT_MINOR;

- pgoff = ((address-area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff;
+ BUG_ON(!(data->vma->vm_flags & VM_CAN_INVALIDATE));

size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
- if (pgoff >= size)
+ if (data->pgoff >= size)
goto outside_data_content;

/* If we don't want any read-ahead, don't bother */
- if (VM_RandomReadHint(area))
+ if (VM_RandomReadHint(data->vma))
goto no_cached_page;

/*
@@ -1381,19 +1378,19 @@ struct page *filemap_nopage(struct vm_ar
*
* For sequential accesses, we use the generic readahead logic.
*/
- if (VM_SequentialReadHint(area))
- page_cache_readahead(mapping, ra, file, pgoff, 1);
+ if (VM_SequentialReadHint(data->vma))
+ page_cache_readahead(mapping, ra, file, data->pgoff, 1);

/*
* Do we have something in the page cache already?
*/
retry_find:
- page = find_lock_page(mapping, pgoff);
+ page = find_lock_page(mapping, data->pgoff);
if (!page) {
unsigned long ra_pages;

- if (VM_SequentialReadHint(area)) {
- handle_ra_miss(mapping, ra, pgoff);
+ if (VM_SequentialReadHint(data->vma)) {
+ handle_ra_miss(mapping, ra, data->pgoff);
goto no_cached_page;
}
ra->mmap_miss++;
@@ -1410,7 +1407,7 @@ retry_find:
* check did_readaround, as this is an inner loop.
*/
if (!did_readaround) {
- majmin = VM_FAULT_MAJOR;
+ data->type = VM_FAULT_MAJOR;
count_vm_event(PGMAJFAULT);
}
did_readaround = 1;
@@ -1418,11 +1415,11 @@ retry_find:
if (ra_pages) {
pgoff_t start = 0;

- if (pgoff > ra_pages / 2)
- start = pgoff - ra_pages / 2;
+ if (data->pgoff > ra_pages / 2)
+ start = data->pgoff - ra_pages / 2;
do_page_cache_readahead(mapping, file, start, ra_pages);
}
- page = find_lock_page(mapping, pgoff);
+ page = find_lock_page(mapping, data->pgoff);
if (!page)
goto no_cached_page;
}
@@ -1439,7 +1436,7 @@ retry_find:

/* Must recheck i_size under page lock */
size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
- if (unlikely(pgoff >= size)) {
+ if (unlikely(data->pgoff >= size)) {
unlock_page(page);
goto outside_data_content;
}
@@ -1448,8 +1445,6 @@ retry_find:
* Found the page and have a reference on it.
*/
mark_page_accessed(page);
- if (type)
- *type = majmin;
return page;

outside_data_content:
@@ -1457,15 +1452,17 @@ outside_data_content:
* An external ptracer can access pages that normally aren't
* accessible..
*/
- if (area->vm_mm == current->mm)
- return NOPAGE_SIGBUS;
+ if (data->vma->vm_mm == current->mm) {
+ data->type = VM_FAULT_SIGBUS;
+ return NULL;
+ }
/* Fall through to the non-read-ahead case */
no_cached_page:
/*
* We're only likely to ever get here if MADV_RANDOM is in
* effect.
*/
- error = page_cache_read(file, pgoff);
+ error = page_cache_read(file, data->pgoff);
grab_swap_token();

/*
@@ -1482,13 +1479,15 @@ no_cached_page:
* to schedule I/O.
*/
if (error == -ENOMEM)
- return NOPAGE_OOM;
- return NOPAGE_SIGBUS;
+ data->type = VM_FAULT_OOM;
+ else
+ data->type = VM_FAULT_SIGBUS;
+ return NULL;

page_not_uptodate:
/* IO error path */
if (!did_readaround) {
- majmin = VM_FAULT_MAJOR;
+ data->type = VM_FAULT_MAJOR;
count_vm_event(PGMAJFAULT);
}

@@ -1507,186 +1506,13 @@ page_not_uptodate:

/* Things didn't work out. Return zero to tell the mm layer so. */
shrink_readahead_size_eio(file, ra);
- return NOPAGE_SIGBUS;
-}
-EXPORT_SYMBOL(filemap_nopage);
-
-static struct page * filemap_getpage(struct file *file, unsigned long pgoff,
- int nonblock)
-{
- struct address_space *mapping = file->f_mapping;
- struct page *page;
- int error;
-
- /*
- * Do we have something in the page cache already?
- */
-retry_find:
- page = find_get_page(mapping, pgoff);
- if (!page) {
- if (nonblock)
- return NULL;
- goto no_cached_page;
- }
-
- /*
- * Ok, found a page in the page cache, now we need to check
- * that it's up-to-date.
- */
- if (!PageUptodate(page)) {
- if (nonblock) {
- page_cache_release(page);
- return NULL;
- }
- goto page_not_uptodate;
- }
-
-success:
- /*
- * Found the page and have a reference on it.
- */
- mark_page_accessed(page);
- return page;
-
-no_cached_page:
- error = page_cache_read(file, pgoff);
-
- /*
- * The page we want has now been added to the page cache.
- * In the unlikely event that someone removed it in the
- * meantime, we'll just come back here and read it again.
- */
- if (error >= 0)
- goto retry_find;
-
- /*
- * An error return from page_cache_read can result if the
- * system is low on memory, or a problem occurs while trying
- * to schedule I/O.
- */
- return NULL;
-
-page_not_uptodate:
- lock_page(page);
-
- /* Did it get truncated while we waited for it? */
- if (!page->mapping) {
- unlock_page(page);
- goto err;
- }
-
- /* Did somebody else get it up-to-date? */
- if (PageUptodate(page)) {
- unlock_page(page);
- goto success;
- }
-
- error = mapping->a_ops->readpage(file, page);
- if (!error) {
- wait_on_page_locked(page);
- if (PageUptodate(page))
- goto success;
- } else if (error == AOP_TRUNCATED_PAGE) {
- page_cache_release(page);
- goto retry_find;
- }
-
- /*
- * Umm, take care of errors if the page isn't up-to-date.
- * Try to re-read it _once_. We do this synchronously,
- * because there really aren't any performance issues here
- * and we need to check for errors.
- */
- lock_page(page);
-
- /* Somebody truncated the page on us? */
- if (!page->mapping) {
- unlock_page(page);
- goto err;
- }
- /* Somebody else successfully read it in? */
- if (PageUptodate(page)) {
- unlock_page(page);
- goto success;
- }
-
- ClearPageError(page);
- error = mapping->a_ops->readpage(file, page);
- if (!error) {
- wait_on_page_locked(page);
- if (PageUptodate(page))
- goto success;
- } else if (error == AOP_TRUNCATED_PAGE) {
- page_cache_release(page);
- goto retry_find;
- }
-
- /*
- * Things didn't work out. Return zero to tell the
- * mm layer so, possibly freeing the page cache page first.
- */
-err:
- page_cache_release(page);
-
+ data->type = VM_FAULT_SIGBUS;
return NULL;
}
-
-int filemap_populate(struct vm_area_struct *vma, unsigned long addr,
- unsigned long len, pgprot_t prot, unsigned long pgoff,
- int nonblock)
-{
- struct file *file = vma->vm_file;
- struct address_space *mapping = file->f_mapping;
- struct inode *inode = mapping->host;
- unsigned long size;
- struct mm_struct *mm = vma->vm_mm;
- struct page *page;
- int err;
-
- if (!nonblock)
- force_page_cache_readahead(mapping, vma->vm_file,
- pgoff, len >> PAGE_CACHE_SHIFT);
-
-repeat:
- size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
- if (pgoff + (len >> PAGE_CACHE_SHIFT) > size)
- return -EINVAL;
-
- page = filemap_getpage(file, pgoff, nonblock);
-
- /* XXX: This is wrong, a filesystem I/O error may have happened. Fix that as
- * done in shmem_populate calling shmem_getpage */
- if (!page && !nonblock)
- return -ENOMEM;
-
- if (page) {
- err = install_page(mm, vma, addr, page, prot);
- if (err) {
- page_cache_release(page);
- return err;
- }
- } else if (vma->vm_flags & VM_NONLINEAR) {
- /* No page was found just because we can't read it in now (being
- * here implies nonblock != 0), but the page may exist, so set
- * the PTE to fault it in later. */
- err = install_file_pte(mm, vma, addr, pgoff, prot);
- if (err)
- return err;
- }
-
- len -= PAGE_SIZE;
- addr += PAGE_SIZE;
- pgoff++;
- if (len)
- goto repeat;
-
- return 0;
-}
-EXPORT_SYMBOL(filemap_populate);
+EXPORT_SYMBOL(filemap_fault);

struct vm_operations_struct generic_file_vm_ops = {
- .nopage = filemap_nopage,
- .populate = filemap_populate,
+ .fault = filemap_fault,
};

/* This is used for a general mmap of a disk file */
@@ -1699,7 +1525,7 @@ int generic_file_mmap(struct file * file
return -ENOEXEC;
file_accessed(file);
vma->vm_ops = &generic_file_vm_ops;
- vma->vm_flags |= VM_CAN_INVALIDATE;
+ vma->vm_flags |= VM_CAN_INVALIDATE | VM_CAN_NONLINEAR;
return 0;
}

Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c
+++ linux-2.6/mm/fremap.c
@@ -115,6 +115,7 @@ int install_file_pte(struct mm_struct *m

set_pte_at(mm, addr, pte, pgoff_to_pte(pgoff));
pte_val = *pte;
+
/*
* We don't need to run update_mmu_cache() here because the "file pte"
* being installed by install_file_pte() is not a real pte - it's a
@@ -128,6 +129,25 @@ out:
return err;
}

+static int populate_range(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long addr, unsigned long size, pgoff_t pgoff)
+{
+ int err;
+
+ do {
+ err = install_file_pte(mm, vma, addr, pgoff, vma->vm_page_prot);
+ if (err)
+ return err;
+
+ size -= PAGE_SIZE;
+ addr += PAGE_SIZE;
+ pgoff++;
+ } while (size);
+
+ return 0;
+
+}
+
/***
* sys_remap_file_pages - remap arbitrary pages of a shared backing store
* file within an existing vma.
@@ -185,41 +205,56 @@ asmlinkage long sys_remap_file_pages(uns
* the single existing vma. vm_private_data is used as a
* swapout cursor in a VM_NONLINEAR vma.
*/
- if (vma && (vma->vm_flags & VM_SHARED) &&
- (!vma->vm_private_data || (vma->vm_flags & VM_NONLINEAR)) &&
- vma->vm_ops && vma->vm_ops->populate &&
- end > start && start >= vma->vm_start &&
- end <= vma->vm_end) {
-
- /* Must set VM_NONLINEAR before any pages are populated. */
- if (pgoff != linear_page_index(vma, start) &&
- !(vma->vm_flags & VM_NONLINEAR)) {
- if (!has_write_lock) {
- up_read(&mm->mmap_sem);
- down_write(&mm->mmap_sem);
- has_write_lock = 1;
- goto retry;
- }
- mapping = vma->vm_file->f_mapping;
- spin_lock(&mapping->i_mmap_lock);
- flush_dcache_mmap_lock(mapping);
- vma->vm_flags |= VM_NONLINEAR;
- vma_prio_tree_remove(vma, &mapping->i_mmap);
- vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear);
- flush_dcache_mmap_unlock(mapping);
- spin_unlock(&mapping->i_mmap_lock);
+ if (!vma || !(vma->vm_flags & VM_SHARED))
+ goto out;
+
+ if (vma->vm_private_data && !(vma->vm_flags & VM_NONLINEAR))
+ goto out;
+
+ if ((!vma->vm_ops || !vma->vm_ops->populate) &&
+ !(vma->vm_flags & VM_CAN_NONLINEAR))
+ goto out;
+
+ if (end <= start || start < vma->vm_start || end > vma->vm_end)
+ goto out;
+
+ /* Must set VM_NONLINEAR before any pages are populated. */
+ if (!(vma->vm_flags & VM_NONLINEAR)) {
+ /* Don't need a nonlinear mapping, exit success */
+ if (pgoff == linear_page_index(vma, start)) {
+ err = 0;
+ goto out;
}

- err = vma->vm_ops->populate(vma, start, size,
- vma->vm_page_prot,
- pgoff, flags & MAP_NONBLOCK);
-
- /*
- * We can't clear VM_NONLINEAR because we'd have to do
- * it after ->populate completes, and that would prevent
- * downgrading the lock. (Locks can't be upgraded).
- */
+ if (!has_write_lock) {
+ up_read(&mm->mmap_sem);
+ down_write(&mm->mmap_sem);
+ has_write_lock = 1;
+ goto retry;
+ }
+ mapping = vma->vm_file->f_mapping;
+ spin_lock(&mapping->i_mmap_lock);
+ flush_dcache_mmap_lock(mapping);
+ vma->vm_flags |= VM_NONLINEAR;
+ vma_prio_tree_remove(vma, &mapping->i_mmap);
+ vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear);
+ flush_dcache_mmap_unlock(mapping);
+ spin_unlock(&mapping->i_mmap_lock);
}
+
+ if (vma->vm_flags & VM_CAN_NONLINEAR)
+ err = populate_range(mm, vma, start, size, pgoff);
+ else
+ err = vma->vm_ops->populate(vma, start, size, vma->vm_page_prot,
+ pgoff, flags & MAP_NONBLOCK);
+
+ /*
+ * We can't clear VM_NONLINEAR because we'd have to do
+ * it after ->populate completes, and that would prevent
+ * downgrading the lock. (Locks can't be upgraded).
+ */
+
+out:
if (likely(!has_write_lock))
up_read(&mm->mmap_sem);
else

2006-10-07 15:15:12

by Jeff Garzik

[permalink] [raw]
Subject: Re: [patch 3/3] mm: fault handler to replace nopage and populate

Nick Piggin wrote:
> Nonlinear mappings are (AFAIKS) simply a virtual memory concept that
> encodes the virtual address -> file offset differently from linear
> mappings.
>
> I can't see why the filesystem/pagecache code should need to know anything
> about it, except for the fact that the ->nopage handler didn't quite pass
> down enough information (ie. pgoff). But it is more logical to pass pgoff
> rather than have the ->nopage function calculate it itself anyway. And
> having the nopage handler install the pte itself is sort of nasty.
>
> This patch introduces a new fault handler that replaces ->nopage and ->populate
> and (hopefully) ->page_mkwrite. Most of the old mechanism is still in place
> so there is a lot of duplication and nice cleanups that can be removed if
> everyone switches over.
>
> The rationale for doing this in the first place is that nonlinear mappings
> are subject to the pagefault vs invalidate/truncate race too, and it seemed
> stupid to duplicate the synchronisation logic rather than just consolidate
> the two.
>
> Comments?

That's pretty nice.

Back when I was writing [the now slated for death]
sound/oss/via82xxx_audio.c driver, Linus suggested that I implement
->nopage() for accessing the mmap'able DMA'd audio buffers, rather than
using remap_pfn_range(). It worked out very nicely, because it allowed
the sound driver to retrieve $N pages for the mmap'able buffer (passed
as an s/g list to the hardware) rather than requiring a single humongous
buffer returned by pci_alloc_consistent().

And although probably not your primary motivation, your change does IMO
improve this area of the kernel.

Jeff



2006-10-07 20:43:52

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 3/3] mm: add arch_alloc_page

On Sat, 7 Oct 2006 15:06:04 +0200 (CEST)
Nick Piggin <[email protected]> wrote:

> Add an arch_alloc_page to match arch_free_page.

umm.. why?

2006-10-07 20:44:08

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 2/3] mm: fault vs invalidate/truncate race fix

On Sat, 7 Oct 2006 15:06:21 +0200 (CEST)
Nick Piggin <[email protected]> wrote:

> Fix the race between invalidate_inode_pages and do_no_page.

Changes have occurred in ocfs2. The fixup is pretty obvious, but this is
one which Mark will need to have a think about please.

2006-10-07 20:44:10

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 2/3] mm: fault vs invalidate/truncate race fix

On Sat, 7 Oct 2006 15:06:21 +0200 (CEST)
Nick Piggin <[email protected]> wrote:

> Fix the race between invalidate_inode_pages and do_no_page.

- In do_no_page() there's a `goto retry' where we appear to have
forgotten to (conditionally) unlock the page.

- In do_no_page() the COW-break code seem to have forgotten to
(conditionally) unlock the page which it just COWed?

- In do_no_page(), the unlock_page() which _is_ there doesn't test
VM_CAN_INVALIDATE before deciding to unlock the page.

I have a bad feeling that I'm not getting the point here...

<looks>

Ah, it appears you've fixed at least some of these things in the next
patch. Tricky.

2006-10-07 20:44:42

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 3/3] mm: fault handler to replace nopage and populate

On Sat, 7 Oct 2006 15:06:32 +0200 (CEST)
Nick Piggin <[email protected]> wrote:

> Nonlinear mappings are (AFAIKS) simply a virtual memory concept that
> encodes the virtual address -> file offset differently from linear
> mappings.
>
> I can't see why the filesystem/pagecache code should need to know anything
> about it, except for the fact that the ->nopage handler didn't quite pass
> down enough information (ie. pgoff). But it is more logical to pass pgoff
> rather than have the ->nopage function calculate it itself anyway. And
> having the nopage handler install the pte itself is sort of nasty.
>
> This patch introduces a new fault handler that replaces ->nopage and ->populate
> and (hopefully) ->page_mkwrite. Most of the old mechanism is still in place
> so there is a lot of duplication and nice cleanups that can be removed if
> everyone switches over.
>
> The rationale for doing this in the first place is that nonlinear mappings
> are subject to the pagefault vs invalidate/truncate race too, and it seemed
> stupid to duplicate the synchronisation logic rather than just consolidate
> the two.

- You may find that gcc generates crap code for the initialisation of the
`struct fault_data'. If so, filling the fields in by hand one-at-a-time
will improve things.

- So is the plan here to migrate all code over to using
vm_operations.fault() and to finally remove vm_operations.nopage and
.nopfn? If so, that'd be nice.

- As you know, there is a case for constructing that `struct fault_data'
all the way up in do_no_page(): so we can pass data back, asking
do_no_page() to rerun the fault if we dropped mmap_sem.

- No useful opinion on the substance of this patch, sorry. It's Saturday ;)

2006-10-08 01:39:56

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 3/3] mm: add arch_alloc_page

Andrew Morton wrote:

>On Sat, 7 Oct 2006 15:06:04 +0200 (CEST)
>Nick Piggin <[email protected]> wrote:
>
>
>>Add an arch_alloc_page to match arch_free_page.
>>
>
>umm.. why?
>

I had a future patch to more kernel_map_pages into it, but couldn't
decide if that's a generic kernel feature that is only implemented in
2 architectures, or an architecture speicifc feature. So I left it out.

But at least Martin wanted a hook here for his volatile pages patches,
so I thought I'd submit this patch anyway.

--

Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-08 02:05:35

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 2/3] mm: fault vs invalidate/truncate race fix

Andrew Morton wrote:

>On Sat, 7 Oct 2006 15:06:21 +0200 (CEST)
>Nick Piggin <[email protected]> wrote:
>
>
>>Fix the race between invalidate_inode_pages and do_no_page.
>>
>
>- In do_no_page() there's a `goto retry' where we appear to have
> forgotten to (conditionally) unlock the page.
>

Hmm, the retry should be gone - it was only there for the
seqlock-ish truncate race code.

>- In do_no_page() the COW-break code seem to have forgotten to
> (conditionally) unlock the page which it just COWed?
>

It keeps the 'nopage_page' around and unlocks it at the end.
Last time I looked, this is required because truncate wants to
unmap 'even_cows', so we must hold the pagecache page locked
while instantiating the mapping on the cow page.

>- In do_no_page(), the unlock_page() which _is_ there doesn't test
> VM_CAN_INVALIDATE before deciding to unlock the page.
>

It does a conditional lock if !VM_CAN_INVALIDATE based on a
suggestion from Hugh. I don't disagree with that, but it can
go away in the next patch as we won't be calling into
->page_mkwrite (if that callback can be implemented with ->fault).

--

Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-08 02:12:48

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 3/3] mm: fault handler to replace nopage and populate

Andrew Morton wrote:

>- You may find that gcc generates crap code for the initialisation of the
> `struct fault_data'. If so, filling the fields in by hand one-at-a-time
> will improve things.
>

OK.

>- So is the plan here to migrate all code over to using
> vm_operations.fault() and to finally remove vm_operations.nopage and
> .nopfn? If so, that'd be nice.
>

Definitely remove .nopage, .populate, and hopefully .page_mkwrite.

.nopfn is a little harder because it doesn't quite follow the same pattern
as the others (eg. has no struct page).

>- As you know, there is a case for constructing that `struct fault_data'
> all the way up in do_no_page(): so we can pass data back, asking
> do_no_page() to rerun the fault if we dropped mmap_sem.
>

That is what it is doing - do_no_page should go away (it is basically
duplicated in __do_fault -- I left it there because I don't know if people
are happy to have a flag day or slowly migrate over).

But I have converted regular pagecache (mm/filemap.c) to use .fault rather
than .nopage and .populate, so you should be able to do the mmap_sem thing
right now. That's something maybe you could look at if you get time? Ie.
whether this .fault handler thing will be sufficient for you.

>- No useful opinion on the substance of this patch, sorry. It's Saturday ;)
>

No hurry. Thanks for the quick initial comments.

--

Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-08 02:17:14

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 3/3] mm: fault handler to replace nopage and populate

Jeff Garzik wrote:

> That's pretty nice.
>
> Back when I was writing [the now slated for death]
> sound/oss/via82xxx_audio.c driver, Linus suggested that I implement
> ->nopage() for accessing the mmap'able DMA'd audio buffers, rather
> than using remap_pfn_range(). It worked out very nicely, because it
> allowed the sound driver to retrieve $N pages for the mmap'able buffer
> (passed as an s/g list to the hardware) rather than requiring a single
> humongous buffer returned by pci_alloc_consistent().
>
> And although probably not your primary motivation, your change does
> IMO improve this area of the kernel.


Thanks. Yeah hopefully this provides a little more flexibility (I think
it can
already replace 3 individual vm_ops callbacks!). And I'd like to see
what other
things it can be used for... :)

However, what we don't want is a bloating of struct fault_data IMO. So
I'd like
to try to nail down the fields that it needs quite quickly then really
keep a
lid on it.

--

Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-08 23:46:38

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [patch 3/3] mm: fault handler to replace nopage and populate


> - So is the plan here to migrate all code over to using
> vm_operations.fault() and to finally remove vm_operations.nopage and
> .nopfn? If so, that'd be nice.

Agreed. That would also allow to pass down knowledge of wether we can be
interruptible or not (coming from userland or not). Useful in a few case
when dealing with strange hw mappings.

Now, fault() still returns a struct page and thus doesn't quite fix the
problem I'm exposing in my "User switchable HW mappings & cie" mail I
posted today in which case we need to either duplicate the truncate
logic in no_pfn() or get rid of no_pfn() and set the PTE from the fault
handler . I tend to prefer the later provided that it's strictly limited
for mappings that do not have a struct page though.

> - As you know, there is a case for constructing that `struct fault_data'
> all the way up in do_no_page(): so we can pass data back, asking
> do_no_page() to rerun the fault if we dropped mmap_sem.
>
> - No useful opinion on the substance of this patch, sorry. It's Saturday ;)
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2006-10-09 10:26:39

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 3/3] mm: fault handler to replace nopage and populate

On Mon, Oct 09, 2006 at 09:46:13AM +1000, Benjamin Herrenschmidt wrote:
>
> > - So is the plan here to migrate all code over to using
> > vm_operations.fault() and to finally remove vm_operations.nopage and
> > .nopfn? If so, that'd be nice.
>
> Agreed. That would also allow to pass down knowledge of wether we can be
> interruptible or not (coming from userland or not). Useful in a few case
> when dealing with strange hw mappings.
>
> Now, fault() still returns a struct page and thus doesn't quite fix the
> problem I'm exposing in my "User switchable HW mappings & cie" mail I
> posted today in which case we need to either duplicate the truncate
> logic in no_pfn() or get rid of no_pfn() and set the PTE from the fault
> handler . I tend to prefer the later provided that it's strictly limited
> for mappings that do not have a struct page though.

The truncate logic can't be duplicated because it works on struct pages.

What sounds best, if you use nopfn, is to do your own internal
synchronisation against your unmap call. Obviously you can't because you
have no ->nopfn_done call with which to drop locks ;)

So, hmm yes I have a good idea for how fault() could take over ->nopfn as
well: just return NULL, set the fault type to VM_FAULT_MINOR, and have
the ->fault handler install the pte. It will require a new helper along
the lines of vm_insert_page.

I'll code that up in my next patchset.

2006-10-09 10:50:36

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [patch 3/3] mm: fault handler to replace nopage and populate


> The truncate logic can't be duplicated because it works on struct pages.
>
> What sounds best, if you use nopfn, is to do your own internal
> synchronisation against your unmap call. Obviously you can't because you
> have no ->nopfn_done call with which to drop locks ;)
>
> So, hmm yes I have a good idea for how fault() could take over ->nopfn as
> well: just return NULL, set the fault type to VM_FAULT_MINOR, and have
> the ->fault handler install the pte. It will require a new helper along
> the lines of vm_insert_page.
>
> I'll code that up in my next patchset.

Which is exactly what I was proposing in my other mail :)

Read it, you'll understnad my point about the truncate logic... I
sometimes want to return struct page (when the mapping is pointing to
backup memory) or map it directly to hardware.

In the later case, with an appropriate helper, I can definitely do my
own locking. In the former case, I return struct page's and need the
truncate logic.

Ben.


2006-10-09 11:00:11

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 3/3] mm: fault handler to replace nopage and populate

On Mon, Oct 09, 2006 at 08:50:14PM +1000, Benjamin Herrenschmidt wrote:
>
> > The truncate logic can't be duplicated because it works on struct pages.
> >
> > What sounds best, if you use nopfn, is to do your own internal
> > synchronisation against your unmap call. Obviously you can't because you
> > have no ->nopfn_done call with which to drop locks ;)
> >
> > So, hmm yes I have a good idea for how fault() could take over ->nopfn as
> > well: just return NULL, set the fault type to VM_FAULT_MINOR, and have
> > the ->fault handler install the pte. It will require a new helper along
> > the lines of vm_insert_page.
> >
> > I'll code that up in my next patchset.
>
> Which is exactly what I was proposing in my other mail :)
>
> Read it, you'll understnad my point about the truncate logic... I
> sometimes want to return struct page (when the mapping is pointing to
> backup memory) or map it directly to hardware.
>
> In the later case, with an appropriate helper, I can definitely do my
> own locking. In the former case, I return struct page's and need the
> truncate logic.

Yep, I see. You just need to be careful about the PFNMAP logic, so
the VM knows whether the pte is backed by a struct page or not.
And going the pageless route means that you must disallow MAP_PRIVATE
PROT_WRITE mappings, I trust that isn't a problem for you?

I guess the helper looks something like the following...

--

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -1105,6 +1105,7 @@ unsigned long vmalloc_to_pfn(void *addr)
int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
unsigned long pfn, unsigned long size, pgprot_t);
int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *);
+int vm_insert_pfn(struct vm_area_struct *, unsigned long addr, unsigned long pfn);

struct page *follow_page(struct vm_area_struct *, unsigned long address,
unsigned int foll_flags);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1267,6 +1267,44 @@ int vm_insert_page(struct vm_area_struct
}
EXPORT_SYMBOL(vm_insert_page);

+/**
+ * vm_insert_pfn - insert single pfn into user vma
+ * @vma: user vma to map to
+ * @addr: target user address of this page
+ * @pfn: source kernel pfn
+ *
+ * Similar to vm_inert_page, this allows drivers to insert individual pages
+ * they've allocated into a user vma. Same comments apply
+ */
+int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ int retval;
+ pte_t *pte;
+ spinlock_t *ptl;
+
+ BUG_ON(is_cow_mapping(vma->vm_flags));
+
+ retval = -ENOMEM;
+ pte = get_locked_pte(mm, addr, &ptl);
+ if (!pte)
+ goto out;
+ retval = -EBUSY;
+ if (!pte_none(*pte))
+ goto out_unlock;
+
+ /* Ok, finally just insert the thing.. */
+ set_pte_at(mm, addr, pte, pfn_pte(pfn, vma->vm_page_prot));
+
+ vma->vm_flags |= VM_PFNMAP;
+ retval = 0;
+out_unlock:
+ pte_unmap_unlock(pte, ptl);
+out:
+ return retval;
+}
+EXPORT_SYMBOL(vm_insert_pfn);
+
/*
* maps a range of physical memory into the requested pages. the old
* mappings are removed. any references to nonexistent pages results

2006-10-09 11:11:00

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [patch 3/3] mm: fault handler to replace nopage and populate


> Yep, I see. You just need to be careful about the PFNMAP logic, so
> the VM knows whether the pte is backed by a struct page or not.

I still need to properly get my head around that one. I can't easily
change the VMA during the "switch" but I can tweak the flags on the
first nopage after one...

> And going the pageless route means that you must disallow MAP_PRIVATE
> PROT_WRITE mappings, I trust that isn't a problem for you?

Should not but I need to look more closely.

> I guess the helper looks something like the following...
>
> --
>
> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h
> +++ linux-2.6/include/linux/mm.h
> @@ -1105,6 +1105,7 @@ unsigned long vmalloc_to_pfn(void *addr)
> int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
> unsigned long pfn, unsigned long size, pgprot_t);
> int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *);
> +int vm_insert_pfn(struct vm_area_struct *, unsigned long addr, unsigned long pfn);
>
> struct page *follow_page(struct vm_area_struct *, unsigned long address,
> unsigned int foll_flags);
> Index: linux-2.6/mm/memory.c
> ===================================================================
> --- linux-2.6.orig/mm/memory.c
> +++ linux-2.6/mm/memory.c
> @@ -1267,6 +1267,44 @@ int vm_insert_page(struct vm_area_struct
> }
> EXPORT_SYMBOL(vm_insert_page);
>
> +/**
> + * vm_insert_pfn - insert single pfn into user vma
> + * @vma: user vma to map to
> + * @addr: target user address of this page
> + * @pfn: source kernel pfn
> + *
> + * Similar to vm_inert_page, this allows drivers to insert individual pages
> + * they've allocated into a user vma. Same comments apply
> + */
> +int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + int retval;
> + pte_t *pte;
> + spinlock_t *ptl;
> +
> + BUG_ON(is_cow_mapping(vma->vm_flags));
> +
> + retval = -ENOMEM;
> + pte = get_locked_pte(mm, addr, &ptl);
> + if (!pte)
> + goto out;
> + retval = -EBUSY;
> + if (!pte_none(*pte))
> + goto out_unlock;
> +
> + /* Ok, finally just insert the thing.. */
> + set_pte_at(mm, addr, pte, pfn_pte(pfn, vma->vm_page_prot));
> +
> + vma->vm_flags |= VM_PFNMAP;
> + retval = 0;
> +out_unlock:
> + pte_unmap_unlock(pte, ptl);
> +out:
> + return retval;
> +}
> +EXPORT_SYMBOL(vm_insert_pfn);

It also needs update_mmu_cache() I suppose.

> /*
> * maps a range of physical memory into the requested pages. the old
> * mappings are removed. any references to nonexistent pages results

2006-10-09 11:19:09

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 3/3] mm: fault handler to replace nopage and populate

On Mon, Oct 09, 2006 at 09:10:13PM +1000, Benjamin Herrenschmidt wrote:
>
> > Yep, I see. You just need to be careful about the PFNMAP logic, so
> > the VM knows whether the pte is backed by a struct page or not.
>
> I still need to properly get my head around that one. I can't easily
> change the VMA during the "switch" but I can tweak the flags on the
> first nopage after one...

You'll want to clear VM_PFNMAP after unmapping all pages from it, before
switching to struct page backing.

>
> > And going the pageless route means that you must disallow MAP_PRIVATE
> > PROT_WRITE mappings, I trust that isn't a problem for you?
>
> Should not but I need to look more closely.

If you do need to, then if your pfns are contiguous in virtual memory,
and you can spare vm_pgoff, then you can use remap_pfn_range's method
of setting vm_pgoff to the first pfn.

I can add a bit of sanity checking for that as well.

> > + /* Ok, finally just insert the thing.. */
> > + set_pte_at(mm, addr, pte, pfn_pte(pfn, vma->vm_page_prot));
> > +
> > + vma->vm_flags |= VM_PFNMAP;
> > + retval = 0;
> > +out_unlock:
> > + pte_unmap_unlock(pte, ptl);
> > +out:
> > + return retval;
> > +}
> > +EXPORT_SYMBOL(vm_insert_pfn);
>
> It also needs update_mmu_cache() I suppose.

Hmm, but it might not be called from a pagefault. Can we get away
with not calling it? Or is it required by some architectures?

2006-10-09 11:33:26

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [patch 3/3] mm: fault handler to replace nopage and populate

On Mon, 2006-10-09 at 13:19 +0200, Nick Piggin wrote:
> On Mon, Oct 09, 2006 at 09:10:13PM +1000, Benjamin Herrenschmidt wrote:
> >
> > > Yep, I see. You just need to be careful about the PFNMAP logic, so
> > > the VM knows whether the pte is backed by a struct page or not.
> >
> > I still need to properly get my head around that one. I can't easily
> > change the VMA during the "switch" but I can tweak the flags on the
> > first nopage after one...
>
> You'll want to clear VM_PFNMAP after unmapping all pages from it, before
> switching to struct page backing.

Which means having a list of all vma's ... I suppose I can look at the
truncate code to do that race free but I was hoping I could avoid it
(that's the whole point of using unmap_mapping_range() in fact).

> > > And going the pageless route means that you must disallow MAP_PRIVATE
> > > PROT_WRITE mappings, I trust that isn't a problem for you?
> >
> > Should not but I need to look more closely.
>
> If you do need to, then if your pfns are contiguous in virtual memory,
> and you can spare vm_pgoff, then you can use remap_pfn_range's method
> of setting vm_pgoff to the first pfn.

Yup. I got that bit.

> I can add a bit of sanity checking for that as well.
>
> > > + /* Ok, finally just insert the thing.. */
> > > + set_pte_at(mm, addr, pte, pfn_pte(pfn, vma->vm_page_prot));
> > > +
> > > + vma->vm_flags |= VM_PFNMAP;
> > > + retval = 0;
> > > +out_unlock:
> > > + pte_unmap_unlock(pte, ptl);
> > > +out:
> > > + return retval;
> > > +}
> > > +EXPORT_SYMBOL(vm_insert_pfn);
> >
> > It also needs update_mmu_cache() I suppose.
>
> Hmm, but it might not be called from a pagefault. Can we get away
> with not calling it? Or is it required by some architectures?

I think some architectures might be upset if it's not called...

Ben.


2006-10-09 11:45:32

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 3/3] mm: fault handler to replace nopage and populate

On Mon, Oct 09, 2006 at 09:32:59PM +1000, Benjamin Herrenschmidt wrote:
> >
> > You'll want to clear VM_PFNMAP after unmapping all pages from it, before
> > switching to struct page backing.
>
> Which means having a list of all vma's ... I suppose I can look at the
> truncate code to do that race free but I was hoping I could avoid it
> (that's the whole point of using unmap_mapping_range() in fact).

Yeah I don't think there is any other way to do it.

> > > It also needs update_mmu_cache() I suppose.
> >
> > Hmm, but it might not be called from a pagefault. Can we get away
> > with not calling it? Or is it required by some architectures?
>
> I think some architectures might be upset if it's not called...

But would any get upset if it is called from !pagefault path?

2006-10-09 11:49:58

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [patch 3/3] mm: fault handler to replace nopage and populate

On Mon, 2006-10-09 at 13:45 +0200, Nick Piggin wrote:
> On Mon, Oct 09, 2006 at 09:32:59PM +1000, Benjamin Herrenschmidt wrote:
> > >
> > > You'll want to clear VM_PFNMAP after unmapping all pages from it, before
> > > switching to struct page backing.
> >
> > Which means having a list of all vma's ... I suppose I can look at the
> > truncate code to do that race free but I was hoping I could avoid it
> > (that's the whole point of using unmap_mapping_range() in fact).
>
> Yeah I don't think there is any other way to do it.

What is the problem if I always keep VM_PFNMAP set ?

> > > > It also needs update_mmu_cache() I suppose.
> > >
> > > Hmm, but it might not be called from a pagefault. Can we get away
> > > with not calling it? Or is it required by some architectures?
> >
> > I think some architectures might be upset if it's not called...
>
> But would any get upset if it is called from !pagefault path?

Probably not... the PTE has been filled so it should be safe, but then,
I don't know the details of what non-ppc archs do with that callback.

Ben.


2006-10-09 11:58:38

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 3/3] mm: fault handler to replace nopage and populate

On Mon, Oct 09, 2006 at 09:49:31PM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2006-10-09 at 13:45 +0200, Nick Piggin wrote:
> > On Mon, Oct 09, 2006 at 09:32:59PM +1000, Benjamin Herrenschmidt wrote:
> > > >
> > > > You'll want to clear VM_PFNMAP after unmapping all pages from it, before
> > > > switching to struct page backing.
> > >
> > > Which means having a list of all vma's ... I suppose I can look at the
> > > truncate code to do that race free but I was hoping I could avoid it
> > > (that's the whole point of using unmap_mapping_range() in fact).
> >
> > Yeah I don't think there is any other way to do it.
>
> What is the problem if I always keep VM_PFNMAP set ?

The VM won't see that you have struct pages backing the ptes, and won't
do the right refcounting or rmap stuff... But for file backed mappings,
all the critical rmap stuff should be set up at mmap time, so you might
have another option to simply always do the nopfn thing, as far as the
VM is concerned (ie. even when you do have a struct page)

> > > > > It also needs update_mmu_cache() I suppose.
> > > >
> > > > Hmm, but it might not be called from a pagefault. Can we get away
> > > > with not calling it? Or is it required by some architectures?
> > >
> > > I think some architectures might be upset if it's not called...
> >
> > But would any get upset if it is called from !pagefault path?
>
> Probably not... the PTE has been filled so it should be safe, but then,
> I don't know the details of what non-ppc archs do with that callback.

2006-10-09 12:08:15

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [patch 3/3] mm: fault handler to replace nopage and populate

On Mon, 2006-10-09 at 13:58 +0200, Nick Piggin wrote:
> On Mon, Oct 09, 2006 at 09:49:31PM +1000, Benjamin Herrenschmidt wrote:
> > On Mon, 2006-10-09 at 13:45 +0200, Nick Piggin wrote:
> > > On Mon, Oct 09, 2006 at 09:32:59PM +1000, Benjamin Herrenschmidt wrote:
> > > > >
> > > > > You'll want to clear VM_PFNMAP after unmapping all pages from it, before
> > > > > switching to struct page backing.
> > > >
> > > > Which means having a list of all vma's ... I suppose I can look at the
> > > > truncate code to do that race free but I was hoping I could avoid it
> > > > (that's the whole point of using unmap_mapping_range() in fact).
> > >
> > > Yeah I don't think there is any other way to do it.
> >
> > What is the problem if I always keep VM_PFNMAP set ?
>
> The VM won't see that you have struct pages backing the ptes, and won't
> do the right refcounting or rmap stuff... But for file backed mappings,
> all the critical rmap stuff should be set up at mmap time, so you might
> have another option to simply always do the nopfn thing, as far as the
> VM is concerned (ie. even when you do have a struct page)

Any reason why it wouldn't work to flip that bit on the first no_page()
after a migration ? A migration always involves destroying all PTEs and
is done with a per-object mutex held that no_page() takes too, so we can
be pretty sure that the first nopage can set that bit before any PTE is
actually inserted in the mapping after all the previous ones have been
invalidated... That would avoid having to walk the vma's.

Ben.


2006-10-09 12:09:15

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 3/3] mm: fault handler to replace nopage and populate

On Mon, Oct 09, 2006 at 01:58:36PM +0200, Nick Piggin wrote:
> > > > > > It also needs update_mmu_cache() I suppose.
> > > > >
> > > > > Hmm, but it might not be called from a pagefault. Can we get away
> > > > > with not calling it? Or is it required by some architectures?
> > > >
> > > > I think some architectures might be upset if it's not called...
> > >
> > > But would any get upset if it is called from !pagefault path?
> >
> > Probably not... the PTE has been filled so it should be safe, but then,
> > I don't know the details of what non-ppc archs do with that callback.

I guess we can make the rule that it only be called from the ->fault
handler anyway. If they want that functionality from ->mmap, then they
can use remap_pfn_range, because it isn't so performance critical.

2006-10-09 12:11:41

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [patch 3/3] mm: fault handler to replace nopage and populate

On Mon, 2006-10-09 at 14:09 +0200, Nick Piggin wrote:
> On Mon, Oct 09, 2006 at 01:58:36PM +0200, Nick Piggin wrote:
> > > > > > > It also needs update_mmu_cache() I suppose.
> > > > > >
> > > > > > Hmm, but it might not be called from a pagefault. Can we get away
> > > > > > with not calling it? Or is it required by some architectures?
> > > > >
> > > > > I think some architectures might be upset if it's not called...
> > > >
> > > > But would any get upset if it is called from !pagefault path?
> > >
> > > Probably not... the PTE has been filled so it should be safe, but then,
> > > I don't know the details of what non-ppc archs do with that callback.
>
> I guess we can make the rule that it only be called from the ->fault
> handler anyway. If they want that functionality from ->mmap, then they
> can use remap_pfn_range, because it isn't so performance critical.

Yup. The whole point of the discussion is for cases where we can't just
setup a static mapping at mmap time.

Ben.


2006-10-09 12:14:19

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 3/3] mm: fault handler to replace nopage and populate

On Mon, Oct 09, 2006 at 10:07:50PM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2006-10-09 at 13:58 +0200, Nick Piggin wrote:
> >
> > The VM won't see that you have struct pages backing the ptes, and won't
> > do the right refcounting or rmap stuff... But for file backed mappings,
> > all the critical rmap stuff should be set up at mmap time, so you might
> > have another option to simply always do the nopfn thing, as far as the
> > VM is concerned (ie. even when you do have a struct page)
>
> Any reason why it wouldn't work to flip that bit on the first no_page()
> after a migration ? A migration always involves destroying all PTEs and
> is done with a per-object mutex held that no_page() takes too, so we can
> be pretty sure that the first nopage can set that bit before any PTE is
> actually inserted in the mapping after all the previous ones have been
> invalidated... That would avoid having to walk the vma's.

Ok I guess that would work. I was kind of thinking that one needs to
hold the mmap_sem for writing when changing the flags, but so long
as everyone *else* does, then I guess you can get exclusion from just
the read lock. And your per-object mutex would prevent concurrent
nopages from modifying it.

2006-10-09 13:38:43

by Thomas Hellström

[permalink] [raw]
Subject: Re: [patch 3/3] mm: fault handler to replace nopage and populate

Nick Piggin wrote:
> On Mon, Oct 09, 2006 at 10:07:50PM +1000, Benjamin Herrenschmidt wrote:
>
>>On Mon, 2006-10-09 at 13:58 +0200, Nick Piggin wrote:
>>
>>>The VM won't see that you have struct pages backing the ptes, and won't
>>>do the right refcounting or rmap stuff... But for file backed mappings,
>>>all the critical rmap stuff should be set up at mmap time, so you might
>>>have another option to simply always do the nopfn thing, as far as the
>>>VM is concerned (ie. even when you do have a struct page)
>>
>>Any reason why it wouldn't work to flip that bit on the first no_page()
>>after a migration ? A migration always involves destroying all PTEs and
>>is done with a per-object mutex held that no_page() takes too, so we can
>>be pretty sure that the first nopage can set that bit before any PTE is
>>actually inserted in the mapping after all the previous ones have been
>>invalidated... That would avoid having to walk the vma's.
>
>
> Ok I guess that would work. I was kind of thinking that one needs to
> hold the mmap_sem for writing when changing the flags, but so long
> as everyone *else* does, then I guess you can get exclusion from just
> the read lock. And your per-object mutex would prevent concurrent
> nopages from modifying it.

Wouldn't that confuse concurrent readers?

Could it be an option to make it safe for the fault handler to
temporarily drop the mmap_sem read lock given that some conditions TBD
are met?
In that case it can retake the mmap_sem write lock, do the VMA flags
modifications, downgrade and do the pte modifications using a helper, or
even use remap_pfn_range() during the time the write lock is held?

/Thomas






2006-10-09 13:53:01

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 3/3] mm: fault handler to replace nopage and populate

On Mon, Oct 09, 2006 at 03:38:10PM +0200, Thomas Hellstrom wrote:
> Nick Piggin wrote:
> >On Mon, Oct 09, 2006 at 10:07:50PM +1000, Benjamin Herrenschmidt wrote:
> >
> >Ok I guess that would work. I was kind of thinking that one needs to
> >hold the mmap_sem for writing when changing the flags, but so long
> >as everyone *else* does, then I guess you can get exclusion from just
> >the read lock. And your per-object mutex would prevent concurrent
> >nopages from modifying it.
>
> Wouldn't that confuse concurrent readers?

I think it should be safe so long as the entire mapping has been
unmapped. After that, there is no read path that should care about
that flag bit. So long as it is well commented (and maybe done via
a helper in mm/memory.c), I can't yet see a problem with it.

> Could it be an option to make it safe for the fault handler to
> temporarily drop the mmap_sem read lock given that some conditions TBD
> are met?
> In that case it can retake the mmap_sem write lock, do the VMA flags
> modifications, downgrade and do the pte modifications using a helper, or
> even use remap_pfn_range() during the time the write lock is held?

When you drop and retake the mmap_sem, you need to start again from
find_vma. At which point you technically probably want to start again
from the architecture specfic fault code. It sounds difficult but I
won't say it can't be done.

2006-10-09 20:46:00

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [patch 3/3] mm: fault handler to replace nopage and populate


> Wouldn't that confuse concurrent readers?
>
> Could it be an option to make it safe for the fault handler to
> temporarily drop the mmap_sem read lock given that some conditions TBD
> are met?
> In that case it can retake the mmap_sem write lock, do the VMA flags
> modifications, downgrade and do the pte modifications using a helper, or
> even use remap_pfn_range() during the time the write lock is held?

If we return NOPAGE_REFAULT, then yes, we can drop the mmap sem, though
I 'm not sure we need that...

Ben.


2006-10-09 20:50:55

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [patch 3/3] mm: fault handler to replace nopage and populate

On Mon, 2006-10-09 at 15:52 +0200, Nick Piggin wrote:
> On Mon, Oct 09, 2006 at 03:38:10PM +0200, Thomas Hellstrom wrote:
> > Nick Piggin wrote:
> > >On Mon, Oct 09, 2006 at 10:07:50PM +1000, Benjamin Herrenschmidt wrote:
> > >
> > >Ok I guess that would work. I was kind of thinking that one needs to
> > >hold the mmap_sem for writing when changing the flags, but so long
> > >as everyone *else* does, then I guess you can get exclusion from just
> > >the read lock. And your per-object mutex would prevent concurrent
> > >nopages from modifying it.
> >
> > Wouldn't that confuse concurrent readers?
>
> I think it should be safe so long as the entire mapping has been
> unmapped. After that, there is no read path that should care about
> that flag bit. So long as it is well commented (and maybe done via
> a helper in mm/memory.c), I can't yet see a problem with it.

Should be fine then. Migration does

- take object mutex
- unmap_mapping_range() -> remove all PTEs for all mappings to
that object
- do whatever is needed for actual migration (copy data etc...)
- release object mutex

And nopage() does

- take object mutex
- check object flags consistency, possibly update VMA
(also possibly updaet VMA pgprot too while at it for cacheable
vs. non cacheable, though it's not strictly necessary if we
use the helper)
- if object is in ram
- get struct page
- drop mutex
- return struct page
- else
- get pfn
- use helper to install PTE
- drop mutex
- return NOPAGE_REFAULT

We don't strictly have to return struct page when the object is in ram
but I feel like it's better for accounting.

Now there is still the question of where that RAM comes from, how it
gets accounted, and wether there is any way we can make it swappable
(which complicates things but would be nice as objects can be fairly big
and we may end up using a significant amount of physical memory with the
graphic objects).

> > Could it be an option to make it safe for the fault handler to
> > temporarily drop the mmap_sem read lock given that some conditions TBD
> > are met?
> > In that case it can retake the mmap_sem write lock, do the VMA flags
> > modifications, downgrade and do the pte modifications using a helper, or
> > even use remap_pfn_range() during the time the write lock is held?
>
> When you drop and retake the mmap_sem, you need to start again from
> find_vma. At which point you technically probably want to start again
> from the architecture specfic fault code. It sounds difficult but I
> won't say it can't be done.

I can be done with returning NOPAGE_REFAULT but as you said, I don't
think it's necessary.

Cheers,
Ben.


2006-10-10 06:11:27

by Thomas Hellström

[permalink] [raw]
Subject: Re: [patch 3/3] mm: fault handler to replace nopage and populate

Benjamin Herrenschmidt wrote:

>
>>>Could it be an option to make it safe for the fault handler to
>>>temporarily drop the mmap_sem read lock given that some conditions TBD
>>>are met?
>>>In that case it can retake the mmap_sem write lock, do the VMA flags
>>>modifications, downgrade and do the pte modifications using a helper, or
>>>even use remap_pfn_range() during the time the write lock is held?
>>>
>>>
>>When you drop and retake the mmap_sem, you need to start again from
>>find_vma. At which point you technically probably want to start again
>>from the architecture specfic fault code. It sounds difficult but I
>>won't say it can't be done.
>>
>>
>
>I can be done with returning NOPAGE_REFAULT but as you said, I don't
>think it's necessary.
>
>
Still, even with NOPAGE_REFAULT or the equivalent with the new fault() code,
in the case we need to take this route, (and it looks like we won't have
to),
I guess we still need to restart from find_vma() in the fault()/nopage()
handler to make sure the VMA is still present. The object mutex need to
be dropped as well to avoid deadlocks. Sounds complicated.

>Cheers,
>Ben.
>
>
>
>
/Thomas

2006-10-10 08:26:14

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [patch 3/3] mm: fault handler to replace nopage and populate


> Still, even with NOPAGE_REFAULT or the equivalent with the new fault() code,
> in the case we need to take this route, (and it looks like we won't have
> to),
> I guess we still need to restart from find_vma() in the fault()/nopage()
> handler to make sure the VMA is still present. The object mutex need to
> be dropped as well to avoid deadlocks. Sounds complicated.

But as we said, it should be enough to do the flag change with the
object mutex held as long as it's after unmap_mapped_ranges()

Ben.


2006-10-10 08:40:08

by Thomas Hellström

[permalink] [raw]
Subject: Re: [patch 3/3] mm: fault handler to replace nopage and populate

Benjamin Herrenschmidt wrote:
>>Still, even with NOPAGE_REFAULT or the equivalent with the new fault() code,
>>in the case we need to take this route, (and it looks like we won't have
>>to),
>>I guess we still need to restart from find_vma() in the fault()/nopage()
>>handler to make sure the VMA is still present. The object mutex need to
>>be dropped as well to avoid deadlocks. Sounds complicated.
>
>
> But as we said, it should be enough to do the flag change with the
> object mutex held as long as it's after unmap_mapped_ranges()
>
> Ben.
>
>
Agreed.
/Thomas



2006-10-10 12:10:20

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch 3/3] mm: fault handler to replace nopage and populate

On Sat, Oct 07, 2006 at 03:06:32PM +0200, Nick Piggin wrote:
> +/*
> + * fault_data is filled in the the pagefault handler and passed to the
> + * vma's ->fault function. That function is responsible for filling in
> + * 'type', which is the type of fault if a page is returned, or the type
> + * of error if NULL is returned.
> + */
> +struct fault_data {
> + struct vm_area_struct *vma;
> + unsigned long address;
> + pgoff_t pgoff;
> + unsigned int flags;
> +
> + int type;
> +};
>
> /*
> * These are the virtual MM functions - opening of an area, closing and
> @@ -203,6 +221,7 @@ extern pgprot_t protection_map[16];
> struct vm_operations_struct {
> void (*open)(struct vm_area_struct * area);
> void (*close)(struct vm_area_struct * area);
> + struct page * (*fault)(struct fault_data * data);

Please pass the vma as an explicit first argument so that all vm_operations
operate on a vma. It's also much cleaner to have the separate between the
the object operated on (the vma) and all the fault details (struct fault_data).

2006-10-10 12:13:43

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 3/3] mm: fault handler to replace nopage and populate

On Tue, Oct 10, 2006 at 01:10:03PM +0100, Christoph Hellwig wrote:
> On Sat, Oct 07, 2006 at 03:06:32PM +0200, Nick Piggin wrote:
> > +/*
> > + * fault_data is filled in the the pagefault handler and passed to the
> > + * vma's ->fault function. That function is responsible for filling in
> > + * 'type', which is the type of fault if a page is returned, or the type
> > + * of error if NULL is returned.
> > + */
> > +struct fault_data {
> > + struct vm_area_struct *vma;
> > + unsigned long address;
> > + pgoff_t pgoff;
> > + unsigned int flags;
> > +
> > + int type;
> > +};
> >
> > /*
> > * These are the virtual MM functions - opening of an area, closing and
> > @@ -203,6 +221,7 @@ extern pgprot_t protection_map[16];
> > struct vm_operations_struct {
> > void (*open)(struct vm_area_struct * area);
> > void (*close)(struct vm_area_struct * area);
> > + struct page * (*fault)(struct fault_data * data);
>
> Please pass the vma as an explicit first argument so that all vm_operations
> operate on a vma. It's also much cleaner to have the separate between the
> the object operated on (the vma) and all the fault details (struct fault_data).

Hmm... I agree it is more consistent, but OTOH if we're passing a
structure I thought it may as well just go in there. But I will
change unless anyone comes up with an objection.

2006-10-10 17:53:16

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 3/3] mm: fault handler to replace nopage and populate

On Tue, 10 Oct 2006 14:13:27 +0200
Nick Piggin <[email protected]> wrote:

> On Tue, Oct 10, 2006 at 01:10:03PM +0100, Christoph Hellwig wrote:
> > On Sat, Oct 07, 2006 at 03:06:32PM +0200, Nick Piggin wrote:
> > > +/*
> > > + * fault_data is filled in the the pagefault handler and passed to the
> > > + * vma's ->fault function. That function is responsible for filling in
> > > + * 'type', which is the type of fault if a page is returned, or the type
> > > + * of error if NULL is returned.
> > > + */
> > > +struct fault_data {
> > > + struct vm_area_struct *vma;
> > > + unsigned long address;
> > > + pgoff_t pgoff;
> > > + unsigned int flags;
> > > +
> > > + int type;
> > > +};
> > >
> > > /*
> > > * These are the virtual MM functions - opening of an area, closing and
> > > @@ -203,6 +221,7 @@ extern pgprot_t protection_map[16];
> > > struct vm_operations_struct {
> > > void (*open)(struct vm_area_struct * area);
> > > void (*close)(struct vm_area_struct * area);
> > > + struct page * (*fault)(struct fault_data * data);
> >
> > Please pass the vma as an explicit first argument so that all vm_operations
> > operate on a vma. It's also much cleaner to have the separate between the
> > the object operated on (the vma) and all the fault details (struct fault_data).
>
> Hmm... I agree it is more consistent, but OTOH if we're passing a
> structure I thought it may as well just go in there. But I will
> change unless anyone comes up with an objection.

I'd agree that it's more attractive to have the vma* in the argument list,
but it presumably adds runtime cost: cycles and stack depth. I don't how
much though.

2006-10-11 00:43:48

by Nick Piggin

[permalink] [raw]
Subject: Re: SPAM: Re: [patch 3/3] mm: fault handler to replace nopage and populate

On Tue, Oct 10, 2006 at 10:52:36AM -0700, Andrew Morton wrote:
> On Tue, 10 Oct 2006 14:13:27 +0200
> Nick Piggin <[email protected]> wrote:
> >
> > Hmm... I agree it is more consistent, but OTOH if we're passing a
> > structure I thought it may as well just go in there. But I will
> > change unless anyone comes up with an objection.
>
> I'd agree that it's more attractive to have the vma* in the argument list,
> but it presumably adds runtime cost: cycles and stack depth. I don't how
> much though.

Possibly, though I considered it might end up in a register, and
considering that the vma is used both before and after the call, and
in filemap_nopage, it's quite possible that it saves a load and does
not harm stack depth. Maybe?

I don't know, I guess we can tweak it while it is in -mm?

2006-10-11 14:48:25

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [patch 3/3] mm: add arch_alloc_page

On Sun, 2006-10-08 at 11:39 +1000, Nick Piggin wrote:
> >On Sat, 7 Oct 2006 15:06:04 +0200 (CEST)
> >Nick Piggin <[email protected]> wrote:
> >
> >
> >>Add an arch_alloc_page to match arch_free_page.
> >>
> >
> >umm.. why?
> >
>
> I had a future patch to more kernel_map_pages into it, but couldn't
> decide if that's a generic kernel feature that is only implemented in
> 2 architectures, or an architecture speicifc feature. So I left it out.
>
> But at least Martin wanted a hook here for his volatile pages patches,
> so I thought I'd submit this patch anyway.

With Nicks patch I can use arch_alloc_page instead of page_set_stable,
but I can still not use arch_free_page instead of page_set_unused
because it is done before the check for reserved pages. If reserved
pages go away or the arch_free_page call would get moved after the check
I could replace page_set_unused as well. So with Nicks patch we are only
halfway there..

--
blue skies,
Martin.

Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH

"Reality continues to ruin my life." - Calvin.


2006-10-11 14:56:50

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 3/3] mm: add arch_alloc_page

On Wed, Oct 11, 2006 at 04:48:24PM +0200, Martin Schwidefsky wrote:
> On Sun, 2006-10-08 at 11:39 +1000, Nick Piggin wrote:
> > >On Sat, 7 Oct 2006 15:06:04 +0200 (CEST)
> > >Nick Piggin <[email protected]> wrote:
> > >
> > >
> > >>Add an arch_alloc_page to match arch_free_page.
> > >>
> > >
> > >umm.. why?
> > >
> >
> > I had a future patch to more kernel_map_pages into it, but couldn't
> > decide if that's a generic kernel feature that is only implemented in
> > 2 architectures, or an architecture speicifc feature. So I left it out.
> >
> > But at least Martin wanted a hook here for his volatile pages patches,
> > so I thought I'd submit this patch anyway.
>
> With Nicks patch I can use arch_alloc_page instead of page_set_stable,
> but I can still not use arch_free_page instead of page_set_unused
> because it is done before the check for reserved pages. If reserved
> pages go away or the arch_free_page call would get moved after the check
> I could replace page_set_unused as well. So with Nicks patch we are only
> halfway there..

Ahh, but with my patchSET I think we are all the way there ;)

Nick

2006-10-11 15:07:18

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [patch 3/3] mm: add arch_alloc_page

On Wed, 2006-10-11 at 16:56 +0200, Nick Piggin wrote:
> > With Nicks patch I can use arch_alloc_page instead of page_set_stable,
> > but I can still not use arch_free_page instead of page_set_unused
> > because it is done before the check for reserved pages. If reserved
> > pages go away or the arch_free_page call would get moved after the check
> > I could replace page_set_unused as well. So with Nicks patch we are only
> > halfway there..
>
> Ahh, but with my patchSET I think we are all the way there ;)

Oh, good. Then I only have to add two more state change functions,
namely page_make_stable and page_make_volatile.

--
blue skies,
Martin.

Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH

"Reality continues to ruin my life." - Calvin.


2006-10-12 12:07:37

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 3/3] mm: fault handler to replace nopage and populate

On Thu, Oct 12, 2006 at 01:56:38PM +0200, Carsten Otte wrote:
> As for the filemap_xip changes, looks nice as far as I can tell. I will test
> that change for xip.

Actually, filemap_xip needs some attention I think... if xip files
can be truncated or invalidated (I assume they can), then we need to
lock the page, validate that it is the correct one and not truncated,
and return with it locked.

That should be as simple as just locking the page and rechecking i_size,
but maybe the zero page can be handled better... I don't know?


2006-10-14 13:29:04

by Ingo Oeser

[permalink] [raw]
Subject: Re: [patch 3/3] mm: fault handler to replace nopage and populate

Hi Nick,

On Thursday, 12. October 2006 14:07, Nick Piggin wrote:
> Actually, filemap_xip needs some attention I think... if xip files
> can be truncated or invalidated (I assume they can), then we need to
> lock the page, validate that it is the correct one and not truncated,
> and return with it locked.

???

Isn't XIP for "eXecuting In Place" from ROM or FLASH?
How to truncate these? I thought the whole idea of
XIP was a pure RO mapping?

They should be valid from mount to umount.

Regards

Ingo Oeser, a bit puzzled about that...

2006-10-15 07:54:53

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 3/3] mm: fault handler to replace nopage and populate

Hi Ingo,

Ingo Oeser wrote:
> Hi Nick,
>
> On Thursday, 12. October 2006 14:07, Nick Piggin wrote:
>
>>Actually, filemap_xip needs some attention I think... if xip files
>>can be truncated or invalidated (I assume they can), then we need to
>>lock the page, validate that it is the correct one and not truncated,
>>and return with it locked.
>
>
> ???
>
> Isn't XIP for "eXecuting In Place" from ROM or FLASH?

Yes, I assume so. It seems that it isn't restricted to executing, but
is basically a terminology to mean that it bypasses the pagecache.

> How to truncate these? I thought the whole idea of
> XIP was a pure RO mapping?

Well, not filemap_xip.

>
> They should be valid from mount to umount.
>
> Regards
>
> Ingo Oeser, a bit puzzled about that...

See mm/filemap_xip.c:xip_file_write, xip_truncate_page.

Thanks,
Nick

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-24 21:32:13

by Dave Airlie

[permalink] [raw]
Subject: Re: [patch 3/3] mm: fault handler to replace nopage and populate

On 10/7/06, Nick Piggin <[email protected]> wrote:
> Nonlinear mappings are (AFAIKS) simply a virtual memory concept that
> encodes the virtual address -> file offset differently from linear
> mappings.
>

Hi Nick,

what is the status of this patch? I'm just trying to line up a kernel
tree for the new DRM memory management code, which really would like
this...

Dave.

2006-10-26 11:09:28

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 3/3] mm: fault handler to replace nopage and populate

Dave Airlie wrote:
> On 10/7/06, Nick Piggin <[email protected]> wrote:
>
>> Nonlinear mappings are (AFAIKS) simply a virtual memory concept that
>> encodes the virtual address -> file offset differently from linear
>> mappings.
>>
>
> Hi Nick,
>
> what is the status of this patch? I'm just trying to line up a kernel
> tree for the new DRM memory management code, which really would like
> this...
>
> Dave.

Hi Dave,

Blocked by another kernel bug at the moment. I hope both fixes can
make it into 2.6.20, but if that doesn't look like it will happen,
then I might try reworking the patchset to break the ->fault change
out by itself because there are several others who would like to use
it as well.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com