2003-05-31 00:30:11

by Paul E. McKenney

[permalink] [raw]
Subject: [RFC][PATCH] Convert do_no_page() to a hook to avoid DFS race

Rediffed to 2.5.70-mm2.

This patch allows a distributed filesystem to avoid the
pagefault/cross-node-invalidate race described in:

http://marc.theaimsgroup.com/?l=linux-kernel&m=105286345316249&w=2

This patch converts the bulk of do_no_page() into a hook that may
be called from the ->nopage vm_operations_struct callout. There
is still an inlined do_no_page() wrapper due to the fact that
do_anonymous_page() requires that the mm->page_table_lock be
held on entry, while the ->nopage callouts require that this
lock be dropped.

This patch is untested.

An alternative to this patch includes the nopagedone() patch posted
moments ago. hch has also suggested that do_anonymous_page() be
converted to a ->nopage callout, but this would require that all
of the other ->nopage callouts drop mm->page_table_lock as their
first action. If people believe that this is the right thing to
do, I will happily produce such a patch.

Thoughts?

Thanx, Paul


diff -urN -X dontdiff linux-2.5.70-mm2/arch/ia64/ia32/binfmt_elf32.c linux-2.5.70-mm2.install_new_page/arch/ia64/ia32/binfmt_elf32.c
--- linux-2.5.70-mm2/arch/ia64/ia32/binfmt_elf32.c Mon May 26 18:00:58 2003
+++ linux-2.5.70-mm2.install_new_page/arch/ia64/ia32/binfmt_elf32.c Fri May 30 15:12:36 2003
@@ -56,13 +56,13 @@
extern struct page *ia32_shared_page[];
extern unsigned long *ia32_gdt;

-struct page *
-ia32_install_shared_page (struct vm_area_struct *vma, unsigned long address, int no_share)
+int
+ia32_install_shared_page (struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, int write_access, pmd_t *pmd)
{
struct page *pg = ia32_shared_page[(address - vma->vm_start)/PAGE_SIZE];

get_page(pg);
- return pg;
+ return install_new_page(mm, vma, address, write_access, pmd, pg);
}

static struct vm_operations_struct ia32_shared_page_vm_ops = {
diff -urN -X dontdiff linux-2.5.70-mm2/arch/sparc64/mm/hugetlbpage.c linux-2.5.70-mm2.install_new_page/arch/sparc64/mm/hugetlbpage.c
--- linux-2.5.70-mm2/arch/sparc64/mm/hugetlbpage.c Mon May 26 18:00:42 2003
+++ linux-2.5.70-mm2.install_new_page/arch/sparc64/mm/hugetlbpage.c Fri May 30 15:12:36 2003
@@ -633,11 +633,11 @@
return (int) htlbzone_pages;
}

-static struct page *
-hugetlb_nopage(struct vm_area_struct *vma, unsigned long address, int unused)
+static int
+hugetlb_nopage(struct mm_struct * mm, struct vm_area_struct *vma, unsigned long address, int write_access, pmd_t * pmd)
{
BUG();
- return NULL;
+ return VM_FAULT_SIGBUS;
}

static struct vm_operations_struct hugetlb_vm_ops = {
diff -urN -X dontdiff linux-2.5.70-mm2/drivers/char/agp/alpha-agp.c linux-2.5.70-mm2.install_new_page/drivers/char/agp/alpha-agp.c
--- linux-2.5.70-mm2/drivers/char/agp/alpha-agp.c Mon May 26 18:00:42 2003
+++ linux-2.5.70-mm2.install_new_page/drivers/char/agp/alpha-agp.c Fri May 30 15:12:36 2003
@@ -11,9 +11,9 @@

#include "agp.h"

-static struct page *alpha_core_agp_vm_nopage(struct vm_area_struct *vma,
- unsigned long address,
- int write_access)
+static int alpha_core_agp_vm_nopage(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address,
+ int write_access, pmd_t pmd)
{
alpha_agp_info *agp = agp_bridge->dev_private_data;
dma_addr_t dma_addr;
@@ -23,14 +23,14 @@
dma_addr = address - vma->vm_start + agp->aperture.bus_base;
pa = agp->ops->translate(agp, dma_addr);

- if (pa == (unsigned long)-EINVAL) return NULL; /* no translation */
+ if (pa == (unsigned long)-EINVAL) return VM_FAULT_SIGBUS; /* no translation */

/*
* Get the page, inc the use count, and return it
*/
page = virt_to_page(__va(pa));
get_page(page);
- return page;
+ return install_new_page(mm, vma, address, write_access, pmd, page);
}

static struct aper_size_info_fixed alpha_core_agp_sizes[] =
diff -urN -X dontdiff linux-2.5.70-mm2/drivers/char/drm/drmP.h linux-2.5.70-mm2.install_new_page/drivers/char/drm/drmP.h
--- linux-2.5.70-mm2/drivers/char/drm/drmP.h Mon May 26 18:00:45 2003
+++ linux-2.5.70-mm2.install_new_page/drivers/char/drm/drmP.h Fri May 30 15:12:36 2003
@@ -620,18 +620,17 @@
extern int DRM(fasync)(int fd, struct file *filp, int on);

/* Mapping support (drm_vm.h) */
-extern struct page *DRM(vm_nopage)(struct vm_area_struct *vma,
- unsigned long address,
- int write_access);
-extern struct page *DRM(vm_shm_nopage)(struct vm_area_struct *vma,
- unsigned long address,
- int write_access);
-extern struct page *DRM(vm_dma_nopage)(struct vm_area_struct *vma,
- unsigned long address,
- int write_access);
-extern struct page *DRM(vm_sg_nopage)(struct vm_area_struct *vma,
- unsigned long address,
- int write_access);
+extern int DRM(vm_nopage)(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, int write_access, pmd_t *pmd);
+extern int DRM(vm_shm_nopage)(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address,
+ int write_access, pmd_t *pmd);
+extern int DRM(vm_dma_nopage)(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address,
+ int write_access, pmd_t *pmd);
+extern int DRM(vm_sg_nopage)(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address,
+ int write_access, pmd_t *pmd);
extern void DRM(vm_open)(struct vm_area_struct *vma);
extern void DRM(vm_close)(struct vm_area_struct *vma);
extern void DRM(vm_shm_close)(struct vm_area_struct *vma);
diff -urN -X dontdiff linux-2.5.70-mm2/drivers/char/drm/drm_vm.h linux-2.5.70-mm2.install_new_page/drivers/char/drm/drm_vm.h
--- linux-2.5.70-mm2/drivers/char/drm/drm_vm.h Mon May 26 18:01:02 2003
+++ linux-2.5.70-mm2.install_new_page/drivers/char/drm/drm_vm.h Fri May 30 15:12:36 2003
@@ -55,9 +55,9 @@
.close = DRM(vm_close),
};

-struct page *DRM(vm_nopage)(struct vm_area_struct *vma,
- unsigned long address,
- int write_access)
+int DRM(vm_nopage)(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address,
+ int write_access, pmd_t *pmd)
{
#if __REALLY_HAVE_AGP
drm_file_t *priv = vma->vm_file->private_data;
@@ -114,35 +114,35 @@
baddr, __va(agpmem->memory->memory[offset]), offset,
atomic_read(&page->count));

- return page;
+ return install_new_page(mm, vma, address, write_access, pmd, page);
}
vm_nopage_error:
#endif /* __REALLY_HAVE_AGP */

- return NOPAGE_SIGBUS; /* Disallow mremap */
+ return VM_FAULT_SIGBUS; /* Disallow mremap */
}

-struct page *DRM(vm_shm_nopage)(struct vm_area_struct *vma,
- unsigned long address,
- int write_access)
+int DRM(vm_shm_nopage)(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address,
+ int write_access, pmd_t *pmd)
{
drm_map_t *map = (drm_map_t *)vma->vm_private_data;
unsigned long offset;
unsigned long i;
struct page *page;

- if (address > vma->vm_end) return NOPAGE_SIGBUS; /* Disallow mremap */
- if (!map) return NOPAGE_OOM; /* Nothing allocated */
+ if (address > vma->vm_end) return VM_FAULT_SIGBUS; /* Disallow mremap */
+ if (!map) return VM_FAULT_OOM; /* Nothing allocated */

offset = address - vma->vm_start;
i = (unsigned long)map->handle + offset;
page = vmalloc_to_page((void *)i);
if (!page)
- return NOPAGE_OOM;
+ return VM_FAULT_OOM;
get_page(page);

DRM_DEBUG("shm_nopage 0x%lx\n", address);
- return page;
+ return install_new_page(mm, vma, address, write_access, pmd, page);
}

/* Special close routine which deletes map information if we are the last
@@ -221,9 +221,9 @@
up(&dev->struct_sem);
}

-struct page *DRM(vm_dma_nopage)(struct vm_area_struct *vma,
- unsigned long address,
- int write_access)
+int DRM(vm_dma_nopage)(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address,
+ int write_access, pmd_t *pmd)
{
drm_file_t *priv = vma->vm_file->private_data;
drm_device_t *dev = priv->dev;
@@ -232,9 +232,9 @@
unsigned long page_nr;
struct page *page;

- if (!dma) return NOPAGE_SIGBUS; /* Error */
- if (address > vma->vm_end) return NOPAGE_SIGBUS; /* Disallow mremap */
- if (!dma->pagelist) return NOPAGE_OOM ; /* Nothing allocated */
+ if (!dma) return VM_FAULT_SIGBUS; /* Error */
+ if (address > vma->vm_end) return VM_FAULT_SIGBUS; /* Disallow mremap */
+ if (!dma->pagelist) return VM_FAULT_OOM ; /* Nothing allocated */

offset = address - vma->vm_start; /* vm_[pg]off[set] should be 0 */
page_nr = offset >> PAGE_SHIFT;
@@ -244,12 +244,12 @@
get_page(page);

DRM_DEBUG("dma_nopage 0x%lx (page %lu)\n", address, page_nr);
- return page;
+ return install_new_page(mm, vma, address, write_access, pmd, page);
}

-struct page *DRM(vm_sg_nopage)(struct vm_area_struct *vma,
- unsigned long address,
- int write_access)
+int DRM(vm_sg_nopage)(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address,
+ int write_access, pmd_t *pmd)
{
drm_map_t *map = (drm_map_t *)vma->vm_private_data;
drm_file_t *priv = vma->vm_file->private_data;
@@ -260,9 +260,9 @@
unsigned long page_offset;
struct page *page;

- if (!entry) return NOPAGE_SIGBUS; /* Error */
- if (address > vma->vm_end) return NOPAGE_SIGBUS; /* Disallow mremap */
- if (!entry->pagelist) return NOPAGE_OOM ; /* Nothing allocated */
+ if (!entry) return VM_FAULT_SIGBUS; /* Error */
+ if (address > vma->vm_end) return VM_FAULT_SIGBUS; /* Disallow mremap */
+ if (!entry->pagelist) return VM_FAULT_OOM ; /* Nothing allocated */


offset = address - vma->vm_start;
@@ -271,7 +271,7 @@
page = entry->pagelist[page_offset];
get_page(page);

- return page;
+ return install_new_page(mm, vma, address, write_access, pmd, page);
}

void DRM(vm_open)(struct vm_area_struct *vma)
diff -urN -X dontdiff linux-2.5.70-mm2/drivers/ieee1394/dma.c linux-2.5.70-mm2.install_new_page/drivers/ieee1394/dma.c
--- linux-2.5.70-mm2/drivers/ieee1394/dma.c Mon May 26 18:00:40 2003
+++ linux-2.5.70-mm2.install_new_page/drivers/ieee1394/dma.c Fri May 30 15:12:36 2003
@@ -184,28 +184,27 @@

/* nopage() handler for mmap access */

-static struct page*
-dma_region_pagefault(struct vm_area_struct *area, unsigned long address, int write_access)
+static int
+dma_region_pagefault(struct mm_struct *mm, struct vm_area_struct *area, unsigned long address, int write_access, pmd_t *pmd)
{
unsigned long offset;
unsigned long kernel_virt_addr;
- struct page *ret = NOPAGE_SIGBUS;
+ struct page *page;

struct dma_region *dma = (struct dma_region*) area->vm_private_data;

if(!dma->kvirt)
- goto out;
+ return VM_FAULT_SIGBUS;

if( (address < (unsigned long) area->vm_start) ||
(address > (unsigned long) area->vm_start + (PAGE_SIZE * dma->n_pages)) )
- goto out;
+ return VM_FAULT_SIGBUS;

offset = address - area->vm_start;
kernel_virt_addr = (unsigned long) dma->kvirt + offset;
- ret = vmalloc_to_page((void*) kernel_virt_addr);
- get_page(ret);
-out:
- return ret;
+ page = vmalloc_to_page((void*) kernel_virt_addr);
+ get_page(page);
+ return install_new_page(mm, vma, address, write_access, pmd, page);
}

static struct vm_operations_struct dma_region_vm_ops = {
diff -urN -X dontdiff linux-2.5.70-mm2/drivers/media/video/video-buf.c linux-2.5.70-mm2.install_new_page/drivers/media/video/video-buf.c
--- linux-2.5.70-mm2/drivers/media/video/video-buf.c Mon May 26 18:00:40 2003
+++ linux-2.5.70-mm2.install_new_page/drivers/media/video/video-buf.c Fri May 30 15:12:36 2003
@@ -979,21 +979,21 @@
* now ...). Bounce buffers don't work very well for the data rates
* video capture has.
*/
-static struct page*
-videobuf_vm_nopage(struct vm_area_struct *vma, unsigned long vaddr,
- int write_access)
+static int
+videobuf_vm_nopage(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long vaddr, int write_access, pmd_t pmd)
{
struct page *page;

dprintk(3,"nopage: fault @ %08lx [vma %08lx-%08lx]\n",
vaddr,vma->vm_start,vma->vm_end);
if (vaddr > vma->vm_end)
- return NOPAGE_SIGBUS;
+ return VM_FAULT_SIGBUS;
page = alloc_page(GFP_USER);
if (!page)
- return NOPAGE_OOM;
+ return VM_FAULT_OOM;
clear_user_page(page_address(page), vaddr, page);
- return page;
+ return install_new_page(mm, vma, vaddr, write_access, pmd, page);
}

static struct vm_operations_struct videobuf_vm_ops =
diff -urN -X dontdiff linux-2.5.70-mm2/drivers/scsi/sg.c linux-2.5.70-mm2.install_new_page/drivers/scsi/sg.c
--- linux-2.5.70-mm2/drivers/scsi/sg.c Fri May 30 14:51:03 2003
+++ linux-2.5.70-mm2.install_new_page/drivers/scsi/sg.c Fri May 30 15:12:36 2003
@@ -1121,21 +1121,21 @@
}
}

-static struct page *
-sg_vma_nopage(struct vm_area_struct *vma, unsigned long addr, int unused)
+static int
+sg_vma_nopage(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, int write_access, pmd_t *pmd)
{
Sg_fd *sfp;
- struct page *page = NOPAGE_SIGBUS;
+ struct page *page = VM_FAULT_SIGBUS;
void *page_ptr = NULL;
unsigned long offset;
Sg_scatter_hold *rsv_schp;

if ((NULL == vma) || (!(sfp = (Sg_fd *) vma->vm_private_data)))
- return page;
+ return install_new_page(mm, vma, addr, write_access, pmd, page);
rsv_schp = &sfp->reserve;
offset = addr - vma->vm_start;
if (offset >= rsv_schp->bufflen)
- return page;
+ return install_new_page(mm, vma, addr, write_access, pmd, page);
SCSI_LOG_TIMEOUT(3, printk("sg_vma_nopage: offset=%lu, scatg=%d\n",
offset, rsv_schp->k_use_sg));
if (rsv_schp->k_use_sg) { /* reserve buffer is a scatter gather list */
@@ -1162,7 +1162,7 @@
page = virt_to_page(page_ptr);
get_page(page); /* increment page count */
}
- return page;
+ return install_new_page(mm, vma, addr, write_access, pmd, page);
}

static struct vm_operations_struct sg_mmap_vm_ops = {
diff -urN -X dontdiff linux-2.5.70-mm2/drivers/sgi/char/graphics.c linux-2.5.70-mm2.install_new_page/drivers/sgi/char/graphics.c
--- linux-2.5.70-mm2/drivers/sgi/char/graphics.c Mon May 26 18:00:40 2003
+++ linux-2.5.70-mm2.install_new_page/drivers/sgi/char/graphics.c Fri May 30 15:12:36 2003
@@ -211,9 +211,9 @@
/*
* This is the core of the direct rendering engine.
*/
-struct page *
-sgi_graphics_nopage (struct vm_area_struct *vma, unsigned long address, int
- no_share)
+struct int
+sgi_graphics_nopage (struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, int write_access, pmd_t *pmdpf)
{
pgd_t *pgd; pmd_t *pmd; pte_t *pte;
int board = GRAPHICS_CARD (vma->vm_dentry->d_inode->i_rdev);
@@ -249,7 +249,7 @@
pte = pte_kmap_offset(pmd, address);
page = pte_page(*pte);
pte_kunmap(pte);
- return page;
+ return install_new_page(mm, vma, address, write_access, pmdpf, page);
}

/*
diff -urN -X dontdiff linux-2.5.70-mm2/fs/ncpfs/mmap.c linux-2.5.70-mm2.install_new_page/fs/ncpfs/mmap.c
--- linux-2.5.70-mm2/fs/ncpfs/mmap.c Mon May 26 18:00:43 2003
+++ linux-2.5.70-mm2.install_new_page/fs/ncpfs/mmap.c Fri May 30 15:12:36 2003
@@ -25,8 +25,8 @@
/*
* Fill in the supplied page for mmap
*/
-static struct page* ncp_file_mmap_nopage(struct vm_area_struct *area,
- unsigned long address, int write_access)
+static int ncp_file_mmap_nopage(struct mm_struct *mm, struct vm_area_struct *area,
+ unsigned long address, int write_access, pmd_t *pmd)
{
struct file *file = area->vm_file;
struct dentry *dentry = file->f_dentry;
@@ -85,7 +85,7 @@
memset(pg_addr + already_read, 0, PAGE_SIZE - already_read);
flush_dcache_page(page);
kunmap(page);
- return page;
+ return install_new_page(mm, area, address, write_access, pmd, page);
}

static struct vm_operations_struct ncp_file_mmap =
diff -urN -X dontdiff linux-2.5.70-mm2/include/linux/mm.h linux-2.5.70-mm2.install_new_page/include/linux/mm.h
--- linux-2.5.70-mm2/include/linux/mm.h Fri May 30 14:51:05 2003
+++ linux-2.5.70-mm2.install_new_page/include/linux/mm.h Fri May 30 15:12:36 2003
@@ -142,7 +142,7 @@
struct vm_operations_struct {
void (*open)(struct vm_area_struct * area);
void (*close)(struct vm_area_struct * area);
- struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int unused);
+ int (*nopage)(struct mm_struct * mm, struct vm_area_struct * area, unsigned long address, int write_access, pmd_t *pmd);
int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
};

@@ -380,12 +380,6 @@
}

/*
- * Error return values for the *_nopage functions
- */
-#define NOPAGE_SIGBUS (NULL)
-#define NOPAGE_OOM ((struct page *) (-1))
-
-/*
* Different kinds of faults, as returned by handle_mm_fault().
* Used to decide whether a process gets delivered SIGBUS or
* just gets major/minor fault counters bumped up.
@@ -402,8 +396,8 @@

extern void show_free_areas(void);

-struct page *shmem_nopage(struct vm_area_struct * vma,
- unsigned long address, int unused);
+int shmem_nopage(struct mm_struct * mm, struct vm_area_struct * vma,
+ unsigned long address, int write_access, pmd_t * pmd);
struct file *shmem_file_setup(char * name, loff_t size, unsigned long flags);
void shmem_lock(struct file * file, int lock);
int shmem_zero_setup(struct vm_area_struct *);
@@ -421,6 +415,7 @@
int zeromap_page_range(struct vm_area_struct *vma, unsigned long from,
unsigned long size, pgprot_t prot);

+extern int install_new_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, int write_access, pmd_t *pmd, struct page * new_page);
extern void invalidate_mmap_range(struct address_space *mapping,
loff_t const holebegin,
loff_t const holelen);
@@ -559,7 +554,7 @@
extern void truncate_inode_pages(struct address_space *, loff_t);

/* generic vm_area_ops exported for stackable file systems */
-extern struct page *filemap_nopage(struct vm_area_struct *, unsigned long, int);
+int filemap_nopage(struct mm_struct *, struct vm_area_struct *, unsigned long, int, pmd_t *);

/* mm/page-writeback.c */
int write_one_page(struct page *page, int wait);
diff -urN -X dontdiff linux-2.5.70-mm2/kernel/ksyms.c linux-2.5.70-mm2.install_new_page/kernel/ksyms.c
--- linux-2.5.70-mm2/kernel/ksyms.c Fri May 30 14:51:06 2003
+++ linux-2.5.70-mm2.install_new_page/kernel/ksyms.c Fri May 30 15:12:36 2003
@@ -116,6 +116,7 @@
EXPORT_SYMBOL(max_mapnr);
#endif
EXPORT_SYMBOL(high_memory);
+EXPORT_SYMBOL(install_new_page);
EXPORT_SYMBOL(invalidate_mmap_range);
EXPORT_SYMBOL(vmtruncate);
EXPORT_SYMBOL(find_vma);
diff -urN -X dontdiff linux-2.5.70-mm2/mm/filemap.c linux-2.5.70-mm2.install_new_page/mm/filemap.c
--- linux-2.5.70-mm2/mm/filemap.c Fri May 30 14:51:06 2003
+++ linux-2.5.70-mm2.install_new_page/mm/filemap.c Fri May 30 15:12:36 2003
@@ -1013,7 +1013,7 @@
* it in the page cache, and handles the special cases reasonably without
* having a lot of duplicated code.
*/
-struct page * filemap_nopage(struct vm_area_struct * area, unsigned long address, int unused)
+int filemap_nopage(struct mm_struct * mm, struct vm_area_struct * area, unsigned long address, int write_access, pmd_t * pmd)
{
int error;
struct file *file = area->vm_file;
@@ -1034,7 +1034,7 @@
*/
size = (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
if ((pgoff >= size) && (area->vm_mm == current->mm))
- return NULL;
+ return VM_FAULT_SIGBUS;

/*
* The "size" of the file, as far as mmap is concerned, isn't bigger
@@ -1088,7 +1088,7 @@
* Found the page and have a reference on it.
*/
mark_page_accessed(page);
- return page;
+ return install_new_page(mm, area, address, write_access, pmd, page);

no_cached_page:
/*
@@ -1111,8 +1111,8 @@
* to schedule I/O.
*/
if (error == -ENOMEM)
- return NOPAGE_OOM;
- return NULL;
+ return VM_FAULT_OOM;
+ return VM_FAULT_SIGBUS;

page_not_uptodate:
inc_page_state(pgmajfault);
@@ -1169,7 +1169,7 @@
* mm layer so, possibly freeing the page cache page first.
*/
page_cache_release(page);
- return NULL;
+ return VM_FAULT_SIGBUS;
}

static struct page * filemap_getpage(struct file *file, unsigned long pgoff,
diff -urN -X dontdiff linux-2.5.70-mm2/mm/memory.c linux-2.5.70-mm2.install_new_page/mm/memory.c
--- linux-2.5.70-mm2/mm/memory.c Fri May 30 14:51:06 2003
+++ linux-2.5.70-mm2.install_new_page/mm/memory.c Fri May 30 15:12:36 2003
@@ -1374,39 +1374,33 @@
}

/*
- * do_no_page() tries to create a new page mapping. It aggressively
- * tries to share with existing pages, but makes a separate copy if
- * the "write_access" parameter is true in order to avoid the next
- * page fault.
- *
- * As this is called only for pages that do not currently exist, we
- * do not need to flush old virtual caches or the TLB.
- *
- * This is called with the MM semaphore held and the page table
- * spinlock held. Exit with the spinlock released.
+ * do_no_page() invokes do_anonymous_page() or ->nopage, as appropriate.
+ * Called w/ MM sema and page_table_lock held, the latter released before exit.
*/
-static int
+static inline int
do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
{
- struct page * new_page;
- pte_t entry;
- struct pte_chain *pte_chain;
- int ret;
-
if (!vma->vm_ops || !vma->vm_ops->nopage)
- return do_anonymous_page(mm, vma, page_table,
- pmd, write_access, address);
+ return do_anonymous_page(mm, vma, page_table, pmd, write_access, address);
pte_unmap(page_table);
spin_unlock(&mm->page_table_lock);
+ return vma->vm_ops->nopage(mm, vma, address & PAGE_MASK, write_access, pmd);
+}

- new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, 0);
-
- /* no page was available -- either SIGBUS or OOM */
- if (new_page == NOPAGE_SIGBUS)
- return VM_FAULT_SIGBUS;
- if (new_page == NOPAGE_OOM)
- return VM_FAULT_OOM;
+/*
+ * install_new_page - tries to create a new page mapping.
+ * install_new_page() tries to share w/existing pages, but makes separate
+ * copy if "write_access" is true in order to avoid the next page fault.
+ * As this is called only for pages that do not currently exist, we
+ * do not need to flush old virtual caches or the TLB.
+ */
+int
+install_new_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, int write_access, pmd_t *pmd, struct page * new_page)
+{
+ pte_t entry, *page_table;
+ struct pte_chain *pte_chain;
+ int ret;

pte_chain = pte_chain_alloc(GFP_KERNEL);
if (!pte_chain)
diff -urN -X dontdiff linux-2.5.70-mm2/mm/shmem.c linux-2.5.70-mm2.install_new_page/mm/shmem.c
--- linux-2.5.70-mm2/mm/shmem.c Mon May 26 18:00:39 2003
+++ linux-2.5.70-mm2.install_new_page/mm/shmem.c Fri May 30 15:12:36 2003
@@ -936,7 +936,7 @@
return error;
}

-struct page *shmem_nopage(struct vm_area_struct *vma, unsigned long address, int unused)
+int shmem_nopage(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, int write_access, pmd_t *pmd)
{
struct inode *inode = vma->vm_file->f_dentry->d_inode;
struct page *page = NULL;
@@ -949,10 +949,10 @@

error = shmem_getpage(inode, idx, &page, SGP_CACHE);
if (error)
- return (error == -ENOMEM)? NOPAGE_OOM: NOPAGE_SIGBUS;
+ return (error == -ENOMEM)? VM_FAULT_OOM: VM_FAULT_SIGBUS;

mark_page_accessed(page);
- return page;
+ return install_new_page(mm, vma, address, write_access, pmd, page);
}

static int shmem_populate(struct vm_area_struct *vma,
diff -urN -X dontdiff linux-2.5.70-mm2/sound/core/pcm_native.c linux-2.5.70-mm2.install_new_page/sound/core/pcm_native.c
--- linux-2.5.70-mm2/sound/core/pcm_native.c Mon May 26 18:00:37 2003
+++ linux-2.5.70-mm2.install_new_page/sound/core/pcm_native.c Fri May 30 15:12:36 2003
@@ -60,6 +60,11 @@
static int snd_pcm_hw_refine_old_user(snd_pcm_substream_t * substream, struct sndrv_pcm_hw_params_old * _oparams);
static int snd_pcm_hw_params_old_user(snd_pcm_substream_t * substream, struct sndrv_pcm_hw_params_old * _oparams);

+#ifndef LINUX_2_2
+#define NOPAGE_OOM VM_FAULT_OOM
+#define NOPAGE_SIGBUS VM_FAULT_SIGBUS
+#endif
+
/*
*
*/
@@ -2693,7 +2698,7 @@
#endif

#ifndef LINUX_2_2
-static struct page * snd_pcm_mmap_status_nopage(struct vm_area_struct *area, unsigned long address, int no_share)
+static int snd_pcm_mmap_status_nopage(struct mm_struct *mm, struct vm_area_struct *area, unsigned long address, int write_access, pmd_t *pmd)
#else
static unsigned long snd_pcm_mmap_status_nopage(struct vm_area_struct *area, unsigned long address, int no_share)
#endif
@@ -2708,7 +2713,7 @@
page = virt_to_page(runtime->status);
get_page(page);
#ifndef LINUX_2_2
- return page;
+ return install_new_page(mm, area, address, write_access, pmd, page);
#else
return page_address(page);
#endif
@@ -2747,7 +2752,7 @@
}

#ifndef LINUX_2_2
-static struct page * snd_pcm_mmap_control_nopage(struct vm_area_struct *area, unsigned long address, int no_share)
+static int snd_pcm_mmap_control_nopage(struct mm_struct *mm, struct vm_area_struct *area, unsigned long address, int write_access, pmd_t *pmd)
#else
static unsigned long snd_pcm_mmap_control_nopage(struct vm_area_struct *area, unsigned long address, int no_share)
#endif
@@ -2762,7 +2767,7 @@
page = virt_to_page(runtime->control);
get_page(page);
#ifndef LINUX_2_2
- return page;
+ return install_new_page(mm, area, address, write_access, pmd, page);
#else
return page_address(page);
#endif
@@ -2813,7 +2818,7 @@
}

#ifndef LINUX_2_2
-static struct page * snd_pcm_mmap_data_nopage(struct vm_area_struct *area, unsigned long address, int no_share)
+static int snd_pcm_mmap_data_nopage(struct mm_struct *mm, struct vm_area_struct *area, unsigned long address, int write_access, pmd_t *pmd)
#else
static unsigned long snd_pcm_mmap_data_nopage(struct vm_area_struct *area, unsigned long address, int no_share)
#endif
@@ -2848,7 +2853,7 @@
}
get_page(page);
#ifndef LINUX_2_2
- return page;
+ return install_new_page(mm, area, address, write_access, pmd, page);
#else
return page_address(page);
#endif
diff -urN -X dontdiff linux-2.5.70-mm2/sound/oss/emu10k1/audio.c linux-2.5.70-mm2.install_new_page/sound/oss/emu10k1/audio.c
--- linux-2.5.70-mm2/sound/oss/emu10k1/audio.c Mon May 26 18:00:23 2003
+++ linux-2.5.70-mm2.install_new_page/sound/oss/emu10k1/audio.c Fri May 30 15:12:36 2003
@@ -970,7 +970,7 @@
return 0;
}

-static struct page *emu10k1_mm_nopage (struct vm_area_struct * vma, unsigned long address, int write_access)
+static int emu10k1_mm_nopage (struct mm_struct * mm, struct vm_area_struct * vma, unsigned long address, int write_access, pmd_t * pmd)
{
struct emu10k1_wavedevice *wave_dev = vma->vm_private_data;
struct woinst *woinst = wave_dev->woinst;
@@ -983,8 +983,8 @@
DPD(3, "addr: %#lx\n", address);

if (address > vma->vm_end) {
- DPF(1, "EXIT, returning NOPAGE_SIGBUS\n");
- return NOPAGE_SIGBUS; /* Disallow mremap */
+ DPF(1, "EXIT, returning VM_FAULT_SIGBUS\n");
+ return VM_FAULT_SIGBUS; /* Disallow mremap */
}

pgoff = vma->vm_pgoff + ((address - vma->vm_start) >> PAGE_SHIFT);
@@ -1013,7 +1013,7 @@
get_page (dmapage);

DPD(3, "page: %#lx\n", (unsigned long) dmapage);
- return dmapage;
+ return install_new_page(mm, vma, address, write_access, pmd, dmapage);
}

struct vm_operations_struct emu10k1_mm_ops = {
diff -urN -X dontdiff linux-2.5.70-mm2/sound/oss/via82cxxx_audio.c linux-2.5.70-mm2.install_new_page/sound/oss/via82cxxx_audio.c
--- linux-2.5.70-mm2/sound/oss/via82cxxx_audio.c Mon May 26 18:00:27 2003
+++ linux-2.5.70-mm2.install_new_page/sound/oss/via82cxxx_audio.c Fri May 30 15:12:36 2003
@@ -1846,8 +1846,8 @@
}


-static struct page * via_mm_nopage (struct vm_area_struct * vma,
- unsigned long address, int write_access)
+static int via_mm_nopage (struct mm_struct *mm, struct vm_area_struct * vma,
+ unsigned long address, int write_access, pmd_t *pmd)
{
struct via_info *card = vma->vm_private_data;
struct via_channel *chan = &card->ch_out;
@@ -1863,12 +1863,12 @@
write_access);

if (address > vma->vm_end) {
- DPRINTK ("EXIT, returning NOPAGE_SIGBUS\n");
- return NOPAGE_SIGBUS; /* Disallow mremap */
+ DPRINTK ("EXIT, returning VM_FAULT_SIGBUS\n");
+ return VM_FAULT_SIGBUS; /* Disallow mremap */
}
if (!card) {
- DPRINTK ("EXIT, returning NOPAGE_OOM\n");
- return NOPAGE_OOM; /* Nothing allocated */
+ DPRINTK ("EXIT, returning VM_FAULT_OOM\n");
+ return VM_FAULT_OOM; /* Nothing allocated */
}

pgoff = vma->vm_pgoff + ((address - vma->vm_start) >> PAGE_SHIFT);
@@ -1895,10 +1895,10 @@
assert ((((unsigned long)chan->pgtbl[pgoff].cpuaddr) % PAGE_SIZE) == 0);

dmapage = virt_to_page (chan->pgtbl[pgoff].cpuaddr);
- DPRINTK ("EXIT, returning page %p for cpuaddr %lXh\n",
+ DPRINTK ("EXIT, installing page %p for cpuaddr %lXh\n",
dmapage, (unsigned long) chan->pgtbl[pgoff].cpuaddr);
get_page (dmapage);
- return dmapage;
+ return install_new_page(mm, vma, address, write_access, pmd, dmapage);
}



2003-05-31 00:47:05

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC][PATCH] Convert do_no_page() to a hook to avoid DFS race

"Paul E. McKenney" <[email protected]> wrote:
>
> Rediffed to 2.5.70-mm2.
>
> This patch allows a distributed filesystem to avoid the
> pagefault/cross-node-invalidate race described in:
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=105286345316249&w=2
>
> This patch converts the bulk of do_no_page() into a hook that may
> be called from the ->nopage vm_operations_struct callout.

Seems reasonable.

> There
> is still an inlined do_no_page() wrapper due to the fact that
> do_anonymous_page() requires that the mm->page_table_lock be
> held on entry, while the ->nopage callouts require that this
> lock be dropped.

I sugest you change the ->nopage definition so that page_table_lock is held
on entry to ->nopage, and ->nopage must drop it at some point. This gives
the nopage implementations some more flexibility and may perhaps eliminate
that special case?

> This patch is untested.

I don't think there's a lot of point in making changes until the code which
requires those changes is accepted into the tree. Otherwise it may be
pointless churn, and there's nothing in-tree to exercise the new features.

> An alternative to this patch includes the nopagedone() patch posted
> moments ago. hch has also suggested that do_anonymous_page() be
> converted to a ->nopage callout, but this would require that all
> of the other ->nopage callouts drop mm->page_table_lock as their
> first action. If people believe that this is the right thing to
> do, I will happily produce such a patch.

That sounds better to me.

2003-05-31 16:11:27

by Ingo Oeser

[permalink] [raw]
Subject: Always passing mm and vma down (was: [RFC][PATCH] Convert do_no_page() to a hook to avoid DFS race)

Hi there,

On Fri, May 30, 2003 at 04:41:50PM -0700, Paul E. McKenney wrote:
> -struct page *
> -ia32_install_shared_page (struct vm_area_struct *vma, unsigned long address, int no_share)
> +int
> +ia32_install_shared_page (struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, int write_access, pmd_t *pmd)
> {
> struct page *pg = ia32_shared_page[(address - vma->vm_start)/PAGE_SIZE];
>
> get_page(pg);
> - return pg;
> + return install_new_page(mm, vma, address, write_access, pmd, pg);
> }

Why do we always pass mm and vma down, even if vma->vm_mm
contains the mm, where the vma belongs to? Is the connection
between a vma and its mm also protected by the mmap_sem?

Is this really necessary or an oversight and we waste a lot of
stack in a lot of places?

If we just need it for accounting: We need current->mm, if we
need it to locate the next vma relatively to this vma, vma->vm_mm
is the one.

Puzzled

Ingo Oeser

2003-05-31 23:35:28

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Always passing mm and vma down (was: [RFC][PATCH] Convert do_no_page() to a hook to avoid DFS race)

On Sat, May 31, 2003 at 10:46:18AM +0200, Ingo Oeser wrote:
> On Fri, May 30, 2003 at 04:41:50PM -0700, Paul E. McKenney wrote:
> > -struct page *
> > -ia32_install_shared_page (struct vm_area_struct *vma, unsigned long address, int no_share)
> > +int
> > +ia32_install_shared_page (struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, int write_access, pmd_t *pmd)
>
> Why do we always pass mm and vma down, even if vma->vm_mm
> contains the mm, where the vma belongs to? Is the connection
> between a vma and its mm also protected by the mmap_sem?
>
> Is this really necessary or an oversight and we waste a lot of
> stack in a lot of places?
>
> If we just need it for accounting: We need current->mm, if we
> need it to locate the next vma relatively to this vma, vma->vm_mm
> is the one.

Interesting point. The original do_no_page() API does this
as well:

static int
do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)

As does do_anonymous_page(). I assumed that there were corner
cases where this one-to-one correspondence did not exist, but
must confess that I did not go looking for them.

Or is this a performance issue, avoiding a dereference and
possible cache miss?

Thanx, Paul

2003-05-31 23:38:06

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC][PATCH] Convert do_no_page() to a hook to avoid DFS race

On Fri, May 30, 2003 at 06:00:27PM -0700, Andrew Morton wrote:
> "Paul E. McKenney" <[email protected]> wrote:
> > There
> > is still an inlined do_no_page() wrapper due to the fact that
> > do_anonymous_page() requires that the mm->page_table_lock be
> > held on entry, while the ->nopage callouts require that this
> > lock be dropped.
>
> I sugest you change the ->nopage definition so that page_table_lock is held
> on entry to ->nopage, and ->nopage must drop it at some point. This gives
> the nopage implementations some more flexibility and may perhaps eliminate
> that special case?

Will do!

> > This patch is untested.
>
> I don't think there's a lot of point in making changes until the code which
> requires those changes is accepted into the tree. Otherwise it may be
> pointless churn, and there's nothing in-tree to exercise the new features.

A GPLed use of these DFS features is expected Real Soon Now...

Thanx, Paul

2003-06-01 12:16:26

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Always passing mm and vma down (was: [RFC][PATCH] Convert do_no_page() to a hook to avoid DFS race)

Hi Paul,

On Sat, May 31, 2003 at 04:48:16PM -0700, Paul E. McKenney wrote:
> On Sat, May 31, 2003 at 10:46:18AM +0200, Ingo Oeser wrote:
> > On Fri, May 30, 2003 at 04:41:50PM -0700, Paul E. McKenney wrote:
> > > -struct page *
> > > -ia32_install_shared_page (struct vm_area_struct *vma, unsigned long address, int no_share)
> > > +int
> > > +ia32_install_shared_page (struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, int write_access, pmd_t *pmd)
> >
> > Why do we always pass mm and vma down, even if vma->vm_mm
> > contains the mm, where the vma belongs to? Is the connection
> > between a vma and its mm also protected by the mmap_sem?
> >
> > Is this really necessary or an oversight and we waste a lot of
> > stack in a lot of places?
> >
> > If we just need it for accounting: We need current->mm, if we
> > need it to locate the next vma relatively to this vma, vma->vm_mm
> > is the one.
>
> Interesting point. The original do_no_page() API does this
> as well:
>
> static int
> do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
>
> As does do_anonymous_page(). I assumed that there were corner
> cases where this one-to-one correspondence did not exist, but
> must confess that I did not go looking for them.
>
> Or is this a performance issue, avoiding a dereference and
> possible cache miss?

it's a performance microoptimization issue (the stack memory overhead is
not significant), each new parameter to those calls will slowdown the
kernel, especially with the x86 calling convetions it will hurt more
than with other archs. Jeff did something similar for UML (dunno if he
also can use vma->vm_mm, anyways I #ifdeffed it out enterely for non UML
compiles for exact this reason that I can't slowdown the whole
production host kernel just for the uml compile) then there's also the
additional extern call. So basically the patch would hurt until there
are active users.

But the real question here I guess is: why should a distributed
filesystem need to install the pte by itself?

The memory coherency with distributed shared memory (i.e. MAP_SHARED
against truncate with MAP_SHARED and truncate running on different boxes
on a DFS) is a generic problem (most of the time using the same
algorithm too), as such I believe the infrastructure and logics to keep
it coherent could make sense to be a common code functionalty.

And w/o proper high level locking, the non distributed filesystems will
corrupt the VM too with truncate against nopage. I already fixed this in
my tree. (see the attachment) So I wonder if the fixes could be shared.
I mean, you definitely need my fixes even when using the DFS on a
isolated box, and if you don't need them while using the fs locally, it
means we're duplicating effort somehow.

Since I don't see the users of the new hook, it's a bit hard to judje if
the duplication is legitimate or not. So overall I'd agree with Andrew
that to judje the patch it'd make sense to see (or know more) about the
users of the hook too.

as for the anon memory, yes, it's probably more efficient and cleaner to
have it be a nopage callback too, the double branch is probably more
costly than the duplicated unlock anyways. However it has nothing to do
with these issues, so I recommend to keep it separated from the DFS
patches (for readibility). (it can't have anything to do with DFS since
that memory is anon local to the box [if not using openmosix but that's
a different issue ;) ])

thanks,
Andrea


Attachments:
(No filename) (3.49 kB)
9999_truncate-nopage-race-1 (3.96 kB)
Download all attachments

2003-06-01 19:21:18

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC][PATCH] Convert do_no_page() to a hook to avoid DFS race

On Fri, May 30, 2003 at 06:00:27PM -0700, Andrew Morton wrote:
> "Paul E. McKenney" <[email protected]> wrote:
> > An alternative to this patch includes the nopagedone() patch posted
> > moments ago. hch has also suggested that do_anonymous_page() be
> > converted to a ->nopage callout, but this would require that all
> > of the other ->nopage callouts drop mm->page_table_lock as their
> > first action. If people believe that this is the right thing to
> > do, I will happily produce such a patch.
>
> That sounds better to me.

Here is a patch, compiled on i386, untested. I had to put the
pte_unmap() as well as the spin_unlock() into each ->nopage
function. I would guess that this might be more attractive
if combined with a fix for the pagefault-truncate() race. ;-)

Thanx, Paul

diff -urN -X dontdiff linux-2.5.70-mm3/arch/i386/mm/hugetlbpage.c linux-2.5.70-mm3.install_new_page_hch/arch/i386/mm/hugetlbpage.c
--- linux-2.5.70-mm3/arch/i386/mm/hugetlbpage.c 2003-05-26 18:00:58.000000000 -0700
+++ linux-2.5.70-mm3.install_new_page_hch/arch/i386/mm/hugetlbpage.c 2003-06-01 10:43:06.000000000 -0700
@@ -487,11 +487,13 @@
* hugegpage VMA. do_page_fault() is supposed to trap this, so BUG is we get
* this far.
*/
-static struct page *hugetlb_nopage(struct vm_area_struct *vma,
- unsigned long address, int unused)
+static int hugetlb_nopage(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
{
+ pte_unmap(page_table);
+ spin_unlock(&mm->page_table_lock);
BUG();
- return NULL;
+ return VM_FAULT_SIGBUS;
}

struct vm_operations_struct hugetlb_vm_ops = {
diff -urN -X dontdiff linux-2.5.70-mm3/arch/ia64/ia32/binfmt_elf32.c linux-2.5.70-mm3.install_new_page_hch/arch/ia64/ia32/binfmt_elf32.c
--- linux-2.5.70-mm3/arch/ia64/ia32/binfmt_elf32.c 2003-05-26 18:00:58.000000000 -0700
+++ linux-2.5.70-mm3.install_new_page_hch/arch/ia64/ia32/binfmt_elf32.c 2003-06-01 10:43:27.000000000 -0700
@@ -56,13 +56,15 @@
extern struct page *ia32_shared_page[];
extern unsigned long *ia32_gdt;

-struct page *
-ia32_install_shared_page (struct vm_area_struct *vma, unsigned long address, int no_share)
+int
+ia32_install_shared_page (struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
{
struct page *pg = ia32_shared_page[(address - vma->vm_start)/PAGE_SIZE];

+ pte_unmap(page_table);
+ spin_unlock(&mm->page_table_lock);
get_page(pg);
- return pg;
+ return install_new_page(mm, vma, address, write_access, pmd, pg);
}

static struct vm_operations_struct ia32_shared_page_vm_ops = {
diff -urN -X dontdiff linux-2.5.70-mm3/arch/ia64/kernel/perfmon.c linux-2.5.70-mm3.install_new_page_hch/arch/ia64/kernel/perfmon.c
--- linux-2.5.70-mm3/arch/ia64/kernel/perfmon.c 2003-05-26 18:00:58.000000000 -0700
+++ linux-2.5.70-mm3.install_new_page_hch/arch/ia64/kernel/perfmon.c 2003-06-01 10:12:48.000000000 -0700
@@ -424,7 +424,8 @@
static void pfm_vm_close(struct vm_area_struct * area);

static struct vm_operations_struct pfm_vm_ops={
- .close = pfm_vm_close
+ .close = pfm_vm_close,
+ .nopage = do_anonymous_page
};

/*
diff -urN -X dontdiff linux-2.5.70-mm3/arch/ia64/mm/hugetlbpage.c linux-2.5.70-mm3.install_new_page_hch/arch/ia64/mm/hugetlbpage.c
--- linux-2.5.70-mm3/arch/ia64/mm/hugetlbpage.c 2003-05-26 18:00:40.000000000 -0700
+++ linux-2.5.70-mm3.install_new_page_hch/arch/ia64/mm/hugetlbpage.c 2003-06-01 10:44:02.000000000 -0700
@@ -479,10 +479,12 @@
return 1;
}

-static struct page *hugetlb_nopage(struct vm_area_struct * area, unsigned long address, int unused)
+static int hugetlb_nopage(struct mm_struct * mm, struct vm_area_struct * area, unsigned long address, int write_access, pte_t * page_table, pmd_t * pmd)
{
+ pte_unmap(page_table);
+ spin_unlock(&mm->page_table_lock);
BUG();
- return NULL;
+ return VM_FAULT_SIGBUS;
}

struct vm_operations_struct hugetlb_vm_ops = {
diff -urN -X dontdiff linux-2.5.70-mm3/arch/sparc64/mm/hugetlbpage.c linux-2.5.70-mm3.install_new_page_hch/arch/sparc64/mm/hugetlbpage.c
--- linux-2.5.70-mm3/arch/sparc64/mm/hugetlbpage.c 2003-05-26 18:00:42.000000000 -0700
+++ linux-2.5.70-mm3.install_new_page_hch/arch/sparc64/mm/hugetlbpage.c 2003-06-01 10:44:21.000000000 -0700
@@ -633,11 +633,13 @@
return (int) htlbzone_pages;
}

-static struct page *
-hugetlb_nopage(struct vm_area_struct *vma, unsigned long address, int unused)
+static int
+hugetlb_nopage(struct mm_struct * mm, struct vm_area_struct *vma, unsigned long address, int write_access, pte_t *page_table, pmd_t * pmd)
{
+ pte_unmap(page_table);
+ spin_unlock(&mm->page_table_lock);
BUG();
- return NULL;
+ return VM_FAULT_SIGBUS;
}

static struct vm_operations_struct hugetlb_vm_ops = {
diff -urN -X dontdiff linux-2.5.70-mm3/drivers/char/agp/alpha-agp.c linux-2.5.70-mm3.install_new_page_hch/drivers/char/agp/alpha-agp.c
--- linux-2.5.70-mm3/drivers/char/agp/alpha-agp.c 2003-05-26 18:00:42.000000000 -0700
+++ linux-2.5.70-mm3.install_new_page_hch/drivers/char/agp/alpha-agp.c 2003-06-01 10:44:53.000000000 -0700
@@ -11,26 +11,28 @@

#include "agp.h"

-static struct page *alpha_core_agp_vm_nopage(struct vm_area_struct *vma,
- unsigned long address,
- int write_access)
+static int alpha_core_agp_vm_nopage(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address,
+ int write_access, pte_t *page_table, pmd_t *pmd)
{
alpha_agp_info *agp = agp_bridge->dev_private_data;
dma_addr_t dma_addr;
unsigned long pa;
struct page *page;

+ pte_unmap(page_table);
+ spin_unlock(&mm->page_table_lock);
dma_addr = address - vma->vm_start + agp->aperture.bus_base;
pa = agp->ops->translate(agp, dma_addr);

- if (pa == (unsigned long)-EINVAL) return NULL; /* no translation */
+ if (pa == (unsigned long)-EINVAL) return VM_FAULT_SIGBUS; /* no translation */

/*
* Get the page, inc the use count, and return it
*/
page = virt_to_page(__va(pa));
get_page(page);
- return page;
+ return install_new_page(mm, vma, address, write_access, pmd, page);
}

static struct aper_size_info_fixed alpha_core_agp_sizes[] =
diff -urN -X dontdiff linux-2.5.70-mm3/drivers/char/drm/drmP.h linux-2.5.70-mm3.install_new_page_hch/drivers/char/drm/drmP.h
--- linux-2.5.70-mm3/drivers/char/drm/drmP.h 2003-05-26 18:00:45.000000000 -0700
+++ linux-2.5.70-mm3.install_new_page_hch/drivers/char/drm/drmP.h 2003-06-01 11:16:09.000000000 -0700
@@ -620,18 +620,17 @@
extern int DRM(fasync)(int fd, struct file *filp, int on);

/* Mapping support (drm_vm.h) */
-extern struct page *DRM(vm_nopage)(struct vm_area_struct *vma,
- unsigned long address,
- int write_access);
-extern struct page *DRM(vm_shm_nopage)(struct vm_area_struct *vma,
- unsigned long address,
- int write_access);
-extern struct page *DRM(vm_dma_nopage)(struct vm_area_struct *vma,
- unsigned long address,
- int write_access);
-extern struct page *DRM(vm_sg_nopage)(struct vm_area_struct *vma,
- unsigned long address,
- int write_access);
+extern int DRM(vm_nopage)(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd);
+extern int DRM(vm_shm_nopage)(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address,
+ int write_access, pte_t *page_table, pmd_t *pmd);
+extern int DRM(vm_dma_nopage)(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address,
+ int write_access, pte_t *page_table, pmd_t *pmd);
+extern int DRM(vm_sg_nopage)(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address,
+ int write_access, pte_t *page_table, pmd_t *pmd);
extern void DRM(vm_open)(struct vm_area_struct *vma);
extern void DRM(vm_close)(struct vm_area_struct *vma);
extern void DRM(vm_shm_close)(struct vm_area_struct *vma);
diff -urN -X dontdiff linux-2.5.70-mm3/drivers/char/drm/drm_vm.h linux-2.5.70-mm3.install_new_page_hch/drivers/char/drm/drm_vm.h
--- linux-2.5.70-mm3/drivers/char/drm/drm_vm.h 2003-05-26 18:01:02.000000000 -0700
+++ linux-2.5.70-mm3.install_new_page_hch/drivers/char/drm/drm_vm.h 2003-06-01 10:28:54.000000000 -0700
@@ -55,10 +55,12 @@
.close = DRM(vm_close),
};

-struct page *DRM(vm_nopage)(struct vm_area_struct *vma,
- unsigned long address,
- int write_access)
+int DRM(vm_nopage)(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address,
+ int write_access, pte_t *page_table, pmd_t *pmd)
{
+ pte_unmap(page_table);
+ spin_unlock(&mm->page_table_lock);
#if __REALLY_HAVE_AGP
drm_file_t *priv = vma->vm_file->private_data;
drm_device_t *dev = priv->dev;
@@ -114,35 +116,37 @@
baddr, __va(agpmem->memory->memory[offset]), offset,
atomic_read(&page->count));

- return page;
+ return install_new_page(mm, vma, address, write_access, pmd, page);
}
vm_nopage_error:
#endif /* __REALLY_HAVE_AGP */

- return NOPAGE_SIGBUS; /* Disallow mremap */
+ return VM_FAULT_SIGBUS; /* Disallow mremap */
}

-struct page *DRM(vm_shm_nopage)(struct vm_area_struct *vma,
- unsigned long address,
- int write_access)
+int DRM(vm_shm_nopage)(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address,
+ int write_access, pte_t *page_table, pmd_t *pmd)
{
drm_map_t *map = (drm_map_t *)vma->vm_private_data;
unsigned long offset;
unsigned long i;
struct page *page;

- if (address > vma->vm_end) return NOPAGE_SIGBUS; /* Disallow mremap */
- if (!map) return NOPAGE_OOM; /* Nothing allocated */
+ pte_unmap(page_table);
+ spin_unlock(&mm->page_table_lock);
+ if (address > vma->vm_end) return VM_FAULT_SIGBUS; /* Disallow mremap */
+ if (!map) return VM_FAULT_OOM; /* Nothing allocated */

offset = address - vma->vm_start;
i = (unsigned long)map->handle + offset;
page = vmalloc_to_page((void *)i);
if (!page)
- return NOPAGE_OOM;
+ return VM_FAULT_OOM;
get_page(page);

DRM_DEBUG("shm_nopage 0x%lx\n", address);
- return page;
+ return install_new_page(mm, vma, address, write_access, pmd, page);
}

/* Special close routine which deletes map information if we are the last
@@ -221,9 +225,9 @@
up(&dev->struct_sem);
}

-struct page *DRM(vm_dma_nopage)(struct vm_area_struct *vma,
- unsigned long address,
- int write_access)
+int DRM(vm_dma_nopage)(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address,
+ int write_access, pte_t *page_table, pmd_t *pmd)
{
drm_file_t *priv = vma->vm_file->private_data;
drm_device_t *dev = priv->dev;
@@ -232,9 +236,11 @@
unsigned long page_nr;
struct page *page;

- if (!dma) return NOPAGE_SIGBUS; /* Error */
- if (address > vma->vm_end) return NOPAGE_SIGBUS; /* Disallow mremap */
- if (!dma->pagelist) return NOPAGE_OOM ; /* Nothing allocated */
+ pte_unmap(page_table);
+ spin_unlock(&mm->page_table_lock);
+ if (!dma) return VM_FAULT_SIGBUS; /* Error */
+ if (address > vma->vm_end) return VM_FAULT_SIGBUS; /* Disallow mremap */
+ if (!dma->pagelist) return VM_FAULT_OOM ; /* Nothing allocated */

offset = address - vma->vm_start; /* vm_[pg]off[set] should be 0 */
page_nr = offset >> PAGE_SHIFT;
@@ -244,12 +250,12 @@
get_page(page);

DRM_DEBUG("dma_nopage 0x%lx (page %lu)\n", address, page_nr);
- return page;
+ return install_new_page(mm, vma, address, write_access, pmd, page);
}

-struct page *DRM(vm_sg_nopage)(struct vm_area_struct *vma,
- unsigned long address,
- int write_access)
+int DRM(vm_sg_nopage)(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address,
+ int write_access, pte_t *page_table, pmd_t *pmd)
{
drm_map_t *map = (drm_map_t *)vma->vm_private_data;
drm_file_t *priv = vma->vm_file->private_data;
@@ -260,9 +266,11 @@
unsigned long page_offset;
struct page *page;

- if (!entry) return NOPAGE_SIGBUS; /* Error */
- if (address > vma->vm_end) return NOPAGE_SIGBUS; /* Disallow mremap */
- if (!entry->pagelist) return NOPAGE_OOM ; /* Nothing allocated */
+ pte_unmap(page_table);
+ spin_unlock(&mm->page_table_lock);
+ if (!entry) return VM_FAULT_SIGBUS; /* Error */
+ if (address > vma->vm_end) return VM_FAULT_SIGBUS; /* Disallow mremap */
+ if (!entry->pagelist) return VM_FAULT_OOM ; /* Nothing allocated */


offset = address - vma->vm_start;
@@ -271,7 +279,7 @@
page = entry->pagelist[page_offset];
get_page(page);

- return page;
+ return install_new_page(mm, vma, address, write_access, pmd, page);
}

void DRM(vm_open)(struct vm_area_struct *vma)
diff -urN -X dontdiff linux-2.5.70-mm3/drivers/char/ftape/zftape/zftape-init.c linux-2.5.70-mm3.install_new_page_hch/drivers/char/ftape/zftape/zftape-init.c
--- linux-2.5.70-mm3/drivers/char/ftape/zftape/zftape-init.c 2003-05-26 18:00:38.000000000 -0700
+++ linux-2.5.70-mm3.install_new_page_hch/drivers/char/ftape/zftape/zftape-init.c 2003-06-01 10:15:32.000000000 -0700
@@ -198,7 +198,7 @@
sigfillset(&current->blocked);
if ((result = ftape_mmap(vma)) >= 0) {
#ifndef MSYNC_BUG_WAS_FIXED
- static struct vm_operations_struct dummy = { NULL, };
+ static struct vm_operations_struct dummy = { .nopage = do_anonymous_page, };
vma->vm_ops = &dummy;
#endif
}
diff -urN -X dontdiff linux-2.5.70-mm3/drivers/ieee1394/dma.c linux-2.5.70-mm3.install_new_page_hch/drivers/ieee1394/dma.c
--- linux-2.5.70-mm3/drivers/ieee1394/dma.c 2003-05-26 18:00:40.000000000 -0700
+++ linux-2.5.70-mm3.install_new_page_hch/drivers/ieee1394/dma.c 2003-06-01 10:29:36.000000000 -0700
@@ -184,28 +184,29 @@

/* nopage() handler for mmap access */

-static struct page*
-dma_region_pagefault(struct vm_area_struct *area, unsigned long address, int write_access)
+static int
+dma_region_pagefault(struct mm_struct *mm, struct vm_area_struct *area, unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
{
unsigned long offset;
unsigned long kernel_virt_addr;
- struct page *ret = NOPAGE_SIGBUS;
+ struct page *page;

struct dma_region *dma = (struct dma_region*) area->vm_private_data;

+ pte_unmap(page_table);
+ spin_unlock(&mm->page_table_lock);
if(!dma->kvirt)
- goto out;
+ return VM_FAULT_SIGBUS;

if( (address < (unsigned long) area->vm_start) ||
(address > (unsigned long) area->vm_start + (PAGE_SIZE * dma->n_pages)) )
- goto out;
+ return VM_FAULT_SIGBUS;

offset = address - area->vm_start;
kernel_virt_addr = (unsigned long) dma->kvirt + offset;
- ret = vmalloc_to_page((void*) kernel_virt_addr);
- get_page(ret);
-out:
- return ret;
+ page = vmalloc_to_page((void*) kernel_virt_addr);
+ get_page(page);
+ return install_new_page(mm, vma, address, write_access, pmd, page);
}

static struct vm_operations_struct dma_region_vm_ops = {
diff -urN -X dontdiff linux-2.5.70-mm3/drivers/media/video/video-buf.c linux-2.5.70-mm3.install_new_page_hch/drivers/media/video/video-buf.c
--- linux-2.5.70-mm3/drivers/media/video/video-buf.c 2003-05-26 18:00:40.000000000 -0700
+++ linux-2.5.70-mm3.install_new_page_hch/drivers/media/video/video-buf.c 2003-06-01 10:30:39.000000000 -0700
@@ -979,21 +979,23 @@
* now ...). Bounce buffers don't work very well for the data rates
* video capture has.
*/
-static struct page*
-videobuf_vm_nopage(struct vm_area_struct *vma, unsigned long vaddr,
- int write_access)
+static int
+videobuf_vm_nopage(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long vaddr, int write_access, pte_t *page_table, pmd_t *pmd)
{
struct page *page;

+ pte_unmap(page_table);
+ spin_unlock(&mm->page_table_lock);
dprintk(3,"nopage: fault @ %08lx [vma %08lx-%08lx]\n",
vaddr,vma->vm_start,vma->vm_end);
if (vaddr > vma->vm_end)
- return NOPAGE_SIGBUS;
+ return VM_FAULT_SIGBUS;
page = alloc_page(GFP_USER);
if (!page)
- return NOPAGE_OOM;
+ return VM_FAULT_OOM;
clear_user_page(page_address(page), vaddr, page);
- return page;
+ return install_new_page(mm, vma, vaddr, write_access, pmd, page);
}

static struct vm_operations_struct videobuf_vm_ops =
diff -urN -X dontdiff linux-2.5.70-mm3/drivers/scsi/sg.c linux-2.5.70-mm3.install_new_page_hch/drivers/scsi/sg.c
--- linux-2.5.70-mm3/drivers/scsi/sg.c 2003-05-31 16:31:06.000000000 -0700
+++ linux-2.5.70-mm3.install_new_page_hch/drivers/scsi/sg.c 2003-06-01 10:31:24.000000000 -0700
@@ -1121,21 +1121,23 @@
}
}

-static struct page *
-sg_vma_nopage(struct vm_area_struct *vma, unsigned long addr, int unused)
+static int
+sg_vma_nopage(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, int write_access, pte_t *page_table, pmd_t *pmd)
{
Sg_fd *sfp;
- struct page *page = NOPAGE_SIGBUS;
+ struct page *page = VM_FAULT_SIGBUS;
void *page_ptr = NULL;
unsigned long offset;
Sg_scatter_hold *rsv_schp;

+ pte_unmap(page_table);
+ spin_unlock(&mm->page_table_lock);
if ((NULL == vma) || (!(sfp = (Sg_fd *) vma->vm_private_data)))
- return page;
+ return install_new_page(mm, vma, addr, write_access, pmd, page);
rsv_schp = &sfp->reserve;
offset = addr - vma->vm_start;
if (offset >= rsv_schp->bufflen)
- return page;
+ return install_new_page(mm, vma, addr, write_access, pmd, page);
SCSI_LOG_TIMEOUT(3, printk("sg_vma_nopage: offset=%lu, scatg=%d\n",
offset, rsv_schp->k_use_sg));
if (rsv_schp->k_use_sg) { /* reserve buffer is a scatter gather list */
@@ -1162,7 +1164,7 @@
page = virt_to_page(page_ptr);
get_page(page); /* increment page count */
}
- return page;
+ return install_new_page(mm, vma, addr, write_access, pmd, page);
}

static struct vm_operations_struct sg_mmap_vm_ops = {
diff -urN -X dontdiff linux-2.5.70-mm3/drivers/sgi/char/graphics.c linux-2.5.70-mm3.install_new_page_hch/drivers/sgi/char/graphics.c
--- linux-2.5.70-mm3/drivers/sgi/char/graphics.c 2003-05-26 18:00:40.000000000 -0700
+++ linux-2.5.70-mm3.install_new_page_hch/drivers/sgi/char/graphics.c 2003-06-01 10:32:03.000000000 -0700
@@ -211,9 +211,9 @@
/*
* This is the core of the direct rendering engine.
*/
-struct page *
-sgi_graphics_nopage (struct vm_area_struct *vma, unsigned long address, int
- no_share)
+struct int
+sgi_graphics_nopage (struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, int write_access, pte_t *page_table, pmd_t *pmdpf)
{
pgd_t *pgd; pmd_t *pmd; pte_t *pte;
int board = GRAPHICS_CARD (vma->vm_dentry->d_inode->i_rdev);
@@ -221,6 +221,8 @@
unsigned long virt_add, phys_add;
struct page * page;

+ pte_unmap(page_table);
+ spin_unlock(&mm->page_table_lock);
#ifdef DEBUG
printk ("Got a page fault for board %d address=%lx guser=%lx\n", board,
address, (unsigned long) cards[board].g_user);
@@ -249,7 +251,7 @@
pte = pte_kmap_offset(pmd, address);
page = pte_page(*pte);
pte_kunmap(pte);
- return page;
+ return install_new_page(mm, vma, address, write_access, pmdpf, page);
}

/*
diff -urN -X dontdiff linux-2.5.70-mm3/drivers/sgi/char/shmiq.c linux-2.5.70-mm3.install_new_page_hch/drivers/sgi/char/shmiq.c
--- linux-2.5.70-mm3/drivers/sgi/char/shmiq.c 2003-05-26 18:00:45.000000000 -0700
+++ linux-2.5.70-mm3.install_new_page_hch/drivers/sgi/char/shmiq.c 2003-06-01 10:33:13.000000000 -0700
@@ -303,11 +303,11 @@
}

struct page *
-shmiq_nopage (struct vm_area_struct *vma, unsigned long address,
- int write_access)
+shmiq_nopage (struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address,
+ int write_access, pte_t *page_table, pmd_t *pmd)
{
/* Do not allow for mremap to expand us */
- return NULL;
+ return VM_FAULT_SIGBUS;
}

static struct vm_operations_struct qcntl_mmap = {
diff -urN -X dontdiff linux-2.5.70-mm3/fs/ncpfs/mmap.c linux-2.5.70-mm3.install_new_page_hch/fs/ncpfs/mmap.c
--- linux-2.5.70-mm3/fs/ncpfs/mmap.c 2003-05-26 18:00:43.000000000 -0700
+++ linux-2.5.70-mm3.install_new_page_hch/fs/ncpfs/mmap.c 2003-06-01 10:33:51.000000000 -0700
@@ -25,8 +25,8 @@
/*
* Fill in the supplied page for mmap
*/
-static struct page* ncp_file_mmap_nopage(struct vm_area_struct *area,
- unsigned long address, int write_access)
+static int ncp_file_mmap_nopage(struct mm_struct *mm, struct vm_area_struct *area,
+ unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
{
struct file *file = area->vm_file;
struct dentry *dentry = file->f_dentry;
@@ -38,6 +38,8 @@
int bufsize;
int pos;

+ pte_unmap(page_table);
+ spin_unlock(&mm->page_table_lock);
page = alloc_page(GFP_HIGHUSER); /* ncpfs has nothing against high pages
as long as recvmsg and memset works on it */
if (!page)
@@ -85,7 +87,7 @@
memset(pg_addr + already_read, 0, PAGE_SIZE - already_read);
flush_dcache_page(page);
kunmap(page);
- return page;
+ return install_new_page(mm, area, address, write_access, pmd, page);
}

static struct vm_operations_struct ncp_file_mmap =
diff -urN -X dontdiff linux-2.5.70-mm3/include/linux/mm.h linux-2.5.70-mm3.install_new_page_hch/include/linux/mm.h
--- linux-2.5.70-mm3/include/linux/mm.h 2003-05-31 16:31:20.000000000 -0700
+++ linux-2.5.70-mm3.install_new_page_hch/include/linux/mm.h 2003-06-01 11:02:08.000000000 -0700
@@ -142,7 +142,7 @@
struct vm_operations_struct {
void (*open)(struct vm_area_struct * area);
void (*close)(struct vm_area_struct * area);
- struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int unused);
+ int (*nopage)(struct mm_struct * mm, struct vm_area_struct * area, unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd);
int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
};

@@ -380,12 +380,6 @@
}

/*
- * Error return values for the *_nopage functions
- */
-#define NOPAGE_SIGBUS (NULL)
-#define NOPAGE_OOM ((struct page *) (-1))
-
-/*
* Different kinds of faults, as returned by handle_mm_fault().
* Used to decide whether a process gets delivered SIGBUS or
* just gets major/minor fault counters bumped up.
@@ -402,8 +396,8 @@

extern void show_free_areas(void);

-struct page *shmem_nopage(struct vm_area_struct * vma,
- unsigned long address, int unused);
+int shmem_nopage(struct mm_struct * mm, struct vm_area_struct * vma,
+ unsigned long address, int write_access, pte_t * page_table, pmd_t * pmd);
struct file *shmem_file_setup(char * name, loff_t size, unsigned long flags);
void shmem_lock(struct file * file, int lock);
int shmem_zero_setup(struct vm_area_struct *);
@@ -421,6 +415,7 @@
int zeromap_page_range(struct vm_area_struct *vma, unsigned long from,
unsigned long size, pgprot_t prot);

+extern int install_new_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, int write_access, pmd_t *pmd, struct page * new_page);
extern void invalidate_mmap_range(struct address_space *mapping,
loff_t const holebegin,
loff_t const holelen);
@@ -559,7 +554,8 @@
extern void truncate_inode_pages(struct address_space *, loff_t);

/* generic vm_area_ops exported for stackable file systems */
-extern struct page *filemap_nopage(struct vm_area_struct *, unsigned long, int);
+int filemap_nopage(struct mm_struct *, struct vm_area_struct *, unsigned long, int, pte_t *, pmd_t *);
+int do_anonymous_page(struct mm_struct *, struct vm_area_struct *, unsigned long, int, pte_t *, pmd_t *);

/* mm/page-writeback.c */
int write_one_page(struct page *page, int wait);
diff -urN -X dontdiff linux-2.5.70-mm3/kernel/ksyms.c linux-2.5.70-mm3.install_new_page_hch/kernel/ksyms.c
--- linux-2.5.70-mm3/kernel/ksyms.c 2003-05-31 16:31:20.000000000 -0700
+++ linux-2.5.70-mm3.install_new_page_hch/kernel/ksyms.c 2003-05-31 16:39:23.000000000 -0700
@@ -116,6 +116,7 @@
EXPORT_SYMBOL(max_mapnr);
#endif
EXPORT_SYMBOL(high_memory);
+EXPORT_SYMBOL(install_new_page);
EXPORT_SYMBOL(invalidate_mmap_range);
EXPORT_SYMBOL(vmtruncate);
EXPORT_SYMBOL(find_vma);
diff -urN -X dontdiff linux-2.5.70-mm3/mm/filemap.c linux-2.5.70-mm3.install_new_page_hch/mm/filemap.c
--- linux-2.5.70-mm3/mm/filemap.c 2003-05-31 16:31:21.000000000 -0700
+++ linux-2.5.70-mm3.install_new_page_hch/mm/filemap.c 2003-06-01 10:34:52.000000000 -0700
@@ -1013,7 +1013,7 @@
* it in the page cache, and handles the special cases reasonably without
* having a lot of duplicated code.
*/
-struct page * filemap_nopage(struct vm_area_struct * area, unsigned long address, int unused)
+int filemap_nopage(struct mm_struct * mm, struct vm_area_struct * area, unsigned long address, int write_access, pte_t *page_table, pmd_t * pmd)
{
int error;
struct file *file = area->vm_file;
@@ -1024,6 +1024,8 @@
unsigned long size, pgoff, endoff;
int did_readahead;

+ pte_unmap(page_table);
+ spin_unlock(&mm->page_table_lock);
pgoff = ((address - area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff;
endoff = ((area->vm_end - area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff;

@@ -1034,7 +1036,7 @@
*/
size = (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
if ((pgoff >= size) && (area->vm_mm == current->mm))
- return NULL;
+ return VM_FAULT_SIGBUS;

/*
* The "size" of the file, as far as mmap is concerned, isn't bigger
@@ -1088,7 +1090,7 @@
* Found the page and have a reference on it.
*/
mark_page_accessed(page);
- return page;
+ return install_new_page(mm, area, address, write_access, pmd, page);

no_cached_page:
/*
@@ -1111,8 +1113,8 @@
* to schedule I/O.
*/
if (error == -ENOMEM)
- return NOPAGE_OOM;
- return NULL;
+ return VM_FAULT_OOM;
+ return VM_FAULT_SIGBUS;

page_not_uptodate:
inc_page_state(pgmajfault);
@@ -1169,7 +1171,7 @@
* mm layer so, possibly freeing the page cache page first.
*/
page_cache_release(page);
- return NULL;
+ return VM_FAULT_SIGBUS;
}

static struct page * filemap_getpage(struct file *file, unsigned long pgoff,
diff -urN -X dontdiff linux-2.5.70-mm3/mm/memory.c linux-2.5.70-mm3.install_new_page_hch/mm/memory.c
--- linux-2.5.70-mm3/mm/memory.c 2003-05-31 16:31:21.000000000 -0700
+++ linux-2.5.70-mm3.install_new_page_hch/mm/memory.c 2003-06-01 11:00:33.000000000 -0700
@@ -1304,10 +1304,10 @@
* spinlock held to protect against concurrent faults in
* multithreaded programs.
*/
-static int
+int
do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
- pte_t *page_table, pmd_t *pmd, int write_access,
- unsigned long addr)
+ unsigned long addr, int write_access,
+ pte_t *page_table, pmd_t *pmd)
{
pte_t entry;
struct page * page = ZERO_PAGE(addr);
@@ -1374,40 +1374,19 @@
}

/*
- * do_no_page() tries to create a new page mapping. It aggressively
- * tries to share with existing pages, but makes a separate copy if
- * the "write_access" parameter is true in order to avoid the next
- * page fault.
- *
+ * install_new_page - tries to create a new page mapping.
+ * install_new_page() tries to share w/existing pages, but makes separate
+ * copy if "write_access" is true in order to avoid the next page fault.
* As this is called only for pages that do not currently exist, we
* do not need to flush old virtual caches or the TLB.
- *
- * This is called with the MM semaphore held and the page table
- * spinlock held. Exit with the spinlock released.
*/
-static int
-do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
+int
+install_new_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, int write_access, pmd_t *pmd, struct page * new_page)
{
- struct page * new_page;
- pte_t entry;
+ pte_t entry, *page_table;
struct pte_chain *pte_chain;
int ret;

- if (!vma->vm_ops || !vma->vm_ops->nopage)
- return do_anonymous_page(mm, vma, page_table,
- pmd, write_access, address);
- pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
-
- new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, 0);
-
- /* no page was available -- either SIGBUS or OOM */
- if (new_page == NOPAGE_SIGBUS)
- return VM_FAULT_SIGBUS;
- if (new_page == NOPAGE_OOM)
- return VM_FAULT_OOM;
-
pte_chain = pte_chain_alloc(GFP_KERNEL);
if (!pte_chain)
goto oom;
@@ -1490,7 +1469,7 @@
if (!vma->vm_ops || !vma->vm_ops->populate ||
(write_access && !(vma->vm_flags & VM_SHARED))) {
pte_clear(pte);
- return do_no_page(mm, vma, address, write_access, pte, pmd);
+ return vma->vm_ops->nopage(mm, vma, address & PAGE_MASK, write_access, pte, pmd);
}

pgoff = pte_to_pgoff(*pte);
@@ -1541,7 +1520,7 @@
* drop the lock.
*/
if (pte_none(entry))
- return do_no_page(mm, vma, address, write_access, pte, pmd);
+ return vma->vm_ops->nopage(mm, vma, address & PAGE_MASK, write_access, pte, pmd);
if (pte_file(entry))
return do_file_page(mm, vma, address, write_access, pte, pmd);
return do_swap_page(mm, vma, address, pte, pmd, entry, write_access);
diff -urN -X dontdiff linux-2.5.70-mm3/mm/shmem.c linux-2.5.70-mm3.install_new_page_hch/mm/shmem.c
--- linux-2.5.70-mm3/mm/shmem.c 2003-05-26 18:00:39.000000000 -0700
+++ linux-2.5.70-mm3.install_new_page_hch/mm/shmem.c 2003-06-01 10:36:09.000000000 -0700
@@ -936,23 +936,25 @@
return error;
}

-struct page *shmem_nopage(struct vm_area_struct *vma, unsigned long address, int unused)
+int shmem_nopage(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
{
struct inode *inode = vma->vm_file->f_dentry->d_inode;
struct page *page = NULL;
unsigned long idx;
int error;

+ pte_unmap(page_table);
+ spin_unlock(&mm->page_table_lock);
idx = (address - vma->vm_start) >> PAGE_SHIFT;
idx += vma->vm_pgoff;
idx >>= PAGE_CACHE_SHIFT - PAGE_SHIFT;

error = shmem_getpage(inode, idx, &page, SGP_CACHE);
if (error)
- return (error == -ENOMEM)? NOPAGE_OOM: NOPAGE_SIGBUS;
+ return (error == -ENOMEM)? VM_FAULT_OOM: VM_FAULT_SIGBUS;

mark_page_accessed(page);
- return page;
+ return install_new_page(mm, vma, address, write_access, pmd, page);
}

static int shmem_populate(struct vm_area_struct *vma,
diff -urN -X dontdiff linux-2.5.70-mm3/net/packet/af_packet.c linux-2.5.70-mm3.install_new_page_hch/net/packet/af_packet.c
--- linux-2.5.70-mm3/net/packet/af_packet.c 2003-05-31 16:31:23.000000000 -0700
+++ linux-2.5.70-mm3.install_new_page_hch/net/packet/af_packet.c 2003-06-01 10:14:58.000000000 -0700
@@ -1523,6 +1523,7 @@
static struct vm_operations_struct packet_mmap_ops = {
.open = packet_mm_open,
.close =packet_mm_close,
+ .nopage = do_anonymous_page,
};

static void free_pg_vec(unsigned long *pg_vec, unsigned order, unsigned len)
diff -urN -X dontdiff linux-2.5.70-mm3/sound/core/pcm_native.c linux-2.5.70-mm3.install_new_page_hch/sound/core/pcm_native.c
--- linux-2.5.70-mm3/sound/core/pcm_native.c 2003-05-26 18:00:37.000000000 -0700
+++ linux-2.5.70-mm3.install_new_page_hch/sound/core/pcm_native.c 2003-06-01 10:40:02.000000000 -0700
@@ -60,6 +60,11 @@
static int snd_pcm_hw_refine_old_user(snd_pcm_substream_t * substream, struct sndrv_pcm_hw_params_old * _oparams);
static int snd_pcm_hw_params_old_user(snd_pcm_substream_t * substream, struct sndrv_pcm_hw_params_old * _oparams);

+#ifndef LINUX_2_2
+#define NOPAGE_OOM VM_FAULT_OOM
+#define NOPAGE_SIGBUS VM_FAULT_SIGBUS
+#endif
+
/*
*
*/
@@ -2693,7 +2698,7 @@
#endif

#ifndef LINUX_2_2
-static struct page * snd_pcm_mmap_status_nopage(struct vm_area_struct *area, unsigned long address, int no_share)
+static int snd_pcm_mmap_status_nopage(struct mm_struct *mm, struct vm_area_struct *area, unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
#else
static unsigned long snd_pcm_mmap_status_nopage(struct vm_area_struct *area, unsigned long address, int no_share)
#endif
@@ -2702,13 +2707,17 @@
snd_pcm_runtime_t *runtime;
struct page * page;

+#ifndef LINUX_2_2
+ pte_unmap(page_table);
+ spin_unlock(&mm->page_table_lock);
+#endif
if (substream == NULL)
return NOPAGE_OOM;
runtime = substream->runtime;
page = virt_to_page(runtime->status);
get_page(page);
#ifndef LINUX_2_2
- return page;
+ return install_new_page(mm, area, address, write_access, pmd, page);
#else
return page_address(page);
#endif
@@ -2747,7 +2756,7 @@
}

#ifndef LINUX_2_2
-static struct page * snd_pcm_mmap_control_nopage(struct vm_area_struct *area, unsigned long address, int no_share)
+static int snd_pcm_mmap_control_nopage(struct mm_struct *mm, struct vm_area_struct *area, unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
#else
static unsigned long snd_pcm_mmap_control_nopage(struct vm_area_struct *area, unsigned long address, int no_share)
#endif
@@ -2755,14 +2764,18 @@
snd_pcm_substream_t *substream = (snd_pcm_substream_t *)area->vm_private_data;
snd_pcm_runtime_t *runtime;
struct page * page;
-
+
+#ifndef LINUX_2_2
+ pte_unmap(page_table);
+ spin_unlock(&mm->page_table_lock);
+#endif
if (substream == NULL)
return NOPAGE_OOM;
runtime = substream->runtime;
page = virt_to_page(runtime->control);
get_page(page);
#ifndef LINUX_2_2
- return page;
+ return install_new_page(mm, area, address, write_access, pmd, page);
#else
return page_address(page);
#endif
@@ -2813,7 +2826,7 @@
}

#ifndef LINUX_2_2
-static struct page * snd_pcm_mmap_data_nopage(struct vm_area_struct *area, unsigned long address, int no_share)
+static int snd_pcm_mmap_data_nopage(struct mm_struct *mm, struct vm_area_struct *area, unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
#else
static unsigned long snd_pcm_mmap_data_nopage(struct vm_area_struct *area, unsigned long address, int no_share)
#endif
@@ -2824,7 +2837,11 @@
struct page * page;
void *vaddr;
size_t dma_bytes;
-
+
+#ifndef LINUX_2_2
+ pte_unmap(page_table);
+ spin_unlock(&mm->page_table_lock);
+#endif
if (substream == NULL)
return NOPAGE_OOM;
runtime = substream->runtime;
@@ -2848,7 +2865,7 @@
}
get_page(page);
#ifndef LINUX_2_2
- return page;
+ return install_new_page(mm, area, address, write_access, pmd, page);
#else
return page_address(page);
#endif
diff -urN -X dontdiff linux-2.5.70-mm3/sound/oss/emu10k1/audio.c linux-2.5.70-mm3.install_new_page_hch/sound/oss/emu10k1/audio.c
--- linux-2.5.70-mm3/sound/oss/emu10k1/audio.c 2003-05-26 18:00:23.000000000 -0700
+++ linux-2.5.70-mm3.install_new_page_hch/sound/oss/emu10k1/audio.c 2003-06-01 10:42:22.000000000 -0700
@@ -970,7 +970,7 @@
return 0;
}

-static struct page *emu10k1_mm_nopage (struct vm_area_struct * vma, unsigned long address, int write_access)
+static int emu10k1_mm_nopage (struct mm_struct * mm, struct vm_area_struct * vma, unsigned long address, int write_access, pte_t * page_table, pmd_t * pmd)
{
struct emu10k1_wavedevice *wave_dev = vma->vm_private_data;
struct woinst *woinst = wave_dev->woinst;
@@ -979,12 +979,14 @@
unsigned long pgoff;
int rd, wr;

+ pte_unmap(page_table);
+ spin_unlock(&mm->page_table_lock);
DPF(3, "emu10k1_mm_nopage()\n");
DPD(3, "addr: %#lx\n", address);

if (address > vma->vm_end) {
- DPF(1, "EXIT, returning NOPAGE_SIGBUS\n");
- return NOPAGE_SIGBUS; /* Disallow mremap */
+ DPF(1, "EXIT, returning VM_FAULT_SIGBUS\n");
+ return VM_FAULT_SIGBUS; /* Disallow mremap */
}

pgoff = vma->vm_pgoff + ((address - vma->vm_start) >> PAGE_SHIFT);
@@ -1013,7 +1015,7 @@
get_page (dmapage);

DPD(3, "page: %#lx\n", (unsigned long) dmapage);
- return dmapage;
+ return install_new_page(mm, vma, address, write_access, pmd, dmapage);
}

struct vm_operations_struct emu10k1_mm_ops = {
diff -urN -X dontdiff linux-2.5.70-mm3/sound/oss/via82cxxx_audio.c linux-2.5.70-mm3.install_new_page_hch/sound/oss/via82cxxx_audio.c
--- linux-2.5.70-mm3/sound/oss/via82cxxx_audio.c 2003-05-26 18:00:27.000000000 -0700
+++ linux-2.5.70-mm3.install_new_page_hch/sound/oss/via82cxxx_audio.c 2003-06-01 10:41:07.000000000 -0700
@@ -1846,8 +1846,8 @@
}


-static struct page * via_mm_nopage (struct vm_area_struct * vma,
- unsigned long address, int write_access)
+static int via_mm_nopage (struct mm_struct *mm, struct vm_area_struct * vma,
+ unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
{
struct via_info *card = vma->vm_private_data;
struct via_channel *chan = &card->ch_out;
@@ -1855,6 +1855,8 @@
unsigned long pgoff;
int rd, wr;

+ pte_unmap(page_table);
+ spin_unlock(&mm->page_table_lock);
DPRINTK ("ENTER, start %lXh, ofs %lXh, pgoff %ld, addr %lXh, wr %d\n",
vma->vm_start,
address - vma->vm_start,
@@ -1863,12 +1865,12 @@
write_access);

if (address > vma->vm_end) {
- DPRINTK ("EXIT, returning NOPAGE_SIGBUS\n");
- return NOPAGE_SIGBUS; /* Disallow mremap */
+ DPRINTK ("EXIT, returning VM_FAULT_SIGBUS\n");
+ return VM_FAULT_SIGBUS; /* Disallow mremap */
}
if (!card) {
- DPRINTK ("EXIT, returning NOPAGE_OOM\n");
- return NOPAGE_OOM; /* Nothing allocated */
+ DPRINTK ("EXIT, returning VM_FAULT_OOM\n");
+ return VM_FAULT_OOM; /* Nothing allocated */
}

pgoff = vma->vm_pgoff + ((address - vma->vm_start) >> PAGE_SHIFT);
@@ -1895,10 +1897,10 @@
assert ((((unsigned long)chan->pgtbl[pgoff].cpuaddr) % PAGE_SIZE) == 0);

dmapage = virt_to_page (chan->pgtbl[pgoff].cpuaddr);
- DPRINTK ("EXIT, returning page %p for cpuaddr %lXh\n",
+ DPRINTK ("EXIT, installing page %p for cpuaddr %lXh\n",
dmapage, (unsigned long) chan->pgtbl[pgoff].cpuaddr);
get_page (dmapage);
- return dmapage;
+ return install_new_page(mm, vma, address, write_access, pmd, dmapage);
}


2003-06-01 19:47:39

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Always passing mm and vma down (was: [RFC][PATCH] Convert do_no_page() to a hook to avoid DFS race)

On Sun, Jun 01, 2003 at 02:22:00PM +0200, Andrea Arcangeli wrote:
> Hi Paul,

Hello, Andrea!

> On Sat, May 31, 2003 at 04:48:16PM -0700, Paul E. McKenney wrote:
[ . . . ]
> > Interesting point. The original do_no_page() API does this
> > as well:
> >
> > static int
> > do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
> > unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
> >
> > As does do_anonymous_page(). I assumed that there were corner
> > cases where this one-to-one correspondence did not exist, but
> > must confess that I did not go looking for them.
> >
> > Or is this a performance issue, avoiding a dereference and
> > possible cache miss?
>
> it's a performance microoptimization issue (the stack memory overhead is
> not significant), each new parameter to those calls will slowdown the
> kernel, especially with the x86 calling convetions it will hurt more
> than with other archs. Jeff did something similar for UML (dunno if he
> also can use vma->vm_mm, anyways I #ifdeffed it out enterely for non UML
> compiles for exact this reason that I can't slowdown the whole
> production host kernel just for the uml compile) then there's also the
> additional extern call. So basically the patch would hurt until there
> are active users.
>
> But the real question here I guess is: why should a distributed
> filesystem need to install the pte by itself?

The immediate motivation is to avoid the race with zap_page_range()
when another node writes to the corresponding portion of the file,
similar to the situation with vmtruncate(). The thought was to
leverage locking within the distributed filesystem, but if the
race is solved locally, then, as you say, perhaps this is not
necessary.

> The memory coherency with distributed shared memory (i.e. MAP_SHARED
> against truncate with MAP_SHARED and truncate running on different boxes
> on a DFS) is a generic problem (most of the time using the same
> algorithm too), as such I believe the infrastructure and logics to keep
> it coherent could make sense to be a common code functionalty.

This sounds good to me, though am checking with some DFS people.

> And w/o proper high level locking, the non distributed filesystems will
> corrupt the VM too with truncate against nopage. I already fixed this in
> my tree. (see the attachment) So I wonder if the fixes could be shared.
> I mean, you definitely need my fixes even when using the DFS on a
> isolated box, and if you don't need them while using the fs locally, it
> means we're duplicating effort somehow.

True -- my patches simply provided hooks to allow DFSs and local
filesystems to fix the problem.

So, the idea is for the DFS to hold a fr_write_lock on the
truncate_lock across the invalidate_mmap_range() call, thus
preventing the PTEs from any racing pagefaults from being
installed? This seems plausible at first glance, but need
to stare at it some more. This might permit the current
do_no_page(), do_anonymous_page(), and ->nopage APIs to
be used, but again, need to stare at it some more.

(If I am not too confused, fr_write_lock() became
write_seqlock() in the 2.5 tree...)

> Since I don't see the users of the new hook, it's a bit hard to judje if
> the duplication is legitimate or not. So overall I'd agree with Andrew
> that to judje the patch it'd make sense to see (or know more) about the
> users of the hook too.

A simple change that takes care of all the cases certainly does
seem better than a more complex change that only takes care of
distributed filesystems!

> as for the anon memory, yes, it's probably more efficient and cleaner to
> have it be a nopage callback too, the double branch is probably more
> costly than the duplicated unlock anyways. However it has nothing to do
> with these issues, so I recommend to keep it separated from the DFS
> patches (for readibility). (it can't have anything to do with DFS since
> that memory is anon local to the box [if not using openmosix but that's
> a different issue ;) ])

Well, we do seem to have a number of ways of attacking the problem,
so I have hope that one of them will do the trick. ;-)

Thanx, Paul

2003-06-02 08:19:41

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Always passing mm and vma down (was: [RFC][PATCH] Convert do_no_page() to a hook to avoid DFS race)

On Sun, 2003-06-01 at 22:00, Paul E. McKenney wrote:
> The immediate motivation is to avoid the race with zap_page_range()
> when another node writes to the corresponding portion of the file,
> similar to the situation with vmtruncate(). The thought was to
> leverage locking within the distributed filesystem, but if the
> race is solved locally, then, as you say, perhaps this is not
> necessary.

is said distributed filesystem open source by chance ?


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2003-06-02 12:59:17

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Always passing mm and vma down (was: [RFC][PATCH] Convert do_no_page() to a hook to avoid DFS race)

On Mon, Jun 02, 2003 at 10:32:50AM +0200, Arjan van de Ven wrote:
> On Sun, 2003-06-01 at 22:00, Paul E. McKenney wrote:
> > The immediate motivation is to avoid the race with zap_page_range()
> > when another node writes to the corresponding portion of the file,
> > similar to the situation with vmtruncate(). The thought was to
> > leverage locking within the distributed filesystem, but if the
> > race is solved locally, then, as you say, perhaps this is not
> > necessary.
>
> is said distributed filesystem open source by chance ?

At least one soon will be.

Thanx, Paul

2003-06-04 10:24:41

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Always passing mm and vma down (was: [RFC][PATCH] Convert do_no_page() to a hook to avoid DFS race)

On Sun, Jun 01, 2003 at 01:00:56PM -0700, Paul E. McKenney wrote:
> The immediate motivation is to avoid the race with zap_page_range()
> when another node writes to the corresponding portion of the file,
> similar to the situation with vmtruncate(). The thought was to
> leverage locking within the distributed filesystem, but if the
> race is solved locally, then, as you say, perhaps this is not
> necessary.

exactly, this was my idea. Since we've the same race locally even on
ext2, maybe it worth to share the fix for all the fs somehow, the
problem sounds the same. You may still need callbacks to get right the
distributed fs though. still I was just wondering if the conceptual fix
could live in the highlevel rather than replicating it in the lowlevel.

> This sounds good to me, though am checking with some DFS people.

cool thanks!

> > And w/o proper high level locking, the non distributed filesystems will
> > corrupt the VM too with truncate against nopage. I already fixed this in
> > my tree. (see the attachment) So I wonder if the fixes could be shared.
> > I mean, you definitely need my fixes even when using the DFS on a
> > isolated box, and if you don't need them while using the fs locally, it
> > means we're duplicating effort somehow.
>
> True -- my patches simply provided hooks to allow DFSs and local
> filesystems to fix the problem.
>
> So, the idea is for the DFS to hold a fr_write_lock on the
> truncate_lock across the invalidate_mmap_range() call, thus
> preventing the PTEs from any racing pagefaults from being
> installed? This seems plausible at first glance, but need
> to stare at it some more. This might permit the current
> do_no_page(), do_anonymous_page(), and ->nopage APIs to
> be used, but again, need to stare at it some more.

btw, we can discuss this some more next month at OLS too, if we didn't
clear all the issues first.

> (If I am not too confused, fr_write_lock() became
> write_seqlock() in the 2.5 tree...)

exactly, I didn't rename it yet in 2.4 since it would provide no runtime
benefit, but it is exactly the same thing ;).

> > Since I don't see the users of the new hook, it's a bit hard to judje if
> > the duplication is legitimate or not. So overall I'd agree with Andrew
> > that to judje the patch it'd make sense to see (or know more) about the
> > users of the hook too.
>
> A simple change that takes care of all the cases certainly does
> seem better than a more complex change that only takes care of
> distributed filesystems!

agreed.

thanks,

Andrea

2003-06-07 16:16:12

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Always passing mm and vma down (was: [RFC][PATCH] Convert do_no_page() to a hook to avoid DFS race)

On Wed, Jun 04, 2003 at 12:38:08PM +0200, Andrea Arcangeli wrote:
> On Sun, Jun 01, 2003 at 01:00:56PM -0700, Paul E. McKenney wrote:
> > The immediate motivation is to avoid the race with zap_page_range()
> > when another node writes to the corresponding portion of the file,
> > similar to the situation with vmtruncate(). The thought was to
> > leverage locking within the distributed filesystem, but if the
> > race is solved locally, then, as you say, perhaps this is not
> > necessary.
>
> exactly, this was my idea. Since we've the same race locally even on
> ext2, maybe it worth to share the fix for all the fs somehow, the
> problem sounds the same. You may still need callbacks to get right the
> distributed fs though. still I was just wondering if the conceptual fix
> could live in the highlevel rather than replicating it in the lowlevel.

I believe that we would need callbacks for DFSes because many of them
have different locking granularities. For example, a number of them allow
writes from different clients to different pages of the same file
fully in parallel, with no communication between the clients required.
Communication is required only when one client attempts to write to
a page most recently written to by another client.

The locking required to support this is more complex than would
be justified in most local-filesystem cases, I suspect. ;-)

> > This sounds good to me, though am checking with some DFS people.
>
> cool thanks!

Good news, they are OK with your approach for handling the truncation
race! They still believe they need the callback for handling fine-grained
locking required for invalidations.

> > > And w/o proper high level locking, the non distributed filesystems will
> > > corrupt the VM too with truncate against nopage. I already fixed this in
> > > my tree. (see the attachment) So I wonder if the fixes could be shared.
> > > I mean, you definitely need my fixes even when using the DFS on a
> > > isolated box, and if you don't need them while using the fs locally, it
> > > means we're duplicating effort somehow.
> >
> > True -- my patches simply provided hooks to allow DFSs and local
> > filesystems to fix the problem.
> >
> > So, the idea is for the DFS to hold a fr_write_lock on the
> > truncate_lock across the invalidate_mmap_range() call, thus
> > preventing the PTEs from any racing pagefaults from being
> > installed? This seems plausible at first glance, but need
> > to stare at it some more. This might permit the current
> > do_no_page(), do_anonymous_page(), and ->nopage APIs to
> > be used, but again, need to stare at it some more.
>
> btw, we can discuss this some more next month at OLS too, if we didn't
> clear all the issues first.

Sounds great!!! Of course, if we get it all agreed to before then,
we will just have to find other things to talk about at OLS. ;-)

Hopefully by that time, one of the DFSes will be released under GPL.
Don't ask. :-/

> > (If I am not too confused, fr_write_lock() became
> > write_seqlock() in the 2.5 tree...)
>
> exactly, I didn't rename it yet in 2.4 since it would provide no runtime
> benefit, but it is exactly the same thing ;).

Cool! Then there is no need to worry about fr_lock getting accepted
into 2.5. ;-)

> > > Since I don't see the users of the new hook, it's a bit hard to judje if
> > > the duplication is legitimate or not. So overall I'd agree with Andrew
> > > that to judje the patch it'd make sense to see (or know more) about the
> > > users of the hook too.
> >
> > A simple change that takes care of all the cases certainly does
> > seem better than a more complex change that only takes care of
> > distributed filesystems!
>
> agreed.

Unless you have other thoughts, I will take a stab at making
the patches safe for each other.

Thanx, Paul

> thanks,
>
> Andrea

2003-08-09 18:50:18

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH] Convert do_no_page() to a hook to avoid DFS race

On Sat, May 31, 2003 at 04:51:23PM -0700, Paul E. McKenney wrote:
> > I don't think there's a lot of point in making changes until the code which
> > requires those changes is accepted into the tree. Otherwise it may be
> > pointless churn, and there's nothing in-tree to exercise the new features.
>
> A GPLed use of these DFS features is expected Real Soon Now...

So we get to see all the kernel C++ code from GPRS? [1] Better not, IBM
might badly scare customers away if it the same quality as the C glue
code layer..

[1] http://oss.software.ibm.com/linux/patches/?patch_id=923

2003-08-10 19:09:16

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC][PATCH] Convert do_no_page() to a hook to avoid DFS race

On Sat, Aug 09, 2003 at 07:50:11PM +0100, Christoph Hellwig wrote:
> On Sat, May 31, 2003 at 04:51:23PM -0700, Paul E. McKenney wrote:
> > > I don't think there's a lot of point in making changes until the code which
> > > requires those changes is accepted into the tree. Otherwise it may be
> > > pointless churn, and there's nothing in-tree to exercise the new features.
> >
> > A GPLed use of these DFS features is expected Real Soon Now...
>
> So we get to see all the kernel C++ code from GPRS? [1] Better not, IBM
> might badly scare customers away if it the same quality as the C glue
> code layer..
>
> [1] http://oss.software.ibm.com/linux/patches/?patch_id=923

I will let the GPFS guys worry about that. ;-)

Thanx, Paul