2009-09-07 21:27:38

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 0/8] mm: around get_user_pages flags

Here's a series of mm mods against current mmotm: mostly cleanup
of get_user_pages flags, but fixing munlock's OOM, sorting out the
"FOLL_ANON optimization", and reinstating ZERO_PAGE along the way.

fs/binfmt_elf.c | 42 ++------
fs/binfmt_elf_fdpic.c | 56 ++++-------
include/linux/hugetlb.h | 4
include/linux/mm.h | 4
mm/hugetlb.c | 62 +++++++------
mm/internal.h | 7 -
mm/memory.c | 180 +++++++++++++++++++++++---------------
mm/mlock.c | 99 ++++++++------------
mm/nommu.c | 22 ++--
9 files changed, 235 insertions(+), 241 deletions(-)

Hugh


2009-09-07 21:30:43

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 1/8] mm: munlock use follow_page

Hiroaki Wakabayashi points out that when mlock() has been interrupted
by SIGKILL, the subsequent munlock() takes unnecessarily long because
its use of __get_user_pages() insists on faulting in all the pages
which mlock() never reached.

It's worse than slowness if mlock() is terminated by Out Of Memory kill:
the munlock_vma_pages_all() in exit_mmap() insists on faulting in all the
pages which mlock() could not find memory for; so innocent bystanders are
killed too, and perhaps the system hangs.

__get_user_pages() does a lot that's silly for munlock(): so remove the
munlock option from __mlock_vma_pages_range(), and use a simple loop of
follow_page()s in munlock_vma_pages_range() instead; ignoring absent
pages, and not marking present pages as accessed or dirty.

(Change munlock() to only go so far as mlock() reached? That does not
work out, given the convention that mlock() claims complete success even
when it has to give up early - in part so that an underlying file can be
extended later, and those pages locked which earlier would give SIGBUS.)

Signed-off-by: Hugh Dickins <[email protected]>
Cc: [email protected]
---

mm/mlock.c | 99 ++++++++++++++++++++-------------------------------
1 file changed, 40 insertions(+), 59 deletions(-)

--- mm0/mm/mlock.c 2009-06-25 05:18:10.000000000 +0100
+++ mm1/mm/mlock.c 2009-09-07 13:16:15.000000000 +0100
@@ -139,49 +139,36 @@ static void munlock_vma_page(struct page
}

/**
- * __mlock_vma_pages_range() - mlock/munlock a range of pages in the vma.
+ * __mlock_vma_pages_range() - mlock a range of pages in the vma.
* @vma: target vma
* @start: start address
* @end: end address
- * @mlock: 0 indicate munlock, otherwise mlock.
*
- * If @mlock == 0, unlock an mlocked range;
- * else mlock the range of pages. This takes care of making the pages present ,
- * too.
+ * This takes care of making the pages present too.
*
* return 0 on success, negative error code on error.
*
* vma->vm_mm->mmap_sem must be held for at least read.
*/
static long __mlock_vma_pages_range(struct vm_area_struct *vma,
- unsigned long start, unsigned long end,
- int mlock)
+ unsigned long start, unsigned long end)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long addr = start;
struct page *pages[16]; /* 16 gives a reasonable batch */
int nr_pages = (end - start) / PAGE_SIZE;
int ret = 0;
- int gup_flags = 0;
+ int gup_flags;

VM_BUG_ON(start & ~PAGE_MASK);
VM_BUG_ON(end & ~PAGE_MASK);
VM_BUG_ON(start < vma->vm_start);
VM_BUG_ON(end > vma->vm_end);
- VM_BUG_ON((!rwsem_is_locked(&mm->mmap_sem)) &&
- (atomic_read(&mm->mm_users) != 0));
-
- /*
- * mlock: don't page populate if vma has PROT_NONE permission.
- * munlock: always do munlock although the vma has PROT_NONE
- * permission, or SIGKILL is pending.
- */
- if (!mlock)
- gup_flags |= GUP_FLAGS_IGNORE_VMA_PERMISSIONS |
- GUP_FLAGS_IGNORE_SIGKILL;
+ VM_BUG_ON(!rwsem_is_locked(&mm->mmap_sem));

+ gup_flags = 0;
if (vma->vm_flags & VM_WRITE)
- gup_flags |= GUP_FLAGS_WRITE;
+ gup_flags = GUP_FLAGS_WRITE;

while (nr_pages > 0) {
int i;
@@ -201,19 +188,10 @@ static long __mlock_vma_pages_range(stru
* This can happen for, e.g., VM_NONLINEAR regions before
* a page has been allocated and mapped at a given offset,
* or for addresses that map beyond end of a file.
- * We'll mlock the the pages if/when they get faulted in.
+ * We'll mlock the pages if/when they get faulted in.
*/
if (ret < 0)
break;
- if (ret == 0) {
- /*
- * We know the vma is there, so the only time
- * we cannot get a single page should be an
- * error (ret < 0) case.
- */
- WARN_ON(1);
- break;
- }

lru_add_drain(); /* push cached pages to LRU */

@@ -224,28 +202,22 @@ static long __mlock_vma_pages_range(stru
/*
* Because we lock page here and migration is blocked
* by the elevated reference, we need only check for
- * page truncation (file-cache only).
+ * file-cache page truncation. This page->mapping
+ * check also neatly skips over the ZERO_PAGE(),
+ * though if that's common we'd prefer not to lock it.
*/
- if (page->mapping) {
- if (mlock)
- mlock_vma_page(page);
- else
- munlock_vma_page(page);
- }
+ if (page->mapping)
+ mlock_vma_page(page);
unlock_page(page);
- put_page(page); /* ref from get_user_pages() */
-
- /*
- * here we assume that get_user_pages() has given us
- * a list of virtually contiguous pages.
- */
- addr += PAGE_SIZE; /* for next get_user_pages() */
- nr_pages--;
+ put_page(page); /* ref from get_user_pages() */
}
+
+ addr += ret * PAGE_SIZE;
+ nr_pages -= ret;
ret = 0;
}

- return ret; /* count entire vma as locked_vm */
+ return ret; /* 0 or negative error code */
}

/*
@@ -289,7 +261,7 @@ long mlock_vma_pages_range(struct vm_are
is_vm_hugetlb_page(vma) ||
vma == get_gate_vma(current))) {

- __mlock_vma_pages_range(vma, start, end, 1);
+ __mlock_vma_pages_range(vma, start, end);

/* Hide errors from mmap() and other callers */
return 0;
@@ -310,7 +282,6 @@ no_mlock:
return nr_pages; /* error or pages NOT mlocked */
}

-
/*
* munlock_vma_pages_range() - munlock all pages in the vma range.'
* @vma - vma containing range to be munlock()ed.
@@ -330,10 +301,24 @@ no_mlock:
* free them. This will result in freeing mlocked pages.
*/
void munlock_vma_pages_range(struct vm_area_struct *vma,
- unsigned long start, unsigned long end)
+ unsigned long start, unsigned long end)
{
+ unsigned long addr;
+
+ lru_add_drain();
vma->vm_flags &= ~VM_LOCKED;
- __mlock_vma_pages_range(vma, start, end, 0);
+
+ for (addr = start; addr < end; addr += PAGE_SIZE) {
+ struct page *page = follow_page(vma, addr, FOLL_GET);
+ if (page) {
+ lock_page(page);
+ if (page->mapping)
+ munlock_vma_page(page);
+ unlock_page(page);
+ put_page(page);
+ }
+ cond_resched();
+ }
}

/*
@@ -400,18 +385,14 @@ success:
* It's okay if try_to_unmap_one unmaps a page just after we
* set VM_LOCKED, __mlock_vma_pages_range will bring it back.
*/
- vma->vm_flags = newflags;

if (lock) {
- ret = __mlock_vma_pages_range(vma, start, end, 1);
-
- if (ret > 0) {
- mm->locked_vm -= ret;
- ret = 0;
- } else
- ret = __mlock_posix_error_return(ret); /* translate if needed */
+ vma->vm_flags = newflags;
+ ret = __mlock_vma_pages_range(vma, start, end);
+ if (ret < 0)
+ ret = __mlock_posix_error_return(ret);
} else {
- __mlock_vma_pages_range(vma, start, end, 0);
+ munlock_vma_pages_range(vma, start, end);
}

out:

2009-09-07 21:31:53

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 2/8] mm: remove unused GUP flags

GUP_FLAGS_IGNORE_VMA_PERMISSIONS and GUP_FLAGS_IGNORE_SIGKILL were
flags added solely to prevent __get_user_pages() from doing some of
what it usually does, in the munlock case: we can now remove them.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/internal.h | 6 ++----
mm/memory.c | 14 ++++----------
mm/nommu.c | 6 ++----
3 files changed, 8 insertions(+), 18 deletions(-)

--- mm1/mm/internal.h 2009-06-25 05:18:10.000000000 +0100
+++ mm2/mm/internal.h 2009-09-07 13:16:22.000000000 +0100
@@ -250,10 +250,8 @@ static inline void mminit_validate_memmo
}
#endif /* CONFIG_SPARSEMEM */

-#define GUP_FLAGS_WRITE 0x1
-#define GUP_FLAGS_FORCE 0x2
-#define GUP_FLAGS_IGNORE_VMA_PERMISSIONS 0x4
-#define GUP_FLAGS_IGNORE_SIGKILL 0x8
+#define GUP_FLAGS_WRITE 0x01
+#define GUP_FLAGS_FORCE 0x02

int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, int len, int flags,
--- mm1/mm/memory.c 2009-09-05 14:40:16.000000000 +0100
+++ mm2/mm/memory.c 2009-09-07 13:16:22.000000000 +0100
@@ -1217,8 +1217,6 @@ int __get_user_pages(struct task_struct
unsigned int vm_flags = 0;
int write = !!(flags & GUP_FLAGS_WRITE);
int force = !!(flags & GUP_FLAGS_FORCE);
- int ignore = !!(flags & GUP_FLAGS_IGNORE_VMA_PERMISSIONS);
- int ignore_sigkill = !!(flags & GUP_FLAGS_IGNORE_SIGKILL);

if (nr_pages <= 0)
return 0;
@@ -1244,7 +1242,7 @@ int __get_user_pages(struct task_struct
pte_t *pte;

/* user gate pages are read-only */
- if (!ignore && write)
+ if (write)
return i ? : -EFAULT;
if (pg > TASK_SIZE)
pgd = pgd_offset_k(pg);
@@ -1278,7 +1276,7 @@ int __get_user_pages(struct task_struct

if (!vma ||
(vma->vm_flags & (VM_IO | VM_PFNMAP)) ||
- (!ignore && !(vm_flags & vma->vm_flags)))
+ !(vm_flags & vma->vm_flags))
return i ? : -EFAULT;

if (is_vm_hugetlb_page(vma)) {
@@ -1298,13 +1296,9 @@ int __get_user_pages(struct task_struct

/*
* If we have a pending SIGKILL, don't keep faulting
- * pages and potentially allocating memory, unless
- * current is handling munlock--e.g., on exit. In
- * that case, we are not allocating memory. Rather,
- * we're only unlocking already resident/mapped pages.
+ * pages and potentially allocating memory.
*/
- if (unlikely(!ignore_sigkill &&
- fatal_signal_pending(current)))
+ if (unlikely(fatal_signal_pending(current)))
return i ? i : -ERESTARTSYS;

if (write)
--- mm1/mm/nommu.c 2009-09-05 14:40:16.000000000 +0100
+++ mm2/mm/nommu.c 2009-09-07 13:16:22.000000000 +0100
@@ -136,7 +136,6 @@ int __get_user_pages(struct task_struct
int i;
int write = !!(flags & GUP_FLAGS_WRITE);
int force = !!(flags & GUP_FLAGS_FORCE);
- int ignore = !!(flags & GUP_FLAGS_IGNORE_VMA_PERMISSIONS);

/* calculate required read or write permissions.
* - if 'force' is set, we only require the "MAY" flags.
@@ -150,8 +149,8 @@ int __get_user_pages(struct task_struct
goto finish_or_fault;

/* protect what we can, including chardevs */
- if (vma->vm_flags & (VM_IO | VM_PFNMAP) ||
- (!ignore && !(vm_flags & vma->vm_flags)))
+ if ((vma->vm_flags & (VM_IO | VM_PFNMAP)) ||
+ !(vm_flags & vma->vm_flags))
goto finish_or_fault;

if (pages) {
@@ -170,7 +169,6 @@ finish_or_fault:
return i ? : -EFAULT;
}

-
/*
* get a list of pages in an address range belonging to the specified process
* and indicate the VMA that covers each page

2009-09-07 21:34:07

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 3/8] mm: add get_dump_page

In preparation for the next patch, add a simple get_dump_page(addr)
interface for the CONFIG_ELF_CORE dumpers to use, instead of calling
get_user_pages() directly. They're not interested in errors: they
just want to use holes as much as possible, to save space and make
sure that the data is aligned where the headers said it would be.

Oh, and don't use that horrid DUMP_SEEK(off) macro!

Signed-off-by: Hugh Dickins <[email protected]>
---

fs/binfmt_elf.c | 42 +++++++++---------------------
fs/binfmt_elf_fdpic.c | 56 +++++++++++++---------------------------
include/linux/mm.h | 1
mm/memory.c | 33 ++++++++++++++++++++++-
4 files changed, 65 insertions(+), 67 deletions(-)

--- mm2/fs/binfmt_elf.c 2009-09-05 14:40:15.000000000 +0100
+++ mm3/fs/binfmt_elf.c 2009-09-07 13:16:32.000000000 +0100
@@ -1280,9 +1280,6 @@ static int writenote(struct memelfnote *
#define DUMP_WRITE(addr, nr) \
if ((size += (nr)) > limit || !dump_write(file, (addr), (nr))) \
goto end_coredump;
-#define DUMP_SEEK(off) \
- if (!dump_seek(file, (off))) \
- goto end_coredump;

static void fill_elf_header(struct elfhdr *elf, int segs,
u16 machine, u32 flags, u8 osabi)
@@ -2024,7 +2021,8 @@ static int elf_core_dump(long signr, str
goto end_coredump;

/* Align to page */
- DUMP_SEEK(dataoff - foffset);
+ if (!dump_seek(file, dataoff - foffset))
+ goto end_coredump;

for (vma = first_vma(current, gate_vma); vma != NULL;
vma = next_vma(vma, gate_vma)) {
@@ -2035,33 +2033,19 @@ static int elf_core_dump(long signr, str

for (addr = vma->vm_start; addr < end; addr += PAGE_SIZE) {
struct page *page;
- struct vm_area_struct *tmp_vma;
+ int stop;

- if (get_user_pages(current, current->mm, addr, 1, 0, 1,
- &page, &tmp_vma) <= 0) {
- DUMP_SEEK(PAGE_SIZE);
- } else {
- if (page == ZERO_PAGE(0)) {
- if (!dump_seek(file, PAGE_SIZE)) {
- page_cache_release(page);
- goto end_coredump;
- }
- } else {
- void *kaddr;
- flush_cache_page(tmp_vma, addr,
- page_to_pfn(page));
- kaddr = kmap(page);
- if ((size += PAGE_SIZE) > limit ||
- !dump_write(file, kaddr,
- PAGE_SIZE)) {
- kunmap(page);
- page_cache_release(page);
- goto end_coredump;
- }
- kunmap(page);
- }
+ page = get_dump_page(addr);
+ if (page) {
+ void *kaddr = kmap(page);
+ stop = ((size += PAGE_SIZE) > limit) ||
+ !dump_write(file, kaddr, PAGE_SIZE);
+ kunmap(page);
page_cache_release(page);
- }
+ } else
+ stop = !dump_seek(file, PAGE_SIZE);
+ if (stop)
+ goto end_coredump;
}
}

--- mm2/fs/binfmt_elf_fdpic.c 2009-09-05 14:40:15.000000000 +0100
+++ mm3/fs/binfmt_elf_fdpic.c 2009-09-07 13:16:32.000000000 +0100
@@ -1328,9 +1328,6 @@ static int writenote(struct memelfnote *
#define DUMP_WRITE(addr, nr) \
if ((size += (nr)) > limit || !dump_write(file, (addr), (nr))) \
goto end_coredump;
-#define DUMP_SEEK(off) \
- if (!dump_seek(file, (off))) \
- goto end_coredump;

static inline void fill_elf_fdpic_header(struct elfhdr *elf, int segs)
{
@@ -1521,6 +1518,7 @@ static int elf_fdpic_dump_segments(struc
unsigned long *limit, unsigned long mm_flags)
{
struct vm_area_struct *vma;
+ int err = 0;

for (vma = current->mm->mmap; vma; vma = vma->vm_next) {
unsigned long addr;
@@ -1528,43 +1526,26 @@ static int elf_fdpic_dump_segments(struc
if (!maydump(vma, mm_flags))
continue;

- for (addr = vma->vm_start;
- addr < vma->vm_end;
- addr += PAGE_SIZE
- ) {
- struct vm_area_struct *vma;
- struct page *page;
-
- if (get_user_pages(current, current->mm, addr, 1, 0, 1,
- &page, &vma) <= 0) {
- DUMP_SEEK(file->f_pos + PAGE_SIZE);
- }
- else if (page == ZERO_PAGE(0)) {
- page_cache_release(page);
- DUMP_SEEK(file->f_pos + PAGE_SIZE);
- }
- else {
- void *kaddr;
-
- flush_cache_page(vma, addr, page_to_pfn(page));
- kaddr = kmap(page);
- if ((*size += PAGE_SIZE) > *limit ||
- !dump_write(file, kaddr, PAGE_SIZE)
- ) {
- kunmap(page);
- page_cache_release(page);
- return -EIO;
- }
+ for (addr = vma->vm_start; addr < vma->vm_end;
+ addr += PAGE_SIZE) {
+ struct page *page = get_dump_page(addr);
+ if (page) {
+ void *kaddr = kmap(page);
+ *size += PAGE_SIZE;
+ if (*size > *limit)
+ err = -EFBIG;
+ else if (!dump_write(file, kaddr, PAGE_SIZE))
+ err = -EIO;
kunmap(page);
page_cache_release(page);
- }
+ } else if (!dump_seek(file, file->f_pos + PAGE_SIZE))
+ err = -EFBIG;
+ if (err)
+ goto out;
}
}
-
- return 0;
-
-end_coredump:
- return -EFBIG;
+out:
+ return err;
}
#endif

@@ -1805,7 +1786,8 @@ static int elf_fdpic_core_dump(long sign
goto end_coredump;
}

- DUMP_SEEK(dataoff);
+ if (!dump_seek(file, dataoff))
+ goto end_coredump;

if (elf_fdpic_dump_segments(file, &size, &limit, mm_flags) < 0)
goto end_coredump;
--- mm2/include/linux/mm.h 2009-09-05 14:40:16.000000000 +0100
+++ mm3/include/linux/mm.h 2009-09-07 13:16:32.000000000 +0100
@@ -832,6 +832,7 @@ int get_user_pages(struct task_struct *t
struct page **pages, struct vm_area_struct **vmas);
int get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages);
+struct page *get_dump_page(unsigned long addr);

extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
extern void do_invalidatepage(struct page *page, unsigned long offset);
--- mm2/mm/memory.c 2009-09-07 13:16:22.000000000 +0100
+++ mm3/mm/memory.c 2009-09-07 13:16:32.000000000 +0100
@@ -1424,9 +1424,40 @@ int get_user_pages(struct task_struct *t

return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas);
}
-
EXPORT_SYMBOL(get_user_pages);

+/**
+ * get_dump_page() - pin user page in memory while writing it to core dump
+ * @addr: user address
+ *
+ * Returns struct page pointer of user page pinned for dump,
+ * to be freed afterwards by page_cache_release() or put_page().
+ *
+ * Returns NULL on any kind of failure - a hole must then be inserted into
+ * the corefile, to preserve alignment with its headers; and also returns
+ * NULL wherever the ZERO_PAGE, or an anonymous pte_none, has been found -
+ * allowing a hole to be left in the corefile to save diskspace.
+ *
+ * Called without mmap_sem, but after all other threads have been killed.
+ */
+#ifdef CONFIG_ELF_CORE
+struct page *get_dump_page(unsigned long addr)
+{
+ struct vm_area_struct *vma;
+ struct page *page;
+
+ if (__get_user_pages(current, current->mm, addr, 1,
+ GUP_FLAGS_FORCE, &page, &vma) < 1)
+ return NULL;
+ if (page == ZERO_PAGE(0)) {
+ page_cache_release(page);
+ return NULL;
+ }
+ flush_cache_page(vma, addr, page_to_pfn(page));
+ return page;
+}
+#endif /* CONFIG_ELF_CORE */
+
pte_t *get_locked_pte(struct mm_struct *mm, unsigned long addr,
spinlock_t **ptl)
{

2009-09-07 21:36:18

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 4/8] mm: FOLL_DUMP replace FOLL_ANON

The "FOLL_ANON optimization" and its use_zero_page() test have caused
confusion and bugs: why does it test VM_SHARED? for the very good but
unsatisfying reason that VMware crashed without. As we look to maybe
reinstating anonymous use of the ZERO_PAGE, we need to sort this out.

Easily done: it's silly for __get_user_pages() and follow_page() to
be guessing whether it's safe to assume that they're being used for
a coredump (which can take a shortcut snapshot where other uses must
handle a fault) - just tell them with GUP_FLAGS_DUMP and FOLL_DUMP.

get_dump_page() doesn't even want a ZERO_PAGE: an error suits fine.

Signed-off-by: Hugh Dickins <[email protected]>
---

include/linux/mm.h | 2 +-
mm/internal.h | 1 +
mm/memory.c | 43 ++++++++++++-------------------------------
3 files changed, 14 insertions(+), 32 deletions(-)

--- mm3/include/linux/mm.h 2009-09-07 13:16:32.000000000 +0100
+++ mm4/include/linux/mm.h 2009-09-07 13:16:39.000000000 +0100
@@ -1247,7 +1247,7 @@ struct page *follow_page(struct vm_area_
#define FOLL_WRITE 0x01 /* check pte is writable */
#define FOLL_TOUCH 0x02 /* mark page accessed */
#define FOLL_GET 0x04 /* do get_page on page */
-#define FOLL_ANON 0x08 /* give ZERO_PAGE if no pgtable */
+#define FOLL_DUMP 0x08 /* give error on hole if it would be zero */

typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
void *data);
--- mm3/mm/internal.h 2009-09-07 13:16:22.000000000 +0100
+++ mm4/mm/internal.h 2009-09-07 13:16:39.000000000 +0100
@@ -252,6 +252,7 @@ static inline void mminit_validate_memmo

#define GUP_FLAGS_WRITE 0x01
#define GUP_FLAGS_FORCE 0x02
+#define GUP_FLAGS_DUMP 0x04

int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, int len, int flags,
--- mm3/mm/memory.c 2009-09-07 13:16:32.000000000 +0100
+++ mm4/mm/memory.c 2009-09-07 13:16:39.000000000 +0100
@@ -1174,41 +1174,22 @@ no_page:
pte_unmap_unlock(ptep, ptl);
if (!pte_none(pte))
return page;
- /* Fall through to ZERO_PAGE handling */
+
no_page_table:
/*
* When core dumping an enormous anonymous area that nobody
- * has touched so far, we don't want to allocate page tables.
+ * has touched so far, we don't want to allocate unnecessary pages or
+ * page tables. Return error instead of NULL to skip handle_mm_fault,
+ * then get_dump_page() will return NULL to leave a hole in the dump.
+ * But we can only make this optimization where a hole would surely
+ * be zero-filled if handle_mm_fault() actually did handle it.
*/
- if (flags & FOLL_ANON) {
- page = ZERO_PAGE(0);
- if (flags & FOLL_GET)
- get_page(page);
- BUG_ON(flags & FOLL_WRITE);
- }
+ if ((flags & FOLL_DUMP) &&
+ (!vma->vm_ops || !vma->vm_ops->fault))
+ return ERR_PTR(-EFAULT);
return page;
}

-/* Can we do the FOLL_ANON optimization? */
-static inline int use_zero_page(struct vm_area_struct *vma)
-{
- /*
- * We don't want to optimize FOLL_ANON for make_pages_present()
- * when it tries to page in a VM_LOCKED region. As to VM_SHARED,
- * we want to get the page from the page tables to make sure
- * that we serialize and update with any other user of that
- * mapping.
- */
- if (vma->vm_flags & (VM_LOCKED | VM_SHARED))
- return 0;
- /*
- * And if we have a fault routine, it's not an anonymous region.
- */
- return !vma->vm_ops || !vma->vm_ops->fault;
-}
-
-
-
int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, int nr_pages, int flags,
struct page **pages, struct vm_area_struct **vmas)
@@ -1288,8 +1269,8 @@ int __get_user_pages(struct task_struct
foll_flags = FOLL_TOUCH;
if (pages)
foll_flags |= FOLL_GET;
- if (!write && use_zero_page(vma))
- foll_flags |= FOLL_ANON;
+ if (flags & GUP_FLAGS_DUMP)
+ foll_flags |= FOLL_DUMP;

do {
struct page *page;
@@ -1447,7 +1428,7 @@ struct page *get_dump_page(unsigned long
struct page *page;

if (__get_user_pages(current, current->mm, addr, 1,
- GUP_FLAGS_FORCE, &page, &vma) < 1)
+ GUP_FLAGS_FORCE | GUP_FLAGS_DUMP, &page, &vma) < 1)
return NULL;
if (page == ZERO_PAGE(0)) {
page_cache_release(page);

2009-09-07 21:38:05

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 5/8] mm: follow_hugetlb_page flags

follow_hugetlb_page() shouldn't be guessing about the coredump case
either: pass the foll_flags down to it, instead of just the write bit.

Remove that obscure huge_zeropage_ok() test. The decision is easy,
though unlike the non-huge case - here vm_ops->fault is always set.
But we know that a fault would serve up zeroes, unless there's
already a hugetlbfs pagecache page to back the range.

(Alternatively, since hugetlb pages aren't swapped out under pressure,
you could save more dump space by arguing that a page not yet faulted
into this process cannot be relevant to the dump; but that would be
more surprising.)

Signed-off-by: Hugh Dickins <[email protected]>
---

include/linux/hugetlb.h | 4 +-
mm/hugetlb.c | 62 ++++++++++++++++++++++----------------
mm/memory.c | 14 ++++----
3 files changed, 48 insertions(+), 32 deletions(-)

--- mm4/include/linux/hugetlb.h 2009-09-05 14:40:16.000000000 +0100
+++ mm5/include/linux/hugetlb.h 2009-09-07 13:16:46.000000000 +0100
@@ -24,7 +24,9 @@ int hugetlb_sysctl_handler(struct ctl_ta
int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
int hugetlb_treat_movable_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
-int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int, int);
+int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
+ struct page **, struct vm_area_struct **,
+ unsigned long *, int *, int, unsigned int flags);
void unmap_hugepage_range(struct vm_area_struct *,
unsigned long, unsigned long, struct page *);
void __unmap_hugepage_range(struct vm_area_struct *,
--- mm4/mm/hugetlb.c 2009-09-05 14:40:16.000000000 +0100
+++ mm5/mm/hugetlb.c 2009-09-07 13:16:46.000000000 +0100
@@ -2016,6 +2016,23 @@ static struct page *hugetlbfs_pagecache_
return find_lock_page(mapping, idx);
}

+/* Return whether there is a pagecache page to back given address within VMA */
+static bool hugetlbfs_backed(struct hstate *h,
+ struct vm_area_struct *vma, unsigned long address)
+{
+ struct address_space *mapping;
+ pgoff_t idx;
+ struct page *page;
+
+ mapping = vma->vm_file->f_mapping;
+ idx = vma_hugecache_offset(h, vma, address);
+
+ page = find_get_page(mapping, idx);
+ if (page)
+ put_page(page);
+ return page != NULL;
+}
+
static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pte_t *ptep, unsigned int flags)
{
@@ -2211,54 +2228,52 @@ follow_huge_pud(struct mm_struct *mm, un
return NULL;
}

-static int huge_zeropage_ok(pte_t *ptep, int write, int shared)
-{
- if (!ptep || write || shared)
- return 0;
- else
- return huge_pte_none(huge_ptep_get(ptep));
-}
-
int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page **pages, struct vm_area_struct **vmas,
unsigned long *position, int *length, int i,
- int write)
+ unsigned int flags)
{
unsigned long pfn_offset;
unsigned long vaddr = *position;
int remainder = *length;
struct hstate *h = hstate_vma(vma);
- int zeropage_ok = 0;
- int shared = vma->vm_flags & VM_SHARED;

spin_lock(&mm->page_table_lock);
while (vaddr < vma->vm_end && remainder) {
pte_t *pte;
+ int absent;
struct page *page;

/*
* Some archs (sparc64, sh*) have multiple pte_ts to
- * each hugepage. We have to make * sure we get the
+ * each hugepage. We have to make sure we get the
* first, for the page indexing below to work.
*/
pte = huge_pte_offset(mm, vaddr & huge_page_mask(h));
- if (huge_zeropage_ok(pte, write, shared))
- zeropage_ok = 1;
+ absent = !pte || huge_pte_none(huge_ptep_get(pte));
+
+ /*
+ * When coredumping, it suits get_dump_page if we just return
+ * an error if there's a hole and no huge pagecache to back it.
+ */
+ if (absent &&
+ ((flags & FOLL_DUMP) && !hugetlbfs_backed(h, vma, vaddr))) {
+ remainder = 0;
+ break;
+ }

- if (!pte ||
- (huge_pte_none(huge_ptep_get(pte)) && !zeropage_ok) ||
- (write && !pte_write(huge_ptep_get(pte)))) {
+ if (absent ||
+ ((flags & FOLL_WRITE) && !pte_write(huge_ptep_get(pte)))) {
int ret;

spin_unlock(&mm->page_table_lock);
- ret = hugetlb_fault(mm, vma, vaddr, write);
+ ret = hugetlb_fault(mm, vma, vaddr,
+ (flags & FOLL_WRITE) ? FAULT_FLAG_WRITE : 0);
spin_lock(&mm->page_table_lock);
if (!(ret & VM_FAULT_ERROR))
continue;

remainder = 0;
- if (!i)
- i = -EFAULT;
break;
}

@@ -2266,10 +2281,7 @@ int follow_hugetlb_page(struct mm_struct
page = pte_page(huge_ptep_get(pte));
same_page:
if (pages) {
- if (zeropage_ok)
- pages[i] = ZERO_PAGE(0);
- else
- pages[i] = mem_map_offset(page, pfn_offset);
+ pages[i] = mem_map_offset(page, pfn_offset);
get_page(pages[i]);
}

@@ -2293,7 +2305,7 @@ same_page:
*length = remainder;
*position = vaddr;

- return i;
+ return i ? i : -EFAULT;
}

void hugetlb_change_protection(struct vm_area_struct *vma,
--- mm4/mm/memory.c 2009-09-07 13:16:39.000000000 +0100
+++ mm5/mm/memory.c 2009-09-07 13:16:46.000000000 +0100
@@ -1260,17 +1260,19 @@ int __get_user_pages(struct task_struct
!(vm_flags & vma->vm_flags))
return i ? : -EFAULT;

- if (is_vm_hugetlb_page(vma)) {
- i = follow_hugetlb_page(mm, vma, pages, vmas,
- &start, &nr_pages, i, write);
- continue;
- }
-
foll_flags = FOLL_TOUCH;
if (pages)
foll_flags |= FOLL_GET;
if (flags & GUP_FLAGS_DUMP)
foll_flags |= FOLL_DUMP;
+ if (write)
+ foll_flags |= FOLL_WRITE;
+
+ if (is_vm_hugetlb_page(vma)) {
+ i = follow_hugetlb_page(mm, vma, pages, vmas,
+ &start, &nr_pages, i, foll_flags);
+ continue;
+ }

do {
struct page *page;

2009-09-07 21:39:14

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 6/8] mm: fix anonymous dirtying

do_anonymous_page() has been wrong to dirty the pte regardless.
If it's not going to mark the pte writable, then it won't help
to mark it dirty here, and clogs up memory with pages which will
need swap instead of being thrown away. Especially wrong if no
overcommit is chosen, and this vma is not yet VM_ACCOUNTed -
we could exceed the limit and OOM despite no overcommit.

Signed-off-by: Hugh Dickins <[email protected]>
Cc: [email protected]
---

mm/memory.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

--- mm5/mm/memory.c 2009-09-07 13:16:46.000000000 +0100
+++ mm6/mm/memory.c 2009-09-07 13:16:53.000000000 +0100
@@ -2608,7 +2608,8 @@ static int do_anonymous_page(struct mm_s
goto oom_free_page;

entry = mk_pte(page, vma->vm_page_prot);
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ if (vma->vm_flags & VM_WRITE)
+ entry = pte_mkwrite(pte_mkdirty(entry));

page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
if (!pte_none(*page_table))

2009-09-07 21:40:19

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 7/8] mm: reinstate ZERO_PAGE

KAMEZAWA Hiroyuki has observed customers of earlier kernels taking
advantage of the ZERO_PAGE: which we stopped do_anonymous_page() from
using in 2.6.24. And there were a couple of regression reports on LKML.

Following suggestions from Linus, reinstate do_anonymous_page() use of
the ZERO_PAGE; but this time avoid dirtying its struct page cacheline
with (map)count updates - let vm_normal_page() regard it as abnormal.

Use it only on arches which __HAVE_ARCH_PTE_SPECIAL (x86, s390, sh32,
most powerpc): that's not essential, but minimizes additional branches
(keeping them in the unlikely pte_special case); and incidentally
excludes mips (some models of which needed eight colours of ZERO_PAGE
to avoid costly exceptions).

Don't be fanatical about avoiding ZERO_PAGE updates: get_user_pages()
callers won't want to make exceptions for it, so increment its count
there. Changes to mlock and migration? happily seems not needed.

In most places it's quicker to check pfn than struct page address:
prepare a __read_mostly zero_pfn for that. Does get_dump_page()
still need its ZERO_PAGE check? probably not, but keep it anyway.

Signed-off-by: Hugh Dickins <[email protected]>
---
I have not studied the performance of this at all: I'd rather it go
into mmotm where others may decide whether it's a good thing or not.

mm/memory.c | 53 +++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 44 insertions(+), 9 deletions(-)

--- mm6/mm/memory.c 2009-09-07 13:16:53.000000000 +0100
+++ mm7/mm/memory.c 2009-09-07 13:17:01.000000000 +0100
@@ -107,6 +107,17 @@ static int __init disable_randmaps(char
}
__setup("norandmaps", disable_randmaps);

+static unsigned long zero_pfn __read_mostly;
+
+/*
+ * CONFIG_MMU architectures set up ZERO_PAGE in their paging_init()
+ */
+static int __init init_zero_pfn(void)
+{
+ zero_pfn = page_to_pfn(ZERO_PAGE(0));
+ return 0;
+}
+core_initcall(init_zero_pfn);

/*
* If a p?d_bad entry is found while walking page tables, report
@@ -499,7 +510,9 @@ struct page *vm_normal_page(struct vm_ar
if (HAVE_PTE_SPECIAL) {
if (likely(!pte_special(pte)))
goto check_pfn;
- if (!(vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)))
+ if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
+ return NULL;
+ if (pfn != zero_pfn)
print_bad_pte(vma, addr, pte, NULL);
return NULL;
}
@@ -1144,9 +1157,14 @@ struct page *follow_page(struct vm_area_
goto no_page;
if ((flags & FOLL_WRITE) && !pte_write(pte))
goto unlock;
+
page = vm_normal_page(vma, address, pte);
- if (unlikely(!page))
- goto bad_page;
+ if (unlikely(!page)) {
+ if ((flags & FOLL_DUMP) ||
+ pte_pfn(pte) != zero_pfn)
+ goto bad_page;
+ page = pte_page(pte);
+ }

if (flags & FOLL_GET)
get_page(page);
@@ -2085,10 +2103,19 @@ gotten:

if (unlikely(anon_vma_prepare(vma)))
goto oom;
- VM_BUG_ON(old_page == ZERO_PAGE(0));
- new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
- if (!new_page)
- goto oom;
+
+ if (pte_pfn(orig_pte) == zero_pfn) {
+ new_page = alloc_zeroed_user_highpage_movable(vma, address);
+ if (!new_page)
+ goto oom;
+ } else {
+ new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+ if (!new_page)
+ goto oom;
+ cow_user_page(new_page, old_page, address, vma);
+ }
+ __SetPageUptodate(new_page);
+
/*
* Don't let another task, with possibly unlocked vma,
* keep the mlocked page.
@@ -2098,8 +2125,6 @@ gotten:
clear_page_mlock(old_page);
unlock_page(old_page);
}
- cow_user_page(new_page, old_page, address, vma);
- __SetPageUptodate(new_page);

if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))
goto oom_free_new;
@@ -2594,6 +2619,15 @@ static int do_anonymous_page(struct mm_s
spinlock_t *ptl;
pte_t entry;

+ if (HAVE_PTE_SPECIAL && !(flags & FAULT_FLAG_WRITE)) {
+ entry = pte_mkspecial(pfn_pte(zero_pfn, vma->vm_page_prot));
+ ptl = pte_lockptr(mm, pmd);
+ spin_lock(ptl);
+ if (!pte_none(*page_table))
+ goto unlock;
+ goto setpte;
+ }
+
/* Allocate our own private page. */
pte_unmap(page_table);

@@ -2617,6 +2651,7 @@ static int do_anonymous_page(struct mm_s

inc_mm_counter(mm, anon_rss);
page_add_new_anon_rmap(page, vma, address);
+setpte:
set_pte_at(mm, address, page_table, entry);

/* No need to invalidate - it was non-present before */

2009-09-07 21:41:35

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 8/8] mm: FOLL flags for GUP flags

__get_user_pages() has been taking its own GUP flags, then processing
them into FOLL flags for follow_page(). Though oddly named, the FOLL
flags are more widely used, so pass them to __get_user_pages() now.
Sorry, VM flags, VM_FAULT flags and FAULT_FLAGs are still distinct.

(The patch to __get_user_pages() looks peculiar, with both gup_flags
and foll_flags: the gup_flags remain constant; but as before there's
an exceptional case, out of scope of the patch, in which foll_flags
per page have FOLL_WRITE masked off.)

Signed-off-by: Hugh Dickins <[email protected]>
---

include/linux/mm.h | 1
mm/internal.h | 6 -----
mm/memory.c | 44 ++++++++++++++++++-------------------------
mm/mlock.c | 4 +--
mm/nommu.c | 16 +++++++--------
5 files changed, 31 insertions(+), 40 deletions(-)

--- mm7/include/linux/mm.h 2009-09-07 13:16:39.000000000 +0100
+++ mm8/include/linux/mm.h 2009-09-07 13:17:07.000000000 +0100
@@ -1248,6 +1248,7 @@ struct page *follow_page(struct vm_area_
#define FOLL_TOUCH 0x02 /* mark page accessed */
#define FOLL_GET 0x04 /* do get_page on page */
#define FOLL_DUMP 0x08 /* give error on hole if it would be zero */
+#define FOLL_FORCE 0x10 /* get_user_pages read/write w/o permission */

typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
void *data);
--- mm7/mm/internal.h 2009-09-07 13:16:39.000000000 +0100
+++ mm8/mm/internal.h 2009-09-07 13:17:07.000000000 +0100
@@ -250,12 +250,8 @@ static inline void mminit_validate_memmo
}
#endif /* CONFIG_SPARSEMEM */

-#define GUP_FLAGS_WRITE 0x01
-#define GUP_FLAGS_FORCE 0x02
-#define GUP_FLAGS_DUMP 0x04
-
int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
- unsigned long start, int len, int flags,
+ unsigned long start, int len, unsigned int foll_flags,
struct page **pages, struct vm_area_struct **vmas);

#define ZONE_RECLAIM_NOSCAN -2
--- mm7/mm/memory.c 2009-09-07 13:17:01.000000000 +0100
+++ mm8/mm/memory.c 2009-09-07 13:17:07.000000000 +0100
@@ -1209,27 +1209,29 @@ no_page_table:
}

int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
- unsigned long start, int nr_pages, int flags,
+ unsigned long start, int nr_pages, unsigned int gup_flags,
struct page **pages, struct vm_area_struct **vmas)
{
int i;
- unsigned int vm_flags = 0;
- int write = !!(flags & GUP_FLAGS_WRITE);
- int force = !!(flags & GUP_FLAGS_FORCE);
+ unsigned long vm_flags;

if (nr_pages <= 0)
return 0;
+
+ VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));
+
/*
* Require read or write permissions.
- * If 'force' is set, we only require the "MAY" flags.
+ * If FOLL_FORCE is set, we only require the "MAY" flags.
*/
- vm_flags = write ? (VM_WRITE | VM_MAYWRITE) : (VM_READ | VM_MAYREAD);
- vm_flags &= force ? (VM_MAYREAD | VM_MAYWRITE) : (VM_READ | VM_WRITE);
+ vm_flags = (gup_flags & FOLL_WRITE) ?
+ (VM_WRITE | VM_MAYWRITE) : (VM_READ | VM_MAYREAD);
+ vm_flags &= (gup_flags & FOLL_FORCE) ?
+ (VM_MAYREAD | VM_MAYWRITE) : (VM_READ | VM_WRITE);
i = 0;

do {
struct vm_area_struct *vma;
- unsigned int foll_flags;

vma = find_extend_vma(mm, start);
if (!vma && in_gate_area(tsk, start)) {
@@ -1241,7 +1243,7 @@ int __get_user_pages(struct task_struct
pte_t *pte;

/* user gate pages are read-only */
- if (write)
+ if (gup_flags & FOLL_WRITE)
return i ? : -EFAULT;
if (pg > TASK_SIZE)
pgd = pgd_offset_k(pg);
@@ -1278,22 +1280,15 @@ int __get_user_pages(struct task_struct
!(vm_flags & vma->vm_flags))
return i ? : -EFAULT;

- foll_flags = FOLL_TOUCH;
- if (pages)
- foll_flags |= FOLL_GET;
- if (flags & GUP_FLAGS_DUMP)
- foll_flags |= FOLL_DUMP;
- if (write)
- foll_flags |= FOLL_WRITE;
-
if (is_vm_hugetlb_page(vma)) {
i = follow_hugetlb_page(mm, vma, pages, vmas,
- &start, &nr_pages, i, foll_flags);
+ &start, &nr_pages, i, gup_flags);
continue;
}

do {
struct page *page;
+ unsigned int foll_flags = gup_flags;

/*
* If we have a pending SIGKILL, don't keep faulting
@@ -1302,9 +1297,6 @@ int __get_user_pages(struct task_struct
if (unlikely(fatal_signal_pending(current)))
return i ? i : -ERESTARTSYS;

- if (write)
- foll_flags |= FOLL_WRITE;
-
cond_resched();
while (!(page = follow_page(vma, start, foll_flags))) {
int ret;
@@ -1416,12 +1408,14 @@ int get_user_pages(struct task_struct *t
unsigned long start, int nr_pages, int write, int force,
struct page **pages, struct vm_area_struct **vmas)
{
- int flags = 0;
+ int flags = FOLL_TOUCH;

+ if (pages)
+ flags |= FOLL_GET;
if (write)
- flags |= GUP_FLAGS_WRITE;
+ flags |= FOLL_WRITE;
if (force)
- flags |= GUP_FLAGS_FORCE;
+ flags |= FOLL_FORCE;

return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas);
}
@@ -1448,7 +1442,7 @@ struct page *get_dump_page(unsigned long
struct page *page;

if (__get_user_pages(current, current->mm, addr, 1,
- GUP_FLAGS_FORCE | GUP_FLAGS_DUMP, &page, &vma) < 1)
+ FOLL_FORCE | FOLL_DUMP | FOLL_GET, &page, &vma) < 1)
return NULL;
if (page == ZERO_PAGE(0)) {
page_cache_release(page);
--- mm7/mm/mlock.c 2009-09-07 13:16:15.000000000 +0100
+++ mm8/mm/mlock.c 2009-09-07 13:17:07.000000000 +0100
@@ -166,9 +166,9 @@ static long __mlock_vma_pages_range(stru
VM_BUG_ON(end > vma->vm_end);
VM_BUG_ON(!rwsem_is_locked(&mm->mmap_sem));

- gup_flags = 0;
+ gup_flags = FOLL_TOUCH | FOLL_GET;
if (vma->vm_flags & VM_WRITE)
- gup_flags = GUP_FLAGS_WRITE;
+ gup_flags |= FOLL_WRITE;

while (nr_pages > 0) {
int i;
--- mm7/mm/nommu.c 2009-09-07 13:16:22.000000000 +0100
+++ mm8/mm/nommu.c 2009-09-07 13:17:07.000000000 +0100
@@ -128,20 +128,20 @@ unsigned int kobjsize(const void *objp)
}

int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
- unsigned long start, int nr_pages, int flags,
+ unsigned long start, int nr_pages, int foll_flags,
struct page **pages, struct vm_area_struct **vmas)
{
struct vm_area_struct *vma;
unsigned long vm_flags;
int i;
- int write = !!(flags & GUP_FLAGS_WRITE);
- int force = !!(flags & GUP_FLAGS_FORCE);

/* calculate required read or write permissions.
- * - if 'force' is set, we only require the "MAY" flags.
+ * If FOLL_FORCE is set, we only require the "MAY" flags.
*/
- vm_flags = write ? (VM_WRITE | VM_MAYWRITE) : (VM_READ | VM_MAYREAD);
- vm_flags &= force ? (VM_MAYREAD | VM_MAYWRITE) : (VM_READ | VM_WRITE);
+ vm_flags = (foll_flags & FOLL_WRITE) ?
+ (VM_WRITE | VM_MAYWRITE) : (VM_READ | VM_MAYREAD);
+ vm_flags &= (foll_flags & FOLL_FORCE) ?
+ (VM_MAYREAD | VM_MAYWRITE) : (VM_READ | VM_WRITE);

for (i = 0; i < nr_pages; i++) {
vma = find_vma(mm, start);
@@ -183,9 +183,9 @@ int get_user_pages(struct task_struct *t
int flags = 0;

if (write)
- flags |= GUP_FLAGS_WRITE;
+ flags |= FOLL_WRITE;
if (force)
- flags |= GUP_FLAGS_FORCE;
+ flags |= FOLL_FORCE;

return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas);
}

2009-09-07 23:51:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/8] mm: around get_user_pages flags



On Mon, 7 Sep 2009, Hugh Dickins wrote:
>
> Here's a series of mm mods against current mmotm: mostly cleanup
> of get_user_pages flags, but fixing munlock's OOM, sorting out the
> "FOLL_ANON optimization", and reinstating ZERO_PAGE along the way.

Ack on the whole series as far as I'm concerned. All looks very
straightforward and sane.

Linus

2009-09-07 23:54:24

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 0/8] mm: around get_user_pages flags

On Mon, 7 Sep 2009 22:26:51 +0100 (BST)
Hugh Dickins <[email protected]> wrote:

> Here's a series of mm mods against current mmotm: mostly cleanup
> of get_user_pages flags, but fixing munlock's OOM, sorting out the
> "FOLL_ANON optimization", and reinstating ZERO_PAGE along the way.
>
> fs/binfmt_elf.c | 42 ++------
> fs/binfmt_elf_fdpic.c | 56 ++++-------
> include/linux/hugetlb.h | 4
> include/linux/mm.h | 4
> mm/hugetlb.c | 62 +++++++------
> mm/internal.h | 7 -
> mm/memory.c | 180 +++++++++++++++++++++++---------------
> mm/mlock.c | 99 ++++++++------------
> mm/nommu.c | 22 ++--
> 9 files changed, 235 insertions(+), 241 deletions(-)
>
> Hugh
>
Seems much claerer than mine. I'll test.
Thank you very much.

Regards,
-Kame

2009-09-08 00:00:41

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 0/8] mm: around get_user_pages flags

> Here's a series of mm mods against current mmotm: mostly cleanup
> of get_user_pages flags, but fixing munlock's OOM, sorting out the
> "FOLL_ANON optimization", and reinstating ZERO_PAGE along the way.
>
> fs/binfmt_elf.c | 42 ++------
> fs/binfmt_elf_fdpic.c | 56 ++++-------
> include/linux/hugetlb.h | 4
> include/linux/mm.h | 4
> mm/hugetlb.c | 62 +++++++------
> mm/internal.h | 7 -
> mm/memory.c | 180 +++++++++++++++++++++++---------------
> mm/mlock.c | 99 ++++++++------------
> mm/nommu.c | 22 ++--
> 9 files changed, 235 insertions(+), 241 deletions(-)

Great!
I'll start to test this patch series. Thanks Hugh!!


2009-09-08 02:39:37

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 7/8] mm: reinstate ZERO_PAGE

On Mon, 7 Sep 2009 22:39:34 +0100 (BST)
Hugh Dickins <[email protected]> wrote:

> KAMEZAWA Hiroyuki has observed customers of earlier kernels taking
> advantage of the ZERO_PAGE: which we stopped do_anonymous_page() from
> using in 2.6.24. And there were a couple of regression reports on LKML.
>
> Following suggestions from Linus, reinstate do_anonymous_page() use of
> the ZERO_PAGE; but this time avoid dirtying its struct page cacheline
> with (map)count updates - let vm_normal_page() regard it as abnormal.
>
> Use it only on arches which __HAVE_ARCH_PTE_SPECIAL (x86, s390, sh32,
> most powerpc): that's not essential, but minimizes additional branches
> (keeping them in the unlikely pte_special case); and incidentally
> excludes mips (some models of which needed eight colours of ZERO_PAGE
> to avoid costly exceptions).
>
> Don't be fanatical about avoiding ZERO_PAGE updates: get_user_pages()
> callers won't want to make exceptions for it, so increment its count
> there. Changes to mlock and migration? happily seems not needed.
>
> In most places it's quicker to check pfn than struct page address:
> prepare a __read_mostly zero_pfn for that. Does get_dump_page()
> still need its ZERO_PAGE check? probably not, but keep it anyway.
>
> Signed-off-by: Hugh Dickins <[email protected]>

A nitpick but this was a concern you shown, IIUC.

== __get_user_pages()..

if (pages) {
pages[i] = page;

flush_anon_page(vma, page, start);
flush_dcache_page(page);
}
==

This part will call flush_dcache_page() even when ZERO_PAGE is found.

Don't we need to mask this ?

Thanks,
-Kame




> ---
> I have not studied the performance of this at all: I'd rather it go
> into mmotm where others may decide whether it's a good thing or not.
>
> mm/memory.c | 53 +++++++++++++++++++++++++++++++++++++++++---------
> 1 file changed, 44 insertions(+), 9 deletions(-)
>
> --- mm6/mm/memory.c 2009-09-07 13:16:53.000000000 +0100
> +++ mm7/mm/memory.c 2009-09-07 13:17:01.000000000 +0100
> @@ -107,6 +107,17 @@ static int __init disable_randmaps(char
> }
> __setup("norandmaps", disable_randmaps);
>
> +static unsigned long zero_pfn __read_mostly;
> +
> +/*
> + * CONFIG_MMU architectures set up ZERO_PAGE in their paging_init()
> + */
> +static int __init init_zero_pfn(void)
> +{
> + zero_pfn = page_to_pfn(ZERO_PAGE(0));
> + return 0;
> +}
> +core_initcall(init_zero_pfn);
>
> /*
> * If a p?d_bad entry is found while walking page tables, report
> @@ -499,7 +510,9 @@ struct page *vm_normal_page(struct vm_ar
> if (HAVE_PTE_SPECIAL) {
> if (likely(!pte_special(pte)))
> goto check_pfn;
> - if (!(vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)))
> + if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
> + return NULL;
> + if (pfn != zero_pfn)
> print_bad_pte(vma, addr, pte, NULL);
> return NULL;
> }
> @@ -1144,9 +1157,14 @@ struct page *follow_page(struct vm_area_
> goto no_page;
> if ((flags & FOLL_WRITE) && !pte_write(pte))
> goto unlock;
> +
> page = vm_normal_page(vma, address, pte);
> - if (unlikely(!page))
> - goto bad_page;
> + if (unlikely(!page)) {
> + if ((flags & FOLL_DUMP) ||
> + pte_pfn(pte) != zero_pfn)
> + goto bad_page;
> + page = pte_page(pte);
> + }
>
> if (flags & FOLL_GET)
> get_page(page);
> @@ -2085,10 +2103,19 @@ gotten:
>
> if (unlikely(anon_vma_prepare(vma)))
> goto oom;
> - VM_BUG_ON(old_page == ZERO_PAGE(0));
> - new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
> - if (!new_page)
> - goto oom;
> +
> + if (pte_pfn(orig_pte) == zero_pfn) {
> + new_page = alloc_zeroed_user_highpage_movable(vma, address);
> + if (!new_page)
> + goto oom;
> + } else {
> + new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
> + if (!new_page)
> + goto oom;
> + cow_user_page(new_page, old_page, address, vma);
> + }
> + __SetPageUptodate(new_page);
> +
> /*
> * Don't let another task, with possibly unlocked vma,
> * keep the mlocked page.
> @@ -2098,8 +2125,6 @@ gotten:
> clear_page_mlock(old_page);
> unlock_page(old_page);
> }
> - cow_user_page(new_page, old_page, address, vma);
> - __SetPageUptodate(new_page);
>
> if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))
> goto oom_free_new;
> @@ -2594,6 +2619,15 @@ static int do_anonymous_page(struct mm_s
> spinlock_t *ptl;
> pte_t entry;
>
> + if (HAVE_PTE_SPECIAL && !(flags & FAULT_FLAG_WRITE)) {
> + entry = pte_mkspecial(pfn_pte(zero_pfn, vma->vm_page_prot));
> + ptl = pte_lockptr(mm, pmd);
> + spin_lock(ptl);
> + if (!pte_none(*page_table))
> + goto unlock;
> + goto setpte;
> + }
> +
> /* Allocate our own private page. */
> pte_unmap(page_table);
>
> @@ -2617,6 +2651,7 @@ static int do_anonymous_page(struct mm_s
>
> inc_mm_counter(mm, anon_rss);
> page_add_new_anon_rmap(page, vma, address);
> +setpte:
> set_pte_at(mm, address, page_table, entry);
>
> /* No need to invalidate - it was non-present before */
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2009-09-08 03:00:29

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 1/8] mm: munlock use follow_page

On Mon, 7 Sep 2009 22:29:55 +0100 (BST)
Hugh Dickins <[email protected]> wrote:

> Hiroaki Wakabayashi points out that when mlock() has been interrupted
> by SIGKILL, the subsequent munlock() takes unnecessarily long because
> its use of __get_user_pages() insists on faulting in all the pages
> which mlock() never reached.
>
> It's worse than slowness if mlock() is terminated by Out Of Memory kill:
> the munlock_vma_pages_all() in exit_mmap() insists on faulting in all the
> pages which mlock() could not find memory for; so innocent bystanders are
> killed too, and perhaps the system hangs.
>
> __get_user_pages() does a lot that's silly for munlock(): so remove the
> munlock option from __mlock_vma_pages_range(), and use a simple loop of
> follow_page()s in munlock_vma_pages_range() instead; ignoring absent
> pages, and not marking present pages as accessed or dirty.
>
> (Change munlock() to only go so far as mlock() reached? That does not
> work out, given the convention that mlock() claims complete success even
> when it has to give up early - in part so that an underlying file can be
> extended later, and those pages locked which earlier would give SIGBUS.)
>
> Signed-off-by: Hugh Dickins <[email protected]>
> Cc: [email protected]
> ---
>
> mm/mlock.c | 99 ++++++++++++++++++++-------------------------------
> 1 file changed, 40 insertions(+), 59 deletions(-)
>
> --- mm0/mm/mlock.c 2009-06-25 05:18:10.000000000 +0100
> +++ mm1/mm/mlock.c 2009-09-07 13:16:15.000000000 +0100
> @@ -139,49 +139,36 @@ static void munlock_vma_page(struct page
> }
>
> /**
> - * __mlock_vma_pages_range() - mlock/munlock a range of pages in the vma.
> + * __mlock_vma_pages_range() - mlock a range of pages in the vma.
> * @vma: target vma
> * @start: start address
> * @end: end address
> - * @mlock: 0 indicate munlock, otherwise mlock.
> *
> - * If @mlock == 0, unlock an mlocked range;
> - * else mlock the range of pages. This takes care of making the pages present ,
> - * too.
> + * This takes care of making the pages present too.
> *
> * return 0 on success, negative error code on error.
> *
> * vma->vm_mm->mmap_sem must be held for at least read.
> */
> static long __mlock_vma_pages_range(struct vm_area_struct *vma,
> - unsigned long start, unsigned long end,
> - int mlock)
> + unsigned long start, unsigned long end)
> {
> struct mm_struct *mm = vma->vm_mm;
> unsigned long addr = start;
> struct page *pages[16]; /* 16 gives a reasonable batch */
> int nr_pages = (end - start) / PAGE_SIZE;
> int ret = 0;
> - int gup_flags = 0;
> + int gup_flags;
>
> VM_BUG_ON(start & ~PAGE_MASK);
> VM_BUG_ON(end & ~PAGE_MASK);
> VM_BUG_ON(start < vma->vm_start);
> VM_BUG_ON(end > vma->vm_end);
> - VM_BUG_ON((!rwsem_is_locked(&mm->mmap_sem)) &&
> - (atomic_read(&mm->mm_users) != 0));
> -
> - /*
> - * mlock: don't page populate if vma has PROT_NONE permission.
> - * munlock: always do munlock although the vma has PROT_NONE
> - * permission, or SIGKILL is pending.
> - */
> - if (!mlock)
> - gup_flags |= GUP_FLAGS_IGNORE_VMA_PERMISSIONS |
> - GUP_FLAGS_IGNORE_SIGKILL;
> + VM_BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
>
> + gup_flags = 0;
> if (vma->vm_flags & VM_WRITE)
> - gup_flags |= GUP_FLAGS_WRITE;
> + gup_flags = GUP_FLAGS_WRITE;
>
> while (nr_pages > 0) {
> int i;
> @@ -201,19 +188,10 @@ static long __mlock_vma_pages_range(stru
> * This can happen for, e.g., VM_NONLINEAR regions before
> * a page has been allocated and mapped at a given offset,
> * or for addresses that map beyond end of a file.
> - * We'll mlock the the pages if/when they get faulted in.
> + * We'll mlock the pages if/when they get faulted in.
> */
> if (ret < 0)
> break;
> - if (ret == 0) {
> - /*
> - * We know the vma is there, so the only time
> - * we cannot get a single page should be an
> - * error (ret < 0) case.
> - */
> - WARN_ON(1);
> - break;
> - }
>
> lru_add_drain(); /* push cached pages to LRU */
>
> @@ -224,28 +202,22 @@ static long __mlock_vma_pages_range(stru
> /*
> * Because we lock page here and migration is blocked
> * by the elevated reference, we need only check for
> - * page truncation (file-cache only).
> + * file-cache page truncation. This page->mapping
> + * check also neatly skips over the ZERO_PAGE(),
> + * though if that's common we'd prefer not to lock it.
> */
> - if (page->mapping) {
> - if (mlock)
> - mlock_vma_page(page);
> - else
> - munlock_vma_page(page);
> - }
> + if (page->mapping)
> + mlock_vma_page(page);
> unlock_page(page);
> - put_page(page); /* ref from get_user_pages() */
> -
> - /*
> - * here we assume that get_user_pages() has given us
> - * a list of virtually contiguous pages.
> - */
> - addr += PAGE_SIZE; /* for next get_user_pages() */
> - nr_pages--;
> + put_page(page); /* ref from get_user_pages() */
> }
> +
> + addr += ret * PAGE_SIZE;
> + nr_pages -= ret;
> ret = 0;
> }
>
> - return ret; /* count entire vma as locked_vm */
> + return ret; /* 0 or negative error code */
> }
>
> /*
> @@ -289,7 +261,7 @@ long mlock_vma_pages_range(struct vm_are
> is_vm_hugetlb_page(vma) ||
> vma == get_gate_vma(current))) {
>
> - __mlock_vma_pages_range(vma, start, end, 1);
> + __mlock_vma_pages_range(vma, start, end);
>
> /* Hide errors from mmap() and other callers */
> return 0;
> @@ -310,7 +282,6 @@ no_mlock:
> return nr_pages; /* error or pages NOT mlocked */
> }
>
> -
> /*
> * munlock_vma_pages_range() - munlock all pages in the vma range.'
> * @vma - vma containing range to be munlock()ed.
> @@ -330,10 +301,24 @@ no_mlock:
> * free them. This will result in freeing mlocked pages.
> */
> void munlock_vma_pages_range(struct vm_area_struct *vma,
> - unsigned long start, unsigned long end)
> + unsigned long start, unsigned long end)
> {
> + unsigned long addr;
> +
> + lru_add_drain();
> vma->vm_flags &= ~VM_LOCKED;
> - __mlock_vma_pages_range(vma, start, end, 0);
> +
> + for (addr = start; addr < end; addr += PAGE_SIZE) {
> + struct page *page = follow_page(vma, addr, FOLL_GET);
> + if (page) {
> + lock_page(page);
> + if (page->mapping)
> + munlock_vma_page(page);

Could you add "please see __mlock_vma_pages_range() to see why" or some here ?

Thanks,
-Kame

> + unlock_page(page);
> + put_page(page);
> + }
> + cond_resched();
> + }
> }
>
> /*
> @@ -400,18 +385,14 @@ success:
> * It's okay if try_to_unmap_one unmaps a page just after we
> * set VM_LOCKED, __mlock_vma_pages_range will bring it back.
> */
> - vma->vm_flags = newflags;
>
> if (lock) {
> - ret = __mlock_vma_pages_range(vma, start, end, 1);
> -
> - if (ret > 0) {
> - mm->locked_vm -= ret;
> - ret = 0;
> - } else
> - ret = __mlock_posix_error_return(ret); /* translate if needed */
> + vma->vm_flags = newflags;
> + ret = __mlock_vma_pages_range(vma, start, end);
> + if (ret < 0)
> + ret = __mlock_posix_error_return(ret);
> } else {
> - __mlock_vma_pages_range(vma, start, end, 0);
> + munlock_vma_pages_range(vma, start, end);
> }
>
> out:
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2009-09-08 07:31:20

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 7/8] mm: reinstate ZERO_PAGE

On Mon, Sep 07, 2009 at 10:39:34PM +0100, Hugh Dickins wrote:
> KAMEZAWA Hiroyuki has observed customers of earlier kernels taking
> advantage of the ZERO_PAGE: which we stopped do_anonymous_page() from
> using in 2.6.24. And there were a couple of regression reports on LKML.
>
> Following suggestions from Linus, reinstate do_anonymous_page() use of
> the ZERO_PAGE; but this time avoid dirtying its struct page cacheline
> with (map)count updates - let vm_normal_page() regard it as abnormal.
>
> Use it only on arches which __HAVE_ARCH_PTE_SPECIAL (x86, s390, sh32,
> most powerpc): that's not essential, but minimizes additional branches
> (keeping them in the unlikely pte_special case); and incidentally
> excludes mips (some models of which needed eight colours of ZERO_PAGE
> to avoid costly exceptions).

Without looking closely, why is it a big problem to have a
!HAVE PTE SPECIAL case? Couldn't it just be a check for
pfn == zero_pfn that is conditionally compiled away for pte
special architectures anyway?

If zero page is such a good idea, I don't see the logic of
limiting it like thisa. Your patch looks pretty clean though.

At any rate, I think it might be an idea to cc linux-arch.

>
> Don't be fanatical about avoiding ZERO_PAGE updates: get_user_pages()
> callers won't want to make exceptions for it, so increment its count
> there. Changes to mlock and migration? happily seems not needed.
>
> In most places it's quicker to check pfn than struct page address:
> prepare a __read_mostly zero_pfn for that. Does get_dump_page()
> still need its ZERO_PAGE check? probably not, but keep it anyway.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> I have not studied the performance of this at all: I'd rather it go
> into mmotm where others may decide whether it's a good thing or not.
>
> mm/memory.c | 53 +++++++++++++++++++++++++++++++++++++++++---------
> 1 file changed, 44 insertions(+), 9 deletions(-)
>
> --- mm6/mm/memory.c 2009-09-07 13:16:53.000000000 +0100
> +++ mm7/mm/memory.c 2009-09-07 13:17:01.000000000 +0100
> @@ -107,6 +107,17 @@ static int __init disable_randmaps(char
> }
> __setup("norandmaps", disable_randmaps);
>
> +static unsigned long zero_pfn __read_mostly;
> +
> +/*
> + * CONFIG_MMU architectures set up ZERO_PAGE in their paging_init()
> + */
> +static int __init init_zero_pfn(void)
> +{
> + zero_pfn = page_to_pfn(ZERO_PAGE(0));
> + return 0;
> +}
> +core_initcall(init_zero_pfn);
>
> /*
> * If a p?d_bad entry is found while walking page tables, report
> @@ -499,7 +510,9 @@ struct page *vm_normal_page(struct vm_ar
> if (HAVE_PTE_SPECIAL) {
> if (likely(!pte_special(pte)))
> goto check_pfn;
> - if (!(vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)))
> + if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
> + return NULL;
> + if (pfn != zero_pfn)
> print_bad_pte(vma, addr, pte, NULL);
> return NULL;
> }
> @@ -1144,9 +1157,14 @@ struct page *follow_page(struct vm_area_
> goto no_page;
> if ((flags & FOLL_WRITE) && !pte_write(pte))
> goto unlock;
> +
> page = vm_normal_page(vma, address, pte);
> - if (unlikely(!page))
> - goto bad_page;
> + if (unlikely(!page)) {
> + if ((flags & FOLL_DUMP) ||
> + pte_pfn(pte) != zero_pfn)
> + goto bad_page;
> + page = pte_page(pte);
> + }
>
> if (flags & FOLL_GET)
> get_page(page);
> @@ -2085,10 +2103,19 @@ gotten:
>
> if (unlikely(anon_vma_prepare(vma)))
> goto oom;
> - VM_BUG_ON(old_page == ZERO_PAGE(0));
> - new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
> - if (!new_page)
> - goto oom;
> +
> + if (pte_pfn(orig_pte) == zero_pfn) {
> + new_page = alloc_zeroed_user_highpage_movable(vma, address);
> + if (!new_page)
> + goto oom;
> + } else {
> + new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
> + if (!new_page)
> + goto oom;
> + cow_user_page(new_page, old_page, address, vma);
> + }
> + __SetPageUptodate(new_page);
> +
> /*
> * Don't let another task, with possibly unlocked vma,
> * keep the mlocked page.
> @@ -2098,8 +2125,6 @@ gotten:
> clear_page_mlock(old_page);
> unlock_page(old_page);
> }
> - cow_user_page(new_page, old_page, address, vma);
> - __SetPageUptodate(new_page);
>
> if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))
> goto oom_free_new;
> @@ -2594,6 +2619,15 @@ static int do_anonymous_page(struct mm_s
> spinlock_t *ptl;
> pte_t entry;
>
> + if (HAVE_PTE_SPECIAL && !(flags & FAULT_FLAG_WRITE)) {
> + entry = pte_mkspecial(pfn_pte(zero_pfn, vma->vm_page_prot));
> + ptl = pte_lockptr(mm, pmd);
> + spin_lock(ptl);
> + if (!pte_none(*page_table))
> + goto unlock;
> + goto setpte;
> + }
> +
> /* Allocate our own private page. */
> pte_unmap(page_table);
>
> @@ -2617,6 +2651,7 @@ static int do_anonymous_page(struct mm_s
>
> inc_mm_counter(mm, anon_rss);
> page_add_new_anon_rmap(page, vma, address);
> +setpte:
> set_pte_at(mm, address, page_table, entry);
>
> /* No need to invalidate - it was non-present before */

2009-09-08 11:31:27

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 1/8] mm: munlock use follow_page

On Tue, 8 Sep 2009, KAMEZAWA Hiroyuki wrote:
> On Mon, 7 Sep 2009 22:29:55 +0100 (BST)
> Hugh Dickins <[email protected]> wrote:
> > void munlock_vma_pages_range(struct vm_area_struct *vma,
> > - unsigned long start, unsigned long end)
> > + unsigned long start, unsigned long end)
> > {
> > + unsigned long addr;
> > +
> > + lru_add_drain();
> > vma->vm_flags &= ~VM_LOCKED;
> > - __mlock_vma_pages_range(vma, start, end, 0);
> > +
> > + for (addr = start; addr < end; addr += PAGE_SIZE) {
> > + struct page *page = follow_page(vma, addr, FOLL_GET);
> > + if (page) {
> > + lock_page(page);
> > + if (page->mapping)
> > + munlock_vma_page(page);
>
> Could you add "please see __mlock_vma_pages_range() to see why" or some here ?

Why the test on page->mapping?
Right, I'll add some such comment on that.
Waiting a day or two to see what else comes up.

Thanks,
Hugh

2009-09-08 11:57:49

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 7/8] mm: reinstate ZERO_PAGE

On Tue, 8 Sep 2009, KAMEZAWA Hiroyuki wrote:
>
> A nitpick but this was a concern you shown, IIUC.
>
> == __get_user_pages()..
>
> if (pages) {
> pages[i] = page;
>
> flush_anon_page(vma, page, start);
> flush_dcache_page(page);
> }
> ==
>
> This part will call flush_dcache_page() even when ZERO_PAGE is found.
>
> Don't we need to mask this ?

No, it's okay to flush_dcache_page() on ZERO_PAGE: we always used to
do that there, and the arches I remember offhand won't do anything
with it anyway, once they see page->mapping NULL.

What you're remembering, that I did object to, was the way your
FOLL_NOZERO ended up doing
pages[i] = NULL;
flush_anon_page(vma, NULL, start);
flush_dcache_page(NULL);

which would cause an oops when those arches look at page->mapping.

I should take another look at your FOLL_NOZERO: I may have dismissed
it too quickly, after seeing that bug, and oopsing on x86 when
mlocking a readonly anonymous area.

Though I like that we don't _need_ to change mlock.c for reinstated
ZERO_PAGE, this morning I'm having trouble persuading myself that
mlocking a readonly anonymous area is too silly to optimize for.

Maybe the very people who persuaded you to bring back the anonymous
use of ZERO_PAGE, are also doing a huge mlock of the area first?
So if two or more are starting up at the same time on the same box,
more bouncing than is healthy (and more than they would have seen
in the old days of ZERO_PAGE but no lock_page on it there).

I'd like to persuade myself not to bother,
but may want to add a further patch for that later.

Hugh

2009-09-08 12:17:49

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 7/8] mm: reinstate ZERO_PAGE

On Tue, 8 Sep 2009, Nick Piggin wrote:
> On Mon, Sep 07, 2009 at 10:39:34PM +0100, Hugh Dickins wrote:
> > KAMEZAWA Hiroyuki has observed customers of earlier kernels taking
> > advantage of the ZERO_PAGE: which we stopped do_anonymous_page() from
> > using in 2.6.24. And there were a couple of regression reports on LKML.
> >
> > Following suggestions from Linus, reinstate do_anonymous_page() use of
> > the ZERO_PAGE; but this time avoid dirtying its struct page cacheline
> > with (map)count updates - let vm_normal_page() regard it as abnormal.
> >
> > Use it only on arches which __HAVE_ARCH_PTE_SPECIAL (x86, s390, sh32,
> > most powerpc): that's not essential, but minimizes additional branches
> > (keeping them in the unlikely pte_special case); and incidentally
> > excludes mips (some models of which needed eight colours of ZERO_PAGE
> > to avoid costly exceptions).
>
> Without looking closely, why is it a big problem to have a
> !HAVE PTE SPECIAL case? Couldn't it just be a check for
> pfn == zero_pfn that is conditionally compiled away for pte
> special architectures anyway?

Yes, I'm uncomfortable with that restriction too: it makes for
neater looking code in a couple of places, but it's not so good
for the architectures to diverge gratuitously there.

I'll give it a try without that restriction, see how it looks:
it was Linus who proposed the "special" approach, I'm sure he'll
speak up if he doesn't like how the alternative comes out.

Tucking the test away in an asm-generic macro, we can leave
the pain of a rangetest to the one mips case.

By the way, in compiling that list of "special" architectures,
I was surprised not to find ia64 amongst them. Not that it
matters to me, but I thought the Fujitsu guys were usually
keen on Itanium - do they realize that the special test is
excluding it, or do they have their own special patch for it?

>
> If zero page is such a good idea, I don't see the logic of
> limiting it like thisa. Your patch looks pretty clean though.
>
> At any rate, I think it might be an idea to cc linux-arch.

Yes, thanks.

Hugh

2009-09-08 14:14:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 7/8] mm: reinstate ZERO_PAGE



On Tue, 8 Sep 2009, Nick Piggin wrote:
>
> Without looking closely, why is it a big problem to have a
> !HAVE PTE SPECIAL case? Couldn't it just be a check for
> pfn == zero_pfn that is conditionally compiled away for pte
> special architectures anyway?

At least traditionally, there wasn't a single zero_pfn, but multiple (for
VIPT caches that have performance issues with aliases). But yeah, we could
check just the pfn number, and allow any architecture to do it.

Linus

2009-09-08 15:34:43

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 7/8] mm: reinstate ZERO_PAGE

On Tue, Sep 08, 2009 at 01:17:01PM +0100, Hugh Dickins wrote:
> On Tue, 8 Sep 2009, Nick Piggin wrote:
> > On Mon, Sep 07, 2009 at 10:39:34PM +0100, Hugh Dickins wrote:
> > > KAMEZAWA Hiroyuki has observed customers of earlier kernels taking
> > > advantage of the ZERO_PAGE: which we stopped do_anonymous_page() from
> > > using in 2.6.24. And there were a couple of regression reports on LKML.
> > >
> > > Following suggestions from Linus, reinstate do_anonymous_page() use of
> > > the ZERO_PAGE; but this time avoid dirtying its struct page cacheline
> > > with (map)count updates - let vm_normal_page() regard it as abnormal.
> > >
> > > Use it only on arches which __HAVE_ARCH_PTE_SPECIAL (x86, s390, sh32,
> > > most powerpc): that's not essential, but minimizes additional branches
> > > (keeping them in the unlikely pte_special case); and incidentally
> > > excludes mips (some models of which needed eight colours of ZERO_PAGE
> > > to avoid costly exceptions).
> >
> > Without looking closely, why is it a big problem to have a
> > !HAVE PTE SPECIAL case? Couldn't it just be a check for
> > pfn == zero_pfn that is conditionally compiled away for pte
> > special architectures anyway?
>
> Yes, I'm uncomfortable with that restriction too: it makes for
> neater looking code in a couple of places, but it's not so good
> for the architectures to diverge gratuitously there.
>
> I'll give it a try without that restriction, see how it looks:
> it was Linus who proposed the "special" approach, I'm sure he'll
> speak up if he doesn't like how the alternative comes out.

I guess using special is pretty neat and doesn't require an
additional branch in vm_normal_page paths. But I think it is
important to allow other architectures at least the _option_
to have equivalent behaviour as x86 here. So it would be
great if you would look into it.


> Tucking the test away in an asm-generic macro, we can leave
> the pain of a rangetest to the one mips case.
>
> By the way, in compiling that list of "special" architectures,
> I was surprised not to find ia64 amongst them. Not that it
> matters to me, but I thought the Fujitsu guys were usually
> keen on Itanium - do they realize that the special test is
> excluding it, or do they have their own special patch for it?

I don't understand your question. Are you asking whether they
know your patch will not enable zero pages on ia64?

I guess pte special was primarily driven by gup_fast, which in
turn was driven primarily by DB2 9.5, which I think might be
only available on x86 and ibm's architectures.

But I admit to being a curious as to when I'll see a gup_fast
patch come out of SGI or HP or Fujitsu :)

2009-09-08 16:42:11

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 7/8] mm: reinstate ZERO_PAGE

On Tue, 8 Sep 2009, Nick Piggin wrote:
> On Tue, Sep 08, 2009 at 01:17:01PM +0100, Hugh Dickins wrote:
> > By the way, in compiling that list of "special" architectures,
> > I was surprised not to find ia64 amongst them. Not that it
> > matters to me, but I thought the Fujitsu guys were usually
> > keen on Itanium - do they realize that the special test is
> > excluding it, or do they have their own special patch for it?
>
> I don't understand your question. Are you asking whether they
> know your patch will not enable zero pages on ia64?

That's what I was meaning to ask, yes; but wondering whether
perhaps they've already got their own patch to enable pte_special
on ia64, and just haven't got around to sending it in yet.

>
> I guess pte special was primarily driven by gup_fast, which in
> turn was driven primarily by DB2 9.5, which I think might be
> only available on x86 and ibm's architectures.
>
> But I admit to being a curious as to when I'll see a gup_fast
> patch come out of SGI or HP or Fujitsu :)

Yes, me too!

Hugh

2009-09-08 17:10:26

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 1/8] mm: munlock use follow_page

Hugh Dickins wrote:
> Hiroaki Wakabayashi points out that when mlock() has been interrupted
> by SIGKILL, the subsequent munlock() takes unnecessarily long because
> its use of __get_user_pages() insists on faulting in all the pages
> which mlock() never reached.

> Signed-off-by: Hugh Dickins <[email protected]>

Nice cleanup.

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed.

2009-09-08 17:27:47

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 2/8] mm: remove unused GUP flags

Hugh Dickins wrote:
> GUP_FLAGS_IGNORE_VMA_PERMISSIONS and GUP_FLAGS_IGNORE_SIGKILL were
> flags added solely to prevent __get_user_pages() from doing some of
> what it usually does, in the munlock case: we can now remove them.
>
> Signed-off-by: Hugh Dickins <[email protected]>

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed.

2009-09-08 18:57:14

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 3/8] mm: add get_dump_page

Hugh Dickins wrote:
> In preparation for the next patch, add a simple get_dump_page(addr)
> interface for the CONFIG_ELF_CORE dumpers to use, instead of calling
> get_user_pages() directly. They're not interested in errors: they
> just want to use holes as much as possible, to save space and make
> sure that the data is aligned where the headers said it would be.
>
> Oh, and don't use that horrid DUMP_SEEK(off) macro!
>
> Signed-off-by: Hugh Dickins <[email protected]>

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed.

2009-09-08 19:00:16

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 4/8] mm: FOLL_DUMP replace FOLL_ANON

Hugh Dickins wrote:
> The "FOLL_ANON optimization" and its use_zero_page() test have caused
> confusion and bugs: why does it test VM_SHARED? for the very good but
> unsatisfying reason that VMware crashed without. As we look to maybe
> reinstating anonymous use of the ZERO_PAGE, we need to sort this out.
>
> Easily done: it's silly for __get_user_pages() and follow_page() to
> be guessing whether it's safe to assume that they're being used for
> a coredump (which can take a shortcut snapshot where other uses must
> handle a fault) - just tell them with GUP_FLAGS_DUMP and FOLL_DUMP.
>
> get_dump_page() doesn't even want a ZERO_PAGE: an error suits fine.
>
> Signed-off-by: Hugh Dickins <[email protected]>

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed.

2009-09-08 22:22:17

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 5/8] mm: follow_hugetlb_page flags

Hugh Dickins wrote:
> follow_hugetlb_page() shouldn't be guessing about the coredump case
> either: pass the foll_flags down to it, instead of just the write bit.
>
> Remove that obscure huge_zeropage_ok() test. The decision is easy,
> though unlike the non-huge case - here vm_ops->fault is always set.
> But we know that a fault would serve up zeroes, unless there's
> already a hugetlbfs pagecache page to back the range.
>
> (Alternatively, since hugetlb pages aren't swapped out under pressure,
> you could save more dump space by arguing that a page not yet faulted
> into this process cannot be relevant to the dump; but that would be
> more surprising.)
>
> Signed-off-by: Hugh Dickins <[email protected]>

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed.

2009-09-08 22:23:48

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 6/8] mm: fix anonymous dirtying

Hugh Dickins wrote:
> do_anonymous_page() has been wrong to dirty the pte regardless.
> If it's not going to mark the pte writable, then it won't help
> to mark it dirty here, and clogs up memory with pages which will
> need swap instead of being thrown away. Especially wrong if no
> overcommit is chosen, and this vma is not yet VM_ACCOUNTed -
> we could exceed the limit and OOM despite no overcommit.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> Cc: [email protected]

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed.

2009-09-08 23:35:20

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 7/8] mm: reinstate ZERO_PAGE

Hugh Dickins wrote:
> KAMEZAWA Hiroyuki has observed customers of earlier kernels taking
> advantage of the ZERO_PAGE: which we stopped do_anonymous_page() from
> using in 2.6.24. And there were a couple of regression reports on LKML.
>
> Following suggestions from Linus, reinstate do_anonymous_page() use of
> the ZERO_PAGE; but this time avoid dirtying its struct page cacheline
> with (map)count updates - let vm_normal_page() regard it as abnormal.
>
> Use it only on arches which __HAVE_ARCH_PTE_SPECIAL (x86, s390, sh32,
> most powerpc): that's not essential, but minimizes additional branches
> (keeping them in the unlikely pte_special case); and incidentally
> excludes mips (some models of which needed eight colours of ZERO_PAGE
> to avoid costly exceptions).
>
> Don't be fanatical about avoiding ZERO_PAGE updates: get_user_pages()
> callers won't want to make exceptions for it, so increment its count
> there. Changes to mlock and migration? happily seems not needed.
>
> In most places it's quicker to check pfn than struct page address:
> prepare a __read_mostly zero_pfn for that. Does get_dump_page()
> still need its ZERO_PAGE check? probably not, but keep it anyway.
>
> Signed-off-by: Hugh Dickins <[email protected]>

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed.

2009-09-09 01:46:24

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 7/8] mm: reinstate ZERO_PAGE

On Tue, 8 Sep 2009 12:56:57 +0100 (BST)
Hugh Dickins <[email protected]> wrote:

> On Tue, 8 Sep 2009, KAMEZAWA Hiroyuki wrote:
> >
> > A nitpick but this was a concern you shown, IIUC.
> >
> > == __get_user_pages()..
> >
> > if (pages) {
> > pages[i] = page;
> >
> > flush_anon_page(vma, page, start);
> > flush_dcache_page(page);
> > }
> > ==
> >
> > This part will call flush_dcache_page() even when ZERO_PAGE is found.
> >
> > Don't we need to mask this ?
>
> No, it's okay to flush_dcache_page() on ZERO_PAGE: we always used to
> do that there, and the arches I remember offhand won't do anything
> with it anyway, once they see page->mapping NULL.
>
> What you're remembering, that I did object to, was the way your
> FOLL_NOZERO ended up doing
> pages[i] = NULL;
> flush_anon_page(vma, NULL, start);
> flush_dcache_page(NULL);
>
> which would cause an oops when those arches look at page->mapping.
>
> I should take another look at your FOLL_NOZERO: I may have dismissed
> it too quickly, after seeing that bug, and oopsing on x86 when
> mlocking a readonly anonymous area.
>
> Though I like that we don't _need_ to change mlock.c for reinstated
> ZERO_PAGE, this morning I'm having trouble persuading myself that
> mlocking a readonly anonymous area is too silly to optimize for.
>
> Maybe the very people who persuaded you to bring back the anonymous
> use of ZERO_PAGE, are also doing a huge mlock of the area first?
No, as far as I know, they'll not do huge mlock.


Thanks,
-Kame

> So if two or more are starting up at the same time on the same box,
> more bouncing than is healthy (and more than they would have seen
> in the old days of ZERO_PAGE but no lock_page on it there).
>
> I'd like to persuade myself not to bother,
> but may want to add a further patch for that later.
>
> Hugh
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2009-09-09 11:14:49

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 4/8] mm: FOLL_DUMP replace FOLL_ANON

On Mon, Sep 07, 2009 at 10:35:32PM +0100, Hugh Dickins wrote:
> The "FOLL_ANON optimization" and its use_zero_page() test have caused
> confusion and bugs: why does it test VM_SHARED? for the very good but
> unsatisfying reason that VMware crashed without. As we look to maybe
> reinstating anonymous use of the ZERO_PAGE, we need to sort this out.
>
> Easily done: it's silly for __get_user_pages() and follow_page() to
> be guessing whether it's safe to assume that they're being used for
> a coredump (which can take a shortcut snapshot where other uses must
> handle a fault) - just tell them with GUP_FLAGS_DUMP and FOLL_DUMP.
>
> get_dump_page() doesn't even want a ZERO_PAGE: an error suits fine.
>
> Signed-off-by: Hugh Dickins <[email protected]>

Acked-by: Mel Gorman <[email protected]>

> ---
>
> include/linux/mm.h | 2 +-
> mm/internal.h | 1 +
> mm/memory.c | 43 ++++++++++++-------------------------------
> 3 files changed, 14 insertions(+), 32 deletions(-)
>
> --- mm3/include/linux/mm.h 2009-09-07 13:16:32.000000000 +0100
> +++ mm4/include/linux/mm.h 2009-09-07 13:16:39.000000000 +0100
> @@ -1247,7 +1247,7 @@ struct page *follow_page(struct vm_area_
> #define FOLL_WRITE 0x01 /* check pte is writable */
> #define FOLL_TOUCH 0x02 /* mark page accessed */
> #define FOLL_GET 0x04 /* do get_page on page */
> -#define FOLL_ANON 0x08 /* give ZERO_PAGE if no pgtable */
> +#define FOLL_DUMP 0x08 /* give error on hole if it would be zero */
>
> typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
> void *data);
> --- mm3/mm/internal.h 2009-09-07 13:16:22.000000000 +0100
> +++ mm4/mm/internal.h 2009-09-07 13:16:39.000000000 +0100
> @@ -252,6 +252,7 @@ static inline void mminit_validate_memmo
>
> #define GUP_FLAGS_WRITE 0x01
> #define GUP_FLAGS_FORCE 0x02
> +#define GUP_FLAGS_DUMP 0x04
>
> int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> unsigned long start, int len, int flags,
> --- mm3/mm/memory.c 2009-09-07 13:16:32.000000000 +0100
> +++ mm4/mm/memory.c 2009-09-07 13:16:39.000000000 +0100
> @@ -1174,41 +1174,22 @@ no_page:
> pte_unmap_unlock(ptep, ptl);
> if (!pte_none(pte))
> return page;
> - /* Fall through to ZERO_PAGE handling */
> +
> no_page_table:
> /*
> * When core dumping an enormous anonymous area that nobody
> - * has touched so far, we don't want to allocate page tables.
> + * has touched so far, we don't want to allocate unnecessary pages or
> + * page tables. Return error instead of NULL to skip handle_mm_fault,
> + * then get_dump_page() will return NULL to leave a hole in the dump.
> + * But we can only make this optimization where a hole would surely
> + * be zero-filled if handle_mm_fault() actually did handle it.
> */
> - if (flags & FOLL_ANON) {
> - page = ZERO_PAGE(0);
> - if (flags & FOLL_GET)
> - get_page(page);
> - BUG_ON(flags & FOLL_WRITE);
> - }
> + if ((flags & FOLL_DUMP) &&
> + (!vma->vm_ops || !vma->vm_ops->fault))
> + return ERR_PTR(-EFAULT);
> return page;
> }
>
> -/* Can we do the FOLL_ANON optimization? */
> -static inline int use_zero_page(struct vm_area_struct *vma)
> -{
> - /*
> - * We don't want to optimize FOLL_ANON for make_pages_present()
> - * when it tries to page in a VM_LOCKED region. As to VM_SHARED,
> - * we want to get the page from the page tables to make sure
> - * that we serialize and update with any other user of that
> - * mapping.
> - */
> - if (vma->vm_flags & (VM_LOCKED | VM_SHARED))
> - return 0;
> - /*
> - * And if we have a fault routine, it's not an anonymous region.
> - */
> - return !vma->vm_ops || !vma->vm_ops->fault;
> -}
> -
> -
> -
> int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> unsigned long start, int nr_pages, int flags,
> struct page **pages, struct vm_area_struct **vmas)
> @@ -1288,8 +1269,8 @@ int __get_user_pages(struct task_struct
> foll_flags = FOLL_TOUCH;
> if (pages)
> foll_flags |= FOLL_GET;
> - if (!write && use_zero_page(vma))
> - foll_flags |= FOLL_ANON;
> + if (flags & GUP_FLAGS_DUMP)
> + foll_flags |= FOLL_DUMP;
>
> do {
> struct page *page;
> @@ -1447,7 +1428,7 @@ struct page *get_dump_page(unsigned long
> struct page *page;
>
> if (__get_user_pages(current, current->mm, addr, 1,
> - GUP_FLAGS_FORCE, &page, &vma) < 1)
> + GUP_FLAGS_FORCE | GUP_FLAGS_DUMP, &page, &vma) < 1)
> return NULL;
> if (page == ZERO_PAGE(0)) {
> page_cache_release(page);
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-09-09 11:31:40

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 5/8] mm: follow_hugetlb_page flags

On Mon, Sep 07, 2009 at 10:37:14PM +0100, Hugh Dickins wrote:
> follow_hugetlb_page() shouldn't be guessing about the coredump case
> either: pass the foll_flags down to it, instead of just the write bit.
>
> Remove that obscure huge_zeropage_ok() test. The decision is easy,
> though unlike the non-huge case - here vm_ops->fault is always set.
> But we know that a fault would serve up zeroes, unless there's
> already a hugetlbfs pagecache page to back the range.
>
> (Alternatively, since hugetlb pages aren't swapped out under pressure,
> you could save more dump space by arguing that a page not yet faulted
> into this process cannot be relevant to the dump; but that would be
> more surprising.)
>

It would be more surprising. It's an implementation detail that hugetlb
pages cannot be swapped out and someone reading the dump shouldn't have
to be aware of it. It's better to treat non-faulted pages as if they
were zero-filled.

> Signed-off-by: Hugh Dickins <[email protected]>
> ---
>
> include/linux/hugetlb.h | 4 +-
> mm/hugetlb.c | 62 ++++++++++++++++++++++----------------
> mm/memory.c | 14 ++++----
> 3 files changed, 48 insertions(+), 32 deletions(-)
>
> --- mm4/include/linux/hugetlb.h 2009-09-05 14:40:16.000000000 +0100
> +++ mm5/include/linux/hugetlb.h 2009-09-07 13:16:46.000000000 +0100
> @@ -24,7 +24,9 @@ int hugetlb_sysctl_handler(struct ctl_ta
> int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> int hugetlb_treat_movable_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
> -int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int, int);
> +int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
> + struct page **, struct vm_area_struct **,
> + unsigned long *, int *, int, unsigned int flags);
> void unmap_hugepage_range(struct vm_area_struct *,
> unsigned long, unsigned long, struct page *);
> void __unmap_hugepage_range(struct vm_area_struct *,
> --- mm4/mm/hugetlb.c 2009-09-05 14:40:16.000000000 +0100
> +++ mm5/mm/hugetlb.c 2009-09-07 13:16:46.000000000 +0100
> @@ -2016,6 +2016,23 @@ static struct page *hugetlbfs_pagecache_
> return find_lock_page(mapping, idx);
> }
>
> +/* Return whether there is a pagecache page to back given address within VMA */
> +static bool hugetlbfs_backed(struct hstate *h,
> + struct vm_area_struct *vma, unsigned long address)
> +{
> + struct address_space *mapping;
> + pgoff_t idx;
> + struct page *page;
> +
> + mapping = vma->vm_file->f_mapping;
> + idx = vma_hugecache_offset(h, vma, address);
> +
> + page = find_get_page(mapping, idx);
> + if (page)
> + put_page(page);
> + return page != NULL;
> +}
> +

It's a total nit-pick, but this is very similar to
hugetlbfs_pagecache_page(). It would have been nice to have them nearby
and called something like hugetlbfs_pagecache_present() or else reuse
the function and have the caller unlock_page but it's probably not worth
addressing.

> static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long address, pte_t *ptep, unsigned int flags)
> {
> @@ -2211,54 +2228,52 @@ follow_huge_pud(struct mm_struct *mm, un
> return NULL;
> }
>
> -static int huge_zeropage_ok(pte_t *ptep, int write, int shared)
> -{
> - if (!ptep || write || shared)
> - return 0;
> - else
> - return huge_pte_none(huge_ptep_get(ptep));
> -}
> -
> int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> struct page **pages, struct vm_area_struct **vmas,
> unsigned long *position, int *length, int i,
> - int write)
> + unsigned int flags)

Total aside, but in line with gfp_t flags, is there a case for having
foll_t type for FOLL_* ?

> {
> unsigned long pfn_offset;
> unsigned long vaddr = *position;
> int remainder = *length;
> struct hstate *h = hstate_vma(vma);
> - int zeropage_ok = 0;
> - int shared = vma->vm_flags & VM_SHARED;
>
> spin_lock(&mm->page_table_lock);
> while (vaddr < vma->vm_end && remainder) {
> pte_t *pte;
> + int absent;
> struct page *page;
>
> /*
> * Some archs (sparc64, sh*) have multiple pte_ts to
> - * each hugepage. We have to make * sure we get the
> + * each hugepage. We have to make sure we get the
> * first, for the page indexing below to work.
> */
> pte = huge_pte_offset(mm, vaddr & huge_page_mask(h));
> - if (huge_zeropage_ok(pte, write, shared))
> - zeropage_ok = 1;
> + absent = !pte || huge_pte_none(huge_ptep_get(pte));
> +
> + /*
> + * When coredumping, it suits get_dump_page if we just return
> + * an error if there's a hole and no huge pagecache to back it.
> + */
> + if (absent &&
> + ((flags & FOLL_DUMP) && !hugetlbfs_backed(h, vma, vaddr))) {
> + remainder = 0;
> + break;
> + }

Does this break an assumption of get_user_pages() whereby when there are
holes, the corresponding pages are NULL but the following pages are still
checked? I guess the caller is informed ultimately that the read was only
partial but offhand I don't know if that's generally expected or not.

Or is your comment saying that because the only caller using FOLL_DUMP is
get_dump_page() using an array of one page, it doesn't care and the case is
just not worth dealing with?

>
> - if (!pte ||
> - (huge_pte_none(huge_ptep_get(pte)) && !zeropage_ok) ||
> - (write && !pte_write(huge_ptep_get(pte)))) {
> + if (absent ||
> + ((flags & FOLL_WRITE) && !pte_write(huge_ptep_get(pte)))) {
> int ret;
>
> spin_unlock(&mm->page_table_lock);
> - ret = hugetlb_fault(mm, vma, vaddr, write);
> + ret = hugetlb_fault(mm, vma, vaddr,
> + (flags & FOLL_WRITE) ? FAULT_FLAG_WRITE : 0);
> spin_lock(&mm->page_table_lock);
> if (!(ret & VM_FAULT_ERROR))
> continue;
>
> remainder = 0;
> - if (!i)
> - i = -EFAULT;
> break;
> }
>
> @@ -2266,10 +2281,7 @@ int follow_hugetlb_page(struct mm_struct
> page = pte_page(huge_ptep_get(pte));
> same_page:
> if (pages) {
> - if (zeropage_ok)
> - pages[i] = ZERO_PAGE(0);
> - else
> - pages[i] = mem_map_offset(page, pfn_offset);
> + pages[i] = mem_map_offset(page, pfn_offset);
> get_page(pages[i]);
> }
>
> @@ -2293,7 +2305,7 @@ same_page:
> *length = remainder;
> *position = vaddr;
>
> - return i;
> + return i ? i : -EFAULT;
> }
>
> void hugetlb_change_protection(struct vm_area_struct *vma,
> --- mm4/mm/memory.c 2009-09-07 13:16:39.000000000 +0100
> +++ mm5/mm/memory.c 2009-09-07 13:16:46.000000000 +0100
> @@ -1260,17 +1260,19 @@ int __get_user_pages(struct task_struct
> !(vm_flags & vma->vm_flags))
> return i ? : -EFAULT;
>
> - if (is_vm_hugetlb_page(vma)) {
> - i = follow_hugetlb_page(mm, vma, pages, vmas,
> - &start, &nr_pages, i, write);
> - continue;
> - }
> -
> foll_flags = FOLL_TOUCH;
> if (pages)
> foll_flags |= FOLL_GET;
> if (flags & GUP_FLAGS_DUMP)
> foll_flags |= FOLL_DUMP;
> + if (write)
> + foll_flags |= FOLL_WRITE;
> +
> + if (is_vm_hugetlb_page(vma)) {
> + i = follow_hugetlb_page(mm, vma, pages, vmas,
> + &start, &nr_pages, i, foll_flags);
> + continue;
> + }
>
> do {
> struct page *page;
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-09-09 15:59:55

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 1/8] mm: munlock use follow_page

On Tue, Sep 8, 2009 at 6:29 AM, Hugh Dickins <[email protected]> wrote:
> Hiroaki Wakabayashi points out that when mlock() has been interrupted
> by SIGKILL, the subsequent munlock() takes unnecessarily long because
> its use of __get_user_pages() insists on faulting in all the pages
> which mlock() never reached.
>
> It's worse than slowness if mlock() is terminated by Out Of Memory kill:
> the munlock_vma_pages_all() in exit_mmap() insists on faulting in all the
> pages which mlock() could not find memory for; so innocent bystanders are
> killed too, and perhaps the system hangs.
>
> __get_user_pages() does a lot that's silly for munlock(): so remove the
> munlock option from __mlock_vma_pages_range(), and use a simple loop of
> follow_page()s in munlock_vma_pages_range() instead; ignoring absent
> pages, and not marking present pages as accessed or dirty.
>
> (Change munlock() to only go so far as mlock() reached? ?That does not
> work out, given the convention that mlock() claims complete success even
> when it has to give up early - in part so that an underlying file can be
> extended later, and those pages locked which earlier would give SIGBUS.)
>
> Signed-off-by: Hugh Dickins <[email protected]>
> Cc: [email protected]
Reviewed-by: Minchan Kim <[email protected]>
It looks better than old Hiroaki's one.

--
Kind regards,
Minchan Kim

2009-09-09 16:16:58

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 4/8] mm: FOLL_DUMP replace FOLL_ANON

On Tue, Sep 8, 2009 at 6:35 AM, Hugh Dickins <[email protected]> wrote:
> The "FOLL_ANON optimization" and its use_zero_page() test have caused
> confusion and bugs: why does it test VM_SHARED? for the very good but
> unsatisfying reason that VMware crashed without. ?As we look to maybe
> reinstating anonymous use of the ZERO_PAGE, we need to sort this out.
>
> Easily done: it's silly for __get_user_pages() and follow_page() to
> be guessing whether it's safe to assume that they're being used for
> a coredump (which can take a shortcut snapshot where other uses must
> handle a fault) - just tell them with GUP_FLAGS_DUMP and FOLL_DUMP.
>
> get_dump_page() doesn't even want a ZERO_PAGE: an error suits fine.
>
> Signed-off-by: Hugh Dickins <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

Just a nitpick below. :)

> ---
>
> ?include/linux/mm.h | ? ?2 +-
> ?mm/internal.h ? ? ?| ? ?1 +
> ?mm/memory.c ? ? ? ?| ? 43 ++++++++++++-------------------------------
> ?3 files changed, 14 insertions(+), 32 deletions(-)
>
> --- mm3/include/linux/mm.h ? ? ?2009-09-07 13:16:32.000000000 +0100
> +++ mm4/include/linux/mm.h ? ? ?2009-09-07 13:16:39.000000000 +0100
> @@ -1247,7 +1247,7 @@ struct page *follow_page(struct vm_area_
> ?#define FOLL_WRITE ? ? 0x01 ? ?/* check pte is writable */
> ?#define FOLL_TOUCH ? ? 0x02 ? ?/* mark page accessed */
> ?#define FOLL_GET ? ? ? 0x04 ? ?/* do get_page on page */
> -#define FOLL_ANON ? ? ?0x08 ? ?/* give ZERO_PAGE if no pgtable */
> +#define FOLL_DUMP ? ? ?0x08 ? ?/* give error on hole if it would be zero */
>
> ?typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
> ? ? ? ? ? ? ? ? ? ? ? ?void *data);
> --- mm3/mm/internal.h ? 2009-09-07 13:16:22.000000000 +0100
> +++ mm4/mm/internal.h ? 2009-09-07 13:16:39.000000000 +0100
> @@ -252,6 +252,7 @@ static inline void mminit_validate_memmo
>
> ?#define GUP_FLAGS_WRITE ? ? ? ? ? ? ? ?0x01
> ?#define GUP_FLAGS_FORCE ? ? ? ? ? ? ? ?0x02
> +#define GUP_FLAGS_DUMP ? ? ? ? 0x04
>
> ?int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> ? ? ? ? ? ? ? ? ? ? unsigned long start, int len, int flags,
> --- mm3/mm/memory.c ? ? 2009-09-07 13:16:32.000000000 +0100
> +++ mm4/mm/memory.c ? ? 2009-09-07 13:16:39.000000000 +0100
> @@ -1174,41 +1174,22 @@ no_page:
> ? ? ? ?pte_unmap_unlock(ptep, ptl);
> ? ? ? ?if (!pte_none(pte))
> ? ? ? ? ? ? ? ?return page;
> - ? ? ? /* Fall through to ZERO_PAGE handling */
> +
> ?no_page_table:
> ? ? ? ?/*
> ? ? ? ? * When core dumping an enormous anonymous area that nobody
> - ? ? ? ?* has touched so far, we don't want to allocate page tables.
> + ? ? ? ?* has touched so far, we don't want to allocate unnecessary pages or
> + ? ? ? ?* page tables. ?Return error instead of NULL to skip handle_mm_fault,
> + ? ? ? ?* then get_dump_page() will return NULL to leave a hole in the dump.
> + ? ? ? ?* But we can only make this optimization where a hole would surely
> + ? ? ? ?* be zero-filled if handle_mm_fault() actually did handle it.
> ? ? ? ? */
> - ? ? ? if (flags & FOLL_ANON) {
> - ? ? ? ? ? ? ? page = ZERO_PAGE(0);
> - ? ? ? ? ? ? ? if (flags & FOLL_GET)
> - ? ? ? ? ? ? ? ? ? ? ? get_page(page);
> - ? ? ? ? ? ? ? BUG_ON(flags & FOLL_WRITE);
> - ? ? ? }
> + ? ? ? if ((flags & FOLL_DUMP) &&
> + ? ? ? ? ? (!vma->vm_ops || !vma->vm_ops->fault))

How about adding comment about zero page use?

> + ? ? ? ? ? ? ? return ERR_PTR(-EFAULT);
> ? ? ? ?return page;
> ?}


--
Kind regards,
Minchan Kim

2009-09-10 00:33:05

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 0/8] mm: around get_user_pages flags

> > Here's a series of mm mods against current mmotm: mostly cleanup
> > of get_user_pages flags, but fixing munlock's OOM, sorting out the
> > "FOLL_ANON optimization", and reinstating ZERO_PAGE along the way.
> >
> > fs/binfmt_elf.c | 42 ++------
> > fs/binfmt_elf_fdpic.c | 56 ++++-------
> > include/linux/hugetlb.h | 4
> > include/linux/mm.h | 4
> > mm/hugetlb.c | 62 +++++++------
> > mm/internal.h | 7 -
> > mm/memory.c | 180 +++++++++++++++++++++++---------------
> > mm/mlock.c | 99 ++++++++------------
> > mm/nommu.c | 22 ++--
> > 9 files changed, 235 insertions(+), 241 deletions(-)
>
> Great!
> I'll start to test this patch series. Thanks Hugh!!

At least, My 24H stress workload test didn't find any problem.
I'll continue testing.

2009-09-11 11:12:50

by Hiroaki Wakabayashi

[permalink] [raw]
Subject: Re: [PATCH 1/8] mm: munlock use follow_page

2009/9/8 Hugh Dickins <[email protected]>:
> Hiroaki Wakabayashi points out that when mlock() has been interrupted
> by SIGKILL, the subsequent munlock() takes unnecessarily long because
> its use of __get_user_pages() insists on faulting in all the pages
> which mlock() never reached.
>
> It's worse than slowness if mlock() is terminated by Out Of Memory kill:
> the munlock_vma_pages_all() in exit_mmap() insists on faulting in all the
> pages which mlock() could not find memory for; so innocent bystanders are
> killed too, and perhaps the system hangs.
>
> __get_user_pages() does a lot that's silly for munlock(): so remove the
> munlock option from __mlock_vma_pages_range(), and use a simple loop of
> follow_page()s in munlock_vma_pages_range() instead; ignoring absent
> pages, and not marking present pages as accessed or dirty.
>
> (Change munlock() to only go so far as mlock() reached? ?That does not
> work out, given the convention that mlock() claims complete success even
> when it has to give up early - in part so that an underlying file can be
> extended later, and those pages locked which earlier would give SIGBUS.)
>
> Signed-off-by: Hugh Dickins <[email protected]>

Reviewed-by: Hiroaki Wakabayashi <[email protected]>

It very simple and so cool! I have learned something.

--
Thanks,
Hiroaki Wakabayashi

2009-09-13 15:36:35

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 5/8] mm: follow_hugetlb_page flags

On Wed, 9 Sep 2009, Mel Gorman wrote:
> On Mon, Sep 07, 2009 at 10:37:14PM +0100, Hugh Dickins wrote:
> >
> > (Alternatively, since hugetlb pages aren't swapped out under pressure,
> > you could save more dump space by arguing that a page not yet faulted
> > into this process cannot be relevant to the dump; but that would be
> > more surprising.)
>
> It would be more surprising. It's an implementation detail that hugetlb
> pages cannot be swapped out and someone reading the dump shouldn't have
> to be aware of it. It's better to treat non-faulted pages as if they
> were zero-filled.

Oh sure, I did mean that the non-faulted pages should be zero-filled,
just stored (on those filesystems which support them) by holes in the
file instead of zero-filled blocks (just as the dump tries to do with
other zero pages). It would mess up the alignment with ELF headers
to leave them out completely.

But it would still be a change in convention which might surprise
someone (pages of hugetlb file in the dump appearing as zeroed where
the underlying hugetlb file is known to contain non-zero data), and
there's already hugetlb dump filters for saving space on those areas.
So I'm not anxious to pursue that parenthetical alternative, just
admitting that we've got a choice of what to do here.

> > @@ -2016,6 +2016,23 @@ static struct page *hugetlbfs_pagecache_
> > return find_lock_page(mapping, idx);
> > }
> >
> > +/* Return whether there is a pagecache page to back given address within VMA */
> > +static bool hugetlbfs_backed(struct hstate *h,
> > + struct vm_area_struct *vma, unsigned long address)
> > +{
> > + struct address_space *mapping;
> > + pgoff_t idx;
> > + struct page *page;
> > +
> > + mapping = vma->vm_file->f_mapping;
> > + idx = vma_hugecache_offset(h, vma, address);
> > +
> > + page = find_get_page(mapping, idx);
> > + if (page)
> > + put_page(page);
> > + return page != NULL;
> > +}
> > +
>
> It's a total nit-pick, but this is very similar to
> hugetlbfs_pagecache_page(). It would have been nice to have them nearby

Indeed! That's why I placed it just after hugetlbfs_pagecache_page ;)

> and called something like hugetlbfs_pagecache_present()

Can call it that if you prefer, either name suits me.

> or else reuse
> the function and have the caller unlock_page but it's probably not worth
> addressing.

I did originally want to do it that way, but the caller is holding
page_table_lock, so cannot lock_page there.

> > int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> > struct page **pages, struct vm_area_struct **vmas,
> > unsigned long *position, int *length, int i,
> > - int write)
> > + unsigned int flags)
>
> Total aside, but in line with gfp_t flags, is there a case for having
> foll_t type for FOLL_* ?

Perhaps some case, but it's the wrong side of my boredom threshold!
Even get_user_pages is much less widely used than the functions where
gfp flags and page order were getting muddled up. (foll_t itself
would not have helped, but maybe such a change would have saved me time
debugging the hang in an earlier version of this patch: eventually I saw
I was passing VM_FAULT_WRITE instead of FAULT_FLAG_WRITE to hugetlb_fault.)

> > + /*
> > + * When coredumping, it suits get_dump_page if we just return
> > + * an error if there's a hole and no huge pagecache to back it.
> > + */
> > + if (absent &&
> > + ((flags & FOLL_DUMP) && !hugetlbfs_backed(h, vma, vaddr))) {
> > + remainder = 0;
> > + break;
> > + }
>
> Does this break an assumption of get_user_pages() whereby when there are
> holes, the corresponding pages are NULL but the following pages are still
> checked? I guess the caller is informed ultimately that the read was only
> partial but offhand I don't know if that's generally expected or not.

Sorry, I don't understand. get_user_pages() doesn't return any NULL
pages within the count it says was successful - Kamezawa-san had a patch
and flag which did so, and we might go that way, but it's not the case
at present is it? And follow_hugetlb_page() seems to be setting every
pages[i] within the count to something non-NULL.

>
> Or is your comment saying that because the only caller using FOLL_DUMP is
> get_dump_page() using an array of one page, it doesn't care and the case is
> just not worth dealing with?

Yes, that's more like it, but what case? Oh, the case where first pages
are okay, then we hit a hole. Right, that case doesn't actually arise
with FOLL_DUMP because of its sole user.

Perhaps my comment confuses because at first I had BUG_ON(remainder != 1)
in there, and wrote that comment, and returned -EFAULT; then later moved
the "i? i: -EFAULT" business down to the bottom and couldn't see any need
to assume remainder 1 any more. But the comment on "error" rather than
"error or short count" remains. But if I do change that to "error or
short count" it'll be a bit odd, because in practice it is always error.

But it does seem that we've confused each other: what to say instead?

Hugh

2009-09-13 15:46:58

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 4/8] mm: FOLL_DUMP replace FOLL_ANON

On Thu, 10 Sep 2009, Minchan Kim wrote:
> >        /*
> >         * When core dumping an enormous anonymous area that nobody
> > -        * has touched so far, we don't want to allocate page tables.
> > +        * has touched so far, we don't want to allocate unnecessary pages or
> > +        * page tables.  Return error instead of NULL to skip handle_mm_fault,
> > +        * then get_dump_page() will return NULL to leave a hole in the dump.
> > +        * But we can only make this optimization where a hole would surely
> > +        * be zero-filled if handle_mm_fault() actually did handle it.
> >         */
> > -       if (flags & FOLL_ANON) {
> > -               page = ZERO_PAGE(0);
> > -               if (flags & FOLL_GET)
> > -                       get_page(page);
> > -               BUG_ON(flags & FOLL_WRITE);
> > -       }
> > +       if ((flags & FOLL_DUMP) &&
> > +           (!vma->vm_ops || !vma->vm_ops->fault))
>
> How about adding comment about zero page use?

What kind of comment did you have in mind?
We used to use ZERO_PAGE there, but with this patch we're not using it.
I thought the comment above describes what we're doing well enough.

I may have kept too quiet about ZERO_PAGEs, knowing that a later patch
was going to change the story; but I don't see what needs saying here.

Hugh

2009-09-13 23:05:11

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 4/8] mm: FOLL_DUMP replace FOLL_ANON

Hi, Hugh.

On Sun, 13 Sep 2009 16:46:12 +0100 (BST)
Hugh Dickins <[email protected]> wrote:

> On Thu, 10 Sep 2009, Minchan Kim wrote:
> > >        /*
> > >         * When core dumping an enormous anonymous area that nobody
> > > -        * has touched so far, we don't want to allocate page tables.
> > > +        * has touched so far, we don't want to allocate unnecessary pages or
> > > +        * page tables.  Return error instead of NULL to skip handle_mm_fault,
> > > +        * then get_dump_page() will return NULL to leave a hole in the dump.
> > > +        * But we can only make this optimization where a hole would surely
> > > +        * be zero-filled if handle_mm_fault() actually did handle it.
> > >         */
> > > -       if (flags & FOLL_ANON) {
> > > -               page = ZERO_PAGE(0);
> > > -               if (flags & FOLL_GET)
> > > -                       get_page(page);
> > > -               BUG_ON(flags & FOLL_WRITE);
> > > -       }
> > > +       if ((flags & FOLL_DUMP) &&
> > > +           (!vma->vm_ops || !vma->vm_ops->fault))
> >
> > How about adding comment about zero page use?
>
> What kind of comment did you have in mind?
> We used to use ZERO_PAGE there, but with this patch we're not using it.
> I thought the comment above describes what we're doing well enough.
>
> I may have kept too quiet about ZERO_PAGEs, knowing that a later patch
> was going to change the story; but I don't see what needs saying here.

I meant following as line.

> > > +       if ((flags & FOLL_DUMP) &&
> > > +           (!vma->vm_ops || !vma->vm_ops->fault))

Why do we care about anonymous vma and FOLL_DUMP?
Yeah, comment above mentioned page tables.
But i think someone who first look at can't think it easily.

If you think the comment is enough, I don't mind it.

> Hugh


--
Kind regards,
Minchan Kim

2009-09-14 13:27:33

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 5/8] mm: follow_hugetlb_page flags

On Sun, Sep 13, 2009 at 04:35:44PM +0100, Hugh Dickins wrote:
> On Wed, 9 Sep 2009, Mel Gorman wrote:
> > On Mon, Sep 07, 2009 at 10:37:14PM +0100, Hugh Dickins wrote:
> > >
> > > (Alternatively, since hugetlb pages aren't swapped out under pressure,
> > > you could save more dump space by arguing that a page not yet faulted
> > > into this process cannot be relevant to the dump; but that would be
> > > more surprising.)
> >
> > It would be more surprising. It's an implementation detail that hugetlb
> > pages cannot be swapped out and someone reading the dump shouldn't have
> > to be aware of it. It's better to treat non-faulted pages as if they
> > were zero-filled.
>
> Oh sure, I did mean that the non-faulted pages should be zero-filled,
> just stored (on those filesystems which support them) by holes in the
> file instead of zero-filled blocks (just as the dump tries to do with
> other zero pages). It would mess up the alignment with ELF headers
> to leave them out completely.
>

Oh right, now I get you.

> But it would still be a change in convention which might surprise
> someone (pages of hugetlb file in the dump appearing as zeroed where
> the underlying hugetlb file is known to contain non-zero data), and
> there's already hugetlb dump filters for saving space on those areas.
> So I'm not anxious to pursue that parenthetical alternative, just
> admitting that we've got a choice of what to do here.
>

Grand.

> > > @@ -2016,6 +2016,23 @@ static struct page *hugetlbfs_pagecache_
> > > return find_lock_page(mapping, idx);
> > > }
> > >
> > > +/* Return whether there is a pagecache page to back given address within VMA */
> > > +static bool hugetlbfs_backed(struct hstate *h,
> > > + struct vm_area_struct *vma, unsigned long address)
> > > +{
> > > + struct address_space *mapping;
> > > + pgoff_t idx;
> > > + struct page *page;
> > > +
> > > + mapping = vma->vm_file->f_mapping;
> > > + idx = vma_hugecache_offset(h, vma, address);
> > > +
> > > + page = find_get_page(mapping, idx);
> > > + if (page)
> > > + put_page(page);
> > > + return page != NULL;
> > > +}
> > > +
> >
> > It's a total nit-pick, but this is very similar to
> > hugetlbfs_pagecache_page(). It would have been nice to have them nearby
>
> Indeed! That's why I placed it just after hugetlbfs_pagecache_page ;)
>

Oops, sorry.

> > and called something like hugetlbfs_pagecache_present()
>
> Can call it that if you prefer, either name suits me.
>

I don't feel strongly enough to ask for a new version. If this is not
the final version that is merged, then a name-change would be nice.
Otherwise, it's not worth the hassle.

> > or else reuse
> > the function and have the caller unlock_page but it's probably not worth
> > addressing.
>
> I did originally want to do it that way, but the caller is holding
> page_table_lock, so cannot lock_page there.
>

Gack, fair point. If there is another version, a comment to that effect
wouldn't hurt.

> > > int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> > > struct page **pages, struct vm_area_struct **vmas,
> > > unsigned long *position, int *length, int i,
> > > - int write)
> > > + unsigned int flags)
> >
> > Total aside, but in line with gfp_t flags, is there a case for having
> > foll_t type for FOLL_* ?
>
> Perhaps some case, but it's the wrong side of my boredom threshold!
> Even get_user_pages is much less widely used than the functions where
> gfp flags and page order were getting muddled up. (foll_t itself
> would not have helped, but maybe such a change would have saved me time
> debugging the hang in an earlier version of this patch: eventually I saw
> I was passing VM_FAULT_WRITE instead of FAULT_FLAG_WRITE to hugetlb_fault.)
>

I guess it's something to have on the back-boiler. If bugs of that
nature happen a few times, then the effort would be justified. As you
say, the gfp flags are much wider used.

> > > + /*
> > > + * When coredumping, it suits get_dump_page if we just return
> > > + * an error if there's a hole and no huge pagecache to back it.
> > > + */
> > > + if (absent &&
> > > + ((flags & FOLL_DUMP) && !hugetlbfs_backed(h, vma, vaddr))) {
> > > + remainder = 0;
> > > + break;
> > > + }
> >
> > Does this break an assumption of get_user_pages() whereby when there are
> > holes, the corresponding pages are NULL but the following pages are still
> > checked? I guess the caller is informed ultimately that the read was only
> > partial but offhand I don't know if that's generally expected or not.
>
> Sorry, I don't understand. get_user_pages() doesn't return any NULL
> pages within the count it says was successful - Kamezawa-san had a patch
> and flag which did so, and we might go that way, but it's not the case
> at present is it?

No, it's not but for some reason, I thought it was. On re-examination,
what you are doing makes sense for the current implementation.

> And follow_hugetlb_page() seems to be setting every
> pages[i] within the count to something non-NULL.
>
> >
> > Or is your comment saying that because the only caller using FOLL_DUMP is
> > get_dump_page() using an array of one page, it doesn't care and the case is
> > just not worth dealing with?
>
> Yes, that's more like it, but what case? Oh, the case where first pages
> are okay, then we hit a hole. Right, that case doesn't actually arise
> with FOLL_DUMP because of its sole user.
>

And nothing else other than core dumping will be using FOLL_DUMP so
there should be no assumptions broken.

> Perhaps my comment confuses because at first I had BUG_ON(remainder != 1)
> in there, and wrote that comment, and returned -EFAULT; then later moved
> the "i? i: -EFAULT" business down to the bottom and couldn't see any need
> to assume remainder 1 any more. But the comment on "error" rather than
> "error or short count" remains. But if I do change that to "error or
> short count" it'll be a bit odd, because in practice it is always error.
>
> But it does seem that we've confused each other: what to say instead?
>

/*
* When core-dumping, it's suits the get_dump_page() if an error is
* returned if there is a hole and no huge pagecache to back it.
* get_dump_page() is concerned with individual pages and by
* returning the page as an error, the core dump file still gets
* zeros but a hugepage allocation is avoided.
*/

?

Sorry for the noise, my review wasn't as careful as it should have been.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-09-15 20:15:59

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 7/8] mm: reinstate ZERO_PAGE

On Wed, 9 Sep 2009, KAMEZAWA Hiroyuki wrote:
> On Tue, 8 Sep 2009 12:56:57 +0100 (BST)
> Hugh Dickins <[email protected]> wrote:
> >
> > Though I like that we don't _need_ to change mlock.c for reinstated
> > ZERO_PAGE, this morning I'm having trouble persuading myself that
> > mlocking a readonly anonymous area is too silly to optimize for.
> >
> > Maybe the very people who persuaded you to bring back the anonymous
> > use of ZERO_PAGE, are also doing a huge mlock of the area first?
> No, as far as I know, they'll not do huge mlock.

Thanks for that (not entirely conclusive!) info.

I've decided for the moment to make a couple of further optimizations
in mlock.c (along with the comment you asked for), and leave adding
another __get_user_pages flag until a case for it is demonstrated.

Small patch series following shortly.

Hugh

2009-09-15 20:17:35

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 0/8] mm: around get_user_pages flags

On Thu, 10 Sep 2009, KOSAKI Motohiro wrote:
> >
> > Great!
> > I'll start to test this patch series. Thanks Hugh!!
>
> At least, My 24H stress workload test didn't find any problem.
> I'll continue testing.

Thanks a lot for that: four more patches coming shortly,
but they shouldn't affect what you've found so far.

Hugh

2009-09-15 20:26:49

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 5/8] mm: follow_hugetlb_page flags

On Mon, 14 Sep 2009, Mel Gorman wrote:
> On Sun, Sep 13, 2009 at 04:35:44PM +0100, Hugh Dickins wrote:
> > On Wed, 9 Sep 2009, Mel Gorman wrote:
>
> > > and called something like hugetlbfs_pagecache_present()
> >
> > Can call it that if you prefer, either name suits me.
>
> I don't feel strongly enough to ask for a new version. If this is not
> the final version that is merged, then a name-change would be nice.
> Otherwise, it's not worth the hassle.

You've raised several points, so worth a patch on top to keep you sweet!

> > > or else reuse
> > > the function and have the caller unlock_page but it's probably not worth
> > > addressing.
> >
> > I did originally want to do it that way, but the caller is holding
> > page_table_lock, so cannot lock_page there.
>
> Gack, fair point. If there is another version, a comment to that effect
> wouldn't hurt.

Righto, done.

> And nothing else other than core dumping will be using FOLL_DUMP so
> there should be no assumptions broken.

You have no idea of the depths of depravity to which I might sink:
see patch 1/4 in the coming group, you might be inclined to protest.

> > But it does seem that we've confused each other: what to say instead?
>
> /*
> * When core-dumping, it's suits the get_dump_page() if an error is
> * returned if there is a hole and no huge pagecache to back it.
> * get_dump_page() is concerned with individual pages and by
> * returning the page as an error, the core dump file still gets
> * zeros but a hugepage allocation is avoided.
> */

I've added a sentence to that comment, not quite what you've
suggested above, but something I hope makes it clearer.

Hugh

2009-09-15 20:30:57

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 0/4] mm: mlock, hugetlb, zero followups

Here's a gang of four patches against current mmotm, following
on from the eight around get_user_pages flags, addressing
concerns raised on those. Best slotted in as a group after
mm-foll-flags-for-gup-flags.patch

arch/mips/include/asm/pgtable.h | 14 ++++++++
mm/hugetlb.c | 16 ++++++---
mm/internal.h | 3 +
mm/memory.c | 37 +++++++++++++++-------
mm/mlock.c | 49 ++++++++++++++++++++++--------
mm/page_alloc.c | 1
6 files changed, 89 insertions(+), 31 deletions(-)

Thanks,
Hugh

2009-09-15 20:32:36

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 1/4] mm: m(un)lock avoid ZERO_PAGE

I'm still reluctant to clutter __get_user_pages() with another flag,
just to avoid touching ZERO_PAGE count in mlock(); though we can add
that later if it shows up as an issue in practice.

But when mlocking, we can test page->mapping slightly earlier, to avoid
the potentially bouncy rescheduling of lock_page on ZERO_PAGE - mlock
didn't lock_page in olden ZERO_PAGE days, so we might have regressed.

And when munlocking, it turns out that FOLL_DUMP coincidentally does
what's needed to avoid all updates to ZERO_PAGE, so use that here also.
Plus add comment suggested by KAMEZAWA Hiroyuki.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/mlock.c | 49 ++++++++++++++++++++++++++++++++++++-------------
1 file changed, 36 insertions(+), 13 deletions(-)

--- mm0/mm/mlock.c 2009-09-14 16:34:37.000000000 +0100
+++ mm1/mm/mlock.c 2009-09-15 17:32:03.000000000 +0100
@@ -198,17 +198,26 @@ static long __mlock_vma_pages_range(stru
for (i = 0; i < ret; i++) {
struct page *page = pages[i];

- lock_page(page);
- /*
- * Because we lock page here and migration is blocked
- * by the elevated reference, we need only check for
- * file-cache page truncation. This page->mapping
- * check also neatly skips over the ZERO_PAGE(),
- * though if that's common we'd prefer not to lock it.
- */
- if (page->mapping)
- mlock_vma_page(page);
- unlock_page(page);
+ if (page->mapping) {
+ /*
+ * That preliminary check is mainly to avoid
+ * the pointless overhead of lock_page on the
+ * ZERO_PAGE: which might bounce very badly if
+ * there is contention. However, we're still
+ * dirtying its cacheline with get/put_page:
+ * we'll add another __get_user_pages flag to
+ * avoid it if that case turns out to matter.
+ */
+ lock_page(page);
+ /*
+ * Because we lock page here and migration is
+ * blocked by the elevated reference, we need
+ * only check for file-cache page truncation.
+ */
+ if (page->mapping)
+ mlock_vma_page(page);
+ unlock_page(page);
+ }
put_page(page); /* ref from get_user_pages() */
}

@@ -309,9 +318,23 @@ void munlock_vma_pages_range(struct vm_a
vma->vm_flags &= ~VM_LOCKED;

for (addr = start; addr < end; addr += PAGE_SIZE) {
- struct page *page = follow_page(vma, addr, FOLL_GET);
- if (page) {
+ struct page *page;
+ /*
+ * Although FOLL_DUMP is intended for get_dump_page(),
+ * it just so happens that its special treatment of the
+ * ZERO_PAGE (returning an error instead of doing get_page)
+ * suits munlock very well (and if somehow an abnormal page
+ * has sneaked into the range, we won't oops here: great).
+ */
+ page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
+ if (page && !IS_ERR(page)) {
lock_page(page);
+ /*
+ * Like in __mlock_vma_pages_range(),
+ * because we lock page here and migration is
+ * blocked by the elevated reference, we need
+ * only check for file-cache page truncation.
+ */
if (page->mapping)
munlock_vma_page(page);
unlock_page(page);

2009-09-15 20:33:51

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 2/4] mm: hugetlbfs_pagecache_present

Rename hugetlbfs_backed() to hugetlbfs_pagecache_present()
and add more comments, as suggested by Mel Gorman.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/hugetlb.c | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)

--- mm1/mm/hugetlb.c 2009-09-14 16:34:37.000000000 +0100
+++ mm2/mm/hugetlb.c 2009-09-15 17:32:12.000000000 +0100
@@ -2016,8 +2016,11 @@ static struct page *hugetlbfs_pagecache_
return find_lock_page(mapping, idx);
}

-/* Return whether there is a pagecache page to back given address within VMA */
-static bool hugetlbfs_backed(struct hstate *h,
+/*
+ * Return whether there is a pagecache page to back given address within VMA.
+ * Caller follow_hugetlb_page() holds page_table_lock so we cannot lock_page.
+ */
+static bool hugetlbfs_pagecache_present(struct hstate *h,
struct vm_area_struct *vma, unsigned long address)
{
struct address_space *mapping;
@@ -2254,10 +2257,13 @@ int follow_hugetlb_page(struct mm_struct

/*
* When coredumping, it suits get_dump_page if we just return
- * an error if there's a hole and no huge pagecache to back it.
+ * an error where there's an empty slot with no huge pagecache
+ * to back it. This way, we avoid allocating a hugepage, and
+ * the sparse dumpfile avoids allocating disk blocks, but its
+ * huge holes still show up with zeroes where they need to be.
*/
- if (absent &&
- ((flags & FOLL_DUMP) && !hugetlbfs_backed(h, vma, vaddr))) {
+ if (absent && (flags & FOLL_DUMP) &&
+ !hugetlbfs_pagecache_present(h, vma, vaddr)) {
remainder = 0;
break;
}

2009-09-15 20:38:10

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 3/4] mm: ZERO_PAGE without PTE_SPECIAL

Reinstate anonymous use of ZERO_PAGE to all architectures, not just
to those which __HAVE_ARCH_PTE_SPECIAL: as suggested by Nick Piggin.

Contrary to how I'd imagined it, there's nothing ugly about this, just
a zero_pfn test built into one or another block of vm_normal_page().

But the MIPS ZERO_PAGE-of-many-colours case demands is_zero_pfn() and
my_zero_pfn() inlines. Reinstate its mremap move_pte() shuffling of
ZERO_PAGEs we did from 2.6.17 to 2.6.19? Not unless someone shouts
for that: it would have to take vm_flags to weed out some cases.

Signed-off-by: Hugh Dickins <[email protected]>
---
I've not built and tested the actual MIPS case, just hacked up x86
definitions to simulate it; had to drop the static from zero_pfn.

arch/mips/include/asm/pgtable.h | 14 +++++++++++
mm/memory.c | 36 ++++++++++++++++++++----------
2 files changed, 39 insertions(+), 11 deletions(-)

--- mm2/arch/mips/include/asm/pgtable.h 2009-09-09 23:13:59.000000000 +0100
+++ mm3/arch/mips/include/asm/pgtable.h 2009-09-15 17:32:19.000000000 +0100
@@ -76,6 +76,20 @@ extern unsigned long zero_page_mask;
#define ZERO_PAGE(vaddr) \
(virt_to_page((void *)(empty_zero_page + (((unsigned long)(vaddr)) & zero_page_mask))))

+#define is_zero_pfn is_zero_pfn
+static inline int is_zero_pfn(unsigned long pfn)
+{
+ extern unsigned long zero_pfn;
+ unsigned long offset_from_zero_pfn = pfn - zero_pfn;
+ return offset_from_zero_pfn <= (zero_page_mask >> PAGE_SHIFT);
+}
+
+#define my_zero_pfn my_zero_pfn
+static inline unsigned long my_zero_pfn(unsigned long addr)
+{
+ return page_to_pfn(ZERO_PAGE(addr));
+}
+
extern void paging_init(void);

/*
--- mm2/mm/memory.c 2009-09-14 16:34:37.000000000 +0100
+++ mm3/mm/memory.c 2009-09-15 17:32:19.000000000 +0100
@@ -107,7 +107,7 @@ static int __init disable_randmaps(char
}
__setup("norandmaps", disable_randmaps);

-static unsigned long zero_pfn __read_mostly;
+unsigned long zero_pfn __read_mostly;

/*
* CONFIG_MMU architectures set up ZERO_PAGE in their paging_init()
@@ -455,6 +455,20 @@ static inline int is_cow_mapping(unsigne
return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
}

+#ifndef is_zero_pfn
+static inline int is_zero_pfn(unsigned long pfn)
+{
+ return pfn == zero_pfn;
+}
+#endif
+
+#ifndef my_zero_pfn
+static inline unsigned long my_zero_pfn(unsigned long addr)
+{
+ return zero_pfn;
+}
+#endif
+
/*
* vm_normal_page -- This function gets the "struct page" associated with a pte.
*
@@ -512,7 +526,7 @@ struct page *vm_normal_page(struct vm_ar
goto check_pfn;
if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
return NULL;
- if (pfn != zero_pfn)
+ if (!is_zero_pfn(pfn))
print_bad_pte(vma, addr, pte, NULL);
return NULL;
}
@@ -534,6 +548,8 @@ struct page *vm_normal_page(struct vm_ar
}
}

+ if (is_zero_pfn(pfn))
+ return NULL;
check_pfn:
if (unlikely(pfn > highest_memmap_pfn)) {
print_bad_pte(vma, addr, pte, NULL);
@@ -1161,7 +1177,7 @@ struct page *follow_page(struct vm_area_
page = vm_normal_page(vma, address, pte);
if (unlikely(!page)) {
if ((flags & FOLL_DUMP) ||
- pte_pfn(pte) != zero_pfn)
+ !is_zero_pfn(pte_pfn(pte)))
goto bad_page;
page = pte_page(pte);
}
@@ -1444,10 +1460,6 @@ struct page *get_dump_page(unsigned long
if (__get_user_pages(current, current->mm, addr, 1,
FOLL_FORCE | FOLL_DUMP | FOLL_GET, &page, &vma) < 1)
return NULL;
- if (page == ZERO_PAGE(0)) {
- page_cache_release(page);
- return NULL;
- }
flush_cache_page(vma, addr, page_to_pfn(page));
return page;
}
@@ -1630,7 +1642,8 @@ int vm_insert_mixed(struct vm_area_struc
* If we don't have pte special, then we have to use the pfn_valid()
* based VM_MIXEDMAP scheme (see vm_normal_page), and thus we *must*
* refcount the page if pfn_valid is true (hence insert_page rather
- * than insert_pfn).
+ * than insert_pfn). If a zero_pfn were inserted into a VM_MIXEDMAP
+ * without pte special, it would there be refcounted as a normal page.
*/
if (!HAVE_PTE_SPECIAL && pfn_valid(pfn)) {
struct page *page;
@@ -2098,7 +2111,7 @@ gotten:
if (unlikely(anon_vma_prepare(vma)))
goto oom;

- if (pte_pfn(orig_pte) == zero_pfn) {
+ if (is_zero_pfn(pte_pfn(orig_pte))) {
new_page = alloc_zeroed_user_highpage_movable(vma, address);
if (!new_page)
goto oom;
@@ -2613,8 +2626,9 @@ static int do_anonymous_page(struct mm_s
spinlock_t *ptl;
pte_t entry;

- if (HAVE_PTE_SPECIAL && !(flags & FAULT_FLAG_WRITE)) {
- entry = pte_mkspecial(pfn_pte(zero_pfn, vma->vm_page_prot));
+ if (!(flags & FAULT_FLAG_WRITE)) {
+ entry = pte_mkspecial(pfn_pte(my_zero_pfn(address),
+ vma->vm_page_prot));
ptl = pte_lockptr(mm, pmd);
spin_lock(ptl);
if (!pte_none(*page_table))

2009-09-15 20:39:15

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 4/4] mm: move highest_memmap_pfn

Move highest_memmap_pfn __read_mostly from page_alloc.c next to
zero_pfn __read_mostly in memory.c: to help them share a cacheline,
since they're very often tested together in vm_normal_page().

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/internal.h | 3 ++-
mm/memory.c | 1 +
mm/page_alloc.c | 1 -
3 files changed, 3 insertions(+), 2 deletions(-)

--- mm3/mm/internal.h 2009-09-14 16:34:37.000000000 +0100
+++ mm4/mm/internal.h 2009-09-15 17:32:27.000000000 +0100
@@ -37,6 +37,8 @@ static inline void __put_page(struct pag
atomic_dec(&page->_count);
}

+extern unsigned long highest_memmap_pfn;
+
/*
* in mm/vmscan.c:
*/
@@ -46,7 +48,6 @@ extern void putback_lru_page(struct page
/*
* in mm/page_alloc.c
*/
-extern unsigned long highest_memmap_pfn;
extern void __free_pages_bootmem(struct page *page, unsigned int order);
extern void prep_compound_page(struct page *page, unsigned long order);

--- mm3/mm/memory.c 2009-09-15 17:32:19.000000000 +0100
+++ mm4/mm/memory.c 2009-09-15 17:32:27.000000000 +0100
@@ -108,6 +108,7 @@ static int __init disable_randmaps(char
__setup("norandmaps", disable_randmaps);

unsigned long zero_pfn __read_mostly;
+unsigned long highest_memmap_pfn __read_mostly;

/*
* CONFIG_MMU architectures set up ZERO_PAGE in their paging_init()
--- mm3/mm/page_alloc.c 2009-09-14 16:34:37.000000000 +0100
+++ mm4/mm/page_alloc.c 2009-09-15 17:32:27.000000000 +0100
@@ -72,7 +72,6 @@ EXPORT_SYMBOL(node_states);

unsigned long totalram_pages __read_mostly;
unsigned long totalreserve_pages __read_mostly;
-unsigned long highest_memmap_pfn __read_mostly;
int percpu_pagelist_fraction;
gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;

2009-09-16 00:08:23

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 1/4] mm: m(un)lock avoid ZERO_PAGE

> I'm still reluctant to clutter __get_user_pages() with another flag,
> just to avoid touching ZERO_PAGE count in mlock(); though we can add
> that later if it shows up as an issue in practice.
>
> But when mlocking, we can test page->mapping slightly earlier, to avoid
> the potentially bouncy rescheduling of lock_page on ZERO_PAGE - mlock
> didn't lock_page in olden ZERO_PAGE days, so we might have regressed.
>
> And when munlocking, it turns out that FOLL_DUMP coincidentally does
> what's needed to avoid all updates to ZERO_PAGE, so use that here also.
> Plus add comment suggested by KAMEZAWA Hiroyuki.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
>
> mm/mlock.c | 49 ++++++++++++++++++++++++++++++++++++-------------
> 1 file changed, 36 insertions(+), 13 deletions(-)
>
> --- mm0/mm/mlock.c 2009-09-14 16:34:37.000000000 +0100
> +++ mm1/mm/mlock.c 2009-09-15 17:32:03.000000000 +0100
> @@ -198,17 +198,26 @@ static long __mlock_vma_pages_range(stru
> for (i = 0; i < ret; i++) {
> struct page *page = pages[i];
>
> - lock_page(page);
> - /*
> - * Because we lock page here and migration is blocked
> - * by the elevated reference, we need only check for
> - * file-cache page truncation. This page->mapping
> - * check also neatly skips over the ZERO_PAGE(),
> - * though if that's common we'd prefer not to lock it.
> - */
> - if (page->mapping)
> - mlock_vma_page(page);
> - unlock_page(page);
> + if (page->mapping) {
> + /*
> + * That preliminary check is mainly to avoid
> + * the pointless overhead of lock_page on the
> + * ZERO_PAGE: which might bounce very badly if
> + * there is contention. However, we're still
> + * dirtying its cacheline with get/put_page:
> + * we'll add another __get_user_pages flag to
> + * avoid it if that case turns out to matter.
> + */
> + lock_page(page);
> + /*
> + * Because we lock page here and migration is
> + * blocked by the elevated reference, we need
> + * only check for file-cache page truncation.
> + */
> + if (page->mapping)
> + mlock_vma_page(page);
> + unlock_page(page);
> + }
> put_page(page); /* ref from get_user_pages() */
> }

Yes, I have similar patch :-)
Reviewed-by: KOSAKI Motohiro <[email protected]>



2009-09-16 06:22:32

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 3/4] mm: ZERO_PAGE without PTE_SPECIAL

On Tue, 15 Sep 2009 21:37:20 +0100 (BST)
Hugh Dickins <[email protected]> wrote:

> Reinstate anonymous use of ZERO_PAGE to all architectures, not just
> to those which __HAVE_ARCH_PTE_SPECIAL: as suggested by Nick Piggin.
>
> Contrary to how I'd imagined it, there's nothing ugly about this, just
> a zero_pfn test built into one or another block of vm_normal_page().
>
> But the MIPS ZERO_PAGE-of-many-colours case demands is_zero_pfn() and
> my_zero_pfn() inlines. Reinstate its mremap move_pte() shuffling of
> ZERO_PAGEs we did from 2.6.17 to 2.6.19? Not unless someone shouts
> for that: it would have to take vm_flags to weed out some cases.
>
> Signed-off-by: Hugh Dickins <[email protected]>

Thank you. I like this way.
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>

> ---
> I've not built and tested the actual MIPS case, just hacked up x86
> definitions to simulate it; had to drop the static from zero_pfn.
>
> arch/mips/include/asm/pgtable.h | 14 +++++++++++
> mm/memory.c | 36 ++++++++++++++++++++----------
> 2 files changed, 39 insertions(+), 11 deletions(-)
>
> --- mm2/arch/mips/include/asm/pgtable.h 2009-09-09 23:13:59.000000000 +0100
> +++ mm3/arch/mips/include/asm/pgtable.h 2009-09-15 17:32:19.000000000 +0100
> @@ -76,6 +76,20 @@ extern unsigned long zero_page_mask;
> #define ZERO_PAGE(vaddr) \
> (virt_to_page((void *)(empty_zero_page + (((unsigned long)(vaddr)) & zero_page_mask))))
>
> +#define is_zero_pfn is_zero_pfn
> +static inline int is_zero_pfn(unsigned long pfn)
> +{
> + extern unsigned long zero_pfn;
> + unsigned long offset_from_zero_pfn = pfn - zero_pfn;
> + return offset_from_zero_pfn <= (zero_page_mask >> PAGE_SHIFT);
> +}
> +
> +#define my_zero_pfn my_zero_pfn
> +static inline unsigned long my_zero_pfn(unsigned long addr)
> +{
> + return page_to_pfn(ZERO_PAGE(addr));
> +}
> +
> extern void paging_init(void);
>
> /*
> --- mm2/mm/memory.c 2009-09-14 16:34:37.000000000 +0100
> +++ mm3/mm/memory.c 2009-09-15 17:32:19.000000000 +0100
> @@ -107,7 +107,7 @@ static int __init disable_randmaps(char
> }
> __setup("norandmaps", disable_randmaps);
>
> -static unsigned long zero_pfn __read_mostly;
> +unsigned long zero_pfn __read_mostly;
>
> /*
> * CONFIG_MMU architectures set up ZERO_PAGE in their paging_init()
> @@ -455,6 +455,20 @@ static inline int is_cow_mapping(unsigne
> return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
> }
>
> +#ifndef is_zero_pfn
> +static inline int is_zero_pfn(unsigned long pfn)
> +{
> + return pfn == zero_pfn;
> +}
> +#endif
> +
> +#ifndef my_zero_pfn
> +static inline unsigned long my_zero_pfn(unsigned long addr)
> +{
> + return zero_pfn;
> +}
> +#endif
> +
> /*
> * vm_normal_page -- This function gets the "struct page" associated with a pte.
> *
> @@ -512,7 +526,7 @@ struct page *vm_normal_page(struct vm_ar
> goto check_pfn;
> if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
> return NULL;
> - if (pfn != zero_pfn)
> + if (!is_zero_pfn(pfn))
> print_bad_pte(vma, addr, pte, NULL);
> return NULL;
> }
> @@ -534,6 +548,8 @@ struct page *vm_normal_page(struct vm_ar
> }
> }
>
> + if (is_zero_pfn(pfn))
> + return NULL;
> check_pfn:
> if (unlikely(pfn > highest_memmap_pfn)) {
> print_bad_pte(vma, addr, pte, NULL);
> @@ -1161,7 +1177,7 @@ struct page *follow_page(struct vm_area_
> page = vm_normal_page(vma, address, pte);
> if (unlikely(!page)) {
> if ((flags & FOLL_DUMP) ||
> - pte_pfn(pte) != zero_pfn)
> + !is_zero_pfn(pte_pfn(pte)))
> goto bad_page;
> page = pte_page(pte);
> }
> @@ -1444,10 +1460,6 @@ struct page *get_dump_page(unsigned long
> if (__get_user_pages(current, current->mm, addr, 1,
> FOLL_FORCE | FOLL_DUMP | FOLL_GET, &page, &vma) < 1)
> return NULL;
> - if (page == ZERO_PAGE(0)) {
> - page_cache_release(page);
> - return NULL;
> - }
> flush_cache_page(vma, addr, page_to_pfn(page));
> return page;
> }
> @@ -1630,7 +1642,8 @@ int vm_insert_mixed(struct vm_area_struc
> * If we don't have pte special, then we have to use the pfn_valid()
> * based VM_MIXEDMAP scheme (see vm_normal_page), and thus we *must*
> * refcount the page if pfn_valid is true (hence insert_page rather
> - * than insert_pfn).
> + * than insert_pfn). If a zero_pfn were inserted into a VM_MIXEDMAP
> + * without pte special, it would there be refcounted as a normal page.
> */
> if (!HAVE_PTE_SPECIAL && pfn_valid(pfn)) {
> struct page *page;
> @@ -2098,7 +2111,7 @@ gotten:
> if (unlikely(anon_vma_prepare(vma)))
> goto oom;
>
> - if (pte_pfn(orig_pte) == zero_pfn) {
> + if (is_zero_pfn(pte_pfn(orig_pte))) {
> new_page = alloc_zeroed_user_highpage_movable(vma, address);
> if (!new_page)
> goto oom;
> @@ -2613,8 +2626,9 @@ static int do_anonymous_page(struct mm_s
> spinlock_t *ptl;
> pte_t entry;
>
> - if (HAVE_PTE_SPECIAL && !(flags & FAULT_FLAG_WRITE)) {
> - entry = pte_mkspecial(pfn_pte(zero_pfn, vma->vm_page_prot));
> + if (!(flags & FAULT_FLAG_WRITE)) {
> + entry = pte_mkspecial(pfn_pte(my_zero_pfn(address),
> + vma->vm_page_prot));
> ptl = pte_lockptr(mm, pmd);
> spin_lock(ptl);
> if (!pte_none(*page_table))
>

2009-09-16 09:35:04

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/4] mm: m(un)lock avoid ZERO_PAGE

On Tue, Sep 15, 2009 at 09:31:49PM +0100, Hugh Dickins wrote:
> I'm still reluctant to clutter __get_user_pages() with another flag,
> just to avoid touching ZERO_PAGE count in mlock(); though we can add
> that later if it shows up as an issue in practice.
>
> But when mlocking, we can test page->mapping slightly earlier, to avoid
> the potentially bouncy rescheduling of lock_page on ZERO_PAGE - mlock
> didn't lock_page in olden ZERO_PAGE days, so we might have regressed.
>
> And when munlocking, it turns out that FOLL_DUMP coincidentally does
> what's needed to avoid all updates to ZERO_PAGE, so use that here also.
> Plus add comment suggested by KAMEZAWA Hiroyuki.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
>
> mm/mlock.c | 49 ++++++++++++++++++++++++++++++++++++-------------
> 1 file changed, 36 insertions(+), 13 deletions(-)
>
> --- mm0/mm/mlock.c 2009-09-14 16:34:37.000000000 +0100
> +++ mm1/mm/mlock.c 2009-09-15 17:32:03.000000000 +0100
> @@ -198,17 +198,26 @@ static long __mlock_vma_pages_range(stru
> for (i = 0; i < ret; i++) {
> struct page *page = pages[i];
>
> - lock_page(page);
> - /*
> - * Because we lock page here and migration is blocked
> - * by the elevated reference, we need only check for
> - * file-cache page truncation. This page->mapping
> - * check also neatly skips over the ZERO_PAGE(),
> - * though if that's common we'd prefer not to lock it.
> - */
> - if (page->mapping)
> - mlock_vma_page(page);
> - unlock_page(page);
> + if (page->mapping) {
> + /*
> + * That preliminary check is mainly to avoid
> + * the pointless overhead of lock_page on the
> + * ZERO_PAGE: which might bounce very badly if
> + * there is contention. However, we're still
> + * dirtying its cacheline with get/put_page:
> + * we'll add another __get_user_pages flag to
> + * avoid it if that case turns out to matter.
> + */
> + lock_page(page);
> + /*
> + * Because we lock page here and migration is
> + * blocked by the elevated reference, we need
> + * only check for file-cache page truncation.
> + */
> + if (page->mapping)
> + mlock_vma_page(page);
> + unlock_page(page);
> + }
> put_page(page); /* ref from get_user_pages() */
> }
>
> @@ -309,9 +318,23 @@ void munlock_vma_pages_range(struct vm_a
> vma->vm_flags &= ~VM_LOCKED;
>
> for (addr = start; addr < end; addr += PAGE_SIZE) {
> - struct page *page = follow_page(vma, addr, FOLL_GET);
> - if (page) {
> + struct page *page;
> + /*
> + * Although FOLL_DUMP is intended for get_dump_page(),
> + * it just so happens that its special treatment of the
> + * ZERO_PAGE (returning an error instead of doing get_page)
> + * suits munlock very well (and if somehow an abnormal page
> + * has sneaked into the range, we won't oops here: great).
> + */
> + page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);

Ouch, now I get your depraved comment :) . This will be a tricky rule to
remember in a years time, wouldn't it?

> + if (page && !IS_ERR(page)) {
> lock_page(page);
> + /*
> + * Like in __mlock_vma_pages_range(),
> + * because we lock page here and migration is
> + * blocked by the elevated reference, we need
> + * only check for file-cache page truncation.
> + */
> if (page->mapping)
> munlock_vma_page(page);
> unlock_page(page);
>

Functionally, the patch seems fine and the avoidance of lock_page() is
nice so.

Reviewed-by: Mel Gorman <[email protected]>

But, as FOLL_DUMP applies to more than core dumping, can it be renamed
in another follow-on patch? The fundamental underlying "thing" it does
is to error instead of faulting the zero page so FOLL_NO_FAULT_ZEROPAGE,
FOLL_ERRORZERO, FOLL_NOZERO etc? A name like that would simplify the comments
as FOLL_DUMP would no longer just be a desirable side-effect.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-09-16 11:41:08

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 1/4] mm: m(un)lock avoid ZERO_PAGE

On Wed, 16 Sep 2009, Mel Gorman wrote:
> On Tue, Sep 15, 2009 at 09:31:49PM +0100, Hugh Dickins wrote:
> >
> > And when munlocking, it turns out that FOLL_DUMP coincidentally does
> > what's needed to avoid all updates to ZERO_PAGE, so use that here also.
...
> > for (addr = start; addr < end; addr += PAGE_SIZE) {
> > - struct page *page = follow_page(vma, addr, FOLL_GET);
> > - if (page) {
> > + struct page *page;
> > + /*
> > + * Although FOLL_DUMP is intended for get_dump_page(),
> > + * it just so happens that its special treatment of the
> > + * ZERO_PAGE (returning an error instead of doing get_page)
> > + * suits munlock very well (and if somehow an abnormal page
> > + * has sneaked into the range, we won't oops here: great).
> > + */
> > + page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
>
> Ouch, now I get your depraved comment :) . This will be a tricky rule to
> remember in a years time, wouldn't it?

I rely more upon git and grep than memory; I hope others do too.
(And that's partly why I put "get_dump_page" into the comment line.)

> Functionally, the patch seems fine and the avoidance of lock_page() is
> nice so.
>
> Reviewed-by: Mel Gorman <[email protected]>

Thanks.

>
> But, as FOLL_DUMP applies to more than core dumping, can it be renamed
> in another follow-on patch? The fundamental underlying "thing" it does
> is to error instead of faulting the zero page so FOLL_NO_FAULT_ZEROPAGE,
> FOLL_ERRORZERO, FOLL_NOZERO etc? A name like that would simplify the comments
> as FOLL_DUMP would no longer just be a desirable side-effect.

At this moment, particularly after the years of FOLL_ANON confusion,
I feel pretty strongly that this flag is there for coredumping; and
it's just a happy accident that it happens also to be useful for munlock.
And if their needs diverge later, FOLL_DUMP will do whatever dumping
wants, and FOLL_MUNLOCK or something with a longer name will do
whatever munlocking wants.

I suspect that if I could think of a really snappy name for the flag,
that didn't just send us away to study the source to see what it
really does, I'd be glad to change to that. But at the moment,
I'm happier sticking with FOLL_DUMP myself.

Hugh

2009-09-16 12:47:37

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/4] mm: m(un)lock avoid ZERO_PAGE

On Wed, Sep 16, 2009 at 12:40:23PM +0100, Hugh Dickins wrote:
> On Wed, 16 Sep 2009, Mel Gorman wrote:
> > On Tue, Sep 15, 2009 at 09:31:49PM +0100, Hugh Dickins wrote:
> > >
> > > And when munlocking, it turns out that FOLL_DUMP coincidentally does
> > > what's needed to avoid all updates to ZERO_PAGE, so use that here also.
> ...
> > > for (addr = start; addr < end; addr += PAGE_SIZE) {
> > > - struct page *page = follow_page(vma, addr, FOLL_GET);
> > > - if (page) {
> > > + struct page *page;
> > > + /*
> > > + * Although FOLL_DUMP is intended for get_dump_page(),
> > > + * it just so happens that its special treatment of the
> > > + * ZERO_PAGE (returning an error instead of doing get_page)
> > > + * suits munlock very well (and if somehow an abnormal page
> > > + * has sneaked into the range, we won't oops here: great).
> > > + */
> > > + page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
> >
> > Ouch, now I get your depraved comment :) . This will be a tricky rule to
> > remember in a years time, wouldn't it?
>
> I rely more upon git and grep than memory; I hope others do too.
> (And that's partly why I put "get_dump_page" into the comment line.)
>

True and the comment is pretty explicit.

> > Functionally, the patch seems fine and the avoidance of lock_page() is
> > nice so.
> >
> > Reviewed-by: Mel Gorman <[email protected]>
>
> Thanks.
>
> >
> > But, as FOLL_DUMP applies to more than core dumping, can it be renamed
> > in another follow-on patch? The fundamental underlying "thing" it does
> > is to error instead of faulting the zero page so FOLL_NO_FAULT_ZEROPAGE,
> > FOLL_ERRORZERO, FOLL_NOZERO etc? A name like that would simplify the comments
> > as FOLL_DUMP would no longer just be a desirable side-effect.
>
> At this moment, particularly after the years of FOLL_ANON confusion,
> I feel pretty strongly that this flag is there for coredumping; and
> it's just a happy accident that it happens also to be useful for munlock.
> And if their needs diverge later, FOLL_DUMP will do whatever dumping
> wants, and FOLL_MUNLOCK or something with a longer name will do
> whatever munlocking wants.
>

Ok, that's reasonable. You want to avoid any temptation of abuse of flag
and a reviewer will spot abuse of something called FOLL_DUMP easier than
something like FOLL_NOZERO.

> I suspect that if I could think of a really snappy name for the flag,
> that didn't just send us away to study the source to see what it
> really does, I'd be glad to change to that. But at the moment,
> I'm happier sticking with FOLL_DUMP myself.
>

Grand. Thanks for clarifying and explaining.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-09-17 00:33:25

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 0/4] mm: mlock, hugetlb, zero followups

> Here's a gang of four patches against current mmotm, following
> on from the eight around get_user_pages flags, addressing
> concerns raised on those. Best slotted in as a group after
> mm-foll-flags-for-gup-flags.patch
>
> arch/mips/include/asm/pgtable.h | 14 ++++++++
> mm/hugetlb.c | 16 ++++++---
> mm/internal.h | 3 +
> mm/memory.c | 37 +++++++++++++++-------
> mm/mlock.c | 49 ++++++++++++++++++++++--------
> mm/page_alloc.c | 1
> 6 files changed, 89 insertions(+), 31 deletions(-)

My stress load found no problem too.

Thanks.