LinuxLists.cc - [PATCH 0/2] ZERO PAGE again v3.

2009-07-09 03:26:35

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [PATCH 0/2] ZERO PAGE again v3.

After v2 discussion, I felt that "Go" sign can be given if implemetaion is neat
and tiny and overhead seems very small. Here is v3.

In this version,

- use pte_special() in vm_normal_page()
All ZERO_PAGE check will go down to vm_normal_page() and check is done here.
Some new flags in follow_page() and get_user_pages().

- per arch use-zero-page config is added.
IIUC, archs which have _PAGE_SPECIAL is only x86, powerpc, s390.
Because this patch make use of pte_special() check, config to use zero page
is added and you can turn it off if necessary.
I this patch, only x86 is turned on which I can test.

Any comments are welcome.

Thanks,
-Kame

2009-07-09 03:29:04

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [PATCH 1/2] ZERO PAGE config

Only x86 config is inculded because I can test only x86...
==
From: KAMEZAWA Hiroyuki <[email protected]>

Kconfig for using ZERO_PAGE. Using ZERO_PAGE or not is depends on
- arch has pte_special() or not.
- arch allows to use ZERO_PAGE or not.

In this patch, generic-config for /mm and arch-specific config for x86
is added. Other archs ? AFAIK, powerpc and s390 has pte_special().

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
arch/x86/Kconfig | 3 +++
mm/Kconfig | 15 +++++++++++++++
2 files changed, 18 insertions(+)

Index: zeropage-trialv3/mm/Kconfig
===================================================================
--- zeropage-trialv3.orig/mm/Kconfig
+++ zeropage-trialv3/mm/Kconfig
@@ -214,6 +214,21 @@ config HAVE_MLOCKED_PAGE_BIT
config MMU_NOTIFIER
bool

+config SUPPORT_ANON_ZERO_PAGE
+ bool "use anon zero page"
+ default y if ARCH_SUPPORT_ANON_ZERO_PAGE
+ help
+ In anonymous private mapping (MAP_ANONYMOUS and /dev/zero), a read
+ page fault will allocate a new zero-cleared page. If the first page
+ fault is write, allocating a new page is necessary. But if it is
+ read, we can use ZERO_PAGE until a write comes. If you set this to y,
+ the kernel use ZERO_PAGE and delays allocating new memory in private
+ anon mapping until the first write. If applications use large mmap
+ and most of accesses are read, this reduces memory usage and cache
+ usage to some extent. To support this, your architecture should have
+ _PAGE_SPECIAL bit in pte. And this will be no help to cpu cache if
+ the arch's cache is virtually tagged.
+
config DEFAULT_MMAP_MIN_ADDR
int "Low address space to protect from user allocation"
default 4096
Index: zeropage-trialv3/arch/x86/Kconfig
===================================================================
--- zeropage-trialv3.orig/arch/x86/Kconfig
+++ zeropage-trialv3/arch/x86/Kconfig
@@ -161,6 +161,9 @@ config ARCH_HIBERNATION_POSSIBLE
config ARCH_SUSPEND_POSSIBLE
def_bool y

+config ARCH_SUPPORT_ANON_ZERO_PAGE
+ def_bool y
+
config ZONE_DMA32
bool
default X86_64

2009-07-09 03:30:06

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [PATCH 2/2] ZERO PAGE by pte_special

From: KAMEZAWA Hiroyuki <[email protected]>

ZERO_PAGE for anonymous private mapping is useful when an application
requires large continuous memory but write sparsely or some other usages.
It was removed in 2.6.24 but this patch tries to re-add it.
(Because there are some use cases..)

In past, ZERO_PAGE was removed because heavy cache line contention in
ZERO_PAGE's refcounting, this version of ZERO_PAGE avoid to refcnt it.
Then, implementation is changed as following.

- Use of ZERO_PAGE is limited to PRIVATE mapping. Then, VM_MAYSHARE is
checked as VM_SHARED.

- pte_special(), _PAGE_SPECIAL bit in pte is used for indicating ZERO_PAGE.

- vm_normal_page() eats one more flag as "ignore_zero". If ignore_zero != 0,
NULL is returned even if ZERO_PAGE is found.

- __get_user_pages() eats one more flag as GUP_FLAGS_IGNORE_ZERO. If set,
__get_user_page() returns NULL even if ZERO_PAGE is found.

- follow_page eats one more flag as FOLL_NOZERO. If set, follow_page()
returns NULL even if ZERO_PAGE is found.

Usual overheads of this patch is...
- One if() in do_anonymous_page().
- 3 if() in __get_user_pages()
- One argument to vm_normal_page()

Note:
- no changes to get_user_pages(). ZERO_PAGE can be returned when
vma is ANONYMOUS && PRIVATE and the access is READ.

Changelog v2->v3
- totally renewed.
- use pte_special()
- added new argument to vm_normal_page().
- MAYSHARE is checked.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
fs/proc/task_mmu.c | 8 +--
include/linux/mm.h | 3 -
mm/fremap.c | 3 -
mm/internal.h | 1
mm/memory.c | 136 +++++++++++++++++++++++++++++++++++++++++------------
mm/mempolicy.c | 8 +--
mm/migrate.c | 6 +-
mm/mlock.c | 2
mm/rmap.c | 6 +-
9 files changed, 129 insertions(+), 44 deletions(-)

Index: zeropage-trialv3/mm/memory.c
===================================================================
--- zeropage-trialv3.orig/mm/memory.c
+++ zeropage-trialv3/mm/memory.c
@@ -442,6 +442,27 @@ static inline int is_cow_mapping(unsigne
}

/*
+ * Can we use ZERO_PAGE at fault ? or Can we do the FOLL_ANON optimization ?
+ */
+static inline int use_zero_page(struct vm_area_struct *vma)
+{
+ /*
+ * We don't want to optimize FOLL_ANON for make_pages_present()
+ * when it tries to page in a VM_LOCKED region. As to VM_SHARED,
+ * we want to get the page from the page tables to make sure
+ * that we serialize and update with any other user of that
+ * mapping. At doing page fault, VM_MAYSHARE should be also check for
+ * avoiding possible changes to VM_SHARED.
+ */
+ if (vma->vm_flags & (VM_LOCKED | VM_SHARED | VM_MAYSHARE))
+ return 0;
+ /*
+ * And if we have a fault routine, it's not an anonymous region.
+ */
+ return !vma->vm_ops || !vma->vm_ops->fault;
+}
+
+/*
* vm_normal_page -- This function gets the "struct page" associated with a pte.
*
* "Special" mappings do not wish to be associated with a "struct page" (either
@@ -488,16 +509,33 @@ static inline int is_cow_mapping(unsigne
#else
# define HAVE_PTE_SPECIAL 0
#endif
+
+#ifdef CONFIG_SUPPORT_ANON_ZERO_PAGE
+# define HAVE_ANON_ZERO 1
+#else
+# define HAVE_ANON_ZERO 0
+#endif
struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
- pte_t pte)
+ pte_t pte, int ignore_zero)
{
unsigned long pfn = pte_pfn(pte);

if (HAVE_PTE_SPECIAL) {
if (likely(!pte_special(pte)))
goto check_pfn;
- if (!(vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)))
- print_bad_pte(vma, addr, pte, NULL);
+
+ if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
+ return NULL;
+ /*
+ * ZERO PAGE ? If vma is shared or has page fault handler,
+ * Using ZERO PAGE is bug.
+ */
+ if (HAVE_ANON_ZERO && use_zero_page(vma)) {
+ if (ignore_zero)
+ return NULL;
+ return ZERO_PAGE(0);
+ }
+ print_bad_pte(vma, addr, pte, NULL);
return NULL;
}

@@ -591,8 +629,8 @@ copy_one_pte(struct mm_struct *dst_mm, s
if (vm_flags & VM_SHARED)
pte = pte_mkclean(pte);
pte = pte_mkold(pte);
-
- page = vm_normal_page(vma, addr, pte);
+ /* we can ignore zero page */
+ page = vm_normal_page(vma, addr, pte, 1);
if (page) {
get_page(page);
page_dup_rmap(page, vma, addr);
@@ -783,7 +821,7 @@ static unsigned long zap_pte_range(struc
if (pte_present(ptent)) {
struct page *page;

- page = vm_normal_page(vma, addr, ptent);
+ page = vm_normal_page(vma, addr, ptent, 1);
if (unlikely(details) && page) {
/*
* unmap_shared_mapping_pages() wants to
@@ -1141,7 +1179,7 @@ struct page *follow_page(struct vm_area_
goto no_page;
if ((flags & FOLL_WRITE) && !pte_write(pte))
goto unlock;
- page = vm_normal_page(vma, address, pte);
+ page = vm_normal_page(vma, address, pte, (flags & FOLL_NOZERO));
if (unlikely(!page))
goto bad_page;

@@ -1155,6 +1193,7 @@ struct page *follow_page(struct vm_area_
* pte_mkyoung() would be more correct here, but atomic care
* is needed to avoid losing the dirty bit: it is easier to use
* mark_page_accessed().
+ * ZERO page may be marked as accessed but no bad side effects.
*/
mark_page_accessed(page);
}
@@ -1186,23 +1225,6 @@ no_page_table:
return page;
}

-/* Can we do the FOLL_ANON optimization? */
-static inline int use_zero_page(struct vm_area_struct *vma)
-{
- /*
- * We don't want to optimize FOLL_ANON for make_pages_present()
- * when it tries to page in a VM_LOCKED region. As to VM_SHARED,
- * we want to get the page from the page tables to make sure
- * that we serialize and update with any other user of that
- * mapping.
- */
- if (vma->vm_flags & (VM_LOCKED | VM_SHARED))
- return 0;
- /*
- * And if we have a fault routine, it's not an anonymous region.
- */
- return !vma->vm_ops || !vma->vm_ops->fault;
-}

@@ -1216,6 +1238,7 @@ int __get_user_pages(struct task_struct
int force = !!(flags & GUP_FLAGS_FORCE);
int ignore = !!(flags & GUP_FLAGS_IGNORE_VMA_PERMISSIONS);
int ignore_sigkill = !!(flags & GUP_FLAGS_IGNORE_SIGKILL);
+ int ignore_zero = !!(flags & GUP_FLAGS_IGNORE_ZERO);

if (nr_pages <= 0)
return 0;
@@ -1259,7 +1282,9 @@ int __get_user_pages(struct task_struct
return i ? : -EFAULT;
}
if (pages) {
- struct page *page = vm_normal_page(gate_vma, start, *pte);
+ struct page *page;
+ page = vm_normal_page(gate_vma, start,
+ *pte, ignore_zero);
pages[i] = page;
if (page)
get_page(page);
@@ -1287,8 +1312,13 @@ int __get_user_pages(struct task_struct
foll_flags = FOLL_TOUCH;
if (pages)
foll_flags |= FOLL_GET;
- if (!write && use_zero_page(vma))
- foll_flags |= FOLL_ANON;
+ if (!write) {
+ if (use_zero_page(vma))
+ foll_flags |= FOLL_ANON;
+ else
+ ignore_zero = 0;
+ } else
+ ignore_zero = 0;

do {
struct page *page;
@@ -1307,9 +1337,17 @@ int __get_user_pages(struct task_struct
if (write)
foll_flags |= FOLL_WRITE;

+ if (ignore_zero)
+ foll_flags |= FOLL_NOZERO;
+
cond_resched();
while (!(page = follow_page(vma, start, foll_flags))) {
int ret;
+ /*
+ * When we ignore zero pages, no more ops to do.
+ */
+ if (ignore_zero)
+ break;

ret = handle_mm_fault(mm, vma, start,
(foll_flags & FOLL_WRITE) ?
@@ -1953,10 +1991,11 @@ static int do_wp_page(struct mm_struct *
int page_mkwrite = 0;
struct page *dirty_page = NULL;

- old_page = vm_normal_page(vma, address, orig_pte);
+ /* This returns NULL when we find ZERO page */
+ old_page = vm_normal_page(vma, address, orig_pte, 1);
if (!old_page) {
/*
- * VM_MIXEDMAP !pfn_valid() case
+ * VM_MIXEDMAP !pfn_valid() case or ZERO_PAGE cases.
*
* We should not cow pages in a shared writeable mapping.
* Just mark the pages writable as we can't do any dirty
@@ -2610,6 +2649,41 @@ out_page:
return ret;
}

+#ifdef CONFIG_SUPPORT_ANON_ZERO_PAGE
+static bool do_anon_zeromap(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd)
+{
+ spinlock_t *ptl;
+ pte_t entry;
+ pte_t *page_table;
+ bool ret = false;
+
+ if (!use_zero_page(vma))
+ return ret;
+ /*
+ * We use _PAGE_SPECIAL bit in pte to indicate this page is ZERO PAGE.
+ */
+ entry = pte_mkspecial(mk_pte(ZERO_PAGE(0), vma->vm_page_prot));
+ page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+ if (!pte_none(*page_table))
+ goto out_unlock;
+ set_pte_at(mm, address, page_table, entry);
+ /* No need to invalidate - it was non-present before */
+ update_mmu_cache(vma, address, entry);
+ ret = true;
+out_unlock:
+ pte_unmap_unlock(page_table, ptl);
+ return ret;
+}
+#else
+static bool do_anon_zeromap(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd)
+{
+ /* We don't use ZERO PAGE */
+ return false;
+}
+#endif /* CONFIG_SUPPORT_ANON_ZERO_PAGE */
+
/*
* We enter with non-exclusive mmap_sem (to exclude vma changes,
* but allow concurrent faults), and pte mapped but not yet locked.
@@ -2626,6 +2700,10 @@ static int do_anonymous_page(struct mm_s
/* Allocate our own private page. */
pte_unmap(page_table);

+ if (unlikely(!(flags & FAULT_FLAG_WRITE)))
+ if (do_anon_zeromap(mm, vma, address, pmd))
+ return 0;
+
if (unlikely(anon_vma_prepare(vma)))
goto oom;
page = alloc_zeroed_user_highpage_movable(vma, address);
Index: zeropage-trialv3/mm/fremap.c
===================================================================
--- zeropage-trialv3.orig/mm/fremap.c
+++ zeropage-trialv3/mm/fremap.c
@@ -33,7 +33,8 @@ static void zap_pte(struct mm_struct *mm

flush_cache_page(vma, addr, pte_pfn(pte));
pte = ptep_clear_flush(vma, addr, ptep);
- page = vm_normal_page(vma, addr, pte);
+ /* we can ignore zero page */
+ page = vm_normal_page(vma, addr, pte, 1);
if (page) {
if (pte_dirty(pte))
set_page_dirty(page);
Index: zeropage-trialv3/mm/mempolicy.c
===================================================================
--- zeropage-trialv3.orig/mm/mempolicy.c
+++ zeropage-trialv3/mm/mempolicy.c
@@ -404,13 +404,13 @@ static int check_pte_range(struct vm_are

if (!pte_present(*pte))
continue;
- page = vm_normal_page(vma, addr, *pte);
+ /* we avoid zero page here */
+ page = vm_normal_page(vma, addr, *pte, 1);
if (!page)
continue;
/*
- * The check for PageReserved here is important to avoid
- * handling zero pages and other pages that may have been
- * marked special by the system.
+ * The check for PageReserved here is imortant to avoid pages
+ * that may have been marked special by the system.
*
* If the PageReserved would not be checked here then f.e.
* the location of the zero page could have an influence
Index: zeropage-trialv3/mm/rmap.c
===================================================================
--- zeropage-trialv3.orig/mm/rmap.c
+++ zeropage-trialv3/mm/rmap.c
@@ -943,7 +943,11 @@ static int try_to_unmap_cluster(unsigned
for (; address < end; pte++, address += PAGE_SIZE) {
if (!pte_present(*pte))
continue;
- page = vm_normal_page(vma, address, *pte);
+ /*
+ * Because we comes from try_to_unmap_file(), we'll never see
+ * ZERO_PAGE or ANON.
+ */
+ page = vm_normal_page(vma, address, *pte, 1);
BUG_ON(!page || PageAnon(page));

if (locked_vma) {
Index: zeropage-trialv3/include/linux/mm.h
===================================================================
--- zeropage-trialv3.orig/include/linux/mm.h
+++ zeropage-trialv3/include/linux/mm.h
@@ -753,7 +753,7 @@ struct zap_details {
};

struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
- pte_t pte);
+ pte_t pte, int ignore_zero);

int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
unsigned long size);
@@ -1242,6 +1242,7 @@ struct page *follow_page(struct vm_area_
#define FOLL_TOUCH 0x02 /* mark page accessed */
#define FOLL_GET 0x04 /* do get_page on page */
#define FOLL_ANON 0x08 /* give ZERO_PAGE if no pgtable */
+#define FOLL_NOZERO 0x10 /* returns NULL if ZERO_PAGE is found */

typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
void *data);
Index: zeropage-trialv3/mm/internal.h
===================================================================
--- zeropage-trialv3.orig/mm/internal.h
+++ zeropage-trialv3/mm/internal.h
@@ -254,6 +254,7 @@ static inline void mminit_validate_memmo
#define GUP_FLAGS_FORCE 0x2
#define GUP_FLAGS_IGNORE_VMA_PERMISSIONS 0x4
#define GUP_FLAGS_IGNORE_SIGKILL 0x8
+#define GUP_FLAGS_IGNORE_ZERO 0x10

int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, int len, int flags,
Index: zeropage-trialv3/mm/migrate.c
===================================================================
--- zeropage-trialv3.orig/mm/migrate.c
+++ zeropage-trialv3/mm/migrate.c
@@ -834,7 +834,7 @@ static int do_move_page_to_node_array(st
if (!vma || !vma_migratable(vma))
goto set_status;

- page = follow_page(vma, pp->addr, FOLL_GET);
+ page = follow_page(vma, pp->addr, FOLL_GET | FOLL_NOZERO);

err = PTR_ERR(page);
if (IS_ERR(page))
@@ -991,14 +991,14 @@ static void do_pages_stat_array(struct m
if (!vma)
goto set_status;

- page = follow_page(vma, addr, 0);
+ page = follow_page(vma, addr, FOLL_NOZERO);

err = PTR_ERR(page);
if (IS_ERR(page))
goto set_status;

err = -ENOENT;
- /* Use PageReserved to check for zero page */
+ /* if zero page, page is NULL. */
if (!page || PageReserved(page))
goto set_status;

Index: zeropage-trialv3/mm/mlock.c
===================================================================
--- zeropage-trialv3.orig/mm/mlock.c
+++ zeropage-trialv3/mm/mlock.c
@@ -162,7 +162,7 @@ static long __mlock_vma_pages_range(stru
struct page *pages[16]; /* 16 gives a reasonable batch */
int nr_pages = (end - start) / PAGE_SIZE;
int ret = 0;
- int gup_flags = 0;
+ int gup_flags = GUP_FLAGS_IGNORE_ZERO;

VM_BUG_ON(start & ~PAGE_MASK);
VM_BUG_ON(end & ~PAGE_MASK);
Index: zeropage-trialv3/fs/proc/task_mmu.c
===================================================================
--- zeropage-trialv3.orig/fs/proc/task_mmu.c
+++ zeropage-trialv3/fs/proc/task_mmu.c
@@ -342,8 +342,8 @@ static int smaps_pte_range(pmd_t *pmd, u
continue;

mss->resident += PAGE_SIZE;
-
- page = vm_normal_page(vma, addr, ptent);
+ /* we ignore zero page */
+ page = vm_normal_page(vma, addr, ptent, 1);
if (!page)
continue;

@@ -450,8 +450,8 @@ static int clear_refs_pte_range(pmd_t *p
ptent = *pte;
if (!pte_present(ptent))
continue;
-
- page = vm_normal_page(vma, addr, ptent);
+ /* we ignore zero page */
+ page = vm_normal_page(vma, addr, ptent, 1);
if (!page)
continue;

2009-07-09 03:59:24

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH 2/2] ZERO PAGE by pte_special

On Thu, 9 Jul 2009, KAMEZAWA Hiroyuki wrote:
>
> + /* we can ignore zero page */
> + page = vm_normal_page(vma, addr, pte, 1);

> - page = vm_normal_page(vma, addr, ptent);
> + page = vm_normal_page(vma, addr, ptent, 1);

> - page = vm_normal_page(vma, address, pte);
> + page = vm_normal_page(vma, address, pte, (flags & FOLL_NOZERO));

> + int ignore_zero = !!(flags & GUP_FLAGS_IGNORE_ZERO);
> ...
> + page = vm_normal_page(gate_vma, start,
> + *pte, ignore_zero);

> + if (ignore_zero)
> + foll_flags |= FOLL_NOZERO;

> + /* This returns NULL when we find ZERO page */
> + old_page = vm_normal_page(vma, address, orig_pte, 1);

> + /* we can ignore zero page */
> + page = vm_normal_page(vma, addr, pte, 1);

> + /* we avoid zero page here */
> + page = vm_normal_page(vma, addr, *pte, 1);

> + /*
> + * Because we comes from try_to_unmap_file(), we'll never see
> + * ZERO_PAGE or ANON.
> + */
> + page = vm_normal_page(vma, address, *pte, 1);

> struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> - pte_t pte);
> + pte_t pte, int ignore_zero);

So I'm quoting these different uses, because they show the pattern that
exists all over this patch: confusion about "no zero" vs "ignore zero" vs
just plain no explanation at all.

Quite frankly, I hate the "ignore zero page" naming/comments. I can kind
of see why you named them that way - we'll not consider it a normal page.
But that's not "ignoring" it. That's very much noticing it, just saying we
don't want to get the "struct page" for it.

I equally hate the anonymous "1" use, with or without comments. Does "1"
mean that you want the zero page, does it means you _don't_ want it, what
does it mean? Yes, I know that it means FOLL_NOZERO, and that when set, we
don't want the zero page, but regardless, it's just not very readable.

So I would suggest:

- never pass in "1".

- never talk about "ignoring" it.

- always pass in a _flag_, in this case FOLL_NOZERO.

If you follow those rules, you almost don't need commentary. Assuming
somebody is knowledgeable about the Linux VM, and knows we have a zero
page, you can just see a line like

page = vm_normal_page(vma, address, *pte, FOLL_NOZERO);

and you can understand that you don't want to see ZERO_PAGE. There's never
any question like "what does that '1' mean here?"

In fact, I'd pass in all of "flags", and then inside vm_normal_page() just
do

if (flags & FOLL_NOZERO) {
...

rather than ever have any boolean arguments.

(Again, I think that we should unify all of FOLL_xyz and FAULT_FLAG_xyz
and GUP_xyz into _one_ namespace - probably all under FAULT_FLAG_xyz - but
that's still a separate issue from this particular patchset).

Anyway, that said, I think the patch looks pretty simple and fairly
straightforward. Looks very much like 2.6.32 material, assuming people
will test it heavily and clean it up as per above before the next merge
window.

Linus

2009-07-09 04:54:59

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH 2/2] ZERO PAGE by pte_special

Linus Torvalds wrote:
>
>
> On Thu, 9 Jul 2009, KAMEZAWA Hiroyuki wrote:
>>
>> + /* we can ignore zero page */
>> + page = vm_normal_page(vma, addr, pte, 1);
>
>> - page = vm_normal_page(vma, addr, ptent);
>> + page = vm_normal_page(vma, addr, ptent, 1);
>
>> - page = vm_normal_page(vma, address, pte);
>> + page = vm_normal_page(vma, address, pte, (flags & FOLL_NOZERO));
>
>> + int ignore_zero = !!(flags & GUP_FLAGS_IGNORE_ZERO);
>> ...
>> + page = vm_normal_page(gate_vma, start,
>> + *pte, ignore_zero);
>
>> + if (ignore_zero)
>> + foll_flags |= FOLL_NOZERO;
>
>> + /* This returns NULL when we find ZERO page */
>> + old_page = vm_normal_page(vma, address, orig_pte, 1);
>
>> + /* we can ignore zero page */
>> + page = vm_normal_page(vma, addr, pte, 1);
>
>> + /* we avoid zero page here */
>> + page = vm_normal_page(vma, addr, *pte, 1);
>
>> + /*
>> + * Because we comes from try_to_unmap_file(), we'll never see
>> + * ZERO_PAGE or ANON.
>> + */
>> + page = vm_normal_page(vma, address, *pte, 1);
>
>> struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long
>> addr,
>> - pte_t pte);
>> + pte_t pte, int ignore_zero);
>
> So I'm quoting these different uses, because they show the pattern that
> exists all over this patch: confusion about "no zero" vs "ignore zero" vs
> just plain no explanation at all.
>
> Quite frankly, I hate the "ignore zero page" naming/comments. I can kind
> of see why you named them that way - we'll not consider it a normal page.
> But that's not "ignoring" it. That's very much noticing it, just saying we
> don't want to get the "struct page" for it.
>
> I equally hate the anonymous "1" use, with or without comments. Does "1"
> mean that you want the zero page, does it means you _don't_ want it, what
> does it mean? Yes, I know that it means FOLL_NOZERO, and that when set, we
> don't want the zero page, but regardless, it's just not very readable.
>
> So I would suggest:
>
> - never pass in "1".
>
> - never talk about "ignoring" it.
>
> - always pass in a _flag_, in this case FOLL_NOZERO.
>
> If you follow those rules, you almost don't need commentary. Assuming
> somebody is knowledgeable about the Linux VM, and knows we have a zero
> page, you can just see a line like
>
> page = vm_normal_page(vma, address, *pte, FOLL_NOZERO);
>
Ahh, yes. This looks much better. I'll do in this way in v4.

> and you can understand that you don't want to see ZERO_PAGE. There's never
> any question like "what does that '1' mean here?"
>
> In fact, I'd pass in all of "flags", and then inside vm_normal_page() just
> do
>
> if (flags & FOLL_NOZERO) {
> ...
>
> rather than ever have any boolean arguments.
>
> (Again, I think that we should unify all of FOLL_xyz and FAULT_FLAG_xyz
> and GUP_xyz into _one_ namespace - probably all under FAULT_FLAG_xyz - but
> that's still a separate issue from this particular patchset).
>
sure...it's confusing...I'll start some work to clean it up when I have
a chance.

> Anyway, that said, I think the patch looks pretty simple and fairly
> straightforward. Looks very much like 2.6.32 material, assuming people
> will test it heavily and clean it up as per above before the next merge
> window.
>

Thanks,
-Kame

2009-07-13 05:47:47

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH 0/2] ZERO PAGE again v3.

Do you think this kind of document is necessary for v4 ?
Any commetns are welcome.
Maybe some amount of people are busy at Montreal, then I'm not in hurry ;)

==
From: KAMEZAWA Hiroyuki <[email protected]>

Add a documenation about zero page at re-introducing it.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
Documentation/vm/zeropage.txt | 77 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 77 insertions(+)

Index: zeropage-trialv4/Documentation/vm/zeropage.txt
===================================================================
--- /dev/null
+++ zeropage-trialv4/Documentation/vm/zeropage.txt
@@ -0,0 +1,77 @@
+Zero Page.
+
+ZERO Page is a page filled with Zero and never modified (write-protected).
+Each arch has its own ZERO_PAGE in the kernel and macro ZERO_PAGE(addr) is
+provided. Now, usage of ZERO_PAGE() is limited.
+
+This documentation explains ZERO_PAGE() for private anonymous mappings.
+
+If CONFIG_SUPPORT_ANON_ZERO_PAGE==y, ZERO_PAGE is used for private anonymous
+mapping. If a read fault to anonymous private mapping occurs, ZERO_PAGE is
+mapped for the faulted address instead of an usual anonymous page. This mapped
+ZERO_PAGE is write-protected and the user process will do copy-on-write when
+it writes there. ZERO_PAGE is used only when vma is for PRIVATE mapping and
+has no vm_ops.
+
+Implementation Details
+ - ZERO_PAGE uses pte_special() for implementation. Then, an arch has to support
+ pte_special() to support ZERO_PAGE for Anon.
+ - ZERO_PAGE for anon has no reference counter manipulation at map/unmap.
+ - When get_user_pages() finds ZERO_PAGE, page->count is got/put.
+ - By passing special flags FOLL_NOZERO, the caller can ignore zero pages.
+ - Because ZERO_PAGE is used only when a read fault on MAP_PRIVATE anonymous
+ MAP_POPULATE may map ZERO_PAGE when it handles read only PRIVATE anonymous
+ mapping. Then, usual anonymous pages will be used in such case.
+ - At coredump, ZERO PAGE will be used for not-existing memory.
+
+For User Applications.
+
+ZERO Page is not the best solution for applications in many case. It's tend
+to be the second best if you have enough time to improve your applications.
+
+Pros. of ZERO Page
+ - not consume extra memory
+ - cpu cache over head is small.(if your cache is physically tagged.)
+ - page's reference count overhead is hidden. This is good for fork()/exec()
+ processes.
+
+Cons. of ZERO Page
+ - Just available for read-faulted anonymous private mappings.
+ - If applications depend on ZERO_PAGE, it means it consume extra TLB.
+ - you can only reduce the memory usage of read-faulted pages.
+
+ZERO Page is helpful in some cases but you can use following techniques.
+Followings are typical solutions for avoiding ZERO Pages. But please note, there
+are always trade-off among designs.
+
+ => Avoid large continuous mapping and use small mmaps.
+ If # of mmap doesn't increase very much, this is good because your
+ application can avoid TLB pollution by ZERO Page and never do unnecessary
+ access.
+
+ => Use large continuous mapping and see /proc/<pid>/pagemap
+ You can check "Which ptes are valid ?" by checking /proc/<pid>/pagemap
+ and avoid unnecessary fault at scanning memory range. But reading
+ /proc/<pid>/pagemap is not very low cost, then the benefit of this technique
+ is depends on usage.
+
+ => Use KSM.(to be implemented..)
+ KSM(kernel shared memory) can merge your anonymous mapped pages with pages
+ of the same contents. Then, ZERO Page will be merged and more pages will
+ be merged. But in bad case, pages are heavily shared and it may affects
+ performance of fork/exit/exec. Behavior depends on the latest KSM
+ implementations, please check.
+
+For kernel developers.
+ Your arch has to support pte_special() and add ARCH_SUPPORT_ANON_ZERO_PAGE=y
+ to use ZERO PAGE. If your arch's cpu-cache is virtually tagged, it's
+ recommended to turn off this feature. To test this, following case should
+ be checked.
+ - mmap/munmap/fork/exit/exec and touch anonymous private pages by READ.
+ - MAP_POPULATE in above test.
+ - mlock()
+ - coredump
+ - /dev/zero PRIVATE mapping
+
+
+

2009-07-16 09:03:39

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [PATCH 0/2] ZERO PAGE again v4.

Rebased onto mm-of-the-moment snapshot 2009-07-15-20-57.
And modifeied to make vm_normal_page() eat FOLL_NOZERO, directly.

Any comments ?

Thanks,
-Kame

2009-07-16 09:04:53

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [PATCH 1/2] ZERO PAGE again v4.

no changes since v3

From: KAMEZAWA Hiroyuki <[email protected]>

Kconfig for using ZERO_PAGE or not. Using ZERO_PAGE or not is depends on
- arch has pte_special() or not.
- arch allows to use ZERO_PAGE or not.

In this patch, generic-config for /mm and arch-specific config for x86
is added.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
Index: mmotm-2.6.31-Jul15/mm/Kconfig
===================================================================
--- mmotm-2.6.31-Jul15.orig/mm/Kconfig
+++ mmotm-2.6.31-Jul15/mm/Kconfig
@@ -214,6 +214,21 @@ config HAVE_MLOCKED_PAGE_BIT
config MMU_NOTIFIER
bool

+config SUPPORT_ANON_ZERO_PAGE
+ bool "Use anon zero page"
+ default y if ARCH_SUPPORT_ANON_ZERO_PAGE
+ help
+ In anonymous private mapping (MAP_ANONYMOUS and /dev/zero), a read
+ page fault will allocate a new zero-cleared page. If the first page
+ fault is write, allocating a new page is necessary. But if it is
+ read, we can use ZERO_PAGE until a write comes. If you set this to y,
+ the kernel use ZERO_PAGE and delays allocating new memory in private
+ anon mapping until the first write. If applications use large mmap
+ and most of accesses are read, this reduces memory usage and cache
+ usage to some extent. To support this, your architecture should have
+ _PAGE_SPECIAL bit in pte. And this will be no help to cpu cache if
+ the arch's cache is virtually tagged.
+
config DEFAULT_MMAP_MIN_ADDR
int "Low address space to protect from user allocation"
default 4096
Index: mmotm-2.6.31-Jul15/arch/x86/Kconfig
===================================================================
--- mmotm-2.6.31-Jul15.orig/arch/x86/Kconfig
+++ mmotm-2.6.31-Jul15/arch/x86/Kconfig
@@ -158,6 +158,9 @@ config ARCH_HIBERNATION_POSSIBLE
config ARCH_SUSPEND_POSSIBLE
def_bool y

+config ARCH_SUPPORT_ANON_ZERO_PAGE
+ def_bool y
+
config ZONE_DMA32
bool
default X86_64

2009-07-16 09:06:16

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [PATCH 2/2] ZERO PAGE based on pte_special

From: KAMEZAWA Hiroyuki <[email protected]>

ZERO_PAGE for anonymous private mapping is useful when an application
requires large continuous memory but write sparsely or some other usages.
It was removed in 2.6.24 but this patch tries to re-add it.
(Because there are some use cases..)

In past, ZERO_PAGE was removed because heavy cache line contention in
ZERO_PAGE's refcounting, this version of ZERO_PAGE avoid to refcnt it.
Then, implementation is changed as following.

- Use of ZERO_PAGE is limited to PRIVATE mapping. Then, VM_MAYSHARE is
checked as VM_SHARED.

- pte_special(), _PAGE_SPECIAL bit in pte is used for indicating ZERO_PAGE.

- vm_normal_page() eats FOLL_XXX flag. If FOLL_NOZERO is set,
NULL is returned even if ZERO_PAGE is found.

- __get_user_pages() eats one more flag as GUP_FLAGS_IGNORE_ZERO. If set,
__get_user_page() returns NULL even if ZERO_PAGE is found.

Note:
- no changes to get_user_pages(). ZERO_PAGE can be returned when
vma is ANONYMOUS && PRIVATE and the access is READ.

Changelog v3->v4
- FOLL_NOZERO is directly passed to vm_normal_page()

Changelog v2->v3
- totally renewed.
- use pte_special()
- added new argument to vm_normal_page().
- MAYSHARE is checked.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
fs/proc/task_mmu.c | 8 +--
include/linux/mm.h | 3 -
mm/fremap.c | 2
mm/internal.h | 1
mm/memory.c | 136 +++++++++++++++++++++++++++++++++++++++++------------
mm/mempolicy.c | 8 +--
mm/migrate.c | 6 +-
mm/mlock.c | 2
mm/rmap.c | 6 +-
9 files changed, 128 insertions(+), 44 deletions(-)

Index: mmotm-2.6.31-Jul15/mm/memory.c
===================================================================
--- mmotm-2.6.31-Jul15.orig/mm/memory.c
+++ mmotm-2.6.31-Jul15/mm/memory.c
@@ -442,6 +442,27 @@ static inline int is_cow_mapping(unsigne
}

/*
+ * Can we use ZERO_PAGE at fault ? or Can we do the FOLL_ANON optimization ?
+ */
+static inline int use_zero_page(struct vm_area_struct *vma)
+{
+ /*
+ * We don't want to optimize FOLL_ANON for make_pages_present()
+ * when it tries to page in a VM_LOCKED region. As to VM_SHARED,
+ * we want to get the page from the page tables to make sure
+ * that we serialize and update with any other user of that
+ * mapping. At doing page fault, VM_MAYSHARE should be also check for
+ * avoiding possible changes to VM_SHARED.
+ */
+ if (vma->vm_flags & (VM_LOCKED | VM_SHARED | VM_MAYSHARE))
+ return 0;
+ /*
+ * And if we have a fault routine, it's not an anonymous region.
+ */
+ return !vma->vm_ops || !vma->vm_ops->fault;
+}
+
+/*
* vm_normal_page -- This function gets the "struct page" associated with a pte.
*
* "Special" mappings do not wish to be associated with a "struct page" (either
@@ -488,16 +509,33 @@ static inline int is_cow_mapping(unsigne
#else
# define HAVE_PTE_SPECIAL 0
#endif
+
+#ifdef CONFIG_SUPPORT_ANON_ZERO_PAGE
+# define HAVE_ANON_ZERO 1
+#else
+# define HAVE_ANON_ZERO 0
+#endif
struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
- pte_t pte)
+ pte_t pte, unsigned int flags)
{
unsigned long pfn = pte_pfn(pte);

if (HAVE_PTE_SPECIAL) {
if (likely(!pte_special(pte)))
goto check_pfn;
- if (!(vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)))
- print_bad_pte(vma, addr, pte, NULL);
+
+ if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
+ return NULL;
+ /*
+ * ZERO PAGE ? If vma is shared or has page fault handler,
+ * Using ZERO PAGE is bug.
+ */
+ if (HAVE_ANON_ZERO && use_zero_page(vma)) {
+ if (flags & FOLL_NOZERO)
+ return NULL;
+ return ZERO_PAGE(0);
+ }
+ print_bad_pte(vma, addr, pte, NULL);
return NULL;
}

@@ -591,8 +629,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
if (vm_flags & VM_SHARED)
pte = pte_mkclean(pte);
pte = pte_mkold(pte);
-
- page = vm_normal_page(vma, addr, pte);
+ page = vm_normal_page(vma, addr, pte, FOLL_NOZERO);
if (page) {
get_page(page);
page_dup_rmap(page, vma, addr);
@@ -783,7 +820,7 @@ static unsigned long zap_pte_range(struc
if (pte_present(ptent)) {
struct page *page;

- page = vm_normal_page(vma, addr, ptent);
+ page = vm_normal_page(vma, addr, ptent, FOLL_NOZERO);
if (unlikely(details) && page) {
/*
* unmap_shared_mapping_pages() wants to
@@ -1141,7 +1178,7 @@ struct page *follow_page(struct vm_area_
goto no_page;
if ((flags & FOLL_WRITE) && !pte_write(pte))
goto unlock;
- page = vm_normal_page(vma, address, pte);
+ page = vm_normal_page(vma, address, pte, flags);
if (unlikely(!page))
goto bad_page;

@@ -1186,23 +1223,6 @@ no_page_table:
return page;
}

-/* Can we do the FOLL_ANON optimization? */
-static inline int use_zero_page(struct vm_area_struct *vma)
-{
- /*
- * We don't want to optimize FOLL_ANON for make_pages_present()
- * when it tries to page in a VM_LOCKED region. As to VM_SHARED,
- * we want to get the page from the page tables to make sure
- * that we serialize and update with any other user of that
- * mapping.
- */
- if (vma->vm_flags & (VM_LOCKED | VM_SHARED))
- return 0;
- /*
- * And if we have a fault routine, it's not an anonymous region.
- */
- return !vma->vm_ops || !vma->vm_ops->fault;
-}

@@ -1216,6 +1236,7 @@ int __get_user_pages(struct task_struct
int force = !!(flags & GUP_FLAGS_FORCE);
int ignore = !!(flags & GUP_FLAGS_IGNORE_VMA_PERMISSIONS);
int ignore_sigkill = !!(flags & GUP_FLAGS_IGNORE_SIGKILL);
+ int ignore_zero = !!(flags & GUP_FLAGS_IGNORE_ZERO);

if (nr_pages <= 0)
return 0;
@@ -1259,7 +1280,12 @@ int __get_user_pages(struct task_struct
return i ? : -EFAULT;
}
if (pages) {
- struct page *page = vm_normal_page(gate_vma, start, *pte);
+ struct page *page;
+ /*
+ * this is not anon vma...don't haddle zero page
+ * related flags.
+ */
+ page = vm_normal_page(gate_vma, start, *pte, 0);
pages[i] = page;
if (page)
get_page(page);
@@ -1287,8 +1313,13 @@ int __get_user_pages(struct task_struct
foll_flags = FOLL_TOUCH;
if (pages)
foll_flags |= FOLL_GET;
- if (!write && use_zero_page(vma))
- foll_flags |= FOLL_ANON;
+ if (!write) {
+ if (use_zero_page(vma))
+ foll_flags |= FOLL_ANON;
+ else
+ ignore_zero = 0;
+ } else
+ ignore_zero = 0;

do {
struct page *page;
@@ -1307,9 +1338,17 @@ int __get_user_pages(struct task_struct
if (write)
foll_flags |= FOLL_WRITE;

+ if (ignore_zero)
+ foll_flags |= FOLL_NOZERO;
+
cond_resched();
while (!(page = follow_page(vma, start, foll_flags))) {
int ret;
+ /*
+ * When we ignore zero pages, no more ops to do.
+ */
+ if (ignore_zero)
+ break;

ret = handle_mm_fault(mm, vma, start,
(foll_flags & FOLL_WRITE) ?
@@ -1953,10 +1992,10 @@ static int do_wp_page(struct mm_struct *
int page_mkwrite = 0;
struct page *dirty_page = NULL;

- old_page = vm_normal_page(vma, address, orig_pte);
+ old_page = vm_normal_page(vma, address, orig_pte, FOLL_NOZERO);
if (!old_page) {
/*
- * VM_MIXEDMAP !pfn_valid() case
+ * VM_MIXEDMAP !pfn_valid() case or ZERO_PAGE cases.
*
* We should not cow pages in a shared writeable mapping.
* Just mark the pages writable as we can't do any dirty
@@ -2610,6 +2649,41 @@ out_page:
return ret;
}

+#ifdef CONFIG_SUPPORT_ANON_ZERO_PAGE
+static bool do_anon_zeromap(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd)
+{
+ spinlock_t *ptl;
+ pte_t entry;
+ pte_t *page_table;
+ bool ret = false;
+
+ if (!use_zero_page(vma))
+ return ret;
+ /*
+ * We use _PAGE_SPECIAL bit in pte to indicate this page is ZERO PAGE.
+ */
+ entry = pte_mkspecial(mk_pte(ZERO_PAGE(0), vma->vm_page_prot));
+ page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+ if (!pte_none(*page_table))
+ goto out_unlock;
+ set_pte_at(mm, address, page_table, entry);
+ /* No need to invalidate - it was non-present before */
+ update_mmu_cache(vma, address, entry);
+ ret = true;
+out_unlock:
+ pte_unmap_unlock(page_table, ptl);
+ return ret;
+}
+#else
+static bool do_anon_zeromap(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd)
+{
+ /* We don't use ZERO PAGE */
+ return false;
+}
+#endif /* CONFIG_SUPPORT_ANON_ZERO_PAGE */
+
/*
* We enter with non-exclusive mmap_sem (to exclude vma changes,
* but allow concurrent faults), and pte mapped but not yet locked.
@@ -2626,6 +2700,10 @@ static int do_anonymous_page(struct mm_s
/* Allocate our own private page. */
pte_unmap(page_table);

+ if (unlikely(!(flags & FAULT_FLAG_WRITE)))
+ if (do_anon_zeromap(mm, vma, address, pmd))
+ return 0;
+
if (unlikely(anon_vma_prepare(vma)))
goto oom;
page = alloc_zeroed_user_highpage_movable(vma, address);
Index: mmotm-2.6.31-Jul15/mm/fremap.c
===================================================================
--- mmotm-2.6.31-Jul15.orig/mm/fremap.c
+++ mmotm-2.6.31-Jul15/mm/fremap.c
@@ -33,7 +33,7 @@ static void zap_pte(struct mm_struct *mm

flush_cache_page(vma, addr, pte_pfn(pte));
pte = ptep_clear_flush(vma, addr, ptep);
- page = vm_normal_page(vma, addr, pte);
+ page = vm_normal_page(vma, addr, pte, FOLL_NOZERO);
if (page) {
if (pte_dirty(pte))
set_page_dirty(page);
Index: mmotm-2.6.31-Jul15/mm/mempolicy.c
===================================================================
--- mmotm-2.6.31-Jul15.orig/mm/mempolicy.c
+++ mmotm-2.6.31-Jul15/mm/mempolicy.c
@@ -404,13 +404,13 @@ static int check_pte_range(struct vm_are

if (!pte_present(*pte))
continue;
- page = vm_normal_page(vma, addr, *pte);
+ /* we avoid zero page here */
+ page = vm_normal_page(vma, addr, *pte, FOLL_NOZERO);
if (!page)
continue;
/*
- * The check for PageReserved here is important to avoid
- * handling zero pages and other pages that may have been
- * marked special by the system.
+ * The check for PageReserved here is imortant to avoid pages
+ * that may have been marked special by the system.
*
* If the PageReserved would not be checked here then f.e.
* the location of the zero page could have an influence
Index: mmotm-2.6.31-Jul15/mm/rmap.c
===================================================================
--- mmotm-2.6.31-Jul15.orig/mm/rmap.c
+++ mmotm-2.6.31-Jul15/mm/rmap.c
@@ -946,7 +946,11 @@ static int try_to_unmap_cluster(unsigned
for (; address < end; pte++, address += PAGE_SIZE) {
if (!pte_present(*pte))
continue;
- page = vm_normal_page(vma, address, *pte);
+ /*
+ * Because we comes from try_to_unmap_file(), we'll never see
+ * ZERO_PAGE or ANON.
+ */
+ page = vm_normal_page(vma, address, *pte, FOLL_NOZERO);
BUG_ON(!page || PageAnon(page));

if (locked_vma) {
Index: mmotm-2.6.31-Jul15/include/linux/mm.h
===================================================================
--- mmotm-2.6.31-Jul15.orig/include/linux/mm.h
+++ mmotm-2.6.31-Jul15/include/linux/mm.h
@@ -753,7 +753,7 @@ struct zap_details {
};

struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
- pte_t pte);
+ pte_t pte, unsigned int flags);

int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
unsigned long size);
@@ -1245,6 +1245,7 @@ struct page *follow_page(struct vm_area_
#define FOLL_TOUCH 0x02 /* mark page accessed */
#define FOLL_GET 0x04 /* do get_page on page */
#define FOLL_ANON 0x08 /* give ZERO_PAGE if no pgtable */
+#define FOLL_NOZERO 0x10 /* returns NULL if ZERO_PAGE is found */

typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
void *data);
Index: mmotm-2.6.31-Jul15/mm/internal.h
===================================================================
--- mmotm-2.6.31-Jul15.orig/mm/internal.h
+++ mmotm-2.6.31-Jul15/mm/internal.h
@@ -254,6 +254,7 @@ static inline void mminit_validate_memmo
#define GUP_FLAGS_FORCE 0x2
#define GUP_FLAGS_IGNORE_VMA_PERMISSIONS 0x4
#define GUP_FLAGS_IGNORE_SIGKILL 0x8
+#define GUP_FLAGS_IGNORE_ZERO 0x10

int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, int len, int flags,
Index: mmotm-2.6.31-Jul15/mm/migrate.c
===================================================================
--- mmotm-2.6.31-Jul15.orig/mm/migrate.c
+++ mmotm-2.6.31-Jul15/mm/migrate.c
@@ -850,7 +850,7 @@ static int do_move_page_to_node_array(st
if (!vma || !vma_migratable(vma))
goto set_status;

- page = follow_page(vma, pp->addr, FOLL_GET);
+ page = follow_page(vma, pp->addr, FOLL_GET | FOLL_NOZERO);

err = PTR_ERR(page);
if (IS_ERR(page))
@@ -1007,14 +1007,14 @@ static void do_pages_stat_array(struct m
if (!vma)
goto set_status;

- page = follow_page(vma, addr, 0);
+ page = follow_page(vma, addr, FOLL_NOZERO);

err = PTR_ERR(page);
if (IS_ERR(page))
goto set_status;

err = -ENOENT;
- /* Use PageReserved to check for zero page */
+ /* if zero page, page is NULL. */
if (!page || PageReserved(page))
goto set_status;

Index: mmotm-2.6.31-Jul15/mm/mlock.c
===================================================================
--- mmotm-2.6.31-Jul15.orig/mm/mlock.c
+++ mmotm-2.6.31-Jul15/mm/mlock.c
@@ -162,7 +162,7 @@ static long __mlock_vma_pages_range(stru
struct page *pages[16]; /* 16 gives a reasonable batch */
int nr_pages = (end - start) / PAGE_SIZE;
int ret = 0;
- int gup_flags = 0;
+ int gup_flags = GUP_FLAGS_IGNORE_ZERO;

VM_BUG_ON(start & ~PAGE_MASK);
VM_BUG_ON(end & ~PAGE_MASK);
Index: mmotm-2.6.31-Jul15/fs/proc/task_mmu.c
===================================================================
--- mmotm-2.6.31-Jul15.orig/fs/proc/task_mmu.c
+++ mmotm-2.6.31-Jul15/fs/proc/task_mmu.c
@@ -361,8 +361,8 @@ static int smaps_pte_range(pmd_t *pmd, u
continue;

mss->resident += PAGE_SIZE;
-
- page = vm_normal_page(vma, addr, ptent);
+ /* we ignore zero page */
+ page = vm_normal_page(vma, addr, ptent, FOLL_NOZERO);
if (!page)
continue;

@@ -469,8 +469,8 @@ static int clear_refs_pte_range(pmd_t *p
ptent = *pte;
if (!pte_present(ptent))
continue;
-
- page = vm_normal_page(vma, addr, ptent);
+ /* we ignore zero page */
+ page = vm_normal_page(vma, addr, ptent, FOLL_NOZERO);
if (!page)
continue;

2009-07-16 12:00:42

[permalink] [raw]

Subject: Re: [PATCH 2/2] ZERO PAGE based on pte_special

Hi, Kame.

On Thu, 16 Jul 2009 18:04:24 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> From: KAMEZAWA Hiroyuki <[email protected]>
>
> ZERO_PAGE for anonymous private mapping is useful when an application
> requires large continuous memory but write sparsely or some other usages.
> It was removed in 2.6.24 but this patch tries to re-add it.
> (Because there are some use cases..)
>
> In past, ZERO_PAGE was removed because heavy cache line contention in
> ZERO_PAGE's refcounting, this version of ZERO_PAGE avoid to refcnt it.
> Then, implementation is changed as following.
>
> - Use of ZERO_PAGE is limited to PRIVATE mapping. Then, VM_MAYSHARE is
> checked as VM_SHARED.
>
> - pte_special(), _PAGE_SPECIAL bit in pte is used for indicating ZERO_PAGE.
>
> - vm_normal_page() eats FOLL_XXX flag. If FOLL_NOZERO is set,
> NULL is returned even if ZERO_PAGE is found.
>
> - __get_user_pages() eats one more flag as GUP_FLAGS_IGNORE_ZERO. If set,
> __get_user_page() returns NULL even if ZERO_PAGE is found.
>
> Note:
> - no changes to get_user_pages(). ZERO_PAGE can be returned when
> vma is ANONYMOUS && PRIVATE and the access is READ.
>
> Changelog v3->v4
> - FOLL_NOZERO is directly passed to vm_normal_page()
>
> Changelog v2->v3
> - totally renewed.
> - use pte_special()
> - added new argument to vm_normal_page().
> - MAYSHARE is checked.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> fs/proc/task_mmu.c | 8 +--
> include/linux/mm.h | 3 -
> mm/fremap.c | 2
> mm/internal.h | 1
> mm/memory.c | 136 +++++++++++++++++++++++++++++++++++++++++------------
> mm/mempolicy.c | 8 +--
> mm/migrate.c | 6 +-
> mm/mlock.c | 2
> mm/rmap.c | 6 +-
> 9 files changed, 128 insertions(+), 44 deletions(-)
>
> Index: mmotm-2.6.31-Jul15/mm/memory.c
> ===================================================================
> --- mmotm-2.6.31-Jul15.orig/mm/memory.c
> +++ mmotm-2.6.31-Jul15/mm/memory.c
> @@ -442,6 +442,27 @@ static inline int is_cow_mapping(unsigne
> }
>
> /*
> + * Can we use ZERO_PAGE at fault ? or Can we do the FOLL_ANON optimization ?
> + */
> +static inline int use_zero_page(struct vm_area_struct *vma)
> +{
> + /*
> + * We don't want to optimize FOLL_ANON for make_pages_present()
> + * when it tries to page in a VM_LOCKED region. As to VM_SHARED,
> + * we want to get the page from the page tables to make sure
> + * that we serialize and update with any other user of that
> + * mapping. At doing page fault, VM_MAYSHARE should be also check for
> + * avoiding possible changes to VM_SHARED.
> + */
> + if (vma->vm_flags & (VM_LOCKED | VM_SHARED | VM_MAYSHARE))
> + return 0;
> + /*
> + * And if we have a fault routine, it's not an anonymous region.
> + */
> + return !vma->vm_ops || !vma->vm_ops->fault;
> +}
> +
> +/*
> * vm_normal_page -- This function gets the "struct page" associated with a pte.
> *
> * "Special" mappings do not wish to be associated with a "struct page" (either
> @@ -488,16 +509,33 @@ static inline int is_cow_mapping(unsigne
> #else
> # define HAVE_PTE_SPECIAL 0
> #endif
> +
> +#ifdef CONFIG_SUPPORT_ANON_ZERO_PAGE
> +# define HAVE_ANON_ZERO 1
> +#else
> +# define HAVE_ANON_ZERO 0
> +#endif
> struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> - pte_t pte)
> + pte_t pte, unsigned int flags)
> {
> unsigned long pfn = pte_pfn(pte);
>
> if (HAVE_PTE_SPECIAL) {
> if (likely(!pte_special(pte)))
> goto check_pfn;
> - if (!(vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)))
> - print_bad_pte(vma, addr, pte, NULL);
> +
> + if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
> + return NULL;
> + /*
> + * ZERO PAGE ? If vma is shared or has page fault handler,
> + * Using ZERO PAGE is bug.
> + */
> + if (HAVE_ANON_ZERO && use_zero_page(vma)) {
> + if (flags & FOLL_NOZERO)
> + return NULL;
> + return ZERO_PAGE(0);
> + }
> + print_bad_pte(vma, addr, pte, NULL);
> return NULL;
> }
>
> @@ -591,8 +629,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
> if (vm_flags & VM_SHARED)
> pte = pte_mkclean(pte);
> pte = pte_mkold(pte);
> -
> - page = vm_normal_page(vma, addr, pte);
> + page = vm_normal_page(vma, addr, pte, FOLL_NOZERO);
> if (page) {
> get_page(page);
> page_dup_rmap(page, vma, addr);
> @@ -783,7 +820,7 @@ static unsigned long zap_pte_range(struc
> if (pte_present(ptent)) {
> struct page *page;
>
> - page = vm_normal_page(vma, addr, ptent);
> + page = vm_normal_page(vma, addr, ptent, FOLL_NOZERO);
> if (unlikely(details) && page) {
> /*
> * unmap_shared_mapping_pages() wants to
> @@ -1141,7 +1178,7 @@ struct page *follow_page(struct vm_area_
> goto no_page;
> if ((flags & FOLL_WRITE) && !pte_write(pte))
> goto unlock;
> - page = vm_normal_page(vma, address, pte);
> + page = vm_normal_page(vma, address, pte, flags);
> if (unlikely(!page))
> goto bad_page;
>
> @@ -1186,23 +1223,6 @@ no_page_table:
> return page;
> }
>
> -/* Can we do the FOLL_ANON optimization? */
> -static inline int use_zero_page(struct vm_area_struct *vma)
> -{
> - /*
> - * We don't want to optimize FOLL_ANON for make_pages_present()
> - * when it tries to page in a VM_LOCKED region. As to VM_SHARED,
> - * we want to get the page from the page tables to make sure
> - * that we serialize and update with any other user of that
> - * mapping.
> - */
> - if (vma->vm_flags & (VM_LOCKED | VM_SHARED))
> - return 0;
> - /*
> - * And if we have a fault routine, it's not an anonymous region.
> - */
> - return !vma->vm_ops || !vma->vm_ops->fault;
> -}
>
>
>
> @@ -1216,6 +1236,7 @@ int __get_user_pages(struct task_struct
> int force = !!(flags & GUP_FLAGS_FORCE);
> int ignore = !!(flags & GUP_FLAGS_IGNORE_VMA_PERMISSIONS);
> int ignore_sigkill = !!(flags & GUP_FLAGS_IGNORE_SIGKILL);
> + int ignore_zero = !!(flags & GUP_FLAGS_IGNORE_ZERO);
>
> if (nr_pages <= 0)
> return 0;
> @@ -1259,7 +1280,12 @@ int __get_user_pages(struct task_struct
> return i ? : -EFAULT;
> }
> if (pages) {
> - struct page *page = vm_normal_page(gate_vma, start, *pte);
> + struct page *page;
> + /*
> + * this is not anon vma...don't haddle zero page
> + * related flags.
> + */
> + page = vm_normal_page(gate_vma, start, *pte, 0);
> pages[i] = page;
> if (page)
> get_page(page);
> @@ -1287,8 +1313,13 @@ int __get_user_pages(struct task_struct
> foll_flags = FOLL_TOUCH;
> if (pages)
> foll_flags |= FOLL_GET;
> - if (!write && use_zero_page(vma))
> - foll_flags |= FOLL_ANON;
> + if (!write) {
> + if (use_zero_page(vma))
> + foll_flags |= FOLL_ANON;
> + else
> + ignore_zero = 0;
> + } else
> + ignore_zero = 0;

Hmm. nested condition is not good for redabililty.

How about this ?
if (!write && use_zero_page(vma))
foll_flags |= FOLL_ANON;
else
ignore_zero = 0;

--
Kind regards,
Minchan Kim

2009-07-16 13:02:33

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH 2/2] ZERO PAGE based on pte_special

Minchan Kim wrote:
>
> Hi, Kame.
>
> On Thu, 16 Jul 2009 18:04:24 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
>> From: KAMEZAWA Hiroyuki <[email protected]>
>>
>> ZERO_PAGE for anonymous private mapping is useful when an application
>> requires large continuous memory but write sparsely or some other
>> usages.
>> It was removed in 2.6.24 but this patch tries to re-add it.
>> (Because there are some use cases..)
>>
>> In past, ZERO_PAGE was removed because heavy cache line contention in
>> ZERO_PAGE's refcounting, this version of ZERO_PAGE avoid to refcnt it.
>> Then, implementation is changed as following.
>>
>> - Use of ZERO_PAGE is limited to PRIVATE mapping. Then, VM_MAYSHARE is
>> checked as VM_SHARED.
>>
>> - pte_special(), _PAGE_SPECIAL bit in pte is used for indicating
>> ZERO_PAGE.
>>
>> - vm_normal_page() eats FOLL_XXX flag. If FOLL_NOZERO is set,
>> NULL is returned even if ZERO_PAGE is found.
>>
>> - __get_user_pages() eats one more flag as GUP_FLAGS_IGNORE_ZERO. If
>> set,
>> __get_user_page() returns NULL even if ZERO_PAGE is found.
>>
>> Note:
>> - no changes to get_user_pages(). ZERO_PAGE can be returned when
>> vma is ANONYMOUS && PRIVATE and the access is READ.
>>
>> Changelog v3->v4
>> - FOLL_NOZERO is directly passed to vm_normal_page()
>>
>> Changelog v2->v3
>> - totally renewed.
>> - use pte_special()
>> - added new argument to vm_normal_page().
>> - MAYSHARE is checked.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>> ---
>> fs/proc/task_mmu.c | 8 +--
>> include/linux/mm.h | 3 -
>> mm/fremap.c | 2
>> mm/internal.h | 1
>> mm/memory.c | 136
>> +++++++++++++++++++++++++++++++++++++++++------------
>> mm/mempolicy.c | 8 +--
>> mm/migrate.c | 6 +-
>> mm/mlock.c | 2
>> mm/rmap.c | 6 +-
>> 9 files changed, 128 insertions(+), 44 deletions(-)
>>
>> Index: mmotm-2.6.31-Jul15/mm/memory.c
>> ===================================================================
>> --- mmotm-2.6.31-Jul15.orig/mm/memory.c
>> +++ mmotm-2.6.31-Jul15/mm/memory.c
>> @@ -442,6 +442,27 @@ static inline int is_cow_mapping(unsigne
>> }
>>
>> /*
>> + * Can we use ZERO_PAGE at fault ? or Can we do the FOLL_ANON
>> optimization ?
>> + */
>> +static inline int use_zero_page(struct vm_area_struct *vma)
>> +{
>> + /*
>> + * We don't want to optimize FOLL_ANON for make_pages_present()
>> + * when it tries to page in a VM_LOCKED region. As to VM_SHARED,
>> + * we want to get the page from the page tables to make sure
>> + * that we serialize and update with any other user of that
>> + * mapping. At doing page fault, VM_MAYSHARE should be also check for
>> + * avoiding possible changes to VM_SHARED.
>> + */
>> + if (vma->vm_flags & (VM_LOCKED | VM_SHARED | VM_MAYSHARE))
>> + return 0;
>> + /*
>> + * And if we have a fault routine, it's not an anonymous region.
>> + */
>> + return !vma->vm_ops || !vma->vm_ops->fault;
>> +}
>> +
>> +/*
>> * vm_normal_page -- This function gets the "struct page" associated
>> with a pte.
>> *
>> * "Special" mappings do not wish to be associated with a "struct page"
>> (either
>> @@ -488,16 +509,33 @@ static inline int is_cow_mapping(unsigne
>> #else
>> # define HAVE_PTE_SPECIAL 0
>> #endif
>> +
>> +#ifdef CONFIG_SUPPORT_ANON_ZERO_PAGE
>> +# define HAVE_ANON_ZERO 1
>> +#else
>> +# define HAVE_ANON_ZERO 0
>> +#endif
>> struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long
>> addr,
>> - pte_t pte)
>> + pte_t pte, unsigned int flags)
>> {
>> unsigned long pfn = pte_pfn(pte);
>>
>> if (HAVE_PTE_SPECIAL) {
>> if (likely(!pte_special(pte)))
>> goto check_pfn;
>> - if (!(vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)))
>> - print_bad_pte(vma, addr, pte, NULL);
>> +
>> + if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
>> + return NULL;
>> + /*
>> + * ZERO PAGE ? If vma is shared or has page fault handler,
>> + * Using ZERO PAGE is bug.
>> + */
>> + if (HAVE_ANON_ZERO && use_zero_page(vma)) {
>> + if (flags & FOLL_NOZERO)
>> + return NULL;
>> + return ZERO_PAGE(0);
>> + }
>> + print_bad_pte(vma, addr, pte, NULL);
>> return NULL;
>> }
>>
>> @@ -591,8 +629,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
>> if (vm_flags & VM_SHARED)
>> pte = pte_mkclean(pte);
>> pte = pte_mkold(pte);
>> -
>> - page = vm_normal_page(vma, addr, pte);
>> + page = vm_normal_page(vma, addr, pte, FOLL_NOZERO);
>> if (page) {
>> get_page(page);
>> page_dup_rmap(page, vma, addr);
>> @@ -783,7 +820,7 @@ static unsigned long zap_pte_range(struc
>> if (pte_present(ptent)) {
>> struct page *page;
>>
>> - page = vm_normal_page(vma, addr, ptent);
>> + page = vm_normal_page(vma, addr, ptent, FOLL_NOZERO);
>> if (unlikely(details) && page) {
>> /*
>> * unmap_shared_mapping_pages() wants to
>> @@ -1141,7 +1178,7 @@ struct page *follow_page(struct vm_area_
>> goto no_page;
>> if ((flags & FOLL_WRITE) && !pte_write(pte))
>> goto unlock;
>> - page = vm_normal_page(vma, address, pte);
>> + page = vm_normal_page(vma, address, pte, flags);
>> if (unlikely(!page))
>> goto bad_page;
>>
>> @@ -1186,23 +1223,6 @@ no_page_table:
>> return page;
>> }
>>
>> -/* Can we do the FOLL_ANON optimization? */
>> -static inline int use_zero_page(struct vm_area_struct *vma)
>> -{
>> - /*
>> - * We don't want to optimize FOLL_ANON for make_pages_present()
>> - * when it tries to page in a VM_LOCKED region. As to VM_SHARED,
>> - * we want to get the page from the page tables to make sure
>> - * that we serialize and update with any other user of that
>> - * mapping.
>> - */
>> - if (vma->vm_flags & (VM_LOCKED | VM_SHARED))
>> - return 0;
>> - /*
>> - * And if we have a fault routine, it's not an anonymous region.
>> - */
>> - return !vma->vm_ops || !vma->vm_ops->fault;
>> -}
>>
>>
>>
>> @@ -1216,6 +1236,7 @@ int __get_user_pages(struct task_struct
>> int force = !!(flags & GUP_FLAGS_FORCE);
>> int ignore = !!(flags & GUP_FLAGS_IGNORE_VMA_PERMISSIONS);
>> int ignore_sigkill = !!(flags & GUP_FLAGS_IGNORE_SIGKILL);
>> + int ignore_zero = !!(flags & GUP_FLAGS_IGNORE_ZERO);
>>
>> if (nr_pages <= 0)
>> return 0;
>> @@ -1259,7 +1280,12 @@ int __get_user_pages(struct task_struct
>> return i ? : -EFAULT;
>> }
>> if (pages) {
>> - struct page *page = vm_normal_page(gate_vma, start, *pte);
>> + struct page *page;
>> + /*
>> + * this is not anon vma...don't haddle zero page
>> + * related flags.
>> + */
>> + page = vm_normal_page(gate_vma, start, *pte, 0);
>> pages[i] = page;
>> if (page)
>> get_page(page);
>> @@ -1287,8 +1313,13 @@ int __get_user_pages(struct task_struct
>> foll_flags = FOLL_TOUCH;
>> if (pages)
>> foll_flags |= FOLL_GET;
>> - if (!write && use_zero_page(vma))
>> - foll_flags |= FOLL_ANON;
>> + if (!write) {
>> + if (use_zero_page(vma))
>> + foll_flags |= FOLL_ANON;
>> + else
>> + ignore_zero = 0;
>> + } else
>> + ignore_zero = 0;
>
> Hmm. nested condition is not good for redabililty.
>
> How about this ?
> if (!write && use_zero_page(vma))
> foll_flags |= FOLL_ANON;
> else
> ignore_zero = 0;
>
Ah, yes. yours seems better.
I'll post updated one, later.

Thank you
-Kame

2009-07-17 00:40:11

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH 2/2] ZERO PAGE based on pte_special

On Thu, 16 Jul 2009 21:00:28 +0900
Minchan Kim <[email protected]> wrote:
> Hmm. nested condition is not good for redabililty.
>
> How about this ?
> if (!write && use_zero_page(vma))
> foll_flags |= FOLL_ANON;
> else
> ignore_zero = 0;
>
Ok, here is v4.1
Thanks,
-Kame
==
From: KAMEZAWA Hiroyuki <[email protected]>

ZERO_PAGE for anonymous private mapping is useful when an application
requires large continuous memory but write sparsely or some other usages.
It was removed in 2.6.24 but this patch tries to re-add it.
(Because there are some use cases..)

In past, ZERO_PAGE was removed because heavy cache line contention in
ZERO_PAGE's refcounting, this version of ZERO_PAGE avoid to refcnt it.
Then, implementation is changed as following.

- Use of ZERO_PAGE is limited to PRIVATE mapping. Then, VM_MAYSHARE is
checked as VM_SHARED.

- pte_special(), _PAGE_SPECIAL bit in pte is used for indicating ZERO_PAGE.

- vm_normal_page() eats FOLL_XXX flag. If FOLL_NOZERO is set,
NULL is returned even if ZERO_PAGE is found.

- __get_user_pages() eats one more flag as GUP_FLAGS_IGNORE_ZERO. If set,
__get_user_page() returns NULL even if ZERO_PAGE is found.

Note:
- no changes to get_user_pages(). ZERO_PAGE can be returned when
vma is ANONYMOUS && PRIVATE and the access is READ.

Changelog v4 -> v4.1
- removed nexted "if" in get_user_pages() for readability

Changelog v3->v4
- FOLL_NOZERO is directly passed to vm_normal_page()

Changelog v2->v3
- totally renewed.
- use pte_special()
- added new argument to vm_normal_page().
- MAYSHARE is checked.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
fs/proc/task_mmu.c | 8 +--
include/linux/mm.h | 3 -
mm/fremap.c | 2
mm/internal.h | 1
mm/memory.c | 129 +++++++++++++++++++++++++++++++++++++++++------------
mm/mempolicy.c | 8 +--
mm/migrate.c | 6 +-
mm/mlock.c | 2
mm/rmap.c | 6 ++
9 files changed, 123 insertions(+), 42 deletions(-)

Index: mmotm-2.6.31-Jul15/mm/memory.c
===================================================================
--- mmotm-2.6.31-Jul15.orig/mm/memory.c
+++ mmotm-2.6.31-Jul15/mm/memory.c
@@ -442,6 +442,27 @@ static inline int is_cow_mapping(unsigne
}

/*
+ * Can we use ZERO_PAGE at fault ? or Can we do the FOLL_ANON optimization ?
+ */
+static inline int use_zero_page(struct vm_area_struct *vma)
+{
+ /*
+ * We don't want to optimize FOLL_ANON for make_pages_present()
+ * when it tries to page in a VM_LOCKED region. As to VM_SHARED,
+ * we want to get the page from the page tables to make sure
+ * that we serialize and update with any other user of that
+ * mapping. At doing page fault, VM_MAYSHARE should be also check for
+ * avoiding possible changes to VM_SHARED.
+ */
+ if (vma->vm_flags & (VM_LOCKED | VM_SHARED | VM_MAYSHARE))
+ return 0;
+ /*
+ * And if we have a fault routine, it's not an anonymous region.
+ */
+ return !vma->vm_ops || !vma->vm_ops->fault;
+}
+
+/*
* vm_normal_page -- This function gets the "struct page" associated with a pte.
*
* "Special" mappings do not wish to be associated with a "struct page" (either
@@ -488,16 +509,33 @@ static inline int is_cow_mapping(unsigne
#else
# define HAVE_PTE_SPECIAL 0
#endif
+
+#ifdef CONFIG_SUPPORT_ANON_ZERO_PAGE
+# define HAVE_ANON_ZERO 1
+#else
+# define HAVE_ANON_ZERO 0
+#endif
struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
- pte_t pte)
+ pte_t pte, unsigned int flags)
{
unsigned long pfn = pte_pfn(pte);

if (HAVE_PTE_SPECIAL) {
if (likely(!pte_special(pte)))
goto check_pfn;
- if (!(vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)))
- print_bad_pte(vma, addr, pte, NULL);
+
+ if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
+ return NULL;
+ /*
+ * ZERO PAGE ? If vma is shared or has page fault handler,
+ * Using ZERO PAGE is bug.
+ */
+ if (HAVE_ANON_ZERO && use_zero_page(vma)) {
+ if (flags & FOLL_NOZERO)
+ return NULL;
+ return ZERO_PAGE(0);
+ }
+ print_bad_pte(vma, addr, pte, NULL);
return NULL;
}

@@ -591,8 +629,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
if (vm_flags & VM_SHARED)
pte = pte_mkclean(pte);
pte = pte_mkold(pte);
-
- page = vm_normal_page(vma, addr, pte);
+ page = vm_normal_page(vma, addr, pte, FOLL_NOZERO);
if (page) {
get_page(page);
page_dup_rmap(page, vma, addr);
@@ -783,7 +820,7 @@ static unsigned long zap_pte_range(struc
if (pte_present(ptent)) {
struct page *page;

- page = vm_normal_page(vma, addr, ptent);
+ page = vm_normal_page(vma, addr, ptent, FOLL_NOZERO);
if (unlikely(details) && page) {
/*
* unmap_shared_mapping_pages() wants to
@@ -1141,7 +1178,7 @@ struct page *follow_page(struct vm_area_
goto no_page;
if ((flags & FOLL_WRITE) && !pte_write(pte))
goto unlock;
- page = vm_normal_page(vma, address, pte);
+ page = vm_normal_page(vma, address, pte, flags);
if (unlikely(!page))
goto bad_page;

@@ -1186,23 +1223,6 @@ no_page_table:
return page;
}

-/* Can we do the FOLL_ANON optimization? */
-static inline int use_zero_page(struct vm_area_struct *vma)
-{
- /*
- * We don't want to optimize FOLL_ANON for make_pages_present()
- * when it tries to page in a VM_LOCKED region. As to VM_SHARED,
- * we want to get the page from the page tables to make sure
- * that we serialize and update with any other user of that
- * mapping.
- */
- if (vma->vm_flags & (VM_LOCKED | VM_SHARED))
- return 0;
- /*
- * And if we have a fault routine, it's not an anonymous region.
- */
- return !vma->vm_ops || !vma->vm_ops->fault;
-}

@@ -1216,6 +1236,7 @@ int __get_user_pages(struct task_struct
int force = !!(flags & GUP_FLAGS_FORCE);
int ignore = !!(flags & GUP_FLAGS_IGNORE_VMA_PERMISSIONS);
int ignore_sigkill = !!(flags & GUP_FLAGS_IGNORE_SIGKILL);
+ int ignore_zero = !!(flags & GUP_FLAGS_IGNORE_ZERO);

if (nr_pages <= 0)
return 0;
@@ -1259,7 +1280,12 @@ int __get_user_pages(struct task_struct
return i ? : -EFAULT;
}
if (pages) {
- struct page *page = vm_normal_page(gate_vma, start, *pte);
+ struct page *page;
+ /*
+ * this is not anon vma...don't haddle zero page
+ * related flags.
+ */
+ page = vm_normal_page(gate_vma, start, *pte, 0);
pages[i] = page;
if (page)
get_page(page);
@@ -1289,6 +1315,8 @@ int __get_user_pages(struct task_struct
foll_flags |= FOLL_GET;
if (!write && use_zero_page(vma))
foll_flags |= FOLL_ANON;
+ else
+ ignore_zero = 0;

do {
struct page *page;
@@ -1307,9 +1335,17 @@ int __get_user_pages(struct task_struct
if (write)
foll_flags |= FOLL_WRITE;

+ if (ignore_zero)
+ foll_flags |= FOLL_NOZERO;
+
cond_resched();
while (!(page = follow_page(vma, start, foll_flags))) {
int ret;
+ /*
+ * When we ignore zero pages, no more ops to do.
+ */
+ if (ignore_zero)
+ break;

ret = handle_mm_fault(mm, vma, start,
(foll_flags & FOLL_WRITE) ?
@@ -1953,10 +1989,10 @@ static int do_wp_page(struct mm_struct *
int page_mkwrite = 0;
struct page *dirty_page = NULL;

- old_page = vm_normal_page(vma, address, orig_pte);
+ old_page = vm_normal_page(vma, address, orig_pte, FOLL_NOZERO);
if (!old_page) {
/*
- * VM_MIXEDMAP !pfn_valid() case
+ * VM_MIXEDMAP !pfn_valid() case or ZERO_PAGE cases.
*
* We should not cow pages in a shared writeable mapping.
* Just mark the pages writable as we can't do any dirty
@@ -2610,6 +2646,41 @@ out_page:
return ret;
}

+#ifdef CONFIG_SUPPORT_ANON_ZERO_PAGE
+static bool do_anon_zeromap(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd)
+{
+ spinlock_t *ptl;
+ pte_t entry;
+ pte_t *page_table;
+ bool ret = false;
+
+ if (!use_zero_page(vma))
+ return ret;
+ /*
+ * We use _PAGE_SPECIAL bit in pte to indicate this page is ZERO PAGE.
+ */
+ entry = pte_mkspecial(mk_pte(ZERO_PAGE(0), vma->vm_page_prot));
+ page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+ if (!pte_none(*page_table))
+ goto out_unlock;
+ set_pte_at(mm, address, page_table, entry);
+ /* No need to invalidate - it was non-present before */
+ update_mmu_cache(vma, address, entry);
+ ret = true;
+out_unlock:
+ pte_unmap_unlock(page_table, ptl);
+ return ret;
+}
+#else
+static bool do_anon_zeromap(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd)
+{
+ /* We don't use ZERO PAGE */
+ return false;
+}
+#endif /* CONFIG_SUPPORT_ANON_ZERO_PAGE */
+
/*
* We enter with non-exclusive mmap_sem (to exclude vma changes,
* but allow concurrent faults), and pte mapped but not yet locked.
@@ -2626,6 +2697,10 @@ static int do_anonymous_page(struct mm_s
/* Allocate our own private page. */
pte_unmap(page_table);

+ if (unlikely(!(flags & FAULT_FLAG_WRITE)))
+ if (do_anon_zeromap(mm, vma, address, pmd))
+ return 0;
+
if (unlikely(anon_vma_prepare(vma)))
goto oom;
page = alloc_zeroed_user_highpage_movable(vma, address);
Index: mmotm-2.6.31-Jul15/mm/fremap.c
===================================================================
--- mmotm-2.6.31-Jul15.orig/mm/fremap.c
+++ mmotm-2.6.31-Jul15/mm/fremap.c
@@ -33,7 +33,7 @@ static void zap_pte(struct mm_struct *mm

flush_cache_page(vma, addr, pte_pfn(pte));
pte = ptep_clear_flush(vma, addr, ptep);
- page = vm_normal_page(vma, addr, pte);
+ page = vm_normal_page(vma, addr, pte, FOLL_NOZERO);
if (page) {
if (pte_dirty(pte))
set_page_dirty(page);
Index: mmotm-2.6.31-Jul15/mm/mempolicy.c
===================================================================
--- mmotm-2.6.31-Jul15.orig/mm/mempolicy.c
+++ mmotm-2.6.31-Jul15/mm/mempolicy.c
@@ -404,13 +404,13 @@ static int check_pte_range(struct vm_are

if (!pte_present(*pte))
continue;
- page = vm_normal_page(vma, addr, *pte);
+ /* we avoid zero page here */
+ page = vm_normal_page(vma, addr, *pte, FOLL_NOZERO);
if (!page)
continue;
/*
- * The check for PageReserved here is important to avoid
- * handling zero pages and other pages that may have been
- * marked special by the system.
+ * The check for PageReserved here is imortant to avoid pages
+ * that may have been marked special by the system.
*
* If the PageReserved would not be checked here then f.e.
* the location of the zero page could have an influence
Index: mmotm-2.6.31-Jul15/mm/rmap.c
===================================================================
--- mmotm-2.6.31-Jul15.orig/mm/rmap.c
+++ mmotm-2.6.31-Jul15/mm/rmap.c
@@ -946,7 +946,11 @@ static int try_to_unmap_cluster(unsigned
for (; address < end; pte++, address += PAGE_SIZE) {
if (!pte_present(*pte))
continue;
- page = vm_normal_page(vma, address, *pte);
+ /*
+ * Because we comes from try_to_unmap_file(), we'll never see
+ * ZERO_PAGE or ANON.
+ */
+ page = vm_normal_page(vma, address, *pte, FOLL_NOZERO);
BUG_ON(!page || PageAnon(page));

if (locked_vma) {
Index: mmotm-2.6.31-Jul15/include/linux/mm.h
===================================================================
--- mmotm-2.6.31-Jul15.orig/include/linux/mm.h
+++ mmotm-2.6.31-Jul15/include/linux/mm.h
@@ -753,7 +753,7 @@ struct zap_details {
};

struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
- pte_t pte);
+ pte_t pte, unsigned int flags);

int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
unsigned long size);
@@ -1245,6 +1245,7 @@ struct page *follow_page(struct vm_area_
#define FOLL_TOUCH 0x02 /* mark page accessed */
#define FOLL_GET 0x04 /* do get_page on page */
#define FOLL_ANON 0x08 /* give ZERO_PAGE if no pgtable */
+#define FOLL_NOZERO 0x10 /* returns NULL if ZERO_PAGE is found */

typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
void *data);
Index: mmotm-2.6.31-Jul15/mm/internal.h
===================================================================
--- mmotm-2.6.31-Jul15.orig/mm/internal.h
+++ mmotm-2.6.31-Jul15/mm/internal.h
@@ -254,6 +254,7 @@ static inline void mminit_validate_memmo
#define GUP_FLAGS_FORCE 0x2
#define GUP_FLAGS_IGNORE_VMA_PERMISSIONS 0x4
#define GUP_FLAGS_IGNORE_SIGKILL 0x8
+#define GUP_FLAGS_IGNORE_ZERO 0x10

int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, int len, int flags,
Index: mmotm-2.6.31-Jul15/mm/migrate.c
===================================================================
--- mmotm-2.6.31-Jul15.orig/mm/migrate.c
+++ mmotm-2.6.31-Jul15/mm/migrate.c
@@ -850,7 +850,7 @@ static int do_move_page_to_node_array(st
if (!vma || !vma_migratable(vma))
goto set_status;

- page = follow_page(vma, pp->addr, FOLL_GET);
+ page = follow_page(vma, pp->addr, FOLL_GET | FOLL_NOZERO);

err = PTR_ERR(page);
if (IS_ERR(page))
@@ -1007,14 +1007,14 @@ static void do_pages_stat_array(struct m
if (!vma)
goto set_status;

- page = follow_page(vma, addr, 0);
+ page = follow_page(vma, addr, FOLL_NOZERO);

err = PTR_ERR(page);
if (IS_ERR(page))
goto set_status;

err = -ENOENT;
- /* Use PageReserved to check for zero page */
+ /* if zero page, page is NULL. */
if (!page || PageReserved(page))
goto set_status;

Index: mmotm-2.6.31-Jul15/mm/mlock.c
===================================================================
--- mmotm-2.6.31-Jul15.orig/mm/mlock.c
+++ mmotm-2.6.31-Jul15/mm/mlock.c
@@ -162,7 +162,7 @@ static long __mlock_vma_pages_range(stru
struct page *pages[16]; /* 16 gives a reasonable batch */
int nr_pages = (end - start) / PAGE_SIZE;
int ret = 0;
- int gup_flags = 0;
+ int gup_flags = GUP_FLAGS_IGNORE_ZERO;

VM_BUG_ON(start & ~PAGE_MASK);
VM_BUG_ON(end & ~PAGE_MASK);
Index: mmotm-2.6.31-Jul15/fs/proc/task_mmu.c
===================================================================
--- mmotm-2.6.31-Jul15.orig/fs/proc/task_mmu.c
+++ mmotm-2.6.31-Jul15/fs/proc/task_mmu.c
@@ -361,8 +361,8 @@ static int smaps_pte_range(pmd_t *pmd, u
continue;

mss->resident += PAGE_SIZE;
-
- page = vm_normal_page(vma, addr, ptent);
+ /* we ignore zero page */
+ page = vm_normal_page(vma, addr, ptent, FOLL_NOZERO);
if (!page)
continue;

@@ -469,8 +469,8 @@ static int clear_refs_pte_range(pmd_t *p
ptent = *pte;
if (!pte_present(ptent))
continue;
-
- page = vm_normal_page(vma, addr, ptent);
+ /* we ignore zero page */
+ page = vm_normal_page(vma, addr, ptent, FOLL_NOZERO);
if (!page)
continue;

2009-07-22 23:53:27

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH 0/2] ZERO PAGE again v4.

On Thu, 16 Jul 2009 18:01:34 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

>
> Rebased onto mm-of-the-moment snapshot 2009-07-15-20-57.
> And modifeied to make vm_normal_page() eat FOLL_NOZERO, directly.
>
> Any comments ?
>

A week passed since I posted this. It's no problem to keep updating this
and post again. But if anyone have concerns, please notify me.
I'll reduce CC: list in the next post.

Thanks,
-Kame

2009-07-23 00:13:49

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH 0/2] ZERO PAGE again v4.

On Thu, 23 Jul 2009 08:51:37 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Thu, 16 Jul 2009 18:01:34 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> >
> > Rebased onto mm-of-the-moment snapshot 2009-07-15-20-57.
> > And modifeied to make vm_normal_page() eat FOLL_NOZERO, directly.
> >
> > Any comments ?
> >
>
> A week passed since I posted this.

I'm catching up at a rate of 2.5 days per day. Am presently up to July
16. I never know whether to work through it forwards or in reverse.

Geeze you guys send a lot of stuff. Stop writing new code and go fix
some bugs!

> It's no problem to keep updating this
> and post again. But if anyone have concerns, please notify me.
> I'll reduce CC: list in the next post.

ok...

2009-07-23 00:35:27

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH 0/2] ZERO PAGE again v4.

On Wed, 22 Jul 2009 17:12:45 -0700
Andrew Morton <[email protected]> wrote:

> On Thu, 23 Jul 2009 08:51:37 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > On Thu, 16 Jul 2009 18:01:34 +0900
> > KAMEZAWA Hiroyuki <[email protected]> wrote:
> >
> > >
> > > Rebased onto mm-of-the-moment snapshot 2009-07-15-20-57.
> > > And modifeied to make vm_normal_page() eat FOLL_NOZERO, directly.
> > >
> > > Any comments ?
> > >
> >
> > A week passed since I posted this.
>
> I'm catching up at a rate of 2.5 days per day. Am presently up to July
> 16. I never know whether to work through it forwards or in reverse.
>
> Geeze you guys send a lot of stuff. Stop writing new code and go fix
> some bugs!
>
In these months, I myself don't write any new feature.
I'm now in stablization stage.
(ZERO_PAGE is not new feature in my point of view.)

Remainig big one is Direce-I/O v.s. fork() fix.

> > It's no problem to keep updating this
> > and post again. But if anyone have concerns, please notify me.
> > I'll reduce CC: list in the next post.
>
> ok...
>
I'll postpone v5 until the next week.

Thank you for your efforts.

BTW, when I post new version, should I send a reply to old version to say
"this version is obsolete" ? Can it make your work easier ? like following.

Re:[PATCH][Obsolete] new version weill come (Was.....)

I tend to update patches until v5 or more until merged.

Regards,
-Kame

2009-07-23 00:48:28

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH 0/2] ZERO PAGE again v4.

On Thu, 23 Jul 2009 09:33:34 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> BTW, when I post new version, should I send a reply to old version to say
> "this version is obsolete" ? Can it make your work easier ? like following.
>
> Re:[PATCH][Obsolete] new version weill come (Was.....)
>
> I tend to update patches until v5 or more until merged.

Usually it's pretty clear when a new patch or patch series is going to
be sent. I think that simply resending it all is OK.

I don't pay much attention to the "version N" info either - it can be
unreliable and not everyone does it and chronological ordering works OK
for this.

Very occasionally I'll merge a patch and then discover a later version
further down through the backlog. But that's OK - I'll just update the
patch. Plus I'm not usually stuck this far in the past.

(I'm still trying to find half a day to read "Per-bdi writeback flusher
threads v12")

2009-07-26 16:01:20

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH 0/2] ZERO PAGE again v4.

On Thu, 23 Jul 2009, KAMEZAWA Hiroyuki wrote:
> On Wed, 22 Jul 2009 17:12:45 -0700
> Andrew Morton <[email protected]> wrote:
> > On Thu, 23 Jul 2009 08:51:37 +0900
> > KAMEZAWA Hiroyuki <[email protected]> wrote:
> > > On Thu, 16 Jul 2009 18:01:34 +0900
> > > KAMEZAWA Hiroyuki <[email protected]> wrote:
> > > >
> > > > Rebased onto mm-of-the-moment snapshot 2009-07-15-20-57.
> > > > And modifeied to make vm_normal_page() eat FOLL_NOZERO, directly.
> > > >
> > > > Any comments ?

Sorry, I've been waiting to have something positive to suggest,
but today still busy with my own issues (handling OOM in KSM).

I do dislike that additional argument to vm_normal_page, and
feel that's a problem to be solved in follow_page, rather
than spread to every other vm_normal_page user.

Does follow_page even need to be using vm_normal_page?
Hmm, VM_MIXEDMAP, __get_user_pages doesn't exclude that.

I also feel a strong (but not yet fulfilled) urge to check
all the use_zero_page ignore_zero stuff: which is far from
self-evident.

Hugh

2009-07-26 22:56:19

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH 0/2] ZERO PAGE again v4.

Hugh Dickins $B$5$s$O=q$-$^$7$?!'(B
> On Thu, 23 Jul 2009, KAMEZAWA Hiroyuki wrote:
>> On Wed, 22 Jul 2009 17:12:45 -0700
>> Andrew Morton <[email protected]> wrote:
>> > On Thu, 23 Jul 2009 08:51:37 +0900
>> > KAMEZAWA Hiroyuki <[email protected]> wrote:
>> > > On Thu, 16 Jul 2009 18:01:34 +0900
>> > > KAMEZAWA Hiroyuki <[email protected]> wrote:
>> > > >
>> > > > Rebased onto mm-of-the-moment snapshot 2009-07-15-20-57.
>> > > > And modifeied to make vm_normal_page() eat FOLL_NOZERO, directly.
>> > > >
>> > > > Any comments ?
>
> Sorry, I've been waiting to have something positive to suggest,
> but today still busy with my own issues (handling OOM in KSM).
>
no problems. thank you.

> I do dislike that additional argument to vm_normal_page, and
> feel that's a problem to be solved in follow_page, rather
> than spread to every other vm_normal_page user.
>
Hmm, I'll check whether it's necessary or not agian before v5.

> Does follow_page even need to be using vm_normal_page?
Avoiding it means follow_page() has to handle pte_special().
But yes, all vm_normal_page() users other than get_user_page() uses
__FOLL_NOZERO. I feel it's just "which is cleaner ?" problem.

> Hmm, VM_MIXEDMAP, __get_user_pages doesn't exclude that.
> > I also feel a strong (but not yet fulfilled) urge to check
> all the use_zero_page ignore_zero stuff: which is far from
> self-evident.
>
I'll add comment more, in v5 (if vm_normal_page() still have "flags")

I myself feels I'll have to update this to v6 or v7.

Thank you
-Kame