2005-09-25 15:46:52

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 00/21] mm: page fault scalability prep

Here comes the preparatory batch for my page fault scalability patches.
This batch makes a few fixes - I suggest 01 and 02 should go in 2.6.14 -
and a lot of tidyups, clearing some undergrowth for the real patches.

Just occasionally there's a hint of where we shall be heading: as in the
prototype posted to linux-mm a month ago, narrowing the scope of the mm
page_table_lock, so we can descend pgd,pud,pmd without it; and then, on
machines with more cpus, using a lock per page-table page for the ptes.
Thanks to Christoph Lameter for his generous endorsement of that approach.

This first batch is one half to one third of the work. The next batch
should follow in a few days' time. The long delay, mainly an odyssey
through the architectures to satisfy myself of safety, should be over -
though I still need to clarify some issues with maintainers, perhaps
by posting a few dubious arch patches.

This batch is against 2.6.14-rc2-mm1 plus Hirofumi's msync patch;
or against 2.6.14-rc2-git5 plus that and Nick's move_pte patch.

21/21 unfairly weights the deletions in the diffstat, without it we're
36 files changed, 404 insertions(+), 702 deletions(-)

Hugh

Documentation/kernel-parameters.txt | 2
Documentation/m68k/kernel-options.txt | 24
arch/ia64/kernel/perfmon.c | 3
arch/ia64/mm/fault.c | 2
arch/m68k/Kconfig | 24
arch/m68k/atari/stram.c | 918 ----------------------------------
arch/mips/kernel/irixelf.c | 1
arch/sh/mm/hugetlbpage.c | 2
arch/sh64/mm/hugetlbpage.c | 188 ------
arch/sparc64/kernel/binfmt_aout32.c | 1
arch/sparc64/mm/tlb.c | 4
arch/x86_64/ia32/ia32_aout.c | 1
fs/binfmt_aout.c | 1
fs/binfmt_elf.c | 1
fs/binfmt_elf_fdpic.c | 7
fs/binfmt_flat.c | 1
fs/binfmt_som.c | 1
fs/exec.c | 2
fs/proc/array.c | 2
fs/proc/task_mmu.c | 8
include/asm-arm/tlb.h | 23
include/asm-arm26/tlb.h | 47 -
include/asm-generic/tlb.h | 23
include/asm-ia64/tlb.h | 19
include/asm-ppc64/pgtable.h | 4
include/asm-sparc64/tlb.h | 29 -
include/linux/mm.h | 17
include/linux/sched.h | 4
kernel/acct.c | 2
kernel/fork.c | 29 -
mm/fremap.c | 4
mm/hugetlb.c | 37 -
mm/memory.c | 325 ++++++------
mm/mmap.c | 87 +--
mm/mprotect.c | 4
mm/mremap.c | 170 ++----
mm/msync.c | 38 -
mm/nommu.c | 2
mm/rmap.c | 8
mm/swapfile.c | 9
40 files changed, 421 insertions(+), 1653 deletions(-)


2005-09-25 15:48:04

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 01/21] mm: hugetlb truncation fixes

hugetlbfs allows truncation of its files (should it?), but hugetlb.c
often forgets that: crashes and misaccounting ensue.

copy_hugetlb_page_range better grab the src page_table_lock since we
don't want to guess what happens if concurrently truncated.
unmap_hugepage_range rss accounting must not assume the full range was
mapped. follow_hugetlb_page must guard with page_table_lock and be
prepared to exit early.

Restyle copy_hugetlb_page_range with a for loop like the others there.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/hugetlb.c | 35 +++++++++++++++++++++--------------
1 files changed, 21 insertions(+), 14 deletions(-)

--- 2.6.14-rc2/mm/hugetlb.c 2005-09-22 12:32:03.000000000 +0100
+++ mm01/mm/hugetlb.c 2005-09-24 19:26:24.000000000 +0100
@@ -273,21 +273,22 @@ int copy_hugetlb_page_range(struct mm_st
{
pte_t *src_pte, *dst_pte, entry;
struct page *ptepage;
- unsigned long addr = vma->vm_start;
- unsigned long end = vma->vm_end;
+ unsigned long addr;

- while (addr < end) {
+ for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
dst_pte = huge_pte_alloc(dst, addr);
if (!dst_pte)
goto nomem;
+ spin_lock(&src->page_table_lock);
src_pte = huge_pte_offset(src, addr);
- BUG_ON(!src_pte || pte_none(*src_pte)); /* prefaulted */
- entry = *src_pte;
- ptepage = pte_page(entry);
- get_page(ptepage);
- add_mm_counter(dst, rss, HPAGE_SIZE / PAGE_SIZE);
- set_huge_pte_at(dst, addr, dst_pte, entry);
- addr += HPAGE_SIZE;
+ if (src_pte && !pte_none(*src_pte)) {
+ entry = *src_pte;
+ ptepage = pte_page(entry);
+ get_page(ptepage);
+ add_mm_counter(dst, rss, HPAGE_SIZE / PAGE_SIZE);
+ set_huge_pte_at(dst, addr, dst_pte, entry);
+ }
+ spin_unlock(&src->page_table_lock);
}
return 0;

@@ -322,8 +323,8 @@ void unmap_hugepage_range(struct vm_area

page = pte_page(pte);
put_page(page);
+ add_mm_counter(mm, rss, - (HPAGE_SIZE / PAGE_SIZE));
}
- add_mm_counter(mm, rss, -((end - start) >> PAGE_SHIFT));
flush_tlb_range(vma, start, end);
}

@@ -402,6 +403,7 @@ int follow_hugetlb_page(struct mm_struct
BUG_ON(!is_vm_hugetlb_page(vma));

vpfn = vaddr/PAGE_SIZE;
+ spin_lock(&mm->page_table_lock);
while (vaddr < vma->vm_end && remainder) {

if (pages) {
@@ -414,8 +416,13 @@ int follow_hugetlb_page(struct mm_struct
* indexing below to work. */
pte = huge_pte_offset(mm, vaddr & HPAGE_MASK);

- /* hugetlb should be locked, and hence, prefaulted */
- WARN_ON(!pte || pte_none(*pte));
+ /* the hugetlb file might have been truncated */
+ if (!pte || pte_none(*pte)) {
+ remainder = 0;
+ if (!i)
+ i = -EFAULT;
+ break;
+ }

page = &pte_page(*pte)[vpfn % (HPAGE_SIZE/PAGE_SIZE)];

@@ -433,7 +440,7 @@ int follow_hugetlb_page(struct mm_struct
--remainder;
++i;
}
-
+ spin_unlock(&mm->page_table_lock);
*length = remainder;
*position = vaddr;

2005-09-25 15:48:47

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 02/21] mm: copy_pte_range progress fix

[PATCH 02/21] mm: copy_pte_range progress fix

My latency breaking in copy_pte_range didn't work as intended: instead
of checking at regularish intervals, after the first interval it checked
every time around the loop, too impatient to be preempted. Fix that.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/memory.c | 14 ++++++++------
1 files changed, 8 insertions(+), 6 deletions(-)

--- mm01/mm/memory.c 2005-09-21 12:16:59.000000000 +0100
+++ mm02/mm/memory.c 2005-09-24 19:26:38.000000000 +0100
@@ -410,7 +410,7 @@ static int copy_pte_range(struct mm_stru
{
pte_t *src_pte, *dst_pte;
unsigned long vm_flags = vma->vm_flags;
- int progress;
+ int progress = 0;

again:
dst_pte = pte_alloc_map(dst_mm, dst_pmd, addr);
@@ -418,17 +418,19 @@ again:
return -ENOMEM;
src_pte = pte_offset_map_nested(src_pmd, addr);

- progress = 0;
spin_lock(&src_mm->page_table_lock);
do {
/*
* We are holding two locks at this point - either of them
* could generate latencies in another task on another CPU.
*/
- if (progress >= 32 && (need_resched() ||
- need_lockbreak(&src_mm->page_table_lock) ||
- need_lockbreak(&dst_mm->page_table_lock)))
- break;
+ if (progress >= 32) {
+ progress = 0;
+ if (need_resched() ||
+ need_lockbreak(&src_mm->page_table_lock) ||
+ need_lockbreak(&dst_mm->page_table_lock))
+ break;
+ }
if (pte_none(*src_pte)) {
progress++;
continue;

2005-09-25 15:49:31

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 03/21] mm: msync_pte_range progress

Use latency breaking in msync_pte_range like that in copy_pte_range,
instead of the ugly CONFIG_PREEMPT filemap_msync alternatives.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/msync.c | 38 ++++++++++++++------------------------
1 files changed, 14 insertions(+), 24 deletions(-)

--- mm02/mm/msync.c 2005-09-24 16:59:50.000000000 +0100
+++ mm03/mm/msync.c 2005-09-24 19:26:52.000000000 +0100
@@ -26,12 +26,21 @@ static void msync_pte_range(struct vm_ar
unsigned long addr, unsigned long end)
{
pte_t *pte;
+ int progress = 0;

+again:
pte = pte_offset_map(pmd, addr);
do {
unsigned long pfn;
struct page *page;

+ if (progress >= 64) {
+ progress = 0;
+ if (need_resched() ||
+ need_lockbreak(&vma->vm_mm->page_table_lock))
+ break;
+ }
+ progress++;
if (!pte_present(*pte))
continue;
if (!pte_maybe_dirty(*pte))
@@ -46,8 +55,12 @@ static void msync_pte_range(struct vm_ar
if (ptep_clear_flush_dirty(vma, addr, pte) ||
page_test_and_clear_dirty(page))
set_page_dirty(page);
+ progress += 3;
} while (pte++, addr += PAGE_SIZE, addr != end);
pte_unmap(pte - 1);
+ cond_resched_lock(&vma->vm_mm->page_table_lock);
+ if (addr != end)
+ goto again;
}

static inline void msync_pmd_range(struct vm_area_struct *vma, pud_t *pud,
@@ -106,29 +119,6 @@ static void msync_page_range(struct vm_a
spin_unlock(&mm->page_table_lock);
}

-#ifdef CONFIG_PREEMPT
-static inline void filemap_msync(struct vm_area_struct *vma,
- unsigned long addr, unsigned long end)
-{
- const size_t chunk = 64 * 1024; /* bytes */
- unsigned long next;
-
- do {
- next = addr + chunk;
- if (next > end || next < addr)
- next = end;
- msync_page_range(vma, addr, next);
- cond_resched();
- } while (addr = next, addr != end);
-}
-#else
-static inline void filemap_msync(struct vm_area_struct *vma,
- unsigned long addr, unsigned long end)
-{
- msync_page_range(vma, addr, end);
-}
-#endif
-
/*
* MS_SYNC syncs the entire file - including mappings.
*
@@ -150,7 +140,7 @@ static int msync_interval(struct vm_area
return -EBUSY;

if (file && (vma->vm_flags & VM_SHARED)) {
- filemap_msync(vma, addr, end);
+ msync_page_range(vma, addr, end);

if (flags & MS_SYNC) {
struct address_space *mapping = file->f_mapping;

2005-09-25 15:50:18

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 04/21] mm: zap_pte_range dont dirty anon

zap_pte_range already avoids wasting time to mark_page_accessed on anon
pages: it can also skip anon set_page_dirty - the page only needs to be
marked dirty if shared with another mm, but that will say pte_dirty too.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/memory.c | 10 ++++++----
1 files changed, 6 insertions(+), 4 deletions(-)

--- mm03/mm/memory.c 2005-09-24 19:26:38.000000000 +0100
+++ mm04/mm/memory.c 2005-09-24 19:27:05.000000000 +0100
@@ -574,12 +574,14 @@ static void zap_pte_range(struct mmu_gat
addr) != page->index)
set_pte_at(tlb->mm, addr, pte,
pgoff_to_pte(page->index));
- if (pte_dirty(ptent))
- set_page_dirty(page);
if (PageAnon(page))
dec_mm_counter(tlb->mm, anon_rss);
- else if (pte_young(ptent))
- mark_page_accessed(page);
+ else {
+ if (pte_dirty(ptent))
+ set_page_dirty(page);
+ if (pte_young(ptent))
+ mark_page_accessed(page);
+ }
tlb->freed++;
page_remove_rmap(page);
tlb_remove_page(tlb, page);

2005-09-25 15:51:30

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 05/21] mm: anon is already wrprotected

do_anonymous_page's pte_wrprotect causes some confusion: in such a case,
vm_page_prot must already be forcing COW, so must omit write permission,
and so the pte_wrprotect is redundant. Replace it by a comment to that
effect, and reword the comment on unuse_pte which also caused confusion.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/memory.c | 7 ++++---
mm/swapfile.c | 7 +++----
2 files changed, 7 insertions(+), 7 deletions(-)

--- mm04/mm/memory.c 2005-09-24 19:27:05.000000000 +0100
+++ mm05/mm/memory.c 2005-09-24 19:27:19.000000000 +0100
@@ -1768,13 +1768,14 @@ do_anonymous_page(struct mm_struct *mm,
unsigned long addr)
{
pte_t entry;
- struct page * page = ZERO_PAGE(addr);

- /* Read-only mapping of ZERO_PAGE. */
- entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
+ /* Mapping of ZERO_PAGE - vm_page_prot is readonly */
+ entry = mk_pte(ZERO_PAGE(addr), vma->vm_page_prot);

/* ..except if it's a write access */
if (write_access) {
+ struct page *page;
+
/* Allocate our own private page. */
pte_unmap(page_table);
spin_unlock(&mm->page_table_lock);
--- mm04/mm/swapfile.c 2005-09-22 12:32:04.000000000 +0100
+++ mm05/mm/swapfile.c 2005-09-24 19:27:19.000000000 +0100
@@ -396,10 +396,9 @@ void free_swap_and_cache(swp_entry_t ent
}

/*
- * Always set the resulting pte to be nowrite (the same as COW pages
- * after one process has exited). We don't know just how many PTEs will
- * share this swap entry, so be cautious and let do_wp_page work out
- * what to do if a write is requested later.
+ * No need to decide whether this PTE shares the swap entry with others,
+ * just let do_wp_page work it out if a write is requested later - to
+ * force COW, vm_page_prot omits write permission from any private vma.
*
* vma->vm_mm->page_table_lock is held.
*/

2005-09-25 15:52:47

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 06/21] mm: vm_stat_account unshackled

The original vm_stat_account has fallen into disuse, with only one user,
and only one user of vm_stat_unaccount. It's easier to keep track if we
convert them all to __vm_stat_account, then free it from its __shackles.

Signed-off-by: Hugh Dickins <[email protected]>
---

arch/ia64/kernel/perfmon.c | 3 ++-
arch/ia64/mm/fault.c | 2 +-
include/linux/mm.h | 16 ++--------------
kernel/fork.c | 2 +-
mm/mmap.c | 20 ++++++++++----------
mm/mprotect.c | 4 ++--
mm/mremap.c | 4 ++--
7 files changed, 20 insertions(+), 31 deletions(-)

--- mm05/arch/ia64/kernel/perfmon.c 2005-09-21 12:16:14.000000000 +0100
+++ mm06/arch/ia64/kernel/perfmon.c 2005-09-24 19:27:33.000000000 +0100
@@ -2352,7 +2352,8 @@ pfm_smpl_buffer_alloc(struct task_struct
insert_vm_struct(mm, vma);

mm->total_vm += size >> PAGE_SHIFT;
- vm_stat_account(vma);
+ vm_stat_account(vma->vm_mm, vma->vm_flags, vma->vm_file,
+ vma_pages(vma));
up_write(&task->mm->mmap_sem);

/*
--- mm05/arch/ia64/mm/fault.c 2005-09-21 12:16:14.000000000 +0100
+++ mm06/arch/ia64/mm/fault.c 2005-09-24 19:27:33.000000000 +0100
@@ -41,7 +41,7 @@ expand_backing_store (struct vm_area_str
vma->vm_mm->total_vm += grow;
if (vma->vm_flags & VM_LOCKED)
vma->vm_mm->locked_vm += grow;
- __vm_stat_account(vma->vm_mm, vma->vm_flags, vma->vm_file, grow);
+ vm_stat_account(vma->vm_mm, vma->vm_flags, vma->vm_file, grow);
return 0;
}

--- mm05/include/linux/mm.h 2005-09-22 12:32:02.000000000 +0100
+++ mm06/include/linux/mm.h 2005-09-24 19:27:33.000000000 +0100
@@ -936,26 +936,14 @@ int remap_pfn_range(struct vm_area_struc
unsigned long, unsigned long, pgprot_t);

#ifdef CONFIG_PROC_FS
-void __vm_stat_account(struct mm_struct *, unsigned long, struct file *, long);
+void vm_stat_account(struct mm_struct *, unsigned long, struct file *, long);
#else
-static inline void __vm_stat_account(struct mm_struct *mm,
+static inline void vm_stat_account(struct mm_struct *mm,
unsigned long flags, struct file *file, long pages)
{
}
#endif /* CONFIG_PROC_FS */

-static inline void vm_stat_account(struct vm_area_struct *vma)
-{
- __vm_stat_account(vma->vm_mm, vma->vm_flags, vma->vm_file,
- vma_pages(vma));
-}
-
-static inline void vm_stat_unaccount(struct vm_area_struct *vma)
-{
- __vm_stat_account(vma->vm_mm, vma->vm_flags, vma->vm_file,
- -vma_pages(vma));
-}
-
/* update per process rss and vm hiwater data */
extern void update_mem_hiwater(struct task_struct *tsk);

--- mm05/kernel/fork.c 2005-09-22 12:32:03.000000000 +0100
+++ mm06/kernel/fork.c 2005-09-24 19:27:33.000000000 +0100
@@ -212,7 +212,7 @@ static inline int dup_mmap(struct mm_str
if (mpnt->vm_flags & VM_DONTCOPY) {
long pages = vma_pages(mpnt);
mm->total_vm -= pages;
- __vm_stat_account(mm, mpnt->vm_flags, mpnt->vm_file,
+ vm_stat_account(mm, mpnt->vm_flags, mpnt->vm_file,
-pages);
continue;
}
--- mm05/mm/mmap.c 2005-09-22 12:32:03.000000000 +0100
+++ mm06/mm/mmap.c 2005-09-24 19:27:33.000000000 +0100
@@ -828,7 +828,7 @@ none:
}

#ifdef CONFIG_PROC_FS
-void __vm_stat_account(struct mm_struct *mm, unsigned long flags,
+void vm_stat_account(struct mm_struct *mm, unsigned long flags,
struct file *file, long pages)
{
const unsigned long stack_flags
@@ -1106,7 +1106,7 @@ munmap_back:
}
out:
mm->total_vm += len >> PAGE_SHIFT;
- __vm_stat_account(mm, vm_flags, file, len >> PAGE_SHIFT);
+ vm_stat_account(mm, vm_flags, file, len >> PAGE_SHIFT);
if (vm_flags & VM_LOCKED) {
mm->locked_vm += len >> PAGE_SHIFT;
make_pages_present(addr, addr + len);
@@ -1471,7 +1471,7 @@ static int acct_stack_growth(struct vm_a
mm->total_vm += grow;
if (vma->vm_flags & VM_LOCKED)
mm->locked_vm += grow;
- __vm_stat_account(mm, vma->vm_flags, vma->vm_file, grow);
+ vm_stat_account(mm, vma->vm_flags, vma->vm_file, grow);
return 0;
}

@@ -1606,15 +1606,15 @@ find_extend_vma(struct mm_struct * mm, u
* By the time this function is called, the area struct has been
* removed from the process mapping list.
*/
-static void unmap_vma(struct mm_struct *mm, struct vm_area_struct *area)
+static void unmap_vma(struct mm_struct *mm, struct vm_area_struct *vma)
{
- size_t len = area->vm_end - area->vm_start;
+ long nrpages = vma_pages(vma);

- area->vm_mm->total_vm -= len >> PAGE_SHIFT;
- if (area->vm_flags & VM_LOCKED)
- area->vm_mm->locked_vm -= len >> PAGE_SHIFT;
- vm_stat_unaccount(area);
- remove_vm_struct(area);
+ mm->total_vm -= nrpages;
+ if (vma->vm_flags & VM_LOCKED)
+ mm->locked_vm -= nrpages;
+ vm_stat_account(mm, vma->vm_flags, vma->vm_file, -nrpages);
+ remove_vm_struct(vma);
}

/*
--- mm05/mm/mprotect.c 2005-09-22 12:32:03.000000000 +0100
+++ mm06/mm/mprotect.c 2005-09-24 19:27:33.000000000 +0100
@@ -168,8 +168,8 @@ success:
vma->vm_flags = newflags;
vma->vm_page_prot = newprot;
change_protection(vma, start, end, newprot);
- __vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
- __vm_stat_account(mm, newflags, vma->vm_file, nrpages);
+ vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
+ vm_stat_account(mm, newflags, vma->vm_file, nrpages);
return 0;

fail:
--- mm05/mm/mremap.c 2005-09-22 12:32:03.000000000 +0100
+++ mm06/mm/mremap.c 2005-09-24 19:27:33.000000000 +0100
@@ -233,7 +233,7 @@ static unsigned long move_vma(struct vm_
* since do_munmap() will decrement it by old_len == new_len
*/
mm->total_vm += new_len >> PAGE_SHIFT;
- __vm_stat_account(mm, vma->vm_flags, vma->vm_file, new_len>>PAGE_SHIFT);
+ vm_stat_account(mm, vma->vm_flags, vma->vm_file, new_len>>PAGE_SHIFT);

if (do_munmap(mm, old_addr, old_len) < 0) {
/* OOM: unable to split vma, just get accounts right */
@@ -384,7 +384,7 @@ unsigned long do_mremap(unsigned long ad
addr + new_len, vma->vm_pgoff, NULL);

current->mm->total_vm += pages;
- __vm_stat_account(vma->vm_mm, vma->vm_flags,
+ vm_stat_account(vma->vm_mm, vma->vm_flags,
vma->vm_file, pages);
if (vma->vm_flags & VM_LOCKED) {
current->mm->locked_vm += pages;

2005-09-25 15:53:38

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 07/21] mm: remove_vma_list consolidation

unmap_vma doesn't amount to much, let's put it inside unmap_vma_list.
Except it doesn't unmap anything, unmap_region just did the unmapping:
rename it to remove_vma_list.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/mmap.c | 36 ++++++++++++------------------------
1 files changed, 12 insertions(+), 24 deletions(-)

--- mm06/mm/mmap.c 2005-09-24 19:27:33.000000000 +0100
+++ mm07/mm/mmap.c 2005-09-24 19:27:47.000000000 +0100
@@ -1599,35 +1599,23 @@ find_extend_vma(struct mm_struct * mm, u
}
#endif

-/* Normal function to fix up a mapping
- * This function is the default for when an area has no specific
- * function. This may be used as part of a more specific routine.
- *
- * By the time this function is called, the area struct has been
- * removed from the process mapping list.
- */
-static void unmap_vma(struct mm_struct *mm, struct vm_area_struct *vma)
-{
- long nrpages = vma_pages(vma);
-
- mm->total_vm -= nrpages;
- if (vma->vm_flags & VM_LOCKED)
- mm->locked_vm -= nrpages;
- vm_stat_account(mm, vma->vm_flags, vma->vm_file, -nrpages);
- remove_vm_struct(vma);
-}
-
/*
- * Update the VMA and inode share lists.
- *
- * Ok - we have the memory areas we should free on the 'free' list,
+ * Ok - we have the memory areas we should free on the vma list,
* so release them, and do the vma updates.
+ *
+ * Called with the mm semaphore held.
*/
-static void unmap_vma_list(struct mm_struct *mm, struct vm_area_struct *vma)
+static void remove_vma_list(struct mm_struct *mm, struct vm_area_struct *vma)
{
do {
struct vm_area_struct *next = vma->vm_next;
- unmap_vma(mm, vma);
+ long nrpages = vma_pages(vma);
+
+ mm->total_vm -= nrpages;
+ if (vma->vm_flags & VM_LOCKED)
+ mm->locked_vm -= nrpages;
+ vm_stat_account(mm, vma->vm_flags, vma->vm_file, -nrpages);
+ remove_vm_struct(vma);
vma = next;
} while (vma);
validate_mm(mm);
@@ -1795,7 +1783,7 @@ int do_munmap(struct mm_struct *mm, unsi
unmap_region(mm, vma, prev, start, end);

/* Fix up all other VM information */
- unmap_vma_list(mm, vma);
+ remove_vma_list(mm, vma);

return 0;
}

2005-09-25 15:54:25

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 08/21] mm: unlink_file_vma, remove_vma

Divide remove_vm_struct into two parts: first anon_vma_unlink plus
unlink_file_vma, to unlink the vma from the list and tree by which rmap
or vmtruncate might find it; then remove_vma to close, fput and free.

The intention here is to do the anon_vma_unlink and unlink_file_vma
earlier, in free_pgtables before freeing any page tables: so we can be
sure that any page tables traversed by rmap and vmtruncate are stable
(and other, ordinary cases are stabilized by holding mmap_sem).

This will be crucial to traversing pgd,pud,pmd without page_table_lock.
But testing the split-out patch showed that lifting the page_table_lock
is symbiotically necessary to make this change - the lock ordering is
wrong to move those unlinks into free_pgtables while it's under ptlock.

Signed-off-by: Hugh Dickins <[email protected]>
---

include/linux/mm.h | 1 +
mm/mmap.c | 41 +++++++++++++++++++++++++++--------------
2 files changed, 28 insertions(+), 14 deletions(-)

--- mm07/include/linux/mm.h 2005-09-24 19:27:33.000000000 +0100
+++ mm08/include/linux/mm.h 2005-09-24 19:28:01.000000000 +0100
@@ -840,6 +840,7 @@ extern int split_vma(struct mm_struct *,
extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
extern void __vma_link_rb(struct mm_struct *, struct vm_area_struct *,
struct rb_node **, struct rb_node *);
+extern void unlink_file_vma(struct vm_area_struct *);
extern struct vm_area_struct *copy_vma(struct vm_area_struct **,
unsigned long addr, unsigned long len, pgoff_t pgoff);
extern void exit_mmap(struct mm_struct *);
--- mm07/mm/mmap.c 2005-09-24 19:27:47.000000000 +0100
+++ mm08/mm/mmap.c 2005-09-24 19:28:01.000000000 +0100
@@ -177,26 +177,44 @@ static void __remove_shared_vm_struct(st
}

/*
- * Remove one vm structure and free it.
+ * Unlink a file-based vm structure from its prio_tree, to hide
+ * vma from rmap and vmtruncate before freeing its page tables.
*/
-static void remove_vm_struct(struct vm_area_struct *vma)
+void unlink_file_vma(struct vm_area_struct *vma)
{
struct file *file = vma->vm_file;

- might_sleep();
if (file) {
struct address_space *mapping = file->f_mapping;
spin_lock(&mapping->i_mmap_lock);
__remove_shared_vm_struct(vma, file, mapping);
spin_unlock(&mapping->i_mmap_lock);
}
+}
+
+/*
+ * Close a vm structure and free it, returning the next.
+ */
+static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
+{
+ struct vm_area_struct *next = vma->vm_next;
+
+ /*
+ * Hide vma from rmap and vmtruncate before freeing page tables:
+ * to be moved into free_pgtables once page_table_lock is lifted
+ * from it, but until then lock ordering forbids that move.
+ */
+ anon_vma_unlink(vma);
+ unlink_file_vma(vma);
+
+ might_sleep();
if (vma->vm_ops && vma->vm_ops->close)
vma->vm_ops->close(vma);
- if (file)
- fput(file);
- anon_vma_unlink(vma);
+ if (vma->vm_file)
+ fput(vma->vm_file);
mpol_free(vma_policy(vma));
kmem_cache_free(vm_area_cachep, vma);
+ return next;
}

asmlinkage unsigned long sys_brk(unsigned long brk)
@@ -1608,15 +1626,13 @@ find_extend_vma(struct mm_struct * mm, u
static void remove_vma_list(struct mm_struct *mm, struct vm_area_struct *vma)
{
do {
- struct vm_area_struct *next = vma->vm_next;
long nrpages = vma_pages(vma);

mm->total_vm -= nrpages;
if (vma->vm_flags & VM_LOCKED)
mm->locked_vm -= nrpages;
vm_stat_account(mm, vma->vm_flags, vma->vm_file, -nrpages);
- remove_vm_struct(vma);
- vma = next;
+ vma = remove_vma(vma);
} while (vma);
validate_mm(mm);
}
@@ -1940,11 +1956,8 @@ void exit_mmap(struct mm_struct *mm)
* Walk the list again, actually closing and freeing it
* without holding any MM locks.
*/
- while (vma) {
- struct vm_area_struct *next = vma->vm_next;
- remove_vm_struct(vma);
- vma = next;
- }
+ while (vma)
+ vma = remove_vma(vma);

BUG_ON(mm->nr_ptes > (FIRST_USER_ADDRESS+PMD_SIZE-1)>>PMD_SHIFT);
}

2005-09-25 15:55:09

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 09/21] mm: exit_mmap need not reset

exit_mmap resets various mm_struct fields, but the mm is well on its way
out, and none of those fields matter by this point.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/mmap.c | 6 ------
1 files changed, 6 deletions(-)

--- mm08/mm/mmap.c 2005-09-24 19:28:01.000000000 +0100
+++ mm09/mm/mmap.c 2005-09-24 19:28:15.000000000 +0100
@@ -1944,12 +1944,6 @@ void exit_mmap(struct mm_struct *mm)
free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
tlb_finish_mmu(tlb, 0, end);

- mm->mmap = mm->mmap_cache = NULL;
- mm->mm_rb = RB_ROOT;
- set_mm_counter(mm, rss, 0);
- mm->total_vm = 0;
- mm->locked_vm = 0;
-
spin_unlock(&mm->page_table_lock);

/*

2005-09-25 15:56:52

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 10/21] mm: page fault handlers tidyup

Impose a little more consistency on the page fault handlers do_wp_page,
do_swap_page, do_anonymous_page, do_no_page, do_file_page: why not pass
their arguments in the same order, called the same names?

break_cow is all very well, but what it did was inlined elsewhere:
easier to compare if it's brought back into do_wp_page.

do_file_page's fallback to do_no_page dates from a time when we were
testing pte_file by using it wherever possible: currently it's peculiar
to nonlinear vmas, so just check that. BUG_ON if not? Better not, it's
probably page table corruption, so just show the pte: hmm, there's a
pte_ERROR macro, let's use that for do_wp_page's invalid pfn too.

Hah! Someone in the ppc64 world noticed pte_ERROR was unused so removed
it: restored (and say "pud" not "pmd" in its pud_ERROR).

Signed-off-by: Hugh Dickins <[email protected]>
---

include/asm-ppc64/pgtable.h | 4
mm/memory.c | 220 +++++++++++++++++++-------------------------
2 files changed, 100 insertions(+), 124 deletions(-)

--- mm09/include/asm-ppc64/pgtable.h 2005-09-21 12:16:55.000000000 +0100
+++ mm10/include/asm-ppc64/pgtable.h 2005-09-24 19:28:28.000000000 +0100
@@ -478,10 +478,12 @@ extern pgprot_t phys_mem_access_prot(str
#define __HAVE_ARCH_PTE_SAME
#define pte_same(A,B) (((pte_val(A) ^ pte_val(B)) & ~_PAGE_HPTEFLAGS) == 0)

+#define pte_ERROR(e) \
+ printk("%s:%d: bad pte %08lx.\n", __FILE__, __LINE__, pte_val(e))
#define pmd_ERROR(e) \
printk("%s:%d: bad pmd %08lx.\n", __FILE__, __LINE__, pmd_val(e))
#define pud_ERROR(e) \
- printk("%s:%d: bad pmd %08lx.\n", __FILE__, __LINE__, pud_val(e))
+ printk("%s:%d: bad pud %08lx.\n", __FILE__, __LINE__, pud_val(e))
#define pgd_ERROR(e) \
printk("%s:%d: bad pgd %08lx.\n", __FILE__, __LINE__, pgd_val(e))

--- mm09/mm/memory.c 2005-09-24 19:27:19.000000000 +0100
+++ mm10/mm/memory.c 2005-09-24 19:28:28.000000000 +0100
@@ -1213,28 +1213,10 @@ static inline pte_t maybe_mkwrite(pte_t
}

/*
- * We hold the mm semaphore for reading and vma->vm_mm->page_table_lock
- */
-static inline void break_cow(struct vm_area_struct * vma, struct page * new_page, unsigned long address,
- pte_t *page_table)
-{
- pte_t entry;
-
- entry = maybe_mkwrite(pte_mkdirty(mk_pte(new_page, vma->vm_page_prot)),
- vma);
- ptep_establish(vma, address, page_table, entry);
- update_mmu_cache(vma, address, entry);
- lazy_mmu_prot_update(entry);
-}
-
-/*
* This routine handles present pages, when users try to write
* to a shared page. It is done by copying the page to a new address
* and decrementing the shared-page counter for the old page.
*
- * Goto-purists beware: the only reason for goto's here is that it results
- * in better assembly code.. The "default" path will see no jumps at all.
- *
* Note that this routine assumes that the protection checks have been
* done by the caller (the low-level page fault routine in most cases).
* Thus we can safely just mark it writable once we've done any necessary
@@ -1247,25 +1229,22 @@ static inline void break_cow(struct vm_a
* We hold the mm semaphore and the page_table_lock on entry and exit
* with the page_table_lock released.
*/
-static int do_wp_page(struct mm_struct *mm, struct vm_area_struct * vma,
- unsigned long address, pte_t *page_table, pmd_t *pmd, pte_t pte)
+static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pte_t *page_table, pmd_t *pmd,
+ pte_t orig_pte)
{
struct page *old_page, *new_page;
- unsigned long pfn = pte_pfn(pte);
+ unsigned long pfn = pte_pfn(orig_pte);
pte_t entry;
- int ret;
+ int ret = VM_FAULT_MINOR;

if (unlikely(!pfn_valid(pfn))) {
/*
- * This should really halt the system so it can be debugged or
- * at least the kernel stops what it's doing before it corrupts
- * data, but for the moment just pretend this is OOM.
+ * Page table corrupted: show pte and kill process.
*/
- pte_unmap(page_table);
- printk(KERN_ERR "do_wp_page: bogus page at address %08lx\n",
- address);
- spin_unlock(&mm->page_table_lock);
- return VM_FAULT_OOM;
+ pte_ERROR(orig_pte);
+ ret = VM_FAULT_OOM;
+ goto unlock;
}
old_page = pfn_to_page(pfn);

@@ -1274,52 +1253,57 @@ static int do_wp_page(struct mm_struct *
unlock_page(old_page);
if (reuse) {
flush_cache_page(vma, address, pfn);
- entry = maybe_mkwrite(pte_mkyoung(pte_mkdirty(pte)),
- vma);
+ entry = pte_mkyoung(orig_pte);
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);
ptep_set_access_flags(vma, address, page_table, entry, 1);
update_mmu_cache(vma, address, entry);
lazy_mmu_prot_update(entry);
- pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
- return VM_FAULT_MINOR|VM_FAULT_WRITE;
+ ret |= VM_FAULT_WRITE;
+ goto unlock;
}
}
- pte_unmap(page_table);

/*
* Ok, we need to copy. Oh, well..
*/
if (!PageReserved(old_page))
page_cache_get(old_page);
+ pte_unmap(page_table);
spin_unlock(&mm->page_table_lock);

if (unlikely(anon_vma_prepare(vma)))
- goto no_new_page;
+ goto oom;
if (old_page == ZERO_PAGE(address)) {
new_page = alloc_zeroed_user_highpage(vma, address);
if (!new_page)
- goto no_new_page;
+ goto oom;
} else {
new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
if (!new_page)
- goto no_new_page;
+ goto oom;
copy_user_highpage(new_page, old_page, address);
}
+
/*
* Re-check the pte - we dropped the lock
*/
- ret = VM_FAULT_MINOR;
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, address);
- if (likely(pte_same(*page_table, pte))) {
+ if (likely(pte_same(*page_table, orig_pte))) {
if (PageAnon(old_page))
dec_mm_counter(mm, anon_rss);
if (PageReserved(old_page))
inc_mm_counter(mm, rss);
else
page_remove_rmap(old_page);
+
flush_cache_page(vma, address, pfn);
- break_cow(vma, new_page, address, page_table);
+ entry = mk_pte(new_page, vma->vm_page_prot);
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ ptep_establish(vma, address, page_table, entry);
+ update_mmu_cache(vma, address, entry);
+ lazy_mmu_prot_update(entry);
+
lru_cache_add_active(new_page);
page_add_anon_rmap(new_page, vma, address);

@@ -1327,13 +1311,13 @@ static int do_wp_page(struct mm_struct *
new_page = old_page;
ret |= VM_FAULT_WRITE;
}
- pte_unmap(page_table);
page_cache_release(new_page);
page_cache_release(old_page);
+unlock:
+ pte_unmap(page_table);
spin_unlock(&mm->page_table_lock);
return ret;
-
-no_new_page:
+oom:
page_cache_release(old_page);
return VM_FAULT_OOM;
}
@@ -1661,17 +1645,19 @@ void swapin_readahead(swp_entry_t entry,
* We hold the mm semaphore and the page_table_lock on entry and
* should release the pagetable lock on exit..
*/
-static int do_swap_page(struct mm_struct * mm,
- struct vm_area_struct * vma, unsigned long address,
- pte_t *page_table, pmd_t *pmd, pte_t orig_pte, int write_access)
+static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pte_t *page_table, pmd_t *pmd,
+ int write_access, pte_t orig_pte)
{
struct page *page;
- swp_entry_t entry = pte_to_swp_entry(orig_pte);
+ swp_entry_t entry;
pte_t pte;
int ret = VM_FAULT_MINOR;

pte_unmap(page_table);
spin_unlock(&mm->page_table_lock);
+
+ entry = pte_to_swp_entry(orig_pte);
page = lookup_swap_cache(entry);
if (!page) {
swapin_readahead(entry, address, vma);
@@ -1685,11 +1671,7 @@ static int do_swap_page(struct mm_struct
page_table = pte_offset_map(pmd, address);
if (likely(pte_same(*page_table, orig_pte)))
ret = VM_FAULT_OOM;
- else
- ret = VM_FAULT_MINOR;
- pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
- goto out;
+ goto unlock;
}

/* Had to read the page from swap area: Major fault */
@@ -1745,6 +1727,7 @@ static int do_swap_page(struct mm_struct
/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, address, pte);
lazy_mmu_prot_update(pte);
+unlock:
pte_unmap(page_table);
spin_unlock(&mm->page_table_lock);
out:
@@ -1754,7 +1737,7 @@ out_nomap:
spin_unlock(&mm->page_table_lock);
unlock_page(page);
page_cache_release(page);
- goto out;
+ return ret;
}

/*
@@ -1762,17 +1745,15 @@ out_nomap:
* spinlock held to protect against concurrent faults in
* multithreaded programs.
*/
-static int
-do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
- pte_t *page_table, pmd_t *pmd, int write_access,
- unsigned long addr)
+static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pte_t *page_table, pmd_t *pmd,
+ int write_access)
{
pte_t entry;

/* Mapping of ZERO_PAGE - vm_page_prot is readonly */
entry = mk_pte(ZERO_PAGE(addr), vma->vm_page_prot);

- /* ..except if it's a write access */
if (write_access) {
struct page *page;

@@ -1781,39 +1762,36 @@ do_anonymous_page(struct mm_struct *mm,
spin_unlock(&mm->page_table_lock);

if (unlikely(anon_vma_prepare(vma)))
- goto no_mem;
- page = alloc_zeroed_user_highpage(vma, addr);
+ goto oom;
+ page = alloc_zeroed_user_highpage(vma, address);
if (!page)
- goto no_mem;
+ goto oom;

spin_lock(&mm->page_table_lock);
- page_table = pte_offset_map(pmd, addr);
+ page_table = pte_offset_map(pmd, address);

if (!pte_none(*page_table)) {
- pte_unmap(page_table);
page_cache_release(page);
- spin_unlock(&mm->page_table_lock);
- goto out;
+ goto unlock;
}
inc_mm_counter(mm, rss);
- entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
- vma->vm_page_prot)),
- vma);
+ entry = mk_pte(page, vma->vm_page_prot);
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);
lru_cache_add_active(page);
SetPageReferenced(page);
- page_add_anon_rmap(page, vma, addr);
+ page_add_anon_rmap(page, vma, address);
}

- set_pte_at(mm, addr, page_table, entry);
- pte_unmap(page_table);
+ set_pte_at(mm, address, page_table, entry);

/* No need to invalidate - it was non-present before */
- update_mmu_cache(vma, addr, entry);
+ update_mmu_cache(vma, address, entry);
lazy_mmu_prot_update(entry);
+unlock:
+ pte_unmap(page_table);
spin_unlock(&mm->page_table_lock);
-out:
return VM_FAULT_MINOR;
-no_mem:
+oom:
return VM_FAULT_OOM;
}

@@ -1829,20 +1807,17 @@ no_mem:
* This is called with the MM semaphore held and the page table
* spinlock held. Exit with the spinlock released.
*/
-static int
-do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
+static int do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pte_t *page_table, pmd_t *pmd,
+ int write_access)
{
- struct page * new_page;
+ struct page *new_page;
struct address_space *mapping = NULL;
pte_t entry;
unsigned int sequence = 0;
int ret = VM_FAULT_MINOR;
int anon = 0;

- if (!vma->vm_ops || !vma->vm_ops->nopage)
- return do_anonymous_page(mm, vma, page_table,
- pmd, write_access, address);
pte_unmap(page_table);
spin_unlock(&mm->page_table_lock);

@@ -1852,7 +1827,6 @@ do_no_page(struct mm_struct *mm, struct
smp_rmb(); /* serializes i_size against truncate_count */
}
retry:
- cond_resched();
new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, &ret);
/*
* No smp_rmb is needed here as long as there's a full
@@ -1892,9 +1866,11 @@ retry:
* retry getting the page.
*/
if (mapping && unlikely(sequence != mapping->truncate_count)) {
- sequence = mapping->truncate_count;
spin_unlock(&mm->page_table_lock);
page_cache_release(new_page);
+ cond_resched();
+ sequence = mapping->truncate_count;
+ smp_rmb();
goto retry;
}
page_table = pte_offset_map(pmd, address);
@@ -1924,25 +1900,22 @@ retry:
page_add_anon_rmap(new_page, vma, address);
} else
page_add_file_rmap(new_page);
- pte_unmap(page_table);
} else {
/* One of our sibling threads was faster, back out. */
- pte_unmap(page_table);
page_cache_release(new_page);
- spin_unlock(&mm->page_table_lock);
- goto out;
+ goto unlock;
}

/* no need to invalidate: a not-present page shouldn't be cached */
update_mmu_cache(vma, address, entry);
lazy_mmu_prot_update(entry);
+unlock:
+ pte_unmap(page_table);
spin_unlock(&mm->page_table_lock);
-out:
return ret;
oom:
page_cache_release(new_page);
- ret = VM_FAULT_OOM;
- goto out;
+ return VM_FAULT_OOM;
}

/*
@@ -1950,29 +1923,28 @@ oom:
* from the encoded file_pte if possible. This enables swappable
* nonlinear vmas.
*/
-static int do_file_page(struct mm_struct * mm, struct vm_area_struct * vma,
- unsigned long address, int write_access, pte_t *pte, pmd_t *pmd)
+static int do_file_page(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pte_t *page_table, pmd_t *pmd,
+ int write_access, pte_t orig_pte)
{
- unsigned long pgoff;
+ pgoff_t pgoff;
int err;

- BUG_ON(!vma->vm_ops || !vma->vm_ops->nopage);
- /*
- * Fall back to the linear mapping if the fs does not support
- * ->populate:
- */
- if (!vma->vm_ops->populate ||
- (write_access && !(vma->vm_flags & VM_SHARED))) {
- pte_clear(mm, address, pte);
- return do_no_page(mm, vma, address, write_access, pte, pmd);
- }
-
- pgoff = pte_to_pgoff(*pte);
-
- pte_unmap(pte);
+ pte_unmap(page_table);
spin_unlock(&mm->page_table_lock);

- err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0);
+ if (unlikely(!(vma->vm_flags & VM_NONLINEAR))) {
+ /*
+ * Page table corrupted: show pte and kill process.
+ */
+ pte_ERROR(orig_pte);
+ return VM_FAULT_OOM;
+ }
+ /* We can then assume vm->vm_ops && vma->vm_ops->populate */
+
+ pgoff = pte_to_pgoff(orig_pte);
+ err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE,
+ vma->vm_page_prot, pgoff, 0);
if (err == -ENOMEM)
return VM_FAULT_OOM;
if (err)
@@ -2002,23 +1974,25 @@ static int do_file_page(struct mm_struct
* release it when done.
*/
static inline int handle_pte_fault(struct mm_struct *mm,
- struct vm_area_struct * vma, unsigned long address,
- int write_access, pte_t *pte, pmd_t *pmd)
+ struct vm_area_struct *vma, unsigned long address,
+ pte_t *pte, pmd_t *pmd, int write_access)
{
pte_t entry;

entry = *pte;
if (!pte_present(entry)) {
- /*
- * If it truly wasn't present, we know that kswapd
- * and the PTE updates will not touch it later. So
- * drop the lock.
- */
- if (pte_none(entry))
- return do_no_page(mm, vma, address, write_access, pte, pmd);
+ if (pte_none(entry)) {
+ if (!vma->vm_ops || !vma->vm_ops->nopage)
+ return do_anonymous_page(mm, vma, address,
+ pte, pmd, write_access);
+ return do_no_page(mm, vma, address,
+ pte, pmd, write_access);
+ }
if (pte_file(entry))
- return do_file_page(mm, vma, address, write_access, pte, pmd);
- return do_swap_page(mm, vma, address, pte, pmd, entry, write_access);
+ return do_file_page(mm, vma, address,
+ pte, pmd, write_access, entry);
+ return do_swap_page(mm, vma, address,
+ pte, pmd, write_access, entry);
}

if (write_access) {
@@ -2038,7 +2012,7 @@ static inline int handle_pte_fault(struc
/*
* By the time we get here, we already hold the mm semaphore
*/
-int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct * vma,
+int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, int write_access)
{
pgd_t *pgd;
@@ -2072,7 +2046,7 @@ int __handle_mm_fault(struct mm_struct *
if (!pte)
goto oom;

- return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
+ return handle_pte_fault(mm, vma, address, pte, pmd, write_access);

oom:
spin_unlock(&mm->page_table_lock);

2005-09-25 15:57:37

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 11/21] mm: move_page_tables by extents

Speeding up mremap's moving of ptes has never been a priority, but the
locking will get more complicated shortly, and is already too baroque.

Scrap the current one-by-one moving, do an extent at a time: curtailed
by end of src and dst pmds (have to use PMD_SIZE: the way pmd_addr_end
gets elided doesn't match this usage), and by latency considerations.

One nice property of the old method is lost: it never allocated a page
table unless absolutely necessary, so you could free empty page tables
by mremapping to and fro. Whereas this way, it allocates a dst table
wherever there was a src table. I keep diving in to reinstate the old
behaviour, then come out preferring not to clutter how it now is.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/mremap.c | 166 +++++++++++++++++++++++++-----------------------------------
1 files changed, 71 insertions(+), 95 deletions(-)

--- mm10/mm/mremap.c 2005-09-24 19:27:33.000000000 +0100
+++ mm11/mm/mremap.c 2005-09-24 19:28:42.000000000 +0100
@@ -22,40 +22,15 @@
#include <asm/cacheflush.h>
#include <asm/tlbflush.h>

-static pte_t *get_one_pte_map_nested(struct mm_struct *mm, unsigned long addr)
-{
- pgd_t *pgd;
- pud_t *pud;
- pmd_t *pmd;
- pte_t *pte = NULL;
-
- pgd = pgd_offset(mm, addr);
- if (pgd_none_or_clear_bad(pgd))
- goto end;
-
- pud = pud_offset(pgd, addr);
- if (pud_none_or_clear_bad(pud))
- goto end;
-
- pmd = pmd_offset(pud, addr);
- if (pmd_none_or_clear_bad(pmd))
- goto end;
-
- pte = pte_offset_map_nested(pmd, addr);
- if (pte_none(*pte)) {
- pte_unmap_nested(pte);
- pte = NULL;
- }
-end:
- return pte;
-}
-
-static pte_t *get_one_pte_map(struct mm_struct *mm, unsigned long addr)
+static pmd_t *get_old_pmd(struct mm_struct *mm, unsigned long addr)
{
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;

+ /*
+ * We don't need page_table_lock: we have mmap_sem exclusively.
+ */
pgd = pgd_offset(mm, addr);
if (pgd_none_or_clear_bad(pgd))
return NULL;
@@ -68,35 +43,48 @@ static pte_t *get_one_pte_map(struct mm_
if (pmd_none_or_clear_bad(pmd))
return NULL;

- return pte_offset_map(pmd, addr);
+ return pmd;
}

-static inline pte_t *alloc_one_pte_map(struct mm_struct *mm, unsigned long addr)
+static pmd_t *alloc_new_pmd(struct mm_struct *mm, unsigned long addr)
{
pgd_t *pgd;
pud_t *pud;
- pmd_t *pmd;
- pte_t *pte = NULL;
+ pmd_t *pmd = NULL;
+ pte_t *pte;

+ /*
+ * We do need page_table_lock: because allocators expect that.
+ */
+ spin_lock(&mm->page_table_lock);
pgd = pgd_offset(mm, addr);
-
pud = pud_alloc(mm, pgd, addr);
if (!pud)
- return NULL;
+ goto out;
+
pmd = pmd_alloc(mm, pud, addr);
- if (pmd)
- pte = pte_alloc_map(mm, pmd, addr);
- return pte;
+ if (!pmd)
+ goto out;
+
+ pte = pte_alloc_map(mm, pmd, addr);
+ if (!pte) {
+ pmd = NULL;
+ goto out;
+ }
+ pte_unmap(pte);
+out:
+ spin_unlock(&mm->page_table_lock);
+ return pmd;
}

-static int
-move_one_page(struct vm_area_struct *vma, unsigned long old_addr,
- struct vm_area_struct *new_vma, unsigned long new_addr)
+static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
+ unsigned long old_addr, unsigned long old_end,
+ struct vm_area_struct *new_vma, pmd_t *new_pmd,
+ unsigned long new_addr)
{
struct address_space *mapping = NULL;
struct mm_struct *mm = vma->vm_mm;
- int error = 0;
- pte_t *src, *dst;
+ pte_t *old_pte, *new_pte, pte;

if (vma->vm_file) {
/*
@@ -111,74 +99,62 @@ move_one_page(struct vm_area_struct *vma
new_vma->vm_truncate_count != vma->vm_truncate_count)
new_vma->vm_truncate_count = 0;
}
+
spin_lock(&mm->page_table_lock);
+ old_pte = pte_offset_map(old_pmd, old_addr);
+ new_pte = pte_offset_map_nested(new_pmd, new_addr);

- src = get_one_pte_map_nested(mm, old_addr);
- if (src) {
- /*
- * Look to see whether alloc_one_pte_map needs to perform a
- * memory allocation. If it does then we need to drop the
- * atomic kmap
- */
- dst = get_one_pte_map(mm, new_addr);
- if (unlikely(!dst)) {
- pte_unmap_nested(src);
- if (mapping)
- spin_unlock(&mapping->i_mmap_lock);
- dst = alloc_one_pte_map(mm, new_addr);
- if (mapping && !spin_trylock(&mapping->i_mmap_lock)) {
- spin_unlock(&mm->page_table_lock);
- spin_lock(&mapping->i_mmap_lock);
- spin_lock(&mm->page_table_lock);
- }
- src = get_one_pte_map_nested(mm, old_addr);
- }
- /*
- * Since alloc_one_pte_map can drop and re-acquire
- * page_table_lock, we should re-check the src entry...
- */
- if (src) {
- if (dst) {
- pte_t pte;
- pte = ptep_clear_flush(vma, old_addr, src);
-
- /* ZERO_PAGE can be dependant on virtual addr */
- pte = move_pte(pte, new_vma->vm_page_prot,
- old_addr, new_addr);
- set_pte_at(mm, new_addr, dst, pte);
- } else
- error = -ENOMEM;
- pte_unmap_nested(src);
- }
- if (dst)
- pte_unmap(dst);
+ for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
+ new_pte++, new_addr += PAGE_SIZE) {
+ if (pte_none(*old_pte))
+ continue;
+ pte = ptep_clear_flush(vma, old_addr, old_pte);
+ /* ZERO_PAGE can be dependant on virtual addr */
+ pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
+ set_pte_at(mm, new_addr, new_pte, pte);
}
+
+ pte_unmap_nested(new_pte - 1);
+ pte_unmap(old_pte - 1);
spin_unlock(&mm->page_table_lock);
if (mapping)
spin_unlock(&mapping->i_mmap_lock);
- return error;
}

+#define LATENCY_LIMIT (64 * PAGE_SIZE)
+
static unsigned long move_page_tables(struct vm_area_struct *vma,
unsigned long old_addr, struct vm_area_struct *new_vma,
unsigned long new_addr, unsigned long len)
{
- unsigned long offset;
+ unsigned long extent, next, old_end;
+ pmd_t *old_pmd, *new_pmd;

- flush_cache_range(vma, old_addr, old_addr + len);
+ old_end = old_addr + len;
+ flush_cache_range(vma, old_addr, old_end);

- /*
- * This is not the clever way to do this, but we're taking the
- * easy way out on the assumption that most remappings will be
- * only a few pages.. This also makes error recovery easier.
- */
- for (offset = 0; offset < len; offset += PAGE_SIZE) {
- if (move_one_page(vma, old_addr + offset,
- new_vma, new_addr + offset) < 0)
- break;
+ for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
cond_resched();
+ next = (old_addr + PMD_SIZE) & PMD_MASK;
+ if (next - 1 > old_end)
+ next = old_end;
+ extent = next - old_addr;
+ old_pmd = get_old_pmd(vma->vm_mm, old_addr);
+ if (!old_pmd)
+ continue;
+ new_pmd = alloc_new_pmd(vma->vm_mm, new_addr);
+ if (!new_pmd)
+ break;
+ next = (new_addr + PMD_SIZE) & PMD_MASK;
+ if (extent > next - new_addr)
+ extent = next - new_addr;
+ if (extent > LATENCY_LIMIT)
+ extent = LATENCY_LIMIT;
+ move_ptes(vma, old_pmd, old_addr, old_addr + extent,
+ new_vma, new_pmd, new_addr);
}
- return offset;
+
+ return len + old_addr - old_end; /* how much done */
}

static unsigned long move_vma(struct vm_area_struct *vma,

2005-09-25 16:00:24

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 12/21] mm: tlb_gather_mmu get_cpu_var

tlb_gather_mmu dates from before kernel preemption was allowed, and uses
smp_processor_id or __get_cpu_var to find its per-cpu mmu_gather. That
works because it's currently only called after getting page_table_lock,
which is not dropped until after the matching tlb_finish_mmu. But don't
rely on that, it will soon change: now disable preemption internally by
proper get_cpu_var in tlb_gather_mmu, put_cpu_var in tlb_finish_mmu.

Signed-off-by: Hugh Dickins <[email protected]>
---

include/asm-arm/tlb.h | 5 +++--
include/asm-arm26/tlb.h | 7 ++++---
include/asm-generic/tlb.h | 10 +++++-----
include/asm-ia64/tlb.h | 6 ++++--
include/asm-sparc64/tlb.h | 4 +++-
5 files changed, 19 insertions(+), 13 deletions(-)

--- mm11/include/asm-arm/tlb.h 2005-06-17 20:48:29.000000000 +0100
+++ mm12/include/asm-arm/tlb.h 2005-09-24 19:28:56.000000000 +0100
@@ -39,8 +39,7 @@ DECLARE_PER_CPU(struct mmu_gather, mmu_g
static inline struct mmu_gather *
tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
{
- int cpu = smp_processor_id();
- struct mmu_gather *tlb = &per_cpu(mmu_gathers, cpu);
+ struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);

tlb->mm = mm;
tlb->freed = 0;
@@ -65,6 +64,8 @@ tlb_finish_mmu(struct mmu_gather *tlb, u

/* keep the page table cache within bounds */
check_pgt_cache();
+
+ put_cpu_var(mmu_gathers);
}

static inline unsigned int tlb_is_full_mm(struct mmu_gather *tlb)
--- mm11/include/asm-arm26/tlb.h 2005-06-17 20:48:29.000000000 +0100
+++ mm12/include/asm-arm26/tlb.h 2005-09-24 19:28:56.000000000 +0100
@@ -17,13 +17,12 @@ struct mmu_gather {
unsigned int avoided_flushes;
};

-extern struct mmu_gather mmu_gathers[NR_CPUS];
+DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);

static inline struct mmu_gather *
tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
{
- int cpu = smp_processor_id();
- struct mmu_gather *tlb = &mmu_gathers[cpu];
+ struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);

tlb->mm = mm;
tlb->freed = 0;
@@ -52,6 +51,8 @@ tlb_finish_mmu(struct mmu_gather *tlb, u

/* keep the page table cache within bounds */
check_pgt_cache();
+
+ put_cpu_var(mmu_gathers);
}


--- mm11/include/asm-generic/tlb.h 2005-09-21 12:16:48.000000000 +0100
+++ mm12/include/asm-generic/tlb.h 2005-09-24 19:28:56.000000000 +0100
@@ -35,9 +35,7 @@
#endif

/* struct mmu_gather is an opaque type used by the mm code for passing around
- * any data needed by arch specific code for tlb_remove_page. This structure
- * can be per-CPU or per-MM as the page table lock is held for the duration of
- * TLB shootdown.
+ * any data needed by arch specific code for tlb_remove_page.
*/
struct mmu_gather {
struct mm_struct *mm;
@@ -57,7 +55,7 @@ DECLARE_PER_CPU(struct mmu_gather, mmu_g
static inline struct mmu_gather *
tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
{
- struct mmu_gather *tlb = &per_cpu(mmu_gathers, smp_processor_id());
+ struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);

tlb->mm = mm;

@@ -85,7 +83,7 @@ tlb_flush_mmu(struct mmu_gather *tlb, un

/* tlb_finish_mmu
* Called at the end of the shootdown operation to free up any resources
- * that were required. The page table lock is still held at this point.
+ * that were required.
*/
static inline void
tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
@@ -101,6 +99,8 @@ tlb_finish_mmu(struct mmu_gather *tlb, u

/* keep the page table cache within bounds */
check_pgt_cache();
+
+ put_cpu_var(mmu_gathers);
}

static inline unsigned int
--- mm11/include/asm-ia64/tlb.h 2005-06-17 20:48:29.000000000 +0100
+++ mm12/include/asm-ia64/tlb.h 2005-09-24 19:28:56.000000000 +0100
@@ -129,7 +129,7 @@ ia64_tlb_flush_mmu (struct mmu_gather *t
static inline struct mmu_gather *
tlb_gather_mmu (struct mm_struct *mm, unsigned int full_mm_flush)
{
- struct mmu_gather *tlb = &__get_cpu_var(mmu_gathers);
+ struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);

tlb->mm = mm;
/*
@@ -154,7 +154,7 @@ tlb_gather_mmu (struct mm_struct *mm, un

/*
* Called at the end of the shootdown operation to free up any resources that were
- * collected. The page table lock is still held at this point.
+ * collected.
*/
static inline void
tlb_finish_mmu (struct mmu_gather *tlb, unsigned long start, unsigned long end)
@@ -174,6 +174,8 @@ tlb_finish_mmu (struct mmu_gather *tlb,

/* keep the page table cache within bounds */
check_pgt_cache();
+
+ put_cpu_var(mmu_gathers);
}

static inline unsigned int
--- mm11/include/asm-sparc64/tlb.h 2005-06-17 20:48:29.000000000 +0100
+++ mm12/include/asm-sparc64/tlb.h 2005-09-24 19:28:56.000000000 +0100
@@ -44,7 +44,7 @@ extern void flush_tlb_pending(void);

static inline struct mmu_gather *tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
{
- struct mmu_gather *mp = &__get_cpu_var(mmu_gathers);
+ struct mmu_gather *mp = &get_cpu_var(mmu_gathers);

BUG_ON(mp->tlb_nr);

@@ -97,6 +97,8 @@ static inline void tlb_finish_mmu(struct

/* keep the page table cache within bounds */
check_pgt_cache();
+
+ put_cpu_var(mmu_gathers);
}

static inline unsigned int tlb_is_full_mm(struct mmu_gather *mp)

2005-09-25 16:02:23

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 13/21] mm: tlb_is_full_mm was obscure

tlb_is_full_mm? What does that mean? The TLB is full? No, it means
that the mm's last user has gone and the whole mm is being torn down.
And it's an inline function because sparc64 uses a different (slightly
better) "tlb_frozen" name for the flag others call "fullmm".

And now the ptep_get_and_clear_full macro used in zap_pte_range refers
directly to tlb->fullmm, which would be wrong for sparc64. Rather than
correct that, I'd prefer to scrap tlb_is_full_mm altogether, and change
sparc64 to just use the same poor name as everyone else - is that okay?

Signed-off-by: Hugh Dickins <[email protected]>
---

arch/sparc64/mm/tlb.c | 4 ++--
include/asm-arm/tlb.h | 5 -----
include/asm-arm26/tlb.h | 7 -------
include/asm-generic/tlb.h | 6 ------
include/asm-ia64/tlb.h | 6 ------
include/asm-sparc64/tlb.h | 13 ++++---------
mm/memory.c | 4 ++--
7 files changed, 8 insertions(+), 37 deletions(-)

--- mm12/arch/sparc64/mm/tlb.c 2005-06-17 20:48:29.000000000 +0100
+++ mm13/arch/sparc64/mm/tlb.c 2005-09-24 19:29:10.000000000 +0100
@@ -72,7 +72,7 @@ void tlb_batch_add(struct mm_struct *mm,

no_cache_flush:

- if (mp->tlb_frozen)
+ if (mp->fullmm)
return;

nr = mp->tlb_nr;
@@ -97,7 +97,7 @@ void flush_tlb_pgtables(struct mm_struct
unsigned long nr = mp->tlb_nr;
long s = start, e = end, vpte_base;

- if (mp->tlb_frozen)
+ if (mp->fullmm)
return;

/* If start is greater than end, that is a real problem. */
--- mm12/include/asm-arm/tlb.h 2005-09-24 19:28:56.000000000 +0100
+++ mm13/include/asm-arm/tlb.h 2005-09-24 19:29:10.000000000 +0100
@@ -68,11 +68,6 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
put_cpu_var(mmu_gathers);
}

-static inline unsigned int tlb_is_full_mm(struct mmu_gather *tlb)
-{
- return tlb->fullmm;
-}
-
#define tlb_remove_tlb_entry(tlb,ptep,address) do { } while (0)

/*
--- mm12/include/asm-arm26/tlb.h 2005-09-24 19:28:56.000000000 +0100
+++ mm13/include/asm-arm26/tlb.h 2005-09-24 19:29:10.000000000 +0100
@@ -55,13 +55,6 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
put_cpu_var(mmu_gathers);
}

-
-static inline unsigned int
-tlb_is_full_mm(struct mmu_gather *tlb)
-{
- return tlb->fullmm;
-}
-
#define tlb_remove_tlb_entry(tlb,ptep,address) do { } while (0)
//#define tlb_start_vma(tlb,vma) do { } while (0)
//FIXME - ARM32 uses this now that things changed in the kernel. seems like it may be pointless on arm26, however to get things compiling...
--- mm12/include/asm-generic/tlb.h 2005-09-24 19:28:56.000000000 +0100
+++ mm13/include/asm-generic/tlb.h 2005-09-24 19:29:10.000000000 +0100
@@ -103,12 +103,6 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
put_cpu_var(mmu_gathers);
}

-static inline unsigned int
-tlb_is_full_mm(struct mmu_gather *tlb)
-{
- return tlb->fullmm;
-}
-
/* tlb_remove_page
* Must perform the equivalent to __free_pte(pte_get_and_clear(ptep)), while
* handling the additional races in SMP caused by other CPUs caching valid
--- mm12/include/asm-ia64/tlb.h 2005-09-24 19:28:56.000000000 +0100
+++ mm13/include/asm-ia64/tlb.h 2005-09-24 19:29:10.000000000 +0100
@@ -178,12 +178,6 @@ tlb_finish_mmu (struct mmu_gather *tlb,
put_cpu_var(mmu_gathers);
}

-static inline unsigned int
-tlb_is_full_mm(struct mmu_gather *tlb)
-{
- return tlb->fullmm;
-}
-
/*
* Logically, this routine frees PAGE. On MP machines, the actual freeing of the page
* must be delayed until after the TLB has been flushed (see comments at the beginning of
--- mm12/include/asm-sparc64/tlb.h 2005-09-24 19:28:56.000000000 +0100
+++ mm13/include/asm-sparc64/tlb.h 2005-09-24 19:29:10.000000000 +0100
@@ -25,7 +25,7 @@ struct mmu_gather {
struct mm_struct *mm;
unsigned int pages_nr;
unsigned int need_flush;
- unsigned int tlb_frozen;
+ unsigned int fullmm;
unsigned int tlb_nr;
unsigned long freed;
unsigned long vaddrs[TLB_BATCH_NR];
@@ -50,7 +50,7 @@ static inline struct mmu_gather *tlb_gat

mp->mm = mm;
mp->pages_nr = num_online_cpus() > 1 ? 0U : ~0U;
- mp->tlb_frozen = full_mm_flush;
+ mp->fullmm = full_mm_flush;
mp->freed = 0;

return mp;
@@ -88,10 +88,10 @@ static inline void tlb_finish_mmu(struct

tlb_flush_mmu(mp);

- if (mp->tlb_frozen) {
+ if (mp->fullmm) {
if (CTX_VALID(mm->context))
do_flush_tlb_mm(mm);
- mp->tlb_frozen = 0;
+ mp->fullmm = 0;
} else
flush_tlb_pending();

@@ -101,11 +101,6 @@ static inline void tlb_finish_mmu(struct
put_cpu_var(mmu_gathers);
}

-static inline unsigned int tlb_is_full_mm(struct mmu_gather *mp)
-{
- return mp->tlb_frozen;
-}
-
static inline void tlb_remove_page(struct mmu_gather *mp, struct page *page)
{
mp->need_flush = 1;
--- mm12/mm/memory.c 2005-09-24 19:28:28.000000000 +0100
+++ mm13/mm/memory.c 2005-09-24 19:29:10.000000000 +0100
@@ -249,7 +249,7 @@ void free_pgd_range(struct mmu_gather **
free_pud_range(*tlb, pgd, addr, next, floor, ceiling);
} while (pgd++, addr = next, addr != end);

- if (!tlb_is_full_mm(*tlb))
+ if (!(*tlb)->fullmm)
flush_tlb_pgtables((*tlb)->mm, start, end);
}

@@ -698,7 +698,7 @@ unsigned long unmap_vmas(struct mmu_gath
int tlb_start_valid = 0;
unsigned long start = start_addr;
spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
- int fullmm = tlb_is_full_mm(*tlbp);
+ int fullmm = (*tlbp)->fullmm;

for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
unsigned long end;

2005-09-25 16:04:26

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 14/21] mm: tlb_finish_mmu forget rss

zap_pte_range has been counting the pages it frees in tlb->freed, then
tlb_finish_mmu has used that to update the mm's rss. That got stranger
when I added anon_rss, yet updated it by a different route; and stranger
when rss and anon_rss became mm_counters with special access macros.
And it would no longer be viable if we're relying on page_table_lock to
stabilize the mm_counter, but calling tlb_finish_mmu outside that lock.

Remove the mmu_gather's freed field, let tlb_finish_mmu stick to its own
business, just decrement the rss mm_counter in zap_pte_range (yes, there
was some point to batching the update, and a subsequent patch restores
that). And forget the anal paranoia of first reading the counter to
avoid going negative - if rss does go negative, just fix that bug.

Remove the mmu_gather's flushes and avoided_flushes from arm and arm26:
no use was being made of them. But arm26 alone was actually using the
freed, in the way some others use need_flush: give it a need_flush.
arm26 seems to prefer spaces to tabs here: respect that.

Signed-off-by: Hugh Dickins <[email protected]>
---

include/asm-arm/tlb.h | 15 +--------------
include/asm-arm26/tlb.h | 35 +++++++++++++----------------------
include/asm-generic/tlb.h | 9 ---------
include/asm-ia64/tlb.h | 9 ---------
include/asm-sparc64/tlb.h | 14 ++------------
mm/memory.c | 2 +-
6 files changed, 17 insertions(+), 67 deletions(-)

--- mm13/include/asm-arm/tlb.h 2005-09-24 19:29:10.000000000 +0100
+++ mm14/include/asm-arm/tlb.h 2005-09-24 19:29:25.000000000 +0100
@@ -27,11 +27,7 @@
*/
struct mmu_gather {
struct mm_struct *mm;
- unsigned int freed;
unsigned int fullmm;
-
- unsigned int flushes;
- unsigned int avoided_flushes;
};

DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
@@ -42,7 +38,6 @@ tlb_gather_mmu(struct mm_struct *mm, uns
struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);

tlb->mm = mm;
- tlb->freed = 0;
tlb->fullmm = full_mm_flush;

return tlb;
@@ -51,16 +46,8 @@ tlb_gather_mmu(struct mm_struct *mm, uns
static inline void
tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
{
- struct mm_struct *mm = tlb->mm;
- unsigned long freed = tlb->freed;
- int rss = get_mm_counter(mm, rss);
-
- if (rss < freed)
- freed = rss;
- add_mm_counter(mm, rss, -freed);
-
if (tlb->fullmm)
- flush_tlb_mm(mm);
+ flush_tlb_mm(tlb->mm);

/* keep the page table cache within bounds */
check_pgt_cache();
--- mm13/include/asm-arm26/tlb.h 2005-09-24 19:29:10.000000000 +0100
+++ mm14/include/asm-arm26/tlb.h 2005-09-24 19:29:25.000000000 +0100
@@ -10,11 +10,8 @@
*/
struct mmu_gather {
struct mm_struct *mm;
- unsigned int freed;
- unsigned int fullmm;
-
- unsigned int flushes;
- unsigned int avoided_flushes;
+ unsigned int need_flush;
+ unsigned int fullmm;
};

DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
@@ -25,8 +22,8 @@ tlb_gather_mmu(struct mm_struct *mm, uns
struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);

tlb->mm = mm;
- tlb->freed = 0;
- tlb->fullmm = full_mm_flush;
+ tlb->need_flush = 0;
+ tlb->fullmm = full_mm_flush;

return tlb;
}
@@ -34,20 +31,8 @@ tlb_gather_mmu(struct mm_struct *mm, uns
static inline void
tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
{
- struct mm_struct *mm = tlb->mm;
- unsigned long freed = tlb->freed;
- int rss = get_mm_counter(mm, rss);
-
- if (rss < freed)
- freed = rss;
- add_mm_counter(mm, rss, -freed);
-
- if (freed) {
- flush_tlb_mm(mm);
- tlb->flushes++;
- } else {
- tlb->avoided_flushes++;
- }
+ if (tlb->need_flush)
+ flush_tlb_mm(tlb->mm);

/* keep the page table cache within bounds */
check_pgt_cache();
@@ -65,7 +50,13 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
} while (0)
#define tlb_end_vma(tlb,vma) do { } while (0)

-#define tlb_remove_page(tlb,page) free_page_and_swap_cache(page)
+static inline void
+tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+{
+ tlb->need_flush = 1;
+ free_page_and_swap_cache(page);
+}
+
#define pte_free_tlb(tlb,ptep) pte_free(ptep)
#define pmd_free_tlb(tlb,pmdp) pmd_free(pmdp)

--- mm13/include/asm-generic/tlb.h 2005-09-24 19:29:10.000000000 +0100
+++ mm14/include/asm-generic/tlb.h 2005-09-24 19:29:25.000000000 +0100
@@ -42,7 +42,6 @@ struct mmu_gather {
unsigned int nr; /* set to ~0U means fast mode */
unsigned int need_flush;/* Really unmapped some ptes? */
unsigned int fullmm; /* non-zero means full mm flush */
- unsigned long freed;
struct page * pages[FREE_PTE_NR];
};

@@ -63,7 +62,6 @@ tlb_gather_mmu(struct mm_struct *mm, uns
tlb->nr = num_online_cpus() > 1 ? 0U : ~0U;

tlb->fullmm = full_mm_flush;
- tlb->freed = 0;

return tlb;
}
@@ -88,13 +86,6 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
static inline void
tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
{
- int freed = tlb->freed;
- struct mm_struct *mm = tlb->mm;
- int rss = get_mm_counter(mm, rss);
-
- if (rss < freed)
- freed = rss;
- add_mm_counter(mm, rss, -freed);
tlb_flush_mmu(tlb, start, end);

/* keep the page table cache within bounds */
--- mm13/include/asm-ia64/tlb.h 2005-09-24 19:29:10.000000000 +0100
+++ mm14/include/asm-ia64/tlb.h 2005-09-24 19:29:25.000000000 +0100
@@ -60,7 +60,6 @@ struct mmu_gather {
unsigned int nr; /* == ~0U => fast mode */
unsigned char fullmm; /* non-zero means full mm flush */
unsigned char need_flush; /* really unmapped some PTEs? */
- unsigned long freed; /* number of pages freed */
unsigned long start_addr;
unsigned long end_addr;
struct page *pages[FREE_PTE_NR];
@@ -147,7 +146,6 @@ tlb_gather_mmu (struct mm_struct *mm, un
*/
tlb->nr = (num_online_cpus() == 1) ? ~0U : 0;
tlb->fullmm = full_mm_flush;
- tlb->freed = 0;
tlb->start_addr = ~0UL;
return tlb;
}
@@ -159,13 +157,6 @@ tlb_gather_mmu (struct mm_struct *mm, un
static inline void
tlb_finish_mmu (struct mmu_gather *tlb, unsigned long start, unsigned long end)
{
- unsigned long freed = tlb->freed;
- struct mm_struct *mm = tlb->mm;
- unsigned long rss = get_mm_counter(mm, rss);
-
- if (rss < freed)
- freed = rss;
- add_mm_counter(mm, rss, -freed);
/*
* Note: tlb->nr may be 0 at this point, so we can't rely on tlb->start_addr and
* tlb->end_addr.
--- mm13/include/asm-sparc64/tlb.h 2005-09-24 19:29:10.000000000 +0100
+++ mm14/include/asm-sparc64/tlb.h 2005-09-24 19:29:25.000000000 +0100
@@ -27,7 +27,6 @@ struct mmu_gather {
unsigned int need_flush;
unsigned int fullmm;
unsigned int tlb_nr;
- unsigned long freed;
unsigned long vaddrs[TLB_BATCH_NR];
struct page *pages[FREE_PTE_NR];
};
@@ -51,7 +50,6 @@ static inline struct mmu_gather *tlb_gat
mp->mm = mm;
mp->pages_nr = num_online_cpus() > 1 ? 0U : ~0U;
mp->fullmm = full_mm_flush;
- mp->freed = 0;

return mp;
}
@@ -78,19 +76,11 @@ extern void smp_flush_tlb_mm(struct mm_s

static inline void tlb_finish_mmu(struct mmu_gather *mp, unsigned long start, unsigned long end)
{
- unsigned long freed = mp->freed;
- struct mm_struct *mm = mp->mm;
- unsigned long rss = get_mm_counter(mm, rss);
-
- if (rss < freed)
- freed = rss;
- add_mm_counter(mm, rss, -freed);
-
tlb_flush_mmu(mp);

if (mp->fullmm) {
- if (CTX_VALID(mm->context))
- do_flush_tlb_mm(mm);
+ if (CTX_VALID(mp->mm->context))
+ do_flush_tlb_mm(mp->mm);
mp->fullmm = 0;
} else
flush_tlb_pending();
--- mm13/mm/memory.c 2005-09-24 19:29:10.000000000 +0100
+++ mm14/mm/memory.c 2005-09-24 19:29:25.000000000 +0100
@@ -582,7 +582,7 @@ static void zap_pte_range(struct mmu_gat
if (pte_young(ptent))
mark_page_accessed(page);
}
- tlb->freed++;
+ dec_mm_counter(tlb->mm, rss);
page_remove_rmap(page);
tlb_remove_page(tlb, page);
continue;

2005-09-25 16:06:34

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 15/21] mm: mm_init set_mm_counters

How is anon_rss initialized? In dup_mmap, and by mm_alloc's memset; but
that's not so good if an mm_counter_t is a special type. And how is rss
initialized? By set_mm_counter, all over the place. Come on, we just
need to initialize them both at once by set_mm_counter in mm_init (which
follows the memcpy when forking).

Signed-off-by: Hugh Dickins <[email protected]>
---

arch/mips/kernel/irixelf.c | 1 -
arch/sparc64/kernel/binfmt_aout32.c | 1 -
arch/x86_64/ia32/ia32_aout.c | 1 -
fs/binfmt_aout.c | 1 -
fs/binfmt_elf.c | 1 -
fs/binfmt_elf_fdpic.c | 7 -------
fs/binfmt_flat.c | 1 -
fs/binfmt_som.c | 1 -
kernel/fork.c | 4 ++--
9 files changed, 2 insertions(+), 16 deletions(-)

--- mm14/arch/mips/kernel/irixelf.c 2005-06-17 20:48:29.000000000 +0100
+++ mm15/arch/mips/kernel/irixelf.c 2005-09-24 19:29:40.000000000 +0100
@@ -692,7 +692,6 @@ static int load_irix_binary(struct linux
/* Do this so that we can load the interpreter, if need be. We will
* change some of these later.
*/
- set_mm_counter(current->mm, rss, 0);
setup_arg_pages(bprm, STACK_TOP, EXSTACK_DEFAULT);
current->mm->start_stack = bprm->p;

--- mm14/arch/sparc64/kernel/binfmt_aout32.c 2005-06-17 20:48:29.000000000 +0100
+++ mm15/arch/sparc64/kernel/binfmt_aout32.c 2005-09-24 19:29:40.000000000 +0100
@@ -241,7 +241,6 @@ static int load_aout32_binary(struct lin
current->mm->brk = ex.a_bss +
(current->mm->start_brk = N_BSSADDR(ex));

- set_mm_counter(current->mm, rss, 0);
current->mm->mmap = NULL;
compute_creds(bprm);
current->flags &= ~PF_FORKNOEXEC;
--- mm14/arch/x86_64/ia32/ia32_aout.c 2005-08-29 00:41:01.000000000 +0100
+++ mm15/arch/x86_64/ia32/ia32_aout.c 2005-09-24 19:29:40.000000000 +0100
@@ -314,7 +314,6 @@ static int load_aout_binary(struct linux
current->mm->free_area_cache = TASK_UNMAPPED_BASE;
current->mm->cached_hole_size = 0;

- set_mm_counter(current->mm, rss, 0);
current->mm->mmap = NULL;
compute_creds(bprm);
current->flags &= ~PF_FORKNOEXEC;
--- mm14/fs/binfmt_aout.c 2005-08-29 00:41:01.000000000 +0100
+++ mm15/fs/binfmt_aout.c 2005-09-24 19:29:40.000000000 +0100
@@ -318,7 +318,6 @@ static int load_aout_binary(struct linux
current->mm->free_area_cache = current->mm->mmap_base;
current->mm->cached_hole_size = 0;

- set_mm_counter(current->mm, rss, 0);
current->mm->mmap = NULL;
compute_creds(bprm);
current->flags &= ~PF_FORKNOEXEC;
--- mm14/fs/binfmt_elf.c 2005-09-22 12:31:59.000000000 +0100
+++ mm15/fs/binfmt_elf.c 2005-09-24 19:29:40.000000000 +0100
@@ -773,7 +773,6 @@ static int load_elf_binary(struct linux_

/* Do this so that we can load the interpreter, if need be. We will
change some of these later */
- set_mm_counter(current->mm, rss, 0);
current->mm->free_area_cache = current->mm->mmap_base;
current->mm->cached_hole_size = 0;
retval = setup_arg_pages(bprm, randomize_stack_top(STACK_TOP),
--- mm14/fs/binfmt_elf_fdpic.c 2005-06-17 20:48:29.000000000 +0100
+++ mm15/fs/binfmt_elf_fdpic.c 2005-09-24 19:29:40.000000000 +0100
@@ -294,14 +294,7 @@ static int load_elf_fdpic_binary(struct
&interp_params,
&current->mm->start_stack,
&current->mm->start_brk);
-#endif
-
- /* do this so that we can load the interpreter, if need be
- * - we will change some of these later
- */
- set_mm_counter(current->mm, rss, 0);

-#ifdef CONFIG_MMU
retval = setup_arg_pages(bprm, current->mm->start_stack, executable_stack);
if (retval < 0) {
send_sig(SIGKILL, current, 0);
--- mm14/fs/binfmt_flat.c 2005-09-21 12:16:40.000000000 +0100
+++ mm15/fs/binfmt_flat.c 2005-09-24 19:29:40.000000000 +0100
@@ -650,7 +650,6 @@ static int load_flat_file(struct linux_b
current->mm->start_brk = datapos + data_len + bss_len;
current->mm->brk = (current->mm->start_brk + 3) & ~3;
current->mm->context.end_brk = memp + ksize((void *) memp) - stack_len;
- set_mm_counter(current->mm, rss, 0);
}

if (flags & FLAT_FLAG_KTRACE)
--- mm14/fs/binfmt_som.c 2005-06-17 20:48:29.000000000 +0100
+++ mm15/fs/binfmt_som.c 2005-09-24 19:29:40.000000000 +0100
@@ -259,7 +259,6 @@ load_som_binary(struct linux_binprm * bp
create_som_tables(bprm);

current->mm->start_stack = bprm->p;
- set_mm_counter(current->mm, rss, 0);

#if 0
printk("(start_brk) %08lx\n" , (unsigned long) current->mm->start_brk);
--- mm14/kernel/fork.c 2005-09-24 19:27:33.000000000 +0100
+++ mm15/kernel/fork.c 2005-09-24 19:29:40.000000000 +0100
@@ -198,8 +198,6 @@ static inline int dup_mmap(struct mm_str
mm->free_area_cache = oldmm->mmap_base;
mm->cached_hole_size = ~0UL;
mm->map_count = 0;
- set_mm_counter(mm, rss, 0);
- set_mm_counter(mm, anon_rss, 0);
cpus_clear(mm->cpu_vm_mask);
mm->mm_rb = RB_ROOT;
rb_link = &mm->mm_rb.rb_node;
@@ -323,6 +321,8 @@ static struct mm_struct * mm_init(struct
INIT_LIST_HEAD(&mm->mmlist);
mm->core_waiters = 0;
mm->nr_ptes = 0;
+ set_mm_counter(mm, rss, 0);
+ set_mm_counter(mm, anon_rss, 0);
spin_lock_init(&mm->page_table_lock);
rwlock_init(&mm->ioctx_list_lock);
mm->ioctx_list = NULL;

2005-09-25 16:07:39

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 16/21] mm: rss = file_rss + anon_rss

I was lazy when we added anon_rss, and chose to change as few places as
possible. So currently each anonymous page has to be counted twice, in
rss and in anon_rss. Which won't be so good if those are atomic counts
in some configurations.

Change that around: keep file_rss and anon_rss separately, and add them
together (with get_mm_rss macro) when the total is needed - reading two
atomics is much cheaper than updating two atomics. And update anon_rss
upfront, typically in memory.c, not tucked away in page_add_anon_rmap.

Signed-off-by: Hugh Dickins <[email protected]>
---

fs/exec.c | 2 +-
fs/proc/array.c | 2 +-
fs/proc/task_mmu.c | 8 +++-----
include/linux/sched.h | 4 +++-
kernel/acct.c | 2 +-
kernel/fork.c | 4 ++--
mm/fremap.c | 4 ++--
mm/hugetlb.c | 6 +++---
mm/memory.c | 31 +++++++++++++++++--------------
mm/nommu.c | 2 +-
mm/rmap.c | 8 +++-----
mm/swapfile.c | 2 +-
12 files changed, 38 insertions(+), 37 deletions(-)

--- mm15/fs/exec.c 2005-09-22 12:31:59.000000000 +0100
+++ mm16/fs/exec.c 2005-09-24 19:29:53.000000000 +0100
@@ -330,7 +330,7 @@ void install_arg_page(struct vm_area_str
pte_unmap(pte);
goto out;
}
- inc_mm_counter(mm, rss);
+ inc_mm_counter(mm, anon_rss);
lru_cache_add_active(page);
set_pte_at(mm, address, pte, pte_mkdirty(pte_mkwrite(mk_pte(
page, vma->vm_page_prot))));
--- mm15/fs/proc/array.c 2005-09-21 12:16:44.000000000 +0100
+++ mm16/fs/proc/array.c 2005-09-24 19:29:53.000000000 +0100
@@ -438,7 +438,7 @@ static int do_task_stat(struct task_stru
jiffies_to_clock_t(it_real_value),
start_time,
vsize,
- mm ? get_mm_counter(mm, rss) : 0, /* you might want to shift this left 3 */
+ mm ? get_mm_rss(mm) : 0,
rsslim,
mm ? mm->start_code : 0,
mm ? mm->end_code : 0,
--- mm15/fs/proc/task_mmu.c 2005-09-22 12:32:00.000000000 +0100
+++ mm16/fs/proc/task_mmu.c 2005-09-24 19:29:53.000000000 +0100
@@ -29,7 +29,7 @@ char *task_mem(struct mm_struct *mm, cha
"VmPTE:\t%8lu kB\n",
(mm->total_vm - mm->reserved_vm) << (PAGE_SHIFT-10),
mm->locked_vm << (PAGE_SHIFT-10),
- get_mm_counter(mm, rss) << (PAGE_SHIFT-10),
+ get_mm_rss(mm) << (PAGE_SHIFT-10),
data << (PAGE_SHIFT-10),
mm->stack_vm << (PAGE_SHIFT-10), text, lib,
(PTRS_PER_PTE*sizeof(pte_t)*mm->nr_ptes) >> 10);
@@ -44,13 +44,11 @@ unsigned long task_vsize(struct mm_struc
int task_statm(struct mm_struct *mm, int *shared, int *text,
int *data, int *resident)
{
- int rss = get_mm_counter(mm, rss);
-
- *shared = rss - get_mm_counter(mm, anon_rss);
+ *shared = get_mm_counter(mm, file_rss);
*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
>> PAGE_SHIFT;
*data = mm->total_vm - mm->shared_vm;
- *resident = rss;
+ *resident = *shared + get_mm_counter(mm, anon_rss);
return mm->total_vm;
}

--- mm15/include/linux/sched.h 2005-09-22 12:32:03.000000000 +0100
+++ mm16/include/linux/sched.h 2005-09-24 19:29:53.000000000 +0100
@@ -243,6 +243,8 @@ extern void arch_unmap_area_topdown(stru
#define add_mm_counter(mm, member, value) (mm)->_##member += (value)
#define inc_mm_counter(mm, member) (mm)->_##member++
#define dec_mm_counter(mm, member) (mm)->_##member--
+#define get_mm_rss(mm) ((mm)->_file_rss + (mm)->_anon_rss)
+
typedef unsigned long mm_counter_t;

struct mm_struct {
@@ -275,7 +277,7 @@ struct mm_struct {
unsigned long exec_vm, stack_vm, reserved_vm, def_flags, nr_ptes;

/* Special counters protected by the page_table_lock */
- mm_counter_t _rss;
+ mm_counter_t _file_rss;
mm_counter_t _anon_rss;

unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
--- mm15/kernel/acct.c 2005-09-21 12:16:59.000000000 +0100
+++ mm16/kernel/acct.c 2005-09-24 19:29:53.000000000 +0100
@@ -553,7 +553,7 @@ void acct_update_integrals(struct task_s
if (delta == 0)
return;
tsk->acct_stimexpd = tsk->stime;
- tsk->acct_rss_mem1 += delta * get_mm_counter(tsk->mm, rss);
+ tsk->acct_rss_mem1 += delta * get_mm_rss(tsk->mm);
tsk->acct_vm_mem1 += delta * tsk->mm->total_vm;
}
}
--- mm15/kernel/fork.c 2005-09-24 19:29:40.000000000 +0100
+++ mm16/kernel/fork.c 2005-09-24 19:29:53.000000000 +0100
@@ -321,7 +321,7 @@ static struct mm_struct * mm_init(struct
INIT_LIST_HEAD(&mm->mmlist);
mm->core_waiters = 0;
mm->nr_ptes = 0;
- set_mm_counter(mm, rss, 0);
+ set_mm_counter(mm, file_rss, 0);
set_mm_counter(mm, anon_rss, 0);
spin_lock_init(&mm->page_table_lock);
rwlock_init(&mm->ioctx_list_lock);
@@ -499,7 +499,7 @@ static int copy_mm(unsigned long clone_f
if (retval)
goto free_pt;

- mm->hiwater_rss = get_mm_counter(mm,rss);
+ mm->hiwater_rss = get_mm_rss(mm);
mm->hiwater_vm = mm->total_vm;

good_mm:
--- mm15/mm/fremap.c 2005-06-17 20:48:29.000000000 +0100
+++ mm16/mm/fremap.c 2005-09-24 19:29:53.000000000 +0100
@@ -39,7 +39,7 @@ static inline void zap_pte(struct mm_str
set_page_dirty(page);
page_remove_rmap(page);
page_cache_release(page);
- dec_mm_counter(mm, rss);
+ dec_mm_counter(mm, file_rss);
}
}
} else {
@@ -92,7 +92,7 @@ int install_page(struct mm_struct *mm, s

zap_pte(mm, vma, addr, pte);

- inc_mm_counter(mm,rss);
+ inc_mm_counter(mm, file_rss);
flush_icache_page(vma, page);
set_pte_at(mm, addr, pte, mk_pte(page, prot));
page_add_file_rmap(page);
--- mm15/mm/hugetlb.c 2005-09-24 19:26:24.000000000 +0100
+++ mm16/mm/hugetlb.c 2005-09-24 19:29:53.000000000 +0100
@@ -285,7 +285,7 @@ int copy_hugetlb_page_range(struct mm_st
entry = *src_pte;
ptepage = pte_page(entry);
get_page(ptepage);
- add_mm_counter(dst, rss, HPAGE_SIZE / PAGE_SIZE);
+ add_mm_counter(dst, file_rss, HPAGE_SIZE / PAGE_SIZE);
set_huge_pte_at(dst, addr, dst_pte, entry);
}
spin_unlock(&src->page_table_lock);
@@ -323,7 +323,7 @@ void unmap_hugepage_range(struct vm_area

page = pte_page(pte);
put_page(page);
- add_mm_counter(mm, rss, - (HPAGE_SIZE / PAGE_SIZE));
+ add_mm_counter(mm, file_rss, - (HPAGE_SIZE / PAGE_SIZE));
}
flush_tlb_range(vma, start, end);
}
@@ -385,7 +385,7 @@ int hugetlb_prefault(struct address_spac
goto out;
}
}
- add_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE);
+ add_mm_counter(mm, file_rss, HPAGE_SIZE / PAGE_SIZE);
set_huge_pte_at(mm, addr, pte, make_huge_pte(vma, page));
}
out:
--- mm15/mm/memory.c 2005-09-24 19:29:25.000000000 +0100
+++ mm16/mm/memory.c 2005-09-24 19:29:53.000000000 +0100
@@ -397,9 +397,10 @@ copy_one_pte(struct mm_struct *dst_mm, s
pte = pte_mkclean(pte);
pte = pte_mkold(pte);
get_page(page);
- inc_mm_counter(dst_mm, rss);
if (PageAnon(page))
inc_mm_counter(dst_mm, anon_rss);
+ else
+ inc_mm_counter(dst_mm, file_rss);
set_pte_at(dst_mm, addr, dst_pte, pte);
page_dup_rmap(page);
}
@@ -581,8 +582,8 @@ static void zap_pte_range(struct mmu_gat
set_page_dirty(page);
if (pte_young(ptent))
mark_page_accessed(page);
+ dec_mm_counter(tlb->mm, file_rss);
}
- dec_mm_counter(tlb->mm, rss);
page_remove_rmap(page);
tlb_remove_page(tlb, page);
continue;
@@ -1290,13 +1291,15 @@ static int do_wp_page(struct mm_struct *
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, address);
if (likely(pte_same(*page_table, orig_pte))) {
- if (PageAnon(old_page))
- dec_mm_counter(mm, anon_rss);
if (PageReserved(old_page))
- inc_mm_counter(mm, rss);
- else
+ inc_mm_counter(mm, anon_rss);
+ else {
page_remove_rmap(old_page);
-
+ if (!PageAnon(old_page)) {
+ inc_mm_counter(mm, anon_rss);
+ dec_mm_counter(mm, file_rss);
+ }
+ }
flush_cache_page(vma, address, pfn);
entry = mk_pte(new_page, vma->vm_page_prot);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
@@ -1701,7 +1704,7 @@ static int do_swap_page(struct mm_struct

/* The page isn't present yet, go ahead with the fault. */

- inc_mm_counter(mm, rss);
+ inc_mm_counter(mm, anon_rss);
pte = mk_pte(page, vma->vm_page_prot);
if (write_access && can_share_swap_page(page)) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
@@ -1774,7 +1777,7 @@ static int do_anonymous_page(struct mm_s
page_cache_release(page);
goto unlock;
}
- inc_mm_counter(mm, rss);
+ inc_mm_counter(mm, anon_rss);
entry = mk_pte(page, vma->vm_page_prot);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
lru_cache_add_active(page);
@@ -1887,19 +1890,19 @@ retry:
*/
/* Only go through if we didn't race with anybody else... */
if (pte_none(*page_table)) {
- if (!PageReserved(new_page))
- inc_mm_counter(mm, rss);
-
flush_icache_page(vma, new_page);
entry = mk_pte(new_page, vma->vm_page_prot);
if (write_access)
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
set_pte_at(mm, address, page_table, entry);
if (anon) {
+ inc_mm_counter(mm, anon_rss);
lru_cache_add_active(new_page);
page_add_anon_rmap(new_page, vma, address);
- } else
+ } else if (!PageReserved(new_page)) {
+ inc_mm_counter(mm, file_rss);
page_add_file_rmap(new_page);
+ }
} else {
/* One of our sibling threads was faster, back out. */
page_cache_release(new_page);
@@ -2192,7 +2195,7 @@ EXPORT_SYMBOL(vmalloc_to_pfn);
void update_mem_hiwater(struct task_struct *tsk)
{
if (tsk->mm) {
- unsigned long rss = get_mm_counter(tsk->mm, rss);
+ unsigned long rss = get_mm_rss(tsk->mm);

if (tsk->mm->hiwater_rss < rss)
tsk->mm->hiwater_rss = rss;
--- mm15/mm/nommu.c 2005-09-22 12:32:03.000000000 +0100
+++ mm16/mm/nommu.c 2005-09-24 19:29:53.000000000 +0100
@@ -1080,7 +1080,7 @@ void update_mem_hiwater(struct task_stru
unsigned long rss;

if (likely(tsk->mm)) {
- rss = get_mm_counter(tsk->mm, rss);
+ rss = get_mm_rss(tsk->mm);
if (tsk->mm->hiwater_rss < rss)
tsk->mm->hiwater_rss = rss;
if (tsk->mm->hiwater_vm < tsk->mm->total_vm)
--- mm15/mm/rmap.c 2005-09-22 12:32:04.000000000 +0100
+++ mm16/mm/rmap.c 2005-09-24 19:29:53.000000000 +0100
@@ -445,8 +445,6 @@ void page_add_anon_rmap(struct page *pag
{
BUG_ON(PageReserved(page));

- inc_mm_counter(vma->vm_mm, anon_rss);
-
if (atomic_inc_and_test(&page->_mapcount)) {
struct anon_vma *anon_vma = vma->anon_vma;

@@ -561,9 +559,9 @@ static int try_to_unmap_one(struct page
set_pte_at(mm, address, pte, swp_entry_to_pte(entry));
BUG_ON(pte_file(*pte));
dec_mm_counter(mm, anon_rss);
- }
+ } else
+ dec_mm_counter(mm, file_rss);

- dec_mm_counter(mm, rss);
page_remove_rmap(page);
page_cache_release(page);

@@ -667,7 +665,7 @@ static void try_to_unmap_cluster(unsigne

page_remove_rmap(page);
page_cache_release(page);
- dec_mm_counter(mm, rss);
+ dec_mm_counter(mm, file_rss);
(*mapcount)--;
}

--- mm15/mm/swapfile.c 2005-09-24 19:27:19.000000000 +0100
+++ mm16/mm/swapfile.c 2005-09-24 19:29:53.000000000 +0100
@@ -405,7 +405,7 @@ void free_swap_and_cache(swp_entry_t ent
static void unuse_pte(struct vm_area_struct *vma, pte_t *pte,
unsigned long addr, swp_entry_t entry, struct page *page)
{
- inc_mm_counter(vma->vm_mm, rss);
+ inc_mm_counter(vma->vm_mm, anon_rss);
get_page(page);
set_pte_at(vma->vm_mm, addr, pte,
pte_mkold(mk_pte(page, vma->vm_page_prot)));

2005-09-25 16:08:51

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 17/21] mm: batch updating mm_counters

tlb_finish_mmu used to batch zap_pte_range's update of mm rss, which may
be worthwhile if the mm is contended, and would reduce atomic operations
if the counts were atomic. Let zap_pte_range now batch its updates to
file_rss and anon_rss, per page-table in case we drop the lock outside;
and copy_pte_range batch them too.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/memory.c | 47 ++++++++++++++++++++++++++++++++---------------
1 files changed, 32 insertions(+), 15 deletions(-)

--- mm16/mm/memory.c 2005-09-24 19:29:53.000000000 +0100
+++ mm17/mm/memory.c 2005-09-24 19:30:07.000000000 +0100
@@ -332,6 +332,16 @@ out:
return pte_offset_kernel(pmd, address);
}

+static inline void add_mm_rss(struct mm_struct *mm, int file_rss, int anon_rss)
+{
+ if (file_rss)
+ add_mm_counter(mm, file_rss, file_rss);
+ if (anon_rss)
+ add_mm_counter(mm, anon_rss, anon_rss);
+}
+
+#define NO_RSS 2 /* Increment neither file_rss nor anon_rss */
+
/*
* copy one vm_area from one task to the other. Assumes the page tables
* already present in the new task to be cleared in the whole range
@@ -341,7 +351,7 @@ out:
* but may be dropped within p[mg]d_alloc() and pte_alloc_map().
*/

-static inline void
+static inline int
copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pte_t *dst_pte, pte_t *src_pte, unsigned long vm_flags,
unsigned long addr)
@@ -349,6 +359,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
pte_t pte = *src_pte;
struct page *page;
unsigned long pfn;
+ int anon = NO_RSS;

/* pte contains position in swap or file, so copy. */
if (unlikely(!pte_present(pte))) {
@@ -361,8 +372,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
spin_unlock(&mmlist_lock);
}
}
- set_pte_at(dst_mm, addr, dst_pte, pte);
- return;
+ goto out_set_pte;
}

pfn = pte_pfn(pte);
@@ -375,10 +385,8 @@ copy_one_pte(struct mm_struct *dst_mm, s
if (pfn_valid(pfn))
page = pfn_to_page(pfn);

- if (!page || PageReserved(page)) {
- set_pte_at(dst_mm, addr, dst_pte, pte);
- return;
- }
+ if (!page || PageReserved(page))
+ goto out_set_pte;

/*
* If it's a COW mapping, write protect it both
@@ -397,12 +405,12 @@ copy_one_pte(struct mm_struct *dst_mm, s
pte = pte_mkclean(pte);
pte = pte_mkold(pte);
get_page(page);
- if (PageAnon(page))
- inc_mm_counter(dst_mm, anon_rss);
- else
- inc_mm_counter(dst_mm, file_rss);
- set_pte_at(dst_mm, addr, dst_pte, pte);
page_dup_rmap(page);
+ anon = !!PageAnon(page);
+
+out_set_pte:
+ set_pte_at(dst_mm, addr, dst_pte, pte);
+ return anon;
}

static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -412,8 +420,10 @@ static int copy_pte_range(struct mm_stru
pte_t *src_pte, *dst_pte;
unsigned long vm_flags = vma->vm_flags;
int progress = 0;
+ int rss[NO_RSS+1], anon;

again:
+ rss[1] = rss[0] = 0;
dst_pte = pte_alloc_map(dst_mm, dst_pmd, addr);
if (!dst_pte)
return -ENOMEM;
@@ -436,13 +446,16 @@ again:
progress++;
continue;
}
- copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vm_flags, addr);
+ anon = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
+ vm_flags, addr);
+ rss[anon]++;
progress += 8;
} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
spin_unlock(&src_mm->page_table_lock);

pte_unmap_nested(src_pte - 1);
pte_unmap(dst_pte - 1);
+ add_mm_rss(dst_mm, rss[0], rss[1]);
cond_resched_lock(&dst_mm->page_table_lock);
if (addr != end)
goto again;
@@ -533,6 +546,8 @@ static void zap_pte_range(struct mmu_gat
struct zap_details *details)
{
pte_t *pte;
+ int file_rss = 0;
+ int anon_rss = 0;

pte = pte_offset_map(pmd, addr);
do {
@@ -576,13 +591,13 @@ static void zap_pte_range(struct mmu_gat
set_pte_at(tlb->mm, addr, pte,
pgoff_to_pte(page->index));
if (PageAnon(page))
- dec_mm_counter(tlb->mm, anon_rss);
+ anon_rss++;
else {
if (pte_dirty(ptent))
set_page_dirty(page);
if (pte_young(ptent))
mark_page_accessed(page);
- dec_mm_counter(tlb->mm, file_rss);
+ file_rss++;
}
page_remove_rmap(page);
tlb_remove_page(tlb, page);
@@ -598,6 +613,8 @@ static void zap_pte_range(struct mmu_gat
free_swap_and_cache(pte_to_swp_entry(ptent));
pte_clear_full(tlb->mm, addr, pte, tlb->fullmm);
} while (pte++, addr += PAGE_SIZE, addr != end);
+
+ add_mm_rss(tlb->mm, -file_rss, -anon_rss);
pte_unmap(pte - 1);
}

2005-09-25 16:09:35

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 18/21] mm: dup_mmap use oldmm more

Use the parent's oldmm throughout dup_mmap, instead of perversely going
back to current->mm. (Can you hear the sigh of relief from those mpnts?
Usually I squash them, but not today.)

Signed-off-by: Hugh Dickins <[email protected]>
---

kernel/fork.c | 12 ++++++------
1 files changed, 6 insertions(+), 6 deletions(-)

--- mm17/kernel/fork.c 2005-09-24 19:29:53.000000000 +0100
+++ mm18/kernel/fork.c 2005-09-24 19:30:21.000000000 +0100
@@ -182,16 +182,16 @@ static struct task_struct *dup_task_stru
}

#ifdef CONFIG_MMU
-static inline int dup_mmap(struct mm_struct * mm, struct mm_struct * oldmm)
+static inline int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
{
- struct vm_area_struct * mpnt, *tmp, **pprev;
+ struct vm_area_struct *mpnt, *tmp, **pprev;
struct rb_node **rb_link, *rb_parent;
int retval;
unsigned long charge;
struct mempolicy *pol;

down_write(&oldmm->mmap_sem);
- flush_cache_mm(current->mm);
+ flush_cache_mm(oldmm);
mm->locked_vm = 0;
mm->mmap = NULL;
mm->mmap_cache = NULL;
@@ -204,7 +204,7 @@ static inline int dup_mmap(struct mm_str
rb_parent = NULL;
pprev = &mm->mmap;

- for (mpnt = current->mm->mmap ; mpnt ; mpnt = mpnt->vm_next) {
+ for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) {
struct file *file;

if (mpnt->vm_flags & VM_DONTCOPY) {
@@ -265,7 +265,7 @@ static inline int dup_mmap(struct mm_str
rb_parent = &tmp->vm_rb;

mm->map_count++;
- retval = copy_page_range(mm, current->mm, tmp);
+ retval = copy_page_range(mm, oldmm, tmp);
spin_unlock(&mm->page_table_lock);

if (tmp->vm_ops && tmp->vm_ops->open)
@@ -277,7 +277,7 @@ static inline int dup_mmap(struct mm_str
retval = 0;

out:
- flush_tlb_mm(current->mm);
+ flush_tlb_mm(oldmm);
up_write(&oldmm->mmap_sem);
return retval;
fail_nomem_policy:

2005-09-25 16:10:42

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 19/21] mm: dup_mmap down new mmap_sem

One anomaly remains from when Andrea rationalized the responsibilities
of mmap_sem and page_table_lock: in dup_mmap we add vmas to the child
holding its page_table_lock, but not the mmap_sem which normally guards
the vma list and rbtree. Which could be an issue for unuse_mm: though
since it just walks down the list (today with page_table_lock, tomorrow
not), it's probably okay. Will need a memory barrier? Oh, keep it
simple, Nick and I agreed, no harm in taking child's mmap_sem here.

Signed-off-by: Hugh Dickins <[email protected]>
---

kernel/fork.c | 9 ++++-----
1 files changed, 4 insertions(+), 5 deletions(-)

--- mm18/kernel/fork.c 2005-09-24 19:30:21.000000000 +0100
+++ mm19/kernel/fork.c 2005-09-24 19:30:35.000000000 +0100
@@ -192,6 +192,8 @@ static inline int dup_mmap(struct mm_str

down_write(&oldmm->mmap_sem);
flush_cache_mm(oldmm);
+ down_write(&mm->mmap_sem);
+
mm->locked_vm = 0;
mm->mmap = NULL;
mm->mmap_cache = NULL;
@@ -251,10 +253,7 @@ static inline int dup_mmap(struct mm_str
}

/*
- * Link in the new vma and copy the page table entries:
- * link in first so that swapoff can see swap entries.
- * Note that, exceptionally, here the vma is inserted
- * without holding mm->mmap_sem.
+ * Link in the new vma and copy the page table entries.
*/
spin_lock(&mm->page_table_lock);
*pprev = tmp;
@@ -275,8 +274,8 @@ static inline int dup_mmap(struct mm_str
goto out;
}
retval = 0;
-
out:
+ up_write(&mm->mmap_sem);
flush_tlb_mm(oldmm);
up_write(&oldmm->mmap_sem);
return retval;

2005-09-25 16:12:25

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 20/21] mm: sh64 hugetlbpage.c

The sh64 hugetlbpage.c seems to be erroneous, left over from a bygone
age, clashing with the common hugetlb.c. Replace it by a copy of the
sh hugetlbpage.c. Except, delete that mk_pte_huge macro neither uses.

Signed-off-by: Hugh Dickins <[email protected]>
---

arch/sh/mm/hugetlbpage.c | 2
arch/sh64/mm/hugetlbpage.c | 188 ++-------------------------------------------
2 files changed, 12 insertions(+), 178 deletions(-)

--- mm19/arch/sh/mm/hugetlbpage.c 2005-08-29 00:41:01.000000000 +0100
+++ mm20/arch/sh/mm/hugetlbpage.c 2005-09-24 19:30:49.000000000 +0100
@@ -54,8 +54,6 @@ pte_t *huge_pte_offset(struct mm_struct
return pte;
}

-#define mk_pte_huge(entry) do { pte_val(entry) |= _PAGE_SZHUGE; } while (0)
-
void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t entry)
{
--- mm19/arch/sh64/mm/hugetlbpage.c 2005-08-29 00:41:01.000000000 +0100
+++ mm20/arch/sh64/mm/hugetlbpage.c 2005-09-24 19:30:49.000000000 +0100
@@ -54,41 +54,31 @@ pte_t *huge_pte_offset(struct mm_struct
return pte;
}

-#define mk_pte_huge(entry) do { pte_val(entry) |= _PAGE_SZHUGE; } while (0)
-
-static void set_huge_pte(struct mm_struct *mm, struct vm_area_struct *vma,
- struct page *page, pte_t * page_table, int write_access)
+void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t entry)
{
- unsigned long i;
- pte_t entry;
-
- add_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE);
-
- if (write_access)
- entry = pte_mkwrite(pte_mkdirty(mk_pte(page,
- vma->vm_page_prot)));
- else
- entry = pte_wrprotect(mk_pte(page, vma->vm_page_prot));
- entry = pte_mkyoung(entry);
- mk_pte_huge(entry);
+ int i;

for (i = 0; i < (1 << HUGETLB_PAGE_ORDER); i++) {
- set_pte(page_table, entry);
- page_table++;
-
+ set_pte_at(mm, addr, ptep, entry);
+ ptep++;
+ addr += PAGE_SIZE;
pte_val(entry) += PAGE_SIZE;
}
}

-pte_t huge_ptep_get_and_clear(pte_t *ptep)
+pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep)
{
pte_t entry;
+ int i;

entry = *ptep;

for (i = 0; i < (1 << HUGETLB_PAGE_ORDER); i++) {
- pte_clear(pte);
- pte++;
+ pte_clear(mm, addr, ptep);
+ addr += PAGE_SIZE;
+ ptep++;
}

return entry;
@@ -106,79 +96,6 @@ int is_aligned_hugepage_range(unsigned l
return 0;
}

-int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
- struct vm_area_struct *vma)
-{
- pte_t *src_pte, *dst_pte, entry;
- struct page *ptepage;
- unsigned long addr = vma->vm_start;
- unsigned long end = vma->vm_end;
- int i;
-
- while (addr < end) {
- dst_pte = huge_pte_alloc(dst, addr);
- if (!dst_pte)
- goto nomem;
- src_pte = huge_pte_offset(src, addr);
- BUG_ON(!src_pte || pte_none(*src_pte));
- entry = *src_pte;
- ptepage = pte_page(entry);
- get_page(ptepage);
- for (i = 0; i < (1 << HUGETLB_PAGE_ORDER); i++) {
- set_pte(dst_pte, entry);
- pte_val(entry) += PAGE_SIZE;
- dst_pte++;
- }
- add_mm_counter(dst, rss, HPAGE_SIZE / PAGE_SIZE);
- addr += HPAGE_SIZE;
- }
- return 0;
-
-nomem:
- return -ENOMEM;
-}
-
-int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
- struct page **pages, struct vm_area_struct **vmas,
- unsigned long *position, int *length, int i)
-{
- unsigned long vaddr = *position;
- int remainder = *length;
-
- WARN_ON(!is_vm_hugetlb_page(vma));
-
- while (vaddr < vma->vm_end && remainder) {
- if (pages) {
- pte_t *pte;
- struct page *page;
-
- pte = huge_pte_offset(mm, vaddr);
-
- /* hugetlb should be locked, and hence, prefaulted */
- BUG_ON(!pte || pte_none(*pte));
-
- page = pte_page(*pte);
-
- WARN_ON(!PageCompound(page));
-
- get_page(page);
- pages[i] = page;
- }
-
- if (vmas)
- vmas[i] = vma;
-
- vaddr += PAGE_SIZE;
- --remainder;
- ++i;
- }
-
- *length = remainder;
- *position = vaddr;
-
- return i;
-}
-
struct page *follow_huge_addr(struct mm_struct *mm,
unsigned long address, int write)
{
@@ -195,84 +112,3 @@ struct page *follow_huge_pmd(struct mm_s
{
return NULL;
}
-
-void unmap_hugepage_range(struct vm_area_struct *vma,
- unsigned long start, unsigned long end)
-{
- struct mm_struct *mm = vma->vm_mm;
- unsigned long address;
- pte_t *pte;
- struct page *page;
- int i;
-
- BUG_ON(start & (HPAGE_SIZE - 1));
- BUG_ON(end & (HPAGE_SIZE - 1));
-
- for (address = start; address < end; address += HPAGE_SIZE) {
- pte = huge_pte_offset(mm, address);
- BUG_ON(!pte);
- if (pte_none(*pte))
- continue;
- page = pte_page(*pte);
- put_page(page);
- for (i = 0; i < (1 << HUGETLB_PAGE_ORDER); i++) {
- pte_clear(mm, address+(i*PAGE_SIZE), pte);
- pte++;
- }
- }
- add_mm_counter(mm, rss, -((end - start) >> PAGE_SHIFT));
- flush_tlb_range(vma, start, end);
-}
-
-int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
-{
- struct mm_struct *mm = current->mm;
- unsigned long addr;
- int ret = 0;
-
- BUG_ON(vma->vm_start & ~HPAGE_MASK);
- BUG_ON(vma->vm_end & ~HPAGE_MASK);
-
- spin_lock(&mm->page_table_lock);
- for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
- unsigned long idx;
- pte_t *pte = huge_pte_alloc(mm, addr);
- struct page *page;
-
- if (!pte) {
- ret = -ENOMEM;
- goto out;
- }
- if (!pte_none(*pte))
- continue;
-
- idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
- + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
- page = find_get_page(mapping, idx);
- if (!page) {
- /* charge the fs quota first */
- if (hugetlb_get_quota(mapping)) {
- ret = -ENOMEM;
- goto out;
- }
- page = alloc_huge_page();
- if (!page) {
- hugetlb_put_quota(mapping);
- ret = -ENOMEM;
- goto out;
- }
- ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
- if (! ret) {
- unlock_page(page);
- } else {
- hugetlb_put_quota(mapping);
- free_huge_page(page);
- goto out;
- }
- }
- set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
- }
-out:
- spin_unlock(&mm->page_table_lock);
- return ret;
-}

2005-09-25 16:15:50

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 21/21] mm: m68k kill stram swap

Please, please now delete the Atari CONFIG_STRAM_SWAP code. It may be
excellent and ingenious code, but its reference to swap_vfsmnt betrays
that it hasn't been built since 2.5.1 (four years old come December),
it's delving deep into matters which are the preserve of core mm code,
its only purpose is to give the more conscientious mm guys an anxiety
attack from time to time; yet we keep on breaking it more and more.

If you want to use RAM for swap, then if the MTD driver does not already
provide just what you need, I'm sure David could be persuaded to add the
extra. But you'd also like to be able to allocate extents of that swap
for other use: we can give you a core interface for that if you need.
But unbuilt for four years suggests to me that there's no need at all.

I cannot swear the patch below won't break your build, but believe so.

Signed-off-by: Hugh Dickins <[email protected]>
---

Documentation/kernel-parameters.txt | 2
Documentation/m68k/kernel-options.txt | 24
arch/m68k/Kconfig | 24
arch/m68k/atari/stram.c | 918 ----------------------------------
4 files changed, 17 insertions(+), 951 deletions(-)

--- mm20/Documentation/kernel-parameters.txt 2005-09-21 12:16:10.000000000 +0100
+++ mm21/Documentation/kernel-parameters.txt 2005-09-24 19:31:03.000000000 +0100
@@ -1443,8 +1443,6 @@ running once the system is up.
stifb= [HW]
Format: bpp:<bpp1>[:<bpp2>[:<bpp3>...]]

- stram_swap= [HW,M68k]
-
swiotlb= [IA-64] Number of I/O TLB slabs

switches= [HW,M68k]
--- mm20/Documentation/m68k/kernel-options.txt 2000-07-28 20:50:51.000000000 +0100
+++ mm21/Documentation/m68k/kernel-options.txt 2005-09-24 19:31:03.000000000 +0100
@@ -626,7 +626,7 @@ ignored (others aren't affected).
can be performed in optimal order. Not all SCSI devices support
tagged queuing (:-().

-4.6 switches=
+4.5 switches=
-------------

Syntax: switches=<list of switches>
@@ -661,28 +661,6 @@ correctly.
earlier initialization ("ov_"-less) takes precedence. But the
switching-off on reset still happens in this case.

-4.5) stram_swap=
-----------------
-
-Syntax: stram_swap=<do_swap>[,<max_swap>]
-
- This option is available only if the kernel has been compiled with
-CONFIG_STRAM_SWAP enabled. Normally, the kernel then determines
-dynamically whether to actually use ST-RAM as swap space. (Currently,
-the fraction of ST-RAM must be less or equal 1/3 of total memory to
-enable this swapping.) You can override the kernel's decision by
-specifying this option. 1 for <do_swap> means always enable the swap,
-even if you have less alternate RAM. 0 stands for never swap to
-ST-RAM, even if it's small enough compared to the rest of memory.
-
- If ST-RAM swapping is enabled, the kernel usually uses all free
-ST-RAM as swap "device". If the kernel resides in ST-RAM, the region
-allocated by it is obviously never used for swapping :-) You can also
-limit this amount by specifying the second parameter, <max_swap>, if
-you want to use parts of ST-RAM as normal system memory. <max_swap> is
-in kBytes and the number should be a multiple of 4 (otherwise: rounded
-down).
-
5) Options for Amiga Only:
==========================

--- mm20/arch/m68k/Kconfig 2005-09-21 12:16:14.000000000 +0100
+++ mm21/arch/m68k/Kconfig 2005-09-24 19:31:03.000000000 +0100
@@ -388,33 +388,11 @@ config AMIGA_PCMCIA
Include support in the kernel for pcmcia on Amiga 1200 and Amiga
600. If you intend to use pcmcia cards say Y; otherwise say N.

-config STRAM_SWAP
- bool "Support for ST-RAM as swap space"
- depends on ATARI && BROKEN
- ---help---
- Some Atari 68k machines (including the 520STF and 1020STE) divide
- their addressable memory into ST and TT sections. The TT section
- (up to 512MB) is the main memory; the ST section (up to 4MB) is
- accessible to the built-in graphics board, runs slower, and is
- present mainly for backward compatibility with older machines.
-
- This enables support for using (parts of) ST-RAM as swap space,
- instead of as normal system memory. This can first enhance system
- performance if you have lots of alternate RAM (compared to the size
- of ST-RAM), because executable code always will reside in faster
- memory. ST-RAM will remain as ultra-fast swap space. On the other
- hand, it allows much improved dynamic allocations of ST-RAM buffers
- for device driver modules (e.g. floppy, ACSI, SLM printer, DMA
- sound). The probability that such allocations at module load time
- fail is drastically reduced.
-
config STRAM_PROC
bool "ST-RAM statistics in /proc"
depends on ATARI
help
- Say Y here to report ST-RAM usage statistics in /proc/stram. See
- the help for CONFIG_STRAM_SWAP for discussion of ST-RAM and its
- uses.
+ Say Y here to report ST-RAM usage statistics in /proc/stram.

config HEARTBEAT
bool "Use power LED as a heartbeat" if AMIGA || APOLLO || ATARI || MAC ||Q40
--- mm20/arch/m68k/atari/stram.c 2005-06-17 20:48:29.000000000 +0100
+++ mm21/arch/m68k/atari/stram.c 2005-09-24 19:31:03.000000000 +0100
@@ -15,11 +15,9 @@
#include <linux/kdev_t.h>
#include <linux/major.h>
#include <linux/init.h>
-#include <linux/swap.h>
#include <linux/slab.h>
#include <linux/vmalloc.h>
#include <linux/pagemap.h>
-#include <linux/shm.h>
#include <linux/bootmem.h>
#include <linux/mount.h>
#include <linux/blkdev.h>
@@ -33,8 +31,6 @@
#include <asm/io.h>
#include <asm/semaphore.h>

-#include <linux/swapops.h>
-
#undef DEBUG

#ifdef DEBUG
@@ -49,8 +45,7 @@
#include <linux/proc_fs.h>
#endif

-/* Pre-swapping comments:
- *
+/*
* ++roman:
*
* New version of ST-Ram buffer allocation. Instead of using the
@@ -75,76 +70,6 @@
*
*/

-/*
- * New Nov 1997: Use ST-RAM as swap space!
- *
- * In the past, there were often problems with modules that require ST-RAM
- * buffers. Such drivers have to use __get_dma_pages(), which unfortunately
- * often isn't very successful in allocating more than 1 page :-( [1] The net
- * result was that most of the time you couldn't insmod such modules (ataflop,
- * ACSI, SCSI on Falcon, Atari internal framebuffer, not to speak of acsi_slm,
- * which needs a 1 MB buffer... :-).
- *
- * To overcome this limitation, ST-RAM can now be turned into a very
- * high-speed swap space. If a request for an ST-RAM buffer comes, the kernel
- * now tries to unswap some pages on that swap device to make some free (and
- * contiguous) space. This works much better in comparison to
- * __get_dma_pages(), since used swap pages can be selectively freed by either
- * moving them to somewhere else in swap space, or by reading them back into
- * system memory. Ok, there operation of unswapping isn't really cheap (for
- * each page, one has to go through the page tables of all processes), but it
- * doesn't happen that often (only when allocation ST-RAM, i.e. when loading a
- * module that needs ST-RAM). But it at least makes it possible to load such
- * modules!
- *
- * It could also be that overall system performance increases a bit due to
- * ST-RAM swapping, since slow ST-RAM isn't used anymore for holding data or
- * executing code in. It's then just a (very fast, compared to disk) back
- * storage for not-so-often needed data. (But this effect must be compared
- * with the loss of total memory...) Don't know if the effect is already
- * visible on a TT, where the speed difference between ST- and TT-RAM isn't
- * that dramatic, but it should on machines where TT-RAM is really much faster
- * (e.g. Afterburner).
- *
- * [1]: __get_free_pages() does a fine job if you only want one page, but if
- * you want more (contiguous) pages, it can give you such a block only if
- * there's already a free one. The algorithm can't try to free buffers or swap
- * out something in order to make more free space, since all that page-freeing
- * mechanisms work "target-less", i.e. they just free something, but not in a
- * specific place. I.e., __get_free_pages() can't do anything to free
- * *adjacent* pages :-( This situation becomes even worse for DMA memory,
- * since the freeing algorithms are also blind to DMA capability of pages.
- */
-
-/* 1998-10-20: ++andreas
- unswap_by_move disabled because it does not handle swapped shm pages.
-*/
-
-/* 2000-05-01: ++andreas
- Integrated with bootmem. Remove all traces of unswap_by_move.
-*/
-
-#ifdef CONFIG_STRAM_SWAP
-#define ALIGN_IF_SWAP(x) PAGE_ALIGN(x)
-#else
-#define ALIGN_IF_SWAP(x) (x)
-#endif
-
-/* get index of swap page at address 'addr' */
-#define SWAP_NR(addr) (((addr) - swap_start) >> PAGE_SHIFT)
-
-/* get address of swap page #'nr' */
-#define SWAP_ADDR(nr) (swap_start + ((nr) << PAGE_SHIFT))
-
-/* get number of pages for 'n' bytes (already page-aligned) */
-#define N_PAGES(n) ((n) >> PAGE_SHIFT)
-
-/* The following two numbers define the maximum fraction of ST-RAM in total
- * memory, below that the kernel would automatically use ST-RAM as swap
- * space. This decision can be overridden with stram_swap= */
-#define MAX_STRAM_FRACTION_NOM 1
-#define MAX_STRAM_FRACTION_DENOM 3
-
/* Start and end (virtual) of ST-RAM */
static void *stram_start, *stram_end;

@@ -164,10 +89,9 @@ typedef struct stram_block {
} BLOCK;

/* values for flags field */
-#define BLOCK_FREE 0x01 /* free structure in the BLOCKs pool */
+#define BLOCK_FREE 0x01 /* free structure in the BLOCKs pool */
#define BLOCK_KMALLOCED 0x02 /* structure allocated by kmalloc() */
-#define BLOCK_GFP 0x08 /* block allocated with __get_dma_pages() */
-#define BLOCK_INSWAP 0x10 /* block allocated in swap space */
+#define BLOCK_GFP 0x08 /* block allocated with __get_dma_pages() */

/* list of allocated blocks */
static BLOCK *alloc_list;
@@ -179,60 +103,8 @@ static BLOCK *alloc_list;
#define N_STATIC_BLOCKS 20
static BLOCK static_blocks[N_STATIC_BLOCKS];

-#ifdef CONFIG_STRAM_SWAP
-/* max. number of bytes to use for swapping
- * 0 = no ST-RAM swapping
- * -1 = do swapping (to whole ST-RAM) if it's less than MAX_STRAM_FRACTION of
- * total memory
- */
-static int max_swap_size = -1;
-
-/* start and end of swapping area */
-static void *swap_start, *swap_end;
-
-/* The ST-RAM's swap info structure */
-static struct swap_info_struct *stram_swap_info;
-
-/* The ST-RAM's swap type */
-static int stram_swap_type;
-
-/* Semaphore for get_stram_region. */
-static DECLARE_MUTEX(stram_swap_sem);
-
-/* major and minor device number of the ST-RAM device; for the major, we use
- * the same as Amiga z2ram, which is really similar and impossible on Atari,
- * and for the minor a relatively odd number to avoid the user creating and
- * using that device. */
-#define STRAM_MAJOR Z2RAM_MAJOR
-#define STRAM_MINOR 13
-
-/* Some impossible pointer value */
-#define MAGIC_FILE_P (struct file *)0xffffdead
-
-#ifdef DO_PROC
-static unsigned stat_swap_read;
-static unsigned stat_swap_write;
-static unsigned stat_swap_force;
-#endif /* DO_PROC */
-
-#endif /* CONFIG_STRAM_SWAP */
-
/***************************** Prototypes *****************************/

-#ifdef CONFIG_STRAM_SWAP
-static int swap_init(void *start_mem, void *swap_data);
-static void *get_stram_region( unsigned long n_pages );
-static void free_stram_region( unsigned long offset, unsigned long n_pages
- );
-static int in_some_region(void *addr);
-static unsigned long find_free_region( unsigned long n_pages, unsigned long
- *total_free, unsigned long
- *region_free );
-static void do_stram_request(request_queue_t *);
-static int stram_open( struct inode *inode, struct file *filp );
-static int stram_release( struct inode *inode, struct file *filp );
-static void reserve_region(void *start, void *end);
-#endif
static BLOCK *add_region( void *addr, unsigned long size );
static BLOCK *find_region( void *addr );
static int remove_region( BLOCK *block );
@@ -279,84 +151,11 @@ void __init atari_stram_init(void)
*/
void __init atari_stram_reserve_pages(void *start_mem)
{
-#ifdef CONFIG_STRAM_SWAP
- /* if max_swap_size is negative (i.e. no stram_swap= option given),
- * determine at run time whether to use ST-RAM swapping */
- if (max_swap_size < 0)
- /* Use swapping if ST-RAM doesn't make up more than MAX_STRAM_FRACTION
- * of total memory. In that case, the max. size is set to 16 MB,
- * because ST-RAM can never be bigger than that.
- * Also, never use swapping on a Hades, there's no separate ST-RAM in
- * that machine. */
- max_swap_size =
- (!MACH_IS_HADES &&
- (N_PAGES(stram_end-stram_start)*MAX_STRAM_FRACTION_DENOM <=
- ((unsigned long)high_memory>>PAGE_SHIFT)*MAX_STRAM_FRACTION_NOM)) ? 16*1024*1024 : 0;
- DPRINTK( "atari_stram_reserve_pages: max_swap_size = %d\n", max_swap_size );
-#endif
-
/* always reserve first page of ST-RAM, the first 2 kB are
* supervisor-only! */
if (!kernel_in_stram)
reserve_bootmem (0, PAGE_SIZE);

-#ifdef CONFIG_STRAM_SWAP
- {
- void *swap_data;
-
- start_mem = (void *) PAGE_ALIGN ((unsigned long) start_mem);
- /* determine first page to use as swap: if the kernel is
- in TT-RAM, this is the first page of (usable) ST-RAM;
- otherwise just use the end of kernel data (= start_mem) */
- swap_start = !kernel_in_stram ? stram_start + PAGE_SIZE : start_mem;
- /* decrement by one page, rest of kernel assumes that first swap page
- * is always reserved and maybe doesn't handle swp_entry == 0
- * correctly */
- swap_start -= PAGE_SIZE;
- swap_end = stram_end;
- if (swap_end-swap_start > max_swap_size)
- swap_end = swap_start + max_swap_size;
- DPRINTK( "atari_stram_reserve_pages: swapping enabled; "
- "swap=%p-%p\n", swap_start, swap_end);
-
- /* reserve some amount of memory for maintainance of
- * swapping itself: one page for each 2048 (PAGE_SIZE/2)
- * swap pages. (2 bytes for each page) */
- swap_data = start_mem;
- start_mem += ((SWAP_NR(swap_end) + PAGE_SIZE/2 - 1)
- >> (PAGE_SHIFT-1)) << PAGE_SHIFT;
- /* correct swap_start if necessary */
- if (swap_start + PAGE_SIZE == swap_data)
- swap_start = start_mem - PAGE_SIZE;
-
- if (!swap_init( start_mem, swap_data )) {
- printk( KERN_ERR "ST-RAM swap space initialization failed\n" );
- max_swap_size = 0;
- return;
- }
- /* reserve region for swapping meta-data */
- reserve_region(swap_data, start_mem);
- /* reserve swapping area itself */
- reserve_region(swap_start + PAGE_SIZE, swap_end);
-
- /*
- * If the whole ST-RAM is used for swapping, there are no allocatable
- * dma pages left. But unfortunately, some shared parts of the kernel
- * (particularly the SCSI mid-level) call __get_dma_pages()
- * unconditionally :-( These calls then fail, and scsi.c even doesn't
- * check for NULL return values and just crashes. The quick fix for
- * this (instead of doing much clean up work in the SCSI code) is to
- * pretend all pages are DMA-able by setting mach_max_dma_address to
- * ULONG_MAX. This doesn't change any functionality so far, since
- * get_dma_pages() shouldn't be used on Atari anyway anymore (better
- * use atari_stram_alloc()), and the Atari SCSI drivers don't need DMA
- * memory. But unfortunately there's now no kind of warning (even not
- * a NULL return value) if you use get_dma_pages() nevertheless :-(
- * You just will get non-DMA-able memory...
- */
- mach_max_dma_address = 0xffffffff;
- }
-#endif
}

void atari_stram_mem_init_hook (void)
@@ -367,7 +166,6 @@ void atari_stram_mem_init_hook (void)

/*
* This is main public interface: somehow allocate a ST-RAM block
- * There are three strategies:
*
* - If we're before mem_init(), we have to make a static allocation. The
* region is taken in the kernel data area (if the kernel is in ST-RAM) or
@@ -375,14 +173,9 @@ void atari_stram_mem_init_hook (void)
* rsvd_stram_* region. The ST-RAM is somewhere in the middle of kernel
* address space in the latter case.
*
- * - If mem_init() already has been called and ST-RAM swapping is enabled,
- * try to get the memory from the (pseudo) swap-space, either free already
- * or by moving some other pages out of the swap.
- *
- * - If mem_init() already has been called, and ST-RAM swapping is not
- * enabled, the only possibility is to try with __get_dma_pages(). This has
- * the disadvantage that it's very hard to get more than 1 page, and it is
- * likely to fail :-(
+ * - If mem_init() already has been called, try with __get_dma_pages().
+ * This has the disadvantage that it's very hard to get more than 1 page,
+ * and it is likely to fail :-(
*
*/
void *atari_stram_alloc(long size, const char *owner)
@@ -393,27 +186,13 @@ void *atari_stram_alloc(long size, const

DPRINTK("atari_stram_alloc(size=%08lx,owner=%s)\n", size, owner);

- size = ALIGN_IF_SWAP(size);
- DPRINTK( "atari_stram_alloc: rounded size = %08lx\n", size );
-#ifdef CONFIG_STRAM_SWAP
- if (max_swap_size) {
- /* If swapping is active: make some free space in the swap
- "device". */
- DPRINTK( "atari_stram_alloc: after mem_init, swapping ok, "
- "calling get_region\n" );
- addr = get_stram_region( N_PAGES(size) );
- flags = BLOCK_INSWAP;
- }
- else
-#endif
if (!mem_init_done)
return alloc_bootmem_low(size);
else {
- /* After mem_init() and no swapping: can only resort to
- * __get_dma_pages() */
+ /* After mem_init(): can only resort to __get_dma_pages() */
addr = (void *)__get_dma_pages(GFP_KERNEL, get_order(size));
flags = BLOCK_GFP;
- DPRINTK( "atari_stram_alloc: after mem_init, swapping off, "
+ DPRINTK( "atari_stram_alloc: after mem_init, "
"get_pages=%p\n", addr );
}

@@ -422,12 +201,7 @@ void *atari_stram_alloc(long size, const
/* out of memory for BLOCK structure :-( */
DPRINTK( "atari_stram_alloc: out of mem for BLOCK -- "
"freeing again\n" );
-#ifdef CONFIG_STRAM_SWAP
- if (flags == BLOCK_INSWAP)
- free_stram_region( SWAP_NR(addr), N_PAGES(size) );
- else
-#endif
- free_pages((unsigned long)addr, get_order(size));
+ free_pages((unsigned long)addr, get_order(size));
return( NULL );
}
block->owner = owner;
@@ -451,25 +225,12 @@ void atari_stram_free( void *addr )
DPRINTK( "atari_stram_free: found block (%p): size=%08lx, owner=%s, "
"flags=%02x\n", block, block->size, block->owner, block->flags );

-#ifdef CONFIG_STRAM_SWAP
- if (!max_swap_size) {
-#endif
- if (block->flags & BLOCK_GFP) {
- DPRINTK("atari_stram_free: is kmalloced, order_size=%d\n",
- get_order(block->size));
- free_pages((unsigned long)addr, get_order(block->size));
- }
- else
- goto fail;
-#ifdef CONFIG_STRAM_SWAP
- }
- else if (block->flags & BLOCK_INSWAP) {
- DPRINTK( "atari_stram_free: is swap-alloced\n" );
- free_stram_region( SWAP_NR(block->start), N_PAGES(block->size) );
- }
- else
+ if (!(block->flags & BLOCK_GFP))
goto fail;
-#endif
+
+ DPRINTK("atari_stram_free: is kmalloced, order_size=%d\n",
+ get_order(block->size));
+ free_pages((unsigned long)addr, get_order(block->size));
remove_region( block );
return;

@@ -478,612 +239,6 @@ void atari_stram_free( void *addr )
"(called from %p)\n", addr, __builtin_return_address(0) );
}

-
-#ifdef CONFIG_STRAM_SWAP
-
-
-/* ------------------------------------------------------------------------ */
-/* Main Swapping Functions */
-/* ------------------------------------------------------------------------ */
-
-
-/*
- * Initialize ST-RAM swap device
- * (lots copied and modified from sys_swapon() in mm/swapfile.c)
- */
-static int __init swap_init(void *start_mem, void *swap_data)
-{
- static struct dentry fake_dentry;
- static struct vfsmount fake_vfsmnt;
- struct swap_info_struct *p;
- struct inode swap_inode;
- unsigned int type;
- void *addr;
- int i, j, k, prev;
-
- DPRINTK("swap_init(start_mem=%p, swap_data=%p)\n",
- start_mem, swap_data);
-
- /* need at least one page for swapping to (and this also isn't very
- * much... :-) */
- if (swap_end - swap_start < 2*PAGE_SIZE) {
- printk( KERN_WARNING "stram_swap_init: swap space too small\n" );
- return( 0 );
- }
-
- /* find free slot in swap_info */
- for( p = swap_info, type = 0; type < nr_swapfiles; type++, p++ )
- if (!(p->flags & SWP_USED))
- break;
- if (type >= MAX_SWAPFILES) {
- printk( KERN_WARNING "stram_swap_init: max. number of "
- "swap devices exhausted\n" );
- return( 0 );
- }
- if (type >= nr_swapfiles)
- nr_swapfiles = type+1;
-
- stram_swap_info = p;
- stram_swap_type = type;
-
- /* fake some dir cache entries to give us some name in /dev/swaps */
- fake_dentry.d_parent = &fake_dentry;
- fake_dentry.d_name.name = "stram (internal)";
- fake_dentry.d_name.len = 16;
- fake_vfsmnt.mnt_parent = &fake_vfsmnt;
-
- p->flags = SWP_USED;
- p->swap_file = &fake_dentry;
- p->swap_vfsmnt = &fake_vfsmnt;
- p->swap_map = swap_data;
- p->cluster_nr = 0;
- p->next = -1;
- p->prio = 0x7ff0; /* a rather high priority, but not the higest
- * to give the user a chance to override */
-
- /* call stram_open() directly, avoids at least the overhead in
- * constructing a dummy file structure... */
- swap_inode.i_rdev = MKDEV( STRAM_MAJOR, STRAM_MINOR );
- stram_open( &swap_inode, MAGIC_FILE_P );
- p->max = SWAP_NR(swap_end);
-
- /* initialize swap_map: set regions that are already allocated or belong
- * to kernel data space to SWAP_MAP_BAD, otherwise to free */
- j = 0; /* # of free pages */
- k = 0; /* # of already allocated pages (from pre-mem_init stram_alloc()) */
- p->lowest_bit = 0;
- p->highest_bit = 0;
- for( i = 1, addr = SWAP_ADDR(1); i < p->max;
- i++, addr += PAGE_SIZE ) {
- if (in_some_region( addr )) {
- p->swap_map[i] = SWAP_MAP_BAD;
- ++k;
- }
- else if (kernel_in_stram && addr < start_mem ) {
- p->swap_map[i] = SWAP_MAP_BAD;
- }
- else {
- p->swap_map[i] = 0;
- ++j;
- if (!p->lowest_bit) p->lowest_bit = i;
- p->highest_bit = i;
- }
- }
- /* first page always reserved (and doesn't really belong to swap space) */
- p->swap_map[0] = SWAP_MAP_BAD;
-
- /* now swapping to this device ok */
- p->pages = j + k;
- swap_list_lock();
- nr_swap_pages += j;
- p->flags = SWP_WRITEOK;
-
- /* insert swap space into swap_list */
- prev = -1;
- for (i = swap_list.head; i >= 0; i = swap_info[i].next) {
- if (p->prio >= swap_info[i].prio) {
- break;
- }
- prev = i;
- }
- p->next = i;
- if (prev < 0) {
- swap_list.head = swap_list.next = p - swap_info;
- } else {
- swap_info[prev].next = p - swap_info;
- }
- swap_list_unlock();
-
- printk( KERN_INFO "Using %dk (%d pages) of ST-RAM as swap space.\n",
- p->pages << 2, p->pages );
- return( 1 );
-}
-
-
-/*
- * The swap entry has been read in advance, and we return 1 to indicate
- * that the page has been used or is no longer needed.
- *
- * Always set the resulting pte to be nowrite (the same as COW pages
- * after one process has exited). We don't know just how many PTEs will
- * share this swap entry, so be cautious and let do_wp_page work out
- * what to do if a write is requested later.
- */
-static inline void unswap_pte(struct vm_area_struct * vma, unsigned long
- address, pte_t *dir, swp_entry_t entry,
- struct page *page)
-{
- pte_t pte = *dir;
-
- if (pte_none(pte))
- return;
- if (pte_present(pte)) {
- /* If this entry is swap-cached, then page must already
- hold the right address for any copies in physical
- memory */
- if (pte_page(pte) != page)
- return;
- /* We will be removing the swap cache in a moment, so... */
- set_pte(dir, pte_mkdirty(pte));
- return;
- }
- if (pte_val(pte) != entry.val)
- return;
-
- DPRINTK("unswap_pte: replacing entry %08lx by new page %p",
- entry.val, page);
- set_pte(dir, pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
- swap_free(entry);
- get_page(page);
- inc_mm_counter(vma->vm_mm, rss);
-}
-
-static inline void unswap_pmd(struct vm_area_struct * vma, pmd_t *dir,
- unsigned long address, unsigned long size,
- unsigned long offset, swp_entry_t entry,
- struct page *page)
-{
- pte_t * pte;
- unsigned long end;
-
- if (pmd_none(*dir))
- return;
- if (pmd_bad(*dir)) {
- pmd_ERROR(*dir);
- pmd_clear(dir);
- return;
- }
- pte = pte_offset_kernel(dir, address);
- offset += address & PMD_MASK;
- address &= ~PMD_MASK;
- end = address + size;
- if (end > PMD_SIZE)
- end = PMD_SIZE;
- do {
- unswap_pte(vma, offset+address-vma->vm_start, pte, entry, page);
- address += PAGE_SIZE;
- pte++;
- } while (address < end);
-}
-
-static inline void unswap_pgd(struct vm_area_struct * vma, pgd_t *dir,
- unsigned long address, unsigned long size,
- swp_entry_t entry, struct page *page)
-{
- pmd_t * pmd;
- unsigned long offset, end;
-
- if (pgd_none(*dir))
- return;
- if (pgd_bad(*dir)) {
- pgd_ERROR(*dir);
- pgd_clear(dir);
- return;
- }
- pmd = pmd_offset(dir, address);
- offset = address & PGDIR_MASK;
- address &= ~PGDIR_MASK;
- end = address + size;
- if (end > PGDIR_SIZE)
- end = PGDIR_SIZE;
- do {
- unswap_pmd(vma, pmd, address, end - address, offset, entry,
- page);
- address = (address + PMD_SIZE) & PMD_MASK;
- pmd++;
- } while (address < end);
-}
-
-static void unswap_vma(struct vm_area_struct * vma, pgd_t *pgdir,
- swp_entry_t entry, struct page *page)
-{
- unsigned long start = vma->vm_start, end = vma->vm_end;
-
- do {
- unswap_pgd(vma, pgdir, start, end - start, entry, page);
- start = (start + PGDIR_SIZE) & PGDIR_MASK;
- pgdir++;
- } while (start < end);
-}
-
-static void unswap_process(struct mm_struct * mm, swp_entry_t entry,
- struct page *page)
-{
- struct vm_area_struct* vma;
-
- /*
- * Go through process' page directory.
- */
- if (!mm)
- return;
- for (vma = mm->mmap; vma; vma = vma->vm_next) {
- pgd_t * pgd = pgd_offset(mm, vma->vm_start);
- unswap_vma(vma, pgd, entry, page);
- }
-}
-
-
-static int unswap_by_read(unsigned short *map, unsigned long max,
- unsigned long start, unsigned long n_pages)
-{
- struct task_struct *p;
- struct page *page;
- swp_entry_t entry;
- unsigned long i;
-
- DPRINTK( "unswapping %lu..%lu by reading in\n",
- start, start+n_pages-1 );
-
- for( i = start; i < start+n_pages; ++i ) {
- if (map[i] == SWAP_MAP_BAD) {
- printk( KERN_ERR "get_stram_region: page %lu already "
- "reserved??\n", i );
- continue;
- }
-
- if (map[i]) {
- entry = swp_entry(stram_swap_type, i);
- DPRINTK("unswap: map[i=%lu]=%u nr_swap=%ld\n",
- i, map[i], nr_swap_pages);
-
- swap_device_lock(stram_swap_info);
- map[i]++;
- swap_device_unlock(stram_swap_info);
- /* Get a page for the entry, using the existing
- swap cache page if there is one. Otherwise,
- get a clean page and read the swap into it. */
- page = read_swap_cache_async(entry, NULL, 0);
- if (!page) {
- swap_free(entry);
- return -ENOMEM;
- }
- read_lock(&tasklist_lock);
- for_each_process(p)
- unswap_process(p->mm, entry, page);
- read_unlock(&tasklist_lock);
- shmem_unuse(entry, page);
- /* Now get rid of the extra reference to the
- temporary page we've been using. */
- if (PageSwapCache(page))
- delete_from_swap_cache(page);
- __free_page(page);
- #ifdef DO_PROC
- stat_swap_force++;
- #endif
- }
-
- DPRINTK( "unswap: map[i=%lu]=%u nr_swap=%ld\n",
- i, map[i], nr_swap_pages );
- swap_list_lock();
- swap_device_lock(stram_swap_info);
- map[i] = SWAP_MAP_BAD;
- if (stram_swap_info->lowest_bit == i)
- stram_swap_info->lowest_bit++;
- if (stram_swap_info->highest_bit == i)
- stram_swap_info->highest_bit--;
- --nr_swap_pages;
- swap_device_unlock(stram_swap_info);
- swap_list_unlock();
- }
-
- return 0;
-}
-
-/*
- * reserve a region in ST-RAM swap space for an allocation
- */
-static void *get_stram_region( unsigned long n_pages )
-{
- unsigned short *map = stram_swap_info->swap_map;
- unsigned long max = stram_swap_info->max;
- unsigned long start, total_free, region_free;
- int err;
- void *ret = NULL;
-
- DPRINTK( "get_stram_region(n_pages=%lu)\n", n_pages );
-
- down(&stram_swap_sem);
-
- /* disallow writing to the swap device now */
- stram_swap_info->flags = SWP_USED;
-
- /* find a region of n_pages pages in the swap space including as much free
- * pages as possible (and excluding any already-reserved pages). */
- if (!(start = find_free_region( n_pages, &total_free, &region_free )))
- goto end;
- DPRINTK( "get_stram_region: region starts at %lu, has %lu free pages\n",
- start, region_free );
-
- err = unswap_by_read(map, max, start, n_pages);
- if (err)
- goto end;
-
- ret = SWAP_ADDR(start);
- end:
- /* allow using swap device again */
- stram_swap_info->flags = SWP_WRITEOK;
- up(&stram_swap_sem);
- DPRINTK( "get_stram_region: returning %p\n", ret );
- return( ret );
-}
-
-
-/*
- * free a reserved region in ST-RAM swap space
- */
-static void free_stram_region( unsigned long offset, unsigned long n_pages )
-{
- unsigned short *map = stram_swap_info->swap_map;
-
- DPRINTK( "free_stram_region(offset=%lu,n_pages=%lu)\n", offset, n_pages );
-
- if (offset < 1 || offset + n_pages > stram_swap_info->max) {
- printk( KERN_ERR "free_stram_region: Trying to free non-ST-RAM\n" );
- return;
- }
-
- swap_list_lock();
- swap_device_lock(stram_swap_info);
- /* un-reserve the freed pages */
- for( ; n_pages > 0; ++offset, --n_pages ) {
- if (map[offset] != SWAP_MAP_BAD)
- printk( KERN_ERR "free_stram_region: Swap page %lu was not "
- "reserved\n", offset );
- map[offset] = 0;
- }
-
- /* update swapping meta-data */
- if (offset < stram_swap_info->lowest_bit)
- stram_swap_info->lowest_bit = offset;
- if (offset+n_pages-1 > stram_swap_info->highest_bit)
- stram_swap_info->highest_bit = offset+n_pages-1;
- if (stram_swap_info->prio > swap_info[swap_list.next].prio)
- swap_list.next = swap_list.head;
- nr_swap_pages += n_pages;
- swap_device_unlock(stram_swap_info);
- swap_list_unlock();
-}
-
-
-/* ------------------------------------------------------------------------ */
-/* Utility Functions for Swapping */
-/* ------------------------------------------------------------------------ */
-
-
-/* is addr in some of the allocated regions? */
-static int in_some_region(void *addr)
-{
- BLOCK *p;
-
- for( p = alloc_list; p; p = p->next ) {
- if (p->start <= addr && addr < p->start + p->size)
- return( 1 );
- }
- return( 0 );
-}
-
-
-static unsigned long find_free_region(unsigned long n_pages,
- unsigned long *total_free,
- unsigned long *region_free)
-{
- unsigned short *map = stram_swap_info->swap_map;
- unsigned long max = stram_swap_info->max;
- unsigned long head, tail, max_start;
- long nfree, max_free;
-
- /* first scan the swap space for a suitable place for the allocation */
- head = 1;
- max_start = 0;
- max_free = -1;
- *total_free = 0;
-
- start_over:
- /* increment tail until final window size reached, and count free pages */
- nfree = 0;
- for( tail = head; tail-head < n_pages && tail < max; ++tail ) {
- if (map[tail] == SWAP_MAP_BAD) {
- head = tail+1;
- goto start_over;
- }
- if (!map[tail]) {
- ++nfree;
- ++*total_free;
- }
- }
- if (tail-head < n_pages)
- goto out;
- if (nfree > max_free) {
- max_start = head;
- max_free = nfree;
- if (max_free >= n_pages)
- /* don't need more free pages... :-) */
- goto out;
- }
-
- /* now shift the window and look for the area where as much pages as
- * possible are free */
- while( tail < max ) {
- nfree -= (map[head++] == 0);
- if (map[tail] == SWAP_MAP_BAD) {
- head = tail+1;
- goto start_over;
- }
- if (!map[tail]) {
- ++nfree;
- ++*total_free;
- }
- ++tail;
- if (nfree > max_free) {
- max_start = head;
- max_free = nfree;
- if (max_free >= n_pages)
- /* don't need more free pages... :-) */
- goto out;
- }
- }
-
- out:
- if (max_free < 0) {
- printk( KERN_NOTICE "get_stram_region: ST-RAM too full or fragmented "
- "-- can't allocate %lu pages\n", n_pages );
- return( 0 );
- }
-
- *region_free = max_free;
- return( max_start );
-}
-
-
-/* setup parameters from command line */
-void __init stram_swap_setup(char *str, int *ints)
-{
- if (ints[0] >= 1)
- max_swap_size = ((ints[1] < 0 ? 0 : ints[1]) * 1024) & PAGE_MASK;
-}
-
-
-/* ------------------------------------------------------------------------ */
-/* ST-RAM device */
-/* ------------------------------------------------------------------------ */
-
-static int refcnt;
-
-static void do_stram_request(request_queue_t *q)
-{
- struct request *req;
-
- while ((req = elv_next_request(q)) != NULL) {
- void *start = swap_start + (req->sector << 9);
- unsigned long len = req->current_nr_sectors << 9;
- if ((start + len) > swap_end) {
- printk( KERN_ERR "stram: bad access beyond end of device: "
- "block=%ld, count=%d\n",
- req->sector,
- req->current_nr_sectors );
- end_request(req, 0);
- continue;
- }
-
- if (req->cmd == READ) {
- memcpy(req->buffer, start, len);
-#ifdef DO_PROC
- stat_swap_read += N_PAGES(len);
-#endif
- }
- else {
- memcpy(start, req->buffer, len);
-#ifdef DO_PROC
- stat_swap_write += N_PAGES(len);
-#endif
- }
- end_request(req, 1);
- }
-}
-
-
-static int stram_open( struct inode *inode, struct file *filp )
-{
- if (filp != MAGIC_FILE_P) {
- printk( KERN_NOTICE "Only kernel can open ST-RAM device\n" );
- return( -EPERM );
- }
- if (refcnt)
- return( -EBUSY );
- ++refcnt;
- return( 0 );
-}
-
-static int stram_release( struct inode *inode, struct file *filp )
-{
- if (filp != MAGIC_FILE_P) {
- printk( KERN_NOTICE "Only kernel can close ST-RAM device\n" );
- return( -EPERM );
- }
- if (refcnt > 0)
- --refcnt;
- return( 0 );
-}
-
-
-static struct block_device_operations stram_fops = {
- .open = stram_open,
- .release = stram_release,
-};
-
-static struct gendisk *stram_disk;
-static struct request_queue *stram_queue;
-static DEFINE_SPINLOCK(stram_lock);
-
-int __init stram_device_init(void)
-{
- if (!MACH_IS_ATARI)
- /* no point in initializing this, I hope */
- return -ENXIO;
-
- if (!max_swap_size)
- /* swapping not enabled */
- return -ENXIO;
- stram_disk = alloc_disk(1);
- if (!stram_disk)
- return -ENOMEM;
-
- if (register_blkdev(STRAM_MAJOR, "stram")) {
- put_disk(stram_disk);
- return -ENXIO;
- }
-
- stram_queue = blk_init_queue(do_stram_request, &stram_lock);
- if (!stram_queue) {
- unregister_blkdev(STRAM_MAJOR, "stram");
- put_disk(stram_disk);
- return -ENOMEM;
- }
-
- stram_disk->major = STRAM_MAJOR;
- stram_disk->first_minor = STRAM_MINOR;
- stram_disk->fops = &stram_fops;
- stram_disk->queue = stram_queue;
- sprintf(stram_disk->disk_name, "stram");
- set_capacity(stram_disk, (swap_end - swap_start)/512);
- add_disk(stram_disk);
- return 0;
-}
-
-
-
-/* ------------------------------------------------------------------------ */
-/* Misc Utility Functions */
-/* ------------------------------------------------------------------------ */
-
-/* reserve a range of pages */
-static void reserve_region(void *start, void *end)
-{
- reserve_bootmem (virt_to_phys(start), end - start);
-}
-
-#endif /* CONFIG_STRAM_SWAP */
-

/* ------------------------------------------------------------------------ */
/* Region Management */
@@ -1173,50 +328,9 @@ int get_stram_list( char *buf )
{
int len = 0;
BLOCK *p;
-#ifdef CONFIG_STRAM_SWAP
- int i;
- unsigned short *map = stram_swap_info->swap_map;
- unsigned long max = stram_swap_info->max;
- unsigned free = 0, used = 0, rsvd = 0;
-#endif

-#ifdef CONFIG_STRAM_SWAP
- if (max_swap_size) {
- for( i = 1; i < max; ++i ) {
- if (!map[i])
- ++free;
- else if (map[i] == SWAP_MAP_BAD)
- ++rsvd;
- else
- ++used;
- }
- PRINT_PROC(
- "Total ST-RAM: %8u kB\n"
- "Total ST-RAM swap: %8lu kB\n"
- "Free swap: %8u kB\n"
- "Used swap: %8u kB\n"
- "Allocated swap: %8u kB\n"
- "Swap Reads: %8u\n"
- "Swap Writes: %8u\n"
- "Swap Forced Reads: %8u\n",
- (stram_end - stram_start) >> 10,
- (max-1) << (PAGE_SHIFT-10),
- free << (PAGE_SHIFT-10),
- used << (PAGE_SHIFT-10),
- rsvd << (PAGE_SHIFT-10),
- stat_swap_read,
- stat_swap_write,
- stat_swap_force );
- }
- else {
-#endif
- PRINT_PROC( "ST-RAM swapping disabled\n" );
- PRINT_PROC("Total ST-RAM: %8u kB\n",
+ PRINT_PROC("Total ST-RAM: %8u kB\n",
(stram_end - stram_start) >> 10);
-#ifdef CONFIG_STRAM_SWAP
- }
-#endif
-
PRINT_PROC( "Allocated regions:\n" );
for( p = alloc_list; p; p = p->next ) {
if (len + 50 >= PAGE_SIZE)
@@ -1227,8 +341,6 @@ int get_stram_list( char *buf )
p->owner);
if (p->flags & BLOCK_GFP)
PRINT_PROC( "page-alloced)\n" );
- else if (p->flags & BLOCK_INSWAP)
- PRINT_PROC( "in swap)\n" );
else
PRINT_PROC( "??)\n" );
}

2005-09-25 22:27:15

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 04/21] mm: zap_pte_range dont dirty anon

Hugh Dickins <[email protected]> wrote:
>
> zap_pte_range already avoids wasting time to mark_page_accessed on anon
> pages: it can also skip anon set_page_dirty - the page only needs to be
> marked dirty if shared with another mm, but that will say pte_dirty too.
>

Are you sure about this?

> --- mm03/mm/memory.c 2005-09-24 19:26:38.000000000 +0100
> +++ mm04/mm/memory.c 2005-09-24 19:27:05.000000000 +0100
> @@ -574,12 +574,14 @@ static void zap_pte_range(struct mmu_gat
> addr) != page->index)
> set_pte_at(tlb->mm, addr, pte,
> pgoff_to_pte(page->index));
> - if (pte_dirty(ptent))
> - set_page_dirty(page);
> if (PageAnon(page))
> dec_mm_counter(tlb->mm, anon_rss);
> - else if (pte_young(ptent))
> - mark_page_accessed(page);
> + else {
> + if (pte_dirty(ptent))
> + set_page_dirty(page);
> + if (pte_young(ptent))
> + mark_page_accessed(page);
> + }
> tlb->freed++;
> page_remove_rmap(page);
> tlb_remove_page(tlb, page);

What is the page is (for example) clean swapcache, having been recently
faulted in. If this pte indicates that this process has modified the page
and we don't run set_page_dirty(), the page could be reclaimed and the
change is lost.

Or what is the page was an anon page resulting from (say) a swapoff, and
it's shared by two mm's and one has modified it and we drop that dirty pte?

Or <other scenarios>.

Need more convincing.

2005-09-26 06:03:15

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 04/21] mm: zap_pte_range dont dirty anon

On Sun, 25 Sep 2005, Andrew Morton wrote:
> Hugh Dickins <[email protected]> wrote:
> >
> > zap_pte_range already avoids wasting time to mark_page_accessed on anon
> > pages: it can also skip anon set_page_dirty - the page only needs to be
> > marked dirty if shared with another mm, but that will say pte_dirty too.
>
> Are you sure about this?

Yeerrrssss, well, I'm never _sure_ about anything,
especially when faced by the question.

> What is the page is (for example) clean swapcache, having been recently
> faulted in. If this pte indicates that this process has modified the page
> and we don't run set_page_dirty(), the page could be reclaimed and the
> change is lost.

Absolutely. But either the page is unique to this mm, shared only with
swapcache: in which case we're about to do a free_swap_cache on it (that
may be delayed in actually freeing the swap because of not getting page
lock, presumably because vmscan just got to it, but no matter), and we
don't care at all that the page no longer represents what's on swap disk.

Or, the page is shared with another mm. But it's an anonymous page
(a private page), so it's shared via fork, and COW applies to it.
copy_one_pte did ptep_set_wrprotect on it, and did not pte_mkclean [*].

So if it's dirty from before the fork, the sharing mm will also have
it marked pte_dirty, which will get propagated through to the page
via that mm, and everything's fine even though we're ignoring the
pte_dirty in this unmapping mm. Or if it's dirty from after the fork,
well, something's gone very wrong if the page is still shared - unless
it's because there's been a further fork since it was dirtied, in which
case that sibling carries the pte_dirty which guarantees its integrity.

The change would be very wrong for the !PageAnon pages:
but it's right for the PageAnon ones, don't you agree?

> Or what is the page was an anon page resulting from (say) a swapoff, and
> it's shared by two mm's and one has modified it and we drop that dirty pte?

Again, the one which tried to modify it will have got a Copy-On-Write
fault, been given a copy of the page, and the original won't be shared.

> Or <other scenarios>.
>
> Need more convincing.

Convinced? I hope it's just that you forgot something and now remember.
But do demand more from me if not (and don't feel obliged to devise more
cases to ask about, though they do help - thanks). Or just drop the
patch, it was merely a passing observation - I'm not building any
great edifice upon it in later patches!

Hugh

[*] This ignores the issue of the "Linus" pages, those anonymous
pages which have got into a shared vma by ptrace writing while the
vma was unwritable - copy_one_pte's tests are on VM_SHARED. They're
in a limbo between private and shared, and may indeed behave oddly.
For the moment we'll continue to ignore the oddities of that case.

2005-09-26 06:15:11

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 04/21] mm: zap_pte_range dont dirty anon

Hugh Dickins <[email protected]> wrote:
>
> > What is the page is (for example) clean swapcache, having been recently
> > faulted in. If this pte indicates that this process has modified the page
> > and we don't run set_page_dirty(), the page could be reclaimed and the
> > change is lost.
>
> Absolutely. But either the page is unique to this mm, shared only with
> swapcache: in which case we're about to do a free_swap_cache on it (that
> may be delayed in actually freeing the swap because of not getting page
> lock, presumably because vmscan just got to it, but no matter), and we
> don't care at all that the page no longer represents what's on swap disk.
>
> Or, the page is shared with another mm. But it's an anonymous page
> (a private page), so it's shared via fork, and COW applies to it.

mmap(MAP_ANONYMOUS|MAP_SHARED)
fork()
swapout
swapin
swapoff

Now we have two mm's sharing a clean, non-cowable, non-swapcache anonymous
page, no?

2005-09-26 07:21:06

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 04/21] mm: zap_pte_range dont dirty anon

On Sun, 25 Sep 2005, Andrew Morton wrote:
>
> mmap(MAP_ANONYMOUS|MAP_SHARED)
> fork()
> swapout
> swapin
> swapoff
>
> Now we have two mm's sharing a clean, non-cowable, non-swapcache anonymous
> page, no?

No, MAP_ANONYMOUS|MAP_SHARED gives you a tmpfs object via shmem_zero_setup:
all those pages are shared file pages, not PageAnon at all.

Hugh

2005-09-26 07:25:35

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 17/21] mm: batch updating mm_counters

On Sun, 2005-09-25 at 17:08 +0100, Hugh Dickins wrote:
> tlb_finish_mmu used to batch zap_pte_range's update of mm rss, which may
> be worthwhile if the mm is contended, and would reduce atomic operations
> if the counts were atomic. Let zap_pte_range now batch its updates to
> file_rss and anon_rss, per page-table in case we drop the lock outside;
> and copy_pte_range batch them too.

Good idea.

> progress++;
> continue;
> }
> - copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vm_flags, addr);
> + anon = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
> + vm_flags, addr);
> + rss[anon]++;

How about passing rss[2] to copy_one_pte, and have that
increment the correct rss value accordingly? Not that
you may consider that any nicer than what you have here.

Nick

--
SUSE Labs, Novell Inc.



Send instant messages to your online friends http://au.messenger.yahoo.com

2005-09-26 08:43:10

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 17/21] mm: batch updating mm_counters

On Mon, 26 Sep 2005, Nick Piggin wrote:
> On Sun, 2005-09-25 at 17:08 +0100, Hugh Dickins wrote:
> > - copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vm_flags, addr);
> > + anon = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
> > + vm_flags, addr);
> > + rss[anon]++;
>
> How about passing rss[2] to copy_one_pte, and have that
> increment the correct rss value accordingly? Not that
> you may consider that any nicer than what you have here.

That does seem a more _normal_ way of doing it.

Though adding a seventh argument doesn't appeal
(perhaps irrelevant since copy_one_pte is inlined).

I don't mind much either way: anyone have strong feelings?

Hugh

2005-09-28 00:05:34

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 00/21] mm: page fault scalability prep

On Sun, 25 Sep 2005, Hugh Dickins wrote:

> Here comes the preparatory batch for my page fault scalability patches.
> This batch makes a few fixes - I suggest 01 and 02 should go in 2.6.14 -
> and a lot of tidyups, clearing some undergrowth for the real patches.

Well. A mind-boogling patchset. Cannot say that I understand all of it but
this does a lot of mine sweeping through the troublespots that I
also have seen. Great work!

2005-09-29 07:00:17

by Paul Mundt

[permalink] [raw]
Subject: Re: [PATCH 20/21] mm: sh64 hugetlbpage.c

On Sun, Sep 25, 2005 at 05:11:58PM +0100, Hugh Dickins wrote:
> The sh64 hugetlbpage.c seems to be erroneous, left over from a bygone
> age, clashing with the common hugetlb.c. Replace it by a copy of the
> sh hugetlbpage.c. Except, delete that mk_pte_huge macro neither uses.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
>
> arch/sh/mm/hugetlbpage.c | 2
> arch/sh64/mm/hugetlbpage.c | 188 ++-------------------------------------------
> 2 files changed, 12 insertions(+), 178 deletions(-)
>
Looks good, thanks Hugh.

Acked-by: Paul Mundt <[email protected]>


Attachments:
(No filename) (588.00 B)
(No filename) (189.00 B)
Download all attachments