LinuxLists.cc - [PATCH 00/21] mm: page fault scalability nearer

2005-10-13 00:45:28

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 00/21] mm: page fault scalability nearer

Here comes the second batch of my page fault scalability patches.
It starts off with a few small adjustments, takes a detour around
update_mem_hiwater in 06/21, then really gets under way with 09/21.

It ends with almost all of the changes complete in the common core, but
still using page_table_lock unscalably. Scattered changes to arch files,
nothing major, needed before finally splitting the lock in the third batch.

This batch is against 2.6.14-rc2-mm2 plus Nick's core remove PageReserved.
18/21 is a small fix to that patch, you may want to move it down there.

Hugh

Documentation/cachetlb.txt | 9
arch/alpha/mm/remap.c | 6
arch/arm/mm/consistent.c | 6
arch/arm/mm/ioremap.c | 4
arch/arm/mm/mm-armv.c | 14
arch/arm/oprofile/backtrace.c | 46 --
arch/arm26/mm/memc.c | 18 -
arch/cris/mm/ioremap.c | 4
arch/frv/mm/dma-alloc.c | 5
arch/i386/mm/ioremap.c | 4
arch/i386/oprofile/backtrace.c | 38 --
arch/ia64/mm/fault.c | 34 --
arch/ia64/mm/init.c | 13
arch/ia64/mm/tlb.c | 2
arch/m32r/mm/ioremap.c | 4
arch/m68k/mm/kmap.c | 2
arch/m68k/sun3x/dvma.c | 2
arch/mips/mm/ioremap.c | 4
arch/parisc/kernel/pci-dma.c | 2
arch/parisc/mm/ioremap.c | 6
arch/ppc/kernel/dma-mapping.c | 6
arch/ppc/mm/4xx_mmu.c | 4
arch/ppc/mm/pgtable.c | 4
arch/ppc64/mm/imalloc.c | 5
arch/ppc64/mm/init.c | 4
arch/s390/mm/ioremap.c | 4
arch/sh/mm/ioremap.c | 4
arch/sh64/mm/ioremap.c | 4
arch/sparc/mm/generic.c | 4
arch/sparc64/mm/generic.c | 6
arch/um/kernel/skas/mmu.c | 3
arch/x86_64/mm/ioremap.c | 4
fs/compat.c | 1
fs/exec.c | 15
fs/hugetlbfs/inode.c | 4
fs/proc/task_mmu.c | 43 +-
include/asm-generic/4level-fixup.h | 11
include/asm-i386/pgtable.h | 3
include/asm-parisc/tlbflush.h | 3
include/asm-um/pgtable.h | 2
include/linux/hugetlb.h | 2
include/linux/mm.h | 85 +++--
include/linux/rmap.h | 4
include/linux/sched.h | 29 +
kernel/exit.c | 5
kernel/fork.c | 2
kernel/futex.c | 6
kernel/sched.c | 2
mm/filemap_xip.c | 15
mm/fremap.c | 67 +---
mm/hugetlb.c | 27 -
mm/memory.c | 609 +++++++++++++++----------------------
mm/mempolicy.c | 7
mm/mmap.c | 44 +-
mm/mprotect.c | 7
mm/mremap.c | 64 +--
mm/msync.c | 21 -
mm/nommu.c | 18 -
mm/rmap.c | 113 +++---
mm/swapfile.c | 20 -
mm/vmalloc.c | 4
61 files changed, 623 insertions(+), 885 deletions(-)

2005-10-13 00:46:26

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 01/21] mm: copy_one_pte inc rss

Small adjustment, following Nick's suggestion: it's more straightforward
for copy_pte_range to let copy_one_pte do the rss incrementation, than
use an index it passed back. Saves a #define, and 16 bytes of .text.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/memory.c | 15 +++++----------
1 files changed, 5 insertions(+), 10 deletions(-)

--- mm00/mm/memory.c 2005-10-11 12:16:50.000000000 +0100
+++ mm01/mm/memory.c 2005-10-11 23:53:00.000000000 +0100
@@ -340,8 +340,6 @@ static inline void add_mm_rss(struct mm_
add_mm_counter(mm, anon_rss, anon_rss);
}

-#define NO_RSS 2 /* Increment neither file_rss nor anon_rss */
-
/*
* This function is called to print an error when a pte in a
* !VM_RESERVED region is found pointing to an invalid pfn (which
@@ -368,16 +366,15 @@ void print_bad_pte(struct vm_area_struct
* but may be dropped within p[mg]d_alloc() and pte_alloc_map().
*/

-static inline int
+static inline void
copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
- unsigned long addr)
+ unsigned long addr, int *rss)
{
unsigned long vm_flags = vma->vm_flags;
pte_t pte = *src_pte;
struct page *page;
unsigned long pfn;
- int anon = NO_RSS;

/* pte contains position in swap or file, so copy. */
if (unlikely(!pte_present(pte))) {
@@ -428,11 +425,10 @@ copy_one_pte(struct mm_struct *dst_mm, s
pte = pte_mkold(pte);
get_page(page);
page_dup_rmap(page);
- anon = !!PageAnon(page);
+ rss[!!PageAnon(page)]++;

out_set_pte:
set_pte_at(dst_mm, addr, dst_pte, pte);
- return anon;
}

static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -441,7 +437,7 @@ static int copy_pte_range(struct mm_stru
{
pte_t *src_pte, *dst_pte;
int progress = 0;
- int rss[NO_RSS+1], anon;
+ int rss[2];

again:
rss[1] = rss[0] = 0;
@@ -467,8 +463,7 @@ again:
progress++;
continue;
}
- anon = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma,addr);
- rss[anon]++;
+ copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss);
progress += 8;
} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
spin_unlock(&src_mm->page_table_lock);

2005-10-13 00:47:05

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 02/21] mm: zap_pte_range dec rss

Small adjustment: zap_pte_range decrement its rss counts from 0 then
finally add, avoiding negations - we don't have or need a sub_mm_rss.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/memory.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)

--- mm01/mm/memory.c 2005-10-11 23:53:00.000000000 +0100
+++ mm02/mm/memory.c 2005-10-11 23:53:17.000000000 +0100
@@ -609,13 +609,13 @@ static void zap_pte_range(struct mmu_gat
set_pte_at(mm, addr, pte,
pgoff_to_pte(page->index));
if (PageAnon(page))
- anon_rss++;
+ anon_rss--;
else {
if (pte_dirty(ptent))
set_page_dirty(page);
if (pte_young(ptent))
mark_page_accessed(page);
- file_rss++;
+ file_rss--;
}
page_remove_rmap(page);
tlb_remove_page(tlb, page);
@@ -632,7 +632,7 @@ static void zap_pte_range(struct mmu_gat
pte_clear_full(mm, addr, pte, tlb->fullmm);
} while (pte++, addr += PAGE_SIZE, addr != end);

- add_mm_rss(mm, -file_rss, -anon_rss);
+ add_mm_rss(mm, file_rss, anon_rss);
pte_unmap(pte - 1);
}

2005-10-13 00:47:46

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 03/21] mm: do_swap_page race major

Small adjustment: do_swap_page should report its !pte_same race as a
major fault if it had to read into swap cache, because whatever raced
with it will have found page already in cache and reported minor fault.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/memory.c | 4 +---
1 files changed, 1 insertion(+), 3 deletions(-)

--- mm02/mm/memory.c 2005-10-11 23:53:17.000000000 +0100
+++ mm03/mm/memory.c 2005-10-11 23:53:35.000000000 +0100
@@ -1728,10 +1728,8 @@ static int do_swap_page(struct mm_struct
*/
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, address);
- if (unlikely(!pte_same(*page_table, orig_pte))) {
- ret = VM_FAULT_MINOR;
+ if (unlikely(!pte_same(*page_table, orig_pte)))
goto out_nomap;
- }

if (unlikely(!PageUptodate(page))) {
ret = VM_FAULT_SIGBUS;

2005-10-13 00:48:35

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 04/21] mm: do_mremap current mm

Cleanup: relieve do_mremap from its surfeit of current->mms.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/mremap.c | 18 +++++++++---------
1 files changed, 9 insertions(+), 9 deletions(-)

--- mm03/mm/mremap.c 2005-09-30 11:59:12.000000000 +0100
+++ mm04/mm/mremap.c 2005-10-11 23:53:55.000000000 +0100
@@ -245,6 +245,7 @@ unsigned long do_mremap(unsigned long ad
unsigned long old_len, unsigned long new_len,
unsigned long flags, unsigned long new_addr)
{
+ struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
unsigned long ret = -EINVAL;
unsigned long charged = 0;
@@ -285,7 +286,7 @@ unsigned long do_mremap(unsigned long ad
if ((addr <= new_addr) && (addr+old_len) > new_addr)
goto out;

- ret = do_munmap(current->mm, new_addr, new_len);
+ ret = do_munmap(mm, new_addr, new_len);
if (ret)
goto out;
}
@@ -296,7 +297,7 @@ unsigned long do_mremap(unsigned long ad
* do_munmap does all the needed commit accounting
*/
if (old_len >= new_len) {
- ret = do_munmap(current->mm, addr+new_len, old_len - new_len);
+ ret = do_munmap(mm, addr+new_len, old_len - new_len);
if (ret && old_len != new_len)
goto out;
ret = addr;
@@ -309,7 +310,7 @@ unsigned long do_mremap(unsigned long ad
* Ok, we need to grow.. or relocate.
*/
ret = -EFAULT;
- vma = find_vma(current->mm, addr);
+ vma = find_vma(mm, addr);
if (!vma || vma->vm_start > addr)
goto out;
if (is_vm_hugetlb_page(vma)) {
@@ -325,14 +326,14 @@ unsigned long do_mremap(unsigned long ad
}
if (vma->vm_flags & VM_LOCKED) {
unsigned long locked, lock_limit;
- locked = current->mm->locked_vm << PAGE_SHIFT;
+ locked = mm->locked_vm << PAGE_SHIFT;
lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
locked += new_len - old_len;
ret = -EAGAIN;
if (locked > lock_limit && !capable(CAP_IPC_LOCK))
goto out;
}
- if (!may_expand_vm(current->mm, (new_len - old_len) >> PAGE_SHIFT)) {
+ if (!may_expand_vm(mm, (new_len - old_len) >> PAGE_SHIFT)) {
ret = -ENOMEM;
goto out;
}
@@ -359,11 +360,10 @@ unsigned long do_mremap(unsigned long ad
vma_adjust(vma, vma->vm_start,
addr + new_len, vma->vm_pgoff, NULL);

- current->mm->total_vm += pages;
- vm_stat_account(vma->vm_mm, vma->vm_flags,
- vma->vm_file, pages);
+ mm->total_vm += pages;
+ vm_stat_account(mm, vma->vm_flags, vma->vm_file, pages);
if (vma->vm_flags & VM_LOCKED) {
- current->mm->locked_vm += pages;
+ mm->locked_vm += pages;
make_pages_present(addr + old_len,
addr + new_len);
}

2005-10-13 00:49:15

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 05/21] mm: zap_pte out of line

There used to be just one call to zap_pte, but it shouldn't be inline
now there are two. Check for the common case pte_none before calling,
and move its rss accounting up into install_page or install_file_pte -
which helps the next patch.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/fremap.c | 19 +++++++++----------
1 files changed, 9 insertions(+), 10 deletions(-)

--- mm04/mm/fremap.c 2005-10-11 12:16:50.000000000 +0100
+++ mm05/mm/fremap.c 2005-10-11 23:54:15.000000000 +0100
@@ -20,34 +20,32 @@
#include <asm/cacheflush.h>
#include <asm/tlbflush.h>

-static inline void zap_pte(struct mm_struct *mm, struct vm_area_struct *vma,
+static int zap_pte(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep)
{
pte_t pte = *ptep;
+ struct page *page = NULL;

- if (pte_none(pte))
- return;
if (pte_present(pte)) {
unsigned long pfn = pte_pfn(pte);
- struct page *page;
-
flush_cache_page(vma, addr, pfn);
pte = ptep_clear_flush(vma, addr, ptep);
if (unlikely(!pfn_valid(pfn))) {
print_bad_pte(vma, pte, addr);
- return;
+ goto out;
}
page = pfn_to_page(pfn);
if (pte_dirty(pte))
set_page_dirty(page);
page_remove_rmap(page);
page_cache_release(page);
- dec_mm_counter(mm, file_rss);
} else {
if (!pte_file(pte))
free_swap_and_cache(pte_to_swp_entry(pte));
pte_clear(mm, addr, ptep);
}
+out:
+ return !!page;
}

/*
@@ -93,9 +91,9 @@ int install_page(struct mm_struct *mm, s
if (!page->mapping || page->index >= size)
goto err_unlock;

- zap_pte(mm, vma, addr, pte);
+ if (pte_none(*pte) || !zap_pte(mm, vma, addr, pte))
+ inc_mm_counter(mm, file_rss);

- inc_mm_counter(mm, file_rss);
flush_icache_page(vma, page);
set_pte_at(mm, addr, pte, mk_pte(page, prot));
page_add_file_rmap(page);
@@ -142,7 +140,8 @@ int install_file_pte(struct mm_struct *m
if (!pte)
goto err_unlock;

- zap_pte(mm, vma, addr, pte);
+ if (!pte_none(*pte) && zap_pte(mm, vma, addr, pte))
+ dec_mm_counter(mm, file_rss);

set_pte_at(mm, addr, pte, pgoff_to_pte(pgoff));
pte_val = *pte;

2005-10-13 00:51:37

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 06/21] mm: update_hiwaters just in time

update_mem_hiwater has attracted various criticisms, in particular from
those concerned with mm scalability. Originally it was called whenever
rss or total_vm got raised. Then many of those callsites were replaced
by a timer tick call from account_system_time. Now Frank van Maarseveen
reports that to be found inadequate. How about this? Works for Frank.

Replace update_mem_hiwater, a poor combination of two unrelated ops, by
macros update_hiwater_rss and update_hiwater_vm. Don't attempt to keep
mm->hiwater_rss up to date at timer tick, nor every time we raise rss
(usually by 1): those are hot paths. Do the opposite, update only when
about to lower rss (usually by many), or just before final accounting in
do_exit. Handle mm->hiwater_vm in the same way, though it's much less
of an issue. Demand that whoever collects these hiwater statistics do
the work of taking the maximum with rss or total_vm.

And there has been no collector of these hiwater statistics in the tree.
The new convention needs an example, so match Frank's usage by adding a
VmPeak line above VmSize to /proc/<pid>/status, and also a VmHWM line
above VmRSS (High-Water-Mark or High-Water-Memory).

There was a particular anomaly during mremap move, that hiwater_vm might
be captured too high. A fleeting such anomaly remains, but it's quickly
corrected now, whereas before it would stick.

What locking? None: if the app is racy then these statistics will be
racy, it's not worth any overhead to make them exact. But whenever it
suits, hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss
under page_table_lock (for now) or with preemption disabled (later on):
without going to any trouble, minimize the time between reading current
values and updating, to minimize those occasions when a racing thread
bumps a count up and back down in between.

Signed-off-by: Hugh Dickins <[email protected]>
---

fs/compat.c | 1 -
fs/exec.c | 1 -
fs/proc/task_mmu.c | 23 +++++++++++++++++++++--
include/linux/mm.h | 3 ---
include/linux/sched.h | 10 ++++++++++
kernel/exit.c | 5 ++++-
kernel/sched.c | 2 --
mm/fremap.c | 4 +++-
mm/hugetlb.c | 3 +++
mm/memory.c | 17 +----------------
mm/mmap.c | 4 ++++
mm/mremap.c | 12 ++++++++++--
mm/nommu.c | 15 ++-------------
mm/rmap.c | 6 ++++++
14 files changed, 64 insertions(+), 42 deletions(-)

--- mm05/fs/compat.c 2005-09-21 12:16:40.000000000 +0100
+++ mm06/fs/compat.c 2005-10-11 23:54:32.000000000 +0100
@@ -1490,7 +1490,6 @@ int compat_do_execve(char * filename,
/* execve success */
security_bprm_free(bprm);
acct_update_integrals(current);
- update_mem_hiwater(current);
kfree(bprm);
return retval;
}
--- mm05/fs/exec.c 2005-09-30 11:59:08.000000000 +0100
+++ mm06/fs/exec.c 2005-10-11 23:54:33.000000000 +0100
@@ -1207,7 +1207,6 @@ int do_execve(char * filename,
/* execve success */
security_bprm_free(bprm);
acct_update_integrals(current);
- update_mem_hiwater(current);
kfree(bprm);
return retval;
}
--- mm05/fs/proc/task_mmu.c 2005-09-30 11:59:09.000000000 +0100
+++ mm06/fs/proc/task_mmu.c 2005-10-11 23:54:33.000000000 +0100
@@ -14,22 +14,41 @@
char *task_mem(struct mm_struct *mm, char *buffer)
{
unsigned long data, text, lib;
+ unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;

+ /*
+ * Note: to minimize their overhead, mm maintains hiwater_vm and
+ * hiwater_rss only when about to *lower* total_vm or rss. Any
+ * collector of these hiwater stats must therefore get total_vm
+ * and rss too, which will usually be the higher. Barriers? not
+ * worth the effort, such snapshots can always be inconsistent.
+ */
+ hiwater_vm = total_vm = mm->total_vm;
+ if (hiwater_vm < mm->hiwater_vm)
+ hiwater_vm = mm->hiwater_vm;
+ hiwater_rss = total_rss = get_mm_rss(mm);
+ if (hiwater_rss < mm->hiwater_rss)
+ hiwater_rss = mm->hiwater_rss;
+
data = mm->total_vm - mm->shared_vm - mm->stack_vm;
text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10;
lib = (mm->exec_vm << (PAGE_SHIFT-10)) - text;
buffer += sprintf(buffer,
+ "VmPeak:\t%8lu kB\n"
"VmSize:\t%8lu kB\n"
"VmLck:\t%8lu kB\n"
+ "VmHWM:\t%8lu kB\n"
"VmRSS:\t%8lu kB\n"
"VmData:\t%8lu kB\n"
"VmStk:\t%8lu kB\n"
"VmExe:\t%8lu kB\n"
"VmLib:\t%8lu kB\n"
"VmPTE:\t%8lu kB\n",
- (mm->total_vm - mm->reserved_vm) << (PAGE_SHIFT-10),
+ hiwater_vm << (PAGE_SHIFT-10),
+ (total_vm - mm->reserved_vm) << (PAGE_SHIFT-10),
mm->locked_vm << (PAGE_SHIFT-10),
- get_mm_rss(mm) << (PAGE_SHIFT-10),
+ hiwater_rss << (PAGE_SHIFT-10),
+ total_rss << (PAGE_SHIFT-10),
data << (PAGE_SHIFT-10),
mm->stack_vm << (PAGE_SHIFT-10), text, lib,
(PTRS_PER_PTE*sizeof(pte_t)*mm->nr_ptes) >> 10);
--- mm05/include/linux/mm.h 2005-10-11 12:16:50.000000000 +0100
+++ mm06/include/linux/mm.h 2005-10-11 23:54:33.000000000 +0100
@@ -946,9 +946,6 @@ static inline void vm_stat_account(struc
}
#endif /* CONFIG_PROC_FS */

-/* update per process rss and vm hiwater data */
-extern void update_mem_hiwater(struct task_struct *tsk);
-
#ifndef CONFIG_DEBUG_PAGEALLOC
static inline void
kernel_map_pages(struct page *page, int numpages, int enable)
--- mm05/include/linux/sched.h 2005-09-30 11:59:11.000000000 +0100
+++ mm06/include/linux/sched.h 2005-10-11 23:54:33.000000000 +0100
@@ -245,6 +245,16 @@ extern void arch_unmap_area_topdown(stru
#define dec_mm_counter(mm, member) (mm)->_##member--
#define get_mm_rss(mm) ((mm)->_file_rss + (mm)->_anon_rss)

+#define update_hiwater_rss(mm) do { \
+ unsigned long _rss = get_mm_rss(mm); \
+ if ((mm)->hiwater_rss < _rss) \
+ (mm)->hiwater_rss = _rss; \
+} while (0)
+#define update_hiwater_vm(mm) do { \
+ if ((mm)->hiwater_vm < (mm)->total_vm) \
+ (mm)->hiwater_vm = (mm)->total_vm; \
+} while (0)
+
typedef unsigned long mm_counter_t;

struct mm_struct {
--- mm05/kernel/exit.c 2005-09-30 11:59:12.000000000 +0100
+++ mm06/kernel/exit.c 2005-10-11 23:54:33.000000000 +0100
@@ -839,7 +839,10 @@ fastcall NORET_TYPE void do_exit(long co
preempt_count());

acct_update_integrals(tsk);
- update_mem_hiwater(tsk);
+ if (tsk->mm) {
+ update_hiwater_rss(tsk->mm);
+ update_hiwater_vm(tsk->mm);
+ }
group_dead = atomic_dec_and_test(&tsk->signal->live);
if (group_dead) {
del_timer_sync(&tsk->signal->real_timer);
--- mm05/kernel/sched.c 2005-09-30 11:59:12.000000000 +0100
+++ mm06/kernel/sched.c 2005-10-11 23:54:33.000000000 +0100
@@ -2603,8 +2603,6 @@ void account_system_time(struct task_str
cpustat->idle = cputime64_add(cpustat->idle, tmp);
/* Account for system time used */
acct_update_integrals(p);
- /* Update rss highwater mark */
- update_mem_hiwater(p);
}

/*
--- mm05/mm/fremap.c 2005-10-11 23:54:15.000000000 +0100
+++ mm06/mm/fremap.c 2005-10-11 23:54:33.000000000 +0100
@@ -140,8 +140,10 @@ int install_file_pte(struct mm_struct *m
if (!pte)
goto err_unlock;

- if (!pte_none(*pte) && zap_pte(mm, vma, addr, pte))
+ if (!pte_none(*pte) && zap_pte(mm, vma, addr, pte)) {
+ update_hiwater_rss(mm);
dec_mm_counter(mm, file_rss);
+ }

set_pte_at(mm, addr, pte, pgoff_to_pte(pgoff));
pte_val = *pte;
--- mm05/mm/hugetlb.c 2005-09-30 11:59:12.000000000 +0100
+++ mm06/mm/hugetlb.c 2005-10-11 23:54:33.000000000 +0100
@@ -309,6 +309,9 @@ void unmap_hugepage_range(struct vm_area
BUG_ON(start & ~HPAGE_MASK);
BUG_ON(end & ~HPAGE_MASK);

+ /* Update high watermark before we lower rss */
+ update_hiwater_rss(mm);
+
for (address = start; address < end; address += HPAGE_SIZE) {
ptep = huge_pte_offset(mm, address);
if (! ptep)
--- mm05/mm/memory.c 2005-10-11 23:53:35.000000000 +0100
+++ mm06/mm/memory.c 2005-10-11 23:54:33.000000000 +0100
@@ -820,6 +820,7 @@ unsigned long zap_page_range(struct vm_a
lru_add_drain();
spin_lock(&mm->page_table_lock);
tlb = tlb_gather_mmu(mm, 0);
+ update_hiwater_rss(mm);
end = unmap_vmas(&tlb, mm, vma, address, end, &nr_accounted, details);
tlb_finish_mmu(tlb, address, end);
spin_unlock(&mm->page_table_lock);
@@ -2225,22 +2226,6 @@ unsigned long vmalloc_to_pfn(void * vmal

EXPORT_SYMBOL(vmalloc_to_pfn);

-/*
- * update_mem_hiwater
- * - update per process rss and vm high water data
- */
-void update_mem_hiwater(struct task_struct *tsk)
-{
- if (tsk->mm) {
- unsigned long rss = get_mm_rss(tsk->mm);
-
- if (tsk->mm->hiwater_rss < rss)
- tsk->mm->hiwater_rss = rss;
- if (tsk->mm->hiwater_vm < tsk->mm->total_vm)
- tsk->mm->hiwater_vm = tsk->mm->total_vm;
- }
-}
-
#if !defined(__HAVE_ARCH_GATE_AREA)

#if defined(AT_SYSINFO_EHDR)
--- mm05/mm/mmap.c 2005-10-11 12:16:50.000000000 +0100
+++ mm06/mm/mmap.c 2005-10-11 23:54:33.000000000 +0100
@@ -1636,6 +1636,8 @@ find_extend_vma(struct mm_struct * mm, u
*/
static void remove_vma_list(struct mm_struct *mm, struct vm_area_struct *vma)
{
+ /* Update high watermark before we lower total_vm */
+ update_hiwater_vm(mm);
do {
long nrpages = vma_pages(vma);

@@ -1664,6 +1666,7 @@ static void unmap_region(struct mm_struc
lru_add_drain();
spin_lock(&mm->page_table_lock);
tlb = tlb_gather_mmu(mm, 0);
+ update_hiwater_rss(mm);
unmap_vmas(&tlb, mm, vma, start, end, &nr_accounted, NULL);
vm_unacct_memory(nr_accounted);
free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
@@ -1949,6 +1952,7 @@ void exit_mmap(struct mm_struct *mm)

flush_cache_mm(mm);
tlb = tlb_gather_mmu(mm, 1);
+ /* Don't update_hiwater_rss(mm) here, do_exit already did */
/* Use -1 here to ensure all VMAs in the mm are unmapped */
end = unmap_vmas(&tlb, mm, vma, 0, -1, &nr_accounted, NULL);
vm_unacct_memory(nr_accounted);
--- mm05/mm/mremap.c 2005-10-11 23:53:55.000000000 +0100
+++ mm06/mm/mremap.c 2005-10-11 23:54:33.000000000 +0100
@@ -167,6 +167,7 @@ static unsigned long move_vma(struct vm_
unsigned long new_pgoff;
unsigned long moved_len;
unsigned long excess = 0;
+ unsigned long hiwater_vm;
int split = 0;

/*
@@ -205,9 +206,15 @@ static unsigned long move_vma(struct vm_
}

/*
- * if we failed to move page tables we still do total_vm increment
- * since do_munmap() will decrement it by old_len == new_len
+ * If we failed to move page tables we still do total_vm increment
+ * since do_munmap() will decrement it by old_len == new_len.
+ *
+ * Since total_vm is about to be raised artificially high for a
+ * moment, we need to restore high watermark afterwards: if stats
+ * are taken meanwhile, total_vm and hiwater_vm appear too high.
+ * If this were a serious issue, we'd add a flag to do_munmap().
*/
+ hiwater_vm = mm->hiwater_vm;
mm->total_vm += new_len >> PAGE_SHIFT;
vm_stat_account(mm, vma->vm_flags, vma->vm_file, new_len>>PAGE_SHIFT);

@@ -216,6 +223,7 @@ static unsigned long move_vma(struct vm_
vm_unacct_memory(excess >> PAGE_SHIFT);
excess = 0;
}
+ mm->hiwater_vm = hiwater_vm;

/* Restore VM_ACCOUNT if one or two pieces of vma left */
if (excess) {
--- mm05/mm/nommu.c 2005-09-30 11:59:12.000000000 +0100
+++ mm06/mm/nommu.c 2005-10-11 23:54:33.000000000 +0100
@@ -928,6 +928,8 @@ int do_munmap(struct mm_struct *mm, unsi
realalloc -= kobjsize(vml);
askedalloc -= sizeof(*vml);
kfree(vml);
+
+ update_hiwater_vm(mm);
mm->total_vm -= len >> PAGE_SHIFT;

#ifdef DEBUG
@@ -1075,19 +1077,6 @@ void arch_unmap_area(struct mm_struct *m
{
}

-void update_mem_hiwater(struct task_struct *tsk)
-{
- unsigned long rss;
-
- if (likely(tsk->mm)) {
- rss = get_mm_rss(tsk->mm);
- if (tsk->mm->hiwater_rss < rss)
- tsk->mm->hiwater_rss = rss;
- if (tsk->mm->hiwater_vm < tsk->mm->total_vm)
- tsk->mm->hiwater_vm = tsk->mm->total_vm;
- }
-}
-
void unmap_mapping_range(struct address_space *mapping,
loff_t const holebegin, loff_t const holelen,
int even_cows)
--- mm05/mm/rmap.c 2005-10-11 12:16:50.000000000 +0100
+++ mm06/mm/rmap.c 2005-10-11 23:54:33.000000000 +0100
@@ -538,6 +538,9 @@ static int try_to_unmap_one(struct page
if (pte_dirty(pteval))
set_page_dirty(page);

+ /* Update high watermark before we lower rss */
+ update_hiwater_rss(mm);
+
if (PageAnon(page)) {
swp_entry_t entry = { .val = page->private };
/*
@@ -628,6 +631,9 @@ static void try_to_unmap_cluster(unsigne
if (!pmd_present(*pmd))
goto out_unlock;

+ /* Update high watermark before we lower rss */
+ update_hiwater_rss(mm);
+
for (original_pte = pte = pte_offset_map(pmd, address);
address < end; pte++, address += PAGE_SIZE) {

2005-10-13 00:53:05

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 07/21] mm: mm_struct hiwaters moved

Slight and timid rearrangement of mm_struct: hiwater_rss and hiwater_vm
were tacked on the end, but it seems better to keep them near _file_rss,
_anon_rss and total_vm, in the same cacheline on those arches verified.

There are likely to be more profitable rearrangements, but less obvious
(is it good or bad that saved_auxv[AT_VECTOR_SIZE] isolates cpu_vm_mask
and context from many others?), needing serious instrumentation.

Signed-off-by: Hugh Dickins <[email protected]>
---

include/linux/sched.h | 19 +++++++++----------
1 files changed, 9 insertions(+), 10 deletions(-)

--- mm06/include/linux/sched.h 2005-10-11 23:54:33.000000000 +0100
+++ mm07/include/linux/sched.h 2005-10-11 23:55:22.000000000 +0100
@@ -280,16 +280,19 @@ struct mm_struct {
* by mmlist_lock
*/

- unsigned long start_code, end_code, start_data, end_data;
- unsigned long start_brk, brk, start_stack;
- unsigned long arg_start, arg_end, env_start, env_end;
- unsigned long total_vm, locked_vm, shared_vm;
- unsigned long exec_vm, stack_vm, reserved_vm, def_flags, nr_ptes;
-
/* Special counters protected by the page_table_lock */
mm_counter_t _file_rss;
mm_counter_t _anon_rss;

+ unsigned long hiwater_rss; /* High-watermark of RSS usage */
+ unsigned long hiwater_vm; /* High-water virtual memory usage */
+
+ unsigned long total_vm, locked_vm, shared_vm, exec_vm;
+ unsigned long stack_vm, reserved_vm, def_flags, nr_ptes;
+ unsigned long start_code, end_code, start_data, end_data;
+ unsigned long start_brk, brk, start_stack;
+ unsigned long arg_start, arg_end, env_start, env_end;
+
unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */

unsigned dumpable:2;
@@ -309,11 +312,7 @@ struct mm_struct {
/* aio bits */
rwlock_t ioctx_list_lock;
struct kioctx *ioctx_list;
-
struct kioctx default_kioctx;
-
- unsigned long hiwater_rss; /* High-water RSS usage */
- unsigned long hiwater_vm; /* High-water virtual memory usage */
};

struct sighand_struct {

2005-10-13 00:54:24

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 08/21] mm: ia64 use expand_upwards

ia64 has expand_backing_store function for growing its Register Backing
Store vma upwards. But more complete code for this purpose is found in
the CONFIG_STACK_GROWSUP part of mm/mmap.c. Uglify its #ifdefs further
to provide expand_upwards for ia64 as well as expand_stack for parisc.

The Register Backing Store vma should be marked VM_ACCOUNT. Implement
the intention of growing it only a page at a time, instead of passing an
address outside of the vma to handle_mm_fault, with unknown consequences.

Signed-off-by: Hugh Dickins <[email protected]>
---

arch/ia64/mm/fault.c | 34 +++++++---------------------------
arch/ia64/mm/init.c | 2 +-
include/linux/mm.h | 3 ++-
mm/mmap.c | 17 ++++++++++++++---
4 files changed, 24 insertions(+), 32 deletions(-)

--- mm07/arch/ia64/mm/fault.c 2005-09-30 11:58:53.000000000 +0100
+++ mm08/arch/ia64/mm/fault.c 2005-10-11 23:55:38.000000000 +0100
@@ -20,32 +20,6 @@
extern void die (char *, struct pt_regs *, long);

/*
- * This routine is analogous to expand_stack() but instead grows the
- * register backing store (which grows towards higher addresses).
- * Since the register backing store is access sequentially, we
- * disallow growing the RBS by more than a page at a time. Note that
- * the VM_GROWSUP flag can be set on any VM area but that's fine
- * because the total process size is still limited by RLIMIT_STACK and
- * RLIMIT_AS.
- */
-static inline long
-expand_backing_store (struct vm_area_struct *vma, unsigned long address)
-{
- unsigned long grow;
-
- grow = PAGE_SIZE >> PAGE_SHIFT;
- if (address - vma->vm_start > current->signal->rlim[RLIMIT_STACK].rlim_cur
- || (((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) > current->signal->rlim[RLIMIT_AS].rlim_cur))
- return -ENOMEM;
- vma->vm_end += PAGE_SIZE;
- vma->vm_mm->total_vm += grow;
- if (vma->vm_flags & VM_LOCKED)
- vma->vm_mm->locked_vm += grow;
- vm_stat_account(vma->vm_mm, vma->vm_flags, vma->vm_file, grow);
- return 0;
-}
-
-/*
* Return TRUE if ADDRESS points at a page in the kernel's mapped segment
* (inside region 5, on ia64) and that page is present.
*/
@@ -185,7 +159,13 @@ ia64_do_page_fault (unsigned long addres
if (REGION_NUMBER(address) != REGION_NUMBER(vma->vm_start)
|| REGION_OFFSET(address) >= RGN_MAP_LIMIT)
goto bad_area;
- if (expand_backing_store(vma, address))
+ /*
+ * Since the register backing store is accessed sequentially,
+ * we disallow growing it by more than a page at a time.
+ */
+ if (address > vma->vm_end + PAGE_SIZE - sizeof(long))
+ goto bad_area;
+ if (expand_upwards(vma, address))
goto bad_area;
}
goto good_area;
--- mm07/arch/ia64/mm/init.c 2005-09-21 12:16:14.000000000 +0100
+++ mm08/arch/ia64/mm/init.c 2005-10-11 23:55:38.000000000 +0100
@@ -158,7 +158,7 @@ ia64_init_addr_space (void)
vma->vm_start = current->thread.rbs_bot & PAGE_MASK;
vma->vm_end = vma->vm_start + PAGE_SIZE;
vma->vm_page_prot = protection_map[VM_DATA_DEFAULT_FLAGS & 0x7];
- vma->vm_flags = VM_DATA_DEFAULT_FLAGS | VM_GROWSUP;
+ vma->vm_flags = VM_DATA_DEFAULT_FLAGS|VM_GROWSUP|VM_ACCOUNT;
down_write(&current->mm->mmap_sem);
if (insert_vm_struct(current->mm, vma)) {
up_write(&current->mm->mmap_sem);
--- mm07/include/linux/mm.h 2005-10-11 23:54:33.000000000 +0100
+++ mm08/include/linux/mm.h 2005-10-11 23:55:38.000000000 +0100
@@ -904,7 +904,8 @@ void handle_ra_miss(struct address_space
unsigned long max_sane_readahead(unsigned long nr);

/* Do stack extension */
-extern int expand_stack(struct vm_area_struct * vma, unsigned long address);
+extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
+extern int expand_upwards(struct vm_area_struct *vma, unsigned long address);

/* Look up the first VMA which satisfies addr < vm_end, NULL if none. */
extern struct vm_area_struct * find_vma(struct mm_struct * mm, unsigned long addr);
--- mm07/mm/mmap.c 2005-10-11 23:54:33.000000000 +0100
+++ mm08/mm/mmap.c 2005-10-11 23:55:38.000000000 +0100
@@ -1504,11 +1504,15 @@ static int acct_stack_growth(struct vm_a
return 0;
}

-#ifdef CONFIG_STACK_GROWSUP
+#if defined(CONFIG_STACK_GROWSUP) || defined(CONFIG_IA64)
/*
- * vma is the first one with address > vma->vm_end. Have to extend vma.
+ * PA-RISC uses this for its stack; IA64 for its Register Backing Store.
+ * vma is the last one with address > vma->vm_end. Have to extend vma.
*/
-int expand_stack(struct vm_area_struct * vma, unsigned long address)
+#ifdef CONFIG_STACK_GROWSUP
+static inline
+#endif
+int expand_upwards(struct vm_area_struct *vma, unsigned long address)
{
int error;

@@ -1546,6 +1550,13 @@ int expand_stack(struct vm_area_struct *
anon_vma_unlock(vma);
return error;
}
+#endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
+
+#ifdef CONFIG_STACK_GROWSUP
+int expand_stack(struct vm_area_struct *vma, unsigned long address)
+{
+ return expand_upwards(vma, address);
+}

struct vm_area_struct *
find_extend_vma(struct mm_struct *mm, unsigned long addr)

2005-10-13 00:55:53

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 09/21] mm: init_mm without ptlock

First step in pushing down the page_table_lock. init_mm.page_table_lock
has been used throughout the architectures (usually for ioremap): not to
serialize kernel address space allocation (that's usually vmlist_lock),
but because pud_alloc,pmd_alloc,pte_alloc_kernel expect caller holds it.

Reverse that: don't lock or unlock init_mm.page_table_lock in any of the
architectures; instead rely on pud_alloc,pmd_alloc,pte_alloc_kernel to
take and drop it when allocating a new one, to check lest a racing task
already did. Similarly no page_table_lock in vmalloc's map_vm_area.

Some temporary ugliness in __pud_alloc and __pmd_alloc: since they also
handle user mms, which are converted only by a later patch, for now they
have to lock differently according to whether or not it's init_mm.

If sources get muddled, there's a danger that an arch source taking
init_mm.page_table_lock will be mixed with common source also taking it
(or neither take it). So break the rules and make another change, which
should break the build for such a mismatch: remove the redundant mm arg
from pte_alloc_kernel (ppc64 scrapped its distinct ioremap_mm in 2.6.13).

Exceptions: arm26 used pte_alloc_kernel on user mm, now pte_alloc_map;
ia64 used pte_alloc_map on init_mm, now pte_alloc_kernel; parisc had bad
args to pmd_alloc and pte_alloc_kernel in unused USE_HPPA_IOREMAP code;
ppc64 map_io_page forgot to unlock on failure; ppc mmu_mapin_ram and
ppc64 im_free took page_table_lock for no good reason.

Signed-off-by: Hugh Dickins <[email protected]>
---

arch/alpha/mm/remap.c | 6 ----
arch/arm/mm/consistent.c | 6 ----
arch/arm/mm/ioremap.c | 4 --
arch/arm26/mm/memc.c | 3 +-
arch/cris/mm/ioremap.c | 4 --
arch/frv/mm/dma-alloc.c | 5 ---
arch/i386/mm/ioremap.c | 4 --
arch/ia64/mm/init.c | 11 ++-----
arch/m32r/mm/ioremap.c | 4 --
arch/m68k/mm/kmap.c | 2 -
arch/m68k/sun3x/dvma.c | 2 -
arch/mips/mm/ioremap.c | 4 --
arch/parisc/kernel/pci-dma.c | 2 -
arch/parisc/mm/ioremap.c | 6 +---
arch/ppc/kernel/dma-mapping.c | 6 ----
arch/ppc/mm/4xx_mmu.c | 4 --
arch/ppc/mm/pgtable.c | 4 --
arch/ppc64/mm/imalloc.c | 5 ---
arch/ppc64/mm/init.c | 4 --
arch/s390/mm/ioremap.c | 4 --
arch/sh/mm/ioremap.c | 4 --
arch/sh64/mm/ioremap.c | 4 --
arch/x86_64/mm/ioremap.c | 4 --
include/linux/mm.h | 2 -
mm/memory.c | 60 ++++++++++++++++++------------------------
mm/vmalloc.c | 4 --
26 files changed, 54 insertions(+), 114 deletions(-)

--- mm08/arch/alpha/mm/remap.c 2003-01-17 01:23:07.000000000 +0000
+++ mm09/arch/alpha/mm/remap.c 2005-10-11 23:55:53.000000000 +0100
@@ -2,7 +2,6 @@
#include <asm/pgalloc.h>
#include <asm/cacheflush.h>

-/* called with the page_table_lock held */
static inline void
remap_area_pte(pte_t * pte, unsigned long address, unsigned long size,
unsigned long phys_addr, unsigned long flags)
@@ -31,7 +30,6 @@ remap_area_pte(pte_t * pte, unsigned lon
} while (address && (address < end));
}

-/* called with the page_table_lock held */
static inline int
remap_area_pmd(pmd_t * pmd, unsigned long address, unsigned long size,
unsigned long phys_addr, unsigned long flags)
@@ -46,7 +44,7 @@ remap_area_pmd(pmd_t * pmd, unsigned lon
if (address >= end)
BUG();
do {
- pte_t * pte = pte_alloc_kernel(&init_mm, pmd, address);
+ pte_t * pte = pte_alloc_kernel(pmd, address);
if (!pte)
return -ENOMEM;
remap_area_pte(pte, address, end - address,
@@ -70,7 +68,6 @@ __alpha_remap_area_pages(unsigned long a
flush_cache_all();
if (address >= end)
BUG();
- spin_lock(&init_mm.page_table_lock);
do {
pmd_t *pmd;
pmd = pmd_alloc(&init_mm, dir, address);
@@ -84,7 +81,6 @@ __alpha_remap_area_pages(unsigned long a
address = (address + PGDIR_SIZE) & PGDIR_MASK;
dir++;
} while (address && (address < end));
- spin_unlock(&init_mm.page_table_lock);
return error;
}

--- mm08/arch/arm/mm/consistent.c 2005-06-17 20:48:29.000000000 +0100
+++ mm09/arch/arm/mm/consistent.c 2005-10-11 23:55:53.000000000 +0100
@@ -397,8 +397,6 @@ static int __init consistent_init(void)
pte_t *pte;
int ret = 0;

- spin_lock(&init_mm.page_table_lock);
-
do {
pgd = pgd_offset(&init_mm, CONSISTENT_BASE);
pmd = pmd_alloc(&init_mm, pgd, CONSISTENT_BASE);
@@ -409,7 +407,7 @@ static int __init consistent_init(void)
}
WARN_ON(!pmd_none(*pmd));

- pte = pte_alloc_kernel(&init_mm, pmd, CONSISTENT_BASE);
+ pte = pte_alloc_kernel(pmd, CONSISTENT_BASE);
if (!pte) {
printk(KERN_ERR "%s: no pte tables\n", __func__);
ret = -ENOMEM;
@@ -419,8 +417,6 @@ static int __init consistent_init(void)
consistent_pte = pte;
} while (0);

- spin_unlock(&init_mm.page_table_lock);
-
return ret;
}

--- mm08/arch/arm/mm/ioremap.c 2005-08-29 00:41:01.000000000 +0100
+++ mm09/arch/arm/mm/ioremap.c 2005-10-11 23:55:53.000000000 +0100
@@ -74,7 +74,7 @@ remap_area_pmd(pmd_t * pmd, unsigned lon

pgprot = __pgprot(L_PTE_PRESENT | L_PTE_YOUNG | L_PTE_DIRTY | L_PTE_WRITE | flags);
do {
- pte_t * pte = pte_alloc_kernel(&init_mm, pmd, address);
+ pte_t * pte = pte_alloc_kernel(pmd, address);
if (!pte)
return -ENOMEM;
remap_area_pte(pte, address, end - address, address + phys_addr, pgprot);
@@ -96,7 +96,6 @@ remap_area_pages(unsigned long start, un
phys_addr -= address;
dir = pgd_offset(&init_mm, address);
BUG_ON(address >= end);
- spin_lock(&init_mm.page_table_lock);
do {
pmd_t *pmd = pmd_alloc(&init_mm, dir, address);
if (!pmd) {
@@ -113,7 +112,6 @@ remap_area_pages(unsigned long start, un
dir++;
} while (address && (address < end));

- spin_unlock(&init_mm.page_table_lock);
flush_cache_vmap(start, end);
return err;
}
--- mm08/arch/arm26/mm/memc.c 2005-03-02 07:39:17.000000000 +0000
+++ mm09/arch/arm26/mm/memc.c 2005-10-11 23:55:53.000000000 +0100
@@ -92,7 +92,7 @@ pgd_t *get_pgd_slow(struct mm_struct *mm
if (!new_pmd)
goto no_pmd;

- new_pte = pte_alloc_kernel(mm, new_pmd, 0);
+ new_pte = pte_alloc_map(mm, new_pmd, 0);
if (!new_pte)
goto no_pte;

@@ -101,6 +101,7 @@ pgd_t *get_pgd_slow(struct mm_struct *mm
init_pte = pte_offset(init_pmd, 0);

set_pte(new_pte, *init_pte);
+ pte_unmap(new_pte);

/*
* the page table entries are zeroed
--- mm08/arch/cris/mm/ioremap.c 2005-08-29 00:41:01.000000000 +0100
+++ mm09/arch/cris/mm/ioremap.c 2005-10-11 23:55:53.000000000 +0100
@@ -52,7 +52,7 @@ static inline int remap_area_pmd(pmd_t *
if (address >= end)
BUG();
do {
- pte_t * pte = pte_alloc_kernel(&init_mm, pmd, address);
+ pte_t * pte = pte_alloc_kernel(pmd, address);
if (!pte)
return -ENOMEM;
remap_area_pte(pte, address, end - address, address + phys_addr, prot);
@@ -74,7 +74,6 @@ static int remap_area_pages(unsigned lon
flush_cache_all();
if (address >= end)
BUG();
- spin_lock(&init_mm.page_table_lock);
do {
pud_t *pud;
pmd_t *pmd;
@@ -94,7 +93,6 @@ static int remap_area_pages(unsigned lon
address = (address + PGDIR_SIZE) & PGDIR_MASK;
dir++;
} while (address && (address < end));
- spin_unlock(&init_mm.page_table_lock);
flush_tlb_all();
return error;
}
--- mm08/arch/frv/mm/dma-alloc.c 2005-03-02 07:39:19.000000000 +0000
+++ mm09/arch/frv/mm/dma-alloc.c 2005-10-11 23:55:53.000000000 +0100
@@ -55,21 +55,18 @@ static int map_page(unsigned long va, un
pte_t *pte;
int err = -ENOMEM;

- spin_lock(&init_mm.page_table_lock);
-
/* Use upper 10 bits of VA to index the first level map */
pge = pgd_offset_k(va);
pue = pud_offset(pge, va);
pme = pmd_offset(pue, va);

/* Use middle 10 bits of VA to index the second-level map */
- pte = pte_alloc_kernel(&init_mm, pme, va);
+ pte = pte_alloc_kernel(pme, va);
if (pte != 0) {
err = 0;
set_pte(pte, mk_pte_phys(pa & PAGE_MASK, prot));
}

- spin_unlock(&init_mm.page_table_lock);
return err;
}

--- mm08/arch/i386/mm/ioremap.c 2005-09-30 11:58:53.000000000 +0100
+++ mm09/arch/i386/mm/ioremap.c 2005-10-11 23:55:53.000000000 +0100
@@ -28,7 +28,7 @@ static int ioremap_pte_range(pmd_t *pmd,
unsigned long pfn;

pfn = phys_addr >> PAGE_SHIFT;
- pte = pte_alloc_kernel(&init_mm, pmd, addr);
+ pte = pte_alloc_kernel(pmd, addr);
if (!pte)
return -ENOMEM;
do {
@@ -87,14 +87,12 @@ static int ioremap_page_range(unsigned l
flush_cache_all();
phys_addr -= addr;
pgd = pgd_offset_k(addr);
- spin_lock(&init_mm.page_table_lock);
do {
next = pgd_addr_end(addr, end);
err = ioremap_pud_range(pgd, addr, next, phys_addr+addr, flags);
if (err)
break;
} while (pgd++, addr = next, addr != end);
- spin_unlock(&init_mm.page_table_lock);
flush_tlb_all();
return err;
}
--- mm08/arch/ia64/mm/init.c 2005-10-11 23:55:38.000000000 +0100
+++ mm09/arch/ia64/mm/init.c 2005-10-11 23:55:53.000000000 +0100
@@ -275,26 +275,21 @@ put_kernel_page (struct page *page, unsi

pgd = pgd_offset_k(address); /* note: this is NOT pgd_offset()! */

- spin_lock(&init_mm.page_table_lock);
{
pud = pud_alloc(&init_mm, pgd, address);
if (!pud)
goto out;
-
pmd = pmd_alloc(&init_mm, pud, address);
if (!pmd)
goto out;
- pte = pte_alloc_map(&init_mm, pmd, address);
+ pte = pte_alloc_kernel(pmd, address);
if (!pte)
goto out;
- if (!pte_none(*pte)) {
- pte_unmap(pte);
+ if (!pte_none(*pte))
goto out;
- }
set_pte(pte, mk_pte(page, pgprot));
- pte_unmap(pte);
}
- out: spin_unlock(&init_mm.page_table_lock);
+ out:
/* no need for flush_tlb */
return page;
}
--- mm08/arch/m32r/mm/ioremap.c 2004-10-18 22:57:23.000000000 +0100
+++ mm09/arch/m32r/mm/ioremap.c 2005-10-11 23:55:53.000000000 +0100
@@ -67,7 +67,7 @@ remap_area_pmd(pmd_t * pmd, unsigned lon
if (address >= end)
BUG();
do {
- pte_t * pte = pte_alloc_kernel(&init_mm, pmd, address);
+ pte_t * pte = pte_alloc_kernel(pmd, address);
if (!pte)
return -ENOMEM;
remap_area_pte(pte, address, end - address, address + phys_addr, flags);
@@ -90,7 +90,6 @@ remap_area_pages(unsigned long address,
flush_cache_all();
if (address >= end)
BUG();
- spin_lock(&init_mm.page_table_lock);
do {
pmd_t *pmd;
pmd = pmd_alloc(&init_mm, dir, address);
@@ -104,7 +103,6 @@ remap_area_pages(unsigned long address,
address = (address + PGDIR_SIZE) & PGDIR_MASK;
dir++;
} while (address && (address < end));
- spin_unlock(&init_mm.page_table_lock);
flush_tlb_all();
return error;
}
--- mm08/arch/m68k/mm/kmap.c 2004-08-14 06:39:28.000000000 +0100
+++ mm09/arch/m68k/mm/kmap.c 2005-10-11 23:55:53.000000000 +0100
@@ -201,7 +201,7 @@ void *__ioremap(unsigned long physaddr,
virtaddr += PTRTREESIZE;
size -= PTRTREESIZE;
} else {
- pte_dir = pte_alloc_kernel(&init_mm, pmd_dir, virtaddr);
+ pte_dir = pte_alloc_kernel(pmd_dir, virtaddr);
if (!pte_dir) {
printk("ioremap: no mem for pte_dir\n");
return NULL;
--- mm08/arch/m68k/sun3x/dvma.c 2004-06-16 06:20:37.000000000 +0100
+++ mm09/arch/m68k/sun3x/dvma.c 2005-10-11 23:55:53.000000000 +0100
@@ -116,7 +116,7 @@ inline int dvma_map_cpu(unsigned long ka
pte_t *pte;
unsigned long end3;

- if((pte = pte_alloc_kernel(&init_mm, pmd, vaddr)) == NULL) {
+ if((pte = pte_alloc_kernel(pmd, vaddr)) == NULL) {
ret = -ENOMEM;
goto out;
}
--- mm08/arch/mips/mm/ioremap.c 2004-12-24 21:36:49.000000000 +0000
+++ mm09/arch/mips/mm/ioremap.c 2005-10-11 23:55:53.000000000 +0100
@@ -55,7 +55,7 @@ static inline int remap_area_pmd(pmd_t *
if (address >= end)
BUG();
do {
- pte_t * pte = pte_alloc_kernel(&init_mm, pmd, address);
+ pte_t * pte = pte_alloc_kernel(pmd, address);
if (!pte)
return -ENOMEM;
remap_area_pte(pte, address, end - address, address + phys_addr, flags);
@@ -77,7 +77,6 @@ static int remap_area_pages(unsigned lon
flush_cache_all();
if (address >= end)
BUG();
- spin_lock(&init_mm.page_table_lock);
do {
pmd_t *pmd;
pmd = pmd_alloc(&init_mm, dir, address);
@@ -91,7 +90,6 @@ static int remap_area_pages(unsigned lon
address = (address + PGDIR_SIZE) & PGDIR_MASK;
dir++;
} while (address && (address < end));
- spin_unlock(&init_mm.page_table_lock);
flush_tlb_all();
return error;
}
--- mm08/arch/parisc/kernel/pci-dma.c 2005-06-17 20:48:29.000000000 +0100
+++ mm09/arch/parisc/kernel/pci-dma.c 2005-10-11 23:55:53.000000000 +0100
@@ -114,7 +114,7 @@ static inline int map_pmd_uncached(pmd_t
if (end > PGDIR_SIZE)
end = PGDIR_SIZE;
do {
- pte_t * pte = pte_alloc_kernel(&init_mm, pmd, vaddr);
+ pte_t * pte = pte_alloc_kernel(pmd, vaddr);
if (!pte)
return -ENOMEM;
if (map_pte_uncached(pte, orig_vaddr, end - vaddr, paddr_ptr))
--- mm08/arch/parisc/mm/ioremap.c 2005-03-02 07:38:55.000000000 +0000
+++ mm09/arch/parisc/mm/ioremap.c 2005-10-11 23:55:53.000000000 +0100
@@ -52,7 +52,7 @@ static inline int remap_area_pmd(pmd_t *
if (address >= end)
BUG();
do {
- pte_t * pte = pte_alloc_kernel(NULL, pmd, address);
+ pte_t * pte = pte_alloc_kernel(pmd, address);
if (!pte)
return -ENOMEM;
remap_area_pte(pte, address, end - address, address + phys_addr, flags);
@@ -75,10 +75,9 @@ static int remap_area_pages(unsigned lon
flush_cache_all();
if (address >= end)
BUG();
- spin_lock(&init_mm.page_table_lock);
do {
pmd_t *pmd;
- pmd = pmd_alloc(dir, address);
+ pmd = pmd_alloc(&init_mm, dir, address);
error = -ENOMEM;
if (!pmd)
break;
@@ -89,7 +88,6 @@ static int remap_area_pages(unsigned lon
address = (address + PGDIR_SIZE) & PGDIR_MASK;
dir++;
} while (address && (address < end));
- spin_unlock(&init_mm.page_table_lock);
flush_tlb_all();
return error;
}
--- mm08/arch/ppc/kernel/dma-mapping.c 2005-09-21 12:16:17.000000000 +0100
+++ mm09/arch/ppc/kernel/dma-mapping.c 2005-10-11 23:55:53.000000000 +0100
@@ -335,8 +335,6 @@ static int __init dma_alloc_init(void)
pte_t *pte;
int ret = 0;

- spin_lock(&init_mm.page_table_lock);
-
do {
pgd = pgd_offset(&init_mm, CONSISTENT_BASE);
pmd = pmd_alloc(&init_mm, pgd, CONSISTENT_BASE);
@@ -347,7 +345,7 @@ static int __init dma_alloc_init(void)
}
WARN_ON(!pmd_none(*pmd));

- pte = pte_alloc_kernel(&init_mm, pmd, CONSISTENT_BASE);
+ pte = pte_alloc_kernel(pmd, CONSISTENT_BASE);
if (!pte) {
printk(KERN_ERR "%s: no pte tables\n", __func__);
ret = -ENOMEM;
@@ -357,8 +355,6 @@ static int __init dma_alloc_init(void)
consistent_pte = pte;
} while (0);

- spin_unlock(&init_mm.page_table_lock);
-
return ret;
}

--- mm08/arch/ppc/mm/4xx_mmu.c 2005-08-29 00:41:01.000000000 +0100
+++ mm09/arch/ppc/mm/4xx_mmu.c 2005-10-11 23:55:54.000000000 +0100
@@ -110,13 +110,11 @@ unsigned long __init mmu_mapin_ram(void)
pmd_t *pmdp;
unsigned long val = p | _PMD_SIZE_16M | _PAGE_HWEXEC | _PAGE_HWWRITE;

- spin_lock(&init_mm.page_table_lock);
pmdp = pmd_offset(pgd_offset_k(v), v);
pmd_val(*pmdp++) = val;
pmd_val(*pmdp++) = val;
pmd_val(*pmdp++) = val;
pmd_val(*pmdp++) = val;
- spin_unlock(&init_mm.page_table_lock);

v += LARGE_PAGE_SIZE_16M;
p += LARGE_PAGE_SIZE_16M;
@@ -127,10 +125,8 @@ unsigned long __init mmu_mapin_ram(void)
pmd_t *pmdp;
unsigned long val = p | _PMD_SIZE_4M | _PAGE_HWEXEC | _PAGE_HWWRITE;

- spin_lock(&init_mm.page_table_lock);
pmdp = pmd_offset(pgd_offset_k(v), v);
pmd_val(*pmdp) = val;
- spin_unlock(&init_mm.page_table_lock);

v += LARGE_PAGE_SIZE_4M;
p += LARGE_PAGE_SIZE_4M;
--- mm08/arch/ppc/mm/pgtable.c 2005-08-29 00:41:01.000000000 +0100
+++ mm09/arch/ppc/mm/pgtable.c 2005-10-11 23:55:54.000000000 +0100
@@ -280,18 +280,16 @@ map_page(unsigned long va, phys_addr_t p
pte_t *pg;
int err = -ENOMEM;

- spin_lock(&init_mm.page_table_lock);
/* Use upper 10 bits of VA to index the first level map */
pd = pmd_offset(pgd_offset_k(va), va);
/* Use middle 10 bits of VA to index the second-level map */
- pg = pte_alloc_kernel(&init_mm, pd, va);
+ pg = pte_alloc_kernel(pd, va);
if (pg != 0) {
err = 0;
set_pte_at(&init_mm, va, pg, pfn_pte(pa >> PAGE_SHIFT, __pgprot(flags)));
if (mem_init_done)
flush_HPTE(0, va, pmd_val(*pd));
}
- spin_unlock(&init_mm.page_table_lock);
return err;
}

--- mm08/arch/ppc64/mm/imalloc.c 2005-09-21 12:16:19.000000000 +0100
+++ mm09/arch/ppc64/mm/imalloc.c 2005-10-11 23:55:54.000000000 +0100
@@ -300,12 +300,7 @@ void im_free(void * addr)
for (p = &imlist ; (tmp = *p) ; p = &tmp->next) {
if (tmp->addr == addr) {
*p = tmp->next;
-
- /* XXX: do we need the lock? */
- spin_lock(&init_mm.page_table_lock);
unmap_vm_area(tmp);
- spin_unlock(&init_mm.page_table_lock);
-
kfree(tmp);
up(&imlist_sem);
return;
--- mm08/arch/ppc64/mm/init.c 2005-09-30 11:58:54.000000000 +0100
+++ mm09/arch/ppc64/mm/init.c 2005-10-11 23:55:54.000000000 +0100
@@ -158,7 +158,6 @@ static int map_io_page(unsigned long ea,
unsigned long vsid;

if (mem_init_done) {
- spin_lock(&init_mm.page_table_lock);
pgdp = pgd_offset_k(ea);
pudp = pud_alloc(&init_mm, pgdp, ea);
if (!pudp)
@@ -166,12 +165,11 @@ static int map_io_page(unsigned long ea,
pmdp = pmd_alloc(&init_mm, pudp, ea);
if (!pmdp)
return -ENOMEM;
- ptep = pte_alloc_kernel(&init_mm, pmdp, ea);
+ ptep = pte_alloc_kernel(pmdp, ea);
if (!ptep)
return -ENOMEM;
set_pte_at(&init_mm, ea, ptep, pfn_pte(pa >> PAGE_SHIFT,
__pgprot(flags)));
- spin_unlock(&init_mm.page_table_lock);
} else {
unsigned long va, vpn, hash, hpteg;

--- mm08/arch/s390/mm/ioremap.c 2004-06-16 06:20:37.000000000 +0100
+++ mm09/arch/s390/mm/ioremap.c 2005-10-11 23:55:54.000000000 +0100
@@ -58,7 +58,7 @@ static inline int remap_area_pmd(pmd_t *
if (address >= end)
BUG();
do {
- pte_t * pte = pte_alloc_kernel(&init_mm, pmd, address);
+ pte_t * pte = pte_alloc_kernel(pmd, address);
if (!pte)
return -ENOMEM;
remap_area_pte(pte, address, end - address, address + phys_addr, flags);
@@ -80,7 +80,6 @@ static int remap_area_pages(unsigned lon
flush_cache_all();
if (address >= end)
BUG();
- spin_lock(&init_mm.page_table_lock);
do {
pmd_t *pmd;
pmd = pmd_alloc(&init_mm, dir, address);
@@ -94,7 +93,6 @@ static int remap_area_pages(unsigned lon
address = (address + PGDIR_SIZE) & PGDIR_MASK;
dir++;
} while (address && (address < end));
- spin_unlock(&init_mm.page_table_lock);
flush_tlb_all();
return 0;
}
--- mm08/arch/sh/mm/ioremap.c 2004-12-24 21:37:00.000000000 +0000
+++ mm09/arch/sh/mm/ioremap.c 2005-10-11 23:55:54.000000000 +0100
@@ -57,7 +57,7 @@ static inline int remap_area_pmd(pmd_t *
if (address >= end)
BUG();
do {
- pte_t * pte = pte_alloc_kernel(&init_mm, pmd, address);
+ pte_t * pte = pte_alloc_kernel(pmd, address);
if (!pte)
return -ENOMEM;
remap_area_pte(pte, address, end - address, address + phys_addr, flags);
@@ -79,7 +79,6 @@ int remap_area_pages(unsigned long addre
flush_cache_all();
if (address >= end)
BUG();
- spin_lock(&init_mm.page_table_lock);
do {
pmd_t *pmd;
pmd = pmd_alloc(&init_mm, dir, address);
@@ -93,7 +92,6 @@ int remap_area_pages(unsigned long addre
address = (address + PGDIR_SIZE) & PGDIR_MASK;
dir++;
} while (address && (address < end));
- spin_unlock(&init_mm.page_table_lock);
flush_tlb_all();
return error;
}
--- mm08/arch/sh64/mm/ioremap.c 2005-06-17 20:48:29.000000000 +0100
+++ mm09/arch/sh64/mm/ioremap.c 2005-10-11 23:55:54.000000000 +0100
@@ -79,7 +79,7 @@ static inline int remap_area_pmd(pmd_t *
BUG();

do {
- pte_t * pte = pte_alloc_kernel(&init_mm, pmd, address);
+ pte_t * pte = pte_alloc_kernel(pmd, address);
if (!pte)
return -ENOMEM;
remap_area_pte(pte, address, end - address, address + phys_addr, flags);
@@ -101,7 +101,6 @@ static int remap_area_pages(unsigned lon
flush_cache_all();
if (address >= end)
BUG();
- spin_lock(&init_mm.page_table_lock);
do {
pmd_t *pmd = pmd_alloc(&init_mm, dir, address);
error = -ENOMEM;
@@ -115,7 +114,6 @@ static int remap_area_pages(unsigned lon
address = (address + PGDIR_SIZE) & PGDIR_MASK;
dir++;
} while (address && (address < end));
- spin_unlock(&init_mm.page_table_lock);
flush_tlb_all();
return 0;
}
--- mm08/arch/x86_64/mm/ioremap.c 2005-09-30 11:58:56.000000000 +0100
+++ mm09/arch/x86_64/mm/ioremap.c 2005-10-11 23:55:54.000000000 +0100
@@ -60,7 +60,7 @@ static inline int remap_area_pmd(pmd_t *
if (address >= end)
BUG();
do {
- pte_t * pte = pte_alloc_kernel(&init_mm, pmd, address);
+ pte_t * pte = pte_alloc_kernel(pmd, address);
if (!pte)
return -ENOMEM;
remap_area_pte(pte, address, end - address, address + phys_addr, flags);
@@ -105,7 +105,6 @@ static int remap_area_pages(unsigned lon
flush_cache_all();
if (address >= end)
BUG();
- spin_lock(&init_mm.page_table_lock);
do {
pud_t *pud;
pud = pud_alloc(&init_mm, pgd, address);
@@ -119,7 +118,6 @@ static int remap_area_pages(unsigned lon
address = (address + PGDIR_SIZE) & PGDIR_MASK;
pgd++;
} while (address && (address < end));
- spin_unlock(&init_mm.page_table_lock);
flush_tlb_all();
return error;
}
--- mm08/include/linux/mm.h 2005-10-11 23:55:38.000000000 +0100
+++ mm09/include/linux/mm.h 2005-10-11 23:55:54.000000000 +0100
@@ -711,7 +711,7 @@ static inline void unmap_shared_mapping_
extern int vmtruncate(struct inode * inode, loff_t offset);
extern pud_t *FASTCALL(__pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address));
extern pmd_t *FASTCALL(__pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address));
-extern pte_t *FASTCALL(pte_alloc_kernel(struct mm_struct *mm, pmd_t *pmd, unsigned long address));
+extern pte_t *FASTCALL(pte_alloc_kernel(pmd_t *pmd, unsigned long address));
extern pte_t *FASTCALL(pte_alloc_map(struct mm_struct *mm, pmd_t *pmd, unsigned long address));
extern int install_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, struct page *page, pgprot_t prot);
extern int install_file_pte(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, unsigned long pgoff, pgprot_t prot);
--- mm08/mm/memory.c 2005-10-11 23:54:33.000000000 +0100
+++ mm09/mm/memory.c 2005-10-11 23:55:54.000000000 +0100
@@ -307,28 +307,22 @@ out:
return pte_offset_map(pmd, address);
}

-pte_t fastcall * pte_alloc_kernel(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
+pte_t fastcall * pte_alloc_kernel(pmd_t *pmd, unsigned long address)
{
if (!pmd_present(*pmd)) {
pte_t *new;

- spin_unlock(&mm->page_table_lock);
- new = pte_alloc_one_kernel(mm, address);
- spin_lock(&mm->page_table_lock);
+ new = pte_alloc_one_kernel(&init_mm, address);
if (!new)
return NULL;

- /*
- * Because we dropped the lock, we should re-check the
- * entry, as somebody else could have populated it..
- */
- if (pmd_present(*pmd)) {
+ spin_lock(&init_mm.page_table_lock);
+ if (pmd_present(*pmd))
pte_free_kernel(new);
- goto out;
- }
- pmd_populate_kernel(mm, pmd, new);
+ else
+ pmd_populate_kernel(&init_mm, pmd, new);
+ spin_unlock(&init_mm.page_table_lock);
}
-out:
return pte_offset_kernel(pmd, address);
}

@@ -2097,30 +2091,30 @@ int __handle_mm_fault(struct mm_struct *
#ifndef __PAGETABLE_PUD_FOLDED
/*
* Allocate page upper directory.
- *
- * We've already handled the fast-path in-line, and we own the
- * page table lock.
+ * We've already handled the fast-path in-line.
*/
pud_t fastcall *__pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
{
pud_t *new;

- spin_unlock(&mm->page_table_lock);
+ if (mm != &init_mm) /* Temporary bridging hack */
+ spin_unlock(&mm->page_table_lock);
new = pud_alloc_one(mm, address);
- spin_lock(&mm->page_table_lock);
- if (!new)
+ if (!new) {
+ if (mm != &init_mm) /* Temporary bridging hack */
+ spin_lock(&mm->page_table_lock);
return NULL;
+ }

- /*
- * Because we dropped the lock, we should re-check the
- * entry, as somebody else could have populated it..
- */
+ spin_lock(&mm->page_table_lock);
if (pgd_present(*pgd)) {
pud_free(new);
goto out;
}
pgd_populate(mm, pgd, new);
out:
+ if (mm == &init_mm) /* Temporary bridging hack */
+ spin_unlock(&mm->page_table_lock);
return pud_offset(pgd, address);
}
#endif /* __PAGETABLE_PUD_FOLDED */
@@ -2128,24 +2122,22 @@ pud_t fastcall *__pud_alloc(struct mm_st
#ifndef __PAGETABLE_PMD_FOLDED
/*
* Allocate page middle directory.
- *
- * We've already handled the fast-path in-line, and we own the
- * page table lock.
+ * We've already handled the fast-path in-line.
*/
pmd_t fastcall *__pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
{
pmd_t *new;

- spin_unlock(&mm->page_table_lock);
+ if (mm != &init_mm) /* Temporary bridging hack */
+ spin_unlock(&mm->page_table_lock);
new = pmd_alloc_one(mm, address);
- spin_lock(&mm->page_table_lock);
- if (!new)
+ if (!new) {
+ if (mm != &init_mm) /* Temporary bridging hack */
+ spin_lock(&mm->page_table_lock);
return NULL;
+ }

- /*
- * Because we dropped the lock, we should re-check the
- * entry, as somebody else could have populated it..
- */
+ spin_lock(&mm->page_table_lock);
#ifndef __ARCH_HAS_4LEVEL_HACK
if (pud_present(*pud)) {
pmd_free(new);
@@ -2161,6 +2153,8 @@ pmd_t fastcall *__pmd_alloc(struct mm_st
#endif /* __ARCH_HAS_4LEVEL_HACK */

out:
+ if (mm == &init_mm) /* Temporary bridging hack */
+ spin_unlock(&mm->page_table_lock);
return pmd_offset(pud, address);
}
#endif /* __PAGETABLE_PMD_FOLDED */
--- mm08/mm/vmalloc.c 2005-09-21 12:16:59.000000000 +0100
+++ mm09/mm/vmalloc.c 2005-10-11 23:55:54.000000000 +0100
@@ -88,7 +88,7 @@ static int vmap_pte_range(pmd_t *pmd, un
{
pte_t *pte;

- pte = pte_alloc_kernel(&init_mm, pmd, addr);
+ pte = pte_alloc_kernel(pmd, addr);
if (!pte)
return -ENOMEM;
do {
@@ -146,14 +146,12 @@ int map_vm_area(struct vm_struct *area,

BUG_ON(addr >= end);
pgd = pgd_offset_k(addr);
- spin_lock(&init_mm.page_table_lock);
do {
next = pgd_addr_end(addr, end);
err = vmap_pud_range(pgd, addr, next, prot, pages);
if (err)
break;
} while (pgd++, addr = next, addr != end);
- spin_unlock(&init_mm.page_table_lock);
flush_cache_vmap((unsigned long) area->addr, end);
return err;
}

2005-10-13 00:56:32

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 10/21] mm: ptd_alloc inline and out

It seems odd to me that, whereas pud_alloc and pmd_alloc test inline,
only calling out-of-line __pud_alloc __pmd_alloc if allocation needed,
pte_alloc_map and pte_alloc_kernel are entirely out-of-line. Though it
does add a little to kernel size, change them to macros testing inline,
calling __pte_alloc or __pte_alloc_kernel to allocate out-of-line.
Mark none of them as fastcalls, leave that to CONFIG_REGPARM or not.

It also seems more natural for the out-of-line functions to leave the
offset calculation and map to the inline, which has to do it anyway for
the common case. At least mremap move wants __pte_alloc without _map.

Macros rather than inline functions, certainly to avoid the header file
issues which arise from CONFIG_HIGHPTE needing kmap_types.h, but also in
case any architectures I haven't built would have other such problems.

Signed-off-by: Hugh Dickins <[email protected]>
---

include/asm-generic/4level-fixup.h | 11 +---
include/linux/mm.h | 38 +++++++--------
mm/memory.c | 93 +++++++++++++++----------------------
mm/mremap.c | 7 --
4 files changed, 61 insertions(+), 88 deletions(-)

--- mm09/include/asm-generic/4level-fixup.h 2005-06-17 20:48:29.000000000 +0100
+++ mm10/include/asm-generic/4level-fixup.h 2005-10-11 23:56:11.000000000 +0100
@@ -10,14 +10,9 @@

#define pud_t pgd_t

-#define pmd_alloc(mm, pud, address) \
-({ pmd_t *ret; \
- if (pgd_none(*pud)) \
- ret = __pmd_alloc(mm, pud, address); \
- else \
- ret = pmd_offset(pud, address); \
- ret; \
-})
+#define pmd_alloc(mm, pud, address) \
+ ((unlikely(pgd_none(*(pud))) && __pmd_alloc(mm, pud, address))? \
+ NULL: pmd_offset(pud, address))

#define pud_alloc(mm, pgd, address) (pgd)
#define pud_offset(pgd, start) (pgd)
--- mm09/include/linux/mm.h 2005-10-11 23:55:54.000000000 +0100
+++ mm10/include/linux/mm.h 2005-10-11 23:56:11.000000000 +0100
@@ -709,10 +709,6 @@ static inline void unmap_shared_mapping_
}

extern int vmtruncate(struct inode * inode, loff_t offset);
-extern pud_t *FASTCALL(__pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address));
-extern pmd_t *FASTCALL(__pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address));
-extern pte_t *FASTCALL(pte_alloc_kernel(pmd_t *pmd, unsigned long address));
-extern pte_t *FASTCALL(pte_alloc_map(struct mm_struct *mm, pmd_t *pmd, unsigned long address));
extern int install_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, struct page *page, pgprot_t prot);
extern int install_file_pte(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, unsigned long pgoff, pgprot_t prot);
extern int __handle_mm_fault(struct mm_struct *mm,struct vm_area_struct *vma, unsigned long address, int write_access);
@@ -765,32 +761,36 @@ struct shrinker;
extern struct shrinker *set_shrinker(int, shrinker_t);
extern void remove_shrinker(struct shrinker *shrinker);

-/*
- * On a two-level or three-level page table, this ends up being trivial. Thus
- * the inlining and the symmetry break with pte_alloc_map() that does all
- * of this out-of-line.
- */
+int __pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address);
+int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address);
+int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address);
+int __pte_alloc_kernel(pmd_t *pmd, unsigned long address);
+
/*
* The following ifdef needed to get the 4level-fixup.h header to work.
* Remove it when 4level-fixup.h has been removed.
*/
-#ifdef CONFIG_MMU
-#ifndef __ARCH_HAS_4LEVEL_HACK
+#if defined(CONFIG_MMU) && !defined(__ARCH_HAS_4LEVEL_HACK)
static inline pud_t *pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
{
- if (pgd_none(*pgd))
- return __pud_alloc(mm, pgd, address);
- return pud_offset(pgd, address);
+ return (unlikely(pgd_none(*pgd)) && __pud_alloc(mm, pgd, address))?
+ NULL: pud_offset(pgd, address);
}

static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
{
- if (pud_none(*pud))
- return __pmd_alloc(mm, pud, address);
- return pmd_offset(pud, address);
+ return (unlikely(pud_none(*pud)) && __pmd_alloc(mm, pud, address))?
+ NULL: pmd_offset(pud, address);
}
-#endif
-#endif /* CONFIG_MMU */
+#endif /* CONFIG_MMU && !__ARCH_HAS_4LEVEL_HACK */
+
+#define pte_alloc_map(mm, pmd, address) \
+ ((unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, pmd, address))? \
+ NULL: pte_offset_map(pmd, address))
+
+#define pte_alloc_kernel(pmd, address) \
+ ((unlikely(!pmd_present(*(pmd))) && __pte_alloc_kernel(pmd, address))? \
+ NULL: pte_offset_kernel(pmd, address))

extern void free_area_init(unsigned long * zones_size);
extern void free_area_init_node(int nid, pg_data_t *pgdat,
--- mm09/mm/memory.c 2005-10-11 23:55:54.000000000 +0100
+++ mm10/mm/memory.c 2005-10-11 23:56:11.000000000 +0100
@@ -280,50 +280,39 @@ void free_pgtables(struct mmu_gather **t
}
}

-pte_t fastcall *pte_alloc_map(struct mm_struct *mm, pmd_t *pmd,
- unsigned long address)
+int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
{
- if (!pmd_present(*pmd)) {
- struct page *new;
+ struct page *new;

- spin_unlock(&mm->page_table_lock);
- new = pte_alloc_one(mm, address);
- spin_lock(&mm->page_table_lock);
- if (!new)
- return NULL;
- /*
- * Because we dropped the lock, we should re-check the
- * entry, as somebody else could have populated it..
- */
- if (pmd_present(*pmd)) {
- pte_free(new);
- goto out;
- }
+ spin_unlock(&mm->page_table_lock);
+ new = pte_alloc_one(mm, address);
+ spin_lock(&mm->page_table_lock);
+ if (!new)
+ return -ENOMEM;
+
+ if (pmd_present(*pmd)) /* Another has populated it */
+ pte_free(new);
+ else {
mm->nr_ptes++;
inc_page_state(nr_page_table_pages);
pmd_populate(mm, pmd, new);
}
-out:
- return pte_offset_map(pmd, address);
+ return 0;
}

-pte_t fastcall * pte_alloc_kernel(pmd_t *pmd, unsigned long address)
+int __pte_alloc_kernel(pmd_t *pmd, unsigned long address)
{
- if (!pmd_present(*pmd)) {
- pte_t *new;
-
- new = pte_alloc_one_kernel(&init_mm, address);
- if (!new)
- return NULL;
+ pte_t *new = pte_alloc_one_kernel(&init_mm, address);
+ if (!new)
+ return -ENOMEM;

- spin_lock(&init_mm.page_table_lock);
- if (pmd_present(*pmd))
- pte_free_kernel(new);
- else
- pmd_populate_kernel(&init_mm, pmd, new);
- spin_unlock(&init_mm.page_table_lock);
- }
- return pte_offset_kernel(pmd, address);
+ spin_lock(&init_mm.page_table_lock);
+ if (pmd_present(*pmd)) /* Another has populated it */
+ pte_free_kernel(new);
+ else
+ pmd_populate_kernel(&init_mm, pmd, new);
+ spin_unlock(&init_mm.page_table_lock);
+ return 0;
}

static inline void add_mm_rss(struct mm_struct *mm, int file_rss, int anon_rss)
@@ -2093,7 +2082,7 @@ int __handle_mm_fault(struct mm_struct *
* Allocate page upper directory.
* We've already handled the fast-path in-line.
*/
-pud_t fastcall *__pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
+int __pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
{
pud_t *new;

@@ -2103,19 +2092,17 @@ pud_t fastcall *__pud_alloc(struct mm_st
if (!new) {
if (mm != &init_mm) /* Temporary bridging hack */
spin_lock(&mm->page_table_lock);
- return NULL;
+ return -ENOMEM;
}

spin_lock(&mm->page_table_lock);
- if (pgd_present(*pgd)) {
+ if (pgd_present(*pgd)) /* Another has populated it */
pud_free(new);
- goto out;
- }
- pgd_populate(mm, pgd, new);
- out:
+ else
+ pgd_populate(mm, pgd, new);
if (mm == &init_mm) /* Temporary bridging hack */
spin_unlock(&mm->page_table_lock);
- return pud_offset(pgd, address);
+ return 0;
}
#endif /* __PAGETABLE_PUD_FOLDED */

@@ -2124,7 +2111,7 @@ pud_t fastcall *__pud_alloc(struct mm_st
* Allocate page middle directory.
* We've already handled the fast-path in-line.
*/
-pmd_t fastcall *__pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
+int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
{
pmd_t *new;

@@ -2134,28 +2121,24 @@ pmd_t fastcall *__pmd_alloc(struct mm_st
if (!new) {
if (mm != &init_mm) /* Temporary bridging hack */
spin_lock(&mm->page_table_lock);
- return NULL;
+ return -ENOMEM;
}

spin_lock(&mm->page_table_lock);
#ifndef __ARCH_HAS_4LEVEL_HACK
- if (pud_present(*pud)) {
+ if (pud_present(*pud)) /* Another has populated it */
pmd_free(new);
- goto out;
- }
- pud_populate(mm, pud, new);
+ else
+ pud_populate(mm, pud, new);
#else
- if (pgd_present(*pud)) {
+ if (pgd_present(*pud)) /* Another has populated it */
pmd_free(new);
- goto out;
- }
- pgd_populate(mm, pud, new);
+ else
+ pgd_populate(mm, pud, new);
#endif /* __ARCH_HAS_4LEVEL_HACK */
-
- out:
if (mm == &init_mm) /* Temporary bridging hack */
spin_unlock(&mm->page_table_lock);
- return pmd_offset(pud, address);
+ return 0;
}
#endif /* __PAGETABLE_PMD_FOLDED */

--- mm09/mm/mremap.c 2005-10-11 23:54:33.000000000 +0100
+++ mm10/mm/mremap.c 2005-10-11 23:56:11.000000000 +0100
@@ -51,7 +51,6 @@ static pmd_t *alloc_new_pmd(struct mm_st
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd = NULL;
- pte_t *pte;

/*
* We do need page_table_lock: because allocators expect that.
@@ -66,12 +65,8 @@ static pmd_t *alloc_new_pmd(struct mm_st
if (!pmd)
goto out;

- pte = pte_alloc_map(mm, pmd, addr);
- if (!pte) {
+ if (!pmd_present(*pmd) && __pte_alloc(mm, pmd, addr))
pmd = NULL;
- goto out;
- }
- pte_unmap(pte);
out:
spin_unlock(&mm->page_table_lock);
return pmd;

2005-10-13 00:57:10

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 11/21] mm: ptd_alloc take ptlock

Second step in pushing down the page_table_lock. Remove the temporary
bridging hack from __pud_alloc, __pmd_alloc, __pte_alloc: expect callers
not to hold page_table_lock, whether it's on init_mm or a user mm; take
page_table_lock internally to check if a racing task already allocated.

Convert their callers from common code. But avoid coming back to change
them again later: instead of moving the spin_lock(&mm->page_table_lock)
down, switch over to new macros pte_alloc_map_lock and pte_unmap_unlock,
which encapsulate the mapping+locking and unlocking+unmapping together,
and in the end may use alternatives to the mm page_table_lock itself.

These callers all hold mmap_sem (some exclusively, some not), so at no
level can a page table be whipped away from beneath them; and pte_alloc
uses the "atomic" pmd_present to test whether it needs to allocate. It
appears that on all arches we can safely descend without page_table_lock.

Signed-off-by: Hugh Dickins <[email protected]>
---

fs/exec.c | 14 ++-----
include/linux/mm.h | 18 +++++++++
kernel/fork.c | 2 -
mm/fremap.c | 46 ++++++++---------------
mm/hugetlb.c | 12 ++++--
mm/memory.c | 104 ++++++++++++++++-------------------------------------
mm/mremap.c | 27 ++++---------
7 files changed, 89 insertions(+), 134 deletions(-)

--- mm10/fs/exec.c 2005-10-11 23:54:33.000000000 +0100
+++ mm11/fs/exec.c 2005-10-11 23:56:25.000000000 +0100
@@ -309,25 +309,24 @@ void install_arg_page(struct vm_area_str
pud_t * pud;
pmd_t * pmd;
pte_t * pte;
+ spinlock_t *ptl;

if (unlikely(anon_vma_prepare(vma)))
- goto out_sig;
+ goto out;

flush_dcache_page(page);
pgd = pgd_offset(mm, address);
-
- spin_lock(&mm->page_table_lock);
pud = pud_alloc(mm, pgd, address);
if (!pud)
goto out;
pmd = pmd_alloc(mm, pud, address);
if (!pmd)
goto out;
- pte = pte_alloc_map(mm, pmd, address);
+ pte = pte_alloc_map_lock(mm, pmd, address, &ptl);
if (!pte)
goto out;
if (!pte_none(*pte)) {
- pte_unmap(pte);
+ pte_unmap_unlock(pte, ptl);
goto out;
}
inc_mm_counter(mm, anon_rss);
@@ -335,14 +334,11 @@ void install_arg_page(struct vm_area_str
set_pte_at(mm, address, pte, pte_mkdirty(pte_mkwrite(mk_pte(
page, vma->vm_page_prot))));
page_add_anon_rmap(page, vma, address);
- pte_unmap(pte);
- spin_unlock(&mm->page_table_lock);
+ pte_unmap_unlock(pte, ptl);

/* no need for flush_tlb */
return;
out:
- spin_unlock(&mm->page_table_lock);
-out_sig:
__free_page(page);
force_sig(SIGKILL, current);
}
--- mm10/include/linux/mm.h 2005-10-11 23:56:11.000000000 +0100
+++ mm11/include/linux/mm.h 2005-10-11 23:56:25.000000000 +0100
@@ -784,10 +784,28 @@ static inline pmd_t *pmd_alloc(struct mm
}
#endif /* CONFIG_MMU && !__ARCH_HAS_4LEVEL_HACK */

+#define pte_offset_map_lock(mm, pmd, address, ptlp) \
+({ \
+ spinlock_t *__ptl = &(mm)->page_table_lock; \
+ pte_t *__pte = pte_offset_map(pmd, address); \
+ *(ptlp) = __ptl; \
+ spin_lock(__ptl); \
+ __pte; \
+})
+
+#define pte_unmap_unlock(pte, ptl) do { \
+ spin_unlock(ptl); \
+ pte_unmap(pte); \
+} while (0)
+
#define pte_alloc_map(mm, pmd, address) \
((unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, pmd, address))? \
NULL: pte_offset_map(pmd, address))

+#define pte_alloc_map_lock(mm, pmd, address, ptlp) \
+ ((unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, pmd, address))? \
+ NULL: pte_offset_map_lock(mm, pmd, address, ptlp))
+
#define pte_alloc_kernel(pmd, address) \
((unlikely(!pmd_present(*(pmd))) && __pte_alloc_kernel(pmd, address))? \
NULL: pte_offset_kernel(pmd, address))
--- mm10/kernel/fork.c 2005-09-30 11:59:12.000000000 +0100
+++ mm11/kernel/fork.c 2005-10-11 23:56:25.000000000 +0100
@@ -255,7 +255,6 @@ static inline int dup_mmap(struct mm_str
/*
* Link in the new vma and copy the page table entries.
*/
- spin_lock(&mm->page_table_lock);
*pprev = tmp;
pprev = &tmp->vm_next;

@@ -265,7 +264,6 @@ static inline int dup_mmap(struct mm_str

mm->map_count++;
retval = copy_page_range(mm, oldmm, tmp);
- spin_unlock(&mm->page_table_lock);

if (tmp->vm_ops && tmp->vm_ops->open)
tmp->vm_ops->open(tmp);
--- mm10/mm/fremap.c 2005-10-11 23:54:33.000000000 +0100
+++ mm11/mm/fremap.c 2005-10-11 23:56:25.000000000 +0100
@@ -63,23 +63,20 @@ int install_page(struct mm_struct *mm, s
pud_t *pud;
pgd_t *pgd;
pte_t pte_val;
+ spinlock_t *ptl;

BUG_ON(vma->vm_flags & VM_RESERVED);

pgd = pgd_offset(mm, addr);
- spin_lock(&mm->page_table_lock);
-
pud = pud_alloc(mm, pgd, addr);
if (!pud)
- goto err_unlock;
-
+ goto out;
pmd = pmd_alloc(mm, pud, addr);
if (!pmd)
- goto err_unlock;
-
- pte = pte_alloc_map(mm, pmd, addr);
+ goto out;
+ pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
if (!pte)
- goto err_unlock;
+ goto out;

/*
* This page may have been truncated. Tell the
@@ -89,7 +86,7 @@ int install_page(struct mm_struct *mm, s
inode = vma->vm_file->f_mapping->host;
size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
if (!page->mapping || page->index >= size)
- goto err_unlock;
+ goto unlock;

if (pte_none(*pte) || !zap_pte(mm, vma, addr, pte))
inc_mm_counter(mm, file_rss);
@@ -98,17 +95,15 @@ int install_page(struct mm_struct *mm, s
set_pte_at(mm, addr, pte, mk_pte(page, prot));
page_add_file_rmap(page);
pte_val = *pte;
- pte_unmap(pte);
update_mmu_cache(vma, addr, pte_val);
-
err = 0;
-err_unlock:
- spin_unlock(&mm->page_table_lock);
+unlock:
+ pte_unmap_unlock(pte, ptl);
+out:
return err;
}
EXPORT_SYMBOL(install_page);

-
/*
* Install a file pte to a given virtual memory address, release any
* previously existing mapping.
@@ -122,23 +117,20 @@ int install_file_pte(struct mm_struct *m
pud_t *pud;
pgd_t *pgd;
pte_t pte_val;
+ spinlock_t *ptl;

BUG_ON(vma->vm_flags & VM_RESERVED);

pgd = pgd_offset(mm, addr);
- spin_lock(&mm->page_table_lock);
-
pud = pud_alloc(mm, pgd, addr);
if (!pud)
- goto err_unlock;
-
+ goto out;
pmd = pmd_alloc(mm, pud, addr);
if (!pmd)
- goto err_unlock;
-
- pte = pte_alloc_map(mm, pmd, addr);
+ goto out;
+ pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
if (!pte)
- goto err_unlock;
+ goto out;

if (!pte_none(*pte) && zap_pte(mm, vma, addr, pte)) {
update_hiwater_rss(mm);
@@ -147,17 +139,13 @@ int install_file_pte(struct mm_struct *m

set_pte_at(mm, addr, pte, pgoff_to_pte(pgoff));
pte_val = *pte;
- pte_unmap(pte);
update_mmu_cache(vma, addr, pte_val);
- spin_unlock(&mm->page_table_lock);
- return 0;
-
-err_unlock:
- spin_unlock(&mm->page_table_lock);
+ pte_unmap_unlock(pte, ptl);
+ err = 0;
+out:
return err;
}

-
/***
* sys_remap_file_pages - remap arbitrary pages of a shared backing store
* file within an existing vma.
--- mm10/mm/hugetlb.c 2005-10-11 23:54:33.000000000 +0100
+++ mm11/mm/hugetlb.c 2005-10-11 23:56:25.000000000 +0100
@@ -276,12 +276,15 @@ int copy_hugetlb_page_range(struct mm_st
unsigned long addr;

for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
+ src_pte = huge_pte_offset(src, addr);
+ if (!src_pte)
+ continue;
dst_pte = huge_pte_alloc(dst, addr);
if (!dst_pte)
goto nomem;
+ spin_lock(&dst->page_table_lock);
spin_lock(&src->page_table_lock);
- src_pte = huge_pte_offset(src, addr);
- if (src_pte && !pte_none(*src_pte)) {
+ if (!pte_none(*src_pte)) {
entry = *src_pte;
ptepage = pte_page(entry);
get_page(ptepage);
@@ -289,6 +292,7 @@ int copy_hugetlb_page_range(struct mm_st
set_huge_pte_at(dst, addr, dst_pte, entry);
}
spin_unlock(&src->page_table_lock);
+ spin_unlock(&dst->page_table_lock);
}
return 0;

@@ -353,7 +357,6 @@ int hugetlb_prefault(struct address_spac

hugetlb_prefault_arch_hook(mm);

- spin_lock(&mm->page_table_lock);
for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
unsigned long idx;
pte_t *pte = huge_pte_alloc(mm, addr);
@@ -388,11 +391,12 @@ int hugetlb_prefault(struct address_spac
goto out;
}
}
+ spin_lock(&mm->page_table_lock);
add_mm_counter(mm, file_rss, HPAGE_SIZE / PAGE_SIZE);
set_huge_pte_at(mm, addr, pte, make_huge_pte(vma, page));
+ spin_unlock(&mm->page_table_lock);
}
out:
- spin_unlock(&mm->page_table_lock);
return ret;
}

--- mm10/mm/memory.c 2005-10-11 23:56:11.000000000 +0100
+++ mm11/mm/memory.c 2005-10-11 23:56:25.000000000 +0100
@@ -282,14 +282,11 @@ void free_pgtables(struct mmu_gather **t

int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
{
- struct page *new;
-
- spin_unlock(&mm->page_table_lock);
- new = pte_alloc_one(mm, address);
- spin_lock(&mm->page_table_lock);
+ struct page *new = pte_alloc_one(mm, address);
if (!new)
return -ENOMEM;

+ spin_lock(&mm->page_table_lock);
if (pmd_present(*pmd)) /* Another has populated it */
pte_free(new);
else {
@@ -297,6 +294,7 @@ int __pte_alloc(struct mm_struct *mm, pm
inc_page_state(nr_page_table_pages);
pmd_populate(mm, pmd, new);
}
+ spin_unlock(&mm->page_table_lock);
return 0;
}

@@ -344,9 +342,6 @@ void print_bad_pte(struct vm_area_struct
* copy one vm_area from one task to the other. Assumes the page tables
* already present in the new task to be cleared in the whole range
* covered by this vma.
- *
- * dst->page_table_lock is held on entry and exit,
- * but may be dropped within p[mg]d_alloc() and pte_alloc_map().
*/

static inline void
@@ -419,17 +414,19 @@ static int copy_pte_range(struct mm_stru
unsigned long addr, unsigned long end)
{
pte_t *src_pte, *dst_pte;
+ spinlock_t *src_ptl, *dst_ptl;
int progress = 0;
int rss[2];

again:
rss[1] = rss[0] = 0;
- dst_pte = pte_alloc_map(dst_mm, dst_pmd, addr);
+ dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
if (!dst_pte)
return -ENOMEM;
src_pte = pte_offset_map_nested(src_pmd, addr);
+ src_ptl = &src_mm->page_table_lock;
+ spin_lock(src_ptl);

- spin_lock(&src_mm->page_table_lock);
do {
/*
* We are holding two locks at this point - either of them
@@ -438,8 +435,8 @@ again:
if (progress >= 32) {
progress = 0;
if (need_resched() ||
- need_lockbreak(&src_mm->page_table_lock) ||
- need_lockbreak(&dst_mm->page_table_lock))
+ need_lockbreak(src_ptl) ||
+ need_lockbreak(dst_ptl))
break;
}
if (pte_none(*src_pte)) {
@@ -449,12 +446,12 @@ again:
copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss);
progress += 8;
} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
- spin_unlock(&src_mm->page_table_lock);

+ spin_unlock(src_ptl);
pte_unmap_nested(src_pte - 1);
- pte_unmap(dst_pte - 1);
add_mm_rss(dst_mm, rss[0], rss[1]);
- cond_resched_lock(&dst_mm->page_table_lock);
+ pte_unmap_unlock(dst_pte - 1, dst_ptl);
+ cond_resched();
if (addr != end)
goto again;
return 0;
@@ -1049,8 +1046,9 @@ static int zeromap_pte_range(struct mm_s
unsigned long addr, unsigned long end, pgprot_t prot)
{
pte_t *pte;
+ spinlock_t *ptl;

- pte = pte_alloc_map(mm, pmd, addr);
+ pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
if (!pte)
return -ENOMEM;
do {
@@ -1062,7 +1060,7 @@ static int zeromap_pte_range(struct mm_s
BUG_ON(!pte_none(*pte));
set_pte_at(mm, addr, pte, zero_pte);
} while (pte++, addr += PAGE_SIZE, addr != end);
- pte_unmap(pte - 1);
+ pte_unmap_unlock(pte - 1, ptl);
return 0;
}

@@ -1112,14 +1110,12 @@ int zeromap_page_range(struct vm_area_st
BUG_ON(addr >= end);
pgd = pgd_offset(mm, addr);
flush_cache_range(vma, addr, end);
- spin_lock(&mm->page_table_lock);
do {
next = pgd_addr_end(addr, end);
err = zeromap_pud_range(mm, pgd, addr, next, prot);
if (err)
break;
} while (pgd++, addr = next, addr != end);
- spin_unlock(&mm->page_table_lock);
return err;
}

@@ -1133,8 +1129,9 @@ static int remap_pte_range(struct mm_str
unsigned long pfn, pgprot_t prot)
{
pte_t *pte;
+ spinlock_t *ptl;

- pte = pte_alloc_map(mm, pmd, addr);
+ pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
if (!pte)
return -ENOMEM;
do {
@@ -1142,7 +1139,7 @@ static int remap_pte_range(struct mm_str
set_pte_at(mm, addr, pte, pfn_pte(pfn, prot));
pfn++;
} while (pte++, addr += PAGE_SIZE, addr != end);
- pte_unmap(pte - 1);
+ pte_unmap_unlock(pte - 1, ptl);
return 0;
}

@@ -1210,7 +1207,6 @@ int remap_pfn_range(struct vm_area_struc
pfn -= addr >> PAGE_SHIFT;
pgd = pgd_offset(mm, addr);
flush_cache_range(vma, addr, end);
- spin_lock(&mm->page_table_lock);
do {
next = pgd_addr_end(addr, end);
err = remap_pud_range(mm, pgd, addr, next,
@@ -1218,7 +1214,6 @@ int remap_pfn_range(struct vm_area_struc
if (err)
break;
} while (pgd++, addr = next, addr != end);
- spin_unlock(&mm->page_table_lock);
return err;
}
EXPORT_SYMBOL(remap_pfn_range);
@@ -1985,17 +1980,9 @@ static int do_file_page(struct mm_struct
* with external mmu caches can use to update those (ie the Sparc or
* PowerPC hashed page tables that act as extended TLBs).
*
- * Note the "page_table_lock". It is to protect against kswapd removing
- * pages from under us. Note that kswapd only ever _removes_ pages, never
- * adds them. As such, once we have noticed that the page is not present,
- * we can drop the lock early.
- *
- * The adding of pages is protected by the MM semaphore (which we hold),
- * so we don't need to worry about a page being suddenly been added into
- * our VM.
- *
- * We enter with the pagetable spinlock held, we are supposed to
- * release it when done.
+ * We enter with non-exclusive mmap_sem (to exclude vma changes,
+ * but allow concurrent faults), and pte mapped but not yet locked.
+ * We return with mmap_sem still held, but pte unmapped and unlocked.
*/
static inline int handle_pte_fault(struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long address,
@@ -2003,6 +1990,7 @@ static inline int handle_pte_fault(struc
{
pte_t entry;

+ spin_lock(&mm->page_table_lock);
entry = *pte;
if (!pte_present(entry)) {
if (pte_none(entry)) {
@@ -2051,30 +2039,18 @@ int __handle_mm_fault(struct mm_struct *
if (is_vm_hugetlb_page(vma))
return VM_FAULT_SIGBUS; /* mapping truncation does this. */

- /*
- * We need the page table lock to synchronize with kswapd
- * and the SMP-safe atomic PTE updates.
- */
pgd = pgd_offset(mm, address);
- spin_lock(&mm->page_table_lock);
-
pud = pud_alloc(mm, pgd, address);
if (!pud)
- goto oom;
-
+ return VM_FAULT_OOM;
pmd = pmd_alloc(mm, pud, address);
if (!pmd)
- goto oom;
-
+ return VM_FAULT_OOM;
pte = pte_alloc_map(mm, pmd, address);
if (!pte)
- goto oom;
-
- return handle_pte_fault(mm, vma, address, pte, pmd, write_access);
+ return VM_FAULT_OOM;

- oom:
- spin_unlock(&mm->page_table_lock);
- return VM_FAULT_OOM;
+ return handle_pte_fault(mm, vma, address, pte, pmd, write_access);
}

#ifndef __PAGETABLE_PUD_FOLDED
@@ -2084,24 +2060,16 @@ int __handle_mm_fault(struct mm_struct *
*/
int __pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
{
- pud_t *new;
-
- if (mm != &init_mm) /* Temporary bridging hack */
- spin_unlock(&mm->page_table_lock);
- new = pud_alloc_one(mm, address);
- if (!new) {
- if (mm != &init_mm) /* Temporary bridging hack */
- spin_lock(&mm->page_table_lock);
+ pud_t *new = pud_alloc_one(mm, address);
+ if (!new)
return -ENOMEM;
- }

spin_lock(&mm->page_table_lock);
if (pgd_present(*pgd)) /* Another has populated it */
pud_free(new);
else
pgd_populate(mm, pgd, new);
- if (mm == &init_mm) /* Temporary bridging hack */
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(&mm->page_table_lock);
return 0;
}
#endif /* __PAGETABLE_PUD_FOLDED */
@@ -2113,16 +2081,9 @@ int __pud_alloc(struct mm_struct *mm, pg
*/
int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
{
- pmd_t *new;
-
- if (mm != &init_mm) /* Temporary bridging hack */
- spin_unlock(&mm->page_table_lock);
- new = pmd_alloc_one(mm, address);
- if (!new) {
- if (mm != &init_mm) /* Temporary bridging hack */
- spin_lock(&mm->page_table_lock);
+ pmd_t *new = pmd_alloc_one(mm, address);
+ if (!new)
return -ENOMEM;
- }

spin_lock(&mm->page_table_lock);
#ifndef __ARCH_HAS_4LEVEL_HACK
@@ -2136,8 +2097,7 @@ int __pmd_alloc(struct mm_struct *mm, pu
else
pgd_populate(mm, pud, new);
#endif /* __ARCH_HAS_4LEVEL_HACK */
- if (mm == &init_mm) /* Temporary bridging hack */
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(&mm->page_table_lock);
return 0;
}
#endif /* __PAGETABLE_PMD_FOLDED */
--- mm10/mm/mremap.c 2005-10-11 23:56:11.000000000 +0100
+++ mm11/mm/mremap.c 2005-10-11 23:56:25.000000000 +0100
@@ -28,9 +28,6 @@ static pmd_t *get_old_pmd(struct mm_stru
pud_t *pud;
pmd_t *pmd;

- /*
- * We don't need page_table_lock: we have mmap_sem exclusively.
- */
pgd = pgd_offset(mm, addr);
if (pgd_none_or_clear_bad(pgd))
return NULL;
@@ -50,25 +47,20 @@ static pmd_t *alloc_new_pmd(struct mm_st
{
pgd_t *pgd;
pud_t *pud;
- pmd_t *pmd = NULL;
+ pmd_t *pmd;

- /*
- * We do need page_table_lock: because allocators expect that.
- */
- spin_lock(&mm->page_table_lock);
pgd = pgd_offset(mm, addr);
pud = pud_alloc(mm, pgd, addr);
if (!pud)
- goto out;
+ return NULL;

pmd = pmd_alloc(mm, pud, addr);
if (!pmd)
- goto out;
+ return NULL;

if (!pmd_present(*pmd) && __pte_alloc(mm, pmd, addr))
- pmd = NULL;
-out:
- spin_unlock(&mm->page_table_lock);
+ return NULL;
+
return pmd;
}

@@ -80,6 +72,7 @@ static void move_ptes(struct vm_area_str
struct address_space *mapping = NULL;
struct mm_struct *mm = vma->vm_mm;
pte_t *old_pte, *new_pte, pte;
+ spinlock_t *old_ptl;

if (vma->vm_file) {
/*
@@ -95,9 +88,8 @@ static void move_ptes(struct vm_area_str
new_vma->vm_truncate_count = 0;
}

- spin_lock(&mm->page_table_lock);
- old_pte = pte_offset_map(old_pmd, old_addr);
- new_pte = pte_offset_map_nested(new_pmd, new_addr);
+ old_pte = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
+ new_pte = pte_offset_map_nested(new_pmd, new_addr);

for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
new_pte++, new_addr += PAGE_SIZE) {
@@ -110,8 +102,7 @@ static void move_ptes(struct vm_area_str
}

pte_unmap_nested(new_pte - 1);
- pte_unmap(old_pte - 1);
- spin_unlock(&mm->page_table_lock);
+ pte_unmap_unlock(old_pte - 1, old_ptl);
if (mapping)
spin_unlock(&mapping->i_mmap_lock);
}

2005-10-13 01:00:13

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 12/21] mm: arches skip ptlock

Convert those few architectures which are calling pud_alloc, pmd_alloc,
pte_alloc_map on a user mm, not to take the page_table_lock first, nor
drop it after. Each of these can continue to use pte_alloc_map, no need
to change over to pte_alloc_map_lock, they're neither racy nor swappable.

In the sparc64 io_remap_pfn_range, flush_tlb_range then falls outside of
the page_table_lock: that's okay, on sparc64 it's like flush_tlb_mm, and
that has always been called from outside of page_table_lock in dup_mmap.

Signed-off-by: Hugh Dickins <[email protected]>
---

arch/arm/mm/mm-armv.c | 14 --------------
arch/arm26/mm/memc.c | 15 ---------------
arch/sparc/mm/generic.c | 4 +---
arch/sparc64/mm/generic.c | 6 ++----
arch/um/kernel/skas/mmu.c | 3 ---
5 files changed, 3 insertions(+), 39 deletions(-)

--- mm11/arch/arm/mm/mm-armv.c 2005-09-21 12:16:12.000000000 +0100
+++ mm12/arch/arm/mm/mm-armv.c 2005-10-11 23:56:41.000000000 +0100
@@ -180,11 +180,6 @@ pgd_t *get_pgd_slow(struct mm_struct *mm

if (!vectors_high()) {
/*
- * This lock is here just to satisfy pmd_alloc and pte_lock
- */
- spin_lock(&mm->page_table_lock);
-
- /*
* On ARM, first page must always be allocated since it
* contains the machine vectors.
*/
@@ -201,23 +196,14 @@ pgd_t *get_pgd_slow(struct mm_struct *mm
set_pte(new_pte, *init_pte);
pte_unmap_nested(init_pte);
pte_unmap(new_pte);
-
- spin_unlock(&mm->page_table_lock);
}

return new_pgd;

no_pte:
- spin_unlock(&mm->page_table_lock);
pmd_free(new_pmd);
- free_pages((unsigned long)new_pgd, 2);
- return NULL;
-
no_pmd:
- spin_unlock(&mm->page_table_lock);
free_pages((unsigned long)new_pgd, 2);
- return NULL;
-
no_pgd:
return NULL;
}
--- mm11/arch/arm26/mm/memc.c 2005-10-11 23:55:53.000000000 +0100
+++ mm12/arch/arm26/mm/memc.c 2005-10-11 23:56:41.000000000 +0100
@@ -79,12 +79,6 @@ pgd_t *get_pgd_slow(struct mm_struct *mm
goto no_pgd;

/*
- * This lock is here just to satisfy pmd_alloc and pte_lock
- * FIXME: I bet we could avoid taking it pretty much altogether
- */
- spin_lock(&mm->page_table_lock);
-
- /*
* On ARM, first page must always be allocated since it contains
* the machine vectors.
*/
@@ -113,23 +107,14 @@ pgd_t *get_pgd_slow(struct mm_struct *mm
memcpy(new_pgd + FIRST_KERNEL_PGD_NR, init_pgd + FIRST_KERNEL_PGD_NR,
(PTRS_PER_PGD - FIRST_KERNEL_PGD_NR) * sizeof(pgd_t));

- spin_unlock(&mm->page_table_lock);
-
/* update MEMC tables */
cpu_memc_update_all(new_pgd);
return new_pgd;

no_pte:
- spin_unlock(&mm->page_table_lock);
pmd_free(new_pmd);
- free_pgd_slow(new_pgd);
- return NULL;
-
no_pmd:
- spin_unlock(&mm->page_table_lock);
free_pgd_slow(new_pgd);
- return NULL;
-
no_pgd:
return NULL;
}
--- mm11/arch/sparc/mm/generic.c 2005-09-21 12:16:20.000000000 +0100
+++ mm12/arch/sparc/mm/generic.c 2005-10-11 23:56:41.000000000 +0100
@@ -78,9 +78,8 @@ int io_remap_pfn_range(struct vm_area_st
dir = pgd_offset(mm, from);
flush_cache_range(vma, beg, end);

- spin_lock(&mm->page_table_lock);
while (from < end) {
- pmd_t *pmd = pmd_alloc(current->mm, dir, from);
+ pmd_t *pmd = pmd_alloc(mm, dir, from);
error = -ENOMEM;
if (!pmd)
break;
@@ -90,7 +89,6 @@ int io_remap_pfn_range(struct vm_area_st
from = (from + PGDIR_SIZE) & PGDIR_MASK;
dir++;
}
- spin_unlock(&mm->page_table_lock);

flush_tlb_range(vma, beg, end);
return error;
--- mm11/arch/sparc64/mm/generic.c 2005-09-21 12:16:20.000000000 +0100
+++ mm12/arch/sparc64/mm/generic.c 2005-10-11 23:56:41.000000000 +0100
@@ -132,9 +132,8 @@ int io_remap_pfn_range(struct vm_area_st
dir = pgd_offset(mm, from);
flush_cache_range(vma, beg, end);

- spin_lock(&mm->page_table_lock);
while (from < end) {
- pud_t *pud = pud_alloc(current->mm, dir, from);
+ pud_t *pud = pud_alloc(mm, dir, from);
error = -ENOMEM;
if (!pud)
break;
@@ -144,8 +143,7 @@ int io_remap_pfn_range(struct vm_area_st
from = (from + PGDIR_SIZE) & PGDIR_MASK;
dir++;
}
- flush_tlb_range(vma, beg, end);
- spin_unlock(&mm->page_table_lock);

+ flush_tlb_range(vma, beg, end);
return error;
}
--- mm11/arch/um/kernel/skas/mmu.c 2005-09-21 12:16:21.000000000 +0100
+++ mm12/arch/um/kernel/skas/mmu.c 2005-10-11 23:56:41.000000000 +0100
@@ -28,7 +28,6 @@ static int init_stub_pte(struct mm_struc
pmd_t *pmd;
pte_t *pte;

- spin_lock(&mm->page_table_lock);
pgd = pgd_offset(mm, proc);
pud = pud_alloc(mm, pgd, proc);
if (!pud)
@@ -63,7 +62,6 @@ static int init_stub_pte(struct mm_struc
*pte = mk_pte(virt_to_page(kernel), __pgprot(_PAGE_PRESENT));
*pte = pte_mkexec(*pte);
*pte = pte_wrprotect(*pte);
- spin_unlock(&mm->page_table_lock);
return(0);

out_pmd:
@@ -71,7 +69,6 @@ static int init_stub_pte(struct mm_struc
out_pte:
pmd_free(pmd);
out:
- spin_unlock(&mm->page_table_lock);
return(-ENOMEM);
}

2005-10-13 01:17:26

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 15/21] mm: flush_tlb_range outside ptlock

There was one small but very significant change in the previous patch:
mprotect's flush_tlb_range fell outside the page_table_lock: as it is
in 2.4, but that doesn't prove it safe in 2.6.

On some architectures flush_tlb_range comes to the same as flush_tlb_mm,
which has always been called from outside page_table_lock in dup_mmap,
and is so proved safe. Others required a deeper audit: I could find no
reliance on page_table_lock in any; but in ia64 and parisc found some
code which looks a bit as if it might want preemption disabled. That
won't do any actual harm, so pending a decision from the maintainers,
disable preemption there.

Remove comments on page_table_lock from flush_tlb_mm, flush_tlb_range
and flush_tlb_page entries in cachetlb.txt: they were rather misleading
(what generic code does is different from what usually happens), the
rules are now changing, and it's not yet clear where we'll end up (will
the generic tlb_flush_mmu happen always under lock? never under lock?
or sometimes under and sometimes not?).

Signed-off-by: Hugh Dickins <[email protected]>
---

Documentation/cachetlb.txt | 9 ---------
arch/ia64/mm/tlb.c | 2 ++
include/asm-parisc/tlbflush.h | 3 ++-
3 files changed, 4 insertions(+), 10 deletions(-)

--- mm14/Documentation/cachetlb.txt 2005-06-17 20:48:29.000000000 +0100
+++ mm15/Documentation/cachetlb.txt 2005-10-11 23:57:25.000000000 +0100
@@ -49,9 +49,6 @@ changes occur:
page table operations such as what happens during
fork, and exec.

- Platform developers note that generic code will always
- invoke this interface without mm->page_table_lock held.
-
3) void flush_tlb_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end)

@@ -72,9 +69,6 @@ changes occur:
call flush_tlb_page (see below) for each entry which may be
modified.

- Platform developers note that generic code will always
- invoke this interface with mm->page_table_lock held.
-
4) void flush_tlb_page(struct vm_area_struct *vma, unsigned long addr)

This time we need to remove the PAGE_SIZE sized translation
@@ -93,9 +87,6 @@ changes occur:

This is used primarily during fault processing.

- Platform developers note that generic code will always
- invoke this interface with mm->page_table_lock held.
-
5) void flush_tlb_pgtables(struct mm_struct *mm,
unsigned long start, unsigned long end)

--- mm14/arch/ia64/mm/tlb.c 2005-06-17 20:48:29.000000000 +0100
+++ mm15/arch/ia64/mm/tlb.c 2005-10-11 23:57:25.000000000 +0100
@@ -155,10 +155,12 @@ flush_tlb_range (struct vm_area_struct *
# ifdef CONFIG_SMP
platform_global_tlb_purge(start, end, nbits);
# else
+ preempt_disable();
do {
ia64_ptcl(start, (nbits<<2));
start += (1UL << nbits);
} while (start < end);
+ preempt_enable();
# endif

ia64_srlz_i(); /* srlz.i implies srlz.d */
--- mm14/include/asm-parisc/tlbflush.h 2004-12-24 21:37:30.000000000 +0000
+++ mm15/include/asm-parisc/tlbflush.h 2005-10-11 23:57:25.000000000 +0100
@@ -69,7 +69,7 @@ static inline void flush_tlb_range(struc
if (npages >= 512) /* XXX arbitrary, should be tuned */
flush_tlb_all();
else {
-
+ preempt_disable();
mtsp(vma->vm_mm->context,1);
if (split_tlb) {
purge_tlb_start();
@@ -87,6 +87,7 @@ static inline void flush_tlb_range(struc
}
purge_tlb_end();
}
+ preempt_enable();
}
}

2005-10-13 01:15:34

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 14/21] mm: pte_offset_map_lock loops

Convert those common loops using page_table_lock on the outside and
pte_offset_map within to use just pte_offset_map_lock within instead.

These all hold mmap_sem (some exclusively, some not), so at no level can
a page table be whipped away from beneath them. But whereas pte_alloc
loops tested with the "atomic" pmd_present, these loops are testing with
pmd_none, which on i386 PAE tests both lower and upper halves.

That's now unsafe, so add a cast into pmd_none to test only the vital
lower half: we lose a little sensitivity to a corrupt middle directory,
but not enough to worry about. It appears that i386 and UML were the
only architectures vulnerable in this way, and pgd and pud no problem.

Signed-off-by: Hugh Dickins <[email protected]>
---

fs/proc/task_mmu.c | 17 ++++++-----------
include/asm-i386/pgtable.h | 3 ++-
include/asm-um/pgtable.h | 2 +-
mm/mempolicy.c | 7 +++----
mm/mprotect.c | 7 +++----
mm/msync.c | 21 ++++++---------------
mm/swapfile.c | 20 +++++++++-----------
7 files changed, 30 insertions(+), 47 deletions(-)

--- mm13/fs/proc/task_mmu.c 2005-10-11 23:54:33.000000000 +0100
+++ mm14/fs/proc/task_mmu.c 2005-10-11 23:57:10.000000000 +0100
@@ -203,13 +203,14 @@ static void smaps_pte_range(struct vm_ar
struct mem_size_stats *mss)
{
pte_t *pte, ptent;
+ spinlock_t *ptl;
unsigned long pfn;
struct page *page;

- pte = pte_offset_map(pmd, addr);
+ pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
do {
ptent = *pte;
- if (pte_none(ptent) || !pte_present(ptent))
+ if (!pte_present(ptent))
continue;

mss->resident += PAGE_SIZE;
@@ -230,8 +231,8 @@ static void smaps_pte_range(struct vm_ar
mss->private_clean += PAGE_SIZE;
}
} while (pte++, addr += PAGE_SIZE, addr != end);
- pte_unmap(pte - 1);
- cond_resched_lock(&vma->vm_mm->page_table_lock);
+ pte_unmap_unlock(pte - 1, ptl);
+ cond_resched();
}

static inline void smaps_pmd_range(struct vm_area_struct *vma, pud_t *pud,
@@ -285,17 +286,11 @@ static inline void smaps_pgd_range(struc
static int show_smap(struct seq_file *m, void *v)
{
struct vm_area_struct *vma = v;
- struct mm_struct *mm = vma->vm_mm;
struct mem_size_stats mss;

memset(&mss, 0, sizeof mss);
-
- if (mm) {
- spin_lock(&mm->page_table_lock);
+ if (vma->vm_mm)
smaps_pgd_range(vma, vma->vm_start, vma->vm_end, &mss);
- spin_unlock(&mm->page_table_lock);
- }
-
return show_map_internal(m, v, &mss);
}

--- mm13/include/asm-i386/pgtable.h 2005-09-30 11:59:09.000000000 +0100
+++ mm14/include/asm-i386/pgtable.h 2005-10-11 23:57:10.000000000 +0100
@@ -213,7 +213,8 @@ extern unsigned long pg0[];
#define pte_present(x) ((x).pte_low & (_PAGE_PRESENT | _PAGE_PROTNONE))
#define pte_clear(mm,addr,xp) do { set_pte_at(mm, addr, xp, __pte(0)); } while (0)

-#define pmd_none(x) (!pmd_val(x))
+/* To avoid harmful races, pmd_none(x) should check only the lower when PAE */
+#define pmd_none(x) (!(unsigned long)pmd_val(x))
#define pmd_present(x) (pmd_val(x) & _PAGE_PRESENT)
#define pmd_clear(xp) do { set_pmd(xp, __pmd(0)); } while (0)
#define pmd_bad(x) ((pmd_val(x) & (~PAGE_MASK & ~_PAGE_USER)) != _KERNPG_TABLE)
--- mm13/include/asm-um/pgtable.h 2005-0OB9-30 11:59:10.000000000 +0100
+++ mm14/include/asm-um/pgtable.h 2005-10-11 23:57:10.000000000 +0100
@@ -138,7 +138,7 @@ extern unsigned long pg0[1024];

#define pte_clear(mm,addr,xp) pte_set_val(*(xp), (phys_t) 0, __pgprot(_PAGE_NEWPAGE))

-#define pmd_none(x) (!(pmd_val(x) & ~_PAGE_NEWPAGE))
+#define pmd_none(x) (!((unsigned long)pmd_val(x) & ~_PAGE_NEWPAGE))
#define pmd_bad(x) ((pmd_val(x) & (~PAGE_MASK & ~_PAGE_USER)) != _KERNPG_TABLE)
#define pmd_present(x) (pmd_val(x) & _PAGE_PRESENT)
#define pmd_clear(xp) do { pmd_val(*(xp)) = _PAGE_NEWPAGE; } while (0)
--- mm13/mm/mempolicy.c 2005-10-11 12:16:50.000000000 +0100
+++ mm14/mm/mempolicy.c 2005-10-11 23:57:10.000000000 +0100
@@ -228,9 +228,9 @@ static int check_pte_range(struct vm_are
{
pte_t *orig_pte;
pte_t *pte;
+ spinlock_t *ptl;

- spin_lock(&vma->vm_mm->page_table_lock);
- orig_pte = pte = pte_offset_map(pmd, addr);
+ orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
do {
unsigned long pfn;
unsigned int nid;
@@ -246,8 +246,7 @@ static int check_pte_range(struct vm_are
if (!node_isset(nid, *nodes))
break;
} while (pte++, addr += PAGE_SIZE, addr != end);
- pte_unmap(orig_pte);
- spin_unlock(&vma->vm_mm->page_table_lock);
+ pte_unmap_unlock(orig_pte, ptl);
return addr != end;
}

--- mm13/mm/mprotect.c 2005-10-11 12:16:50.000000000 +0100
+++ mm14/mm/mprotect.c 2005-10-11 23:57:10.000000000 +0100
@@ -29,8 +29,9 @@ static void change_pte_range(struct mm_s
unsigned long addr, unsigned long end, pgprot_t newprot)
{
pte_t *pte;
+ spinlock_t *ptl;

- pte = pte_offset_map(pmd, addr);
+ pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
do {
if (pte_present(*pte)) {
pte_t ptent;
@@ -44,7 +45,7 @@ static void change_pte_range(struct mm_s
lazy_mmu_prot_update(ptent);
}
} while (pte++, addr += PAGE_SIZE, addr != end);
- pte_unmap(pte - 1);
+ pte_unmap_unlock(pte - 1, ptl);
}

static inline void change_pmd_range(struct mm_struct *mm, pud_t *pud,
@@ -88,7 +89,6 @@ static void change_protection(struct vm_
BUG_ON(addr >= end);
pgd = pgd_offset(mm, addr);
flush_cache_range(vma, addr, end);
- spin_lock(&mm->page_table_lock);
do {
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(pgd))
@@ -96,7 +96,6 @@ static void change_protection(struct vm_
change_pud_range(mm, pgd, addr, next, newprot);
} while (pgd++, addr = next, addr != end);
flush_tlb_range(vma, start, end);
- spin_unlock(&mm->page_table_lock);
}

static int
--- mm13/mm/msync.c 2005-10-11 12:16:50.000000000 +0100
+++ mm14/mm/msync.c 2005-10-11 23:57:10.000000000 +0100
@@ -17,28 +17,22 @@
#include <asm/pgtable.h>
#include <asm/tlbflush.h>

-/*
- * Called with mm->page_table_lock held to protect against other
- * threads/the swapper from ripping pte's out from under us.
- */
-
static void msync_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end)
{
- struct mm_struct *mm = vma->vm_mm;
pte_t *pte;
+ spinlock_t *ptl;
int progress = 0;

again:
- pte = pte_offset_map(pmd, addr);
+ pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
do {
unsigned long pfn;
struct page *page;

if (progress >= 64) {
progress = 0;
- if (need_resched() ||
- need_lockbreak(&mm->page_table_lock))
+ if (need_resched() || need_lockbreak(ptl))
break;
}
progress++;
@@ -58,8 +52,8 @@ again:
set_page_dirty(page);
progress += 3;
} while (pte++, addr += PAGE_SIZE, addr != end);
- pte_unmap(pte - 1);
- cond_resched_lock(&mm->page_table_lock);
+ pte_unmap_unlock(pte - 1, ptl);
+ cond_resched();
if (addr != end)
goto again;
}
@@ -97,7 +91,6 @@ static inline void msync_pud_range(struc
static void msync_page_range(struct vm_area_struct *vma,
unsigned long addr, unsigned long end)
{
- struct mm_struct *mm = vma->vm_mm;
pgd_t *pgd;
unsigned long next;

@@ -110,16 +103,14 @@ static void msync_page_range(struct vm_a
return;

BUG_ON(addr >= end);
- pgd = pgd_offset(mm, addr);
+ pgd = pgd_offset(vma->vm_mm, addr);
flush_cache_range(vma, addr, end);
- spin_lock(&mm->page_table_lock);
do {
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(pgd))
continue;
msync_pud_range(vma, pgd, addr, next);
} while (pgd++, addr = next, addr != end);
- spin_unlock(&mm->page_table_lock);
}

/*
--- mm13/mm/swapfile.c 2005-09-30 11:59:12.000000000 +0100
+++ mm14/mm/swapfile.c 2005-10-11 23:57:10.000000000 +0100
@@ -399,8 +399,6 @@ void free_swap_and_cache(swp_entry_t ent
* No need to decide whether this PTE shares the swap entry with others,
* just let do_wp_page work it out if a write is requested later - to
* force COW, vm_page_prot omits write permission from any private vma.
- *
- * vma->vm_mm->page_table_lock is held.
*/
static void unuse_pte(struct vm_area_struct *vma, pte_t *pte,
unsigned long addr, swp_entry_t entry, struct page *page)
@@ -422,23 +420,25 @@ static int unuse_pte_range(struct vm_are
unsigned long addr, unsigned long end,
swp_entry_t entry, struct page *page)
{
- pte_t *pte;
pte_t swp_pte = swp_entry_to_pte(entry);
+ pte_t *pte;
+ spinlock_t *ptl;
+ int found = 0;

- pte = pte_offset_map(pmd, addr);
+ pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
do {
/*
* swapoff spends a _lot_ of time in this loop!
* Test inline before going to call unuse_pte.
*/
if (unlikely(pte_same(*pte, swp_pte))) {
- unuse_pte(vma, pte, addr, entry, page);
- pte_unmap(pte);
- return 1;
+ unuse_pte(vma, pte++, addr, entry, page);
+ found = 1;
+ break;
}
} while (pte++, addr += PAGE_SIZE, addr != end);
- pte_unmap(pte - 1);
- return 0;
+ pte_unmap_unlock(pte - 1, ptl);
+ return found;
}

static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
@@ -520,12 +520,10 @@ static int unuse_mm(struct mm_struct *mm
down_read(&mm->mmap_sem);
lock_page(page);
}
- spin_lock(&mm->page_table_lock);
for (vma = mm->mmap; vma; vma = vma->vm_next) {
if (vma->anon_vma && unuse_vma(vma, entry, page))
break;
}
- spin_unlock(&mm->page_table_lock);
up_read(&mm->mmap_sem);
/*
* Currently unuse_mm cannot fail, but leave error handling

2005-10-13 01:14:05

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 13/21] mm: page fault handler locking

On the page fault path, the patch before last pushed acquiring the
page_table_lock down to the head of handle_pte_fault (though it's also
taken and dropped earlier when a new page table has to be allocated).

Now delete that line, read "entry = *pte" without it, and go off to this
or that page fault handler on the basis of this unlocked peek. Usually
the handler can proceed without the lock, relying on the subsequent
locked pte_same or pte_none test to back out when necessary; though
do_wp_page needs the lock immediately, and do_file_page doesn't check
(if there's a race, install_page just zaps the entry and reinstalls it).

But on those architectures (notably i386 with PAE) whose pte is too big
to be read atomically, if SMP or preemption is enabled, do_swap_page and
do_file_page might cause irretrievable damage if passed a Frankenstein
entry stitched together from unrelated parts. In those configs,
"pte_unmap_same" has to take page_table_lock, validate orig_pte still
the same, and drop page_table_lock before unmapping, before proceeding.

Use pte_offset_map_lock and pte_unmap_unlock throughout the handlers;
but lock avoidance leaves more lone maps and unmaps than elsewhere.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/memory.c | 150 ++++++++++++++++++++++++++++++++++++------------------------
1 files changed, 90 insertions(+), 60 deletions(-)

--- mm12/mm/memory.c 2005-10-11 23:56:25.000000000 +0100
+++ mm13/mm/memory.c 2005-10-11 23:56:56.000000000 +0100
@@ -1219,6 +1219,30 @@ int remap_pfn_range(struct vm_area_struc
EXPORT_SYMBOL(remap_pfn_range);

/*
+ * handle_pte_fault chooses page fault handler according to an entry
+ * which was read non-atomically. Before making any commitment, on
+ * those architectures or configurations (e.g. i386 with PAE) which
+ * might give a mix of unmatched parts, do_swap_page and do_file_page
+ * must check under lock before unmapping the pte and proceeding
+ * (but do_wp_page is only called after already making such a check;
+ * and do_anonymous_page and do_no_page can safely check later on).
+ */
+static inline int pte_unmap_same(struct mm_struct *mm,
+ pte_t *page_table, pte_t orig_pte)
+{
+ int same = 1;
+#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
+ if (sizeof(pte_t) > sizeof(unsigned long)) {
+ spin_lock(&mm->page_table_lock);
+ same = pte_same(*page_table, orig_pte);
+ spin_unlock(&mm->page_table_lock);
+ }
+#endif
+ pte_unmap(page_table);
+ return same;
+}
+
+/*
* Do pte_mkwrite, but only if the vma says VM_WRITE. We do this when
* servicing faults for write access. In the normal case, do always want
* pte_mkwrite. But get_user_pages can cause write faults for mappings
@@ -1245,12 +1269,13 @@ static inline pte_t maybe_mkwrite(pte_t
* change only once the write actually happens. This avoids a few races,
* and potentially makes it more efficient.
*
- * We hold the mm semaphore and the page_table_lock on entry and exit
- * with the page_table_lock released.
+ * We enter with non-exclusive mmap_sem (to exclude vma changes,
+ * but allow concurrent faults), with pte both mapped and locked.
+ * We return with mmap_sem still held, but pte unmapped and unlocked.
*/
static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pte_t *page_table, pmd_t *pmd,
- pte_t orig_pte)
+ spinlock_t *ptl, pte_t orig_pte)
{
struct page *old_page, *new_page;
unsigned long pfn = pte_pfn(orig_pte);
@@ -1288,8 +1313,7 @@ static int do_wp_page(struct mm_struct *
* Ok, we need to copy. Oh, well..
*/
page_cache_get(old_page);
- pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
+ pte_unmap_unlock(page_table, ptl);

if (unlikely(anon_vma_prepare(vma)))
goto oom;
@@ -1307,8 +1331,7 @@ static int do_wp_page(struct mm_struct *
/*
* Re-check the pte - we dropped the lock
*/
- spin_lock(&mm->page_table_lock);
- page_table = pte_offset_map(pmd, address);
+ page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
if (likely(pte_same(*page_table, orig_pte))) {
page_remove_rmap(old_page);
if (!PageAnon(old_page)) {
@@ -1321,7 +1344,6 @@ static int do_wp_page(struct mm_struct *
ptep_establish(vma, address, page_table, entry);
update_mmu_cache(vma, address, entry);
lazy_mmu_prot_update(entry);
-
lru_cache_add_active(new_page);
page_add_anon_rmap(new_page, vma, address);

@@ -1332,8 +1354,7 @@ static int do_wp_page(struct mm_struct *
page_cache_release(new_page);
page_cache_release(old_page);
unlock:
- pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
+ pte_unmap_unlock(page_table, ptl);
return ret;
oom:
page_cache_release(old_page);
@@ -1660,20 +1681,22 @@ void swapin_readahead(swp_entry_t entry,
}

/*
- * We hold the mm semaphore and the page_table_lock on entry and
- * should release the pagetable lock on exit..
+ * We enter with non-exclusive mmap_sem (to exclude vma changes,
+ * but allow concurrent faults), and pte mapped but not yet locked.
+ * We return with mmap_sem still held, but pte unmapped and unlocked.
*/
static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pte_t *page_table, pmd_t *pmd,
int write_access, pte_t orig_pte)
{
+ spinlock_t *ptl;
struct page *page;
swp_entry_t entry;
pte_t pte;
int ret = VM_FAULT_MINOR;

- pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
+ if (!pte_unmap_same(mm, page_table, orig_pte))
+ goto out;

entry = pte_to_swp_entry(orig_pte);
page = lookup_swap_cache(entry);
@@ -1682,11 +1705,10 @@ static int do_swap_page(struct mm_struct
page = read_swap_cache_async(entry, vma, address);
if (!page) {
/*
- * Back out if somebody else faulted in this pte while
- * we released the page table lock.
+ * Back out if somebody else faulted in this pte
+ * while we released the pte lock.
*/
- spin_lock(&mm->page_table_lock);
- page_table = pte_offset_map(pmd, address);
+ page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
if (likely(pte_same(*page_table, orig_pte)))
ret = VM_FAULT_OOM;
goto unlock;
@@ -1702,11 +1724,9 @@ static int do_swap_page(struct mm_struct
lock_page(page);

/*
- * Back out if somebody else faulted in this pte while we
- * released the page table lock.
+ * Back out if somebody else already faulted in this pte.
*/
- spin_lock(&mm->page_table_lock);
- page_table = pte_offset_map(pmd, address);
+ page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
if (unlikely(!pte_same(*page_table, orig_pte)))
goto out_nomap;

@@ -1735,7 +1755,7 @@ static int do_swap_page(struct mm_struct

if (write_access) {
if (do_wp_page(mm, vma, address,
- page_table, pmd, pte) == VM_FAULT_OOM)
+ page_table, pmd, ptl, pte) == VM_FAULT_OOM)
ret = VM_FAULT_OOM;
goto out;
}
@@ -1744,37 +1764,32 @@ static int do_swap_page(struct mm_struct
update_mmu_cache(vma, address, pte);
lazy_mmu_prot_update(pte);
unlock:
- pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
+ pte_unmap_unlock(page_table, ptl);
out:
return ret;
out_nomap:
- pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
+ pte_unmap_unlock(page_table, ptl);
unlock_page(page);
page_cache_release(page);
return ret;
}

/*
- * We are called with the MM semaphore and page_table_lock
- * spinlock held to protect against concurrent faults in
- * multithreaded programs.
+ * We enter with non-exclusive mmap_sem (to exclude vma changes,
+ * but allow concurrent faults), and pte mapped but not yet locked.
+ * We return with mmap_sem still held, but pte unmapped and unlocked.
*/
static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pte_t *page_table, pmd_t *pmd,
int write_access)
{
- struct page *page = ZERO_PAGE(addr);
+ struct page *page;
+ spinlock_t *ptl;
pte_t entry;

- /* Mapping of ZERO_PAGE - vm_page_prot is readonly */
- entry = mk_pte(page, vma->vm_page_prot);
-
if (write_access) {
/* Allocate our own private page. */
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);

if (unlikely(anon_vma_prepare(vma)))
goto oom;
@@ -1782,23 +1797,28 @@ static int do_anonymous_page(struct mm_s
if (!page)
goto oom;

- spin_lock(&mm->page_table_lock);
- page_table = pte_offset_map(pmd, address);
-
- if (!pte_none(*page_table)) {
- page_cache_release(page);
- goto unlock;
- }
- inc_mm_counter(mm, anon_rss);
entry = mk_pte(page, vma->vm_page_prot);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+
+ page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+ if (!pte_none(*page_table))
+ goto release;
+ inc_mm_counter(mm, anon_rss);
lru_cache_add_active(page);
SetPageReferenced(page);
page_add_anon_rmap(page, vma, address);
} else {
+ /* Map the ZERO_PAGE - vm_page_prot is readonly */
+ page = ZERO_PAGE(address);
+ page_cache_get(page);
+ entry = mk_pte(page, vma->vm_page_prot);
+
+ ptl = &mm->page_table_lock;
+ spin_lock(ptl);
+ if (!pte_none(*page_table))
+ goto release;
inc_mm_counter(mm, file_rss);
page_add_file_rmap(page);
- page_cache_get(page);
}

set_pte_at(mm, address, page_table, entry);
@@ -1807,9 +1827,11 @@ static int do_anonymous_page(struct mm_s
update_mmu_cache(vma, address, entry);
lazy_mmu_prot_update(entry);
unlock:
- pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
+ pte_unmap_unlock(page_table, ptl);
return VM_FAULT_MINOR;
+release:
+ page_cache_release(page);
+ goto unlock;
oom:
return VM_FAULT_OOM;
}
@@ -1823,13 +1845,15 @@ oom:
* As this is called only for pages that do not currently exist, we
* do not need to flush old virtual caches or the TLB.
*
- * This is called with the MM semaphore held and the page table
- * spinlock held. Exit with the spinlock released.
+ * We enter with non-exclusive mmap_sem (to exclude vma changes,
+ * but allow concurrent faults), and pte mapped but not yet locked.
+ * We return with mmap_sem still held, but pte unmapped and unlocked.
*/
static int do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pte_t *page_table, pmd_t *pmd,
int write_access)
{
+ spinlock_t *ptl;
struct page *new_page;
struct address_space *mapping = NULL;
pte_t entry;
@@ -1838,7 +1862,6 @@ static int do_no_page(struct mm_struct *
int anon = 0;

pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);

if (vma->vm_file) {
mapping = vma->vm_file->f_mapping;
@@ -1878,21 +1901,20 @@ retry:
anon = 1;
}

- spin_lock(&mm->page_table_lock);
+ page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
/*
* For a file-backed vma, someone could have truncated or otherwise
* invalidated this page. If unmap_mapping_range got called,
* retry getting the page.
*/
if (mapping && unlikely(sequence != mapping->truncate_count)) {
- spin_unlock(&mm->page_table_lock);
+ pte_unmap_unlock(page_table, ptl);
page_cache_release(new_page);
cond_resched();
sequence = mapping->truncate_count;
smp_rmb();
goto retry;
}
- page_table = pte_offset_map(pmd, address);

/*
* This silly early PAGE_DIRTY setting removes a race
@@ -1929,8 +1951,7 @@ retry:
update_mmu_cache(vma, address, entry);
lazy_mmu_prot_update(entry);
unlock:
- pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
+ pte_unmap_unlock(page_table, ptl);
return ret;
oom:
page_cache_release(new_page);
@@ -1941,6 +1962,10 @@ oom:
* Fault of a previously existing named mapping. Repopulate the pte
* from the encoded file_pte if possible. This enables swappable
* nonlinear vmas.
+ *
+ * We enter with non-exclusive mmap_sem (to exclude vma changes,
+ * but allow concurrent faults), and pte mapped but not yet locked.
+ * We return with mmap_sem still held, but pte unmapped and unlocked.
*/
static int do_file_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pte_t *page_table, pmd_t *pmd,
@@ -1949,8 +1974,8 @@ static int do_file_page(struct mm_struct
pgoff_t pgoff;
int err;

- pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
+ if (!pte_unmap_same(mm, page_table, orig_pte))
+ return VM_FAULT_MINOR;

if (unlikely(!(vma->vm_flags & VM_NONLINEAR))) {
/*
@@ -1989,8 +2014,8 @@ static inline int handle_pte_fault(struc
pte_t *pte, pmd_t *pmd, int write_access)
{
pte_t entry;
+ spinlock_t *ptl;

- spin_lock(&mm->page_table_lock);
entry = *pte;
if (!pte_present(entry)) {
if (pte_none(entry)) {
@@ -2007,17 +2032,22 @@ static inline int handle_pte_fault(struc
pte, pmd, write_access, entry);
}

+ ptl = &mm->page_table_lock;
+ spin_lock(ptl);
+ if (unlikely(!pte_same(*pte, entry)))
+ goto unlock;
if (write_access) {
if (!pte_write(entry))
- return do_wp_page(mm, vma, address, pte, pmd, entry);
+ return do_wp_page(mm, vma, address,
+ pte, pmd, ptl, entry);
entry = pte_mkdirty(entry);
}
entry = pte_mkyoung(entry);
ptep_set_access_flags(vma, address, pte, entry, write_access);
update_mmu_cache(vma, address, entry);
lazy_mmu_prot_update(entry);
- pte_unmap(pte);
- spin_unlock(&mm->page_table_lock);
+unlock:
+ pte_unmap_unlock(pte, ptl);
return VM_FAULT_MINOR;
}

2005-10-13 01:17:47

by Christoph Lameter

[permalink] [raw]

Subject: Re: [PATCH 06/21] mm: update_hiwaters just in time

Looks fine to me. Great innovative idea on how to reduce the resources
needed for the counters.

Jay, what do you say?

2005-10-13 01:18:20

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 16/21] mm: unlink vma before pagetables

In most places the descent from pgd to pud to pmd to pte holds mmap_sem
(exclusively or not), which ensures that free_pgtables cannot be freeing
page tables from any level at the same time. But truncation and reverse
mapping descend without mmap_sem.

No problem: just make sure that a vma is unlinked from its prio_tree (or
nonlinear list) and from its anon_vma list, after zapping the vma, but
before freeing its page tables. Then neither vmtruncate nor rmap can
reach that vma whose page tables are now volatile (nor do they need to
reach it, since all its page entries have been zapped by this stage).

The i_mmap_lock and anon_vma->lock already serialize this correctly;
but the locking hierarchy is such that we cannot take them while holding
page_table_lock. Well, we're trying to push that down anyway. So in
this patch, move anon_vma_unlink and unlink_file_vma into free_pgtables,
at the same time as moving page_table_lock around calls to unmap_vmas.

tlb_gather_mmu and tlb_finish_mmu then fall outside the page_table_lock,
but we made them preempt_disable and preempt_enable earlier; and a long
source audit of all the architectures has shown no problem with removing
page_table_lock from them. free_pgtables doesn't need page_table_lock
for itself, nor for what it calls; tlb->mm->nr_ptes is usually protected
by page_table_lock, but partly by non-exclusive mmap_sem - here it's
decremented with exclusive mmap_sem, or mm_users 0. update_hiwater_rss
and vm_unacct_memory don't need page_table_lock either.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/memory.c | 12 ++++++++++--
mm/mmap.c | 23 ++++++-----------------
2 files changed, 16 insertions(+), 19 deletions(-)

--- mm15/mm/memory.c 2005-10-11 23:56:56.000000000 +0100
+++ mm16/mm/memory.c 2005-10-11 23:57:45.000000000 +0100
@@ -260,6 +260,12 @@ void free_pgtables(struct mmu_gather **t
struct vm_area_struct *next = vma->vm_next;
unsigned long addr = vma->vm_start;

+ /*
+ * Hide vma from rmap and vmtruncate before freeing pgtables
+ */
+ anon_vma_unlink(vma);
+ unlink_file_vma(vma);
+
if (is_hugepage_only_range(vma->vm_mm, addr, HPAGE_SIZE)) {
hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
floor, next? next->vm_start: ceiling);
@@ -272,6 +278,8 @@ void free_pgtables(struct mmu_gather **t
HPAGE_SIZE)) {
vma = next;
next = vma->vm_next;
+ anon_vma_unlink(vma);
+ unlink_file_vma(vma);
}
free_pgd_range(tlb, addr, vma->vm_end,
floor, next? next->vm_start: ceiling);
@@ -798,12 +806,12 @@ unsigned long zap_page_range(struct vm_a
}

lru_add_drain();
- spin_lock(&mm->page_table_lock);
tlb = tlb_gather_mmu(mm, 0);
update_hiwater_rss(mm);
+ spin_lock(&mm->page_table_lock);
end = unmap_vmas(&tlb, mm, vma, address, end, &nr_accounted, details);
- tlb_finish_mmu(tlb, address, end);
spin_unlock(&mm->page_table_lock);
+ tlb_finish_mmu(tlb, address, end);
return end;
}

--- mm15/mm/mmap.c 2005-10-11 23:55:38.000000000 +0100
+++ mm16/mm/mmap.c 2005-10-11 23:57:46.000000000 +0100
@@ -199,14 +199,6 @@ static struct vm_area_struct *remove_vma
{
struct vm_area_struct *next = vma->vm_next;

- /*
- * Hide vma from rmap and vmtruncate before freeing page tables:
- * to be moved into free_pgtables once page_table_lock is lifted
- * from it, but until then lock ordering forbids that move.
- */
- anon_vma_unlink(vma);
- unlink_file_vma(vma);
-
might_sleep();
if (vma->vm_ops && vma->vm_ops->close)
vma->vm_ops->close(vma);
@@ -1675,15 +1667,15 @@ static void unmap_region(struct mm_struc
unsigned long nr_accounted = 0;

lru_add_drain();
- spin_lock(&mm->page_table_lock);
tlb = tlb_gather_mmu(mm, 0);
update_hiwater_rss(mm);
+ spin_lock(&mm->page_table_lock);
unmap_vmas(&tlb, mm, vma, start, end, &nr_accounted, NULL);
+ spin_unlock(&mm->page_table_lock);
vm_unacct_memory(nr_accounted);
free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
next? next->vm_start: 0);
tlb_finish_mmu(tlb, start, end);
- spin_unlock(&mm->page_table_lock);
}

/*
@@ -1958,23 +1950,20 @@ void exit_mmap(struct mm_struct *mm)
unsigned long end;

lru_add_drain();
-
- spin_lock(&mm->page_table_lock);
-
flush_cache_mm(mm);
tlb = tlb_gather_mmu(mm, 1);
/* Don't update_hiwater_rss(mm) here, do_exit already did */
/* Use -1 here to ensure all VMAs in the mm are unmapped */
+ spin_lock(&mm->page_table_lock);
end = unmap_vmas(&tlb, mm, vma, 0, -1, &nr_accounted, NULL);
+ spin_unlock(&mm->page_table_lock);
vm_unacct_memory(nr_accounted);
free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
tlb_finish_mmu(tlb, 0, end);

- spin_unlock(&mm->page_table_lock);
-
/*
- * Walk the list again, actually closing and freeing it
- * without holding any MM locks.
+ * Walk the list again, actually closing and freeing it,
+ * with preemption enabled, without holding any MM locks.
*/
while (vma)
vma = remove_vma(vma);

2005-10-13 01:19:38

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 17/21] mm: unmap_vmas with inner ptlock

Remove the page_table_lock from around the calls to unmap_vmas, and
replace the pte_offset_map in zap_pte_range by pte_offset_map_lock:
all callers are now safe to descend without page_table_lock.

Don't attempt fancy locking for hugepages, just take page_table_lock in
unmap_hugepage_range. Which makes zap_hugepage_range, and the hugetlb
test in zap_page_range, redundant: unmap_vmas calls unmap_hugepage_range
anyway. Nor does unmap_vmas have much use for its mm arg now.

The tlb_start_vma and tlb_end_vma in unmap_page_range are now called
without page_table_lock: if they're implemented at all, they typically
come down to flush_cache_range (usually done outside page_table_lock)
and flush_tlb_range (which we already audited for the mprotect case).

Signed-off-by: Hugh Dickins <[email protected]>
---

fs/hugetlbfs/inode.c | 4 ++--
include/linux/hugetlb.h | 2 --
include/linux/mm.h | 2 +-
mm/hugetlb.c | 12 +++---------
mm/memory.c | 41 ++++++++++++-----------------------------
mm/mmap.c | 8 ++------
6 files changed, 20 insertions(+), 49 deletions(-)

--- mm16/fs/hugetlbfs/inode.c 2005-09-30 11:59:08.000000000 +0100
+++ mm17/fs/hugetlbfs/inode.c 2005-10-11 23:57:59.000000000 +0100
@@ -92,7 +92,7 @@ out:
}

/*
- * Called under down_write(mmap_sem), page_table_lock is not held
+ * Called under down_write(mmap_sem).
*/

#ifdef HAVE_ARCH_HUGETLB_UNMAPPED_AREA
@@ -297,7 +297,7 @@ hugetlb_vmtruncate_list(struct prio_tree

v_length = vma->vm_end - vma->vm_start;

- zap_hugepage_range(vma,
+ unmap_hugepage_range(vma,
vma->vm_start + v_offset,
v_length - v_offset);
}
--- mm16/include/linux/hugetlb.h 2005-09-21 12:16:56.000000000 +0100
+++ mm17/include/linux/hugetlb.h 2005-10-11 23:57:59.000000000 +0100
@@ -16,7 +16,6 @@ static inline int is_vm_hugetlb_page(str
int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *);
int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int);
-void zap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long);
void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long);
int hugetlb_prefault(struct address_space *, struct vm_area_struct *);
int hugetlb_report_meminfo(char *);
@@ -85,7 +84,6 @@ static inline unsigned long hugetlb_tota
#define follow_huge_addr(mm, addr, write) ERR_PTR(-EINVAL)
#define copy_hugetlb_page_range(src, dst, vma) ({ BUG(); 0; })
#define hugetlb_prefault(mapping, vma) ({ BUG(); 0; })
-#define zap_hugepage_range(vma, start, len) BUG()
#define unmap_hugepage_range(vma, start, end) BUG()
#define is_hugepage_mem_enough(size) 0
#define hugetlb_report_meminfo(buf) 0
--- mm16/include/linux/mm.h 2005-10-11 23:56:25.000000000 +0100
+++ mm17/include/linux/mm.h 2005-10-11 23:57:59.000000000 +0100
@@ -687,7 +687,7 @@ struct zap_details {

unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
unsigned long size, struct zap_details *);
-unsigned long unmap_vmas(struct mmu_gather **tlb, struct mm_struct *mm,
+unsigned long unmap_vmas(struct mmu_gather **tlb,
struct vm_area_struct *start_vma, unsigned long start_addr,
unsigned long end_addr, unsigned long *nr_accounted,
struct zap_details *);
--- mm16/mm/hugetlb.c 2005-10-11 23:56:25.000000000 +0100
+++ mm17/mm/hugetlb.c 2005-10-11 23:57:59.000000000 +0100
@@ -313,6 +313,8 @@ void unmap_hugepage_range(struct vm_area
BUG_ON(start & ~HPAGE_MASK);
BUG_ON(end & ~HPAGE_MASK);

+ spin_lock(&mm->page_table_lock);
+
/* Update high watermark before we lower rss */
update_hiwater_rss(mm);

@@ -332,17 +334,9 @@ void unmap_hugepage_range(struct vm_area
put_page(page);
add_mm_counter(mm, file_rss, - (HPAGE_SIZE / PAGE_SIZE));
}
- flush_tlb_range(vma, start, end);
-}

-void zap_hugepage_range(struct vm_area_struct *vma,
- unsigned long start, unsigned long length)
-{
- struct mm_struct *mm = vma->vm_mm;
-
- spin_lock(&mm->page_table_lock);
- unmap_hugepage_range(vma, start, start + length);
spin_unlock(&mm->page_table_lock);
+ flush_tlb_range(vma, start, end);
}

int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
--- mm16/mm/memory.c 2005-10-11 23:57:45.000000000 +0100
+++ mm17/mm/memory.c 2005-10-11 23:57:59.000000000 +0100
@@ -551,10 +551,11 @@ static void zap_pte_range(struct mmu_gat
{
struct mm_struct *mm = tlb->mm;
pte_t *pte;
+ spinlock_t *ptl;
int file_rss = 0;
int anon_rss = 0;

- pte = pte_offset_map(pmd, addr);
+ pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
do {
pte_t ptent = *pte;
if (pte_none(ptent))
@@ -621,7 +622,7 @@ static void zap_pte_range(struct mmu_gat
} while (pte++, addr += PAGE_SIZE, addr != end);

add_mm_rss(mm, file_rss, anon_rss);
- pte_unmap(pte - 1);
+ pte_unmap_unlock(pte - 1, ptl);
}

static inline void zap_pmd_range(struct mmu_gather *tlb,
@@ -690,7 +691,6 @@ static void unmap_page_range(struct mmu_
/**
* unmap_vmas - unmap a range of memory covered by a list of vma's
* @tlbp: address of the caller's struct mmu_gather
- * @mm: the controlling mm_struct
* @vma: the starting vma
* @start_addr: virtual address at which to start unmapping
* @end_addr: virtual address at which to end unmapping
@@ -699,10 +699,10 @@ static void unmap_page_range(struct mmu_
*
* Returns the end address of the unmapping (restart addr if interrupted).
*
- * Unmap all pages in the vma list. Called under page_table_lock.
+ * Unmap all pages in the vma list.
*
- * We aim to not hold page_table_lock for too long (for scheduling latency
- * reasons). So zap pages in ZAP_BLOCK_SIZE bytecounts. This means we need to
+ * We aim to not hold locks for too long (for scheduling latency reasons).
+ * So zap pages in ZAP_BLOCK_SIZE bytecounts. This means we need to
* return the ending mmu_gather to the caller.
*
* Only addresses between `start' and `end' will be unmapped.
@@ -714,7 +714,7 @@ static void unmap_page_range(struct mmu_
* ensure that any thus-far unmapped pages are flushed before unmap_vmas()
* drops the lock and schedules.
*/
-unsigned long unmap_vmas(struct mmu_gather **tlbp, struct mm_struct *mm,
+unsigned long unmap_vmas(struct mmu_gather **tlbp,
struct vm_area_struct *vma, unsigned long start_addr,
unsigned long end_addr, unsigned long *nr_accounted,
struct zap_details *details)
@@ -764,19 +764,15 @@ unsigned long unmap_vmas(struct mmu_gath
tlb_finish_mmu(*tlbp, tlb_start, start);

if (need_resched() ||
- need_lockbreak(&mm->page_table_lock) ||
(i_mmap_lock && need_lockbreak(i_mmap_lock))) {
if (i_mmap_lock) {
- /* must reset count of rss freed */
- *tlbp = tlb_gather_mmu(mm, fullmm);
+ *tlbp = NULL;
goto out;
}
- spin_unlock(&mm->page_table_lock);
cond_resched();
- spin_lock(&mm->page_table_lock);
}

- *tlbp = tlb_gather_mmu(mm, fullmm);
+ *tlbp = tlb_gather_mmu(vma->vm_mm, fullmm);
tlb_start_valid = 0;
zap_bytes = ZAP_BLOCK_SIZE;
}
@@ -800,18 +796,12 @@ unsigned long zap_page_range(struct vm_a
unsigned long end = address + size;
unsigned long nr_accounted = 0;

- if (is_vm_hugetlb_page(vma)) {
- zap_hugepage_range(vma, address, size);
- return end;
- }
-
lru_add_drain();
tlb = tlb_gather_mmu(mm, 0);
update_hiwater_rss(mm);
- spin_lock(&mm->page_table_lock);
- end = unmap_vmas(&tlb, mm, vma, address, end, &nr_accounted, details);
- spin_unlock(&mm->page_table_lock);
- tlb_finish_mmu(tlb, address, end);
+ end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
+ if (tlb)
+ tlb_finish_mmu(tlb, address, end);
return end;
}

@@ -1434,13 +1424,6 @@ again:

restart_addr = zap_page_range(vma, start_addr,
end_addr - start_addr, details);
-
- /*
- * We cannot rely on the break test in unmap_vmas:
- * on the one hand, we don't want to restart our loop
- * just because that broke out for the page_table_lock;
- * on the other hand, it does no test when vma is small.
- */
need_break = need_resched() ||
need_lockbreak(details->i_mmap_lock);

--- mm16/mm/mmap.c 2005-10-11 23:57:46.000000000 +0100
+++ mm17/mm/mmap.c 2005-10-11 23:57:59.000000000 +0100
@@ -1669,9 +1669,7 @@ static void unmap_region(struct mm_struc
lru_add_drain();
tlb = tlb_gather_mmu(mm, 0);
update_hiwater_rss(mm);
- spin_lock(&mm->page_table_lock);
- unmap_vmas(&tlb, mm, vma, start, end, &nr_accounted, NULL);
- spin_unlock(&mm->page_table_lock);
+ unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
vm_unacct_memory(nr_accounted);
free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
next? next->vm_start: 0);
@@ -1954,9 +1952,7 @@ void exit_mmap(struct mm_struct *mm)
tlb = tlb_gather_mmu(mm, 1);
/* Don't update_hiwater_rss(mm) here, do_exit already did */
/* Use -1 here to ensure all VMAs in the mm are unmapped */
- spin_lock(&mm->page_table_lock);
- end = unmap_vmas(&tlb, mm, vma, 0, -1, &nr_accounted, NULL);
- spin_unlock(&mm->page_table_lock);
+ end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
vm_unacct_memory(nr_accounted);
free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
tlb_finish_mmu(tlb, 0, end);

2005-10-13 01:21:07

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 18/21] mm: xip_unmap ZERO_PAGE fix

Small fix to the PageReserved patch: the mips ZERO_PAGE(address) depends
on address, so __xip_unmap is wrong to initialize page with that before
address is initialized; and in fact must re-evaluate it each iteration.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/filemap_xip.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletion(-)

--- mm17/mm/filemap_xip.c 2005-10-11 12:16:50.000000000 +0100
+++ mm18/mm/filemap_xip.c 2005-10-11 23:58:14.000000000 +0100
@@ -174,7 +174,7 @@ __xip_unmap (struct address_space * mapp
unsigned long address;
pte_t *pte;
pte_t pteval;
- struct page *page = ZERO_PAGE(address);
+ struct page *page;

spin_lock(&mapping->i_mmap_lock);
vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
@@ -182,6 +182,7 @@ __xip_unmap (struct address_space * mapp
address = vma->vm_start +
((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
BUG_ON(address < vma->vm_start || address >= vma->vm_end);
+ page = ZERO_PAGE(address);
/*
* We need the page_table_lock to protect us from page faults,
* munmap, fork, etc...

2005-10-13 01:21:53

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 19/21] mm: rmap with inner ptlock

rmap's page_check_address descend without page_table_lock. First just
pte_offset_map in case there's no pte present worth locking for, then
take page_table_lock for the full check, and pass ptl back to caller in
the same style as pte_offset_map_lock. __xip_unmap, page_referenced_one
and try_to_unmap_one use pte_unmap_unlock. try_to_unmap_cluster also.

page_check_address reformatted to avoid progressive indentation.
No use is made of its one error code, return NULL when it fails.

Signed-off-by: Hugh Dickins <[email protected]>
---

include/linux/rmap.h | 4 -
mm/filemap_xip.c | 12 +----
mm/rmap.c | 109 +++++++++++++++++++++++++--------------------------
3 files changed, 60 insertions(+), 65 deletions(-)

--- mm18/include/linux/rmap.h 2005-08-29 00:41:01.000000000 +0100
+++ mm19/include/linux/rmap.h 2005-10-11 23:58:30.000000000 +0100
@@ -95,8 +95,8 @@ int try_to_unmap(struct page *);
/*
* Called from mm/filemap_xip.c to unmap empty zero page
*/
-pte_t *page_check_address(struct page *, struct mm_struct *, unsigned long);
-
+pte_t *page_check_address(struct page *, struct mm_struct *,
+ unsigned long, spinlock_t **);

/*
* Used by swapoff to help locate where page is expected in vma.
--- mm18/mm/filemap_xip.c 2005-10-11 23:58:14.000000000 +0100
+++ mm19/mm/filemap_xip.c 2005-10-11 23:58:30.000000000 +0100
@@ -174,6 +174,7 @@ __xip_unmap (struct address_space * mapp
unsigned long address;
pte_t *pte;
pte_t pteval;
+ spinlock_t *ptl;
struct page *page;

spin_lock(&mapping->i_mmap_lock);
@@ -183,20 +184,15 @@ __xip_unmap (struct address_space * mapp
((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
BUG_ON(address < vma->vm_start || address >= vma->vm_end);
page = ZERO_PAGE(address);
- /*
- * We need the page_table_lock to protect us from page faults,
- * munmap, fork, etc...
- */
- pte = page_check_address(page, mm, address);
- if (!IS_ERR(pte)) {
+ pte = page_check_address(page, mm, address, &ptl);
+ if (pte) {
/* Nuke the page table entry. */
flush_cache_page(vma, address, pte_pfn(*pte));
pteval = ptep_clear_flush(vma, address, pte);
page_remove_rmap(page);
dec_mm_counter(mm, file_rss);
BUG_ON(pte_dirty(pteval));
- pte_unmap(pte);
- spin_unlock(&mm->page_table_lock);
+ pte_unmap_unlock(pte, ptl);
page_cache_release(page);
}
}
--- mm18/mm/rmap.c 2005-10-11 23:54:33.000000000 +0100
+++ mm19/mm/rmap.c 2005-10-11 23:58:30.000000000 +0100
@@ -247,34 +247,41 @@ unsigned long page_address_in_vma(struct
* On success returns with mapped pte and locked mm->page_table_lock.
*/
pte_t *page_check_address(struct page *page, struct mm_struct *mm,
- unsigned long address)
+ unsigned long address, spinlock_t **ptlp)
{
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
+ spinlock_t *ptl;

- /*
- * We need the page_table_lock to protect us from page faults,
- * munmap, fork, etc...
- */
- spin_lock(&mm->page_table_lock);
pgd = pgd_offset(mm, address);
- if (likely(pgd_present(*pgd))) {
- pud = pud_offset(pgd, address);
- if (likely(pud_present(*pud))) {
- pmd = pmd_offset(pud, address);
- if (likely(pmd_present(*pmd))) {
- pte = pte_offset_map(pmd, address);
- if (likely(pte_present(*pte) &&
- page_to_pfn(page) == pte_pfn(*pte)))
- return pte;
- pte_unmap(pte);
- }
- }
+ if (!pgd_present(*pgd))
+ return NULL;
+
+ pud = pud_offset(pgd, address);
+ if (!pud_present(*pud))
+ return NULL;
+
+ pmd = pmd_offset(pud, address);
+ if (!pmd_present(*pmd))
+ return NULL;
+
+ pte = pte_offset_map(pmd, address);
+ /* Make a quick check before getting the lock */
+ if (!pte_present(*pte)) {
+ pte_unmap(pte);
+ return NULL;
+ }
+
+ ptl = &mm->page_table_lock;
+ spin_lock(ptl);
+ if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
+ *ptlp = ptl;
+ return pte;
}
- spin_unlock(&mm->page_table_lock);
- return ERR_PTR(-ENOENT);
+ pte_unmap_unlock(pte, ptl);
+ return NULL;
}

/*
@@ -287,28 +294,28 @@ static int page_referenced_one(struct pa
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
pte_t *pte;
+ spinlock_t *ptl;
int referenced = 0;

address = vma_address(page, vma);
if (address == -EFAULT)
goto out;

- pte = page_check_address(page, mm, address);
- if (!IS_ERR(pte)) {
- if (ptep_clear_flush_young(vma, address, pte))
- referenced++;
+ pte = page_check_address(page, mm, address, &ptl);
+ if (!pte)
+ goto out;

- /* Pretend the page is referenced if the task has the
- swap token and is in the middle of a page fault. */
- if (mm != current->mm && !ignore_token &&
- has_swap_token(mm) &&
- rwsem_is_locked(&mm->mmap_sem))
- referenced++;
+ if (ptep_clear_flush_young(vma, address, pte))
+ referenced++;

- (*mapcount)--;
- pte_unmap(pte);
- spin_unlock(&mm->page_table_lock);
- }
+ /* Pretend the page is referenced if the task has the
+ swap token and is in the middle of a page fault. */
+ if (mm != current->mm && !ignore_token && has_swap_token(mm) &&
+ rwsem_is_locked(&mm->mmap_sem))
+ referenced++;
+
+ (*mapcount)--;
+ pte_unmap_unlock(pte, ptl);
out:
return referenced;
}
@@ -507,14 +514,15 @@ static int try_to_unmap_one(struct page
unsigned long address;
pte_t *pte;
pte_t pteval;
+ spinlock_t *ptl;
int ret = SWAP_AGAIN;

address = vma_address(page, vma);
if (address == -EFAULT)
goto out;

- pte = page_check_address(page, mm, address);
- if (IS_ERR(pte))
+ pte = page_check_address(page, mm, address, &ptl);
+ if (!pte)
goto out;

/*
@@ -564,8 +572,7 @@ static int try_to_unmap_one(struct page
page_cache_release(page);

out_unmap:
- pte_unmap(pte);
- spin_unlock(&mm->page_table_lock);
+ pte_unmap_unlock(pte, ptl);
out:
return ret;
}
@@ -599,19 +606,14 @@ static void try_to_unmap_cluster(unsigne
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
- pte_t *pte, *original_pte;
+ pte_t *pte;
pte_t pteval;
+ spinlock_t *ptl;
struct page *page;
unsigned long address;
unsigned long end;
unsigned long pfn;

- /*
- * We need the page_table_lock to protect us from page faults,
- * munmap, fork, etc...
- */
- spin_lock(&mm->page_table_lock);
-
address = (vma->vm_start + cursor) & CLUSTER_MASK;
end = address + CLUSTER_SIZE;
if (address < vma->vm_start)
@@ -621,22 +623,22 @@ static void try_to_unmap_cluster(unsigne

pgd = pgd_offset(mm, address);
if (!pgd_present(*pgd))
- goto out_unlock;
+ return;

pud = pud_offset(pgd, address);
if (!pud_present(*pud))
- goto out_unlock;
+ return;

pmd = pmd_offset(pud, address);
if (!pmd_present(*pmd))
- goto out_unlock;
+ return;
+
+ pte = pte_offset_map_lock(mm, pmd, address, &ptl);

/* Update high watermark before we lower rss */
update_hiwater_rss(mm);

- for (original_pte = pte = pte_offset_map(pmd, address);
- address < end; pte++, address += PAGE_SIZE) {
-
+ for (; address < end; pte++, address += PAGE_SIZE) {
if (!pte_present(*pte))
continue;

@@ -669,10 +671,7 @@ static void try_to_unmap_cluster(unsigne
dec_mm_counter(mm, file_rss);
(*mapcount)--;
}
-
- pte_unmap(original_pte);
-out_unlock:
- spin_unlock(&mm->page_table_lock);
+ pte_unmap_unlock(pte - 1, ptl);
}

static int try_to_unmap_anon(struct page *page)

2005-10-13 01:23:42

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 20/21] mm: kill check_user_page_readable

check_user_page_readable is a problematic variant of follow_page. It's
used only by oprofile's i386 and arm backtrace code, at interrupt time,
to establish whether a userspace stackframe is currently readable.

This is problematic, because we want to push the page_table_lock down
inside follow_page, and later split it; whereas oprofile is doing a
spin_trylock on it (in the i386 case, forgotten in the arm case), and
needs that to pin perhaps two pages spanned by the stackframe (which
might be covered by different locks when we split).

I think oprofile is going about this in the wrong way: it doesn't need
to know the area is readable (neither i386 nor arm uses read protection
of user pages), it doesn't need to pin the memory, it should simply
__copy_from_user_inatomic, and see if that succeeds or not. Sorry, but
I've not got around to devising the sparse __user annotations for this.

Then we can eliminate check_user_page_readable, and return to a single
follow_page without the __follow_page variants.

Signed-off-by: Hugh Dickins <[email protected]>
---

arch/arm/oprofile/backtrace.c | 46 ++++++++---------------------------------
arch/i386/oprofile/backtrace.c | 38 +++++++++++----------------------
include/linux/mm.h | 1
mm/memory.c | 29 +++----------------------
4 files changed, 26 insertions(+), 88 deletions(-)

--- mm19/arch/arm/oprofile/backtrace.c 2005-08-29 00:41:01.000000000 +0100
+++ mm20/arch/arm/oprofile/backtrace.c 2005-10-11 23:58:43.000000000 +0100
@@ -49,42 +49,22 @@ static struct frame_tail* kernel_backtra

static struct frame_tail* user_backtrace(struct frame_tail *tail)
{
- struct frame_tail buftail;
+ struct frame_tail buftail[2];

- /* hardware pte might not be valid due to dirty/accessed bit emulation
- * so we use copy_from_user and benefit from exception fixups */
- if (copy_from_user(&buftail, tail, sizeof(struct frame_tail)))
+ /* Also check accessibility of one struct frame_tail beyond */
+ if (!access_ok(VERIFY_READ, tail, sizeof(buftail)))
+ return NULL;
+ if (__copy_from_user_inatomic(buftail, tail, sizeof(buftail)))
return NULL;

- oprofile_add_trace(buftail.lr);
+ oprofile_add_trace(buftail[0].lr);

/* frame pointers should strictly progress back up the stack
* (towards higher addresses) */
- if (tail >= buftail.fp)
+ if (tail >= buftail[0].fp)
return NULL;

- return buftail.fp-1;
-}
-
-/* Compare two addresses and see if they're on the same page */
-#define CMP_ADDR_EQUAL(x,y,offset) ((((unsigned long) x) >> PAGE_SHIFT) \
- == ((((unsigned long) y) + offset) >> PAGE_SHIFT))
-
-/* check that the page(s) containing the frame tail are present */
-static int pages_present(struct frame_tail *tail)
-{
- struct mm_struct * mm = current->mm;
-
- if (!check_user_page_readable(mm, (unsigned long)tail))
- return 0;
-
- if (CMP_ADDR_EQUAL(tail, tail, 8))
- return 1;
-
- if (!check_user_page_readable(mm, ((unsigned long)tail) + 8))
- return 0;
-
- return 1;
+ return buftail[0].fp-1;
}

/*
@@ -118,7 +98,6 @@ static int valid_kernel_stack(struct fra
void arm_backtrace(struct pt_regs * const regs, unsigned int depth)
{
struct frame_tail *tail;
- unsigned long last_address = 0;

tail = ((struct frame_tail *) regs->ARM_fp) - 1;

@@ -132,13 +111,6 @@ void arm_backtrace(struct pt_regs * cons
return;
}

- while (depth-- && tail && !((unsigned long) tail & 3)) {
- if ((!CMP_ADDR_EQUAL(last_address, tail, 0)
- || !CMP_ADDR_EQUAL(last_address, tail, 8))
- && !pages_present(tail))
- return;
- last_address = (unsigned long) tail;
+ while (depth-- && tail && !((unsigned long) tail & 3))
tail = user_backtrace(tail);
- }
}
-
--- mm19/arch/i386/oprofile/backtrace.c 2005-08-29 00:41:01.000000000 +0100
+++ mm20/arch/i386/oprofile/backtrace.c 2005-10-11 23:58:43.000000000 +0100
@@ -12,6 +12,7 @@
#include <linux/sched.h>
#include <linux/mm.h>
#include <asm/ptrace.h>
+#include <asm/uaccess.h>

struct frame_head {
struct frame_head * ebp;
@@ -21,26 +22,22 @@ struct frame_head {
static struct frame_head *
dump_backtrace(struct frame_head * head)
{
- oprofile_add_trace(head->ret);
+ struct frame_head bufhead[2];

- /* frame pointers should strictly progress back up the stack
- * (towards higher addresses) */
- if (head >= head->ebp)
+ /* Also check accessibility of one struct frame_head beyond */
+ if (!access_ok(VERIFY_READ, head, sizeof(bufhead)))
+ return NULL;
+ if (__copy_from_user_inatomic(bufhead, head, sizeof(bufhead)))
return NULL;

- return head->ebp;
-}
-
-/* check that the page(s) containing the frame head are present */
-static int pages_present(struct frame_head * head)
-{
- struct mm_struct * mm = current->mm;
+ oprofile_add_trace(bufhead[0].ret);

- /* FIXME: only necessary once per page */
- if (!check_user_page_readable(mm, (unsigned long)head))
- return 0;
+ /* frame pointers should strictly progress back up the stack
+ * (towards higher addresses) */
+ if (head >= bufhead[0].ebp)
+ return NULL;

- return check_user_page_readable(mm, (unsigned long)(head + 1));
+ return bufhead[0].ebp;
}

/*
@@ -97,15 +94,6 @@ x86_backtrace(struct pt_regs * const reg
return;
}

-#ifdef CONFIG_SMP
- if (!spin_trylock(&current->mm->page_table_lock))
- return;
-#endif
-
- while (depth-- && head && pages_present(head))
+ while (depth-- && head)
head = dump_backtrace(head);
-
-#ifdef CONFIG_SMP
- spin_unlock(&current->mm->page_table_lock);
-#endif
}
--- mm19/include/linux/mm.h 2005-10-11 23:57:59.000000000 +0100
+++ mm20/include/linux/mm.h 2005-10-11 23:58:43.000000000 +0100
@@ -952,7 +952,6 @@ extern struct page * vmalloc_to_page(voi
extern unsigned long vmalloc_to_pfn(void *addr);
extern struct page * follow_page(struct mm_struct *mm, unsigned long address,
int write);
-extern int check_user_page_readable(struct mm_struct *mm, unsigned long address);
int remap_pfn_range(struct vm_area_struct *, unsigned long,
unsigned long, unsigned long, pgprot_t);

--- mm19/mm/memory.c 2005-10-11 23:57:59.000000000 +0100
+++ mm20/mm/memory.c 2005-10-11 23:58:43.000000000 +0100
@@ -809,8 +809,7 @@ unsigned long zap_page_range(struct vm_a
* Do a quick page-table lookup for a single page.
* mm->page_table_lock must be held.
*/
-static struct page *__follow_page(struct mm_struct *mm, unsigned long address,
- int read, int write, int accessed)
+struct page *follow_page(struct mm_struct *mm, unsigned long address, int write)
{
pgd_t *pgd;
pud_t *pud;
@@ -846,16 +845,12 @@ static struct page *__follow_page(struct
if (pte_present(pte)) {
if (write && !pte_write(pte))
goto out;
- if (read && !pte_read(pte))
- goto out;
pfn = pte_pfn(pte);
if (pfn_valid(pfn)) {
page = pfn_to_page(pfn);
- if (accessed) {
- if (write && !pte_dirty(pte) &&!PageDirty(page))
- set_page_dirty(page);
- mark_page_accessed(page);
- }
+ if (write && !pte_dirty(pte) &&!PageDirty(page))
+ set_page_dirty(page);
+ mark_page_accessed(page);
return page;
}
}
@@ -864,22 +859,6 @@ out:
return NULL;
}

-inline struct page *
-follow_page(struct mm_struct *mm, unsigned long address, int write)
-{
- return __follow_page(mm, address, 0, write, 1);
-}
-
-/*
- * check_user_page_readable() can be called frm niterrupt context by oprofile,
- * so we need to avoid taking any non-irq-safe locks
- */
-int check_user_page_readable(struct mm_struct *mm, unsigned long address)
-{
- return __follow_page(mm, address, 1, 0, 0) != NULL;
-}
-EXPORT_SYMBOL(check_user_page_readable);
-
static inline int
untouched_anonymous_page(struct mm_struct* mm, struct vm_area_struct *vma,
unsigned long address)

2005-10-13 01:24:44

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 21/21] mm: follow_page with inner ptlock

Final step in pushing down common core's page_table_lock. follow_page
no longer wants caller to hold page_table_lock, uses pte_offset_map_lock
itself; and so no page_table_lock is taken in get_user_pages itself.

But get_user_pages (and get_futex_key) do then need follow_page to pin
the page for them: take Daniel's suggestion of bitflags to follow_page.

Need one for WRITE, another for TOUCH (it was the accessed flag before:
vanished along with check_user_page_readable, but surely get_numa_maps
is wrong to mark every page it finds as accessed), another for GET.

And another, ANON to dispose of untouched_anonymous_page: it seems silly
for that to descend a second time, let follow_page observe if there was
no page table and return ZERO_PAGE if so. Fix minor bug in that: check
VM_LOCKED - make_pages_present ought to make readonly anonymous present.

Give get_numa_maps a cond_resched while we're there.

Signed-off-by: Hugh Dickins <[email protected]>
---

fs/proc/task_mmu.c | 3 -
include/linux/mm.h | 20 ++++--
kernel/futex.c | 6 --
mm/memory.c | 152 ++++++++++++++++++++++++-----------------------------
mm/nommu.c | 3 -
5 files changed, 88 insertions(+), 96 deletions(-)

--- mm20/fs/proc/task_mmu.c 2005-10-11 23:57:10.000000000 +0100
+++ mm21/fs/proc/task_mmu.c 2005-10-11 23:58:59.000000000 +0100
@@ -419,7 +419,6 @@ static struct numa_maps *get_numa_maps(c
for_each_node(i)
md->node[i] =0;

- spin_lock(&mm->page_table_lock);
for (vaddr = vma->vm_start; vaddr < vma->vm_end; vaddr += PAGE_SIZE) {
page = follow_page(mm, vaddr, 0);
if (page) {
@@ -434,8 +433,8 @@ static struct numa_maps *get_numa_maps(c
md->anon++;
md->node[page_to_nid(page)]++;
}
+ cond_resched();
}
- spin_unlock(&mm->page_table_lock);
return md;
}

--- mm20/include/linux/mm.h 2005-10-11 23:58:43.000000000 +0100
+++ mm21/include/linux/mm.h 2005-10-11 23:58:59.000000000 +0100
@@ -946,14 +946,18 @@ static inline unsigned long vma_pages(st
return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
}

-extern struct vm_area_struct *find_extend_vma(struct mm_struct *mm, unsigned long addr);
-
-extern struct page * vmalloc_to_page(void *addr);
-extern unsigned long vmalloc_to_pfn(void *addr);
-extern struct page * follow_page(struct mm_struct *mm, unsigned long address,
- int write);
-int remap_pfn_range(struct vm_area_struct *, unsigned long,
- unsigned long, unsigned long, pgprot_t);
+struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
+struct page *vmalloc_to_page(void *addr);
+unsigned long vmalloc_to_pfn(void *addr);
+int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
+ unsigned long pfn, unsigned long size, pgprot_t);
+
+struct page *follow_page(struct mm_struct *, unsigned long address,
+ unsigned int foll_flags);
+#define FOLL_WRITE 0x01 /* check pte is writable */
+#define FOLL_TOUCH 0x02 /* mark page accessed */
+#define FOLL_GET 0x04 /* do get_page on page */
+#define FOLL_ANON 0x08 /* give ZERO_PAGE if no pgtable */

#ifdef CONFIG_PROC_FS
void vm_stat_account(struct mm_struct *, unsigned long, struct file *, long);
--- mm20/kernel/futex.c 2005-09-21 12:16:59.000000000 +0100
+++ mm21/kernel/futex.c 2005-10-11 23:58:59.000000000 +0100
@@ -205,15 +205,13 @@ static int get_futex_key(unsigned long u
/*
* Do a quick atomic lookup first - this is the fastpath.
*/
- spin_lock(&current->mm->page_table_lock);
- page = follow_page(mm, uaddr, 0);
+ page = follow_page(mm, uaddr, FOLL_TOUCH|FOLL_GET);
if (likely(page != NULL)) {
key->shared.pgoff =
page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
- spin_unlock(&current->mm->page_table_lock);
+ put_page(page);
return 0;
}
- spin_unlock(&current->mm->page_table_lock);

/*
* Do it the general way.
--- mm20/mm/memory.c 2005-10-11 23:58:43.000000000 +0100
+++ mm21/mm/memory.c 2005-10-11 23:58:59.000000000 +0100
@@ -807,86 +807,82 @@ unsigned long zap_page_range(struct vm_a

/*
* Do a quick page-table lookup for a single page.
- * mm->page_table_lock must be held.
*/
-struct page *follow_page(struct mm_struct *mm, unsigned long address, int write)
+struct page *follow_page(struct mm_struct *mm, unsigned long address,
+ unsigned int flags)
{
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *ptep, pte;
+ spinlock_t *ptl;
unsigned long pfn;
struct page *page;

- page = follow_huge_addr(mm, address, write);
- if (! IS_ERR(page))
- return page;
+ page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
+ if (!IS_ERR(page)) {
+ BUG_ON(flags & FOLL_GET);
+ goto out;
+ }

+ page = NULL;
pgd = pgd_offset(mm, address);
if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
- goto out;
+ goto no_page_table;

pud = pud_offset(pgd, address);
if (pud_none(*pud) || unlikely(pud_bad(*pud)))
- goto out;
+ goto no_page_table;

pmd = pmd_offset(pud, address);
if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
+ goto no_page_table;
+
+ if (pmd_huge(*pmd)) {
+ BUG_ON(flags & FOLL_GET);
+ page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
goto out;
- if (pmd_huge(*pmd))
- return follow_huge_pmd(mm, address, pmd, write);
+ }

- ptep = pte_offset_map(pmd, address);
+ ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
if (!ptep)
goto out;

pte = *ptep;
- pte_unmap(ptep);
- if (pte_present(pte)) {
- if (write && !pte_write(pte))
- goto out;
- pfn = pte_pfn(pte);
- if (pfn_valid(pfn)) {
- page = pfn_to_page(pfn);
- if (write && !pte_dirty(pte) &&!PageDirty(page))
- set_page_dirty(page);
- mark_page_accessed(page);
- return page;
- }
- }
+ if (!pte_present(pte))
+ goto unlock;
+ if ((flags & FOLL_WRITE) && !pte_write(pte))
+ goto unlock;
+ pfn = pte_pfn(pte);
+ if (!pfn_valid(pfn))
+ goto unlock;

+ page = pfn_to_page(pfn);
+ if (flags & FOLL_GET)
+ get_page(page);
+ if (flags & FOLL_TOUCH) {
+ if ((flags & FOLL_WRITE) &&
+ !pte_dirty(pte) && !PageDirty(page))
+ set_page_dirty(page);
+ mark_page_accessed(page);
+ }
+unlock:
+ pte_unmap_unlock(ptep, ptl);
out:
- return NULL;
-}
-
-static inline int
-untouched_anonymous_page(struct mm_struct* mm, struct vm_area_struct *vma,
- unsigned long address)
-{
- pgd_t *pgd;
- pud_t *pud;
- pmd_t *pmd;
-
- /* Check if the vma is for an anonymous mapping. */
- if (vma->vm_ops && vma->vm_ops->nopage)
- return 0;
-
- /* Check if page directory entry exists. */
- pgd = pgd_offset(mm, address);
- if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
- return 1;
-
- pud = pud_offset(pgd, address);
- if (pud_none(*pud) || unlikely(pud_bad(*pud)))
- return 1;
-
- /* Check if page middle directory entry exists. */
- pmd = pmd_offset(pud, address);
- if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
- return 1;
+ return page;

- /* There is a pte slot for 'address' in 'mm'. */
- return 0;
+no_page_table:
+ /*
+ * When core dumping an enormous anonymous area that nobody
+ * has touched so far, we don't want to allocate page tables.
+ */
+ if (flags & FOLL_ANON) {
+ page = ZERO_PAGE(address);
+ if (flags & FOLL_GET)
+ get_page(page);
+ BUG_ON(flags & FOLL_WRITE);
+ }
+ return page;
}

int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
@@ -894,18 +890,19 @@ int get_user_pages(struct task_struct *t
struct page **pages, struct vm_area_struct **vmas)
{
int i;
- unsigned int flags;
+ unsigned int vm_flags;

/*
* Require read or write permissions.
* If 'force' is set, we only require the "MAY" flags.
*/
- flags = write ? (VM_WRITE | VM_MAYWRITE) : (VM_READ | VM_MAYREAD);
- flags &= force ? (VM_MAYREAD | VM_MAYWRITE) : (VM_READ | VM_WRITE);
+ vm_flags = write ? (VM_WRITE | VM_MAYWRITE) : (VM_READ | VM_MAYREAD);
+ vm_flags &= force ? (VM_MAYREAD | VM_MAYWRITE) : (VM_READ | VM_WRITE);
i = 0;

do {
- struct vm_area_struct * vma;
+ struct vm_area_struct *vma;
+ unsigned int foll_flags;

vma = find_extend_vma(mm, start);
if (!vma && in_gate_area(tsk, start)) {
@@ -946,7 +943,7 @@ int get_user_pages(struct task_struct *t
}

if (!vma || (vma->vm_flags & (VM_IO | VM_RESERVED))
- || !(flags & vma->vm_flags))
+ || !(vm_flags & vma->vm_flags))
return i ? : -EFAULT;

if (is_vm_hugetlb_page(vma)) {
@@ -954,29 +951,25 @@ int get_user_pages(struct task_struct *t
&start, &len, i);
continue;
}
- spin_lock(&mm->page_table_lock);
+
+ foll_flags = FOLL_TOUCH;
+ if (pages)
+ foll_flags |= FOLL_GET;
+ if (!write && !(vma->vm_flags & VM_LOCKED) &&
+ (!vma->vm_ops || !vma->vm_ops->nopage))
+ foll_flags |= FOLL_ANON;
+
do {
- int write_access = write;
struct page *page;

- cond_resched_lock(&mm->page_table_lock);
- while (!(page = follow_page(mm, start, write_access))) {
- int ret;
-
- /*
- * Shortcut for anonymous pages. We don't want
- * to force the creation of pages tables for
- * insanely big anonymously mapped areas that
- * nobody touched so far. This is important
- * for doing a core dump for these mappings.
- */
- if (!write && untouched_anonymous_page(mm,vma,start)) {
- page = ZERO_PAGE(start);
- break;
- }
- spin_unlock(&mm->page_table_lock);
- ret = __handle_mm_fault(mm, vma, start, write_access);
+ if (write)
+ foll_flags |= FOLL_WRITE;

+ cond_resched();
+ while (!(page = follow_page(mm, start, foll_flags))) {
+ int ret;
+ ret = __handle_mm_fault(mm, vma, start,
+ foll_flags & FOLL_WRITE);
/*
* The VM_FAULT_WRITE bit tells us that do_wp_page has
* broken COW when necessary, even if maybe_mkwrite
@@ -984,7 +977,7 @@ int get_user_pages(struct task_struct *t
* subsequent page lookups as if they were reads.
*/
if (ret & VM_FAULT_WRITE)
- write_access = 0;
+ foll_flags &= ~FOLL_WRITE;

switch (ret & ~VM_FAULT_WRITE) {
case VM_FAULT_MINOR:
@@ -1000,12 +993,10 @@ int get_user_pages(struct task_struct *t
default:
BUG();
}
- spin_lock(&mm->page_table_lock);
}
if (pages) {
pages[i] = page;
flush_dcache_page(page);
- page_cache_get(page);
}
if (vmas)
vmas[i] = vma;
@@ -1013,7 +1004,6 @@ int get_user_pages(struct task_struct *t
start += PAGE_SIZE;
len--;
} while (len && start < vma->vm_end);
- spin_unlock(&mm->page_table_lock);
} while (len);
return i;
}
--- mm20/mm/nommu.c 2005-10-11 23:54:33.000000000 +0100
+++ mm21/mm/nommu.c 2005-10-11 23:58:59.000000000 +0100
@@ -1046,7 +1046,8 @@ struct vm_area_struct *find_vma(struct m

EXPORT_SYMBOL(find_vma);

-struct page * follow_page(struct mm_struct *mm, unsigned long addr, int write)
+struct page *follow_page(struct mm_struct *mm, unsigned long address,
+ unsigned int foll_flags)
{
return NULL;
}

2005-10-13 12:39:09

by Carsten Otte

[permalink] [raw]

Subject: Re: [PATCH 18/21] mm: xip_unmap ZERO_PAGE fix

Hugh Dickins wrote:
> Small fix to the PageReserved patch: the mips ZERO_PAGE(address) depends
> on address, so __xip_unmap is wrong to initialize page with that before
> address is initialized; and in fact must re-evaluate it each iteration.
Looks fine to me. I never realized they have multiple zero pages on mips.
--

Carsten Otte
IBM Linux technology center
ARCH=s390

2005-10-13 19:18:14

[permalink] [raw]

Subject: Re: [PATCH 06/21] mm: update_hiwaters just in time

Christoph Lameter wrote:
> Looks fine to me. Great innovative idea on how to reduce the resources
> needed for the counters.
>
> Jay, what do you say?

Yes, looks good to me.
My testing results were good also. :)

Thanks, Hugh!
- jay