2005-03-02 03:49:25

by Christoph Lameter

[permalink] [raw]
Subject: Page fault scalability patch V18: Overview

Is there any chance that this patchset could go into mm now? This has been
discussed since last August....

Changelog:

V17->V18 Rediff against 2.6.11-rc5-bk4
V16->V17 Do not increment page_count in do_wp_page. Performance data
posted.
V15->V16 of this patch: Redesign to allow full backback
for architectures that do not supporting atomic operations.

An introduction to what this patch does and a patch archive can be found on
http://oss.sgi.com/projects/page_fault_performance. The archive also has the
result of various performance tests (LMBench, Microbenchmark and
kernel compiles).

The basic approach in this patchset is the same as used in SGI's 2.4.X
based kernels which have been in production use in ProPack 3 for a long time.

The patchset is composed of 4 patches (and was tested against 2.6.11-rc5-bk4):

1/4: ptep_cmpxchg and ptep_xchg to avoid intermittent zeroing of ptes

The current way of synchronizing with the CPU or arch specific
interrupts updating page table entries is to first set a pte
to zero before writing a new value. This patch uses ptep_xchg
and ptep_cmpxchg to avoid writing the zero for certain
configurations.

The patch introduces CONFIG_ATOMIC_TABLE_OPS that may be
enabled as a experimental feature during kernel configuration
if the hardware is able to support atomic operations and if
an SMP kernel is being configured. A Kconfig update for i386,
x86_64 and ia64 has been provided. On i386 this options is
restricted to CPUs better than a 486 and non PAE mode (that
way all the cmpxchg issues on old i386 CPUS and the problems
with 64bit atomic operations on recent i386 CPUS are avoided).

If CONFIG_ATOMIC_TABLE_OPS is not set then ptep_xchg and
ptep_xcmpxchg are realized by falling back to clearing a pte
before updating it.

The patch does not change the use of mm->page_table_lock and
the only performance improvement is the replacement of
xchg-with-zero-and-then-write-new-pte-value with an xchg with
the new value for SMP on some architectures if
CONFIG_ATOMIC_TABLE_OPS is configured. It should not do anything
major to VM operations.

2/4: Macros for mm counter manipulation

There are various approaches to handling mm counters if the
page_table_lock is no longer acquired. This patch defines
macros in include/linux/sched.h to handle these counters and
makes sure that these macros are used throughout the kernel
to access and manipulate rss and anon_rss. There should be
no change to the generated code as a result of this patch.

3/4: Drop the first use of the page_table_lock in handle_mm_fault

The patch introduces two new functions:

page_table_atomic_start(mm), page_table_atomic_stop(mm)

that fall back to the use of the page_table_lock if
CONFIG_ATOMIC_TABLE_OPS is not defined.

If CONFIG_ATOMIC_TABLE_OPS is defined those functions may
be used to prep the CPU for atomic table ops (i386 in PAE mode
may f.e. get the MMX register ready for 64bit atomic ops) but
are simply empty by default.

Two operations may then be performed on the page table without
acquiring the page table lock:

a) updating access bits in pte
b) anonymous read faults installed a mapping to the zero page.

All counters are still protected with the page_table_lock thus
avoiding any issues there.

Some additional statistics are added to /proc/meminfo to
give some statistics. Also counts spurious faults with no
effect. There is a surprisingly high number of those on ia64
(used to populate the cpu caches with the pte??)

4/4: Drop the use of the page_table_lock in do_anonymous_page

The second acquisition of the page_table_lock is removed
from do_anonymous_page and allows the anonymous
write fault to be possible without the page_table_lock.

The macros for manipulating rss and anon_rss in include/linux/sched.h
are changed if CONFIG_ATOMIC_TABLE_OPS is set to use atomic
operations for rss and anon_rss (safest solution for now, other
solutions may easily be implemented by changing those macros).

This patch typically yield significant increases in page fault
performance for threaded applications on SMP systems.



2005-03-02 03:52:38

by Christoph Lameter

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: atomic pte ops, pte_cmpxchg and pte_xchg

The current way of updating ptes in the Linux vm includes first clearing
a pte before setting it to another value. The clearing is performed while
holding the page_table_lock to insure that the entry will not be modified
by the CPU directly (clearing the pte clears the present bit),
by an arch specific interrupt handler or another page fault handler
running on another CPU. This approach is necessary for some
architectures that cannot perform atomic updates of page table entries.

If a page table entry is cleared then a second CPU may generate a page fault
for that entry. The fault handler on the second CPU will then attempt to
acquire the page_table_lock and wait until the first CPU has completed
updating the page table entry. The fault handler on the second CPU will then
discover that everything is ok and simply do nothing (apart from incrementing
the counters for a minor fault and marking the page again as accessed).

However, most architectures actually support atomic operations on page
table entries. The use of atomic operations on page table entries would
allow the update of a page table entry in a single atomic operation instead
of writing to the page table entry twice. There would also be no danger of
generating a spurious page fault on other CPUs.

The following patch introduces two new atomic operations ptep_xchg and
ptep_cmpxchg that may be provided by an architecture. The fallback in
include/asm-generic/pgtable.h is to simulate both operations through the
existing ptep_get_and_clear function. So there is essentially no change if
atomic operations on ptes have not been defined. Architectures that do
not support atomic operations on ptes may continue to use the clearing of
a pte for locking type purposes.

Atomic operations may be enabled in the kernel configuration on
i386, ia64 and x86_64 if a suitable CPU is configured in SMP mode.
Generic atomic definitions for ptep_xchg and ptep_cmpxchg
have been provided based on the existing xchg() and cmpxchg() functions
that already work atomically on many platforms. It is very
easy to implement this for any architecture by adding the appropriate
definitions to arch/xx/Kconfig.

The provided generic atomic functions may be overridden as usual by defining
the appropriate__HAVE_ARCH_xxx constant and providing an implementation.

My aim to reduce the use of the page_table_lock in the page fault handler
rely on a pte never being clear if the pte is in use even when the
page_table_lock is not held. Clearing a pte before setting it to another
values could result in a situation in which a fault generated by
another cpu could install a pte which is then immediately overwritten by
the first CPU setting the pte to a valid value again. This patch is
important for future work on reducing the use of spinlocks in the vm.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.10/mm/rmap.c
===================================================================
--- linux-2.6.10.orig/mm/rmap.c 2005-02-24 19:41:50.000000000 -0800
+++ linux-2.6.10/mm/rmap.c 2005-02-24 19:42:12.000000000 -0800
@@ -575,11 +575,6 @@ static int try_to_unmap_one(struct page

/* Nuke the page table entry. */
flush_cache_page(vma, address);
- pteval = ptep_clear_flush(vma, address, pte);
-
- /* Move the dirty bit to the physical page now the pte is gone. */
- if (pte_dirty(pteval))
- set_page_dirty(page);

if (PageAnon(page)) {
swp_entry_t entry = { .val = page->private };
@@ -594,11 +589,15 @@ static int try_to_unmap_one(struct page
list_add(&mm->mmlist, &init_mm.mmlist);
spin_unlock(&mmlist_lock);
}
- set_pte(pte, swp_entry_to_pte(entry));
+ pteval = ptep_xchg_flush(vma, address, pte, swp_entry_to_pte(entry));
BUG_ON(pte_file(*pte));
mm->anon_rss--;
- }
+ } else
+ pteval = ptep_clear_flush(vma, address, pte);

+ /* Move the dirty bit to the physical page now that the pte is gone. */
+ if (pte_dirty(pteval))
+ set_page_dirty(page);
mm->rss--;
acct_update_integrals();
page_remove_rmap(page);
@@ -691,15 +690,15 @@ static void try_to_unmap_cluster(unsigne
if (ptep_clear_flush_young(vma, address, pte))
continue;

- /* Nuke the page table entry. */
flush_cache_page(vma, address);
- pteval = ptep_clear_flush(vma, address, pte);

/* If nonlinear, store the file page offset in the pte. */
if (page->index != linear_page_index(vma, address))
- set_pte(pte, pgoff_to_pte(page->index));
+ pteval = ptep_xchg_flush(vma, address, pte, pgoff_to_pte(page->index));
+ else
+ pteval = ptep_clear_flush(vma, address, pte);

- /* Move the dirty bit to the physical page now the pte is gone. */
+ /* Move the dirty bit to the physical page now that the pte is gone. */
if (pte_dirty(pteval))
set_page_dirty(page);

Index: linux-2.6.10/mm/memory.c
===================================================================
--- linux-2.6.10.orig/mm/memory.c 2005-02-24 19:41:50.000000000 -0800
+++ linux-2.6.10/mm/memory.c 2005-02-24 19:42:12.000000000 -0800
@@ -502,14 +502,18 @@ static void zap_pte_range(struct mmu_gat
page->index > details->last_index))
continue;
}
- pte = ptep_get_and_clear(ptep);
- tlb_remove_tlb_entry(tlb, ptep, address+offset);
- if (unlikely(!page))
+ if (unlikely(!page)) {
+ pte = ptep_get_and_clear(ptep);
+ tlb_remove_tlb_entry(tlb, ptep, address+offset);
continue;
+ }
if (unlikely(details) && details->nonlinear_vma
&& linear_page_index(details->nonlinear_vma,
address+offset) != page->index)
- set_pte(ptep, pgoff_to_pte(page->index));
+ pte = ptep_xchg(ptep, pgoff_to_pte(page->index));
+ else
+ pte = ptep_get_and_clear(ptep);
+ tlb_remove_tlb_entry(tlb, ptep, address+offset);
if (pte_dirty(pte))
set_page_dirty(page);
if (PageAnon(page))
Index: linux-2.6.10/mm/mprotect.c
===================================================================
--- linux-2.6.10.orig/mm/mprotect.c 2005-02-24 19:41:50.000000000 -0800
+++ linux-2.6.10/mm/mprotect.c 2005-02-24 19:42:12.000000000 -0800
@@ -48,12 +48,16 @@ change_pte_range(pmd_t *pmd, unsigned lo
if (pte_present(*pte)) {
pte_t entry;

- /* Avoid an SMP race with hardware updated dirty/clean
- * bits by wiping the pte and then setting the new pte
- * into place.
- */
- entry = ptep_get_and_clear(pte);
- set_pte(pte, pte_modify(entry, newprot));
+ /* Deal with a potential SMP race with hardware/arch
+ * interrupt updating dirty/clean bits through the use
+ * of ptep_cmpxchg.
+ */
+ do {
+ entry = *pte;
+ } while (!ptep_cmpxchg(pte,
+ entry,
+ pte_modify(entry, newprot)
+ ));
}
address += PAGE_SIZE;
pte++;
Index: linux-2.6.10/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.10.orig/include/asm-generic/pgtable.h 2004-12-24 13:34:30.000000000 -0800
+++ linux-2.6.10/include/asm-generic/pgtable.h 2005-02-24 19:42:12.000000000 -0800
@@ -102,6 +102,92 @@ static inline pte_t ptep_get_and_clear(p
})
#endif

+#ifdef CONFIG_ATOMIC_TABLE_OPS
+
+/*
+ * The architecture does support atomic table operations.
+ * Thus we may provide generic atomic ptep_xchg and ptep_cmpxchg using
+ * cmpxchg and xchg.
+ */
+#ifndef __HAVE_ARCH_PTEP_XCHG
+#define ptep_xchg(__ptep, __pteval) \
+ __pte(xchg(&pte_val(*(__ptep)), pte_val(__pteval)))
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_CMPXCHG
+#define ptep_cmpxchg(__ptep,__oldval,__newval) \
+ (cmpxchg(&pte_val(*(__ptep)), \
+ pte_val(__oldval), \
+ pte_val(__newval) \
+ ) == pte_val(__oldval) \
+ )
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \
+({ \
+ pte_t __pte = ptep_xchg(__ptep, __pteval); \
+ flush_tlb_page(__vma, __address); \
+ __pte; \
+})
+#endif
+
+#else
+
+/*
+ * No support for atomic operations on the page table.
+ * Exchanging of pte values is done by first swapping zeros into
+ * a pte and then putting new content into the pte entry.
+ * However, these functions will generate an empty pte for a
+ * short time frame. This means that the page_table_lock must be held
+ * to avoid a page fault that would install a new entry.
+ */
+#ifndef __HAVE_ARCH_PTEP_XCHG
+#define ptep_xchg(__ptep, __pteval) \
+({ \
+ pte_t __pte = ptep_get_and_clear(__ptep); \
+ set_pte(__ptep, __pteval); \
+ __pte; \
+})
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH
+#ifndef __HAVE_ARCH_PTEP_XCHG
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \
+({ \
+ pte_t __pte = ptep_clear_flush(__vma, __address, __ptep); \
+ set_pte(__ptep, __pteval); \
+ __pte; \
+})
+#else
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \
+({ \
+ pte_t __pte = ptep_xchg(__ptep, __pteval); \
+ flush_tlb_page(__vma, __address); \
+ __pte; \
+})
+#endif
+#endif
+
+/*
+ * The fallback function for ptep_cmpxchg avoids any real use of cmpxchg
+ * since cmpxchg may not be available on certain architectures. Instead
+ * the clearing of a pte is used as a form of locking mechanism.
+ * This approach will only work if the page_table_lock is held to insure
+ * that the pte is not populated by a page fault generated on another
+ * CPU.
+ */
+#ifndef __HAVE_ARCH_PTEP_CMPXCHG
+#define ptep_cmpxchg(__ptep, __old, __new) \
+({ \
+ pte_t prev = ptep_get_and_clear(__ptep); \
+ int r = pte_val(prev) == pte_val(__old); \
+ set_pte(__ptep, r ? (__new) : prev); \
+ r; \
+})
+#endif
+#endif
+
#ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT
static inline void ptep_set_wrprotect(pte_t *ptep)
{
Index: linux-2.6.10/arch/ia64/Kconfig
===================================================================
--- linux-2.6.10.orig/arch/ia64/Kconfig 2005-02-24 19:41:28.000000000 -0800
+++ linux-2.6.10/arch/ia64/Kconfig 2005-02-24 19:42:12.000000000 -0800
@@ -272,6 +272,17 @@ config PREEMPT
Say Y here if you are building a kernel for a desktop, embedded
or real-time system. Say N if you are unsure.

+config ATOMIC_TABLE_OPS
+ bool "Atomic Page Table Operations (EXPERIMENTAL)"
+ depends on SMP && EXPERIMENTAL
+ help
+ Atomic Page table operations allow page faults
+ without the use (or with reduce use of) spinlocks
+ and allow greater concurrency for a task with multiple
+ threads in the page fault handler. This is in particular
+ useful for high CPU counts and processes that use
+ large amounts of memory.
+
config HAVE_DEC_LOCK
bool
depends on (SMP || PREEMPT)
Index: linux-2.6.10/arch/i386/Kconfig
===================================================================
--- linux-2.6.10.orig/arch/i386/Kconfig 2005-02-24 19:41:28.000000000 -0800
+++ linux-2.6.10/arch/i386/Kconfig 2005-02-24 19:42:12.000000000 -0800
@@ -868,6 +868,17 @@ config HAVE_DEC_LOCK
depends on (SMP || PREEMPT) && X86_CMPXCHG
default y

+config ATOMIC_TABLE_OPS
+ bool "Atomic Page Table Operations (EXPERIMENTAL)"
+ depends on SMP && X86_CMPXCHG && EXPERIMENTAL && !X86_PAE
+ help
+ Atomic Page table operations allow page faults
+ without the use (or with reduce use of) spinlocks
+ and allow greater concurrency for a task with multiple
+ threads in the page fault handler. This is in particular
+ useful for high CPU counts and processes that use
+ large amounts of memory.
+
# turning this on wastes a bunch of space.
# Summit needs it only when NUMA is on
config BOOT_IOREMAP
Index: linux-2.6.10/arch/x86_64/Kconfig
===================================================================
--- linux-2.6.10.orig/arch/x86_64/Kconfig 2005-02-24 19:41:33.000000000 -0800
+++ linux-2.6.10/arch/x86_64/Kconfig 2005-02-24 19:42:12.000000000 -0800
@@ -240,6 +240,17 @@ config PREEMPT
Say Y here if you are feeling brave and building a kernel for a
desktop, embedded or real-time system. Say N if you are unsure.

+config ATOMIC_TABLE_OPS
+ bool "Atomic Page Table Operations (EXPERIMENTAL)"
+ depends on SMP && EXPERIMENTAL
+ help
+ Atomic Page table operations allow page faults
+ without the use (or with reduce use of) spinlocks
+ and allow greater concurrency for a task with multiple
+ threads in the page fault handler. This is in particular
+ useful for high CPU counts and processes that use
+ large amounts of memory.
+
config PREEMPT_BKL
bool "Preempt The Big Kernel Lock"
depends on PREEMPT

2005-03-02 03:56:19

by Christoph Lameter

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: abstract rss counter ops

This patch extracts all the operations on rss into definitions in
include/linux/sched.h. All rss operations are performed through
the following three macros:

get_mm_counter(mm, member) -> Obtain the value of a counter
set_mm_counter(mm, member, value) -> Set the value of a counter
update_mm_counter(mm, member, value) -> Add a value to a counter

The simple definitions provided in this patch result in no change to
to the generated code.

With this patch it becomes easier to add new counters and it is possible
to redefine the method of counter handling (f.e. the page fault scalability
patches may want to use atomic operations or split rss).

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.10/include/linux/sched.h
===================================================================
--- linux-2.6.10.orig/include/linux/sched.h 2005-02-24 19:41:49.000000000 -0800
+++ linux-2.6.10/include/linux/sched.h 2005-02-24 19:42:17.000000000 -0800
@@ -203,6 +203,10 @@ arch_get_unmapped_area_topdown(struct fi
extern void arch_unmap_area(struct vm_area_struct *area);
extern void arch_unmap_area_topdown(struct vm_area_struct *area);

+#define set_mm_counter(mm, member, value) (mm)->member = (value)
+#define get_mm_counter(mm, member) ((mm)->member)
+#define update_mm_counter(mm, member, value) (mm)->member += (value)
+#define MM_COUNTER_T unsigned long

struct mm_struct {
struct vm_area_struct * mmap; /* list of VMAs */
@@ -219,7 +223,7 @@ struct mm_struct {
atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */
int map_count; /* number of VMAs */
struct rw_semaphore mmap_sem;
- spinlock_t page_table_lock; /* Protects page tables, mm->rss, mm->anon_rss */
+ spinlock_t page_table_lock; /* Protects page tables and some counters */

struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung
* together off init_mm.mmlist, and are protected
@@ -229,9 +233,13 @@ struct mm_struct {
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
unsigned long arg_start, arg_end, env_start, env_end;
- unsigned long rss, anon_rss, total_vm, locked_vm, shared_vm;
+ unsigned long total_vm, locked_vm, shared_vm;
unsigned long exec_vm, stack_vm, reserved_vm, def_flags, nr_ptes;

+ /* Special counters protected by the page_table_lock */
+ MM_COUNTER_T rss;
+ MM_COUNTER_T anon_rss;
+
unsigned long saved_auxv[42]; /* for /proc/PID/auxv */

unsigned dumpable:1;
Index: linux-2.6.10/mm/memory.c
===================================================================
--- linux-2.6.10.orig/mm/memory.c 2005-02-24 19:42:12.000000000 -0800
+++ linux-2.6.10/mm/memory.c 2005-02-24 19:42:17.000000000 -0800
@@ -313,9 +313,9 @@ copy_one_pte(struct mm_struct *dst_mm,
pte = pte_mkclean(pte);
pte = pte_mkold(pte);
get_page(page);
- dst_mm->rss++;
+ update_mm_counter(dst_mm, rss, 1);
if (PageAnon(page))
- dst_mm->anon_rss++;
+ update_mm_counter(dst_mm, anon_rss, 1);
set_pte(dst_pte, pte);
page_dup_rmap(page);
}
@@ -517,7 +517,7 @@ static void zap_pte_range(struct mmu_gat
if (pte_dirty(pte))
set_page_dirty(page);
if (PageAnon(page))
- tlb->mm->anon_rss--;
+ update_mm_counter(tlb->mm, anon_rss, -1);
else if (pte_young(pte))
mark_page_accessed(page);
tlb->freed++;
@@ -1340,13 +1340,14 @@ static int do_wp_page(struct mm_struct *
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, address);
if (likely(pte_same(*page_table, pte))) {
- if (PageAnon(old_page))
- mm->anon_rss--;
+ if (PageAnon(old_page))
+ update_mm_counter(mm, anon_rss, -1);
if (PageReserved(old_page)) {
- ++mm->rss;
+ update_mm_counter(mm, rss, 1);
acct_update_integrals();
update_mem_hiwater();
} else
+
page_remove_rmap(old_page);
break_cow(vma, new_page, address, page_table);
lru_cache_add_active(new_page);
@@ -1750,7 +1751,7 @@ static int do_swap_page(struct mm_struct
if (vm_swap_full())
remove_exclusive_swap_page(page);

- mm->rss++;
+ update_mm_counter(mm, rss, 1);
acct_update_integrals();
update_mem_hiwater();

@@ -1817,7 +1818,7 @@ do_anonymous_page(struct mm_struct *mm,
spin_unlock(&mm->page_table_lock);
goto out;
}
- mm->rss++;
+ update_mm_counter(mm, rss, 1);
acct_update_integrals();
update_mem_hiwater();
entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
@@ -1935,7 +1936,7 @@ retry:
/* Only go through if we didn't race with anybody else... */
if (pte_none(*page_table)) {
if (!PageReserved(new_page))
- ++mm->rss;
+ update_mm_counter(mm, rss, 1);
acct_update_integrals();
update_mem_hiwater();

@@ -2262,8 +2263,10 @@ void update_mem_hiwater(void)
struct task_struct *tsk = current;

if (tsk->mm) {
- if (tsk->mm->hiwater_rss < tsk->mm->rss)
- tsk->mm->hiwater_rss = tsk->mm->rss;
+ unsigned long rss = get_mm_counter(tsk->mm, rss);
+
+ if (tsk->mm->hiwater_rss < rss)
+ tsk->mm->hiwater_rss = rss;
if (tsk->mm->hiwater_vm < tsk->mm->total_vm)
tsk->mm->hiwater_vm = tsk->mm->total_vm;
}
Index: linux-2.6.10/mm/rmap.c
===================================================================
--- linux-2.6.10.orig/mm/rmap.c 2005-02-24 19:42:12.000000000 -0800
+++ linux-2.6.10/mm/rmap.c 2005-02-24 19:42:17.000000000 -0800
@@ -258,7 +258,7 @@ static int page_referenced_one(struct pa
pte_t *pte;
int referenced = 0;

- if (!mm->rss)
+ if (!get_mm_counter(mm, rss))
goto out;
address = vma_address(page, vma);
if (address == -EFAULT)
@@ -437,7 +437,7 @@ void page_add_anon_rmap(struct page *pag
BUG_ON(PageReserved(page));
BUG_ON(!anon_vma);

- vma->vm_mm->anon_rss++;
+ update_mm_counter(vma->vm_mm, anon_rss, 1);

anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
index = (address - vma->vm_start) >> PAGE_SHIFT;
@@ -510,7 +510,7 @@ static int try_to_unmap_one(struct page
pte_t pteval;
int ret = SWAP_AGAIN;

- if (!mm->rss)
+ if (!get_mm_counter(mm, rss))
goto out;
address = vma_address(page, vma);
if (address == -EFAULT)
@@ -591,14 +591,14 @@ static int try_to_unmap_one(struct page
}
pteval = ptep_xchg_flush(vma, address, pte, swp_entry_to_pte(entry));
BUG_ON(pte_file(*pte));
- mm->anon_rss--;
+ update_mm_counter(mm, anon_rss, -1);
} else
pteval = ptep_clear_flush(vma, address, pte);

/* Move the dirty bit to the physical page now that the pte is gone. */
if (pte_dirty(pteval))
set_page_dirty(page);
- mm->rss--;
+ update_mm_counter(mm, rss, -1);
acct_update_integrals();
page_remove_rmap(page);
page_cache_release(page);
@@ -705,7 +705,7 @@ static void try_to_unmap_cluster(unsigne
page_remove_rmap(page);
page_cache_release(page);
acct_update_integrals();
- mm->rss--;
+ update_mm_counter(mm, rss, -1);
(*mapcount)--;
}

@@ -804,7 +804,7 @@ static int try_to_unmap_file(struct page
if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
continue;
cursor = (unsigned long) vma->vm_private_data;
- while (vma->vm_mm->rss &&
+ while (get_mm_counter(vma->vm_mm, rss) &&
cursor < max_nl_cursor &&
cursor < vma->vm_end - vma->vm_start) {
try_to_unmap_cluster(cursor, &mapcount, vma);
Index: linux-2.6.10/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.10.orig/fs/proc/task_mmu.c 2005-02-24 19:41:44.000000000 -0800
+++ linux-2.6.10/fs/proc/task_mmu.c 2005-02-24 19:42:17.000000000 -0800
@@ -24,7 +24,7 @@ char *task_mem(struct mm_struct *mm, cha
"VmPTE:\t%8lu kB\n",
(mm->total_vm - mm->reserved_vm) << (PAGE_SHIFT-10),
mm->locked_vm << (PAGE_SHIFT-10),
- mm->rss << (PAGE_SHIFT-10),
+ get_mm_counter(mm, rss) << (PAGE_SHIFT-10),
data << (PAGE_SHIFT-10),
mm->stack_vm << (PAGE_SHIFT-10), text, lib,
(PTRS_PER_PTE*sizeof(pte_t)*mm->nr_ptes) >> 10);
@@ -39,11 +39,13 @@ unsigned long task_vsize(struct mm_struc
int task_statm(struct mm_struct *mm, int *shared, int *text,
int *data, int *resident)
{
- *shared = mm->rss - mm->anon_rss;
+ int rss = get_mm_counter(mm, rss);
+
+ *shared = rss - get_mm_counter(mm, anon_rss);
*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
>> PAGE_SHIFT;
*data = mm->total_vm - mm->shared_vm;
- *resident = mm->rss;
+ *resident = rss;
return mm->total_vm;
}

Index: linux-2.6.10/mm/mmap.c
===================================================================
--- linux-2.6.10.orig/mm/mmap.c 2005-02-24 19:41:50.000000000 -0800
+++ linux-2.6.10/mm/mmap.c 2005-02-24 19:42:17.000000000 -0800
@@ -2000,7 +2000,7 @@ void exit_mmap(struct mm_struct *mm)
vma = mm->mmap;
mm->mmap = mm->mmap_cache = NULL;
mm->mm_rb = RB_ROOT;
- mm->rss = 0;
+ set_mm_counter(mm, rss, 0);
mm->total_vm = 0;
mm->locked_vm = 0;

Index: linux-2.6.10/kernel/fork.c
===================================================================
--- linux-2.6.10.orig/kernel/fork.c 2005-02-24 19:41:50.000000000 -0800
+++ linux-2.6.10/kernel/fork.c 2005-02-24 19:42:17.000000000 -0800
@@ -174,8 +174,8 @@ static inline int dup_mmap(struct mm_str
mm->mmap_cache = NULL;
mm->free_area_cache = oldmm->mmap_base;
mm->map_count = 0;
- mm->rss = 0;
- mm->anon_rss = 0;
+ set_mm_counter(mm, rss, 0);
+ set_mm_counter(mm, anon_rss, 0);
cpus_clear(mm->cpu_vm_mask);
mm->mm_rb = RB_ROOT;
rb_link = &mm->mm_rb.rb_node;
@@ -471,7 +471,7 @@ static int copy_mm(unsigned long clone_f
if (retval)
goto free_pt;

- mm->hiwater_rss = mm->rss;
+ mm->hiwater_rss = get_mm_counter(mm,rss);
mm->hiwater_vm = mm->total_vm;

good_mm:
Index: linux-2.6.10/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.10.orig/include/asm-generic/tlb.h 2005-02-24 19:41:46.000000000 -0800
+++ linux-2.6.10/include/asm-generic/tlb.h 2005-02-24 19:42:17.000000000 -0800
@@ -88,11 +88,11 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
{
int freed = tlb->freed;
struct mm_struct *mm = tlb->mm;
- int rss = mm->rss;
+ int rss = get_mm_counter(mm, rss);

if (rss < freed)
freed = rss;
- mm->rss = rss - freed;
+ update_mm_counter(mm, rss, -freed);
tlb_flush_mmu(tlb, start, end);

/* keep the page table cache within bounds */
Index: linux-2.6.10/fs/binfmt_flat.c
===================================================================
--- linux-2.6.10.orig/fs/binfmt_flat.c 2004-12-24 13:33:47.000000000 -0800
+++ linux-2.6.10/fs/binfmt_flat.c 2005-02-24 19:42:17.000000000 -0800
@@ -650,7 +650,7 @@ static int load_flat_file(struct linux_b
current->mm->start_brk = datapos + data_len + bss_len;
current->mm->brk = (current->mm->start_brk + 3) & ~3;
current->mm->context.end_brk = memp + ksize((void *) memp) - stack_len;
- current->mm->rss = 0;
+ set_mm_counter(current->mm, rss, 0);
}

if (flags & FLAT_FLAG_KTRACE)
Index: linux-2.6.10/fs/exec.c
===================================================================
--- linux-2.6.10.orig/fs/exec.c 2005-02-24 19:41:43.000000000 -0800
+++ linux-2.6.10/fs/exec.c 2005-02-24 19:42:17.000000000 -0800
@@ -326,7 +326,7 @@ void install_arg_page(struct vm_area_str
pte_unmap(pte);
goto out;
}
- mm->rss++;
+ update_mm_counter(mm, rss, 1);
lru_cache_add_active(page);
set_pte(pte, pte_mkdirty(pte_mkwrite(mk_pte(
page, vma->vm_page_prot))));
Index: linux-2.6.10/fs/binfmt_som.c
===================================================================
--- linux-2.6.10.orig/fs/binfmt_som.c 2005-02-24 19:41:43.000000000 -0800
+++ linux-2.6.10/fs/binfmt_som.c 2005-02-24 19:42:17.000000000 -0800
@@ -259,7 +259,7 @@ load_som_binary(struct linux_binprm * bp
create_som_tables(bprm);

current->mm->start_stack = bprm->p;
- current->mm->rss = 0;
+ set_mm_counter(current->mm, rss, 0);

#if 0
printk("(start_brk) %08lx\n" , (unsigned long) current->mm->start_brk);
Index: linux-2.6.10/mm/fremap.c
===================================================================
--- linux-2.6.10.orig/mm/fremap.c 2005-02-24 19:41:50.000000000 -0800
+++ linux-2.6.10/mm/fremap.c 2005-02-24 19:42:17.000000000 -0800
@@ -39,7 +39,7 @@ static inline void zap_pte(struct mm_str
set_page_dirty(page);
page_remove_rmap(page);
page_cache_release(page);
- mm->rss--;
+ update_mm_counter(mm, rss, -1);
}
}
} else {
@@ -92,7 +92,7 @@ int install_page(struct mm_struct *mm, s

zap_pte(mm, vma, addr, pte);

- mm->rss++;
+ update_mm_counter(mm,rss, 1);
flush_icache_page(vma, page);
set_pte(pte, mk_pte(page, prot));
page_add_file_rmap(page);
Index: linux-2.6.10/mm/swapfile.c
===================================================================
--- linux-2.6.10.orig/mm/swapfile.c 2005-02-24 19:41:50.000000000 -0800
+++ linux-2.6.10/mm/swapfile.c 2005-02-24 19:42:17.000000000 -0800
@@ -432,7 +432,7 @@ static void
unuse_pte(struct vm_area_struct *vma, unsigned long address, pte_t *dir,
swp_entry_t entry, struct page *page)
{
- vma->vm_mm->rss++;
+ update_mm_counter(vma->vm_mm, rss, 1);
get_page(page);
set_pte(dir, pte_mkold(mk_pte(page, vma->vm_page_prot)));
page_add_anon_rmap(page, vma, address);
Index: linux-2.6.10/fs/binfmt_aout.c
===================================================================
--- linux-2.6.10.orig/fs/binfmt_aout.c 2005-02-24 19:41:43.000000000 -0800
+++ linux-2.6.10/fs/binfmt_aout.c 2005-02-24 19:42:17.000000000 -0800
@@ -317,7 +317,7 @@ static int load_aout_binary(struct linux
(current->mm->start_brk = N_BSSADDR(ex));
current->mm->free_area_cache = current->mm->mmap_base;

- current->mm->rss = 0;
+ set_mm_counter(current->mm, rss, 0);
current->mm->mmap = NULL;
compute_creds(bprm);
current->flags &= ~PF_FORKNOEXEC;
Index: linux-2.6.10/arch/ia64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.10.orig/arch/ia64/mm/hugetlbpage.c 2005-02-24 19:41:29.000000000 -0800
+++ linux-2.6.10/arch/ia64/mm/hugetlbpage.c 2005-02-24 19:42:17.000000000 -0800
@@ -73,7 +73,7 @@ set_huge_pte (struct mm_struct *mm, stru
{
pte_t entry;

- mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+ update_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE);
if (write_access) {
entry =
pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
@@ -116,7 +116,7 @@ int copy_hugetlb_page_range(struct mm_st
ptepage = pte_page(entry);
get_page(ptepage);
set_pte(dst_pte, entry);
- dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+ update_mm_counter(dst, rss, HPAGE_SIZE / PAGE_SIZE);
addr += HPAGE_SIZE;
}
return 0;
@@ -246,7 +246,7 @@ void unmap_hugepage_range(struct vm_area
put_page(page);
pte_clear(pte);
}
- mm->rss -= (end - start) >> PAGE_SHIFT;
+ update_mm_counter(mm, rss, - ((end - start) >> PAGE_SHIFT));
flush_tlb_range(vma, start, end);
}

Index: linux-2.6.10/fs/binfmt_elf.c
===================================================================
--- linux-2.6.10.orig/fs/binfmt_elf.c 2005-02-24 19:41:43.000000000 -0800
+++ linux-2.6.10/fs/binfmt_elf.c 2005-02-24 19:42:17.000000000 -0800
@@ -764,7 +764,7 @@ static int load_elf_binary(struct linux_

/* Do this so that we can load the interpreter, if need be. We will
change some of these later */
- current->mm->rss = 0;
+ set_mm_counter(current->mm, rss, 0);
current->mm->free_area_cache = current->mm->mmap_base;
retval = setup_arg_pages(bprm, STACK_TOP, executable_stack);
if (retval < 0) {
Index: linux-2.6.10/include/asm-ia64/tlb.h
===================================================================
--- linux-2.6.10.orig/include/asm-ia64/tlb.h 2005-02-24 19:41:47.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/tlb.h 2005-02-24 19:42:17.000000000 -0800
@@ -161,11 +161,11 @@ tlb_finish_mmu (struct mmu_gather *tlb,
{
unsigned long freed = tlb->freed;
struct mm_struct *mm = tlb->mm;
- unsigned long rss = mm->rss;
+ unsigned long rss = get_mm_counter(mm, rss);

if (rss < freed)
freed = rss;
- mm->rss = rss - freed;
+ update_mm_counter(mm, rss, -freed);
/*
* Note: tlb->nr may be 0 at this point, so we can't rely on tlb->start_addr and
* tlb->end_addr.
Index: linux-2.6.10/include/asm-arm/tlb.h
===================================================================
--- linux-2.6.10.orig/include/asm-arm/tlb.h 2005-02-24 19:41:45.000000000 -0800
+++ linux-2.6.10/include/asm-arm/tlb.h 2005-02-24 19:42:17.000000000 -0800
@@ -54,11 +54,11 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
{
struct mm_struct *mm = tlb->mm;
unsigned long freed = tlb->freed;
- int rss = mm->rss;
+ int rss = get_mm_counter(mm, rss);

if (rss < freed)
freed = rss;
- mm->rss = rss - freed;
+ update_mm_counter(mm, rss, -freed);

if (freed) {
flush_tlb_mm(mm);
Index: linux-2.6.10/include/asm-arm26/tlb.h
===================================================================
--- linux-2.6.10.orig/include/asm-arm26/tlb.h 2005-02-24 19:41:45.000000000 -0800
+++ linux-2.6.10/include/asm-arm26/tlb.h 2005-02-24 19:42:17.000000000 -0800
@@ -37,11 +37,11 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
{
struct mm_struct *mm = tlb->mm;
unsigned long freed = tlb->freed;
- int rss = mm->rss;
+ int rss = get_mm_counter(mm, rss);

if (rss < freed)
freed = rss;
- mm->rss = rss - freed;
+ update_mm_counter(mm, rss, -freed);

if (freed) {
flush_tlb_mm(mm);
Index: linux-2.6.10/include/asm-sparc64/tlb.h
===================================================================
--- linux-2.6.10.orig/include/asm-sparc64/tlb.h 2005-02-24 19:41:48.000000000 -0800
+++ linux-2.6.10/include/asm-sparc64/tlb.h 2005-02-24 19:42:17.000000000 -0800
@@ -80,11 +80,11 @@ static inline void tlb_finish_mmu(struct
{
unsigned long freed = mp->freed;
struct mm_struct *mm = mp->mm;
- unsigned long rss = mm->rss;
+ unsigned long rss = get_mm_counter(mm, rss);

if (rss < freed)
freed = rss;
- mm->rss = rss - freed;
+ update_mm_counter(mm, rss, -freed);

tlb_flush_mmu(mp);

Index: linux-2.6.10/arch/sh/mm/hugetlbpage.c
===================================================================
--- linux-2.6.10.orig/arch/sh/mm/hugetlbpage.c 2004-12-24 13:34:58.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/hugetlbpage.c 2005-02-24 19:42:17.000000000 -0800
@@ -62,7 +62,7 @@ static void set_huge_pte(struct mm_struc
unsigned long i;
pte_t entry;

- mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+ update_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE);

if (write_access)
entry = pte_mkwrite(pte_mkdirty(mk_pte(page,
@@ -115,7 +115,7 @@ int copy_hugetlb_page_range(struct mm_st
pte_val(entry) += PAGE_SIZE;
dst_pte++;
}
- dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+ update_mm_counter(dst, rss, HPAGE_SIZE / PAGE_SIZE);
addr += HPAGE_SIZE;
}
return 0;
@@ -206,7 +206,7 @@ void unmap_hugepage_range(struct vm_area
pte++;
}
}
- mm->rss -= (end - start) >> PAGE_SHIFT;
+ update_mm_counter(mm, rss, -((end - start) >> PAGE_SHIFT));
flush_tlb_range(vma, start, end);
}

Index: linux-2.6.10/arch/x86_64/ia32/ia32_aout.c
===================================================================
--- linux-2.6.10.orig/arch/x86_64/ia32/ia32_aout.c 2005-02-24 19:41:33.000000000 -0800
+++ linux-2.6.10/arch/x86_64/ia32/ia32_aout.c 2005-02-24 19:42:17.000000000 -0800
@@ -313,7 +313,7 @@ static int load_aout_binary(struct linux
(current->mm->start_brk = N_BSSADDR(ex));
current->mm->free_area_cache = TASK_UNMAPPED_BASE;

- current->mm->rss = 0;
+ set_mm_counter(current->mm, rss, 0);
current->mm->mmap = NULL;
compute_creds(bprm);
current->flags &= ~PF_FORKNOEXEC;
Index: linux-2.6.10/arch/ppc64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.10.orig/arch/ppc64/mm/hugetlbpage.c 2005-02-24 19:41:32.000000000 -0800
+++ linux-2.6.10/arch/ppc64/mm/hugetlbpage.c 2005-02-24 19:42:17.000000000 -0800
@@ -153,7 +153,7 @@ static void set_huge_pte(struct mm_struc
{
pte_t entry;

- mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+ update_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE);
if (write_access) {
entry =
pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
@@ -315,7 +315,7 @@ int copy_hugetlb_page_range(struct mm_st

ptepage = pte_page(entry);
get_page(ptepage);
- dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+ update_mm_counter(dst, rss, HPAGE_SIZE / PAGE_SIZE);
set_pte(dst_pte, entry);

addr += HPAGE_SIZE;
@@ -425,7 +425,7 @@ void unmap_hugepage_range(struct vm_area

put_page(page);
}
- mm->rss -= (end - start) >> PAGE_SHIFT;
+ update_mm_counter(mm, rss, -((end - start) >> PAGE_SHIFT));
flush_tlb_pending();
}

Index: linux-2.6.10/arch/sh64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.10.orig/arch/sh64/mm/hugetlbpage.c 2004-12-24 13:34:30.000000000 -0800
+++ linux-2.6.10/arch/sh64/mm/hugetlbpage.c 2005-02-24 19:42:17.000000000 -0800
@@ -62,7 +62,7 @@ static void set_huge_pte(struct mm_struc
unsigned long i;
pte_t entry;

- mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+ update_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE);

if (write_access)
entry = pte_mkwrite(pte_mkdirty(mk_pte(page,
@@ -115,7 +115,7 @@ int copy_hugetlb_page_range(struct mm_st
pte_val(entry) += PAGE_SIZE;
dst_pte++;
}
- dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+ update_mm_counter(dst, rss, HPAGE_SIZE / PAGE_SIZE);
addr += HPAGE_SIZE;
}
return 0;
@@ -206,7 +206,7 @@ void unmap_hugepage_range(struct vm_area
pte++;
}
}
- mm->rss -= (end - start) >> PAGE_SHIFT;
+ update_mm_counter(mm, rss, -((end - start) >> PAGE_SHIFT));
flush_tlb_range(vma, start, end);
}

Index: linux-2.6.10/arch/sparc64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.10.orig/arch/sparc64/mm/hugetlbpage.c 2005-02-24 19:41:32.000000000 -0800
+++ linux-2.6.10/arch/sparc64/mm/hugetlbpage.c 2005-02-24 19:42:17.000000000 -0800
@@ -67,7 +67,7 @@ static void set_huge_pte(struct mm_struc
unsigned long i;
pte_t entry;

- mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+ update_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE);

if (write_access)
entry = pte_mkwrite(pte_mkdirty(mk_pte(page,
@@ -120,7 +120,7 @@ int copy_hugetlb_page_range(struct mm_st
pte_val(entry) += PAGE_SIZE;
dst_pte++;
}
- dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+ update_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE);
addr += HPAGE_SIZE;
}
return 0;
@@ -211,7 +211,7 @@ void unmap_hugepage_range(struct vm_area
pte++;
}
}
- mm->rss -= (end - start) >> PAGE_SHIFT;
+ update_mm_counter(mm, rss, -((end - start) >> PAGE_SHIFT));
flush_tlb_range(vma, start, end);
}

Index: linux-2.6.10/arch/mips/kernel/irixelf.c
===================================================================
--- linux-2.6.10.orig/arch/mips/kernel/irixelf.c 2005-02-24 19:41:29.000000000 -0800
+++ linux-2.6.10/arch/mips/kernel/irixelf.c 2005-02-24 19:42:17.000000000 -0800
@@ -692,7 +692,7 @@ static int load_irix_binary(struct linux
/* Do this so that we can load the interpreter, if need be. We will
* change some of these later.
*/
- current->mm->rss = 0;
+ set_mm_counter(current->mm, rss, 0);
setup_arg_pages(bprm, STACK_TOP, EXSTACK_DEFAULT);
current->mm->start_stack = bprm->p;

Index: linux-2.6.10/arch/m68k/atari/stram.c
===================================================================
--- linux-2.6.10.orig/arch/m68k/atari/stram.c 2005-02-24 19:41:29.000000000 -0800
+++ linux-2.6.10/arch/m68k/atari/stram.c 2005-02-24 19:42:17.000000000 -0800
@@ -635,7 +635,7 @@ static inline void unswap_pte(struct vm_
set_pte(dir, pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
swap_free(entry);
get_page(page);
- ++vma->vm_mm->rss;
+ update_mm_counter(vma->vm_mm, rss, 1);
}

static inline void unswap_pmd(struct vm_area_struct * vma, pmd_t *dir,
Index: linux-2.6.10/arch/i386/mm/hugetlbpage.c
===================================================================
--- linux-2.6.10.orig/arch/i386/mm/hugetlbpage.c 2005-02-24 19:41:28.000000000 -0800
+++ linux-2.6.10/arch/i386/mm/hugetlbpage.c 2005-02-24 19:42:17.000000000 -0800
@@ -46,7 +46,7 @@ static void set_huge_pte(struct mm_struc
{
pte_t entry;

- mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+ update_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE);
if (write_access) {
entry =
pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
@@ -86,7 +86,7 @@ int copy_hugetlb_page_range(struct mm_st
ptepage = pte_page(entry);
get_page(ptepage);
set_pte(dst_pte, entry);
- dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+ update_mm_counter(dst, rss, HPAGE_SIZE / PAGE_SIZE);
addr += HPAGE_SIZE;
}
return 0;
@@ -222,7 +222,7 @@ void unmap_hugepage_range(struct vm_area
page = pte_page(pte);
put_page(page);
}
- mm->rss -= (end - start) >> PAGE_SHIFT;
+ update_mm_counter(mm ,rss, -((end - start) >> PAGE_SHIFT));
flush_tlb_range(vma, start, end);
}

Index: linux-2.6.10/arch/sparc64/kernel/binfmt_aout32.c
===================================================================
--- linux-2.6.10.orig/arch/sparc64/kernel/binfmt_aout32.c 2005-02-24 19:41:32.000000000 -0800
+++ linux-2.6.10/arch/sparc64/kernel/binfmt_aout32.c 2005-02-24 19:42:17.000000000 -0800
@@ -241,7 +241,7 @@ static int load_aout32_binary(struct lin
current->mm->brk = ex.a_bss +
(current->mm->start_brk = N_BSSADDR(ex));

- current->mm->rss = 0;
+ set_mm_counter(current->mm, rss, 0);
current->mm->mmap = NULL;
compute_creds(bprm);
current->flags &= ~PF_FORKNOEXEC;
Index: linux-2.6.10/fs/proc/array.c
===================================================================
--- linux-2.6.10.orig/fs/proc/array.c 2005-02-24 19:41:44.000000000 -0800
+++ linux-2.6.10/fs/proc/array.c 2005-02-24 19:42:17.000000000 -0800
@@ -423,7 +423,7 @@ static int do_task_stat(struct task_stru
jiffies_to_clock_t(task->it_real_value),
start_time,
vsize,
- mm ? mm->rss : 0, /* you might want to shift this left 3 */
+ mm ? get_mm_counter(mm, rss) : 0, /* you might want to shift this left 3 */
rsslim,
mm ? mm->start_code : 0,
mm ? mm->end_code : 0,
Index: linux-2.6.10/fs/binfmt_elf_fdpic.c
===================================================================
--- linux-2.6.10.orig/fs/binfmt_elf_fdpic.c 2005-02-24 19:41:43.000000000 -0800
+++ linux-2.6.10/fs/binfmt_elf_fdpic.c 2005-02-24 19:42:17.000000000 -0800
@@ -299,7 +299,7 @@ static int load_elf_fdpic_binary(struct
/* do this so that we can load the interpreter, if need be
* - we will change some of these later
*/
- current->mm->rss = 0;
+ set_mm_counter(current->mm, rss, 0);

#ifdef CONFIG_MMU
retval = setup_arg_pages(bprm, current->mm->start_stack, executable_stack);
Index: linux-2.6.10/mm/nommu.c
===================================================================
--- linux-2.6.10.orig/mm/nommu.c 2005-02-24 19:41:50.000000000 -0800
+++ linux-2.6.10/mm/nommu.c 2005-02-24 19:42:17.000000000 -0800
@@ -962,10 +962,11 @@ void arch_unmap_area(struct vm_area_stru
void update_mem_hiwater(void)
{
struct task_struct *tsk = current;
+ unsigned long rss = get_mm_counter(tsk->mm, rss);

if (likely(tsk->mm)) {
- if (tsk->mm->hiwater_rss < tsk->mm->rss)
- tsk->mm->hiwater_rss = tsk->mm->rss;
+ if (tsk->mm->hiwater_rss < rss)
+ tsk->mm->hiwater_rss = rss;
if (tsk->mm->hiwater_vm < tsk->mm->total_vm)
tsk->mm->hiwater_vm = tsk->mm->total_vm;
}
Index: linux-2.6.10/kernel/acct.c
===================================================================
--- linux-2.6.10.orig/kernel/acct.c 2005-02-24 19:41:50.000000000 -0800
+++ linux-2.6.10/kernel/acct.c 2005-02-24 19:42:17.000000000 -0800
@@ -544,7 +544,7 @@ void acct_update_integrals(void)
if (delta == 0)
return;
tsk->acct_stimexpd = tsk->stime;
- tsk->acct_rss_mem1 += delta * tsk->mm->rss;
+ tsk->acct_rss_mem1 += delta * get_mm_counter(tsk->mm, rss);
tsk->acct_vm_mem1 += delta * tsk->mm->total_vm;
}
}

2005-03-02 04:01:03

by Christoph Lameter

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Drop first acquisition of ptl

The page fault handler attempts to use the page_table_lock only for short
time periods. It repeatedly drops and reacquires the lock. When the lock
is reacquired, checks are made if the underlying pte has changed before
replacing the pte value. These locations are a good fit for the use of
ptep_cmpxchg.

The following patch allows to remove the first time the page_table_lock is
acquired and uses atomic operations on the page table instead. A section
using atomic pte operations is begun with

page_table_atomic_start(struct mm_struct *)

and ends with

page_table_atomic_stop(struct mm_struct *)

Both of these become spin_lock(page_table_lock) and
spin_unlock(page_table_lock) if atomic page table operations are not
configured (CONFIG_ATOMIC_TABLE_OPS undefined).

The atomic operations with pte_xchg and pte_cmpxchg only work for the lowest
layer of the page table. Higher layers may also be populated in an atomic
way by defining pmd_test_and_populate() etc. The generic versions of these
functions fall back to the page_table_lock (populating higher level page
table entries is rare and therefore this is not likely to be performance
critical). For ia64 the definition of higher level atomic operations is
included.

This patch depends on the pte_cmpxchg patch to be applied first and will
only remove the first use of the page_table_lock in the page fault handler.
This will allow the following page table operations without acquiring
the page_table_lock:

1. Updating of access bits (handle_mm_faults)
2. Anonymous read faults (do_anonymous_page)

The page_table_lock is still acquired for creating a new pte for an anonymous
write fault and therefore the problems with rss that were addressed by splitting
rss into the task structure do not yet occur.

The patch also adds some diagnostic features by counting the number of cmpxchg
failures (useful for verification if this patch works right) and the number of patches
received that led to no change in the page table. Statistics may be viewed via
/proc/meminfo

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.10/mm/memory.c
===================================================================
--- linux-2.6.10.orig/mm/memory.c 2005-02-24 19:42:17.000000000 -0800
+++ linux-2.6.10/mm/memory.c 2005-02-24 19:42:21.000000000 -0800
@@ -36,6 +36,8 @@
* ([email protected])
*
* Aug/Sep 2004 Changed to four level page tables (Andi Kleen)
+ * Jan 2005 Scalability improvement by reducing the use and the length of time
+ * the page table lock is held (Christoph Lameter)
*/

#include <linux/kernel_stat.h>
@@ -1275,8 +1277,8 @@ static inline void break_cow(struct vm_a
* change only once the write actually happens. This avoids a few races,
* and potentially makes it more efficient.
*
- * We hold the mm semaphore and the page_table_lock on entry and exit
- * with the page_table_lock released.
+ * We hold the mm semaphore and have started atomic pte operations,
+ * exit with pte ops completed.
*/
static int do_wp_page(struct mm_struct *mm, struct vm_area_struct * vma,
unsigned long address, pte_t *page_table, pmd_t *pmd, pte_t pte)
@@ -1294,7 +1296,7 @@ static int do_wp_page(struct mm_struct *
pte_unmap(page_table);
printk(KERN_ERR "do_wp_page: bogus page at address %08lx\n",
address);
- spin_unlock(&mm->page_table_lock);
+ page_table_atomic_stop(mm);
return VM_FAULT_OOM;
}
old_page = pfn_to_page(pfn);
@@ -1306,22 +1308,25 @@ static int do_wp_page(struct mm_struct *
flush_cache_page(vma, address);
entry = maybe_mkwrite(pte_mkyoung(pte_mkdirty(pte)),
vma);
- ptep_set_access_flags(vma, address, page_table, entry, 1);
- update_mmu_cache(vma, address, entry);
+ /*
+ * If the bits are not updated then another fault
+ * will be generated with another chance of updating.
+ */
+ if (ptep_cmpxchg(page_table, pte, entry))
+ update_mmu_cache(vma, address, entry);
+ else
+ inc_page_state(cmpxchg_fail_flag_reuse);
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
+ page_table_atomic_stop(mm);
return VM_FAULT_MINOR;
}
}
pte_unmap(page_table);
+ page_table_atomic_stop(mm);

/*
* Ok, we need to copy. Oh, well..
*/
- if (!PageReserved(old_page))
- page_cache_get(old_page);
- spin_unlock(&mm->page_table_lock);
-
if (unlikely(anon_vma_prepare(vma)))
goto no_new_page;
if (old_page == ZERO_PAGE(address)) {
@@ -1332,10 +1337,15 @@ static int do_wp_page(struct mm_struct *
new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
if (!new_page)
goto no_new_page;
+ /*
+ * No page_cache_get so we may copy some crap
+ * that is later discarded if the pte has changed
+ */
copy_user_highpage(new_page, old_page, address);
}
/*
- * Re-check the pte - we dropped the lock
+ * Re-check the pte - so far we may not have acquired the
+ * page_table_lock
*/
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, address);
@@ -1347,7 +1357,6 @@ static int do_wp_page(struct mm_struct *
acct_update_integrals();
update_mem_hiwater();
} else
-
page_remove_rmap(old_page);
break_cow(vma, new_page, address, page_table);
lru_cache_add_active(new_page);
@@ -1358,7 +1367,6 @@ static int do_wp_page(struct mm_struct *
}
pte_unmap(page_table);
page_cache_release(new_page);
- page_cache_release(old_page);
spin_unlock(&mm->page_table_lock);
return VM_FAULT_MINOR;

@@ -1687,8 +1695,7 @@ void swapin_readahead(swp_entry_t entry,
}

/*
- * We hold the mm semaphore and the page_table_lock on entry and
- * should release the pagetable lock on exit..
+ * We hold the mm semaphore and have started atomic pte operations
*/
static int do_swap_page(struct mm_struct * mm,
struct vm_area_struct * vma, unsigned long address,
@@ -1700,15 +1707,14 @@ static int do_swap_page(struct mm_struct
int ret = VM_FAULT_MINOR;

pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
+ page_table_atomic_stop(mm);
page = lookup_swap_cache(entry);
if (!page) {
swapin_readahead(entry, address, vma);
page = read_swap_cache_async(entry, vma, address);
if (!page) {
/*
- * Back out if somebody else faulted in this pte while
- * we released the page table lock.
+ * Back out if somebody else faulted in this pte
*/
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, address);
@@ -1727,12 +1733,11 @@ static int do_swap_page(struct mm_struct
grab_swap_token();
}

- mark_page_accessed(page);
+ SetPageReferenced(page);
lock_page(page);

/*
- * Back out if somebody else faulted in this pte while we
- * released the page table lock.
+ * Back out if somebody else faulted in this pte
*/
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, address);
@@ -1766,79 +1771,93 @@ static int do_swap_page(struct mm_struct
set_pte(page_table, pte);
page_add_anon_rmap(page, vma, address);

+ /* No need to invalidate - it was non-present before */
+ update_mmu_cache(vma, address, pte);
+ pte_unmap(page_table);
+ spin_unlock(&mm->page_table_lock);
+
if (write_access) {
+ page_table_atomic_start(mm);
if (do_wp_page(mm, vma, address,
page_table, pmd, pte) == VM_FAULT_OOM)
ret = VM_FAULT_OOM;
- goto out;
}

- /* No need to invalidate - it was non-present before */
- update_mmu_cache(vma, address, pte);
- pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
out:
return ret;
}

/*
- * We are called with the MM semaphore and page_table_lock
- * spinlock held to protect against concurrent faults in
- * multithreaded programs.
+ * We are called with the MM semaphore held and atomic pte operations started.
*/
static int
do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
pte_t *page_table, pmd_t *pmd, int write_access,
- unsigned long addr)
+ unsigned long addr, pte_t orig_entry)
{
pte_t entry;
- struct page * page = ZERO_PAGE(addr);
+ struct page * page;

- /* Read-only mapping of ZERO_PAGE. */
- entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
+ if (unlikely(!write_access)) {

- /* ..except if it's a write access */
- if (write_access) {
- /* Allocate our own private page. */
+ /* Read-only mapping of ZERO_PAGE. */
+ entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
+
+ /*
+ * If the cmpxchg fails then another fault may be
+ * generated that may then be successful
+ */
+
+ if (ptep_cmpxchg(page_table, orig_entry, entry))
+ update_mmu_cache(vma, addr, entry);
+ else
+ inc_page_state(cmpxchg_fail_anon_read);
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
+ page_table_atomic_stop(mm);

- if (unlikely(anon_vma_prepare(vma)))
- goto no_mem;
- page = alloc_zeroed_user_highpage(vma, addr);
- if (!page)
- goto no_mem;
+ return VM_FAULT_MINOR;
+ }

- spin_lock(&mm->page_table_lock);
- page_table = pte_offset_map(pmd, addr);
+ page_table_atomic_stop(mm);

- if (!pte_none(*page_table)) {
- pte_unmap(page_table);
- page_cache_release(page);
- spin_unlock(&mm->page_table_lock);
- goto out;
- }
- update_mm_counter(mm, rss, 1);
- acct_update_integrals();
- update_mem_hiwater();
- entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
- vma->vm_page_prot)),
- vma);
- lru_cache_add_active(page);
- SetPageReferenced(page);
- page_add_anon_rmap(page, vma, addr);
+ /* Allocate our own private page. */
+ if (unlikely(anon_vma_prepare(vma)))
+ return VM_FAULT_OOM;
+
+ page = alloc_zeroed_user_highpage(vma, addr);
+ if (!page)
+ return VM_FAULT_OOM;
+
+ entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
+ vma->vm_page_prot)),
+ vma);
+
+ spin_lock(&mm->page_table_lock);
+
+ if (!ptep_cmpxchg(page_table, orig_entry, entry)) {
+ pte_unmap(page_table);
+ page_cache_release(page);
+ spin_unlock(&mm->page_table_lock);
+ inc_page_state(cmpxchg_fail_anon_write);
+ return VM_FAULT_MINOR;
}

- set_pte(page_table, entry);
- pte_unmap(page_table);
+ /*
+ * These two functions must come after the cmpxchg
+ * because if the page is on the LRU then try_to_unmap may come
+ * in and unmap the pte.
+ */
+ page_add_anon_rmap(page, vma, addr);
+ lru_cache_add_active(page);
+ update_mm_counter(mm, rss, 1);
+ acct_update_integrals();
+ update_mem_hiwater();

- /* No need to invalidate - it was non-present before */
- update_mmu_cache(vma, addr, entry);
+ update_mmu_cache(vma, addr, entry);
+ pte_unmap(page_table);
spin_unlock(&mm->page_table_lock);
-out:
+
return VM_FAULT_MINOR;
-no_mem:
- return VM_FAULT_OOM;
}

/*
@@ -1850,12 +1869,12 @@ no_mem:
* As this is called only for pages that do not currently exist, we
* do not need to flush old virtual caches or the TLB.
*
- * This is called with the MM semaphore held and the page table
- * spinlock held. Exit with the spinlock released.
+ * This is called with the MM semaphore held and atomic pte operations started.
*/
static int
do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
+ unsigned long address, int write_access, pte_t *page_table,
+ pmd_t *pmd, pte_t orig_entry)
{
struct page * new_page;
struct address_space *mapping = NULL;
@@ -1866,9 +1885,9 @@ do_no_page(struct mm_struct *mm, struct

if (!vma->vm_ops || !vma->vm_ops->nopage)
return do_anonymous_page(mm, vma, page_table,
- pmd, write_access, address);
+ pmd, write_access, address, orig_entry);
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
+ page_table_atomic_stop(mm);

if (vma->vm_file) {
mapping = vma->vm_file->f_mapping;
@@ -1976,7 +1995,7 @@ oom:
* nonlinear vmas.
*/
static int do_file_page(struct mm_struct * mm, struct vm_area_struct * vma,
- unsigned long address, int write_access, pte_t *pte, pmd_t *pmd)
+ unsigned long address, int write_access, pte_t *pte, pmd_t *pmd, pte_t entry)
{
unsigned long pgoff;
int err;
@@ -1989,13 +2008,13 @@ static int do_file_page(struct mm_struct
if (!vma->vm_ops || !vma->vm_ops->populate ||
(write_access && !(vma->vm_flags & VM_SHARED))) {
pte_clear(pte);
- return do_no_page(mm, vma, address, write_access, pte, pmd);
+ return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
}

- pgoff = pte_to_pgoff(*pte);
+ pgoff = pte_to_pgoff(entry);

pte_unmap(pte);
- spin_unlock(&mm->page_table_lock);
+ page_table_atomic_stop(mm);

err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0);
if (err == -ENOMEM)
@@ -2014,49 +2033,45 @@ static int do_file_page(struct mm_struct
* with external mmu caches can use to update those (ie the Sparc or
* PowerPC hashed page tables that act as extended TLBs).
*
- * Note the "page_table_lock". It is to protect against kswapd removing
- * pages from under us. Note that kswapd only ever _removes_ pages, never
- * adds them. As such, once we have noticed that the page is not present,
- * we can drop the lock early.
- *
- * The adding of pages is protected by the MM semaphore (which we hold),
- * so we don't need to worry about a page being suddenly been added into
- * our VM.
- *
- * We enter with the pagetable spinlock held, we are supposed to
- * release it when done.
+ * Note that kswapd only ever _removes_ pages, never adds them.
+ * We need to insure to handle that case properly.
*/
static inline int handle_pte_fault(struct mm_struct *mm,
struct vm_area_struct * vma, unsigned long address,
int write_access, pte_t *pte, pmd_t *pmd)
{
pte_t entry;
+ pte_t new_entry;

entry = *pte;
if (!pte_present(entry)) {
- /*
- * If it truly wasn't present, we know that kswapd
- * and the PTE updates will not touch it later. So
- * drop the lock.
- */
if (pte_none(entry))
- return do_no_page(mm, vma, address, write_access, pte, pmd);
+ return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
if (pte_file(entry))
- return do_file_page(mm, vma, address, write_access, pte, pmd);
+ return do_file_page(mm, vma, address, write_access, pte, pmd, entry);
return do_swap_page(mm, vma, address, pte, pmd, entry, write_access);
}

+ new_entry = pte_mkyoung(entry);
if (write_access) {
if (!pte_write(entry))
return do_wp_page(mm, vma, address, pte, pmd, entry);
-
- entry = pte_mkdirty(entry);
+ new_entry = pte_mkdirty(new_entry);
}
- entry = pte_mkyoung(entry);
- ptep_set_access_flags(vma, address, pte, entry, write_access);
- update_mmu_cache(vma, address, entry);
+
+ /*
+ * If the cmpxchg fails then we will get another fault which
+ * has another chance of successfully updating the page table entry.
+ */
+ if (ptep_cmpxchg(pte, entry, new_entry)) {
+ flush_tlb_page(vma, address);
+ update_mmu_cache(vma, address, entry);
+ } else
+ inc_page_state(cmpxchg_fail_flag_update);
pte_unmap(pte);
- spin_unlock(&mm->page_table_lock);
+ page_table_atomic_stop(mm);
+ if (pte_val(new_entry) == pte_val(entry))
+ inc_page_state(spurious_page_faults);
return VM_FAULT_MINOR;
}

@@ -2075,33 +2090,73 @@ int handle_mm_fault(struct mm_struct *mm

inc_page_state(pgfault);

- if (is_vm_hugetlb_page(vma))
+ if (unlikely(is_vm_hugetlb_page(vma)))
return VM_FAULT_SIGBUS; /* mapping truncation does this. */

/*
- * We need the page table lock to synchronize with kswapd
- * and the SMP-safe atomic PTE updates.
+ * We try to rely on the mmap_sem and the SMP-safe atomic PTE updates.
+ * to synchronize with kswapd. However, the arch may fall back
+ * in page_table_atomic_start to the page table lock.
+ *
+ * We may be able to avoid taking and releasing the page_table_lock
+ * for the p??_alloc functions through atomic operations so we
+ * duplicate the functionality of pmd_alloc, pud_alloc and
+ * pte_alloc_map here.
*/
+ page_table_atomic_start(mm);
pgd = pgd_offset(mm, address);
- spin_lock(&mm->page_table_lock);
+ if (unlikely(pgd_none(*pgd))) {
+ pud_t *new;
+
+ page_table_atomic_stop(mm);
+ new = pud_alloc_one(mm, address);

- pud = pud_alloc(mm, pgd, address);
- if (!pud)
- goto oom;
-
- pmd = pmd_alloc(mm, pud, address);
- if (!pmd)
- goto oom;
-
- pte = pte_alloc_map(mm, pmd, address);
- if (!pte)
- goto oom;
+ if (!new)
+ return VM_FAULT_OOM;
+
+ page_table_atomic_start(mm);
+ if (!pgd_test_and_populate(mm, pgd, new))
+ pud_free(new);
+ }
+
+ pud = pud_offset(pgd, address);
+ if (unlikely(pud_none(*pud))) {
+ pmd_t *new;
+
+ page_table_atomic_stop(mm);
+ new = pmd_alloc_one(mm, address);
+
+ if (!new)
+ return VM_FAULT_OOM;

- return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
+ page_table_atomic_start(mm);
+
+ if (!pud_test_and_populate(mm, pud, new))
+ pmd_free(new);
+ }

- oom:
- spin_unlock(&mm->page_table_lock);
- return VM_FAULT_OOM;
+ pmd = pmd_offset(pud, address);
+ if (unlikely(!pmd_present(*pmd))) {
+ struct page *new;
+
+ page_table_atomic_stop(mm);
+ new = pte_alloc_one(mm, address);
+
+ if (!new)
+ return VM_FAULT_OOM;
+
+ page_table_atomic_start(mm);
+
+ if (!pmd_test_and_populate(mm, pmd, new))
+ pte_free(new);
+ else {
+ inc_page_state(nr_page_table_pages);
+ mm->nr_ptes++;
+ }
+ }
+
+ pte = pte_offset_map(pmd, address);
+ return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
}

#ifndef __ARCH_HAS_4LEVEL_HACK
Index: linux-2.6.10/include/asm-generic/pgtable-nopud.h
===================================================================
--- linux-2.6.10.orig/include/asm-generic/pgtable-nopud.h 2005-02-24 19:41:46.000000000 -0800
+++ linux-2.6.10/include/asm-generic/pgtable-nopud.h 2005-02-24 19:42:21.000000000 -0800
@@ -25,8 +25,14 @@ static inline int pgd_bad(pgd_t pgd) {
static inline int pgd_present(pgd_t pgd) { return 1; }
static inline void pgd_clear(pgd_t *pgd) { }
#define pud_ERROR(pud) (pgd_ERROR((pud).pgd))
-
#define pgd_populate(mm, pgd, pud) do { } while (0)
+
+#define __HAVE_ARCH_PGD_TEST_AND_POPULATE
+static inline int pgd_test_and_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud)
+{
+ return 1;
+}
+
/*
* (puds are folded into pgds so this doesn't get actually called,
* but the define is needed for a generic inline function.)
Index: linux-2.6.10/include/asm-generic/pgtable-nopmd.h
===================================================================
--- linux-2.6.10.orig/include/asm-generic/pgtable-nopmd.h 2005-02-24 19:41:46.000000000 -0800
+++ linux-2.6.10/include/asm-generic/pgtable-nopmd.h 2005-02-24 19:42:21.000000000 -0800
@@ -29,6 +29,11 @@ static inline void pud_clear(pud_t *pud)
#define pmd_ERROR(pmd) (pud_ERROR((pmd).pud))

#define pud_populate(mm, pmd, pte) do { } while (0)
+#define __ARCH_HAVE_PUD_TEST_AND_POPULATE
+static inline int pud_test_and_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
+{
+ return 1;
+}

/*
* (pmds are folded into puds so this doesn't get actually called,
Index: linux-2.6.10/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.10.orig/include/asm-generic/pgtable.h 2005-02-24 19:42:12.000000000 -0800
+++ linux-2.6.10/include/asm-generic/pgtable.h 2005-02-24 19:42:21.000000000 -0800
@@ -105,8 +105,14 @@ static inline pte_t ptep_get_and_clear(p
#ifdef CONFIG_ATOMIC_TABLE_OPS

/*
- * The architecture does support atomic table operations.
- * Thus we may provide generic atomic ptep_xchg and ptep_cmpxchg using
+ * The architecture does support atomic table operations and
+ * all operations on page table entries must always be atomic.
+ *
+ * This means that the kernel will never encounter a partially updated
+ * page table entry.
+ *
+ * Since the architecture does support atomic table operations, we
+ * may provide generic atomic ptep_xchg and ptep_cmpxchg using
* cmpxchg and xchg.
*/
#ifndef __HAVE_ARCH_PTEP_XCHG
@@ -132,6 +138,65 @@ static inline pte_t ptep_get_and_clear(p
})
#endif

+/*
+ * page_table_atomic_start and page_table_atomic_stop may be used to
+ * define special measures that an arch needs to guarantee atomic
+ * operations outside of a spinlock. In the case that an arch does
+ * not support atomic page table operations we will fall back to the
+ * page table lock.
+ */
+#ifndef __HAVE_ARCH_PAGE_TABLE_ATOMIC_START
+#define page_table_atomic_start(mm) do { } while (0)
+#endif
+
+#ifndef __HAVE_ARCH_PAGE_TABLE_ATOMIC_START
+#define page_table_atomic_stop(mm) do { } while (0)
+#endif
+
+/*
+ * Fallback functions for atomic population of higher page table
+ * structures. These simply acquire the page_table_lock for
+ * synchronization. An architecture may override these generic
+ * functions to provide atomic populate functions to make these
+ * more effective.
+ */
+
+#ifndef __HAVE_ARCH_PGD_TEST_AND_POPULATE
+#define pgd_test_and_populate(__mm, __pgd, __pud) \
+({ \
+ int __rc; \
+ spin_lock(&mm->page_table_lock); \
+ __rc = pgd_none(*(__pgd)); \
+ if (__rc) pgd_populate(__mm, __pgd, __pud); \
+ spin_unlock(&mm->page_table_lock); \
+ __rc; \
+})
+#endif
+
+#ifndef __HAVE_ARCH_PUD_TEST_AND_POPULATE
+#define pud_test_and_populate(__mm, __pud, __pmd) \
+({ \
+ int __rc; \
+ spin_lock(&mm->page_table_lock); \
+ __rc = pud_none(*(__pud)); \
+ if (__rc) pud_populate(__mm, __pud, __pmd); \
+ spin_unlock(&mm->page_table_lock); \
+ __rc; \
+})
+#endif
+
+#ifndef __HAVE_ARCH_PMD_TEST_AND_POPULATE
+#define pmd_test_and_populate(__mm, __pmd, __page) \
+({ \
+ int __rc; \
+ spin_lock(&mm->page_table_lock); \
+ __rc = !pmd_present(*(__pmd)); \
+ if (__rc) pmd_populate(__mm, __pmd, __page); \
+ spin_unlock(&mm->page_table_lock); \
+ __rc; \
+})
+#endif
+
#else

/*
@@ -142,6 +207,11 @@ static inline pte_t ptep_get_and_clear(p
* short time frame. This means that the page_table_lock must be held
* to avoid a page fault that would install a new entry.
*/
+
+/* Fall back to the page table lock to synchronize page table access */
+#define page_table_atomic_start(mm) spin_lock(&(mm)->page_table_lock)
+#define page_table_atomic_stop(mm) spin_unlock(&(mm)->page_table_lock)
+
#ifndef __HAVE_ARCH_PTEP_XCHG
#define ptep_xchg(__ptep, __pteval) \
({ \
@@ -186,6 +256,41 @@ static inline pte_t ptep_get_and_clear(p
r; \
})
#endif
+
+/*
+ * Fallback functions for atomic population of higher page table
+ * structures. These rely on the page_table_lock being held.
+ */
+#ifndef __HAVE_ARCH_PGD_TEST_AND_POPULATE
+#define pgd_test_and_populate(__mm, __pgd, __pud) \
+({ \
+ int __rc; \
+ __rc = pgd_none(*(__pgd)); \
+ if (__rc) pgd_populate(__mm, __pgd, __pud); \
+ __rc; \
+})
+#endif
+
+#ifndef __HAVE_ARCH_PUD_TEST_AND_POPULATE
+#define pud_test_and_populate(__mm, __pud, __pmd) \
+({ \
+ int __rc; \
+ __rc = pud_none(*(__pud)); \
+ if (__rc) pud_populate(__mm, __pud, __pmd); \
+ __rc; \
+})
+#endif
+
+#ifndef __HAVE_ARCH_PMD_TEST_AND_POPULATE
+#define pmd_test_and_populate(__mm, __pmd, __page) \
+({ \
+ int __rc; \
+ __rc = !pmd_present(*(__pmd)); \
+ if (__rc) pmd_populate(__mm, __pmd, __page); \
+ __rc; \
+})
+#endif
+
#endif

#ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT
Index: linux-2.6.10/include/asm-ia64/pgtable.h
===================================================================
--- linux-2.6.10.orig/include/asm-ia64/pgtable.h 2005-02-24 19:41:46.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/pgtable.h 2005-02-24 19:42:21.000000000 -0800
@@ -554,6 +554,8 @@ do { \
#define FIXADDR_USER_START GATE_ADDR
#define FIXADDR_USER_END (GATE_ADDR + 2*PERCPU_PAGE_SIZE)

+#define __HAVE_ARCH_PUD_TEST_AND_POPULATE
+#define __HAVE_ARCH_PMD_TEST_AND_POPULATE
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
@@ -561,7 +563,7 @@ do { \
#define __HAVE_ARCH_PTEP_MKDIRTY
#define __HAVE_ARCH_PTE_SAME
#define __HAVE_ARCH_PGD_OFFSET_GATE
-#include <asm-generic/pgtable.h>
#include <asm-generic/pgtable-nopud.h>
+#include <asm-generic/pgtable.h>

#endif /* _ASM_IA64_PGTABLE_H */
Index: linux-2.6.10/include/linux/page-flags.h
===================================================================
--- linux-2.6.10.orig/include/linux/page-flags.h 2005-02-24 19:41:49.000000000 -0800
+++ linux-2.6.10/include/linux/page-flags.h 2005-02-24 19:42:21.000000000 -0800
@@ -131,6 +131,17 @@ struct page_state {
unsigned long allocstall; /* direct reclaim calls */

unsigned long pgrotated; /* pages rotated to tail of the LRU */
+
+ /* Low level counters */
+ unsigned long spurious_page_faults; /* Faults with no ops */
+ unsigned long cmpxchg_fail_flag_update; /* cmpxchg failures for pte flag update */
+ unsigned long cmpxchg_fail_flag_reuse; /* cmpxchg failures when cow reuse of pte */
+ unsigned long cmpxchg_fail_anon_read; /* cmpxchg failures on anonymous read */
+ unsigned long cmpxchg_fail_anon_write; /* cmpxchg failures on anonymous write */
+
+ /* rss deltas for the current executing thread */
+ long rss;
+ long anon_rss;
};

extern void get_page_state(struct page_state *ret);
Index: linux-2.6.10/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.10.orig/fs/proc/proc_misc.c 2005-02-24 19:41:44.000000000 -0800
+++ linux-2.6.10/fs/proc/proc_misc.c 2005-02-24 19:42:21.000000000 -0800
@@ -127,7 +127,7 @@ static int meminfo_read_proc(char *page,
unsigned long allowed;
struct vmalloc_info vmi;

- get_page_state(&ps);
+ get_full_page_state(&ps);
get_zone_counts(&active, &inactive, &free);

/*
@@ -168,7 +168,12 @@ static int meminfo_read_proc(char *page,
"PageTables: %8lu kB\n"
"VmallocTotal: %8lu kB\n"
"VmallocUsed: %8lu kB\n"
- "VmallocChunk: %8lu kB\n",
+ "VmallocChunk: %8lu kB\n"
+ "Spurious page faults : %8lu\n"
+ "cmpxchg fail flag update: %8lu\n"
+ "cmpxchg fail COW reuse : %8lu\n"
+ "cmpxchg fail anon read : %8lu\n"
+ "cmpxchg fail anon write : %8lu\n",
K(i.totalram),
K(i.freeram),
K(i.bufferram),
@@ -191,7 +196,12 @@ static int meminfo_read_proc(char *page,
K(ps.nr_page_table_pages),
VMALLOC_TOTAL >> 10,
vmi.used >> 10,
- vmi.largest_chunk >> 10
+ vmi.largest_chunk >> 10,
+ ps.spurious_page_faults,
+ ps.cmpxchg_fail_flag_update,
+ ps.cmpxchg_fail_flag_reuse,
+ ps.cmpxchg_fail_anon_read,
+ ps.cmpxchg_fail_anon_write
);

len += hugetlb_report_meminfo(page + len);
Index: linux-2.6.10/include/asm-ia64/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-ia64/pgalloc.h 2005-02-24 19:41:46.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/pgalloc.h 2005-02-24 19:42:21.000000000 -0800
@@ -34,6 +34,10 @@
#define pmd_quicklist (local_cpu_data->pmd_quick)
#define pgtable_cache_size (local_cpu_data->pgtable_cache_sz)

+/* Empty entries of PMD and PGD */
+#define PMD_NONE 0
+#define PUD_NONE 0
+
static inline pgd_t*
pgd_alloc_one_fast (struct mm_struct *mm)
{
@@ -82,6 +86,13 @@ pud_populate (struct mm_struct *mm, pud_
pud_val(*pud_entry) = __pa(pmd);
}

+/* Atomic populate */
+static inline int
+pud_test_and_populate (struct mm_struct *mm, pud_t *pud_entry, pmd_t *pmd)
+{
+ return ia64_cmpxchg8_acq(pud_entry,__pa(pmd), PUD_NONE) == PUD_NONE;
+}
+
static inline pmd_t*
pmd_alloc_one_fast (struct mm_struct *mm, unsigned long addr)
{
@@ -127,6 +138,14 @@ pmd_populate (struct mm_struct *mm, pmd_
pmd_val(*pmd_entry) = page_to_phys(pte);
}

+/* Atomic populate */
+static inline int
+pmd_test_and_populate (struct mm_struct *mm, pmd_t *pmd_entry, struct page *pte)
+{
+ return ia64_cmpxchg8_acq(pmd_entry, page_to_phys(pte), PMD_NONE) == PMD_NONE;
+}
+
+
static inline void
pmd_populate_kernel (struct mm_struct *mm, pmd_t *pmd_entry, pte_t *pte)
{

2005-03-02 04:01:35

by Christoph Lameter

[permalink] [raw]
Subject: Page fault scalability patch V18: No page table lock in do_anonymous_page

Do not use the page_table_lock in do_anonymous_page. This will significantly
increase the parallelism in the page fault handler in SMP systems. The patch
also modifies the definitions of _mm_counter functions so that rss and anon_rss
become atomic.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.10/mm/memory.c
===================================================================
--- linux-2.6.10.orig/mm/memory.c 2005-02-24 19:42:21.000000000 -0800
+++ linux-2.6.10/mm/memory.c 2005-02-24 19:42:25.000000000 -0800
@@ -1832,12 +1832,12 @@ do_anonymous_page(struct mm_struct *mm,
vma->vm_page_prot)),
vma);

- spin_lock(&mm->page_table_lock);
+ page_table_atomic_start(mm);

if (!ptep_cmpxchg(page_table, orig_entry, entry)) {
pte_unmap(page_table);
page_cache_release(page);
- spin_unlock(&mm->page_table_lock);
+ page_table_atomic_stop(mm);
inc_page_state(cmpxchg_fail_anon_write);
return VM_FAULT_MINOR;
}
@@ -1855,7 +1855,7 @@ do_anonymous_page(struct mm_struct *mm,

update_mmu_cache(vma, addr, entry);
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
+ page_table_atomic_stop(mm);

return VM_FAULT_MINOR;
}
Index: linux-2.6.10/include/linux/sched.h
===================================================================
--- linux-2.6.10.orig/include/linux/sched.h 2005-02-24 19:42:17.000000000 -0800
+++ linux-2.6.10/include/linux/sched.h 2005-02-24 19:42:25.000000000 -0800
@@ -203,10 +203,26 @@ arch_get_unmapped_area_topdown(struct fi
extern void arch_unmap_area(struct vm_area_struct *area);
extern void arch_unmap_area_topdown(struct vm_area_struct *area);

+#ifdef CONFIG_ATOMIC_TABLE_OPS
+/*
+ * Atomic page table operations require that the counters are also
+ * incremented atomically
+*/
+#define set_mm_counter(mm, member, value) atomic_set(&(mm)->member, value)
+#define get_mm_counter(mm, member) ((unsigned long)atomic_read(&(mm)->member))
+#define update_mm_counter(mm, member, value) atomic_add(value, &(mm)->member)
+#define MM_COUNTER_T atomic_t
+
+#else
+/*
+ * No atomic page table operations. Counters are protected by
+ * the page table lock
+ */
#define set_mm_counter(mm, member, value) (mm)->member = (value)
#define get_mm_counter(mm, member) ((mm)->member)
#define update_mm_counter(mm, member, value) (mm)->member += (value)
#define MM_COUNTER_T unsigned long
+#endif

struct mm_struct {
struct vm_area_struct * mmap; /* list of VMAs */

2005-03-03 02:00:31

by Andrew Morton

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Drop first acquisition of ptl

Christoph Lameter <[email protected]> wrote:
>
> ...
> static int do_wp_page(struct mm_struct *mm, struct vm_area_struct * vma,
> unsigned long address, pte_t *page_table, pmd_t *pmd, pte_t pte)
> @@ -1306,22 +1308,25 @@ static int do_wp_page(struct mm_struct *
> flush_cache_page(vma, address);
> entry = maybe_mkwrite(pte_mkyoung(pte_mkdirty(pte)),
> vma);
> - ptep_set_access_flags(vma, address, page_table, entry, 1);
> - update_mmu_cache(vma, address, entry);
> + /*
> + * If the bits are not updated then another fault
> + * will be generated with another chance of updating.
> + */
> + if (ptep_cmpxchg(page_table, pte, entry))
> + update_mmu_cache(vma, address, entry);
> + else
> + inc_page_state(cmpxchg_fail_flag_reuse);
> pte_unmap(page_table);
> - spin_unlock(&mm->page_table_lock);
> + page_table_atomic_stop(mm);
> return VM_FAULT_MINOR;
> }
> }
> pte_unmap(page_table);
> + page_table_atomic_stop(mm);
>
> /*
> * Ok, we need to copy. Oh, well..
> */
> - if (!PageReserved(old_page))
> - page_cache_get(old_page);

hm, this seems to be an unrelated change. You're saying that this page is
protected from munmap() by munmap()'s down_write(mmap_sem), yes? What
stops memory reclaim from freeing old_page?

> static int do_swap_page(struct mm_struct * mm,
> struct vm_area_struct * vma, unsigned long address,
> @@ -1727,12 +1733,11 @@ static int do_swap_page(struct mm_struct
> grab_swap_token();
> }
>
> - mark_page_accessed(page);
> + SetPageReferenced(page);

Another unrelated change. IIRC, this is indeed equivalent, but I forget
why. Care to remind me?


Overall, do we know which architectures are capable of using this feature?
Would ppc64 (and sparc64?) still have a problem with page_table_lock no
longer protecting their internals?

I'd really like to see other architecture maintainers stand up and say
"yes, we need this".

Did you consider doing the locking at the pte page level? That could be
neater than all those games with atomic pte operattions.

We need to do the big page-table-walker code consolidation/cleanup. That
might have some overlap.

2005-03-03 02:20:51

by Christoph Lameter

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Drop first acquisition of ptl

On Wed, 2 Mar 2005, Andrew Morton wrote:

> > - if (!PageReserved(old_page))
> > - page_cache_get(old_page);
>
> hm, this seems to be an unrelated change. You're saying that this page is
> protected from munmap() by munmap()'s down_write(mmap_sem), yes? What
> stops memory reclaim from freeing old_page?

This is a related change discussed during V16 with Nick.

The page is protected from munmap because of the down_read(mmap_sem) in
the arch specific code before calling handle_mm_fault.

> > - mark_page_accessed(page);
> > + SetPageReferenced(page);
>
> Another unrelated change. IIRC, this is indeed equivalent, but I forget
> why. Care to remind me?

Seems that mark_page_accessed was discouraged in favor SetPageReferenced.
We agreed that we wanted this change earlier (I believe that was in
November?).

> Overall, do we know which architectures are capable of using this feature?
> Would ppc64 (and sparc64?) still have a problem with page_table_lock no
> longer protecting their internals?

That is up to the arch maintainers. Add something to arch/xx/Kconfig to
allow atomic operations for an arch. Out of the box it only works for
x86_64, ia64 and ia32.

> I'd really like to see other architecture maintainers stand up and say
> "yes, we need this".

You definitely need this for machines with high SMP counts.

> Did you consider doing the locking at the pte page level? That could be
> neater than all those games with atomic pte operattions.

Earlier releases back in September 2004 had some pte locking code (and
AFAIK Nick also played around with pte locking) but that
was less efficient than atomic operations.

2005-03-03 03:00:50

by Andrew Morton

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Drop first acquisition of ptl

Christoph Lameter <[email protected]> wrote:
>
> On Wed, 2 Mar 2005, Andrew Morton wrote:
>
> > > - if (!PageReserved(old_page))
> > > - page_cache_get(old_page);
> >
> > hm, this seems to be an unrelated change. You're saying that this page is
> > protected from munmap() by munmap()'s down_write(mmap_sem), yes? What
> > stops memory reclaim from freeing old_page?
>
> This is a related change discussed during V16 with Nick.

It's worth retaining a paragraph for the changelog.

> The page is protected from munmap because of the down_read(mmap_sem) in
> the arch specific code before calling handle_mm_fault.

We don't take mmap_sem during page reclaim. What prevents the page from
being freed by, say, kswapd?

> > > - mark_page_accessed(page);
> > > + SetPageReferenced(page);
> >
> > Another unrelated change. IIRC, this is indeed equivalent, but I forget
> > why. Care to remind me?
>
> Seems that mark_page_accessed was discouraged in favor SetPageReferenced.
> We agreed that we wanted this change earlier (I believe that was in
> November?).

I forget. I do recall that we decided that the change was OK, but briefly
looking at it now, it seems that we'll fail to move a
PageReferenced,!PageActive onto the active list?

> > Overall, do we know which architectures are capable of using this feature?
> > Would ppc64 (and sparc64?) still have a problem with page_table_lock no
> > longer protecting their internals?
>
> That is up to the arch maintainers. Add something to arch/xx/Kconfig to
> allow atomic operations for an arch. Out of the box it only works for
> x86_64, ia64 and ia32.

Feedback from s390, sparc64 and ppc64 people would help in making a merge
decision.

> > I'd really like to see other architecture maintainers stand up and say
> > "yes, we need this".
>
> You definitely need this for machines with high SMP counts.

Well. We need some solution to the page_table_lock problem on high SMP
counts.

> > Did you consider doing the locking at the pte page level? That could be
> > neater than all those games with atomic pte operattions.
>
> Earlier releases back in September 2004 had some pte locking code (and
> AFAIK Nick also played around with pte locking) but that
> was less efficient than atomic operations.

How much less efficient?

Does anyone else have that code around?

2005-03-03 03:28:52

by Christoph Lameter

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Drop first acquisition of ptl

On Wed, 2 Mar 2005, Andrew Morton wrote:

> > This is a related change discussed during V16 with Nick.
>
> It's worth retaining a paragraph for the changelog.

There have been extensive discussions on all aspects of this patch.
This issue was discussed in
http://marc.theaimsgroup.com/?t=110694497200004&r=1&w=2

>
> > The page is protected from munmap because of the down_read(mmap_sem) in
> > the arch specific code before calling handle_mm_fault.
>
> We don't take mmap_sem during page reclaim. What prevents the page from
> being freed by, say, kswapd?

The cmpxchg will fail if that happens.

> I forget. I do recall that we decided that the change was OK, but briefly
> looking at it now, it seems that we'll fail to move a
> PageReferenced,!PageActive onto the active list?

See http://marc.theaimsgroup.com/?l=bk-commits-head&m=110481975332117&w=2

and

http://marc.theaimsgroup.com/?l=linux-kernel&m=110272296503539&w=2

> > That is up to the arch maintainers. Add something to arch/xx/Kconfig to
> > allow atomic operations for an arch. Out of the box it only works for
> > x86_64, ia64 and ia32.
>
> Feedback from s390, sparc64 and ppc64 people would help in making a merge
> decision.

These architectures have the atomic pte's not enable. It would require
them to submit a patch to activate atomic pte's for these architectures.

> > You definitely need this for machines with high SMP counts.
>
> Well. We need some solution to the page_table_lock problem on high SMP
> counts.

Great!

> > Earlier releases back in September 2004 had some pte locking code (and
> > AFAIK Nick also played around with pte locking) but that
> > was less efficient than atomic operations.
>
> How much less efficient?
> Does anyone else have that code around?

Nick may have some data. It got far too complicated too fast when I tried
to introduce locking for individual ptes. It required bit
spinlocks for the pte meaning multiple atomic operations. One
would have to check for the lock being active leading to significant code
changes. This would include the arch specific low level fault handers to
update bits, walk the page table etc etc.

2005-03-03 04:22:06

by Andrew Morton

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Drop first acquisition of ptl

Christoph Lameter <[email protected]> wrote:
>
> On Wed, 2 Mar 2005, Andrew Morton wrote:
>
> > > This is a related change discussed during V16 with Nick.
> >
> > It's worth retaining a paragraph for the changelog.
>
> There have been extensive discussions on all aspects of this patch.
> This issue was discussed in
> http://marc.theaimsgroup.com/?t=110694497200004&r=1&w=2

This is a difficult, intrusive and controversial patch. Things like the
above should be done in a separate patch. Not only does this aid
maintainability, it also allows the change to be performance tested in
isolation.

If the change gets folded into other changes then it would be best to draw
attention to, and fully explain/justify the change within the changelog.

> >
> > > The page is protected from munmap because of the down_read(mmap_sem) in
> > > the arch specific code before calling handle_mm_fault.
> >
> > We don't take mmap_sem during page reclaim. What prevents the page from
> > being freed by, say, kswapd?
>
> The cmpxchg will fail if that happens.

How about if someone does remap_file_pages() against that virtual address
and that syscalls happens to pick the same physical page? We have the same
physical page at the same pte slot with different contents, and the cmpxchg
will succeed.

Maybe mmap_sem will save us, maybe not. Either way, this change needs a
ton of analysys, justification and documentation, please.

Plus if the page gets freed under our feet, CONFIG_DEBUG_PAGEALLOC will
oops during the copy.

> > I forget. I do recall that we decided that the change was OK, but briefly
> > looking at it now, it seems that we'll fail to move a
> > PageReferenced,!PageActive onto the active list?
>
> See http://marc.theaimsgroup.com/?l=bk-commits-head&m=110481975332117&w=2
>
> and
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=110272296503539&w=2

Those are different cases. I still don't see why the change is justified in
do_swap_page().

> > > That is up to the arch maintainers. Add something to arch/xx/Kconfig to
> > > allow atomic operations for an arch. Out of the box it only works for
> > > x86_64, ia64 and ia32.
> > > > Feedback from s390, sparc64 and ppc64 people would help in making a merge
> > decision.
>
> These architectures have the atomic pte's not enable. It would require
> them to submit a patch to activate atomic pte's for these architectures.


But if the approach which these patches take is not suitable for these
architectures then they have no solution to the scalability problem. The
machines will perform suboptimally and more (perhaps conflicting)
development will be needed.

> > > Earlier releases back in September 2004 had some pte locking code (and
> > > AFAIK Nick also played around with pte locking) but that
> > > was less efficient than atomic operations.
> >
> > How much less efficient?
> > Does anyone else have that code around?
>
> Nick may have some data. It got far too complicated too fast when I tried
> to introduce locking for individual ptes. It required bit
> spinlocks for the pte meaning multiple atomic operations.

One could add a spinlock to the pageframe, or use hashed spinlocking.

> One
> would have to check for the lock being active leading to significant code
> changes.

Why?

> This would include the arch specific low level fault handers to
> update bits, walk the page table etc etc.


2005-03-03 05:03:51

by Andrew Morton

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Drop first acquisition of ptl

Christoph Lameter <[email protected]> wrote:
>
> > > The cmpxchg will fail if that happens.
> >
> > How about if someone does remap_file_pages() against that virtual address
> > and that syscalls happens to pick the same physical page? We have the same
> > physical page at the same pte slot with different contents, and the cmpxchg
> > will succeed.
>
> Any mmap changes requires the mmapsem.

sys_remap_file_pages() will call install_page() under down_read(mmap_sem).
It relies upon page_table_lock for pte atomicity.

> > > http://marc.theaimsgroup.com/?l=linux-kernel&m=110272296503539&w=2
> >
> > Those are different cases. I still don't see why the change is justified in
> > do_swap_page().
>
> Lets undo that then.

OK.

> > > These architectures have the atomic pte's not enable. It would require
> > > them to submit a patch to activate atomic pte's for these architectures.
> >
> >
> > But if the approach which these patches take is not suitable for these
> > architectures then they have no solution to the scalability problem. The
> > machines will perform suboptimally and more (perhaps conflicting)
> > development will be needed.
>
> They can implement their own approach with the provided hooks. You could
> for example use SSE / MMX for atomic 64 bit ops on i386 with PAE mode by
> using the start/stop macros to deal with the floatingh point issues.

Have the ppc64 and sparc64 people reviewed and acked the change? (Not a
facetious question - I just haven't been following the saga sufficiently
closely to remember).

> > > One
> > > would have to check for the lock being active leading to significant code
> > > changes.
> >
> > Why?
>
> Because if a pte is locked it should not be used.

Confused. Why not just spin on the lock in the normal manner?

> Look this is an endless discussion with new things brought up at every
> corner and I have reworked the patches numerous times. Could you tell me
> some step by step way that we can finally deal with this? Specify a
> sequence of patches and I will submit them to you step by step.

No, I couldn't do that - that's what the collective brain is for.

Look, I'm sorry, but this patch is highly atypical. Few have this much
trouble. I have queazy feeling about it (maybe too low-level locking,
maybe inappropriate to other architectures, only addresses a subset of
workloads on a tiny subset of machines, doesn't seem to address all uses of
the lock, etc) and I know that others have had, and continue to have
similar feelings. But if we could think of anything better, we'd have said
so :( It's a diffucult problem.

If the other relvant architecture people say "we can use this" then perhaps
we should grin and bear it. But one does wonder whether some more sweeping
design change is needed.

2005-03-03 05:04:33

by Christoph Lameter

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Drop first acquisition of ptl

On Wed, 2 Mar 2005, Andrew Morton wrote:

> > There have been extensive discussions on all aspects of this patch.
> > This issue was discussed in
> > http://marc.theaimsgroup.com/?t=110694497200004&r=1&w=2
>
> This is a difficult, intrusive and controversial patch. Things like the
> above should be done in a separate patch. Not only does this aid
> maintainability, it also allows the change to be performance tested in
> isolation.

Well it would have been great if this was mentioned in the actual
discussion at the time.

> > The cmpxchg will fail if that happens.
>
> How about if someone does remap_file_pages() against that virtual address
> and that syscalls happens to pick the same physical page? We have the same
> physical page at the same pte slot with different contents, and the cmpxchg
> will succeed.

Any mmap changes requires the mmapsem.

> > http://marc.theaimsgroup.com/?l=linux-kernel&m=110272296503539&w=2
>
> Those are different cases. I still don't see why the change is justified in
> do_swap_page().

Lets undo that then.

> > These architectures have the atomic pte's not enable. It would require
> > them to submit a patch to activate atomic pte's for these architectures.
>
>
> But if the approach which these patches take is not suitable for these
> architectures then they have no solution to the scalability problem. The
> machines will perform suboptimally and more (perhaps conflicting)
> development will be needed.

They can implement their own approach with the provided hooks. You could
for example use SSE / MMX for atomic 64 bit ops on i386 with PAE mode by
using the start/stop macros to deal with the floatingh point issues.

> > One
> > would have to check for the lock being active leading to significant code
> > changes.
>
> Why?

Because if a pte is locked it should not be used.

Look this is an endless discussion with new things brought up at every
corner and I have reworked the patches numerous times. Could you tell me
some step by step way that we can finally deal with this? Specify a
sequence of patches and I will submit them to you step by step.

2005-03-03 05:08:25

by Paul Mackerras

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Drop first acquisition of ptl

Andrew Morton writes:

> But if the approach which these patches take is not suitable for these
> architectures then they have no solution to the scalability problem. The
> machines will perform suboptimally and more (perhaps conflicting)
> development will be needed.

We can do a pte_cmpxchg on ppc64. We already have a busy bit in the
PTE and do most operations atomically, in order to ensure that we
don't get races or inconsistencies due to accesses to the PTE by the
low-level hash_page() routine (which instantiates a hardware PTE in
the hardware hash table based on a Linux PTE), because it already
accesses the linux page tables without taking the mm->page_table_lock.

However, there are other developments we are considering in this area:
notably Ben wants to change things so that when we invalidate a Linux
PTE we leave it busy until we actually remove the hardware PTE from
the hash table. Also we are looking forward to DaveM's patch which
will change the generic MM code to give us the mm and address on all
PTE operations, which will simplify some things for us. I don't
really want to have to think about pte_cmpxchg until those other
things are sorted out.

More generally, I would be interested to know what sorts of
applications or benchmarks show scalability problems on large machines
due to contention on mm->page_table_lock.

Paul.

2005-03-03 05:26:42

by Christoph Lameter

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Drop first acquisition of ptl

On Wed, 2 Mar 2005, Andrew Morton wrote:

> Have the ppc64 and sparc64 people reviewed and acked the change? (Not a
> facetious question - I just haven't been following the saga sufficiently
> closely to remember).

There should be no change to these arches

> > Because if a pte is locked it should not be used.
>
> Confused. Why not just spin on the lock in the normal manner?

I thought you wanted to lock the pte? This is realized through a lock bit
in the pte. If that lock bit is set one should not use the pte. Otherwise
the lock is bypassed. Or are you proposing a write lock only?

> If the other relvant architecture people say "we can use this" then perhaps
> we should grin and bear it. But one does wonder whether some more sweeping
> design change is needed.

Could we at least get the first two patches in? I can then gradually
address the other issues piece by piece.

The necessary more sweeping design change can be found at

http://marc.theaimsgroup.com/?l=linux-kernel&m=110922543030922&w=2

but these may be a long way off. These patches address an urgent issue
that we have with higher CPU counts for a long time and the method used
here has been used for years in our ProPack line.

2005-03-03 05:27:30

by Christoph Lameter

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Drop first acquisition of ptl

On Thu, 3 Mar 2005, Paul Mackerras wrote:

> More generally, I would be interested to know what sorts of
> applications or benchmarks show scalability problems on large machines
> due to contention on mm->page_table_lock.

Number crunching apps that use vast amounts of memory through MPI or
large databases etc. They stall for a long time during their
initialization phase without these patches.

2005-03-03 05:27:29

by Nick Piggin

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Drop first acquisition of ptl

Andrew Morton wrote:

>Christoph Lameter <[email protected]> wrote:
>
>>On Wed, 2 Mar 2005, Andrew Morton wrote:
>>
>>
>>>>Earlier releases back in September 2004 had some pte locking code (and
>>>>AFAIK Nick also played around with pte locking) but that
>>>>was less efficient than atomic operations.
>>>>
>>>How much less efficient?
>>>Does anyone else have that code around?
>>>
>>Nick may have some data. It got far too complicated too fast when I tried
>>to introduce locking for individual ptes. It required bit
>>spinlocks for the pte meaning multiple atomic operations.
>>
>
>One could add a spinlock to the pageframe, or use hashed spinlocking.
>
>

I did have a version using bit spin locks in the pte on ia64. That
only works efficiently on architectures who's MMU hardware won't
concurrently update the pte (so you can do non-atomic pte operations
and non-atomic unlocks on a locked pte).

I pretty much solved all the efficiency problems IIRC. Of course
this doesn't work on i386 or x86_64.

Having a spinlock for example per pte page might be another good
option that we haven't looked at.

>>One
>>would have to check for the lock being active leading to significant code
>>changes.
>>
>
>Why?
>
>

When using per-pte locks on ia64 for example, the low level code that
walks page tables and sets dirty, accessed, etc bits needs to be made
aware of the pte lock. But Keith made me up a little patch to do this,
and it is pretty simple.


2005-03-03 05:46:26

by David Miller

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Drop first acquisition of ptl

On Thu, 3 Mar 2005 16:00:10 +1100
Paul Mackerras <[email protected]> wrote:

> Andrew Morton writes:
>
> > But if the approach which these patches take is not suitable for these
> > architectures then they have no solution to the scalability problem. The
> > machines will perform suboptimally and more (perhaps conflicting)
> > development will be needed.
>
> We can do a pte_cmpxchg on ppc64.

We really can't make use of this on sparc64. Unlike ppc64 I don't
have the hash table issue (although sparc64 TLB's have a hash lookup
helping mechanism in hardware, which we ignore since virtually mapped
linear page tables are faster than Sun's bogus TSB table scheme).

I make all real faults go through the handle_mm_fault() path so all
page table modifications are serialized by the page table lock.
The TLB miss handlers never modify PTEs, not even to update access
and dirty bits.

Actually, I guess I could do the pte_cmpxchg() stuff, but only if it's
used to "add" access. If the TLB miss handler races, we just go into
the handle_mm_fault() path unnecessarily in order to synchronize.

However, if this pte_cmpxchg() thing is used for removing access, then
sparc64 can't use it. In such a case a race in the TLB handler would
result in using an invalid PTE. I could "spin" on some lock bit, but
there is no way I'm adding instructions to the carefully constructed
TLB miss handler assembler on sparc64 just for that :-)

2005-03-03 05:45:08

by Andrew Morton

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Drop first acquisition of ptl

Christoph Lameter <[email protected]> wrote:
>
> On Wed, 2 Mar 2005, Andrew Morton wrote:
>
> > Have the ppc64 and sparc64 people reviewed and acked the change? (Not a
> > facetious question - I just haven't been following the saga sufficiently
> > closely to remember).
>
> There should be no change to these arches

But we must at least confirm that these architectures can make these
changes in the future. If they make no changes then they haven't
benefitted from the patch. And the patch must be suitable for all
architectures which might hit this scalability problem.

> > > Because if a pte is locked it should not be used.
> >
> > Confused. Why not just spin on the lock in the normal manner?
>
> I thought you wanted to lock the pte? This is realized through a lock bit
> in the pte. If that lock bit is set one should not use the pte. Otherwise
> the lock is bypassed. Or are you proposing a write lock only?

I was suggesting a lock in (or associated with) the pageframe of the page
which holds the pte. That's just a convenient way of hashing the locking.
Probably that's not much different from the atomic pte ops.

> > If the other relvant architecture people say "we can use this" then perhaps
> > we should grin and bear it. But one does wonder whether some more sweeping
> > design change is needed.
>
> Could we at least get the first two patches in? I can then gradually
> address the other issues piece by piece.

The atomic ops patch should be coupled with the main patch series. The mm
counter one we could sneak in beforehand, I guess.

> The necessary more sweeping design change can be found at
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=110922543030922&w=2
>
> but these may be a long way off.

Yes, that seemed sensible, although it may not work out to be as clean as
it appears.

But how would that work allow us to address page_table_lock scalability
problems?

2005-03-03 05:50:35

by Christoph Lameter

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Drop first acquisition of ptl

On Wed, 2 Mar 2005, Andrew Morton wrote:

> > There should be no change to these arches
>
> But we must at least confirm that these architectures can make these
> changes in the future. If they make no changes then they haven't
> benefitted from the patch. And the patch must be suitable for all
> architectures which might hit this scalability problem.
>
> > Could we at least get the first two patches in? I can then gradually
> > address the other issues piece by piece.
>
> The atomic ops patch should be coupled with the main patch series. The mm
> counter one we could sneak in beforehand, I guess.

The atomic ops patch basically just avoids doing a pte_clear and then
setting the pte for archs that define CONFIG_ATOMIC_TABLE_OPS. This is
unecessary on ia64 and ia32 AFAIK.

>
> > The necessary more sweeping design change can be found at
> >
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=110922543030922&w=2
> >
> > but these may be a long way off.
>
> Yes, that seemed sensible, although it may not work out to be as clean as
> it appears.

Of course. But at least we would like to start as clean as possible.

> But how would that work allow us to address page_table_lock scalability
> problems?

Because the actual locking method is abstracted in a transaction
(idea by Nick Piggins, I just tried to make it cleaner). The arch may use
pte locking, pmd locking, atomic ops or whatever to provide
synchronization for page table operations.


2005-03-03 05:55:48

by Christoph Lameter

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Drop first acquisition of ptl

On Wed, 2 Mar 2005, David S. Miller wrote:

> Actually, I guess I could do the pte_cmpxchg() stuff, but only if it's
> used to "add" access. If the TLB miss handler races, we just go into
> the handle_mm_fault() path unnecessarily in order to synchronize.
>
> However, if this pte_cmpxchg() thing is used for removing access, then
> sparc64 can't use it. In such a case a race in the TLB handler would
> result in using an invalid PTE. I could "spin" on some lock bit, but
> there is no way I'm adding instructions to the carefully constructed
> TLB miss handler assembler on sparc64 just for that :-)

There is no need to provide pte_cmpxchg. If the arch does not support
cmpxchg on ptes (CONFIG_ATOMIC_TABLE_OPS not defined)
then it will fall back to using pte_get_and_clear while holding the
page_table_lock to insure that the entry is not touched while performing
the comparison.

2005-03-03 06:00:29

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Drop first acquisition of ptl


> However, if this pte_cmpxchg() thing is used for removing access, then
> sparc64 can't use it. In such a case a race in the TLB handler would
> result in using an invalid PTE. I could "spin" on some lock bit, but
> there is no way I'm adding instructions to the carefully constructed
> TLB miss handler assembler on sparc64 just for that :-)

Can't you add a lock bit in the PTE itself like we do on ppc64 hash
refill ?

Ok, ok, you don't want to add instructions, fair enough :) On ppc64, I
had to do that to close some nasty race we had in the hash refill, but
it came almost for free as we already had an atomic loop in there.

Ben.


2005-03-03 06:31:07

by Andrew Morton

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Drop first acquisition of ptl

Christoph Lameter <[email protected]> wrote:
>
> On Wed, 2 Mar 2005, Andrew Morton wrote:
>
> > > Any mmap changes requires the mmapsem.
> >
> > sys_remap_file_pages() will call install_page() under down_read(mmap_sem).
> > It relies upon page_table_lock for pte atomicity.
>
> This is not relevant since it only deals with file pages.

OK. And CONFIG_DEBUG_PAGEALLOC?

> ptes are only
> installed atomically for anonymous memory (if CONFIG_ATOMIC_OPS
> is defined).

It's a shame. A *nice* solution to this problem would address all pte ops
and wouldn't have such special cases...

2005-03-03 06:31:09

by Nick Piggin

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Drop first acquisition of ptl

Benjamin Herrenschmidt wrote:

>>However, if this pte_cmpxchg() thing is used for removing access, then
>>sparc64 can't use it. In such a case a race in the TLB handler would
>>result in using an invalid PTE. I could "spin" on some lock bit, but
>>there is no way I'm adding instructions to the carefully constructed
>>TLB miss handler assembler on sparc64 just for that :-)
>>
>
>Can't you add a lock bit in the PTE itself like we do on ppc64 hash
>refill ?
>
>

You don't want to do that for all architectures, as I said earlier.
eg. i386 can concurrently set the dirty bit with the MMU (which won't
honour the lock).

So you then need an atomic lock, atomic pte operations, and atomic
unlock where previously you had only the atomic pte operation. This is
disastrous for performance.


2005-03-03 06:36:01

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Drop first acquisition of ptl

On Fri, 2005-03-04 at 04:19 +1100, Nick Piggin wrote:

> You don't want to do that for all architectures, as I said earlier.
> eg. i386 can concurrently set the dirty bit with the MMU (which won't
> honour the lock).
>
> So you then need an atomic lock, atomic pte operations, and atomic
> unlock where previously you had only the atomic pte operation. This is
> disastrous for performance.

Of course, but I was answering to David about sparc64 which uses
software TLB load :)

Ben.


2005-03-03 06:55:40

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Drop first acquisition of ptl

On Wed, 2005-03-02 at 21:51 -0800, Christoph Lameter wrote:
> On Wed, 2 Mar 2005, David S. Miller wrote:
>
> > Actually, I guess I could do the pte_cmpxchg() stuff, but only if it's
> > used to "add" access. If the TLB miss handler races, we just go into
> > the handle_mm_fault() path unnecessarily in order to synchronize.
> >
> > However, if this pte_cmpxchg() thing is used for removing access, then
> > sparc64 can't use it. In such a case a race in the TLB handler would
> > result in using an invalid PTE. I could "spin" on some lock bit, but
> > there is no way I'm adding instructions to the carefully constructed
> > TLB miss handler assembler on sparc64 just for that :-)
>
> There is no need to provide pte_cmpxchg. If the arch does not support
> cmpxchg on ptes (CONFIG_ATOMIC_TABLE_OPS not defined)
> then it will fall back to using pte_get_and_clear while holding the
> page_table_lock to insure that the entry is not touched while performing
> the comparison.

Nah, this is wrong :)

We actually _want_ pte_cmpxchg on ppc64, because we can do the stuff,
but it requires some careful manipulation of some bits in the PTE that
are beyond linux common layer understanding :) Like the BUSY bit which
is a lock bit for arbitrating with the hash fault handler for example.

Also, if it's ever used to cmpxchg from anything but a !present PTE, it
will need additional massaging (like the COW case where we just
"replace" a PTE with set_pte). We also need to preserve some bits in
there that indicate if the PTE was in the hash table and where in the
hash so we can flush it afterward.

Ben.


2005-03-03 06:55:41

by Christoph Lameter

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Drop first acquisition of ptl

On Wed, 2 Mar 2005, Andrew Morton wrote:

> > Any mmap changes requires the mmapsem.
>
> sys_remap_file_pages() will call install_page() under down_read(mmap_sem).
> It relies upon page_table_lock for pte atomicity.

This is not relevant since it only deals with file pages. ptes are only
installed atomically for anonymous memory (if CONFIG_ATOMIC_OPS
is defined).

do_file_page() does call the populate function which does the right thing
in acquiring the page_table_lock before a pte update. My patch does not
touch that.

/*
* Install a file pte to a given virtual memory address, release any
* previously existing mapping.
*/
int install_file_pte(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, unsigned long pgoff, pgprot_t prot)
{
int err = -ENOMEM;
pte_t *pte;
pmd_t *pmd;
pud_t *pud;
pgd_t *pgd;
pte_t pte_val;

pgd = pgd_offset(mm, addr);
spin_lock(&mm->page_table_lock);

pud = pud_alloc(mm, pgd, addr);
if (!pud)
goto err_unlock;

pmd = pmd_alloc(mm, pud, addr);
if (!pmd)
goto err_unlock;

pte = pte_alloc_map(mm, pmd, addr);
if (!pte)
goto err_unlock;

zap_pte(mm, vma, addr, pte);

set_pte(pte, pgoff_to_pte(pgoff));
pte_val = *pte;
pte_unmap(pte);
update_mmu_cache(vma, addr, pte_val);
spin_unlock(&mm->page_table_lock);
return 0;

err_unlock:
spin_unlock(&mm->page_table_lock);
return err;
}



2005-03-03 07:44:38

by Nick Piggin

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Drop first acquisition of ptl

Benjamin Herrenschmidt wrote:
> On Fri, 2005-03-04 at 04:19 +1100, Nick Piggin wrote:
>
>
>>You don't want to do that for all architectures, as I said earlier.
>>eg. i386 can concurrently set the dirty bit with the MMU (which won't
>>honour the lock).
>>
>>So you then need an atomic lock, atomic pte operations, and atomic
>>unlock where previously you had only the atomic pte operation. This is
>>disastrous for performance.
>
>
> Of course, but I was answering to David about sparc64 which uses
> software TLB load :)
>

Oh sorry, I completely misunderstood what you said... and the
context in which you said it :P

2005-03-03 16:52:43

by Christoph Lameter

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Drop first acquisition of ptl

On Thu, 3 Mar 2005, Benjamin Herrenschmidt wrote:

> > There is no need to provide pte_cmpxchg. If the arch does not support
> > cmpxchg on ptes (CONFIG_ATOMIC_TABLE_OPS not defined)
> > then it will fall back to using pte_get_and_clear while holding the
> > page_table_lock to insure that the entry is not touched while performing
> > the comparison.
>
> Nah, this is wrong :)
>
> We actually _want_ pte_cmpxchg on ppc64, because we can do the stuff,
> but it requires some careful manipulation of some bits in the PTE that
> are beyond linux common layer understanding :) Like the BUSY bit which
> is a lock bit for arbitrating with the hash fault handler for example.
>
> Also, if it's ever used to cmpxchg from anything but a !present PTE, it
> will need additional massaging (like the COW case where we just
> "replace" a PTE with set_pte). We also need to preserve some bits in
> there that indicate if the PTE was in the hash table and where in the
> hash so we can flush it afterward.

You can define your own pte_cmpxchg without a problem ...

2005-03-03 16:55:45

by Christoph Lameter

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Drop first acquisition of ptl

On Wed, 2 Mar 2005, Andrew Morton wrote:

> > This is not relevant since it only deals with file pages.
>
> OK. And CONFIG_DEBUG_PAGEALLOC?

Its a debug feature that can be fixed if its broken.

> > ptes are only
> > installed atomically for anonymous memory (if CONFIG_ATOMIC_OPS
> > is defined).
>
> It's a shame. A *nice* solution to this problem would address all pte ops
> and wouldn't have such special cases...

Yeah. See my mmu abstraction proposal. This is a solution until we can
address the bigger issues. After that has been worked out the
start/stop become begin/end transaction and the pte_cmpxchg are converted
into mmu_add/mmu_change/mmu_commit

2005-03-03 18:39:40

by David Miller

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Drop first acquisition of ptl

On Thu, 03 Mar 2005 17:30:28 +1100
Benjamin Herrenschmidt <[email protected]> wrote:

> On Fri, 2005-03-04 at 04:19 +1100, Nick Piggin wrote:
>
> > You don't want to do that for all architectures, as I said earlier.
> > eg. i386 can concurrently set the dirty bit with the MMU (which won't
> > honour the lock).
> >
> > So you then need an atomic lock, atomic pte operations, and atomic
> > unlock where previously you had only the atomic pte operation. This is
> > disastrous for performance.
>
> Of course, but I was answering to David about sparc64 which uses
> software TLB load :)

Right.

The current situation on sparc64 is that the tlb miss handler is
~10 cycles.

Like I said, I can use this thing if it just increases access, without
modifying the TLB miss handler at all.

Hmmm... let me think about this some more.

2005-03-03 22:19:52

by Christoph Lameter

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Drop first acquisition of ptl

On Thu, 3 Mar 2005, Andrew Morton wrote:

> Christoph Lameter <[email protected]> wrote:
> >
> > On Wed, 2 Mar 2005, Andrew Morton wrote:
> >
> > > > This is not relevant since it only deals with file pages.
> > >
> > > OK. And CONFIG_DEBUG_PAGEALLOC?
> >
> > Its a debug feature that can be fixed if its broken.
>
> It's broken.
>
> A fix would be to restore the get_page() if CONFIG_DEBUG_PAGEALLOC. Not
> particularly glorious..

Another fix would be to have a global variable "dontunmap" and have
the map kernel function not change the pte. But this is also not the
cleanest way.

The problem with atomic operations is the difficulty of keeping state. The
state must essentially all be bound to the atomic value replaced otherwise
more extensive locking schemes are needed.


2005-03-04 02:31:44

by Darren Williams

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Overview

Hi Christoph

On Tue, 01 Mar 2005, Christoph Lameter wrote:

> Is there any chance that this patchset could go into mm now? This has been
> discussed since last August....
>
> Changelog:
>
> V17->V18 Rediff against 2.6.11-rc5-bk4

Just applied this patch against 2.6.11, however with the patch applied
and all the aditional config options not set, the kernel hangs at
Freeing unused kernel memory: 240kB freed
FYI:

boot atomic prezero
OK on on
fail off on
fail off off
OK on off

> V16->V17 Do not increment page_count in do_wp_page. Performance data
> posted.
> V15->V16 of this patch: Redesign to allow full backback
> for architectures that do not supporting atomic operations.
>
> An introduction to what this patch does and a patch archive can be found on
> http://oss.sgi.com/projects/page_fault_performance. The archive also has the
> result of various performance tests (LMBench, Microbenchmark and
> kernel compiles).
>
> The basic approach in this patchset is the same as used in SGI's 2.4.X
> based kernels which have been in production use in ProPack 3 for a long time.
>
> The patchset is composed of 4 patches (and was tested against 2.6.11-rc5-bk4):
>
[SNIP]

> -
> To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--------------------------------------------------
Darren Williams <dsw AT gelato.unsw.edu.au>
Gelato@UNSW <http://www.gelato.unsw.edu.au>
--------------------------------------------------

2005-03-04 03:04:16

by Darren Williams

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Overview

Hi Darren

On Fri, 04 Mar 2005, Darren Williams wrote:

> Hi Christoph
>
> On Tue, 01 Mar 2005, Christoph Lameter wrote:
>
> > Is there any chance that this patchset could go into mm now? This has been
> > discussed since last August....
> >
> > Changelog:
> >
> > V17->V18 Rediff against 2.6.11-rc5-bk4
>
> Just applied this patch against 2.6.11, however with the patch applied
> and all the aditional config options not set, the kernel hangs at
> Freeing unused kernel memory: 240kB freed
> FYI:
>
> boot atomic prezero
> OK on on
> fail off on
> fail off off
> OK on off

A bit extra info on the system:
HP rx8620 Itanium(R) 2 16way

>
> > V16->V17 Do not increment page_count in do_wp_page. Performance data
> > posted.
> > V15->V16 of this patch: Redesign to allow full backback
> > for architectures that do not supporting atomic operations.
> >
> > An introduction to what this patch does and a patch archive can be found on
> > http://oss.sgi.com/projects/page_fault_performance. The archive also has the
> > result of various performance tests (LMBench, Microbenchmark and
> > kernel compiles).
> >
> > The basic approach in this patchset is the same as used in SGI's 2.4.X
> > based kernels which have been in production use in ProPack 3 for a long time.
> >
> > The patchset is composed of 4 patches (and was tested against 2.6.11-rc5-bk4):
> >
> [SNIP]
>
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> --------------------------------------------------
> Darren Williams <dsw AT gelato.unsw.edu.au>
> Gelato@UNSW <http://www.gelato.unsw.edu.au>
> --------------------------------------------------
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--------------------------------------------------
Darren Williams <dsw AT gelato.unsw.edu.au>
Gelato@UNSW <http://www.gelato.unsw.edu.au>
--------------------------------------------------

2005-03-03 21:28:58

by Andrew Morton

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Drop first acquisition of ptl

Christoph Lameter <[email protected]> wrote:
>
> On Wed, 2 Mar 2005, Andrew Morton wrote:
>
> > > This is not relevant since it only deals with file pages.
> >
> > OK. And CONFIG_DEBUG_PAGEALLOC?
>
> Its a debug feature that can be fixed if its broken.

It's broken.

A fix would be to restore the get_page() if CONFIG_DEBUG_PAGEALLOC. Not
particularly glorious..

2005-03-04 16:15:39

by Christoph Lameter

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Overview

Make sure that scrubd_stop on startup is set to 2 and no zero in
mm/scrubd.c. The current patch on oss.sgi.com may have that set to zero.

On Fri, 4 Mar 2005, Darren Williams wrote:

> Hi Darren
>
> On Fri, 04 Mar 2005, Darren Williams wrote:
>
> > Hi Christoph
> >
> > On Tue, 01 Mar 2005, Christoph Lameter wrote:
> >
> > > Is there any chance that this patchset could go into mm now? This has been
> > > discussed since last August....
> > >
> > > Changelog:
> > >
> > > V17->V18 Rediff against 2.6.11-rc5-bk4
> >
> > Just applied this patch against 2.6.11, however with the patch applied
> > and all the aditional config options not set, the kernel hangs at
> > Freeing unused kernel memory: 240kB freed
> > FYI:
> >
> > boot atomic prezero
> > OK on on
> > fail off on
> > fail off off
> > OK on off
>
> A bit extra info on the system:
> HP rx8620 Itanium(R) 2 16way
>
> >
> > > V16->V17 Do not increment page_count in do_wp_page. Performance data
> > > posted.
> > > V15->V16 of this patch: Redesign to allow full backback
> > > for architectures that do not supporting atomic operations.
> > >
> > > An introduction to what this patch does and a patch archive can be found on
> > > http://oss.sgi.com/projects/page_fault_performance. The archive also has the
> > > result of various performance tests (LMBench, Microbenchmark and
> > > kernel compiles).
> > >
> > > The basic approach in this patchset is the same as used in SGI's 2.4.X
> > > based kernels which have been in production use in ProPack 3 for a long time.
> > >
> > > The patchset is composed of 4 patches (and was tested against 2.6.11-rc5-bk4):
> > >
> > [SNIP]
> >
> > > -
> > > To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
> > > the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > --------------------------------------------------
> > Darren Williams <dsw AT gelato.unsw.edu.au>
> > Gelato@UNSW <http://www.gelato.unsw.edu.au>
> > --------------------------------------------------
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> --------------------------------------------------
> Darren Williams <dsw AT gelato.unsw.edu.au>
> Gelato@UNSW <http://www.gelato.unsw.edu.au>
> --------------------------------------------------
>

2005-03-04 16:48:49

by Christoph Lameter

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Drop first acquisition of ptl

On Thu, 3 Mar 2005, Andrew Morton wrote:

> A fix would be to restore the get_page() if CONFIG_DEBUG_PAGEALLOC. Not
> particularly glorious..

Here is the unglorious solution. It also requies that
CONFIG_ATOMIC_TABLE_OPS not be used together with CONFIG_DEBUG_PAGEALLOC
(handled in the following patch):

---------------------------------------------------------------------------

We do a page_cache_get in do_wp_page but we check the pte for changes later.

So why do a page_cache_get at all? Do the copy and maybe copy garbage and
if the pte was changed forget about it. This avoids having to keep state
on the page copied from.

However this does not work for CONFIG_DEBUG_PAGEALLOC. So leave these in
if debugging is enabled. This also means that the following patch will not
allow setting CONFIG_ATOMIC_TABLE_OPS if CONFIG_DEBUG_PAGEALLOC is
selected.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.11/mm/memory.c
===================================================================
--- linux-2.6.11.orig/mm/memory.c 2005-03-04 08:25:22.000000000 -0800
+++ linux-2.6.11/mm/memory.c 2005-03-04 08:26:30.000000000 -0800
@@ -1318,8 +1318,14 @@ static int do_wp_page(struct mm_struct *
/*
* Ok, we need to copy. Oh, well..
*/
+#ifdef CONFIG_DEBUG_PAGEALLOC
+ /* For debugging we need to get the page otherwise
+ * the pte for this kernel page may vanish while
+ * we copy the page.
+ */
if (!PageReserved(old_page))
page_cache_get(old_page);
+#endif
spin_unlock(&mm->page_table_lock);

if (unlikely(anon_vma_prepare(vma)))
@@ -1358,12 +1364,16 @@ static int do_wp_page(struct mm_struct *
}
pte_unmap(page_table);
page_cache_release(new_page);
+#ifdef CONFIG_DEBUG_PAGEALLOC
page_cache_release(old_page);
+#endif
spin_unlock(&mm->page_table_lock);
return VM_FAULT_MINOR;

no_new_page:
+#ifdef CONFIG_DEBUG_PAGEALLOC
page_cache_release(old_page);
+#endif
return VM_FAULT_OOM;
}

2005-03-04 16:55:55

by Christoph Lameter

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Drop first acquisition of ptl

Changelog:
- Do not allow setting of CONFIG_DEBUG_PAGEALLOC together with
CONFIG_ATOMIC_TABLE_OPS
- Keep mark_page_accessed in do_swap_page

The page fault handler attempts to use the page_table_lock only for short
time periods. It repeatedly drops and reacquires the lock. When the lock
is reacquired, checks are made if the underlying pte has changed before
replacing the pte value. These locations are a good fit for the use of
ptep_cmpxchg.

The following patch allows to remove the first time the page_table_lock is
acquired and uses atomic operations on the page table instead. A section
using atomic pte operations is begun with

page_table_atomic_start(struct mm_struct *)

and ends with

page_table_atomic_stop(struct mm_struct *)

Both of these become spin_lock(page_table_lock) and
spin_unlock(page_table_lock) if atomic page table operations are not
configured (CONFIG_ATOMIC_TABLE_OPS undefined).

The atomic operations with pte_xchg and pte_cmpxchg only work for the lowest
layer of the page table. Higher layers may also be populated in an atomic
way by defining pmd_test_and_populate() etc. The generic versions of these
functions fall back to the page_table_lock (populating higher level page
table entries is rare and therefore this is not likely to be performance
critical). For ia64 the definition of higher level atomic operations is
included.

This patch depends on the pte_cmpxchg patch to be applied first and will
only remove the first use of the page_table_lock in the page fault handler.
This will allow the following page table operations without acquiring
the page_table_lock:

1. Updating of access bits (handle_mm_faults)
2. Anonymous read faults (do_anonymous_page)

The page_table_lock is still acquired for creating a new pte for an anonymous
write fault and therefore the problems with rss that were addressed by splitting
rss into the task structure do not yet occur.

The patch also adds some diagnostic features by counting the number of cmpxchg
failures (useful for verification if this patch works right) and the number of patches
received that led to no change in the page table. Statistics may be viewed via
/proc/meminfo

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.11/mm/memory.c
===================================================================
--- linux-2.6.11.orig/mm/memory.c 2005-03-04 08:26:30.000000000 -0800
+++ linux-2.6.11/mm/memory.c 2005-03-04 08:30:25.000000000 -0800
@@ -36,6 +36,8 @@
* ([email protected])
*
* Aug/Sep 2004 Changed to four level page tables (Andi Kleen)
+ * Jan 2005 Scalability improvement by reducing the use and the length of time
+ * the page table lock is held (Christoph Lameter)
*/

#include <linux/kernel_stat.h>
@@ -1275,8 +1277,8 @@ static inline void break_cow(struct vm_a
* change only once the write actually happens. This avoids a few races,
* and potentially makes it more efficient.
*
- * We hold the mm semaphore and the page_table_lock on entry and exit
- * with the page_table_lock released.
+ * We hold the mm semaphore and have started atomic pte operations,
+ * exit with pte ops completed.
*/
static int do_wp_page(struct mm_struct *mm, struct vm_area_struct * vma,
unsigned long address, pte_t *page_table, pmd_t *pmd, pte_t pte)
@@ -1294,7 +1296,7 @@ static int do_wp_page(struct mm_struct *
pte_unmap(page_table);
printk(KERN_ERR "do_wp_page: bogus page at address %08lx\n",
address);
- spin_unlock(&mm->page_table_lock);
+ page_table_atomic_stop(mm);
return VM_FAULT_OOM;
}
old_page = pfn_to_page(pfn);
@@ -1306,10 +1308,17 @@ static int do_wp_page(struct mm_struct *
flush_cache_page(vma, address);
entry = maybe_mkwrite(pte_mkyoung(pte_mkdirty(pte)),
vma);
- ptep_set_access_flags(vma, address, page_table, entry, 1);
- update_mmu_cache(vma, address, entry);
+ /*
+ * If the bits are not updated then another fault
+ * will be generated with another chance of updating.
+ */
+ if (ptep_cmpxchg(page_table, pte, entry))
+ update_mmu_cache(vma, address, entry);
+ else
+ inc_page_state(cmpxchg_fail_flag_reuse);
+
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
+ page_table_atomic_stop(mm);
return VM_FAULT_MINOR;
}
}
@@ -1321,12 +1330,13 @@ static int do_wp_page(struct mm_struct *
#ifdef CONFIG_DEBUG_PAGEALLOC
/* For debugging we need to get the page otherwise
* the pte for this kernel page may vanish while
- * we copy the page.
+ * we copy the page. Also page_atomic_start/stop
+ * must map to a spinlock using the page_table_lock!
*/
if (!PageReserved(old_page))
page_cache_get(old_page);
#endif
- spin_unlock(&mm->page_table_lock);
+ page_table_atomic_stop(mm);

if (unlikely(anon_vma_prepare(vma)))
goto no_new_page;
@@ -1338,10 +1348,15 @@ static int do_wp_page(struct mm_struct *
new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
if (!new_page)
goto no_new_page;
+ /*
+ * No page_cache_get so we may copy some crap
+ * that is later discarded if the pte has changed
+ */
copy_user_highpage(new_page, old_page, address);
}
/*
- * Re-check the pte - we dropped the lock
+ * Re-check the pte - so far we may not have acquired the
+ * page_table_lock
*/
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, address);
@@ -1353,7 +1368,6 @@ static int do_wp_page(struct mm_struct *
acct_update_integrals();
update_mem_hiwater();
} else
-
page_remove_rmap(old_page);
break_cow(vma, new_page, address, page_table);
lru_cache_add_active(new_page);
@@ -1697,8 +1711,7 @@ void swapin_readahead(swp_entry_t entry,
}

/*
- * We hold the mm semaphore and the page_table_lock on entry and
- * should release the pagetable lock on exit..
+ * We hold the mm semaphore and have started atomic pte operations
*/
static int do_swap_page(struct mm_struct * mm,
struct vm_area_struct * vma, unsigned long address,
@@ -1710,15 +1723,14 @@ static int do_swap_page(struct mm_struct
int ret = VM_FAULT_MINOR;

pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
+ page_table_atomic_stop(mm);
page = lookup_swap_cache(entry);
if (!page) {
swapin_readahead(entry, address, vma);
page = read_swap_cache_async(entry, vma, address);
if (!page) {
/*
- * Back out if somebody else faulted in this pte while
- * we released the page table lock.
+ * Back out if somebody else faulted in this pte
*/
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, address);
@@ -1741,8 +1753,7 @@ static int do_swap_page(struct mm_struct
lock_page(page);

/*
- * Back out if somebody else faulted in this pte while we
- * released the page table lock.
+ * Back out if somebody else faulted in this pte
*/
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, address);
@@ -1776,79 +1787,93 @@ static int do_swap_page(struct mm_struct
set_pte(page_table, pte);
page_add_anon_rmap(page, vma, address);

+ /* No need to invalidate - it was non-present before */
+ update_mmu_cache(vma, address, pte);
+ pte_unmap(page_table);
+ spin_unlock(&mm->page_table_lock);
+
if (write_access) {
+ page_table_atomic_start(mm);
if (do_wp_page(mm, vma, address,
page_table, pmd, pte) == VM_FAULT_OOM)
ret = VM_FAULT_OOM;
- goto out;
}

- /* No need to invalidate - it was non-present before */
- update_mmu_cache(vma, address, pte);
- pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
out:
return ret;
}

/*
- * We are called with the MM semaphore and page_table_lock
- * spinlock held to protect against concurrent faults in
- * multithreaded programs.
+ * We are called with the MM semaphore held and atomic pte operations started.
*/
static int
do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
pte_t *page_table, pmd_t *pmd, int write_access,
- unsigned long addr)
+ unsigned long addr, pte_t orig_entry)
{
pte_t entry;
- struct page * page = ZERO_PAGE(addr);
+ struct page * page;

- /* Read-only mapping of ZERO_PAGE. */
- entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
+ if (unlikely(!write_access)) {

- /* ..except if it's a write access */
- if (write_access) {
- /* Allocate our own private page. */
+ /* Read-only mapping of ZERO_PAGE. */
+ entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
+
+ /*
+ * If the cmpxchg fails then another fault may be
+ * generated that may then be successful
+ */
+
+ if (ptep_cmpxchg(page_table, orig_entry, entry))
+ update_mmu_cache(vma, addr, entry);
+ else
+ inc_page_state(cmpxchg_fail_anon_read);
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
+ page_table_atomic_stop(mm);

- if (unlikely(anon_vma_prepare(vma)))
- goto no_mem;
- page = alloc_zeroed_user_highpage(vma, addr);
- if (!page)
- goto no_mem;
+ return VM_FAULT_MINOR;
+ }

- spin_lock(&mm->page_table_lock);
- page_table = pte_offset_map(pmd, addr);
+ page_table_atomic_stop(mm);

- if (!pte_none(*page_table)) {
- pte_unmap(page_table);
- page_cache_release(page);
- spin_unlock(&mm->page_table_lock);
- goto out;
- }
- update_mm_counter(mm, rss, 1);
- acct_update_integrals();
- update_mem_hiwater();
- entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
- vma->vm_page_prot)),
- vma);
- lru_cache_add_active(page);
- SetPageReferenced(page);
- page_add_anon_rmap(page, vma, addr);
+ /* Allocate our own private page. */
+ if (unlikely(anon_vma_prepare(vma)))
+ return VM_FAULT_OOM;
+
+ page = alloc_zeroed_user_highpage(vma, addr);
+ if (!page)
+ return VM_FAULT_OOM;
+
+ entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
+ vma->vm_page_prot)),
+ vma);
+
+ spin_lock(&mm->page_table_lock);
+
+ if (!ptep_cmpxchg(page_table, orig_entry, entry)) {
+ pte_unmap(page_table);
+ page_cache_release(page);
+ spin_unlock(&mm->page_table_lock);
+ inc_page_state(cmpxchg_fail_anon_write);
+ return VM_FAULT_MINOR;
}

- set_pte(page_table, entry);
- pte_unmap(page_table);
+ /*
+ * These two functions must come after the cmpxchg
+ * because if the page is on the LRU then try_to_unmap may come
+ * in and unmap the pte.
+ */
+ page_add_anon_rmap(page, vma, addr);
+ lru_cache_add_active(page);
+ update_mm_counter(mm, rss, 1);
+ acct_update_integrals();
+ update_mem_hiwater();

- /* No need to invalidate - it was non-present before */
- update_mmu_cache(vma, addr, entry);
+ update_mmu_cache(vma, addr, entry);
+ pte_unmap(page_table);
spin_unlock(&mm->page_table_lock);
-out:
+
return VM_FAULT_MINOR;
-no_mem:
- return VM_FAULT_OOM;
}

/*
@@ -1860,12 +1885,12 @@ no_mem:
* As this is called only for pages that do not currently exist, we
* do not need to flush old virtual caches or the TLB.
*
- * This is called with the MM semaphore held and the page table
- * spinlock held. Exit with the spinlock released.
+ * This is called with the MM semaphore held and atomic pte operations started.
*/
static int
do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
+ unsigned long address, int write_access, pte_t *page_table,
+ pmd_t *pmd, pte_t orig_entry)
{
struct page * new_page;
struct address_space *mapping = NULL;
@@ -1876,9 +1901,9 @@ do_no_page(struct mm_struct *mm, struct

if (!vma->vm_ops || !vma->vm_ops->nopage)
return do_anonymous_page(mm, vma, page_table,
- pmd, write_access, address);
+ pmd, write_access, address, orig_entry);
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
+ page_table_atomic_stop(mm);

if (vma->vm_file) {
mapping = vma->vm_file->f_mapping;
@@ -1986,7 +2011,7 @@ oom:
* nonlinear vmas.
*/
static int do_file_page(struct mm_struct * mm, struct vm_area_struct * vma,
- unsigned long address, int write_access, pte_t *pte, pmd_t *pmd)
+ unsigned long address, int write_access, pte_t *pte, pmd_t *pmd, pte_t entry)
{
unsigned long pgoff;
int err;
@@ -1999,13 +2024,13 @@ static int do_file_page(struct mm_struct
if (!vma->vm_ops || !vma->vm_ops->populate ||
(write_access && !(vma->vm_flags & VM_SHARED))) {
pte_clear(pte);
- return do_no_page(mm, vma, address, write_access, pte, pmd);
+ return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
}

- pgoff = pte_to_pgoff(*pte);
+ pgoff = pte_to_pgoff(entry);

pte_unmap(pte);
- spin_unlock(&mm->page_table_lock);
+ page_table_atomic_stop(mm);

err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0);
if (err == -ENOMEM)
@@ -2024,49 +2049,45 @@ static int do_file_page(struct mm_struct
* with external mmu caches can use to update those (ie the Sparc or
* PowerPC hashed page tables that act as extended TLBs).
*
- * Note the "page_table_lock". It is to protect against kswapd removing
- * pages from under us. Note that kswapd only ever _removes_ pages, never
- * adds them. As such, once we have noticed that the page is not present,
- * we can drop the lock early.
- *
- * The adding of pages is protected by the MM semaphore (which we hold),
- * so we don't need to worry about a page being suddenly been added into
- * our VM.
- *
- * We enter with the pagetable spinlock held, we are supposed to
- * release it when done.
+ * Note that kswapd only ever _removes_ pages, never adds them.
+ * We need to insure to handle that case properly.
*/
static inline int handle_pte_fault(struct mm_struct *mm,
struct vm_area_struct * vma, unsigned long address,
int write_access, pte_t *pte, pmd_t *pmd)
{
pte_t entry;
+ pte_t new_entry;

entry = *pte;
if (!pte_present(entry)) {
- /*
- * If it truly wasn't present, we know that kswapd
- * and the PTE updates will not touch it later. So
- * drop the lock.
- */
if (pte_none(entry))
- return do_no_page(mm, vma, address, write_access, pte, pmd);
+ return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
if (pte_file(entry))
- return do_file_page(mm, vma, address, write_access, pte, pmd);
+ return do_file_page(mm, vma, address, write_access, pte, pmd, entry);
return do_swap_page(mm, vma, address, pte, pmd, entry, write_access);
}

+ new_entry = pte_mkyoung(entry);
if (write_access) {
if (!pte_write(entry))
return do_wp_page(mm, vma, address, pte, pmd, entry);
-
- entry = pte_mkdirty(entry);
+ new_entry = pte_mkdirty(new_entry);
}
- entry = pte_mkyoung(entry);
- ptep_set_access_flags(vma, address, pte, entry, write_access);
- update_mmu_cache(vma, address, entry);
+
+ /*
+ * If the cmpxchg fails then we will get another fault which
+ * has another chance of successfully updating the page table entry.
+ */
+ if (ptep_cmpxchg(pte, entry, new_entry)) {
+ flush_tlb_page(vma, address);
+ update_mmu_cache(vma, address, entry);
+ } else
+ inc_page_state(cmpxchg_fail_flag_update);
pte_unmap(pte);
- spin_unlock(&mm->page_table_lock);
+ page_table_atomic_stop(mm);
+ if (pte_val(new_entry) == pte_val(entry))
+ inc_page_state(spurious_page_faults);
return VM_FAULT_MINOR;
}

@@ -2085,33 +2106,73 @@ int handle_mm_fault(struct mm_struct *mm

inc_page_state(pgfault);

- if (is_vm_hugetlb_page(vma))
+ if (unlikely(is_vm_hugetlb_page(vma)))
return VM_FAULT_SIGBUS; /* mapping truncation does this. */

/*
- * We need the page table lock to synchronize with kswapd
- * and the SMP-safe atomic PTE updates.
+ * We try to rely on the mmap_sem and the SMP-safe atomic PTE updates.
+ * to synchronize with kswapd. However, the arch may fall back
+ * in page_table_atomic_start to the page table lock.
+ *
+ * We may be able to avoid taking and releasing the page_table_lock
+ * for the p??_alloc functions through atomic operations so we
+ * duplicate the functionality of pmd_alloc, pud_alloc and
+ * pte_alloc_map here.
*/
+ page_table_atomic_start(mm);
pgd = pgd_offset(mm, address);
- spin_lock(&mm->page_table_lock);
+ if (unlikely(pgd_none(*pgd))) {
+ pud_t *new;
+
+ page_table_atomic_stop(mm);
+ new = pud_alloc_one(mm, address);

- pud = pud_alloc(mm, pgd, address);
- if (!pud)
- goto oom;
-
- pmd = pmd_alloc(mm, pud, address);
- if (!pmd)
- goto oom;
-
- pte = pte_alloc_map(mm, pmd, address);
- if (!pte)
- goto oom;
+ if (!new)
+ return VM_FAULT_OOM;
+
+ page_table_atomic_start(mm);
+ if (!pgd_test_and_populate(mm, pgd, new))
+ pud_free(new);
+ }
+
+ pud = pud_offset(pgd, address);
+ if (unlikely(pud_none(*pud))) {
+ pmd_t *new;
+
+ page_table_atomic_stop(mm);
+ new = pmd_alloc_one(mm, address);
+
+ if (!new)
+ return VM_FAULT_OOM;

- return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
+ page_table_atomic_start(mm);
+
+ if (!pud_test_and_populate(mm, pud, new))
+ pmd_free(new);
+ }

- oom:
- spin_unlock(&mm->page_table_lock);
- return VM_FAULT_OOM;
+ pmd = pmd_offset(pud, address);
+ if (unlikely(!pmd_present(*pmd))) {
+ struct page *new;
+
+ page_table_atomic_stop(mm);
+ new = pte_alloc_one(mm, address);
+
+ if (!new)
+ return VM_FAULT_OOM;
+
+ page_table_atomic_start(mm);
+
+ if (!pmd_test_and_populate(mm, pmd, new))
+ pte_free(new);
+ else {
+ inc_page_state(nr_page_table_pages);
+ mm->nr_ptes++;
+ }
+ }
+
+ pte = pte_offset_map(pmd, address);
+ return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
}

#ifndef __ARCH_HAS_4LEVEL_HACK
Index: linux-2.6.11/include/asm-generic/pgtable-nopud.h
===================================================================
--- linux-2.6.11.orig/include/asm-generic/pgtable-nopud.h 2005-03-01 23:38:12.000000000 -0800
+++ linux-2.6.11/include/asm-generic/pgtable-nopud.h 2005-03-04 08:26:38.000000000 -0800
@@ -25,8 +25,14 @@ static inline int pgd_bad(pgd_t pgd) {
static inline int pgd_present(pgd_t pgd) { return 1; }
static inline void pgd_clear(pgd_t *pgd) { }
#define pud_ERROR(pud) (pgd_ERROR((pud).pgd))
-
#define pgd_populate(mm, pgd, pud) do { } while (0)
+
+#define __HAVE_ARCH_PGD_TEST_AND_POPULATE
+static inline int pgd_test_and_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud)
+{
+ return 1;
+}
+
/*
* (puds are folded into pgds so this doesn't get actually called,
* but the define is needed for a generic inline function.)
Index: linux-2.6.11/include/asm-generic/pgtable-nopmd.h
===================================================================
--- linux-2.6.11.orig/include/asm-generic/pgtable-nopmd.h 2005-03-01 23:37:49.000000000 -0800
+++ linux-2.6.11/include/asm-generic/pgtable-nopmd.h 2005-03-04 08:26:38.000000000 -0800
@@ -29,6 +29,11 @@ static inline void pud_clear(pud_t *pud)
#define pmd_ERROR(pmd) (pud_ERROR((pmd).pud))

#define pud_populate(mm, pmd, pte) do { } while (0)
+#define __ARCH_HAVE_PUD_TEST_AND_POPULATE
+static inline int pud_test_and_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
+{
+ return 1;
+}

/*
* (pmds are folded into puds so this doesn't get actually called,
Index: linux-2.6.11/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.11.orig/include/asm-generic/pgtable.h 2005-03-03 10:20:56.000000000 -0800
+++ linux-2.6.11/include/asm-generic/pgtable.h 2005-03-04 08:26:38.000000000 -0800
@@ -105,8 +105,14 @@ static inline pte_t ptep_get_and_clear(p
#ifdef CONFIG_ATOMIC_TABLE_OPS

/*
- * The architecture does support atomic table operations.
- * Thus we may provide generic atomic ptep_xchg and ptep_cmpxchg using
+ * The architecture does support atomic table operations and
+ * all operations on page table entries must always be atomic.
+ *
+ * This means that the kernel will never encounter a partially updated
+ * page table entry.
+ *
+ * Since the architecture does support atomic table operations, we
+ * may provide generic atomic ptep_xchg and ptep_cmpxchg using
* cmpxchg and xchg.
*/
#ifndef __HAVE_ARCH_PTEP_XCHG
@@ -132,6 +138,65 @@ static inline pte_t ptep_get_and_clear(p
})
#endif

+/*
+ * page_table_atomic_start and page_table_atomic_stop may be used to
+ * define special measures that an arch needs to guarantee atomic
+ * operations outside of a spinlock. In the case that an arch does
+ * not support atomic page table operations we will fall back to the
+ * page table lock.
+ */
+#ifndef __HAVE_ARCH_PAGE_TABLE_ATOMIC_START
+#define page_table_atomic_start(mm) do { } while (0)
+#endif
+
+#ifndef __HAVE_ARCH_PAGE_TABLE_ATOMIC_START
+#define page_table_atomic_stop(mm) do { } while (0)
+#endif
+
+/*
+ * Fallback functions for atomic population of higher page table
+ * structures. These simply acquire the page_table_lock for
+ * synchronization. An architecture may override these generic
+ * functions to provide atomic populate functions to make these
+ * more effective.
+ */
+
+#ifndef __HAVE_ARCH_PGD_TEST_AND_POPULATE
+#define pgd_test_and_populate(__mm, __pgd, __pud) \
+({ \
+ int __rc; \
+ spin_lock(&mm->page_table_lock); \
+ __rc = pgd_none(*(__pgd)); \
+ if (__rc) pgd_populate(__mm, __pgd, __pud); \
+ spin_unlock(&mm->page_table_lock); \
+ __rc; \
+})
+#endif
+
+#ifndef __HAVE_ARCH_PUD_TEST_AND_POPULATE
+#define pud_test_and_populate(__mm, __pud, __pmd) \
+({ \
+ int __rc; \
+ spin_lock(&mm->page_table_lock); \
+ __rc = pud_none(*(__pud)); \
+ if (__rc) pud_populate(__mm, __pud, __pmd); \
+ spin_unlock(&mm->page_table_lock); \
+ __rc; \
+})
+#endif
+
+#ifndef __HAVE_ARCH_PMD_TEST_AND_POPULATE
+#define pmd_test_and_populate(__mm, __pmd, __page) \
+({ \
+ int __rc; \
+ spin_lock(&mm->page_table_lock); \
+ __rc = !pmd_present(*(__pmd)); \
+ if (__rc) pmd_populate(__mm, __pmd, __page); \
+ spin_unlock(&mm->page_table_lock); \
+ __rc; \
+})
+#endif
+
#else

/*
@@ -142,6 +207,11 @@ static inline pte_t ptep_get_and_clear(p
* short time frame. This means that the page_table_lock must be held
* to avoid a page fault that would install a new entry.
*/
+
+/* Fall back to the page table lock to synchronize page table access */
+#define page_table_atomic_start(mm) spin_lock(&(mm)->page_table_lock)
+#define page_table_atomic_stop(mm) spin_unlock(&(mm)->page_table_lock)
+
#ifndef __HAVE_ARCH_PTEP_XCHG
#define ptep_xchg(__ptep, __pteval) \
({ \
@@ -186,6 +256,41 @@ static inline pte_t ptep_get_and_clear(p
r; \
})
#endif
+
+/*
+ * Fallback functions for atomic population of higher page table
+ * structures. These rely on the page_table_lock being held.
+ */
+#ifndef __HAVE_ARCH_PGD_TEST_AND_POPULATE
+#define pgd_test_and_populate(__mm, __pgd, __pud) \
+({ \
+ int __rc; \
+ __rc = pgd_none(*(__pgd)); \
+ if (__rc) pgd_populate(__mm, __pgd, __pud); \
+ __rc; \
+})
+#endif
+
+#ifndef __HAVE_ARCH_PUD_TEST_AND_POPULATE
+#define pud_test_and_populate(__mm, __pud, __pmd) \
+({ \
+ int __rc; \
+ __rc = pud_none(*(__pud)); \
+ if (__rc) pud_populate(__mm, __pud, __pmd); \
+ __rc; \
+})
+#endif
+
+#ifndef __HAVE_ARCH_PMD_TEST_AND_POPULATE
+#define pmd_test_and_populate(__mm, __pmd, __page) \
+({ \
+ int __rc; \
+ __rc = !pmd_present(*(__pmd)); \
+ if (__rc) pmd_populate(__mm, __pmd, __page); \
+ __rc; \
+})
+#endif
+
#endif

#ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT
Index: linux-2.6.11/include/asm-ia64/pgtable.h
===================================================================
--- linux-2.6.11.orig/include/asm-ia64/pgtable.h 2005-03-01 23:37:53.000000000 -0800
+++ linux-2.6.11/include/asm-ia64/pgtable.h 2005-03-04 08:26:38.000000000 -0800
@@ -554,6 +554,8 @@ do { \
#define FIXADDR_USER_START GATE_ADDR
#define FIXADDR_USER_END (GATE_ADDR + 2*PERCPU_PAGE_SIZE)

+#define __HAVE_ARCH_PUD_TEST_AND_POPULATE
+#define __HAVE_ARCH_PMD_TEST_AND_POPULATE
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
@@ -561,7 +563,7 @@ do { \
#define __HAVE_ARCH_PTEP_MKDIRTY
#define __HAVE_ARCH_PTE_SAME
#define __HAVE_ARCH_PGD_OFFSET_GATE
-#include <asm-generic/pgtable.h>
#include <asm-generic/pgtable-nopud.h>
+#include <asm-generic/pgtable.h>

#endif /* _ASM_IA64_PGTABLE_H */
Index: linux-2.6.11/include/linux/page-flags.h
===================================================================
--- linux-2.6.11.orig/include/linux/page-flags.h 2005-03-01 23:38:13.000000000 -0800
+++ linux-2.6.11/include/linux/page-flags.h 2005-03-04 08:26:38.000000000 -0800
@@ -131,6 +131,17 @@ struct page_state {
unsigned long allocstall; /* direct reclaim calls */

unsigned long pgrotated; /* pages rotated to tail of the LRU */
+
+ /* Low level counters */
+ unsigned long spurious_page_faults; /* Faults with no ops */
+ unsigned long cmpxchg_fail_flag_update; /* cmpxchg failures for pte flag update */
+ unsigned long cmpxchg_fail_flag_reuse; /* cmpxchg failures when cow reuse of pte */
+ unsigned long cmpxchg_fail_anon_read; /* cmpxchg failures on anonymous read */
+ unsigned long cmpxchg_fail_anon_write; /* cmpxchg failures on anonymous write */
+
+ /* rss deltas for the current executing thread */
+ long rss;
+ long anon_rss;
};

extern void get_page_state(struct page_state *ret);
Index: linux-2.6.11/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.11.orig/fs/proc/proc_misc.c 2005-03-01 23:37:49.000000000 -0800
+++ linux-2.6.11/fs/proc/proc_misc.c 2005-03-04 08:26:38.000000000 -0800
@@ -127,7 +127,7 @@ static int meminfo_read_proc(char *page,
unsigned long allowed;
struct vmalloc_info vmi;

- get_page_state(&ps);
+ get_full_page_state(&ps);
get_zone_counts(&active, &inactive, &free);

/*
@@ -168,7 +168,12 @@ static int meminfo_read_proc(char *page,
"PageTables: %8lu kB\n"
"VmallocTotal: %8lu kB\n"
"VmallocUsed: %8lu kB\n"
- "VmallocChunk: %8lu kB\n",
+ "VmallocChunk: %8lu kB\n"
+ "Spurious page faults : %8lu\n"
+ "cmpxchg fail flag update: %8lu\n"
+ "cmpxchg fail COW reuse : %8lu\n"
+ "cmpxchg fail anon read : %8lu\n"
+ "cmpxchg fail anon write : %8lu\n",
K(i.totalram),
K(i.freeram),
K(i.bufferram),
@@ -191,7 +196,12 @@ static int meminfo_read_proc(char *page,
K(ps.nr_page_table_pages),
VMALLOC_TOTAL >> 10,
vmi.used >> 10,
- vmi.largest_chunk >> 10
+ vmi.largest_chunk >> 10,
+ ps.spurious_page_faults,
+ ps.cmpxchg_fail_flag_update,
+ ps.cmpxchg_fail_flag_reuse,
+ ps.cmpxchg_fail_anon_read,
+ ps.cmpxchg_fail_anon_write
);

len += hugetlb_report_meminfo(page + len);
Index: linux-2.6.11/include/asm-ia64/pgalloc.h
===================================================================
--- linux-2.6.11.orig/include/asm-ia64/pgalloc.h 2005-03-01 23:37:31.000000000 -0800
+++ linux-2.6.11/include/asm-ia64/pgalloc.h 2005-03-04 08:26:38.000000000 -0800
@@ -34,6 +34,10 @@
#define pmd_quicklist (local_cpu_data->pmd_quick)
#define pgtable_cache_size (local_cpu_data->pgtable_cache_sz)

+/* Empty entries of PMD and PGD */
+#define PMD_NONE 0
+#define PUD_NONE 0
+
static inline pgd_t*
pgd_alloc_one_fast (struct mm_struct *mm)
{
@@ -82,6 +86,13 @@ pud_populate (struct mm_struct *mm, pud_
pud_val(*pud_entry) = __pa(pmd);
}

+/* Atomic populate */
+static inline int
+pud_test_and_populate (struct mm_struct *mm, pud_t *pud_entry, pmd_t *pmd)
+{
+ return ia64_cmpxchg8_acq(pud_entry,__pa(pmd), PUD_NONE) == PUD_NONE;
+}
+
static inline pmd_t*
pmd_alloc_one_fast (struct mm_struct *mm, unsigned long addr)
{
@@ -127,6 +138,14 @@ pmd_populate (struct mm_struct *mm, pmd_
pmd_val(*pmd_entry) = page_to_phys(pte);
}

+/* Atomic populate */
+static inline int
+pmd_test_and_populate (struct mm_struct *mm, pmd_t *pmd_entry, struct page *pte)
+{
+ return ia64_cmpxchg8_acq(pmd_entry, page_to_phys(pte), PMD_NONE) == PMD_NONE;
+}
+
+
static inline void
pmd_populate_kernel (struct mm_struct *mm, pmd_t *pmd_entry, pte_t *pte)
{
Index: linux-2.6.11/arch/x86_64/Kconfig
===================================================================
--- linux-2.6.11.orig/arch/x86_64/Kconfig 2005-03-03 10:20:56.000000000 -0800
+++ linux-2.6.11/arch/x86_64/Kconfig 2005-03-04 08:32:17.000000000 -0800
@@ -242,7 +242,7 @@ config PREEMPT

config ATOMIC_TABLE_OPS
bool "Atomic Page Table Operations (EXPERIMENTAL)"
- depends on SMP && EXPERIMENTAL
+ depends on SMP && EXPERIMENTAL && !DEBUG_PAGEALLOC
help
Atomic Page table operations allow page faults
without the use (or with reduce use of) spinlocks
Index: linux-2.6.11/arch/i386/Kconfig
===================================================================
--- linux-2.6.11.orig/arch/i386/Kconfig 2005-03-03 10:20:56.000000000 -0800
+++ linux-2.6.11/arch/i386/Kconfig 2005-03-04 08:31:52.000000000 -0800
@@ -870,7 +870,7 @@ config HAVE_DEC_LOCK

config ATOMIC_TABLE_OPS
bool "Atomic Page Table Operations (EXPERIMENTAL)"
- depends on SMP && X86_CMPXCHG && EXPERIMENTAL && !X86_PAE
+ depends on SMP && X86_CMPXCHG && EXPERIMENTAL && !X86_PAE && !DEBUG_PAGEALLOC
help
Atomic Page table operations allow page faults
without the use (or with reduce use of) spinlocks
Index: linux-2.6.11/arch/ia64/Kconfig
===================================================================
--- linux-2.6.11.orig/arch/ia64/Kconfig 2005-03-03 10:20:56.000000000 -0800
+++ linux-2.6.11/arch/ia64/Kconfig 2005-03-04 08:31:12.000000000 -0800
@@ -274,7 +274,7 @@ config PREEMPT

config ATOMIC_TABLE_OPS
bool "Atomic Page Table Operations (EXPERIMENTAL)"
- depends on SMP && EXPERIMENTAL
+ depends on SMP && EXPERIMENTAL && !DEBUG_PAGEALLOC
help
Atomic Page table operations allow page faults
without the use (or with reduce use of) spinlocks

2005-03-04 17:12:43

by Hugh Dickins

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Drop first acquisition of ptl

Another! Sorry, you're way ahead of me...

On Fri, 4 Mar 2005, Christoph Lameter wrote:
> On Thu, 3 Mar 2005, Andrew Morton wrote:
>
> > A fix would be to restore the get_page() if CONFIG_DEBUG_PAGEALLOC. Not
> > particularly glorious..
>
> Here is the unglorious solution. It also requies that
> CONFIG_ATOMIC_TABLE_OPS not be used together with CONFIG_DEBUG_PAGEALLOC
> (handled in the following patch):

Nacked for the same reason as just given to earlier version. Ugly too.

Hugh

2005-03-04 18:34:40

by Christoph Lameter

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Drop first acquisition of ptl

On Fri, 4 Mar 2005, Hugh Dickins wrote:

> > Here is the unglorious solution. It also requies that
> > CONFIG_ATOMIC_TABLE_OPS not be used together with CONFIG_DEBUG_PAGEALLOC
> > (handled in the following patch):
>
> Nacked for the same reason as just given to earlier version. Ugly too.

Ok. Then we could still get back the also ugly solution in the earlier
patchsets that acquired the spinlock separately before getting to
do_wp_page (also no need for the separate patch anymore). Patch is
then shorter too.

Changelog:
- Acquire spinlock before invoking do_wp_page if CONFIG_ATOMIC_TABLE_OPS
- Lead do_wp_page unchanged
- Keep mark_page_accessed in do_swap_page

The page fault handler attempts to use the page_table_lock only for short
time periods. It repeatedly drops and reacquires the lock. When the lock
is reacquired, checks are made if the underlying pte has changed before
replacing the pte value. These locations are a good fit for the use of
ptep_cmpxchg.

The following patch allows to remove the first time the page_table_lock is
acquired and uses atomic operations on the page table instead. A section
using atomic pte operations is begun with

page_table_atomic_start(struct mm_struct *)

and ends with

page_table_atomic_stop(struct mm_struct *)

Both of these become spin_lock(page_table_lock) and
spin_unlock(page_table_lock) if atomic page table operations are not
configured (CONFIG_ATOMIC_TABLE_OPS undefined).

The atomic operations with pte_xchg and pte_cmpxchg only work for the lowest
layer of the page table. Higher layers may also be populated in an atomic
way by defining pmd_test_and_populate() etc. The generic versions of these
functions fall back to the page_table_lock (populating higher level page
table entries is rare and therefore this is not likely to be performance
critical). For ia64 the definition of higher level atomic operations is
included.

This patch depends on the pte_cmpxchg patch to be applied first and will
only remove the first use of the page_table_lock in the page fault handler.
This will allow the following page table operations without acquiring
the page_table_lock:

1. Updating of access bits (handle_mm_faults)
2. Anonymous read faults (do_anonymous_page)

The page_table_lock is still acquired for creating a new pte for an anonymous
write fault and therefore the problems with rss that were addressed by splitting
rss into the task structure do not yet occur.

The patch also adds some diagnostic features by counting the number of cmpxchg
failures (useful for verification if this patch works right) and the number of patches
received that led to no change in the page table. Statistics may be viewed via
/proc/meminfo

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.11/mm/memory.c
===================================================================
--- linux-2.6.11.orig/mm/memory.c 2005-03-04 08:25:22.000000000 -0800
+++ linux-2.6.11/mm/memory.c 2005-03-04 10:15:38.000000000 -0800
@@ -36,6 +36,8 @@
* ([email protected])
*
* Aug/Sep 2004 Changed to four level page tables (Andi Kleen)
+ * Jan 2005 Scalability improvement by reducing the use and the length of time
+ * the page table lock is held (Christoph Lameter)
*/

#include <linux/kernel_stat.h>
@@ -1687,8 +1689,7 @@ void swapin_readahead(swp_entry_t entry,
}

/*
- * We hold the mm semaphore and the page_table_lock on entry and
- * should release the pagetable lock on exit..
+ * We hold the mm semaphore and have started atomic pte operations
*/
static int do_swap_page(struct mm_struct * mm,
struct vm_area_struct * vma, unsigned long address,
@@ -1700,15 +1701,14 @@ static int do_swap_page(struct mm_struct
int ret = VM_FAULT_MINOR;

pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
+ page_table_atomic_stop(mm);
page = lookup_swap_cache(entry);
if (!page) {
swapin_readahead(entry, address, vma);
page = read_swap_cache_async(entry, vma, address);
if (!page) {
/*
- * Back out if somebody else faulted in this pte while
- * we released the page table lock.
+ * Back out if somebody else faulted in this pte
*/
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, address);
@@ -1731,8 +1731,7 @@ static int do_swap_page(struct mm_struct
lock_page(page);

/*
- * Back out if somebody else faulted in this pte while we
- * released the page table lock.
+ * Back out if somebody else faulted in this pte
*/
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, address);
@@ -1782,63 +1781,76 @@ out:
}

/*
- * We are called with the MM semaphore and page_table_lock
- * spinlock held to protect against concurrent faults in
- * multithreaded programs.
+ * We are called with the MM semaphore held and atomic pte operations started.
*/
static int
do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
pte_t *page_table, pmd_t *pmd, int write_access,
- unsigned long addr)
+ unsigned long addr, pte_t orig_entry)
{
pte_t entry;
- struct page * page = ZERO_PAGE(addr);
+ struct page * page;

- /* Read-only mapping of ZERO_PAGE. */
- entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
+ if (unlikely(!write_access)) {

- /* ..except if it's a write access */
- if (write_access) {
- /* Allocate our own private page. */
+ /* Read-only mapping of ZERO_PAGE. */
+ entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
+
+ /*
+ * If the cmpxchg fails then another fault may be
+ * generated that may then be successful
+ */
+
+ if (ptep_cmpxchg(page_table, orig_entry, entry))
+ update_mmu_cache(vma, addr, entry);
+ else
+ inc_page_state(cmpxchg_fail_anon_read);
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
+ page_table_atomic_stop(mm);

- if (unlikely(anon_vma_prepare(vma)))
- goto no_mem;
- page = alloc_zeroed_user_highpage(vma, addr);
- if (!page)
- goto no_mem;
+ return VM_FAULT_MINOR;
+ }

- spin_lock(&mm->page_table_lock);
- page_table = pte_offset_map(pmd, addr);
+ page_table_atomic_stop(mm);

- if (!pte_none(*page_table)) {
- pte_unmap(page_table);
- page_cache_release(page);
- spin_unlock(&mm->page_table_lock);
- goto out;
- }
- update_mm_counter(mm, rss, 1);
- acct_update_integrals();
- update_mem_hiwater();
- entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
- vma->vm_page_prot)),
- vma);
- lru_cache_add_active(page);
- SetPageReferenced(page);
- page_add_anon_rmap(page, vma, addr);
+ /* Allocate our own private page. */
+ if (unlikely(anon_vma_prepare(vma)))
+ return VM_FAULT_OOM;
+
+ page = alloc_zeroed_user_highpage(vma, addr);
+ if (!page)
+ return VM_FAULT_OOM;
+
+ entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
+ vma->vm_page_prot)),
+ vma);
+
+ spin_lock(&mm->page_table_lock);
+
+ if (!ptep_cmpxchg(page_table, orig_entry, entry)) {
+ pte_unmap(page_table);
+ page_cache_release(page);
+ spin_unlock(&mm->page_table_lock);
+ inc_page_state(cmpxchg_fail_anon_write);
+ return VM_FAULT_MINOR;
}

- set_pte(page_table, entry);
- pte_unmap(page_table);
+ /*
+ * These two functions must come after the cmpxchg
+ * because if the page is on the LRU then try_to_unmap may come
+ * in and unmap the pte.
+ */
+ page_add_anon_rmap(page, vma, addr);
+ lru_cache_add_active(page);
+ update_mm_counter(mm, rss, 1);
+ acct_update_integrals();
+ update_mem_hiwater();

- /* No need to invalidate - it was non-present before */
- update_mmu_cache(vma, addr, entry);
+ update_mmu_cache(vma, addr, entry);
+ pte_unmap(page_table);
spin_unlock(&mm->page_table_lock);
-out:
+
return VM_FAULT_MINOR;
-no_mem:
- return VM_FAULT_OOM;
}

/*
@@ -1850,12 +1862,12 @@ no_mem:
* As this is called only for pages that do not currently exist, we
* do not need to flush old virtual caches or the TLB.
*
- * This is called with the MM semaphore held and the page table
- * spinlock held. Exit with the spinlock released.
+ * This is called with the MM semaphore held and atomic pte operations started.
*/
static int
do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
+ unsigned long address, int write_access, pte_t *page_table,
+ pmd_t *pmd, pte_t orig_entry)
{
struct page * new_page;
struct address_space *mapping = NULL;
@@ -1866,9 +1878,9 @@ do_no_page(struct mm_struct *mm, struct

if (!vma->vm_ops || !vma->vm_ops->nopage)
return do_anonymous_page(mm, vma, page_table,
- pmd, write_access, address);
+ pmd, write_access, address, orig_entry);
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
+ page_table_atomic_stop(mm);

if (vma->vm_file) {
mapping = vma->vm_file->f_mapping;
@@ -1976,7 +1988,7 @@ oom:
* nonlinear vmas.
*/
static int do_file_page(struct mm_struct * mm, struct vm_area_struct * vma,
- unsigned long address, int write_access, pte_t *pte, pmd_t *pmd)
+ unsigned long address, int write_access, pte_t *pte, pmd_t *pmd, pte_t entry)
{
unsigned long pgoff;
int err;
@@ -1989,13 +2001,13 @@ static int do_file_page(struct mm_struct
if (!vma->vm_ops || !vma->vm_ops->populate ||
(write_access && !(vma->vm_flags & VM_SHARED))) {
pte_clear(pte);
- return do_no_page(mm, vma, address, write_access, pte, pmd);
+ return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
}

- pgoff = pte_to_pgoff(*pte);
+ pgoff = pte_to_pgoff(entry);

pte_unmap(pte);
- spin_unlock(&mm->page_table_lock);
+ page_table_atomic_stop(mm);

err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0);
if (err == -ENOMEM)
@@ -2014,49 +2026,56 @@ static int do_file_page(struct mm_struct
* with external mmu caches can use to update those (ie the Sparc or
* PowerPC hashed page tables that act as extended TLBs).
*
- * Note the "page_table_lock". It is to protect against kswapd removing
- * pages from under us. Note that kswapd only ever _removes_ pages, never
- * adds them. As such, once we have noticed that the page is not present,
- * we can drop the lock early.
- *
- * The adding of pages is protected by the MM semaphore (which we hold),
- * so we don't need to worry about a page being suddenly been added into
- * our VM.
- *
- * We enter with the pagetable spinlock held, we are supposed to
- * release it when done.
+ * Note that kswapd only ever _removes_ pages, never adds them.
+ * We need to insure to handle that case properly.
*/
static inline int handle_pte_fault(struct mm_struct *mm,
struct vm_area_struct * vma, unsigned long address,
int write_access, pte_t *pte, pmd_t *pmd)
{
pte_t entry;
+ pte_t new_entry;

entry = *pte;
if (!pte_present(entry)) {
- /*
- * If it truly wasn't present, we know that kswapd
- * and the PTE updates will not touch it later. So
- * drop the lock.
- */
if (pte_none(entry))
- return do_no_page(mm, vma, address, write_access, pte, pmd);
+ return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
if (pte_file(entry))
- return do_file_page(mm, vma, address, write_access, pte, pmd);
+ return do_file_page(mm, vma, address, write_access, pte, pmd, entry);
return do_swap_page(mm, vma, address, pte, pmd, entry, write_access);
}

+ new_entry = pte_mkyoung(entry);
if (write_access) {
- if (!pte_write(entry))
- return do_wp_page(mm, vma, address, pte, pmd, entry);
-
- entry = pte_mkdirty(entry);
+ if (!pte_write(entry)) {
+#ifdef CONFIG_ATOMIC_TABLE_OPS
+ /* do_wp_page really needs the page table lock badly */
+ spin_lock(&mm->page_table_lock);
+ if (pte_same(entry, *pte))
+#endif
+ return do_wp_page(mm, vma, address, pte, pmd, entry);
+#ifdef CONFIG_ATOMIC_TABLE_OPS
+ pte_unmap(pte);
+ spin_unlock(&mm->page_table_lock);
+ return VM_FAULT_MINOR;
+#endif
+ }
+ new_entry = pte_mkdirty(new_entry);
}
- entry = pte_mkyoung(entry);
- ptep_set_access_flags(vma, address, pte, entry, write_access);
- update_mmu_cache(vma, address, entry);
+
+ /*
+ * If the cmpxchg fails then we will get another fault which
+ * has another chance of successfully updating the page table entry.
+ */
+ if (ptep_cmpxchg(pte, entry, new_entry)) {
+ flush_tlb_page(vma, address);
+ update_mmu_cache(vma, address, entry);
+ } else
+ inc_page_state(cmpxchg_fail_flag_update);
pte_unmap(pte);
- spin_unlock(&mm->page_table_lock);
+ page_table_atomic_stop(mm);
+ if (pte_val(new_entry) == pte_val(entry))
+ inc_page_state(spurious_page_faults);
return VM_FAULT_MINOR;
}

@@ -2075,33 +2094,73 @@ int handle_mm_fault(struct mm_struct *mm

inc_page_state(pgfault);

- if (is_vm_hugetlb_page(vma))
+ if (unlikely(is_vm_hugetlb_page(vma)))
return VM_FAULT_SIGBUS; /* mapping truncation does this. */

/*
- * We need the page table lock to synchronize with kswapd
- * and the SMP-safe atomic PTE updates.
+ * We try to rely on the mmap_sem and the SMP-safe atomic PTE updates.
+ * to synchronize with kswapd. However, the arch may fall back
+ * in page_table_atomic_start to the page table lock.
+ *
+ * We may be able to avoid taking and releasing the page_table_lock
+ * for the p??_alloc functions through atomic operations so we
+ * duplicate the functionality of pmd_alloc, pud_alloc and
+ * pte_alloc_map here.
*/
+ page_table_atomic_start(mm);
pgd = pgd_offset(mm, address);
- spin_lock(&mm->page_table_lock);
+ if (unlikely(pgd_none(*pgd))) {
+ pud_t *new;
+
+ page_table_atomic_stop(mm);
+ new = pud_alloc_one(mm, address);
+
+ if (!new)
+ return VM_FAULT_OOM;
+
+ page_table_atomic_start(mm);
+ if (!pgd_test_and_populate(mm, pgd, new))
+ pud_free(new);
+ }
+
+ pud = pud_offset(pgd, address);
+ if (unlikely(pud_none(*pud))) {
+ pmd_t *new;

- pud = pud_alloc(mm, pgd, address);
- if (!pud)
- goto oom;
-
- pmd = pmd_alloc(mm, pud, address);
- if (!pmd)
- goto oom;
-
- pte = pte_alloc_map(mm, pmd, address);
- if (!pte)
- goto oom;
+ page_table_atomic_stop(mm);
+ new = pmd_alloc_one(mm, address);
+
+ if (!new)
+ return VM_FAULT_OOM;

- return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
+ page_table_atomic_start(mm);
+
+ if (!pud_test_and_populate(mm, pud, new))
+ pmd_free(new);
+ }

- oom:
- spin_unlock(&mm->page_table_lock);
- return VM_FAULT_OOM;
+ pmd = pmd_offset(pud, address);
+ if (unlikely(!pmd_present(*pmd))) {
+ struct page *new;
+
+ page_table_atomic_stop(mm);
+ new = pte_alloc_one(mm, address);
+
+ if (!new)
+ return VM_FAULT_OOM;
+
+ page_table_atomic_start(mm);
+
+ if (!pmd_test_and_populate(mm, pmd, new))
+ pte_free(new);
+ else {
+ inc_page_state(nr_page_table_pages);
+ mm->nr_ptes++;
+ }
+ }
+
+ pte = pte_offset_map(pmd, address);
+ return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
}

#ifndef __ARCH_HAS_4LEVEL_HACK
Index: linux-2.6.11/include/asm-generic/pgtable-nopud.h
===================================================================
--- linux-2.6.11.orig/include/asm-generic/pgtable-nopud.h 2005-03-01 23:38:12.000000000 -0800
+++ linux-2.6.11/include/asm-generic/pgtable-nopud.h 2005-03-04 10:03:36.000000000 -0800
@@ -25,8 +25,14 @@ static inline int pgd_bad(pgd_t pgd) {
static inline int pgd_present(pgd_t pgd) { return 1; }
static inline void pgd_clear(pgd_t *pgd) { }
#define pud_ERROR(pud) (pgd_ERROR((pud).pgd))
-
#define pgd_populate(mm, pgd, pud) do { } while (0)
+
+#define __HAVE_ARCH_PGD_TEST_AND_POPULATE
+static inline int pgd_test_and_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud)
+{
+ return 1;
+}
+
/*
* (puds are folded into pgds so this doesn't get actually called,
* but the define is needed for a generic inline function.)
Index: linux-2.6.11/include/asm-generic/pgtable-nopmd.h
===================================================================
--- linux-2.6.11.orig/include/asm-generic/pgtable-nopmd.h 2005-03-01 23:37:49.000000000 -0800
+++ linux-2.6.11/include/asm-generic/pgtable-nopmd.h 2005-03-04 10:03:36.000000000 -0800
@@ -29,6 +29,11 @@ static inline void pud_clear(pud_t *pud)
#define pmd_ERROR(pmd) (pud_ERROR((pmd).pud))

#define pud_populate(mm, pmd, pte) do { } while (0)
+#define __ARCH_HAVE_PUD_TEST_AND_POPULATE
+static inline int pud_test_and_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
+{
+ return 1;
+}

/*
* (pmds are folded into puds so this doesn't get actually called,
Index: linux-2.6.11/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.11.orig/include/asm-generic/pgtable.h 2005-03-03 10:20:56.000000000 -0800
+++ linux-2.6.11/include/asm-generic/pgtable.h 2005-03-04 10:03:36.000000000 -0800
@@ -105,8 +105,14 @@ static inline pte_t ptep_get_and_clear(p
#ifdef CONFIG_ATOMIC_TABLE_OPS

/*
- * The architecture does support atomic table operations.
- * Thus we may provide generic atomic ptep_xchg and ptep_cmpxchg using
+ * The architecture does support atomic table operations and
+ * all operations on page table entries must always be atomic.
+ *
+ * This means that the kernel will never encounter a partially updated
+ * page table entry.
+ *
+ * Since the architecture does support atomic table operations, we
+ * may provide generic atomic ptep_xchg and ptep_cmpxchg using
* cmpxchg and xchg.
*/
#ifndef __HAVE_ARCH_PTEP_XCHG
@@ -132,6 +138,65 @@ static inline pte_t ptep_get_and_clear(p
})
#endif

+/*
+ * page_table_atomic_start and page_table_atomic_stop may be used to
+ * define special measures that an arch needs to guarantee atomic
+ * operations outside of a spinlock. In the case that an arch does
+ * not support atomic page table operations we will fall back to the
+ * page table lock.
+ */
+#ifndef __HAVE_ARCH_PAGE_TABLE_ATOMIC_START
+#define page_table_atomic_start(mm) do { } while (0)
+#endif
+
+#ifndef __HAVE_ARCH_PAGE_TABLE_ATOMIC_START
+#define page_table_atomic_stop(mm) do { } while (0)
+#endif
+
+/*
+ * Fallback functions for atomic population of higher page table
+ * structures. These simply acquire the page_table_lock for
+ * synchronization. An architecture may override these generic
+ * functions to provide atomic populate functions to make these
+ * more effective.
+ */
+
+#ifndef __HAVE_ARCH_PGD_TEST_AND_POPULATE
+#define pgd_test_and_populate(__mm, __pgd, __pud) \
+({ \
+ int __rc; \
+ spin_lock(&mm->page_table_lock); \
+ __rc = pgd_none(*(__pgd)); \
+ if (__rc) pgd_populate(__mm, __pgd, __pud); \
+ spin_unlock(&mm->page_table_lock); \
+ __rc; \
+})
+#endif
+
+#ifndef __HAVE_ARCH_PUD_TEST_AND_POPULATE
+#define pud_test_and_populate(__mm, __pud, __pmd) \
+({ \
+ int __rc; \
+ spin_lock(&mm->page_table_lock); \
+ __rc = pud_none(*(__pud)); \
+ if (__rc) pud_populate(__mm, __pud, __pmd); \
+ spin_unlock(&mm->page_table_lock); \
+ __rc; \
+})
+#endif
+
+#ifndef __HAVE_ARCH_PMD_TEST_AND_POPULATE
+#define pmd_test_and_populate(__mm, __pmd, __page) \
+({ \
+ int __rc; \
+ spin_lock(&mm->page_table_lock); \
+ __rc = !pmd_present(*(__pmd)); \
+ if (__rc) pmd_populate(__mm, __pmd, __page); \
+ spin_unlock(&mm->page_table_lock); \
+ __rc; \
+})
+#endif
+
#else

/*
@@ -142,6 +207,11 @@ static inline pte_t ptep_get_and_clear(p
* short time frame. This means that the page_table_lock must be held
* to avoid a page fault that would install a new entry.
*/
+
+/* Fall back to the page table lock to synchronize page table access */
+#define page_table_atomic_start(mm) spin_lock(&(mm)->page_table_lock)
+#define page_table_atomic_stop(mm) spin_unlock(&(mm)->page_table_lock)
+
#ifndef __HAVE_ARCH_PTEP_XCHG
#define ptep_xchg(__ptep, __pteval) \
({ \
@@ -186,6 +256,41 @@ static inline pte_t ptep_get_and_clear(p
r; \
})
#endif
+
+/*
+ * Fallback functions for atomic population of higher page table
+ * structures. These rely on the page_table_lock being held.
+ */
+#ifndef __HAVE_ARCH_PGD_TEST_AND_POPULATE
+#define pgd_test_and_populate(__mm, __pgd, __pud) \
+({ \
+ int __rc; \
+ __rc = pgd_none(*(__pgd)); \
+ if (__rc) pgd_populate(__mm, __pgd, __pud); \
+ __rc; \
+})
+#endif
+
+#ifndef __HAVE_ARCH_PUD_TEST_AND_POPULATE
+#define pud_test_and_populate(__mm, __pud, __pmd) \
+({ \
+ int __rc; \
+ __rc = pud_none(*(__pud)); \
+ if (__rc) pud_populate(__mm, __pud, __pmd); \
+ __rc; \
+})
+#endif
+
+#ifndef __HAVE_ARCH_PMD_TEST_AND_POPULATE
+#define pmd_test_and_populate(__mm, __pmd, __page) \
+({ \
+ int __rc; \
+ __rc = !pmd_present(*(__pmd)); \
+ if (__rc) pmd_populate(__mm, __pmd, __page); \
+ __rc; \
+})
+#endif
+
#endif

#ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT
Index: linux-2.6.11/include/asm-ia64/pgtable.h
===================================================================
--- linux-2.6.11.orig/include/asm-ia64/pgtable.h 2005-03-01 23:37:53.000000000 -0800
+++ linux-2.6.11/include/asm-ia64/pgtable.h 2005-03-04 10:03:36.000000000 -0800
@@ -554,6 +554,8 @@ do { \
#define FIXADDR_USER_START GATE_ADDR
#define FIXADDR_USER_END (GATE_ADDR + 2*PERCPU_PAGE_SIZE)

+#define __HAVE_ARCH_PUD_TEST_AND_POPULATE
+#define __HAVE_ARCH_PMD_TEST_AND_POPULATE
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
@@ -561,7 +563,7 @@ do { \
#define __HAVE_ARCH_PTEP_MKDIRTY
#define __HAVE_ARCH_PTE_SAME
#define __HAVE_ARCH_PGD_OFFSET_GATE
-#include <asm-generic/pgtable.h>
#include <asm-generic/pgtable-nopud.h>
+#include <asm-generic/pgtable.h>

#endif /* _ASM_IA64_PGTABLE_H */
Index: linux-2.6.11/include/linux/page-flags.h
===================================================================
--- linux-2.6.11.orig/include/linux/page-flags.h 2005-03-01 23:38:13.000000000 -0800
+++ linux-2.6.11/include/linux/page-flags.h 2005-03-04 10:03:36.000000000 -0800
@@ -131,6 +131,17 @@ struct page_state {
unsigned long allocstall; /* direct reclaim calls */

unsigned long pgrotated; /* pages rotated to tail of the LRU */
+
+ /* Low level counters */
+ unsigned long spurious_page_faults; /* Faults with no ops */
+ unsigned long cmpxchg_fail_flag_update; /* cmpxchg failures for pte flag update */
+ unsigned long cmpxchg_fail_flag_reuse; /* cmpxchg failures when cow reuse of pte */
+ unsigned long cmpxchg_fail_anon_read; /* cmpxchg failures on anonymous read */
+ unsigned long cmpxchg_fail_anon_write; /* cmpxchg failures on anonymous write */
+
+ /* rss deltas for the current executing thread */
+ long rss;
+ long anon_rss;
};

extern void get_page_state(struct page_state *ret);
Index: linux-2.6.11/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.11.orig/fs/proc/proc_misc.c 2005-03-01 23:37:49.000000000 -0800
+++ linux-2.6.11/fs/proc/proc_misc.c 2005-03-04 10:03:36.000000000 -0800
@@ -127,7 +127,7 @@ static int meminfo_read_proc(char *page,
unsigned long allowed;
struct vmalloc_info vmi;

- get_page_state(&ps);
+ get_full_page_state(&ps);
get_zone_counts(&active, &inactive, &free);

/*
@@ -168,7 +168,12 @@ static int meminfo_read_proc(char *page,
"PageTables: %8lu kB\n"
"VmallocTotal: %8lu kB\n"
"VmallocUsed: %8lu kB\n"
- "VmallocChunk: %8lu kB\n",
+ "VmallocChunk: %8lu kB\n"
+ "Spurious page faults : %8lu\n"
+ "cmpxchg fail flag update: %8lu\n"
+ "cmpxchg fail COW reuse : %8lu\n"
+ "cmpxchg fail anon read : %8lu\n"
+ "cmpxchg fail anon write : %8lu\n",
K(i.totalram),
K(i.freeram),
K(i.bufferram),
@@ -191,7 +196,12 @@ static int meminfo_read_proc(char *page,
K(ps.nr_page_table_pages),
VMALLOC_TOTAL >> 10,
vmi.used >> 10,
- vmi.largest_chunk >> 10
+ vmi.largest_chunk >> 10,
+ ps.spurious_page_faults,
+ ps.cmpxchg_fail_flag_update,
+ ps.cmpxchg_fail_flag_reuse,
+ ps.cmpxchg_fail_anon_read,
+ ps.cmpxchg_fail_anon_write
);

len += hugetlb_report_meminfo(page + len);
Index: linux-2.6.11/include/asm-ia64/pgalloc.h
===================================================================
--- linux-2.6.11.orig/include/asm-ia64/pgalloc.h 2005-03-01 23:37:31.000000000 -0800
+++ linux-2.6.11/include/asm-ia64/pgalloc.h 2005-03-04 10:03:36.000000000 -0800
@@ -34,6 +34,10 @@
#define pmd_quicklist (local_cpu_data->pmd_quick)
#define pgtable_cache_size (local_cpu_data->pgtable_cache_sz)

+/* Empty entries of PMD and PGD */
+#define PMD_NONE 0
+#define PUD_NONE 0
+
static inline pgd_t*
pgd_alloc_one_fast (struct mm_struct *mm)
{
@@ -82,6 +86,13 @@ pud_populate (struct mm_struct *mm, pud_
pud_val(*pud_entry) = __pa(pmd);
}

+/* Atomic populate */
+static inline int
+pud_test_and_populate (struct mm_struct *mm, pud_t *pud_entry, pmd_t *pmd)
+{
+ return ia64_cmpxchg8_acq(pud_entry,__pa(pmd), PUD_NONE) == PUD_NONE;
+}
+
static inline pmd_t*
pmd_alloc_one_fast (struct mm_struct *mm, unsigned long addr)
{
@@ -127,6 +138,14 @@ pmd_populate (struct mm_struct *mm, pmd_
pmd_val(*pmd_entry) = page_to_phys(pte);
}

+/* Atomic populate */
+static inline int
+pmd_test_and_populate (struct mm_struct *mm, pmd_t *pmd_entry, struct page *pte)
+{
+ return ia64_cmpxchg8_acq(pmd_entry, page_to_phys(pte), PMD_NONE) == PMD_NONE;
+}
+
+
static inline void
pmd_populate_kernel (struct mm_struct *mm, pmd_t *pmd_entry, pte_t *pte)
{

2005-03-04 19:21:54

by Hugh Dickins

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Drop first acquisition of ptl

On Fri, 4 Mar 2005, Christoph Lameter wrote:
> On Fri, 4 Mar 2005, Hugh Dickins wrote:
> > Nacked for the same reason as just given to earlier version. Ugly too.
>
> Ok. Then we could still get back the also ugly solution in the earlier
> patchsets that acquired the spinlock separately before getting to
> do_wp_page (also no need for the separate patch anymore). Patch is
> then shorter too.

Maybe. I should make it clear that I simply haven't examined the
recent incarnations of your patch, was just commenting on an issue
I could comment on quickly without needing to find time to think.

So, I just want to make clear, this absence of Nack doesn't mean
Ack: I remain uneasy with it all, waiting to see some architecture
maintainers come along with a clear "Yes, this is how it should be".

Hugh

2005-03-06 21:50:18

by Darren Williams

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Overview

Hi Christoph

On Fri, 04 Mar 2005, Christoph Lameter wrote:

> Make sure that scrubd_stop on startup is set to 2 and no zero in
> mm/scrubd.c. The current patch on oss.sgi.com may have that set to zero.
>
unsigned int sysctl_scrub_stop = 2; /* Mininum order of page to zero */

This is the assignment when page zero fails.

Darren

> On Fri, 4 Mar 2005, Darren Williams wrote:
>
> > Hi Darren
> >
> > On Fri, 04 Mar 2005, Darren Williams wrote:
> >
> > > Hi Christoph
> > >
> > > On Tue, 01 Mar 2005, Christoph Lameter wrote:
> > >
> > > > Is there any chance that this patchset could go into mm now? This has been
> > > > discussed since last August....
> > > >
> > > > Changelog:
> > > >
> > > > V17->V18 Rediff against 2.6.11-rc5-bk4
> > >
> > > Just applied this patch against 2.6.11, however with the patch applied
> > > and all the aditional config options not set, the kernel hangs at
> > > Freeing unused kernel memory: 240kB freed
> > > FYI:
> > >
> > > boot atomic prezero
> > > OK on on
> > > fail off on
> > > fail off off
> > > OK on off
> >
> > A bit extra info on the system:
> > HP rx8620 Itanium(R) 2 16way
> >
> > >
> > > > V16->V17 Do not increment page_count in do_wp_page. Performance data
> > > > posted.
> > > > V15->V16 of this patch: Redesign to allow full backback
> > > > for architectures that do not supporting atomic operations.
> > > >
> > > > An introduction to what this patch does and a patch archive can be found on
> > > > http://oss.sgi.com/projects/page_fault_performance. The archive also has the
> > > > result of various performance tests (LMBench, Microbenchmark and
> > > > kernel compiles).
> > > >
> > > > The basic approach in this patchset is the same as used in SGI's 2.4.X
> > > > based kernels which have been in production use in ProPack 3 for a long time.
> > > >
> > > > The patchset is composed of 4 patches (and was tested against 2.6.11-rc5-bk4):
> > > >
> > > [SNIP]
> > >
> > > > -
> > > > To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
> > > > the body of a message to [email protected]
> > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > --------------------------------------------------
> > > Darren Williams <dsw AT gelato.unsw.edu.au>
> > > Gelato@UNSW <http://www.gelato.unsw.edu.au>
> > > --------------------------------------------------
> > > -
> > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > > the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > Please read the FAQ at http://www.tux.org/lkml/
> > --------------------------------------------------
> > Darren Williams <dsw AT gelato.unsw.edu.au>
> > Gelato@UNSW <http://www.gelato.unsw.edu.au>
> > --------------------------------------------------
> >
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--------------------------------------------------
Darren Williams <dsw AT gelato.unsw.edu.au>
Gelato@UNSW <http://www.gelato.unsw.edu.au>
--------------------------------------------------

2005-03-07 00:18:56

by Christoph Lameter

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Overview

On Mon, 7 Mar 2005, Darren Williams wrote:

> Hi Christoph
>
> On Fri, 04 Mar 2005, Christoph Lameter wrote:
>
> > Make sure that scrubd_stop on startup is set to 2 and no zero in
> > mm/scrubd.c. The current patch on oss.sgi.com may have that set to zero.
> >
> unsigned int sysctl_scrub_stop = 2; /* Mininum order of page to zero */
>
> This is the assignment when page zero fails.

Could you just test this with the prezeroing patches alone?
Include a dmesg from a successful boot and then tell me what
you changed and where the boot failed. Which version of the patches did
you apply?

2005-03-07 03:33:37

by Darren Williams

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Overview

Hi Christoph

On Sun, 06 Mar 2005, Christoph Lameter wrote:

> On Mon, 7 Mar 2005, Darren Williams wrote:
>
> > Hi Christoph
> >
> > On Fri, 04 Mar 2005, Christoph Lameter wrote:
> >
> > > Make sure that scrubd_stop on startup is set to 2 and no zero in
> > > mm/scrubd.c. The current patch on oss.sgi.com may have that set to zero.
> > >
> > unsigned int sysctl_scrub_stop = 2; /* Mininum order of page to zero */
> >
> > This is the assignment when page zero fails.
>
> Could you just test this with the prezeroing patches alone?
> Include a dmesg from a successful boot and then tell me what
> you changed and where the boot failed. Which version of the patches did
> you apply?
>

PATCHES:
ftp://oss.sgi.com/projects/page_fault_performance/download/prezero/patch/patchset-2.6.11/

No scrubd_stop
cat /proc/sys/vm/scrub_stop
2

/etc/sysctl.conf
vm.scrub_stop=2

CONFIG_SCRUBD =N =Y
Boots OK FAILED

Oops of failed prezero boot:
/dev/sdb4 has been mounted 31 times without being checked, check forced.Unable to handle kernel NULL pointer dereference (address 0000000000000000)%
kscrubd0[362]: Oops 11012296146944 [1]

Pid: 362, CPU 0, comm: kscrubd0
psr : 0000121008022038 ifs : 8000000000000308 ip : [<a0000001000cf821>] Not tainted
ip is at scrubd_rmpage+0x61/0x140
unat: 0000000000000000 pfs : 0000000000000308 rsc : 0000000000000003
rnat: 0000000000000000 bsps: 0000000000000000 pr : 0000000000009641
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f
csd : 0000000000000000 ssd : 0000000000000000
b0 : a0000001000cf7f0 b6 : a000000100002d70 b7 : a000000100009ff0
f6 : 1003e6db6db6db6db6db7 f7 : 0fff4847277b800000000
f8 : 1003e000000001c07c8b4 f9 : 1003e00000000c4367cec
f10 : 10017a77079199649f035 f11 : 1003e00000000014ee0f2
r1 : a000000100a53d80 r2 : 0000000000100100 r3 : e0000721fe04fdd0
r8 : 0000001008026038 r9 : e000070020041070 r10 : 0000000000000008
r11 : 0000000000200200 r12 : e0000721fe04fdc0 r13 : e0000721fe048000
r14 : 0000000000000000 r15 : fffffffffffffffd r16 : ffffffffffffffff
r17 : a0007fffabf48000 r18 : ffffffffffffc000 r19 : 0000000000000000
r20 : ffffffffffffefff r21 : 0000000000000001 r22 : 0000000000000002
r23 : 0000000000000000 r24 : 0000000000000000 r25 : 0000000000000000
r26 : 0000000000000000 r27 : 0000001008026038 r28 : e0000721fe04fdd0
r29 : a0007fffabf4a7e0 r30 : 0000000000000000 r31 : e000070020041080

Call Trace:
[<a00000010000f3a0>] show_stack+0x80/0xa0
sp=e0000721fe04f980 bsp=e0000721fe049010
[<a00000010000fc00>] show_regs+0x7e0/0x800
sp=e0000721fe04fb50 bsp=e0000721fe048fa8
[<a0000001000335f0>] die+0x150/0x1c0
sp=e0000721fe04fb60 bsp=e0000721fe048f68
[<a000000100052430>] ia64_do_page_fault+0x370/0x980
sp=e0000721fe04fb60 bsp=e0000721fe048f00
[<a00000010000a780>] ia64_leave_kernel+0x0/0x260
sp=e0000721fe04fbf0 bsp=e0000721fe048f00
[<a0000001000cf820>] scrubd_rmpage+0x60/0x140
sp=e0000721fe04fdc0 bsp=e0000721fe048ec0
[<a00000010010e080>] zero_highest_order_page+0x120/0x2c0
sp=e0000721fe04fdc0 bsp=e0000721fe048e68
[<a00000010010e330>] scrub_pgdat+0x110/0x1c0
sp=e0000721fe04fdd0 bsp=e0000721fe048e30
[<a00000010010e600>] kscrubd+0x220/0x240
sp=e0000721fe04fdd0 bsp=e0000721fe048e00
[<a000000100011410>] kernel_thread_helper+0xd0/0x100
sp=e0000721fe04fe30 bsp=e0000721fe048dd0
[<a0000001000090e0>] start_kernel_thread+0x20/0x40
sp=e0000721fe04fe30 bsp=e0000721fe048dd0
<1>Unable to handle kernel NULL pointer dereference (address 0000000000000000)
kscrubd1[363]: Oops 11012296146944 [2]

Pid: 363, CPU 6, comm: kscrubd1
psr : 0000121008022018 ifs : 8000000000000308 ip : [<a0000001000cf821>] Not tainted
ip is at scrubd_rmpage+0x61/0x140
unat: 0000000000000000 pfs : 0000000000000308 rsc : 0000000000000003
rnat: 0000000000000000 bsps: 0000000000000000 pr : 0000000000009641
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f
csd : 0000000000000000 ssd : 0000000000000000
b0 : a0000001000cf7f0 b6 : a000000100002d70 b7 : a000000100009ff0
f6 : 1003e6db6db6db6db6db7 f7 : 0fff68a47601000000000
f8 : 1003e000000001c87f85a f9 : 1003e00000000c7b7ca76
f10 : 100188a47600ff75b89ff f11 : 1003e0000000002291d80
r1 : a000000100a53d80 r2 : 0000000000100100 r3 : e0000741fe427dd0
r8 : 0000001008026018 r9 : e0000720000510f0 r10 : 0000000000000008
r11 : 0000000000200200 r12 : e0000741fe427dc0 r13 : e0000741fe420000
r14 : 0000000000000000 r15 : fffffffffffffffd r16 : ffffffffffffffff
r17 : a0007fffc7ff0000 r18 : ffffffffffffc000 r19 : 0000000000000000
r20 : ffffffffffffefff r21 : 0000000000000001 r22 : 0000000000000001
r23 : 0000000000000000 r24 : 0000000000000000 r25 : 0000000000000000
r26 : 0000000000000000 r27 : 0000001008026018 r28 : e0000741fe427dd0
r29 : a0007fffc7ff13f8 r30 : 0000000000000000 r31 : e000072000051100

Call Trace:
[<a00000010000f3a0>] show_stack+0x80/0xa0
sp=e0000741fe427980 bsp=e0000741fe421010
[<a00000010000fc00>] show_regs+0x7e0/0x800
sp=e0000741fe427b50 bsp=e0000741fe420fa8
[<a0000001000335f0>] die+0x150/0x1c0
sp=e0000741fe427b60 bsp=e0000741fe420f68
[<a000000100052430>] ia64_do_page_fault+0x370/0x980
sp=e0000741fe427b60 bsp=e0000741fe420f00
[<a00000010000a780>] ia64_leave_kernel+0x0/0x260
sp=e0000741fe427bf0 bsp=e0000741fe420f00
[<a0000001000cf820>] scrubd_rmpage+0x60/0x140
sp=e0000741fe427dc0 bsp=e0000741fe420ec0
[<a00000010010e080>] zero_highest_order_page+0x120/0x2c0
sp=e0000741fe427dc0 bsp=e0000741fe420e68
[<a00000010010e330>] scrub_pgdat+0x110/0x1c0
sp=e0000741fe427dd0 bsp=e0000741fe420e30
[<a00000010010e600>] kscrubd+0x220/0x240
sp=e0000741fe427dd0 bsp=e0000741fe420e00
[<a000000100011410>] kernel_thread_helper+0xd0/0x100
sp=e0000741fe427e30 bsp=e0000741fe420dd0
[<a0000001000090e0>] start_kernel_thread+0x20/0x40
sp=e0000741fe427e30 bsp=e0000741fe420dd0



With atomic and page-zero patches applied

ftp://oss.sgi.com/projects/page_fault_performance/download/hpc-2.6.11.gz

the dmesg is attached.

These are the values in scrub_[start|stop]

debian-dsw@trixie:~$ cat /proc/sys/vm/scrub_start
5
debian-dsw@trixie:~$ cat /proc/sys/vm/scrub_stop
2

> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--------------------------------------------------
Darren Williams <dsw AT gelato.unsw.edu.au>
Gelato@UNSW <http://www.gelato.unsw.edu.au>
--------------------------------------------------


Attachments:
(No filename) (7.06 kB)
HPC-dmesg.txt (15.00 kB)
Download all attachments

2005-03-08 04:03:56

by Christoph Lameter

[permalink] [raw]
Subject: Re: Page fault scalability patch V18: Overview

On Mon, 7 Mar 2005, Darren Williams wrote:

> Pid: 362, CPU 0, comm: kscrubd0
> psr : 0000121008022038 ifs : 8000000000000308 ip : [<a0000001000cf821>] Not tainted
> ip is at scrubd_rmpage+0x61/0x140

Would you try the new version on oss.sgi.com please.

2005-03-31 06:57:19

by Christoph Lameter

[permalink] [raw]
Subject: Avoid spurious page faults by avoiding pte_clear -> set pte

The current way of updating ptes in the Linux vm includes first clearing
a pte before setting it to another value. The clearing is performed while
holding the page_table_lock to insure that the entry will not be modified
by the CPU directly (clearing the pte clears the present bit which
notifies the MMU that this entry is invalid) by an arch specific interrupt
handler or another page fault handler running on another CPU. This approach
is necessary for some architectures that cannot perform atomic updates of
page table entries.

If a page table entry is cleared then a second CPU may generate a page fault
for that entry. The fault handler on the second CPU will then attempt to
acquire the page_table_lock and wait until the first CPU has completed
updating the page table entry. The fault handler on the second CPU will then
discover that everything is ok and simply do nothing (apart from incrementing
the counters for a minor fault and marking the page again as accessed).

However, most architectures actually support atomic operations on page
table entries. The use of atomic operations on page table entries would
allow the update of a page table entry in a single atomic operation instead
of writing to the page table entry twice. There would also be no danger of
generating a spurious page fault on other CPUs.

The following patch introduces two new atomic operations ptep_xchg and
ptep_cmpxchg that may be provided by an architecture. The fallback in
include/asm-generic/pgtable.h is to simulate both operations through the
existing ptep_get_and_clear function. So there is essentially no change if
atomic operations on ptes have not been defined. Architectures that do
not support atomic operations on ptes may continue to use the clearing of
a pte for locking type purposes.

Atomic operations are enabled for i386, ia64 and x86_64 if a suitable
CPU is configured in SMP mode. Generic atomic definitions for ptep_xchg
and ptep_cmpxchg have been provided based on the existing xchg() and
cmpxchg() functions that already work atomically on many platforms.

The provided generic atomic functions may be overridden as usual by defining
the appropriate__HAVE_ARCH_xxx constant and providing a different
implementation.

The attempt to reduce the use of the page_table_lock in the page fault handler
through atomic operations relies on a pte never being clear if the pte is in
use even when the page_table_lock is not held. Clearing a pte before setting
it to another value could result in a situation in which a fault generated by
another cpu could install a pte which is then immediately overwritten by
the first CPU setting the pte to a valid value again. This patch is
important for future work on reducing the use of spinlocks in the vm through
atomic operations on pte's.

AIM7 Benchmark on an 8 processor system:

w/o patch
Tasks jobs/min jti jobs/min/task real cpu
1 471.79 100 471.7899 12.34 2.30 Wed Mar 30 21:29:54 2005
100 18068.92 89 180.6892 32.21 158.37 Wed Mar 30 21:30:27 2005
200 21427.39 84 107.1369 54.32 315.84 Wed Mar 30 21:31:22 2005
300 21500.87 82 71.6696 81.21 473.74 Wed Mar 30 21:32:43 2005
400 24886.42 83 62.2160 93.55 633.73 Wed Mar 30 21:34:23 2005
500 25658.89 81 51.3178 113.41 789.44 Wed Mar 30 21:36:17 2005
600 25693.47 81 42.8225 135.91 949.00 Wed Mar 30 21:38:33 2005
700 26098.32 80 37.2833 156.10 1108.17 Wed Mar 30 21:41:10 2005
800 26334.25 80 32.9178 176.80 1266.73 Wed Mar 30 21:44:07 2005
900 26913.85 80 29.9043 194.62 1422.11 Wed Mar 30 21:47:22 2005
1000 26749.89 80 26.7499 217.57 1583.95 Wed Mar 30 21:51:01 2005

w/patch:
Tasks jobs/min jti jobs/min/task real cpu
1 470.30 100 470.3030 12.38 2.33 Wed Mar 30 21:57:27 2005
100 18465.05 89 184.6505 31.52 158.62 Wed Mar 30 21:57:58 2005
200 22399.26 86 111.9963 51.97 315.95 Wed Mar 30 21:58:51 2005
300 24274.61 84 80.9154 71.93 475.04 Wed Mar 30 22:00:03 2005
400 25120.86 82 62.8021 92.67 634.10 Wed Mar 30 22:01:36 2005
500 25742.87 81 51.4857 113.04 791.13 Wed Mar 30 22:03:30 2005
600 26322.73 82 43.8712 132.66 948.31 Wed Mar 30 22:05:43 2005
700 25718.40 80 36.7406 158.41 1112.30 Wed Mar 30 22:08:22 2005
800 26361.08 80 32.9514 176.62 1269.94 Wed Mar 30 22:11:19 2005
900 26975.67 81 29.9730 194.17 1424.56 Wed Mar 30 22:14:33 2005
1000 26765.51 80 26.7655 217.44 1585.27 Wed Mar 30 22:18:12 2005

There are some minor performance improvements and some minimal losses for other
numbers of tasks. The improvement may be due to the avoidance of one store
and the avoidance of useless page faults.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.11/mm/rmap.c
===================================================================
--- linux-2.6.11.orig/mm/rmap.c 2005-03-30 20:32:21.000000000 -0800
+++ linux-2.6.11/mm/rmap.c 2005-03-30 22:40:47.000000000 -0800
@@ -574,11 +574,6 @@ static int try_to_unmap_one(struct page

/* Nuke the page table entry. */
flush_cache_page(vma, address, page_to_pfn(page));
- pteval = ptep_clear_flush(vma, address, pte);
-
- /* Move the dirty bit to the physical page now the pte is gone. */
- if (pte_dirty(pteval))
- set_page_dirty(page);

if (PageAnon(page)) {
swp_entry_t entry = { .val = page->private };
@@ -593,10 +588,15 @@ static int try_to_unmap_one(struct page
list_add(&mm->mmlist, &init_mm.mmlist);
spin_unlock(&mmlist_lock);
}
- set_pte_at(mm, address, pte, swp_entry_to_pte(entry));
+ pteval = ptep_xchg_flush(vma, address, pte, swp_entry_to_pte(entry));
BUG_ON(pte_file(*pte));
dec_mm_counter(mm, anon_rss);
- }
+ } else
+ pteval = ptep_clear_flush(vma, address, pte);
+
+ /* Move the dirty bit to the physical page now the pte is gone. */
+ if (pte_dirty(pteval))
+ set_page_dirty(page);

inc_mm_counter(mm, rss);
page_remove_rmap(page);
@@ -689,15 +689,15 @@ static void try_to_unmap_cluster(unsigne
if (ptep_clear_flush_young(vma, address, pte))
continue;

- /* Nuke the page table entry. */
flush_cache_page(vma, address, pfn);
- pteval = ptep_clear_flush(vma, address, pte);

/* If nonlinear, store the file page offset in the pte. */
if (page->index != linear_page_index(vma, address))
- set_pte_at(mm, address, pte, pgoff_to_pte(page->index));
+ pteval = ptep_xchg_flush(vma, address, pte, pgoff_to_pte(page->index));
+ else
+ pteval = ptep_clear_flush(vma, address, pte);

- /* Move the dirty bit to the physical page now the pte is gone. */
+ /* Move the dirty bit to the physical page now that the pte is gone. */
if (pte_dirty(pteval))
set_page_dirty(page);

Index: linux-2.6.11/mm/mprotect.c
===================================================================
--- linux-2.6.11.orig/mm/mprotect.c 2005-03-30 20:32:21.000000000 -0800
+++ linux-2.6.11/mm/mprotect.c 2005-03-30 22:42:51.000000000 -0800
@@ -32,17 +32,19 @@ static void change_pte_range(struct mm_s

pte = pte_offset_map(pmd, addr);
do {
- if (pte_present(*pte)) {
- pte_t ptent;
+ pte_t ptent;
+redo:
+ ptent = *pte;
+ if (!pte_present(ptent))
+ continue;

- /* Avoid an SMP race with hardware updated dirty/clean
- * bits by wiping the pte and then setting the new pte
- * into place.
- */
- ptent = pte_modify(ptep_get_and_clear(mm, addr, pte), newprot);
- set_pte_at(mm, addr, pte, ptent);
- lazy_mmu_prot_update(ptent);
- }
+ /* Deal with a potential SMP race with hardware/arch
+ * interrupt updating dirty/clean bits through the use
+ * of ptep_cmpxchg.
+ */
+ if (!ptep_cmpxchg(mm, addr, pte, ptent, pte_modify(ptent, newprot)))
+ goto redo;
+ lazy_mmu_prot_update(ptent);
} while (pte++, addr += PAGE_SIZE, addr != end);
pte_unmap(pte - 1);
}
Index: linux-2.6.11/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.11.orig/include/asm-generic/pgtable.h 2005-03-30 20:32:08.000000000 -0800
+++ linux-2.6.11/include/asm-generic/pgtable.h 2005-03-30 22:40:47.000000000 -0800
@@ -111,6 +111,92 @@ do { \
})
#endif

+#ifdef CONFIG_ATOMIC_TABLE_OPS
+
+/*
+ * The architecture does support atomic table operations.
+ * We may be able to provide atomic ptep_xchg and ptep_cmpxchg using
+ * cmpxchg and xchg.
+ */
+#ifndef __HAVE_ARCH_PTEP_XCHG
+#define ptep_xchg(__mm, __address, __ptep, __pteval) \
+ __pte(xchg(&pte_val(*(__ptep)), pte_val(__pteval)))
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_CMPXCHG
+#define ptep_cmpxchg(__mm, __address, __ptep,__oldval,__newval) \
+ (cmpxchg(&pte_val(*(__ptep)), \
+ pte_val(__oldval), \
+ pte_val(__newval) \
+ ) == pte_val(__oldval) \
+ )
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \
+({ \
+ pte_t __pte = ptep_xchg(__vma, __address, __ptep, __pteval); \
+ flush_tlb_page(__vma, __address); \
+ __pte; \
+})
+#endif
+
+#else
+
+/*
+ * No support for atomic operations on the page table.
+ * Exchanging of pte values is done by first swapping zeros into
+ * a pte and then putting new content into the pte entry.
+ * However, these functions will generate an empty pte for a
+ * short time frame. This means that the page_table_lock must be held
+ * to avoid a page fault that would install a new entry.
+ */
+#ifndef __HAVE_ARCH_PTEP_XCHG
+#define ptep_xchg(__mm, __address, __ptep, __pteval) \
+({ \
+ pte_t __pte = ptep_get_and_clear(__mm, __address, __ptep); \
+ set_pte_at(__mm, __address, __ptep, __pteval); \
+ __pte; \
+})
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH
+#ifndef __HAVE_ARCH_PTEP_XCHG
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \
+({ \
+ pte_t __pte = ptep_clear_flush(__vma, __address, __ptep); \
+ set_pte_at((__vma)->mm, __address, __ptep, __pteval); \
+ __pte; \
+})
+#else
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \
+({ \
+ pte_t __pte = ptep_xchg((__vma)->mm, __address, __ptep, __pteval);\
+ flush_tlb_page(__vma, __address); \
+ __pte; \
+})
+#endif
+#endif
+
+/*
+ * The fallback function for ptep_cmpxchg avoids any real use of cmpxchg
+ * since cmpxchg may not be available on certain architectures. Instead
+ * the clearing of a pte is used as a form of locking mechanism.
+ * This approach will only work if the page_table_lock is held to insure
+ * that the pte is not populated by a page fault generated on another
+ * CPU.
+ */
+#ifndef __HAVE_ARCH_PTEP_CMPXCHG
+#define ptep_cmpxchg(__mm, __address, __ptep, __old, __new) \
+({ \
+ pte_t prev = ptep_get_and_clear(__mm, __address, __ptep); \
+ int r = pte_val(prev) == pte_val(__old); \
+ set_pte_at(__mm, __address, __ptep, r ? (__new) : prev); \
+ r; \
+})
+#endif
+#endif
+
#ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT
static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long address, pte_t *ptep)
{
Index: linux-2.6.11/arch/ia64/Kconfig
===================================================================
--- linux-2.6.11.orig/arch/ia64/Kconfig 2005-03-01 23:38:26.000000000 -0800
+++ linux-2.6.11/arch/ia64/Kconfig 2005-03-30 22:40:47.000000000 -0800
@@ -272,6 +272,11 @@ config PREEMPT
Say Y here if you are building a kernel for a desktop, embedded
or real-time system. Say N if you are unsure.

+config ATOMIC_TABLE_OPS
+ bool
+ depends on SMP
+ default y
+
config HAVE_DEC_LOCK
bool
depends on (SMP || PREEMPT)
Index: linux-2.6.11/arch/i386/Kconfig
===================================================================
--- linux-2.6.11.orig/arch/i386/Kconfig 2005-03-30 20:31:27.000000000 -0800
+++ linux-2.6.11/arch/i386/Kconfig 2005-03-30 22:40:47.000000000 -0800
@@ -886,6 +886,11 @@ config HAVE_DEC_LOCK
depends on (SMP || PREEMPT) && X86_CMPXCHG
default y

+config ATOMIC_TABLE_OPS
+ bool
+ depends on SMP && X86_CMPXCHG && !X86_PAE
+ default y
+
# turning this on wastes a bunch of space.
# Summit needs it only when NUMA is on
config BOOT_IOREMAP
Index: linux-2.6.11/arch/x86_64/Kconfig
===================================================================
--- linux-2.6.11.orig/arch/x86_64/Kconfig 2005-03-30 20:31:40.000000000 -0800
+++ linux-2.6.11/arch/x86_64/Kconfig 2005-03-30 22:40:47.000000000 -0800
@@ -223,6 +223,11 @@ config PREEMPT
Say Y here if you are feeling brave and building a kernel for a
desktop, embedded or real-time system. Say N if you are unsure.

+config ATOMIC_TABLE_OPS
+ bool
+ depends on SMP
+ default y
+
config PREEMPT_BKL
bool "Preempt The Big Kernel Lock"
depends on PREEMPT
Index: linux-2.6.11/mm/memory.c
===================================================================
--- linux-2.6.11.orig/mm/memory.c 2005-03-30 20:32:21.000000000 -0800
+++ linux-2.6.11/mm/memory.c 2005-03-30 22:40:47.000000000 -0800
@@ -463,15 +463,19 @@ static void zap_pte_range(struct mmu_gat
page->index > details->last_index))
continue;
}
- ptent = ptep_get_and_clear(tlb->mm, addr, pte);
- tlb_remove_tlb_entry(tlb, pte, addr);
- if (unlikely(!page))
+ if (unlikely(!page)) {
+ ptent = ptep_get_and_clear(tlb->mm, addr, pte);
+ tlb_remove_tlb_entry(tlb, pte, addr);
continue;
+ }
if (unlikely(details) && details->nonlinear_vma
&& linear_page_index(details->nonlinear_vma,
addr) != page->index)
- set_pte_at(tlb->mm, addr, pte,
- pgoff_to_pte(page->index));
+ ptent = ptep_xchg(tlb->mm, addr, pte,
+ pgoff_to_pte(page->index));
+ else
+ ptent = ptep_get_and_clear(tlb->mm, addr, pte);
+ tlb_remove_tlb_entry(tlb, pte, addr);
if (pte_dirty(ptent))
set_page_dirty(page);
if (PageAnon(page))