2005-01-04 19:44:13

by Christoph Lameter

[permalink] [raw]
Subject: page fault scalability patch V14 [0/7]: Overview

Changes from V13->V14 of this patch:
- 4level page support
- Tested on ia64, i386 and i386 in PAE mode

This is a series of patches that increases the scalability of
the page fault handler for SMP. The performance increase is
accomplished by avoiding the use of the
page_table_lock spinlock (but not mm->mmap_sem) through new atomic
operations on pte's (ptep_xchg, ptep_cmpxchg) and on pmd and pgd's
(pgd_test_and_populate, pmd_test_and_populate).

The page table lock can be avoided in the following situations:

1. An empty pte or pmd entry is populated

This is safe since the swapper may only depopulate them and the
swapper code has been changed to never set a pte to be empty until the
page has been evicted. The population of an empty pte is frequent
if a process touches newly allocated memory.

2. Modifications of flags in a pte entry (write/accessed).

These modifications are done by the CPU or by low level handlers
on various platforms also bypassing the page_table_lock. So this
seems to be safe too.

One essential change in the VM is the use of pte_cmpxchg (or its
generic emulation) on page table entries before doing an
update_mmu_change without holding the page table lock. However, we do
similar things now with other atomic pte operations such as
ptep_get_and_clear and ptep_test_and_clear_dirty. These operations
clear a pte *after* doing an operation on it. The ptep_cmpxchg as used
in this patch operates on an *cleared* pte and replaces it with a pte
pointing to valid memory. The effect of this change on various
architectures has to be thought through. Local definitions of
ptep_cmpxchg and ptep_xchg may be necessary.

For IA64 an icache coherency issue may arise that potentially requires
the flushing of the icache (as done via update_mmu_cache on IA64) prior
to the use of ptep_cmpxchg. Similar issues may arise on other platforms.

The patch introduces a split counter for rss handling to avoid atomic
operations and locks currently necessary for rss modifications. In
addition to mm->rss, tsk->rss is introduced. tsk->rss is defined to be
in the same cache line as tsk->mm (which is already used by the fault
handler) and thus tsk->rss can be incremented without locks
in a fast way. The cache line does not need to be shared between
processors for the page table handler.

A tasklist is generated for each mm (rcu based). Values in that list
are added up to calculate rss or anon_rss values.

The patchset is composed of 7 patches (and was tested against 2.6.10-bk6):

1/7: Avoid page_table_lock in handle_mm_fault

This patch defers the acquisition of the page_table_lock as much as
possible and uses atomic operations for allocating anonymous memory.
These atomic operations are simulated by acquiring the page_table_lock
for very small time frames if an architecture does not define
__HAVE_ARCH_ATOMIC_TABLE_OPS. It also changes kswapd so that a
pte will not be set to empty if a page is in transition to swap.

If only the first two patches are applied then the time that the
page_table_lock is held is simply reduced. The lock may then be
acquired multiple times during a page fault.

2/7: Atomic pte operations for ia64

3/7: Make cmpxchg generally available on i386

The atomic operations on the page table rely heavily on cmpxchg
instructions. This patch adds emulations for cmpxchg and cmpxchg8b
for old 80386 and 80486 cpus. The emulations are only included if a
kernel is build for these old cpus and are skipped for the real
cmpxchg instructions if the kernel that is build for 386 or 486 is
then run on a more recent cpu.

This patch may be used independently of the other patches.

4/7: Atomic pte operations for i386

A generally available cmpxchg (last patch) must be available for
this patch to preserve the ability to build kernels for 386 and 486.

5/7: Atomic pte operation for x86_64

6/7: Atomic pte operations for s390

7/7: Split counter implementation for rss
Add tsk->rss and tsk->anon_rss. Add tasklist. Add logic
to calculate rss from tasklist.

Signed-off-by: Christoph Lameter <[email protected]>


2005-01-04 19:43:44

by Christoph Lameter

[permalink] [raw]
Subject: page fault scalability patch V14 [3/7]: i386 universal cmpxchg

Changelog
* Make cmpxchg and cmpxchg8b generally available on the i386
platform.
* Provide emulation of cmpxchg suitable for uniprocessor if
build and run on 386.
* Provide emulation of cmpxchg8b suitable for uniprocessor
systems if build and run on 386 or 486.
* Provide an inline function to atomically get a 64 bit value
via cmpxchg8b in an SMP system (courtesy of Nick Piggin)
(important for i386 PAE mode and other places where atomic
64 bit operations are useful)

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.9/arch/i386/Kconfig
===================================================================
--- linux-2.6.9.orig/arch/i386/Kconfig 2004-12-10 09:58:03.000000000 -0800
+++ linux-2.6.9/arch/i386/Kconfig 2004-12-10 09:59:27.000000000 -0800
@@ -351,6 +351,11 @@
depends on !M386
default y

+config X86_CMPXCHG8B
+ bool
+ depends on !M386 && !M486
+ default y
+
config X86_XADD
bool
depends on !M386
Index: linux-2.6.9/arch/i386/kernel/cpu/intel.c
===================================================================
--- linux-2.6.9.orig/arch/i386/kernel/cpu/intel.c 2004-12-06 17:23:49.000000000 -0800
+++ linux-2.6.9/arch/i386/kernel/cpu/intel.c 2004-12-10 09:59:27.000000000 -0800
@@ -6,6 +6,7 @@
#include <linux/bitops.h>
#include <linux/smp.h>
#include <linux/thread_info.h>
+#include <linux/module.h>

#include <asm/processor.h>
#include <asm/msr.h>
@@ -287,5 +288,103 @@
return 0;
}

+#ifndef CONFIG_X86_CMPXCHG
+unsigned long cmpxchg_386_u8(volatile void *ptr, u8 old, u8 new)
+{
+ u8 prev;
+ unsigned long flags;
+ /*
+ * Check if the kernel was compiled for an old cpu but the
+ * currently running cpu can do cmpxchg after all
+ * All CPUs except 386 support CMPXCHG
+ */
+ if (cpu_data->x86 > 3)
+ return __cmpxchg(ptr, old, new, sizeof(u8));
+
+ /* Poor man's cmpxchg for 386. Unsuitable for SMP */
+ local_irq_save(flags);
+ prev = *(u8 *)ptr;
+ if (prev == old)
+ *(u8 *)ptr = new;
+ local_irq_restore(flags);
+ return prev;
+}
+
+EXPORT_SYMBOL(cmpxchg_386_u8);
+
+unsigned long cmpxchg_386_u16(volatile void *ptr, u16 old, u16 new)
+{
+ u16 prev;
+ unsigned long flags;
+ /*
+ * Check if the kernel was compiled for an old cpu but the
+ * currently running cpu can do cmpxchg after all
+ * All CPUs except 386 support CMPXCHG
+ */
+ if (cpu_data->x86 > 3)
+ return __cmpxchg(ptr, old, new, sizeof(u16));
+
+ /* Poor man's cmpxchg for 386. Unsuitable for SMP */
+ local_irq_save(flags);
+ prev = *(u16 *)ptr;
+ if (prev == old)
+ *(u16 *)ptr = new;
+ local_irq_restore(flags);
+ return prev;
+}
+
+EXPORT_SYMBOL(cmpxchg_386_u16);
+
+unsigned long cmpxchg_386_u32(volatile void *ptr, u32 old, u32 new)
+{
+ u32 prev;
+ unsigned long flags;
+ /*
+ * Check if the kernel was compiled for an old cpu but the
+ * currently running cpu can do cmpxchg after all
+ * All CPUs except 386 support CMPXCHG
+ */
+ if (cpu_data->x86 > 3)
+ return __cmpxchg(ptr, old, new, sizeof(u32));
+
+ /* Poor man's cmpxchg for 386. Unsuitable for SMP */
+ local_irq_save(flags);
+ prev = *(u32 *)ptr;
+ if (prev == old)
+ *(u32 *)ptr = new;
+ local_irq_restore(flags);
+ return prev;
+}
+
+EXPORT_SYMBOL(cmpxchg_386_u32);
+#endif
+
+#ifndef CONFIG_X86_CMPXCHG8B
+unsigned long long cmpxchg8b_486(volatile unsigned long long *ptr,
+ unsigned long long old, unsigned long long newv)
+{
+ unsigned long long prev;
+ unsigned long flags;
+
+ /*
+ * Check if the kernel was compiled for an old cpu but
+ * we are running really on a cpu capable of cmpxchg8b
+ */
+
+ if (cpu_has(cpu_data, X86_FEATURE_CX8))
+ return __cmpxchg8b(ptr, old, newv);
+
+ /* Poor mans cmpxchg8b for 386 and 486. Not suitable for SMP */
+ local_irq_save(flags);
+ prev = *ptr;
+ if (prev == old)
+ *ptr = newv;
+ local_irq_restore(flags);
+ return prev;
+}
+
+EXPORT_SYMBOL(cmpxchg8b_486);
+#endif
+
// arch_initcall(intel_cpu_init);

Index: linux-2.6.9/include/asm-i386/system.h
===================================================================
--- linux-2.6.9.orig/include/asm-i386/system.h 2004-12-06 17:23:55.000000000 -0800
+++ linux-2.6.9/include/asm-i386/system.h 2004-12-10 10:00:49.000000000 -0800
@@ -149,6 +149,9 @@
#define __xg(x) ((struct __xchg_dummy *)(x))


+#define ll_low(x) *(((unsigned int*)&(x))+0)
+#define ll_high(x) *(((unsigned int*)&(x))+1)
+
/*
* The semantics of XCHGCMP8B are a bit strange, this is why
* there is a loop and the loading of %%eax and %%edx has to
@@ -184,8 +187,6 @@
{
__set_64bit(ptr,(unsigned int)(value), (unsigned int)((value)>>32ULL));
}
-#define ll_low(x) *(((unsigned int*)&(x))+0)
-#define ll_high(x) *(((unsigned int*)&(x))+1)

static inline void __set_64bit_var (unsigned long long *ptr,
unsigned long long value)
@@ -203,6 +204,26 @@
__set_64bit(ptr, (unsigned int)(value), (unsigned int)((value)>>32ULL) ) : \
__set_64bit(ptr, ll_low(value), ll_high(value)) )

+static inline unsigned long long __get_64bit(unsigned long long * ptr)
+{
+ unsigned long long ret;
+ __asm__ __volatile__ (
+ "\n1:\t"
+ "movl (%1), %%eax\n\t"
+ "movl 4(%1), %%edx\n\t"
+ "movl %%eax, %%ebx\n\t"
+ "movl %%edx, %%ecx\n\t"
+ LOCK_PREFIX "cmpxchg8b (%1)\n\t"
+ "jnz 1b"
+ : "=A"(ret)
+ : "D"(ptr)
+ : "ebx", "ecx", "memory");
+ return ret;
+}
+
+#define get_64bit(ptr) __get_64bit(ptr)
+
+
/*
* Note: no "lock" prefix even on SMP: xchg always implies lock anyway
* Note 2: xchg has side effect, so that attribute volatile is necessary,
@@ -240,7 +261,41 @@
*/

#ifdef CONFIG_X86_CMPXCHG
+
#define __HAVE_ARCH_CMPXCHG 1
+#define cmpxchg(ptr,o,n)\
+ ((__typeof__(*(ptr)))__cmpxchg((ptr), (unsigned long)(o), \
+ (unsigned long)(n), sizeof(*(ptr))))
+
+#else
+
+/*
+ * Building a kernel capable running on 80386. It may be necessary to
+ * simulate the cmpxchg on the 80386 CPU. For that purpose we define
+ * a function for each of the sizes we support.
+ */
+
+extern unsigned long cmpxchg_386_u8(volatile void *, u8, u8);
+extern unsigned long cmpxchg_386_u16(volatile void *, u16, u16);
+extern unsigned long cmpxchg_386_u32(volatile void *, u32, u32);
+
+static inline unsigned long cmpxchg_386(volatile void *ptr, unsigned long old,
+ unsigned long new, int size)
+{
+ switch (size) {
+ case 1:
+ return cmpxchg_386_u8(ptr, old, new);
+ case 2:
+ return cmpxchg_386_u16(ptr, old, new);
+ case 4:
+ return cmpxchg_386_u32(ptr, old, new);
+ }
+ return old;
+}
+
+#define cmpxchg(ptr,o,n)\
+ ((__typeof__(*(ptr)))cmpxchg_386((ptr), (unsigned long)(o), \
+ (unsigned long)(n), sizeof(*(ptr))))
#endif

static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old,
@@ -270,12 +325,34 @@
return old;
}

-#define cmpxchg(ptr,o,n)\
- ((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\
- (unsigned long)(n),sizeof(*(ptr))))
-
+static inline unsigned long long __cmpxchg8b(volatile unsigned long long *ptr,
+ unsigned long long old, unsigned long long newv)
+{
+ unsigned long long prev;
+ __asm__ __volatile__(
+ LOCK_PREFIX "cmpxchg8b (%4)"
+ : "=A" (prev)
+ : "0" (old), "c" ((unsigned long)(newv >> 32)),
+ "b" ((unsigned long)(newv & 0xffffffffULL)), "D" (ptr)
+ : "memory");
+ return prev;
+}
+
+#ifdef CONFIG_X86_CMPXCHG8B
+#define cmpxchg8b __cmpxchg8b
+#else
+/*
+ * Building a kernel capable of running on 80486 and 80386. Both
+ * do not support cmpxchg8b. Call a function that emulates the
+ * instruction if necessary.
+ */
+extern unsigned long long cmpxchg8b_486(volatile unsigned long long *,
+ unsigned long long, unsigned long long);
+#define cmpxchg8b cmpxchg8b_486
+#endif
+
#ifdef __KERNEL__
-struct alt_instr {
+struct alt_instr {
__u8 *instr; /* original instruction */
__u8 *replacement;
__u8 cpuid; /* cpuid bit set for replacement */

2005-01-04 19:48:02

by Christoph Lameter

[permalink] [raw]
Subject: page fault scalability patch V14 [2/7]: ia64 atomic pte operations

Changelog
* Provide atomic pte operations for ia64
* Enhanced parallelism in page fault handler if applied together
with the generic patch

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.10/include/asm-ia64/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-ia64/pgalloc.h 2005-01-03 10:31:31.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/pgalloc.h 2005-01-03 12:37:44.000000000 -0800
@@ -34,6 +34,10 @@
#define pmd_quicklist (local_cpu_data->pmd_quick)
#define pgtable_cache_size (local_cpu_data->pgtable_cache_sz)

+/* Empty entries of PMD and PGD */
+#define PMD_NONE 0
+#define PUD_NONE 0
+
static inline pgd_t*
pgd_alloc_one_fast (struct mm_struct *mm)
{
@@ -84,6 +88,13 @@
pud_val(*pud_entry) = __pa(pmd);
}

+/* Atomic populate */
+static inline int
+pud_test_and_populate (struct mm_struct *mm, pud_t *pud_entry, pmd_t *pmd)
+{
+ return ia64_cmpxchg8_acq(pud_entry,__pa(pmd), PUD_NONE) == PUD_NONE;
+}
+
static inline pmd_t*
pmd_alloc_one_fast (struct mm_struct *mm, unsigned long addr)
{
@@ -131,6 +142,13 @@
pmd_val(*pmd_entry) = page_to_phys(pte);
}

+/* Atomic populate */
+static inline int
+pmd_test_and_populate (struct mm_struct *mm, pmd_t *pmd_entry, struct page *pte)
+{
+ return ia64_cmpxchg8_acq(pmd_entry, page_to_phys(pte), PMD_NONE) == PMD_NONE;
+}
+
static inline void
pmd_populate_kernel (struct mm_struct *mm, pmd_t *pmd_entry, pte_t *pte)
{
Index: linux-2.6.10/include/asm-ia64/pgtable.h
===================================================================
--- linux-2.6.10.orig/include/asm-ia64/pgtable.h 2005-01-03 10:31:31.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/pgtable.h 2005-01-03 12:37:44.000000000 -0800
@@ -30,6 +30,8 @@
#define _PAGE_P_BIT 0
#define _PAGE_A_BIT 5
#define _PAGE_D_BIT 6
+#define _PAGE_IG_BITS 53
+#define _PAGE_LOCK_BIT (_PAGE_IG_BITS+3) /* bit 56. Aligned to 8 bits */

#define _PAGE_P (1 << _PAGE_P_BIT) /* page present bit */
#define _PAGE_MA_WB (0x0 << 2) /* write back memory attribute */
@@ -58,6 +60,7 @@
#define _PAGE_PPN_MASK (((__IA64_UL(1) << IA64_MAX_PHYS_BITS) - 1) & ~0xfffUL)
#define _PAGE_ED (__IA64_UL(1) << 52) /* exception deferral */
#define _PAGE_PROTNONE (__IA64_UL(1) << 63)
+#define _PAGE_LOCK (__IA64_UL(1) << _PAGE_LOCK_BIT)

/* Valid only for a PTE with the present bit cleared: */
#define _PAGE_FILE (1 << 1) /* see swap & file pte remarks below */
@@ -271,6 +274,8 @@
#define pte_dirty(pte) ((pte_val(pte) & _PAGE_D) != 0)
#define pte_young(pte) ((pte_val(pte) & _PAGE_A) != 0)
#define pte_file(pte) ((pte_val(pte) & _PAGE_FILE) != 0)
+#define pte_locked(pte) ((pte_val(pte) & _PAGE_LOCK)!=0)
+
/*
* Note: we convert AR_RWX to AR_RX and AR_RW to AR_R by clearing the 2nd bit in the
* access rights:
@@ -282,8 +287,15 @@
#define pte_mkyoung(pte) (__pte(pte_val(pte) | _PAGE_A))
#define pte_mkclean(pte) (__pte(pte_val(pte) & ~_PAGE_D))
#define pte_mkdirty(pte) (__pte(pte_val(pte) | _PAGE_D))
+#define pte_mkunlocked(pte) (__pte(pte_val(pte) & ~_PAGE_LOCK))

/*
+ * Lock functions for pte's
+ */
+#define ptep_lock(ptep) test_and_set_bit(_PAGE_LOCK_BIT, ptep)
+#define ptep_unlock(ptep) { clear_bit(_PAGE_LOCK_BIT,ptep); smp_mb__after_clear_bit(); }
+#define ptep_unlock_set(ptep, val) set_pte(ptep, pte_mkunlocked(val))
+/*
* Macro to a page protection value as "uncacheable". Note that "protection" is really a
* misnomer here as the protection value contains the memory attribute bits, dirty bits,
* and various other bits as well.
@@ -343,7 +355,6 @@
#define pte_unmap_nested(pte) do { } while (0)

/* atomic versions of the some PTE manipulations: */
-
static inline int
ptep_test_and_clear_young (pte_t *ptep)
{
@@ -415,6 +426,26 @@
#endif
}

+/*
+ * IA-64 doesn't have any external MMU info: the page tables contain all the necessary
+ * information. However, we use this routine to take care of any (delayed) i-cache
+ * flushing that may be necessary.
+ */
+extern void update_mmu_cache (struct vm_area_struct *vma, unsigned long vaddr, pte_t pte);
+
+static inline int
+ptep_cmpxchg (struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, pte_t oldval, pte_t newval)
+{
+ /*
+ * IA64 defers icache flushes. If the new pte is executable we may
+ * have to flush the icache to insure cache coherency immediately
+ * after the cmpxchg.
+ */
+ if (pte_exec(newval))
+ update_mmu_cache(vma, addr, newval);
+ return ia64_cmpxchg8_acq(&ptep->pte, newval.pte, oldval.pte) == oldval.pte;
+}
+
static inline int
pte_same (pte_t a, pte_t b)
{
@@ -477,13 +508,6 @@
struct vm_area_struct * prev, unsigned long start, unsigned long end);
#endif

-/*
- * IA-64 doesn't have any external MMU info: the page tables contain all the necessary
- * information. However, we use this routine to take care of any (delayed) i-cache
- * flushing that may be necessary.
- */
-extern void update_mmu_cache (struct vm_area_struct *vma, unsigned long vaddr, pte_t pte);
-
#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
/*
* Update PTEP with ENTRY, which is guaranteed to be a less
@@ -561,6 +585,8 @@
#define __HAVE_ARCH_PTEP_MKDIRTY
#define __HAVE_ARCH_PTE_SAME
#define __HAVE_ARCH_PGD_OFFSET_GATE
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
+#define __HAVE_ARCH_LOCK_TABLE_OPS
#include <asm-generic/pgtable.h>
#include <asm-generic/pgtable-nopud.h>


2005-01-04 19:48:03

by Christoph Lameter

[permalink] [raw]
Subject: page fault scalability patch V14 [1/7]: Avoid taking page_table_lock

Changelog
* Increase parallelism in SMP configurations by deferring
the acquisition of page_table_lock in handle_mm_fault
* Anonymous memory page faults bypass the page_table_lock
through the use of atomic page table operations
* Swapper does not set pte to empty in transition to swap
* Simulate atomic page table operations using the
page_table_lock if an arch does not define
__HAVE_ARCH_ATOMIC_TABLE_OPS. This still provides
a performance benefit since the page_table_lock
is held for shorter periods of time.

Signed-off-by: Christoph Lameter <[email protected]

Index: linux-2.6.10/mm/memory.c
===================================================================
--- linux-2.6.10.orig/mm/memory.c 2005-01-03 15:02:01.000000000 -0800
+++ linux-2.6.10/mm/memory.c 2005-01-03 15:48:34.000000000 -0800
@@ -1537,8 +1537,7 @@
}

/*
- * We hold the mm semaphore and the page_table_lock on entry and
- * should release the pagetable lock on exit..
+ * We hold the mm semaphore
*/
static int do_swap_page(struct mm_struct * mm,
struct vm_area_struct * vma, unsigned long address,
@@ -1550,15 +1549,13 @@
int ret = VM_FAULT_MINOR;

pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
page = lookup_swap_cache(entry);
if (!page) {
swapin_readahead(entry, address, vma);
page = read_swap_cache_async(entry, vma, address);
if (!page) {
/*
- * Back out if somebody else faulted in this pte while
- * we released the page table lock.
+ * Back out if somebody else faulted in this pte
*/
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, address);
@@ -1581,8 +1578,7 @@
lock_page(page);

/*
- * Back out if somebody else faulted in this pte while we
- * released the page table lock.
+ * Back out if somebody else faulted in this pte
*/
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, address);
@@ -1629,14 +1625,12 @@
}

/*
- * We are called with the MM semaphore and page_table_lock
- * spinlock held to protect against concurrent faults in
- * multithreaded programs.
+ * We are called with the MM semaphore held.
*/
static int
do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
pte_t *page_table, pmd_t *pmd, int write_access,
- unsigned long addr)
+ unsigned long addr, pte_t orig_entry)
{
pte_t entry;
struct page * page = ZERO_PAGE(addr);
@@ -1648,7 +1642,6 @@
if (write_access) {
/* Allocate our own private page. */
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);

if (unlikely(anon_vma_prepare(vma)))
goto no_mem;
@@ -1657,30 +1650,34 @@
goto no_mem;
clear_user_highpage(page, addr);

- spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, addr);

- if (!pte_none(*page_table)) {
- pte_unmap(page_table);
- page_cache_release(page);
- spin_unlock(&mm->page_table_lock);
- goto out;
- }
- mm->rss++;
entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
vma->vm_page_prot)),
vma);
- lru_cache_add_active(page);
- mark_page_accessed(page);
- page_add_anon_rmap(page, vma, addr);
}

- set_pte(page_table, entry);
+ /* update the entry */
+ if (!ptep_cmpxchg(vma, addr, page_table, orig_entry, entry)) {
+ if (write_access) {
+ pte_unmap(page_table);
+ page_cache_release(page);
+ }
+ goto out;
+ }
+ if (write_access) {
+ /*
+ * These two functions must come after the cmpxchg
+ * because if the page is on the LRU then try_to_unmap may come
+ * in and unmap the pte.
+ */
+ page_add_anon_rmap(page, vma, addr);
+ lru_cache_add_active(page);
+ mm->rss++;
+
+ }
pte_unmap(page_table);

- /* No need to invalidate - it was non-present before */
- update_mmu_cache(vma, addr, entry);
- spin_unlock(&mm->page_table_lock);
out:
return VM_FAULT_MINOR;
no_mem:
@@ -1696,12 +1693,12 @@
* As this is called only for pages that do not currently exist, we
* do not need to flush old virtual caches or the TLB.
*
- * This is called with the MM semaphore held and the page table
- * spinlock held. Exit with the spinlock released.
+ * This is called with the MM semaphore held.
*/
static int
do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
+ unsigned long address, int write_access, pte_t *page_table,
+ pmd_t *pmd, pte_t orig_entry)
{
struct page * new_page;
struct address_space *mapping = NULL;
@@ -1712,9 +1709,8 @@

if (!vma->vm_ops || !vma->vm_ops->nopage)
return do_anonymous_page(mm, vma, page_table,
- pmd, write_access, address);
+ pmd, write_access, address, orig_entry);
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);

if (vma->vm_file) {
mapping = vma->vm_file->f_mapping;
@@ -1812,7 +1808,7 @@
* nonlinear vmas.
*/
static int do_file_page(struct mm_struct * mm, struct vm_area_struct * vma,
- unsigned long address, int write_access, pte_t *pte, pmd_t *pmd)
+ unsigned long address, int write_access, pte_t *pte, pmd_t *pmd, pte_t entry)
{
unsigned long pgoff;
int err;
@@ -1825,13 +1821,12 @@
if (!vma->vm_ops || !vma->vm_ops->populate ||
(write_access && !(vma->vm_flags & VM_SHARED))) {
pte_clear(pte);
- return do_no_page(mm, vma, address, write_access, pte, pmd);
+ return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
}

- pgoff = pte_to_pgoff(*pte);
+ pgoff = pte_to_pgoff(entry);

pte_unmap(pte);
- spin_unlock(&mm->page_table_lock);

err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0);
if (err == -ENOMEM)
@@ -1850,49 +1845,46 @@
* with external mmu caches can use to update those (ie the Sparc or
* PowerPC hashed page tables that act as extended TLBs).
*
- * Note the "page_table_lock". It is to protect against kswapd removing
- * pages from under us. Note that kswapd only ever _removes_ pages, never
- * adds them. As such, once we have noticed that the page is not present,
- * we can drop the lock early.
- *
- * The adding of pages is protected by the MM semaphore (which we hold),
- * so we don't need to worry about a page being suddenly been added into
- * our VM.
- *
- * We enter with the pagetable spinlock held, we are supposed to
- * release it when done.
+ * Note that kswapd only ever _removes_ pages, never adds them.
+ * We need to insure to handle that case properly.
*/
static inline int handle_pte_fault(struct mm_struct *mm,
struct vm_area_struct * vma, unsigned long address,
int write_access, pte_t *pte, pmd_t *pmd)
{
pte_t entry;
+ pte_t new_entry;

- entry = *pte;
+ /*
+ * This must be a atomic operation since the page_table_lock is
+ * not held. If a pte_t larger than the word size is used an
+ * incorrect value could be read because another processor is
+ * concurrently updating the multi-word pte. The i386 PAE mode
+ * is raising its ugly head here.
+ */
+ entry = get_pte_atomic(pte);
if (!pte_present(entry)) {
- /*
- * If it truly wasn't present, we know that kswapd
- * and the PTE updates will not touch it later. So
- * drop the lock.
- */
if (pte_none(entry))
- return do_no_page(mm, vma, address, write_access, pte, pmd);
+ return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
if (pte_file(entry))
- return do_file_page(mm, vma, address, write_access, pte, pmd);
+ return do_file_page(mm, vma, address, write_access, pte, pmd, entry);
return do_swap_page(mm, vma, address, pte, pmd, entry, write_access);
}

+ /*
+ * This is the case in which we only update some bits in the pte.
+ */
+ new_entry = pte_mkyoung(entry);
if (write_access) {
- if (!pte_write(entry))
+ if (!pte_write(entry)) {
+ /* do_wp_page expects us to hold the page_table_lock */
+ spin_lock(&mm->page_table_lock);
return do_wp_page(mm, vma, address, pte, pmd, entry);
-
- entry = pte_mkdirty(entry);
+ }
+ new_entry = pte_mkdirty(new_entry);
}
- entry = pte_mkyoung(entry);
- ptep_set_access_flags(vma, address, pte, entry, write_access);
- update_mmu_cache(vma, address, entry);
+ ptep_cmpxchg(vma, address, pte, entry, new_entry);
pte_unmap(pte);
- spin_unlock(&mm->page_table_lock);
return VM_FAULT_MINOR;
}

@@ -1911,33 +1903,54 @@

inc_page_state(pgfault);

- if (is_vm_hugetlb_page(vma))
+ if (unlikely(is_vm_hugetlb_page(vma)))
return VM_FAULT_SIGBUS; /* mapping truncation does this. */

/*
- * We need the page table lock to synchronize with kswapd
- * and the SMP-safe atomic PTE updates.
+ * We rely on the mmap_sem and the SMP-safe atomic PTE updates.
+ * to synchronize with kswapd. We can avoid the overhead
+ * of the p??_alloc functions through atomic operations so
+ * we duplicate the functionality of pmd_alloc, pud_alloc and
+ * pte_alloc_map here.
*/
pgd = pgd_offset(mm, address);
- spin_lock(&mm->page_table_lock);
+ if (unlikely(pgd_none(*pgd))) {
+ pud_t *new = pud_alloc_one(mm, address);
+
+ if (!new)
+ return VM_FAULT_OOM;
+ if (!pgd_test_and_populate(mm, pgd, new));
+ pud_free(new);
+ }

- pud = pud_alloc(mm, pgd, address);
- if (!pud)
- goto oom;
-
- pmd = pmd_alloc(mm, pud, address);
- if (!pmd)
- goto oom;
-
- pte = pte_alloc_map(mm, pmd, address);
- if (!pte)
- goto oom;
+ pud = pud_offset(pgd, address);
+ if (unlikely(pud_none(*pud))) {
+ pmd_t *new = pmd_alloc_one(mm, address);
+
+ if (!new)
+ return VM_FAULT_OOM;
+
+ if (!pud_test_and_populate(mm, pud, new))
+ pmd_free(new);
+ }
+
+ pmd = pmd_offset(pud, address);
+ if (unlikely(!pmd_present(*pmd))) {
+ struct page *new = pte_alloc_one(mm, address);

- return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
+ if (!new)
+ return VM_FAULT_OOM;

- oom:
- spin_unlock(&mm->page_table_lock);
- return VM_FAULT_OOM;
+ if (!pmd_test_and_populate(mm, pmd, new))
+ pte_free(new);
+ else {
+ inc_page_state(nr_page_table_pages);
+ mm->nr_ptes++;
+ }
+ }
+
+ pte = pte_offset_map(pmd, address);
+ return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
}

#ifndef __ARCH_HAS_4LEVEL_HACK
Index: linux-2.6.10/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.10.orig/include/asm-generic/pgtable.h 2004-12-24 13:34:30.000000000 -0800
+++ linux-2.6.10/include/asm-generic/pgtable.h 2005-01-03 15:48:34.000000000 -0800
@@ -28,6 +28,11 @@
#endif /* __HAVE_ARCH_SET_PTE_ATOMIC */
#endif

+/* Get a pte entry without the page table lock */
+#ifndef __HAVE_ARCH_GET_PTE_ATOMIC
+#define get_pte_atomic(__x) *(__x)
+#endif
+
#ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
/*
* Largely same as above, but only sets the access flags (dirty,
@@ -134,4 +139,61 @@
#define pgd_offset_gate(mm, addr) pgd_offset(mm, addr)
#endif

+#ifndef __HAVE_ARCH_ATOMIC_TABLE_OPS
+/*
+ * If atomic page table operations are not available then use
+ * the page_table_lock to insure some form of locking.
+ * Note thought that low level operations as well as the
+ * page_table_handling of the cpu may bypass all locking.
+ */
+
+#ifndef __HAVE_ARCH_PTEP_CMPXCHG
+#define ptep_cmpxchg(__vma, __addr, __ptep, __oldval, __newval) \
+({ \
+ int __rc; \
+ spin_lock(&__vma->vm_mm->page_table_lock); \
+ __rc = pte_same(*(__ptep), __oldval); \
+ if (__rc) { set_pte(__ptep, __newval); \
+ update_mmu_cache(__vma, __addr, __newval); } \
+ spin_unlock(&__vma->vm_mm->page_table_lock); \
+ __rc; \
+})
+#endif
+
+#ifndef __HAVE_ARCH_PGP_TEST_AND_POPULATE
+#define pgd_test_and_populate(__mm, __pgd, __pmd) \
+({ \
+ int __rc; \
+ spin_lock(&__mm->page_table_lock); \
+ __rc = !pgd_present(*(__pgd)); \
+ if (__rc) pgd_populate(__mm, __pgd, __pmd); \
+ spin_unlock(&__mm->page_table_lock); \
+ __rc; \
+})
+#endif
+
+#ifndef __HAVE_PMD_TEST_AND_POPULATE
+#define pmd_test_and_populate(__mm, __pmd, __page) \
+({ \
+ int __rc; \
+ spin_lock(&__mm->page_table_lock); \
+ __rc = !pmd_present(*(__pmd)); \
+ if (__rc) pmd_populate(__mm, __pmd, __page); \
+ spin_unlock(&__mm->page_table_lock); \
+ __rc; \
+})
+#endif
+
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \
+({ \
+ pte_t __p = __pte(xchg(&pte_val(*(__ptep)), pte_val(__pteval)));\
+ flush_tlb_page(__vma, __address); \
+ __p; \
+})
+
+#endif
+
#endif /* _ASM_GENERIC_PGTABLE_H */
Index: linux-2.6.10/mm/rmap.c
===================================================================
--- linux-2.6.10.orig/mm/rmap.c 2005-01-03 15:02:01.000000000 -0800
+++ linux-2.6.10/mm/rmap.c 2005-01-03 15:48:34.000000000 -0800
@@ -432,7 +432,10 @@
* @vma: the vm area in which the mapping is added
* @address: the user virtual address mapped
*
- * The caller needs to hold the mm->page_table_lock.
+ * The caller needs to hold the mm->page_table_lock if page
+ * is pointing to something that is known by the vm.
+ * The lock does not need to be held if page is pointing
+ * to a newly allocated page.
*/
void page_add_anon_rmap(struct page *page,
struct vm_area_struct *vma, unsigned long address)
@@ -581,11 +584,6 @@

/* Nuke the page table entry. */
flush_cache_page(vma, address);
- pteval = ptep_clear_flush(vma, address, pte);
-
- /* Move the dirty bit to the physical page now the pte is gone. */
- if (pte_dirty(pteval))
- set_page_dirty(page);

if (PageAnon(page)) {
swp_entry_t entry = { .val = page->private };
@@ -600,11 +598,15 @@
list_add(&mm->mmlist, &init_mm.mmlist);
spin_unlock(&mmlist_lock);
}
- set_pte(pte, swp_entry_to_pte(entry));
+ pteval = ptep_xchg_flush(vma, address, pte, swp_entry_to_pte(entry));
BUG_ON(pte_file(*pte));
mm->anon_rss--;
- }
+ } else
+ pteval = ptep_clear_flush(vma, address, pte);

+ /* Move the dirty bit to the physical page now the pte is gone. */
+ if (pte_dirty(pteval))
+ set_page_dirty(page);
mm->rss--;
page_remove_rmap(page);
page_cache_release(page);
@@ -696,15 +698,21 @@
if (ptep_clear_flush_young(vma, address, pte))
continue;

- /* Nuke the page table entry. */
flush_cache_page(vma, address);
- pteval = ptep_clear_flush(vma, address, pte);
+ /*
+ * There would be a race here with handle_mm_fault and do_anonymous_page
+ * which bypasses the page_table_lock if we would zap the pte before
+ * putting something into it. On the other hand we need to
+ * have the dirty flag setting at the time we replaced the value.
+ */

/* If nonlinear, store the file page offset in the pte. */
if (page->index != linear_page_index(vma, address))
- set_pte(pte, pgoff_to_pte(page->index));
+ pteval = ptep_xchg_flush(vma, address, pte, pgoff_to_pte(page->index));
+ else
+ pteval = ptep_get_and_clear(pte);

- /* Move the dirty bit to the physical page now the pte is gone. */
+ /* Move the dirty bit to the physical page now that the pte is gone. */
if (pte_dirty(pteval))
set_page_dirty(page);

Index: linux-2.6.10/include/asm-generic/pgtable-nopud.h
===================================================================
--- linux-2.6.10.orig/include/asm-generic/pgtable-nopud.h 2005-01-03 15:02:01.000000000 -0800
+++ linux-2.6.10/include/asm-generic/pgtable-nopud.h 2005-01-03 15:48:34.000000000 -0800
@@ -25,8 +25,9 @@
static inline int pgd_present(pgd_t pgd) { return 1; }
static inline void pgd_clear(pgd_t *pgd) { }
#define pud_ERROR(pud) (pgd_ERROR((pud).pgd))
-
#define pgd_populate(mm, pgd, pud) do { } while (0)
+static inline int pgd_test_and_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud) { return 1; }
+
/*
* (puds are folded into pgds so this doesn't get actually called,
* but the define is needed for a generic inline function.)
Index: linux-2.6.10/include/asm-generic/pgtable-nopmd.h
===================================================================
--- linux-2.6.10.orig/include/asm-generic/pgtable-nopmd.h 2005-01-03 15:02:01.000000000 -0800
+++ linux-2.6.10/include/asm-generic/pgtable-nopmd.h 2005-01-03 15:49:12.000000000 -0800
@@ -29,6 +29,7 @@
#define pmd_ERROR(pmd) (pud_ERROR((pmd).pud))

#define pud_populate(mm, pmd, pte) do { } while (0)
+static inline int pud_test_and_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd) { return 1; }

/*
* (pmds are folded into puds so this doesn't get actually called,

2005-01-04 19:52:57

by Andi Kleen

[permalink] [raw]
Subject: Re: page fault scalability patch V14 [5/7]: x86_64 atomic pte operations

Christoph Lameter <[email protected]> writes:

I bet this has been never tested.

> #define pmd_populate_kernel(mm, pmd, pte) \
> set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte)))
> #define pud_populate(mm, pud, pmd) \
> @@ -14,11 +18,24 @@
> #define pgd_populate(mm, pgd, pud) \
> set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(pud)))
>
> +#define pmd_test_and_populate(mm, pmd, pte) \
> + (cmpxchg((int *)pmd, PMD_NONE, _PAGE_TABLE | __pa(pte)) == PMD_NONE)
> +#define pud_test_and_populate(mm, pud, pmd) \
> + (cmpxchg((int *)pgd, PUD_NONE, _PAGE_TABLE | __pa(pmd)) == PUD_NONE)
> +#define pgd_test_and_populate(mm, pgd, pud) \
> + (cmpxchg((int *)pgd, PGD_NONE, _PAGE_TABLE | __pa(pud)) == PGD_NONE)
> +

Shouldn't this all be (long *)pmd ? page table entries on x86-64 are 64bit.
Also why do you cast at all? i think the macro should handle an arbitary
pointer.

> +
> static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
> {
> set_pmd(pmd, __pmd(_PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)));
> }
>
> +static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
> +{
> + return cmpxchg((int *)pmd, PMD_NONE, _PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE;

Same.

-Andi

2005-01-04 19:52:57

by Christoph Lameter

[permalink] [raw]
Subject: page fault scalability patch V14 [7/7]: Split RSS counters

Changelog
* Split rss counter into the task structure
* remove 3 checks of rss in mm/rmap.c
* increment current->rss instead of mm->rss in the page fault handler
* move incrementing of anon_rss out of page_add_anon_rmap to group
the increments more tightly and allow a better cache utilization

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.10/include/linux/sched.h
===================================================================
--- linux-2.6.10.orig/include/linux/sched.h 2004-12-24 13:33:59.000000000 -0800
+++ linux-2.6.10/include/linux/sched.h 2005-01-03 12:21:32.000000000 -0800
@@ -30,6 +30,7 @@
#include <linux/pid.h>
#include <linux/percpu.h>
#include <linux/topology.h>
+#include <linux/rcupdate.h>

struct exec_domain;

@@ -217,6 +218,7 @@
int map_count; /* number of VMAs */
struct rw_semaphore mmap_sem;
spinlock_t page_table_lock; /* Protects page tables, mm->rss, mm->anon_rss */
+ long rss, anon_rss;

struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung
* together off init_mm.mmlist, and are protected
@@ -226,7 +228,7 @@
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
unsigned long arg_start, arg_end, env_start, env_end;
- unsigned long rss, anon_rss, total_vm, locked_vm, shared_vm;
+ unsigned long total_vm, locked_vm, shared_vm;
unsigned long exec_vm, stack_vm, reserved_vm, def_flags, nr_ptes;

unsigned long saved_auxv[42]; /* for /proc/PID/auxv */
@@ -236,6 +238,7 @@

/* Architecture-specific MM context */
mm_context_t context;
+ struct list_head task_list; /* Tasks using this mm */

/* Token based thrashing protection. */
unsigned long swap_token_time;
@@ -545,6 +548,9 @@
struct list_head ptrace_list;

struct mm_struct *mm, *active_mm;
+ /* Split counters from mm */
+ long rss;
+ long anon_rss;

/* task state */
struct linux_binfmt *binfmt;
@@ -578,6 +584,10 @@
struct completion *vfork_done; /* for vfork() */
int __user *set_child_tid; /* CLONE_CHILD_SETTID */
int __user *clear_child_tid; /* CLONE_CHILD_CLEARTID */
+
+ /* List of other tasks using the same mm */
+ struct list_head mm_tasks;
+ struct rcu_head rcu_head; /* For freeing the task via rcu */

unsigned long rt_priority;
unsigned long it_real_value, it_prof_value, it_virt_value;
@@ -1124,6 +1134,12 @@

#endif

+void get_rss(struct mm_struct *mm, unsigned long *rss, unsigned long *anon_rss);
+
+void mm_remove_thread(struct mm_struct *mm, struct task_struct *tsk);
+void mm_add_thread(struct mm_struct *mm, struct task_struct *tsk);
+
#endif /* __KERNEL__ */

#endif
+
Index: linux-2.6.10/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.10.orig/fs/proc/task_mmu.c 2004-12-24 13:34:01.000000000 -0800
+++ linux-2.6.10/fs/proc/task_mmu.c 2005-01-03 12:21:32.000000000 -0800
@@ -6,8 +6,9 @@

char *task_mem(struct mm_struct *mm, char *buffer)
{
- unsigned long data, text, lib;
+ unsigned long data, text, lib, rss, anon_rss;

+ get_rss(mm, &rss, &anon_rss);
data = mm->total_vm - mm->shared_vm - mm->stack_vm;
text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10;
lib = (mm->exec_vm << (PAGE_SHIFT-10)) - text;
@@ -22,7 +23,7 @@
"VmPTE:\t%8lu kB\n",
(mm->total_vm - mm->reserved_vm) << (PAGE_SHIFT-10),
mm->locked_vm << (PAGE_SHIFT-10),
- mm->rss << (PAGE_SHIFT-10),
+ rss << (PAGE_SHIFT-10),
data << (PAGE_SHIFT-10),
mm->stack_vm << (PAGE_SHIFT-10), text, lib,
(PTRS_PER_PTE*sizeof(pte_t)*mm->nr_ptes) >> 10);
@@ -37,11 +38,14 @@
int task_statm(struct mm_struct *mm, int *shared, int *text,
int *data, int *resident)
{
- *shared = mm->rss - mm->anon_rss;
+ unsigned long rss, anon_rss;
+
+ get_rss(mm, &rss, &anon_rss);
+ *shared = rss - anon_rss;
*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
>> PAGE_SHIFT;
*data = mm->total_vm - mm->shared_vm;
- *resident = mm->rss;
+ *resident = rss;
return mm->total_vm;
}

Index: linux-2.6.10/fs/proc/array.c
===================================================================
--- linux-2.6.10.orig/fs/proc/array.c 2004-12-24 13:35:00.000000000 -0800
+++ linux-2.6.10/fs/proc/array.c 2005-01-03 12:21:32.000000000 -0800
@@ -302,7 +302,7 @@

static int do_task_stat(struct task_struct *task, char * buffer, int whole)
{
- unsigned long vsize, eip, esp, wchan = ~0UL;
+ unsigned long rss, anon_rss, vsize, eip, esp, wchan = ~0UL;
long priority, nice;
int tty_pgrp = -1, tty_nr = 0;
sigset_t sigign, sigcatch;
@@ -325,6 +325,7 @@
vsize = task_vsize(mm);
eip = KSTK_EIP(task);
esp = KSTK_ESP(task);
+ get_rss(mm, &rss, &anon_rss);
}

get_task_comm(tcomm, task);
@@ -420,7 +421,7 @@
jiffies_to_clock_t(task->it_real_value),
start_time,
vsize,
- mm ? mm->rss : 0, /* you might want to shift this left 3 */
+ mm ? rss : 0, /* you might want to shift this left 3 */
rsslim,
mm ? mm->start_code : 0,
mm ? mm->end_code : 0,
Index: linux-2.6.10/mm/rmap.c
===================================================================
--- linux-2.6.10.orig/mm/rmap.c 2005-01-03 10:31:41.000000000 -0800
+++ linux-2.6.10/mm/rmap.c 2005-01-03 12:21:32.000000000 -0800
@@ -264,8 +264,6 @@
pte_t *pte;
int referenced = 0;

- if (!mm->rss)
- goto out;
address = vma_address(page, vma);
if (address == -EFAULT)
goto out;
@@ -446,8 +444,6 @@
BUG_ON(PageReserved(page));
BUG_ON(!anon_vma);

- vma->vm_mm->anon_rss++;
-
anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
index = (address - vma->vm_start) >> PAGE_SHIFT;
index += vma->vm_pgoff;
@@ -519,8 +515,6 @@
pte_t pteval;
int ret = SWAP_AGAIN;

- if (!mm->rss)
- goto out;
address = vma_address(page, vma);
if (address == -EFAULT)
goto out;
@@ -817,8 +811,7 @@
if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
continue;
cursor = (unsigned long) vma->vm_private_data;
- while (vma->vm_mm->rss &&
- cursor < max_nl_cursor &&
+ while (cursor < max_nl_cursor &&
cursor < vma->vm_end - vma->vm_start) {
try_to_unmap_cluster(cursor, &mapcount, vma);
cursor += CLUSTER_SIZE;
Index: linux-2.6.10/kernel/fork.c
===================================================================
--- linux-2.6.10.orig/kernel/fork.c 2004-12-24 13:33:59.000000000 -0800
+++ linux-2.6.10/kernel/fork.c 2005-01-03 12:21:32.000000000 -0800
@@ -78,10 +78,16 @@
static kmem_cache_t *task_struct_cachep;
#endif

+static void rcu_free_task(struct rcu_head *head)
+{
+ struct task_struct *tsk = container_of(head ,struct task_struct, rcu_head);
+ free_task_struct(tsk);
+}
+
void free_task(struct task_struct *tsk)
{
free_thread_info(tsk->thread_info);
- free_task_struct(tsk);
+ call_rcu(&tsk->rcu_head, rcu_free_task);
}
EXPORT_SYMBOL(free_task);

@@ -98,7 +104,7 @@
put_group_info(tsk->group_info);

if (!profile_handoff_task(tsk))
- free_task(tsk);
+ call_rcu(&tsk->rcu_head, rcu_free_task);
}

void __init fork_init(unsigned long mempages)
@@ -151,6 +157,7 @@
*tsk = *orig;
tsk->thread_info = ti;
ti->task = tsk;
+ tsk->rss = 0;

/* One for us, one for whoever does the "release_task()" (usually parent) */
atomic_set(&tsk->usage,2);
@@ -292,6 +299,7 @@
atomic_set(&mm->mm_count, 1);
init_rwsem(&mm->mmap_sem);
INIT_LIST_HEAD(&mm->mmlist);
+ INIT_LIST_HEAD(&mm->task_list);
mm->core_waiters = 0;
mm->nr_ptes = 0;
spin_lock_init(&mm->page_table_lock);
@@ -400,6 +408,8 @@

/* Get rid of any cached register state */
deactivate_mm(tsk, mm);
+ if (mm)
+ mm_remove_thread(mm, tsk);

/* notify parent sleeping on vfork() */
if (vfork_done) {
@@ -447,8 +457,8 @@
* new threads start up in user mode using an mm, which
* allows optimizing out ipis; the tlb_gather_mmu code
* is an example.
+ * (mm_add_thread does use the ptl .... )
*/
- spin_unlock_wait(&oldmm->page_table_lock);
goto good_mm;
}

@@ -470,6 +480,7 @@
goto free_pt;

good_mm:
+ mm_add_thread(mm, tsk);
tsk->mm = mm;
tsk->active_mm = mm;
return 0;
@@ -1063,7 +1074,7 @@
atomic_dec(&p->user->processes);
free_uid(p->user);
bad_fork_free:
- free_task(p);
+ call_rcu(&p->rcu_head, rcu_free_task);
goto fork_out;
}

Index: linux-2.6.10/mm/memory.c
===================================================================
--- linux-2.6.10.orig/mm/memory.c 2005-01-03 11:15:55.000000000 -0800
+++ linux-2.6.10/mm/memory.c 2005-01-03 12:21:32.000000000 -0800
@@ -909,6 +909,7 @@
struct page *map;
int lookup_write = write;
while (!(map = follow_page(mm, start, lookup_write))) {
+ unsigned long rss, anon_rss;
/*
* Shortcut for anonymous pages. We don't want
* to force the creation of pages tables for
@@ -921,6 +922,17 @@
map = ZERO_PAGE(start);
break;
}
+ if (mm != current->mm) {
+ /*
+ * handle_mm_fault uses the current pointer
+ * for a split rss counter. The current pointer
+ * is not correct if we are using a different mm
+ */
+ rss = current->rss;
+ anon_rss = current->anon_rss;
+ current->rss = 0;
+ current->anon_rss = 0;
+ }
spin_unlock(&mm->page_table_lock);
switch (handle_mm_fault(mm,vma,start,write)) {
case VM_FAULT_MINOR:
@@ -945,6 +957,12 @@
*/
lookup_write = write && !force;
spin_lock(&mm->page_table_lock);
+ if (mm != current->mm) {
+ mm->rss += current->rss;
+ mm->anon_rss += current->anon_rss;
+ current->rss = rss;
+ current->anon_rss = anon_rss;
+ }
}
if (pages) {
pages[i] = get_page_map(map);
@@ -1325,6 +1343,7 @@
break_cow(vma, new_page, address, page_table);
lru_cache_add_active(new_page);
page_add_anon_rmap(new_page, vma, address);
+ mm->anon_rss++;

/* Free the old page.. */
new_page = old_page;
@@ -1608,6 +1627,7 @@
flush_icache_page(vma, page);
set_pte(page_table, pte);
page_add_anon_rmap(page, vma, address);
+ mm->anon_rss++;

if (write_access) {
if (do_wp_page(mm, vma, address,
@@ -1674,6 +1694,7 @@
page_add_anon_rmap(page, vma, addr);
lru_cache_add_active(page);
mm->rss++;
+ mm->anon_rss++;

}
pte_unmap(page_table);
@@ -1780,6 +1801,7 @@
if (anon) {
lru_cache_add_active(new_page);
page_add_anon_rmap(new_page, vma, address);
+ mm->anon_rss++;
} else
page_add_file_rmap(new_page);
pte_unmap(page_table);
@@ -2143,3 +2165,49 @@
}

#endif
+
+void get_rss(struct mm_struct *mm, unsigned long *rss, unsigned long *anon_rss)
+{
+ struct list_head *y;
+ struct task_struct *t;
+ long rss_sum, anon_rss_sum;
+
+ rcu_read_lock();
+ rss_sum = mm->rss;
+ anon_rss_sum = mm->anon_rss;
+ list_for_each_rcu(y, &mm->task_list) {
+ t = list_entry(y, struct task_struct, mm_tasks);
+ rss_sum += t->rss;
+ anon_rss_sum += t->anon_rss;
+ }
+ if (rss_sum < 0)
+ rss_sum = 0;
+ if (anon_rss_sum < 0)
+ anon_rss_sum = 0;
+ rcu_read_unlock();
+ *rss = rss_sum;
+ *anon_rss = anon_rss_sum;
+}
+
+void mm_remove_thread(struct mm_struct *mm, struct task_struct *tsk)
+{
+ if (!mm)
+ return;
+
+ spin_lock(&mm->page_table_lock);
+ mm->rss += tsk->rss;
+ mm->anon_rss += tsk->anon_rss;
+ list_del_rcu(&tsk->mm_tasks);
+ spin_unlock(&mm->page_table_lock);
+}
+
+void mm_add_thread(struct mm_struct *mm, struct task_struct *tsk)
+{
+ spin_lock(&mm->page_table_lock);
+ tsk->rss = 0;
+ tsk->anon_rss = 0;
+ list_add_rcu(&tsk->mm_tasks, &mm->task_list);
+ spin_unlock(&mm->page_table_lock);
+}
+
+
Index: linux-2.6.10/include/linux/init_task.h
===================================================================
--- linux-2.6.10.orig/include/linux/init_task.h 2004-12-24 13:33:52.000000000 -0800
+++ linux-2.6.10/include/linux/init_task.h 2005-01-03 12:21:32.000000000 -0800
@@ -42,6 +42,7 @@
.mmlist = LIST_HEAD_INIT(name.mmlist), \
.cpu_vm_mask = CPU_MASK_ALL, \
.default_kioctx = INIT_KIOCTX(name.default_kioctx, name), \
+ .task_list = LIST_HEAD_INIT(name.task_list), \
}

#define INIT_SIGNALS(sig) { \
@@ -112,6 +113,7 @@
.proc_lock = SPIN_LOCK_UNLOCKED, \
.switch_lock = SPIN_LOCK_UNLOCKED, \
.journal_info = NULL, \
+ .mm_tasks = LIST_HEAD_INIT(tsk.mm_tasks), \
}


Index: linux-2.6.10/fs/exec.c
===================================================================
--- linux-2.6.10.orig/fs/exec.c 2005-01-03 10:31:31.000000000 -0800
+++ linux-2.6.10/fs/exec.c 2005-01-03 12:21:32.000000000 -0800
@@ -549,6 +549,7 @@
tsk->active_mm = mm;
activate_mm(active_mm, mm);
task_unlock(tsk);
+ mm_add_thread(mm, current);
arch_pick_mmap_layout(mm);
if (old_mm) {
if (active_mm != old_mm) BUG();
Index: linux-2.6.10/fs/aio.c
===================================================================
--- linux-2.6.10.orig/fs/aio.c 2004-12-24 13:34:44.000000000 -0800
+++ linux-2.6.10/fs/aio.c 2005-01-03 12:21:32.000000000 -0800
@@ -577,6 +577,7 @@
tsk->active_mm = mm;
activate_mm(active_mm, mm);
task_unlock(tsk);
+ mm_add_thread(mm, tsk);

mmdrop(active_mm);
}
@@ -596,6 +597,7 @@
{
struct task_struct *tsk = current;

+ mm_remove_thread(mm,tsk);
task_lock(tsk);
tsk->flags &= ~PF_BORROWED_MM;
tsk->mm = NULL;
Index: linux-2.6.10/mm/swapfile.c
===================================================================
--- linux-2.6.10.orig/mm/swapfile.c 2005-01-03 10:31:31.000000000 -0800
+++ linux-2.6.10/mm/swapfile.c 2005-01-03 12:21:32.000000000 -0800
@@ -432,6 +432,7 @@
swp_entry_t entry, struct page *page)
{
vma->vm_mm->rss++;
+ vma->vm_mm->anon_rss++;
get_page(page);
set_pte(dir, pte_mkold(mk_pte(page, vma->vm_page_prot)));
page_add_anon_rmap(page, vma, address);

2005-01-04 19:52:56

by Christoph Lameter

[permalink] [raw]
Subject: page fault scalability patch V14 [6/7]: s390 atomic pte operationsw

Changelog
* Provide atomic pte operations for s390

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.10/include/asm-s390/pgtable.h
===================================================================
--- linux-2.6.10.orig/include/asm-s390/pgtable.h 2005-01-03 10:31:31.000000000 -0800
+++ linux-2.6.10/include/asm-s390/pgtable.h 2005-01-03 12:12:03.000000000 -0800
@@ -569,6 +569,15 @@
return pte;
}

+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \
+({ \
+ struct mm_struct *__mm = __vma->vm_mm; \
+ pte_t __pte; \
+ __pte = ptep_clear_flush(__vma, __address, __ptep); \
+ set_pte(__ptep, __pteval); \
+ __pte; \
+})
+
static inline void ptep_set_wrprotect(pte_t *ptep)
{
pte_t old_pte = *ptep;
@@ -780,6 +789,14 @@

#define kern_addr_valid(addr) (1)

+/* Atomic PTE operations */
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
+
+static inline int ptep_cmpxchg (struct vm_area_struct *vma, unsigned long address, pte_t *ptep, pte_t oldval, pte_t newval)
+{
+ return cmpxchg(ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval);
+}
+
/*
* No page table caches to initialise
*/
@@ -793,6 +810,7 @@
#define __HAVE_ARCH_PTEP_CLEAR_DIRTY_FLUSH
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
#define __HAVE_ARCH_PTEP_CLEAR_FLUSH
+#define __HAVE_ARCH_PTEP_XCHG_FLUSH
#define __HAVE_ARCH_PTEP_SET_WRPROTECT
#define __HAVE_ARCH_PTEP_MKDIRTY
#define __HAVE_ARCH_PTE_SAME
Index: linux-2.6.10/include/asm-s390/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-s390/pgalloc.h 2004-12-24 13:35:00.000000000 -0800
+++ linux-2.6.10/include/asm-s390/pgalloc.h 2005-01-03 12:12:03.000000000 -0800
@@ -97,6 +97,10 @@
pgd_val(*pgd) = _PGD_ENTRY | __pa(pmd);
}

+static inline int pgd_test_and_populate(struct mm_struct *mm, pdg_t *pgd, pmd_t *pmd)
+{
+ return cmpxchg(pgd, _PAGE_TABLE_INV, _PGD_ENTRY | __pa(pmd)) == _PAGE_TABLE_INV;
+}
#endif /* __s390x__ */

static inline void
@@ -119,6 +123,18 @@
pmd_populate_kernel(mm, pmd, (pte_t *)((page-mem_map) << PAGE_SHIFT));
}

+static inline int
+pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *page)
+{
+ int rc;
+ spin_lock(&mm->page_table_lock);
+
+ rc=pte_same(*pmd, _PAGE_INVALID_EMPTY);
+ if (rc) pmd_populate(mm, pmd, page);
+ spin_unlock(&mm->page_table_lock);
+ return rc;
+}
+
/*
* page table entry allocation/free routines.
*/

2005-01-04 20:15:22

by Christoph Lameter

[permalink] [raw]
Subject: Re: page fault scalability patch V14 [5/7]: x86_64 atomic pte operations

On Tue, 4 Jan 2005, Andi Kleen wrote:

> Christoph Lameter <[email protected]> writes:
>
> I bet this has been never tested.

I tested this back in October and it worked fine. Would you be able to
test your proposed modifications and send me a patch?

> > +#define pmd_test_and_populate(mm, pmd, pte) \
> > + (cmpxchg((int *)pmd, PMD_NONE, _PAGE_TABLE | __pa(pte)) == PMD_NONE)
> > +#define pud_test_and_populate(mm, pud, pmd) \
> > + (cmpxchg((int *)pgd, PUD_NONE, _PAGE_TABLE | __pa(pmd)) == PUD_NONE)
> > +#define pgd_test_and_populate(mm, pgd, pud) \
> > + (cmpxchg((int *)pgd, PGD_NONE, _PAGE_TABLE | __pa(pud)) == PGD_NONE)
> > +
>
> Shouldn't this all be (long *)pmd ? page table entries on x86-64 are 64bit.
> Also why do you cast at all? i think the macro should handle an arbitary
> pointer.

The macro checks for the size of the pointer and then generates the
appropriate cmpxchg instruction. pgd_t is a struct which may be
problematic for the cmpxchg macros.

2005-01-04 20:26:50

by Andi Kleen

[permalink] [raw]
Subject: Re: page fault scalability patch V14 [5/7]: x86_64 atomic pte operations

On Tue, Jan 04, 2005 at 11:58:13AM -0800, Christoph Lameter wrote:
> On Tue, 4 Jan 2005, Andi Kleen wrote:
>
> > Christoph Lameter <[email protected]> writes:
> >
> > I bet this has been never tested.
>
> I tested this back in October and it worked fine. Would you be able to
> test your proposed modifications and send me a patch?

Hmm, I don't think it could have worked this way, except
if you only tested page faults < 4GB.

>
> > > +#define pmd_test_and_populate(mm, pmd, pte) \
> > > + (cmpxchg((int *)pmd, PMD_NONE, _PAGE_TABLE | __pa(pte)) == PMD_NONE)
> > > +#define pud_test_and_populate(mm, pud, pmd) \
> > > + (cmpxchg((int *)pgd, PUD_NONE, _PAGE_TABLE | __pa(pmd)) == PUD_NONE)
> > > +#define pgd_test_and_populate(mm, pgd, pud) \
> > > + (cmpxchg((int *)pgd, PGD_NONE, _PAGE_TABLE | __pa(pud)) == PGD_NONE)
> > > +
> >
> > Shouldn't this all be (long *)pmd ? page table entries on x86-64 are 64bit.
> > Also why do you cast at all? i think the macro should handle an arbitary
> > pointer.
>
> The macro checks for the size of the pointer and then generates the
> appropriate cmpxchg instruction. pgd_t is a struct which may be
> problematic for the cmpxchg macros.

It just checks sizeof, that should be fine.

-Andi

2005-01-04 20:39:32

by Christoph Lameter

[permalink] [raw]
Subject: Re: page fault scalability patch V14 [5/7]: x86_64 atomic pte operations

On Tue, 4 Jan 2005, Andi Kleen wrote:

> > The macro checks for the size of the pointer and then generates the
> > appropriate cmpxchg instruction. pgd_t is a struct which may be
> > problematic for the cmpxchg macros.
>
> It just checks sizeof, that should be fine.

Index: linux-2.6.10/include/asm-x86_64/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-x86_64/pgalloc.h 2005-01-03 15:02:01.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/pgalloc.h 2005-01-04 12:31:14.000000000 -0800
@@ -7,6 +7,10 @@
#include <linux/threads.h>
#include <linux/mm.h>

+#define PMD_NONE 0
+#define PUD_NONE 0
+#define PGD_NONE 0
+
#define pmd_populate_kernel(mm, pmd, pte) \
set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte)))
#define pud_populate(mm, pud, pmd) \
@@ -14,11 +18,24 @@
#define pgd_populate(mm, pgd, pud) \
set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(pud)))

+#define pmd_test_and_populate(mm, pmd, pte) \
+ (cmpxchg(pmd, PMD_NONE, _PAGE_TABLE | __pa(pte)) == PMD_NONE)
+#define pud_test_and_populate(mm, pud, pmd) \
+ (cmpxchg(pgd, PUD_NONE, _PAGE_TABLE | __pa(pmd)) == PUD_NONE)
+#define pgd_test_and_populate(mm, pgd, pud) \
+ (cmpxchg(pgd, PGD_NONE, _PAGE_TABLE | __pa(pud)) == PGD_NONE)
+
+
static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
{
set_pmd(pmd, __pmd(_PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)));
}

+static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
+{
+ return cmpxchg(pmd, PMD_NONE, _PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE;
+}
+
extern __inline__ pmd_t *get_pmd(void)
{
return (pmd_t *)get_zeroed_page(GFP_KERNEL);
Index: linux-2.6.10/include/asm-x86_64/pgtable.h
===================================================================
--- linux-2.6.10.orig/include/asm-x86_64/pgtable.h 2005-01-03 15:02:01.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/pgtable.h 2005-01-04 12:29:25.000000000 -0800
@@ -413,6 +413,10 @@
#define kc_offset_to_vaddr(o) \
(((o) & (1UL << (__VIRTUAL_MASK_SHIFT-1))) ? ((o) | (~__VIRTUAL_MASK)) : (o))

+
+#define ptep_cmpxchg(__vma,__addr,__xp,__oldval,__newval) (cmpxchg(&(__xp)->pte, pte_val(__oldval), pte_val(__newval)) == pte_val(__oldval))
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
+
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR

2005-01-04 19:52:53

by Christoph Lameter

[permalink] [raw]
Subject: page fault scalability patch V14 [5/7]: x86_64 atomic pte operations

Changelog
* Provide atomic pte operations for x86_64

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.10/include/asm-x86_64/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-x86_64/pgalloc.h 2005-01-03 10:31:31.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/pgalloc.h 2005-01-03 12:21:28.000000000 -0800
@@ -7,6 +7,10 @@
#include <linux/threads.h>
#include <linux/mm.h>

+#define PMD_NONE 0
+#define PUD_NONE 0
+#define PGD_NONE 0
+
#define pmd_populate_kernel(mm, pmd, pte) \
set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte)))
#define pud_populate(mm, pud, pmd) \
@@ -14,11 +18,24 @@
#define pgd_populate(mm, pgd, pud) \
set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(pud)))

+#define pmd_test_and_populate(mm, pmd, pte) \
+ (cmpxchg((int *)pmd, PMD_NONE, _PAGE_TABLE | __pa(pte)) == PMD_NONE)
+#define pud_test_and_populate(mm, pud, pmd) \
+ (cmpxchg((int *)pgd, PUD_NONE, _PAGE_TABLE | __pa(pmd)) == PUD_NONE)
+#define pgd_test_and_populate(mm, pgd, pud) \
+ (cmpxchg((int *)pgd, PGD_NONE, _PAGE_TABLE | __pa(pud)) == PGD_NONE)
+
+
static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
{
set_pmd(pmd, __pmd(_PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)));
}

+static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
+{
+ return cmpxchg((int *)pmd, PMD_NONE, _PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE;
+}
+
extern __inline__ pmd_t *get_pmd(void)
{
return (pmd_t *)get_zeroed_page(GFP_KERNEL);
Index: linux-2.6.10/include/asm-x86_64/pgtable.h
===================================================================
--- linux-2.6.10.orig/include/asm-x86_64/pgtable.h 2005-01-03 10:31:31.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/pgtable.h 2005-01-03 12:13:17.000000000 -0800
@@ -413,6 +413,10 @@
#define kc_offset_to_vaddr(o) \
(((o) & (1UL << (__VIRTUAL_MASK_SHIFT-1))) ? ((o) | (~__VIRTUAL_MASK)) : (o))

+
+#define ptep_cmpxchg(__vma,__addr,__xp,__oldval,__newval) (cmpxchg(&(__xp)->pte, pte_val(__oldval), pte_val(__newval)) == pte_val(__oldval))
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
+
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR

2005-01-04 19:48:01

by Christoph Lameter

[permalink] [raw]
Subject: page fault scalability patch V14 [4/7]: i386 atomic pte operations

Changelog
* Atomic pte operations for i386 in regular and PAE modes

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.10/include/asm-i386/pgtable.h
===================================================================
--- linux-2.6.10.orig/include/asm-i386/pgtable.h 2005-01-03 10:31:31.000000000 -0800
+++ linux-2.6.10/include/asm-i386/pgtable.h 2005-01-03 12:08:35.000000000 -0800
@@ -407,6 +407,7 @@
#define __HAVE_ARCH_PTEP_SET_WRPROTECT
#define __HAVE_ARCH_PTEP_MKDIRTY
#define __HAVE_ARCH_PTE_SAME
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
#include <asm-generic/pgtable.h>

#endif /* _I386_PGTABLE_H */
Index: linux-2.6.10/include/asm-i386/pgtable-3level.h
===================================================================
--- linux-2.6.10.orig/include/asm-i386/pgtable-3level.h 2005-01-03 10:31:31.000000000 -0800
+++ linux-2.6.10/include/asm-i386/pgtable-3level.h 2005-01-03 12:11:59.000000000 -0800
@@ -8,7 +8,8 @@
* tables on PPro+ CPUs.
*
* Copyright (C) 1999 Ingo Molnar <[email protected]>
- */
+ * August 26, 2004 added ptep_cmpxchg <[email protected]>
+*/

#define pte_ERROR(e) \
printk("%s:%d: bad pte %p(%08lx%08lx).\n", __FILE__, __LINE__, &(e), (e).pte_high, (e).pte_low)
@@ -44,21 +45,11 @@
return pte_x(pte);
}

-/* Rules for using set_pte: the pte being assigned *must* be
- * either not present or in a state where the hardware will
- * not attempt to update the pte. In places where this is
- * not possible, use pte_get_and_clear to obtain the old pte
- * value and then use set_pte to update it. -ben
- */
-static inline void set_pte(pte_t *ptep, pte_t pte)
-{
- ptep->pte_high = pte.pte_high;
- smp_wmb();
- ptep->pte_low = pte.pte_low;
-}
#define __HAVE_ARCH_SET_PTE_ATOMIC
#define set_pte_atomic(pteptr,pteval) \
set_64bit((unsigned long long *)(pteptr),pte_val(pteval))
+#define set_pte(pteptr,pteval) \
+ *(unsigned long long *)(pteptr) = pte_val(pteval)
#define set_pmd(pmdptr,pmdval) \
set_64bit((unsigned long long *)(pmdptr),pmd_val(pmdval))
#define set_pud(pudptr,pudval) \
@@ -155,4 +146,25 @@

#define __pmd_free_tlb(tlb, x) do { } while (0)

+/* Atomic PTE operations */
+#define ptep_xchg_flush(__vma, __addr, __ptep, __newval) \
+({ pte_t __r; \
+ /* xchg acts as a barrier before the setting of the high bits. */\
+ __r.pte_low = xchg(&(__ptep)->pte_low, (__newval).pte_low); \
+ __r.pte_high = (__ptep)->pte_high; \
+ (__ptep)->pte_high = (__newval).pte_high; \
+ flush_tlb_page(__vma, __addr); \
+ (__r); \
+})
+
+#define __HAVE_ARCH_PTEP_XCHG_FLUSH
+
+static inline int ptep_cmpxchg(struct vm_area_struct *vma, unsigned long address, pte_t *ptep, pte_t oldval, pte_t newval)
+{
+ return cmpxchg8b((unsigned long long *)ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval);
+}
+
+#define __HAVE_ARCH_GET_PTE_ATOMIC
+#define get_pte_atomic(__ptep) __pte(get_64bit((unsigned long long *)(__ptep)))
+
#endif /* _I386_PGTABLE_3LEVEL_H */
Index: linux-2.6.10/include/asm-i386/pgtable-2level.h
===================================================================
--- linux-2.6.10.orig/include/asm-i386/pgtable-2level.h 2005-01-03 10:31:31.000000000 -0800
+++ linux-2.6.10/include/asm-i386/pgtable-2level.h 2005-01-03 12:08:35.000000000 -0800
@@ -65,4 +65,7 @@
#define __pte_to_swp_entry(pte) ((swp_entry_t) { (pte).pte_low })
#define __swp_entry_to_pte(x) ((pte_t) { (x).val })

+/* Atomic PTE operations */
+#define ptep_cmpxchg(__vma,__a,__xp,__oldpte,__newpte) (cmpxchg(&(__xp)->pte_low, (__oldpte).pte_low, (__newpte).pte_low)==(__oldpte).pte_low)
+
#endif /* _I386_PGTABLE_2LEVEL_H */
Index: linux-2.6.10/include/asm-i386/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-i386/pgalloc.h 2005-01-03 10:31:31.000000000 -0800
+++ linux-2.6.10/include/asm-i386/pgalloc.h 2005-01-03 12:11:23.000000000 -0800
@@ -4,9 +4,12 @@
#include <linux/config.h>
#include <asm/processor.h>
#include <asm/fixmap.h>
+#include <asm/system.h>
#include <linux/threads.h>
#include <linux/mm.h> /* for struct page */

+#define PMD_NONE 0L
+
#define pmd_populate_kernel(mm, pmd, pte) \
set_pmd(pmd, __pmd(_PAGE_TABLE + __pa(pte)))

@@ -14,6 +17,18 @@
set_pmd(pmd, __pmd(_PAGE_TABLE + \
((unsigned long long)page_to_pfn(pte) << \
(unsigned long long) PAGE_SHIFT)))
+/* Atomic version */
+static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
+{
+#ifdef CONFIG_X86_PAE
+ return cmpxchg8b( ((unsigned long long *)pmd), PMD_NONE, _PAGE_TABLE +
+ ((unsigned long long)page_to_pfn(pte) <<
+ (unsigned long long) PAGE_SHIFT) ) == PMD_NONE;
+#else
+ return cmpxchg( (unsigned long *)pmd, PMD_NONE, _PAGE_TABLE + (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE;
+#endif
+}
+
/*
* Allocate and free page tables.
*/
@@ -44,6 +59,7 @@
#define pmd_free(x) do { } while (0)
#define __pmd_free_tlb(tlb,x) do { } while (0)
#define pud_populate(mm, pmd, pte) BUG()
+#define pud_test_and_populate(mm, pmd, pte) ({ BUG(); 1; })
#endif

#define check_pgt_cache() do { } while (0)

2005-01-04 21:33:43

by Christoph Lameter

[permalink] [raw]
Subject: Re: page fault scalability patch V14 [5/7]: x86_64 atomic pte operations

On Tue, 4 Jan 2005, Brian Gerst wrote:

> > +#define pud_test_and_populate(mm, pud, pmd) \
> > + (cmpxchg((int *)pgd, PUD_NONE, _PAGE_TABLE | __pa(pmd)) == PUD_NONE)
> ^^^
> Shouldn't this be pud?

Corrrect. Sigh. Could someone test this on x86_64?

Index: linux-2.6.10/include/asm-x86_64/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-x86_64/pgalloc.h 2005-01-03 15:02:01.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/pgalloc.h 2005-01-04 12:31:14.000000000 -0800
@@ -7,6 +7,10 @@
#include <linux/threads.h>
#include <linux/mm.h>

+#define PMD_NONE 0
+#define PUD_NONE 0
+#define PGD_NONE 0
+
#define pmd_populate_kernel(mm, pmd, pte) \
set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte)))
#define pud_populate(mm, pud, pmd) \
@@ -14,11 +18,24 @@
#define pgd_populate(mm, pgd, pud) \
set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(pud)))

+#define pmd_test_and_populate(mm, pmd, pte) \
+ (cmpxchg(pmd, PMD_NONE, _PAGE_TABLE | __pa(pte)) == PMD_NONE)
+#define pud_test_and_populate(mm, pud, pmd) \
+ (cmpxchg(pud, PUD_NONE, _PAGE_TABLE | __pa(pmd)) == PUD_NONE)
+#define pgd_test_and_populate(mm, pgd, pud) \
+ (cmpxchg(pgd, PGD_NONE, _PAGE_TABLE | __pa(pud)) == PGD_NONE)
+
+
static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
{
set_pmd(pmd, __pmd(_PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)));
}

+static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
+{
+ return cmpxchg(pmd, PMD_NONE, _PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE;
+}
+
extern __inline__ pmd_t *get_pmd(void)
{
return (pmd_t *)get_zeroed_page(GFP_KERNEL);
Index: linux-2.6.10/include/asm-x86_64/pgtable.h
===================================================================
--- linux-2.6.10.orig/include/asm-x86_64/pgtable.h 2005-01-03 15:02:01.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/pgtable.h 2005-01-04 12:29:25.000000000 -0800
@@ -413,6 +413,10 @@
#define kc_offset_to_vaddr(o) \
(((o) & (1UL << (__VIRTUAL_MASK_SHIFT-1))) ? ((o) | (~__VIRTUAL_MASK)) : (o))

+
+#define ptep_cmpxchg(__vma,__addr,__xp,__oldval,__newval) (cmpxchg(&(__xp)->pte, pte_val(__oldval), pte_val(__newval)) == pte_val(__oldval))
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
+
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR

2005-01-04 21:38:18

by Brian Gerst

[permalink] [raw]
Subject: Re: page fault scalability patch V14 [5/7]: x86_64 atomic pte operations

Christoph Lameter wrote:
> Changelog
> * Provide atomic pte operations for x86_64
>
> Signed-off-by: Christoph Lameter <[email protected]>
>
> Index: linux-2.6.10/include/asm-x86_64/pgalloc.h
> ===================================================================
> --- linux-2.6.10.orig/include/asm-x86_64/pgalloc.h 2005-01-03 10:31:31.000000000 -0800
> +++ linux-2.6.10/include/asm-x86_64/pgalloc.h 2005-01-03 12:21:28.000000000 -0800
> @@ -7,6 +7,10 @@
> #include <linux/threads.h>
> #include <linux/mm.h>
>
> +#define PMD_NONE 0
> +#define PUD_NONE 0
> +#define PGD_NONE 0
> +
> #define pmd_populate_kernel(mm, pmd, pte) \
> set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte)))
> #define pud_populate(mm, pud, pmd) \
> @@ -14,11 +18,24 @@
> #define pgd_populate(mm, pgd, pud) \
> set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(pud)))
>
> +#define pmd_test_and_populate(mm, pmd, pte) \
> + (cmpxchg((int *)pmd, PMD_NONE, _PAGE_TABLE | __pa(pte)) == PMD_NONE)
> +#define pud_test_and_populate(mm, pud, pmd) \
> + (cmpxchg((int *)pgd, PUD_NONE, _PAGE_TABLE | __pa(pmd)) == PUD_NONE)
^^^
Shouldn't this be pud?

> +#define pgd_test_and_populate(mm, pgd, pud) \
> + (cmpxchg((int *)pgd, PGD_NONE, _PAGE_TABLE | __pa(pud)) == PGD_NONE)
> +
> +

--
Brian Gerst

2005-01-05 17:16:46

by Roman Zippel

[permalink] [raw]
Subject: Re: page fault scalability patch V14 [3/7]: i386 universal cmpxchg

Hi,

On Tuesday 04 January 2005 20:37, Christoph Lameter wrote:

> * Provide emulation of cmpxchg suitable for uniprocessor if
> build and run on 386.
> * Provide emulation of cmpxchg8b suitable for uniprocessor
> systems if build and run on 386 or 486.

I'm not sure that's such a good idea. This emulation is more expensive as it
has to disable interrupts and you already have emulation functions using
spinlocks anyway, so why not use them? This way your patch would not just
scale up, but also still scale down.

bye, Roman

2005-01-11 17:50:04

by Christoph Lameter

[permalink] [raw]
Subject: page table lock patch V15 [0/7]: overview

Changes from V14->V15 of this patch:
- Remove misplaced semicolon in handle_mm_fault (caused x86_64 troubles)
- Fixed up and tested x86_64 arch specific patch
- Redone against 2.6.10-bk14

This is a series of patches that increases the scalability of
the page fault handler for SMP. The performance increase is
accomplished by avoiding the use of the page_table_lock spinlock
(but not mm->mmap_sem) through new atomic operations on pte's
(ptep_xchg, ptep_cmpxchg) and on pmd, pud and
pgd's (pgd_test_and_populate, pud_test_and_populate,
pmd_test_and_populate).

The page table lock can be avoided in the following situations:

1. An empty pte or pmd entry is populated

This is safe since the swapper may only depopulate them and the
swapper code has been changed to never set a pte to be empty until the
page has been evicted. The population of an empty pte is frequent
if a process touches newly allocated memory.

2. Modifications of flags in a pte entry (write/accessed).

These modifications are done by the CPU or by low level handlers
on various platforms also bypassing the page_table_lock. So this
seems to be safe too.

One essential change in the VM is the use of pte_cmpxchg (or its
generic emulation) on page table entries before doing an
update_mmu_change without holding the page table lock. However, we do
similar things now with other atomic pte operations such as
ptep_get_and_clear and ptep_test_and_clear_dirty. These operations
clear a pte *after* doing an operation on it. The ptep_cmpxchg as used
in this patch operates on an *cleared* pte and replaces it with a pte
pointing to valid memory. The effect of this change on various
architectures has to be thought through. Local definitions of
ptep_cmpxchg and ptep_xchg may be necessary.

For ia64 an icache coherency issue may arise that potentially requires
the flushing of the icache (as done via update_mmu_cache on ia64) prior
to the use of ptep_cmpxchg. Similar issues may arise on other platforms.

The patch introduces a split counter for rss handling to avoid atomic
operations and locks currently necessary for rss modifications. In
addition to mm->rss, tsk->rss is introduced. tsk->rss is defined to be
in the same cache line as tsk->mm (which is already used by the fault
handler) and thus tsk->rss can be incremented without locks
in a fast way. The cache line does not need to be shared between
processors for the page table handler.

A tasklist is generated for each mm (rcu based). Values in that list
are added up to calculate rss or anon_rss values.

The patchset is composed of 7 patches (and was tested against 2.6.10-bk6):

1/7: Avoid page_table_lock in handle_mm_fault

This patch defers the acquisition of the page_table_lock as much as
possible and uses atomic operations for allocating anonymous memory.
These atomic operations are simulated by acquiring the page_table_lock
for very small time frames if an architecture does not define
__HAVE_ARCH_ATOMIC_TABLE_OPS. It also changes kswapd so that a
pte will not be set to empty if a page is in transition to swap.

If only the first two patches are applied then the time that the
page_table_lock is held is simply reduced. The lock may then be
acquired multiple times during a page fault.

2/7: Atomic pte operations for ia64

3/7: Make cmpxchg generally available on i386

The atomic operations on the page table rely heavily on cmpxchg
instructions. This patch adds emulations for cmpxchg and cmpxchg8b
for old 80386 and 80486 cpus. The emulations are only included if a
kernel is build for these old cpus and are skipped for the real
cmpxchg instructions if the kernel that is build for 386 or 486 is
then run on a more recent cpu.

This patch may be used independently of the other patches.

4/7: Atomic pte operations for i386

A generally available cmpxchg (last patch) must be available for
this patch to preserve the ability to build kernels for 386 and 486.

5/7: Atomic pte operation for x86_64

6/7: Atomic pte operations for s390

7/7: Split counter implementation for rss
Add tsk->rss and tsk->anon_rss. Add tasklist. Add logic
to calculate rss from tasklist.

Signed-off-by: Christoph Lameter <[email protected]>

2005-01-11 17:55:11

by Christoph Lameter

[permalink] [raw]
Subject: page table lock patch V15 [2/7]: ia64 atomic pte operations

Changelog
* Provide atomic pte operations for ia64
* Enhanced parallelism in page fault handler if applied together
with the generic patch

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.10/include/asm-ia64/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-ia64/pgalloc.h 2005-01-10 16:31:56.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/pgalloc.h 2005-01-10 16:41:00.000000000 -0800
@@ -34,6 +34,10 @@
#define pmd_quicklist (local_cpu_data->pmd_quick)
#define pgtable_cache_size (local_cpu_data->pgtable_cache_sz)

+/* Empty entries of PMD and PGD */
+#define PMD_NONE 0
+#define PUD_NONE 0
+
static inline pgd_t*
pgd_alloc_one_fast (struct mm_struct *mm)
{
@@ -82,6 +86,13 @@
pud_val(*pud_entry) = __pa(pmd);
}

+/* Atomic populate */
+static inline int
+pud_test_and_populate (struct mm_struct *mm, pud_t *pud_entry, pmd_t *pmd)
+{
+ return ia64_cmpxchg8_acq(pud_entry,__pa(pmd), PUD_NONE) == PUD_NONE;
+}
+
static inline pmd_t*
pmd_alloc_one_fast (struct mm_struct *mm, unsigned long addr)
{
@@ -127,6 +138,13 @@
pmd_val(*pmd_entry) = page_to_phys(pte);
}

+/* Atomic populate */
+static inline int
+pmd_test_and_populate (struct mm_struct *mm, pmd_t *pmd_entry, struct page *pte)
+{
+ return ia64_cmpxchg8_acq(pmd_entry, page_to_phys(pte), PMD_NONE) == PMD_NONE;
+}
+
static inline void
pmd_populate_kernel (struct mm_struct *mm, pmd_t *pmd_entry, pte_t *pte)
{
Index: linux-2.6.10/include/asm-ia64/pgtable.h
===================================================================
--- linux-2.6.10.orig/include/asm-ia64/pgtable.h 2005-01-10 16:32:35.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/pgtable.h 2005-01-10 16:41:00.000000000 -0800
@@ -30,6 +30,8 @@
#define _PAGE_P_BIT 0
#define _PAGE_A_BIT 5
#define _PAGE_D_BIT 6
+#define _PAGE_IG_BITS 53
+#define _PAGE_LOCK_BIT (_PAGE_IG_BITS+3) /* bit 56. Aligned to 8 bits */

#define _PAGE_P (1 << _PAGE_P_BIT) /* page present bit */
#define _PAGE_MA_WB (0x0 << 2) /* write back memory attribute */
@@ -58,6 +60,7 @@
#define _PAGE_PPN_MASK (((__IA64_UL(1) << IA64_MAX_PHYS_BITS) - 1) & ~0xfffUL)
#define _PAGE_ED (__IA64_UL(1) << 52) /* exception deferral */
#define _PAGE_PROTNONE (__IA64_UL(1) << 63)
+#define _PAGE_LOCK (__IA64_UL(1) << _PAGE_LOCK_BIT)

/* Valid only for a PTE with the present bit cleared: */
#define _PAGE_FILE (1 << 1) /* see swap & file pte remarks below */
@@ -271,6 +274,8 @@
#define pte_dirty(pte) ((pte_val(pte) & _PAGE_D) != 0)
#define pte_young(pte) ((pte_val(pte) & _PAGE_A) != 0)
#define pte_file(pte) ((pte_val(pte) & _PAGE_FILE) != 0)
+#define pte_locked(pte) ((pte_val(pte) & _PAGE_LOCK)!=0)
+
/*
* Note: we convert AR_RWX to AR_RX and AR_RW to AR_R by clearing the 2nd bit in the
* access rights:
@@ -282,8 +287,15 @@
#define pte_mkyoung(pte) (__pte(pte_val(pte) | _PAGE_A))
#define pte_mkclean(pte) (__pte(pte_val(pte) & ~_PAGE_D))
#define pte_mkdirty(pte) (__pte(pte_val(pte) | _PAGE_D))
+#define pte_mkunlocked(pte) (__pte(pte_val(pte) & ~_PAGE_LOCK))

/*
+ * Lock functions for pte's
+ */
+#define ptep_lock(ptep) test_and_set_bit(_PAGE_LOCK_BIT, ptep)
+#define ptep_unlock(ptep) { clear_bit(_PAGE_LOCK_BIT,ptep); smp_mb__after_clear_bit(); }
+#define ptep_unlock_set(ptep, val) set_pte(ptep, pte_mkunlocked(val))
+/*
* Macro to a page protection value as "uncacheable". Note that "protection" is really a
* misnomer here as the protection value contains the memory attribute bits, dirty bits,
* and various other bits as well.
@@ -343,7 +355,6 @@
#define pte_unmap_nested(pte) do { } while (0)

/* atomic versions of the some PTE manipulations: */
-
static inline int
ptep_test_and_clear_young (pte_t *ptep)
{
@@ -415,6 +426,26 @@
#endif
}

+/*
+ * IA-64 doesn't have any external MMU info: the page tables contain all the necessary
+ * information. However, we use this routine to take care of any (delayed) i-cache
+ * flushing that may be necessary.
+ */
+extern void update_mmu_cache (struct vm_area_struct *vma, unsigned long vaddr, pte_t pte);
+
+static inline int
+ptep_cmpxchg (struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, pte_t oldval, pte_t newval)
+{
+ /*
+ * IA64 defers icache flushes. If the new pte is executable we may
+ * have to flush the icache to insure cache coherency immediately
+ * after the cmpxchg.
+ */
+ if (pte_exec(newval))
+ update_mmu_cache(vma, addr, newval);
+ return ia64_cmpxchg8_acq(&ptep->pte, newval.pte, oldval.pte) == oldval.pte;
+}
+
static inline int
pte_same (pte_t a, pte_t b)
{
@@ -477,13 +508,6 @@
struct vm_area_struct * prev, unsigned long start, unsigned long end);
#endif

-/*
- * IA-64 doesn't have any external MMU info: the page tables contain all the necessary
- * information. However, we use this routine to take care of any (delayed) i-cache
- * flushing that may be necessary.
- */
-extern void update_mmu_cache (struct vm_area_struct *vma, unsigned long vaddr, pte_t pte);
-
#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
/*
* Update PTEP with ENTRY, which is guaranteed to be a less
@@ -561,6 +585,8 @@
#define __HAVE_ARCH_PTEP_MKDIRTY
#define __HAVE_ARCH_PTE_SAME
#define __HAVE_ARCH_PGD_OFFSET_GATE
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
+#define __HAVE_ARCH_LOCK_TABLE_OPS
#include <asm-generic/pgtable-nopud.h>
#include <asm-generic/pgtable.h>


2005-01-11 17:59:15

by Christoph Lameter

[permalink] [raw]
Subject: page table lock patch V15 [5/7]: x86_64 atomic pte operations

Changelog
* Provide atomic pte operations for x86_64

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.10/include/asm-x86_64/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-x86_64/pgalloc.h 2005-01-10 16:31:56.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/pgalloc.h 2005-01-10 16:41:24.000000000 -0800
@@ -7,6 +7,10 @@
#include <linux/threads.h>
#include <linux/mm.h>

+#define PMD_NONE 0
+#define PUD_NONE 0
+#define PGD_NONE 0
+
#define pmd_populate_kernel(mm, pmd, pte) \
set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte)))
#define pud_populate(mm, pud, pmd) \
@@ -14,9 +18,20 @@
#define pgd_populate(mm, pgd, pud) \
set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(pud)))

+#define pud_test_and_populate(mm, pud, pmd) \
+ (cmpxchg((unsigned long *)pud, PUD_NONE, _PAGE_TABLE | __pa(pmd)) == PUD_NONE)
+#define pgd_test_and_populate(mm, pgd, pud) \
+ (cmpxchg((unsigned long *)pgd, PGD_NONE, _PAGE_TABLE | __pa(pud)) == PGD_NONE)
+
+
static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
{
- set_pmd(pmd, __pmd(_PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)));
+ set_pmd(pmd, __pmd(_PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)));
+}
+
+static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
+{
+ return cmpxchg((unsigned long *)pmd, PMD_NONE, _PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE;
}

extern __inline__ pmd_t *get_pmd(void)
Index: linux-2.6.10/include/asm-x86_64/pgtable.h
===================================================================
--- linux-2.6.10.orig/include/asm-x86_64/pgtable.h 2005-01-10 16:31:56.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/pgtable.h 2005-01-10 16:41:24.000000000 -0800
@@ -414,6 +414,10 @@
#define kc_offset_to_vaddr(o) \
(((o) & (1UL << (__VIRTUAL_MASK_SHIFT-1))) ? ((o) | (~__VIRTUAL_MASK)) : (o))

+
+#define ptep_cmpxchg(__vma,__addr,__xp,__oldval,__newval) (cmpxchg(&(__xp)->pte, pte_val(__oldval), pte_val(__newval)) == pte_val(__oldval))
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
+
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR

2005-01-11 18:07:28

by Christoph Lameter

[permalink] [raw]
Subject: page table lock patch V15 [7/7]: Split RSS counter

Changelog
* Split rss counter into the task structure
* remove 3 checks of rss in mm/rmap.c
* increment current->rss instead of mm->rss in the page fault handler
* move incrementing of anon_rss out of page_add_anon_rmap to group
the increments more tightly and allow a better cache utilization

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.10/include/linux/sched.h
===================================================================
--- linux-2.6.10.orig/include/linux/sched.h 2005-01-11 08:46:16.000000000 -0800
+++ linux-2.6.10/include/linux/sched.h 2005-01-11 08:56:45.000000000 -0800
@@ -31,6 +31,7 @@
#include <linux/pid.h>
#include <linux/percpu.h>
#include <linux/topology.h>
+#include <linux/rcupdate.h>

struct exec_domain;

@@ -216,6 +217,7 @@
int map_count; /* number of VMAs */
struct rw_semaphore mmap_sem;
spinlock_t page_table_lock; /* Protects page tables, mm->rss, mm->anon_rss */
+ long rss, anon_rss;

struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung
* together off init_mm.mmlist, and are protected
@@ -225,7 +227,7 @@
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
unsigned long arg_start, arg_end, env_start, env_end;
- unsigned long rss, anon_rss, total_vm, locked_vm, shared_vm;
+ unsigned long total_vm, locked_vm, shared_vm;
unsigned long exec_vm, stack_vm, reserved_vm, def_flags, nr_ptes;

unsigned long saved_auxv[42]; /* for /proc/PID/auxv */
@@ -235,6 +237,7 @@

/* Architecture-specific MM context */
mm_context_t context;
+ struct list_head task_list; /* Tasks using this mm */

/* Token based thrashing protection. */
unsigned long swap_token_time;
@@ -555,6 +558,9 @@
struct list_head ptrace_list;

struct mm_struct *mm, *active_mm;
+ /* Split counters from mm */
+ long rss;
+ long anon_rss;

/* task state */
struct linux_binfmt *binfmt;
@@ -587,6 +593,10 @@
struct completion *vfork_done; /* for vfork() */
int __user *set_child_tid; /* CLONE_CHILD_SETTID */
int __user *clear_child_tid; /* CLONE_CHILD_CLEARTID */
+
+ /* List of other tasks using the same mm */
+ struct list_head mm_tasks;
+ struct rcu_head rcu_head; /* For freeing the task via rcu */

unsigned long rt_priority;
unsigned long it_real_value, it_prof_value, it_virt_value;
@@ -1184,6 +1194,11 @@
return 0;
}
#endif /* CONFIG_PM */
+
+void get_rss(struct mm_struct *mm, unsigned long *rss, unsigned long *anon_rss);
+void mm_remove_thread(struct mm_struct *mm, struct task_struct *tsk);
+void mm_add_thread(struct mm_struct *mm, struct task_struct *tsk);
+
#endif /* __KERNEL__ */

#endif
Index: linux-2.6.10/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.10.orig/fs/proc/task_mmu.c 2005-01-11 08:46:15.000000000 -0800
+++ linux-2.6.10/fs/proc/task_mmu.c 2005-01-11 08:56:45.000000000 -0800
@@ -8,8 +8,9 @@

char *task_mem(struct mm_struct *mm, char *buffer)
{
- unsigned long data, text, lib;
+ unsigned long data, text, lib, rss, anon_rss;

+ get_rss(mm, &rss, &anon_rss);
data = mm->total_vm - mm->shared_vm - mm->stack_vm;
text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10;
lib = (mm->exec_vm << (PAGE_SHIFT-10)) - text;
@@ -24,7 +25,7 @@
"VmPTE:\t%8lu kB\n",
(mm->total_vm - mm->reserved_vm) << (PAGE_SHIFT-10),
mm->locked_vm << (PAGE_SHIFT-10),
- mm->rss << (PAGE_SHIFT-10),
+ rss << (PAGE_SHIFT-10),
data << (PAGE_SHIFT-10),
mm->stack_vm << (PAGE_SHIFT-10), text, lib,
(PTRS_PER_PTE*sizeof(pte_t)*mm->nr_ptes) >> 10);
@@ -39,11 +40,14 @@
int task_statm(struct mm_struct *mm, int *shared, int *text,
int *data, int *resident)
{
- *shared = mm->rss - mm->anon_rss;
+ unsigned long rss, anon_rss;
+
+ get_rss(mm, &rss, &anon_rss);
+ *shared = rss - anon_rss;
*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
>> PAGE_SHIFT;
*data = mm->total_vm - mm->shared_vm;
- *resident = mm->rss;
+ *resident = rss;
return mm->total_vm;
}

Index: linux-2.6.10/fs/proc/array.c
===================================================================
--- linux-2.6.10.orig/fs/proc/array.c 2005-01-11 08:46:15.000000000 -0800
+++ linux-2.6.10/fs/proc/array.c 2005-01-11 08:56:45.000000000 -0800
@@ -303,7 +303,7 @@

static int do_task_stat(struct task_struct *task, char * buffer, int whole)
{
- unsigned long vsize, eip, esp, wchan = ~0UL;
+ unsigned long rss, anon_rss, vsize, eip, esp, wchan = ~0UL;
long priority, nice;
int tty_pgrp = -1, tty_nr = 0;
sigset_t sigign, sigcatch;
@@ -326,6 +326,7 @@
vsize = task_vsize(mm);
eip = KSTK_EIP(task);
esp = KSTK_ESP(task);
+ get_rss(mm, &rss, &anon_rss);
}

get_task_comm(tcomm, task);
@@ -421,7 +422,7 @@
jiffies_to_clock_t(task->it_real_value),
start_time,
vsize,
- mm ? mm->rss : 0, /* you might want to shift this left 3 */
+ mm ? rss : 0, /* you might want to shift this left 3 */
rsslim,
mm ? mm->start_code : 0,
mm ? mm->end_code : 0,
Index: linux-2.6.10/mm/rmap.c
===================================================================
--- linux-2.6.10.orig/mm/rmap.c 2005-01-11 08:48:30.000000000 -0800
+++ linux-2.6.10/mm/rmap.c 2005-01-11 08:56:45.000000000 -0800
@@ -258,8 +258,6 @@
pte_t *pte;
int referenced = 0;

- if (!mm->rss)
- goto out;
address = vma_address(page, vma);
if (address == -EFAULT)
goto out;
@@ -440,8 +438,6 @@
BUG_ON(PageReserved(page));
BUG_ON(!anon_vma);

- vma->vm_mm->anon_rss++;
-
anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
index = (address - vma->vm_start) >> PAGE_SHIFT;
index += vma->vm_pgoff;
@@ -513,8 +509,6 @@
pte_t pteval;
int ret = SWAP_AGAIN;

- if (!mm->rss)
- goto out;
address = vma_address(page, vma);
if (address == -EFAULT)
goto out;
@@ -813,8 +807,7 @@
if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
continue;
cursor = (unsigned long) vma->vm_private_data;
- while (vma->vm_mm->rss &&
- cursor < max_nl_cursor &&
+ while (cursor < max_nl_cursor &&
cursor < vma->vm_end - vma->vm_start) {
try_to_unmap_cluster(cursor, &mapcount, vma);
cursor += CLUSTER_SIZE;
Index: linux-2.6.10/kernel/fork.c
===================================================================
--- linux-2.6.10.orig/kernel/fork.c 2005-01-11 08:46:16.000000000 -0800
+++ linux-2.6.10/kernel/fork.c 2005-01-11 08:56:45.000000000 -0800
@@ -79,10 +79,16 @@
static kmem_cache_t *task_struct_cachep;
#endif

+static void rcu_free_task(struct rcu_head *head)
+{
+ struct task_struct *tsk = container_of(head ,struct task_struct, rcu_head);
+ free_task_struct(tsk);
+}
+
void free_task(struct task_struct *tsk)
{
free_thread_info(tsk->thread_info);
- free_task_struct(tsk);
+ call_rcu(&tsk->rcu_head, rcu_free_task);
}
EXPORT_SYMBOL(free_task);

@@ -99,7 +105,7 @@
put_group_info(tsk->group_info);

if (!profile_handoff_task(tsk))
- free_task(tsk);
+ call_rcu(&tsk->rcu_head, rcu_free_task);
}

void __init fork_init(unsigned long mempages)
@@ -152,6 +158,7 @@
*tsk = *orig;
tsk->thread_info = ti;
ti->task = tsk;
+ tsk->rss = 0;

/* One for us, one for whoever does the "release_task()" (usually parent) */
atomic_set(&tsk->usage,2);
@@ -294,6 +301,7 @@
atomic_set(&mm->mm_count, 1);
init_rwsem(&mm->mmap_sem);
INIT_LIST_HEAD(&mm->mmlist);
+ INIT_LIST_HEAD(&mm->task_list);
mm->core_waiters = 0;
mm->nr_ptes = 0;
spin_lock_init(&mm->page_table_lock);
@@ -402,6 +410,8 @@

/* Get rid of any cached register state */
deactivate_mm(tsk, mm);
+ if (mm)
+ mm_remove_thread(mm, tsk);

/* notify parent sleeping on vfork() */
if (vfork_done) {
@@ -449,8 +459,8 @@
* new threads start up in user mode using an mm, which
* allows optimizing out ipis; the tlb_gather_mmu code
* is an example.
+ * (mm_add_thread does use the ptl .... )
*/
- spin_unlock_wait(&oldmm->page_table_lock);
goto good_mm;
}

@@ -475,6 +485,7 @@
mm->hiwater_vm = mm->total_vm;

good_mm:
+ mm_add_thread(mm, tsk);
tsk->mm = mm;
tsk->active_mm = mm;
return 0;
@@ -1079,7 +1090,7 @@
atomic_dec(&p->user->processes);
free_uid(p->user);
bad_fork_free:
- free_task(p);
+ call_rcu(&p->rcu_head, rcu_free_task);
goto fork_out;
}

Index: linux-2.6.10/mm/memory.c
===================================================================
--- linux-2.6.10.orig/mm/memory.c 2005-01-11 08:56:37.000000000 -0800
+++ linux-2.6.10/mm/memory.c 2005-01-11 08:56:45.000000000 -0800
@@ -935,6 +935,7 @@

cond_resched_lock(&mm->page_table_lock);
while (!(map = follow_page(mm, start, lookup_write))) {
+ unsigned long rss, anon_rss;
/*
* Shortcut for anonymous pages. We don't want
* to force the creation of pages tables for
@@ -947,6 +948,17 @@
map = ZERO_PAGE(start);
break;
}
+ if (mm != current->mm) {
+ /*
+ * handle_mm_fault uses the current pointer
+ * for a split rss counter. The current pointer
+ * is not correct if we are using a different mm
+ */
+ rss = current->rss;
+ anon_rss = current->anon_rss;
+ current->rss = 0;
+ current->anon_rss = 0;
+ }
spin_unlock(&mm->page_table_lock);
switch (handle_mm_fault(mm,vma,start,write)) {
case VM_FAULT_MINOR:
@@ -971,6 +983,12 @@
*/
lookup_write = write && !force;
spin_lock(&mm->page_table_lock);
+ if (mm != current->mm) {
+ mm->rss += current->rss;
+ mm->anon_rss += current->anon_rss;
+ current->rss = rss;
+ current->anon_rss = anon_rss;
+ }
}
if (pages) {
pages[i] = get_page_map(map);
@@ -1353,6 +1371,7 @@
break_cow(vma, new_page, address, page_table);
lru_cache_add_active(new_page);
page_add_anon_rmap(new_page, vma, address);
+ mm->anon_rss++;

/* Free the old page.. */
new_page = old_page;
@@ -1753,6 +1772,7 @@
flush_icache_page(vma, page);
set_pte(page_table, pte);
page_add_anon_rmap(page, vma, address);
+ mm->anon_rss++;

if (write_access) {
if (do_wp_page(mm, vma, address,
@@ -1815,6 +1835,7 @@
page_add_anon_rmap(page, vma, addr);
lru_cache_add_active(page);
mm->rss++;
+ mm->anon_rss++;
acct_update_integrals();
update_mem_hiwater();

@@ -1922,6 +1943,7 @@
if (anon) {
lru_cache_add_active(new_page);
page_add_anon_rmap(new_page, vma, address);
+ mm->anon_rss++;
} else
page_add_file_rmap(new_page);
pte_unmap(page_table);
@@ -2250,6 +2272,49 @@

EXPORT_SYMBOL(vmalloc_to_pfn);

+void get_rss(struct mm_struct *mm, unsigned long *rss, unsigned long *anon_rss)
+{
+ struct list_head *y;
+ struct task_struct *t;
+ long rss_sum, anon_rss_sum;
+
+ rcu_read_lock();
+ rss_sum = mm->rss;
+ anon_rss_sum = mm->anon_rss;
+ list_for_each_rcu(y, &mm->task_list) {
+ t = list_entry(y, struct task_struct, mm_tasks);
+ rss_sum += t->rss;
+ anon_rss_sum += t->anon_rss;
+ }
+ if (rss_sum < 0)
+ rss_sum = 0;
+ if (anon_rss_sum < 0)
+ anon_rss_sum = 0;
+ rcu_read_unlock();
+ *rss = rss_sum;
+ *anon_rss = anon_rss_sum;
+}
+
+void mm_remove_thread(struct mm_struct *mm, struct task_struct *tsk)
+{
+ if (!mm)
+ return;
+
+ spin_lock(&mm->page_table_lock);
+ mm->rss += tsk->rss;
+ mm->anon_rss += tsk->anon_rss;
+ list_del_rcu(&tsk->mm_tasks);
+ spin_unlock(&mm->page_table_lock);
+}
+
+void mm_add_thread(struct mm_struct *mm, struct task_struct *tsk)
+{
+ spin_lock(&mm->page_table_lock);
+ tsk->rss = 0;
+ tsk->anon_rss = 0;
+ list_add_rcu(&tsk->mm_tasks, &mm->task_list);
+ spin_unlock(&mm->page_table_lock);
+}
/*
* update_mem_hiwater
* - update per process rss and vm high water data
Index: linux-2.6.10/include/linux/init_task.h
===================================================================
--- linux-2.6.10.orig/include/linux/init_task.h 2005-01-11 08:46:16.000000000 -0800
+++ linux-2.6.10/include/linux/init_task.h 2005-01-11 08:56:45.000000000 -0800
@@ -42,6 +42,7 @@
.mmlist = LIST_HEAD_INIT(name.mmlist), \
.cpu_vm_mask = CPU_MASK_ALL, \
.default_kioctx = INIT_KIOCTX(name.default_kioctx, name), \
+ .task_list = LIST_HEAD_INIT(name.task_list), \
}

#define INIT_SIGNALS(sig) { \
@@ -112,6 +113,7 @@
.proc_lock = SPIN_LOCK_UNLOCKED, \
.switch_lock = SPIN_LOCK_UNLOCKED, \
.journal_info = NULL, \
+ .mm_tasks = LIST_HEAD_INIT(tsk.mm_tasks), \
}


Index: linux-2.6.10/fs/exec.c
===================================================================
--- linux-2.6.10.orig/fs/exec.c 2005-01-11 08:46:15.000000000 -0800
+++ linux-2.6.10/fs/exec.c 2005-01-11 08:56:45.000000000 -0800
@@ -556,6 +556,7 @@
tsk->active_mm = mm;
activate_mm(active_mm, mm);
task_unlock(tsk);
+ mm_add_thread(mm, current);
arch_pick_mmap_layout(mm);
if (old_mm) {
if (active_mm != old_mm) BUG();
Index: linux-2.6.10/fs/aio.c
===================================================================
--- linux-2.6.10.orig/fs/aio.c 2004-12-24 13:34:44.000000000 -0800
+++ linux-2.6.10/fs/aio.c 2005-01-11 08:56:45.000000000 -0800
@@ -577,6 +577,7 @@
tsk->active_mm = mm;
activate_mm(active_mm, mm);
task_unlock(tsk);
+ mm_add_thread(mm, tsk);

mmdrop(active_mm);
}
@@ -596,6 +597,7 @@
{
struct task_struct *tsk = current;

+ mm_remove_thread(mm,tsk);
task_lock(tsk);
tsk->flags &= ~PF_BORROWED_MM;
tsk->mm = NULL;
Index: linux-2.6.10/mm/swapfile.c
===================================================================
--- linux-2.6.10.orig/mm/swapfile.c 2005-01-11 08:46:16.000000000 -0800
+++ linux-2.6.10/mm/swapfile.c 2005-01-11 08:56:45.000000000 -0800
@@ -433,6 +433,7 @@
swp_entry_t entry, struct page *page)
{
vma->vm_mm->rss++;
+ vma->vm_mm->anon_rss++;
get_page(page);
set_pte(dir, pte_mkold(mk_pte(page, vma->vm_page_prot)));
page_add_anon_rmap(page, vma, address);

2005-01-11 17:59:14

by Christoph Lameter

[permalink] [raw]
Subject: page table lock patch V15 [6/7]: s390 atomic pte operations

Changelog
* Provide atomic pte operations for s390

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.10/include/asm-s390/pgtable.h
===================================================================
--- linux-2.6.10.orig/include/asm-s390/pgtable.h 2005-01-10 16:31:56.000000000 -0800
+++ linux-2.6.10/include/asm-s390/pgtable.h 2005-01-10 16:41:07.000000000 -0800
@@ -577,6 +577,15 @@
return pte;
}

+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \
+({ \
+ struct mm_struct *__mm = __vma->vm_mm; \
+ pte_t __pte; \
+ __pte = ptep_clear_flush(__vma, __address, __ptep); \
+ set_pte(__ptep, __pteval); \
+ __pte; \
+})
+
static inline void ptep_set_wrprotect(pte_t *ptep)
{
pte_t old_pte = *ptep;
@@ -788,6 +797,14 @@

#define kern_addr_valid(addr) (1)

+/* Atomic PTE operations */
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
+
+static inline int ptep_cmpxchg (struct vm_area_struct *vma, unsigned long address, pte_t *ptep, pte_t oldval, pte_t newval)
+{
+ return cmpxchg(ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval);
+}
+
/*
* No page table caches to initialise
*/
@@ -801,6 +818,7 @@
#define __HAVE_ARCH_PTEP_CLEAR_DIRTY_FLUSH
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
#define __HAVE_ARCH_PTEP_CLEAR_FLUSH
+#define __HAVE_ARCH_PTEP_XCHG_FLUSH
#define __HAVE_ARCH_PTEP_SET_WRPROTECT
#define __HAVE_ARCH_PTEP_MKDIRTY
#define __HAVE_ARCH_PTE_SAME
Index: linux-2.6.10/include/asm-s390/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-s390/pgalloc.h 2004-12-24 13:35:00.000000000 -0800
+++ linux-2.6.10/include/asm-s390/pgalloc.h 2005-01-10 16:41:07.000000000 -0800
@@ -97,6 +97,10 @@
pgd_val(*pgd) = _PGD_ENTRY | __pa(pmd);
}

+static inline int pgd_test_and_populate(struct mm_struct *mm, pdg_t *pgd, pmd_t *pmd)
+{
+ return cmpxchg(pgd, _PAGE_TABLE_INV, _PGD_ENTRY | __pa(pmd)) == _PAGE_TABLE_INV;
+}
#endif /* __s390x__ */

static inline void
@@ -119,6 +123,18 @@
pmd_populate_kernel(mm, pmd, (pte_t *)((page-mem_map) << PAGE_SHIFT));
}

+static inline int
+pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *page)
+{
+ int rc;
+ spin_lock(&mm->page_table_lock);
+
+ rc=pte_same(*pmd, _PAGE_INVALID_EMPTY);
+ if (rc) pmd_populate(mm, pmd, page);
+ spin_unlock(&mm->page_table_lock);
+ return rc;
+}
+
/*
* page table entry allocation/free routines.
*/

2005-01-11 18:11:58

by Christoph Lameter

[permalink] [raw]
Subject: page table lock patch V15 [4/7]: i386 atomic pte operations

Changelog
* Atomic pte operations for i386 in regular and PAE modes

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.10/include/asm-i386/pgtable.h
===================================================================
--- linux-2.6.10.orig/include/asm-i386/pgtable.h 2005-01-07 09:48:57.000000000 -0800
+++ linux-2.6.10/include/asm-i386/pgtable.h 2005-01-07 09:51:09.000000000 -0800
@@ -409,6 +409,7 @@
#define __HAVE_ARCH_PTEP_SET_WRPROTECT
#define __HAVE_ARCH_PTEP_MKDIRTY
#define __HAVE_ARCH_PTE_SAME
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
#include <asm-generic/pgtable.h>

#endif /* _I386_PGTABLE_H */
Index: linux-2.6.10/include/asm-i386/pgtable-3level.h
===================================================================
--- linux-2.6.10.orig/include/asm-i386/pgtable-3level.h 2005-01-07 09:48:57.000000000 -0800
+++ linux-2.6.10/include/asm-i386/pgtable-3level.h 2005-01-07 09:51:09.000000000 -0800
@@ -8,7 +8,8 @@
* tables on PPro+ CPUs.
*
* Copyright (C) 1999 Ingo Molnar <[email protected]>
- */
+ * August 26, 2004 added ptep_cmpxchg <[email protected]>
+*/

#define pte_ERROR(e) \
printk("%s:%d: bad pte %p(%08lx%08lx).\n", __FILE__, __LINE__, &(e), (e).pte_high, (e).pte_low)
@@ -44,21 +45,11 @@
return pte_x(pte);
}

-/* Rules for using set_pte: the pte being assigned *must* be
- * either not present or in a state where the hardware will
- * not attempt to update the pte. In places where this is
- * not possible, use pte_get_and_clear to obtain the old pte
- * value and then use set_pte to update it. -ben
- */
-static inline void set_pte(pte_t *ptep, pte_t pte)
-{
- ptep->pte_high = pte.pte_high;
- smp_wmb();
- ptep->pte_low = pte.pte_low;
-}
#define __HAVE_ARCH_SET_PTE_ATOMIC
#define set_pte_atomic(pteptr,pteval) \
set_64bit((unsigned long long *)(pteptr),pte_val(pteval))
+#define set_pte(pteptr,pteval) \
+ *(unsigned long long *)(pteptr) = pte_val(pteval)
#define set_pmd(pmdptr,pmdval) \
set_64bit((unsigned long long *)(pmdptr),pmd_val(pmdval))
#define set_pud(pudptr,pudval) \
@@ -155,4 +146,25 @@

#define __pmd_free_tlb(tlb, x) do { } while (0)

+/* Atomic PTE operations */
+#define ptep_xchg_flush(__vma, __addr, __ptep, __newval) \
+({ pte_t __r; \
+ /* xchg acts as a barrier before the setting of the high bits. */\
+ __r.pte_low = xchg(&(__ptep)->pte_low, (__newval).pte_low); \
+ __r.pte_high = (__ptep)->pte_high; \
+ (__ptep)->pte_high = (__newval).pte_high; \
+ flush_tlb_page(__vma, __addr); \
+ (__r); \
+})
+
+#define __HAVE_ARCH_PTEP_XCHG_FLUSH
+
+static inline int ptep_cmpxchg(struct vm_area_struct *vma, unsigned long address, pte_t *ptep, pte_t oldval, pte_t newval)
+{
+ return cmpxchg8b((unsigned long long *)ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval);
+}
+
+#define __HAVE_ARCH_GET_PTE_ATOMIC
+#define get_pte_atomic(__ptep) __pte(get_64bit((unsigned long long *)(__ptep)))
+
#endif /* _I386_PGTABLE_3LEVEL_H */
Index: linux-2.6.10/include/asm-i386/pgtable-2level.h
===================================================================
--- linux-2.6.10.orig/include/asm-i386/pgtable-2level.h 2005-01-07 09:48:57.000000000 -0800
+++ linux-2.6.10/include/asm-i386/pgtable-2level.h 2005-01-07 09:51:09.000000000 -0800
@@ -65,4 +65,7 @@
#define __pte_to_swp_entry(pte) ((swp_entry_t) { (pte).pte_low })
#define __swp_entry_to_pte(x) ((pte_t) { (x).val })

+/* Atomic PTE operations */
+#define ptep_cmpxchg(__vma,__a,__xp,__oldpte,__newpte) (cmpxchg(&(__xp)->pte_low, (__oldpte).pte_low, (__newpte).pte_low)==(__oldpte).pte_low)
+
#endif /* _I386_PGTABLE_2LEVEL_H */
Index: linux-2.6.10/include/asm-i386/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-i386/pgalloc.h 2005-01-07 09:48:57.000000000 -0800
+++ linux-2.6.10/include/asm-i386/pgalloc.h 2005-01-07 11:15:55.000000000 -0800
@@ -4,9 +4,12 @@
#include <linux/config.h>
#include <asm/processor.h>
#include <asm/fixmap.h>
+#include <asm/system.h>
#include <linux/threads.h>
#include <linux/mm.h> /* for struct page */

+#define PMD_NONE 0L
+
#define pmd_populate_kernel(mm, pmd, pte) \
set_pmd(pmd, __pmd(_PAGE_TABLE + __pa(pte)))

@@ -14,6 +17,18 @@
set_pmd(pmd, __pmd(_PAGE_TABLE + \
((unsigned long long)page_to_pfn(pte) << \
(unsigned long long) PAGE_SHIFT)))
+/* Atomic version */
+static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
+{
+#ifdef CONFIG_X86_PAE
+ return cmpxchg8b( ((unsigned long long *)pmd), PMD_NONE, _PAGE_TABLE +
+ ((unsigned long long)page_to_pfn(pte) <<
+ (unsigned long long) PAGE_SHIFT) ) == PMD_NONE;
+#else
+ return cmpxchg( (unsigned long *)pmd, PMD_NONE, _PAGE_TABLE + (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE;
+#endif
+}
+
/*
* Allocate and free page tables.
*/
@@ -44,6 +59,7 @@
#define pmd_free(x) do { } while (0)
#define __pmd_free_tlb(tlb,x) do { } while (0)
#define pud_populate(mm, pmd, pte) BUG()
+#define pud_test_and_populate(mm, pmd, pte) ({ BUG(); 1; })
#endif

#define check_pgt_cache() do { } while (0)

2005-01-11 18:11:59

by Christoph Lameter

[permalink] [raw]
Subject: page table lock patch V15 [3/7]: i386 universal cmpxchg

Changelog
* Make cmpxchg and cmpxchg8b generally available on the i386
platform.
* Provide emulation of cmpxchg suitable for uniprocessor if
build and run on 386.
* Provide emulation of cmpxchg8b suitable for uniprocessor
systems if build and run on 386 or 486.
* Provide an inline function to atomically get a 64 bit value
via cmpxchg8b in an SMP system (courtesy of Nick Piggin)
(important for i386 PAE mode and other places where atomic
64 bit operations are useful)

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.9/arch/i386/Kconfig
===================================================================
--- linux-2.6.9.orig/arch/i386/Kconfig 2004-12-10 09:58:03.000000000 -0800
+++ linux-2.6.9/arch/i386/Kconfig 2004-12-10 09:59:27.000000000 -0800
@@ -351,6 +351,11 @@
depends on !M386
default y

+config X86_CMPXCHG8B
+ bool
+ depends on !M386 && !M486
+ default y
+
config X86_XADD
bool
depends on !M386
Index: linux-2.6.9/arch/i386/kernel/cpu/intel.c
===================================================================
--- linux-2.6.9.orig/arch/i386/kernel/cpu/intel.c 2004-12-06 17:23:49.000000000 -0800
+++ linux-2.6.9/arch/i386/kernel/cpu/intel.c 2004-12-10 09:59:27.000000000 -0800
@@ -6,6 +6,7 @@
#include <linux/bitops.h>
#include <linux/smp.h>
#include <linux/thread_info.h>
+#include <linux/module.h>

#include <asm/processor.h>
#include <asm/msr.h>
@@ -287,5 +288,103 @@
return 0;
}

+#ifndef CONFIG_X86_CMPXCHG
+unsigned long cmpxchg_386_u8(volatile void *ptr, u8 old, u8 new)
+{
+ u8 prev;
+ unsigned long flags;
+ /*
+ * Check if the kernel was compiled for an old cpu but the
+ * currently running cpu can do cmpxchg after all
+ * All CPUs except 386 support CMPXCHG
+ */
+ if (cpu_data->x86 > 3)
+ return __cmpxchg(ptr, old, new, sizeof(u8));
+
+ /* Poor man's cmpxchg for 386. Unsuitable for SMP */
+ local_irq_save(flags);
+ prev = *(u8 *)ptr;
+ if (prev == old)
+ *(u8 *)ptr = new;
+ local_irq_restore(flags);
+ return prev;
+}
+
+EXPORT_SYMBOL(cmpxchg_386_u8);
+
+unsigned long cmpxchg_386_u16(volatile void *ptr, u16 old, u16 new)
+{
+ u16 prev;
+ unsigned long flags;
+ /*
+ * Check if the kernel was compiled for an old cpu but the
+ * currently running cpu can do cmpxchg after all
+ * All CPUs except 386 support CMPXCHG
+ */
+ if (cpu_data->x86 > 3)
+ return __cmpxchg(ptr, old, new, sizeof(u16));
+
+ /* Poor man's cmpxchg for 386. Unsuitable for SMP */
+ local_irq_save(flags);
+ prev = *(u16 *)ptr;
+ if (prev == old)
+ *(u16 *)ptr = new;
+ local_irq_restore(flags);
+ return prev;
+}
+
+EXPORT_SYMBOL(cmpxchg_386_u16);
+
+unsigned long cmpxchg_386_u32(volatile void *ptr, u32 old, u32 new)
+{
+ u32 prev;
+ unsigned long flags;
+ /*
+ * Check if the kernel was compiled for an old cpu but the
+ * currently running cpu can do cmpxchg after all
+ * All CPUs except 386 support CMPXCHG
+ */
+ if (cpu_data->x86 > 3)
+ return __cmpxchg(ptr, old, new, sizeof(u32));
+
+ /* Poor man's cmpxchg for 386. Unsuitable for SMP */
+ local_irq_save(flags);
+ prev = *(u32 *)ptr;
+ if (prev == old)
+ *(u32 *)ptr = new;
+ local_irq_restore(flags);
+ return prev;
+}
+
+EXPORT_SYMBOL(cmpxchg_386_u32);
+#endif
+
+#ifndef CONFIG_X86_CMPXCHG8B
+unsigned long long cmpxchg8b_486(volatile unsigned long long *ptr,
+ unsigned long long old, unsigned long long newv)
+{
+ unsigned long long prev;
+ unsigned long flags;
+
+ /*
+ * Check if the kernel was compiled for an old cpu but
+ * we are running really on a cpu capable of cmpxchg8b
+ */
+
+ if (cpu_has(cpu_data, X86_FEATURE_CX8))
+ return __cmpxchg8b(ptr, old, newv);
+
+ /* Poor mans cmpxchg8b for 386 and 486. Not suitable for SMP */
+ local_irq_save(flags);
+ prev = *ptr;
+ if (prev == old)
+ *ptr = newv;
+ local_irq_restore(flags);
+ return prev;
+}
+
+EXPORT_SYMBOL(cmpxchg8b_486);
+#endif
+
// arch_initcall(intel_cpu_init);

Index: linux-2.6.9/include/asm-i386/system.h
===================================================================
--- linux-2.6.9.orig/include/asm-i386/system.h 2004-12-06 17:23:55.000000000 -0800
+++ linux-2.6.9/include/asm-i386/system.h 2004-12-10 10:00:49.000000000 -0800
@@ -149,6 +149,9 @@
#define __xg(x) ((struct __xchg_dummy *)(x))


+#define ll_low(x) *(((unsigned int*)&(x))+0)
+#define ll_high(x) *(((unsigned int*)&(x))+1)
+
/*
* The semantics of XCHGCMP8B are a bit strange, this is why
* there is a loop and the loading of %%eax and %%edx has to
@@ -184,8 +187,6 @@
{
__set_64bit(ptr,(unsigned int)(value), (unsigned int)((value)>>32ULL));
}
-#define ll_low(x) *(((unsigned int*)&(x))+0)
-#define ll_high(x) *(((unsigned int*)&(x))+1)

static inline void __set_64bit_var (unsigned long long *ptr,
unsigned long long value)
@@ -203,6 +204,26 @@
__set_64bit(ptr, (unsigned int)(value), (unsigned int)((value)>>32ULL) ) : \
__set_64bit(ptr, ll_low(value), ll_high(value)) )

+static inline unsigned long long __get_64bit(unsigned long long * ptr)
+{
+ unsigned long long ret;
+ __asm__ __volatile__ (
+ "\n1:\t"
+ "movl (%1), %%eax\n\t"
+ "movl 4(%1), %%edx\n\t"
+ "movl %%eax, %%ebx\n\t"
+ "movl %%edx, %%ecx\n\t"
+ LOCK_PREFIX "cmpxchg8b (%1)\n\t"
+ "jnz 1b"
+ : "=A"(ret)
+ : "D"(ptr)
+ : "ebx", "ecx", "memory");
+ return ret;
+}
+
+#define get_64bit(ptr) __get_64bit(ptr)
+
+
/*
* Note: no "lock" prefix even on SMP: xchg always implies lock anyway
* Note 2: xchg has side effect, so that attribute volatile is necessary,
@@ -240,7 +261,41 @@
*/

#ifdef CONFIG_X86_CMPXCHG
+
#define __HAVE_ARCH_CMPXCHG 1
+#define cmpxchg(ptr,o,n)\
+ ((__typeof__(*(ptr)))__cmpxchg((ptr), (unsigned long)(o), \
+ (unsigned long)(n), sizeof(*(ptr))))
+
+#else
+
+/*
+ * Building a kernel capable running on 80386. It may be necessary to
+ * simulate the cmpxchg on the 80386 CPU. For that purpose we define
+ * a function for each of the sizes we support.
+ */
+
+extern unsigned long cmpxchg_386_u8(volatile void *, u8, u8);
+extern unsigned long cmpxchg_386_u16(volatile void *, u16, u16);
+extern unsigned long cmpxchg_386_u32(volatile void *, u32, u32);
+
+static inline unsigned long cmpxchg_386(volatile void *ptr, unsigned long old,
+ unsigned long new, int size)
+{
+ switch (size) {
+ case 1:
+ return cmpxchg_386_u8(ptr, old, new);
+ case 2:
+ return cmpxchg_386_u16(ptr, old, new);
+ case 4:
+ return cmpxchg_386_u32(ptr, old, new);
+ }
+ return old;
+}
+
+#define cmpxchg(ptr,o,n)\
+ ((__typeof__(*(ptr)))cmpxchg_386((ptr), (unsigned long)(o), \
+ (unsigned long)(n), sizeof(*(ptr))))
#endif

static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old,
@@ -270,12 +325,34 @@
return old;
}

-#define cmpxchg(ptr,o,n)\
- ((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\
- (unsigned long)(n),sizeof(*(ptr))))
-
+static inline unsigned long long __cmpxchg8b(volatile unsigned long long *ptr,
+ unsigned long long old, unsigned long long newv)
+{
+ unsigned long long prev;
+ __asm__ __volatile__(
+ LOCK_PREFIX "cmpxchg8b (%4)"
+ : "=A" (prev)
+ : "0" (old), "c" ((unsigned long)(newv >> 32)),
+ "b" ((unsigned long)(newv & 0xffffffffULL)), "D" (ptr)
+ : "memory");
+ return prev;
+}
+
+#ifdef CONFIG_X86_CMPXCHG8B
+#define cmpxchg8b __cmpxchg8b
+#else
+/*
+ * Building a kernel capable of running on 80486 and 80386. Both
+ * do not support cmpxchg8b. Call a function that emulates the
+ * instruction if necessary.
+ */
+extern unsigned long long cmpxchg8b_486(volatile unsigned long long *,
+ unsigned long long, unsigned long long);
+#define cmpxchg8b cmpxchg8b_486
+#endif
+
#ifdef __KERNEL__
-struct alt_instr {
+struct alt_instr {
__u8 *instr; /* original instruction */
__u8 *replacement;
__u8 cpuid; /* cpuid bit set for replacement */

2005-01-11 17:55:12

by Christoph Lameter

[permalink] [raw]
Subject: page table lock patch V15 [1/7]: Reduce use of page table lock

Changelog
* Increase parallelism in SMP configurations by deferring
the acquisition of page_table_lock in handle_mm_fault
* Anonymous memory page faults bypass the page_table_lock
through the use of atomic page table operations
* Swapper does not set pte to empty in transition to swap
* Simulate atomic page table operations using the
page_table_lock if an arch does not define
__HAVE_ARCH_ATOMIC_TABLE_OPS. This still provides
a performance benefit since the page_table_lock
is held for shorter periods of time.

Signed-off-by: Christoph Lameter <[email protected]

Index: linux-2.6.10/mm/memory.c
===================================================================
--- linux-2.6.10.orig/mm/memory.c 2005-01-11 08:46:16.000000000 -0800
+++ linux-2.6.10/mm/memory.c 2005-01-11 09:16:30.000000000 -0800
@@ -36,6 +36,8 @@
* ([email protected])
*
* Aug/Sep 2004 Changed to four level page tables (Andi Kleen)
+ * Jan 2005 Scalability improvement by reducing the use and the length of time
+ * the page table lock is held (Christoph Lameter)
*/

#include <linux/kernel_stat.h>
@@ -1677,8 +1679,7 @@
}

/*
- * We hold the mm semaphore and the page_table_lock on entry and
- * should release the pagetable lock on exit..
+ * We hold the mm semaphore
*/
static int do_swap_page(struct mm_struct * mm,
struct vm_area_struct * vma, unsigned long address,
@@ -1690,15 +1691,13 @@
int ret = VM_FAULT_MINOR;

pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
page = lookup_swap_cache(entry);
if (!page) {
swapin_readahead(entry, address, vma);
page = read_swap_cache_async(entry, vma, address);
if (!page) {
/*
- * Back out if somebody else faulted in this pte while
- * we released the page table lock.
+ * Back out if somebody else faulted in this pte
*/
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, address);
@@ -1721,8 +1720,7 @@
lock_page(page);

/*
- * Back out if somebody else faulted in this pte while we
- * released the page table lock.
+ * Back out if somebody else faulted in this pte
*/
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, address);
@@ -1772,14 +1770,12 @@
}

/*
- * We are called with the MM semaphore and page_table_lock
- * spinlock held to protect against concurrent faults in
- * multithreaded programs.
+ * We are called with the MM semaphore held.
*/
static int
do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
pte_t *page_table, pmd_t *pmd, int write_access,
- unsigned long addr)
+ unsigned long addr, pte_t orig_entry)
{
pte_t entry;
struct page * page = ZERO_PAGE(addr);
@@ -1789,47 +1785,44 @@

/* ..except if it's a write access */
if (write_access) {
- /* Allocate our own private page. */
- pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);

+ /* Allocate our own private page. */
if (unlikely(anon_vma_prepare(vma)))
- goto no_mem;
+ return VM_FAULT_OOM;
+
page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
if (!page)
- goto no_mem;
+ return VM_FAULT_OOM;
clear_user_highpage(page, addr);

- spin_lock(&mm->page_table_lock);
- page_table = pte_offset_map(pmd, addr);
+ entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
+ vma->vm_page_prot)),
+ vma);
+ }

- if (!pte_none(*page_table)) {
+ if (!ptep_cmpxchg(vma, addr, page_table, orig_entry, entry)) {
+ if (write_access) {
pte_unmap(page_table);
page_cache_release(page);
- spin_unlock(&mm->page_table_lock);
- goto out;
}
+ return VM_FAULT_MINOR;
+ }
+ if (write_access) {
+ /*
+ * These two functions must come after the cmpxchg
+ * because if the page is on the LRU then try_to_unmap may come
+ * in and unmap the pte.
+ */
+ page_add_anon_rmap(page, vma, addr);
+ lru_cache_add_active(page);
mm->rss++;
acct_update_integrals();
update_mem_hiwater();
- entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
- vma->vm_page_prot)),
- vma);
- lru_cache_add_active(page);
- SetPageReferenced(page);
- page_add_anon_rmap(page, vma, addr);
- }

- set_pte(page_table, entry);
+ }
pte_unmap(page_table);

- /* No need to invalidate - it was non-present before */
- update_mmu_cache(vma, addr, entry);
- spin_unlock(&mm->page_table_lock);
-out:
return VM_FAULT_MINOR;
-no_mem:
- return VM_FAULT_OOM;
}

/*
@@ -1841,12 +1834,12 @@
* As this is called only for pages that do not currently exist, we
* do not need to flush old virtual caches or the TLB.
*
- * This is called with the MM semaphore held and the page table
- * spinlock held. Exit with the spinlock released.
+ * This is called with the MM semaphore held.
*/
static int
do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
+ unsigned long address, int write_access, pte_t *page_table,
+ pmd_t *pmd, pte_t orig_entry)
{
struct page * new_page;
struct address_space *mapping = NULL;
@@ -1857,9 +1850,8 @@

if (!vma->vm_ops || !vma->vm_ops->nopage)
return do_anonymous_page(mm, vma, page_table,
- pmd, write_access, address);
+ pmd, write_access, address, orig_entry);
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);

if (vma->vm_file) {
mapping = vma->vm_file->f_mapping;
@@ -1959,7 +1951,7 @@
* nonlinear vmas.
*/
static int do_file_page(struct mm_struct * mm, struct vm_area_struct * vma,
- unsigned long address, int write_access, pte_t *pte, pmd_t *pmd)
+ unsigned long address, int write_access, pte_t *pte, pmd_t *pmd, pte_t entry)
{
unsigned long pgoff;
int err;
@@ -1972,13 +1964,12 @@
if (!vma->vm_ops || !vma->vm_ops->populate ||
(write_access && !(vma->vm_flags & VM_SHARED))) {
pte_clear(pte);
- return do_no_page(mm, vma, address, write_access, pte, pmd);
+ return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
}

- pgoff = pte_to_pgoff(*pte);
+ pgoff = pte_to_pgoff(entry);

pte_unmap(pte);
- spin_unlock(&mm->page_table_lock);

err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0);
if (err == -ENOMEM)
@@ -1997,49 +1988,46 @@
* with external mmu caches can use to update those (ie the Sparc or
* PowerPC hashed page tables that act as extended TLBs).
*
- * Note the "page_table_lock". It is to protect against kswapd removing
- * pages from under us. Note that kswapd only ever _removes_ pages, never
- * adds them. As such, once we have noticed that the page is not present,
- * we can drop the lock early.
- *
- * The adding of pages is protected by the MM semaphore (which we hold),
- * so we don't need to worry about a page being suddenly been added into
- * our VM.
- *
- * We enter with the pagetable spinlock held, we are supposed to
- * release it when done.
+ * Note that kswapd only ever _removes_ pages, never adds them.
+ * We need to insure to handle that case properly.
*/
static inline int handle_pte_fault(struct mm_struct *mm,
struct vm_area_struct * vma, unsigned long address,
int write_access, pte_t *pte, pmd_t *pmd)
{
pte_t entry;
+ pte_t new_entry;

- entry = *pte;
+ /*
+ * This must be an atomic operation since the page_table_lock is
+ * not held. If a pte_t larger than the word size is used then an
+ * incorrect value could be read because another processor is
+ * concurrently updating the multi-word pte. The i386 PAE mode
+ * is raising its ugly head here.
+ */
+ entry = get_pte_atomic(pte);
if (!pte_present(entry)) {
- /*
- * If it truly wasn't present, we know that kswapd
- * and the PTE updates will not touch it later. So
- * drop the lock.
- */
if (pte_none(entry))
- return do_no_page(mm, vma, address, write_access, pte, pmd);
+ return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
if (pte_file(entry))
- return do_file_page(mm, vma, address, write_access, pte, pmd);
+ return do_file_page(mm, vma, address, write_access, pte, pmd, entry);
return do_swap_page(mm, vma, address, pte, pmd, entry, write_access);
}

+ /*
+ * This is the case in which we only update some bits in the pte.
+ */
+ new_entry = pte_mkyoung(entry);
if (write_access) {
- if (!pte_write(entry))
+ if (!pte_write(entry)) {
+ /* do_wp_page expects us to hold the page_table_lock */
+ spin_lock(&mm->page_table_lock);
return do_wp_page(mm, vma, address, pte, pmd, entry);
-
- entry = pte_mkdirty(entry);
+ }
+ new_entry = pte_mkdirty(new_entry);
}
- entry = pte_mkyoung(entry);
- ptep_set_access_flags(vma, address, pte, entry, write_access);
- update_mmu_cache(vma, address, entry);
+ ptep_cmpxchg(vma, address, pte, entry, new_entry);
pte_unmap(pte);
- spin_unlock(&mm->page_table_lock);
return VM_FAULT_MINOR;
}

@@ -2058,33 +2046,55 @@

inc_page_state(pgfault);

- if (is_vm_hugetlb_page(vma))
+ if (unlikely(is_vm_hugetlb_page(vma)))
return VM_FAULT_SIGBUS; /* mapping truncation does this. */

/*
- * We need the page table lock to synchronize with kswapd
- * and the SMP-safe atomic PTE updates.
+ * We rely on the mmap_sem and the SMP-safe atomic PTE updates.
+ * to synchronize with kswapd. We can avoid the overhead
+ * of the p??_alloc functions through atomic operations so
+ * we duplicate the functionality of pmd_alloc, pud_alloc and
+ * pte_alloc_map here.
*/
pgd = pgd_offset(mm, address);
- spin_lock(&mm->page_table_lock);
+ if (unlikely(pgd_none(*pgd))) {
+ pud_t *new = pud_alloc_one(mm, address);
+
+ if (!new)
+ return VM_FAULT_OOM;

- pud = pud_alloc(mm, pgd, address);
- if (!pud)
- goto oom;
-
- pmd = pmd_alloc(mm, pud, address);
- if (!pmd)
- goto oom;
-
- pte = pte_alloc_map(mm, pmd, address);
- if (!pte)
- goto oom;
+ if (!pgd_test_and_populate(mm, pgd, new))
+ pud_free(new);
+ }
+
+ pud = pud_offset(pgd, address);
+ if (unlikely(pud_none(*pud))) {
+ pmd_t *new = pmd_alloc_one(mm, address);
+
+ if (!new)
+ return VM_FAULT_OOM;
+
+ if (!pud_test_and_populate(mm, pud, new))
+ pmd_free(new);
+ }
+
+ pmd = pmd_offset(pud, address);
+ if (unlikely(!pmd_present(*pmd))) {
+ struct page *new = pte_alloc_one(mm, address);

- return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
+ if (!new)
+ return VM_FAULT_OOM;

- oom:
- spin_unlock(&mm->page_table_lock);
- return VM_FAULT_OOM;
+ if (!pmd_test_and_populate(mm, pmd, new))
+ pte_free(new);
+ else {
+ inc_page_state(nr_page_table_pages);
+ mm->nr_ptes++;
+ }
+ }
+
+ pte = pte_offset_map(pmd, address);
+ return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
}

#ifndef __ARCH_HAS_4LEVEL_HACK
Index: linux-2.6.10/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.10.orig/include/asm-generic/pgtable.h 2004-12-24 13:34:30.000000000 -0800
+++ linux-2.6.10/include/asm-generic/pgtable.h 2005-01-11 08:48:30.000000000 -0800
@@ -28,6 +28,11 @@
#endif /* __HAVE_ARCH_SET_PTE_ATOMIC */
#endif

+/* Get a pte entry without the page table lock */
+#ifndef __HAVE_ARCH_GET_PTE_ATOMIC
+#define get_pte_atomic(__x) *(__x)
+#endif
+
#ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
/*
* Largely same as above, but only sets the access flags (dirty,
@@ -134,4 +139,73 @@
#define pgd_offset_gate(mm, addr) pgd_offset(mm, addr)
#endif

+#ifndef __HAVE_ARCH_ATOMIC_TABLE_OPS
+/*
+ * If atomic page table operations are not available then use
+ * the page_table_lock to insure some form of locking.
+ * Note thought that low level operations as well as the
+ * page_table_handling of the cpu may bypass all locking.
+ */
+
+#ifndef __HAVE_ARCH_PTEP_CMPXCHG
+#define ptep_cmpxchg(__vma, __addr, __ptep, __oldval, __newval) \
+({ \
+ int __rc; \
+ spin_lock(&__vma->vm_mm->page_table_lock); \
+ __rc = pte_same(*(__ptep), __oldval); \
+ if (__rc) { set_pte(__ptep, __newval); \
+ update_mmu_cache(__vma, __addr, __newval); } \
+ spin_unlock(&__vma->vm_mm->page_table_lock); \
+ __rc; \
+})
+#endif
+
+#ifndef __HAVE_ARCH_PGD_TEST_AND_POPULATE
+#define pgd_test_and_populate(__mm, __pgd, __pud) \
+({ \
+ int __rc; \
+ spin_lock(&__mm->page_table_lock); \
+ __rc = !pgd_present(*(__pgd)); \
+ if (__rc) pgd_populate(__mm, __pgd, __pud); \
+ spin_unlock(&__mm->page_table_lock); \
+ __rc; \
+})
+#endif
+
+#ifndef __HAVE_ARCH_PUD_TEST_AND_POPULATE
+#define pud_test_and_populate(__mm, __pud, __pmd) \
+({ \
+ int __rc; \
+ spin_lock(&__mm->page_table_lock); \
+ __rc = !pud_present(*(__pud)); \
+ if (__rc) pud_populate(__mm, __pud, __pmd); \
+ spin_unlock(&__mm->page_table_lock); \
+ __rc; \
+})
+#endif
+
+#ifndef __HAVE_PMD_TEST_AND_POPULATE
+#define pmd_test_and_populate(__mm, __pmd, __page) \
+({ \
+ int __rc; \
+ spin_lock(&__mm->page_table_lock); \
+ __rc = !pmd_present(*(__pmd)); \
+ if (__rc) pmd_populate(__mm, __pmd, __page); \
+ spin_unlock(&__mm->page_table_lock); \
+ __rc; \
+})
+#endif
+
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \
+({ \
+ pte_t __p = __pte(xchg(&pte_val(*(__ptep)), pte_val(__pteval)));\
+ flush_tlb_page(__vma, __address); \
+ __p; \
+})
+
+#endif
+
#endif /* _ASM_GENERIC_PGTABLE_H */
Index: linux-2.6.10/mm/rmap.c
===================================================================
--- linux-2.6.10.orig/mm/rmap.c 2005-01-11 08:46:16.000000000 -0800
+++ linux-2.6.10/mm/rmap.c 2005-01-11 08:48:30.000000000 -0800
@@ -426,7 +426,10 @@
* @vma: the vm area in which the mapping is added
* @address: the user virtual address mapped
*
- * The caller needs to hold the mm->page_table_lock.
+ * The caller needs to hold the mm->page_table_lock if page
+ * is pointing to something that is known by the vm.
+ * The lock does not need to be held if page is pointing
+ * to a newly allocated page.
*/
void page_add_anon_rmap(struct page *page,
struct vm_area_struct *vma, unsigned long address)
@@ -575,11 +578,6 @@

/* Nuke the page table entry. */
flush_cache_page(vma, address);
- pteval = ptep_clear_flush(vma, address, pte);
-
- /* Move the dirty bit to the physical page now the pte is gone. */
- if (pte_dirty(pteval))
- set_page_dirty(page);

if (PageAnon(page)) {
swp_entry_t entry = { .val = page->private };
@@ -594,11 +592,15 @@
list_add(&mm->mmlist, &init_mm.mmlist);
spin_unlock(&mmlist_lock);
}
- set_pte(pte, swp_entry_to_pte(entry));
+ pteval = ptep_xchg_flush(vma, address, pte, swp_entry_to_pte(entry));
BUG_ON(pte_file(*pte));
mm->anon_rss--;
- }
+ } else
+ pteval = ptep_clear_flush(vma, address, pte);

+ /* Move the dirty bit to the physical page now the pte is gone. */
+ if (pte_dirty(pteval))
+ set_page_dirty(page);
mm->rss--;
acct_update_integrals();
page_remove_rmap(page);
@@ -691,15 +693,21 @@
if (ptep_clear_flush_young(vma, address, pte))
continue;

- /* Nuke the page table entry. */
flush_cache_page(vma, address);
- pteval = ptep_clear_flush(vma, address, pte);
+ /*
+ * There would be a race here with handle_mm_fault and do_anonymous_page
+ * which bypasses the page_table_lock if we would zap the pte before
+ * putting something into it. On the other hand we need to
+ * have the dirty flag setting at the time we replaced the value.
+ */

/* If nonlinear, store the file page offset in the pte. */
if (page->index != linear_page_index(vma, address))
- set_pte(pte, pgoff_to_pte(page->index));
+ pteval = ptep_xchg_flush(vma, address, pte, pgoff_to_pte(page->index));
+ else
+ pteval = ptep_get_and_clear(pte);

- /* Move the dirty bit to the physical page now the pte is gone. */
+ /* Move the dirty bit to the physical page now that the pte is gone. */
if (pte_dirty(pteval))
set_page_dirty(page);

Index: linux-2.6.10/include/asm-generic/pgtable-nopud.h
===================================================================
--- linux-2.6.10.orig/include/asm-generic/pgtable-nopud.h 2005-01-11 08:46:15.000000000 -0800
+++ linux-2.6.10/include/asm-generic/pgtable-nopud.h 2005-01-11 08:48:30.000000000 -0800
@@ -25,8 +25,13 @@
static inline int pgd_present(pgd_t pgd) { return 1; }
static inline void pgd_clear(pgd_t *pgd) { }
#define pud_ERROR(pud) (pgd_ERROR((pud).pgd))
-
#define pgd_populate(mm, pgd, pud) do { } while (0)
+
+static inline int pgd_test_and_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud)
+{
+ return 1;
+}
+
/*
* (puds are folded into pgds so this doesn't get actually called,
* but the define is needed for a generic inline function.)
Index: linux-2.6.10/include/asm-generic/pgtable-nopmd.h
===================================================================
--- linux-2.6.10.orig/include/asm-generic/pgtable-nopmd.h 2005-01-11 08:46:15.000000000 -0800
+++ linux-2.6.10/include/asm-generic/pgtable-nopmd.h 2005-01-11 08:48:30.000000000 -0800
@@ -29,6 +29,7 @@
#define pmd_ERROR(pmd) (pud_ERROR((pmd).pud))

#define pud_populate(mm, pmd, pte) do { } while (0)
+static inline int pud_test_and_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd) { return 1; }

/*
* (pmds are folded into puds so this doesn't get actually called,
Index: linux-2.6.10/include/asm-ia64/pgtable.h
===================================================================
--- linux-2.6.10.orig/include/asm-ia64/pgtable.h 2005-01-11 08:46:15.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/pgtable.h 2005-01-11 08:48:30.000000000 -0800
@@ -561,7 +561,7 @@
#define __HAVE_ARCH_PTEP_MKDIRTY
#define __HAVE_ARCH_PTE_SAME
#define __HAVE_ARCH_PGD_OFFSET_GATE
-#include <asm-generic/pgtable.h>
#include <asm-generic/pgtable-nopud.h>
+#include <asm-generic/pgtable.h>

#endif /* _ASM_IA64_PGTABLE_H */

2005-01-12 05:59:40

by Nick Piggin

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

Christoph Lameter wrote:
> Changes from V14->V15 of this patch:

Hi,

I wonder what everyone thinks about moving forward with these patches?
Has it been decided that they'll be merged soon? Christoph has been
working fairly hard on them, but there hasn't been a lot of feedback.


And for those few people who have looked at my patches for page table
lock removal, is there is any preference to one implementation or the
other?

It is probably fair to say that my patches are more comprehensive
(in terms of ptl removal, ie. the complete removal**), and can allow
architectures to be more flexible in their page table synchronisation
methods.

However, Christoph's are simpler and probably more widely tested and
reviewed at this stage, and more polished. Christoph's implementation
probably also covers the most pressing performance cases.

On the other hand, my patches *do* allow for the use of a spin-locked
synchronisation implementation, which is probably closer to the
current code than Christoph's spin-locked pte_cmpxchg fallback in
terms of changes to locking semantics.


[** Aside, I didn't see a very significant improvement in mm/rmap.c
functions from ptl removal. Mostly I think due to contention on
mapping->i_mmap_lock (I didn't test anonymous pages, they may have
a better yield)]

2005-01-12 09:44:06

by Andrew Morton

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

Nick Piggin <[email protected]> wrote:
>
> Christoph Lameter wrote:
> > Changes from V14->V15 of this patch:
>
> Hi,
>
> I wonder what everyone thinks about moving forward with these patches?

I was waiting for them to settle down before paying more attention.

My general take is that these patches address a single workload on
exceedingly rare and expensive machines. If they adversely affect common
and cheap machines via code complexity, memory footprint or via runtime
impact then it would be pretty hard to justify their inclusion.

Do we have measurements of the negative and/or positive impact on smaller
machines?

2005-01-12 12:44:10

by Hugh Dickins

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

On Wed, 12 Jan 2005, Andrew Morton wrote:
> Nick Piggin <[email protected]> wrote:
> >
> > Christoph Lameter wrote:
> > > Changes from V14->V15 of this patch:
> >
> > I wonder what everyone thinks about moving forward with these patches?
>
> I was waiting for them to settle down before paying more attention.

They seem to have settled down, without advancing to anything satisfactory.
7/7 is particularly amusing at the moment (added complexity with no payoff).

> My general take is that these patches address a single workload on
> exceedingly rare and expensive machines.

Well put. Christoph's patches stubbornly remain a _good_ hack for one
very specific initial workload (multi-parallel faulting of anon memory)
on one architecture (ia64, perhaps a few more) important to SGI.
I don't see why the mainline kernel should want them.

> If they adversely affect common
> and cheap machines via code complexity, memory footprint or via runtime
> impact then it would be pretty hard to justify their inclusion.

Aside from 7/7 (and some good asm primitives within headers),
the code itself is not complex; but it is more complex to think about,
and so less obviously correct.

> Do we have measurements of the negative and/or positive impact on smaller
> machines?

I don't think so. But my main worry remains the detriment to other
architectures, which still remains unaddressed.

Nick's patches (I've not seen for some while) are a different case:
on the minus side, considerably more complex; on the plus side,
more general and more aware of the range of architectures.

I'll write at greater length to support these accusations later on.

Hugh

2005-01-12 15:45:14

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

On Wed, Jan 12, 2005 at 01:42:35AM -0800, Andrew Morton wrote:
> Nick Piggin <[email protected]> wrote:
> >
> > Christoph Lameter wrote:
> > > Changes from V14->V15 of this patch:
> >
> > Hi,
> >
> > I wonder what everyone thinks about moving forward with these patches?
>
> I was waiting for them to settle down before paying more attention.
>
> My general take is that these patches address a single workload on
> exceedingly rare and expensive machines. If they adversely affect common
> and cheap machines via code complexity, memory footprint or via runtime
> impact then it would be pretty hard to justify their inclusion.
>
> Do we have measurements of the negative and/or positive impact on smaller
> machines?

I haven't seen wide performance numbers of this patch yet.

Hint: STP is really easy.

2005-01-12 16:40:17

by Christoph Lameter

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

On Wed, 12 Jan 2005, Andrew Morton wrote:

> My general take is that these patches address a single workload on
> exceedingly rare and expensive machines. If they adversely affect common
> and cheap machines via code complexity, memory footprint or via runtime
> impact then it would be pretty hard to justify their inclusion.

The future is in higher and higher SMP counts since the chase for the
higher clock frequency has ended. We will increasingly see multi-core
cpus etc. Machines with higher CPU counts are becoming common in business.
Of course SGI uses much higher CPU counts and our supercomputer
applications would benefit most from this patch.

I thought this patch was already approved by Linus?

> Do we have measurements of the negative and/or positive impact on smaller
> machines?

Here is a measurement of 256M allocation on a 2 way SMP machine 2x
PIII-500Mhz:

Gb Rep Threads User System Wall flt/cpu/s fault/wsec
0 10 1 0.005s 0.016s 0.002s 54357.280 52261.895
0 10 2 0.008s 0.019s 0.002s 43112.368 42463.566

With patch:

Gb Rep Threads User System Wall flt/cpu/s fault/wsec
0 10 1 0.005s 0.016s 0.002s 54357.280 53439.357
0 10 2 0.008s 0.018s 0.002s 44650.831 44202.412

So only a very minor improvements for old machines (this one from ~ 98).

2005-01-12 16:49:48

by Christoph Hellwig

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

On Wed, Jan 12, 2005 at 08:39:21AM -0800, Christoph Lameter wrote:
> The future is in higher and higher SMP counts since the chase for the
> higher clock frequency has ended. We will increasingly see multi-core
> cpus etc. Machines with higher CPU counts are becoming common in business.

An they still are absolutely in the minority. In fact with multicore
cpus it becomes more and more important to be fast for SMP systtems with
a _small_ number of CPUs, while really larget CPUs will remain a small
nische for the forseeable future.

2005-01-12 17:38:00

by Christoph Lameter

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

On Wed, 12 Jan 2005, Christoph Hellwig wrote:

> On Wed, Jan 12, 2005 at 08:39:21AM -0800, Christoph Lameter wrote:
> > The future is in higher and higher SMP counts since the chase for the
> > higher clock frequency has ended. We will increasingly see multi-core
> > cpus etc. Machines with higher CPU counts are becoming common in business.
>
> An they still are absolutely in the minority. In fact with multicore
> cpus it becomes more and more important to be fast for SMP systtems with
> a _small_ number of CPUs, while really larget CPUs will remain a small
> nische for the forseeable future.

The benefits start to be significant pretty fast with even a few cpus
on modern architectures:

Altix no patch:
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
1 10 1 0.107s 6.444s 6.055s100028.084 100006.622
1 10 2 0.121s 9.048s 4.082s 71468.414 135904.412
1 10 4 0.129s 10.185s 3.011s 63531.985 210146.600

w/patch
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
1 10 1 0.094s 6.116s 6.021s105517.039 105517.574
1 10 2 0.134s 6.998s 3.087s 91879.573 169079.712
1 10 4 0.095s 7.658s 2.043s 84519.939 268955.165

There is even a small benefit to the single thread case.

Its not the case that this patch only benefits systems with a large number
of CPUs. Of course that is when the benefits results in performance gains
by orders of magnitude.

2005-01-12 17:41:19

by Christoph Hellwig

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

On Wed, Jan 12, 2005 at 09:37:27AM -0800, Christoph Lameter wrote:
>
> The benefits start to be significant pretty fast with even a few cpus
> on modern architectures:
>
> Altix no patch:
> Gb Rep Threads User System Wall flt/cpu/s fault/wsec
> 1 10 1 0.107s 6.444s 6.055s100028.084 100006.622
> 1 10 2 0.121s 9.048s 4.082s 71468.414 135904.412
> 1 10 4 0.129s 10.185s 3.011s 63531.985 210146.600
>
> w/patch
> Gb Rep Threads User System Wall flt/cpu/s fault/wsec
> 1 10 1 0.094s 6.116s 6.021s105517.039 105517.574
> 1 10 2 0.134s 6.998s 3.087s 91879.573 169079.712
> 1 10 4 0.095s 7.658s 2.043s 84519.939 268955.165

These smaller systems are more likely x86/x86_64 machines ;-)

2005-01-12 17:53:12

by Christoph Lameter

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

On Wed, 12 Jan 2005, Christoph Hellwig wrote:

> These smaller systems are more likely x86/x86_64 machines ;-)

But they will not have been build in 1998 either like the machine I used
for the i386 tests. Could you do some tests on contemporary x86/x86_64
SMP systems with large memory?

2005-01-12 18:04:53

by Christoph Hellwig

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

On Wed, Jan 12, 2005 at 09:52:53AM -0800, Christoph Lameter wrote:
> On Wed, 12 Jan 2005, Christoph Hellwig wrote:
>
> > These smaller systems are more likely x86/x86_64 machines ;-)
>
> But they will not have been build in 1998 either like the machine I used
> for the i386 tests. Could you do some tests on contemporary x86/x86_64
> SMP systems with large memory?

I don't have such systems.

2005-01-12 18:21:51

by Andrew Walrond

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

On Wednesday 12 January 2005 17:52, Christoph Lameter wrote:
> On Wed, 12 Jan 2005, Christoph Hellwig wrote:
> > These smaller systems are more likely x86/x86_64 machines ;-)
>
> But they will not have been build in 1998 either like the machine I used
> for the i386 tests. Could you do some tests on contemporary x86/x86_64
> SMP systems with large memory?

I have various dual x86_64 systems with 1-4Gb ram. What tests do you want run?

Andrew Walrond

2005-01-12 18:44:32

by Andrew Morton

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

Christoph Lameter <[email protected]> wrote:
>
> > Do we have measurements of the negative and/or positive impact on smaller
> > machines?
>
> Here is a measurement of 256M allocation on a 2 way SMP machine 2x
> PIII-500Mhz:
>
> Gb Rep Threads User System Wall flt/cpu/s fault/wsec
> 0 10 1 0.005s 0.016s 0.002s 54357.280 52261.895
> 0 10 2 0.008s 0.019s 0.002s 43112.368 42463.566
>
> With patch:
>
> Gb Rep Threads User System Wall flt/cpu/s fault/wsec
> 0 10 1 0.005s 0.016s 0.002s 54357.280 53439.357
> 0 10 2 0.008s 0.018s 0.002s 44650.831 44202.412
>
> So only a very minor improvements for old machines (this one from ~ 98).

OK. But have you written a test to demonstrate any performance
regressions? From, say, the use of atomic ops on ptes?

2005-01-12 19:12:58

by Christoph Lameter

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

On Wed, 12 Jan 2005, Andrew Morton wrote:

> > So only a very minor improvements for old machines (this one from ~ 98).
>
> OK. But have you written a test to demonstrate any performance
> regressions? From, say, the use of atomic ops on ptes?

If I knew of any regressions, I would certain try to deal with them. The
test is written to check for concurrent page fault performance and it has
repeatedly helped me to find problems with page faults. I have used it for
a couple of other patchsets too. If the patch would be available in mm
then it certainly would get more exposure and it may become clear that
there are some regressions.

Introduction of the cmpxchg is one atomic operations that replaces the two
spinlock ops typically necessary in an unpatched kernel. Obtaining the
spinlock requires an spinlock (which is an atomic operation) and then the
release involves a barrier. So there is a net win for all SMP cases as far
as I can see.

2005-01-12 21:29:02

by Hugh Dickins

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

On Wed, 12 Jan 2005, Hugh Dickins wrote:
> On Wed, 12 Jan 2005, Andrew Morton wrote:
> > Nick Piggin <[email protected]> wrote:
> > > Christoph Lameter wrote:
> > > > Changes from V14->V15 of this patch:
> > > I wonder what everyone thinks about moving forward with these patches?
> > I was waiting for them to settle down before paying more attention.
> They seem to have settled down, without advancing to anything satisfactory.

Well, I studied the patches a bit more, and wrote
"That remark looks a bit unfair to me now I've looked closer."
Sorry. But I do still think it remains unsatisfactory."

Then I studied it a bit more, and I think my hostility melted away
once I thought about the other-arch-defaults: I'd been supposing that
taking and dropping the page_table_lock within each primitive was
adding up to an unhealthy flurry of takes and drops on the non-target
architectures. But that doesn't look like the case to me now (except
in those rarer paths where a page table has to be allocated: of course,
not a problem).

I owe Christoph an apology. It's not quite satisfactory yet,
but it does look a lot better than an ia64 hack for one special case.

Might I save face by suggesting that it would be a lot clearer and
better if 1/1 got split into two? The first entirely concerned with
removing the spin_lock(&mm->page_table_lock) from handle_mm_fault,
and dealing with the consequences of that - moving the locking into
the allocating blocks, atomic getting of pud and pmd and pte,
passing the atomically-gotten orig_pte down to subfunctions
(which no longer expect page_table_lock held on entry) etc.

If there's a slight increase in the number of atomic operations
in each i386 PAE page fault, well, I think the superiority of
x86_64 makes that now an acceptable tradeoff.

That would be quite a decent patch, wouldn't it? that could go into
-mm for a few days and be measured, before any more. Then one using
something called ptep_cmpxchg to encapsulate the page_table_lock'ed
checking of pte_same and set_pte in do_anonymous page. Then ones to
implement ptep_cmpxchg per selected arches without page_table_lock.

Dismiss those suggestions if they'd just waste everyone's time.

Christoph has made some strides in correcting for other architectures
e.g. update_mmu_cache within default ptep_cmpxchg's page_table_lock
(probably correct but I can't be sure myself), and get_pte_atomic to
get even i386 PAE pte correctly without page_table_lock; and reverted
the pessimization of set_pte being always atomic on i386 PAE (but now
I've forgotten and can't find the case where it needed to be atomic).

Unless it's just been fixed in this latest version, the well-intentioned
get_pte_atomic doesn't actually work on i386 PAE: once you get swapping,
the swap entries look like pte_nones and all collapses. Presumably just
#define get_pte_atomic(__ptep) __pte(get_64bit((unsigned long long *)(__ptep)))
doesn't quite do what it's trying to do, and needs a slight adjustment.

But no sign of get_pmd(atomic) or get_pud(atomic) to get the higher level
entries - I thought we'd agreed they were also necessary on some arches?

> 7/7 is particularly amusing at the moment (added complexity with no payoff).

I still dislike 7/7, despite seeing the sense of keeping stats in the
task struct. It's at the very end anyway, and I'd be glad for it to
be delayed (in the hope that time somehow magically makes it nicer).

In its present state it is absurd: partly because Christoph seems to
have forgotten the point of it, so after all the per-thread infrastructure,
has ended up with do_anonymous_page saying mm->rss++, mm->anon_rss++.

And partly because others at SGI have been working in the opposite
direction, adding mysterious and tasteless acct_update_integrals
and update_mem_hiwater calls. I say mysterious because there's
nothing in the tree which actually uses the accumulated statistics,
or shows how they might be used (when many threads share the mm),
- so Adrian/Arjan/HCH might remove them any day. But looking at
December mails suggests there's lse-tech agreement that all kinds
of addons would find them useful. I say tasteless because they
don't even take "mm" arguments (what happens when ptrace or AIO
daemon faults something? perhaps it's okay but there's no use of
the stats to judge by), and the places where you'd want to update
hiwater_rss are almost entirely disjoint from the places where
you'd want to update hiwater_vm (expand_stack the exception).

If those new stats stay, and the per-task-rss idea stays,
then I suppose those new stats need to be split per task too.

> I'll write at greater length to support these accusations later on.

I rather failed to do so! And perhaps tomorrow I'll have to be
apologizing to Jay for my uncomprehending attack on hiwater etc.

Hugh

2005-01-12 23:20:47

by Nick Piggin

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

Andrew Morton wrote:
> Christoph Lameter <[email protected]> wrote:
>
>>>Do we have measurements of the negative and/or positive impact on smaller
>>
>> > machines?
>>
>> Here is a measurement of 256M allocation on a 2 way SMP machine 2x
>> PIII-500Mhz:
>>
>> Gb Rep Threads User System Wall flt/cpu/s fault/wsec
>> 0 10 1 0.005s 0.016s 0.002s 54357.280 52261.895
>> 0 10 2 0.008s 0.019s 0.002s 43112.368 42463.566
>>
>> With patch:
>>
>> Gb Rep Threads User System Wall flt/cpu/s fault/wsec
>> 0 10 1 0.005s 0.016s 0.002s 54357.280 53439.357
>> 0 10 2 0.008s 0.018s 0.002s 44650.831 44202.412
>>
>> So only a very minor improvements for old machines (this one from ~ 98).
>
>
> OK. But have you written a test to demonstrate any performance
> regressions? From, say, the use of atomic ops on ptes?
>

Performance wise, Christoph's never had as much of a problem as my
patches because it isn't doing extra atomic operations in copy_page_range.

However, it looks like it should be. For the same reason there needs to
be an atomic read in handle_mm_fault. And it probably needs atomic ops
in other places too, I think.

So my patches cost about 7% in lmbench fork benchmark... however, I've
been thinking we could take the mmap_sem for writing before doing the
copy_page_range which could reduce the need for atomic ops.


2005-01-12 23:29:19

by Andrew Morton

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

Nick Piggin <[email protected]> wrote:
>
> So my patches cost about 7% in lmbench fork benchmark.

OK, well that's the sort of thing we need to understand fully. What sort
of CPU was that on?

Look, -7% on a 2-way versus +700% on a many-way might well be a tradeoff we
agree to take. But we need to fully understand all the costs and benefits.

2005-01-12 23:51:02

by Nick Piggin

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

Andrew Morton wrote:
> Nick Piggin <[email protected]> wrote:
>
>>So my patches cost about 7% in lmbench fork benchmark.
>
>
> OK, well that's the sort of thing we need to understand fully. What sort
> of CPU was that on?
>

That was on a P4, although I've seen pretty similar results on ia64 and
other x86 CPUs.

Note that this was with my ptl removal patches. I can't see why Christoph's
would have _any_ extra overhead as they are, but it looks to me like they're
lacking in atomic ops. So I'd expect something similar for Christoph's when
they're properly atomic.

> Look, -7% on a 2-way versus +700% on a many-way might well be a tradeoff we
> agree to take. But we need to fully understand all the costs and benefits.
>

I think copy_page_range is the one to keep an eye on.

2005-01-12 23:58:24

by Christoph Lameter

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

On Thu, 13 Jan 2005, Nick Piggin wrote:

> Note that this was with my ptl removal patches. I can't see why Christoph's
> would have _any_ extra overhead as they are, but it looks to me like they're
> lacking in atomic ops. So I'd expect something similar for Christoph's when
> they're properly atomic.

Pointer operations and word size operations are atomic. So this is mostly
okay.

The issue arises on architectures that have a large pte size than the
wordsize. This is only on i386 PAE mode and S/390. S/390 falls back to
the page table lock for these operations. PAE mode should do the same and
not use atomic ops if they cannot be made to work in a reasonable manner.


2005-01-13 00:06:18

by Christoph Lameter

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

On Wed, 12 Jan 2005, Hugh Dickins wrote:

> Well, I studied the patches a bit more, and wrote
> "That remark looks a bit unfair to me now I've looked closer."
> Sorry. But I do still think it remains unsatisfactory."

Well then thanks for not ccing me on the initial rant but a whole bunch of
other people instead that you then did not send the following email too.
Is this standard behavior on linux-mm?

> Might I save face by suggesting that it would be a lot clearer and
> better if 1/1 got split into two? The first entirely concerned with
> removing the spin_lock(&mm->page_table_lock) from handle_mm_fault,
> and dealing with the consequences of that - moving the locking into
> the allocating blocks, atomic getting of pud and pmd and pte,
> passing the atomically-gotten orig_pte down to subfunctions
> (which no longer expect page_table_lock held on entry) etc.

That wont do any good since the pte's are not always updated in an atomic
way. One would have to change set_pte to always be atomic. The reason
that I added get_pte_atomic was that you told me that this would fix the
PAE mode. I did not think too much about this but simply added it
according to your wish and it seemed to run fine. If you have any
complaints, complain to yourself.

> If there's a slight increase in the number of atomic operations
> in each i386 PAE page fault, well, I think the superiority of
> x86_64 makes that now an acceptable tradeoff.

Could we have PAE mode drop back to using the page_table_lock?

> Dismiss those suggestions if they'd just waste everyone's time.

They dont fix the PAE mode issue.

> Christoph has made some strides in correcting for other architectures
> e.g. update_mmu_cache within default ptep_cmpxchg's page_table_lock
> (probably correct but I can't be sure myself), and get_pte_atomic to
> get even i386 PAE pte correctly without page_table_lock; and reverted
> the pessimization of set_pte being always atomic on i386 PAE (but now
> I've forgotten and can't find the case where it needed to be atomic).

Well this was another suggestion of yours that I followed. Turns out that
the set_pte must be atomic for this to work! Look I am no expert on the
i386 PAE mode and I rely on other for this to check up on it. And you were
the expert.

> But no sign of get_pmd(atomic) or get_pud(atomic) to get the higher level
> entries - I thought we'd agreed they were also necessary on some arches?

I did not hear about that. Maybe you also sent that email to other people
instead?

> In its present state it is absurd: partly because Christoph seems to
> have forgotten the point of it, so after all the per-thread infrastructure,
> has ended up with do_anonymous_page saying mm->rss++, mm->anon_rss++.

Sorry that seems to have dropped out of the patch somehow. Here is the
fix:

Index: linux-2.6.10/mm/memory.c
===================================================================
--- linux-2.6.10.orig/mm/memory.c 2005-01-11 09:16:34.000000000 -0800
+++ linux-2.6.10/mm/memory.c 2005-01-12 15:49:45.000000000 -0800
@@ -1835,8 +1835,8 @@ do_anonymous_page(struct mm_struct *mm,
*/
page_add_anon_rmap(page, vma, addr);
lru_cache_add_active(page);
- mm->rss++;
- mm->anon_rss++;
+ current->rss++;
+ current->anon_rss++;
acct_update_integrals();
update_mem_hiwater();



> And partly because others at SGI have been working in the opposite
> direction, adding mysterious and tasteless acct_update_integrals
> and update_mem_hiwater calls. I say mysterious because there's

Yea. I posted a patch to move that stuff out of the vm. No
good deed gets unpunished.

2005-01-13 00:11:28

by Nick Piggin

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

Christoph Lameter wrote:
> On Thu, 13 Jan 2005, Nick Piggin wrote:
>
>
>>Note that this was with my ptl removal patches. I can't see why Christoph's
>>would have _any_ extra overhead as they are, but it looks to me like they're
>>lacking in atomic ops. So I'd expect something similar for Christoph's when
>>they're properly atomic.
>
>
> Pointer operations and word size operations are atomic. So this is mostly
> okay.
>
> The issue arises on architectures that have a large pte size than the
> wordsize. This is only on i386 PAE mode and S/390. S/390 falls back to
> the page table lock for these operations. PAE mode should do the same and
> not use atomic ops if they cannot be made to work in a reasonable manner.
>

Yep well you should be OK then. Your implementation has the advantage
that it only instantiates previously clear ptes... hmm, no I'm wrong,
your ptep_set_access_flags path modifies an existing pte. I think this
can cause subtle races in copy_page_range, and maybe other places,
can't it?

2005-01-13 00:19:52

by Christoph Lameter

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

On Thu, 13 Jan 2005, Nick Piggin wrote:

> > Pointer operations and word size operations are atomic. So this is mostly
> > okay.
> >
> > The issue arises on architectures that have a large pte size than the
> > wordsize. This is only on i386 PAE mode and S/390. S/390 falls back to
> > the page table lock for these operations. PAE mode should do the same and
> > not use atomic ops if they cannot be made to work in a reasonable manner.
> >
>
> Yep well you should be OK then. Your implementation has the advantage
> that it only instantiates previously clear ptes... hmm, no I'm wrong,
> your ptep_set_access_flags path modifies an existing pte. I think this
> can cause subtle races in copy_page_range, and maybe other places,
> can't it?

ptep_set_access_flags is only used after acquiring the page_table_lock and
does not clear a pte. That is safe. The only critical thing is if a pte
would be cleared while holding the page_table_lock. That used to occur in
the swapper code but we modified that.

There is still an issue as Hugh rightly observed. One cannot rely on a
read of a pte/pud/pmd being atomic if the pte is > word size. This occurs
for all higher levels in handle_mm_fault. Thus we would need to either
acuire the page_table_lock for some architectures or provide primitives
get_pgd, get_pud etc that take the page_table_lock on PAE mode. ARGH.

2005-01-13 00:54:17

by Nick Piggin

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

Christoph Lameter wrote:
> On Thu, 13 Jan 2005, Nick Piggin wrote:
>
>
>>>Pointer operations and word size operations are atomic. So this is mostly
>>>okay.
>>>
>>>The issue arises on architectures that have a large pte size than the
>>>wordsize. This is only on i386 PAE mode and S/390. S/390 falls back to
>>>the page table lock for these operations. PAE mode should do the same and
>>>not use atomic ops if they cannot be made to work in a reasonable manner.
>>>
>>
>>Yep well you should be OK then. Your implementation has the advantage
>>that it only instantiates previously clear ptes... hmm, no I'm wrong,
>>your ptep_set_access_flags path modifies an existing pte. I think this
>>can cause subtle races in copy_page_range, and maybe other places,
>>can't it?
>
>
> ptep_set_access_flags is only used after acquiring the page_table_lock and
> does not clear a pte. That is safe. The only critical thing is if a pte
> would be cleared while holding the page_table_lock. That used to occur in
> the swapper code but we modified that.
>

I mean what used to be the ptep_set_access_flags path. Where you are
now modifying a pte without the ptl. However after a second look, it
seems like that won't be a problem.

> There is still an issue as Hugh rightly observed. One cannot rely on a
> read of a pte/pud/pmd being atomic if the pte is > word size. This occurs
> for all higher levels in handle_mm_fault. Thus we would need to either
> acuire the page_table_lock for some architectures or provide primitives
> get_pgd, get_pud etc that take the page_table_lock on PAE mode. ARGH.
>

Yes I know. I would say that having arch-definable accessors for the
page tables wouldn't be a bad idea anyway, and the flexibility may
come in handy for other things.

It would be a big, annoying patch though :(

2005-01-13 02:54:28

by Hugh Dickins

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

On Wed, 12 Jan 2005, Christoph Lameter wrote:
> On Wed, 12 Jan 2005, Hugh Dickins wrote:
>
> > Well, I studied the patches a bit more, and wrote
> > "That remark looks a bit unfair to me now I've looked closer."
> > Sorry. But I do still think it remains unsatisfactory."
>
> Well then thanks for not ccing me on the initial rant but a whole bunch of
> other people instead that you then did not send the following email too.
> Is this standard behavior on linux-mm?

I did cc you. What whole bunch of other people? The list of recipients
was the same, except (for obvious reasons) I added Jay the second time
(and having more time, spelt out most names in full).

Perhaps we've a misunderstanding: when I say "and wrote..." above,
I'm not quoting from some mail I sent others not you, I'm referring
to an earlier draft of the mail I'm then sending.

Or perhaps SGI has a spam filter which chose to gobble it up.
I'll try forwarding it to you again.

> > Might I save face by suggesting that it would be a lot clearer and
> > better if 1/1 got split into two? The first entirely concerned with
> > removing the spin_lock(&mm->page_table_lock) from handle_mm_fault,
> > and dealing with the consequences of that - moving the locking into
> > the allocating blocks, atomic getting of pud and pmd and pte,
> > passing the atomically-gotten orig_pte down to subfunctions
> > (which no longer expect page_table_lock held on entry) etc.
>
> That wont do any good since the pte's are not always updated in an atomic
> way. One would have to change set_pte to always be atomic.

You did have set_pte always atomic at one point, to the detriment of
(PAE) set_page_range. You rightly reverted that, but you've reminded
me of what I confessed to forgetting, where you do need set_pte_atomic
in various places, mainly (only?) the fault handlers in mm/memory.c.
And yes, I think you're right, that needs to be in this first patch.

> The reason
> that I added get_pte_atomic was that you told me that this would fix the
> PAE mode. I did not think too much about this but simply added it
> according to your wish and it seemed to run fine.

Please don't leave the thinking to me or anyone else.

> If you have any complaints, complain to yourself.

I'd better omit my response to that.

> > If there's a slight increase in the number of atomic operations
> > in each i386 PAE page fault, well, I think the superiority of
> > x86_64 makes that now an acceptable tradeoff.
>
> Could we have PAE mode drop back to using the page_table_lock?

That sounds a simple and sensible alternative (to more atomics):
haven't really thought it through, but if the default arch code is
right, and not overhead, then why not use it for the PAE case instead
of cluttering up with cleverness. Yes, I think that's a good idea:
anyone see why not?

> > Dismiss those suggestions if they'd just waste everyone's time.
>
> They dont fix the PAE mode issue.
>
> > Christoph has made some strides in correcting for other architectures
> > e.g. update_mmu_cache within default ptep_cmpxchg's page_table_lock
> > (probably correct but I can't be sure myself), and get_pte_atomic to
> > get even i386 PAE pte correctly without page_table_lock; and reverted
> > the pessimization of set_pte being always atomic on i386 PAE (but now
> > I've forgotten and can't find the case where it needed to be atomic).
>
> Well this was another suggestion of yours that I followed. Turns out that
> the set_pte must be atomic for this to work!

I didn't say you never needed an atomic set_pte, I said that making
set_pte always atomic (in the PAE case) unnecessarily slowed down
copy_page_range and zap_pte_range. Probably a misunderstanding.

> Look I am no expert on the
> i386 PAE mode and I rely on other for this to check up on it. And you were
> the expert.

Expert? I was trying to help, but you seem to resent that.

> > But no sign of get_pmd(atomic) or get_pud(atomic) to get the higher level
> > entries - I thought we'd agreed they were also necessary on some arches?
>
> I did not hear about that. Maybe you also sent that email to other people
> instead?

No, you were cc'ed on that one too (Sun, 12 Dec to Nick Piggin).
The spam filter again. Not that I have total recall of every
exchange about these patches either.

Hugh

2005-01-13 03:10:45

by Hugh Dickins

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

On Thu, 13 Jan 2005, Nick Piggin wrote:
> Andrew Morton wrote:
>
> Note that this was with my ptl removal patches. I can't see why Christoph's
> would have _any_ extra overhead as they are, but it looks to me like they're
> lacking in atomic ops. So I'd expect something similar for Christoph's when
> they're properly atomic.
>
> > Look, -7% on a 2-way versus +700% on a many-way might well be a tradeoff we
> > agree to take. But we need to fully understand all the costs and benefits.
>
> I think copy_page_range is the one to keep an eye on.

Christoph's currently lack set_pte_atomics in the fault handlers, yes.
But I don't see why they should need set_pte_atomics in copy_page_range
(which is why I persuaded him to drop forcing set_pte to atomic).

dup_mmap has down_write of the src mmap_sem, keeping out any faults on
that. copy_pte_range has spin_lock of the dst page_table_lock and the
src page_table_lock, keeping swapout away from those. Why would atomic
set_ptes be needed there? Probably in yours, but not in Christoph's.

Hugh

2005-01-13 03:18:14

by Andi Kleen

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

> There is still an issue as Hugh rightly observed. One cannot rely on a
> read of a pte/pud/pmd being atomic if the pte is > word size. This occurs
> for all higher levels in handle_mm_fault. Thus we would need to either
> acuire the page_table_lock for some architectures or provide primitives
> get_pgd, get_pud etc that take the page_table_lock on PAE mode. ARGH.
>

Alternatively you can use a lazy load, checking for changes.
(untested)

pte_t read_pte(volatile pte_t *pte)
{
pte_t n;
do {
n.pte_low = pte->pte_low;
rmb();
n.pte_high = pte->pte_high;
rmb();
} while (n.pte_low != pte->pte_low);
return pte;
}

No atomic operations, I bet it's actually faster than the cmpxchg8.
There is a small risk for livelock, but not much worse than with an
ordinary spinlock.

Not that I get it what you want it for exactly - the content
of the pte could change any time when you don't hold page_table_lock, right?

-Andi

2005-01-13 03:52:12

by Nick Piggin

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

Hugh Dickins wrote:
> On Thu, 13 Jan 2005, Nick Piggin wrote:
>
>>Andrew Morton wrote:
>>
>>Note that this was with my ptl removal patches. I can't see why Christoph's
>>would have _any_ extra overhead as they are, but it looks to me like they're
>>lacking in atomic ops. So I'd expect something similar for Christoph's when
>>they're properly atomic.
>>
>>
>>>Look, -7% on a 2-way versus +700% on a many-way might well be a tradeoff we
>>>agree to take. But we need to fully understand all the costs and benefits.
>>
>>I think copy_page_range is the one to keep an eye on.
>
>
> Christoph's currently lack set_pte_atomics in the fault handlers, yes.
> But I don't see why they should need set_pte_atomics in copy_page_range
> (which is why I persuaded him to drop forcing set_pte to atomic).
>
> dup_mmap has down_write of the src mmap_sem, keeping out any faults on
> that. copy_pte_range has spin_lock of the dst page_table_lock and the
> src page_table_lock, keeping swapout away from those. Why would atomic
> set_ptes be needed there? Probably in yours, but not in Christoph's.
>

I was more thinking of atomic pte reads there. I had for some reason
thought that dup_mmap only had a down_read of the mmap_sem. But even if
it did only down_read, a further look showed this wouldn't have been a
problem for Christoph anyway. That dim light-bulb probably changes things
for my patches too; I may be able to do copy_page_range with fewer atomics.

I'm still not too sure that all places read the pte atomically where needed.
But presently this is not a really big concern because it only would
really slow down i386 PAE if anything.

2005-01-13 17:10:57

by Christoph Lameter

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

On Thu, 13 Jan 2005, Hugh Dickins wrote:

> I did cc you. What whole bunch of other people? The list of recipients
> was the same, except (for obvious reasons) I added Jay the second time
> (and having more time, spelt out most names in full).
>
> Or perhaps SGI has a spam filter which chose to gobble it up.
> I'll try forwarding it to you again.

Yes sorry it was the spam filter. I got my copy from linux-mm just fine on
another account.

2005-01-13 17:18:28

by Christoph Lameter

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

On Wed, 13 Jan 2005, Andi Kleen wrote:

> Alternatively you can use a lazy load, checking for changes.
> (untested)
>
> pte_t read_pte(volatile pte_t *pte)
> {
> pte_t n;
> do {
> n.pte_low = pte->pte_low;
> rmb();
> n.pte_high = pte->pte_high;
> rmb();
> } while (n.pte_low != pte->pte_low);
> return pte;
> }
>
> No atomic operations, I bet it's actually faster than the cmpxchg8.
> There is a small risk for livelock, but not much worse than with an
> ordinary spinlock.

Hmm.... This may replace the get of a 64 bit value. But here could still
be another process that is setting the pte in a non-atomic way.

> Not that I get it what you want it for exactly - the content
> of the pte could change any time when you don't hold page_table_lock, right?

The content of the pte can change anytime the page_table_lock is held and
it may change from cleared to a value through a cmpxchg while the lock is
not held.

2005-01-13 17:18:58

by Christoph Lameter

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

On Thu, 13 Jan 2005, Nick Piggin wrote:

> I'm still not too sure that all places read the pte atomically where needed.
> But presently this is not a really big concern because it only would
> really slow down i386 PAE if anything.

S/390 is also affected. And I vaguely recall special issues with sparc
too. That is why I dropped the arch support for that a long time ago from
the patchset. Then there was some talk a couple of months back to use
another addressing mode on IA64 that may also require 128 bit ptes. There
are significantly different ways of doing optimal SMP locking for these
scenarios.

2005-01-13 17:27:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview



On Thu, 13 Jan 2005, Christoph Lameter wrote:
>
> On Wed, 13 Jan 2005, Andi Kleen wrote:
>
> > Alternatively you can use a lazy load, checking for changes.
> > (untested)
> >
> > pte_t read_pte(volatile pte_t *pte)
> > {
> > pte_t n;
> > do {
> > n.pte_low = pte->pte_low;
> > rmb();
> > n.pte_high = pte->pte_high;
> > rmb();
> > } while (n.pte_low != pte->pte_low);
> > return pte;
> > }
> >
> > No atomic operations, I bet it's actually faster than the cmpxchg8.
> > There is a small risk for livelock, but not much worse than with an
> > ordinary spinlock.
>
> Hmm.... This may replace the get of a 64 bit value. But here could still
> be another process that is setting the pte in a non-atomic way.

There's a nice standard way of doing that, namely sequence numbers.

However, most of the time it isn't actually faster than just getting the
lock. There are two real costs in getting a lock: serialization and cache
bouncing. The ordering often requires _more_ serialization than a
lock/unlock sequence, so sequences like the above are often slower than
the trivial lock is, at least in the absense of lock contention.

So sequence numbers (or multiple reads) only tend make sense where there
is a _lot_ more reads than writes, and where you get lots of lock
contention. If there are lots of writes, my gut feel (but hey, all locking
optimization should be backed up by real numbers) is that it's better to
have a lock close to the data, since you'll get the cacheline bounces
_anyway_, and locking often has lower serialization costs.

Linus

2005-01-13 18:04:23

by Andi Kleen

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

On Thu, Jan 13, 2005 at 09:11:29AM -0800, Christoph Lameter wrote:
> On Wed, 13 Jan 2005, Andi Kleen wrote:
>
> > Alternatively you can use a lazy load, checking for changes.
> > (untested)
> >
> > pte_t read_pte(volatile pte_t *pte)
> > {
> > pte_t n;
> > do {
> > n.pte_low = pte->pte_low;
> > rmb();
> > n.pte_high = pte->pte_high;
> > rmb();
> > } while (n.pte_low != pte->pte_low);
> > return pte;

It should be return n; here of course.

> > }
> >
> > No atomic operations, I bet it's actually faster than the cmpxchg8.
> > There is a small risk for livelock, but not much worse than with an
> > ordinary spinlock.
>
> Hmm.... This may replace the get of a 64 bit value. But here could still
> be another process that is setting the pte in a non-atomic way.

The rule in i386/x86-64 is that you cannot set the PTE in a non atomic way
when its present bit is set (because the hardware could asynchronously
change bits in the PTE that would get lost). Atomic way means clearing
first and then replacing in an atomic operation.

This helps you because you shouldn't be looking at the pte anyways
when pte_present is false. When it is not false it is always updated
atomically.

-Andi

2005-01-13 18:25:04

by Christoph Lameter

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

On Thu, 13 Jan 2005, Andi Kleen wrote:

> The rule in i386/x86-64 is that you cannot set the PTE in a non atomic way
> when its present bit is set (because the hardware could asynchronously
> change bits in the PTE that would get lost). Atomic way means clearing
> first and then replacing in an atomic operation.

Hmm. I replaced that portion in the swapper with an xchg operation
and inspect the result later. Clearing a pte and then setting it to
something would open a window for the page fault handler to set up a new
pte there since it does not take the page_table_lock. That xchg must be
atomic for PAE mode to work then.

> This helps you because you shouldn't be looking at the pte anyways
> when pte_present is false. When it is not false it is always updated
> atomically.

so pmd_present, pud_none and pgd_none could be considered atomic even if
the pm/u/gd_t is a multi-word entity? In that case the current approach
would work for higher level entities and in particular S/390 would be in
the clear.

But then the issues of replacing multi-word ptes on i386 PAE remain. If no
write lock is held on mmap_sem then all writes to pte's must be atomic in
order for the get_pte_atomic operation to work reliably.

2005-01-13 20:23:09

by Andi Kleen

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

On Thu, Jan 13, 2005 at 10:16:58AM -0800, Christoph Lameter wrote:
> On Thu, 13 Jan 2005, Andi Kleen wrote:
>
> > The rule in i386/x86-64 is that you cannot set the PTE in a non atomic way
> > when its present bit is set (because the hardware could asynchronously
> > change bits in the PTE that would get lost). Atomic way means clearing
> > first and then replacing in an atomic operation.
>
> Hmm. I replaced that portion in the swapper with an xchg operation
> and inspect the result later. Clearing a pte and then setting it to
> something would open a window for the page fault handler to set up a new

Yes, it usually assumes page table lock hold.

> pte there since it does not take the page_table_lock. That xchg must be
> atomic for PAE mode to work then.

You can always use cmpxchg8 for that if you want. Just to make
it really atomic you may need a LOCK prefix, and with that the
cost is not much lower than a real spinlock.


>
> > This helps you because you shouldn't be looking at the pte anyways
> > when pte_present is false. When it is not false it is always updated
> > atomically.
>
> so pmd_present, pud_none and pgd_none could be considered atomic even if
> the pm/u/gd_t is a multi-word entity? In that case the current approach

The optimistic read function I posted would do this.

But you have to read multiple entries anyways, which could get
non atomic no? (e.g. to do something on a PTE you always need
to read PGD/PUD/PMD)

In theory you could do this lazily with retires too, but it would be probably
somewhat costly and complicated.

> would work for higher level entities and in particular S/390 would be in
> the clear.
>
> But then the issues of replacing multi-word ptes on i386 PAE remain. If no
> write lock is held on mmap_sem then all writes to pte's must be atomic in

mmap_sem is only for VMAs. The page tables itself are protected by page table
lock.

> order for the get_pte_atomic operation to work reliably.

-Andi

2005-01-13 22:22:29

by Peter Chubb

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview


Nick Piggin wrote:

Nick> I would say that having arch-definable accessors for
Nick> the page tables wouldn't be a bad idea anyway, and the
Nick> flexibility may come in handy for other things.

Nick> It would be a big, annoying patch though :(

We're currently working in a slightly different direction, to try to
hide page-table implemention details from anything outside the page table
implementation. Our goal is to be able to try out other page tables
(e.g., Liedtke's guarded page table) instead of the 2/3/4 level fixed
hierarchy.

We're currently working on a 2.6.10 snapshot; obviously we'll have to
roll up to 2.6.11 before releasing (and there are lots of changes
there because of the recent 4-layer page table implementation).

--
Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au
The technical we do immediately, the political takes *forever*

2005-01-14 01:13:12

by Christoph Lameter

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

On Thu, 13 Jan 2005, Andi Kleen wrote:

> On Thu, Jan 13, 2005 at 09:11:29AM -0800, Christoph Lameter wrote:
> > On Wed, 13 Jan 2005, Andi Kleen wrote:
> >
> > > Alternatively you can use a lazy load, checking for changes.
> > > (untested)
> > >
> > > pte_t read_pte(volatile pte_t *pte)
> > > {
> > > pte_t n;
> > > do {
> > > n.pte_low = pte->pte_low;
> > > rmb();
> > > n.pte_high = pte->pte_high;
> > > rmb();
> > > } while (n.pte_low != pte->pte_low);
> > > return pte;

I think this is not necessary. Most IA32 processors do 64
bit operations in an atomic way in the same way as IA64. We can cut out
all the stuff we put in to simulate 64 bit atomicity for i386 PAE mode if
we just use convince the compiler to use 64 bit fetches and stores. 486
cpus and earlier are the only ones unable to do 64 bit atomic ops but
those wont be able to use PAE mode anyhow.

Page 231 of Volume 3 of the Intel IA32 manual states regarding atomicity
of operations:

7.1.1. Guaranteed Atomic Operations

The Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors
guarantee that the following basic memory operations will always be
carried out atomically:

o reading or writing a byte
o reading or writing a word aligned on a 16-bit boundary
o reading or writing a doubleword aligned on a 32-bit boundary

The Pentium 4, Intel Xeon, and P6 family, and Pentium processors guarantee
that the following additional memory operations will always be carried out
atomically:

o reading or writing a quadword aligned on a 64-bit boundary
o 16-bit accesses to uncached memory locations that fit within a 32-bit data bus
o The P6 family processors guarantee that the following additional memory
operation will always be carried out atomically:
o unaligned 16-, 32-, and 64-bit accesses to cached memory that fit
within a 32-byte cache

.... off to look for 64bit store and load instructions in the intel
manuals. I feel much better about keeping the existing approach.

2005-01-14 03:41:55

by Roman Zippel

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

Hi,

Christoph Lameter wrote:

> Introduction of the cmpxchg is one atomic operations that replaces the two
> spinlock ops typically necessary in an unpatched kernel. Obtaining the
> spinlock requires an spinlock (which is an atomic operation) and then the
> release involves a barrier. So there is a net win for all SMP cases as far
> as I can see.

But there might be a loss in the UP case. Spinlocks are optimized away,
but your cmpxchg emulation enables/disables interrupts with every access.

bye, Roman

2005-01-14 04:14:34

by Andi Kleen

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

On Fri, Jan 14, 2005 at 04:39:16AM +0100, Roman Zippel wrote:
> Hi,
>
> Christoph Lameter wrote:
>
> >Introduction of the cmpxchg is one atomic operations that replaces the two
> >spinlock ops typically necessary in an unpatched kernel. Obtaining the
> >spinlock requires an spinlock (which is an atomic operation) and then the
> >release involves a barrier. So there is a net win for all SMP cases as far
> >as I can see.
>
> But there might be a loss in the UP case. Spinlocks are optimized away,
> but your cmpxchg emulation enables/disables interrupts with every access.

Only for 386s and STI/CLI is quite cheap there.

-Andi

2005-01-14 04:39:51

by Andi Kleen

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

On Thu, Jan 13, 2005 at 05:09:04PM -0800, Christoph Lameter wrote:
> On Thu, 13 Jan 2005, Andi Kleen wrote:
>
> > On Thu, Jan 13, 2005 at 09:11:29AM -0800, Christoph Lameter wrote:
> > > On Wed, 13 Jan 2005, Andi Kleen wrote:
> > >
> > > > Alternatively you can use a lazy load, checking for changes.
> > > > (untested)
> > > >
> > > > pte_t read_pte(volatile pte_t *pte)
> > > > {
> > > > pte_t n;
> > > > do {
> > > > n.pte_low = pte->pte_low;
> > > > rmb();
> > > > n.pte_high = pte->pte_high;
> > > > rmb();
> > > > } while (n.pte_low != pte->pte_low);
> > > > return pte;
>
> I think this is not necessary. Most IA32 processors do 64
> bit operations in an atomic way in the same way as IA64. We can cut out
> all the stuff we put in to simulate 64 bit atomicity for i386 PAE mode if
> we just use convince the compiler to use 64 bit fetches and stores. 486

That would mean either cmpxchg8 (slow) or using MMX/SSE (even slower
because you would need to save FPU stable and disable
exceptions).

I think FPU is far too slow and complicated. I benchmarked lazy read
and cmpxchg 8:

Athlon64:
readpte hot 42
readpte cold 426
readpte_cmp hot 33
readpte_cmp cold 2693

Nocona:
readpte hot 140
readpte cold 960
readpte_cmp hot 48
readpte_cmp cold 2668

As you can see cmpxchg is slightly faster for the cache hot case,
but incredibly slow for cache cold (probably because it does something
nasty on the bus). This is pretty consistent to Intel and AMD CPUs.
Given that page tables are likely more often cache cold than hot
I would use the lazy variant.

-Andi


2005-01-14 04:52:30

by Andi Kleen

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview II

Andi Kleen <[email protected]> writes:
> As you can see cmpxchg is slightly faster for the cache hot case,
> but incredibly slow for cache cold (probably because it does something
> nasty on the bus). This is pretty consistent to Intel and AMD CPUs.
> Given that page tables are likely more often cache cold than hot
> I would use the lazy variant.

Sorry, my benchmark program actually had a bug (first loop included
page faults). Here are updated numbers. They are somewhat different:

Athlon 64:
readpte hot 25
readpte cold 171
readpte_cmp hot 18
readpte_cmp cold 162

Nocona:
readpte hot 118
readpte cold 443
readpte_cmp hot 22
readpte_cmp cold 224

The difference is much smaller here. Assuming cache cold cmpxchg8b is
better, at least on the Intel CPUs which have a slow rmb().

-Andi

2005-01-14 04:55:22

by Nick Piggin

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

On Fri, 2005-01-14 at 05:39 +0100, Andi Kleen wrote:

> As you can see cmpxchg is slightly faster for the cache hot case,
> but incredibly slow for cache cold (probably because it does something
> nasty on the bus). This is pretty consistent to Intel and AMD CPUs.
> Given that page tables are likely more often cache cold than hot
> I would use the lazy variant.
>

I have a question about your trickery with the read_pte function ;)

pte_t read_pte(volatile pte_t *pte)
{
pte_t n;
do {
n.pte_low = pte->pte_low;
rmb();
n.pte_high = pte->pte_high;
rmb();
} while (n.pte_low != pte->pte_low);
return pte;
}

Versus the existing set_pte function. Presumably the order here
can't be changed otherwise you could set the present bit before
the high bit, and race with the hardware MMU?

static inline void set_pte(pte_t *ptep, pte_t pte)
{
ptep->pte_high = pte.pte_high;
smp_wmb();
ptep->pte_low = pte.pte_low;
}

Now take the following interleaving:
CPU0 read_pte CPU1 set_pte
n.pte_low = pte->pte_low;
rmb();
ptep->pte_high = pte.pte_high;
smp_wmb();
n.pte_high = pte->pte_high;
rmb();
while (n.pte_low != pte->pte_low);
return pte;
ptep->pte_low = pte.pte_low;

So I think you can get a non atomic result. Are you relying on
assumptions about the value of pte_low not causing any problems
in the page fault handler?

Or am I missing something?


Find local movie times and trailers on Yahoo! Movies.
http://au.movies.yahoo.com

2005-01-14 04:59:29

by Nick Piggin

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview II

On Fri, 2005-01-14 at 05:52 +0100, Andi Kleen wrote:
> Andi Kleen <[email protected]> writes:
> > As you can see cmpxchg is slightly faster for the cache hot case,
> > but incredibly slow for cache cold (probably because it does something
> > nasty on the bus). This is pretty consistent to Intel and AMD CPUs.
> > Given that page tables are likely more often cache cold than hot
> > I would use the lazy variant.
>
> Sorry, my benchmark program actually had a bug (first loop included
> page faults). Here are updated numbers. They are somewhat different:
>
> Athlon 64:
> readpte hot 25
> readpte cold 171
> readpte_cmp hot 18
> readpte_cmp cold 162
>
> Nocona:
> readpte hot 118
> readpte cold 443
> readpte_cmp hot 22
> readpte_cmp cold 224
>
> The difference is much smaller here. Assuming cache cold cmpxchg8b is
> better, at least on the Intel CPUs which have a slow rmb().
>

I have a question for the x86 gurus. We're currently using the lock
prefix for set_64bit. This will lock the bus for the RMW cycle, but
is it a prerequisite for the atomic 64-bit store? Even on UP?



Find local movie times and trailers on Yahoo! Movies.
http://au.movies.yahoo.com

2005-01-14 10:46:29

by Andi Kleen

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

On Fri, Jan 14, 2005 at 03:54:59PM +1100, Nick Piggin wrote:
> On Fri, 2005-01-14 at 05:39 +0100, Andi Kleen wrote:
>
> > As you can see cmpxchg is slightly faster for the cache hot case,
> > but incredibly slow for cache cold (probably because it does something
> > nasty on the bus). This is pretty consistent to Intel and AMD CPUs.
> > Given that page tables are likely more often cache cold than hot
> > I would use the lazy variant.
> >
>
> I have a question about your trickery with the read_pte function ;)
>
> pte_t read_pte(volatile pte_t *pte)
> {
> pte_t n;
> do {
> n.pte_low = pte->pte_low;
> rmb();
> n.pte_high = pte->pte_high;
> rmb();
> } while (n.pte_low != pte->pte_low);
> return pte;
> }
>
> Versus the existing set_pte function. Presumably the order here
> can't be changed otherwise you could set the present bit before
> the high bit, and race with the hardware MMU?

The hardware MMU only ever adds some bits (D etc.). Never changes
the address. It won't clear P bits. The page fault handler also doesn't
clear them, only the swapper does. With that knowledge you could probably
do some optimizations.


> So I think you can get a non atomic result. Are you relying on
> assumptions about the value of pte_low not causing any problems
> in the page fault handler?

I don't know. You have to ask Christopher L. I only commented
on one subthread where he asked about atomic pte reading,
but haven't studied his patches in detail.

-Andi

2005-01-14 10:47:42

by Andi Kleen

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview II

> I have a question for the x86 gurus. We're currently using the lock
> prefix for set_64bit. This will lock the bus for the RMW cycle, but
> is it a prerequisite for the atomic 64-bit store? Even on UP?

An atomic 64bit store doesn't need a lock prefix. A cmpxchg will
need to though. Note that UP kernels define LOCK to nothing.

-Andi

2005-01-14 10:57:26

by Nick Piggin

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview II

Andi Kleen wrote:
>>I have a question for the x86 gurus. We're currently using the lock
>>prefix for set_64bit. This will lock the bus for the RMW cycle, but
>>is it a prerequisite for the atomic 64-bit store? Even on UP?
>
>
> An atomic 64bit store doesn't need a lock prefix. A cmpxchg will
> need to though.

Are you sure the cmpxchg8b need a lock prefix? Sure it does to
get the proper "atomic cmpxchg" semantics, but what about a
simple 64-bit store... If it boils down to 8 byte load, 8 byte
store on the memory bus, and that store is atomic, then maybe
a lock isn't needed at all?

I think when emulating a *load*, then the lock is needed, because
otherwise the subsequent store may overwrite some value that has
just been stored by another processor.... but for a store I'm not
so sure.

> Note that UP kernels define LOCK to nothing.
>

Yes. In this case (include/asm-i386/system.h:__set_64bit), it
is using lowercase lock, which I think is not defined away,
right?


2005-01-14 11:11:27

by Andi Kleen

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview II

On Fri, Jan 14, 2005 at 09:57:16PM +1100, Nick Piggin wrote:
> Andi Kleen wrote:
> >>I have a question for the x86 gurus. We're currently using the lock
> >>prefix for set_64bit. This will lock the bus for the RMW cycle, but
> >>is it a prerequisite for the atomic 64-bit store? Even on UP?
> >
> >
> >An atomic 64bit store doesn't need a lock prefix. A cmpxchg will
> >need to though.
>
> Are you sure the cmpxchg8b need a lock prefix? Sure it does to

If you want it to be atomic on SMP then yes.

> get the proper "atomic cmpxchg" semantics, but what about a
> simple 64-bit store... If it boils down to 8 byte load, 8 byte

A 64bit store with a 64bit store instruction is atomic. But
to do that on 32bit x86 you need SSE/MMX (not an option in the kernel)
or cmpxchg8

> store on the memory bus, and that store is atomic, then maybe
> a lock isn't needed at all?

More complex operations than store or load are not atomic without
LOCK (and not all operations can have a lock prefix). There are a few
instructions with implicit lock. If you want the gory details read
chapter 7 in the IA32 Software Developer's Manual Volume 3.

-Andi

2005-01-14 12:03:14

by Roman Zippel

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

Hi,

On Fri, 14 Jan 2005, Andi Kleen wrote:

> > But there might be a loss in the UP case. Spinlocks are optimized away,
> > but your cmpxchg emulation enables/disables interrupts with every access.
>
> Only for 386s and STI/CLI is quite cheap there.

But it's still not free and what about other archs? Why not just check
__HAVE_ARCH_CMPXCHG and provide a replacement, which is guaranteed cheaper
if no interrupt synchronisation is needed.

bye, Roman

2005-01-14 16:55:08

by Christoph Lameter

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

On Thu, 14 Jan 2005, Andi Kleen wrote:

> > I think this is not necessary. Most IA32 processors do 64
> > bit operations in an atomic way in the same way as IA64. We can cut out
> > all the stuff we put in to simulate 64 bit atomicity for i386 PAE mode if
> > we just use convince the compiler to use 64 bit fetches and stores. 486
>
> That would mean either cmpxchg8 (slow) or using MMX/SSE (even slower
> because you would need to save FPU stable and disable
> exceptions).

It strange that the instruction set does not contain some simple 64bit
store or load and the FPU state seems to be complex to manage...sigh.

Looked at arch/i386/lib/mmx.c. It avoids the mmx ops in an interrupt
context but the rest of the prep for mmx only saves the fpu state if its
in use. So that code would only be used rarely. The mmx 64 bit
instructions seem to be quite fast according to the manual. Double the
cycles than the 32 bit instructions on Pentium M (somewhat higher on Pentium 4).

One could simply do a movq.

2005-01-14 17:01:16

by Christoph Lameter

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview II

On Fri, 14 Jan 2005, Andi Kleen wrote:

> > Are you sure the cmpxchg8b need a lock prefix? Sure it does to
>
> If you want it to be atomic on SMP then yes.
>
> > get the proper "atomic cmpxchg" semantics, but what about a
> > simple 64-bit store... If it boils down to 8 byte load, 8 byte
>
> A 64bit store with a 64bit store instruction is atomic. But
> to do that on 32bit x86 you need SSE/MMX (not an option in the kernel)
> or cmpxchg8
>
> > store on the memory bus, and that store is atomic, then maybe
> > a lock isn't needed at all?
>
> More complex operations than store or load are not atomic without
> LOCK (and not all operations can have a lock prefix). There are a few
> instructions with implicit lock. If you want the gory details read
> chapter 7 in the IA32 Software Developer's Manual Volume 3.

It needs a lock prefix. Volume 2 of the IA32 manual states on page 150
regarding cmpxchg (Note that the atomicity mentioned here seems to apply
to the complete instruction not the 64 bit fetches and stores):


This instruction can be used with a LOCK prefix to allow the instruction
to be executed atomically. To simplify the interface to the processor's
bus, the destination operand receives a write cycle without regard to the
result of the comparison. The destination operand is written back ifthe
comparison fails; otherwise, the source operand is written into the
destination. (The processor never produces a locked read without also
producing a locked write.)

2005-01-14 17:03:26

by Andi Kleen

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

> Looked at arch/i386/lib/mmx.c. It avoids the mmx ops in an interrupt
> context but the rest of the prep for mmx only saves the fpu state if its
> in use. So that code would only be used rarely. The mmx 64 bit
> instructions seem to be quite fast according to the manual. Double the
> cycles than the 32 bit instructions on Pentium M (somewhat higher on Pentium 4).

With all the other overhead (disabling exceptions, saving register etc.)
will be likely slower. Also you would need fallback paths for CPUs
without MMX but with PAE (like Ppro). You can benchmark
it if you want, but I wouldn't be very optimistic.

-Andi

2005-01-14 17:09:28

by Christoph Lameter

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

On Fri, 14 Jan 2005, Andi Kleen wrote:

> > Looked at arch/i386/lib/mmx.c. It avoids the mmx ops in an interrupt
> > context but the rest of the prep for mmx only saves the fpu state if its
> > in use. So that code would only be used rarely. The mmx 64 bit
> > instructions seem to be quite fast according to the manual. Double the
> > cycles than the 32 bit instructions on Pentium M (somewhat higher on Pentium 4).
>
> With all the other overhead (disabling exceptions, saving register etc.)
> will be likely slower. Also you would need fallback paths for CPUs
> without MMX but with PAE (like Ppro). You can benchmark
> it if you want, but I wouldn't be very optimistic.

So the PentiumPro is a cpu with atomic 64 bit operations in a cmpxchg but
no instruction to do an atomic 64 bit store or load although the
architecture conceptually supports 64bit atomic stores and loads? Wild.

2005-01-14 17:13:53

by Andi Kleen

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview

On Fri, Jan 14, 2005 at 09:08:54AM -0800, Christoph Lameter wrote:
> On Fri, 14 Jan 2005, Andi Kleen wrote:
>
> > > Looked at arch/i386/lib/mmx.c. It avoids the mmx ops in an interrupt
> > > context but the rest of the prep for mmx only saves the fpu state if its
> > > in use. So that code would only be used rarely. The mmx 64 bit
> > > instructions seem to be quite fast according to the manual. Double the
> > > cycles than the 32 bit instructions on Pentium M (somewhat higher on Pentium 4).
> >
> > With all the other overhead (disabling exceptions, saving register etc.)
> > will be likely slower. Also you would need fallback paths for CPUs
> > without MMX but with PAE (like Ppro). You can benchmark
> > it if you want, but I wouldn't be very optimistic.
>
> So the PentiumPro is a cpu with atomic 64 bit operations in a cmpxchg but
> no instruction to do an atomic 64 bit store or load although the
> architecture conceptually supports 64bit atomic stores and loads? Wild.

It can do 64bit x87 FP loads/stores. But I doubt that is what you're
looking for.

-Andi

2005-01-14 17:43:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: page table lock patch V15 [0/7]: overview



On Fri, 14 Jan 2005, Andi Kleen wrote:
>
> With all the other overhead (disabling exceptions, saving register etc.)
> will be likely slower. Also you would need fallback paths for CPUs
> without MMX but with PAE (like Ppro). You can benchmark
> it if you want, but I wouldn't be very optimistic.

We could just say that PAE requires MMX. Quite frankly, if you have a
PPro, you probably don't need PAE anyway - I don't see a whole lot of
people that spent huge amounts of money on memory and CPU (a PPro that had
more than 4GB in it was _quite_ expensive at the time) who haven't
upgraded to a PII by now..

IOW, the overlap of "really needs PAE" and "doesn't have MMX" is probably
effectively zero.

That said, you're probably right in that it probably _is_ expensive enough
that it doesn't help. Even if the process doesn't use FP/MMX (so that you
can avoid the overhead of state save/restore), you need to

- disable preemption
- clear "TS" (pretty expensive in itself, since it touches CR0)
- .. do any operations ..
- set "TS" (again, CR0)
- enable preemption

so it's likely a thousand cycles minimum on a P4 (I'm just assuming that
the P4 will serialize on CR0 accesses, which implies that it's damn
expensive), and possibly a hundred on other x86 implementations.

That's in the noise for something that does a full page table copy, but it
likely makes using MMX for single page table entries a total loss.

Linus

2005-01-28 20:39:44

by Christoph Lameter

[permalink] [raw]
Subject: page fault scalability patch V16 [1/4]: avoid intermittent clearing of ptes

The current way of updating ptes in the Linux vm includes first clearing
a pte before setting it to another value. The clearing is performed while
holding the page_table_lock to insure that the entry has not been modified
by the CPU directly, by an arch specific interrupt handler or another page
fault handler running on another CPU. This approach is necessary for some
architectures that cannot perform atomic updates of page table entries to
set the page table entry to not present for the MMU logic.

If a page table entry is cleared then a second CPU may generate a page fault
for that entry. The fault handler on the second CPU will then attempt to
acquire the page_table_lock and wait until the first CPU has completed
updating the page table entry. The fault handler on the second CPU will then
discover that everything is ok and simply do nothing (apart from incrementing
the counters for a minor fault and marking the page again as accessed).

However, most architectures actually support atomic operations on page
table entries. The use of atomic operations on page table entries would
allow the update of a page table entry in a single atomic operation instead
of writing to the page table entry twice. There would also be no danger of
generating a spurious page fault on other CPUs.

The following patch introduces two new atomic operations ptep_xchg and
ptep_cmpxchg that may be provided by an architecture. The fallback in
include/asm-generic/pgtable.h is to simulate both operations through the
existing ptep_get_and_clear function. So there is essentially no change if
atomic operations on ptes have not been defined. Architectures that do
not support atomic operations on ptes may continue to use the clearing of
a pte for locking type purposes.

Atomic operations may be enabled in the kernel configuration on
i386, ia64 and x86_64 if a suitable CPU is configured in SMP mode.
Generic atomic definitions for ptep_xchg and ptep_cmpxchg
have been provided based on the existing xchg() and cmpxchg() functions
that already work atomically on many platforms. It is very
easy to implement this for any architecture by adding the appropriate
definitions to arch/xx/Kconfig.

The provided generic atomic functions may be overridden as usual by defining
the appropriate__HAVE_ARCH_xxx constant and providing an implementation.

My aim to reduce the use of the page_table_lock in the page fault handler
rely on a pte never being clear if the pte is in use even when the
page_table_lock is not held. Clearing a pte before setting it to another
values could result in a situation in which a fault generated by
another cpu could install a pte which is then immediately overwritten by
the first CPU setting the pte to a valid value again. This patch is
important for future work on reducing the use of spinlocks in the vm.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.10/mm/rmap.c
===================================================================
--- linux-2.6.10.orig/mm/rmap.c 2005-01-27 14:47:20.000000000 -0800
+++ linux-2.6.10/mm/rmap.c 2005-01-27 16:27:40.000000000 -0800
@@ -575,11 +575,6 @@ static int try_to_unmap_one(struct page

/* Nuke the page table entry. */
flush_cache_page(vma, address);
- pteval = ptep_clear_flush(vma, address, pte);
-
- /* Move the dirty bit to the physical page now the pte is gone. */
- if (pte_dirty(pteval))
- set_page_dirty(page);

if (PageAnon(page)) {
swp_entry_t entry = { .val = page->private };
@@ -594,11 +589,15 @@ static int try_to_unmap_one(struct page
list_add(&mm->mmlist, &init_mm.mmlist);
spin_unlock(&mmlist_lock);
}
- set_pte(pte, swp_entry_to_pte(entry));
+ pteval = ptep_xchg_flush(vma, address, pte, swp_entry_to_pte(entry));
BUG_ON(pte_file(*pte));
mm->anon_rss--;
- }
+ } else
+ pteval = ptep_clear_flush(vma, address, pte);

+ /* Move the dirty bit to the physical page now that the pte is gone. */
+ if (pte_dirty(pteval))
+ set_page_dirty(page);
mm->rss--;
acct_update_integrals();
page_remove_rmap(page);
@@ -691,15 +690,15 @@ static void try_to_unmap_cluster(unsigne
if (ptep_clear_flush_young(vma, address, pte))
continue;

- /* Nuke the page table entry. */
flush_cache_page(vma, address);
- pteval = ptep_clear_flush(vma, address, pte);

/* If nonlinear, store the file page offset in the pte. */
if (page->index != linear_page_index(vma, address))
- set_pte(pte, pgoff_to_pte(page->index));
+ pteval = ptep_xchg_flush(vma, address, pte, pgoff_to_pte(page->index));
+ else
+ pteval = ptep_clear_flush(vma, address, pte);

- /* Move the dirty bit to the physical page now the pte is gone. */
+ /* Move the dirty bit to the physical page now that the pte is gone. */
if (pte_dirty(pteval))
set_page_dirty(page);

Index: linux-2.6.10/mm/memory.c
===================================================================
--- linux-2.6.10.orig/mm/memory.c 2005-01-27 14:52:11.000000000 -0800
+++ linux-2.6.10/mm/memory.c 2005-01-27 16:27:40.000000000 -0800
@@ -513,14 +513,18 @@ static void zap_pte_range(struct mmu_gat
page->index > details->last_index))
continue;
}
- pte = ptep_get_and_clear(ptep);
- tlb_remove_tlb_entry(tlb, ptep, address+offset);
- if (unlikely(!page))
+ if (unlikely(!page)) {
+ pte = ptep_get_and_clear(ptep);
+ tlb_remove_tlb_entry(tlb, ptep, address+offset);
continue;
+ }
if (unlikely(details) && details->nonlinear_vma
&& linear_page_index(details->nonlinear_vma,
address+offset) != page->index)
- set_pte(ptep, pgoff_to_pte(page->index));
+ pte = ptep_xchg(ptep, pgoff_to_pte(page->index));
+ else
+ pte = ptep_get_and_clear(ptep);
+ tlb_remove_tlb_entry(tlb, ptep, address+offset);
if (pte_dirty(pte))
set_page_dirty(page);
if (PageAnon(page))
Index: linux-2.6.10/mm/mprotect.c
===================================================================
--- linux-2.6.10.orig/mm/mprotect.c 2005-01-27 14:47:20.000000000 -0800
+++ linux-2.6.10/mm/mprotect.c 2005-01-27 16:27:40.000000000 -0800
@@ -48,12 +48,16 @@ change_pte_range(pmd_t *pmd, unsigned lo
if (pte_present(*pte)) {
pte_t entry;

- /* Avoid an SMP race with hardware updated dirty/clean
- * bits by wiping the pte and then setting the new pte
- * into place.
- */
- entry = ptep_get_and_clear(pte);
- set_pte(pte, pte_modify(entry, newprot));
+ /* Deal with a potential SMP race with hardware/arch
+ * interrupt updating dirty/clean bits through the use
+ * of ptep_cmpxchg.
+ */
+ do {
+ entry = *pte;
+ } while (!ptep_cmpxchg(pte,
+ entry,
+ pte_modify(entry, newprot)
+ ));
}
address += PAGE_SIZE;
pte++;
Index: linux-2.6.10/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.10.orig/include/asm-generic/pgtable.h 2004-12-24 13:34:30.000000000 -0800
+++ linux-2.6.10/include/asm-generic/pgtable.h 2005-01-27 16:27:40.000000000 -0800
@@ -102,6 +102,92 @@ static inline pte_t ptep_get_and_clear(p
})
#endif

+#ifdef CONFIG_ATOMIC_TABLE_OPS
+
+/*
+ * The architecture does support atomic table operations.
+ * Thus we may provide generic atomic ptep_xchg and ptep_cmpxchg using
+ * cmpxchg and xchg.
+ */
+#ifndef __HAVE_ARCH_PTEP_XCHG
+#define ptep_xchg(__ptep, __pteval) \
+ __pte(xchg(&pte_val(*(__ptep)), pte_val(__pteval)))
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_CMPXCHG
+#define ptep_cmpxchg(__ptep,__oldval,__newval) \
+ (cmpxchg(&pte_val(*(__ptep)), \
+ pte_val(__oldval), \
+ pte_val(__newval) \
+ ) == pte_val(__oldval) \
+ )
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \
+({ \
+ pte_t __pte = ptep_xchg(__ptep, __pteval); \
+ flush_tlb_page(__vma, __address); \
+ __pte; \
+})
+#endif
+
+#else
+
+/*
+ * No support for atomic operations on the page table.
+ * Exchanging of pte values is done by first swapping zeros into
+ * a pte and then putting new content into the pte entry.
+ * However, these functions will generate an empty pte for a
+ * short time frame. This means that the page_table_lock must be held
+ * to avoid a page fault that would install a new entry.
+ */
+#ifndef __HAVE_ARCH_PTEP_XCHG
+#define ptep_xchg(__ptep, __pteval) \
+({ \
+ pte_t __pte = ptep_get_and_clear(__ptep); \
+ set_pte(__ptep, __pteval); \
+ __pte; \
+})
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH
+#ifndef __HAVE_ARCH_PTEP_XCHG
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \
+({ \
+ pte_t __pte = ptep_clear_flush(__vma, __address, __ptep); \
+ set_pte(__ptep, __pteval); \
+ __pte; \
+})
+#else
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \
+({ \
+ pte_t __pte = ptep_xchg(__ptep, __pteval); \
+ flush_tlb_page(__vma, __address); \
+ __pte; \
+})
+#endif
+#endif
+
+/*
+ * The fallback function for ptep_cmpxchg avoids any real use of cmpxchg
+ * since cmpxchg may not be available on certain architectures. Instead
+ * the clearing of a pte is used as a form of locking mechanism.
+ * This approach will only work if the page_table_lock is held to insure
+ * that the pte is not populated by a page fault generated on another
+ * CPU.
+ */
+#ifndef __HAVE_ARCH_PTEP_CMPXCHG
+#define ptep_cmpxchg(__ptep, __old, __new) \
+({ \
+ pte_t prev = ptep_get_and_clear(__ptep); \
+ int r = pte_val(prev) == pte_val(__old); \
+ set_pte(__ptep, r ? (__new) : prev); \
+ r; \
+})
+#endif
+#endif
+
#ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT
static inline void ptep_set_wrprotect(pte_t *ptep)
{
Index: linux-2.6.10/arch/ia64/Kconfig
===================================================================
--- linux-2.6.10.orig/arch/ia64/Kconfig 2005-01-27 14:47:14.000000000 -0800
+++ linux-2.6.10/arch/ia64/Kconfig 2005-01-27 16:36:56.000000000 -0800
@@ -280,6 +280,17 @@ config PREEMPT
Say Y here if you are building a kernel for a desktop, embedded
or real-time system. Say N if you are unsure.

+config ATOMIC_TABLE_OPS
+ bool "Atomic Page Table Operations (EXPERIMENTAL)"
+ depends on SMP && EXPERIMENTAL
+ help
+ Atomic Page table operations allow page faults
+ without the use (or with reduce use of) spinlocks
+ and allow greater concurrency for a task with multiple
+ threads in the page fault handler. This is in particular
+ useful for high CPU counts and processes that use
+ large amounts of memory.
+
config HAVE_DEC_LOCK
bool
depends on (SMP || PREEMPT)
Index: linux-2.6.10/arch/i386/Kconfig
===================================================================
--- linux-2.6.10.orig/arch/i386/Kconfig 2005-01-27 14:47:14.000000000 -0800
+++ linux-2.6.10/arch/i386/Kconfig 2005-01-27 16:37:05.000000000 -0800
@@ -868,6 +868,17 @@ config HAVE_DEC_LOCK
depends on (SMP || PREEMPT) && X86_CMPXCHG
default y

+config ATOMIC_TABLE_OPS
+ bool "Atomic Page Table Operations (EXPERIMENTAL)"
+ depends on SMP && X86_CMPXCHG && EXPERIMENTAL && !X86_PAE
+ help
+ Atomic Page table operations allow page faults
+ without the use (or with reduce use of) spinlocks
+ and allow greater concurrency for a task with multiple
+ threads in the page fault handler. This is in particular
+ useful for high CPU counts and processes that use
+ large amounts of memory.
+
# turning this on wastes a bunch of space.
# Summit needs it only when NUMA is on
config BOOT_IOREMAP
Index: linux-2.6.10/arch/x86_64/Kconfig
===================================================================
--- linux-2.6.10.orig/arch/x86_64/Kconfig 2005-01-27 14:52:10.000000000 -0800
+++ linux-2.6.10/arch/x86_64/Kconfig 2005-01-27 16:37:15.000000000 -0800
@@ -240,6 +240,17 @@ config PREEMPT
Say Y here if you are feeling brave and building a kernel for a
desktop, embedded or real-time system. Say N if you are unsure.

+config ATOMIC_TABLE_OPS
+ bool "Atomic Page Table Operations (EXPERIMENTAL)"
+ depends on SMP && EXPERIMENTAL
+ help
+ Atomic Page table operations allow page faults
+ without the use (or with reduce use of) spinlocks
+ and allow greater concurrency for a task with multiple
+ threads in the page fault handler. This is in particular
+ useful for high CPU counts and processes that use
+ large amounts of memory.
+
config PREEMPT_BKL
bool "Preempt The Big Kernel Lock"
depends on PREEMPT

2005-01-28 20:41:33

by Christoph Lameter

[permalink] [raw]
Subject: page fault scalability patch V16 [0/4]: redesign overview

Changes from V15->V16 of this patch: Complete Redesign.

An introduction to what this patch does and a patch archive can be found on
http://oss.sgi.com/projects/page_fault_performance. The archive also has a
combined patch.

The basic approach in this patchset is the same as used in SGI's 2.4.X
based kernels which have been in production use in ProPack 3 for a long time.

The patchset is composed of 4 patches (and was tested against 2.6.11-rc2-bk6
on ia64, i386 and x86_64):

1/4: ptep_cmpxchg and ptep_xchg to avoid intermittent zeroing of ptes

The current way of synchronizing with the CPU or arch specific
interrupts updating page table entries is to first set a pte
to zero before writing a new value. This patch uses ptep_xchg
and ptep_cmpxchg to avoid writing the zero for certain
configurations.

The patch introduces CONFIG_ATOMIC_TABLE_OPS that may be
enabled as a experimental feature during kernel configuration
if the hardware is able to support atomic operations and if
an SMP kernel is being configured. A Kconfig update for i386,
x86_64 and ia64 has been provided. On i386 this options is
restricted to CPUs better than a 486 and non PAE mode (that
way all the cmpxchg issues on old i386 CPUS and the problems
with 64bit atomic operations on recent i386 CPUS are avoided).

If CONFIG_ATOMIC_TABLE_OPS is not set then ptep_xchg and
ptep_xcmpxchg are realized by falling back to clearing a pte
before updating it.

The patch does not change the use of mm->page_table_lock and
the only performance improvement is the replacement of
xchg-with-zero-and-then-write-new-pte-value with an xchg with
the new value for SMP on some architectures if
CONFIG_ATOMIC_TABLE_OPS is configured. It should not do anything
major to VM operations.

2/4: Macros for mm counter manipulation

There are various approaches to handling mm counters if the
page_table_lock is no longer acquired. This patch defines
macros in include/linux/sched.h to handle these counters and
makes sure that these macros are used throughout the kernel
to access and manipulate rss and anon_rss. There should be
no change to the generated code as a result of this patch.

3/4: Drop the first use of the page_table_lock in handle_mm_fault

The patch introduces two new functions:

page_table_atomic_start(mm), page_table_atomic_stop(mm)

that fall back to the use of the page_table_lock if
CONFIG_ATOMIC_TABLE_OPS is not defined.

If CONFIG_ATOMIC_TABLE_OPS is defined those functions may
be used to prep the CPU for atomic table ops (i386 in PAE mode
may f.e. get the MMX register ready for 64bit atomic ops) but
are simply empty by default.

Two operations may then be performed on the page table without
acquiring the page table lock:

a) updating access bits in pte
b) anonymous read faults installed a mapping to the zero page.

All counters are still protected with the page_table_lock thus
avoiding any issues there.

Some additional statistics are added to /proc/meminfo to
give some statistics. Also counts spurious faults with no
effect. There is a surprisingly high number of those on ia64
(used to populate the cpu caches with the pte??)

4/4: Drop the use of the page_table_lock in do_anonymous_page

The second acquisition of the page_table_lock is removed
from do_anonymous_page and allows the anonymous
write fault to be possible without the page_table_lock.

The macros for manipulating rss and anon_rss in include/linux/sched.h
are changed if CONFIG_ATOMIC_TABLE_OPS is set to use atomic
operations for rss and anon_rss (safest solution for now, other
solutions may easily be implemented by changing those macros).

This patch typically yield significant increases in page fault
performance for threaded applications on SMP systems.

I have an additional patch that drops the page_table_lock for COW but that
raises a lot of other issues. I will post that patch separately and only
to linux-mm.

2005-01-28 20:45:37

by Christoph Lameter

[permalink] [raw]
Subject: page fault scalability patch V16 [4/4]: Drop page_table_lock in do_anonymous_page

Do not use the page_table_lock in do_anonymous_page. This will significantly
increase the parallelism in the page fault handler in SMP systems. The patch
also modifies the definitions of _mm_counter functions so that rss and anon_rss
become atomic.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.10/mm/memory.c
===================================================================
--- linux-2.6.10.orig/mm/memory.c 2005-01-27 16:39:24.000000000 -0800
+++ linux-2.6.10/mm/memory.c 2005-01-27 16:39:24.000000000 -0800
@@ -1839,12 +1839,12 @@ do_anonymous_page(struct mm_struct *mm,
vma->vm_page_prot)),
vma);

- spin_lock(&mm->page_table_lock);
+ page_table_atomic_start(mm);

if (!ptep_cmpxchg(page_table, orig_entry, entry)) {
pte_unmap(page_table);
page_cache_release(page);
- spin_unlock(&mm->page_table_lock);
+ page_table_atomic_stop(mm);
inc_page_state(cmpxchg_fail_anon_write);
return VM_FAULT_MINOR;
}
@@ -1862,7 +1862,7 @@ do_anonymous_page(struct mm_struct *mm,

update_mmu_cache(vma, addr, entry);
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
+ page_table_atomic_stop(mm);

return VM_FAULT_MINOR;
}
Index: linux-2.6.10/include/linux/sched.h
===================================================================
--- linux-2.6.10.orig/include/linux/sched.h 2005-01-27 16:39:24.000000000 -0800
+++ linux-2.6.10/include/linux/sched.h 2005-01-27 16:40:24.000000000 -0800
@@ -203,10 +203,26 @@ arch_get_unmapped_area_topdown(struct fi
extern void arch_unmap_area(struct vm_area_struct *area);
extern void arch_unmap_area_topdown(struct vm_area_struct *area);

+#ifdef CONFIG_ATOMIC_TABLE_OPS
+/*
+ * Atomic page table operations require that the counters are also
+ * incremented atomically
+*/
+#define set_mm_counter(mm, member, value) atomic_set(&(mm)->member, value)
+#define get_mm_counter(mm, member) ((unsigned long)atomic_read(&(mm)->member))
+#define update_mm_counter(mm, member, value) atomic_add(value, &(mm)->member)
+#define MM_COUNTER_T atomic_t
+
+#else
+/*
+ * No atomic page table operations. Counters are protected by
+ * the page table lock
+ */
#define set_mm_counter(mm, member, value) (mm)->member = (value)
#define get_mm_counter(mm, member) ((mm)->member)
#define update_mm_counter(mm, member, value) (mm)->member += (value)
#define MM_COUNTER_T unsigned long
+#endif

struct mm_struct {
struct vm_area_struct * mmap; /* list of VMAs */

2005-01-28 20:50:04

by Christoph Lameter

[permalink] [raw]
Subject: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

The page fault handler attempts to use the page_table_lock only for short
time periods. It repeatedly drops and reacquires the lock. When the lock
is reacquired, checks are made if the underlying pte has changed before
replacing the pte value. These locations are a good fit for the use of
ptep_cmpxchg.

The following patch allows to remove the first time the page_table_lock is
acquired and uses atomic operations on the page table instead. A section
using atomic pte operations is begun with

page_table_atomic_start(struct mm_struct *)

and ends with

page_table_atomic_stop(struct mm_struct *)

Both of these become spin_lock(page_table_lock) and
spin_unlock(page_table_lock) if atomic page table operations are not
configured (CONFIG_ATOMIC_TABLE_OPS undefined).

Atomic operations with pte_xchg and pte_cmpxchg only work for the lowest
layer of the page table. Higher layers may also be populated in an atomic
way by defining pmd_test_and_populate() etc. The generic versions of these
functions fall back to the page_table_lock (populating higher level page
table entries is rare and therefore this is not likely to be performance
critical). For ia64 the definitions for higher level atomic operations is
included and these may easily be added for other architectures.

This patch depends on the pte_cmpxchg patch to be applied first and will
only remove the first use of the page_table_lock in the page fault handler.
This will allow the following page table operations without acquiring
the page_table_lock:

1. Updating of access bits (handle_mm_faults)
2. Anonymous read faults (do_anonymous_page)

The page_table_lock is still acquired for creating a new pte for an anonymous
write fault and therefore the problems with rss that were addressed by splitting
rss into the task structure do not yet occur.

The patch also adds some diagnostic features by counting the number of cmpxchg
failures (useful for verification if this patch works right) and the number of
faults received that led to no change in the page table. These statistics may
be viewed via /proc/meminfo

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.10/mm/memory.c
===================================================================
--- linux-2.6.10.orig/mm/memory.c 2005-01-27 16:27:59.000000000 -0800
+++ linux-2.6.10/mm/memory.c 2005-01-27 16:28:54.000000000 -0800
@@ -36,6 +36,8 @@
* ([email protected])
*
* Aug/Sep 2004 Changed to four level page tables (Andi Kleen)
+ * Jan 2005 Scalability improvement by reducing the use and the length of time
+ * the page table lock is held (Christoph Lameter)
*/

#include <linux/kernel_stat.h>
@@ -1285,8 +1287,8 @@ static inline void break_cow(struct vm_a
* change only once the write actually happens. This avoids a few races,
* and potentially makes it more efficient.
*
- * We hold the mm semaphore and the page_table_lock on entry and exit
- * with the page_table_lock released.
+ * We hold the mm semaphore and have started atomic pte operations,
+ * exit with pte ops completed.
*/
static int do_wp_page(struct mm_struct *mm, struct vm_area_struct * vma,
unsigned long address, pte_t *page_table, pmd_t *pmd, pte_t pte)
@@ -1304,7 +1306,7 @@ static int do_wp_page(struct mm_struct *
pte_unmap(page_table);
printk(KERN_ERR "do_wp_page: bogus page at address %08lx\n",
address);
- spin_unlock(&mm->page_table_lock);
+ page_table_atomic_stop(mm);
return VM_FAULT_OOM;
}
old_page = pfn_to_page(pfn);
@@ -1316,21 +1318,27 @@ static int do_wp_page(struct mm_struct *
flush_cache_page(vma, address);
entry = maybe_mkwrite(pte_mkyoung(pte_mkdirty(pte)),
vma);
- ptep_set_access_flags(vma, address, page_table, entry, 1);
- update_mmu_cache(vma, address, entry);
+ /*
+ * If the bits are not updated then another fault
+ * will be generated with another chance of updating.
+ */
+ if (ptep_cmpxchg(page_table, pte, entry))
+ update_mmu_cache(vma, address, entry);
+ else
+ inc_page_state(cmpxchg_fail_flag_reuse);
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
+ page_table_atomic_stop(mm);
return VM_FAULT_MINOR;
}
}
pte_unmap(page_table);
+ page_table_atomic_stop(mm);

/*
* Ok, we need to copy. Oh, well..
*/
if (!PageReserved(old_page))
page_cache_get(old_page);
- spin_unlock(&mm->page_table_lock);

if (unlikely(anon_vma_prepare(vma)))
goto no_new_page;
@@ -1340,7 +1348,8 @@ static int do_wp_page(struct mm_struct *
copy_cow_page(old_page,new_page,address);

/*
- * Re-check the pte - we dropped the lock
+ * Re-check the pte - so far we may not have acquired the
+ * page_table_lock
*/
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, address);
@@ -1692,8 +1701,7 @@ void swapin_readahead(swp_entry_t entry,
}

/*
- * We hold the mm semaphore and the page_table_lock on entry and
- * should release the pagetable lock on exit..
+ * We hold the mm semaphore and have started atomic pte operations
*/
static int do_swap_page(struct mm_struct * mm,
struct vm_area_struct * vma, unsigned long address,
@@ -1705,15 +1713,14 @@ static int do_swap_page(struct mm_struct
int ret = VM_FAULT_MINOR;

pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
+ page_table_atomic_stop(mm);
page = lookup_swap_cache(entry);
if (!page) {
swapin_readahead(entry, address, vma);
page = read_swap_cache_async(entry, vma, address);
if (!page) {
/*
- * Back out if somebody else faulted in this pte while
- * we released the page table lock.
+ * Back out if somebody else faulted in this pte
*/
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, address);
@@ -1732,12 +1739,11 @@ static int do_swap_page(struct mm_struct
grab_swap_token();
}

- mark_page_accessed(page);
+ SetPageReferenced(page);
lock_page(page);

/*
- * Back out if somebody else faulted in this pte while we
- * released the page table lock.
+ * Back out if somebody else faulted in this pte
*/
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, address);
@@ -1771,80 +1777,94 @@ static int do_swap_page(struct mm_struct
set_pte(page_table, pte);
page_add_anon_rmap(page, vma, address);

+ /* No need to invalidate - it was non-present before */
+ update_mmu_cache(vma, address, pte);
+ pte_unmap(page_table);
+ spin_unlock(&mm->page_table_lock);
+
if (write_access) {
+ page_table_atomic_start(mm);
if (do_wp_page(mm, vma, address,
page_table, pmd, pte) == VM_FAULT_OOM)
ret = VM_FAULT_OOM;
- goto out;
}

- /* No need to invalidate - it was non-present before */
- update_mmu_cache(vma, address, pte);
- pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
out:
return ret;
}

/*
- * We are called with the MM semaphore and page_table_lock
- * spinlock held to protect against concurrent faults in
- * multithreaded programs.
+ * We are called with the MM semaphore held and atomic pte operations started.
*/
static int
do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
pte_t *page_table, pmd_t *pmd, int write_access,
- unsigned long addr)
+ unsigned long addr, pte_t orig_entry)
{
pte_t entry;
- struct page * page = ZERO_PAGE(addr);
+ struct page * page;

- /* Read-only mapping of ZERO_PAGE. */
- entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
+ if (unlikely(!write_access)) {

- /* ..except if it's a write access */
- if (write_access) {
- /* Allocate our own private page. */
+ /* Read-only mapping of ZERO_PAGE. */
+ entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
+
+ /*
+ * If the cmpxchg fails then another fault may be
+ * generated that may then be successful
+ */
+
+ if (ptep_cmpxchg(page_table, orig_entry, entry))
+ update_mmu_cache(vma, addr, entry);
+ else
+ inc_page_state(cmpxchg_fail_anon_read);
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
+ page_table_atomic_stop(mm);

- if (unlikely(anon_vma_prepare(vma)))
- goto no_mem;
- page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
- if (!page)
- goto no_mem;
- clear_user_highpage(page, addr);
+ return VM_FAULT_MINOR;
+ }

- spin_lock(&mm->page_table_lock);
- page_table = pte_offset_map(pmd, addr);
+ page_table_atomic_stop(mm);

- if (!pte_none(*page_table)) {
- pte_unmap(page_table);
- page_cache_release(page);
- spin_unlock(&mm->page_table_lock);
- goto out;
- }
- update_mm_counter(mm, rss, 1);
- acct_update_integrals();
- update_mem_hiwater();
- entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
- vma->vm_page_prot)),
- vma);
- lru_cache_add_active(page);
- SetPageReferenced(page);
- page_add_anon_rmap(page, vma, addr);
+ /* Allocate our own private page. */
+ if (unlikely(anon_vma_prepare(vma)))
+ return VM_FAULT_OOM;
+
+ page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+ if (!page)
+ return VM_FAULT_OOM;
+ clear_user_highpage(page, addr);
+
+ entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
+ vma->vm_page_prot)),
+ vma);
+
+ spin_lock(&mm->page_table_lock);
+
+ if (!ptep_cmpxchg(page_table, orig_entry, entry)) {
+ pte_unmap(page_table);
+ page_cache_release(page);
+ spin_unlock(&mm->page_table_lock);
+ inc_page_state(cmpxchg_fail_anon_write);
+ return VM_FAULT_MINOR;
}

- set_pte(page_table, entry);
- pte_unmap(page_table);
+ /*
+ * These two functions must come after the cmpxchg
+ * because if the page is on the LRU then try_to_unmap may come
+ * in and unmap the pte.
+ */
+ page_add_anon_rmap(page, vma, addr);
+ lru_cache_add_active(page);
+ update_mm_counter(mm, rss, 1);
+ acct_update_integrals();
+ update_mem_hiwater();

- /* No need to invalidate - it was non-present before */
- update_mmu_cache(vma, addr, entry);
+ update_mmu_cache(vma, addr, entry);
+ pte_unmap(page_table);
spin_unlock(&mm->page_table_lock);
-out:
+
return VM_FAULT_MINOR;
-no_mem:
- return VM_FAULT_OOM;
}

/*
@@ -1856,12 +1876,12 @@ no_mem:
* As this is called only for pages that do not currently exist, we
* do not need to flush old virtual caches or the TLB.
*
- * This is called with the MM semaphore held and the page table
- * spinlock held. Exit with the spinlock released.
+ * This is called with the MM semaphore held and atomic pte operations started.
*/
static int
do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
+ unsigned long address, int write_access, pte_t *page_table,
+ pmd_t *pmd, pte_t orig_entry)
{
struct page * new_page;
struct address_space *mapping = NULL;
@@ -1872,9 +1892,9 @@ do_no_page(struct mm_struct *mm, struct

if (!vma->vm_ops || !vma->vm_ops->nopage)
return do_anonymous_page(mm, vma, page_table,
- pmd, write_access, address);
+ pmd, write_access, address, orig_entry);
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
+ page_table_atomic_stop(mm);

if (vma->vm_file) {
mapping = vma->vm_file->f_mapping;
@@ -1982,7 +2002,7 @@ oom:
* nonlinear vmas.
*/
static int do_file_page(struct mm_struct * mm, struct vm_area_struct * vma,
- unsigned long address, int write_access, pte_t *pte, pmd_t *pmd)
+ unsigned long address, int write_access, pte_t *pte, pmd_t *pmd, pte_t entry)
{
unsigned long pgoff;
int err;
@@ -1995,13 +2015,13 @@ static int do_file_page(struct mm_struct
if (!vma->vm_ops || !vma->vm_ops->populate ||
(write_access && !(vma->vm_flags & VM_SHARED))) {
pte_clear(pte);
- return do_no_page(mm, vma, address, write_access, pte, pmd);
+ return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
}

- pgoff = pte_to_pgoff(*pte);
+ pgoff = pte_to_pgoff(entry);

pte_unmap(pte);
- spin_unlock(&mm->page_table_lock);
+ page_table_atomic_stop(mm);

err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0);
if (err == -ENOMEM)
@@ -2020,49 +2040,45 @@ static int do_file_page(struct mm_struct
* with external mmu caches can use to update those (ie the Sparc or
* PowerPC hashed page tables that act as extended TLBs).
*
- * Note the "page_table_lock". It is to protect against kswapd removing
- * pages from under us. Note that kswapd only ever _removes_ pages, never
- * adds them. As such, once we have noticed that the page is not present,
- * we can drop the lock early.
- *
- * The adding of pages is protected by the MM semaphore (which we hold),
- * so we don't need to worry about a page being suddenly been added into
- * our VM.
- *
- * We enter with the pagetable spinlock held, we are supposed to
- * release it when done.
+ * Note that kswapd only ever _removes_ pages, never adds them.
+ * We need to insure to handle that case properly.
*/
static inline int handle_pte_fault(struct mm_struct *mm,
struct vm_area_struct * vma, unsigned long address,
int write_access, pte_t *pte, pmd_t *pmd)
{
pte_t entry;
+ pte_t new_entry;

entry = *pte;
if (!pte_present(entry)) {
- /*
- * If it truly wasn't present, we know that kswapd
- * and the PTE updates will not touch it later. So
- * drop the lock.
- */
if (pte_none(entry))
- return do_no_page(mm, vma, address, write_access, pte, pmd);
+ return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
if (pte_file(entry))
- return do_file_page(mm, vma, address, write_access, pte, pmd);
+ return do_file_page(mm, vma, address, write_access, pte, pmd, entry);
return do_swap_page(mm, vma, address, pte, pmd, entry, write_access);
}

+ new_entry = pte_mkyoung(entry);
if (write_access) {
if (!pte_write(entry))
return do_wp_page(mm, vma, address, pte, pmd, entry);
-
- entry = pte_mkdirty(entry);
+ new_entry = pte_mkdirty(new_entry);
}
- entry = pte_mkyoung(entry);
- ptep_set_access_flags(vma, address, pte, entry, write_access);
- update_mmu_cache(vma, address, entry);
+
+ /*
+ * If the cmpxchg fails then we will get another fault which
+ * has another chance of successfully updating the page table entry.
+ */
+ if (ptep_cmpxchg(pte, entry, new_entry)) {
+ flush_tlb_page(vma, address);
+ update_mmu_cache(vma, address, entry);
+ } else
+ inc_page_state(cmpxchg_fail_flag_update);
pte_unmap(pte);
- spin_unlock(&mm->page_table_lock);
+ page_table_atomic_stop(mm);
+ if (pte_val(new_entry) == pte_val(entry))
+ inc_page_state(spurious_page_faults);
return VM_FAULT_MINOR;
}

@@ -2081,33 +2097,73 @@ int handle_mm_fault(struct mm_struct *mm

inc_page_state(pgfault);

- if (is_vm_hugetlb_page(vma))
+ if (unlikely(is_vm_hugetlb_page(vma)))
return VM_FAULT_SIGBUS; /* mapping truncation does this. */

/*
- * We need the page table lock to synchronize with kswapd
- * and the SMP-safe atomic PTE updates.
+ * We try to rely on the mmap_sem and the SMP-safe atomic PTE updates.
+ * to synchronize with kswapd. However, the arch may fall back
+ * in page_table_atomic_start to the page table lock.
+ *
+ * We may be able to avoid taking and releasing the page_table_lock
+ * for the p??_alloc functions through atomic operations so we
+ * duplicate the functionality of pmd_alloc, pud_alloc and
+ * pte_alloc_map here.
*/
+ page_table_atomic_start(mm);
pgd = pgd_offset(mm, address);
- spin_lock(&mm->page_table_lock);
+ if (unlikely(pgd_none(*pgd))) {
+ pud_t *new;
+
+ page_table_atomic_stop(mm);
+ new = pud_alloc_one(mm, address);
+
+ if (!new)
+ return VM_FAULT_OOM;
+
+ page_table_atomic_start(mm);
+ if (!pgd_test_and_populate(mm, pgd, new))
+ pud_free(new);
+ }
+
+ pud = pud_offset(pgd, address);
+ if (unlikely(pud_none(*pud))) {
+ pmd_t *new;
+
+ page_table_atomic_stop(mm);
+ new = pmd_alloc_one(mm, address);

- pud = pud_alloc(mm, pgd, address);
- if (!pud)
- goto oom;
-
- pmd = pmd_alloc(mm, pud, address);
- if (!pmd)
- goto oom;
-
- pte = pte_alloc_map(mm, pmd, address);
- if (!pte)
- goto oom;
+ if (!new)
+ return VM_FAULT_OOM;

- return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
+ page_table_atomic_start(mm);
+
+ if (!pud_test_and_populate(mm, pud, new))
+ pmd_free(new);
+ }

- oom:
- spin_unlock(&mm->page_table_lock);
- return VM_FAULT_OOM;
+ pmd = pmd_offset(pud, address);
+ if (unlikely(!pmd_present(*pmd))) {
+ struct page *new;
+
+ page_table_atomic_stop(mm);
+ new = pte_alloc_one(mm, address);
+
+ if (!new)
+ return VM_FAULT_OOM;
+
+ page_table_atomic_start(mm);
+
+ if (!pmd_test_and_populate(mm, pmd, new))
+ pte_free(new);
+ else {
+ inc_page_state(nr_page_table_pages);
+ mm->nr_ptes++;
+ }
+ }
+
+ pte = pte_offset_map(pmd, address);
+ return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
}

#ifndef __ARCH_HAS_4LEVEL_HACK
Index: linux-2.6.10/include/asm-generic/pgtable-nopud.h
===================================================================
--- linux-2.6.10.orig/include/asm-generic/pgtable-nopud.h 2005-01-27 14:47:20.000000000 -0800
+++ linux-2.6.10/include/asm-generic/pgtable-nopud.h 2005-01-27 16:28:54.000000000 -0800
@@ -25,8 +25,14 @@ static inline int pgd_bad(pgd_t pgd) {
static inline int pgd_present(pgd_t pgd) { return 1; }
static inline void pgd_clear(pgd_t *pgd) { }
#define pud_ERROR(pud) (pgd_ERROR((pud).pgd))
-
#define pgd_populate(mm, pgd, pud) do { } while (0)
+
+#define __HAVE_ARCH_PGD_TEST_AND_POPULATE
+static inline int pgd_test_and_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud)
+{
+ return 1;
+}
+
/*
* (puds are folded into pgds so this doesn't get actually called,
* but the define is needed for a generic inline function.)
Index: linux-2.6.10/include/asm-generic/pgtable-nopmd.h
===================================================================
--- linux-2.6.10.orig/include/asm-generic/pgtable-nopmd.h 2005-01-27 14:47:20.000000000 -0800
+++ linux-2.6.10/include/asm-generic/pgtable-nopmd.h 2005-01-27 16:28:54.000000000 -0800
@@ -29,6 +29,11 @@ static inline void pud_clear(pud_t *pud)
#define pmd_ERROR(pmd) (pud_ERROR((pmd).pud))

#define pud_populate(mm, pmd, pte) do { } while (0)
+#define __ARCH_HAVE_PUD_TEST_AND_POPULATE
+static inline int pud_test_and_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
+{
+ return 1;
+}

/*
* (pmds are folded into puds so this doesn't get actually called,
Index: linux-2.6.10/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.10.orig/include/asm-generic/pgtable.h 2005-01-27 16:27:40.000000000 -0800
+++ linux-2.6.10/include/asm-generic/pgtable.h 2005-01-27 16:30:35.000000000 -0800
@@ -105,8 +105,14 @@ static inline pte_t ptep_get_and_clear(p
#ifdef CONFIG_ATOMIC_TABLE_OPS

/*
- * The architecture does support atomic table operations.
- * Thus we may provide generic atomic ptep_xchg and ptep_cmpxchg using
+ * The architecture does support atomic table operations and
+ * all operations on page table entries must always be atomic.
+ *
+ * This means that the kernel will never encounter a partially updated
+ * page table entry.
+ *
+ * Since the architecture does support atomic table operations, we
+ * may provide generic atomic ptep_xchg and ptep_cmpxchg using
* cmpxchg and xchg.
*/
#ifndef __HAVE_ARCH_PTEP_XCHG
@@ -132,6 +138,65 @@ static inline pte_t ptep_get_and_clear(p
})
#endif

+/*
+ * page_table_atomic_start and page_table_atomic_stop may be used to
+ * define special measures that an arch needs to guarantee atomic
+ * operations outside of a spinlock. In the case that an arch does
+ * not support atomic page table operations we will fall back to the
+ * page table lock.
+ */
+#ifndef __HAVE_ARCH_PAGE_TABLE_ATOMIC_START
+#define page_table_atomic_start(mm) do { } while (0)
+#endif
+
+#ifndef __HAVE_ARCH_PAGE_TABLE_ATOMIC_START
+#define page_table_atomic_stop(mm) do { } while (0)
+#endif
+
+/*
+ * Fallback functions for atomic population of higher page table
+ * structures. These simply acquire the page_table_lock for
+ * synchronization. An architecture may override these generic
+ * functions to provide atomic populate functions to make these
+ * more effective.
+ */
+
+#ifndef __HAVE_ARCH_PGD_TEST_AND_POPULATE
+#define pgd_test_and_populate(__mm, __pgd, __pud) \
+({ \
+ int __rc; \
+ spin_lock(&mm->page_table_lock); \
+ __rc = pgd_none(*(__pgd)); \
+ if (__rc) pgd_populate(__mm, __pgd, __pud); \
+ spin_unlock(&mm->page_table_lock); \
+ __rc; \
+})
+#endif
+
+#ifndef __HAVE_ARCH_PUD_TEST_AND_POPULATE
+#define pud_test_and_populate(__mm, __pud, __pmd) \
+({ \
+ int __rc; \
+ spin_lock(&mm->page_table_lock); \
+ __rc = pud_none(*(__pud)); \
+ if (__rc) pud_populate(__mm, __pud, __pmd); \
+ spin_unlock(&mm->page_table_lock); \
+ __rc; \
+})
+#endif
+
+#ifndef __HAVE_ARCH_PMD_TEST_AND_POPULATE
+#define pmd_test_and_populate(__mm, __pmd, __page) \
+({ \
+ int __rc; \
+ spin_lock(&mm->page_table_lock); \
+ __rc = !pmd_present(*(__pmd)); \
+ if (__rc) pmd_populate(__mm, __pmd, __page); \
+ spin_unlock(&mm->page_table_lock); \
+ __rc; \
+})
+#endif
+
#else

/*
@@ -142,6 +207,11 @@ static inline pte_t ptep_get_and_clear(p
* short time frame. This means that the page_table_lock must be held
* to avoid a page fault that would install a new entry.
*/
+
+/* Fall back to the page table lock to synchronize page table access */
+#define page_table_atomic_start(mm) spin_lock(&(mm)->page_table_lock)
+#define page_table_atomic_stop(mm) spin_unlock(&(mm)->page_table_lock)
+
#ifndef __HAVE_ARCH_PTEP_XCHG
#define ptep_xchg(__ptep, __pteval) \
({ \
@@ -186,6 +256,41 @@ static inline pte_t ptep_get_and_clear(p
r; \
})
#endif
+
+/*
+ * Fallback functions for atomic population of higher page table
+ * structures. These rely on the page_table_lock being held.
+ */
+#ifndef __HAVE_ARCH_PGD_TEST_AND_POPULATE
+#define pgd_test_and_populate(__mm, __pgd, __pud) \
+({ \
+ int __rc; \
+ __rc = pgd_none(*(__pgd)); \
+ if (__rc) pgd_populate(__mm, __pgd, __pud); \
+ __rc; \
+})
+#endif
+
+#ifndef __HAVE_ARCH_PUD_TEST_AND_POPULATE
+#define pud_test_and_populate(__mm, __pud, __pmd) \
+({ \
+ int __rc; \
+ __rc = pud_none(*(__pud)); \
+ if (__rc) pud_populate(__mm, __pud, __pmd); \
+ __rc; \
+})
+#endif
+
+#ifndef __HAVE_ARCH_PMD_TEST_AND_POPULATE
+#define pmd_test_and_populate(__mm, __pmd, __page) \
+({ \
+ int __rc; \
+ __rc = !pmd_present(*(__pmd)); \
+ if (__rc) pmd_populate(__mm, __pmd, __page); \
+ __rc; \
+})
+#endif
+
#endif

#ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT
Index: linux-2.6.10/include/asm-ia64/pgtable.h
===================================================================
--- linux-2.6.10.orig/include/asm-ia64/pgtable.h 2005-01-27 14:47:20.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/pgtable.h 2005-01-27 16:33:24.000000000 -0800
@@ -554,6 +554,8 @@ do { \
#define FIXADDR_USER_START GATE_ADDR
#define FIXADDR_USER_END (GATE_ADDR + 2*PERCPU_PAGE_SIZE)

+#define __HAVE_ARCH_PUD_TEST_AND_POPULATE
+#define __HAVE_ARCH_PMD_TEST_AND_POPULATE
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
@@ -561,7 +563,7 @@ do { \
#define __HAVE_ARCH_PTEP_MKDIRTY
#define __HAVE_ARCH_PTE_SAME
#define __HAVE_ARCH_PGD_OFFSET_GATE
-#include <asm-generic/pgtable.h>
#include <asm-generic/pgtable-nopud.h>
+#include <asm-generic/pgtable.h>

#endif /* _ASM_IA64_PGTABLE_H */
Index: linux-2.6.10/include/linux/page-flags.h
===================================================================
--- linux-2.6.10.orig/include/linux/page-flags.h 2005-01-27 14:47:20.000000000 -0800
+++ linux-2.6.10/include/linux/page-flags.h 2005-01-27 16:28:54.000000000 -0800
@@ -131,6 +131,17 @@ struct page_state {
unsigned long allocstall; /* direct reclaim calls */

unsigned long pgrotated; /* pages rotated to tail of the LRU */
+
+ /* Low level counters */
+ unsigned long spurious_page_faults; /* Faults with no ops */
+ unsigned long cmpxchg_fail_flag_update; /* cmpxchg failures for pte flag update */
+ unsigned long cmpxchg_fail_flag_reuse; /* cmpxchg failures when cow reuse of pte */
+ unsigned long cmpxchg_fail_anon_read; /* cmpxchg failures on anonymous read */
+ unsigned long cmpxchg_fail_anon_write; /* cmpxchg failures on anonymous write */
+
+ /* rss deltas for the current executing thread */
+ long rss;
+ long anon_rss;
};

extern void get_page_state(struct page_state *ret);
Index: linux-2.6.10/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.10.orig/fs/proc/proc_misc.c 2005-01-27 14:47:19.000000000 -0800
+++ linux-2.6.10/fs/proc/proc_misc.c 2005-01-27 16:28:54.000000000 -0800
@@ -127,7 +127,7 @@ static int meminfo_read_proc(char *page,
unsigned long allowed;
struct vmalloc_info vmi;

- get_page_state(&ps);
+ get_full_page_state(&ps);
get_zone_counts(&active, &inactive, &free);

/*
@@ -168,7 +168,12 @@ static int meminfo_read_proc(char *page,
"PageTables: %8lu kB\n"
"VmallocTotal: %8lu kB\n"
"VmallocUsed: %8lu kB\n"
- "VmallocChunk: %8lu kB\n",
+ "VmallocChunk: %8lu kB\n"
+ "Spurious page faults : %8lu\n"
+ "cmpxchg fail flag update: %8lu\n"
+ "cmpxchg fail COW reuse : %8lu\n"
+ "cmpxchg fail anon read : %8lu\n"
+ "cmpxchg fail anon write : %8lu\n",
K(i.totalram),
K(i.freeram),
K(i.bufferram),
@@ -191,7 +196,12 @@ static int meminfo_read_proc(char *page,
K(ps.nr_page_table_pages),
VMALLOC_TOTAL >> 10,
vmi.used >> 10,
- vmi.largest_chunk >> 10
+ vmi.largest_chunk >> 10,
+ ps.spurious_page_faults,
+ ps.cmpxchg_fail_flag_update,
+ ps.cmpxchg_fail_flag_reuse,
+ ps.cmpxchg_fail_anon_read,
+ ps.cmpxchg_fail_anon_write
);

len += hugetlb_report_meminfo(page + len);
Index: linux-2.6.10/include/asm-ia64/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-ia64/pgalloc.h 2005-01-27 14:47:20.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/pgalloc.h 2005-01-27 16:33:10.000000000 -0800
@@ -34,6 +34,10 @@
#define pmd_quicklist (local_cpu_data->pmd_quick)
#define pgtable_cache_size (local_cpu_data->pgtable_cache_sz)

+/* Empty entries of PMD and PGD */
+#define PMD_NONE 0
+#define PUD_NONE 0
+
static inline pgd_t*
pgd_alloc_one_fast (struct mm_struct *mm)
{
@@ -82,6 +86,13 @@ pud_populate (struct mm_struct *mm, pud_
pud_val(*pud_entry) = __pa(pmd);
}

+/* Atomic populate */
+static inline int
+pud_test_and_populate (struct mm_struct *mm, pud_t *pud_entry, pmd_t *pmd)
+{
+ return ia64_cmpxchg8_acq(pud_entry,__pa(pmd), PUD_NONE) == PUD_NONE;
+}
+
static inline pmd_t*
pmd_alloc_one_fast (struct mm_struct *mm, unsigned long addr)
{
@@ -127,6 +138,14 @@ pmd_populate (struct mm_struct *mm, pmd_
pmd_val(*pmd_entry) = page_to_phys(pte);
}

+/* Atomic populate */
+static inline int
+pmd_test_and_populate (struct mm_struct *mm, pmd_t *pmd_entry, struct page *pte)
+{
+ return ia64_cmpxchg8_acq(pmd_entry, page_to_phys(pte), PMD_NONE) == PMD_NONE;
+}
+
+
static inline void
pmd_populate_kernel (struct mm_struct *mm, pmd_t *pmd_entry, pte_t *pte)
{

2005-01-28 20:55:37

by Christoph Lameter

[permalink] [raw]
Subject: page fault scalability patch V16 [2/4]: mm counter macros

This patch extracts all the interesting pieces for handling rss and
anon_rss into definitions in include/linux/sched.h. All rss operations
are performed through the following three macros:

get_mm_counter(mm, member) -> Obtain the value of a counter
set_mm_counter(mm, member, value) -> Set the value of a counter
update_mm_counter(mm, member, value) -> Add a value to a counter

The simple definitions provided in this patch should result in no change to
to the generated code.

With this patch it becomes easier to add new counters and it is possible
to redefine the method of counter handling (f.e. the page fault scalability
patches may want to use atomic operations or split rss).

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.10/include/linux/sched.h
===================================================================
--- linux-2.6.10.orig/include/linux/sched.h 2005-01-28 11:01:51.000000000 -0800
+++ linux-2.6.10/include/linux/sched.h 2005-01-28 11:02:00.000000000 -0800
@@ -203,6 +203,10 @@ arch_get_unmapped_area_topdown(struct fi
extern void arch_unmap_area(struct vm_area_struct *area);
extern void arch_unmap_area_topdown(struct vm_area_struct *area);

+#define set_mm_counter(mm, member, value) (mm)->member = (value)
+#define get_mm_counter(mm, member) ((mm)->member)
+#define update_mm_counter(mm, member, value) (mm)->member += (value)
+#define MM_COUNTER_T unsigned long

struct mm_struct {
struct vm_area_struct * mmap; /* list of VMAs */
@@ -219,7 +223,7 @@ struct mm_struct {
atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */
int map_count; /* number of VMAs */
struct rw_semaphore mmap_sem;
- spinlock_t page_table_lock; /* Protects page tables, mm->rss, mm->anon_rss */
+ spinlock_t page_table_lock; /* Protects page tables and some counters */

struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung
* together off init_mm.mmlist, and are protected
@@ -229,9 +233,13 @@ struct mm_struct {
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
unsigned long arg_start, arg_end, env_start, env_end;
- unsigned long rss, anon_rss, total_vm, locked_vm, shared_vm;
+ unsigned long total_vm, locked_vm, shared_vm;
unsigned long exec_vm, stack_vm, reserved_vm, def_flags, nr_ptes;

+ /* Special counters protected by the page_table_lock */
+ MM_COUNTER_T rss;
+ MM_COUNTER_T anon_rss;
+
unsigned long saved_auxv[42]; /* for /proc/PID/auxv */

unsigned dumpable:1;
Index: linux-2.6.10/mm/memory.c
===================================================================
--- linux-2.6.10.orig/mm/memory.c 2005-01-28 11:01:58.000000000 -0800
+++ linux-2.6.10/mm/memory.c 2005-01-28 11:02:00.000000000 -0800
@@ -324,9 +324,9 @@ copy_one_pte(struct mm_struct *dst_mm,
pte = pte_mkclean(pte);
pte = pte_mkold(pte);
get_page(page);
- dst_mm->rss++;
+ update_mm_counter(dst_mm, rss, 1);
if (PageAnon(page))
- dst_mm->anon_rss++;
+ update_mm_counter(dst_mm, anon_rss, 1);
set_pte(dst_pte, pte);
page_dup_rmap(page);
}
@@ -528,7 +528,7 @@ static void zap_pte_range(struct mmu_gat
if (pte_dirty(pte))
set_page_dirty(page);
if (PageAnon(page))
- tlb->mm->anon_rss--;
+ update_mm_counter(tlb->mm, anon_rss, -1);
else if (pte_young(pte))
mark_page_accessed(page);
tlb->freed++;
@@ -1345,13 +1345,14 @@ static int do_wp_page(struct mm_struct *
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, address);
if (likely(pte_same(*page_table, pte))) {
- if (PageAnon(old_page))
- mm->anon_rss--;
+ if (PageAnon(old_page))
+ update_mm_counter(mm, anon_rss, -1);
if (PageReserved(old_page)) {
- ++mm->rss;
+ update_mm_counter(mm, rss, 1);
acct_update_integrals();
update_mem_hiwater();
} else
+
page_remove_rmap(old_page);
break_cow(vma, new_page, address, page_table);
lru_cache_add_active(new_page);
@@ -1755,7 +1756,7 @@ static int do_swap_page(struct mm_struct
if (vm_swap_full())
remove_exclusive_swap_page(page);

- mm->rss++;
+ update_mm_counter(mm, rss, 1);
acct_update_integrals();
update_mem_hiwater();

@@ -1823,7 +1824,7 @@ do_anonymous_page(struct mm_struct *mm,
spin_unlock(&mm->page_table_lock);
goto out;
}
- mm->rss++;
+ update_mm_counter(mm, rss, 1);
acct_update_integrals();
update_mem_hiwater();
entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
@@ -1941,7 +1942,7 @@ retry:
/* Only go through if we didn't race with anybody else... */
if (pte_none(*page_table)) {
if (!PageReserved(new_page))
- ++mm->rss;
+ update_mm_counter(mm, rss, 1);
acct_update_integrals();
update_mem_hiwater();

@@ -2272,8 +2273,10 @@ void update_mem_hiwater(void)
struct task_struct *tsk = current;

if (tsk->mm) {
- if (tsk->mm->hiwater_rss < tsk->mm->rss)
- tsk->mm->hiwater_rss = tsk->mm->rss;
+ unsigned long rss = get_mm_counter(tsk->mm, rss);
+
+ if (tsk->mm->hiwater_rss < rss)
+ tsk->mm->hiwater_rss = rss;
if (tsk->mm->hiwater_vm < tsk->mm->total_vm)
tsk->mm->hiwater_vm = tsk->mm->total_vm;
}
Index: linux-2.6.10/mm/rmap.c
===================================================================
--- linux-2.6.10.orig/mm/rmap.c 2005-01-28 11:01:58.000000000 -0800
+++ linux-2.6.10/mm/rmap.c 2005-01-28 11:02:00.000000000 -0800
@@ -258,7 +258,7 @@ static int page_referenced_one(struct pa
pte_t *pte;
int referenced = 0;

- if (!mm->rss)
+ if (!get_mm_counter(mm, rss))
goto out;
address = vma_address(page, vma);
if (address == -EFAULT)
@@ -437,7 +437,7 @@ void page_add_anon_rmap(struct page *pag
BUG_ON(PageReserved(page));
BUG_ON(!anon_vma);

- vma->vm_mm->anon_rss++;
+ update_mm_counter(vma->vm_mm, anon_rss, 1);

anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
index = (address - vma->vm_start) >> PAGE_SHIFT;
@@ -510,7 +510,7 @@ static int try_to_unmap_one(struct page
pte_t pteval;
int ret = SWAP_AGAIN;

- if (!mm->rss)
+ if (!get_mm_counter(mm, rss))
goto out;
address = vma_address(page, vma);
if (address == -EFAULT)
@@ -591,14 +591,14 @@ static int try_to_unmap_one(struct page
}
pteval = ptep_xchg_flush(vma, address, pte, swp_entry_to_pte(entry));
BUG_ON(pte_file(*pte));
- mm->anon_rss--;
+ update_mm_counter(mm, anon_rss, -1);
} else
pteval = ptep_clear_flush(vma, address, pte);

/* Move the dirty bit to the physical page now that the pte is gone. */
if (pte_dirty(pteval))
set_page_dirty(page);
- mm->rss--;
+ update_mm_counter(mm, rss, -1);
acct_update_integrals();
page_remove_rmap(page);
page_cache_release(page);
@@ -705,7 +705,7 @@ static void try_to_unmap_cluster(unsigne
page_remove_rmap(page);
page_cache_release(page);
acct_update_integrals();
- mm->rss--;
+ update_mm_counter(mm, rss, -1);
(*mapcount)--;
}

@@ -804,7 +804,7 @@ static int try_to_unmap_file(struct page
if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
continue;
cursor = (unsigned long) vma->vm_private_data;
- while (vma->vm_mm->rss &&
+ while (get_mm_counter(vma->vm_mm, rss) &&
cursor < max_nl_cursor &&
cursor < vma->vm_end - vma->vm_start) {
try_to_unmap_cluster(cursor, &mapcount, vma);
Index: linux-2.6.10/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.10.orig/fs/proc/task_mmu.c 2005-01-28 11:01:51.000000000 -0800
+++ linux-2.6.10/fs/proc/task_mmu.c 2005-01-28 11:02:00.000000000 -0800
@@ -24,7 +24,7 @@ char *task_mem(struct mm_struct *mm, cha
"VmPTE:\t%8lu kB\n",
(mm->total_vm - mm->reserved_vm) << (PAGE_SHIFT-10),
mm->locked_vm << (PAGE_SHIFT-10),
- mm->rss << (PAGE_SHIFT-10),
+ get_mm_counter(mm, rss) << (PAGE_SHIFT-10),
data << (PAGE_SHIFT-10),
mm->stack_vm << (PAGE_SHIFT-10), text, lib,
(PTRS_PER_PTE*sizeof(pte_t)*mm->nr_ptes) >> 10);
@@ -39,11 +39,13 @@ unsigned long task_vsize(struct mm_struc
int task_statm(struct mm_struct *mm, int *shared, int *text,
int *data, int *resident)
{
- *shared = mm->rss - mm->anon_rss;
+ int rss = get_mm_counter(mm, rss);
+
+ *shared = rss - get_mm_counter(mm, anon_rss);
*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
>> PAGE_SHIFT;
*data = mm->total_vm - mm->shared_vm;
- *resident = mm->rss;
+ *resident = rss;
return mm->total_vm;
}

Index: linux-2.6.10/mm/mmap.c
===================================================================
--- linux-2.6.10.orig/mm/mmap.c 2005-01-28 11:01:51.000000000 -0800
+++ linux-2.6.10/mm/mmap.c 2005-01-28 11:02:00.000000000 -0800
@@ -2003,7 +2003,7 @@ void exit_mmap(struct mm_struct *mm)
vma = mm->mmap;
mm->mmap = mm->mmap_cache = NULL;
mm->mm_rb = RB_ROOT;
- mm->rss = 0;
+ set_mm_counter(mm, rss, 0);
mm->total_vm = 0;
mm->locked_vm = 0;

Index: linux-2.6.10/kernel/fork.c
===================================================================
--- linux-2.6.10.orig/kernel/fork.c 2005-01-28 11:01:51.000000000 -0800
+++ linux-2.6.10/kernel/fork.c 2005-01-28 11:02:00.000000000 -0800
@@ -174,8 +174,8 @@ static inline int dup_mmap(struct mm_str
mm->mmap_cache = NULL;
mm->free_area_cache = oldmm->mmap_base;
mm->map_count = 0;
- mm->rss = 0;
- mm->anon_rss = 0;
+ set_mm_counter(mm, rss, 0);
+ set_mm_counter(mm, anon_rss, 0);
cpus_clear(mm->cpu_vm_mask);
mm->mm_rb = RB_ROOT;
rb_link = &mm->mm_rb.rb_node;
@@ -471,7 +471,7 @@ static int copy_mm(unsigned long clone_f
if (retval)
goto free_pt;

- mm->hiwater_rss = mm->rss;
+ mm->hiwater_rss = get_mm_counter(mm,rss);
mm->hiwater_vm = mm->total_vm;

good_mm:
Index: linux-2.6.10/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.10.orig/include/asm-generic/tlb.h 2005-01-28 11:01:51.000000000 -0800
+++ linux-2.6.10/include/asm-generic/tlb.h 2005-01-28 11:02:00.000000000 -0800
@@ -88,11 +88,11 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
{
int freed = tlb->freed;
struct mm_struct *mm = tlb->mm;
- int rss = mm->rss;
+ int rss = get_mm_counter(mm, rss);

if (rss < freed)
freed = rss;
- mm->rss = rss - freed;
+ update_mm_counter(mm, rss, -freed);
tlb_flush_mmu(tlb, start, end);

/* keep the page table cache within bounds */
Index: linux-2.6.10/fs/binfmt_flat.c
===================================================================
--- linux-2.6.10.orig/fs/binfmt_flat.c 2004-12-24 13:33:47.000000000 -0800
+++ linux-2.6.10/fs/binfmt_flat.c 2005-01-28 11:02:00.000000000 -0800
@@ -650,7 +650,7 @@ static int load_flat_file(struct linux_b
current->mm->start_brk = datapos + data_len + bss_len;
current->mm->brk = (current->mm->start_brk + 3) & ~3;
current->mm->context.end_brk = memp + ksize((void *) memp) - stack_len;
- current->mm->rss = 0;
+ set_mm_counter(current->mm, rss, 0);
}

if (flags & FLAT_FLAG_KTRACE)
Index: linux-2.6.10/fs/exec.c
===================================================================
--- linux-2.6.10.orig/fs/exec.c 2005-01-28 11:01:50.000000000 -0800
+++ linux-2.6.10/fs/exec.c 2005-01-28 11:02:00.000000000 -0800
@@ -326,7 +326,7 @@ void install_arg_page(struct vm_area_str
pte_unmap(pte);
goto out;
}
- mm->rss++;
+ update_mm_counter(mm, rss, 1);
lru_cache_add_active(page);
set_pte(pte, pte_mkdirty(pte_mkwrite(mk_pte(
page, vma->vm_page_prot))));
Index: linux-2.6.10/fs/binfmt_som.c
===================================================================
--- linux-2.6.10.orig/fs/binfmt_som.c 2005-01-28 11:01:50.000000000 -0800
+++ linux-2.6.10/fs/binfmt_som.c 2005-01-28 11:02:00.000000000 -0800
@@ -259,7 +259,7 @@ load_som_binary(struct linux_binprm * bp
create_som_tables(bprm);

current->mm->start_stack = bprm->p;
- current->mm->rss = 0;
+ set_mm_counter(current->mm, rss, 0);

#if 0
printk("(start_brk) %08lx\n" , (unsigned long) current->mm->start_brk);
Index: linux-2.6.10/mm/fremap.c
===================================================================
--- linux-2.6.10.orig/mm/fremap.c 2005-01-28 11:01:51.000000000 -0800
+++ linux-2.6.10/mm/fremap.c 2005-01-28 11:02:00.000000000 -0800
@@ -39,7 +39,7 @@ static inline void zap_pte(struct mm_str
set_page_dirty(page);
page_remove_rmap(page);
page_cache_release(page);
- mm->rss--;
+ update_mm_counter(mm, rss, -1);
}
}
} else {
@@ -92,7 +92,7 @@ int install_page(struct mm_struct *mm, s

zap_pte(mm, vma, addr, pte);

- mm->rss++;
+ update_mm_counter(mm,rss, 1);
flush_icache_page(vma, page);
set_pte(pte, mk_pte(page, prot));
page_add_file_rmap(page);
Index: linux-2.6.10/mm/swapfile.c
===================================================================
--- linux-2.6.10.orig/mm/swapfile.c 2005-01-28 11:01:51.000000000 -0800
+++ linux-2.6.10/mm/swapfile.c 2005-01-28 11:02:00.000000000 -0800
@@ -432,7 +432,7 @@ static void
unuse_pte(struct vm_area_struct *vma, unsigned long address, pte_t *dir,
swp_entry_t entry, struct page *page)
{
- vma->vm_mm->rss++;
+ update_mm_counter(vma->vm_mm, rss, 1);
get_page(page);
set_pte(dir, pte_mkold(mk_pte(page, vma->vm_page_prot)));
page_add_anon_rmap(page, vma, address);
Index: linux-2.6.10/fs/binfmt_aout.c
===================================================================
--- linux-2.6.10.orig/fs/binfmt_aout.c 2005-01-28 11:01:50.000000000 -0800
+++ linux-2.6.10/fs/binfmt_aout.c 2005-01-28 11:02:00.000000000 -0800
@@ -317,7 +317,7 @@ static int load_aout_binary(struct linux
(current->mm->start_brk = N_BSSADDR(ex));
current->mm->free_area_cache = current->mm->mmap_base;

- current->mm->rss = 0;
+ set_mm_counter(current->mm, rss, 0);
current->mm->mmap = NULL;
compute_creds(bprm);
current->flags &= ~PF_FORKNOEXEC;
Index: linux-2.6.10/arch/ia64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.10.orig/arch/ia64/mm/hugetlbpage.c 2005-01-28 11:01:48.000000000 -0800
+++ linux-2.6.10/arch/ia64/mm/hugetlbpage.c 2005-01-28 11:02:00.000000000 -0800
@@ -73,7 +73,7 @@ set_huge_pte (struct mm_struct *mm, stru
{
pte_t entry;

- mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+ update_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE);
if (write_access) {
entry =
pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
@@ -116,7 +116,7 @@ int copy_hugetlb_page_range(struct mm_st
ptepage = pte_page(entry);
get_page(ptepage);
set_pte(dst_pte, entry);
- dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+ update_mm_counter(dst, rss, HPAGE_SIZE / PAGE_SIZE);
addr += HPAGE_SIZE;
}
return 0;
@@ -246,7 +246,7 @@ void unmap_hugepage_range(struct vm_area
put_page(page);
pte_clear(pte);
}
- mm->rss -= (end - start) >> PAGE_SHIFT;
+ update_mm_counter(mm, rss, - ((end - start) >> PAGE_SHIFT));
flush_tlb_range(vma, start, end);
}

Index: linux-2.6.10/fs/binfmt_elf.c
===================================================================
--- linux-2.6.10.orig/fs/binfmt_elf.c 2005-01-28 11:01:55.000000000 -0800
+++ linux-2.6.10/fs/binfmt_elf.c 2005-01-28 11:02:00.000000000 -0800
@@ -764,7 +764,7 @@ static int load_elf_binary(struct linux_

/* Do this so that we can load the interpreter, if need be. We will
change some of these later */
- current->mm->rss = 0;
+ set_mm_counter(current->mm, rss, 0);
current->mm->free_area_cache = current->mm->mmap_base;
retval = setup_arg_pages(bprm, STACK_TOP, executable_stack);
if (retval < 0) {
Index: linux-2.6.10/include/asm-ia64/tlb.h
===================================================================
--- linux-2.6.10.orig/include/asm-ia64/tlb.h 2005-01-28 11:01:51.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/tlb.h 2005-01-28 11:02:00.000000000 -0800
@@ -161,11 +161,11 @@ tlb_finish_mmu (struct mmu_gather *tlb,
{
unsigned long freed = tlb->freed;
struct mm_struct *mm = tlb->mm;
- unsigned long rss = mm->rss;
+ unsigned long rss = get_mm_counter(mm, rss);

if (rss < freed)
freed = rss;
- mm->rss = rss - freed;
+ update_mm_counter(mm, rss, -freed);
/*
* Note: tlb->nr may be 0 at this point, so we can't rely on tlb->start_addr and
* tlb->end_addr.
Index: linux-2.6.10/include/asm-arm/tlb.h
===================================================================
--- linux-2.6.10.orig/include/asm-arm/tlb.h 2005-01-28 11:01:51.000000000 -0800
+++ linux-2.6.10/include/asm-arm/tlb.h 2005-01-28 11:02:00.000000000 -0800
@@ -54,11 +54,11 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
{
struct mm_struct *mm = tlb->mm;
unsigned long freed = tlb->freed;
- int rss = mm->rss;
+ int rss = get_mm_counter(mm, rss);

if (rss < freed)
freed = rss;
- mm->rss = rss - freed;
+ update_mm_counter(mm, rss, -freed);

if (freed) {
flush_tlb_mm(mm);
Index: linux-2.6.10/include/asm-arm26/tlb.h
===================================================================
--- linux-2.6.10.orig/include/asm-arm26/tlb.h 2005-01-28 11:01:51.000000000 -0800
+++ linux-2.6.10/include/asm-arm26/tlb.h 2005-01-28 11:02:00.000000000 -0800
@@ -37,11 +37,11 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
{
struct mm_struct *mm = tlb->mm;
unsigned long freed = tlb->freed;
- int rss = mm->rss;
+ int rss = get_mm_counter(mm, rss);

if (rss < freed)
freed = rss;
- mm->rss = rss - freed;
+ update_mm_counter(mm, rss, -freed);

if (freed) {
flush_tlb_mm(mm);
Index: linux-2.6.10/include/asm-sparc64/tlb.h
===================================================================
--- linux-2.6.10.orig/include/asm-sparc64/tlb.h 2004-12-24 13:35:23.000000000 -0800
+++ linux-2.6.10/include/asm-sparc64/tlb.h 2005-01-28 11:02:00.000000000 -0800
@@ -80,11 +80,11 @@ static inline void tlb_finish_mmu(struct
{
unsigned long freed = mp->freed;
struct mm_struct *mm = mp->mm;
- unsigned long rss = mm->rss;
+ unsigned long rss = get_mm_counter(mm, rss);

if (rss < freed)
freed = rss;
- mm->rss = rss - freed;
+ update_mm_counter(mm, rss, -freed);

tlb_flush_mmu(mp);

Index: linux-2.6.10/arch/sh/mm/hugetlbpage.c
===================================================================
--- linux-2.6.10.orig/arch/sh/mm/hugetlbpage.c 2004-12-24 13:34:58.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/hugetlbpage.c 2005-01-28 11:02:00.000000000 -0800
@@ -62,7 +62,7 @@ static void set_huge_pte(struct mm_struc
unsigned long i;
pte_t entry;

- mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+ update_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE);

if (write_access)
entry = pte_mkwrite(pte_mkdirty(mk_pte(page,
@@ -115,7 +115,7 @@ int copy_hugetlb_page_range(struct mm_st
pte_val(entry) += PAGE_SIZE;
dst_pte++;
}
- dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+ update_mm_counter(dst, rss, HPAGE_SIZE / PAGE_SIZE);
addr += HPAGE_SIZE;
}
return 0;
@@ -206,7 +206,7 @@ void unmap_hugepage_range(struct vm_area
pte++;
}
}
- mm->rss -= (end - start) >> PAGE_SHIFT;
+ update_mm_counter(mm, rss, -((end - start) >> PAGE_SHIFT));
flush_tlb_range(vma, start, end);
}

Index: linux-2.6.10/arch/x86_64/ia32/ia32_aout.c
===================================================================
--- linux-2.6.10.orig/arch/x86_64/ia32/ia32_aout.c 2005-01-28 11:01:48.000000000 -0800
+++ linux-2.6.10/arch/x86_64/ia32/ia32_aout.c 2005-01-28 11:02:00.000000000 -0800
@@ -313,7 +313,7 @@ static int load_aout_binary(struct linux
(current->mm->start_brk = N_BSSADDR(ex));
current->mm->free_area_cache = TASK_UNMAPPED_BASE;

- current->mm->rss = 0;
+ set_mm_counter(current->mm, rss, 0);
current->mm->mmap = NULL;
compute_creds(bprm);
current->flags &= ~PF_FORKNOEXEC;
Index: linux-2.6.10/arch/ppc64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.10.orig/arch/ppc64/mm/hugetlbpage.c 2005-01-28 11:01:48.000000000 -0800
+++ linux-2.6.10/arch/ppc64/mm/hugetlbpage.c 2005-01-28 11:02:00.000000000 -0800
@@ -153,7 +153,7 @@ static void set_huge_pte(struct mm_struc
{
pte_t entry;

- mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+ update_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE);
if (write_access) {
entry =
pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
@@ -311,7 +311,7 @@ int copy_hugetlb_page_range(struct mm_st

ptepage = pte_page(entry);
get_page(ptepage);
- dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+ update_mm_counter(dst, rss, HPAGE_SIZE / PAGE_SIZE);
set_pte(dst_pte, entry);

addr += HPAGE_SIZE;
@@ -421,7 +421,7 @@ void unmap_hugepage_range(struct vm_area

put_page(page);
}
- mm->rss -= (end - start) >> PAGE_SHIFT;
+ update_mm_counter(mm, rss, -((end - start) >> PAGE_SHIFT));
flush_tlb_pending();
}

Index: linux-2.6.10/arch/sh64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.10.orig/arch/sh64/mm/hugetlbpage.c 2004-12-24 13:34:30.000000000 -0800
+++ linux-2.6.10/arch/sh64/mm/hugetlbpage.c 2005-01-28 11:02:00.000000000 -0800
@@ -62,7 +62,7 @@ static void set_huge_pte(struct mm_struc
unsigned long i;
pte_t entry;

- mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+ update_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE);

if (write_access)
entry = pte_mkwrite(pte_mkdirty(mk_pte(page,
@@ -115,7 +115,7 @@ int copy_hugetlb_page_range(struct mm_st
pte_val(entry) += PAGE_SIZE;
dst_pte++;
}
- dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+ update_mm_counter(dst, rss, HPAGE_SIZE / PAGE_SIZE);
addr += HPAGE_SIZE;
}
return 0;
@@ -206,7 +206,7 @@ void unmap_hugepage_range(struct vm_area
pte++;
}
}
- mm->rss -= (end - start) >> PAGE_SHIFT;
+ update_mm_counter(mm, rss, -((end - start) >> PAGE_SHIFT));
flush_tlb_range(vma, start, end);
}

Index: linux-2.6.10/arch/sparc64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.10.orig/arch/sparc64/mm/hugetlbpage.c 2004-12-24 13:35:01.000000000 -0800
+++ linux-2.6.10/arch/sparc64/mm/hugetlbpage.c 2005-01-28 11:02:00.000000000 -0800
@@ -59,7 +59,7 @@ static void set_huge_pte(struct mm_struc
unsigned long i;
pte_t entry;

- mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+ update_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE);

if (write_access)
entry = pte_mkwrite(pte_mkdirty(mk_pte(page,
@@ -112,7 +112,7 @@ int copy_hugetlb_page_range(struct mm_st
pte_val(entry) += PAGE_SIZE;
dst_pte++;
}
- dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+ update_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE);
addr += HPAGE_SIZE;
}
return 0;
@@ -203,7 +203,7 @@ void unmap_hugepage_range(struct vm_area
pte++;
}
}
- mm->rss -= (end - start) >> PAGE_SHIFT;
+ update_mm_counter(mm, rss, -((end - start) >> PAGE_SHIFT));
flush_tlb_range(vma, start, end);
}

Index: linux-2.6.10/arch/mips/kernel/irixelf.c
===================================================================
--- linux-2.6.10.orig/arch/mips/kernel/irixelf.c 2005-01-28 11:01:48.000000000 -0800
+++ linux-2.6.10/arch/mips/kernel/irixelf.c 2005-01-28 11:02:00.000000000 -0800
@@ -690,7 +690,7 @@ static int load_irix_binary(struct linux
/* Do this so that we can load the interpreter, if need be. We will
* change some of these later.
*/
- current->mm->rss = 0;
+ set_mm_counter(current->mm, rss, 0);
setup_arg_pages(bprm, STACK_TOP, EXSTACK_DEFAULT);
current->mm->start_stack = bprm->p;

Index: linux-2.6.10/arch/m68k/atari/stram.c
===================================================================
--- linux-2.6.10.orig/arch/m68k/atari/stram.c 2005-01-28 11:01:48.000000000 -0800
+++ linux-2.6.10/arch/m68k/atari/stram.c 2005-01-28 11:02:00.000000000 -0800
@@ -635,7 +635,7 @@ static inline void unswap_pte(struct vm_
set_pte(dir, pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
swap_free(entry);
get_page(page);
- ++vma->vm_mm->rss;
+ update_mm_counter(vma->vm_mm, rss, 1);
}

static inline void unswap_pmd(struct vm_area_struct * vma, pmd_t *dir,
Index: linux-2.6.10/arch/i386/mm/hugetlbpage.c
===================================================================
--- linux-2.6.10.orig/arch/i386/mm/hugetlbpage.c 2005-01-28 11:01:47.000000000 -0800
+++ linux-2.6.10/arch/i386/mm/hugetlbpage.c 2005-01-28 11:02:00.000000000 -0800
@@ -46,7 +46,7 @@ static void set_huge_pte(struct mm_struc
{
pte_t entry;

- mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+ update_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE);
if (write_access) {
entry =
pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
@@ -86,7 +86,7 @@ int copy_hugetlb_page_range(struct mm_st
ptepage = pte_page(entry);
get_page(ptepage);
set_pte(dst_pte, entry);
- dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+ update_mm_counter(dst, rss, HPAGE_SIZE / PAGE_SIZE);
addr += HPAGE_SIZE;
}
return 0;
@@ -222,7 +222,7 @@ void unmap_hugepage_range(struct vm_area
page = pte_page(pte);
put_page(page);
}
- mm->rss -= (end - start) >> PAGE_SHIFT;
+ update_mm_counter(mm ,rss, -((end - start) >> PAGE_SHIFT));
flush_tlb_range(vma, start, end);
}

Index: linux-2.6.10/arch/sparc64/kernel/binfmt_aout32.c
===================================================================
--- linux-2.6.10.orig/arch/sparc64/kernel/binfmt_aout32.c 2005-01-28 11:01:48.000000000 -0800
+++ linux-2.6.10/arch/sparc64/kernel/binfmt_aout32.c 2005-01-28 11:02:00.000000000 -0800
@@ -241,7 +241,7 @@ static int load_aout32_binary(struct lin
current->mm->brk = ex.a_bss +
(current->mm->start_brk = N_BSSADDR(ex));

- current->mm->rss = 0;
+ set_mm_counter(current->mm, rss, 0);
current->mm->mmap = NULL;
compute_creds(bprm);
current->flags &= ~PF_FORKNOEXEC;
Index: linux-2.6.10/fs/proc/array.c
===================================================================
--- linux-2.6.10.orig/fs/proc/array.c 2005-01-28 11:01:51.000000000 -0800
+++ linux-2.6.10/fs/proc/array.c 2005-01-28 11:02:00.000000000 -0800
@@ -423,7 +423,7 @@ static int do_task_stat(struct task_stru
jiffies_to_clock_t(task->it_real_value),
start_time,
vsize,
- mm ? mm->rss : 0, /* you might want to shift this left 3 */
+ mm ? get_mm_counter(mm, rss) : 0, /* you might want to shift this left 3 */
rsslim,
mm ? mm->start_code : 0,
mm ? mm->end_code : 0,
Index: linux-2.6.10/fs/binfmt_elf_fdpic.c
===================================================================
--- linux-2.6.10.orig/fs/binfmt_elf_fdpic.c 2005-01-28 11:01:50.000000000 -0800
+++ linux-2.6.10/fs/binfmt_elf_fdpic.c 2005-01-28 10:53:07.000000000 -0800
@@ -299,7 +299,7 @@ static int load_elf_fdpic_binary(struct
/* do this so that we can load the interpreter, if need be
* - we will change some of these later
*/
- current->mm->rss = 0;
+ set_mm_counter(current->mm, rss, 0);

#ifdef CONFIG_MMU
retval = setup_arg_pages(bprm, current->mm->start_stack, executable_stack);
Index: linux-2.6.10/mm/nommu.c
===================================================================
--- linux-2.6.10.orig/mm/nommu.c 2005-01-28 11:01:51.000000000 -0800
+++ linux-2.6.10/mm/nommu.c 2005-01-28 11:04:33.000000000 -0800
@@ -958,10 +958,11 @@ void arch_unmap_area(struct vm_area_stru
void update_mem_hiwater(void)
{
struct task_struct *tsk = current;
+ unsigned long rss = get_mm_counter(tsk->mm, rss);

if (likely(tsk->mm)) {
- if (tsk->mm->hiwater_rss < tsk->mm->rss)
- tsk->mm->hiwater_rss = tsk->mm->rss;
+ if (tsk->mm->hiwater_rss < rss)
+ tsk->mm->hiwater_rss = rss;
if (tsk->mm->hiwater_vm < tsk->mm->total_vm)
tsk->mm->hiwater_vm = tsk->mm->total_vm;
}
Index: linux-2.6.10/kernel/acct.c
===================================================================
--- linux-2.6.10.orig/kernel/acct.c 2005-01-28 11:01:51.000000000 -0800
+++ linux-2.6.10/kernel/acct.c 2005-01-28 11:03:13.000000000 -0800
@@ -544,7 +544,7 @@ void acct_update_integrals(void)
if (delta == 0)
return;
tsk->acct_stimexpd = tsk->stime;
- tsk->acct_rss_mem1 += delta * tsk->mm->rss;
+ tsk->acct_rss_mem1 += delta * get_mm_counter(tsk->mm, rss);
tsk->acct_vm_mem1 += delta * tsk->mm->total_vm;
}
}

2005-02-01 04:08:56

by Nick Piggin

[permalink] [raw]
Subject: Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

Christoph Lameter wrote:

Slightly OT: are you still planning to move the update_mem_hiwater and
friends crud out of these fastpaths? It looks like at least that function
is unsafe to be lockless.

> @@ -1316,21 +1318,27 @@ static int do_wp_page(struct mm_struct *
> flush_cache_page(vma, address);
> entry = maybe_mkwrite(pte_mkyoung(pte_mkdirty(pte)),
> vma);
> - ptep_set_access_flags(vma, address, page_table, entry, 1);
> - update_mmu_cache(vma, address, entry);
> + /*
> + * If the bits are not updated then another fault
> + * will be generated with another chance of updating.
> + */
> + if (ptep_cmpxchg(page_table, pte, entry))
> + update_mmu_cache(vma, address, entry);
> + else
> + inc_page_state(cmpxchg_fail_flag_reuse);
> pte_unmap(page_table);
> - spin_unlock(&mm->page_table_lock);
> + page_table_atomic_stop(mm);
> return VM_FAULT_MINOR;
> }
> }
> pte_unmap(page_table);
> + page_table_atomic_stop(mm);
>
> /*
> * Ok, we need to copy. Oh, well..
> */
> if (!PageReserved(old_page))
> page_cache_get(old_page);
> - spin_unlock(&mm->page_table_lock);
>

I don't think you can do this unless you have done something funky that I
missed. And that kind of shoots down your lockless COW too, although it
looks like you can safely have the second part of do_wp_page without the
lock. Basically - your lockless COW patch itself seems like it should be
OK, but this hunk does not.

I would be very interested if you are seeing performance gains with your
lockless COW patches, BTW.

Basically, getting a reference on a struct page was the only thing I found
I wasn't able to do lockless with pte cmpxchg. Because it can race with
unmapping in rmap.c and reclaim and reuse, which probably isn't too good.
That means: the only operations you are able to do lockless is when there
is no backing page (ie. the anonymous unpopulated->populated case).

A per-pte lock is sufficient for this case, of course, which is why the
pte-locked system is completely free of the page table lock.

Although I may have some fact fundamentally wrong?

2005-02-01 04:16:06

by Nick Piggin

[permalink] [raw]
Subject: Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

Christoph Lameter wrote:
> The page fault handler attempts to use the page_table_lock only for short
> time periods. It repeatedly drops and reacquires the lock. When the lock
> is reacquired, checks are made if the underlying pte has changed before
> replacing the pte value. These locations are a good fit for the use of
> ptep_cmpxchg.
>
> The following patch allows to remove the first time the page_table_lock is
> acquired and uses atomic operations on the page table instead. A section
> using atomic pte operations is begun with
>
> page_table_atomic_start(struct mm_struct *)
>
> and ends with
>
> page_table_atomic_stop(struct mm_struct *)
>

Hmm, this is moving toward the direction my patches take.

I think it may be the right way to go if you're lifting the ptl
from some core things, because some architectures won't want to
audit and stuff, and some may need the lock.

Naturally I prefer the complete replacement that is made with
my patch - however this of course means one has to move
*everything* over to be pte_cmpxchg safe, which runs against
your goal of getting the low hanging fruit with as little fuss
as possible for the moment.


2005-02-01 08:20:09

by baswaraj kasture

[permalink] [raw]
Subject: Kernel 2.4.21 hangs up

Hi,

I compiled kernel 2.4.21 with intel compiler .
While booting it hangs-up . further i found that it
hangsup due to call to "calibrate_delay" routine in
"init/main.c". Also found that loop in the
callibrate_delay" routine goes infinite.When i comment
out the call to "callibrate_delay" routine, it works
fine.Even compiling "init/main.c" with "-O0" works
fine. I am using IA-64 (Intel Itanium 2 ) with EL3.0.

Any pointers will be great help.


Thanks,
-Baswaraj



__________________________________
Do you Yahoo!?
Yahoo! Mail - 250MB free storage. Do more. Manage less.
http://info.mail.yahoo.com/mail_250

2005-02-01 08:35:37

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Kernel 2.4.21 hangs up

On Tue, 2005-02-01 at 00:20 -0800, baswaraj kasture wrote:
> Hi,
>
> I compiled kernel 2.4.21 with intel compiler .

2.4.21 isn't supposed to be compilable with the intel compiler...

> fine. I am using IA-64 (Intel Itanium 2 ) with EL3.0.

... and the RHEL3 kernel most certainly isn't.

I strongly suggest that you stick to gcc for compiling the RHEL3 kernel.


Also sticking half the world on the CC is considered rude if those
people have nothing to do with the subject at hand, as is the case here.


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2005-02-01 09:14:14

by Christian Hildner

[permalink] [raw]
Subject: Re: Kernel 2.4.21 hangs up

baswaraj kasture schrieb:

>Hi,
>
>I compiled kernel 2.4.21 with intel compiler .
>While booting it hangs-up . further i found that it
>hangsup due to call to "calibrate_delay" routine in
>"init/main.c". Also found that loop in the
>callibrate_delay" routine goes infinite.When i comment
>out the call to "callibrate_delay" routine, it works
>fine.Even compiling "init/main.c" with "-O0" works
>fine. I am using IA-64 (Intel Itanium 2 ) with EL3.0.
>
>Any pointers will be great help.
>
- Download ski from http://www.hpl.hp.com/research/linux/ski/download.php
- Compile your kernel for the simulator
- set simulator breakpoint at calibrate_delay
- look at ar.itc and cr.itm (cr.itm must be greater than ar.itc)

Or for debugging on hardware:
-run into loop, press the TOC button, reboot and analyze the dump with
efi shell + errdump init

Christian

2005-02-01 17:46:52

by David Mosberger

[permalink] [raw]
Subject: Re: Kernel 2.4.21 hangs up

[I trimmed the cc-list...]

>>>>> On Tue, 1 Feb 2005 00:20:01 -0800 (PST), baswaraj kasture <[email protected]> said:

Baswaraj> Hi, I compiled kernel 2.4.21 with intel compiler .

That's curious. Last time I checked, the changes needed to use the
Intel-compiler have not been backported to 2.4. What kernel sources
are you working off of?

Also, even with 2.6 you need a script from Intel which does some
"magic" GCC->ICC option translations to build the kernel with the
Intel compiler. AFAIK, this script has not been released by Intel
(hint, hint...).

Baswaraj> While booting it hangs-up . further i found that it
Baswaraj> hangsup due to call to "calibrate_delay" routine in
Baswaraj> "init/main.c". Also found that loop in the
Baswaraj> callibrate_delay" routine goes infinite.

I suspect your kernel was just miscompiled. We have used the
Intel-compiler internally on a 2.6 kernel and it worked fine at the
time, though I haven't tried recently.

--david

2005-02-01 17:57:53

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Kernel 2.4.21 hangs up

On Tue, 2005-02-01 at 09:46 -0800, David Mosberger wrote:

> Also, even with 2.6 you need a script from Intel which does some
> "magic" GCC->ICC option translations to build the kernel with the
> Intel compiler. AFAIK, this script has not been released by Intel
> (hint, hint...).
>
They posted it to the LKML so time ago (2004-03-12). (message):
http://marc.theaimsgroup.com/?l=linux-kernel&m=107913092300497
(script):
http://marc.theaimsgroup.com/?l=linux-kernel&m=107913092300497&q=p3

__
Markus

2005-02-01 18:08:44

by David Mosberger

[permalink] [raw]
Subject: Re: Kernel 2.4.21 hangs up

>>>>> On Tue, 01 Feb 2005 18:54:59 +0100, Markus Trippelsdorf <[email protected]> said:

Markus> They posted it to the LKML so time ago
Markus> (2004-03-12). (message):
Markus> http://marc.theaimsgroup.com/?l=linux-kernel&m=107913092300497
Markus> (script):
Markus> http://marc.theaimsgroup.com/?l=linux-kernel&m=107913092300497&q=p3

That script is for the x86-version of icc only.
It doesn't work for ia64, which is the context of this discussion.

--david

2005-02-01 18:44:56

by Christoph Lameter

[permalink] [raw]
Subject: Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

On Tue, 1 Feb 2005, Nick Piggin wrote:

> Hmm, this is moving toward the direction my patches take.

You are right. But I am still weary of the transactional idea in your
patchset (which is really not comparable with a database transaction
after all...).

I think moving to cmpxchg and xchg operations will give this more
transparency and make it easier to understand and handle.

> Naturally I prefer the complete replacement that is made with
> my patch - however this of course means one has to move
> *everything* over to be pte_cmpxchg safe, which runs against
> your goal of getting the low hanging fruit with as little fuss
> as possible for the moment.

I would also prefer a replacement but there are certain cost-benefit
tradeoffs with atomic operations vs. spinlock that may better be tackled
on a case by case basis. Also this is pretty much at the core of the Linux
VM and thus highly sensitive. Given its history and the danger of breaking
things it may be best to preserve it intact as much as possible and move
in small steps.

2005-02-01 18:49:24

by Christoph Lameter

[permalink] [raw]
Subject: Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

On Tue, 1 Feb 2005, Nick Piggin wrote:

> Slightly OT: are you still planning to move the update_mem_hiwater and
> friends crud out of these fastpaths? It looks like at least that function
> is unsafe to be lockless.

Yes. I have a patch pending and the author of the CSA patches is a
cowoerker of mine. The patch will be resubmitted once certain aspects
of the timer subsystem are stabilized and/or when he gets back from his
vacation. The statistics are not critical to system operation.

2005-02-01 19:04:43

by Christoph Lameter

[permalink] [raw]
Subject: Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

On Tue, 1 Feb 2005, Nick Piggin wrote:

> > pte_unmap(page_table);
> > + page_table_atomic_stop(mm);
> >
> > /*
> > * Ok, we need to copy. Oh, well..
> > */
> > if (!PageReserved(old_page))
> > page_cache_get(old_page);
> > - spin_unlock(&mm->page_table_lock);
> >
>
> I don't think you can do this unless you have done something funky that I
> missed. And that kind of shoots down your lockless COW too, although it
> looks like you can safely have the second part of do_wp_page without the
> lock. Basically - your lockless COW patch itself seems like it should be
> OK, but this hunk does not.

See my comment at the end of this message.

> I would be very interested if you are seeing performance gains with your
> lockless COW patches, BTW.

So far I have not had time to focus on benchmarking that.

> Basically, getting a reference on a struct page was the only thing I found
> I wasn't able to do lockless with pte cmpxchg. Because it can race with
> unmapping in rmap.c and reclaim and reuse, which probably isn't too good.
> That means: the only operations you are able to do lockless is when there
> is no backing page (ie. the anonymous unpopulated->populated case).
>
> A per-pte lock is sufficient for this case, of course, which is why the
> pte-locked system is completely free of the page table lock.

Introducing pte locking would allow us to go further with parallelizing
this but its another invasive procedure. I think parallelizing COW is only
possible to do reliable with some pte locking scheme. But then the
question is if the pte locking is really faster than obtaining a spinlock.
I suspect this may not be the case.

> Although I may have some fact fundamentally wrong?

The unmapping in rmap.c would change the pte. This would be discovered
after acquiring the spinlock later in do_wp_page. Which would then lead to
the operation being abandoned.

2005-02-02 00:36:09

by Nick Piggin

[permalink] [raw]
Subject: Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

On Tue, 2005-02-01 at 11:01 -0800, Christoph Lameter wrote:
> On Tue, 1 Feb 2005, Nick Piggin wrote:

> > A per-pte lock is sufficient for this case, of course, which is why the
> > pte-locked system is completely free of the page table lock.
>
> Introducing pte locking would allow us to go further with parallelizing
> this but its another invasive procedure. I think parallelizing COW is only
> possible to do reliable with some pte locking scheme. But then the
> question is if the pte locking is really faster than obtaining a spinlock.
> I suspect this may not be the case.
>

Well most likely not although I haven't been able to detect much
difference. But in your case you would probably be happy to live
with that if it meant better parallelising of an important
function... but we'll leave future discussion to another thread ;)

> > Although I may have some fact fundamentally wrong?
>
> The unmapping in rmap.c would change the pte. This would be discovered
> after acquiring the spinlock later in do_wp_page. Which would then lead to
> the operation being abandoned.
>

Oh yes, but suppose your page_cache_get is happening at the same time
as free_pages_check, after the page gets freed by the scanner? I can't
actually think of anything that would cause a real problem (ie. not a
debug check), off the top of my head. But can you say there _isn't_
anything?

Regardless, it seems pretty dirty to me. But could possibly be made
workable, of course.



2005-02-02 01:21:17

by Christoph Lameter

[permalink] [raw]
Subject: Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

On Wed, 2 Feb 2005, Nick Piggin wrote:

> > The unmapping in rmap.c would change the pte. This would be discovered
> > after acquiring the spinlock later in do_wp_page. Which would then lead to
> > the operation being abandoned.
> Oh yes, but suppose your page_cache_get is happening at the same time
> as free_pages_check, after the page gets freed by the scanner? I can't
> actually think of anything that would cause a real problem (ie. not a
> debug check), off the top of my head. But can you say there _isn't_
> anything?
>
> Regardless, it seems pretty dirty to me. But could possibly be made
> workable, of course.

Would it make you feel better if we would move the spin_unlock back to the
prior position? This would ensure that the fallback case is exactly the
same.

Index: linux-2.6.10/mm/memory.c
===================================================================
--- linux-2.6.10.orig/mm/memory.c 2005-01-31 08:59:07.000000000 -0800
+++ linux-2.6.10/mm/memory.c 2005-02-01 10:55:30.000000000 -0800
@@ -1318,7 +1318,6 @@ static int do_wp_page(struct mm_struct *
}
}
pte_unmap(page_table);
- page_table_atomic_stop(mm);

/*
* Ok, we need to copy. Oh, well..
@@ -1326,6 +1325,7 @@ static int do_wp_page(struct mm_struct *
if (!PageReserved(old_page))
page_cache_get(old_page);

+ page_table_atomic_stop(mm);
if (unlikely(anon_vma_prepare(vma)))
goto no_new_page;
if (old_page == ZERO_PAGE(address)) {

2005-02-02 01:41:56

by Nick Piggin

[permalink] [raw]
Subject: Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

On Tue, 2005-02-01 at 17:20 -0800, Christoph Lameter wrote:
> On Wed, 2 Feb 2005, Nick Piggin wrote:
>
> > > The unmapping in rmap.c would change the pte. This would be discovered
> > > after acquiring the spinlock later in do_wp_page. Which would then lead to
> > > the operation being abandoned.
> > Oh yes, but suppose your page_cache_get is happening at the same time
> > as free_pages_check, after the page gets freed by the scanner? I can't
> > actually think of anything that would cause a real problem (ie. not a
> > debug check), off the top of my head. But can you say there _isn't_
> > anything?
> >
> > Regardless, it seems pretty dirty to me. But could possibly be made
> > workable, of course.
>
> Would it make you feel better if we would move the spin_unlock back to the
> prior position? This would ensure that the fallback case is exactly the
> same.
>

Well yeah, but the interesting case is when that isn't a lock ;)

I'm not saying what you've got is no good. I'm sure it would be fine
for testing. And if it happens that we can do the "page_count doesn't
mean anything after it has reached zero and been freed. Nor will it
necessarily be zero when a new page is allocated" thing without many
problems, then this may be a fine way to do it.

I was just pointing out this could be a problem without putting a
lot of thought into it...


Find local movie times and trailers on Yahoo! Movies.
http://au.movies.yahoo.com

2005-02-02 03:16:34

by Nick Piggin

[permalink] [raw]
Subject: Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

On Tue, 2005-02-01 at 18:49 -0800, Christoph Lameter wrote:
> On Wed, 2 Feb 2005, Nick Piggin wrote:
>
> > Well yeah, but the interesting case is when that isn't a lock ;)
> >
> > I'm not saying what you've got is no good. I'm sure it would be fine
> > for testing. And if it happens that we can do the "page_count doesn't
> > mean anything after it has reached zero and been freed. Nor will it
> > necessarily be zero when a new page is allocated" thing without many
> > problems, then this may be a fine way to do it.
> >
> > I was just pointing out this could be a problem without putting a
> > lot of thought into it...
>
> Surely we need to do this the right way. Do we really need to
> use page_cache_get()? Is anything relying on page_count == 2 of
> the old_page?
>
> I mean we could just speculatively copy, risk copying crap and
> discard that later when we find that the pte has changed. This would
> simplify the function:
>

I think this may be the better approach. Anyone else?


Find local movie times and trailers on Yahoo! Movies.
http://au.movies.yahoo.com

2005-02-02 04:08:48

by Christoph Lameter

[permalink] [raw]
Subject: Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

On Wed, 2 Feb 2005, Nick Piggin wrote:

> Well yeah, but the interesting case is when that isn't a lock ;)
>
> I'm not saying what you've got is no good. I'm sure it would be fine
> for testing. And if it happens that we can do the "page_count doesn't
> mean anything after it has reached zero and been freed. Nor will it
> necessarily be zero when a new page is allocated" thing without many
> problems, then this may be a fine way to do it.
>
> I was just pointing out this could be a problem without putting a
> lot of thought into it...

Surely we need to do this the right way. Do we really need to
use page_cache_get()? Is anything relying on page_count == 2 of
the old_page?

I mean we could just speculatively copy, risk copying crap and
discard that later when we find that the pte has changed. This would
simplify the function:

Index: linux-2.6.10/mm/memory.c
===================================================================
--- linux-2.6.10.orig/mm/memory.c 2005-02-01 18:10:46.000000000 -0800
+++ linux-2.6.10/mm/memory.c 2005-02-01 18:43:08.000000000 -0800
@@ -1323,9 +1323,6 @@ static int do_wp_page(struct mm_struct *
/*
* Ok, we need to copy. Oh, well..
*/
- if (!PageReserved(old_page))
- page_cache_get(old_page);
-
if (unlikely(anon_vma_prepare(vma)))
goto no_new_page;
if (old_page == ZERO_PAGE(address)) {
@@ -1336,6 +1333,10 @@ static int do_wp_page(struct mm_struct *
new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
if (!new_page)
goto no_new_page;
+ /*
+ * No page_cache_get so we may copy some crap
+ * that is later discarded if the pte has changed
+ */
copy_user_highpage(new_page, old_page, address);
}
/*
@@ -1352,7 +1353,6 @@ static int do_wp_page(struct mm_struct *
acct_update_integrals();
update_mem_hiwater();
} else
-
page_remove_rmap(old_page);
break_cow(vma, new_page, address, page_table);
lru_cache_add_active(new_page);
@@ -1363,7 +1363,6 @@ static int do_wp_page(struct mm_struct *
}
pte_unmap(page_table);
page_cache_release(new_page);
- page_cache_release(old_page);
spin_unlock(&mm->page_table_lock);
return VM_FAULT_MINOR;

2005-02-04 06:27:29

by Nick Piggin

[permalink] [raw]
Subject: Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

On Wed, 2005-02-02 at 14:09 +1100, Nick Piggin wrote:
> On Tue, 2005-02-01 at 18:49 -0800, Christoph Lameter wrote:
> > On Wed, 2 Feb 2005, Nick Piggin wrote:

> > I mean we could just speculatively copy, risk copying crap and
> > discard that later when we find that the pte has changed. This would
> > simplify the function:
> >
>
> I think this may be the better approach. Anyone else?
>

Not to say it is perfect either. Normal semantics say not to touch
a page if it is not somehow pinned. So this may cause problems in
corner cases (DEBUG_PAGEALLOC comes to mind... hopefully nothing else).

But I think a plain read of the page when it isn't pinned is less
yucky than writing into the non-pinned struct page.



2005-02-07 06:15:01

by baswaraj kasture

[permalink] [raw]
Subject: Kernel 2.4.21 gives kernel panic at boot time

Hi,

I have compiled the kerne 2.4.21. Compilation went
well. but i got follwing message at boot time.
=============================================
.
.
/lib/mptscsih.o : unresolved symbol
mpt_deregister_Rsmp_6fb5ab71
/lib/mptscsih.o : Unresolved symbol
mpt_event_register_Rsmp_34ace96b

ERROR : /bin/insmod exited abnormally
Mounting /proc filesystem
Creating block devices
VFS : cannot open root device "LABEL=/ or 00:00
Please append corrrect "root=" boot option
Kernel panic : VFS : Unable to mount root fs on 00:00


===========================================


I have following lines in my elilo.conf ,
------------------------------
#original kernel
image=vmlinuz-2.4.21-9.EL
label=linux
initrd=initrd-2.4.21-9.EL.img
read-only
append="root=LABEL=/"

#icc-O2
image=iccvmlinux
label=icc_O2
initrd=iccinitrd-preBasicc.img
read-only
append="root=LABEL=/"
---------------------------
First one works fine.

Any clues why i am getting this error.

Is it related to SCSI Driver ?

Further "/sbin/mkinitrd -f -v " gave follwing
messge,
======================================
.
.
.
Looking for deps of module scsi_mod
Looking for deps of module sd_mod
Looking for deps of module unknown
Looking for deps of module mptbase
Looking for deps of module mptscsih mptbase
Looking for deps of module mptbase
Looking for deps of module ide-disk
Looking for deps of module ext3
Using modules:
./kernel/drivers/message/fusion/mptbase.o
./kernel/drivers/message/fusion/mptscsih.o
Using loopback device /dev/loop0
/sbin/nash -> /tmp/initrd.EsIvQ9/bin/nash
/sbin/insmod.static -> /tmp/initrd.EsIvQ9/bin/insmod
`/lib/modules/2.4.21preBasicc/./kernel/drivers/message/fusion/mptbase.o'
-> `/tmp/initrd.EsIvQ9/lib/mptbase.o'
`/lib/modules/2.4.21preBasicc/./kernel/drivers/message/fusion/mptscsih.o'
-> `/tmp/initrd.EsIvQ9/lib/mptscsih.o'
Loading module mptbase
Loading module mptscsih

=======================================



Any clues will be great help ?


Thanx,
Baswaraj



__________________________________
Do you Yahoo!?
Yahoo! Mail - 250MB free storage. Do more. Manage less.
http://info.mail.yahoo.com/mail_250

2005-02-17 00:58:10

by Christoph Lameter

[permalink] [raw]
Subject: page fault scalability patchsets update: prezeroing, prefaulting and atomic operations

I thought I save myself the constant crossposting of large amounts of
patches. Patches, documentation, test results etc are available at

http://oss.sgi.com/projects/page_fault_performance/

Changes:
- Performance tests for all patchsets (i386 single processor,
Altix 8 processors and Altix 128 processors)
- Archives of patches so far
- Some docs (still needs work)
- Patches against 2.6.11-rc4-bk4

Patch specific:

atomic operations for page faults (V17)

- Avoid incrementing page count for page in do_wp_page (see discussion
with Nick Piggin on last patchset)

prezeroing (V7)

- set /proc/sys/vm/scrub_load to 1 by default to avoid slight performance
loss during kernel compile on i386
- Scrubd needs to be configured in kernel configuration as an experimental
feature.
- Patch still follows kswapd's method to bind node specific scrubd daemons
to each NUMA node. Cannot find any new infrastructure to assign tasks to
certain nodes. kthread_bind() binds to single cpu and not to a NUMA
node. Guess other API work would have to be first done to realize
Andrews proposed approach.

prefaulting (V5)

- Set default for /proc/sys/vm/max_prealloc_order to 1 to avoid
overallocating pages which led to a performance loss in some
situations.

This is pretty complex thing to manage so please tell me if I missed
anything ...

2005-02-24 06:06:46

by Christoph Lameter

[permalink] [raw]
Subject: A Proposal for an MMU abstraction layer

1. Rationale
============

Currently the Linux kernel implements a hierachical page table utilizing 4
layers. Architectures that have less layers may cause the kernel to not
generate code for certain layers. However, there are other means for mmu
to describe page tables to the system. For example the Itanium (and other
CPUs) support hashed page table structures or linear page tables. IA64 has
to simulate the hierachical layers through its linear page tables and
implements the higher layers in software.

Moreover, different architectures have different means of implementing
huge page table entries. On IA32 this is realized by omitting the lower
layer entries and providing single PMD entry replacing 512/1024 PTE
entries. On IA64 a PTE entry is used for that purpose. Other architecture
realize huge page table entries through groups of PTE entries. There are
hooks for each of these methods in the kernel. Moreover the way of
handling huge pages is not like other pages but they are managed through a
file system. Only one size of huge pages is supported. It would be much
better if huge pages would be handled more like regular pages and also to
have support for multiple page sizes (which then may lead to support
variable page sizes in the VM).

It would be best to hide these implementation differences in an mmu
abstraction layer. Various architectures could then implement their own
way of representing page table entries. We would provide a legacy 4 layer,
3 layer and 2 layer implementation that would take care of the existing
implementations. These generic implementations can then be taken by an
architecture and emendedto provide the huge page table entries in way
fitting for that architecture. For IA64 and otherplatforms that allow
alternate ways of maintaining translations, we could avoid maintaining a
hierachical table.

There are a couple of additional features for page tables that then could
also be worked into that abstraction layer:

A. Global translation entries.
B. Variable page size.
C. Use a transactional scheme to allow a variety of synchronization
schemes.

Early idea for an mmu abstraction layer API
===========================================

Three new opaque types:

mmu_entry_t
mmu_translation_set_t
mmu_transaction_t

*mmu_entry_t* replaces the existing pte_t and has roughly the same features.
However, mmu_entry_t describes a translation of a logical address to a
physical address in general. This means that the mmu_entry_t must be able
to represent all possible mappings including mappings for huge pages and
pages of various sizes if these features are supported by the method of
handling page tables. If statistics need to be kept about entries then this
entry will also contain a number to indicate what counter to update when
inserting or deleting this type of entry [spare bits may be used for this
purpose]

*mmu_translation_set_t* represents a virtual address space for a process and is essentially
a set of mmu_entry_t's plus additional management information that may be necessary to
manage an address space.

*mmu_transaction_t* allows to perform transactions on translation entries and maintains the
state of a transaction. The state information allows to undo changes or commit them in
a way that must appear to be atomic to any other access in the system.

Operations on mmu_translation_set_t
-----------------------------------

void mmu_new_translation_set(struct mmu_translation_set_t *t);
Generates an empty translation set

void mmu_dup_translation_set(struct mmu_translation_set_t *t, struct mmu_translation_set *t);
Generates a duplicate of a translation set

void mmu_remove_translation_set(struct mmu_translation_set *t);
Removes a translation set

void mmu_clear_range(struct mmu_translation_set_t *t, unsigned long start, unsigned long end);
Wipe out a range of addresses in the translation set

void mmu_copy_range(struct mmu_translation_set *dest, struct
mmu_translation_set_t *src, unsinged long dest_start, unsigned long src_start, unsigned long
length);

These functions are not implemented for the period in which old and new
schemes are coexisting since this would require a major change to mm_struct.

Transactional operations
------------------------

void mmu_transaction(struct mmu_transaction_t *ta, struct mmu_translation_set_t *tr);
Begin a transaction

For the coexistence period this is implemented as

mmu_transaction(struct mmu_transaction_t , struct mm_struct *mm,
struct vm_are_struct *);

void mmu_commit(struct mmu_transaction_t);
Commit changes done

void mmu_forget(struct mmu_transaction_t);
Undo changes undone

struct mmu_entry_t mmu_find(struct mmu_transaction_t *ta, unsigned long address);
Find mmu entry and make this the current entry

void mmu_update(struct mmu_transaction_t *ta, mmu_entry_t entry);
Update the current entry

void mmu_add(struct mmu_transaction_t *ta, mmu_entry_t entry, unsigned long address);
Add a new translation entry

void mmu_remove(struct mmu_transaction_t *ta);
Remove current translation entry

Operations on mmu_entry_t
-------------------------
The same as for pte_t now. Additional

struct mmu_entry mkglobal(struct mmu_entry)
Define an entry to be global (valid for all translation sets)

struct mmu_entry mksize(struct mmu_entry entry, unsigned order)
Set the page size in an entry to order.

struct mmu_entry mkcount(struct mmu_entry entry, unsigned long counter)
Adding and removing this entry must lead to an update of the specified
counter.

Not for coexistence period.

Statistics
----------

void mmu_stats(struct mmu_translation_set, unsigned long *entries,
unsigned long *size_in_pages, unsigned long *counters[]);

Not for coexistence period.

Scanning through mmu entries
----------------------------

void mmu_scan(struct mmu_translation_set_t *t, unsigned long start,
unsigned long end,
mmu_entry_t (*func)(struct mmu_entry_t, void *private),
void *private);