2004-11-19 19:43:53

by Christoph Lameter

[permalink] [raw]
Subject: page fault scalability patch V11 [0/7]: overview

Signed-off-by: Christoph Lameter <[email protected]>

Changes from V10->V11 of this patch:
- cmpxchg_i386: Optimize code generated after feedback from Linus. Various
fixes.
- drop make_rss_atomic in favor of rss_sloppy
- generic: adapt to new changes in Linus tree, some fixes to fallback
functions. Add generic ptep_xchg_flush based on xchg.
- S390: remove use of page_table_lock from ptep_xchg_flush (deadlock)
- x86_64: remove ptep_xchg
- i386: integrated Nick Piggin's changes for PAE mode. Create ptep_xchg_flush and
various fixes.
- ia64: if necessary flush icache before ptep_cmpxchg. Remove ptep_xchg

This is a series of patches that increases the scalability of
the page fault handler for SMP. Here are some performance results
on a machine with 32 processors allocating 32 GB with an increasing
number of cpus.

Without the patches:
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
32 10 1 3.966s 366.840s 370.082s 56556.456 56553.456
32 10 2 3.604s 319.004s 172.058s 65006.086 121511.453
32 10 4 3.705s 341.550s 106.007s 60741.936 197704.486
32 10 8 3.597s 809.711s 119.021s 25785.427 175917.674
32 10 16 5.886s 2238.122s 163.084s 9345.560 127998.973
32 10 32 21.748s 5458.983s 201.062s 3826.409 104011.521

With the patches:
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
32 10 1 3.772s 330.629s 334.042s 62713.587 62708.706
32 10 2 3.767s 352.252s 185.077s 58905.502 112886.222
32 10 4 3.549s 255.683s 77.000s 80898.177 272326.496
32 10 8 3.522s 263.879s 52.030s 78427.083 400965.857
32 10 16 5.193s 384.813s 42.076s 53772.158 490378.852
32 10 32 15.806s 996.890s 54.077s 20708.587 382879.208

With a high number of CPUs the page fault rate improves more than
twofold and may reach 500000 faults/sec betweenr 16-512 cpus. The
fault rate drops if a process is running on all processors as also
here for the 32 cpu case.

Note that the measurements were done on a NUMA system and this
test uses off node memory. Variations may exist due to allocations in
memory areas in diverse distances to the local cpu. The slight drop
for 2 cpus is probably due to that effect.

The performance increase is accomplished by avoiding the use of the
page_table_lock spinlock (but not mm->mmap_sem!) through new atomic
operations on pte's (ptep_xchg, ptep_cmpxchg) and on pmd and pgd's
(pgd_test_and_populate, pmd_test_and_populate).

The page table lock can be avoided in the following situations:

1. An empty pte or pmd entry is populated

This is safe since the swapper may only depopulate them and the
swapper code has been changed to never set a pte to be empty until the
page has been evicted. The population of an empty pte is frequent
if a process touches newly allocated memory.

2. Modifications of flags in a pte entry (write/accessed).

These modifications are done by the CPU or by low level handlers
on various platforms also bypassing the page_table_lock. So this
seems to be safe too.

One essential change in the VM is the use of pte_cmpxchg (or its generic
emulation) on page table entries before doing an update_mmu_change without holding
the page table lock. However, we do similar things now with other atomic pte operations
such as ptep_get_and_clear and ptep_test_and_clear_dirty. These operations clear
a pte *after* doing an operation on it. The ptep_cmpxchg as used in this patch
operates on an *cleared* pte and replaces it with a pte pointing to valid memory.
The effect of this change on various architectures has to be thought through. Local
definitions of ptep_cmpxchg and ptep_xchg may be necessary.

For IA64 an icache coherency issue may arise that potentially requires the
flushing of the icache (as done via update_mmu_cache on IA64) prior
to the use of ptep_cmpxchg. Similar issues may arise on other platforms.

The patch uses sloppy rss handling. mm->rss is incremented without
proper locking because locking would introduce too much overhead. Rss
is not essential for vm operations (3 uses of rss in rmap.c were not necessary and
were removed). The difference in rss values has been found to be less than 1% in
our tests (see also the separate email to linux-mm and linux-ia64 on the subject
of "sloppy rss"). The move away from using atomic operations for rss in earlier versions
of this patch also increases the performance of the page fault handler in the single
thread case over an unpatched kernel.

Note that I have posted two other approaches of dealing with the rss problem:

A. make_rss_atomic. The earlier releases contained that patch but then another
variable (such as anon_rss) was introduced that would have required additional
atomic operations. Atomic rss operations are also causing slowdowns on
machines with a high number of cpus due to memory contention.

B. remove_rss. Replace rss with a periodic scan over the vm to determine
rss and additional numbers. This was also discussed on linux-mm and linux-ia64.
The scans while displaying /proc data were undesirable.

The patchset is composed of 7 patches:

1/7: Sloppy rss

Removes mm->rss usage from mm/rmap.c and insures that negative rss values
are not displayed.

2/7: Avoid page_table_lock in handle_mm_fault

This patch defers the acquisition of the page_table_lock as much as
possible and uses atomic operations for allocating anonymous memory.
These atomic operations are simulated by acquiring the page_table_lock
for very small time frames if an architecture does not define
__HAVE_ARCH_ATOMIC_TABLE_OPS. It also changes the swapper so that a
pte will not be set to empty if a page is in transition to swap.

If only the first two patches are applied then the time that the page_table_lock
is held is simply reduced. The lock may then be acquired multiple
times during a page fault.

The remaining patches introduce the necessary atomic pte operations to avoid
the page_table_lock.

3/7: Atomic pte operations for ia64

4/7: Make cmpxchg generally available on i386

The atomic operations on the page table rely heavily on cmpxchg instructions.
This patch adds emulations for cmpxchg and cmpxchg8b for old 80386 and 80486
cpus. The emulations are only included if a kernel is build for these old
cpus and are skipped for the real cmpxchg instructions if the kernel
that is build for 386 or 486 is then run on a more recent cpu.

This patch may be used independently of the other patches.

5/7: Atomic pte operations for i386

A generally available cmpxchg (last patch) must be available for this patch to
preserve the ability to build kernels for 386 and 486.

6/7: Atomic pte operation for x86_64

7/7: Atomic pte operations for s390


2004-11-19 19:50:06

by Christoph Lameter

[permalink] [raw]
Subject: page fault scalability patch V11 [2/7]: page fault handler optimizations

Changelog
* Increase parallelism in SMP configurations by deferring
the acquisition of page_table_lock in handle_mm_fault
* Anonymous memory page faults bypass the page_table_lock
through the use of atomic page table operations
* Swapper does not set pte to empty in transition to swap
* Simulate atomic page table operations using the
page_table_lock if an arch does not define
__HAVE_ARCH_ATOMIC_TABLE_OPS. This still provides
a performance benefit since the page_table_lock
is held for shorter periods of time.

Signed-off-by: Christoph Lameter <[email protected]

Index: linux-2.6.9/mm/memory.c
===================================================================
--- linux-2.6.9.orig/mm/memory.c 2004-11-18 12:25:49.000000000 -0800
+++ linux-2.6.9/mm/memory.c 2004-11-19 06:38:53.000000000 -0800
@@ -1330,8 +1330,7 @@
}

/*
- * We hold the mm semaphore and the page_table_lock on entry and
- * should release the pagetable lock on exit..
+ * We hold the mm semaphore
*/
static int do_swap_page(struct mm_struct * mm,
struct vm_area_struct * vma, unsigned long address,
@@ -1343,15 +1342,13 @@
int ret = VM_FAULT_MINOR;

pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
page = lookup_swap_cache(entry);
if (!page) {
swapin_readahead(entry, address, vma);
page = read_swap_cache_async(entry, vma, address);
if (!page) {
/*
- * Back out if somebody else faulted in this pte while
- * we released the page table lock.
+ * Back out if somebody else faulted in this pte
*/
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, address);
@@ -1374,8 +1371,7 @@
lock_page(page);

/*
- * Back out if somebody else faulted in this pte while we
- * released the page table lock.
+ * Back out if somebody else faulted in this pte
*/
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, address);
@@ -1422,14 +1418,12 @@
}

/*
- * We are called with the MM semaphore and page_table_lock
- * spinlock held to protect against concurrent faults in
- * multithreaded programs.
+ * We are called with the MM semaphore held.
*/
static int
do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
pte_t *page_table, pmd_t *pmd, int write_access,
- unsigned long addr)
+ unsigned long addr, pte_t orig_entry)
{
pte_t entry;
struct page * page = ZERO_PAGE(addr);
@@ -1441,7 +1435,6 @@
if (write_access) {
/* Allocate our own private page. */
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);

if (unlikely(anon_vma_prepare(vma)))
goto no_mem;
@@ -1450,30 +1443,37 @@
goto no_mem;
clear_user_highpage(page, addr);

- spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, addr);

- if (!pte_none(*page_table)) {
- pte_unmap(page_table);
- page_cache_release(page);
- spin_unlock(&mm->page_table_lock);
- goto out;
- }
- mm->rss++;
entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
vma->vm_page_prot)),
vma);
- lru_cache_add_active(page);
mark_page_accessed(page);
- page_add_anon_rmap(page, vma, addr);
}

- set_pte(page_table, entry);
+ /* update the entry */
+ if (!ptep_cmpxchg(vma, addr, page_table, orig_entry, entry)) {
+ if (write_access) {
+ pte_unmap(page_table);
+ page_cache_release(page);
+ }
+ goto out;
+ }
+ if (write_access) {
+ /*
+ * These two functions must come after the cmpxchg
+ * because if the page is on the LRU then try_to_unmap may come
+ * in and unmap the pte.
+ */
+ lru_cache_add_active(page);
+ page_add_anon_rmap(page, vma, addr);
+ mm->rss++;
+
+ }
pte_unmap(page_table);

/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, addr, entry);
- spin_unlock(&mm->page_table_lock);
out:
return VM_FAULT_MINOR;
no_mem:
@@ -1489,12 +1489,12 @@
* As this is called only for pages that do not currently exist, we
* do not need to flush old virtual caches or the TLB.
*
- * This is called with the MM semaphore held and the page table
- * spinlock held. Exit with the spinlock released.
+ * This is called with the MM semaphore held.
*/
static int
do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
+ unsigned long address, int write_access, pte_t *page_table,
+ pmd_t *pmd, pte_t orig_entry)
{
struct page * new_page;
struct address_space *mapping = NULL;
@@ -1505,9 +1505,8 @@

if (!vma->vm_ops || !vma->vm_ops->nopage)
return do_anonymous_page(mm, vma, page_table,
- pmd, write_access, address);
+ pmd, write_access, address, orig_entry);
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);

if (vma->vm_file) {
mapping = vma->vm_file->f_mapping;
@@ -1605,7 +1604,7 @@
* nonlinear vmas.
*/
static int do_file_page(struct mm_struct * mm, struct vm_area_struct * vma,
- unsigned long address, int write_access, pte_t *pte, pmd_t *pmd)
+ unsigned long address, int write_access, pte_t *pte, pmd_t *pmd, pte_t entry)
{
unsigned long pgoff;
int err;
@@ -1618,13 +1617,12 @@
if (!vma->vm_ops || !vma->vm_ops->populate ||
(write_access && !(vma->vm_flags & VM_SHARED))) {
pte_clear(pte);
- return do_no_page(mm, vma, address, write_access, pte, pmd);
+ return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
}

pgoff = pte_to_pgoff(*pte);

pte_unmap(pte);
- spin_unlock(&mm->page_table_lock);

err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0);
if (err == -ENOMEM)
@@ -1643,49 +1641,40 @@
* with external mmu caches can use to update those (ie the Sparc or
* PowerPC hashed page tables that act as extended TLBs).
*
- * Note the "page_table_lock". It is to protect against kswapd removing
- * pages from under us. Note that kswapd only ever _removes_ pages, never
- * adds them. As such, once we have noticed that the page is not present,
- * we can drop the lock early.
- *
- * The adding of pages is protected by the MM semaphore (which we hold),
- * so we don't need to worry about a page being suddenly been added into
- * our VM.
- *
- * We enter with the pagetable spinlock held, we are supposed to
- * release it when done.
+ * Note that kswapd only ever _removes_ pages, never adds them.
+ * We need to insure to handle that case properly.
*/
static inline int handle_pte_fault(struct mm_struct *mm,
struct vm_area_struct * vma, unsigned long address,
int write_access, pte_t *pte, pmd_t *pmd)
{
pte_t entry;
+ pte_t new_entry;

entry = *pte;
if (!pte_present(entry)) {
- /*
- * If it truly wasn't present, we know that kswapd
- * and the PTE updates will not touch it later. So
- * drop the lock.
- */
if (pte_none(entry))
- return do_no_page(mm, vma, address, write_access, pte, pmd);
+ return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
if (pte_file(entry))
- return do_file_page(mm, vma, address, write_access, pte, pmd);
+ return do_file_page(mm, vma, address, write_access, pte, pmd, entry);
return do_swap_page(mm, vma, address, pte, pmd, entry, write_access);
}

+ /*
+ * This is the case in which we only update some bits in the pte.
+ */
+ new_entry = pte_mkyoung(entry);
if (write_access) {
- if (!pte_write(entry))
+ if (!pte_write(entry)) {
+ /* do_wp_page expects us to hold the page_table_lock */
+ spin_lock(&mm->page_table_lock);
return do_wp_page(mm, vma, address, pte, pmd, entry);
-
- entry = pte_mkdirty(entry);
+ }
+ new_entry = pte_mkdirty(new_entry);
}
- entry = pte_mkyoung(entry);
- ptep_set_access_flags(vma, address, pte, entry, write_access);
- update_mmu_cache(vma, address, entry);
+ if (ptep_cmpxchg(vma, address, pte, entry, new_entry))
+ update_mmu_cache(vma, address, new_entry);
pte_unmap(pte);
- spin_unlock(&mm->page_table_lock);
return VM_FAULT_MINOR;
}

@@ -1703,22 +1692,45 @@

inc_page_state(pgfault);

- if (is_vm_hugetlb_page(vma))
+ if (unlikely(is_vm_hugetlb_page(vma)))
return VM_FAULT_SIGBUS; /* mapping truncation does this. */

/*
- * We need the page table lock to synchronize with kswapd
- * and the SMP-safe atomic PTE updates.
+ * We rely on the mmap_sem and the SMP-safe atomic PTE updates.
+ * to synchronize with kswapd
*/
- spin_lock(&mm->page_table_lock);
- pmd = pmd_alloc(mm, pgd, address);
+ if (unlikely(pgd_none(*pgd))) {
+ pmd_t *new = pmd_alloc_one(mm, address);
+ if (!new)
+ return VM_FAULT_OOM;
+
+ /* Insure that the update is done in an atomic way */
+ if (!pgd_test_and_populate(mm, pgd, new))
+ pmd_free(new);
+ }
+
+ pmd = pmd_offset(pgd, address);
+
+ if (likely(pmd)) {
+ pte_t *pte;
+
+ if (!pmd_present(*pmd)) {
+ struct page *new;

- if (pmd) {
- pte_t * pte = pte_alloc_map(mm, pmd, address);
- if (pte)
+ new = pte_alloc_one(mm, address);
+ if (!new)
+ return VM_FAULT_OOM;
+
+ if (!pmd_test_and_populate(mm, pmd, new))
+ pte_free(new);
+ else
+ inc_page_state(nr_page_table_pages);
+ }
+
+ pte = pte_offset_map(pmd, address);
+ if (likely(pte))
return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
}
- spin_unlock(&mm->page_table_lock);
return VM_FAULT_OOM;
}

Index: linux-2.6.9/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.9.orig/include/asm-generic/pgtable.h 2004-10-18 14:53:46.000000000 -0700
+++ linux-2.6.9/include/asm-generic/pgtable.h 2004-11-19 07:54:05.000000000 -0800
@@ -134,4 +134,60 @@
#define pgd_offset_gate(mm, addr) pgd_offset(mm, addr)
#endif

+#ifndef __HAVE_ARCH_ATOMIC_TABLE_OPS
+/*
+ * If atomic page table operations are not available then use
+ * the page_table_lock to insure some form of locking.
+ * Note thought that low level operations as well as the
+ * page_table_handling of the cpu may bypass all locking.
+ */
+
+#ifndef __HAVE_ARCH_PTEP_CMPXCHG
+#define ptep_cmpxchg(__vma, __addr, __ptep, __oldval, __newval) \
+({ \
+ int __rc; \
+ spin_lock(&__vma->vm_mm->page_table_lock); \
+ __rc = pte_same(*(__ptep), __oldval); \
+ if (__rc) set_pte(__ptep, __newval); \
+ spin_unlock(&__vma->vm_mm->page_table_lock); \
+ __rc; \
+})
+#endif
+
+#ifndef __HAVE_ARCH_PGP_TEST_AND_POPULATE
+#define pgd_test_and_populate(__mm, __pgd, __pmd) \
+({ \
+ int __rc; \
+ spin_lock(&__mm->page_table_lock); \
+ __rc = !pgd_present(*(__pgd)); \
+ if (__rc) pgd_populate(__mm, __pgd, __pmd); \
+ spin_unlock(&__mm->page_table_lock); \
+ __rc; \
+})
+#endif
+
+#ifndef __HAVE_PMD_TEST_AND_POPULATE
+#define pmd_test_and_populate(__mm, __pmd, __page) \
+({ \
+ int __rc; \
+ spin_lock(&__mm->page_table_lock); \
+ __rc = !pmd_present(*(__pmd)); \
+ if (__rc) pmd_populate(__mm, __pmd, __page); \
+ spin_unlock(&__mm->page_table_lock); \
+ __rc; \
+})
+#endif
+
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \
+({ \
+ pte_t __p = __pte(xchg(&pte_val(*(__ptep)), pte_val(__pteval)));\
+ flush_tlb_page(__vma, __address); \
+ __p; \
+})
+
+#endif
+
#endif /* _ASM_GENERIC_PGTABLE_H */
Index: linux-2.6.9/mm/rmap.c
===================================================================
--- linux-2.6.9.orig/mm/rmap.c 2004-11-19 06:38:51.000000000 -0800
+++ linux-2.6.9/mm/rmap.c 2004-11-19 06:38:53.000000000 -0800
@@ -419,7 +419,10 @@
* @vma: the vm area in which the mapping is added
* @address: the user virtual address mapped
*
- * The caller needs to hold the mm->page_table_lock.
+ * The caller needs to hold the mm->page_table_lock if page
+ * is pointing to something that is known by the vm.
+ * The lock does not need to be held if page is pointing
+ * to a newly allocated page.
*/
void page_add_anon_rmap(struct page *page,
struct vm_area_struct *vma, unsigned long address)
@@ -561,11 +564,6 @@

/* Nuke the page table entry. */
flush_cache_page(vma, address);
- pteval = ptep_clear_flush(vma, address, pte);
-
- /* Move the dirty bit to the physical page now the pte is gone. */
- if (pte_dirty(pteval))
- set_page_dirty(page);

if (PageAnon(page)) {
swp_entry_t entry = { .val = page->private };
@@ -580,11 +578,15 @@
list_add(&mm->mmlist, &init_mm.mmlist);
spin_unlock(&mmlist_lock);
}
- set_pte(pte, swp_entry_to_pte(entry));
+ pteval = ptep_xchg_flush(vma, address, pte, swp_entry_to_pte(entry));
BUG_ON(pte_file(*pte));
mm->anon_rss--;
- }
+ } else
+ pteval = ptep_clear_flush(vma, address, pte);

+ /* Move the dirty bit to the physical page now the pte is gone. */
+ if (pte_dirty(pteval))
+ set_page_dirty(page);
mm->rss--;
page_remove_rmap(page);
page_cache_release(page);
@@ -671,15 +673,21 @@
if (ptep_clear_flush_young(vma, address, pte))
continue;

- /* Nuke the page table entry. */
flush_cache_page(vma, address);
- pteval = ptep_clear_flush(vma, address, pte);
+ /*
+ * There would be a race here with handle_mm_fault and do_anonymous_page
+ * which bypasses the page_table_lock if we would zap the pte before
+ * putting something into it. On the other hand we need to
+ * have the dirty flag setting at the time we replaced the value.
+ */

/* If nonlinear, store the file page offset in the pte. */
if (page->index != linear_page_index(vma, address))
- set_pte(pte, pgoff_to_pte(page->index));
+ pteval = ptep_xchg_flush(vma, address, pte, pgoff_to_pte(page->index));
+ else
+ pteval = ptep_get_and_clear(pte);

- /* Move the dirty bit to the physical page now the pte is gone. */
+ /* Move the dirty bit to the physical page now that the pte is gone. */
if (pte_dirty(pteval))
set_page_dirty(page);


2004-11-19 19:50:06

by Christoph Lameter

[permalink] [raw]
Subject: page fault scalability patch V11 [3/7]: ia64 atomic pte operations

Changelog
* Provide atomic pte operations for ia64
* Enhanced parallelism in page fault handler if applied together
with the generic patch

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.9/include/asm-ia64/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-ia64/pgalloc.h 2004-10-18 14:53:06.000000000 -0700
+++ linux-2.6.9/include/asm-ia64/pgalloc.h 2004-11-19 07:54:19.000000000 -0800
@@ -34,6 +34,10 @@
#define pmd_quicklist (local_cpu_data->pmd_quick)
#define pgtable_cache_size (local_cpu_data->pgtable_cache_sz)

+/* Empty entries of PMD and PGD */
+#define PMD_NONE 0
+#define PGD_NONE 0
+
static inline pgd_t*
pgd_alloc_one_fast (struct mm_struct *mm)
{
@@ -78,12 +82,19 @@
preempt_enable();
}

+
static inline void
pgd_populate (struct mm_struct *mm, pgd_t *pgd_entry, pmd_t *pmd)
{
pgd_val(*pgd_entry) = __pa(pmd);
}

+/* Atomic populate */
+static inline int
+pgd_test_and_populate (struct mm_struct *mm, pgd_t *pgd_entry, pmd_t *pmd)
+{
+ return ia64_cmpxchg8_acq(pgd_entry,__pa(pmd), PGD_NONE) == PGD_NONE;
+}

static inline pmd_t*
pmd_alloc_one_fast (struct mm_struct *mm, unsigned long addr)
@@ -132,6 +143,13 @@
pmd_val(*pmd_entry) = page_to_phys(pte);
}

+/* Atomic populate */
+static inline int
+pmd_test_and_populate (struct mm_struct *mm, pmd_t *pmd_entry, struct page *pte)
+{
+ return ia64_cmpxchg8_acq(pmd_entry, page_to_phys(pte), PMD_NONE) == PMD_NONE;
+}
+
static inline void
pmd_populate_kernel (struct mm_struct *mm, pmd_t *pmd_entry, pte_t *pte)
{
Index: linux-2.6.9/include/asm-ia64/pgtable.h
===================================================================
--- linux-2.6.9.orig/include/asm-ia64/pgtable.h 2004-11-15 11:13:38.000000000 -0800
+++ linux-2.6.9/include/asm-ia64/pgtable.h 2004-11-19 07:55:35.000000000 -0800
@@ -414,6 +425,26 @@
#endif
}

+/*
+ * IA-64 doesn't have any external MMU info: the page tables contain all the necessary
+ * information. However, we use this routine to take care of any (delayed) i-cache
+ * flushing that may be necessary.
+ */
+extern void update_mmu_cache (struct vm_area_struct *vma, unsigned long vaddr, pte_t pte);
+
+static inline int
+ptep_cmpxchg (struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, pte_t oldval, pte_t newval)
+{
+ /*
+ * IA64 defers icache flushes. If the new pte is executable we may
+ * have to flush the icache to insure cache coherency immediately
+ * after the cmpxchg.
+ */
+ if (pte_exec(newval))
+ update_mmu_cache(vma, addr, newval);
+ return ia64_cmpxchg8_acq(&ptep->pte, newval.pte, oldval.pte) == oldval.pte;
+}
+
static inline int
pte_same (pte_t a, pte_t b)
{
@@ -476,13 +507,6 @@
struct vm_area_struct * prev, unsigned long start, unsigned long end);
#endif

-/*
- * IA-64 doesn't have any external MMU info: the page tables contain all the necessary
- * information. However, we use this routine to take care of any (delayed) i-cache
- * flushing that may be necessary.
- */
-extern void update_mmu_cache (struct vm_area_struct *vma, unsigned long vaddr, pte_t pte);
-
#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
/*
* Update PTEP with ENTRY, which is guaranteed to be a less
@@ -560,6 +584,8 @@
#define __HAVE_ARCH_PTEP_MKDIRTY
#define __HAVE_ARCH_PTE_SAME
#define __HAVE_ARCH_PGD_OFFSET_GATE
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
+#define __HAVE_ARCH_LOCK_TABLE_OPS
#include <asm-generic/pgtable.h>

#endif /* _ASM_IA64_PGTABLE_H */

2004-11-19 19:56:08

by Christoph Lameter

[permalink] [raw]
Subject: page fault scalability patch V11 [5/7]: i386 atomic pte operations

Changelog
* Atomic pte operations for i386 in regular and PAE modes

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.9/include/asm-i386/pgtable.h
===================================================================
--- linux-2.6.9.orig/include/asm-i386/pgtable.h 2004-11-15 11:13:38.000000000 -0800
+++ linux-2.6.9/include/asm-i386/pgtable.h 2004-11-19 10:05:27.000000000 -0800
@@ -413,6 +413,7 @@
#define __HAVE_ARCH_PTEP_SET_WRPROTECT
#define __HAVE_ARCH_PTEP_MKDIRTY
#define __HAVE_ARCH_PTE_SAME
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
#include <asm-generic/pgtable.h>

#endif /* _I386_PGTABLE_H */
Index: linux-2.6.9/include/asm-i386/pgtable-3level.h
===================================================================
--- linux-2.6.9.orig/include/asm-i386/pgtable-3level.h 2004-10-18 14:54:55.000000000 -0700
+++ linux-2.6.9/include/asm-i386/pgtable-3level.h 2004-11-19 10:10:06.000000000 -0800
@@ -6,7 +6,8 @@
* tables on PPro+ CPUs.
*
* Copyright (C) 1999 Ingo Molnar <[email protected]>
- */
+ * August 26, 2004 added ptep_cmpxchg <[email protected]>
+*/

#define pte_ERROR(e) \
printk("%s:%d: bad pte %p(%08lx%08lx).\n", __FILE__, __LINE__, &(e), (e).pte_high, (e).pte_low)
@@ -42,26 +43,15 @@
return pte_x(pte);
}

-/* Rules for using set_pte: the pte being assigned *must* be
- * either not present or in a state where the hardware will
- * not attempt to update the pte. In places where this is
- * not possible, use pte_get_and_clear to obtain the old pte
- * value and then use set_pte to update it. -ben
- */
-static inline void set_pte(pte_t *ptep, pte_t pte)
-{
- ptep->pte_high = pte.pte_high;
- smp_wmb();
- ptep->pte_low = pte.pte_low;
-}
-#define __HAVE_ARCH_SET_PTE_ATOMIC
-#define set_pte_atomic(pteptr,pteval) \
+#define set_pte(pteptr,pteval) \
set_64bit((unsigned long long *)(pteptr),pte_val(pteval))
#define set_pmd(pmdptr,pmdval) \
set_64bit((unsigned long long *)(pmdptr),pmd_val(pmdval))
#define set_pgd(pgdptr,pgdval) \
set_64bit((unsigned long long *)(pgdptr),pgd_val(pgdval))

+#define set_pte_atomic set_pte
+
/*
* Pentium-II erratum A13: in PAE mode we explicitly have to flush
* the TLB via cr3 if the top-level pgd is changed...
@@ -142,4 +132,23 @@
#define __pte_to_swp_entry(pte) ((swp_entry_t){ (pte).pte_high })
#define __swp_entry_to_pte(x) ((pte_t){ 0, (x).val })

+/* Atomic PTE operations */
+#define ptep_xchg_flush(__vma, __addr, __ptep, __newval) \
+({ pte_t __r; \
+ /* xchg acts as a barrier before the setting of the high bits. */\
+ __r.pte_low = xchg(&(__ptep)->pte_low, (__newval).pte_low); \
+ __r.pte_high = (__ptep)->pte_high; \
+ (__ptep)->pte_high = (__newval).pte_high; \
+ flush_tlb_page(__vma, __addr); \
+ (__r); \
+})
+
+#define __HAVE_ARCH_PTEP_XCHG_FLUSH
+
+static inline int ptep_cmpxchg(struct vm_area_struct *vma, unsigned long address, pte_t *ptep, pte_t oldval, pte_t newval)
+{
+ return cmpxchg((unsigned int *)ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval);
+}
+
+
#endif /* _I386_PGTABLE_3LEVEL_H */
Index: linux-2.6.9/include/asm-i386/pgtable-2level.h
===================================================================
--- linux-2.6.9.orig/include/asm-i386/pgtable-2level.h 2004-10-18 14:54:31.000000000 -0700
+++ linux-2.6.9/include/asm-i386/pgtable-2level.h 2004-11-19 10:05:27.000000000 -0800
@@ -82,4 +82,7 @@
#define __pte_to_swp_entry(pte) ((swp_entry_t) { (pte).pte_low })
#define __swp_entry_to_pte(x) ((pte_t) { (x).val })

+/* Atomic PTE operations */
+#define ptep_cmpxchg(__vma,__a,__xp,__oldpte,__newpte) (cmpxchg(&(__xp)->pte_low, (__oldpte).pte_low, (__newpte).pte_low)==(__oldpte).pte_low)
+
#endif /* _I386_PGTABLE_2LEVEL_H */
Index: linux-2.6.9/include/asm-i386/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-i386/pgalloc.h 2004-10-18 14:53:10.000000000 -0700
+++ linux-2.6.9/include/asm-i386/pgalloc.h 2004-11-19 10:10:40.000000000 -0800
@@ -4,9 +4,12 @@
#include <linux/config.h>
#include <asm/processor.h>
#include <asm/fixmap.h>
+#include <asm/system.h>
#include <linux/threads.h>
#include <linux/mm.h> /* for struct page */

+#define PMD_NONE 0L
+
#define pmd_populate_kernel(mm, pmd, pte) \
set_pmd(pmd, __pmd(_PAGE_TABLE + __pa(pte)))

@@ -16,6 +19,19 @@
((unsigned long long)page_to_pfn(pte) <<
(unsigned long long) PAGE_SHIFT)));
}
+
+/* Atomic version */
+static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
+{
+#ifdef CONFIG_X86_PAE
+ return cmpxchg8b( ((unsigned long long *)pmd), PMD_NONE, _PAGE_TABLE +
+ ((unsigned long long)page_to_pfn(pte) <<
+ (unsigned long long) PAGE_SHIFT) ) == PMD_NONE;
+#else
+ return cmpxchg( (unsigned long *)pmd, PMD_NONE, _PAGE_TABLE + (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE;
+#endif
+}
+
/*
* Allocate and free page tables.
*/
@@ -49,6 +65,7 @@
#define pmd_free(x) do { } while (0)
#define __pmd_free_tlb(tlb,x) do { } while (0)
#define pgd_populate(mm, pmd, pte) BUG()
+#define pgd_test_and_populate(mm, pmd, pte) ({ BUG(); 1; })

#define check_pgt_cache() do { } while (0)


2004-11-19 20:02:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview



On Fri, 19 Nov 2004, Christoph Lameter wrote:
>
> Note that I have posted two other approaches of dealing with the rss problem:

You could also make "rss" be a _signed_ integer per-thread.

When unmapping a page, you decrement one of the threads that shares the mm
(doesn't matter which - which is why the per-thread rss may go negative),
and when mapping a page you increment it.

Then, anybody who actually wants a global rss can just iterate over
threads and add it all up. If you do it under the mmap_sem, it's stable,
and if you do it outside the mmap_sem it's imprecise but stable in the
long term (ie errors never _accumulate_, like the non-atomic case will
do).

Does anybody care enough? Maybe, maybe not. It certainly sounds a hell of
a lot better than the periodic scan.

Linus

2004-11-19 19:56:08

by Christoph Lameter

[permalink] [raw]
Subject: page fault scalability patch V11 [7/7]: s390 atomic pte operations

Changelog
* Provide atomic pte operations for s390

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.9/include/asm-s390/pgtable.h
===================================================================
--- linux-2.6.9.orig/include/asm-s390/pgtable.h 2004-10-18 14:54:55.000000000 -0700
+++ linux-2.6.9/include/asm-s390/pgtable.h 2004-11-19 11:35:08.000000000 -0800
@@ -567,6 +567,15 @@
return pte;
}

+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \
+({ \
+ struct mm_struct *__mm = __vma->vm_mm; \
+ pte_t __pte; \
+ __pte = ptep_clear_flush(__vma, __address, __ptep); \
+ set_pte(__ptep, __pteval); \
+ __pte; \
+})
+
static inline void ptep_set_wrprotect(pte_t *ptep)
{
pte_t old_pte = *ptep;
@@ -778,6 +787,14 @@

#define kern_addr_valid(addr) (1)

+/* Atomic PTE operations */
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
+
+static inline int ptep_cmpxchg (struct vm_area_struct *vma, unsigned long address, pte_t *ptep, pte_t oldval, pte_t newval)
+{
+ return cmpxchg(ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval);
+}
+
/*
* No page table caches to initialise
*/
@@ -791,6 +808,7 @@
#define __HAVE_ARCH_PTEP_CLEAR_DIRTY_FLUSH
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
#define __HAVE_ARCH_PTEP_CLEAR_FLUSH
+#define __HAVE_ARCH_PTEP_XCHG_FLUSH
#define __HAVE_ARCH_PTEP_SET_WRPROTECT
#define __HAVE_ARCH_PTEP_MKDIRTY
#define __HAVE_ARCH_PTE_SAME
Index: linux-2.6.9/include/asm-s390/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-s390/pgalloc.h 2004-10-18 14:54:37.000000000 -0700
+++ linux-2.6.9/include/asm-s390/pgalloc.h 2004-11-19 11:33:25.000000000 -0800
@@ -97,6 +97,10 @@
pgd_val(*pgd) = _PGD_ENTRY | __pa(pmd);
}

+static inline int pgd_test_and_populate(struct mm_struct *mm, pdg_t *pgd, pmd_t *pmd)
+{
+ return cmpxchg(pgd, _PAGE_TABLE_INV, _PGD_ENTRY | __pa(pmd)) == _PAGE_TABLE_INV;
+}
#endif /* __s390x__ */

static inline void
@@ -119,6 +123,18 @@
pmd_populate_kernel(mm, pmd, (pte_t *)((page-mem_map) << PAGE_SHIFT));
}

+static inline int
+pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *page)
+{
+ int rc;
+ spin_lock(&mm->page_table_lock);
+
+ rc=pte_same(*pmd, _PAGE_INVALID_EMPTY);
+ if (rc) pmd_populate(mm, pmd, page);
+ spin_unlock(&mm->page_table_lock);
+ return rc;
+}
+
/*
* page table entry allocation/free routines.
*/

2004-11-19 19:56:07

by Christoph Lameter

[permalink] [raw]
Subject: page fault scalability patch V11 [6/7]: x86_64 atomic pte operations

Changelog
* Provide atomic pte operations for x86_64

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.9/include/asm-x86_64/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-x86_64/pgalloc.h 2004-10-18 14:54:30.000000000 -0700
+++ linux-2.6.9/include/asm-x86_64/pgalloc.h 2004-11-19 08:17:55.000000000 -0800
@@ -7,16 +7,26 @@
#include <linux/threads.h>
#include <linux/mm.h>

+#define PMD_NONE 0
+#define PGD_NONE 0
+
#define pmd_populate_kernel(mm, pmd, pte) \
set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte)))
#define pgd_populate(mm, pgd, pmd) \
set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(pmd)))
+#define pgd_test_and_populate(mm, pgd, pmd) \
+ (cmpxchg((int *)pgd, PGD_NONE, _PAGE_TABLE | __pa(pmd)) == PGD_NONE)

static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
{
set_pmd(pmd, __pmd(_PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)));
}

+static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
+{
+ return cmpxchg((int *)pmd, PMD_NONE, _PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE;
+}
+
extern __inline__ pmd_t *get_pmd(void)
{
return (pmd_t *)get_zeroed_page(GFP_KERNEL);
Index: linux-2.6.9/include/asm-x86_64/pgtable.h
===================================================================
--- linux-2.6.9.orig/include/asm-x86_64/pgtable.h 2004-11-15 11:13:39.000000000 -0800
+++ linux-2.6.9/include/asm-x86_64/pgtable.h 2004-11-19 08:18:52.000000000 -0800
@@ -437,6 +437,10 @@
#define kc_offset_to_vaddr(o) \
(((o) & (1UL << (__VIRTUAL_MASK_SHIFT-1))) ? ((o) | (~__VIRTUAL_MASK)) : (o))

+
+#define ptep_cmpxchg(__vma,__addr,__xp,__oldval,__newval) (cmpxchg(&(__xp)->pte, pte_val(__oldval), pte_val(__newval)) == pte_val(__oldval))
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
+
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR

2004-11-19 19:50:05

by Christoph Lameter

[permalink] [raw]
Subject: page fault scalability patch V11 [4/7]: universal cmpxchg for i386

Changelog
* Make cmpxchg and cmpxchg8b generally available on the i386
platform.
* Provide emulation of cmpxchg suitable for uniprocessor if
build and run on 386.
* Provide emulation of cmpxchg8b suitable for uniprocessor systems
if build and run on 386 or 486.
* Provide an inline function to atomically get a 64 bit value via
cmpxchg8b in an SMP system (courtesy of Nick Piggin)
(important for i386 PAE mode and other places where atomic 64 bit
operations are useful)

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.9/arch/i386/Kconfig
===================================================================
--- linux-2.6.9.orig/arch/i386/Kconfig 2004-11-15 11:13:34.000000000 -0800
+++ linux-2.6.9/arch/i386/Kconfig 2004-11-19 10:02:54.000000000 -0800
@@ -351,6 +351,11 @@
depends on !M386
default y

+config X86_CMPXCHG8B
+ bool
+ depends on !M386 && !M486
+ default y
+
config X86_XADD
bool
depends on !M386
Index: linux-2.6.9/arch/i386/kernel/cpu/intel.c
===================================================================
--- linux-2.6.9.orig/arch/i386/kernel/cpu/intel.c 2004-11-15 11:13:34.000000000 -0800
+++ linux-2.6.9/arch/i386/kernel/cpu/intel.c 2004-11-19 10:38:26.000000000 -0800
@@ -6,6 +6,7 @@
#include <linux/bitops.h>
#include <linux/smp.h>
#include <linux/thread_info.h>
+#include <linux/module.h>

#include <asm/processor.h>
#include <asm/msr.h>
@@ -287,5 +288,103 @@
return 0;
}

+#ifndef CONFIG_X86_CMPXCHG
+unsigned long cmpxchg_386_u8(volatile void *ptr, u8 old, u8 new)
+{
+ u8 prev;
+ unsigned long flags;
+ /*
+ * Check if the kernel was compiled for an old cpu but the
+ * currently running cpu can do cmpxchg after all
+ * All CPUs except 386 support CMPXCHG
+ */
+ if (cpu_data->x86 > 3)
+ return __cmpxchg(ptr, old, new, sizeof(u8));
+
+ /* Poor man's cmpxchg for 386. Unsuitable for SMP */
+ local_irq_save(flags);
+ prev = *(u8 *)ptr;
+ if (prev == old)
+ *(u8 *)ptr = new;
+ local_irq_restore(flags);
+ return prev;
+}
+
+EXPORT_SYMBOL(cmpxchg_386_u8);
+
+unsigned long cmpxchg_386_u16(volatile void *ptr, u16 old, u16 new)
+{
+ u16 prev;
+ unsigned long flags;
+ /*
+ * Check if the kernel was compiled for an old cpu but the
+ * currently running cpu can do cmpxchg after all
+ * All CPUs except 386 support CMPXCHG
+ */
+ if (cpu_data->x86 > 3)
+ return __cmpxchg(ptr, old, new, sizeof(u16));
+
+ /* Poor man's cmpxchg for 386. Unsuitable for SMP */
+ local_irq_save(flags);
+ prev = *(u16 *)ptr;
+ if (prev == old)
+ *(u16 *)ptr = new;
+ local_irq_restore(flags);
+ return prev;
+}
+
+EXPORT_SYMBOL(cmpxchg_386_u16);
+
+unsigned long cmpxchg_386_u32(volatile void *ptr, u32 old, u32 new)
+{
+ u32 prev;
+ unsigned long flags;
+ /*
+ * Check if the kernel was compiled for an old cpu but the
+ * currently running cpu can do cmpxchg after all
+ * All CPUs except 386 support CMPXCHG
+ */
+ if (cpu_data->x86 > 3)
+ return __cmpxchg(ptr, old, new, sizeof(u32));
+
+ /* Poor man's cmpxchg for 386. Unsuitable for SMP */
+ local_irq_save(flags);
+ prev = *(u32 *)ptr;
+ if (prev == old)
+ *(u32 *)ptr = new;
+ local_irq_restore(flags);
+ return prev;
+}
+
+EXPORT_SYMBOL(cmpxchg_386_u32);
+#endif
+
+#ifndef CONFIG_X86_CMPXCHG8B
+unsigned long long cmpxchg8b_486(volatile unsigned long long *ptr,
+ unsigned long long old, unsigned long long newv)
+{
+ unsigned long long prev;
+ unsigned long flags;
+
+ /*
+ * Check if the kernel was compiled for an old cpu but
+ * we are running really on a cpu capable of cmpxchg8b
+ */
+
+ if (cpu_has(cpu_data, X86_FEATURE_CX8))
+ return __cmpxchg8b(ptr, old, newv);
+
+ /* Poor mans cmpxchg8b for 386 and 486. Not suitable for SMP */
+ local_irq_save(flags);
+ prev = *ptr;
+ if (prev == old)
+ *ptr = newv;
+ local_irq_restore(flags);
+ return prev;
+}
+
+EXPORT_SYMBOL(cmpxchg8b_486);
+#endif
+
// arch_initcall(intel_cpu_init);

Index: linux-2.6.9/include/asm-i386/system.h
===================================================================
--- linux-2.6.9.orig/include/asm-i386/system.h 2004-11-15 11:13:38.000000000 -0800
+++ linux-2.6.9/include/asm-i386/system.h 2004-11-19 10:49:46.000000000 -0800
@@ -149,6 +149,9 @@
#define __xg(x) ((struct __xchg_dummy *)(x))


+#define ll_low(x) *(((unsigned int*)&(x))+0)
+#define ll_high(x) *(((unsigned int*)&(x))+1)
+
/*
* The semantics of XCHGCMP8B are a bit strange, this is why
* there is a loop and the loading of %%eax and %%edx has to
@@ -184,8 +187,6 @@
{
__set_64bit(ptr,(unsigned int)(value), (unsigned int)((value)>>32ULL));
}
-#define ll_low(x) *(((unsigned int*)&(x))+0)
-#define ll_high(x) *(((unsigned int*)&(x))+1)

static inline void __set_64bit_var (unsigned long long *ptr,
unsigned long long value)
@@ -203,6 +204,26 @@
__set_64bit(ptr, (unsigned int)(value), (unsigned int)((value)>>32ULL) ) : \
__set_64bit(ptr, ll_low(value), ll_high(value)) )

+static inline unsigned long long __get_64bit(unsigned long long * ptr)
+{
+ unsigned long long ret;
+ __asm__ __volatile__ (
+ "\n1:\t"
+ "movl (%1), %%eax\n\t"
+ "movl 4(%1), %%edx\n\t"
+ "movl %%eax, %%ebx\n\t"
+ "movl %%edx, %%ecx\n\t"
+ LOCK_PREFIX "cmpxchg8b (%1)\n\t"
+ "jnz 1b"
+ : "=A"(ret)
+ : "D"(ptr)
+ : "ebx", "ecx", "memory");
+ return ret;
+}
+
+#define get_64bit(ptr) __get_64bit(ptr)
+
+
/*
* Note: no "lock" prefix even on SMP: xchg always implies lock anyway
* Note 2: xchg has side effect, so that attribute volatile is necessary,
@@ -240,7 +261,41 @@
*/

#ifdef CONFIG_X86_CMPXCHG
+
#define __HAVE_ARCH_CMPXCHG 1
+#define cmpxchg(ptr,o,n)\
+ ((__typeof__(*(ptr)))__cmpxchg((ptr), (unsigned long)(o), \
+ (unsigned long)(n), sizeof(*(ptr))))
+
+#else
+
+/*
+ * Building a kernel capable running on 80386. It may be necessary to
+ * simulate the cmpxchg on the 80386 CPU. For that purpose we define
+ * a function for each of the sizes we support.
+ */
+
+extern unsigned long cmpxchg_386_u8(volatile void *, u8, u8);
+extern unsigned long cmpxchg_386_u16(volatile void *, u16, u16);
+extern unsigned long cmpxchg_386_u32(volatile void *, u32, u32);
+
+static inline unsigned long cmpxchg_386(volatile void *ptr, unsigned long old,
+ unsigned long new, int size)
+{
+ switch (size) {
+ case 1:
+ return cmpxchg_386_u8(ptr, old, new);
+ case 2:
+ return cmpxchg_386_u16(ptr, old, new);
+ case 4:
+ return cmpxchg_386_u32(ptr, old, new);
+ }
+ return old;
+}
+
+#define cmpxchg(ptr,o,n)\
+ ((__typeof__(*(ptr)))cmpxchg_386((ptr), (unsigned long)(o), \
+ (unsigned long)(n), sizeof(*(ptr))))
#endif

static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old,
@@ -270,10 +325,32 @@
return old;
}

-#define cmpxchg(ptr,o,n)\
- ((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\
- (unsigned long)(n),sizeof(*(ptr))))
-
+static inline unsigned long long __cmpxchg8b(volatile unsigned long long *ptr,
+ unsigned long long old, unsigned long long newv)
+{
+ unsigned long long prev;
+ __asm__ __volatile__(
+ LOCK_PREFIX "cmpxchg8b (%4)"
+ : "=A" (prev)
+ : "0" (old), "c" ((unsigned long)(newv >> 32)),
+ "b" ((unsigned long)(newv & 0xffffffffULL)), "D" (ptr)
+ : "memory");
+ return prev;
+}
+
+#ifdef CONFIG_X86_CMPXCHG8B
+#define cmpxchg8b __cmpxchg8b
+#else
+/*
+ * Building a kernel capable of running on 80486 and 80386. Both
+ * do not support cmpxchg8b. Call a function that emulates the
+ * instruction if necessary.
+ */
+extern unsigned long long cmpxchg8b_486(volatile unsigned long long *,
+ unsigned long long, unsigned long long);
+#define cmpxchg8b cmpxchg8b_486
+#endif
+
#ifdef __KERNEL__
struct alt_instr {
__u8 *instr; /* original instruction */

2004-11-19 19:50:02

by Christoph Lameter

[permalink] [raw]
Subject: page fault scalability patch V11 [1/7]: sloppy rss

Changelog
* Enable the sloppy use of mm->rss and mm->anon_rss atomic without locking
* Insure that negative rss values are not given out by the /proc filesystem
* remove 3 checks of rss in mm/rmap.c
* Prerequisite for page table scalability patch

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.9/include/linux/sched.h
===================================================================
--- linux-2.6.9.orig/include/linux/sched.h 2004-11-15 11:13:39.000000000 -0800
+++ linux-2.6.9/include/linux/sched.h 2004-11-18 13:04:30.000000000 -0800
@@ -216,7 +216,7 @@
atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */
int map_count; /* number of VMAs */
struct rw_semaphore mmap_sem;
- spinlock_t page_table_lock; /* Protects page tables, mm->rss, mm->anon_rss */
+ spinlock_t page_table_lock; /* Protects page tables */

struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung
* together off init_mm.mmlist, and are protected
@@ -252,6 +252,21 @@
struct kioctx default_kioctx;
};

+/*
+ * rss and anon_rss are incremented and decremented in some locations without
+ * proper locking. This function insures that these values do not become negative.
+ */
+static long inline get_rss(struct mm_struct *mm)
+{
+ long rss = mm->rss;
+
+ if (rss < 0)
+ mm->rss = rss = 0;
+ if ((long)mm->anon_rss < 0)
+ mm->anon_rss = 0;
+ return rss;
+}
+
struct sighand_struct {
atomic_t count;
struct k_sigaction action[_NSIG];
Index: linux-2.6.9/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.9.orig/fs/proc/task_mmu.c 2004-11-15 11:13:38.000000000 -0800
+++ linux-2.6.9/fs/proc/task_mmu.c 2004-11-18 12:56:26.000000000 -0800
@@ -22,7 +22,7 @@
"VmPTE:\t%8lu kB\n",
(mm->total_vm - mm->reserved_vm) << (PAGE_SHIFT-10),
mm->locked_vm << (PAGE_SHIFT-10),
- mm->rss << (PAGE_SHIFT-10),
+ get_rss(mm) << (PAGE_SHIFT-10),
data << (PAGE_SHIFT-10),
mm->stack_vm << (PAGE_SHIFT-10), text, lib,
(PTRS_PER_PTE*sizeof(pte_t)*mm->nr_ptes) >> 10);
@@ -37,7 +37,9 @@
int task_statm(struct mm_struct *mm, int *shared, int *text,
int *data, int *resident)
{
- *shared = mm->rss - mm->anon_rss;
+ *shared = get_rss(mm) - mm->anon_rss;
+ if (*shared <0)
+ *shared = 0;
*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
>> PAGE_SHIFT;
*data = mm->total_vm - mm->shared_vm;
Index: linux-2.6.9/fs/proc/array.c
===================================================================
--- linux-2.6.9.orig/fs/proc/array.c 2004-11-15 11:13:38.000000000 -0800
+++ linux-2.6.9/fs/proc/array.c 2004-11-18 12:53:16.000000000 -0800
@@ -420,7 +420,7 @@
jiffies_to_clock_t(task->it_real_value),
start_time,
vsize,
- mm ? mm->rss : 0, /* you might want to shift this left 3 */
+ mm ? get_rss(mm) : 0, /* you might want to shift this left 3 */
rsslim,
mm ? mm->start_code : 0,
mm ? mm->end_code : 0,
Index: linux-2.6.9/mm/rmap.c
===================================================================
--- linux-2.6.9.orig/mm/rmap.c 2004-11-15 11:13:40.000000000 -0800
+++ linux-2.6.9/mm/rmap.c 2004-11-18 12:26:45.000000000 -0800
@@ -263,8 +263,6 @@
pte_t *pte;
int referenced = 0;

- if (!mm->rss)
- goto out;
address = vma_address(page, vma);
if (address == -EFAULT)
goto out;
@@ -504,8 +502,6 @@
pte_t pteval;
int ret = SWAP_AGAIN;

- if (!mm->rss)
- goto out;
address = vma_address(page, vma);
if (address == -EFAULT)
goto out;
@@ -788,8 +784,7 @@
if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
continue;
cursor = (unsigned long) vma->vm_private_data;
- while (vma->vm_mm->rss &&
- cursor < max_nl_cursor &&
+ while (cursor < max_nl_cursor &&
cursor < vma->vm_end - vma->vm_start) {
try_to_unmap_cluster(cursor, &mapcount, vma);
cursor += CLUSTER_SIZE;

2004-11-19 20:51:51

by Hugh Dickins

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [1/7]: sloppy rss

Sorry, against what tree do these patches apply?
Apparently not linux-2.6.9, nor latest -bk, nor -mm?

Hugh

2004-11-20 01:15:52

by Nick Piggin

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

Linus Torvalds wrote:
>
> On Fri, 19 Nov 2004, Christoph Lameter wrote:
>
>>Note that I have posted two other approaches of dealing with the rss problem:
>
>
> You could also make "rss" be a _signed_ integer per-thread.
>
> When unmapping a page, you decrement one of the threads that shares the mm
> (doesn't matter which - which is why the per-thread rss may go negative),
> and when mapping a page you increment it.
>
> Then, anybody who actually wants a global rss can just iterate over
> threads and add it all up. If you do it under the mmap_sem, it's stable,
> and if you do it outside the mmap_sem it's imprecise but stable in the
> long term (ie errors never _accumulate_, like the non-atomic case will
> do).
>
> Does anybody care enough? Maybe, maybe not. It certainly sounds a hell of
> a lot better than the periodic scan.
>

I think this sounds like it might be a good idea. I prefer it to having
the unbounded error of sloppy rss (as improbable as it may be in practice).

The per thread rss may wrap (maybe not 64-bit counters), but even so,
the summation over all threads should still end up being correct I
think.

2004-11-20 01:32:58

by Christoph Lameter

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [1/7]: sloppy rss

2.6.10-rc2-bk3

On Fri, 19 Nov 2004, Hugh Dickins wrote:

> Sorry, against what tree do these patches apply?
> Apparently not linux-2.6.9, nor latest -bk, nor -mm?

2004-11-20 01:34:57

by Christoph Lameter

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

On Sat, 20 Nov 2004, Nick Piggin wrote:

> I think this sounds like it might be a good idea. I prefer it to having
> the unbounded error of sloppy rss (as improbable as it may be in practice).

It may also be faster since the processors can have exclusive cache lines.

This means we need to move rss into the task struct. But how does one get
from mm struct to task struct? current is likely available most of
the time. Is that always the case?

> The per thread rss may wrap (maybe not 64-bit counters), but even so,
> the summation over all threads should still end up being correct I
> think.

Note though that the mmap_sem is no protection. It is a read lock and may
be held by multiple processes while incrementing and decrementing rss.
This is likely reducing the number of collisions significantly but it wont
be a guarantee like locking or atomic ops.

2004-11-20 01:53:33

by Nick Piggin

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

Christoph Lameter wrote:
> On Sat, 20 Nov 2004, Nick Piggin wrote:
>
>
>>I think this sounds like it might be a good idea. I prefer it to having
>>the unbounded error of sloppy rss (as improbable as it may be in practice).
>
>
> It may also be faster since the processors can have exclusive cache lines.
>

Yep.

> This means we need to move rss into the task struct. But how does one get
> from mm struct to task struct? current is likely available most of
> the time. Is that always the case?
>

It is available everywhere that mm_struct is, I guess. So yes, I
think `current` should be OK.

>
>>The per thread rss may wrap (maybe not 64-bit counters), but even so,
>>the summation over all threads should still end up being correct I
>>think.
>
>
> Note though that the mmap_sem is no protection. It is a read lock and may
> be held by multiple processes while incrementing and decrementing rss.
> This is likely reducing the number of collisions significantly but it wont
> be a guarantee like locking or atomic ops.
>

Yeah the read lock won't do anything to serialise it. I think what Linus
is saying is that we _don't care_ most of the time (because the error will
be bounded). But if it happened that we really do care anywhere, then the
write lock should be sufficient.

2004-11-20 01:58:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview



On Sat, 20 Nov 2004, Nick Piggin wrote:
>
> The per thread rss may wrap (maybe not 64-bit counters), but even so,
> the summation over all threads should still end up being correct I
> think.

Yes. As long as the total rss fits in an int, it doesn't matter if any of
them wrap. Addition is still associative in twos-complement arithmetic
even in the presense of overflows.

If you actually want to make it proper standard C, I guess you'd have to
make the thing unsigned, which gives you the mod-2**n guarantees even if
somebody were to ever make a non-twos-complement machine.

Linus

2004-11-20 01:59:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview



On Fri, 19 Nov 2004, Christoph Lameter wrote:
>
> Note though that the mmap_sem is no protection. It is a read lock and may
> be held by multiple processes while incrementing and decrementing rss.
> This is likely reducing the number of collisions significantly but it wont
> be a guarantee like locking or atomic ops.

It is, though, if you hold it for a write.

The point being that you _can_ get an exact rss value if you want to.

Not that I really see any overwhelming evidence of anybody ever really
caring, but it's nice to know that you have the option.

Linus

2004-11-20 02:14:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview



On Fri, 19 Nov 2004, Linus Torvalds wrote:
>
> Not that I really see any overwhelming evidence of anybody ever really
> caring, but it's nice to know that you have the option.

Btw, if you are going to look at doing this rss thing, you need to make
sure that thread exit ends up adding its rss to _some_ remaining sibling.

I guess that was obvious, but it's worth pointing out. That may actually
be the only case where we do _not_ have a nice SMP-safe access: we do have
a stable sibling (tsk->thread_leader), but we don't have any good
serialization _except_ for taking mmap_sem for writing. Which we currently
don't do: we take it for reading (and then we possibly upgrade it to a
write lock if we notice that there is a core-dump starting).

We can avoid this too by having a per-mm atomic rss "spill" counter. So
exit_mm() would basically do:

...
tsk->mm = NULL;
atomic_add(tsk->rss, &mm->rss_spill);
...

and then the algorithm for getting rss would be:

rss = atomic_read(mm->rss_spill);
for_each_thread(..)
rss += tsk->rss;

Or does anybody see any better approaches?

Linus

2004-11-20 02:17:24

by Robin Holt

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

On Fri, Nov 19, 2004 at 11:42:39AM -0800, Christoph Lameter wrote:
> Note that I have posted two other approaches of dealing with the rss problem:
>
> A. make_rss_atomic. The earlier releases contained that patch but then another
> variable (such as anon_rss) was introduced that would have required additional
> atomic operations. Atomic rss operations are also causing slowdowns on
> machines with a high number of cpus due to memory contention.
>
> B. remove_rss. Replace rss with a periodic scan over the vm to determine
> rss and additional numbers. This was also discussed on linux-mm and linux-ia64.
> The scans while displaying /proc data were undesirable.

Can you run a comparison benchmark between atomic rss and anon_rss and
the sloppy rss with the rss and anon_rss in seperate cachelines. I am not
sure that it is important to seperate the two into seperate lines, just
rss and anon_rss from the lock and sema.

If I have the time over the weekend, I might try this myself. If not, can
you give it a try.

Thanks,
Robin

2004-11-20 02:24:38

by Nick Piggin

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

William Lee Irwin III wrote:
> On Fri, Nov 19, 2004 at 11:42:39AM -0800, Christoph Lameter wrote:
>
>>A. make_rss_atomic. The earlier releases contained that patch but
>>then another variable (such as anon_rss) was introduced that would
>> have required additional atomic operations. Atomic rss operations
>> are also causing slowdowns on machines with a high number of cpus
>> due to memory contention.
>>B. remove_rss. Replace rss with a periodic scan over the vm to
>> determine rss and additional numbers. This was also discussed on
>> linux-mm and linux-ia64. The scans while displaying /proc data
>> were undesirable.
>
>
> Split counters easily resolve the issues with both these approaches
> (and apparently your co-workers are suggesting it too, and have
> performance results backing it).
>

Split counters still require atomic operations though. This is what
Christoph's latest effort is directed at removing. And they'll still
bounce cachelines around. (I assume we've reached the conclusion
that per-cpu split counters per-mm won't fly?).

2004-11-20 02:28:44

by Nick Piggin

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

William Lee Irwin III wrote:
> On Fri, Nov 19, 2004 at 11:59:03AM -0800, Linus Torvalds wrote:
>
>>You could also make "rss" be a _signed_ integer per-thread.
>>When unmapping a page, you decrement one of the threads that shares the mm
>>(doesn't matter which - which is why the per-thread rss may go negative),
>>and when mapping a page you increment it.
>>Then, anybody who actually wants a global rss can just iterate over
>>threads and add it all up. If you do it under the mmap_sem, it's stable,
>>and if you do it outside the mmap_sem it's imprecise but stable in the
>>long term (ie errors never _accumulate_, like the non-atomic case will
>>do).
>>Does anybody care enough? Maybe, maybe not. It certainly sounds a hell of
>>a lot better than the periodic scan.
>
>
> Unprivileged triggers for full-tasklist scans are NMI oops material.
>

What about pushing the per-thread rss delta back into the global atomic
rss counter in each schedule()?

Pros:
This would take the task exiting problem into its stride as a matter of
course.

Single atomic read to get rss.

Cons:
would just be moving the atomic op somewhere else if we don't get
many page faults per schedule.

Not really nice dependancies.

Assumes schedule (not context switch) must occur somewhat regularly.
At present this is not true for SCHED_FIFO tasks.


Too nasty?

2004-11-20 02:11:17

by William Lee Irwin III

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

On Fri, Nov 19, 2004 at 11:42:39AM -0800, Christoph Lameter wrote:
> A. make_rss_atomic. The earlier releases contained that patch but
> then another variable (such as anon_rss) was introduced that would
> have required additional atomic operations. Atomic rss operations
> are also causing slowdowns on machines with a high number of cpus
> due to memory contention.
> B. remove_rss. Replace rss with a periodic scan over the vm to
> determine rss and additional numbers. This was also discussed on
> linux-mm and linux-ia64. The scans while displaying /proc data
> were undesirable.

Split counters easily resolve the issues with both these approaches
(and apparently your co-workers are suggesting it too, and have
performance results backing it).


-- wli

2004-11-20 02:11:16

by William Lee Irwin III

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

On Fri, Nov 19, 2004 at 11:59:03AM -0800, Linus Torvalds wrote:
> You could also make "rss" be a _signed_ integer per-thread.
> When unmapping a page, you decrement one of the threads that shares the mm
> (doesn't matter which - which is why the per-thread rss may go negative),
> and when mapping a page you increment it.
> Then, anybody who actually wants a global rss can just iterate over
> threads and add it all up. If you do it under the mmap_sem, it's stable,
> and if you do it outside the mmap_sem it's imprecise but stable in the
> long term (ie errors never _accumulate_, like the non-atomic case will
> do).
> Does anybody care enough? Maybe, maybe not. It certainly sounds a hell of
> a lot better than the periodic scan.

Unprivileged triggers for full-tasklist scans are NMI oops material.


-- wli

2004-11-20 02:51:20

by Nick Piggin

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
>
>>>Unprivileged triggers for full-tasklist scans are NMI oops material.
>
>
> On Sat, Nov 20, 2004 at 01:25:37PM +1100, Nick Piggin wrote:
>
>>What about pushing the per-thread rss delta back into the global atomic
>>rss counter in each schedule()?
>>Pros:
>>This would take the task exiting problem into its stride as a matter of
>>course.
>>Single atomic read to get rss.
>>Cons:
>>would just be moving the atomic op somewhere else if we don't get
>>many page faults per schedule.
>>Not really nice dependancies.
>>Assumes schedule (not context switch) must occur somewhat regularly.
>>At present this is not true for SCHED_FIFO tasks.
>>Too nasty?
>
>
> This doesn't sound too hot. There's enough accounting that can't be
> done anywhere but schedule(), and this can be done elsewhere. Plus,
> you're moving an already too-frequent operation to a more frequent
> callsite.
>

No, it won't somehow increase the number of atomic rss operations
just because schedule is called more often. The number of ops will
be at _most_ the number of page faults.

But I agree with your overall evaluation of its 'hotness'. Just
another idea. Give this monkey another thousand years at the keys
and he'll come up with the perfect solution :P

2004-11-20 03:18:10

by William Lee Irwin III

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

William Lee Irwin III wrote:
>> Furthermore, see Robin Holt's results regarding the performance of the
>> atomic operations and their relation to cacheline sharing.

On Sat, Nov 20, 2004 at 01:40:40PM +1100, Nick Piggin wrote:
> Well yeah, but a. their patch isn't in 2.6 (or 2.4), and b. anon_rss

Irrelevant. Unshare cachelines with hot mm-global ones, and the
"problem" goes away.

This stuff is going on and on about some purist "no atomic operations
anywhere" weirdness even though killing the last atomic operation
creates problems and doesn't improve performance.


On Sat, Nov 20, 2004 at 01:40:40PM +1100, Nick Piggin wrote:
> means another atomic op. While this doesn't immediately make it a
> showstopper, it is gradually slowing down the single threaded page
> fault path too, which is bad.

William Lee Irwin III wrote:
>> And frankly, the argument that the space overhead of per-cpu counters
>> is problematic is not compelling. Even at 1024 cpus it's smaller than
>> an ia64 pagetable page, of which there are numerous instances attached
>> to each mm.

On Sat, Nov 20, 2004 at 01:40:40PM +1100, Nick Piggin wrote:
> 1024 CPUs * 64 byte cachelines == 64K, no? Well I'm sure they probably
> don't even care about 64K on their large machines, but...
> On i386 this would be maybe 32 * 128 byte == 4K per task for distro
> kernels. Not so good.

Why the Hell would you bother giving each cpu a separate cacheline?
The odds of bouncing significantly merely amongst the counters are not
particularly high.


-- wli

2004-11-20 03:22:47

by Nick Piggin

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
>
>>>Furthermore, see Robin Holt's results regarding the performance of the
>>>atomic operations and their relation to cacheline sharing.
>
>
> On Sat, Nov 20, 2004 at 01:40:40PM +1100, Nick Piggin wrote:
>
>>Well yeah, but a. their patch isn't in 2.6 (or 2.4), and b. anon_rss
>
>
> Irrelevant. Unshare cachelines with hot mm-global ones, and the
> "problem" goes away.
>

That's the idea.

> This stuff is going on and on about some purist "no atomic operations
> anywhere" weirdness even though killing the last atomic operation
> creates problems and doesn't improve performance.
>

Huh? How is not wanting to impact single threaded performance being
"purist weirdness"? Practical, I'd call it.

>
> On Sat, Nov 20, 2004 at 01:40:40PM +1100, Nick Piggin wrote:
>
>>means another atomic op. While this doesn't immediately make it a
>>showstopper, it is gradually slowing down the single threaded page
>>fault path too, which is bad.
>
>
> William Lee Irwin III wrote:
>
>>>And frankly, the argument that the space overhead of per-cpu counters
>>>is problematic is not compelling. Even at 1024 cpus it's smaller than
>>>an ia64 pagetable page, of which there are numerous instances attached
>>>to each mm.
>
>
> On Sat, Nov 20, 2004 at 01:40:40PM +1100, Nick Piggin wrote:
>
>>1024 CPUs * 64 byte cachelines == 64K, no? Well I'm sure they probably
>>don't even care about 64K on their large machines, but...
>>On i386 this would be maybe 32 * 128 byte == 4K per task for distro
>>kernels. Not so good.
>
>
> Why the Hell would you bother giving each cpu a separate cacheline?
> The odds of bouncing significantly merely amongst the counters are not
> particularly high.
>

Hmm yeah I guess wouldn't put them all on different cachelines.
As you can see though, Christoph ran into a wall at 8 CPUs, so
having them densly packed still might not be enough.

2004-11-20 03:36:27

by Robin Holt

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

On Fri, Nov 19, 2004 at 07:04:25PM -0800, William Lee Irwin III wrote:
> Why the Hell would you bother giving each cpu a separate cacheline?
> The odds of bouncing significantly merely amongst the counters are not
> particularly high.

Agree, we are currently using atomic ops on a global rss on our 2.4
kernel with 512cpu systems and not seeing much cacheline contention.
I don't remember how little it ended up being, but it was very little.
We had gone to dropping the page_table_lock and only reaquiring it if
the pte was non-null when we went to insert our new one. I think that
was how we had it working. I would have to wake up and actually look
at that code as it was many months ago that Ray Bryant did that work.
We did make rss atomic. Most of the contention is sorted out by the
mmap_sem. Processes acquiring themselves off of mmap_sem were found
to have spaced themselves out enough that they were all approximately
equal time from doing their atomic_add and therefore had very little
contention for the cacheline. At least it was not enough that we could
measure it as significant.

2004-11-20 03:41:43

by Nick Piggin

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

William Lee Irwin III wrote:
> On Fri, Nov 19, 2004 at 11:59:03AM -0800, Linus Torvalds wrote:
>
>>You could also make "rss" be a _signed_ integer per-thread.
>>When unmapping a page, you decrement one of the threads that shares the mm
>>(doesn't matter which - which is why the per-thread rss may go negative),
>>and when mapping a page you increment it.
>>Then, anybody who actually wants a global rss can just iterate over
>>threads and add it all up. If you do it under the mmap_sem, it's stable,
>>and if you do it outside the mmap_sem it's imprecise but stable in the
>>long term (ie errors never _accumulate_, like the non-atomic case will
>>do).
>>Does anybody care enough? Maybe, maybe not. It certainly sounds a hell of
>>a lot better than the periodic scan.
>
>
> Unprivileged triggers for full-tasklist scans are NMI oops material.
>

Hang on, let's come back to this...

We already have unprivileged do-for-each-thread triggers in the proc
code. It's in do_task_stat, even. Rss reporting would basically just
involve one extra addition within that loop.

So... hmm, I can't see a problem with it.

2004-11-20 03:48:43

by William Lee Irwin III

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

William Lee Irwin III wrote:
>> Irrelevant. Unshare cachelines with hot mm-global ones, and the
>> "problem" goes away.

On Sat, Nov 20, 2004 at 02:14:33PM +1100, Nick Piggin wrote:
> That's the idea.


William Lee Irwin III wrote:
>> This stuff is going on and on about some purist "no atomic operations
>> anywhere" weirdness even though killing the last atomic operation
>> creates problems and doesn't improve performance.

On Sat, Nov 20, 2004 at 02:14:33PM +1100, Nick Piggin wrote:
> Huh? How is not wanting to impact single threaded performance being
> "purist weirdness"? Practical, I'd call it.

Empirically demonstrate the impact on single-threaded performance.


On Sat, Nov 20, 2004 at 01:40:40PM +1100, Nick Piggin wrote:
>> Why the Hell would you bother giving each cpu a separate cacheline?
>> The odds of bouncing significantly merely amongst the counters are not
>> particularly high.

On Sat, Nov 20, 2004 at 02:14:33PM +1100, Nick Piggin wrote:
> Hmm yeah I guess wouldn't put them all on different cachelines.
> As you can see though, Christoph ran into a wall at 8 CPUs, so
> having them densly packed still might not be enough.

Please be more specific about the result, and cite the Message-Id.


-- wli

2004-11-20 03:58:04

by William Lee Irwin III

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

William Lee Irwin III wrote:
>> Split counters easily resolve the issues with both these approaches
>> (and apparently your co-workers are suggesting it too, and have
>> performance results backing it).

On Sat, Nov 20, 2004 at 01:18:22PM +1100, Nick Piggin wrote:
> Split counters still require atomic operations though. This is what
> Christoph's latest effort is directed at removing. And they'll still
> bounce cachelines around. (I assume we've reached the conclusion
> that per-cpu split counters per-mm won't fly?).

Split != per-cpu, though it may be. Counterexamples are
as simple as atomic_inc(&mm->rss[smp_processor_id()>>RSS_IDX_SHIFT]);
Furthermore, see Robin Holt's results regarding the performance of the
atomic operations and their relation to cacheline sharing.

And frankly, the argument that the space overhead of per-cpu counters
is problematic is not compelling. Even at 1024 cpus it's smaller than
an ia64 pagetable page, of which there are numerous instances attached
to each mm.


-- wli

2004-11-20 03:59:57

by William Lee Irwin III

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

William Lee Irwin III wrote:
>> Unprivileged triggers for full-tasklist scans are NMI oops material.

On Sat, Nov 20, 2004 at 02:37:04PM +1100, Nick Piggin wrote:
> Hang on, let's come back to this...
> We already have unprivileged do-for-each-thread triggers in the proc
> code. It's in do_task_stat, even. Rss reporting would basically just
> involve one extra addition within that loop.
> So... hmm, I can't see a problem with it.

/proc/ triggering NMI oopses was a persistent problem even before that
code was merged. I've not bothered testing it as it at best aggravates it.

And thread groups can share mm's. do_for_each_thread() won't suffice.


-- wli

2004-11-20 04:02:39

by William Lee Irwin III

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

William Lee Irwin III wrote:
>> Please be more specific about the result, and cite the Message-Id.

On Sat, Nov 20, 2004 at 02:58:36PM +1100, Nick Piggin wrote:
> Start of this thread.

Those do not have testing results of different RSS counter
implementations in isolation.


-- wli

2004-11-20 04:07:04

by Nick Piggin

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

Nick Piggin wrote:
> William Lee Irwin III wrote:

>> And thread groups can share mm's. do_for_each_thread() won't suffice.
>>
>
> I think it will be just fine.
>

Sorry, I misread. I think having per-thread rss counters will be
fine (regardless of whether or not do_for_each_thread itself will
suffice).

2004-11-20 04:09:56

by Nick Piggin

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
>
>>>Unprivileged triggers for full-tasklist scans are NMI oops material.
>
>
> On Sat, Nov 20, 2004 at 02:37:04PM +1100, Nick Piggin wrote:
>
>>Hang on, let's come back to this...
>>We already have unprivileged do-for-each-thread triggers in the proc
>>code. It's in do_task_stat, even. Rss reporting would basically just
>>involve one extra addition within that loop.
>>So... hmm, I can't see a problem with it.
>
>
> /proc/ triggering NMI oopses was a persistent problem even before that
> code was merged. I've not bothered testing it as it at best aggravates it.
>

It isn't a problem. If it ever became a problem then we can just
touch the nmi oopser in the loop.

> And thread groups can share mm's. do_for_each_thread() won't suffice.
>

I think it will be just fine.

2004-11-20 04:04:59

by Nick Piggin

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
>
>>>Irrelevant. Unshare cachelines with hot mm-global ones, and the
>>>"problem" goes away.
>
>
> On Sat, Nov 20, 2004 at 02:14:33PM +1100, Nick Piggin wrote:
>
>>That's the idea.
>
>
>
> William Lee Irwin III wrote:
>
>>>This stuff is going on and on about some purist "no atomic operations
>>>anywhere" weirdness even though killing the last atomic operation
>>>creates problems and doesn't improve performance.
>
>
> On Sat, Nov 20, 2004 at 02:14:33PM +1100, Nick Piggin wrote:
>
>>Huh? How is not wanting to impact single threaded performance being
>>"purist weirdness"? Practical, I'd call it.
>
>
> Empirically demonstrate the impact on single-threaded performance.
>

I can tell you its worse. I don't have to demonstrate anything, more
atomic RMW ops in the page fault path is going to have an impact.

I'm not saying we must not compromise *anywhere*, but it would
just be nice to try to avoid making the path heavier, that's all.
I'm not being purist when I say I'd first rather explore all other
options before adding atomics.

But nevermind arguing, it appears Linus' suggested method will
be fine and *does* mean we don't have to compromise.

>
> On Sat, Nov 20, 2004 at 01:40:40PM +1100, Nick Piggin wrote:
>
>>>Why the Hell would you bother giving each cpu a separate cacheline?
>>>The odds of bouncing significantly merely amongst the counters are not
>>>particularly high.
>
>
> On Sat, Nov 20, 2004 at 02:14:33PM +1100, Nick Piggin wrote:
>
>>Hmm yeah I guess wouldn't put them all on different cachelines.
>>As you can see though, Christoph ran into a wall at 8 CPUs, so
>>having them densly packed still might not be enough.
>
>
> Please be more specific about the result, and cite the Message-Id.
>

Start of this thread.

2004-11-20 04:27:31

by William Lee Irwin III

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

On Fri, Nov 19, 2004 at 09:33:12PM -0600, Robin Holt wrote:
> Agree, we are currently using atomic ops on a global rss on our 2.4
> kernel with 512cpu systems and not seeing much cacheline contention.
> I don't remember how little it ended up being, but it was very little.
> We had gone to dropping the page_table_lock and only reaquiring it if
> the pte was non-null when we went to insert our new one. I think that
> was how we had it working. I would have to wake up and actually look
> at that code as it was many months ago that Ray Bryant did that work.
> We did make rss atomic. Most of the contention is sorted out by the
> mmap_sem. Processes acquiring themselves off of mmap_sem were found
> to have spaced themselves out enough that they were all approximately
> equal time from doing their atomic_add and therefore had very little
> contention for the cacheline. At least it was not enough that we could
> measure it as significant.

Also, the densely-packed split counter can only get 4-16 cpus to a
cacheline with cachelines <= 128B, so there are definite limitations to
the amount of cacheline contention in such schemes.


-- wli

2004-11-20 04:30:25

by William Lee Irwin III

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

William Lee Irwin III wrote:
>> /proc/ triggering NMI oopses was a persistent problem even before that
>> code was merged. I've not bothered testing it as it at best aggravates it.

On Sat, Nov 20, 2004 at 03:03:17PM +1100, Nick Piggin wrote:
> It isn't a problem. If it ever became a problem then we can just
> touch the nmi oopser in the loop.

Very, very wrong. The tasklist scans hold the read side of the lock
and aren't even what's running with interrupts off. The contenders
on the write side are what the NMI oopser oopses.

And supposing the arch reenables interrupts in the write side's
spinloop, you just get a box that silently goes out of service for
extended periods of time, breaking cluster membership and more. The
NMI oopser is just the report of the problem, not the problem itself.
It's not a false report. The box is dead for > 5s at a time.


William Lee Irwin III wrote:
>> And thread groups can share mm's. do_for_each_thread() won't suffice.

On Sat, Nov 20, 2004 at 03:03:17PM +1100, Nick Piggin wrote:
> I think it will be just fine.

And that makes it wrong on both counts. The above fails any time
LD_ASSUME_KERNEL=2.4 is used, we well as when actual Linux features
are used directly.


-- wli

2004-11-20 04:37:06

by Nick Piggin

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
>
>>>/proc/ triggering NMI oopses was a persistent problem even before that
>>>code was merged. I've not bothered testing it as it at best aggravates it.
>
>
> On Sat, Nov 20, 2004 at 03:03:17PM +1100, Nick Piggin wrote:
>
>>It isn't a problem. If it ever became a problem then we can just
>>touch the nmi oopser in the loop.
>
>
> Very, very wrong. The tasklist scans hold the read side of the lock
> and aren't even what's running with interrupts off. The contenders
> on the write side are what the NMI oopser oopses.
>

*blinks*

So explain how this is "very very wrong", then?

> And supposing the arch reenables interrupts in the write side's
> spinloop, you just get a box that silently goes out of service for
> extended periods of time, breaking cluster membership and more. The
> NMI oopser is just the report of the problem, not the problem itself.
> It's not a false report. The box is dead for > 5s at a time.
>

The point is, adding a for-each-thread loop or two in /proc isn't
going to cause a problem that isn't already there.

If you had zero for-each-thread loops then you might have a valid
complaint. Seeing as you have more than zero, with slim chances of
reducing that number, then there is no valid complaint.

>
> William Lee Irwin III wrote:
>
>>>And thread groups can share mm's. do_for_each_thread() won't suffice.
>
>
> On Sat, Nov 20, 2004 at 03:03:17PM +1100, Nick Piggin wrote:
>
>>I think it will be just fine.
>
>
> And that makes it wrong on both counts. The above fails any time
> LD_ASSUME_KERNEL=2.4 is used, we well as when actual Linux features
> are used directly.
>

See my followup.

2004-11-20 04:41:58

by Robin Holt

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

On Sat, Nov 20, 2004 at 02:58:36PM +1100, Nick Piggin wrote:
> >Please be more specific about the result, and cite the Message-Id.
> >
>
> Start of this thread.

Part of the impact was having the page table lock, the mmap_sem, and
these two atomic counters in the same cacheline. What about seperating
the counters from the locks?

2004-11-20 04:40:09

by Nick Piggin

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
>
>>>Split counters easily resolve the issues with both these approaches
>>>(and apparently your co-workers are suggesting it too, and have
>>>performance results backing it).
>
>
> On Sat, Nov 20, 2004 at 01:18:22PM +1100, Nick Piggin wrote:
>
>>Split counters still require atomic operations though. This is what
>>Christoph's latest effort is directed at removing. And they'll still
>>bounce cachelines around. (I assume we've reached the conclusion
>>that per-cpu split counters per-mm won't fly?).
>
>
> Split != per-cpu, though it may be. Counterexamples are
> as simple as atomic_inc(&mm->rss[smp_processor_id()>>RSS_IDX_SHIFT]);

Oh yes, I just meant that the only way split counters will relieve
the atomic ops and bouncing is by having them per-cpu. But you knew
that :)

> Furthermore, see Robin Holt's results regarding the performance of the
> atomic operations and their relation to cacheline sharing.
>

Well yeah, but a. their patch isn't in 2.6 (or 2.4), and b. anon_rss
means another atomic op. While this doesn't immediately make it a
showstopper, it is gradually slowing down the single threaded page
fault path too, which is bad.

> And frankly, the argument that the space overhead of per-cpu counters
> is problematic is not compelling. Even at 1024 cpus it's smaller than
> an ia64 pagetable page, of which there are numerous instances attached
> to each mm.
>

1024 CPUs * 64 byte cachelines == 64K, no? Well I'm sure they probably
don't even care about 64K on their large machines, but...

On i386 this would be maybe 32 * 128 byte == 4K per task for distro
kernels. Not so good.

2004-11-20 04:40:08

by William Lee Irwin III

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

William Lee Irwin III wrote:
>> Unprivileged triggers for full-tasklist scans are NMI oops material.

On Sat, Nov 20, 2004 at 01:25:37PM +1100, Nick Piggin wrote:
> What about pushing the per-thread rss delta back into the global atomic
> rss counter in each schedule()?
> Pros:
> This would take the task exiting problem into its stride as a matter of
> course.
> Single atomic read to get rss.
> Cons:
> would just be moving the atomic op somewhere else if we don't get
> many page faults per schedule.
> Not really nice dependancies.
> Assumes schedule (not context switch) must occur somewhat regularly.
> At present this is not true for SCHED_FIFO tasks.
> Too nasty?

This doesn't sound too hot. There's enough accounting that can't be
done anywhere but schedule(), and this can be done elsewhere. Plus,
you're moving an already too-frequent operation to a more frequent
callsite.


-- wli

2004-11-20 05:42:20

by William Lee Irwin III

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

William Lee Irwin III wrote:
>> Very, very wrong. The tasklist scans hold the read side of the lock
>> and aren't even what's running with interrupts off. The contenders
>> on the write side are what the NMI oopser oopses.

On Sat, Nov 20, 2004 at 03:29:29PM +1100, Nick Piggin wrote:
> *blinks*
> So explain how this is "very very wrong", then?

There isn't anything left to explain. So if there's a question, be
specific about it.


William Lee Irwin III wrote:
>> And supposing the arch reenables interrupts in the write side's
>> spinloop, you just get a box that silently goes out of service for
>> extended periods of time, breaking cluster membership and more. The
>> NMI oopser is just the report of the problem, not the problem itself.
>> It's not a false report. The box is dead for > 5s at a time.

On Sat, Nov 20, 2004 at 03:29:29PM +1100, Nick Piggin wrote:
> The point is, adding a for-each-thread loop or two in /proc isn't
> going to cause a problem that isn't already there.
> If you had zero for-each-thread loops then you might have a valid
> complaint. Seeing as you have more than zero, with slim chances of
> reducing that number, then there is no valid complaint.

This entire line of argument is bogus. A preexisting bug of a similar
nature is not grounds for deliberately introducing any bug.


-- wli

2004-11-20 05:51:18

by Nick Piggin

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
>
>>>Very, very wrong. The tasklist scans hold the read side of the lock
>>>and aren't even what's running with interrupts off. The contenders
>>>on the write side are what the NMI oopser oopses.
>
>
> On Sat, Nov 20, 2004 at 03:29:29PM +1100, Nick Piggin wrote:
>
>>*blinks*
>>So explain how this is "very very wrong", then?
>
>
> There isn't anything left to explain. So if there's a question, be
> specific about it.
>

Why am I very very wrong? Why won't touch_nmi_watchdog work from
the read loop?

And let's just be nice and try not to jump at the chance to point
out when people are very very wrong, and keep count of the times
they have been very very wrong. I'm trying to be constructive.

>
> William Lee Irwin III wrote:
>
>>>And supposing the arch reenables interrupts in the write side's
>>>spinloop, you just get a box that silently goes out of service for
>>>extended periods of time, breaking cluster membership and more. The
>>>NMI oopser is just the report of the problem, not the problem itself.
>>>It's not a false report. The box is dead for > 5s at a time.
>
>
> On Sat, Nov 20, 2004 at 03:29:29PM +1100, Nick Piggin wrote:
>
>>The point is, adding a for-each-thread loop or two in /proc isn't
>>going to cause a problem that isn't already there.
>>If you had zero for-each-thread loops then you might have a valid
>>complaint. Seeing as you have more than zero, with slim chances of
>>reducing that number, then there is no valid complaint.
>
>
> This entire line of argument is bogus. A preexisting bug of a similar
> nature is not grounds for deliberately introducing any bug.
>

Sure, if that is a bug and someone is just about to fix it then
yes you're right, we shouldn't introduce this. I didn't realise
it was a bug. Sounds like it would be causing you lots of problems
though - have you looked at how to fix it?

2004-11-20 06:23:57

by William Lee Irwin III

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

William Lee Irwin III wrote:
>> There isn't anything left to explain. So if there's a question, be
>> specific about it.

On Sat, Nov 20, 2004 at 04:50:25PM +1100, Nick Piggin wrote:
> Why am I very very wrong? Why won't touch_nmi_watchdog work from
> the read loop?
> And let's just be nice and try not to jump at the chance to point
> out when people are very very wrong, and keep count of the times
> they have been very very wrong. I'm trying to be constructive.

touch_nmi_watchdog() is only "protection" against local interrupt
disablement triggering the NMI oopser because alert_counter[]
increments are not atomic. Yet even supposing they were made so, the
net effect of "covering up" this gross deficiency is making the
user-observable problems it causes undiagnosable, as noted before.


William Lee Irwin III wrote:
>> This entire line of argument is bogus. A preexisting bug of a similar
>> nature is not grounds for deliberately introducing any bug.

On Sat, Nov 20, 2004 at 04:50:25PM +1100, Nick Piggin wrote:
> Sure, if that is a bug and someone is just about to fix it then
> yes you're right, we shouldn't introduce this. I didn't realise
> it was a bug. Sounds like it would be causing you lots of problems
> though - have you looked at how to fix it?

Kevin Marin was the first to report this issue to lkml. I had seen
instances of it in internal corporate bugreports and it was one of
the motivators for the work I did on pidhashing (one of the causes
of the timeouts was worst cases in pid allocation). Manfred Spraul
and myself wrote patches attempting to reduce read-side hold time
in /proc/ algorithms, Ingo Molnar wrote patches to hierarchically
subdivide the /proc/ iterations, and Dipankar Sarma and Maneesh
Soni wrote patches to carry out the long iterations in /proc/ locklessly.

The last several of these affecting /proc/ have not gained acceptance,
though the work has not been halted in any sense, as this problem
recurs quite regularly. A considerable amount of sustained effort has
gone toward mitigating and resolving rwlock starvation.

Aggravating the rwlock starvation destabilizes, not pessimizes,
and performance is secondary to stability.


-- wli

2004-11-20 06:50:46

by Nick Piggin

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
>
>>>There isn't anything left to explain. So if there's a question, be
>>>specific about it.
>
>
> On Sat, Nov 20, 2004 at 04:50:25PM +1100, Nick Piggin wrote:
>
>>Why am I very very wrong? Why won't touch_nmi_watchdog work from
>>the read loop?
>>And let's just be nice and try not to jump at the chance to point
>>out when people are very very wrong, and keep count of the times
>>they have been very very wrong. I'm trying to be constructive.
>
>
> touch_nmi_watchdog() is only "protection" against local interrupt
> disablement triggering the NMI oopser because alert_counter[]
> increments are not atomic. Yet even supposing they were made so, the

That would be a bug in touch_nmi_watchdog then, because you're
racy against your own NMI too.

So I'm actually not very very wrong at all. I'm technically wrong
because touch_nmi_watchdog has a theoretical 'bug'. In practice,
multiple races with the non atomic increments to the same counter,
and in an unbroken sequence would be about as likely as hardware
failure.

Anyway, this touch nmi thing is going off topic, sorry list.

> net effect of "covering up" this gross deficiency is making the
> user-observable problems it causes undiagnosable, as noted before.
>

Well the loops that are in there now aren't covered up, and they
don't seem to be causing problems. Ergo there is no problem (we're
being _practical_ here, right?)

>
> William Lee Irwin III wrote:
>
>>>This entire line of argument is bogus. A preexisting bug of a similar
>>>nature is not grounds for deliberately introducing any bug.
>
>
> On Sat, Nov 20, 2004 at 04:50:25PM +1100, Nick Piggin wrote:
>
>>Sure, if that is a bug and someone is just about to fix it then
>>yes you're right, we shouldn't introduce this. I didn't realise
>>it was a bug. Sounds like it would be causing you lots of problems
>>though - have you looked at how to fix it?
>
>
> Kevin Marin was the first to report this issue to lkml. I had seen
> instances of it in internal corporate bugreports and it was one of
> the motivators for the work I did on pidhashing (one of the causes
> of the timeouts was worst cases in pid allocation). Manfred Spraul
> and myself wrote patches attempting to reduce read-side hold time
> in /proc/ algorithms, Ingo Molnar wrote patches to hierarchically
> subdivide the /proc/ iterations, and Dipankar Sarma and Maneesh
> Soni wrote patches to carry out the long iterations in /proc/ locklessly.
>
> The last several of these affecting /proc/ have not gained acceptance,
> though the work has not been halted in any sense, as this problem
> recurs quite regularly. A considerable amount of sustained effort has
> gone toward mitigating and resolving rwlock starvation.
>

That's very nice. But there is no problem _now_, is there?

> Aggravating the rwlock starvation destabilizes, not pessimizes,
> and performance is secondary to stability.
>

Well luckily we're not going to be aggravating the rwlock stavation.

If you found a problem with, and fixed do_task_stat: ?time, ???_flt,
et al, then you would apply the same solution to per thread rss to
fix it in the same way.

2004-11-20 06:58:23

by Andrew Morton

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

Nick Piggin <[email protected]> wrote:
>
> per thread rss

Given that we have contention problems updating a single mm-wide rss and
given that the way to fix that up is to spread things out a bit, it seems
wildly arbitrary to me that the way in which we choose to spread the
counter out is to stick a bit of it into each task_struct.

I'd expect that just shoving a pointer into mm_struct which points at a
dynamically allocated array[NR_CPUS] of longs would suffice. We probably
don't even need to spread them out on cachelines - having four or eight
cpus sharing the same cacheline probably isn't going to hurt much.

At least, that'd be my first attempt. If it's still not good enough, try
something else.

2004-11-20 07:05:59

by Andrew Morton

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

Andrew Morton <[email protected]> wrote:
>
> I'd expect that just shoving a pointer into mm_struct which points at a
> dynamically allocated array[NR_CPUS] of longs would suffice.

One might even be able to use percpu_counter.h, although that might end up
hurting many-cpu fork times, due to all that work in __alloc_percpu().

2004-11-20 07:13:28

by Nick Piggin

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

Andrew Morton wrote:
> Nick Piggin <[email protected]> wrote:
>
>>per thread rss
>
>
> Given that we have contention problems updating a single mm-wide rss and
> given that the way to fix that up is to spread things out a bit, it seems
> wildly arbitrary to me that the way in which we choose to spread the
> counter out is to stick a bit of it into each task_struct.
>
> I'd expect that just shoving a pointer into mm_struct which points at a
> dynamically allocated array[NR_CPUS] of longs would suffice. We probably
> don't even need to spread them out on cachelines - having four or eight
> cpus sharing the same cacheline probably isn't going to hurt much.
>
> At least, that'd be my first attempt. If it's still not good enough, try
> something else.
>
>

That is what Bill thought too. I guess per-cpu and per-thread rss are
the leading candidates.

Per thread rss has the benefits of cacheline exclusivity, and not
causing task bloat in the common case.

Per CPU array has better worst case /proc properties, but shares
cachelines (or not, if using percpu_counter as you suggested).


I think I'd better leave it to others to finish off the arguments ;)

2004-11-20 07:20:52

by William Lee Irwin III

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

William Lee Irwin III wrote:
>> touch_nmi_watchdog() is only "protection" against local interrupt
>> disablement triggering the NMI oopser because alert_counter[]
>> increments are not atomic. Yet even supposing they were made so, the

On Sat, Nov 20, 2004 at 05:49:53PM +1100, Nick Piggin wrote:
> That would be a bug in touch_nmi_watchdog then, because you're
> racy against your own NMI too.
> So I'm actually not very very wrong at all. I'm technically wrong
> because touch_nmi_watchdog has a theoretical 'bug'. In practice,
> multiple races with the non atomic increments to the same counter,
> and in an unbroken sequence would be about as likely as hardware
> failure.
> Anyway, this touch nmi thing is going off topic, sorry list.

No, it's on-topic.
(1) The issue is not theoretical. e.g. sysrq t does trigger NMI oopses,
merely not every time, and not on every system. It is not
associated with hardware failure. It is, however, tolerable
because sysrq's require privilege to trigger and are primarly
used when the box is dying anyway.
(2) NMI's don't nest. There is no possibility of NMI's racing against
themselves while the data is per-cpu.


William Lee Irwin III wrote:
>> net effect of "covering up" this gross deficiency is making the
>> user-observable problems it causes undiagnosable, as noted before.

On Sat, Nov 20, 2004 at 05:49:53PM +1100, Nick Piggin wrote:
> Well the loops that are in there now aren't covered up, and they
> don't seem to be causing problems. Ergo there is no problem (we're
> being _practical_ here, right?)

They are causing problems. They never stopped causing problems. None
of the above attempts to reduce rwlock starvation has been successful
in reducing it to untriggerable-in-the-field levels, and empirical
demonstrations of starvation recurring after those available at the
time of testing were put into place did in fact happen. Reduction of
frequency and making starvation more difficult to trigger are all that
they've achieved thus far.


William Lee Irwin III wrote:
>> Kevin Marin was the first to report this issue to lkml. I had seen
>> instances of it in internal corporate bugreports and it was one of
>> the motivators for the work I did on pidhashing (one of the causes
>> of the timeouts was worst cases in pid allocation). Manfred Spraul
>> and myself wrote patches attempting to reduce read-side hold time
>> in /proc/ algorithms, Ingo Molnar wrote patches to hierarchically
>> subdivide the /proc/ iterations, and Dipankar Sarma and Maneesh
>> Soni wrote patches to carry out the long iterations in /proc/ locklessly.
>> The last several of these affecting /proc/ have not gained acceptance,
>> though the work has not been halted in any sense, as this problem
>> recurs quite regularly. A considerable amount of sustained effort has
>> gone toward mitigating and resolving rwlock starvation.

On Sat, Nov 20, 2004 at 05:49:53PM +1100, Nick Piggin wrote:
> That's very nice. But there is no problem _now_, is there?

There is and has always been. All of the above merely mitigate the
issue, with the possible exception of the tasklist RCU patch, for
which I know of no testing results. Also note that almost none of
the work on /proc/ has been merged.


William Lee Irwin III wrote:
>> Aggravating the rwlock starvation destabilizes, not pessimizes,
>> and performance is secondary to stability.

On Sat, Nov 20, 2004 at 05:49:53PM +1100, Nick Piggin wrote:
> Well luckily we're not going to be aggravating the rwlock stavation.
> If you found a problem with, and fixed do_task_stat: ?time, ???_flt,
> et al, then you would apply the same solution to per thread rss to
> fix it in the same way.

You are aggravating the rwlock starvation by introducing gratuitous
full tasklist iterations. There is no solution to do_task_stat()
because it was recently introduced. There will be one as part of a port
of the usual mitigation patches when the perennial problem is reported
against a sufficiently recent kernel version, as usual. The already-
demonstrated problematic iterations have not been removed.


-- wli

2004-11-20 07:29:52

by Nick Piggin

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
>
>>>touch_nmi_watchdog() is only "protection" against local interrupt
>>>disablement triggering the NMI oopser because alert_counter[]
>>>increments are not atomic. Yet even supposing they were made so, the
>
>
> On Sat, Nov 20, 2004 at 05:49:53PM +1100, Nick Piggin wrote:
>
>>That would be a bug in touch_nmi_watchdog then, because you're
>>racy against your own NMI too.
>>So I'm actually not very very wrong at all. I'm technically wrong
>>because touch_nmi_watchdog has a theoretical 'bug'. In practice,
>>multiple races with the non atomic increments to the same counter,
>>and in an unbroken sequence would be about as likely as hardware
>>failure.
>>Anyway, this touch nmi thing is going off topic, sorry list.
>
>
> No, it's on-topic.
> (1) The issue is not theoretical. e.g. sysrq t does trigger NMI oopses,
> merely not every time, and not on every system. It is not
> associated with hardware failure. It is, however, tolerable
> because sysrq's require privilege to trigger and are primarly
> used when the box is dying anyway.

OK then put a touch_nmi_watchdog in there if you must.

> (2) NMI's don't nest. There is no possibility of NMI's racing against
> themselves while the data is per-cpu.
>

Your point was that touch_nmi_watchdog() which resets alert_counter,
is racy when resetting the counter of other CPUs. Yes it is racy.
It is also racy against the NMI on the _current_ CPU.

This has nothing whatsoever to do with NMIs racing against themselves,
I don't know how you got that idea when you were the one to bring up
this race anyway.

[ snip back-and-forth that is going nowhere ]

I'll bow out of the argument here. I grant you raise valid concens
WRT the /proc issues, of course.

2004-11-20 07:46:56

by Nick Piggin

[permalink] [raw]
Subject: touch_nmi_watchdog (was: page fault scalability patch V11 [0/7]: overview)

Nick Piggin wrote:

>> (2) NMI's don't nest. There is no possibility of NMI's racing against
>> themselves while the data is per-cpu.
>>
>
> Your point was that touch_nmi_watchdog() which resets alert_counter,
> is racy when resetting the counter of other CPUs. Yes it is racy.
> It is also racy against the NMI on the _current_ CPU.

Hmm no I think you're right in that it is only a problem WRT the remote
CPUs. However that would still be a problem, as the comment in i386
touch_nmi_watchdog attests.

2004-11-20 07:57:46

by Nick Piggin

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

Nick Piggin wrote:
> William Lee Irwin III wrote:

>> No, it's on-topic.
>> (1) The issue is not theoretical. e.g. sysrq t does trigger NMI oopses,
>> merely not every time, and not on every system. It is not
>> associated with hardware failure. It is, however, tolerable
>> because sysrq's require privilege to trigger and are primarly
>> used when the box is dying anyway.
>
>
> OK then put a touch_nmi_watchdog in there if you must.
>

Duh, there is one in there :\

Still, that doesn't really say much about a normal tasklist traversal
because this thing will spend ages writing stuff to serial console.

Now I know going over the whole tasklist is crap. Anything O(n) for
things like this is crap. I happen to just get frustrated to see
concessions being made to support more efficient /proc access. I know
you are one of the ones who has to deal with the practical realities
of that though. Sigh. Well try to bear with me... :|

2004-11-20 08:01:13

by William Lee Irwin III

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

Andrew Morton wrote:
>> Given that we have contention problems updating a single mm-wide rss and
>> given that the way to fix that up is to spread things out a bit, it seems
>> wildly arbitrary to me that the way in which we choose to spread the
>> counter out is to stick a bit of it into each task_struct.
>> I'd expect that just shoving a pointer into mm_struct which points at a
>> dynamically allocated array[NR_CPUS] of longs would suffice. We probably
>> don't even need to spread them out on cachelines - having four or eight
>> cpus sharing the same cacheline probably isn't going to hurt much.
>> At least, that'd be my first attempt. If it's still not good enough, try
>> something else.

On Sat, Nov 20, 2004 at 06:13:03PM +1100, Nick Piggin wrote:
> That is what Bill thought too. I guess per-cpu and per-thread rss are
> the leading candidates.
> Per thread rss has the benefits of cacheline exclusivity, and not
> causing task bloat in the common case.
> Per CPU array has better worst case /proc properties, but shares
> cachelines (or not, if using percpu_counter as you suggested).
> I think I'd better leave it to others to finish off the arguments ;)

(1) The "task bloat" is more than tolerable on the systems capable
of having enough cpus to see significant per-process
memory footprint, where "significant" is smaller than a
pagetable page even for systems twice as large as now shipped.
(2) The cacheline exclusivity is not entirely gone in dense per-cpu
arrays, it's merely "approximated" by sharing amongst small
groups of adjacent cpus. This is fine for e.g. NUMA because
those small groups of adjacent cpus will typically be on nearby
nodes.
(3) The price paid to get "perfect exclusivity" instead of "approximate
exclusivity" is unbounded tasklist_lock hold time, which takes
boxen down outright in every known instance.

The properties are not for /proc/, they are for tasklist_lock. Every
read stops all other writes. When you hold tasklist_lock for an
extended period of time for read or write, (e.g. exhaustive tasklist
search) you stop all fork()'s and exit()'s and execve()'s on a running
system. The "worst case" analysis has nothing to do with speed. It has
everything to do with taking a box down outright, much like unplugging
power cables or dereferencing NULL. Unbounded tasklist_lock hold time
kills running boxen dead.

Read sides of rwlocks are not licenses to spin for aeons with locks held.

And the "question" of sufficiency has in fact already been answered.
SGI's own testing during the 2.4 out-of-tree patching cycle determined
that an mm-global atomic counter was already sufficient so long as the
cacheline was not shared with ->mmap_sem and the like. The "simplest"
optimization of moving the field out of the way of ->mmap_sem already
worked. The grander ones, if and *ONLY* if they don't have showstoppers
like unbounded tasklist_lock hold time or castrating workload monitoring
to unusability, will merely be more robust for future systems.
Reiterating, this is all just fine so long as they don't cause any
showstopping problems, like castrating the ability to monitor
processes, or introducing more tasklist_lock starvation.


-- wli

2004-11-20 08:25:52

by William Lee Irwin III

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

Nick Piggin wrote:
>> OK then put a touch_nmi_watchdog in there if you must.

On Sat, Nov 20, 2004 at 06:57:38PM +1100, Nick Piggin wrote:
> Duh, there is one in there :\
> Still, that doesn't really say much about a normal tasklist traversal
> because this thing will spend ages writing stuff to serial console.
> Now I know going over the whole tasklist is crap. Anything O(n) for
> things like this is crap. I happen to just get frustrated to see
> concessions being made to support more efficient /proc access. I know
> you are one of the ones who has to deal with the practical realities
> of that though. Sigh. Well try to bear with me... :|

I sure as Hell don't have any interest in /proc/ in and of itself,
but this stuff does really bite people, and hard, too.


-- wli

2004-11-20 17:06:15

by Martin J. Bligh

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

>> Given that we have contention problems updating a single mm-wide rss and
>> given that the way to fix that up is to spread things out a bit, it seems
>> wildly arbitrary to me that the way in which we choose to spread the
>> counter out is to stick a bit of it into each task_struct.
>>
>> I'd expect that just shoving a pointer into mm_struct which points at a
>> dynamically allocated array[NR_CPUS] of longs would suffice. We probably
>> don't even need to spread them out on cachelines - having four or eight
>> cpus sharing the same cacheline probably isn't going to hurt much.
>>
>> At least, that'd be my first attempt. If it's still not good enough, try
>> something else.
>>
>>
>
> That is what Bill thought too. I guess per-cpu and per-thread rss are
> the leading candidates.
>
> Per thread rss has the benefits of cacheline exclusivity, and not
> causing task bloat in the common case.
>
> Per CPU array has better worst case /proc properties, but shares
> cachelines (or not, if using percpu_counter as you suggested).

Per thread seems much nicer to me - mainly because it degrades cleanly to
a single counter for 99% of processes, which are single threaded.

M.

2004-11-20 17:14:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview



On Sat, 20 Nov 2004, Martin J. Bligh wrote:
>
> Per thread seems much nicer to me - mainly because it degrades cleanly to
> a single counter for 99% of processes, which are single threaded.

I will pretty much guarantee that if you put the per-thread patches next
to some abomination with per-cpu allocation for each mm, the choice will
be clear. Especially if the per-cpu/per-mm thing tries to avoid false
cacheline sharing, which sounds really "interesting" in itself.

And without the cacheline sharing avoidance, what's the point of this
again? It sure wasn't to make the code simpler. It was about performance
and scalability.

Linus

2004-11-20 19:08:34

by William Lee Irwin III

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

On Sat, Nov 20, 2004 at 09:14:11AM -0800, Linus Torvalds wrote:
> I will pretty much guarantee that if you put the per-thread patches next
> to some abomination with per-cpu allocation for each mm, the choice will
> be clear. Especially if the per-cpu/per-mm thing tries to avoid false
> cacheline sharing, which sounds really "interesting" in itself.
> And without the cacheline sharing avoidance, what's the point of this
> again? It sure wasn't to make the code simpler. It was about performance
> and scalability.

"The perfect is the enemy of the good."

The "perfect" cacheline separation achieved that way is at the cost of
destabilizing the kernel. The dense per-cpu business is only really a
concession to the notion that the counter needs to be split up at all,
which has never been demonstrated with performance measurements. In fact,
Robin Holt has performance measurements demonstrating the opposite.

The "good" alternatives are negligibly different wrt. performance, and
don't carry the high cost of rwlock starvation that breaks boxen.


-- wli

2004-11-20 19:17:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview



On Sat, 20 Nov 2004, William Lee Irwin III wrote:
>
> "The perfect is the enemy of the good."

Yes. But in this case, my suggestion _is_ the good. You seem to be pushing
for a really horrid thing which allocates a per-cpu array for each
mm_struct.

What is it that you have against the per-thread rss? We already have
several places that do the thread-looping, so it's not like "you can't do
that" is a valid argument.

Linus

2004-11-20 19:33:53

by William Lee Irwin III

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

On Sat, 20 Nov 2004, William Lee Irwin III wrote:
>> "The perfect is the enemy of the good."

On Sat, Nov 20, 2004 at 11:16:12AM -0800, Linus Torvalds wrote:
> Yes. But in this case, my suggestion _is_ the good. You seem to be pushing
> for a really horrid thing which allocates a per-cpu array for each
> mm_struct.
> What is it that you have against the per-thread rss? We already have
> several places that do the thread-looping, so it's not like "you can't do
> that" is a valid argument.

Okay, first thread groups can share mm's, so it's worse than iterating
over a thread group. Second, the long loops under tasklist_lock didn't
stop causing rwlock starvation because what patches there were to do
something about them didn't get merged.

I'm not particularly "stuck on" the per-cpu business, it was merely the
most obvious method of splitting the RSS counter without catastrophes
elsewhere. Robin Holt's 2.4 performance studies actually show that
splitting the counter is not even essential.


-- wli

2004-11-20 20:26:02

by Adam Heath

[permalink] [raw]
Subject: [OT] Re: page fault scalability patch V11 [0/7]: overview

On Sat, 20 Nov 2004, William Lee Irwin III wrote:

> "The perfect is the enemy of the good."

"With the proper course of action made so explicit, we had merely to choose
between wisdom and folly. Precisely how we chose folly in this instance is
not entirely clear."

Quote taken from Andrew Suffield on irc, who said he got it from Penny Arcade.

2004-11-22 15:08:07

by Hugh Dickins

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [1/7]: sloppy rss

On Fri, 19 Nov 2004, Christoph Lameter wrote:
> On Fri, 19 Nov 2004, Hugh Dickins wrote:
>
> > Sorry, against what tree do these patches apply?
> > Apparently not linux-2.6.9, nor latest -bk, nor -mm?
>
> 2.6.10-rc2-bk3

Ah, thanks - got it patched now, but your mailer (or something else)
is eating trailing spaces. Better than adding them, but we have to
apply this patch before your set:

--- 2.6.10-rc2-bk3/include/asm-i386/system.h 2004-11-15 16:21:12.000000000 +0000
+++ linux/include/asm-i386/system.h 2004-11-22 14:44:30.761904592 +0000
@@ -273,9 +273,9 @@ static inline unsigned long __cmpxchg(vo
#define cmpxchg(ptr,o,n)\
((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\
(unsigned long)(n),sizeof(*(ptr))))
-
+
#ifdef __KERNEL__
-struct alt_instr {
+struct alt_instr {
__u8 *instr; /* original instruction */
__u8 *replacement;
__u8 cpuid; /* cpuid bit set for replacement */
--- 2.6.10-rc2-bk3/include/asm-s390/pgalloc.h 2004-05-10 03:33:39.000000000 +0100
+++ linux/include/asm-s390/pgalloc.h 2004-11-22 14:54:43.704723120 +0000
@@ -99,7 +99,7 @@ static inline void pgd_populate(struct m

#endif /* __s390x__ */

-static inline void
+static inline void
pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd, pte_t *pte)
{
#ifndef __s390x__
--- 2.6.10-rc2-bk3/mm/memory.c 2004-11-18 17:56:11.000000000 +0000
+++ linux/mm/memory.c 2004-11-22 14:39:33.924030808 +0000
@@ -1424,7 +1424,7 @@ out:
/*
* We are called with the MM semaphore and page_table_lock
* spinlock held to protect against concurrent faults in
- * multithreaded programs.
+ * multithreaded programs.
*/
static int
do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -1615,7 +1615,7 @@ static int do_file_page(struct mm_struct
* Fall back to the linear mapping if the fs does not support
* ->populate:
*/
- if (!vma->vm_ops || !vma->vm_ops->populate ||
+ if (!vma->vm_ops || !vma->vm_ops->populate ||
(write_access && !(vma->vm_flags & VM_SHARED))) {
pte_clear(pte);
return do_no_page(mm, vma, address, write_access, pte, pmd);

2004-11-22 17:49:01

by Christoph Lameter

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

On Sat, 20 Nov 2004, William Lee Irwin III wrote:
> I'm not particularly "stuck on" the per-cpu business, it was merely the
> most obvious method of splitting the RSS counter without catastrophes
> elsewhere. Robin Holt's 2.4 performance studies actually show that
> splitting the counter is not even essential.

There is no problem moving back to the atomic approach that is if it is
okay to also make anon_rss atomic. But its a pretty significant
performance hit (comparison with some old data from V4 of patch which
makes this data a bit suspect since the test environment is likely
slightly different. I should really test this again. Note that the old
performance test was only run 3 times instead of 10):

atomic vs. sloppy rss performance 64G allocation:

sloppy rss:

Gb Rep Threads User System Wall flt/cpu/s fault/wsec
16 10 1 1.818s 131.556s 133.038s 78618.592 78615.672
16 10 2 1.736s 121.167s 65.026s 85317.098 160656.362
16 10 4 1.835s 120.444s 36.002s 85751.810 291074.998
16 10 8 1.820s 131.068s 25.049s 78906.310 411304.895
16 10 16 3.275s 194.971s 22.019s 52892.356 472497.962
16 10 32 13.006s 496.628s 27.044s 20575.038 381999.865

atomic:

Gb Rep Threads User System Wall flt/cpu/s fault/wsec
16 3 1 0.610s 61.557s 62.016s 50600.438 50599.822
16 3 2 0.640s 83.116s 43.016s 37557.847 72869.978
16 3 4 0.621s 73.897s 26.023s 42214.002 119908.246
16 3 8 0.596s 86.587s 14.098s 36081.229 209962.059
16 3 16 0.646s 69.601s 7.000s 44780.269 448823.690
16 3 32 0.903s 185.609s 8.085s 16866.018 355301.694

Lets go for the approach to move rss into the thread structure but
keep the rss in the mm structure as is (need to take page_table_lock
for update) to consolidate the values. This allows to keep most
of the code as is and the rss in the task struct is only used if
we are not holding page_table_lock.

Maybe we can then find some way to regularly update the rss in the mm
structure to avoid the loop over the tasklist in proc.

2004-11-22 19:39:11

by Bill Davidsen

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

Linus Torvalds wrote:
>
> On Sat, 20 Nov 2004, Nick Piggin wrote:
>
>>The per thread rss may wrap (maybe not 64-bit counters), but even so,
>>the summation over all threads should still end up being correct I
>>think.
>
>
> Yes. As long as the total rss fits in an int, it doesn't matter if any of
> them wrap. Addition is still associative in twos-complement arithmetic
> even in the presense of overflows.
>
> If you actually want to make it proper standard C, I guess you'd have to
> make the thing unsigned, which gives you the mod-2**n guarantees even if
> somebody were to ever make a non-twos-complement machine.

I think other stuff breaks as well, I think I saw you post some example
code using something like (a & -a) or similar within the last few
months. Fortunately neither 1's comp or BCD are likeliy to return in
hardware. Big-end vs. little-end is still an issue, though.

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2004-11-22 21:54:54

by Christoph Lameter

[permalink] [raw]
Subject: deferred rss update instead of sloppy rss

One way to solve the rss issues is--as discussed--to put rss into the
task structure and then have the page fault increment that rss.

The problem is then that the proc filesystem must do an extensive scan
over all threads to find users of a certain mm_struct.

The following patch does put the rss into task_struct. The page fault
handler is then incrementing current->rss if the page_table_lock is not
held.

The timer interrupt checks if task->rss is non zero (when doing
stime/utime updates. rss is defined near those so its hopefully on the
same cacheline and has a minimal impact).

If rss is non zero and the page_table_lock and the mmap_sem can be taken
then the mm->rss will be updated by the value of the task->rss and
task->rss will be zeroed. This avoids all proc issues. The only
disadvantage is that rss may be inaccurate for a couple of clock ticks.

This also adds some performance (sorry only a 4p system):

sloppy rss:

Gb Rep Threads User System Wall flt/cpu/s fault/wsec
4 10 1 0.593s 29.897s 30.050s 85973.585 85948.565
4 10 2 0.616s 42.184s 22.045s 61247.450 116719.558
4 10 4 0.559s 44.918s 14.076s 57641.255 177553.945

deferred rss:
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
4 10 1 0.565s 29.429s 30.000s 87395.518 87366.580
4 10 2 0.500s 33.514s 18.002s 77067.935 145426.659
4 10 4 0.533s 44.455s 14.085s 58269.368 176413.196

Index: linux-2.6.9/include/linux/sched.h
===================================================================
--- linux-2.6.9.orig/include/linux/sched.h 2004-11-15 11:13:39.000000000 -0800
+++ linux-2.6.9/include/linux/sched.h 2004-11-22 13:18:58.000000000 -0800
@@ -584,6 +584,10 @@
unsigned long it_real_incr, it_prof_incr, it_virt_incr;
struct timer_list real_timer;
unsigned long utime, stime;
+ long rss; /* rss counter when mm->rss is not usable. mm->page_table_lock
+ * not held but mm->mmap_sem must be held for sync with
+ * the timer interrupt which clears rss and updates mm->rss.
+ */
unsigned long nvcsw, nivcsw; /* context switch counts */
struct timespec start_time;
/* mm fault and swap info: this can arguably be seen as either mm-specific or thread-specific */
Index: linux-2.6.9/mm/rmap.c
===================================================================
--- linux-2.6.9.orig/mm/rmap.c 2004-11-22 09:51:58.000000000 -0800
+++ linux-2.6.9/mm/rmap.c 2004-11-22 11:16:02.000000000 -0800
@@ -263,8 +263,6 @@
pte_t *pte;
int referenced = 0;

- if (!mm->rss)
- goto out;
address = vma_address(page, vma);
if (address == -EFAULT)
goto out;
@@ -507,8 +505,6 @@
pte_t pteval;
int ret = SWAP_AGAIN;

- if (!mm->rss)
- goto out;
address = vma_address(page, vma);
if (address == -EFAULT)
goto out;
@@ -791,8 +787,7 @@
if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
continue;
cursor = (unsigned long) vma->vm_private_data;
- while (vma->vm_mm->rss &&
- cursor < max_nl_cursor &&
+ while (cursor < max_nl_cursor &&
cursor < vma->vm_end - vma->vm_start) {
try_to_unmap_cluster(cursor, &mapcount, vma);
cursor += CLUSTER_SIZE;
Index: linux-2.6.9/kernel/fork.c
===================================================================
--- linux-2.6.9.orig/kernel/fork.c 2004-11-22 09:51:58.000000000 -0800
+++ linux-2.6.9/kernel/fork.c 2004-11-22 11:16:02.000000000 -0800
@@ -876,6 +876,7 @@
p->io_context = NULL;
p->io_wait = NULL;
p->audit_context = NULL;
+ p->rss = 0;
#ifdef CONFIG_NUMA
p->mempolicy = mpol_copy(p->mempolicy);
if (IS_ERR(p->mempolicy)) {
Index: linux-2.6.9/kernel/exit.c
===================================================================
--- linux-2.6.9.orig/kernel/exit.c 2004-11-22 09:51:58.000000000 -0800
+++ linux-2.6.9/kernel/exit.c 2004-11-22 11:16:02.000000000 -0800
@@ -501,6 +501,9 @@
/* more a memory barrier than a real lock */
task_lock(tsk);
tsk->mm = NULL;
+ /* only holding mmap_sem here maybe get page_table_lock too? */
+ mm->rss += tsk->rss;
+ tsk->rss = 0;
up_read(&mm->mmap_sem);
enter_lazy_tlb(mm, current);
task_unlock(tsk);
Index: linux-2.6.9/kernel/timer.c
===================================================================
--- linux-2.6.9.orig/kernel/timer.c 2004-11-22 09:51:58.000000000 -0800
+++ linux-2.6.9/kernel/timer.c 2004-11-22 11:42:12.000000000 -0800
@@ -815,6 +815,15 @@
if (psecs / HZ >= p->signal->rlim[RLIMIT_CPU].rlim_max)
send_sig(SIGKILL, p, 1);
}
+ /* Update mm->rss if necessary */
+ if (p->rss && p->mm && down_write_trylock(&p->mm->mmap_sem)) {
+ if (spin_trylock(&p->mm->page_table_lock)) {
+ p->mm->rss += p->rss;
+ p->rss = 0;
+ spin_unlock(&p->mm->page_table_lock);
+ }
+ up_write(&p->mm->mmap_sem);
+ }
}

static inline void do_it_virt(struct task_struct * p, unsigned long ticks)

2004-11-22 22:13:40

by Andrew Morton

[permalink] [raw]
Subject: Re: deferred rss update instead of sloppy rss

Christoph Lameter <[email protected]> wrote:
>
> One way to solve the rss issues is--as discussed--to put rss into the
> task structure and then have the page fault increment that rss.
>
> The problem is then that the proc filesystem must do an extensive scan
> over all threads to find users of a certain mm_struct.
>
> The following patch does put the rss into task_struct. The page fault
> handler is then incrementing current->rss if the page_table_lock is not
> held.
>
> The timer interrupt checks if task->rss is non zero (when doing
> stime/utime updates. rss is defined near those so its hopefully on the
> same cacheline and has a minimal impact).
>
> If rss is non zero and the page_table_lock and the mmap_sem can be taken
> then the mm->rss will be updated by the value of the task->rss and
> task->rss will be zeroed. This avoids all proc issues. The only
> disadvantage is that rss may be inaccurate for a couple of clock ticks.
>
> This also adds some performance (sorry only a 4p system):
>
> sloppy rss:
>
> Gb Rep Threads User System Wall flt/cpu/s fault/wsec
> 4 10 1 0.593s 29.897s 30.050s 85973.585 85948.565
> 4 10 2 0.616s 42.184s 22.045s 61247.450 116719.558
> 4 10 4 0.559s 44.918s 14.076s 57641.255 177553.945
>
> deferred rss:
> Gb Rep Threads User System Wall flt/cpu/s fault/wsec
> 4 10 1 0.565s 29.429s 30.000s 87395.518 87366.580
> 4 10 2 0.500s 33.514s 18.002s 77067.935 145426.659
> 4 10 4 0.533s 44.455s 14.085s 58269.368 176413.196

hrm. I cannot see anywhere in this patch where you update task_struct.rss.

> Index: linux-2.6.9/kernel/exit.c
> ===================================================================
> --- linux-2.6.9.orig/kernel/exit.c 2004-11-22 09:51:58.000000000 -0800
> +++ linux-2.6.9/kernel/exit.c 2004-11-22 11:16:02.000000000 -0800
> @@ -501,6 +501,9 @@
> /* more a memory barrier than a real lock */
> task_lock(tsk);
> tsk->mm = NULL;
> + /* only holding mmap_sem here maybe get page_table_lock too? */
> + mm->rss += tsk->rss;
> + tsk->rss = 0;
> up_read(&mm->mmap_sem);

mmap_sem needs to be held for writing, surely?

> + /* Update mm->rss if necessary */
> + if (p->rss && p->mm && down_write_trylock(&p->mm->mmap_sem)) {
> + if (spin_trylock(&p->mm->page_table_lock)) {
> + p->mm->rss += p->rss;
> + p->rss = 0;
> + spin_unlock(&p->mm->page_table_lock);
> + }
> + up_write(&p->mm->mmap_sem);
> + }
> }

I'd also suggest that you do:

tsk->rss++;
if (tsk->rss > 16) {
spin_lock(&mm->page_table_lock);
mm->rss += tsk->rss;
spin_unlock(&mm->page_table_lock);
tsk->rss = 0;
}

just to prevent transient gross inaccuracies. For some value of "16".

2004-11-22 22:23:02

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: deferred rss update instead of sloppy rss

On Mon, 2004-11-22 at 14:13 -0800, Christoph Lameter wrote:
> On Mon, 22 Nov 2004, Andrew Morton wrote:
>
> > hrm. I cannot see anywhere in this patch where you update task_struct.rss.
>
> This is just the piece around it dealing with rss. The updating of rss
> happens in the generic code. The change to that is trivial. I can repost
> the whole shebang if you want.
>
> > > + /* only holding mmap_sem here maybe get page_table_lock too? */
> > > + mm->rss += tsk->rss;
> > > + tsk->rss = 0;
> > > up_read(&mm->mmap_sem);
> >
> > mmap_sem needs to be held for writing, surely?
>
> If there are no page faults occurring anymore then we would not need to
> get the lock. Q: Is it safe to assume that no faults occur
> anymore at this point?

Why wouldn't the mm take faults on other CPUs ? (other threads)

> > just to prevent transient gross inaccuracies. For some value of "16".
>
> The page fault code only increments rss. For larger transactions that
> increase / decrease rss significantly the page_table_lock is taken and
> mm->rss is updated directly. So no
> gross inaccuracies can result.
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
--
Benjamin Herrenschmidt <[email protected]>

2004-11-22 22:23:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: deferred rss update instead of sloppy rss



On Mon, 22 Nov 2004, Christoph Lameter wrote:
>
> The problem is then that the proc filesystem must do an extensive scan
> over all threads to find users of a certain mm_struct.

The alternative is to just add a simple list into the task_struct and the
head of it into mm_struct. Then, at fork, you just finish the fork() with

list_add(p->mm_list, p->mm->thread_list);

and do the proper list_del() in exit_mm() or wherever.

You'll still loop in /proc, but you'll do the minimal loop necessary.

Linus

2004-11-22 22:31:48

by Christoph Lameter

[permalink] [raw]
Subject: Re: deferred rss update instead of sloppy rss

On Mon, 22 Nov 2004, Linus Torvalds wrote:

> The alternative is to just add a simple list into the task_struct and the
> head of it into mm_struct. Then, at fork, you just finish the fork() with
>
> list_add(p->mm_list, p->mm->thread_list);
>
> and do the proper list_del() in exit_mm() or wherever.
>
> You'll still loop in /proc, but you'll do the minimal loop necessary.

I think the approach that I posted is simpler unless there are other
benefits to be gained if it would be easy to figure out which tasks use an
mm.

2004-11-22 22:39:45

by Nick Piggin

[permalink] [raw]
Subject: Re: deferred rss update instead of sloppy rss

Linus Torvalds wrote:
>
> On Mon, 22 Nov 2004, Christoph Lameter wrote:
>
>>The problem is then that the proc filesystem must do an extensive scan
>>over all threads to find users of a certain mm_struct.
>
>
> The alternative is to just add a simple list into the task_struct and the
> head of it into mm_struct. Then, at fork, you just finish the fork() with
>
> list_add(p->mm_list, p->mm->thread_list);
>
> and do the proper list_del() in exit_mm() or wherever.
>
> You'll still loop in /proc, but you'll do the minimal loop necessary.
>

Yes, that was what I was thinking we'd have to resort to. Not a bad idea.

It would be nice if you could have it integrated with the locking that
is already there - for example mmap_sem, although that might mean you'd
have to take mmap_sem for writing which may limit scalability of thread
creation / destruction... maybe a seperate lock / semaphore for that list
itself would be OK.

Deferred rss might be a practical solution, but I'd prefer this if it can
be made workable.

2004-11-22 22:42:22

by Christoph Lameter

[permalink] [raw]
Subject: Re: deferred rss update instead of sloppy rss

On Tue, 23 Nov 2004, Nick Piggin wrote:

> Deferred rss might be a practical solution, but I'd prefer this if it can
> be made workable.

Both results in an additional field in task_struct that is going to be
incremented when the page_table_lock is not held. It would be possible
to switch to looping in procfs later. The main question with this patchset
is:

How and when can we get this get into the kernel?

2004-11-22 22:45:19

by Andrew Morton

[permalink] [raw]
Subject: Re: deferred rss update instead of sloppy rss

Christoph Lameter <[email protected]> wrote:
>
> > just to prevent transient gross inaccuracies. For some value of "16".
>
> The page fault code only increments rss. For larger transactions that
> increase / decrease rss significantly the page_table_lock is taken and
> mm->rss is updated directly. So no
> gross inaccuracies can result.

Sure. Take a million successive pagefaults and mm->rss is grossly
inaccurate. Hence my suggestion that it be spilled into mm->rss
periodically.

2004-11-22 22:51:18

by Christoph Lameter

[permalink] [raw]
Subject: Re: deferred rss update instead of sloppy rss

On Mon, 22 Nov 2004, Andrew Morton wrote:

> > The page fault code only increments rss. For larger transactions that
> > increase / decrease rss significantly the page_table_lock is taken and
> > mm->rss is updated directly. So no
> > gross inaccuracies can result.
>
> Sure. Take a million successive pagefaults and mm->rss is grossly
> inaccurate. Hence my suggestion that it be spilled into mm->rss
> periodically.

It is spilled into mm->rss periodically. That is the whole point of the
patch.

The timer tick occurs every 1 ms. The maximum pagefault frequency that I
have seen is 500000 faults /second. The max deviation is therefore
less than 500 (could be greater if page table lock / mmap_sem always held
when the tick occurs).

2004-11-22 22:53:43

by William Lee Irwin III

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

On Sat, 20 Nov 2004, William Lee Irwin III wrote:
>> I'm not particularly "stuck on" the per-cpu business, it was merely the
>> most obvious method of splitting the RSS counter without catastrophes
>> elsewhere. Robin Holt's 2.4 performance studies actually show that
>> splitting the counter is not even essential.

On Mon, Nov 22, 2004 at 09:44:02AM -0800, Christoph Lameter wrote:
> There is no problem moving back to the atomic approach that is if it is
> okay to also make anon_rss atomic. But its a pretty significant
> performance hit (comparison with some old data from V4 of patch which
> makes this data a bit suspect since the test environment is likely
> slightly different. I should really test this again. Note that the old
> performance test was only run 3 times instead of 10):
> atomic vs. sloppy rss performance 64G allocation:

The specific patches you compared matter a great deal as there are
implementation blunders (e.g. poor placement of counters relative to
->mmap_sem) that can ruin the results. URL's to the specific patches
would rule out that source of error.


-- wli

2004-11-22 23:13:21

by Nick Piggin

[permalink] [raw]
Subject: Re: deferred rss update instead of sloppy rss

Christoph Lameter wrote:
> On Mon, 22 Nov 2004, Andrew Morton wrote:
>
>
>>>The page fault code only increments rss. For larger transactions that
>>>increase / decrease rss significantly the page_table_lock is taken and
>>>mm->rss is updated directly. So no
>>>gross inaccuracies can result.
>>
>>Sure. Take a million successive pagefaults and mm->rss is grossly
>>inaccurate. Hence my suggestion that it be spilled into mm->rss
>>periodically.
>
>
> It is spilled into mm->rss periodically. That is the whole point of the
> patch.
>
> The timer tick occurs every 1 ms. The maximum pagefault frequency that I
> have seen is 500000 faults /second. The max deviation is therefore
> less than 500 (could be greater if page table lock / mmap_sem always held
> when the tick occurs).


You could imagine a situation where something pagefaults and sleeps in
lock-step with the timer though. Theoretical problem only?

I think that by the time you get the spilling code in, the mm-list method
will be looking positively elegant!

2004-11-22 23:16:15

by Andrew Morton

[permalink] [raw]
Subject: Re: deferred rss update instead of sloppy rss

Christoph Lameter <[email protected]> wrote:
>
> On Mon, 22 Nov 2004, Andrew Morton wrote:
>
> > > The page fault code only increments rss. For larger transactions that
> > > increase / decrease rss significantly the page_table_lock is taken and
> > > mm->rss is updated directly. So no
> > > gross inaccuracies can result.
> >
> > Sure. Take a million successive pagefaults and mm->rss is grossly
> > inaccurate. Hence my suggestion that it be spilled into mm->rss
> > periodically.
>
> It is spilled into mm->rss periodically. That is the whole point of the
> patch.
>
> The timer tick occurs every 1 ms.

That only works if the task happens to have the CPU when the timer tick
occurs. There remains no theoretical upper bound to the error in mm->rss,
and that's very easy to fix.

2004-11-22 23:19:48

by Nick Piggin

[permalink] [raw]
Subject: Re: deferred rss update instead of sloppy rss

Christoph Lameter wrote:
> On Tue, 23 Nov 2004, Nick Piggin wrote:
>
>
>>Deferred rss might be a practical solution, but I'd prefer this if it can
>>be made workable.
>
>
> Both results in an additional field in task_struct that is going to be
> incremented when the page_table_lock is not held. It would be possible
> to switch to looping in procfs later. The main question with this patchset
> is:
>

Sure.

> How and when can we get this get into the kernel?
>

Well it is a good starting platform for the various PTL reduction patches
floating around.

I'd say Andrew could be convinced to stick it in -mm after 2.6.10, but we'd
probably need a clear path to one of the PTL patches before anything would
move into 2.6.

2004-11-22 23:19:48

by Christoph Lameter

[permalink] [raw]
Subject: Re: deferred rss update instead of sloppy rss

On Tue, 23 Nov 2004, Nick Piggin wrote:

> > The timer tick occurs every 1 ms. The maximum pagefault frequency that I
> > have seen is 500000 faults /second. The max deviation is therefore
> > less than 500 (could be greater if page table lock / mmap_sem always held
> > when the tick occurs).
> I think that by the time you get the spilling code in, the mm-list method
> will be looking positively elegant!

I do not care what gets in as long as something goes in to address the
performance issues. So far everyone seems to have their pet ideas. By all
means do the mm-list method and post it. But we have already seen
objections by other against loops in proc. So that will also cause
additional controversy.

2004-11-23 00:50:47

by Christoph Lameter

[permalink] [raw]
Subject: Re: deferred rss update instead of sloppy rss

On Mon, 22 Nov 2004, Andrew Morton wrote:

> > The timer tick occurs every 1 ms.
>
> That only works if the task happens to have the CPU when the timer tick
> occurs. There remains no theoretical upper bound to the error in mm->rss,
> and that's very easy to fix.

Page fault intensive programs typically hog the cpu. But you are in
principle right.

The "easy fix" that you propose is to add additional logic to some very
hot code paths. Then we are probably better off with another approach.

2004-11-23 01:56:51

by Christoph Lameter

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

On Mon, 22 Nov 2004, William Lee Irwin III wrote:

> The specific patches you compared matter a great deal as there are
> implementation blunders (e.g. poor placement of counters relative to
> ->mmap_sem) that can ruin the results. URL's to the specific patches
> would rule out that source of error.

I mentioned V4 of this patch which was posted to lkml. A simple search
should get you there.

2004-11-22 22:45:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: deferred rss update instead of sloppy rss



On Mon, 22 Nov 2004, Christoph Lameter wrote:
>
> I think the approach that I posted is simpler unless there are other
> benefits to be gained if it would be easy to figure out which tasks use an
> mm.

I'm just worried that your timer tick thing won't catch things in a timely
manner. That said, if that isn't an issue, and people don't have problems
with it. On the other hand, if /proc literally is the only real user, then
I guess it really can't matter.

Linus

2004-11-23 02:28:41

by William Lee Irwin III

[permalink] [raw]
Subject: Re: page fault scalability patch V11 [0/7]: overview

On Mon, 22 Nov 2004, William Lee Irwin III wrote:
>> The specific patches you compared matter a great deal as there are
>> implementation blunders (e.g. poor placement of counters relative to
>> ->mmap_sem) that can ruin the results. URL's to the specific patches
>> would rule out that source of error.

On Mon, Nov 22, 2004 at 02:51:22PM -0800, Christoph Lameter wrote:
> I mentioned V4 of this patch which was posted to lkml. A simple search
> should get you there.

The counter's placement was poor in that version of the patch. The
results are very suspect and likely invalid. It would have been more
helpful if you provided some kind of unique identifier when requests
for complete disambiguation are made. For instance, the version tags of
your patches are not visible in Subject: lines.

There are, of course, other issues, e.g. where the arch sweeps went.
This discussion has degenerated into non-cooperation making it beyond
my power to help, and I'm in the midst of several rather urgent
bughunts, of which there are apparently more to come.


-- wli

2004-11-22 22:15:24

by Christoph Lameter

[permalink] [raw]
Subject: Re: deferred rss update instead of sloppy rss

On Mon, 22 Nov 2004, Andrew Morton wrote:

> hrm. I cannot see anywhere in this patch where you update task_struct.rss.

This is just the piece around it dealing with rss. The updating of rss
happens in the generic code. The change to that is trivial. I can repost
the whole shebang if you want.

> > + /* only holding mmap_sem here maybe get page_table_lock too? */
> > + mm->rss += tsk->rss;
> > + tsk->rss = 0;
> > up_read(&mm->mmap_sem);
>
> mmap_sem needs to be held for writing, surely?

If there are no page faults occurring anymore then we would not need to
get the lock. Q: Is it safe to assume that no faults occur
anymore at this point?

> just to prevent transient gross inaccuracies. For some value of "16".

The page fault code only increments rss. For larger transactions that
increase / decrease rss significantly the page_table_lock is taken and
mm->rss is updated directly. So no
gross inaccuracies can result.