2005-09-02 06:25:44

by Nick Piggin

[permalink] [raw]
Subject: New lockless pagecache

There were a number of problems with the old lockless pagecache:

- page_count of free pages was unstable, which meant an extra atomic
operation when allocating a new page (had to use get_page instead
of set_page_count).

- This meant a new page flag PG_free had to be introduced.

- Also needed a spin loop and memory barriers added to the page
allocator to prevent a free page with a speculative reference on
it from being allocated.

- This introduced the requirement that interrupts be disabled in
page_cache_get_speculative to prevent deadlock, which was very
disheartening.

- Too complex.

- To cap it off, there was an "unsolvable" race whereby a second
page_cache_get_speculative would not be able to detect it had
picked up a free page, and try to free it again.

The elegant solution was right under my nose the whole time.
I introduced an atomic_cmpxchg and use that to ensure we don't
touch ->_count of a free page. This solves all the above problems.

I think this is getting pretty stable. No guarantees of course,
but it would be great if anyone gave it a test.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com


2005-09-02 06:27:55

by Nick Piggin

[permalink] [raw]
Subject: [PATCH 2.6.13] lockless pagecache 1/7

Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -333,6 +333,21 @@ out:
}

/*
+ * This function is called to print an error when a pte in a
+ * !VM_RESERVED region is found pointing to an invalid pfn (which
+ * is an error.
+ *
+ * The calling function must still handle the error.
+ */
+void __invalid_pfn(const char *errfunc, pte_t pte,
+ unsigned long vm_flags, unsigned long vaddr)
+{
+ printk(KERN_ERR "%s: pte does not point to valid memory. "
+ "process = %s, pte = %08lx, vm_flags = %lx, vaddr = %lx\n",
+ errfunc, current->comm, (long)pte_val(pte), vm_flags, vaddr);
+}
+
+/*
* copy one vm_area from one task to the other. Assumes the page tables
* already present in the new task to be cleared in the whole range
* covered by this vma.
@@ -361,25 +376,29 @@ copy_one_pte(struct mm_struct *dst_mm, s
spin_unlock(&mmlist_lock);
}
}
- set_pte_at(dst_mm, addr, dst_pte, pte);
- return;
+ goto out_set_pte;
}

+ /* If the region is VM_RESERVED, the mapping is not
+ * mapped via rmap - duplicate the pte as is.
+ */
+ if (vm_flags & VM_RESERVED)
+ goto out_set_pte;
+
+ /* If the pte points outside of valid memory but
+ * the region is not VM_RESERVED, we have a problem.
+ */
pfn = pte_pfn(pte);
- /* the pte points outside of valid memory, the
- * mapping is assumed to be good, meaningful
- * and not mapped via rmap - duplicate the
- * mapping as is.
- */
- page = NULL;
- if (pfn_valid(pfn))
- page = pfn_to_page(pfn);
-
- if (!page || PageReserved(page)) {
- set_pte_at(dst_mm, addr, dst_pte, pte);
- return;
+ if (unlikely(!pfn_valid(pfn))) {
+ invalid_pfn(pte, vm_flags, addr);
+ goto out_set_pte; /* try to do something sane */
}

+ page = pfn_to_page(pfn);
+ /* Mappings to zero pages aren't covered by rmap either. */
+ if (page == ZERO_PAGE(addr))
+ goto out_set_pte;
+
/*
* If it's a COW mapping, write protect it both
* in the parent and the child
@@ -400,8 +419,9 @@ copy_one_pte(struct mm_struct *dst_mm, s
inc_mm_counter(dst_mm, rss);
if (PageAnon(page))
inc_mm_counter(dst_mm, anon_rss);
- set_pte_at(dst_mm, addr, dst_pte, pte);
page_dup_rmap(page);
+out_set_pte:
+ set_pte_at(dst_mm, addr, dst_pte, pte);
}

static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -514,7 +534,8 @@ int copy_page_range(struct mm_struct *ds
return 0;
}

-static void zap_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
+static void zap_pte_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
struct zap_details *details)
{
@@ -527,11 +548,15 @@ static void zap_pte_range(struct mmu_gat
continue;
if (pte_present(ptent)) {
struct page *page = NULL;
- unsigned long pfn = pte_pfn(ptent);
- if (pfn_valid(pfn)) {
- page = pfn_to_page(pfn);
- if (PageReserved(page))
- page = NULL;
+ if (!(vma->vm_flags & VM_RESERVED)) {
+ unsigned long pfn = pte_pfn(ptent);
+ if (unlikely(!pfn_valid(pfn))) {
+ invalid_pfn(ptent, vma->vm_flags, addr);
+ } else {
+ page = pfn_to_page(pfn);
+ if (page == ZERO_PAGE(addr))
+ page = NULL;
+ }
}
if (unlikely(details) && page) {
/*
@@ -584,7 +609,8 @@ static void zap_pte_range(struct mmu_gat
pte_unmap(pte - 1);
}

-static inline void zap_pmd_range(struct mmu_gather *tlb, pud_t *pud,
+static inline void zap_pmd_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, pud_t *pud,
unsigned long addr, unsigned long end,
struct zap_details *details)
{
@@ -596,11 +622,12 @@ static inline void zap_pmd_range(struct
next = pmd_addr_end(addr, end);
if (pmd_none_or_clear_bad(pmd))
continue;
- zap_pte_range(tlb, pmd, addr, next, details);
+ zap_pte_range(tlb, vma, pmd, addr, next, details);
} while (pmd++, addr = next, addr != end);
}

-static inline void zap_pud_range(struct mmu_gather *tlb, pgd_t *pgd,
+static inline void zap_pud_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, pgd_t *pgd,
unsigned long addr, unsigned long end,
struct zap_details *details)
{
@@ -612,7 +639,7 @@ static inline void zap_pud_range(struct
next = pud_addr_end(addr, end);
if (pud_none_or_clear_bad(pud))
continue;
- zap_pmd_range(tlb, pud, addr, next, details);
+ zap_pmd_range(tlb, vma, pud, addr, next, details);
} while (pud++, addr = next, addr != end);
}

@@ -633,7 +660,7 @@ static void unmap_page_range(struct mmu_
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(pgd))
continue;
- zap_pud_range(tlb, pgd, addr, next, details);
+ zap_pud_range(tlb, vma, pgd, addr, next, details);
} while (pgd++, addr = next, addr != end);
tlb_end_vma(tlb, vma);
}
@@ -933,7 +960,7 @@ int get_user_pages(struct task_struct *t
continue;
}

- if (!vma || (vma->vm_flags & VM_IO)
+ if (!vma || (vma->vm_flags & (VM_IO | VM_RESERVED))
|| !(flags & vma->vm_flags))
return i ? : -EFAULT;

@@ -993,8 +1020,7 @@ int get_user_pages(struct task_struct *t
if (pages) {
pages[i] = page;
flush_dcache_page(page);
- if (!PageReserved(page))
- page_cache_get(page);
+ page_cache_get(page);
}
if (vmas)
vmas[i] = vma;
@@ -1098,8 +1124,7 @@ static int remap_pte_range(struct mm_str
return -ENOMEM;
do {
BUG_ON(!pte_none(*pte));
- if (!pfn_valid(pfn) || PageReserved(pfn_to_page(pfn)))
- set_pte_at(mm, addr, pte, pfn_pte(pfn, prot));
+ set_pte_at(mm, addr, pte, pfn_pte(pfn, prot));
pfn++;
} while (pte++, addr += PAGE_SIZE, addr != end);
pte_unmap(pte - 1);
@@ -1239,6 +1264,8 @@ static int do_wp_page(struct mm_struct *
pte_t entry;
int ret;

+ BUG_ON(vma->vm_flags & VM_RESERVED);
+
if (unlikely(!pfn_valid(pfn))) {
/*
* This should really halt the system so it can be debugged or
@@ -1246,9 +1273,8 @@ static int do_wp_page(struct mm_struct *
* data, but for the moment just pretend this is OOM.
*/
pte_unmap(page_table);
- printk(KERN_ERR "do_wp_page: bogus page at address %08lx\n",
- address);
spin_unlock(&mm->page_table_lock);
+ invalid_pfn(pte, vma->vm_flags, address);
return VM_FAULT_OOM;
}
old_page = pfn_to_page(pfn);
@@ -1273,13 +1299,16 @@ static int do_wp_page(struct mm_struct *
/*
* Ok, we need to copy. Oh, well..
*/
- if (!PageReserved(old_page))
+ if (old_page == ZERO_PAGE(address))
+ old_page = NULL;
+ else
page_cache_get(old_page);
+
spin_unlock(&mm->page_table_lock);

if (unlikely(anon_vma_prepare(vma)))
goto no_new_page;
- if (old_page == ZERO_PAGE(address)) {
+ if (old_page == NULL) {
new_page = alloc_zeroed_user_highpage(vma, address);
if (!new_page)
goto no_new_page;
@@ -1296,12 +1325,13 @@ static int do_wp_page(struct mm_struct *
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, address);
if (likely(pte_same(*page_table, pte))) {
- if (PageAnon(old_page))
- dec_mm_counter(mm, anon_rss);
- if (PageReserved(old_page))
+ if (old_page == NULL)
inc_mm_counter(mm, rss);
- else
+ else {
page_remove_rmap(old_page);
+ if (PageAnon(old_page))
+ dec_mm_counter(mm, anon_rss);
+ }
flush_cache_page(vma, address, pfn);
break_cow(vma, new_page, address, page_table);
lru_cache_add_active(new_page);
@@ -1312,13 +1342,16 @@ static int do_wp_page(struct mm_struct *
ret |= VM_FAULT_WRITE;
}
pte_unmap(page_table);
- page_cache_release(new_page);
- page_cache_release(old_page);
+ if (old_page) {
+ page_cache_release(new_page);
+ page_cache_release(old_page);
+ }
spin_unlock(&mm->page_table_lock);
return ret;

no_new_page:
- page_cache_release(old_page);
+ if (old_page)
+ page_cache_release(old_page);
return VM_FAULT_OOM;
}

@@ -1755,7 +1788,7 @@ do_anonymous_page(struct mm_struct *mm,
struct page * page = ZERO_PAGE(addr);

/* Read-only mapping of ZERO_PAGE. */
- entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
+ entry = pte_wrprotect(mk_pte(page, vma->vm_page_prot));

/* ..except if it's a write access */
if (write_access) {
@@ -1894,9 +1927,6 @@ retry:
*/
/* Only go through if we didn't race with anybody else... */
if (pte_none(*page_table)) {
- if (!PageReserved(new_page))
- inc_mm_counter(mm, rss);
-
flush_icache_page(vma, new_page);
entry = mk_pte(new_page, vma->vm_page_prot);
if (write_access)
@@ -1905,8 +1935,10 @@ retry:
if (anon) {
lru_cache_add_active(new_page);
page_add_anon_rmap(new_page, vma, address);
- } else
+ } else if (!(vma->vm_flags & VM_RESERVED)) {
page_add_file_rmap(new_page);
+ inc_mm_counter(mm, rss);
+ }
pte_unmap(page_table);
} else {
/* One of our sibling threads was faster, back out. */
@@ -2213,7 +2245,7 @@ void update_mem_hiwater(struct task_stru
#if !defined(__HAVE_ARCH_GATE_AREA)

#if defined(AT_SYSINFO_EHDR)
-struct vm_area_struct gate_vma;
+static struct vm_area_struct gate_vma;

static int __init gate_vma_init(void)
{
@@ -2221,7 +2253,7 @@ static int __init gate_vma_init(void)
gate_vma.vm_start = FIXADDR_USER_START;
gate_vma.vm_end = FIXADDR_USER_END;
gate_vma.vm_page_prot = PAGE_READONLY;
- gate_vma.vm_flags = 0;
+ gate_vma.vm_flags = VM_RESERVED;
return 0;
}
__initcall(gate_vma_init);
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -442,22 +442,17 @@ int page_referenced(struct page *page, i
void page_add_anon_rmap(struct page *page,
struct vm_area_struct *vma, unsigned long address)
{
- struct anon_vma *anon_vma = vma->anon_vma;
- pgoff_t index;
-
- BUG_ON(PageReserved(page));
- BUG_ON(!anon_vma);
-
inc_mm_counter(vma->vm_mm, anon_rss);

- anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
- index = (address - vma->vm_start) >> PAGE_SHIFT;
- index += vma->vm_pgoff;
- index >>= PAGE_CACHE_SHIFT - PAGE_SHIFT;
-
if (atomic_inc_and_test(&page->_mapcount)) {
- page->index = index;
+ struct anon_vma *anon_vma = vma->anon_vma;
+
+ BUG_ON(!anon_vma);
+ anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
page->mapping = (struct address_space *) anon_vma;
+
+ page->index = linear_page_index(vma, address);
+
inc_page_state(nr_mapped);
}
/* else checking page index and mapping is racy */
@@ -472,8 +467,7 @@ void page_add_anon_rmap(struct page *pag
void page_add_file_rmap(struct page *page)
{
BUG_ON(PageAnon(page));
- if (!pfn_valid(page_to_pfn(page)) || PageReserved(page))
- return;
+ BUG_ON(!pfn_valid(page_to_pfn(page)));

if (atomic_inc_and_test(&page->_mapcount))
inc_page_state(nr_mapped);
@@ -487,8 +481,6 @@ void page_add_file_rmap(struct page *pag
*/
void page_remove_rmap(struct page *page)
{
- BUG_ON(PageReserved(page));
-
if (atomic_add_negative(-1, &page->_mapcount)) {
BUG_ON(page_mapcount(page) < 0);
/*
@@ -532,6 +524,8 @@ static int try_to_unmap_one(struct page
* If the page is mlock()d, we cannot swap it out.
* If it's recently referenced (perhaps page_referenced
* skipped over this mm) then we should reactivate it.
+ *
+ * Pages belonging to VM_RESERVED regions should not happen here.
*/
if ((vma->vm_flags & (VM_LOCKED|VM_RESERVED)) ||
ptep_clear_flush_young(vma, address, pte)) {
@@ -644,13 +638,13 @@ static void try_to_unmap_cluster(unsigne
continue;

pfn = pte_pfn(*pte);
- if (!pfn_valid(pfn))
+ if (unlikely(!pfn_valid(pfn))) {
+ invalid_pfn(*pte, vma->vm_flags, address);
continue;
+ }

page = pfn_to_page(pfn);
BUG_ON(PageAnon(page));
- if (PageReserved(page))
- continue;

if (ptep_clear_flush_young(vma, address, pte))
continue;
@@ -813,7 +807,6 @@ int try_to_unmap(struct page *page)
{
int ret;

- BUG_ON(PageReserved(page));
BUG_ON(!PageLocked(page));

if (PageAnon(page))
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -113,7 +113,8 @@ static void bad_page(const char *functio
1 << PG_reclaim |
1 << PG_slab |
1 << PG_swapcache |
- 1 << PG_writeback);
+ 1 << PG_writeback |
+ 1 << PG_reserved );
set_page_count(page, 0);
reset_page_mapcount(page);
page->mapping = NULL;
@@ -243,7 +244,6 @@ static inline int page_is_buddy(struct p
{
if (PagePrivate(page) &&
(page_order(page) == order) &&
- !PageReserved(page) &&
page_count(page) == 0)
return 1;
return 0;
@@ -326,10 +326,11 @@ static inline void free_pages_check(cons
1 << PG_reclaim |
1 << PG_slab |
1 << PG_swapcache |
- 1 << PG_writeback )))
+ 1 << PG_writeback |
+ 1 << PG_reserved )))
bad_page(function, page);
if (PageDirty(page))
- ClearPageDirty(page);
+ __ClearPageDirty(page);
}

/*
@@ -454,7 +455,8 @@ static void prep_new_page(struct page *p
1 << PG_reclaim |
1 << PG_slab |
1 << PG_swapcache |
- 1 << PG_writeback )))
+ 1 << PG_writeback |
+ 1 << PG_reserved )))
bad_page(__FUNCTION__, page);

page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
@@ -1011,7 +1013,7 @@ void __pagevec_free(struct pagevec *pvec

fastcall void __free_pages(struct page *page, unsigned int order)
{
- if (!PageReserved(page) && put_page_testzero(page)) {
+ if (put_page_testzero(page)) {
if (order == 0)
free_hot_page(page);
else
@@ -1653,7 +1655,7 @@ void __init memmap_init_zone(unsigned lo
continue;
page = pfn_to_page(pfn);
set_page_links(page, zone, nid, pfn);
- set_page_count(page, 0);
+ set_page_count(page, 1);
reset_page_mapcount(page);
SetPageReserved(page);
INIT_LIST_HEAD(&page->lru);
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h
+++ linux-2.6/include/linux/page-flags.h
@@ -194,6 +194,7 @@ extern void __mod_page_state(unsigned lo
#define SetPageDirty(page) set_bit(PG_dirty, &(page)->flags)
#define TestSetPageDirty(page) test_and_set_bit(PG_dirty, &(page)->flags)
#define ClearPageDirty(page) clear_bit(PG_dirty, &(page)->flags)
+#define __ClearPageDirty(page) __clear_bit(PG_dirty, &(page)->flags)
#define TestClearPageDirty(page) test_and_clear_bit(PG_dirty, &(page)->flags)

#define SetPageLRU(page) set_bit(PG_lru, &(page)->flags)
Index: linux-2.6/mm/mremap.c
===================================================================
--- linux-2.6.orig/mm/mremap.c
+++ linux-2.6/mm/mremap.c
@@ -141,6 +141,12 @@ move_one_page(struct vm_area_struct *vma
if (dst) {
pte_t pte;
pte = ptep_clear_flush(vma, old_addr, src);
+
+ /*
+ * ZERO_PAGE can be dependant on virtual
+ * address on some architectures.
+ */
+ pte = move_zero_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
set_pte_at(mm, new_addr, dst, pte);
} else
error = -ENOMEM;
Index: linux-2.6/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-generic/pgtable.h
+++ linux-2.6/include/asm-generic/pgtable.h
@@ -142,6 +142,18 @@ static inline void ptep_set_wrprotect(st
#define lazy_mmu_prot_update(pte) do { } while (0)
#endif

+#ifndef __HAVE_ARCH_MULTIPLE_ZERO_PAGE
+#define move_zero_pte(pte, prot, old_addr, new_addr) (pte)
+#else
+#define move_zero_pte(pte, prot, old_addr, new_addr) \
+({ \
+ pte_t newpte = (pte); \
+ if (pfn_valid(pte_pfn(pte)) && pte_page(pte) == ZERO_PAGE(old_addr)) \
+ newpte = mk_pte(ZERO_PAGE(new_addr), (prot))); \
+ newpte; \
+})
+#endif
+
/*
* When walking page tables, get the address of the next boundary,
* or the end address of the range if that comes earlier. Although no
Index: linux-2.6/include/asm-mips/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-mips/pgtable.h
+++ linux-2.6/include/asm-mips/pgtable.h
@@ -68,6 +68,8 @@ extern unsigned long zero_page_mask;
#define ZERO_PAGE(vaddr) \
(virt_to_page(empty_zero_page + (((unsigned long)(vaddr)) & zero_page_mask)))

+#define __HAVE_ARCH_MULTIPLE_ZERO_PAGE
+
extern void paging_init(void);

/*
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -156,7 +156,8 @@ extern unsigned int kobjsize(const void

#define VM_DONTCOPY 0x00020000 /* Do not copy this vma on fork */
#define VM_DONTEXPAND 0x00040000 /* Cannot expand with mremap() */
-#define VM_RESERVED 0x00080000 /* Don't unmap it from swap_out */
+#define VM_RESERVED 0x00080000 /* Pages and ptes in region aren't managed with regular pagecache or rmap routines */
+
#define VM_ACCOUNT 0x00100000 /* Is a VM accounted object */
#define VM_HUGETLB 0x00400000 /* Huge TLB Page VM */
#define VM_NONLINEAR 0x00800000 /* Is non-linear (remap_file_pages) */
@@ -337,7 +338,7 @@ static inline void get_page(struct page

static inline void put_page(struct page *page)
{
- if (!PageReserved(page) && put_page_testzero(page))
+ if (put_page_testzero(page))
__page_cache_release(page);
}

@@ -723,6 +724,9 @@ void install_arg_page(struct vm_area_str

int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start,
int len, int write, int force, struct page **pages, struct vm_area_struct **vmas);
+#define invalid_pfn(pte, vm_flags, vaddr) \
+ __invalid_pfn(__FUNCTION__, pte, vm_flags, vaddr)
+void __invalid_pfn(const char *, pte_t, unsigned long, unsigned long);

int __set_page_dirty_buffers(struct page *page);
int __set_page_dirty_nobuffers(struct page *page);
Index: linux-2.6/mm/madvise.c
===================================================================
--- linux-2.6.orig/mm/madvise.c
+++ linux-2.6/mm/madvise.c
@@ -123,7 +123,7 @@ static long madvise_dontneed(struct vm_a
unsigned long start, unsigned long end)
{
*prev = vma;
- if ((vma->vm_flags & VM_LOCKED) || is_vm_hugetlb_page(vma))
+ if ((vma->vm_flags & (VM_LOCKED|VM_RESERVED)) || is_vm_hugetlb_page(vma))
return -EINVAL;

if (unlikely(vma->vm_flags & VM_NONLINEAR)) {
Index: linux-2.6/mm/swap.c
===================================================================
--- linux-2.6.orig/mm/swap.c
+++ linux-2.6/mm/swap.c
@@ -48,7 +48,7 @@ void put_page(struct page *page)
}
return;
}
- if (!PageReserved(page) && put_page_testzero(page))
+ if (put_page_testzero(page))
__page_cache_release(page);
}
EXPORT_SYMBOL(put_page);
@@ -215,7 +215,7 @@ void release_pages(struct page **pages,
struct page *page = pages[i];
struct zone *pagezone;

- if (PageReserved(page) || !put_page_testzero(page))
+ if (!put_page_testzero(page))
continue;

pagezone = page_zone(page);
Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c
+++ linux-2.6/mm/fremap.c
@@ -29,18 +29,21 @@ static inline void zap_pte(struct mm_str
return;
if (pte_present(pte)) {
unsigned long pfn = pte_pfn(pte);
+ struct page *page;

flush_cache_page(vma, addr, pfn);
pte = ptep_clear_flush(vma, addr, ptep);
- if (pfn_valid(pfn)) {
- struct page *page = pfn_to_page(pfn);
- if (!PageReserved(page)) {
- if (pte_dirty(pte))
- set_page_dirty(page);
- page_remove_rmap(page);
- page_cache_release(page);
- dec_mm_counter(mm, rss);
- }
+ if (unlikely(!pfn_valid(pfn))) {
+ invalid_pfn(pte, vma->vm_flags, addr);
+ return;
+ }
+ page = pfn_to_page(pfn);
+ if (page != ZERO_PAGE(addr)) {
+ if (pte_dirty(pte))
+ set_page_dirty(page);
+ page_remove_rmap(page);
+ dec_mm_counter(mm, rss);
+ page_cache_release(page);
}
} else {
if (!pte_file(pte))
@@ -65,6 +68,8 @@ int install_page(struct mm_struct *mm, s
pgd_t *pgd;
pte_t pte_val;

+ BUG_ON(vma->vm_flags & VM_RESERVED);
+
pgd = pgd_offset(mm, addr);
spin_lock(&mm->page_table_lock);

@@ -122,6 +127,8 @@ int install_file_pte(struct mm_struct *m
pgd_t *pgd;
pte_t pte_val;

+ BUG_ON(vma->vm_flags & VM_RESERVED);
+
pgd = pgd_offset(mm, addr);
spin_lock(&mm->page_table_lock);

Index: linux-2.6/mm/msync.c
===================================================================
--- linux-2.6.orig/mm/msync.c
+++ linux-2.6/mm/msync.c
@@ -37,11 +37,11 @@ static void sync_pte_range(struct vm_are
if (!pte_maybe_dirty(*pte))
continue;
pfn = pte_pfn(*pte);
- if (!pfn_valid(pfn))
+ if (unlikely(!pfn_valid(pfn))) {
+ invalid_pfn(*pte, vma->vm_flags, addr);
continue;
+ }
page = pfn_to_page(pfn);
- if (PageReserved(page))
- continue;

if (ptep_clear_flush_dirty(vma, addr, pte) ||
page_test_and_clear_dirty(page))
@@ -149,6 +149,9 @@ static int msync_interval(struct vm_area
if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED))
return -EBUSY;

+ if (vma->vm_flags & VM_RESERVED)
+ return -EINVAL;
+
if (file && (vma->vm_flags & VM_SHARED)) {
filemap_sync(vma, addr, end);

Index: linux-2.6/drivers/scsi/sg.c
===================================================================
--- linux-2.6.orig/drivers/scsi/sg.c
+++ linux-2.6/drivers/scsi/sg.c
@@ -1887,13 +1887,17 @@ st_unmap_user_pages(struct scatterlist *
int i;

for (i=0; i < nr_pages; i++) {
- if (dirtied && !PageReserved(sgl[i].page))
- SetPageDirty(sgl[i].page);
- /* unlock_page(sgl[i].page); */
+ struct page *page = sgl[i].page;
+
+ /* XXX: just for debug. Remove when PageReserved is removed */
+ BUG_ON(PageReserved(page));
+ if (dirtied)
+ SetPageDirty(page);
+ /* unlock_page(page); */
/* FIXME: cache flush missing for rw==READ
* FIXME: call the correct reference counting function
*/
- page_cache_release(sgl[i].page);
+ page_cache_release(page);
}

return 0;
Index: linux-2.6/drivers/scsi/st.c
===================================================================
--- linux-2.6.orig/drivers/scsi/st.c
+++ linux-2.6/drivers/scsi/st.c
@@ -4431,12 +4431,16 @@ static int sgl_unmap_user_pages(struct s
int i;

for (i=0; i < nr_pages; i++) {
- if (dirtied && !PageReserved(sgl[i].page))
- SetPageDirty(sgl[i].page);
+ struct page *page = sgl[i].page;
+
+ /* XXX: just for debug. Remove when PageReserved is removed */
+ BUG_ON(PageReserved(page));
+ if (dirtied)
+ SetPageDirty(page);
/* FIXME: cache flush missing for rw==READ
* FIXME: call the correct reference counting function
*/
- page_cache_release(sgl[i].page);
+ page_cache_release(page);
}

return 0;
Index: linux-2.6/sound/core/pcm_native.c
===================================================================
--- linux-2.6.orig/sound/core/pcm_native.c
+++ linux-2.6/sound/core/pcm_native.c
@@ -2944,8 +2944,7 @@ static struct page * snd_pcm_mmap_status
return NOPAGE_OOM;
runtime = substream->runtime;
page = virt_to_page(runtime->status);
- if (!PageReserved(page))
- get_page(page);
+ get_page(page);
if (type)
*type = VM_FAULT_MINOR;
return page;
@@ -2987,8 +2986,7 @@ static struct page * snd_pcm_mmap_contro
return NOPAGE_OOM;
runtime = substream->runtime;
page = virt_to_page(runtime->control);
- if (!PageReserved(page))
- get_page(page);
+ get_page(page);
if (type)
*type = VM_FAULT_MINOR;
return page;
@@ -3061,8 +3059,7 @@ static struct page *snd_pcm_mmap_data_no
vaddr = runtime->dma_area + offset;
page = virt_to_page(vaddr);
}
- if (!PageReserved(page))
- get_page(page);
+ get_page(page);
if (type)
*type = VM_FAULT_MINOR;
return page;
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -1077,6 +1077,17 @@ munmap_back:
error = file->f_op->mmap(file, vma);
if (error)
goto unmap_and_free_vma;
+ if ((vma->vm_flags & (VM_SHARED | VM_WRITE | VM_RESERVED))
+ == (VM_WRITE | VM_RESERVED)) {
+ printk(KERN_WARNING "program %s is using MAP_PRIVATE, "
+ "PROT_WRITE mmap of VM_RESERVED memory, which "
+ "is deprecated. Please report this to "
+ "[email protected]\n",current->comm);
+ if (vma->vm_ops && vma->vm_ops->close)
+ vma->vm_ops->close(vma);
+ error = -EACCES;
+ goto unmap_and_free_vma;
+ }
} else if (vm_flags & VM_SHARED) {
error = shmem_zero_setup(vma);
if (error)
Index: linux-2.6/mm/mprotect.c
===================================================================
--- linux-2.6.orig/mm/mprotect.c
+++ linux-2.6/mm/mprotect.c
@@ -131,6 +131,14 @@ mprotect_fixup(struct vm_area_struct *vm
return -ENOMEM;
newflags |= VM_ACCOUNT;
}
+ if (oldflags & VM_RESERVED) {
+ BUG_ON(oldflags & VM_WRITE);
+ printk(KERN_WARNING "program %s is using MAP_PRIVATE, "
+ "PROT_WRITE mprotect of VM_RESERVED memory, "
+ "which is deprecated. Please report this to "
+ "[email protected]\n",current->comm);
+ return -EACCES;
+ }
}

newprot = protection_map[newflags & 0xf];
Index: linux-2.6/mm/bootmem.c
===================================================================
--- linux-2.6.orig/mm/bootmem.c
+++ linux-2.6/mm/bootmem.c
@@ -297,6 +297,7 @@ static unsigned long __init free_all_boo
if (j + 16 < BITS_PER_LONG)
prefetchw(page + j + 16);
__ClearPageReserved(page + j);
+ set_page_count(page + j, 0);
}
__free_pages(page, order);
i += BITS_PER_LONG;
Index: linux-2.6/mm/mempolicy.c
===================================================================
--- linux-2.6.orig/mm/mempolicy.c
+++ linux-2.6/mm/mempolicy.c
@@ -253,8 +253,10 @@ static int check_pte_range(struct mm_str
if (!pte_present(*pte))
continue;
pfn = pte_pfn(*pte);
- if (!pfn_valid(pfn))
+ if (unlikely(!pfn_valid(pfn))) {
+ invalid_pfn(*pte, -1UL, addr);
continue;
+ }
nid = pfn_to_nid(pfn);
if (!test_bit(nid, nodes))
break;
@@ -326,6 +328,8 @@ check_range(struct mm_struct *mm, unsign
first = find_vma(mm, start);
if (!first)
return ERR_PTR(-EFAULT);
+ if (first->vm_flags & VM_RESERVED)
+ return ERR_PTR(-EACCES);
prev = NULL;
for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
if (!vma->vm_next && vma->vm_end < end)
Index: linux-2.6/arch/ppc64/kernel/vdso.c
===================================================================
--- linux-2.6.orig/arch/ppc64/kernel/vdso.c
+++ linux-2.6/arch/ppc64/kernel/vdso.c
@@ -176,13 +176,13 @@ static struct page * vdso_vma_nopage(str
return NOPAGE_SIGBUS;

/*
- * Last page is systemcfg, special handling here, no get_page() a
- * this is a reserved page
+ * Last page is systemcfg.
*/
if ((vma->vm_end - address) <= PAGE_SIZE)
- return virt_to_page(systemcfg);
+ pg = virt_to_page(systemcfg);
+ else
+ pg = virt_to_page(vbase + offset);

- pg = virt_to_page(vbase + offset);
get_page(pg);
DBG(" ->page count: %d\n", page_count(pg));

@@ -260,7 +260,7 @@ int arch_setup_additional_pages(struct l
* gettimeofday will be totally dead. It's fine to use that for setting
* breakpoints in the vDSO code pages though
*/
- vma->vm_flags = VM_READ | VM_EXEC | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
+ vma->vm_flags = VM_READ | VM_EXEC | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC | VM_RESERVED;
vma->vm_flags |= mm->def_flags;
vma->vm_page_prot = protection_map[vma->vm_flags & 0x7];
vma->vm_ops = &vdso_vmops;
@@ -600,6 +600,8 @@ void __init vdso_init(void)
ClearPageReserved(pg);
get_page(pg);
}
+
+ get_page(virt_to_page(systemcfg));
}

int in_gate_area_no_task(unsigned long addr)
Index: linux-2.6/kernel/power/swsusp.c
===================================================================
--- linux-2.6.orig/kernel/power/swsusp.c
+++ linux-2.6/kernel/power/swsusp.c
@@ -434,15 +434,23 @@ static int save_highmem_zone(struct zone
continue;
page = pfn_to_page(pfn);
/*
- * This condition results from rvmalloc() sans vmalloc_32()
- * and architectural memory reservations. This should be
- * corrected eventually when the cases giving rise to this
- * are better understood.
+ * PageReserved results from rvmalloc() sans vmalloc_32()
+ * and architectural memory reservations.
+ *
+ * rvmalloc should not cause this, because all implementations
+ * appear to always be using vmalloc_32 on architectures with
+ * highmem. This is a good thing, because we would like to save
+ * rvmalloc pages.
+ *
+ * It appears to be triggered by pages which do not point to
+ * valid memory (see arch/i386/mm/init.c:one_highpage_init(),
+ * which sets PageReserved if the page does not point to valid
+ * RAM.
+ *
+ * XXX: must remove usage of PageReserved!
*/
- if (PageReserved(page)) {
- printk("highmem reserved page?!\n");
+ if (PageReserved(page))
continue;
- }
BUG_ON(PageNosave(page));
if (PageNosaveFree(page))
continue;
@@ -528,10 +536,9 @@ static int saveable(struct zone * zone,
return 0;

page = pfn_to_page(pfn);
- BUG_ON(PageReserved(page) && PageNosave(page));
if (PageNosave(page))
return 0;
- if (PageReserved(page) && pfn_is_nosave(pfn)) {
+ if (pfn_is_nosave(pfn)) {
pr_debug("[nosave pfn 0x%lx]", pfn);
return 0;
}
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c
+++ linux-2.6/mm/shmem.c
@@ -1523,7 +1523,8 @@ static void do_shmem_file_read(struct fi
index += offset >> PAGE_CACHE_SHIFT;
offset &= ~PAGE_CACHE_MASK;

- page_cache_release(page);
+ if (page != ZERO_PAGE(0))
+ page_cache_release(page);
if (ret != nr || !desc->count)
break;


Attachments:
mm-rpr-rollup.patch (28.98 kB)

2005-09-02 06:29:05

by Nick Piggin

[permalink] [raw]
Subject: [PATCH 2.6.13] lockless pagecache 2/7

Introduce an atomic_cmpxchg operation. Implement this for i386 and ppc64.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/include/asm-i386/atomic.h
===================================================================
--- linux-2.6.orig/include/asm-i386/atomic.h
+++ linux-2.6/include/asm-i386/atomic.h
@@ -215,6 +215,8 @@ static __inline__ int atomic_sub_return(
return atomic_add_return(-i,v);
}

+#define atomic_cmpxchg(v, old, new) ((int)cmpxchg(&((v)->counter), old, new))
+
#define atomic_inc_return(v) (atomic_add_return(1,v))
#define atomic_dec_return(v) (atomic_sub_return(1,v))

Index: linux-2.6/include/asm-ppc64/atomic.h
===================================================================
--- linux-2.6.orig/include/asm-ppc64/atomic.h
+++ linux-2.6/include/asm-ppc64/atomic.h
@@ -162,6 +162,8 @@ static __inline__ int atomic_dec_return(
return t;
}

+#define atomic_cmpxchg(v, o, n) ((int)cmpxchg(&((v)->counter), (o), (n)))
+
#define atomic_sub_and_test(a, v) (atomic_sub_return((a), (v)) == 0)
#define atomic_dec_and_test(v) (atomic_dec_return((v)) == 0)


Attachments:
atomic_cmpxchg.patch (1.08 kB)

2005-09-02 06:30:48

by Nick Piggin

[permalink] [raw]
Subject: [PATCH 2.6.13] lockless pagecache 4/7

From: Hans Reiser <[email protected]>

Reiser4 uses radix trees to solve a trouble reiser4_readdir has serving nfs
requests.

Unfortunately, radix tree api lacks an operation suitable for modifying
existing entry. This patch adds radix_tree_lookup_slot which returns pointer
to found item within the tree. That location can be then updated.

Index: linux-2.6/include/linux/radix-tree.h
===================================================================
--- linux-2.6.orig/include/linux/radix-tree.h
+++ linux-2.6/include/linux/radix-tree.h
@@ -46,6 +46,7 @@ do { \

int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
+void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
void *radix_tree_delete(struct radix_tree_root *, unsigned long);
unsigned int
radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
Index: linux-2.6/lib/radix-tree.c
===================================================================
--- linux-2.6.orig/lib/radix-tree.c
+++ linux-2.6/lib/radix-tree.c
@@ -276,14 +276,8 @@ int radix_tree_insert(struct radix_tree_
}
EXPORT_SYMBOL(radix_tree_insert);

-/**
- * radix_tree_lookup - perform lookup operation on a radix tree
- * @root: radix tree root
- * @index: index key
- *
- * Lookup the item at the position @index in the radix tree @root.
- */
-void *radix_tree_lookup(struct radix_tree_root *root, unsigned long index)
+static inline void **__lookup_slot(struct radix_tree_root *root,
+ unsigned long index)
{
unsigned int height, shift;
struct radix_tree_node **slot;
@@ -306,7 +300,36 @@ void *radix_tree_lookup(struct radix_tre
height--;
}

- return *slot;
+ return (void **)slot;
+}
+
+/**
+ * radix_tree_lookup_slot - lookup a slot in a radix tree
+ * @root: radix tree root
+ * @index: index key
+ *
+ * Lookup the slot corresponding to the position @index in the radix tree
+ * @root. This is useful for update-if-exists operations.
+ */
+void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long index)
+{
+ return __lookup_slot(root, index);
+}
+EXPORT_SYMBOL(radix_tree_lookup_slot);
+
+/**
+ * radix_tree_lookup - perform lookup operation on a radix tree
+ * @root: radix tree root
+ * @index: index key
+ *
+ * Lookup the item at the position @index in the radix tree @root.
+ */
+void *radix_tree_lookup(struct radix_tree_root *root, unsigned long index)
+{
+ void **slot;
+
+ slot = __lookup_slot(root, index);
+ return slot != NULL ? *slot : NULL;
}
EXPORT_SYMBOL(radix_tree_lookup);


Attachments:
radix-tree-lookup_slot.patch (2.57 kB)

2005-09-02 06:29:57

by Nick Piggin

[permalink] [raw]
Subject: [PATCH 2.6.13] lockless pagecache 3/7

If we can be sure that elevating the page_count on a pagecache
page will pin it, we can speculatively run this operation, and
subsequently check to see if we hit the right page rather than
relying on holding a lock or otherwise pinning a reference to
the page.

This can be done if get_page/put_page behaves in the same manner
throughout the whole tree (ie. if we "get" the page after it has
been used for something else, we must be able to free it with a
put_page).

This needs an atomic_cmpxchg operation to ensure we don't try to
grab a free page.

Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h
+++ linux-2.6/include/linux/page-flags.h
@@ -76,6 +76,8 @@
#define PG_nosave_free 18 /* Free, should not be written */
#define PG_uncached 19 /* Page has been mapped as uncached */

+#define PG_freeing 20 /* Pagecache is about to be freed */
+
/*
* Global page accounting. One instance per CPU. Only unsigned longs are
* allowed.
@@ -306,6 +308,11 @@ extern void __mod_page_state(unsigned lo
#define SetPageUncached(page) set_bit(PG_uncached, &(page)->flags)
#define ClearPageUncached(page) clear_bit(PG_uncached, &(page)->flags)

+#define PageFreeing(page) test_bit(PG_freeing, &(page)->flags)
+#define SetPageFreeing(page) set_bit(PG_freeing, &(page)->flags)
+#define ClearPageFreeing(page) clear_bit(PG_freeing, &(page)->flags)
+#define __ClearPageFreeing(page) __clear_bit(PG_freeing, &(page)->flags)
+
struct page; /* forward declaration */

int test_clear_page_dirty(struct page *page);
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -50,6 +50,36 @@ static inline void mapping_set_gfp_mask(
#define page_cache_release(page) put_page(page)
void release_pages(struct page **pages, int nr, int cold);

+static inline struct page *page_cache_get_speculative(struct page **pagep)
+{
+ struct page *page;
+ int count;
+
+ page = *pagep;
+ if (!page)
+ return NULL;
+
+ do {
+ count = atomic_read(&page->_count);
+ if (unlikely(count == -1)) /* Picked up a free page. */
+ return NULL;
+ } while (unlikely(atomic_cmpxchg(&page->_count, count, count+1) != count));
+
+ /* Note that atomic_load_lock provides a memory barrier */
+
+ if (unlikely(PageFreeing(page) || page != *pagep)) {
+ /*
+ * Picked up a page being freed, or one that is no longer
+ * being pointed to by pagep. Now that we have a reference
+ * to the page, we must do a put_page.
+ */
+ put_page(page);
+ return NULL;
+ }
+
+ return page;
+}
+
static inline struct page *page_cache_alloc(struct address_space *x)
{
return alloc_pages(mapping_gfp_mask(x)|__GFP_NORECLAIM, 0);
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -504,6 +504,7 @@ static int shrink_list(struct list_head
if (!mapping)
goto keep_locked; /* truncate got there first */

+ SetPageFreeing(page);
write_lock_irq(&mapping->tree_lock);

/*
@@ -513,6 +514,7 @@ static int shrink_list(struct list_head
*/
if (page_count(page) != 2 || PageDirty(page)) {
write_unlock_irq(&mapping->tree_lock);
+ ClearPageFreeing(page);
goto keep_locked;
}

@@ -533,6 +535,7 @@ static int shrink_list(struct list_head

free_it:
unlock_page(page);
+ ClearPageFreeing(page);
reclaimed++;
if (!pagevec_add(&freed_pvec, page))
__pagevec_release_nonlru(&freed_pvec);
Index: linux-2.6/mm/bootmem.c
===================================================================
--- linux-2.6.orig/mm/bootmem.c
+++ linux-2.6/mm/bootmem.c
@@ -289,19 +289,20 @@ static unsigned long __init free_all_boo
int j, order;

page = pfn_to_page(pfn);
+ prefetchw(page);
+
count += BITS_PER_LONG;
- __ClearPageReserved(page);
order = ffs(BITS_PER_LONG) - 1;
- set_page_refs(page, order);
- for (j = 1; j < BITS_PER_LONG; j++) {
- if (j + 16 < BITS_PER_LONG)
- prefetchw(page + j + 16);
+ for (j = 0; j < BITS_PER_LONG; j++) {
+ if (j + 1 < BITS_PER_LONG)
+ prefetchw(page + j + 1);
__ClearPageReserved(page + j);
set_page_count(page + j, 0);
}
+ set_page_refs(page, order);
__free_pages(page, order);
+
i += BITS_PER_LONG;
- page += BITS_PER_LONG;
} else if (v) {
unsigned long m;

@@ -310,6 +311,7 @@ static unsigned long __init free_all_boo
if (v & m) {
count++;
__ClearPageReserved(page);
+ set_page_count(page, 0);
set_page_refs(page, 0);
__free_page(page);
}
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c
+++ linux-2.6/mm/swapfile.c
@@ -338,6 +338,7 @@ int remove_exclusive_swap_page(struct pa
retval = 0;
if (p->swap_map[swp_offset(entry)] == 1) {
/* Recheck the page count with the swapcache lock held.. */
+ SetPageFreeing(page);
write_lock_irq(&swapper_space.tree_lock);
if ((page_count(page) == 2) && !PageWriteback(page)) {
__delete_from_swap_cache(page);
@@ -345,6 +346,7 @@ int remove_exclusive_swap_page(struct pa
retval = 1;
}
write_unlock_irq(&swapper_space.tree_lock);
+ ClearPageFreeing(page);
}
swap_info_put(p);


Attachments:
mm-speculative-get_page.patch (5.29 kB)

2005-09-02 06:30:57

by Nick Piggin

[permalink] [raw]
Subject: [PATCH 2.6.13] lockless pagecache 5/7

Make radix tree lookups safe to be performed without locks.
Readers are protected against nodes being deleted by using RCU
based freeing. Readers are protected against new node insertion
by using memory barriers to ensure the node itself will be
properly written before it is visible in the radix tree.

Also introduce a lockfree gang_lookup_slot which will be used
by a future patch.

Index: linux-2.6/lib/radix-tree.c
===================================================================
--- linux-2.6.orig/lib/radix-tree.c
+++ linux-2.6/lib/radix-tree.c
@@ -29,6 +29,7 @@
#include <linux/gfp.h>
#include <linux/string.h>
#include <linux/bitops.h>
+#include <linux/rcupdate.h>


#ifdef __KERNEL__
@@ -45,7 +46,9 @@
((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)

struct radix_tree_node {
+ unsigned int height; /* Height from the bottom */
unsigned int count;
+ struct rcu_head rcu_head;
void *slots[RADIX_TREE_MAP_SIZE];
unsigned long tags[RADIX_TREE_TAGS][RADIX_TREE_TAG_LONGS];
};
@@ -97,10 +100,17 @@ radix_tree_node_alloc(struct radix_tree_
return ret;
}

+static void radix_tree_node_rcu_free(struct rcu_head *head)
+{
+ struct radix_tree_node *node =
+ container_of(head, struct radix_tree_node, rcu_head);
+ kmem_cache_free(radix_tree_node_cachep, node);
+}
+
static inline void
radix_tree_node_free(struct radix_tree_node *node)
{
- kmem_cache_free(radix_tree_node_cachep, node);
+ call_rcu(&node->rcu_head, radix_tree_node_rcu_free);
}

/*
@@ -196,6 +206,7 @@ static int radix_tree_extend(struct radi
}

do {
+ unsigned int newheight;
if (!(node = radix_tree_node_alloc(root)))
return -ENOMEM;

@@ -208,9 +219,11 @@ static int radix_tree_extend(struct radi
tag_set(node, tag, 0);
}

+ newheight = root->height+1;
+ node->height = newheight;
node->count = 1;
- root->rnode = node;
- root->height++;
+ rcu_assign_pointer(root->rnode, node);
+ root->height = newheight;
} while (height > root->height);
out:
return 0;
@@ -250,9 +263,10 @@ int radix_tree_insert(struct radix_tree_
/* Have to add a child node. */
if (!(tmp = radix_tree_node_alloc(root)))
return -ENOMEM;
- *slot = tmp;
+ tmp->height = height;
if (node)
node->count++;
+ rcu_assign_pointer(*slot, tmp);
}

/* Go a level down */
@@ -282,12 +296,14 @@ static inline void **__lookup_slot(struc
unsigned int height, shift;
struct radix_tree_node **slot;

- height = root->height;
+ if (root->rnode == NULL)
+ return NULL;
+ slot = &root->rnode;
+ height = (*slot)->height;
if (index > radix_tree_maxindex(height))
return NULL;

shift = (height-1) * RADIX_TREE_MAP_SHIFT;
- slot = &root->rnode;

while (height > 0) {
if (*slot == NULL)
@@ -491,21 +507,24 @@ EXPORT_SYMBOL(radix_tree_tag_get);
#endif

static unsigned int
-__lookup(struct radix_tree_root *root, void **results, unsigned long index,
+__lookup(struct radix_tree_root *root, void ***results, unsigned long index,
unsigned int max_items, unsigned long *next_index)
{
+ unsigned long i;
unsigned int nr_found = 0;
unsigned int shift;
- unsigned int height = root->height;
+ unsigned int height;
struct radix_tree_node *slot;

- shift = (height-1) * RADIX_TREE_MAP_SHIFT;
slot = root->rnode;
+ if (!slot)
+ goto out;
+ height = slot->height;
+ shift = (height-1) * RADIX_TREE_MAP_SHIFT;

- while (height > 0) {
- unsigned long i = (index >> shift) & RADIX_TREE_MAP_MASK;
-
- for ( ; i < RADIX_TREE_MAP_SIZE; i++) {
+ for (;;) {
+ for (i = (index >> shift) & RADIX_TREE_MAP_MASK;
+ i < RADIX_TREE_MAP_SIZE; i++) {
if (slot->slots[i] != NULL)
break;
index &= ~((1UL << shift) - 1);
@@ -516,21 +535,23 @@ __lookup(struct radix_tree_root *root, v
if (i == RADIX_TREE_MAP_SIZE)
goto out;
height--;
- if (height == 0) { /* Bottom level: grab some items */
- unsigned long j = index & RADIX_TREE_MAP_MASK;
-
- for ( ; j < RADIX_TREE_MAP_SIZE; j++) {
- index++;
- if (slot->slots[j]) {
- results[nr_found++] = slot->slots[j];
- if (nr_found == max_items)
- goto out;
- }
- }
+ if (height == 0) {
+ /* Bottom level: grab some items */
+ break;
}
shift -= RADIX_TREE_MAP_SHIFT;
slot = slot->slots[i];
}
+
+ for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE; i++) {
+ index++;
+ if (slot->slots[i]) {
+ results[nr_found++] = &(slot->slots[i]);
+ if (nr_found == max_items)
+ goto out;
+ }
+ }
+
out:
*next_index = index;
return nr_found;
@@ -558,6 +579,43 @@ radix_tree_gang_lookup(struct radix_tree
unsigned int ret = 0;

while (ret < max_items) {
+ unsigned int nr_found, i;
+ unsigned long next_index; /* Index of next search */
+
+ if (cur_index > max_index)
+ break;
+ nr_found = __lookup(root, (void ***)results + ret, cur_index,
+ max_items - ret, &next_index);
+ for (i = 0; i < nr_found; i++)
+ results[ret + i] = *(((void ***)results)[ret + i]);
+ ret += nr_found;
+ if (next_index == 0)
+ break;
+ cur_index = next_index;
+ }
+ return ret;
+}
+EXPORT_SYMBOL(radix_tree_gang_lookup);
+
+/**
+ * radix_tree_gang_lookup_slot - perform multiple lookup on a radix tree
+ * @root: radix tree root
+ * @results: where the results of the lookup are placed
+ * @first_index: start the lookup from this key
+ * @max_items: place up to this many items at *results
+ *
+ * Same as radix_tree_gang_lookup, but returns an array of pointers
+ * (slots) to the stored items instead of the items themselves.
+ */
+unsigned int
+radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results,
+ unsigned long first_index, unsigned int max_items)
+{
+ const unsigned long max_index = radix_tree_maxindex(root->height);
+ unsigned long cur_index = first_index;
+ unsigned int ret = 0;
+
+ while (ret < max_items) {
unsigned int nr_found;
unsigned long next_index; /* Index of next search */

@@ -572,7 +630,8 @@ radix_tree_gang_lookup(struct radix_tree
}
return ret;
}
-EXPORT_SYMBOL(radix_tree_gang_lookup);
+EXPORT_SYMBOL(radix_tree_gang_lookup_slot);
+

/*
* FIXME: the two tag_get()s here should use find_next_bit() instead of
Index: linux-2.6/include/linux/radix-tree.h
===================================================================
--- linux-2.6.orig/include/linux/radix-tree.h
+++ linux-2.6/include/linux/radix-tree.h
@@ -51,6 +51,9 @@ void *radix_tree_delete(struct radix_tre
unsigned int
radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
unsigned long first_index, unsigned int max_items);
+unsigned int
+radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results,
+ unsigned long first_index, unsigned int max_items);
int radix_tree_preload(int gfp_mask);
void radix_tree_init(void);
void *radix_tree_tag_set(struct radix_tree_root *root,


Attachments:
radix-tree-lockless-readside.patch (6.66 kB)

2005-09-02 06:31:38

by Nick Piggin

[permalink] [raw]
Subject: [PATCH 2.6.13] lockless pagecache 6/7

Use the speculative get_page and the lockless radix tree lookups
to introduce lockless page cache lookups (ie. no mapping->tree_lock).

The only atomicity changes this should introduce is the use of a
non atomic pagevec lookup for truncate, however what atomicity
guarantees there were are probably not too useful anyway.

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -379,18 +379,25 @@ int add_to_page_cache(struct page *page,
int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);

if (error == 0) {
+ page_cache_get(page);
+ __SetPageLocked(page);
+ page->mapping = mapping;
+ page->index = offset;
+
write_lock_irq(&mapping->tree_lock);
error = radix_tree_insert(&mapping->page_tree, offset, page);
if (!error) {
- page_cache_get(page);
- SetPageLocked(page);
- page->mapping = mapping;
- page->index = offset;
mapping->nrpages++;
pagecache_acct(1);
}
write_unlock_irq(&mapping->tree_lock);
radix_tree_preload_end();
+
+ if (error) {
+ page->mapping = NULL;
+ __put_page(page);
+ __ClearPageLocked(page);
+ }
}
return error;
}
@@ -500,13 +507,15 @@ EXPORT_SYMBOL(__lock_page);
*/
struct page * find_get_page(struct address_space *mapping, unsigned long offset)
{
- struct page *page;
+ struct page **pagep;
+ struct page *page = NULL;

- read_lock_irq(&mapping->tree_lock);
- page = radix_tree_lookup(&mapping->page_tree, offset);
- if (page)
- page_cache_get(page);
- read_unlock_irq(&mapping->tree_lock);
+ rcu_read_lock();
+ pagep = (struct page **)radix_tree_lookup_slot(&mapping->page_tree,
+ offset);
+ if (pagep)
+ page = page_cache_get_speculative(pagep);
+ rcu_read_unlock();
return page;
}

@@ -519,12 +528,24 @@ struct page *find_trylock_page(struct ad
{
struct page *page;

- read_lock_irq(&mapping->tree_lock);
- page = radix_tree_lookup(&mapping->page_tree, offset);
- if (page && TestSetPageLocked(page))
- page = NULL;
- read_unlock_irq(&mapping->tree_lock);
- return page;
+ page = find_get_page(mapping, offset);
+ if (page) {
+ if (TestSetPageLocked(page))
+ goto out_failed;
+ /* Has the page been truncated before being locked? */
+ if (page->mapping != mapping || page->index != offset) {
+ unlock_page(page);
+ goto out_failed;
+ }
+
+ /* Silly interface requires us to drop the refcount */
+ __put_page(page);
+ return page;
+
+out_failed:
+ page_cache_release(page);
+ }
+ return NULL;
}

EXPORT_SYMBOL(find_trylock_page);
@@ -545,25 +566,17 @@ struct page *find_lock_page(struct addre
{
struct page *page;

- read_lock_irq(&mapping->tree_lock);
repeat:
- page = radix_tree_lookup(&mapping->page_tree, offset);
+ page = find_get_page(mapping, offset);
if (page) {
- page_cache_get(page);
- if (TestSetPageLocked(page)) {
- read_unlock_irq(&mapping->tree_lock);
- lock_page(page);
- read_lock_irq(&mapping->tree_lock);
-
- /* Has the page been truncated while we slept? */
- if (page->mapping != mapping || page->index != offset) {
- unlock_page(page);
- page_cache_release(page);
- goto repeat;
- }
+ lock_page(page);
+ /* Has the page been truncated before being locked? */
+ if (page->mapping != mapping || page->index != offset) {
+ unlock_page(page);
+ page_cache_release(page);
+ goto repeat;
}
}
- read_unlock_irq(&mapping->tree_lock);
return page;
}

@@ -646,6 +659,32 @@ unsigned find_get_pages(struct address_s
return ret;
}

+unsigned find_get_pages_nonatomic(struct address_space *mapping, pgoff_t start,
+ unsigned int nr_pages, struct page **pages)
+{
+ unsigned int i;
+ unsigned int nr_found;
+ unsigned int ret;
+
+ /*
+ * We do some unsightly casting to use the array first for storing
+ * pointers to the page pointers, and then for the pointers to
+ * the pages themselves that the caller wants.
+ */
+ rcu_read_lock();
+ nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree,
+ (void ***)pages, start, nr_pages);
+ ret = 0;
+ for (i = 0; i < nr_found; i++) {
+ struct page *page;
+ page = page_cache_get_speculative(((struct page ***)pages)[i]);
+ if (page)
+ pages[ret++] = page;
+ }
+ rcu_read_unlock();
+ return ret;
+}
+
/*
* Like find_get_pages, except we only return pages which are tagged with
* `tag'. We update *index to index the next page for the traversal.
Index: linux-2.6/mm/readahead.c
===================================================================
--- linux-2.6.orig/mm/readahead.c
+++ linux-2.6/mm/readahead.c
@@ -272,27 +272,26 @@ __do_page_cache_readahead(struct address
/*
* Preallocate as many pages as we will need.
*/
- read_lock_irq(&mapping->tree_lock);
for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
unsigned long page_offset = offset + page_idx;

if (page_offset > end_index)
break;

+ /* Don't need mapping->tree_lock - lookup can be racy */
+ rcu_read_lock();
page = radix_tree_lookup(&mapping->page_tree, page_offset);
+ rcu_read_unlock();
if (page)
continue;

- read_unlock_irq(&mapping->tree_lock);
page = page_cache_alloc_cold(mapping);
- read_lock_irq(&mapping->tree_lock);
if (!page)
break;
page->index = page_offset;
list_add(&page->lru, &page_pool);
ret++;
}
- read_unlock_irq(&mapping->tree_lock);

/*
* Now start the IO. We ignore I/O errors - if the page is not
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -76,19 +76,26 @@ static int __add_to_swap_cache(struct pa
BUG_ON(PagePrivate(page));
error = radix_tree_preload(gfp_mask);
if (!error) {
+ page_cache_get(page);
+ SetPageLocked(page);
+ SetPageSwapCache(page);
+ page->private = entry.val;
+
write_lock_irq(&swapper_space.tree_lock);
error = radix_tree_insert(&swapper_space.page_tree,
entry.val, page);
if (!error) {
- page_cache_get(page);
- SetPageLocked(page);
- SetPageSwapCache(page);
- page->private = entry.val;
total_swapcache_pages++;
pagecache_acct(1);
}
write_unlock_irq(&swapper_space.tree_lock);
radix_tree_preload_end();
+
+ if (error) {
+ __put_page(page);
+ ClearPageLocked(page);
+ ClearPageSwapCache(page);
+ }
}
return error;
}
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h
+++ linux-2.6/include/linux/page-flags.h
@@ -166,16 +166,13 @@ extern void __mod_page_state(unsigned lo
/*
* Manipulation of page state flags
*/
-#define PageLocked(page) \
- test_bit(PG_locked, &(page)->flags)
-#define SetPageLocked(page) \
- set_bit(PG_locked, &(page)->flags)
-#define TestSetPageLocked(page) \
- test_and_set_bit(PG_locked, &(page)->flags)
-#define ClearPageLocked(page) \
- clear_bit(PG_locked, &(page)->flags)
-#define TestClearPageLocked(page) \
- test_and_clear_bit(PG_locked, &(page)->flags)
+#define PageLocked(page) test_bit(PG_locked, &(page)->flags)
+#define SetPageLocked(page) set_bit(PG_locked, &(page)->flags)
+#define __SetPageLocked(page) __set_bit(PG_locked, &(page)->flags)
+#define TestSetPageLocked(page) test_and_set_bit(PG_locked, &(page)->flags)
+#define ClearPageLocked(page) clear_bit(PG_locked, &(page)->flags)
+#define __ClearPageLocked(page) __clear_bit(PG_locked, &(page)->flags)
+#define TestClearPageLocked(page) test_and_clear_bit(PG_locked, &(page)->flags)

#define PageError(page) test_bit(PG_error, &(page)->flags)
#define SetPageError(page) set_bit(PG_error, &(page)->flags)
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -102,6 +102,8 @@ extern struct page * find_or_create_page
unsigned long index, unsigned int gfp_mask);
unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
unsigned int nr_pages, struct page **pages);
+unsigned find_get_pages_nonatomic(struct address_space *mapping, pgoff_t start,
+ unsigned int nr_pages, struct page **pages);
unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
int tag, unsigned int nr_pages, struct page **pages);

Index: linux-2.6/include/linux/pagevec.h
===================================================================
--- linux-2.6.orig/include/linux/pagevec.h
+++ linux-2.6/include/linux/pagevec.h
@@ -25,6 +25,8 @@ void __pagevec_lru_add_active(struct pag
void pagevec_strip(struct pagevec *pvec);
unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
pgoff_t start, unsigned nr_pages);
+unsigned pagevec_lookup_nonatomic(struct pagevec *pvec,
+ struct address_space *mapping, pgoff_t start, unsigned nr_pages);
unsigned pagevec_lookup_tag(struct pagevec *pvec,
struct address_space *mapping, pgoff_t *index, int tag,
unsigned nr_pages);
Index: linux-2.6/mm/swap.c
===================================================================
--- linux-2.6.orig/mm/swap.c
+++ linux-2.6/mm/swap.c
@@ -380,6 +380,19 @@ unsigned pagevec_lookup(struct pagevec *
return pagevec_count(pvec);
}

+/**
+ * pagevec_lookup_nonatomic - non atomic pagevec_lookup
+ *
+ * This routine is non-atomic in that it may return blah.
+ */
+unsigned pagevec_lookup_nonatomic(struct pagevec *pvec,
+ struct address_space *mapping, pgoff_t start, unsigned nr_pages)
+{
+ pvec->nr = find_get_pages_nonatomic(mapping, start,
+ nr_pages, pvec->pages);
+ return pagevec_count(pvec);
+}
+
unsigned pagevec_lookup_tag(struct pagevec *pvec, struct address_space *mapping,
pgoff_t *index, int tag, unsigned nr_pages)
{
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -126,7 +126,7 @@ void truncate_inode_pages(struct address

pagevec_init(&pvec, 0);
next = start;
- while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+ while (pagevec_lookup_nonatomic(&pvec, mapping, next, PAGEVEC_SIZE)) {
for (i = 0; i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];
pgoff_t page_index = page->index;
@@ -160,7 +160,7 @@ void truncate_inode_pages(struct address
next = start;
for ( ; ; ) {
cond_resched();
- if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+ if (!pagevec_lookup_nonatomic(&pvec, mapping, next, PAGEVEC_SIZE)) {
if (next == start)
break;
next = start;
@@ -206,7 +206,7 @@ unsigned long invalidate_mapping_pages(s

pagevec_init(&pvec, 0);
while (next <= end &&
- pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+ pagevec_lookup_nonatomic(&pvec, mapping, next, PAGEVEC_SIZE)) {
for (i = 0; i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];

Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -811,6 +811,7 @@ int mapping_tagged(struct address_space
unsigned long flags;
int ret;

+ /* XXX: radix_tree_tagged is safe to run without the lock? */
read_lock_irqsave(&mapping->tree_lock, flags);
ret = radix_tree_tagged(&mapping->page_tree, tag);
read_unlock_irqrestore(&mapping->tree_lock, flags);


Attachments:
mm-lockless-pagecache-lookups.patch (11.15 kB)

2005-09-02 06:32:15

by Nick Piggin

[permalink] [raw]
Subject: [PATCH 2.6.13] lockless pagecache 7/7

With practially all the read locks gone from mapping->tree_lock,
convert the lock from an rwlock back to a spinlock.

The remaining locks including the read locks mainly deal with IO
submission and not the lookup fastpaths.

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -859,7 +859,7 @@ int __set_page_dirty_buffers(struct page
spin_unlock(&mapping->private_lock);

if (!TestSetPageDirty(page)) {
- write_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
if (page->mapping) { /* Race with truncate? */
if (mapping_cap_account_dirty(mapping))
inc_page_state(nr_dirty);
@@ -867,7 +867,7 @@ int __set_page_dirty_buffers(struct page
page_index(page),
PAGECACHE_TAG_DIRTY);
}
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
}

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -195,7 +195,7 @@ void inode_init_once(struct inode *inode
sema_init(&inode->i_sem, 1);
init_rwsem(&inode->i_alloc_sem);
INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
- rwlock_init(&inode->i_data.tree_lock);
+ spin_lock_init(&inode->i_data.tree_lock);
spin_lock_init(&inode->i_data.i_mmap_lock);
INIT_LIST_HEAD(&inode->i_data.private_list);
spin_lock_init(&inode->i_data.private_lock);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -339,7 +339,7 @@ struct backing_dev_info;
struct address_space {
struct inode *host; /* owner: inode, block_device */
struct radix_tree_root page_tree; /* radix tree of all pages */
- rwlock_t tree_lock; /* and rwlock protecting it */
+ spinlock_t tree_lock; /* and lock protecting it */
unsigned int i_mmap_writable;/* count VM_SHARED mappings */
struct prio_tree_root i_mmap; /* tree of private and shared mappings */
struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -121,9 +121,9 @@ void remove_from_page_cache(struct page

BUG_ON(!PageLocked(page));

- write_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
__remove_from_page_cache(page);
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
}

static int sync_page(void *word)
@@ -384,13 +384,13 @@ int add_to_page_cache(struct page *page,
page->mapping = mapping;
page->index = offset;

- write_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
error = radix_tree_insert(&mapping->page_tree, offset, page);
if (!error) {
mapping->nrpages++;
pagecache_acct(1);
}
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
radix_tree_preload_end();

if (error) {
@@ -650,12 +650,12 @@ unsigned find_get_pages(struct address_s
unsigned int i;
unsigned int ret;

- read_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
ret = radix_tree_gang_lookup(&mapping->page_tree,
(void **)pages, start, nr_pages);
for (i = 0; i < ret; i++)
page_cache_get(pages[i]);
- read_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
return ret;
}

@@ -695,14 +695,14 @@ unsigned find_get_pages_tag(struct addre
unsigned int i;
unsigned int ret;

- read_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
ret = radix_tree_gang_lookup_tag(&mapping->page_tree,
(void **)pages, *index, nr_pages, tag);
for (i = 0; i < ret; i++)
page_cache_get(pages[i]);
if (ret)
*index = pages[ret - 1]->index + 1;
- read_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
return ret;
}

Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -35,7 +35,7 @@ static struct backing_dev_info swap_back

struct address_space swapper_space = {
.page_tree = RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN),
- .tree_lock = RW_LOCK_UNLOCKED,
+ .tree_lock = SPIN_LOCK_UNLOCKED,
.a_ops = &swap_aops,
.i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
.backing_dev_info = &swap_backing_dev_info,
@@ -81,14 +81,14 @@ static int __add_to_swap_cache(struct pa
SetPageSwapCache(page);
page->private = entry.val;

- write_lock_irq(&swapper_space.tree_lock);
+ spin_lock_irq(&swapper_space.tree_lock);
error = radix_tree_insert(&swapper_space.page_tree,
entry.val, page);
if (!error) {
total_swapcache_pages++;
pagecache_acct(1);
}
- write_unlock_irq(&swapper_space.tree_lock);
+ spin_unlock_irq(&swapper_space.tree_lock);
radix_tree_preload_end();

if (error) {
@@ -210,9 +210,9 @@ void delete_from_swap_cache(struct page

entry.val = page->private;

- write_lock_irq(&swapper_space.tree_lock);
+ spin_lock_irq(&swapper_space.tree_lock);
__delete_from_swap_cache(page);
- write_unlock_irq(&swapper_space.tree_lock);
+ spin_unlock_irq(&swapper_space.tree_lock);

swap_free(entry);
page_cache_release(page);
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c
+++ linux-2.6/mm/swapfile.c
@@ -339,13 +339,13 @@ int remove_exclusive_swap_page(struct pa
if (p->swap_map[swp_offset(entry)] == 1) {
/* Recheck the page count with the swapcache lock held.. */
SetPageFreeing(page);
- write_lock_irq(&swapper_space.tree_lock);
+ spin_lock_irq(&swapper_space.tree_lock);
if ((page_count(page) == 2) && !PageWriteback(page)) {
__delete_from_swap_cache(page);
SetPageDirty(page);
retval = 1;
}
- write_unlock_irq(&swapper_space.tree_lock);
+ spin_unlock_irq(&swapper_space.tree_lock);
ClearPageFreeing(page);
}
swap_info_put(p);
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -76,15 +76,15 @@ invalidate_complete_page(struct address_
if (PagePrivate(page) && !try_to_release_page(page, 0))
return 0;

- write_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
if (PageDirty(page)) {
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
return 0;
}

BUG_ON(PagePrivate(page));
__remove_from_page_cache(page);
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
ClearPageUptodate(page);
page_cache_release(page); /* pagecache ref */
return 1;
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -505,7 +505,7 @@ static int shrink_list(struct list_head
goto keep_locked; /* truncate got there first */

SetPageFreeing(page);
- write_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);

/*
* The non-racy check for busy page. It is critical to check
@@ -513,7 +513,7 @@ static int shrink_list(struct list_head
* not in use by anybody. (pagecache + us == 2)
*/
if (page_count(page) != 2 || PageDirty(page)) {
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
ClearPageFreeing(page);
goto keep_locked;
}
@@ -522,7 +522,7 @@ static int shrink_list(struct list_head
if (PageSwapCache(page)) {
swp_entry_t swap = { .val = page->private };
__delete_from_swap_cache(page);
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
swap_free(swap);
__put_page(page); /* The pagecache ref */
goto free_it;
@@ -530,7 +530,7 @@ static int shrink_list(struct list_head
#endif /* CONFIG_SWAP */

__remove_from_page_cache(page);
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
__put_page(page);

free_it:
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -623,7 +623,7 @@ int __set_page_dirty_nobuffers(struct pa
struct address_space *mapping2;

if (mapping) {
- write_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
mapping2 = page_mapping(page);
if (mapping2) { /* Race with truncate? */
BUG_ON(mapping2 != mapping);
@@ -632,7 +632,7 @@ int __set_page_dirty_nobuffers(struct pa
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
if (mapping->host) {
/* !PageAnon && !swapper_space */
__mark_inode_dirty(mapping->host,
@@ -707,17 +707,17 @@ int test_clear_page_dirty(struct page *p
unsigned long flags;

if (mapping) {
- write_lock_irqsave(&mapping->tree_lock, flags);
+ spin_lock_irqsave(&mapping->tree_lock, flags);
if (TestClearPageDirty(page)) {
radix_tree_tag_clear(&mapping->page_tree,
page_index(page),
PAGECACHE_TAG_DIRTY);
- write_unlock_irqrestore(&mapping->tree_lock, flags);
+ spin_unlock_irqrestore(&mapping->tree_lock, flags);
if (mapping_cap_account_dirty(mapping))
dec_page_state(nr_dirty);
return 1;
}
- write_unlock_irqrestore(&mapping->tree_lock, flags);
+ spin_unlock_irqrestore(&mapping->tree_lock, flags);
return 0;
}
return TestClearPageDirty(page);
@@ -762,13 +762,13 @@ int test_clear_page_writeback(struct pag
if (mapping) {
unsigned long flags;

- write_lock_irqsave(&mapping->tree_lock, flags);
+ spin_lock_irqsave(&mapping->tree_lock, flags);
ret = TestClearPageWriteback(page);
if (ret)
radix_tree_tag_clear(&mapping->page_tree,
page_index(page),
PAGECACHE_TAG_WRITEBACK);
- write_unlock_irqrestore(&mapping->tree_lock, flags);
+ spin_unlock_irqrestore(&mapping->tree_lock, flags);
} else {
ret = TestClearPageWriteback(page);
}
@@ -783,7 +783,7 @@ int test_set_page_writeback(struct page
if (mapping) {
unsigned long flags;

- write_lock_irqsave(&mapping->tree_lock, flags);
+ spin_lock_irqsave(&mapping->tree_lock, flags);
ret = TestSetPageWriteback(page);
if (!ret)
radix_tree_tag_set(&mapping->page_tree,
@@ -793,7 +793,7 @@ int test_set_page_writeback(struct page
radix_tree_tag_clear(&mapping->page_tree,
page_index(page),
PAGECACHE_TAG_DIRTY);
- write_unlock_irqrestore(&mapping->tree_lock, flags);
+ spin_unlock_irqrestore(&mapping->tree_lock, flags);
} else {
ret = TestSetPageWriteback(page);
}
@@ -811,10 +811,10 @@ int mapping_tagged(struct address_space
unsigned long flags;
int ret;

- /* XXX: radix_tree_tagged is safe to run without the lock? */
- read_lock_irqsave(&mapping->tree_lock, flags);
+ /* XXX: radix_tree_tagged is safe to run without the lock */
+ spin_lock_irqsave(&mapping->tree_lock, flags);
ret = radix_tree_tagged(&mapping->page_tree, tag);
- read_unlock_irqrestore(&mapping->tree_lock, flags);
+ spin_unlock_irqrestore(&mapping->tree_lock, flags);
return ret;
}
EXPORT_SYMBOL(mapping_tagged);
Index: linux-2.6/drivers/mtd/devices/block2mtd.c
===================================================================
--- linux-2.6.orig/drivers/mtd/devices/block2mtd.c
+++ linux-2.6/drivers/mtd/devices/block2mtd.c
@@ -58,7 +58,7 @@ void cache_readahead(struct address_spac

end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);

- read_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
for (i = 0; i < PAGE_READAHEAD; i++) {
pagei = index + i;
if (pagei > end_index) {
@@ -70,16 +70,16 @@ void cache_readahead(struct address_spac
break;
if (page)
continue;
- read_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
page = page_cache_alloc_cold(mapping);
- read_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
if (!page)
break;
page->index = pagei;
list_add(&page->lru, &page_pool);
ret++;
}
- read_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
if (ret)
read_cache_pages(mapping, &page_pool, filler, NULL);
}
Index: linux-2.6/include/asm-arm/cacheflush.h
===================================================================
--- linux-2.6.orig/include/asm-arm/cacheflush.h
+++ linux-2.6/include/asm-arm/cacheflush.h
@@ -315,9 +315,9 @@ flush_cache_page(struct vm_area_struct *
extern void flush_dcache_page(struct page *);

#define flush_dcache_mmap_lock(mapping) \
- write_lock_irq(&(mapping)->tree_lock)
+ spin_lock_irq(&(mapping)->tree_lock)
#define flush_dcache_mmap_unlock(mapping) \
- write_unlock_irq(&(mapping)->tree_lock)
+ spin_unlock_irq(&(mapping)->tree_lock)

#define flush_icache_user_range(vma,page,addr,len) \
flush_dcache_page(page)
Index: linux-2.6/include/asm-parisc/cacheflush.h
===================================================================
--- linux-2.6.orig/include/asm-parisc/cacheflush.h
+++ linux-2.6/include/asm-parisc/cacheflush.h
@@ -57,9 +57,9 @@ flush_user_icache_range(unsigned long st
extern void flush_dcache_page(struct page *page);

#define flush_dcache_mmap_lock(mapping) \
- write_lock_irq(&(mapping)->tree_lock)
+ spin_lock_irq(&(mapping)->tree_lock)
#define flush_dcache_mmap_unlock(mapping) \
- write_unlock_irq(&(mapping)->tree_lock)
+ spin_unlock_irq(&(mapping)->tree_lock)

#define flush_icache_page(vma,page) do { flush_kernel_dcache_page(page_address(page)); flush_kernel_icache_page(page_address(page)); } while (0)


Attachments:
mm-spinlock-tree_lock.patch (13.59 kB)

2005-09-02 06:45:07

by Nick Piggin

[permalink] [raw]
Subject: Re: New lockless pagecache

Nick Piggin wrote:

> I think this is getting pretty stable. No guarantees of course,
> but it would be great if anyone gave it a test.
>

Or review, I might add. While I understand such a review is
still quite difficult, this code really is far less complex
than the previous lockless pagecache patches.

(Ignore 1/7 though, which is a rollup - a broken out patchset
can be provided on request)

Nick

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-09-02 12:44:00

by Alan

[permalink] [raw]
Subject: Re: [PATCH 2.6.13] lockless pagecache 2/7

On Gwe, 2005-09-02 at 16:29 +1000, Nick Piggin wrote:
> 2/7
> Implement atomic_cmpxchg for i386 and ppc64. Is there any
> architecture that won't be able to implement such an operation?

i386, sun4c, ....

Yeah quite a few. I suspect most MIPS also would have a problem in this
area.

2005-09-02 18:27:06

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 2.6.13] lockless pagecache 2/7

On Fri, 2 Sep 2005, Nick Piggin wrote:

> Implement atomic_cmpxchg for i386 and ppc64. Is there any
> architecture that won't be able to implement such an operation?

Something like that used to be part of the page fault scalability
patchset. You contributed to it last year. Here is the latest version of
that. May need some work though.

Changelog
* Make cmpxchg and cmpxchg8b generally available on the i386
platform.
* Provide emulation of cmpxchg suitable for uniprocessor if
build and run on 386.
* Provide emulation of cmpxchg8b suitable for uniprocessor
systems if build and run on 386 or 486.
* Provide an inline function to atomically get a 64 bit value
via cmpxchg8b in an SMP system (courtesy of Nick Piggin)
(important for i386 PAE mode and other places where atomic
64 bit operations are useful)

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.9/arch/i386/Kconfig
===================================================================
--- linux-2.6.9.orig/arch/i386/Kconfig 2004-12-10 09:58:03.000000000 -0800
+++ linux-2.6.9/arch/i386/Kconfig 2004-12-10 09:59:27.000000000 -0800
@@ -351,6 +351,11 @@
depends on !M386
default y

+config X86_CMPXCHG8B
+ bool
+ depends on !M386 && !M486
+ default y
+
config X86_XADD
bool
depends on !M386
Index: linux-2.6.9/arch/i386/kernel/cpu/intel.c
===================================================================
--- linux-2.6.9.orig/arch/i386/kernel/cpu/intel.c 2004-12-06 17:23:49.000000000 -0800
+++ linux-2.6.9/arch/i386/kernel/cpu/intel.c 2004-12-10 09:59:27.000000000 -0800
@@ -6,6 +6,7 @@
#include <linux/bitops.h>
#include <linux/smp.h>
#include <linux/thread_info.h>
+#include <linux/module.h>

#include <asm/processor.h>
#include <asm/msr.h>
@@ -287,5 +288,103 @@
return 0;
}

+#ifndef CONFIG_X86_CMPXCHG
+unsigned long cmpxchg_386_u8(volatile void *ptr, u8 old, u8 new)
+{
+ u8 prev;
+ unsigned long flags;
+ /*
+ * Check if the kernel was compiled for an old cpu but the
+ * currently running cpu can do cmpxchg after all
+ * All CPUs except 386 support CMPXCHG
+ */
+ if (cpu_data->x86 > 3)
+ return __cmpxchg(ptr, old, new, sizeof(u8));
+
+ /* Poor man's cmpxchg for 386. Unsuitable for SMP */
+ local_irq_save(flags);
+ prev = *(u8 *)ptr;
+ if (prev == old)
+ *(u8 *)ptr = new;
+ local_irq_restore(flags);
+ return prev;
+}
+
+EXPORT_SYMBOL(cmpxchg_386_u8);
+
+unsigned long cmpxchg_386_u16(volatile void *ptr, u16 old, u16 new)
+{
+ u16 prev;
+ unsigned long flags;
+ /*
+ * Check if the kernel was compiled for an old cpu but the
+ * currently running cpu can do cmpxchg after all
+ * All CPUs except 386 support CMPXCHG
+ */
+ if (cpu_data->x86 > 3)
+ return __cmpxchg(ptr, old, new, sizeof(u16));
+
+ /* Poor man's cmpxchg for 386. Unsuitable for SMP */
+ local_irq_save(flags);
+ prev = *(u16 *)ptr;
+ if (prev == old)
+ *(u16 *)ptr = new;
+ local_irq_restore(flags);
+ return prev;
+}
+
+EXPORT_SYMBOL(cmpxchg_386_u16);
+
+unsigned long cmpxchg_386_u32(volatile void *ptr, u32 old, u32 new)
+{
+ u32 prev;
+ unsigned long flags;
+ /*
+ * Check if the kernel was compiled for an old cpu but the
+ * currently running cpu can do cmpxchg after all
+ * All CPUs except 386 support CMPXCHG
+ */
+ if (cpu_data->x86 > 3)
+ return __cmpxchg(ptr, old, new, sizeof(u32));
+
+ /* Poor man's cmpxchg for 386. Unsuitable for SMP */
+ local_irq_save(flags);
+ prev = *(u32 *)ptr;
+ if (prev == old)
+ *(u32 *)ptr = new;
+ local_irq_restore(flags);
+ return prev;
+}
+
+EXPORT_SYMBOL(cmpxchg_386_u32);
+#endif
+
+#ifndef CONFIG_X86_CMPXCHG8B
+unsigned long long cmpxchg8b_486(volatile unsigned long long *ptr,
+ unsigned long long old, unsigned long long newv)
+{
+ unsigned long long prev;
+ unsigned long flags;
+
+ /*
+ * Check if the kernel was compiled for an old cpu but
+ * we are running really on a cpu capable of cmpxchg8b
+ */
+
+ if (cpu_has(cpu_data, X86_FEATURE_CX8))
+ return __cmpxchg8b(ptr, old, newv);
+
+ /* Poor mans cmpxchg8b for 386 and 486. Not suitable for SMP */
+ local_irq_save(flags);
+ prev = *ptr;
+ if (prev == old)
+ *ptr = newv;
+ local_irq_restore(flags);
+ return prev;
+}
+
+EXPORT_SYMBOL(cmpxchg8b_486);
+#endif
+
// arch_initcall(intel_cpu_init);

Index: linux-2.6.9/include/asm-i386/system.h
===================================================================
--- linux-2.6.9.orig/include/asm-i386/system.h 2004-12-06 17:23:55.000000000 -0800
+++ linux-2.6.9/include/asm-i386/system.h 2004-12-10 10:00:49.000000000 -0800
@@ -149,6 +149,9 @@
#define __xg(x) ((struct __xchg_dummy *)(x))


+#define ll_low(x) *(((unsigned int*)&(x))+0)
+#define ll_high(x) *(((unsigned int*)&(x))+1)
+
/*
* The semantics of XCHGCMP8B are a bit strange, this is why
* there is a loop and the loading of %%eax and %%edx has to
@@ -184,8 +187,6 @@
{
__set_64bit(ptr,(unsigned int)(value), (unsigned int)((value)>>32ULL));
}
-#define ll_low(x) *(((unsigned int*)&(x))+0)
-#define ll_high(x) *(((unsigned int*)&(x))+1)

static inline void __set_64bit_var (unsigned long long *ptr,
unsigned long long value)
@@ -203,6 +204,26 @@
__set_64bit(ptr, (unsigned int)(value), (unsigned int)((value)>>32ULL) ) : \
__set_64bit(ptr, ll_low(value), ll_high(value)) )

+static inline unsigned long long __get_64bit(unsigned long long * ptr)
+{
+ unsigned long long ret;
+ __asm__ __volatile__ (
+ "\n1:\t"
+ "movl (%1), %%eax\n\t"
+ "movl 4(%1), %%edx\n\t"
+ "movl %%eax, %%ebx\n\t"
+ "movl %%edx, %%ecx\n\t"
+ LOCK_PREFIX "cmpxchg8b (%1)\n\t"
+ "jnz 1b"
+ : "=A"(ret)
+ : "D"(ptr)
+ : "ebx", "ecx", "memory");
+ return ret;
+}
+
+#define get_64bit(ptr) __get_64bit(ptr)
+
+
/*
* Note: no "lock" prefix even on SMP: xchg always implies lock anyway
* Note 2: xchg has side effect, so that attribute volatile is necessary,
@@ -240,7 +261,41 @@
*/

#ifdef CONFIG_X86_CMPXCHG
+
#define __HAVE_ARCH_CMPXCHG 1
+#define cmpxchg(ptr,o,n)\
+ ((__typeof__(*(ptr)))__cmpxchg((ptr), (unsigned long)(o), \
+ (unsigned long)(n), sizeof(*(ptr))))
+
+#else
+
+/*
+ * Building a kernel capable running on 80386. It may be necessary to
+ * simulate the cmpxchg on the 80386 CPU. For that purpose we define
+ * a function for each of the sizes we support.
+ */
+
+extern unsigned long cmpxchg_386_u8(volatile void *, u8, u8);
+extern unsigned long cmpxchg_386_u16(volatile void *, u16, u16);
+extern unsigned long cmpxchg_386_u32(volatile void *, u32, u32);
+
+static inline unsigned long cmpxchg_386(volatile void *ptr, unsigned long old,
+ unsigned long new, int size)
+{
+ switch (size) {
+ case 1:
+ return cmpxchg_386_u8(ptr, old, new);
+ case 2:
+ return cmpxchg_386_u16(ptr, old, new);
+ case 4:
+ return cmpxchg_386_u32(ptr, old, new);
+ }
+ return old;
+}
+
+#define cmpxchg(ptr,o,n)\
+ ((__typeof__(*(ptr)))cmpxchg_386((ptr), (unsigned long)(o), \
+ (unsigned long)(n), sizeof(*(ptr))))
#endif

static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old,
@@ -270,12 +325,34 @@
return old;
}

-#define cmpxchg(ptr,o,n)\
- ((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\
- (unsigned long)(n),sizeof(*(ptr))))
-
+static inline unsigned long long __cmpxchg8b(volatile unsigned long long *ptr,
+ unsigned long long old, unsigned long long newv)
+{
+ unsigned long long prev;
+ __asm__ __volatile__(
+ LOCK_PREFIX "cmpxchg8b (%4)"
+ : "=A" (prev)
+ : "0" (old), "c" ((unsigned long)(newv >> 32)),
+ "b" ((unsigned long)(newv & 0xffffffffULL)), "D" (ptr)
+ : "memory");
+ return prev;
+}
+
+#ifdef CONFIG_X86_CMPXCHG8B
+#define cmpxchg8b __cmpxchg8b
+#else
+/*
+ * Building a kernel capable of running on 80486 and 80386. Both
+ * do not support cmpxchg8b. Call a function that emulates the
+ * instruction if necessary.
+ */
+extern unsigned long long cmpxchg8b_486(volatile unsigned long long *,
+ unsigned long long, unsigned long long);
+#define cmpxchg8b cmpxchg8b_486
+#endif
+
#ifdef __KERNEL__
-struct alt_instr {
+struct alt_instr {
__u8 *instr; /* original instruction */
__u8 *replacement;
__u8 cpuid; /* cpuid bit set for replacement */

2005-09-02 20:41:34

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 2.6.13] lockless pagecache 2/7

Alan Cox <[email protected]> writes:

> On Gwe, 2005-09-02 at 16:29 +1000, Nick Piggin wrote:
> > 2/7
> > Implement atomic_cmpxchg for i386 and ppc64. Is there any
> > architecture that won't be able to implement such an operation?
>
> i386, sun4c, ....

Actually we have cmpxchg on i386 these days - we don't support
any SMP i386s so it's just done non atomically.

> Yeah quite a few. I suspect most MIPS also would have a problem in this
> area.

cmpxchg can be done with LL/SC can't it? Any MIPS should have that.

-Andi

2005-09-02 21:13:15

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 2.6.13] lockless pagecache 2/7

From: Andi Kleen <[email protected]>
Date: 02 Sep 2005 22:41:31 +0200

> > Yeah quite a few. I suspect most MIPS also would have a problem in this
> > area.
>
> cmpxchg can be done with LL/SC can't it? Any MIPS should have that.

Right.

On PARISC, I don't see where they are emulating compare and swap
as indicated. They are doing the funny hashed spinlocks for the
atomic_t operations and bitops, but that is entirely different.

cmpxchg() has to operate in an environment where, unlike the atomic_t
and bitops, you cannot control the accessors to the object at all.

The DRM is the only place in the kernel that requires cmpxchg()
and you can thus make a list of what platform can provide cmpxchg()
by which ones support DRM and thus provide the cmpxchg() macro already
in asm/system.h

We really can't require support for this primitive kernel wide, it's
simply not possible on a couple chips.

2005-09-02 21:22:01

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 2.6.13] lockless pagecache 2/7

Andi Kleen wrote:
> Alan Cox <[email protected]> writes:
>
>
>>On Gwe, 2005-09-02 at 16:29 +1000, Nick Piggin wrote:
>>
>>>2/7
>>>Implement atomic_cmpxchg for i386 and ppc64. Is there any
>>>architecture that won't be able to implement such an operation?
>>
>>i386, sun4c, ....
>
>
> Actually we have cmpxchg on i386 these days - we don't support
> any SMP i386s so it's just done non atomically.
>

Yes, I guess that's what Alan must have meant.

This atomic_cmpxchg, unlike a "regular" cmpxchg, has the advantage
that the memory altered should always be going through the atomic_
accessors, and thus should be implementable with spinlocks.

See for example, arch/sparc/lib/atomic32.c

At least, that's what I'm hoping for.

>
>>Yeah quite a few. I suspect most MIPS also would have a problem in this
>>area.
>
>
> cmpxchg can be done with LL/SC can't it? Any MIPS should have that.
>

Yes, and indeed it does. However it also tests for "cpu_has_llsc",
but I suspect that SMP isn't supported on those CPUs without ll/sc,
and thus an atomic_cmpxchg could be emulated by disabling interrupts.

Nick

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-09-02 21:26:15

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 2.6.13] lockless pagecache 2/7

Christoph Lameter wrote:
> On Fri, 2 Sep 2005, Nick Piggin wrote:
>
>
>>Implement atomic_cmpxchg for i386 and ppc64. Is there any
>>architecture that won't be able to implement such an operation?
>
>
> Something like that used to be part of the page fault scalability
> patchset. You contributed to it last year. Here is the latest version of
> that. May need some work though.
>

Thanks Christoph, I think this will be required to support 386.
In the worst case, we could provide a fallback path and take
->tree_lock in pagecache lookups if there is no atomic_cmpxchg,
however I would much prefer all architectures get an atomic_cmpxchg,
and I think it should turn out to be a generally useful primitive.

I may trim this down to only provide what is needed for atomic_cmpxchg
if that is OK?

Nick

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-09-02 21:32:04

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 2.6.13] lockless pagecache 2/7

From: Nick Piggin <[email protected]>
Date: Sat, 03 Sep 2005 07:22:18 +1000

> This atomic_cmpxchg, unlike a "regular" cmpxchg, has the advantage
> that the memory altered should always be going through the atomic_
> accessors, and thus should be implementable with spinlocks.
>
> See for example, arch/sparc/lib/atomic32.c
>
> At least, that's what I'm hoping for.

Ok, as long as the rule is that all accesses have to go
through accessor macros, it would work. This is not true
for existing uses of cmpxchg() btw, userland accesses shared
locks with the kernel would using any kind of accessors we
can control.

This means that your atomic_cmpxchg() cannot be used for locking
objects shared with userland, as DRM wants, since the hashed spinlock
trick does not work in such a case.

2005-09-02 21:43:38

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 2.6.13] lockless pagecache 2/7

Bear with me Dave, I'll repeat myself a bit, for the benefit of lkml.


Andi Kleen wrote:
>>>Yeah quite a few. I suspect most MIPS also would have a problem in this
>>>area.
>>
>>cmpxchg can be done with LL/SC can't it? Any MIPS should have that.
>
>
> Right.
>
> On PARISC, I don't see where they are emulating compare and swap
> as indicated. They are doing the funny hashed spinlocks for the
> atomic_t operations and bitops, but that is entirely different.
>

Yep, same as SPARC (at least, SPARC's 32-bit atomic_t).

> cmpxchg() has to operate in an environment where, unlike the atomic_t
> and bitops, you cannot control the accessors to the object at all.
>
> The DRM is the only place in the kernel that requires cmpxchg()
> and you can thus make a list of what platform can provide cmpxchg()
> by which ones support DRM and thus provide the cmpxchg() macro already
> in asm/system.h
>
> We really can't require support for this primitive kernel wide, it's
> simply not possible on a couple chips.

Not a generic cmpxchg, no. However, I _believe_ that those
architectures that are missing something like ll/sc or real
atomic cmpxchg should still be able to implement an
"atomic_cmpxchg" on their atomic type.

Sorry if I wasn't at all clear initially. What I'd be interested
in is an architecture that doesn't support ll/sc or real cmpxchg
*and* does not implement atomic_t operations with locks.

Thanks,
Nick

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-09-02 21:47:22

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 2.6.13] lockless pagecache 2/7

David S. Miller wrote:
> From: Nick Piggin <[email protected]>
> Date: Sat, 03 Sep 2005 07:22:18 +1000
>
>
>>This atomic_cmpxchg, unlike a "regular" cmpxchg, has the advantage
>>that the memory altered should always be going through the atomic_
>>accessors, and thus should be implementable with spinlocks.
>>
>>See for example, arch/sparc/lib/atomic32.c
>>
>>At least, that's what I'm hoping for.
>
>
> Ok, as long as the rule is that all accesses have to go
> through accessor macros, it would work. This is not true
> for existing uses of cmpxchg() btw, userland accesses shared
> locks with the kernel would using any kind of accessors we
> can control.
>
> This means that your atomic_cmpxchg() cannot be used for locking
> objects shared with userland, as DRM wants, since the hashed spinlock
> trick does not work in such a case.
>

So neither could currently supported atomic_t ops be shared with
userland accesses?

Then I think it would not be breaking any interface rule to do an
atomic_t atomic_cmpxchg either. Definitely for my usage it will
not be shared with userland.

Thanks,
Nick

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-09-02 21:57:59

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 2.6.13] lockless pagecache 2/7

From: Nick Piggin <[email protected]>
Date: Sat, 03 Sep 2005 07:47:48 +1000

> So neither could currently supported atomic_t ops be shared with
> userland accesses?

Correct.

> Then I think it would not be breaking any interface rule to do an
> atomic_t atomic_cmpxchg either. Definitely for my usage it will
> not be shared with userland.

Ok.

2005-09-02 23:33:50

by Alan

[permalink] [raw]
Subject: Re: [PATCH 2.6.13] lockless pagecache 2/7

On Sad, 2005-09-03 at 07:22 +1000, Nick Piggin wrote:
> > Actually we have cmpxchg on i386 these days - we don't support
> > any SMP i386s so it's just done non atomically.
>
> Yes, I guess that's what Alan must have meant.

Well I was thinking about things like pre-empt. Also the x86 cmpxchg()
is defined for non i386 processors to allow certain things to use it
(ACPI, DRM etc) which know they won't be on a 386. The implementation in
this case will blow up on a 386 and the __HAVE_ARCH_CMPXCHG remains
false.

> but I suspect that SMP isn't supported on those CPUs without ll/sc,
> and thus an atomic_cmpxchg could be emulated by disabling interrupts.

It's obviously emulatable on any platform - the question is at what
cost. For x86 it probably isn't a big problem as there are very very few
people who need to build for 386 any more and there is already a big
penalty for such chips.

2005-09-03 01:40:09

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 2.6.13] lockless pagecache 2/7

On Sat, 3 Sep 2005, Nick Piggin wrote:

> Thanks Christoph, I think this will be required to support 386.
> In the worst case, we could provide a fallback path and take
> ->tree_lock in pagecache lookups if there is no atomic_cmpxchg,
> however I would much prefer all architectures get an atomic_cmpxchg,
> and I think it should turn out to be a generally useful primitive.
>
> I may trim this down to only provide what is needed for atomic_cmpxchg
> if that is OK?

Do not hesitate to do whatever you need to the patch. I took what I
needed from you for this patch last year too.

2005-09-03 01:40:39

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 2.6.13] lockless pagecache 2/7

Alan Cox wrote:

>>but I suspect that SMP isn't supported on those CPUs without ll/sc,
>>and thus an atomic_cmpxchg could be emulated by disabling interrupts.
>
>
> It's obviously emulatable on any platform - the question is at what
> cost. For x86 it probably isn't a big problem as there are very very few
> people who need to build for 386 any more and there is already a big
> penalty for such chips.
>
>

Thanks Alan, Dave, others.

We'll see how things go. I'm fairly sure that for my usage it will
be a win even if it is costly. It is replacing an atomic_inc_return,
and a read_lock/read_unlock pair.

But if it does one day get merged, and proves to be very costly on
some architectures then we'll need to be careful about where it gets
used.

Nick

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-09-03 17:08:01

by Alan

[permalink] [raw]
Subject: Re: [PATCH 2.6.13] lockless pagecache 2/7

On Sad, 2005-09-03 at 11:40 +1000, Nick Piggin wrote:
> We'll see how things go. I'm fairly sure that for my usage it will
> be a win even if it is costly. It is replacing an atomic_inc_return,
> and a read_lock/read_unlock pair.

Make sure you bench both AMD and Intel - I'd expect it to be a big loss
on AMD because the AMD stuff will perform atomic locked operations very
efficiently if they are already exclusive on this CPU or a prefetch_w()
on them was done 200+ clocks before.

2005-09-04 01:00:54

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 2.6.13] lockless pagecache 2/7

Alan Cox wrote:
> On Sad, 2005-09-03 at 11:40 +1000, Nick Piggin wrote:
>
>>We'll see how things go. I'm fairly sure that for my usage it will
>>be a win even if it is costly. It is replacing an atomic_inc_return,
>>and a read_lock/read_unlock pair.
>
>
> Make sure you bench both AMD and Intel - I'd expect it to be a big loss
> on AMD because the AMD stuff will perform atomic locked operations very
> efficiently if they are already exclusive on this CPU or a prefetch_w()
> on them was done 200+ clocks before.
>

I will try to get numbers for both.

I would be surprised if it was a big loss... but I'm assuming
a locked cmpxchg isn't outlandishly expensive. Basically:

read_lock_irqsave(cacheline1);
atomic_inc_return(cacheline2);
read_unlock_irqrestore(cacheline1);

Turns into

atomic_cmpxchg();

I'll do some microbenchmarks and get back to you. I'm quite
interested now ;) What sort of AMDs did you have in mind,
Opterons?

Thanks,
Nick

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-09-05 16:32:41

by Alan

[permalink] [raw]
Subject: Re: [PATCH 2.6.13] lockless pagecache 2/7

On Sul, 2005-09-04 at 11:01 +1000, Nick Piggin wrote:
> I would be surprised if it was a big loss... but I'm assuming
> a locked cmpxchg isn't outlandishly expensive. Basically:
>
> read_lock_irqsave(cacheline1);
> atomic_inc_return(cacheline2);
> read_unlock_irqrestore(cacheline1);
>
> Turns into
>
> atomic_cmpxchg();
>
> I'll do some microbenchmarks and get back to you. I'm quite
> interested now ;) What sort of AMDs did you have in mind,


Athlon or higher give very different atomic numbers to P4. If you are
losing the read_lock/unlock then the atomic_cmpxchg should be faster on
all I agree.

One question however - atomic_foo operations are not store barriers so
you might need mb() and friends for PPC ?

Alan

2005-09-06 01:03:24

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 2.6.13] lockless pagecache 2/7

Alan Cox wrote:
> On Sul, 2005-09-04 at 11:01 +1000, Nick Piggin wrote:
>
>>I would be surprised if it was a big loss... but I'm assuming
>>a locked cmpxchg isn't outlandishly expensive. Basically:
>>
>> read_lock_irqsave(cacheline1);
>> atomic_inc_return(cacheline2);
>> read_unlock_irqrestore(cacheline1);
>>
>>Turns into
>>
>> atomic_cmpxchg();
>>
>>I'll do some microbenchmarks and get back to you. I'm quite
>>interested now ;) What sort of AMDs did you have in mind,
>
>
>
> Athlon or higher give very different atomic numbers to P4. If you are
> losing the read_lock/unlock then the atomic_cmpxchg should be faster on
> all I agree.
>

Phew! I'll test them anyway, however.

> One question however - atomic_foo operations are not store barriers so
> you might need mb() and friends for PPC ?
>

Dave's documented that nicely in Documentation/atomic_ops.txt

In general, atomic ops that do not return a value are not barriers,
while operations that do return a value are.

So I think we can define the atomic_cmpxchg as providing a barrier.

Thanks,
Nick

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-09-09 05:36:33

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 2.6.13] lockless pagecache 5/7

I wonder if it may not be better to use a seqlock for the tree_lock? A
seqlock requires no writes at all if the tree has not been changed. RCU
still requires the incrementing of a (local) counter.

Using seqlocks would require reworking the readers so that they can
retry. Seqlocks provide already a verification that no update took place
while the operation was in process. Thus we would be using an established
framework that insures that the speculation was successful.

The problem is then though to guarantee that the radix trees are always
traversable since the seqlock's retry rather than block. This would
require sequencing of inserts and pose a big problem for deletes and
updates.

2005-09-09 06:21:45

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 2.6.13] lockless pagecache 5/7

Christoph Lameter wrote:
> I wonder if it may not be better to use a seqlock for the tree_lock? A
> seqlock requires no writes at all if the tree has not been changed. RCU
> still requires the incrementing of a (local) counter.
>

Ah, but the seqlock's write side will cause cacheline bouncing in
the readside. Currently the lockless pagecache patches allow IO to
be busily populating one part of the cache while readers can look
up another part without any cacheline transfers.

Presumably you mean rcu_read_lock's preempt_disable(), however this
is a noop on non preempt kernels, and will almost certainly be hot
in cache in preempt kernels. What's more it should not cause bouncing.

> Using seqlocks would require reworking the readers so that they can
> retry. Seqlocks provide already a verification that no update took place
> while the operation was in process. Thus we would be using an established
> framework that insures that the speculation was successful.
>
> The problem is then though to guarantee that the radix trees are always
> traversable since the seqlock's retry rather than block. This would
> require sequencing of inserts and pose a big problem for deletes and
> updates.
>

Well the lockless radix tree read side patches (which I have actually
updated to fix some bugs and be rebased on top of your nice cleanup)
do provide that the radix trees are always traversable.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-09-09 13:00:22

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 2.6.13] lockless pagecache 7/7

For Itanium (and I guess also for ppc64 and sparch64) the performance of
write_lock/unlock is the same as spin_lock/unlock. There is at least
one case where concurrent reads would be allowed without this patch.

Maybe keep the rwlock_t there?

2005-09-09 15:23:09

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 2.6.13] lockless pagecache 7/7

Christoph Lameter wrote:
> For Itanium (and I guess also for ppc64 and sparch64) the performance of
> write_lock/unlock is the same as spin_lock/unlock. There is at least
> one case where concurrent reads would be allowed without this patch.
>

Yep, I picked up another one that was easy to make lockless (I'll send
out a new patchset soon), however the tagged lookup that was under read
lock is changed to a spin lock.

It shouldn't be too difficult to make the tag lookups (find_get_pages_tag)
lockless, however I just haven't gotten around to looking at the write
side of the tagging yet.

When that is done, there should be no more read locks at all.

Nick

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-09-15 19:50:19

by Alok Kataria

[permalink] [raw]
Subject: Re: New lockless pagecache

Hi Nick,

I have collected performance numbers for the lock less page cache
patch on the AIM - IO test.
The performance numbers are collected for 1-100 tasks 1-50 tasks and
90-100 tasks both for with and without your patch. This was done on
2.6.13 kernel.
There's definite improvement when the tasks are small i.e ~50-70. But
when the tasks go beyond 80, we see a large performance dip.
I again profiled the 90-100 runs with spinlock's inlined, but couldn't
understand the reason behind the performance difference.

Please find attached the performance numbers as well as the oprofile logs.

Thanks & Regards,
Alok


On 9/2/05, Nick Piggin <[email protected]> wrote:
> Nick Piggin wrote:
>
> > I think this is getting pretty stable. No guarantees of course,
> > but it would be great if anyone gave it a test.
> >
>
> Or review, I might add. While I understand such a review is
> still quite difficult, this code really is far less complex
> than the previous lockless pagecache patches.
>
> (Ignore 1/7 though, which is a rollup - a broken out patchset
> can be provided on request)
>
> Nick
>
> --
> SUSE Labs, Novell Inc.
>
> Send instant messages to your online friends http://au.messenger.yahoo.com
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>


--
A computer scientist is someone who, when told to "Go to Hell," sees
the "go to," rather than the destination, as harmful.

Alok Kataria


Attachments:
(No filename) (1.58 kB)
lockless-pagecache.tjz (52.41 kB)
Download all attachments

2005-09-16 03:12:05

by Nick Piggin

[permalink] [raw]
Subject: Re: New lockless pagecache

Alok kataria wrote:
> Hi Nick,
>
> I have collected performance numbers for the lock less page cache
> patch on the AIM - IO test.
> The performance numbers are collected for 1-100 tasks 1-50 tasks and
> 90-100 tasks both for with and without your patch. This was done on
> 2.6.13 kernel.
> There's definite improvement when the tasks are small i.e ~50-70. But
> when the tasks go beyond 80, we see a large performance dip.
> I again profiled the 90-100 runs with spinlock's inlined, but couldn't
> understand the reason behind the performance difference.
>
> Please find attached the performance numbers as well as the oprofile logs.
>

Hi Alok,

Thanks very much for doing these numbers. Performance is improved
significantly at smaller numbers of tasks, as you say.

Unfortunately I can't pinpoint the reason why performance drops at
larger numbers. I could assume that the last remaining place that
used read_lock_irq for the tree_lock (wait_on_page_writeback_range)
got hurt when switching to spinlocks, but that would seem vary
unlikely.

I'll have to look into it further.

Thanks,
Nick

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com