2006-03-10 15:18:13

by Nick Piggin

[permalink] [raw]
Subject: A lockless pagecache for Linux

Hi,

I was waiting for 2.6.16 before releasing my patchset, but that got
boring.

ftp://ftp.kernel.org/pub/linux/kernel/people/npiggin/patches/lockless/2.6.16-rc5/

Now I've used some clever subject lines on the subsequent patches
to make you think this isn't a big deal. Actually there are about
36 other "prep" patches before those, and PageReserved removal before
that (which are luckily now mostly in -mm or -linus, respectively).
What's more, there aren't 3 lockless pagecache patches, there are
5 -- but the last two are optimisations.

I'm writing some stuff about these patches, and I've uploaded a
**draft** chapter on the RCU radix-tree, 'radix-intro.pdf' in above
directory (note the bibliography didn't make it -- but thanks Paul
McKenney!)

If anyone would like to test or review it, I would be very happy.
Suggestions to the code or document would be very welcome... but
I'm still hoping nobody spots a fundamental flaw until after OLS.

Rollup of prep patches (5 posted patches apply to the top of this):
2.6.16-rc5-git14-prep.patch.gz

Rollup of prep+lockless patches (includes the 5 posted patches):
2.6.16-rc5-git14-lockless.patch.gz

Note: anyone interested in benchmarking should test prep+rollup vs
prep rather than vs mainline if possible, because there are various
other optimisations in prep.

Thanks,
Nick


2006-03-10 15:18:23

by Nick Piggin

[permalink] [raw]
Subject: [patch 1/3] radix tree: RCU lockless read-side

Make radix tree lookups safe to be performed without locks. Readers
are protected against nodes being deleted by using RCU based freeing.
Readers are protected against new node insertion by using memory
barriers to ensure the node itself will be properly written before it
is visible in the radix tree.

Each radix tree node keeps a record of their height (above leaf
nodes). This height does not change after insertion -- when the radix
tree is extended, higher nodes are only inserted in the top. So a
lookup can take the pointer to what is *now* the root node, and
traverse down it even if the tree is concurrently extended and this
node becomes a subtree of a new root.

When a reader wants to traverse the next branch, they will take a
copy of the pointer. This pointer will be either NULL (and the branch
is empty) or non-NULL (and will point to a valid node).

Also introduce a lockfree gang_lookup_slot which will be used by a
future patch.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/lib/radix-tree.c
===================================================================
--- linux-2.6.orig/lib/radix-tree.c
+++ linux-2.6/lib/radix-tree.c
@@ -30,6 +30,7 @@
#include <linux/gfp.h>
#include <linux/string.h>
#include <linux/bitops.h>
+#include <linux/rcupdate.h>


#ifdef __KERNEL__
@@ -46,7 +47,9 @@
((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)

struct radix_tree_node {
+ unsigned int height; /* Height from the bottom */
unsigned int count;
+ struct rcu_head rcu_head;
void *slots[RADIX_TREE_MAP_SIZE];
unsigned long tags[RADIX_TREE_TAGS][RADIX_TREE_TAG_LONGS];
};
@@ -98,10 +101,17 @@ radix_tree_node_alloc(struct radix_tree_
return ret;
}

+static void radix_tree_node_rcu_free(struct rcu_head *head)
+{
+ struct radix_tree_node *node =
+ container_of(head, struct radix_tree_node, rcu_head);
+ kmem_cache_free(radix_tree_node_cachep, node);
+}
+
static inline void
radix_tree_node_free(struct radix_tree_node *node)
{
- kmem_cache_free(radix_tree_node_cachep, node);
+ call_rcu(&node->rcu_head, radix_tree_node_rcu_free);
}

/*
@@ -204,6 +214,7 @@ static int radix_tree_extend(struct radi
}

do {
+ unsigned int newheight;
if (!(node = radix_tree_node_alloc(root)))
return -ENOMEM;

@@ -216,9 +227,11 @@ static int radix_tree_extend(struct radi
tag_set(node, tag, 0);
}

+ newheight = root->height+1;
+ node->height = newheight;
node->count = 1;
- root->rnode = node;
- root->height++;
+ rcu_assign_pointer(root->rnode, node);
+ root->height = newheight;
} while (height > root->height);
out:
return 0;
@@ -258,11 +271,12 @@ int radix_tree_insert(struct radix_tree_
/* Have to add a child node. */
if (!(slot = radix_tree_node_alloc(root)))
return -ENOMEM;
+ slot->height = height;
if (node) {
- node->slots[offset] = slot;
+ rcu_assign_pointer(node->slots[offset], slot);
node->count++;
} else
- root->rnode = slot;
+ rcu_assign_pointer(root->rnode, slot);
}

/* Go a level down */
@@ -278,7 +292,7 @@ int radix_tree_insert(struct radix_tree_

BUG_ON(!node);
node->count++;
- node->slots[offset] = item;
+ rcu_assign_pointer(node->slots[offset], item);
BUG_ON(tag_get(node, 0, offset));
BUG_ON(tag_get(node, 1, offset));

@@ -290,25 +304,29 @@ static inline void **__lookup_slot(struc
unsigned long index)
{
unsigned int height, shift;
- struct radix_tree_node **slot;
+ struct radix_tree_node *node, **slot;

- height = root->height;
+ /* Must take a copy now because root->rnode may change */
+ node = rcu_dereference(root->rnode);
+ if (node == NULL)
+ return NULL;
+
+ height = node->height;
if (index > radix_tree_maxindex(height))
return NULL;

shift = (height-1) * RADIX_TREE_MAP_SHIFT;
- slot = &root->rnode;

- while (height > 0) {
- if (*slot == NULL)
+ do {
+ slot = (struct radix_tree_node **)
+ (node->slots + ((index>>shift) & RADIX_TREE_MAP_MASK));
+ node = rcu_dereference(*slot);
+ if (node == NULL)
return NULL;

- slot = (struct radix_tree_node **)
- ((*slot)->slots +
- ((index >> shift) & RADIX_TREE_MAP_MASK));
shift -= RADIX_TREE_MAP_SHIFT;
height--;
- }
+ } while (height > 0);

return (void **)slot;
}
@@ -339,7 +357,7 @@ void *radix_tree_lookup(struct radix_tre
void **slot;

slot = __lookup_slot(root, index);
- return slot != NULL ? *slot : NULL;
+ return slot != NULL ? rcu_dereference(*slot) : NULL;
}
EXPORT_SYMBOL(radix_tree_lookup);

@@ -501,26 +519,27 @@ EXPORT_SYMBOL(radix_tree_tag_get);
#endif

static unsigned int
-__lookup(struct radix_tree_root *root, void **results, unsigned long index,
+__lookup(struct radix_tree_root *root, void ***results, unsigned long index,
unsigned int max_items, unsigned long *next_index)
{
unsigned int nr_found = 0;
unsigned int shift, height;
- struct radix_tree_node *slot;
+ struct radix_tree_node *slot, *__s;
unsigned long i;

- height = root->height;
- if (height == 0)
+ slot = rcu_dereference(root->rnode);
+ if (!slot || slot->height == 0)
goto out;

+ height = slot->height;
shift = (height-1) * RADIX_TREE_MAP_SHIFT;
- slot = root->rnode;

for ( ; height > 1; height--) {

for (i = (index >> shift) & RADIX_TREE_MAP_MASK ;
i < RADIX_TREE_MAP_SIZE; i++) {
- if (slot->slots[i] != NULL)
+ __s = rcu_dereference(slot->slots[i]);
+ if (__s != NULL)
break;
index &= ~((1UL << shift) - 1);
index += 1UL << shift;
@@ -531,14 +550,14 @@ __lookup(struct radix_tree_root *root, v
goto out;

shift -= RADIX_TREE_MAP_SHIFT;
- slot = slot->slots[i];
+ slot = __s;
}

/* Bottom level: grab some items */
for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE; i++) {
index++;
if (slot->slots[i]) {
- results[nr_found++] = slot->slots[i];
+ results[nr_found++] = &slot->slots[i];
if (nr_found == max_items)
goto out;
}
@@ -570,6 +589,43 @@ radix_tree_gang_lookup(struct radix_tree
unsigned int ret = 0;

while (ret < max_items) {
+ unsigned int nr_found, i;
+ unsigned long next_index; /* Index of next search */
+
+ if (cur_index > max_index)
+ break;
+ nr_found = __lookup(root, (void ***)results + ret, cur_index,
+ max_items - ret, &next_index);
+ for (i = 0; i < nr_found; i++)
+ results[ret + i] = *(((void ***)results)[ret + i]);
+ ret += nr_found;
+ if (next_index == 0)
+ break;
+ cur_index = next_index;
+ }
+ return ret;
+}
+EXPORT_SYMBOL(radix_tree_gang_lookup);
+
+/**
+ * radix_tree_gang_lookup_slot - perform multiple lookup on a radix tree
+ * @root: radix tree root
+ * @results: where the results of the lookup are placed
+ * @first_index: start the lookup from this key
+ * @max_items: place up to this many items at *results
+ *
+ * Same as radix_tree_gang_lookup, but returns an array of pointers
+ * (slots) to the stored items instead of the items themselves.
+ */
+unsigned int
+radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results,
+ unsigned long first_index, unsigned int max_items)
+{
+ const unsigned long max_index = radix_tree_maxindex(root->height);
+ unsigned long cur_index = first_index;
+ unsigned int ret = 0;
+
+ while (ret < max_items) {
unsigned int nr_found;
unsigned long next_index; /* Index of next search */

@@ -584,7 +640,8 @@ radix_tree_gang_lookup(struct radix_tree
}
return ret;
}
-EXPORT_SYMBOL(radix_tree_gang_lookup);
+EXPORT_SYMBOL(radix_tree_gang_lookup_slot);
+

/*
* FIXME: the two tag_get()s here should use find_next_bit() instead of
@@ -689,6 +746,11 @@ static inline void radix_tree_shrink(str
root->rnode->slots[0]) {
struct radix_tree_node *to_free = root->rnode;

+ /*
+ * this doesn't need an rcu_assign_pointer, because
+ * we aren't touching the object that to_free->slots[0]
+ * points to.
+ */
root->rnode = to_free->slots[0];
root->height--;
/* must only free zeroed nodes into the slab */
@@ -804,7 +866,7 @@ EXPORT_SYMBOL(radix_tree_delete);
int radix_tree_tagged(struct radix_tree_root *root, int tag)
{
struct radix_tree_node *rnode;
- rnode = root->rnode;
+ rnode = rcu_dereference(root->rnode);
if (!rnode)
return 0;
return any_tag_set(rnode, tag);
Index: linux-2.6/include/linux/radix-tree.h
===================================================================
--- linux-2.6.orig/include/linux/radix-tree.h
+++ linux-2.6/include/linux/radix-tree.h
@@ -52,6 +52,9 @@ void *radix_tree_delete(struct radix_tre
unsigned int
radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
unsigned long first_index, unsigned int max_items);
+unsigned int
+radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results,
+ unsigned long first_index, unsigned int max_items);
int radix_tree_preload(gfp_t gfp_mask);
void radix_tree_init(void);
void *radix_tree_tag_set(struct radix_tree_root *root,

2006-03-10 15:18:58

by Nick Piggin

[permalink] [raw]
Subject: [patch 2/3] mm: speculative get_page

If we can be sure that elevating the page_count on a pagecache
page will pin it, we can speculatively run this operation, and
subsequently check to see if we hit the right page rather than
relying on holding a lock or otherwise pinning a reference to the
page.

This can be done if get_page/put_page behaves consistently
throughout the whole tree (ie. if we "get" the page after it has
been used for something else, we must be able to free it with a
put_page).

Actually, there is a period where the count behaves differently:
when the page is free or if it is a constituent page of a compound
page. We need an atomic_inc_not_zero operation to ensure we don't
try to grab the page in either case.

Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h
+++ linux-2.6/include/linux/page-flags.h
@@ -76,6 +76,9 @@
#define PG_nosave_free 18 /* Free, should not be written */
#define PG_uncached 19 /* Page has been mapped as uncached */

+#define PG_nonewrefs 20 /* Block concurrent pagecache lookups
+ * while testing refcount */
+
/*
* Global page accounting. One instance per CPU. Only unsigned longs are
* allowed.
@@ -346,6 +349,11 @@ extern void __mod_page_state_offset(unsi
#define SetPageUncached(page) set_bit(PG_uncached, &(page)->flags)
#define ClearPageUncached(page) clear_bit(PG_uncached, &(page)->flags)

+#define PageNoNewRefs(page) test_bit(PG_nonewrefs, &(page)->flags)
+#define SetPageNoNewRefs(page) set_bit(PG_nonewrefs, &(page)->flags)
+#define ClearPageNoNewRefs(page) clear_bit(PG_nonewrefs, &(page)->flags)
+#define __ClearPageNoNewRefs(page) __clear_bit(PG_nonewrefs, &(page)->flags)
+
struct page; /* forward declaration */

int test_clear_page_dirty(struct page *page);
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -11,6 +11,8 @@
#include <linux/compiler.h>
#include <asm/uaccess.h>
#include <linux/gfp.h>
+#include <linux/page-flags.h>
+#include <linux/hardirq.h> /* for in_interrupt() */

/*
* Bits in mapping->flags. The lower __GFP_BITS_SHIFT bits are the page
@@ -51,6 +53,91 @@ static inline void mapping_set_gfp_mask(
#define page_cache_release(page) put_page(page)
void release_pages(struct page **pages, int nr, int cold);

+static inline struct page *page_cache_get_speculative(struct page **pagep)
+{
+ struct page *page;
+
+ VM_BUG_ON(in_interrupt());
+
+#ifndef CONFIG_SMP
+ page = *pagep;
+ if (unlikely(!page))
+ return NULL;
+
+ VM_BUG_ON(!in_atomic());
+ /*
+ * Preempt must be disabled here - we rely on rcu_read_lock doing
+ * this for us.
+ *
+ * Pagecache won't be truncated from interrupt context, so if we have
+ * found a page in the radix tree here, we have pinned its refcount by
+ * disabling preempt, and hence no need for the "speculative get" that
+ * SMP requires.
+ */
+ VM_BUG_ON(page_count(page) == 0);
+ atomic_inc(&page->_count);
+ VM_BUG_ON(page != *pagep);
+
+#else
+ again:
+ page = rcu_dereference(*pagep);
+ if (unlikely(!page))
+ return NULL;
+
+ if (unlikely(!get_page_unless_zero(page)))
+ goto again; /* page has been freed */
+
+ /*
+ * Note that get_page_unless_zero provides a memory barrier.
+ * This is needed to ensure PageNoNewRefs is evaluated after the
+ * page refcount has been raised. See below comment.
+ */
+
+ /*
+ * PageNoNewRefs is set in order to prevent new references to the
+ * page (eg. before it gets removed from pagecache). Wait until it
+ * becomes clear (and checks below will ensure we still have the
+ * correct one).
+ */
+ while (unlikely(PageNoNewRefs(page)))
+ cpu_relax();
+
+ /*
+ * smp_rmb is to ensure the load of page->flags (for PageNoNewRefs())
+ * is performed before the load of *pagep in the below comparison.
+ *
+ * Those places that set PageNoNewRefs have the following pattern:
+ * SetPageNoNewRefs(page)
+ * wmb();
+ * if (page_count(page) == X)
+ * remove page from pagecache
+ * wmb();
+ * ClearPageNoNewRefs(page)
+ *
+ * So PageNoNewRefs() becomes clear _after_ we've elevated page
+ * refcount, then either the page will be safely pinned in pagecache,
+ * or it will have been already removed. In the latter case, *pagep
+ * will be changed in the below test - provided it is loaded after
+ * testing PageNoNewRefs() (which is what the smp_rmb is for).
+ *
+ * If the load was out of order, *pagep might be loaded before the
+ * page is removed from pagecache while PageNoNewRefs evaluated after
+ * the ClearPageNoNewRefs().
+ */
+ smp_rmb();
+
+ if (unlikely(page != *pagep)) {
+ /* page no longer at *pagep */
+ put_page(page);
+ goto again;
+ }
+
+#endif
+ VM_BUG_ON(PageCompound(page) && (struct page *)page_private(page) != page);
+
+ return page;
+}
+
static inline struct page *page_cache_alloc(struct address_space *x)
{
return alloc_pages(mapping_gfp_mask(x), 0);
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -383,6 +383,7 @@ static int remove_mapping(struct address
if (!mapping)
return 0; /* truncate got there first */

+ SetPageNoNewRefs(page);
write_lock_irq(&mapping->tree_lock);

/*
@@ -401,17 +402,20 @@ static int remove_mapping(struct address
__delete_from_swap_cache(page);
write_unlock_irq(&mapping->tree_lock);
swap_free(swap);
- __put_page(page); /* The pagecache ref */
- return 1;
+ goto free_it;
}

__remove_from_page_cache(page);
write_unlock_irq(&mapping->tree_lock);
- __put_page(page);
+
+free_it:
+ __ClearPageNoNewRefs(page);
+ __put_page(page); /* The pagecache ref */
return 1;

cannot_free:
write_unlock_irq(&mapping->tree_lock);
+ ClearPageNoNewRefs(page);
return 0;
}

@@ -731,6 +735,7 @@ int migrate_page_remove_references(struc
if (page_mapcount(page))
return 1;

+ SetPageNoNewRefs(page);
write_lock_irq(&mapping->tree_lock);

radix_pointer = (struct page **)radix_tree_lookup_slot(
@@ -740,6 +745,7 @@ int migrate_page_remove_references(struc
if (!page_mapping(page) || page_count(page) != nr_refs ||
*radix_pointer != page) {
write_unlock_irq(&mapping->tree_lock);
+ ClearPageNoNewRefs(page);
return 1;
}

@@ -758,10 +764,14 @@ int migrate_page_remove_references(struc
SetPageSwapCache(newpage);
set_page_private(newpage, page_private(page));
}
+ SetPageNoNewRefs(newpage);
+
+ rcu_assign_pointer(*radix_pointer, newpage);

- *radix_pointer = newpage;
- __put_page(page);
write_unlock_irq(&mapping->tree_lock);
+ __put_page(page);
+ ClearPageNoNewRefs(page);
+ ClearPageNoNewRefs(newpage);

return 0;
}
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -400,6 +400,7 @@ int add_to_page_cache(struct page *page,
int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);

if (error == 0) {
+ SetPageNoNewRefs(page);
write_lock_irq(&mapping->tree_lock);
error = radix_tree_insert(&mapping->page_tree, offset, page);
if (!error) {
@@ -411,6 +412,7 @@ int add_to_page_cache(struct page *page,
pagecache_acct(1);
}
write_unlock_irq(&mapping->tree_lock);
+ ClearPageNoNewRefs(page);
radix_tree_preload_end();
}
return error;
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -77,6 +77,7 @@ static int __add_to_swap_cache(struct pa
BUG_ON(PagePrivate(page));
error = radix_tree_preload(gfp_mask);
if (!error) {
+ SetPageNoNewRefs(page);
write_lock_irq(&swapper_space.tree_lock);
error = radix_tree_insert(&swapper_space.page_tree,
entry.val, page);
@@ -89,6 +90,7 @@ static int __add_to_swap_cache(struct pa
pagecache_acct(1);
}
write_unlock_irq(&swapper_space.tree_lock);
+ ClearPageNoNewRefs(page);
radix_tree_preload_end();
}
return error;

2006-03-10 15:19:01

by Nick Piggin

[permalink] [raw]
Subject: [patch 4/3] mm: lockless optimisations

add_to_page_cache only deals with newly allocated pages except in the
swap -> shm case. Take advantage of this to optimise add_to_page_cache,
and introduce a new __add_to_page_cache for use on pages other than
newly allocated ones.

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -400,6 +400,45 @@ int add_to_page_cache(struct page *page,
int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);

if (error == 0) {
+ /*
+ * Can get away with less atomic ops and without using
+ * Set/ClearPageNoNewRefs if we order operations correctly.
+ */
+ page_cache_get(page);
+ __SetPageLocked(page);
+ page->mapping = mapping;
+ page->index = offset;
+
+ write_lock_irq(&mapping->tree_lock);
+ error = radix_tree_insert(&mapping->page_tree, offset, page);
+ if (!error) {
+ mapping->nrpages++;
+ pagecache_acct(1);
+ }
+ write_unlock_irq(&mapping->tree_lock);
+ radix_tree_preload_end();
+
+ if (error) {
+ page->mapping = NULL;
+ __put_page(page);
+ __ClearPageLocked(page);
+ }
+ }
+ return error;
+}
+EXPORT_SYMBOL(add_to_page_cache);
+
+/*
+ * Same as add_to_page_cache, but works on pages that are already in
+ * swapcache and possibly visible to external lookups.
+ * (special case for move_from_swap_cache).
+ */
+int __add_to_page_cache(struct page *page, struct address_space *mapping,
+ pgoff_t offset, gfp_t gfp_mask)
+{
+ int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
+
+ if (error == 0) {
SetPageNoNewRefs(page);
write_lock_irq(&mapping->tree_lock);
error = radix_tree_insert(&mapping->page_tree, offset, page);
@@ -417,7 +456,6 @@ int add_to_page_cache(struct page *page,
}
return error;
}
-EXPORT_SYMBOL(add_to_page_cache);

int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
pgoff_t offset, gfp_t gfp_mask)
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -234,7 +234,7 @@ int move_to_swap_cache(struct page *page
int move_from_swap_cache(struct page *page, unsigned long index,
struct address_space *mapping)
{
- int err = add_to_page_cache(page, mapping, index, GFP_ATOMIC);
+ int err = __add_to_page_cache(page, mapping, index, GFP_ATOMIC);
if (!err) {
delete_from_swap_cache(page);
/* shift page from clean_pages to dirty_pages list */
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -183,6 +183,8 @@ extern int read_cache_pages(struct addre

int add_to_page_cache(struct page *page, struct address_space *mapping,
unsigned long index, gfp_t gfp_mask);
+int __add_to_page_cache(struct page *page, struct address_space *mapping,
+ unsigned long index, gfp_t gfp_mask);
int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
unsigned long index, gfp_t gfp_mask);
extern void remove_from_page_cache(struct page *page);
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h
+++ linux-2.6/include/linux/page-flags.h
@@ -214,16 +214,13 @@ extern void __mod_page_state_offset(unsi
/*
* Manipulation of page state flags
*/
-#define PageLocked(page) \
- test_bit(PG_locked, &(page)->flags)
-#define SetPageLocked(page) \
- set_bit(PG_locked, &(page)->flags)
-#define TestSetPageLocked(page) \
- test_and_set_bit(PG_locked, &(page)->flags)
-#define ClearPageLocked(page) \
- clear_bit(PG_locked, &(page)->flags)
-#define TestClearPageLocked(page) \
- test_and_clear_bit(PG_locked, &(page)->flags)
+#define PageLocked(page) test_bit(PG_locked, &(page)->flags)
+#define SetPageLocked(page) set_bit(PG_locked, &(page)->flags)
+#define __SetPageLocked(page) __set_bit(PG_locked, &(page)->flags)
+#define TestSetPageLocked(page) test_and_set_bit(PG_locked, &(page)->flags)
+#define ClearPageLocked(page) clear_bit(PG_locked, &(page)->flags)
+#define __ClearPageLocked(page) __clear_bit(PG_locked, &(page)->flags)
+#define TestClearPageLocked(page) test_and_clear_bit(PG_locked, &(page)->flags)

#define PageError(page) test_bit(PG_error, &(page)->flags)
#define SetPageError(page) set_bit(PG_error, &(page)->flags)

2006-03-10 15:19:43

by Nick Piggin

[permalink] [raw]
Subject: [patch 3/3] mm: lockless pagecache lookups

Use page_cache_get_speculative and lockless radix tree lookups to
introduce lockless page cache lookups (ie. no mapping->tree_lock).

The only atomicity changes this should introduce is the use of a
non atomic pagevec lookup for truncate, however what atomicity
guarantees that there might have been were not used anyway, because
the size of the pagevec is not guaranteed (eg. it might be 1).

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -417,7 +417,6 @@ int add_to_page_cache(struct page *page,
}
return error;
}
-
EXPORT_SYMBOL(add_to_page_cache);

int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
@@ -518,21 +517,21 @@ void fastcall __lock_page(struct page *p
EXPORT_SYMBOL(__lock_page);

/*
- * a rather lightweight function, finding and getting a reference to a
- * hashed page atomically.
+ * find_get_page finds and gets a reference to a pagecache page.
*/
-struct page * find_get_page(struct address_space *mapping, unsigned long offset)
+struct page *find_get_page(struct address_space *mapping, unsigned long offset)
{
- struct page *page;
+ struct page **pagep;
+ struct page *page = NULL;

- read_lock_irq(&mapping->tree_lock);
- page = radix_tree_lookup(&mapping->page_tree, offset);
- if (page)
- page_cache_get(page);
- read_unlock_irq(&mapping->tree_lock);
+ rcu_read_lock();
+ pagep = (struct page **)radix_tree_lookup_slot(&mapping->page_tree,
+ offset);
+ if (likely(pagep))
+ page = page_cache_get_speculative(pagep);
+ rcu_read_unlock();
return page;
}
-
EXPORT_SYMBOL(find_get_page);

/*
@@ -540,17 +539,31 @@ EXPORT_SYMBOL(find_get_page);
*/
struct page *find_trylock_page(struct address_space *mapping, unsigned long offset)
{
- struct page *page;
+ struct page **pagep;
+ struct page *page = NULL;

- read_lock_irq(&mapping->tree_lock);
- page = radix_tree_lookup(&mapping->page_tree, offset);
- if (page) {
- if (unlikely(TestSetPageLocked(page)))
+ rcu_read_lock();
+ pagep = (struct page **)radix_tree_lookup_slot(&mapping->page_tree,
+ offset);
+ if (pagep) {
+repeat:
+ page = page_cache_get_speculative(pagep);
+
+ if (unlikely(TestSetPageLocked(page))) {
+ page_cache_release(page);
page = NULL;
- else
- get_page(page);
+ goto out;
+ }
+
+ /* Has the page been truncated before being locked? */
+ if (unlikely(page != *pagep)) {
+ unlock_page(page);
+ page_cache_release(page);
+ goto repeat;
+ }
}
- read_unlock_irq(&mapping->tree_lock);
+out:
+ rcu_read_unlock();
return page;
}

@@ -573,25 +586,17 @@ struct page *find_lock_page(struct addre
struct page *page;

repeat:
- read_lock_irq(&mapping->tree_lock);
- page = radix_tree_lookup(&mapping->page_tree, offset);
+ page = find_get_page(mapping, offset);
if (page) {
- page_cache_get(page);
- if (TestSetPageLocked(page)) {
- read_unlock_irq(&mapping->tree_lock);
- __lock_page(page);
+ lock_page(page);

- /* Has the page been truncated while we slept? */
- if (unlikely(page->mapping != mapping)) {
- unlock_page(page);
- page_cache_release(page);
- goto repeat;
- }
- goto out;
+ /* Has the page been truncated while we slept? */
+ if (unlikely(page->mapping != mapping)) {
+ unlock_page(page);
+ page_cache_release(page);
+ goto repeat;
}
}
- read_unlock_irq(&mapping->tree_lock);
-out:
return page;
}

@@ -674,6 +679,32 @@ unsigned find_get_pages(struct address_s
return ret;
}

+unsigned find_get_pages_nonatomic(struct address_space *mapping, pgoff_t start,
+ unsigned int nr_pages, struct page **pages)
+{
+ unsigned int i;
+ unsigned int nr_found;
+ unsigned int ret;
+
+ /*
+ * We do some unsightly casting to use the array first for storing
+ * pointers to the page pointers, and then for the pointers to
+ * the pages themselves that the caller wants.
+ */
+ rcu_read_lock();
+ nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree,
+ (void ***)pages, start, nr_pages);
+ ret = 0;
+ for (i = 0; i < nr_found; i++) {
+ struct page *page;
+ page = page_cache_get_speculative(((struct page ***)pages)[i]);
+ if (likely(page))
+ pages[ret++] = page;
+ }
+ rcu_read_unlock();
+ return ret;
+}
+
/*
* Like find_get_pages, except we only return pages which are tagged with
* `tag'. We update *index to index the next page for the traversal.
Index: linux-2.6/mm/readahead.c
===================================================================
--- linux-2.6.orig/mm/readahead.c
+++ linux-2.6/mm/readahead.c
@@ -275,27 +275,26 @@ __do_page_cache_readahead(struct address
/*
* Preallocate as many pages as we will need.
*/
- read_lock_irq(&mapping->tree_lock);
for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
pgoff_t page_offset = offset + page_idx;

if (page_offset > end_index)
break;

+ /* Don't need mapping->tree_lock - lookup can be racy */
+ rcu_read_lock();
page = radix_tree_lookup(&mapping->page_tree, page_offset);
+ rcu_read_unlock();
if (page)
continue;

- read_unlock_irq(&mapping->tree_lock);
page = page_cache_alloc_cold(mapping);
- read_lock_irq(&mapping->tree_lock);
if (!page)
break;
page->index = page_offset;
list_add(&page->lru, &page_pool);
ret++;
}
- read_unlock_irq(&mapping->tree_lock);

/*
* Now start the IO. We ignore I/O errors - if the page is not
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -160,6 +160,8 @@ extern struct page * find_or_create_page
unsigned long index, gfp_t gfp_mask);
unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
unsigned int nr_pages, struct page **pages);
+unsigned find_get_pages_nonatomic(struct address_space *mapping, pgoff_t start,
+ unsigned int nr_pages, struct page **pages);
unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
int tag, unsigned int nr_pages, struct page **pages);

Index: linux-2.6/include/linux/pagevec.h
===================================================================
--- linux-2.6.orig/include/linux/pagevec.h
+++ linux-2.6/include/linux/pagevec.h
@@ -28,6 +28,8 @@ void __pagevec_lru_add_active(struct pag
void pagevec_strip(struct pagevec *pvec);
unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
pgoff_t start, unsigned nr_pages);
+unsigned pagevec_lookup_nonatomic(struct pagevec *pvec,
+ struct address_space *mapping, pgoff_t start, unsigned nr_pages);
unsigned pagevec_lookup_tag(struct pagevec *pvec,
struct address_space *mapping, pgoff_t *index, int tag,
unsigned nr_pages);
Index: linux-2.6/mm/swap.c
===================================================================
--- linux-2.6.orig/mm/swap.c
+++ linux-2.6/mm/swap.c
@@ -428,6 +428,20 @@ unsigned pagevec_lookup(struct pagevec *

EXPORT_SYMBOL(pagevec_lookup);

+/**
+ * pagevec_lookup_nonatomic - non atomic pagevec_lookup
+ *
+ * This routine is non-atomic in that it may return blah.
+ */
+unsigned pagevec_lookup_nonatomic(struct pagevec *pvec,
+ struct address_space *mapping, pgoff_t start, unsigned nr_pages)
+{
+ pvec->nr = find_get_pages_nonatomic(mapping, start,
+ nr_pages, pvec->pages);
+ return pagevec_count(pvec);
+}
+EXPORT_SYMBOL(pagevec_lookup_nonatomic);
+
unsigned pagevec_lookup_tag(struct pagevec *pvec, struct address_space *mapping,
pgoff_t *index, int tag, unsigned nr_pages)
{
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -124,7 +124,7 @@ void truncate_inode_pages_range(struct a
pagevec_init(&pvec, 0);
next = start;
while (next <= end &&
- pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+ pagevec_lookup_nonatomic(&pvec, mapping, next, PAGEVEC_SIZE)) {
for (i = 0; i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];
pgoff_t page_index = page->index;
@@ -163,7 +163,7 @@ void truncate_inode_pages_range(struct a
next = start;
for ( ; ; ) {
cond_resched();
- if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+ if (!pagevec_lookup_nonatomic(&pvec, mapping, next, PAGEVEC_SIZE)) {
if (next == start)
break;
next = start;
@@ -227,7 +227,7 @@ unsigned long invalidate_mapping_pages(s

pagevec_init(&pvec, 0);
while (next <= end &&
- pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+ pagevec_lookup_nonatomic(&pvec, mapping, next, PAGEVEC_SIZE)) {
for (i = 0; i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];

@@ -284,7 +284,7 @@ int invalidate_inode_pages2_range(struct
pagevec_init(&pvec, 0);
next = start;
while (next <= end && !ret && !wrapped &&
- pagevec_lookup(&pvec, mapping, next,
+ pagevec_lookup_nonatomic(&pvec, mapping, next,
min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
for (i = 0; !ret && i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -808,17 +808,15 @@ int test_set_page_writeback(struct page
EXPORT_SYMBOL(test_set_page_writeback);

/*
- * Return true if any of the pages in the mapping are marged with the
+ * Return true if any of the pages in the mapping are marked with the
* passed tag.
*/
int mapping_tagged(struct address_space *mapping, int tag)
{
- unsigned long flags;
int ret;
-
- read_lock_irqsave(&mapping->tree_lock, flags);
+ rcu_read_lock();
ret = radix_tree_tagged(&mapping->page_tree, tag);
- read_unlock_irqrestore(&mapping->tree_lock, flags);
+ rcu_read_unlock();
return ret;
}
EXPORT_SYMBOL(mapping_tagged);

2006-03-10 15:19:43

by Nick Piggin

[permalink] [raw]
Subject: [patch 5/3] mm: spinlock tree_lock

With practially all the read locks gone from mapping->tree_lock,
convert the lock from an rwlock back to a spinlock.

The remaining locks including the read locks mainly deal with IO
submission and not the lookup fastpaths.

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -855,7 +855,7 @@ int __set_page_dirty_buffers(struct page
spin_unlock(&mapping->private_lock);

if (!TestSetPageDirty(page)) {
- write_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
if (page->mapping) { /* Race with truncate? */
if (mapping_cap_account_dirty(mapping))
__inc_page_state(nr_dirty);
@@ -863,7 +863,7 @@ int __set_page_dirty_buffers(struct page
page_index(page),
PAGECACHE_TAG_DIRTY);
}
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
}

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -196,7 +196,7 @@ void inode_init_once(struct inode *inode
mutex_init(&inode->i_mutex);
init_rwsem(&inode->i_alloc_sem);
INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
- rwlock_init(&inode->i_data.tree_lock);
+ spin_lock_init(&inode->i_data.tree_lock);
spin_lock_init(&inode->i_data.i_mmap_lock);
INIT_LIST_HEAD(&inode->i_data.private_list);
spin_lock_init(&inode->i_data.private_lock);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -372,7 +372,7 @@ struct backing_dev_info;
struct address_space {
struct inode *host; /* owner: inode, block_device */
struct radix_tree_root page_tree; /* radix tree of all pages */
- rwlock_t tree_lock; /* and rwlock protecting it */
+ spinlock_t tree_lock; /* and lock protecting it */
unsigned int i_mmap_writable;/* count VM_SHARED mappings */
struct prio_tree_root i_mmap; /* tree of private and shared mappings */
struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -110,7 +110,7 @@ generic_file_direct_IO(int rw, struct ki
/*
* Remove a page from the page cache and free it. Caller has to make
* sure the page is locked and that nobody else uses it - or that usage
- * is safe. The caller must hold a write_lock on the mapping's tree_lock.
+ * is safe. The caller must hold the mapping's tree_lock.
*/
void __remove_from_page_cache(struct page *page)
{
@@ -128,9 +128,9 @@ void remove_from_page_cache(struct page

BUG_ON(!PageLocked(page));

- write_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
__remove_from_page_cache(page);
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
}

static int sync_page(void *word)
@@ -409,13 +409,13 @@ int add_to_page_cache(struct page *page,
page->mapping = mapping;
page->index = offset;

- write_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
error = radix_tree_insert(&mapping->page_tree, offset, page);
if (!error) {
mapping->nrpages++;
pagecache_acct(1);
}
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
radix_tree_preload_end();

if (error) {
@@ -440,7 +440,7 @@ int __add_to_page_cache(struct page *pag

if (error == 0) {
SetPageNoNewRefs(page);
- write_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
error = radix_tree_insert(&mapping->page_tree, offset, page);
if (!error) {
page_cache_get(page);
@@ -450,7 +450,7 @@ int __add_to_page_cache(struct page *pag
mapping->nrpages++;
pagecache_acct(1);
}
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
ClearPageNoNewRefs(page);
radix_tree_preload_end();
}
@@ -708,12 +708,12 @@ unsigned find_get_pages(struct address_s
unsigned int i;
unsigned int ret;

- read_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
ret = radix_tree_gang_lookup(&mapping->page_tree,
(void **)pages, start, nr_pages);
for (i = 0; i < ret; i++)
page_cache_get(pages[i]);
- read_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
return ret;
}

@@ -753,14 +753,14 @@ unsigned find_get_pages_tag(struct addre
unsigned int i;
unsigned int ret;

- read_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
ret = radix_tree_gang_lookup_tag(&mapping->page_tree,
(void **)pages, *index, nr_pages, tag);
for (i = 0; i < ret; i++)
page_cache_get(pages[i]);
if (ret)
*index = pages[ret - 1]->index + 1;
- read_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
return ret;
}

Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -37,7 +37,7 @@ static struct backing_dev_info swap_back

struct address_space swapper_space = {
.page_tree = RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN),
- .tree_lock = RW_LOCK_UNLOCKED,
+ .tree_lock = SPIN_LOCK_UNLOCKED,
.a_ops = &swap_aops,
.i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
.backing_dev_info = &swap_backing_dev_info,
@@ -78,7 +78,7 @@ static int __add_to_swap_cache(struct pa
error = radix_tree_preload(gfp_mask);
if (!error) {
SetPageNoNewRefs(page);
- write_lock_irq(&swapper_space.tree_lock);
+ spin_lock_irq(&swapper_space.tree_lock);
error = radix_tree_insert(&swapper_space.page_tree,
entry.val, page);
if (!error) {
@@ -89,7 +89,7 @@ static int __add_to_swap_cache(struct pa
total_swapcache_pages++;
pagecache_acct(1);
}
- write_unlock_irq(&swapper_space.tree_lock);
+ spin_unlock_irq(&swapper_space.tree_lock);
ClearPageNoNewRefs(page);
radix_tree_preload_end();
}
@@ -202,9 +202,9 @@ void delete_from_swap_cache(struct page

entry.val = page_private(page);

- write_lock_irq(&swapper_space.tree_lock);
+ spin_lock_irq(&swapper_space.tree_lock);
__delete_from_swap_cache(page);
- write_unlock_irq(&swapper_space.tree_lock);
+ spin_unlock_irq(&swapper_space.tree_lock);

swap_free(entry);
page_cache_release(page);
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c
+++ linux-2.6/mm/swapfile.c
@@ -367,13 +367,13 @@ int remove_exclusive_swap_page(struct pa
/* Is the only swap cache user the cache itself? */
retval = 0;
if (p->swap_map[swp_offset(entry)] == 1) {
- write_lock_irq(&swapper_space.tree_lock);
+ spin_lock_irq(&swapper_space.tree_lock);
if (!PageWriteback(page)) {
__delete_from_swap_cache(page);
SetPageDirty(page);
retval = 1;
}
- write_unlock_irq(&swapper_space.tree_lock);
+ spin_unlock_irq(&swapper_space.tree_lock);
}
spin_unlock(&swap_lock);

Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -67,15 +67,15 @@ invalidate_complete_page(struct address_
if (PagePrivate(page) && !try_to_release_page(page, 0))
return 0;

- write_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
if (PageDirty(page)) {
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
return 0;
}

BUG_ON(PagePrivate(page));
__remove_from_page_cache(page);
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
ClearPageUptodate(page);
page_cache_release(page); /* pagecache ref */
return 1;
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -384,7 +384,7 @@ static int remove_mapping(struct address
return 0; /* truncate got there first */

SetPageNoNewRefs(page);
- write_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);

/*
* The non-racy check for busy page. It is critical to check
@@ -400,13 +400,13 @@ static int remove_mapping(struct address
if (PageSwapCache(page)) {
swp_entry_t swap = { .val = page_private(page) };
__delete_from_swap_cache(page);
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
swap_free(swap);
goto free_it;
}

__remove_from_page_cache(page);
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);

free_it:
__ClearPageNoNewRefs(page);
@@ -414,7 +414,7 @@ free_it:
return 1;

cannot_free:
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
ClearPageNoNewRefs(page);
return 0;
}
@@ -736,7 +736,7 @@ int migrate_page_remove_references(struc
return 1;

SetPageNoNewRefs(page);
- write_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);

radix_pointer = (struct page **)radix_tree_lookup_slot(
&mapping->page_tree,
@@ -744,7 +744,7 @@ int migrate_page_remove_references(struc

if (!page_mapping(page) || page_count(page) != nr_refs ||
*radix_pointer != page) {
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
ClearPageNoNewRefs(page);
return 1;
}
@@ -768,7 +768,7 @@ int migrate_page_remove_references(struc

rcu_assign_pointer(*radix_pointer, newpage);

- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
__put_page(page);
ClearPageNoNewRefs(page);
ClearPageNoNewRefs(newpage);
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -628,7 +628,7 @@ int __set_page_dirty_nobuffers(struct pa
struct address_space *mapping2;

if (mapping) {
- write_lock_irq(&mapping->tree_lock);
+ spin_lock_irq(&mapping->tree_lock);
mapping2 = page_mapping(page);
if (mapping2) { /* Race with truncate? */
BUG_ON(mapping2 != mapping);
@@ -637,7 +637,7 @@ int __set_page_dirty_nobuffers(struct pa
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
- write_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
if (mapping->host) {
/* !PageAnon && !swapper_space */
__mark_inode_dirty(mapping->host,
@@ -709,21 +709,22 @@ EXPORT_SYMBOL(set_page_dirty_lock);
int test_clear_page_dirty(struct page *page)
{
struct address_space *mapping = page_mapping(page);
- unsigned long flags;

if (mapping) {
- write_lock_irqsave(&mapping->tree_lock, flags);
- if (TestClearPageDirty(page)) {
+ unsigned long flags;
+ int ret;
+
+ spin_lock_irqsave(&mapping->tree_lock, flags);
+ ret = TestClearPageDirty(page);
+ if (ret) {
radix_tree_tag_clear(&mapping->page_tree,
page_index(page),
PAGECACHE_TAG_DIRTY);
if (mapping_cap_account_dirty(mapping))
__dec_page_state(nr_dirty);
- write_unlock_irqrestore(&mapping->tree_lock, flags);
- return 1;
}
- write_unlock_irqrestore(&mapping->tree_lock, flags);
- return 0;
+ spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ return ret;
}
return TestClearPageDirty(page);
}
@@ -762,33 +763,32 @@ EXPORT_SYMBOL(clear_page_dirty_for_io);
int test_clear_page_writeback(struct page *page)
{
struct address_space *mapping = page_mapping(page);
- int ret;

if (mapping) {
unsigned long flags;
+ int ret;

- write_lock_irqsave(&mapping->tree_lock, flags);
+ spin_lock_irqsave(&mapping->tree_lock, flags);
ret = TestClearPageWriteback(page);
if (ret)
radix_tree_tag_clear(&mapping->page_tree,
page_index(page),
PAGECACHE_TAG_WRITEBACK);
- write_unlock_irqrestore(&mapping->tree_lock, flags);
- } else {
- ret = TestClearPageWriteback(page);
+ spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ return ret;
}
- return ret;
+ return TestClearPageWriteback(page);
}

int test_set_page_writeback(struct page *page)
{
struct address_space *mapping = page_mapping(page);
- int ret;

if (mapping) {
unsigned long flags;
+ int ret;

- write_lock_irqsave(&mapping->tree_lock, flags);
+ spin_lock_irqsave(&mapping->tree_lock, flags);
ret = TestSetPageWriteback(page);
if (!ret)
radix_tree_tag_set(&mapping->page_tree,
@@ -798,11 +798,10 @@ int test_set_page_writeback(struct page
radix_tree_tag_clear(&mapping->page_tree,
page_index(page),
PAGECACHE_TAG_DIRTY);
- write_unlock_irqrestore(&mapping->tree_lock, flags);
- } else {
- ret = TestSetPageWriteback(page);
+ spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ return ret;
}
- return ret;
+ return TestSetPageWriteback(page);

}
EXPORT_SYMBOL(test_set_page_writeback);
Index: linux-2.6/include/asm-arm/cacheflush.h
===================================================================
--- linux-2.6.orig/include/asm-arm/cacheflush.h
+++ linux-2.6/include/asm-arm/cacheflush.h
@@ -319,9 +319,9 @@ extern void flush_cache_page(struct vm_a
extern void flush_dcache_page(struct page *);

#define flush_dcache_mmap_lock(mapping) \
- write_lock_irq(&(mapping)->tree_lock)
+ spin_lock_irq(&(mapping)->tree_lock)
#define flush_dcache_mmap_unlock(mapping) \
- write_unlock_irq(&(mapping)->tree_lock)
+ spin_unlock_irq(&(mapping)->tree_lock)

#define flush_icache_user_range(vma,page,addr,len) \
flush_dcache_page(page)
Index: linux-2.6/include/asm-parisc/cacheflush.h
===================================================================
--- linux-2.6.orig/include/asm-parisc/cacheflush.h
+++ linux-2.6/include/asm-parisc/cacheflush.h
@@ -58,9 +58,9 @@ flush_user_icache_range(unsigned long st
extern void flush_dcache_page(struct page *page);

#define flush_dcache_mmap_lock(mapping) \
- write_lock_irq(&(mapping)->tree_lock)
+ spin_lock_irq(&(mapping)->tree_lock)
#define flush_dcache_mmap_unlock(mapping) \
- write_unlock_irq(&(mapping)->tree_lock)
+ spin_unlock_irq(&(mapping)->tree_lock)

#define flush_icache_page(vma,page) do { flush_kernel_dcache_page(page_address(page)); flush_kernel_icache_page(page_address(page)); } while (0)

Index: linux-2.6/drivers/mtd/devices/block2mtd.c
===================================================================
--- linux-2.6.orig/drivers/mtd/devices/block2mtd.c
+++ linux-2.6/drivers/mtd/devices/block2mtd.c
@@ -58,28 +58,27 @@ static void cache_readahead(struct addre

end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);

- read_lock_irq(&mapping->tree_lock);
for (i = 0; i < PAGE_READAHEAD; i++) {
pagei = index + i;
if (pagei > end_index) {
INFO("Overrun end of disk in cache readahead\n");
break;
}
+ /* Don't need mapping->tree_lock - lookup can be racy */
+ rcu_read_lock();
page = radix_tree_lookup(&mapping->page_tree, pagei);
+ rcu_read_unlock();
if (page && (!i))
break;
if (page)
continue;
- read_unlock_irq(&mapping->tree_lock);
page = page_cache_alloc_cold(mapping);
- read_lock_irq(&mapping->tree_lock);
if (!page)
break;
page->index = pagei;
list_add(&page->lru, &page_pool);
ret++;
}
- read_unlock_irq(&mapping->tree_lock);
if (ret)
read_cache_pages(mapping, &page_pool, filler, NULL);
}

2006-03-11 08:22:46

by Balbir Singh

[permalink] [raw]
Subject: Re: [patch 1/3] radix tree: RCU lockless read-side

<snip>

> if (slot->slots[i]) {
> - results[nr_found++] = slot->slots[i];
> + results[nr_found++] = &slot->slots[i];
> if (nr_found == max_items)
> goto out;
> }

A quick clarification - Shouldn't accesses to slot->slots[i] above be
protected using rcu_derefence()?

Warm Regards,
Balbir

2006-03-11 08:48:31

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/3] radix tree: RCU lockless read-side

Balbir Singh wrote:
> <snip>
>
>> if (slot->slots[i]) {
>>- results[nr_found++] = slot->slots[i];
>>+ results[nr_found++] = &slot->slots[i];
>> if (nr_found == max_items)
>> goto out;
>> }
>
>
> A quick clarification - Shouldn't accesses to slot->slots[i] above be
> protected using rcu_derefence()?
>

I think we're safe here -- this is the _address_ of the pointer.
However, when dereferencing this address in _gang_lookup,
I think we do need rcu_dereference indeed.

Note that _gang_lookup_slot doesn't do this for us, however --
the caller must do that when dereferencing the pointer to the
item (eg. see page_cache_get_speculative in 2/3).

That said, I'm not 100% sure I have the rcu memory barriers in
the right places (well I'm sure I don't, given the _gang_lookup
bug you exposed!).

Thanks,
Nick

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-13 03:04:55

by Balbir Singh

[permalink] [raw]
Subject: Re: [patch 1/3] radix tree: RCU lockless read-side

On 3/11/06, Nick Piggin <[email protected]> wrote:
> Balbir Singh wrote:
> > <snip>
> >
> >> if (slot->slots[i]) {
> >>- results[nr_found++] = slot->slots[i];
> >>+ results[nr_found++] = &slot->slots[i];
> >> if (nr_found == max_items)
> >> goto out;
> >> }
> >
> >
> > A quick clarification - Shouldn't accesses to slot->slots[i] above be
> > protected using rcu_derefence()?
> >
>
> I think we're safe here -- this is the _address_ of the pointer.
> However, when dereferencing this address in _gang_lookup,
> I think we do need rcu_dereference indeed.
>

Yes, I saw the address operator, but we still derefence "slots" to get
the address.

> Note that _gang_lookup_slot doesn't do this for us, however --
> the caller must do that when dereferencing the pointer to the
> item (eg. see page_cache_get_speculative in 2/3).

Oh! I did not get that far. Will look at the rest of the series

>
> That said, I'm not 100% sure I have the rcu memory barriers in
> the right places (well I'm sure I don't, given the _gang_lookup
> bug you exposed!).

Hmm... Let me look at rcu_torture module and see if I can figure it
out or read the documentation again.

>
> Thanks,
> Nick
>

Warm Regards,
Balbir

2006-03-13 03:11:12

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/3] radix tree: RCU lockless read-side

Balbir Singh wrote:
> On 3/11/06, Nick Piggin <[email protected]> wrote:
>
>>Balbir Singh wrote:
>>
>>><snip>
>>>
>>>> if (slot->slots[i]) {
>>>>- results[nr_found++] = slot->slots[i];
>>>>+ results[nr_found++] = &slot->slots[i];
>>>> if (nr_found == max_items)
>>>> goto out;
>>>> }
>>>
>>>
>>>A quick clarification - Shouldn't accesses to slot->slots[i] above be
>>>protected using rcu_derefence()?
>>>
>>
>>I think we're safe here -- this is the _address_ of the pointer.
>>However, when dereferencing this address in _gang_lookup,
>>I think we do need rcu_dereference indeed.
>>
>
>
> Yes, I saw the address operator, but we still derefence "slots" to get
> the address.
>

But we should have already rcu_dereference()ed "slot", right
(in the loop above this one)? That means we are now able to
dereference it, and the data at the other end will be valid.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-13 06:40:52

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/3] radix tree: RCU lockless read-side

On Mon, Mar 13, 2006 at 08:34:53AM +0530, Balbir Singh wrote:
> On 3/11/06, Nick Piggin <[email protected]> wrote:
> > Balbir Singh wrote:
> > > <snip>
> > >
> > >> if (slot->slots[i]) {
> > >>- results[nr_found++] = slot->slots[i];
> > >>+ results[nr_found++] = &slot->slots[i];
> > >> if (nr_found == max_items)
> > >> goto out;
> > >> }
> > >
> > >
> > > A quick clarification - Shouldn't accesses to slot->slots[i] above be
> > > protected using rcu_derefence()?
> > >
> >
> > I think we're safe here -- this is the _address_ of the pointer.
> > However, when dereferencing this address in _gang_lookup,
> > I think we do need rcu_dereference indeed.
> >
>
> Yes, I saw the address operator, but we still derefence "slots" to get
> the address.
>

OK, I reread what you wrote and I misunderstood you earlier I guess.
slot->slots[i] does dereference the pointer at the ith entry of slots,
but &slot->slots[i] does not, it will return the same thing as
slot->slots+i, which only dereferences 'slot' (which we've established
to be safe).

Even if &slot->slots[i] did, for some silly compiler, dereference the
pointer, we never actually see it or use it so it should be harmless.

Nick

2006-03-13 15:24:49

by Balbir Singh

[permalink] [raw]
Subject: Re: [patch 1/3] radix tree: RCU lockless read-side

<snip>

>
> But we should have already rcu_dereference()ed "slot", right
> (in the loop above this one)? That means we are now able to
> dereference it, and the data at the other end will be valid.
>

Yes, but my confusion is about the following piece of code

<begin code>

for ( ; height > 1; height--) {

for (i = (index >> shift) & RADIX_TREE_MAP_MASK ;
i < RADIX_TREE_MAP_SIZE; i++) {
- if (slot->slots[i] != NULL)
+ __s = rcu_dereference(slot->slots[i]);
+ if (__s != NULL)
break;
index &= ~((1UL << shift) - 1);
index += 1UL << shift;
@@ -531,14 +550,14 @@ __lookup(struct radix_tree_root *root, v
goto out;

shift -= RADIX_TREE_MAP_SHIFT;
- slot = slot->slots[i];
+ slot = __s;
}

/* Bottom level: grab some items */
for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE; i++) {
index++;
if (slot->slots[i]) {
- results[nr_found++] = slot->slots[i];
+ results[nr_found++] = &slot->slots[i];
if (nr_found == max_items)
goto out;
}
<end code>

In the for loop, lets say __s is *not* NULL, we break from the loop.
In the loop below
slot->slots[i] is derefenced without rcu, __s is not used. Is that not
inconsistent?

2006-03-13 22:37:16

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/3] radix tree: RCU lockless read-side

Balbir Singh wrote:

><snip>
>
>>But we should have already rcu_dereference()ed "slot", right
>>(in the loop above this one)? That means we are now able to
>>dereference it, and the data at the other end will be valid.
>>
>>
>
>Yes, but my confusion is about the following piece of code
>
><begin code>
>
> for ( ; height > 1; height--) {
>
> for (i = (index >> shift) & RADIX_TREE_MAP_MASK ;
> i < RADIX_TREE_MAP_SIZE; i++) {
>- if (slot->slots[i] != NULL)
>+ __s = rcu_dereference(slot->slots[i]);
>+ if (__s != NULL)
> break;
> index &= ~((1UL << shift) - 1);
> index += 1UL << shift;
>@@ -531,14 +550,14 @@ __lookup(struct radix_tree_root *root, v
> goto out;
>
> shift -= RADIX_TREE_MAP_SHIFT;
>- slot = slot->slots[i];
>+ slot = __s;
> }
>
> /* Bottom level: grab some items */
> for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE; i++) {
> index++;
> if (slot->slots[i]) {
>- results[nr_found++] = slot->slots[i];
>+ results[nr_found++] = &slot->slots[i];
> if (nr_found == max_items)
> goto out;
> }
><end code>
>
>In the for loop, lets say __s is *not* NULL, we break from the loop.
>In the loop below
>slot->slots[i] is derefenced without rcu, __s is not used. Is that not
>inconsistent?
>
>

The "slots" member is an array, not an RCU assigned pointer. As such, after
doing rcu_dereference(slot), you can access slot->slots[i] without further
memory barriers I think?

But I agree that code now is a bit inconsistent. I've cleaned things up a
bit in my tree now... but perhaps it is easier if you send a patch to show
what you mean (because sometimes I'm a bit dense, I'm afraid).

Thanks,
Nick

--

Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-13 23:35:58

by Christoph Lameter

[permalink] [raw]
Subject: Re: A lockless pagecache for Linux

On Fri, 10 Mar 2006, Nick Piggin wrote:

> I'm writing some stuff about these patches, and I've uploaded a
> **draft** chapter on the RCU radix-tree, 'radix-intro.pdf' in above
> directory (note the bibliography didn't make it -- but thanks Paul
> McKenney!)

Ah thanks. I had a look at it. Note that the problem with the radix tree
tags is that these are inherited from the lower layer. How is the
consistency of these guaranteed? Also you may want to add a more elaborate
intro and conclusion. Typically these summarize other sections of the
paper.

What you are proposing is to allow lockless read operations right? No
lockless write? The concurrency issue that we currently have is multiple
processes faulting in pages in different ranges from the same file. I
think this is a rather typical usage scenario. Faulting in a page from a
file for reading requires a write operation on the radix tree. The
approach with a lockless read path does not help us. This proposed scheme
would only help if pages are already faulted in and another process starts
using the same pages as an earlier process.

Would it not be better to handle the radix tree in the same way as a page
table? Have a lock at the lowest layer so that different sections of the
radix tree can be locked by different processes? That would enable
concurrent writes.

2006-03-14 03:32:27

by Balbir Singh

[permalink] [raw]
Subject: Re: [patch 1/3] radix tree: RCU lockless read-side

> The "slots" member is an array, not an RCU assigned pointer. As such, after
> doing rcu_dereference(slot), you can access slot->slots[i] without further
> memory barriers I think?
>
> But I agree that code now is a bit inconsistent. I've cleaned things up a
> bit in my tree now... but perhaps it is easier if you send a patch to show
> what you mean (because sometimes I'm a bit dense, I'm afraid).
>

Fro starters, I do not think your dense at all.

Hmm... slot/slots is quite confusing name. I was referring to slot and
ended up calling it slots. The point I am contending is that
rcu_derefence(slot->slots[i]) should happen.

<snippet>
+ __s = rcu_dereference(slot->slots[i]);
+ if (__s != NULL)
break;
</snippet>

If we break from the loop because __s != NULL. Then in the snippet below

<snippet>
/* Bottom level: grab some items */
for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE; i++) {
index++;
if (slot->slots[i]) {
- results[nr_found++] = slot->slots[i];
+ results[nr_found++] = &slot->slots[i];
if (nr_found == max_items)
goto out;
}
</snippet>

We do not use __s above. "slot->slots[i]" is not rcu_derefenced() in
this case because we broke out of the loop above with __s being not
NULL. Another issue is - is it good enough to rcu_derefence() slot
once? Shouldn't all uses of *slot->* be rcu derefenced?

<suggestion (WARNING: patch has spaces and its not compiled)>
/* Bottom level: grab some items */
for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE; i++) {
index++;
- if (slot->slots[i]) {
- results[nr_found++] = slot->slots[i];
+ __s = rcu_dereference(slot->slots[i]);
+ if (__s) {
+ /* This is tricky, cannot take the address of __s
or rcu_derefence() */
+ results[nr_found++] = &slot->slots[i];
if (nr_found == max_items)
goto out;
}
</suggestion>

I hope I am making sense.

Warm Regards,
Balbir

2006-03-14 04:14:49

by Nick Piggin

[permalink] [raw]
Subject: Re: A lockless pagecache for Linux

Christoph Lameter wrote:
> On Fri, 10 Mar 2006, Nick Piggin wrote:
>
>
>>I'm writing some stuff about these patches, and I've uploaded a
>>**draft** chapter on the RCU radix-tree, 'radix-intro.pdf' in above
>>directory (note the bibliography didn't make it -- but thanks Paul
>>McKenney!)
>
>
> Ah thanks. I had a look at it. Note that the problem with the radix tree
> tags is that these are inherited from the lower layer. How is the
> consistency of these guaranteed? Also you may want to add a more elaborate
> intro and conclusion. Typically these summarize other sections of the
> paper.
>

Thanks for looking at it. Yeah in the intro I say that I'm considering
a simplified radix-tree (without tags or gang lookups) to start with.
At the end I say how tags are handled... it isn't quite clear enough
for my liking yet though.

Intro and conclusion - yes they should be better. It _is_ a chapter from
a larger document, however I want it to still stand alone as a good
document.

What happens is: read-side tag operations (ie. tag lookups etc) are done
under lock. Ie. they are not made lockless.

> What you are proposing is to allow lockless read operations right? No
> lockless write? The concurrency issue that we currently have is multiple
> processes faulting in pages in different ranges from the same file. I
> think this is a rather typical usage scenario. Faulting in a page from a
> file for reading requires a write operation on the radix tree. The
> approach with a lockless read path does not help us. This proposed scheme
> would only help if pages are already faulted in and another process starts
> using the same pages as an earlier process.
>

Yep, lockless reads only to start with. I think you'll see some benefit
because the read(2) and ->nopage paths also take read-side locks, so your
write side will no longer have to contend with them.

It won't be a huge improvement in scalability though, maybe just a constant
factor.

> Would it not be better to handle the radix tree in the same way as a page
> table? Have a lock at the lowest layer so that different sections of the
> radix tree can be locked by different processes? That would enable
> concurrent writes.
>

Yeah this is the next step. Note that it is not the first step because I
actually want to _speed up_ single threaded lookup paths, rather than
slowing them down, otherwise it will never get accepted.

It also might add quite a large amount of complexity to the radix tree, so
it may no longer be suitable for a generic data structure anymore (depends
how it is implemented). But the write side should be easier than the
read-side so I don't think there is too much to worry about. I already have
some bits and pieces to make it fine-grained.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-14 05:16:10

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/3] radix tree: RCU lockless read-side

Balbir Singh wrote:
>>The "slots" member is an array, not an RCU assigned pointer. As such, after
>>doing rcu_dereference(slot), you can access slot->slots[i] without further
>>memory barriers I think?
>>
>>But I agree that code now is a bit inconsistent. I've cleaned things up a
>>bit in my tree now... but perhaps it is easier if you send a patch to show
>>what you mean (because sometimes I'm a bit dense, I'm afraid).
>>
>
>
> Fro starters, I do not think your dense at all.
>

I'm glad you think so ;)

> Hmm... slot/slots is quite confusing name. I was referring to slot and
> ended up calling it slots. The point I am contending is that
> rcu_derefence(slot->slots[i]) should happen.
>
> <snippet>
> + __s = rcu_dereference(slot->slots[i]);
> + if (__s != NULL)
> break;
> </snippet>
>
> If we break from the loop because __s != NULL. Then in the snippet below
>

This is a little confusing, because technically I don't think it needs to
rcu_dereference here, either, only when we actually want to dereference
__s and read the data the other end.

rcu_dereference is perhaps an awkward interface for optimising this case.
#define rcu_make_pointer_safe_to_follow(ptr) smp_read_barrier_depends()
or
#define rcu_follow_pointer(ptr) ({ smp_read_barrier_depends(); *ptr; })
might be better

__s = slot->slots[i];
if (__s != NULL) {
rcu_make_pointer_safe_to_follow(__s);
...

Would allow the barrier to be skipped if we're not going to follow the
pointer.

> <snippet>
> /* Bottom level: grab some items */
> for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE; i++) {
> index++;
> if (slot->slots[i]) {
> - results[nr_found++] = slot->slots[i];
> + results[nr_found++] = &slot->slots[i];
> if (nr_found == max_items)
> goto out;
> }
> </snippet>
>
> We do not use __s above. "slot->slots[i]" is not rcu_derefenced() in
> this case because we broke out of the loop above with __s being not
> NULL. Another issue is - is it good enough to rcu_derefence() slot
> once? Shouldn't all uses of *slot->* be rcu derefenced?
>

Yes, slot is a local pointer, the address it points to will not change,
so once we have loaded it and issued the read barrier, we can follow it
from then on and reads will not find stale values.

> <suggestion (WARNING: patch has spaces and its not compiled)>
> /* Bottom level: grab some items */
> for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE; i++) {
> index++;
> - if (slot->slots[i]) {
> - results[nr_found++] = slot->slots[i];
> + __s = rcu_dereference(slot->slots[i]);
> + if (__s) {
> + /* This is tricky, cannot take the address of __s
> or rcu_derefence() */
> + results[nr_found++] = &slot->slots[i];
> if (nr_found == max_items)
> goto out;
> }
> </suggestion>
>
> I hope I am making sense.
>

In this case I think we do not need a barrier because we are only looking
at the pointer (whether it is NULL or not), rather than following it down.
Paul may be able to jump in at this point.

I'll release a new patchset in the next couple of days and try to be more
consisten with the barriers which will hopefully make things less confusing.

Thanks,
Nick

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-14 12:59:37

by Wu Fengguang

[permalink] [raw]
Subject: Re: A lockless pagecache for Linux

On Tue, Mar 14, 2006 at 03:14:38PM +1100, Nick Piggin wrote:
> Christoph Lameter wrote:
> >What you are proposing is to allow lockless read operations right? No
> >lockless write? The concurrency issue that we currently have is multiple
> >processes faulting in pages in different ranges from the same file. I
> >think this is a rather typical usage scenario. Faulting in a page from a
> >file for reading requires a write operation on the radix tree. The
> >approach with a lockless read path does not help us. This proposed scheme
> >would only help if pages are already faulted in and another process starts
> >using the same pages as an earlier process.
> >
>
> Yep, lockless reads only to start with. I think you'll see some benefit
> because the read(2) and ->nopage paths also take read-side locks, so your
> write side will no longer have to contend with them.
>
> It won't be a huge improvement in scalability though, maybe just a constant
> factor.
>
> >Would it not be better to handle the radix tree in the same way as a page
> >table? Have a lock at the lowest layer so that different sections of the
> >radix tree can be locked by different processes? That would enable
> >concurrent writes.
> >
>
> Yeah this is the next step. Note that it is not the first step because I
> actually want to _speed up_ single threaded lookup paths, rather than
> slowing them down, otherwise it will never get accepted.
>
> It also might add quite a large amount of complexity to the radix tree, so
> it may no longer be suitable for a generic data structure anymore (depends
> how it is implemented). But the write side should be easier than the
> read-side so I don't think there is too much to worry about. I already have
> some bits and pieces to make it fine-grained.

Maybe we can try another way to reduce the concurrent radix tree
writers problem: coordinating and serializing writers at high level.

Since kswapd is the major radix tree deleter, and readahead is the
major radix tree inserter, putting parts of them together in a loop
might reduce the contention noticeably.

The following pseudo-code shows the basic idea:
(Integrating kprefetchd is also possible. Just for simplicity...:)

PER_NODE(ra_queue);

kswapd()
{
loop {
loop {
free enough pages for top(ra_queue)
}
submit(pop(ra_queue));
wait();
}
}

readahead()
{
assemble one ra_req

if (ra_req immediately needed)
submit(ra_req);
else {
push(ra_queue, ra_req);
wakeup_kswapd();
}
}

This scheme might reduce
- direct page reclaim pressure
- radix tree write lock contention
- lru lock contention

Regards,
Wu