2008-10-26 12:47:10

by Jörn Engel

[permalink] [raw]
Subject: [RFC] B+Tree library

The idea of using btrees in the kernel is not exactly new. They have a
number of advantages over rbtrees and hashtables, but also a number of
drawbacks. More importantly, logfs already makes use of them and -
since I don't want any incompatible code to get merged and suffer the
trouble it creates - I would like to discuss my implementation and where
it makes sense and where it doesn't.

General advantages of btrees are memory density and efficient use of
cachelines. Hashtables are either too small and degrade into linked
list performance, or they are too large and waste memory. With changing
workloads, both may be true on the same system. Rbtrees have a bad
fanout of less than 2 (they are not actually balanced binary trees),
hence reading a fairly large number of cachelines to each lookup.

Main disadvantage of btrees is that they are complicated, come in a
gazillion subtly different variant that differ mainly in the balance
between read efficiency and write efficiency. Comparing btrees against
anything is a bit like comparing apples and random fruits.

This implementation is extremely simple. It splits nodes when they
overflow. It does not move elements to neighboring nodes. It does not
try fancy 2:3 splits. It does not even merge nodes when they shrink,
making degenerate cases possible. And it requires callers to do
tree-global locking. In effect, it will be hard to find anything less
sophisticated.

The one aspect where my implementation is actually nice is in allowing
variable key length. Btree keys are interpreted as an array of unsigned
long. So by passing the correct geometry to the core functions, it is
possible to handle 32bit, 64bit or 128bit btrees, which logfs uses. If
so desired, any other weird data format can be used as well (Zach, are
you reading this?).

So would something like this be merged once some users are identified?
Would it be useful for anything but logfs? Or am I plain nuts?

include/linux/btree.h | 272 ++++++++++++++++++++++
lib/Kconfig | 3
lib/Makefile | 1
lib/btree.c | 603 ++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 879 insertions(+)

Jörn

--
It is the mark of an educated mind to be able to entertain a thought
without accepting it.
-- Aristotle

diff --git a/include/linux/btree.h b/include/linux/btree.h
new file mode 100644
index 0000000..d56a691
--- /dev/null
+++ b/include/linux/btree.h
@@ -0,0 +1,272 @@
+#ifndef BTREE_H
+#define BTREE_H
+
+#include <linux/kernel.h>
+#include <linux/mempool.h>
+
+/*
+ * B+Tree node format:
+ * [key0, key1, ..., keyN] [val0, val1, ..., valN]
+ * Each key is an array of unsigned longs, head->no_longs in total.
+ * Total number of keys and vals (N) is head->no_pairs.
+ */
+
+struct btree_head {
+ unsigned long *node;
+ mempool_t *mempool;
+ int height;
+ gfp_t gfp_mask;
+};
+
+struct btree_geo {
+ int keylen;
+ int no_pairs;
+};
+extern struct btree_geo btree_geo32;
+extern struct btree_geo btree_geo64;
+extern struct btree_geo btree_geo128;
+
+struct btree_headl { struct btree_head h; };
+struct btree_head32 { struct btree_head h; };
+struct btree_head64 { struct btree_head h; };
+struct btree_head128 { struct btree_head h; };
+
+/*
+ * These couple of functions are all there is to it. The rest of this header
+ * consists only of wrappers that try to add some typesafety, make the code
+ * a little self-documenting and generally be nice to people.
+ */
+void *btree_alloc(gfp_t gfp_mask, void *pool_data);
+void btree_free(void *element, void *pool_data);
+void btree_init(struct btree_head *head, mempool_t *mempool, gfp_t gfp);
+void *btree_lookup(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key);
+int btree_insert(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key, void *val);
+void *btree_remove(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key);
+int btree_merge(struct btree_head *target, struct btree_head *victim,
+ struct btree_geo *geo, unsigned long *duplicate);
+unsigned long *btree_last(struct btree_head *head, struct btree_geo *geo);
+size_t btree_visitor(struct btree_head *head, struct btree_geo *geo,
+ long opaque,
+ void (*func)(void *elem, long opaque, unsigned long *key,
+ size_t index, void *func2), void *func2);
+size_t btree_grim_visitor(struct btree_head *head, struct btree_geo *geo,
+ long opaque,
+ void (*func)(void *elem, long opaque, unsigned long *key,
+ size_t index, void *func2), void *func2);
+
+/* key is unsigned long */
+static inline void btree_initl(struct btree_headl *head, mempool_t *mempool,
+ gfp_t gfp)
+{
+ btree_init(&head->h, mempool, gfp);
+}
+
+static inline void *btree_lookupl(struct btree_headl *head, unsigned long key)
+{
+ return btree_lookup(&head->h, &btree_geo32, &key);
+}
+
+static inline int btree_insertl(struct btree_headl *head, unsigned long key,
+ void *val)
+{
+ return btree_insert(&head->h, &btree_geo32, &key, val);
+}
+
+static inline void *btree_removel(struct btree_headl *head, unsigned long key)
+{
+ return btree_remove(&head->h, &btree_geo32, &key);
+}
+
+static inline int btree_mergel(struct btree_headl *target,
+ struct btree_headl *victim)
+{
+ unsigned long scratch;
+
+ return btree_merge(&target->h, &victim->h, &btree_geo32, &scratch);
+}
+
+void visitorl(void *elem, long opaque, unsigned long *key, size_t index,
+ void *__func);
+
+typedef void (*visitorl_t)(void *elem, long opaque, unsigned long key,
+ size_t index);
+
+static inline size_t btree_visitorl(struct btree_headl *head, long opaque,
+ visitorl_t func2)
+{
+ return btree_visitor(&head->h, &btree_geo32, opaque, visitorl, func2);
+}
+
+static inline size_t btree_grim_visitorl(struct btree_headl *head, long opaque,
+ visitorl_t func2)
+{
+ return btree_grim_visitor(&head->h, &btree_geo32, opaque, visitorl, func2);
+}
+
+/* key is u32 */
+static inline void btree_init32(struct btree_head32 *head, mempool_t *mempool,
+ gfp_t gfp)
+{
+ btree_init(&head->h, mempool, gfp);
+}
+
+static inline void *btree_lookup32(struct btree_head32 *head, u32 key)
+{
+ return btree_lookup(&head->h, &btree_geo32, (unsigned long *)&key);
+}
+
+static inline int btree_insert32(struct btree_head32 *head, u32 key,
+ void *val)
+{
+ return btree_insert(&head->h, &btree_geo32, (unsigned long *)&key, val);
+}
+
+static inline void *btree_remove32(struct btree_head32 *head, u32 key)
+{
+ return btree_remove(&head->h, &btree_geo32, (unsigned long *)&key);
+}
+
+static inline int btree_merge32(struct btree_head32 *target,
+ struct btree_head32 *victim)
+{
+ unsigned long scratch;
+
+ return btree_merge(&target->h, &victim->h, &btree_geo32, &scratch);
+}
+
+void visitor32(void *elem, long opaque, unsigned long *__key, size_t index,
+ void *__func);
+
+typedef void (*visitor32_t)(void *elem, long opaque, u32 key, size_t index);
+
+static inline size_t btree_visitor32(struct btree_head32 *head, long opaque,
+ visitor32_t func2)
+{
+ return btree_visitor(&head->h, &btree_geo32, opaque, visitor32, func2);
+}
+
+static inline size_t btree_grim_visitor32(struct btree_head32 *head, long opaque,
+ visitor32_t func2)
+{
+ return btree_grim_visitor(&head->h, &btree_geo32, opaque, visitor32, func2);
+}
+
+/* key is u64 */
+static inline void btree_init64(struct btree_head64 *head, mempool_t *mempool,
+ gfp_t gfp)
+{
+ btree_init(&head->h, mempool, gfp);
+}
+
+static inline void *btree_lookup64(struct btree_head64 *head, u64 key)
+{
+ return btree_lookup(&head->h, &btree_geo64, (unsigned long *)&key);
+}
+
+static inline int btree_insert64(struct btree_head64 *head, u64 key,
+ void *val)
+{
+ return btree_insert(&head->h, &btree_geo64, (unsigned long *)&key, val);
+}
+
+static inline void *btree_remove64(struct btree_head64 *head, u64 key)
+{
+ return btree_remove(&head->h, &btree_geo64, (unsigned long *)&key);
+}
+
+static inline int btree_merge64(struct btree_head64 *target,
+ struct btree_head64 *victim)
+{
+ u64 scratch;
+
+ return btree_merge(&target->h, &victim->h, &btree_geo64,
+ (unsigned long *)&scratch);
+}
+
+void visitor64(void *elem, long opaque, unsigned long *__key, size_t index,
+ void *__func);
+
+typedef void (*visitor64_t)(void *elem, long opaque, u64 key, size_t index);
+
+static inline size_t btree_visitor64(struct btree_head64 *head, long opaque,
+ visitor64_t func2)
+{
+ return btree_visitor(&head->h, &btree_geo64, opaque, visitor64, func2);
+}
+
+static inline size_t btree_grim_visitor64(struct btree_head64 *head, long opaque,
+ visitor64_t func2)
+{
+ return btree_grim_visitor(&head->h, &btree_geo64, opaque, visitor64, func2);
+}
+
+/* key is 128bit (two u64) */
+static inline void btree_init128(struct btree_head128 *head, mempool_t *mempool,
+ gfp_t gfp)
+{
+ btree_init(&head->h, mempool, gfp);
+}
+
+static inline void *btree_lookup128(struct btree_head128 *head, u64 k1, u64 k2)
+{
+ u64 key[2] = {k1, k2};
+ return btree_lookup(&head->h, &btree_geo128, (unsigned long *)&key);
+}
+
+static inline int btree_insert128(struct btree_head128 *head, u64 k1, u64 k2,
+ void *val)
+{
+ u64 key[2] = {k1, k2};
+ return btree_insert(&head->h, &btree_geo128, (unsigned long *)&key, val);
+}
+
+static inline void *btree_remove128(struct btree_head128 *head, u64 k1, u64 k2)
+{
+ u64 key[2] = {k1, k2};
+ return btree_remove(&head->h, &btree_geo128, (unsigned long *)&key);
+}
+
+static inline void btree_last128(struct btree_head128 *head, u64 *k1, u64 *k2)
+{
+ u64 *key = (u64 *)btree_last(&head->h, &btree_geo128);
+
+ if (key) {
+ *k1 = key[0];
+ *k2 = key[1];
+ } else {
+ *k1 = 0;
+ *k2 = 0;
+ }
+}
+
+static inline int btree_merge128(struct btree_head128 *target,
+ struct btree_head128 *victim)
+{
+ u64 scratch[2];
+
+ return btree_merge(&target->h, &victim->h, &btree_geo128,
+ (unsigned long *)scratch);
+}
+
+void visitor128(void *elem, long opaque, unsigned long *__key, size_t index,
+ void *__func);
+
+typedef void (*visitor128_t)(void *elem, long opaque, u64 key1, u64 key2,
+ size_t index);
+
+static inline size_t btree_visitor128(struct btree_head128 *head, long opaque,
+ visitor128_t func2)
+{
+ return btree_visitor(&head->h, &btree_geo128, opaque, visitor128, func2);
+}
+
+static inline size_t btree_grim_visitor128(struct btree_head128 *head, long opaque,
+ visitor128_t func2)
+{
+ return btree_grim_visitor(&head->h, &btree_geo128, opaque, visitor128, func2);
+}
+
+#endif
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index cd6d02c..f07a8bd 100644
diff --git a/lib/Kconfig b/lib/Kconfig
index 8cc8e87..7614216 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -129,6 +129,9 @@ config TEXTSEARCH_FSM
config PLIST
boolean

+config BTREE
+ boolean
+
config HAS_IOMEM
boolean
depends on !NO_IOMEM
diff --git a/lib/Makefile b/lib/Makefile
index 2d7001b..b1eed60 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -34,6 +34,7 @@ lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o
obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o
obj-$(CONFIG_PLIST) += plist.o
+obj-$(CONFIG_BTREE) += btree.o
obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
obj-$(CONFIG_DEBUG_LIST) += list_debug.o

diff --git a/lib/btree.c b/lib/btree.c
new file mode 100644
index 0000000..e0770da
--- /dev/null
+++ b/lib/btree.c
@@ -0,0 +1,603 @@
+/*
+ * lib/btree.c - Simple In-memory B+Tree
+ *
+ * As should be obvious for Linux kernel code, license is GPLv2
+ *
+ * Copyright (c) 2007-2008 Joern Engel <[email protected]>
+ * Bits and pieces stolen from Peter Zijlstra's code, which is
+ * Copyright 2007, Red Hat Inc. Peter Zijlstra <[email protected]>
+ * GPLv2
+ *
+ * see http://programming.kicks-ass.net/kernel-patches/vma_lookup/btree.patch
+ *
+ * A relatively simple B+Tree implementation. I have written it as a learning
+ * excercise to understand how B+Trees work. Turned out to be useful as well.
+ *
+ * B+Trees can be used similar to Linux radix trees (which don't have anything
+ * in common with textbook radix trees, beware). Prerequisite for them working
+ * well is that access to a random tree node is much faster than a large number
+ * of operations within each node.
+ *
+ * Disks have fulfilled the prerequite for a long time. More recently DRAM
+ * has gained similar properties, as memory access times, when measured in cpu
+ * cycles, have increased. Cacheline sizes have increased as well, which also
+ * helps B+Trees.
+ *
+ * Compared to radix trees, B+Trees are more efficient when dealing with a
+ * sparsely populated address space. Between 25% and 50% of the memory is
+ * occupied with valid pointers. When densely populated, radix trees contain
+ * ~98% pointers - hard to beat. Very sparse radix trees contain only ~2%
+ * pointers.
+ *
+ * This particular implementation stores pointers identified by a long value.
+ * Storing NULL pointers is illegal, lookup will return NULL when no entry
+ * was found.
+ *
+ * Two tricks were used that are not commonly found in textbooks. First, the
+ * lowest values are to the right, not to the left. All used slots within a
+ * node are on the left, all unused slots contain NUL values. Most operations
+ * simply loop once over all slots and terminate on the first NUL.
+ *
+ * Second trick is to special-case the key "0" or NUL. As seen above, this
+ * value indicates an unused slot, so such a value should not be stored in the
+ * tree itself. Instead it is stored in the null_ptr field in the btree_head.
+ */
+
+#include <linux/btree.h>
+#include <linux/cache.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+
+#define MAX(a, b) ((a) > (b) ? (a) : (b))
+#define NODESIZE MAX(L1_CACHE_BYTES, 128)
+
+struct btree_geo btree_geo32 = {
+ .keylen = 1,
+ .no_pairs = NODESIZE / sizeof(long) / 2,
+};
+
+#define LONG_PER_U64 (64 / BITS_PER_LONG)
+struct btree_geo btree_geo64 = {
+ .keylen = LONG_PER_U64,
+ .no_pairs = NODESIZE / sizeof(long) / (1 + LONG_PER_U64),
+};
+
+struct btree_geo btree_geo128 = {
+ .keylen = 2 * LONG_PER_U64,
+ .no_pairs = NODESIZE / sizeof(long) / (1 + 2 * LONG_PER_U64),
+};
+
+static struct kmem_cache *btree_cachep;
+
+void *btree_alloc(gfp_t gfp_mask, void *pool_data)
+{
+ return kmem_cache_alloc(btree_cachep, gfp_mask);
+}
+
+void btree_free(void *element, void *pool_data)
+{
+ kmem_cache_free(btree_cachep, element);
+}
+
+static unsigned long *btree_node_alloc(struct btree_head *head)
+{
+ unsigned long *node;
+
+ node = mempool_alloc(head->mempool, head->gfp_mask);
+ memset(node, 0, NODESIZE);
+ return node;
+}
+
+static int longcmp(const unsigned long *l1, const unsigned long *l2, size_t n)
+{
+ size_t i;
+
+ for (i = 0; i < n; i++) {
+ if (l1[i] < l2[i])
+ return -1;
+ if (l1[i] > l2[i])
+ return 1;
+ }
+ return 0;
+}
+
+static unsigned long *longcpy(unsigned long *dest, const unsigned long *src,
+ size_t n)
+{
+ size_t i;
+
+ for (i = 0; i < n; i++)
+ dest[i] = src[i];
+ return dest;
+}
+
+static unsigned long *longset(unsigned long *s, unsigned long c, size_t n)
+{
+ size_t i;
+
+ for (i = 0; i < n; i++)
+ s[i] = c;
+ return s;
+}
+
+/*
+ * B+Tree node format:
+ * [key0, key1, ..., keyN] [val0, val1, ..., valN]
+ * Each key is an array of unsigned longs, head->keylen in total.
+ * Total number of keys and vals (N) is head->no_pairs.
+ */
+
+static unsigned long *bkey(struct btree_geo *geo, unsigned long *node, int n)
+{
+ return &node[n * geo->keylen];
+}
+
+static unsigned long bval(struct btree_geo *geo, unsigned long *node, int n)
+{
+ return node[geo->no_pairs * geo->keylen + n];
+}
+
+static void setkey(struct btree_geo *geo, unsigned long *node,
+ unsigned long *key, int n)
+{
+ longcpy(bkey(geo, node, n), key, geo->keylen);
+}
+
+static void setval(struct btree_geo *geo, unsigned long *node,
+ unsigned long val, int n)
+{
+ node[geo->no_pairs * geo->keylen + n] = val;
+}
+
+static void clearpair(struct btree_geo *geo, unsigned long *node, int n)
+{
+ longset(bkey(geo, node, n), 0, geo->keylen);
+ node[geo->no_pairs * geo->keylen + n] = 0;
+}
+
+#if 0
+static void dumpkey(struct btree_geo *geo, unsigned long *key)
+{
+ int k;
+
+ printk("(%lx", key[0]);
+ for (k = 1; k < geo->keylen; k++)
+ printk(",%lx", key[k]);
+ printk(")");
+}
+
+static void dumpnode(struct btree_geo *geo, unsigned long *node)
+{
+ int i;
+ unsigned long *key;
+
+ printk("%p: ", node);
+ for (i = 0; i < geo->no_pairs; i++) {
+ key = bkey(geo, node, i);
+ dumpkey(geo, key);
+ printk(" %lx ", bval(geo, node, i));
+ }
+ printk("\n");
+}
+
+static void __dumptree(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *node, int height)
+{
+ int i;
+ unsigned long *child;
+
+ if (!height)
+ return;
+
+ printk("%2x ", height);
+ dumpnode(geo, node);
+ for (i = 0; i < geo->no_pairs; i++) {
+ child = (void *)bval(geo, node, i);
+ if (!child)
+ return;
+ __dumptree(head, geo, child, height - 1);
+ }
+}
+
+static void dumptree(struct btree_head *head, struct btree_geo *geo)
+{
+ __dumptree(head, geo, head->node, head->height);
+}
+#endif
+
+static inline void __btree_init(struct btree_head *head)
+{
+ head->node = NULL;
+ head->height = 0;
+}
+
+void btree_init(struct btree_head *head, mempool_t *mempool, gfp_t gfp)
+{
+ __btree_init(head);
+ head->mempool = mempool;
+ head->gfp_mask = gfp;
+}
+
+unsigned long *btree_last(struct btree_head *head, struct btree_geo *geo)
+{
+ int height = head->height;
+ unsigned long *node = head->node;
+
+ if (height == 0)
+ return NULL;
+
+ for ( ; height > 1; height--)
+ node = (unsigned long *)bval(geo, node, 0);
+
+ return bkey(geo, node, 0);
+}
+
+static int keycmp(struct btree_geo *geo, unsigned long *node, int pos,
+ unsigned long *key)
+{
+ return longcmp(bkey(geo, node, pos), key, geo->keylen);
+}
+
+void *btree_lookup(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key)
+{
+ int i, height = head->height;
+ unsigned long *node = head->node;
+
+ if (height == 0)
+ return NULL;
+
+ for ( ; height > 1; height--) {
+ for (i = 0; i < geo->no_pairs; i++)
+ if (keycmp(geo, node, i, key) <= 0)
+ break;
+ if (i == geo->no_pairs)
+ return NULL;
+ node = (unsigned long *)bval(geo, node, i);
+ if (!node)
+ return NULL;
+ }
+
+ if (!node)
+ return NULL;
+
+ for (i = 0; i < geo->no_pairs; i++)
+ if (keycmp(geo, node, i, key) == 0)
+ return (void *)bval(geo, node, i);
+ return NULL;
+}
+
+static int getpos(struct btree_geo *geo, unsigned long *node,
+ unsigned long *key)
+{
+ int i;
+
+ for (i = 0; i < geo->no_pairs; i++) {
+ if (keycmp(geo, node, i, key) <= 0)
+ break;
+ }
+ return i;
+}
+
+static int getfill(struct btree_geo *geo, unsigned long *node, int start)
+{
+ int i;
+
+ for (i = start; i < geo->no_pairs; i++)
+ if (bval(geo, node, i) == 0)
+ break;
+ return i;
+}
+
+/*
+ * locate the correct leaf node in the btree
+ */
+static unsigned long *find_level(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key, int level)
+{
+ unsigned long *node = head->node;
+ int i, height;
+
+ for (height = head->height; height > level; height--) {
+ for (i = 0; i < geo->no_pairs; i++)
+ if (keycmp(geo, node, i, key) <= 0)
+ break;
+
+ if ((i == geo->no_pairs) || !bval(geo, node, i)) {
+ /* right-most key is too large, update it */
+ i--;
+ setkey(geo, node, key, i);
+ }
+ BUG_ON(i < 0);
+ node = (unsigned long *)bval(geo, node, i);
+ }
+ BUG_ON(!node);
+ return node;
+}
+
+static int btree_grow(struct btree_head *head, struct btree_geo *geo)
+{
+ unsigned long *node;
+ int fill;
+
+ node = btree_node_alloc(head);
+ if (!node)
+ return -ENOMEM;
+ if (head->node) {
+ fill = getfill(geo, head->node, 0);
+ setkey(geo, node, bkey(geo, head->node, fill - 1), 0);
+ setval(geo, node, (unsigned long)head->node, 0);
+ }
+ head->node = node;
+ head->height++;
+ return 0;
+}
+
+static int btree_insert_level(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key, unsigned long val, int level)
+{
+ unsigned long *node;
+ int i, pos, fill, err;
+
+ BUG_ON(!val);
+ if (head->height < level) {
+ err = btree_grow(head, geo);
+ if (err)
+ return err;
+ }
+
+retry:
+ node = find_level(head, geo, key, level);
+ pos = getpos(geo, node, key);
+ fill = getfill(geo, node, pos);
+ /* two identical keys are not allowed */
+ BUG_ON(pos < fill && keycmp(geo, node, pos, key) == 0);
+
+ if (fill == geo->no_pairs) {
+ /* need to split node */
+ unsigned long *new;
+
+ new = btree_node_alloc(head);
+ if (!new)
+ return -ENOMEM;
+ err = btree_insert_level(head, geo,
+ bkey(geo, node, fill / 2 - 1),
+ (unsigned long)new, level + 1);
+ if (err) {
+ kfree(new);
+ return err;
+ }
+ for (i = 0; i < fill / 2; i++) {
+ setkey(geo, new, bkey(geo, node, i), i);
+ setval(geo, new, bval(geo, node, i), i);
+ setkey(geo, node, bkey(geo, node, i + fill / 2), i);
+ setval(geo, node, bval(geo, node, i + fill / 2), i);
+ clearpair(geo, node, i + fill / 2);
+ }
+ if (fill & 1) {
+ setkey(geo, node, bkey(geo, node, fill - 1), i);
+ setval(geo, node, bval(geo, node, fill - 1), i);
+ clearpair(geo, node, fill - 1);
+ }
+ goto retry;
+ }
+ BUG_ON(fill >= geo->no_pairs);
+
+ /* shift and insert */
+ for (i = fill; i > pos; i--) {
+ setkey(geo, node, bkey(geo, node, i - 1), i);
+ setval(geo, node, bval(geo, node, i - 1), i);
+ }
+ setkey(geo, node, key, pos);
+ setval(geo, node, val, pos);
+
+ return 0;
+}
+
+int btree_insert(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key, void *val)
+{
+ return btree_insert_level(head, geo, key, (unsigned long)val, 1);
+}
+
+static void *btree_remove_level(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key, int level)
+{
+ unsigned long *node;
+ int i, pos, fill;
+ void *ret;
+
+ if (level > head->height) {
+ /* we recursed all the way up */
+ head->height = 0;
+ head->node = NULL;
+ return NULL;
+ }
+
+ node = find_level(head, geo, key, level);
+ pos = getpos(geo, node, key);
+ fill = getfill(geo, node, pos);
+ if ((level == 1) && (keycmp(geo, node, pos, key) != 0))
+ return NULL;
+ ret = (void *)bval(geo, node, pos);
+
+ /* remove and shift */
+ for (i = pos; i < fill - 1; i++) {
+ setkey(geo, node, bkey(geo, node, i + 1), i);
+ setval(geo, node, bval(geo, node, i + 1), i);
+ }
+ clearpair(geo, node, fill - 1);
+
+ if (fill - 1 < geo->no_pairs / 2) {
+ /*
+ * At this point there *should* be code to either merge with
+ * a neighboring node or steal some entries from it to preserve
+ * the btree invariant of only having nodes with n/2..n
+ * elements.
+ *
+ * As you can see, that code is left as an excercise to the
+ * reader or anyone noticing severe performance problems in
+ * very rare cases.
+ *
+ * As-is this code "implements" a method called lazy deletion,
+ * which according to text books is relatively common in
+ * databases and usually works quite well.
+ * Not so usually, the btree can degrade into very long lists
+ * of 1-element nodes and perform accordingly.
+ */
+ }
+ if (fill - 1 == 0) {
+ btree_remove_level(head, geo, key, level + 1);
+ kfree(node);
+ }
+
+ return ret;
+}
+
+void *btree_remove(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key)
+{
+ if (head->height == 0)
+ return NULL;
+
+ return btree_remove_level(head, geo, key, 1);
+}
+
+int btree_merge(struct btree_head *target, struct btree_head *victim,
+ struct btree_geo *geo, unsigned long *duplicate)
+{
+ unsigned long *key;
+ void *val;
+ int err;
+
+ BUG_ON(target == victim);
+
+ if (!(target->node)) {
+ /* target is empty, just copy fields over */
+ target->node = victim->node;
+ target->height = victim->height;
+ __btree_init(victim);
+ return 0;
+ }
+
+ for (;;) {
+ key = btree_last(victim, geo);
+ if (!key)
+ break;
+ val = btree_lookup(victim, geo, key);
+ err = btree_insert(target, geo, key, val);
+ if (err)
+ return err;
+ /* We must make a copy of the key, as the original will get
+ * mangled inside btree_remove. */
+ longcpy(duplicate, key, geo->keylen);
+ btree_remove(victim, geo, duplicate);
+ }
+ return 0;
+}
+
+static size_t __btree_for_each(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *node, long opaque,
+ void (*func)(void *elem, long opaque,
+ unsigned long *key, size_t index, void *func2),
+ void *func2, int reap, int height, size_t count)
+{
+ int i;
+ unsigned long *child;
+
+ for (i = 0; i < geo->no_pairs; i++) {
+ child = (void *)bval(geo, node, i);
+ if (!child)
+ break;
+ if (height > 1)
+ count = __btree_for_each(head, geo, child, opaque,
+ func, func2, reap, height - 1, count);
+ else
+ func(child, opaque, bkey(geo, node, i), count++,
+ func2);
+ }
+ if (reap)
+ kfree(node);
+ return count;
+}
+
+static void empty(void *elem, long opaque, unsigned long *key, size_t index,
+ void *func2)
+{
+}
+
+void visitorl(void *elem, long opaque, unsigned long *key, size_t index,
+ void *__func)
+{
+ visitorl_t func = __func;
+
+ func(elem, opaque, *key, index);
+}
+
+void visitor32(void *elem, long opaque, unsigned long *__key, size_t index,
+ void *__func)
+{
+ visitor32_t func = __func;
+ u32 *key = (void *)__key;
+
+ func(elem, opaque, *key, index);
+}
+
+void visitor64(void *elem, long opaque, unsigned long *__key, size_t index,
+ void *__func)
+{
+ visitor64_t func = __func;
+ u64 *key = (void *)__key;
+
+ func(elem, opaque, *key, index);
+}
+
+void visitor128(void *elem, long opaque, unsigned long *__key, size_t index,
+ void *__func)
+{
+ visitor128_t func = __func;
+ u64 *key = (void *)__key;
+
+ func(elem, opaque, key[0], key[1], index);
+}
+
+size_t btree_visitor(struct btree_head *head, struct btree_geo *geo,
+ long opaque,
+ void (*func)(void *elem, long opaque, unsigned long *key,
+ size_t index, void *func2), void *func2)
+{
+ size_t count = 0;
+
+ if (!func)
+ func = empty;
+ if (head->node)
+ count = __btree_for_each(head, geo, head->node, opaque, func,
+ func2, 0, head->height, 0);
+ return count;
+}
+
+size_t btree_grim_visitor(struct btree_head *head, struct btree_geo *geo,
+ long opaque,
+ void (*func)(void *elem, long opaque, unsigned long *key,
+ size_t index, void *func2), void *func2)
+{
+ size_t count = 0;
+
+ if (!func)
+ func = empty;
+ if (head->node)
+ count = __btree_for_each(head, geo, head->node, opaque, func,
+ func2, 1, head->height, 0);
+ __btree_init(head);
+ return count;
+}
+
+static int __init btree_module_init(void)
+{
+ btree_cachep = kmem_cache_create("btree_node", NODESIZE, 0,
+ SLAB_HWCACHE_ALIGN, NULL);
+ return 0;
+}
+
+/* If core code starts using btree, initialization should happen even earlier */
+module_init(btree_module_init);


2008-10-28 01:26:24

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC] B+Tree library

On Sun, Oct 26, 2008 at 01:46:44PM +0100, J?rn Engel wrote:
> The idea of using btrees in the kernel is not exactly new. They have a
> number of advantages over rbtrees and hashtables, but also a number of
> drawbacks. More importantly, logfs already makes use of them and -
> since I don't want any incompatible code to get merged and suffer the
> trouble it creates - I would like to discuss my implementation and where
> it makes sense and where it doesn't.
>
> General advantages of btrees are memory density and efficient use of
> cachelines. Hashtables are either too small and degrade into linked
> list performance, or they are too large and waste memory. With changing
> workloads, both may be true on the same system. Rbtrees have a bad
> fanout of less than 2 (they are not actually balanced binary trees),
> hence reading a fairly large number of cachelines to each lookup.
>
> Main disadvantage of btrees is that they are complicated, come in a
> gazillion subtly different variant that differ mainly in the balance
> between read efficiency and write efficiency. Comparing btrees against
> anything is a bit like comparing apples and random fruits.
>
> This implementation is extremely simple. It splits nodes when they
> overflow. It does not move elements to neighboring nodes. It does not
> try fancy 2:3 splits. It does not even merge nodes when they shrink,
> making degenerate cases possible. And it requires callers to do
> tree-global locking. In effect, it will be hard to find anything less
> sophisticated.
>
> The one aspect where my implementation is actually nice is in allowing
> variable key length. Btree keys are interpreted as an array of unsigned
> long. So by passing the correct geometry to the core functions, it is
> possible to handle 32bit, 64bit or 128bit btrees, which logfs uses. If
> so desired, any other weird data format can be used as well (Zach, are
> you reading this?).
>
> So would something like this be merged once some users are identified?
> Would it be useful for anything but logfs? Or am I plain nuts?

I think a btree library would be useful - there are places where
people are using rb-trees instead of btree simply because it's
easier to use the rbtree than it is to implement a btree library.
I can think of several places I could use such a library for
in-memory extent representation....

That being said, I haven't had a chance to look at that code yet....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2008-10-30 17:43:21

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC] B+Tree library

Hi!

> Main disadvantage of btrees is that they are complicated, come in a
> gazillion subtly different variant that differ mainly in the balance
> between read efficiency and write efficiency. Comparing btrees against
> anything is a bit like comparing apples and random fruits.

:-)))))





+ * Disks have fulfilled the prerequite for a long time. More recently DRAM

prerequisite?

+#define MAX(a, b) ((a) > (b) ? (a) : (b))

We already have that in the headers somewhere.


--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-10-30 17:59:03

by Jörn Engel

[permalink] [raw]
Subject: Re: [RFC] B+Tree library

On Thu, 30 October 2008 18:43:00 +0100, Pavel Machek wrote:
>
> + * Disks have fulfilled the prerequite for a long time. More recently DRAM
>
> prerequisite?

In the paragraph above:

+ * B+Trees can be used similar to Linux radix trees (which don't have anything
+ * in common with textbook radix trees, beware). Prerequisite for them working
+ * well is that access to a random tree node is much faster than a large number
+ * of operations within each node.

> +#define MAX(a, b) ((a) > (b) ? (a) : (b))
>
> We already have that in the headers somewhere.

Except that min/max in include/linux/kernel.h have some type-safety
added. In this particular case that is actually a disadvantage:
CC lib/btree.o
lib/btree.c:56: error: braced-group within expression allowed only inside a function
lib/btree.c:62: error: braced-group within expression allowed only inside a function
lib/btree.c:67: error: braced-group within expression allowed only inside a function

So I need something more stupid. :)

Jörn

--
It does not require a majority to prevail, but rather an irate,
tireless minority keen to set brush fires in people's minds.
-- Samuel Adams

2008-10-30 19:13:30

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC] B+Tree library

On Thu 2008-10-30 18:58:44, J?rn Engel wrote:
> On Thu, 30 October 2008 18:43:00 +0100, Pavel Machek wrote:
> >
> > + * Disks have fulfilled the prerequite for a long time. More recently DRAM
> >
> > prerequisite?
>
> In the paragraph above:

I don't thing "prerequite" is valid english word... is it?

> > +#define MAX(a, b) ((a) > (b) ? (a) : (b))
> >
> > We already have that in the headers somewhere.
>
> Except that min/max in include/linux/kernel.h have some type-safety
> added. In this particular case that is actually a disadvantage:
> CC lib/btree.o
> lib/btree.c:56: error: braced-group within expression allowed only inside a function
> lib/btree.c:62: error: braced-group within expression allowed only inside a function
> lib/btree.c:67: error: braced-group within expression allowed only inside a function
>
> So I need something more stupid. :)

Aha, ok.

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-10-30 20:21:19

by Jörn Engel

[permalink] [raw]
Subject: Re: [RFC] B+Tree library

On Thu, 30 October 2008 20:14:51 +0100, Pavel Machek wrote:
> On Thu 2008-10-30 18:58:44, Jörn Engel wrote:
> > On Thu, 30 October 2008 18:43:00 +0100, Pavel Machek wrote:
> > >
> > > + * Disks have fulfilled the prerequite for a long time. More recently DRAM
> > > prerequisite?
> >
> > In the paragraph above:
>
> I don't thing "prerequite" is valid english word... is it?

Indeed. I could have read it another dozen times and not noticed
myself. Thanks.

Jörn

--
The wise man seeks everything in himself; the ignorant man tries to get
everything from somebody else.
-- unknown

2008-10-31 06:40:46

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [RFC] B+Tree library

Am Donnerstag, 30. Oktober 2008 schrieb Jörn Engel:
> Except that min/max in include/linux/kernel.h have some type-safety
> added. In this particular case that is actually a disadvantage:

So what about min_t and max_t from include/linux/kernel.h? Can that be used?

Christian

2008-10-31 07:35:33

by Jörn Engel

[permalink] [raw]
Subject: Re: [RFC] B+Tree library

On Fri, 31 October 2008 07:38:46 +0100, Christian Borntraeger wrote:
> Am Donnerstag, 30. Oktober 2008 schrieb Jörn Engel:
> > Except that min/max in include/linux/kernel.h have some type-safety
> > added. In this particular case that is actually a disadvantage:
>
> So what about min_t and max_t from include/linux/kernel.h? Can that be used?

Sorry, no. The problems comes from using MAX in structure initializers.

struct btree_geo btree_geo32 = {
.keylen = 1,
.no_pairs = NODESIZE / sizeof(long) / 2,
};

Jörn

--
Correctness comes second.
Features come third.
Performance comes last.
Maintainability is easily forgotten.

2008-10-31 09:16:47

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [RFC] B+Tree library

On Fri, 31 Oct 2008, Jörn Engel wrote:
> On Fri, 31 October 2008 07:38:46 +0100, Christian Borntraeger wrote:
> > Am Donnerstag, 30. Oktober 2008 schrieb Jörn Engel:
> > > Except that min/max in include/linux/kernel.h have some type-safety
> > > added. In this particular case that is actually a disadvantage:
> >
> > So what about min_t and max_t from include/linux/kernel.h? Can that be used?
>
> Sorry, no. The problems comes from using MAX in structure initializers.
>
> struct btree_geo btree_geo32 = {
> .keylen = 1,
> .no_pairs = NODESIZE / sizeof(long) / 2,
> };

Where's the MIN/MAX? ;-)

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2008-10-31 09:22:17

by Jörn Engel

[permalink] [raw]
Subject: Re: [RFC] B+Tree library

On Fri, 31 October 2008 10:16:27 +0100, Geert Uytterhoeven wrote:
>
> Where's the MIN/MAX? ;-)

#define MAX(a, b) ((a) > (b) ? (a) : (b))
#define NODESIZE MAX(L1_CACHE_BYTES, 128)

struct btree_geo btree_geo32 = {
.keylen = 1,
.no_pairs = NODESIZE / sizeof(long) / 2,
};

/me senses a bikeshed discussion.

Jörn

--
You cannot suppose that Moliere ever troubled himself to be original in the
matter of ideas. You cannot suppose that the stories he tells in his plays
have never been told before. They were culled, as you very well know.
-- Andre-Louis Moreau in Scarabouche

2008-10-31 10:35:29

by Johannes Berg

[permalink] [raw]
Subject: Re: [RFC] B+Tree library

[looks like this hitting LWN means everyone suddenly found it...]

> The one aspect where my implementation is actually nice is in allowing
> variable key length. Btree keys are interpreted as an array of unsigned
> long. So by passing the correct geometry to the core functions, it is
> possible to handle 32bit, 64bit or 128bit btrees, which logfs uses. If
> so desired, any other weird data format can be used as well (Zach, are
> you reading this?).

Would there be an easy way to use 48-bit keys? Or variable length keys?

> So would something like this be merged once some users are identified?
> Would it be useful for anything but logfs? Or am I plain nuts?

I could imagine using it instead of the hash-table for stations and APs
in the wireless code, stations are identified by the MAC address (48
bit) and APs (BSSs) are identified by their BSSID+SSID (or mesh ID), so
variable length. Currently we use a hash table with 256 slots which is
quite large for the typical case of mostly less than a hundred entries.

> This implementation is extremely simple. It splits nodes when they
> overflow. It does not move elements to neighboring nodes. It does not
> try fancy 2:3 splits. It does not even merge nodes when they shrink,
> making degenerate cases possible. And it requires callers to do
> tree-global locking. In effect, it will be hard to find anything less
> sophisticated.

I think the wireless case would probably want to have real shrinking
because it's well possible that you're moving, with your laptop, from an
area with a large number of APs to say your home out in the countryside
that only has your single AP.

johannes


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part

2008-10-31 11:27:17

by Jörn Engel

[permalink] [raw]
Subject: Re: [RFC] B+Tree library

On Fri, 31 October 2008 11:35:14 +0100, Johannes Berg wrote:
>
> [looks like this hitting LWN means everyone suddenly found it...]

It did?

> > The one aspect where my implementation is actually nice is in allowing
> > variable key length. Btree keys are interpreted as an array of unsigned
> > long. So by passing the correct geometry to the core functions, it is
> > possible to handle 32bit, 64bit or 128bit btrees, which logfs uses. If
> > so desired, any other weird data format can be used as well (Zach, are
> > you reading this?).
>
> Would there be an easy way to use 48-bit keys? Or variable length keys?

Variable as in one implementation for several trees with different
sizes, yes. Variable as in one tree with differently sized keys, no.

With the existing code, 48bit keys would need to be expressed as an
array of longs, thereby growing to 64bit. Alternatively one could
replace the longset/longcmp/longcpy with memset/memcmp/memcpy.

> > So would something like this be merged once some users are identified?
> > Would it be useful for anything but logfs? Or am I plain nuts?
>
> I could imagine using it instead of the hash-table for stations and APs
> in the wireless code, stations are identified by the MAC address (48
> bit) and APs (BSSs) are identified by their BSSID+SSID (or mesh ID), so
> variable length. Currently we use a hash table with 256 slots which is
> quite large for the typical case of mostly less than a hundred entries.

MAC address wouldn't be a problem. For BSSID+SSID one could just keep
the hashing and use the btree as an 'implementation detail' of the
'hashtable'. Beware that I don't allow two identical keys. The
implications of that make my toenails curl up.

> > This implementation is extremely simple. It splits nodes when they
> > overflow. It does not move elements to neighboring nodes. It does not
> > try fancy 2:3 splits. It does not even merge nodes when they shrink,
> > making degenerate cases possible. And it requires callers to do
> > tree-global locking. In effect, it will be hard to find anything less
> > sophisticated.
>
> I think the wireless case would probably want to have real shrinking
> because it's well possible that you're moving, with your laptop, from an
> area with a large number of APs to say your home out in the countryside
> that only has your single AP.

Indeed.

Jörn

--
Joern's library part 7:
http://www.usenix.org/publications/library/proceedings/neworl/full_papers/mckusick.a

2008-10-31 11:33:00

by Johannes Berg

[permalink] [raw]
Subject: Re: [RFC] B+Tree library

On Fri, 2008-10-31 at 12:26 +0100, Jörn Engel wrote:
> On Fri, 31 October 2008 11:35:14 +0100, Johannes Berg wrote:
> >
> > [looks like this hitting LWN means everyone suddenly found it...]
>
> It did?

Yes, on the weekly kernel page, under "Patches and updates / Core kernel
code"

> > > The one aspect where my implementation is actually nice is in allowing
> > > variable key length. Btree keys are interpreted as an array of unsigned
> > > long. So by passing the correct geometry to the core functions, it is
> > > possible to handle 32bit, 64bit or 128bit btrees, which logfs uses. If
> > > so desired, any other weird data format can be used as well (Zach, are
> > > you reading this?).
> >
> > Would there be an easy way to use 48-bit keys? Or variable length keys?
>
> Variable as in one implementation for several trees with different
> sizes, yes. Variable as in one tree with differently sized keys, no.

Ok, I guess that would blow up the key size to 6+1+32 bytes, or 10 (5)
longs. Bit large.

> > > So would something like this be merged once some users are identified?
> > > Would it be useful for anything but logfs? Or am I plain nuts?
> >
> > I could imagine using it instead of the hash-table for stations and APs
> > in the wireless code, stations are identified by the MAC address (48
> > bit) and APs (BSSs) are identified by their BSSID+SSID (or mesh ID), so
> > variable length. Currently we use a hash table with 256 slots which is
> > quite large for the typical case of mostly less than a hundred entries.
>
> MAC address wouldn't be a problem. For BSSID+SSID one could just keep
> the hashing and use the btree as an 'implementation detail' of the
> 'hashtable'. Beware that I don't allow two identical keys. The
> implications of that make my toenails curl up.

Heh. We don't actually hash on the SSID right now, only on the BSSID,
and then use the SSID to distinguish (it isn't common to have two SSIDs
on the same BSSID)

> > > This implementation is extremely simple. It splits nodes when they
> > > overflow. It does not move elements to neighboring nodes. It does not
> > > try fancy 2:3 splits. It does not even merge nodes when they shrink,
> > > making degenerate cases possible. And it requires callers to do
> > > tree-global locking. In effect, it will be hard to find anything less
> > > sophisticated.
> >
> > I think the wireless case would probably want to have real shrinking
> > because it's well possible that you're moving, with your laptop, from an
> > area with a large number of APs to say your home out in the countryside
> > that only has your single AP.
>
> Indeed.

I guess not then for now, unless one of us wants to implement
resizing... I'll look for replacements another time, nobody has
complained yet about the huge hash table :)

johannes


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part

2008-10-31 12:55:18

by Jörn Engel

[permalink] [raw]
Subject: Re: [RFC] B+Tree library

On Fri, 31 October 2008 12:32:41 +0100, Johannes Berg wrote:
> > >
> > > Would there be an easy way to use 48-bit keys? Or variable length keys?
> >
> > Variable as in one implementation for several trees with different
> > sizes, yes. Variable as in one tree with differently sized keys, no.
>
> Ok, I guess that would blow up the key size to 6+1+32 bytes, or 10 (5)
> longs. Bit large.

Yes. Insanely large keys are a good indication to better avoid btrees.

> > > I think the wireless case would probably want to have real shrinking
> > > because it's well possible that you're moving, with your laptop, from an
> > > area with a large number of APs to say your home out in the countryside
> > > that only has your single AP.
> >
> > Indeed.
>
> I guess not then for now, unless one of us wants to implement
> resizing... I'll look for replacements another time, nobody has
> complained yet about the huge hash table :)

I actually have something that compiles now. It still needs a bit of
water and soap before I'd consider it presentable, but turned out to be
less complicated than expected.

Jörn

--
More computing sins are committed in the name of efficiency (without
necessarily achieving it) than for any other single reason - including
blind stupidity.
-- W. A. Wulf

2008-10-31 13:07:31

by Johannes Berg

[permalink] [raw]
Subject: Re: [RFC] B+Tree library

On Fri, 2008-10-31 at 13:54 +0100, Jörn Engel wrote:
> On Fri, 31 October 2008 12:32:41 +0100, Johannes Berg wrote:
> > > >
> > > > Would there be an easy way to use 48-bit keys? Or variable length keys?
> > >
> > > Variable as in one implementation for several trees with different
> > > sizes, yes. Variable as in one tree with differently sized keys, no.
> >
> > Ok, I guess that would blow up the key size to 6+1+32 bytes, or 10 (5)
> > longs. Bit large.
>
> Yes. Insanely large keys are a good indication to better avoid btrees.

OTOH, there is no need to put the SSID in if I put a small list into
each node, effectively using the tree instead of the hash table and then
disambiguating the unlikely case of multiple SSID in a list.

> I actually have something that compiles now. It still needs a bit of
> water and soap before I'd consider it presentable, but turned out to be
> less complicated than expected.

:)

johannes


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part

2008-10-31 13:15:42

by Jörn Engel

[permalink] [raw]
Subject: Re: [RFC] B+Tree library

On Fri, 31 October 2008 14:07:16 +0100, Johannes Berg wrote:
>
> OTOH, there is no need to put the SSID in if I put a small list into
> each node, effectively using the tree instead of the hash table and then
> disambiguating the unlikely case of multiple SSID in a list.

Exactly. In sunny wheather btrees can be a good implementation for hash
tables. They can grow and shrink as needed, without taking the latency
hit for rehashing every single item. Until lock contention or pingpong
shows otherwise. :(

Most likely the answer to bad whether is RCU. Patches are welcome.

Jörn

--
They laughed at Galileo. They laughed at Copernicus. They laughed at
Columbus. But remember, they also laughed at Bozo the Clown.
-- unknown

2008-10-31 13:16:24

by Johannes Berg

[permalink] [raw]
Subject: Re: [RFC] B+Tree library


> +static inline size_t btree_visitorl(struct btree_headl *head, long opaque,
> + visitorl_t func2)
> +{
> + return btree_visitor(&head->h, &btree_geo32, opaque, visitorl, func2);
> +}

Incidentally, do you think it would be possible to implement a kind of

btree_for_each_entry(e, ...) {
do something with e
}

macro or function/macro combination? You seem to be doing a recursive
walk across the tree, would it be useful to have a linked list at the
lowest level of nodes to be able to iterate more easily?

johannes


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part

2008-10-31 13:29:40

by Jörn Engel

[permalink] [raw]
Subject: Re: [RFC] B+Tree library

On Fri, 31 October 2008 14:16:14 +0100, Johannes Berg wrote:
>
> > +static inline size_t btree_visitorl(struct btree_headl *head, long opaque,
> > + visitorl_t func2)
> > +{
> > + return btree_visitor(&head->h, &btree_geo32, opaque, visitorl, func2);
> > +}
>
> Incidentally, do you think it would be possible to implement a kind of
>
> btree_for_each_entry(e, ...) {
> do something with e
> }
>
> macro or function/macro combination? You seem to be doing a recursive
> walk across the tree, would it be useful to have a linked list at the
> lowest level of nodes to be able to iterate more easily?

Maybe. On the whole, I've tried hard not to optimize much. The idea
was that a simple design can still receive some fundamental changes.
But once a number of neat tricks like a linked list at the lowest level
are used, the code becomes more resistent to changes.

That being said, the maze of function pointer around btree_visitor()
does not make me very proud. Might be worth it.

Jörn

--
Prosperity makes friends, adversity tries them.
-- Publilius Syrus

2008-10-31 13:46:12

by Bert Wesarg

[permalink] [raw]
Subject: Re: [RFC] B+Tree library

On Fri, Oct 31, 2008 at 14:16, Johannes Berg <[email protected]> wrote:
>
>> +static inline size_t btree_visitorl(struct btree_headl *head, long opaque,
>> + visitorl_t func2)
>> +{
>> + return btree_visitor(&head->h, &btree_geo32, opaque, visitorl, func2);
>> +}
>
> Incidentally, do you think it would be possible to implement a kind of
>
> btree_for_each_entry(e, ...) {
> do something with e
> }

#define btree_for_each_entryl(__elem, __head, __opaque, ...) \
({ \
void f2(void *__elem, long opaque, unsigned long key, size_t index) \
{ \
(__VA_ARGS__); \
} \
btree_visitorl(__head, __opaque, f2); \
});

usage:

btree_for_each_entryl(e, head, opaque, {
do something with e
})

looks strange but works.

Bert

>
> macro or function/macro combination? You seem to be doing a recursive
> walk across the tree, would it be useful to have a linked list at the
> lowest level of nodes to be able to iterate more easily?
>
> johannes
>

2008-10-31 15:18:18

by Tim Gardner

[permalink] [raw]
Subject: Re: [RFC] B+Tree library

Johannes Berg wrote:
>> +static inline size_t btree_visitorl(struct btree_headl *head, long opaque,
>> + visitorl_t func2)
>> +{
>> + return btree_visitor(&head->h, &btree_geo32, opaque, visitorl, func2);
>> +}
>
> Incidentally, do you think it would be possible to implement a kind of
>
> btree_for_each_entry(e, ...) {
> do something with e
> }
>
> macro or function/macro combination? You seem to be doing a recursive
> walk across the tree, would it be useful to have a linked list at the
> lowest level of nodes to be able to iterate more easily?
>
> johannes

What would you expect to be the behavior if you remove 'e' ? That might
cause the tree to get re-ordered. Do you restart the list traversal?

I had a similar issue once with a hash table algorithm where you could
either access elements with a hash key, or efficiently traverse the hash
table using a linked list of elements. The solution wasn't too difficult
in that case because removing an element didn't cause the table to get
re-ordered.

rtg
--
Tim Gardner [email protected]

2008-10-31 15:36:00

by Jörn Engel

[permalink] [raw]
Subject: Re: [RFC] B+Tree library

On Fri, 31 October 2008 09:18:06 -0600, Tim Gardner wrote:
> Johannes Berg wrote:
> >> +static inline size_t btree_visitorl(struct btree_headl *head, long opaque,
> >> + visitorl_t func2)
> >> +{
> >> + return btree_visitor(&head->h, &btree_geo32, opaque, visitorl, func2);
> >> +}
> >
> > Incidentally, do you think it would be possible to implement a kind of
> >
> > btree_for_each_entry(e, ...) {
> > do something with e
> > }
> >
> > macro or function/macro combination? You seem to be doing a recursive
> > walk across the tree, would it be useful to have a linked list at the
> > lowest level of nodes to be able to iterate more easily?
>
> What would you expect to be the behavior if you remove 'e' ? That might
> cause the tree to get re-ordered. Do you restart the list traversal?

BUG(), if you're lucky. Silent data corruption, if you're not so lucky.

The btree_grim_visitor() exists to remove all element after being handed
to the visitor. btree_visitor() removes none. Either function requires
the caller to do proper locking. Calling two functions from two threads
in parallel or calling a function from the visitor callback will give
you undefined results - with the exception of parallel lookups, which
are obviously fine.

Jörn

--
Beware of bugs in the above code; I have only proved it correct, but
not tried it.
-- Donald Knuth

2008-10-31 20:45:17

by Sean Young

[permalink] [raw]
Subject: Re: [RFC] B+Tree library

On Sun, Oct 26, 2008 at 01:46:44PM +0100, Joern Engel wrote:
> General advantages of btrees are memory density and efficient use of
> cachelines. Hashtables are either too small and degrade into linked
> list performance, or they are too large and waste memory. With changing
> workloads, both may be true on the same system. Rbtrees have a bad
> fanout of less than 2 (they are not actually balanced binary trees),
> hence reading a fairly large number of cachelines to each lookup.

Which reminds me:

find_vma() uses rbtrees. Now I assume find_vma() is called far more than
mmap() and friends. Since avltree are balanced (unlike rbtrees) lookups
will be faster at the expense of extra rotations during updates.

Would patches for avltrees be accepted?

Thanks
Sean

2008-10-31 23:36:31

by Jörn Engel

[permalink] [raw]
Subject: Re: [RFC] B+Tree library

On Fri, 31 October 2008 20:17:45 +0000, Sean Young wrote:
> On Sun, Oct 26, 2008 at 01:46:44PM +0100, Joern Engel wrote:
> > General advantages of btrees are memory density and efficient use of
> > cachelines. Hashtables are either too small and degrade into linked
> > list performance, or they are too large and waste memory. With changing
> > workloads, both may be true on the same system. Rbtrees have a bad
> > fanout of less than 2 (they are not actually balanced binary trees),
> > hence reading a fairly large number of cachelines to each lookup.
>
> Which reminds me:
>
> find_vma() uses rbtrees. Now I assume find_vma() is called far more than
> mmap() and friends. Since avltree are balanced (unlike rbtrees) lookups
> will be faster at the expense of extra rotations during updates.

Maybe I should have been clearer. Rbtrees _are_ balanced trees. They
are not balanced _binary_ trees, but balanced 234-trees in a binary
representation.

> Would patches for avltrees be accepted?

The question is whether they are an improvement. As always.

Jörn

--
Das Aufregende am Schreiben ist es, eine Ordnung zu schaffen, wo
vorher keine existiert hat.
-- Doris Lessing

2008-11-01 10:17:51

by Sean Young

[permalink] [raw]
Subject: Re: [RFC] B+Tree library

On Sat, Nov 01, 2008 at 12:36:15AM +0100, Joern Engel wrote:
> On Fri, 31 October 2008 20:17:45 +0000, Sean Young wrote:
> > On Sun, Oct 26, 2008 at 01:46:44PM +0100, Joern Engel wrote:
> > > General advantages of btrees are memory density and efficient use of
> > > cachelines. Hashtables are either too small and degrade into linked
> > > list performance, or they are too large and waste memory. With changing
> > > workloads, both may be true on the same system. Rbtrees have a bad
> > > fanout of less than 2 (they are not actually balanced binary trees),
> > > hence reading a fairly large number of cachelines to each lookup.
> >
> > Which reminds me:
> >
> > find_vma() uses rbtrees. Now I assume find_vma() is called far more than
> > mmap() and friends. Since avltree are balanced (unlike rbtrees) lookups
> > will be faster at the expense of extra rotations during updates.
>
> Maybe I should have been clearer. Rbtrees _are_ balanced trees. They
> are not balanced _binary_ trees, but balanced 234-trees in a binary
> representation.

I should have been more clearer. avltrees are more rigidly balanced than
rbtrees, making them faster for lookups but slower for modification:

http://en.wikipedia.org/wiki/AVL_tree#Comparison_to_other_structures

The difference between the shortest path to a leaf node and the longest
path to a leaf node is +1 for avl and *2 for red-black.


Sean

2008-11-01 16:00:27

by Jörn Engel

[permalink] [raw]
Subject: [RFC] B+Tree library V2

On Fri, 31 October 2008 13:54:53 +0100, Jörn Engel wrote:
> On Fri, 31 October 2008 12:32:41 +0100, Johannes Berg wrote:
> >
> > I guess not then for now, unless one of us wants to implement
> > resizing... I'll look for replacements another time, nobody has
> > complained yet about the huge hash table :)
>
> I actually have something that compiles now. It still needs a bit of
> water and soap before I'd consider it presentable, but turned out to be
> less complicated than expected.

Not only compiles, also survives muserspace test harness and no longer
looks like a bulldog with lipstick.

You might note that "less complicated than expected" means I'm still
bending the rules a bit. I only merge underpopulated nodes with
neighbours. If the neighbour is too full for a merge I don't steal
entries. That bit is for someone else to implement. :)

Jörn

--
on a false concept. This concept is that
people actually want to look at source code.
-- Rob Enderle

diff --git a/include/linux/btree.h b/include/linux/btree.h
new file mode 100644
index 0000000..d56a691
--- /dev/null
+++ b/include/linux/btree.h
@@ -0,0 +1,272 @@
+#ifndef BTREE_H
+#define BTREE_H
+
+#include <linux/kernel.h>
+#include <linux/mempool.h>
+
+/*
+ * B+Tree node format:
+ * [key0, key1, ..., keyN] [val0, val1, ..., valN]
+ * Each key is an array of unsigned longs, head->no_longs in total.
+ * Total number of keys and vals (N) is head->no_pairs.
+ */
+
+struct btree_head {
+ unsigned long *node;
+ mempool_t *mempool;
+ int height;
+ gfp_t gfp_mask;
+};
+
+struct btree_geo {
+ int keylen;
+ int no_pairs;
+};
+extern struct btree_geo btree_geo32;
+extern struct btree_geo btree_geo64;
+extern struct btree_geo btree_geo128;
+
+struct btree_headl { struct btree_head h; };
+struct btree_head32 { struct btree_head h; };
+struct btree_head64 { struct btree_head h; };
+struct btree_head128 { struct btree_head h; };
+
+/*
+ * These couple of functions are all there is to it. The rest of this header
+ * consists only of wrappers that try to add some typesafety, make the code
+ * a little self-documenting and generally be nice to people.
+ */
+void *btree_alloc(gfp_t gfp_mask, void *pool_data);
+void btree_free(void *element, void *pool_data);
+void btree_init(struct btree_head *head, mempool_t *mempool, gfp_t gfp);
+void *btree_lookup(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key);
+int btree_insert(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key, void *val);
+void *btree_remove(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key);
+int btree_merge(struct btree_head *target, struct btree_head *victim,
+ struct btree_geo *geo, unsigned long *duplicate);
+unsigned long *btree_last(struct btree_head *head, struct btree_geo *geo);
+size_t btree_visitor(struct btree_head *head, struct btree_geo *geo,
+ long opaque,
+ void (*func)(void *elem, long opaque, unsigned long *key,
+ size_t index, void *func2), void *func2);
+size_t btree_grim_visitor(struct btree_head *head, struct btree_geo *geo,
+ long opaque,
+ void (*func)(void *elem, long opaque, unsigned long *key,
+ size_t index, void *func2), void *func2);
+
+/* key is unsigned long */
+static inline void btree_initl(struct btree_headl *head, mempool_t *mempool,
+ gfp_t gfp)
+{
+ btree_init(&head->h, mempool, gfp);
+}
+
+static inline void *btree_lookupl(struct btree_headl *head, unsigned long key)
+{
+ return btree_lookup(&head->h, &btree_geo32, &key);
+}
+
+static inline int btree_insertl(struct btree_headl *head, unsigned long key,
+ void *val)
+{
+ return btree_insert(&head->h, &btree_geo32, &key, val);
+}
+
+static inline void *btree_removel(struct btree_headl *head, unsigned long key)
+{
+ return btree_remove(&head->h, &btree_geo32, &key);
+}
+
+static inline int btree_mergel(struct btree_headl *target,
+ struct btree_headl *victim)
+{
+ unsigned long scratch;
+
+ return btree_merge(&target->h, &victim->h, &btree_geo32, &scratch);
+}
+
+void visitorl(void *elem, long opaque, unsigned long *key, size_t index,
+ void *__func);
+
+typedef void (*visitorl_t)(void *elem, long opaque, unsigned long key,
+ size_t index);
+
+static inline size_t btree_visitorl(struct btree_headl *head, long opaque,
+ visitorl_t func2)
+{
+ return btree_visitor(&head->h, &btree_geo32, opaque, visitorl, func2);
+}
+
+static inline size_t btree_grim_visitorl(struct btree_headl *head, long opaque,
+ visitorl_t func2)
+{
+ return btree_grim_visitor(&head->h, &btree_geo32, opaque, visitorl, func2);
+}
+
+/* key is u32 */
+static inline void btree_init32(struct btree_head32 *head, mempool_t *mempool,
+ gfp_t gfp)
+{
+ btree_init(&head->h, mempool, gfp);
+}
+
+static inline void *btree_lookup32(struct btree_head32 *head, u32 key)
+{
+ return btree_lookup(&head->h, &btree_geo32, (unsigned long *)&key);
+}
+
+static inline int btree_insert32(struct btree_head32 *head, u32 key,
+ void *val)
+{
+ return btree_insert(&head->h, &btree_geo32, (unsigned long *)&key, val);
+}
+
+static inline void *btree_remove32(struct btree_head32 *head, u32 key)
+{
+ return btree_remove(&head->h, &btree_geo32, (unsigned long *)&key);
+}
+
+static inline int btree_merge32(struct btree_head32 *target,
+ struct btree_head32 *victim)
+{
+ unsigned long scratch;
+
+ return btree_merge(&target->h, &victim->h, &btree_geo32, &scratch);
+}
+
+void visitor32(void *elem, long opaque, unsigned long *__key, size_t index,
+ void *__func);
+
+typedef void (*visitor32_t)(void *elem, long opaque, u32 key, size_t index);
+
+static inline size_t btree_visitor32(struct btree_head32 *head, long opaque,
+ visitor32_t func2)
+{
+ return btree_visitor(&head->h, &btree_geo32, opaque, visitor32, func2);
+}
+
+static inline size_t btree_grim_visitor32(struct btree_head32 *head, long opaque,
+ visitor32_t func2)
+{
+ return btree_grim_visitor(&head->h, &btree_geo32, opaque, visitor32, func2);
+}
+
+/* key is u64 */
+static inline void btree_init64(struct btree_head64 *head, mempool_t *mempool,
+ gfp_t gfp)
+{
+ btree_init(&head->h, mempool, gfp);
+}
+
+static inline void *btree_lookup64(struct btree_head64 *head, u64 key)
+{
+ return btree_lookup(&head->h, &btree_geo64, (unsigned long *)&key);
+}
+
+static inline int btree_insert64(struct btree_head64 *head, u64 key,
+ void *val)
+{
+ return btree_insert(&head->h, &btree_geo64, (unsigned long *)&key, val);
+}
+
+static inline void *btree_remove64(struct btree_head64 *head, u64 key)
+{
+ return btree_remove(&head->h, &btree_geo64, (unsigned long *)&key);
+}
+
+static inline int btree_merge64(struct btree_head64 *target,
+ struct btree_head64 *victim)
+{
+ u64 scratch;
+
+ return btree_merge(&target->h, &victim->h, &btree_geo64,
+ (unsigned long *)&scratch);
+}
+
+void visitor64(void *elem, long opaque, unsigned long *__key, size_t index,
+ void *__func);
+
+typedef void (*visitor64_t)(void *elem, long opaque, u64 key, size_t index);
+
+static inline size_t btree_visitor64(struct btree_head64 *head, long opaque,
+ visitor64_t func2)
+{
+ return btree_visitor(&head->h, &btree_geo64, opaque, visitor64, func2);
+}
+
+static inline size_t btree_grim_visitor64(struct btree_head64 *head, long opaque,
+ visitor64_t func2)
+{
+ return btree_grim_visitor(&head->h, &btree_geo64, opaque, visitor64, func2);
+}
+
+/* key is 128bit (two u64) */
+static inline void btree_init128(struct btree_head128 *head, mempool_t *mempool,
+ gfp_t gfp)
+{
+ btree_init(&head->h, mempool, gfp);
+}
+
+static inline void *btree_lookup128(struct btree_head128 *head, u64 k1, u64 k2)
+{
+ u64 key[2] = {k1, k2};
+ return btree_lookup(&head->h, &btree_geo128, (unsigned long *)&key);
+}
+
+static inline int btree_insert128(struct btree_head128 *head, u64 k1, u64 k2,
+ void *val)
+{
+ u64 key[2] = {k1, k2};
+ return btree_insert(&head->h, &btree_geo128, (unsigned long *)&key, val);
+}
+
+static inline void *btree_remove128(struct btree_head128 *head, u64 k1, u64 k2)
+{
+ u64 key[2] = {k1, k2};
+ return btree_remove(&head->h, &btree_geo128, (unsigned long *)&key);
+}
+
+static inline void btree_last128(struct btree_head128 *head, u64 *k1, u64 *k2)
+{
+ u64 *key = (u64 *)btree_last(&head->h, &btree_geo128);
+
+ if (key) {
+ *k1 = key[0];
+ *k2 = key[1];
+ } else {
+ *k1 = 0;
+ *k2 = 0;
+ }
+}
+
+static inline int btree_merge128(struct btree_head128 *target,
+ struct btree_head128 *victim)
+{
+ u64 scratch[2];
+
+ return btree_merge(&target->h, &victim->h, &btree_geo128,
+ (unsigned long *)scratch);
+}
+
+void visitor128(void *elem, long opaque, unsigned long *__key, size_t index,
+ void *__func);
+
+typedef void (*visitor128_t)(void *elem, long opaque, u64 key1, u64 key2,
+ size_t index);
+
+static inline size_t btree_visitor128(struct btree_head128 *head, long opaque,
+ visitor128_t func2)
+{
+ return btree_visitor(&head->h, &btree_geo128, opaque, visitor128, func2);
+}
+
+static inline size_t btree_grim_visitor128(struct btree_head128 *head, long opaque,
+ visitor128_t func2)
+{
+ return btree_grim_visitor(&head->h, &btree_geo128, opaque, visitor128, func2);
+}
+
+#endif
diff --git a/lib/Kconfig b/lib/Kconfig
index 8cc8e87..7614216 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -129,6 +129,9 @@ config TEXTSEARCH_FSM
config PLIST
boolean

+config BTREE
+ boolean
+
config HAS_IOMEM
boolean
depends on !NO_IOMEM
diff --git a/lib/Makefile b/lib/Makefile
index 2d7001b..b1eed60 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -34,6 +34,7 @@ lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o
obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o
obj-$(CONFIG_PLIST) += plist.o
+obj-$(CONFIG_BTREE) += btree.o
obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
obj-$(CONFIG_DEBUG_LIST) += list_debug.o

diff --git a/lib/btree.c b/lib/btree.c
new file mode 100644
index 0000000..9562356
--- /dev/null
+++ b/lib/btree.c
@@ -0,0 +1,668 @@
+/*
+ * lib/btree.c - Simple In-memory B+Tree
+ *
+ * As should be obvious for Linux kernel code, license is GPLv2
+ *
+ * Copyright (c) 2007-2008 Joern Engel <[email protected]>
+ * Bits and pieces stolen from Peter Zijlstra's code, which is
+ * Copyright 2007, Red Hat Inc. Peter Zijlstra <[email protected]>
+ * GPLv2
+ *
+ * see http://programming.kicks-ass.net/kernel-patches/vma_lookup/btree.patch
+ *
+ * A relatively simple B+Tree implementation. I have written it as a learning
+ * excercise to understand how B+Trees work. Turned out to be useful as well.
+ *
+ * B+Trees can be used similar to Linux radix trees (which don't have anything
+ * in common with textbook radix trees, beware). Prerequisite for them working
+ * well is that access to a random tree node is much faster than a large number
+ * of operations within each node.
+ *
+ * Disks have fulfilled the prerequisite for a long time. More recently DRAM
+ * has gained similar properties, as memory access times, when measured in cpu
+ * cycles, have increased. Cacheline sizes have increased as well, which also
+ * helps B+Trees.
+ *
+ * Compared to radix trees, B+Trees are more efficient when dealing with a
+ * sparsely populated address space. Between 25% and 50% of the memory is
+ * occupied with valid pointers. When densely populated, radix trees contain
+ * ~98% pointers - hard to beat. Very sparse radix trees contain only ~2%
+ * pointers.
+ *
+ * This particular implementation stores pointers identified by a long value.
+ * Storing NULL pointers is illegal, lookup will return NULL when no entry
+ * was found.
+ *
+ * Two tricks were used that are not commonly found in textbooks. First, the
+ * lowest values are to the right, not to the left. All used slots within a
+ * node are on the left, all unused slots contain NUL values. Most operations
+ * simply loop once over all slots and terminate on the first NUL.
+ *
+ * Second trick is to special-case the key "0" or NUL. As seen above, this
+ * value indicates an unused slot, so such a value should not be stored in the
+ * tree itself. Instead it is stored in the null_ptr field in the btree_head.
+ */
+
+#include <linux/btree.h>
+#include <linux/cache.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+
+#define MAX(a, b) ((a) > (b) ? (a) : (b))
+#define NODESIZE MAX(L1_CACHE_BYTES, 128)
+
+struct btree_geo btree_geo32 = {
+ .keylen = 1,
+ .no_pairs = NODESIZE / sizeof(long) / 2,
+};
+
+#define LONG_PER_U64 (64 / BITS_PER_LONG)
+struct btree_geo btree_geo64 = {
+ .keylen = LONG_PER_U64,
+ .no_pairs = NODESIZE / sizeof(long) / (1 + LONG_PER_U64),
+};
+
+struct btree_geo btree_geo128 = {
+ .keylen = 2 * LONG_PER_U64,
+ .no_pairs = NODESIZE / sizeof(long) / (1 + 2 * LONG_PER_U64),
+};
+
+static struct kmem_cache *btree_cachep;
+
+void *btree_alloc(gfp_t gfp_mask, void *pool_data)
+{
+ return kmem_cache_alloc(btree_cachep, gfp_mask);
+}
+
+void btree_free(void *element, void *pool_data)
+{
+ kmem_cache_free(btree_cachep, element);
+}
+
+static unsigned long *btree_node_alloc(struct btree_head *head)
+{
+ unsigned long *node;
+
+ node = mempool_alloc(head->mempool, head->gfp_mask);
+ memset(node, 0, NODESIZE);
+ return node;
+}
+
+static int longcmp(const unsigned long *l1, const unsigned long *l2, size_t n)
+{
+ size_t i;
+
+ for (i = 0; i < n; i++) {
+ if (l1[i] < l2[i])
+ return -1;
+ if (l1[i] > l2[i])
+ return 1;
+ }
+ return 0;
+}
+
+static unsigned long *longcpy(unsigned long *dest, const unsigned long *src,
+ size_t n)
+{
+ size_t i;
+
+ for (i = 0; i < n; i++)
+ dest[i] = src[i];
+ return dest;
+}
+
+static unsigned long *longset(unsigned long *s, unsigned long c, size_t n)
+{
+ size_t i;
+
+ for (i = 0; i < n; i++)
+ s[i] = c;
+ return s;
+}
+
+/*
+ * B+Tree node format:
+ * [key0, key1, ..., keyN] [val0, val1, ..., valN]
+ * Each key is an array of unsigned longs, head->keylen in total.
+ * Total number of keys and vals (N) is head->no_pairs.
+ */
+
+static unsigned long *bkey(struct btree_geo *geo, unsigned long *node, int n)
+{
+ return &node[n * geo->keylen];
+}
+
+static unsigned long bval(struct btree_geo *geo, unsigned long *node, int n)
+{
+ return node[geo->no_pairs * geo->keylen + n];
+}
+
+static void setkey(struct btree_geo *geo, unsigned long *node,
+ unsigned long *key, int n)
+{
+ longcpy(bkey(geo, node, n), key, geo->keylen);
+}
+
+static void setval(struct btree_geo *geo, unsigned long *node,
+ unsigned long val, int n)
+{
+ node[geo->no_pairs * geo->keylen + n] = val;
+}
+
+static void clearpair(struct btree_geo *geo, unsigned long *node, int n)
+{
+ longset(bkey(geo, node, n), 0, geo->keylen);
+ node[geo->no_pairs * geo->keylen + n] = 0;
+}
+
+#if 0
+static void dumpkey(struct btree_geo *geo, unsigned long *key)
+{
+ int k;
+
+ printk("(%lx", key[0]);
+ for (k = 1; k < geo->keylen; k++)
+ printk(",%lx", key[k]);
+ printk(")");
+}
+
+static void dumpnode(struct btree_geo *geo, unsigned long *node)
+{
+ int i;
+ unsigned long *key;
+
+ printk("%p: ", node);
+ for (i = 0; i < geo->no_pairs; i++) {
+ key = bkey(geo, node, i);
+ dumpkey(geo, key);
+ printk(" %lx ", bval(geo, node, i));
+ }
+ printk("\n");
+}
+
+static void __dumptree(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *node, int height)
+{
+ int i;
+ unsigned long *child;
+
+ if (!height)
+ return;
+
+ printk("%2x ", height);
+ dumpnode(geo, node);
+ for (i = 0; i < geo->no_pairs; i++) {
+ child = (void *)bval(geo, node, i);
+ if (!child)
+ return;
+ __dumptree(head, geo, child, height - 1);
+ }
+}
+
+static void dumptree(struct btree_head *head, struct btree_geo *geo)
+{
+ __dumptree(head, geo, head->node, head->height);
+}
+#endif
+
+static inline void __btree_init(struct btree_head *head)
+{
+ head->node = NULL;
+ head->height = 0;
+}
+
+void btree_init(struct btree_head *head, mempool_t *mempool, gfp_t gfp)
+{
+ __btree_init(head);
+ head->mempool = mempool;
+ head->gfp_mask = gfp;
+}
+
+unsigned long *btree_last(struct btree_head *head, struct btree_geo *geo)
+{
+ int height = head->height;
+ unsigned long *node = head->node;
+
+ if (height == 0)
+ return NULL;
+
+ for ( ; height > 1; height--)
+ node = (unsigned long *)bval(geo, node, 0);
+
+ return bkey(geo, node, 0);
+}
+
+static int keycmp(struct btree_geo *geo, unsigned long *node, int pos,
+ unsigned long *key)
+{
+ return longcmp(bkey(geo, node, pos), key, geo->keylen);
+}
+
+void *btree_lookup(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key)
+{
+ int i, height = head->height;
+ unsigned long *node = head->node;
+
+ if (height == 0)
+ return NULL;
+
+ for ( ; height > 1; height--) {
+ for (i = 0; i < geo->no_pairs; i++)
+ if (keycmp(geo, node, i, key) <= 0)
+ break;
+ if (i == geo->no_pairs)
+ return NULL;
+ node = (unsigned long *)bval(geo, node, i);
+ if (!node)
+ return NULL;
+ }
+
+ if (!node)
+ return NULL;
+
+ for (i = 0; i < geo->no_pairs; i++)
+ if (keycmp(geo, node, i, key) == 0)
+ return (void *)bval(geo, node, i);
+ return NULL;
+}
+
+static int getpos(struct btree_geo *geo, unsigned long *node,
+ unsigned long *key)
+{
+ int i;
+
+ for (i = 0; i < geo->no_pairs; i++) {
+ if (keycmp(geo, node, i, key) <= 0)
+ break;
+ }
+ return i;
+}
+
+static int getfill(struct btree_geo *geo, unsigned long *node, int start)
+{
+ int i;
+
+ for (i = start; i < geo->no_pairs; i++)
+ if (bval(geo, node, i) == 0)
+ break;
+ return i;
+}
+
+/*
+ * locate the correct leaf node in the btree
+ */
+static unsigned long *find_level(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key, int level)
+{
+ unsigned long *node = head->node;
+ int i, height;
+
+ for (height = head->height; height > level; height--) {
+ for (i = 0; i < geo->no_pairs; i++)
+ if (keycmp(geo, node, i, key) <= 0)
+ break;
+
+ if ((i == geo->no_pairs) || !bval(geo, node, i)) {
+ /* right-most key is too large, update it */
+ /* FIXME: If the right-most key on higher levels is
+ * always zero, this wouldn't be necessary. */
+ i--;
+ setkey(geo, node, key, i);
+ }
+ BUG_ON(i < 0);
+ node = (unsigned long *)bval(geo, node, i);
+ }
+ BUG_ON(!node);
+ return node;
+}
+
+static int btree_grow(struct btree_head *head, struct btree_geo *geo)
+{
+ unsigned long *node;
+ int fill;
+
+ node = btree_node_alloc(head);
+ if (!node)
+ return -ENOMEM;
+ if (head->node) {
+ fill = getfill(geo, head->node, 0);
+ setkey(geo, node, bkey(geo, head->node, fill - 1), 0);
+ setval(geo, node, (unsigned long)head->node, 0);
+ }
+ head->node = node;
+ head->height++;
+ return 0;
+}
+
+static void btree_shrink(struct btree_head *head, struct btree_geo *geo)
+{
+ unsigned long *node;
+ int fill;
+
+ if (head->height <= 1)
+ return;
+
+ node = head->node;
+ fill = getfill(geo, node, 0);
+ BUG_ON(fill > 1);
+ head->node = (unsigned long *)bval(geo, node, 0);
+ head->height--;
+ mempool_free(node, head->mempool);
+}
+
+static int btree_insert_level(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key, unsigned long val, int level)
+{
+ unsigned long *node;
+ int i, pos, fill, err;
+
+ BUG_ON(!val);
+ if (head->height < level) {
+ err = btree_grow(head, geo);
+ if (err)
+ return err;
+ }
+
+retry:
+ node = find_level(head, geo, key, level);
+ pos = getpos(geo, node, key);
+ fill = getfill(geo, node, pos);
+ /* two identical keys are not allowed */
+ BUG_ON(pos < fill && keycmp(geo, node, pos, key) == 0);
+
+ if (fill == geo->no_pairs) {
+ /* need to split node */
+ unsigned long *new;
+
+ new = btree_node_alloc(head);
+ if (!new)
+ return -ENOMEM;
+ err = btree_insert_level(head, geo,
+ bkey(geo, node, fill / 2 - 1),
+ (unsigned long)new, level + 1);
+ if (err) {
+ mempool_free(new, head->mempool);
+ return err;
+ }
+ for (i = 0; i < fill / 2; i++) {
+ setkey(geo, new, bkey(geo, node, i), i);
+ setval(geo, new, bval(geo, node, i), i);
+ setkey(geo, node, bkey(geo, node, i + fill / 2), i);
+ setval(geo, node, bval(geo, node, i + fill / 2), i);
+ clearpair(geo, node, i + fill / 2);
+ }
+ if (fill & 1) {
+ setkey(geo, node, bkey(geo, node, fill - 1), i);
+ setval(geo, node, bval(geo, node, fill - 1), i);
+ clearpair(geo, node, fill - 1);
+ }
+ goto retry;
+ }
+ BUG_ON(fill >= geo->no_pairs);
+
+ /* shift and insert */
+ for (i = fill; i > pos; i--) {
+ setkey(geo, node, bkey(geo, node, i - 1), i);
+ setval(geo, node, bval(geo, node, i - 1), i);
+ }
+ setkey(geo, node, key, pos);
+ setval(geo, node, val, pos);
+
+ return 0;
+}
+
+int btree_insert(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key, void *val)
+{
+ return btree_insert_level(head, geo, key, (unsigned long)val, 1);
+}
+
+static void *btree_remove_level(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key, int level);
+static void merge(struct btree_head *head, struct btree_geo *geo, int level,
+ unsigned long *left, int lfill,
+ unsigned long *right, int rfill,
+ unsigned long *parent, int lpos)
+{
+ int i;
+
+ for (i = 0; i < rfill; i++) {
+ /* Move all keys to the left */
+ setkey(geo, left, bkey(geo, right, i), lfill + i);
+ setval(geo, left, bval(geo, right, i), lfill + i);
+ }
+ /* Exchange left and right child in parent */
+ setval(geo, parent, (unsigned long)right, lpos);
+ setval(geo, parent, (unsigned long)left, lpos + 1);
+ /* Remove left (formerly right) child from parent */
+ btree_remove_level(head, geo, bkey(geo, parent, lpos), level + 1);
+ mempool_free(right, head->mempool);
+}
+
+static void rebalance(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key, int level, unsigned long *child, int fill)
+{
+ unsigned long *parent, *left = NULL, *right = NULL;
+ int i, no_left, no_right;
+
+ parent = find_level(head, geo, key, level + 1);
+ i = getpos(geo, parent, key);
+ BUG_ON(bval(geo, parent, i) != (unsigned long)child);
+
+ if (i > 0) {
+ left = (unsigned long *)bval(geo, parent, i - 1);
+ no_left = getfill(geo, left, 0);
+ if (fill + no_left <= geo->no_pairs) {
+ merge(head, geo, level,
+ left, no_left,
+ child, fill,
+ parent, i - 1);
+ return;
+ }
+ }
+ if (i + 1 < getfill(geo, parent, i)) {
+ right = (unsigned long *)bval(geo, parent, i + 1);
+ no_right = getfill(geo, right, 0);
+ if (fill + no_right <= geo->no_pairs) {
+ merge(head, geo, level,
+ child, fill,
+ right, no_right,
+ parent, i);
+ return;
+ }
+ }
+ /*
+ * We could also try to steal one entry from the left or right
+ * neighbor. By not doing so we changed the invariant from
+ * "all nodes are at least half full" to "no two neighboring
+ * nodes can be merged". Which means that the average fill of
+ * all nodes is still half or better.
+ */
+}
+
+static void *btree_remove_level(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key, int level)
+{
+ unsigned long *node;
+ int i, pos, fill;
+ void *ret;
+
+ if (level > head->height) {
+ /* we recursed all the way up */
+ head->height = 0;
+ head->node = NULL;
+ return NULL;
+ }
+
+ node = find_level(head, geo, key, level);
+ pos = getpos(geo, node, key);
+ fill = getfill(geo, node, pos);
+ if ((level == 1) && (keycmp(geo, node, pos, key) != 0))
+ return NULL;
+ ret = (void *)bval(geo, node, pos);
+
+ /* remove and shift */
+ for (i = pos; i < fill - 1; i++) {
+ setkey(geo, node, bkey(geo, node, i + 1), i);
+ setval(geo, node, bval(geo, node, i + 1), i);
+ }
+ clearpair(geo, node, fill - 1);
+
+ if (fill - 1 < geo->no_pairs / 2) {
+ if (level < head->height)
+ rebalance(head, geo, key, level, node, fill - 1);
+ else if (fill - 1 == 1)
+ btree_shrink(head, geo);
+ }
+
+ return ret;
+}
+
+void *btree_remove(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key)
+{
+ if (head->height == 0)
+ return NULL;
+
+ return btree_remove_level(head, geo, key, 1);
+}
+
+int btree_merge(struct btree_head *target, struct btree_head *victim,
+ struct btree_geo *geo, unsigned long *duplicate)
+{
+ unsigned long *key;
+ void *val;
+ int err;
+
+ BUG_ON(target == victim);
+
+ if (!(target->node)) {
+ /* target is empty, just copy fields over */
+ target->node = victim->node;
+ target->height = victim->height;
+ __btree_init(victim);
+ return 0;
+ }
+
+ for (;;) {
+ key = btree_last(victim, geo);
+ if (!key)
+ break;
+ val = btree_lookup(victim, geo, key);
+ err = btree_insert(target, geo, key, val);
+ if (err)
+ return err;
+ /* We must make a copy of the key, as the original will get
+ * mangled inside btree_remove. */
+ longcpy(duplicate, key, geo->keylen);
+ btree_remove(victim, geo, duplicate);
+ }
+ return 0;
+}
+
+static size_t __btree_for_each(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *node, long opaque,
+ void (*func)(void *elem, long opaque,
+ unsigned long *key, size_t index, void *func2),
+ void *func2, int reap, int height, size_t count)
+{
+ int i;
+ unsigned long *child;
+
+ for (i = 0; i < geo->no_pairs; i++) {
+ child = (void *)bval(geo, node, i);
+ if (!child)
+ break;
+ if (height > 1)
+ count = __btree_for_each(head, geo, child, opaque,
+ func, func2, reap, height - 1, count);
+ else
+ func(child, opaque, bkey(geo, node, i), count++,
+ func2);
+ }
+ if (reap)
+ mempool_free(node, head->mempool);
+ return count;
+}
+
+static void empty(void *elem, long opaque, unsigned long *key, size_t index,
+ void *func2)
+{
+}
+
+void visitorl(void *elem, long opaque, unsigned long *key, size_t index,
+ void *__func)
+{
+ visitorl_t func = __func;
+
+ func(elem, opaque, *key, index);
+}
+
+void visitor32(void *elem, long opaque, unsigned long *__key, size_t index,
+ void *__func)
+{
+ visitor32_t func = __func;
+ u32 *key = (void *)__key;
+
+ func(elem, opaque, *key, index);
+}
+
+void visitor64(void *elem, long opaque, unsigned long *__key, size_t index,
+ void *__func)
+{
+ visitor64_t func = __func;
+ u64 *key = (void *)__key;
+
+ func(elem, opaque, *key, index);
+}
+
+void visitor128(void *elem, long opaque, unsigned long *__key, size_t index,
+ void *__func)
+{
+ visitor128_t func = __func;
+ u64 *key = (void *)__key;
+
+ func(elem, opaque, key[0], key[1], index);
+}
+
+size_t btree_visitor(struct btree_head *head, struct btree_geo *geo,
+ long opaque,
+ void (*func)(void *elem, long opaque, unsigned long *key,
+ size_t index, void *func2), void *func2)
+{
+ size_t count = 0;
+
+ if (!func)
+ func = empty;
+ if (head->node)
+ count = __btree_for_each(head, geo, head->node, opaque, func,
+ func2, 0, head->height, 0);
+ return count;
+}
+
+size_t btree_grim_visitor(struct btree_head *head, struct btree_geo *geo,
+ long opaque,
+ void (*func)(void *elem, long opaque, unsigned long *key,
+ size_t index, void *func2), void *func2)
+{
+ size_t count = 0;
+
+ if (!func)
+ func = empty;
+ if (head->node)
+ count = __btree_for_each(head, geo, head->node, opaque, func,
+ func2, 1, head->height, 0);
+ __btree_init(head);
+ return count;
+}
+
+static int __init btree_module_init(void)
+{
+ btree_cachep = kmem_cache_create("btree_node", NODESIZE, 0,
+ SLAB_HWCACHE_ALIGN, NULL);
+ return 0;
+}
+
+/* If core code starts using btree, initialization should happen even earlier */
+module_init(btree_module_init);

2008-11-05 19:57:29

by Johannes Berg

[permalink] [raw]
Subject: Re: [RFC] B+Tree library V2

sorry for the late reply.

On Sat, 2008-11-01 at 16:59 +0100, Jörn Engel wrote:

> Not only compiles, also survives muserspace test harness and no longer
> looks like a bulldog with lipstick.

cool :)

> You might note that "less complicated than expected" means I'm still
> bending the rules a bit. I only merge underpopulated nodes with
> neighbours. If the neighbour is too full for a merge I don't steal
> entries. That bit is for someone else to implement. :)

That sounds fair, after all at least it limits the tree size, but I'm
too lazy to calculate the worst case right now.

I'd use this as-is for the WIP cfg80211 code that keeps track of BSSes,
as a hash-table/list on steroids.

johannes


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part

2008-11-05 20:06:54

by Jörn Engel

[permalink] [raw]
Subject: Re: [RFC] B+Tree library V2

On Wed, 5 November 2008 20:57:20 +0100, Johannes Berg wrote:
>
> That sounds fair, after all at least it limits the tree size, but I'm
> too lazy to calculate the worst case right now.

Worst case you have a single entry next to a full node, then split the
full node, etc. So slightly more than 1/4 populated.

> I'd use this as-is for the WIP cfg80211 code that keeps track of BSSes,
> as a hash-table/list on steroids.

Excellent. With this trojan horse in place I don't have to worry much
about a competing library with a different interface. Famous last
words. :)

Jörn

--
The wise man seeks everything in himself; the ignorant man tries to get
everything from somebody else.
-- unknown

2008-11-05 20:12:43

by Johannes Berg

[permalink] [raw]
Subject: Re: [RFC] B+Tree library V2

On Wed, 2008-11-05 at 21:06 +0100, Jörn Engel wrote:
> On Wed, 5 November 2008 20:57:20 +0100, Johannes Berg wrote:
> >
> > That sounds fair, after all at least it limits the tree size, but I'm
> > too lazy to calculate the worst case right now.
>
> Worst case you have a single entry next to a full node, then split the
> full node, etc. So slightly more than 1/4 populated.

That would be roughly 1/3, no? But it doesn't really matter much.

> > I'd use this as-is for the WIP cfg80211 code that keeps track of BSSes,
> > as a hash-table/list on steroids.
>
> Excellent. With this trojan horse in place I don't have to worry much
> about a competing library with a different interface. Famous last
> words. :)

:)
How we going to synchronise this? I'm not in a hurry with this scanning
code and I need to work on it still anyway.

johannes


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part

2008-11-05 20:22:18

by Jörn Engel

[permalink] [raw]
Subject: Re: [RFC] B+Tree library V2

On Wed, 5 November 2008 21:12:34 +0100, Johannes Berg wrote:
> On Wed, 2008-11-05 at 21:06 +0100, Jörn Engel wrote:
> > On Wed, 5 November 2008 20:57:20 +0100, Johannes Berg wrote:
> > >
> > > That sounds fair, after all at least it limits the tree size, but I'm
> > > too lazy to calculate the worst case right now.
> >
> > Worst case you have a single entry next to a full node, then split the
> > full node, etc. So slightly more than 1/4 populated.
>
> That would be roughly 1/3, no? But it doesn't really matter much.

When entries/node approaches infinity it is 1/4. With 4 entries/node,
which is the smallest useful number, it is 3/8.

> How we going to synchronise this? I'm not in a hurry with this scanning
> code and I need to work on it still anyway.

I could create a git tree and ask for inclusion in -next. Or you could
combine it with your code. I slightly favor the latter, as a library
without any users is somewhat... useless.

Jörn

--
When in doubt, use brute force.
-- Ken Thompson

2008-11-05 20:25:35

by Johannes Berg

[permalink] [raw]
Subject: Re: [RFC] B+Tree library V2

On Wed, 2008-11-05 at 21:21 +0100, Jörn Engel wrote:
> On Wed, 5 November 2008 21:12:34 +0100, Johannes Berg wrote:
> > On Wed, 2008-11-05 at 21:06 +0100, Jörn Engel wrote:
> > > On Wed, 5 November 2008 20:57:20 +0100, Johannes Berg wrote:
> > > >
> > > > That sounds fair, after all at least it limits the tree size, but I'm
> > > > too lazy to calculate the worst case right now.
> > >
> > > Worst case you have a single entry next to a full node, then split the
> > > full node, etc. So slightly more than 1/4 populated.
> >
> > That would be roughly 1/3, no? But it doesn't really matter much.
>
> When entries/node approaches infinity it is 1/4. With 4 entries/node,
> which is the smallest useful number, it is 3/8.

Ah, right.

> > How we going to synchronise this? I'm not in a hurry with this scanning
> > code and I need to work on it still anyway.
>
> I could create a git tree and ask for inclusion in -next. Or you could
> combine it with your code. I slightly favor the latter, as a library
> without any users is somewhat... useless.

I can do that, though I can't promise right now when I'll actually be
able to solve the remaining problems with that code. And I also thought
you already had another user in mind :)

johannes


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part

2008-11-07 07:52:38

by Jörn Engel

[permalink] [raw]
Subject: Re: [RFC] B+Tree library V2

On Wed, 5 November 2008 21:25:24 +0100, Johannes Berg wrote:
> On Wed, 2008-11-05 at 21:21 +0100, Jörn Engel wrote:
> >
> > I could create a git tree and ask for inclusion in -next. Or you could
> > combine it with your code. I slightly favor the latter, as a library
> > without any users is somewhat... useless.
>
> I can do that, though I can't promise right now when I'll actually be
> able to solve the remaining problems with that code. And I also thought
> you already had another user in mind :)

Of course I do. And if my code gets merged first I'll take the btree
along, promised.

Jörn

--
He who knows others is wise.
He who knows himself is enlightened.
-- Lao Tsu

2009-01-08 12:03:37

by Johannes Berg

[permalink] [raw]
Subject: Re: [RFC] B+Tree library V2

On Sat, 2008-11-01 at 16:59 +0100, Jörn Engel wrote:

[btree library]

Alright, back to this. I'm going to need something like the patch below
(except the update facility, which I thought I needed but then changed
my mind) which is on top of your patch. Does that look sane?

One thing that seems a little odd is requiring a btree_init(), but not
having a btree_destroy(), but rather requiring managing the mempool in
the caller (which invariably leads to two mempool pointers being
required unless callers want to look into the btree struct details)...
Do you think this is required? Could we have a btree_init/btree_destroy
instead that take care of the mempool stuff?

Another thing that I need is a visitor that deletes _some_ entries. How
can I do that? Additionally, it would be nice to be able to abort a tree
walk, when you have determined that you can't continue for whatever
reason.

johannes
---
include/linux/btree.h | 31 +++++++++++++++++++++++++-
lib/Kconfig | 2 -
lib/btree.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 90 insertions(+), 2 deletions(-)

--- wireless-testing.orig/lib/Kconfig 2009-01-07 22:58:29.000000000 +0100
+++ wireless-testing/lib/Kconfig 2009-01-07 22:58:32.000000000 +0100
@@ -137,7 +137,7 @@ config PLIST
boolean

config BTREE
- boolean
+ tristate

config HAS_IOMEM
boolean
--- wireless-testing.orig/lib/btree.c 2009-01-07 22:58:35.000000000 +0100
+++ wireless-testing/lib/btree.c 2009-01-08 00:50:37.000000000 +0100
@@ -47,6 +47,7 @@
#include <linux/cache.h>
#include <linux/kernel.h>
#include <linux/slab.h>
+#include <linux/module.h>

#define MAX(a, b) ((a) > (b) ? (a) : (b))
#define NODESIZE MAX(L1_CACHE_BYTES, 128)
@@ -55,17 +56,20 @@ struct btree_geo btree_geo32 = {
.keylen = 1,
.no_pairs = NODESIZE / sizeof(long) / 2,
};
+EXPORT_SYMBOL_GPL(btree_geo32);

#define LONG_PER_U64 (64 / BITS_PER_LONG)
struct btree_geo btree_geo64 = {
.keylen = LONG_PER_U64,
.no_pairs = NODESIZE / sizeof(long) / (1 + LONG_PER_U64),
};
+EXPORT_SYMBOL_GPL(btree_geo64);

struct btree_geo btree_geo128 = {
.keylen = 2 * LONG_PER_U64,
.no_pairs = NODESIZE / sizeof(long) / (1 + 2 * LONG_PER_U64),
};
+EXPORT_SYMBOL_GPL(btree_geo128);

static struct kmem_cache *btree_cachep;

@@ -73,11 +77,13 @@ void *btree_alloc(gfp_t gfp_mask, void *
{
return kmem_cache_alloc(btree_cachep, gfp_mask);
}
+EXPORT_SYMBOL_GPL(btree_alloc);

void btree_free(void *element, void *pool_data)
{
kmem_cache_free(btree_cachep, element);
}
+EXPORT_SYMBOL_GPL(btree_free);

static unsigned long *btree_node_alloc(struct btree_head *head)
{
@@ -217,6 +223,7 @@ void btree_init(struct btree_head *head,
head->mempool = mempool;
head->gfp_mask = gfp;
}
+EXPORT_SYMBOL_GPL(btree_init);

unsigned long *btree_last(struct btree_head *head, struct btree_geo *geo)
{
@@ -231,6 +238,7 @@ unsigned long *btree_last(struct btree_h

return bkey(geo, node, 0);
}
+EXPORT_SYMBOL_GPL(btree_last);

static int keycmp(struct btree_geo *geo, unsigned long *node, int pos,
unsigned long *key)
@@ -266,6 +274,39 @@ void *btree_lookup(struct btree_head *he
return (void *)bval(geo, node, i);
return NULL;
}
+EXPORT_SYMBOL_GPL(btree_lookup);
+
+int btree_update(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key, void *val)
+{
+ int i, height = head->height;
+ unsigned long *node = head->node;
+
+ if (height == 0)
+ return -ENOENT;
+
+ for ( ; height > 1; height--) {
+ for (i = 0; i < geo->no_pairs; i++)
+ if (keycmp(geo, node, i, key) <= 0)
+ break;
+ if (i == geo->no_pairs)
+ return -ENOENT;
+ node = (unsigned long *)bval(geo, node, i);
+ if (!node)
+ return -ENOENT;
+ }
+
+ if (!node)
+ return -ENOENT;
+
+ for (i = 0; i < geo->no_pairs; i++)
+ if (keycmp(geo, node, i, key) == 0) {
+ setval(geo, node, (unsigned long)val, i);
+ return 0;
+ }
+ return -ENOENT;
+}
+EXPORT_SYMBOL_GPL(btree_update);

static int getpos(struct btree_geo *geo, unsigned long *node,
unsigned long *key)
@@ -417,6 +458,7 @@ int btree_insert(struct btree_head *head
{
return btree_insert_level(head, geo, key, (unsigned long)val, 1);
}
+EXPORT_SYMBOL_GPL(btree_insert);

static void *btree_remove_level(struct btree_head *head, struct btree_geo *geo,
unsigned long *key, int level);
@@ -527,6 +569,7 @@ void *btree_remove(struct btree_head *he

return btree_remove_level(head, geo, key, 1);
}
+EXPORT_SYMBOL_GPL(btree_remove);

int btree_merge(struct btree_head *target, struct btree_head *victim,
struct btree_geo *geo, unsigned long *duplicate)
@@ -560,6 +603,7 @@ int btree_merge(struct btree_head *targe
}
return 0;
}
+EXPORT_SYMBOL_GPL(btree_merge);

static size_t __btree_for_each(struct btree_head *head, struct btree_geo *geo,
unsigned long *node, long opaque,
@@ -598,6 +642,7 @@ void visitorl(void *elem, long opaque, u

func(elem, opaque, *key, index);
}
+EXPORT_SYMBOL_GPL(visitorl);

void visitor32(void *elem, long opaque, unsigned long *__key, size_t index,
void *__func)
@@ -607,6 +652,7 @@ void visitor32(void *elem, long opaque,

func(elem, opaque, *key, index);
}
+EXPORT_SYMBOL_GPL(visitor32);

void visitor64(void *elem, long opaque, unsigned long *__key, size_t index,
void *__func)
@@ -616,6 +662,7 @@ void visitor64(void *elem, long opaque,

func(elem, opaque, *key, index);
}
+EXPORT_SYMBOL_GPL(visitor64);

void visitor128(void *elem, long opaque, unsigned long *__key, size_t index,
void *__func)
@@ -625,6 +672,7 @@ void visitor128(void *elem, long opaque,

func(elem, opaque, key[0], key[1], index);
}
+EXPORT_SYMBOL_GPL(visitor128);

size_t btree_visitor(struct btree_head *head, struct btree_geo *geo,
long opaque,
@@ -640,6 +688,7 @@ size_t btree_visitor(struct btree_head *
func2, 0, head->height, 0);
return count;
}
+EXPORT_SYMBOL_GPL(btree_visitor);

size_t btree_grim_visitor(struct btree_head *head, struct btree_geo *geo,
long opaque,
@@ -656,6 +705,7 @@ size_t btree_grim_visitor(struct btree_h
__btree_init(head);
return count;
}
+EXPORT_SYMBOL_GPL(btree_grim_visitor);

static int __init btree_module_init(void)
{
@@ -664,5 +714,14 @@ static int __init btree_module_init(void
return 0;
}

+static void __exit btree_module_exit(void)
+{
+ kmem_cache_destroy(btree_cachep);
+}
+
/* If core code starts using btree, initialization should happen even earlier */
module_init(btree_module_init);
+module_exit(btree_module_exit);
+
+MODULE_AUTHOR("Joern Engel <[email protected]>");
+MODULE_LICENSE("GPL");
--- wireless-testing.orig/include/linux/btree.h 2009-01-07 23:16:54.000000000 +0100
+++ wireless-testing/include/linux/btree.h 2009-01-08 00:48:29.000000000 +0100
@@ -43,6 +43,8 @@ void *btree_lookup(struct btree_head *he
unsigned long *key);
int btree_insert(struct btree_head *head, struct btree_geo *geo,
unsigned long *key, void *val);
+int btree_update(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key, void *val);
void *btree_remove(struct btree_head *head, struct btree_geo *geo,
unsigned long *key);
int btree_merge(struct btree_head *target, struct btree_head *victim,
@@ -75,6 +77,12 @@ static inline int btree_insertl(struct b
return btree_insert(&head->h, &btree_geo32, &key, val);
}

+static inline int btree_updatel(struct btree_headl *head, unsigned long key,
+ void *val)
+{
+ return btree_update(&head->h, &btree_geo32, &key, val);
+}
+
static inline void *btree_removel(struct btree_headl *head, unsigned long key)
{
return btree_remove(&head->h, &btree_geo32, &key);
@@ -124,6 +132,12 @@ static inline int btree_insert32(struct
return btree_insert(&head->h, &btree_geo32, (unsigned long *)&key, val);
}

+static inline int btree_update32(struct btree_head32 *head, u32 key,
+ void *val)
+{
+ return btree_update(&head->h, &btree_geo32, (unsigned long *)&key, val);
+}
+
static inline void *btree_remove32(struct btree_head32 *head, u32 key)
{
return btree_remove(&head->h, &btree_geo32, (unsigned long *)&key);
@@ -172,6 +186,12 @@ static inline int btree_insert64(struct
return btree_insert(&head->h, &btree_geo64, (unsigned long *)&key, val);
}

+static inline int btree_update64(struct btree_head64 *head, u64 key,
+ void *val)
+{
+ return btree_update(&head->h, &btree_geo64, (unsigned long *)&key, val);
+}
+
static inline void *btree_remove64(struct btree_head64 *head, u64 key)
{
return btree_remove(&head->h, &btree_geo64, (unsigned long *)&key);
@@ -220,7 +240,16 @@ static inline int btree_insert128(struct
void *val)
{
u64 key[2] = {k1, k2};
- return btree_insert(&head->h, &btree_geo128, (unsigned long *)&key, val);
+ return btree_insert(&head->h, &btree_geo128,
+ (unsigned long *)&key, val);
+}
+
+static inline int btree_update128(struct btree_head128 *head, u64 k1, u64 k2,
+ void *val)
+{
+ u64 key[2] = {k1, k2};
+ return btree_update(&head->h, &btree_geo128,
+ (unsigned long *)&key, val);
}

static inline void *btree_remove128(struct btree_head128 *head, u64 k1, u64 k2)

2009-01-08 16:24:48

by Jörn Engel

[permalink] [raw]
Subject: Re: [RFC] B+Tree library V2

On Thu, 8 January 2009 01:57:21 +0100, Johannes Berg wrote:
> On Sat, 2008-11-01 at 16:59 +0100, Jörn Engel wrote:
>
> [btree library]
>
> Alright, back to this. I'm going to need something like the patch below
> (except the update facility, which I thought I needed but then changed
> my mind) which is on top of your patch. Does that look sane?

At first glance it does. The update looks like someone might need it,
but doesn't have a user yet. The btree_merge is similar, as I used to
need it, but no longer do. Maybe #if 0 those two.

> One thing that seems a little odd is requiring a btree_init(), but not
> having a btree_destroy(), but rather requiring managing the mempool in
> the caller (which invariably leads to two mempool pointers being
> required unless callers want to look into the btree struct details)...
> Do you think this is required? Could we have a btree_init/btree_destroy
> instead that take care of the mempool stuff?

Well, I used to have heaps of btrees around and wanted to share the
mempool between all of them. Not sure if I still do, I believe not.
Will check.

If mempools aren't shared, I agree that encapsulating those bits is much
nicer.

> Another thing that I need is a visitor that deletes _some_ entries. How
> can I do that? Additionally, it would be nice to be able to abort a tree
> walk, when you have determined that you can't continue for whatever
> reason.

If you want to open-code it, you can use btree_lookup_less(). I added
that function sometime last month. Basically something like this:
key = btree_last(head, geo);
while (key) {
/* do something with key */
key = btree_lookup_less(head, geo, key);
}

Maybe it should be renamed to btree_get_prev_key(). I never noticed how
confusing it was because my head was busy with other problems. The code
is currently burried within my logfs tree:
http://logfs.org/git?p=logfs;a=summary

Will merge it with your patch and respin later tonight or tomorrow...

Jörn

--
I can say that I spend most of my time fixing bugs even if I have lots
of new features to implement in mind, but I give bugs more priority.
-- Andrea Arcangeli, 2000

2009-01-08 16:33:59

by Johannes Berg

[permalink] [raw]
Subject: Re: [RFC] B+Tree library V2

On Thu, 2009-01-08 at 17:24 +0100, Jörn Engel wrote:

> > Alright, back to this. I'm going to need something like the patch below
> > (except the update facility, which I thought I needed but then changed
> > my mind) which is on top of your patch. Does that look sane?
>
> At first glance it does. The update looks like someone might need it,
> but doesn't have a user yet. The btree_merge is similar, as I used to
> need it, but no longer do. Maybe #if 0 those two.

Sure, works for me. I just had written it and didn't want to keep it
from the world after changing my mind about needing it.

> Well, I used to have heaps of btrees around and wanted to share the
> mempool between all of them. Not sure if I still do, I believe not.
> Will check.
>
> If mempools aren't shared, I agree that encapsulating those bits is much
> nicer.

Just made the API a bit quirky, maybe just support both ways of using
things? Would only need a flag somewhere in the btree structure, I'd
think; ultimately it doesn't matter much to me though.

> > Another thing that I need is a visitor that deletes _some_ entries. How
> > can I do that? Additionally, it would be nice to be able to abort a tree
> > walk, when you have determined that you can't continue for whatever
> > reason.
>
> If you want to open-code it, you can use btree_lookup_less(). I added
> that function sometime last month. Basically something like this:
> key = btree_last(head, geo);
> while (key) {
> /* do something with key */
> key = btree_lookup_less(head, geo, key);
> }
>
> Maybe it should be renamed to btree_get_prev_key(). I never noticed how
> confusing it was because my head was busy with other problems. The code
> is currently burried within my logfs tree:
> http://logfs.org/git?p=logfs;a=summary

I don't think I've understood this yet. That code above there is a
backwards walk over the key space, and it's safe to call
btree_remove(key) at /* do something with key */?

> Will merge it with your patch and respin later tonight or tomorrow...

Cool, thanks. Mainly making it a module for testing anyway.

johannes


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part

2009-01-08 16:50:12

by Johannes Berg

[permalink] [raw]
Subject: Re: [RFC] B+Tree library V2


> Maybe it should be renamed to btree_get_prev_key(). I never noticed how
> confusing it was because my head was busy with other problems. The code
> is currently burried within my logfs tree:
> http://logfs.org/git?p=logfs;a=summary

I see this:


- * Second trick is to special-case the key "0" or NUL. As seen above, this
- * value indicates an unused slot, so such a value should not be stored in the
- * tree itself. Instead it is stored in the null_ptr field in the btree_head.

Does that mean that wasn't true, and I can store a 0 key?

johannes


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part

2009-01-08 17:09:44

by Johannes Berg

[permalink] [raw]
Subject: Re: [RFC] B+Tree library V2

On Thu, 2009-01-08 at 17:24 +0100, Jörn Engel wrote:

> If you want to open-code it, you can use btree_lookup_less(). I added
> that function sometime last month. Basically something like this:
> key = btree_last(head, geo);
> while (key) {
> /* do something with key */
> key = btree_lookup_less(head, geo, key);
> }

Ok, so looking deeper into this, how about adding

#define btree_for_each_key(head, geo, key, tmp) \
for (key = btree_last(head, geo), tmp = btree_lookup_less(head, geo, key);
key; key = tmp, tmp = btree_lookup_less(head, geo, key))

(and possibly some type-checking variants that hardcode the geo)

Does that seem correct? And would it be possible to provide btree_last()
that takes an void ** and fills it with the last entry, and the same for
lookup_less(), so we can write btree_for_each_entry() too?

johannes


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part

2009-01-08 19:40:48

by Jörn Engel

[permalink] [raw]
Subject: Re: [RFC] B+Tree library V2

On Thu, 8 January 2009 17:34:10 +0100, Johannes Berg wrote:
>
> > Well, I used to have heaps of btrees around and wanted to share the
> > mempool between all of them. Not sure if I still do, I believe not.
> > Will check.
> >
> > If mempools aren't shared, I agree that encapsulating those bits is much
> > nicer.
>
> Just made the API a bit quirky, maybe just support both ways of using
> things? Would only need a flag somewhere in the btree structure, I'd
> think; ultimately it doesn't matter much to me though.

I don't think we need a flag. Just a regular pair of init/cleanup
functions and some __init_i_want_to_share_mempools_and_crawl_on_my_knees
without a cleanup for those who need it.

> > If you want to open-code it, you can use btree_lookup_less(). I added
> > that function sometime last month. Basically something like this:
> > key = btree_last(head, geo);
> > while (key) {
> > /* do something with key */
> > key = btree_lookup_less(head, geo, key);
> > }
> >
> > Maybe it should be renamed to btree_get_prev_key(). I never noticed how
> > confusing it was because my head was busy with other problems. The code
> > is currently burried within my logfs tree:
> > http://logfs.org/git?p=logfs;a=summary
>
> I don't think I've understood this yet. That code above there is a
> backwards walk over the key space, and it's safe to call
> btree_remove(key) at /* do something with key */?

Yes, it's a backwards walk. The btree is sorted backwards as a
micro-optimization. Loops will terminate a bit earlier without a
complicated test this way.

And you can do absolutely anything at /* do something with key */. The
btree_get_prev_key() will simply tell you what the next-lowest key in
the btree is at that time. You no longer need locking around the whole
visitor, but can release the lock after each step, if you want.

Downside is that each step will walk the btree from the root again.
Twice, because you also need to lookup/remove each element. Tanstaafl.

Jörn

--
Courage is not the absence of fear, but rather the judgement that
something else is more important than fear.
-- Ambrose Redmoon

2009-01-08 19:46:56

by Jörn Engel

[permalink] [raw]
Subject: Re: [RFC] B+Tree library V2

On Thu, 8 January 2009 17:50:04 +0100, Johannes Berg wrote:
>
> I see this:
>
>
> - * Second trick is to special-case the key "0" or NUL. As seen above, this
> - * value indicates an unused slot, so such a value should not be stored in the
> - * tree itself. Instead it is stored in the null_ptr field in the btree_head.
>
> Does that mean that wasn't true, and I can store a 0 key?

Ahh, don't look at that! The embarrassment is unbearable! Go away!

Yes, I used to have a special exception for a 0 key. But I also have
a special exception for a NULL value and can test against that instead
of testing against a 0 key. It is even faster, because the value is not
variable-sized. In hindsight it is hard to explain why I ever did that.

Jörn

--
Unless something dramatically changes, by 2015 we'll be largely
wondering what all the fuss surrounding Linux was really about.
-- Rob Enderle

2009-01-08 20:03:06

by Jörn Engel

[permalink] [raw]
Subject: Re: [RFC] B+Tree library V2

On Thu, 8 January 2009 18:10:01 +0100, Johannes Berg wrote:
> On Thu, 2009-01-08 at 17:24 +0100, Jörn Engel wrote:
>
> > If you want to open-code it, you can use btree_lookup_less(). I added
> > that function sometime last month. Basically something like this:
> > key = btree_last(head, geo);
> > while (key) {
> > /* do something with key */
> > key = btree_lookup_less(head, geo, key);
> > }
>
> Ok, so looking deeper into this, how about adding
>
> #define btree_for_each_key(head, geo, key, tmp) \
> for (key = btree_last(head, geo), tmp = btree_get_prev_key(head, geo, key);
> key; key = tmp, tmp = btree_get_prev_key(head, geo, key))

[ Changed the function name above. It really isn't a lookup, it returns
a key, not a value. My fault. ]

Looks correct otherwise. Probably needs a comment that without "tmp" we
would skip a 0 key. Or am I the only one who wants to simplify the code
before spotting this little subtlety?

> (and possibly some type-checking variants that hardcode the geo)
>
> Does that seem correct? And would it be possible to provide btree_last()
> that takes an void ** and fills it with the last entry, and the same for
> lookup_less(), so we can write btree_for_each_entry() too?

Not sure what you mean. Something with the same effect as this?

#define btree_for_each_val(head, geo, key, val) \
for (key = btree_last(head, geo), \
val = btree_lookup(head, geo, key); \
val; \
key = btree_get_prev_key(head, geo, key), \
val = btree_lookup(head, geo, key))

Jörn

--
Time? What's that? Time is only worth what you do with it.
-- Theo de Raadt

2009-01-08 20:18:52

by Johannes Berg

[permalink] [raw]
Subject: Re: [RFC] B+Tree library V2

On Thu, 2009-01-08 at 21:02 +0100, Jörn Engel wrote:

> > #define btree_for_each_key(head, geo, key, tmp) \
> > for (key = btree_last(head, geo), tmp = btree_get_prev_key(head, geo, key);
> > key; key = tmp, tmp = btree_get_prev_key(head, geo, key))
>
> [ Changed the function name above. It really isn't a lookup, it returns
> a key, not a value. My fault. ]

Right.

> Looks correct otherwise. Probably needs a comment that without "tmp" we
> would skip a 0 key. Or am I the only one who wants to simplify the code
> before spotting this little subtlety?

I, uh, I didn't even realise that. I think the code for
btree_last/btree_get_prev_key isn't correct as is since the 0 key is
valid, but you can't tell whether it returned 0 because it didn't find
anything, or because there was no more entry. Or am I missing something?

> > (and possibly some type-checking variants that hardcode the geo)
> >
> > Does that seem correct? And would it be possible to provide btree_last()
> > that takes an void ** and fills it with the last entry, and the same for
> > lookup_less(), so we can write btree_for_each_entry() too?
>
> Not sure what you mean. Something with the same effect as this?
>
> #define btree_for_each_val(head, geo, key, val) \
> for (key = btree_last(head, geo), \
> val = btree_lookup(head, geo, key); \
> val; \
> key = btree_get_prev_key(head, geo, key), \
> val = btree_lookup(head, geo, key))

Well, that does lots of lookups that don't seem necessary, since a
function like btree_last should be able to return the value right away.
Also, if it was

#define btree_for_each_val(head, geo, key, val)
for (val = btree_last(head, geo, &key);
val;
val = btree_get_prev(head, geo, &key))

it would be more correct, I think?

johannes


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part

2009-01-08 21:09:28

by Jörn Engel

[permalink] [raw]
Subject: Re: [RFC] B+Tree library V2

On Thu, 8 January 2009 21:18:55 +0100, Johannes Berg wrote:
>
> > Looks correct otherwise. Probably needs a comment that without "tmp" we
> > would skip a 0 key. Or am I the only one who wants to simplify the code
> > before spotting this little subtlety?
>
> I, uh, I didn't even realise that. I think the code for
> btree_last/btree_get_prev_key isn't correct as is since the 0 key is
> valid, but you can't tell whether it returned 0 because it didn't find
> anything, or because there was no more entry. Or am I missing something?

Correct or not is a matter of opinion, so let's not go there. It
certainly is unexpected and also inefficient. The alternative would be
to return two values, they key and a flag to indicate the end.

> > > (and possibly some type-checking variants that hardcode the geo)
> > >
> > > Does that seem correct? And would it be possible to provide btree_last()
> > > that takes an void ** and fills it with the last entry, and the same for
> > > lookup_less(), so we can write btree_for_each_entry() too?
> >
> > Not sure what you mean. Something with the same effect as this?
^^^^^^^^^^^ ;)
> >
> > #define btree_for_each_val(head, geo, key, val) \
> > for (key = btree_last(head, geo), \
> > val = btree_lookup(head, geo, key); \
> > val; \
> > key = btree_get_prev_key(head, geo, key), \
> > val = btree_lookup(head, geo, key))
>
> Well, that does lots of lookups that don't seem necessary, since a
> function like btree_last should be able to return the value right away.
> Also, if it was
>
> #define btree_for_each_val(head, geo, key, val)
> for (val = btree_last(head, geo, &key);
> val;
> val = btree_get_prev(head, geo, &key))
>
> it would be more correct, I think?

More efficient, certainly. Half the tree walks are gone. Let's do it.

Note, btw, that this changes effort from O(2n) to O(n), while the old
visitor is O(1) *). That was the reason why I wrote it in the first
place. If the code wasn't as horrible and hard to use, it would be a
clear winner. Guess we'll have to keep both variants.

*) Or rather O(2n*log(n)), O(n*log(n)) and O(log(n)) respectively.

Jörn

--
Joern's library part 6:
http://www.gzip.org/zlib/feldspar.html