2011-01-10 15:19:15

by Lukas Czerner

[permalink] [raw]
Subject: [RFC PATCH 0/2] Add rbtree backend for storing bitmaps

Hi all,

as I mentioned a while ago I am working on reducing e2fsprogs memory
utilization. The first thing which need to be addressed is mechanism of
storing bitmaps. So far bitarrays was used for this purpose however today
this approach might hit its limits with todays huge data storage devices,
because of its memory utilization.

Bitarrays stores bitmaps as ..well, as bitmaps. But this is in most
cases highly unefficient because we need to allocate memory even for the
big parts of bitmaps we will never use, resulting in high memory
utilization especially for huge filesystem, when bitmaps might occupy
gigabytes of space.

To address this I have created rbtree backend for storing bitmaps. It stores
just used extents of bitmaps, it means, that it can be more memory efficient
in most cases and also with careful use it might be much faster as well.

Rbtree implementation itself is ripped of linux kernel with some minor changes
to make it work outside kernel. So this should not cause any problems at all
since it has been really extensively tested through out its life.

I have done a lot of testing to validate my new backend and to find as many
bugs as possible. Now it seems to be quite reliable, however since this is the
most crucial part of the e2fsprogs it has to be tested most widely on various
scenarios. So this is my appeal to you guys, please take it, make it and test
it and test it some more, to really make sure this is doing what it is
supposed to.

The other side of the thing is, does it really help with memory utilization ?
So my answer is YES, (...but). Of course there is a "but". That is because one
rbtree node on 64-bit system takes 40B of data, which is 320 bits of
information. So there might be a situation when one single rbtree node does
not cover even 320 bits of bitmap it stores, it this case this node is not
very efficient. In case we have a lot of unefficient nodes we might actually
end up with bigger memory utilization than bitarrays itself and that's bad.

Now, the answer for this problem is benchmarking, to figure out how probable
this situation is and how (and when) bad it can get. We would probably need to
create some fallback switch which will convert bitmaps from one backend
(rbtree) to another (bitarray) depending on which appears more efficient for
the situation, but let's keep it simple for now and lets figure out how bad
the problem actually is.

I have done some limited benchmarking. Limited, because it takes time (a lot
of it) and also huge storages, because we do not really care about memory
utilization differences in megabytes, but rather in hundreds and thousands of
megabytes. So this is my second appeal to you guys, take it, make it, test it
and benchmark it.

For measuring memory utilization I have used valgrind (massif tool to be
specific). All the numbers I will show you are peak memory utilization. So
here is an example how to use massif.

valgrind --tool=massif ./e2fsck/e2fsck -fn /dev/loop0
ms_print massif.out.5467 | less

Now, I can show you some (rather optimistic) graphs of e2fsck memory
utilization I have done. Here is the link (description included):

http://people.redhat.com/lczerner/e2fsprogs_memory/graphs.pdf

Now the simple description of the graphs. The first one is showing the e2fsck
memory utilization depending on filesystem size. The benchmark was performed
right after the fs creation so it shows the best scenario for the rbtree
backend. The amount of saved memory grows approx by 8.5MB per 100MB of
filesystem size - this is the best what we can get.

The second graph shows memory utilization per inode depending on inode count.
We can see that with growing number of inodes the Bytes per inode ratio is
dropping significantly, moreover it is dropping faster for bitarrays than for
rbtrees, which tells us that inode count is in fact the main factor which
impact the memory utilization difference between the rbtree and bitattay
backend on the filesystem of constant size. However it strongly depends also
on how are the files created on the filesystem - some loads might be more
rbtree-friendly than others.

The threshold is, when the Bytes per Inode is equal for both backends, this is
the last point where we will need to convert rbtree backend to bitarrays,
because above this threshold rbtree approach is no longer efficient. In my
load (copying content of /usr/share several times) it means that rbtree memory
utilization is growing by 20B per inode, however bitarray mem. util. is
growing by 4B per inode (on 10TB filesystem). So the rbtree memory consumption
is growing 5 times faster then bitarrays.

Let's simplify situation and let's say that the Bytes per Inode growing ratio
will not change with inode count (which is not true, but this is yet to be
tested). In this case we hit the threshold when we fill 33% of inodes, which
means 20% of filesystem size. Not very promising, however is this load the
typical real-life load ? - this we need to find out.

Next graph shows e2fsck memory utilization depending on inode count which is
practically the same thing as in previous graph.

The last two graphs were created on filesystem aged with Impression in default
configuration, rather than copying /usr/share. It should provide more
real-live filesystem images. The filesystem is rather small now, only 312GB.
So the fourth graph shows running times of e2fsck for both backends, rbtree
and bitarrays. We can see that rbtree backend performs quite bad, however I
believe that I know the source of the slowdown. It is the rb_test_bit()
function which need to walk the tree every time e2fsck needs to test the bit
and since this usually happen in sequential order (bit-by-bit), we can easily
improve performance by adding simple cache (similar to write cache in
rb_set_bit()). Also with the little improvement of e2fsck bitmap handling (use
the advantage of extents) we can improve performance even more. But I leave
those optimization to after we are done with memory utilization optimization.

Finally, the last graph shows Memory utilization per inode depending on inode
count. This is the similar graph as the second one, however this one is a bit
more optimistic. The rbtree memory utilization grows by avg. 8B per Inode, and
bitarray grows by avg. 7B per inode, which is quite good and in this scenario
with stable Bytes per inode grow the rbtree backend will be always better than
bitarray. This show us, that it really depends on the type of load as well as
on the size of the filesystem. So we need to test more and more real life
filesystems to see the if the rbtree is really a win for most of the users, or
just some small groupr group, who does not even care about memory utilization
at all (it is clearly numbers from huge filesystems we care about!).


There were some people having OOM problems with e2fsck, It would be great if
those people can try this patches to see if it helps and get back to us to
share the experience. Although I feel that I need to warn you guys, this is
not yet stable, and even though I have not seen any problems, it does not mean
that it is bug free, so at least try it with '-n' option which should be
quite safe.

Please, comments are highly appreciated!

Thanks!
-Lukas


--
[PATCH 1/2] e2fsprogs: Add rbtree library
[PATCH 2/2] e2fsprogs: Add rbtree backend for bitmaps, use it instead of bitarrays

lib/ext2fs/Makefile.in | 13 +-
lib/ext2fs/blkmap64_rb.c | 815 +++++++++++++++++++++++++++++++++++++++++++++
lib/ext2fs/bmap64.h | 1 +
lib/ext2fs/ext2fsP.h | 1 +
lib/ext2fs/gen_bitmap64.c | 3 +
lib/ext2fs/rbtree.c | 451 +++++++++++++++++++++++++
lib/ext2fs/rbtree.h | 179 ++++++++++
7 files changed, 1461 insertions(+), 2 deletions(-)


2011-01-10 15:19:16

by Lukas Czerner

[permalink] [raw]
Subject: [PATCH 1/2] e2fsprogs: Add rbtree library

This commit adds rbtree library into e2fsprogs so it can be used for
various internal data structures. The rbtree implementation is ripped of
kernel rbtree implementation with small changes needed for it to work
outside kernel.

Signed-off-by: Lukas Czerner <[email protected]>
---
lib/ext2fs/Makefile.in | 11 +-
lib/ext2fs/rbtree.c | 451 ++++++++++++++++++++++++++++++++++++++++++++++++
lib/ext2fs/rbtree.h | 179 +++++++++++++++++++
3 files changed, 639 insertions(+), 2 deletions(-)
create mode 100644 lib/ext2fs/rbtree.c
create mode 100644 lib/ext2fs/rbtree.h

diff --git a/lib/ext2fs/Makefile.in b/lib/ext2fs/Makefile.in
index 9c1c273..b24a53a 100644
--- a/lib/ext2fs/Makefile.in
+++ b/lib/ext2fs/Makefile.in
@@ -80,7 +80,8 @@ OBJS= $(DEBUGFS_LIB_OBJS) $(RESIZE_LIB_OBJS) $(E2IMAGE_LIB_OBJS) \
unix_io.o \
unlink.o \
valid_blk.o \
- version.o
+ version.o \
+ rbtree.o

SRCS= ext2_err.c \
$(srcdir)/alloc.c \
@@ -155,7 +156,8 @@ SRCS= ext2_err.c \
$(srcdir)/unlink.c \
$(srcdir)/valid_blk.c \
$(srcdir)/version.c \
- $(srcdir)/write_bb_file.c
+ $(srcdir)/write_bb_file.c \
+ $(srcdir)/rbtree.c \

HFILES= bitops.h ext2fs.h ext2_io.h ext2_fs.h ext2_ext_attr.h ext3_extents.h \
tdb.h
@@ -762,3 +764,8 @@ write_bb_file.o: $(srcdir)/write_bb_file.c $(srcdir)/ext2_fs.h \
$(srcdir)/ext2_fs.h $(srcdir)/ext3_extents.h $(top_srcdir)/lib/et/com_err.h \
$(srcdir)/ext2_io.h $(top_builddir)/lib/ext2fs/ext2_err.h \
$(srcdir)/ext2_ext_attr.h $(srcdir)/bitops.h
+rbtree.o: $(srcdir)/rbtree.c $(srcdir)/ext2_fs.h \
+ $(top_builddir)/lib/ext2fs/ext2_types.h $(srcdir)/ext2fs.h \
+ $(srcdir)/ext2_fs.h $(srcdir)/ext3_extents.h $(top_srcdir)/lib/et/com_err.h \
+ $(srcdir)/ext2_io.h $(top_builddir)/lib/ext2fs/ext2_err.h \
+ $(srcdir)/ext2_ext_attr.h $(srcdir)/bitops.h
diff --git a/lib/ext2fs/rbtree.c b/lib/ext2fs/rbtree.c
new file mode 100644
index 0000000..0b638cf
--- /dev/null
+++ b/lib/ext2fs/rbtree.c
@@ -0,0 +1,451 @@
+/*
+ Red Black Trees
+ (C) 1999 Andrea Arcangeli <[email protected]>
+ (C) 2002 David Woodhouse <[email protected]>
+
+ This program is free software; you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation; either version 2 of the License, or
+ (at your option) any later version.
+
+ This program is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU General Public License for more details.
+
+ You should have received a copy of the GNU General Public License
+ along with this program; if not, write to the Free Software
+ Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+
+ linux/lib/rbtree.c
+*/
+
+#include "rbtree.h"
+
+static void __rb_rotate_left(struct rb_node *node, struct rb_root *root)
+{
+ struct rb_node *right = node->rb_right;
+ struct rb_node *parent = rb_parent(node);
+
+ if ((node->rb_right = right->rb_left))
+ rb_set_parent(right->rb_left, node);
+ right->rb_left = node;
+
+ rb_set_parent(right, parent);
+
+ if (parent)
+ {
+ if (node == parent->rb_left)
+ parent->rb_left = right;
+ else
+ parent->rb_right = right;
+ }
+ else
+ root->rb_node = right;
+ rb_set_parent(node, right);
+}
+
+static void __rb_rotate_right(struct rb_node *node, struct rb_root *root)
+{
+ struct rb_node *left = node->rb_left;
+ struct rb_node *parent = rb_parent(node);
+
+ if ((node->rb_left = left->rb_right))
+ rb_set_parent(left->rb_right, node);
+ left->rb_right = node;
+
+ rb_set_parent(left, parent);
+
+ if (parent)
+ {
+ if (node == parent->rb_right)
+ parent->rb_right = left;
+ else
+ parent->rb_left = left;
+ }
+ else
+ root->rb_node = left;
+ rb_set_parent(node, left);
+}
+
+void rb_insert_color(struct rb_node *node, struct rb_root *root)
+{
+ struct rb_node *parent, *gparent;
+
+ while ((parent = rb_parent(node)) && rb_is_red(parent))
+ {
+ gparent = rb_parent(parent);
+
+ if (parent == gparent->rb_left)
+ {
+ {
+ register struct rb_node *uncle = gparent->rb_right;
+ if (uncle && rb_is_red(uncle))
+ {
+ rb_set_black(uncle);
+ rb_set_black(parent);
+ rb_set_red(gparent);
+ node = gparent;
+ continue;
+ }
+ }
+
+ if (parent->rb_right == node)
+ {
+ register struct rb_node *tmp;
+ __rb_rotate_left(parent, root);
+ tmp = parent;
+ parent = node;
+ node = tmp;
+ }
+
+ rb_set_black(parent);
+ rb_set_red(gparent);
+ __rb_rotate_right(gparent, root);
+ } else {
+ {
+ register struct rb_node *uncle = gparent->rb_left;
+ if (uncle && rb_is_red(uncle))
+ {
+ rb_set_black(uncle);
+ rb_set_black(parent);
+ rb_set_red(gparent);
+ node = gparent;
+ continue;
+ }
+ }
+
+ if (parent->rb_left == node)
+ {
+ register struct rb_node *tmp;
+ __rb_rotate_right(parent, root);
+ tmp = parent;
+ parent = node;
+ node = tmp;
+ }
+
+ rb_set_black(parent);
+ rb_set_red(gparent);
+ __rb_rotate_left(gparent, root);
+ }
+ }
+
+ rb_set_black(root->rb_node);
+}
+
+static void __rb_erase_color(struct rb_node *node, struct rb_node *parent,
+ struct rb_root *root)
+{
+ struct rb_node *other;
+
+ while ((!node || rb_is_black(node)) && node != root->rb_node)
+ {
+ if (parent->rb_left == node)
+ {
+ other = parent->rb_right;
+ if (rb_is_red(other))
+ {
+ rb_set_black(other);
+ rb_set_red(parent);
+ __rb_rotate_left(parent, root);
+ other = parent->rb_right;
+ }
+ if ((!other->rb_left || rb_is_black(other->rb_left)) &&
+ (!other->rb_right || rb_is_black(other->rb_right)))
+ {
+ rb_set_red(other);
+ node = parent;
+ parent = rb_parent(node);
+ }
+ else
+ {
+ if (!other->rb_right || rb_is_black(other->rb_right))
+ {
+ rb_set_black(other->rb_left);
+ rb_set_red(other);
+ __rb_rotate_right(other, root);
+ other = parent->rb_right;
+ }
+ rb_set_color(other, rb_color(parent));
+ rb_set_black(parent);
+ rb_set_black(other->rb_right);
+ __rb_rotate_left(parent, root);
+ node = root->rb_node;
+ break;
+ }
+ }
+ else
+ {
+ other = parent->rb_left;
+ if (rb_is_red(other))
+ {
+ rb_set_black(other);
+ rb_set_red(parent);
+ __rb_rotate_right(parent, root);
+ other = parent->rb_left;
+ }
+ if ((!other->rb_left || rb_is_black(other->rb_left)) &&
+ (!other->rb_right || rb_is_black(other->rb_right)))
+ {
+ rb_set_red(other);
+ node = parent;
+ parent = rb_parent(node);
+ }
+ else
+ {
+ if (!other->rb_left || rb_is_black(other->rb_left))
+ {
+ rb_set_black(other->rb_right);
+ rb_set_red(other);
+ __rb_rotate_left(other, root);
+ other = parent->rb_left;
+ }
+ rb_set_color(other, rb_color(parent));
+ rb_set_black(parent);
+ rb_set_black(other->rb_left);
+ __rb_rotate_right(parent, root);
+ node = root->rb_node;
+ break;
+ }
+ }
+ }
+ if (node)
+ rb_set_black(node);
+}
+
+void rb_erase(struct rb_node *node, struct rb_root *root)
+{
+ struct rb_node *child, *parent;
+ int color;
+
+ if (!node->rb_left)
+ child = node->rb_right;
+ else if (!node->rb_right)
+ child = node->rb_left;
+ else
+ {
+ struct rb_node *old = node, *left;
+
+ node = node->rb_right;
+ while ((left = node->rb_left) != NULL)
+ node = left;
+
+ if (rb_parent(old)) {
+ if (rb_parent(old)->rb_left == old)
+ rb_parent(old)->rb_left = node;
+ else
+ rb_parent(old)->rb_right = node;
+ } else
+ root->rb_node = node;
+
+ child = node->rb_right;
+ parent = rb_parent(node);
+ color = rb_color(node);
+
+ if (parent == old) {
+ parent = node;
+ } else {
+ if (child)
+ rb_set_parent(child, parent);
+ parent->rb_left = child;
+
+ node->rb_right = old->rb_right;
+ rb_set_parent(old->rb_right, node);
+ }
+
+ node->rb_parent_color = old->rb_parent_color;
+ node->rb_left = old->rb_left;
+ rb_set_parent(old->rb_left, node);
+
+ goto color;
+ }
+
+ parent = rb_parent(node);
+ color = rb_color(node);
+
+ if (child)
+ rb_set_parent(child, parent);
+ if (parent)
+ {
+ if (parent->rb_left == node)
+ parent->rb_left = child;
+ else
+ parent->rb_right = child;
+ }
+ else
+ root->rb_node = child;
+
+ color:
+ if (color == RB_BLACK)
+ __rb_erase_color(child, parent, root);
+}
+
+static void rb_augment_path(struct rb_node *node, rb_augment_f func, void *data)
+{
+ struct rb_node *parent;
+
+up:
+ func(node, data);
+ parent = rb_parent(node);
+ if (!parent)
+ return;
+
+ if (node == parent->rb_left && parent->rb_right)
+ func(parent->rb_right, data);
+ else if (parent->rb_left)
+ func(parent->rb_left, data);
+
+ node = parent;
+ goto up;
+}
+
+/*
+ * after inserting @node into the tree, update the tree to account for
+ * both the new entry and any damage done by rebalance
+ */
+void rb_augment_insert(struct rb_node *node, rb_augment_f func, void *data)
+{
+ if (node->rb_left)
+ node = node->rb_left;
+ else if (node->rb_right)
+ node = node->rb_right;
+
+ rb_augment_path(node, func, data);
+}
+
+/*
+ * before removing the node, find the deepest node on the rebalance path
+ * that will still be there after @node gets removed
+ */
+struct rb_node *rb_augment_erase_begin(struct rb_node *node)
+{
+ struct rb_node *deepest;
+
+ if (!node->rb_right && !node->rb_left)
+ deepest = rb_parent(node);
+ else if (!node->rb_right)
+ deepest = node->rb_left;
+ else if (!node->rb_left)
+ deepest = node->rb_right;
+ else {
+ deepest = rb_next(node);
+ if (deepest->rb_right)
+ deepest = deepest->rb_right;
+ else if (rb_parent(deepest) != node)
+ deepest = rb_parent(deepest);
+ }
+
+ return deepest;
+}
+
+/*
+ * after removal, update the tree to account for the removed entry
+ * and any rebalance damage.
+ */
+void rb_augment_erase_end(struct rb_node *node, rb_augment_f func, void *data)
+{
+ if (node)
+ rb_augment_path(node, func, data);
+}
+
+/*
+ * This function returns the first node (in sort order) of the tree.
+ */
+struct rb_node *rb_first(const struct rb_root *root)
+{
+ struct rb_node *n;
+
+ n = root->rb_node;
+ if (!n)
+ return NULL;
+ while (n->rb_left)
+ n = n->rb_left;
+ return n;
+}
+
+struct rb_node *rb_last(const struct rb_root *root)
+{
+ struct rb_node *n;
+
+ n = root->rb_node;
+ if (!n)
+ return NULL;
+ while (n->rb_right)
+ n = n->rb_right;
+ return n;
+}
+
+struct rb_node *rb_next(const struct rb_node *node)
+{
+ struct rb_node *parent;
+
+ if (rb_parent(node) == node)
+ return NULL;
+
+ /* If we have a right-hand child, go down and then left as far
+ as we can. */
+ if (node->rb_right) {
+ node = node->rb_right;
+ while (node->rb_left)
+ node=node->rb_left;
+ return (struct rb_node *)node;
+ }
+
+ /* No right-hand children. Everything down and left is
+ smaller than us, so any 'next' node must be in the general
+ direction of our parent. Go up the tree; any time the
+ ancestor is a right-hand child of its parent, keep going
+ up. First time it's a left-hand child of its parent, said
+ parent is our 'next' node. */
+ while ((parent = rb_parent(node)) && node == parent->rb_right)
+ node = parent;
+
+ return parent;
+}
+
+struct rb_node *rb_prev(const struct rb_node *node)
+{
+ struct rb_node *parent;
+
+ if (rb_parent(node) == node)
+ return NULL;
+
+ /* If we have a left-hand child, go down and then right as far
+ as we can. */
+ if (node->rb_left) {
+ node = node->rb_left;
+ while (node->rb_right)
+ node=node->rb_right;
+ return (struct rb_node *)node;
+ }
+
+ /* No left-hand children. Go up till we find an ancestor which
+ is a right-hand child of its parent */
+ while ((parent = rb_parent(node)) && node == parent->rb_left)
+ node = parent;
+
+ return parent;
+}
+
+void rb_replace_node(struct rb_node *victim, struct rb_node *new,
+ struct rb_root *root)
+{
+ struct rb_node *parent = rb_parent(victim);
+
+ /* Set the surrounding nodes to point to the replacement */
+ if (parent) {
+ if (victim == parent->rb_left)
+ parent->rb_left = new;
+ else
+ parent->rb_right = new;
+ } else {
+ root->rb_node = new;
+ }
+ if (victim->rb_left)
+ rb_set_parent(victim->rb_left, new);
+ if (victim->rb_right)
+ rb_set_parent(victim->rb_right, new);
+
+ /* Copy the pointers/colour from the victim to the replacement */
+ *new = *victim;
+}
diff --git a/lib/ext2fs/rbtree.h b/lib/ext2fs/rbtree.h
new file mode 100644
index 0000000..8820e28
--- /dev/null
+++ b/lib/ext2fs/rbtree.h
@@ -0,0 +1,179 @@
+/*
+ Red Black Trees
+ (C) 1999 Andrea Arcangeli <[email protected]>
+
+ This program is free software; you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation; either version 2 of the License, or
+ (at your option) any later version.
+
+ This program is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU General Public License for more details.
+
+ You should have received a copy of the GNU General Public License
+ along with this program; if not, write to the Free Software
+ Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+
+ linux/include/linux/rbtree.h
+
+ To use rbtrees you'll have to implement your own insert and search cores.
+ This will avoid us to use callbacks and to drop drammatically performances.
+ I know it's not the cleaner way, but in C (not in C++) to get
+ performances and genericity...
+
+ Some example of insert and search follows here. The search is a plain
+ normal search over an ordered tree. The insert instead must be implemented
+ in two steps: First, the code must insert the element in order as a red leaf
+ in the tree, and then the support library function rb_insert_color() must
+ be called. Such function will do the not trivial work to rebalance the
+ rbtree, if necessary.
+
+-----------------------------------------------------------------------
+static inline struct page * rb_search_page_cache(struct inode * inode,
+ unsigned long offset)
+{
+ struct rb_node * n = inode->i_rb_page_cache.rb_node;
+ struct page * page;
+
+ while (n)
+ {
+ page = rb_entry(n, struct page, rb_page_cache);
+
+ if (offset < page->offset)
+ n = n->rb_left;
+ else if (offset > page->offset)
+ n = n->rb_right;
+ else
+ return page;
+ }
+ return NULL;
+}
+
+static inline struct page * __rb_insert_page_cache(struct inode * inode,
+ unsigned long offset,
+ struct rb_node * node)
+{
+ struct rb_node ** p = &inode->i_rb_page_cache.rb_node;
+ struct rb_node * parent = NULL;
+ struct page * page;
+
+ while (*p)
+ {
+ parent = *p;
+ page = rb_entry(parent, struct page, rb_page_cache);
+
+ if (offset < page->offset)
+ p = &(*p)->rb_left;
+ else if (offset > page->offset)
+ p = &(*p)->rb_right;
+ else
+ return page;
+ }
+
+ rb_link_node(node, parent, p);
+
+ return NULL;
+}
+
+static inline struct page * rb_insert_page_cache(struct inode * inode,
+ unsigned long offset,
+ struct rb_node * node)
+{
+ struct page * ret;
+ if ((ret = __rb_insert_page_cache(inode, offset, node)))
+ goto out;
+ rb_insert_color(node, &inode->i_rb_page_cache);
+ out:
+ return ret;
+}
+-----------------------------------------------------------------------
+*/
+
+#ifndef _LINUX_RBTREE_H
+#define _LINUX_RBTREE_H
+
+#include <stdlib.h>
+
+#undef offsetof
+#ifdef __compiler_offsetof
+#define offsetof(TYPE,MEMBER) __compiler_offsetof(TYPE,MEMBER)
+#else
+#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)
+#endif
+
+#define container_of(ptr, type, member) ({ \
+ const typeof( ((type *)0)->member ) *__mptr = (ptr); \
+ (type *)( (char *)__mptr - offsetof(type,member) );})
+
+struct rb_node
+{
+ unsigned long rb_parent_color;
+#define RB_RED 0
+#define RB_BLACK 1
+ struct rb_node *rb_right;
+ struct rb_node *rb_left;
+} __attribute__((aligned(sizeof(long))));
+ /* The alignment might seem pointless, but allegedly CRIS needs it */
+
+struct rb_root
+{
+ struct rb_node *rb_node;
+};
+
+
+#define rb_parent(r) ((struct rb_node *)((r)->rb_parent_color & ~3))
+#define rb_color(r) ((r)->rb_parent_color & 1)
+#define rb_is_red(r) (!rb_color(r))
+#define rb_is_black(r) rb_color(r)
+#define rb_set_red(r) do { (r)->rb_parent_color &= ~1; } while (0)
+#define rb_set_black(r) do { (r)->rb_parent_color |= 1; } while (0)
+
+static inline void rb_set_parent(struct rb_node *rb, struct rb_node *p)
+{
+ rb->rb_parent_color = (rb->rb_parent_color & 3) | (unsigned long)p;
+}
+static inline void rb_set_color(struct rb_node *rb, int color)
+{
+ rb->rb_parent_color = (rb->rb_parent_color & ~1) | color;
+}
+
+#define RB_ROOT (struct rb_root) { NULL, }
+#define rb_entry(ptr, type, member) container_of(ptr, type, member)
+
+#define RB_EMPTY_ROOT(root) ((root)->rb_node == NULL)
+#define RB_EMPTY_NODE(node) (rb_parent(node) == node)
+#define RB_CLEAR_NODE(node) (rb_set_parent(node, node))
+
+extern void rb_insert_color(struct rb_node *, struct rb_root *);
+extern void rb_erase(struct rb_node *, struct rb_root *);
+
+typedef void (*rb_augment_f)(struct rb_node *node, void *data);
+
+extern void rb_augment_insert(struct rb_node *node,
+ rb_augment_f func, void *data);
+extern struct rb_node *rb_augment_erase_begin(struct rb_node *node);
+extern void rb_augment_erase_end(struct rb_node *node,
+ rb_augment_f func, void *data);
+
+/* Find logical next and previous nodes in a tree */
+extern struct rb_node *rb_next(const struct rb_node *);
+extern struct rb_node *rb_prev(const struct rb_node *);
+extern struct rb_node *rb_first(const struct rb_root *);
+extern struct rb_node *rb_last(const struct rb_root *);
+
+/* Fast replacement of a single node without remove/rebalance/add/rebalance */
+extern void rb_replace_node(struct rb_node *victim, struct rb_node *new,
+ struct rb_root *root);
+
+static inline void rb_link_node(struct rb_node * node, struct rb_node * parent,
+ struct rb_node ** rb_link)
+{
+ node->rb_parent_color = (unsigned long )parent;
+ node->rb_left = node->rb_right = NULL;
+
+ *rb_link = node;
+}
+
+#endif /* _LINUX_RBTREE_H */
--
1.7.2.3


2011-01-10 15:19:20

by Lukas Czerner

[permalink] [raw]
Subject: [PATCH 2/2] e2fsprogs: Add rbtree backend for bitmaps, use it instead of bitarrays

For a long time we had a bitarray backend for storing filesystem
metadata bitmaps, however today this approach might hit its limits with
todays huge data storage devices, because of its memory utilization.

Bitarrays stores bitmaps as ..well, as bitmaps. But this is in most
cases highly unefficient because we need to allocate memory even for the
big parts of bitmaps we will never use, resulting in high memory
utilization especially for huge filesystem, when bitmaps might occupy
gigabytes of space.

This commit adds another backend to store bitmaps. It is based on
rbtrees and it stores just used extents of bitmaps. It means that it can
be more memory efficient in most cases and also with a careful use it
might be much faster than simple bitarrays.

I have done some limited benchmarking and it shows that rbtree backend
consumes approx 65% less memory that bitarray on 312GB filesystem aged
with Impression (default config). This number may grow significantly
with the filesystem size, but also it may be a lot lower (even negative)
with a lot bigger number of inodes (need more benchmarking).
---
lib/ext2fs/Makefile.in | 2 +
lib/ext2fs/bitmaps.c | 4 +-
lib/ext2fs/blkmap64_rb.c | 815 +++++++++++++++++++++++++++++++++++++++++++++
lib/ext2fs/bmap64.h | 1 +
lib/ext2fs/ext2fsP.h | 1 +
lib/ext2fs/gen_bitmap64.c | 3 +
6 files changed, 824 insertions(+), 2 deletions(-)
create mode 100644 lib/ext2fs/blkmap64_rb.c

diff --git a/lib/ext2fs/Makefile.in b/lib/ext2fs/Makefile.in
index b24a53a..29d1994 100644
--- a/lib/ext2fs/Makefile.in
+++ b/lib/ext2fs/Makefile.in
@@ -27,6 +27,7 @@ OBJS= $(DEBUGFS_LIB_OBJS) $(RESIZE_LIB_OBJS) $(E2IMAGE_LIB_OBJS) \
bitmaps.o \
bitops.o \
blkmap64_ba.o \
+ blkmap64_rb.o \
blknum.o \
block.o \
bmap.o \
@@ -94,6 +95,7 @@ SRCS= ext2_err.c \
$(srcdir)/bitmaps.c \
$(srcdir)/bitops.c \
$(srcdir)/blkmap64_ba.o \
+ $(srcdir)/blkmap64_rb.c \
$(srcdir)/block.c \
$(srcdir)/bmap.c \
$(srcdir)/check_desc.c \
diff --git a/lib/ext2fs/bitmaps.c b/lib/ext2fs/bitmaps.c
index c53d61e..b671336 100644
--- a/lib/ext2fs/bitmaps.c
+++ b/lib/ext2fs/bitmaps.c
@@ -66,7 +66,7 @@ errcode_t ext2fs_allocate_inode_bitmap(ext2_filsys fs,
if (fs->flags & EXT2_FLAG_64BITS)
return (ext2fs_alloc_generic_bmap(fs,
EXT2_ET_MAGIC_INODE_BITMAP64,
- EXT2FS_BMAP64_BITARRAY,
+ EXT2FS_BMAP64_RBTREE,
start, end, real_end, descr, ret));

/* Otherwise, check to see if the file system is small enough
@@ -98,7 +98,7 @@ errcode_t ext2fs_allocate_block_bitmap(ext2_filsys fs,
if (fs->flags & EXT2_FLAG_64BITS)
return (ext2fs_alloc_generic_bmap(fs,
EXT2_ET_MAGIC_BLOCK_BITMAP64,
- EXT2FS_BMAP64_BITARRAY,
+ EXT2FS_BMAP64_RBTREE,
start, end, real_end, descr, ret));

if ((end > ~0U) || (real_end > ~0U))
diff --git a/lib/ext2fs/blkmap64_rb.c b/lib/ext2fs/blkmap64_rb.c
new file mode 100644
index 0000000..6de8eb9
--- /dev/null
+++ b/lib/ext2fs/blkmap64_rb.c
@@ -0,0 +1,815 @@
+/*
+ * blkmap64_rb.c --- Simple rb-tree implementation for bitmaps
+ *
+ * (C)2010 Red Hat, Inc., Lukas Czerner <[email protected]>
+ *
+ * %Begin-Header%
+ * This file may be redistributed under the terms of the GNU Public
+ * License.
+ * %End-Header%
+ */
+
+#include <stdio.h>
+#include <string.h>
+#if HAVE_UNISTD_H
+#include <unistd.h>
+#endif
+#include <fcntl.h>
+#include <time.h>
+#if HAVE_SYS_STAT_H
+#include <sys/stat.h>
+#endif
+#if HAVE_SYS_TYPES_H
+#include <sys/types.h>
+#endif
+
+#include "ext2_fs.h"
+#include "ext2fsP.h"
+#include "bmap64.h"
+#include "rbtree.h"
+
+#include <limits.h>
+
+struct bmap_rb_extent {
+ struct rb_node node;
+ __u64 start;
+ __u32 count;
+};
+
+struct ext2fs_rb_private {
+ struct rb_root root;
+ struct bmap_rb_extent *cache;
+};
+
+static int _rb_insert_extent(struct bmap_rb_extent *, struct rb_root *, int);
+static int rb_insert_extent(struct bmap_rb_extent *,
+ struct ext2fs_rb_private *);
+static void rb_flush_cache(struct ext2fs_rb_private *);
+static int rb_get_new_extent(struct bmap_rb_extent **, __u64, __u64);
+
+/*#define DEBUG*/
+
+#ifdef DEBUG
+static void print_tree(struct rb_root *root)
+{
+ struct rb_node *node = NULL;
+ struct bmap_rb_extent *ext;
+
+ printf("\t\t\t=================================\n");
+ node = rb_first(root);
+ for (node = rb_first(root); node != NULL; node=rb_next(node)) {
+ ext = rb_entry(node, struct bmap_rb_extent, node);
+ printf("\t\t\t--> (%llu -> %llu)\n",
+ ext->start, ext->start + ext->count);
+ }
+ printf("\t\t\t=================================\n");
+}
+
+static int check_tree(struct rb_root *root, const char *msg)
+{
+ struct rb_node *new_node, *node, *next;
+ struct bmap_rb_extent *ext, *old = NULL;
+
+ for (node = rb_first(root); node; node = rb_next(node)) {
+ ext = rb_entry(node, struct bmap_rb_extent, node);
+ if (ext->count <= 0) {
+ printf("Tree Error: count is crazy\n");
+ printf("extent: %llu -> %llu (%llu)\n", ext->start,
+ ext->start + ext->count, ext->count);
+ goto err_out;
+ }
+ if (ext->start < 0) {
+ printf("Tree Error: start is crazy\n");
+ printf("extent: %llu -> %llu (%llu)\n", ext->start,
+ ext->start + ext->count, ext->count);
+ goto err_out;
+ }
+
+ if (old) {
+ if (old->start > ext->start) {
+ printf("Tree Error: start is crazy\n");
+ printf("extent: %llu -> %llu (%llu)\n",
+ old->start, old->start + old->count,
+ old->count);
+ printf("extent next: %llu -> %llu (%llu)\n",
+ ext->start, ext->start + ext->count,
+ ext->count);
+ goto err_out;
+ }
+ if ((old->start + old->count) >= ext->start) {
+ printf("Tree Error: extent is crazy\n");
+ printf("extent: %llu -> %llu (%llu)\n",
+ old->start, old->start + old->count,
+ old->count);
+ printf("extent next: %llu -> %llu (%llu)\n",
+ ext->start, ext->start + ext->count,
+ ext->count);
+ goto err_out;
+ }
+ }
+ old = ext;
+ }
+ return 0;
+
+err_out:
+ printf("%s\n", msg);
+ print_tree(root);
+ exit(1);
+}
+#else
+#define check_tree(root, msg) 0
+#define print_tree(root, msg) 0
+#endif
+
+static int rb_get_new_extent(struct bmap_rb_extent **ext, __u64 start,
+ __u64 count)
+{
+ struct bmap_rb_extent *new_ext;
+ int retval = 0;
+
+ retval = ext2fs_get_mem(sizeof (struct bmap_rb_extent),
+ &new_ext);
+ /*
+ * Not quite sure how to do this, but passing the error up the stack
+ * probably is not the best idea.
+ */
+ if (retval) {
+ fprintf(stderr, "Error: NOT ENOUGH MEMORY!\n");
+ exit(ENOMEM);
+ }
+
+ new_ext->start = start;
+ new_ext->count = count;
+ *ext = new_ext;
+ return retval;
+}
+
+static void rb_flush_cache(struct ext2fs_rb_private *bp)
+{
+ if (!bp->cache)
+ return;
+
+ _rb_insert_extent(bp->cache, &bp->root, 0);
+ check_tree(&bp->root, __func__);
+ bp->cache = NULL;
+}
+
+/*
+ * Wrapper around _rb_insert_extent.
+ * First check cache, when it is emtpy put ext into cache, in the other
+ * case, try to merge ext with cache. If they are mutually exclusive
+ * insert old cache into tree and put ext into cache.
+ */
+static int rb_insert_extent(struct bmap_rb_extent *ext,
+ struct ext2fs_rb_private *bp)
+{
+ struct bmap_rb_extent *cache = bp->cache;
+ __u64 cend, eend;
+ int ret = 1;
+
+ if (!cache) {
+ bp->cache = ext;
+ return 0;
+ }
+
+ cend = cache->start + cache->count;
+ eend = ext->start + ext->count;
+
+ /* Mutually exclusive extents */
+ if ((ext->start > cend) ||
+ (cache->start > (ext->start + ext->count))) {
+ ret = _rb_insert_extent(bp->cache, &bp->root, 0);
+ bp->cache = ext;
+ check_tree(&bp->root, __func__);
+ return ret;
+ }
+
+ if (ext->start < cache->start) {
+ /* embedded interval */
+ if (cend >= eend)
+ return 1;
+ cache->count += cache->start - ext->start;
+ cache->start = ext->start;
+ } else
+ cache->count += (eend - cend);
+
+
+ bp->cache = cache;
+ return ret;
+}
+
+static errcode_t rb_alloc_private_data (ext2fs_generic_bitmap bitmap)
+{
+ struct ext2fs_rb_private *bp;
+ errcode_t retval;
+
+ retval = ext2fs_get_mem(sizeof (struct ext2fs_rb_private), &bp);
+ if (retval)
+ return retval;
+
+ bp->root = RB_ROOT;
+ bp->cache = NULL;
+
+
+ bitmap->private = (void *) bp;
+ return 0;
+}
+
+static errcode_t rb_new_bmap(ext2_filsys fs EXT2FS_ATTR((unused)),
+ ext2fs_generic_bitmap bitmap)
+{
+ errcode_t retval;
+
+ retval = rb_alloc_private_data (bitmap);
+ if (retval)
+ return retval;
+
+ return 0;
+}
+
+static void rb_free_tree(struct rb_root *root)
+{
+ struct bmap_rb_extent *ext;
+ struct rb_node *node;
+ int count = 0;
+
+ node = rb_first(root);
+ while (node) {
+ ext = rb_entry(node, struct bmap_rb_extent, node);
+ rb_erase(node, root);
+ ext2fs_free_mem(&ext);
+ node = rb_first(root);
+ count++;
+ }
+}
+
+static void rb_free_bmap(ext2fs_generic_bitmap bitmap)
+{
+ struct ext2fs_rb_private *bp;
+
+ bp = (struct ext2fs_rb_private *) bitmap->private;
+
+ rb_free_tree(&bp->root);
+ ext2fs_free_mem(&bp->cache);
+ ext2fs_free_mem(&bp);
+ bp = 0;
+}
+
+static errcode_t rb_copy_bmap(ext2fs_generic_bitmap src,
+ ext2fs_generic_bitmap dest)
+{
+ struct ext2fs_rb_private *src_bp, *dest_bp;
+ struct bmap_rb_extent *src_ext, *dest_ext;
+ struct rb_node *dest_node, *src_node, *dest_last, **n;
+ errcode_t retval = 0;
+
+ retval = rb_alloc_private_data (dest);
+ if (retval)
+ return retval;
+
+ src_bp = (struct ext2fs_rb_private *) src->private;
+ dest_bp = (struct ext2fs_rb_private *) dest->private;
+ rb_flush_cache(src_bp);
+
+ src_node = rb_first(&src_bp->root);
+ while (src_node) {
+ src_ext = rb_entry(src_node, struct bmap_rb_extent, node);
+ retval = ext2fs_get_mem(sizeof (struct bmap_rb_extent),
+ &dest_ext);
+ if (retval)
+ break;
+
+ memcpy(dest_ext, src_ext, sizeof(struct bmap_rb_extent));
+
+ dest_node = &dest_ext->node;
+ n = &dest_bp->root.rb_node;
+
+ dest_last = NULL;
+ if (*n) {
+ dest_last = rb_last(&dest_bp->root);
+ n = &(dest_last)->rb_right;
+ }
+
+ rb_link_node(dest_node, dest_last, n);
+ rb_insert_color(dest_node, &dest_bp->root);
+
+ src_node = rb_next(src_node);
+ }
+
+ return retval;
+}
+
+static void rb_truncate(__u64 new_max, struct rb_root *root)
+{
+ struct bmap_rb_extent *ext;
+ struct rb_node *node;
+
+ node = rb_last(root);
+ while (node) {
+ ext = rb_entry(node, struct bmap_rb_extent, node);
+
+ if ((ext->start + ext->count - 1) <= new_max)
+ break;
+ else if (ext->start > new_max) {
+ rb_erase(node, root);
+ ext2fs_free_mem(&ext);
+ node = rb_last(root);
+ continue;
+ } else
+ ext->count = new_max - ext->start + 1;
+ }
+}
+
+static errcode_t rb_resize_bmap(ext2fs_generic_bitmap bmap,
+ __u64 new_end, __u64 new_real_end)
+{
+ struct ext2fs_rb_private *bp;
+
+ if (new_real_end >= bmap->real_end) {
+ bmap->end = new_end;
+ bmap->real_end = new_real_end;
+ return 0;
+ }
+
+ bp = (struct ext2fs_rb_private *) bmap->private;
+ rb_flush_cache(bp);
+
+ /* truncate tree to new_real_end size */
+ rb_truncate(new_real_end, &bp->root);
+
+ bmap->end = new_end;
+ bmap->real_end = new_real_end;
+ return 0;
+
+}
+
+/*
+ * It would be nice to have read cache for this, so when walking bitmap
+ * in sequential manner we do not have to go through the list for every bit.
+ *
+ * The idea is:
+ * 1. check if there is a cache.
+ * 2. look for match in the cached extent
+ * 3. if not found, search the tree.
+ * 4. put found extent into the cache.
+ */
+static int
+rb_test_bit(struct ext2fs_rb_private *bp, __u64 bit)
+{
+ struct rb_node *parent = NULL;
+ struct rb_node **n = &bp->root.rb_node;
+ struct bmap_rb_extent *ext;
+
+ while (*n) {
+ parent = *n;
+ ext = rb_entry(parent, struct bmap_rb_extent, node);
+ if (bit < ext->start)
+ n = &(*n)->rb_left;
+ else if (bit >= (ext->start + ext->count))
+ n = &(*n)->rb_right;
+ else
+ return 1;
+ }
+ return 0;
+}
+
+static int
+rb_set_bit(struct ext2fs_rb_private *bp, __u64 bit)
+{
+ struct bmap_rb_extent *cache = bp->cache;
+ struct bmap_rb_extent *ext;
+ int retval;
+
+ if (!cache)
+ goto new_cache;
+
+ if (bit == (cache->start + cache->count)) {
+ cache->count++;
+ return 0;
+ }
+
+ /* Mutually exclusive */
+ if ((bit > (cache->start + cache->count)) ||
+ (bit < cache->start)) {
+ _rb_insert_extent(bp->cache, &bp->root, 0);
+ goto new_cache;
+ }
+
+ return 1;
+
+new_cache:
+ retval = rb_get_new_extent(&ext, bit, 1);
+ if (retval)
+ return retval;
+ bp->cache = ext;
+ return 0;
+}
+
+static int _rb_insert_extent(struct bmap_rb_extent *new_ext,
+ struct rb_root *root, int in)
+{
+ struct rb_node *parent = NULL, **n = &root->rb_node;
+ struct rb_node *new_node, *node, *next;
+ struct bmap_rb_extent *ext;
+ __u64 start, count, old_start, old_count;
+ int retval = 0;
+
+ start = old_start = new_ext->start;
+ count = old_count = new_ext->count;
+
+ while (*n) {
+ parent = *n;
+ ext = rb_entry(parent, struct bmap_rb_extent, node);
+
+ if (start < ext->start) {
+ n = &(*n)->rb_left;
+ } else if (start > (ext->start + ext->count)) {
+ n = &(*n)->rb_right;
+ } else {
+ ext2fs_free_mem(&new_ext);
+ if ((start + count) <= (ext->start + ext->count))
+ return 1;
+
+ count += (start - ext->start);
+ start = ext->start;
+ new_ext = ext;
+ new_node = &ext->node;
+ retval = 1;
+
+ goto skip_insert;
+ }
+ }
+
+ new_node = &new_ext->node;
+ rb_link_node(new_node, parent, n);
+ rb_insert_color(new_node, root);
+
+ node = rb_prev(new_node);
+ if (node) {
+ ext = rb_entry(node, struct bmap_rb_extent, node);
+ if ((ext->start + ext->count) == start) {
+ start = ext->start;
+ count += ext->count;
+ rb_erase(node, root);
+ ext2fs_free_mem(&ext);
+ }
+ }
+
+skip_insert:
+ /* See if we can merge extent to the right */
+ for (node = rb_next(new_node); node != NULL; node = next) {
+ next = rb_next(node);
+ ext = rb_entry(node, struct bmap_rb_extent, node);
+
+ if ((ext->start + ext->count) <= start)
+ continue;
+
+ /* No more merging */
+ if ((start + count) < ext->start)
+ break;
+
+ /* ext is embedded in new_ext interval */
+ if ((start + count) >= (ext->start + ext->count)) {
+ rb_erase(node, root);
+ ext2fs_free_mem(&ext);
+ continue;
+ } else {
+ /* merge ext with new_ext */
+ count += ((ext->start + ext->count) -
+ (start + count));
+ rb_erase(node, root);
+ ext2fs_free_mem(&ext);
+ break;
+ }
+ }
+
+ new_ext->start = start;
+ new_ext->count = count;
+
+ return retval;
+}
+
+static int rb_remove_extent(struct bmap_rb_extent *del_ext,
+ struct rb_root *root)
+{
+ struct rb_node *parent = NULL, **n = &root->rb_node;
+ struct rb_node *next;
+ struct bmap_rb_extent *ext;
+ __u64 start, count, old_count, old_start;
+ int retval = 0;
+
+ if (RB_EMPTY_ROOT(root))
+ return 0;
+
+ start = old_start = del_ext->start;
+ count = old_count = del_ext->count;
+
+ while (*n) {
+ parent = *n;
+ ext = rb_entry(parent, struct bmap_rb_extent, node);
+ if (start < ext->start) {
+ n = &(*n)->rb_left;
+ continue;
+ } else if (start >= (ext->start + ext->count)) {
+ n = &(*n)->rb_right;
+ continue;
+ }
+
+ if ((start > ext->start) &&
+ (start + count) < (ext->start + ext->count)) {
+ /* We have to split extent into two */
+ del_ext->start = start + count;
+ del_ext->count = (ext->start + ext->count) -
+ del_ext->start;
+
+ ext->count = start - ext->start;
+ retval = _rb_insert_extent(del_ext, root, 0);
+ if (retval)
+ printf("\t%s Warning - extent should be "
+ "unique, but it is not\n", __func__);
+ return 1;
+ }
+
+ if ((start + count) >= (ext->start + ext->count))
+ ext->count = start - ext->start;
+
+ if (0 == ext->count) {
+ parent = rb_next(&ext->node);
+ rb_erase(&ext->node, root);
+ ext2fs_free_mem(&ext);
+ goto search_right;
+ }
+
+ if (start == ext->start) {
+ ext->start += count;
+ ext->count -= count;
+ return retval;
+ }
+ }
+
+search_right:
+ /* See if we should delete or truncate extent on the right */
+ for (; parent != NULL; parent = next) {
+ next = rb_next(parent);
+ ext = rb_entry(parent, struct bmap_rb_extent, node);
+ if ((ext->start + ext->count) <= start)
+ continue;
+
+ /* No more merging */
+ if ((start + count) < ext->start)
+ break;
+
+ /* ext is embedded in new_ext interval */
+ if ((start + count) >= (ext->start + ext->count)) {
+ rb_erase(&ext->node, root);
+ ext2fs_free_mem(&ext);
+ continue;
+ } else {
+ /* merge ext with new_ext */
+ ext->count -= ((start + count) - ext->start);
+ ext->start = start + count;
+ break;
+ }
+ }
+
+ return retval;
+}
+
+static int rb_mark_bmap(ext2fs_generic_bitmap bitmap, __u64 arg)
+{
+ struct ext2fs_rb_private *bp;
+
+ bp = (struct ext2fs_rb_private *) bitmap->private;
+ arg -= bitmap->start;
+
+ return rb_set_bit(bp, arg);
+}
+
+static int rb_unmark_bmap(ext2fs_generic_bitmap bitmap, __u64 arg)
+{
+ struct ext2fs_rb_private *bp;
+ struct bmap_rb_extent *del_ext;
+ int retval;
+
+ bp = (struct ext2fs_rb_private *) bitmap->private;
+ rb_flush_cache(bp);
+ arg -= bitmap->start;
+
+ retval = rb_get_new_extent(&del_ext, arg, 1);
+ if (retval)
+ return retval;
+
+ retval = rb_remove_extent(del_ext, &bp->root);
+ if (!retval)
+ ext2fs_free_mem(&del_ext);
+ check_tree(&bp->root, __func__);
+
+ return retval;
+}
+
+static int rb_test_bmap(ext2fs_generic_bitmap bitmap, __u64 arg)
+{
+ struct ext2fs_rb_private *bp;
+
+ bp = (struct ext2fs_rb_private *) bitmap->private;
+ rb_flush_cache(bp);
+ arg -= bitmap->start;
+
+ return rb_test_bit(bp, arg);
+}
+
+static void rb_mark_bmap_extent(ext2fs_generic_bitmap bitmap, __u64 arg,
+ unsigned int num)
+{
+ struct ext2fs_rb_private *bp;
+ struct bmap_rb_extent *new_ext;
+
+ bp = (struct ext2fs_rb_private *) bitmap->private;
+ arg -= bitmap->start;
+
+ /* We should check and return exit value since allocation
+ * may not be successful
+ */
+ rb_get_new_extent(&new_ext, arg, num);
+ rb_insert_extent(new_ext, bp);
+}
+
+static void rb_unmark_bmap_extent(ext2fs_generic_bitmap bitmap, __u64 arg,
+ unsigned int num)
+{
+ struct ext2fs_rb_private *bp;
+ struct bmap_rb_extent *del_ext;
+
+ bp = (struct ext2fs_rb_private *) bitmap->private;
+ rb_flush_cache(bp);
+ arg -= bitmap->start;
+
+ /* We should check and return exit value since allocation
+ * may not be successful
+ */
+ rb_get_new_extent(&del_ext, arg, num);
+ rb_remove_extent(del_ext, &bp->root);
+ check_tree(&bp->root, __func__);
+}
+
+static int rb_test_clear_bmap_extent(ext2fs_generic_bitmap bitmap,
+ __u64 start, unsigned int len)
+{
+ struct rb_node *parent = NULL, **n;
+ struct rb_node *node, *next;
+ struct ext2fs_rb_private *bp;
+ struct bmap_rb_extent *ext;
+ int retval = 1;
+
+ bp = (struct ext2fs_rb_private *) bitmap->private;
+ n = &bp->root.rb_node;
+ rb_flush_cache(bp);
+ start -= bitmap->start;
+
+ if ((len == 0) ||
+ RB_EMPTY_ROOT(&bp->root))
+ return 1;
+
+ /*
+ * If we find nothing, we should examine whole extent, but
+ * when we find match, the extent is not clean, thus be return
+ * false.
+ */
+ while (*n) {
+ parent = *n;
+ ext = rb_entry(parent, struct bmap_rb_extent, node);
+ if (start < ext->start) {
+ n = &(*n)->rb_left;
+ } else if (start >= (ext->start + ext->count)) {
+ n = &(*n)->rb_right;
+ } else {
+ /*
+ * We found extent int the tree -> extent is not
+ * clean
+ */
+ return 0;
+ }
+ }
+
+ node = parent;
+ while (node) {
+ next = rb_next(node);
+ ext = rb_entry(node, struct bmap_rb_extent, node);
+ node = next;
+
+ if ((ext->start + ext->count) <= start)
+ continue;
+
+ /* No more merging */
+ if ((start + len) <= ext->start)
+ break;
+
+ retval = 0;
+ break;
+ }
+ return retval;
+}
+
+static errcode_t rb_set_bmap_range(ext2fs_generic_bitmap bitmap,
+ __u64 start, size_t num, void *in)
+{
+ struct ext2fs_rb_private *bp;
+ size_t i;
+ int ret;
+
+ bp = (struct ext2fs_rb_private *) bitmap->private;
+ rb_flush_cache(bp);
+
+ for (i = 0; i < num; i++) {
+ ret = ext2fs_test_bit(i, in);
+ if (ret)
+ rb_set_bit(bp, start + i - bitmap->start);
+ }
+
+ return 0;
+}
+
+static errcode_t rb_get_bmap_range(ext2fs_generic_bitmap bitmap,
+ __u64 start, size_t num, void *out)
+{
+
+ struct rb_node *parent = NULL, *next, **n;
+ struct ext2fs_rb_private *bp;
+ struct bmap_rb_extent *ext;
+ __u64 pos;
+
+ bp = (struct ext2fs_rb_private *) bitmap->private;
+ n = &bp->root.rb_node;
+ rb_flush_cache(bp);
+ start -= bitmap->start;
+
+ if (RB_EMPTY_ROOT(&bp->root)) {
+ return 0;
+ }
+
+ while (*n) {
+ parent = *n;
+ ext = rb_entry(parent, struct bmap_rb_extent, node);
+ if (start < ext->start) {
+ n = &(*n)->rb_left;
+ } else if (start >= (ext->start + ext->count)) {
+ n = &(*n)->rb_right;
+ } else
+ break;
+ }
+
+ pos = start;
+ for (; parent != NULL; parent = next) {
+ next = rb_next(parent);
+ ext = rb_entry(parent, struct bmap_rb_extent, node);
+
+ while (((pos - start) < num) &&
+ (pos < ext->start)) {
+ ext2fs_fast_clear_bit64((pos - start), out);
+ pos++;
+ }
+
+ if ((pos - start) >= num)
+ return 0;
+
+ while (((pos - start) < num) &&
+ (pos < (ext->start + ext->count))) {
+ ext2fs_fast_set_bit64((pos - start), out);
+ pos++;
+ }
+ }
+
+ while ((pos - start) < num) {
+ ext2fs_fast_clear_bit64((pos - start), out);
+ pos++;
+ }
+
+ return 0;
+}
+
+static void rb_clear_bmap(ext2fs_generic_bitmap bitmap)
+{
+ struct ext2fs_rb_private *bp;
+
+ bp = (struct ext2fs_rb_private *) bitmap->private;
+
+ rb_free_tree(&bp->root);
+ ext2fs_free_mem(&bp->cache);
+}
+
+struct ext2_bitmap_ops ext2fs_blkmap64_rbtree = {
+ .type = EXT2FS_BMAP64_RBTREE,
+ .new_bmap = rb_new_bmap,
+ .free_bmap = rb_free_bmap,
+ .copy_bmap = rb_copy_bmap,
+ .resize_bmap = rb_resize_bmap,
+ .mark_bmap = rb_mark_bmap,
+ .unmark_bmap = rb_unmark_bmap,
+ .test_bmap = rb_test_bmap,
+ .test_clear_bmap_extent = rb_test_clear_bmap_extent,
+ .mark_bmap_extent = rb_mark_bmap_extent,
+ .unmark_bmap_extent = rb_unmark_bmap_extent,
+ .set_bmap_range = rb_set_bmap_range,
+ .get_bmap_range = rb_get_bmap_range,
+ .clear_bmap = rb_clear_bmap,
+};
diff --git a/lib/ext2fs/bmap64.h b/lib/ext2fs/bmap64.h
index b0aa84c..31021b9 100644
--- a/lib/ext2fs/bmap64.h
+++ b/lib/ext2fs/bmap64.h
@@ -59,3 +59,4 @@ struct ext2_bitmap_ops {
};

extern struct ext2_bitmap_ops ext2fs_blkmap64_bitarray;
+extern struct ext2_bitmap_ops ext2fs_blkmap64_rbtree;
diff --git a/lib/ext2fs/ext2fsP.h b/lib/ext2fs/ext2fsP.h
index ab9ee76..aa45c43 100644
--- a/lib/ext2fs/ext2fsP.h
+++ b/lib/ext2fs/ext2fsP.h
@@ -108,6 +108,7 @@ extern void ext2fs_numeric_progress_close(ext2_filsys fs,
*/

#define EXT2FS_BMAP64_BITARRAY 1
+#define EXT2FS_BMAP64_RBTREE 2

extern errcode_t ext2fs_alloc_generic_bmap(ext2_filsys fs, errcode_t magic,
int type, __u64 start, __u64 end,
diff --git a/lib/ext2fs/gen_bitmap64.c b/lib/ext2fs/gen_bitmap64.c
index df095ac..eb233f4 100644
--- a/lib/ext2fs/gen_bitmap64.c
+++ b/lib/ext2fs/gen_bitmap64.c
@@ -91,6 +91,9 @@ errcode_t ext2fs_alloc_generic_bmap(ext2_filsys fs, errcode_t magic,
case EXT2FS_BMAP64_BITARRAY:
ops = &ext2fs_blkmap64_bitarray;
break;
+ case EXT2FS_BMAP64_RBTREE:
+ ops = &ext2fs_blkmap64_rbtree;
+ break;
default:
return EINVAL;
}
--
1.7.2.3


2011-01-10 16:30:12

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 2/2] e2fsprogs: Add rbtree backend for bitmaps, use it instead of bitarrays

On 2011-01-10, at 8:18, Lukas Czerner <[email protected]> wrote:
> This commit adds another backend to store bitmaps. It is based on
> rbtrees and it stores just used extents of bitmaps. It means that it can
> be more memory efficient in most cases and also with a careful use it
> might be much faster than simple bitarrays.
>
>
> @@ -66,7 +66,7 @@ errcode_t ext2fs_allocate_inode_bitmap(ext2_filsys fs,
> if (fs->flags & EXT2_FLAG_64BITS)
> return (ext2fs_alloc_generic_bmap(fs,
> EXT2_ET_MAGIC_INODE_BITMAP64,
> - EXT2FS_BMAP64_BITARRAY,
> + EXT2FS_BMAP64_RBTREE,
> start, end, real_end, descr, ret));

It would be really useful to allow runtime selection of b
> /* Otherwise, check to see if the file system is small enough
> @@ -98,7 +98,7 @@ errcode_t ext2fs_allocate_block_bitmap(ext2_filsys fs,
> if (fs->flags & EXT2_FLAG_64BITS)
> return (ext2fs_alloc_generic_bmap(fs,
> EXT2_ET_MAGIC_BLOCK_BITMAP64,
> - EXT2FS_BMAP64_BITARRAY,
> + EXT2FS_BMAP64_RBTREE,
> start, end, real_end, descr, ret));
>
> if ((end > ~0U) || (real_end > ~0U))
> diff --git a/lib/ext2fs/blkmap64_rb.c b/lib/ext2fs/blkmap64_rb.c
> new file mode 100644
> index 0000000..6de8eb9
> --- /dev/null
> +++ b/lib/ext2fs/blkmap64_rb.c
> @@ -0,0 +1,815 @@
> +/*
> + * blkmap64_rb.c --- Simple rb-tree implementation for bitmaps
> + *
> + * (C)2010 Red Hat, Inc., Lukas Czerner <[email protected]>
> + *
> + * %Begin-Header%
> + * This file may be redistributed under the terms of the GNU Public
> + * License.
> + * %End-Header%
> + */
> +
> +#include <stdio.h>
> +#include <string.h>
> +#if HAVE_UNISTD_H
> +#include <unistd.h>
> +#endif
> +#include <fcntl.h>
> +#include <time.h>
> +#if HAVE_SYS_STAT_H
> +#include <sys/stat.h>
> +#endif
> +#if HAVE_SYS_TYPES_H
> +#include <sys/types.h>
> +#endif
> +
> +#include "ext2_fs.h"
> +#include "ext2fsP.h"
> +#include "bmap64.h"
> +#include "rbtree.h"
> +
> +#include <limits.h>
> +
> +struct bmap_rb_extent {
> + struct rb_node node;
> + __u64 start;
> + __u32 count;
> +};
> +
> +struct ext2fs_rb_private {
> + struct rb_root root;
> + struct bmap_rb_extent *cache;
> +};
> +
> +static int _rb_insert_extent(struct bmap_rb_extent *, struct rb_root *, int);
> +static int rb_insert_extent(struct bmap_rb_extent *,
> + struct ext2fs_rb_private *);
> +static void rb_flush_cache(struct ext2fs_rb_private *);
> +static int rb_get_new_extent(struct bmap_rb_extent **, __u64, __u64);
> +
> +/*#define DEBUG*/
> +
> +#ifdef DEBUG
> +static void print_tree(struct rb_root *root)
> +{
> + struct rb_node *node = NULL;
> + struct bmap_rb_extent *ext;
> +
> + printf("\t\t\t=================================\n");
> + node = rb_first(root);
> + for (node = rb_first(root); node != NULL; node=rb_next(node)) {
> + ext = rb_entry(node, struct bmap_rb_extent, node);
> + printf("\t\t\t--> (%llu -> %llu)\n",
> + ext->start, ext->start + ext->count);
> + }
> + printf("\t\t\t=================================\n");
> +}
> +
> +static int check_tree(struct rb_root *root, const char *msg)
> +{
> + struct rb_node *new_node, *node, *next;
> + struct bmap_rb_extent *ext, *old = NULL;
> +
> + for (node = rb_first(root); node; node = rb_next(node)) {
> + ext = rb_entry(node, struct bmap_rb_extent, node);
> + if (ext->count <= 0) {
> + printf("Tree Error: count is crazy\n");
> + printf("extent: %llu -> %llu (%llu)\n", ext->start,
> + ext->start + ext->count, ext->count);
> + goto err_out;
> + }
> + if (ext->start < 0) {
> + printf("Tree Error: start is crazy\n");
> + printf("extent: %llu -> %llu (%llu)\n", ext->start,
> + ext->start + ext->count, ext->count);
> + goto err_out;
> + }
> +
> + if (old) {
> + if (old->start > ext->start) {
> + printf("Tree Error: start is crazy\n");
> + printf("extent: %llu -> %llu (%llu)\n",
> + old->start, old->start + old->count,
> + old->count);
> + printf("extent next: %llu -> %llu (%llu)\n",
> + ext->start, ext->start + ext->count,
> + ext->count);
> + goto err_out;
> + }
> + if ((old->start + old->count) >= ext->start) {
> + printf("Tree Error: extent is crazy\n");
> + printf("extent: %llu -> %llu (%llu)\n",
> + old->start, old->start + old->count,
> + old->count);
> + printf("extent next: %llu -> %llu (%llu)\n",
> + ext->start, ext->start + ext->count,
> + ext->count);
> + goto err_out;
> + }
> + }
> + old = ext;
> + }
> + return 0;
> +
> +err_out:
> + printf("%s\n", msg);
> + print_tree(root);
> + exit(1);
> +}
> +#else
> +#define check_tree(root, msg) 0
> +#define print_tree(root, msg) 0
> +#endif
> +
> +static int rb_get_new_extent(struct bmap_rb_extent **ext, __u64 start,
> + __u64 count)
> +{
> + struct bmap_rb_extent *new_ext;
> + int retval = 0;
> +
> + retval = ext2fs_get_mem(sizeof (struct bmap_rb_extent),
> + &new_ext);
> + /*
> + * Not quite sure how to do this, but passing the error up the stack
> + * probably is not the best idea.
> + */
> + if (retval) {
> + fprintf(stderr, "Error: NOT ENOUGH MEMORY!\n");
> + exit(ENOMEM);
> + }
> +
> + new_ext->start = start;
> + new_ext->count = count;
> + *ext = new_ext;
> + return retval;
> +}
> +
> +static void rb_flush_cache(struct ext2fs_rb_private *bp)
> +{
> + if (!bp->cache)
> + return;
> +
> + _rb_insert_extent(bp->cache, &bp->root, 0);
> + check_tree(&bp->root, __func__);
> + bp->cache = NULL;
> +}
> +
> +/*
> + * Wrapper around _rb_insert_extent.
> + * First check cache, when it is emtpy put ext into cache, in the other
> + * case, try to merge ext with cache. If they are mutually exclusive
> + * insert old cache into tree and put ext into cache.
> + */
> +static int rb_insert_extent(struct bmap_rb_extent *ext,
> + struct ext2fs_rb_private *bp)
> +{
> + struct bmap_rb_extent *cache = bp->cache;
> + __u64 cend, eend;
> + int ret = 1;
> +
> + if (!cache) {
> + bp->cache = ext;
> + return 0;
> + }
> +
> + cend = cache->start + cache->count;
> + eend = ext->start + ext->count;
> +
> + /* Mutually exclusive extents */
> + if ((ext->start > cend) ||
> + (cache->start > (ext->start + ext->count))) {
> + ret = _rb_insert_extent(bp->cache, &bp->root, 0);
> + bp->cache = ext;
> + check_tree(&bp->root, __func__);
> + return ret;
> + }
> +
> + if (ext->start < cache->start) {
> + /* embedded interval */
> + if (cend >= eend)
> + return 1;
> + cache->count += cache->start - ext->start;
> + cache->start = ext->start;
> + } else
> + cache->count += (eend - cend);
> +
> +
> + bp->cache = cache;
> + return ret;
> +}
> +
> +static errcode_t rb_alloc_private_data (ext2fs_generic_bitmap bitmap)
> +{
> + struct ext2fs_rb_private *bp;
> + errcode_t retval;
> +
> + retval = ext2fs_get_mem(sizeof (struct ext2fs_rb_private), &bp);
> + if (retval)
> + return retval;
> +
> + bp->root = RB_ROOT;
> + bp->cache = NULL;
> +
> +
> + bitmap->private = (void *) bp;
> + return 0;
> +}
> +
> +static errcode_t rb_new_bmap(ext2_filsys fs EXT2FS_ATTR((unused)),
> + ext2fs_generic_bitmap bitmap)
> +{
> + errcode_t retval;
> +
> + retval = rb_alloc_private_data (bitmap);
> + if (retval)
> + return retval;
> +
> + return 0;
> +}
> +
> +static void rb_free_tree(struct rb_root *root)
> +{
> + struct bmap_rb_extent *ext;
> + struct rb_node *node;
> + int count = 0;
> +
> + node = rb_first(root);
> + while (node) {
> + ext = rb_entry(node, struct bmap_rb_extent, node);
> + rb_erase(node, root);
> + ext2fs_free_mem(&ext);
> + node = rb_first(root);
> + count++;
> + }
> +}
> +
> +static void rb_free_bmap(ext2fs_generic_bitmap bitmap)
> +{
> + struct ext2fs_rb_private *bp;
> +
> + bp = (struct ext2fs_rb_private *) bitmap->private;
> +
> + rb_free_tree(&bp->root);
> + ext2fs_free_mem(&bp->cache);
> + ext2fs_free_mem(&bp);
> + bp = 0;
> +}
> +
> +static errcode_t rb_copy_bmap(ext2fs_generic_bitmap src,
> + ext2fs_generic_bitmap dest)
> +{
> + struct ext2fs_rb_private *src_bp, *dest_bp;
> + struct bmap_rb_extent *src_ext, *dest_ext;
> + struct rb_node *dest_node, *src_node, *dest_last, **n;
> + errcode_t retval = 0;
> +
> + retval = rb_alloc_private_data (dest);
> + if (retval)
> + return retval;
> +
> + src_bp = (struct ext2fs_rb_private *) src->private;
> + dest_bp = (struct ext2fs_rb_private *) dest->private;
> + rb_flush_cache(src_bp);
> +
> + src_node = rb_first(&src_bp->root);
> + while (src_node) {
> + src_ext = rb_entry(src_node, struct bmap_rb_extent, node);
> + retval = ext2fs_get_mem(sizeof (struct bmap_rb_extent),
> + &dest_ext);
> + if (retval)
> + break;
> +
> + memcpy(dest_ext, src_ext, sizeof(struct bmap_rb_extent));
> +
> + dest_node = &dest_ext->node;
> + n = &dest_bp->root.rb_node;
> +
> + dest_last = NULL;
> + if (*n) {
> + dest_last = rb_last(&dest_bp->root);
> + n = &(dest_last)->rb_right;
> + }
> +
> + rb_link_node(dest_node, dest_last, n);
> + rb_insert_color(dest_node, &dest_bp->root);
> +
> + src_node = rb_next(src_node);
> + }
> +
> + return retval;
> +}
> +
> +static void rb_truncate(__u64 new_max, struct rb_root *root)
> +{
> + struct bmap_rb_extent *ext;
> + struct rb_node *node;
> +
> + node = rb_last(root);
> + while (node) {
> + ext = rb_entry(node, struct bmap_rb_extent, node);
> +
> + if ((ext->start + ext->count - 1) <= new_max)
> + break;
> + else if (ext->start > new_max) {
> + rb_erase(node, root);
> + ext2fs_free_mem(&ext);
> + node = rb_last(root);
> + continue;
> + } else
> + ext->count = new_max - ext->start + 1;
> + }
> +}
> +
> +static errcode_t rb_resize_bmap(ext2fs_generic_bitmap bmap,
> + __u64 new_end, __u64 new_real_end)
> +{
> + struct ext2fs_rb_private *bp;
> +
> + if (new_real_end >= bmap->real_end) {
> + bmap->end = new_end;
> + bmap->real_end = new_real_end;
> + return 0;
> + }
> +
> + bp = (struct ext2fs_rb_private *) bmap->private;
> + rb_flush_cache(bp);
> +
> + /* truncate tree to new_real_end size */
> + rb_truncate(new_real_end, &bp->root);
> +
> + bmap->end = new_end;
> + bmap->real_end = new_real_end;
> + return 0;
> +
> +}
> +
> +/*
> + * It would be nice to have read cache for this, so when walking bitmap
> + * in sequential manner we do not have to go through the list for every bit.
> + *
> + * The idea is:
> + * 1. check if there is a cache.
> + * 2. look for match in the cached extent
> + * 3. if not found, search the tree.
> + * 4. put found extent into the cache.
> + */
> +static int
> +rb_test_bit(struct ext2fs_rb_private *bp, __u64 bit)
> +{
> + struct rb_node *parent = NULL;
> + struct rb_node **n = &bp->root.rb_node;
> + struct bmap_rb_extent *ext;
> +
> + while (*n) {
> + parent = *n;
> + ext = rb_entry(parent, struct bmap_rb_extent, node);
> + if (bit < ext->start)
> + n = &(*n)->rb_left;
> + else if (bit >= (ext->start + ext->count))
> + n = &(*n)->rb_right;
> + else
> + return 1;
> + }
> + return 0;
> +}
> +
> +static int
> +rb_set_bit(struct ext2fs_rb_private *bp, __u64 bit)
> +{
> + struct bmap_rb_extent *cache = bp->cache;
> + struct bmap_rb_extent *ext;
> + int retval;
> +
> + if (!cache)
> + goto new_cache;
> +
> + if (bit == (cache->start + cache->count)) {
> + cache->count++;
> + return 0;
> + }
> +
> + /* Mutually exclusive */
> + if ((bit > (cache->start + cache->count)) ||
> + (bit < cache->start)) {
> + _rb_insert_extent(bp->cache, &bp->root, 0);
> + goto new_cache;
> + }
> +
> + return 1;
> +
> +new_cache:
> + retval = rb_get_new_extent(&ext, bit, 1);
> + if (retval)
> + return retval;
> + bp->cache = ext;
> + return 0;
> +}
> +
> +static int _rb_insert_extent(struct bmap_rb_extent *new_ext,
> + struct rb_root *root, int in)
> +{
> + struct rb_node *parent = NULL, **n = &root->rb_node;
> + struct rb_node *new_node, *node, *next;
> + struct bmap_rb_extent *ext;
> + __u64 start, count, old_start, old_count;
> + int retval = 0;
> +
> + start = old_start = new_ext->start;
> + count = old_count = new_ext->count;
> +
> + while (*n) {
> + parent = *n;
> + ext = rb_entry(parent, struct bmap_rb_extent, node);
> +
> + if (start < ext->start) {
> + n = &(*n)->rb_left;
> + } else if (start > (ext->start + ext->count)) {
> + n = &(*n)->rb_right;
> + } else {
> + ext2fs_free_mem(&new_ext);
> + if ((start + count) <= (ext->start + ext->count))
> + return 1;
> +
> + count += (start - ext->start);
> + start = ext->start;
> + new_ext = ext;
> + new_node = &ext->node;
> + retval = 1;
> +
> + goto skip_insert;
> + }
> + }
> +
> + new_node = &new_ext->node;
> + rb_link_node(new_node, parent, n);
> + rb_insert_color(new_node, root);
> +
> + node = rb_prev(new_node);
> + if (node) {
> + ext = rb_entry(node, struct bmap_rb_extent, node);
> + if ((ext->start + ext->count) == start) {
> + start = ext->start;
> + count += ext->count;
> + rb_erase(node, root);
> + ext2fs_free_mem(&ext);
> + }
> + }
> +
> +skip_insert:
> + /* See if we can merge extent to the right */
> + for (node = rb_next(new_node); node != NULL; node = next) {
> + next = rb_next(node);
> + ext = rb_entry(node, struct bmap_rb_extent, node);
> +
> + if ((ext->start + ext->count) <= start)
> + continue;
> +
> + /* No more merging */
> + if ((start + count) < ext->start)
> + break;
> +
> + /* ext is embedded in new_ext interval */
> + if ((start + count) >= (ext->start + ext->count)) {
> + rb_erase(node, root);
> + ext2fs_free_mem(&ext);
> + continue;
> + } else {
> + /* merge ext with new_ext */
> + count += ((ext->start + ext->count) -
> + (start + count));
> + rb_erase(node, root);
> + ext2fs_free_mem(&ext);
> + break;
> + }
> + }
> +
> + new_ext->start = start;
> + new_ext->count = count;
> +
> + return retval;
> +}
> +
> +static int rb_remove_extent(struct bmap_rb_extent *del_ext,
> + struct rb_root *root)
> +{
> + struct rb_node *parent = NULL, **n = &root->rb_node;
> + struct rb_node *next;
> + struct bmap_rb_extent *ext;
> + __u64 start, count, old_count, old_start;
> + int retval = 0;
> +
> + if (RB_EMPTY_ROOT(root))
> + return 0;
> +
> + start = old_start = del_ext->start;
> + count = old_count = del_ext->count;
> +
> + while (*n) {
> + parent = *n;
> + ext = rb_entry(parent, struct bmap_rb_extent, node);
> + if (start < ext->start) {
> + n = &(*n)->rb_left;
> + continue;
> + } else if (start >= (ext->start + ext->count)) {
> + n = &(*n)->rb_right;
> + continue;
> + }
> +
> + if ((start > ext->start) &&
> + (start + count) < (ext->start + ext->count)) {
> + /* We have to split extent into two */
> + del_ext->start = start + count;
> + del_ext->count = (ext->start + ext->count) -
> + del_ext->start;
> +
> + ext->count = start - ext->start;
> + retval = _rb_insert_extent(del_ext, root, 0);
> + if (retval)
> + printf("\t%s Warning - extent should be "
> + "unique, but it is not\n", __func__);
> + return 1;
> + }
> +
> + if ((start + count) >= (ext->start + ext->count))
> + ext->count = start - ext->start;
> +
> + if (0 == ext->count) {
> + parent = rb_next(&ext->node);
> + rb_erase(&ext->node, root);
> + ext2fs_free_mem(&ext);
> + goto search_right;
> + }
> +
> + if (start == ext->start) {
> + ext->start += count;
> + ext->count -= count;
> + return retval;
> + }
> + }
> +
> +search_right:
> + /* See if we should delete or truncate extent on the right */
> + for (; parent != NULL; parent = next) {
> + next = rb_next(parent);
> + ext = rb_entry(parent, struct bmap_rb_extent, node);
> + if ((ext->start + ext->count) <= start)
> + continue;
> +
> + /* No more merging */
> + if ((start + count) < ext->start)
> + break;
> +
> + /* ext is embedded in new_ext interval */
> + if ((start + count) >= (ext->start + ext->count)) {
> + rb_erase(&ext->node, root);
> + ext2fs_free_mem(&ext);
> + continue;
> + } else {
> + /* merge ext with new_ext */
> + ext->count -= ((start + count) - ext->start);
> + ext->start = start + count;
> + break;
> + }
> + }
> +
> + return retval;
> +}
> +
> +static int rb_mark_bmap(ext2fs_generic_bitmap bitmap, __u64 arg)
> +{
> + struct ext2fs_rb_private *bp;
> +
> + bp = (struct ext2fs_rb_private *) bitmap->private;
> + arg -= bitmap->start;
> +
> + return rb_set_bit(bp, arg);
> +}
> +
> +static int rb_unmark_bmap(ext2fs_generic_bitmap bitmap, __u64 arg)
> +{
> + struct ext2fs_rb_private *bp;
> + struct bmap_rb_extent *del_ext;
> + int retval;
> +
> + bp = (struct ext2fs_rb_private *) bitmap->private;
> + rb_flush_cache(bp);
> + arg -= bitmap->start;
> +
> + retval = rb_get_new_extent(&del_ext, arg, 1);
> + if (retval)
> + return retval;
> +
> + retval = rb_remove_extent(del_ext, &bp->root);
> + if (!retval)
> + ext2fs_free_mem(&del_ext);
> + check_tree(&bp->root, __func__);
> +
> + return retval;
> +}
> +
> +static int rb_test_bmap(ext2fs_generic_bitmap bitmap, __u64 arg)
> +{
> + struct ext2fs_rb_private *bp;
> +
> + bp = (struct ext2fs_rb_private *) bitmap->private;
> + rb_flush_cache(bp);
> + arg -= bitmap->start;
> +
> + return rb_test_bit(bp, arg);
> +}
> +
> +static void rb_mark_bmap_extent(ext2fs_generic_bitmap bitmap, __u64 arg,
> + unsigned int num)
> +{
> + struct ext2fs_rb_private *bp;
> + struct bmap_rb_extent *new_ext;
> +
> + bp = (struct ext2fs_rb_private *) bitmap->private;
> + arg -= bitmap->start;
> +
> + /* We should check and return exit value since allocation
> + * may not be successful
> + */
> + rb_get_new_extent(&new_ext, arg, num);
> + rb_insert_extent(new_ext, bp);
> +}
> +
> +static void rb_unmark_bmap_extent(ext2fs_generic_bitmap bitmap, __u64 arg,
> + unsigned int num)
> +{
> + struct ext2fs_rb_private *bp;
> + struct bmap_rb_extent *del_ext;
> +
> + bp = (struct ext2fs_rb_private *) bitmap->private;
> + rb_flush_cache(bp);
> + arg -= bitmap->start;
> +
> + /* We should check and return exit value since allocation
> + * may not be successful
> + */
> + rb_get_new_extent(&del_ext, arg, num);
> + rb_remove_extent(del_ext, &bp->root);
> + check_tree(&bp->root, __func__);
> +}
> +
> +static int rb_test_clear_bmap_extent(ext2fs_generic_bitmap bitmap,
> + __u64 start, unsigned int len)
> +{
> + struct rb_node *parent = NULL, **n;
> + struct rb_node *node, *next;
> + struct ext2fs_rb_private *bp;
> + struct bmap_rb_extent *ext;
> + int retval = 1;
> +
> + bp = (struct ext2fs_rb_private *) bitmap->private;
> + n = &bp->root.rb_node;
> + rb_flush_cache(bp);
> + start -= bitmap->start;
> +
> + if ((len == 0) ||
> + RB_EMPTY_ROOT(&bp->root))
> + return 1;
> +
> + /*
> + * If we find nothing, we should examine whole extent, but
> + * when we find match, the extent is not clean, thus be return
> + * false.
> + */
> + while (*n) {
> + parent = *n;
> + ext = rb_entry(parent, struct bmap_rb_extent, node);
> + if (start < ext->start) {
> + n = &(*n)->rb_left;
> + } else if (start >= (ext->start + ext->count)) {
> + n = &(*n)->rb_right;
> + } else {
> + /*
> + * We found extent int the tree -> extent is not
> + * clean
> + */
> + return 0;
> + }
> + }
> +
> + node = parent;
> + while (node) {
> + next = rb_next(node);
> + ext = rb_entry(node, struct bmap_rb_extent, node);
> + node = next;
> +
> + if ((ext->start + ext->count) <= start)
> + continue;
> +
> + /* No more merging */
> + if ((start + len) <= ext->start)
> + break;
> +
> + retval = 0;
> + break;
> + }
> + return retval;
> +}
> +
> +static errcode_t rb_set_bmap_range(ext2fs_generic_bitmap bitmap,
> + __u64 start, size_t num, void *in)
> +{
> + struct ext2fs_rb_private *bp;
> + size_t i;
> + int ret;
> +
> + bp = (struct ext2fs_rb_private *) bitmap->private;
> + rb_flush_cache(bp);
> +
> + for (i = 0; i < num; i++) {
> + ret = ext2fs_test_bit(i, in);
> + if (ret)
> + rb_set_bit(bp, start + i - bitmap->start);
> + }
> +
> + return 0;
> +}
> +
> +static errcode_t rb_get_bmap_range(ext2fs_generic_bitmap bitmap,
> + __u64 start, size_t num, void *out)
> +{
> +
> + struct rb_node *parent = NULL, *next, **n;
> + struct ext2fs_rb_private *bp;
> + struct bmap_rb_extent *ext;
> + __u64 pos;
> +
> + bp = (struct ext2fs_rb_private *) bitmap->private;
> + n = &bp->root.rb_node;
> + rb_flush_cache(bp);
> + start -= bitmap->start;
> +
> + if (RB_EMPTY_ROOT(&bp->root)) {
> + return 0;
> + }
> +
> + while (*n) {
> + parent = *n;
> + ext = rb_entry(parent, struct bmap_rb_extent, node);
> + if (start < ext->start) {
> + n = &(*n)->rb_left;
> + } else if (start >= (ext->start + ext->count)) {
> + n = &(*n)->rb_right;
> + } else
> + break;
> + }
> +
> + pos = start;
> + for (; parent != NULL; parent = next) {
> + next = rb_next(parent);
> + ext = rb_entry(parent, struct bmap_rb_extent, node);
> +
> + while (((pos - start) < num) &&
> + (pos < ext->start)) {
> + ext2fs_fast_clear_bit64((pos - start), out);
> + pos++;
> + }
> +
> + if ((pos - start) >= num)
> + return 0;
> +
> + while (((pos - start) < num) &&
> + (pos < (ext->start + ext->count))) {
> + ext2fs_fast_set_bit64((pos - start), out);
> + pos++;
> + }
> + }
> +
> + while ((pos - start) < num) {
> + ext2fs_fast_clear_bit64((pos - start), out);
> + pos++;
> + }
> +
> + return 0;
> +}
> +
> +static void rb_clear_bmap(ext2fs_generic_bitmap bitmap)
> +{
> + struct ext2fs_rb_private *bp;
> +
> + bp = (struct ext2fs_rb_private *) bitmap->private;
> +
> + rb_free_tree(&bp->root);
> + ext2fs_free_mem(&bp->cache);
> +}
> +
> +struct ext2_bitmap_ops ext2fs_blkmap64_rbtree = {
> + .type = EXT2FS_BMAP64_RBTREE,
> + .new_bmap = rb_new_bmap,
> + .free_bmap = rb_free_bmap,
> + .copy_bmap = rb_copy_bmap,
> + .resize_bmap = rb_resize_bmap,
> + .mark_bmap = rb_mark_bmap,
> + .unmark_bmap = rb_unmark_bmap,
> + .test_bmap = rb_test_bmap,
> + .test_clear_bmap_extent = rb_test_clear_bmap_extent,
> + .mark_bmap_extent = rb_mark_bmap_extent,
> + .unmark_bmap_extent = rb_unmark_bmap_extent,
> + .set_bmap_range = rb_set_bmap_range,
> + .get_bmap_range = rb_get_bmap_range,
> + .clear_bmap = rb_clear_bmap,
> +};
> diff --git a/lib/ext2fs/bmap64.h b/lib/ext2fs/bmap64.h
> index b0aa84c..31021b9 100644
> --- a/lib/ext2fs/bmap64.h
> +++ b/lib/ext2fs/bmap64.h
> @@ -59,3 +59,4 @@ struct ext2_bitmap_ops {
> };
>
> extern struct ext2_bitmap_ops ext2fs_blkmap64_bitarray;
> +extern struct ext2_bitmap_ops ext2fs_blkmap64_rbtree;
> diff --git a/lib/ext2fs/ext2fsP.h b/lib/ext2fs/ext2fsP.h
> index ab9ee76..aa45c43 100644
> --- a/lib/ext2fs/ext2fsP.h
> +++ b/lib/ext2fs/ext2fsP.h
> @@ -108,6 +108,7 @@ extern void ext2fs_numeric_progress_close(ext2_filsys fs,
> */
>
> #define EXT2FS_BMAP64_BITARRAY 1
> +#define EXT2FS_BMAP64_RBTREE 2
>
> extern errcode_t ext2fs_alloc_generic_bmap(ext2_filsys fs, errcode_t magic,
> int type, __u64 start, __u64 end,
> diff --git a/lib/ext2fs/gen_bitmap64.c b/lib/ext2fs/gen_bitmap64.c
> index df095ac..eb233f4 100644
> --- a/lib/ext2fs/gen_bitmap64.c
> +++ b/lib/ext2fs/gen_bitmap64.c
> @@ -91,6 +91,9 @@ errcode_t ext2fs_alloc_generic_bmap(ext2_filsys fs, errcode_t magic,
> case EXT2FS_BMAP64_BITARRAY:
> ops = &ext2fs_blkmap64_bitarray;
> break;
> + case EXT2FS_BMAP64_RBTREE:
> + ops = &ext2fs_blkmap64_rbtree;
> + break;
> default:
> return EINVAL;
> }
> --
> 1.7.2.3
>

2011-01-10 16:42:40

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 2/2] e2fsprogs: Add rbtree backend for bitmaps, use it instead of bitarrays

Oops, hit send by accident.

On 2011-01-10, at 8:18, Lukas Czerner <[email protected]> wrote:
>> This commit adds another backend to store bitmaps. It is based on
>> rbtrees and it stores just used extents of bitmaps. It means that it can
>> be more memory efficient in most cases and also with a careful use it
>> might be much faster than simple bitarrays.
>>
>>
>> @@ -66,7 +66,7 @@ errcode_t ext2fs_allocate_inode_bitmap(ext2_filsys fs,
>> if (fs->flags & EXT2_FLAG_64BITS)
>> return (ext2fs_alloc_generic_bmap(fs,
>> EXT2_ET_MAGIC_INODE_BITMAP64,
>> - EXT2FS_BMAP64_BITARRAY,
>> + EXT2FS_BMAP64_RBTREE,
>> start, end, real_end, descr, ret));
>

It would be really useful to allow runtime selection of bitmap type, ideally separately for block and inode bitmaps. This will allow easier testing of this patch on different filesystems, as well as allow users to select the bitmap the type for performance over space, if they have plenty of memory.

Perhaps in the future by default e2fsck can select one type based on system memory vs. filesystem size? That will be important if there continues to be a performance gap between the two types.

>> +/*
>>
>> +/*#define DEBUG*/
>> +
>> +#ifdef DEBUG
>> +static void print_tree(struct rb_root *root)

Probably should be DEBUG_RB or similar to allow separate enabling from other debug.

Cheers, Andreas

2011-01-12 12:44:38

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] Add rbtree backend for storing bitmaps

On Mon, Jan 10, 2011 at 5:18 PM, Lukas Czerner <[email protected]> wrote:
> Hi all,
>
> as I mentioned a while ago I am working on reducing e2fsprogs memory
> utilization. The first thing which need to be addressed is mechanism of
> storing bitmaps. So far bitarrays was used for this purpose however today
> this approach might hit its limits with todays huge data storage devices,
> because of its memory utilization.
>
> Bitarrays stores bitmaps as ..well, as bitmaps. But this is in most
> cases highly unefficient because we need to allocate memory even for the
> big parts of bitmaps we will never use, resulting in high memory
> utilization especially for huge filesystem, when bitmaps might occupy
> gigabytes of space.
>
> To address this I have created rbtree backend for storing bitmaps. It stores
> just used extents of bitmaps, it means, that it can be more memory efficient
> in most cases and also with careful use it might be much faster as well.
>
> Rbtree implementation itself is ripped of linux kernel with some minor changes
> to make it work outside kernel. So this should not cause any problems at all
> since it has been really extensively tested through out its life.
>
> I have done a lot of testing to validate my new backend and to find as many
> bugs as possible. Now it seems to be quite reliable, however since this is the
> most crucial part of the e2fsprogs it has to be tested most widely on various
> scenarios. So this is my appeal to you guys, please take it, make it and test
> it and test it some more, to really make sure this is doing what it is
> supposed to.
>
> The other side of the thing is, does it really help with memory utilization ?
> So my answer is YES, (...but). Of course there is a "but". That is because one
> rbtree node on 64-bit system takes 40B of data, which is 320 bits of
> information. So there might be a situation when one single rbtree node does
> not cover even 320 bits of bitmap it stores, it this case this node is not
> very efficient. In case we have a lot of unefficient nodes we might actually
> end up with bigger memory utilization than bitarrays itself and that's bad.
>
> Now, the answer for this problem is benchmarking, to figure out how probable
> this situation is and how (and when) bad it can get. We would probably need to
> create some fallback switch which will convert bitmaps from one backend
> (rbtree) to another (bitarray) depending on which appears more efficient for
> the situation, but let's keep it simple for now and lets figure out how bad
> the problem actually is.
>
> I have done some limited benchmarking. Limited, because it takes time (a lot
> of it) and also huge storages, because we do not really care about memory
> utilization differences in megabytes, but rather in hundreds and thousands of
> megabytes. So this is my second appeal to you guys, take it, make it, test it
> and benchmark it.
>
> For measuring memory utilization I have used valgrind (massif tool to be
> specific). All the numbers I will show you are peak memory utilization. So
> here is an example how to use massif.
>
> ? ? ? ?valgrind --tool=massif ./e2fsck/e2fsck -fn /dev/loop0
> ? ? ? ?ms_print massif.out.5467 | less
>
> Now, I can show you some (rather optimistic) graphs of e2fsck memory
> utilization I have done. Here is the link (description included):
>
> http://people.redhat.com/lczerner/e2fsprogs_memory/graphs.pdf
>
> Now the simple description of the graphs. The first one is showing the e2fsck
> memory utilization depending on filesystem size. The benchmark was performed
> right after the fs creation so it shows the best scenario for the rbtree
> backend. The amount of saved memory grows approx by 8.5MB per 100MB of
> filesystem size - this is the best what we can get.
>
> The second graph shows memory utilization per inode depending on inode count.
> We can see that with growing number of inodes the Bytes per inode ratio is
> dropping significantly, moreover it is dropping faster for bitarrays than for
> rbtrees, which tells us that inode count is in fact the main factor which
> impact the memory utilization difference between the rbtree and bitattay
> backend on the filesystem of constant size. However it strongly depends also
> on how are the files created on the filesystem - some loads might be more
> rbtree-friendly than others.
>
> The threshold is, when the Bytes per Inode is equal for both backends, this is
> the last point where we will need to convert rbtree backend to bitarrays,
> because above this threshold rbtree approach is no longer efficient. In my
> load (copying content of /usr/share several times) it means that rbtree memory
> utilization is growing by 20B per inode, however bitarray mem. util. is
> growing by 4B per inode (on 10TB filesystem). So the rbtree memory consumption
> is growing 5 times faster then bitarrays.
>
> Let's simplify situation and let's say that the Bytes per Inode growing ratio
> will not change with inode count (which is not true, but this is yet to be
> tested). In this case we hit the threshold when we fill 33% of inodes, which
> means 20% of filesystem size. Not very promising, however is this load the
> typical real-life load ? - this we need to find out.
>
> Next graph shows e2fsck memory utilization depending on inode count which is
> practically the same thing as in previous graph.
>
> The last two graphs were created on filesystem aged with Impression in default
> configuration, rather than copying /usr/share. It should provide more
> real-live filesystem images. The filesystem is rather small now, only 312GB.
> So the fourth graph shows running times of e2fsck for both backends, rbtree
> and bitarrays. We can see that rbtree backend performs quite bad, however I
> believe that I know the source of the slowdown. It is the rb_test_bit()
> function which need to walk the tree every time e2fsck needs to test the bit
> and since this usually happen in sequential order (bit-by-bit), we can easily
> improve performance by adding simple cache (similar to write cache in
> rb_set_bit()). Also with the little improvement of e2fsck bitmap handling (use
> the advantage of extents) we can improve performance even more. But I leave
> those optimization to after we are done with memory utilization optimization.
>
> Finally, the last graph shows Memory utilization per inode depending on inode
> count. This is the similar graph as the second one, however this one is a bit
> more optimistic. The rbtree memory utilization grows by avg. 8B per Inode, and
> bitarray grows by avg. 7B per inode, which is quite good and in this scenario
> with stable Bytes per inode grow the rbtree backend will be always better than
> bitarray. This show us, that it really depends on the type of load as well as
> on the size of the filesystem. So we need to test more and more real life
> filesystems to see the if the rbtree is really a win for most of the users, or
> just some small groupr group, who does not even care about memory utilization
> at all (it is clearly numbers from huge filesystems we care about!).
>
>
> There were some people having OOM problems with e2fsck, It would be great if
> those people can try this patches to see if it helps and get back to us to
> share the experience. Although I feel that I need to warn you guys, this is
> not yet stable, and even though I have not seen any problems, it does not mean
> that it is bug free, so at least try it with '-n' option which should be
> quite safe.
>
> Please, comments are highly appreciated!
>
> Thanks!
> -Lukas
>
>

Hi Lukas,

Your work is very impressive.
I would be very much interested in reducing the memory usage of e2fsck
and reducing it's run-time.
We have appliances with up to 16TB of storage and only 1GB of RAM.

I have investigated the issue of reducing e2fsck memory usage in the past
and have a lot of comments/suggestions. I will try the summarize them by topic:

1. Analysis of usage pattern per bitmap instance.
2. Yet another bitmap backend.
3. More ways to reduce memory usage of e2fsck.


1. Analysis of usage pattern per bitmap instance.

As you know, e2fsck uses bitmaps to mark many different things.
For block bitmaps you have "in-use block map","multiply claimed block
map" and "ext attr block map".
While memory usage analysis of an "empty" and an "healthy" file
systems is important,
it is very important to look at worst case as well.
And even a single multiply claimed block or a single extended
attribute block will
result is considerable memory savings with rb_trees, which is not
shown in your graphs.

There are several inode bitmaps, such as "icount->multiple",
which are relatively sparse and would benefit a lot from being stored
in an rb_tree.
Again, you will only see the benefit once you have hard links in your
file system.

As a conclusion, you may want to add a statistics report of the memory usage of
each bitmap instance, so the optimal backend can be chosen per bitmap instance.

2. Yet another bitmap backend.

It is my *hunch*, that rb_trees are a suitable backend for very sparse bitmaps,
like the onces I mentioned above and maybe for "in-use block map".

However, I have a *feeling* that rb_trees may not be the optimal choice for
"in-use inode map" and maybe also not for "in-use block map".
The reason is that the average use case can have very dense and quite fragmented
"in-use" bitmaps.

It seems to me, that if half of the block groups are not in use and
the other half
is dense and fragmented, then neither rb_trees, nor arrays are the
optimal backend.
A data structure similar to the page table, could be quite efficient
in this use case,
both from the memory usage aspect and the test/set_bit speed aspect.

While it is rare to see a block group occupied with few used block, it
could certainly
happen for inodes, so I would choose a "page" size equal to block size
for "in-use block bitmap"
and a much smaller "page" size for "in-use inode" bitmap.


3. More ways to reduce memory usage of e2fsck.

The most recent case of e2fsck OOM I remember showing on this list
was cause by a file system with many hard links that were created by rsnapshot
(http://rsnapshot.org/)

If I recall correctly, the overblown memory usage was caused by the
icount cache,
which creates an entry for every inode with nlink > 1.

Similar problem can be caused by a large ea_count cache, when a file system
has many multiply referenced extended attribute blocks (ACLs?).

For that problem, I have a somewhat "crazy" solution.
If the hard linked inodes are in fact sparsely distributed
and if the larger the refcounts, the fewer the inodes
(let's neglect directory hard links, because we've given up on them
for #subdirs > 32000 anyway), then it is possible to replace
the icount cache with 16 inode bitmaps, each one representing
a single bit in the u16 i_links_count.

Assuming that in a normal file system the maximum links count is bounded,
then most of these bitmaps will be empty and the rest very sparse.

Even in a highly linked file system generated by 256 rsnapshots,
the memory usage is still only about 8bits per inode, while icount
cache stores 64bit per inode.


So that's it. A lot of talking and no benchmarking... Sorry about that.

If we can come to a definite conclusion of a way to reduce memory usage
and run-time of e2fsck for both the average and worst case, I think I will be
able to turn more resources into the implementation and testing of
such a scheme.

So Lukas, if you can please try to apply the rb_trees backend to selective
bitmap instances, like "ext attribute block" and show that it is a clear win
situation, I am most likely to take the rb_tree patches and run some tests
on our appliances.

Thanks,
Amir.

2011-01-12 15:19:37

by Lukas Czerner

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] Add rbtree backend for storing bitmaps

On Wed, 12 Jan 2011, Amir Goldstein wrote:

> On Mon, Jan 10, 2011 at 5:18 PM, Lukas Czerner <[email protected]> wrote:
> > Hi all,
> >
> > as I mentioned a while ago I am working on reducing e2fsprogs memory
> > utilization. The first thing which need to be addressed is mechanism of
> > storing bitmaps. So far bitarrays was used for this purpose however today
> > this approach might hit its limits with todays huge data storage devices,
> > because of its memory utilization.
> >
> > Bitarrays stores bitmaps as ..well, as bitmaps. But this is in most
> > cases highly unefficient because we need to allocate memory even for the
> > big parts of bitmaps we will never use, resulting in high memory
> > utilization especially for huge filesystem, when bitmaps might occupy
> > gigabytes of space.
> >
> > To address this I have created rbtree backend for storing bitmaps. It stores
> > just used extents of bitmaps, it means, that it can be more memory efficient
> > in most cases and also with careful use it might be much faster as well.
> >
> > Rbtree implementation itself is ripped of linux kernel with some minor changes
> > to make it work outside kernel. So this should not cause any problems at all
> > since it has been really extensively tested through out its life.
> >
> > I have done a lot of testing to validate my new backend and to find as many
> > bugs as possible. Now it seems to be quite reliable, however since this is the
> > most crucial part of the e2fsprogs it has to be tested most widely on various
> > scenarios. So this is my appeal to you guys, please take it, make it and test
> > it and test it some more, to really make sure this is doing what it is
> > supposed to.
> >
> > The other side of the thing is, does it really help with memory utilization ?
> > So my answer is YES, (...but). Of course there is a "but". That is because one
> > rbtree node on 64-bit system takes 40B of data, which is 320 bits of
> > information. So there might be a situation when one single rbtree node does
> > not cover even 320 bits of bitmap it stores, it this case this node is not
> > very efficient. In case we have a lot of unefficient nodes we might actually
> > end up with bigger memory utilization than bitarrays itself and that's bad.
> >
> > Now, the answer for this problem is benchmarking, to figure out how probable
> > this situation is and how (and when) bad it can get. We would probably need to
> > create some fallback switch which will convert bitmaps from one backend
> > (rbtree) to another (bitarray) depending on which appears more efficient for
> > the situation, but let's keep it simple for now and lets figure out how bad
> > the problem actually is.
> >
> > I have done some limited benchmarking. Limited, because it takes time (a lot
> > of it) and also huge storages, because we do not really care about memory
> > utilization differences in megabytes, but rather in hundreds and thousands of
> > megabytes. So this is my second appeal to you guys, take it, make it, test it
> > and benchmark it.
> >
> > For measuring memory utilization I have used valgrind (massif tool to be
> > specific). All the numbers I will show you are peak memory utilization. So
> > here is an example how to use massif.
> >
> > ? ? ? ?valgrind --tool=massif ./e2fsck/e2fsck -fn /dev/loop0
> > ? ? ? ?ms_print massif.out.5467 | less
> >
> > Now, I can show you some (rather optimistic) graphs of e2fsck memory
> > utilization I have done. Here is the link (description included):
> >
> > http://people.redhat.com/lczerner/e2fsprogs_memory/graphs.pdf
> >
> > Now the simple description of the graphs. The first one is showing the e2fsck
> > memory utilization depending on filesystem size. The benchmark was performed
> > right after the fs creation so it shows the best scenario for the rbtree
> > backend. The amount of saved memory grows approx by 8.5MB per 100MB of
> > filesystem size - this is the best what we can get.
> >
> > The second graph shows memory utilization per inode depending on inode count.
> > We can see that with growing number of inodes the Bytes per inode ratio is
> > dropping significantly, moreover it is dropping faster for bitarrays than for
> > rbtrees, which tells us that inode count is in fact the main factor which
> > impact the memory utilization difference between the rbtree and bitattay
> > backend on the filesystem of constant size. However it strongly depends also
> > on how are the files created on the filesystem - some loads might be more
> > rbtree-friendly than others.
> >
> > The threshold is, when the Bytes per Inode is equal for both backends, this is
> > the last point where we will need to convert rbtree backend to bitarrays,
> > because above this threshold rbtree approach is no longer efficient. In my
> > load (copying content of /usr/share several times) it means that rbtree memory
> > utilization is growing by 20B per inode, however bitarray mem. util. is
> > growing by 4B per inode (on 10TB filesystem). So the rbtree memory consumption
> > is growing 5 times faster then bitarrays.
> >
> > Let's simplify situation and let's say that the Bytes per Inode growing ratio
> > will not change with inode count (which is not true, but this is yet to be
> > tested). In this case we hit the threshold when we fill 33% of inodes, which
> > means 20% of filesystem size. Not very promising, however is this load the
> > typical real-life load ? - this we need to find out.
> >
> > Next graph shows e2fsck memory utilization depending on inode count which is
> > practically the same thing as in previous graph.
> >
> > The last two graphs were created on filesystem aged with Impression in default
> > configuration, rather than copying /usr/share. It should provide more
> > real-live filesystem images. The filesystem is rather small now, only 312GB.
> > So the fourth graph shows running times of e2fsck for both backends, rbtree
> > and bitarrays. We can see that rbtree backend performs quite bad, however I
> > believe that I know the source of the slowdown. It is the rb_test_bit()
> > function which need to walk the tree every time e2fsck needs to test the bit
> > and since this usually happen in sequential order (bit-by-bit), we can easily
> > improve performance by adding simple cache (similar to write cache in
> > rb_set_bit()). Also with the little improvement of e2fsck bitmap handling (use
> > the advantage of extents) we can improve performance even more. But I leave
> > those optimization to after we are done with memory utilization optimization.
> >
> > Finally, the last graph shows Memory utilization per inode depending on inode
> > count. This is the similar graph as the second one, however this one is a bit
> > more optimistic. The rbtree memory utilization grows by avg. 8B per Inode, and
> > bitarray grows by avg. 7B per inode, which is quite good and in this scenario
> > with stable Bytes per inode grow the rbtree backend will be always better than
> > bitarray. This show us, that it really depends on the type of load as well as
> > on the size of the filesystem. So we need to test more and more real life
> > filesystems to see the if the rbtree is really a win for most of the users, or
> > just some small groupr group, who does not even care about memory utilization
> > at all (it is clearly numbers from huge filesystems we care about!).
> >
> >
> > There were some people having OOM problems with e2fsck, It would be great if
> > those people can try this patches to see if it helps and get back to us to
> > share the experience. Although I feel that I need to warn you guys, this is
> > not yet stable, and even though I have not seen any problems, it does not mean
> > that it is bug free, so at least try it with '-n' option which should be
> > quite safe.
> >
> > Please, comments are highly appreciated!
> >
> > Thanks!
> > -Lukas
> >
> >
>
> Hi Lukas,
>
> Your work is very impressive.
> I would be very much interested in reducing the memory usage of e2fsck
> and reducing it's run-time.
> We have appliances with up to 16TB of storage and only 1GB of RAM.
>
> I have investigated the issue of reducing e2fsck memory usage in the past
> and have a lot of comments/suggestions. I will try the summarize them by topic:
>
> 1. Analysis of usage pattern per bitmap instance.
> 2. Yet another bitmap backend.
> 3. More ways to reduce memory usage of e2fsck.
>
>
> 1. Analysis of usage pattern per bitmap instance.
>
> As you know, e2fsck uses bitmaps to mark many different things.
> For block bitmaps you have "in-use block map","multiply claimed block
> map" and "ext attr block map".
> While memory usage analysis of an "empty" and an "healthy" file
> systems is important,
> it is very important to look at worst case as well.
> And even a single multiply claimed block or a single extended
> attribute block will
> result is considerable memory savings with rb_trees, which is not
> shown in your graphs.
>
> There are several inode bitmaps, such as "icount->multiple",
> which are relatively sparse and would benefit a lot from being stored
> in an rb_tree.
> Again, you will only see the benefit once you have hard links in your
> file system.
>
> As a conclusion, you may want to add a statistics report of the memory usage of
> each bitmap instance, so the optimal backend can be chosen per bitmap instance.
>
> 2. Yet another bitmap backend.
>
> It is my *hunch*, that rb_trees are a suitable backend for very sparse bitmaps,
> like the onces I mentioned above and maybe for "in-use block map".
>
> However, I have a *feeling* that rb_trees may not be the optimal choice for
> "in-use inode map" and maybe also not for "in-use block map".
> The reason is that the average use case can have very dense and quite fragmented
> "in-use" bitmaps.
>
> It seems to me, that if half of the block groups are not in use and
> the other half
> is dense and fragmented, then neither rb_trees, nor arrays are the
> optimal backend.
> A data structure similar to the page table, could be quite efficient
> in this use case,
> both from the memory usage aspect and the test/set_bit speed aspect.
>
> While it is rare to see a block group occupied with few used block, it
> could certainly
> happen for inodes, so I would choose a "page" size equal to block size
> for "in-use block bitmap"
> and a much smaller "page" size for "in-use inode" bitmap.
>
>
> 3. More ways to reduce memory usage of e2fsck.
>
> The most recent case of e2fsck OOM I remember showing on this list
> was cause by a file system with many hard links that were created by rsnapshot
> (http://rsnapshot.org/)
>
> If I recall correctly, the overblown memory usage was caused by the
> icount cache,
> which creates an entry for every inode with nlink > 1.
>
> Similar problem can be caused by a large ea_count cache, when a file system
> has many multiply referenced extended attribute blocks (ACLs?).
>
> For that problem, I have a somewhat "crazy" solution.
> If the hard linked inodes are in fact sparsely distributed
> and if the larger the refcounts, the fewer the inodes
> (let's neglect directory hard links, because we've given up on them
> for #subdirs > 32000 anyway), then it is possible to replace
> the icount cache with 16 inode bitmaps, each one representing
> a single bit in the u16 i_links_count.
>
> Assuming that in a normal file system the maximum links count is bounded,
> then most of these bitmaps will be empty and the rest very sparse.
>
> Even in a highly linked file system generated by 256 rsnapshots,
> the memory usage is still only about 8bits per inode, while icount
> cache stores 64bit per inode.
>
>
> So that's it. A lot of talking and no benchmarking... Sorry about that.
>
> If we can come to a definite conclusion of a way to reduce memory usage
> and run-time of e2fsck for both the average and worst case, I think I will be
> able to turn more resources into the implementation and testing of
> such a scheme.
>
> So Lukas, if you can please try to apply the rb_trees backend to selective
> bitmap instances, like "ext attribute block" and show that it is a clear win
> situation, I am most likely to take the rb_tree patches and run some tests
> on our appliances.
>
> Thanks,
> Amir.
>

Hi Amir,

Thanks for useful tips and inspiration. It seems like a good strategy to
divide the memory statistics to a separate bitmaps, so I'll certainly do
that. Also as Andreas noted it would be useful to allow choosing different
bitmap backed for different bitmap types, I'll try to look at it. Right now
I am little bit swamped by other work, but I'll do my best.

Thanks!
-Lukas

2011-01-12 19:41:30

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] Add rbtree backend for storing bitmaps

On 2011-01-12, at 05:44, Amir Goldstein wrote:
> On Mon, Jan 10, 2011 at 5:18 PM, Lukas Czerner <[email protected]> wrote:
>> To address this I have created rbtree backend for storing bitmaps. It stores
>> just used extents of bitmaps, it means, that it can be more memory efficient
>> in most cases and also with careful use it might be much faster as well.
>
> 2. Yet another bitmap backend.
>
> It seems to me, that if half of the block groups are not in use and
> the other half is dense and fragmented, then neither rb_trees, nor
> arrays are the optimal backend. A data structure similar to the page
> table, could be quite efficient in this use case, both from the memory
> usage aspect and the test/set_bit speed aspect.
>
> While it is rare to see a block group occupied with few used block, it
> could certainly happen for inodes, so I would choose a "page" size equal
> to block size for "in-use block bitmap" and a much smaller "page" size
> for "in-use inode" bitmap.

This is essentially the "ebitmap" proposal that I had made some years ago.
It is a simple tree of data structures, and the leaf nodes are either
bitmaps (if a lot of bits are set) or a simple list (if very few bits are
set). We might consider a third type of leaf with extents/RLE for when
there are a lot of bits set but mostly in contiguous ranges.

Initial selection of the leaf type for a given bitmap could be hinted by
the caller based on expected usage, but conversion between leaf types is
done transparently based on usage (e.g. list->bitmap when the list memory
usage exceeds a bitmap block).

> 3. More ways to reduce memory usage of e2fsck.
>
> The most recent case of e2fsck OOM I remember showing on this list
> was cause by a file system with many hard links that were created by
> rsnapshot (http://rsnapshot.org/)
>
> If I recall correctly, the overblown memory usage was caused by the
> icount cache, which creates an entry for every inode with nlink > 1.

I posted a patch to at least partly avoid this problem, by allowing
the icount cache to be allocated in chunks, instead of a single huge
allocation. That allows e2fsck/kernel to swap out parts of the icount
array, and avoids the need to use 2x the memory when it is doing the
realloc() of the array to increase its size.

> If the hard linked inodes are in fact sparsely distributed
> and if the larger the refcounts, the fewer the inodes
> (let's neglect directory hard links, because we've given up on them
> for #subdirs > 32000 anyway),

Actually, for directories I think we could optimize the icount usage
somewhat as well. We could use the inode_dir_map/S_ISDIR as a +1 link
count for the directory "." reference, so that the icount can only
count other links. That means that directories will not need any entries
in the icount array, and I think this would reduce memory usage for
icount considerably in normal usage.

> then it is possible to replace the icount cache with 16 inode
> bitmaps, each one representing a single bit in the u16 i_links_count.
>
> Assuming that in a normal file system the maximum links count is bounded,
> then most of these bitmaps will be empty and the rest very sparse.

I had thought of something similar, with a limited number of bitmaps
(say 2 or 3) to handle the common hard-link cases, and then fall back
to the old icount list for inodes with more links than that.

With rbtree the cost of a very sparse bitmap for the high bits would be
fairly low, so the memory usage of having 16 bitmaps is not so bad, but
there would be 16x bitmap lookups for each inode. That might still be a
good trade-off (if memory is tight) because CPU speed is improving relative
to memory size over time.

> Even in a highly linked file system generated by 256 rsnapshots,
> the memory usage is still only about 8bits per inode, while icount
> cache stores 64bit per inode.

That isn't quite correct. With rbtree the memory usage is in the order
of several bytes per bit in the worst case, IIRC from Lukas' email. I
think this idea would need a lot of benchmarking before it is accepted.


Cheers, Andreas