2010-03-28 09:11:57

by jing zhang

[permalink] [raw]
Subject: [PATCH] ext4: add rb_tree cache to struct ext4_group_info

From: Jing Zhang <[email protected]>

Date: Sun Mar 28 17:09:33 2010

rb_tree cache is added to struct ext4_group_info at minor cost.

Cc: Theodore Ts'o <[email protected]>
Cc: Andreas Dilger <[email protected]>
Cc: Dave Kleikamp <[email protected]>
Cc: "Aneesh Kumar K. V" <[email protected]>
Signed-off-by: Jing Zhang <[email protected]>

---

--- linux-2.6.32/fs/ext4/ext4.h 2009-12-03 11:51:22.000000000 +0800
+++ ext4_mm_leak/ext4-12.h 2010-03-28 16:47:56.000000000 +0800
@@ -1622,6 +1622,7 @@ static inline void ext4_update_i_disksiz
struct ext4_group_info {
unsigned long bb_state;
struct rb_root bb_free_root;
+ struct rb_node *bb_free_cache;
ext4_grpblk_t bb_first_free; /* first free block */
ext4_grpblk_t bb_free; /* total free blocks */
ext4_grpblk_t bb_fragments; /* nr of freespace fragments */
--- linux-2.6.32/fs/ext4/mballoc.c 2009-12-03 11:51:22.000000000 +0800
+++ ext4_mm_leak/mballoc-12.c 2010-03-28 16:43:08.000000000 +0800
@@ -2548,6 +2548,8 @@ static void release_blocks_on_commit(jou
count2++;
ext4_lock_group(sb, entry->group);
/* Take it out of per group rb tree */
+ if (db->bb_free_cache == &entry->node)
+ db->bb_free_cache = rb_next(&entry->node);
rb_erase(&entry->node, &(db->bb_free_root));
mb_free_blocks(NULL, &e4b, entry->start_blk, entry->count);

@@ -3188,7 +3190,10 @@ static void ext4_mb_generate_from_freeli
struct ext4_free_data *entry;

grp = ext4_get_group_info(sb, group);
- n = rb_first(&(grp->bb_free_root));
+ if (grp->bb_free_cache)
+ n = grp->bb_free_cache;
+ else
+ n = rb_first(&(grp->bb_free_root));

while (n) {
entry = rb_entry(n, struct ext4_free_data, node);
@@ -4353,6 +4358,7 @@ ext4_mb_free_metadata(handle_t *handle,
struct ext4_sb_info *sbi = EXT4_SB(sb);
struct rb_node **n = &db->bb_free_root.rb_node, *node;
struct rb_node *parent = NULL, *new_node;
+ int left = 1;

BUG_ON(!ext4_handle_valid(handle));
BUG_ON(e4b->bd_bitmap_page == NULL);
@@ -4369,6 +4375,8 @@ ext4_mb_free_metadata(handle_t *handle,
* blocks */
page_cache_get(e4b->bd_buddy_page);
page_cache_get(e4b->bd_bitmap_page);
+ /* just for sure */
+ db->bb_free_cache = 0;
}
while (*n) {
parent = *n;
@@ -4376,7 +4384,7 @@ ext4_mb_free_metadata(handle_t *handle,
if (block < entry->start_blk)
n = &(*n)->rb_left;
else if (block >= (entry->start_blk + entry->count))
- n = &(*n)->rb_right;
+ n = &(*n)->rb_right, left = 0;
else {
ext4_grp_locked_error(sb, e4b->bd_group, __func__,
"Double free of blocks %d (%d %d)",
@@ -4387,6 +4395,8 @@ ext4_mb_free_metadata(handle_t *handle,

rb_link_node(new_node, parent, n);
rb_insert_color(new_node, &db->bb_free_root);
+ if (left)
+ db->bb_free_cache = new_node;

/* Now try to see the extent can be merged to left and right */
node = rb_prev(new_node);


2010-03-28 15:03:28

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH] ext4: add rb_tree cache to struct ext4_group_info

jing zhang wrote:
> From: Jing Zhang <[email protected]>
>
> Date: Sun Mar 28 17:09:33 2010
>
> rb_tree cache is added to struct ext4_group_info at minor cost.

Please include a reason for the change in the commit message,
so that in the future people don't have to figure out for
themselves why a change was made. It also helps reviewers. :)

...

> @@ -4376,7 +4384,7 @@ ext4_mb_free_metadata(handle_t *handle,
> if (block < entry->start_blk)
> n = &(*n)->rb_left;
> else if (block >= (entry->start_blk + entry->count))
> - n = &(*n)->rb_right;
> + n = &(*n)->rb_right, left = 0;
> else {
> ext4_grp_locked_error(sb, e4b->bd_group, __func__,
> "Double free of blocks %d (%d %d)",

I think it's better if you don't use that style, but instead:

else if (block >= (entry->start_blk + entry->count)) {
n = &(*n)->rb_right;
left = 0;
} else {
...

2010-03-29 13:40:29

by jing zhang

[permalink] [raw]
Subject: Re: [PATCH] ext4: add rb_tree cache to struct ext4_group_info

2010/3/28, Eric Sandeen <[email protected]>:
> jing zhang wrote:
>> From: Jing Zhang <[email protected]>
>>
>> Date: Sun Mar 28 17:09:33 2010
>>
>> rb_tree cache is added to struct ext4_group_info at minor cost.
>
> Please include a reason for the change in the commit message,
> so that in the future people don't have to figure out for
> themselves why a change was made. It also helps reviewers. :)

good advice

>
> ...
>
>> @@ -4376,7 +4384,7 @@ ext4_mb_free_metadata(handle_t *handle,
>> if (block < entry->start_blk)
>> n = &(*n)->rb_left;
>> else if (block >= (entry->start_blk + entry->count))
>> - n = &(*n)->rb_right;
>> + n = &(*n)->rb_right, left = 0;
>> else {
>> ext4_grp_locked_error(sb, e4b->bd_group, __func__,
>> "Double free of blocks %d (%d %d)",
>
> I think it's better if you don't use that style, but instead:

fine
- zj
>
> else if (block >= (entry->start_blk + entry->count)) {
> n = &(*n)->rb_right;
> left = 0;
> } else {
> ...
>

Cache of rb_tree is added to struct ext4_group_info, and is maintained
at minor cost.
Then rb_tree traversing in the function,
ext4_mb_generate_from_freelist() is accelerated.

---

--- linux-2.6.32/fs/ext4/mballoc.c 2009-12-03 11:51:22.000000000 +0800
+++ ext4_mm_leak/mballoc-12.c 2010-03-29 21:31:18.000000000 +0800
@@ -2548,6 +2548,8 @@ static void release_blocks_on_commit(jou
count2++;
ext4_lock_group(sb, entry->group);
/* Take it out of per group rb tree */
+ if (db->bb_free_cache == &entry->node)
+ db->bb_free_cache = rb_next(&entry->node);
rb_erase(&entry->node, &(db->bb_free_root));
mb_free_blocks(NULL, &e4b, entry->start_blk, entry->count);

@@ -3188,7 +3190,10 @@ static void ext4_mb_generate_from_freeli
struct ext4_free_data *entry;

grp = ext4_get_group_info(sb, group);
- n = rb_first(&(grp->bb_free_root));
+ if (grp->bb_free_cache)
+ n = grp->bb_free_cache;
+ else
+ n = rb_first(&(grp->bb_free_root));

while (n) {
entry = rb_entry(n, struct ext4_free_data, node);
@@ -4353,6 +4358,7 @@ ext4_mb_free_metadata(handle_t *handle,
struct ext4_sb_info *sbi = EXT4_SB(sb);
struct rb_node **n = &db->bb_free_root.rb_node, *node;
struct rb_node *parent = NULL, *new_node;
+ int left = 1;

BUG_ON(!ext4_handle_valid(handle));
BUG_ON(e4b->bd_bitmap_page == NULL);
@@ -4369,15 +4375,18 @@ ext4_mb_free_metadata(handle_t *handle,
* blocks */
page_cache_get(e4b->bd_buddy_page);
page_cache_get(e4b->bd_bitmap_page);
+ /* just for sure */
+ db->bb_free_cache = 0;
}
while (*n) {
parent = *n;
entry = rb_entry(parent, struct ext4_free_data, node);
if (block < entry->start_blk)
n = &(*n)->rb_left;
- else if (block >= (entry->start_blk + entry->count))
+ else if (block >= (entry->start_blk + entry->count)) {
n = &(*n)->rb_right;
- else {
+ left = 0;
+ } else {
ext4_grp_locked_error(sb, e4b->bd_group, __func__,
"Double free of blocks %d (%d %d)",
block, entry->start_blk, entry->count);
@@ -4387,6 +4396,8 @@ ext4_mb_free_metadata(handle_t *handle,

rb_link_node(new_node, parent, n);
rb_insert_color(new_node, &db->bb_free_root);
+ if (left)
+ db->bb_free_cache = new_node;

/* Now try to see the extent can be merged to left and right */
node = rb_prev(new_node);

2010-03-30 18:29:31

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH] ext4: add rb_tree cache to struct ext4_group_info

On Sun, 28 Mar 2010 17:11:55 +0800, jing zhang <[email protected]> wrote:
> From: Jing Zhang <[email protected]>
>
> Date: Sun Mar 28 17:09:33 2010
>
> rb_tree cache is added to struct ext4_group_info at minor cost.
>

Did we measure the cost of not doing this ? Is there a profile data that
we can look at ?

-aneesh

2010-03-31 14:26:47

by jing zhang

[permalink] [raw]
Subject: Re: [PATCH] ext4: add rb_tree cache to struct ext4_group_info

2010/3/28, jing zhang <[email protected]>:
> From: Jing Zhang <[email protected]>
>
> Date: Sun Mar 28 17:09:33 2010
>
> rb_tree cache is added to struct ext4_group_info at minor cost.
>
> Cc: Theodore Ts'o <[email protected]>
> Cc: Andreas Dilger <[email protected]>
> Cc: Dave Kleikamp <[email protected]>
> Cc: "Aneesh Kumar K. V" <[email protected]>
> Signed-off-by: Jing Zhang <[email protected]>
>
> ---
>
> --- linux-2.6.32/fs/ext4/ext4.h 2009-12-03 11:51:22.000000000 +0800
> +++ ext4_mm_leak/ext4-12.h 2010-03-28 16:47:56.000000000 +0800
> @@ -1622,6 +1622,7 @@ static inline void ext4_update_i_disksiz
> struct ext4_group_info {
> unsigned long bb_state;
> struct rb_root bb_free_root;
> + struct rb_node *bb_free_cache;
> ext4_grpblk_t bb_first_free; /* first free block */
> ext4_grpblk_t bb_free; /* total free blocks */
> ext4_grpblk_t bb_fragments; /* nr of freespace fragments */
> --- linux-2.6.32/fs/ext4/mballoc.c 2009-12-03 11:51:22.000000000 +0800
> +++ ext4_mm_leak/mballoc-12.c 2010-03-28 16:43:08.000000000 +0800
> @@ -2548,6 +2548,8 @@ static void release_blocks_on_commit(jou
> count2++;
> ext4_lock_group(sb, entry->group);
> /* Take it out of per group rb tree */
> + if (db->bb_free_cache == &entry->node)
> + db->bb_free_cache = rb_next(&entry->node);
> rb_erase(&entry->node, &(db->bb_free_root));
> mb_free_blocks(NULL, &e4b, entry->start_blk, entry->count);
>
> @@ -3188,7 +3190,10 @@ static void ext4_mb_generate_from_freeli
> struct ext4_free_data *entry;
>
> grp = ext4_get_group_info(sb, group);
> - n = rb_first(&(grp->bb_free_root));
> + if (grp->bb_free_cache)
> + n = grp->bb_free_cache;
> + else
> + n = rb_first(&(grp->bb_free_root));
>
> while (n) {
> entry = rb_entry(n, struct ext4_free_data, node);
> @@ -4353,6 +4358,7 @@ ext4_mb_free_metadata(handle_t *handle,
> struct ext4_sb_info *sbi = EXT4_SB(sb);
> struct rb_node **n = &db->bb_free_root.rb_node, *node;
> struct rb_node *parent = NULL, *new_node;
> + int left = 1;
>
> BUG_ON(!ext4_handle_valid(handle));
> BUG_ON(e4b->bd_bitmap_page == NULL);
> @@ -4369,6 +4375,8 @@ ext4_mb_free_metadata(handle_t *handle,
> * blocks */
> page_cache_get(e4b->bd_buddy_page);
> page_cache_get(e4b->bd_bitmap_page);
> + /* just for sure */
> + db->bb_free_cache = 0;
> }
> while (*n) {
> parent = *n;
> @@ -4376,7 +4384,7 @@ ext4_mb_free_metadata(handle_t *handle,
> if (block < entry->start_blk)
> n = &(*n)->rb_left;
> else if (block >= (entry->start_blk + entry->count))
> - n = &(*n)->rb_right;
> + n = &(*n)->rb_right, left = 0;
> else {
> ext4_grp_locked_error(sb, e4b->bd_group, __func__,
> "Double free of blocks %d (%d %d)",
> @@ -4387,6 +4395,8 @@ ext4_mb_free_metadata(handle_t *handle,
>
> rb_link_node(new_node, parent, n);
> rb_insert_color(new_node, &db->bb_free_root);
> + if (left)
> + db->bb_free_cache = new_node;
>
> /* Now try to see the extent can be merged to left and right */
> node = rb_prev(new_node);
>



2010/3/31, Aneesh Kumar K. V <[email protected]>:
> On Sun, 28 Mar 2010 17:11:55 +0800, jing zhang <[email protected]> wrote:
>> From: Jing Zhang <[email protected]>
>>
>> Date: Sun Mar 28 17:09:33 2010
>>
>> rb_tree cache is added to struct ext4_group_info at minor cost.
>>
>
> Did we measure the cost of not doing this ? Is there a profile data that
> we can look at ?
>
> -aneesh
>

With the added cache, there is over 50% probability that the operation,
rb_first(&(grp->bb_free_root));
can be saved, when there are multiple nodes in tree.

It seems what is added is following what is called O(1), one of the
works by Mr. Ingo Molnar, but I am not sure, and let's ask Mr. Ingo
Molnar.

- zj

2010-04-03 15:34:25

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: add rb_tree cache to struct ext4_group_info

On Wed, Mar 31, 2010 at 10:26:45PM +0800, jing zhang wrote:
>
> With the added cache, there is over 50% probability that the operation,
> rb_first(&(grp->bb_free_root));
> can be saved, when there are multiple nodes in tree.
>
> It seems what is added is following what is called O(1), one of the
> works by Mr. Ingo Molnar, but I am not sure, and let's ask Mr. Ingo
> Molnar.

Sure, but does it matter? The red-black tree is per-block group, and
rb_first() is O(ln n), and it's cleared after every transaction
commit. Have you measured how deep it gets? Have you measured how
much CPU time this would actually save?

I'm almost certiain the code complexity isn't worth it. For example,
your patch is buggy. There are places where the red black tree is
manipulated, and where the node pointed at by bb_free_cache could get
freed. For example, see release_blocks_on_commit() and
ext4_mb_free_metadata().

That being said, I'm not convinced ext4_mb_generate_from_freelist() is
(a) necessary, or (b) bug-free, either. The whole point of having
extents in bb_free_root tree is that those extents aren't safe to be
placed in the buddy bitmap. And ext4_mb_generate_from_freelist()
isn't freeing the nodes from the rbtree. Fortunately it looks like
ext4_mb_generate_from_freelist is only getting called when the buddy
bitmap is being set up, so the rbtree should be empty during those
times.

I need to do some more investigation, but I think the function can be
removed entirely.

- Ted

2010-04-04 01:26:32

by jing zhang

[permalink] [raw]
Subject: Re: [PATCH] ext4: add rb_tree cache to struct ext4_group_info

2010/4/3, [email protected] <[email protected]>:
> On Wed, Mar 31, 2010 at 10:26:45PM +0800, jing zhang wrote:
>>
>> With the added cache, there is over 50% probability that the operation,
>> rb_first(&(grp->bb_free_root));
>> can be saved, when there are multiple nodes in tree.
>>
>> It seems what is added is following what is called O(1), one of the
>> works by Mr. Ingo Molnar, but I am not sure, and let's ask Mr. Ingo
>> Molnar.
>
> Sure, but does it matter? The red-black tree is per-block group, and
> rb_first() is O(ln n), and it's cleared after every transaction
> commit. Have you measured how deep it gets? Have you measured how
> much CPU time this would actually save?
>

Thanks for these questions, which were out of my consideration.

> I'm almost certiain the code complexity isn't worth it. For example,
> your patch is buggy. There are places where the red black tree is
> manipulated, and where the node pointed at by bb_free_cache could get
> freed. For example, see release_blocks_on_commit() and
> ext4_mb_free_metadata().

I will check release_blocks_on_commit() and ext4_mb_free_metadata()
again next week.

>
> That being said, I'm not convinced ext4_mb_generate_from_freelist() is
> (a) necessary, or (b) bug-free, either. The whole point of having
> extents in bb_free_root tree is that those extents aren't safe to be
> placed in the buddy bitmap. And ext4_mb_generate_from_freelist()
> isn't freeing the nodes from the rbtree. Fortunately it looks like
> ext4_mb_generate_from_freelist is only getting called when the buddy
> bitmap is being set up, so the rbtree should be empty during those
> times.
>
> I need to do some more investigation, but I think the function can be
> removed entirely.

Do you mean that ext4_mb_generate_from_freelist() can be removed entirely?

- zj
>
> - Ted
>

2010-04-04 02:07:23

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: add rb_tree cache to struct ext4_group_info

On Sun, Apr 04, 2010 at 09:26:26AM +0800, jing zhang wrote:
> > That being said, I'm not convinced ext4_mb_generate_from_freelist() is
> > (a) necessary, or (b) bug-free, either. The whole point of having
> > extents in bb_free_root tree is that those extents aren't safe to be
> > placed in the buddy bitmap. And ext4_mb_generate_from_freelist()
> > isn't freeing the nodes from the rbtree. Fortunately it looks like
> > ext4_mb_generate_from_freelist is only getting called when the buddy
> > bitmap is being set up, so the rbtree should be empty during those
> > times.
> >
> > I need to do some more investigation, but I think the function can be
> > removed entirely.
>
> Do you mean that ext4_mb_generate_from_freelist() can be removed entirely?

Maybe. I need to do more investigation to be sure. The code in
mballoc is subtle, and in some places there is stuff which is vestigal
or misnamed, but it means that I'm going to be very careful before
changing things.

It also means that if you submit patches, you need to be very clear
what you think the surrounding code is doing, why you think it's
wrong, and how your patch make things better. For example, this:

The function, ext4_mb_discard_group_preallocations(), works alone as
hard as possible without correct understanding its caller's good
thinking.

And now try to relieve it in simple way.

is almost useless as a comment. It doesn't help me understand the
code. "hard as possible"? Huh? "without correct understanding"?
How can code, unless it's artificially intelligent, have
understanding? And if you meant the original author had no
understanding, how do you know that? "caller's good thinking"? Same
question; the calling code doesn't think.

This sort of explanation isn't helpful in understanding the patch.

- Ted

2010-04-04 06:26:03

by jing zhang

[permalink] [raw]
Subject: Re: [PATCH] ext4: add rb_tree cache to struct ext4_group_info

2010/4/4, [email protected] <[email protected]>:
> On Sun, Apr 04, 2010 at 09:26:26AM +0800, jing zhang wrote:
>> > That being said, I'm not convinced ext4_mb_generate_from_freelist() is
>> > (a) necessary, or (b) bug-free, either. The whole point of having
>> > extents in bb_free_root tree is that those extents aren't safe to be
>> > placed in the buddy bitmap. And ext4_mb_generate_from_freelist()
>> > isn't freeing the nodes from the rbtree. Fortunately it looks like
>> > ext4_mb_generate_from_freelist is only getting called when the buddy
>> > bitmap is being set up, so the rbtree should be empty during those
>> > times.
>> >
>> > I need to do some more investigation, but I think the function can be
>> > removed entirely.
>>
>> Do you mean that ext4_mb_generate_from_freelist() can be removed entirely?
>
> Maybe. I need to do more investigation to be sure. The code in
> mballoc is subtle, and in some places there is stuff which is vestigal
> or misnamed, but it means that I'm going to be very careful before
> changing things.
>
> It also means that if you submit patches, you need to be very clear
> what you think the surrounding code is doing, why you think it's
> wrong, and how your patch make things better. For example, this:
>
> The function, ext4_mb_discard_group_preallocations(), works alone as
> hard as possible without correct understanding its caller's good
> thinking.
>
> And now try to relieve it in simple way.
>
> is almost useless as a comment. It doesn't help me understand the
> code. "hard as possible"? Huh? "without correct understanding"?
> How can code, unless it's artificially intelligent, have
> understanding? And if you meant the original author had no
> understanding, how do you know that? "caller's good thinking"? Same
> question; the calling code doesn't think.
>
> This sort of explanation isn't helpful in understanding the patch.
>
> - Ted
>

Another good lesson, I got it.

Thanks
- zj