2004-04-18 16:14:05

by Kevin O'Connor

[permalink] [raw]
Subject: Questions on prio_tree

Hi Rajesh,

I've been reading through your recently posted prio tree code. There are a
couple of things in the code that I do not understand, and I am concerned
that they may be bugs or omissions. Hopefully, you'll be just able to
clarify the code for me. :-)

1 - Why does prio_tree_expand use the max_heap_index to determine the
height of the radix tree? The bits of the radix_index is only ever used
(never the bits in the heap_index), so shouldn't max_radix_index be used?
I understand that using the max_heap_index isn't necessary broken (it is
guaranteed to be greater than or equal to the max_radix_index), but it
seems inefficient. As an example, consider the case where 100 vmas are
created all starting at address 0 and ending at addresses 0 through 100.
The current code will force the address part of the tree to a height of 7,
but since there is only one start address (0) the tree really could have a
height of just 1.

2 - When traversing the tree, after the first root->index_bits levels of
the tree have been walked, the code starts inspecting the "size" portion of
the radix key. Why does the code set the start bit for the size portion to
root->index_bits? Shouldn't the code set the start bit to (BITS_PER_LONG -
PAGE_SHIFT)?

Consider the case where 100 vmas are created all starting at address 0 and
ending at addresses 0 through 100, followed by the creation of a vma from
address 100,000 - 100,001. After the first 100 vmas, the tree will have a
height of 7 (reasons sited above) for the start address and an additional
height of 7 for the size portion of the index. This will cause the tree to
have 7 records in the start_address portion of the tree, and the remaining
93 will be distributed pretty evenly in the size portion of the radix tree
that hangs off the zero address. When the 101st vma is added, the start
address portion of the tree will be expanded to a height of 17. The
current code will expand the tree to contain 17 records in the size portion
of the radix tree and leave the remaining 84 records in the size portion of
the tree hanging off the zero address.

The problem, as I see it, is that new inserts into the tree that start at
address zero will not be inspecting the same bits that were used when the
first 100 were inserted. (When the first 100 were inserted, bits 1-7 were
used to branch in the size portion of the tree, but inserts 102 and on will
use bits 1-17.) I don't think this can cause a "crash", but I think it
skews the tree ordering and could disturb the O(log n + m) behavior of the
algorithm.

Did I miss something?
-Kevin

--
---------------------------------------------------------------------
| Kevin O'Connor "BTW, IMHO we need a FAQ for |
| [email protected] 'IMHO', 'FAQ', 'BTW', etc. !" |
---------------------------------------------------------------------


Subject: Re: Questions on prio_tree


Hi Kevin,

Thanks for you analysis and comments.

> 1 - Why does prio_tree_expand use the max_heap_index to determine the
> height of the radix tree? The bits of the radix_index is only ever used
> (never the bits in the heap_index), so shouldn't max_radix_index be used?

In some cases heap_index is also used. As you noted below sometimes
size = heap_index - radix_index is used for indexing the tree. If you
consider radix_index = 0UL, then size == heap_index.

> I understand that using the max_heap_index isn't necessary broken (it is
> guaranteed to be greater than or equal to the max_radix_index), but it
> seems inefficient. As an example, consider the case where 100 vmas are
> created all starting at address 0 and ending at addresses 0 through 100.
> The current code will force the address part of the tree to a height of 7,
> but since there is only one start address (0) the tree really could have a
> height of just 1.

Good example. Actually, your example is the exact reason why we use
max_heap_index rather than max_radix_index.

It is clear that we cannot store 101 vmas if the height of the tree is
just 1. We need at least 7 levels to store 101 vmas. Yes. The first 7
levels of the tree will be skewed because radix_index (all 0) is used
for indexing first 7 (root->index_bits) levels. From level 8 to 14 the
size (== heap_index because radix_index is always 0) is used for indexing
the tree. So from level 8 to 14 it will be a fully packed (dense) radix
tree where size (== heap_index) is used as the index.

Yes. This is bit inefficient. But, it provides a clean solution for a
corner case where all vmas start from the same pgoff (radix_index)
with different end addresses. This is a rare corner case, so we don't
have to worry much about the performance here.

> 2 - When traversing the tree, after the first root->index_bits levels of
> the tree have been walked, the code starts inspecting the "size" portion of
> the radix key. Why does the code set the start bit for the size portion to
> root->index_bits? Shouldn't the code set the start bit to (BITS_PER_LONG -
> PAGE_SHIFT)?
>
> Consider the case where 100 vmas are created all starting at address 0 and
> ending at addresses 0 through 100, followed by the creation of a vma from
> address 100,000 - 100,001. After the first 100 vmas, the tree will have a
> height of 7 (reasons sited above) for the start address and an additional
> height of 7 for the size portion of the index.

Yeah.

> This will cause the tree to
> have 7 records in the start_address portion of the tree, and the remaining
> 93 will be distributed pretty evenly in the size portion of the radix tree
> that hangs off the zero address.

Agreed.

> When the 101st vma is added, the start
> address portion of the tree will be expanded to a height of 17. The
> current code will expand the tree to contain 17 records in the size portion
^^^^
I think it is typo: 17 records in the start_address portion

> of the radix tree and leave the remaining 84 records in the size portion of
> the tree hanging off the zero address.

Right.

> The problem, as I see it, is that new inserts into the tree that start at
> address zero will not be inspecting the same bits that were used when the
> first 100 were inserted. (When the first 100 were inserted, bits 1-7 were
> used to branch in the size portion of the tree, but inserts 102 and on will
> use bits 1-17.)

Thanks for your analysis. This is the reason why the expand algorithm is
O((log n)^2) algorithm.

Check the prio_tree_remove in prio_tree_expand. If the tree is expanded
more than 1 index_bit, then we start removing nodes from the current tree
and form a new skewed tree.

In your example, the index_bits changes from 7 to 17. So, (17 - 7) - 1
= 9 nodes will be removed from the current tree before adding the 101st
vma. So, at the end of the removal of 9 nodes from the current tree,
there are 91 vmas (7 in the skewed start_address portion and the
remaining 84 in the size portion that hangs of the radix_index zero).
Now, the 9 removed vmas and the newly added vma forms a skewed tree of
height 10. The current tree with 91 vmas is appened at the end of this
10 vma skewed tree forming a 17 vma skewed tree in the start_address
portion and 84 vma dense tree in the size portion.

> I don't think this can cause a "crash", but I think it
> skews the tree ordering and could disturb the O(log n + m) behavior of the
> algorithm.

Even when the tree is skewed we guarantee O(log n + m). Yeah, we hide
the multiplicative constant (2). But then, that is the beauty of big-O
notation, right? Hiding the practical performance issues.

The tree gets skewed only in very rare corner cases. So the performance
issues should not matter much, IMHO.

Thanks,
Rajesh