Hello,
on one of my machines the kernel is oopsing every so often with
the following message:
Sep 19 18:59:19 www2 kernel: ------------[ cut here ]------------
Sep 19 18:59:19 www2 kernel: kernel BUG at mm/prio_tree.c:538!
Sep 19 18:59:19 www2 kernel: invalid operand: 0000 [#1]
Sep 19 18:59:19 www2 kernel: SMP
Sep 19 18:59:19 www2 kernel: Modules linked in: dm_mod
Sep 19 18:59:19 www2 kernel: CPU: 0
Sep 19 18:59:19 www2 kernel: EIP: 0060:[vma_prio_tree_add+21/160] Not tainted
Sep 19 18:59:19 www2 kernel: EFLAGS: 00010206 (2.6.8.1)
Sep 19 18:59:19 www2 kernel: EIP is at vma_prio_tree_add+0x15/0xa0
Sep 19 18:59:19 www2 kernel: eax: d8e2b5b8 ebx: d5298f90 ecx: 00000000 edx: 0000000a
Sep 19 18:59:19 www2 kernel: esi: d8e2b5b8 edi: 00000000 ebp: 0fa9b000 esp: d8e91ef0
Sep 19 18:59:19 www2 kernel: ds: 007b es: 007b ss: 0068
Sep 19 18:59:19 www2 kernel: Process cfserver (pid: 1317, threadinfo=d8e91000 task=d7fb4650)
Sep 19 18:59:20 www2 kernel: Stack: d5298fb8 d5298f90 00000000 c0135b19 d5298f90 d8e2b5b8 d6004bf4 d6fa68ac
Sep 19 18:59:20 www2 kernel: c013ed0f d5298f90 ded2be58 d6fa68ac d6fa6900 d5298f90 0fa9b000 bd7fed90
Sep 19 18:59:20 www2 kernel: 00000000 00000000 def6a780 ded2be58 ded2be38 00000000 df84e300 c013fdf4
Sep 19 18:59:20 www2 kernel: Call Trace:
Sep 19 18:59:20 www2 kernel: [vma_prio_tree_insert+37/44] vma_prio_tree_insert+0x25/0x2c
Sep 19 18:59:20 www2 kernel: [vma_adjust+363/808] vma_adjust+0x16b/0x328
Sep 19 18:59:20 www2 kernel: [split_vma+252/264] split_vma+0xfc/0x108
Sep 19 18:59:20 www2 kernel: [do_munmap+126/280] do_munmap+0x7e/0x118
Sep 19 18:59:20 www2 kernel: [sys_munmap+56/88] sys_munmap+0x38/0x58
Sep 19 18:59:20 www2 kernel: [syscall_call+7/11] syscall_call+0x7/0xb
Sep 19 18:59:20 www2 kernel: Code: 0f 0b 1a 02 9f 51 35 c0 8b 43 04 8b 7b 08 29 c7 89 f8 c1 e8
This happens with 2.6.8.1 as well as with earlier kernels.
This is a Quad-Xeon-Box from Dell (SMP kernel).
I've no idea what is causing these failures and didn't found anything useful
by searching. Therefore I would be grateful for any help.
With regards
Racke
--
LinuXia Systems => http://www.linuxia.de/
Expert Interchange Consulting and System Administration
ICDEVGROUP => http://www.icdevgroup.org/
Interchange Development Team
On Mon, 20 Sep 2004, Stefan Hornburg wrote:
>
> on one of my machines the kernel is oopsing every so often with
> the following message:
>
> Sep 19 18:59:19 www2 kernel: ------------[ cut here ]------------
> Sep 19 18:59:19 www2 kernel: kernel BUG at mm/prio_tree.c:538!
That is worrying, we've not seen any such problem before. You mention it
happens with several kernels, so it doesn't feel like random corruption.
Would you mind mailing me and Rajesh (not the list) the output
of "objdump -rd mm/prio_tree.o" from your kernel build tree,
to help us make sense of what's shown in the registers here?
Though I very much doubt that will shed enough light: we'll probably
need to add printks to dump out the vmas involved when this happens.
Would running such a debug kernel build on this machine be possible?
Thanks,
Hugh
> Sep 19 18:59:19 www2 kernel: invalid operand: 0000 [#1]
> Sep 19 18:59:19 www2 kernel: SMP
> Sep 19 18:59:19 www2 kernel: Modules linked in: dm_mod
> Sep 19 18:59:19 www2 kernel: CPU: 0
> Sep 19 18:59:19 www2 kernel: EIP: 0060:[vma_prio_tree_add+21/160] Not tainted
> Sep 19 18:59:19 www2 kernel: EFLAGS: 00010206 (2.6.8.1)
> Sep 19 18:59:19 www2 kernel: EIP is at vma_prio_tree_add+0x15/0xa0
> Sep 19 18:59:19 www2 kernel: eax: d8e2b5b8 ebx: d5298f90 ecx: 00000000 edx: 0000000a
> Sep 19 18:59:19 www2 kernel: esi: d8e2b5b8 edi: 00000000 ebp: 0fa9b000 esp: d8e91ef0
> Sep 19 18:59:19 www2 kernel: ds: 007b es: 007b ss: 0068
> Sep 19 18:59:19 www2 kernel: Process cfserver (pid: 1317, threadinfo=d8e91000 task=d7fb4650)
> Sep 19 18:59:20 www2 kernel: Stack: d5298fb8 d5298f90 00000000 c0135b19 d5298f90 d8e2b5b8 d6004bf4 d6fa68ac
> Sep 19 18:59:20 www2 kernel: c013ed0f d5298f90 ded2be58 d6fa68ac d6fa6900 d5298f90 0fa9b000 bd7fed90
> Sep 19 18:59:20 www2 kernel: 00000000 00000000 def6a780 ded2be58 ded2be38 00000000 df84e300 c013fdf4
> Sep 19 18:59:20 www2 kernel: Call Trace:
> Sep 19 18:59:20 www2 kernel: [vma_prio_tree_insert+37/44] vma_prio_tree_insert+0x25/0x2c
> Sep 19 18:59:20 www2 kernel: [vma_adjust+363/808] vma_adjust+0x16b/0x328
> Sep 19 18:59:20 www2 kernel: [split_vma+252/264] split_vma+0xfc/0x108
> Sep 19 18:59:20 www2 kernel: [do_munmap+126/280] do_munmap+0x7e/0x118
> Sep 19 18:59:20 www2 kernel: [sys_munmap+56/88] sys_munmap+0x38/0x58
> Sep 19 18:59:20 www2 kernel: [syscall_call+7/11] syscall_call+0x7/0xb
> Sep 19 18:59:20 www2 kernel: Code: 0f 0b 1a 02 9f 51 35 c0 8b 43 04 8b 7b 08 29 c7 89 f8 c1 e8
>
> This happens with 2.6.8.1 as well as with earlier kernels.
> This is a Quad-Xeon-Box from Dell (SMP kernel).
> I've no idea what is causing these failures and didn't found anything useful
> by searching. Therefore I would be grateful for any help.
On Mon, 20 Sep 2004, Hugh Dickins wrote:
> On Mon, 20 Sep 2004, Stefan Hornburg wrote:
> >
> > on one of my machines the kernel is oopsing every so often with
> > the following message:
> >
> > Sep 19 18:59:19 www2 kernel: ------------[ cut here ]------------
> > Sep 19 18:59:19 www2 kernel: kernel BUG at mm/prio_tree.c:538!
>
> That is worrying, we've not seen any such problem before. You mention it
> happens with several kernels, so it doesn't feel like random corruption.
Few months ago Andrea got a similar report. But, he tagged that
as a hardware problem.
I am inclined to claim this report as a memory problem, but then
I may be fooling myself. The BUG_ONs in the vma_prio_tree_add are
very redundant and I would like to remove it but for the rare
reports like this.
If you see the call trace, vma_prio_tree_insert called
vma_prio_tree_add because prio_tree_insert returned a tree node
that is not equal to currently inserted &vma->shared.
If we poke into prio_tree_insert, such a tree node is returned
if and only if (r_index == radix_index && h_index == heap_index).
Therefore, we just checked the indices to be equal. However, this
bug report claims that the indices are not equal.
The only chance I see for this to happen: we are changing vm_start,
vm_end, vm_pgoff of a vma that is already in an i_mmap tree
without holding the corresponding i_mmap_lock. The last time I
did an audit, all such changes were inside the i_mmap_lock.
I will look again tonight.
Rajesh
> Would you mind mailing me and Rajesh (not the list) the output
> of "objdump -rd mm/prio_tree.o" from your kernel build tree,
> to help us make sense of what's shown in the registers here?
>
> Though I very much doubt that will shed enough light: we'll probably
> need to add printks to dump out the vmas involved when this happens.
> Would running such a debug kernel build on this machine be possible?
>
> Thanks,
> Hugh
>
> > Sep 19 18:59:19 www2 kernel: invalid operand: 0000 [#1]
> > Sep 19 18:59:19 www2 kernel: SMP
> > Sep 19 18:59:19 www2 kernel: Modules linked in: dm_mod
> > Sep 19 18:59:19 www2 kernel: CPU: 0
> > Sep 19 18:59:19 www2 kernel: EIP: 0060:[vma_prio_tree_add+21/160] Not tainted
> > Sep 19 18:59:19 www2 kernel: EFLAGS: 00010206 (2.6.8.1)
> > Sep 19 18:59:19 www2 kernel: EIP is at vma_prio_tree_add+0x15/0xa0
> > Sep 19 18:59:19 www2 kernel: eax: d8e2b5b8 ebx: d5298f90 ecx: 00000000 edx: 0000000a
> > Sep 19 18:59:19 www2 kernel: esi: d8e2b5b8 edi: 00000000 ebp: 0fa9b000 esp: d8e91ef0
> > Sep 19 18:59:19 www2 kernel: ds: 007b es: 007b ss: 0068
> > Sep 19 18:59:19 www2 kernel: Process cfserver (pid: 1317, threadinfo=d8e91000 task=d7fb4650)
> > Sep 19 18:59:20 www2 kernel: Stack: d5298fb8 d5298f90 00000000 c0135b19 d5298f90 d8e2b5b8 d6004bf4 d6fa68ac
> > Sep 19 18:59:20 www2 kernel: c013ed0f d5298f90 ded2be58 d6fa68ac d6fa6900 d5298f90 0fa9b000 bd7fed90
> > Sep 19 18:59:20 www2 kernel: 00000000 00000000 def6a780 ded2be58 ded2be38 00000000 df84e300 c013fdf4
> > Sep 19 18:59:20 www2 kernel: Call Trace:
> > Sep 19 18:59:20 www2 kernel: [vma_prio_tree_insert+37/44] vma_prio_tree_insert+0x25/0x2c
> > Sep 19 18:59:20 www2 kernel: [vma_adjust+363/808] vma_adjust+0x16b/0x328
> > Sep 19 18:59:20 www2 kernel: [split_vma+252/264] split_vma+0xfc/0x108
> > Sep 19 18:59:20 www2 kernel: [do_munmap+126/280] do_munmap+0x7e/0x118
> > Sep 19 18:59:20 www2 kernel: [sys_munmap+56/88] sys_munmap+0x38/0x58
> > Sep 19 18:59:20 www2 kernel: [syscall_call+7/11] syscall_call+0x7/0xb
> > Sep 19 18:59:20 www2 kernel: Code: 0f 0b 1a 02 9f 51 35 c0 8b 43 04 8b 7b 08 29 c7 89 f8 c1 e8
> >
> > This happens with 2.6.8.1 as well as with earlier kernels.
> > This is a Quad-Xeon-Box from Dell (SMP kernel).
> > I've no idea what is causing these failures and didn't found anything useful
> > by searching. Therefore I would be grateful for any help.
>
>
On Mon, Sep 20, 2004 at 11:14:53AM -0400, Rajesh Venkatasubramanian wrote:
> The only chance I see for this to happen: we are changing vm_start,
> vm_end, vm_pgoff of a vma that is already in an i_mmap tree
> without holding the corresponding i_mmap_lock. The last time I
> did an audit, all such changes were inside the i_mmap_lock.
they should yes. Only the anon_vma_lock can be dropped before vm_end
modifications (but only for vm_end, vm_start and vm_pgoff updates still
needs the anon_vma_lock hold to be coherent with vm_pgoff reads).
operations on memory-seeking data structures like lists and trees tends
to be the most likely to show bad ram, we had it for some time in the
dcache layer for the dcache shrinking (still todate AFIK they were all
hardware issues), this could be a similar case.
though I agree the same BUG_ON triggering worth a second check.
Thanks a lot!
Currently we use prio_tree_root->index_bits to optimize the height
of both the initial heap-and-radix indexed levels of a prio_tree
as well as the heap-and-size indexed overflow-sub-trees. Please see
the accompanying prio_tree documentation patch for further details.
When index_bits is increased in prio_tree_expand we shuffle the
initial heap-and-radix indexed levels to make sure that vmas are
placed in the tree at appropriate places. Similarly, since the
overflow-sub-trees' heights also depend on prio_tree_root->index_bits
we should shuffle all the overflow-sub-trees when index_bits changes.
However, I missed to take care of this in my implementation.
Recently Stefan Hornburg (Racke) reported the problem and patiently
tested the trace patches. Hugh Dickins produced the trace patches
that helped to detect the bug. Moreover, Hugh reduced the crash
test case to few lines of code. Thanks to both of them.
The easy fix is to disable prio_tree_expand code completely. That
may lead to skewed trees in some common cases. Hence, this patch
takes a different approach.
This patch fixes the problem by not optimizing the height of the
overflow-sub-trees using prio_tree_root->index_bits. Now all
overflow-sub-trees use full BITS_PER_LONG bits of size_index to place
the vmas (that have the same start_vm_pgoff) in an overflow-sub-tree.
This may result in skewed overflow-sub-trees because all bits in
vm_pgoff above prio_tree_root->index_bits will be 0 (zero). However,
processes rarely map many vmas with the same start_vm_pgoff and different
end_vm_pgoff. Therefore, such skewed sub-trees should be very rare.
Signed-off-by: Rajesh Venkatasubramanian <[email protected]>
mm/prio_tree.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)
diff -puN mm/prio_tree.c~prio_expand_fix mm/prio_tree.c
--- linux-2.6.9/mm/prio_tree.c~prio_expand_fix 2004-10-31 18:27:09.000000000 -0500
+++ linux-2.6.9-jaya/mm/prio_tree.c 2004-10-31 18:27:09.000000000 -0500
@@ -245,7 +245,7 @@ static struct prio_tree_node *prio_tree_
mask >>= 1;
if (!mask) {
- mask = 1UL << (root->index_bits - 1);
+ mask = 1UL << (BITS_PER_LONG - 1);
size_flag = 1;
}
}
@@ -334,7 +334,7 @@ static struct prio_tree_node *prio_tree_
iter->mask = ULONG_MAX;
} else {
iter->size_level = 1;
- iter->mask = 1UL << (iter->root->index_bits - 1);
+ iter->mask = 1UL << (BITS_PER_LONG - 1);
}
}
return iter->cur;
@@ -376,7 +376,7 @@ static struct prio_tree_node *prio_tree_
iter->mask = ULONG_MAX;
} else {
iter->size_level = 1;
- iter->mask = 1UL << (iter->root->index_bits - 1);
+ iter->mask = 1UL << (BITS_PER_LONG - 1);
}
}
return iter->cur;
_
Add prio_tree.c documentation.
Signed-off-by: Rajesh Venkatasubramanian <[email protected]>
Documentation/prio_tree.txt | 107 ++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 107 insertions(+)
diff -puN /dev/null Documentation/prio_tree.txt
--- /dev/null 2003-01-30 05:24:37.000000000 -0500
+++ linux-2.6.9-jaya/Documentation/prio_tree.txt 2004-10-31 18:27:12.000000000 -0500
@@ -0,0 +1,107 @@
+The prio_tree.c code indexes vmas using 3 different indexes:
+ * heap_index = vm_pgoff + vm_size_in_pages : end_vm_pgoff
+ * radix_index = vm_pgoff : start_vm_pgoff
+ * size_index = vm_size_in_pages
+
+A regular radix-priority-search-tree indexes vmas using only heap_index and
+radix_index. The conditions for indexing are:
+ * ->heap_index >= ->left->heap_index &&
+ ->heap_index >= ->right->heap_index
+ * if (->heap_index == ->left->heap_index)
+ then ->radix_index < ->left->radix_index;
+ * if (->heap_index == ->right->heap_index)
+ then ->radix_index < ->right->radix_index;
+ * nodes are hashed to left or right subtree using radix_index
+ similar to a pure binary radix tree.
+
+A regular radix-priority-search-tree helps to store and query
+intervals (vmas). However, a regular radix-priority-search-tree is only
+suitable for storing vmas with different radix indices (vm_pgoff).
+
+Therefore, the prio_tree.c extends the regular radix-priority-search-tree
+to handle many vmas with the same vm_pgoff. Such vmas are handled in
+2 different ways: 1) All vmas with the same radix _and_ heap indices are
+linked using vm_set.list, 2) if there are many vmas with the same radix
+index, but different heap indices and if the regular radix-priority-search
+tree cannot index them all, we build an overflow-sub-tree that indexes such
+vmas using heap and size indices instead of heap and radix indices. For
+example, in the figure below some vmas with vm_pgoff = 0 (zero) are
+indexed by regular radix-priority-search-tree whereas others are pushed
+into an overflow-subtree. Note that all vmas in an overflow-sub-tree have
+the same vm_pgoff (radix_index) and if necessary we build different
+overflow-sub-trees to handle each possible radix_index. For example,
+in figure we have 3 overflow-sub-trees corresponding to radix indices
+0, 2, and 4.
+
+In the final tree the first few (prio_tree_root->index_bits) levels
+are indexed using heap and radix indices whereas the overflow-sub-trees below
+those levels (i.e. levels prio_tree_root->index_bits + 1 and higher) are
+indexed using heap and size indices. In overflow-sub-trees the size_index
+is used for hashing the nodes to appropriate places.
+
+Now, an example prio_tree:
+
+ vmas are represented [radix_index, size_index, heap_index]
+ i.e., [start_vm_pgoff, vm_size_in_pages, end_vm_pgoff]
+
+level prio_tree_root->index_bits = 3
+-----
+ _
+ 0 [0,7,7] |
+ / \ |
+ ------------------ ------------ | Regular
+ / \ | radix priority
+ 1 [1,6,7] [4,3,7] | search tree
+ / \ / \ |
+ ------- ----- ------ ----- | heap-and-radix
+ / \ / \ | indexed
+ 2 [0,6,6] [2,5,7] [5,2,7] [6,1,7] |
+ / \ / \ / \ / \ |
+ 3 [0,5,5] [1,5,6] [2,4,6] [3,4,7] [4,2,6] [5,1,6] [6,0,6] [7,0,7] |
+ / / / _
+ / / / _
+ 4 [0,4,4] [2,3,5] [4,1,5] |
+ / / / |
+ 5 [0,3,3] [2,2,4] [4,0,4] | Overflow-sub-trees
+ / / |
+ 6 [0,2,2] [2,1,3] | heap-and-size
+ / / | indexed
+ 7 [0,1,1] [2,0,2] |
+ / |
+ 8 [0,0,0] |
+ _
+
+Note that we use prio_tree_root->index_bits to optimize the height
+of the heap-and-radix indexed tree. Since prio_tree_root->index_bits is
+set according to the maximum end_vm_pgoff mapped, we are sure that all
+bits (in vm_pgoff) above prio_tree_root->index_bits are 0 (zero). Therefore,
+we only use the first prio_tree_root->index_bits as radix_index.
+Whenever index_bits is increased in prio_tree_expand, we shuffle the tree
+to make sure that the first prio_tree_root->index_bits levels of the tree
+is indexed properly using heap and radix indices.
+
+We do not optimize the height of overflow-sub-trees using index_bits.
+The reason is: there can be many such overflow-sub-trees and all of
+them have to be suffled whenever the index_bits increases. This may involve
+walking the whole prio_tree in prio_tree_insert->prio_tree_expand code
+path which is not desirable. Hence, we do not optimize the height of the
+heap-and-size indexed overflow-sub-trees using prio_tree->index_bits.
+Instead the overflow sub-trees are indexed using full BITS_PER_LONG bits
+of size_index. This may lead to skewed sub-trees because most of the
+higher significant bits of the size_index are likely to be be 0 (zero). In
+the example above, all 3 overflow-sub-trees are skewed. This may marginally
+affect the performance. However, processes rarely map many vmas with the
+same start_vm_pgoff but different end_vm_pgoffs. Therefore, we normally
+do not require overflow-sub-trees to index all vmas.
+
+From the above discussion it is clear that the maximum height of
+a prio_tree can be prio_tree_root->index_bits + BITS_PER_LONG.
+However, in most of the common cases we do not need overflow-sub-trees,
+so the tree height in the common cases will be prio_tree_root->index_bits.
+
+It is fair to mention here that the prio_tree_root->index_bits
+is increased on demand, however, the index_bits is not decreased when
+vmas are removed from the prio_tree. That's tricky to do. Hence, it's
+left as a home work problem.
+
+
_