Hi,
after a kernel upgrade from 2.6.11 to 2.6.15.4, we were experiencing
crashes on all four web servers. These web servers obtain their data
from NFSv3 from a NetApp server. The servers were under heavy load -
mostly reading, but also a lot of writing to NFS.
Hardware: Compaq ProLiant with two (physical) Xeon 2.4 CPUs, 4 GB
memory, Broadcom Tigon3 network interfaces. Kernel config is appended
to this mail.
After one of the crashes, an administrator made a screenshot
(http://www.duempel.org/~max/linux/nfs_radix_tree_crash.png) and
rebooted. Unfortunately, part of the stack trace is missing (25 lines
console only), and I had no access to the KDB console. I am currently
waiting for the next crash to happen so I can provide more
information.
The BUG_ON() failed in lib/radix-tree.c:372 :
slot = slot->slots[offset];
BUG_ON(slot == NULL);
I believe the missing stack trace calls are nfs_mark_request_dirty(),
nfs_flush_one(), nfs_flush_list(), nfs_flush_inode().
That would mean that req->wb_index was somehow removed from
nfsi->nfs_page_tree, maybe in another thread on another CPU? I see
the spinlock nfsi->req_lock is only held for very short timespans - is
it possible that another CPU tries to flush the same NFS write request
which is currently in the middle of being handled by the first CPU?
Any other explanation?
Max
On 2006/03/02 14:05, Max Kellermann <[email protected]> wrote:
> The BUG_ON() failed in lib/radix-tree.c:372 :
I just found the patches by Tony Griffiths and Neil Brown - I'm going
to try them now.
Max