2006-03-02 13:06:20

by Max Kellermann

[permalink] [raw]
Subject: 2.6.15.4: NFS-related BUG in radix_tree_tag_set()

Hi,

after a kernel upgrade from 2.6.11 to 2.6.15.4, we were experiencing
crashes on all four web servers. These web servers obtain their data
from NFSv3 from a NetApp server. The servers were under heavy load -
mostly reading, but also a lot of writing to NFS.

Hardware: Compaq ProLiant with two (physical) Xeon 2.4 CPUs, 4 GB
memory, Broadcom Tigon3 network interfaces. Kernel config is appended
to this mail.

After one of the crashes, an administrator made a screenshot
(http://www.duempel.org/~max/linux/nfs_radix_tree_crash.png) and
rebooted. Unfortunately, part of the stack trace is missing (25 lines
console only), and I had no access to the KDB console. I am currently
waiting for the next crash to happen so I can provide more
information.

The BUG_ON() failed in lib/radix-tree.c:372 :

slot = slot->slots[offset];
BUG_ON(slot == NULL);

I believe the missing stack trace calls are nfs_mark_request_dirty(),
nfs_flush_one(), nfs_flush_list(), nfs_flush_inode().

That would mean that req->wb_index was somehow removed from
nfsi->nfs_page_tree, maybe in another thread on another CPU? I see
the spinlock nfsi->req_lock is only held for very short timespans - is
it possible that another CPU tries to flush the same NFS write request
which is currently in the middle of being handled by the first CPU?
Any other explanation?

Max


Attachments:
(No filename) (1.32 kB)
.config.gz (6.84 kB)
Download all attachments

2006-03-02 13:33:09

by Max Kellermann

[permalink] [raw]
Subject: Re: 2.6.15.4: NFS-related BUG in radix_tree_tag_set()

On 2006/03/02 14:05, Max Kellermann <[email protected]> wrote:
> The BUG_ON() failed in lib/radix-tree.c:372 :

I just found the patches by Tony Griffiths and Neil Brown - I'm going
to try them now.

Max