LinuxLists.cc - NFS oopses on smp servers

2002-12-15 14:58:15

Subject: NFS oopses on smp servers

Hello,

we keep getting nfs related oopses on our servers. we tried stock
2.4.19 and 2.4.20, ac kernels, aa kernels, with and without different
mix of trond and neilb patches.
the setup is:
- intel 4-way smp general-purpose servers with debian 3.0
- intel and sparc fileservers with solaris8
- intel workstations with solaris7/8, redhat 7.2/7.3 and debian 3.0

the workload is mostly software development. developers are running
simultaneous builds on our genereal-purpose servers, accessing a
multitude of files exported from fileservers and workstations in
parallel. there's no nfsd running on workservrs. we use autofs with
no special mount options, so we get
rw,nosuid,v3,rsize=8192,wsize=8192,hard,intr,udp,lock
for linux exports and
rw,nosuid,v3,rsize=32768,wsize=32768,hard,intr,udp,lock
for solaris exports.

oopses are hard to track, they happen once or twice a week on random
server with nothing unusual in workload or logs prior to it. none of
our tests (high load, network disconnects, lost packets, etc.)
triggered the problem, so we can't provide a test case.

when we had nfs compiled as module (autoloaded) we had this oopses
(ksymoops from kern.log):
==================================================================
Nov 14 11:44:43 server kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000000
Nov 14 11:44:43 server kernel: f8a4d441
Nov 14 11:44:43 server kernel: *pde = 00000000
Nov 14 11:44:43 server kernel: Oops: 0000
Nov 14 11:44:43 server kernel: CPU: 2
Nov 14 11:44:43 server kernel: EIP: 0010:[nfs:__insmod_nfs_S.text_L62016+21473/62016] Not tainted
Nov 14 11:44:43 server kernel: EFLAGS: 00010246
Nov 14 11:44:43 server kernel: eax: 00000000 ebx: e7c48f80 ecx: e7c48f88 edx: e7c48f88
Nov 14 11:44:43 server kernel: esi: f1ac9e00 edi: f28b30fc ebp: e7c48f80 esp: f1ac9de0
Nov 14 11:44:43 server kernel: ds: 0018 es: 0018 ss: 0018
Nov 14 11:44:43 server kernel: Process test1 (pid: 10621, stackpage=f1ac9000)
Nov 14 11:44:43 server kernel: Stack: 00000000 f8a4dabc e7c48f80 e7c48f80 ec8897c0 f28b30fc 00000000 f1ac8000
Nov 14 11:44:43 server kernel: f1ac9e00 f1ac9e00 f8a4d2e8 f28b30fc 00000000 ebb1a780 00000000 ebb1a938
Nov 14 11:44:43 server kernel: f8a5204e f8a507c8 00000000 ebb1a780 c418f168 00000000 00001000 c418f168
Nov 14 11:44:43 server kernel: Call Trace: [nfs:__insmod_nfs_S.text_L62016+23132/62016] [nfs:__insmod_nfs_S.text_L62016+21128/62016] [nfs:__insmod_nfs_S.text_L62016+40942/62016] [nfs:__insmod_nfs_S.text_L62016+34664/62016] [nfs:__insmod_nfs_S.text_L62016+32758/62016]
Nov 14 11:44:43 server kernel: Code: 8b 00 85 c0 7d 08 0f 0b a9 00 57 86 a5 f8 53 e8 0b ff ff ff
Using defaults from ksymoops -t elf32-i386 -a i386

>>ebx; e7c48f80 <_end+2790d20c/386b428c>
>>ecx; e7c48f88 <_end+2790d214/386b428c>
>>edx; e7c48f88 <_end+2790d214/386b428c>
>>esi; f1ac9e00 <_end+3178e08c/386b428c>
>>edi; f28b30fc <_end+32577388/386b428c>
>>ebp; e7c48f80 <_end+2790d20c/386b428c>
>>esp; f1ac9de0 <_end+3178e06c/386b428c>

Code; 00000000 Before first symbol
00000000 <_EIP>:
Code; 00000000 Before first symbol
0: 8b 00 mov (%eax),%eax
Code; 00000002 Before first symbol
2: 85 c0 test %eax,%eax
Code; 00000004 Before first symbol
4: 7d 08 jge e <_EIP+0xe> 0000000e Before first symbol
Code; 00000006 Before first symbol
6: 0f 0b ud2a
Code; 00000008 Before first symbol
8: a9 00 57 86 a5 test $0xa5865700,%eax
Code; 0000000d Before first symbol
d: f8 clc
Code; 0000000e Before first symbol
e: 53 push %ebx
Code; 0000000f Before first symbol
f: e8 0b ff ff ff call ffffff1f <_EIP+0xffffff1f> ffffff1f <END_OF_CODE+75a5ba4/????>

Nov 15 18:58:33 server kernel: 3136MB HIGHMEM available.
Nov 15 18:58:34 server kernel: cpu: 0, clocks: 1002260, slice: 200452
Nov 15 18:58:34 server kernel: cpu: 1, clocks: 1002260, slice: 200452
Nov 15 18:58:34 server kernel: cpu: 2, clocks: 1002260, slice: 200452
Nov 15 18:58:34 server kernel: cpu: 3, clocks: 1002260, slice: 200452
Nov 15 18:58:34 server kernel: Receiver lock-up bug exists -- enabling work-around.
Nov 15 18:58:34 server kernel: e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex
==================================================================

when we tried to compile nfs in kernel we start getting this oopes:
==================================================================
Dec 12 04:09:46 server kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000000
Dec 12 04:09:46 server kernel: c018d381
Dec 12 04:09:46 server kernel: *pde = 00000000
Dec 12 04:09:46 server kernel: Oops: 0000 2.4.20-aa1 #1 SMP Thu Dec 5 12:01:04 GMT 2002
Dec 12 04:09:46 server kernel: CPU: 0
Dec 12 04:09:46 server kernel: EIP: 0010:[nfs_release_request+137/180] Not tainted
Dec 12 04:09:46 server kernel: EFLAGS: 00010246
Dec 12 04:09:46 server kernel: eax: 00000000 ebx: dab0f240 ecx: dab0f248 edx: dab0f248
Dec 12 04:09:46 server kernel: esi: c68fb4fc edi: c68fb4fc ebp: dab0f240 esp: edfd9e4c
Dec 12 04:09:46 server kernel: ds: 0018 es: 0018 ss: 0018
Dec 12 04:09:46 server kernel: Process test2 (pid: 19656, stackpage=edfd9000)
Dec 12 04:09:46 server kernel: Stack: 00000000 c018da0c dab0f240 dab0f240 c82d3700 c68fb4fc 00000000 edfd8000
Dec 12 04:09:46 server kernel: edfd9e6c edfd9e6c c018d228 c68fb4fc 00000000 00000000 00000000 d83d4bb8
Dec 12 04:09:46 server kernel: c544f4d4 c01906c8 d3962c20 d83d4a00 c14ff2e0 00000000 00000200 d4cd3ce0
Dec 12 04:09:46 server kernel: Call Trace: [nfs_try_to_free_pages+268/288] [nfs_create_request+168/288] [nfs_update_request+544/828] [nfs_updatepage+165/516] [nfs_commit_write+63/108]
Dec 12 04:09:46 server kernel: Code: 8b 00 85 c0 7d 08 0f 0b a9 00 52 0c 2a c0 53 e8 0b ff ff ff

>>ebx; dab0f240 <END_OF_CODE+13fecfbd/????>
>>ecx; dab0f248 <END_OF_CODE+13fecfc5/????>
>>edx; dab0f248 <END_OF_CODE+13fecfc5/????>
>>esi; c68fb4fc <_end+651f120/6723c24>
>>edi; c68fb4fc <_end+651f120/6723c24>
>>ebp; dab0f240 <END_OF_CODE+13fecfbd/????>
>>esp; edfd9e4c <END_OF_CODE+274b7bc9/????>

Code; 00000000 Before first symbol
00000000 <_EIP>:
Code; 00000000 Before first symbol
0: 8b 00 mov (%eax),%eax
Code; 00000002 Before first symbol
2: 85 c0 test %eax,%eax
Code; 00000004 Before first symbol
4: 7d 08 jge e <_EIP+0xe> 0000000e Before first symbol
Code; 00000006 Before first symbol
6: 0f 0b ud2a
Code; 00000008 Before first symbol
8: a9 00 52 0c 2a test $0x2a0c5200,%eax
Code; 0000000d Before first symbol
d: c0 53 e8 0b rclb $0xb,0xffffffe8(%ebx)
Code; 00000011 Before first symbol
11: ff (bad)
Code; 00000012 Before first symbol
12: ff (bad)
Code; 00000013 Before first symbol
13: ff 00 incl (%eax)

Dec 15 11:06:01 server kernel: 3136MB HIGHMEM available.
Dec 15 11:06:01 server kernel: cpu: 0, clocks: 1002300, slice: 200460
Dec 15 11:06:01 server kernel: cpu: 1, clocks: 1002300, slice: 200460
Dec 15 11:06:01 server kernel: cpu: 3, clocks: 1002300, slice: 200460
Dec 15 11:06:01 server kernel: cpu: 2, clocks: 1002300, slice: 200460
Dec 15 11:06:01 server kernel: Receiver lock-up bug exists -- enabling work-around.
Dec 15 11:06:01 server kernel: e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex
==================================================================

-------------------------------------------------------
This sf.net email is sponsored by:
With Great Power, Comes Great Responsibility
Learn to use your power at OSDN's High Performance Computing Channel
http://hpc.devchannel.org/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-12-16 14:38:08

by Bernhard Kaindl

[permalink] [raw]

Subject: Re: NFS oopses on smp servers

On Sun, 15 Dec 2002, Peter Lojkin wrote:

> we keep getting nfs related oopses on our servers. we tried stock
[Trace reformatted for better reading:]
> EIP: 0010:[nfs_release_request+137/180]
> [nfs_try_to_free_pages+268/288]
> [nfs_create_request+168/288]
> [nfs_update_request+544/828]
> [nfs_updatepage+165/516]
> [nfs_commit_write+63/108]
> Code: 8b 00 85 c0 7d 08 0f 0b a9 00 52 0c 2a c0 53 e8 0b ff ff ff

Hi,
we've nailed an identical looking oops which we were able to reproduce
also with a 8-CPU Machine running the NFS-Client under extreme load, it
happened also with 2.4.20-NFS-ALL without any other patch in addition.

Chris Mason found the cause for this oops, I've attached his patch with
enhanced description from me. I've only tested the other, smaller diff
extensively(2nd, smaller attachment), but the patch from Chris should
be ok as well.

Configuration:

NFS client: Dell PowerEdge 6650, eight(8) 1.6GHz Xeon Processors, 4GB RAM
NFS server: Network Appliance NetApp F825(Storage Appliance, high-perf)
Network: Direct crossover 1000Mbit Ethernet without switch using Fiber.

With this configuration I was able to trigger within one hour after starting
an extreme load test reaching a load of >45 on the NFS-Client only doing
NFS Work(setting up ~1GB chroot environments and compiling in them, many
many in parallel)

Problem description:

The problem is that nfs_clear_request and nfs_release_request
contain accesses to the file's inode which are done after the
fput of the file which could have freed the inode also.

In extreme circumstances the inode pointer used for these accesses
could point to reused/cleared memory which would lead to an oops.

Fix discussion:

The idea of the fix is to not to fput the file until the request
is completely gone.

The fput needs to be done after the the last access to the inode
because the fput could free the inode also.

There are two different ways to do it, one would be to change
nfs_clear_request to not do an fput at all, and do the fput in
nfs_release_request.

But since nfs_release_request calls nfs_clear_request, it looks
like nfs_clear_request wasn't intended to be called separately
at a few places before nfs_release_request is called.

So the calls to nfs_clear_request before nfs_release_request
should be removed, nfs_release_request calls nfs_clear_request
itself at the right time after the last cleanup and the consisteny
checks.

To achieve this completely, nfs_clear_request must be also
reordered to do the fput last since the unpatched code accesses
the inode after the fput which can also lead to an oops on SMP
under extreme conditions.

Patch Description:
------------------

Remove the calls to nfs_clear_request before nfs_release_request,
to avoid freeing the file in nfs_clear_request because this could
also free the inode which is then accessed in nfs_release_request
directly afterwards.

Also reorder nfs_clear_request to do the fput last, since
the unpatched code accesses the inode after the fput.

This patch also changes nfs_clear_request to static and removes
the extern declaration, because it's not called from outside of
pagelist.o anymore after the diff applied.

Possible Improvement after applying the patch:
----------------------------------------------

In a second step, the code from nfs_clear_request could
be moved to nfs_release_request to reduce code size and
number of instructions since nfs_release_request should
be the only caller of nfs_clear_request then.

Best Regards,
Bernhard Kaindl
UnitedLinux Development
SuSE Linux - http://www.suse.com

PS: Other sample traces from this test/oops (different kernels)

>>EIP; c0197001 <nfs_release_request+101/130> <=====
Trace; c0197616 <nfs_try_to_free_pages+36/240>
Trace; c0196e21 <nfs_create_request+b1/120>
Trace; c019b006 <nfs_update_request+126/490>
Trace; c019b3cc <nfs_strategy+5c/70>
Trace; c019b578 <nfs_updatepage+c8/2c0>
Trace; c0192c72 <nfs_commit_write+72/d0>
Trace; c013cf15 <generic_file_write+495/800>
Trace; c0192dc8 <nfs_file_write+98/100>
Trace; c014da67 <sys_write+97/1d0>
Trace; c010984f <system_call+33/38>

>>EIP; c01a48ce <nfs_release_request+8e/c0> <=====
Trace; c01a4ef6 <nfs_try_to_free_pages+36/150>
Trace; c02613cc <skb_copy_datagram_iovec+4c/280>
Trace; c01a4768 <nfs_create_request+a8/110>
Trace; c025f0f6 <__kfree_skb+106/170>
Trace; c01a85d1 <nfs_update_request+c1/350>
Trace; c01a8a3e <nfs_updatepage+9e/270>
Trace; c0143067 <do_generic_file_write+447/7e0>
Trace; c025b098 <sock_recvmsg+58/f0>
Trace; c014349b <generic_file_write+9b/d0>
Trace; c01a0aab <nfs_file_write+bb/140>
Trace; c0154c47 <sys_write+97/140>
Trace; c01095ef <system_call+33/38>

Code; c01a48ce <nfs_release_request+8e/c0>
00000000 <_EIP>:
Code; c01a48ce <nfs_release_request+8e/c0> <=====
0: 8b 00 mov (%eax),%eax <=====
Code; c01a48d0 <nfs_release_request+90/c0>
2: 85 c0 test %eax,%eax
Code; c01a48d2 <nfs_release_request+92/c0>
4: 78 1e js 24 <_EIP+0x24> c01a48f2 <nfs_release_r
equest+b2/c0>
Code; c01a48d4 <nfs_release_request+94/c0>
6: 89 1c 24 mov %ebx,(%esp,1)
Code; c01a48d7 <nfs_release_request+97/c0>
9: e8 f4 fe ff ff call ffffff02 <_EIP+0xffffff02> c01a47d0 <n
fs_clear_request+0/70>
Code; c01a48dc <nfs_release_request+9c/c0>
e: a1 04 15 42 c0 mov 0xc0421504,%eax
Code; c01a48e1 <nfs_release_request+a1/c0>
13: 89 00 mov %eax,(%eax)

Attachments:

mason-nfs_clear_request-fput (4.31 kB)
nfs_clear_request-fput (2.34 kB)
Download all attachments

2002-12-16 15:34:49

by Trond Myklebust

[permalink] [raw]

Subject: Re: NFS oopses on smp servers

>>>>> " " == Bernhard Kaindl <[email protected]> writes:

> The idea of the fix is to not to fput the file until the
> request is completely gone.

No. There's a very good reason for fputting the file ASAP, and that in
case of an open() for write, the struct file grabs a lease. If you
defer fputting the file until the request is gone, then you'll see
ETXTBSY errors when trying to execute a file immediately after
linking.

Try the appended patch instead:

Cheers,
Trond

--- linux-2.4.20-smp/fs/nfs/pagelist.c.orig Fri Nov 29 00:53:15 2002
+++ linux-2.4.20-smp/fs/nfs/pagelist.c Mon Dec 16 16:32:05 2002
@@ -126,6 +126,7 @@
{
/* Release struct file or cached credential */
if (req->wb_file) {
+ atomic_dec(&NFS_REQUESTLIST(req->wb_inode)->nr_requests);
fput(req->wb_file);
req->wb_file = NULL;
}
@@ -136,7 +137,6 @@
if (req->wb_page) {
page_cache_release(req->wb_page);
req->wb_page = NULL;
- atomic_dec(&NFS_REQUESTLIST(req->wb_inode)->nr_requests);
}
}

@@ -165,8 +165,6 @@
BUG();
if (NFS_WBACK_BUSY(req))
BUG();
- if (atomic_read(&NFS_REQUESTLIST(req->wb_inode)->nr_requests) < 0)
- BUG();
#endif

/* Release struct file or cached credential */

-------------------------------------------------------
This sf.net email is sponsored by:
With Great Power, Comes Great Responsibility
Learn to use your power at OSDN's High Performance Computing Channel
http://hpc.devchannel.org/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-12-16 16:58:49

by Trond Myklebust

[permalink] [raw]

Subject: Re: NFS oopses on smp servers

>>>>> " " == Trond Myklebust <[email protected]> writes:

> Try the appended patch instead:

Duh: typo...

Cheers,
Trond

--- linux-2.4.20-smp/fs/nfs/pagelist.c.orig Fri Nov 29 00:53:15 2002
+++ linux-2.4.20-smp/fs/nfs/pagelist.c Mon Dec 16 17:36:56 2002
@@ -134,9 +134,9 @@
req->wb_cred = NULL;
}
if (req->wb_page) {
+ atomic_dec(&NFS_REQUESTLIST(req->wb_inode)->nr_requests);
page_cache_release(req->wb_page);
req->wb_page = NULL;
- atomic_dec(&NFS_REQUESTLIST(req->wb_inode)->nr_requests);
}
}

@@ -165,8 +165,6 @@
BUG();
if (NFS_WBACK_BUSY(req))
BUG();
- if (atomic_read(&NFS_REQUESTLIST(req->wb_inode)->nr_requests) < 0)
- BUG();
#endif

/* Release struct file or cached credential */

-------------------------------------------------------
This sf.net email is sponsored by:
With Great Power, Comes Great Responsibility
Learn to use your power at OSDN's High Performance Computing Channel
http://hpc.devchannel.org/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-12-26 14:31:47

by Peter Lojkin

[permalink] [raw]

Subject: Re: NFS oopses on smp servers

Hello,

we were unable to trigger the oops with this patch so we start installing it on failing servers. none of the updated servers
failed with such oops (first server updated 8 days ago), so i
guess this patch solves the problem. big thanks to all involved
for quick help!

now we have another nfs related oopses. these oopses happend on
both updated and not updated servers, so this patch wasn't the
cause. see my next message to the list for full info...

-----Original Message-----
From: Trond Myklebust <[email protected]>
Date: 16 Dec 2002 17:58:30 +0100
Subject: Re: [NFS] NFS oopses on smp servers

> >>>>> " " == Trond Myklebust <[email protected]> writes:
>
> > Try the appended patch instead:
>
> Duh: typo...
>
> Cheers,
> Trond
[patch skipped]

-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs