2008-06-16 05:25:33

by howard chen

[permalink] [raw]
Subject: [NFS] Sudden high load average and abnormal behavior

Hi,

I have a dedicated NFS server running Raid5 disks and recently
observed a sudden increase in load average and some abnormal behavior
(e.g. command "df -h" halt without returning).

I have checked the Dell OpenManage and showing hardware is okay, the
load average used to be around 3 to 4 before.


Some info might be useful:


>> top

top - 13:17:53 up 382 days, 23:44, 6 users, load average: 20.53, 20.21, 18.93
Tasks: 286 total, 1 running, 285 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1% us, 1.1% sy, 0.0% ni, 68.4% id, 29.9% wa, 0.0% hi, 0.5% si
Mem: 4045256k total, 4028028k used, 17228k free, 437428k buffers
Swap: 9775512k total, 160k used, 9775352k free, 2814332k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2049 root 15 0 0 0 0 S 1 0.0 861:21.26 kjournald
26094 root 15 0 0 0 0 S 0 0.0 85:02.82 nfsd
26106 root 15 0 0 0 0 S 0 0.0 83:49.86 nfsd
26110 root 15 0 0 0 0 S 0 0.0 84:33.23 nfsd
26124 root 15 0 0 0 0 S 0 0.0 84:37.47 nfsd
2839 root 16 0 6280 1172 780 R 0 0.0 0:00.02 top
..

>> iostat

avg-cpu: %user %nice %sys %iowait %idle
0.06 0.00 1.34 21.60 77.00

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 114.89 4.33 18.05 143391021 597126208
sda1 1.07 0.69 8.26 22771290 273100496
sda2 0.00 0.00 0.00 2 0
sda5 0.00 0.00 0.00 1010 408
sda6 110.49 3.63 9.79 119979495 323992464
dm-0 0.58 2.91 3.22 96295602 106444120
dm-1 0.55 0.60 4.31 19996266 142435600
dm-2 0.02 0.08 0.18 2673626 5953184
dm-3 109.53 1.52 2.09 50389354 69192400


>> df -h

Filesystem Size Used Avail Use% Mounted on
/dev/sda1 9.7G 1.6G 7.6G 18% /
none 2.0G 0 2.0G 0% /dev/shm
/dev/mapper/lvm01-lvm01_usr
20G 1.5G 18G 8% /usr
/dev/mapper/lvm01-lvm01_var
9.9G 327M 9.1G 4% /var
/dev/mapper/lvm01-lvm01_home
9.9G 56M 9.3G 1% /home
/dev/mapper/lvm01-lvm01_data0
492G 285G 182G 62% /data0

# !!! == The command stopped at here without returning === !!!



Any idea?

Howard

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that [email protected] is being discontinued.
Please subscribe to [email protected] instead.
http://vger.kernel.org/vger-lists.html#linux-nfs



2008-06-16 16:08:09

by howard chen

[permalink] [raw]
Subject: Re: [NFS] Sudden high load average and abnormal behavior

Hi

On Mon, Jun 16, 2008 at 11:18 PM, Wendy Cheng <[email protected]> wrote:
> howard chen wrote:
> This will write all the thread backtraces into the system file
> /var/log/messages file so people can have a rough idea of what goes wrong.
> The *trick* here is to make sure the /var/log/messages file doesn't live on
> the particular filesystem that has the high load issue (otherwise the
> writing to the /var/log/messages will hang as well). So you may want to
> configure the /var on a separate filesystem. Remember each ext3 filesystem
> has its own kjournald (again, I have not touched ext3 for a while so this is
> from my old memory).
>
> Another option is to google to see whether other people on the same kernel
> level has the same issue as yours and pull their fix into your system -
> however, it is more of a long shot (since you're doing the guessing).
>
> -- Wendy

Thanks.

I will have a more detail tests

Howard

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that [email protected] is being discontinued.
Please subscribe to [email protected] instead.
http://vger.kernel.org/vger-lists.html#linux-nfs


2008-06-16 15:15:04

by Wendy Cheng

[permalink] [raw]
Subject: Re: [NFS] Sudden high load average and abnormal behavior

howard chen wrote:
>
>
> top - 13:17:53 up 382 days, 23:44, 6 users, load average: 20.53, 20.21, 18.93
> Tasks: 286 total, 1 running, 285 sleeping, 0 stopped, 0 zombie
> Cpu(s): 0.1% us, 1.1% sy, 0.0% ni, 68.4% id, 29.9% wa, 0.0% hi, 0.5% si
> Mem: 4045256k total, 4028028k used, 17228k free, 437428k buffers
> Swap: 9775512k total, 160k used, 9775352k free, 2814332k cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 2049 root 15 0 0 0 0 S 1 0.0 861:21.26 kjournald
> 26094 root 15 0 0 0 0 S 0 0.0 85:02.82 nfsd
> 26106 root 15 0 0 0 0 S 0 0.0 83:49.86 nfsd
> 26110 root 15 0 0 0 0 S 0 0.0 84:33.23 nfsd
> 26124 root 15 0 0 0 0 S 0 0.0 84:37.47 nfsd
> 2839 root 16 0 6280 1172 780 R 0 0.0 0:00.02 top
>

I haven't used ext3 for a very long time so not sure whether there are
changes. IIRC, if kjournald is up and runnning (implying ext3 is
flushing its data to the disk), it holds the journal lock so the access
to that particular filesystem is temporarily suspended. So the issue
here is to check why kjournald takes such a long time to do the flushing.

Normally we want to see the thread backtrace of "kjournald" by asking
for a "sysrq-t" output via:

shell> cd /proc
shell> echo t > sysrq-trigger

This will write all the thread backtraces into the system file
/var/log/messages file so people can have a rough idea of what goes
wrong. The *trick* here is to make sure the /var/log/messages file
doesn't live on the particular filesystem that has the high load issue
(otherwise the writing to the /var/log/messages will hang as well). So
you may want to configure the /var on a separate filesystem. Remember
each ext3 filesystem has its own kjournald (again, I have not touched
ext3 for a while so this is from my old memory).

Another option is to google to see whether other people on the same
kernel level has the same issue as yours and pull their fix into your
system - however, it is more of a long shot (since you're doing the
guessing).

-- Wendy



-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that [email protected] is being discontinued.
Please subscribe to [email protected] instead.
http://vger.kernel.org/vger-lists.html#linux-nfs