2004-09-08 05:41:07

by Yaroslav Halchenko

[permalink] [raw]
Subject: proc stalls

Dear All,

Please give me some hints on where to look and what possibly to do to
bring system back to usable without rebooting, or at least to catch what
can be a problem.

I've discroverd that mozilla-firefox didn't want to start - just hanged
during start... well - I've found that 'fuser -m /dev/dsp' causes it to
stall... well - I've fount that any fuser process stalls... well... I've
found using strace that they stall around next point:
getdents64(4, /* 0 entries */, 1024) = 0
close(4) = 0
chdir("/proc/26336") = 0
stat64("root", {st_mode=S_IFDIR|0755, st_size=720, ...}) = 0
lstat64("root", {st_mode=S_IFLNK|0777, st_size=0, ...}) = 0
stat64("cwd",

well.. I've tried to cd to /proc/26336 and my bash got frozen as well...


now my sytem load is around 20 probably due to all the unkillable fuser
processes I've ran so far... They look like:
yoh 29083 0.0 0.0 1532 560 ? D 00:48 0:00 fuser -m
/dev/dsp

no abnormal logs are reported in syslog...


What can I do to find the cause or to resolve the situation somehow
without reboot? Which else hints can I provide? I'm reporting main
system params and linux kernel config on

http://www.onerussian.com/Linux/bugs/bug.proc/

(Many other tools like df stall as well)

Thank you in advance

--
Yaroslav Halchenko
Research Assistant, Psychology Department, Rutgers
Office (973) 353-5440 x263
Ph.D. Student CS Dept. NJIT
Key http://www.onerussian.com/gpg-yoh.asc
GPG fingerprint 3BB6 E124 0643 A615 6F00 6854 8D11 4563 75C0 24C8


2004-09-08 14:26:39

by Yaroslav Halchenko

[permalink] [raw]
Subject: Re: proc stalls

As I suspected yesterday but didn't report because it seemed like not very
reasonable idea:

that problem was linked to the fact that nfs-mounted directory became
unavailable... I have a host (belka) which exports a couple of
directories and it is nfs-mounted using am-utils:

ravana[0] ~>mount | grep belka
belka:/home/yoh.m on /amd/belka/root/home/yoh.m type nfs (rw,proto=tcp)

last evening belka halted for some reason and that directory became
unavailable:
multiple error messages:
nfs_stat_to_errno: bad nfs status return value: 11
nfs_stat_to_errno: bad nfs status return value: 11
nfs_statfs: statfs error = 5
RPC: error 5 connecting to server belka

I've tried to unmount it but it didn't work...

This morning I saw that hanged fuser processes which I've tried to kill,
finally died, but new ones still didn't work correctly.
When I arrived at work and rebooted belka, ravana became fine again -
fuser/df/etc works fine.

Under Debian I'm running vanilla 2.6.8-rc3-bk3 kernel and
ii nfs-common 1.0.6-3 NFS support files common to client and server
ii nfs-kernel-ser 1.0.6-3 Kernel NFS server support

I think that problems with NFS have to not influence the work of the
system that much... May be I should switch to user space nfs server...
Any ideas on how to further debug this situation to avoid future
problems?

--
Yarik


On Wed, Sep 08, 2004 at 01:41:01AM -0400, Yaroslav Halchenko wrote:
> Dear All,

> Please give me some hints on where to look and what possibly to do to
> bring system back to usable without rebooting, or at least to catch what
> can be a problem.

> I've discroverd that mozilla-firefox didn't want to start - just hanged
> during start... well - I've found that 'fuser -m /dev/dsp' causes it to
> stall... well - I've fount that any fuser process stalls... well... I've
> found using strace that they stall around next point:
> getdents64(4, /* 0 entries */, 1024) = 0
> close(4) = 0
> chdir("/proc/26336") = 0
> stat64("root", {st_mode=S_IFDIR|0755, st_size=720, ...}) = 0
> lstat64("root", {st_mode=S_IFLNK|0777, st_size=0, ...}) = 0
> stat64("cwd",

> well.. I've tried to cd to /proc/26336 and my bash got frozen as well...


> now my sytem load is around 20 probably due to all the unkillable fuser
> processes I've ran so far... They look like:
> yoh 29083 0.0 0.0 1532 560 ? D 00:48 0:00 fuser -m
> /dev/dsp

> no abnormal logs are reported in syslog...


> What can I do to find the cause or to resolve the situation somehow
> without reboot? Which else hints can I provide? I'm reporting main
> system params and linux kernel config on

> http://www.onerussian.com/Linux/bugs/bug.proc/

> (Many other tools like df stall as well)

> Thank you in advance
--
Yaroslav Halchenko
Research Assistant, Psychology Department, Rutgers
Office (973) 353-5440 x263
Ph.D. Student CS Dept. NJIT
Key http://www.onerussian.com/gpg-yoh.asc
GPG fingerprint 3BB6 E124 0643 A615 6F00 6854 8D11 4563 75C0 24C8

2004-09-08 15:58:37

by Tommy Reynolds

[permalink] [raw]
Subject: Re: proc stalls

Uttered Yaroslav Halchenko <[email protected]>, spake thus:

> that problem was linked to the fact that nfs-mounted directory became
> unavailable...
> Any ideas on how to further debug this situation to avoid future
> problems?

This is the required behavior for "hard" NFS mounts. NFS doesn't
deal with servers that drop off-line very well.

Perhaps you should use the "soft" and/or the "timeo=N" value. A
"soft" mount will not cause your client to hang if the server goes
away. Unfortunately, this also has implications for application
program's ideas about file integrity, but there you go.

HTH.

2004-09-08 18:52:45

by Denis Vlasenko

[permalink] [raw]
Subject: Re: proc stalls

On Wednesday 08 September 2004 18:56, Tommy Reynolds wrote:
> Uttered Yaroslav Halchenko <[email protected]>, spake thus:
> > that problem was linked to the fact that nfs-mounted directory became
> > unavailable...
> > Any ideas on how to further debug this situation to avoid future
> > problems?
>
> This is the required behavior for "hard" NFS mounts. NFS doesn't
> deal with servers that drop off-line very well.
>
> Perhaps you should use the "soft" and/or the "timeo=N" value. A
> "soft" mount will not cause your client to hang if the server goes
> away. Unfortunately, this also has implications for application
> program's ideas about file integrity, but there you go.

Think about some very important work being run on NFS-mounted file,
and server is brought down while you're at lunch. I much prefer
client to hang forever (i.e., no 'soft' option!), waiting for admin
to take action.

I use 'hard,intr' so admin can kill 'hung' processed if (s)he wants to.
--
vda