2011-08-27 19:22:55

by Rüdiger Meier

[permalink] [raw]
Subject: processes hanging in state D when reading from nfs

Hi,


I've got an annoying problem with my nfs4 clients.
Lately I see many processes hanging in state D when reading from nfs
mount. Sometimes they can be killed sometimes not.

This occurs mostly whith shell scripts started by cron.

For example on one machine there is a file where suddenly all reads on
it are hanging, ls -ls still works:

rwxr-xr-x 1 tk users 128 2010-09-08 15:54 /home/tk/usr/local/scripts/plain_ALLMAJOR.sh

As you see it's an old script, not modified since long time. It was
running a few times per day since months.

Now this is the processlist:

tk 8829 0.0 0.0 11372 800 ? Ds Aug25 0:00 /bin/sh -c ~/usr/local/scripts/plain_ALLMAJOR.sh
tk 8830 0.0 0.0 11372 824 ? Ds Aug25 0:00 /bin/sh -c ~/usr/local/scripts/plain_ALLMAJOR.sh
tk 18864 0.0 0.0 11372 844 ? Ds Aug26 0:00 /bin/sh -c ~/usr/local/scripts/plain_ALLMAJOR.sh
tk 18865 0.0 0.0 11372 860 ? Ds Aug26 0:00 /bin/sh -c ~/usr/local/scripts/plain_ALLMAJOR.sh
rudi 23745 0.0 0.0 10300 748 pts/20 D 20:39 0:00 file /home/tk/usr/local/scripts/plain_ALLMAJOR.sh
rudi 24361 0.0 0.0 10300 748 pts/20 D 20:40 0:00 file /home/tk/usr/local/scripts/plain_ALLMAJOR.sh
root 30417 0.0 0.0 10056 472 ? D Aug24 0:00 less /home/tk/usr/local/scripts/plain_ALLMAJOR.sh
rudi 30569 0.0 0.0 10064 1128 pts/1 D+ 20:41 0:00 less /home/tk/usr/local/scripts/plain_ALLMAJOR.sh

The /bin/sh processes are hanging forever in state "Ds" but can be killed.
The less and file commands can't be killed.
On other clients I can read that file without probs.

The logs on server and clients don't tell me anything.
What can I do to find out what's the problem?


BTW each hanging process increases the load by 1 but the affected machines
are still quite usable even with a load of 800 on a single core CPU!


here my specs:
2.6.37.6-0.7-desktop
openSUSE 11.4 (x86_64)


cu,
Rudi


2011-09-21 23:40:04

by Rüdiger Meier

[permalink] [raw]
Subject: Re: processes hanging in state D when reading from nfs

On Tuesday 20 September 2011, J. Bruce Fields wrote:
> On Sat, Aug 27, 2011 at 09:22:53PM +0200, Rüdiger Meier wrote:
> > I've got an annoying problem with my nfs4 clients.
> > Lately I see many processes hanging in state D when reading from
> > nfs mount. Sometimes they can be killed sometimes not.
>
> Is this still happening?

Yes, allthough we've managed to avoid the "dangerous" things.
Sometimes we have also probs like the other current thread
"Writing / Locking problem with NFSv4".

Moreover I've got issues with wrong file permissions of newly created
files. Mostly when using make: The first make failes and you see some
objects files with strange permissions. If you watch these files on
other clients then they are ok. Then the second make finishes
successful.
I'm still suspecting the damn readdir cache changes in 2.6.37.

> Running wireshark and watching the network traffic may sometimes give
> an idea whether the client or server is to blame.

I should do that but somehow I'm a bit tyred of debugging my NFS issues
after doing it the whole last year just to get one thing fixed and run
into another issue. Maybe I first try a current kernel, hoping that
things are getting better without doing anything.

cu,
Rudi

2011-09-20 13:52:52

by J. Bruce Fields

[permalink] [raw]
Subject: Re: processes hanging in state D when reading from nfs

On Sat, Aug 27, 2011 at 09:22:53PM +0200, Rüdiger Meier wrote:
> I've got an annoying problem with my nfs4 clients.
> Lately I see many processes hanging in state D when reading from nfs
> mount. Sometimes they can be killed sometimes not.

Is this still happening?

> This occurs mostly whith shell scripts started by cron.
>
> For example on one machine there is a file where suddenly all reads on
> it are hanging, ls -ls still works:
>
> rwxr-xr-x 1 tk users 128 2010-09-08 15:54 /home/tk/usr/local/scripts/plain_ALLMAJOR.sh
>
> As you see it's an old script, not modified since long time. It was
> running a few times per day since months.
>
> Now this is the processlist:
>
> tk 8829 0.0 0.0 11372 800 ? Ds Aug25 0:00 /bin/sh -c ~/usr/local/scripts/plain_ALLMAJOR.sh
> tk 8830 0.0 0.0 11372 824 ? Ds Aug25 0:00 /bin/sh -c ~/usr/local/scripts/plain_ALLMAJOR.sh
> tk 18864 0.0 0.0 11372 844 ? Ds Aug26 0:00 /bin/sh -c ~/usr/local/scripts/plain_ALLMAJOR.sh
> tk 18865 0.0 0.0 11372 860 ? Ds Aug26 0:00 /bin/sh -c ~/usr/local/scripts/plain_ALLMAJOR.sh
> rudi 23745 0.0 0.0 10300 748 pts/20 D 20:39 0:00 file /home/tk/usr/local/scripts/plain_ALLMAJOR.sh
> rudi 24361 0.0 0.0 10300 748 pts/20 D 20:40 0:00 file /home/tk/usr/local/scripts/plain_ALLMAJOR.sh
> root 30417 0.0 0.0 10056 472 ? D Aug24 0:00 less /home/tk/usr/local/scripts/plain_ALLMAJOR.sh
> rudi 30569 0.0 0.0 10064 1128 pts/1 D+ 20:41 0:00 less /home/tk/usr/local/scripts/plain_ALLMAJOR.sh
>
> The /bin/sh processes are hanging forever in state "Ds" but can be killed.
> The less and file commands can't be killed.
> On other clients I can read that file without probs.
>
> The logs on server and clients don't tell me anything.
> What can I do to find out what's the problem?

Running wireshark and watching the network traffic may sometimes give an
idea whether the client or server is to blame.

> BTW each hanging process increases the load by 1 but the affected machines
> are still quite usable even with a load of 800 on a single core CPU!
>
>
> here my specs:
> 2.6.37.6-0.7-desktop
> openSUSE 11.4 (x86_64)

On both client and server?

--b.

2011-09-22 15:45:10

by Michael Gutteridge

[permalink] [raw]
Subject: Re: processes hanging in state D when reading from nfs

Rüdiger Meier <sweet_f_a@...> writes:

>
> On Tuesday 20 September 2011, J. Bruce Fields wrote:
> > On Sat, Aug 27, 2011 at 09:22:53PM +0200, Rüdiger Meier wrote:
> > > I've got an annoying problem with my nfs4 clients.
> > > Lately I see many processes hanging in state D when reading from
> > > nfs mount. Sometimes they can be killed sometimes not.
> >
> > Is this still happening?
>
> Yes, allthough we've managed to avoid the "dangerous" things.
> Sometimes we have also probs like the other current thread
> "Writing / Locking problem with NFSv4".
>

For what it's worth: we have been seeing very similar behavior on our OpenSuSE
11.3 (x86_64, 2.6.34.10-0.2) systems, though one other difference is that we are
using NFSv3 for these mounts.

I was able to get some traces via sysrq, though no ethernet dumps (these
problems would happen occasionally, impossible to determine when/where). These
are heavily loaded systems, doing lots of compute and IO.

1 [3754730.533669] R D ffffffff810dc3e0 0 22621 1
0x00000004
2 [3754730.533671] ffff88165f993cb8 0000000000000086 ffff881037174600
ffffffffa0332bbd
3 [3754730.533673] 0000000000013e80 0000000000013e80 ffff88165f993fd8
0000000000013e80
4 [3754730.533675] ffff88165f993fd8 ffff881e5cd521c0 0000000000013e80
0000000000013e80
5 [3754730.533676] Call Trace:
6 [3754730.533678] [<ffffffff8145004e>] io_schedule+0x6e/0xb0
7 [3754730.533681] [<ffffffff810dc418>] sync_page+0x38/0x50
8 [3754730.533683] [<ffffffff814505da>] __wait_on_bit_lock+0x4a/0xb0
9 [3754730.533685] [<ffffffff810dc3be>] __lock_page+0x5e/0x70
10 [3754730.533687] [<ffffffff810dd2f8>] filemap_fault+0x2f8/0x410
11 [3754730.533690] [<ffffffff810f7c12>] __do_fault+0x52/0x4f0
12 [3754730.533692] [<ffffffff810fbf82>] handle_mm_fault+0x1b2/0xbd0
13 [3754730.533694] [<ffffffff81455799>] do_page_fault+0x169/0x3a0
14 [3754730.533697] [<ffffffff8145271f>] page_fault+0x1f/0x30
15 [3754730.533699] [<00007f79e2486ce0>] 0x7f79e2486ce0

This is pretty representative of the processses in D. Does this help, or are
there too many differences from the original?

Thanks

Michael