2005-03-08 04:54:01

by Bernardo Innocenti

[permalink] [raw]
Subject: NFS client bug in 2.6.8-2.6.11

Hello,

This problem was previously described by Neil Conway.
All relevant information here:

http://lkml.org/lkml/2005/2/10/97


I still see this very same problem on 2.6.11 vanilla and in
Fedora/RawHide hernels. It has haunted me for a couple of
months on several Fedora clients. Strangely, a Gentoo
client isn't affected, but I couldn't investigate further.

When the current directory becomes inaccessible, it remains
so until I cd somewhere else and then cd back to it.
Sometimes I must wait a few seconds before cd succeeds.

Here's a sample session:

[executing a find / in another shell to trigger the bug]
beetle:/pub/linux/distro/fedora-devel# ll
ls: .: No such file or directory
beetle:/pub/linux/distro/fedora-devel# cd -
/
beetle:/# cd -
bash: cd: /pub/linux/distro/fedora-devel: No such file or directory
beetle:/#
[...a few seconds later...]
beetle:/# cd -
/pub/linux/distro/fedora-devel


Appears to be a client bug. The problem only happens
when there's heavy filesystem activity on other
filesystems (local or NFS).

NFS mount options: rw,_netdev,rsize=32768,wsize=32768,hard,intr,proto=udp,addr=10.3.3.1

--
// Bernardo Innocenti - Develer S.r.l., R&D dept.
\X/ http://www.develer.com/


2005-03-08 05:30:52

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS client bug in 2.6.8-2.6.11

ty den 08.03.2005 Klokka 05:53 (+0100) skreiv Bernardo Innocenti:

> Appears to be a client bug.

Why?

--
Trond Myklebust <[email protected]>

2005-03-08 06:39:21

by Bernardo Innocenti

[permalink] [raw]
Subject: Re: NFS client bug in 2.6.8-2.6.11

Trond Myklebust wrote:
> ty den 08.03.2005 Klokka 05:53 (+0100) skreiv Bernardo Innocenti:
>
>>Appears to be a client bug.
>
> Why?

Two clients started showing the problem after
being upgraded from FC2 to FC3, while the server
remained unchanged.

I also can't reproduce the problem on an older
client running 2.4.21.

I'll test with 2.6.7 as soon as I can reboot the
client I'm using right now.

--
// Bernardo Innocenti - Develer S.r.l., R&D dept.
\X/ http://www.develer.com/

2005-03-08 06:52:53

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS client bug in 2.6.8-2.6.11

ty den 08.03.2005 Klokka 07:38 (+0100) skreiv Bernardo Innocenti:
> Trond Myklebust wrote:
> > ty den 08.03.2005 Klokka 05:53 (+0100) skreiv Bernardo Innocenti:
> >
> >>Appears to be a client bug.
> >
> > Why?
>
> Two clients started showing the problem after
> being upgraded from FC2 to FC3, while the server
> remained unchanged.

Can you produce tcpdumps to back that up?

Neil's problem appeared rather to be server-related. Neither of us could
reproduce his problem when the server was exporting an XFS partition.

The other thing to try is to turn off subtree checking on the server.

Cheers,
Trond

--
Trond Myklebust <[email protected]>

2005-03-08 07:07:09

by Bernardo Innocenti

[permalink] [raw]
Subject: Re: NFS client bug in 2.6.8-2.6.11

Bernardo Innocenti wrote:
> Trond Myklebust wrote:
>
> I also can't reproduce the problem on an older
> client running 2.4.21.

Well, actually I tried harder with the 2.4.21
client and I obtained a similar effect:

naraku:/pub/linux/distro/fedora-devel# ll
ls: .: Stale NFS file handle
naraku:/pub/linux/distro/fedora-devel# cd -
/arc/linux
naraku:/arc/linux# cd -
/pub/linux/distro/fedora-devel
naraku:/pub/linux/distro/fedora-devel# ll
... (lots of files)


So, instead of ENOENT I get ESTALE on 2.4.21.

May well be a server bug then. The server is running
2.6.10-1.766_FC3. Do you think I should try installing
a vanilla kernel on the server?

--
// Bernardo Innocenti - Develer S.r.l., R&D dept.
\X/ http://www.develer.com/

2005-03-08 08:55:23

by Anders Saaby

[permalink] [raw]
Subject: Re: NFS client bug in 2.6.8-2.6.11

On Tuesday 08 March 2005 08:03, Bernardo Innocenti wrote:
> Bernardo Innocenti wrote:
> > Trond Myklebust wrote:
> >
> > I also can't reproduce the problem on an older
> > client running 2.4.21.
>
> Well, actually I tried harder with the 2.4.21
> client and I obtained a similar effect:
>
> So, instead of ENOENT I get ESTALE on 2.4.21.
>
> May well be a server bug then. The server is running
> 2.6.10-1.766_FC3. Do you think I should try installing
> a vanilla kernel on the server?

We have seen lots of ESTALE's/ENOENT's when the server is running 2.6.10
(vanilla). Don't know if this was supposed to be fixed in the 2.6.10-FC
kernels, but vanilla 2.6.11 doesen't seem to have this bug at all.

You mention a lot of kernel versions including 2.6.11, and I can't really
figure out whether you are talking abount the clients or the server. -
Anyways if your server has only run with 2.6.10 - try 2.6.11.

- Apologies if I missed something obvious.

--
Med venlig hilsen - Best regards - Meilleures salutations

Anders Saaby
Systems Engineer
------------------------------------------------
Cohaesio A/S - Maglebjergvej 5D - DK-2800 Lyngby
Phone: +45 45 880 888 - Fax: +45 45 880 777
Mail: [email protected] - http://www.cohaesio.com
------------------------------------------------

2005-03-08 09:26:18

by Bernardo Innocenti

[permalink] [raw]
Subject: Re: NFS client bug in 2.6.8-2.6.11

Trond Myklebust wrote:
> ty den 08.03.2005 Klokka 07:38 (+0100) skreiv Bernardo Innocenti:
>>
>>Two clients started showing the problem after
>>being upgraded from FC2 to FC3, while the server
>>remained unchanged.
>
> Can you produce tcpdumps to back that up?
>
> Neil's problem appeared rather to be server-related. Neither of us could
> reproduce his problem when the server was exporting an XFS partition.

Actually, I was mistaken: running a background "find / >/dev/null"
triggers the problem even on the old RedHat (2.4.26) and
Gentoo (2.6.11) clients.


> The other thing to try is to turn off subtree checking on the server.

It's already turned off on all shares. For the record, this is the
contents of my /etc/exportfs:

/home gss/krb5(rw,no_root_squash,no_subtree_check,async) beetle(rw,no_root_squash,no_subtree_check,async) deimos(rw,async,no_subtree_check,anonuid=134,anongid=100) haring(rw,async,no_subtree_check,anonuid=127,anongid=100) murphy(rw,async,no_subtree_check,anonuid=158,anongid=100) daneel(rw,async,no_subtree_check,anonuid=100,anongid=100) 10.0.0.0/8(rw,no_subtree_check,async)
/arc 10.0.0.0/8(rw,no_root_squash,no_subtree_check,async,anonuid=14,anongid=113)

#
# NFSv4
#
/export beetle(rw,fsid=0,no_root_squash,insecure,no_subtree_check,async)
/export 10.0.0.0/8(rw,fsid=0,insecure,no_subtree_check,async)
/export gss/krb5(rw,fsid=0,insecure,no_subtree_check,async)
/export/home beetle(rw,nohide,no_root_squash,insecure,no_subtree_check,async)
/export/home 10.0.0.0/8(rw,nohide,insecure,no_subtree_check,async)
/export/home gss/krb5(rw,nohide,no_root_squash,insecure,no_subtree_check,async)
/export/arc 10.0.0.0/8(rw,nohide,no_root_squash,insecure,no_subtree_check,async,anonuid=14,anongid=113)

--
// Bernardo Innocenti - Develer S.r.l., R&D dept.
\X/ http://www.develer.com/

2005-03-08 22:25:30

by Bernardo Innocenti

[permalink] [raw]
Subject: Re: NFS client bug in 2.6.8-2.6.11

Anders Saaby wrote:
> On Tuesday 08 March 2005 08:03, Bernardo Innocenti wrote:
>
>>Bernardo Innocenti wrote:
>>
>>>Trond Myklebust wrote:
>>>
>>>I also can't reproduce the problem on an older
>>>client running 2.4.21.
>>
>>Well, actually I tried harder with the 2.4.21
>>client and I obtained a similar effect:
>>
>>So, instead of ENOENT I get ESTALE on 2.4.21.
>>
>>May well be a server bug then. The server is running
>>2.6.10-1.766_FC3. Do you think I should try installing
>>a vanilla kernel on the server?
>
>
> We have seen lots of ESTALE's/ENOENT's when the server is running 2.6.10
> (vanilla). Don't know if this was supposed to be fixed in the 2.6.10-FC
> kernels, but vanilla 2.6.11 doesen't seem to have this bug at all.
>
> You mention a lot of kernel versions including 2.6.11, and I can't really
> figure out whether you are talking abount the clients or the server. -
> Anyways if your server has only run with 2.6.10 - try 2.6.11.

Thank you, I've finally nailed it down by upgrading the
*server* kernel from 2.6.10-1.770_FC3 to 2.6.10-1.770_FC3.

The latter is basically 2.6.10-ac12 plus a bunch of vendor
specific patches.


> - Apologies if I missed something obvious.

No, *I* did. All the clues I had leaded me to the client
side, while the problem was in the server instead.

--
// Bernardo Innocenti - Develer S.r.l., R&D dept.
\X/ http://www.develer.com/

2005-03-15 23:45:36

by Neil Conway

[permalink] [raw]
Subject: Re: NFS client bug in 2.6.8-2.6.11

Hi Bernardo (et al). Apologies - I've not been reading my account for
a wee while. Then again, I probably don't have much useful to add to
the debate right now ;-)

--- Bernardo Innocenti <[email protected]> wrote:
> Anders Saaby wrote:
> > Anyways if your server has only run with 2.6.10 - try 2.6.11.
>
> Thank you, I've finally nailed it down by upgrading the
> *server* kernel from 2.6.10-1.770_FC3 to 2.6.10-1.770_FC3.

Hmm, I will infer from a previous email you sent that you mean 766_FC3
for the "from" kernel.

> The latter is basically 2.6.10-ac12 plus a bunch of vendor
> specific patches.

766 -> 770 sounds like a "small" (ish) number of patches to check, if
we're lucky. Did you wade through 'em all yet? Any smoking guns?

Regards,
Neil
PS: oh bugger, just remembered that I also reproduced my bug with a
2.6.8 kernel on the server; admittedly though it was an FC2 kernel so
who knows what extra patches it had.




__________________________________
Do you Yahoo!?
Make Yahoo! your home page
http://www.yahoo.com/r/hs

2005-03-16 02:50:49

by Bernardo Innocenti

[permalink] [raw]
Subject: Re: NFS client bug in 2.6.8-2.6.11

Neil Conway wrote:

> 766 -> 770 sounds like a "small" (ish) number of patches to check, if
> we're lucky. Did you wade through 'em all yet? Any smoking guns?

The RPM changelog doesn't contain anything relevant
between 766 and 770:

---CUT---
* Thu Feb 24 2005 Dave Jones <[email protected]>

- Use old scheme first when probing USB. (#145273)

* Wed Feb 23 2005 Dave Jones <[email protected]>

- Try as you may, there's no escape from crap SCSI hardware. (#149402)

* Mon Feb 21 2005 Dave Jones <[email protected]>

- Disable some experimental USB EHCI features.

* Tue Feb 15 2005 Dave Jones <[email protected]>

- Fix bio leak in md layer.
---CUT---

Perhaps the changelog is incomplete. I don't have the
two SRPMs at hand to make a comparison.

By the way, it seems upgrading to 2.6.10-1.770_FC3 just made
the bug much harder to trigger: I've definitely seen it once
again when I had left a shell sitting in an NFS directory
overnight. I couldn't reproduce it a second time.


> PS: oh bugger, just remembered that I also reproduced my bug with a
> 2.6.8 kernel on the server; admittedly though it was an FC2 kernel so
> who knows what extra patches it had.

You can easily find out by downloading the SRPM. Now that
Fedora provides a public CVS, perhaps it could be used to
make such investigations directly with the cvsweb interface
without downloading and unpacking a 40MB file.

--
// Bernardo Innocenti - Develer S.r.l., R&D dept.
\X/ http://www.develer.com/