LinuxLists.cc - htree+NFS (NFS client bug?)

[permalink] [raw]

Subject: Re: [NFS] htree+NFS (NFS client bug?)

>>>>> " " == Jeremy Fitzhardinge <[email protected]> writes:

> It looks to me like some sort of problem managing the NFS
> readdir cookies, but it isn't clear to me whether this is the
> NFS server/ext3 generating bad cookies, or the NFS client
> handling them wrongly.

In order to determine which of the two needs to be fixed, it would
help if you could print out the cookies from that listing or better
still: if you could provide us with the raw tcpdump output. Please
remember to use an 8k snaplen for the tcpdump...

Cheers,
Trond

2002-11-27 03:47:58

by Christopher Li

[permalink] [raw]

Subject: Re: [Ext2-devel] Re: [NFS] htree+NFS (NFS client bug?)

I am not sure that is going to help or not:

Htree readdir will return the file position bigger than the
real size of directory, it stores the hash information in lower
bits.

I just go though the source code, haven't find out how NFS
handle that yet.

Yes. Print out cookies will help.

Chris

On Wed, Nov 27, 2002 at 04:26:46AM +0100, Trond Myklebust wrote:
> >>>>> " " == Jeremy Fitzhardinge <[email protected]> writes:
>
> > It looks to me like some sort of problem managing the NFS
> > readdir cookies, but it isn't clear to me whether this is the
> > NFS server/ext3 generating bad cookies, or the NFS client
> > handling them wrongly.
>
> In order to determine which of the two needs to be fixed, it would
> help if you could print out the cookies from that listing or better
> still: if you could provide us with the raw tcpdump output. Please
> remember to use an 8k snaplen for the tcpdump...
>
> Cheers,
> Trond
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: Get the new Palm Tungsten T
> handheld. Power & Color in a compact size!
> http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en
> _______________________________________________
> Ext2-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/ext2-devel

2002-11-27 08:51:26

[permalink] [raw]

Subject: Re: [NFS] htree+NFS (NFS client bug?)

On Tue, 2002-11-26 at 19:26, Trond Myklebust wrote:
> >>>>> " " == Jeremy Fitzhardinge <[email protected]> writes:
>
> > It looks to me like some sort of problem managing the NFS
> > readdir cookies, but it isn't clear to me whether this is the
> > NFS server/ext3 generating bad cookies, or the NFS client
> > handling them wrongly.
>
> In order to determine which of the two needs to be fixed, it would
> help if you could print out the cookies from that listing or better
> still: if you could provide us with the raw tcpdump output. Please
> remember to use an 8k snaplen for the tcpdump...

Here's a pcap dump file of the transaction. I changed [rw]size to 1024
because I couldn't work out how to get ethereal to reassemble the
fragments. It doesn't seem to have affected the behaviour at all.

This was with a 2.5.47 client (same server as before).

J

Attachments:

nfs-htree.pcap.gz (2.26 kB)

2002-11-27 13:26:42

by Theodore Ts'o

[permalink] [raw]

Subject: Re: htree+NFS (NFS client bug?)

On Tue, Nov 26, 2002 at 03:44:46PM -0800, Jeremy Fitzhardinge wrote:
> I'm having problems with exporting htree ext3 filesystems over NFS.
> Quite often, if a program on the client side is reading a directory it
> will simply stop and consume a vast amount of memory. It seems that the
> readdir never terminates, and endlessly returns the same dir entries
> again and again.
>
> This seems to be happening entirely on the NFS client side; after the
> initial readdir and a couple of getattrs, there's no further network
> traffic.
>
> It looks to me like some sort of problem managing the NFS readdir
> cookies, but it isn't clear to me whether this is the NFS server/ext3
> generating bad cookies, or the NFS client handling them wrongly.

Well, even if the NFS server is generating bad cookies (and that may
be possible), the NFS client should be more robust and not spin in a
loop forever.

Can you send me a directory list (from the server) of the directory in
question? Also, can you send me the output of dumpe2fs -h on the
filesystem. I'll need the later to get the seed for the htree hash,
so I can try replicating this on my end.

Also, have you tried running e2fsck on filesystem on the server? It
would be very interesting to confirm whether or not the filesystem is
consistent.

- Ted

2002-11-27 14:54:01

[permalink] [raw]

Subject: Re: [Ext2-devel] Re: [NFS] htree+NFS (NFS client bug?)

Hi,

On Wed, Nov 27, 2002 at 12:58:42AM -0800, Jeremy Fitzhardinge wrote:
> On Tue, 2002-11-26 at 19:26, Trond Myklebust wrote:
> > >>>>> " " == Jeremy Fitzhardinge <[email protected]> writes:
> >
> > > It looks to me like some sort of problem managing the NFS
> > > readdir cookies, but it isn't clear to me whether this is the
> > > NFS server/ext3 generating bad cookies, or the NFS client
> > > handling them wrongly.
> >
> > In order to determine which of the two needs to be fixed, it would
> > help if you could print out the cookies from that listing or better
> > still: if you could provide us with the raw tcpdump output. Please
> > remember to use an 8k snaplen for the tcpdump...
>
> Here's a pcap dump file of the transaction. I changed [rw]size to 1024
> because I couldn't work out how to get ethereal to reassemble the
> fragments. It doesn't seem to have affected the behaviour at all.

The dump looks problematic, but I'm not sure whether it's a client or
a server problem.

Frame 6 is a READDIR reply, ending with

Entry: file ID 294052, name vg_include.h~51-kill-inceip
File ID: 294052
Name: vg_include.h~51-kill-inceip
length: 27
contents: vg_include.h~51-kill-inceip
fill bytes: opaque data
Cookie: 1611747420
Value Follows: No
EOF: 0

There follows in frame 7 a request for the next chunk, starting with
the same cookie, and the reply in frame 8 ends with:

Entry: file ID 298407, name vg_libpthread.vs
File ID: 298407
Name: vg_libpthread.vs
length: 16
contents: vg_libpthread.vs
Cookie: 1611747420
Value Follows: No
EOF: 1

Now, this final frame has EOF = 1, so the client should not be looking
for any more data after it. If the client _does_ try to progress,
though, it will use that final cookie for the next request, and will
naturally find the results already in cache so will repeat the final
chunk of directory entries.

So I suspect that this is a root a client problem --- the client has
repeated a READDIR despite being told that the previous reply was EOF
--- but that the best solution is actually to change the server to
poke a dedicated EOF cookie into the final dirent of the stream. One
of the reasons we need to do that is to cope with the unclear
situation when an application is using telldir/seekdir to reposition
the directory stream (tcsh is really bad for trying to do that with
32-bit offsets when globbing, for example, although the right answer
there is "use readdir64".)

Cheers,
Stephen

2002-11-27 17:19:24

[permalink] [raw]

Subject: Re: htree+NFS (NFS client bug?)

.
..
vg_symtypes.h~
vg_include.h
.deps
vg_to_ucode.c~47-chained-bb
vg_from_ucode.c~47-chained-bb
vg_memory.c
vg_symtypes.c~
vg_constants.h~48-chained-indirect
valgrind
vg_needs.c
vg_from_ucode.i
vg_symtypes.h~44-symbolic-addr
vg_libpthread.c~11-timedwait-rel
vg_libpthread_unimp.c
vg_constants.h~47-chained-bb
vg_libpthread.c~03-accept-nonblock
vg_transtab.c~47-chained-bb
vg_include.h~48-chained-indirect
vg_symtypes.c~44-symbolic-addr
vg_main.c~14-hg-mmap-magic-virgin
vg_symtab2-test.c
vg_startup.S
vg_main.c~51-kill-inceip
vg_default.c
vg_translate.c~01-partial-mul
vg_dispatch.S~48-chained-indirect
vg_scheduler.c~47-chained-bb
vg_clientfuncs.c
Makefile.in
gmon.out
vg_main.c~47-chained-bb
vg_procselfmaps.c
vg_symtab2.h
vg_main.c~48-chained-indirect
vg_translate.c~51-kill-inceip
vg_symtypes.h
vg_symtab_stabs.c
dosyms
vg_constants.h
vg_demangle.c
vg_messages.c
vg_include.h~50-fast-cond
vg_to_ucode.c
Makefile~
vg_symtab2.c~44-symbolic-addr
vg_dispatch.S~47-chained-bb
vg_libpthread.c
vg_kerneliface.h
vg_from_ucode.c~48-chained-indirect
vg_from_ucode.c
vg_from_ucode.c~49-no-inceip
vg_malloc2.c
vg_instrument.c
vg_signals.c~28-sigtrap
vg_main.c.rej
vg_symtab2-test
vg_errcontext.c
vg_translate.c
vg_symtypes.c
out.pid11357
vg_execontext.c
Makefile
vg_main.c
vg_errcontext.c~47-chained-bb
vg_signals.c
vg_symtab2.c~offset-dehack
vg_from_ucode.c~01-partial-mul
Makefile.am~44-symbolic-addr
vg_main.c.orig
vg_symtab2.c
vg_kerneliface.h~28-sigtrap
vg_ldt.c
vg_from_ucode.c~00-lazy-fp
vg_transtab.c
vg_mylibc.c~12-vgprof
vg_from_ucode.c~50-fast-cond
vg_symtab2.h~44-symbolic-addr
vg_symtab_dwarf.c~44-symbolic-addr
vg_unsafe.h
vg_include.h~51-kill-inceip
vg_symtab_stabs.c~44-symbolic-addr
vg_mylibc.c
vg_dummy_profile.c
vg_main.c~50-fast-cond
vg_mylibc.c~09-rdtsc-calibration
vg_include.h~47-chained-bb
vg_translate.c~47-chained-bb
vg_libpthread.c~46-fix-writeable_or_erring-proto
vg_syscall.S
vg_from_ucode.c~51-kill-inceip
vg_mylibc.c~28-sigtrap
vg_dispatch.S
vg_symtab_dwarf.c
vg_intercept.c
vg_symtab_stabs.c~
vg_clientmalloc.c
vg_valgrinq_dummy.c
vg_scheduler.c
vg_to_ucode.c~01-partial-mul
Makefile.am
vg_helpers.S
out.pid11334
vg_syscalls.c
valgrind.in
vg_libpthread.vs

Attachments:

dirlist (2.13 kB)

2002-11-27 20:18:50

[permalink] [raw]

Subject: Re: [Ext2-devel] Re: [NFS] htree+NFS (NFS client bug?)

>>>>> " " == Stephen C Tweedie <[email protected]> writes:

> So I suspect that this is a root a client problem --- the
> client has repeated a READDIR despite being told that the
> previous reply was EOF

I disagree. As far as the client is concerned, it has just been asked
to read the entry that corresponds to that particular cookie. If
glibc issued a new readdir request (which is what I suspect has
happened here), the NFS client has no idea what the previous reply
was, or even where it was positioned within the page cache (the latter
may have been cleared). All it can do is look up the cookie afresh and
start reading from there.

IOW: A cookie should *always* be unique. There are no exceptions to
this rule.

Cheers,
Trond

2002-11-27 20:35:06

[permalink] [raw]

Subject: Re: htree+NFS (NFS client bug?)

>>>>> " " == Theodore Ts'o <[email protected]> writes:

> Well, even if the NFS server is generating bad cookies (and
> that may be possible), the NFS client should be more robust and
> not spin in a loop forever.

No. In this case, the problem appears to be that the cookie issued by
the server for the end of directory case is not unique.

If userland calls an extra 'readdir()' after EOF is reached, then as
far as the client is concerned, this sort of thing cannot be
distinguished from readdir()+seekdir()+readdir()..

Cheers,
Trond

2002-11-27 20:48:57

[permalink] [raw]

Subject: Re: [Ext2-devel] Re: [NFS] htree+NFS (NFS client bug?)

Hi,

On Wed, Nov 27, 2002 at 09:25:35PM +0100, Trond Myklebust wrote:
> >>>>> " " == Stephen C Tweedie <[email protected]> writes:
> > So I suspect that this is a root a client problem --- the
> > client has repeated a READDIR despite being told that the
> > previous reply was EOF

> I disagree. As far as the client is concerned, it has just been asked
> to read the entry that corresponds to that particular cookie.

No, it hasn't --- at least not unless there has been a seekdir in
between. If the client has already been told that we're at EOF, then
it's wrong to go back to the server again for more data.

Having said that, the server is clearly in error in sending a
duplicate cookie in the first place, and if it did so we'd never get
into such a state.

> If
> glibc issued a new readdir request (which is what I suspect has
> happened here), the NFS client has no idea what the previous reply
> was

Well, glibc will *always* issue another readdir, because the only way
we can ever tell glibc that we're at EOF on the directory is when we
eventually return 0 from getdents. The question about client
behaviour is, if we've already been told that the stream is at EOF,
should the client simply discard that info and keep reading
regardless, or should it cache the EOF status?

> IOW: A cookie should *always* be unique. There are no exceptions to
> this rule.

Agreed.

--Stephen

2002-11-27 22:36:50

[permalink] [raw]

Subject: Re: [Ext2-devel] Re: [NFS] htree+NFS (NFS client bug?)

>>>>> " " == Stephen C Tweedie <[email protected]> writes:

>> If glibc issued a new readdir request (which is what I suspect
>> has happened here), the NFS client has no idea what the
>> previous reply was

> Well, glibc will *always* issue another readdir, because the
> only way we can ever tell glibc that we're at EOF on the
> directory is when we eventually return 0 from getdents. The
> question about client behaviour is, if we've already been told
> that the stream is at EOF, should the client simply discard
> that info and keep reading regardless, or should it cache the
> EOF status?

We could possibly cache the EOF status by overloading some other field
in the struct file. f_version comes to mind as a useful candidate,
since it automatically gets reset by llseek.

Cheers,
Trond

2002-11-28 01:59:46

[permalink] [raw]

Subject: Re: [Ext2-devel] Re: [NFS] htree+NFS (NFS client bug?)

On Wed, 2002-11-27 at 07:00, Stephen C. Tweedie wrote:
> The dump looks problematic, but I'm not sure whether it's a client or
> a server problem.
>
> Frame 6 is a READDIR reply, ending with
> Entry: file ID 294052, name vg_include.h~51-kill-inceip
> Cookie: 1611747420
> Entry: file ID 298407, name vg_libpthread.vs
> Cookie: 1611747420
> Value Follows: No
> EOF: 1
>
> Now, this final frame has EOF = 1, so the client should not be looking
> for any more data after it. If the client _does_ try to progress,
> though, it will use that final cookie for the next request, and will
> naturally find the results already in cache so will repeat the final
> chunk of directory entries.

Hm, I just did a run with 8k NFS packets, and the results are slightly
different. There's only a single READDIR reply, fragmented over 3 IP
packets. vg_include.h~51-kill-inceip still has a cookie of 1611747420,
and vg_libpthread.vs is still the last entry returned, but it has a
cookie of 0. As far as I can see, all the other cookies are unique.

Trace attached.

J

Attachments:

nfs-8k.pcap.gz (2.17 kB)

2002-11-28 02:38:53

[permalink] [raw]

Subject: Re: [Ext2-devel] Re: [NFS] htree+NFS (NFS client bug?)

>>>>> " " == Jeremy Fitzhardinge <[email protected]> writes:

> Hm, I just did a run with 8k NFS packets, and the results are
> slightly different. There's only a single READDIR reply,
> fragmented over 3 IP packets. vg_include.h~51-kill-inceip
> still has a cookie of 1611747420, and vg_libpthread.vs is still
> the last entry returned, but it has a cookie of 0. As far as I
> can see, all the other cookies are unique.

AFAICS, the ext2/ext3 code is simply failing to initialize that last
cookie, and so knfsd is giving it the value of the cookie that was
passed by the client.

Cheers,
Trond

2002-11-28 16:37:37

[permalink] [raw]

Subject: Re: [Ext2-devel] Re: [NFS] htree+NFS (NFS client bug?)

Hi,

On Wed, Nov 27, 2002 at 08:55:54PM +0000, Stephen C. Tweedie wrote:

> Having said that, the server is clearly in error in sending a
> duplicate cookie in the first place, and if it did so we'd never get
> into such a state.

And it's ext3's fault. Reproducer below. Run the attached readdir
against an htree directory and you get something like:

[root@host1 htest]# ~sct/test/fs/readdir
getdents at f_pos 0000000000000000 returned 4084.
getdents at f_pos 0X0000000B753BE7 returned 4080.
getdents at f_pos 0X000000158C4C61 returned 4080.
getdents at f_pos 0X00000021E86BDC returned 4080.
getdents at f_pos 0X0000002D60F25D returned 4080.
getdents at f_pos 0X00000037BC95D7 returned 4096.
getdents at f_pos 0X000000434E2AA3 returned 4080.
getdents at f_pos 0X0000004EF11AE6 returned 4080.
getdents at f_pos 0X000000596EBC2F returned 4080.
getdents at f_pos 0X00000065A76668 returned 4080.
getdents at f_pos 0X0000007060CF8B returned 4080.
getdents at f_pos 0X0000007B9213FA returned 1464.
getdents at f_pos 0X0000007B9213FA returned 0.
Final f_pos is 0X0000007B9213FA.
[root@host1 htest]#

The problem is that the htree readdir code is not updating f_pos after
returning the very last chunk of data to the caller. That doesn't
hurt most callers because the location is cached in the filp->private
data, but it really upsets NFS.

--Stephen

Attachments:

(No filename) (1.32 kB)
readdir.c (1.06 kB)
Download all attachments

2002-11-28 16:34:45

[permalink] [raw]

Subject: Re: [Ext2-devel] Re: [NFS] htree+NFS (NFS client bug?)

Hi,

On Wed, Nov 27, 2002 at 11:44:01PM +0100, Trond Myklebust wrote:

> We could possibly cache the EOF status by overloading some other field
> in the struct file. f_version comes to mind as a useful candidate,
> since it automatically gets reset by llseek.

If you want it to be preserved in cache, it needs to be in the inode,
not in the file.

--Stephen

2002-11-28 16:50:53

[permalink] [raw]

Subject: Re: [Ext2-devel] Re: [NFS] htree+NFS (NFS client bug?)

>>>>> " " == Stephen C Tweedie <[email protected]> writes:

>> We could possibly cache the EOF status by overloading some
>> other field in the struct file. f_version comes to mind as a
>> useful candidate, since it automatically gets reset by llseek.

> If you want it to be preserved in cache, it needs to be in the
> inode, not in the file.

You misunderstand the problem. It isn't that the page cache has an
incorrect representation of the stream: the page cache is indeed
stopping filling as soon as it hits the EOF marker.

The problem is that we are stuffing the server-supplied cookies into
file->f_pos and using them to reconstruct where we are in the readdir
stream.
As there is no reserved 'EOF cookie' defined by the protocol that we
might use, we must either rely on the server giving us a unique cookie
also for the EOF case, or else mark the fact that filp->f_pos points
to EOF in some other way.

Cheers,
Trond

2002-11-28 17:02:22

[permalink] [raw]

Subject: Re: [Ext2-devel] Re: [NFS] htree+NFS (NFS client bug?)

Hi,

On Thu, Nov 28, 2002 at 05:58:04PM +0100, Trond Myklebust wrote:
> >>>>> " " == Stephen C Tweedie <[email protected]> writes:

> > If you want it to be preserved in cache, it needs to be in the
> > inode, not in the file.
>
> You misunderstand the problem. It isn't that the page cache has an
> incorrect representation of the stream: the page cache is indeed
> stopping filling as soon as it hits the EOF marker.

I know that.

> The problem is that we are stuffing the server-supplied cookies into
> file->f_pos and using them to reconstruct where we are in the readdir
> stream.
> As there is no reserved 'EOF cookie' defined by the protocol that we
> might use, we must either rely on the server giving us a unique cookie
> also for the EOF case, or else mark the fact that filp->f_pos points
> to EOF in some other way.

Right. But marking the fact can be done in the inode. We do that for
regular files, after all --- we have an i_size field which marks the
value of f_pos which represents EOF. And _that_ is what I'm
suggesting for the NFS case --- record in the inode the cookie which
represents EOF, so that in future reads from cache, we still know when
we've got to the end of the stream.

This gets complicated by the possible presence of hash collisions,
though.

Ted, I think there's a problem with the logic for that case. We can
really only deal sensibly with hash collisions if we do everything in
our power to prevent returning f_pos back to the user in the middle of
a collision, but there's nothing in the ext3 code to do that right now
afaict. (And even if there were, NFS would just keep calling us to
get more dirents, so fixing it just in ext3 doesn't stop a hash
collision from spanning an NFS reply.)

--Stephen

2002-11-28 17:06:25

[permalink] [raw]

Subject: Re: [Ext2-devel] Re: [NFS] htree+NFS (NFS client bug?)

Hi,

On Thu, Nov 28, 2002 at 04:44:39PM +0000, Stephen C. Tweedie wrote:

> And it's ext3's fault. Reproducer below. Run the attached readdir
> against an htree directory and you get something like:
> ...
> getdents at f_pos 0X0000007060CF8B returned 4080.
> getdents at f_pos 0X0000007B9213FA returned 1464.
> getdents at f_pos 0X0000007B9213FA returned 0.
> Final f_pos is 0X0000007B9213FA.
> [root@host1 htest]#
>
> The problem is that the htree readdir code is not updating f_pos after
> returning the very last chunk of data to the caller. That doesn't
> hurt most callers because the location is cached in the filp->private
> data, but it really upsets NFS.

In fact, it's not clear what we _can_ return as f_pos after the last
dirent.

We're only using 31-bit hashes right now. Trond, how will other NFS
clients react if we return an NFS cookie 32-bits wide? We could
easily use something like 0x80000000 as an f_pos to represent EOF in
the Linux side of things, but will that cookie work if passed over the
wire on NFSv2?

The alternative is to hack in a special case so that (for example) we
consider a major htree hash of 0x7fffffff to map to an f_pos of
0x7ffffffe and just consider that a possible collision, so that
0x7fffffff is a unique EOF for the htree tree walker.

--Stephen

2002-11-28 17:37:05

[permalink] [raw]

Subject: Re: [Ext2-devel] Re: [NFS] htree+NFS (NFS client bug?)

>>>>> " " == Stephen C Tweedie <[email protected]> writes:

> We're only using 31-bit hashes right now. Trond, how will
> other NFS clients react if we return an NFS cookie 32-bits
> wide? We could easily use something like 0x80000000 as an
> f_pos to represent EOF in the Linux side of things, but will
> that cookie work if passed over the wire on NFSv2?

For all other NFS clients that I know of, this is perfectly
acceptable. As far as the Linux kernel goes, it is quite OK, but when
you get to userland, glibc-2.2 and above will insist that this is an
illegal value (they like to sign extend 32-bit values). Causes no end
of trouble, since XFS tends to use '0xffffffff' as the EOF cookie.

I have a patch that hacks the values of such cookies so that glibc
will accept them. That hack will never go in to the official kernel,
so it would be nice if ext2/ext3 could avoid the need for it.

> The alternative is to hack in a special case so that (for
> example) we consider a major htree hash of 0x7fffffff to map to
> an f_pos of 0x7ffffffe and just consider that a possible
> collision, so that 0x7fffffff is a unique EOF for the htree
> tree walker.

That would be fine.

Cheers,
Trond

2002-11-28 17:50:30

[permalink] [raw]

Subject: Re: [Ext2-devel] Re: [NFS] htree+NFS (NFS client bug?)

>>>>> " " == Stephen C Tweedie <[email protected]> writes:

> Right. But marking the fact can be done in the inode. We do
> that for regular files, after all --- we have an i_size field
> which marks the value of f_pos which represents EOF. And
> _that_ is what I'm suggesting for the NFS case --- record in
> the inode the cookie which represents EOF, so that in future
> reads from cache, we still know when we've got to the end of
> the stream.

That wouldn't really help much here, since that scheme also has a
uniqueness assumption on the cookie. I agree that it would cause a
more graceful failure, since we'd end up truncating the readdir rather
than endlessly looping, but as far as the user is concerned, it's
still wrong behaviour.

The real solution here would be to augment the cookie information in
the struct file with something like a filename in order to define
where we are in the readdir stream.
That doesn't work too well with telldir/seekdir though, and glibc
(grr....) relies heavily on using those in order in their 'heuristic'
that corrects for dirent being larger than the kernel dirent.

Cheers,
Trond

2002-11-28 19:53:08