2002-11-26 23:39:06

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: htree+NFS (NFS client bug?)

I'm having problems with exporting htree ext3 filesystems over NFS.
Quite often, if a program on the client side is reading a directory it
will simply stop and consume a vast amount of memory. It seems that the
readdir never terminates, and endlessly returns the same dir entries
again and again.

This seems to be happening entirely on the NFS client side; after the
initial readdir and a couple of getattrs, there's no further network
traffic.

It looks to me like some sort of problem managing the NFS readdir
cookies, but it isn't clear to me whether this is the NFS server/ext3
generating bad cookies, or the NFS client handling them wrongly.

Both the client and server machines are running 2.4.20-rc1, with Ted's
extfs-update-2.4.20-rc1 patch (dating from Nov 8th or so). Is there a
more up to date patch?

This is the network traffic:

# tcpdump -s1500 -vvv -ieth0 host server and port 2049
tcpdump: listening on eth0
15:13:39.784538 client.3095412229 > server.nfs: 140 lookup fh Unknown/1 "coregrind-" (DF) (ttl 64, id 0, len 168)
15:13:39.784814 server.nfs > client.3095412229: reply ok 128 lookup fh Unknown/1 DIR 40775 ids 1043/1043 sz 12288 nlink 3 rdev ffffffff fsid 301 nodeid 47976 a/m/ctime 1038043539.000000 1038351994.000000 1038351994.000000 (DF) (ttl 64, id 0, len 156)
15:13:39.785247 client.3112189445 > server.nfs: 124 getattr fh Unknown/1 (DF) (ttl 64, id 0, len 152)
15:13:39.785434 server.nfs > client.3112189445: reply ok 96 getattr DIR 40775 ids 1043/1043 sz 12288 (DF) (ttl 64, id 0, len 124)
15:13:39.785938 client.3128966661 > server.nfs: 132 readdir fh Unknown/1 4096 bytes @ 0 (DF) (ttl 64, id 0, len 160)
15:13:39.786939 server.nfs > client.3128966661: reply ok 1472 readdir offset 1 size 293238 eof (frag 48814:1480@0+) (ttl 64, len 1500)
15:14:50.727869 client.3246407173 > server.nfs: 116 getattr fh 0,1/16777984 (DF) (ttl 64, id 0, len 144)
15:14:50.728127 server.nfs > client.3246407173: reply ok 96 getattr DIR 40755 ids 0/0 sz 4096 (DF) (ttl 64, id 0, len 124)

For this command:
$ strace -s1500 ls -f .
execve("/bin/ls", ["ls", "-f", "."], [/* 122 vars */]) = 0
[...]
lstat64(".", {st_mode=S_IFDIR|0775, st_size=12288, ...}) = 0
open("/dev/null", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = -1 ENOTDIR (Not a directory)
open(".", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 3
fstat64(3, {st_mode=S_IFDIR|0775, st_size=12288, ...}) = 0
fcntl64(3, F_SETFD, FD_CLOEXEC) = 0
getdents64(3, /* 94 entries */, 4096) = 4096
brk(0x8059000) = 0x8059000
getdents64(3, /* 13 entries */, 4096) = 496
brk(0x805f000) = 0x805f000
getdents64(3, /* 94 entries */, 4096) = 4096
brk(0x806b000) = 0x806b000
getdents64(3, /* 13 entries */, 4096) = 496
getdents64(3, /* 94 entries */, 4096) = 4096
getdents64(3, /* 13 entries */, 4096) = 496
getdents64(3, /* 94 entries */, 4096) = 4096
brk(0x8082000) = 0x8082000
getdents64(3, /* 13 entries */, 4096) = 496
getdents64(3, /* 94 entries */, 4096) = 4096
getdents64(3, /* 13 entries */, 4096) = 496
getdents64(3, /* 94 entries */, 4096) = 4096
getdents64(3, /* 13 entries */, 4096) = 496
getdents64(3, /* 94 entries */, 4096) = 4096
getdents64(3, /* 13 entries */, 4096) = 496
getdents64(3, /* 94 entries */, 4096) = 4096
[...]

A simple program which does an opendir/while(readdir != NULL)/closedir
loop shows this:
$ readdir .
name=".", ino=293238
name="..", ino=298200
name="vg_symtypes.h~", ino=163580
name="vg_include.h", ino=163476
name=".deps", ino=293390
name="vg_to_ucode.c~47-chained-bb", ino=293901
name="vg_from_ucode.c~47-chained-bb", ino=293896
name="vg_memory.c", ino=293462
name="vg_symtypes.c~", ino=163620
name="vg_constants.h~48-chained-indirect", ino=293925
name="valgrind", ino=293394
name="vg_needs.c", ino=293431
name="vg_from_ucode.i", ino=293860
name="vg_symtypes.h~44-symbolic-addr", ino=293651
name="vg_libpthread.c~11-timedwait-rel", ino=293628
name="vg_libpthread_unimp.c", ino=293999
name="vg_constants.h~47-chained-bb", ino=293904
name="vg_libpthread.c~03-accept-nonblock", ino=293652
name="vg_transtab.c~47-chained-bb", ino=293900
name="vg_include.h~48-chained-indirect", ino=293939
name="vg_symtypes.c~44-symbolic-addr", ino=293639
name="vg_main.c~14-hg-mmap-magic-virgin", ino=293941
name="vg_symtab2-test.c", ino=293903
name="vg_startup.S", ino=298417
name="vg_main.c~51-kill-inceip", ino=294086
name="vg_default.c", ino=293439
name="vg_translate.c~01-partial-mul", ino=293615
name="vg_dispatch.S~48-chained-indirect", ino=293856
name="vg_scheduler.c~47-chained-bb", ino=293890
name="vg_clientfuncs.c", ino=298391
name="Makefile.in", ino=298437
name="gmon.out", ino=293911
name="vg_main.c~47-chained-bb", ino=293897
name="vg_procselfmaps.c", ino=298414
name="vg_symtab2.h", ino=163541
name="vg_main.c~48-chained-indirect", ino=293946
name="vg_translate.c~51-kill-inceip", ino=294087
name="vg_symtypes.h", ino=163543
name="vg_symtab_stabs.c", ino=163530
name="dosyms", ino=298389
name="vg_constants.h", ino=163142
name="vg_demangle.c", ino=298395
name="vg_messages.c", ino=293882
name="vg_include.h~50-fast-cond", ino=294050
name="vg_to_ucode.c", ino=163552
name="Makefile~", ino=293393
name="vg_symtab2.c~44-symbolic-addr", ino=293650
name="vg_dispatch.S~47-chained-bb", ino=293878
name="vg_libpthread.c", ino=163595
name="vg_kerneliface.h", ino=163597
name="vg_from_ucode.c~48-chained-indirect", ino=293880
name="vg_from_ucode.c", ino=163451
name="vg_from_ucode.c~49-no-inceip", ino=294060
name="vg_malloc2.c", ino=163531
name="vg_instrument.c", ino=298403
name="vg_signals.c~28-sigtrap", ino=293637
name="vg_main.c.rej", ino=163538
name="vg_symtab2-test", ino=293691
name="vg_errcontext.c", ino=163586
name="vg_translate.c", ino=163539
name="vg_symtypes.c", ino=163647
name="out.pid11357", ino=293908
name="vg_execontext.c", ino=293436
name="Makefile", ino=293504
name="vg_main.c", ino=163523
name="vg_errcontext.c~47-chained-bb", ino=293898
name="vg_signals.c", ino=163536
name="vg_symtab2.c~offset-dehack", ino=293949
name="vg_from_ucode.c~01-partial-mul", ino=293605
name="Makefile.am~44-symbolic-addr", ino=293671
name="vg_main.c.orig", ino=163519
name="vg_symtab2.c", ino=163599
name="vg_kerneliface.h~28-sigtrap", ino=293638
name="vg_ldt.c", ino=298405
name="vg_from_ucode.c~00-lazy-fp", ino=293269
name="vg_transtab.c", ino=163622
name="vg_mylibc.c~12-vgprof", ino=293636
name="vg_from_ucode.c~50-fast-cond", ino=293862
name="vg_symtab2.h~44-symbolic-addr", ino=293851
name="vg_symtab_dwarf.c~44-symbolic-addr", ino=293858
name="vg_unsafe.h", ino=298424
name="vg_include.h~51-kill-inceip", ino=294052
name="vg_symtab_stabs.c~44-symbolic-addr", ino=293842
name="vg_mylibc.c", ino=163177
name="vg_dummy_profile.c", ino=298397
name="vg_main.c~50-fast-cond", ino=293956
name="vg_mylibc.c~09-rdtsc-calibration", ino=293625
name="vg_include.h~47-chained-bb", ino=293879
name="vg_translate.c~47-chained-bb", ino=293894
name="vg_libpthread.c~46-fix-writeable_or_erring-proto", ino=293863
name="vg_syscall.S", ino=298419
name="vg_from_ucode.c~51-kill-inceip", ino=294051
name="vg_mylibc.c~28-sigtrap", ino=298763
name="vg_dispatch.S", ino=163511
name="vg_symtab_dwarf.c", ino=163448
name="vg_intercept.c", ino=293614
name="vg_symtab_stabs.c~", ino=163548
name="vg_clientmalloc.c", ino=298392
name="vg_valgrinq_dummy.c", ino=298425
name="vg_scheduler.c", ino=163617
name="vg_to_ucode.c~01-partial-mul", ino=293607
name="Makefile.am", ino=163594
name="vg_helpers.S", ino=298401
name="out.pid11334", ino=293907
name="vg_syscalls.c", ino=293575
name="valgrind.in", ino=298390
name="vg_libpthread.vs", ino=298407
name=".", ino=293238
name="..", ino=298200
name="vg_symtypes.h~", ino=163580
name="vg_include.h", ino=163476
name=".deps", ino=293390
name="vg_to_ucode.c~47-chained-bb", ino=293901
name="vg_from_ucode.c~47-chained-bb", ino=293896
name="vg_memory.c", ino=293462
name="vg_symtypes.c~", ino=163620
name="vg_constants.h~48-chained-indirect", ino=293925
name="valgrind", ino=293394
name="vg_needs.c", ino=293431
name="vg_from_ucode.i", ino=293860
name="vg_symtypes.h~44-symbolic-addr", ino=293651
name="vg_libpthread.c~11-timedwait-rel", ino=293628
name="vg_libpthread_unimp.c", ino=293999
name="vg_constants.h~47-chained-bb", ino=293904
name="vg_libpthread.c~03-accept-nonblock", ino=293652
name="vg_transtab.c~47-chained-bb", ino=293900
name="vg_include.h~48-chained-indirect", ino=293939
name="vg_symtypes.c~44-symbolic-addr", ino=293639
name="vg_main.c~14-hg-mmap-magic-virgin", ino=293941
name="vg_symtab2-test.c", ino=293903
name="vg_startup.S", ino=298417
name="vg_main.c~51-kill-inceip", ino=294086
name="vg_default.c", ino=293439
name="vg_translate.c~01-partial-mul", ino=293615
name="vg_dispatch.S~48-chained-indirect", ino=293856
name="vg_scheduler.c~47-chained-bb", ino=293890
name="vg_clientfuncs.c", ino=298391
name="Makefile.in", ino=298437
name="gmon.out", ino=293911
name="vg_main.c~47-chained-bb", ino=293897
name="vg_procselfmaps.c", ino=298414
name="vg_symtab2.h", ino=163541
name="vg_main.c~48-chained-indirect", ino=293946
name="vg_translate.c~51-kill-inceip", ino=294087
name="vg_symtypes.h", ino=163543
name="vg_symtab_stabs.c", ino=163530
name="dosyms", ino=298389
name="vg_constants.h", ino=163142
name="vg_demangle.c", ino=298395
name="vg_messages.c", ino=293882
name="vg_include.h~50-fast-cond", ino=294050
name="vg_to_ucode.c", ino=163552
name="Makefile~", ino=293393
name="vg_symtab2.c~44-symbolic-addr", ino=293650
name="vg_dispatch.S~47-chained-bb", ino=293878
name="vg_libpthread.c", ino=163595
name="vg_kerneliface.h", ino=163597
name="vg_from_ucode.c~48-chained-indirect", ino=293880
name="vg_from_ucode.c", ino=163451
name="vg_from_ucode.c~49-no-inceip", ino=294060
name="vg_malloc2.c", ino=163531
name="vg_instrument.c", ino=298403
name="vg_signals.c~28-sigtrap", ino=293637
name="vg_main.c.rej", ino=163538
name="vg_symtab2-test", ino=293691
name="vg_errcontext.c", ino=163586
name="vg_translate.c", ino=163539
name="vg_symtypes.c", ino=163647
name="out.pid11357", ino=293908
name="vg_execontext.c", ino=293436
name="Makefile", ino=293504
name="vg_main.c", ino=163523
name="vg_errcontext.c~47-chained-bb", ino=293898
name="vg_signals.c", ino=163536
name="vg_symtab2.c~offset-dehack", ino=293949
name="vg_from_ucode.c~01-partial-mul", ino=293605
name="Makefile.am~44-symbolic-addr", ino=293671
name="vg_main.c.orig", ino=163519
name="vg_symtab2.c", ino=163599
name="vg_kerneliface.h~28-sigtrap", ino=293638
name="vg_ldt.c", ino=298405
name="vg_from_ucode.c~00-lazy-fp", ino=293269
name="vg_transtab.c", ino=163622
name="vg_mylibc.c~12-vgprof", ino=293636
name="vg_from_ucode.c~50-fast-cond", ino=293862
name="vg_symtab2.h~44-symbolic-addr", ino=293851
name="vg_symtab_dwarf.c~44-symbolic-addr", ino=293858
name="vg_unsafe.h", ino=298424
name="vg_include.h~51-kill-inceip", ino=294052
name="vg_symtab_stabs.c~44-symbolic-addr", ino=293842
name="vg_mylibc.c", ino=163177
name="vg_dummy_profile.c", ino=298397
name="vg_main.c~50-fast-cond", ino=293956
name="vg_mylibc.c~09-rdtsc-calibration", ino=293625
name="vg_include.h~47-chained-bb", ino=293879
name="vg_translate.c~47-chained-bb", ino=293894
name="vg_libpthread.c~46-fix-writeable_or_erring-proto", ino=293863
name="vg_syscall.S", ino=298419
name="vg_from_ucode.c~51-kill-inceip", ino=294051
name="vg_mylibc.c~28-sigtrap", ino=298763
[...etc...]


J


2002-11-27 03:19:43

by Trond Myklebust

[permalink] [raw]
Subject: Re: [NFS] htree+NFS (NFS client bug?)

>>>>> " " == Jeremy Fitzhardinge <[email protected]> writes:

> It looks to me like some sort of problem managing the NFS
> readdir cookies, but it isn't clear to me whether this is the
> NFS server/ext3 generating bad cookies, or the NFS client
> handling them wrongly.

In order to determine which of the two needs to be fixed, it would
help if you could print out the cookies from that listing or better
still: if you could provide us with the raw tcpdump output. Please
remember to use an 8k snaplen for the tcpdump...

Cheers,
Trond

2002-11-27 03:47:58

by Christopher Li

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: [NFS] htree+NFS (NFS client bug?)

I am not sure that is going to help or not:

Htree readdir will return the file position bigger than the
real size of directory, it stores the hash information in lower
bits.

I just go though the source code, haven't find out how NFS
handle that yet.

Yes. Print out cookies will help.

Chris

On Wed, Nov 27, 2002 at 04:26:46AM +0100, Trond Myklebust wrote:
> >>>>> " " == Jeremy Fitzhardinge <[email protected]> writes:
>
> > It looks to me like some sort of problem managing the NFS
> > readdir cookies, but it isn't clear to me whether this is the
> > NFS server/ext3 generating bad cookies, or the NFS client
> > handling them wrongly.
>
> In order to determine which of the two needs to be fixed, it would
> help if you could print out the cookies from that listing or better
> still: if you could provide us with the raw tcpdump output. Please
> remember to use an 8k snaplen for the tcpdump...
>
> Cheers,
> Trond
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: Get the new Palm Tungsten T
> handheld. Power & Color in a compact size!
> http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en
> _______________________________________________
> Ext2-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/ext2-devel


2002-11-27 08:51:26

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [NFS] htree+NFS (NFS client bug?)

On Tue, 2002-11-26 at 19:26, Trond Myklebust wrote:
> >>>>> " " == Jeremy Fitzhardinge <[email protected]> writes:
>
> > It looks to me like some sort of problem managing the NFS
> > readdir cookies, but it isn't clear to me whether this is the
> > NFS server/ext3 generating bad cookies, or the NFS client
> > handling them wrongly.
>
> In order to determine which of the two needs to be fixed, it would
> help if you could print out the cookies from that listing or better
> still: if you could provide us with the raw tcpdump output. Please
> remember to use an 8k snaplen for the tcpdump...

Here's a pcap dump file of the transaction. I changed [rw]size to 1024
because I couldn't work out how to get ethereal to reassemble the
fragments. It doesn't seem to have affected the behaviour at all.

This was with a 2.5.47 client (same server as before).

J


Attachments:
nfs-htree.pcap.gz (2.26 kB)

2002-11-27 13:26:42

by Theodore Ts'o

[permalink] [raw]
Subject: Re: htree+NFS (NFS client bug?)

On Tue, Nov 26, 2002 at 03:44:46PM -0800, Jeremy Fitzhardinge wrote:
> I'm having problems with exporting htree ext3 filesystems over NFS.
> Quite often, if a program on the client side is reading a directory it
> will simply stop and consume a vast amount of memory. It seems that the
> readdir never terminates, and endlessly returns the same dir entries
> again and again.
>
> This seems to be happening entirely on the NFS client side; after the
> initial readdir and a couple of getattrs, there's no further network
> traffic.
>
> It looks to me like some sort of problem managing the NFS readdir
> cookies, but it isn't clear to me whether this is the NFS server/ext3
> generating bad cookies, or the NFS client handling them wrongly.

Well, even if the NFS server is generating bad cookies (and that may
be possible), the NFS client should be more robust and not spin in a
loop forever.

Can you send me a directory list (from the server) of the directory in
question? Also, can you send me the output of dumpe2fs -h on the
filesystem. I'll need the later to get the seed for the htree hash,
so I can try replicating this on my end.

Also, have you tried running e2fsck on filesystem on the server? It
would be very interesting to confirm whether or not the filesystem is
consistent.

- Ted

2002-11-27 14:54:01

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: [NFS] htree+NFS (NFS client bug?)

Hi,

On Wed, Nov 27, 2002 at 12:58:42AM -0800, Jeremy Fitzhardinge wrote:
> On Tue, 2002-11-26 at 19:26, Trond Myklebust wrote:
> > >>>>> " " == Jeremy Fitzhardinge <[email protected]> writes:
> >
> > > It looks to me like some sort of problem managing the NFS
> > > readdir cookies, but it isn't clear to me whether this is the
> > > NFS server/ext3 generating bad cookies, or the NFS client
> > > handling them wrongly.
> >
> > In order to determine which of the two needs to be fixed, it would
> > help if you could print out the cookies from that listing or better
> > still: if you could provide us with the raw tcpdump output. Please
> > remember to use an 8k snaplen for the tcpdump...
>
> Here's a pcap dump file of the transaction. I changed [rw]size to 1024
> because I couldn't work out how to get ethereal to reassemble the
> fragments. It doesn't seem to have affected the behaviour at all.

The dump looks problematic, but I'm not sure whether it's a client or
a server problem.

Frame 6 is a READDIR reply, ending with

Entry: file ID 294052, name vg_include.h~51-kill-inceip
File ID: 294052
Name: vg_include.h~51-kill-inceip
length: 27
contents: vg_include.h~51-kill-inceip
fill bytes: opaque data
Cookie: 1611747420
Value Follows: No
EOF: 0

There follows in frame 7 a request for the next chunk, starting with
the same cookie, and the reply in frame 8 ends with:

Entry: file ID 298407, name vg_libpthread.vs
File ID: 298407
Name: vg_libpthread.vs
length: 16
contents: vg_libpthread.vs
Cookie: 1611747420
Value Follows: No
EOF: 1

Now, this final frame has EOF = 1, so the client should not be looking
for any more data after it. If the client _does_ try to progress,
though, it will use that final cookie for the next request, and will
naturally find the results already in cache so will repeat the final
chunk of directory entries.

So I suspect that this is a root a client problem --- the client has
repeated a READDIR despite being told that the previous reply was EOF
--- but that the best solution is actually to change the server to
poke a dedicated EOF cookie into the final dirent of the stream. One
of the reasons we need to do that is to cope with the unclear
situation when an application is using telldir/seekdir to reposition
the directory stream (tcsh is really bad for trying to do that with
32-bit offsets when globbing, for example, although the right answer
there is "use readdir64".)

Cheers,
Stephen

2002-11-27 17:19:24

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: htree+NFS (NFS client bug?)

.
..
vg_symtypes.h~
vg_include.h
.deps
vg_to_ucode.c~47-chained-bb
vg_from_ucode.c~47-chained-bb
vg_memory.c
vg_symtypes.c~
vg_constants.h~48-chained-indirect
valgrind
vg_needs.c
vg_from_ucode.i
vg_symtypes.h~44-symbolic-addr
vg_libpthread.c~11-timedwait-rel
vg_libpthread_unimp.c
vg_constants.h~47-chained-bb
vg_libpthread.c~03-accept-nonblock
vg_transtab.c~47-chained-bb
vg_include.h~48-chained-indirect
vg_symtypes.c~44-symbolic-addr
vg_main.c~14-hg-mmap-magic-virgin
vg_symtab2-test.c
vg_startup.S
vg_main.c~51-kill-inceip
vg_default.c
vg_translate.c~01-partial-mul
vg_dispatch.S~48-chained-indirect
vg_scheduler.c~47-chained-bb
vg_clientfuncs.c
Makefile.in
gmon.out
vg_main.c~47-chained-bb
vg_procselfmaps.c
vg_symtab2.h
vg_main.c~48-chained-indirect
vg_translate.c~51-kill-inceip
vg_symtypes.h
vg_symtab_stabs.c
dosyms
vg_constants.h
vg_demangle.c
vg_messages.c
vg_include.h~50-fast-cond
vg_to_ucode.c
Makefile~
vg_symtab2.c~44-symbolic-addr
vg_dispatch.S~47-chained-bb
vg_libpthread.c
vg_kerneliface.h
vg_from_ucode.c~48-chained-indirect
vg_from_ucode.c
vg_from_ucode.c~49-no-inceip
vg_malloc2.c
vg_instrument.c
vg_signals.c~28-sigtrap
vg_main.c.rej
vg_symtab2-test
vg_errcontext.c
vg_translate.c
vg_symtypes.c
out.pid11357
vg_execontext.c
Makefile
vg_main.c
vg_errcontext.c~47-chained-bb
vg_signals.c
vg_symtab2.c~offset-dehack
vg_from_ucode.c~01-partial-mul
Makefile.am~44-symbolic-addr
vg_main.c.orig
vg_symtab2.c
vg_kerneliface.h~28-sigtrap
vg_ldt.c
vg_from_ucode.c~00-lazy-fp
vg_transtab.c
vg_mylibc.c~12-vgprof
vg_from_ucode.c~50-fast-cond
vg_symtab2.h~44-symbolic-addr
vg_symtab_dwarf.c~44-symbolic-addr
vg_unsafe.h
vg_include.h~51-kill-inceip
vg_symtab_stabs.c~44-symbolic-addr
vg_mylibc.c
vg_dummy_profile.c
vg_main.c~50-fast-cond
vg_mylibc.c~09-rdtsc-calibration
vg_include.h~47-chained-bb
vg_translate.c~47-chained-bb
vg_libpthread.c~46-fix-writeable_or_erring-proto
vg_syscall.S
vg_from_ucode.c~51-kill-inceip
vg_mylibc.c~28-sigtrap
vg_dispatch.S
vg_symtab_dwarf.c
vg_intercept.c
vg_symtab_stabs.c~
vg_clientmalloc.c
vg_valgrinq_dummy.c
vg_scheduler.c
vg_to_ucode.c~01-partial-mul
Makefile.am
vg_helpers.S
out.pid11334
vg_syscalls.c
valgrind.in
vg_libpthread.vs


Attachments:
dirlist (2.13 kB)

2002-11-27 20:18:50

by Trond Myklebust

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: [NFS] htree+NFS (NFS client bug?)

>>>>> " " == Stephen C Tweedie <[email protected]> writes:

> So I suspect that this is a root a client problem --- the
> client has repeated a READDIR despite being told that the
> previous reply was EOF

I disagree. As far as the client is concerned, it has just been asked
to read the entry that corresponds to that particular cookie. If
glibc issued a new readdir request (which is what I suspect has
happened here), the NFS client has no idea what the previous reply
was, or even where it was positioned within the page cache (the latter
may have been cleared). All it can do is look up the cookie afresh and
start reading from there.

IOW: A cookie should *always* be unique. There are no exceptions to
this rule.

Cheers,
Trond

2002-11-27 20:35:06

by Trond Myklebust

[permalink] [raw]
Subject: Re: htree+NFS (NFS client bug?)

>>>>> " " == Theodore Ts'o <[email protected]> writes:

> Well, even if the NFS server is generating bad cookies (and
> that may be possible), the NFS client should be more robust and
> not spin in a loop forever.

No. In this case, the problem appears to be that the cookie issued by
the server for the end of directory case is not unique.

If userland calls an extra 'readdir()' after EOF is reached, then as
far as the client is concerned, this sort of thing cannot be
distinguished from readdir()+seekdir()+readdir()..

Cheers,
Trond

2002-11-27 20:48:57

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: [NFS] htree+NFS (NFS client bug?)

Hi,

On Wed, Nov 27, 2002 at 09:25:35PM +0100, Trond Myklebust wrote:
> >>>>> " " == Stephen C Tweedie <[email protected]> writes:
> > So I suspect that this is a root a client problem --- the
> > client has repeated a READDIR despite being told that the
> > previous reply was EOF

> I disagree. As far as the client is concerned, it has just been asked
> to read the entry that corresponds to that particular cookie.

No, it hasn't --- at least not unless there has been a seekdir in
between. If the client has already been told that we're at EOF, then
it's wrong to go back to the server again for more data.

Having said that, the server is clearly in error in sending a
duplicate cookie in the first place, and if it did so we'd never get
into such a state.

> If
> glibc issued a new readdir request (which is what I suspect has
> happened here), the NFS client has no idea what the previous reply
> was

Well, glibc will *always* issue another readdir, because the only way
we can ever tell glibc that we're at EOF on the directory is when we
eventually return 0 from getdents. The question about client
behaviour is, if we've already been told that the stream is at EOF,
should the client simply discard that info and keep reading
regardless, or should it cache the EOF status?

> IOW: A cookie should *always* be unique. There are no exceptions to
> this rule.

Agreed.

--Stephen

2002-11-27 22:36:50

by Trond Myklebust

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: [NFS] htree+NFS (NFS client bug?)

>>>>> " " == Stephen C Tweedie <[email protected]> writes:

>> If glibc issued a new readdir request (which is what I suspect
>> has happened here), the NFS client has no idea what the
>> previous reply was

> Well, glibc will *always* issue another readdir, because the
> only way we can ever tell glibc that we're at EOF on the
> directory is when we eventually return 0 from getdents. The
> question about client behaviour is, if we've already been told
> that the stream is at EOF, should the client simply discard
> that info and keep reading regardless, or should it cache the
> EOF status?

We could possibly cache the EOF status by overloading some other field
in the struct file. f_version comes to mind as a useful candidate,
since it automatically gets reset by llseek.

Cheers,
Trond

2002-11-28 01:59:46

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: [NFS] htree+NFS (NFS client bug?)

On Wed, 2002-11-27 at 07:00, Stephen C. Tweedie wrote:
> The dump looks problematic, but I'm not sure whether it's a client or
> a server problem.
>
> Frame 6 is a READDIR reply, ending with
> Entry: file ID 294052, name vg_include.h~51-kill-inceip
> Cookie: 1611747420
> Entry: file ID 298407, name vg_libpthread.vs
> Cookie: 1611747420
> Value Follows: No
> EOF: 1
>
> Now, this final frame has EOF = 1, so the client should not be looking
> for any more data after it. If the client _does_ try to progress,
> though, it will use that final cookie for the next request, and will
> naturally find the results already in cache so will repeat the final
> chunk of directory entries.

Hm, I just did a run with 8k NFS packets, and the results are slightly
different. There's only a single READDIR reply, fragmented over 3 IP
packets. vg_include.h~51-kill-inceip still has a cookie of 1611747420,
and vg_libpthread.vs is still the last entry returned, but it has a
cookie of 0. As far as I can see, all the other cookies are unique.

Trace attached.

J


Attachments:
nfs-8k.pcap.gz (2.17 kB)

2002-11-28 02:38:53

by Trond Myklebust

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: [NFS] htree+NFS (NFS client bug?)

>>>>> " " == Jeremy Fitzhardinge <[email protected]> writes:

> Hm, I just did a run with 8k NFS packets, and the results are
> slightly different. There's only a single READDIR reply,
> fragmented over 3 IP packets. vg_include.h~51-kill-inceip
> still has a cookie of 1611747420, and vg_libpthread.vs is still
> the last entry returned, but it has a cookie of 0. As far as I
> can see, all the other cookies are unique.

AFAICS, the ext2/ext3 code is simply failing to initialize that last
cookie, and so knfsd is giving it the value of the cookie that was
passed by the client.

Cheers,
Trond

2002-11-28 16:37:37

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: [NFS] htree+NFS (NFS client bug?)

Hi,

On Wed, Nov 27, 2002 at 08:55:54PM +0000, Stephen C. Tweedie wrote:

> Having said that, the server is clearly in error in sending a
> duplicate cookie in the first place, and if it did so we'd never get
> into such a state.

And it's ext3's fault. Reproducer below. Run the attached readdir
against an htree directory and you get something like:

[root@host1 htest]# ~sct/test/fs/readdir
getdents at f_pos 0000000000000000 returned 4084.
getdents at f_pos 0X0000000B753BE7 returned 4080.
getdents at f_pos 0X000000158C4C61 returned 4080.
getdents at f_pos 0X00000021E86BDC returned 4080.
getdents at f_pos 0X0000002D60F25D returned 4080.
getdents at f_pos 0X00000037BC95D7 returned 4096.
getdents at f_pos 0X000000434E2AA3 returned 4080.
getdents at f_pos 0X0000004EF11AE6 returned 4080.
getdents at f_pos 0X000000596EBC2F returned 4080.
getdents at f_pos 0X00000065A76668 returned 4080.
getdents at f_pos 0X0000007060CF8B returned 4080.
getdents at f_pos 0X0000007B9213FA returned 1464.
getdents at f_pos 0X0000007B9213FA returned 0.
Final f_pos is 0X0000007B9213FA.
[root@host1 htest]#

The problem is that the htree readdir code is not updating f_pos after
returning the very last chunk of data to the caller. That doesn't
hurt most callers because the location is cached in the filp->private
data, but it really upsets NFS.

--Stephen


Attachments:
(No filename) (1.32 kB)
readdir.c (1.06 kB)
Download all attachments

2002-11-28 16:34:45

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: [NFS] htree+NFS (NFS client bug?)

Hi,

On Wed, Nov 27, 2002 at 11:44:01PM +0100, Trond Myklebust wrote:

> We could possibly cache the EOF status by overloading some other field
> in the struct file. f_version comes to mind as a useful candidate,
> since it automatically gets reset by llseek.

If you want it to be preserved in cache, it needs to be in the inode,
not in the file.

--Stephen

2002-11-28 16:50:53

by Trond Myklebust

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: [NFS] htree+NFS (NFS client bug?)

>>>>> " " == Stephen C Tweedie <[email protected]> writes:

>> We could possibly cache the EOF status by overloading some
>> other field in the struct file. f_version comes to mind as a
>> useful candidate, since it automatically gets reset by llseek.

> If you want it to be preserved in cache, it needs to be in the
> inode, not in the file.

You misunderstand the problem. It isn't that the page cache has an
incorrect representation of the stream: the page cache is indeed
stopping filling as soon as it hits the EOF marker.

The problem is that we are stuffing the server-supplied cookies into
file->f_pos and using them to reconstruct where we are in the readdir
stream.
As there is no reserved 'EOF cookie' defined by the protocol that we
might use, we must either rely on the server giving us a unique cookie
also for the EOF case, or else mark the fact that filp->f_pos points
to EOF in some other way.

Cheers,
Trond

2002-11-28 17:02:22

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: [NFS] htree+NFS (NFS client bug?)

Hi,

On Thu, Nov 28, 2002 at 05:58:04PM +0100, Trond Myklebust wrote:
> >>>>> " " == Stephen C Tweedie <[email protected]> writes:

> > If you want it to be preserved in cache, it needs to be in the
> > inode, not in the file.
>
> You misunderstand the problem. It isn't that the page cache has an
> incorrect representation of the stream: the page cache is indeed
> stopping filling as soon as it hits the EOF marker.

I know that.

> The problem is that we are stuffing the server-supplied cookies into
> file->f_pos and using them to reconstruct where we are in the readdir
> stream.
> As there is no reserved 'EOF cookie' defined by the protocol that we
> might use, we must either rely on the server giving us a unique cookie
> also for the EOF case, or else mark the fact that filp->f_pos points
> to EOF in some other way.

Right. But marking the fact can be done in the inode. We do that for
regular files, after all --- we have an i_size field which marks the
value of f_pos which represents EOF. And _that_ is what I'm
suggesting for the NFS case --- record in the inode the cookie which
represents EOF, so that in future reads from cache, we still know when
we've got to the end of the stream.

This gets complicated by the possible presence of hash collisions,
though.

Ted, I think there's a problem with the logic for that case. We can
really only deal sensibly with hash collisions if we do everything in
our power to prevent returning f_pos back to the user in the middle of
a collision, but there's nothing in the ext3 code to do that right now
afaict. (And even if there were, NFS would just keep calling us to
get more dirents, so fixing it just in ext3 doesn't stop a hash
collision from spanning an NFS reply.)

--Stephen

2002-11-28 17:06:25

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: [NFS] htree+NFS (NFS client bug?)

Hi,

On Thu, Nov 28, 2002 at 04:44:39PM +0000, Stephen C. Tweedie wrote:

> And it's ext3's fault. Reproducer below. Run the attached readdir
> against an htree directory and you get something like:
> ...
> getdents at f_pos 0X0000007060CF8B returned 4080.
> getdents at f_pos 0X0000007B9213FA returned 1464.
> getdents at f_pos 0X0000007B9213FA returned 0.
> Final f_pos is 0X0000007B9213FA.
> [root@host1 htest]#
>
> The problem is that the htree readdir code is not updating f_pos after
> returning the very last chunk of data to the caller. That doesn't
> hurt most callers because the location is cached in the filp->private
> data, but it really upsets NFS.

In fact, it's not clear what we _can_ return as f_pos after the last
dirent.

We're only using 31-bit hashes right now. Trond, how will other NFS
clients react if we return an NFS cookie 32-bits wide? We could
easily use something like 0x80000000 as an f_pos to represent EOF in
the Linux side of things, but will that cookie work if passed over the
wire on NFSv2?

The alternative is to hack in a special case so that (for example) we
consider a major htree hash of 0x7fffffff to map to an f_pos of
0x7ffffffe and just consider that a possible collision, so that
0x7fffffff is a unique EOF for the htree tree walker.

--Stephen

2002-11-28 17:37:05

by Trond Myklebust

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: [NFS] htree+NFS (NFS client bug?)

>>>>> " " == Stephen C Tweedie <[email protected]> writes:

> We're only using 31-bit hashes right now. Trond, how will
> other NFS clients react if we return an NFS cookie 32-bits
> wide? We could easily use something like 0x80000000 as an
> f_pos to represent EOF in the Linux side of things, but will
> that cookie work if passed over the wire on NFSv2?

For all other NFS clients that I know of, this is perfectly
acceptable. As far as the Linux kernel goes, it is quite OK, but when
you get to userland, glibc-2.2 and above will insist that this is an
illegal value (they like to sign extend 32-bit values). Causes no end
of trouble, since XFS tends to use '0xffffffff' as the EOF cookie.

I have a patch that hacks the values of such cookies so that glibc
will accept them. That hack will never go in to the official kernel,
so it would be nice if ext2/ext3 could avoid the need for it.

> The alternative is to hack in a special case so that (for
> example) we consider a major htree hash of 0x7fffffff to map to
> an f_pos of 0x7ffffffe and just consider that a possible
> collision, so that 0x7fffffff is a unique EOF for the htree
> tree walker.

That would be fine.

Cheers,
Trond

2002-11-28 17:50:30

by Trond Myklebust

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: [NFS] htree+NFS (NFS client bug?)

>>>>> " " == Stephen C Tweedie <[email protected]> writes:

> Right. But marking the fact can be done in the inode. We do
> that for regular files, after all --- we have an i_size field
> which marks the value of f_pos which represents EOF. And
> _that_ is what I'm suggesting for the NFS case --- record in
> the inode the cookie which represents EOF, so that in future
> reads from cache, we still know when we've got to the end of
> the stream.

That wouldn't really help much here, since that scheme also has a
uniqueness assumption on the cookie. I agree that it would cause a
more graceful failure, since we'd end up truncating the readdir rather
than endlessly looping, but as far as the user is concerned, it's
still wrong behaviour.

The real solution here would be to augment the cookie information in
the struct file with something like a filename in order to define
where we are in the readdir stream.
That doesn't work too well with telldir/seekdir though, and glibc
(grr....) relies heavily on using those in order in their 'heuristic'
that corrects for dirent being larger than the kernel dirent.

Cheers,
Trond

2002-11-28 19:53:08

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: [NFS] htree+NFS (NFS client bug?)

On Thu, 2002-11-28 at 09:13, Stephen C. Tweedie wrote:
> In fact, it's not clear what we _can_ return as f_pos after the last
> dirent.
>
> We're only using 31-bit hashes right now. Trond, how will other NFS
> clients react if we return an NFS cookie 32-bits wide? We could
> easily use something like 0x80000000 as an f_pos to represent EOF in
> the Linux side of things, but will that cookie work if passed over the
> wire on NFSv2?
>
> The alternative is to hack in a special case so that (for example) we
> consider a major htree hash of 0x7fffffff to map to an f_pos of
> 0x7ffffffe and just consider that a possible collision, so that
> 0x7fffffff is a unique EOF for the htree tree walker.

Even if you fix this, there's another problem.

It seems that htree basically can't work with NFS in its current state -
it only works at all on small directories, which aren't hashed and
therefore use the non-htree cookie scheme. This can be fixed creating a
distinct EOF cookie.

However, in the transformation from a non-hashed to hashed directory the
cookie scheme completely changes, and in effect invalidates all cookies
currently known by clients. The obvious problem is that sometimes
adding a single entry to a directory will kill all concurrent readdirs.

I know that changing a directory while scanning it has at least some
undefined effects (allowed to miss entries, but not allowed to
duplicate, if I remember correctly), but if you add a single entry to a
directory, is it allowed to completely break any pending readdir
operation?

One solution I can think of is to always use name hashes as directory
cookies, even for non-hashed directories. This means that scans of a
small directory will require linear searching to find the entry
corresponding to a particular cookie, but since the directory is small
by definition it shouldn't be a bad performance hit.

J