Hello all,
I just upgraded a host from 2.2.19 to 2.2.21-pre3 and discovered a problem with kernel nfs. Setup is this:
knfs-server is 2.4.19-pre2
knfs-client is 2.2.21-pre3
First mount some fs (mountpoint /backup). Then go and mount some other fs from the same server (mountpoint /mnt), do some i/o on the latter and umount it again. Now try to access /backup. You see:
1) /backup (as a fs) vanished, you get a stale nfs handle.
2) umount /backup; mount /backup does not work. client tells "permission denied". server tells "rpc.mountd: getfh failed: Operation not permitted"
Only solution: restart nfs-server (no reboot required), then everything works again.
Same setup works with 2.2.19-client.
Any hints?
Regards,
Stephan
> I just upgraded a host from 2.2.19 to 2.2.21-pre3 and discovered a problem with kernel nfs. Setup is this:
>
> knfs-server is 2.4.19-pre2
> knfs-client is 2.2.21-pre3
Do you see this with 2.2.20 (2.2.20 has NFS changes in the client, 2.2.21pre
does not) ?
On Sat, 9 Mar 2002 19:10:52 +0000 (GMT)
Alan Cox <[email protected]> wrote:
> > I just upgraded a host from 2.2.19 to 2.2.21-pre3 and discovered a problem with kernel nfs. Setup is this:
> >
> > knfs-server is 2.4.19-pre2
> > knfs-client is 2.2.21-pre3
>
> Do you see this with 2.2.20 (2.2.20 has NFS changes in the client, 2.2.21pre
> does not) ?
Hello Alan,
sorry for delayed answer. The machine in question is in production, and I had to find some other test candidate first. So here we go:
1) The problem is reproducable on another hardware with 2.2.21-pre3 client.
2) The problem is reproducable with 2.2.20 either.
What to test next?
Regards,
Stephan
>>>>> " " == Stephan von Krawczynski <[email protected]> writes:
> Hello all, I just upgraded a host from 2.2.19 to 2.2.21-pre3
> and discovered a problem with kernel nfs. Setup is this:
> knfs-server is 2.4.19-pre2 knfs-client is 2.2.21-pre3
> First mount some fs (mountpoint /backup). Then go and mount
> some other fs from the same server (mountpoint /mnt), do some
> i/o on the latter and umount it again. Now try to access
> /backup. You see:
> 1) /backup (as a fs) vanished, you get a stale nfs handle.
> 2) umount /backup; mount /backup does not work. client tells
> "permission denied". server tells "rpc.mountd: getfh failed:
> Operation not permitted"
By 'some fs' do you mean ext2?
Not all filesystems work well with knfsd when things start to drop out
of the (d|i)caches. In particular things like /backup == VFAT might
give the above behaviour, since VFAT does not know how to map the NFS
file handles into on-disk inodes.
Cheers,
Trond
> >>>>> " " == Stephan von Krawczynski <[email protected]> writes:
>
> > Hello all, I just upgraded a host from 2.2.19 to 2.2.21-pre3
> > and discovered a problem with kernel nfs. Setup is this:
>
> > knfs-server is 2.4.19-pre2 knfs-client is 2.2.21-pre3
>
> > First mount some fs (mountpoint /backup). Then go and mount
> > some other fs from the same server (mountpoint /mnt), do some
> > i/o on the latter and umount it again. Now try to access
> > /backup. You see:
> > 1) /backup (as a fs) vanished, you get a stale nfs handle.
> > 2) umount /backup; mount /backup does not work. client tells
> > "permission denied". server tells "rpc.mountd: getfh
failed:
> > Operation not permitted"
>
> By 'some fs' do you mean ext2?
>
> Not all filesystems work well with knfsd when things start to drop
out
> of the (d|i)caches. In particular things like /backup == VFAT might
> give the above behaviour, since VFAT does not know how to map the
NFS
> file handles into on-disk inodes.
Sorry Trond,
this is a weak try of an explanation. All involved fs types are
reiserfs. The problem occurs reproducably only after (and including)
2.2.20 and above and _not_ in 2.2.19. There must be some problem.
Though I do not know whether the problem is on the client side, or
simply produced by this client side and effectively located on 2.4.18
server, I really can't tell. But giving me something to try might
clear the picture.
Any hints?
Stephan
>>>>> " " == Stephan von Krawczynski <[email protected]> writes:
> this is a weak try of an explanation. All involved fs types are
> reiserfs. The problem occurs reproducably only after (and
Which ReiserFS format? Is it version 3.5?
'cat /proc/fs/reiserfs/device/version'
> including)
> 2.2.20 and above and _not_ in 2.2.19. There must be some
> problem.
The client code in 2.2.20 is supposed to be the same as in 2.4.x. The
only thing I can think might be missing is the fix to cope with broken
servers that reuse filehandles (this violates the RFCs). Reiserfs 3.5
+ knfsd is one such broken combination. Another broken server is
unfsd...
> Though I do not know whether the problem is on the client side,
> or simply produced by this client side and effectively located
> on 2.4.18 server, I really can't tell. But giving me something
> to try might clear the picture.
You might try keeping a file open on /backup while you play with /mnt...
Cheers,
Trond
Hello!
On Mon, Mar 11, 2002 at 01:28:42AM +0100, Trond Myklebust wrote:
> > this is a weak try of an explanation. All involved fs types are
> > reiserfs. The problem occurs reproducably only after (and
> Which ReiserFS format? Is it version 3.5?
> 'cat /proc/fs/reiserfs/device/version'
If this does not work because you have no such file, then look through your
kernel logs, if you use reiserfs v3.5 on 2.4 kernel, it will show itself
as such record in the log file: "reiserfs: using 3.5.x disk format"
Bye,
Oleg
On Mon, 11 Mar 2002 09:14:58 +0300
Oleg Drokin <[email protected]> wrote:
> Hello!
>
> On Mon, Mar 11, 2002 at 01:28:42AM +0100, Trond Myklebust wrote:
> > > this is a weak try of an explanation. All involved fs types are
> > > reiserfs. The problem occurs reproducably only after (and
> > Which ReiserFS format? Is it version 3.5?
>
> > 'cat /proc/fs/reiserfs/device/version'
>
> If this does not work because you have no such file, then look through your
> kernel logs, if you use reiserfs v3.5 on 2.4 kernel, it will show itself
> as such record in the log file: "reiserfs: using 3.5.x disk format"
Hello Oleg, hello Trond, hello Alan,
I have several reiserfs fs in use on this server. boot.msg looks like
<4>reiserfs: checking transaction log (device 08:03) ...
<4>Using r5 hash to sort names
<4>ReiserFS version 3.6.25
<4>VFS: Mounted root (reiserfs filesystem) readonly.
<4>Freeing unused kernel memory: 224k freed
<6>Adding Swap: 265032k swap-space (priority 42)
<4>reiserfs: checking transaction log (device 21:01) ...
<4>Using r5 hash to sort names
<4>ReiserFS version 3.6.25
<4>reiserfs: checking transaction log (device 22:01) ...
<4>Using tea hash to sort names
<4>reiserfs: using 3.5.x disk format
<4>ReiserFS version 3.6.25
<4>reiserfs: checking transaction log (device 08:04) ...
<4>Using r5 hash to sort names
<4>ReiserFS version 3.6.25
I tried to find out which device has which numbers and did a cat /proc/devices:
Block devices:
2 fd
8 sd
11 sr
22 ide1
33 ide2
34 ide3
65 sd
66 sd
Interestingly there is no #21. Shouldn't I see a block device 21 here?
More strange the only two existing ide-drives in this system are located on
ide2 and ide3 and should therefore have device numbers 33 and 34, or not?
There is no hd on ide1, only a CDROM (not used during the test). ide0 is
completely empty.
As told earlier this is kernel 2.4.19-pre2.
Enlighten me, please...
Stephan
Hello!
On Mon, Mar 11, 2002 at 11:46:54AM +0100, Stephan von Krawczynski wrote:
> <4>reiserfs: checking transaction log (device 22:01) ...
> <4>Using tea hash to sort names
> <4>reiserfs: using 3.5.x disk format
This means you have reiserfs v3.5 format on /dev/hdc1
And this one won't behave very good with nfs.
Does this one contain your nfs exports?
Bye,
Oleg
On Mon, 11 Mar 2002 13:52:56 +0300
Oleg Drokin <[email protected]> wrote:
> Hello!
>
> On Mon, Mar 11, 2002 at 11:46:54AM +0100, Stephan von Krawczynski wrote:
> > <4>reiserfs: checking transaction log (device 22:01) ...
> > <4>Using tea hash to sort names
> > <4>reiserfs: using 3.5.x disk format
>
> This means you have reiserfs v3.5 format on /dev/hdc1
> And this one won't behave very good with nfs.
> Does this one contain your nfs exports?
There is _no_ /dev/hdc1.
Filesystem 1k-blocks Used Available Use% Mounted on
/dev/sda3 6297280 6146232 151048 98% /
/dev/sda2 31111 24695 4810 84% /boot
/dev/hde1 60049096 30161576 29887520 51% /p2
/dev/hdg1 20043416 16419444 3623972 82% /p3
/dev/sda4 29245432 27525524 1719908 95% /p5
shmfs 1035112 0 1035112 0% /dev/shm
Exported fs is on /dev/hde1.
/dev/hdc could only be a cdrom, but is not in use nor mounted, and for sure has
never been related to reiserfs.
Regards,
Stephan
Hello!
On Mon, Mar 11, 2002 at 12:00:16PM +0100, Stephan von Krawczynski wrote:
> > On Mon, Mar 11, 2002 at 11:46:54AM +0100, Stephan von Krawczynski wrote:
> > > <4>reiserfs: checking transaction log (device 22:01) ...
> > > <4>Using tea hash to sort names
> > > <4>reiserfs: using 3.5.x disk format
> > This means you have reiserfs v3.5 format on /dev/hdc1
> > And this one won't behave very good with nfs.
> > Does this one contain your nfs exports?
> There is _no_ /dev/hdc1.
Stupid me! Numbers are in hex! ;)
So that's /dev/hdg1 that is reiserfs v3.5
> /dev/hdg1 20043416 16419444 3623972 82% /p3
> Exported fs is on /dev/hde1.
Hm. Strange. Are you sure you do not export /dev/hdg1?
Bye,
Oleg
On Mon, 11 Mar 2002 14:11:54 +0300
Oleg Drokin <[email protected]> wrote:
> Hello!
> [...]
> > There is _no_ /dev/hdc1.
>
> Stupid me! Numbers are in hex! ;)
aahh, shoot me, a lot of trees, but no forest in sight ... ;-)
> So that's /dev/hdg1 that is reiserfs v3.5
>
> > /dev/hdg1 20043416 16419444 3623972 82% /p3
> > Exported fs is on /dev/hde1.
>
> Hm. Strange. Are you sure you do not export /dev/hdg1?
Yes, you are right here, one of the exports comes from hdg1. I will reformat with 3.6 and re-check the problem. Anyways I find it interesting that the problem does not occur with 2.2.19 client side ...
Stay tuned, I'll be back.
Regards,
Stephan
On Mon, 11 Mar 2002 14:11:54 +0300
Oleg Drokin <[email protected]> wrote:
> So that's /dev/hdg1 that is reiserfs v3.5
>
> > /dev/hdg1 20043416 16419444 3623972 82% /p3
> > Exported fs is on /dev/hde1.
>
> Hm. Strange. Are you sure you do not export /dev/hdg1?
Ok, Oleg,
I re-checked the setup with all server fs as reiserfs 3.6 and the problem stays
the same.
Mar 11 13:05:07 admin kernel: reiserfs: checking transaction log (device 22:01)
...
Mar 11 13:05:09 admin kernel: Using r5 hash to sort names
Mar 11 13:05:09 admin kernel: ReiserFS version 3.6.25
What else can I try?
I checked the setup with another client kernel 2.4.18, and guess what: it has
the same problem. I have the impression that the problem is somewhere on the
nfs server side - possibly around the umount case. Trond, Ken?
Can anyone reproduce this? It should be fairly simple to check.
Regards,
Stephan
Hello!
On Mon, Mar 11, 2002 at 01:47:17PM +0100, Stephan von Krawczynski wrote:
> What else can I try?
> I checked the setup with another client kernel 2.4.18, and guess what: it has
> the same problem. I have the impression that the problem is somewhere on the
> nfs server side - possibly around the umount case. Trond, Ken?
Just to be sure - have you tried 2.4.17 at the server?
2.4.18 have 2 patches included that were supposed to have another
stale filehandle problem resolved.
Our test have not shown any problems, but I am interested can you still
reproduce with these 2 patches reversed off the 2.4.18?
Also if you still can trigger, apply back only 1st hunk of G-... patch.
Bye,
Oleg
On Mon, 11 Mar 2002 15:59:37 +0300
Oleg Drokin <[email protected]> wrote:
> Hello!
>
> On Mon, Mar 11, 2002 at 01:47:17PM +0100, Stephan von Krawczynski wrote:
> > What else can I try?
> > I checked the setup with another client kernel 2.4.18, and guess what: it has
> > the same problem. I have the impression that the problem is somewhere on the
> > nfs server side - possibly around the umount case. Trond, Ken?
> Just to be sure - have you tried 2.4.17 at the server?
Hello Oleg,
I just checked with 2.4.17 on the server side: the problem stays.
I guess it will not make any sense to try your patches (reversing).
Regards,
Stephan
Hello!
On Mon, Mar 11, 2002 at 03:48:52PM +0100, Stephan von Krawczynski wrote:
> > Just to be sure - have you tried 2.4.17 at the server?
> I just checked with 2.4.17 on the server side: the problem stays.
> I guess it will not make any sense to try your patches (reversing).
Yes.
Hm. Can you make non-reiserfs partition (say ext2) and try to reproduce a
problem on it. This way we can know which direction to dig further.
Trod, do you think that'll work or should some other non-ext2 fs be tried?
Bye,
Oleg
>>>>> " " == Oleg Drokin <[email protected]> writes:
> Trod, do you think that'll work or should some other non-ext2
> fs be tried?
Ext2 should work fine: I've never seen any problems such as that which
Stephan describes, and certainly not with 2.4.18 clients.
In any case, any occurence of an ESTALE error *must* first have
originated from the server. The client itself cannot determine that a
filehandle is stale.
Cheers,
Trond
>>>>> " " == Trond Myklebust <[email protected]> writes:
> In any case, any occurence of an ESTALE error *must* first have
> originated from the server. The client itself cannot determine
> that a filehandle is stale.
BTW: a tcpdump would prove this...
Cheers,
Trond
On Mon, 11 Mar 2002 14:59:04 +0100
Trond Myklebust <[email protected]> wrote:
> >>>>> " " == Oleg Drokin <[email protected]> writes:
>
> > Trod, do you think that'll work or should some other non-ext2
> > fs be tried?
>
> Ext2 should work fine: I've never seen any problems such as that which
> Stephan describes, and certainly not with 2.4.18 clients.
>
> In any case, any occurence of an ESTALE error *must* first have
> originated from the server. The client itself cannot determine that a
> filehandle is stale.
Next try:
I have now in addition to the /backup and /mnt reiserfs exports created another
ext2 export. First test case:
mount /backup, mount the ext2 fs on /test, then mount /mnt, do i/o on /mnt and
umount /mnt.
After that everything works! /test works _and_ /backup works!
Second test case: (server and client have several network cards, so I can mount
on other ips as well)
mount /backup, mount /mnt on ip1, mount /test on ip2 (from same server). do i/o
on /mnt and umount /mnt.
After that /test works, but/backup is stale.
Conclusion: reiserfs has a problem being nfs-mounted as the only fs to a
client. If you add another fs (here ext2) mount, then even reiserfs is happy.
The problem is originated at the server side.
Any ideas for a fix?
Regards,
Stephan
Stephan von Krawczynski wrote:
>On Mon, 11 Mar 2002 14:59:04 +0100
>Trond Myklebust <[email protected]> wrote:
>
>>>>>>>" " == Oleg Drokin <[email protected]> writes:
>>>>>>>
>> > Trod, do you think that'll work or should some other non-ext2
>> > fs be tried?
>>
>>Ext2 should work fine: I've never seen any problems such as that which
>>Stephan describes, and certainly not with 2.4.18 clients.
>>
>>In any case, any occurence of an ESTALE error *must* first have
>>originated from the server. The client itself cannot determine that a
>>filehandle is stale.
>>
>
>Next try:
>I have now in addition to the /backup and /mnt reiserfs exports created another
>ext2 export. First test case:
>mount /backup, mount the ext2 fs on /test, then mount /mnt, do i/o on /mnt and
>umount /mnt.
>After that everything works! /test works _and_ /backup works!
>
>Second test case: (server and client have several network cards, so I can mount
>on other ips as well)
>mount /backup, mount /mnt on ip1, mount /test on ip2 (from same server). do i/o
>on /mnt and umount /mnt.
>After that /test works, but/backup is stale.
>
>Conclusion: reiserfs has a problem being nfs-mounted as the only fs to a
>client. If you add another fs (here ext2) mount, then even reiserfs is happy.
>The problem is originated at the server side.
>
>Any ideas for a fix?
>
>Regards,
>Stephan
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
Oleg will be back at work in 16 hours;-)
Hans
Hello!
On Mon, Mar 11, 2002 at 04:57:22PM +0100, Stephan von Krawczynski wrote:
> Conclusion: reiserfs has a problem being nfs-mounted as the only fs to a
> client. If you add another fs (here ext2) mount, then even reiserfs is happy.
> The problem is originated at the server side.
> Any ideas for a fix?
Ok I tried your scenario of mounting fs1, then mounting fs2, do io on fs2,
umount fs2 and access fs1 and everything went fine.
I cannot reproduce this at all. :(
Bye,
Oleg
On Fri, 15 Mar 2002 13:32:41 +0300
Oleg Drokin <[email protected]> wrote:
> Hello!
>
> On Mon, Mar 11, 2002 at 04:57:22PM +0100, Stephan von Krawczynski wrote:
>
> > Conclusion: reiserfs has a problem being nfs-mounted as the only fs to a
> > client. If you add another fs (here ext2) mount, then even reiserfs is happy.
> > The problem is originated at the server side.
> > Any ideas for a fix?
>
> Ok I tried your scenario of mounting fs1, then mounting fs2, do io on fs2,
> umount fs2 and access fs1 and everything went fine.
> I cannot reproduce this at all. :(
There must be a reason for this. One "non-standard" option in my setup is in /etc/exports:
/p2/backup 192.168.1.1(rw,no_root_squash,no_subtree_check)
Can the "no_subtree_check" be a cause?
What kernels are you using (client,server)?
I am pretty sure we can find the thing, stuff here is pretty straightforward and problem is continously reproducable on several clients.
Regards,
Stephan
Hello!
On Fri, Mar 15, 2002 at 12:02:32PM +0100, Stephan von Krawczynski wrote:
> > Ok I tried your scenario of mounting fs1, then mounting fs2, do io on fs2,
> > umount fs2 and access fs1 and everything went fine.
> > I cannot reproduce this at all. :(
> There must be a reason for this. One "non-standard" option in my setup is in /etc/exports:
> /p2/backup 192.168.1.1(rw,no_root_squash,no_subtree_check)
> Can the "no_subtree_check" be a cause?
I will try with this one.
BTW how much i/o do you usually do to observe an effect.
Are exported filesystems actually reside on one physical flesystem on server
or they are separate physical filesystems too?
> What kernels are you using (client,server)?
2.4.18 at both sides.
Bye,
Oleg
On Fri, 15 Mar 2002 14:13:28 +0300
Oleg Drokin <[email protected]> wrote:
> Hello!
>
> On Fri, Mar 15, 2002 at 12:02:32PM +0100, Stephan von Krawczynski wrote:
> > > Ok I tried your scenario of mounting fs1, then mounting fs2, do io on fs2,
> > > umount fs2 and access fs1 and everything went fine.
> > > I cannot reproduce this at all. :(
> > There must be a reason for this. One "non-standard" option in my setup is in /etc/exports:
> > /p2/backup 192.168.1.1(rw,no_root_squash,no_subtree_check)
> > Can the "no_subtree_check" be a cause?
> I will try with this one.
> BTW how much i/o do you usually do to observe an effect.
Very low. Something like 10 MB reading files on the server. My standard way of producing it is mounting the backup fs and then starting yast (SuSE config tool), which is configured to read the data from (the same) nfs-server. Scroll around in the packet selection a bit and exit the tool without installing something. I cannot see however how YaST mounts the fs (which options on mount command). Maybe someone from SuSE can clarify?
> Are exported filesystems actually reside on one physical flesystem on server
No. They are on different filesystems.
> or they are separate physical filesystems too?
>
> > What kernels are you using (client,server)?
> 2.4.18 at both sides.
Another point to clarify, my client fstab entry looks like this:
192.168.1.2:/p2/backup /backup nfs timeo=20,dev,suid,rw,exec,user,rsize=8192,wsize=8192 0 0
I cannot say anything about the second fs mounted via YaST.
Regards,
Stephan
commence Stephan von Krawczynski quotation:
> Another point to clarify, my client fstab entry looks like this:
>
> 192.168.1.2:/p2/backup /backup nfs timeo=20,dev,suid,rw,exec,user,rsize=8192,wsize=8192 0 0
>
> I cannot say anything about the second fs mounted via YaST.
Surely running mount from another window/console after starting YaST would
reveal this information?
--
///////////////// | | The spark of a pin
<[email protected]> | (require 'gnu) | dropping, falling feather-like.
\\\\\\\\\\\\\\\\\ | | There is too much noise.
Hello!
On Fri, Mar 15, 2002 at 12:30:08PM +0100, Stephan von Krawczynski wrote:
> Very low. Something like 10 MB reading files on the server. My standard way of producing it is mounting the backup fs and then starting yast (SuSE config tool), which is configured to read the data from (the same) nfs-server. Scroll around in the packet selection a bit and exit the tool without installing something. I cannot see however how YaST mounts the fs (which options on mount command). Maybe someone from SuSE can clarify?
Well, I tried with the options you provided and still no luck.
Bye,
Oleg
On Fri, 15 Mar 2002 11:36:30 +0000
Sean Neakums <[email protected]> wrote:
> commence Stephan von Krawczynski quotation:
>
> > Another point to clarify, my client fstab entry looks like this:
> >
> > 192.168.1.2:/p2/backup /backup nfs timeo=20,dev,suid,rw,exec,user,rsize=8192,wsize=8192 0 0
> >
> > I cannot say anything about the second fs mounted via YaST.
>
> Surely running mount from another window/console after starting YaST would
> reveal this information?
Sorry, weekend in sight ;-)
admin:/p2/backup on /backup type nfs (rw,noexec,nosuid,nodev,timeo=20,rsize=8192,wsize=8192,addr=192.168.1.2)
admin:/p3/suse/6.4 on /var/adm/mount type nfs (ro,intr,addr=192.168.1.2)
BTW: another fs mounted from a different server on the same client is not affected at all from this troubles.
Are there any userspace tools with problems involved? mount ? maybe I should replace something ...
Regards,
Stephan
Hello!
On Fri, Mar 15, 2002 at 01:03:38PM +0100, Stephan von Krawczynski wrote:
> Sorry, weekend in sight ;-)
> admin:/p2/backup on /backup type nfs (rw,noexec,nosuid,nodev,timeo=20,rsize=8192,wsize=8192,addr=192.168.1.2)
> admin:/p3/suse/6.4 on /var/adm/mount type nfs (ro,intr,addr=192.168.1.2)
> BTW: another fs mounted from a different server on the same client is not affected at all from this troubles.
> Are there any userspace tools with problems involved? mount ? maybe I should replace something ...
Do not know about the tools, can you run reiserfsck on all exported volumes just
in case?
Bye,
Oleg
On Fri, 15 Mar 2002 15:05:36 +0300
Oleg Drokin <[email protected]> wrote:
> Hello!
>
> On Fri, Mar 15, 2002 at 01:03:38PM +0100, Stephan von Krawczynski wrote:
>
> > Sorry, weekend in sight ;-)
> > admin:/p2/backup on /backup type nfs (rw,noexec,nosuid,nodev,timeo=20,rsize=8192,wsize=8192,addr=192.168.1.2)
> > admin:/p3/suse/6.4 on /var/adm/mount type nfs (ro,intr,addr=192.168.1.2)
> > BTW: another fs mounted from a different server on the same client is not affected at all from this troubles.
> > Are there any userspace tools with problems involved? mount ? maybe I should replace something ...
>
> Do not know about the tools, can you run reiserfsck on all exported volumes just
> in case?
Runs without problems on both exported filesystems.
Regards,
Stephan
Hello Trond,
Trond Myklebust wrote:
> The only thing I can think might be missing is the fix to cope with
> broken servers that reuse filehandles (this violates the
> RFCs). Reiserfs 3.5 + knfsd is one such broken combination. Another
> broken server is unfsd...
Yes, unfsd...
A problem is easily reproducible with user-space nfsd (on ext3, in my case).
We see the message (say, when installing a package with dpkg -i):
nfs_refresh_inode: inode XXXXXXX mode changed, OOOO to OOOO
Which means, same file handle but different type.
FWIW, I'm using the patch attached. It works for me.
--- linux-2.4.18/fs/nfs/inode.c~ Wed Mar 13 17:56:48 2002
+++ linux-2.4.18.superh/fs/nfs/inode.c Mon Mar 18 13:27:39 2002
@@ -680,8 +680,10 @@ nfs_find_actor(struct inode *inode, unsi
if (is_bad_inode(inode))
return 0;
/* Force an attribute cache update if inode->i_count == 0 */
- if (!atomic_read(&inode->i_count))
+ if (!atomic_read(&inode->i_count)) {
NFS_CACHEINV(inode);
+ inode->i_mode = 0;
+ }
return 1;
}
--
>>>>> " " == NIIBE Yutaka <[email protected]> writes:
> A problem is easily reproducible with user-space nfsd (on ext3,
> in my case). We see the message (say, when installing a
> package with dpkg -i):
> nfs_refresh_inode: inode XXXXXXX mode changed, OOOO to OOOO
> Which means, same file handle but different type.
> FWIW, I'm using the patch attached. It works for me.
> --- linux-2.4.18/fs/nfs/inode.c~ Wed Mar 13 17:56:48 2002
> +++ linux-2.4.18.superh/fs/nfs/inode.c Mon Mar 18 13:27:39 2002
> @@ -680,8 +680,10 @@ nfs_find_actor(struct inode *inode, unsi
> if (is_bad_inode(inode))
> return 0;
> /* Force an attribute cache update if inode->i_count
> == 0 */
> - if (!atomic_read(&inode->i_count))
> + if (!atomic_read(&inode->i_count)) {
> NFS_CACHEINV(inode);
> + inode->i_mode = 0;
> + }
> return 1;
> }
Er... Why?
If you really want to change something in nfs_find_actor() then the
following works better w.r.t. init_special_inode() on character
devices:
if ((inode->i_mode & S_IFMT) != (fattr->mode & S_IFMT))
return 0;
That doesn't fix all the races w.r.t. unfsd though: if someone on the
server removes a file that you have open for writing and replaces it
with a new one, you can still corrupt the new file.
Cheers,
Trond
Trond Myklebust wrote:
> Er... Why?
Because the inode could be on inode_unused, being still on the hash at
the client side, and server could reuse the inode (in case of
unfsd/ext3). When the inode will be reused for different type, it
will result error. Here is a scenario for non-patched 2.4.18:
(1) Symbolic link has been removed. The inode is put on inode_unused.
Say the inode # was 0x1234.
(2) Client issue "creat", server returns inode # 0x1234 (by the reuse).
(3) Call chain is:
nfs_create -> nfs_instantiate -> nfs_fhget -> __nfs_fhget -> iget4
iget4 returns the cached inode object on inode_unused.
(4) nfs_fill_inode doesn't fill it, because inode->i_mode is not 0.
(5) nfs_refresh_inode result error because inode->i_mode != fattr->mode.
Note that this is _real_ case.
> If you really want to change something in nfs_find_actor() then the
> following works better w.r.t. init_special_inode() on character
> devices:
>
> if ((inode->i_mode & S_IFMT) != (fattr->mode & S_IFMT))
> return 0;
Well, I've just tested this. This works well, thank you.
--
>>>>> " " == NIIBE Yutaka <[email protected]> writes:
> Because the inode could be on inode_unused, being still on the
> hash at the client side, and server could reuse the inode (in
> case of unfsd/ext3). When the inode will be reused for
> different type, it will result error. Here is a scenario for
> non-patched 2.4.18:
> (1) Symbolic link has been removed. The inode is put on inode_unused.
> Say the inode # was 0x1234.
> (2) Client issue "creat", server returns inode # 0x1234 (by the
> reuse).
> (3) Call chain is:
> nfs_create -> nfs_instantiate -> nfs_fhget -> __nfs_fhget ->
> iget4
> iget4 returns the cached inode object on inode_unused.
> (4) nfs_fill_inode doesn't fill it, because inode->i_mode is
> not 0.
> (5) nfs_refresh_inode result error because inode->i_mode !=
> fattr->mode.
> Note that this is _real_ case.
Sure, but it is a consequence of a badly broken server that violates
the NFS specs concerning file handles. Rigging the client in order to
cope with *all* the consequences in terms of unfsd races is an
exercise in futility - it cannot be done.
The solution is not to keep flogging the dead horse that is unfsd. It
is to put the effort into fixing knfsd so that it can cope with all
those cases where people are using unfsd today.
Cheers,
Trond
Trond Myklebust wrote:
> Rigging the client in order to cope with *all* the consequences in
> terms of unfsd races is an exercise in futility - it cannot be
> done.
[...]
> The solution is not to keep flogging the dead horse that is unfsd. It
> is to put the effort into fixing knfsd so that it can cope with all
> those cases where people are using unfsd today.
Agreed in general. That's the way to go.
* * *
> Sure, but it is a consequence of a badly broken server that violates
> the NFS specs concerning file handles.
I have technical concern here. Is the server violating specs?
Please correct me, if I am wrong. I've read through rfc1094, rfc1813,
XNFS specification of Opengroup and NFS v3 specification by Sun, I
cannot find the description of... :
reuse of file handle in server side is wrong.
File handle must be unique. But I think that it may be reused (for
different type). Client side cache should handle this case, IMO.
There is an explanation for the file handle in NFS v3 specification by Sun
(http://www.connectathon.org/nfsv3.pdf):
----------------------
Servers should try to maintain a one-to-one correspondence between
file handles and files, but this is not required. Clients should use
file handle comparisons only to improve performance, not for correct
behavior.
----------------------
Current client implementation of Linux uses a file handle for
correctness, in the scenario I've described.
--
On Tuesday 19. March 2002 00:57, NIIBE Yutaka wrote:
> File handle must be unique. But I think that it may be reused (for
> different type). Client side cache should handle this case, IMO.
No...
>From RFC1094:
----------------
2.3.3. fhandle
typedef opaque fhandle[FHSIZE];
The "fhandle" is the file handle passed between the server and the
client. All file operations are done using file handles to refer
to a file or directory. The file handle can contain whatever
information the server needs to distinguish an individual file.
-----------------
IOW: the server is required to distinguish an individual file.
Note that there is no time limit on this: if I try to write to a file that
was deleted behind my back, the server is supposed to be able to determine
which file I was writing to.
This is further clarified in RFC1813:
-----------------
If two file handles from the same server are equal, they must refer to
the same file
------------------
Again: at no point does the RFC say that there is a timelimit on the above
(unlike the so-called 'volatile filehandles' that were introduced for NFSv4)
Indeed if you think about it, then there is no way the RFC *can* allow the
client to take the burden: we are talking about a stateless system. Unless
the server has a way of notifying the client that a filehandle is invalid,
and/or the file was deleted there is no way that the client can know...
Cheers,
Trond
In article <[email protected]>,
Trond Myklebust <[email protected]> writes:
> The solution is not to keep flogging the dead horse that is unfsd. It
> is to put the effort into fixing knfsd so that it can cope with all
> those cases where people are using unfsd today.
>
> Cheers,
> Trond
<HINT_HINT>
well, the only reasons I still use unfsd is link_relative and re-export
</HINT_HINT>
> > is to put the effort into fixing knfsd so that it can cope with all
> > those cases where people are using unfsd today.
>
> <HINT_HINT>
> well, the only reasons I still use unfsd is link_relative and re-export
> </HINT_HINT>
<HINT#2>
Ask Trond for a price quote 8)
</HINT#2>
>>>>> " " == Ton Hospel <[email protected]> writes:
> <HINT_HINT> well, the only reasons I still use unfsd is
> link_relative and re-export </HINT_HINT>
Link_relative should be very easy to implement: the only reason I can
think why it hasn't been done yet is that it is so rarely useful that
nobody has bothered.
As for re-exporting: that can be done pretty easily too unless of
course you actually expect it to be reliable. The tough cookie is to
get it to survive server reboots.
Then again, I'm not sure unfsd was too good at handling that sort of
thing either...
Cheers,
Trond
While I agree that your argument is correct generally, what I'd like
to discuss is a specific technical concern with the _change_. Well,
you may have a big picture of NFS than me, so I just say my thought
here...
I attach the change (of my current version against 2.4.18) again, so
that we can focus the issue.
Here, the non-patched implementation checks inode correctness and
fails in:
nfs_refresh_inode: inode XXXXXXX mode changed, OOOO to OOOO
I think there is nothing to be useful in client side inode data, when
inode->i_count == 0. So, it doesn't make sense to check correctness
with this.
Provided the file handle is eternal thing, it doesn't fail...
IMO, it's no use to check the correctness for the inode when ->i_count == 0.
We can reuse the memory of the client side inode, that's true,
but we don't need to check old data against new one at that time.
BTW, I got positive feedback with this change.
2002-03-20 NIIBE Yutaka <[email protected]>
* fs/nfs/inode.c (nfs_read_inode): Don't set inode->i_rdev here.
(nfs_fill_inode): But set it here, instead.
(nfs_find_actor): Reusing cached inode, clear ->i_mode.
--- fs/nfs/inode.c 19 Mar 2002 23:57:40 -0000 1.1.2.1
+++ fs/nfs/inode.c 20 Mar 2002 00:05:30 -0000
@@ -104,7 +104,6 @@ nfs_read_inode(struct inode * inode)
{
inode->i_blksize = inode->i_sb->s_blocksize;
inode->i_mode = 0;
- inode->i_rdev = 0;
/* We can't support UPDATE_ATIME(), since the server will reset it */
inode->i_flags |= S_NOATIME;
INIT_LIST_HEAD(&inode->u.nfs_i.read);
@@ -638,6 +637,7 @@ nfs_fill_inode(struct inode *inode, stru
* that's precisely what we have in nfs_file_inode_operations.
*/
inode->i_op = &nfs_file_inode_operations;
+ inode->i_rdev = 0;
if (S_ISREG(inode->i_mode)) {
inode->i_fop = &nfs_file_operations;
inode->i_data.a_ops = &nfs_file_aops;
@@ -679,7 +679,7 @@ nfs_find_actor(struct inode *inode, unsi
return 0;
/* Force an attribute cache update if inode->i_count == 0 */
if (!atomic_read(&inode->i_count))
- NFS_CACHEINV(inode);
+ inode->i_mode = 0;
return 1;
}
--
On Wednesday 20. March 2002 01:42, NIIBE Yutaka wrote:
> IMO, it's no use to check the correctness for the inode when ->i_count ==
> 0. We can reuse the memory of the client side inode, that's true,
> but we don't need to check old data against new one at that time.
I don't understand what you mean by this. As I see it, close-to-open
consistency checking mandates that you do this. What if somebody changed the
data on the server while you had the file closed?
Furthermore, inode->i_count == 0 offers no guarantees that the client doesn't
for instance have dirty pages to write out.
Messing around with the value of i_mode in nfs_find_actor as you want to do
in your patch is going to introduce new dimensions to this problem. For
instance, magically changing a regular file into a symlink without first
flushing out dirty pages and clearing the page cache is certainly going to
produce som "interesting" results...
As I said yesterday: a test of the form
if ((inode->i_mode & S_IFMT) != (fattr->mode & S_IFMT))
return 0;
in nfs_find_actor might make sense since that forces the creation of a new
inode. However it doesn't help at all with the same race if inode->i_mode
hasn't changed. There is simply no way you can test for whether or not the
file is the same on the server.
Cheers,
Trond
On Fri, 15 Mar 2002 15:05:36 +0300
Oleg Drokin <[email protected]> wrote:
> Hello!
>
> On Fri, Mar 15, 2002 at 01:03:38PM +0100, Stephan von Krawczynski wrote:
>
> > Sorry, weekend in sight ;-)
> > admin:/p2/backup on /backup type nfs
> > (rw,noexec,nosuid,nodev,timeo=20,rsize=8192,wsize=8192,addr=192.168.1.2)
> > admin:/p3/suse/6.4 on /var/adm/mount type nfs (ro,intr,addr=192.168.1.2)
> > BTW: another fs mounted from a different server on the same client is not
> > affected at all from this troubles. Are there any userspace tools with
> > problems involved? mount ? maybe I should replace something ...
>
> Do not know about the tools, can you run reiserfsck on all exported volumes
> just in case?
Hello,
just in case there is still somebody interested:
the problem stays the same with upgrading the server to 2.4.19-pre4
Trond: can you please tell me in short, what the common case (or your guess) is
why I see this stale file handles on the client side. I am going to try and
find out myself what the problem with reiserfs is here, it gets a bit on my
nerves now. Do you suspect the fs to drop some inodes under the nfs-server?
Regards,
Stephan
On Thu, 21 Mar 2002 15:45:00 +0100
Stephan von Krawczynski <[email protected]> wrote:
> Hello,
>
> just in case there is still somebody interested:
> the problem stays the same with upgrading the server to 2.4.19-pre4
Hello Oleg,
detailed investigation showed the following interesting results:
1) the problem is not dependant on any nfs-flags, whatever I try it shows up
2) the problem _is_ dependant on the fs mounted in the following form:
mounting two fs that are located on the _same_ reiserfs _works_.
mounting two fs that are located on _different_ reiserfs _does not work_.
How about that?
Regards,
Stephan
Hello!
On Thu, Mar 21, 2002 at 03:57:31PM +0100, Stephan von Krawczynski wrote:
> 2) the problem _is_ dependant on the fs mounted in the following form:
> mounting two fs that are located on the _same_ reiserfs _works_.
> mounting two fs that are located on _different_ reiserfs _does not work_.
> How about that?
I cannot reproduce it locally, that's it.
And if you have reiserfs v3.6 (that's it, not v3.5 converted to 3.6,
but v3.6 created with mkreiserfs), then I am out of ideas for you :(
Bye,
Oleg
On Thu, 21 Mar 2002 18:01:17 +0300
Oleg Drokin <[email protected]> wrote:
> Hello!
>
> On Thu, Mar 21, 2002 at 03:57:31PM +0100, Stephan von Krawczynski wrote:
>
> > 2) the problem _is_ dependant on the fs mounted in the following form:
> > mounting two fs that are located on the _same_ reiserfs _works_.
> > mounting two fs that are located on _different_ reiserfs _does not work_.
> > How about that?
>
> I cannot reproduce it locally, that's it.
> And if you have reiserfs v3.6 (that's it, not v3.5 converted to 3.6,
> but v3.6 created with mkreiserfs), then I am out of ideas for you :(
I never did any conversion. I just don't trust it.
Maybe my mkreiserfs util is old, and I should try recreating the volumes with a
newer version? Were there "suspicious" changes during 3.6 format?
Regards,
Stephan
Hello!
On Thu, Mar 21, 2002 at 04:05:26PM +0100, Stephan von Krawczynski wrote:
> > I cannot reproduce it locally, that's it.
> > And if you have reiserfs v3.6 (that's it, not v3.5 converted to 3.6,
> > but v3.6 created with mkreiserfs), then I am out of ideas for you :(
> I never did any conversion. I just don't trust it.
;)
> Maybe my mkreiserfs util is old, and I should try recreating the volumes with a
> newer version? Were there "suspicious" changes during 3.6 format?
Not any I am aware of.
Bye,
Oleg
On Thu, 21 Mar 2002 18:07:50 +0300
Oleg Drokin <[email protected]> wrote:
> > Maybe my mkreiserfs util is old, and I should try recreating the volumes
> > with a newer version? Were there "suspicious" changes during 3.6 format?
>
> Not any I am aware of.
Hello Oleg,
I just re-created the questionable fs (both) with a freshly compiled
util-package (reiserfsprogs-3.x.1b) and now things are even more weird:
It now works, depending on which fs I mount first. Remeber both are completely
new 3.6 fs. I can really reproduce mounting "a", then "b" works, but first
mounting "b", then "a" has the problem. Did you try something like this (play
with the mounting sequence)?
Regards,
Stephan
>>>>> " " == Stephan von Krawczynski <[email protected]> writes:
> Trond: can you please tell me in short, what the common case
> (or your guess) is why I see this stale file handles on the
> client side. I am going to try and find out myself what the
> problem with reiserfs is here, it gets a bit on my nerves
> now. Do you suspect the fs to drop some inodes under the
> nfs-server?
Hold on thar: are you using nfs-server (a.k.a. unfsd) or are you using
knfsd?
The client will only return ESTALE if the server has first told it to
do so. For knfsd, this is only supposed to occur if the file has
actually been deleted on the server (knfsd is supposed to be able to
retrieve ReiserFS file that have fallen out of cache).
For unfsd, the 'falling out of cache' business might indeed be a
problem...
Cheers,
Trond
Hello!
On Thu, Mar 21, 2002 at 06:15:16PM +0100, Stephan von Krawczynski wrote:
> It now works, depending on which fs I mount first. Remeber both are completely
> new 3.6 fs. I can really reproduce mounting "a", then "b" works, but first
> mounting "b", then "a" has the problem. Did you try something like this (play
> with the mounting sequence)?
Yes, I tried to change order of mounts with no apparent success (or perhaps
failure).
Bye,
Oleg
[email protected] said:
> As for re-exporting: that can be done pretty easily too unless of
> course you actually expect it to be reliable. The tough cookie is to
> get it to survive server reboots.
The problem here is that we're using the anonymous device which the NFS
mount happens to have as sb->s_dev as the device ID in our exported file
handles. We don't have to do that; we could use something slightly more
useful, based on the root fh we got from the _real_ server, surely?
--
dwmw2
On Fri, 22 Mar 2002 01:19:56 +0100
Trond Myklebust <[email protected]> wrote:
> >>>>> " " == Stephan von Krawczynski <[email protected]> writes:
>
> > Trond: can you please tell me in short, what the common case
> > (or your guess) is why I see this stale file handles on the
> > client side. I am going to try and find out myself what the
> > problem with reiserfs is here, it gets a bit on my nerves
> > now. Do you suspect the fs to drop some inodes under the
> > nfs-server?
>
> Hold on thar: are you using nfs-server (a.k.a. unfsd) or are you using
> knfsd?
This is a knfsd setup.
> The client will only return ESTALE if the server has first told it to
> do so. For knfsd, this is only supposed to occur if the file has
> actually been deleted on the server (knfsd is supposed to be able to
> retrieve ReiserFS file that have fallen out of cache).
The files are obviously not deleted from the server. Can you give me a short
hint in where to look after this specific case (source location). I will try to
do some debugging around the place to see what is going on.
Thank you for your help
Stephan
>>>>> " " == David Woodhouse <[email protected]> writes:
> [email protected] said:
>> As for re-exporting: that can be done pretty easily too unless
>> of course you actually expect it to be reliable. The tough
>> cookie is to get it to survive server reboots.
> The problem here is that we're using the anonymous device which
> the NFS mount happens to have as sb->s_dev as the device ID in
> our exported file handles. We don't have to do that; we could
> use something slightly more useful, based on the root fh we got
> from the _real_ server, surely?
That is an issue, but it is really only a minor one.
The real problem is that whereas the tuple (sb->s_dev,i_ino) suffices
in order to be able to iget() a typical ext2 file, you require the the
tuple (sb->s_dev, 32/64 byte opaque filehandle) if you want to
iget() an NFS file.
Basically, if you want to be able to recover gracefully from the
situation in which the re-exporting server reboots, you would need to
compress the entire filehandle from the original server + the
sb->s_dev (in some manner that survives a reboot, I'll grant you) and
fit that into the filehandle that the NFS client uses.
To complicate matters a bit further, you have the fact that NFSv3
filehandles are 0-64 bytes long, and NFSv2 filehandles are always 32
bytes long...
Cheers,
Trond
>>>>> " " == Stephan von Krawczynski <[email protected]> writes:
> This is a knfsd setup.
Good...
>> The client will only return ESTALE if the server has first told
>> it to do so. For knfsd, this is only supposed to occur if the
>> file has actually been deleted on the server (knfsd is supposed
>> to be able to retrieve ReiserFS file that have fallen out of
>> cache).
> The files are obviously not deleted from the server. Can you
> give me a short hint in where to look after this specific case
> (source location). I will try to do some debugging around the
> place to see what is going on.
Those decisions are supposed to be made in the fh_to_dentry()
'struct super_operations' method. For ReiserFS, that would be in
fs/reiserfs/inode.c:reiserfs_fh_to_dentry().
It would indeed be a good idea to try sticking some debugging
'printk's in there in order to see what is failing...
Cheers,
Trond
On Fri, 22 Mar 2002 12:07:55 +0100
Trond Myklebust <[email protected]> wrote:
> >>>>> " " == Stephan von Krawczynski <[email protected]> writes:
> > The files are obviously not deleted from the server. Can you
> > give me a short hint in where to look after this specific case
> > (source location). I will try to do some debugging around the
> > place to see what is going on.
>
> Those decisions are supposed to be made in the fh_to_dentry()
> 'struct super_operations' method. For ReiserFS, that would be in
> fs/reiserfs/inode.c:reiserfs_fh_to_dentry().
>
> It would indeed be a good idea to try sticking some debugging
> 'printk's in there in order to see what is failing...
Well, I seem to be really stupid, watch this:
struct dentry *reiserfs_fh_to_dentry(struct super_block *sb, __u32 *data,
int len, int fhtype, int parent) {
struct cpu_key key ;
struct inode *inode = NULL ;
struct list_head *lp;
struct dentry *result;
/* fhtype happens to reflect the number of u32s encoded.
* due to a bug in earlier code, fhtype might indicate there
* are more u32s then actually fitted.
* so if fhtype seems to be more than len, reduce fhtype.
* Valid types are:
* 2 - objectid + dir_id - legacy support
* 3 - objectid + dir_id + generation
* 4 - objectid + dir_id + objectid and dirid of parent - legacy
* 5 - objectid + dir_id + generation + objectid and dirid of parent
* 6 - as above plus generation of directory
* 6 does not fit in NFSv2 handles
*/
if (fhtype > len) {
if (fhtype != 6 || len != 5)
printk(KERN_WARNING "nfsd/reiserfs, fhtype=%d, len=%d - odd\n",
fhtype, len);
fhtype = 5;
}
if (fhtype < 2 || (parent && fhtype < 4)) {
printk(KERN_WARNING "fh: 1\n");
goto out ;
}
if (! parent) {
/* this works for handles from old kernels because the default
** reiserfs generation number is the packing locality.
*/
key.on_disk_key.k_objectid = data[0] ;
key.on_disk_key.k_dir_id = data[1] ;
inode = reiserfs_iget(sb, &key) ;
if (!inode)
printk(KERN_WARNING "fh: 2a\n");
if (inode && !IS_ERR(inode) && (fhtype == 3 || fhtype >= 5) &&
data[2] != inode->i_generation) {
iput(inode) ;
printk(KERN_WARNING "fh: 2\n");
inode = NULL ;
}
} else {
key.on_disk_key.k_objectid = data[fhtype>=5?3:2] ;
key.on_disk_key.k_dir_id = data[fhtype>=5?4:3] ;
inode = reiserfs_iget(sb, &key) ;
if (!inode)
printk(KERN_WARNING "fh: 3a\n");
if (inode && !IS_ERR(inode) && fhtype == 6 &&
data[5] != inode->i_generation) {
iput(inode) ;
printk(KERN_WARNING "fh: 3\n");
inode = NULL ;
}
}
out:
if (IS_ERR(inode))
return ERR_PTR(PTR_ERR(inode));
if (!inode)
return ERR_PTR(-ESTALE) ;
/* now to find a dentry.
* If possible, get a well-connected one
*/
spin_lock(&dcache_lock);
for (lp = inode->i_dentry.next; lp != &inode->i_dentry ; lp=lp->next) {
result = list_entry(lp,struct dentry, d_alias);
if (! (result->d_flags & DCACHE_NFSD_DISCONNECTED)) {
dget_locked(result);
result->d_vfs_flags |= DCACHE_REFERENCED;
spin_unlock(&dcache_lock);
iput(inode);
return result;
}
}
spin_unlock(&dcache_lock);
result = d_alloc_root(inode);
if (result == NULL) {
iput(inode);
printk(KERN_WARNING "fh: 4\n");
return ERR_PTR(-ENOMEM);
}
result->d_flags |= DCACHE_NFSD_DISCONNECTED;
return result;
}
As you can see I put printks just around everywhere, where a false exit seems possible. Only I get _no_ output from it in my testcase.
Is this really the correct spot to look at?
What else can I do?
Regards,
Stephan