Hi Neil et al.,
during the last days, Trond and me was able to hunt a problem down
to $subject, which happens as follows:
It occurs with all 2.4 kernels, I've tested so far, but for reference:
Server: 2.4.18-pre3 on Dual P3/500 exports reiserfs partitions
Client: Diskless 2.4.18-pre3 on Athlon 1.2 GHz
When building lm_sensors-2.6.2 on the client, I could easily reproduce
this:
gcc -shared -Wl,-soname,libsensors.so.1 -o lib/libsensors.so.1.2.0
lib/data.lo lib/general.lo lib/error.lo lib/chips.lo lib/proc.lo
lib/access.lo lib/init.lo lib/conf-parse.lo lib/conf-lex.lo -lc
rm -f lib/libsensors.so.1
ln -sfn libsensors.so.1.2.0 lib/libsensors.so.1
make: stat:lib/libsensors.so.1: Eingabe-/Ausgabefehler
rm -f lib/libsensors.so
ln -sfn libsensors.so.1.2.0 lib/libsensors.so
In syslog, this message appears:
Jan 15 00:21:03 elfe kernel: nfs_refresh_inode: inode 50066 mode changed,
0100664 to 0120777
In this case, ln managed to create an invalid link in the above sequence.
Really bad is, you cannot get around this within the client. Within the
server, the link is ok, but on the client, ls -l lib throws a
ls: lib/libsensors.so.1: Eingabe-/Ausgabefehler
A comment from Trond << EOC
It is telling you that the server has a blatant bug: it is first
telling the client that the inode 50066 is a regular file, then
it changes it to a link.
When this happens, the RFCs state that the server is supposed to
change the NFS filehandle. The client *does* check the filehandle, so
if the server had updated it correctly, you would not have had a
problem.
EOC
A least, Trond was able to get me around it with this patch, but
I would be nice to fix the real problem instead (b/c the build
feels noticable slower with it):
--- linux-2.4.18-up/fs/nfs/dir.c.orig Fri Jan 11 23:06:38 2002
+++ linux-2.4.18-up/fs/nfs/dir.c Mon Jan 14 23:52:17 2002
@@ -619,6 +619,8 @@
nfs_complete_unlink(dentry);
unlock_kernel();
}
+ if (is_bad_inode(inode))
+ force_delete(inode);
iput(inode);
}
--- linux-2.4.18-up/fs/nfs/inode.c.orig Fri Jan 11 23:08:00 2002
+++ linux-2.4.18-up/fs/nfs/inode.c Mon Jan 14 23:53:10 2002
@@ -699,6 +699,8 @@
return 0;
if (memcmp(&inode->u.nfs_i.fh, fh, sizeof(inode->u.nfs_i.fh)) != 0)
return 0;
+ if (is_bad_inode(inode))
+ return 0;
/* Force an attribute cache update if inode->i_count == 0 */
if (!atomic_read(&inode->i_count))
NFS_CACHEINV(inode);
An noted, this is a longer standing problem here, but with libsensors build,
I could easily reproduce this. Do yoou?
Any chance to get this fixed soon?
Cheers,
Hans-Peter
>>>>> " " == Hans-Peter Jansen <[email protected]> writes:
> In syslog, this message appears: Jan 15 00:21:03 elfe kernel:
> nfs_refresh_inode: inode 50066 mode changed, 0100664 to 0120777
The error is basically telling you that ReiserFS filehandles are being
reused by the server. Doesn't Reiser provide a generation count to
guard against this sort of thing?
My 'fix' just solves the immediate problem of the wrong file mode. It
does not solve the problems of data corruption that can occur when the
client is incapable of distinguishing the 'old' and 'new' files that
share the same filehandle.
Cheers,
Trond
Trond Myklebust writes:
> >>>>> " " == Hans-Peter Jansen <[email protected]> writes:
>
> > In syslog, this message appears: Jan 15 00:21:03 elfe kernel:
> > nfs_refresh_inode: inode 50066 mode changed, 0100664 to 0120777
>
> The error is basically telling you that ReiserFS filehandles are being
> reused by the server. Doesn't Reiser provide a generation count to
> guard against this sort of thing?
Yes, inode->i_generation is stored in the file handle:
fs/reiserfs/inode.c:reiserfs_dentry_to_fh().
Hans-Peter, what version of NFS are you using and have you remounted
clients after upgrading to the newer kernel?
>
> My 'fix' just solves the immediate problem of the wrong file mode. It
> does not solve the problems of data corruption that can occur when the
> client is incapable of distinguishing the 'old' and 'new' files that
> share the same filehandle.
This requires i_generation overflow (modulo bug in reiserfs).
>
> Cheers,
> Trond
Nikita.
Hrm, this sounds a lot like a problem I've been having as well. Server
is 2.2.19 (RedHat) + reiserfs; client is 2.4.12. There are times when I
mv the contents of public_html to a new location, then rm -rf
public_html, then make public_html a symlink to the new location. On
the client, I think what I would get is an I/O error when trying to list
or cd to public_html. On the server everything is fine, and the only
fix for this I've found is rebooting the client. The original
public_html directory is on a knfsd exported reiserfs fs.
Hopefully this is informative; the bug has only bitten me 2 or 3 times
in several months, and I've not been able to reproduce it at will. :-\
regards,
David
Hans-Peter Jansen wrote:
> Hi Neil et al.,
>
> during the last days, Trond and me was able to hunt a problem down
> to $subject, which happens as follows:
>
> It occurs with all 2.4 kernels, I've tested so far, but for reference:
>
> Server: 2.4.18-pre3 on Dual P3/500 exports reiserfs partitions
> Client: Diskless 2.4.18-pre3 on Athlon 1.2 GHz
>
> When building lm_sensors-2.6.2 on the client, I could easily reproduce
> this:
>
> gcc -shared -Wl,-soname,libsensors.so.1 -o lib/libsensors.so.1.2.0
> lib/data.lo lib/general.lo lib/error.lo lib/chips.lo lib/proc.lo
> lib/access.lo lib/init.lo lib/conf-parse.lo lib/conf-lex.lo -lc
> rm -f lib/libsensors.so.1
> ln -sfn libsensors.so.1.2.0 lib/libsensors.so.1
> make: stat:lib/libsensors.so.1: Eingabe-/Ausgabefehler
> rm -f lib/libsensors.so
> ln -sfn libsensors.so.1.2.0 lib/libsensors.so
>
> In syslog, this message appears:
> Jan 15 00:21:03 elfe kernel: nfs_refresh_inode: inode 50066 mode changed,
> 0100664 to 0120777
>
> In this case, ln managed to create an invalid link in the above sequence.
> Really bad is, you cannot get around this within the client. Within the
> server, the link is ok, but on the client, ls -l lib throws a
> ls: lib/libsensors.so.1: Eingabe-/Ausgabefehler
>
> A comment from Trond << EOC
> It is telling you that the server has a blatant bug: it is first
> telling the client that the inode 50066 is a regular file, then
> it changes it to a link.
>
> When this happens, the RFCs state that the server is supposed to
> change the NFS filehandle. The client *does* check the filehandle, so
> if the server had updated it correctly, you would not have had a
> problem.
> EOC
>
> A least, Trond was able to get me around it with this patch, but
> I would be nice to fix the real problem instead (b/c the build
> feels noticable slower with it):
>
> --- linux-2.4.18-up/fs/nfs/dir.c.orig Fri Jan 11 23:06:38 2002
> +++ linux-2.4.18-up/fs/nfs/dir.c Mon Jan 14 23:52:17 2002
> @@ -619,6 +619,8 @@
> nfs_complete_unlink(dentry);
> unlock_kernel();
> }
> + if (is_bad_inode(inode))
> + force_delete(inode);
> iput(inode);
> }
>
> --- linux-2.4.18-up/fs/nfs/inode.c.orig Fri Jan 11 23:08:00 2002
> +++ linux-2.4.18-up/fs/nfs/inode.c Mon Jan 14 23:53:10 2002
> @@ -699,6 +699,8 @@
> return 0;
> if (memcmp(&inode->u.nfs_i.fh, fh, sizeof(inode->u.nfs_i.fh)) != 0)
> return 0;
> + if (is_bad_inode(inode))
> + return 0;
> /* Force an attribute cache update if inode->i_count == 0 */
> if (!atomic_read(&inode->i_count))
> NFS_CACHEINV(inode);
>
> An noted, this is a longer standing problem here, but with libsensors build,
> I could easily reproduce this. Do yoou?
>
> Any chance to get this fixed soon?
>
> Cheers,
> Hans-Peter
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
--
David L. Parsley
Network Administrator, Roanoke College
"If I have seen further it is by standing on ye shoulders of Giants."
--Isaac Newton
>>>>> " " == Nikita Danilov <[email protected]> writes:
> Yes, inode->i_generation is stored in the file handle:
> fs/reiserfs/inode.c:reiserfs_dentry_to_fh().
But what is stored in inode->i_generation? AFAICS
inode->i_generation = le32_to_cpu (INODE_PKEY (inode)->k_dir_id);
which appears not to be a unique generation count. Isn't that instead
the directory's object id?
The point of i_generation is to provide a unique number that changes
every time you reuse the inode number.
Cheers,
Trond
Trond Myklebust writes:
> >>>>> " " == Nikita Danilov <[email protected]> writes:
>
> > Yes, inode->i_generation is stored in the file handle:
> > fs/reiserfs/inode.c:reiserfs_dentry_to_fh().
>
> But what is stored in inode->i_generation? AFAICS
>
> inode->i_generation = le32_to_cpu (INODE_PKEY (inode)->k_dir_id);
>
> which appears not to be a unique generation count. Isn't that instead
> the directory's object id?
This is only for 3.5 reiserfs format (default for 2.2 kernels), for 3.6
format, generation is stored on the disk (in the same place where rdev
is stored for device files). 3.5 cannot work with nfs reliably.
Hans-Peter, you can check version of reiserfs you use with
/sbin/debugreiserfs /dev/device
or
cat /proc/fs/reiserfs/device/version
>
> The point of i_generation is to provide a unique number that changes
> every time you reuse the inode number.
In reiserfs there is no static inode table, so we keep global generation
counter in a super block which is incremented on each inode deletion,
this generation is stored in the new inodes. Not that good as per-inode
generation, but we cannot do better without changing disk format.
>
> Cheers,
> Trond
>
Nikita.
On Tuesday, 15. January 2002 15:07, Nikita Danilov wrote:
> Trond Myklebust writes:
> > >>>>> " " == Hans-Peter Jansen <[email protected]> writes:
> > > In syslog, this message appears: Jan 15 00:21:03 elfe kernel:
> > > nfs_refresh_inode: inode 50066 mode changed, 0100664 to 0120777
> >
> > The error is basically telling you that ReiserFS filehandles are being
> > reused by the server. Doesn't Reiser provide a generation count to
> > guard against this sort of thing?
>
> Yes, inode->i_generation is stored in the file handle:
> fs/reiserfs/inode.c:reiserfs_dentry_to_fh().
>
> Hans-Peter, what version of NFS are you using and have you remounted
> clients after upgrading to the newer kernel?
I can reproduce it with 2.4.5, 6, 13-ac7, 18-pre3 with and without Trond's
NFS-ALL patches applied. I don't understand your question, but testing this
implied several reboots of the server and some dozen reboots on the client.
The lm_sensors build reproduced it pretty stable (with a few exceptions,
and on different commands of the range ar, gcc and ln)
Test is pretty simple: clean make of lm_sensors almost all the time
triggers it. If not, just rm lib/libsensors* and make again. This
created certainly stale files lib/libsensors.so|lib/libsensors.so.1
from within the client. You can only get rid of them by rebooting or
removing them on the server.
> > My 'fix' just solves the immediate problem of the wrong file mode. It
> > does not solve the problems of data corruption that can occur when the
> > client is incapable of distinguishing the 'old' and 'new' files that
> > share the same filehandle.
>
> This requires i_generation overflow (modulo bug in reiserfs).
Please, try to reproduce it, and let us know, if you can reproduce it.
> > Cheers,
> > Trond
>
> Nikita.
Cheers,
Hans-Peter
On Tuesday 15. January 2002 16:27, Nikita Danilov wrote:
> In reiserfs there is no static inode table, so we keep global generation
> counter in a super block which is incremented on each inode deletion,
> this generation is stored in the new inodes. Not that good as per-inode
> generation, but we cannot do better without changing disk format.
Am I right in assuming that you therefore cannot check that the filehandle is
stale if the client presents you with the filehandle of the 'old' inode
(prior to deletion)?
However if the client compares the 'old' and 'new' filehandle, it will find
them to be different?
Cheers,
Trond
Trond Myklebust writes:
> On Tuesday 15. January 2002 16:27, Nikita Danilov wrote:
>
> > In reiserfs there is no static inode table, so we keep global generation
> > counter in a super block which is incremented on each inode deletion,
> > this generation is stored in the new inodes. Not that good as per-inode
> > generation, but we cannot do better without changing disk format.
>
> Am I right in assuming that you therefore cannot check that the filehandle is
> stale if the client presents you with the filehandle of the 'old' inode
> (prior to deletion)?
> However if the client compares the 'old' and 'new' filehandle, it will find
> them to be different?
Sorry for being vague. Reiserfs keeps global "inode generation counter"
->s_inode_generation in a super block. This counter is incremented each
time reiserfs inode is being deleted on a disk. When new inode is
created, current value of ->s_inode_generation is stored in inode's
on-disk representation. Inode number (objectid in reiserfs parlance) is
reusable once inode was deleted. The same pair (i_ino, i_generation) can
be assigned to different inode only after ->s_inode_generation
overflows, which requires 2**32 file deletions.
So, no, reiserfs can tell stale filehandle, although not as reliable as
file systems with static inode tables.
Hans-Peter, please tell me, what reiserfs format are you using. 3.5
doesn't support NFS reliably. If you are using 3.5 you'll have to
upgrade to 3.6 format (copy data to the new file system). mount -o conv
will not eliminate this problem completely, but will make it much less
probable, so you can try this first.
>
> Cheers,
> Trond
>
Nikita.
On Tuesday, 15. January 2002 17:47, Nikita Danilov wrote:
> Trond Myklebust writes:
> > On Tuesday 15. January 2002 16:27, Nikita Danilov wrote:
> > > In reiserfs there is no static inode table, so we keep global
> > > generation counter in a super block which is incremented on each inode
> > > deletion, this generation is stored in the new inodes. Not that good
> > > as per-inode generation, but we cannot do better without changing disk
> > > format.
> >
> > Am I right in assuming that you therefore cannot check that the
> > filehandle is stale if the client presents you with the filehandle of
> > the 'old' inode (prior to deletion)?
> > However if the client compares the 'old' and 'new' filehandle, it will
> > find them to be different?
>
> Sorry for being vague. Reiserfs keeps global "inode generation counter"
> ->s_inode_generation in a super block. This counter is incremented each
> time reiserfs inode is being deleted on a disk. When new inode is
> created, current value of ->s_inode_generation is stored in inode's
> on-disk representation. Inode number (objectid in reiserfs parlance) is
> reusable once inode was deleted. The same pair (i_ino, i_generation) can
> be assigned to different inode only after ->s_inode_generation
> overflows, which requires 2**32 file deletions.
Except it's in 3.5 format, which requires one deletion then?
> So, no, reiserfs can tell stale filehandle, although not as reliable as
> file systems with static inode tables.
>
> Hans-Peter, please tell me, what reiserfs format are you using. 3.5
> doesn't support NFS reliably. If you are using 3.5 you'll have to
> upgrade to 3.6 format (copy data to the new file system). mount -o conv
> will not eliminate this problem completely, but will make it much less
> probable, so you can try this first.
Bad luck for me, obviously :-(
<4>reiserfs: checking transaction log (device 03:09) ...
<4>Using r5 hash to sort names
<4>reiserfs: using 3.5.x disk format
<4>ReiserFS version 3.6.25
<4>reiserfs: checking transaction log (device 03:08) ...
<4>Using r5 hash to sort names
<4>reiserfs: using 3.5.x disk format
<4>ReiserFS version 3.6.25
<4>reiserfs: checking transaction log (device 03:06) ...
<4>Using r5 hash to sort names
<4>reiserfs: using 3.5.x disk format
<4>ReiserFS version 3.6.25
<4>reiserfs: checking transaction log (device 03:07) ...
<4>Using r5 hash to sort names
<4>reiserfs: using 3.5.x disk format
<4>ReiserFS version 3.6.25
<4>reiserfs: checking transaction log (device 03:0a) ...
<4>Using r5 hash to sort names
<4>reiserfs: using 3.5.x disk format
<4>ReiserFS version 3.6.25
<4>reiserfs: checking transaction log (device 21:02) ...
<4>Using r5 hash to sort names
<4>reiserfs: using 3.5.x disk format
<4>ReiserFS version 3.6.25
We're talking about 100 GB on _this_ server.
How big is the chance to loose data with -o conv?
Is there any paper around, which describes this conversion
a bit more detailed? If I understand you correctly, the inode
generation counter doesn't work at all with 3.5?
> > Cheers,
> > Trond
>
> Nikita.
Cheers,
Hans-Peter
Hans-Peter Jansen writes:
> On Tuesday, 15. January 2002 17:47, Nikita Danilov wrote:
> > Trond Myklebust writes:
> > > On Tuesday 15. January 2002 16:27, Nikita Danilov wrote:
> > > > In reiserfs there is no static inode table, so we keep global
> > > > generation counter in a super block which is incremented on each inode
> > > > deletion, this generation is stored in the new inodes. Not that good
> > > > as per-inode generation, but we cannot do better without changing disk
> > > > format.
> > >
> > > Am I right in assuming that you therefore cannot check that the
> > > filehandle is stale if the client presents you with the filehandle of
> > > the 'old' inode (prior to deletion)?
> > > However if the client compares the 'old' and 'new' filehandle, it will
> > > find them to be different?
> >
> > Sorry for being vague. Reiserfs keeps global "inode generation counter"
> > ->s_inode_generation in a super block. This counter is incremented each
> > time reiserfs inode is being deleted on a disk. When new inode is
> > created, current value of ->s_inode_generation is stored in inode's
> > on-disk representation. Inode number (objectid in reiserfs parlance) is
> > reusable once inode was deleted. The same pair (i_ino, i_generation) can
> > be assigned to different inode only after ->s_inode_generation
> > overflows, which requires 2**32 file deletions.
>
> Except it's in 3.5 format, which requires one deletion then?
In the same directory, yes.
>
> > So, no, reiserfs can tell stale filehandle, although not as reliable as
> > file systems with static inode tables.
> >
> > Hans-Peter, please tell me, what reiserfs format are you using. 3.5
> > doesn't support NFS reliably. If you are using 3.5 you'll have to
> > upgrade to 3.6 format (copy data to the new file system). mount -o conv
> > will not eliminate this problem completely, but will make it much less
> > probable, so you can try this first.
>
> Bad luck for me, obviously :-(
>
> <4>reiserfs: checking transaction log (device 03:09) ...
> <4>Using r5 hash to sort names
> <4>reiserfs: using 3.5.x disk format
[...]
> <4>reiserfs: using 3.5.x disk format
> <4>ReiserFS version 3.6.25
>
> We're talking about 100 GB on _this_ server.
3.6. is advantageous because of many other things, like LFS, etc.
>
> How big is the chance to loose data with -o conv?
There were problems with -o conv and remount (for root file system), but
they were cured in latest Marcelo's kernels.
>
> Is there any paper around, which describes this conversion
> a bit more detailed? If I understand you correctly, the inode
> generation counter doesn't work at all with 3.5?
After file system is mounted with -o conv, all new files will be created
in a new format. This file system will then no longer be mountable as
3.5 (and thus, inaccessible from 2.2 kernels).
New files will store generation counters. The possibility of a stale
handle lurking undetected is when old-format file was deleted, its
objectid was reused for new format file, and super-block generation
counter at that time happens to coincide with objectid of parent
directory of the old file. Not exactly likely thing to happen, but
still.
>
> > > Cheers,
> > > Trond
> >
Nikita.
>
> Cheers,
> Hans-Peter
>
Nikita,
To be clear: if I upgrade the kernel on my nfs server to 2.4.latest and
mount -o conv my reiserfs partition, that will almost certain fix my
knfsd problem with a very small likelihood of generation problems?
regards,
David
--
David L. Parsley
Network Administrator, Roanoke College
"If I have seen further it is by standing on ye shoulders of Giants."
--Isaac Newton
On Tuesday, 15. January 2002 18:53, Nikita Danilov wrote:
> Hans-Peter Jansen writes:
> > On Tuesday, 15. January 2002 17:47, Nikita Danilov wrote:
>
> 3.6. is advantageous because of many other things, like LFS, etc.
>
> > How big is the chance to loose data with -o conv?
>
> There were problems with -o conv and remount (for root file system), but
> they were cured in latest Marcelo's kernels.
>
> > Is there any paper around, which describes this conversion
> > a bit more detailed? If I understand you correctly, the inode
> > generation counter doesn't work at all with 3.5?
>
> After file system is mounted with -o conv, all new files will be created
> in a new format. This file system will then no longer be mountable as
> 3.5 (and thus, inaccessible from 2.2 kernels).
>
> New files will store generation counters. The possibility of a stale
> handle lurking undetected is when old-format file was deleted, its
> objectid was reused for new format file, and super-block generation
> counter at that time happens to coincide with objectid of parent
> directory of the old file. Not exactly likely thing to happen, but
> still.
I will meditate over the last paragraph later. I decided to follow your
first advice...
I think, this is worth a note in the reiserfs-FAQ. And remember: allmost
all linux distributions will use 3.5 to ensure backward compatibility.
Also note, that web man page and mkreiserfs -h disagree on the -v option.
I will believe mkreiserfs.
If I use notail mount option on a already populated partition, what happens
to the "tailed" files? I expect, only newly created ones get there own block.
>
> Nikita.
>
Cheers,
Hans-Peter
Hans-Peter Jansen writes:
> On Tuesday, 15. January 2002 18:53, Nikita Danilov wrote:
> > Hans-Peter Jansen writes:
> > > On Tuesday, 15. January 2002 17:47, Nikita Danilov wrote:
[...]
>
> If I use notail mount option on a already populated partition, what happens
> to the "tailed" files? I expect, only newly created ones get there own block.
Right.
>
> >
Nikita.
> >
> Cheers,
> Hans-Peter
>
David L. Parsley writes:
> Nikita,
>
> To be clear: if I upgrade the kernel on my nfs server to 2.4.latest and
> mount -o conv my reiserfs partition, that will almost certain fix my
> knfsd problem with a very small likelihood of generation problems?
It will make likelihood smaller than before. As to whether it will be
sufficiently small for your purposes I can not make any claims. I
recommend you to upgrade to 3.6.
>
> regards,
> David
Nikita.
> --
> David L. Parsley
> Network Administrator, Roanoke College
> "If I have seen further it is by standing on ye shoulders of Giants."
> --Isaac Newton
>
>