NFS server: Linux 2.6.25
NFS client: Linux debian 2.6.25-2 (or 2.6.23.1)
If I do:
NFS client: fd1 = creat("foo"); write(fd1, "xx", 2); fsync(fd1);
NFS server: unlink("foo"); creat("foo");
NFS client: fd2 = open("foo"); fstat(fd1, &st1); fstat(fd2, &st2);
fstat(fd1, &st3);
The result is usually that the fstat(fd1) fails with ESTALE. But
sometimes the result is st1.st_ino == st2.st_ino == st3.st_ino and
st1.st_size == 2 but st2.st_size == 0. So I see two different files
using the same inode number. I'd really want to avoid seeing that
condition.
So what I'd want to know is:
a) Why does this happen only sometimes? I can't really figure out from
the code what invalidates the fd1 inode. Apparently the second open()
somehow, but since it uses the new "foo" file with a different struct
inode, where does the old struct inode get invalidated?
b) Can this be fixed? Or is it just luck that it works as well as it
does now?
Attached a test program. Usage:
NFS client: Mount with actimeo=2
NFS client: ./t
(Run the next two commands within 2 seconds)
NFS server: rm -f foo;touch foo
NFS client: hit enter
Once in a while the result will be:
1a: ino=15646940 size=2
1b: ino=15646940 size=2
1c: ino=15646940 size=2
2: ino=15646940 size=0
1d: ino=15646940 size=2
Timo Sirainen wrote:
> NFS server: Linux 2.6.25
> NFS client: Linux debian 2.6.25-2 (or 2.6.23.1)
>
> If I do:
>
> NFS client: fd1 =3D creat("foo"); write(fd1, "xx", 2); fsync(fd1);
> NFS server: unlink("foo"); creat("foo");
> NFS client: fd2 =3D open("foo"); fstat(fd1, &st1); fstat(fd2, &st2);
> fstat(fd1, &st3);
>
> The result is usually that the fstat(fd1) fails with ESTALE. But
> sometimes the result is st1.st_ino =3D=3D st2.st_ino =3D=3D st3.st_in=
o and
> st1.st_size =3D=3D 2 but st2.st_size =3D=3D 0. So I see two different=
files
> using the same inode number. I'd really want to avoid seeing that
> condition.
>
> =20
This is really up the file system on the server. It is the one
that selects the inode number when creating a new file.
> So what I'd want to know is:
>
> a) Why does this happen only sometimes? I can't really figure out fro=
m
> the code what invalidates the fd1 inode. Apparently the second open()
> somehow, but since it uses the new "foo" file with a different struct
> inode, where does the old struct inode get invalidated?
>
> =20
This will happen always, but you may see occasional successful
fstat() calls on the client due to attribute caching and/or
dentry caching.
> b) Can this be fixed? Or is it just luck that it works as well as it
> does now?
>
> =20
This can be fixed, somewhat. I have some changes to address the
ESTALE situation in system calls that take filename as arguments,
but I need to work with some more people to get them included.
The system calls which do not take file names as arguments can not
be recovered from because the file they are referring is really
gone or at least not accessible anymore.
The reuse of the inode number is just a fact of life and that way
that file systems work. I would suggest rethinking your application
in order to reduce or eliminate any dependence that it might have.
All this said, making changes on both the server and the client is
dangerous and can easily to lead to consistency and/or performance
issues.
Thanx...
ps
> =EF=BB=BFAttached a test program. Usage:
>
> NFS client: Mount with actimeo=3D2
> NFS client: ./t
> (Run the next two commands within 2 seconds)
> NFS server: rm -f foo;touch foo
> NFS client: hit enter=20
>
> Once in a while the result will be:
> 1a: ino=3D15646940 size=3D2
> 1b: ino=3D15646940 size=3D2
> 1c: ino=3D15646940 size=3D2
> 2: ino=3D15646940 size=3D0
> 1d: ino=3D15646940 size=3D2
>
> =20
> ---------------------------------------------------------------------=
---
>
> #include <errno.h>
> #include <string.h>
> #include <unistd.h>
> #include <fcntl.h>
> #include <stdio.h>
> #include <sys/stat.h>
>
> int main(void) {
> struct stat st;
> int fd, fd2;
> char buf[100];
>
> fd =3D open("foo", O_RDWR | O_CREAT, 0666);
> write(fd, "xx", 2); fsync(fd);
> if (fstat(fd, &st) < 0) perror("fstat()");
> printf("1a: ino=3D%ld size=3D%ld\n", (long)st.st_ino, st.st_size);
>
> fgets(buf, sizeof(buf), stdin);
> if (fstat(fd, &st) < 0) perror("fstat()");
> else printf("1b: ino=3D%ld size=3D%ld\n", (long)st.st_ino, st.st_siz=
e);
>
> fd2 =3D open("foo", O_RDWR);
> if (fstat(fd, &st) < 0) perror("fstat()");
> else printf("1c: ino=3D%ld size=3D%ld\n", (long)st.st_ino, st.st_siz=
e);
> if (fstat(fd2, &st) < 0) perror("fstat()");
> else printf("2: ino=3D%ld size=3D%ld\n", (long)st.st_ino, st.st_size=
);
> if (fstat(fd, &st) < 0) perror("fstat()");
> else printf("1d: ino=3D%ld size=3D%ld\n", (long)st.st_ino, st.st_siz=
e);
> return 0;
> }
> =20
On Tue, 2008-05-27 at 08:48 -0400, Peter Staubach wrote:
> Timo Sirainen wrote:
> > NFS server: Linux 2.6.25
> > NFS client: Linux debian 2.6.25-2 (or 2.6.23.1)
> >
> > If I do:
> >
> > NFS client: fd1 = creat("foo"); write(fd1, "xx", 2); fsync(fd1);
> > NFS server: unlink("foo"); creat("foo");
> > NFS client: fd2 = open("foo"); fstat(fd1, &st1); fstat(fd2, &st2);
> > fstat(fd1, &st3);
> >
> > The result is usually that the fstat(fd1) fails with ESTALE. But
> > sometimes the result is st1.st_ino == st2.st_ino == st3.st_ino and
> > st1.st_size == 2 but st2.st_size == 0. So I see two different files
> > using the same inode number. I'd really want to avoid seeing that
> > condition.
> >
> >
>
> This is really up the file system on the server. It is the one
> that selects the inode number when creating a new file.
I don't mind that the inode gets reused, I mind that I can't reliably
detect that situation.
> > So what I'd want to know is:
> >
> > a) Why does this happen only sometimes? I can't really figure out from
> > the code what invalidates the fd1 inode. Apparently the second open()
> > somehow, but since it uses the new "foo" file with a different struct
> > inode, where does the old struct inode get invalidated?
> >
> >
>
> This will happen always, but you may see occasional successful
> fstat() calls on the client due to attribute caching and/or
> dentry caching.
I would understand if it always failed or always succeeded, but it seems
to be somewhat random now. And it's not "occational successful fstat()",
but it's "occational failed fstat()". The difference shouldn't be
because of attribute caching, because I specify it explicitly to two
seconds and run the test within that 2 second. So the test should always
hit the attribute cache, and according to you that should always cause
it to succeed (but it rarely does). I think dentry caching also more or
less depends on attribute cache timeout?
> > b) Can this be fixed? Or is it just luck that it works as well as it
> > does now?
> >
> >
>
> This can be fixed, somewhat. I have some changes to address the
> ESTALE situation in system calls that take filename as arguments,
> but I need to work with some more people to get them included.
> The system calls which do not take file names as arguments can not
> be recovered from because the file they are referring is really
> gone or at least not accessible anymore.
>
> The reuse of the inode number is just a fact of life and that way
> that file systems work. I would suggest rethinking your application
> in order to reduce or eliminate any dependence that it might have.
The problem I have is that I need to reliably find out if a file has
been replaced with a new file. So I first flush the dentry cache
(chowning parent directory), stat() the file and fstat() the opened
file. If fstat() fails with ESTALE or if the inodes don't match, I know
that the file has been replaced and I need to re-open and re-read it.
This seems to work nearly always.
Timo Sirainen wrote:
> On Tue, 2008-05-27 at 08:48 -0400, Peter Staubach wrote:
>
>> Timo Sirainen wrote:
>>
>>> NFS server: Linux 2.6.25
>>> NFS client: Linux debian 2.6.25-2 (or 2.6.23.1)
>>>
>>> If I do:
>>>
>>> NFS client: fd1 = creat("foo"); write(fd1, "xx", 2); fsync(fd1);
>>> NFS server: unlink("foo"); creat("foo");
>>> NFS client: fd2 = open("foo"); fstat(fd1, &st1); fstat(fd2, &st2);
>>> fstat(fd1, &st3);
>>>
>>> The result is usually that the fstat(fd1) fails with ESTALE. But
>>> sometimes the result is st1.st_ino == st2.st_ino == st3.st_ino and
>>> st1.st_size == 2 but st2.st_size == 0. So I see two different files
>>> using the same inode number. I'd really want to avoid seeing that
>>> condition.
>>>
>>>
>>>
>> This is really up the file system on the server. It is the one
>> that selects the inode number when creating a new file.
>>
>
> I don't mind that the inode gets reused, I mind that I can't reliably
> detect that situation.
>
>
Outside of shortening up the attribute cache timeout values,
with the current implementation, I don't think that you are
going to be able to reliably detect when the file on the server
has been removed and recreated.
>>> So what I'd want to know is:
>>>
>>> a) Why does this happen only sometimes? I can't really figure out from
>>> the code what invalidates the fd1 inode. Apparently the second open()
>>> somehow, but since it uses the new "foo" file with a different struct
>>> inode, where does the old struct inode get invalidated?
>>>
>>>
>>>
>> This will happen always, but you may see occasional successful
>> fstat() calls on the client due to attribute caching and/or
>> dentry caching.
>>
>
> I would understand if it always failed or always succeeded, but it seems
> to be somewhat random now. And it's not "occational successful fstat()",
> but it's "occational failed fstat()". The difference shouldn't be
> because of attribute caching, because I specify it explicitly to two
> seconds and run the test within that 2 second. So the test should always
> hit the attribute cache, and according to you that should always cause
> it to succeed (but it rarely does). I think dentry caching also more or
> less depends on attribute cache timeout?
>
>
How did you specify the attribute cache to be 2 seconds?
The dentry based caching is also subject to timeout based
verification, but typically on much longer time scales.
>>> b) Can this be fixed? Or is it just luck that it works as well as it
>>> does now?
>>>
>>>
>>>
>> This can be fixed, somewhat. I have some changes to address the
>> ESTALE situation in system calls that take filename as arguments,
>> but I need to work with some more people to get them included.
>> The system calls which do not take file names as arguments can not
>> be recovered from because the file they are referring is really
>> gone or at least not accessible anymore.
>>
>> The reuse of the inode number is just a fact of life and that way
>> that file systems work. I would suggest rethinking your application
>> in order to reduce or eliminate any dependence that it might have.
>>
>
> The problem I have is that I need to reliably find out if a file has
> been replaced with a new file. So I first flush the dentry cache
> (chowning parent directory), stat() the file and fstat() the opened
> file. If fstat() fails with ESTALE or if the inodes don't match, I know
> that the file has been replaced and I need to re-open and re-read it.
> This seems to work nearly always.
>
This would seem to be quite implementation specific and also has
some timing dependencies built-in. These would seem to me to be
dangerous assumptions and heuristics to be depending upon.
Have you considered making the contents of the file itself versioned
in some fashion and thus, removing dependencies on how the NFS client
works and/or the file system on the NFS server?
Thanx...
ps
On May 27, 2008, at 9:09 PM, Peter Staubach wrote:
>>>> So what I'd want to know is:
>>>>
>>>> a) Why does this happen only sometimes? I can't really figure out
>>>> from
>>>> the code what invalidates the fd1 inode. Apparently the second
>>>> open()
>>>> somehow, but since it uses the new "foo" file with a different
>>>> struct
>>>> inode, where does the old struct inode get invalidated?
>>>>
>>>>
>>> This will happen always, but you may see occasional successful
>>> fstat() calls on the client due to attribute caching and/or
>>> dentry caching.
>>>
>>
>> I would understand if it always failed or always succeeded, but it
>> seems
>> to be somewhat random now. And it's not "occational successful
>> fstat()",
>> but it's "occational failed fstat()". The difference shouldn't be
>> because of attribute caching, because I specify it explicitly to two
>> seconds and run the test within that 2 second. So the test should
>> always
>> hit the attribute cache, and according to you that should always
>> cause
>> it to succeed (but it rarely does). I think dentry caching also
>> more or
>> less depends on attribute cache timeout?
>>
>>
>
> How did you specify the attribute cache to be 2 seconds?
mount -o actimeo=2
>>>> b) Can this be fixed? Or is it just luck that it works as well as
>>>> it
>>>> does now?
>>>>
>>>>
>>> This can be fixed, somewhat. I have some changes to address the
>>> ESTALE situation in system calls that take filename as arguments,
>>> but I need to work with some more people to get them included.
>>> The system calls which do not take file names as arguments can not
>>> be recovered from because the file they are referring is really
>>> gone or at least not accessible anymore.
>>>
>>> The reuse of the inode number is just a fact of life and that way
>>> that file systems work. I would suggest rethinking your application
>>> in order to reduce or eliminate any dependence that it might have.
>>>
>>
>> The problem I have is that I need to reliably find out if a file has
>> been replaced with a new file. So I first flush the dentry cache
>> (chowning parent directory), stat() the file and fstat() the opened
>> file. If fstat() fails with ESTALE or if the inodes don't match, I
>> know
>> that the file has been replaced and I need to re-open and re-read it.
>> This seems to work nearly always.
>>
>
> This would seem to be quite implementation specific and also has
> some timing dependencies built-in. These would seem to me to be
> dangerous assumptions and heuristics to be depending upon.
>
> Have you considered making the contents of the file itself versioned
> in some fashion and thus, removing dependencies on how the NFS client
> works and/or the file system on the NFS server?
I guess one possibility would be to link() the file elsewhere for "a
while", so that the inode wouldn't get reused until everyone's
attribute caches have become flushed. That feels a bit dirty solution
too though. (This is about handling Dovecot IMAP/POP3's metadata files.)
I'd still like to understand why exactly this happens though. Maybe
there's a chance that this is just a bug in the current NFS
implementation so I could keep using my current code (which is
actually very difficult to break even with stress testing, so if this
doesn't get fixed on kernel side I'll probably just leave my code as
it is). I guess I'll start debugging the NFS code to find out what's
really going on.
On May. 27, 2008, 22:13 +0300, Timo Sirainen <[email protected]> wrote:
> On May 27, 2008, at 9:09 PM, Peter Staubach wrote:
>
>>>>> So what I'd want to know is:
>>>>>
>>>>> a) Why does this happen only sometimes? I can't really figure out
>>>>> from
>>>>> the code what invalidates the fd1 inode. Apparently the second
>>>>> open()
>>>>> somehow, but since it uses the new "foo" file with a different
>>>>> struct
>>>>> inode, where does the old struct inode get invalidated?
>>>>>
>>>>>
>>>> This will happen always, but you may see occasional successful
>>>> fstat() calls on the client due to attribute caching and/or
>>>> dentry caching.
>>>>
>>> I would understand if it always failed or always succeeded, but it
>>> seems
>>> to be somewhat random now. And it's not "occational successful
>>> fstat()",
>>> but it's "occational failed fstat()". The difference shouldn't be
>>> because of attribute caching, because I specify it explicitly to two
>>> seconds and run the test within that 2 second. So the test should
>>> always
>>> hit the attribute cache, and according to you that should always
>>> cause
>>> it to succeed (but it rarely does). I think dentry caching also
>>> more or
>>> less depends on attribute cache timeout?
>>>
>>>
>> How did you specify the attribute cache to be 2 seconds?
>
> mount -o actimeo=2
>
>>>>> b) Can this be fixed? Or is it just luck that it works as well as
>>>>> it
>>>>> does now?
>>>>>
>>>>>
>>>> This can be fixed, somewhat. I have some changes to address the
>>>> ESTALE situation in system calls that take filename as arguments,
>>>> but I need to work with some more people to get them included.
>>>> The system calls which do not take file names as arguments can not
>>>> be recovered from because the file they are referring is really
>>>> gone or at least not accessible anymore.
>>>>
>>>> The reuse of the inode number is just a fact of life and that way
>>>> that file systems work. I would suggest rethinking your application
>>>> in order to reduce or eliminate any dependence that it might have.
>>>>
>>> The problem I have is that I need to reliably find out if a file has
>>> been replaced with a new file. So I first flush the dentry cache
>>> (chowning parent directory), stat() the file and fstat() the opened
>>> file. If fstat() fails with ESTALE or if the inodes don't match, I
>>> know
>>> that the file has been replaced and I need to re-open and re-read it.
>>> This seems to work nearly always.
>>>
>> This would seem to be quite implementation specific and also has
>> some timing dependencies built-in. These would seem to me to be
>> dangerous assumptions and heuristics to be depending upon.
>>
>> Have you considered making the contents of the file itself versioned
>> in some fashion and thus, removing dependencies on how the NFS client
>> works and/or the file system on the NFS server?
>
> I guess one possibility would be to link() the file elsewhere for "a
> while", so that the inode wouldn't get reused until everyone's
> attribute caches have become flushed. That feels a bit dirty solution
> too though. (This is about handling Dovecot IMAP/POP3's metadata files.)
The NFS (v2/v3) server can't guarantee you traditional Unix semantics
where the inode is kept around until last close. Hard linking it to keep
it around is the cleanest way you can go IMO.
>
> I'd still like to understand why exactly this happens though. Maybe
> there's a chance that this is just a bug in the current NFS
> implementation so I could keep using my current code (which is
> actually very difficult to break even with stress testing, so if this
> doesn't get fixed on kernel side I'll probably just leave my code as
> it is). I guess I'll start debugging the NFS code to find out what's
> really going on.
My guess would be that the new incarnation of the inode generates the
same filehandle as the old one, not just the same inode number.
Benny
On Wed, May 28, 2008 at 08:38:29AM +0300, Benny Halevy wrote:
> On May. 27, 2008, 22:13 +0300, Timo Sirainen <[email protected]> wrote:
> > I'd still like to understand why exactly this happens though. Maybe
> > there's a chance that this is just a bug in the current NFS
> > implementation so I could keep using my current code (which is
> > actually very difficult to break even with stress testing, so if this
> > doesn't get fixed on kernel side I'll probably just leave my code as
> > it is). I guess I'll start debugging the NFS code to find out what's
> > really going on.
>
> My guess would be that the new incarnation of the inode generates the
> same filehandle as the old one, not just the same inode number.
That sounds like a server bug (either in the server itself, or the
filesystem it's exporting); the generation number is supposed to prevent
this.
--b.
On Wed, 2008-05-28 at 09:59 -0400, J. Bruce Fields wrote:
> On Wed, May 28, 2008 at 08:38:29AM +0300, Benny Halevy wrote:
> > On May. 27, 2008, 22:13 +0300, Timo Sirainen <[email protected]> wrote:
> > > I'd still like to understand why exactly this happens though. Maybe
> > > there's a chance that this is just a bug in the current NFS
> > > implementation so I could keep using my current code (which is
> > > actually very difficult to break even with stress testing, so if this
> > > doesn't get fixed on kernel side I'll probably just leave my code as
> > > it is). I guess I'll start debugging the NFS code to find out what's
> > > really going on.
> >
> > My guess would be that the new incarnation of the inode generates the
> > same filehandle as the old one, not just the same inode number.
>
> That sounds like a server bug (either in the server itself, or the
> filesystem it's exporting); the generation number is supposed to prevent
> this.
And if it happens, wouldn't the struct inode on NFS client be the same
for both files then? I'm seeing different results for fstat() calls
(just with the same st_ino).