On both kernels in Ubuntu 16.04 (4.4.0-130) and CentOS 7.3
(3.10.0-862.11.6.el7.x86_64) with NFS 4.1, I'm seeing an issue where
stale data is shown if a file remains open on one machine, and the
file is overwritten via a rename() on another. Here's my test:
1. On node A, create two different files on a shared NFS mount:
"test1.txt" and "test2.txt".
2. On node B, continuously show the contents of the first file: "while
true; do cat test1.txt; done"
3. On node B, run a process that keeps "test1.txt" open. For example,
with Python, run:
f = open('/nfs-mount/test.txt', 'r')
4. Rename test2.txt via "mv -f test2.txt test1.txt"
On node B, I see the contents of the original test1.txt indefinitely,
even after I disabled attribute caching and the lookup cache. I can
make the while loop in step 2 show the new content if I perform one of
these actions:
1. Run "ls /nfs-mount"
2. Close the open file in step 3
I suspect the first causes the readdir cache revalidation to happen.
Is this intended behavior, or is there a better way to achieve
consistency here without performing one of these actions?
Note that with an Isilon NFS server, instead of seeing stale content,
I see "Stale file handle" errors indefinitely unless I perform one of
the corrective steps.
On Mon, Sep 17, 2018 at 01:57:17PM -0700, Stan Hu wrote:
> On both kernels in Ubuntu 16.04 (4.4.0-130) and CentOS 7.3
> (3.10.0-862.11.6.el7.x86_64) with NFS 4.1, I'm seeing an issue where
> stale data is shown if a file remains open on one machine, and the
> file is overwritten via a rename() on another. Here's my test:
>
> 1. On node A, create two different files on a shared NFS mount:
> "test1.txt" and "test2.txt".
> 2. On node B, continuously show the contents of the first file: "while
> true; do cat test1.txt; done"
> 3. On node B, run a process that keeps "test1.txt" open. For example,
> with Python, run:
> f = open('/nfs-mount/test.txt', 'r')
> 4. Rename test2.txt via "mv -f test2.txt test1.txt"
>
> On node B, I see the contents of the original test1.txt indefinitely,
> even after I disabled attribute caching and the lookup cache. I can
> make the while loop in step 2 show the new content if I perform one of
> these actions:
>
> 1. Run "ls /nfs-mount"
> 2. Close the open file in step 3
>
> I suspect the first causes the readdir cache revalidation to happen.
>
> Is this intended behavior, or is there a better way to achieve
> consistency here without performing one of these actions?
Sounds like a bug to me, but I'm not sure where. What filesystem are
you exporting? How much time do you think passes between steps 1 and 4?
(I *think* it's possible you could hit a bug caused by low ctime
granularity if you could get from step 1 to step 4 in less than a
millisecond.)
Those kernel versions--are those the client (node A and B) versions, or
the server versions?
> Note that with an Isilon NFS server, instead of seeing stale content,
> I see "Stale file handle" errors indefinitely unless I perform one of
> the corrective steps.
You see "stale file handle" errors from the "cat test1.txt"? That's
also weird.
--b.
On Mon, Sep 17, 2018 at 2:15 PM J. Bruce Fields <[email protected]> wrote:
> Sounds like a bug to me, but I'm not sure where. What filesystem are
> you exporting? How much time do you think passes between steps 1 and 4?
> (I *think* it's possible you could hit a bug caused by low ctime
> granularity if you could get from step 1 to step 4 in less than a
> millisecond.)
For CentOS, I am exporting xfs. In Ubuntu, I think I was using ext4.
Steps 1 through 4 are all done by hand, so I don't think we're hitting
a millisecond issue. Just for good measure, I've done experiments
where I waited a few minutes between steps 1 and 4.
> Those kernel versions--are those the client (node A and B) versions, or
> the server versions?
The client and server kernel versions are the same across the board. I
didn't mix and match kernels.
> > Note that with an Isilon NFS server, instead of seeing stale content,
> > I see "Stale file handle" errors indefinitely unless I perform one of
> > the corrective steps.
>
> You see "stale file handle" errors from the "cat test1.txt"? That's
> also weird.
Yes, this is the problem I'm actually more concerned about, which led
to this investigation in the first place.
On Mon, Sep 17, 2018 at 02:37:16PM -0700, Stan Hu wrote:
> On Mon, Sep 17, 2018 at 2:15 PM J. Bruce Fields <[email protected]> wrote:
>
> > Sounds like a bug to me, but I'm not sure where. What filesystem are
> > you exporting? How much time do you think passes between steps 1 and 4?
> > (I *think* it's possible you could hit a bug caused by low ctime
> > granularity if you could get from step 1 to step 4 in less than a
> > millisecond.)
>
> For CentOS, I am exporting xfs. In Ubuntu, I think I was using ext4.
>
> Steps 1 through 4 are all done by hand, so I don't think we're hitting
> a millisecond issue. Just for good measure, I've done experiments
> where I waited a few minutes between steps 1 and 4.
>
> > Those kernel versions--are those the client (node A and B) versions, or
> > the server versions?
>
> The client and server kernel versions are the same across the board. I
> didn't mix and match kernels.
>
> > > Note that with an Isilon NFS server, instead of seeing stale content,
> > > I see "Stale file handle" errors indefinitely unless I perform one of
> > > the corrective steps.
> >
> > You see "stale file handle" errors from the "cat test1.txt"? That's
> > also weird.
>
> Yes, this is the problem I'm actually more concerned about, which led
> to this investigation in the first place.
It might be useful to look at the packets on the wire. So, run
something on the server like:
tcpdump -wtmp.pcap -s0 -ieth0
(replace eth0 by the relevant interface), then run the test, then kill
the tcpdump and take a look at tmp.pcap in wireshark, or send tmp.pcap
to the list (as long as there's no sensitive info in there).
What we'd be looking for:
- does the rename cause the directory's change attribute to
change?
- does the server give out a delegation, and, if so, does it
return it before allowing the rename?
- does the client do an open by filehandle or an open by name
after the rename?
--b.
I'm not sure if the binary pcap made it on the list, but here's s a
publicly available link:
https://s3.amazonaws.com/gitlab-support/nfs/nfs-rename-test1.pcap.gz
Some things to note:
* 10.138.0.14 is the NFS server.
* 10.138.0.12 is Node A (the NFS client where the RENAME happened).
* 10.138.0.13 is Node B (the NFS client that has test.txt open and the cat loop)
* Packet 13762 shows the first RENAME request, which the server
responds with an NFS4ERR_DELAY
* Packet 13769 shows an OPEN request for "test.txt"
* Packet 14564 shows the RENAME retry
* Packet 14569 the server responded with a RENAME NFS4_OK
I don't see a subsequent OPEN request after that. Should there be one?
On Mon, Sep 17, 2018 at 3:16 PM Stan Hu <[email protected]> wrote:
>
> Attached is the compressed pcap of port 2049 traffic. The file is
> pretty large because the while loop generated a fair amount of
> traffic.
>
> On Mon, Sep 17, 2018 at 3:01 PM J. Bruce Fields <[email protected]> wrote:
> >
> > On Mon, Sep 17, 2018 at 02:37:16PM -0700, Stan Hu wrote:
> > > On Mon, Sep 17, 2018 at 2:15 PM J. Bruce Fields <[email protected]> wrote:
> > >
> > > > Sounds like a bug to me, but I'm not sure where. What filesystem are
> > > > you exporting? How much time do you think passes between steps 1 and 4?
> > > > (I *think* it's possible you could hit a bug caused by low ctime
> > > > granularity if you could get from step 1 to step 4 in less than a
> > > > millisecond.)
> > >
> > > For CentOS, I am exporting xfs. In Ubuntu, I think I was using ext4.
> > >
> > > Steps 1 through 4 are all done by hand, so I don't think we're hitting
> > > a millisecond issue. Just for good measure, I've done experiments
> > > where I waited a few minutes between steps 1 and 4.
> > >
> > > > Those kernel versions--are those the client (node A and B) versions, or
> > > > the server versions?
> > >
> > > The client and server kernel versions are the same across the board. I
> > > didn't mix and match kernels.
> > >
> > > > > Note that with an Isilon NFS server, instead of seeing stale content,
> > > > > I see "Stale file handle" errors indefinitely unless I perform one of
> > > > > the corrective steps.
> > > >
> > > > You see "stale file handle" errors from the "cat test1.txt"? That's
> > > > also weird.
> > >
> > > Yes, this is the problem I'm actually more concerned about, which led
> > > to this investigation in the first place.
> >
> > It might be useful to look at the packets on the wire. So, run
> > something on the server like:
> >
> > tcpdump -wtmp.pcap -s0 -ieth0
> >
> > (replace eth0 by the relevant interface), then run the test, then kill
> > the tcpdump and take a look at tmp.pcap in wireshark, or send tmp.pcap
> > to the list (as long as there's no sensitive info in there).
> >
> > What we'd be looking for:
> > - does the rename cause the directory's change attribute to
> > change?
> > - does the server give out a delegation, and, if so, does it
> > return it before allowing the rename?
> > - does the client do an open by filehandle or an open by name
> > after the rename?
> >
> > --b.
Here are more traces that have been pared down by introducing a
10-second sleep in the while loop:
1. Experiment 1: A normal rename from test2.txt to test.txt without a
file open: https://s3.amazonaws.com/gitlab-support/nfs/exp1-normal-rename-over-nfs.pcap
2. Experiment 2: A rename from test2.txt to test.txt with a file open:
https://s3.amazonaws.com/gitlab-support/nfs/exp2-rename-over-nfs-with-file-open.pcap
3. Experiment 3: Same as experiment 2 except an `ls` was issued in the
directory where the files reside:
https://s3.amazonaws.com/gitlab-support/nfs/exp3-rename-over-nfs-with-file-open-and-ls.pcap
In the pcap of experiment 2, as before we see the NFS respond to the
first RENAME request with a NFS4ERR_DELAY.
The next RENAME succeeds, and it looks to me that the GETATTR call for
the file handle receives a modification time of 3 minutes prior to the
current timestamp. I could see how this might make sense since there
is an open file handle out there for this file.
Is this behavior up to the server implementation, and the Linux
implementation chooses to allow references to old handles? I see in
https://tools.ietf.org/html/rfc3010#page-153:
A filehandle may or may not become stale or expire on a rename.
However, server implementors are strongly encouraged to attempt to
keep file handles from becoming stale or expiring in this fashion.
In experiment 3, I see a burst of activity after the "ls" call. In
both experiments 1 and 3, I see more than 1 OPEN request for test.txt,
which seems to refresh the inode of test.txt.
Any idea what's going on here?
On Mon, Sep 17, 2018 at 3:48 PM Stan Hu <[email protected]> wrote:
>
> I'm not sure if the binary pcap made it on the list, but here's s a
> publicly available link:
> https://s3.amazonaws.com/gitlab-support/nfs/nfs-rename-test1.pcap.gz
>
> Some things to note:
>
> * 10.138.0.14 is the NFS server.
> * 10.138.0.12 is Node A (the NFS client where the RENAME happened).
> * 10.138.0.13 is Node B (the NFS client that has test.txt open and the cat loop)
>
> * Packet 13762 shows the first RENAME request, which the server
> responds with an NFS4ERR_DELAY
> * Packet 13769 shows an OPEN request for "test.txt"
> * Packet 14564 shows the RENAME retry
> * Packet 14569 the server responded with a RENAME NFS4_OK
>
> I don't see a subsequent OPEN request after that. Should there be one?
>
> On Mon, Sep 17, 2018 at 3:16 PM Stan Hu <[email protected]> wrote:
> >
> > Attached is the compressed pcap of port 2049 traffic. The file is
> > pretty large because the while loop generated a fair amount of
> > traffic.
> >
> > On Mon, Sep 17, 2018 at 3:01 PM J. Bruce Fields <[email protected]> wrote:
> > >
> > > On Mon, Sep 17, 2018 at 02:37:16PM -0700, Stan Hu wrote:
> > > > On Mon, Sep 17, 2018 at 2:15 PM J. Bruce Fields <[email protected]> wrote:
> > > >
> > > > > Sounds like a bug to me, but I'm not sure where. What filesystem are
> > > > > you exporting? How much time do you think passes between steps 1 and 4?
> > > > > (I *think* it's possible you could hit a bug caused by low ctime
> > > > > granularity if you could get from step 1 to step 4 in less than a
> > > > > millisecond.)
> > > >
> > > > For CentOS, I am exporting xfs. In Ubuntu, I think I was using ext4.
> > > >
> > > > Steps 1 through 4 are all done by hand, so I don't think we're hitting
> > > > a millisecond issue. Just for good measure, I've done experiments
> > > > where I waited a few minutes between steps 1 and 4.
> > > >
> > > > > Those kernel versions--are those the client (node A and B) versions, or
> > > > > the server versions?
> > > >
> > > > The client and server kernel versions are the same across the board. I
> > > > didn't mix and match kernels.
> > > >
> > > > > > Note that with an Isilon NFS server, instead of seeing stale content,
> > > > > > I see "Stale file handle" errors indefinitely unless I perform one of
> > > > > > the corrective steps.
> > > > >
> > > > > You see "stale file handle" errors from the "cat test1.txt"? That's
> > > > > also weird.
> > > >
> > > > Yes, this is the problem I'm actually more concerned about, which led
> > > > to this investigation in the first place.
> > >
> > > It might be useful to look at the packets on the wire. So, run
> > > something on the server like:
> > >
> > > tcpdump -wtmp.pcap -s0 -ieth0
> > >
> > > (replace eth0 by the relevant interface), then run the test, then kill
> > > the tcpdump and take a look at tmp.pcap in wireshark, or send tmp.pcap
> > > to the list (as long as there's no sensitive info in there).
> > >
> > > What we'd be looking for:
> > > - does the rename cause the directory's change attribute to
> > > change?
> > > - does the server give out a delegation, and, if so, does it
> > > return it before allowing the rename?
> > > - does the client do an open by filehandle or an open by name
> > > after the rename?
> > >
> > > --b.
On Mon, Sep 17, 2018 at 03:48:12PM -0700, Stan Hu wrote:
> I'm not sure if the binary pcap made it on the list, but here's s a
> publicly available link:
> https://s3.amazonaws.com/gitlab-support/nfs/nfs-rename-test1.pcap.gz
>
> Some things to note:
>
> * 10.138.0.14 is the NFS server.
> * 10.138.0.12 is Node A (the NFS client where the RENAME happened).
> * 10.138.0.13 is Node B (the NFS client that has test.txt open and the cat loop)
>
> * Packet 13762 shows the first RENAME request, which the server
> responds with an NFS4ERR_DELAY
> * Packet 13769 shows an OPEN request for "test.txt"
> * Packet 14564 shows the RENAME retry
> * Packet 14569 the server responded with a RENAME NFS4_OK
>
> I don't see a subsequent OPEN request after that. Should there be one?
Yes.
We know node B has that cat loop that will keep reopening the file.
The only way node B could avoid translating those open syscalls into
on-the-wire OPENs is if the client holds a delegation.
But it can't hold a delegation on the file that was newly renamed to
test.txt--delegations are revoked on rename, and it would need to do
another OPEN after the rename to get a new delegation. Similarly the
file that gets renamed over should have its delegation revoked--and we
can see that the client does return that delegation. The OPEN here is
actually part of that delegation return process--the CLAIM_DELEGATE_CUR
value on "claim type" is telling the server that this is an open that
the client had cached locally under the delegation it is about to
return.
Looks like a client bug to me, possibly some sort of race handling the
delegation return and the new open.
It might help if it were possible to confirm that this is still
reproduceable on the latest upstream kernel.
--b.
>
> On Mon, Sep 17, 2018 at 3:16 PM Stan Hu <[email protected]> wrote:
> >
> > Attached is the compressed pcap of port 2049 traffic. The file is
> > pretty large because the while loop generated a fair amount of
> > traffic.
> >
> > On Mon, Sep 17, 2018 at 3:01 PM J. Bruce Fields <[email protected]> wrote:
> > >
> > > On Mon, Sep 17, 2018 at 02:37:16PM -0700, Stan Hu wrote:
> > > > On Mon, Sep 17, 2018 at 2:15 PM J. Bruce Fields <[email protected]> wrote:
> > > >
> > > > > Sounds like a bug to me, but I'm not sure where. What filesystem are
> > > > > you exporting? How much time do you think passes between steps 1 and 4?
> > > > > (I *think* it's possible you could hit a bug caused by low ctime
> > > > > granularity if you could get from step 1 to step 4 in less than a
> > > > > millisecond.)
> > > >
> > > > For CentOS, I am exporting xfs. In Ubuntu, I think I was using ext4.
> > > >
> > > > Steps 1 through 4 are all done by hand, so I don't think we're hitting
> > > > a millisecond issue. Just for good measure, I've done experiments
> > > > where I waited a few minutes between steps 1 and 4.
> > > >
> > > > > Those kernel versions--are those the client (node A and B) versions, or
> > > > > the server versions?
> > > >
> > > > The client and server kernel versions are the same across the board. I
> > > > didn't mix and match kernels.
> > > >
> > > > > > Note that with an Isilon NFS server, instead of seeing stale content,
> > > > > > I see "Stale file handle" errors indefinitely unless I perform one of
> > > > > > the corrective steps.
> > > > >
> > > > > You see "stale file handle" errors from the "cat test1.txt"? That's
> > > > > also weird.
> > > >
> > > > Yes, this is the problem I'm actually more concerned about, which led
> > > > to this investigation in the first place.
> > >
> > > It might be useful to look at the packets on the wire. So, run
> > > something on the server like:
> > >
> > > tcpdump -wtmp.pcap -s0 -ieth0
> > >
> > > (replace eth0 by the relevant interface), then run the test, then kill
> > > the tcpdump and take a look at tmp.pcap in wireshark, or send tmp.pcap
> > > to the list (as long as there's no sensitive info in there).
> > >
> > > What we'd be looking for:
> > > - does the rename cause the directory's change attribute to
> > > change?
> > > - does the server give out a delegation, and, if so, does it
> > > return it before allowing the rename?
> > > - does the client do an open by filehandle or an open by name
> > > after the rename?
> > >
> > > --b.
On Tue, Sep 18, 2018 at 10:42:03AM -0700, Stan Hu wrote:
> Here are more traces that have been pared down by introducing a
> 10-second sleep in the while loop:
>
> 1. Experiment 1: A normal rename from test2.txt to test.txt without a
> file open: https://s3.amazonaws.com/gitlab-support/nfs/exp1-normal-rename-over-nfs.pcap
> 2. Experiment 2: A rename from test2.txt to test.txt with a file open:
> https://s3.amazonaws.com/gitlab-support/nfs/exp2-rename-over-nfs-with-file-open.pcap
> 3. Experiment 3: Same as experiment 2 except an `ls` was issued in the
> directory where the files reside:
> https://s3.amazonaws.com/gitlab-support/nfs/exp3-rename-over-nfs-with-file-open-and-ls.pcap
>
> In the pcap of experiment 2, as before we see the NFS respond to the
> first RENAME request with a NFS4ERR_DELAY.
>
> The next RENAME succeeds, and it looks to me that the GETATTR call for
> the file handle receives a modification time of 3 minutes prior to the
> current timestamp.
In this case (rename within a directory, over an existing file), there
are three objects affected: the file that's being renamed, the file
that's renamed over, and the parent directory.
Neither file should see it's mtime change: mtime only changes when a
file's data changes.
The ctime changes whenever datat or attributes change. I'm not sure
whether the ctime of the renamed file is expected to change. The ctime
of the renamed-over file will change (if for no other reason that that
it's nlink value changes from 1 to 0).
The directory mtime and ctime will both change, I think.
In any case, what the client actually relies on here is that the rename
changes the NFSv4 attribute called "Change". Depending on filesystem
and kernel version that "Change" attribute may be derived from the ctime
or from a separate counter. In any case, if that's not changing,
there's a problem. I bet it is changing, though, so I doubt that's the
bug here.
> I could see how this might make sense since there
> is an open file handle out there for this file.
> Is this behavior up to the server implementation, and the Linux
> implementation chooses to allow references to old handles? I see in
> https://tools.ietf.org/html/rfc3010#page-153:
>
> A filehandle may or may not become stale or expire on a rename.
> However, server implementors are strongly encouraged to attempt to
> keep file handles from becoming stale or expiring in this fashion.
>
> In experiment 3, I see a burst of activity after the "ls" call. In
> both experiments 1 and 3, I see more than 1 OPEN request for test.txt,
> which seems to refresh the inode of test.txt.
>
> Any idea what's going on here?
The filehandle of the renamed file should definitely stay good after the
rename. The filehandle of the file that was renamed over (the old
"test.txt") might or might not. I know the Linux server will still keep
it around as long as an NFSv4 client has it opened and there's not a
server reboot. Your isilon server might start returning STALE as soon
as its unlinked by the rename, and that's probably an OK implementation
choice.
--b.
> The ctime changes whenever datat or attributes change. I'm not sure
> whether the ctime of the renamed file is expected to change. The
> ctime of the renamed-over file will change (if for no other reason
> that that it's nlink value changes from 1 to 0).
The ctime of the renamed file is usually expected (and required)
to change (at least on local filesystems). Among other things, this
is how backup programs traditionally detect renamed files during
incremental backups.
Some Linux filesystems have had problems here in the past (changing
ctime on rename is an obscure corner case), but I think all mainstream
ones work right these days.
- cks
On Tue, Sep 18, 2018 at 03:06:06PM -0400, Chris Siebenmann wrote:
> > The ctime changes whenever datat or attributes change. I'm not sure
> > whether the ctime of the renamed file is expected to change. The
> > ctime of the renamed-over file will change (if for no other reason
> > that that it's nlink value changes from 1 to 0).
>
> The ctime of the renamed file is usually expected (and required)
> to change (at least on local filesystems). Among other things, this
> is how backup programs traditionally detect renamed files during
> incremental backups.
>
> Some Linux filesystems have had problems here in the past (changing
> ctime on rename is an obscure corner case), but I think all mainstream
> ones work right these days.
Got it, thanks!--b.
On Tue, Sep 18, 2018 at 11:19 AM J. Bruce Fields <[email protected]> wrote:
> We know node B has that cat loop that will keep reopening the file.
>
> The only way node B could avoid translating those open syscalls into
> on-the-wire OPENs is if the client holds a delegation.
>
> But it can't hold a delegation on the file that was newly renamed to
> test.txt--delegations are revoked on rename, and it would need to do
> another OPEN after the rename to get a new delegation. Similarly the
> file that gets renamed over should have its delegation revoked--and we
> can see that the client does return that delegation. The OPEN here is
> actually part of that delegation return process--the CLAIM_DELEGATE_CUR
> value on "claim type" is telling the server that this is an open that
> the client had cached locally under the delegation it is about to
> return.
>
> Looks like a client bug to me, possibly some sort of race handling the
> delegation return and the new open.
>
> It might help if it were possible to confirm that this is still
> reproduceable on the latest upstream kernel.
Thanks for that information. I did more testing, and it looks like
this stale file problem only appears to happen when the NFS client
protocol is 4.0 (via the vers=4.0 mount option). 4.1 doesn't appear to
have the problem.
I've also confirmed this problem happens on the mainline kernel
version (4.19.0-rc4). Do you have any idea why 4.1 would be working
but 4.0 has this bug?
https://s3.amazonaws.com/gitlab-support/nfs/nfs-4.0-kernel-4.19-0-rc4-rename.pcap
is the latest capture that also includes the NFS callbacks. Here's
what I see after the first RENAME from Node A:
Node B: DELEGRETURN StateId: 0xa93
NFS server: DELEGRETURN
Node A: RENAME From: test2.txt To: test.txt
NFS server: RENAME
Node B: GETATTR
NFS Server: GETATTR (with old inode)
Node B: READ StateId: 0xa93
NFS Server: READ
In comparison, if I don't have a process with an open file to
test.txt, things work and the trace looks like:
Node B: DELEGRETURN StateId: 0xa93
NFS server: DELEGRETURN
Node A: RENAME From: test2.txt To: test.txt
NFS server: RENAME
Node B: OPEN test.txt
NFS Server: OPEN StateID: 0xa93
Node B: CLOSE StateID: 0xa93
NFS Server: CLOSE
Node B: OPEN test.txt
NFS Server: OPEN StateId: 0xa93
Node B: READ StateID: 0xa93
NFS Server: READ
In the first case, since the client reused the StateId that it should
have released in DELEGRETURN, does this suggest that perhaps the
client isn't properly releasing that delegation? How might the open
file affect this behavior? Any pointers to where things might be going
awry in the code base would be appreciated here.
>
> --b.
>
> >
> > On Mon, Sep 17, 2018 at 3:16 PM Stan Hu <[email protected]> wrote:
> > >
> > > Attached is the compressed pcap of port 2049 traffic. The file is
> > > pretty large because the while loop generated a fair amount of
> > > traffic.
> > >
> > > On Mon, Sep 17, 2018 at 3:01 PM J. Bruce Fields <[email protected]> wrote:
> > > >
> > > > On Mon, Sep 17, 2018 at 02:37:16PM -0700, Stan Hu wrote:
> > > > > On Mon, Sep 17, 2018 at 2:15 PM J. Bruce Fields <[email protected]> wrote:
> > > > >
> > > > > > Sounds like a bug to me, but I'm not sure where. What filesystem are
> > > > > > you exporting? How much time do you think passes between steps 1 and 4?
> > > > > > (I *think* it's possible you could hit a bug caused by low ctime
> > > > > > granularity if you could get from step 1 to step 4 in less than a
> > > > > > millisecond.)
> > > > >
> > > > > For CentOS, I am exporting xfs. In Ubuntu, I think I was using ext4.
> > > > >
> > > > > Steps 1 through 4 are all done by hand, so I don't think we're hitting
> > > > > a millisecond issue. Just for good measure, I've done experiments
> > > > > where I waited a few minutes between steps 1 and 4.
> > > > >
> > > > > > Those kernel versions--are those the client (node A and B) versions, or
> > > > > > the server versions?
> > > > >
> > > > > The client and server kernel versions are the same across the board. I
> > > > > didn't mix and match kernels.
> > > > >
> > > > > > > Note that with an Isilon NFS server, instead of seeing stale content,
> > > > > > > I see "Stale file handle" errors indefinitely unless I perform one of
> > > > > > > the corrective steps.
> > > > > >
> > > > > > You see "stale file handle" errors from the "cat test1.txt"? That's
> > > > > > also weird.
> > > > >
> > > > > Yes, this is the problem I'm actually more concerned about, which led
> > > > > to this investigation in the first place.
> > > >
> > > > It might be useful to look at the packets on the wire. So, run
> > > > something on the server like:
> > > >
> > > > tcpdump -wtmp.pcap -s0 -ieth0
> > > >
> > > > (replace eth0 by the relevant interface), then run the test, then kill
> > > > the tcpdump and take a look at tmp.pcap in wireshark, or send tmp.pcap
> > > > to the list (as long as there's no sensitive info in there).
> > > >
> > > > What we'd be looking for:
> > > > - does the rename cause the directory's change attribute to
> > > > change?
> > > > - does the server give out a delegation, and, if so, does it
> > > > return it before allowing the rename?
> > > > - does the client do an open by filehandle or an open by name
> > > > after the rename?
> > > >
> > > > --b.
On Wed, Sep 19, 2018 at 10:39:19AM -0700, Stan Hu wrote:
> On Tue, Sep 18, 2018 at 11:19 AM J. Bruce Fields <[email protected]> wrote:
>
> > We know node B has that cat loop that will keep reopening the file.
> >
> > The only way node B could avoid translating those open syscalls into
> > on-the-wire OPENs is if the client holds a delegation.
> >
> > But it can't hold a delegation on the file that was newly renamed to
> > test.txt--delegations are revoked on rename, and it would need to do
> > another OPEN after the rename to get a new delegation. Similarly the
> > file that gets renamed over should have its delegation revoked--and we
> > can see that the client does return that delegation. The OPEN here is
> > actually part of that delegation return process--the CLAIM_DELEGATE_CUR
> > value on "claim type" is telling the server that this is an open that
> > the client had cached locally under the delegation it is about to
> > return.
> >
> > Looks like a client bug to me, possibly some sort of race handling the
> > delegation return and the new open.
> >
> > It might help if it were possible to confirm that this is still
> > reproduceable on the latest upstream kernel.
>
> Thanks for that information. I did more testing, and it looks like
> this stale file problem only appears to happen when the NFS client
> protocol is 4.0 (via the vers=4.0 mount option). 4.1 doesn't appear to
> have the problem.
>
> I've also confirmed this problem happens on the mainline kernel
> version (4.19.0-rc4). Do you have any idea why 4.1 would be working
> but 4.0 has this bug?
No. I mean, the 4.1/4.0 differences are complicated, so it's not too
surprising a bug could hit one and not the other, but I don't have an
explanation for this one off the top of my head.
> https://s3.amazonaws.com/gitlab-support/nfs/nfs-4.0-kernel-4.19-0-rc4-rename.pcap
> is the latest capture that also includes the NFS callbacks. Here's
> what I see after the first RENAME from Node A:
>
> Node B: DELEGRETURN StateId: 0xa93
> NFS server: DELEGRETURN
> Node A: RENAME From: test2.txt To: test.txt
> NFS server: RENAME
> Node B: GETATTR
> NFS Server: GETATTR (with old inode)
> Node B: READ StateId: 0xa93
> NFS Server: READ
Presumably the GETATTR and READ use a filehandle for the old file (the
one that was renamed over)?
That's what's weird, and indicates a possible client bug. It should be
doing a new OPEN("test.txt").
Also, READ shouldn't be using the stateid that was returned in
DELEGRETURN. And the server should reject any attempt to use that
stateid. I wonder if you misread the stateids--may be worth taking a
closer look to see if they're really bit-for-bit identical. (They're
128 bits, so that 0xa93 is either a hash or just some subset of the
stateid.)
(Apologies, I haven't gotten a chance to look at it myself.)
> In comparison, if I don't have a process with an open file to
> test.txt, things work and the trace looks like:
>
> Node B: DELEGRETURN StateId: 0xa93
> NFS server: DELEGRETURN
> Node A: RENAME From: test2.txt To: test.txt
> NFS server: RENAME
> Node B: OPEN test.txt
> NFS Server: OPEN StateID: 0xa93
> Node B: CLOSE StateID: 0xa93
> NFS Server: CLOSE
> Node B: OPEN test.txt
> NFS Server: OPEN StateId: 0xa93
> Node B: READ StateID: 0xa93
> NFS Server: READ
>
> In the first case, since the client reused the StateId that it should
> have released in DELEGRETURN, does this suggest that perhaps the
> client isn't properly releasing that delegation? How might the open
> file affect this behavior? Any pointers to where things might be going
> awry in the code base would be appreciated here.
I'd expect the first trace to look more like this one, with new OPENs
and CLOSEs after the rename.
--b.
>
>
>
> >
> > --b.
> >
> > >
> > > On Mon, Sep 17, 2018 at 3:16 PM Stan Hu <[email protected]> wrote:
> > > >
> > > > Attached is the compressed pcap of port 2049 traffic. The file is
> > > > pretty large because the while loop generated a fair amount of
> > > > traffic.
> > > >
> > > > On Mon, Sep 17, 2018 at 3:01 PM J. Bruce Fields <[email protected]> wrote:
> > > > >
> > > > > On Mon, Sep 17, 2018 at 02:37:16PM -0700, Stan Hu wrote:
> > > > > > On Mon, Sep 17, 2018 at 2:15 PM J. Bruce Fields <[email protected]> wrote:
> > > > > >
> > > > > > > Sounds like a bug to me, but I'm not sure where. What filesystem are
> > > > > > > you exporting? How much time do you think passes between steps 1 and 4?
> > > > > > > (I *think* it's possible you could hit a bug caused by low ctime
> > > > > > > granularity if you could get from step 1 to step 4 in less than a
> > > > > > > millisecond.)
> > > > > >
> > > > > > For CentOS, I am exporting xfs. In Ubuntu, I think I was using ext4.
> > > > > >
> > > > > > Steps 1 through 4 are all done by hand, so I don't think we're hitting
> > > > > > a millisecond issue. Just for good measure, I've done experiments
> > > > > > where I waited a few minutes between steps 1 and 4.
> > > > > >
> > > > > > > Those kernel versions--are those the client (node A and B) versions, or
> > > > > > > the server versions?
> > > > > >
> > > > > > The client and server kernel versions are the same across the board. I
> > > > > > didn't mix and match kernels.
> > > > > >
> > > > > > > > Note that with an Isilon NFS server, instead of seeing stale content,
> > > > > > > > I see "Stale file handle" errors indefinitely unless I perform one of
> > > > > > > > the corrective steps.
> > > > > > >
> > > > > > > You see "stale file handle" errors from the "cat test1.txt"? That's
> > > > > > > also weird.
> > > > > >
> > > > > > Yes, this is the problem I'm actually more concerned about, which led
> > > > > > to this investigation in the first place.
> > > > >
> > > > > It might be useful to look at the packets on the wire. So, run
> > > > > something on the server like:
> > > > >
> > > > > tcpdump -wtmp.pcap -s0 -ieth0
> > > > >
> > > > > (replace eth0 by the relevant interface), then run the test, then kill
> > > > > the tcpdump and take a look at tmp.pcap in wireshark, or send tmp.pcap
> > > > > to the list (as long as there's no sensitive info in there).
> > > > >
> > > > > What we'd be looking for:
> > > > > - does the rename cause the directory's change attribute to
> > > > > change?
> > > > > - does the server give out a delegation, and, if so, does it
> > > > > return it before allowing the rename?
> > > > > - does the client do an open by filehandle or an open by name
> > > > > after the rename?
> > > > >
> > > > > --b.
On Wed, Sep 19, 2018 at 04:02:14PM -0400, Bruce Fields wrote:
> On Wed, Sep 19, 2018 at 10:39:19AM -0700, Stan Hu wrote:
> > https://s3.amazonaws.com/gitlab-support/nfs/nfs-4.0-kernel-4.19-0-rc4-rename.pcap
> > is the latest capture that also includes the NFS callbacks. Here's
> > what I see after the first RENAME from Node A:
> >
> > Node B: DELEGRETURN StateId: 0xa93
> > NFS server: DELEGRETURN
> > Node A: RENAME From: test2.txt To: test.txt
> > NFS server: RENAME
> > Node B: GETATTR
> > NFS Server: GETATTR (with old inode)
> > Node B: READ StateId: 0xa93
> > NFS Server: READ
>
> Presumably the GETATTR and READ use a filehandle for the old file (the
> one that was renamed over)?
>
> That's what's weird, and indicates a possible client bug. It should be
> doing a new OPEN("test.txt").
Or, maybe that part's expected behavior.
I'd assumed that the every open syscall would result in an on-the-wire
open or lookup, or at least a getattr to verify of the directory change
attribute to verify that the directory hasn't been modified.
Trond tells me that's guaranteed only on the first open (of a set of
overlapping opens) from that client. You've already got another open in
your test case (if I remember correctly), so you may have to wait for
the client's cache of the parent directory to time out (could be a few
minutes). There are some mount options to control this--see the nfs man
page.
(And Trond reminds me the same is true of data revalidation for the
purpose of open-to-close cache consistency; if you have a set of
overlapping opens from one client, only the first one is guaranteed to
revalidate the cache. This is contrary to my understanding (and every
definition of close-to-open that I can find online), but I take it there
are some good reasons for it. But this should be better documented, it
causes way too much confusion (see another ongoing thread on this
list).)
Anyway, that doesn't explain the 4.0/4.1 difference.
--b.
>
> Also, READ shouldn't be using the stateid that was returned in
> DELEGRETURN. And the server should reject any attempt to use that
> stateid. I wonder if you misread the stateids--may be worth taking a
> closer look to see if they're really bit-for-bit identical. (They're
> 128 bits, so that 0xa93 is either a hash or just some subset of the
> stateid.)
>
> (Apologies, I haven't gotten a chance to look at it myself.)
>
> > In comparison, if I don't have a process with an open file to
> > test.txt, things work and the trace looks like:
> >
> > Node B: DELEGRETURN StateId: 0xa93
> > NFS server: DELEGRETURN
> > Node A: RENAME From: test2.txt To: test.txt
> > NFS server: RENAME
> > Node B: OPEN test.txt
> > NFS Server: OPEN StateID: 0xa93
> > Node B: CLOSE StateID: 0xa93
> > NFS Server: CLOSE
> > Node B: OPEN test.txt
> > NFS Server: OPEN StateId: 0xa93
> > Node B: READ StateID: 0xa93
> > NFS Server: READ
> >
> > In the first case, since the client reused the StateId that it should
> > have released in DELEGRETURN, does this suggest that perhaps the
> > client isn't properly releasing that delegation? How might the open
> > file affect this behavior? Any pointers to where things might be going
> > awry in the code base would be appreciated here.
>
> I'd expect the first trace to look more like this one, with new OPENs
> and CLOSEs after the rename.
>
> --b.
>
> >
> >
> >
> > >
> > > --b.
> > >
> > > >
> > > > On Mon, Sep 17, 2018 at 3:16 PM Stan Hu <[email protected]> wrote:
> > > > >
> > > > > Attached is the compressed pcap of port 2049 traffic. The file is
> > > > > pretty large because the while loop generated a fair amount of
> > > > > traffic.
> > > > >
> > > > > On Mon, Sep 17, 2018 at 3:01 PM J. Bruce Fields <[email protected]> wrote:
> > > > > >
> > > > > > On Mon, Sep 17, 2018 at 02:37:16PM -0700, Stan Hu wrote:
> > > > > > > On Mon, Sep 17, 2018 at 2:15 PM J. Bruce Fields <[email protected]> wrote:
> > > > > > >
> > > > > > > > Sounds like a bug to me, but I'm not sure where. What filesystem are
> > > > > > > > you exporting? How much time do you think passes between steps 1 and 4?
> > > > > > > > (I *think* it's possible you could hit a bug caused by low ctime
> > > > > > > > granularity if you could get from step 1 to step 4 in less than a
> > > > > > > > millisecond.)
> > > > > > >
> > > > > > > For CentOS, I am exporting xfs. In Ubuntu, I think I was using ext4.
> > > > > > >
> > > > > > > Steps 1 through 4 are all done by hand, so I don't think we're hitting
> > > > > > > a millisecond issue. Just for good measure, I've done experiments
> > > > > > > where I waited a few minutes between steps 1 and 4.
> > > > > > >
> > > > > > > > Those kernel versions--are those the client (node A and B) versions, or
> > > > > > > > the server versions?
> > > > > > >
> > > > > > > The client and server kernel versions are the same across the board. I
> > > > > > > didn't mix and match kernels.
> > > > > > >
> > > > > > > > > Note that with an Isilon NFS server, instead of seeing stale content,
> > > > > > > > > I see "Stale file handle" errors indefinitely unless I perform one of
> > > > > > > > > the corrective steps.
> > > > > > > >
> > > > > > > > You see "stale file handle" errors from the "cat test1.txt"? That's
> > > > > > > > also weird.
> > > > > > >
> > > > > > > Yes, this is the problem I'm actually more concerned about, which led
> > > > > > > to this investigation in the first place.
> > > > > >
> > > > > > It might be useful to look at the packets on the wire. So, run
> > > > > > something on the server like:
> > > > > >
> > > > > > tcpdump -wtmp.pcap -s0 -ieth0
> > > > > >
> > > > > > (replace eth0 by the relevant interface), then run the test, then kill
> > > > > > the tcpdump and take a look at tmp.pcap in wireshark, or send tmp.pcap
> > > > > > to the list (as long as there's no sensitive info in there).
> > > > > >
> > > > > > What we'd be looking for:
> > > > > > - does the rename cause the directory's change attribute to
> > > > > > change?
> > > > > > - does the server give out a delegation, and, if so, does it
> > > > > > return it before allowing the rename?
> > > > > > - does the client do an open by filehandle or an open by name
> > > > > > after the rename?
> > > > > >
> > > > > > --b.
On Wed, Sep 19, 2018 at 1:02 PM Bruce Fields <[email protected]> wrote:
>
> On Wed, Sep 19, 2018 at 10:39:19AM -0700, Stan Hu wrote:
> > On Tue, Sep 18, 2018 at 11:19 AM J. Bruce Fields <[email protected]> wrote:
> >
> > > We know node B has that cat loop that will keep reopening the file.
> > >
> > > The only way node B could avoid translating those open syscalls into
> > > on-the-wire OPENs is if the client holds a delegation.
> > >
> > > But it can't hold a delegation on the file that was newly renamed to
> > > test.txt--delegations are revoked on rename, and it would need to do
> > > another OPEN after the rename to get a new delegation. Similarly the
> > > file that gets renamed over should have its delegation revoked--and we
> > > can see that the client does return that delegation. The OPEN here is
> > > actually part of that delegation return process--the CLAIM_DELEGATE_CUR
> > > value on "claim type" is telling the server that this is an open that
> > > the client had cached locally under the delegation it is about to
> > > return.
> > >
> > > Looks like a client bug to me, possibly some sort of race handling the
> > > delegation return and the new open.
> > >
> > > It might help if it were possible to confirm that this is still
> > > reproduceable on the latest upstream kernel.
> >
> > Thanks for that information. I did more testing, and it looks like
> > this stale file problem only appears to happen when the NFS client
> > protocol is 4.0 (via the vers=4.0 mount option). 4.1 doesn't appear to
> > have the problem.
> >
> > I've also confirmed this problem happens on the mainline kernel
> > version (4.19.0-rc4). Do you have any idea why 4.1 would be working
> > but 4.0 has this bug?
>
> No. I mean, the 4.1/4.0 differences are complicated, so it's not too
> surprising a bug could hit one and not the other, but I don't have an
> explanation for this one off the top of my head.
>
> > https://s3.amazonaws.com/gitlab-support/nfs/nfs-4.0-kernel-4.19-0-rc4-rename.pcap
> > is the latest capture that also includes the NFS callbacks. Here's
> > what I see after the first RENAME from Node A:
> >
> > Node B: DELEGRETURN StateId: 0xa93
> > NFS server: DELEGRETURN
> > Node A: RENAME From: test2.txt To: test.txt
> > NFS server: RENAME
> > Node B: GETATTR
> > NFS Server: GETATTR (with old inode)
> > Node B: READ StateId: 0xa93
> > NFS Server: READ
>
> Presumably the GETATTR and READ use a filehandle for the old file (the
> one that was renamed over)?
>
> That's what's weird, and indicates a possible client bug. It should be
> doing a new OPEN("test.txt").
>
> Also, READ shouldn't be using the stateid that was returned in
> DELEGRETURN. And the server should reject any attempt to use that
> stateid. I wonder if you misread the stateids--may be worth taking a
> closer look to see if they're really bit-for-bit identical. (They're
> 128 bits, so that 0xa93 is either a hash or just some subset of the
> stateid.)
>
> (Apologies, I haven't gotten a chance to look at it myself.)
Thanks again for the information.
It looks like Wireshark is showing the CRC-16 hash of both the
sequence ID and the other state ID, which makes things a little
confusing. Here's the abbreviated flow with the 12-byte StateId
(excludes the sequence number) and hashed values:
Node B: OPEN test.txt
NFS server: OPEN StateId: 96:49:9c:5b:e0:41:30:32:27:00:00:00 (0x7cd5)
--- 5 seconds later, presumably when I start the while loop ---
Node B: READ StateId: 96:49:9c:5b:e0:41:30:32:28:00:00:00 (0xec2c)
Node B: DELEGRETURN StateId: 96:49:9c:5b:e0:41:30:32:28:00:00:00 (0xce2c)
NFS server: DELEGRETURN
Node A: RENAME From: test2.txt To: test.txt
NFS server: RENAME
Node B: GETATTR
NFS Server: GETATTR (with old inode)
Node B: READ StateId: 96:49:9c:5b:e0:41:30:32:27:00:00:00 (0x7cd5)
NFS Server: READ
I'm a bit confused why the READ call would be using a StateId that
looks like it's one bit off (27 vs 28). It's possible I'm not seeing a
message in the trace, but is the NFS client doing something behind the
scenes?
Note the DELEGRETURN returned the StateId with the 28, and the last
READ request used 27.
I repeated the same test using NFS 4.1
(https://s3.amazonaws.com/gitlab-support/nfs/nfs-4.1-kernel-4.19-0-rc4-rename-test1.pcap).
The behavior looks quite similar here with the DELEGRETURN and StateId
values that appear to me as off-by-1. But in this case the NFS client
requests a new OPEN and READ.
> I'd assumed that the every open syscall would result in an on-the-wire
> open or lookup, or at least a getattr to verify of the directory change
> attribute to verify that the directory hasn't been modified.
>
> Trond tells me that's guaranteed only on the first open (of a set of
> overlapping opens) from that client. You've already got another open in
> your test case (if I remember correctly), so you may have to wait for
> the client's cache of the parent directory to time out (could be a few
> minutes). There are some mount options to control this--see the nfs man
> page.
Last night I left my test running on for more than 30 minutes, and the
while loop still showed the stale data. I think I even turned off
attribute caching entirely to see if this would help, and it did not.
On Thu, Sep 20, 2018 at 11:23:22AM -0700, Stan Hu wrote:
> On Wed, Sep 19, 2018 at 1:02 PM Bruce Fields <[email protected]> wrote:
> >
> > On Wed, Sep 19, 2018 at 10:39:19AM -0700, Stan Hu wrote:
> > > On Tue, Sep 18, 2018 at 11:19 AM J. Bruce Fields <[email protected]> wrote:
> > >
> > > > We know node B has that cat loop that will keep reopening the file.
> > > >
> > > > The only way node B could avoid translating those open syscalls into
> > > > on-the-wire OPENs is if the client holds a delegation.
> > > >
> > > > But it can't hold a delegation on the file that was newly renamed to
> > > > test.txt--delegations are revoked on rename, and it would need to do
> > > > another OPEN after the rename to get a new delegation. Similarly the
> > > > file that gets renamed over should have its delegation revoked--and we
> > > > can see that the client does return that delegation. The OPEN here is
> > > > actually part of that delegation return process--the CLAIM_DELEGATE_CUR
> > > > value on "claim type" is telling the server that this is an open that
> > > > the client had cached locally under the delegation it is about to
> > > > return.
> > > >
> > > > Looks like a client bug to me, possibly some sort of race handling the
> > > > delegation return and the new open.
> > > >
> > > > It might help if it were possible to confirm that this is still
> > > > reproduceable on the latest upstream kernel.
> > >
> > > Thanks for that information. I did more testing, and it looks like
> > > this stale file problem only appears to happen when the NFS client
> > > protocol is 4.0 (via the vers=4.0 mount option). 4.1 doesn't appear to
> > > have the problem.
> > >
> > > I've also confirmed this problem happens on the mainline kernel
> > > version (4.19.0-rc4). Do you have any idea why 4.1 would be working
> > > but 4.0 has this bug?
> >
> > No. I mean, the 4.1/4.0 differences are complicated, so it's not too
> > surprising a bug could hit one and not the other, but I don't have an
> > explanation for this one off the top of my head.
> >
> > > https://s3.amazonaws.com/gitlab-support/nfs/nfs-4.0-kernel-4.19-0-rc4-rename.pcap
> > > is the latest capture that also includes the NFS callbacks. Here's
> > > what I see after the first RENAME from Node A:
> > >
> > > Node B: DELEGRETURN StateId: 0xa93
> > > NFS server: DELEGRETURN
> > > Node A: RENAME From: test2.txt To: test.txt
> > > NFS server: RENAME
> > > Node B: GETATTR
> > > NFS Server: GETATTR (with old inode)
> > > Node B: READ StateId: 0xa93
> > > NFS Server: READ
> >
> > Presumably the GETATTR and READ use a filehandle for the old file (the
> > one that was renamed over)?
> >
> > That's what's weird, and indicates a possible client bug. It should be
> > doing a new OPEN("test.txt").
> >
> > Also, READ shouldn't be using the stateid that was returned in
> > DELEGRETURN. And the server should reject any attempt to use that
> > stateid. I wonder if you misread the stateids--may be worth taking a
> > closer look to see if they're really bit-for-bit identical. (They're
> > 128 bits, so that 0xa93 is either a hash or just some subset of the
> > stateid.)
> >
> > (Apologies, I haven't gotten a chance to look at it myself.)
>
> Thanks again for the information.
>
> It looks like Wireshark is showing the CRC-16 hash of both the
> sequence ID and the other state ID, which makes things a little
> confusing. Here's the abbreviated flow with the 12-byte StateId
> (excludes the sequence number) and hashed values:
Those are 16 bytes, with the last 4 (27/28:00:00:00) the sequence
number:
> Node B: OPEN test.txt
> NFS server: OPEN StateId: 96:49:9c:5b:e0:41:30:32:27:00:00:00 (0x7cd5)
> --- 5 seconds later, presumably when I start the while loop ---
> Node B: READ StateId: 96:49:9c:5b:e0:41:30:32:28:00:00:00 (0xec2c)
> Node B: DELEGRETURN StateId: 96:49:9c:5b:e0:41:30:32:28:00:00:00 (0xce2c)
> NFS server: DELEGRETURN
> Node A: RENAME From: test2.txt To: test.txt
> NFS server: RENAME
> Node B: GETATTR
> NFS Server: GETATTR (with old inode)
> Node B: READ StateId: 96:49:9c:5b:e0:41:30:32:27:00:00:00 (0x7cd5)
> NFS Server: READ
>
> I'm a bit confused why the READ call would be using a StateId that
> looks like it's one bit off (27 vs 28). It's possible I'm not seeing a
> message in the trace, but is the NFS client doing something behind the
> scenes?
>
> Note the DELEGRETURN returned the StateId with the 28, and the last
> READ request used 27.
To be honest I'd need a few minutes to refresh my memory of the spec,
but I think this all normal, thanks.
> I repeated the same test using NFS 4.1
> (https://s3.amazonaws.com/gitlab-support/nfs/nfs-4.1-kernel-4.19-0-rc4-rename-test1.pcap).
> The behavior looks quite similar here with the DELEGRETURN and StateId
> values that appear to me as off-by-1. But in this case the NFS client
> requests a new OPEN and READ.
>
> > I'd assumed that the every open syscall would result in an on-the-wire
> > open or lookup, or at least a getattr to verify of the directory change
> > attribute to verify that the directory hasn't been modified.
> >
> > Trond tells me that's guaranteed only on the first open (of a set of
> > overlapping opens) from that client. You've already got another open in
> > your test case (if I remember correctly), so you may have to wait for
> > the client's cache of the parent directory to time out (could be a few
> > minutes). There are some mount options to control this--see the nfs man
> > page.
>
> Last night I left my test running on for more than 30 minutes, and the
> while loop still showed the stale data. I think I even turned off
> attribute caching entirely to see if this would help, and it did not.
Huh. Then I'm back to thinking there's a client bug in the 4.0 case.
Thanks for your persistence....
--b.
On Thu, Sep 20, 2018 at 11:39 AM Bruce Fields <[email protected]> wrote:
> > Last night I left my test running on for more than 30 minutes, and the
> > while loop still showed the stale data. I think I even turned off
> > attribute caching entirely to see if this would help, and it did not.
>
> Huh. Then I'm back to thinking there's a client bug in the 4.0 case.
>
I've been doing more digging, and I think there is some issue with the
cache validation here. In NFS 4.1, it looks like in dir.c
nfs4_lookup_revalidate() calls nfs_lookup_revalidate() since the
NFS_CAP_ATOMIC_OPEN_V1 flag is active
(https://github.com/torvalds/linux/blob/v4.19-rc4/fs/nfs/dir.c#L1591).
On the other hand, since that flag isn't active for NFS 4.0, the
validation is much briefer
(https://github.com/torvalds/linux/blob/v4.19-rc4/fs/nfs/dir.c#L1599-L1628).
I'm not sure if the comment in
https://github.com/torvalds/linux/blob/v4.19-rc4/fs/nfs/dir.c#L1630
actually reflects what's happening. If I look at the stack trace of
the next file open call, I don't see additional validation:
Sep 24 20:20:38 test-kernel kernel: [ 1145.233460] Call Trace:
Sep 24 20:20:38 test-kernel kernel: [ 1145.233462] dump_stack+0x8e/0xd5
Sep 24 20:20:38 test-kernel kernel: [ 1145.233480]
nfs4_file_open+0x56/0x2a0 [nfsv4]
Sep 24 20:20:38 test-kernel kernel: [ 1145.233488] ?
nfs42_clone_file_range+0x1c0/0x1c0 [nfsv4]
Sep 24 20:20:38 test-kernel kernel: [ 1145.233490] do_dentry_open+0x1f6/0x360
Sep 24 20:20:38 test-kernel kernel: [ 1145.233492] vfs_open+0x2f/0x40
Sep 24 20:20:38 test-kernel kernel: [ 1145.233493] path_openat+0x2e8/0x1690
Sep 24 20:20:38 test-kernel kernel: [ 1145.233496] ?
mem_cgroup_try_charge+0x8b/0x190
Sep 24 20:20:38 test-kernel kernel: [ 1145.233497] do_filp_open+0x9b/0x110
Sep 24 20:20:38 test-kernel kernel: [ 1145.233499] ?
__check_object_size+0xb8/0x1b0
Sep 24 20:20:38 test-kernel kernel: [ 1145.233501] ? __alloc_fd+0x46/0x170
Sep 24 20:20:38 test-kernel kernel: [ 1145.233503] do_sys_open+0x1ba/0x250
Sep 24 20:20:38 test-kernel kernel: [ 1145.233505] ? do_sys_open+0x1ba/0x250
Sep 24 20:20:38 test-kernel kernel: [ 1145.233507] __x64_sys_openat+0x20/0x30
Sep 24 20:20:38 test-kernel kernel: [ 1145.233508] do_syscall_64+0x65/0x130
If I naively apply this patch:
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 8bfaa658b2c1..6e3ece2e6984 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1631,7 +1631,7 @@ static int nfs4_lookup_revalidate(struct dentry
*dentry, unsigned int flags)
ret = 1;
out:
- return ret;
+ return nfs_lookup_revalidate(dentry, flags);
no_open:
return nfs_lookup_revalidate(dentry, flags);
Things behave as expected on NFS 4.0. What's the right fix here?
On Tue, Sep 25, 2018 at 11:56:16AM -0700, Stan Hu wrote:
> >From https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/5/html/5.6_technical_notes/known_issues-kernel,
> I see this bug has been known for a while:
That specific server bug was fixed in several years ago, and I forget
what kernel version you were using on the server, but I'm pretty sure it
has the fix. From the network trace it looks like delegations were
being returned when they should be.
--b.
>
> In some cases the NFS server fails to notify NFSv4 clients about
> renames and unlinks done by other clients, or by non-NFS users of the
> server. An application on a client may then be able to open the file
> at its old pathname (and read old cached data from it, and perform
> read locks on it), long after the file no longer exists at that
> pathname on the server.
>
> To work around this issue, use NFSv3 instead of NFSv4. Alternatively,
> turn off support for leases by writing 0 to /proc/sys/fs/leases-enable
> (ideally on boot, before the nfs server is started). This change
> prevents NFSv4 delegations from being given out, restore correctness
> at the expense of some performance.
>
> On Mon, Sep 24, 2018 at 1:34 PM Stan Hu <[email protected]> wrote:
> >
> > On Thu, Sep 20, 2018 at 11:39 AM Bruce Fields <[email protected]> wrote:
> >
> > > > Last night I left my test running on for more than 30 minutes, and the
> > > > while loop still showed the stale data. I think I even turned off
> > > > attribute caching entirely to see if this would help, and it did not.
> > >
> > > Huh. Then I'm back to thinking there's a client bug in the 4.0 case.
> > >
> >
> > I've been doing more digging, and I think there is some issue with the
> > cache validation here. In NFS 4.1, it looks like in dir.c
> > nfs4_lookup_revalidate() calls nfs_lookup_revalidate() since the
> > NFS_CAP_ATOMIC_OPEN_V1 flag is active
> > (https://github.com/torvalds/linux/blob/v4.19-rc4/fs/nfs/dir.c#L1591).
> > On the other hand, since that flag isn't active for NFS 4.0, the
> > validation is much briefer
> > (https://github.com/torvalds/linux/blob/v4.19-rc4/fs/nfs/dir.c#L1599-L1628).
> >
> > I'm not sure if the comment in
> > https://github.com/torvalds/linux/blob/v4.19-rc4/fs/nfs/dir.c#L1630
> > actually reflects what's happening. If I look at the stack trace of
> > the next file open call, I don't see additional validation:
> >
> > Sep 24 20:20:38 test-kernel kernel: [ 1145.233460] Call Trace:
> > Sep 24 20:20:38 test-kernel kernel: [ 1145.233462] dump_stack+0x8e/0xd5
> > Sep 24 20:20:38 test-kernel kernel: [ 1145.233480]
> > nfs4_file_open+0x56/0x2a0 [nfsv4]
> > Sep 24 20:20:38 test-kernel kernel: [ 1145.233488] ?
> > nfs42_clone_file_range+0x1c0/0x1c0 [nfsv4]
> > Sep 24 20:20:38 test-kernel kernel: [ 1145.233490] do_dentry_open+0x1f6/0x360
> > Sep 24 20:20:38 test-kernel kernel: [ 1145.233492] vfs_open+0x2f/0x40
> > Sep 24 20:20:38 test-kernel kernel: [ 1145.233493] path_openat+0x2e8/0x1690
> > Sep 24 20:20:38 test-kernel kernel: [ 1145.233496] ?
> > mem_cgroup_try_charge+0x8b/0x190
> > Sep 24 20:20:38 test-kernel kernel: [ 1145.233497] do_filp_open+0x9b/0x110
> > Sep 24 20:20:38 test-kernel kernel: [ 1145.233499] ?
> > __check_object_size+0xb8/0x1b0
> > Sep 24 20:20:38 test-kernel kernel: [ 1145.233501] ? __alloc_fd+0x46/0x170
> > Sep 24 20:20:38 test-kernel kernel: [ 1145.233503] do_sys_open+0x1ba/0x250
> > Sep 24 20:20:38 test-kernel kernel: [ 1145.233505] ? do_sys_open+0x1ba/0x250
> > Sep 24 20:20:38 test-kernel kernel: [ 1145.233507] __x64_sys_openat+0x20/0x30
> > Sep 24 20:20:38 test-kernel kernel: [ 1145.233508] do_syscall_64+0x65/0x130
> >
> > If I naively apply this patch:
> >
> > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > index 8bfaa658b2c1..6e3ece2e6984 100644
> > --- a/fs/nfs/dir.c
> > +++ b/fs/nfs/dir.c
> > @@ -1631,7 +1631,7 @@ static int nfs4_lookup_revalidate(struct dentry
> > *dentry, unsigned int flags)
> > ret = 1;
> >
> > out:
> > - return ret;
> > + return nfs_lookup_revalidate(dentry, flags);
> >
> > no_open:
> > return nfs_lookup_revalidate(dentry, flags);
> >
> > Things behave as expected on NFS 4.0. What's the right fix here?
Thanks. I belatedly realized this is an NFS client issue, not an NFS
server issue.
As I mentioned in the previous e-mail, the following patch fixes the
NFS client validation issues for me, but it's not clear to me why in
NFS 4.0 we bail out earlier than NFS 4.1:
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 8bfaa658b2c1..6e3ece2e6984 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1631,7 +1631,7 @@ static int nfs4_lookup_revalidate(struct dentry
*dentry, unsigned int flags)
ret = 1;
out:
- return ret;
+ return nfs_lookup_revalidate(dentry, flags);
no_open:
return nfs_lookup_revalidate(dentry, flags);
On Tue, Sep 25, 2018 at 1:34 PM Bruce Fields <[email protected]> wrote:
>
> On Tue, Sep 25, 2018 at 11:56:16AM -0700, Stan Hu wrote:
> > >From https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/5/html/5.6_technical_notes/known_issues-kernel,
> > I see this bug has been known for a while:
>
> That specific server bug was fixed in several years ago, and I forget
> what kernel version you were using on the server, but I'm pretty sure it
> has the fix. From the network trace it looks like delegations were
> being returned when they should be.
>
> --b.
>
> >
> > In some cases the NFS server fails to notify NFSv4 clients about
> > renames and unlinks done by other clients, or by non-NFS users of the
> > server. An application on a client may then be able to open the file
> > at its old pathname (and read old cached data from it, and perform
> > read locks on it), long after the file no longer exists at that
> > pathname on the server.
> >
> > To work around this issue, use NFSv3 instead of NFSv4. Alternatively,
> > turn off support for leases by writing 0 to /proc/sys/fs/leases-enable
> > (ideally on boot, before the nfs server is started). This change
> > prevents NFSv4 delegations from being given out, restore correctness
> > at the expense of some performance.
> >
> > On Mon, Sep 24, 2018 at 1:34 PM Stan Hu <[email protected]> wrote:
> > >
> > > On Thu, Sep 20, 2018 at 11:39 AM Bruce Fields <[email protected]> wrote:
> > >
> > > > > Last night I left my test running on for more than 30 minutes, and the
> > > > > while loop still showed the stale data. I think I even turned off
> > > > > attribute caching entirely to see if this would help, and it did not.
> > > >
> > > > Huh. Then I'm back to thinking there's a client bug in the 4.0 case.
> > > >
> > >
> > > I've been doing more digging, and I think there is some issue with the
> > > cache validation here. In NFS 4.1, it looks like in dir.c
> > > nfs4_lookup_revalidate() calls nfs_lookup_revalidate() since the
> > > NFS_CAP_ATOMIC_OPEN_V1 flag is active
> > > (https://github.com/torvalds/linux/blob/v4.19-rc4/fs/nfs/dir.c#L1591).
> > > On the other hand, since that flag isn't active for NFS 4.0, the
> > > validation is much briefer
> > > (https://github.com/torvalds/linux/blob/v4.19-rc4/fs/nfs/dir.c#L1599-L1628).
> > >
> > > I'm not sure if the comment in
> > > https://github.com/torvalds/linux/blob/v4.19-rc4/fs/nfs/dir.c#L1630
> > > actually reflects what's happening. If I look at the stack trace of
> > > the next file open call, I don't see additional validation:
> > >
> > > Sep 24 20:20:38 test-kernel kernel: [ 1145.233460] Call Trace:
> > > Sep 24 20:20:38 test-kernel kernel: [ 1145.233462] dump_stack+0x8e/0xd5
> > > Sep 24 20:20:38 test-kernel kernel: [ 1145.233480]
> > > nfs4_file_open+0x56/0x2a0 [nfsv4]
> > > Sep 24 20:20:38 test-kernel kernel: [ 1145.233488] ?
> > > nfs42_clone_file_range+0x1c0/0x1c0 [nfsv4]
> > > Sep 24 20:20:38 test-kernel kernel: [ 1145.233490] do_dentry_open+0x1f6/0x360
> > > Sep 24 20:20:38 test-kernel kernel: [ 1145.233492] vfs_open+0x2f/0x40
> > > Sep 24 20:20:38 test-kernel kernel: [ 1145.233493] path_openat+0x2e8/0x1690
> > > Sep 24 20:20:38 test-kernel kernel: [ 1145.233496] ?
> > > mem_cgroup_try_charge+0x8b/0x190
> > > Sep 24 20:20:38 test-kernel kernel: [ 1145.233497] do_filp_open+0x9b/0x110
> > > Sep 24 20:20:38 test-kernel kernel: [ 1145.233499] ?
> > > __check_object_size+0xb8/0x1b0
> > > Sep 24 20:20:38 test-kernel kernel: [ 1145.233501] ? __alloc_fd+0x46/0x170
> > > Sep 24 20:20:38 test-kernel kernel: [ 1145.233503] do_sys_open+0x1ba/0x250
> > > Sep 24 20:20:38 test-kernel kernel: [ 1145.233505] ? do_sys_open+0x1ba/0x250
> > > Sep 24 20:20:38 test-kernel kernel: [ 1145.233507] __x64_sys_openat+0x20/0x30
> > > Sep 24 20:20:38 test-kernel kernel: [ 1145.233508] do_syscall_64+0x65/0x130
> > >
> > > If I naively apply this patch:
> > >
> > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > > index 8bfaa658b2c1..6e3ece2e6984 100644
> > > --- a/fs/nfs/dir.c
> > > +++ b/fs/nfs/dir.c
> > > @@ -1631,7 +1631,7 @@ static int nfs4_lookup_revalidate(struct dentry
> > > *dentry, unsigned int flags)
> > > ret = 1;
> > >
> > > out:
> > > - return ret;
> > > + return nfs_lookup_revalidate(dentry, flags);
> > >
> > > no_open:
> > > return nfs_lookup_revalidate(dentry, flags);
> > >
> > > Things behave as expected on NFS 4.0. What's the right fix here?