The program below (provided by a customer) demonstrates a livelock that can
trigger in NFS.
"/mnt" should be an NFSv4 mount from a server which will hand out READ
delegations (e.g. the Linux NFS server) and should contain a subdirectory
"/mnt/data".
The program forks off some worker threads which poll a particular file in
that directory until it disappears. Then each worker will exit.
The main program waits a few seconds and then unlinks the file.
The number of threads can be set with the first arg (4 is good). The number of
seconds with the second. Both have useful defaults.
The unlink should happen promptly and then all the workers should exit. But
they don't.
What happens is that between when the "unlink" returns the delegation that it
will inevitably have due to all those "open"s, and when it performs the
required silly-rename (because some thread will have the file open), some
other thread opens the file and gets a delegation.
So the NFSv4 RENAME returns NFS4ERR_DELAY while it tries to reclaim the
delegation. 15 seconds later the rename will be retried, but there will still
(or again) be an active delegation. So the pattern repeats indefinitely.
All this time the i_mutex on the directory and file are held so "ls" on the
directory will also hang.
As an interesting twist, if you remove the file on the server, the main
process will duly notice when it next tries to rename it, and so will exit.
The clients will continue to successfully open and close the file, even
though "ls -l" shows that it doesn't exist.
If you then rm the file on the client, it will tell you that it doesn't
exist, and the workers will suddenly notice that too.
I haven't really looked into the cause of this second issue, but I can work
around the original problem with the patch below. It effectively serialised
'open' with 'unlink' (and with anything else that grabs the file's mutex).
I think some sort of serialisation is needed. I don't know whether i_mutex
is suitable or if we should find (or make) some other locks.
Thoughts?
Thanks,
NeilBrown
Patch:
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index 8de3407e0360..96108f88d3f9 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -33,6 +33,10 @@ nfs4_file_open(struct inode *inode, struct file *filp)
dprintk("NFS: open file(%pd2)\n", dentry);
+ // hack - if being unlinked, pause for it to complete
+ mutex_lock(&inode->i_mutex);
+ mutex_unlock(&inode->i_mutex);
+
if ((openflags & O_ACCMODE) == 3)
openflags--;
Program:
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
// nfsv4 mount point /mnt
const char check[] = "/mnt/data/checkTST";
const char data[] = "/mnt/data/dummy.data";
int num_client = 4;
int wait_sec = 3;
// call open/close in endless loop until open fails
void client (void)
{
for (;;)
{
int f = open (check, O_RDONLY);
if (f == -1)
{
printf ("client: done\n");
_exit(0);
}
close (f);
}
}
int main (int argc, char **argv)
{
int i, fd;
FILE *f_dummy;
if (argc > 1)
num_client = atoi (argv[1]);
if (argc > 2)
wait_sec = atoi (argv[2]);
fd = open (check, O_RDWR|O_CREAT, S_IRWXU);
if (fd == -1)
{
perror ("open failed:\n");
_exit (1);
}
close (fd);
for (i=0; i<num_client; i++)
{
// fork childs to poll the 'checkTST' file
pid_t p = fork ();
if (p==0)
client();
else if (p==-1)
{
perror ("fork failed");
_exit (1);
}
}
// parent process
// - wait a few seconds and unlink 'checkTST' file
// - all childs are polling the 'checkTST' and should
// end now
sleep (wait_sec);
unlink (check);
printf ("server: done\n");
}
On Fri, 30 May 2014 17:55:23 -0400 "J. Bruce Fields" <[email protected]>
wrote:
> On Fri, May 30, 2014 at 01:44:42PM +1000, NeilBrown wrote:
> > On Thu, 29 May 2014 20:44:23 -0400 "J. Bruce Fields" <[email protected]>
> > wrote:
> >
> > > Yes, it's a known server bug.
> > >
> > > As a first attempt I was thinking of just sticking a timestamp in struct
> > > inode to record the time of the most recent conflicting access and deny
> > > delegations if the timestamp is too recent, for some definition of too
> > > recent.
> > >
> >
> > Hmmm... I'll have a look next week and see what I can come up with.
>
> Thanks!
>
> If we didn't think it was worth another struct inode field, we could
> probably get away with global state. Even just refusing to give out any
> delegations for a few seconds after any delegation break would be enough
> to fix this bug.
>
> Or you could make it a little less harsh with a small hash table: "don't
> give out a delegation on any inode whose inode number hashes to X for a
> few seconds."
I was thinking of using a bloom filter - or possibly two.
- avoid handing out delegations if either bloom filter reports a match
- when reclaiming a delegation add the inode to the second bloom filter
- every so-often zero-out the older filter and swap them.
Might be a bit of overkill, but I won't know until I implement it.
NeilBrown
>
> As long as the delegations can be turned down at the whim of the server,
> we've got a lot of leeway.
>
> --b.
On Fri, May 30, 2014 at 01:44:42PM +1000, NeilBrown wrote:
> On Thu, 29 May 2014 20:44:23 -0400 "J. Bruce Fields" <[email protected]>
> wrote:
>
> > Yes, it's a known server bug.
> >
> > As a first attempt I was thinking of just sticking a timestamp in struct
> > inode to record the time of the most recent conflicting access and deny
> > delegations if the timestamp is too recent, for some definition of too
> > recent.
> >
>
> Hmmm... I'll have a look next week and see what I can come up with.
Thanks!
If we didn't think it was worth another struct inode field, we could
probably get away with global state. Even just refusing to give out any
delegations for a few seconds after any delegation break would be enough
to fix this bug.
Or you could make it a little less harsh with a small hash table: "don't
give out a delegation on any inode whose inode number hashes to X for a
few seconds."
As long as the delegations can be turned down at the whim of the server,
we've got a lot of leeway.
--b.
Apologies Neil. Resending due to gmail defaulting to html formatting
which gets rejected by vger.kernel.org...
On Thu, May 29, 2014 at 2:45 AM, NeilBrown <[email protected]> wrote:
>
> The program below (provided by a customer) demonstrates a livelock that can
> trigger in NFS.
>
> "/mnt" should be an NFSv4 mount from a server which will hand out READ
> delegations (e.g. the Linux NFS server) and should contain a subdirectory
> "/mnt/data".
>
> The program forks off some worker threads which poll a particular file in
> that directory until it disappears. Then each worker will exit.
> The main program waits a few seconds and then unlinks the file.
>
> The number of threads can be set with the first arg (4 is good). The number of
> seconds with the second. Both have useful defaults.
>
> The unlink should happen promptly and then all the workers should exit. But
> they don't.
>
> What happens is that between when the "unlink" returns the delegation that it
> will inevitably have due to all those "open"s, and when it performs the
> required silly-rename (because some thread will have the file open), some
> other thread opens the file and gets a delegation.
> So the NFSv4 RENAME returns NFS4ERR_DELAY while it tries to reclaim the
> delegation. 15 seconds later the rename will be retried, but there will still
> (or again) be an active delegation. So the pattern repeats indefinitely.
> All this time the i_mutex on the directory and file are held so "ls" on the
> directory will also hang.
Why would the server hand out another delegation just moments after it
recalled the last one? That sounds like a nasty server bug.
You can invent variants of this problem with a second client trying to
open() the file while the first one is trying to unlink(), where the
i_mutex hack will not suffice to prevent client 2 from getting the
delegation back.
> As an interesting twist, if you remove the file on the server, the main
> process will duly notice when it next tries to rename it, and so will exit.
> The clients will continue to successfully open and close the file, even
> though "ls -l" shows that it doesn't exist.
> If you then rm the file on the client, it will tell you that it doesn't
> exist, and the workers will suddenly notice that too.
>
> I haven't really looked into the cause of this second issue, but I can work
> around the original problem with the patch below. It effectively serialised
> 'open' with 'unlink' (and with anything else that grabs the file's mutex).
>
> I think some sort of serialisation is needed. I don't know whether i_mutex
> is suitable or if we should find (or make) some other locks.
>
> Thoughts?
>
> Thanks,
> NeilBrown
>
> Patch:
>
> diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
> index 8de3407e0360..96108f88d3f9 100644
> --- a/fs/nfs/nfs4file.c
> +++ b/fs/nfs/nfs4file.c
> @@ -33,6 +33,10 @@ nfs4_file_open(struct inode *inode, struct file *filp)
>
> dprintk("NFS: open file(%pd2)\n", dentry);
>
> + // hack - if being unlinked, pause for it to complete
> + mutex_lock(&inode->i_mutex);
> + mutex_unlock(&inode->i_mutex);
> +
> if ((openflags & O_ACCMODE) == 3)
> openflags--;
>
>
>
> Program:
> #include <stdio.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <unistd.h>
> #include <errno.h>
>
>
> // nfsv4 mount point /mnt
> const char check[] = "/mnt/data/checkTST";
> const char data[] = "/mnt/data/dummy.data";
>
> int num_client = 4;
> int wait_sec = 3;
>
> // call open/close in endless loop until open fails
> void client (void)
> {
> for (;;)
> {
> int f = open (check, O_RDONLY);
>
> if (f == -1)
> {
> printf ("client: done\n");
> _exit(0);
> }
> close (f);
> }
> }
>
> int main (int argc, char **argv)
> {
> int i, fd;
> FILE *f_dummy;
>
> if (argc > 1)
> num_client = atoi (argv[1]);
>
> if (argc > 2)
> wait_sec = atoi (argv[2]);
>
> fd = open (check, O_RDWR|O_CREAT, S_IRWXU);
>
> if (fd == -1)
> {
> perror ("open failed:\n");
> _exit (1);
> }
>
> close (fd);
>
> for (i=0; i<num_client; i++)
> {
> // fork childs to poll the 'checkTST' file
> pid_t p = fork ();
> if (p==0)
> client();
> else if (p==-1)
> {
> perror ("fork failed");
> _exit (1);
> }
> }
>
> // parent process
> // - wait a few seconds and unlink 'checkTST' file
> // - all childs are polling the 'checkTST' and should
> // end now
> sleep (wait_sec);
> unlink (check);
> printf ("server: done\n");
> }
--
Trond Myklebust
Linux NFS client maintainer, PrimaryData
[email protected]
On Fri, May 30, 2014 at 07:51:35AM +1000, NeilBrown wrote:
> On Thu, 29 May 2014 12:38:19 -0400 Trond Myklebust
> <[email protected]> wrote:
>
> > Apologies Neil. Resending due to gmail defaulting to html formatting
> > which gets rejected by vger.kernel.org...
> >
> > On Thu, May 29, 2014 at 2:45 AM, NeilBrown <[email protected]> wrote:
> > >
> > > The program below (provided by a customer) demonstrates a livelock that can
> > > trigger in NFS.
> > >
> > > "/mnt" should be an NFSv4 mount from a server which will hand out READ
> > > delegations (e.g. the Linux NFS server) and should contain a subdirectory
> > > "/mnt/data".
> > >
> > > The program forks off some worker threads which poll a particular file in
> > > that directory until it disappears. Then each worker will exit.
> > > The main program waits a few seconds and then unlinks the file.
> > >
> > > The number of threads can be set with the first arg (4 is good). The number of
> > > seconds with the second. Both have useful defaults.
> > >
> > > The unlink should happen promptly and then all the workers should exit. But
> > > they don't.
> > >
> > > What happens is that between when the "unlink" returns the delegation that it
> > > will inevitably have due to all those "open"s, and when it performs the
> > > required silly-rename (because some thread will have the file open), some
> > > other thread opens the file and gets a delegation.
> > > So the NFSv4 RENAME returns NFS4ERR_DELAY while it tries to reclaim the
> > > delegation. 15 seconds later the rename will be retried, but there will still
> > > (or again) be an active delegation. So the pattern repeats indefinitely.
> > > All this time the i_mutex on the directory and file are held so "ls" on the
> > > directory will also hang.
> >
> > Why would the server hand out another delegation just moments after it
> > recalled the last one? That sounds like a nasty server bug.
>
> Exactly how long should it wait?
> Bruce: do you have any thoughts on whether the server should hold off giving
> out new delegations after trying to recall one (e.g. due to a lease-break
> caused by rename)??
> I don't suppose the RFC addresses this?
Yes, it's a known server bug.
As a first attempt I was thinking of just sticking a timestamp in struct
inode to record the time of the most recent conflicting access and deny
delegations if the timestamp is too recent, for some definition of too
recent.
--b.
>
> >
> > You can invent variants of this problem with a second client trying to
> > open() the file while the first one is trying to unlink(), where the
> > i_mutex hack will not suffice to prevent client 2 from getting the
> > delegation back.
>
> True. So we do need support from the server.
>
> Thanks,
> NeilBrown
>
On Thu, 29 May 2014 20:44:23 -0400 "J. Bruce Fields" <[email protected]>
wrote:
> On Fri, May 30, 2014 at 07:51:35AM +1000, NeilBrown wrote:
> > On Thu, 29 May 2014 12:38:19 -0400 Trond Myklebust
> > <[email protected]> wrote:
> >
> > > Apologies Neil. Resending due to gmail defaulting to html formatting
> > > which gets rejected by vger.kernel.org...
> > >
> > > On Thu, May 29, 2014 at 2:45 AM, NeilBrown <[email protected]> wrote:
> > > >
> > > > The program below (provided by a customer) demonstrates a livelock that can
> > > > trigger in NFS.
> > > >
> > > > "/mnt" should be an NFSv4 mount from a server which will hand out READ
> > > > delegations (e.g. the Linux NFS server) and should contain a subdirectory
> > > > "/mnt/data".
> > > >
> > > > The program forks off some worker threads which poll a particular file in
> > > > that directory until it disappears. Then each worker will exit.
> > > > The main program waits a few seconds and then unlinks the file.
> > > >
> > > > The number of threads can be set with the first arg (4 is good). The number of
> > > > seconds with the second. Both have useful defaults.
> > > >
> > > > The unlink should happen promptly and then all the workers should exit. But
> > > > they don't.
> > > >
> > > > What happens is that between when the "unlink" returns the delegation that it
> > > > will inevitably have due to all those "open"s, and when it performs the
> > > > required silly-rename (because some thread will have the file open), some
> > > > other thread opens the file and gets a delegation.
> > > > So the NFSv4 RENAME returns NFS4ERR_DELAY while it tries to reclaim the
> > > > delegation. 15 seconds later the rename will be retried, but there will still
> > > > (or again) be an active delegation. So the pattern repeats indefinitely.
> > > > All this time the i_mutex on the directory and file are held so "ls" on the
> > > > directory will also hang.
> > >
> > > Why would the server hand out another delegation just moments after it
> > > recalled the last one? That sounds like a nasty server bug.
> >
> > Exactly how long should it wait?
> > Bruce: do you have any thoughts on whether the server should hold off giving
> > out new delegations after trying to recall one (e.g. due to a lease-break
> > caused by rename)??
> > I don't suppose the RFC addresses this?
>
> Yes, it's a known server bug.
>
> As a first attempt I was thinking of just sticking a timestamp in struct
> inode to record the time of the most recent conflicting access and deny
> delegations if the timestamp is too recent, for some definition of too
> recent.
>
Hmmm... I'll have a look next week and see what I can come up with.
Thanks,
NeilBrown
On Thu, Jun 05, 2014 at 10:34:23AM +1000, NeilBrown wrote:
> On Wed, 4 Jun 2014 18:05:31 -0400 "J. Bruce Fields" <[email protected]>
> wrote:
>
> > On Wed, Jun 04, 2014 at 05:39:26PM +1000, NeilBrown wrote:
> > > Below is my suggestion. It seems easy enough. It even works.
> >
> > Woah!
> >
> > Anyway, looks reasonable to me, and it fixes an immediate problem so I'm
> > inclined to just apply.
...
> And if you are going to apply it, you'll want:
>
> Signed-off-by: NeilBrown <[email protected]>
Oh, gah, then I forgot to actually apply.
Anyway, it's a reasonably self-contained fix for an important bug so
I'll send it as part of a later bugfix pull request.
I thought it could also use a more explicit description of the resulting
problem, so I added:
... so further delegations should not be handed out.
The current code fails to do so, and the result is effectively a
live-lock under some workloads: a client attempting a
conflicting operation on a read-delegated file receives
NFS4ERR_DELAY and retries the operation, but by the time it
retries the server may already have given out another
delegation.
--b.
On Wed, 4 Jun 2014 08:48:02 -0400 Trond Myklebust
<[email protected]> wrote:
> On Wed, Jun 4, 2014 at 3:39 AM, NeilBrown <[email protected]> wrote:
> > On Sat, 31 May 2014 08:13:58 +1000 NeilBrown <[email protected]> wrote:
> >
> >> On Fri, 30 May 2014 17:55:23 -0400 "J. Bruce Fields" <[email protected]>
> >> wrote:
> >>
> >> > On Fri, May 30, 2014 at 01:44:42PM +1000, NeilBrown wrote:
> >> > > On Thu, 29 May 2014 20:44:23 -0400 "J. Bruce Fields" <[email protected]>
> >> > > wrote:
> >> > >
> >> > > > Yes, it's a known server bug.
> >> > > >
> >> > > > As a first attempt I was thinking of just sticking a timestamp in struct
> >> > > > inode to record the time of the most recent conflicting access and deny
> >> > > > delegations if the timestamp is too recent, for some definition of too
> >> > > > recent.
> >> > > >
> >> > >
> >> > > Hmmm... I'll have a look next week and see what I can come up with.
> >> >
> >> > Thanks!
> >> >
> >> > If we didn't think it was worth another struct inode field, we could
> >> > probably get away with global state. Even just refusing to give out any
> >> > delegations for a few seconds after any delegation break would be enough
> >> > to fix this bug.
> >> >
> >> > Or you could make it a little less harsh with a small hash table: "don't
> >> > give out a delegation on any inode whose inode number hashes to X for a
> >> > few seconds."
> >>
> >> I was thinking of using a bloom filter - or possibly two.
> >> - avoid handing out delegations if either bloom filter reports a match
> >> - when reclaiming a delegation add the inode to the second bloom filter
> >> - every so-often zero-out the older filter and swap them.
> >>
> >> Might be a bit of overkill, but I won't know until I implement it.
> >>
> >
> > Below is my suggestion. It seems easy enough. It even works.
> >
> > However it does raise an issue with the NFS client.
> >
> > NFS performs a silly-rename as an 'asynchronous' operation. One consequence
> > of this is that NFS4ERR_DELAY always results in a delay of
> > NFS4_POLL_RETRY_MAX (15*HZ), where as sync requests use an exponential scale
> > from _MIN to _MAX.
> >
> > So in my test case there is always a 15second delay:
> > - try to silly-rename
> > - get NFS4ERR_DELAY
> > - server reclaim delegation
> > - 15 seconds passes
> > - retry silly-rename - it works.
> >
> > I hacked the NFS server to store a timeout in 'struct nfs_renamedata', and
> > use the same exponential retry pattern and the 15 seconds (obviously)
> > disappeared.
> >
> > Trond: would you accept a patch which did that more generally? e.g. pass a
> > timeout pointer to nfs4_async_handle_error() and various *_done function pass
> > a pointer to a field in their calldata?
>
> It depends. If we're touching nfs4_async_handle_error, then I think we
> should also convert nfs4_async_handle_error to use the same "struct
> nfs4_exception" argument that we use for the synchronous case so that
> we can share a bit more code.
>
Yes, it certainly would be an improvement if we could unite
nfs4_handle_exception and nfs4_async_handle_error.
I might have a look but it probably won't be straight away.
Thanks,
NeilBrown
On Wed, 4 Jun 2014 09:27:39 -0400 "J. Bruce Fields" <[email protected]>
wrote:
> On Wed, Jun 04, 2014 at 08:48:02AM -0400, Trond Myklebust wrote:
> > On Wed, Jun 4, 2014 at 3:39 AM, NeilBrown <[email protected]> wrote:
> > > On Sat, 31 May 2014 08:13:58 +1000 NeilBrown <[email protected]> wrote:
> > >
> > >> On Fri, 30 May 2014 17:55:23 -0400 "J. Bruce Fields" <[email protected]>
> > >> wrote:
> > >>
> > >> > On Fri, May 30, 2014 at 01:44:42PM +1000, NeilBrown wrote:
> > >> > > On Thu, 29 May 2014 20:44:23 -0400 "J. Bruce Fields" <[email protected]>
> > >> > > wrote:
> > >> > >
> > >> > > > Yes, it's a known server bug.
> > >> > > >
> > >> > > > As a first attempt I was thinking of just sticking a timestamp in struct
> > >> > > > inode to record the time of the most recent conflicting access and deny
> > >> > > > delegations if the timestamp is too recent, for some definition of too
> > >> > > > recent.
> > >> > > >
> > >> > >
> > >> > > Hmmm... I'll have a look next week and see what I can come up with.
> > >> >
> > >> > Thanks!
> > >> >
> > >> > If we didn't think it was worth another struct inode field, we could
> > >> > probably get away with global state. Even just refusing to give out any
> > >> > delegations for a few seconds after any delegation break would be enough
> > >> > to fix this bug.
> > >> >
> > >> > Or you could make it a little less harsh with a small hash table: "don't
> > >> > give out a delegation on any inode whose inode number hashes to X for a
> > >> > few seconds."
> > >>
> > >> I was thinking of using a bloom filter - or possibly two.
> > >> - avoid handing out delegations if either bloom filter reports a match
> > >> - when reclaiming a delegation add the inode to the second bloom filter
> > >> - every so-often zero-out the older filter and swap them.
> > >>
> > >> Might be a bit of overkill, but I won't know until I implement it.
> > >>
> > >
> > > Below is my suggestion. It seems easy enough. It even works.
> > >
> > > However it does raise an issue with the NFS client.
> > >
> > > NFS performs a silly-rename as an 'asynchronous' operation. One consequence
> > > of this is that NFS4ERR_DELAY always results in a delay of
> > > NFS4_POLL_RETRY_MAX (15*HZ), where as sync requests use an exponential scale
> > > from _MIN to _MAX.
> > >
> > > So in my test case there is always a 15second delay:
> > > - try to silly-rename
> > > - get NFS4ERR_DELAY
> > > - server reclaim delegation
> > > - 15 seconds passes
> > > - retry silly-rename - it works.
> > >
> > > I hacked the NFS server to store a timeout in 'struct nfs_renamedata', and
> > > use the same exponential retry pattern and the 15 seconds (obviously)
> > > disappeared.
> > >
> > > Trond: would you accept a patch which did that more generally? e.g. pass a
> > > timeout pointer to nfs4_async_handle_error() and various *_done function pass
> > > a pointer to a field in their calldata?
> >
> > It depends. If we're touching nfs4_async_handle_error, then I think we
> > should also convert nfs4_async_handle_error to use the same "struct
> > nfs4_exception" argument that we use for the synchronous case so that
> > we can share a bit more code.
>
> I wonder why this hasn't been a major complaint before--is there
> something other servers are doing to mitigate the problem, or is
> renaming a delegated file just rarer than I would have expected?
Renaming a file isn't a problem as that is synchronous as gets the
exponentially increasing sequence of timeouts which starts small.
It is only the silly-rename which causes a problem as that is async
and so has a fixed large delay.
The async operations are:
close, unlink, rename, callback(?), write, commit, delegreturn,
unlock, layoutget, layoutreturn, layoutcommit, free_stateid
The async versions of 'unlink' and 'rename' are only used for silly-delete
processing. 'rename' when the last link is dropped, then 'unlink' on last
close.
The others look like being async and possibly having a longer delay would
not be a problem.
'rename' is a problem because until the rename completes, the file is still
visible in the namespace...
I don't really get why an async rename is used for silly-rename as the
nfs_async_rename() call is followed immediately by
error = rpc_wait_for_completion_task(task);
so it looks synchronous. I suspect there is a subtlety....
NeilBrown
>
> --b.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 04, 2014 at 05:39:26PM +1000, NeilBrown wrote:
> Below is my suggestion. It seems easy enough. It even works.
Woah!
Anyway, looks reasonable to me, and it fixes an immediate problem so I'm
inclined to just apply.
The vfs knows about delegations at this point, and there's been this
vague idea we might give user space servers request them at some point,
so we might eventually want this code to fs/locks.c instead of here.
Wonder if filehandle is the right thing to hash on, as opposed to inode
number, or inode pointer, or something else?
--b.
> The approach taken here is to use a bloom filter to record the filehandles
> which are currently blocked from delegation, and to accept the cost of a few
> false positives.
>
> We have 2 bloom filters, each of which is valid for 30 seconds. When a
> delegation is recalled the filehandle is added to one filter and will remain
> disabled for between 30 and 60 seconds.
>
> We keep a count of the number of filehandles that have been added, so when
> that count is zero we can bypass all other tests.
>
> The bloom filters have 256 bits and 3 hash functions. This should allow a
> couple of dozen blocked filehandles with minimal false positives. If many
> more filehandles are all blocked at once, behaviour will degrade towards
> rejecting all delegations for between 30 and 60 seconds, then resetting and
> allowing new delegations.
>
>
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index 9a77a5a21557..45101b41fb04 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -41,6 +41,7 @@
> #include <linux/ratelimit.h>
> #include <linux/sunrpc/svcauth_gss.h>
> #include <linux/sunrpc/addr.h>
> +#include <linux/hash.h>
> #include "xdr4.h"
> #include "xdr4cb.h"
> #include "vfs.h"
> @@ -364,6 +365,79 @@ static struct nfs4_ol_stateid * nfs4_alloc_stateid(struct nfs4_client *clp)
> return openlockstateid(nfs4_alloc_stid(clp, stateid_slab));
> }
>
> +/*
> + * When we recall a delegation, we should be careful not to hand it
> + * out again straight away.
> + * To ensure this we keep a pair of bloom filters ('new' and 'old')
> + * in which the filehandles of recalled delegations are "stored".
> + * If a filehandle appear in either filter, a delegation is blocked.
> + * When a delegation is recalled, the filehandle is stored in the "new"
> + * filter.
> + * Every 30 seconds we swap the filters and clear the "new" one,
> + * unless both are empty of course.
> + *
> + * Each filter is 256 bits. We hash the filehandle to 32bit and use the
> + * low 3 bytes as hash-table indices.
> + *
> + * 'recall_lock', which is always held when block_delegations() is called,
> + * is used to manage concurrent access. Testing does not need the lock
> + * except when swapping the two filters.
> + */
> +static struct bloom_pair {
> + int entries, old_entries;
> + time_t swap_time;
> + int new; /* index into 'set' */
> + DECLARE_BITMAP(set[2], 256);
> +} blocked_delegations;
> +
> +static int delegation_blocked(struct knfsd_fh *fh)
> +{
> + u32 hash;
> + struct bloom_pair *bd = &blocked_delegations;
> +
> + if (bd->entries == 0)
> + return 0;
> + if (seconds_since_boot() - bd->swap_time > 30) {
> + spin_lock(&recall_lock);
> + if (seconds_since_boot() - bd->swap_time > 30) {
> + bd->entries -= bd->old_entries;
> + bd->old_entries = bd->entries;
> + memset(bd->set[bd->new], 0,
> + sizeof(bd->set[0]));
> + bd->new = 1-bd->new;
> + bd->swap_time = seconds_since_boot();
> + }
> + spin_unlock(&recall_lock);
> + }
> + hash = arch_fast_hash(&fh->fh_base, fh->fh_size, 0);
> + if (test_bit(hash&255, bd->set[0]) &&
> + test_bit((hash>>8)&255, bd->set[0]) &&
> + test_bit((hash>>16)&255, bd->set[0]))
> + return 1;
> +
> + if (test_bit(hash&255, bd->set[1]) &&
> + test_bit((hash>>8)&255, bd->set[1]) &&
> + test_bit((hash>>16)&255, bd->set[1]))
> + return 1;
> +
> + return 0;
> +}
> +
> +static void block_delegations(struct knfsd_fh *fh)
> +{
> + u32 hash;
> + struct bloom_pair *bd = &blocked_delegations;
> +
> + hash = arch_fast_hash(&fh->fh_base, fh->fh_size, 0);
> +
> + __set_bit(hash&255, bd->set[bd->new]);
> + __set_bit((hash>>8)&255, bd->set[bd->new]);
> + __set_bit((hash>>16)&255, bd->set[bd->new]);
> + if (bd->entries == 0)
> + bd->swap_time = seconds_since_boot();
> + bd->entries += 1;
> +}
> +
> static struct nfs4_delegation *
> alloc_init_deleg(struct nfs4_client *clp, struct nfs4_ol_stateid *stp, struct svc_fh *current_fh)
> {
> @@ -372,6 +446,8 @@ alloc_init_deleg(struct nfs4_client *clp, struct nfs4_ol_stateid *stp, struct sv
> dprintk("NFSD alloc_init_deleg\n");
> if (num_delegations > max_delegations)
> return NULL;
> + if (delegation_blocked(¤t_fh->fh_handle))
> + return NULL;
> dp = delegstateid(nfs4_alloc_stid(clp, deleg_slab));
> if (dp == NULL)
> return dp;
> @@ -2742,6 +2818,8 @@ static void nfsd_break_one_deleg(struct nfs4_delegation *dp)
> /* Only place dl_time is set; protected by i_lock: */
> dp->dl_time = get_seconds();
>
> + block_delegations(&dp->dl_fh);
> +
> nfsd4_cb_recall(dp);
> }
>
On Wed, Jun 04, 2014 at 08:48:02AM -0400, Trond Myklebust wrote:
> On Wed, Jun 4, 2014 at 3:39 AM, NeilBrown <[email protected]> wrote:
> > On Sat, 31 May 2014 08:13:58 +1000 NeilBrown <[email protected]> wrote:
> >
> >> On Fri, 30 May 2014 17:55:23 -0400 "J. Bruce Fields" <[email protected]>
> >> wrote:
> >>
> >> > On Fri, May 30, 2014 at 01:44:42PM +1000, NeilBrown wrote:
> >> > > On Thu, 29 May 2014 20:44:23 -0400 "J. Bruce Fields" <[email protected]>
> >> > > wrote:
> >> > >
> >> > > > Yes, it's a known server bug.
> >> > > >
> >> > > > As a first attempt I was thinking of just sticking a timestamp in struct
> >> > > > inode to record the time of the most recent conflicting access and deny
> >> > > > delegations if the timestamp is too recent, for some definition of too
> >> > > > recent.
> >> > > >
> >> > >
> >> > > Hmmm... I'll have a look next week and see what I can come up with.
> >> >
> >> > Thanks!
> >> >
> >> > If we didn't think it was worth another struct inode field, we could
> >> > probably get away with global state. Even just refusing to give out any
> >> > delegations for a few seconds after any delegation break would be enough
> >> > to fix this bug.
> >> >
> >> > Or you could make it a little less harsh with a small hash table: "don't
> >> > give out a delegation on any inode whose inode number hashes to X for a
> >> > few seconds."
> >>
> >> I was thinking of using a bloom filter - or possibly two.
> >> - avoid handing out delegations if either bloom filter reports a match
> >> - when reclaiming a delegation add the inode to the second bloom filter
> >> - every so-often zero-out the older filter and swap them.
> >>
> >> Might be a bit of overkill, but I won't know until I implement it.
> >>
> >
> > Below is my suggestion. It seems easy enough. It even works.
> >
> > However it does raise an issue with the NFS client.
> >
> > NFS performs a silly-rename as an 'asynchronous' operation. One consequence
> > of this is that NFS4ERR_DELAY always results in a delay of
> > NFS4_POLL_RETRY_MAX (15*HZ), where as sync requests use an exponential scale
> > from _MIN to _MAX.
> >
> > So in my test case there is always a 15second delay:
> > - try to silly-rename
> > - get NFS4ERR_DELAY
> > - server reclaim delegation
> > - 15 seconds passes
> > - retry silly-rename - it works.
> >
> > I hacked the NFS server to store a timeout in 'struct nfs_renamedata', and
> > use the same exponential retry pattern and the 15 seconds (obviously)
> > disappeared.
> >
> > Trond: would you accept a patch which did that more generally? e.g. pass a
> > timeout pointer to nfs4_async_handle_error() and various *_done function pass
> > a pointer to a field in their calldata?
>
> It depends. If we're touching nfs4_async_handle_error, then I think we
> should also convert nfs4_async_handle_error to use the same "struct
> nfs4_exception" argument that we use for the synchronous case so that
> we can share a bit more code.
I wonder why this hasn't been a major complaint before--is there
something other servers are doing to mitigate the problem, or is
renaming a delegated file just rarer than I would have expected?
--b.
On Wed, 4 Jun 2014 18:05:31 -0400 "J. Bruce Fields" <[email protected]>
wrote:
> On Wed, Jun 04, 2014 at 05:39:26PM +1000, NeilBrown wrote:
> > Below is my suggestion. It seems easy enough. It even works.
>
> Woah!
>
> Anyway, looks reasonable to me, and it fixes an immediate problem so I'm
> inclined to just apply.
>
> The vfs knows about delegations at this point, and there's been this
> vague idea we might give user space servers request them at some point,
> so we might eventually want this code to fs/locks.c instead of here.
>
> Wonder if filehandle is the right thing to hash on, as opposed to inode
> number, or inode pointer, or something else?
I pondered that question for a short while too.
inode number didn't seem right as you might export two filesystems with the
same inode numbers.
inode pointer didn't seem right as it could fall out of the cache and
something else could be loaded in the same place.
filehandle isn't perfect as it isn't 100% unique (you can have two
filehandles with different 'types' for the one inode). But I felt it was
close enough in practice.
superblock pointer plus inode number plus inode generation number would
probably be perfect, but that is so close to the filehandle which was already
easily available, it seems silly not to just use filehandle.
Certainly if the code ever wants to move to locks.c, the question can be
revisited.
And if you are going to apply it, you'll want:
Signed-off-by: NeilBrown <[email protected]>
I forgot that in the original.
Thanks,
NeilBrown
On Wed, Jun 4, 2014 at 3:39 AM, NeilBrown <[email protected]> wrote:
> On Sat, 31 May 2014 08:13:58 +1000 NeilBrown <[email protected]> wrote:
>
>> On Fri, 30 May 2014 17:55:23 -0400 "J. Bruce Fields" <[email protected]>
>> wrote:
>>
>> > On Fri, May 30, 2014 at 01:44:42PM +1000, NeilBrown wrote:
>> > > On Thu, 29 May 2014 20:44:23 -0400 "J. Bruce Fields" <[email protected]>
>> > > wrote:
>> > >
>> > > > Yes, it's a known server bug.
>> > > >
>> > > > As a first attempt I was thinking of just sticking a timestamp in struct
>> > > > inode to record the time of the most recent conflicting access and deny
>> > > > delegations if the timestamp is too recent, for some definition of too
>> > > > recent.
>> > > >
>> > >
>> > > Hmmm... I'll have a look next week and see what I can come up with.
>> >
>> > Thanks!
>> >
>> > If we didn't think it was worth another struct inode field, we could
>> > probably get away with global state. Even just refusing to give out any
>> > delegations for a few seconds after any delegation break would be enough
>> > to fix this bug.
>> >
>> > Or you could make it a little less harsh with a small hash table: "don't
>> > give out a delegation on any inode whose inode number hashes to X for a
>> > few seconds."
>>
>> I was thinking of using a bloom filter - or possibly two.
>> - avoid handing out delegations if either bloom filter reports a match
>> - when reclaiming a delegation add the inode to the second bloom filter
>> - every so-often zero-out the older filter and swap them.
>>
>> Might be a bit of overkill, but I won't know until I implement it.
>>
>
> Below is my suggestion. It seems easy enough. It even works.
>
> However it does raise an issue with the NFS client.
>
> NFS performs a silly-rename as an 'asynchronous' operation. One consequence
> of this is that NFS4ERR_DELAY always results in a delay of
> NFS4_POLL_RETRY_MAX (15*HZ), where as sync requests use an exponential scale
> from _MIN to _MAX.
>
> So in my test case there is always a 15second delay:
> - try to silly-rename
> - get NFS4ERR_DELAY
> - server reclaim delegation
> - 15 seconds passes
> - retry silly-rename - it works.
>
> I hacked the NFS server to store a timeout in 'struct nfs_renamedata', and
> use the same exponential retry pattern and the 15 seconds (obviously)
> disappeared.
>
> Trond: would you accept a patch which did that more generally? e.g. pass a
> timeout pointer to nfs4_async_handle_error() and various *_done function pass
> a pointer to a field in their calldata?
It depends. If we're touching nfs4_async_handle_error, then I think we
should also convert nfs4_async_handle_error to use the same "struct
nfs4_exception" argument that we use for the synchronous case so that
we can share a bit more code.
--
Trond Myklebust
Linux NFS client maintainer, PrimaryData
[email protected]
On Wed, 11 Jun 2014 10:21:02 -0400 "J. Bruce Fields" <[email protected]>
wrote:
> On Thu, Jun 05, 2014 at 10:34:23AM +1000, NeilBrown wrote:
> > On Wed, 4 Jun 2014 18:05:31 -0400 "J. Bruce Fields" <[email protected]>
> > wrote:
> >
> > > On Wed, Jun 04, 2014 at 05:39:26PM +1000, NeilBrown wrote:
> > > > Below is my suggestion. It seems easy enough. It even works.
> > >
> > > Woah!
> > >
> > > Anyway, looks reasonable to me, and it fixes an immediate problem so I'm
> > > inclined to just apply.
> ...
> > And if you are going to apply it, you'll want:
> >
> > Signed-off-by: NeilBrown <[email protected]>
>
> Oh, gah, then I forgot to actually apply.
The best laid plan of mice ....
>
> Anyway, it's a reasonably self-contained fix for an important bug so
> I'll send it as part of a later bugfix pull request.
>
> I thought it could also use a more explicit description of the resulting
> problem, so I added:
>
> ... so further delegations should not be handed out.
>
> The current code fails to do so, and the result is effectively a
> live-lock under some workloads: a client attempting a
> conflicting operation on a read-delegated file receives
> NFS4ERR_DELAY and retries the operation, but by the time it
> retries the server may already have given out another
> delegation.
Looks good, thanks.
NeilBrown
On Sat, 31 May 2014 08:13:58 +1000 NeilBrown <[email protected]> wrote:
> On Fri, 30 May 2014 17:55:23 -0400 "J. Bruce Fields" <[email protected]>
> wrote:
>
> > On Fri, May 30, 2014 at 01:44:42PM +1000, NeilBrown wrote:
> > > On Thu, 29 May 2014 20:44:23 -0400 "J. Bruce Fields" <[email protected]>
> > > wrote:
> > >
> > > > Yes, it's a known server bug.
> > > >
> > > > As a first attempt I was thinking of just sticking a timestamp in struct
> > > > inode to record the time of the most recent conflicting access and deny
> > > > delegations if the timestamp is too recent, for some definition of too
> > > > recent.
> > > >
> > >
> > > Hmmm... I'll have a look next week and see what I can come up with.
> >
> > Thanks!
> >
> > If we didn't think it was worth another struct inode field, we could
> > probably get away with global state. Even just refusing to give out any
> > delegations for a few seconds after any delegation break would be enough
> > to fix this bug.
> >
> > Or you could make it a little less harsh with a small hash table: "don't
> > give out a delegation on any inode whose inode number hashes to X for a
> > few seconds."
>
> I was thinking of using a bloom filter - or possibly two.
> - avoid handing out delegations if either bloom filter reports a match
> - when reclaiming a delegation add the inode to the second bloom filter
> - every so-often zero-out the older filter and swap them.
>
> Might be a bit of overkill, but I won't know until I implement it.
>
Below is my suggestion. It seems easy enough. It even works.
However it does raise an issue with the NFS client.
NFS performs a silly-rename as an 'asynchronous' operation. One consequence
of this is that NFS4ERR_DELAY always results in a delay of
NFS4_POLL_RETRY_MAX (15*HZ), where as sync requests use an exponential scale
from _MIN to _MAX.
So in my test case there is always a 15second delay:
- try to silly-rename
- get NFS4ERR_DELAY
- server reclaim delegation
- 15 seconds passes
- retry silly-rename - it works.
I hacked the NFS server to store a timeout in 'struct nfs_renamedata', and
use the same exponential retry pattern and the 15 seconds (obviously)
disappeared.
Trond: would you accept a patch which did that more generally? e.g. pass a
timeout pointer to nfs4_async_handle_error() and various *_done function pass
a pointer to a field in their calldata?
NeilBrown
NFSD: Don't hand out delegations for 30 seconds after recalling them.
If nfsd needs to recall a delegation for some reason it implies that there is
contention on the file, so further delegations should not be handed out.
We could simply avoid delegations for (say) 30 seconds after any recall, but
this is probably too heavy handed.
We could keep a list of inodes (or inode numbers or filehandles) for recalled
delegations, but that requires memory allocation and searching.
The approach taken here is to use a bloom filter to record the filehandles
which are currently blocked from delegation, and to accept the cost of a few
false positives.
We have 2 bloom filters, each of which is valid for 30 seconds. When a
delegation is recalled the filehandle is added to one filter and will remain
disabled for between 30 and 60 seconds.
We keep a count of the number of filehandles that have been added, so when
that count is zero we can bypass all other tests.
The bloom filters have 256 bits and 3 hash functions. This should allow a
couple of dozen blocked filehandles with minimal false positives. If many
more filehandles are all blocked at once, behaviour will degrade towards
rejecting all delegations for between 30 and 60 seconds, then resetting and
allowing new delegations.
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 9a77a5a21557..45101b41fb04 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -41,6 +41,7 @@
#include <linux/ratelimit.h>
#include <linux/sunrpc/svcauth_gss.h>
#include <linux/sunrpc/addr.h>
+#include <linux/hash.h>
#include "xdr4.h"
#include "xdr4cb.h"
#include "vfs.h"
@@ -364,6 +365,79 @@ static struct nfs4_ol_stateid * nfs4_alloc_stateid(struct nfs4_client *clp)
return openlockstateid(nfs4_alloc_stid(clp, stateid_slab));
}
+/*
+ * When we recall a delegation, we should be careful not to hand it
+ * out again straight away.
+ * To ensure this we keep a pair of bloom filters ('new' and 'old')
+ * in which the filehandles of recalled delegations are "stored".
+ * If a filehandle appear in either filter, a delegation is blocked.
+ * When a delegation is recalled, the filehandle is stored in the "new"
+ * filter.
+ * Every 30 seconds we swap the filters and clear the "new" one,
+ * unless both are empty of course.
+ *
+ * Each filter is 256 bits. We hash the filehandle to 32bit and use the
+ * low 3 bytes as hash-table indices.
+ *
+ * 'recall_lock', which is always held when block_delegations() is called,
+ * is used to manage concurrent access. Testing does not need the lock
+ * except when swapping the two filters.
+ */
+static struct bloom_pair {
+ int entries, old_entries;
+ time_t swap_time;
+ int new; /* index into 'set' */
+ DECLARE_BITMAP(set[2], 256);
+} blocked_delegations;
+
+static int delegation_blocked(struct knfsd_fh *fh)
+{
+ u32 hash;
+ struct bloom_pair *bd = &blocked_delegations;
+
+ if (bd->entries == 0)
+ return 0;
+ if (seconds_since_boot() - bd->swap_time > 30) {
+ spin_lock(&recall_lock);
+ if (seconds_since_boot() - bd->swap_time > 30) {
+ bd->entries -= bd->old_entries;
+ bd->old_entries = bd->entries;
+ memset(bd->set[bd->new], 0,
+ sizeof(bd->set[0]));
+ bd->new = 1-bd->new;
+ bd->swap_time = seconds_since_boot();
+ }
+ spin_unlock(&recall_lock);
+ }
+ hash = arch_fast_hash(&fh->fh_base, fh->fh_size, 0);
+ if (test_bit(hash&255, bd->set[0]) &&
+ test_bit((hash>>8)&255, bd->set[0]) &&
+ test_bit((hash>>16)&255, bd->set[0]))
+ return 1;
+
+ if (test_bit(hash&255, bd->set[1]) &&
+ test_bit((hash>>8)&255, bd->set[1]) &&
+ test_bit((hash>>16)&255, bd->set[1]))
+ return 1;
+
+ return 0;
+}
+
+static void block_delegations(struct knfsd_fh *fh)
+{
+ u32 hash;
+ struct bloom_pair *bd = &blocked_delegations;
+
+ hash = arch_fast_hash(&fh->fh_base, fh->fh_size, 0);
+
+ __set_bit(hash&255, bd->set[bd->new]);
+ __set_bit((hash>>8)&255, bd->set[bd->new]);
+ __set_bit((hash>>16)&255, bd->set[bd->new]);
+ if (bd->entries == 0)
+ bd->swap_time = seconds_since_boot();
+ bd->entries += 1;
+}
+
static struct nfs4_delegation *
alloc_init_deleg(struct nfs4_client *clp, struct nfs4_ol_stateid *stp, struct svc_fh *current_fh)
{
@@ -372,6 +446,8 @@ alloc_init_deleg(struct nfs4_client *clp, struct nfs4_ol_stateid *stp, struct sv
dprintk("NFSD alloc_init_deleg\n");
if (num_delegations > max_delegations)
return NULL;
+ if (delegation_blocked(¤t_fh->fh_handle))
+ return NULL;
dp = delegstateid(nfs4_alloc_stid(clp, deleg_slab));
if (dp == NULL)
return dp;
@@ -2742,6 +2818,8 @@ static void nfsd_break_one_deleg(struct nfs4_delegation *dp)
/* Only place dl_time is set; protected by i_lock: */
dp->dl_time = get_seconds();
+ block_delegations(&dp->dl_fh);
+
nfsd4_cb_recall(dp);
}