Hi,
Apologies for this rather long email, but I thought there may be some interest out there in the community in how and why we've been doing something unsupported and barely documented - NFS re-exporting! And I'm not sure I can tell our story well in just a few short sentences so please bear with me (or stop now!).
Full disclosure - I am also rather hoping that this story piques some interest amongst developers to help make our rather niche setup even better and perhaps a little better documented. I also totally understand if this is something people wouldn't want to touch with a very long barge pole....
First a quick bit of history (I hope I have this right). Late in 2015, Jeff Layton proposed a patch series allowing knfsd to re-export a NFS client mount. The rationale then was to provide a "proxy" server that could mount an NFSv4 only server and re-export it to older clients that only supported NFSv3. One of the main sticking points then (as now), was around the 63 byte limit of filehandles for NFSv3 and how it couldn't be guaranteed that all re-exported filehandles would fit within that (in my experience it mostly works with "no_subtree_check"). There are also the usual locking and coherence concerns with NFSv3 too but I'll get to that in a bit.
Then almost two years later, v4.13 was released including parts of the patch series that actually allowed the re-export and since then other relevant bits (such as the open file cache) have also been merged. I soon became interested in using this new functionality to both accelerate our on-premises NFS storage and use it as a "WAN cache" to provide cloud compute instances locally cached proxy access to our on-premises storage.
Cut to a brief introduction to us and what we do... DNEG is an award winning VFX company which uses large compute farms to generate complex final frame renders for movies and TV. This workload mostly consists of reads of common data shared between many render clients (e.g textures, geometry) and a little unique data per frame. All file writes are to unique files per process (frames) and there is very little if any writing over existing files. Hence it's not very demanding on locking and coherence guarantees.
When our on-premises NFS storage is being overloaded or the server's network is maxed out, we can place multiple re-export servers in between them and our farm to improve performance. When our on-premises render farm is not quite big enough to meet a deadline, we spin up compute instances with a (reasonably local) cloud provider. Some of these cloud instances are Linux NFS servers which mount our on-premises NFS storage servers (~10ms away) and re-export these to the other cloud (render) instances. Since we know that the data we are reading doesn't change often, we can increase the actimeo and even use nocto to reduce the network chatter back to the on-prem servers. These re-export servers also use fscache/cachefiles to cache data to disk so that we can retain TBs of previously read data locally in the cloud over long periods of time. We also use NFSv4 (less network chatter) all the way from our on-prem storage to the re-export server and then on to the clients.
The re-export server(s) quickly builds up both a memory cache and disk backed fscache/cachefiles storage cache of our working data set so the data being pulled from on-prem lessens over time. Data is only ever read once over the WAN network from on-prem storage and then read multiple times by the many render client instances in the cloud. Recent NFS features such as "nconnect" help to speed up the initial reading of data from on-prem by using multiple connections to offset TCP latency. At the end of the render, we write the files back through the re-export server to our on-prem storage. Our average read bandwidth is many times higher than our write bandwidth.
Rather surprisingly, this mostly works for our particular workloads. We've completed movies using this setup and saved money on commercial caching systems (e.g Avere, GPFS, etc). But there are still some remaining issues with doing something that is very much not widely supported (or recommended). In most cases we have worked around them, but it would be great if we didn't have to so others could also benefit. I will list the main problems quickly now and provide more information and reproducers later if anyone is interested.
1) The kernel can drop entries out of the NFS client inode cache (under memory cache churn) when those filehandles are still being used by the knfsd's remote clients resulting in sporadic and random stale filehandles. This seems to be mostly for directories from what I've seen. Does the NFS client not know that knfsd is still using those files/dirs? The workaround is to never drop inode & dentry caches on the re-export servers (vfs_cache_pressure=1). This also helps to ensure that we actually make the most of our actimeo=3600,nocto mount options for the full specified time.
2) If we cache metadata on the re-export server using actimeo=3600,nocto we can cut the network packets back to the origin server to zero for repeated lookups. However, if a client of the re-export server walks paths and memory maps those files (i.e. loading an application), the re-export server starts issuing unexpected calls back to the origin server again, ignoring/invalidating the re-export server's NFS client cache. We worked around this this by patching an inode/iversion validity check in inode.c so that the NFS client cache on the re-export server is used. I'm not sure about the correctness of this patch but it works for our corner case.
3) If we saturate an NFS client's network with reads from the server, all client metadata lookups become unbearably slow even if it's all cached in the NFS client's memory and no network RPCs should be required. This is the case for any NFS client regardless of re-exporting but it affects this case more because when we can't serve cached metadata we also can't serve the cached data. It feels like some sort of bottleneck in the client's ability to parallelise requests? We work around this by not maxing out our network.
4) With an NFSv4 re-export, lots of open/close requests (hundreds per second) quickly eat up the CPU on the re-export server and perf top shows we are mostly in native_queued_spin_lock_slowpath. Does NFSv4 also need an open file cache like that added to NFSv3? Our workaround is to either fix the thing doing lots of repeated open/closes or use NFSv3 instead.
If you made it this far, I've probably taken up way too much of your valuable time already. If nobody is interested in this rather niche application of the Linux client & knfsd, then I totally understand and I will not mention it here again. If your interest is piqued however, I'm happy to go into more detail about any of this with the hope that this could become a better documented and understood type of setup that others with similar workloads could reference.
Also, many thanks to all the Linux NFS developers for the amazing work you do which, in turn, helps us to make great movies. :)
Daire (Head of Systems DNEG)
Just out of curiosity:
did you have tries instead of re-exporting nfs mount directly
re-export an overlayfs mount on top of the original nfs mount?
Such setup should cover most of your issues.
Regards,
Tigran.
----- Original Message -----
> From: "Daire Byrne" <[email protected]>
> To: "linux-nfs" <[email protected]>
> Cc: [email protected]
> Sent: Monday, September 7, 2020 7:31:00 PM
> Subject: Adventures in NFS re-exporting
> Hi,
>
> Apologies for this rather long email, but I thought there may be some interest
> out there in the community in how and why we've been doing something
> unsupported and barely documented - NFS re-exporting! And I'm not sure I can
> tell our story well in just a few short sentences so please bear with me (or
> stop now!).
>
> Full disclosure - I am also rather hoping that this story piques some interest
> amongst developers to help make our rather niche setup even better and perhaps
> a little better documented. I also totally understand if this is something
> people wouldn't want to touch with a very long barge pole....
>
> First a quick bit of history (I hope I have this right). Late in 2015, Jeff
> Layton proposed a patch series allowing knfsd to re-export a NFS client mount.
> The rationale then was to provide a "proxy" server that could mount an NFSv4
> only server and re-export it to older clients that only supported NFSv3. One of
> the main sticking points then (as now), was around the 63 byte limit of
> filehandles for NFSv3 and how it couldn't be guaranteed that all re-exported
> filehandles would fit within that (in my experience it mostly works with
> "no_subtree_check"). There are also the usual locking and coherence concerns
> with NFSv3 too but I'll get to that in a bit.
>
> Then almost two years later, v4.13 was released including parts of the patch
> series that actually allowed the re-export and since then other relevant bits
> (such as the open file cache) have also been merged. I soon became interested
> in using this new functionality to both accelerate our on-premises NFS storage
> and use it as a "WAN cache" to provide cloud compute instances locally cached
> proxy access to our on-premises storage.
>
> Cut to a brief introduction to us and what we do... DNEG is an award winning VFX
> company which uses large compute farms to generate complex final frame renders
> for movies and TV. This workload mostly consists of reads of common data shared
> between many render clients (e.g textures, geometry) and a little unique data
> per frame. All file writes are to unique files per process (frames) and there
> is very little if any writing over existing files. Hence it's not very
> demanding on locking and coherence guarantees.
>
> When our on-premises NFS storage is being overloaded or the server's network is
> maxed out, we can place multiple re-export servers in between them and our farm
> to improve performance. When our on-premises render farm is not quite big
> enough to meet a deadline, we spin up compute instances with a (reasonably
> local) cloud provider. Some of these cloud instances are Linux NFS servers
> which mount our on-premises NFS storage servers (~10ms away) and re-export
> these to the other cloud (render) instances. Since we know that the data we are
> reading doesn't change often, we can increase the actimeo and even use nocto to
> reduce the network chatter back to the on-prem servers. These re-export servers
> also use fscache/cachefiles to cache data to disk so that we can retain TBs of
> previously read data locally in the cloud over long periods of time. We also
> use NFSv4 (less network chatter) all the way from our on-prem storage to the
> re-export server and then on to the clients.
>
> The re-export server(s) quickly builds up both a memory cache and disk backed
> fscache/cachefiles storage cache of our working data set so the data being
> pulled from on-prem lessens over time. Data is only ever read once over the WAN
> network from on-prem storage and then read multiple times by the many render
> client instances in the cloud. Recent NFS features such as "nconnect" help to
> speed up the initial reading of data from on-prem by using multiple connections
> to offset TCP latency. At the end of the render, we write the files back
> through the re-export server to our on-prem storage. Our average read bandwidth
> is many times higher than our write bandwidth.
>
> Rather surprisingly, this mostly works for our particular workloads. We've
> completed movies using this setup and saved money on commercial caching systems
> (e.g Avere, GPFS, etc). But there are still some remaining issues with doing
> something that is very much not widely supported (or recommended). In most
> cases we have worked around them, but it would be great if we didn't have to so
> others could also benefit. I will list the main problems quickly now and
> provide more information and reproducers later if anyone is interested.
>
> 1) The kernel can drop entries out of the NFS client inode cache (under memory
> cache churn) when those filehandles are still being used by the knfsd's remote
> clients resulting in sporadic and random stale filehandles. This seems to be
> mostly for directories from what I've seen. Does the NFS client not know that
> knfsd is still using those files/dirs? The workaround is to never drop inode &
> dentry caches on the re-export servers (vfs_cache_pressure=1). This also helps
> to ensure that we actually make the most of our actimeo=3600,nocto mount
> options for the full specified time.
>
> 2) If we cache metadata on the re-export server using actimeo=3600,nocto we can
> cut the network packets back to the origin server to zero for repeated lookups.
> However, if a client of the re-export server walks paths and memory maps those
> files (i.e. loading an application), the re-export server starts issuing
> unexpected calls back to the origin server again, ignoring/invalidating the
> re-export server's NFS client cache. We worked around this this by patching an
> inode/iversion validity check in inode.c so that the NFS client cache on the
> re-export server is used. I'm not sure about the correctness of this patch but
> it works for our corner case.
>
> 3) If we saturate an NFS client's network with reads from the server, all client
> metadata lookups become unbearably slow even if it's all cached in the NFS
> client's memory and no network RPCs should be required. This is the case for
> any NFS client regardless of re-exporting but it affects this case more because
> when we can't serve cached metadata we also can't serve the cached data. It
> feels like some sort of bottleneck in the client's ability to parallelise
> requests? We work around this by not maxing out our network.
>
> 4) With an NFSv4 re-export, lots of open/close requests (hundreds per second)
> quickly eat up the CPU on the re-export server and perf top shows we are mostly
> in native_queued_spin_lock_slowpath. Does NFSv4 also need an open file cache
> like that added to NFSv3? Our workaround is to either fix the thing doing lots
> of repeated open/closes or use NFSv3 instead.
>
> If you made it this far, I've probably taken up way too much of your valuable
> time already. If nobody is interested in this rather niche application of the
> Linux client & knfsd, then I totally understand and I will not mention it here
> again. If your interest is piqued however, I'm happy to go into more detail
> about any of this with the hope that this could become a better documented and
> understood type of setup that others with similar workloads could reference.
>
> Also, many thanks to all the Linux NFS developers for the amazing work you do
> which, in turn, helps us to make great movies. :)
>
> Daire (Head of Systems DNEG)
Tigran,
I guess I never really considered overlayfs because we still want to seamlessly write through to the original servers from time to time and post processing the copies from upper to lower seems like it might be hard to make reliable or do with low latency? I would also worry that our lower filesystem is being actively updated by processes outside of the overlay clients and how overlayfs would deal with that. And ultimately, the COW nature of overlayfs is a somewhat wasted feature for our workloads whereby it's the caching of file reads (and metadata) we care most about.
I must confess to not having looked at overlayfs in a few years so there may be lots of new tricks and options that would help our case. I'm aware that it gained the ability to NFS (re-)export a couple of years back.
But I'm certainly now interested to know if that NFS re-export implementation fares any better with the issues I experience with a direct knfsd re-export of an NFS client. So I will do some testing with overlayfs and see how it stacks up (see what I did there?).
Thanks for the suggestion!
Daire
----- On 8 Sep, 2020, at 10:40, Mkrtchyan, Tigran [email protected] wrote:
> Just out of curiosity:
>
> did you have tries instead of re-exporting nfs mount directly
> re-export an overlayfs mount on top of the original nfs mount?
> Such setup should cover most of your issues.
>
> Regards,
> Tigran.
>
> ----- Original Message -----
>> From: "Daire Byrne" <[email protected]>
>> To: "linux-nfs" <[email protected]>
>> Cc: [email protected]
>> Sent: Monday, September 7, 2020 7:31:00 PM
>> Subject: Adventures in NFS re-exporting
>
>> Hi,
>>
>> Apologies for this rather long email, but I thought there may be some interest
>> out there in the community in how and why we've been doing something
>> unsupported and barely documented - NFS re-exporting! And I'm not sure I can
>> tell our story well in just a few short sentences so please bear with me (or
>> stop now!).
>>
>> Full disclosure - I am also rather hoping that this story piques some interest
>> amongst developers to help make our rather niche setup even better and perhaps
>> a little better documented. I also totally understand if this is something
>> people wouldn't want to touch with a very long barge pole....
>>
>> First a quick bit of history (I hope I have this right). Late in 2015, Jeff
>> Layton proposed a patch series allowing knfsd to re-export a NFS client mount.
>> The rationale then was to provide a "proxy" server that could mount an NFSv4
>> only server and re-export it to older clients that only supported NFSv3. One of
>> the main sticking points then (as now), was around the 63 byte limit of
>> filehandles for NFSv3 and how it couldn't be guaranteed that all re-exported
>> filehandles would fit within that (in my experience it mostly works with
>> "no_subtree_check"). There are also the usual locking and coherence concerns
>> with NFSv3 too but I'll get to that in a bit.
>>
>> Then almost two years later, v4.13 was released including parts of the patch
>> series that actually allowed the re-export and since then other relevant bits
>> (such as the open file cache) have also been merged. I soon became interested
>> in using this new functionality to both accelerate our on-premises NFS storage
>> and use it as a "WAN cache" to provide cloud compute instances locally cached
>> proxy access to our on-premises storage.
>>
>> Cut to a brief introduction to us and what we do... DNEG is an award winning VFX
>> company which uses large compute farms to generate complex final frame renders
>> for movies and TV. This workload mostly consists of reads of common data shared
>> between many render clients (e.g textures, geometry) and a little unique data
>> per frame. All file writes are to unique files per process (frames) and there
>> is very little if any writing over existing files. Hence it's not very
>> demanding on locking and coherence guarantees.
>>
>> When our on-premises NFS storage is being overloaded or the server's network is
>> maxed out, we can place multiple re-export servers in between them and our farm
>> to improve performance. When our on-premises render farm is not quite big
>> enough to meet a deadline, we spin up compute instances with a (reasonably
>> local) cloud provider. Some of these cloud instances are Linux NFS servers
>> which mount our on-premises NFS storage servers (~10ms away) and re-export
>> these to the other cloud (render) instances. Since we know that the data we are
>> reading doesn't change often, we can increase the actimeo and even use nocto to
>> reduce the network chatter back to the on-prem servers. These re-export servers
>> also use fscache/cachefiles to cache data to disk so that we can retain TBs of
>> previously read data locally in the cloud over long periods of time. We also
>> use NFSv4 (less network chatter) all the way from our on-prem storage to the
>> re-export server and then on to the clients.
>>
>> The re-export server(s) quickly builds up both a memory cache and disk backed
>> fscache/cachefiles storage cache of our working data set so the data being
>> pulled from on-prem lessens over time. Data is only ever read once over the WAN
>> network from on-prem storage and then read multiple times by the many render
>> client instances in the cloud. Recent NFS features such as "nconnect" help to
>> speed up the initial reading of data from on-prem by using multiple connections
>> to offset TCP latency. At the end of the render, we write the files back
>> through the re-export server to our on-prem storage. Our average read bandwidth
>> is many times higher than our write bandwidth.
>>
>> Rather surprisingly, this mostly works for our particular workloads. We've
>> completed movies using this setup and saved money on commercial caching systems
>> (e.g Avere, GPFS, etc). But there are still some remaining issues with doing
>> something that is very much not widely supported (or recommended). In most
>> cases we have worked around them, but it would be great if we didn't have to so
>> others could also benefit. I will list the main problems quickly now and
>> provide more information and reproducers later if anyone is interested.
>>
>> 1) The kernel can drop entries out of the NFS client inode cache (under memory
>> cache churn) when those filehandles are still being used by the knfsd's remote
>> clients resulting in sporadic and random stale filehandles. This seems to be
>> mostly for directories from what I've seen. Does the NFS client not know that
>> knfsd is still using those files/dirs? The workaround is to never drop inode &
>> dentry caches on the re-export servers (vfs_cache_pressure=1). This also helps
>> to ensure that we actually make the most of our actimeo=3600,nocto mount
>> options for the full specified time.
>>
>> 2) If we cache metadata on the re-export server using actimeo=3600,nocto we can
>> cut the network packets back to the origin server to zero for repeated lookups.
>> However, if a client of the re-export server walks paths and memory maps those
>> files (i.e. loading an application), the re-export server starts issuing
>> unexpected calls back to the origin server again, ignoring/invalidating the
>> re-export server's NFS client cache. We worked around this this by patching an
>> inode/iversion validity check in inode.c so that the NFS client cache on the
>> re-export server is used. I'm not sure about the correctness of this patch but
>> it works for our corner case.
>>
>> 3) If we saturate an NFS client's network with reads from the server, all client
>> metadata lookups become unbearably slow even if it's all cached in the NFS
>> client's memory and no network RPCs should be required. This is the case for
>> any NFS client regardless of re-exporting but it affects this case more because
>> when we can't serve cached metadata we also can't serve the cached data. It
>> feels like some sort of bottleneck in the client's ability to parallelise
>> requests? We work around this by not maxing out our network.
>>
>> 4) With an NFSv4 re-export, lots of open/close requests (hundreds per second)
>> quickly eat up the CPU on the re-export server and perf top shows we are mostly
>> in native_queued_spin_lock_slowpath. Does NFSv4 also need an open file cache
>> like that added to NFSv3? Our workaround is to either fix the thing doing lots
>> of repeated open/closes or use NFSv3 instead.
>>
>> If you made it this far, I've probably taken up way too much of your valuable
>> time already. If nobody is interested in this rather niche application of the
>> Linux client & knfsd, then I totally understand and I will not mention it here
>> again. If your interest is piqued however, I'm happy to go into more detail
>> about any of this with the hope that this could become a better documented and
>> understood type of setup that others with similar workloads could reference.
>>
>> Also, many thanks to all the Linux NFS developers for the amazing work you do
>> which, in turn, helps us to make great movies. :)
>>
> > Daire (Head of Systems DNEG)
On Mon, Sep 07, 2020 at 06:31:00PM +0100, Daire Byrne wrote:
> 1) The kernel can drop entries out of the NFS client inode cache (under memory cache churn) when those filehandles are still being used by the knfsd's remote clients resulting in sporadic and random stale filehandles. This seems to be mostly for directories from what I've seen. Does the NFS client not know that knfsd is still using those files/dirs? The workaround is to never drop inode & dentry caches on the re-export servers (vfs_cache_pressure=1). This also helps to ensure that we actually make the most of our actimeo=3600,nocto mount options for the full specified time.
I thought reexport worked by embedding the original server's filehandles
in the filehandles given out by the reexporting server.
So, even if nothing's cached, when the reexporting server gets a
filehandle, it should be able to extract the original filehandle from it
and use that.
I wonder why that's not working?
> 4) With an NFSv4 re-export, lots of open/close requests (hundreds per
> second) quickly eat up the CPU on the re-export server and perf top
> shows we are mostly in native_queued_spin_lock_slowpath.
Any statistics on who's calling that function?
> Does NFSv4
> also need an open file cache like that added to NFSv3? Our workaround
> is to either fix the thing doing lots of repeated open/closes or use
> NFSv3 instead.
NFSv4 uses the same file cache. It might be the file cache that's at
fault, in fact....
--b.
On Tue, 2020-09-15 at 13:21 -0400, J. Bruce Fields wrote:
> On Mon, Sep 07, 2020 at 06:31:00PM +0100, Daire Byrne wrote:
> > 1) The kernel can drop entries out of the NFS client inode cache
> > (under memory cache churn) when those filehandles are still being
> > used by the knfsd's remote clients resulting in sporadic and random
> > stale filehandles. This seems to be mostly for directories from
> > what I've seen. Does the NFS client not know that knfsd is still
> > using those files/dirs? The workaround is to never drop inode &
> > dentry caches on the re-export servers (vfs_cache_pressure=1). This
> > also helps to ensure that we actually make the most of our
> > actimeo=3600,nocto mount options for the full specified time.
>
> I thought reexport worked by embedding the original server's
> filehandles
> in the filehandles given out by the reexporting server.
>
> So, even if nothing's cached, when the reexporting server gets a
> filehandle, it should be able to extract the original filehandle from
> it
> and use that.
>
> I wonder why that's not working?
NFSv3? If so, I suspect it is because we never wrote a lookupp()
callback for it.
>
> > 4) With an NFSv4 re-export, lots of open/close requests (hundreds
> > per
> > second) quickly eat up the CPU on the re-export server and perf top
> > shows we are mostly in native_queued_spin_lock_slowpath.
>
> Any statistics on who's calling that function?
>
> > Does NFSv4
> > also need an open file cache like that added to NFSv3? Our
> > workaround
> > is to either fix the thing doing lots of repeated open/closes or
> > use
> > NFSv3 instead.
>
> NFSv4 uses the same file cache. It might be the file cache that's at
> fault, in fact....
>
> --b.
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
Trond/Bruce,
----- On 15 Sep, 2020, at 20:59, Trond Myklebust [email protected] wrote:
> On Tue, 2020-09-15 at 13:21 -0400, J. Bruce Fields wrote:
>> On Mon, Sep 07, 2020 at 06:31:00PM +0100, Daire Byrne wrote:
>> > 1) The kernel can drop entries out of the NFS client inode cache
>> > (under memory cache churn) when those filehandles are still being
>> > used by the knfsd's remote clients resulting in sporadic and random
>> > stale filehandles. This seems to be mostly for directories from
>> > what I've seen. Does the NFS client not know that knfsd is still
>> > using those files/dirs? The workaround is to never drop inode &
>> > dentry caches on the re-export servers (vfs_cache_pressure=1). This
>> > also helps to ensure that we actually make the most of our
>> > actimeo=3600,nocto mount options for the full specified time.
>>
>> I thought reexport worked by embedding the original server's
>> filehandles
>> in the filehandles given out by the reexporting server.
>>
>> So, even if nothing's cached, when the reexporting server gets a
>> filehandle, it should be able to extract the original filehandle from
>> it
>> and use that.
>>
>> I wonder why that's not working?
>
> NFSv3? If so, I suspect it is because we never wrote a lookupp()
> callback for it.
So in terms of the ESTALE counter on the reexport server, we see it increase if the end client mounts the reexport using either NFSv3 or NFSv4. But there is a difference in the client experience in that with NFSv3 we quickly get input/output errors but with NFSv4 we don't. But it does seem like the performance drops significantly which makes me think that NFSv4 retries the lookups (which succeed) when an ESTALE is reported but NFSv3 does not?
This is the simplest reproducer I could come up with but it may still be specific to our workloads/applications and hard to replicate exactly.
nfs-client # sudo mount -t nfs -o vers=3,actimeo=5,ro reexport-server:/vol/software /mnt/software
nfs-client # while true; do /mnt/software/bin/application; echo 3 | sudo tee /proc/sys/vm/drop_caches; done
reexport-server # sysctl -w vm.vfs_cache_pressure=100
reexport-server # while true; do echo 3 > /proc/sys/vm/drop_caches ; done
reexport-server # while true; do awk '/fh/ {print $2}' /proc/net/rpc/nfsd; sleep 10; done
Where "application" is some big application with lots of paths to scan with libs to memory map and "/vol/software" is an NFS mount on the reexport-server from another originating NFS server. I don't know why this application loading workload shows this best, but perhaps the access patterns of memory mapped binaries and libs is particularly susceptible to estale?
With vfs_cache_pressure=100, running "echo 3 > /proc/sys/vm/drop_caches" repeatedly on the reexport server drops chunks of the dentry & nfs_inode_cache. The ESTALE count increases and the client running the application reports input/output errors with NFSv3 or the loading slows to a crawl with NFSv4.
As soon as we switch to vfs_cache_pressure=0, the repeating drop_caches on the reexport server do not cull the dentry or nfs_inode_cache, the ESTALE counter no longer increases and the client experiences no issues (NFSv3 & NFSv4).
>> > 4) With an NFSv4 re-export, lots of open/close requests (hundreds
>> > per
>> > second) quickly eat up the CPU on the re-export server and perf top
>> > shows we are mostly in native_queued_spin_lock_slowpath.
>>
>> Any statistics on who's calling that function?
I have not managed to devise a good reproducer for this as I suspect it requires large numbers of clients. So, I will have to use some production load to replicate it and it will take me a day or two to get something back to you.
Would something from a perf report be of particular interest (e.g. the call graph) or even a /proc/X/stack of a high CPU nfsd thread?
I do recall that nfsd_file_lru_cb and __list_lru_walk_one were usually right below native_queued_spin_lock_slowpath as the next most busy functions in perf top (with NFSv4 exporting). Perhaps this is less of an NFS reexport phenomenon and would be the case for any NFSv4 export of a particularly "slow" underlying filesystem?
>> > Does NFSv4
>> > also need an open file cache like that added to NFSv3? Our
>> > workaround
>> > is to either fix the thing doing lots of repeated open/closes or
>> > use
>> > NFSv3 instead.
>>
>> NFSv4 uses the same file cache. It might be the file cache that's at
>> fault, in fact....
Ah, my misunderstanding. I had assumed the open file descriptor cache was of more benefit to NFSv3 and that NFSv4 did not necessarily require it for performance.
I might also be able to do a test with a kernel version from before when that feature landed to see if NFSv4 reexport performs any different.
Cheers,
Daire
----- On 15 Sep, 2020, at 18:21, bfields [email protected] wrote:
>> 4) With an NFSv4 re-export, lots of open/close requests (hundreds per
>> second) quickly eat up the CPU on the re-export server and perf top
>> shows we are mostly in native_queued_spin_lock_slowpath.
>
> Any statistics on who's calling that function?
I've always struggled to reproduce this with a simple open/close simulation, so I suspect some other operations need to be mixed in too. But I have one production workload that I know has lots of opens & closes (buggy software) included in amongst the usual reads, writes etc.
With just 40 clients mounting the reexport server (v5.7.6) using NFSv4.2, we see the CPU of the nfsd threads increase rapidly and by the time we have 100 clients, we have maxed out the 32 cores of the server with most of that in native_queued_spin_lock_slowpath.
The perf top summary looks like this:
# Overhead Command Shared Object Symbol
# ........ ............... ............................ .......................................................
#
82.91% nfsd [kernel.kallsyms] [k] native_queued_spin_lock_slowpath
8.24% swapper [kernel.kallsyms] [k] intel_idle
4.66% nfsd [kernel.kallsyms] [k] __list_lru_walk_one
0.80% nfsd [kernel.kallsyms] [k] nfsd_file_lru_cb
And the call graph (not sure how this will format):
- nfsd
- 89.34% svc_process
- 88.94% svc_process_common
- 88.87% nfsd_dispatch
- 88.82% nfsd4_proc_compound
- 53.97% nfsd4_open
- 53.95% nfsd4_process_open2
- 53.87% nfs4_get_vfs_file
- 53.48% nfsd_file_acquire
- 33.31% nfsd_file_lru_walk_list
- 33.28% list_lru_walk_node
- 33.28% list_lru_walk_one
- 30.21% _raw_spin_lock
- 30.21% queued_spin_lock_slowpath
30.20% native_queued_spin_lock_slowpath
2.46% __list_lru_walk_one
- 19.39% list_lru_add
- 19.39% _raw_spin_lock
- 19.39% queued_spin_lock_slowpath
19.38% native_queued_spin_lock_slowpath
- 34.46% nfsd4_close
- 34.45% nfs4_put_stid
- 34.45% nfs4_free_ol_stateid
- 34.45% release_all_access
- 34.45% nfs4_file_put_access
- 34.45% __nfs4_file_put_access.part.81
- 34.45% nfsd_file_put
- 34.44% nfsd_file_lru_walk_list
- 34.40% list_lru_walk_node
- 34.40% list_lru_walk_one
- 31.27% _raw_spin_lock
- 31.27% queued_spin_lock_slowpath
31.26% native_queued_spin_lock_slowpath
2.50% __list_lru_walk_one
0.50% nfsd_file_lru_cb
The original NFS server is mounted by the reexport server using NFSv4.2. As soon as we switch the clients to mount the reexport server with NFSv3, the high CPU usage goes away and we start to see expected performance for this workload and server hardware.
I'm happy to share perf data or anything else that is useful and I can repeatedly run this production load as required.
Cheers,
Daire
On Thu, Sep 17, 2020 at 05:01:11PM +0100, Daire Byrne wrote:
>
> ----- On 15 Sep, 2020, at 18:21, bfields [email protected] wrote:
>
> >> 4) With an NFSv4 re-export, lots of open/close requests (hundreds per
> >> second) quickly eat up the CPU on the re-export server and perf top
> >> shows we are mostly in native_queued_spin_lock_slowpath.
> >
> > Any statistics on who's calling that function?
>
> I've always struggled to reproduce this with a simple open/close simulation, so I suspect some other operations need to be mixed in too. But I have one production workload that I know has lots of opens & closes (buggy software) included in amongst the usual reads, writes etc.
>
> With just 40 clients mounting the reexport server (v5.7.6) using NFSv4.2, we see the CPU of the nfsd threads increase rapidly and by the time we have 100 clients, we have maxed out the 32 cores of the server with most of that in native_queued_spin_lock_slowpath.
That sounds a lot like what Frank Van der Linden reported:
https://lore.kernel.org/linux-nfs/20200608192122.GA19171@dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com/
It looks like a bug in the filehandle caching code.
--b.
>
> The perf top summary looks like this:
>
> # Overhead Command Shared Object Symbol
> # ........ ............... ............................ .......................................................
> #
> 82.91% nfsd [kernel.kallsyms] [k] native_queued_spin_lock_slowpath
> 8.24% swapper [kernel.kallsyms] [k] intel_idle
> 4.66% nfsd [kernel.kallsyms] [k] __list_lru_walk_one
> 0.80% nfsd [kernel.kallsyms] [k] nfsd_file_lru_cb
>
> And the call graph (not sure how this will format):
>
> - nfsd
> - 89.34% svc_process
> - 88.94% svc_process_common
> - 88.87% nfsd_dispatch
> - 88.82% nfsd4_proc_compound
> - 53.97% nfsd4_open
> - 53.95% nfsd4_process_open2
> - 53.87% nfs4_get_vfs_file
> - 53.48% nfsd_file_acquire
> - 33.31% nfsd_file_lru_walk_list
> - 33.28% list_lru_walk_node
> - 33.28% list_lru_walk_one
> - 30.21% _raw_spin_lock
> - 30.21% queued_spin_lock_slowpath
> 30.20% native_queued_spin_lock_slowpath
> 2.46% __list_lru_walk_one
> - 19.39% list_lru_add
> - 19.39% _raw_spin_lock
> - 19.39% queued_spin_lock_slowpath
> 19.38% native_queued_spin_lock_slowpath
> - 34.46% nfsd4_close
> - 34.45% nfs4_put_stid
> - 34.45% nfs4_free_ol_stateid
> - 34.45% release_all_access
> - 34.45% nfs4_file_put_access
> - 34.45% __nfs4_file_put_access.part.81
> - 34.45% nfsd_file_put
> - 34.44% nfsd_file_lru_walk_list
> - 34.40% list_lru_walk_node
> - 34.40% list_lru_walk_one
> - 31.27% _raw_spin_lock
> - 31.27% queued_spin_lock_slowpath
> 31.26% native_queued_spin_lock_slowpath
> 2.50% __list_lru_walk_one
> 0.50% nfsd_file_lru_cb
>
>
> The original NFS server is mounted by the reexport server using NFSv4.2. As soon as we switch the clients to mount the reexport server with NFSv3, the high CPU usage goes away and we start to see expected performance for this workload and server hardware.
>
> I'm happy to share perf data or anything else that is useful and I can repeatedly run this production load as required.
>
> Cheers,
>
> Daire
On Thu, Sep 17, 2020 at 03:09:31PM -0400, bfields wrote:
>
> On Thu, Sep 17, 2020 at 05:01:11PM +0100, Daire Byrne wrote:
> >
> > ----- On 15 Sep, 2020, at 18:21, bfields [email protected] wrote:
> >
> > >> 4) With an NFSv4 re-export, lots of open/close requests (hundreds per
> > >> second) quickly eat up the CPU on the re-export server and perf top
> > >> shows we are mostly in native_queued_spin_lock_slowpath.
> > >
> > > Any statistics on who's calling that function?
> >
> > I've always struggled to reproduce this with a simple open/close simulation, so I suspect some other operations need to be mixed in too. But I have one production workload that I know has lots of opens & closes (buggy software) included in amongst the usual reads, writes etc.
> >
> > With just 40 clients mounting the reexport server (v5.7.6) using NFSv4.2, we see the CPU of the nfsd threads increase rapidly and by the time we have 100 clients, we have maxed out the 32 cores of the server with most of that in native_queued_spin_lock_slowpath.
>
> That sounds a lot like what Frank Van der Linden reported:
>
> https://lore.kernel.org/linux-nfs/20200608192122.GA19171@dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com/
>
> It looks like a bug in the filehandle caching code.
>
> --b.
Yes, that does look like the same one.
I still think that not caching v4 files at all may be the best way to go
here, since the intent of the filecache code was to speed up v2/v3 I/O,
where you end up doing a lot of opens/closes, but it doesn't make as
much sense for v4.
However, short of that, I tested a local patch a few months back, that
I never posted here, so I'll do so now. It just makes v4 opens in to
'long term' opens, which do not get put on the LRU, since that doesn't
make sense (they are in the hash table, so they are still cached).
Also, the file caching code seems to walk the LRU a little too often,
but that's another issue - and this change keeps the LRU short, so it's
not a big deal.
I don't particularly love this patch, but it does keep the LRU short, and
did significantly speed up my testcase (by about 50%). So, maybe you can
give it a try.
I'll also attach a second patch, that converts the hash table to an rhashtable,
which automatically grows and shrinks in size with usage. That patch also
helped, but not by nearly as much (I think it yielded another 10%).
- Frank
On Thu, Sep 17, 2020 at 08:23:03PM +0000, Frank van der Linden wrote:
> On Thu, Sep 17, 2020 at 03:09:31PM -0400, bfields wrote:
> >
> > On Thu, Sep 17, 2020 at 05:01:11PM +0100, Daire Byrne wrote:
> > >
> > > ----- On 15 Sep, 2020, at 18:21, bfields [email protected] wrote:
> > >
> > > >> 4) With an NFSv4 re-export, lots of open/close requests (hundreds per
> > > >> second) quickly eat up the CPU on the re-export server and perf top
> > > >> shows we are mostly in native_queued_spin_lock_slowpath.
> > > >
> > > > Any statistics on who's calling that function?
> > >
> > > I've always struggled to reproduce this with a simple open/close simulation, so I suspect some other operations need to be mixed in too. But I have one production workload that I know has lots of opens & closes (buggy software) included in amongst the usual reads, writes etc.
> > >
> > > With just 40 clients mounting the reexport server (v5.7.6) using NFSv4.2, we see the CPU of the nfsd threads increase rapidly and by the time we have 100 clients, we have maxed out the 32 cores of the server with most of that in native_queued_spin_lock_slowpath.
> >
> > That sounds a lot like what Frank Van der Linden reported:
> >
> > https://lore.kernel.org/linux-nfs/20200608192122.GA19171@dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com/
> >
> > It looks like a bug in the filehandle caching code.
> >
> > --b.
>
> Yes, that does look like the same one.
>
> I still think that not caching v4 files at all may be the best way to go
> here, since the intent of the filecache code was to speed up v2/v3 I/O,
> where you end up doing a lot of opens/closes, but it doesn't make as
> much sense for v4.
>
> However, short of that, I tested a local patch a few months back, that
> I never posted here, so I'll do so now. It just makes v4 opens in to
> 'long term' opens, which do not get put on the LRU, since that doesn't
> make sense (they are in the hash table, so they are still cached).
That makes sense to me. But I'm also not opposed to turning it off for
v4 at this point.
--b.
> Also, the file caching code seems to walk the LRU a little too often,
> but that's another issue - and this change keeps the LRU short, so it's
> not a big deal.
>
> I don't particularly love this patch, but it does keep the LRU short, and
> did significantly speed up my testcase (by about 50%). So, maybe you can
> give it a try.
>
> I'll also attach a second patch, that converts the hash table to an rhashtable,
> which automatically grows and shrinks in size with usage. That patch also
> helped, but not by nearly as much (I think it yielded another 10%).
>
> - Frank
> >From 057a24e1b3744c716e4956eb34c2d15ed719db23 Mon Sep 17 00:00:00 2001
> From: Frank van der Linden <[email protected]>
> Date: Fri, 26 Jun 2020 22:35:01 +0000
> Subject: [PATCH 1/2] nfsd: don't put nfsd_files with long term refs on the LRU
> list
>
> Files with long term references, as created by v4 OPENs, will
> just clutter the LRU list without a chance of being reaped.
> So, don't put them there at all.
>
> When finding a file in the hash table for a long term ref, remove
> it from the LRU list.
>
> When dropping the last long term ref, add it back to the LRU list.
>
> Signed-off-by: Frank van der Linden <[email protected]>
> ---
> fs/nfsd/filecache.c | 81 ++++++++++++++++++++++++++++++++++++++++-----
> fs/nfsd/filecache.h | 6 ++++
> fs/nfsd/nfs4state.c | 2 +-
> 3 files changed, 79 insertions(+), 10 deletions(-)
>
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index 82198d747c4c..5ef6bb802f24 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -186,6 +186,7 @@ nfsd_file_alloc(struct inode *inode, unsigned int may, unsigned int hashval,
> nf->nf_inode = inode;
> nf->nf_hashval = hashval;
> refcount_set(&nf->nf_ref, 1);
> + atomic_set(&nf->nf_lref, 0);
> nf->nf_may = may & NFSD_FILE_MAY_MASK;
> if (may & NFSD_MAY_NOT_BREAK_LEASE) {
> if (may & NFSD_MAY_WRITE)
> @@ -297,13 +298,26 @@ nfsd_file_put_noref(struct nfsd_file *nf)
> }
> }
>
> -void
> -nfsd_file_put(struct nfsd_file *nf)
> +static void
> +__nfsd_file_put(struct nfsd_file *nf, unsigned int flags)
> {
> bool is_hashed;
> + int refs;
> +
> + refs = refcount_read(&nf->nf_ref);
> +
> + if (flags & NFSD_ACQ_FILE_LONGTERM) {
> + /*
> + * If we're dropping the last long term ref, and there
> + * are other references, put the file on the LRU list,
> + * as it now makes sense for it to be there.
> + */
> + if (atomic_dec_return(&nf->nf_lref) == 0 && refs > 2)
> + list_lru_add(&nfsd_file_lru, &nf->nf_lru);
> + } else
> + set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
>
> - set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> - if (refcount_read(&nf->nf_ref) > 2 || !nf->nf_file) {
> + if (refs > 2 || !nf->nf_file) {
> nfsd_file_put_noref(nf);
> return;
> }
> @@ -317,6 +331,18 @@ nfsd_file_put(struct nfsd_file *nf)
> nfsd_file_gc();
> }
>
> +void
> +nfsd_file_put(struct nfsd_file *nf)
> +{
> + __nfsd_file_put(nf, 0);
> +}
> +
> +void
> +nfsd_file_put_longterm(struct nfsd_file *nf)
> +{
> + __nfsd_file_put(nf, NFSD_ACQ_FILE_LONGTERM);
> +}
> +
> struct nfsd_file *
> nfsd_file_get(struct nfsd_file *nf)
> {
> @@ -934,13 +960,14 @@ nfsd_file_is_cached(struct inode *inode)
> return ret;
> }
>
> -__be32
> -nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> - unsigned int may_flags, struct nfsd_file **pnf)
> +static __be32
> +__nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> + unsigned int may_flags, struct nfsd_file **pnf,
> + unsigned int flags)
> {
> __be32 status;
> struct net *net = SVC_NET(rqstp);
> - struct nfsd_file *nf, *new;
> + struct nfsd_file *nf, *new = NULL;
> struct inode *inode;
> unsigned int hashval;
> bool retry = true;
> @@ -1006,6 +1033,16 @@ nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> }
> }
> out:
> + if (flags & NFSD_ACQ_FILE_LONGTERM) {
> + /*
> + * A file with long term (v4) references will needlessly
> + * clutter the LRU, so remove it when adding the first
> + * long term ref.
> + */
> + if (!new && atomic_inc_return(&nf->nf_lref) == 1)
> + list_lru_del(&nfsd_file_lru, &nf->nf_lru);
> + }
> +
> if (status == nfs_ok) {
> *pnf = nf;
> } else {
> @@ -1021,7 +1058,18 @@ nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> refcount_inc(&nf->nf_ref);
> __set_bit(NFSD_FILE_HASHED, &nf->nf_flags);
> __set_bit(NFSD_FILE_PENDING, &nf->nf_flags);
> - list_lru_add(&nfsd_file_lru, &nf->nf_lru);
> +
> + /*
> + * Don't add a new file to the LRU if it's a long term reference.
> + * It is still added to the hash table, so it may be added to the
> + * LRU later, when the number of long term references drops back
> + * to zero, and there are other references.
> + */
> + if (flags & NFSD_ACQ_FILE_LONGTERM)
> + atomic_inc(&nf->nf_lref);
> + else
> + list_lru_add(&nfsd_file_lru, &nf->nf_lru);
> +
> hlist_add_head_rcu(&nf->nf_node, &nfsd_file_hashtbl[hashval].nfb_head);
> ++nfsd_file_hashtbl[hashval].nfb_count;
> nfsd_file_hashtbl[hashval].nfb_maxcount = max(nfsd_file_hashtbl[hashval].nfb_maxcount,
> @@ -1054,6 +1102,21 @@ nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> goto out;
> }
>
> +__be32
> +nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> + unsigned int may_flags, struct nfsd_file **pnf)
> +{
> + return __nfsd_file_acquire(rqstp, fhp, may_flags, pnf, 0);
> +}
> +
> +__be32
> +nfsd_file_acquire_longterm(struct svc_rqst *rqstp, struct svc_fh *fhp,
> + unsigned int may_flags, struct nfsd_file **pnf)
> +{
> + return __nfsd_file_acquire(rqstp, fhp, may_flags, pnf,
> + NFSD_ACQ_FILE_LONGTERM);
> +}
> +
> /*
> * Note that fields may be added, removed or reordered in the future. Programs
> * scraping this file for info should test the labels to ensure they're
> diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
> index 7872df5a0fe3..6e1db77d7148 100644
> --- a/fs/nfsd/filecache.h
> +++ b/fs/nfsd/filecache.h
> @@ -44,21 +44,27 @@ struct nfsd_file {
> struct inode *nf_inode;
> unsigned int nf_hashval;
> refcount_t nf_ref;
> + atomic_t nf_lref;
> unsigned char nf_may;
> struct nfsd_file_mark *nf_mark;
> struct rw_semaphore nf_rwsem;
> };
>
> +#define NFSD_ACQ_FILE_LONGTERM 0x0001
> +
> int nfsd_file_cache_init(void);
> void nfsd_file_cache_purge(struct net *);
> void nfsd_file_cache_shutdown(void);
> int nfsd_file_cache_start_net(struct net *net);
> void nfsd_file_cache_shutdown_net(struct net *net);
> void nfsd_file_put(struct nfsd_file *nf);
> +void nfsd_file_put_longterm(struct nfsd_file *nf);
> struct nfsd_file *nfsd_file_get(struct nfsd_file *nf);
> void nfsd_file_close_inode_sync(struct inode *inode);
> bool nfsd_file_is_cached(struct inode *inode);
> __be32 nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> unsigned int may_flags, struct nfsd_file **nfp);
> +__be32 nfsd_file_acquire_longterm(struct svc_rqst *rqstp, struct svc_fh *fhp,
> + unsigned int may_flags, struct nfsd_file **nfp);
> int nfsd_file_cache_stats_open(struct inode *, struct file *);
> #endif /* _FS_NFSD_FILECACHE_H */
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index bb3d2c32664a..451a1071daf4 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -4838,7 +4838,7 @@ static __be32 nfs4_get_vfs_file(struct svc_rqst *rqstp, struct nfs4_file *fp,
>
> if (!fp->fi_fds[oflag]) {
> spin_unlock(&fp->fi_lock);
> - status = nfsd_file_acquire(rqstp, cur_fh, access, &nf);
> + status = nfsd_file_acquire_longterm(rqstp, cur_fh, access, &nf);
> if (status)
> goto out_put_access;
> spin_lock(&fp->fi_lock);
> --
> 2.17.2
>
> >From 79e7ffd01482d90cd5f6e98b5a362bbf95ea9b2c Mon Sep 17 00:00:00 2001
> From: Frank van der Linden <[email protected]>
> Date: Thu, 16 Jul 2020 21:35:29 +0000
> Subject: [PATCH 2/2] nfsd: change file_hashtbl to an rhashtable
>
> file_hashtbl can grow quite large, so use rhashtable, which has
> automatic growing (and shrinking).
>
> Signed-off-by: Frank van der Linden <[email protected]>
> ---
> fs/nfsd/nfs4state.c | 112 +++++++++++++++++++++++++++++---------------
> fs/nfsd/nfsctl.c | 7 ++-
> fs/nfsd/nfsd.h | 4 ++
> fs/nfsd/state.h | 3 +-
> 4 files changed, 86 insertions(+), 40 deletions(-)
>
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index 451a1071daf4..ff81c0136224 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -417,13 +417,33 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
> kmem_cache_free(file_slab, fp);
> }
>
> +/* hash table for nfs4_file */
> +#define FILE_HASH_SIZE 256
> +
> +static u32 nfsd4_file_key_hash(const void *data, u32 len, u32 seed);
> +static u32 nfsd4_file_obj_hash(const void *data, u32 len, u32 seed);
> +static int nfsd4_file_obj_compare(struct rhashtable_compare_arg *arg,
> + const void *obj);
> +
> +static const struct rhashtable_params file_rhashparams = {
> + .head_offset = offsetof(struct nfs4_file, fi_hash),
> + .min_size = FILE_HASH_SIZE,
> + .automatic_shrinking = true,
> + .hashfn = nfsd4_file_key_hash,
> + .obj_hashfn = nfsd4_file_obj_hash,
> + .obj_cmpfn = nfsd4_file_obj_compare,
> +};
> +
> +struct rhashtable file_hashtbl;
> +
> void
> put_nfs4_file(struct nfs4_file *fi)
> {
> might_lock(&state_lock);
>
> if (refcount_dec_and_lock(&fi->fi_ref, &state_lock)) {
> - hlist_del_rcu(&fi->fi_hash);
> + rhashtable_remove_fast(&file_hashtbl, &fi->fi_hash,
> + file_rhashparams);
> spin_unlock(&state_lock);
> WARN_ON_ONCE(!list_empty(&fi->fi_clnt_odstate));
> WARN_ON_ONCE(!list_empty(&fi->fi_delegations));
> @@ -527,21 +547,33 @@ static unsigned int ownerstr_hashval(struct xdr_netobj *ownername)
> return ret & OWNER_HASH_MASK;
> }
>
> -/* hash table for nfs4_file */
> -#define FILE_HASH_BITS 8
> -#define FILE_HASH_SIZE (1 << FILE_HASH_BITS)
> -
> -static unsigned int nfsd_fh_hashval(struct knfsd_fh *fh)
> +static u32 nfsd4_file_key_hash(const void *data, u32 len, u32 seed)
> {
> - return jhash2(fh->fh_base.fh_pad, XDR_QUADLEN(fh->fh_size), 0);
> + struct knfsd_fh *fh = (struct knfsd_fh *)data;
> +
> + return jhash2(fh->fh_base.fh_pad, XDR_QUADLEN(fh->fh_size), seed);
> }
>
> -static unsigned int file_hashval(struct knfsd_fh *fh)
> +static u32 nfsd4_file_obj_hash(const void *data, u32 len, u32 seed)
> {
> - return nfsd_fh_hashval(fh) & (FILE_HASH_SIZE - 1);
> + struct nfs4_file *fp = (struct nfs4_file *)data;
> + struct knfsd_fh *fh;
> +
> + fh = &fp->fi_fhandle;
> +
> + return jhash2(fh->fh_base.fh_pad, XDR_QUADLEN(fh->fh_size), seed);
> }
>
> -static struct hlist_head file_hashtbl[FILE_HASH_SIZE];
> +static int nfsd4_file_obj_compare(struct rhashtable_compare_arg *arg,
> + const void *obj)
> +{
> + struct nfs4_file *fp = (struct nfs4_file *)obj;
> +
> + if (fh_match(&fp->fi_fhandle, (struct knfsd_fh *)arg->key))
> + return 0;
> +
> + return 1;
> +}
>
> static void
> __nfs4_file_get_access(struct nfs4_file *fp, u32 access)
> @@ -4042,8 +4074,7 @@ static struct nfs4_file *nfsd4_alloc_file(void)
> }
>
> /* OPEN Share state helper functions */
> -static void nfsd4_init_file(struct knfsd_fh *fh, unsigned int hashval,
> - struct nfs4_file *fp)
> +static void nfsd4_init_file(struct knfsd_fh *fh, struct nfs4_file *fp)
> {
> lockdep_assert_held(&state_lock);
>
> @@ -4062,7 +4093,6 @@ static void nfsd4_init_file(struct knfsd_fh *fh, unsigned int hashval,
> INIT_LIST_HEAD(&fp->fi_lo_states);
> atomic_set(&fp->fi_lo_recalls, 0);
> #endif
> - hlist_add_head_rcu(&fp->fi_hash, &file_hashtbl[hashval]);
> }
>
> void
> @@ -4126,6 +4156,18 @@ nfsd4_init_slabs(void)
> return -ENOMEM;
> }
>
> +int
> +nfsd4_init_hash(void)
> +{
> + return rhashtable_init(&file_hashtbl, &file_rhashparams);
> +}
> +
> +void
> +nfsd4_free_hash(void)
> +{
> + rhashtable_destroy(&file_hashtbl);
> +}
> +
> static void init_nfs4_replay(struct nfs4_replay *rp)
> {
> rp->rp_status = nfserr_serverfault;
> @@ -4395,30 +4437,19 @@ move_to_close_lru(struct nfs4_ol_stateid *s, struct net *net)
> }
>
> /* search file_hashtbl[] for file */
> -static struct nfs4_file *
> -find_file_locked(struct knfsd_fh *fh, unsigned int hashval)
> -{
> - struct nfs4_file *fp;
> -
> - hlist_for_each_entry_rcu(fp, &file_hashtbl[hashval], fi_hash,
> - lockdep_is_held(&state_lock)) {
> - if (fh_match(&fp->fi_fhandle, fh)) {
> - if (refcount_inc_not_zero(&fp->fi_ref))
> - return fp;
> - }
> - }
> - return NULL;
> -}
> -
> struct nfs4_file *
> find_file(struct knfsd_fh *fh)
> {
> struct nfs4_file *fp;
> - unsigned int hashval = file_hashval(fh);
>
> rcu_read_lock();
> - fp = find_file_locked(fh, hashval);
> + fp = rhashtable_lookup(&file_hashtbl, fh, file_rhashparams);
> + if (fp) {
> + if (IS_ERR(fp) || refcount_inc_not_zero(&fp->fi_ref))
> + fp = NULL;
> + }
> rcu_read_unlock();
> +
> return fp;
> }
>
> @@ -4426,22 +4457,27 @@ static struct nfs4_file *
> find_or_add_file(struct nfs4_file *new, struct knfsd_fh *fh)
> {
> struct nfs4_file *fp;
> - unsigned int hashval = file_hashval(fh);
>
> - rcu_read_lock();
> - fp = find_file_locked(fh, hashval);
> - rcu_read_unlock();
> + fp = find_file(fh);
> if (fp)
> return fp;
>
> + nfsd4_init_file(fh, new);
> +
> spin_lock(&state_lock);
> - fp = find_file_locked(fh, hashval);
> - if (likely(fp == NULL)) {
> - nfsd4_init_file(fh, hashval, new);
> +
> + fp = rhashtable_lookup_get_insert_key(&file_hashtbl, &new->fi_fhandle,
> + &new->fi_hash, file_rhashparams);
> + if (likely(fp == NULL))
> fp = new;
> - }
> + else if (IS_ERR(fp))
> + fp = NULL;
> + else
> + refcount_inc(&fp->fi_ref);
> +
> spin_unlock(&state_lock);
>
> +
> return fp;
> }
>
> diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
> index b68e96681522..bac5d8cff1d3 100644
> --- a/fs/nfsd/nfsctl.c
> +++ b/fs/nfsd/nfsctl.c
> @@ -1528,9 +1528,12 @@ static int __init init_nfsd(void)
> retval = nfsd4_init_slabs();
> if (retval)
> goto out_unregister_notifier;
> - retval = nfsd4_init_pnfs();
> + retval = nfsd4_init_hash();
> if (retval)
> goto out_free_slabs;
> + retval = nfsd4_init_pnfs();
> + if (retval)
> + goto out_free_hash;
> nfsd_fault_inject_init(); /* nfsd fault injection controls */
> nfsd_stat_init(); /* Statistics */
> retval = nfsd_drc_slab_create();
> @@ -1554,6 +1557,8 @@ static int __init init_nfsd(void)
> nfsd_stat_shutdown();
> nfsd_fault_inject_cleanup();
> nfsd4_exit_pnfs();
> +out_free_hash:
> + nfsd4_free_hash();
> out_free_slabs:
> nfsd4_free_slabs();
> out_unregister_notifier:
> diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> index 5343c771da18..fb0349d16158 100644
> --- a/fs/nfsd/nfsd.h
> +++ b/fs/nfsd/nfsd.h
> @@ -141,6 +141,8 @@ nfsd_user_namespace(const struct svc_rqst *rqstp)
> extern unsigned long max_delegations;
> int nfsd4_init_slabs(void);
> void nfsd4_free_slabs(void);
> +int nfsd4_init_hash(void);
> +void nfsd4_free_hash(void);
> int nfs4_state_start(void);
> int nfs4_state_start_net(struct net *net);
> void nfs4_state_shutdown(void);
> @@ -151,6 +153,8 @@ bool nfsd4_spo_must_allow(struct svc_rqst *rqstp);
> #else
> static inline int nfsd4_init_slabs(void) { return 0; }
> static inline void nfsd4_free_slabs(void) { }
> +static inline int nfsd4_init_hash(void) { return 0; }
> +static inline void nfsd4_free_hash(void) { }
> static inline int nfs4_state_start(void) { return 0; }
> static inline int nfs4_state_start_net(struct net *net) { return 0; }
> static inline void nfs4_state_shutdown(void) { }
> diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
> index 3b408532a5dc..bf66244a7a2d 100644
> --- a/fs/nfsd/state.h
> +++ b/fs/nfsd/state.h
> @@ -38,6 +38,7 @@
> #include <linux/idr.h>
> #include <linux/refcount.h>
> #include <linux/sunrpc/svc_xprt.h>
> +#include <linux/rhashtable.h>
> #include "nfsfh.h"
> #include "nfsd.h"
>
> @@ -513,7 +514,7 @@ struct nfs4_clnt_odstate {
> struct nfs4_file {
> refcount_t fi_ref;
> spinlock_t fi_lock;
> - struct hlist_node fi_hash; /* hash on fi_fhandle */
> + struct rhash_head fi_hash; /* hash on fi_fhandle */
> struct list_head fi_stateids;
> union {
> struct list_head fi_delegations;
> --
> 2.17.2
>
----- On 17 Sep, 2020, at 22:57, bfields [email protected] wrote:
> On Thu, Sep 17, 2020 at 08:23:03PM +0000, Frank van der Linden wrote:
>> On Thu, Sep 17, 2020 at 03:09:31PM -0400, bfields wrote:
>> >
>> > On Thu, Sep 17, 2020 at 05:01:11PM +0100, Daire Byrne wrote:
>> > >
>> > > ----- On 15 Sep, 2020, at 18:21, bfields [email protected] wrote:
>> > >
>> > > >> 4) With an NFSv4 re-export, lots of open/close requests (hundreds per
>> > > >> second) quickly eat up the CPU on the re-export server and perf top
>> > > >> shows we are mostly in native_queued_spin_lock_slowpath.
>> > > >
>> > > > Any statistics on who's calling that function?
>> > >
>> > > With just 40 clients mounting the reexport server (v5.7.6) using NFSv4.2, we see
>> > > the CPU of the nfsd threads increase rapidly and by the time we have 100
>> > > clients, we have maxed out the 32 cores of the server with most of that in
>> > > native_queued_spin_lock_slowpath.
>> >
>> > That sounds a lot like what Frank Van der Linden reported:
>> >
>> > https://lore.kernel.org/linux-nfs/20200608192122.GA19171@dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com/
>> >
>> > It looks like a bug in the filehandle caching code.
>> >
>> > --b.
>>
>> Yes, that does look like the same one.
>>
>> I still think that not caching v4 files at all may be the best way to go
>> here, since the intent of the filecache code was to speed up v2/v3 I/O,
>> where you end up doing a lot of opens/closes, but it doesn't make as
>> much sense for v4.
>>
>> However, short of that, I tested a local patch a few months back, that
>> I never posted here, so I'll do so now. It just makes v4 opens in to
>> 'long term' opens, which do not get put on the LRU, since that doesn't
>> make sense (they are in the hash table, so they are still cached).
>
> That makes sense to me. But I'm also not opposed to turning it off for
> v4 at this point.
>
> --b.
Thank you both, that's absolutely the issue with our (broken) production workload. I totally missed that thread while researching the archives.
I tried both of Frank's patches and the CPU returned to normal levels, native_queued_spin_lock_slowpath went from 88% to 2% usage and the server performed pretty much the same as it does for an NFSv3 export.
So, ultimately this had nothing to do with NFS re-exporting; it's just that I was using a newer kernel with filecache to do it. All our other NFSv4 originating servers are running older kernels, hence why our (broken) workload never caused us any problems before. Thanks for clearing that up for me.
With regards to dropping the filecache feature completely for NFSv4, I do wonder if it does still save a few precious network round-trips (which is especially important for my re-export scenario)? We want to be able to choose the level of caching on the re-export server and minimise expensive lookups to originating servers that may be many milliseconds away (coherency be damned).
Seeing as there was some interest in issue #1 (drop caches = estale re-exports) and this #4 issue (NFSv4 filecache vs ridiculous open/close counts), I'll post some more detail & reproducers next week for #2 (invalidating the re-export server's NFS client cache) and #3 (cached client metadata lookups not returned quickly enough when the client is busy with reads).
That way anyone trying to follow in my (re-exporting) footsteps is fully aware of all the potential performance pitfalls I have discovered so far.
Many thanks,
Daire
Hi,
I just thought I'd flesh out the other two issues I have found with re-exporting that are ultimately responsible for the biggest performance bottlenecks. And both of them revolve around the caching of metadata file lookups in the NFS client.
Especially for the case where we are re-exporting a server many milliseconds away (i.e. on-premise -> cloud), we want to be able to control how much the client caches metadata and file data so that it's many LAN clients all benefit from the re-export server only having to do the WAN lookups once (within a specified coherency time).
Keeping the file data in the vfs page cache or on disk using fscache/cachefiles is fairly straightforward, but keeping the metadata cached is particularly difficult. And without the cached metadata we introduce long delays before we can serve the already present and locally cached file data to many waiting clients.
----- On 7 Sep, 2020, at 18:31, Daire Byrne [email protected] wrote:
> 2) If we cache metadata on the re-export server using actimeo=3600,nocto we can
> cut the network packets back to the origin server to zero for repeated lookups.
> However, if a client of the re-export server walks paths and memory maps those
> files (i.e. loading an application), the re-export server starts issuing
> unexpected calls back to the origin server again, ignoring/invalidating the
> re-export server's NFS client cache. We worked around this this by patching an
> inode/iversion validity check in inode.c so that the NFS client cache on the
> re-export server is used. I'm not sure about the correctness of this patch but
> it works for our corner case.
If we use actimeo=3600,nocto (say) to mount a remote software volume on the re-export server, we can successfully cache the loading of applications and walking of paths directly on the re-export server such that after a couple of runs, there are practically zero packets back to the originating NFS server (great!). But, if we then do the same thing on a client which is mounting that re-export server, the re-export server now starts issuing lots of calls back to the originating server and invalidating it's client cache (bad!).
I'm not exactly sure why, but the iversion of the inode gets changed locally (due to atime modification?) most likely via invocation of method inode_inc_iversion_raw. Each time it gets incremented the following call to validate attributes detects changes causing it to be reloaded from the originating server.
This patch helps to avoid this when applied to the re-export server but there may be other places where this happens too. I accept that this patch is probably not the right/general way to do this, but it helps to highlight the issue when re-exporting and it works well for our use case:
--- linux-5.5.0-1.el7.x86_64/fs/nfs/inode.c 2020-01-27 00:23:03.000000000 +0000
+++ new/fs/nfs/inode.c 2020-02-13 16:32:09.013055074 +0000
@@ -1869,7 +1869,7 @@
/* More cache consistency checks */
if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
- if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
+ if (inode_peek_iversion_raw(inode) < fattr->change_attr) {
/* Could it be a race with writeback? */
if (!(have_writers || have_delegation)) {
invalid |= NFS_INO_INVALID_DATA
With this patch, the re-export server's NFS client attribute cache is maintained and used by all the clients that then mount it. When many hundreds of clients are all doing similar things at the same time, the re-export server's NFS client cache is invaluable in accelerating the lookups (getattrs).
Perhaps a more correct approach would be to detect when it is knfsd that is accessing the client mount and change the cache consistency checks accordingly?
> 3) If we saturate an NFS client's network with reads from the server, all client
> metadata lookups become unbearably slow even if it's all cached in the NFS
> client's memory and no network RPCs should be required. This is the case for
> any NFS client regardless of re-exporting but it affects this case more because
> when we can't serve cached metadata we also can't serve the cached data. It
> feels like some sort of bottleneck in the client's ability to parallelise
> requests? We work around this by not maxing out our network.
I spent a bit more time testing this issue and it's not quite as I've written it. Again the issue is that we have very little control over preserving complete metadata caches to avoid expensive contact with the originating NFS server. Even though we can use actimeo,nocto mount options, these provide no guarantees that we can keep all the required metadata in cache when the page cache is under constant churn (e.g. NFS reads).
This has very little to do with the re-export of an NFS client mount and is more a general observation of how the NFS client works. It is probably relevant to anyone who wants to cache metadata for long periods of time (e.g. read-only, non-changing, over the WAN).
Let's consider how we might try to keep as much metadata cached in memory....
nfsclient # echo 0 >/proc/sys/vm/vfs_cache_pressure
nfsclient # mount -o vers=3,actimeo=7200,nocto,ro,nolock nfsserver:/usr /mnt/nfsserver
nfsclient # for x in {1..3}; do /usr/bin/time -f %e ls -hlR /mnt/nfsserver/share > /dev/null; sleep 5; done
53.23 <- first time so lots of network traffic
2.82 <- now cached for actimeo=7200 with almost no packets between nfsserver & nfsclient
2.85
This is ideal and as long as we don't touch the page cache then repeated walks of the remote server will all come from cache until the attribute cache times out.
We can even read from the remote server using either directio or fadvise so that we don't upset the client's page cache and we will keep the complete metadata cache intact. e.g.
nfsclient # find /mnt/nfsserver -type f -size +1M -print | shuf | xargs -n1 -P8 -iX bash -c 'dd if="X" iflag=direct of=/dev/null bs=1M &>/dev/null'
nfsclient # find /mnt/nfsserver -type f -size +1M -print | shuf | xargs -n1 -P8 -iX bash -c 'nocache dd if="X" of=/dev/null bs=1M &>/dev/null'
nfsclient # /usr/bin/time -f %e ls -hlR /mnt/nfsserver/share > /dev/null
2.82 <- still showing good complete cached metadata
But as soon as we switch to the more normal reading of file data which then populates the page cache, we lose portions of our cached metadata (readdir?) even when there is plenty of RAM available.
nfsclient # find /mnt/nfsserver -type f -size +1M -print | shuf | xargs -n1 -P8 -iX bash -c 'dd if="X" of=/dev/null bs=1M &>/dev/null'
nfsclient # /usr/bin/time -f %e ls -hlR /mnt/nfsserver/share > /dev/null
10.82 <- still mostly cached metadata but we had to do some fresh lookups
Now once our NFS client starts doing lots of sustained reads such that it maxes out the network, we end up in a situation where we are both dropping useful cached metadata (before actimeo) and we are making it harder to get the new metadata lookups back in a timely fashion because the reads are so much more dominant (and require less round trips to get more done).
So if we do the reads and try to do the filesystem walk at the same time, we get even slower performance:
nfsclient # (find /mnt/nfsserver -type f -size +1M -print | shuf | xargs -n1 -P8 -iX bash -c 'dd if="X" of=/dev/null bs=1M &>/dev/null') &
nfsclient # /usr/bin/time -f %e ls -hlR /mnt/nfsserver/share > /dev/null
30.12
As we increase the number of simultaneous threads for the reads (e.g knfsd threads), the single thread of metadata lookups gets slower and slower.
So even when setting vfs_cache_pressure=0 (to keep nfs inodes in memory), setting actimeo=large and using nocto to avoid more lookups, we still can't keep a complete metadata cache in memory for any specified time when the server is doing lots of reads and churning through the page cache.
So, while I am not able to provide many answers or solutions to any of the issues I have highlighted in this email thread, hopefully I have described in enough detail all the main performance hurdles others will likely run into if they attempt this in production as we have.
And like I said from the outset, it's already stable enough for us to use in production and it's definitely better than nothing... ;)
Regards,
Daire
On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote:
> Hi,
>
> I just thought I'd flesh out the other two issues I have found with
> re-exporting that are ultimately responsible for the biggest
> performance bottlenecks. And both of them revolve around the caching
> of metadata file lookups in the NFS client.
>
> Especially for the case where we are re-exporting a server many
> milliseconds away (i.e. on-premise -> cloud), we want to be able to
> control how much the client caches metadata and file data so that
> it's many LAN clients all benefit from the re-export server only
> having to do the WAN lookups once (within a specified coherency
> time).
>
> Keeping the file data in the vfs page cache or on disk using
> fscache/cachefiles is fairly straightforward, but keeping the
> metadata cached is particularly difficult. And without the cached
> metadata we introduce long delays before we can serve the already
> present and locally cached file data to many waiting clients.
>
> ----- On 7 Sep, 2020, at 18:31, Daire Byrne [email protected] wrote:
> > 2) If we cache metadata on the re-export server using
> > actimeo=3600,nocto we can
> > cut the network packets back to the origin server to zero for
> > repeated lookups.
> > However, if a client of the re-export server walks paths and memory
> > maps those
> > files (i.e. loading an application), the re-export server starts
> > issuing
> > unexpected calls back to the origin server again,
> > ignoring/invalidating the
> > re-export server's NFS client cache. We worked around this this by
> > patching an
> > inode/iversion validity check in inode.c so that the NFS client
> > cache on the
> > re-export server is used. I'm not sure about the correctness of
> > this patch but
> > it works for our corner case.
>
> If we use actimeo=3600,nocto (say) to mount a remote software volume
> on the re-export server, we can successfully cache the loading of
> applications and walking of paths directly on the re-export server
> such that after a couple of runs, there are practically zero packets
> back to the originating NFS server (great!). But, if we then do the
> same thing on a client which is mounting that re-export server, the
> re-export server now starts issuing lots of calls back to the
> originating server and invalidating it's client cache (bad!).
>
> I'm not exactly sure why, but the iversion of the inode gets changed
> locally (due to atime modification?) most likely via invocation of
> method inode_inc_iversion_raw. Each time it gets incremented the
> following call to validate attributes detects changes causing it to
> be reloaded from the originating server.
>
> This patch helps to avoid this when applied to the re-export server
> but there may be other places where this happens too. I accept that
> this patch is probably not the right/general way to do this, but it
> helps to highlight the issue when re-exporting and it works well for
> our use case:
>
> --- linux-5.5.0-1.el7.x86_64/fs/nfs/inode.c 2020-01-27
> 00:23:03.000000000 +0000
> +++ new/fs/nfs/inode.c 2020-02-13 16:32:09.013055074 +0000
> @@ -1869,7 +1869,7 @@
>
> /* More cache consistency checks */
> if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
> - if (!inode_eq_iversion_raw(inode, fattr-
> >change_attr)) {
> + if (inode_peek_iversion_raw(inode) < fattr-
> >change_attr) {
> /* Could it be a race with writeback? */
> if (!(have_writers || have_delegation)) {
> invalid |= NFS_INO_INVALID_DATA
There is nothing in the base NFSv4, and NFSv4.1 specs that allow you to
make assumptions about how the change attribute behaves over time.
The only safe way to do something like the above is if the server
supports NFSv4.2 and also advertises support for the 'change_attr_type'
attribute. In that case, you can check at mount time for whether or not
the change attribute on this filesystem is one of the monotonic types
which would allow the above optimisation.
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
> On Sep 17, 2020, at 4:23 PM, Frank van der Linden <[email protected]> wrote:
>
> On Thu, Sep 17, 2020 at 03:09:31PM -0400, bfields wrote:
>>
>> On Thu, Sep 17, 2020 at 05:01:11PM +0100, Daire Byrne wrote:
>>>
>>> ----- On 15 Sep, 2020, at 18:21, bfields [email protected] wrote:
>>>
>>>>> 4) With an NFSv4 re-export, lots of open/close requests (hundreds per
>>>>> second) quickly eat up the CPU on the re-export server and perf top
>>>>> shows we are mostly in native_queued_spin_lock_slowpath.
>>>>
>>>> Any statistics on who's calling that function?
>>>
>>> I've always struggled to reproduce this with a simple open/close simulation, so I suspect some other operations need to be mixed in too. But I have one production workload that I know has lots of opens & closes (buggy software) included in amongst the usual reads, writes etc.
>>>
>>> With just 40 clients mounting the reexport server (v5.7.6) using NFSv4.2, we see the CPU of the nfsd threads increase rapidly and by the time we have 100 clients, we have maxed out the 32 cores of the server with most of that in native_queued_spin_lock_slowpath.
>>
>> That sounds a lot like what Frank Van der Linden reported:
>>
>> https://lore.kernel.org/linux-nfs/20200608192122.GA19171@dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com/
>>
>> It looks like a bug in the filehandle caching code.
>>
>> --b.
>
> Yes, that does look like the same one.
>
> I still think that not caching v4 files at all may be the best way to go
> here, since the intent of the filecache code was to speed up v2/v3 I/O,
> where you end up doing a lot of opens/closes, but it doesn't make as
> much sense for v4.
>
> However, short of that, I tested a local patch a few months back, that
> I never posted here, so I'll do so now. It just makes v4 opens in to
> 'long term' opens, which do not get put on the LRU, since that doesn't
> make sense (they are in the hash table, so they are still cached).
>
> Also, the file caching code seems to walk the LRU a little too often,
> but that's another issue - and this change keeps the LRU short, so it's
> not a big deal.
>
> I don't particularly love this patch, but it does keep the LRU short, and
> did significantly speed up my testcase (by about 50%). So, maybe you can
> give it a try.
>
> I'll also attach a second patch, that converts the hash table to an rhashtable,
> which automatically grows and shrinks in size with usage. That patch also
> helped, but not by nearly as much (I think it yielded another 10%).
For what it's worth, I applied your two patches to my test server, along
with my patch that force-closes cached file descriptors during NFSv4
CLOSE processing. The patch combination improves performance (faster
elapsed time) for my workload as well.
--
Chuck Lever
On Tue, Sep 22, 2020 at 01:52:25PM +0000, Trond Myklebust wrote:
> On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote:
> > Hi,
> >
> > I just thought I'd flesh out the other two issues I have found with
> > re-exporting that are ultimately responsible for the biggest
> > performance bottlenecks. And both of them revolve around the caching
> > of metadata file lookups in the NFS client.
> >
> > Especially for the case where we are re-exporting a server many
> > milliseconds away (i.e. on-premise -> cloud), we want to be able to
> > control how much the client caches metadata and file data so that
> > it's many LAN clients all benefit from the re-export server only
> > having to do the WAN lookups once (within a specified coherency
> > time).
> >
> > Keeping the file data in the vfs page cache or on disk using
> > fscache/cachefiles is fairly straightforward, but keeping the
> > metadata cached is particularly difficult. And without the cached
> > metadata we introduce long delays before we can serve the already
> > present and locally cached file data to many waiting clients.
> >
> > ----- On 7 Sep, 2020, at 18:31, Daire Byrne [email protected] wrote:
> > > 2) If we cache metadata on the re-export server using
> > > actimeo=3600,nocto we can
> > > cut the network packets back to the origin server to zero for
> > > repeated lookups.
> > > However, if a client of the re-export server walks paths and memory
> > > maps those
> > > files (i.e. loading an application), the re-export server starts
> > > issuing
> > > unexpected calls back to the origin server again,
> > > ignoring/invalidating the
> > > re-export server's NFS client cache. We worked around this this by
> > > patching an
> > > inode/iversion validity check in inode.c so that the NFS client
> > > cache on the
> > > re-export server is used. I'm not sure about the correctness of
> > > this patch but
> > > it works for our corner case.
> >
> > If we use actimeo=3600,nocto (say) to mount a remote software volume
> > on the re-export server, we can successfully cache the loading of
> > applications and walking of paths directly on the re-export server
> > such that after a couple of runs, there are practically zero packets
> > back to the originating NFS server (great!). But, if we then do the
> > same thing on a client which is mounting that re-export server, the
> > re-export server now starts issuing lots of calls back to the
> > originating server and invalidating it's client cache (bad!).
> >
> > I'm not exactly sure why, but the iversion of the inode gets changed
> > locally (due to atime modification?) most likely via invocation of
> > method inode_inc_iversion_raw. Each time it gets incremented the
> > following call to validate attributes detects changes causing it to
> > be reloaded from the originating server.
> >
> > This patch helps to avoid this when applied to the re-export server
> > but there may be other places where this happens too. I accept that
> > this patch is probably not the right/general way to do this, but it
> > helps to highlight the issue when re-exporting and it works well for
> > our use case:
> >
> > --- linux-5.5.0-1.el7.x86_64/fs/nfs/inode.c 2020-01-27
> > 00:23:03.000000000 +0000
> > +++ new/fs/nfs/inode.c 2020-02-13 16:32:09.013055074 +0000
> > @@ -1869,7 +1869,7 @@
> >
> > /* More cache consistency checks */
> > if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
> > - if (!inode_eq_iversion_raw(inode, fattr-
> > >change_attr)) {
> > + if (inode_peek_iversion_raw(inode) < fattr-
> > >change_attr) {
> > /* Could it be a race with writeback? */
> > if (!(have_writers || have_delegation)) {
> > invalid |= NFS_INO_INVALID_DATA
>
>
> There is nothing in the base NFSv4, and NFSv4.1 specs that allow you to
> make assumptions about how the change attribute behaves over time.
>
> The only safe way to do something like the above is if the server
> supports NFSv4.2 and also advertises support for the 'change_attr_type'
> attribute. In that case, you can check at mount time for whether or not
> the change attribute on this filesystem is one of the monotonic types
> which would allow the above optimisation.
Looking at https://tools.ietf.org/html/rfc7862#section-12.2.3 .... I
think that would be anything but NFS4_CHANGE_TYPE_IS_UNDEFINED ?
The Linux server's ctime is monotonic and will advertise that with
change_attr_type since 4.19.
So I think it would be easy to patch the client to check
change_attr_type and set an NFS_CAP_MONOTONIC_CHANGE flag in
server->caps, the hard part would be figuring out which optimisations
are OK.
--b.
On Wed, 2020-09-23 at 08:40 -0400, J. Bruce Fields wrote:
> On Tue, Sep 22, 2020 at 01:52:25PM +0000, Trond Myklebust wrote:
> > On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote:
> > > Hi,
> > >
> > > I just thought I'd flesh out the other two issues I have found
> > > with
> > > re-exporting that are ultimately responsible for the biggest
> > > performance bottlenecks. And both of them revolve around the
> > > caching
> > > of metadata file lookups in the NFS client.
> > >
> > > Especially for the case where we are re-exporting a server many
> > > milliseconds away (i.e. on-premise -> cloud), we want to be able
> > > to
> > > control how much the client caches metadata and file data so that
> > > it's many LAN clients all benefit from the re-export server only
> > > having to do the WAN lookups once (within a specified coherency
> > > time).
> > >
> > > Keeping the file data in the vfs page cache or on disk using
> > > fscache/cachefiles is fairly straightforward, but keeping the
> > > metadata cached is particularly difficult. And without the cached
> > > metadata we introduce long delays before we can serve the already
> > > present and locally cached file data to many waiting clients.
> > >
> > > ----- On 7 Sep, 2020, at 18:31, Daire Byrne [email protected] wrote:
> > > > 2) If we cache metadata on the re-export server using
> > > > actimeo=3600,nocto we can
> > > > cut the network packets back to the origin server to zero for
> > > > repeated lookups.
> > > > However, if a client of the re-export server walks paths and
> > > > memory
> > > > maps those
> > > > files (i.e. loading an application), the re-export server
> > > > starts
> > > > issuing
> > > > unexpected calls back to the origin server again,
> > > > ignoring/invalidating the
> > > > re-export server's NFS client cache. We worked around this this
> > > > by
> > > > patching an
> > > > inode/iversion validity check in inode.c so that the NFS client
> > > > cache on the
> > > > re-export server is used. I'm not sure about the correctness of
> > > > this patch but
> > > > it works for our corner case.
> > >
> > > If we use actimeo=3600,nocto (say) to mount a remote software
> > > volume
> > > on the re-export server, we can successfully cache the loading of
> > > applications and walking of paths directly on the re-export
> > > server
> > > such that after a couple of runs, there are practically zero
> > > packets
> > > back to the originating NFS server (great!). But, if we then do
> > > the
> > > same thing on a client which is mounting that re-export server,
> > > the
> > > re-export server now starts issuing lots of calls back to the
> > > originating server and invalidating it's client cache (bad!).
> > >
> > > I'm not exactly sure why, but the iversion of the inode gets
> > > changed
> > > locally (due to atime modification?) most likely via invocation
> > > of
> > > method inode_inc_iversion_raw. Each time it gets incremented the
> > > following call to validate attributes detects changes causing it
> > > to
> > > be reloaded from the originating server.
> > >
> > > This patch helps to avoid this when applied to the re-export
> > > server
> > > but there may be other places where this happens too. I accept
> > > that
> > > this patch is probably not the right/general way to do this, but
> > > it
> > > helps to highlight the issue when re-exporting and it works well
> > > for
> > > our use case:
> > >
> > > --- linux-5.5.0-1.el7.x86_64/fs/nfs/inode.c 2020-01-27
> > > 00:23:03.000000000 +0000
> > > +++ new/fs/nfs/inode.c 2020-02-13 16:32:09.013055074 +0000
> > > @@ -1869,7 +1869,7 @@
> > >
> > > /* More cache consistency checks */
> > > if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
> > > - if (!inode_eq_iversion_raw(inode, fattr-
> > > > change_attr)) {
> > > + if (inode_peek_iversion_raw(inode) < fattr-
> > > > change_attr) {
> > > /* Could it be a race with writeback? */
> > > if (!(have_writers || have_delegation)) {
> > > invalid |= NFS_INO_INVALID_DATA
> >
> > There is nothing in the base NFSv4, and NFSv4.1 specs that allow
> > you to
> > make assumptions about how the change attribute behaves over time.
> >
> > The only safe way to do something like the above is if the server
> > supports NFSv4.2 and also advertises support for the
> > 'change_attr_type'
> > attribute. In that case, you can check at mount time for whether or
> > not
> > the change attribute on this filesystem is one of the monotonic
> > types
> > which would allow the above optimisation.
>
> Looking at https://tools.ietf.org/html/rfc7862#section-12.2.3 .... I
> think that would be anything but NFS4_CHANGE_TYPE_IS_UNDEFINED ?
>
> The Linux server's ctime is monotonic and will advertise that with
> change_attr_type since 4.19.
>
> So I think it would be easy to patch the client to check
> change_attr_type and set an NFS_CAP_MONOTONIC_CHANGE flag in
> server->caps, the hard part would be figuring out which optimisations
> are OK.
>
The ctime is *not* monotonic. It can regress under server reboots and
it can regress if someone deliberately changes the time. We have code
that tries to handle all these issues (see fattr->gencount and nfsi-
>attr_gencount) because we've hit those issues before...
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Wed, Sep 23, 2020 at 01:09:01PM +0000, Trond Myklebust wrote:
> On Wed, 2020-09-23 at 08:40 -0400, J. Bruce Fields wrote:
> > On Tue, Sep 22, 2020 at 01:52:25PM +0000, Trond Myklebust wrote:
> > > On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote:
> > > > Hi,
> > > >
> > > > I just thought I'd flesh out the other two issues I have found
> > > > with
> > > > re-exporting that are ultimately responsible for the biggest
> > > > performance bottlenecks. And both of them revolve around the
> > > > caching
> > > > of metadata file lookups in the NFS client.
> > > >
> > > > Especially for the case where we are re-exporting a server many
> > > > milliseconds away (i.e. on-premise -> cloud), we want to be able
> > > > to
> > > > control how much the client caches metadata and file data so that
> > > > it's many LAN clients all benefit from the re-export server only
> > > > having to do the WAN lookups once (within a specified coherency
> > > > time).
> > > >
> > > > Keeping the file data in the vfs page cache or on disk using
> > > > fscache/cachefiles is fairly straightforward, but keeping the
> > > > metadata cached is particularly difficult. And without the cached
> > > > metadata we introduce long delays before we can serve the already
> > > > present and locally cached file data to many waiting clients.
> > > >
> > > > ----- On 7 Sep, 2020, at 18:31, Daire Byrne [email protected] wrote:
> > > > > 2) If we cache metadata on the re-export server using
> > > > > actimeo=3600,nocto we can
> > > > > cut the network packets back to the origin server to zero for
> > > > > repeated lookups.
> > > > > However, if a client of the re-export server walks paths and
> > > > > memory
> > > > > maps those
> > > > > files (i.e. loading an application), the re-export server
> > > > > starts
> > > > > issuing
> > > > > unexpected calls back to the origin server again,
> > > > > ignoring/invalidating the
> > > > > re-export server's NFS client cache. We worked around this this
> > > > > by
> > > > > patching an
> > > > > inode/iversion validity check in inode.c so that the NFS client
> > > > > cache on the
> > > > > re-export server is used. I'm not sure about the correctness of
> > > > > this patch but
> > > > > it works for our corner case.
> > > >
> > > > If we use actimeo=3600,nocto (say) to mount a remote software
> > > > volume
> > > > on the re-export server, we can successfully cache the loading of
> > > > applications and walking of paths directly on the re-export
> > > > server
> > > > such that after a couple of runs, there are practically zero
> > > > packets
> > > > back to the originating NFS server (great!). But, if we then do
> > > > the
> > > > same thing on a client which is mounting that re-export server,
> > > > the
> > > > re-export server now starts issuing lots of calls back to the
> > > > originating server and invalidating it's client cache (bad!).
> > > >
> > > > I'm not exactly sure why, but the iversion of the inode gets
> > > > changed
> > > > locally (due to atime modification?) most likely via invocation
> > > > of
> > > > method inode_inc_iversion_raw. Each time it gets incremented the
> > > > following call to validate attributes detects changes causing it
> > > > to
> > > > be reloaded from the originating server.
> > > >
> > > > This patch helps to avoid this when applied to the re-export
> > > > server
> > > > but there may be other places where this happens too. I accept
> > > > that
> > > > this patch is probably not the right/general way to do this, but
> > > > it
> > > > helps to highlight the issue when re-exporting and it works well
> > > > for
> > > > our use case:
> > > >
> > > > --- linux-5.5.0-1.el7.x86_64/fs/nfs/inode.c 2020-01-27
> > > > 00:23:03.000000000 +0000
> > > > +++ new/fs/nfs/inode.c 2020-02-13 16:32:09.013055074 +0000
> > > > @@ -1869,7 +1869,7 @@
> > > >
> > > > /* More cache consistency checks */
> > > > if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
> > > > - if (!inode_eq_iversion_raw(inode, fattr-
> > > > > change_attr)) {
> > > > + if (inode_peek_iversion_raw(inode) < fattr-
> > > > > change_attr) {
> > > > /* Could it be a race with writeback? */
> > > > if (!(have_writers || have_delegation)) {
> > > > invalid |= NFS_INO_INVALID_DATA
> > >
> > > There is nothing in the base NFSv4, and NFSv4.1 specs that allow
> > > you to
> > > make assumptions about how the change attribute behaves over time.
> > >
> > > The only safe way to do something like the above is if the server
> > > supports NFSv4.2 and also advertises support for the
> > > 'change_attr_type'
> > > attribute. In that case, you can check at mount time for whether or
> > > not
> > > the change attribute on this filesystem is one of the monotonic
> > > types
> > > which would allow the above optimisation.
> >
> > Looking at https://tools.ietf.org/html/rfc7862#section-12.2.3 .... I
> > think that would be anything but NFS4_CHANGE_TYPE_IS_UNDEFINED ?
> >
> > The Linux server's ctime is monotonic and will advertise that with
> > change_attr_type since 4.19.
> >
> > So I think it would be easy to patch the client to check
> > change_attr_type and set an NFS_CAP_MONOTONIC_CHANGE flag in
> > server->caps, the hard part would be figuring out which optimisations
> > are OK.
> >
>
> The ctime is *not* monotonic. It can regress under server reboots and
> it can regress if someone deliberately changes the time.
So, anything other than IS_UNDEFINED or IS_TIME_METADATA?
Though the linux server is susceptible to some of that even when it
returns MONTONIC_INCR. If the admin replaces the filesystem by an older
snapshot, there's not much we can do. I'm not sure what degree of
gaurantee we need.
--b.
> We have code
> that tries to handle all these issues (see fattr->gencount and nfsi-
> >attr_gencount) because we've hit those issues before...
----- On 22 Sep, 2020, at 17:43, Chuck Lever [email protected] wrote:
>> On Sep 17, 2020, at 4:23 PM, Frank van der Linden <[email protected]> wrote:
>>
>> On Thu, Sep 17, 2020 at 03:09:31PM -0400, bfields wrote:
>>>
>>> On Thu, Sep 17, 2020 at 05:01:11PM +0100, Daire Byrne wrote:
>>>>
>>>> ----- On 15 Sep, 2020, at 18:21, bfields [email protected] wrote:
>>>>
>>>>>> 4) With an NFSv4 re-export, lots of open/close requests (hundreds per
>>>>>> second) quickly eat up the CPU on the re-export server and perf top
>>>>>> shows we are mostly in native_queued_spin_lock_slowpath.
>>>>>
>>>>> Any statistics on who's calling that function?
>>>>
>>>> I've always struggled to reproduce this with a simple open/close simulation, so
>>>> I suspect some other operations need to be mixed in too. But I have one
>>>> production workload that I know has lots of opens & closes (buggy software)
>>>> included in amongst the usual reads, writes etc.
>>>>
>>>> With just 40 clients mounting the reexport server (v5.7.6) using NFSv4.2, we see
>>>> the CPU of the nfsd threads increase rapidly and by the time we have 100
>>>> clients, we have maxed out the 32 cores of the server with most of that in
>>>> native_queued_spin_lock_slowpath.
>>>
>>> That sounds a lot like what Frank Van der Linden reported:
>>>
>>> https://lore.kernel.org/linux-nfs/20200608192122.GA19171@dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com/
>>>
>>> It looks like a bug in the filehandle caching code.
>>>
>>> --b.
>>
>> Yes, that does look like the same one.
>>
>> I still think that not caching v4 files at all may be the best way to go
>> here, since the intent of the filecache code was to speed up v2/v3 I/O,
>> where you end up doing a lot of opens/closes, but it doesn't make as
>> much sense for v4.
>>
>> However, short of that, I tested a local patch a few months back, that
>> I never posted here, so I'll do so now. It just makes v4 opens in to
>> 'long term' opens, which do not get put on the LRU, since that doesn't
>> make sense (they are in the hash table, so they are still cached).
>>
>> Also, the file caching code seems to walk the LRU a little too often,
>> but that's another issue - and this change keeps the LRU short, so it's
>> not a big deal.
>>
>> I don't particularly love this patch, but it does keep the LRU short, and
>> did significantly speed up my testcase (by about 50%). So, maybe you can
>> give it a try.
>>
>> I'll also attach a second patch, that converts the hash table to an rhashtable,
>> which automatically grows and shrinks in size with usage. That patch also
>> helped, but not by nearly as much (I think it yielded another 10%).
>
> For what it's worth, I applied your two patches to my test server, along
> with my patch that force-closes cached file descriptors during NFSv4
> CLOSE processing. The patch combination improves performance (faster
> elapsed time) for my workload as well.
I tested Frank's NFSv4 filecache patches with some production workloads and I've hit the below refcount issue a couple of times in the last 48 hours with v5.8.10. This server was re-exporting an NFS client mount at the time.
Apologies for the spam if I've just hit something unrelated to the patches that is present in v5.8.10.... In truth, I have not used this kernel version before with this workload and just patched it because I had it ready to go. I'll remove the 2 patches and verify.
Daire
[ 8930.027838] ------------[ cut here ]------------
[ 8930.032769] refcount_t: addition on 0; use-after-free.
[ 8930.038251] WARNING: CPU: 2 PID: 3624 at lib/refcount.c:25 refcount_warn_saturate+0x6e/0xf0
[ 8930.046799] Modules linked in: tcp_diag inet_diag rpcsec_gss_krb5 nfsv4 dns_resolver act_mirred sch_ingress ifb nfsv3 nfs cls_u32 sch_fq sch_prio cachefiles fscache ext4 mbcache jbd2 sb_edac rapl sg virtio_rng i2c_piix4 input_leds nfsd auth_rpcgss nfs_acl lockd grace binfmt_misc ip_tables xfs libcrc32c sd_mod t10_pi 8021q garp mrp virtio_net net_failover failover virtio_scsi crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel scsi_transport_iscsi crypto_simd cryptd glue_helper virtio_pci virtio_ring virtio serio_raw sunrpc dm_mirror dm_region_hash dm_log dm_mod
[ 8930.098703] CPU: 2 PID: 3624 Comm: nfsd Tainted: G W 5.8.10-1.dneg.x86_64 #1
[ 8930.107391] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[ 8930.116775] RIP: 0010:refcount_warn_saturate+0x6e/0xf0
[ 8930.122078] Code: 49 91 18 01 01 e8 57 d6 c2 ff 0f 0b 5d c3 80 3d 38 91 18 01 00 75 d1 48 c7 c7 d0 5c 13 82 c6 05 28 91 18 01 01 e8 37 d6 c2 ff <0f> 0b 5d c3 80 3d 1a 91 18 01 00 75 b1 48 c7 c7 a8 5c 13 82 c6 05
[ 8930.141107] RSP: 0018:ffffc900012efc70 EFLAGS: 00010282
[ 8930.146497] RAX: 0000000000000000 RBX: ffff888cc12811e0 RCX: 0000000000000000
[ 8930.153793] RDX: ffff888d0bca8f20 RSI: ffff888d0bc98d40 RDI: ffff888d0bc98d40
[ 8930.161087] RBP: ffffc900012efc70 R08: ffff888d0bc98d40 R09: 0000000000000019
[ 8930.168380] R10: 000000000000072e R11: ffffc900012efad8 R12: ffff888b8bdad600
[ 8930.175680] R13: ffff888cd428ebe0 R14: ffff8889264f9170 R15: 0000000000000000
[ 8930.182976] FS: 0000000000000000(0000) GS:ffff888d0bc80000(0000) knlGS:0000000000000000
[ 8930.191231] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8930.197139] CR2: 00007fbe43ca1248 CR3: 0000000ce48ee004 CR4: 00000000001606e0
[ 8930.204436] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8930.211734] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8930.219027] Call Trace:
[ 8930.221665] nfsd4_process_open2+0xa48/0xec0 [nfsd]
[ 8930.226724] ? nfsd_permission+0x6b/0x100 [nfsd]
[ 8930.231524] ? fh_verify+0x167/0x210 [nfsd]
[ 8930.235893] nfsd4_open+0x407/0x820 [nfsd]
[ 8930.240248] nfsd4_proc_compound+0x3c2/0x760 [nfsd]
[ 8930.245296] ? nfsd4_decode_compound.constprop.0+0x3a9/0x450 [nfsd]
[ 8930.251734] nfsd_dispatch+0xe2/0x220 [nfsd]
[ 8930.256213] svc_process_common+0x47b/0x6f0 [sunrpc]
[ 8930.261355] ? svc_sock_secure_port+0x16/0x30 [sunrpc]
[ 8930.266707] ? nfsd_svc+0x330/0x330 [nfsd]
[ 8930.270981] svc_process+0xc5/0x100 [sunrpc]
[ 8930.275423] nfsd+0xe8/0x150 [nfsd]
[ 8930.280028] kthread+0x114/0x150
[ 8930.283434] ? nfsd_destroy+0x60/0x60 [nfsd]
[ 8930.287875] ? kthread_park+0x90/0x90
[ 8930.291700] ret_from_fork+0x22/0x30
[ 8930.295447] ---[ end trace c551536c3520545c ]---
On Wed, Sep 23, 2020 at 09:25:07PM +0100, Daire Byrne wrote:
>
> ----- On 22 Sep, 2020, at 17:43, Chuck Lever [email protected] wrote:
> >> On Sep 17, 2020, at 4:23 PM, Frank van der Linden <[email protected]> wrote:
> >>
> >> On Thu, Sep 17, 2020 at 03:09:31PM -0400, bfields wrote:
> >>>
> >>> On Thu, Sep 17, 2020 at 05:01:11PM +0100, Daire Byrne wrote:
> >>>>
> >>>> ----- On 15 Sep, 2020, at 18:21, bfields [email protected] wrote:
> >>>>
> >>>>>> 4) With an NFSv4 re-export, lots of open/close requests (hundreds per
> >>>>>> second) quickly eat up the CPU on the re-export server and perf top
> >>>>>> shows we are mostly in native_queued_spin_lock_slowpath.
> >>>>>
> >>>>> Any statistics on who's calling that function?
> >>>>
> >>>> I've always struggled to reproduce this with a simple open/close simulation, so
> >>>> I suspect some other operations need to be mixed in too. But I have one
> >>>> production workload that I know has lots of opens & closes (buggy software)
> >>>> included in amongst the usual reads, writes etc.
> >>>>
> >>>> With just 40 clients mounting the reexport server (v5.7.6) using NFSv4.2, we see
> >>>> the CPU of the nfsd threads increase rapidly and by the time we have 100
> >>>> clients, we have maxed out the 32 cores of the server with most of that in
> >>>> native_queued_spin_lock_slowpath.
> >>>
> >>> That sounds a lot like what Frank Van der Linden reported:
> >>>
> >>> https://lore.kernel.org/linux-nfs/20200608192122.GA19171@dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com/
> >>>
> >>> It looks like a bug in the filehandle caching code.
> >>>
> >>> --b.
> >>
> >> Yes, that does look like the same one.
> >>
> >> I still think that not caching v4 files at all may be the best way to go
> >> here, since the intent of the filecache code was to speed up v2/v3 I/O,
> >> where you end up doing a lot of opens/closes, but it doesn't make as
> >> much sense for v4.
> >>
> >> However, short of that, I tested a local patch a few months back, that
> >> I never posted here, so I'll do so now. It just makes v4 opens in to
> >> 'long term' opens, which do not get put on the LRU, since that doesn't
> >> make sense (they are in the hash table, so they are still cached).
> >>
> >> Also, the file caching code seems to walk the LRU a little too often,
> >> but that's another issue - and this change keeps the LRU short, so it's
> >> not a big deal.
> >>
> >> I don't particularly love this patch, but it does keep the LRU short, and
> >> did significantly speed up my testcase (by about 50%). So, maybe you can
> >> give it a try.
> >>
> >> I'll also attach a second patch, that converts the hash table to an rhashtable,
> >> which automatically grows and shrinks in size with usage. That patch also
> >> helped, but not by nearly as much (I think it yielded another 10%).
> >
> > For what it's worth, I applied your two patches to my test server, along
> > with my patch that force-closes cached file descriptors during NFSv4
> > CLOSE processing. The patch combination improves performance (faster
> > elapsed time) for my workload as well.
>
> I tested Frank's NFSv4 filecache patches with some production workloads and I've hit the below refcount issue a couple of times in the last 48 hours with v5.8.10. This server was re-exporting an NFS client mount at the time.
>
> Apologies for the spam if I've just hit something unrelated to the patches that is present in v5.8.10.... In truth, I have not used this kernel version before with this workload and just patched it because I had it ready to go. I'll remove the 2 patches and verify.
>
> Daire
>
>
> [ 8930.027838] ------------[ cut here ]------------
> [ 8930.032769] refcount_t: addition on 0; use-after-free.
> [ 8930.038251] WARNING: CPU: 2 PID: 3624 at lib/refcount.c:25 refcount_warn_saturate+0x6e/0xf0
> [ 8930.046799] Modules linked in: tcp_diag inet_diag rpcsec_gss_krb5 nfsv4 dns_resolver act_mirred sch_ingress ifb nfsv3 nfs cls_u32 sch_fq sch_prio cachefiles fscache ext4 mbcache jbd2 sb_edac rapl sg virtio_rng i2c_piix4 input_leds nfsd auth_rpcgss nfs_acl lockd grace binfmt_misc ip_tables xfs libcrc32c sd_mod t10_pi 8021q garp mrp virtio_net net_failover failover virtio_scsi crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel scsi_transport_iscsi crypto_simd cryptd glue_helper virtio_pci virtio_ring virtio serio_raw sunrpc dm_mirror dm_region_hash dm_log dm_mod
> [ 8930.098703] CPU: 2 PID: 3624 Comm: nfsd Tainted: G W 5.8.10-1.dneg.x86_64 #1
> [ 8930.107391] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> [ 8930.116775] RIP: 0010:refcount_warn_saturate+0x6e/0xf0
> [ 8930.122078] Code: 49 91 18 01 01 e8 57 d6 c2 ff 0f 0b 5d c3 80 3d 38 91 18 01 00 75 d1 48 c7 c7 d0 5c 13 82 c6 05 28 91 18 01 01 e8 37 d6 c2 ff <0f> 0b 5d c3 80 3d 1a 91 18 01 00 75 b1 48 c7 c7 a8 5c 13 82 c6 05
> [ 8930.141107] RSP: 0018:ffffc900012efc70 EFLAGS: 00010282
> [ 8930.146497] RAX: 0000000000000000 RBX: ffff888cc12811e0 RCX: 0000000000000000
> [ 8930.153793] RDX: ffff888d0bca8f20 RSI: ffff888d0bc98d40 RDI: ffff888d0bc98d40
> [ 8930.161087] RBP: ffffc900012efc70 R08: ffff888d0bc98d40 R09: 0000000000000019
> [ 8930.168380] R10: 000000000000072e R11: ffffc900012efad8 R12: ffff888b8bdad600
> [ 8930.175680] R13: ffff888cd428ebe0 R14: ffff8889264f9170 R15: 0000000000000000
> [ 8930.182976] FS: 0000000000000000(0000) GS:ffff888d0bc80000(0000) knlGS:0000000000000000
> [ 8930.191231] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 8930.197139] CR2: 00007fbe43ca1248 CR3: 0000000ce48ee004 CR4: 00000000001606e0
> [ 8930.204436] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 8930.211734] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 8930.219027] Call Trace:
> [ 8930.221665] nfsd4_process_open2+0xa48/0xec0 [nfsd]
> [ 8930.226724] ? nfsd_permission+0x6b/0x100 [nfsd]
> [ 8930.231524] ? fh_verify+0x167/0x210 [nfsd]
> [ 8930.235893] nfsd4_open+0x407/0x820 [nfsd]
> [ 8930.240248] nfsd4_proc_compound+0x3c2/0x760 [nfsd]
> [ 8930.245296] ? nfsd4_decode_compound.constprop.0+0x3a9/0x450 [nfsd]
> [ 8930.251734] nfsd_dispatch+0xe2/0x220 [nfsd]
> [ 8930.256213] svc_process_common+0x47b/0x6f0 [sunrpc]
> [ 8930.261355] ? svc_sock_secure_port+0x16/0x30 [sunrpc]
> [ 8930.266707] ? nfsd_svc+0x330/0x330 [nfsd]
> [ 8930.270981] svc_process+0xc5/0x100 [sunrpc]
> [ 8930.275423] nfsd+0xe8/0x150 [nfsd]
> [ 8930.280028] kthread+0x114/0x150
> [ 8930.283434] ? nfsd_destroy+0x60/0x60 [nfsd]
> [ 8930.287875] ? kthread_park+0x90/0x90
> [ 8930.291700] ret_from_fork+0x22/0x30
> [ 8930.295447] ---[ end trace c551536c3520545c ]---
It's entirely possible that my patch introduces a refcounting error - it was
intended as a proof-of-concept on how to fix the LRU locking issue for v4
open file caching (while keeping it enabled) - which is why I didn't
"formally" send it in.
Having said that, I don't immediately see the problem.
Maybe try it without the rhashtable patch, that is much less of an
optimization.
The problem would have to be nf_ref as part of nfsd_file, or fi_ref as part
of nfs4_file. If it's the latter, it's probably the rhashtable change.
- Frank
----- On 23 Sep, 2020, at 22:01, Frank van der Linden [email protected] wrote:
> On Wed, Sep 23, 2020 at 09:25:07PM +0100, Daire Byrne wrote:
>>
>> ----- On 22 Sep, 2020, at 17:43, Chuck Lever [email protected] wrote:
>> >> On Sep 17, 2020, at 4:23 PM, Frank van der Linden <[email protected]> wrote:
>> >>
>> >> On Thu, Sep 17, 2020 at 03:09:31PM -0400, bfields wrote:
>> >>>
>> >>> On Thu, Sep 17, 2020 at 05:01:11PM +0100, Daire Byrne wrote:
>> >>>>
>> >>>> ----- On 15 Sep, 2020, at 18:21, bfields [email protected] wrote:
>> >>>>
>> >>>>>> 4) With an NFSv4 re-export, lots of open/close requests (hundreds per
>> >>>>>> second) quickly eat up the CPU on the re-export server and perf top
>> >>>>>> shows we are mostly in native_queued_spin_lock_slowpath.
>> >>>>>
>> >>>>> Any statistics on who's calling that function?
>> >>>>
>> >>>> I've always struggled to reproduce this with a simple open/close simulation, so
>> >>>> I suspect some other operations need to be mixed in too. But I have one
>> >>>> production workload that I know has lots of opens & closes (buggy software)
>> >>>> included in amongst the usual reads, writes etc.
>> >>>>
>> >>>> With just 40 clients mounting the reexport server (v5.7.6) using NFSv4.2, we see
>> >>>> the CPU of the nfsd threads increase rapidly and by the time we have 100
>> >>>> clients, we have maxed out the 32 cores of the server with most of that in
>> >>>> native_queued_spin_lock_slowpath.
>> >>>
>> >>> That sounds a lot like what Frank Van der Linden reported:
>> >>>
>> >>> https://lore.kernel.org/linux-nfs/20200608192122.GA19171@dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com/
>> >>>
>> >>> It looks like a bug in the filehandle caching code.
>> >>>
>> >>> --b.
>> >>
>> >> Yes, that does look like the same one.
>> >>
>> >> I still think that not caching v4 files at all may be the best way to go
>> >> here, since the intent of the filecache code was to speed up v2/v3 I/O,
>> >> where you end up doing a lot of opens/closes, but it doesn't make as
>> >> much sense for v4.
>> >>
>> >> However, short of that, I tested a local patch a few months back, that
>> >> I never posted here, so I'll do so now. It just makes v4 opens in to
>> >> 'long term' opens, which do not get put on the LRU, since that doesn't
>> >> make sense (they are in the hash table, so they are still cached).
>> >>
>> >> Also, the file caching code seems to walk the LRU a little too often,
>> >> but that's another issue - and this change keeps the LRU short, so it's
>> >> not a big deal.
>> >>
>> >> I don't particularly love this patch, but it does keep the LRU short, and
>> >> did significantly speed up my testcase (by about 50%). So, maybe you can
>> >> give it a try.
>> >>
>> >> I'll also attach a second patch, that converts the hash table to an rhashtable,
>> >> which automatically grows and shrinks in size with usage. That patch also
>> >> helped, but not by nearly as much (I think it yielded another 10%).
>> >
>> > For what it's worth, I applied your two patches to my test server, along
>> > with my patch that force-closes cached file descriptors during NFSv4
>> > CLOSE processing. The patch combination improves performance (faster
>> > elapsed time) for my workload as well.
>>
>> I tested Frank's NFSv4 filecache patches with some production workloads and I've
>> hit the below refcount issue a couple of times in the last 48 hours with
>> v5.8.10. This server was re-exporting an NFS client mount at the time.
>>
>> Apologies for the spam if I've just hit something unrelated to the patches that
>> is present in v5.8.10.... In truth, I have not used this kernel version before
>> with this workload and just patched it because I had it ready to go. I'll
>> remove the 2 patches and verify.
>>
>> Daire
>>
>>
>> [ 8930.027838] ------------[ cut here ]------------
>> [ 8930.032769] refcount_t: addition on 0; use-after-free.
>> [ 8930.038251] WARNING: CPU: 2 PID: 3624 at lib/refcount.c:25
>> refcount_warn_saturate+0x6e/0xf0
>> [ 8930.046799] Modules linked in: tcp_diag inet_diag rpcsec_gss_krb5 nfsv4
>> dns_resolver act_mirred sch_ingress ifb nfsv3 nfs cls_u32 sch_fq sch_prio
>> cachefiles fscache ext4 mbcache jbd2 sb_edac rapl sg virtio_rng i2c_piix4
>> input_leds nfsd auth_rpcgss nfs_acl lockd grace binfmt_misc ip_tables xfs
>> libcrc32c sd_mod t10_pi 8021q garp mrp virtio_net net_failover failover
>> virtio_scsi crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel
>> aesni_intel scsi_transport_iscsi crypto_simd cryptd glue_helper virtio_pci
>> virtio_ring virtio serio_raw sunrpc dm_mirror dm_region_hash dm_log dm_mod
>> [ 8930.098703] CPU: 2 PID: 3624 Comm: nfsd Tainted: G W
>> 5.8.10-1.dneg.x86_64 #1
>> [ 8930.107391] Hardware name: Google Google Compute Engine/Google Compute
>> Engine, BIOS Google 01/01/2011
>> [ 8930.116775] RIP: 0010:refcount_warn_saturate+0x6e/0xf0
>> [ 8930.122078] Code: 49 91 18 01 01 e8 57 d6 c2 ff 0f 0b 5d c3 80 3d 38 91 18 01
>> 00 75 d1 48 c7 c7 d0 5c 13 82 c6 05 28 91 18 01 01 e8 37 d6 c2 ff <0f> 0b 5d c3
>> 80 3d 1a 91 18 01 00 75 b1 48 c7 c7 a8 5c 13 82 c6 05
>> [ 8930.141107] RSP: 0018:ffffc900012efc70 EFLAGS: 00010282
>> [ 8930.146497] RAX: 0000000000000000 RBX: ffff888cc12811e0 RCX: 0000000000000000
>> [ 8930.153793] RDX: ffff888d0bca8f20 RSI: ffff888d0bc98d40 RDI: ffff888d0bc98d40
>> [ 8930.161087] RBP: ffffc900012efc70 R08: ffff888d0bc98d40 R09: 0000000000000019
>> [ 8930.168380] R10: 000000000000072e R11: ffffc900012efad8 R12: ffff888b8bdad600
>> [ 8930.175680] R13: ffff888cd428ebe0 R14: ffff8889264f9170 R15: 0000000000000000
>> [ 8930.182976] FS: 0000000000000000(0000) GS:ffff888d0bc80000(0000)
>> knlGS:0000000000000000
>> [ 8930.191231] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 8930.197139] CR2: 00007fbe43ca1248 CR3: 0000000ce48ee004 CR4: 00000000001606e0
>> [ 8930.204436] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [ 8930.211734] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [ 8930.219027] Call Trace:
>> [ 8930.221665] nfsd4_process_open2+0xa48/0xec0 [nfsd]
>> [ 8930.226724] ? nfsd_permission+0x6b/0x100 [nfsd]
>> [ 8930.231524] ? fh_verify+0x167/0x210 [nfsd]
>> [ 8930.235893] nfsd4_open+0x407/0x820 [nfsd]
>> [ 8930.240248] nfsd4_proc_compound+0x3c2/0x760 [nfsd]
>> [ 8930.245296] ? nfsd4_decode_compound.constprop.0+0x3a9/0x450 [nfsd]
>> [ 8930.251734] nfsd_dispatch+0xe2/0x220 [nfsd]
>> [ 8930.256213] svc_process_common+0x47b/0x6f0 [sunrpc]
>> [ 8930.261355] ? svc_sock_secure_port+0x16/0x30 [sunrpc]
>> [ 8930.266707] ? nfsd_svc+0x330/0x330 [nfsd]
>> [ 8930.270981] svc_process+0xc5/0x100 [sunrpc]
>> [ 8930.275423] nfsd+0xe8/0x150 [nfsd]
>> [ 8930.280028] kthread+0x114/0x150
>> [ 8930.283434] ? nfsd_destroy+0x60/0x60 [nfsd]
>> [ 8930.287875] ? kthread_park+0x90/0x90
>> [ 8930.291700] ret_from_fork+0x22/0x30
>> [ 8930.295447] ---[ end trace c551536c3520545c ]---
>
> It's entirely possible that my patch introduces a refcounting error - it was
> intended as a proof-of-concept on how to fix the LRU locking issue for v4
> open file caching (while keeping it enabled) - which is why I didn't
> "formally" send it in.
>
> Having said that, I don't immediately see the problem.
>
> Maybe try it without the rhashtable patch, that is much less of an
> optimization.
>
> The problem would have to be nf_ref as part of nfsd_file, or fi_ref as part
> of nfs4_file. If it's the latter, it's probably the rhashtable change.
Thanks Frank; I think you are right in that it seems to be a problem with the rhashtable patch. Another 48 hours using the same workload with just the main patch and I have not seen the same issue again so far.
Also, it still has the effect of reducing the CPU usage dramatically such that there are plenty of cores still left idle. This is actually helping us buy some more time while we fix our obviously broken software so that it doesn't open/close so crazily.
So, many thanks for that.
Daire
On Sat, Sep 26, 2020 at 10:00:22AM +0100, Daire Byrne wrote:
>
>
> ----- On 23 Sep, 2020, at 22:01, Frank van der Linden [email protected] wrote:
> > It's entirely possible that my patch introduces a refcounting error - it was
> > intended as a proof-of-concept on how to fix the LRU locking issue for v4
> > open file caching (while keeping it enabled) - which is why I didn't
> > "formally" send it in.
> >
> > Having said that, I don't immediately see the problem.
> >
> > Maybe try it without the rhashtable patch, that is much less of an
> > optimization.
> >
> > The problem would have to be nf_ref as part of nfsd_file, or fi_ref as part
> > of nfs4_file. If it's the latter, it's probably the rhashtable change.
>
> Thanks Frank; I think you are right in that it seems to be a problem with the rhashtable patch. Another 48 hours using the same workload with just the main patch and I have not seen the same issue again so far.
>
> Also, it still has the effect of reducing the CPU usage dramatically such that there are plenty of cores still left idle. This is actually helping us buy some more time while we fix our obviously broken software so that it doesn't open/close so crazily.
>
> So, many thanks for that.
Cool. I'm glad the "don't put v4 files on the LRU list" works as intended for
you. The rhashtable patch was more of an afterthought, and obviously has an
issue. It did provide some extra gains, so I'll see if I can find the problem
if I get some time.
Bruce - if you want me to 'formally' submit a version of the patch, let me
know. Just disabling the cache for v4, which comes down to reverting a few
commits, is probably simpler - I'd be able to test that too.
- Frank
> On Sep 28, 2020, at 11:49 AM, Frank van der Linden <[email protected]> wrote:
>
> Bruce - if you want me to 'formally' submit a version of the patch, let me
> know. Just disabling the cache for v4, which comes down to reverting a few
> commits, is probably simpler - I'd be able to test that too.
I'd be interested in seeing that. From what I saw, the mechanics of
unhooking the cache from NFSv4 simply involve reverting patches, but
there appear to be some recent changes that depend on the open
filecache that might be difficult to deal with, like
b66ae6dd0c30 ("nfsd: Pass the nfsd_file as arguments to nfsd4_clone_file_range()")
--
Chuck Lever
On Mon, Sep 28, 2020 at 12:08:09PM -0400, Chuck Lever wrote:
>
>
> > On Sep 28, 2020, at 11:49 AM, Frank van der Linden <[email protected]> wrote:
> >
> > Bruce - if you want me to 'formally' submit a version of the patch, let me
> > know. Just disabling the cache for v4, which comes down to reverting a few
> > commits, is probably simpler - I'd be able to test that too.
>
> I'd be interested in seeing that. From what I saw, the mechanics of
> unhooking the cache from NFSv4 simply involve reverting patches, but
> there appear to be some recent changes that depend on the open
> filecache that might be difficult to deal with, like
>
> b66ae6dd0c30 ("nfsd: Pass the nfsd_file as arguments to nfsd4_clone_file_range()")
Hm, yes, I missed nf_rwsem being added to the struct.
Probably easier to keep nfsd_file, and have v4 use just straight alloc/free
functions for it that don't touch the cache at all.
- Frank
On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote:
> Hi,
>
> I just thought I'd flesh out the other two issues I have found with re-exporting that are ultimately responsible for the biggest performance bottlenecks. And both of them revolve around the caching of metadata file lookups in the NFS client.
>
> Especially for the case where we are re-exporting a server many milliseconds away (i.e. on-premise -> cloud), we want to be able to control how much the client caches metadata and file data so that it's many LAN clients all benefit from the re-export server only having to do the WAN lookups once (within a specified coherency time).
>
> Keeping the file data in the vfs page cache or on disk using fscache/cachefiles is fairly straightforward, but keeping the metadata cached is particularly difficult. And without the cached metadata we introduce long delays before we can serve the already present and locally cached file data to many waiting clients.
>
> ----- On 7 Sep, 2020, at 18:31, Daire Byrne [email protected] wrote:
> > 2) If we cache metadata on the re-export server using actimeo=3600,nocto we can
> > cut the network packets back to the origin server to zero for repeated lookups.
> > However, if a client of the re-export server walks paths and memory maps those
> > files (i.e. loading an application), the re-export server starts issuing
> > unexpected calls back to the origin server again, ignoring/invalidating the
> > re-export server's NFS client cache. We worked around this this by patching an
> > inode/iversion validity check in inode.c so that the NFS client cache on the
> > re-export server is used. I'm not sure about the correctness of this patch but
> > it works for our corner case.
>
> If we use actimeo=3600,nocto (say) to mount a remote software volume on the re-export server, we can successfully cache the loading of applications and walking of paths directly on the re-export server such that after a couple of runs, there are practically zero packets back to the originating NFS server (great!). But, if we then do the same thing on a client which is mounting that re-export server, the re-export server now starts issuing lots of calls back to the originating server and invalidating it's client cache (bad!).
>
> I'm not exactly sure why, but the iversion of the inode gets changed locally (due to atime modification?) most likely via invocation of method inode_inc_iversion_raw. Each time it gets incremented the following call to validate attributes detects changes causing it to be reloaded from the originating server.
>
I'd expect the change attribute to track what's in actual inode on the
"home" server. The NFS client is supposed to (mostly) keep the raw
change attribute in its i_version field.
The only place we call inode_inc_iversion_raw is in
nfs_inode_add_request, which I don't think you'd be hitting unless you
were writing to the file while holding a write delegation.
What sort of server is hosting the actual data in your setup?
> This patch helps to avoid this when applied to the re-export server but there may be other places where this happens too. I accept that this patch is probably not the right/general way to do this, but it helps to highlight the issue when re-exporting and it works well for our use case:
>
> --- linux-5.5.0-1.el7.x86_64/fs/nfs/inode.c 2020-01-27 00:23:03.000000000 +0000
> +++ new/fs/nfs/inode.c 2020-02-13 16:32:09.013055074 +0000
> @@ -1869,7 +1869,7 @@
>
> /* More cache consistency checks */
> if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
> - if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
> + if (inode_peek_iversion_raw(inode) < fattr->change_attr) {
> /* Could it be a race with writeback? */
> if (!(have_writers || have_delegation)) {
> invalid |= NFS_INO_INVALID_DATA
>
> With this patch, the re-export server's NFS client attribute cache is maintained and used by all the clients that then mount it. When many hundreds of clients are all doing similar things at the same time, the re-export server's NFS client cache is invaluable in accelerating the lookups (getattrs).
>
> Perhaps a more correct approach would be to detect when it is knfsd that is accessing the client mount and change the cache consistency checks accordingly?
Yeah, I don't think you can do this for the reasons Trond outlined.
--
Jeff Layton <[email protected]>
----- On 30 Sep, 2020, at 20:30, Jeff Layton [email protected] wrote:
> On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote:
>> Hi,
>>
>> I just thought I'd flesh out the other two issues I have found with re-exporting
>> that are ultimately responsible for the biggest performance bottlenecks. And
>> both of them revolve around the caching of metadata file lookups in the NFS
>> client.
>>
>> Especially for the case where we are re-exporting a server many milliseconds
>> away (i.e. on-premise -> cloud), we want to be able to control how much the
>> client caches metadata and file data so that it's many LAN clients all benefit
>> from the re-export server only having to do the WAN lookups once (within a
>> specified coherency time).
>>
>> Keeping the file data in the vfs page cache or on disk using fscache/cachefiles
>> is fairly straightforward, but keeping the metadata cached is particularly
>> difficult. And without the cached metadata we introduce long delays before we
>> can serve the already present and locally cached file data to many waiting
>> clients.
>>
>> ----- On 7 Sep, 2020, at 18:31, Daire Byrne [email protected] wrote:
>> > 2) If we cache metadata on the re-export server using actimeo=3600,nocto we can
>> > cut the network packets back to the origin server to zero for repeated lookups.
>> > However, if a client of the re-export server walks paths and memory maps those
>> > files (i.e. loading an application), the re-export server starts issuing
>> > unexpected calls back to the origin server again, ignoring/invalidating the
>> > re-export server's NFS client cache. We worked around this this by patching an
>> > inode/iversion validity check in inode.c so that the NFS client cache on the
>> > re-export server is used. I'm not sure about the correctness of this patch but
>> > it works for our corner case.
>>
>> If we use actimeo=3600,nocto (say) to mount a remote software volume on the
>> re-export server, we can successfully cache the loading of applications and
>> walking of paths directly on the re-export server such that after a couple of
>> runs, there are practically zero packets back to the originating NFS server
>> (great!). But, if we then do the same thing on a client which is mounting that
>> re-export server, the re-export server now starts issuing lots of calls back to
>> the originating server and invalidating it's client cache (bad!).
>>
>> I'm not exactly sure why, but the iversion of the inode gets changed locally
>> (due to atime modification?) most likely via invocation of method
>> inode_inc_iversion_raw. Each time it gets incremented the following call to
>> validate attributes detects changes causing it to be reloaded from the
>> originating server.
>>
>
> I'd expect the change attribute to track what's in actual inode on the
> "home" server. The NFS client is supposed to (mostly) keep the raw
> change attribute in its i_version field.
>
> The only place we call inode_inc_iversion_raw is in
> nfs_inode_add_request, which I don't think you'd be hitting unless you
> were writing to the file while holding a write delegation.
>
> What sort of server is hosting the actual data in your setup?
We mostly use RHEL7.6 NFS servers with XFS backed filesystems and a couple of (older) Netapps too. The re-export server is running the latest mainline kernel(s).
As far as I can make out, both these originating (home) server types exhibit a similar (but not exactly the same) effect on the Linux NFS client cache when it is being re-exported and accessed by other clients. I can replicate it when only using a read-only mount at every hop so I don't think that writes are related.
Our RHEL7 NFS servers actually mount XFS with noatime too so any atime updates that might be causing this client invalidation (which is what I initially thought) are ultimately a wasted effort.
>> This patch helps to avoid this when applied to the re-export server but there
>> may be other places where this happens too. I accept that this patch is
>> probably not the right/general way to do this, but it helps to highlight the
>> issue when re-exporting and it works well for our use case:
>>
>> --- linux-5.5.0-1.el7.x86_64/fs/nfs/inode.c 2020-01-27 00:23:03.000000000
>> +0000
>> +++ new/fs/nfs/inode.c 2020-02-13 16:32:09.013055074 +0000
>> @@ -1869,7 +1869,7 @@
>>
>> /* More cache consistency checks */
>> if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
>> - if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
>> + if (inode_peek_iversion_raw(inode) < fattr->change_attr) {
>> /* Could it be a race with writeback? */
>> if (!(have_writers || have_delegation)) {
>> invalid |= NFS_INO_INVALID_DATA
>>
>> With this patch, the re-export server's NFS client attribute cache is maintained
>> and used by all the clients that then mount it. When many hundreds of clients
>> are all doing similar things at the same time, the re-export server's NFS
>> client cache is invaluable in accelerating the lookups (getattrs).
>>
>> Perhaps a more correct approach would be to detect when it is knfsd that is
>> accessing the client mount and change the cache consistency checks accordingly?
>
> Yeah, I don't think you can do this for the reasons Trond outlined.
Yea, I kind of felt like it wasn't quite right, but I didn't know enough about the intricacies to say why exactly. So thanks to everyone for clearing that up for me.
We just followed the code and found that the re-export server spent a lot of time in this code block when we assumed that we should be able to serve the same read-only metadata requests to multiple clients out of the re-export server's NFS client cache. I guess the patch was more for us to see if we could (incorrectly) engineer our desired behaviour with a dirty hack.
While the patch definitely helps to better utilise the re-export server's nfs client cache when exporting via knfsd, we do still see many repeat getattrs per minute for the same files on the re-export server when 100s of clients are all reading the same files. So this is probably not the only area where the reading via a knfsd export of an nfs client mount, invalidates the re-export server's nfs client cache.
Ultimately, I guess we are willing to take some risks with cache coherency (similar to actimeo=large,nocto) if it means that we can do expensive metadata lookups to a remote (WAN) server once and re-export that result to hundreds of (LAN) clients. For read-only or "almost" read-only workloads like ours where we repeatedly read the same files from many clients, it can lead to big savings over the WAN.
But I accept that it is a coherency and locking nightmare when you want to do writes to shared files.
Daire
On Thu, 2020-10-01 at 01:09 +0100, Daire Byrne wrote:
> ----- On 30 Sep, 2020, at 20:30, Jeff Layton [email protected] wrote:
>
> > On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote:
> > > Hi,
> > >
> > > I just thought I'd flesh out the other two issues I have found with re-exporting
> > > that are ultimately responsible for the biggest performance bottlenecks. And
> > > both of them revolve around the caching of metadata file lookups in the NFS
> > > client.
> > >
> > > Especially for the case where we are re-exporting a server many milliseconds
> > > away (i.e. on-premise -> cloud), we want to be able to control how much the
> > > client caches metadata and file data so that it's many LAN clients all benefit
> > > from the re-export server only having to do the WAN lookups once (within a
> > > specified coherency time).
> > >
> > > Keeping the file data in the vfs page cache or on disk using fscache/cachefiles
> > > is fairly straightforward, but keeping the metadata cached is particularly
> > > difficult. And without the cached metadata we introduce long delays before we
> > > can serve the already present and locally cached file data to many waiting
> > > clients.
> > >
> > > ----- On 7 Sep, 2020, at 18:31, Daire Byrne [email protected] wrote:
> > > > 2) If we cache metadata on the re-export server using actimeo=3600,nocto we can
> > > > cut the network packets back to the origin server to zero for repeated lookups.
> > > > However, if a client of the re-export server walks paths and memory maps those
> > > > files (i.e. loading an application), the re-export server starts issuing
> > > > unexpected calls back to the origin server again, ignoring/invalidating the
> > > > re-export server's NFS client cache. We worked around this this by patching an
> > > > inode/iversion validity check in inode.c so that the NFS client cache on the
> > > > re-export server is used. I'm not sure about the correctness of this patch but
> > > > it works for our corner case.
> > >
> > > If we use actimeo=3600,nocto (say) to mount a remote software volume on the
> > > re-export server, we can successfully cache the loading of applications and
> > > walking of paths directly on the re-export server such that after a couple of
> > > runs, there are practically zero packets back to the originating NFS server
> > > (great!). But, if we then do the same thing on a client which is mounting that
> > > re-export server, the re-export server now starts issuing lots of calls back to
> > > the originating server and invalidating it's client cache (bad!).
> > >
> > > I'm not exactly sure why, but the iversion of the inode gets changed locally
> > > (due to atime modification?) most likely via invocation of method
> > > inode_inc_iversion_raw. Each time it gets incremented the following call to
> > > validate attributes detects changes causing it to be reloaded from the
> > > originating server.
> > >
> >
> > I'd expect the change attribute to track what's in actual inode on the
> > "home" server. The NFS client is supposed to (mostly) keep the raw
> > change attribute in its i_version field.
> >
> > The only place we call inode_inc_iversion_raw is in
> > nfs_inode_add_request, which I don't think you'd be hitting unless you
> > were writing to the file while holding a write delegation.
> >
> > What sort of server is hosting the actual data in your setup?
>
> We mostly use RHEL7.6 NFS servers with XFS backed filesystems and a couple of (older) Netapps too. The re-export server is running the latest mainline kernel(s).
>
> As far as I can make out, both these originating (home) server types exhibit a similar (but not exactly the same) effect on the Linux NFS client cache when it is being re-exported and accessed by other clients. I can replicate it when only using a read-only mount at every hop so I don't think that writes are related.
>
> Our RHEL7 NFS servers actually mount XFS with noatime too so any atime updates that might be causing this client invalidation (which is what I initially thought) are ultimately a wasted effort.
>
Ok. I suspect there is a bug here somewhere, but with such a complicated
setup though it's not clear to me where that bug would be though. You
might need to do some packet sniffing and look at what the servers are
sending for change attributes.
nfsd4_change_attribute does mix in the ctime, so your hunch about the
atime may be correct. atime updates imply a ctime update and that could
cause nfsd to continually send a new one, even on files that aren't
being changed.
It might be interesting to doctor nfsd4_change_attribute() to not mix in
the ctime and see whether that improves things. If it does, then we may
want to teach nfsd how to avoid doing that for certain types of
filesystems.
>
> > > This patch helps to avoid this when applied to the re-export server but there
> > > may be other places where this happens too. I accept that this patch is
> > > probably not the right/general way to do this, but it helps to highlight the
> > > issue when re-exporting and it works well for our use case:
> > >
> > > --- linux-5.5.0-1.el7.x86_64/fs/nfs/inode.c 2020-01-27 00:23:03.000000000
> > > +0000
> > > +++ new/fs/nfs/inode.c 2020-02-13 16:32:09.013055074 +0000
> > > @@ -1869,7 +1869,7 @@
> > >
> > > /* More cache consistency checks */
> > > if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
> > > - if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
> > > + if (inode_peek_iversion_raw(inode) < fattr->change_attr) {
> > > /* Could it be a race with writeback? */
> > > if (!(have_writers || have_delegation)) {
> > > invalid |= NFS_INO_INVALID_DATA
> > >
> > > With this patch, the re-export server's NFS client attribute cache is maintained
> > > and used by all the clients that then mount it. When many hundreds of clients
> > > are all doing similar things at the same time, the re-export server's NFS
> > > client cache is invaluable in accelerating the lookups (getattrs).
> > >
> > > Perhaps a more correct approach would be to detect when it is knfsd that is
> > > accessing the client mount and change the cache consistency checks accordingly?
> >
> > Yeah, I don't think you can do this for the reasons Trond outlined.
>
> Yea, I kind of felt like it wasn't quite right, but I didn't know enough about the intricacies to say why exactly. So thanks to everyone for clearing that up for me.
>
> We just followed the code and found that the re-export server spent a lot of time in this code block when we assumed that we should be able to serve the same read-only metadata requests to multiple clients out of the re-export server's NFS client cache. I guess the patch was more for us to see if we could (incorrectly) engineer our desired behaviour with a dirty hack.
>
> While the patch definitely helps to better utilise the re-export server's nfs client cache when exporting via knfsd, we do still see many repeat getattrs per minute for the same files on the re-export server when 100s of clients are all reading the same files. So this is probably not the only area where the reading via a knfsd export of an nfs client mount, invalidates the re-export server's nfs client cache.
>
> Ultimately, I guess we are willing to take some risks with cache coherency (similar to actimeo=large,nocto) if it means that we can do expensive metadata lookups to a remote (WAN) server once and re-export that result to hundreds of (LAN) clients. For read-only or "almost" read-only workloads like ours where we repeatedly read the same files from many clients, it can lead to big savings over the WAN.
>
> But I accept that it is a coherency and locking nightmare when you want to do writes to shared files.
>
> Daire
--
Jeff Layton <[email protected]>
On Thu, 2020-10-01 at 06:36 -0400, Jeff Layton wrote:
> On Thu, 2020-10-01 at 01:09 +0100, Daire Byrne wrote:
> > ----- On 30 Sep, 2020, at 20:30, Jeff Layton [email protected]
> > wrote:
> >
> > > On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote:
> > > > Hi,
> > > >
> > > > I just thought I'd flesh out the other two issues I have found
> > > > with re-exporting
> > > > that are ultimately responsible for the biggest performance
> > > > bottlenecks. And
> > > > both of them revolve around the caching of metadata file
> > > > lookups in the NFS
> > > > client.
> > > >
> > > > Especially for the case where we are re-exporting a server many
> > > > milliseconds
> > > > away (i.e. on-premise -> cloud), we want to be able to control
> > > > how much the
> > > > client caches metadata and file data so that it's many LAN
> > > > clients all benefit
> > > > from the re-export server only having to do the WAN lookups
> > > > once (within a
> > > > specified coherency time).
> > > >
> > > > Keeping the file data in the vfs page cache or on disk using
> > > > fscache/cachefiles
> > > > is fairly straightforward, but keeping the metadata cached is
> > > > particularly
> > > > difficult. And without the cached metadata we introduce long
> > > > delays before we
> > > > can serve the already present and locally cached file data to
> > > > many waiting
> > > > clients.
> > > >
> > > > ----- On 7 Sep, 2020, at 18:31, Daire Byrne [email protected]
> > > > wrote:
> > > > > 2) If we cache metadata on the re-export server using
> > > > > actimeo=3600,nocto we can
> > > > > cut the network packets back to the origin server to zero for
> > > > > repeated lookups.
> > > > > However, if a client of the re-export server walks paths and
> > > > > memory maps those
> > > > > files (i.e. loading an application), the re-export server
> > > > > starts issuing
> > > > > unexpected calls back to the origin server again,
> > > > > ignoring/invalidating the
> > > > > re-export server's NFS client cache. We worked around this
> > > > > this by patching an
> > > > > inode/iversion validity check in inode.c so that the NFS
> > > > > client cache on the
> > > > > re-export server is used. I'm not sure about the correctness
> > > > > of this patch but
> > > > > it works for our corner case.
> > > >
> > > > If we use actimeo=3600,nocto (say) to mount a remote software
> > > > volume on the
> > > > re-export server, we can successfully cache the loading of
> > > > applications and
> > > > walking of paths directly on the re-export server such that
> > > > after a couple of
> > > > runs, there are practically zero packets back to the
> > > > originating NFS server
> > > > (great!). But, if we then do the same thing on a client which
> > > > is mounting that
> > > > re-export server, the re-export server now starts issuing lots
> > > > of calls back to
> > > > the originating server and invalidating it's client cache
> > > > (bad!).
> > > >
> > > > I'm not exactly sure why, but the iversion of the inode gets
> > > > changed locally
> > > > (due to atime modification?) most likely via invocation of
> > > > method
> > > > inode_inc_iversion_raw. Each time it gets incremented the
> > > > following call to
> > > > validate attributes detects changes causing it to be reloaded
> > > > from the
> > > > originating server.
> > > >
> > >
> > > I'd expect the change attribute to track what's in actual inode
> > > on the
> > > "home" server. The NFS client is supposed to (mostly) keep the
> > > raw
> > > change attribute in its i_version field.
> > >
> > > The only place we call inode_inc_iversion_raw is in
> > > nfs_inode_add_request, which I don't think you'd be hitting
> > > unless you
> > > were writing to the file while holding a write delegation.
> > >
> > > What sort of server is hosting the actual data in your setup?
> >
> > We mostly use RHEL7.6 NFS servers with XFS backed filesystems and a
> > couple of (older) Netapps too. The re-export server is running the
> > latest mainline kernel(s).
> >
> > As far as I can make out, both these originating (home) server
> > types exhibit a similar (but not exactly the same) effect on the
> > Linux NFS client cache when it is being re-exported and accessed by
> > other clients. I can replicate it when only using a read-only mount
> > at every hop so I don't think that writes are related.
> >
> > Our RHEL7 NFS servers actually mount XFS with noatime too so any
> > atime updates that might be causing this client invalidation (which
> > is what I initially thought) are ultimately a wasted effort.
> >
>
> Ok. I suspect there is a bug here somewhere, but with such a
> complicated
> setup though it's not clear to me where that bug would be though. You
> might need to do some packet sniffing and look at what the servers
> are
> sending for change attributes.
>
> nfsd4_change_attribute does mix in the ctime, so your hunch about the
> atime may be correct. atime updates imply a ctime update and that
> could
> cause nfsd to continually send a new one, even on files that aren't
> being changed.
No. Ordinary atime updates due to read() do not trigger a ctime or
change attribute update. Only an explicit atime update through, e.g. a
call to utimensat() will do that.
>
> It might be interesting to doctor nfsd4_change_attribute() to not mix
> in
> the ctime and see whether that improves things. If it does, then we
> may
> want to teach nfsd how to avoid doing that for certain types of
> filesystems.
NACK. That would cause very incorrect behaviour for the change
attribute. It is supposed to change in all circumstances where you
ordinarily see a ctime change.
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Thu, 2020-10-01 at 12:38 +0000, Trond Myklebust wrote:
> On Thu, 2020-10-01 at 06:36 -0400, Jeff Layton wrote:
> > On Thu, 2020-10-01 at 01:09 +0100, Daire Byrne wrote:
> > > ----- On 30 Sep, 2020, at 20:30, Jeff Layton [email protected]
> > > wrote:
> > >
> > > > On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote:
> > > > > Hi,
> > > > >
> > > > > I just thought I'd flesh out the other two issues I have found
> > > > > with re-exporting
> > > > > that are ultimately responsible for the biggest performance
> > > > > bottlenecks. And
> > > > > both of them revolve around the caching of metadata file
> > > > > lookups in the NFS
> > > > > client.
> > > > >
> > > > > Especially for the case where we are re-exporting a server many
> > > > > milliseconds
> > > > > away (i.e. on-premise -> cloud), we want to be able to control
> > > > > how much the
> > > > > client caches metadata and file data so that it's many LAN
> > > > > clients all benefit
> > > > > from the re-export server only having to do the WAN lookups
> > > > > once (within a
> > > > > specified coherency time).
> > > > >
> > > > > Keeping the file data in the vfs page cache or on disk using
> > > > > fscache/cachefiles
> > > > > is fairly straightforward, but keeping the metadata cached is
> > > > > particularly
> > > > > difficult. And without the cached metadata we introduce long
> > > > > delays before we
> > > > > can serve the already present and locally cached file data to
> > > > > many waiting
> > > > > clients.
> > > > >
> > > > > ----- On 7 Sep, 2020, at 18:31, Daire Byrne [email protected]
> > > > > wrote:
> > > > > > 2) If we cache metadata on the re-export server using
> > > > > > actimeo=3600,nocto we can
> > > > > > cut the network packets back to the origin server to zero for
> > > > > > repeated lookups.
> > > > > > However, if a client of the re-export server walks paths and
> > > > > > memory maps those
> > > > > > files (i.e. loading an application), the re-export server
> > > > > > starts issuing
> > > > > > unexpected calls back to the origin server again,
> > > > > > ignoring/invalidating the
> > > > > > re-export server's NFS client cache. We worked around this
> > > > > > this by patching an
> > > > > > inode/iversion validity check in inode.c so that the NFS
> > > > > > client cache on the
> > > > > > re-export server is used. I'm not sure about the correctness
> > > > > > of this patch but
> > > > > > it works for our corner case.
> > > > >
> > > > > If we use actimeo=3600,nocto (say) to mount a remote software
> > > > > volume on the
> > > > > re-export server, we can successfully cache the loading of
> > > > > applications and
> > > > > walking of paths directly on the re-export server such that
> > > > > after a couple of
> > > > > runs, there are practically zero packets back to the
> > > > > originating NFS server
> > > > > (great!). But, if we then do the same thing on a client which
> > > > > is mounting that
> > > > > re-export server, the re-export server now starts issuing lots
> > > > > of calls back to
> > > > > the originating server and invalidating it's client cache
> > > > > (bad!).
> > > > >
> > > > > I'm not exactly sure why, but the iversion of the inode gets
> > > > > changed locally
> > > > > (due to atime modification?) most likely via invocation of
> > > > > method
> > > > > inode_inc_iversion_raw. Each time it gets incremented the
> > > > > following call to
> > > > > validate attributes detects changes causing it to be reloaded
> > > > > from the
> > > > > originating server.
> > > > >
> > > >
> > > > I'd expect the change attribute to track what's in actual inode
> > > > on the
> > > > "home" server. The NFS client is supposed to (mostly) keep the
> > > > raw
> > > > change attribute in its i_version field.
> > > >
> > > > The only place we call inode_inc_iversion_raw is in
> > > > nfs_inode_add_request, which I don't think you'd be hitting
> > > > unless you
> > > > were writing to the file while holding a write delegation.
> > > >
> > > > What sort of server is hosting the actual data in your setup?
> > >
> > > We mostly use RHEL7.6 NFS servers with XFS backed filesystems and a
> > > couple of (older) Netapps too. The re-export server is running the
> > > latest mainline kernel(s).
> > >
> > > As far as I can make out, both these originating (home) server
> > > types exhibit a similar (but not exactly the same) effect on the
> > > Linux NFS client cache when it is being re-exported and accessed by
> > > other clients. I can replicate it when only using a read-only mount
> > > at every hop so I don't think that writes are related.
> > >
> > > Our RHEL7 NFS servers actually mount XFS with noatime too so any
> > > atime updates that might be causing this client invalidation (which
> > > is what I initially thought) are ultimately a wasted effort.
> > >
> >
> > Ok. I suspect there is a bug here somewhere, but with such a
> > complicated
> > setup though it's not clear to me where that bug would be though. You
> > might need to do some packet sniffing and look at what the servers
> > are
> > sending for change attributes.
> >
> > nfsd4_change_attribute does mix in the ctime, so your hunch about the
> > atime may be correct. atime updates imply a ctime update and that
> > could
> > cause nfsd to continually send a new one, even on files that aren't
> > being changed.
>
> No. Ordinary atime updates due to read() do not trigger a ctime or
> change attribute update. Only an explicit atime update through, e.g. a
> call to utimensat() will do that.
>
Oh, interesting. I didn't realize that.
> > It might be interesting to doctor nfsd4_change_attribute() to not mix
> > in
> > the ctime and see whether that improves things. If it does, then we
> > may
> > want to teach nfsd how to avoid doing that for certain types of
> > filesystems.
>
> NACK. That would cause very incorrect behaviour for the change
> attribute. It is supposed to change in all circumstances where you
> ordinarily see a ctime change.
I wasn't suggesting this as a real fix, just as a way to see whether we
understand the problem correctly. I doubt the reexporting machine would
be bumping the change_attr on its own, and this may tell you whether
it's the "home" server changing it. There are other ways to determine it
too though (packet sniffer, for instance).
--
Jeff Layton <[email protected]>
On Wed, Sep 30, 2020 at 03:30:22PM -0400, Jeff Layton wrote:
> On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote:
> > This patch helps to avoid this when applied to the re-export server but there may be other places where this happens too. I accept that this patch is probably not the right/general way to do this, but it helps to highlight the issue when re-exporting and it works well for our use case:
> >
> > --- linux-5.5.0-1.el7.x86_64/fs/nfs/inode.c 2020-01-27 00:23:03.000000000 +0000
> > +++ new/fs/nfs/inode.c 2020-02-13 16:32:09.013055074 +0000
> > @@ -1869,7 +1869,7 @@
> >
> > /* More cache consistency checks */
> > if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
> > - if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
> > + if (inode_peek_iversion_raw(inode) < fattr->change_attr) {
> > /* Could it be a race with writeback? */
> > if (!(have_writers || have_delegation)) {
> > invalid |= NFS_INO_INVALID_DATA
> >
> > With this patch, the re-export server's NFS client attribute cache is maintained and used by all the clients that then mount it. When many hundreds of clients are all doing similar things at the same time, the re-export server's NFS client cache is invaluable in accelerating the lookups (getattrs).
> >
> > Perhaps a more correct approach would be to detect when it is knfsd that is accessing the client mount and change the cache consistency checks accordingly?
>
> Yeah, I don't think you can do this for the reasons Trond outlined.
I'm not clear whether Trond thought that knfsd's behavior in the case it
returns NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR might be good enough to allow
this or some other optimization.
--b.
On Thu, 2020-10-01 at 14:41 -0400, J. Bruce Fields wrote:
> On Wed, Sep 30, 2020 at 03:30:22PM -0400, Jeff Layton wrote:
> > On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote:
> > > This patch helps to avoid this when applied to the re-export
> > > server but there may be other places where this happens too. I
> > > accept that this patch is probably not the right/general way to
> > > do this, but it helps to highlight the issue when re-exporting
> > > and it works well for our use case:
> > >
> > > --- linux-5.5.0-1.el7.x86_64/fs/nfs/inode.c 2020-01-27
> > > 00:23:03.000000000 +0000
> > > +++ new/fs/nfs/inode.c 2020-02-13 16:32:09.013055074 +0000
> > > @@ -1869,7 +1869,7 @@
> > >
> > > /* More cache consistency checks */
> > > if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
> > > - if (!inode_eq_iversion_raw(inode, fattr-
> > > >change_attr)) {
> > > + if (inode_peek_iversion_raw(inode) < fattr-
> > > >change_attr) {
> > > /* Could it be a race with writeback? */
> > > if (!(have_writers || have_delegation)) {
> > > invalid |= NFS_INO_INVALID_DATA
> > >
> > > With this patch, the re-export server's NFS client attribute
> > > cache is maintained and used by all the clients that then mount
> > > it. When many hundreds of clients are all doing similar things at
> > > the same time, the re-export server's NFS client cache is
> > > invaluable in accelerating the lookups (getattrs).
> > >
> > > Perhaps a more correct approach would be to detect when it is
> > > knfsd that is accessing the client mount and change the cache
> > > consistency checks accordingly?
> >
> > Yeah, I don't think you can do this for the reasons Trond outlined.
>
> I'm not clear whether Trond thought that knfsd's behavior in the case
> it
> returns NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR might be good enough to
> allow
> this or some other optimization.
>
NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR should normally be good enough to
allow the above optimisation, yes. I'm less sure about whether or not
we are correct in returning NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR when in
fact we are adding the ctime and filesystem-specific change attribute,
but we could fix that too.
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Thu, Oct 01, 2020 at 07:24:42PM +0000, Trond Myklebust wrote:
> On Thu, 2020-10-01 at 14:41 -0400, J. Bruce Fields wrote:
> > On Wed, Sep 30, 2020 at 03:30:22PM -0400, Jeff Layton wrote:
> > > On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote:
> > > > This patch helps to avoid this when applied to the re-export
> > > > server but there may be other places where this happens too. I
> > > > accept that this patch is probably not the right/general way to
> > > > do this, but it helps to highlight the issue when re-exporting
> > > > and it works well for our use case:
> > > >
> > > > --- linux-5.5.0-1.el7.x86_64/fs/nfs/inode.c 2020-01-27
> > > > 00:23:03.000000000 +0000
> > > > +++ new/fs/nfs/inode.c 2020-02-13 16:32:09.013055074 +0000
> > > > @@ -1869,7 +1869,7 @@
> > > >
> > > > /* More cache consistency checks */
> > > > if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
> > > > - if (!inode_eq_iversion_raw(inode, fattr-
> > > > >change_attr)) {
> > > > + if (inode_peek_iversion_raw(inode) < fattr-
> > > > >change_attr) {
> > > > /* Could it be a race with writeback? */
> > > > if (!(have_writers || have_delegation)) {
> > > > invalid |= NFS_INO_INVALID_DATA
> > > >
> > > > With this patch, the re-export server's NFS client attribute
> > > > cache is maintained and used by all the clients that then mount
> > > > it. When many hundreds of clients are all doing similar things at
> > > > the same time, the re-export server's NFS client cache is
> > > > invaluable in accelerating the lookups (getattrs).
> > > >
> > > > Perhaps a more correct approach would be to detect when it is
> > > > knfsd that is accessing the client mount and change the cache
> > > > consistency checks accordingly?
> > >
> > > Yeah, I don't think you can do this for the reasons Trond outlined.
> >
> > I'm not clear whether Trond thought that knfsd's behavior in the case
> > it
> > returns NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR might be good enough to
> > allow
> > this or some other optimization.
> >
>
> NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR should normally be good enough to
> allow the above optimisation, yes. I'm less sure about whether or not
> we are correct in returning NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR when in
> fact we are adding the ctime and filesystem-specific change attribute,
> but we could fix that too.
Could you explain your concern?
--b.
On Thu, 2020-10-01 at 15:26 -0400, [email protected] wrote:
> On Thu, Oct 01, 2020 at 07:24:42PM +0000, Trond Myklebust wrote:
> > On Thu, 2020-10-01 at 14:41 -0400, J. Bruce Fields wrote:
> > > On Wed, Sep 30, 2020 at 03:30:22PM -0400, Jeff Layton wrote:
> > > > On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote:
> > > > > This patch helps to avoid this when applied to the re-export
> > > > > server but there may be other places where this happens too.
> > > > > I
> > > > > accept that this patch is probably not the right/general way
> > > > > to
> > > > > do this, but it helps to highlight the issue when re-
> > > > > exporting
> > > > > and it works well for our use case:
> > > > >
> > > > > --- linux-5.5.0-1.el7.x86_64/fs/nfs/inode.c 2020-01-27
> > > > > 00:23:03.000000000 +0000
> > > > > +++ new/fs/nfs/inode.c 2020-02-13 16:32:09.013055074 +0000
> > > > > @@ -1869,7 +1869,7 @@
> > > > >
> > > > > /* More cache consistency checks */
> > > > > if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
> > > > > - if (!inode_eq_iversion_raw(inode, fattr-
> > > > > > change_attr)) {
> > > > > + if (inode_peek_iversion_raw(inode) < fattr-
> > > > > > change_attr) {
> > > > > /* Could it be a race with writeback?
> > > > > */
> > > > > if (!(have_writers ||
> > > > > have_delegation)) {
> > > > > invalid |=
> > > > > NFS_INO_INVALID_DATA
> > > > >
> > > > > With this patch, the re-export server's NFS client attribute
> > > > > cache is maintained and used by all the clients that then
> > > > > mount
> > > > > it. When many hundreds of clients are all doing similar
> > > > > things at
> > > > > the same time, the re-export server's NFS client cache is
> > > > > invaluable in accelerating the lookups (getattrs).
> > > > >
> > > > > Perhaps a more correct approach would be to detect when it is
> > > > > knfsd that is accessing the client mount and change the cache
> > > > > consistency checks accordingly?
> > > >
> > > > Yeah, I don't think you can do this for the reasons Trond
> > > > outlined.
> > >
> > > I'm not clear whether Trond thought that knfsd's behavior in the
> > > case
> > > it
> > > returns NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR might be good enough
> > > to
> > > allow
> > > this or some other optimization.
> > >
> >
> > NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR should normally be good enough
> > to
> > allow the above optimisation, yes. I'm less sure about whether or
> > not
> > we are correct in returning NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR when
> > in
> > fact we are adding the ctime and filesystem-specific change
> > attribute,
> > but we could fix that too.
>
> Could you explain your concern?
>
Same as before: that the ctime could cause the value to regress if
someone messes with the system time on the server. Yes, we do add in
the change attribute, but the value of ctime.tv_sec dominates by a
factor 2^30.
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Thu, Oct 01, 2020 at 07:29:51PM +0000, Trond Myklebust wrote:
> On Thu, 2020-10-01 at 15:26 -0400, [email protected] wrote:
> > On Thu, Oct 01, 2020 at 07:24:42PM +0000, Trond Myklebust wrote:
> > > NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR should normally be good enough
> > > to
> > > allow the above optimisation, yes. I'm less sure about whether or
> > > not
> > > we are correct in returning NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR when
> > > in
> > > fact we are adding the ctime and filesystem-specific change
> > > attribute,
> > > but we could fix that too.
> >
> > Could you explain your concern?
> >
>
> Same as before: that the ctime could cause the value to regress if
> someone messes with the system time on the server. Yes, we do add in
> the change attribute, but the value of ctime.tv_sec dominates by a
> factor 2^30.
Got it.
I'd like to just tell people not to do that....
If we think it's too easy a mistake to make, I can think of other
approaches, though filesystem assistance might be required:
- Ideal would be just never to expose uncommitted change attributes to
the client. Absent persistant RAM that could be terribly expensive.
- It would help just to have any number that's guaranteed to increase
after a boot. Of course, if would to go forward at least as reliably
as the system time. We'd put it in the high bits of the on-disk
i_version. (We'd rather not just mix it into the returned change
attribute as we do with ctime, because that would cause clients to
discard all their caches unnecessarily after boot.)
--b.
----- On 1 Oct, 2020, at 11:36, Jeff Layton [email protected] wrote:
> On Thu, 2020-10-01 at 01:09 +0100, Daire Byrne wrote:
>> ----- On 30 Sep, 2020, at 20:30, Jeff Layton [email protected] wrote:
>>
>> > On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote:
>> > > Hi,
>> > >
>> > > I just thought I'd flesh out the other two issues I have found with re-exporting
>> > > that are ultimately responsible for the biggest performance bottlenecks. And
>> > > both of them revolve around the caching of metadata file lookups in the NFS
>> > > client.
>> > >
>> > > Especially for the case where we are re-exporting a server many milliseconds
>> > > away (i.e. on-premise -> cloud), we want to be able to control how much the
>> > > client caches metadata and file data so that it's many LAN clients all benefit
>> > > from the re-export server only having to do the WAN lookups once (within a
>> > > specified coherency time).
>> > >
>> > > Keeping the file data in the vfs page cache or on disk using fscache/cachefiles
>> > > is fairly straightforward, but keeping the metadata cached is particularly
>> > > difficult. And without the cached metadata we introduce long delays before we
>> > > can serve the already present and locally cached file data to many waiting
>> > > clients.
>> > >
>> > > ----- On 7 Sep, 2020, at 18:31, Daire Byrne [email protected] wrote:
>> > > > 2) If we cache metadata on the re-export server using actimeo=3600,nocto we can
>> > > > cut the network packets back to the origin server to zero for repeated lookups.
>> > > > However, if a client of the re-export server walks paths and memory maps those
>> > > > files (i.e. loading an application), the re-export server starts issuing
>> > > > unexpected calls back to the origin server again, ignoring/invalidating the
>> > > > re-export server's NFS client cache. We worked around this this by patching an
>> > > > inode/iversion validity check in inode.c so that the NFS client cache on the
>> > > > re-export server is used. I'm not sure about the correctness of this patch but
>> > > > it works for our corner case.
>> > >
>> > > If we use actimeo=3600,nocto (say) to mount a remote software volume on the
>> > > re-export server, we can successfully cache the loading of applications and
>> > > walking of paths directly on the re-export server such that after a couple of
>> > > runs, there are practically zero packets back to the originating NFS server
>> > > (great!). But, if we then do the same thing on a client which is mounting that
>> > > re-export server, the re-export server now starts issuing lots of calls back to
>> > > the originating server and invalidating it's client cache (bad!).
>> > >
>> > > I'm not exactly sure why, but the iversion of the inode gets changed locally
>> > > (due to atime modification?) most likely via invocation of method
>> > > inode_inc_iversion_raw. Each time it gets incremented the following call to
>> > > validate attributes detects changes causing it to be reloaded from the
>> > > originating server.
>> > >
>> >
>> > I'd expect the change attribute to track what's in actual inode on the
>> > "home" server. The NFS client is supposed to (mostly) keep the raw
>> > change attribute in its i_version field.
>> >
>> > The only place we call inode_inc_iversion_raw is in
>> > nfs_inode_add_request, which I don't think you'd be hitting unless you
>> > were writing to the file while holding a write delegation.
>> >
>> > What sort of server is hosting the actual data in your setup?
>>
>> We mostly use RHEL7.6 NFS servers with XFS backed filesystems and a couple of
>> (older) Netapps too. The re-export server is running the latest mainline
>> kernel(s).
>>
>> As far as I can make out, both these originating (home) server types exhibit a
>> similar (but not exactly the same) effect on the Linux NFS client cache when it
>> is being re-exported and accessed by other clients. I can replicate it when
>> only using a read-only mount at every hop so I don't think that writes are
>> related.
>>
>> Our RHEL7 NFS servers actually mount XFS with noatime too so any atime updates
>> that might be causing this client invalidation (which is what I initially
>> thought) are ultimately a wasted effort.
>>
>
> Ok. I suspect there is a bug here somewhere, but with such a complicated
> setup though it's not clear to me where that bug would be though. You
> might need to do some packet sniffing and look at what the servers are
> sending for change attributes.
>
> nfsd4_change_attribute does mix in the ctime, so your hunch about the
> atime may be correct. atime updates imply a ctime update and that could
> cause nfsd to continually send a new one, even on files that aren't
> being changed.
>
> It might be interesting to doctor nfsd4_change_attribute() to not mix in
> the ctime and see whether that improves things. If it does, then we may
> want to teach nfsd how to avoid doing that for certain types of
> filesystems.
Okay, I started to run back through all my tests again with various combinations of server, client mount options, NFS version etc. with the intention of packet capturing as Jeff has suggested.
But I quickly realised that I had mixed up some previous results before I reported them here. The summary is that using an NFS RHEL76 server, a client mounting with a recent mainline kernel and re-exporting using NFSv4.x all the way through does NOT invalidate the re-export server's NFS client cache (great!) like I had assumed before. It does when we mount the originating RHEL7 server using NFSv3 and re-export, but not with any version of NFSv4 on Linux.
But I think I know how I got confused - the Netapp NFSv4 case is different. When we mount our (old) 7-mode Netapp using NFSv4.0 and re-export that, the re-export server's client cache is invalidated often in the same way as for an NFSv3 server. On top of that, I think I wrongly mistook some of the NFSv4 client's natural dropping of metadata from page cache as client invalidations caused by the re-export and client access (without vfs_cache_pressure=0 and see my #3 bullet point).
Both of these conspired to make me think that both NFSv3 AND NFSv4 re-exporting showed the same issue when in fact, it's just NFSv3 and the Netapp's v4.0 that require my "hack" to stop the client cache being invalidated. Sorry for any confusion (it is indeed a complicated setup!). Let me summarise then once and for all:
rhel76 server (xfs noatime) -> re-export server (vers=4.x,nocto,actimeo=3600,ro; vfs_cache_pressure=0) = good client cache metadata performance, my hacky patch is not required.
rhel76 server (xfs noatime) -> re-export server (vers=3,nocto,actimeo=3600,ro; vfs_cache_pressure=0) = bad performance (new lookups & getattrs), my hacky patch is required for better performance.
netapp (7-mode) -> re-export server (vers=4.0,nocto,actimeo=3600,ro; vfs_cache_pressure=0) = bad performance, my hacky patch is required for better performance.
So for Jeff's original intention of proxying a NFSv3 server -> NFSv4 clients by re-exporting, the metadata lookup performance will degrade severely as more clients access the same files because the re-export server's client cache is not being used as effectively (re-exported) and lookups are happening for the same files many times within the re-export server's actimeo even with vfs_cache_pressure=0.
For our particular use case, we could live without NFSv3 (and my horrible hack) except for the fact that the Netapp shows similar behaviour with NFSv4.0 (but Linux servers do not). I don't know if turning off atime updates on the Netapp volume will change anything - I might try it. Of course, re-exporting NFSv3 with good meatadata cache performance is still a nice thing to have too.
I'll now see if I can decipher the network calls back to the Netapp (NFSv4.0) as suggested by Jeff to see why it is different.
Daire
----- On 5 Oct, 2020, at 13:54, Daire Byrne [email protected] wrote:
> ----- On 1 Oct, 2020, at 11:36, Jeff Layton [email protected] wrote:
>
>> On Thu, 2020-10-01 at 01:09 +0100, Daire Byrne wrote:
>>> ----- On 30 Sep, 2020, at 20:30, Jeff Layton [email protected] wrote:
>>>
>>> > On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote:
>>> > > Hi,
>>> > >
>>> > > I just thought I'd flesh out the other two issues I have found with re-exporting
>>> > > that are ultimately responsible for the biggest performance bottlenecks. And
>>> > > both of them revolve around the caching of metadata file lookups in the NFS
>>> > > client.
>>> > >
>>> > > Especially for the case where we are re-exporting a server many milliseconds
>>> > > away (i.e. on-premise -> cloud), we want to be able to control how much the
>>> > > client caches metadata and file data so that it's many LAN clients all benefit
>>> > > from the re-export server only having to do the WAN lookups once (within a
>>> > > specified coherency time).
>>> > >
>>> > > Keeping the file data in the vfs page cache or on disk using fscache/cachefiles
>>> > > is fairly straightforward, but keeping the metadata cached is particularly
>>> > > difficult. And without the cached metadata we introduce long delays before we
>>> > > can serve the already present and locally cached file data to many waiting
>>> > > clients.
>>> > >
>>> > > ----- On 7 Sep, 2020, at 18:31, Daire Byrne [email protected] wrote:
>>> > > > 2) If we cache metadata on the re-export server using actimeo=3600,nocto we can
>>> > > > cut the network packets back to the origin server to zero for repeated lookups.
>>> > > > However, if a client of the re-export server walks paths and memory maps those
>>> > > > files (i.e. loading an application), the re-export server starts issuing
>>> > > > unexpected calls back to the origin server again, ignoring/invalidating the
>>> > > > re-export server's NFS client cache. We worked around this this by patching an
>>> > > > inode/iversion validity check in inode.c so that the NFS client cache on the
>>> > > > re-export server is used. I'm not sure about the correctness of this patch but
>>> > > > it works for our corner case.
>>> > >
>>> > > If we use actimeo=3600,nocto (say) to mount a remote software volume on the
>>> > > re-export server, we can successfully cache the loading of applications and
>>> > > walking of paths directly on the re-export server such that after a couple of
>>> > > runs, there are practically zero packets back to the originating NFS server
>>> > > (great!). But, if we then do the same thing on a client which is mounting that
>>> > > re-export server, the re-export server now starts issuing lots of calls back to
>>> > > the originating server and invalidating it's client cache (bad!).
>>> > >
>>> > > I'm not exactly sure why, but the iversion of the inode gets changed locally
>>> > > (due to atime modification?) most likely via invocation of method
>>> > > inode_inc_iversion_raw. Each time it gets incremented the following call to
>>> > > validate attributes detects changes causing it to be reloaded from the
>>> > > originating server.
>>> > >
>>> >
>>> > I'd expect the change attribute to track what's in actual inode on the
>>> > "home" server. The NFS client is supposed to (mostly) keep the raw
>>> > change attribute in its i_version field.
>>> >
>>> > The only place we call inode_inc_iversion_raw is in
>>> > nfs_inode_add_request, which I don't think you'd be hitting unless you
>>> > were writing to the file while holding a write delegation.
>>> >
>>> > What sort of server is hosting the actual data in your setup?
>>>
>>> We mostly use RHEL7.6 NFS servers with XFS backed filesystems and a couple of
>>> (older) Netapps too. The re-export server is running the latest mainline
>>> kernel(s).
>>>
>>> As far as I can make out, both these originating (home) server types exhibit a
>>> similar (but not exactly the same) effect on the Linux NFS client cache when it
>>> is being re-exported and accessed by other clients. I can replicate it when
>>> only using a read-only mount at every hop so I don't think that writes are
>>> related.
>>>
>>> Our RHEL7 NFS servers actually mount XFS with noatime too so any atime updates
>>> that might be causing this client invalidation (which is what I initially
>>> thought) are ultimately a wasted effort.
>>>
>>
>> Ok. I suspect there is a bug here somewhere, but with such a complicated
>> setup though it's not clear to me where that bug would be though. You
>> might need to do some packet sniffing and look at what the servers are
>> sending for change attributes.
>>
>> nfsd4_change_attribute does mix in the ctime, so your hunch about the
>> atime may be correct. atime updates imply a ctime update and that could
>> cause nfsd to continually send a new one, even on files that aren't
>> being changed.
>>
>> It might be interesting to doctor nfsd4_change_attribute() to not mix in
>> the ctime and see whether that improves things. If it does, then we may
>> want to teach nfsd how to avoid doing that for certain types of
>> filesystems.
>
> Okay, I started to run back through all my tests again with various combinations
> of server, client mount options, NFS version etc. with the intention of packet
> capturing as Jeff has suggested.
>
> But I quickly realised that I had mixed up some previous results before I
> reported them here. The summary is that using an NFS RHEL76 server, a client
> mounting with a recent mainline kernel and re-exporting using NFSv4.x all the
> way through does NOT invalidate the re-export server's NFS client cache
> (great!) like I had assumed before. It does when we mount the originating RHEL7
> server using NFSv3 and re-export, but not with any version of NFSv4 on Linux.
>
> But I think I know how I got confused - the Netapp NFSv4 case is different. When
> we mount our (old) 7-mode Netapp using NFSv4.0 and re-export that, the
> re-export server's client cache is invalidated often in the same way as for an
> NFSv3 server. On top of that, I think I wrongly mistook some of the NFSv4
> client's natural dropping of metadata from page cache as client invalidations
> caused by the re-export and client access (without vfs_cache_pressure=0 and see
> my #3 bullet point).
>
> Both of these conspired to make me think that both NFSv3 AND NFSv4 re-exporting
> showed the same issue when in fact, it's just NFSv3 and the Netapp's v4.0 that
> require my "hack" to stop the client cache being invalidated. Sorry for any
> confusion (it is indeed a complicated setup!). Let me summarise then once and
> for all:
>
> rhel76 server (xfs noatime) -> re-export server (vers=4.x,nocto,actimeo=3600,ro;
> vfs_cache_pressure=0) = good client cache metadata performance, my hacky patch
> is not required.
> rhel76 server (xfs noatime) -> re-export server (vers=3,nocto,actimeo=3600,ro;
> vfs_cache_pressure=0) = bad performance (new lookups & getattrs), my hacky
> patch is required for better performance.
> netapp (7-mode) -> re-export server (vers=4.0,nocto,actimeo=3600,ro;
> vfs_cache_pressure=0) = bad performance, my hacky patch is required for better
> performance.
>
> So for Jeff's original intention of proxying a NFSv3 server -> NFSv4 clients by
> re-exporting, the metadata lookup performance will degrade severely as more
> clients access the same files because the re-export server's client cache is
> not being used as effectively (re-exported) and lookups are happening for the
> same files many times within the re-export server's actimeo even with
> vfs_cache_pressure=0.
>
> For our particular use case, we could live without NFSv3 (and my horrible hack)
> except for the fact that the Netapp shows similar behaviour with NFSv4.0 (but
> Linux servers do not). I don't know if turning off atime updates on the Netapp
> volume will change anything - I might try it. Of course, re-exporting NFSv3
> with good meatadata cache performance is still a nice thing to have too.
>
> I'll now see if I can decipher the network calls back to the Netapp (NFSv4.0) as
> suggested by Jeff to see why it is different.
I did a little more digging and the big jump in client ops on the re-export server back to the originating Netapp using NFSv4.0 seems to be mostly because it is issuing lots of READDIR calls. The same workload to a Linux NFS server does not issue a single READDIR/READDIRPLUS call (once cached). As to why these are not cached in the client for repeated lookups (without my hack), I have no idea.
However, I was eventually able to devise a workload that could also cause the NFSv4.2 client cache on the re-export server to unexpectedly "lose" entries such that it needed to reissue calls back to an originating Linux server. A large proportion of these were NFS4ERR_NOENT (but not all) so I don't know if maybe it is something specific to the negative entry cache.
It is really hard following the packets from the re-export's client through the re-export server and on to the originating server, but as far as I can make out, it was mostly issuing access/lookup/getattr for directories (that should already be cached) when the re-export server's clients are issuing calls like readlink (for example resolving a library directory with symlinks).
I have also noticed another couple of new curiosities. If we run a typical small workload against a client mount such that it is all cached for repeat runs and then re-export that same directory to a remote client and run the same workload, the reads that should already be cached are all fetched again from the originating server. Only then are they are cached for repeat runs or for different clients. It's almost like the NFS client cache on the re-export server sees the locally accessed client mount as a different filesystem (and cache) to the knfsd re-exported one. A consequence of embedding the filehandles?
And while looking at the packet traces for this, I also noticed that when re-exported to a client, all the read calls back to the originating server are being chopped up into a maximum of 128k. It's as if I had mounted the originating server using rsize=131072 (it's definitely 1MB). So a client of the re-export server is receiving rsize=1MB reads, but the re-export server is pulling them from the originating server in 128k chunks. This was using NFSV4.2 all the way through.
Is this an expected side-effect of re-exporting? Is it some weird interaction with the nfs client's readahead? It has the effect of large reads requiring 8x more round-trips for re-export clients than if they had just gone direct to the originating server (and gotten 1MB reads).
Daire
----- On 16 Sep, 2020, at 17:01, Daire Byrne [email protected] wrote:
> Trond/Bruce,
>
> ----- On 15 Sep, 2020, at 20:59, Trond Myklebust [email protected] wrote:
>
>> On Tue, 2020-09-15 at 13:21 -0400, J. Bruce Fields wrote:
>>> On Mon, Sep 07, 2020 at 06:31:00PM +0100, Daire Byrne wrote:
>>> > 1) The kernel can drop entries out of the NFS client inode cache
>>> > (under memory cache churn) when those filehandles are still being
>>> > used by the knfsd's remote clients resulting in sporadic and random
>>> > stale filehandles. This seems to be mostly for directories from
>>> > what I've seen. Does the NFS client not know that knfsd is still
>>> > using those files/dirs? The workaround is to never drop inode &
>>> > dentry caches on the re-export servers (vfs_cache_pressure=1). This
>>> > also helps to ensure that we actually make the most of our
>>> > actimeo=3600,nocto mount options for the full specified time.
>>>
>>> I thought reexport worked by embedding the original server's
>>> filehandles
>>> in the filehandles given out by the reexporting server.
>>>
>>> So, even if nothing's cached, when the reexporting server gets a
>>> filehandle, it should be able to extract the original filehandle from
>>> it
>>> and use that.
>>>
>>> I wonder why that's not working?
>>
>> NFSv3? If so, I suspect it is because we never wrote a lookupp()
>> callback for it.
>
> So in terms of the ESTALE counter on the reexport server, we see it increase if
> the end client mounts the reexport using either NFSv3 or NFSv4. But there is a
> difference in the client experience in that with NFSv3 we quickly get
> input/output errors but with NFSv4 we don't. But it does seem like the
> performance drops significantly which makes me think that NFSv4 retries the
> lookups (which succeed) when an ESTALE is reported but NFSv3 does not?
>
> This is the simplest reproducer I could come up with but it may still be
> specific to our workloads/applications and hard to replicate exactly.
>
> nfs-client # sudo mount -t nfs -o vers=3,actimeo=5,ro
> reexport-server:/vol/software /mnt/software
> nfs-client # while true; do /mnt/software/bin/application; echo 3 | sudo tee
> /proc/sys/vm/drop_caches; done
>
> reexport-server # sysctl -w vm.vfs_cache_pressure=100
> reexport-server # while true; do echo 3 > /proc/sys/vm/drop_caches ; done
> reexport-server # while true; do awk '/fh/ {print $2}' /proc/net/rpc/nfsd; sleep
> 10; done
>
> Where "application" is some big application with lots of paths to scan with libs
> to memory map and "/vol/software" is an NFS mount on the reexport-server from
> another originating NFS server. I don't know why this application loading
> workload shows this best, but perhaps the access patterns of memory mapped
> binaries and libs is particularly susceptible to estale?
>
> With vfs_cache_pressure=100, running "echo 3 > /proc/sys/vm/drop_caches"
> repeatedly on the reexport server drops chunks of the dentry & nfs_inode_cache.
> The ESTALE count increases and the client running the application reports
> input/output errors with NFSv3 or the loading slows to a crawl with NFSv4.
>
> As soon as we switch to vfs_cache_pressure=0, the repeating drop_caches on the
> reexport server do not cull the dentry or nfs_inode_cache, the ESTALE counter
> no longer increases and the client experiences no issues (NFSv3 & NFSv4).
I don't suppose anyone has any more thoughts on this one? This is likely the first problem that anyone trying to NFS re-export is going to encounter. If they re-export NFSv3 they'll just get lots of ESTALE as the nfs inodes are dropped from cache (with the default vfs_cache_pressure=100) and if they re-export NFSv4, the lookup performance will drop significantly as an ESTALE triggers re-lookups.
For our particular use case, it is actually desirable to have vfs_cache_pressure=0 to keep nfs client inodes and dentry caches in memory to help with expensive metadata lookups, but it would still be nice to have the option of using a less drastic setting (such as vfs_cache_pressure=1) to help avoid OOM conditions.
Daire
From: Trond Myklebust <[email protected]>
In order to use the open-by-filehandle functionality with NFSv3, we
need to ensure that the NFS client can convert disconnected dentries
into connected ones by doing a reverse walk of the filesystem path.
To do so, NFSv4 provides the LOOKUPP operation, which does not
exist in NFSv3, but which can usually be emulated using lookup("..").
Trond Myklebust (2):
NFSv3: Refactor nfs3_proc_lookup() to split out the dentry
NFSv3: Add emulation of the lookupp() operation
fs/nfs/nfs3proc.c | 43 ++++++++++++++++++++++++++++++++-----------
1 file changed, 32 insertions(+), 11 deletions(-)
--
2.26.2
From: Trond Myklebust <[email protected]>
We want to reuse the lookup code in NFSv3 in order to emulate the
NFSv4 lookupp operation.
Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/nfs3proc.c | 28 ++++++++++++++++++----------
1 file changed, 18 insertions(+), 10 deletions(-)
diff --git a/fs/nfs/nfs3proc.c b/fs/nfs/nfs3proc.c
index 2397ceedba8a..a6a222435e9b 100644
--- a/fs/nfs/nfs3proc.c
+++ b/fs/nfs/nfs3proc.c
@@ -154,14 +154,13 @@ nfs3_proc_setattr(struct dentry *dentry, struct nfs_fattr *fattr,
}
static int
-nfs3_proc_lookup(struct inode *dir, struct dentry *dentry,
- struct nfs_fh *fhandle, struct nfs_fattr *fattr,
- struct nfs4_label *label)
+__nfs3_proc_lookup(struct inode *dir, const char *name, size_t len,
+ struct nfs_fh *fhandle, struct nfs_fattr *fattr)
{
struct nfs3_diropargs arg = {
.fh = NFS_FH(dir),
- .name = dentry->d_name.name,
- .len = dentry->d_name.len
+ .name = name,
+ .len = len
};
struct nfs3_diropres res = {
.fh = fhandle,
@@ -175,15 +174,10 @@ nfs3_proc_lookup(struct inode *dir, struct dentry *dentry,
int status;
unsigned short task_flags = 0;
- /* Is this is an attribute revalidation, subject to softreval? */
- if (nfs_lookup_is_soft_revalidate(dentry))
- task_flags |= RPC_TASK_TIMEOUT;
-
res.dir_attr = nfs_alloc_fattr();
if (res.dir_attr == NULL)
return -ENOMEM;
- dprintk("NFS call lookup %pd2\n", dentry);
nfs_fattr_init(fattr);
status = rpc_call_sync(NFS_CLIENT(dir), &msg, task_flags);
nfs_refresh_inode(dir, res.dir_attr);
@@ -198,6 +192,20 @@ nfs3_proc_lookup(struct inode *dir, struct dentry *dentry,
return status;
}
+static int
+nfs3_proc_lookup(struct inode *dir, struct dentry *dentry,
+ struct nfs_fh *fhandle, struct nfs_fattr *fattr,
+ struct nfs4_label *label)
+{
+ /* Is this is an attribute revalidation, subject to softreval? */
+ if (nfs_lookup_is_soft_revalidate(dentry))
+ task_flags |= RPC_TASK_TIMEOUT;
+
+ dprintk("NFS call lookup %pd2\n", dentry);
+ return __nfs3_proc_lookup(dir, dentry->d_name.name,
+ dentry->d_name.len, fhandle, fattr);
+}
+
static int nfs3_proc_access(struct inode *inode, struct nfs_access_entry *entry)
{
struct nfs3_accessargs arg = {
--
2.26.2
From: Trond Myklebust <[email protected]>
In order to use the open_by_filehandle() operations on NFSv3, we need
to be able to emulate lookupp() so that nfs_get_parent() can be used
to convert disconnected dentries into connected ones.
Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/nfs3proc.c | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)
diff --git a/fs/nfs/nfs3proc.c b/fs/nfs/nfs3proc.c
index a6a222435e9b..63d1979933f3 100644
--- a/fs/nfs/nfs3proc.c
+++ b/fs/nfs/nfs3proc.c
@@ -155,7 +155,8 @@ nfs3_proc_setattr(struct dentry *dentry, struct nfs_fattr *fattr,
static int
__nfs3_proc_lookup(struct inode *dir, const char *name, size_t len,
- struct nfs_fh *fhandle, struct nfs_fattr *fattr)
+ struct nfs_fh *fhandle, struct nfs_fattr *fattr,
+ unsigned short task_flags)
{
struct nfs3_diropargs arg = {
.fh = NFS_FH(dir),
@@ -172,7 +173,6 @@ __nfs3_proc_lookup(struct inode *dir, const char *name, size_t len,
.rpc_resp = &res,
};
int status;
- unsigned short task_flags = 0;
res.dir_attr = nfs_alloc_fattr();
if (res.dir_attr == NULL)
@@ -197,13 +197,25 @@ nfs3_proc_lookup(struct inode *dir, struct dentry *dentry,
struct nfs_fh *fhandle, struct nfs_fattr *fattr,
struct nfs4_label *label)
{
+ unsigned short task_flags = 0;
+
/* Is this is an attribute revalidation, subject to softreval? */
if (nfs_lookup_is_soft_revalidate(dentry))
task_flags |= RPC_TASK_TIMEOUT;
dprintk("NFS call lookup %pd2\n", dentry);
return __nfs3_proc_lookup(dir, dentry->d_name.name,
- dentry->d_name.len, fhandle, fattr);
+ dentry->d_name.len, fhandle, fattr,
+ task_flags);
+}
+
+static int nfs3_proc_lookupp(struct inode *inode, struct nfs_fh *fhandle,
+ struct nfs_fattr *fattr, struct nfs4_label *label)
+{
+ const char *dotdot = "..";
+ const size_t len = sizeof(dotdot) - 1;
+
+ return __nfs3_proc_lookup(inode, dotdot, len, fhandle, fattr, 0);
}
static int nfs3_proc_access(struct inode *inode, struct nfs_access_entry *entry)
@@ -1012,6 +1024,7 @@ const struct nfs_rpc_ops nfs_v3_clientops = {
.getattr = nfs3_proc_getattr,
.setattr = nfs3_proc_setattr,
.lookup = nfs3_proc_lookup,
+ .lookupp = nfs3_proc_lookupp,
.access = nfs3_proc_access,
.readlink = nfs3_proc_readlink,
.create = nfs3_proc_create,
--
2.26.2
From: Trond Myklebust <[email protected]>
In order to use the open-by-filehandle functionality with NFSv3, we
need to ensure that the NFS client can convert disconnected dentries
into connected ones by doing a reverse walk of the filesystem path.
To do so, NFSv4 provides the LOOKUPP operation, which does not
exist in NFSv3, but which can usually be emulated using lookup("..").
v2:
- Fix compilation issues for "NFSv3: Refactor nfs3_proc_lookup() to
split out the dentry"
Trond Myklebust (2):
NFSv3: Refactor nfs3_proc_lookup() to split out the dentry
NFSv3: Add emulation of the lookupp() operation
fs/nfs/nfs3proc.c | 43 ++++++++++++++++++++++++++++++++-----------
1 file changed, 32 insertions(+), 11 deletions(-)
--
2.26.2
From: Trond Myklebust <[email protected]>
In order to use the open-by-filehandle functionality with NFSv3, we
need to ensure that the NFS client can convert disconnected dentries
into connected ones by doing a reverse walk of the filesystem path.
To do so, NFSv4 provides the LOOKUPP operation, which does not
exist in NFSv3, but which can usually be emulated using lookup("..").
v2:
- Fix compilation issues for "NFSv3: Refactor nfs3_proc_lookup() to
split out the dentry"
v3:
- Fix the string length calculation
- Apply the NFS_MOUNT_SOFTREVAL flag in both the NFSv3 and NFSv4 lookupp
Trond Myklebust (3):
NFSv3: Refactor nfs3_proc_lookup() to split out the dentry
NFSv3: Add emulation of the lookupp() operation
NFSv4: Observe the NFS_MOUNT_SOFTREVAL flag in _nfs4_proc_lookupp
fs/nfs/nfs3proc.c | 48 ++++++++++++++++++++++++++++++++++++-----------
fs/nfs/nfs4proc.c | 6 +++++-
2 files changed, 42 insertions(+), 12 deletions(-)
--
2.26.2
From: Trond Myklebust <[email protected]>
We want to reuse the lookup code in NFSv3 in order to emulate the
NFSv4 lookupp operation.
Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/nfs3proc.c | 33 ++++++++++++++++++++++-----------
1 file changed, 22 insertions(+), 11 deletions(-)
diff --git a/fs/nfs/nfs3proc.c b/fs/nfs/nfs3proc.c
index 2397ceedba8a..acbdf7496d31 100644
--- a/fs/nfs/nfs3proc.c
+++ b/fs/nfs/nfs3proc.c
@@ -154,14 +154,14 @@ nfs3_proc_setattr(struct dentry *dentry, struct nfs_fattr *fattr,
}
static int
-nfs3_proc_lookup(struct inode *dir, struct dentry *dentry,
- struct nfs_fh *fhandle, struct nfs_fattr *fattr,
- struct nfs4_label *label)
+__nfs3_proc_lookup(struct inode *dir, const char *name, size_t len,
+ struct nfs_fh *fhandle, struct nfs_fattr *fattr,
+ unsigned short task_flags)
{
struct nfs3_diropargs arg = {
.fh = NFS_FH(dir),
- .name = dentry->d_name.name,
- .len = dentry->d_name.len
+ .name = name,
+ .len = len
};
struct nfs3_diropres res = {
.fh = fhandle,
@@ -173,17 +173,11 @@ nfs3_proc_lookup(struct inode *dir, struct dentry *dentry,
.rpc_resp = &res,
};
int status;
- unsigned short task_flags = 0;
-
- /* Is this is an attribute revalidation, subject to softreval? */
- if (nfs_lookup_is_soft_revalidate(dentry))
- task_flags |= RPC_TASK_TIMEOUT;
res.dir_attr = nfs_alloc_fattr();
if (res.dir_attr == NULL)
return -ENOMEM;
- dprintk("NFS call lookup %pd2\n", dentry);
nfs_fattr_init(fattr);
status = rpc_call_sync(NFS_CLIENT(dir), &msg, task_flags);
nfs_refresh_inode(dir, res.dir_attr);
@@ -198,6 +192,23 @@ nfs3_proc_lookup(struct inode *dir, struct dentry *dentry,
return status;
}
+static int
+nfs3_proc_lookup(struct inode *dir, struct dentry *dentry,
+ struct nfs_fh *fhandle, struct nfs_fattr *fattr,
+ struct nfs4_label *label)
+{
+ unsigned short task_flags = 0;
+
+ /* Is this is an attribute revalidation, subject to softreval? */
+ if (nfs_lookup_is_soft_revalidate(dentry))
+ task_flags |= RPC_TASK_TIMEOUT;
+
+ dprintk("NFS call lookup %pd2\n", dentry);
+ return __nfs3_proc_lookup(dir, dentry->d_name.name,
+ dentry->d_name.len, fhandle, fattr,
+ task_flags);
+}
+
static int nfs3_proc_access(struct inode *inode, struct nfs_access_entry *entry)
{
struct nfs3_accessargs arg = {
--
2.26.2
----- On 19 Oct, 2020, at 17:19, Daire Byrne [email protected] wrote:
> ----- On 16 Sep, 2020, at 17:01, Daire Byrne [email protected] wrote:
>
>> Trond/Bruce,
>>
>> ----- On 15 Sep, 2020, at 20:59, Trond Myklebust [email protected] wrote:
>>
>>> On Tue, 2020-09-15 at 13:21 -0400, J. Bruce Fields wrote:
>>>> On Mon, Sep 07, 2020 at 06:31:00PM +0100, Daire Byrne wrote:
>>>> > 1) The kernel can drop entries out of the NFS client inode cache
>>>> > (under memory cache churn) when those filehandles are still being
>>>> > used by the knfsd's remote clients resulting in sporadic and random
>>>> > stale filehandles. This seems to be mostly for directories from
>>>> > what I've seen. Does the NFS client not know that knfsd is still
>>>> > using those files/dirs? The workaround is to never drop inode &
>>>> > dentry caches on the re-export servers (vfs_cache_pressure=1). This
>>>> > also helps to ensure that we actually make the most of our
>>>> > actimeo=3600,nocto mount options for the full specified time.
>>>>
>>>> I thought reexport worked by embedding the original server's
>>>> filehandles
>>>> in the filehandles given out by the reexporting server.
>>>>
>>>> So, even if nothing's cached, when the reexporting server gets a
>>>> filehandle, it should be able to extract the original filehandle from
>>>> it
>>>> and use that.
>>>>
>>>> I wonder why that's not working?
>>>
>>> NFSv3? If so, I suspect it is because we never wrote a lookupp()
>>> callback for it.
>>
>> So in terms of the ESTALE counter on the reexport server, we see it increase if
>> the end client mounts the reexport using either NFSv3 or NFSv4. But there is a
>> difference in the client experience in that with NFSv3 we quickly get
>> input/output errors but with NFSv4 we don't. But it does seem like the
>> performance drops significantly which makes me think that NFSv4 retries the
>> lookups (which succeed) when an ESTALE is reported but NFSv3 does not?
>>
>> This is the simplest reproducer I could come up with but it may still be
>> specific to our workloads/applications and hard to replicate exactly.
>>
>> nfs-client # sudo mount -t nfs -o vers=3,actimeo=5,ro
>> reexport-server:/vol/software /mnt/software
>> nfs-client # while true; do /mnt/software/bin/application; echo 3 | sudo tee
>> /proc/sys/vm/drop_caches; done
>>
>> reexport-server # sysctl -w vm.vfs_cache_pressure=100
>> reexport-server # while true; do echo 3 > /proc/sys/vm/drop_caches ; done
>> reexport-server # while true; do awk '/fh/ {print $2}' /proc/net/rpc/nfsd; sleep
>> 10; done
>>
>> Where "application" is some big application with lots of paths to scan with libs
>> to memory map and "/vol/software" is an NFS mount on the reexport-server from
>> another originating NFS server. I don't know why this application loading
>> workload shows this best, but perhaps the access patterns of memory mapped
>> binaries and libs is particularly susceptible to estale?
>>
>> With vfs_cache_pressure=100, running "echo 3 > /proc/sys/vm/drop_caches"
>> repeatedly on the reexport server drops chunks of the dentry & nfs_inode_cache.
>> The ESTALE count increases and the client running the application reports
>> input/output errors with NFSv3 or the loading slows to a crawl with NFSv4.
>>
>> As soon as we switch to vfs_cache_pressure=0, the repeating drop_caches on the
>> reexport server do not cull the dentry or nfs_inode_cache, the ESTALE counter
>> no longer increases and the client experiences no issues (NFSv3 & NFSv4).
>
> I don't suppose anyone has any more thoughts on this one? This is likely the
> first problem that anyone trying to NFS re-export is going to encounter. If
> they re-export NFSv3 they'll just get lots of ESTALE as the nfs inodes are
> dropped from cache (with the default vfs_cache_pressure=100) and if they
> re-export NFSv4, the lookup performance will drop significantly as an ESTALE
> triggers re-lookups.
>
> For our particular use case, it is actually desirable to have
> vfs_cache_pressure=0 to keep nfs client inodes and dentry caches in memory to
> help with expensive metadata lookups, but it would still be nice to have the
> option of using a less drastic setting (such as vfs_cache_pressure=1) to help
> avoid OOM conditions.
Trond has posted some (v3) patches to emulate lookupp for NFSv3 (a million thanks!) so I applied them to v5.9.1 and ran some more tests using that on the re-export server. Again, I just pathologically dropped inode & dentry caches every second on the re-export server (vfs_cache_pressure=100) while a client looped through some application loading tests.
Now for every combination of re-export (NFSv3 -> NFSv4.x or NFSv4.x -> NFSv3), I no longer see any stale file handles (/proc/net/rpc/nfsd) when dropping inode & dentry caches (yay!).
However, my assumption that some of the input/output errors I was seeing were related to the estales seems to have been misguided. After running these tests again without any estales, it now looks like a different issue that is unique to re-exporting NFSv3 from an NFSv4.0 originating server (either Linux or Netapp). The lookups are all fine (no estale) but reading some files eventually gives an input/output error on multiple clients which remain consistent until the re-export nfs-server is restarted. Again, this only occurs while dropping inode + dentry caches.
So in summary, while continuously dropping inode/dentry caches on the re-export server:
originating server NFSv4.x -> NFSv4.x re-export server = good (no estale, no input/output errors)
originating server NFSv4.1/4.2 -> NFSv3 re-export server = good
originating server NFSv4.0 -> NFSv3 re-export server = no estale but lots of input/output errors
originating server NFSv3 -> NFSv3 re-export server = good (fixed by Trond's lookupp emulation patches)
originating server NFSv3 -> NFSv4.x re-export server = good (fixed by Trond's lookupp emulation patches)
In our case, we are stuck with some old 7-mode Netapps so we only have two mount choices, NFSv3 or NFSv4.0 (hence our particular interest in the NFSv4.0 re-export behaviour). And as discussed previously, a re-export of an NFSv3 server requires my horrible hack in order to avoid excessive lookups and client cache invalidations.
But these lookupp emulation patches fix the ESTALEs for the NFSv3 re-export cases, so many thanks again for that Trond. When re-exporting an NFSv3 client mount, we no longer need to change vfs_cache_pressure=0.
Daire
On Wed, Oct 21, 2020 at 10:33:52AM +0100, Daire Byrne wrote:
> Trond has posted some (v3) patches to emulate lookupp for NFSv3 (a million thanks!) so I applied them to v5.9.1 and ran some more tests using that on the re-export server. Again, I just pathologically dropped inode & dentry caches every second on the re-export server (vfs_cache_pressure=100) while a client looped through some application loading tests.
>
> Now for every combination of re-export (NFSv3 -> NFSv4.x or NFSv4.x -> NFSv3), I no longer see any stale file handles (/proc/net/rpc/nfsd) when dropping inode & dentry caches (yay!).
>
> However, my assumption that some of the input/output errors I was seeing were related to the estales seems to have been misguided. After running these tests again without any estales, it now looks like a different issue that is unique to re-exporting NFSv3 from an NFSv4.0 originating server (either Linux or Netapp). The lookups are all fine (no estale) but reading some files eventually gives an input/output error on multiple clients which remain consistent until the re-export nfs-server is restarted. Again, this only occurs while dropping inode + dentry caches.
>
> So in summary, while continuously dropping inode/dentry caches on the re-export server:
How continuously, exactly?
I recall that there are some situations where the best the client can do
to handle an ESTALE is just retry. And that our code generally just
retries once and then gives up.
I wonder if it's possible that the client or re-export server can get
stuck in a situation where they can't guarantee forward progress in the
face of repeated ESTALEs. I don't have a specific case in mind, though.
--b.
>
> originating server NFSv4.x -> NFSv4.x re-export server = good (no estale, no input/output errors)
> originating server NFSv4.1/4.2 -> NFSv3 re-export server = good
> originating server NFSv4.0 -> NFSv3 re-export server = no estale but lots of input/output errors
> originating server NFSv3 -> NFSv3 re-export server = good (fixed by Trond's lookupp emulation patches)
> originating server NFSv3 -> NFSv4.x re-export server = good (fixed by Trond's lookupp emulation patches)
>
> In our case, we are stuck with some old 7-mode Netapps so we only have two mount choices, NFSv3 or NFSv4.0 (hence our particular interest in the NFSv4.0 re-export behaviour). And as discussed previously, a re-export of an NFSv3 server requires my horrible hack in order to avoid excessive lookups and client cache invalidations.
>
> But these lookupp emulation patches fix the ESTALEs for the NFSv3 re-export cases, so many thanks again for that Trond. When re-exporting an NFSv3 client mount, we no longer need to change vfs_cache_pressure=0.
>
> Daire
----- On 9 Nov, 2020, at 16:02, bfields [email protected] wrote:
> On Wed, Oct 21, 2020 at 10:33:52AM +0100, Daire Byrne wrote:
>> Trond has posted some (v3) patches to emulate lookupp for NFSv3 (a million
>> thanks!) so I applied them to v5.9.1 and ran some more tests using that on the
>> re-export server. Again, I just pathologically dropped inode & dentry caches
>> every second on the re-export server (vfs_cache_pressure=100) while a client
>> looped through some application loading tests.
>>
>> Now for every combination of re-export (NFSv3 -> NFSv4.x or NFSv4.x -> NFSv3), I
>> no longer see any stale file handles (/proc/net/rpc/nfsd) when dropping inode &
>> dentry caches (yay!).
>>
>> However, my assumption that some of the input/output errors I was seeing were
>> related to the estales seems to have been misguided. After running these tests
>> again without any estales, it now looks like a different issue that is unique
>> to re-exporting NFSv3 from an NFSv4.0 originating server (either Linux or
>> Netapp). The lookups are all fine (no estale) but reading some files eventually
>> gives an input/output error on multiple clients which remain consistent until
>> the re-export nfs-server is restarted. Again, this only occurs while dropping
>> inode + dentry caches.
>>
>> So in summary, while continuously dropping inode/dentry caches on the re-export
>> server:
>
> How continuously, exactly?
>
> I recall that there are some situations where the best the client can do
> to handle an ESTALE is just retry. And that our code generally just
> retries once and then gives up.
>
> I wonder if it's possible that the client or re-export server can get
> stuck in a situation where they can't guarantee forward progress in the
> face of repeated ESTALEs. I don't have a specific case in mind, though.
I was dropping caches every second in a loop on the NFS re-export server. Meanwhile a large python application that takes ~15 seconds to complete was also looping on a client of the re-export server. So we are clearing out the cache many times such that the same python paths are being re-populated many times.
Having just completed a bunch of fresh cloud rendering with v5.9.1 and Trond's NFSv3 lookupp emulation patches, I can now revise my original list of issues that others will likely experience if they ever try to do this craziness:
1) Don't re-export NFSv4.0 unless you set vfs_cache_presure=0 otherwise you will see random input/output errors on your clients when things are dropped out of the cache. In the end we gave up on using NFSv4.0 with our Netapps because the 7-mode implementation seemed a bit flakey with modern Linux clients (Linux NFSv4.2 servers on the other hand have been rock solid). We now use NFSv3 with Trond's lookupp emulation patches instead.
2) In order to better utilise the re-export server's client cache when re-exporting an NFSv3 server (using either NFSv3 or NFSv4), we still need to use the horrible inode_peek_iversion_raw hack to maintain good metadata performance for large numbers of clients. Otherwise each re-export server's clients can cause invalidation of the re-export server client cache. Once you have hundreds of clients they all combine to constantly invalidate the cache resulting in an order of magnitude slower metadata performance. If you are re-exporting an NFSv4.x server (with either NFSv3 or NFSv4.x) this hack is not required.
3) For some reason, when a 1MB read call arrives at the re-export server from a client, it gets chopped up into 128k read calls that are issued back to the originating server despite rsize/wsize=1MB on all mounts. This results in a noticeable increase in rpc chatter for large reads. Writes on the other hand retain their 1MB size from client to re-export server and back to the originating server. I am using nconnect but I doubt that is related.
4) After some random time, the cachefilesd userspace daemon stops culling old data from an fscache disk storage. I thought it was to do with setting vfs_cache_pressure=0 but even with it set to the default 100 it just randomly decides to stop culling and never comes back to life until restarted or rebooted. Perhaps the fscache/cachefilesd rewrite that David Howells & David Wysochanski have been working on will improve matters.
5) It's still really hard to cache nfs client metadata for any definitive time (actimeo,nocto) due to the pagecache churn that reads cause. If all required metadata (i.e. directory contents) could either be locally cached to disk or the inode cache rather than pagecache then maybe we would have more control over the actual cache times we are comfortable with for our workloads. This has little to do with re-exporting and is just a general NFS performance over the WAN thing. I'm very interested to see how Trond's recent patches to improve readdir performance might at least help re-populate the dropped cached metadata more efficiently over the WAN.
I just want to finish with one more crazy thing we have been doing - a re-export server of a re-export server! Again, a locking and consistency nightmare so only possible for very specific workloads (like ours). The advantage of this topology is that you can pull all your data over the WAN once (e.g. on-premise to cloud) and then fan-out that data to multiple other NFS re-export servers in the cloud to improve the aggregate performance to many clients. This avoids having multiple re-export servers all needing to pull the same data across the WAN.
Daire
On Thu, Nov 12, 2020 at 01:01:24PM +0000, Daire Byrne wrote:
>
> ----- On 9 Nov, 2020, at 16:02, bfields [email protected] wrote:
> > On Wed, Oct 21, 2020 at 10:33:52AM +0100, Daire Byrne wrote:
> >> Trond has posted some (v3) patches to emulate lookupp for NFSv3 (a million
> >> thanks!) so I applied them to v5.9.1 and ran some more tests using that on the
> >> re-export server. Again, I just pathologically dropped inode & dentry caches
> >> every second on the re-export server (vfs_cache_pressure=100) while a client
> >> looped through some application loading tests.
> >>
> >> Now for every combination of re-export (NFSv3 -> NFSv4.x or NFSv4.x -> NFSv3), I
> >> no longer see any stale file handles (/proc/net/rpc/nfsd) when dropping inode &
> >> dentry caches (yay!).
> >>
> >> However, my assumption that some of the input/output errors I was seeing were
> >> related to the estales seems to have been misguided. After running these tests
> >> again without any estales, it now looks like a different issue that is unique
> >> to re-exporting NFSv3 from an NFSv4.0 originating server (either Linux or
> >> Netapp). The lookups are all fine (no estale) but reading some files eventually
> >> gives an input/output error on multiple clients which remain consistent until
> >> the re-export nfs-server is restarted. Again, this only occurs while dropping
> >> inode + dentry caches.
> >>
> >> So in summary, while continuously dropping inode/dentry caches on the re-export
> >> server:
> >
> > How continuously, exactly?
> >
> > I recall that there are some situations where the best the client can do
> > to handle an ESTALE is just retry. And that our code generally just
> > retries once and then gives up.
> >
> > I wonder if it's possible that the client or re-export server can get
> > stuck in a situation where they can't guarantee forward progress in the
> > face of repeated ESTALEs. I don't have a specific case in mind, though.
>
> I was dropping caches every second in a loop on the NFS re-export server. Meanwhile a large python application that takes ~15 seconds to complete was also looping on a client of the re-export server. So we are clearing out the cache many times such that the same python paths are being re-populated many times.
>
> Having just completed a bunch of fresh cloud rendering with v5.9.1 and Trond's NFSv3 lookupp emulation patches, I can now revise my original list of issues that others will likely experience if they ever try to do this craziness:
>
> 1) Don't re-export NFSv4.0 unless you set vfs_cache_presure=0 otherwise you will see random input/output errors on your clients when things are dropped out of the cache. In the end we gave up on using NFSv4.0 with our Netapps because the 7-mode implementation seemed a bit flakey with modern Linux clients (Linux NFSv4.2 servers on the other hand have been rock solid). We now use NFSv3 with Trond's lookupp emulation patches instead.
So,
NFSv4.2 NFSv4.2
client --------> re-export server -------> original server
works as long as both servers are recent Linux, but when the original
server is Netapp, you need the protocol used in both places to be v3, is
that right?
> 2) In order to better utilise the re-export server's client cache when re-exporting an NFSv3 server (using either NFSv3 or NFSv4), we still need to use the horrible inode_peek_iversion_raw hack to maintain good metadata performance for large numbers of clients. Otherwise each re-export server's clients can cause invalidation of the re-export server client cache. Once you have hundreds of clients they all combine to constantly invalidate the cache resulting in an order of magnitude slower metadata performance. If you are re-exporting an NFSv4.x server (with either NFSv3 or NFSv4.x) this hack is not required.
Have we figured out why that's required, or found a longer-term
solution? (Apologies, the memory of the earlier conversation is
fading....)
--b.
----- On 12 Nov, 2020, at 13:57, bfields [email protected] wrote:
> On Thu, Nov 12, 2020 at 01:01:24PM +0000, Daire Byrne wrote:
>>
>> Having just completed a bunch of fresh cloud rendering with v5.9.1 and Trond's
>> NFSv3 lookupp emulation patches, I can now revise my original list of issues
>> that others will likely experience if they ever try to do this craziness:
>>
>> 1) Don't re-export NFSv4.0 unless you set vfs_cache_presure=0 otherwise you will
>> see random input/output errors on your clients when things are dropped out of
>> the cache. In the end we gave up on using NFSv4.0 with our Netapps because the
>> 7-mode implementation seemed a bit flakey with modern Linux clients (Linux
>> NFSv4.2 servers on the other hand have been rock solid). We now use NFSv3 with
>> Trond's lookupp emulation patches instead.
>
> So,
>
> NFSv4.2 NFSv4.2
> client --------> re-export server -------> original server
>
> works as long as both servers are recent Linux, but when the original
> server is Netapp, you need the protocol used in both places to be v3, is
> that right?
Well, yes NFSv4.2 all the way through works well for us but it's re-exporting a NFSv4.0 server (Linux OR Netapp) that seems to still show the input/output errors when dropping caches. Every other possible combination now seems to be working without ESTALE or input/errors with the lookupp emulation patches.
So this is still not working when dropping caches on the re-export server:
NFSv3/4.x NFSv4.0
client --------> re-export server -------> original server
The bit specific to the Netapp is simply that our 7-mode only supports NFSv4.0 so I can't actually test NFSv4.1/4.2 on a more modern Netapp firmware release. So I have to use NFSv3 to mount the Netapp and can then happily re-export that using NFSv4.x or NFSv3 (if the filehandles fit in 63 bytes).
>> 2) In order to better utilise the re-export server's client cache when
>> re-exporting an NFSv3 server (using either NFSv3 or NFSv4), we still need to
>> use the horrible inode_peek_iversion_raw hack to maintain good metadata
>> performance for large numbers of clients. Otherwise each re-export server's
>> clients can cause invalidation of the re-export server client cache. Once you
>> have hundreds of clients they all combine to constantly invalidate the cache
>> resulting in an order of magnitude slower metadata performance. If you are
>> re-exporting an NFSv4.x server (with either NFSv3 or NFSv4.x) this hack is not
>> required.
>
> Have we figured out why that's required, or found a longer-term
> solution? (Apologies, the memory of the earlier conversation is
> fading....)
There was some discussion about NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR allowing for the hack/optimisation but I guess that is only for the case when re-exporting NFSv4 to the eventual clients. It would not help if you were re-exporting an NFSv3 server with NFSv3 to the clients? I lack the deeper understanding to say anything more than that.
In our case we re-export everything to the clients using NFSv4.2 whether the originating server is NFSv3 (e.g our Netapp) or NFSv4.2 (our RHEL7 storage servers).
With NFSv4.2 as the originating server, we found that either this hack/optimsation was not required or the incidence rate of invalidating the re-export server's client cache was much less as to not cause significant performance problems when many clients requested the same metadata.
Daire
On Thu, Nov 12, 2020 at 06:33:45PM +0000, Daire Byrne wrote:
> Well, yes NFSv4.2 all the way through works well for us but it's re-exporting a NFSv4.0 server (Linux OR Netapp) that seems to still show the input/output errors when dropping caches. Every other possible combination now seems to be working without ESTALE or input/errors with the lookupp emulation patches.
>
> So this is still not working when dropping caches on the re-export server:
>
> NFSv3/4.x NFSv4.0
> client --------> re-export server -------> original server
>
> The bit specific to the Netapp is simply that our 7-mode only supports NFSv4.0 so I can't actually test NFSv4.1/4.2 on a more modern Netapp firmware release. So I have to use NFSv3 to mount the Netapp and can then happily re-export that using NFSv4.x or NFSv3 (if the filehandles fit in 63 bytes).
Oh, got it, thanks, so it's just the minor-version difference (probably
the open-by-filehandle stuff that went into 4.1).
> There was some discussion about NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR
> allowing for the hack/optimisation but I guess that is only for the
> case when re-exporting NFSv4 to the eventual clients. It would not
> help if you were re-exporting an NFSv3 server with NFSv3 to the
> clients? I lack the deeper understanding to say anything more than
> that.
Oh, right, thanks for the reminder. The CHANGE_TYPE_IS_MONOTONIC_INCR
optimization still looks doable to me.
How does that help, anyway? I guess it avoids false positives of some
kind when rpc's are processed out of order?
Looking back at
https://lore.kernel.org/linux-nfs/[email protected]/
this bothers me: "I'm not exactly sure why, but the iversion of the
inode gets changed locally (due to atime modification?) most likely via
invocation of method inode_inc_iversion_raw. Each time it gets
incremented the following call to validate attributes detects changes
causing it to be reloaded from the originating server."
The only call to that function outside afs or ceph code is in
fs/nfs/write.c, in the write delegation case. The Linux server doesn't
support write delegations, Netapp does but this shouldn't be causing
cache invalidations.
--b.
----- On 12 Nov, 2020, at 20:55, bfields [email protected] wrote:
> On Thu, Nov 12, 2020 at 06:33:45PM +0000, Daire Byrne wrote:
>> There was some discussion about NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR
>> allowing for the hack/optimisation but I guess that is only for the
>> case when re-exporting NFSv4 to the eventual clients. It would not
>> help if you were re-exporting an NFSv3 server with NFSv3 to the
>> clients? I lack the deeper understanding to say anything more than
>> that.
>
> Oh, right, thanks for the reminder. The CHANGE_TYPE_IS_MONOTONIC_INCR
> optimization still looks doable to me.
>
> How does that help, anyway? I guess it avoids false positives of some
> kind when rpc's are processed out of order?
>
> Looking back at
>
> https://lore.kernel.org/linux-nfs/[email protected]/
>
> this bothers me: "I'm not exactly sure why, but the iversion of the
> inode gets changed locally (due to atime modification?) most likely via
> invocation of method inode_inc_iversion_raw. Each time it gets
> incremented the following call to validate attributes detects changes
> causing it to be reloaded from the originating server."
>
> The only call to that function outside afs or ceph code is in
> fs/nfs/write.c, in the write delegation case. The Linux server doesn't
> support write delegations, Netapp does but this shouldn't be causing
> cache invalidations.
So, I can't lay claim to identifying the exact optimisation/hack that improves the retention of the re-export server's client cache when re-exporting an NFSv3 server (which is then read by many clients). We were working with an engineer at the time who showed an interest in our use case and after we supplied a reproducer he suggested modifying the nfs/inode.c
- if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
+ if (inode_peek_iversion_raw(inode) < fattr->change_attr) {
His reasoning at the time was:
"Fixes inode invalidation caused by read access. The least
important bit is ORed with 1 and causes the inode version to differ from the
one seen on the NFS share. This in turn causes unnecessary re-download
impacting the performance significantly. This fix makes it only re-fetch file
content if inode version seen on the server is newer than the one on the
client."
But I've always been puzzled by why this only seems to be the case when using knfsd to re-export the (NFSv3) client mount. Using multiple processes on a standard client mount never causes any similar re-validations. And this happens with a completely read-only share which is why I started to think it has something to do with atimes as that could perhaps still cause a "write" modification even when read-only?
In our case we saw this at it's most extreme when we were re-exporting a read-only NFSv3 Netapp "software" share and loading large applications with many python search paths to trawl through. Multiple clients of the re-export server just kept causing the re-export server's client to re-validate and re-download from the Netapp even though no files or dirs had changed and the actimeo=large (with nocto for good measure).
The patch made it such that the re-export server's client cache acted the same way if we ran 100 processes directly on the NFSv3 client mount (on the re-export server) or ran it on 100 clients of the re-export server - the data remained in client cache for the duration. So the re-export server fetches the data from the originating server once and then serves all those results many times over to all the clients from it's cache - exactly what we want.
Daire
On Thu, Nov 12, 2020 at 11:05:57PM +0000, Daire Byrne wrote:
> So, I can't lay claim to identifying the exact optimisation/hack that
> improves the retention of the re-export server's client cache when
> re-exporting an NFSv3 server (which is then read by many clients). We
> were working with an engineer at the time who showed an interest in
> our use case and after we supplied a reproducer he suggested modifying
> the nfs/inode.c
>
> - if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
> + if (inode_peek_iversion_raw(inode) < fattr->change_attr)
> {
>
> His reasoning at the time was:
>
> "Fixes inode invalidation caused by read access. The least important
> bit is ORed with 1 and causes the inode version to differ from the one
> seen on the NFS share. This in turn causes unnecessary re-download
> impacting the performance significantly. This fix makes it only
> re-fetch file content if inode version seen on the server is newer
> than the one on the client."
>
> But I've always been puzzled by why this only seems to be the case
> when using knfsd to re-export the (NFSv3) client mount. Using multiple
> processes on a standard client mount never causes any similar
> re-validations. And this happens with a completely read-only share
> which is why I started to think it has something to do with atimes as
> that could perhaps still cause a "write" modification even when
> read-only?
Ah-hah! So, it's inode_query_iversion() that's modifying a nfs inode's
i_version. That's a special thing that only nfsd would do.
I think that's totally fixable, we'll just have to think a little about
how....
--b.
On Fri, Nov 13, 2020 at 09:50:50AM -0500, bfields wrote:
> On Thu, Nov 12, 2020 at 11:05:57PM +0000, Daire Byrne wrote:
> > So, I can't lay claim to identifying the exact optimisation/hack that
> > improves the retention of the re-export server's client cache when
> > re-exporting an NFSv3 server (which is then read by many clients). We
> > were working with an engineer at the time who showed an interest in
> > our use case and after we supplied a reproducer he suggested modifying
> > the nfs/inode.c
> >
> > - if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
> > + if (inode_peek_iversion_raw(inode) < fattr->change_attr)
> > {
> >
> > His reasoning at the time was:
> >
> > "Fixes inode invalidation caused by read access. The least important
> > bit is ORed with 1 and causes the inode version to differ from the one
> > seen on the NFS share. This in turn causes unnecessary re-download
> > impacting the performance significantly. This fix makes it only
> > re-fetch file content if inode version seen on the server is newer
> > than the one on the client."
> >
> > But I've always been puzzled by why this only seems to be the case
> > when using knfsd to re-export the (NFSv3) client mount. Using multiple
> > processes on a standard client mount never causes any similar
> > re-validations. And this happens with a completely read-only share
> > which is why I started to think it has something to do with atimes as
> > that could perhaps still cause a "write" modification even when
> > read-only?
>
> Ah-hah! So, it's inode_query_iversion() that's modifying a nfs inode's
> i_version. That's a special thing that only nfsd would do.
>
> I think that's totally fixable, we'll just have to think a little about
> how....
I wonder if something like this helps?--b.
commit 0add88a9ccc5
Author: J. Bruce Fields <[email protected]>
Date: Fri Nov 13 17:03:04 2020 -0500
nfs: don't mangle i_version on NFS
The i_version on NFS has pretty much opaque to the client, so we don't
want to give the low bit any special interpretation.
Define a new FS_PRIVATE_I_VERSION flag for filesystems that manage the
i_version on their own.
Signed-off-by: J. Bruce Fields <[email protected]>
diff --git a/fs/nfs/fs_context.c b/fs/nfs/fs_context.c
index 29ec8b09a52d..9b8dd5b713a7 100644
--- a/fs/nfs/fs_context.c
+++ b/fs/nfs/fs_context.c
@@ -1488,7 +1488,8 @@ struct file_system_type nfs_fs_type = {
.init_fs_context = nfs_init_fs_context,
.parameters = nfs_fs_parameters,
.kill_sb = nfs_kill_super,
- .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
+ .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
+ FS_PRIVATE_I_VERSION,
};
MODULE_ALIAS_FS("nfs");
EXPORT_SYMBOL_GPL(nfs_fs_type);
@@ -1500,7 +1501,8 @@ struct file_system_type nfs4_fs_type = {
.init_fs_context = nfs_init_fs_context,
.parameters = nfs_fs_parameters,
.kill_sb = nfs_kill_super,
- .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
+ .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
+ FS_PRIVATE_I_VERSION,
};
MODULE_ALIAS_FS("nfs4");
MODULE_ALIAS("nfs4");
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 21cc971fd960..c5bb4268228b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2217,6 +2217,7 @@ struct file_system_type {
#define FS_HAS_SUBTYPE 4
#define FS_USERNS_MOUNT 8 /* Can be mounted by userns root */
#define FS_DISALLOW_NOTIFY_PERM 16 /* Disable fanotify permission events */
+#define FS_PRIVATE_I_VERSION 32 /* i_version managed by filesystem */
#define FS_THP_SUPPORT 8192 /* Remove once all fs converted */
#define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */
int (*init_fs_context)(struct fs_context *);
diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index 2917ef990d43..52c790a847de 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -307,6 +307,8 @@ inode_query_iversion(struct inode *inode)
u64 cur, old, new;
cur = inode_peek_iversion_raw(inode);
+ if (inode->i_sb->s_type->fs_flags & FS_PRIVATE_I_VERSION)
+ return cur;
for (;;) {
/* If flag is already set, then no need to swap */
if (cur & I_VERSION_QUERIED) {
----- On 13 Nov, 2020, at 22:26, bfields [email protected] wrote:
> On Fri, Nov 13, 2020 at 09:50:50AM -0500, bfields wrote:
>> On Thu, Nov 12, 2020 at 11:05:57PM +0000, Daire Byrne wrote:
>> > So, I can't lay claim to identifying the exact optimisation/hack that
>> > improves the retention of the re-export server's client cache when
>> > re-exporting an NFSv3 server (which is then read by many clients). We
>> > were working with an engineer at the time who showed an interest in
>> > our use case and after we supplied a reproducer he suggested modifying
>> > the nfs/inode.c
>> >
>> > - if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
>> > + if (inode_peek_iversion_raw(inode) < fattr->change_attr)
>> > {
>> >
>> > His reasoning at the time was:
>> >
>> > "Fixes inode invalidation caused by read access. The least important
>> > bit is ORed with 1 and causes the inode version to differ from the one
>> > seen on the NFS share. This in turn causes unnecessary re-download
>> > impacting the performance significantly. This fix makes it only
>> > re-fetch file content if inode version seen on the server is newer
>> > than the one on the client."
>> >
>> > But I've always been puzzled by why this only seems to be the case
>> > when using knfsd to re-export the (NFSv3) client mount. Using multiple
>> > processes on a standard client mount never causes any similar
>> > re-validations. And this happens with a completely read-only share
>> > which is why I started to think it has something to do with atimes as
>> > that could perhaps still cause a "write" modification even when
>> > read-only?
>>
>> Ah-hah! So, it's inode_query_iversion() that's modifying a nfs inode's
>> i_version. That's a special thing that only nfsd would do.
>>
>> I think that's totally fixable, we'll just have to think a little about
>> how....
>
> I wonder if something like this helps?--b.
>
> commit 0add88a9ccc5
> Author: J. Bruce Fields <[email protected]>
> Date: Fri Nov 13 17:03:04 2020 -0500
>
> nfs: don't mangle i_version on NFS
>
> The i_version on NFS has pretty much opaque to the client, so we don't
> want to give the low bit any special interpretation.
>
> Define a new FS_PRIVATE_I_VERSION flag for filesystems that manage the
> i_version on their own.
>
> Signed-off-by: J. Bruce Fields <[email protected]>
>
> diff --git a/fs/nfs/fs_context.c b/fs/nfs/fs_context.c
> index 29ec8b09a52d..9b8dd5b713a7 100644
> --- a/fs/nfs/fs_context.c
> +++ b/fs/nfs/fs_context.c
> @@ -1488,7 +1488,8 @@ struct file_system_type nfs_fs_type = {
> .init_fs_context = nfs_init_fs_context,
> .parameters = nfs_fs_parameters,
> .kill_sb = nfs_kill_super,
> - .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
> + .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
> + FS_PRIVATE_I_VERSION,
> };
> MODULE_ALIAS_FS("nfs");
> EXPORT_SYMBOL_GPL(nfs_fs_type);
> @@ -1500,7 +1501,8 @@ struct file_system_type nfs4_fs_type = {
> .init_fs_context = nfs_init_fs_context,
> .parameters = nfs_fs_parameters,
> .kill_sb = nfs_kill_super,
> - .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
> + .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
> + FS_PRIVATE_I_VERSION,
> };
> MODULE_ALIAS_FS("nfs4");
> MODULE_ALIAS("nfs4");
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 21cc971fd960..c5bb4268228b 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2217,6 +2217,7 @@ struct file_system_type {
> #define FS_HAS_SUBTYPE 4
> #define FS_USERNS_MOUNT 8 /* Can be mounted by userns root */
> #define FS_DISALLOW_NOTIFY_PERM 16 /* Disable fanotify permission events */
> +#define FS_PRIVATE_I_VERSION 32 /* i_version managed by filesystem */
> #define FS_THP_SUPPORT 8192 /* Remove once all fs converted */
> #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename()
> internally. */
> int (*init_fs_context)(struct fs_context *);
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index 2917ef990d43..52c790a847de 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -307,6 +307,8 @@ inode_query_iversion(struct inode *inode)
> u64 cur, old, new;
>
> cur = inode_peek_iversion_raw(inode);
> + if (inode->i_sb->s_type->fs_flags & FS_PRIVATE_I_VERSION)
> + return cur;
> for (;;) {
> /* If flag is already set, then no need to swap */
> if (cur & I_VERSION_QUERIED) {
Yes, I can confirm that this absolutely helps! I replaced our (brute force) iversion patch with this (much nicer) patch and we got the same improvement; nfsd and it's clients no longer cause the re-export server's client cache to constantly be re-validated. The re-export server can now serve the same results to many clients from cache. Thanks so much for spending the time to track this down. If merged, future (crazy) NFS re-exporters will benefit from the metadata performance improvement/acceleration!
Now if anyone has any ideas why all the read calls to the originating server are limited to a maximum of 128k (with rsize=1M) when coming via the re-export server's nfsd threads, I see that as the next biggest performance issue. Reading directly on the re-export server with a userspace process issues 1MB reads as expected. It doesn't happen for writes (wsize=1MB all the way through) but I'm not sure if that has more to do with async and write back caching helping to build up the size before commit?
I figure the other remaining items on my (wish) list are probably more in the "won't fix" or "can't fix" category (except maybe the NFSv4.0 input/output errors?).
Daire
Jeff, does something like this look reasonable?
--b.
On Sat, Nov 14, 2020 at 12:57:24PM +0000, Daire Byrne wrote:
> ----- On 13 Nov, 2020, at 22:26, bfields [email protected] wrote:
> > On Fri, Nov 13, 2020 at 09:50:50AM -0500, bfields wrote:
> >> Ah-hah! So, it's inode_query_iversion() that's modifying a nfs inode's
> >> i_version. That's a special thing that only nfsd would do.
> >>
> >> I think that's totally fixable, we'll just have to think a little about
> >> how....
> >
> > I wonder if something like this helps?--b.
> >
> > commit 0add88a9ccc5
> > Author: J. Bruce Fields <[email protected]>
> > Date: Fri Nov 13 17:03:04 2020 -0500
> >
> > nfs: don't mangle i_version on NFS
> >
> > The i_version on NFS has pretty much opaque to the client, so we don't
> > want to give the low bit any special interpretation.
> >
> > Define a new FS_PRIVATE_I_VERSION flag for filesystems that manage the
> > i_version on their own.
> >
> > Signed-off-by: J. Bruce Fields <[email protected]>
> >
> > diff --git a/fs/nfs/fs_context.c b/fs/nfs/fs_context.c
> > index 29ec8b09a52d..9b8dd5b713a7 100644
> > --- a/fs/nfs/fs_context.c
> > +++ b/fs/nfs/fs_context.c
> > @@ -1488,7 +1488,8 @@ struct file_system_type nfs_fs_type = {
> > .init_fs_context = nfs_init_fs_context,
> > .parameters = nfs_fs_parameters,
> > .kill_sb = nfs_kill_super,
> > - .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
> > + .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
> > + FS_PRIVATE_I_VERSION,
> > };
> > MODULE_ALIAS_FS("nfs");
> > EXPORT_SYMBOL_GPL(nfs_fs_type);
> > @@ -1500,7 +1501,8 @@ struct file_system_type nfs4_fs_type = {
> > .init_fs_context = nfs_init_fs_context,
> > .parameters = nfs_fs_parameters,
> > .kill_sb = nfs_kill_super,
> > - .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
> > + .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
> > + FS_PRIVATE_I_VERSION,
> > };
> > MODULE_ALIAS_FS("nfs4");
> > MODULE_ALIAS("nfs4");
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 21cc971fd960..c5bb4268228b 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -2217,6 +2217,7 @@ struct file_system_type {
> > #define FS_HAS_SUBTYPE 4
> > #define FS_USERNS_MOUNT 8 /* Can be mounted by userns root */
> > #define FS_DISALLOW_NOTIFY_PERM 16 /* Disable fanotify permission events */
> > +#define FS_PRIVATE_I_VERSION 32 /* i_version managed by filesystem */
> > #define FS_THP_SUPPORT 8192 /* Remove once all fs converted */
> > #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename()
> > internally. */
> > int (*init_fs_context)(struct fs_context *);
> > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > index 2917ef990d43..52c790a847de 100644
> > --- a/include/linux/iversion.h
> > +++ b/include/linux/iversion.h
> > @@ -307,6 +307,8 @@ inode_query_iversion(struct inode *inode)
> > u64 cur, old, new;
> >
> > cur = inode_peek_iversion_raw(inode);
> > + if (inode->i_sb->s_type->fs_flags & FS_PRIVATE_I_VERSION)
> > + return cur;
> > for (;;) {
> > /* If flag is already set, then no need to swap */
> > if (cur & I_VERSION_QUERIED) {
>
> Yes, I can confirm that this absolutely helps! I replaced our (brute force) iversion patch with this (much nicer) patch and we got the same improvement; nfsd and it's clients no longer cause the re-export server's client cache to constantly be re-validated. The re-export server can now serve the same results to many clients from cache. Thanks so much for spending the time to track this down. If merged, future (crazy) NFS re-exporters will benefit from the metadata performance improvement/acceleration!
>
> Now if anyone has any ideas why all the read calls to the originating server are limited to a maximum of 128k (with rsize=1M) when coming via the re-export server's nfsd threads, I see that as the next biggest performance issue. Reading directly on the re-export server with a userspace process issues 1MB reads as expected. It doesn't happen for writes (wsize=1MB all the way through) but I'm not sure if that has more to do with async and write back caching helping to build up the size before commit?
>
> I figure the other remaining items on my (wish) list are probably more in the "won't fix" or "can't fix" category (except maybe the NFSv4.0 input/output errors?).
>
> Daire
On Fri, 2020-11-13 at 17:26 -0500, bfields wrote:
> On Fri, Nov 13, 2020 at 09:50:50AM -0500, bfields wrote:
> > On Thu, Nov 12, 2020 at 11:05:57PM +0000, Daire Byrne wrote:
> > > So, I can't lay claim to identifying the exact optimisation/hack that
> > > improves the retention of the re-export server's client cache when
> > > re-exporting an NFSv3 server (which is then read by many clients). We
> > > were working with an engineer at the time who showed an interest in
> > > our use case and after we supplied a reproducer he suggested modifying
> > > the nfs/inode.c
> > >
> > > - if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
> > > + if (inode_peek_iversion_raw(inode) < fattr->change_attr)
> > > {
> > >
> > > His reasoning at the time was:
> > >
> > > "Fixes inode invalidation caused by read access. The least important
> > > bit is ORed with 1 and causes the inode version to differ from the one
> > > seen on the NFS share. This in turn causes unnecessary re-download
> > > impacting the performance significantly. This fix makes it only
> > > re-fetch file content if inode version seen on the server is newer
> > > than the one on the client."
> > >
> > > But I've always been puzzled by why this only seems to be the case
> > > when using knfsd to re-export the (NFSv3) client mount. Using multiple
> > > processes on a standard client mount never causes any similar
> > > re-validations. And this happens with a completely read-only share
> > > which is why I started to think it has something to do with atimes as
> > > that could perhaps still cause a "write" modification even when
> > > read-only?
> >
> > Ah-hah! So, it's inode_query_iversion() that's modifying a nfs inode's
> > i_version. That's a special thing that only nfsd would do.
> >
> > I think that's totally fixable, we'll just have to think a little about
> > how....
>
> I wonder if something like this helps?--b.
>
> commit 0add88a9ccc5
> Author: J. Bruce Fields <[email protected]>
> Date: Fri Nov 13 17:03:04 2020 -0500
>
> nfs: don't mangle i_version on NFS
>
>
> The i_version on NFS has pretty much opaque to the client, so we don't
> want to give the low bit any special interpretation.
>
>
> Define a new FS_PRIVATE_I_VERSION flag for filesystems that manage the
> i_version on their own.
>
>
> Signed-off-by: J. Bruce Fields <[email protected]>
>
> diff --git a/fs/nfs/fs_context.c b/fs/nfs/fs_context.c
> index 29ec8b09a52d..9b8dd5b713a7 100644
> --- a/fs/nfs/fs_context.c
> +++ b/fs/nfs/fs_context.c
> @@ -1488,7 +1488,8 @@ struct file_system_type nfs_fs_type = {
> .init_fs_context = nfs_init_fs_context,
> .parameters = nfs_fs_parameters,
> .kill_sb = nfs_kill_super,
> - .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
> + .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
> + FS_PRIVATE_I_VERSION,
> };
> MODULE_ALIAS_FS("nfs");
> EXPORT_SYMBOL_GPL(nfs_fs_type);
> @@ -1500,7 +1501,8 @@ struct file_system_type nfs4_fs_type = {
> .init_fs_context = nfs_init_fs_context,
> .parameters = nfs_fs_parameters,
> .kill_sb = nfs_kill_super,
> - .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
> + .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
> + FS_PRIVATE_I_VERSION,
> };
> MODULE_ALIAS_FS("nfs4");
> MODULE_ALIAS("nfs4");
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 21cc971fd960..c5bb4268228b 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2217,6 +2217,7 @@ struct file_system_type {
> #define FS_HAS_SUBTYPE 4
> #define FS_USERNS_MOUNT 8 /* Can be mounted by userns root */
> #define FS_DISALLOW_NOTIFY_PERM 16 /* Disable fanotify permission events */
> +#define FS_PRIVATE_I_VERSION 32 /* i_version managed by filesystem */
> #define FS_THP_SUPPORT 8192 /* Remove once all fs converted */
> #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */
> int (*init_fs_context)(struct fs_context *);
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index 2917ef990d43..52c790a847de 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -307,6 +307,8 @@ inode_query_iversion(struct inode *inode)
> u64 cur, old, new;
>
>
> cur = inode_peek_iversion_raw(inode);
> + if (inode->i_sb->s_type->fs_flags & FS_PRIVATE_I_VERSION)
> + return cur;
> for (;;) {
> /* If flag is already set, then no need to swap */
> if (cur & I_VERSION_QUERIED) {
It's probably more correct to just check the already-existing
SB_I_VERSION flag here (though in hindsight a fstype flag might have
made more sense).
--
Jeff Layton <[email protected]>
On Sat, Nov 14, 2020 at 12:57:24PM +0000, Daire Byrne wrote:
> Now if anyone has any ideas why all the read calls to the originating
> server are limited to a maximum of 128k (with rsize=1M) when coming
> via the re-export server's nfsd threads, I see that as the next
> biggest performance issue. Reading directly on the re-export server
> with a userspace process issues 1MB reads as expected. It doesn't
> happen for writes (wsize=1MB all the way through) but I'm not sure if
> that has more to do with async and write back caching helping to build
> up the size before commit?
I'm not sure where to start with this one....
Is this behavior independent of protocol version and backend server?
> I figure the other remaining items on my (wish) list are probably more
> in the "won't fix" or "can't fix" category (except maybe the NFSv4.0
> input/output errors?).
Well, sounds like you've found a case where this feature's actually
useful. We should make sure that's documented.
And I think it's also worth some effort to document and triage the list
of remaining issues.
--b.
On Mon, Nov 16, 2020 at 10:29:29AM -0500, Jeff Layton wrote:
> On Fri, 2020-11-13 at 17:26 -0500, bfields wrote:
> > On Fri, Nov 13, 2020 at 09:50:50AM -0500, bfields wrote:
> > > On Thu, Nov 12, 2020 at 11:05:57PM +0000, Daire Byrne wrote:
> > > > So, I can't lay claim to identifying the exact optimisation/hack that
> > > > improves the retention of the re-export server's client cache when
> > > > re-exporting an NFSv3 server (which is then read by many clients). We
> > > > were working with an engineer at the time who showed an interest in
> > > > our use case and after we supplied a reproducer he suggested modifying
> > > > the nfs/inode.c
> > > >
> > > > - if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
> > > > + if (inode_peek_iversion_raw(inode) < fattr->change_attr)
> > > > {
> > > >
> > > > His reasoning at the time was:
> > > >
> > > > "Fixes inode invalidation caused by read access. The least important
> > > > bit is ORed with 1 and causes the inode version to differ from the one
> > > > seen on the NFS share. This in turn causes unnecessary re-download
> > > > impacting the performance significantly. This fix makes it only
> > > > re-fetch file content if inode version seen on the server is newer
> > > > than the one on the client."
> > > >
> > > > But I've always been puzzled by why this only seems to be the case
> > > > when using knfsd to re-export the (NFSv3) client mount. Using multiple
> > > > processes on a standard client mount never causes any similar
> > > > re-validations. And this happens with a completely read-only share
> > > > which is why I started to think it has something to do with atimes as
> > > > that could perhaps still cause a "write" modification even when
> > > > read-only?
> > >
> > > Ah-hah! So, it's inode_query_iversion() that's modifying a nfs inode's
> > > i_version. That's a special thing that only nfsd would do.
> > >
> > > I think that's totally fixable, we'll just have to think a little about
> > > how....
> >
> > I wonder if something like this helps?--b.
> >
> > commit 0add88a9ccc5
> > Author: J. Bruce Fields <[email protected]>
> > Date: Fri Nov 13 17:03:04 2020 -0500
> >
> > nfs: don't mangle i_version on NFS
> >
> >
> > The i_version on NFS has pretty much opaque to the client, so we don't
> > want to give the low bit any special interpretation.
> >
> >
> > Define a new FS_PRIVATE_I_VERSION flag for filesystems that manage the
> > i_version on their own.
> >
> >
> > Signed-off-by: J. Bruce Fields <[email protected]>
> >
> > diff --git a/fs/nfs/fs_context.c b/fs/nfs/fs_context.c
> > index 29ec8b09a52d..9b8dd5b713a7 100644
> > --- a/fs/nfs/fs_context.c
> > +++ b/fs/nfs/fs_context.c
> > @@ -1488,7 +1488,8 @@ struct file_system_type nfs_fs_type = {
> > .init_fs_context = nfs_init_fs_context,
> > .parameters = nfs_fs_parameters,
> > .kill_sb = nfs_kill_super,
> > - .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
> > + .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
> > + FS_PRIVATE_I_VERSION,
> > };
> > MODULE_ALIAS_FS("nfs");
> > EXPORT_SYMBOL_GPL(nfs_fs_type);
> > @@ -1500,7 +1501,8 @@ struct file_system_type nfs4_fs_type = {
> > .init_fs_context = nfs_init_fs_context,
> > .parameters = nfs_fs_parameters,
> > .kill_sb = nfs_kill_super,
> > - .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
> > + .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
> > + FS_PRIVATE_I_VERSION,
> > };
> > MODULE_ALIAS_FS("nfs4");
> > MODULE_ALIAS("nfs4");
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 21cc971fd960..c5bb4268228b 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -2217,6 +2217,7 @@ struct file_system_type {
> > #define FS_HAS_SUBTYPE 4
> > #define FS_USERNS_MOUNT 8 /* Can be mounted by userns root */
> > #define FS_DISALLOW_NOTIFY_PERM 16 /* Disable fanotify permission events */
> > +#define FS_PRIVATE_I_VERSION 32 /* i_version managed by filesystem */
> > #define FS_THP_SUPPORT 8192 /* Remove once all fs converted */
> > #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */
> > int (*init_fs_context)(struct fs_context *);
> > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > index 2917ef990d43..52c790a847de 100644
> > --- a/include/linux/iversion.h
> > +++ b/include/linux/iversion.h
> > @@ -307,6 +307,8 @@ inode_query_iversion(struct inode *inode)
> > u64 cur, old, new;
> >
> >
> > cur = inode_peek_iversion_raw(inode);
> > + if (inode->i_sb->s_type->fs_flags & FS_PRIVATE_I_VERSION)
> > + return cur;
> > for (;;) {
> > /* If flag is already set, then no need to swap */
> > if (cur & I_VERSION_QUERIED) {
>
>
> It's probably more correct to just check the already-existing
> SB_I_VERSION flag here
So the check would be
if (!IS_I_VERSION(inode))
return cur;
?
> (though in hindsight a fstype flag might have made more sense).
I_VERSION support can vary by superblock (for example, xfs supports it
or not depending on on-disk format version).
--b.
On Mon, 2020-11-16 at 10:56 -0500, bfields wrote:
> On Mon, Nov 16, 2020 at 10:29:29AM -0500, Jeff Layton wrote:
> > On Fri, 2020-11-13 at 17:26 -0500, bfields wrote:
> > > On Fri, Nov 13, 2020 at 09:50:50AM -0500, bfields wrote:
> > > > On Thu, Nov 12, 2020 at 11:05:57PM +0000, Daire Byrne wrote:
> > > > > So, I can't lay claim to identifying the exact optimisation/hack that
> > > > > improves the retention of the re-export server's client cache when
> > > > > re-exporting an NFSv3 server (which is then read by many clients). We
> > > > > were working with an engineer at the time who showed an interest in
> > > > > our use case and after we supplied a reproducer he suggested modifying
> > > > > the nfs/inode.c
> > > > >
> > > > > - if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
> > > > > + if (inode_peek_iversion_raw(inode) < fattr->change_attr)
> > > > > {
> > > > >
> > > > > His reasoning at the time was:
> > > > >
> > > > > "Fixes inode invalidation caused by read access. The least important
> > > > > bit is ORed with 1 and causes the inode version to differ from the one
> > > > > seen on the NFS share. This in turn causes unnecessary re-download
> > > > > impacting the performance significantly. This fix makes it only
> > > > > re-fetch file content if inode version seen on the server is newer
> > > > > than the one on the client."
> > > > >
> > > > > But I've always been puzzled by why this only seems to be the case
> > > > > when using knfsd to re-export the (NFSv3) client mount. Using multiple
> > > > > processes on a standard client mount never causes any similar
> > > > > re-validations. And this happens with a completely read-only share
> > > > > which is why I started to think it has something to do with atimes as
> > > > > that could perhaps still cause a "write" modification even when
> > > > > read-only?
> > > >
> > > > Ah-hah! So, it's inode_query_iversion() that's modifying a nfs inode's
> > > > i_version. That's a special thing that only nfsd would do.
> > > >
> > > > I think that's totally fixable, we'll just have to think a little about
> > > > how....
> > >
> > > I wonder if something like this helps?--b.
> > >
> > > commit 0add88a9ccc5
> > > Author: J. Bruce Fields <[email protected]>
> > > Date: Fri Nov 13 17:03:04 2020 -0500
> > >
> > > nfs: don't mangle i_version on NFS
> > >
> > >
> > > The i_version on NFS has pretty much opaque to the client, so we don't
> > > want to give the low bit any special interpretation.
> > >
> > >
> > > Define a new FS_PRIVATE_I_VERSION flag for filesystems that manage the
> > > i_version on their own.
> > >
> > >
> > > Signed-off-by: J. Bruce Fields <[email protected]>
> > >
> > > diff --git a/fs/nfs/fs_context.c b/fs/nfs/fs_context.c
> > > index 29ec8b09a52d..9b8dd5b713a7 100644
> > > --- a/fs/nfs/fs_context.c
> > > +++ b/fs/nfs/fs_context.c
> > > @@ -1488,7 +1488,8 @@ struct file_system_type nfs_fs_type = {
> > > .init_fs_context = nfs_init_fs_context,
> > > .parameters = nfs_fs_parameters,
> > > .kill_sb = nfs_kill_super,
> > > - .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
> > > + .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
> > > + FS_PRIVATE_I_VERSION,
> > > };
> > > MODULE_ALIAS_FS("nfs");
> > > EXPORT_SYMBOL_GPL(nfs_fs_type);
> > > @@ -1500,7 +1501,8 @@ struct file_system_type nfs4_fs_type = {
> > > .init_fs_context = nfs_init_fs_context,
> > > .parameters = nfs_fs_parameters,
> > > .kill_sb = nfs_kill_super,
> > > - .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
> > > + .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
> > > + FS_PRIVATE_I_VERSION,
> > > };
> > > MODULE_ALIAS_FS("nfs4");
> > > MODULE_ALIAS("nfs4");
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index 21cc971fd960..c5bb4268228b 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -2217,6 +2217,7 @@ struct file_system_type {
> > > #define FS_HAS_SUBTYPE 4
> > > #define FS_USERNS_MOUNT 8 /* Can be mounted by userns root */
> > > #define FS_DISALLOW_NOTIFY_PERM 16 /* Disable fanotify permission events */
> > > +#define FS_PRIVATE_I_VERSION 32 /* i_version managed by filesystem */
> > > #define FS_THP_SUPPORT 8192 /* Remove once all fs converted */
> > > #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */
> > > int (*init_fs_context)(struct fs_context *);
> > > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > > index 2917ef990d43..52c790a847de 100644
> > > --- a/include/linux/iversion.h
> > > +++ b/include/linux/iversion.h
> > > @@ -307,6 +307,8 @@ inode_query_iversion(struct inode *inode)
> > > u64 cur, old, new;
> > >
> > >
> > > cur = inode_peek_iversion_raw(inode);
> > > + if (inode->i_sb->s_type->fs_flags & FS_PRIVATE_I_VERSION)
> > > + return cur;
> > > for (;;) {
> > > /* If flag is already set, then no need to swap */
> > > if (cur & I_VERSION_QUERIED) {
> >
> >
> > It's probably more correct to just check the already-existing
> > SB_I_VERSION flag here
>
> So the check would be
>
> if (!IS_I_VERSION(inode))
> return cur;
>
> ?
>
Yes, that looks about right.
> > (though in hindsight a fstype flag might have made more sense).
>
> I_VERSION support can vary by superblock (for example, xfs supports it
> or not depending on on-disk format version).
>
Good point!
--
Jeff Layton <[email protected]>
On Mon, Nov 16, 2020 at 11:03:00AM -0500, Jeff Layton wrote:
> On Mon, 2020-11-16 at 10:56 -0500, bfields wrote:
> > On Mon, Nov 16, 2020 at 10:29:29AM -0500, Jeff Layton wrote:
> > > On Fri, 2020-11-13 at 17:26 -0500, bfields wrote:
> > > > On Fri, Nov 13, 2020 at 09:50:50AM -0500, bfields wrote:
> > > > > On Thu, Nov 12, 2020 at 11:05:57PM +0000, Daire Byrne wrote:
> > > > > > So, I can't lay claim to identifying the exact optimisation/hack that
> > > > > > improves the retention of the re-export server's client cache when
> > > > > > re-exporting an NFSv3 server (which is then read by many clients). We
> > > > > > were working with an engineer at the time who showed an interest in
> > > > > > our use case and after we supplied a reproducer he suggested modifying
> > > > > > the nfs/inode.c
> > > > > >
> > > > > > - if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
> > > > > > + if (inode_peek_iversion_raw(inode) < fattr->change_attr)
> > > > > > {
> > > > > >
> > > > > > His reasoning at the time was:
> > > > > >
> > > > > > "Fixes inode invalidation caused by read access. The least important
> > > > > > bit is ORed with 1 and causes the inode version to differ from the one
> > > > > > seen on the NFS share. This in turn causes unnecessary re-download
> > > > > > impacting the performance significantly. This fix makes it only
> > > > > > re-fetch file content if inode version seen on the server is newer
> > > > > > than the one on the client."
> > > > > >
> > > > > > But I've always been puzzled by why this only seems to be the case
> > > > > > when using knfsd to re-export the (NFSv3) client mount. Using multiple
> > > > > > processes on a standard client mount never causes any similar
> > > > > > re-validations. And this happens with a completely read-only share
> > > > > > which is why I started to think it has something to do with atimes as
> > > > > > that could perhaps still cause a "write" modification even when
> > > > > > read-only?
> > > > >
> > > > > Ah-hah! So, it's inode_query_iversion() that's modifying a nfs inode's
> > > > > i_version. That's a special thing that only nfsd would do.
> > > > >
> > > > > I think that's totally fixable, we'll just have to think a little about
> > > > > how....
> > > >
> > > > I wonder if something like this helps?--b.
> > > >
> > > > commit 0add88a9ccc5
> > > > Author: J. Bruce Fields <[email protected]>
> > > > Date: Fri Nov 13 17:03:04 2020 -0500
> > > >
> > > > nfs: don't mangle i_version on NFS
> > > >
> > > >
> > > > The i_version on NFS has pretty much opaque to the client, so we don't
> > > > want to give the low bit any special interpretation.
> > > >
> > > >
> > > > Define a new FS_PRIVATE_I_VERSION flag for filesystems that manage the
> > > > i_version on their own.
> > > >
> > > >
> > > > Signed-off-by: J. Bruce Fields <[email protected]>
> > > >
> > > > diff --git a/fs/nfs/fs_context.c b/fs/nfs/fs_context.c
> > > > index 29ec8b09a52d..9b8dd5b713a7 100644
> > > > --- a/fs/nfs/fs_context.c
> > > > +++ b/fs/nfs/fs_context.c
> > > > @@ -1488,7 +1488,8 @@ struct file_system_type nfs_fs_type = {
> > > > .init_fs_context = nfs_init_fs_context,
> > > > .parameters = nfs_fs_parameters,
> > > > .kill_sb = nfs_kill_super,
> > > > - .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
> > > > + .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
> > > > + FS_PRIVATE_I_VERSION,
> > > > };
> > > > MODULE_ALIAS_FS("nfs");
> > > > EXPORT_SYMBOL_GPL(nfs_fs_type);
> > > > @@ -1500,7 +1501,8 @@ struct file_system_type nfs4_fs_type = {
> > > > .init_fs_context = nfs_init_fs_context,
> > > > .parameters = nfs_fs_parameters,
> > > > .kill_sb = nfs_kill_super,
> > > > - .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
> > > > + .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
> > > > + FS_PRIVATE_I_VERSION,
> > > > };
> > > > MODULE_ALIAS_FS("nfs4");
> > > > MODULE_ALIAS("nfs4");
> > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > index 21cc971fd960..c5bb4268228b 100644
> > > > --- a/include/linux/fs.h
> > > > +++ b/include/linux/fs.h
> > > > @@ -2217,6 +2217,7 @@ struct file_system_type {
> > > > #define FS_HAS_SUBTYPE 4
> > > > #define FS_USERNS_MOUNT 8 /* Can be mounted by userns root */
> > > > #define FS_DISALLOW_NOTIFY_PERM 16 /* Disable fanotify permission events */
> > > > +#define FS_PRIVATE_I_VERSION 32 /* i_version managed by filesystem */
> > > > #define FS_THP_SUPPORT 8192 /* Remove once all fs converted */
> > > > #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */
> > > > int (*init_fs_context)(struct fs_context *);
> > > > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > > > index 2917ef990d43..52c790a847de 100644
> > > > --- a/include/linux/iversion.h
> > > > +++ b/include/linux/iversion.h
> > > > @@ -307,6 +307,8 @@ inode_query_iversion(struct inode *inode)
> > > > u64 cur, old, new;
> > > >
> > > >
> > > > cur = inode_peek_iversion_raw(inode);
> > > > + if (inode->i_sb->s_type->fs_flags & FS_PRIVATE_I_VERSION)
> > > > + return cur;
> > > > for (;;) {
> > > > /* If flag is already set, then no need to swap */
> > > > if (cur & I_VERSION_QUERIED) {
> > >
> > >
> > > It's probably more correct to just check the already-existing
> > > SB_I_VERSION flag here
> >
> > So the check would be
> >
> > if (!IS_I_VERSION(inode))
> > return cur;
> >
> > ?
> >
>
> Yes, that looks about right.
That doesn't sound right to me. NFS, for example, has a perfectly good
i_version that works as a change attribute, so it should set
SB_I_VERSION. But it doesn't want the vfs playing games with the low
bit.
(In fact, I'm confused now: the improvement Daire was seeing should only
be possible if the re-export server was seeing SB_I_VERSION set on the
NFS filesystem it was exporting, but a quick grep doesn't actually show
me where NFS is setting SB_I_VERSION. I'm missing something
obvious....)
--b.
On Mon, 2020-11-16 at 11:14 -0500, bfields wrote:
> On Mon, Nov 16, 2020 at 11:03:00AM -0500, Jeff Layton wrote:
> > On Mon, 2020-11-16 at 10:56 -0500, bfields wrote:
> > > On Mon, Nov 16, 2020 at 10:29:29AM -0500, Jeff Layton wrote:
> > > > On Fri, 2020-11-13 at 17:26 -0500, bfields wrote:
> > > > > On Fri, Nov 13, 2020 at 09:50:50AM -0500, bfields wrote:
> > > > > > On Thu, Nov 12, 2020 at 11:05:57PM +0000, Daire Byrne wrote:
> > > > > > > So, I can't lay claim to identifying the exact optimisation/hack that
> > > > > > > improves the retention of the re-export server's client cache when
> > > > > > > re-exporting an NFSv3 server (which is then read by many clients). We
> > > > > > > were working with an engineer at the time who showed an interest in
> > > > > > > our use case and after we supplied a reproducer he suggested modifying
> > > > > > > the nfs/inode.c
> > > > > > >
> > > > > > > - if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
> > > > > > > + if (inode_peek_iversion_raw(inode) < fattr->change_attr)
> > > > > > > {
> > > > > > >
> > > > > > > His reasoning at the time was:
> > > > > > >
> > > > > > > "Fixes inode invalidation caused by read access. The least important
> > > > > > > bit is ORed with 1 and causes the inode version to differ from the one
> > > > > > > seen on the NFS share. This in turn causes unnecessary re-download
> > > > > > > impacting the performance significantly. This fix makes it only
> > > > > > > re-fetch file content if inode version seen on the server is newer
> > > > > > > than the one on the client."
> > > > > > >
> > > > > > > But I've always been puzzled by why this only seems to be the case
> > > > > > > when using knfsd to re-export the (NFSv3) client mount. Using multiple
> > > > > > > processes on a standard client mount never causes any similar
> > > > > > > re-validations. And this happens with a completely read-only share
> > > > > > > which is why I started to think it has something to do with atimes as
> > > > > > > that could perhaps still cause a "write" modification even when
> > > > > > > read-only?
> > > > > >
> > > > > > Ah-hah! So, it's inode_query_iversion() that's modifying a nfs inode's
> > > > > > i_version. That's a special thing that only nfsd would do.
> > > > > >
> > > > > > I think that's totally fixable, we'll just have to think a little about
> > > > > > how....
> > > > >
> > > > > I wonder if something like this helps?--b.
> > > > >
> > > > > commit 0add88a9ccc5
> > > > > Author: J. Bruce Fields <[email protected]>
> > > > > Date: Fri Nov 13 17:03:04 2020 -0500
> > > > >
> > > > > nfs: don't mangle i_version on NFS
> > > > >
> > > > >
> > > > > The i_version on NFS has pretty much opaque to the client, so we don't
> > > > > want to give the low bit any special interpretation.
> > > > >
> > > > >
> > > > > Define a new FS_PRIVATE_I_VERSION flag for filesystems that manage the
> > > > > i_version on their own.
> > > > >
> > > > >
> > > > > Signed-off-by: J. Bruce Fields <[email protected]>
> > > > >
> > > > > diff --git a/fs/nfs/fs_context.c b/fs/nfs/fs_context.c
> > > > > index 29ec8b09a52d..9b8dd5b713a7 100644
> > > > > --- a/fs/nfs/fs_context.c
> > > > > +++ b/fs/nfs/fs_context.c
> > > > > @@ -1488,7 +1488,8 @@ struct file_system_type nfs_fs_type = {
> > > > > .init_fs_context = nfs_init_fs_context,
> > > > > .parameters = nfs_fs_parameters,
> > > > > .kill_sb = nfs_kill_super,
> > > > > - .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
> > > > > + .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
> > > > > + FS_PRIVATE_I_VERSION,
> > > > > };
> > > > > MODULE_ALIAS_FS("nfs");
> > > > > EXPORT_SYMBOL_GPL(nfs_fs_type);
> > > > > @@ -1500,7 +1501,8 @@ struct file_system_type nfs4_fs_type = {
> > > > > .init_fs_context = nfs_init_fs_context,
> > > > > .parameters = nfs_fs_parameters,
> > > > > .kill_sb = nfs_kill_super,
> > > > > - .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
> > > > > + .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
> > > > > + FS_PRIVATE_I_VERSION,
> > > > > };
> > > > > MODULE_ALIAS_FS("nfs4");
> > > > > MODULE_ALIAS("nfs4");
> > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > > index 21cc971fd960..c5bb4268228b 100644
> > > > > --- a/include/linux/fs.h
> > > > > +++ b/include/linux/fs.h
> > > > > @@ -2217,6 +2217,7 @@ struct file_system_type {
> > > > > #define FS_HAS_SUBTYPE 4
> > > > > #define FS_USERNS_MOUNT 8 /* Can be mounted by userns root */
> > > > > #define FS_DISALLOW_NOTIFY_PERM 16 /* Disable fanotify permission events */
> > > > > +#define FS_PRIVATE_I_VERSION 32 /* i_version managed by filesystem */
> > > > > #define FS_THP_SUPPORT 8192 /* Remove once all fs converted */
> > > > > #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */
> > > > > int (*init_fs_context)(struct fs_context *);
> > > > > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > > > > index 2917ef990d43..52c790a847de 100644
> > > > > --- a/include/linux/iversion.h
> > > > > +++ b/include/linux/iversion.h
> > > > > @@ -307,6 +307,8 @@ inode_query_iversion(struct inode *inode)
> > > > > u64 cur, old, new;
> > > > >
> > > > >
> > > > > cur = inode_peek_iversion_raw(inode);
> > > > > + if (inode->i_sb->s_type->fs_flags & FS_PRIVATE_I_VERSION)
> > > > > + return cur;
> > > > > for (;;) {
> > > > > /* If flag is already set, then no need to swap */
> > > > > if (cur & I_VERSION_QUERIED) {
> > > >
> > > >
> > > > It's probably more correct to just check the already-existing
> > > > SB_I_VERSION flag here
> > >
> > > So the check would be
> > >
> > > if (!IS_I_VERSION(inode))
> > > return cur;
> > >
> > > ?
> > >
> >
> > Yes, that looks about right.
>
> That doesn't sound right to me. NFS, for example, has a perfectly good
> i_version that works as a change attribute, so it should set
> SB_I_VERSION. But it doesn't want the vfs playing games with the low
> bit.
>
> (In fact, I'm confused now: the improvement Daire was seeing should only
> be possible if the re-export server was seeing SB_I_VERSION set on the
> NFS filesystem it was exporting, but a quick grep doesn't actually show
> me where NFS is setting SB_I_VERSION. I'm missing something
> obvious....)
Hmm, ok... nfsd4_change_attribute() is called from nfs4 code but also
nfs3 code as well. The v4 caller (encode_change) only calls it when
IS_I_VERSION is set, but the v3 callers don't seem to pay attention to
that.
I think the basic issue here is that we're trying to use SB_I_VERSION
for two different things. Its main purpose is to tell the kernel that
when it's updating the file times that it should also (possibly)
increment the i_version counter too. (Some of this is documented in
include/linux/iversion.h too, fwiw)
nfsd needs a way to tell whether the field should be consulted at all.
For that we probably do need a different flag of some sort. Doing it at
the fstype level seems a bit wrong though -- v2/3 don't have a real
change attribute and it probably shouldn't be trusted when exporting
them.
--
Jeff Layton <[email protected]>
On Mon, Nov 16, 2020 at 11:38:44AM -0500, Jeff Layton wrote:
> Hmm, ok... nfsd4_change_attribute() is called from nfs4 code but also
> nfs3 code as well. The v4 caller (encode_change) only calls it when
> IS_I_VERSION is set, but the v3 callers don't seem to pay attention to
> that.
Weird. Looking back.... That goes back to the original patch adding
support for ext4's i_version, c654b8a9cba6 "nfsd: support ext4
i_version".
It's in nfs3xdr.c, but the fields it's filling in, fh_pre_change and
fh_post_change, are only used in nfs4xdr.c. Maybe moving it someplace
else (vfs.c?) would save some confusion.
Anyway, yes, that should be checking SB_I_VERSION too.
> I think the basic issue here is that we're trying to use SB_I_VERSION
> for two different things. Its main purpose is to tell the kernel that
> when it's updating the file times that it should also (possibly)
> increment the i_version counter too. (Some of this is documented in
> include/linux/iversion.h too, fwiw)
>
> nfsd needs a way to tell whether the field should be consulted at all.
> For that we probably do need a different flag of some sort. Doing it at
> the fstype level seems a bit wrong though -- v2/3 don't have a real
> change attribute and it probably shouldn't be trusted when exporting
> them.
Oops, good point.
I suppose simplest is just another SB_ flag.
--b.
----- On 16 Nov, 2020, at 15:53, bfields [email protected] wrote:
> On Sat, Nov 14, 2020 at 12:57:24PM +0000, Daire Byrne wrote:
>> Now if anyone has any ideas why all the read calls to the originating
>> server are limited to a maximum of 128k (with rsize=1M) when coming
>> via the re-export server's nfsd threads, I see that as the next
>> biggest performance issue. Reading directly on the re-export server
>> with a userspace process issues 1MB reads as expected. It doesn't
>> happen for writes (wsize=1MB all the way through) but I'm not sure if
>> that has more to do with async and write back caching helping to build
>> up the size before commit?
>
> I'm not sure where to start with this one....
>
> Is this behavior independent of protocol version and backend server?
It seems to the case for all combinations of backend versions and re-export versions.
But it does look like it is related to readahead somehow. The default for a client mount is 128k ....
I just increased it to 1024 on the client mount of the originating server on the re-export server and now it's doing the expected 1MB (rsize) read requests back to onprem from the clients all the way through. i.e.
echo 1024 > /sys/class/bdi/0:52/read_ahead_kb
So, there is a difference in behaviour when reading from the client mount with user space processes or the knfsd threads on the re-export server.
Daire
On Mon, 2020-11-16 at 14:03 -0500, bfields wrote:
> On Mon, Nov 16, 2020 at 11:38:44AM -0500, Jeff Layton wrote:
> > Hmm, ok... nfsd4_change_attribute() is called from nfs4 code but also
> > nfs3 code as well. The v4 caller (encode_change) only calls it when
> > IS_I_VERSION is set, but the v3 callers don't seem to pay attention to
> > that.
>
> Weird. Looking back.... That goes back to the original patch adding
> support for ext4's i_version, c654b8a9cba6 "nfsd: support ext4
> i_version".
>
> It's in nfs3xdr.c, but the fields it's filling in, fh_pre_change and
> fh_post_change, are only used in nfs4xdr.c. Maybe moving it someplace
> else (vfs.c?) would save some confusion.
>
> Anyway, yes, that should be checking SB_I_VERSION too.
>
> > I think the basic issue here is that we're trying to use SB_I_VERSION
> > for two different things. Its main purpose is to tell the kernel that
> > when it's updating the file times that it should also (possibly)
> > increment the i_version counter too. (Some of this is documented in
> > include/linux/iversion.h too, fwiw)
> >
> > nfsd needs a way to tell whether the field should be consulted at all.
> > For that we probably do need a different flag of some sort. Doing it at
> > the fstype level seems a bit wrong though -- v2/3 don't have a real
> > change attribute and it probably shouldn't be trusted when exporting
> > them.
>
> Oops, good point.
>
> I suppose simplest is just another SB_ flag.
>
Another idea might be to add a new fetch_iversion export operation that
returns a u64. Roll two generic functions -- one to handle the
xfs/ext4/btrfs case and another for the NFS/AFS/Ceph case (where we just
fetch it raw). When the op is a NULL pointer, treat it like the
!IS_I_VERSION case today.
--
Jeff Layton <[email protected]>
On Mon, Nov 16, 2020 at 03:03:15PM -0500, Jeff Layton wrote:
> Another idea might be to add a new fetch_iversion export operation that
> returns a u64. Roll two generic functions -- one to handle the
> xfs/ext4/btrfs case and another for the NFS/AFS/Ceph case (where we just
> fetch it raw). When the op is a NULL pointer, treat it like the
> !IS_I_VERSION case today.
OK, a rough attempt follows, mostly untested.--b.
From: "J. Bruce Fields" <[email protected]>
These functions are actually used by NFSv4 code as well, and having them
in nfs3xdr.c has caused some confusion.
This is just cleanup, no change in behavior.
Signed-off-by: J. Bruce Fields <[email protected]>
---
fs/nfsd/nfs3xdr.c | 49 -----------------------------------------------
fs/nfsd/nfsfh.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 49 insertions(+), 49 deletions(-)
diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index 2277f83da250..14efb3aba6b2 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -252,55 +252,6 @@ encode_wcc_data(struct svc_rqst *rqstp, __be32 *p, struct svc_fh *fhp)
return encode_post_op_attr(rqstp, p, fhp);
}
-/*
- * Fill in the pre_op attr for the wcc data
- */
-void fill_pre_wcc(struct svc_fh *fhp)
-{
- struct inode *inode;
- struct kstat stat;
- __be32 err;
-
- if (fhp->fh_pre_saved)
- return;
-
- inode = d_inode(fhp->fh_dentry);
- err = fh_getattr(fhp, &stat);
- if (err) {
- /* Grab the times from inode anyway */
- stat.mtime = inode->i_mtime;
- stat.ctime = inode->i_ctime;
- stat.size = inode->i_size;
- }
-
- fhp->fh_pre_mtime = stat.mtime;
- fhp->fh_pre_ctime = stat.ctime;
- fhp->fh_pre_size = stat.size;
- fhp->fh_pre_change = nfsd4_change_attribute(&stat, inode);
- fhp->fh_pre_saved = true;
-}
-
-/*
- * Fill in the post_op attr for the wcc data
- */
-void fill_post_wcc(struct svc_fh *fhp)
-{
- __be32 err;
-
- if (fhp->fh_post_saved)
- printk("nfsd: inode locked twice during operation.\n");
-
- err = fh_getattr(fhp, &fhp->fh_post_attr);
- fhp->fh_post_change = nfsd4_change_attribute(&fhp->fh_post_attr,
- d_inode(fhp->fh_dentry));
- if (err) {
- fhp->fh_post_saved = false;
- /* Grab the ctime anyway - set_change_info might use it */
- fhp->fh_post_attr.ctime = d_inode(fhp->fh_dentry)->i_ctime;
- } else
- fhp->fh_post_saved = true;
-}
-
/*
* XDR decode functions
*/
diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index c81dbbad8792..b3b4e8809aa9 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -711,3 +711,52 @@ enum fsid_source fsid_source(struct svc_fh *fhp)
return FSIDSOURCE_UUID;
return FSIDSOURCE_DEV;
}
+
+/*
+ * Fill in the pre_op attr for the wcc data
+ */
+void fill_pre_wcc(struct svc_fh *fhp)
+{
+ struct inode *inode;
+ struct kstat stat;
+ __be32 err;
+
+ if (fhp->fh_pre_saved)
+ return;
+
+ inode = d_inode(fhp->fh_dentry);
+ err = fh_getattr(fhp, &stat);
+ if (err) {
+ /* Grab the times from inode anyway */
+ stat.mtime = inode->i_mtime;
+ stat.ctime = inode->i_ctime;
+ stat.size = inode->i_size;
+ }
+
+ fhp->fh_pre_mtime = stat.mtime;
+ fhp->fh_pre_ctime = stat.ctime;
+ fhp->fh_pre_size = stat.size;
+ fhp->fh_pre_change = nfsd4_change_attribute(&stat, inode);
+ fhp->fh_pre_saved = true;
+}
+
+/*
+ * Fill in the post_op attr for the wcc data
+ */
+void fill_post_wcc(struct svc_fh *fhp)
+{
+ __be32 err;
+
+ if (fhp->fh_post_saved)
+ printk("nfsd: inode locked twice during operation.\n");
+
+ err = fh_getattr(fhp, &fhp->fh_post_attr);
+ fhp->fh_post_change = nfsd4_change_attribute(&fhp->fh_post_attr,
+ d_inode(fhp->fh_dentry));
+ if (err) {
+ fhp->fh_post_saved = false;
+ /* Grab the ctime anyway - set_change_info might use it */
+ fhp->fh_post_attr.ctime = d_inode(fhp->fh_dentry)->i_ctime;
+ } else
+ fhp->fh_post_saved = true;
+}
--
2.28.0
From: "J. Bruce Fields" <[email protected]>
The i_version on NFS has pretty much opaque to the client, so we don't
want to give the low bit any special interpretation.
Define a new FS_PRIVATE_I_VERSION flag for filesystems that manage the
i_version on their own.
Signed-off-by: J. Bruce Fields <[email protected]>
---
fs/nfs/export.c | 1 +
include/linux/exportfs.h | 1 +
include/linux/iversion.h | 4 ++++
3 files changed, 6 insertions(+)
diff --git a/fs/nfs/export.c b/fs/nfs/export.c
index 3430d6891e89..c2eb915a54ca 100644
--- a/fs/nfs/export.c
+++ b/fs/nfs/export.c
@@ -171,4 +171,5 @@ const struct export_operations nfs_export_ops = {
.encode_fh = nfs_encode_fh,
.fh_to_dentry = nfs_fh_to_dentry,
.get_parent = nfs_get_parent,
+ .fetch_iversion = inode_peek_iversion_raw,
};
diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
index 3ceb72b67a7a..6000121a201f 100644
--- a/include/linux/exportfs.h
+++ b/include/linux/exportfs.h
@@ -213,6 +213,7 @@ struct export_operations {
bool write, u32 *device_generation);
int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
int nr_iomaps, struct iattr *iattr);
+ u64 (*fetch_iversion)(const struct inode *);
};
extern int exportfs_encode_inode_fh(struct inode *inode, struct fid *fid,
diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index 2917ef990d43..481b3debf6bb 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -3,6 +3,7 @@
#define _LINUX_IVERSION_H
#include <linux/fs.h>
+#include <linux/exportfs.h>
/*
* The inode->i_version field:
@@ -306,6 +307,9 @@ inode_query_iversion(struct inode *inode)
{
u64 cur, old, new;
+ if (inode->i_sb->s_export_op->fetch_iversion)
+ return inode->i_sb->s_export_op->fetch_iversion(inode);
+
cur = inode_peek_iversion_raw(inode);
for (;;) {
/* If flag is already set, then no need to swap */
--
2.28.0
On Mon, 2020-11-16 at 22:18 -0500, J. Bruce Fields wrote:
> From: "J. Bruce Fields" <[email protected]>
>
> The i_version on NFS has pretty much opaque to the client, so we don't
> want to give the low bit any special interpretation.
>
> Define a new FS_PRIVATE_I_VERSION flag for filesystems that manage the
> i_version on their own.
>
Description here doesn't quite match the patch...
> Signed-off-by: J. Bruce Fields <[email protected]>
> ---
> fs/nfs/export.c | 1 +
> include/linux/exportfs.h | 1 +
> include/linux/iversion.h | 4 ++++
> 3 files changed, 6 insertions(+)
>
> diff --git a/fs/nfs/export.c b/fs/nfs/export.c
> index 3430d6891e89..c2eb915a54ca 100644
> --- a/fs/nfs/export.c
> +++ b/fs/nfs/export.c
> @@ -171,4 +171,5 @@ const struct export_operations nfs_export_ops = {
> .encode_fh = nfs_encode_fh,
> .fh_to_dentry = nfs_fh_to_dentry,
> .get_parent = nfs_get_parent,
> + .fetch_iversion = inode_peek_iversion_raw,
> };
> diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
> index 3ceb72b67a7a..6000121a201f 100644
> --- a/include/linux/exportfs.h
> +++ b/include/linux/exportfs.h
> @@ -213,6 +213,7 @@ struct export_operations {
> bool write, u32 *device_generation);
> int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
> int nr_iomaps, struct iattr *iattr);
> + u64 (*fetch_iversion)(const struct inode *);
> };
>
>
>
>
>
>
>
>
> extern int exportfs_encode_inode_fh(struct inode *inode, struct fid *fid,
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index 2917ef990d43..481b3debf6bb 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -3,6 +3,7 @@
> #define _LINUX_IVERSION_H
>
>
>
>
>
>
>
>
> #include <linux/fs.h>
> +#include <linux/exportfs.h>
>
>
>
>
>
>
>
>
> /*
> * The inode->i_version field:
> @@ -306,6 +307,9 @@ inode_query_iversion(struct inode *inode)
> {
> u64 cur, old, new;
>
>
>
>
>
>
>
>
> + if (inode->i_sb->s_export_op->fetch_iversion)
> + return inode->i_sb->s_export_op->fetch_iversion(inode);
> +
This looks dangerous -- s_export_op could be a NULL pointer.
> cur = inode_peek_iversion_raw(inode);
> for (;;) {
> /* If flag is already set, then no need to swap */
--
Jeff Layton <[email protected]>
On Tue, Nov 17, 2020 at 07:27:03AM -0500, Jeff Layton wrote:
> On Mon, 2020-11-16 at 22:18 -0500, J. Bruce Fields wrote:
> > From: "J. Bruce Fields" <[email protected]>
> >
> > The i_version on NFS has pretty much opaque to the client, so we don't
> > want to give the low bit any special interpretation.
> >
> > Define a new FS_PRIVATE_I_VERSION flag for filesystems that manage the
> > i_version on their own.
> >
>
> Description here doesn't quite match the patch...
Oops, thanks.--b.
>
> > Signed-off-by: J. Bruce Fields <[email protected]>
> > ---
> > fs/nfs/export.c | 1 +
> > include/linux/exportfs.h | 1 +
> > include/linux/iversion.h | 4 ++++
> > 3 files changed, 6 insertions(+)
> >
> > diff --git a/fs/nfs/export.c b/fs/nfs/export.c
> > index 3430d6891e89..c2eb915a54ca 100644
> > --- a/fs/nfs/export.c
> > +++ b/fs/nfs/export.c
> > @@ -171,4 +171,5 @@ const struct export_operations nfs_export_ops = {
> > .encode_fh = nfs_encode_fh,
> > .fh_to_dentry = nfs_fh_to_dentry,
> > .get_parent = nfs_get_parent,
> > + .fetch_iversion = inode_peek_iversion_raw,
> > };
> > diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
> > index 3ceb72b67a7a..6000121a201f 100644
> > --- a/include/linux/exportfs.h
> > +++ b/include/linux/exportfs.h
> > @@ -213,6 +213,7 @@ struct export_operations {
> > bool write, u32 *device_generation);
> > int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
> > int nr_iomaps, struct iattr *iattr);
> > + u64 (*fetch_iversion)(const struct inode *);
> > };
> >
> >
> >
> >
> >
> >
> >
> >
> > extern int exportfs_encode_inode_fh(struct inode *inode, struct fid *fid,
> > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > index 2917ef990d43..481b3debf6bb 100644
> > --- a/include/linux/iversion.h
> > +++ b/include/linux/iversion.h
> > @@ -3,6 +3,7 @@
> > #define _LINUX_IVERSION_H
> >
> >
> >
> >
> >
> >
> >
> >
> > #include <linux/fs.h>
> > +#include <linux/exportfs.h>
> >
> >
> >
> >
> >
> >
> >
> >
> > /*
> > * The inode->i_version field:
> > @@ -306,6 +307,9 @@ inode_query_iversion(struct inode *inode)
> > {
> > u64 cur, old, new;
> >
> >
> >
> >
> >
> >
> >
> >
> > + if (inode->i_sb->s_export_op->fetch_iversion)
> > + return inode->i_sb->s_export_op->fetch_iversion(inode);
> > +
>
> This looks dangerous -- s_export_op could be a NULL pointer.
>
> > cur = inode_peek_iversion_raw(inode);
> > for (;;) {
> > /* If flag is already set, then no need to swap */
>
> --
> Jeff Layton <[email protected]>
----- On 12 Nov, 2020, at 13:01, Daire Byrne [email protected] wrote:
>
> Having just completed a bunch of fresh cloud rendering with v5.9.1 and Trond's
> NFSv3 lookupp emulation patches, I can now revise my original list of issues
> that others will likely experience if they ever try to do this craziness:
>
> 1) Don't re-export NFSv4.0 unless you set vfs_cache_presure=0 otherwise you will
> see random input/output errors on your clients when things are dropped out of
> the cache. In the end we gave up on using NFSv4.0 with our Netapps because the
> 7-mode implementation seemed a bit flakey with modern Linux clients (Linux
> NFSv4.2 servers on the other hand have been rock solid). We now use NFSv3 with
> Trond's lookupp emulation patches instead.
>
> 2) In order to better utilise the re-export server's client cache when
> re-exporting an NFSv3 server (using either NFSv3 or NFSv4), we still need to
> use the horrible inode_peek_iversion_raw hack to maintain good metadata
> performance for large numbers of clients. Otherwise each re-export server's
> clients can cause invalidation of the re-export server client cache. Once you
> have hundreds of clients they all combine to constantly invalidate the cache
> resulting in an order of magnitude slower metadata performance. If you are
> re-exporting an NFSv4.x server (with either NFSv3 or NFSv4.x) this hack is not
> required.
>
> 3) For some reason, when a 1MB read call arrives at the re-export server from a
> client, it gets chopped up into 128k read calls that are issued back to the
> originating server despite rsize/wsize=1MB on all mounts. This results in a
> noticeable increase in rpc chatter for large reads. Writes on the other hand
> retain their 1MB size from client to re-export server and back to the
> originating server. I am using nconnect but I doubt that is related.
>
> 4) After some random time, the cachefilesd userspace daemon stops culling old
> data from an fscache disk storage. I thought it was to do with setting
> vfs_cache_pressure=0 but even with it set to the default 100 it just randomly
> decides to stop culling and never comes back to life until restarted or
> rebooted. Perhaps the fscache/cachefilesd rewrite that David Howells & David
> Wysochanski have been working on will improve matters.
>
> 5) It's still really hard to cache nfs client metadata for any definitive time
> (actimeo,nocto) due to the pagecache churn that reads cause. If all required
> metadata (i.e. directory contents) could either be locally cached to disk or
> the inode cache rather than pagecache then maybe we would have more control
> over the actual cache times we are comfortable with for our workloads. This has
> little to do with re-exporting and is just a general NFS performance over the
> WAN thing. I'm very interested to see how Trond's recent patches to improve
> readdir performance might at least help re-populate the dropped cached metadata
> more efficiently over the WAN.
>
> I just want to finish with one more crazy thing we have been doing - a re-export
> server of a re-export server! Again, a locking and consistency nightmare so
> only possible for very specific workloads (like ours). The advantage of this
> topology is that you can pull all your data over the WAN once (e.g. on-premise
> to cloud) and then fan-out that data to multiple other NFS re-export servers in
> the cloud to improve the aggregate performance to many clients. This avoids
> having multiple re-export servers all needing to pull the same data across the
> WAN.
I will officially add another point to the wishlist that I mentioned in Bruce's recent patches thread (for dealing with the iversion change on NFS re-export). I had held off mentioning this one because I wasn't really sure if it was just a normal production workload and expected behaviour for NFS, but the more I look into it, the more it seems like maybe it could be optimised for the re-export case. But then I also might be too overly sensitive about metadata ops over the WAN at this point....
6) I see many fast repeating COMMITs & GETATTRs from the NFS re-export server to the originating server for the same file while writing through it from a client. If I do a write from userspace on the re-export server directly to the client mountpoint (i.e. no re-exporting) I do not see the GETATTRs or COMMITs.
I see something similar with both a re-export of a NFSv3 originating server and a re-export of a NFSv4.2 originating server (using either NFSv3 or NFSv4). Bruce mentioned an extra GETATTR in the NFSv4.2 re-export case for a COMMIT (pre/post).
For simplicity let's look at the NFSv3 re-export of an NFSv3 originating server. But first let's write a file from userspace directly on the re-export server back to the originating server mount point (ie no re-export):
3 0.772902 V3 GETATTR Call, FH: 0x6791bc70
6 0.781239 V3 SETATTR Call, FH: 0x6791bc70
3286 0.919601 V3 WRITE Call, FH: 0x6791bc70 Offset: 1048576 Len: 1048576 UNSTABLE [TCP segment of a reassembled PDU]
3494 0.921351 V3 WRITE Call, FH: 0x6791bc70 Offset: 8388608 Len: 1048576 UNSTABLE [TCP segment of a reassembled PDU]
...
...
48178 1.462670 V3 WRITE Call, FH: 0x6791bc70 Offset: 102760448 Len: 1048576 UNSTABLE
48210 1.472400 V3 COMMIT Call, FH: 0x6791bc70
So lots of uninterrupted 1MB write calls back to the originating server as expected with a final COMMIT (good). We can also set nconnect=16 back to the originating server and get the same trace but with the write packets going down different ports (also good).
Now let's do the same write through the re-export server from a client (NFSv4.2 or NFSv3, it doesn't matter much):
7 0.034411 V3 SETATTR Call, FH: 0x364ced2c
286 0.148066 V3 WRITE Call, FH: 0x364ced2c Offset: 0 Len: 1048576 UNSTABLE [TCP segment of a reassembled PDU]
343 0.152644 V3 WRITE Call, FH: 0x364ced2c Offset: 1048576 Len: 196608 UNSTABLEV3 WRITE Call, FH: 0x364ced2c Offset: 1245184 Len: 8192 FILE_SYNC
580 0.168159 V3 WRITE Call, FH: 0x364ced2c Offset: 1253376 Len: 843776 UNSTABLE
671 0.174668 V3 COMMIT Call, FH: 0x364ced2c
1105 0.193805 V3 COMMIT Call, FH: 0x364ced2c
1123 0.201570 V3 WRITE Call, FH: 0x364ced2c Offset: 2097152 Len: 1048576 UNSTABLE [TCP segment of a reassembled PDU]
1592 0.242259 V3 WRITE Call, FH: 0x364ced2c Offset: 3145728 Len: 1048576 UNSTABLE
...
...
54571 3.668028 V3 WRITE Call, FH: 0x364ced2c Offset: 102760448 Len: 1048576 FILE_SYNC [TCP segment of a reassembled PDU]
54940 3.713392 V3 WRITE Call, FH: 0x364ced2c Offset: 103809024 Len: 1048576 UNSTABLE
55706 3.733284 V3 COMMIT Call, FH: 0x364ced2c
So now we have lots of pairs of COMMIT calls inbetween the WRITE calls. We also see sporadic FILE_SYNC write calls which we don't when we just write direct to the originating server from userspace (all UNSTABLE).
Finally, if we add nconnect=16 when mounting the originating server (useful for increasing WAN throughput) and again write through from the client, we start to see lots of GETATTRs mixed with the WRITEs & COMMITs:
84 0.075830 V3 SETATTR Call, FH: 0x0e9698e8
608 0.201944 V3 WRITE Call, FH: 0x0e9698e8 Offset: 0 Len: 1048576 UNSTABLE
857 0.218760 V3 COMMIT Call, FH: 0x0e9698e8
968 0.231706 V3 WRITE Call, FH: 0x0e9698e8 Offset: 1048576 Len: 1048576 UNSTABLE
1042 0.246934 V3 COMMIT Call, FH: 0x0e9698e8
...
...
43754 3.033689 V3 WRITE Call, FH: 0x0e9698e8 Offset: 100663296 Len: 1048576 UNSTABLE
44085 3.044767 V3 COMMIT Call, FH: 0x0e9698e8
44086 3.044959 V3 GETATTR Call, FH: 0x0e9698e8
44087 3.044964 V3 GETATTR Call, FH: 0x0e9698e8
44088 3.044983 V3 COMMIT Call, FH: 0x0e9698e8
44615 3.079491 V3 WRITE Call, FH: 0x0e9698e8 Offset: 102760448 Len: 1048576 UNSTABLE
44700 3.082909 V3 WRITE Call, FH: 0x0e9698e8 Offset: 103809024 Len: 1048576 UNSTABLE
44978 3.092010 V3 COMMIT Call, FH: 0x0e9698e8
44982 3.092943 V3 COMMIT Call, FH: 0x0e9698e8
Sometimes I have seen clusters of 16 GETATTRs for the same file on the wire with nothing else inbetween. So if the re-export server is the only "client" writing these files to the originating server, why do we need to do so many repeat GETATTR calls when using nconnect>1? And why are the COMMIT calls required when the writes are coming via nfsd but not from userspace on the re-export server? Is that due to some sort of memory pressure or locking?
I picked the NFSv3 originating server case because my head starts to hurt tracking the equivalent packets, stateids and compound calls with NFSv4. But I think it's mostly the same for NFSv4. The writes through the re-export server lead to lots of COMMITs and (double) GETATTRs but using nconnect>1 at least doesn't seem to make it any worse like it does for NFSv3.
But maybe you actually want all the extra COMMITs to help better guarantee your writes when putting a re-export server in the way? Perhaps all of this is by design...
Daire
On Tue, Nov 24, 2020 at 08:35:06PM +0000, Daire Byrne wrote:
> Sometimes I have seen clusters of 16 GETATTRs for the same file on the
> wire with nothing else inbetween. So if the re-export server is the
> only "client" writing these files to the originating server, why do we
> need to do so many repeat GETATTR calls when using nconnect>1? And why
> are the COMMIT calls required when the writes are coming via nfsd but
> not from userspace on the re-export server? Is that due to some sort
> of memory pressure or locking?
>
> I picked the NFSv3 originating server case because my head starts to
> hurt tracking the equivalent packets, stateids and compound calls with
> NFSv4. But I think it's mostly the same for NFSv4. The writes through
> the re-export server lead to lots of COMMITs and (double) GETATTRs but
> using nconnect>1 at least doesn't seem to make it any worse like it
> does for NFSv3.
>
> But maybe you actually want all the extra COMMITs to help better
> guarantee your writes when putting a re-export server in the way?
> Perhaps all of this is by design...
Maybe that's close-to-open combined with the server's tendency to
open/close on every IO operation? (Though the file cache should have
helped with that, I thought; as would using version >=4.0 on the final
client.)
Might be interesting to know whether the nocto mount option makes a
difference. (So, add "nocto" to the mount options for the NFS mount
that you're re-exporting on the re-export server.)
By the way I made a start at a list of issues at
http://wiki.linux-nfs.org/wiki/index.php/NFS_re-export
but I was a little vague on which of your issues remained and didn't
take much time over it.
(If you want an account on that wiki BTW I seem to recall you just have
to ask Trond (for anti-spam reasons).)
--b.
> On Tue, Nov 24, 2020 at 08:35:06PM +0000, Daire Byrne wrote:
> > Sometimes I have seen clusters of 16 GETATTRs for the same file on the
> > wire with nothing else inbetween. So if the re-export server is the
> > only "client" writing these files to the originating server, why do we
> > need to do so many repeat GETATTR calls when using nconnect>1? And why
> > are the COMMIT calls required when the writes are coming via nfsd but
> > not from userspace on the re-export server? Is that due to some sort
> > of memory pressure or locking?
> >
> > I picked the NFSv3 originating server case because my head starts to
> > hurt tracking the equivalent packets, stateids and compound calls with
> > NFSv4. But I think it's mostly the same for NFSv4. The writes through
> > the re-export server lead to lots of COMMITs and (double) GETATTRs but
> > using nconnect>1 at least doesn't seem to make it any worse like it
> > does for NFSv3.
> >
> > But maybe you actually want all the extra COMMITs to help better
> > guarantee your writes when putting a re-export server in the way?
> > Perhaps all of this is by design...
>
> Maybe that's close-to-open combined with the server's tendency to
open/close
> on every IO operation? (Though the file cache should have helped with
that, I
> thought; as would using version >=4.0 on the final
> client.)
>
> Might be interesting to know whether the nocto mount option makes a
> difference. (So, add "nocto" to the mount options for the NFS mount that
> you're re-exporting on the re-export server.)
>
> By the way I made a start at a list of issues at
>
> http://wiki.linux-nfs.org/wiki/index.php/NFS_re-export
>
> but I was a little vague on which of your issues remained and didn't take
much
> time over it.
>
> (If you want an account on that wiki BTW I seem to recall you just have to
ask
> Trond (for anti-spam reasons).)
How much conversation about re-export has been had at the wider NFS
community level? I have an interest because Ganesha supports re-export via
the PROXY_V3 and PROXY_V4 FSALs. We currently don't have a data cache though
there has been discussion of such, we do have attribute and dirent caches.
Looking over the wiki page, I have considered being able to specify a
re-export of a Ganesha export without encapsulating handles. Ganesha
encapsulates the export_fs handle in a way that could be coordinated between
the original server and the re-export so they would both effectively have
the same encapsulation layer.
I'd love to see some re-export best practices shared among server
implementations, and also what we can do to improve things when two server
implementations are interoperating via re-export.
Frank
On Tue, Nov 24, 2020 at 02:15:57PM -0800, Frank Filz wrote:
> How much conversation about re-export has been had at the wider NFS
> community level? I have an interest because Ganesha supports re-export via
> the PROXY_V3 and PROXY_V4 FSALs. We currently don't have a data cache though
> there has been discussion of such, we do have attribute and dirent caches.
>
> Looking over the wiki page, I have considered being able to specify a
> re-export of a Ganesha export without encapsulating handles. Ganesha
> encapsulates the export_fs handle in a way that could be coordinated between
> the original server and the re-export so they would both effectively have
> the same encapsulation layer.
In the case the re-export server only servers a single export, I guess
you could do away with the encapsulation. (The only risk I see is that
a client of the re-export server could also access any export of the
original server if it could guess filehandles, which might surprise
admins.) Maybe that'd be useful.
Another advantage of not encapsulating filehandles is that clients could
more easily migrate between servers.
Cooperating servers could have an agreement on filehandles. And I guess
we could standardize that somehow. Are we ready for that? I'm not sure
what other re-exporting problems there are that I haven't thought of.
--b.
> I'd love to see some re-export best practices shared among server
> implementations, and also what we can do to improve things when two server
> implementations are interoperating via re-export.
> On Tue, Nov 24, 2020 at 02:15:57PM -0800, Frank Filz wrote:
> > How much conversation about re-export has been had at the wider NFS
> > community level? I have an interest because Ganesha supports
> > re-export via the PROXY_V3 and PROXY_V4 FSALs. We currently don't have
> > a data cache though there has been discussion of such, we do have
attribute
> and dirent caches.
> >
> > Looking over the wiki page, I have considered being able to specify a
> > re-export of a Ganesha export without encapsulating handles. Ganesha
> > encapsulates the export_fs handle in a way that could be coordinated
> > between the original server and the re-export so they would both
> > effectively have the same encapsulation layer.
>
> In the case the re-export server only servers a single export, I guess you
could do
> away with the encapsulation. (The only risk I see is that a client of the
re-export
> server could also access any export of the original server if it could
guess
> filehandles, which might surprise
> admins.) Maybe that'd be useful.
Ganesha handles have a minor downside that is a help here if Ganesha was
re-exporting another Ganesha server. There is a 16 bit export_id that comes
from the export configuration and is part of the handle. We could easily set
it up so that if the sysadmin configured it as such, each re-exported
Ganesha export would have the same export_id, and then a client handle for
export_id 1 would be mirrored to the original server as export_id 1 and the
two servers can have the same export permissions and everything.
There is some additional stuff we could easily implement in Ganesha to
restrict handle manipulation to sneak around export permissions.
> Another advantage of not encapsulating filehandles is that clients could
more
> easily migrate between servers.
Yea, with the idea I've been mulling for Ganesha, migration between original
server and re-export server would be simple with the same handles. Of course
state migration is a whole different ball of wax, but a clustered setup
could work just as well as Ganesha's clustered filesystems. On the other
hand, re-export with state has a pitfall. If the re-export server crashes,
the state is lost on the original server unless we make a protocol change to
allow state re-export such that a re-export server crashing doesn't cause
state loss. For this reason, I haven't rushed to implement lock state
re-export in Ganesha, rather allowing the re-export server to just manage
lock state locally.
> Cooperating servers could have an agreement on filehandles. And I guess
we
> could standardize that somehow. Are we ready for that? I'm not sure what
> other re-exporting problems there are that I haven't thought of.
I'm not sure how far we want to go there, but potentially specific server
implementations could choose to be interoperable in a way that allows the
handle encapsulation to either be smaller or no extra overhead. For example,
if we implemented what I've thought about for Ganesha-Ganesha re-export,
Ganesha COULD also be "taught" which portion of the knfsd handle is the
filesystem/export identifier, and maintain a database of Ganesha
export/filesystem <-> knfsd export/filesystem and have Ganesha
re-encapsulate the exportfs/name_to_handle_at portion of the handle. Of
course in this case, trivial migration isn't possible since Ganesha will
have a different encapsulation than knfsd.
Incidentally, I also purposefully made Ganesha's encapsulation different so
it never collides with either version of knfsd handles (now if over the
course of the past 10 years another handle version has come along...).
Frank
> --b.
>
> > I'd love to see some re-export best practices shared among server
> > implementations, and also what we can do to improve things when two
> > server implementations are interoperating via re-export.
----- On 24 Nov, 2020, at 21:15, bfields [email protected] wrote:
> On Tue, Nov 24, 2020 at 08:35:06PM +0000, Daire Byrne wrote:
>> Sometimes I have seen clusters of 16 GETATTRs for the same file on the
>> wire with nothing else inbetween. So if the re-export server is the
>> only "client" writing these files to the originating server, why do we
>> need to do so many repeat GETATTR calls when using nconnect>1? And why
>> are the COMMIT calls required when the writes are coming via nfsd but
>> not from userspace on the re-export server? Is that due to some sort
>> of memory pressure or locking?
>>
>> I picked the NFSv3 originating server case because my head starts to
>> hurt tracking the equivalent packets, stateids and compound calls with
>> NFSv4. But I think it's mostly the same for NFSv4. The writes through
>> the re-export server lead to lots of COMMITs and (double) GETATTRs but
>> using nconnect>1 at least doesn't seem to make it any worse like it
>> does for NFSv3.
>>
>> But maybe you actually want all the extra COMMITs to help better
>> guarantee your writes when putting a re-export server in the way?
>> Perhaps all of this is by design...
>
> Maybe that's close-to-open combined with the server's tendency to
> open/close on every IO operation? (Though the file cache should have
> helped with that, I thought; as would using version >=4.0 on the final
> client.)
>
> Might be interesting to know whether the nocto mount option makes a
> difference. (So, add "nocto" to the mount options for the NFS mount
> that you're re-exporting on the re-export server.)
The nocto didn't really seem to help but the NFSv4.2 re-export of a NFSv3 server did. I also realised I had done some tests with nconnect on the re-export server's client and consequently mixed things up a bit in my head. So I did some more tests and tried to make the results clear and simple. In all cases I'm just writing a big file with "dd" and capturing the traffic between the originating server and re-export server.
First off, writing direct to the originating server mount on the re-export server from userspace shows the ideal behaviour for all combinations:
originating server <- (vers=X,actimeo=1800,nconnect=X) <- reexport server writing = WRITE,WRITE .... repeating (good!)
Then re-exporting a NFSv4.2 server:
originating server <- (vers=4.2) <- reexport server - (vers=3) <- client writing = GETATTR,COMMIT,WRITE .... repeating
originating server <- (vers=4.2) <- reexport server - (vers=4.2) <- client writing = GETATTR,WRITE .... repeating
And re-exporting a NFSv3 server:
originating server <- (vers=3) <- reexport server - (vers=4.2) <- client writing = WRITE,WRITE .... repeating (good!)
originating server <- (vers=3) <- reexport server - (vers=3) <- client writing = WRITE,COMMIT .... repeating
So of all the combinations, a NFSv4.2 re-export of an NFSv3 server is the only one that matches the "ideal" case where we WRITE continuously without all the extra chatter.
And for completeness, taking that "good" case and making it bad with nconnect:
originating server <- (vers=3,nconnect=16) <- reexport server - (vers=4.2) <- client writing = WRITE,WRITE .... repeating (good!)
originating server <- (vers=3) <- reexport server <- (vers=4.2,nconnect=16) <- client writing = WRITE,COMMIT,GETATTR .... randomly repeating
So using nconnect on the re-export's client causes lots more metadata ops. There are reasons for doing that for increasing throughput but it could be that the gain is offset by the extra metadata roundtrips.
Similarly, we have mostly been using a NFSv4.2 re-export of a NFSV4.2 server over the WAN because of reduced metadata ops for reading, but it looks like we incur extra metadata ops for writing.
Side note: it's hard to decode nconnect enabled packet captures because wireshark doesn't seem to like those extra port streams.
> By the way I made a start at a list of issues at
>
> http://wiki.linux-nfs.org/wiki/index.php/NFS_re-export
>
> but I was a little vague on which of your issues remained and didn't
> take much time over it.
Cool. I'm glad there are some notes for others to reference - this thread is now too long for any human to read. The only things I'd consider adding are:
* re-export of NFSv4.0 filesystem can give input/output errors when the cache is dropped
* a weird interaction with nfs client readahead such that all reads are limited to the default 128k unless you manually increase it to match rsize.
The only other thing I can offer are tips & tricks for doing this kind of thing over the WAN (vfs_cache_pressure, actimeo, nocto) and using fscache.
Daire
On Wed, Nov 25, 2020 at 08:25:19AM -0800, Frank Filz wrote:
> On the other
> hand, re-export with state has a pitfall. If the re-export server crashes,
> the state is lost on the original server unless we make a protocol change to
> allow state re-export such that a re-export server crashing doesn't cause
> state loss.
Oh, yes, reboot recovery's an interesting problem that I'd forgotten
about; added to that wiki page.
By "state re-export" you mean you'd take the stateids the original
server returned to you, and return them to your own clients? So then
I guess you wouldn't need much state at all.
> For this reason, I haven't rushed to implement lock state
> re-export in Ganesha, rather allowing the re-export server to just manage
> lock state locally.
>
> > Cooperating servers could have an agreement on filehandles. And I guess
> we
> > could standardize that somehow. Are we ready for that? I'm not sure what
> > other re-exporting problems there are that I haven't thought of.
>
> I'm not sure how far we want to go there, but potentially specific server
> implementations could choose to be interoperable in a way that allows the
> handle encapsulation to either be smaller or no extra overhead. For example,
> if we implemented what I've thought about for Ganesha-Ganesha re-export,
> Ganesha COULD also be "taught" which portion of the knfsd handle is the
> filesystem/export identifier, and maintain a database of Ganesha
> export/filesystem <-> knfsd export/filesystem and have Ganesha
> re-encapsulate the exportfs/name_to_handle_at portion of the handle. Of
> course in this case, trivial migration isn't possible since Ganesha will
> have a different encapsulation than knfsd.
>
> Incidentally, I also purposefully made Ganesha's encapsulation different so
> it never collides with either version of knfsd handles (now if over the
> course of the past 10 years another handle version has come along...).
I don't think anything's changed there.
--b.
On Wed, Nov 25, 2020 at 05:14:51PM +0000, Daire Byrne wrote:
> Cool. I'm glad there are some notes for others to reference - this
> thread is now too long for any human to read. The only things I'd
> consider adding are:
Thanks, done.
> * re-export of NFSv4.0 filesystem can give input/output errors when the cache is dropped
Looking back at that thread.... I suspect that's just unfixable, so all
you can do is either use v4.1+ on the original server or 4.0+ on the
edge clients. Or I wonder if it would help if there was some way to
tell the 4.0 client just to try special stateids instead of attempting
an open?
> * a weird interaction with nfs client readahead such that all reads
> are limited to the default 128k unless you manually increase it to
> match rsize.
>
> The only other thing I can offer are tips & tricks for doing this kind
> of thing over the WAN (vfs_cache_pressure, actimeo, nocto) and using
> fscache.
OK, I haven't tried to pick that out of the thread yet.
--b.
> -----Original Message-----
> From: 'bfields' [mailto:[email protected]]
> Sent: Wednesday, November 25, 2020 11:03 AM
> To: Frank Filz <[email protected]>
> Cc: 'Daire Byrne' <[email protected]>; 'Trond Myklebust'
> <[email protected]>; 'linux-cachefs' <[email protected]>;
> 'linux-nfs' <[email protected]>
> Subject: Re: Adventures in NFS re-exporting
>
> On Wed, Nov 25, 2020 at 08:25:19AM -0800, Frank Filz wrote:
> > On the other
> > hand, re-export with state has a pitfall. If the re-export server
> > crashes, the state is lost on the original server unless we make a
> > protocol change to allow state re-export such that a re-export server
> > crashing doesn't cause state loss.
>
> Oh, yes, reboot recovery's an interesting problem that I'd forgotten
about;
> added to that wiki page.
>
> By "state re-export" you mean you'd take the stateids the original server
> returned to you, and return them to your own clients? So then I guess you
> wouldn't need much state at all.
By state re-export I meant reflecting locks the end client takes on the
re-export server to the original server. Not necessarily by reflecting the
stateid (probably something to trip on there...) (Can we nail down a good
name for it? Proxy server or re-export server work well for the man in the
middle, but what about the back end server?)
Frank
> > For this reason, I haven't rushed to implement lock state re-export in
> > Ganesha, rather allowing the re-export server to just manage lock
> > state locally.
> >
> > > Cooperating servers could have an agreement on filehandles. And I
> > > guess
> > we
> > > could standardize that somehow. Are we ready for that? I'm not
> > > sure what other re-exporting problems there are that I haven't thought
of.
> >
> > I'm not sure how far we want to go there, but potentially specific
> > server implementations could choose to be interoperable in a way that
> > allows the handle encapsulation to either be smaller or no extra
> > overhead. For example, if we implemented what I've thought about for
> > Ganesha-Ganesha re-export, Ganesha COULD also be "taught" which
> > portion of the knfsd handle is the filesystem/export identifier, and
> > maintain a database of Ganesha export/filesystem <-> knfsd
> > export/filesystem and have Ganesha re-encapsulate the
> > exportfs/name_to_handle_at portion of the handle. Of course in this
> > case, trivial migration isn't possible since Ganesha will have a
different
> encapsulation than knfsd.
> >
> > Incidentally, I also purposefully made Ganesha's encapsulation
> > different so it never collides with either version of knfsd handles
> > (now if over the course of the past 10 years another handle version has
come
> along...).
>
> I don't think anything's changed there.
>
> --b.
----- On 25 Nov, 2020, at 17:14, Daire Byrne [email protected] wrote:
> First off, writing direct to the originating server mount on the re-export
> server from userspace shows the ideal behaviour for all combinations:
>
> originating server <- (vers=X,actimeo=1800,nconnect=X) <- reexport server
> writing = WRITE,WRITE .... repeating (good!)
>
> Then re-exporting a NFSv4.2 server:
>
> originating server <- (vers=4.2) <- reexport server - (vers=3) <- client writing
> = GETATTR,COMMIT,WRITE .... repeating
> originating server <- (vers=4.2) <- reexport server - (vers=4.2) <- client
> writing = GETATTR,WRITE .... repeating
>
> And re-exporting a NFSv3 server:
>
> originating server <- (vers=3) <- reexport server - (vers=4.2) <- client writing
> = WRITE,WRITE .... repeating (good!)
> originating server <- (vers=3) <- reexport server - (vers=3) <- client writing =
> WRITE,COMMIT .... repeating
>
> So of all the combinations, a NFSv4.2 re-export of an NFSv3 server is the only
> one that matches the "ideal" case where we WRITE continuously without all the
> extra chatter.
>
> And for completeness, taking that "good" case and making it bad with nconnect:
>
> originating server <- (vers=3,nconnect=16) <- reexport server - (vers=4.2) <-
> client writing = WRITE,WRITE .... repeating (good!)
> originating server <- (vers=3) <- reexport server <- (vers=4.2,nconnect=16) <-
> client writing = WRITE,COMMIT,GETATTR .... randomly repeating
>
> So using nconnect on the re-export's client causes lots more metadata ops. There
> are reasons for doing that for increasing throughput but it could be that the
> gain is offset by the extra metadata roundtrips.
>
> Similarly, we have mostly been using a NFSv4.2 re-export of a NFSV4.2 server
> over the WAN because of reduced metadata ops for reading, but it looks like we
> incur extra metadata ops for writing.
Just a small update based on the most recent patchsets from Trond & Bruce:
https://patchwork.kernel.org/project/linux-nfs/list/?series=393567
https://patchwork.kernel.org/project/linux-nfs/list/?series=393561
For the write-through tests, the NFSv3 re-export of a NFSv4.2 server has trimmed an extra GETATTR:
Before:
originating server <- (vers=4.2) <- reexport server - (vers=3) <- client writing = WRITE,COMMIT,GETATTR .... repeating
After:
originating server <- (vers=4.2) <- reexport server - (vers=3) <- client writing = WRITE,COMMIT .... repeating
I'm assuming this is specifically due to the "EXPORT_OP_NOWCC" patch? All other combinations look the same as before (for write-through). An NFSv4.2 re-export of a NFSv3 server is still the best/ideal in terms of not incurring extra metadata roundtrips when writing.
It's great to see this re-export scenario becoming a better supported (and performing) topology; many thanks all.
Daire
On Thu, Dec 03, 2020 at 12:20:35PM +0000, Daire Byrne wrote:
> Just a small update based on the most recent patchsets from Trond &
> Bruce:
>
> https://patchwork.kernel.org/project/linux-nfs/list/?series=393567
> https://patchwork.kernel.org/project/linux-nfs/list/?series=393561
>
> For the write-through tests, the NFSv3 re-export of a NFSv4.2 server
> has trimmed an extra GETATTR:
>
> Before: originating server <- (vers=4.2) <- reexport server - (vers=3)
> <- client writing = WRITE,COMMIT,GETATTR .... repeating
>
> After: originating server <- (vers=4.2) <- reexport server - (vers=3)
> <- client writing = WRITE,COMMIT .... repeating
>
> I'm assuming this is specifically due to the "EXPORT_OP_NOWCC" patch?
Probably so, thanks for the update.
> All other combinations look the same as before (for write-through). An
> NFSv4.2 re-export of a NFSv3 server is still the best/ideal in terms
> of not incurring extra metadata roundtrips when writing.
>
> It's great to see this re-export scenario becoming a better supported
> (and performing) topology; many thanks all.
I've been scratching my head over how to handle reboot of a re-exporting
server. I think one way to fix it might be just to allow the re-export
server to pass along reclaims to the original server as it receives them
from its own clients. It might require some protocol tweaks, I'm not
sure. I'll try to get my thoughts in order and propose something.
--b.
On Thu, 2020-12-03 at 13:51 -0500, bfields wrote:
> On Thu, Dec 03, 2020 at 12:20:35PM +0000, Daire Byrne wrote:
> > Just a small update based on the most recent patchsets from Trond &
> > Bruce:
> >
> > https://patchwork.kernel.org/project/linux-nfs/list/?series=393567
> > https://patchwork.kernel.org/project/linux-nfs/list/?series=393561
> >
> > For the write-through tests, the NFSv3 re-export of a NFSv4.2
> > server
> > has trimmed an extra GETATTR:
> >
> > Before: originating server <- (vers=4.2) <- reexport server -
> > (vers=3)
> > <- client writing = WRITE,COMMIT,GETATTR .... repeating
> >
> > After: originating server <- (vers=4.2) <- reexport server -
> > (vers=3)
> > <- client writing = WRITE,COMMIT .... repeating
> >
> > I'm assuming this is specifically due to the "EXPORT_OP_NOWCC"
> > patch?
>
> Probably so, thanks for the update.
>
> > All other combinations look the same as before (for write-through).
> > An
> > NFSv4.2 re-export of a NFSv3 server is still the best/ideal in
> > terms
> > of not incurring extra metadata roundtrips when writing.
> >
> > It's great to see this re-export scenario becoming a better
> > supported
> > (and performing) topology; many thanks all.
>
> I've been scratching my head over how to handle reboot of a re-
> exporting
> server. I think one way to fix it might be just to allow the re-
> export
> server to pass along reclaims to the original server as it receives
> them
> from its own clients. It might require some protocol tweaks, I'm not
> sure. I'll try to get my thoughts in order and propose something.
>
It's more complicated than that. If the re-exporting server reboots,
but the original server does not, then unless that re-exporting server
persisted its lease and a full set of stateids somewhere, it will not
be able to atomically reclaim delegation and lock state on the server
on behalf of its clients.
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Thu, Dec 03, 2020 at 08:27:39PM +0000, Trond Myklebust wrote:
> On Thu, 2020-12-03 at 13:51 -0500, bfields wrote:
> > I've been scratching my head over how to handle reboot of a re-
> > exporting
> > server. I think one way to fix it might be just to allow the re-
> > export
> > server to pass along reclaims to the original server as it receives
> > them
> > from its own clients. It might require some protocol tweaks, I'm not
> > sure. I'll try to get my thoughts in order and propose something.
> >
>
> It's more complicated than that. If the re-exporting server reboots,
> but the original server does not, then unless that re-exporting server
> persisted its lease and a full set of stateids somewhere, it will not
> be able to atomically reclaim delegation and lock state on the server
> on behalf of its clients.
By sending reclaims to the original server, I mean literally sending new
open and lock requests with the RECLAIM bit set, which would get brand
new stateids.
So, the original server would invalidate the existing client's previous
clientid and stateids--just as it normally would on reboot--but it would
optionally remember the underlying locks held by the client and allow
compatible lock reclaims.
Rough attempt:
https://wiki.linux-nfs.org/wiki/index.php/Reboot_recovery_for_re-export_servers
Think it would fly?
--b.
> On Thu, Dec 03, 2020 at 08:27:39PM +0000, Trond Myklebust wrote:
> > On Thu, 2020-12-03 at 13:51 -0500, bfields wrote:
> > > I've been scratching my head over how to handle reboot of a re-
> > > exporting server. I think one way to fix it might be just to allow
> > > the re- export server to pass along reclaims to the original server
> > > as it receives them from its own clients. It might require some
> > > protocol tweaks, I'm not sure. I'll try to get my thoughts in order
> > > and propose something.
> > >
> >
> > It's more complicated than that. If the re-exporting server reboots,
> > but the original server does not, then unless that re-exporting server
> > persisted its lease and a full set of stateids somewhere, it will not
> > be able to atomically reclaim delegation and lock state on the server
> > on behalf of its clients.
>
> By sending reclaims to the original server, I mean literally sending new
> open and lock requests with the RECLAIM bit set, which would get brand
> new stateids.
>
> So, the original server would invalidate the existing client's previous
> clientid and stateids--just as it normally would on reboot--but it would
> optionally remember the underlying locks held by the client and allow
> compatible lock reclaims.
>
> Rough attempt:
>
> https://wiki.linux-nfs.org/wiki/index.php/Reboot_recovery_for_re-
> export_servers
>
> Think it would fly?
At a quick read through, that sounds good. I'm sure there's some bits and bobs we need to fix up.
I'm cc:ing Jeff Layton because what the original server needs to do looks a bit like what he implemented in CephFS to allow HA restarts of nfs-ganesha instances.
Maybe we should take this to the IETF mailing list? I'm certainly interested in discussion on what we could do in the protocol to facilitate this from nfs-ganesha perspective.
Frank
On Thu, 2020-12-03 at 16:13 -0500, [email protected] wrote:
> On Thu, Dec 03, 2020 at 08:27:39PM +0000, Trond Myklebust wrote:
> > On Thu, 2020-12-03 at 13:51 -0500, bfields wrote:
> > > I've been scratching my head over how to handle reboot of a re-
> > > exporting
> > > server. I think one way to fix it might be just to allow the re-
> > > export
> > > server to pass along reclaims to the original server as it
> > > receives
> > > them
> > > from its own clients. It might require some protocol tweaks, I'm
> > > not
> > > sure. I'll try to get my thoughts in order and propose
> > > something.
> > >
> >
> > It's more complicated than that. If the re-exporting server
> > reboots,
> > but the original server does not, then unless that re-exporting
> > server
> > persisted its lease and a full set of stateids somewhere, it will
> > not
> > be able to atomically reclaim delegation and lock state on the
> > server
> > on behalf of its clients.
>
> By sending reclaims to the original server, I mean literally sending
> new
> open and lock requests with the RECLAIM bit set, which would get
> brand
> new stateids.
>
> So, the original server would invalidate the existing client's
> previous
> clientid and stateids--just as it normally would on reboot--but it
> would
> optionally remember the underlying locks held by the client and allow
> compatible lock reclaims.
>
> Rough attempt:
>
> https://wiki.linux-nfs.org/wiki/index.php/Reboot_recovery_for_re-export_servers
>
> Think it would fly?
So this would be a variant of courtesy locks that can be reclaimed by
the client using the reboot reclaim variant of OPEN/LOCK outside the
grace period? The purpose being to allow reclaim without forcing the
client to persist the original stateid?
Hmm... That's doable, but how about the following alternative: Add a
function that allows the client to request the full list of stateids
that the server holds on its behalf?
I've been wanting such a function for quite a while anyway in order to
allow the client to detect state leaks (either due to soft timeouts, or
due to reordered close/open operations).
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
> On Thu, 2020-12-03 at 16:13 -0500, [email protected] wrote:
> > On Thu, Dec 03, 2020 at 08:27:39PM +0000, Trond Myklebust wrote:
> > > On Thu, 2020-12-03 at 13:51 -0500, bfields wrote:
> > > > I've been scratching my head over how to handle reboot of a re-
> > > > exporting server. I think one way to fix it might be just to
> > > > allow the re- export server to pass along reclaims to the original
> > > > server as it receives them from its own clients. It might require
> > > > some protocol tweaks, I'm not sure. I'll try to get my thoughts
> > > > in order and propose something.
> > > >
> > >
> > > It's more complicated than that. If the re-exporting server reboots,
> > > but the original server does not, then unless that re-exporting
> > > server persisted its lease and a full set of stateids somewhere, it
> > > will not be able to atomically reclaim delegation and lock state on
> > > the server on behalf of its clients.
> >
> > By sending reclaims to the original server, I mean literally sending
> > new open and lock requests with the RECLAIM bit set, which would get
> > brand new stateids.
> >
> > So, the original server would invalidate the existing client's
> > previous clientid and stateids--just as it normally would on
> > reboot--but it would optionally remember the underlying locks held by
> > the client and allow compatible lock reclaims.
> >
> > Rough attempt:
> >
> >
> > https://wiki.linux-nfs.org/wiki/index.php/Reboot_recovery_for_re-expor
> > t_servers
> >
> > Think it would fly?
>
> So this would be a variant of courtesy locks that can be reclaimed by the client
> using the reboot reclaim variant of OPEN/LOCK outside the grace period? The
> purpose being to allow reclaim without forcing the client to persist the original
> stateid?
>
> Hmm... That's doable, but how about the following alternative: Add a function
> that allows the client to request the full list of stateids that the server holds on
> its behalf?
>
> I've been wanting such a function for quite a while anyway in order to allow the
> client to detect state leaks (either due to soft timeouts, or due to reordered
> close/open operations).
Oh, that sounds interesting. So basically the re-export server would re-populate it's state from the original server rather than relying on it's clients doing reclaims? Hmm, but how does the re-export server rebuild its stateids? I guess it could make the clients repopulate them with the same "give me a dump of all my state", using the state details to match up with the old state and replacing stateids. Or did you have something different in mind?
Frank
On Thu, Dec 03, 2020 at 09:34:26PM +0000, Trond Myklebust wrote:
> On Thu, 2020-12-03 at 16:13 -0500, [email protected] wrote:
> > On Thu, Dec 03, 2020 at 08:27:39PM +0000, Trond Myklebust wrote:
> > > On Thu, 2020-12-03 at 13:51 -0500, bfields wrote:
> > > > I've been scratching my head over how to handle reboot of a re-
> > > > exporting
> > > > server. I think one way to fix it might be just to allow the re-
> > > > export
> > > > server to pass along reclaims to the original server as it
> > > > receives
> > > > them
> > > > from its own clients. It might require some protocol tweaks, I'm
> > > > not
> > > > sure. I'll try to get my thoughts in order and propose
> > > > something.
> > > >
> > >
> > > It's more complicated than that. If the re-exporting server
> > > reboots,
> > > but the original server does not, then unless that re-exporting
> > > server
> > > persisted its lease and a full set of stateids somewhere, it will
> > > not
> > > be able to atomically reclaim delegation and lock state on the
> > > server
> > > on behalf of its clients.
> >
> > By sending reclaims to the original server, I mean literally sending
> > new
> > open and lock requests with the RECLAIM bit set, which would get
> > brand
> > new stateids.
> >
> > So, the original server would invalidate the existing client's
> > previous
> > clientid and stateids--just as it normally would on reboot--but it
> > would
> > optionally remember the underlying locks held by the client and allow
> > compatible lock reclaims.
> >
> > Rough attempt:
> >
> > https://wiki.linux-nfs.org/wiki/index.php/Reboot_recovery_for_re-export_servers
> >
> > Think it would fly?
>
> So this would be a variant of courtesy locks that can be reclaimed by
> the client using the reboot reclaim variant of OPEN/LOCK outside the
> grace period? The purpose being to allow reclaim without forcing the
> client to persist the original stateid?
Right.
> Hmm... That's doable,
Keep mulling it over and let me know if you see something that doesn't
work.
> but how about the following alternative: Add a
> function that allows the client to request the full list of stateids
> that the server holds on its behalf?
So, on the re-export server:
The client comes back up knowing nothing, so it requests that list of
stateids. A reclaim comes in from an end client. The client looks
through its list for a stateid that matches that reclaim somehow. So, I
guess the list of stateids also has to include filehandles and access
bits and lock ranges and such, so the client can pick an appropriate
stateid to use?
> I've been wanting such a function for quite a while anyway in order to
> allow the client to detect state leaks (either due to soft timeouts, or
> due to reordered close/open operations).
Yipes, I hadn't realized that was possible.
--b.
On Thu, 2020-12-03 at 13:45 -0800, Frank Filz wrote:
> > On Thu, 2020-12-03 at 16:13 -0500, [email protected] wrote:
> > > On Thu, Dec 03, 2020 at 08:27:39PM +0000, Trond Myklebust wrote:
> > > > On Thu, 2020-12-03 at 13:51 -0500, bfields wrote:
> > > > > I've been scratching my head over how to handle reboot of a
> > > > > re-
> > > > > exporting server. I think one way to fix it might be just to
> > > > > allow the re- export server to pass along reclaims to the
> > > > > original
> > > > > server as it receives them from its own clients. It might
> > > > > require
> > > > > some protocol tweaks, I'm not sure. I'll try to get my
> > > > > thoughts
> > > > > in order and propose something.
> > > > >
> > > >
> > > > It's more complicated than that. If the re-exporting server
> > > > reboots,
> > > > but the original server does not, then unless that re-exporting
> > > > server persisted its lease and a full set of stateids
> > > > somewhere, it
> > > > will not be able to atomically reclaim delegation and lock
> > > > state on
> > > > the server on behalf of its clients.
> > >
> > > By sending reclaims to the original server, I mean literally
> > > sending
> > > new open and lock requests with the RECLAIM bit set, which would
> > > get
> > > brand new stateids.
> > >
> > > So, the original server would invalidate the existing client's
> > > previous clientid and stateids--just as it normally would on
> > > reboot--but it would optionally remember the underlying locks
> > > held by
> > > the client and allow compatible lock reclaims.
> > >
> > > Rough attempt:
> > >
> > >
> > > https://wiki.linux-nfs.org/wiki/index.php/Reboot_recovery_for_re-expor
> > > t_servers
> > >
> > > Think it would fly?
> >
> > So this would be a variant of courtesy locks that can be reclaimed
> > by the client
> > using the reboot reclaim variant of OPEN/LOCK outside the grace
> > period? The
> > purpose being to allow reclaim without forcing the client to
> > persist the original
> > stateid?
> >
> > Hmm... That's doable, but how about the following alternative: Add
> > a function
> > that allows the client to request the full list of stateids that
> > the server holds on
> > its behalf?
> >
> > I've been wanting such a function for quite a while anyway in order
> > to allow the
> > client to detect state leaks (either due to soft timeouts, or due
> > to reordered
> > close/open operations).
>
> Oh, that sounds interesting. So basically the re-export server would
> re-populate it's state from the original server rather than relying
> on it's clients doing reclaims? Hmm, but how does the re-export
> server rebuild its stateids? I guess it could make the clients
> repopulate them with the same "give me a dump of all my state", using
> the state details to match up with the old state and replacing
> stateids. Or did you have something different in mind?
>
I was thinking that the re-export server could just use that list of
stateids to figure out which locks can be reclaimed atomically, and
which ones have been irredeemably lost. The assumption is that if you
have a lock stateid or a delegation, then that means the clients can
reclaim all the locks that were represented by that stateid.
I suppose the client would also need to know the lockowner for the
stateid, but presumably that information could also be returned by the
server?
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Thu, Dec 03, 2020 at 09:57:41PM +0000, Trond Myklebust wrote:
> On Thu, 2020-12-03 at 13:45 -0800, Frank Filz wrote:
> > > On Thu, 2020-12-03 at 16:13 -0500, [email protected] wrote:
> > > > On Thu, Dec 03, 2020 at 08:27:39PM +0000, Trond Myklebust wrote:
> > > > > On Thu, 2020-12-03 at 13:51 -0500, bfields wrote:
> > > > > > I've been scratching my head over how to handle reboot of a
> > > > > > re-
> > > > > > exporting server. I think one way to fix it might be just to
> > > > > > allow the re- export server to pass along reclaims to the
> > > > > > original
> > > > > > server as it receives them from its own clients. It might
> > > > > > require
> > > > > > some protocol tweaks, I'm not sure. I'll try to get my
> > > > > > thoughts
> > > > > > in order and propose something.
> > > > > >
> > > > >
> > > > > It's more complicated than that. If the re-exporting server
> > > > > reboots,
> > > > > but the original server does not, then unless that re-exporting
> > > > > server persisted its lease and a full set of stateids
> > > > > somewhere, it
> > > > > will not be able to atomically reclaim delegation and lock
> > > > > state on
> > > > > the server on behalf of its clients.
> > > >
> > > > By sending reclaims to the original server, I mean literally
> > > > sending
> > > > new open and lock requests with the RECLAIM bit set, which would
> > > > get
> > > > brand new stateids.
> > > >
> > > > So, the original server would invalidate the existing client's
> > > > previous clientid and stateids--just as it normally would on
> > > > reboot--but it would optionally remember the underlying locks
> > > > held by
> > > > the client and allow compatible lock reclaims.
> > > >
> > > > Rough attempt:
> > > >
> > > >
> > > > https://wiki.linux-nfs.org/wiki/index.php/Reboot_recovery_for_re-expor
> > > > t_servers
> > > >
> > > > Think it would fly?
> > >
> > > So this would be a variant of courtesy locks that can be reclaimed
> > > by the client
> > > using the reboot reclaim variant of OPEN/LOCK outside the grace
> > > period? The
> > > purpose being to allow reclaim without forcing the client to
> > > persist the original
> > > stateid?
> > >
> > > Hmm... That's doable, but how about the following alternative: Add
> > > a function
> > > that allows the client to request the full list of stateids that
> > > the server holds on
> > > its behalf?
> > >
> > > I've been wanting such a function for quite a while anyway in order
> > > to allow the
> > > client to detect state leaks (either due to soft timeouts, or due
> > > to reordered
> > > close/open operations).
> >
> > Oh, that sounds interesting. So basically the re-export server would
> > re-populate it's state from the original server rather than relying
> > on it's clients doing reclaims? Hmm, but how does the re-export
> > server rebuild its stateids? I guess it could make the clients
> > repopulate them with the same "give me a dump of all my state", using
> > the state details to match up with the old state and replacing
> > stateids. Or did you have something different in mind?
> >
>
> I was thinking that the re-export server could just use that list of
> stateids to figure out which locks can be reclaimed atomically, and
> which ones have been irredeemably lost. The assumption is that if you
> have a lock stateid or a delegation, then that means the clients can
> reclaim all the locks that were represented by that stateid.
I'm confused about how the re-export server uses that list. Are you
assuming it persisted its own list across its own crash/reboot? I guess
that's what I was trying to avoid having to do.
--b.
On Thu, 2020-12-03 at 17:04 -0500, [email protected] wrote:
> On Thu, Dec 03, 2020 at 09:57:41PM +0000, Trond Myklebust wrote:
> > On Thu, 2020-12-03 at 13:45 -0800, Frank Filz wrote:
> > > > On Thu, 2020-12-03 at 16:13 -0500, [email protected] wrote:
> > > > > On Thu, Dec 03, 2020 at 08:27:39PM +0000, Trond Myklebust
> > > > > wrote:
> > > > > > On Thu, 2020-12-03 at 13:51 -0500, bfields wrote:
> > > > > > > I've been scratching my head over how to handle reboot of
> > > > > > > a
> > > > > > > re-
> > > > > > > exporting server. I think one way to fix it might be
> > > > > > > just to
> > > > > > > allow the re- export server to pass along reclaims to the
> > > > > > > original
> > > > > > > server as it receives them from its own clients. It
> > > > > > > might
> > > > > > > require
> > > > > > > some protocol tweaks, I'm not sure. I'll try to get my
> > > > > > > thoughts
> > > > > > > in order and propose something.
> > > > > > >
> > > > > >
> > > > > > It's more complicated than that. If the re-exporting server
> > > > > > reboots,
> > > > > > but the original server does not, then unless that re-
> > > > > > exporting
> > > > > > server persisted its lease and a full set of stateids
> > > > > > somewhere, it
> > > > > > will not be able to atomically reclaim delegation and lock
> > > > > > state on
> > > > > > the server on behalf of its clients.
> > > > >
> > > > > By sending reclaims to the original server, I mean literally
> > > > > sending
> > > > > new open and lock requests with the RECLAIM bit set, which
> > > > > would
> > > > > get
> > > > > brand new stateids.
> > > > >
> > > > > So, the original server would invalidate the existing
> > > > > client's
> > > > > previous clientid and stateids--just as it normally would on
> > > > > reboot--but it would optionally remember the underlying locks
> > > > > held by
> > > > > the client and allow compatible lock reclaims.
> > > > >
> > > > > Rough attempt:
> > > > >
> > > > >
> > > > > https://wiki.linux-nfs.org/wiki/index.php/Reboot_recovery_for_re-expor
> > > > > t_servers
> > > > >
> > > > > Think it would fly?
> > > >
> > > > So this would be a variant of courtesy locks that can be
> > > > reclaimed
> > > > by the client
> > > > using the reboot reclaim variant of OPEN/LOCK outside the grace
> > > > period? The
> > > > purpose being to allow reclaim without forcing the client to
> > > > persist the original
> > > > stateid?
> > > >
> > > > Hmm... That's doable, but how about the following alternative:
> > > > Add
> > > > a function
> > > > that allows the client to request the full list of stateids
> > > > that
> > > > the server holds on
> > > > its behalf?
> > > >
> > > > I've been wanting such a function for quite a while anyway in
> > > > order
> > > > to allow the
> > > > client to detect state leaks (either due to soft timeouts, or
> > > > due
> > > > to reordered
> > > > close/open operations).
> > >
> > > Oh, that sounds interesting. So basically the re-export server
> > > would
> > > re-populate it's state from the original server rather than
> > > relying
> > > on it's clients doing reclaims? Hmm, but how does the re-export
> > > server rebuild its stateids? I guess it could make the clients
> > > repopulate them with the same "give me a dump of all my state",
> > > using
> > > the state details to match up with the old state and replacing
> > > stateids. Or did you have something different in mind?
> > >
> >
> > I was thinking that the re-export server could just use that list
> > of
> > stateids to figure out which locks can be reclaimed atomically, and
> > which ones have been irredeemably lost. The assumption is that if
> > you
> > have a lock stateid or a delegation, then that means the clients
> > can
> > reclaim all the locks that were represented by that stateid.
>
> I'm confused about how the re-export server uses that list. Are you
> assuming it persisted its own list across its own crash/reboot? I
> guess
> that's what I was trying to avoid having to do.
>
No. The server just uses the stateids as part of a check for 'do I hold
state for this file on this server?'. If the answer is 'yes' and the
lock owners are sane, then we should be able to assume the full set of
locks that lock owner held on that file are still valid.
BTW: if the lock owner is also returned by the server, then since the
lock owner is an opaque value, it could, for instance, be used by the
client to cache info on the server about which uid/gid owns these
locks.
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
> -----Original Message-----
> From: Trond Myklebust [mailto:[email protected]]
> Sent: Thursday, December 3, 2020 2:14 PM
> To: [email protected]
> Cc: [email protected]; [email protected]; linux-
> [email protected]; [email protected]
> Subject: Re: Adventures in NFS re-exporting
>
> On Thu, 2020-12-03 at 17:04 -0500, [email protected] wrote:
> > On Thu, Dec 03, 2020 at 09:57:41PM +0000, Trond Myklebust wrote:
> > > On Thu, 2020-12-03 at 13:45 -0800, Frank Filz wrote:
> > > > > On Thu, 2020-12-03 at 16:13 -0500, [email protected] wrote:
> > > > > > On Thu, Dec 03, 2020 at 08:27:39PM +0000, Trond Myklebust
> > > > > > wrote:
> > > > > > > On Thu, 2020-12-03 at 13:51 -0500, bfields wrote:
> > > > > > > > I've been scratching my head over how to handle reboot of
> > > > > > > > a
> > > > > > > > re-
> > > > > > > > exporting server. I think one way to fix it might be just
> > > > > > > > to allow the re- export server to pass along reclaims to
> > > > > > > > the original server as it receives them from its own
> > > > > > > > clients. It might require some protocol tweaks, I'm not
> > > > > > > > sure. I'll try to get my thoughts in order and propose
> > > > > > > > something.
> > > > > > > >
> > > > > > >
> > > > > > > It's more complicated than that. If the re-exporting server
> > > > > > > reboots, but the original server does not, then unless that
> > > > > > > re- exporting server persisted its lease and a full set of
> > > > > > > stateids somewhere, it will not be able to atomically
> > > > > > > reclaim delegation and lock state on the server on behalf of
> > > > > > > its clients.
> > > > > >
> > > > > > By sending reclaims to the original server, I mean literally
> > > > > > sending new open and lock requests with the RECLAIM bit set,
> > > > > > which would get brand new stateids.
> > > > > >
> > > > > > So, the original server would invalidate the existing client's
> > > > > > previous clientid and stateids--just as it normally would on
> > > > > > reboot--but it would optionally remember the underlying locks
> > > > > > held by the client and allow compatible lock reclaims.
> > > > > >
> > > > > > Rough attempt:
> > > > > >
> > > > > >
> > > > > > https://wiki.linux-nfs.org/wiki/index.php/Reboot_recovery_for_
> > > > > > re-expor
> > > > > > t_servers
> > > > > >
> > > > > > Think it would fly?
> > > > >
> > > > > So this would be a variant of courtesy locks that can be
> > > > > reclaimed by the client using the reboot reclaim variant of
> > > > > OPEN/LOCK outside the grace period? The purpose being to allow
> > > > > reclaim without forcing the client to persist the original
> > > > > stateid?
> > > > >
> > > > > Hmm... That's doable, but how about the following alternative:
> > > > > Add
> > > > > a function
> > > > > that allows the client to request the full list of stateids that
> > > > > the server holds on its behalf?
> > > > >
> > > > > I've been wanting such a function for quite a while anyway in
> > > > > order to allow the client to detect state leaks (either due to
> > > > > soft timeouts, or due to reordered close/open operations).
> > > >
> > > > Oh, that sounds interesting. So basically the re-export server
> > > > would re-populate it's state from the original server rather than
> > > > relying on it's clients doing reclaims? Hmm, but how does the
> > > > re-export server rebuild its stateids? I guess it could make the
> > > > clients repopulate them with the same "give me a dump of all my
> > > > state", using the state details to match up with the old state and
> > > > replacing stateids. Or did you have something different in mind?
> > > >
> > >
> > > I was thinking that the re-export server could just use that list of
> > > stateids to figure out which locks can be reclaimed atomically, and
> > > which ones have been irredeemably lost. The assumption is that if
> > > you have a lock stateid or a delegation, then that means the clients
> > > can reclaim all the locks that were represented by that stateid.
> >
> > I'm confused about how the re-export server uses that list. Are you
> > assuming it persisted its own list across its own crash/reboot? I
> > guess that's what I was trying to avoid having to do.
> >
> No. The server just uses the stateids as part of a check for 'do I hold state for
> this file on this server?'. If the answer is 'yes' and the lock owners are sane, then
> we should be able to assume the full set of locks that lock owner held on that
> file are still valid.
>
> BTW: if the lock owner is also returned by the server, then since the lock owner
> is an opaque value, it could, for instance, be used by the client to cache info on
> the server about which uid/gid owns these locks.
Let me see if I'm understanding your idea right...
Re-export server reboots within the extended lease period it's been given by the original server. I'm assuming it uses the same clientid? But would probably open new sessions. It requests the list of stateids. Hmm, how to make the owner information useful, nfs-ganesha doesn't pass on the actual client's owner but rather just passes the address of its record for that client owner. Maybe it will have to do something a bit different for this degree of re-export support...
Now the re-export server knows which original client lock owners are allowed to reclaim state. So it just acquires locks using the original stateid as the client reclaims (what happens if the client doesn't reclaim a lock? I suppose the re-export server could unlock all regions not explicitly locked once reclaim is complete). Since the re-export server is acquiring new locks using the original stateid it will just overlay the original lock with the new lock and write locks don't conflict since they are being acquired by the same lock owner. Actually the original server could even balk at a "reclaim" in this way that wasn't originally held... And the original server could "refresh" the locks, and discard any that aren't refreshed at the end of reclaim. That part assumes the original server is apprised that what is actually happening is a reclaim.
The re-export server can destroy any stateids that it doesn't receive reclaims for.
Hmm, I think if the re-export server is implemented as an HA cluster, it should establish a clientid on the original server for each virtual IP (assuming that's the unit of HA) that exists. Then when virtual IPs are moved, the re-export server just goes through the above reclaim process for that clientid.
Frank
On Thu, Dec 03, 2020 at 10:14:25PM +0000, Trond Myklebust wrote:
> On Thu, 2020-12-03 at 17:04 -0500, [email protected] wrote:
> > On Thu, Dec 03, 2020 at 09:57:41PM +0000, Trond Myklebust wrote:
> > > On Thu, 2020-12-03 at 13:45 -0800, Frank Filz wrote:
> > > > > On Thu, 2020-12-03 at 16:13 -0500, [email protected] wrote:
> > > > > > On Thu, Dec 03, 2020 at 08:27:39PM +0000, Trond Myklebust
> > > > > > wrote:
> > > > > > > On Thu, 2020-12-03 at 13:51 -0500, bfields wrote:
> > > > > > > > I've been scratching my head over how to handle reboot of
> > > > > > > > a
> > > > > > > > re-
> > > > > > > > exporting server. I think one way to fix it might be
> > > > > > > > just to
> > > > > > > > allow the re- export server to pass along reclaims to the
> > > > > > > > original
> > > > > > > > server as it receives them from its own clients. It
> > > > > > > > might
> > > > > > > > require
> > > > > > > > some protocol tweaks, I'm not sure. I'll try to get my
> > > > > > > > thoughts
> > > > > > > > in order and propose something.
> > > > > > > >
> > > > > > >
> > > > > > > It's more complicated than that. If the re-exporting server
> > > > > > > reboots,
> > > > > > > but the original server does not, then unless that re-
> > > > > > > exporting
> > > > > > > server persisted its lease and a full set of stateids
> > > > > > > somewhere, it
> > > > > > > will not be able to atomically reclaim delegation and lock
> > > > > > > state on
> > > > > > > the server on behalf of its clients.
> > > > > >
> > > > > > By sending reclaims to the original server, I mean literally
> > > > > > sending
> > > > > > new open and lock requests with the RECLAIM bit set, which
> > > > > > would
> > > > > > get
> > > > > > brand new stateids.
> > > > > >
> > > > > > So, the original server would invalidate the existing
> > > > > > client's
> > > > > > previous clientid and stateids--just as it normally would on
> > > > > > reboot--but it would optionally remember the underlying locks
> > > > > > held by
> > > > > > the client and allow compatible lock reclaims.
> > > > > >
> > > > > > Rough attempt:
> > > > > >
> > > > > >
> > > > > > https://wiki.linux-nfs.org/wiki/index.php/Reboot_recovery_for_re-expor
> > > > > > t_servers
> > > > > >
> > > > > > Think it would fly?
> > > > >
> > > > > So this would be a variant of courtesy locks that can be
> > > > > reclaimed
> > > > > by the client
> > > > > using the reboot reclaim variant of OPEN/LOCK outside the grace
> > > > > period? The
> > > > > purpose being to allow reclaim without forcing the client to
> > > > > persist the original
> > > > > stateid?
> > > > >
> > > > > Hmm... That's doable, but how about the following alternative:
> > > > > Add
> > > > > a function
> > > > > that allows the client to request the full list of stateids
> > > > > that
> > > > > the server holds on
> > > > > its behalf?
> > > > >
> > > > > I've been wanting such a function for quite a while anyway in
> > > > > order
> > > > > to allow the
> > > > > client to detect state leaks (either due to soft timeouts, or
> > > > > due
> > > > > to reordered
> > > > > close/open operations).
> > > >
> > > > Oh, that sounds interesting. So basically the re-export server
> > > > would
> > > > re-populate it's state from the original server rather than
> > > > relying
> > > > on it's clients doing reclaims? Hmm, but how does the re-export
> > > > server rebuild its stateids? I guess it could make the clients
> > > > repopulate them with the same "give me a dump of all my state",
> > > > using
> > > > the state details to match up with the old state and replacing
> > > > stateids. Or did you have something different in mind?
> > > >
> > >
> > > I was thinking that the re-export server could just use that list
> > > of
> > > stateids to figure out which locks can be reclaimed atomically, and
> > > which ones have been irredeemably lost. The assumption is that if
> > > you
> > > have a lock stateid or a delegation, then that means the clients
> > > can
> > > reclaim all the locks that were represented by that stateid.
> >
> > I'm confused about how the re-export server uses that list. Are you
> > assuming it persisted its own list across its own crash/reboot? I
> > guess
> > that's what I was trying to avoid having to do.
> >
> No. The server just uses the stateids as part of a check for 'do I hold
> state for this file on this server?'. If the answer is 'yes' and the
> lock owners are sane, then we should be able to assume the full set of
> locks that lock owner held on that file are still valid.
>
> BTW: if the lock owner is also returned by the server, then since the
> lock owner is an opaque value, it could, for instance, be used by the
> client to cache info on the server about which uid/gid owns these
> locks.
OK, so the list of stateids returned by the server has entries that look
like (type, filehandle, owner, stateid) (where type=open or lock?).
I guess I'd need to see this in more detail.
--b.
On Thu, Dec 03, 2020 at 09:34:26PM +0000, Trond Myklebust wrote:
> I've been wanting such a function for quite a while anyway in order to
> allow the client to detect state leaks (either due to soft timeouts, or
> due to reordered close/open operations).
One sure way to fix any state leaks is to reboot the server. The server
throws everything away, the clients reclaim, all that's left is stuff
they still actually care about.
It's very disruptive.
But you could do a limited version of that: the server throws away the
state from one client (keeping the underlying locks on the exported
filesystem), lets the client go through its normal reclaim process, at
the end of that throws away anything that wasn't reclaimed. The only
delay is to anyone trying to acquire new locks that conflict with that
set of locks, and only for as long as it takes for the one client to
reclaim.
?
--b.
On Thu, 2020-12-03 at 14:39 -0800, Frank Filz wrote:
>
>
> > -----Original Message-----
> > From: Trond Myklebust [mailto:[email protected]]
> > Sent: Thursday, December 3, 2020 2:14 PM
> > To: [email protected]
> > Cc: [email protected]; [email protected]; linux-
> > [email protected]; [email protected]
> > Subject: Re: Adventures in NFS re-exporting
> >
> > On Thu, 2020-12-03 at 17:04 -0500, [email protected] wrote:
> > > On Thu, Dec 03, 2020 at 09:57:41PM +0000, Trond Myklebust wrote:
> > > > On Thu, 2020-12-03 at 13:45 -0800, Frank Filz wrote:
> > > > > > On Thu, 2020-12-03 at 16:13 -0500,
> > > > > > [email protected] wrote:
> > > > > > > On Thu, Dec 03, 2020 at 08:27:39PM +0000, Trond Myklebust
> > > > > > > wrote:
> > > > > > > > On Thu, 2020-12-03 at 13:51 -0500, bfields wrote:
> > > > > > > > > I've been scratching my head over how to handle
> > > > > > > > > reboot of
> > > > > > > > > a
> > > > > > > > > re-
> > > > > > > > > exporting server. I think one way to fix it might be
> > > > > > > > > just
> > > > > > > > > to allow the re- export server to pass along reclaims
> > > > > > > > > to
> > > > > > > > > the original server as it receives them from its own
> > > > > > > > > clients. It might require some protocol tweaks, I'm
> > > > > > > > > not
> > > > > > > > > sure. I'll try to get my thoughts in order and
> > > > > > > > > propose
> > > > > > > > > something.
> > > > > > > > >
> > > > > > > >
> > > > > > > > It's more complicated than that. If the re-exporting
> > > > > > > > server
> > > > > > > > reboots, but the original server does not, then unless
> > > > > > > > that
> > > > > > > > re- exporting server persisted its lease and a full set
> > > > > > > > of
> > > > > > > > stateids somewhere, it will not be able to atomically
> > > > > > > > reclaim delegation and lock state on the server on
> > > > > > > > behalf of
> > > > > > > > its clients.
> > > > > > >
> > > > > > > By sending reclaims to the original server, I mean
> > > > > > > literally
> > > > > > > sending new open and lock requests with the RECLAIM bit
> > > > > > > set,
> > > > > > > which would get brand new stateids.
> > > > > > >
> > > > > > > So, the original server would invalidate the existing
> > > > > > > client's
> > > > > > > previous clientid and stateids--just as it normally would
> > > > > > > on
> > > > > > > reboot--but it would optionally remember the underlying
> > > > > > > locks
> > > > > > > held by the client and allow compatible lock reclaims.
> > > > > > >
> > > > > > > Rough attempt:
> > > > > > >
> > > > > > >
> > > > > > > https://wiki.linux-nfs.org/wiki/index.php/Reboot_recovery_for_
> > > > > > > re-expor
> > > > > > > t_servers
> > > > > > >
> > > > > > > Think it would fly?
> > > > > >
> > > > > > So this would be a variant of courtesy locks that can be
> > > > > > reclaimed by the client using the reboot reclaim variant of
> > > > > > OPEN/LOCK outside the grace period? The purpose being to
> > > > > > allow
> > > > > > reclaim without forcing the client to persist the original
> > > > > > stateid?
> > > > > >
> > > > > > Hmm... That's doable, but how about the following
> > > > > > alternative:
> > > > > > Add
> > > > > > a function
> > > > > > that allows the client to request the full list of stateids
> > > > > > that
> > > > > > the server holds on its behalf?
> > > > > >
> > > > > > I've been wanting such a function for quite a while anyway
> > > > > > in
> > > > > > order to allow the client to detect state leaks (either due
> > > > > > to
> > > > > > soft timeouts, or due to reordered close/open operations).
> > > > >
> > > > > Oh, that sounds interesting. So basically the re-export
> > > > > server
> > > > > would re-populate it's state from the original server rather
> > > > > than
> > > > > relying on it's clients doing reclaims? Hmm, but how does the
> > > > > re-export server rebuild its stateids? I guess it could make
> > > > > the
> > > > > clients repopulate them with the same "give me a dump of all
> > > > > my
> > > > > state", using the state details to match up with the old
> > > > > state and
> > > > > replacing stateids. Or did you have something different in
> > > > > mind?
> > > > >
> > > >
> > > > I was thinking that the re-export server could just use that
> > > > list of
> > > > stateids to figure out which locks can be reclaimed atomically,
> > > > and
> > > > which ones have been irredeemably lost. The assumption is that
> > > > if
> > > > you have a lock stateid or a delegation, then that means the
> > > > clients
> > > > can reclaim all the locks that were represented by that
> > > > stateid.
> > >
> > > I'm confused about how the re-export server uses that list. Are
> > > you
> > > assuming it persisted its own list across its own crash/reboot?
> > > I
> > > guess that's what I was trying to avoid having to do.
> > >
> > No. The server just uses the stateids as part of a check for 'do I
> > hold state for
> > this file on this server?'. If the answer is 'yes' and the lock
> > owners are sane, then
> > we should be able to assume the full set of locks that lock owner
> > held on that
> > file are still valid.
> >
> > BTW: if the lock owner is also returned by the server, then since
> > the lock owner
> > is an opaque value, it could, for instance, be used by the client
> > to cache info on
> > the server about which uid/gid owns these locks.
>
> Let me see if I'm understanding your idea right...
>
> Re-export server reboots within the extended lease period it's been
> given by the original server. I'm assuming it uses the same clientid?
Yes. It would have to use the same clientid.
> But would probably open new sessions. It requests the list of
> stateids. Hmm, how to make the owner information useful, nfs-ganesha
> doesn't pass on the actual client's owner but rather just passes the
> address of its record for that client owner. Maybe it will have to do
> something a bit different for this degree of re-export support...
>
> Now the re-export server knows which original client lock owners are
> allowed to reclaim state. So it just acquires locks using the
> original stateid as the client reclaims (what happens if the client
> doesn't reclaim a lock? I suppose the re-export server could unlock
> all regions not explicitly locked once reclaim is complete). Since
> the re-export server is acquiring new locks using the original
> stateid it will just overlay the original lock with the new lock and
> write locks don't conflict since they are being acquired by the same
> lock owner. Actually the original server could even balk at a
> "reclaim" in this way that wasn't originally held... And the original
> server could "refresh" the locks, and discard any that aren't
> refreshed at the end of reclaim. That part assumes the original
> server is apprised that what is actually happening is a reclaim.
>
> The re-export server can destroy any stateids that it doesn't receive
> reclaims for.
Right. That's in essence what I'm suggesting. There are corner cases to
be considered: e.g. "what happens if the re-export server crashes after
unlocking on the server, but before passing the LOCKU reply on the the
client", however I think it should be possible to figure out strategies
for those cases.
>
> Hmm, I think if the re-export server is implemented as an HA cluster,
> it should establish a clientid on the original server for each
> virtual IP (assuming that's the unit of HA) that exists. Then when
> virtual IPs are moved, the re-export server just goes through the
> above reclaim process for that clientid.
>
Yes, we could do something like that.
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Thu, 2020-12-03 at 17:45 -0500, [email protected] wrote:
> On Thu, Dec 03, 2020 at 09:34:26PM +0000, Trond Myklebust wrote:
> > I've been wanting such a function for quite a while anyway in order
> > to
> > allow the client to detect state leaks (either due to soft
> > timeouts, or
> > due to reordered close/open operations).
>
> One sure way to fix any state leaks is to reboot the server. The
> server
> throws everything away, the clients reclaim, all that's left is stuff
> they still actually care about.
>
> It's very disruptive.
>
> But you could do a limited version of that: the server throws away
> the
> state from one client (keeping the underlying locks on the exported
> filesystem), lets the client go through its normal reclaim process,
> at
> the end of that throws away anything that wasn't reclaimed. The only
> delay is to anyone trying to acquire new locks that conflict with
> that
> set of locks, and only for as long as it takes for the one client to
> reclaim.
One could do that, but that requires the existence of a quiescent
period where the client holds no state at all on the server. There are
definitely cases where that is not an option.
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Thu, Dec 03, 2020 at 10:53:26PM +0000, Trond Myklebust wrote:
> On Thu, 2020-12-03 at 17:45 -0500, [email protected] wrote:
> > On Thu, Dec 03, 2020 at 09:34:26PM +0000, Trond Myklebust wrote:
> > > I've been wanting such a function for quite a while anyway in
> > > order to allow the client to detect state leaks (either due to
> > > soft timeouts, or due to reordered close/open operations).
> >
> > One sure way to fix any state leaks is to reboot the server. The
> > server throws everything away, the clients reclaim, all that's left
> > is stuff they still actually care about.
> >
> > It's very disruptive.
> >
> > But you could do a limited version of that: the server throws away
> > the state from one client (keeping the underlying locks on the
> > exported filesystem), lets the client go through its normal reclaim
> > process, at the end of that throws away anything that wasn't
> > reclaimed. The only delay is to anyone trying to acquire new locks
> > that conflict with that set of locks, and only for as long as it
> > takes for the one client to reclaim.
>
> One could do that, but that requires the existence of a quiescent
> period where the client holds no state at all on the server.
No, as I said, the client performs reboot recovery for any state that it
holds when we do this.
--b.
> On Thu, Dec 03, 2020 at 10:53:26PM +0000, Trond Myklebust wrote:
> > On Thu, 2020-12-03 at 17:45 -0500, [email protected] wrote:
> > > On Thu, Dec 03, 2020 at 09:34:26PM +0000, Trond Myklebust wrote:
> > > > I've been wanting such a function for quite a while anyway in
> > > > order to allow the client to detect state leaks (either due to
> > > > soft timeouts, or due to reordered close/open operations).
> > >
> > > One sure way to fix any state leaks is to reboot the server. The
> > > server throws everything away, the clients reclaim, all that's left
> > > is stuff they still actually care about.
> > >
> > > It's very disruptive.
> > >
> > > But you could do a limited version of that: the server throws away
> > > the state from one client (keeping the underlying locks on the
> > > exported filesystem), lets the client go through its normal reclaim
> > > process, at the end of that throws away anything that wasn't
> > > reclaimed. The only delay is to anyone trying to acquire new locks
> > > that conflict with that set of locks, and only for as long as it
> > > takes for the one client to reclaim.
> >
> > One could do that, but that requires the existence of a quiescent
> > period where the client holds no state at all on the server.
>
> No, as I said, the client performs reboot recovery for any state that it holds
> when we do this.
Yea, but the original sever goes through a period where it has dropped all state and isn't in grace, and if it's coordinating with non-NFS users, they don't know anything about grace anyway.
Frank
> > > -----Original Message-----
> > > From: Trond Myklebust [mailto:[email protected]]
> > > Sent: Thursday, December 3, 2020 2:14 PM
> > > To: [email protected]
> > > Cc: [email protected]; [email protected]; linux-
> > > [email protected]; [email protected]
> > > Subject: Re: Adventures in NFS re-exporting
> > >
> > > On Thu, 2020-12-03 at 17:04 -0500, [email protected] wrote:
> > > > On Thu, Dec 03, 2020 at 09:57:41PM +0000, Trond Myklebust wrote:
> > > > > On Thu, 2020-12-03 at 13:45 -0800, Frank Filz wrote:
> > > > > > > On Thu, 2020-12-03 at 16:13 -0500, [email protected]
> > > > > > > wrote:
> > > > > > > > On Thu, Dec 03, 2020 at 08:27:39PM +0000, Trond Myklebust
> > > > > > > > wrote:
> > > > > > > > > On Thu, 2020-12-03 at 13:51 -0500, bfields wrote:
> > > > > > > > > > I've been scratching my head over how to handle reboot
> > > > > > > > > > of a
> > > > > > > > > > re-
> > > > > > > > > > exporting server. I think one way to fix it might be
> > > > > > > > > > just to allow the re- export server to pass along
> > > > > > > > > > reclaims to the original server as it receives them
> > > > > > > > > > from its own clients. It might require some protocol
> > > > > > > > > > tweaks, I'm not sure. I'll try to get my thoughts in
> > > > > > > > > > order and propose something.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > It's more complicated than that. If the re-exporting
> > > > > > > > > server reboots, but the original server does not, then
> > > > > > > > > unless that
> > > > > > > > > re- exporting server persisted its lease and a full set
> > > > > > > > > of stateids somewhere, it will not be able to atomically
> > > > > > > > > reclaim delegation and lock state on the server on
> > > > > > > > > behalf of its clients.
> > > > > > > >
> > > > > > > > By sending reclaims to the original server, I mean
> > > > > > > > literally sending new open and lock requests with the
> > > > > > > > RECLAIM bit set, which would get brand new stateids.
> > > > > > > >
> > > > > > > > So, the original server would invalidate the existing
> > > > > > > > client's previous clientid and stateids--just as it
> > > > > > > > normally would on reboot--but it would optionally remember
> > > > > > > > the underlying locks held by the client and allow
> > > > > > > > compatible lock reclaims.
> > > > > > > >
> > > > > > > > Rough attempt:
> > > > > > > >
> > > > > > > >
> > > > > > > > https://wiki.linux-nfs.org/wiki/index.php/Reboot_recovery_
> > > > > > > > for_
> > > > > > > > re-expor
> > > > > > > > t_servers
> > > > > > > >
> > > > > > > > Think it would fly?
> > > > > > >
> > > > > > > So this would be a variant of courtesy locks that can be
> > > > > > > reclaimed by the client using the reboot reclaim variant of
> > > > > > > OPEN/LOCK outside the grace period? The purpose being to
> > > > > > > allow reclaim without forcing the client to persist the
> > > > > > > original stateid?
> > > > > > >
> > > > > > > Hmm... That's doable, but how about the following
> > > > > > > alternative:
> > > > > > > Add
> > > > > > > a function
> > > > > > > that allows the client to request the full list of stateids
> > > > > > > that the server holds on its behalf?
> > > > > > >
> > > > > > > I've been wanting such a function for quite a while anyway
> > > > > > > in order to allow the client to detect state leaks (either
> > > > > > > due to soft timeouts, or due to reordered close/open
> > > > > > > operations).
> > > > > >
> > > > > > Oh, that sounds interesting. So basically the re-export server
> > > > > > would re-populate it's state from the original server rather
> > > > > > than relying on it's clients doing reclaims? Hmm, but how does
> > > > > > the re-export server rebuild its stateids? I guess it could
> > > > > > make the clients repopulate them with the same "give me a dump
> > > > > > of all my state", using the state details to match up with the
> > > > > > old state and replacing stateids. Or did you have something
> > > > > > different in mind?
> > > > > >
> > > > >
> > > > > I was thinking that the re-export server could just use that
> > > > > list of stateids to figure out which locks can be reclaimed
> > > > > atomically, and which ones have been irredeemably lost. The
> > > > > assumption is that if you have a lock stateid or a delegation,
> > > > > then that means the clients can reclaim all the locks that were
> > > > > represented by that stateid.
> > > >
> > > > I'm confused about how the re-export server uses that list. Are
> > > > you assuming it persisted its own list across its own
> > > > crash/reboot?
> > > > I
> > > > guess that's what I was trying to avoid having to do.
> > > >
> > > No. The server just uses the stateids as part of a check for 'do I
> > > hold state for this file on this server?'. If the answer is 'yes'
> > > and the lock owners are sane, then we should be able to assume the
> > > full set of locks that lock owner held on that file are still valid.
> > >
> > > BTW: if the lock owner is also returned by the server, then since
> > > the lock owner is an opaque value, it could, for instance, be used
> > > by the client to cache info on the server about which uid/gid owns
> > > these locks.
> >
> > Let me see if I'm understanding your idea right...
> >
> > Re-export server reboots within the extended lease period it's been
> > given by the original server. I'm assuming it uses the same clientid?
>
> Yes. It would have to use the same clientid.
>
> > But would probably open new sessions. It requests the list of
> > stateids. Hmm, how to make the owner information useful, nfs-ganesha
> > doesn't pass on the actual client's owner but rather just passes the
> > address of its record for that client owner. Maybe it will have to do
> > something a bit different for this degree of re-export support...
> >
> > Now the re-export server knows which original client lock owners are
> > allowed to reclaim state. So it just acquires locks using the original
> > stateid as the client reclaims (what happens if the client doesn't
> > reclaim a lock? I suppose the re-export server could unlock all
> > regions not explicitly locked once reclaim is complete). Since the
> > re-export server is acquiring new locks using the original stateid it
> > will just overlay the original lock with the new lock and write locks
> > don't conflict since they are being acquired by the same lock owner.
> > Actually the original server could even balk at a "reclaim" in this
> > way that wasn't originally held... And the original server could
> > "refresh" the locks, and discard any that aren't refreshed at the end
> > of reclaim. That part assumes the original server is apprised that
> > what is actually happening is a reclaim.
> >
> > The re-export server can destroy any stateids that it doesn't receive
> > reclaims for.
>
> Right. That's in essence what I'm suggesting. There are corner cases to be
> considered: e.g. "what happens if the re-export server crashes after unlocking
> on the server, but before passing the LOCKU reply on the the client", however I
> think it should be possible to figure out strategies for those cases.
That's no different than a regular NFS server crashes before responding to an unlock. The client likely doesn't reclaim locks it was attempting to drop at server crash time. So then one place we would definitely have abandoned locks on the original server IF the unlock never made it to the original server. But we're already talking strategies to clean up abandoned locks.
I won't be surprised if we find a more tricky corner case, but my gut feel is every corner case will have a relatively simple solution.
Another consideration is how to handle the size of the state list... Ideally we would have some way to break it up that is less clunky than readdir (at least the state list can be assumed to be static during the course of the fetching of it, even for a regular client just interested in it, it could pause state activity until the list is retrieved).
Frank
Frank
On Thu, 2020-12-03 at 18:16 -0500, [email protected] wrote:
> On Thu, Dec 03, 2020 at 10:53:26PM +0000, Trond Myklebust wrote:
> > On Thu, 2020-12-03 at 17:45 -0500, [email protected] wrote:
> > > On Thu, Dec 03, 2020 at 09:34:26PM +0000, Trond Myklebust wrote:
> > > > I've been wanting such a function for quite a while anyway in
> > > > order to allow the client to detect state leaks (either due to
> > > > soft timeouts, or due to reordered close/open operations).
> > >
> > > One sure way to fix any state leaks is to reboot the server. The
> > > server throws everything away, the clients reclaim, all that's
> > > left
> > > is stuff they still actually care about.
> > >
> > > It's very disruptive.
> > >
> > > But you could do a limited version of that: the server throws
> > > away
> > > the state from one client (keeping the underlying locks on the
> > > exported filesystem), lets the client go through its normal
> > > reclaim
> > > process, at the end of that throws away anything that wasn't
> > > reclaimed. The only delay is to anyone trying to acquire new
> > > locks
> > > that conflict with that set of locks, and only for as long as it
> > > takes for the one client to reclaim.
> >
> > One could do that, but that requires the existence of a quiescent
> > period where the client holds no state at all on the server.
>
> No, as I said, the client performs reboot recovery for any state that
> it
> holds when we do this.
>
Hmm... So how do the client and server coordinate what can and cannot
be reclaimed? The issue is that races can work both ways, with the
client sometimes believing that it holds a layout or a delegation that
the server thinks it has returned. If the server allows a reclaim of
such a delegation, then that could be problematic (because it breaks
lock atomicity on the client and because it may cause conflicts).
By the way, the other thing that I'd like to add to my wishlist is a
callback that allows the server to ask the client if it still holds a
given open or lock stateid. A server can recall a delegation or a
layout, so it can fix up leaks of those, however it has no remedy if
the client loses an open or lock stateid other than to possibly
forcibly revoke state. That could cause application crashes if the
server makes a mistake and revokes a lock that is actually in use.
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Fri, Dec 04, 2020 at 01:02:20AM +0000, Trond Myklebust wrote:
> On Thu, 2020-12-03 at 18:16 -0500, [email protected] wrote:
> > On Thu, Dec 03, 2020 at 10:53:26PM +0000, Trond Myklebust wrote:
> > > On Thu, 2020-12-03 at 17:45 -0500, [email protected] wrote:
> > > > On Thu, Dec 03, 2020 at 09:34:26PM +0000, Trond Myklebust wrote:
> > > > > I've been wanting such a function for quite a while anyway in
> > > > > order to allow the client to detect state leaks (either due to
> > > > > soft timeouts, or due to reordered close/open operations).
> > > >
> > > > One sure way to fix any state leaks is to reboot the server. The
> > > > server throws everything away, the clients reclaim, all that's
> > > > left
> > > > is stuff they still actually care about.
> > > >
> > > > It's very disruptive.
> > > >
> > > > But you could do a limited version of that: the server throws
> > > > away
> > > > the state from one client (keeping the underlying locks on the
> > > > exported filesystem), lets the client go through its normal
> > > > reclaim
> > > > process, at the end of that throws away anything that wasn't
> > > > reclaimed. The only delay is to anyone trying to acquire new
> > > > locks
> > > > that conflict with that set of locks, and only for as long as it
> > > > takes for the one client to reclaim.
> > >
> > > One could do that, but that requires the existence of a quiescent
> > > period where the client holds no state at all on the server.
> >
> > No, as I said, the client performs reboot recovery for any state that
> > it
> > holds when we do this.
> >
>
> Hmm... So how do the client and server coordinate what can and cannot
> be reclaimed? The issue is that races can work both ways, with the
> client sometimes believing that it holds a layout or a delegation that
> the server thinks it has returned. If the server allows a reclaim of
> such a delegation, then that could be problematic (because it breaks
> lock atomicity on the client and because it may cause conflicts).
The server's not actually forgetting anything, it's just pretending to,
in order to trigger the client's reboot recovery. It can turn down the
client's attempt to reclaim something it doesn't have.
Though isn't it already game over by the time the client thinks it holds
some lock/open/delegation that the server doesn't? I guess I'd need to
see these cases written out in detail to understand.
--b.
> By the way, the other thing that I'd like to add to my wishlist is a
> callback that allows the server to ask the client if it still holds a
> given open or lock stateid. A server can recall a delegation or a
> layout, so it can fix up leaks of those, however it has no remedy if
> the client loses an open or lock stateid other than to possibly
> forcibly revoke state. That could cause application crashes if the
> server makes a mistake and revokes a lock that is actually in use.
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
>
>
On Thu, 2020-12-03 at 20:41 -0500, [email protected] wrote:
> On Fri, Dec 04, 2020 at 01:02:20AM +0000, Trond Myklebust wrote:
> > On Thu, 2020-12-03 at 18:16 -0500, [email protected] wrote:
> > > On Thu, Dec 03, 2020 at 10:53:26PM +0000, Trond Myklebust wrote:
> > > > On Thu, 2020-12-03 at 17:45 -0500, [email protected] wrote:
> > > > > On Thu, Dec 03, 2020 at 09:34:26PM +0000, Trond Myklebust
> > > > > wrote:
> > > > > > I've been wanting such a function for quite a while anyway
> > > > > > in
> > > > > > order to allow the client to detect state leaks (either due
> > > > > > to
> > > > > > soft timeouts, or due to reordered close/open operations).
> > > > >
> > > > > One sure way to fix any state leaks is to reboot the server.
> > > > > The
> > > > > server throws everything away, the clients reclaim, all
> > > > > that's
> > > > > left
> > > > > is stuff they still actually care about.
> > > > >
> > > > > It's very disruptive.
> > > > >
> > > > > But you could do a limited version of that: the server throws
> > > > > away
> > > > > the state from one client (keeping the underlying locks on
> > > > > the
> > > > > exported filesystem), lets the client go through its normal
> > > > > reclaim
> > > > > process, at the end of that throws away anything that wasn't
> > > > > reclaimed. The only delay is to anyone trying to acquire new
> > > > > locks
> > > > > that conflict with that set of locks, and only for as long as
> > > > > it
> > > > > takes for the one client to reclaim.
> > > >
> > > > One could do that, but that requires the existence of a
> > > > quiescent
> > > > period where the client holds no state at all on the server.
> > >
> > > No, as I said, the client performs reboot recovery for any state
> > > that
> > > it
> > > holds when we do this.
> > >
> >
> > Hmm... So how do the client and server coordinate what can and
> > cannot
> > be reclaimed? The issue is that races can work both ways, with the
> > client sometimes believing that it holds a layout or a delegation
> > that
> > the server thinks it has returned. If the server allows a reclaim
> > of
> > such a delegation, then that could be problematic (because it
> > breaks
> > lock atomicity on the client and because it may cause conflicts).
>
> The server's not actually forgetting anything, it's just pretending
> to,
> in order to trigger the client's reboot recovery. It can turn down
> the
> client's attempt to reclaim something it doesn't have.
>
> Though isn't it already game over by the time the client thinks it
> holds
> some lock/open/delegation that the server doesn't? I guess I'd need
> to
> see these cases written out in detail to understand.
>
Normally, the server will return NFS4ERR_BAD_STATEID or
NFS4ERR_OLD_STATEID if the client tries to use an invalid stateid. The
issue here is that you'd be discarding that machinery, because the
client is forgetting its stateids when it gets told that the server
rebooted.
That again puts the onus on the server to verify more strongly whether
or not the client is recovering state that it actually holds.
So to elaborate a little more on the cases where we have seen the
client and server state mess up here. Typically it happens when we
build COMPOUNDS where there is a stateful operation followed by a slow
operation. Something like
Thread 1
========
OPEN(foo) + LAYOUTGET
-> openstateid(01: blah)
Thread 2
========
OPEN(foo)
->openstateid(02: blah)
CLOSE(openstateid(02:blah))
(gets reply from OPEN).
Typically the client forgets about the stateid after the CLOSE, so when
it gets a reply to the original OPEN, it thinks it just got a
completely fresh stateid "openstateid(01: blah)", which it might try to
reclaim if the server declares a reboot.
> --b.
>
> > By the way, the other thing that I'd like to add to my wishlist is
> > a
> > callback that allows the server to ask the client if it still holds
> > a
> > given open or lock stateid. A server can recall a delegation or a
> > layout, so it can fix up leaks of those, however it has no remedy
> > if
> > the client loses an open or lock stateid other than to possibly
> > forcibly revoke state. That could cause application crashes if the
> > server makes a mistake and revokes a lock that is actually in use.
> >
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]