LinuxLists.cc - Adventures in NFS re-exporting

2020-09-07 17:39:36

Subject: Adventures in NFS re-exporting

Hi,

Apologies for this rather long email, but I thought there may be some interest out there in the community in how and why we've been doing something unsupported and barely documented - NFS re-exporting! And I'm not sure I can tell our story well in just a few short sentences so please bear with me (or stop now!).

Full disclosure - I am also rather hoping that this story piques some interest amongst developers to help make our rather niche setup even better and perhaps a little better documented. I also totally understand if this is something people wouldn't want to touch with a very long barge pole....

First a quick bit of history (I hope I have this right). Late in 2015, Jeff Layton proposed a patch series allowing knfsd to re-export a NFS client mount. The rationale then was to provide a "proxy" server that could mount an NFSv4 only server and re-export it to older clients that only supported NFSv3. One of the main sticking points then (as now), was around the 63 byte limit of filehandles for NFSv3 and how it couldn't be guaranteed that all re-exported filehandles would fit within that (in my experience it mostly works with "no_subtree_check"). There are also the usual locking and coherence concerns with NFSv3 too but I'll get to that in a bit.

Then almost two years later, v4.13 was released including parts of the patch series that actually allowed the re-export and since then other relevant bits (such as the open file cache) have also been merged. I soon became interested in using this new functionality to both accelerate our on-premises NFS storage and use it as a "WAN cache" to provide cloud compute instances locally cached proxy access to our on-premises storage.

Cut to a brief introduction to us and what we do... DNEG is an award winning VFX company which uses large compute farms to generate complex final frame renders for movies and TV. This workload mostly consists of reads of common data shared between many render clients (e.g textures, geometry) and a little unique data per frame. All file writes are to unique files per process (frames) and there is very little if any writing over existing files. Hence it's not very demanding on locking and coherence guarantees.

When our on-premises NFS storage is being overloaded or the server's network is maxed out, we can place multiple re-export servers in between them and our farm to improve performance. When our on-premises render farm is not quite big enough to meet a deadline, we spin up compute instances with a (reasonably local) cloud provider. Some of these cloud instances are Linux NFS servers which mount our on-premises NFS storage servers (~10ms away) and re-export these to the other cloud (render) instances. Since we know that the data we are reading doesn't change often, we can increase the actimeo and even use nocto to reduce the network chatter back to the on-prem servers. These re-export servers also use fscache/cachefiles to cache data to disk so that we can retain TBs of previously read data locally in the cloud over long periods of time. We also use NFSv4 (less network chatter) all the way from our on-prem storage to the re-export server and then on to the clients.

The re-export server(s) quickly builds up both a memory cache and disk backed fscache/cachefiles storage cache of our working data set so the data being pulled from on-prem lessens over time. Data is only ever read once over the WAN network from on-prem storage and then read multiple times by the many render client instances in the cloud. Recent NFS features such as "nconnect" help to speed up the initial reading of data from on-prem by using multiple connections to offset TCP latency. At the end of the render, we write the files back through the re-export server to our on-prem storage. Our average read bandwidth is many times higher than our write bandwidth.

Rather surprisingly, this mostly works for our particular workloads. We've completed movies using this setup and saved money on commercial caching systems (e.g Avere, GPFS, etc). But there are still some remaining issues with doing something that is very much not widely supported (or recommended). In most cases we have worked around them, but it would be great if we didn't have to so others could also benefit. I will list the main problems quickly now and provide more information and reproducers later if anyone is interested.

1) The kernel can drop entries out of the NFS client inode cache (under memory cache churn) when those filehandles are still being used by the knfsd's remote clients resulting in sporadic and random stale filehandles. This seems to be mostly for directories from what I've seen. Does the NFS client not know that knfsd is still using those files/dirs? The workaround is to never drop inode & dentry caches on the re-export servers (vfs_cache_pressure=1). This also helps to ensure that we actually make the most of our actimeo=3600,nocto mount options for the full specified time.

2) If we cache metadata on the re-export server using actimeo=3600,nocto we can cut the network packets back to the origin server to zero for repeated lookups. However, if a client of the re-export server walks paths and memory maps those files (i.e. loading an application), the re-export server starts issuing unexpected calls back to the origin server again, ignoring/invalidating the re-export server's NFS client cache. We worked around this this by patching an inode/iversion validity check in inode.c so that the NFS client cache on the re-export server is used. I'm not sure about the correctness of this patch but it works for our corner case.

3) If we saturate an NFS client's network with reads from the server, all client metadata lookups become unbearably slow even if it's all cached in the NFS client's memory and no network RPCs should be required. This is the case for any NFS client regardless of re-exporting but it affects this case more because when we can't serve cached metadata we also can't serve the cached data. It feels like some sort of bottleneck in the client's ability to parallelise requests? We work around this by not maxing out our network.

4) With an NFSv4 re-export, lots of open/close requests (hundreds per second) quickly eat up the CPU on the re-export server and perf top shows we are mostly in native_queued_spin_lock_slowpath. Does NFSv4 also need an open file cache like that added to NFSv3? Our workaround is to either fix the thing doing lots of repeated open/closes or use NFSv3 instead.

If you made it this far, I've probably taken up way too much of your valuable time already. If nobody is interested in this rather niche application of the Linux client & knfsd, then I totally understand and I will not mention it here again. If your interest is piqued however, I'm happy to go into more detail about any of this with the hope that this could become a better documented and understood type of setup that others with similar workloads could reference.

Also, many thanks to all the Linux NFS developers for the amazing work you do which, in turn, helps us to make great movies. :)

Daire (Head of Systems DNEG)

2020-09-08 09:46:50

by Mkrtchyan, Tigran

[permalink] [raw]

Subject: Re: Adventures in NFS re-exporting

Just out of curiosity:

did you have tries instead of re-exporting nfs mount directly
re-export an overlayfs mount on top of the original nfs mount?
Such setup should cover most of your issues.

Regards,
Tigran.

----- Original Message -----
> From: "Daire Byrne" <[email protected]>
> To: "linux-nfs" <[email protected]>
> Cc: [email protected]
> Sent: Monday, September 7, 2020 7:31:00 PM
> Subject: Adventures in NFS re-exporting

> Hi,
>
> Apologies for this rather long email, but I thought there may be some interest
> out there in the community in how and why we've been doing something
> unsupported and barely documented - NFS re-exporting! And I'm not sure I can
> tell our story well in just a few short sentences so please bear with me (or
> stop now!).
>
> Full disclosure - I am also rather hoping that this story piques some interest
> amongst developers to help make our rather niche setup even better and perhaps
> a little better documented. I also totally understand if this is something
> people wouldn't want to touch with a very long barge pole....
>
> First a quick bit of history (I hope I have this right). Late in 2015, Jeff
> Layton proposed a patch series allowing knfsd to re-export a NFS client mount.
> The rationale then was to provide a "proxy" server that could mount an NFSv4
> only server and re-export it to older clients that only supported NFSv3. One of
> the main sticking points then (as now), was around the 63 byte limit of
> filehandles for NFSv3 and how it couldn't be guaranteed that all re-exported
> filehandles would fit within that (in my experience it mostly works with
> "no_subtree_check"). There are also the usual locking and coherence concerns
> with NFSv3 too but I'll get to that in a bit.
>
> Then almost two years later, v4.13 was released including parts of the patch
> series that actually allowed the re-export and since then other relevant bits
> (such as the open file cache) have also been merged. I soon became interested
> in using this new functionality to both accelerate our on-premises NFS storage
> and use it as a "WAN cache" to provide cloud compute instances locally cached
> proxy access to our on-premises storage.
>
> Cut to a brief introduction to us and what we do... DNEG is an award winning VFX
> company which uses large compute farms to generate complex final frame renders
> for movies and TV. This workload mostly consists of reads of common data shared
> between many render clients (e.g textures, geometry) and a little unique data
> per frame. All file writes are to unique files per process (frames) and there
> is very little if any writing over existing files. Hence it's not very
> demanding on locking and coherence guarantees.
>
> When our on-premises NFS storage is being overloaded or the server's network is
> maxed out, we can place multiple re-export servers in between them and our farm
> to improve performance. When our on-premises render farm is not quite big
> enough to meet a deadline, we spin up compute instances with a (reasonably
> local) cloud provider. Some of these cloud instances are Linux NFS servers
> which mount our on-premises NFS storage servers (~10ms away) and re-export
> these to the other cloud (render) instances. Since we know that the data we are
> reading doesn't change often, we can increase the actimeo and even use nocto to
> reduce the network chatter back to the on-prem servers. These re-export servers
> also use fscache/cachefiles to cache data to disk so that we can retain TBs of
> previously read data locally in the cloud over long periods of time. We also
> use NFSv4 (less network chatter) all the way from our on-prem storage to the
> re-export server and then on to the clients.
>
> The re-export server(s) quickly builds up both a memory cache and disk backed
> fscache/cachefiles storage cache of our working data set so the data being
> pulled from on-prem lessens over time. Data is only ever read once over the WAN
> network from on-prem storage and then read multiple times by the many render
> client instances in the cloud. Recent NFS features such as "nconnect" help to
> speed up the initial reading of data from on-prem by using multiple connections
> to offset TCP latency. At the end of the render, we write the files back
> through the re-export server to our on-prem storage. Our average read bandwidth
> is many times higher than our write bandwidth.
>
> Rather surprisingly, this mostly works for our particular workloads. We've
> completed movies using this setup and saved money on commercial caching systems
> (e.g Avere, GPFS, etc). But there are still some remaining issues with doing
> something that is very much not widely supported (or recommended). In most
> cases we have worked around them, but it would be great if we didn't have to so
> others could also benefit. I will list the main problems quickly now and
> provide more information and reproducers later if anyone is interested.
>
> 1) The kernel can drop entries out of the NFS client inode cache (under memory
> cache churn) when those filehandles are still being used by the knfsd's remote
> clients resulting in sporadic and random stale filehandles. This seems to be
> mostly for directories from what I've seen. Does the NFS client not know that
> knfsd is still using those files/dirs? The workaround is to never drop inode &
> dentry caches on the re-export servers (vfs_cache_pressure=1). This also helps
> to ensure that we actually make the most of our actimeo=3600,nocto mount
> options for the full specified time.
>
> 2) If we cache metadata on the re-export server using actimeo=3600,nocto we can
> cut the network packets back to the origin server to zero for repeated lookups.
> However, if a client of the re-export server walks paths and memory maps those
> files (i.e. loading an application), the re-export server starts issuing
> unexpected calls back to the origin server again, ignoring/invalidating the
> re-export server's NFS client cache. We worked around this this by patching an
> inode/iversion validity check in inode.c so that the NFS client cache on the
> re-export server is used. I'm not sure about the correctness of this patch but
> it works for our corner case.
>
> 3) If we saturate an NFS client's network with reads from the server, all client
> metadata lookups become unbearably slow even if it's all cached in the NFS
> client's memory and no network RPCs should be required. This is the case for
> any NFS client regardless of re-exporting but it affects this case more because
> when we can't serve cached metadata we also can't serve the cached data. It
> feels like some sort of bottleneck in the client's ability to parallelise
> requests? We work around this by not maxing out our network.
>
> 4) With an NFSv4 re-export, lots of open/close requests (hundreds per second)
> quickly eat up the CPU on the re-export server and perf top shows we are mostly
> in native_queued_spin_lock_slowpath. Does NFSv4 also need an open file cache
> like that added to NFSv3? Our workaround is to either fix the thing doing lots
> of repeated open/closes or use NFSv3 instead.
>
> If you made it this far, I've probably taken up way too much of your valuable
> time already. If nobody is interested in this rather niche application of the
> Linux client & knfsd, then I totally understand and I will not mention it here
> again. If your interest is piqued however, I'm happy to go into more detail
> about any of this with the hope that this could become a better documented and
> understood type of setup that others with similar workloads could reference.
>
> Also, many thanks to all the Linux NFS developers for the amazing work you do
> which, in turn, helps us to make great movies. :)
>
> Daire (Head of Systems DNEG)

2020-09-08 11:09:21

by Daire Byrne

[permalink] [raw]

Subject: Re: Adventures in NFS re-exporting

Tigran,

I guess I never really considered overlayfs because we still want to seamlessly write through to the original servers from time to time and post processing the copies from upper to lower seems like it might be hard to make reliable or do with low latency? I would also worry that our lower filesystem is being actively updated by processes outside of the overlay clients and how overlayfs would deal with that. And ultimately, the COW nature of overlayfs is a somewhat wasted feature for our workloads whereby it's the caching of file reads (and metadata) we care most about.

I must confess to not having looked at overlayfs in a few years so there may be lots of new tricks and options that would help our case. I'm aware that it gained the ability to NFS (re-)export a couple of years back.

But I'm certainly now interested to know if that NFS re-export implementation fares any better with the issues I experience with a direct knfsd re-export of an NFS client. So I will do some testing with overlayfs and see how it stacks up (see what I did there?).

Thanks for the suggestion!

Daire

----- On 8 Sep, 2020, at 10:40, Mkrtchyan, Tigran [email protected] wrote:

> Just out of curiosity:
>
> did you have tries instead of re-exporting nfs mount directly
> re-export an overlayfs mount on top of the original nfs mount?
> Such setup should cover most of your issues.
>
> Regards,
> Tigran.
>
> ----- Original Message -----
>> From: "Daire Byrne" <[email protected]>
>> To: "linux-nfs" <[email protected]>
>> Cc: [email protected]
>> Sent: Monday, September 7, 2020 7:31:00 PM
>> Subject: Adventures in NFS re-exporting
>
>> Hi,
>>
>> Apologies for this rather long email, but I thought there may be some interest
>> out there in the community in how and why we've been doing something
>> unsupported and barely documented - NFS re-exporting! And I'm not sure I can
>> tell our story well in just a few short sentences so please bear with me (or
>> stop now!).
>>
>> Full disclosure - I am also rather hoping that this story piques some interest
>> amongst developers to help make our rather niche setup even better and perhaps
>> a little better documented. I also totally understand if this is something
>> people wouldn't want to touch with a very long barge pole....
>>
>> First a quick bit of history (I hope I have this right). Late in 2015, Jeff
>> Layton proposed a patch series allowing knfsd to re-export a NFS client mount.
>> The rationale then was to provide a "proxy" server that could mount an NFSv4
>> only server and re-export it to older clients that only supported NFSv3. One of
>> the main sticking points then (as now), was around the 63 byte limit of
>> filehandles for NFSv3 and how it couldn't be guaranteed that all re-exported
>> filehandles would fit within that (in my experience it mostly works with
>> "no_subtree_check"). There are also the usual locking and coherence concerns
>> with NFSv3 too but I'll get to that in a bit.
>>
>> Then almost two years later, v4.13 was released including parts of the patch
>> series that actually allowed the re-export and since then other relevant bits
>> (such as the open file cache) have also been merged. I soon became interested
>> in using this new functionality to both accelerate our on-premises NFS storage
>> and use it as a "WAN cache" to provide cloud compute instances locally cached
>> proxy access to our on-premises storage.
>>
>> Cut to a brief introduction to us and what we do... DNEG is an award winning VFX
>> company which uses large compute farms to generate complex final frame renders
>> for movies and TV. This workload mostly consists of reads of common data shared
>> between many render clients (e.g textures, geometry) and a little unique data
>> per frame. All file writes are to unique files per process (frames) and there
>> is very little if any writing over existing files. Hence it's not very
>> demanding on locking and coherence guarantees.
>>
>> When our on-premises NFS storage is being overloaded or the server's network is
>> maxed out, we can place multiple re-export servers in between them and our farm
>> to improve performance. When our on-premises render farm is not quite big
>> enough to meet a deadline, we spin up compute instances with a (reasonably
>> local) cloud provider. Some of these cloud instances are Linux NFS servers
>> which mount our on-premises NFS storage servers (~10ms away) and re-export
>> these to the other cloud (render) instances. Since we know that the data we are
>> reading doesn't change often, we can increase the actimeo and even use nocto to
>> reduce the network chatter back to the on-prem servers. These re-export servers
>> also use fscache/cachefiles to cache data to disk so that we can retain TBs of
>> previously read data locally in the cloud over long periods of time. We also
>> use NFSv4 (less network chatter) all the way from our on-prem storage to the
>> re-export server and then on to the clients.
>>
>> The re-export server(s) quickly builds up both a memory cache and disk backed
>> fscache/cachefiles storage cache of our working data set so the data being
>> pulled from on-prem lessens over time. Data is only ever read once over the WAN
>> network from on-prem storage and then read multiple times by the many render
>> client instances in the cloud. Recent NFS features such as "nconnect" help to
>> speed up the initial reading of data from on-prem by using multiple connections
>> to offset TCP latency. At the end of the render, we write the files back
>> through the re-export server to our on-prem storage. Our average read bandwidth
>> is many times higher than our write bandwidth.
>>
>> Rather surprisingly, this mostly works for our particular workloads. We've
>> completed movies using this setup and saved money on commercial caching systems
>> (e.g Avere, GPFS, etc). But there are still some remaining issues with doing
>> something that is very much not widely supported (or recommended). In most
>> cases we have worked around them, but it would be great if we didn't have to so
>> others could also benefit. I will list the main problems quickly now and
>> provide more information and reproducers later if anyone is interested.
>>
>> 1) The kernel can drop entries out of the NFS client inode cache (under memory
>> cache churn) when those filehandles are still being used by the knfsd's remote
>> clients resulting in sporadic and random stale filehandles. This seems to be
>> mostly for directories from what I've seen. Does the NFS client not know that
>> knfsd is still using those files/dirs? The workaround is to never drop inode &
>> dentry caches on the re-export servers (vfs_cache_pressure=1). This also helps
>> to ensure that we actually make the most of our actimeo=3600,nocto mount
>> options for the full specified time.
>>
>> 2) If we cache metadata on the re-export server using actimeo=3600,nocto we can
>> cut the network packets back to the origin server to zero for repeated lookups.
>> However, if a client of the re-export server walks paths and memory maps those
>> files (i.e. loading an application), the re-export server starts issuing
>> unexpected calls back to the origin server again, ignoring/invalidating the
>> re-export server's NFS client cache. We worked around this this by patching an
>> inode/iversion validity check in inode.c so that the NFS client cache on the
>> re-export server is used. I'm not sure about the correctness of this patch but
>> it works for our corner case.
>>
>> 3) If we saturate an NFS client's network with reads from the server, all client
>> metadata lookups become unbearably slow even if it's all cached in the NFS
>> client's memory and no network RPCs should be required. This is the case for
>> any NFS client regardless of re-exporting but it affects this case more because
>> when we can't serve cached metadata we also can't serve the cached data. It
>> feels like some sort of bottleneck in the client's ability to parallelise
>> requests? We work around this by not maxing out our network.
>>
>> 4) With an NFSv4 re-export, lots of open/close requests (hundreds per second)
>> quickly eat up the CPU on the re-export server and perf top shows we are mostly
>> in native_queued_spin_lock_slowpath. Does NFSv4 also need an open file cache
>> like that added to NFSv3? Our workaround is to either fix the thing doing lots
>> of repeated open/closes or use NFSv3 instead.
>>
>> If you made it this far, I've probably taken up way too much of your valuable
>> time already. If nobody is interested in this rather niche application of the
>> Linux client & knfsd, then I totally understand and I will not mention it here
>> again. If your interest is piqued however, I'm happy to go into more detail
>> about any of this with the hope that this could become a better documented and
>> understood type of setup that others with similar workloads could reference.
>>
>> Also, many thanks to all the Linux NFS developers for the amazing work you do
>> which, in turn, helps us to make great movies. :)
>>
> > Daire (Head of Systems DNEG)

2020-09-15 17:24:33

Subject: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Attachments:

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: [Linux-cachefs] Adventures in NFS re-exporting

Subject: Re: [Linux-cachefs] Adventures in NFS re-exporting

Subject: Re: [Linux-cachefs] Adventures in NFS re-exporting

Subject: Re: [Linux-cachefs] Adventures in NFS re-exporting

Subject: Re: [Linux-cachefs] Adventures in NFS re-exporting

Subject: Re: [Linux-cachefs] Adventures in NFS re-exporting

Subject: Re: [Linux-cachefs] Adventures in NFS re-exporting

Subject: Re: [Linux-cachefs] Adventures in NFS re-exporting

Subject: Re: [Linux-cachefs] Adventures in NFS re-exporting

Subject: Re: [Linux-cachefs] Adventures in NFS re-exporting

Subject: Re: [Linux-cachefs] Adventures in NFS re-exporting

Subject: Re: [Linux-cachefs] Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: [PATCH 0/2] Add NFSv3 emulation of the lookupp operation

Subject: [PATCH 1/2] NFSv3: Refactor nfs3_proc_lookup() to split out the dentry

Subject: [PATCH 2/2] NFSv3: Add emulation of the lookupp() operation

Subject: [PATCH v2 0/2] Add NFSv3 emulation of the lookupp operation

Subject: [PATCH v3 0/3] Add NFSv3 emulation of the lookupp operation

Subject: [PATCH v3 1/3] NFSv3: Refactor nfs3_proc_lookup() to split out the dentry

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: [PATCH 1/4] nfsd: move fill_{pre,post}_wcc to nfsfh.c

Subject: [PATCH 3/4] nfs: don't mangle i_version on NFS

Subject: Re: [PATCH 3/4] nfs: don't mangle i_version on NFS

Subject: Re: [PATCH 3/4] nfs: don't mangle i_version on NFS

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: RE: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: RE: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: RE: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting

Subject: Re: Adventures in NFS re-exporting