Hi,
I've been experimenting a bit more with high latency NFSv4.2 (200ms).
I've noticed a difference between the file creation rates when you
have parallel processes running against a single client mount creating
files in multiple directories compared to in one shared directory.
If I start 100 processes on the same client creating unique files in a
single shared directory (with 200ms latency), the rate of new file
creates is limited to around 3 files per second. Something like this:
# add latency to the client
sudo tc qdisc replace dev eth0 root netem delay 200ms
sudo mount -o vers=4.2,nocto,actimeo=3600 server:/data /tmp/data
for x in {1..10000}; do
echo /tmp/data/dir1/touch.$x
done | xargs -n1 -P 100 -iX -t touch X 2>&1 | pv -l -a > /dev/null
It's a similar (slow) result for NFSv3. If we run it again just to
update the existing files, it's a lot faster because of the
nocto,actimeo and open file caching (32 files/s).
Then if I switch it so that each process on the client creates
hundreds of files in a unique directory per process, the aggregate
file create rate increases to 32 per second. For NFSv3 it's 162
aggregate new files per second. So much better parallelism is possible
when the creates are spread across multiple remote directories on the
same client.
If I then take the slow 3 creates per second example again and instead
use 10 client hosts (all with 200ms latency) and set them all creating
in the same remote server directory, then we get 3 x 10 = 30 creates
per second.
So we can achieve some parallel file create performance in the same
remote directory but just not from a single client running multiple
processes. Which makes me think it's more of a client limitation
rather than a server locking issue?
My interest in this (as always) is because while having hundreds of
processes creating files in the same directory might not be a common
workload, it is if you are re-exporting a filesystem and multiple
clients are creating new files for writing. For example a batch job
creating files in a common output directory.
Re-exporting is a useful way of caching mostly read heavy workloads
but then performance suffers for these metadata heavy or writing
workloads. The parallel performance (nfsd threads) with a single
client mountpoint just can't compete with directly connected clients
to the originating server.
Does anyone have any idea what the specific bottlenecks are here for
parallel file creates from a single client to a single directory?
Cheers,
Daire
Hi,
This seemed like a good test case for Neil Brown's "namespaces" patch:
https://lore.kernel.org/linux-nfs/[email protected]
The interesting thing about this is that we get independent slot
tables for the same remote server (and directory).
So we can test like this:
# mount server 10 times with a different namespace
for x in {0..9}; do
sudo mkdir -p /srv/data-$x
sudo mount -o vers=4.2,namespace=server${x},actimeo=3600,nocto,
server:/data /srv/data-${x}
done
# create files across the namespace mounts but in same remote directory
for x in {1..2000}; do
echo /srv/data-$((RANDOM %10))/dir1/touch.$x
done | xargs -n1 -P 100 -iX -t touch X 2>&1 | pv -l -a >|/dev/null
Doing this we get the same file create rate (32/s) as if we had used
10 individual clients.
I can only assume this is because of the independent slot table rpc queues?
But I have no idea why that also seems to effect the rate depending on
whether you use multiple remote directories or a single shared
directory.
So in summary:
* concurrent processes creating files in a single remote directory = slow
* concurrent processes creating files across many directories = fast
* concurrent clients creating files in a shared remote directory = fast
* concurrent namespaces creating files in a shared remote directory = fast
There is probably also some overlap with my previous queries around
parallel io/metadata performance:
https://marc.info/?t=160199739400001&r=2&w=4
Daire
On Sun, 23 Jan 2022 at 23:53, Daire Byrne <[email protected]> wrote:
>
> Hi,
>
> I've been experimenting a bit more with high latency NFSv4.2 (200ms).
> I've noticed a difference between the file creation rates when you
> have parallel processes running against a single client mount creating
> files in multiple directories compared to in one shared directory.
>
> If I start 100 processes on the same client creating unique files in a
> single shared directory (with 200ms latency), the rate of new file
> creates is limited to around 3 files per second. Something like this:
>
> # add latency to the client
> sudo tc qdisc replace dev eth0 root netem delay 200ms
>
> sudo mount -o vers=4.2,nocto,actimeo=3600 server:/data /tmp/data
> for x in {1..10000}; do
> echo /tmp/data/dir1/touch.$x
> done | xargs -n1 -P 100 -iX -t touch X 2>&1 | pv -l -a > /dev/null
>
> It's a similar (slow) result for NFSv3. If we run it again just to
> update the existing files, it's a lot faster because of the
> nocto,actimeo and open file caching (32 files/s).
>
> Then if I switch it so that each process on the client creates
> hundreds of files in a unique directory per process, the aggregate
> file create rate increases to 32 per second. For NFSv3 it's 162
> aggregate new files per second. So much better parallelism is possible
> when the creates are spread across multiple remote directories on the
> same client.
>
> If I then take the slow 3 creates per second example again and instead
> use 10 client hosts (all with 200ms latency) and set them all creating
> in the same remote server directory, then we get 3 x 10 = 30 creates
> per second.
>
> So we can achieve some parallel file create performance in the same
> remote directory but just not from a single client running multiple
> processes. Which makes me think it's more of a client limitation
> rather than a server locking issue?
>
> My interest in this (as always) is because while having hundreds of
> processes creating files in the same directory might not be a common
> workload, it is if you are re-exporting a filesystem and multiple
> clients are creating new files for writing. For example a batch job
> creating files in a common output directory.
>
> Re-exporting is a useful way of caching mostly read heavy workloads
> but then performance suffers for these metadata heavy or writing
> workloads. The parallel performance (nfsd threads) with a single
> client mountpoint just can't compete with directly connected clients
> to the originating server.
>
> Does anyone have any idea what the specific bottlenecks are here for
> parallel file creates from a single client to a single directory?
>
> Cheers,
>
> Daire
On Sun, Jan 23, 2022 at 11:53:08PM +0000, Daire Byrne wrote:
> I've been experimenting a bit more with high latency NFSv4.2 (200ms).
> I've noticed a difference between the file creation rates when you
> have parallel processes running against a single client mount creating
> files in multiple directories compared to in one shared directory.
The Linux VFS requires an exclusive lock on the directory while you're
creating a file.
So, if L is the time in seconds required to create a single file, you're
never going to be able to create more than 1/L files per second, because
there's no parallelism.
So, it's not surprising you'd get a higher rate when creating in
multiple directories.
Also, that lock's taken on both client and server. So it makes sense
that you might get a little more parallelism from multiple clients.
So the usual advice is just to try to get that latency number as low as
possible, by using a low-latency network and storage that can commit
very quickly. (An NFS server isn't permitted to reply to the RPC
creating the new file until the new file actually hits stable storage.)
Are you really seeing 200ms in production?
--b.
>
> If I start 100 processes on the same client creating unique files in a
> single shared directory (with 200ms latency), the rate of new file
> creates is limited to around 3 files per second. Something like this:
>
> # add latency to the client
> sudo tc qdisc replace dev eth0 root netem delay 200ms
>
> sudo mount -o vers=4.2,nocto,actimeo=3600 server:/data /tmp/data
> for x in {1..10000}; do
> echo /tmp/data/dir1/touch.$x
> done | xargs -n1 -P 100 -iX -t touch X 2>&1 | pv -l -a > /dev/null
>
> It's a similar (slow) result for NFSv3. If we run it again just to
> update the existing files, it's a lot faster because of the
> nocto,actimeo and open file caching (32 files/s).
>
> Then if I switch it so that each process on the client creates
> hundreds of files in a unique directory per process, the aggregate
> file create rate increases to 32 per second. For NFSv3 it's 162
> aggregate new files per second. So much better parallelism is possible
> when the creates are spread across multiple remote directories on the
> same client.
>
> If I then take the slow 3 creates per second example again and instead
> use 10 client hosts (all with 200ms latency) and set them all creating
> in the same remote server directory, then we get 3 x 10 = 30 creates
> per second.
>
> So we can achieve some parallel file create performance in the same
> remote directory but just not from a single client running multiple
> processes. Which makes me think it's more of a client limitation
> rather than a server locking issue?
>
> My interest in this (as always) is because while having hundreds of
> processes creating files in the same directory might not be a common
> workload, it is if you are re-exporting a filesystem and multiple
> clients are creating new files for writing. For example a batch job
> creating files in a common output directory.
>
> Re-exporting is a useful way of caching mostly read heavy workloads
> but then performance suffers for these metadata heavy or writing
> workloads. The parallel performance (nfsd threads) with a single
> client mountpoint just can't compete with directly connected clients
> to the originating server.
>
> Does anyone have any idea what the specific bottlenecks are here for
> parallel file creates from a single client to a single directory?
>
> Cheers,
>
> Daire
On Mon, 24 Jan 2022 at 19:38, J. Bruce Fields <[email protected]> wrote:
>
> On Sun, Jan 23, 2022 at 11:53:08PM +0000, Daire Byrne wrote:
> > I've been experimenting a bit more with high latency NFSv4.2 (200ms).
> > I've noticed a difference between the file creation rates when you
> > have parallel processes running against a single client mount creating
> > files in multiple directories compared to in one shared directory.
>
> The Linux VFS requires an exclusive lock on the directory while you're
> creating a file.
Right. So when I mounted the same server/dir multiple times using
namespaces, all I was really doing was making the VFS *think* I wanted
locks on different directories even though the remote server directory
was actually the same?
> So, if L is the time in seconds required to create a single file, you're
> never going to be able to create more than 1/L files per second, because
> there's no parallelism.
And things like directory delegations can't help with this kind of
workload? You can't batch directories locks or file creates I guess.
> So, it's not surprising you'd get a higher rate when creating in
> multiple directories.
>
> Also, that lock's taken on both client and server. So it makes sense
> that you might get a little more parallelism from multiple clients.
>
> So the usual advice is just to try to get that latency number as low as
> possible, by using a low-latency network and storage that can commit
> very quickly. (An NFS server isn't permitted to reply to the RPC
> creating the new file until the new file actually hits stable storage.)
>
> Are you really seeing 200ms in production?
Yea, it's just a (crazy) test for now. This is the latency between two
of our offices. Running batch jobs over this kind of latency with a
NFS re-export server doing all the caching works surprisingly well.
It's just these file creations that's the deal breaker. A batch job
might create 100,000+ files in a single directory across many clients.
Maybe many containerised re-export servers in round-robin with a
common cache is the only way to get more directory locks and file
creates in flight at the same time.
Cheers,
Daire
On Mon, Jan 24, 2022 at 08:10:07PM +0000, Daire Byrne wrote:
> On Mon, 24 Jan 2022 at 19:38, J. Bruce Fields <[email protected]> wrote:
> >
> > On Sun, Jan 23, 2022 at 11:53:08PM +0000, Daire Byrne wrote:
> > > I've been experimenting a bit more with high latency NFSv4.2 (200ms).
> > > I've noticed a difference between the file creation rates when you
> > > have parallel processes running against a single client mount creating
> > > files in multiple directories compared to in one shared directory.
> >
> > The Linux VFS requires an exclusive lock on the directory while you're
> > creating a file.
>
> Right. So when I mounted the same server/dir multiple times using
> namespaces, all I was really doing was making the VFS *think* I wanted
> locks on different directories even though the remote server directory
> was actually the same?
In that scenario the client-side locks are probably all different, but
they'd all have to wait for the same lock on the server side, yes.
> > So, if L is the time in seconds required to create a single file, you're
> > never going to be able to create more than 1/L files per second, because
> > there's no parallelism.
>
> And things like directory delegations can't help with this kind of
> workload? You can't batch directories locks or file creates I guess.
Alas, there are directory delegations specified in RFC 8881, but they
are read-only, and nobody's implemented them.
Directory write delegations could help a lot, if they existed.
> > So, it's not surprising you'd get a higher rate when creating in
> > multiple directories.
> >
> > Also, that lock's taken on both client and server. So it makes sense
> > that you might get a little more parallelism from multiple clients.
> >
> > So the usual advice is just to try to get that latency number as low as
> > possible, by using a low-latency network and storage that can commit
> > very quickly. (An NFS server isn't permitted to reply to the RPC
> > creating the new file until the new file actually hits stable storage.)
> >
> > Are you really seeing 200ms in production?
>
> Yea, it's just a (crazy) test for now. This is the latency between two
> of our offices. Running batch jobs over this kind of latency with a
> NFS re-export server doing all the caching works surprisingly well.
>
> It's just these file creations that's the deal breaker. A batch job
> might create 100,000+ files in a single directory across many clients.
>
> Maybe many containerised re-export servers in round-robin with a
> common cache is the only way to get more directory locks and file
> creates in flight at the same time.
ssh into the original server and crate the files there?
I've got no help, sorry.
The client-side locking does seem redundant to some degree, but I don't
know what to do about it.
--b.
On Mon, 24 Jan 2022 at 20:50, J. Bruce Fields <[email protected]> wrote:
>
> On Mon, Jan 24, 2022 at 08:10:07PM +0000, Daire Byrne wrote:
> > On Mon, 24 Jan 2022 at 19:38, J. Bruce Fields <[email protected]> wrote:
> > >
> > > On Sun, Jan 23, 2022 at 11:53:08PM +0000, Daire Byrne wrote:
> > > > I've been experimenting a bit more with high latency NFSv4.2 (200ms).
> > > > I've noticed a difference between the file creation rates when you
> > > > have parallel processes running against a single client mount creating
> > > > files in multiple directories compared to in one shared directory.
> > >
> > > The Linux VFS requires an exclusive lock on the directory while you're
> > > creating a file.
> >
> > Right. So when I mounted the same server/dir multiple times using
> > namespaces, all I was really doing was making the VFS *think* I wanted
> > locks on different directories even though the remote server directory
> > was actually the same?
>
> In that scenario the client-side locks are probably all different, but
> they'd all have to wait for the same lock on the server side, yes.
Yea, I was totally overthinking the problem. Thanks for setting me straight.
> > > So, if L is the time in seconds required to create a single file, you're
> > > never going to be able to create more than 1/L files per second, because
> > > there's no parallelism.
> >
> > And things like directory delegations can't help with this kind of
> > workload? You can't batch directories locks or file creates I guess.
>
> Alas, there are directory delegations specified in RFC 8881, but they
> are read-only, and nobody's implemented them.
>
> Directory write delegations could help a lot, if they existed.
Shame. And tackling that problem is way past my ability.
> > > So, it's not surprising you'd get a higher rate when creating in
> > > multiple directories.
> > >
> > > Also, that lock's taken on both client and server. So it makes sense
> > > that you might get a little more parallelism from multiple clients.
> > >
> > > So the usual advice is just to try to get that latency number as low as
> > > possible, by using a low-latency network and storage that can commit
> > > very quickly. (An NFS server isn't permitted to reply to the RPC
> > > creating the new file until the new file actually hits stable storage.)
> > >
> > > Are you really seeing 200ms in production?
> >
> > Yea, it's just a (crazy) test for now. This is the latency between two
> > of our offices. Running batch jobs over this kind of latency with a
> > NFS re-export server doing all the caching works surprisingly well.
> >
> > It's just these file creations that's the deal breaker. A batch job
> > might create 100,000+ files in a single directory across many clients.
> >
> > Maybe many containerised re-export servers in round-robin with a
> > common cache is the only way to get more directory locks and file
> > creates in flight at the same time.
>
> ssh into the original server and crate the files there?
That might work. Perhaps we can figure out the expected file outputs
and make that a local LAN task that runs first.
Actually, I can probably make it better by having each batch process
(client's of the re-export server) create their output files in a
unique directories rather than have all the files in one big shared
directory. There is still the slow creation of all those subdirs but
it's an order of magnitude less creates in one directory. Then the
files created across the subdirs can be paralellised on the wire.
> I've got no help, sorry.
>
> The client-side locking does seem redundant to some degree, but I don't
> know what to do about it.
Yea, it does seem like the server is the ultimate arbitrar and the
fact that multiple clients can achieve much higher rates of
parallelism does suggest that the VFS locking per client is somewhat
redundant and limiting (in this super niche case).
I can have multiple round-robin re-export servers but then they all
have to fetch the same data multiple times.
I'll read up on containerised NFS but it's not clear that you could
ever have a shared pagecache or fscache across multiple instances of
re-exports on the same server. Similar for Neil's "namespaces" patch I
think.
Thanks again for your help (and patience).
Daire
On Tue, Jan 25, 2022 at 12:52:46PM +0000, Daire Byrne wrote:
> Yea, it does seem like the server is the ultimate arbitrar and the
> fact that multiple clients can achieve much higher rates of
> parallelism does suggest that the VFS locking per client is somewhat
> redundant and limiting (in this super niche case).
It doesn't seem *so* weird to have a server with fast storage a long
round-trip time away, in which case the client-side operation could take
several orders of magnitude longer than the server.
Though even if the client locking wasn't a factor, you might still have
to do some work to take advantage of that. (E.g. if your workload is
just a single "untar"--it still waits for one create before doing the
next one).
--b.
On Tue, 25 Jan 2022 at 14:00, J. Bruce Fields <[email protected]> wrote:
>
> On Tue, Jan 25, 2022 at 12:52:46PM +0000, Daire Byrne wrote:
> > Yea, it does seem like the server is the ultimate arbitrar and the
> > fact that multiple clients can achieve much higher rates of
> > parallelism does suggest that the VFS locking per client is somewhat
> > redundant and limiting (in this super niche case).
>
> It doesn't seem *so* weird to have a server with fast storage a long
> round-trip time away, in which case the client-side operation could take
> several orders of magnitude longer than the server.
Yea, I'm fine with the speed of light constraints for a single process
far away. But the best way to achieve aggregate performance in such
environments is to have multiple parallel streams in flight at once
(preferably bulk transfers).
Because I am writing through a single re-export server, I just so
happen to be killing any parallelism for single directory files
creates even though it works reasonably well for opens, reads, writes
and stat (and anything cacheable) which all retain a certain amount of
useful parallelism over a high latency network (nconnect helps too).
Each of our batch jobs all read 95% of the same files each time and
they all tend to run within the same short (hour) time periods so
highly cacheable.
> Though even if the client locking wasn't a factor, you might still have
> to do some work to take advantage of that. (E.g. if your workload is
> just a single "untar"--it still waits for one create before doing the
> next one).
Yep. Again I'm okay with each client of a re-export server doing 3
creates per second, my problem is that all N instances together do 3
creates per second aggregate (in a single directory).
But I guess for this kind of workload, win some lose some. I just need
to figure out if I can engineer it to be less of a loser...
Of course, Hammerspace have a different approach to this kind of
problem with their global namespace and replicated MDT servers. And
that is probably a much more sensible way of going about this kind of
thing.
Daire
> On Jan 25, 2022, at 8:59 AM, J. Bruce Fields <[email protected]> wrote:
>
> On Tue, Jan 25, 2022 at 12:52:46PM +0000, Daire Byrne wrote:
>> Yea, it does seem like the server is the ultimate arbitrar and the
>> fact that multiple clients can achieve much higher rates of
>> parallelism does suggest that the VFS locking per client is somewhat
>> redundant and limiting (in this super niche case).
>
> It doesn't seem *so* weird to have a server with fast storage a long
> round-trip time away, in which case the client-side operation could take
> several orders of magnitude longer than the server.
>
> Though even if the client locking wasn't a factor, you might still have
> to do some work to take advantage of that. (E.g. if your workload is
> just a single "untar"--it still waits for one create before doing the
> next one).
Note that this is also an issue for data center area filesystems, where
back-end replication of metadata updates makes creates and deletes as
slow as if they were being done on storage hundreds of miles away.
The solution of choice appears to be to replace tar/rsync and such
tools with versions that are smarter about parallelizing file creation
and deletion.
--
Chuck Lever
On 1/24/22 13:37, J. Bruce Fields wrote:
> On Sun, Jan 23, 2022 at 11:53:08PM +0000, Daire Byrne wrote:
>> I've been experimenting a bit more with high latency NFSv4.2 (200ms).
>> I've noticed a difference between the file creation rates when you
>> have parallel processes running against a single client mount creating
>> files in multiple directories compared to in one shared directory.
>
> The Linux VFS requires an exclusive lock on the directory while you're
> creating a file.
>
> So, if L is the time in seconds required to create a single file, you're
> never going to be able to create more than 1/L files per second, because
> there's no parallelism.
So the directory is locked while the inode is created, or something like
this, which makes sense. File creation means the directory "file" is
being updated. Just to be clear, though, from your ssh suggestion below,
this limitation does not exist if an existing file is being updated?
>
> So, it's not surprising you'd get a higher rate when creating in
> multiple directories.
>
> Also, that lock's taken on both client and server. So it makes sense
> that you might get a little more parallelism from multiple clients.
>
> So the usual advice is just to try to get that latency number as low as
> possible, by using a low-latency network and storage that can commit
> very quickly. (An NFS server isn't permitted to reply to the RPC
> creating the new file until the new file actually hits stable storage.)
>
> Are you really seeing 200ms in production?
>
> --b.
>
>>
>> If I start 100 processes on the same client creating unique files in a
>> single shared directory (with 200ms latency), the rate of new file
>> creates is limited to around 3 files per second. Something like this:
>>
>> # add latency to the client
>> sudo tc qdisc replace dev eth0 root netem delay 200ms
>>
>> sudo mount -o vers=4.2,nocto,actimeo=3600 server:/data /tmp/data
>> for x in {1..10000}; do
>> echo /tmp/data/dir1/touch.$x
>> done | xargs -n1 -P 100 -iX -t touch X 2>&1 | pv -l -a > /dev/null
>>
>> It's a similar (slow) result for NFSv3. If we run it again just to
>> update the existing files, it's a lot faster because of the
>> nocto,actimeo and open file caching (32 files/s).
>>
>> Then if I switch it so that each process on the client creates
>> hundreds of files in a unique directory per process, the aggregate
>> file create rate increases to 32 per second. For NFSv3 it's 162
>> aggregate new files per second. So much better parallelism is possible
>> when the creates are spread across multiple remote directories on the
>> same client.
>>
>> If I then take the slow 3 creates per second example again and instead
>> use 10 client hosts (all with 200ms latency) and set them all creating
>> in the same remote server directory, then we get 3 x 10 = 30 creates
>> per second.
>>
>> So we can achieve some parallel file create performance in the same
>> remote directory but just not from a single client running multiple
>> processes. Which makes me think it's more of a client limitation
>> rather than a server locking issue?
>>
>> My interest in this (as always) is because while having hundreds of
>> processes creating files in the same directory might not be a common
>> workload, it is if you are re-exporting a filesystem and multiple
>> clients are creating new files for writing. For example a batch job
>> creating files in a common output directory.
>>
>> Re-exporting is a useful way of caching mostly read heavy workloads
>> but then performance suffers for these metadata heavy or writing
>> workloads. The parallel performance (nfsd threads) with a single
>> client mountpoint just can't compete with directly connected clients
>> to the originating server.
>>
>> Does anyone have any idea what the specific bottlenecks are here for
>> parallel file creates from a single client to a single directory?
>>
>> Cheers,
>>
>> Daire
On Tue, Jan 25, 2022 at 03:15:42PM -0600, Patrick Goetz wrote:
> So the directory is locked while the inode is created, or something
> like this, which makes sense.
It accomplishes a number of things, details in
https://www.kernel.org/doc/html/latest/filesystems/directory-locking.html
> File creation means the directory
> "file" is being updated. Just to be clear, though, from your ssh
> suggestion below, this limitation does not exist if an existing file
> is being updated?
You don't need to take the exclusive i_rwsem lock on the directory to
update an existing file, no.
(But I was only suggesting that creating a bunch of files by ssh'ing
into the server first and doing the create there would be faster,
because the latency of each file create is less when you're running it
directly on the server, as opposed to over a wide-area network
connection.)
--b.
>
>
> >
> >So, it's not surprising you'd get a higher rate when creating in
> >multiple directories.
> >
> >Also, that lock's taken on both client and server. So it makes sense
> >that you might get a little more parallelism from multiple clients.
> >
> >So the usual advice is just to try to get that latency number as low as
> >possible, by using a low-latency network and storage that can commit
> >very quickly. (An NFS server isn't permitted to reply to the RPC
> >creating the new file until the new file actually hits stable storage.)
> >
> >Are you really seeing 200ms in production?
> >
> >--b.
> >
> >>
> >>If I start 100 processes on the same client creating unique files in a
> >>single shared directory (with 200ms latency), the rate of new file
> >>creates is limited to around 3 files per second. Something like this:
> >>
> >># add latency to the client
> >>sudo tc qdisc replace dev eth0 root netem delay 200ms
> >>
> >>sudo mount -o vers=4.2,nocto,actimeo=3600 server:/data /tmp/data
> >>for x in {1..10000}; do
> >> echo /tmp/data/dir1/touch.$x
> >>done | xargs -n1 -P 100 -iX -t touch X 2>&1 | pv -l -a > /dev/null
> >>
> >>It's a similar (slow) result for NFSv3. If we run it again just to
> >>update the existing files, it's a lot faster because of the
> >>nocto,actimeo and open file caching (32 files/s).
> >>
> >>Then if I switch it so that each process on the client creates
> >>hundreds of files in a unique directory per process, the aggregate
> >>file create rate increases to 32 per second. For NFSv3 it's 162
> >>aggregate new files per second. So much better parallelism is possible
> >>when the creates are spread across multiple remote directories on the
> >>same client.
> >>
> >>If I then take the slow 3 creates per second example again and instead
> >>use 10 client hosts (all with 200ms latency) and set them all creating
> >>in the same remote server directory, then we get 3 x 10 = 30 creates
> >>per second.
> >>
> >>So we can achieve some parallel file create performance in the same
> >>remote directory but just not from a single client running multiple
> >>processes. Which makes me think it's more of a client limitation
> >>rather than a server locking issue?
> >>
> >>My interest in this (as always) is because while having hundreds of
> >>processes creating files in the same directory might not be a common
> >>workload, it is if you are re-exporting a filesystem and multiple
> >>clients are creating new files for writing. For example a batch job
> >>creating files in a common output directory.
> >>
> >>Re-exporting is a useful way of caching mostly read heavy workloads
> >>but then performance suffers for these metadata heavy or writing
> >>workloads. The parallel performance (nfsd threads) with a single
> >>client mountpoint just can't compete with directly connected clients
> >>to the originating server.
> >>
> >>Does anyone have any idea what the specific bottlenecks are here for
> >>parallel file creates from a single client to a single directory?
> >>
> >>Cheers,
> >>
> >>Daire
On 1/25/22 09:30, Chuck Lever III wrote:
>
>
>> On Jan 25, 2022, at 8:59 AM, J. Bruce Fields <[email protected]> wrote:
>>
>> On Tue, Jan 25, 2022 at 12:52:46PM +0000, Daire Byrne wrote:
>>> Yea, it does seem like the server is the ultimate arbitrar and the
>>> fact that multiple clients can achieve much higher rates of
>>> parallelism does suggest that the VFS locking per client is somewhat
>>> redundant and limiting (in this super niche case).
>>
>> It doesn't seem *so* weird to have a server with fast storage a long
>> round-trip time away, in which case the client-side operation could take
>> several orders of magnitude longer than the server.
>>
>> Though even if the client locking wasn't a factor, you might still have
>> to do some work to take advantage of that. (E.g. if your workload is
>> just a single "untar"--it still waits for one create before doing the
>> next one).
>
> Note that this is also an issue for data center area filesystems, where
> back-end replication of metadata updates makes creates and deletes as
> slow as if they were being done on storage hundreds of miles away.
>
> The solution of choice appears to be to replace tar/rsync and such
> tools with versions that are smarter about parallelizing file creation
> and deletion.
>
Are these tools available to mere mortals? If so, what are they called.
This is a problem I'm currently dealing with; trying to back up
hundreds of terabytes of image data.
>
> --
> Chuck Lever
>
>
>
> On Jan 25, 2022, at 4:50 PM, Patrick Goetz <[email protected]> wrote:
>
>
>
> On 1/25/22 09:30, Chuck Lever III wrote:
>>> On Jan 25, 2022, at 8:59 AM, J. Bruce Fields <[email protected]> wrote:
>>>
>>> On Tue, Jan 25, 2022 at 12:52:46PM +0000, Daire Byrne wrote:
>>>> Yea, it does seem like the server is the ultimate arbitrar and the
>>>> fact that multiple clients can achieve much higher rates of
>>>> parallelism does suggest that the VFS locking per client is somewhat
>>>> redundant and limiting (in this super niche case).
>>>
>>> It doesn't seem *so* weird to have a server with fast storage a long
>>> round-trip time away, in which case the client-side operation could take
>>> several orders of magnitude longer than the server.
>>>
>>> Though even if the client locking wasn't a factor, you might still have
>>> to do some work to take advantage of that. (E.g. if your workload is
>>> just a single "untar"--it still waits for one create before doing the
>>> next one).
>> Note that this is also an issue for data center area filesystems, where
>> back-end replication of metadata updates makes creates and deletes as
>> slow as if they were being done on storage hundreds of miles away.
>> The solution of choice appears to be to replace tar/rsync and such
>> tools with versions that are smarter about parallelizing file creation
>> and deletion.
>
> Are these tools available to mere mortals? If so, what are they called. This is a problem I'm currently dealing with; trying to back up hundreds of terabytes of image data.
They are available to cloud customers (like Oracle and Amazon) I believe,
and possibly for Azure folks too. Try Google, I'm sorry I don't have a
link handy.
parcp? Something like that.
--
Chuck Lever
On Tue, Jan 25, 2022 at 03:50:05PM -0600, Patrick Goetz wrote:
> On 1/25/22 09:30, Chuck Lever III wrote:
> >>On Jan 25, 2022, at 8:59 AM, J. Bruce Fields <[email protected]> wrote:
> >>On Tue, Jan 25, 2022 at 12:52:46PM +0000, Daire Byrne wrote:
> >>>Yea, it does seem like the server is the ultimate arbitrar and the
> >>>fact that multiple clients can achieve much higher rates of
> >>>parallelism does suggest that the VFS locking per client is somewhat
> >>>redundant and limiting (in this super niche case).
> >>
> >>It doesn't seem *so* weird to have a server with fast storage a long
> >>round-trip time away, in which case the client-side operation could take
> >>several orders of magnitude longer than the server.
> >>
> >>Though even if the client locking wasn't a factor, you might still have
> >>to do some work to take advantage of that. (E.g. if your workload is
> >>just a single "untar"--it still waits for one create before doing the
> >>next one).
> >
> >Note that this is also an issue for data center area filesystems, where
> >back-end replication of metadata updates makes creates and deletes as
> >slow as if they were being done on storage hundreds of miles away.
> >
> >The solution of choice appears to be to replace tar/rsync and such
> >tools with versions that are smarter about parallelizing file creation
> >and deletion.
>
> Are these tools available to mere mortals? If so, what are they
> called. This is a problem I'm currently dealing with; trying to
> back up hundreds of terabytes of image data.
How many files, though?
Writes of file data *should* be limited mainly just be your network and
disk bandwidth.
Creation of files is limited by network and disk latency, is more
complicated, and is where multiple processes are more likely to help.
--b.
On 1/25/22 15:59, Bruce Fields wrote:
> On Tue, Jan 25, 2022 at 03:50:05PM -0600, Patrick Goetz wrote:
>> On 1/25/22 09:30, Chuck Lever III wrote:
>>>> On Jan 25, 2022, at 8:59 AM, J. Bruce Fields <[email protected]> wrote:
>>>> On Tue, Jan 25, 2022 at 12:52:46PM +0000, Daire Byrne wrote:
>>>>> Yea, it does seem like the server is the ultimate arbitrar and the
>>>>> fact that multiple clients can achieve much higher rates of
>>>>> parallelism does suggest that the VFS locking per client is somewhat
>>>>> redundant and limiting (in this super niche case).
>>>>
>>>> It doesn't seem *so* weird to have a server with fast storage a long
>>>> round-trip time away, in which case the client-side operation could take
>>>> several orders of magnitude longer than the server.
>>>>
>>>> Though even if the client locking wasn't a factor, you might still have
>>>> to do some work to take advantage of that. (E.g. if your workload is
>>>> just a single "untar"--it still waits for one create before doing the
>>>> next one).
>>>
>>> Note that this is also an issue for data center area filesystems, where
>>> back-end replication of metadata updates makes creates and deletes as
>>> slow as if they were being done on storage hundreds of miles away.
>>>
>>> The solution of choice appears to be to replace tar/rsync and such
>>> tools with versions that are smarter about parallelizing file creation
>>> and deletion.
>>
>> Are these tools available to mere mortals? If so, what are they
>> called. This is a problem I'm currently dealing with; trying to
>> back up hundreds of terabytes of image data.
>
> How many files, though?
>
IDK, 4000 images per collection, with hundreds of collections on disk?
Say at least 500,000 files? Maybe a million? With most files about 1GB
in size. I was trying to just rsync it all from the data server to a
ZFS-based backup server in our data center, but the backup started
failing constantly because the filesystem would change after rsync had
already constructed an index. Even after an initial copy, a backup like
that runs for over a week. The strategy I'm about to try and implement
is to NFS mount the data server's data partition to the backup server
and then have a script walk through the directory hierarchy, rsyncing
collections one at a time. ZFS send/receive would probably be better,
but the data server isn't configured with ZFS.
> Writes of file data *should* be limited mainly just be your network and
> disk bandwidth.
>
> Creation of files is limited by network and disk latency, is more
> complicated, and is where multiple processes are more likely to help.
>
> --b.
On Tue, 25 Jan 2022 at 22:11, Patrick Goetz <[email protected]> wrote:
>
> IDK, 4000 images per collection, with hundreds of collections on disk?
> Say at least 500,000 files? Maybe a million? With most files about 1GB
> in size. I was trying to just rsync it all from the data server to a
> ZFS-based backup server in our data center, but the backup started
> failing constantly because the filesystem would change after rsync had
> already constructed an index. Even after an initial copy, a backup like
> that runs for over a week. The strategy I'm about to try and implement
> is to NFS mount the data server's data partition to the backup server
> and then have a script walk through the directory hierarchy, rsyncing
> collections one at a time. ZFS send/receive would probably be better,
> but the data server isn't configured with ZFS.
We've strayed slightly off topic (even if we are talking about file
creates over NFS) because you can get good parallel performance
(creates, read, writes etc) over NFS with simultaneous copies using
lots of processes if distributed across lots of directories.
Well "good" being subjective. I get 1,500 creates/s in a single
directory on a LAN NFS server from a single client and 160 creates/s
aggregate over my extreme 200ms using 10 clients & 10 different
directories. It seems fair all things considered.
But seeing as I do a lot of these kinds of big data moves (TBs) across
both the LAN and WAN, I can perhaps offer some advice from experience
that might be useful:
* walk the filesystem (locally) first to build a file list, split it
and then use rsync --files-from (e.g. https://github.com/jbd/msrsync)
to feed multiple simultaneous rsyncs.
* avoid NFS and use rsyncd directly between the servers (no ssh) so
filesystem walks are "local".
The advantage of rsync is that it will do the filesystem walks at both
ends locally and compare the directory trees as it goes along. The
other nice thing it does is open a connection between sender and
receiver and stream all the file data down it so it works really well
even for lists of small files. The TCP connection and window scaling
can sit at it's maximum without any slow remote file metadata latency
disrupting it. Avoid the encapsulation of sshand use rsyncd instead
as it just speeds everything up.
And as always with any WAN connection, large buffers, window scaling,
no firewall DPI and maybe some fancy congestion control like BBR/2
helps.
Daire
On 1/25/22 16:41, Daire Byrne wrote:
> On Tue, 25 Jan 2022 at 22:11, Patrick Goetz <[email protected]> wrote:
>>
>> IDK, 4000 images per collection, with hundreds of collections on disk?
>> Say at least 500,000 files? Maybe a million? With most files about 1GB
>> in size. I was trying to just rsync it all from the data server to a
>> ZFS-based backup server in our data center, but the backup started
>> failing constantly because the filesystem would change after rsync had
>> already constructed an index. Even after an initial copy, a backup like
>> that runs for over a week. The strategy I'm about to try and implement
>> is to NFS mount the data server's data partition to the backup server
>> and then have a script walk through the directory hierarchy, rsyncing
>> collections one at a time. ZFS send/receive would probably be better,
>> but the data server isn't configured with ZFS.
>
> We've strayed slightly off topic (even if we are talking about file
> creates over NFS) because you can get good parallel performance
> (creates, read, writes etc) over NFS with simultaneous copies using
> lots of processes if distributed across lots of directories.
>
> Well "good" being subjective. I get 1,500 creates/s in a single
> directory on a LAN NFS server from a single client and 160 creates/s
> aggregate over my extreme 200ms using 10 clients & 10 different
> directories. It seems fair all things considered.
>
> But seeing as I do a lot of these kinds of big data moves (TBs) across
> both the LAN and WAN, I can perhaps offer some advice from experience
> that might be useful:
>
> * walk the filesystem (locally) first to build a file list, split it
> and then use rsync --files-from (e.g. https://github.com/jbd/msrsync)
> to feed multiple simultaneous rsyncs.
> * avoid NFS and use rsyncd directly between the servers (no ssh) so
> filesystem walks are "local".
Thanks for this suggestion! This option didn't even occur to me. The
only downside is that this server gets really busy during image
processing, so I'm a bit worried about loading it down with dozens of
simultaneous rsync processes. Also, the biggest performance problem in
this system (which includes multiple GPU-laden workstations and 2 other
NFS servers) is always I/O bottlenecks. I suppose the solution is to
nice all the rsync processes to 19.
Question: given that I usually run backups from cron, and given that
they can take a long time, how does msrsync avoid stepping on itself?
>
> The advantage of rsync is that it will do the filesystem walks at both
> ends locally and compare the directory trees as it goes along. The
> other nice thing it does is open a connection between sender and
> receiver and stream all the file data down it so it works really well
> even for lists of small files. The TCP connection and window scaling
> can sit at it's maximum without any slow remote file metadata latency
> disrupting it. Avoid the encapsulation of sshand use rsyncd instead
> as it just speeds everything up.
>
> And as always with any WAN connection, large buffers, window scaling,
> no firewall DPI and maybe some fancy congestion control like BBR/2
> helps.
>
> Daire
On Tue, 25 Jan 2022 at 23:01, Patrick Goetz <[email protected]> wrote:
> On 1/25/22 16:41, Daire Byrne wrote:
> > On Tue, 25 Jan 2022 at 22:11, Patrick Goetz <[email protected]> wrote:
> Thanks for this suggestion! This option didn't even occur to me. The
> only downside is that this server gets really busy during image
> processing, so I'm a bit worried about loading it down with dozens of
> simultaneous rsync processes. Also, the biggest performance problem in
> this system (which includes multiple GPU-laden workstations and 2 other
> NFS servers) is always I/O bottlenecks. I suppose the solution is to
> nice all the rsync processes to 19.
Yea, that's always the problem with backups - when do you slow down
production because backups are more important? :)
You could also have another NFS client close (latency) to the server
which would free CPU but it's hard to limit IO. You can still use
rsync+rsyncd for the higher latency hop.
> Question: given that I usually run backups from cron, and given that
> they can take a long time, how does msrsync avoid stepping on itself?
I guess you would need a lockfile or pgrep for a running instance.
TBH, I wrote a simpler version of msrsync in bash that was more suited
to our environment but the principle is the same.
Daire
On Wed, 26 Jan 2022, J. Bruce Fields wrote:
> On Tue, Jan 25, 2022 at 03:15:42PM -0600, Patrick Goetz wrote:
> > So the directory is locked while the inode is created, or something
> > like this, which makes sense.
>
> It accomplishes a number of things, details in
> https://www.kernel.org/doc/html/latest/filesystems/directory-locking.html
Just in case anyone is interested, I wrote this a while back:
http://lists.lustre.org/pipermail/lustre-devel-lustre.org/2018-November/008177.html
it includes a patch to allow parallel creates/deletes over NFS (and any
other filesystem which adds support).
I doubt it still applies, but it wouldn't be hard to make it work if
anyone was willing to make a strong case that we would benefit from
this.
NeilBrown
>
> > File creation means the directory
> > "file" is being updated. Just to be clear, though, from your ssh
> > suggestion below, this limitation does not exist if an existing file
> > is being updated?
>
> You don't need to take the exclusive i_rwsem lock on the directory to
> update an existing file, no.
>
> (But I was only suggesting that creating a bunch of files by ssh'ing
> into the server first and doing the create there would be faster,
> because the latency of each file create is less when you're running it
> directly on the server, as opposed to over a wide-area network
> connection.)
>
> --b.
>
> >
> >
> > >
> > >So, it's not surprising you'd get a higher rate when creating in
> > >multiple directories.
> > >
> > >Also, that lock's taken on both client and server. So it makes sense
> > >that you might get a little more parallelism from multiple clients.
> > >
> > >So the usual advice is just to try to get that latency number as low as
> > >possible, by using a low-latency network and storage that can commit
> > >very quickly. (An NFS server isn't permitted to reply to the RPC
> > >creating the new file until the new file actually hits stable storage.)
> > >
> > >Are you really seeing 200ms in production?
> > >
> > >--b.
> > >
> > >>
> > >>If I start 100 processes on the same client creating unique files in a
> > >>single shared directory (with 200ms latency), the rate of new file
> > >>creates is limited to around 3 files per second. Something like this:
> > >>
> > >># add latency to the client
> > >>sudo tc qdisc replace dev eth0 root netem delay 200ms
> > >>
> > >>sudo mount -o vers=4.2,nocto,actimeo=3600 server:/data /tmp/data
> > >>for x in {1..10000}; do
> > >> echo /tmp/data/dir1/touch.$x
> > >>done | xargs -n1 -P 100 -iX -t touch X 2>&1 | pv -l -a > /dev/null
> > >>
> > >>It's a similar (slow) result for NFSv3. If we run it again just to
> > >>update the existing files, it's a lot faster because of the
> > >>nocto,actimeo and open file caching (32 files/s).
> > >>
> > >>Then if I switch it so that each process on the client creates
> > >>hundreds of files in a unique directory per process, the aggregate
> > >>file create rate increases to 32 per second. For NFSv3 it's 162
> > >>aggregate new files per second. So much better parallelism is possible
> > >>when the creates are spread across multiple remote directories on the
> > >>same client.
> > >>
> > >>If I then take the slow 3 creates per second example again and instead
> > >>use 10 client hosts (all with 200ms latency) and set them all creating
> > >>in the same remote server directory, then we get 3 x 10 = 30 creates
> > >>per second.
> > >>
> > >>So we can achieve some parallel file create performance in the same
> > >>remote directory but just not from a single client running multiple
> > >>processes. Which makes me think it's more of a client limitation
> > >>rather than a server locking issue?
> > >>
> > >>My interest in this (as always) is because while having hundreds of
> > >>processes creating files in the same directory might not be a common
> > >>workload, it is if you are re-exporting a filesystem and multiple
> > >>clients are creating new files for writing. For example a batch job
> > >>creating files in a common output directory.
> > >>
> > >>Re-exporting is a useful way of caching mostly read heavy workloads
> > >>but then performance suffers for these metadata heavy or writing
> > >>workloads. The parallel performance (nfsd threads) with a single
> > >>client mountpoint just can't compete with directly connected clients
> > >>to the originating server.
> > >>
> > >>Does anyone have any idea what the specific bottlenecks are here for
> > >>parallel file creates from a single client to a single directory?
> > >>
> > >>Cheers,
> > >>
> > >>Daire
>
>
On Wed, 26 Jan 2022 at 00:02, NeilBrown <[email protected]> wrote:
>
> On Wed, 26 Jan 2022, J. Bruce Fields wrote:
> > On Tue, Jan 25, 2022 at 03:15:42PM -0600, Patrick Goetz wrote:
> > > So the directory is locked while the inode is created, or something
> > > like this, which makes sense.
> >
> > It accomplishes a number of things, details in
> > https://www.kernel.org/doc/html/latest/filesystems/directory-locking.html
>
> Just in case anyone is interested, I wrote this a while back:
>
> http://lists.lustre.org/pipermail/lustre-devel-lustre.org/2018-November/008177.html
>
> it includes a patch to allow parallel creates/deletes over NFS (and any
> other filesystem which adds support).
> I doubt it still applies, but it wouldn't be hard to make it work if
> anyone was willing to make a strong case that we would benefit from
> this.
Oh wow! That would be really interesting to test for my (high latency)
use case and single directory parallel file creates.
However, what I'm doing is so niche that I doubt I could help make
much of a valid case for its inclusion.
Hopefully others might have better reasons than I...
Daire
> > > File creation means the directory
> > > "file" is being updated. Just to be clear, though, from your ssh
> > > suggestion below, this limitation does not exist if an existing file
> > > is being updated?
> >
> > You don't need to take the exclusive i_rwsem lock on the directory to
> > update an existing file, no.
> >
> > (But I was only suggesting that creating a bunch of files by ssh'ing
> > into the server first and doing the create there would be faster,
> > because the latency of each file create is less when you're running it
> > directly on the server, as opposed to over a wide-area network
> > connection.)
> >
> > --b.
> >
> > >
> > >
> > > >
> > > >So, it's not surprising you'd get a higher rate when creating in
> > > >multiple directories.
> > > >
> > > >Also, that lock's taken on both client and server. So it makes sense
> > > >that you might get a little more parallelism from multiple clients.
> > > >
> > > >So the usual advice is just to try to get that latency number as low as
> > > >possible, by using a low-latency network and storage that can commit
> > > >very quickly. (An NFS server isn't permitted to reply to the RPC
> > > >creating the new file until the new file actually hits stable storage.)
> > > >
> > > >Are you really seeing 200ms in production?
> > > >
> > > >--b.
> > > >
> > > >>
> > > >>If I start 100 processes on the same client creating unique files in a
> > > >>single shared directory (with 200ms latency), the rate of new file
> > > >>creates is limited to around 3 files per second. Something like this:
> > > >>
> > > >># add latency to the client
> > > >>sudo tc qdisc replace dev eth0 root netem delay 200ms
> > > >>
> > > >>sudo mount -o vers=4.2,nocto,actimeo=3600 server:/data /tmp/data
> > > >>for x in {1..10000}; do
> > > >> echo /tmp/data/dir1/touch.$x
> > > >>done | xargs -n1 -P 100 -iX -t touch X 2>&1 | pv -l -a > /dev/null
> > > >>
> > > >>It's a similar (slow) result for NFSv3. If we run it again just to
> > > >>update the existing files, it's a lot faster because of the
> > > >>nocto,actimeo and open file caching (32 files/s).
> > > >>
> > > >>Then if I switch it so that each process on the client creates
> > > >>hundreds of files in a unique directory per process, the aggregate
> > > >>file create rate increases to 32 per second. For NFSv3 it's 162
> > > >>aggregate new files per second. So much better parallelism is possible
> > > >>when the creates are spread across multiple remote directories on the
> > > >>same client.
> > > >>
> > > >>If I then take the slow 3 creates per second example again and instead
> > > >>use 10 client hosts (all with 200ms latency) and set them all creating
> > > >>in the same remote server directory, then we get 3 x 10 = 30 creates
> > > >>per second.
> > > >>
> > > >>So we can achieve some parallel file create performance in the same
> > > >>remote directory but just not from a single client running multiple
> > > >>processes. Which makes me think it's more of a client limitation
> > > >>rather than a server locking issue?
> > > >>
> > > >>My interest in this (as always) is because while having hundreds of
> > > >>processes creating files in the same directory might not be a common
> > > >>workload, it is if you are re-exporting a filesystem and multiple
> > > >>clients are creating new files for writing. For example a batch job
> > > >>creating files in a common output directory.
> > > >>
> > > >>Re-exporting is a useful way of caching mostly read heavy workloads
> > > >>but then performance suffers for these metadata heavy or writing
> > > >>workloads. The parallel performance (nfsd threads) with a single
> > > >>client mountpoint just can't compete with directly connected clients
> > > >>to the originating server.
> > > >>
> > > >>Does anyone have any idea what the specific bottlenecks are here for
> > > >>parallel file creates from a single client to a single directory?
> > > >>
> > > >>Cheers,
> > > >>
> > > >>Daire
> >
> >
On Wed, Jan 26, 2022 at 11:02:16AM +1100, NeilBrown wrote:
> On Wed, 26 Jan 2022, J. Bruce Fields wrote:
> > On Tue, Jan 25, 2022 at 03:15:42PM -0600, Patrick Goetz wrote:
> > > So the directory is locked while the inode is created, or something
> > > like this, which makes sense.
> >
> > It accomplishes a number of things, details in
> > https://www.kernel.org/doc/html/latest/filesystems/directory-locking.html
>
> Just in case anyone is interested, I wrote this a while back:
>
> http://lists.lustre.org/pipermail/lustre-devel-lustre.org/2018-November/008177.html
>
> it includes a patch to allow parallel creates/deletes over NFS (and any
> other filesystem which adds support).
> I doubt it still applies, but it wouldn't be hard to make it work if
> anyone was willing to make a strong case that we would benefit from
> this.
Neato.
Removing the need to hold an exclusive lock on the directory across
server round trips seems compelling to me....
I also wonder: why couldn't you fire off the RPC without any locks, then
wait till you get a reply to take locks and update your local cache?
OK, for one thing, calls and replies and server processing could all get
reordered. We'd need to know what order the server processed operations
in, so we could process replies in the same order.
--b.
On Wed, 26 Jan 2022 at 02:57, J. Bruce Fields <[email protected]> wrote:
>
> On Wed, Jan 26, 2022 at 11:02:16AM +1100, NeilBrown wrote:
> > On Wed, 26 Jan 2022, J. Bruce Fields wrote:
> > > On Tue, Jan 25, 2022 at 03:15:42PM -0600, Patrick Goetz wrote:
> > > > So the directory is locked while the inode is created, or something
> > > > like this, which makes sense.
> > >
> > > It accomplishes a number of things, details in
> > > https://www.kernel.org/doc/html/latest/filesystems/directory-locking.html
> >
> > Just in case anyone is interested, I wrote this a while back:
> >
> > http://lists.lustre.org/pipermail/lustre-devel-lustre.org/2018-November/008177.html
> >
> > it includes a patch to allow parallel creates/deletes over NFS (and any
> > other filesystem which adds support).
> > I doubt it still applies, but it wouldn't be hard to make it work if
> > anyone was willing to make a strong case that we would benefit from
> > this.
Well, I couldn't resist quickly testing Neil's patch. I found it
applied okay to v5.6.19 with minimal edits.
As before, without the patch, parallel file creates in a single
directory with 1000 threads topped out at an aggregate of 3 creates/s
over a 200ms link. With the patch it jumps up to 1,200 creates/s.
So a pretty dramatic difference. I can't say if there are any other
side effects or regressions as I only did this simple test.
It's great for our super niche workloads and use case anyway.
Daire
> Neato.
>
> Removing the need to hold an exclusive lock on the directory across
> server round trips seems compelling to me....
>
> I also wonder: why couldn't you fire off the RPC without any locks, then
> wait till you get a reply to take locks and update your local cache?
>
> OK, for one thing, calls and replies and server processing could all get
> reordered. We'd need to know what order the server processed operations
> in, so we could process replies in the same order.
>
> --b.
I had a quick attempt at updating Neil's patch for mainline but I
quickly got stuck and confused. It looks like fs/namei.c in particular
underwent major changes and refactoring from v5.7+.
If there is ever any interest in updating this and getting it into
mainline, I'm more than willing to test it with production loads. I
just lack the skills to update it myself.
It definitely solves a big problem for us, but I also suspect we may
be the only ones with this problem.
Cheers,
Daire
On Tue, 8 Feb 2022 at 18:48, Daire Byrne <[email protected]> wrote:
>
> On Wed, 26 Jan 2022 at 02:57, J. Bruce Fields <[email protected]> wrote:
> >
> > On Wed, Jan 26, 2022 at 11:02:16AM +1100, NeilBrown wrote:
> > > On Wed, 26 Jan 2022, J. Bruce Fields wrote:
> > > > On Tue, Jan 25, 2022 at 03:15:42PM -0600, Patrick Goetz wrote:
> > > > > So the directory is locked while the inode is created, or something
> > > > > like this, which makes sense.
> > > >
> > > > It accomplishes a number of things, details in
> > > > https://www.kernel.org/doc/html/latest/filesystems/directory-locking.html
> > >
> > > Just in case anyone is interested, I wrote this a while back:
> > >
> > > http://lists.lustre.org/pipermail/lustre-devel-lustre.org/2018-November/008177.html
> > >
> > > it includes a patch to allow parallel creates/deletes over NFS (and any
> > > other filesystem which adds support).
> > > I doubt it still applies, but it wouldn't be hard to make it work if
> > > anyone was willing to make a strong case that we would benefit from
> > > this.
>
> Well, I couldn't resist quickly testing Neil's patch. I found it
> applied okay to v5.6.19 with minimal edits.
>
> As before, without the patch, parallel file creates in a single
> directory with 1000 threads topped out at an aggregate of 3 creates/s
> over a 200ms link. With the patch it jumps up to 1,200 creates/s.
>
> So a pretty dramatic difference. I can't say if there are any other
> side effects or regressions as I only did this simple test.
>
> It's great for our super niche workloads and use case anyway.
>
> Daire
>
>
> > Neato.
> >
> > Removing the need to hold an exclusive lock on the directory across
> > server round trips seems compelling to me....
> >
> > I also wonder: why couldn't you fire off the RPC without any locks, then
> > wait till you get a reply to take locks and update your local cache?
> >
> > OK, for one thing, calls and replies and server processing could all get
> > reordered. We'd need to know what order the server processed operations
> > in, so we could process replies in the same order.
> >
> > --b.
On Thu, Feb 10, 2022 at 06:19:09PM +0000, Daire Byrne wrote:
> I had a quick attempt at updating Neil's patch for mainline but I
> quickly got stuck and confused. It looks like fs/namei.c in particular
> underwent major changes and refactoring from v5.7+.
>
> If there is ever any interest in updating this and getting it into
> mainline, I'm more than willing to test it with production loads. I
> just lack the skills to update it myself.
>
> It definitely solves a big problem for us, but I also suspect we may
> be the only ones with this problem.
It benefits anyone trying to do a lot of creates in a on an NFS
filesystem where the network round trip time is significant. That
doesn't seem so weird. And even if the case is a little weird, just
having a case and clear numbers to show the improvement is a big help.
I haven't had the chance to read Neil's patch or work out what the issue
with the namei changes.
Al Viro is the expert on VFS locking. I was sure I'd seen him speculate
about what would be needed to make parallel directory modifications
possible, but I spent some time mining old mail and didn't find that.
I think the path forward would be to update Neil's patch, add your
performance data, send it to Al and linux-fsdevel, and see if we can get
some idea what remains to be done to get this right.
--b.
>
> Cheers,
>
> Daire
>
>
> On Tue, 8 Feb 2022 at 18:48, Daire Byrne <[email protected]> wrote:
> >
> > On Wed, 26 Jan 2022 at 02:57, J. Bruce Fields <[email protected]> wrote:
> > >
> > > On Wed, Jan 26, 2022 at 11:02:16AM +1100, NeilBrown wrote:
> > > > On Wed, 26 Jan 2022, J. Bruce Fields wrote:
> > > > > On Tue, Jan 25, 2022 at 03:15:42PM -0600, Patrick Goetz wrote:
> > > > > > So the directory is locked while the inode is created, or something
> > > > > > like this, which makes sense.
> > > > >
> > > > > It accomplishes a number of things, details in
> > > > > https://www.kernel.org/doc/html/latest/filesystems/directory-locking.html
> > > >
> > > > Just in case anyone is interested, I wrote this a while back:
> > > >
> > > > http://lists.lustre.org/pipermail/lustre-devel-lustre.org/2018-November/008177.html
> > > >
> > > > it includes a patch to allow parallel creates/deletes over NFS (and any
> > > > other filesystem which adds support).
> > > > I doubt it still applies, but it wouldn't be hard to make it work if
> > > > anyone was willing to make a strong case that we would benefit from
> > > > this.
> >
> > Well, I couldn't resist quickly testing Neil's patch. I found it
> > applied okay to v5.6.19 with minimal edits.
> >
> > As before, without the patch, parallel file creates in a single
> > directory with 1000 threads topped out at an aggregate of 3 creates/s
> > over a 200ms link. With the patch it jumps up to 1,200 creates/s.
> >
> > So a pretty dramatic difference. I can't say if there are any other
> > side effects or regressions as I only did this simple test.
> >
> > It's great for our super niche workloads and use case anyway.
> >
> > Daire
> >
> >
> > > Neato.
> > >
> > > Removing the need to hold an exclusive lock on the directory across
> > > server round trips seems compelling to me....
> > >
> > > I also wonder: why couldn't you fire off the RPC without any locks, then
> > > wait till you get a reply to take locks and update your local cache?
> > >
> > > OK, for one thing, calls and replies and server processing could all get
> > > reordered. We'd need to know what order the server processed operations
> > > in, so we could process replies in the same order.
> > >
> > > --b.
On Fri, 11 Feb 2022 at 15:59, J. Bruce Fields <[email protected]> wrote:
>
> On Thu, Feb 10, 2022 at 06:19:09PM +0000, Daire Byrne wrote:
> > I had a quick attempt at updating Neil's patch for mainline but I
> > quickly got stuck and confused. It looks like fs/namei.c in particular
> > underwent major changes and refactoring from v5.7+.
> >
> > If there is ever any interest in updating this and getting it into
> > mainline, I'm more than willing to test it with production loads. I
> > just lack the skills to update it myself.
> >
> > It definitely solves a big problem for us, but I also suspect we may
> > be the only ones with this problem.
>
> It benefits anyone trying to do a lot of creates in a on an NFS
> filesystem where the network round trip time is significant. That
> doesn't seem so weird. And even if the case is a little weird, just
> having a case and clear numbers to show the improvement is a big help.
>
> I haven't had the chance to read Neil's patch or work out what the issue
> with the namei changes.
As far as I can see, there was a flurry of changes in v5.7 from Al
Viro that changed lots of stuff in namei.c
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/fs/namei.c
https://lkml.org/lkml/2020/3/13/1057
> Al Viro is the expert on VFS locking. I was sure I'd seen him speculate
> about what would be needed to make parallel directory modifications
> possible, but I spent some time mining old mail and didn't find that.
I do recall reading about the work he did for parallel lookups some years back:
https://lwn.net/Articles/685108
> I think the path forward would be to update Neil's patch, add your
> performance data, send it to Al and linux-fsdevel, and see if we can get
> some idea what remains to be done to get this right.
If Neil or anyone else is able to do that work, I'm happy to test and
provide the numbers.
If I could update the patch myself, I would have happily contributed
but I lack the experience or knowledge. I'm great at identifying
problems, but not so hot at solving them :)
Daire
> --b.
>
> >
> > Cheers,
> >
> > Daire
> >
> >
> > On Tue, 8 Feb 2022 at 18:48, Daire Byrne <[email protected]> wrote:
> > >
> > > On Wed, 26 Jan 2022 at 02:57, J. Bruce Fields <[email protected]> wrote:
> > > >
> > > > On Wed, Jan 26, 2022 at 11:02:16AM +1100, NeilBrown wrote:
> > > > > On Wed, 26 Jan 2022, J. Bruce Fields wrote:
> > > > > > On Tue, Jan 25, 2022 at 03:15:42PM -0600, Patrick Goetz wrote:
> > > > > > > So the directory is locked while the inode is created, or something
> > > > > > > like this, which makes sense.
> > > > > >
> > > > > > It accomplishes a number of things, details in
> > > > > > https://www.kernel.org/doc/html/latest/filesystems/directory-locking.html
> > > > >
> > > > > Just in case anyone is interested, I wrote this a while back:
> > > > >
> > > > > http://lists.lustre.org/pipermail/lustre-devel-lustre.org/2018-November/008177.html
> > > > >
> > > > > it includes a patch to allow parallel creates/deletes over NFS (and any
> > > > > other filesystem which adds support).
> > > > > I doubt it still applies, but it wouldn't be hard to make it work if
> > > > > anyone was willing to make a strong case that we would benefit from
> > > > > this.
> > >
> > > Well, I couldn't resist quickly testing Neil's patch. I found it
> > > applied okay to v5.6.19 with minimal edits.
> > >
> > > As before, without the patch, parallel file creates in a single
> > > directory with 1000 threads topped out at an aggregate of 3 creates/s
> > > over a 200ms link. With the patch it jumps up to 1,200 creates/s.
> > >
> > > So a pretty dramatic difference. I can't say if there are any other
> > > side effects or regressions as I only did this simple test.
> > >
> > > It's great for our super niche workloads and use case anyway.
> > >
> > > Daire
> > >
> > >
> > > > Neato.
> > > >
> > > > Removing the need to hold an exclusive lock on the directory across
> > > > server round trips seems compelling to me....
> > > >
> > > > I also wonder: why couldn't you fire off the RPC without any locks, then
> > > > wait till you get a reply to take locks and update your local cache?
> > > >
> > > > OK, for one thing, calls and replies and server processing could all get
> > > > reordered. We'd need to know what order the server processed operations
> > > > in, so we could process replies in the same order.
> > > >
> > > > --b.
On Fri, 18 Feb 2022, Daire Byrne wrote:
> On Fri, 11 Feb 2022 at 15:59, J. Bruce Fields <[email protected]> wrote:
>
> > I think the path forward would be to update Neil's patch, add your
> > performance data, send it to Al and linux-fsdevel, and see if we can get
> > some idea what remains to be done to get this right.
>
> If Neil or anyone else is able to do that work, I'm happy to test and
> provide the numbers.
>
> If I could update the patch myself, I would have happily contributed
> but I lack the experience or knowledge. I'm great at identifying
> problems, but not so hot at solving them :)
>
I've ported it to mainline without much trouble. I started some simple
testing (parallel create/delete of the same file) and hit a bug quite
easily. I fixed that (eventually) and then tried with more than 1 CPU,
and hit another bug. But then it was quitting time. If I can get rid
of all the easy to find bugs, I'll post it with a CC to you, and you can
find some more for me!
Thanks,
NeilBrown
On Fri, 18 Feb 2022 at 07:46, NeilBrown <[email protected]> wrote:
>
> On Fri, 18 Feb 2022, Daire Byrne wrote:
> > On Fri, 11 Feb 2022 at 15:59, J. Bruce Fields <[email protected]> wrote:
> >
> > > I think the path forward would be to update Neil's patch, add your
> > > performance data, send it to Al and linux-fsdevel, and see if we can get
> > > some idea what remains to be done to get this right.
> >
> > If Neil or anyone else is able to do that work, I'm happy to test and
> > provide the numbers.
> >
> > If I could update the patch myself, I would have happily contributed
> > but I lack the experience or knowledge. I'm great at identifying
> > problems, but not so hot at solving them :)
> >
>
> I've ported it to mainline without much trouble. I started some simple
> testing (parallel create/delete of the same file) and hit a bug quite
> easily. I fixed that (eventually) and then tried with more than 1 CPU,
> and hit another bug. But then it was quitting time. If I can get rid
> of all the easy to find bugs, I'll post it with a CC to you, and you can
> find some more for me!
That would be awesome! I have a real world production case for this
and it's a pretty heavy workload. If that doesn't shake out any bugs,
nothing will.
The only caveat being that it will likely be restricted to NFSv3
testing due to the concurrency limitations with NFSv4.1+ (from the
other thread).
Daire
On Mon, 21 Feb 2022 at 13:59, Daire Byrne <[email protected]> wrote:
>
> On Fri, 18 Feb 2022 at 07:46, NeilBrown <[email protected]> wrote:
> > I've ported it to mainline without much trouble. I started some simple
> > testing (parallel create/delete of the same file) and hit a bug quite
> > easily. I fixed that (eventually) and then tried with more than 1 CPU,
> > and hit another bug. But then it was quitting time. If I can get rid
> > of all the easy to find bugs, I'll post it with a CC to you, and you can
> > find some more for me!
>
> That would be awesome! I have a real world production case for this
> and it's a pretty heavy workload. If that doesn't shake out any bugs,
> nothing will.
>
> The only caveat being that it will likely be restricted to NFSv3
> testing due to the concurrency limitations with NFSv4.1+ (from the
> other thread).
>
> Daire
Just to follow up on this again - I have been using Neil's patch for
parallel file creates (thanks!) but I'm a bit confused as to why it
doesn't seem to help in my NFS re-export case.
With the patch, I can achieve much higher parallel (multi process)
creates directly on my re-export server to a high latency remote
server mount, but when I re-export that to multiple clients, the
aggregate create rate again degrades to that which we might expect
either without the patch or if there was only one process creating the
files in sequence.
My assumption was that the nfsd threads of the re-export server would
act as multiple independent processes and it's clients would be spread
across them such that they would also benefit from the parallel
creates patch on the re-export server. So I expected many clients
creating files in the same directory would achieve much higher
aggregate performance.
Am I missing some other interaction here that limits parallel
performance in my unusual re-export case?
Cheers,
Daire
On Mon, Apr 25, 2022 at 02:00:32PM +0100, Daire Byrne wrote:
> On Mon, 21 Feb 2022 at 13:59, Daire Byrne <[email protected]> wrote:
> >
> > On Fri, 18 Feb 2022 at 07:46, NeilBrown <[email protected]> wrote:
> > > I've ported it to mainline without much trouble. I started some simple
> > > testing (parallel create/delete of the same file) and hit a bug quite
> > > easily. I fixed that (eventually) and then tried with more than 1 CPU,
> > > and hit another bug. But then it was quitting time. If I can get rid
> > > of all the easy to find bugs, I'll post it with a CC to you, and you can
> > > find some more for me!
> >
> > That would be awesome! I have a real world production case for this
> > and it's a pretty heavy workload. If that doesn't shake out any bugs,
> > nothing will.
> >
> > The only caveat being that it will likely be restricted to NFSv3
> > testing due to the concurrency limitations with NFSv4.1+ (from the
> > other thread).
> >
> > Daire
>
> Just to follow up on this again - I have been using Neil's patch for
> parallel file creates (thanks!) but I'm a bit confused as to why it
> doesn't seem to help in my NFS re-export case.
>
> With the patch, I can achieve much higher parallel (multi process)
> creates directly on my re-export server to a high latency remote
> server mount, but when I re-export that to multiple clients, the
> aggregate create rate again degrades to that which we might expect
> either without the patch or if there was only one process creating the
> files in sequence.
>
> My assumption was that the nfsd threads of the re-export server would
> act as multiple independent processes and it's clients would be spread
> across them such that they would also benefit from the parallel
> creates patch on the re-export server. So I expected many clients
> creating files in the same directory would achieve much higher
> aggregate performance.
That's the idea.
I've lost track, where's the latest version of Neil's patch?
--b.
>
> Am I missing some other interaction here that limits parallel
> performance in my unusual re-export case?
>
> Cheers,
>
> Daire
On Mon, 25 Apr 2022 at 14:22, J. Bruce Fields <[email protected]> wrote:
>
> On Mon, Apr 25, 2022 at 02:00:32PM +0100, Daire Byrne wrote:
> > On Mon, 21 Feb 2022 at 13:59, Daire Byrne <[email protected]> wrote:
> > >
> > > On Fri, 18 Feb 2022 at 07:46, NeilBrown <[email protected]> wrote:
> > > > I've ported it to mainline without much trouble. I started some simple
> > > > testing (parallel create/delete of the same file) and hit a bug quite
> > > > easily. I fixed that (eventually) and then tried with more than 1 CPU,
> > > > and hit another bug. But then it was quitting time. If I can get rid
> > > > of all the easy to find bugs, I'll post it with a CC to you, and you can
> > > > find some more for me!
> > >
> > > That would be awesome! I have a real world production case for this
> > > and it's a pretty heavy workload. If that doesn't shake out any bugs,
> > > nothing will.
> > >
> > > The only caveat being that it will likely be restricted to NFSv3
> > > testing due to the concurrency limitations with NFSv4.1+ (from the
> > > other thread).
> > >
> > > Daire
> >
> > Just to follow up on this again - I have been using Neil's patch for
> > parallel file creates (thanks!) but I'm a bit confused as to why it
> > doesn't seem to help in my NFS re-export case.
> >
> > With the patch, I can achieve much higher parallel (multi process)
> > creates directly on my re-export server to a high latency remote
> > server mount, but when I re-export that to multiple clients, the
> > aggregate create rate again degrades to that which we might expect
> > either without the patch or if there was only one process creating the
> > files in sequence.
> >
> > My assumption was that the nfsd threads of the re-export server would
> > act as multiple independent processes and it's clients would be spread
> > across them such that they would also benefit from the parallel
> > creates patch on the re-export server. So I expected many clients
> > creating files in the same directory would achieve much higher
> > aggregate performance.
>
> That's the idea.
>
> I've lost track, where's the latest version of Neil's patch?
>
> --b.
The latest is still the one from this thread (with a minor update to
apply it to v5.18-rc):
https://lore.kernel.org/lkml/[email protected]/T/#m922999bf830cacb745f32cc464caf72d5ffa7c2c
My test is something like this:
reexport1 # for x in {1..5000}; do
echo /srv/server1/touch.$HOSTNAME.$x
done | xargs -n1 -P 200 -iX -t touch X 2>&1 | pv -l -a >|/dev/null
Without the patch this results in 3 creates/s and with the patch it's
~250 creates/s with 200 threads/processes (200ms latency) when run
directly against a remote RHEL8 server (server1).
Then I run something similar to this but simultaneously across 200
clients of the "reexport1" server's re-export of the originating
"server1". I get an aggregate of around 3 creates/s even with the
patch applied to reexport1 (v5.18-rc2) which is suspiciously similar
to the performance without the parallel vfs create patch.
The clients don't run any special kernels or configurations. I have
only tested NFSv3 so far.
Daire
On Mon, Apr 25, 2022 at 04:24:50PM +0100, Daire Byrne wrote:
> On Mon, 25 Apr 2022 at 14:22, J. Bruce Fields <[email protected]> wrote:
> >
> > On Mon, Apr 25, 2022 at 02:00:32PM +0100, Daire Byrne wrote:
> > > On Mon, 21 Feb 2022 at 13:59, Daire Byrne <[email protected]> wrote:
> > > >
> > > > On Fri, 18 Feb 2022 at 07:46, NeilBrown <[email protected]> wrote:
> > > > > I've ported it to mainline without much trouble. I started some simple
> > > > > testing (parallel create/delete of the same file) and hit a bug quite
> > > > > easily. I fixed that (eventually) and then tried with more than 1 CPU,
> > > > > and hit another bug. But then it was quitting time. If I can get rid
> > > > > of all the easy to find bugs, I'll post it with a CC to you, and you can
> > > > > find some more for me!
> > > >
> > > > That would be awesome! I have a real world production case for this
> > > > and it's a pretty heavy workload. If that doesn't shake out any bugs,
> > > > nothing will.
> > > >
> > > > The only caveat being that it will likely be restricted to NFSv3
> > > > testing due to the concurrency limitations with NFSv4.1+ (from the
> > > > other thread).
> > > >
> > > > Daire
> > >
> > > Just to follow up on this again - I have been using Neil's patch for
> > > parallel file creates (thanks!) but I'm a bit confused as to why it
> > > doesn't seem to help in my NFS re-export case.
> > >
> > > With the patch, I can achieve much higher parallel (multi process)
> > > creates directly on my re-export server to a high latency remote
> > > server mount, but when I re-export that to multiple clients, the
> > > aggregate create rate again degrades to that which we might expect
> > > either without the patch or if there was only one process creating the
> > > files in sequence.
> > >
> > > My assumption was that the nfsd threads of the re-export server would
> > > act as multiple independent processes and it's clients would be spread
> > > across them such that they would also benefit from the parallel
> > > creates patch on the re-export server. So I expected many clients
> > > creating files in the same directory would achieve much higher
> > > aggregate performance.
> >
> > That's the idea.
> >
> > I've lost track, where's the latest version of Neil's patch?
> >
> > --b.
>
> The latest is still the one from this thread (with a minor update to
> apply it to v5.18-rc):
>
> https://lore.kernel.org/lkml/[email protected]/T/#m922999bf830cacb745f32cc464caf72d5ffa7c2c
Thanks!
I haven't really tried to understand that patch--but just looking at the
diffstat, it doesn't touch fs/nfsd/. And nfsd calls into the vfs only
after it locks the parent. So nfsd is probably still using
the old behavior, where local callers are using the new (parallel)
behavior.
So I bet what you're seeing is expected, and all that's needed is some
updates to fs/nfsd/vfs.c to reflect whatever Neil did in fs/namei.c.
--b.
> My test is something like this:
>
> reexport1 # for x in {1..5000}; do
> echo /srv/server1/touch.$HOSTNAME.$x
> done | xargs -n1 -P 200 -iX -t touch X 2>&1 | pv -l -a >|/dev/null
>
> Without the patch this results in 3 creates/s and with the patch it's
> ~250 creates/s with 200 threads/processes (200ms latency) when run
> directly against a remote RHEL8 server (server1).
>
> Then I run something similar to this but simultaneously across 200
> clients of the "reexport1" server's re-export of the originating
> "server1". I get an aggregate of around 3 creates/s even with the
> patch applied to reexport1 (v5.18-rc2) which is suspiciously similar
> to the performance without the parallel vfs create patch.
>
> The clients don't run any special kernels or configurations. I have
> only tested NFSv3 so far.
>
> Daire
On Mon, 25 Apr 2022 at 17:02, J. Bruce Fields <[email protected]> wrote:
>
> On Mon, Apr 25, 2022 at 04:24:50PM +0100, Daire Byrne wrote:
> > On Mon, 25 Apr 2022 at 14:22, J. Bruce Fields <[email protected]> wrote:
> > >
> > > On Mon, Apr 25, 2022 at 02:00:32PM +0100, Daire Byrne wrote:
> > > > On Mon, 21 Feb 2022 at 13:59, Daire Byrne <[email protected]> wrote:
> > > > >
> > > > > On Fri, 18 Feb 2022 at 07:46, NeilBrown <[email protected]> wrote:
> > > > > > I've ported it to mainline without much trouble. I started some simple
> > > > > > testing (parallel create/delete of the same file) and hit a bug quite
> > > > > > easily. I fixed that (eventually) and then tried with more than 1 CPU,
> > > > > > and hit another bug. But then it was quitting time. If I can get rid
> > > > > > of all the easy to find bugs, I'll post it with a CC to you, and you can
> > > > > > find some more for me!
> > > > >
> > > > > That would be awesome! I have a real world production case for this
> > > > > and it's a pretty heavy workload. If that doesn't shake out any bugs,
> > > > > nothing will.
> > > > >
> > > > > The only caveat being that it will likely be restricted to NFSv3
> > > > > testing due to the concurrency limitations with NFSv4.1+ (from the
> > > > > other thread).
> > > > >
> > > > > Daire
> > > >
> > > > Just to follow up on this again - I have been using Neil's patch for
> > > > parallel file creates (thanks!) but I'm a bit confused as to why it
> > > > doesn't seem to help in my NFS re-export case.
> > > >
> > > > With the patch, I can achieve much higher parallel (multi process)
> > > > creates directly on my re-export server to a high latency remote
> > > > server mount, but when I re-export that to multiple clients, the
> > > > aggregate create rate again degrades to that which we might expect
> > > > either without the patch or if there was only one process creating the
> > > > files in sequence.
> > > >
> > > > My assumption was that the nfsd threads of the re-export server would
> > > > act as multiple independent processes and it's clients would be spread
> > > > across them such that they would also benefit from the parallel
> > > > creates patch on the re-export server. So I expected many clients
> > > > creating files in the same directory would achieve much higher
> > > > aggregate performance.
> > >
> > > That's the idea.
> > >
> > > I've lost track, where's the latest version of Neil's patch?
> > >
> > > --b.
> >
> > The latest is still the one from this thread (with a minor update to
> > apply it to v5.18-rc):
> >
> > https://lore.kernel.org/lkml/[email protected]/T/#m922999bf830cacb745f32cc464caf72d5ffa7c2c
>
> Thanks!
>
> I haven't really tried to understand that patch--but just looking at the
> diffstat, it doesn't touch fs/nfsd/. And nfsd calls into the vfs only
> after it locks the parent. So nfsd is probably still using
> the old behavior, where local callers are using the new (parallel)
> behavior.
>
> So I bet what you're seeing is expected, and all that's needed is some
> updates to fs/nfsd/vfs.c to reflect whatever Neil did in fs/namei.c.
>
> --b.
Ah right, that would explain it then - thanks. I just naively assumed
that nfsd would pass straight into the VFS and rely on those locks.
I'll stare at fs/nfsd/vfs.c for a bit but I probably lack the
expertise to make it work.
It's also not entirely clear that this parallel creates RFC patch will
ever make it into mainline?
Daire
> > My test is something like this:
> >
> > reexport1 # for x in {1..5000}; do
> > echo /srv/server1/touch.$HOSTNAME.$x
> > done | xargs -n1 -P 200 -iX -t touch X 2>&1 | pv -l -a >|/dev/null
> >
> > Without the patch this results in 3 creates/s and with the patch it's
> > ~250 creates/s with 200 threads/processes (200ms latency) when run
> > directly against a remote RHEL8 server (server1).
> >
> > Then I run something similar to this but simultaneously across 200
> > clients of the "reexport1" server's re-export of the originating
> > "server1". I get an aggregate of around 3 creates/s even with the
> > patch applied to reexport1 (v5.18-rc2) which is suspiciously similar
> > to the performance without the parallel vfs create patch.
> >
> > The clients don't run any special kernels or configurations. I have
> > only tested NFSv3 so far.
On Tue, 26 Apr 2022, Daire Byrne wrote:
>
> I'll stare at fs/nfsd/vfs.c for a bit but I probably lack the
> expertise to make it work.
Staring at code is good for the soul .... but I'll try to schedule time
to work on this patch again - make it work from nfsd and also make it
handle rename.
>
> It's also not entirely clear that this parallel creates RFC patch will
> ever make it into mainline?
I hope it will eventually, but I have no idea when that might be.
Thanks for your continued interest,
NeilBrown
On Tue, 26 Apr 2022 at 02:36, NeilBrown <[email protected]> wrote:
>
> On Tue, 26 Apr 2022, Daire Byrne wrote:
> >
> > I'll stare at fs/nfsd/vfs.c for a bit but I probably lack the
> > expertise to make it work.
>
> Staring at code is good for the soul .... but I'll try to schedule time
> to work on this patch again - make it work from nfsd and also make it
> handle rename.
Yea, I stared at it for quite a while this morning and no amount of
coffee was going to help me figure out how best to proceed.
If you are able to update this for nfsd then I'll be eternally
grateful and do my best to test it under load in an effort to get it
all merged.
The community here has been so good to us over the last couple of
years and it is very much appreciated. It has helped us deliver (oscar
winning) movies!
To give some brief context as to why this is useful to us (for anyone
interested), we utilise NFS re-export extensively to run our batch
jobs (movie frame renders) in various remote DCs. In combination with
fscache and long term attribute caching, this works great for exposing
our (read often) onprem storage to the remote DCs (with varying
latencies).
But batch jobs have a tendency to start related tasks on many clients
at the same time with their results or logs being written to big
common directories. And by writing through the re-export server, we
often hit this limitation with parallel file creates which slows
everything down. We have tried to avoid large directories where
possible, but it's hard to catch and fix all the cases.
Using an NFS re-export server works 95% of the time for our workloads
(after much help from this community), so we are just trying to pick
away at the last 5% of edge cases. One of the disadvantages of the
re-export server in the middle, is that we lose some of the natural
parallelism that directly connected clients would otherwise have. And
this becomes very noticeable once the latency goes above 20ms.
Cheers,
Daire
> > It's also not entirely clear that this parallel creates RFC patch will
> > ever make it into mainline?
>
> I hope it will eventually, but I have no idea when that might be.
>
> Thanks for your continued interest,
> NeilBrown
On Tue, 26 Apr 2022, Daire Byrne wrote:
> On Tue, 26 Apr 2022 at 02:36, NeilBrown <[email protected]> wrote:
> >
> > On Tue, 26 Apr 2022, Daire Byrne wrote:
> > >
> > > I'll stare at fs/nfsd/vfs.c for a bit but I probably lack the
> > > expertise to make it work.
> >
> > Staring at code is good for the soul .... but I'll try to schedule time
> > to work on this patch again - make it work from nfsd and also make it
> > handle rename.
>
> Yea, I stared at it for quite a while this morning and no amount of
> coffee was going to help me figure out how best to proceed.
yes, it isn't at all straight forward - is it?
We probably need quite a bit of surgery in nfsd/vfs.c to make it more
similar to fs/namei.c. In particularly we will need to use
filename_create() instead of lookup_one_len().
There is a potential cost to doing this though. The NFS protocol allows
the server to report the change-id of the directory before and after a
create/unlink operation so that the client can determine if it is the
only one making changes to the directory, and so can keep its cache.
This requires the pre/post to be atomic - which requires an exclusive
lock.
If we change nfsd to use a shared lock on the directory, then it cannot
report atomic pre/post attributes, so the client will have to flush its
cache more often.
Support parallel creates and atomic attributes we would need to enhance
the filesystem interface so the fs can report the attributes for each
create. Could get messy.
This doesn't actually matter for NFS re-export because it doesn't
support atomic attributes anyway. It also doesn't matter if multiple
clients are changing tghe one directory. But I think we do want to keep atomic
attributes for exporting other filesystems in other use-cases.
It's starting to get messy. Not impossible, just messy. Messy takes
longer :-)
NeilBrown
On Thu, 28 Apr 2022 at 06:46, NeilBrown <[email protected]> wrote:
>
> On Tue, 26 Apr 2022, Daire Byrne wrote:
> > On Tue, 26 Apr 2022 at 02:36, NeilBrown <[email protected]> wrote:
> > >
> > > On Tue, 26 Apr 2022, Daire Byrne wrote:
> > > >
> > > > I'll stare at fs/nfsd/vfs.c for a bit but I probably lack the
> > > > expertise to make it work.
> > >
> > > Staring at code is good for the soul .... but I'll try to schedule time
> > > to work on this patch again - make it work from nfsd and also make it
> > > handle rename.
> >
> > Yea, I stared at it for quite a while this morning and no amount of
> > coffee was going to help me figure out how best to proceed.
>
> yes, it isn't at all straight forward - is it?
> We probably need quite a bit of surgery in nfsd/vfs.c to make it more
> similar to fs/namei.c. In particularly we will need to use
> filename_create() instead of lookup_one_len().
>
> There is a potential cost to doing this though. The NFS protocol allows
> the server to report the change-id of the directory before and after a
> create/unlink operation so that the client can determine if it is the
> only one making changes to the directory, and so can keep its cache.
> This requires the pre/post to be atomic - which requires an exclusive
> lock.
> If we change nfsd to use a shared lock on the directory, then it cannot
> report atomic pre/post attributes, so the client will have to flush its
> cache more often.
>
> Support parallel creates and atomic attributes we would need to enhance
> the filesystem interface so the fs can report the attributes for each
> create. Could get messy.
>
> This doesn't actually matter for NFS re-export because it doesn't
> support atomic attributes anyway. It also doesn't matter if multiple
> clients are changing tghe one directory. But I think we do want to keep atomic
> attributes for exporting other filesystems in other use-cases.
>
> It's starting to get messy. Not impossible, just messy. Messy takes
> longer :-)
Thanks for looking Neil. I don't feel quite so dumb admitting that it
was way above my level! :)
Like I said, this isn't a blocker for us but definitely a nice to have
feature. Slow progress is still good progress.
Cheers,
Daire