In response to this thread that died out a while back: http://marc.info/?t=142727765600001&r=1&w=2
I have the same issue and am looking for a solution.
My current setup involves nested ZFS datasets. It looks like this:
tank/homes/user{1..6000}
tank/stage/somedirname{1..100}
tank/homes and tank/stage are datasets and the immediate children are also datasets.
There are huge benefits to using datasets that I don't want to get into here, so lets leave it at I would prefer to keep this setup.
Now to the issue. Because of the nested mounts I need to use the 'crossmnt' export option which seems to invoke a loop over '/etc/mtab'. During various straces I have seen /etc/mtab get read over 900 times (on ssh login) while a users home directory is being mounted.
Since /etc/mtab has around 6000 lines (and growing) it takes a long time to complete. Around 8 seconds actually from the time you start an ssh connection to when the password prompt appears (tested with 'time ssh -o BatchMode=yes root@client1'). It seems that most of this time is spent by rpc.mountd reading /etc/mtab about 900 times while the client waits for the server to finish.
# ls -l /etc/mtab
lrwxrwxrwx 1 root root 19 Jun 22 14:31 /etc/mtab -> ../proc/self/mounts
# cat /proc/self/mounts > mtab
# ls -lh mtab
-rw-rw---- 1 root root 442K Jul 6 16:51 mtab
# wc -l mtab
6051 mtab
Right now I am dealing with 'nfs-kernel-server' version '1:1.2.8-9ubuntu12.1' (on Xenial), which I already had to patch with http://git.linux-nfs.org/?p=steved/nfs-utils.git;a=commitdiff;h=35640883cf34a32f893e9fecefbb193782e9bc75 suggested here http://marc.info/?t=138684383100001&r=1&w=2
(This works brilliantly BTW, it shaved a good 20 seconds off the mount time because all those devices were being skipped.)
I was hoping you folks might take a look, and maybe provide a patch. /crosses fingers/
steved suggested that this took place around: utils/mountd/cache.c:nfsd_fh():path = next_mnt(&mnt, exp->m_export.e_path);
Cheers,
Phil
P.S.: I can provide straces on demand if required, I didn't want to clutter up the message.
--
Phil Kauffman
Systems Admin
Dept. of Computer Science
University of Chicago
[email protected]
On Thu, Jul 06 2017, Phil Kauffman wrote:
> In response to this thread that died out a while back: http://marc.info/?t=142727765600001&r=1&w=2
>
> I have the same issue and am looking for a solution.
>
> My current setup involves nested ZFS datasets. It looks like this:
> tank/homes/user{1..6000}
> tank/stage/somedirname{1..100}
>
> tank/homes and tank/stage are datasets and the immediate children are also datasets.
>
> There are huge benefits to using datasets that I don't want to get into here, so lets leave it at I would prefer to keep this setup.
>
>
> Now to the issue. Because of the nested mounts I need to use the 'crossmnt' export option which seems to invoke a loop over '/etc/mtab'. During various straces I have seen /etc/mtab get read over 900 times (on ssh login) while a users home directory is being mounted.
>
> Since /etc/mtab has around 6000 lines (and growing) it takes a long time to complete. Around 8 seconds actually from the time you start an ssh connection to when the password prompt appears (tested with 'time ssh -o BatchMode=yes root@client1'). It seems that most of this time is spent by rpc.mountd reading /etc/mtab about 900 times while the client waits for the server to finish.
>
> # ls -l /etc/mtab
> lrwxrwxrwx 1 root root 19 Jun 22 14:31 /etc/mtab -> ../proc/self/mounts
>
> # cat /proc/self/mounts > mtab
>
> # ls -lh mtab
> -rw-rw---- 1 root root 442K Jul 6 16:51 mtab
>
> # wc -l mtab
> 6051 mtab
>
>
> Right now I am dealing with 'nfs-kernel-server' version '1:1.2.8-9ubuntu12.1' (on Xenial), which I already had to patch with http://git.linux-nfs.org/?p=steved/nfs-utils.git;a=commitdiff;h=35640883cf34a32f893e9fecefbb193782e9bc75 suggested here http://marc.info/?t=138684383100001&r=1&w=2
> (This works brilliantly BTW, it shaved a good 20 seconds off the mount time because all those devices were being skipped.)
>
>
> I was hoping you folks might take a look, and maybe provide a patch. /crosses fingers/
>
> steved suggested that this took place around: utils/mountd/cache.c:nfsd_fh():path = next_mnt(&mnt, exp->m_export.e_path);
>
>
> Cheers,
>
> Phil
>
I can imagine /etc/mtab being read once for every line in /etc/exports,
but unless your /etc/exports is very big, I can't easily see why it
would be read 900 times.
Maybe lots of different mount points are being accessed by something and
each one triggers a few reads...
Can you show your /etc/exports file?
>
> P.S.: I can provide straces on demand if required, I didn't want to clutter up the message.
I don't recommend this approach. If you have data, just provide it. If
it is really really big, provide a link.
You have my attention now. You might not get it again when you send the
strace. You might have wasted your opportunity.
Yes - please provide the strace showing lots of reads through
/etc/mtab. Don't provide it privately to me. Either post it to the
list, or post a link to it.
NeilBrown
>
> --
> Phil Kauffman
> Systems Admin
> Dept. of Computer Science
> University of Chicago
> [email protected]
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> I can imagine /etc/mtab being read once for every line in /etc/exports,
> but unless your /etc/exports is very big, I can't easily see why it
> would be read 900 times.
> Maybe lots of different mount points are being accessed by something and
> each one triggers a few reads...
>
> Can you show your /etc/exports file?
These get generated via puppet and dropped into /etc/exports.d/
Any given host will have 3 files associated with it:
root@charlie-storage1:~# cat
/etc/exports.d/client1.cs.uchicago.edu.charliestorage1_fsidroot.charliestorage1_fsidroot.exports
/tank
client1.cs.uchicago.edu(fsid=0,rw,async,no_subtree_check,insecure,crossmnt)
root@charlie-storage1:~# cat
/etc/exports.d/client1.cs.uchicago.edu.charliestorage1_homes.charliestorage1_homes.exports
/tank/homes
client1.cs.uchicago.edu(rw,async,no_subtree_check,insecure,crossmnt)
root@charlie-storage1:~# cat
/etc/exports.d/client1.cs.uchicago.edu.charliestorage1_stage.charliestorage1_stage.exports
/tank/stage
client1.cs.uchicago.edu(rw,async,no_subtree_check,insecure,crossmnt)
Here we can see that there are 960 files in /etc/exports.d
root@storage1:~# find /etc/exports.d/ -type f | wc
960 960 88701
You probably don't want or need this but here is a link to all of my
/etc/exports.d/ combined into one file which I redacted the hostnames:
http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/all_etc_exports.d_in_one_file.txt
> I don't recommend this approach. If you have data, just provide it. If
> it is really really big, provide a link.
> You have my attention now. You might not get it again when you send the
> strace. You might have wasted your opportunity.
> Yes - please provide the strace showing lots of reads through
> /etc/mtab. Don't provide it privately to me. Either post it to the
> list, or post a link to it.
Thanks for the advice.
Here is a link to a best effort redacted strace:
http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/redacted_strace.txt
I would prefer to keep user name leakage to a minimum.
Cheers,
Phil
--
Phil Kauffman
Systems Admin
Dept. of Computer Science
University of Chicago
[email protected]
773-702-3913
On Fri, Jul 07 2017, Phil Kauffman wrote:
>> I can imagine /etc/mtab being read once for every line in /etc/exports,
>> but unless your /etc/exports is very big, I can't easily see why it
>> would be read 900 times.
>> Maybe lots of different mount points are being accessed by something and
>> each one triggers a few reads...
>>
>> Can you show your /etc/exports file?
>
> These get generated via puppet and dropped into /etc/exports.d/
...
>
> Here we can see that there are 960 files in /etc/exports.d
> root@storage1:~# find /etc/exports.d/ -type f | wc
> 960 960 88701
Ahhh.. 960 exports. That could do it.
$ grep -c crossmnt all_etc_exports.d_in_one_file.txt
957
I get 957 - much the same number.
$ grep open redacted_strace.txt | uniq -c
2 open("/var/lib/nfs/etab", O_RDONLY) = 6
942 open("/etc/mtab", O_RDONLY|O_CLOEXEC) = 6
1 open("/proc/net/rpc/nfsd.export/channel", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 6
/etc/mtab was ready 942 for a single access. I would have expected 957,
but maybe the system is dynamic and something changed between the two
samples.
This makes it fairly clear what is happening. Now we just need to fix
it.
One option would be to cache some of the details extracted from
/etc/mtab, but that could get messy.
Another is to move the code around. In your case there are really just
3 exports to each of 300+ clients (I assume "client.cs.uchicago.edu" in
the combined exports file is really different in different files).
So any one client only needs to consider 3 exports, not 300.
There is room to optimize this code further than the below, but let's
start simple. Could you test with this patch applied and see what
difference it makes?
Thanks,
NeilBrown
diff --git a/utils/mountd/cache.c b/utils/mountd/cache.c
index ca6c84f4d93d..e712cc166157 100644
--- a/utils/mountd/cache.c
+++ b/utils/mountd/cache.c
@@ -727,6 +727,17 @@ static void nfsd_fh(int f)
for (exp = exportlist[i].p_head; exp; exp = next_exp) {
char *path;
+ if (!is_ipaddr_client(dom)
+ && !namelist_client_matches(exp, dom)) {
+ next_exp = exp->m_next;
+ continue;
+ }
+ if (is_ipaddr_client(dom)
+ && !ipaddr_client_matches(exp, ai)) {
+ next_exp = exp->m_next;
+ continue;
+ }
+
if (exp->m_export.e_flags & NFSEXP_CROSSMOUNT) {
static nfs_export *prev = NULL;
static void *mnt = NULL;
@@ -751,9 +762,6 @@ static void nfsd_fh(int f)
next_exp = exp->m_next;
}
- if (!is_ipaddr_client(dom)
- && !namelist_client_matches(exp, dom))
- continue;
if (exp->m_export.e_mountpoint &&
!is_mountpoint(exp->m_export.e_mountpoint[0]?
exp->m_export.e_mountpoint:
@@ -762,9 +770,6 @@ static void nfsd_fh(int f)
if (!match_fsid(&parsed, exp, path))
continue;
- if (is_ipaddr_client(dom)
- && !ipaddr_client_matches(exp, ai))
- continue;
if (!found || subexport(&exp->m_export, found)) {
found = &exp->m_export;
free(found_path);
On Mon, Jul 10, 2017 at 05:17:27PM +1000, NeilBrown wrote:
> On Fri, Jul 07 2017, Phil Kauffman wrote:
>
> >> I can imagine /etc/mtab being read once for every line in /etc/exports,
> >> but unless your /etc/exports is very big, I can't easily see why it
> >> would be read 900 times.
> >> Maybe lots of different mount points are being accessed by something and
> >> each one triggers a few reads...
> >>
> >> Can you show your /etc/exports file?
> >
> > These get generated via puppet and dropped into /etc/exports.d/
> ...
> >
> > Here we can see that there are 960 files in /etc/exports.d
> > root@storage1:~# find /etc/exports.d/ -type f | wc
> > 960 960 88701
>
> Ahhh.. 960 exports. That could do it.
>
> $ grep -c crossmnt all_etc_exports.d_in_one_file.txt
> 957
>
> I get 957 - much the same number.
>
> $ grep open redacted_strace.txt | uniq -c
> 2 open("/var/lib/nfs/etab", O_RDONLY) = 6
> 942 open("/etc/mtab", O_RDONLY|O_CLOEXEC) = 6
> 1 open("/proc/net/rpc/nfsd.export/channel", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 6
>
> /etc/mtab was ready 942 for a single access. I would have expected 957,
> but maybe the system is dynamic and something changed between the two
> samples.
>
> This makes it fairly clear what is happening. Now we just need to fix
> it.
> One option would be to cache some of the details extracted from
> /etc/mtab, but that could get messy.
Would be possible to read /proc/self/mounts only once and use poll()
to detect changes?
The issue is not only the open()+read() in the process, but also
kernel has to compose the file from the mount tree. It's expensive.
Karel
--
Karel Zak <[email protected]>
http://karelzak.blogspot.com
On 07/10/2017 02:17 AM, NeilBrown wrote:
> Another is to move the code around. In your case there are really just
> 3 exports to each of 300+ clients (I assume "client.cs.uchicago.edu" in
> the combined exports file is really different in different files).
> So any one client only needs to consider 3 exports, not 300.
This assumption is correct. We have ~300 clients and growing (slowly).
> Could you test with this patch applied and see what
> difference it makes?
After being confused by conflicting test results and determining that the service nfs-server takes a bit longer to start than 'systemctl' will let you believe, I believe Niel's patch works.
The new cache.c file I am using to test with: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch/cache.c
It also contains the official patch from this thread: http://marc.info/?l=linux-nfs&m=138900709311582&w=2
Building the new deb packages: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch/build_deb.txt
Install nfs-kernel-server and nfs-common debs on server and nfs-common on client.
I reboot everything: client and server
Run ssh for loop (no stracing) with the result that every user was able to ssh in under a second (I consider this to be a success).
http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch/ssh_result_no-strace.txt
I then used a larger sample size of users (~/f) and got a similar result.
Now lets strace rpc.mountd while sshing with one user:
$ time ssh -o BatchMode=yes $CLIENTHOSTNAME -l user_not_in_list_f
Permission denied (publickey,password).
real 0m0.352s
user 0m0.020s
sys 0m0.012s
The resulting strace seems entirely reasonable: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch/ssh_one_user_strace.txt
$ for user in $(cat ~/users | grep -E "^g*"); do time ssh -o BatchMode=yes $CLIENTHOSTNAME -l $user; done
The first few dozen users whos usernames happen to start with 'g':
http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch/ssh_multiple_user_strace.txt
--
Phil Kauffman
Systems Admin
Dept. of Computer Science
University of Chicago
[email protected]
On Mon, Jul 10 2017, Karel Zak wrote:
> On Mon, Jul 10, 2017 at 05:17:27PM +1000, NeilBrown wrote:
>> On Fri, Jul 07 2017, Phil Kauffman wrote:
>>
>> >> I can imagine /etc/mtab being read once for every line in /etc/exports,
>> >> but unless your /etc/exports is very big, I can't easily see why it
>> >> would be read 900 times.
>> >> Maybe lots of different mount points are being accessed by something and
>> >> each one triggers a few reads...
>> >>
>> >> Can you show your /etc/exports file?
>> >
>> > These get generated via puppet and dropped into /etc/exports.d/
>> ...
>> >
>> > Here we can see that there are 960 files in /etc/exports.d
>> > root@storage1:~# find /etc/exports.d/ -type f | wc
>> > 960 960 88701
>>
>> Ahhh.. 960 exports. That could do it.
>>
>> $ grep -c crossmnt all_etc_exports.d_in_one_file.txt
>> 957
>>
>> I get 957 - much the same number.
>>
>> $ grep open redacted_strace.txt | uniq -c
>> 2 open("/var/lib/nfs/etab", O_RDONLY) = 6
>> 942 open("/etc/mtab", O_RDONLY|O_CLOEXEC) = 6
>> 1 open("/proc/net/rpc/nfsd.export/channel", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 6
>>
>> /etc/mtab was ready 942 for a single access. I would have expected 957,
>> but maybe the system is dynamic and something changed between the two
>> samples.
>>
>> This makes it fairly clear what is happening. Now we just need to fix
>> it.
>> One option would be to cache some of the details extracted from
>> /etc/mtab, but that could get messy.
>
> Would be possible to read /proc/self/mounts only once and use poll()
> to detect changes?
Sure it is possible. All we need is code....
>
> The issue is not only the open()+read() in the process, but also
> kernel has to compose the file from the mount tree. It's expensive.
That certainly contributes to the problem. If we cache the content, we
can process it fast. If we avoid wanting to look at the content, it
doesn't matter how fast processes it is (because we don't).
So yes, that might be part of the solution. But not the whole solution.
Thanks,
NeilBrown
>
> Karel
>
>
> --
> Karel Zak <[email protected]>
> http://karelzak.blogspot.com
On Mon, Jul 10 2017, Phil Kauffman wrote:
> On 07/10/2017 02:17 AM, NeilBrown wrote:
>> Another is to move the code around. In your case there are really just
>> 3 exports to each of 300+ clients (I assume "client.cs.uchicago.edu" in
>> the combined exports file is really different in different files).
>> So any one client only needs to consider 3 exports, not 300.
> This assumption is correct. We have ~300 clients and growing (slowly).
>
>> Could you test with this patch applied and see what
>> difference it makes?
> After being confused by conflicting test results and determining that the service nfs-server takes a bit longer to start than 'systemctl' will let you believe, I believe Niel's patch works.
>
>
> The new cache.c file I am using to test with: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch/cache.c
> It also contains the official patch from this thread: http://marc.info/?l=linux-nfs&m=138900709311582&w=2
>
> Building the new deb packages: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch/build_deb.txt
>
> Install nfs-kernel-server and nfs-common debs on server and nfs-common on client.
>
>
> I reboot everything: client and server
>
> Run ssh for loop (no stracing) with the result that every user was able to ssh in under a second (I consider this to be a success).
This does look encouraging ... but I'm not sure we are comparing apples
with apples.
In the original strace we see.
read(4, ....)
check on etab
statfs(some path)
write(4,....
which is exactly like the new strace.
But then the old strace also has
read(5, ...)
check on etab
read /etc/mtab hundreds of times.
write(5, ...)
which is completely missing in the new strace. So the new code which
doesn't read /etc/mtab as much, isn't being run at all.
file desc 4 is /proc/1926/net/rpc/nfsd.export/channel
file desc 5 is /proc/1926/net/rpc/nfsd.fh/channel
If the nfs server kernel has nothing cached, then when a request arrives
you should first see a transaction of fd 3
(/proc/1926/net/rpc/auth.unix.ip/channel) to map the IP address to a
"domain".
Then a transaction on fd 5 (nfsd.fh) to map the domain + a filehandle
fragment to an export point.
Then a transaction on fd 4 (nfsd.export) to map the domain + export
point to some export options.
If the request is a "LOOKUP", you might see another "fd 5" lookup after
the "fd 4" to find out if a sub-mountpoint is exported. That is what
the old strace showed.
If you don't see all of those, then something is cached.
Could you repeat your experiments after first running "exportfs -f" on
the nfs server? That should cause worst-case load on mountd.
Until we see the "read(5).... write(5)" sequence completing quickly, I
cannot be sure that we have fixed anything. Actually.. for future
traces, please add "-tt" so I can see where the delays are, and how much
they are.
(and thanks for providing all the tracing details - I love getting
unambiguous data!!)
Thanks,
NeilBrown
On 07/10/2017 06:51 PM, NeilBrown wrote:
> This does look encouraging ... but I'm not sure we are comparing apples
> with apples.
I had a feeling this was the case.
> Could you repeat your experiments after first running "exportfs -f" on
> the nfs server? That should cause worst-case load on mountd.
The new strace with '-tt': http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch_take2/2017-07-11_100929_strace.txt
SSH test: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch_take2/ssh_output.txt
nfsserver# exportfs -f; m=$(pgrep rpc.mountd); strace -tt -p ${m} 2>&1 | tee `date +%F_%T`_strace.txt
# cut -d' ' -f2 2017-07-11_100929_strace.txt| cut -d'(' -f1 | sort | uniq -c | sort -n
1 Process
73 select
143 write
216 close
216 fstat
216 open
48240 read
435599 stat
871162 statfs
871163 lstat
> (and thanks for providing all the tracing details - I love getting
> unambiguous data!!)
No problem, I'm available to run any necessary tests. I also appreciate your help on this.
--
Phil Kauffman
Systems Admin
Dept. of Computer Science
University of Chicago
[email protected]
773-702-3913
On Tue, Jul 11 2017, Phil Kauffman wrote:
> On 07/10/2017 06:51 PM, NeilBrown wrote:
>> This does look encouraging ... but I'm not sure we are comparing apples
>> with apples.
> I had a feeling this was the case.
>
>
>> Could you repeat your experiments after first running "exportfs -f" on
>> the nfs server? That should cause worst-case load on mountd.
> The new strace with '-tt': http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch_take2/2017-07-11_100929_strace.txt
> SSH test: http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/test_with_patch_take2/ssh_output.txt
>
> nfsserver# exportfs -f; m=$(pgrep rpc.mountd); strace -tt -p ${m} 2>&1 | tee `date +%F_%T`_strace.txt
>
>
> # cut -d' ' -f2 2017-07-11_100929_strace.txt| cut -d'(' -f1 | sort | uniq -c | sort -n
> 1 Process
> 73 select
> 143 write
> 216 close
> 216 fstat
> 216 open
> 48240 read
> 435599 stat
> 871162 statfs
> 871163 lstat
>
>
>> (and thanks for providing all the tracing details - I love getting
>> unambiguous data!!)
> No problem, I'm available to run any necessary tests. I also appreciate your help on this.
>
So the new data shows about 7 seconds for a login, which is probably
a little longer than you would like, but might be acceptable?
Each login triggers 3 reads of /etc/mtab, and each read (from 'open' to
'close') takes either 3-4 seconds or about 0.2 seconds.
Actually reading /etc/mtab seems to take about 0.1-0.2 seconds at most.
The rest of the time is calling stat, with a pattern like
10:09:39.547280 stat("/tank/homes/someuser", {st_mode=S_IFDIR|0750, st_size=52, ...}) = 0
10:09:39.547318 lstat("/tank/homes/someuser", {st_mode=S_IFDIR|0750, st_size=52, ...}) = 0
10:09:39.547350 lstat("/tank/homes/someuser"..", {st_mode=S_IFDIR|0755, st_size=5967, ...}) = 0
10:09:39.547385 statfs("/tank/homes/someuser", {f_type=0x2fc12fc1, f_bsize=131072, f_blocks=40960, f_bfree=38830, f_ba
10:09:39.547424 statfs("/tank/homes/someuser", {f_type=0x2fc12fc1, f_bsize=131072, f_blocks=40960, f_bfree=38830, f_ba
Of the three passes through /etc/mtab for each login, one generates
95 of these, one generates 6049, and one generates 5952.
The 'stat' is from match_fsid()
The 2 'lstat' are from is_mountpoint() (called by match_fsid()
The first statfs() is from uuid_by_path()
Maybe the second is somewhere in the blkid code.
Probably the next step is to cache the mapping from
path, fsidtype -> fsid.
That is more than a 2-line patch. I might have a go later this week.
NeilBrown
On 07/11/2017 07:46 PM, NeilBrown wrote:
> So the new data shows about 7 seconds for a login, which is probably
> a little longer than you would like, but might be acceptable?
Unfortunately, the delay is not acceptable.
The ideal would be to achieve performance parity with when one is not
forced to use the 'crossmnt' option.
My current setup (home directory server) does not require 'crossmnt' and
does not incur a delay. It is a standard nfs server using mdadm, lvm,
and xfs.
While my current setup is probably "normal" and using nested datasets
with the 'crossmnt' option might be "weird" now; nested mounts will
probably only become more common as BTRFS, ZFS, and other filesystems
with similar features gain traction on linux.
> That is more than a 2-line patch. I might have a go later this week.
That would be super! Thanks for taking a look.
--
Phil Kauffman
Systems Admin
Dept. of Computer Science
University of Chicago
[email protected]
On Wed, Jul 12 2017, Phil Kauffman wrote:
> On 07/11/2017 07:46 PM, NeilBrown wrote:
>> So the new data shows about 7 seconds for a login, which is probably
>> a little longer than you would like, but might be acceptable?
> Unfortunately, the delay is not acceptable.
>
> The ideal would be to achieve performance parity with when one is not
> forced to use the 'crossmnt' option.
>
> My current setup (home directory server) does not require 'crossmnt' and
> does not incur a delay. It is a standard nfs server using mdadm, lvm,
> and xfs.
>
> While my current setup is probably "normal" and using nested datasets
> with the 'crossmnt' option might be "weird" now; nested mounts will
> probably only become more common as BTRFS, ZFS, and other filesystems
> with similar features gain traction on linux.
>
>
>> That is more than a 2-line patch. I might have a go later this week.
> That would be super! Thanks for taking a look.
Please try this (against a clean nfs-utils. i.e. remove the previous
patch).
It is a hack and would need to be totally re-written, but hopely the
measurements you make and strace that you report could be useful.
Also, for the strace, please use "-ttt" rather than "-tt" like I asked
before. It is easier to find the difference between two times with
-ttt.
And add -T as well.
Thanks,
NeilBrown
From f15571d747bb6272ec539e46b6b3aee3ccc53b30 Mon Sep 17 00:00:00 2001
From: NeilBrown <[email protected]>
Date: Fri, 14 Jul 2017 16:43:37 +1000
Subject: [PATCH] testing
Signed-off-by: NeilBrown <[email protected]>
---
utils/mountd/cache.c | 105 +++++++++++++++++++++++++++++++++++++++------------
1 file changed, 81 insertions(+), 24 deletions(-)
diff --git a/utils/mountd/cache.c b/utils/mountd/cache.c
index ca6c84f4d93d..db89a4feb7ae 100644
--- a/utils/mountd/cache.c
+++ b/utils/mountd/cache.c
@@ -18,6 +18,7 @@
#include <time.h>
#include <netinet/in.h>
#include <arpa/inet.h>
+#include <sys/sysmacros.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
@@ -299,7 +300,7 @@ static const long int nonblkid_filesystems[] = {
0 /* last */
};
-static int uuid_by_path(char *path, int type, size_t uuidlen, char *uuid)
+static int uuid_by_path(char *path, struct statfs64 *stp, int type, size_t uuidlen, char *uuid)
{
/* get a uuid for the filesystem found at 'path'.
* There are several possible ways of generating the
@@ -334,12 +335,17 @@ static int uuid_by_path(char *path, int type, size_t uuidlen, char *uuid)
const char *val;
int rc;
- rc = statfs64(path, &st);
+ if (stp)
+ rc = 0;
+ else {
+ stp = &st;
+ rc= statfs64(path, stp);
+ }
if (type == 0 && rc == 0) {
const long int *bad;
for (bad = nonblkid_filesystems; *bad; bad++) {
- if (*bad == st.f_type)
+ if (*bad == stp->f_type)
break;
}
if (*bad == 0)
@@ -347,9 +353,9 @@ static int uuid_by_path(char *path, int type, size_t uuidlen, char *uuid)
}
if (rc == 0 &&
- (st.f_fsid.__val[0] || st.f_fsid.__val[1]))
+ (stp->f_fsid.__val[0] || stp->f_fsid.__val[1]))
snprintf(fsid_val, 17, "%08x%08x",
- st.f_fsid.__val[0], st.f_fsid.__val[1]);
+ stp->f_fsid.__val[0], stp->f_fsid.__val[1]);
else
fsid_val[0] = 0;
@@ -603,25 +609,64 @@ static int parse_fsid(int fsidtype, int fsidlen, char *fsid,
return 0;
}
+#define entry FOO
+#include <search.h>
+struct pinfo {
+ unsigned int stb:1, mntp:1, stfs:1;
+ unsigned int mountpoint:1;
+ time_t valid;
+ struct stat statbuf;
+ struct statfs64 statfsbuf;
+};
+static struct hsearch_data hdata;
+static int hdata_init = 0;
+
+
static bool match_fsid(struct parsed_fsid *parsed, nfs_export *exp, char *path)
{
- struct stat stb;
+ struct stat *stb;
int type;
char u[16];
+ struct pinfo *pi;
+ ENTRY ent, *ret;
- if (stat(path, &stb) != 0)
- return false;
- if (!S_ISDIR(stb.st_mode) && !S_ISREG(stb.st_mode))
+ if (!hdata_init) {
+ hcreate_r(1000, &hdata);
+ hdata_init = 1;
+ }
+ ent.key = path;
+ if (hsearch_r(ent, FIND, &ret, &hdata) == 0) {
+ ent.key = strdup(path);
+ pi = xmalloc(sizeof(*pi));
+ pi->stb = pi->mntp = pi->stfs = 0;
+ pi->valid = 0;
+ ent.data = pi;
+ hsearch_r(ent, ENTER, &ret, &hdata);
+ } else
+ pi = ret->data;
+
+ if (pi->valid < time(NULL)+120) {
+ pi->stb = pi->mntp = pi->stfs = 0;
+ pi->valid = time(NULL);
+ }
+
+ stb = &pi->statbuf;
+ if (!pi->stb) {
+ if (stat(path, stb) != 0)
+ return false;
+ pi->stb = 1;
+ }
+ if (!S_ISDIR(stb->st_mode) && !S_ISREG(stb->st_mode))
return false;
switch (parsed->fsidtype) {
case FSID_DEV:
case FSID_MAJOR_MINOR:
case FSID_ENCODE_DEV:
- if (stb.st_ino != parsed->inode)
+ if (stb->st_ino != parsed->inode)
return false;
- if (parsed->major != major(stb.st_dev) ||
- parsed->minor != minor(stb.st_dev))
+ if (parsed->major != major(stb->st_dev) ||
+ parsed->minor != minor(stb->st_dev))
return false;
return true;
case FSID_NUM:
@@ -631,12 +676,16 @@ static bool match_fsid(struct parsed_fsid *parsed, nfs_export *exp, char *path)
return true;
case FSID_UUID4_INUM:
case FSID_UUID16_INUM:
- if (stb.st_ino != parsed->inode)
+ if (stb->st_ino != parsed->inode)
return false;
goto check_uuid;
case FSID_UUID8:
case FSID_UUID16:
- if (!is_mountpoint(path))
+ if (!pi->mntp) {
+ pi->mountpoint = is_mountpoint(path);
+ pi->mntp = 1;
+ }
+ if (!pi->mountpoint)
return false;
check_uuid:
if (exp->m_export.e_uuid) {
@@ -644,12 +693,18 @@ static bool match_fsid(struct parsed_fsid *parsed, nfs_export *exp, char *path)
if (memcmp(u, parsed->fhuuid, parsed->uuidlen) == 0)
return true;
}
- else
+ else {
+ if (!pi->stfs) {
+ if (statfs64(path, &pi->statfsbuf) != 0)
+ return false;
+ pi->stfs = 1;
+ }
for (type = 0;
- uuid_by_path(path, type, parsed->uuidlen, u);
+ uuid_by_path(path, &pi->statfsbuf, type, parsed->uuidlen, u);
type++)
if (memcmp(u, parsed->fhuuid, parsed->uuidlen) == 0)
return true;
+ }
}
/* Well, unreachable, actually: */
return false;
@@ -727,10 +782,18 @@ static void nfsd_fh(int f)
for (exp = exportlist[i].p_head; exp; exp = next_exp) {
char *path;
+ if (!is_ipaddr_client(dom)) {
+ if (!namelist_client_matches(exp, dom))
+ continue;
+ } else {
+ if (!ipaddr_client_matches(exp, ai))
+ continue;
+ }
+
if (exp->m_export.e_flags & NFSEXP_CROSSMOUNT) {
static nfs_export *prev = NULL;
static void *mnt = NULL;
-
+
if (prev == exp) {
/* try a submount */
path = next_mnt(&mnt, exp->m_export.e_path);
@@ -751,9 +814,6 @@ static void nfsd_fh(int f)
next_exp = exp->m_next;
}
- if (!is_ipaddr_client(dom)
- && !namelist_client_matches(exp, dom))
- continue;
if (exp->m_export.e_mountpoint &&
!is_mountpoint(exp->m_export.e_mountpoint[0]?
exp->m_export.e_mountpoint:
@@ -762,9 +822,6 @@ static void nfsd_fh(int f)
if (!match_fsid(&parsed, exp, path))
continue;
- if (is_ipaddr_client(dom)
- && !ipaddr_client_matches(exp, ai))
- continue;
if (!found || subexport(&exp->m_export, found)) {
found = &exp->m_export;
free(found_path);
@@ -906,7 +963,7 @@ static int dump_to_cache(int f, char *buf, int buflen, char *domain,
write_secinfo(&bp, &blen, exp, flag_mask);
if (exp->e_uuid == NULL || different_fs) {
char u[16];
- if (uuid_by_path(path, 0, 16, u)) {
+ if (uuid_by_path(path, NULL, 0, 16, u)) {
qword_add(&bp, &blen, "uuid");
qword_addhex(&bp, &blen, u, 16);
}
--
2.12.2
On 07/14/2017 01:51 AM, NeilBrown wrote:
> Please try this (against a clean nfs-utils. i.e. remove the previous
> patch).
> It is a hack and would need to be totally re-written, but hopely the
> measurements you make and strace that you report could be useful.
> Also, for the strace, please use "-ttt" rather than "-tt" like I asked
> before. It is easier to find the difference between two times with
> -ttt. And add -T as well.
Sorry this took so long but I had to be sure of my results.
With your latest patch applied I am unable to mount my NFS shares and
thus continue with the tests from before.
Since the patch you provided cannot be applied cleanly to the source
used to build '1:1.2.8-9ubuntu12.1' for Ubuntu Xenial I chose to build a
couple of VMs to continue this testing. They should be identical to the
machines I was using before except they are under powered and not using
nfs-common, nfs-kernel-server package version 1:1.2.8-9ubuntu12.1. Most
importantly the zfs datasets and the files in /etc/exports.d are the same.
I found that utils/mountd/cache.c was last modified in 1.3.4, so I
tested both 1.3.4 and 2.1.1 without your patch as a baseline. I was able
to mount my shares in both versions.
I then reverted my VMs to a state before the baseline install and
installed again with your patch applied.
I get the following with your patch applied on 1.3.4 and 2.1.1:
root@nfsclient:~# mount -avv
/ : ignored
/boot : already mounted
none : ignored
mount.nfs4: timeout set for Tue Jul 25 15:43:13 2017
mount.nfs4: trying text-based options
'ac,hard,rsize=16384,wsize=16384,vers=4.2,addr=10.135.24.23,clientaddr=10.135.24.22'
mount.nfs4: mount(2): No such file or directory
mount.nfs4: trying text-based options
'ac,hard,rsize=16384,wsize=16384,addr=10.135.24.23'
mount.nfs4: prog 100003, trying vers=3, prot=6
mount.nfs4: trying 10.135.24.23 prog 100003 vers 3 prot TCP port 2049
mount.nfs4: prog 100005, trying vers=3, prot=17
mount.nfs4: trying 10.135.24.23 prog 100005 vers 3 prot UDP port 41598
mount.nfs4: mount(2): Permission denied
mount.nfs4: access denied by server while mounting
nfsserver.cs.uchicago.edu:/homes
mount.nfs4: timeout set for Tue Jul 25 15:43:14 2017
mount.nfs4: trying text-based options
'ac,hard,rsize=16384,wsize=16384,vers=4.2,addr=10.135.24.23,clientaddr=10.135.24.22'
mount.nfs4: mount(2): No such file or directory
mount.nfs4: trying text-based options
'ac,hard,rsize=16384,wsize=16384,addr=10.135.24.23'
mount.nfs4: prog 100003, trying vers=3, prot=6
mount.nfs4: trying 10.135.24.23 prog 100003 vers 3 prot TCP port 2049
mount.nfs4: prog 100005, trying vers=3, prot=17
mount.nfs4: trying 10.135.24.23 prog 100005 vers 3 prot UDP port 41598
mount.nfs4: mount(2): Permission denied
mount.nfs4: access denied by server while mounting
nfsserver.cs.uchicago.edu:/stage
Attached are my install notes and the strace during the failed mount above.
--
Phil Kauffman
Systems Admin
Dept. of Computer Science
University of Chicago
[email protected]
On Tue, Jul 25 2017, Phil Kauffman wrote:
> On 07/14/2017 01:51 AM, NeilBrown wrote:
>> Please try this (against a clean nfs-utils. i.e. remove the previous
>> patch).
>> It is a hack and would need to be totally re-written, but hopely the
>> measurements you make and strace that you report could be useful.
>> Also, for the strace, please use "-ttt" rather than "-tt" like I asked
>> before. It is easier to find the difference between two times with
>> -ttt. And add -T as well.
>
> Sorry this took so long but I had to be sure of my results.
It is worth being thorough.
>
> With your latest patch applied I am unable to mount my NFS shares and
> thus continue with the tests from before.
Weird.
You are experiencing a bug that was very recently fixed, where if
mount.nfs4 gets the error ENOENT from the server, it falls back to
NFSv3.
That explains some of the noise, but doesn't explain why you get ENOENT
for the v4 mount.
The strace output you've provided doesn't even show any attempts to read
/etc/mtab, which my patch doesn't change at all. So it seems like the
context is different in some way.
Your nfs_test_notes.txt doesn't show /etc/export.d getting filled in
... maybe that it done automatically somehow...
Can you try with unpatches 2.1.1?
Also maybe provide an strace starting before any attempt to mount
anything, and with an extra option "-s 1000"..
Thanks,
NeilBrown
On 07/26/2017 11:37 PM, NeilBrown wrote:
> Weird.
I thought so as well.
> So it seems like the context is different in some way.
It is in the sense that I am now using 1.3.4 and 2.1.1 vanilla on Ubuntu Xenial(provides 1.2.8) without any of debian's patches or niceties.
> Your nfs_test_notes.txt doesn't show /etc/export.d getting filled in
> ... maybe that it done automatically somehow...
I copied the files from the original NFS server into /etc/exports.d as part of the VM snapshot I revert back to. My notes say 'mkdir -p /etc/exports.d', but this step is not necessary.
root@nfsserver:~# ls /etc/exports.d/ | wc -l
972
I only added 3 extra files so that 'nfsclient' would be able to mount from the testing server (nfsserver).
The ZFS datasets were also precreated:
root@nfsserver:~# zfs list | wc -l
6062
> Can you try with unpatches 2.1.1?
I assume you mean without any patches. vanilla 2.1.1
> Also maybe provide an strace starting before any attempt to mount
> anything, and with an extra option "-s 1000"..
This is all that happens *before* trying to mount from the client.
root@nfsserver:~# systemctl restart nfs-server; exportfs -f; m=$(pgrep rpc.mountd); strace -s 1000 -ttt -T -p $m 2>&1 | tee strace_rpc.mountd2.txt
strace: Process 2517 attached
1501190436.659800 select(1024, [3 4 5 7 8 9 10 11 12 13 14 15 16 17 18], NULL, NULL, NULL
Stracing rpc.nfsd:
root@nfsserver:~# exportfs -f; strace -s1000 -ttt -T /usr/sbin/rpc.nfsd 0 2>&1| tee strace_nfsd.txt; /usr/sbin/sm-notify
http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/take3/strace_rpc.nfsd.txt
When mounting from the client I get a strace that is 1.3G (it seems '-s' was not respected for some reason): http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/take3/strace_rpc.mountd.txt
--
Phil Kauffman
Systems Admin
Dept. of Computer Science
University of Chicago
[email protected]
On Thu, Jul 27 2017, Phil Kauffman wrote:
> On 07/26/2017 11:37 PM, NeilBrown wrote:
>> Weird.
> I thought so as well.
>
>> So it seems like the context is different in some way.
> It is in the sense that I am now using 1.3.4 and 2.1.1 vanilla on Ubuntu Xenial(provides 1.2.8) without any of debian's patches or niceties.
>
>
>> Your nfs_test_notes.txt doesn't show /etc/export.d getting filled in
>> ... maybe that it done automatically somehow...
> I copied the files from the original NFS server into /etc/exports.d as part of the VM snapshot I revert back to. My notes say 'mkdir -p /etc/exports.d', but this step is not necessary.
>
> root@nfsserver:~# ls /etc/exports.d/ | wc -l
> 972
>
> I only added 3 extra files so that 'nfsclient' would be able to mount from the testing server (nfsserver).
>
> The ZFS datasets were also precreated:
>
> root@nfsserver:~# zfs list | wc -l
> 6062
>
>
>> Can you try with unpatches 2.1.1?
> I assume you mean without any patches. vanilla 2.1.1
>
>> Also maybe provide an strace starting before any attempt to mount
>> anything, and with an extra option "-s 1000"..
>
> This is all that happens *before* trying to mount from the client.
> root@nfsserver:~# systemctl restart nfs-server; exportfs -f; m=$(pgrep rpc.mountd); strace -s 1000 -ttt -T -p $m 2>&1 | tee strace_rpc.mountd2.txt
> strace: Process 2517 attached
> 1501190436.659800 select(1024, [3 4 5 7 8 9 10 11 12 13 14 15 16 17 18], NULL, NULL, NULL
>
> Stracing rpc.nfsd:
> root@nfsserver:~# exportfs -f; strace -s1000 -ttt -T /usr/sbin/rpc.nfsd 0 2>&1| tee strace_nfsd.txt; /usr/sbin/sm-notify
> http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/take3/strace_rpc.nfsd.txt
>
> When mounting from the client I get a strace that is 1.3G (it seems '-s' was not respected for some reason): http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/take3/strace_rpc.mountd.txt
>
I'm sorry but I cannot get anything useful from these. It isn't crystal
clear what was being tested in each case....
Could you try:
1/ build the current 'master' branch from the nfs-utils git tree, with
no extra patches.
2/ Install that and perform the same test which produce the 1.3G strace
file above. Provide the resulting strace of rpc.mountd.
3/ Tell me what happened when you mounted on the client. Did it work as
expected?
4/ Apply the patch I provided on top of the 'master' branch but change
hcreate_r(1000, &hdata);
to
hcreate_r(20000, &hdata);
5/ Install that and perform the same test as in step to, producing
a new strace of rpc.mounted. Presumably it will be difference.
6/ Tell me what happened when you mounted on the client. Did it work
as expected, or did it fail like I think it did last time.
If one worked and the other didn't, I can compare the strace and try to
find a difference.
If both worked, then we should have useful information about how much
speed up the patch provides (if any).
If neither worked, then something weird is happening...
NeilBrown