2015-12-03 18:31:16

by Chuck Lever III

[permalink] [raw]
Subject: Call graph of nfsd_open

Hi-

I’m looking into NFS WRITE latency on the Linux NFS server.
With a high speed network fabric and ultra-fast storage (think
persistent memory) we can get the per-WRITE latency under 50us
(as observed on the wire at the server).

One source of the remaining latency appears to be nfsd_open.
Here is an example of a call graph captured during a simple
iozone 4KB direct write workload (v4.4-rc3):

2) | nfsd_open [nfsd]() {
2) | fh_verify [nfsd]() {
2) | rqst_exp_find [nfsd]() {
2) | exp_find [nfsd]() {
2) | exp_find_key [nfsd]() {
2) 0.043 us | svc_expkey_hash [nfsd]();
2) | sunrpc_cache_lookup [sunrpc]() {
2) 0.027 us | _raw_read_lock();
2) | getboottime64() {
2) 0.022 us | ns_to_timespec();
2) 0.210 us | }
2) 0.021 us | get_seconds();
2) 0.854 us | }
2) | cache_check [sunrpc]() {
2) | getboottime64() {
2) 0.021 us | ns_to_timespec();
2) 0.206 us | }
2) 0.022 us | get_seconds();
2) 0.596 us | }
2) 2.050 us | }
2) | exp_get_by_name [nfsd]() {
2) | svc_export_lookup [nfsd]() {
2) | sunrpc_cache_lookup [sunrpc]() {
2) 0.022 us | _raw_read_lock();
2) 0.024 us | svc_export_match [nfsd]();
2) | getboottime64() {
2) 0.023 us | ns_to_timespec();
2) 0.210 us | }
2) 0.022 us | get_seconds();
2) 0.997 us | }
2) 1.181 us | }
2) | cache_check [sunrpc]() {
2) | getboottime64() {
2) 0.022 us | ns_to_timespec();
2) 0.210 us | }
2) 0.022 us | get_seconds();
2) 0.592 us | }
2) 2.147 us | }
2) 4.560 us | }
2) 4.768 us | }
2) | prepare_creds() {
2) | kmem_cache_alloc() {
2) 0.021 us | _cond_resched();
2) 0.233 us | }
2) | security_prepare_creds() {
2) | selinux_cred_prepare() {
2) | kmemdup() {
2) | __kmalloc_track_caller() {
2) 0.022 us | kmalloc_slab();
2) 0.023 us | _cond_resched();
2) 0.448 us | }
2) 0.640 us | }
2) 0.838 us | }
2) 1.129 us | }
2) 1.808 us | }
2) 0.022 us | override_creds();
2) | exportfs_decode_fh() {
2) | shmem_fh_to_dentry() {
2) | ilookup5() {
2) | ilookup5_nowait() {
2) 0.026 us | _raw_spin_lock();
2) | find_inode() {
2) 0.031 us | shmem_match();
2) 0.029 us | _raw_spin_lock();
2) 0.429 us | }
2) 0.826 us | }
2) 0.022 us | _cond_resched();
2) 0.021 us | _cond_resched();
2) 1.417 us | }
2) | d_find_alias() {
2) 0.022 us | _raw_spin_lock();
2) 0.027 us | _raw_spin_lock();
2) 0.423 us | }
2) 0.035 us | iput();
2) 2.410 us | }
2) | find_acceptable_alias() {
2) 0.026 us | nfsd_acceptable [nfsd]();
2) 0.220 us | }
2) 3.032 us | }
2) | nfsd_setuser_and_check_port [nfsd]() {
2) 0.023 us | nfsexp_flags [nfsd]();
2) | nfsd_setuser [nfsd]() {
2) | revert_creds() {
2) | __put_cred() {
2) | call_rcu_sched() {
2) | __call_rcu() {
2) 0.049 us | __call_rcu_nocb_enqueue();
2) 0.265 us | }
2) 0.446 us | }
2) 0.661 us | }
2) 0.854 us | }
2) | prepare_creds() {
2) | kmem_cache_alloc() {
2) 0.022 us | _cond_resched();
2) 0.219 us | }
2) | security_prepare_creds() {
2) | selinux_cred_prepare() {
2) | kmemdup() {
2) | __kmalloc_track_caller() {
2) 0.024 us | kmalloc_slab();
2) 0.022 us | _cond_resched();
2) 0.412 us | }
2) 0.615 us | }
2) 0.801 us | }
2) 0.993 us | }
2) 1.651 us | }
2) 0.033 us | set_groups();
2) 0.023 us | override_creds();
2) 3.323 us | }
2) 0.029 us | nfserrno [nfsd]();
2) 3.919 us | }
2) 0.025 us | check_nfsd_access [nfsd]();
2) 0.022 us | nfsd_permission [nfsd]();
2) + 14.904 us | }

So, 15us for fh_verify by itself.

2) | dentry_open() {
2) | get_empty_filp() {
2) | kmem_cache_alloc() {
2) 0.021 us | _cond_resched();
2) 0.325 us | }
2) | security_file_alloc() {
2) | selinux_file_alloc_security() {
2) | kmem_cache_alloc() {
2) 0.022 us | _cond_resched();
2) 0.251 us | }
2) 0.441 us | }
2) 0.633 us | }
2) 0.022 us | __mutex_init();
2) 1.572 us | }
2) | vfs_open() {
2) | do_dentry_open() {
2) | path_get() {
2) 0.025 us | mntget();
2) 0.313 us | }
2) 0.029 us | try_module_get();
2) | security_file_open() {
2) | selinux_file_open() {
2) 0.024 us | avc_policy_seqno();
2) | inode_has_perm.isra.21() {
2) 0.148 us | avc_has_perm();
2) 0.332 us | }
2) 0.728 us | }
2) 0.023 us | __fsnotify_parent();
2) 0.034 us | fsnotify();
2) 1.359 us | }
2) 0.045 us | file_ra_state_init();
2) 2.548 us | }
2) 2.733 us | }
2) 4.667 us | }
2) | ima_file_check() {
2) 0.028 us | process_measurement();
2) 0.211 us | }
2) 0.023 us | nfserrno [nfsd]();
2) + 20.549 us | }

And that’s 20us total for this nfsd_open. Seems like a lot of
work for each NFSv3 WRITE operation, and plenty of opportunities
for lock contention if my server happened to be otherwise busy.

Note these figures are a little inflated because ftrace itself
adds some latency.

Anyway, making fh_verify faster could have a significant
impact on per-op latency with fast storage. The recently-proposed
NFSD open cache by itself could eliminate the need to perform
fh_verify so often, for example.

--
Chuck Lever






2015-12-03 19:53:37

by Trond Myklebust

[permalink] [raw]
Subject: Re: Call graph of nfsd_open

Hi Chuck,

On Thu, Dec 3, 2015 at 10:31 AM, Chuck Lever <[email protected]> wrote:
> Hi-
>
> I’m looking into NFS WRITE latency on the Linux NFS server.
> With a high speed network fabric and ultra-fast storage (think
> persistent memory) we can get the per-WRITE latency under 50us
> (as observed on the wire at the server).
>
> One source of the remaining latency appears to be nfsd_open.
> Here is an example of a call graph captured during a simple
> iozone 4KB direct write workload (v4.4-rc3):
>
> 2) | nfsd_open [nfsd]() {
> 2) | fh_verify [nfsd]() {
> 2) | rqst_exp_find [nfsd]() {
> 2) | exp_find [nfsd]() {
> 2) | exp_find_key [nfsd]() {
> 2) 0.043 us | svc_expkey_hash [nfsd]();
> 2) | sunrpc_cache_lookup [sunrpc]() {
> 2) 0.027 us | _raw_read_lock();
> 2) | getboottime64() {
> 2) 0.022 us | ns_to_timespec();
> 2) 0.210 us | }
> 2) 0.021 us | get_seconds();
> 2) 0.854 us | }
> 2) | cache_check [sunrpc]() {
> 2) | getboottime64() {
> 2) 0.021 us | ns_to_timespec();
> 2) 0.206 us | }
> 2) 0.022 us | get_seconds();
> 2) 0.596 us | }
> 2) 2.050 us | }
> 2) | exp_get_by_name [nfsd]() {
> 2) | svc_export_lookup [nfsd]() {
> 2) | sunrpc_cache_lookup [sunrpc]() {
> 2) 0.022 us | _raw_read_lock();
> 2) 0.024 us | svc_export_match [nfsd]();
> 2) | getboottime64() {
> 2) 0.023 us | ns_to_timespec();
> 2) 0.210 us | }
> 2) 0.022 us | get_seconds();
> 2) 0.997 us | }
> 2) 1.181 us | }
> 2) | cache_check [sunrpc]() {
> 2) | getboottime64() {
> 2) 0.022 us | ns_to_timespec();
> 2) 0.210 us | }
> 2) 0.022 us | get_seconds();
> 2) 0.592 us | }
> 2) 2.147 us | }
> 2) 4.560 us | }
> 2) 4.768 us | }
> 2) | prepare_creds() {
> 2) | kmem_cache_alloc() {
> 2) 0.021 us | _cond_resched();
> 2) 0.233 us | }
> 2) | security_prepare_creds() {
> 2) | selinux_cred_prepare() {
> 2) | kmemdup() {
> 2) | __kmalloc_track_caller() {
> 2) 0.022 us | kmalloc_slab();
> 2) 0.023 us | _cond_resched();
> 2) 0.448 us | }
> 2) 0.640 us | }
> 2) 0.838 us | }
> 2) 1.129 us | }
> 2) 1.808 us | }
> 2) 0.022 us | override_creds();
> 2) | exportfs_decode_fh() {
> 2) | shmem_fh_to_dentry() {
> 2) | ilookup5() {
> 2) | ilookup5_nowait() {
> 2) 0.026 us | _raw_spin_lock();
> 2) | find_inode() {
> 2) 0.031 us | shmem_match();
> 2) 0.029 us | _raw_spin_lock();
> 2) 0.429 us | }
> 2) 0.826 us | }
> 2) 0.022 us | _cond_resched();
> 2) 0.021 us | _cond_resched();
> 2) 1.417 us | }
> 2) | d_find_alias() {
> 2) 0.022 us | _raw_spin_lock();
> 2) 0.027 us | _raw_spin_lock();
> 2) 0.423 us | }
> 2) 0.035 us | iput();
> 2) 2.410 us | }
> 2) | find_acceptable_alias() {
> 2) 0.026 us | nfsd_acceptable [nfsd]();
> 2) 0.220 us | }
> 2) 3.032 us | }
> 2) | nfsd_setuser_and_check_port [nfsd]() {
> 2) 0.023 us | nfsexp_flags [nfsd]();
> 2) | nfsd_setuser [nfsd]() {
> 2) | revert_creds() {
> 2) | __put_cred() {
> 2) | call_rcu_sched() {
> 2) | __call_rcu() {
> 2) 0.049 us | __call_rcu_nocb_enqueue();
> 2) 0.265 us | }
> 2) 0.446 us | }
> 2) 0.661 us | }
> 2) 0.854 us | }
> 2) | prepare_creds() {
> 2) | kmem_cache_alloc() {
> 2) 0.022 us | _cond_resched();
> 2) 0.219 us | }
> 2) | security_prepare_creds() {
> 2) | selinux_cred_prepare() {
> 2) | kmemdup() {
> 2) | __kmalloc_track_caller() {
> 2) 0.024 us | kmalloc_slab();
> 2) 0.022 us | _cond_resched();
> 2) 0.412 us | }
> 2) 0.615 us | }
> 2) 0.801 us | }
> 2) 0.993 us | }
> 2) 1.651 us | }
> 2) 0.033 us | set_groups();
> 2) 0.023 us | override_creds();
> 2) 3.323 us | }
> 2) 0.029 us | nfserrno [nfsd]();
> 2) 3.919 us | }
> 2) 0.025 us | check_nfsd_access [nfsd]();
> 2) 0.022 us | nfsd_permission [nfsd]();
> 2) + 14.904 us | }
>
> So, 15us for fh_verify by itself.
>
> 2) | dentry_open() {
> 2) | get_empty_filp() {
> 2) | kmem_cache_alloc() {
> 2) 0.021 us | _cond_resched();
> 2) 0.325 us | }
> 2) | security_file_alloc() {
> 2) | selinux_file_alloc_security() {
> 2) | kmem_cache_alloc() {
> 2) 0.022 us | _cond_resched();
> 2) 0.251 us | }
> 2) 0.441 us | }
> 2) 0.633 us | }
> 2) 0.022 us | __mutex_init();
> 2) 1.572 us | }
> 2) | vfs_open() {
> 2) | do_dentry_open() {
> 2) | path_get() {
> 2) 0.025 us | mntget();
> 2) 0.313 us | }
> 2) 0.029 us | try_module_get();
> 2) | security_file_open() {
> 2) | selinux_file_open() {
> 2) 0.024 us | avc_policy_seqno();
> 2) | inode_has_perm.isra.21() {
> 2) 0.148 us | avc_has_perm();
> 2) 0.332 us | }
> 2) 0.728 us | }
> 2) 0.023 us | __fsnotify_parent();
> 2) 0.034 us | fsnotify();
> 2) 1.359 us | }
> 2) 0.045 us | file_ra_state_init();
> 2) 2.548 us | }
> 2) 2.733 us | }
> 2) 4.667 us | }
> 2) | ima_file_check() {
> 2) 0.028 us | process_measurement();
> 2) 0.211 us | }
> 2) 0.023 us | nfserrno [nfsd]();
> 2) + 20.549 us | }
>
> And that’s 20us total for this nfsd_open. Seems like a lot of
> work for each NFSv3 WRITE operation, and plenty of opportunities
> for lock contention if my server happened to be otherwise busy.
>
> Note these figures are a little inflated because ftrace itself
> adds some latency.
>
> Anyway, making fh_verify faster could have a significant
> impact on per-op latency with fast storage. The recently-proposed
> NFSD open cache by itself could eliminate the need to perform
> fh_verify so often, for example.
>

Have you compared with Jeff's patchset? It would be very interesting
to see how that affects your numbers.

Cheers
Trond

2015-12-03 19:57:04

by Chuck Lever III

[permalink] [raw]
Subject: Re: Call graph of nfsd_open


> On Dec 3, 2015, at 2:53 PM, Trond Myklebust <[email protected]> wrote:
>
> Hi Chuck,
>
> On Thu, Dec 3, 2015 at 10:31 AM, Chuck Lever <[email protected]> wrote:
>> Hi-
>>
>> I’m looking into NFS WRITE latency on the Linux NFS server.
>> With a high speed network fabric and ultra-fast storage (think
>> persistent memory) we can get the per-WRITE latency under 50us
>> (as observed on the wire at the server).
>>
>> One source of the remaining latency appears to be nfsd_open.
>> Here is an example of a call graph captured during a simple
>> iozone 4KB direct write workload (v4.4-rc3):
>>
>> 2) | nfsd_open [nfsd]() {
>> 2) | fh_verify [nfsd]() {
>> 2) | rqst_exp_find [nfsd]() {
>> 2) | exp_find [nfsd]() {
>> 2) | exp_find_key [nfsd]() {
>> 2) 0.043 us | svc_expkey_hash [nfsd]();
>> 2) | sunrpc_cache_lookup [sunrpc]() {
>> 2) 0.027 us | _raw_read_lock();
>> 2) | getboottime64() {
>> 2) 0.022 us | ns_to_timespec();
>> 2) 0.210 us | }
>> 2) 0.021 us | get_seconds();
>> 2) 0.854 us | }
>> 2) | cache_check [sunrpc]() {
>> 2) | getboottime64() {
>> 2) 0.021 us | ns_to_timespec();
>> 2) 0.206 us | }
>> 2) 0.022 us | get_seconds();
>> 2) 0.596 us | }
>> 2) 2.050 us | }
>> 2) | exp_get_by_name [nfsd]() {
>> 2) | svc_export_lookup [nfsd]() {
>> 2) | sunrpc_cache_lookup [sunrpc]() {
>> 2) 0.022 us | _raw_read_lock();
>> 2) 0.024 us | svc_export_match [nfsd]();
>> 2) | getboottime64() {
>> 2) 0.023 us | ns_to_timespec();
>> 2) 0.210 us | }
>> 2) 0.022 us | get_seconds();
>> 2) 0.997 us | }
>> 2) 1.181 us | }
>> 2) | cache_check [sunrpc]() {
>> 2) | getboottime64() {
>> 2) 0.022 us | ns_to_timespec();
>> 2) 0.210 us | }
>> 2) 0.022 us | get_seconds();
>> 2) 0.592 us | }
>> 2) 2.147 us | }
>> 2) 4.560 us | }
>> 2) 4.768 us | }
>> 2) | prepare_creds() {
>> 2) | kmem_cache_alloc() {
>> 2) 0.021 us | _cond_resched();
>> 2) 0.233 us | }
>> 2) | security_prepare_creds() {
>> 2) | selinux_cred_prepare() {
>> 2) | kmemdup() {
>> 2) | __kmalloc_track_caller() {
>> 2) 0.022 us | kmalloc_slab();
>> 2) 0.023 us | _cond_resched();
>> 2) 0.448 us | }
>> 2) 0.640 us | }
>> 2) 0.838 us | }
>> 2) 1.129 us | }
>> 2) 1.808 us | }
>> 2) 0.022 us | override_creds();
>> 2) | exportfs_decode_fh() {
>> 2) | shmem_fh_to_dentry() {
>> 2) | ilookup5() {
>> 2) | ilookup5_nowait() {
>> 2) 0.026 us | _raw_spin_lock();
>> 2) | find_inode() {
>> 2) 0.031 us | shmem_match();
>> 2) 0.029 us | _raw_spin_lock();
>> 2) 0.429 us | }
>> 2) 0.826 us | }
>> 2) 0.022 us | _cond_resched();
>> 2) 0.021 us | _cond_resched();
>> 2) 1.417 us | }
>> 2) | d_find_alias() {
>> 2) 0.022 us | _raw_spin_lock();
>> 2) 0.027 us | _raw_spin_lock();
>> 2) 0.423 us | }
>> 2) 0.035 us | iput();
>> 2) 2.410 us | }
>> 2) | find_acceptable_alias() {
>> 2) 0.026 us | nfsd_acceptable [nfsd]();
>> 2) 0.220 us | }
>> 2) 3.032 us | }
>> 2) | nfsd_setuser_and_check_port [nfsd]() {
>> 2) 0.023 us | nfsexp_flags [nfsd]();
>> 2) | nfsd_setuser [nfsd]() {
>> 2) | revert_creds() {
>> 2) | __put_cred() {
>> 2) | call_rcu_sched() {
>> 2) | __call_rcu() {
>> 2) 0.049 us | __call_rcu_nocb_enqueue();
>> 2) 0.265 us | }
>> 2) 0.446 us | }
>> 2) 0.661 us | }
>> 2) 0.854 us | }
>> 2) | prepare_creds() {
>> 2) | kmem_cache_alloc() {
>> 2) 0.022 us | _cond_resched();
>> 2) 0.219 us | }
>> 2) | security_prepare_creds() {
>> 2) | selinux_cred_prepare() {
>> 2) | kmemdup() {
>> 2) | __kmalloc_track_caller() {
>> 2) 0.024 us | kmalloc_slab();
>> 2) 0.022 us | _cond_resched();
>> 2) 0.412 us | }
>> 2) 0.615 us | }
>> 2) 0.801 us | }
>> 2) 0.993 us | }
>> 2) 1.651 us | }
>> 2) 0.033 us | set_groups();
>> 2) 0.023 us | override_creds();
>> 2) 3.323 us | }
>> 2) 0.029 us | nfserrno [nfsd]();
>> 2) 3.919 us | }
>> 2) 0.025 us | check_nfsd_access [nfsd]();
>> 2) 0.022 us | nfsd_permission [nfsd]();
>> 2) + 14.904 us | }
>>
>> So, 15us for fh_verify by itself.
>>
>> 2) | dentry_open() {
>> 2) | get_empty_filp() {
>> 2) | kmem_cache_alloc() {
>> 2) 0.021 us | _cond_resched();
>> 2) 0.325 us | }
>> 2) | security_file_alloc() {
>> 2) | selinux_file_alloc_security() {
>> 2) | kmem_cache_alloc() {
>> 2) 0.022 us | _cond_resched();
>> 2) 0.251 us | }
>> 2) 0.441 us | }
>> 2) 0.633 us | }
>> 2) 0.022 us | __mutex_init();
>> 2) 1.572 us | }
>> 2) | vfs_open() {
>> 2) | do_dentry_open() {
>> 2) | path_get() {
>> 2) 0.025 us | mntget();
>> 2) 0.313 us | }
>> 2) 0.029 us | try_module_get();
>> 2) | security_file_open() {
>> 2) | selinux_file_open() {
>> 2) 0.024 us | avc_policy_seqno();
>> 2) | inode_has_perm.isra.21() {
>> 2) 0.148 us | avc_has_perm();
>> 2) 0.332 us | }
>> 2) 0.728 us | }
>> 2) 0.023 us | __fsnotify_parent();
>> 2) 0.034 us | fsnotify();
>> 2) 1.359 us | }
>> 2) 0.045 us | file_ra_state_init();
>> 2) 2.548 us | }
>> 2) 2.733 us | }
>> 2) 4.667 us | }
>> 2) | ima_file_check() {
>> 2) 0.028 us | process_measurement();
>> 2) 0.211 us | }
>> 2) 0.023 us | nfserrno [nfsd]();
>> 2) + 20.549 us | }
>>
>> And that’s 20us total for this nfsd_open. Seems like a lot of
>> work for each NFSv3 WRITE operation, and plenty of opportunities
>> for lock contention if my server happened to be otherwise busy.
>>
>> Note these figures are a little inflated because ftrace itself
>> adds some latency.
>>
>> Anyway, making fh_verify faster could have a significant
>> impact on per-op latency with fast storage. The recently-proposed
>> NFSD open cache by itself could eliminate the need to perform
>> fh_verify so often, for example.
>>
>
> Have you compared with Jeff's patchset? It would be very interesting
> to see how that affects your numbers.

Bruce asked me to post what I had so far, to help justify
adding the NFSD open cache, which has been tabled. Do you
know where Jeff keeps this work, and is it merged up to
4.4-rc ?

--
Chuck Lever





2015-12-03 21:00:13

by Trond Myklebust

[permalink] [raw]
Subject: Re: Call graph of nfsd_open

On Thu, Dec 3, 2015 at 11:56 AM, Chuck Lever <[email protected]> wrote:
> Bruce asked me to post what I had so far, to help justify
> adding the NFSD open cache, which has been tabled. Do you
> know where Jeff keeps this work, and is it merged up to
> 4.4-rc ?
>

IIRC, he posted them to the list a couple of months ago. He's on
vacation right now, but I'll see if I can dig them up for you.

2015-12-04 00:01:33

by Jeff Layton

[permalink] [raw]
Subject: Re: Call graph of nfsd_open

On Thu, 3 Dec 2015 13:31:13 -0500
Chuck Lever <[email protected]> wrote:

> Hi-
>
> I’m looking into NFS WRITE latency on the Linux NFS server.
> With a high speed network fabric and ultra-fast storage (think
> persistent memory) we can get the per-WRITE latency under 50us
> (as observed on the wire at the server).
>
> One source of the remaining latency appears to be nfsd_open.
> Here is an example of a call graph captured during a simple
> iozone 4KB direct write workload (v4.4-rc3):
>
> 2) | nfsd_open [nfsd]() {
> 2) | fh_verify [nfsd]() {
> 2) | rqst_exp_find [nfsd]() {
> 2) | exp_find [nfsd]() {
> 2) | exp_find_key [nfsd]() {
> 2) 0.043 us | svc_expkey_hash [nfsd]();
> 2) | sunrpc_cache_lookup [sunrpc]() {
> 2) 0.027 us | _raw_read_lock();
> 2) | getboottime64() {
> 2) 0.022 us | ns_to_timespec();
> 2) 0.210 us | }
> 2) 0.021 us | get_seconds();
> 2) 0.854 us | }
> 2) | cache_check [sunrpc]() {
> 2) | getboottime64() {
> 2) 0.021 us | ns_to_timespec();
> 2) 0.206 us | }
> 2) 0.022 us | get_seconds();
> 2) 0.596 us | }
> 2) 2.050 us | }
> 2) | exp_get_by_name [nfsd]() {
> 2) | svc_export_lookup [nfsd]() {
> 2) | sunrpc_cache_lookup [sunrpc]() {
> 2) 0.022 us | _raw_read_lock();
> 2) 0.024 us | svc_export_match [nfsd]();
> 2) | getboottime64() {
> 2) 0.023 us | ns_to_timespec();
> 2) 0.210 us | }
> 2) 0.022 us | get_seconds();
> 2) 0.997 us | }
> 2) 1.181 us | }
> 2) | cache_check [sunrpc]() {
> 2) | getboottime64() {
> 2) 0.022 us | ns_to_timespec();
> 2) 0.210 us | }
> 2) 0.022 us | get_seconds();
> 2) 0.592 us | }
> 2) 2.147 us | }
> 2) 4.560 us | }
> 2) 4.768 us | }
> 2) | prepare_creds() {
> 2) | kmem_cache_alloc() {
> 2) 0.021 us | _cond_resched();
> 2) 0.233 us | }
> 2) | security_prepare_creds() {
> 2) | selinux_cred_prepare() {
> 2) | kmemdup() {
> 2) | __kmalloc_track_caller() {
> 2) 0.022 us | kmalloc_slab();
> 2) 0.023 us | _cond_resched();
> 2) 0.448 us | }
> 2) 0.640 us | }
> 2) 0.838 us | }
> 2) 1.129 us | }
> 2) 1.808 us | }
> 2) 0.022 us | override_creds();
> 2) | exportfs_decode_fh() {
> 2) | shmem_fh_to_dentry() {
> 2) | ilookup5() {
> 2) | ilookup5_nowait() {
> 2) 0.026 us | _raw_spin_lock();
> 2) | find_inode() {
> 2) 0.031 us | shmem_match();
> 2) 0.029 us | _raw_spin_lock();
> 2) 0.429 us | }
> 2) 0.826 us | }
> 2) 0.022 us | _cond_resched();
> 2) 0.021 us | _cond_resched();
> 2) 1.417 us | }
> 2) | d_find_alias() {
> 2) 0.022 us | _raw_spin_lock();
> 2) 0.027 us | _raw_spin_lock();
> 2) 0.423 us | }
> 2) 0.035 us | iput();
> 2) 2.410 us | }
> 2) | find_acceptable_alias() {
> 2) 0.026 us | nfsd_acceptable [nfsd]();
> 2) 0.220 us | }
> 2) 3.032 us | }
> 2) | nfsd_setuser_and_check_port [nfsd]() {
> 2) 0.023 us | nfsexp_flags [nfsd]();
> 2) | nfsd_setuser [nfsd]() {
> 2) | revert_creds() {
> 2) | __put_cred() {
> 2) | call_rcu_sched() {
> 2) | __call_rcu() {
> 2) 0.049 us | __call_rcu_nocb_enqueue();
> 2) 0.265 us | }
> 2) 0.446 us | }
> 2) 0.661 us | }
> 2) 0.854 us | }
> 2) | prepare_creds() {
> 2) | kmem_cache_alloc() {
> 2) 0.022 us | _cond_resched();
> 2) 0.219 us | }
> 2) | security_prepare_creds() {
> 2) | selinux_cred_prepare() {
> 2) | kmemdup() {
> 2) | __kmalloc_track_caller() {
> 2) 0.024 us | kmalloc_slab();
> 2) 0.022 us | _cond_resched();
> 2) 0.412 us | }
> 2) 0.615 us | }
> 2) 0.801 us | }
> 2) 0.993 us | }
> 2) 1.651 us | }
> 2) 0.033 us | set_groups();
> 2) 0.023 us | override_creds();
> 2) 3.323 us | }
> 2) 0.029 us | nfserrno [nfsd]();
> 2) 3.919 us | }
> 2) 0.025 us | check_nfsd_access [nfsd]();
> 2) 0.022 us | nfsd_permission [nfsd]();
> 2) + 14.904 us | }
>
> So, 15us for fh_verify by itself.
>
> 2) | dentry_open() {
> 2) | get_empty_filp() {
> 2) | kmem_cache_alloc() {
> 2) 0.021 us | _cond_resched();
> 2) 0.325 us | }
> 2) | security_file_alloc() {
> 2) | selinux_file_alloc_security() {
> 2) | kmem_cache_alloc() {
> 2) 0.022 us | _cond_resched();
> 2) 0.251 us | }
> 2) 0.441 us | }
> 2) 0.633 us | }
> 2) 0.022 us | __mutex_init();
> 2) 1.572 us | }
> 2) | vfs_open() {
> 2) | do_dentry_open() {
> 2) | path_get() {
> 2) 0.025 us | mntget();
> 2) 0.313 us | }
> 2) 0.029 us | try_module_get();
> 2) | security_file_open() {
> 2) | selinux_file_open() {
> 2) 0.024 us | avc_policy_seqno();
> 2) | inode_has_perm.isra.21() {
> 2) 0.148 us | avc_has_perm();
> 2) 0.332 us | }
> 2) 0.728 us | }
> 2) 0.023 us | __fsnotify_parent();
> 2) 0.034 us | fsnotify();
> 2) 1.359 us | }
> 2) 0.045 us | file_ra_state_init();
> 2) 2.548 us | }
> 2) 2.733 us | }
> 2) 4.667 us | }
> 2) | ima_file_check() {
> 2) 0.028 us | process_measurement();
> 2) 0.211 us | }
> 2) 0.023 us | nfserrno [nfsd]();
> 2) + 20.549 us | }
>
> And that’s 20us total for this nfsd_open. Seems like a lot of
> work for each NFSv3 WRITE operation, and plenty of opportunities
> for lock contention if my server happened to be otherwise busy.
>
> Note these figures are a little inflated because ftrace itself
> adds some latency.
>
> Anyway, making fh_verify faster could have a significant
> impact on per-op latency with fast storage. The recently-proposed
> NFSD open cache by itself could eliminate the need to perform
> fh_verify so often, for example.
>

No, we still have to do a fh_verify. We still have to do a
nfsd_permission check, etc (in particular). What we _can_ avoid is
calling down into the vfs layer to do the actual open. Maybe it's
possible to optimize away some of the fh_verify processing though?
Personally, I think the fh_* routines do too much...

In any case, the latest set that I have is here:

https://git.samba.org/?p=jlayton/linux.git;a=shortlog;h=refs/heads/nfsd-4.5

...but note that Trond and Tao have merged some fixes to our internal
repo for this that I have not yet had time to review yet. I'll probably
pick those up next week sometime once I've gotten caught up.

--
Jeff Layton <[email protected]>

2015-12-04 14:42:32

by Chuck Lever III

[permalink] [raw]
Subject: Re: Call graph of nfsd_open


> On Dec 3, 2015, at 7:01 PM, Jeff Layton <[email protected]> wrote:
>
> On Thu, 3 Dec 2015 13:31:13 -0500
> Chuck Lever <[email protected]> wrote:
>
>> Hi-
>>
>> I’m looking into NFS WRITE latency on the Linux NFS server.
>> With a high speed network fabric and ultra-fast storage (think
>> persistent memory) we can get the per-WRITE latency under 50us
>> (as observed on the wire at the server).
>>
>> One source of the remaining latency appears to be nfsd_open.
>> Here is an example of a call graph captured during a simple
>> iozone 4KB direct write workload (v4.4-rc3):
>>
>> 2) | nfsd_open [nfsd]() {
>> 2) | fh_verify [nfsd]() {
>> 2) | rqst_exp_find [nfsd]() {
>> 2) | exp_find [nfsd]() {
>> 2) | exp_find_key [nfsd]() {
>> 2) 0.043 us | svc_expkey_hash [nfsd]();
>> 2) | sunrpc_cache_lookup [sunrpc]() {
>> 2) 0.027 us | _raw_read_lock();
>> 2) | getboottime64() {
>> 2) 0.022 us | ns_to_timespec();
>> 2) 0.210 us | }
>> 2) 0.021 us | get_seconds();
>> 2) 0.854 us | }
>> 2) | cache_check [sunrpc]() {
>> 2) | getboottime64() {
>> 2) 0.021 us | ns_to_timespec();
>> 2) 0.206 us | }
>> 2) 0.022 us | get_seconds();
>> 2) 0.596 us | }
>> 2) 2.050 us | }
>> 2) | exp_get_by_name [nfsd]() {
>> 2) | svc_export_lookup [nfsd]() {
>> 2) | sunrpc_cache_lookup [sunrpc]() {
>> 2) 0.022 us | _raw_read_lock();
>> 2) 0.024 us | svc_export_match [nfsd]();
>> 2) | getboottime64() {
>> 2) 0.023 us | ns_to_timespec();
>> 2) 0.210 us | }
>> 2) 0.022 us | get_seconds();
>> 2) 0.997 us | }
>> 2) 1.181 us | }
>> 2) | cache_check [sunrpc]() {
>> 2) | getboottime64() {
>> 2) 0.022 us | ns_to_timespec();
>> 2) 0.210 us | }
>> 2) 0.022 us | get_seconds();
>> 2) 0.592 us | }
>> 2) 2.147 us | }
>> 2) 4.560 us | }
>> 2) 4.768 us | }
>> 2) | prepare_creds() {
>> 2) | kmem_cache_alloc() {
>> 2) 0.021 us | _cond_resched();
>> 2) 0.233 us | }
>> 2) | security_prepare_creds() {
>> 2) | selinux_cred_prepare() {
>> 2) | kmemdup() {
>> 2) | __kmalloc_track_caller() {
>> 2) 0.022 us | kmalloc_slab();
>> 2) 0.023 us | _cond_resched();
>> 2) 0.448 us | }
>> 2) 0.640 us | }
>> 2) 0.838 us | }
>> 2) 1.129 us | }
>> 2) 1.808 us | }
>> 2) 0.022 us | override_creds();
>> 2) | exportfs_decode_fh() {
>> 2) | shmem_fh_to_dentry() {
>> 2) | ilookup5() {
>> 2) | ilookup5_nowait() {
>> 2) 0.026 us | _raw_spin_lock();
>> 2) | find_inode() {
>> 2) 0.031 us | shmem_match();
>> 2) 0.029 us | _raw_spin_lock();
>> 2) 0.429 us | }
>> 2) 0.826 us | }
>> 2) 0.022 us | _cond_resched();
>> 2) 0.021 us | _cond_resched();
>> 2) 1.417 us | }
>> 2) | d_find_alias() {
>> 2) 0.022 us | _raw_spin_lock();
>> 2) 0.027 us | _raw_spin_lock();
>> 2) 0.423 us | }
>> 2) 0.035 us | iput();
>> 2) 2.410 us | }
>> 2) | find_acceptable_alias() {
>> 2) 0.026 us | nfsd_acceptable [nfsd]();
>> 2) 0.220 us | }
>> 2) 3.032 us | }
>> 2) | nfsd_setuser_and_check_port [nfsd]() {
>> 2) 0.023 us | nfsexp_flags [nfsd]();
>> 2) | nfsd_setuser [nfsd]() {
>> 2) | revert_creds() {
>> 2) | __put_cred() {
>> 2) | call_rcu_sched() {
>> 2) | __call_rcu() {
>> 2) 0.049 us | __call_rcu_nocb_enqueue();
>> 2) 0.265 us | }
>> 2) 0.446 us | }
>> 2) 0.661 us | }
>> 2) 0.854 us | }
>> 2) | prepare_creds() {
>> 2) | kmem_cache_alloc() {
>> 2) 0.022 us | _cond_resched();
>> 2) 0.219 us | }
>> 2) | security_prepare_creds() {
>> 2) | selinux_cred_prepare() {
>> 2) | kmemdup() {
>> 2) | __kmalloc_track_caller() {
>> 2) 0.024 us | kmalloc_slab();
>> 2) 0.022 us | _cond_resched();
>> 2) 0.412 us | }
>> 2) 0.615 us | }
>> 2) 0.801 us | }
>> 2) 0.993 us | }
>> 2) 1.651 us | }
>> 2) 0.033 us | set_groups();
>> 2) 0.023 us | override_creds();
>> 2) 3.323 us | }
>> 2) 0.029 us | nfserrno [nfsd]();
>> 2) 3.919 us | }
>> 2) 0.025 us | check_nfsd_access [nfsd]();
>> 2) 0.022 us | nfsd_permission [nfsd]();
>> 2) + 14.904 us | }
>>
>> So, 15us for fh_verify by itself.
>>
>> 2) | dentry_open() {
>> 2) | get_empty_filp() {
>> 2) | kmem_cache_alloc() {
>> 2) 0.021 us | _cond_resched();
>> 2) 0.325 us | }
>> 2) | security_file_alloc() {
>> 2) | selinux_file_alloc_security() {
>> 2) | kmem_cache_alloc() {
>> 2) 0.022 us | _cond_resched();
>> 2) 0.251 us | }
>> 2) 0.441 us | }
>> 2) 0.633 us | }
>> 2) 0.022 us | __mutex_init();
>> 2) 1.572 us | }
>> 2) | vfs_open() {
>> 2) | do_dentry_open() {
>> 2) | path_get() {
>> 2) 0.025 us | mntget();
>> 2) 0.313 us | }
>> 2) 0.029 us | try_module_get();
>> 2) | security_file_open() {
>> 2) | selinux_file_open() {
>> 2) 0.024 us | avc_policy_seqno();
>> 2) | inode_has_perm.isra.21() {
>> 2) 0.148 us | avc_has_perm();
>> 2) 0.332 us | }
>> 2) 0.728 us | }
>> 2) 0.023 us | __fsnotify_parent();
>> 2) 0.034 us | fsnotify();
>> 2) 1.359 us | }
>> 2) 0.045 us | file_ra_state_init();
>> 2) 2.548 us | }
>> 2) 2.733 us | }
>> 2) 4.667 us | }
>> 2) | ima_file_check() {
>> 2) 0.028 us | process_measurement();
>> 2) 0.211 us | }
>> 2) 0.023 us | nfserrno [nfsd]();
>> 2) + 20.549 us | }
>>
>> And that’s 20us total for this nfsd_open. Seems like a lot of
>> work for each NFSv3 WRITE operation, and plenty of opportunities
>> for lock contention if my server happened to be otherwise busy.
>>
>> Note these figures are a little inflated because ftrace itself
>> adds some latency.
>>
>> Anyway, making fh_verify faster could have a significant
>> impact on per-op latency with fast storage. The recently-proposed
>> NFSD open cache by itself could eliminate the need to perform
>> fh_verify so often, for example.
>>
>
> No, we still have to do a fh_verify. We still have to do a
> nfsd_permission check, etc (in particular). What we _can_ avoid is
> calling down into the vfs layer to do the actual open. Maybe it's
> possible to optimize away some of the fh_verify processing though?
> Personally, I think the fh_* routines do too much

fh_verify turns out to be the expensive part, about three-fourths
of the latency of nfsd_open in my environment. It would be very
nice to reduce or eliminate this expense.

But this is with 4KB NFSv3 WRITE operations. As the payload size
increases, latency is dominated by how long it takes to move
the data on the network and other housekeeping I have yet to
look into. More to come.


> In any case, the latest set that I have is here:
>
> https://git.samba.org/?p=jlayton/linux.git;a=shortlog;h=refs/heads/nfsd-4.5
>
> ...but note that Trond and Tao have merged some fixes to our internal
> repo for this that I have not yet had time to review yet. I'll probably
> pick those up next week sometime once I've gotten caught up.
>
> --
> Jeff Layton <[email protected]>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever





2015-12-15 01:34:57

by Chuck Lever III

[permalink] [raw]
Subject: Re: Call graph of nfsd_open


> On Dec 3, 2015, at 2:53 PM, Trond Myklebust <[email protected]> wrote:
>
> Hi Chuck,
>
> On Thu, Dec 3, 2015 at 10:31 AM, Chuck Lever <[email protected]> wrote:
>> Hi-
>>
>> I’m looking into NFS WRITE latency on the Linux NFS server.
>> With a high speed network fabric and ultra-fast storage (think
>> persistent memory) we can get the per-WRITE latency under 50us
>> (as observed on the wire at the server).
>>
>> One source of the remaining latency appears to be nfsd_open.
>> Here is an example of a call graph captured during a simple
>> iozone 4KB direct write workload (v4.4-rc3):

[ ... function call graph snipped ... ]

>> And that’s 20us total for this nfsd_open. Seems like a lot of
>> work for each NFSv3 WRITE operation, and plenty of opportunities
>> for lock contention if my server happened to be otherwise busy.
>>
>> Note these figures are a little inflated because ftrace itself
>> adds some latency.
>>
>> Anyway, making fh_verify faster could have a significant
>> impact on per-op latency with fast storage. The recently-proposed
>> NFSD open cache by itself could eliminate the need to perform
>> fh_verify so often, for example.
>>
>
> Have you compared with Jeff's patchset? It would be very interesting
> to see how that affects your numbers.

I compared the latency of read(2) and write(2) as measured
by iozone on the client.

I pulled Jeff's patches from samba.org, they are based on
4.4-rc4. My patches are my for-4.5 series, also based on
4.4-rc4. I've disabled FRMR on the server for RDMA Reads.
The share is a tmpfs.

The client is 4.4-rc4 with my for-4.5 patches applied.
1MB rsize and wsize.

The fabric is FDR, both systems have CX-3 Pro adapters
and report 56Gbps link speed.

The test is "iozone -az -i0 -i1 -y1k -s128m -I -N". The
results are microseconds per system call. Lower is
better.

With Jeff's NFSD open cache:

KB reclen write rewrite read reread
131072 1 48 48 43 43
131072 2 49 49 43 43
131072 4 50 50 43 43
131072 8 54 52 45 46
131072 16 59 58 48 48
131072 32 71 69 56 56
131072 64 97 92 75 78
131072 128 147 142 105 105
131072 256 237 216 171 162
131072 512 492 457 268 267
131072 1024 806 748 529 526
131072 2048 1257 1189 711 696
131072 4096 1884 1718 1019 1016
131072 8192 3452 2958 1714 1710
131072 16384 6835 5416 3132 3134

With my for-4.5 patches (no open cache):

KB reclen write rewrite read reread
131072 1 49 50 43 43
131072 2 50 49 43 43
131072 4 51 50 43 43
131072 8 55 53 45 45
131072 16 60 58 48 48
131072 32 70 68 53 54
131072 64 91 85 69 69
131072 128 140 130 95 96
131072 256 214 203 145 147
131072 512 480 410 253 249
131072 1024 755 698 508 477
131072 2048 1206 1049 667 656
131072 4096 1786 1632 977 977
131072 8192 3243 2851 1670 1672
131072 16384 6579 5276 3091 3092

You can see below 64KB I/O size, Jeff's open cache saves
about a microsecond per NFS WRITE. NFS READ is not affected.

Above that, other effects dominate the cost per I/O. I can't
explain yet why my for-4.5 server looks better with larger I/Os.

NFS WRITE takes longer than NFS READ even with a memory
file system. The difference seems to be dealing with all
the pages: on the server, this is done by the local file
system; on the client, there's a palpable O/S cost per page
to set up memory registration (the FAST_REG WR itself is
quick), and lock contention dealing with the NFS WRITE
completions.

Looking at ways to speed up fh_verify would help the smaller
WRITEs, and reducing the cost of page management might be
good for large I/O.


--
Chuck Lever