Currently the idle timeout for courtesy client is fixed at 1 day. If
there are lots of courtesy clients remain in the system it can cause
memory resource shortage that effects the operations of other modules
in the kernel. This problem can be observed by running pynfs nfs4.0
CID5 test in a loop. Eventually system runs out of memory and rpc.gssd
fails to add new watch:
rpc.gssd[3851]: ERROR: inotify_add_watch failed for nfsd4_cb/clnt6c2e:
No space left on device
and alloc_inode also fails with out of memory:
Call Trace:
<TASK>
dump_stack_lvl+0x33/0x42
dump_header+0x4a/0x1ed
oom_kill_process+0x80/0x10d
out_of_memory+0x237/0x25f
__alloc_pages_slowpath.constprop.0+0x617/0x7b6
__alloc_pages+0x132/0x1e3
alloc_slab_page+0x15/0x33
allocate_slab+0x78/0x1ab
? alloc_inode+0x38/0x8d
___slab_alloc+0x2af/0x373
? alloc_inode+0x38/0x8d
? slab_pre_alloc_hook.constprop.0+0x9f/0x158
? alloc_inode+0x38/0x8d
__slab_alloc.constprop.0+0x1c/0x24
kmem_cache_alloc_lru+0x8c/0x142
alloc_inode+0x38/0x8d
iget_locked+0x60/0x126
kernfs_get_inode+0x18/0x105
kernfs_iop_lookup+0x6d/0xbc
__lookup_slow+0xb7/0xf9
lookup_slow+0x3a/0x52
walk_component+0x90/0x100
? inode_permission+0x87/0x128
link_path_walk.part.0.constprop.0+0x266/0x2ea
? path_init+0x101/0x2f2
path_lookupat+0x4c/0xfa
filename_lookup+0x63/0xd7
? getname_flags+0x32/0x17a
? kmem_cache_alloc+0x11f/0x144
? getname_flags+0x16c/0x17a
user_path_at_empty+0x37/0x4b
do_readlinkat+0x61/0x102
__x64_sys_readlinkat+0x18/0x1b
do_syscall_64+0x57/0x72
entry_SYSCALL_64_after_hwframe+0x46/0xb0
This patch addresses this problem by:
. removing the fixed 1-day idle time limit for courtesy client.
Courtesy client is now allowed to remain valid as long as the
available system memory is above 80%.
. when available system memory drops below 80%, laundromat starts
trimming older courtesy clients. The number of courtesy clients
to trim is a percentage of the total number of courtesy clients
exist in the system. This percentage is computed based on
the current percentage of available system memory.
. the percentage of number of courtesy clients to be trimmed
is based on this table:
----------------------------------
| % memory | % courtesy clients |
| available | to trim |
----------------------------------
| > 80 | 0 |
| > 70 | 10 |
| > 60 | 20 |
| > 50 | 40 |
| > 40 | 60 |
| > 30 | 80 |
| < 30 | 100 |
----------------------------------
. due to the overhead associated with removing client record,
there is a limit of 128 clients to be trimmed for each
laundromat run. This is done to prevent the laundromat from
spending too long destroying the clients and misses performing
its other tasks in a timely manner.
. the laundromat is scheduled to run sooner if there are more
courtesy clients need to be destroyed.
The shrinker method was evaluated and found it's not suitable
for this problem due to these reasons:
. destroying the NFSv4 client on the shrinker context can cause
deadlock since nfsd_file_put calls into the underlying FS
code and we have no control what it will do as seen in this
stack trace:
======================================================
WARNING: possible circular locking dependency detected
5.19.0-rc2_sk+ #1 Not tainted
------------------------------------------------------
lck/31847 is trying to acquire lock:
ffff88811d268850 (&sb->s_type->i_mutex_key#16){+.+.}-{3:3}, at: btrfs_inode_lock+0x38/0x70
#012but task is already holding lock:
ffffffffb41848c0 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0x506/0x1db0
#012which lock already depends on the new lock.
#012the existing dependency chain (in reverse order) is:
#012-> #1 (fs_reclaim){+.+.}-{0:0}:
fs_reclaim_acquire+0xc0/0x100
__kmalloc+0x51/0x320
btrfs_buffered_write+0x2eb/0xd90
btrfs_do_write_iter+0x6bf/0x11c0
do_iter_readv_writev+0x2bb/0x5a0
do_iter_write+0x131/0x630
nfsd_vfs_write+0x4da/0x1900 [nfsd]
nfsd4_write+0x2ac/0x760 [nfsd]
nfsd4_proc_compound+0xce8/0x23e0 [nfsd]
nfsd_dispatch+0x4ed/0xc10 [nfsd]
svc_process_common+0xd3f/0x1b00 [sunrpc]
svc_process+0x361/0x4f0 [sunrpc]
nfsd+0x2d6/0x570 [nfsd]
kthread+0x2a1/0x340
ret_from_fork+0x22/0x30
#012-> #0 (&sb->s_type->i_mutex_key#16){+.+.}-{3:3}:
__lock_acquire+0x318d/0x7830
lock_acquire+0x1bb/0x500
down_write+0x82/0x130
btrfs_inode_lock+0x38/0x70
btrfs_sync_file+0x280/0x1010
nfsd_file_flush.isra.0+0x1b/0x220 [nfsd]
nfsd_file_put+0xd4/0x110 [nfsd]
release_all_access+0x13a/0x220 [nfsd]
nfs4_free_ol_stateid+0x40/0x90 [nfsd]
free_ol_stateid_reaplist+0x131/0x210 [nfsd]
release_openowner+0xf7/0x160 [nfsd]
__destroy_client+0x3cc/0x740 [nfsd]
nfsd_cc_lru_scan+0x271/0x410 [nfsd]
shrink_slab.constprop.0+0x31e/0x7d0
shrink_node+0x54b/0xe50
try_to_free_pages+0x394/0xba0
__alloc_pages_slowpath.constprop.0+0x5d2/0x1db0
__alloc_pages+0x4d6/0x580
__handle_mm_fault+0xc25/0x2810
handle_mm_fault+0x136/0x480
do_user_addr_fault+0x3d8/0xec0
exc_page_fault+0x5d/0xc0
asm_exc_page_fault+0x27/0x30
#012other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(fs_reclaim);
lock(&sb->s_type->i_mutex_key#16);
lock(fs_reclaim);
lock(&sb->s_type->i_mutex_key#16);
#012 *** DEADLOCK ***
. the shrinker kicks in only when memory drops really low, ~<5%.
By this time, some other components in the system already run
into issue with memory shortage. For example, rpc.gssd starts
failing to add watches in /var/lib/nfs/rpc_pipefs/nfsd4_cb
once the memory consumed by these watches reaches about 1% of
available system memory.
. destroying the NFSv4 client has significant overhead due to
the upcall to user space to remove the client records which
might access storage device. There is potential deadlock
if the storage subsystem needs to allocate memory.
Add counter nfscourtesy_client_count to keep track of the number
of courtesy clients in the system.
Signed-off-by: Dai Ngo <[email protected]>
---
fs/nfsd/nfs4state.c | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 9409a0dc1b76..a34ffb0d8c77 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -126,11 +126,13 @@ static const struct nfsd4_callback_ops nfsd4_cb_recall_ops;
static const struct nfsd4_callback_ops nfsd4_cb_notify_lock_ops;
static struct workqueue_struct *laundry_wq;
+static atomic_t courtesy_client_count;
int nfsd4_create_laundry_wq(void)
{
int rc = 0;
+ atomic_set(&courtesy_client_count, 0);
laundry_wq = alloc_workqueue("%s", WQ_UNBOUND, 0, "nfsd4");
if (laundry_wq == NULL)
rc = -ENOMEM;
@@ -169,7 +171,8 @@ static __be32 get_client_locked(struct nfs4_client *clp)
if (is_client_expired(clp))
return nfserr_expired;
atomic_inc(&clp->cl_rpc_users);
- clp->cl_state = NFSD4_ACTIVE;
+ if (xchg(&clp->cl_state, NFSD4_ACTIVE) != NFSD4_ACTIVE)
+ atomic_add_unless(&courtesy_client_count, -1, 0);
return nfs_ok;
}
@@ -190,7 +193,8 @@ renew_client_locked(struct nfs4_client *clp)
list_move_tail(&clp->cl_lru, &nn->client_lru);
clp->cl_time = ktime_get_boottime_seconds();
- clp->cl_state = NFSD4_ACTIVE;
+ if (xchg(&clp->cl_state, NFSD4_ACTIVE) != NFSD4_ACTIVE)
+ atomic_add_unless(&courtesy_client_count, -1, 0);
}
static void put_client_renew_locked(struct nfs4_client *clp)
@@ -2226,6 +2230,8 @@ __destroy_client(struct nfs4_client *clp)
nfsd4_shutdown_callback(clp);
if (clp->cl_cb_conn.cb_xprt)
svc_xprt_put(clp->cl_cb_conn.cb_xprt);
+ if (clp->cl_state != NFSD4_ACTIVE)
+ atomic_add_unless(&courtesy_client_count, -1, 0);
free_client(clp);
wake_up_all(&expiry_wq);
}
@@ -5803,8 +5809,11 @@ nfs4_get_client_reaplist(struct nfsd_net *nn, struct list_head *reaplist,
goto exp_client;
if (!state_expired(lt, clp->cl_time))
break;
- if (!atomic_read(&clp->cl_rpc_users))
- clp->cl_state = NFSD4_COURTESY;
+ if (!atomic_read(&clp->cl_rpc_users)) {
+ if (xchg(&clp->cl_state, NFSD4_COURTESY) ==
+ NFSD4_ACTIVE)
+ atomic_inc(&courtesy_client_count);
+ }
if (!client_has_state(clp) ||
ktime_get_boottime_seconds() >=
(clp->cl_time + NFSD_COURTESY_CLIENT_TIMEOUT))
--
2.9.5
Hello Dai -
I agree that tackling resource management is indeed an appropriate
next step for courteous server. Thanks for tackling this!
More comments are inline.
> On Jul 4, 2022, at 3:05 PM, Dai Ngo <[email protected]> wrote:
>
> Currently the idle timeout for courtesy client is fixed at 1 day. If
> there are lots of courtesy clients remain in the system it can cause
> memory resource shortage that effects the operations of other modules
> in the kernel. This problem can be observed by running pynfs nfs4.0
> CID5 test in a loop. Eventually system runs out of memory and rpc.gssd
> fails to add new watch:
>
> rpc.gssd[3851]: ERROR: inotify_add_watch failed for nfsd4_cb/clnt6c2e:
> No space left on device
>
> and alloc_inode also fails with out of memory:
>
> Call Trace:
> <TASK>
> dump_stack_lvl+0x33/0x42
> dump_header+0x4a/0x1ed
> oom_kill_process+0x80/0x10d
> out_of_memory+0x237/0x25f
> __alloc_pages_slowpath.constprop.0+0x617/0x7b6
> __alloc_pages+0x132/0x1e3
> alloc_slab_page+0x15/0x33
> allocate_slab+0x78/0x1ab
> ? alloc_inode+0x38/0x8d
> ___slab_alloc+0x2af/0x373
> ? alloc_inode+0x38/0x8d
> ? slab_pre_alloc_hook.constprop.0+0x9f/0x158
> ? alloc_inode+0x38/0x8d
> __slab_alloc.constprop.0+0x1c/0x24
> kmem_cache_alloc_lru+0x8c/0x142
> alloc_inode+0x38/0x8d
> iget_locked+0x60/0x126
> kernfs_get_inode+0x18/0x105
> kernfs_iop_lookup+0x6d/0xbc
> __lookup_slow+0xb7/0xf9
> lookup_slow+0x3a/0x52
> walk_component+0x90/0x100
> ? inode_permission+0x87/0x128
> link_path_walk.part.0.constprop.0+0x266/0x2ea
> ? path_init+0x101/0x2f2
> path_lookupat+0x4c/0xfa
> filename_lookup+0x63/0xd7
> ? getname_flags+0x32/0x17a
> ? kmem_cache_alloc+0x11f/0x144
> ? getname_flags+0x16c/0x17a
> user_path_at_empty+0x37/0x4b
> do_readlinkat+0x61/0x102
> __x64_sys_readlinkat+0x18/0x1b
> do_syscall_64+0x57/0x72
> entry_SYSCALL_64_after_hwframe+0x46/0xb0
These details are a little distracting. IMO you can summarize
the above with just this:
>> Currently the idle timeout for courtesy client is fixed at 1 day. If
>> there are lots of courtesy clients remain in the system it can cause
>> memory resource shortage. This problem can be observed by running
>> pynfs nfs4.0 CID5 test in a loop.
Now I'm going to comment in reverse order here. To add context
for others on-list, when we designed courteous server, we had
assumed that eventually a shrinker would be used to garbage
collect courtesy clients. Dai has found some issues with that
approach:
> The shrinker method was evaluated and found it's not suitable
> for this problem due to these reasons:
>
> . destroying the NFSv4 client on the shrinker context can cause
> deadlock since nfsd_file_put calls into the underlying FS
> code and we have no control what it will do as seen in this
> stack trace:
[ ... stack trace snipped ... ]
I think I always had in mind that only the laundromat would be
responsible for harvesting courtesy clients. A shrinker might
trigger that activity, but as you point out, a deadlock is pretty
likely if the shrinker itself had to do the harvesting.
> . destroying the NFSv4 client has significant overhead due to
> the upcall to user space to remove the client records which
> might access storage device. There is potential deadlock
> if the storage subsystem needs to allocate memory.
The issue is that harvesting a courtesy client will involve
an upcall to nfsdcltracker, and that will result in I/O that
updates the tracker's database. Very likely this will require
further allocation of memory and thus it could deadlock the
system.
Now this might also be all the demonstration that we need
that managing courtesy resources cannot be done using the
system's shrinker facility -- expiring a client can never
be done when there is a direct reclaim waiting on it. I'm
interested in other opinions on that. Neil? Bruce? Trond?
> . the shrinker kicks in only when memory drops really low, ~<5%.
> By this time, some other components in the system already run
> into issue with memory shortage. For example, rpc.gssd starts
> failing to add watches in /var/lib/nfs/rpc_pipefs/nfsd4_cb
> once the memory consumed by these watches reaches about 1% of
> available system memory.
Your claim is that a courtesy client shrinker would be invoked
too late. That might be true on a server with 2GB of RAM, but
on a big system (say, a server with 64GB of RAM), 5% is still
more than 3GB -- wouldn't that be enough to harvest safely?
We can't optimize for tiny server systems because that almost
always hobbles the scalability of larger systems for no good
reason. Can you test with a large-memory server as well as a
small-memory server?
I think the central question here is why is 5% not enough on
all systems. I would like to understand that better. It seems
like a primary scalability question that needs an answer so
a good harvesting heuristic can be derived.
One question in my mind is what is the maximum rate at which
the server converts active clients to courtesy clients, and
can the current laundromat scheme keep up with harvesting them
at that rate? The destructive scenario seems to be when courtesy
clients are manufactured faster than they can be harvested and
expunged.
(Also I recall Bruce fixed a problem recently with nfsdcltracker
where it was doing three fsync's for every database update,
which significantly slowed it down. You should look for that
fix in nfs-utils and ensure the above rate measurement is done
with the fix applied).
> This patch addresses this problem by:
>
> . removing the fixed 1-day idle time limit for courtesy client.
> Courtesy client is now allowed to remain valid as long as the
> available system memory is above 80%.
>
> . when available system memory drops below 80%, laundromat starts
> trimming older courtesy clients. The number of courtesy clients
> to trim is a percentage of the total number of courtesy clients
> exist in the system. This percentage is computed based on
> the current percentage of available system memory.
>
> . the percentage of number of courtesy clients to be trimmed
> is based on this table:
>
> ----------------------------------
> | % memory | % courtesy clients |
> | available | to trim |
> ----------------------------------
> | > 80 | 0 |
> | > 70 | 10 |
> | > 60 | 20 |
> | > 50 | 40 |
> | > 40 | 60 |
> | > 30 | 80 |
> | < 30 | 100 |
> ----------------------------------
"80% available memory" on a big system means there's still an
enormous amount of free memory on that system. It will be
surprising to administrators on those systems if the laundromat
is harvesting courtesy clients at that point.
Also, if a server is at 60-70% free memory all the time due to
non-NFSD-related memory consumption, would that mean that the
laundromat would always trim courtesy clients, even though doing
so would not be needed or beneficial?
I don't think we can use a fixed percentage ladder like this;
it might make sense for the CID5 test (or to stop other types of
inadvertent or malicious DoS attacks) but the common case
steady-state behavior doesn't seem very good.
I don't recall, are courtesy clients maintained on an LRU so
that the oldest ones would be harvested first? This mechanism
seems to harvest at random?
> . due to the overhead associated with removing client record,
> there is a limit of 128 clients to be trimmed for each
> laundromat run. This is done to prevent the laundromat from
> spending too long destroying the clients and misses performing
> its other tasks in a timely manner.
>
> . the laundromat is scheduled to run sooner if there are more
> courtesy clients need to be destroyed.
Both of these last two changes seem sensible. Can they be
broken out so they can be applied immediately?
--
Chuck Lever
On 7/5/22 7:50 AM, Chuck Lever III wrote:
> Hello Dai -
>
> I agree that tackling resource management is indeed an appropriate
> next step for courteous server. Thanks for tackling this!
>
> More comments are inline.
>
>
>> On Jul 4, 2022, at 3:05 PM, Dai Ngo <[email protected]> wrote:
>>
>> Currently the idle timeout for courtesy client is fixed at 1 day. If
>> there are lots of courtesy clients remain in the system it can cause
>> memory resource shortage that effects the operations of other modules
>> in the kernel. This problem can be observed by running pynfs nfs4.0
>> CID5 test in a loop. Eventually system runs out of memory and rpc.gssd
>> fails to add new watch:
>>
>> rpc.gssd[3851]: ERROR: inotify_add_watch failed for nfsd4_cb/clnt6c2e:
>> No space left on device
>>
>> and alloc_inode also fails with out of memory:
>>
>> Call Trace:
>> <TASK>
>> dump_stack_lvl+0x33/0x42
>> dump_header+0x4a/0x1ed
>> oom_kill_process+0x80/0x10d
>> out_of_memory+0x237/0x25f
>> __alloc_pages_slowpath.constprop.0+0x617/0x7b6
>> __alloc_pages+0x132/0x1e3
>> alloc_slab_page+0x15/0x33
>> allocate_slab+0x78/0x1ab
>> ? alloc_inode+0x38/0x8d
>> ___slab_alloc+0x2af/0x373
>> ? alloc_inode+0x38/0x8d
>> ? slab_pre_alloc_hook.constprop.0+0x9f/0x158
>> ? alloc_inode+0x38/0x8d
>> __slab_alloc.constprop.0+0x1c/0x24
>> kmem_cache_alloc_lru+0x8c/0x142
>> alloc_inode+0x38/0x8d
>> iget_locked+0x60/0x126
>> kernfs_get_inode+0x18/0x105
>> kernfs_iop_lookup+0x6d/0xbc
>> __lookup_slow+0xb7/0xf9
>> lookup_slow+0x3a/0x52
>> walk_component+0x90/0x100
>> ? inode_permission+0x87/0x128
>> link_path_walk.part.0.constprop.0+0x266/0x2ea
>> ? path_init+0x101/0x2f2
>> path_lookupat+0x4c/0xfa
>> filename_lookup+0x63/0xd7
>> ? getname_flags+0x32/0x17a
>> ? kmem_cache_alloc+0x11f/0x144
>> ? getname_flags+0x16c/0x17a
>> user_path_at_empty+0x37/0x4b
>> do_readlinkat+0x61/0x102
>> __x64_sys_readlinkat+0x18/0x1b
>> do_syscall_64+0x57/0x72
>> entry_SYSCALL_64_after_hwframe+0x46/0xb0
> These details are a little distracting. IMO you can summarize
> the above with just this:
>
>>> Currently the idle timeout for courtesy client is fixed at 1 day. If
>>> there are lots of courtesy clients remain in the system it can cause
>>> memory resource shortage. This problem can be observed by running
>>> pynfs nfs4.0 CID5 test in a loop.
>
>
> Now I'm going to comment in reverse order here. To add context
> for others on-list, when we designed courteous server, we had
> assumed that eventually a shrinker would be used to garbage
> collect courtesy clients. Dai has found some issues with that
> approach:
>
>
>> The shrinker method was evaluated and found it's not suitable
>> for this problem due to these reasons:
>>
>> . destroying the NFSv4 client on the shrinker context can cause
>> deadlock since nfsd_file_put calls into the underlying FS
>> code and we have no control what it will do as seen in this
>> stack trace:
> [ ... stack trace snipped ... ]
>
> I think I always had in mind that only the laundromat would be
> responsible for harvesting courtesy clients. A shrinker might
> trigger that activity, but as you point out, a deadlock is pretty
> likely if the shrinker itself had to do the harvesting.
>
>
>> . destroying the NFSv4 client has significant overhead due to
>> the upcall to user space to remove the client records which
>> might access storage device. There is potential deadlock
>> if the storage subsystem needs to allocate memory.
> The issue is that harvesting a courtesy client will involve
> an upcall to nfsdcltracker, and that will result in I/O that
> updates the tracker's database. Very likely this will require
> further allocation of memory and thus it could deadlock the
> system.
>
> Now this might also be all the demonstration that we need
> that managing courtesy resources cannot be done using the
> system's shrinker facility -- expiring a client can never
> be done when there is a direct reclaim waiting on it. I'm
> interested in other opinions on that. Neil? Bruce? Trond?
>
>
>> . the shrinker kicks in only when memory drops really low, ~<5%.
>> By this time, some other components in the system already run
>> into issue with memory shortage. For example, rpc.gssd starts
>> failing to add watches in /var/lib/nfs/rpc_pipefs/nfsd4_cb
>> once the memory consumed by these watches reaches about 1% of
>> available system memory.
> Your claim is that a courtesy client shrinker would be invoked
> too late. That might be true on a server with 2GB of RAM, but
> on a big system (say, a server with 64GB of RAM), 5% is still
> more than 3GB -- wouldn't that be enough to harvest safely?
>
> We can't optimize for tiny server systems because that almost
> always hobbles the scalability of larger systems for no good
> reason. Can you test with a large-memory server as well as a
> small-memory server?
I don't have a system with large memory configuration, my VM has
only 6GB of memory.
I think the shrinker is not an option due to the deadlock problem
so I think we just concentrate on the laundromat route.
>
> I think the central question here is why is 5% not enough on
> all systems. I would like to understand that better. It seems
> like a primary scalability question that needs an answer so
> a good harvesting heuristic can be derived.
>
> One question in my mind is what is the maximum rate at which
> the server converts active clients to courtesy clients, and
> can the current laundromat scheme keep up with harvesting them
> at that rate? The destructive scenario seems to be when courtesy
> clients are manufactured faster than they can be harvested and
> expunged.
That seems to be the case. Currently the laundromat destroys idle
courtesy clients after 1 day and running CID5 in a loop generates
a ton of courtesy clients. Before the 1-day expiration occurs,
memory already drops to almost <1% and problems with rpc.gssd and
memory allocation were seen as mentioned above.
>
> (Also I recall Bruce fixed a problem recently with nfsdcltracker
> where it was doing three fsync's for every database update,
> which significantly slowed it down. You should look for that
> fix in nfs-utils and ensure the above rate measurement is done
> with the fix applied).
will do.
>
>
>> This patch addresses this problem by:
>>
>> . removing the fixed 1-day idle time limit for courtesy client.
>> Courtesy client is now allowed to remain valid as long as the
>> available system memory is above 80%.
>>
>> . when available system memory drops below 80%, laundromat starts
>> trimming older courtesy clients. The number of courtesy clients
>> to trim is a percentage of the total number of courtesy clients
>> exist in the system. This percentage is computed based on
>> the current percentage of available system memory.
>>
>> . the percentage of number of courtesy clients to be trimmed
>> is based on this table:
>>
>> ----------------------------------
>> | % memory | % courtesy clients |
>> | available | to trim |
>> ----------------------------------
>> | > 80 | 0 |
>> | > 70 | 10 |
>> | > 60 | 20 |
>> | > 50 | 40 |
>> | > 40 | 60 |
>> | > 30 | 80 |
>> | < 30 | 100 |
>> ----------------------------------
> "80% available memory" on a big system means there's still an
> enormous amount of free memory on that system. It will be
> surprising to administrators on those systems if the laundromat
> is harvesting courtesy clients at that point.
at 80% and above there is no harvesting going on.
>
> Also, if a server is at 60-70% free memory all the time due to
> non-NFSD-related memory consumption, would that mean that the
> laundromat would always trim courtesy clients, even though doing
> so would not be needed or beneficial?
it's true that there is no benefit to harvest courtesy clients
at 60-70% if the available memory stays in this range. But we
don't know whether available memory will stay in this range or
it will continue to drop (as in my test case with CID5). Shouldn't
we start harvest some of the courtesy clients at this point to
be on the safe side?
>
> I don't think we can use a fixed percentage ladder like this;
> it might make sense for the CID5 test (or to stop other types of
> inadvertent or malicious DoS attacks) but the common case
> steady-state behavior doesn't seem very good.
I'm looking for suggestion for better solution to handle this
problem.
>
> I don't recall, are courtesy clients maintained on an LRU so
> that the oldest ones would be harvested first?
courtesy clients and 'normal' clients are in the same LRU list
so the oldest ones would be harvested first.
> This mechanism seems to harvest at random?
I'm not sure what you mean here?
>
>
>> . due to the overhead associated with removing client record,
>> there is a limit of 128 clients to be trimmed for each
>> laundromat run. This is done to prevent the laundromat from
>> spending too long destroying the clients and misses performing
>> its other tasks in a timely manner.
>>
>> . the laundromat is scheduled to run sooner if there are more
>> courtesy clients need to be destroyed.
> Both of these last two changes seem sensible. Can they be
> broken out so they can be applied immediately?
Yes. Do you want me to rework the patch just to have these 2
changes for now while we continue to look for a better solution
than the proposed fixed percentage?
Thanks,
-Dai
--
Chuck Lever
On Tue, 2022-07-05 at 14:50 +0000, Chuck Lever III wrote:
> Hello Dai -
>
> I agree that tackling resource management is indeed an appropriate
> next step for courteous server. Thanks for tackling this!
>
> More comments are inline.
>
>
> > On Jul 4, 2022, at 3:05 PM, Dai Ngo <[email protected]> wrote:
> >
> > Currently the idle timeout for courtesy client is fixed at 1 day. If
> > there are lots of courtesy clients remain in the system it can cause
> > memory resource shortage that effects the operations of other modules
> > in the kernel. This problem can be observed by running pynfs nfs4.0
> > CID5 test in a loop. Eventually system runs out of memory and rpc.gssd
> > fails to add new watch:
> >
> > rpc.gssd[3851]: ERROR: inotify_add_watch failed for nfsd4_cb/clnt6c2e:
> > No space left on device
> >
> > and alloc_inode also fails with out of memory:
> >
> > Call Trace:
> > <TASK>
> > dump_stack_lvl+0x33/0x42
> > dump_header+0x4a/0x1ed
> > oom_kill_process+0x80/0x10d
> > out_of_memory+0x237/0x25f
> > __alloc_pages_slowpath.constprop.0+0x617/0x7b6
> > __alloc_pages+0x132/0x1e3
> > alloc_slab_page+0x15/0x33
> > allocate_slab+0x78/0x1ab
> > ? alloc_inode+0x38/0x8d
> > ___slab_alloc+0x2af/0x373
> > ? alloc_inode+0x38/0x8d
> > ? slab_pre_alloc_hook.constprop.0+0x9f/0x158
> > ? alloc_inode+0x38/0x8d
> > __slab_alloc.constprop.0+0x1c/0x24
> > kmem_cache_alloc_lru+0x8c/0x142
> > alloc_inode+0x38/0x8d
> > iget_locked+0x60/0x126
> > kernfs_get_inode+0x18/0x105
> > kernfs_iop_lookup+0x6d/0xbc
> > __lookup_slow+0xb7/0xf9
> > lookup_slow+0x3a/0x52
> > walk_component+0x90/0x100
> > ? inode_permission+0x87/0x128
> > link_path_walk.part.0.constprop.0+0x266/0x2ea
> > ? path_init+0x101/0x2f2
> > path_lookupat+0x4c/0xfa
> > filename_lookup+0x63/0xd7
> > ? getname_flags+0x32/0x17a
> > ? kmem_cache_alloc+0x11f/0x144
> > ? getname_flags+0x16c/0x17a
> > user_path_at_empty+0x37/0x4b
> > do_readlinkat+0x61/0x102
> > __x64_sys_readlinkat+0x18/0x1b
> > do_syscall_64+0x57/0x72
> > entry_SYSCALL_64_after_hwframe+0x46/0xb0
>
> These details are a little distracting. IMO you can summarize
> the above with just this:
>
> > > Currently the idle timeout for courtesy client is fixed at 1 day. If
> > > there are lots of courtesy clients remain in the system it can cause
> > > memory resource shortage. This problem can be observed by running
> > > pynfs nfs4.0 CID5 test in a loop.
>
>
>
> Now I'm going to comment in reverse order here. To add context
> for others on-list, when we designed courteous server, we had
> assumed that eventually a shrinker would be used to garbage
> collect courtesy clients. Dai has found some issues with that
> approach:
>
>
> > The shrinker method was evaluated and found it's not suitable
> > for this problem due to these reasons:
> >
> > . destroying the NFSv4 client on the shrinker context can cause
> > deadlock since nfsd_file_put calls into the underlying FS
> > code and we have no control what it will do as seen in this
> > stack trace:
>
> [ ... stack trace snipped ... ]
>
> I think I always had in mind that only the laundromat would be
> responsible for harvesting courtesy clients. A shrinker might
> trigger that activity, but as you point out, a deadlock is pretty
> likely if the shrinker itself had to do the harvesting.
>
>
> > . destroying the NFSv4 client has significant overhead due to
> > the upcall to user space to remove the client records which
> > might access storage device. There is potential deadlock
> > if the storage subsystem needs to allocate memory.
>
> The issue is that harvesting a courtesy client will involve
> an upcall to nfsdcltracker, and that will result in I/O that
> updates the tracker's database. Very likely this will require
> further allocation of memory and thus it could deadlock the
> system.
>
> Now this might also be all the demonstration that we need
> that managing courtesy resources cannot be done using the
> system's shrinker facility -- expiring a client can never
> be done when there is a direct reclaim waiting on it. I'm
> interested in other opinions on that. Neil? Bruce? Trond?
>
That is potentially an ugly problem, but if you hit it then you really
are running the host at the redline.
Do you need to "shrink" synchronously? The scan_objects routine is
supposed to return the number of entries freed. We could (in principle)
always return 0, and wake up the laundromat to do the "real" shrinking.
It might not help out as much with direct reclaim, but it might still
help.
>
> > . the shrinker kicks in only when memory drops really low, ~<5%.
> > By this time, some other components in the system already run
> > into issue with memory shortage. For example, rpc.gssd starts
> > failing to add watches in /var/lib/nfs/rpc_pipefs/nfsd4_cb
> > once the memory consumed by these watches reaches about 1% of
> > available system memory.
>
> Your claim is that a courtesy client shrinker would be invoked
> too late. That might be true on a server with 2GB of RAM, but
> on a big system (say, a server with 64GB of RAM), 5% is still
> more than 3GB -- wouldn't that be enough to harvest safely?
>
> We can't optimize for tiny server systems because that almost
> always hobbles the scalability of larger systems for no good
> reason. Can you test with a large-memory server as well as a
> small-memory server?
>
> I think the central question here is why is 5% not enough on
> all systems. I would like to understand that better. It seems
> like a primary scalability question that needs an answer so
> a good harvesting heuristic can be derived.
>
> One question in my mind is what is the maximum rate at which
> the server converts active clients to courtesy clients, and
> can the current laundromat scheme keep up with harvesting them
> at that rate? The destructive scenario seems to be when courtesy
> clients are manufactured faster than they can be harvested and
> expunged.
>
> (Also I recall Bruce fixed a problem recently with nfsdcltracker
> where it was doing three fsync's for every database update,
> which significantly slowed it down. You should look for that
> fix in nfs-utils and ensure the above rate measurement is done
> with the fix applied).
>
>
> > This patch addresses this problem by:
> >
> > . removing the fixed 1-day idle time limit for courtesy client.
> > Courtesy client is now allowed to remain valid as long as the
> > available system memory is above 80%.
> >
> > . when available system memory drops below 80%, laundromat starts
> > trimming older courtesy clients. The number of courtesy clients
> > to trim is a percentage of the total number of courtesy clients
> > exist in the system. This percentage is computed based on
> > the current percentage of available system memory.
> >
> > . the percentage of number of courtesy clients to be trimmed
> > is based on this table:
> >
> > ----------------------------------
> > | % memory | % courtesy clients |
> > | available | to trim |
> > ----------------------------------
> > | > 80 | 0 |
> > | > 70 | 10 |
> > | > 60 | 20 |
> > | > 50 | 40 |
> > | > 40 | 60 |
> > | > 30 | 80 |
> > | < 30 | 100 |
> > ----------------------------------
>
> "80% available memory" on a big system means there's still an
> enormous amount of free memory on that system. It will be
> surprising to administrators on those systems if the laundromat
> is harvesting courtesy clients at that point.
>
> Also, if a server is at 60-70% free memory all the time due to
> non-NFSD-related memory consumption, would that mean that the
> laundromat would always trim courtesy clients, even though doing
> so would not be needed or beneficial?
>
> I don't think we can use a fixed percentage ladder like this;
> it might make sense for the CID5 test (or to stop other types of
> inadvertent or malicious DoS attacks) but the common case
> steady-state behavior doesn't seem very good.
>
> I don't recall, are courtesy clients maintained on an LRU so
> that the oldest ones would be harvested first? This mechanism
> seems to harvest at random?
>
>
> > . due to the overhead associated with removing client record,
> > there is a limit of 128 clients to be trimmed for each
> > laundromat run. This is done to prevent the laundromat from
> > spending too long destroying the clients and misses performing
> > its other tasks in a timely manner.
> >
> > . the laundromat is scheduled to run sooner if there are more
> > courtesy clients need to be destroyed.
>
> Both of these last two changes seem sensible. Can they be
> broken out so they can be applied immediately?
>
I forget...is there a hard (or soft) cap on the number of courtesy
clients that can be in play at a time? Adding such a cap might be
another option if we're concerned about this.
--
Jeff Layton <[email protected]>
> On Jul 5, 2022, at 2:42 PM, Dai Ngo <[email protected]> wrote:
>
>
> On 7/5/22 7:50 AM, Chuck Lever III wrote:
>> Hello Dai -
>>
>> I agree that tackling resource management is indeed an appropriate
>> next step for courteous server. Thanks for tackling this!
>>
>> More comments are inline.
>>
>>
>>> On Jul 4, 2022, at 3:05 PM, Dai Ngo <[email protected]> wrote:
>>>
>>> Currently the idle timeout for courtesy client is fixed at 1 day. If
>>> there are lots of courtesy clients remain in the system it can cause
>>> memory resource shortage that effects the operations of other modules
>>> in the kernel. This problem can be observed by running pynfs nfs4.0
>>> CID5 test in a loop. Eventually system runs out of memory and rpc.gssd
>>> fails to add new watch:
>>>
>>> rpc.gssd[3851]: ERROR: inotify_add_watch failed for nfsd4_cb/clnt6c2e:
>>> No space left on device
>>>
>>> and alloc_inode also fails with out of memory:
>>>
>>> Call Trace:
>>> <TASK>
>>> dump_stack_lvl+0x33/0x42
>>> dump_header+0x4a/0x1ed
>>> oom_kill_process+0x80/0x10d
>>> out_of_memory+0x237/0x25f
>>> __alloc_pages_slowpath.constprop.0+0x617/0x7b6
>>> __alloc_pages+0x132/0x1e3
>>> alloc_slab_page+0x15/0x33
>>> allocate_slab+0x78/0x1ab
>>> ? alloc_inode+0x38/0x8d
>>> ___slab_alloc+0x2af/0x373
>>> ? alloc_inode+0x38/0x8d
>>> ? slab_pre_alloc_hook.constprop.0+0x9f/0x158
>>> ? alloc_inode+0x38/0x8d
>>> __slab_alloc.constprop.0+0x1c/0x24
>>> kmem_cache_alloc_lru+0x8c/0x142
>>> alloc_inode+0x38/0x8d
>>> iget_locked+0x60/0x126
>>> kernfs_get_inode+0x18/0x105
>>> kernfs_iop_lookup+0x6d/0xbc
>>> __lookup_slow+0xb7/0xf9
>>> lookup_slow+0x3a/0x52
>>> walk_component+0x90/0x100
>>> ? inode_permission+0x87/0x128
>>> link_path_walk.part.0.constprop.0+0x266/0x2ea
>>> ? path_init+0x101/0x2f2
>>> path_lookupat+0x4c/0xfa
>>> filename_lookup+0x63/0xd7
>>> ? getname_flags+0x32/0x17a
>>> ? kmem_cache_alloc+0x11f/0x144
>>> ? getname_flags+0x16c/0x17a
>>> user_path_at_empty+0x37/0x4b
>>> do_readlinkat+0x61/0x102
>>> __x64_sys_readlinkat+0x18/0x1b
>>> do_syscall_64+0x57/0x72
>>> entry_SYSCALL_64_after_hwframe+0x46/0xb0
>> These details are a little distracting. IMO you can summarize
>> the above with just this:
>>
>>>> Currently the idle timeout for courtesy client is fixed at 1 day. If
>>>> there are lots of courtesy clients remain in the system it can cause
>>>> memory resource shortage. This problem can be observed by running
>>>> pynfs nfs4.0 CID5 test in a loop.
>>
>>
>> Now I'm going to comment in reverse order here. To add context
>> for others on-list, when we designed courteous server, we had
>> assumed that eventually a shrinker would be used to garbage
>> collect courtesy clients. Dai has found some issues with that
>> approach:
>>
>>
>>> The shrinker method was evaluated and found it's not suitable
>>> for this problem due to these reasons:
>>>
>>> . destroying the NFSv4 client on the shrinker context can cause
>>> deadlock since nfsd_file_put calls into the underlying FS
>>> code and we have no control what it will do as seen in this
>>> stack trace:
>> [ ... stack trace snipped ... ]
>>
>> I think I always had in mind that only the laundromat would be
>> responsible for harvesting courtesy clients. A shrinker might
>> trigger that activity, but as you point out, a deadlock is pretty
>> likely if the shrinker itself had to do the harvesting.
>>
>>
>>> . destroying the NFSv4 client has significant overhead due to
>>> the upcall to user space to remove the client records which
>>> might access storage device. There is potential deadlock
>>> if the storage subsystem needs to allocate memory.
>> The issue is that harvesting a courtesy client will involve
>> an upcall to nfsdcltracker, and that will result in I/O that
>> updates the tracker's database. Very likely this will require
>> further allocation of memory and thus it could deadlock the
>> system.
>>
>> Now this might also be all the demonstration that we need
>> that managing courtesy resources cannot be done using the
>> system's shrinker facility -- expiring a client can never
>> be done when there is a direct reclaim waiting on it. I'm
>> interested in other opinions on that. Neil? Bruce? Trond?
>>
>>
>>> . the shrinker kicks in only when memory drops really low, ~<5%.
>>> By this time, some other components in the system already run
>>> into issue with memory shortage. For example, rpc.gssd starts
>>> failing to add watches in /var/lib/nfs/rpc_pipefs/nfsd4_cb
>>> once the memory consumed by these watches reaches about 1% of
>>> available system memory.
>> Your claim is that a courtesy client shrinker would be invoked
>> too late. That might be true on a server with 2GB of RAM, but
>> on a big system (say, a server with 64GB of RAM), 5% is still
>> more than 3GB -- wouldn't that be enough to harvest safely?
>>
>> We can't optimize for tiny server systems because that almost
>> always hobbles the scalability of larger systems for no good
>> reason. Can you test with a large-memory server as well as a
>> small-memory server?
>
> I don't have a system with large memory configuration, my VM has
> only 6GB of memory.
Let's ask internally. Maybe Barry's group has a big system it
can lend us.
>> I think the central question here is why is 5% not enough on
>> all systems. I would like to understand that better. It seems
>> like a primary scalability question that needs an answer so
>> a good harvesting heuristic can be derived.
>>
>> One question in my mind is what is the maximum rate at which
>> the server converts active clients to courtesy clients, and
>> can the current laundromat scheme keep up with harvesting them
>> at that rate? The destructive scenario seems to be when courtesy
>> clients are manufactured faster than they can be harvested and
>> expunged.
>
> That seems to be the case. Currently the laundromat destroys idle
> courtesy clients after 1 day and running CID5 in a loop generates
> a ton of courtesy clients. Before the 1-day expiration occurs,
> memory already drops to almost <1% and problems with rpc.gssd and
> memory allocation were seen as mentioned above.
The issue is not the instantaneous amount of memory available,
it's the change in free memory. If available memory is relatively
constant, even if it's at 25%, there's no reason to trim the
courtesy list. The problem arises when the number of courtesy
clients is increasing quickly.
>
>>
>> (Also I recall Bruce fixed a problem recently with nfsdcltracker
>> where it was doing three fsync's for every database update,
>> which significantly slowed it down. You should look for that
>> fix in nfs-utils and ensure the above rate measurement is done
>> with the fix applied).
>
> will do.
>
>>
>>
>>> This patch addresses this problem by:
>>>
>>> . removing the fixed 1-day idle time limit for courtesy client.
>>> Courtesy client is now allowed to remain valid as long as the
>>> available system memory is above 80%.
>>>
>>> . when available system memory drops below 80%, laundromat starts
>>> trimming older courtesy clients. The number of courtesy clients
>>> to trim is a percentage of the total number of courtesy clients
>>> exist in the system. This percentage is computed based on
>>> the current percentage of available system memory.
>>>
>>> . the percentage of number of courtesy clients to be trimmed
>>> is based on this table:
>>>
>>> ----------------------------------
>>> | % memory | % courtesy clients |
>>> | available | to trim |
>>> ----------------------------------
>>> | > 80 | 0 |
>>> | > 70 | 10 |
>>> | > 60 | 20 |
>>> | > 50 | 40 |
>>> | > 40 | 60 |
>>> | > 30 | 80 |
>>> | < 30 | 100 |
>>> ----------------------------------
>> "80% available memory" on a big system means there's still an
>> enormous amount of free memory on that system. It will be
>> surprising to administrators on those systems if the laundromat
>> is harvesting courtesy clients at that point.
>
> at 80% and above there is no harvesting going on.
You miss my point. Even 30% available on a big system is still
a lot of memory and not a reason (in itself) to start trimming.
>> Also, if a server is at 60-70% free memory all the time due to
>> non-NFSD-related memory consumption, would that mean that the
>> laundromat would always trim courtesy clients, even though doing
>> so would not be needed or beneficial?
>
> it's true that there is no benefit to harvest courtesy clients
> at 60-70% if the available memory stays in this range. But we
> don't know whether available memory will stay in this range or
> it will continue to drop (as in my test case with CID5). Shouldn't
> we start harvest some of the courtesy clients at this point to
> be on the safe side?
The Linux philosophy is to let the workload take as many resources
as it can. The common case is that workload resident sets nearly
always reside comfortably within available resources, so garbage
collection that happens too soon is wasted effort and can even
have negative impact.
The other side of that coin is that when we hit the knee, a Linux
system is easy to push into thrashing because then it will start
pushing things out desperately. That's kind of the situation I
would like to avoid, but I don't think trimming when there is
more than half of memory available is the answer.
>> I don't recall, are courtesy clients maintained on an LRU so
>> that the oldest ones would be harvested first?
>
> courtesy clients and 'normal' clients are in the same LRU list
> so the oldest ones would be harvested first.
OK, thanks for confirming.
>>> . due to the overhead associated with removing client record,
>>> there is a limit of 128 clients to be trimmed for each
>>> laundromat run. This is done to prevent the laundromat from
>>> spending too long destroying the clients and misses performing
>>> its other tasks in a timely manner.
>>>
>>> . the laundromat is scheduled to run sooner if there are more
>>> courtesy clients need to be destroyed.
>> Both of these last two changes seem sensible. Can they be
>> broken out so they can be applied immediately?
>
> Yes. Do you want me to rework the patch just to have these 2
> changes for now while we continue to look for a better solution
> than the proposed fixed percentage?
Yes. Two patches, one for each of these changes.
--
Chuck Lever
Hi Jeff-
> On Jul 5, 2022, at 2:48 PM, Jeff Layton <[email protected]> wrote:
>
> On Tue, 2022-07-05 at 14:50 +0000, Chuck Lever III wrote:
>> Hello Dai -
>>
>> I agree that tackling resource management is indeed an appropriate
>> next step for courteous server. Thanks for tackling this!
>>
>> More comments are inline.
>>
>>
>>> On Jul 4, 2022, at 3:05 PM, Dai Ngo <[email protected]> wrote:
>>>
>>> Currently the idle timeout for courtesy client is fixed at 1 day. If
>>> there are lots of courtesy clients remain in the system it can cause
>>> memory resource shortage that effects the operations of other modules
>>> in the kernel. This problem can be observed by running pynfs nfs4.0
>>> CID5 test in a loop. Eventually system runs out of memory and rpc.gssd
>>> fails to add new watch:
>>>
>>> rpc.gssd[3851]: ERROR: inotify_add_watch failed for nfsd4_cb/clnt6c2e:
>>> No space left on device
>>>
>>> and alloc_inode also fails with out of memory:
>>>
>>> Call Trace:
>>> <TASK>
>>> dump_stack_lvl+0x33/0x42
>>> dump_header+0x4a/0x1ed
>>> oom_kill_process+0x80/0x10d
>>> out_of_memory+0x237/0x25f
>>> __alloc_pages_slowpath.constprop.0+0x617/0x7b6
>>> __alloc_pages+0x132/0x1e3
>>> alloc_slab_page+0x15/0x33
>>> allocate_slab+0x78/0x1ab
>>> ? alloc_inode+0x38/0x8d
>>> ___slab_alloc+0x2af/0x373
>>> ? alloc_inode+0x38/0x8d
>>> ? slab_pre_alloc_hook.constprop.0+0x9f/0x158
>>> ? alloc_inode+0x38/0x8d
>>> __slab_alloc.constprop.0+0x1c/0x24
>>> kmem_cache_alloc_lru+0x8c/0x142
>>> alloc_inode+0x38/0x8d
>>> iget_locked+0x60/0x126
>>> kernfs_get_inode+0x18/0x105
>>> kernfs_iop_lookup+0x6d/0xbc
>>> __lookup_slow+0xb7/0xf9
>>> lookup_slow+0x3a/0x52
>>> walk_component+0x90/0x100
>>> ? inode_permission+0x87/0x128
>>> link_path_walk.part.0.constprop.0+0x266/0x2ea
>>> ? path_init+0x101/0x2f2
>>> path_lookupat+0x4c/0xfa
>>> filename_lookup+0x63/0xd7
>>> ? getname_flags+0x32/0x17a
>>> ? kmem_cache_alloc+0x11f/0x144
>>> ? getname_flags+0x16c/0x17a
>>> user_path_at_empty+0x37/0x4b
>>> do_readlinkat+0x61/0x102
>>> __x64_sys_readlinkat+0x18/0x1b
>>> do_syscall_64+0x57/0x72
>>> entry_SYSCALL_64_after_hwframe+0x46/0xb0
>>
>> These details are a little distracting. IMO you can summarize
>> the above with just this:
>>
>>>> Currently the idle timeout for courtesy client is fixed at 1 day. If
>>>> there are lots of courtesy clients remain in the system it can cause
>>>> memory resource shortage. This problem can be observed by running
>>>> pynfs nfs4.0 CID5 test in a loop.
>>
>>
>>
>> Now I'm going to comment in reverse order here. To add context
>> for others on-list, when we designed courteous server, we had
>> assumed that eventually a shrinker would be used to garbage
>> collect courtesy clients. Dai has found some issues with that
>> approach:
>>
>>
>>> The shrinker method was evaluated and found it's not suitable
>>> for this problem due to these reasons:
>>>
>>> . destroying the NFSv4 client on the shrinker context can cause
>>> deadlock since nfsd_file_put calls into the underlying FS
>>> code and we have no control what it will do as seen in this
>>> stack trace:
>>
>> [ ... stack trace snipped ... ]
>>
>> I think I always had in mind that only the laundromat would be
>> responsible for harvesting courtesy clients. A shrinker might
>> trigger that activity, but as you point out, a deadlock is pretty
>> likely if the shrinker itself had to do the harvesting.
>>
>>
>>> . destroying the NFSv4 client has significant overhead due to
>>> the upcall to user space to remove the client records which
>>> might access storage device. There is potential deadlock
>>> if the storage subsystem needs to allocate memory.
>>
>> The issue is that harvesting a courtesy client will involve
>> an upcall to nfsdcltracker, and that will result in I/O that
>> updates the tracker's database. Very likely this will require
>> further allocation of memory and thus it could deadlock the
>> system.
>>
>> Now this might also be all the demonstration that we need
>> that managing courtesy resources cannot be done using the
>> system's shrinker facility -- expiring a client can never
>> be done when there is a direct reclaim waiting on it. I'm
>> interested in other opinions on that. Neil? Bruce? Trond?
>>
>
> That is potentially an ugly problem, but if you hit it then you really
> are running the host at the redline.
Exactly. I'm just not sure how much we can do to keep a system
stable once it is pushed to that point, therefore I don't think
we should be optimizing for that state. My concern is whether
larger systems can be pushed to that state by drive-by DoS
attacks.
> Do you need to "shrink" synchronously? The scan_objects routine is
> supposed to return the number of entries freed. We could (in principle)
> always return 0, and wake up the laundromat to do the "real" shrinking.
> It might not help out as much with direct reclaim, but it might still
> help.
I suggested that as well. IIRC Dai said it still doesn't keep the
server from toppling over. I would like more information about
what is the final straw and whether a "return 0 and kick the
laundromat" shrinker still provides some benefit.
>>> . due to the overhead associated with removing client record,
>>> there is a limit of 128 clients to be trimmed for each
>>> laundromat run. This is done to prevent the laundromat from
>>> spending too long destroying the clients and misses performing
>>> its other tasks in a timely manner.
>>>
>>> . the laundromat is scheduled to run sooner if there are more
>>> courtesy clients need to be destroyed.
>>
>> Both of these last two changes seem sensible. Can they be
>> broken out so they can be applied immediately?
>>
>
> I forget...is there a hard (or soft) cap on the number of courtesy
> clients that can be in play at a time? Adding such a cap might be
> another option if we're concerned about this.
The current cap is courtesy clients stay around no longer than 24
hours. The server doesn't cap the number of courtesy clients, though
it could limit based on the physical memory size of the host, as
we do with other resources. Also imperfect, but might be better
than nothing.
--
Chuck Lever
On Tue, Jul 05, 2022 at 07:08:32PM +0000, Chuck Lever III wrote:
>
>
> > On Jul 5, 2022, at 2:42 PM, Dai Ngo <[email protected]> wrote:
> >
> >
> > On 7/5/22 7:50 AM, Chuck Lever III wrote:
> >> Hello Dai -
> >>
> >> I agree that tackling resource management is indeed an appropriate
> >> next step for courteous server. Thanks for tackling this!
> >>
> >> More comments are inline.
> >>
> >>
> >>> On Jul 4, 2022, at 3:05 PM, Dai Ngo <[email protected]> wrote:
> >>>
> >>> Currently the idle timeout for courtesy client is fixed at 1 day. If
> >>> there are lots of courtesy clients remain in the system it can cause
> >>> memory resource shortage that effects the operations of other modules
> >>> in the kernel. This problem can be observed by running pynfs nfs4.0
> >>> CID5 test in a loop. Eventually system runs out of memory and rpc.gssd
> >>> fails to add new watch:
> >>>
> >>> rpc.gssd[3851]: ERROR: inotify_add_watch failed for nfsd4_cb/clnt6c2e:
> >>> No space left on device
> >>>
> >>> and alloc_inode also fails with out of memory:
> >>>
> >>> Call Trace:
> >>> <TASK>
> >>> dump_stack_lvl+0x33/0x42
> >>> dump_header+0x4a/0x1ed
> >>> oom_kill_process+0x80/0x10d
> >>> out_of_memory+0x237/0x25f
> >>> __alloc_pages_slowpath.constprop.0+0x617/0x7b6
> >>> __alloc_pages+0x132/0x1e3
> >>> alloc_slab_page+0x15/0x33
> >>> allocate_slab+0x78/0x1ab
> >>> ? alloc_inode+0x38/0x8d
> >>> ___slab_alloc+0x2af/0x373
> >>> ? alloc_inode+0x38/0x8d
> >>> ? slab_pre_alloc_hook.constprop.0+0x9f/0x158
> >>> ? alloc_inode+0x38/0x8d
> >>> __slab_alloc.constprop.0+0x1c/0x24
> >>> kmem_cache_alloc_lru+0x8c/0x142
> >>> alloc_inode+0x38/0x8d
> >>> iget_locked+0x60/0x126
> >>> kernfs_get_inode+0x18/0x105
> >>> kernfs_iop_lookup+0x6d/0xbc
> >>> __lookup_slow+0xb7/0xf9
> >>> lookup_slow+0x3a/0x52
> >>> walk_component+0x90/0x100
> >>> ? inode_permission+0x87/0x128
> >>> link_path_walk.part.0.constprop.0+0x266/0x2ea
> >>> ? path_init+0x101/0x2f2
> >>> path_lookupat+0x4c/0xfa
> >>> filename_lookup+0x63/0xd7
> >>> ? getname_flags+0x32/0x17a
> >>> ? kmem_cache_alloc+0x11f/0x144
> >>> ? getname_flags+0x16c/0x17a
> >>> user_path_at_empty+0x37/0x4b
> >>> do_readlinkat+0x61/0x102
> >>> __x64_sys_readlinkat+0x18/0x1b
> >>> do_syscall_64+0x57/0x72
> >>> entry_SYSCALL_64_after_hwframe+0x46/0xb0
> >> These details are a little distracting. IMO you can summarize
> >> the above with just this:
> >>
> >>>> Currently the idle timeout for courtesy client is fixed at 1 day. If
> >>>> there are lots of courtesy clients remain in the system it can cause
> >>>> memory resource shortage. This problem can be observed by running
> >>>> pynfs nfs4.0 CID5 test in a loop.
> >>
> >>
> >> Now I'm going to comment in reverse order here. To add context
> >> for others on-list, when we designed courteous server, we had
> >> assumed that eventually a shrinker would be used to garbage
> >> collect courtesy clients. Dai has found some issues with that
> >> approach:
> >>
> >>
> >>> The shrinker method was evaluated and found it's not suitable
> >>> for this problem due to these reasons:
> >>>
> >>> . destroying the NFSv4 client on the shrinker context can cause
> >>> deadlock since nfsd_file_put calls into the underlying FS
> >>> code and we have no control what it will do as seen in this
> >>> stack trace:
> >> [ ... stack trace snipped ... ]
> >>
> >> I think I always had in mind that only the laundromat would be
> >> responsible for harvesting courtesy clients. A shrinker might
> >> trigger that activity, but as you point out, a deadlock is pretty
> >> likely if the shrinker itself had to do the harvesting.
> >>
> >>
> >>> . destroying the NFSv4 client has significant overhead due to
> >>> the upcall to user space to remove the client records which
> >>> might access storage device. There is potential deadlock
> >>> if the storage subsystem needs to allocate memory.
> >> The issue is that harvesting a courtesy client will involve
> >> an upcall to nfsdcltracker, and that will result in I/O that
> >> updates the tracker's database. Very likely this will require
> >> further allocation of memory and thus it could deadlock the
> >> system.
> >>
> >> Now this might also be all the demonstration that we need
> >> that managing courtesy resources cannot be done using the
> >> system's shrinker facility -- expiring a client can never
> >> be done when there is a direct reclaim waiting on it. I'm
> >> interested in other opinions on that. Neil? Bruce? Trond?
> >>
> >>
> >>> . the shrinker kicks in only when memory drops really low, ~<5%.
> >>> By this time, some other components in the system already run
> >>> into issue with memory shortage. For example, rpc.gssd starts
> >>> failing to add watches in /var/lib/nfs/rpc_pipefs/nfsd4_cb
> >>> once the memory consumed by these watches reaches about 1% of
> >>> available system memory.
> >> Your claim is that a courtesy client shrinker would be invoked
> >> too late. That might be true on a server with 2GB of RAM, but
> >> on a big system (say, a server with 64GB of RAM), 5% is still
> >> more than 3GB -- wouldn't that be enough to harvest safely?
> >>
> >> We can't optimize for tiny server systems because that almost
> >> always hobbles the scalability of larger systems for no good
> >> reason. Can you test with a large-memory server as well as a
> >> small-memory server?
> >
> > I don't have a system with large memory configuration, my VM has
> > only 6GB of memory.
>
> Let's ask internally. Maybe Barry's group has a big system it
> can lend us.
>
>
> >> I think the central question here is why is 5% not enough on
> >> all systems. I would like to understand that better. It seems
> >> like a primary scalability question that needs an answer so
> >> a good harvesting heuristic can be derived.
> >>
> >> One question in my mind is what is the maximum rate at which
> >> the server converts active clients to courtesy clients, and
> >> can the current laundromat scheme keep up with harvesting them
> >> at that rate? The destructive scenario seems to be when courtesy
> >> clients are manufactured faster than they can be harvested and
> >> expunged.
> >
> > That seems to be the case. Currently the laundromat destroys idle
> > courtesy clients after 1 day and running CID5 in a loop generates
> > a ton of courtesy clients. Before the 1-day expiration occurs,
> > memory already drops to almost <1% and problems with rpc.gssd and
> > memory allocation were seen as mentioned above.
>
> The issue is not the instantaneous amount of memory available,
> it's the change in free memory. If available memory is relatively
> constant, even if it's at 25%, there's no reason to trim the
> courtesy list. The problem arises when the number of courtesy
> clients is increasing quickly.
>
>
> >
> >>
> >> (Also I recall Bruce fixed a problem recently with nfsdcltracker
> >> where it was doing three fsync's for every database update,
> >> which significantly slowed it down. You should look for that
> >> fix in nfs-utils and ensure the above rate measurement is done
> >> with the fix applied).
> >
> > will do.
> >
> >>
> >>
> >>> This patch addresses this problem by:
> >>>
> >>> . removing the fixed 1-day idle time limit for courtesy client.
> >>> Courtesy client is now allowed to remain valid as long as the
> >>> available system memory is above 80%.
> >>>
> >>> . when available system memory drops below 80%, laundromat starts
> >>> trimming older courtesy clients. The number of courtesy clients
> >>> to trim is a percentage of the total number of courtesy clients
> >>> exist in the system. This percentage is computed based on
> >>> the current percentage of available system memory.
> >>>
> >>> . the percentage of number of courtesy clients to be trimmed
> >>> is based on this table:
> >>>
> >>> ----------------------------------
> >>> | % memory | % courtesy clients |
> >>> | available | to trim |
> >>> ----------------------------------
> >>> | > 80 | 0 |
> >>> | > 70 | 10 |
> >>> | > 60 | 20 |
> >>> | > 50 | 40 |
> >>> | > 40 | 60 |
> >>> | > 30 | 80 |
> >>> | < 30 | 100 |
> >>> ----------------------------------
> >> "80% available memory" on a big system means there's still an
> >> enormous amount of free memory on that system. It will be
> >> surprising to administrators on those systems if the laundromat
> >> is harvesting courtesy clients at that point.
> >
> > at 80% and above there is no harvesting going on.
>
> You miss my point. Even 30% available on a big system is still
> a lot of memory and not a reason (in itself) to start trimming.
>
>
> >> Also, if a server is at 60-70% free memory all the time due to
> >> non-NFSD-related memory consumption, would that mean that the
> >> laundromat would always trim courtesy clients, even though doing
> >> so would not be needed or beneficial?
> >
> > it's true that there is no benefit to harvest courtesy clients
> > at 60-70% if the available memory stays in this range. But we
> > don't know whether available memory will stay in this range or
> > it will continue to drop (as in my test case with CID5). Shouldn't
> > we start harvest some of the courtesy clients at this point to
> > be on the safe side?
>
> The Linux philosophy is to let the workload take as many resources
> as it can. The common case is that workload resident sets nearly
> always reside comfortably within available resources, so garbage
> collection that happens too soon is wasted effort and can even
> have negative impact.
In this particular case (pynfs with repeated CID5), I think each client
is an NFSv4.0 client with a single open. I wonder how much memory that
ends up using per client? The client itself is only 1k, the inode,
file, dentry, nfs4 stateid, etc., probably add a few more k. If you're
filling up gigabytes of memory with that, then you may be talking about
10s-hundreds of thousands of clients, which your server probably can't
handle well anyway, and the bigger problem may be that at a synchronous
file write per client you're going to be waiting a long time to expire
them all.
I wonder what more realistic cases might look like?
In the 4.1 case you'll probably run into the session limits first.
Maybe nfsd4_get_drc_mem should be able to suggest purging courtesy
clients?
In the 4.0 case maybe we're more at risk of blowing up the nfs4 file
cache?
> The other side of that coin is that when we hit the knee, a Linux
> system is easy to push into thrashing because then it will start
> pushing things out desperately. That's kind of the situation I
> would like to avoid, but I don't think trimming when there is
> more than half of memory available is the answer.
I dunno, a (possibly somewhat arbitrary) limit on the number of courtesy
clients doesn't sound so bad to me, especially since we know the IO
required to expire them is proportional to that number.
--b.
> On Jul 6, 2022, at 11:46 AM, J. Bruce Fields <[email protected]> wrote:
>
> On Tue, Jul 05, 2022 at 07:08:32PM +0000, Chuck Lever III wrote:
>>
>>
>>> On Jul 5, 2022, at 2:42 PM, Dai Ngo <[email protected]> wrote:
>>>
>>>
>>> On 7/5/22 7:50 AM, Chuck Lever III wrote:
>>>> Hello Dai -
>>>>
>>>> I agree that tackling resource management is indeed an appropriate
>>>> next step for courteous server. Thanks for tackling this!
>>>>
>>>> More comments are inline.
>>>>
>>>>
>>>>> On Jul 4, 2022, at 3:05 PM, Dai Ngo <[email protected]> wrote:
>>>>>
>>>>> Currently the idle timeout for courtesy client is fixed at 1 day. If
>>>>> there are lots of courtesy clients remain in the system it can cause
>>>>> memory resource shortage that effects the operations of other modules
>>>>> in the kernel. This problem can be observed by running pynfs nfs4.0
>>>>> CID5 test in a loop. Eventually system runs out of memory and rpc.gssd
>>>>> fails to add new watch:
>>>>>
>>>>> rpc.gssd[3851]: ERROR: inotify_add_watch failed for nfsd4_cb/clnt6c2e:
>>>>> No space left on device
>>>>>
>>>>> and alloc_inode also fails with out of memory:
>>>>>
>>>>> Call Trace:
>>>>> <TASK>
>>>>> dump_stack_lvl+0x33/0x42
>>>>> dump_header+0x4a/0x1ed
>>>>> oom_kill_process+0x80/0x10d
>>>>> out_of_memory+0x237/0x25f
>>>>> __alloc_pages_slowpath.constprop.0+0x617/0x7b6
>>>>> __alloc_pages+0x132/0x1e3
>>>>> alloc_slab_page+0x15/0x33
>>>>> allocate_slab+0x78/0x1ab
>>>>> ? alloc_inode+0x38/0x8d
>>>>> ___slab_alloc+0x2af/0x373
>>>>> ? alloc_inode+0x38/0x8d
>>>>> ? slab_pre_alloc_hook.constprop.0+0x9f/0x158
>>>>> ? alloc_inode+0x38/0x8d
>>>>> __slab_alloc.constprop.0+0x1c/0x24
>>>>> kmem_cache_alloc_lru+0x8c/0x142
>>>>> alloc_inode+0x38/0x8d
>>>>> iget_locked+0x60/0x126
>>>>> kernfs_get_inode+0x18/0x105
>>>>> kernfs_iop_lookup+0x6d/0xbc
>>>>> __lookup_slow+0xb7/0xf9
>>>>> lookup_slow+0x3a/0x52
>>>>> walk_component+0x90/0x100
>>>>> ? inode_permission+0x87/0x128
>>>>> link_path_walk.part.0.constprop.0+0x266/0x2ea
>>>>> ? path_init+0x101/0x2f2
>>>>> path_lookupat+0x4c/0xfa
>>>>> filename_lookup+0x63/0xd7
>>>>> ? getname_flags+0x32/0x17a
>>>>> ? kmem_cache_alloc+0x11f/0x144
>>>>> ? getname_flags+0x16c/0x17a
>>>>> user_path_at_empty+0x37/0x4b
>>>>> do_readlinkat+0x61/0x102
>>>>> __x64_sys_readlinkat+0x18/0x1b
>>>>> do_syscall_64+0x57/0x72
>>>>> entry_SYSCALL_64_after_hwframe+0x46/0xb0
>>>> These details are a little distracting. IMO you can summarize
>>>> the above with just this:
>>>>
>>>>>> Currently the idle timeout for courtesy client is fixed at 1 day. If
>>>>>> there are lots of courtesy clients remain in the system it can cause
>>>>>> memory resource shortage. This problem can be observed by running
>>>>>> pynfs nfs4.0 CID5 test in a loop.
>>>>
>>>>
>>>> Now I'm going to comment in reverse order here. To add context
>>>> for others on-list, when we designed courteous server, we had
>>>> assumed that eventually a shrinker would be used to garbage
>>>> collect courtesy clients. Dai has found some issues with that
>>>> approach:
>>>>
>>>>
>>>>> The shrinker method was evaluated and found it's not suitable
>>>>> for this problem due to these reasons:
>>>>>
>>>>> . destroying the NFSv4 client on the shrinker context can cause
>>>>> deadlock since nfsd_file_put calls into the underlying FS
>>>>> code and we have no control what it will do as seen in this
>>>>> stack trace:
>>>> [ ... stack trace snipped ... ]
>>>>
>>>> I think I always had in mind that only the laundromat would be
>>>> responsible for harvesting courtesy clients. A shrinker might
>>>> trigger that activity, but as you point out, a deadlock is pretty
>>>> likely if the shrinker itself had to do the harvesting.
>>>>
>>>>
>>>>> . destroying the NFSv4 client has significant overhead due to
>>>>> the upcall to user space to remove the client records which
>>>>> might access storage device. There is potential deadlock
>>>>> if the storage subsystem needs to allocate memory.
>>>> The issue is that harvesting a courtesy client will involve
>>>> an upcall to nfsdcltracker, and that will result in I/O that
>>>> updates the tracker's database. Very likely this will require
>>>> further allocation of memory and thus it could deadlock the
>>>> system.
>>>>
>>>> Now this might also be all the demonstration that we need
>>>> that managing courtesy resources cannot be done using the
>>>> system's shrinker facility -- expiring a client can never
>>>> be done when there is a direct reclaim waiting on it. I'm
>>>> interested in other opinions on that. Neil? Bruce? Trond?
>>>>
>>>>
>>>>> . the shrinker kicks in only when memory drops really low, ~<5%.
>>>>> By this time, some other components in the system already run
>>>>> into issue with memory shortage. For example, rpc.gssd starts
>>>>> failing to add watches in /var/lib/nfs/rpc_pipefs/nfsd4_cb
>>>>> once the memory consumed by these watches reaches about 1% of
>>>>> available system memory.
>>>> Your claim is that a courtesy client shrinker would be invoked
>>>> too late. That might be true on a server with 2GB of RAM, but
>>>> on a big system (say, a server with 64GB of RAM), 5% is still
>>>> more than 3GB -- wouldn't that be enough to harvest safely?
>>>>
>>>> We can't optimize for tiny server systems because that almost
>>>> always hobbles the scalability of larger systems for no good
>>>> reason. Can you test with a large-memory server as well as a
>>>> small-memory server?
>>>
>>> I don't have a system with large memory configuration, my VM has
>>> only 6GB of memory.
>>
>> Let's ask internally. Maybe Barry's group has a big system it
>> can lend us.
>>
>>
>>>> I think the central question here is why is 5% not enough on
>>>> all systems. I would like to understand that better. It seems
>>>> like a primary scalability question that needs an answer so
>>>> a good harvesting heuristic can be derived.
>>>>
>>>> One question in my mind is what is the maximum rate at which
>>>> the server converts active clients to courtesy clients, and
>>>> can the current laundromat scheme keep up with harvesting them
>>>> at that rate? The destructive scenario seems to be when courtesy
>>>> clients are manufactured faster than they can be harvested and
>>>> expunged.
>>>
>>> That seems to be the case. Currently the laundromat destroys idle
>>> courtesy clients after 1 day and running CID5 in a loop generates
>>> a ton of courtesy clients. Before the 1-day expiration occurs,
>>> memory already drops to almost <1% and problems with rpc.gssd and
>>> memory allocation were seen as mentioned above.
>>
>> The issue is not the instantaneous amount of memory available,
>> it's the change in free memory. If available memory is relatively
>> constant, even if it's at 25%, there's no reason to trim the
>> courtesy list. The problem arises when the number of courtesy
>> clients is increasing quickly.
>>
>>
>>>
>>>>
>>>> (Also I recall Bruce fixed a problem recently with nfsdcltracker
>>>> where it was doing three fsync's for every database update,
>>>> which significantly slowed it down. You should look for that
>>>> fix in nfs-utils and ensure the above rate measurement is done
>>>> with the fix applied).
>>>
>>> will do.
>>>
>>>>
>>>>
>>>>> This patch addresses this problem by:
>>>>>
>>>>> . removing the fixed 1-day idle time limit for courtesy client.
>>>>> Courtesy client is now allowed to remain valid as long as the
>>>>> available system memory is above 80%.
>>>>>
>>>>> . when available system memory drops below 80%, laundromat starts
>>>>> trimming older courtesy clients. The number of courtesy clients
>>>>> to trim is a percentage of the total number of courtesy clients
>>>>> exist in the system. This percentage is computed based on
>>>>> the current percentage of available system memory.
>>>>>
>>>>> . the percentage of number of courtesy clients to be trimmed
>>>>> is based on this table:
>>>>>
>>>>> ----------------------------------
>>>>> | % memory | % courtesy clients |
>>>>> | available | to trim |
>>>>> ----------------------------------
>>>>> | > 80 | 0 |
>>>>> | > 70 | 10 |
>>>>> | > 60 | 20 |
>>>>> | > 50 | 40 |
>>>>> | > 40 | 60 |
>>>>> | > 30 | 80 |
>>>>> | < 30 | 100 |
>>>>> ----------------------------------
>>>> "80% available memory" on a big system means there's still an
>>>> enormous amount of free memory on that system. It will be
>>>> surprising to administrators on those systems if the laundromat
>>>> is harvesting courtesy clients at that point.
>>>
>>> at 80% and above there is no harvesting going on.
>>
>> You miss my point. Even 30% available on a big system is still
>> a lot of memory and not a reason (in itself) to start trimming.
>>
>>
>>>> Also, if a server is at 60-70% free memory all the time due to
>>>> non-NFSD-related memory consumption, would that mean that the
>>>> laundromat would always trim courtesy clients, even though doing
>>>> so would not be needed or beneficial?
>>>
>>> it's true that there is no benefit to harvest courtesy clients
>>> at 60-70% if the available memory stays in this range. But we
>>> don't know whether available memory will stay in this range or
>>> it will continue to drop (as in my test case with CID5). Shouldn't
>>> we start harvest some of the courtesy clients at this point to
>>> be on the safe side?
>>
>> The Linux philosophy is to let the workload take as many resources
>> as it can. The common case is that workload resident sets nearly
>> always reside comfortably within available resources, so garbage
>> collection that happens too soon is wasted effort and can even
>> have negative impact.
>
> In this particular case (pynfs with repeated CID5), I think each client
> is an NFSv4.0 client with a single open. I wonder how much memory that
> ends up using per client? The client itself is only 1k, the inode,
> file, dentry, nfs4 stateid, etc., probably add a few more k. If you're
> filling up gigabytes of memory with that, then you may be talking about
> 10s-hundreds of thousands of clients, which your server probably can't
> handle well anyway, and the bigger problem may be that at a synchronous
> file write per client you're going to be waiting a long time to expire
> them all.
Exactly: the rate at which client leases/state can be created
exceeds the rate at which they can be garbage collected.
> I wonder what more realistic cases might look like?
>
> In the 4.1 case you'll probably run into the session limits first.
> Maybe nfsd4_get_drc_mem should be able to suggest purging courtesy
> clients?
>
> In the 4.0 case maybe we're more at risk of blowing up the nfs4 file
> cache?
>
>> The other side of that coin is that when we hit the knee, a Linux
>> system is easy to push into thrashing because then it will start
>> pushing things out desperately. That's kind of the situation I
>> would like to avoid, but I don't think trimming when there is
>> more than half of memory available is the answer.
>
> I dunno, a (possibly somewhat arbitrary) limit on the number of courtesy
> clients doesn't sound so bad to me, especially since we know the IO
> required to expire them is proportional to that number.
Given your analysis above I have to wonder if the issue is not the
number of courtesy clients, but the /total/ number of clients. The
server should maybe want to limit the total number of clients due
to concerns about how much memory they consume and how long it would
take to expunge each of them.
--
Chuck Lever