Hi,
I have a customer report of NFS getting stuck due to a workqueue
lockup.
This appears to be triggered by calling 'close' on a 5TB file.
The rpc_release set up by nfs4_do_close() calls a final iput()
on the inode which leads to nfs4_evict_inode() which calls
truncate_inode_pages_final(). On a 5TB file, this can take a little
while.
Documentation for workqueue says
Generally, work items are not expected to hog a CPU and consume many
cycles.
so maybe that isn't a good idea.
truncate_inode_pages_final() does call cond_resched(), but workqueue
doesn't take notice of that. By default it only runs more threads on
the same CPU if the first thread actually sleeps. So the net result is
that there are lots of rpc_async_schedule tasks queued up behind the
iput, waiting for it to finish rather than running concurrently.
I believe this can be fixed by setting WQ_CPU_INTENSIVE on the nfsiod
workqueue. This flag causes the workqueue core to schedule more
threads as needed even if the existing threads never sleep.
I don't know if this is a good idea as it might spans lots of threads
needlessly when rpc_release functions don't have lots of work to do.
Another option might be to create a separate nfsiod_intensive_workqueue
with this flag set, and hand all iput requests over to this workqueue.
I've asked for the customer to test with this simple patch.
Any thoughts or suggestions most welcome,
Thanks,
NeilBrown
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index b9d0921cb4fe..d40125a28a77 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -2153,12 +2153,14 @@ EXPORT_SYMBOL_GPL(nfsiod_workqueue);
/*
* start up the nfsiod workqueue
+ * WQ_CPU_INTENSIVE is needed because we might call iput()
+ * which might spend a lot of time tearing down a large file.
*/
static int nfsiod_start(void)
{
struct workqueue_struct *wq;
dprintk("RPC: creating workqueue nfsiod\n");
- wq = alloc_workqueue("nfsiod", WQ_MEM_RECLAIM, 0);
+ wq = alloc_workqueue("nfsiod", WQ_MEM_RECLAIM | WQ_CPU_INTENSIVE, 0);
if (wq == NULL)
return -ENOMEM;
nfsiod_workqueue = wq;
On Thu, 2020-11-05 at 11:12 +1100, NeilBrown wrote:
>
> Hi,
> I have a customer report of NFS getting stuck due to a workqueue
> lockup.
>
> This appears to be triggered by calling 'close' on a 5TB file.
> The rpc_release set up by nfs4_do_close() calls a final iput()
> on the inode which leads to nfs4_evict_inode() which calls
> truncate_inode_pages_final(). On a 5TB file, this can take a little
> while.
>
> Documentation for workqueue says
> Generally, work items are not expected to hog a CPU and consume
> many
> cycles.
>
> so maybe that isn't a good idea.
> truncate_inode_pages_final() does call cond_resched(), but workqueue
> doesn't take notice of that. By default it only runs more threads
> on
> the same CPU if the first thread actually sleeps. So the net result
> is
> that there are lots of rpc_async_schedule tasks queued up behind the
> iput, waiting for it to finish rather than running concurrently.
>
> I believe this can be fixed by setting WQ_CPU_INTENSIVE on the
> nfsiod
> workqueue. This flag causes the workqueue core to schedule more
> threads as needed even if the existing threads never sleep.
> I don't know if this is a good idea as it might spans lots of
> threads
> needlessly when rpc_release functions don't have lots of work to do.
>
> Another option might be to create a separate
> nfsiod_intensive_workqueue
> with this flag set, and hand all iput requests over to this
> workqueue.
>
> I've asked for the customer to test with this simple patch.
>
> Any thoughts or suggestions most welcome,
>
Isn't this a general problem (i.e. one that is not specific to NFS)
when you have multi-terabyte caches? Why wouldn't io_uring be
vulnerable to the same issue, for instance?
The thing is that truncate_inode_pages() has plenty of latency reducing
cond_sched() calls that should ensure that other threads get scheduled,
so this problem doesn't strictly meet the 'CPU intensive' criterion
that I understand WQ_CPU_INTENSIVE to be designed for.
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Thu, Nov 05 2020, Trond Myklebust wrote:
> On Thu, 2020-11-05 at 11:12 +1100, NeilBrown wrote:
>>
>> Hi,
>> I have a customer report of NFS getting stuck due to a workqueue
>> lockup.
>>
>> This appears to be triggered by calling 'close' on a 5TB file.
>> The rpc_release set up by nfs4_do_close() calls a final iput()
>> on the inode which leads to nfs4_evict_inode() which calls
>> truncate_inode_pages_final(). On a 5TB file, this can take a little
>> while.
>>
>> Documentation for workqueue says
>> Generally, work items are not expected to hog a CPU and consume
>> many
>> cycles.
>>
>> so maybe that isn't a good idea.
>> truncate_inode_pages_final() does call cond_resched(), but workqueue
>> doesn't take notice of that. By default it only runs more threads
>> on
>> the same CPU if the first thread actually sleeps. So the net result
>> is
>> that there are lots of rpc_async_schedule tasks queued up behind the
>> iput, waiting for it to finish rather than running concurrently.
>>
>> I believe this can be fixed by setting WQ_CPU_INTENSIVE on the
>> nfsiod
>> workqueue. This flag causes the workqueue core to schedule more
>> threads as needed even if the existing threads never sleep.
>> I don't know if this is a good idea as it might spans lots of
>> threads
>> needlessly when rpc_release functions don't have lots of work to do.
>>
>> Another option might be to create a separate
>> nfsiod_intensive_workqueue
>> with this flag set, and hand all iput requests over to this
>> workqueue.
>>
>> I've asked for the customer to test with this simple patch.
>>
>> Any thoughts or suggestions most welcome,
>>
>
> Isn't this a general problem (i.e. one that is not specific to NFS)
> when you have multi-terabyte caches? Why wouldn't io_uring be
> vulnerable to the same issue, for instance?
Maybe it is. I haven't followed io_uring in any detail, but if it calls
iput() from a workqueue which isn't marked WQ_CPU_INTENSIVE, then it
will suffer the same problem.
And maybe that means we should look for a more general solution and I'm
not against that (though my customer needs a fix *now*).
>
> The thing is that truncate_inode_pages() has plenty of latency reducing
> cond_sched() calls that should ensure that other threads get scheduled,
> so this problem doesn't strictly meet the 'CPU intensive' criterion
> that I understand WQ_CPU_INTENSIVE to be designed for.
>
It is relevant to observe the end of should_reclaim_retry() in mm/page_alloc.c.
/*
* Memory allocation/reclaim might be called from a WQ context and the
* current implementation of the WQ concurrency control doesn't
* recognize that a particular WQ is congested if the worker thread is
* looping without ever sleeping. Therefore we have to do a short sleep
* here rather than calling cond_resched().
*/
if (current->flags & PF_WQ_WORKER)
schedule_timeout_uninterruptible(1);
else
cond_resched();
This suggests that cond_resched() isn't sufficient in the content of a
workqueue worker. If cond_resched() were changed to include that test,
or if the cond_reshed() in truncate_inode_pages_range() were changed to
use the above code fragment, they would equally solve my current
problem. So if you, as NFS maintainer, were happy to second one of those,
I'm happy to accept your support and carry the mission further into the
core.
w.r.t WQ_CPU_INTENSIVE in general: the goal as I understand it is to
have multiple runnable workers on the same CPU, leaving it to the
scheduler to balance them. Unless kernel preempt is enabled, that means
that each thread *must* call cond_resched() periodically, otherwise
there is nothing the scheduler can do, and some lockup-detector would fire.
Probably the best "fix" would be to make WQ_CPU_INTENSIVE irrelevant by
getting the workqueue core to notice when a worker calls cond_resched(),
but I don't know how to do that.
NeilBrown
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
nfsiod is currently a concurrency-managed workqueue (CMWQ).
This means that workitems scheduled to nfsiod on a given CPU are queued
behind all other work items queued on any CMWQ on the same CPU. This
can introduce unexpected latency.
Occaionally nfsiod can even cause excessive latency. If the work item
to complete a CLOSE request calls the final iput() on an inode, the
address_space of that inode will be dismantled. This takes time
proportional to the number of in-memory pages, which on a large host
working on large files (e.g.. 5TB), can be a large number of pages
resulting in a noticable number of seconds.
We can avoid these latency problems by switching nfsiod to WQ_UNBOUND.
This causes each concurrent work item to gets a dedicated thread which
can be scheduled to an idle CPU.
There is precedent for this as several other filesystems use WQ_UNBOUND
workqueue for handling various async events.
Signed-off-by: NeilBrown <[email protected]>
---
fs/nfs/inode.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index aa6493905bbe..43af053f467a 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -2180,7 +2180,7 @@ static int nfsiod_start(void)
{
struct workqueue_struct *wq;
dprintk("RPC: creating workqueue nfsiod\n");
- wq = alloc_workqueue("nfsiod", WQ_MEM_RECLAIM, 0);
+ wq = alloc_workqueue("nfsiod", WQ_MEM_RECLAIM | WQ_UNBOUND, 0);
if (wq == NULL)
return -ENOMEM;
nfsiod_workqueue = wq;
--
2.29.2
On Fri, Nov 27, 2020 at 11:24:33AM +1100, NeilBrown wrote:
>
> nfsiod is currently a concurrency-managed workqueue (CMWQ).
> This means that workitems scheduled to nfsiod on a given CPU are queued
> behind all other work items queued on any CMWQ on the same CPU. This
> can introduce unexpected latency.
>
> Occaionally nfsiod can even cause excessive latency. If the work item
> to complete a CLOSE request calls the final iput() on an inode, the
> address_space of that inode will be dismantled. This takes time
> proportional to the number of in-memory pages, which on a large host
> working on large files (e.g.. 5TB), can be a large number of pages
> resulting in a noticable number of seconds.
>
> We can avoid these latency problems by switching nfsiod to WQ_UNBOUND.
> This causes each concurrent work item to gets a dedicated thread which
> can be scheduled to an idle CPU.
>
> There is precedent for this as several other filesystems use WQ_UNBOUND
> workqueue for handling various async events.
>
> Signed-off-by: NeilBrown <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Thanks.
--
tejun