Return-Path: Received: from mx1.redhat.com ([209.132.183.28]:50706 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752789AbeA0PjI (ORCPT ); Sat, 27 Jan 2018 10:39:08 -0500 From: "Benjamin Coddington" To: "Jeff Layton" Cc: "open list:NFS, SUNRPC, AND..." , "Tom Haynes" , "Christoph Hellwig" , "Bruce Fields" Subject: Re: CB_LAYOUTRECALL "deadlock" with in-kernel flexfiles server and XFS Date: Sat, 27 Jan 2018 10:39:06 -0500 Message-ID: In-Reply-To: <1470929036.30238.14.camel@redhat.com> References: <1470929036.30238.14.camel@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; format=flowed Sender: linux-nfs-owner@vger.kernel.org List-ID: On 11 Aug 2016, at 11:23, Jeff Layton wrote: > I was playing around with the in-kernel flexfiles server today, and I > seem to be hitting a deadlock when using it on an XFS-exported > filesystem. Here's the stack trace of how the CB_LAYOUTRECALL occurs: > > [ 928.736139] CPU: 0 PID: 846 Comm: nfsd Tainted: G OE > 4.8.0-rc1+ #3 > [ 928.737040] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS 1.9.1-1.fc24 04/01/2014 > [ 928.738009] 0000000000000286 000000006125f50e ffff91153845b878 > ffffffff8f463853 > [ 928.738906] ffff91152ec194d0 ffff91152d31d9c0 ffff91153845b8a8 > ffffffffc045936f > [ 928.739788] ffff91152c051980 ffff91152d31d9c0 ffff91152c051540 > ffff9115361b8a58 > [ 928.740697] Call Trace: > [ 928.740998] [] dump_stack+0x86/0xc3 > [ 928.741570] [] > nfsd4_recall_file_layout+0x17f/0x190 [nfsd] > [ 928.742380] [] nfsd4_layout_lm_break+0x1d/0x30 > [nfsd] > [ 928.743115] [] __break_lease+0x118/0x6a0 > [ 928.743759] [] xfs_break_layouts+0x79/0x120 > [xfs] > [ 928.744462] [] > xfs_file_aio_write_checks+0x94/0x1f0 [xfs] > [ 928.745251] [] > xfs_file_buffered_aio_write+0x7b/0x330 [xfs] > [ 928.746063] [] xfs_file_write_iter+0xec/0x140 > [xfs] > [ 928.746803] [] do_iter_readv_writev+0xb9/0x140 > [ 928.747478] [] do_readv_writev+0x19b/0x240 > [ 928.748146] [] ? > xfs_file_buffered_aio_write+0x330/0x330 [xfs] > [ 928.748956] [] ? do_dentry_open+0x28b/0x310 > [ 928.749614] [] ? > xfs_extent_busy_ag_cmp+0x20/0x20 [xfs] > [ 928.750367] [] vfs_writev+0x3f/0x50 > [ 928.750934] [] nfsd_vfs_write+0xca/0x3a0 [nfsd] > [ 928.751608] [] nfsd_write+0x485/0x780 [nfsd] > [ 928.752263] [] nfsd3_proc_write+0xbc/0x150 > [nfsd] > [ 928.752973] [] nfsd_dispatch+0xb8/0x1f0 [nfsd] > [ 928.753642] [] svc_process_common+0x42f/0x690 > [sunrpc] > [ 928.754395] [] svc_process+0x118/0x330 [sunrpc] > [ 928.755080] [] nfsd+0x19c/0x2b0 [nfsd] > [ 928.755681] [] ? nfsd+0x5/0x2b0 [nfsd] > [ 928.756274] [] ? nfsd_destroy+0x190/0x190 [nfsd] > [ 928.756991] [] kthread+0x101/0x120 > [ 928.757563] [] ? > trace_hardirqs_on_caller+0xf5/0x1b0 > [ 928.758282] [] ret_from_fork+0x1f/0x40 > [ 928.758875] [] ? > kthread_create_on_node+0x250/0x250 > > > So the client gets a flexfiles layout, and then tries to issue a v3 > WRITE against the file. XFS then recalls the layout, but the client > can't return the layout until the v3 WRITE completes. Eventually this > should resolve itself after 2 lease periods, but that's quite a long > time. > > I guess XFS requires recalling block and SCSI layouts when the server > wants to issue a write (or someone writes to it locally), but that > seems like it shouldn't be happening when the layout is a flexfiles > layout. > > Any thoughts on what the right fix is here? > > On a related note, knfsd will spam the heck out of the client with > CB_LAYOUTRECALLs during this time. I think we ought to consider fixing > the server not to treat an NFS_OK return from the client like > NFS4ERR_DELAY there, but that would mean a different mechanism for > timing out a CB_LAYOUTRECALL. I'm getting into similar trouble with SCSI layouts when the client ends up submitting a WRITE because the IO is not page aligned, but it already holds a layout for that range. It looks like the server sends a CB_LAYOUTRECALL, but the client has to answer NFS4ERR_DELAY because it is still holding the layout. Probably, the client should return any layouts it holds for that range before doing IO through the MDS. Alternatively, shouldn't the MDS accept IO from the same client that holds a layout for that range, rather than recall that layout? RFC 5661 Section 20.3.4 talks about the client submitting WRITEs before responding to CB_LAYOUTRECALL: "As always, the client may write the data through the metadata server." I'm trying to find the discussion that resulted in this commit: commit 6b9b21073d3b250e17812cd562fffc9006962b39 Author: Jeff Layton Date: Tue Dec 8 07:23:48 2015 -0500 nfsd: give up on CB_LAYOUTRECALLs after two lease periods Why should we poll the client if the client answers with NFS4ERR_DELAY? Can we instead just wait for the layout to be returned? Also, I think the 2*lease period timeout is currently broken because we reset tk_start after every call.. but that's not really causing any trouble. Ben