Return-Path: Message-ID: <1517089301.3516.9.camel@redhat.com> Subject: Re: CB_LAYOUTRECALL "deadlock" with in-kernel flexfiles server and XFS From: Jeff Layton To: Benjamin Coddington Cc: "open list:NFS, SUNRPC, AND..." , Tom Haynes , Christoph Hellwig , Bruce Fields Date: Sat, 27 Jan 2018 16:41:41 -0500 In-Reply-To: References: <1470929036.30238.14.camel@redhat.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 List-ID: On Sat, 2018-01-27 at 10:39 -0500, Benjamin Coddington wrote: > On 11 Aug 2016, at 11:23, Jeff Layton wrote: > > > I was playing around with the in-kernel flexfiles server today, and I > > seem to be hitting a deadlock when using it on an XFS-exported > > filesystem. Here's the stack trace of how the CB_LAYOUTRECALL occurs: > > > > [ 928.736139] CPU: 0 PID: 846 Comm: nfsd Tainted: G OE > > 4.8.0-rc1+ #3 > > [ 928.737040] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > > BIOS 1.9.1-1.fc24 04/01/2014 > > [ 928.738009] 0000000000000286 000000006125f50e ffff91153845b878 > > ffffffff8f463853 > > [ 928.738906] ffff91152ec194d0 ffff91152d31d9c0 ffff91153845b8a8 > > ffffffffc045936f > > [ 928.739788] ffff91152c051980 ffff91152d31d9c0 ffff91152c051540 > > ffff9115361b8a58 > > [ 928.740697] Call Trace: > > [ 928.740998] [] dump_stack+0x86/0xc3 > > [ 928.741570] [] > > nfsd4_recall_file_layout+0x17f/0x190 [nfsd] > > [ 928.742380] [] nfsd4_layout_lm_break+0x1d/0x30 > > [nfsd] > > [ 928.743115] [] __break_lease+0x118/0x6a0 > > [ 928.743759] [] xfs_break_layouts+0x79/0x120 > > [xfs] > > [ 928.744462] [] > > xfs_file_aio_write_checks+0x94/0x1f0 [xfs] > > [ 928.745251] [] > > xfs_file_buffered_aio_write+0x7b/0x330 [xfs] > > [ 928.746063] [] xfs_file_write_iter+0xec/0x140 > > [xfs] > > [ 928.746803] [] do_iter_readv_writev+0xb9/0x140 > > [ 928.747478] [] do_readv_writev+0x19b/0x240 > > [ 928.748146] [] ? > > xfs_file_buffered_aio_write+0x330/0x330 [xfs] > > [ 928.748956] [] ? do_dentry_open+0x28b/0x310 > > [ 928.749614] [] ? > > xfs_extent_busy_ag_cmp+0x20/0x20 [xfs] > > [ 928.750367] [] vfs_writev+0x3f/0x50 > > [ 928.750934] [] nfsd_vfs_write+0xca/0x3a0 [nfsd] > > [ 928.751608] [] nfsd_write+0x485/0x780 [nfsd] > > [ 928.752263] [] nfsd3_proc_write+0xbc/0x150 > > [nfsd] > > [ 928.752973] [] nfsd_dispatch+0xb8/0x1f0 [nfsd] > > [ 928.753642] [] svc_process_common+0x42f/0x690 > > [sunrpc] > > [ 928.754395] [] svc_process+0x118/0x330 [sunrpc] > > [ 928.755080] [] nfsd+0x19c/0x2b0 [nfsd] > > [ 928.755681] [] ? nfsd+0x5/0x2b0 [nfsd] > > [ 928.756274] [] ? nfsd_destroy+0x190/0x190 [nfsd] > > [ 928.756991] [] kthread+0x101/0x120 > > [ 928.757563] [] ? > > trace_hardirqs_on_caller+0xf5/0x1b0 > > [ 928.758282] [] ret_from_fork+0x1f/0x40 > > [ 928.758875] [] ? > > kthread_create_on_node+0x250/0x250 > > > > > > So the client gets a flexfiles layout, and then tries to issue a v3 > > WRITE against the file. XFS then recalls the layout, but the client > > can't return the layout until the v3 WRITE completes. Eventually this > > should resolve itself after 2 lease periods, but that's quite a long > > time. > > > > I guess XFS requires recalling block and SCSI layouts when the server > > wants to issue a write (or someone writes to it locally), but that > > seems like it shouldn't be happening when the layout is a flexfiles > > layout. > > > > Any thoughts on what the right fix is here? > > > > On a related note, knfsd will spam the heck out of the client with > > CB_LAYOUTRECALLs during this time. I think we ought to consider fixing > > the server not to treat an NFS_OK return from the client like > > NFS4ERR_DELAY there, but that would mean a different mechanism for > > timing out a CB_LAYOUTRECALL. > > I'm getting into similar trouble with SCSI layouts when the client ends > up > submitting a WRITE because the IO is not page aligned, but it already > holds > a layout for that range. It looks like the server sends a > CB_LAYOUTRECALL, > but the client has to answer NFS4ERR_DELAY because it is still holding > the > layout. > > Probably, the client should return any layouts it holds for that range > before > doing IO through the MDS. > Yes, that might be good. Could even prefix the WRITE compound with a LAYOUTRETURN if you want to get fancy. :) > Alternatively, shouldn't the MDS accept IO from the same client that > holds a > layout for that range, rather than recall that layout? RFC 5661 Section > 20.3.4 talks about the client submitting WRITEs before responding to > CB_LAYOUTRECALL: "As always, the client may write the data through the > metadata server." > Agreed. That seems reasonable too. > I'm trying to find the discussion that resulted in this commit: > > commit 6b9b21073d3b250e17812cd562fffc9006962b39 > Author: Jeff Layton > Date: Tue Dec 8 07:23:48 2015 -0500 > > nfsd: give up on CB_LAYOUTRECALLs after two lease periods > > Why should we poll the client if the client answers with NFS4ERR_DELAY? > Can > we instead just wait for the layout to be returned? > No. NFS4ERR_DELAY just means "I'm too busy to answer right now, please call again later". You can't infer that the client has made any note of the CB_LAYOUTRECALL at all since it didn't succeed. Returning NFS4_OK on a CB_LAYOUTRECALL just means that you acknowledge that it has been recalled and will eventually send a LAYOUTRETURN. It doesn't mean that you are immediately returning it. Probably what the client should do in this situation is mark the layout as having been recalled and return NFS4_OK instead of NFS4ERR_DELAY. It seems like that ought to be possible, but I haven't looked at the code to see why that isn't occurring. > Also, I think the 2*lease period timeout is currently broken because we > reset > tk_start after every call.. but that's not really causing any trouble. > It'd be good to fix that too, since you're in there... -- Jeff Layton