From: Jeff Layton <jeff.layton@primarydata.com>
Date: Thu, 18 Sep 2014 08:01:07 -0400
To: NeilBrown <neilb@suse.de>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>,
        Jeff Layton <jeff.layton@primarydata.com>, linux-nfs@vger.kernel.org
Subject: Re: [PATCH 3/4] NFS: avoid deadlocks with loop-back mounted NFS
 filesystems.
Message-ID: <20140918080107.77bb52c5@tlielax.poochiereds.net>
In-Reply-To: <20140918060317.23854.40603.stgit@notabene.brown>
References: <20140918055907.23854.32118.stgit@notabene.brown>
	<20140918060317.23854.40603.stgit@notabene.brown>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Sender: linux-nfs-owner@vger.kernel.org

On Thu, 18 Sep 2014 16:03:17 +1000
NeilBrown <neilb@suse.de> wrote:

> Support for loop-back mounted NFS filesystems is useful when NFS is
> used to access shared storage in a high-availability cluster.
> 
> If the node running the NFS server fails, some other node can mount the
> filesystem and start providing NFS service.  If that node already had
> the filesystem NFS mounted, it will now have it loop-back mounted.
> 
> nfsd can suffer a deadlock when allocating memory and entering direct
> reclaim.
> While direct reclaim does not write to the NFS filesystem it can send
> and wait for a COMMIT through nfs_release_page().
> 
> This patch modifies nfs_release_page() to wait a limited time for the
> commit to complete - one second.  If the commit doesn't complete
> in this time, nfs_release_page() will fail.  This means it might now
> fail in some cases where it wouldn't before.  These cases are only
> when 'gfp' includes '__GFP_WAIT'.
> 
> nfs_release_page() is only called by try_to_release_page(), and that
> can only be called on an NFS page with required 'gfp' flags from
>  - page_cache_pipe_buf_steal() in splice.c
>  - shrink_page_list() in vmscan.c
>  - invalidate_inode_pages2_range() in truncate.c
> 
> The first two handle failure quite safely.  The last is only called
> after ->launder_page() has been called, and that will have waited
> for the commit to finish already.
> 
> So aborting if the commit takes longer than 1 second is perfectly safe.
> 
> If nfs_release_page() is called on a sequence of pages which are all
> in the same file which is blocked on COMMIT, each page could
> contribute a 1 second delay which could be come excessive.  I have
> seen delays of as much as 208 seconds.
> 
> To keep the delay to one second, the bdi is marked as write-congested
> if the commit didn't finished.  Once it does finish, the
> write-congested flag will be cleared.
> 
> With this, the longest total delay in try_to_free_pages that I have
> seen in under 3 seconds.  With no waiting in nfs_release_page at all
> I have seen delays of nearly 1.5 seconds.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/nfs/file.c  |   30 ++++++++++++++++++++----------
>  fs/nfs/write.c |    7 +++++++
>  2 files changed, 27 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index 524dd80d1898..febba950d8a6 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -468,17 +468,27 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
>  
>  	dfprintk(PAGECACHE, "NFS: release_page(%p)\n", page);
>  
> -	/* Only do I/O if gfp is a superset of GFP_KERNEL, and we're not
> -	 * doing this memory reclaim for a fs-related allocation.
> +	/* Always try to initiate a 'commit' if relevant, but only
> +	 * wait for it if __GFP_WAIT is set and the calling process is
> +	 * allowed to block.  Even then, only wait 1 second and only
> +	 * if the 'bdi' is not congested.
> +	 * Waiting indefinitely can cause deadlocks when the NFS
> +	 * server is on this machine, and there is no particular need
> +	 * to wait extensively here.  A short wait has the benefit
> +	 * that someone else can worry about the freezer.
>  	 */
> -	if (mapping && (gfp & GFP_KERNEL) == GFP_KERNEL &&
> -	    !(current->flags & PF_FSTRANS)) {
> -		int how = FLUSH_SYNC;
> -
> -		/* Don't let kswapd deadlock waiting for OOM RPC calls */
> -		if (current_is_kswapd())
> -			how = 0;
> -		nfs_commit_inode(mapping->host, how);
> +	if (mapping) {
> +		struct nfs_server *nfss = NFS_SERVER(mapping->host);
> +		nfs_commit_inode(mapping->host, 0);
> +		if ((gfp & __GFP_WAIT) &&
> +		    !current_is_kswapd() &&
> +		    !(current->flags & PF_FSTRANS) &&
> +		    !bdi_write_congested(&nfss->backing_dev_info))
> +			wait_on_page_bit_killable_timeout(page, PG_private,
> +							  HZ);
> +		if (PagePrivate(page))
> +			set_bdi_congested(&nfss->backing_dev_info,
> +					  BLK_RW_ASYNC);

I've never had a great feel for the BDI congestion stuff, but won't
this have some unintended effects?

For instance, suppose the VM decides to try to free this page and
passes in a gfp mask that doesn't contain __GFP_WAIT. We issue the
COMMIT, but don't wait for it. The COMMIT is actually going to go
reasonably fast, but we now set the BDI congested because we didn't
wait for it to occur.

That in turn causes writeout for other inodes on this BDI to get
throttled even though there really is no congestion. It just looks that
way due to how releasepage got called.

Am I making mountains out of molehills here?

>  	}
>  	/* If PagePrivate() is set, then the page is not freeable */
>  	if (PagePrivate(page))
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index 175d5d073ccf..3066c7fcb565 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -731,6 +731,8 @@ static void nfs_inode_remove_request(struct nfs_page *req)
>  		if (likely(!PageSwapCache(head->wb_page))) {
>  			set_page_private(head->wb_page, 0);
>  			ClearPagePrivate(head->wb_page);
> +			smp_mb__after_atomic();
> +			wake_up_page(head->wb_page, PG_private);
>  			clear_bit(PG_MAPPED, &head->wb_flags);
>  		}
>  		nfsi->npages--;
> @@ -1636,6 +1638,7 @@ static void nfs_commit_release_pages(struct nfs_commit_data *data)
>  	struct nfs_page	*req;
>  	int status = data->task.tk_status;
>  	struct nfs_commit_info cinfo;
> +	struct nfs_server *nfss;
>  
>  	while (!list_empty(&data->pages)) {
>  		req = nfs_list_entry(data->pages.next);
> @@ -1669,6 +1672,10 @@ static void nfs_commit_release_pages(struct nfs_commit_data *data)
>  	next:
>  		nfs_unlock_and_release_request(req);
>  	}
> +	nfss = NFS_SERVER(data->inode);
> +	if (atomic_long_read(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
> +		clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
> +
>  	nfs_init_cinfo(&cinfo, data->inode, data->dreq);
>  	if (atomic_dec_and_test(&cinfo.mds->rpcs_out))
>  		nfs_commit_clear_lock(NFS_I(data->inode));
> 
> 


-- 
Jeff Layton <jlayton@primarydata.com>