Received: by 10.213.65.68 with SMTP id h4csp536058imn; Wed, 4 Apr 2018 02:56:47 -0700 (PDT) X-Google-Smtp-Source: AIpwx49pNwSUBKotH3TGXgcfjumsgXDZyZmCxBHMLHikid5zZdRoyLa4faZl8LRY7YnoXeY7xAg9 X-Received: by 2002:a17:902:3124:: with SMTP id w33-v6mr18450032plb.119.1522835807625; Wed, 04 Apr 2018 02:56:47 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1522835807; cv=none; d=google.com; s=arc-20160816; b=x7jeuSVJp6jzU3HogIGA7ec+vsmGW+J+1j2HEQHIg3E7CqimgwblgMfwyU8c1JxpJO N/Vbp+VIbge0rf3rWiWJTVY2SJbVD1UrKG5nKCsN9DASV/XpadtP4MsDXCZpAFa3dIOv CuxE2bRnIReilRLa3liiOYQv3GhqxO3h2KMMWBkWKxM9Ye+nmnUVIsvUf9zwWkUktvH1 NFc/7cNg/cUZIM7ZxjWNjoIZ9b0CdrlKJEpxG7janpJSZHFbi3jno2GkLgkaN2hHByuK P4EzyIZhb9ytBiWD6NSQqGEhronbtEENyipaHWBJr9WteQGb/7KNl+Urh96OC3SjS1af Nlbw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=5Urft7svL9ue/j5S6YKSEAa5SNwseZsrfnswgi/Z1dY=; b=EB+2cCda4/eFWVZgYph6jt/fd+pF0RUv3hpfRWesHmMGQv+g5PTVvqlpW2OtjKZSZ7 5Udn25Lw71lzuENAgwsXRiKQZWPdzVuS5je2BFvmgH1KgSOM3pt4w0sz9DQdBg2j1MjO 5ED9d/9LNd/1WNVRrrCS0JxALclPDGKA8s+8M10PP3+CPgVy+SC4W/IrSzkDodtlEBqM UAGZ1MCs9M3AhQdYhcZg0WPa5eLVHSAvjasIE8DSPDaI/7oRxrhgljhWdrDBnUhTI+rC v1RaSNCIVyn60Mjs7J1vo5JpwP8AxSLp/Gf/i7rR2KWwT3DZt8Ev1kr60nY4wE+NYo8j Yi/w== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a3si2060074pgv.522.2018.04.04.02.56.33; Wed, 04 Apr 2018 02:56:47 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751244AbeDDJzW (ORCPT + 99 others); Wed, 4 Apr 2018 05:55:22 -0400 Received: from mx2.suse.de ([195.135.220.15]:41527 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750812AbeDDJzU (ORCPT ); Wed, 4 Apr 2018 05:55:20 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (charybdis-ext.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id A22E3AEAD; Wed, 4 Apr 2018 09:55:18 +0000 (UTC) Received: by quack2.suse.cz (Postfix, from userid 1000) id 123671E0A1B; Wed, 4 Apr 2018 11:55:18 +0200 (CEST) Date: Wed, 4 Apr 2018 11:55:18 +0200 From: Jan Kara To: Dan Williams Cc: linux-nvdimm@lists.01.org, Jan Kara , Dave Chinner , "Darrick J. Wong" , Ross Zwisler , Christoph Hellwig , linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org, linux-kernel@vger.kernel.org, snitzer@redhat.com Subject: Re: [PATCH v8 18/18] xfs, dax: introduce xfs_break_dax_layouts() Message-ID: <20180404095518.65exgpxuca3tqhav@quack2.suse.cz> References: <152246892890.36038.18436540150980653229.stgit@dwillia2-desk3.amr.corp.intel.com> <152246902607.36038.15813002361509305325.stgit@dwillia2-desk3.amr.corp.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <152246902607.36038.15813002361509305325.stgit@dwillia2-desk3.amr.corp.intel.com> User-Agent: NeoMutt/20170421 (1.8.2) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 30-03-18 21:03:46, Dan Williams wrote: > xfs_break_dax_layouts(), similar to xfs_break_leased_layouts(), scans > for busy / pinned dax pages and waits for those pages to go idle before > any potential extent unmap operation. > > dax_layout_busy_page() handles synchronizing against new page-busy > events (get_user_pages). It invalidates all mappings to trigger the > get_user_pages slow path which will eventually block on the xfs inode > lock held in XFS_MMAPLOCK_EXCL mode. If dax_layout_busy_page() finds a > busy page it returns it for xfs to wait for the page-idle event that > will fire when the page reference count reaches 1 (recall ZONE_DEVICE > pages are idle at count 1, see generic_dax_pagefree()). > > While waiting, the XFS_MMAPLOCK_EXCL lock is dropped in order to not > deadlock the process that might be trying to elevate the page count of > more pages before arranging for any of them to go idle. I.e. the typical > case of submitting I/O is that iov_iter_get_pages() elevates the > reference count of all pages in the I/O before starting I/O on the first > page. The process of elevating the reference count of all pages involved > in an I/O may cause faults that need to take XFS_MMAPLOCK_EXCL. > > Cc: Jan Kara > Cc: Dave Chinner > Cc: "Darrick J. Wong" > Cc: Ross Zwisler > Reviewed-by: Christoph Hellwig > Signed-off-by: Dan Williams ... > --- > fs/xfs/xfs_file.c | 60 +++++++++++++++++++++++++++++++++++++++++++---------- > 1 file changed, 49 insertions(+), 11 deletions(-) > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index 51e6506bdcb1..0342f6fb782f 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -752,6 +752,38 @@ xfs_file_write_iter( > return ret; > } > > +static void > +xfs_wait_var_event( > + struct inode *inode, > + uint iolock, > + bool *did_unlock) > +{ > + struct xfs_inode *ip = XFS_I(inode); > + > + *did_unlock = true; > + xfs_iunlock(ip, iolock); > + schedule(); > + xfs_ilock(ip, iolock); > +} With this scheme, there's a problem that it can be easily livelocked. E.g. when I created a program that maps a file on DAX fs and does AIO DIO from it indefinitely (with 64 iocbs in flight), truncate of that file never gets past xfs_break_layouts(). The reason is that once we drop all locks, new iocbs can be submitted, they grab new page references and these prevent truncation next time... So I think we need to somehow fix this retry scheme so that we guarantee forward progress of the truncate. E.g. if we kept IOLOCK locked, that would prevent new iocbs from being submitted... Honza > + > +static int > +xfs_break_dax_layouts( > + struct inode *inode, > + uint iolock, > + bool *did_unlock) > +{ > + struct page *page; > + > + *did_unlock = false; > + page = dax_layout_busy_page(inode->i_mapping); > + if (!page) > + return 0; > + > + return ___wait_var_event(&page->_refcount, > + atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE, > + 0, 0, xfs_wait_var_event(inode, iolock, did_unlock)); > +} > + > int > xfs_break_layouts( > struct inode *inode, > @@ -763,17 +795,23 @@ xfs_break_layouts( > > ASSERT(xfs_isilocked(XFS_I(inode), XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)); > > - switch (reason) { > - case BREAK_UNMAP: > - ASSERT(xfs_isilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL)); > - /* fall through */ > - case BREAK_WRITE: > - error = xfs_break_leased_layouts(inode, iolock, &retry); > - break; > - default: > - WARN_ON_ONCE(1); > - return -EINVAL; > - } > + do { > + switch (reason) { > + case BREAK_UNMAP: > + ASSERT(xfs_isilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL)); > + > + error = xfs_break_dax_layouts(inode, *iolock, &retry); > + /* fall through */ > + case BREAK_WRITE: > + if (error || retry) > + break; > + error = xfs_break_leased_layouts(inode, iolock, &retry); > + break; > + default: > + WARN_ON_ONCE(1); > + return -EINVAL; > + } > + } while (error == 0 && retry); > > return error; > } > -- Jan Kara SUSE Labs, CR