Received: by 2002:a05:6a10:a0d1:0:0:0:0 with SMTP id j17csp4675706pxa; Mon, 10 Aug 2020 15:23:41 -0700 (PDT) X-Google-Smtp-Source: ABdhPJweYiatj4/oPft24Dbr8IMpS3PnBKYANw9UR7sD5+inXoNupH/lTdjV+N1HQhDT2frjf62t X-Received: by 2002:a17:906:bcf4:: with SMTP id op20mr24599592ejb.277.1597098221740; Mon, 10 Aug 2020 15:23:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1597098221; cv=none; d=google.com; s=arc-20160816; b=cCylKWhMY7wYd9K0rBAjk3GFyVd/+TleJ0/ejiDYYF8KWs9wy09cnoIWXtUVo/V9GZ SpfrpMdcHRfh8fD8EckdVcJFm8Tb+Nj1ZxdaCVGGh/U9dRFRGpMjNiAtAPundx4FLeBD flZETf+42xQCy9AJGB3mk6074p0iIMVdWMrpUFZ47I0uUWZSWlucUTZ7YtJHZzvuzi9H mwZ6hytgzOlMmc8ez3Igr3dPNn6Zp9lStsoKjpLd4QEH/lKYDbvoUCYNVkvBvcI5wQZc PcfrfBOHmfe8qZUL0L2VWhcySEzUO2WL0z2Pp+hSXdSbbwMDloWrlOersZl6cVBdbEbf 2uHA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=VBikFLZxzDBd9LG4MYtQ91msuK4Ik0qZ691jBjREJWc=; b=PPqBVIrNtiMO6w6+mvvTKxrNeZoIipZjSQsL9RTPvA2KJodhp2NRvIPujTOx1RDLmz JiBTjMjysAB1VH7hWQpPRswiVyBfxbz9fDg7SmcTYr7v3nrtImrIyvFZGC3mF8GzbuIt sEGqAk7lPKG6NtP7Qbroq+iCllwuutRQOLJ1ulNqE8X3P9j1qjS/69Bul92MMT6y3Y4R jm4sEV+xX/UtysnRNWSaI+VGsqDC2b6omL8t0+4WzjOuZZhN1wMIxu8h4SxqTi3UghD1 +C9MHBc8nv3lzWJ8bpEN7J8S2V7+OZUiY84wZ7BQxEeQHiNU3auG/IMUWYWWjAwOfNxH t+3A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id bl4si3887727ejb.420.2020.08.10.15.23.18; Mon, 10 Aug 2020 15:23:41 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726775AbgHJWWp (ORCPT + 99 others); Mon, 10 Aug 2020 18:22:45 -0400 Received: from mail107.syd.optusnet.com.au ([211.29.132.53]:36133 "EHLO mail107.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726500AbgHJWWo (ORCPT ); Mon, 10 Aug 2020 18:22:44 -0400 Received: from dread.disaster.area (pa49-180-53-24.pa.nsw.optusnet.com.au [49.180.53.24]) by mail107.syd.optusnet.com.au (Postfix) with ESMTPS id E7389D5AFD7; Tue, 11 Aug 2020 08:22:40 +1000 (AEST) Received: from dave by dread.disaster.area with local (Exim 4.92.3) (envelope-from ) id 1k5GBi-0007nF-LK; Tue, 11 Aug 2020 08:22:38 +1000 Date: Tue, 11 Aug 2020 08:22:38 +1000 From: Dave Chinner To: Vivek Goyal Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, virtio-fs@redhat.com, miklos@szeredi.hu, stefanha@redhat.com, dgilbert@redhat.com Subject: Re: [PATCH v2 15/20] fuse, dax: Take ->i_mmap_sem lock during dax page fault Message-ID: <20200810222238.GD2079@dread.disaster.area> References: <20200807195526.426056-1-vgoyal@redhat.com> <20200807195526.426056-16-vgoyal@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200807195526.426056-16-vgoyal@redhat.com> User-Agent: Mutt/1.10.1 (2018-07-13) X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=QIgWuTDL c=1 sm=1 tr=0 a=moVtWZxmCkf3aAMJKIb/8g==:117 a=moVtWZxmCkf3aAMJKIb/8g==:17 a=kj9zAlcOel0A:10 a=y4yBn9ojGxQA:10 a=20KFwNOVAAAA:8 a=7-415B0cAAAA:8 a=rPINo3o76Op-SQJf9JMA:9 a=CjuIK1q_8ugA:10 a=biEYGPWJfzWAr4FL6Ov7:22 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Aug 07, 2020 at 03:55:21PM -0400, Vivek Goyal wrote: > We need some kind of locking mechanism here. Normal file systems like > ext4 and xfs seems to take their own semaphore to protect agains > truncate while fault is going on. > > We have additional requirement to protect against fuse dax memory range > reclaim. When a range has been selected for reclaim, we need to make sure > no other read/write/fault can try to access that memory range while > reclaim is in progress. Once reclaim is complete, lock will be released > and read/write/fault will trigger allocation of fresh dax range. > > Taking inode_lock() is not an option in fault path as lockdep complains > about circular dependencies. So define a new fuse_inode->i_mmap_sem. That's precisely why filesystems like ext4 and XFS define their own rwsem. Note that this isn't a DAX requirement - the page fault serialisation is actually a requirement of hole punching... > Signed-off-by: Vivek Goyal > --- > fs/fuse/dir.c | 2 ++ > fs/fuse/file.c | 15 ++++++++++++--- > fs/fuse/fuse_i.h | 7 +++++++ > fs/fuse/inode.c | 1 + > 4 files changed, 22 insertions(+), 3 deletions(-) > > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c > index 26f028bc760b..f40766c0693b 100644 > --- a/fs/fuse/dir.c > +++ b/fs/fuse/dir.c > @@ -1609,8 +1609,10 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr, > */ > if ((is_truncate || !is_wb) && > S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) { > + down_write(&fi->i_mmap_sem); > truncate_pagecache(inode, outarg.attr.size); > invalidate_inode_pages2(inode->i_mapping); > + up_write(&fi->i_mmap_sem); > } > > clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state); > diff --git a/fs/fuse/file.c b/fs/fuse/file.c > index be7d90eb5b41..00ad27216cc3 100644 > --- a/fs/fuse/file.c > +++ b/fs/fuse/file.c > @@ -2878,11 +2878,18 @@ static vm_fault_t __fuse_dax_fault(struct vm_fault *vmf, > > if (write) > sb_start_pagefault(sb); > - > + /* > + * We need to serialize against not only truncate but also against > + * fuse dax memory range reclaim. While a range is being reclaimed, > + * we do not want any read/write/mmap to make progress and try > + * to populate page cache or access memory we are trying to free. > + */ > + down_read(&get_fuse_inode(inode)->i_mmap_sem); > ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &fuse_iomap_ops); > > if (ret & VM_FAULT_NEEDDSYNC) > ret = dax_finish_sync_fault(vmf, pe_size, pfn); > + up_read(&get_fuse_inode(inode)->i_mmap_sem); > > if (write) > sb_end_pagefault(sb); > @@ -3849,9 +3856,11 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset, > file_update_time(file); > } > > - if (mode & FALLOC_FL_PUNCH_HOLE) > + if (mode & FALLOC_FL_PUNCH_HOLE) { > + down_write(&fi->i_mmap_sem); > truncate_pagecache_range(inode, offset, offset + length - 1); > - > + up_write(&fi->i_mmap_sem); > + } > fuse_invalidate_attr(inode); I'm not sure this is sufficient. You have to lock page faults out for the entire time the hole punch is being performed, not just while the mapping is being invalidated. That is, once you've taken the inode lock and written back the dirty data over the range being punched, you can then take a page fault and dirty the page again. Then after you punch the hole out, you have a dirty page with non-zero data in it, and that can get written out before the page cache is truncated. IOWs, to do a hole punch safely, you have to both lock the inode and lock out page faults *before* you write back dirty data. Then you can invalidate the page cache so you know there is no cached data over the range about to be punched. Once the punch is done, then you can drop all locks.... The same goes for any other operation that manipulates extents directly (other fallocate ops, truncate, etc). /me also wonders if there can be racing AIO+DIO in progress over the range that is being punched and whether fuse needs to call inode_dio_wait() before punching holes, running truncates, etc... Cheers, Dave. -- Dave Chinner david@fromorbit.com