Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932246AbdIFRJv (ORCPT ); Wed, 6 Sep 2017 13:09:51 -0400 Received: from mga01.intel.com ([192.55.52.88]:47037 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755638AbdIFRJt (ORCPT ); Wed, 6 Sep 2017 13:09:49 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.42,354,1500966000"; d="scan'208";a="146185868" Date: Wed, 6 Sep 2017 11:09:46 -0600 From: Ross Zwisler To: Jan Kara Cc: Ross Zwisler , Andrew Morton , linux-kernel@vger.kernel.org, "Darrick J. Wong" , "Theodore Ts'o" , Andreas Dilger , Christoph Hellwig , Dan Williams , Dave Chinner , linux-ext4@vger.kernel.org, linux-nvdimm@lists.01.org, linux-xfs@vger.kernel.org, stable@vger.kernel.org Subject: Re: [PATCH 6/9] ext4: safely transition S_DAX on journaling changes Message-ID: <20170906170946.GC17663@linux.intel.com> References: <20170905223541.20594-1-ross.zwisler@linux.intel.com> <20170905223541.20594-7-ross.zwisler@linux.intel.com> <20170906094700.GC27916@quack2.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170906094700.GC27916@quack2.suse.cz> User-Agent: Mutt/1.8.3 (2017-05-23) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4103 Lines: 86 On Wed, Sep 06, 2017 at 11:47:00AM +0200, Jan Kara wrote: > On Tue 05-09-17 16:35:38, Ross Zwisler wrote: > > The IOCTL path which switches the journaling mode for an inode is currently > > unsafe because it doesn't properly do a writeback and invalidation on the > > inode. In XFS, for example, safe transitions of S_DAX are handled by > > xfs_ioctl_setattr_dax_invalidate() which locks out page faults and I/O, > > does a writeback via filemap_write_and_wait() and an invalidation via > > invalidate_inode_pages2(). > > > > Without this in place we can see the following kernel warning when we try > > and insert a DAX exceptional entry but find that a dirty page cache page is > > still in the mapping->radix_tree: > > > > WARNING: CPU: 4 PID: 1052 at mm/filemap.c:262 __delete_from_page_cache+0x375/0x550 > > Modules linked in: dax_pmem nd_pmem device_dax nd_btt nfit libnvdimm > > CPU: 4 PID: 1052 Comm: small Not tainted 4.13.0-rc6-00055-gac26931 #3 > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014 > > task: ffff88020ccd0000 task.stack: ffffc900021d4000 > > RIP: 0010:__delete_from_page_cache+0x375/0x550 > > RSP: 0000:ffffc900021d7b90 EFLAGS: 00010002 > > RAX: 002fffc00001123d RBX: ffffffffffffffff RCX: ffff8801d9440d68 > > RDX: 0000000000000000 RSI: ffffffff81fd5b84 RDI: ffffffff81f6f0e5 > > RBP: ffffc900021d7be0 R08: 0000000000000000 R09: ffff8801f9938c70 > > R10: 0000000000000021 R11: ffff8801f9938c91 R12: ffff8801d9440d70 > > R13: ffffea0007fdda80 R14: 0000000000000001 R15: ffff8801d9440d68 > > FS: 00007feacc041700(0000) GS:ffff880211800000(0000) knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > CR2: 0000000010420000 CR3: 000000020cfd8000 CR4: 00000000000006e0 > > Call Trace: > > dax_insert_mapping_entry+0x158/0x2c0 > > dax_iomap_fault+0x1020/0x1bb0 > > ext4_dax_huge_fault+0xc8/0x160 > > ext4_dax_fault+0x10/0x20 > > __do_fault+0x20/0x110 > > __handle_mm_fault+0x97d/0x1120 > > handle_mm_fault+0x188/0x2f0 > > __do_page_fault+0x28f/0x590 > > trace_do_page_fault+0x58/0x2c0 > > do_async_page_fault+0x2c/0x90 > > async_page_fault+0x28/0x30 > > > > I'm pretty sure we could make a test that shows userspace visible data > > corruption as well in this scenario. > > > > Make it safe to change the journaling mode and turn on or off S_DAX by > > adding locking to properly lock out page faults (i_mmap_sem) and then doing > > the writeback and invalidate. I/O is already held off because all callers > > of ext4_ioctl_setflags() hold the inode lock. > > Yeah, this is a good point. It is just that this is not enough as I > discovered in [1]. You also need to tear down & recreate VMAs when changing > DAX flag which is a bit tricky. So for now I think returning EBUSY when > file is mmaped and we'd like to flip DAX flag is the best solution. Hmm? > > [1] https://www.spinics.net/lists/linux-xfs/msg09859.html Yea, thanks for the link, I totally missed this discussion (obviously). Cool, I'll rework this for v2. > > The locking for this new code is complex because of the following: > > > > 1) filemap_write_and_wait() eventually calls ext4_writepages(), which > > acquires the sbi->s_journal_flag_rwsem. This lock ranks above the > > jbdw_handle which is eventually taken by ext4_journal_start(). This > > essentially means that the writeback has to happen outside of the context > > of an active journal handle (outside of ext4_journal_start() to > > ext4_journal_stop().) > > > > 2) To lock out page faults we take a write lock on the ei->i_mmap_sem, and > > this lock again ranks above the jbd2_handle taken by ext4_journal_start(). > > So, as with the writeback code in 1) above we have to take ei->i_mmap_sem > > outside of the context of an active journal handle. > > Welcome to the joy of fs locking ;) :) Well, I feel like I learned a lot more about ext4 during this patch set! > > Signed-off-by: Ross Zwisler > > CC: stable@vger.kernel.org > > Honza > > -- > Jan Kara > SUSE Labs, CR