Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932178AbbHXBL3 (ORCPT ); Sun, 23 Aug 2015 21:11:29 -0400 Received: from ipmail04.adl6.internode.on.net ([150.101.137.141]:60032 "EHLO ipmail04.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932066AbbHXBL1 (ORCPT ); Sun, 23 Aug 2015 21:11:27 -0400 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2BWCQACbtpVPEDvLHldgxtUZwKCV6cNDgEBAQEGg0OSHoV5AgIBAQJFXU0BAQEBAQEHAQEBAUE/hCQBAQQ6EQEKIxAIAw4FBQklDwURFAMHGhOILQ62JgGPQQEBAQEBBQEBAQEeGYYKhTSFCgeDGIEUBZU0hQaHaYFOhDCDHIVTgzOIM4Q2LDMBgksBAQE Date: Mon, 24 Aug 2015 11:11:23 +1000 From: Dave Chinner To: Eryu Guan Cc: Jens Axboe , Jan Kara , linux-kernel@vger.kernel.org, xfs@oss.sgi.com, axboe@fb.com, linux-fsdevel@vger.kernel.org, Jan Kara , Tejun Heo , kernel-team@fb.com Subject: Re: [PATCH block/for-linus] writeback: fix syncing of I_DIRTY_TIME inodes Message-ID: <20150824011123.GA714@dastard> References: <20150813224415.GG4496@mtj.duckdns.org> <20150814111408.GB8710@quack.suse.cz> <20150817200254.GG21075@mtj.duckdns.org> <20150818091603.GA12317@quack.suse.cz> <20150818174718.GA15739@mtj.duckdns.org> <20150818195439.GB15739@mtj.duckdns.org> <20150818215611.GD3902@dastard> <20150821102053.GL17933@dhcp-13-216.nay.redhat.com> <20150822003025.GS3902@dastard> <20150822044609.GM17933@dhcp-13-216.nay.redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150822044609.GM17933@dhcp-13-216.nay.redhat.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5595 Lines: 138 On Sat, Aug 22, 2015 at 12:46:09PM +0800, Eryu Guan wrote: > On Sat, Aug 22, 2015 at 10:30:25AM +1000, Dave Chinner wrote: > > On Fri, Aug 21, 2015 at 06:20:53PM +0800, Eryu Guan wrote: > [snip] > > > > Eryu, can you try again, this time manually specifying the writeback > > tracepoints so you exclude the really noisy ones? You can also drop > > the xfs_file_buffered_write and xfs_file_fsync tracepoints as well, > > as we can see that the incoming side of the code is doing the right > > thing.... > > I excluded the writeback tracepoints you mentioned > > writeback_mark_inode_dirty > writeback_dirty_inode_start > writeback_dirty_inode > writeback_dirty_page > writeback_write_inode > > and left all other writeback tracepoints enabled, also dropped > xfs_file_buffered_write and xfs_file_fsync. > > This time I can reproduce generic/048 quickly and please download the > trace info from below > > http://128.199.137.77/writeback-v2/ ok: $ ls -li /mnt/scr total 102396 15688948 -rw------- 1 root root 0 Aug 22 14:31 978 15688950 -rw------- 1 root root 0 Aug 22 14:31 980 15688952 -rw------- 1 root root 10481664 Aug 22 14:31 982 15688957 -rw------- 1 root root 0 Aug 22 14:31 987 15688961 -rw------- 1 root root 0 Aug 22 14:31 991 15688963 -rw------- 1 root root 0 Aug 22 14:31 993 15688964 -rw------- 1 root root 0 Aug 22 14:31 994 15688966 -rw------- 1 root root 0 Aug 22 14:31 996 15688967 -rw------- 1 root root 0 Aug 22 14:31 997 15688968 -rw------- 1 root root 0 Aug 22 14:31 998 $ So, looking at what is on disk and what is in the log: Inode # Size block count flushiter dec hex inode log inode log inode log 15688948 0xef64f4 0 0 0xa00 0xa00 0 0 15688950 0xef64f6 0 0 0xa00 0xa00 0 0 15688952 0xef64f8 0x9ff000 0x9ff000 0x9ff 0xa00 1 0 15688957 0xef64fd 0 0 0xa00 0xa00 0 0 15688961 0xef6501 0 0 0xa00 0xa00 0 0 15688963 0xef6503 0 0 0xa00 0xa00 0 0 15688964 0xef6504 0 0 0xa00 0xa00 0 0 15688966 0xef6506 0 0 0xa00 0xa00 0 0 15688967 0xef6507 0 0 0xa00 0xa00 0 0 15688968 0xef6508 0 0 0xa00 0xa00 0 0 Now, inode #15688952 looks like there's some weirdness going on there with a non-zero flushiter and a block count that doesn't match between what is in the log and what is on disk. However, this is a result of the second mount that checks the file sizes and extent counts - it loads the inode into memory, checks it, and then when it is purged from the cache on unmount the blocks beyond EOF are punched away and the inode writen to disk. Hence there is a second transaction in the log for that inode after all the other inodes have been unlinked: INODE: #regs: 3 ino: 0xef64f8 flags: 0x5 dsize: 16 ..... size 0x9ff000 nblocks 0x9ff extsize 0x0 nextents 0x1 It is preceeded in the log by modifications to the AGF and frees space btree buffers. It's then followed by the superblock buffer and the unmount record. Hence this is not unexpected. What it does tell us, though, is that the log never recorded file size changes for all of the inode with zero size. We see the block count of 0xa00, which means the delayed allocation transaction during IO submission has hit the disk, but there are none of the IO completion transactions in the log. So let's go look at the event trace now now that we know the EOF size update transactions were not run before the filesystem shut down. Inode # writeback completion hex first last first last 0xef64f4 0-0x9ff000 yes no no 0xef64f6 0-0x9ff000 yes no no 0xef64f8 0-0x9ff000 yes no no 0xef64fd 0-0x9ff000 yes no no 0xef6501 0-0x9ff000 yes no no 0xef6503 no no no no 0xef6504 no no no no 0xef6506 no no no no 0xef6507 no no no no 0xef6508 no no no no Ok, so we still can't trust the event trace to be complete - we know from the log and the on-disk state that that allocation occurred for those last 5 inodes, so we can't read anything into the fact the traces for completions are missing. Eryu, can you change the way you run the event trace to be: $ sudo trace-cmd -o ./check rather than running the trace as a background operation elsewhere? Maybe that will give better results. Also, it would be informative to us if you can reproduce this with a v5 filesystem (i.e. mkfs.xfs -m crc=1) because it has much better on-disk information for sequence-of-event triage like this. If you can reproduce it with a v5 filesystem, can you post the trace and metadump? Other things to check (separately): - change godown to godown -f - add a "sleep 5" before running godown after sync - add a "sleep 5; sync" before running godown i.e. I'm wondering if sync is not waiting for everything, and so we aren't capturing the IO completions because the filesystem is already shut down by the time they are delivered... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/