Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932276AbbHXGYb (ORCPT ); Mon, 24 Aug 2015 02:24:31 -0400 Received: from ipmail04.adl6.internode.on.net ([150.101.137.141]:41123 "EHLO ipmail04.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932100AbbHXGYa (ORCPT ); Mon, 24 Aug 2015 02:24:30 -0400 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2D9CgBft9pVPEDvLHldgxtUZwKCV6cPDAEBAQEBAQaDQ5IehXkCAgEBAkVfTQEBAQEBAQcBAQEBQT+EJAEBBCcTEQEKIxAIAw4FBQklDwURFAMHGhOILQ62RAGPQQEBAQcCAR8ZhgqFNIUKB4QsBZU0hQaHaY5ti2aENiwzAYJLAQEB Date: Mon, 24 Aug 2015 16:24:25 +1000 From: Dave Chinner To: Eryu Guan Cc: Jens Axboe , Jan Kara , linux-kernel@vger.kernel.org, xfs@oss.sgi.com, axboe@fb.com, Tejun Heo , Jan Kara , linux-fsdevel@vger.kernel.org, kernel-team@fb.com Subject: Re: [PATCH block/for-linus] writeback: fix syncing of I_DIRTY_TIME inodes Message-ID: <20150824062425.GU3902@dastard> References: <20150817200254.GG21075@mtj.duckdns.org> <20150818091603.GA12317@quack.suse.cz> <20150818174718.GA15739@mtj.duckdns.org> <20150818195439.GB15739@mtj.duckdns.org> <20150818215611.GD3902@dastard> <20150821102053.GL17933@dhcp-13-216.nay.redhat.com> <20150822003025.GS3902@dastard> <20150822044609.GM17933@dhcp-13-216.nay.redhat.com> <20150824011123.GA714@dastard> <20150824031816.GO17933@dhcp-13-216.nay.redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150824031816.GO17933@dhcp-13-216.nay.redhat.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6537 Lines: 136 On Mon, Aug 24, 2015 at 11:18:16AM +0800, Eryu Guan wrote: > On Mon, Aug 24, 2015 at 11:11:23AM +1000, Dave Chinner wrote: > > > > Eryu, can you change the way you run the event trace to be: > > > > $ sudo trace-cmd -o ./check > > > > rather than running the trace as a background operation elsewhere? > > Maybe that will give better results. > > The results are here > > http://128.199.137.77/writeback-v3/ OK, the trace is still missing a few events, but it's much more complete, and contains the events that tell more of the story: $ ls -li /mnt/scr total 51196 15688955 -rw------- 1 root root 10481664 Aug 24 13:06 985 15688959 -rw------- 1 root root 0 Aug 24 13:06 989 15688961 -rw------- 1 root root 0 Aug 24 13:06 991 15688964 -rw------- 1 root root 0 Aug 24 13:06 994 15688966 -rw------- 1 root root 0 Aug 24 13:06 996 $ Inode # Size block count flushiter dec hex inode log inode log inode log 15688955 0xef64fb 0x9ff000 0x9ff000 0x9ff 0xa00 1 0 15688959 0xef64ff 0 0 0xa00 0xa00 0 0 15688961 0xef6501 0 0 0xa00 0xa00 0 0 15688964 0xef6504 0 0 0xa00 0xa00 0 0 15688966 0xef6506 0 0 0xa00 0xa00 0 0 Ok, that tallies with the previous symptoms. However, the trace tells us something different. the xfs_setfilesize events. here's the last 12 xfs_setfilesize events delivered in the trace (trimmed for brevity and focus): ino 0xef64fa isize 0xa00000 disize 0x0 offset 0x0 count 10481664 ino 0xef64fa isize 0xa00000 disize 0x9ff000 offset 0x9ff000 count 409 ino 0xef64fb isize 0xa00000 disize 0x0 offset 0x0 count 10481664 ino 0xef64fb isize 0xa00000 disize 0x9ff000 offset 0x9ff000 count 4096 ino 0xef64ff isize 0xa00000 disize 0x0 offset 0x0 count 10481664 ino 0xef64ff isize 0xa00000 disize 0x9ff000 offset 0x9ff000 count 4096 ino 0xef6501 isize 0xa00000 disize 0x0 offset 0x0 count 10481664 ino 0xef6501 isize 0xa00000 disize 0x9ff000 offset 0x9ff000 count 4096 ino 0xef6504 isize 0xa00000 disize 0x0 offset 0x0 count 10481664 ino 0xef6504 isize 0xa00000 disize 0x9ff000 offset 0x9ff000 count 4096 ino 0xef6506 isize 0xa00000 disize 0x0 offset 0x0 count 10481664 ino 0xef6506 isize 0xa00000 disize 0x9ff000 offset 0x9ff000 count 4096 Yeah, the last 9 transactions corresponding to these events are not in the log. Which means, most likely, that they occurred after ->sync_fs forced the log out. THe next thing that happens is the fs is shut down, and that's preventing the dirty log from being written to disk. Eryu, this can be confirmed by add the xfs_log_force event to the trace. What I can't see in the traces is where sync is doing a blocking sync pass on the fileystem. The wbc control structure being passed to XFS is: wbc_writepage: bdi 253:0: towrt=45569 skip=0 mode=0 kupd=0 bgrd=0 reclm=0 cyclic=0 start=0x0 end=0x7fffffffffffffff Which is not coming from sync_inodes_sb() as the sync mode is incorrect (i.e. not WB_SYNC_ALL). It looks to me that writeback is coming from a generic bdi flusher command rather than a directed superblock sync. i.e. through wakeup_flusher_threads() which sets: work->sync_mode = WB_SYNC_NONE; work->nr_pages = nr_pages; work->range_cyclic = range_cyclic; work->reason = reason; work->auto_free = 1; as the reason is "sync": sync-18849 writeback_queue: bdi 253:0: sb_dev 0:0 nr_pages=308986 sync_mode=0 kupdate=0 range_cyclic=0 background=0 reason=sync sync-18849 writeback_queue: bdi 253:0: sb_dev 253:1 nr_pages=9223372036854775807 sync_mode=1 kupdate=0 range_cyclic=0 background=0 reason=sync .... kworker/u8:1-1563 writeback_exec: bdi 253:0: sb_dev 0:0 nr_pages=308986 sync_mode=0 kupdate=0 range_cyclic=0 background=0 reason=sync kworker/u8:1-1563 writeback_start: bdi 253:0: sb_dev 0:0 nr_pages=308986 sync_mode=0 kupdate=0 range_cyclic=0 background=0 reason=sync The next writeback_queue/writeback_exec tracepoint pair are: .... kworker/2:1-17163 xfs_setfilesize: dev 253:6 ino 0xef6506 isize 0xa00000 disize 0x0 offset 0x0 count 10481664 kworker/2:1-17163 xfs_setfilesize: dev 253:6 ino 0xef6506 isize 0xa00000 disize 0x9ff000 offset 0x9ff000 count 4096 sync-18849 wbc_writepage: bdi 253:0: towrt=9223372036854775798 skip=0 mode=1 kupd=0 bgrd=0 reclm=0 cyclic=0 start=0x0 end=0x7fffffffffffffff sync-18849 wbc_writepage: bdi 253:0: towrt=9223372036854775797 skip=0 mode=1 kupd=0 bgrd=0 reclm=0 cyclic=0 start=0x0 end=0x7fffffffffffffff sync-18849 wbc_writepage: bdi 253:0: towrt=9223372036854775796 skip=0 mode=1 kupd=0 bgrd=0 reclm=0 cyclic=0 start=0x0 end=0x7fffffffffffffff sync-18849 wbc_writepage: bdi 253:0: towrt=9223372036854775795 skip=0 mode=1 kupd=0 bgrd=0 reclm=0 cyclic=0 start=0x0 end=0x7fffffffffffffff umount-18852 writeback_queue: bdi 253:0: sb_dev 253:6 nr_pages=22059 sync_mode=0 kupdate=0 range_cyclic=0 background=0 reason=sync kworker/u8:1-1563 writeback_exec: bdi 253:0: sb_dev 253:6 nr_pages=22059 sync_mode=0 kupdate=0 range_cyclic=0 background=0 reason=sync .... which shows unmount being the next writeback event queued and executed after the IO completions have come in (that missed the log). What is missing is the specific queue/exec events for sync_sb_inodes() from the sync code for each filesystem. Anyway, Eryum long and short of it is that you don't need to worry about testing all the different combinations - we now know that the completion events are occurring, so let's focus on whether the sync code is not waiting for them correctly. Can you trace the following events: xfs_log_force xfs_setfilesize writeback_queue writeback_exec writeback_start writeback_queue_io writeback_written writeback_pages_written basically I'm trying to see if we've got all the BDI events as we'd expect then to be queued and run for sync, and when the ->sync_fs call occurs during the sync process before shutdown and unmount... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/