Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753912Ab3HOGCB (ORCPT ); Thu, 15 Aug 2013 02:02:01 -0400 Received: from ipmail05.adl6.internode.on.net ([150.101.137.143]:3277 "EHLO ipmail05.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753337Ab3HOGB7 (ORCPT ); Thu, 15 Aug 2013 02:01:59 -0400 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AnsFAPZtDFJ5LCJRgWdsb2JhbABbgzu6FYVagR4XDgEBFiYogiQBAQIDJxMcIxAIAw4KCSUPBSUDIROIDw25cRaQOgeEEgOXY4Euk1Iq Date: Thu, 15 Aug 2013 16:01:49 +1000 From: Dave Chinner To: Andy Lutomirski Cc: "Theodore Ts'o" , Dave Hansen , Dave Hansen , Linux FS Devel , xfs@oss.sgi.com, "linux-ext4@vger.kernel.org" , Jan Kara , LKML , Tim Chen , Andi Kleen Subject: Re: page fault scalability (ext3, ext4, xfs) Message-ID: <20130815060149.GP6023@dastard> References: <520BB9EF.5020308@linux.intel.com> <20130814194359.GA22316@thunk.org> <520BED7A.4000903@intel.com> <20130814230648.GD22316@thunk.org> <20130815011101.GA3572@thunk.org> <20130815021028.GM6023@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2426 Lines: 54 On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote: > On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner wrote: > > On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote: > >> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote: > >> > > It would be better to write zeros to it, so we aren't measuring the > >> > > cost of the unwritten->written conversion. > >> > > >> > At the risk of beating a dead horse, how hard would it be to defer > >> > this part until writeback? > >> > >> Part of the work has to be done at write time because we need to > >> update allocation statistics (i.e., so that we don't have ENOSPC > >> problems). The unwritten->written conversion does happen at writeback > >> (as does the actual block allocation if we are doing delayed > >> allocation). > >> > >> The point is that if the goal is to measure page fault scalability, we > >> shouldn't have this other stuff happening as the same time as the page > >> fault workload. > > > > Sure, but the real problem is not the block mapping or allocation > > path - even if the test is changed to take that out of the picture, > > we still have timestamp updates being done on every single page > > fault. ext4, XFS and btrfs all do transactional timestamp updates > > and have nanosecond granularity, so every page fault is resulting in > > a transaction to update the timestamp of the file being modified. > > I have (unmergeable) patches to fix this: > > http://comments.gmane.org/gmane.linux.kernel.mm/92476 The big problem with this approach is that not doing the timestamp update on page faults is going to break the inode change version counting because for ext4, btrfs and XFS it takes a transaction to bump that counter. NFS needs to know the moment a file is changed in memory, not when it is written to disk. Also, NFS requires the change to the counter to be persistent over server failures, so it needs to be changed as part of a transaction.... IOWs, fixing the "filesystems need a transaction on each page_mkwrite call" problem isn't as simple as changing how timestamps are updated. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/