Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752394AbcDYXJY (ORCPT ); Mon, 25 Apr 2016 19:09:24 -0400 Received: from ipmail06.adl2.internode.on.net ([150.101.137.129]:64452 "EHLO ipmail06.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752364AbcDYXJV (ORCPT ); Mon, 25 Apr 2016 19:09:21 -0400 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2CBBgAIox5XVs/E03ZegziBUKVoBowUhWqEEYR+gQoEAgKBOk0BAQEBAQEHAYEFhEEBAQEDATocIwULCAMOCgklDwUlAwcaE4giB8MrAQEBAQYBAQEBHBmFQIUTh2qCKwWYD44MjxqPL4JnG4FdKjCJAAEBAQ Date: Tue, 26 Apr 2016 09:08:33 +1000 From: Dave Chinner To: Lucas Stach Cc: Brian Foster , linux-kernel@vger.kernel.org, xfs@oss.sgi.com Subject: Re: [PATCH] xfs: idle aild if the AIL is pushed up to the target LSN Message-ID: <20160425230833.GC18496@dastard> References: <1461570163-4083-1-git-send-email-dev@lynxeye.de> <20160425142444.GC33882@bfoster.bfoster> <1461607897.2364.27.camel@lynxeye.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1461607897.2364.27.camel@lynxeye.de> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3514 Lines: 85 On Mon, Apr 25, 2016 at 08:11:37PM +0200, Lucas Stach wrote: > Am Montag, den 25.04.2016, 10:24 -0400 schrieb Brian Foster: > > On Mon, Apr 25, 2016 at 09:42:43AM +0200, Lucas Stach wrote: > > > > > > The current logic only idles aild if the AIL is empty, rescheduling > > > the worker with a timeout of 50ms otherwise. If the target LSN > > > isn't > > > moved forward, the worker will not make any progress as it only > > > pushes the AIL up to the target LSN, leading to the empty AIL > > > condition to only be met after the log worker pushed the complete > > > AIL. As the log worker runs on the scale of minutes, this leaves > > > aild in the "no progress, 50ms timeout" state for extended periods > > > of > > > time, causing many unecessary wakeups. > > > > > Where do you get the "scale of minutes" from? I believe the default > > log > > worker interval is 30s, though it can be increased by the admin. > > > Uh, right. It's 30s, which is still a very long time when we consider > that new items can be added to the AIL right after the log worker > forced it out. Standard writeback timeout for data and metadata. > > The other thing to note is that the AIL target is also driven by > > demand > > on log space. E.g., see xlog_grant_push_ail(), which executes when a > > transaction attempts to reserve log space that isn't currently > > available > > in the log. > > > And that's where I think the problem stems from: the above function > pushes the AIL only up to a target LSN to make room for the required > log space. Only the log worker empties the AIL completely by issuing a > xfs_ail_push_all_sync(). xfs_ail_push_all(), actually, and the log worker runs every 30s by default. i.e. we move the target to the end to the AIL to try to empty it completely every 30s. > > That said, I'm not sure whether there's a notable benefit of idling > > for > > 50ms over just scheduling out when we've hit the target lsn. It seems > > like that anybody who pushes the target forward again is going to > > wake > > up the thread anyways. On the other hand, if the fs is idle the > > thread > > will eventually schedule out indefinitely. > > Is this a problem? The patch tries to do exactly that: schedule out > aild indefinitely when there is no more work to do as nobody is pushing > the target LSN forward. If the filesystem is slowly being dirtied, then the aild should't really idle at all.i Keep in mind that the xfsaild has multiple functions, one of which is a watchdog that catches log space stalls that would otherwise hang the filesystem. Every time we've removed the watchdog function (i.e. agressively idle the aild) we've had users report random, unreproducable hangs/stalls that have gone away when the watchdog function (i.e. don't idle until the log is covered and completely idle) was re-instated... > On an idle FS this will stop aild when the target LSN is hit, will wake > it up again as soon as the log worker requests the complete AIL to be > pushed out, then it goes back to sleep indefinitely until the log fills > up again. The behaviour suggests that your filesystem is not idle. The filesystem takes up to 90s to be marked idle (log needs to be covered, state machine takes 3x30s cycles to transition to idle "covered" state. If you want the filesytsem to idle quickly, then run the log worker more frequently to get the target updated more quickly. This will also speed up the log covering state machine as well. Cheers, Dave. -- Dave Chinner david@fromorbit.com