LinuxLists.cc - xfsaild causing 30+ wakeups/s on an idle system since 2.6.25-rcX

2008-02-17 16:41:42

Subject: xfsaild causing 30+ wakeups/s on an idle system since 2.6.25-rcX

Hi,

xfsaild is causing many wakeups, a quick investigation shows
xfsaild_push is always
returning 30 msecs timeout value.

This is on an idle system, running only gnome, and gnome-terminal.

I suggest changing the timeout logic in xfsaild to be more power
consumption friendly.

See below my original report to the powerTOP mailing list.

Best regards,
--Edwin

Attachments:

Re: new offender in 2.6.25-git: xfsaild.eml (2.49 kB)

2008-02-17 16:51:19

by Oliver Pinter

[permalink] [raw]

Subject: Re: xfsaild causing 30+ wakeups/s on an idle system since 2.6.25-rcX

On 2/17/08, T?r?k Edwin <[email protected]> wrote:
> Hi,
>
> xfsaild is causing many wakeups, a quick investigation shows
> xfsaild_push is always
> returning 30 msecs timeout value.
>
> This is on an idle system, running only gnome, and gnome-terminal.
>
> I suggest changing the timeout logic in xfsaild to be more power
> consumption friendly.
>
> See below my original report to the powerTOP mailing list.
>
> Best regards,
> --Edwin
>

--
Thanks,
Oliver

2008-02-17 22:48:18

by David Chinner

[permalink] [raw]

Subject: Re: xfsaild causing 30+ wakeups/s on an idle system since 2.6.25-rcX

On Sun, Feb 17, 2008 at 05:51:08PM +0100, Oliver Pinter wrote:
> On 2/17/08, T?r?k Edwin <[email protected]> wrote:
> > Hi,
> >
> > xfsaild is causing many wakeups, a quick investigation shows
> > xfsaild_push is always
> > returning 30 msecs timeout value.

That's a bug, and has nothing to do with power consumption. ;)

I can see that there is a dirty item in the filesystem:

Entering kdb (current=0xe00000b8f4fe0000, pid 30046) on processor 3 due to Breakpoint @ 0xa000000100454fc0
[3]kdb> bt
Stack traceback for pid 30046
0xe00000b8f4fe0000 30046 2 1 3 R 0xe00000b8f4fe0340 *xfsaild
0xa000000100454fc0 xfsaild_push
args (0xe00000b8ffff9090, 0xe00000b8f4fe7e30, 0x31b)
....
[3]kdb> xail 0xe00000b8ffff9090
AIL for mp 0xe00000b8ffff9090, oldest first
[0] type buf flags: 0x1 <in ail > lsn [1:13133]
buf 0xe00000b880258800 blkno 0x0 flags: 0x2 <dirty >
Superblock (at 0xe00000b8f9b3c000)
[3]kdb>

the superblock is dirty, and the lsn is well beyond the target
of the xfsaild hence it *should* be idling. However, it isn't
idling because there is a dirty item in the list and the idle trigger
of "is list empty" is not tripping.

I only managed to reproduce this on a lazy superblock counter
filesystem (i.e. new mkfs and recent kernel), as it logs the
superblock every so often, and that is probably what is keeping the
fs dirty like this.

Can you see if the patch below fixes the problem.

---

Idle state is not being detected properly by the xfsaild push
code. The current idle state is detected by an empty list
which may never happen with mostly idle filesystem or one
using lazy superblock counters. A single dirty item in the
list will result repeated looping to push everything past
the target when everything because it fails to check if we
managed to push anything.

Fix by considering a dirty list with everything past the target
as an idle state and set the timeout appropriately.

Signed-off-by: Dave Chinner <[email protected]>
---
fs/xfs/xfs_trans_ail.c | 15 +++++++++------
1 file changed, 9 insertions(+), 6 deletions(-)

Index: 2.6.x-xfs-new/fs/xfs/xfs_trans_ail.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_trans_ail.c 2008-02-18 09:14:34.000000000 +1100
+++ 2.6.x-xfs-new/fs/xfs/xfs_trans_ail.c 2008-02-18 09:18:52.070682570 +1100
@@ -261,14 +261,17 @@ xfsaild_push(
xfs_log_force(mp, (xfs_lsn_t)0, XFS_LOG_FORCE);
}

- /*
- * We reached the target so wait a bit longer for I/O to complete and
- * remove pushed items from the AIL before we start the next scan from
- * the start of the AIL.
- */
- if ((XFS_LSN_CMP(lsn, target) >= 0)) {
+ if (count && (XFS_LSN_CMP(lsn, target) >= 0)) {
+ /*
+ * We reached the target so wait a bit longer for I/O to
+ * complete and remove pushed items from the AIL before we
+ * start the next scan from the start of the AIL.
+ */
tout += 20;
last_pushed_lsn = 0;
+ } else if (!count) {
+ /* We're past our target or empty, so idle */
+ tout = 1000;
} else if ((restarts > XFS_TRANS_PUSH_AIL_RESTARTS) ||
(count && ((stuck * 100) / count > 90))) {
/*

2008-02-18 09:41:55

by Török Edwin

[permalink] [raw]

Subject: Re: xfsaild causing 30+ wakeups/s on an idle system since 2.6.25-rcX

David Chinner wrote:
> On Sun, Feb 17, 2008 at 05:51:08PM +0100, Oliver Pinter wrote:
>
>> On 2/17/08, T?r?k Edwin <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> xfsaild is causing many wakeups, a quick investigation shows
>>> xfsaild_push is always
>>> returning 30 msecs timeout value.
>>>
>
> That's a bug

Ok. Your patches fixes the 30+ wakeups :)

> , and has nothing to do with power consumption. ;)
>

I suggest using a sysctl value (such as
/proc/sys/vm/dirty_writeback_centisecs), instead of a hardcoded default
1000.
That would further reduce the wakeups.

>
> I only managed to reproduce this on a lazy superblock counter
> filesystem (i.e. new mkfs and recent kernel),

The filesystem was created in July 2007

> Can you see if the patch below fixes the problem.

Yes, it reduces wakeups to 1/sec.

Thanks,
--Edwin

2008-02-18 10:22:23

by David Chinner

[permalink] [raw]

Subject: Re: xfsaild causing 30+ wakeups/s on an idle system since 2.6.25-rcX

On Mon, Feb 18, 2008 at 11:41:39AM +0200, T?r?k Edwin wrote:
> David Chinner wrote:
> > On Sun, Feb 17, 2008 at 05:51:08PM +0100, Oliver Pinter wrote:
> >
> >> On 2/17/08, T?r?k Edwin <[email protected]> wrote:
> >>
> >>> Hi,
> >>>
> >>> xfsaild is causing many wakeups, a quick investigation shows
> >>> xfsaild_push is always
> >>> returning 30 msecs timeout value.
> >>>
> >
> > That's a bug
>
> Ok. Your patches fixes the 30+ wakeups :)

Good. I'll push it out for review then.

> > , and has nothing to do with power consumption. ;)
> >
>
> I suggest using a sysctl value (such as
> /proc/sys/vm/dirty_writeback_centisecs), instead of a hardcoded default
> 1000.

No, too magic. I dislike adding knobs to workaround issues that
really should be fixed by having sane default behaviour. Further
down the track as we correct know issues with the AIL push code
we'll be able to increase this idle timeout or even make it purely
wakeup driven once we get back to an idle state. However, right now
it still needs that once a second wakeup to work around a nasty
corner case that can hang the filesystem....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2008-02-18 23:22:21

by Linda Walsh

[permalink] [raw]

Subject: Re: xfsaild causing 30+ wakeups/s on an idle system since 2.6.25-rcX

Not to look excessively dumb, but what's xfsaild?

xfs seems to be sprouting daemons at a more rapid pace
these days...xfsbufd, xfssyncd, xfsdatad, xfslogd, xfs_mru_cache, and
now xfsaild?

Not a complaint if it ups performance, but I do sorta wonder what all
of them are for and why they are needed "now" but not for, say,
kernels before 2.6.18 (arbitrary number picked out of hat).

Like bufd writes out buffers, logd writes/hands the log, datad? Isn't
the data in buffers? mru_cache? -- isn't that handled by the linux
block layer? Sorry...just a bit confused by the additions...

Are there any design docs (scribbles?) saying what these do and why
they were added so I can just go read 'em myself? I'm sure they
were added for good reason...just am curious more than anything.

Thanksd
-lindad

2008-02-19 08:20:42

by David Chinner

[permalink] [raw]

Subject: Re: xfsaild causing 30+ wakeups/s on an idle system since 2.6.25-rcX

On Mon, Feb 18, 2008 at 03:22:02PM -0800, Linda Walsh wrote:
> Not to look excessively dumb, but what's xfsaild?

AIL = Active Item List

It is a sorted list all the logged metadata objects that have not
yet been written back to disk. The xfsaild is responsible for tail
pushing the log. i.e. writing back objects in the AIL in the most
efficient manner possible.

Why a thread? Because allowing parallelism in tail pushing is a
scalability problem and moving this to it's own thread completely
isolates it from parallelism. Tail pushing only requires a small
amount of CPU time, but it requires a global scope spinlock.
Isolating the pushing to a single CPU means the spinlock is not
contended across every CPU in the machine.

How much did it improve scalability? on a 2048p machine with an
MPI job that did a synchronised close of 12,000 files (6 per CPU),
the close time went from ~5400s without the thread to 9s with the
xfsaild. That's only about 600x faster. ;)

> xfs seems to be sprouting daemons at a more rapid pace
> these days...xfsbufd, xfssyncd, xfsdatad, xfslogd, xfs_mru_cache, and
> now xfsaild?

Why not? Got to make use of all those cores machines have these
days. ;)

Fundamentally, threads are cheap and simple. We'll keep adding
threads where it makes sense as long as it improves performance and
scalability.

> Are there any design docs (scribbles?) saying what these do and why
> they were added so I can just go read 'em myself? I'm sure they
> were added for good reason...just am curious more than anything.

'git log' is your friend. The commits that introduce the new threads
explain why they are necessary. ;)

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group