2008-01-07 11:10:27

by Joerg Platte

[permalink] [raw]
Subject: regression: 100% io-wait with 2.6.24-rcX

Hi,

when booting kernel 2.6.24-rc{4,5,6,7} top reports up to 100% iowait, even if
no program accesses the disc on my Thinkpad T40p. Kernel 2.6.23.12 does not
suffer from this. Is there anything I can do to find out which process or
which part of the kernel is responsible for this? I can try to bisect it, but
maybe there are other possibilities to debug this, since I cannot boot this
computer frequently. I discovered, that there is no iowait within the first
few seconds after waking up from suspend to ram...

regards,
Jörg


2008-01-07 11:19:48

by Ingo Molnar

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX


* Joerg Platte <[email protected]> wrote:

> Hi,
>
> when booting kernel 2.6.24-rc{4,5,6,7} top reports up to 100% iowait,
> even if no program accesses the disc on my Thinkpad T40p. Kernel
> 2.6.23.12 does not suffer from this. Is there anything I can do to
> find out which process or which part of the kernel is responsible for
> this? I can try to bisect it, but maybe there are other possibilities
> to debug this, since I cannot boot this computer frequently. I
> discovered, that there is no iowait within the first few seconds after
> waking up from suspend to ram...

do:

echo t > /proc/sysrq-trigger

and send us the dmesg output. If the dmesg output does not include the
bootup bits then increase CONFIG_LOG_BUF_SHIFT to 20 or so:

CONFIG_LOG_BUF_SHIFT=20

to have a large enough kernel messages buffer.

Ingo

2008-01-07 13:24:26

by Joerg Platte

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

Am Montag, 7. Januar 2008 schrieb Ingo Molnar:

> do:
>
> echo t > /proc/sysrq-trigger
>
> and send us the dmesg output. If the dmesg output does not include the
> bootup bits then increase CONFIG_LOG_BUF_SHIFT to 20 or so:
>
> CONFIG_LOG_BUF_SHIFT=20
>
> to have a large enough kernel messages buffer.

The buffer was too small, hence I copied the relevant parts
of /var/log/kern.log, I hope it contains all required information as well.

regards,
J?rg


Attachments:
(No filename) (459.00 B)
dmesg (17.12 kB)
Download all attachments

2008-01-07 13:32:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX


On Mon, 2008-01-07 at 14:24 +0100, Joerg Platte wrote:

This is from: 2.6.24-rc7

> kernel: pdflush D f41c2f14 0 18822 2
> kernel: f673f000 00000046 00000286 f41c2f14 f5194ce0 00000286 00000286 f41c2f14
> kernel: 00175279 f41c2f6c 00000000 c0271f6c f5ff363c f5ff3644 c0354a90 c0354a90
> kernel: 00175279 c0123251 f5194b80 c03546c0 c0271f67 6c666470 00687375 00000000
> kernel: Call Trace:
> kernel: [<c0271f6c>] schedule_timeout+0x6e/0x8b
> kernel: [<c0123251>] process_timeout+0x0/0x5
> kernel: [<c0271f67>] schedule_timeout+0x69/0x8b
> kernel: [<c027179a>] __sched_text_start+0x3a/0x70
> kernel: [<c014d34b>] congestion_wait+0x4e/0x62
> kernel: [<c012be2b>] autoremove_wake_function+0x0/0x33
> kernel: [<c014971e>] pdflush+0x0/0x1bf
> kernel: [<c01493c6>] wb_kupdate+0x8c/0xd1
> kernel: [<c014971e>] pdflush+0x0/0x1bf
> kernel: [<c0149839>] pdflush+0x11b/0x1bf
> kernel: [<c014933a>] wb_kupdate+0x0/0xd1
> kernel: [<c012bd71>] kthread+0x36/0x5d
> kernel: [<c012bd3b>] kthread+0x0/0x5d
> kernel: [<c010493b>] kernel_thread_helper+0x7/0x10
> kernel: =======================

What filesystem are you using?

2008-01-07 13:40:28

by Joerg Platte

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

Am Montag, 7. Januar 2008 schrieb Peter Zijlstra:
> On Mon, 2008-01-07 at 14:24 +0100, Joerg Platte wrote:
>
> This is from: 2.6.24-rc7
>
> > kernel: pdflush D f41c2f14 0 18822 2
> > kernel: f673f000 00000046 00000286 f41c2f14 f5194ce0 00000286
> > 00000286 f41c2f14 kernel: 00175279 f41c2f6c 00000000 c0271f6c
> > f5ff363c f5ff3644 c0354a90 c0354a90 kernel: 00175279 c0123251
> > f5194b80 c03546c0 c0271f67 6c666470 00687375 00000000 kernel: Call Trace:
> > kernel: [<c0271f6c>] schedule_timeout+0x6e/0x8b
> > kernel: [<c0123251>] process_timeout+0x0/0x5
> > kernel: [<c0271f67>] schedule_timeout+0x69/0x8b
> > kernel: [<c027179a>] __sched_text_start+0x3a/0x70
> > kernel: [<c014d34b>] congestion_wait+0x4e/0x62
> > kernel: [<c012be2b>] autoremove_wake_function+0x0/0x33
> > kernel: [<c014971e>] pdflush+0x0/0x1bf
> > kernel: [<c01493c6>] wb_kupdate+0x8c/0xd1
> > kernel: [<c014971e>] pdflush+0x0/0x1bf
> > kernel: [<c0149839>] pdflush+0x11b/0x1bf
> > kernel: [<c014933a>] wb_kupdate+0x0/0xd1
> > kernel: [<c012bd71>] kthread+0x36/0x5d
> > kernel: [<c012bd3b>] kthread+0x0/0x5d
> > kernel: [<c010493b>] kernel_thread_helper+0x7/0x10
> > kernel: =======================
>
> What filesystem are you using?

Here you can see all currently mounted filesystems:

/dev/sda6 on / type ext3 (rw,noatime,errors=remount-ro,acl)
tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
procbususb on /proc/bus/usb type usbfs (rw)
udev on /dev type tmpfs (rw,mode=0755)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
fusectl on /sys/fs/fuse/connections type fusectl (rw)
/dev/sda7 on /tmp type ext2 (rw,noatime,errors=remount-ro,acl)
/dev/sda8 on /export type ext3 (rw,noatime,errors=remount-ro,acl)
/dev/sda1 on /winxp type ntfs (rw,umask=002,gid=10000,nls=utf8)
rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
automount(pid4936) on /amnt type autofs
(rw,fd=4,pgrp=4936,minproto=2,maxproto=4)
automount(pid4965) on /home type autofs
(rw,fd=4,pgrp=4965,minproto=2,maxproto=4)
nfsd on /proc/fs/nfsd type nfsd (rw)
/export/home/jplatte on /home/jplatte type none (rw,bind)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc
(rw,noexec,nosuid,nodev)

regards,
J?rg

--
PGP Key: send mail with subject 'SEND PGP-KEY' PGP Key-ID: FD 4E 21 1D
PGP Fingerprint: 388A872AFC5649D3 BCEC65778BE0C605

2008-01-09 03:27:32

by Wu Fengguang

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

On Mon, Jan 07, 2008 at 02:40:13PM +0100, Joerg Platte wrote:
> Am Montag, 7. Januar 2008 schrieb Peter Zijlstra:
> > On Mon, 2008-01-07 at 14:24 +0100, Joerg Platte wrote:
> >
> > This is from: 2.6.24-rc7
> >
> > > kernel: pdflush D f41c2f14 0 18822 2
> > > kernel: f673f000 00000046 00000286 f41c2f14 f5194ce0 00000286
> > > 00000286 f41c2f14 kernel: 00175279 f41c2f6c 00000000 c0271f6c
> > > f5ff363c f5ff3644 c0354a90 c0354a90 kernel: 00175279 c0123251
> > > f5194b80 c03546c0 c0271f67 6c666470 00687375 00000000 kernel: Call Trace:
> > > kernel: [<c0271f6c>] schedule_timeout+0x6e/0x8b
> > > kernel: [<c0123251>] process_timeout+0x0/0x5
> > > kernel: [<c0271f67>] schedule_timeout+0x69/0x8b
> > > kernel: [<c027179a>] __sched_text_start+0x3a/0x70
> > > kernel: [<c014d34b>] congestion_wait+0x4e/0x62
> > > kernel: [<c012be2b>] autoremove_wake_function+0x0/0x33
> > > kernel: [<c014971e>] pdflush+0x0/0x1bf
> > > kernel: [<c01493c6>] wb_kupdate+0x8c/0xd1
> > > kernel: [<c014971e>] pdflush+0x0/0x1bf
> > > kernel: [<c0149839>] pdflush+0x11b/0x1bf
> > > kernel: [<c014933a>] wb_kupdate+0x0/0xd1
> > > kernel: [<c012bd71>] kthread+0x36/0x5d
> > > kernel: [<c012bd3b>] kthread+0x0/0x5d
> > > kernel: [<c010493b>] kernel_thread_helper+0x7/0x10
> > > kernel: =======================
> >
> > What filesystem are you using?
>
> Here you can see all currently mounted filesystems:
>
> /dev/sda6 on / type ext3 (rw,noatime,errors=remount-ro,acl)
> tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
> proc on /proc type proc (rw,noexec,nosuid,nodev)
> sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
> procbususb on /proc/bus/usb type usbfs (rw)
> udev on /dev type tmpfs (rw,mode=0755)
> tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
> devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
> fusectl on /sys/fs/fuse/connections type fusectl (rw)
> /dev/sda7 on /tmp type ext2 (rw,noatime,errors=remount-ro,acl)
> /dev/sda8 on /export type ext3 (rw,noatime,errors=remount-ro,acl)
> /dev/sda1 on /winxp type ntfs (rw,umask=002,gid=10000,nls=utf8)

So they are ext3/ext2/ntfs. What if you umount ntfs? and ext2 if possible?

2008-01-09 06:13:32

by Joerg Platte

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

Am Mittwoch, 9. Januar 2008 schrieb Fengguang Wu:
> > /dev/sda6 on / type ext3 (rw,noatime,errors=remount-ro,acl)
> > tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
> > proc on /proc type proc (rw,noexec,nosuid,nodev)
> > sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
> > procbususb on /proc/bus/usb type usbfs (rw)
> > udev on /dev type tmpfs (rw,mode=0755)
> > tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
> > devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
> > fusectl on /sys/fs/fuse/connections type fusectl (rw)
> > /dev/sda7 on /tmp type ext2 (rw,noatime,errors=remount-ro,acl)
> > /dev/sda8 on /export type ext3 (rw,noatime,errors=remount-ro,acl)
> > /dev/sda1 on /winxp type ntfs (rw,umask=002,gid=10000,nls=utf8)
>
> So they are ext3/ext2/ntfs. What if you umount ntfs? and ext2 if possible?

Unmounting ntfs doesn't help, hence I converted the remaining ext2 filesystem
to ext3, modified the fstab entry accordingly and rebooted. Now everything
seems to be fine! Top reports an idle system and there is no abnormal iowait
any longer! Seems to be ext2 was causing this! Later today I can try to
remount the filesystem as ext2 to be sure the bug shows up again.

regards,
Jörg

--
PGP Key: send mail with subject 'SEND PGP-KEY' PGP Key-ID: FD 4E 21 1D
PGP Fingerprint: 388A872AFC5649D3 BCEC65778BE0C605

2008-01-09 12:05:08

by Wu Fengguang

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

On Wed, Jan 09, 2008 at 07:13:14AM +0100, Joerg Platte wrote:
> Am Mittwoch, 9. Januar 2008 schrieb Fengguang Wu:
> > > /dev/sda6 on / type ext3 (rw,noatime,errors=remount-ro,acl)
> > > tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
> > > proc on /proc type proc (rw,noexec,nosuid,nodev)
> > > sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
> > > procbususb on /proc/bus/usb type usbfs (rw)
> > > udev on /dev type tmpfs (rw,mode=0755)
> > > tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
> > > devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
> > > fusectl on /sys/fs/fuse/connections type fusectl (rw)
> > > /dev/sda7 on /tmp type ext2 (rw,noatime,errors=remount-ro,acl)
> > > /dev/sda8 on /export type ext3 (rw,noatime,errors=remount-ro,acl)
> > > /dev/sda1 on /winxp type ntfs (rw,umask=002,gid=10000,nls=utf8)
> >
> > So they are ext3/ext2/ntfs. What if you umount ntfs? and ext2 if possible?
>
> Unmounting ntfs doesn't help, hence I converted the remaining ext2 filesystem
> to ext3, modified the fstab entry accordingly and rebooted. Now everything
> seems to be fine! Top reports an idle system and there is no abnormal iowait
> any longer! Seems to be ext2 was causing this! Later today I can try to
> remount the filesystem as ext2 to be sure the bug shows up again.

Thank you for the clue. However I cannot reproduce the bug on
ext2/2.6.24-rc7. Can you provide more details? Thank you.

Fengguang

2008-01-09 12:23:18

by Joerg Platte

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

Am Mittwoch, 9. Januar 2008 schrieb Fengguang Wu:

Thank your for the hint with the filesystems!

> Thank you for the clue. However I cannot reproduce the bug on
> ext2/2.6.24-rc7. Can you provide more details? Thank you.

I attached some more information. I'm using the ata_piix driver for my PATA
disk and cdrom drive and booted with hpet=force. Kernel 2.6.23.X has been
patched with the -hrt patches to enable the hidden hpet time on the ICH4-M
chipset. I just rebooted the notebook and mounted /tmp again as ext2 and now
the iowait problem is back. Seems to be reproducible on my computer. What
additional information do you need?

regards,
J?rg

--
PGP Key: send mail with subject 'SEND PGP-KEY' PGP Key-ID: FD 4E 21 1D
PGP Fingerprint: 388A872AFC5649D3 BCEC65778BE0C605


Attachments:
(No filename) (783.00 B)
lspci (1.58 kB)
lsmod (7.30 kB)
dmesg (15.12 kB)
Download all attachments

2008-01-09 12:57:23

by Wu Fengguang

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

On Wed, Jan 09, 2008 at 01:22:33PM +0100, Joerg Platte wrote:
> Am Mittwoch, 9. Januar 2008 schrieb Fengguang Wu:
>
> Thank your for the hint with the filesystems!
>
> > Thank you for the clue. However I cannot reproduce the bug on
> > ext2/2.6.24-rc7. Can you provide more details? Thank you.
>
> I attached some more information. I'm using the ata_piix driver for my PATA
> disk and cdrom drive and booted with hpet=force. Kernel 2.6.23.X has been
~~~~~~~~
not 2.6.24-rc7?

> patched with the -hrt patches to enable the hidden hpet time on the ICH4-M
> chipset. I just rebooted the notebook and mounted /tmp again as ext2 and now
> the iowait problem is back. Seems to be reproducible on my computer. What
> additional information do you need?

I mounted an ext2 as tmp and find no problem. My config options are:

CONFIG_EXT2_FS=y
CONFIG_EXT2_FS_XATTR=y
CONFIG_EXT2_FS_POSIX_ACL=y
CONFIG_EXT2_FS_SECURITY=y
# CONFIG_EXT2_FS_XIP is not set

Fengguang

2008-01-09 13:04:41

by Joerg Platte

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

Am Mittwoch, 9. Januar 2008 schrieb Fengguang Wu:
> On Wed, Jan 09, 2008 at 01:22:33PM +0100, Joerg Platte wrote:
> > Am Mittwoch, 9. Januar 2008 schrieb Fengguang Wu:
> >
> > Thank your for the hint with the filesystems!
> >
> > > Thank you for the clue. However I cannot reproduce the bug on
> > > ext2/2.6.24-rc7. Can you provide more details? Thank you.
> >
> > I attached some more information. I'm using the ata_piix driver for my
> > PATA disk and cdrom drive and booted with hpet=force. Kernel 2.6.23.X has
> > been
>
> ~~~~~~~~
> not 2.6.24-rc7?
>
No, 2.6.23.X was the last working kernel without this problem, the bug shows
up with 2.6.24-rcX. I just wanted to emphasis, that I don't think that it has
something to do with the hpet stuff. Kernel 2.6.24-rcX is unpatched, because
the hrt stuff has been merged.

> > patched with the -hrt patches to enable the hidden hpet time on the
> > ICH4-M chipset. I just rebooted the notebook and mounted /tmp again as
> > ext2 and now the iowait problem is back. Seems to be reproducible on my
> > computer. What additional information do you need?
>
> I mounted an ext2 as tmp and find no problem. My config options are:
>
> CONFIG_EXT2_FS=y
> CONFIG_EXT2_FS_XATTR=y
> CONFIG_EXT2_FS_POSIX_ACL=y
> CONFIG_EXT2_FS_SECURITY=y
> # CONFIG_EXT2_FS_XIP is not set
>
> Fengguang

CONFIG_EXT2_FS=m
CONFIG_EXT2_FS_XATTR=y
CONFIG_EXT2_FS_POSIX_ACL=y
CONFIG_EXT2_FS_SECURITY=y
CONFIG_EXT2_FS_XIP=y

Here it is modular and I enabled CONFIG_EXT2_FS_XIP, but the same is enabled
on 2.6.23.X. Maybe it is related to libata? I additionally discovered, that
the problem disappears for a few seconds when I press the eject button for
the ultra bay of my thinkpad. Pressing the button unregisters the cdrom drive
to be able to replace it with a hard drive or a battery. Maybe this bug is
thinkpad relared?

regards,
J?rg

--
PGP Key: send mail with subject 'SEND PGP-KEY' PGP Key-ID: FD 4E 21 1D
PGP Fingerprint: 388A872AFC5649D3 BCEC65778BE0C605

2008-01-10 06:58:30

by Wu Fengguang

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

On Wed, Jan 09, 2008 at 02:04:29PM +0100, Joerg Platte wrote:
> Am Mittwoch, 9. Januar 2008 schrieb Fengguang Wu:
> > On Wed, Jan 09, 2008 at 01:22:33PM +0100, Joerg Platte wrote:
> > > Am Mittwoch, 9. Januar 2008 schrieb Fengguang Wu:
> > >
> > > Thank your for the hint with the filesystems!
> > >
> > > > Thank you for the clue. However I cannot reproduce the bug on
> > > > ext2/2.6.24-rc7. Can you provide more details? Thank you.
> > >
> > > I attached some more information. I'm using the ata_piix driver for my
> > > PATA disk and cdrom drive and booted with hpet=force. Kernel 2.6.23.X has
> > > been
> >
> > ~~~~~~~~
> > not 2.6.24-rc7?
> >
> No, 2.6.23.X was the last working kernel without this problem, the bug shows
> up with 2.6.24-rcX. I just wanted to emphasis, that I don't think that it has
> something to do with the hpet stuff. Kernel 2.6.24-rcX is unpatched, because
> the hrt stuff has been merged.
>
> > > patched with the -hrt patches to enable the hidden hpet time on the
> > > ICH4-M chipset. I just rebooted the notebook and mounted /tmp again as
> > > ext2 and now the iowait problem is back. Seems to be reproducible on my
> > > computer. What additional information do you need?
> >
> > I mounted an ext2 as tmp and find no problem. My config options are:
> >
> > CONFIG_EXT2_FS=y
> > CONFIG_EXT2_FS_XATTR=y
> > CONFIG_EXT2_FS_POSIX_ACL=y
> > CONFIG_EXT2_FS_SECURITY=y
> > # CONFIG_EXT2_FS_XIP is not set
> >
> > Fengguang
>
> CONFIG_EXT2_FS=m
> CONFIG_EXT2_FS_XATTR=y
> CONFIG_EXT2_FS_POSIX_ACL=y
> CONFIG_EXT2_FS_SECURITY=y
> CONFIG_EXT2_FS_XIP=y
>
> Here it is modular and I enabled CONFIG_EXT2_FS_XIP, but the same is enabled
> on 2.6.23.X. Maybe it is related to libata? I additionally discovered, that
> the problem disappears for a few seconds when I press the eject button for
> the ultra bay of my thinkpad. Pressing the button unregisters the cdrom drive
> to be able to replace it with a hard drive or a battery. Maybe this bug is
> thinkpad relared?

At last I caught it :-)

[ 1862.219189] requeue_io 301: inode 50948 size 0 at 03:02(hda2)
[ 1862.219199] requeue_io 301: inode 51616 size 0 at 03:02(hda2)
[ 1862.219204] requeue_io 301: inode 51656 size 0 at 03:02(hda2)
[ 1862.219208] requeue_io 301: inode 51655 size 0 at 03:02(hda2)
[ 1862.219216] mm/page-writeback.c 668 wb_kupdate: pdflush(182) 10768 global 3100 0 0 wc _M tw 1024 sk 1
[ 1862.319039] requeue_io 301: inode 50948 size 0 at 03:02(hda2)
[ 1862.319050] requeue_io 301: inode 51616 size 0 at 03:02(hda2)
[ 1862.319055] requeue_io 301: inode 51656 size 0 at 03:02(hda2)
[ 1862.319059] requeue_io 301: inode 51655 size 0 at 03:02(hda2)
[ 1862.319068] mm/page-writeback.c 668 wb_kupdate: pdflush(182) 10768 global 3100 0 0 wc _M tw 1024 sk 1

They are some zero sized files, maybe something goes wrong with the truncate code.

2008-01-10 07:31:10

by Wu Fengguang

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

Joerg,

Can you try the attached patches? Thank you.
I cannot reliably reproduce the bug yet.

Fengguang


Attachments:
(No filename) (105.00 B)
writeback-ext2-fix.patch (437.00 B)
writeback-debug.patch (2.61 kB)
requeue_io-debug.patch (1.11 kB)
Download all attachments

2008-01-10 07:53:21

by Wu Fengguang

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

On Thu, Jan 10, 2008 at 03:30:46PM +0800, Fengguang Wu wrote:
> Joerg,
>
> Can you try the attached patches? Thank you.
> I cannot reliably reproduce the bug yet.

Please ignore the first patch and only apply the two debugging
patches. They will produce many printk messages. The output of
`dmesg` will be enough for me.

Thank you,
Fengguang

2008-01-10 08:38:28

by Joerg Platte

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

Am Donnerstag, 10. Januar 2008 schrieb Fengguang Wu:
> On Thu, Jan 10, 2008 at 03:30:46PM +0800, Fengguang Wu wrote:
> > Joerg,
> >
> > Can you try the attached patches? Thank you.
> > I cannot reliably reproduce the bug yet.
>
> Please ignore the first patch and only apply the two debugging
> patches. They will produce many printk messages. The output of
> `dmesg` will be enough for me.

Too late, I'm already compiling the kernel. But I have currently another
problem, because the iowait problem disappeared today after the regular
Debian update. I'll try to install the old package versions to make it show
up again. Maybe that helps to debug it.

Gr??e,
J?rg

--
PGP Key: send mail with subject 'SEND PGP-KEY' PGP Key-ID: FD 4E 21 1D
PGP Fingerprint: 388A872AFC5649D3 BCEC65778BE0C605

2008-01-10 08:44:19

by Wu Fengguang

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

On Thu, Jan 10, 2008 at 09:37:53AM +0100, Joerg Platte wrote:
> Am Donnerstag, 10. Januar 2008 schrieb Fengguang Wu:
> > On Thu, Jan 10, 2008 at 03:30:46PM +0800, Fengguang Wu wrote:
> > > Joerg,
> > >
> > > Can you try the attached patches? Thank you.
> > > I cannot reliably reproduce the bug yet.
> >
> > Please ignore the first patch and only apply the two debugging
> > patches. They will produce many printk messages. The output of
> > `dmesg` will be enough for me.
>
> Too late, I'm already compiling the kernel. But I have currently another

It won't hurt, as you don't use XIP.

> problem, because the iowait problem disappeared today after the regular
> Debian update. I'll try to install the old package versions to make it show
> up again. Maybe that helps to debug it.

Thank you. I'm running sid, ext2 as rootfs now ;-)

2008-01-10 10:03:27

by Joerg Platte

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

Am Donnerstag, 10. Januar 2008 schrieb Fengguang Wu:
> > problem, because the iowait problem disappeared today after the regular
> > Debian update. I'll try to install the old package versions to make it
> > show up again. Maybe that helps to debug it.
>
> Thank you. I'm running sid, ext2 as rootfs now ;-)

The error is back and I'm getting thousands of messages like this with the
patched kernel:

mm/page-writeback.c 668 wb_kupdate: pdflush(146) 21115 global 3936 0 0 wc _M
tw 1024 sk 0
requeue_io 301: inode 81441 size 0 at 08:07(sda7)
mm/page-writeback.c 668 wb_kupdate: pdflush(147) 17451 global 3936 0 0 wc _M
tw 1024 sk 2
requeue_io 301: inode 81441 size 0 at 08:07(sda7)
mm/page-writeback.c 668 wb_kupdate: pdflush(147) 17451 global 3936 0 0 wc _M
tw 1024 sk 2
requeue_io 301: inode 81441 size 0 at 08:07(sda7)
mm/page-writeback.c 668 wb_kupdate: pdflush(146) 21115 global 3936 0 0 wc _M
tw 1024 sk 0
requeue_io 301: inode 81441 size 0 at 08:07(sda7)
mm/page-writeback.c 668 wb_kupdate: pdflush(146) 21115 global 3936 0 0 wc _M
tw 1024 sk 0
requeue_io 301: inode 81441 size 0 at 08:07(sda7)
mm/page-writeback.c 668 wb_kupdate: pdflush(147) 17451 global 3936 0 0 wc _M
tw 1024 sk 2
requeue_io 301: inode 81441 size 0 at 08:07(sda7)
mm/page-writeback.c 668 wb_kupdate: pdflush(147) 17451 global 3936 0 0 wc _M
tw 1024 sk 2
requeue_io 301: inode 81441 size 0 at 08:07(sda7)
mm/page-writeback.c 668 wb_kupdate: pdflush(146) 21115 global 3936 0 0 wc _M
tw 1024 sk 0
requeue_io 301: inode 81441 size 0 at 08:07(sda7)
mm/page-writeback.c 668 wb_kupdate: pdflush(147) 17451 global 3936 0 0 wc _M
tw 1024 sk 2
requeue_io 301: inode 81441 size 0 at 08:07(sda7)
mm/page-writeback.c 668 wb_kupdate: pdflush(146) 21115 global 3936 0 0 wc _M
tw 1024 sk 0
requeue_io 301: inode 81441 size 0 at 08:07(sda7)
mm/page-writeback.c 668 wb_kupdate: pdflush(146) 21115 global 3937 0 0 wc _M
tw 1024 sk 0
requeue_io 301: inode 81441 size 0 at 08:07(sda7)
mm/page-writeback.c 668 wb_kupdate: pdflush(147) 17451 global 3937 0 0 wc _M
tw 1024 sk 2
requeue_io 301: inode 81441 size 0 at 08:07(sda7)
mm/page-writeback.c 668 wb_kupdate: pdflush(146) 21115 global 3937 0 0 wc _M
tw 1024 sk 0
requeue_io 301: inode 81441 size 0 at 08:07(sda7)
mm/page-writeback.c 668 wb_kupdate: pdflush(147) 17451 global 3937 0 0 wc _M
tw 1024 sk 2
requeue_io 301: inode 81441 size 0 at 08:07(sda7)
mm/page-writeback.c 668 wb_kupdate: pdflush(147) 17451 global 3937 0 0 wc _M
tw 1024 sk 2
requeue_io 301: inode 81441 size 0 at 08:07(sda7)
mm/page-writeback.c 668 wb_kupdate: pdflush(146) 21115 global 3937 0 0 wc _M
tw 1024 sk 0
requeue_io 301: inode 81441 size 0 at 08:07(sda7)
mm/page-writeback.c 668 wb_kupdate: pdflush(146) 21115 global 3937 0 0 wc _M
tw 1024 sk 0
requeue_io 301: inode 81441 size 0 at 08:07(sda7)
mm/page-writeback.c 668 wb_kupdate: pdflush(147) 17451 global 3937 0 0 wc _M
tw 1024 sk 2
requeue_io 301: inode 81441 size 0 at 08:07(sda7)
mm/page-writeback.c 668 wb_kupdate: pdflush(147) 17451 global 3938 0 0 wc _M
tw 1024 sk 2
requeue_io 301: inode 81441 size 0 at 08:07(sda7)
mm/page-writeback.c 668 wb_kupdate: pdflush(146) 21115 global 3938 0 0 wc _M
tw 1024 sk 0
requeue_io 301: inode 81441 size 0 at 08:07(sda7)
mm/page-writeback.c 668 wb_kupdate: pdflush(146) 21115 global 3939 0 0 wc _M
tw 1024 sk 0
requeue_io 301: inode 81441 size 0 at 08:07(sda7)
mm/page-writeback.c 668 wb_kupdate: pdflush(147) 17451 global 3939 0 0 wc _M
tw 1024 sk 2
requeue_io 301: inode 81441 size 0 at 08:07(sda7)
mm/page-writeback.c 668 wb_kupdate: pdflush(147) 17451 global 3939 0 0 wc _M
tw 1024 sk 2
requeue_io 301: inode 81441 size 0 at 08:07(sda7)
mm/page-writeback.c 668 wb_kupdate: pdflush(146) 21115 global 3939 0 0 wc _M
tw 1024 sk 0
requeue_io 301: inode 81441 size 0 at 08:07(sda7)
mm/page-writeback.c 668 wb_kupdate: pdflush(147) 17451 global 3939 0 0 wc _M
tw 1024 sk 2
requeue_io 301: inode 81441 size 0 at 08:07(sda7)
mm/page-writeback.c 668 wb_kupdate: pdflush(146) 21115 global 3939 0 0 wc _M
tw 1024 sk 0
requeue_io 301: inode 81441 size 0 at 08:07(sda7)
mm/page-writeback.c 668 wb_kupdate: pdflush(146) 21115 global 3939 0 0 wc _M
tw 1024 sk 0
requeue_io 301: inode 81441 size 0 at 08:07(sda7)
mm/page-writeback.c 668 wb_kupdate: pdflush(147) 17451 global 3939 0 0 wc _M
tw 1024 sk 2

regards,
J?rg

--
PGP Key: send mail with subject 'SEND PGP-KEY' PGP Key-ID: FD 4E 21 1D
PGP Fingerprint: 388A872AFC5649D3 BCEC65778BE0C605

2008-01-11 05:23:31

by Wu Fengguang

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

On Thu, Jan 10, 2008 at 11:03:05AM +0100, Joerg Platte wrote:
> Am Donnerstag, 10. Januar 2008 schrieb Fengguang Wu:
> > > problem, because the iowait problem disappeared today after the regular
> > > Debian update. I'll try to install the old package versions to make it
> > > show up again. Maybe that helps to debug it.
> >
> > Thank you. I'm running sid, ext2 as rootfs now ;-)
>
> The error is back and I'm getting thousands of messages like this with the
> patched kernel:
>
> mm/page-writeback.c 668 wb_kupdate: pdflush(146) 21115 global 3936 0 0 wc _M > tw 1024 sk 0
> requeue_io 301: inode 81441 size 0 at 08:07(sda7)
> mm/page-writeback.c 668 wb_kupdate: pdflush(147) 17451 global 3936 0 0 wc _M > tw 1024 sk 2
> requeue_io 301: inode 81441 size 0 at 08:07(sda7)
> mm/page-writeback.c 668 wb_kupdate: pdflush(147) 17451 global 3936 0 0 wc _M > tw 1024 sk 2
> requeue_io 301: inode 81441 size 0 at 08:07(sda7)

Joerg, what's the output of `dumpe2fs /dev/sda7` and `lsof|grep /tmp`?

2008-01-11 06:00:40

by Joerg Platte

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

Am Freitag, 11. Januar 2008 schrieb Fengguang Wu:

> Joerg, what's the output of `dumpe2fs /dev/sda7` and `lsof|grep /tmp`?

Fengang, here is the output (kernel 2.6.24-rc7 without your patches):

Filesystem volume name: TMP
Last mounted on: <not available>
Filesystem UUID: e23ae961-bbdc-44bc-b662-f28f7314939b
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal resize_inode dir_index filetype
sparse_super large_file
Filesystem flags: signed directory hash
Default mount options: (none)
Filesystem state: not clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 244320
Block count: 487966
Reserved block count: 24398
Free blocks: 468728
Free inodes: 244162
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 119
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 16288
Inode blocks per group: 509
Filesystem created: Thu Feb 8 18:46:19 2007
Last mount time: Thu Jan 10 11:09:35 2008
Last write time: Fri Jan 11 06:12:58 2008
Mount count: 25
Maximum mount count: 39
Last checked: Tue Dec 11 22:07:03 2007
Check interval: 15552000 (6 months)
Next check after: Sun Jun 8 23:07:03 2008
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 128
Journal inode: 8
Default directory hash: tea
Directory Hash Seed: 79e271e9-b874-4e11-92b0-cdb36d07e1c1
Journal backup: inode blocks
Journal size: 32M


Group 0: (Blocks 0-32767)
Primary superblock at 0, Group descriptors at 1-1
Reserved GDT blocks at 2-120
Block bitmap at 121 (+121), Inode bitmap at 122 (+122)
Inode table at 123-631 (+123)
23926 free blocks, 16275 free inodes, 2 directories
Free blocks: 638-10239, 16384, 18444-30719, 30721-32767
Free inodes: 14-16288
Group 1: (Blocks 32768-65535)
Backup superblock at 32768, Group descriptors at 32769-32769
Reserved GDT blocks at 32770-32888
Block bitmap at 32889 (+121), Inode bitmap at 32890 (+122)
Inode table at 32891-33399 (+123)
32135 free blocks, 16287 free inodes, 1 directories
Free blocks: 33400-57343, 57345-65535
Free inodes: 16290-32576
Group 2: (Blocks 65536-98303)
Block bitmap at 65536 (+0), Inode bitmap at 65537 (+1)
Inode table at 65538-66046 (+2)
32256 free blocks, 16286 free inodes, 1 directories
Free blocks: 66047-96255, 96257-98303
Free inodes: 32579-48864
Group 3: (Blocks 98304-131071)
Backup superblock at 98304, Group descriptors at 98305-98305
Reserved GDT blocks at 98306-98424
Block bitmap at 98425 (+121), Inode bitmap at 98426 (+122)
Inode table at 98427-98935 (+123)
32135 free blocks, 16286 free inodes, 1 directories
Free blocks: 98936-120831, 120833-131071
Free inodes: 48867-65152
Group 4: (Blocks 131072-163839)
Block bitmap at 131072 (+0), Inode bitmap at 131073 (+1)
Inode table at 131074-131582 (+2)
32256 free blocks, 16287 free inodes, 1 directories
Free blocks: 131583-133119, 133121-163839
Free inodes: 65154-81440
Group 5: (Blocks 163840-196607)
Backup superblock at 163840, Group descriptors at 163841-163841
Reserved GDT blocks at 163842-163960
Block bitmap at 163961 (+121), Inode bitmap at 163962 (+122)
Inode table at 163963-164471 (+123)
32135 free blocks, 16286 free inodes, 1 directories
Free blocks: 164472-182271, 182273-196607
Free inodes: 81443-97728
Group 6: (Blocks 196608-229375)
Block bitmap at 196608 (+0), Inode bitmap at 196609 (+1)
Inode table at 196610-197118 (+2)
32087 free blocks, 16265 free inodes, 1 directories
Free blocks: 197119-200703, 200705-210943, 210945-210951, 210970-210975,
210988-215039, 215041-215047, 215058-215071, 215090-217087, 217089-219135,
219137-219143, 219177-219199, 219226-219255, 219257-219263, 219265-219271,
219274-219279, 219305-219335, 219337-219343, 219345-219351, 219354-219359,
219371-219383, 219385-227327, 227331-229375
Free inodes: 97752-114016
Group 7: (Blocks 229376-262143)
Backup superblock at 229376, Group descriptors at 229377-229377
Reserved GDT blocks at 229378-229496
Block bitmap at 229497 (+121), Inode bitmap at 229498 (+122)
Inode table at 229499-230007 (+123)
32134 free blocks, 16278 free inodes, 1 directories
Free blocks: 230008-253951, 253953-260095, 260097-262143
Free inodes: 114022, 114025-114027, 114031-130304
Group 8: (Blocks 262144-294911)
Block bitmap at 262144 (+0), Inode bitmap at 262145 (+1)
Inode table at 262146-262654 (+2)
29809 free blocks, 16193 free inodes, 17 directories
Free blocks: 262661-264191, 264244, 264250, 264263, 264272, 264281,
264288-266239, 266870-266871, 266877-268287, 268289-270335, 270965-274431,
274434-282623, 282625-290815, 291897-292863, 292866-294911
Free inodes: 130316, 130319, 130388, 130402-130406, 130408-146592
Group 9: (Blocks 294912-327679)
Backup superblock at 294912, Group descriptors at 294913-294913
Reserved GDT blocks at 294914-295032
Block bitmap at 295033 (+121), Inode bitmap at 295034 (+122)
Inode table at 295035-295543 (+123)
32135 free blocks, 16286 free inodes, 1 directories
Free blocks: 295544-313343, 313345-327679
Free inodes: 146595-162880
Group 10: (Blocks 327680-360447)
Block bitmap at 327680 (+0), Inode bitmap at 327681 (+1)
Inode table at 327682-328190 (+2)
32256 free blocks, 16283 free inodes, 1 directories
Free blocks: 328191-342015, 342017-360447
Free inodes: 162886-179168
Group 11: (Blocks 360448-393215)
Block bitmap at 360448 (+0), Inode bitmap at 360449 (+1)
Inode table at 360450-360958 (+2)
32248 free blocks, 16286 free inodes, 1 directories
Free blocks: 360959-372735, 372744-380927, 380929-393215
Free inodes: 179171-195456
Group 12: (Blocks 393216-425983)
Block bitmap at 393216 (+0), Inode bitmap at 393217 (+1)
Inode table at 393218-393726 (+2)
32257 free blocks, 16288 free inodes, 0 directories
Free blocks: 393727-425983
Free inodes: 195457-211744
Group 13: (Blocks 425984-458751)
Block bitmap at 425984 (+0), Inode bitmap at 425985 (+1)
Inode table at 425986-426494 (+2)
32256 free blocks, 16287 free inodes, 1 directories
Free blocks: 426495-436223, 436225-458751
Free inodes: 211746-228032
Group 14: (Blocks 458752-487965)
Block bitmap at 458752 (+0), Inode bitmap at 458753 (+1)
Inode table at 458754-459262 (+2)
28703 free blocks, 16288 free inodes, 0 directories
Free blocks: 459263-487965
Free inodes: 228033-244320

konqueror 987 jplatte mem REG 8,6 2590525
1336104 /var/tmp/kdecache-jplatte/ksycoca
konqueror 987 jplatte 12r REG 8,6 2590525
1336104 /var/tmp/kdecache-jplatte/ksycoca
konqueror 987 jplatte 13u REG 8,7 579
97731 /tmp/kde-jplatte/konqueror-crash-Jv2u8a.log
konqueror 987 jplatte 14u REG 8,7 128289
97734 /tmp/kde-jplatte/khtmlcacheA7VjAa.tmp (deleted)
konqueror 987 jplatte 16u REG 8,7 43334
97750 /tmp/kde-jplatte/khtmlcacheZUNlsc.tmp (deleted)
konqueror 987 jplatte 22u REG 8,7 797
97751 /tmp/kde-jplatte/khtmlcache76bZYa.tmp (deleted)
konqueror 987 jplatte 27u REG 8,7 102347
97735 /tmp/kde-jplatte/khtmlcacheMhlDJb.tmp (deleted)
konqueror 987 jplatte 31u REG 8,7 354
97738 /tmp/kde-jplatte/khtmlcacheq21uoc.tmp (deleted)
konqueror 987 jplatte 32u REG 8,7 1360
97742 /tmp/kde-jplatte/khtmlcacheo2Ms2a.tmp (deleted)
konqueror 987 jplatte 34u REG 8,7 6378
97745 /tmp/kde-jplatte/khtmlcacheLETtgc.tmp (deleted)
konqueror 987 jplatte 35u REG 8,7 97350
97746 /tmp/kde-jplatte/khtmlcache5jit8a.tmp (deleted)
konqueror 987 jplatte 36u REG 8,7 354
97747 /tmp/kde-jplatte/khtmlcache7VBSNa.tmp (deleted)
konqueror 987 jplatte 37u REG 8,7 1360
97748 /tmp/kde-jplatte/khtmlcachetSNbub.tmp (deleted)
konqueror 987 jplatte 38u REG 8,7 6073
97749 /tmp/kde-jplatte/khtmlcacheKAKhuc.tmp (deleted)
xfs 5503 root 3u unix 0xdfdd7e00
15884 /tmp/.font-unix/fs7100
xfs 5503 root 4u unix 0xf724ee00
18389 /tmp/.font-unix/fs7100
Xorg 6031 root 1u unix 0xf7151000
17447 /tmp/.X11-unix/X0
Xorg 6031 root 3u unix 0xf40f5540
108479 /tmp/.X11-unix/X0
Xorg 6031 root 10u unix 0xf400ec40
107863 /tmp/.X11-unix/X0
Xorg 6031 root 13u unix 0xf724ea80
18532 /tmp/.X11-unix/X0
Xorg 6031 root 14u unix 0xf6702000
18880 /tmp/.X11-unix/X0
Xorg 6031 root 15u unix 0xf5d6ac40
19298 /tmp/.X11-unix/X0
Xorg 6031 root 16u unix 0xf5297c40
25334 /tmp/.X11-unix/X0
Xorg 6031 root 17u unix 0xf410b1c0
31056 /tmp/.X11-unix/X0
Xorg 6031 root 18u unix 0xf67021c0
19002 /tmp/.X11-unix/X0
Xorg 6031 root 19u unix 0xf5c7f700
19014 /tmp/.X11-unix/X0
Xorg 6031 root 20u unix 0xf5051380
20514 /tmp/.X11-unix/X0
Xorg 6031 root 21u unix 0xf5ccc540
19100 /tmp/.X11-unix/X0
Xorg 6031 root 22u unix 0xf5c7fa80
19086 /tmp/.X11-unix/X0
Xorg 6031 root 23u unix 0xf725c380
19123 /tmp/.X11-unix/X0
Xorg 6031 root 24u unix 0xf5d29380
19147 /tmp/.X11-unix/X0
Xorg 6031 root 25u unix 0xf5174700
31277 /tmp/.X11-unix/X0
Xorg 6031 root 26u unix 0xf5db3380
19252 /tmp/.X11-unix/X0
Xorg 6031 root 27u unix 0xf51ce8c0
41610 /tmp/.X11-unix/X0
Xorg 6031 root 28u unix 0xf5de8e00
19379 /tmp/.X11-unix/X0
Xorg 6031 root 29u unix 0xf5e0b8c0
19401 /tmp/.X11-unix/X0
Xorg 6031 root 30u unix 0xf5e5c540
19419 /tmp/.X11-unix/X0
Xorg 6031 root 31u unix 0xf5e9ba80
19500 /tmp/.X11-unix/X0
Xorg 6031 root 32u unix 0xf5e0ba80
19515 /tmp/.X11-unix/X0
Xorg 6031 root 33u unix 0xf5e7fa80
19531 /tmp/.X11-unix/X0
Xorg 6031 root 34u unix 0xf5ea61c0
19982 /tmp/.X11-unix/X0
Xorg 6031 root 35u unix 0xf5ea6a80
19991 /tmp/.X11-unix/X0
Xorg 6031 root 36u unix 0xf7220000
20286 /tmp/.X11-unix/X0
Xorg 6031 root 37u unix 0xf5187700
54906 /tmp/.X11-unix/X0
Xorg 6031 root 39u unix 0xf5187e00
43577 /tmp/.X11-unix/X0
Xorg 6031 root 40u unix 0xf5ec9a80
20377 /tmp/.X11-unix/X0
Xorg 6031 root 41u unix 0xf51b2380
20809 /tmp/.X11-unix/X0
Xorg 6031 root 42u unix 0xf71921c0
20455 /tmp/.X11-unix/X0
Xorg 6031 root 43u unix 0xf528a1c0
99356 /tmp/.X11-unix/X0
Xorg 6031 root 44u unix 0xf5297540
21298 /tmp/.X11-unix/X0
Xorg 6031 root 45u unix 0xf5099c40
20549 /tmp/.X11-unix/X0
Xorg 6031 root 46u unix 0xf50ff380
20600 /tmp/.X11-unix/X0
Xorg 6031 root 47u unix 0xf510c8c0
20655 /tmp/.X11-unix/X0
Xorg 6031 root 48u unix 0xf510cc40
20657 /tmp/.X11-unix/X0
Xorg 6031 root 49u unix 0xf5138540
20663 /tmp/.X11-unix/X0
Xorg 6031 root 50u unix 0xf50cb380
20683 /tmp/.X11-unix/X0
Xorg 6031 root 51u unix 0xf51b2e00
20883 /tmp/.X11-unix/X0
Xorg 6031 root 52u unix 0xf52c3a80
21316 /tmp/.X11-unix/X0
Xorg 6031 root 53u unix 0xf52f3a80
21373 /tmp/.X11-unix/X0
Xorg 6031 root 54u unix 0xf528ac40
21871 /tmp/.X11-unix/X0
Xorg 6031 root 55u unix 0xdfd301c0
25107 /tmp/.X11-unix/X0
Xorg 6031 root 56u unix 0xf67ab380
25380 /tmp/.X11-unix/X0
Xorg 6031 root 58u unix 0xf512c700
49503 /tmp/.X11-unix/X0
Xorg 6031 root 59u unix 0xf51cee00
55093 /tmp/.X11-unix/X0
Xorg 6031 root 60u unix 0xf5187a80
54949 /tmp/.X11-unix/X0
kontact 6669 jplatte mem REG 8,6 2590525
1336104 /var/tmp/kdecache-jplatte/ksycoca
kontact 6669 jplatte 12r REG 8,6 2590525
1336104 /var/tmp/kdecache-jplatte/ksycoca
kontact 6669 jplatte 20u unix 0xf5005000
109900 /tmp/ksocket-jplatte/kontactb5GSMb.slave-socket
kontact 6669 jplatte 33u unix 0xf712a700
108568 /tmp/ksocket-jplatte/kontact7somnc.slave-socket
kontact 6669 jplatte 34u unix 0xf50d78c0
108570 /tmp/ksocket-jplatte/kontactQdchTb.slave-socket
kontact 6669 jplatte 36u unix 0xf50d7000
108569 /tmp/ksocket-jplatte/kontactdlqZ9b.slave-socket
kio_imap4 6691 jplatte mem REG 8,6 2590525
1336104 /var/tmp/kdecache-jplatte/ksycoca
kio_imap4 6691 jplatte 10r REG 8,6 2590525
1336104 /var/tmp/kdecache-jplatte/ksycoca
kio_imap4 6698 jplatte mem REG 8,6 2590525
1336104 /var/tmp/kdecache-jplatte/ksycoca
kio_imap4 6698 jplatte 10r REG 8,6 2590525
1336104 /var/tmp/kdecache-jplatte/ksycoca
kio_imap4 6699 jplatte mem REG 8,6 2590525
1336104 /var/tmp/kdecache-jplatte/ksycoca
kio_imap4 6699 jplatte 10r REG 8,6 2590525
1336104 /var/tmp/kdecache-jplatte/ksycoca
gam_serve 6972 root cwd DIR 8,7 0
211745 /tmp/1878472609 (deleted)
ssh-agent 7097 jplatte 3u unix 0xf724f1c0
18873 /tmp/ssh-gdxKKv7049/agent.7049
gpg-agent 7098 jplatte 5u unix 0xf724f540
18877 /tmp/gpg-cKJSSb/S.gpg-agent
kdeinit 7159 jplatte 9u unix 0xf5c2ba80
18966 /tmp/ksocket-jplatte/kdeinit__0
kdeinit 7159 jplatte 10u unix 0xf5c2bc40
18968 /tmp/ksocket-jplatte/kdeinit-
dcopserve 7162 jplatte 6u unix 0xf5c7f380
19011 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 8u unix 0xf5c891c0
18977 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 11u unix 0xf5c89e00
18993 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 12u unix 0xf5cd6700
20495 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 13u unix 0xf5c2b1c0
19092 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 14u unix 0xf5d1d540
19107 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 15u unix 0xf5dd5e00
19295 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 16u unix 0xf725c000
19118 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 17u unix 0xf725ce00
19142 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 18u unix 0xf5ee9000
108474 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 20u unix 0xf5db3000
19249 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 21u unix 0xf5db3540
19264 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 22u unix 0xf51cea80
108608 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 23u unix 0xf5de81c0
19376 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 24u unix 0xf7161700
20283 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 25u unix 0xf400e000
108491 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 26u unix 0xf5e0b540
19398 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 27u unix 0xf5e9b700
19497 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 28u unix 0xf5e5c1c0
19416 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 29u unix 0xf5de8c40
19541 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 30u unix 0xf400e700
108493 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 31u unix 0xf5ea4a80
20007 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 32u unix 0xf5e7f700
19526 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 33u unix 0xf5ea6e00
19993 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 34u unix 0xf400e380
108495 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 35u unix 0xf50ff540
108497 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 36u unix 0xf5cb0000
108499 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 37u unix 0xf5ec9380
20374 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 38u unix 0xf5cb0700
108501 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 39u unix 0xf5cb01c0
108503 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 40u unix 0xf514b8c0
20914 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 41u unix 0xf506be00
21019 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 42u unix 0xf5cd6a80
20506 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 43u unix 0xf51ce540
108610 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 44u unix 0xf5099380
20542 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 45u unix 0xf50998c0
20546 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 46u unix 0xf5138c40
20680 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 47u unix 0xdfd4fc40
99363 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 49u unix 0xf50ff000
20597 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 50u unix 0xf724f700
99470 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 51u unix 0xf50ffe00
20616 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 54u unix 0xf50fd8c0
20623 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 56u unix 0xf510c380
20649 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 58u unix 0xf6702c40
108612 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 60u unix 0xf50d7e00
20764 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 74u unix 0xf5255540
43572 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 88u unix 0xf407da80
25329 /tmp/.ICE-unix/dcop7162-1199959838
dcopserve 7162 jplatte 89u unix 0xf6702540
25375 /tmp/.ICE-unix/dcop7162-1199959838
klauncher 7164 jplatte mem REG 8,6 2590525
1336104 /var/tmp/kdecache-jplatte/ksycoca
klauncher 7164 jplatte 10r REG 8,6 2590525
1336104 /var/tmp/kdecache-jplatte/ksycoca
klauncher 7164 jplatte 11u unix 0xf5c89540
18999 /tmp/ksocket-jplatte/klauncherDCSyUa.slave-socket
klauncher 7164 jplatte 12u unix 0xf5255a80
110308 /tmp/ksocket-jplatte/klauncherDCSyUa.slave-socket
kded 7166 jplatte mem REG 8,6 2590525
1336104 /var/tmp/kdecache-jplatte/ksycoca
kded 7166 jplatte 14r REG 8,6 2590525
1336104 /var/tmp/kdecache-jplatte/ksycoca
ksmserver 7183 jplatte 7u unix 0xf5ccc8c0
19076 /tmp/ksocket-jplatte/kdeinit__0
ksmserver 7183 jplatte 12u unix 0xf5c2b8c0
19093 /tmp/.ICE-unix/7183
ksmserver 7183 jplatte 13u unix 0xf5d1d1c0
19102 /tmp/.ICE-unix/7183
ksmserver 7183 jplatte 14u unix 0xf5d1d8c0
19110 /tmp/.ICE-unix/7183
ksmserver 7183 jplatte 15u unix 0xf725ca80
19127 /tmp/.ICE-unix/7183
ksmserver 7183 jplatte 16u unix 0xf5d298c0
19151 /tmp/.ICE-unix/7183
ksmserver 7183 jplatte 17u unix 0xf5db3a80
19256 /tmp/.ICE-unix/7183
ksmserver 7183 jplatte 18u unix 0xf5c2b000
19306 /tmp/.ICE-unix/7183
ksmserver 7183 jplatte 19u unix 0xdfdd7700
19472 /tmp/.ICE-unix/7183
ksmserver 7183 jplatte 20u unix 0xf5c7fc40
19384 /tmp/.ICE-unix/7183
ksmserver 7183 jplatte 21u unix 0xf5dddc40
19405 /tmp/.ICE-unix/7183
ksmserver 7183 jplatte 22u unix 0xf712a380
20290 /tmp/.ICE-unix/7183
ksmserver 7183 jplatte 23u unix 0xf5e5ca80
19504 /tmp/.ICE-unix/7183
ksmserver 7183 jplatte 24u unix 0xf5e7f1c0
19517 /tmp/.ICE-unix/7183
ksmserver 7183 jplatte 25u unix 0xf5e7fe00
19533 /tmp/.ICE-unix/7183
ksmserver 7183 jplatte 26u unix 0xf5ea6700
19984 /tmp/.ICE-unix/7183
ksmserver 7183 jplatte 27u unix 0xf5ea4380
19995 /tmp/.ICE-unix/7183
ksmserver 7183 jplatte 28u unix 0xf5d6aa80
43581 /tmp/.ICE-unix/7183
ksmserver 7183 jplatte 29u unix 0xf40f5700
108483 /tmp/.ICE-unix/7183
ksmserver 7183 jplatte 30u unix 0xdfd4f1c0
99358 /tmp/.ICE-unix/7183
ksmserver 7183 jplatte 31u unix 0xf5ee9e00
20381 /tmp/.ICE-unix/7183
ksmserver 7183 jplatte 32u unix 0xf5005700
20467 /tmp/.ICE-unix/7183
ksmserver 7183 jplatte 34u unix 0xf512c1c0
21389 /tmp/.ICE-unix/7183
ksmserver 7183 jplatte 35u unix 0xf50cbc40
20568 /tmp/.ICE-unix/7183
ksmserver 7183 jplatte 36u unix 0xf50e4000
20571 /tmp/.ICE-unix/7183
ksmserver 7183 jplatte 37u unix 0xf400e540
25338 /tmp/.ICE-unix/7183
ksmserver 7183 jplatte 38u unix 0xf50e4700
20575 /tmp/.ICE-unix/7183
ksmserver 7183 jplatte 39u unix 0xf50ffa80
20604 /tmp/.ICE-unix/7183
ksmserver 7183 jplatte 40u unix 0xf51381c0
20665 /tmp/.ICE-unix/7183
ksmserver 7183 jplatte 41u unix 0xf514b380
20693 /tmp/.ICE-unix/7183
ksmserver 7183 jplatte 42u unix 0xf514b540
20851 /tmp/.ICE-unix/7183
ksmserver 7183 jplatte 43u unix 0xf52c3700
25384 /tmp/.ICE-unix/7183
kicker 7192 jplatte mem REG 8,6 2590525
1336104 /var/tmp/kdecache-jplatte/ksycoca
kicker 7192 jplatte 12r REG 8,6 2590525
1336104 /var/tmp/kdecache-jplatte/ksycoca
kxkb 7215 jplatte 12r REG 8,7 11912
97730 /tmp/kde-jplatte/de..xkm
konqueror 7255 jplatte mem REG 8,6 2590525
1336104 /var/tmp/kdecache-jplatte/ksycoca
konqueror 7255 jplatte 12r REG 8,6 2590525
1336104 /var/tmp/kdecache-jplatte/ksycoca
konqueror 7255 jplatte 13u REG 8,7 283
97732 /tmp/kde-jplatte/konqueror-crash-LPnWVb.log
konqueror 7255 jplatte 18u REG 8,7 67736
97740 /tmp/kde-jplatte/khtmlcacheJfHsua.tmp (deleted)
konqueror 7255 jplatte 21u REG 8,7 48368
97741 /tmp/kde-jplatte/khtmlcache8KkbLa.tmp (deleted)
konqueror 7337 jplatte mem REG 8,6 2590525
1336104 /var/tmp/kdecache-jplatte/ksycoca
konqueror 7337 jplatte 12r REG 8,6 2590525
1336104 /var/tmp/kdecache-jplatte/ksycoca
konqueror 7337 jplatte 13u REG 8,7 102
97733 /tmp/kde-jplatte/konqueror-crash-YtoSma.log
konqueror 7337 jplatte 16u REG 8,7 37907
97743 /tmp/kde-jplatte/khtmlcacheemjbkb.tmp (deleted)
konqueror 7337 jplatte 22u REG 8,7 66887
97744 /tmp/kde-jplatte/khtmlcache1u4D4b.tmp (deleted)
amarokapp 7351 jplatte mem REG 8,6 2590525
1336104 /var/tmp/kdecache-jplatte/ksycoca
amarokapp 7351 jplatte 18r REG 8,6 2590525
1336104 /var/tmp/kdecache-jplatte/ksycoca
kopete 7447 jplatte mem REG 8,6 2590525
1336104 /var/tmp/kdecache-jplatte/ksycoca
kopete 7447 jplatte 12r REG 8,6 2590525
1336104 /var/tmp/kdecache-jplatte/ksycoca
kdeinit 7647 root cwd DIR 8,7 4096
130305 /tmp/1878472609
kdeinit 7647 root 7u unix 0xf51ce1c0
21255 /tmp/1878472609/.kde/socket-ibm/kdeinit__0
kdeinit 7647 root 8u unix 0xf67ab8c0
21257 /tmp/1878472609/.kde/socket-ibm/kdeinit-
dcopserve 7651 root cwd DIR 8,7 4096
130305 /tmp/1878472609
dcopserve 7651 root 5u unix 0xf5297000
21279 /tmp/.ICE-unix/dcop7651-1199959912
dcopserve 7651 root 7u unix 0xf67abc40
21265 /tmp/.ICE-unix/dcop7651-1199959912
dcopserve 7651 root 10u unix 0xf52c3000
21308 /tmp/.ICE-unix/dcop7651-1199959912
dcopserve 7651 root 11u unix 0xdfd30c40
25104 /tmp/.ICE-unix/dcop7651-1199959912
klauncher 7653 root cwd DIR 8,7 4096
130305 /tmp/1878472609
klauncher 7653 root mem REG 8,7
130321 /tmp/1878472609/.kde/cache-ibm/ksycoca (path inode=130326)
klauncher 7653 root 9r REG 8,7 2564654
130321 /tmp/1878472609/.kde/cache-ibm/ksycoca
klauncher 7653 root 10u unix 0xf5297a80
21295 /tmp/1878472609/.kde/socket-ibm/klauncher1FOdCb.slave-socket
kded 7662 root cwd DIR 8,7 4096
130305 /tmp/1878472609
kded 7662 root mem REG 8,7 2564654
130326 /tmp/1878472609/.kde/cache-ibm/ksycoca
kded 7662 root 12r REG 8,7 2564654
130326 /tmp/1878472609/.kde/cache-ibm/ksycoca
kio_smtp 7710 jplatte mem REG 8,6 2590525
1336104 /var/tmp/kdecache-jplatte/ksycoca
kio_smtp 7710 jplatte 9r REG 8,6 2590525
1336104 /var/tmp/kdecache-jplatte/ksycoca
kio_file 7786 jplatte mem REG 8,6 2590525
1336104 /var/tmp/kdecache-jplatte/ksycoca
kio_file 7786 jplatte 9r REG 8,6 2590525
1336104 /var/tmp/kdecache-jplatte/ksycoca
java 10922 jplatte mem REG 8,7 32768
179170 /tmp/hsperfdata_jplatte/10922

regards,
J?rg

--
PGP Key: send mail with subject 'SEND PGP-KEY' PGP Key-ID: FD 4E 21 1D
PGP Fingerprint: 388A872AFC5649D3 BCEC65778BE0C605

2008-01-11 06:41:19

by Joerg Platte

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

Am Freitag, 11. Januar 2008 schrieb Joerg Platte:

> konqueror 987 jplatte mem REG 8,6 2590525
> 1336104 /var/tmp/kdecache-jplatte/ksycoca
> konqueror 987 jplatte 12r REG 8,6 2590525
> 1336104 /var/tmp/kdecache-jplatte/ksycoca
> konqueror 987 jplatte 13u REG 8,7 579
> 97731 /tmp/kde-jplatte/konqueror-crash-Jv2u8a.log
> konqueror 987 jplatte 14u REG 8,7 128289
> 97734 /tmp/kde-jplatte/khtmlcacheA7VjAa.tmp (deleted)
> konqueror 987 jplatte 16u REG 8,7 43334
> 97750 /tmp/kde-jplatte/khtmlcacheZUNlsc.tmp (deleted)
> konqueror 987 jplatte 22u REG 8,7 797
> 97751 /tmp/kde-jplatte/khtmlcache76bZYa.tmp (deleted)
> konqueror 987 jplatte 27u REG 8,7 102347
> 97735 /tmp/kde-jplatte/khtmlcacheMhlDJb.tmp (deleted)
> konqueror 987 jplatte 31u REG 8,7 354
> 97738 /tmp/kde-jplatte/khtmlcacheq21uoc.tmp (deleted)
> konqueror 987 jplatte 32u REG 8,7 1360
> 97742 /tmp/kde-jplatte/khtmlcacheo2Ms2a.tmp (deleted)
> konqueror 987 jplatte 34u REG 8,7 6378
> 97745 /tmp/kde-jplatte/khtmlcacheLETtgc.tmp (deleted)
> konqueror 987 jplatte 35u REG 8,7 97350
> 97746 /tmp/kde-jplatte/khtmlcache5jit8a.tmp (deleted)
> konqueror 987 jplatte 36u REG 8,7 354
> 97747 /tmp/kde-jplatte/khtmlcache7VBSNa.tmp (deleted)
> konqueror 987 jplatte 37u REG 8,7 1360
> 97748 /tmp/kde-jplatte/khtmlcachetSNbub.tmp (deleted)
> konqueror 987 jplatte 38u REG 8,7 6073
> 97749 /tmp/kde-jplatte/khtmlcacheKAKhuc.tmp (deleted)

Fengguang, a few minutes after killing this konqueror process the high iowait
load has disappeared. Lets see if it'll come back...

regards,
J?rg

--
PGP Key: send mail with subject 'SEND PGP-KEY' PGP Key-ID: FD 4E 21 1D
PGP Fingerprint: 388A872AFC5649D3 BCEC65778BE0C605

2008-01-12 22:33:43

by Joerg Platte

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

Am Freitag, 11. Januar 2008 schrieb Fengguang Wu:
> On Thu, Jan 10, 2008 at 11:03:05AM +0100, Joerg Platte wrote:
> > Am Donnerstag, 10. Januar 2008 schrieb Fengguang Wu:
> > > > problem, because the iowait problem disappeared today after the
> > > > regular Debian update. I'll try to install the old package versions
> > > > to make it show up again. Maybe that helps to debug it.
> > >
> > > Thank you. I'm running sid, ext2 as rootfs now ;-)
> >
> > The error is back and I'm getting thousands of messages like this with
> > the patched kernel:
> >
> > mm/page-writeback.c 668 wb_kupdate: pdflush(146) 21115 global 3936 0 0 wc
> > _M > tw 1024 sk 0 requeue_io 301: inode 81441 size 0 at 08:07(sda7)
> > mm/page-writeback.c 668 wb_kupdate: pdflush(147) 17451 global 3936 0 0 wc
> > _M > tw 1024 sk 2 requeue_io 301: inode 81441 size 0 at 08:07(sda7)
> > mm/page-writeback.c 668 wb_kupdate: pdflush(147) 17451 global 3936 0 0 wc
> > _M > tw 1024 sk 2 requeue_io 301: inode 81441 size 0 at 08:07(sda7)
>
> Joerg, what's the output of `dumpe2fs /dev/sda7` and `lsof|grep /tmp`?

After another reboot I tried to get more information about the konqueror
process possibly causing the iowait load by using strace -p. Here is the
output:

gettimeofday({1200180588, 878508}, NULL) = 0
setitimer(ITIMER_VIRTUAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
rt_sigaction(SIGVTALRM, {SIG_DFL}, {0xb5cffed0, [VTALRM], SA_RESTART}, 8) = 0
gettimeofday({1200180588, 879942}, NULL) = 0
time(NULL) = 1200180588
gettimeofday({1200180588, 880838}, NULL) = 0
gettimeofday({1200180588, 881284}, NULL) = 0
time(NULL) = 1200180588
gettimeofday({1200180588, 882131}, NULL) = 0
gettimeofday({1200180588, 882572}, NULL) = 0
ioctl(5, FIONREAD, [0]) = 0
gettimeofday({1200180588, 883477}, NULL) = 0
select(16, [5 6 7 9 11 14 15], [], [], {0, 150095}) = 0 (Timeout)
gettimeofday({1200180589, 34269}, NULL) = 0
gettimeofday({1200180589, 34885}, NULL) = 0
time(NULL) = 1200180589
gettimeofday({1200180589, 36672}, NULL) = 0
rt_sigaction(SIGVTALRM, {0xb5cffed0, [VTALRM], SA_RESTART}, {SIG_DFL}, 8) = 0
setitimer(ITIMER_VIRTUAL, {it_interval={10, 0}, it_value={5, 0}},
{it_interval={0, 0}, it_value={0, 0}}) = 0
gettimeofday({1200180589, 38555}, NULL) = 0
time(NULL) = 1200180589
gettimeofday({1200180589, 39802}, NULL) = 0
setitimer(ITIMER_VIRTUAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
rt_sigaction(SIGVTALRM, {SIG_DFL}, {0xb5cffed0, [VTALRM], SA_RESTART}, 8) = 0
gettimeofday({1200180589, 40912}, NULL) = 0
time(NULL) = 1200180589
gettimeofday({1200180589, 42019}, NULL) = 0
gettimeofday({1200180589, 42458}, NULL) = 0
time(NULL) = 1200180589
gettimeofday({1200180589, 43303}, NULL) = 0
gettimeofday({1200180589, 43747}, NULL) = 0
ioctl(5, FIONREAD, [0]) = 0
gettimeofday({1200180589, 45834}, NULL) = 0
select(16, [5 6 7 9 11 14 15], [], [], {0, 149913}) = 0 (Timeout)
gettimeofday({1200180589, 194815}, NULL) = 0
ioctl(5, FIONREAD, [0]) = 0
gettimeofday({1200180589, 195730}, NULL) = 0
select(16, [5 6 7 9 11 14 15], [], [], {0, 17}) = 0 (Timeout)
gettimeofday({1200180589, 197555}, NULL) = 0
gettimeofday({1200180589, 198020}, NULL) = 0
time(NULL) = 1200180589
gettimeofday({1200180589, 198884}, NULL) = 0
rt_sigaction(SIGVTALRM, {0xb5cffed0, [VTALRM], SA_RESTART}, {SIG_DFL}, 8) = 0
setitimer(ITIMER_VIRTUAL, {it_interval={10, 0}, it_value={5, 0}},
{it_interval={0, 0}, it_value={0, 0}}) = 0
gettimeofday({1200180589, 200702}, NULL) = 0
time(NULL) = 1200180589
gettimeofday({1200180589, 200806}, NULL) = 0
setitimer(ITIMER_VIRTUAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
rt_sigaction(SIGVTALRM, {SIG_DFL}, {0xb5cffed0, [VTALRM], SA_RESTART}, 8) = 0
gettimeofday({1200180589, 202975}, NULL) = 0
time(NULL) = 1200180589
gettimeofday({1200180589, 203837}, NULL) = 0
gettimeofday({1200180589, 204319}, NULL) = 0
time(NULL) = 1200180589
gettimeofday({1200180589, 205169}, NULL) = 0
gettimeofday({1200180589, 205613}, NULL) = 0
ioctl(5, FIONREAD, [0]) = 0
gettimeofday({1200180589, 206515}, NULL) = 0
select(16, [5 6 7 9 11 14 15], [], [], {0, 149098} <unfinished ...>

Fengguang, do you have any idea what's going wrong here?

regards,
J?rg

--
PGP Key: send mail with subject 'SEND PGP-KEY' PGP Key-ID: FD 4E 21 1D
PGP Fingerprint: 388A872AFC5649D3 BCEC65778BE0C605

2008-01-13 06:44:55

by Wu Fengguang

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

On Sun, Jan 13, 2008 at 12:32:30AM +0100, Joerg Platte wrote:
> Am Freitag, 11. Januar 2008 schrieb Fengguang Wu:
> > On Thu, Jan 10, 2008 at 11:03:05AM +0100, Joerg Platte wrote:
> > > Am Donnerstag, 10. Januar 2008 schrieb Fengguang Wu:
> > > > > problem, because the iowait problem disappeared today after the
> > > > > regular Debian update. I'll try to install the old package versions
> > > > > to make it show up again. Maybe that helps to debug it.
> > > >
> > > > Thank you. I'm running sid, ext2 as rootfs now ;-)
> > >
> > > The error is back and I'm getting thousands of messages like this with
> > > the patched kernel:
> > >
> > > mm/page-writeback.c 668 wb_kupdate: pdflush(146) 21115 global 3936 0 0 wc
> > > _M > tw 1024 sk 0 requeue_io 301: inode 81441 size 0 at 08:07(sda7)
> > > mm/page-writeback.c 668 wb_kupdate: pdflush(147) 17451 global 3936 0 0 wc
> > > _M > tw 1024 sk 2 requeue_io 301: inode 81441 size 0 at 08:07(sda7)
> > > mm/page-writeback.c 668 wb_kupdate: pdflush(147) 17451 global 3936 0 0 wc
> > > _M > tw 1024 sk 2 requeue_io 301: inode 81441 size 0 at 08:07(sda7)
> >
> > Joerg, what's the output of `dumpe2fs /dev/sda7` and `lsof|grep /tmp`?
>
> After another reboot I tried to get more information about the konqueror
> process possibly causing the iowait load by using strace -p. Here is the
> output:
>
> gettimeofday({1200180588, 878508}, NULL) = 0
> setitimer(ITIMER_VIRTUAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
> rt_sigaction(SIGVTALRM, {SIG_DFL}, {0xb5cffed0, [VTALRM], SA_RESTART}, 8) = 0
> gettimeofday({1200180588, 879942}, NULL) = 0
> time(NULL) = 1200180588
> gettimeofday({1200180588, 880838}, NULL) = 0
> gettimeofday({1200180588, 881284}, NULL) = 0

No idea yet :-/ I'm afraid I have to trouble you again - the bug just
refused to appear in my system. I prepared a kernel module for you to
gather more information:

make && insmod ext2-writeback-debug.ko && sleep 1s && rmmod ext2-writeback-debug
dmesg > ext2-writeback-debug.dmesg

Please do it when 100% iowait appears, and send the dmesg file.

Thank you,
Fengguang


Attachments:
(No filename) (2.07 kB)
Makefile (187.00 B)
ext2-writeback-debug.c (3.52 kB)
Download all attachments

2008-01-13 08:05:59

by Joerg Platte

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

Am Sonntag, 13. Januar 2008 schrieb Fengguang Wu:

> No idea yet :-/ I'm afraid I have to trouble you again - the bug just
> refused to appear in my system. I prepared a kernel module for you to
> gather more information:

> make && insmod ext2-writeback-debug.ko && sleep 1s && rmmod
> ext2-writeback-debug dmesg > ext2-writeback-debug.dmesg

Unfortunately, I unable to compile the module:

make -C /lib/modules/2.6.24-rc7-ext2/build SUBDIRS=/export/src modules
make[1]: Entering directory `/export/src/linux-2.6.24-rc7'
CC [M] /export/src/ext2-writeback-debug.o
/export/src/ext2-writeback-debug.c:89: error: variable ‘my_kprobe’ has
initializer but incomplete type
/export/src/ext2-writeback-debug.c:90: error: unknown field ‘pre_handler’
specified in initializer
/export/src/ext2-writeback-debug.c:90: warning: excess elements in struct
initializer
/export/src/ext2-writeback-debug.c:90: warning: (near initialization
for ‘my_kprobe’)
/export/src/ext2-writeback-debug.c:91: error: unknown field ‘post_handler’
specified in initializer
/export/src/ext2-writeback-debug.c:91: warning: excess elements in struct
initializer
/export/src/ext2-writeback-debug.c:91: warning: (near initialization
for ‘my_kprobe’)
/export/src/ext2-writeback-debug.c:92: error: unknown field ‘symbol_name’
specified in initializer
/export/src/ext2-writeback-debug.c:93: warning: excess elements in struct
initializer
/export/src/ext2-writeback-debug.c:93: warning: (near initialization
for ‘my_kprobe’)
/export/src/ext2-writeback-debug.c:109: error:
variable ‘jprobe_ext2_writepage’ has initializer but incomplete type
/export/src/ext2-writeback-debug.c:110: error: unknown field ‘entry’ specified
in initializer
/export/src/ext2-writeback-debug.c:110: warning: excess elements in struct
initializer
/export/src/ext2-writeback-debug.c:110: warning: (near initialization
for ‘jprobe_ext2_writepage’)
/export/src/ext2-writeback-debug.c:111: error: unknown field ‘kp’ specified in
initializer
/export/src/ext2-writeback-debug.c:112: warning: excess elements in struct
initializer
/export/src/ext2-writeback-debug.c:112: warning: (near initialization
for ‘jprobe_ext2_writepage’)
/export/src/ext2-writeback-debug.c:124: error: variable ‘jprobe_requeue_io’
has initializer but incomplete type
/export/src/ext2-writeback-debug.c:125: error: unknown field ‘entry’ specified
in initializer
/export/src/ext2-writeback-debug.c:125: warning: excess elements in struct
initializer
/export/src/ext2-writeback-debug.c:125: warning: (near initialization
for ‘jprobe_requeue_io’)
/export/src/ext2-writeback-debug.c:126: error: unknown field ‘kp’ specified in
initializer
/export/src/ext2-writeback-debug.c:127: warning: excess elements in struct
initializer
/export/src/ext2-writeback-debug.c:127: warning: (near initialization
for ‘jprobe_requeue_io’)
/export/src/ext2-writeback-debug.c: In function ‘setup_kprobe’:
/export/src/ext2-writeback-debug.c:132: error: dereferencing pointer to
incomplete type
/export/src/ext2-writeback-debug.c: In function ‘setup_jprobe’:
/export/src/ext2-writeback-debug.c:139: error: dereferencing pointer to
incomplete type
make[2]: *** [/export/src/ext2-writeback-debug.o] Error 1
make[1]: *** [_module_/export/src] Error 2
make[1]: Leaving directory `/export/src/linux-2.6.24-rc7'
make: *** [default] Error 2

regards,
Jörg

--
PGP Key: send mail with subject 'SEND PGP-KEY' PGP Key-ID: FD 4E 21 1D
PGP Fingerprint: 388A872AFC5649D3 BCEC65778BE0C605

2008-01-13 08:21:39

by Wu Fengguang

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

On Sun, Jan 13, 2008 at 09:05:43AM +0100, Joerg Platte wrote:
> Am Sonntag, 13. Januar 2008 schrieb Fengguang Wu:
>
> > No idea yet :-/ I'm afraid I have to trouble you again - the bug just
> > refused to appear in my system. I prepared a kernel module for you to
> > gather more information:
>
> > make && insmod ext2-writeback-debug.ko && sleep 1s && rmmod
> > ext2-writeback-debug dmesg > ext2-writeback-debug.dmesg
>
> Unfortunately, I unable to compile the module:

> make -C /lib/modules/2.6.24-rc7-ext2/build SUBDIRS=/export/src modules
> make[1]: Entering directory `/export/src/linux-2.6.24-rc7'
> CC [M] /export/src/ext2-writeback-debug.o
> /export/src/ext2-writeback-debug.c:89: error: variable ‘my_kprobe’ has
> initializer but incomplete type

Do you have CONFIG_KPROBES=y?

2008-01-13 09:49:55

by Joerg Platte

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

Am Sonntag, 13. Januar 2008 schrieb Fengguang Wu:
> On Sun, Jan 13, 2008 at 09:05:43AM +0100, Joerg Platte wrote:
> > Am Sonntag, 13. Januar 2008 schrieb Fengguang Wu:
> > > No idea yet :-/ I'm afraid I have to trouble you again - the bug just
> > > refused to appear in my system. I prepared a kernel module for you to
> > > gather more information:
> > >
> > > make && insmod ext2-writeback-debug.ko && sleep 1s && rmmod
> > > ext2-writeback-debug dmesg > ext2-writeback-debug.dmesg
> >
> > Unfortunately, I unable to compile the module:
> >
> > make -C /lib/modules/2.6.24-rc7-ext2/build SUBDIRS=/export/src modules
> > make[1]: Entering directory `/export/src/linux-2.6.24-rc7'
> > CC [M] /export/src/ext2-writeback-debug.o
> > /export/src/ext2-writeback-debug.c:89: error: variable ‘my_kprobe’ has
> > initializer but incomplete type
>
> Do you have CONFIG_KPROBES=y?

Now I have :)

Here is the result.

regards,
Jörg

--
PGP Key: send mail with subject 'SEND PGP-KEY' PGP Key-ID: FD 4E 21 1D
PGP Fingerprint: 388A872AFC5649D3 BCEC65778BE0C605


Attachments:
(No filename) (1.04 kB)
ext2-writeback-debug.dmesg (47.96 kB)
Download all attachments

2008-01-14 01:47:54

by Wu Fengguang

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

On Sun, Jan 13, 2008 at 10:49:31AM +0100, Joerg Platte wrote:
> register_jprobe(ext2_writepage) = 0
> register_jprobe(requeue_io) = 0
> register_kprobe(submit_bio) = 0
> requeue_io:
> inode 114019(sda7/.kde) count 2,2 size 0 pages 1
> 0 2 0 U____
> requeue_io:
> inode 114025(sda7/cache-ibm) count 2,1 size 0 pages 1
> 0 2 0 U____
> requeue_io:
> inode 114029(sda7/socket-ibm) count 2,3 size 0 pages 1
> 0 2 0 U____
> requeue_io:
> inode 114017(sda7/0266584877) count 3,6 size 0 pages 1
> 0 2 0 U____

It helps. Thank you, Joerg!

The .kde/cache-ibm/socket-ibm/0266584877 above are directories.
It's weird that dirs would have their own mappings in ext2. In
particular this bug is triggered because the dir mapping page has
PAGECACHE_TAG_DIRTY set and PG_dirty cleared, staying in an
inconsistent state.

Fengguang

2008-01-14 03:55:10

by Wu Fengguang

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

On Sun, Jan 13, 2008 at 07:59:33PM +0800, Fengguang Wu wrote:
> On Sun, Jan 13, 2008 at 10:49:31AM +0100, Joerg Platte wrote:
> > register_jprobe(ext2_writepage) = 0
> > register_jprobe(requeue_io) = 0
> > register_kprobe(submit_bio) = 0
> > requeue_io:
> > inode 114019(sda7/.kde) count 2,2 size 0 pages 1
> > 0 2 0 U____
> > requeue_io:
> > inode 114025(sda7/cache-ibm) count 2,1 size 0 pages 1
> > 0 2 0 U____
> > requeue_io:
> > inode 114029(sda7/socket-ibm) count 2,3 size 0 pages 1
> > 0 2 0 U____
> > requeue_io:
> > inode 114017(sda7/0266584877) count 3,6 size 0 pages 1
> > 0 2 0 U____
>
> It helps. Thank you, Joerg!
>
> The .kde/cache-ibm/socket-ibm/0266584877 above are directories.
> It's weird that dirs would have their own mappings in ext2. In

Oh, ext2 dirs have their own mapping pages. Not the same with ext3.

> particular this bug is triggered because the dir mapping page has
> PAGECACHE_TAG_DIRTY set and PG_dirty cleared, staying in an
> inconsistent state.

Just found that a deleted dir will enter that inconsistent state when
someone still have reference to it...

2008-01-14 09:55:35

by Wu Fengguang

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

On Mon, Jan 14, 2008 at 11:54:39AM +0800, Fengguang Wu wrote:
> > particular this bug is triggered because the dir mapping page has
> > PAGECACHE_TAG_DIRTY set and PG_dirty cleared, staying in an
> > inconsistent state.
>
> Just found that a deleted dir will enter that inconsistent state when
> someone still have reference to it...

Joerg, this patch fixed the bug for me :-)

Fengguang
---

clear PAGECACHE_TAG_DIRTY for truncated page in block_write_full_page()

The `truncated' page in block_write_full_page() may stick for a long time.
E.g. ext2_rmdir() will set i_size to 0. The dir may still be referenced by
someone, and have dirty pages in it.

So clear PAGECACHE_TAG_DIRTY to prevent pdflush from retrying and iowaiting on
it.

Signed-off-by: Fengguang Wu <[email protected]>
---
fs/buffer.c | 2 ++
1 files changed, 2 insertions(+)

Index: linux/fs/buffer.c
===================================================================
--- linux.orig/fs/buffer.c
+++ linux/fs/buffer.c
@@ -2820,7 +2820,9 @@ int block_write_full_page(struct page *p
* freeable here, so the page does not leak.
*/
do_invalidatepage(page, 0);
+ set_page_writeback(page);
unlock_page(page);
+ end_page_writeback(page);
return 0; /* don't care */
}

2008-01-14 11:30:27

by Joerg Platte

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

Am Montag, 14. Januar 2008 schrieb Fengguang Wu:

> Joerg, this patch fixed the bug for me :-)

Fengguang, congratulations, I can confirm that your patch fixed the bug! With
previous kernels the bug showed up after each reboot. Now, when booting the
patched kernel everything is fine and there is no longer any suspicious
iowait!

Do you have an idea why this problem appeared in 2.6.24? Did somebody change
the ext2 code or is it related to the changes in the scheduler?

regards,
J?rg

--
PGP Key: send mail with subject 'SEND PGP-KEY' PGP Key-ID: FD 4E 21 1D
PGP Fingerprint: 388A872AFC5649D3 BCEC65778BE0C605

2008-01-14 11:41:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX


On Mon, 2008-01-14 at 12:30 +0100, Joerg Platte wrote:
> Am Montag, 14. Januar 2008 schrieb Fengguang Wu:
>
> > Joerg, this patch fixed the bug for me :-)
>
> Fengguang, congratulations, I can confirm that your patch fixed the bug! With
> previous kernels the bug showed up after each reboot. Now, when booting the
> patched kernel everything is fine and there is no longer any suspicious
> iowait!
>
> Do you have an idea why this problem appeared in 2.6.24? Did somebody change
> the ext2 code or is it related to the changes in the scheduler?

It was Fengguang who changed the inode writeback code, and I guess the
new and improved code was less able do deal with these funny corner
cases. But he has been very good in tracking them down and solving them,
kudos to him for that work!


2008-01-15 00:56:19

by Wu Fengguang

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

On Mon, Jan 14, 2008 at 12:41:26PM +0100, Peter Zijlstra wrote:
>
> On Mon, 2008-01-14 at 12:30 +0100, Joerg Platte wrote:
> > Am Montag, 14. Januar 2008 schrieb Fengguang Wu:
> >
> > > Joerg, this patch fixed the bug for me :-)
> >
> > Fengguang, congratulations, I can confirm that your patch fixed the bug! With
> > previous kernels the bug showed up after each reboot. Now, when booting the
> > patched kernel everything is fine and there is no longer any suspicious
> > iowait!
> >
> > Do you have an idea why this problem appeared in 2.6.24? Did somebody change
> > the ext2 code or is it related to the changes in the scheduler?
>
> It was Fengguang who changed the inode writeback code, and I guess the
> new and improved code was less able do deal with these funny corner
> cases. But he has been very good in tracking them down and solving them,
> kudos to him for that work!

Thank you.

In particular the bug is triggered by the patch named:
"writeback: introduce writeback_control.more_io to indicate more io"
That patch means to speed up writeback, but unfortunately its
aggressiveness has disclosed bugs in reiserfs, jfs and now ext2.

Linus, given the number of bugs it triggered, I'd recommend revert
this patch(git commit 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b). Let's
push it back to -mm tree for more testings?

Regards,
Fengguang

2008-01-15 21:13:34

by Mike Snitzer

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

On Jan 14, 2008 7:50 AM, Fengguang Wu <[email protected]> wrote:
> On Mon, Jan 14, 2008 at 12:41:26PM +0100, Peter Zijlstra wrote:
> >
> > On Mon, 2008-01-14 at 12:30 +0100, Joerg Platte wrote:
> > > Am Montag, 14. Januar 2008 schrieb Fengguang Wu:
> > >
> > > > Joerg, this patch fixed the bug for me :-)
> > >
> > > Fengguang, congratulations, I can confirm that your patch fixed the bug! With
> > > previous kernels the bug showed up after each reboot. Now, when booting the
> > > patched kernel everything is fine and there is no longer any suspicious
> > > iowait!
> > >
> > > Do you have an idea why this problem appeared in 2.6.24? Did somebody change
> > > the ext2 code or is it related to the changes in the scheduler?
> >
> > It was Fengguang who changed the inode writeback code, and I guess the
> > new and improved code was less able do deal with these funny corner
> > cases. But he has been very good in tracking them down and solving them,
> > kudos to him for that work!
>
> Thank you.
>
> In particular the bug is triggered by the patch named:
> "writeback: introduce writeback_control.more_io to indicate more io"
> That patch means to speed up writeback, but unfortunately its
> aggressiveness has disclosed bugs in reiserfs, jfs and now ext2.
>
> Linus, given the number of bugs it triggered, I'd recommend revert
> this patch(git commit 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b). Let's
> push it back to -mm tree for more testings?

Fengguang,

I'd like to better understand where your writeback work stands
relative to 2.6.24-rcX and -mm. To be clear, your changes in
2.6.24-rc7 have been benchmarked to provide a ~33% sequential write
performance improvement with ext3 (as compared to 2.6.22, CFS could be
helping, etc but...). Very impressive!

Given this improvement it is unfortunate to see your request to revert
2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b but it is understandable if
you're not confident in it for 2.6.24.

That said, you recently posted an -mm patchset that first reverts
2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b and then goes on to address
the "slow writes for concurrent large and small file writes" bug:
http://lkml.org/lkml/2008/1/15/132

For those interested in using your writeback improvements in
production sooner rather than later (primarily with ext3); what
recommendations do you have? Just heavily test our own 2.6.24 + your
evolving "close, but not ready for merge" -mm writeback patchset?

regards,
Mike

2008-01-15 21:42:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX


* Fengguang Wu <[email protected]> wrote:

> On Mon, Jan 14, 2008 at 12:41:26PM +0100, Peter Zijlstra wrote:
> >
> > On Mon, 2008-01-14 at 12:30 +0100, Joerg Platte wrote:
> > > Am Montag, 14. Januar 2008 schrieb Fengguang Wu:
> > >
> > > > Joerg, this patch fixed the bug for me :-)
> > >
> > > Fengguang, congratulations, I can confirm that your patch fixed the bug! With
> > > previous kernels the bug showed up after each reboot. Now, when booting the
> > > patched kernel everything is fine and there is no longer any suspicious
> > > iowait!
> > >
> > > Do you have an idea why this problem appeared in 2.6.24? Did somebody change
> > > the ext2 code or is it related to the changes in the scheduler?
> >
> > It was Fengguang who changed the inode writeback code, and I guess the
> > new and improved code was less able do deal with these funny corner
> > cases. But he has been very good in tracking them down and solving them,
> > kudos to him for that work!
>
> Thank you.
>
> In particular the bug is triggered by the patch named:
> "writeback: introduce writeback_control.more_io to indicate more io"
> That patch means to speed up writeback, but unfortunately its
> aggressiveness has disclosed bugs in reiserfs, jfs and now ext2.
>
> Linus, given the number of bugs it triggered, I'd recommend revert
> this patch(git commit 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b). Let's
> push it back to -mm tree for more testings?

i dont think a revert at this stage is a good idea and i'm not sure
pushing it back into -mm would really expose more of these bugs. And
these are real bugs in filesystems - bugs which we want to see fixed
anyway. You are also tracking down those bugs very fast.

[ perhaps, if it's possible technically (and if it is clean enough), you
might want to offer a runtime debug tunable that can be used to switch
off the new aspects of your code. That would speed up testing, in case
anyone suspects the new writeback code. ]

Ingo

2008-01-16 05:14:33

by Wu Fengguang

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

On Tue, Jan 15, 2008 at 10:42:13PM +0100, Ingo Molnar wrote:
>
> * Fengguang Wu <[email protected]> wrote:
>
> > On Mon, Jan 14, 2008 at 12:41:26PM +0100, Peter Zijlstra wrote:
> > >
> > > On Mon, 2008-01-14 at 12:30 +0100, Joerg Platte wrote:
> > > > Am Montag, 14. Januar 2008 schrieb Fengguang Wu:
> > > >
> > > > > Joerg, this patch fixed the bug for me :-)
> > > >
> > > > Fengguang, congratulations, I can confirm that your patch fixed the bug! With
> > > > previous kernels the bug showed up after each reboot. Now, when booting the
> > > > patched kernel everything is fine and there is no longer any suspicious
> > > > iowait!
> > > >
> > > > Do you have an idea why this problem appeared in 2.6.24? Did somebody change
> > > > the ext2 code or is it related to the changes in the scheduler?
> > >
> > > It was Fengguang who changed the inode writeback code, and I guess the
> > > new and improved code was less able do deal with these funny corner
> > > cases. But he has been very good in tracking them down and solving them,
> > > kudos to him for that work!
> >
> > Thank you.
> >
> > In particular the bug is triggered by the patch named:
> > "writeback: introduce writeback_control.more_io to indicate more io"
> > That patch means to speed up writeback, but unfortunately its
> > aggressiveness has disclosed bugs in reiserfs, jfs and now ext2.
> >
> > Linus, given the number of bugs it triggered, I'd recommend revert
> > this patch(git commit 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b). Let's
> > push it back to -mm tree for more testings?
>
> i dont think a revert at this stage is a good idea and i'm not sure
> pushing it back into -mm would really expose more of these bugs. And
> these are real bugs in filesystems - bugs which we want to see fixed
> anyway. You are also tracking down those bugs very fast.
>
> [ perhaps, if it's possible technically (and if it is clean enough), you
> might want to offer a runtime debug tunable that can be used to switch
> off the new aspects of your code. That would speed up testing, in case
> anyone suspects the new writeback code. ]

The patch is too aggressive in itself. We'd better not risk on it.
The iowait is only unpleasant not destructive. But it will hurt if
many users complaints. Comment says that "nfs_writepages() sometimes
bales out without doing anything."

However I have an improved and more safe patch now. It won't iowait
when nfs_writepages() bale out without increasing pages_skipped, or
even when some buggy filesystem forget to clear PAGECACHE_TAG_DIRTY.
(The magic lies in the first chunk below.)

Mike, you can use this one on 2.6.24.


---
fs/fs-writeback.c | 17 +++++++++++++++--
include/linux/writeback.h | 1 +
mm/page-writeback.c | 9 ++++++---
3 files changed, 22 insertions(+), 5 deletions(-)

--- linux.orig/fs/fs-writeback.c
+++ linux/fs/fs-writeback.c
@@ -284,7 +284,16 @@ __sync_single_inode(struct inode *inode,
* soon as the queue becomes uncongested.
*/
inode->i_state |= I_DIRTY_PAGES;
- requeue_io(inode);
+ if (wbc->nr_to_write <= 0)
+ /*
+ * slice used up: queue for next turn
+ */
+ requeue_io(inode);
+ else
+ /*
+ * somehow blocked: retry later
+ */
+ redirty_tail(inode);
} else {
/*
* Otherwise fully redirty the inode so that
@@ -479,8 +488,12 @@ sync_sb_inodes(struct super_block *sb, s
iput(inode);
cond_resched();
spin_lock(&inode_lock);
- if (wbc->nr_to_write <= 0)
+ if (wbc->nr_to_write <= 0) {
+ wbc->more_io = 1;
break;
+ }
+ if (!list_empty(&sb->s_more_io))
+ wbc->more_io = 1;
}
return; /* Leave any unwritten inodes on s_io */
}
--- linux.orig/include/linux/writeback.h
+++ linux/include/linux/writeback.h
@@ -62,6 +62,7 @@ struct writeback_control {
unsigned for_reclaim:1; /* Invoked from the page allocator */
unsigned for_writepages:1; /* This is a writepages() call */
unsigned range_cyclic:1; /* range_start is cyclic */
+ unsigned more_io:1; /* more io to be dispatched */
};

/*
--- linux.orig/mm/page-writeback.c
+++ linux/mm/page-writeback.c
@@ -558,6 +558,7 @@ static void background_writeout(unsigned
global_page_state(NR_UNSTABLE_NFS) < background_thresh
&& min_pages <= 0)
break;
+ wbc.more_io = 0;
wbc.encountered_congestion = 0;
wbc.nr_to_write = MAX_WRITEBACK_PAGES;
wbc.pages_skipped = 0;
@@ -565,8 +566,9 @@ static void background_writeout(unsigned
min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
/* Wrote less than expected */
- congestion_wait(WRITE, HZ/10);
- if (!wbc.encountered_congestion)
+ if (wbc.encountered_congestion || wbc.more_io)
+ congestion_wait(WRITE, HZ/10);
+ else
break;
}
}
@@ -631,11 +633,12 @@ static void wb_kupdate(unsigned long arg
global_page_state(NR_UNSTABLE_NFS) +
(inodes_stat.nr_inodes - inodes_stat.nr_unused);
while (nr_to_write > 0) {
+ wbc.more_io = 0;
wbc.encountered_congestion = 0;
wbc.nr_to_write = MAX_WRITEBACK_PAGES;
writeback_inodes(&wbc);
if (wbc.nr_to_write > 0) {
- if (wbc.encountered_congestion)
+ if (wbc.encountered_congestion || wbc.more_io)
congestion_wait(WRITE, HZ/10);
else
break; /* All the old data is written */

2008-01-16 05:26:54

by Wu Fengguang

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

On Tue, Jan 15, 2008 at 04:13:22PM -0500, Mike Snitzer wrote:
> On Jan 14, 2008 7:50 AM, Fengguang Wu <[email protected]> wrote:
> > On Mon, Jan 14, 2008 at 12:41:26PM +0100, Peter Zijlstra wrote:
> > >
> > > On Mon, 2008-01-14 at 12:30 +0100, Joerg Platte wrote:
> > > > Am Montag, 14. Januar 2008 schrieb Fengguang Wu:
> > > >
> > > > > Joerg, this patch fixed the bug for me :-)
> > > >
> > > > Fengguang, congratulations, I can confirm that your patch fixed the bug! With
> > > > previous kernels the bug showed up after each reboot. Now, when booting the
> > > > patched kernel everything is fine and there is no longer any suspicious
> > > > iowait!
> > > >
> > > > Do you have an idea why this problem appeared in 2.6.24? Did somebody change
> > > > the ext2 code or is it related to the changes in the scheduler?
> > >
> > > It was Fengguang who changed the inode writeback code, and I guess the
> > > new and improved code was less able do deal with these funny corner
> > > cases. But he has been very good in tracking them down and solving them,
> > > kudos to him for that work!
> >
> > Thank you.
> >
> > In particular the bug is triggered by the patch named:
> > "writeback: introduce writeback_control.more_io to indicate more io"
> > That patch means to speed up writeback, but unfortunately its
> > aggressiveness has disclosed bugs in reiserfs, jfs and now ext2.
> >
> > Linus, given the number of bugs it triggered, I'd recommend revert
> > this patch(git commit 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b). Let's
> > push it back to -mm tree for more testings?
>
> Fengguang,
>
> I'd like to better understand where your writeback work stands
> relative to 2.6.24-rcX and -mm. To be clear, your changes in
> 2.6.24-rc7 have been benchmarked to provide a ~33% sequential write
> performance improvement with ext3 (as compared to 2.6.22, CFS could be
> helping, etc but...). Very impressive!

Wow, glad to hear that.

> Given this improvement it is unfortunate to see your request to revert
> 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b but it is understandable if
> you're not confident in it for 2.6.24.
>
> That said, you recently posted an -mm patchset that first reverts
> 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b and then goes on to address
> the "slow writes for concurrent large and small file writes" bug:
> http://lkml.org/lkml/2008/1/15/132
>
> For those interested in using your writeback improvements in
> production sooner rather than later (primarily with ext3); what
> recommendations do you have? Just heavily test our own 2.6.24 + your
> evolving "close, but not ready for merge" -mm writeback patchset?

It's not ready mainly because it is fresh made and need more
feedbacks. It's doing OK on my desktop :-)

2008-01-16 09:26:55

by Martin Knoblauch

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

----- Original Message ----
> From: Mike Snitzer <[email protected]>
> To: Fengguang Wu <[email protected]>
> Cc: Peter Zijlstra <[email protected]>; [email protected]; Ingo Molnar <[email protected]>; [email protected]; "[email protected]" <[email protected]>; Linus Torvalds <[email protected]>; Andrew Morton <akpm@li>
> Sent: Tuesday, January 15, 2008 10:13:22 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
>
> On Jan 14, 2008 7:50 AM, Fengguang Wu wrote:
> > On Mon, Jan 14, 2008 at 12:41:26PM +0100, Peter Zijlstra wrote:
> > >
> > > On Mon, 2008-01-14 at 12:30 +0100, Joerg Platte wrote:
> > > > Am Montag, 14. Januar 2008 schrieb Fengguang Wu:
> > > >
> > > > > Joerg, this patch fixed the bug for me :-)
> > > >
> > > > Fengguang, congratulations, I can confirm that your patch
> fixed
>
the bug! With
> > > > previous kernels the bug showed up after each reboot. Now,
> when
>
booting the
> > > > patched kernel everything is fine and there is no longer
> any
>
suspicious
> > > > iowait!
> > > >
> > > > Do you have an idea why this problem appeared in 2.6.24?
> Did
>
somebody change
> > > > the ext2 code or is it related to the changes in the scheduler?
> > >
> > > It was Fengguang who changed the inode writeback code, and I
> guess
>
the
> > > new and improved code was less able do deal with these funny corner
> > > cases. But he has been very good in tracking them down and
> solving
>
them,
> > > kudos to him for that work!
> >
> > Thank you.
> >
> > In particular the bug is triggered by the patch named:
> > "writeback: introduce writeback_control.more_io to
> indicate
>
more io"
> > That patch means to speed up writeback, but unfortunately its
> > aggressiveness has disclosed bugs in reiserfs, jfs and now ext2.
> >
> > Linus, given the number of bugs it triggered, I'd recommend revert
> > this patch(git commit
> 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b).
>
Let's
> > push it back to -mm tree for more testings?
>
> Fengguang,
>
> I'd like to better understand where your writeback work stands
> relative to 2.6.24-rcX and -mm. To be clear, your changes in
> 2.6.24-rc7 have been benchmarked to provide a ~33% sequential write
> performance improvement with ext3 (as compared to 2.6.22, CFS could be
> helping, etc but...). Very impressive!
>
> Given this improvement it is unfortunate to see your request to revert
> 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b but it is understandable if
> you're not confident in it for 2.6.24.
>
> That said, you recently posted an -mm patchset that first reverts
> 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b and then goes on to address
> the "slow writes for concurrent large and small file writes" bug:
> http://lkml.org/lkml/2008/1/15/132
>
> For those interested in using your writeback improvements in
> production sooner rather than later (primarily with ext3); what
> recommendations do you have? Just heavily test our own 2.6.24 + your
> evolving "close, but not ready for merge" -mm writeback patchset?
>
Hi Fengguang, Mike,

I can add myself to Mikes question. It would be good to know a "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has been showing quite nice improvement of the overall writeback situation and it would be sad to see this [partially] gone in 2.6.24-final. Linus apparently already has reverted "...2250b". I will definitely repeat my tests with -rc8. and report.

Cheers
Martin



2008-01-16 12:00:21

by Wu Fengguang

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote:
> > For those interested in using your writeback improvements in
> > production sooner rather than later (primarily with ext3); what
> > recommendations do you have? Just heavily test our own 2.6.24 + your
> > evolving "close, but not ready for merge" -mm writeback patchset?
> >
> Hi Fengguang, Mike,
>
> I can add myself to Mikes question. It would be good to know a "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has been showing quite nice improvement of the overall writeback situation and it would be sad to see this [partially] gone in 2.6.24-final. Linus apparently already has reverted "...2250b". I will definitely repeat my tests with -rc8. and report.

Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
Maybe we can push it to 2.6.24 after your testing.

Fengguang
---
fs/fs-writeback.c | 17 +++++++++++++++--
include/linux/writeback.h | 1 +
mm/page-writeback.c | 9 ++++++---
3 files changed, 22 insertions(+), 5 deletions(-)

--- linux.orig/fs/fs-writeback.c
+++ linux/fs/fs-writeback.c
@@ -284,7 +284,16 @@ __sync_single_inode(struct inode *inode,
* soon as the queue becomes uncongested.
*/
inode->i_state |= I_DIRTY_PAGES;
- requeue_io(inode);
+ if (wbc->nr_to_write <= 0)
+ /*
+ * slice used up: queue for next turn
+ */
+ requeue_io(inode);
+ else
+ /*
+ * somehow blocked: retry later
+ */
+ redirty_tail(inode);
} else {
/*
* Otherwise fully redirty the inode so that
@@ -479,8 +488,12 @@ sync_sb_inodes(struct super_block *sb, s
iput(inode);
cond_resched();
spin_lock(&inode_lock);
- if (wbc->nr_to_write <= 0)
+ if (wbc->nr_to_write <= 0) {
+ wbc->more_io = 1;
break;
+ }
+ if (!list_empty(&sb->s_more_io))
+ wbc->more_io = 1;
}
return; /* Leave any unwritten inodes on s_io */
}
--- linux.orig/include/linux/writeback.h
+++ linux/include/linux/writeback.h
@@ -62,6 +62,7 @@ struct writeback_control {
unsigned for_reclaim:1; /* Invoked from the page allocator */
unsigned for_writepages:1; /* This is a writepages() call */
unsigned range_cyclic:1; /* range_start is cyclic */
+ unsigned more_io:1; /* more io to be dispatched */
};

/*
--- linux.orig/mm/page-writeback.c
+++ linux/mm/page-writeback.c
@@ -558,6 +558,7 @@ static void background_writeout(unsigned
global_page_state(NR_UNSTABLE_NFS) < background_thresh
&& min_pages <= 0)
break;
+ wbc.more_io = 0;
wbc.encountered_congestion = 0;
wbc.nr_to_write = MAX_WRITEBACK_PAGES;
wbc.pages_skipped = 0;
@@ -565,8 +566,9 @@ static void background_writeout(unsigned
min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
/* Wrote less than expected */
- congestion_wait(WRITE, HZ/10);
- if (!wbc.encountered_congestion)
+ if (wbc.encountered_congestion || wbc.more_io)
+ congestion_wait(WRITE, HZ/10);
+ else
break;
}
}
@@ -631,11 +633,12 @@ static void wb_kupdate(unsigned long arg
global_page_state(NR_UNSTABLE_NFS) +
(inodes_stat.nr_inodes - inodes_stat.nr_unused);
while (nr_to_write > 0) {
+ wbc.more_io = 0;
wbc.encountered_congestion = 0;
wbc.nr_to_write = MAX_WRITEBACK_PAGES;
writeback_inodes(&wbc);
if (wbc.nr_to_write > 0) {
- if (wbc.encountered_congestion)
+ if (wbc.encountered_congestion || wbc.more_io)
congestion_wait(WRITE, HZ/10);
else
break; /* All the old data is written */

2008-01-16 14:16:03

by Martin Knoblauch

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

----- Original Message ----
> From: Fengguang Wu <[email protected]>
> To: Martin Knoblauch <[email protected]>
> Cc: Mike Snitzer <[email protected]>; Peter Zijlstra <[email protected]>; [email protected]; Ingo Molnar <[email protected]>; [email protected]; "[email protected]" <[email protected]>; Linus Torvalds <[email protected]>
> Sent: Wednesday, January 16, 2008 1:00:04 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
>
> On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote:
> > > For those interested in using your writeback improvements in
> > > production sooner rather than later (primarily with ext3); what
> > > recommendations do you have? Just heavily test our own 2.6.24
> +
>
your
> > > evolving "close, but not ready for merge" -mm writeback patchset?
> > >
> > Hi Fengguang, Mike,
> >
> > I can add myself to Mikes question. It would be good to know
> a
>
"roadmap" for the writeback changes. Testing 2.6.24-rcX so far has
> been
>
showing quite nice improvement of the overall writeback situation and
> it
>
would be sad to see this [partially] gone in 2.6.24-final.
> Linus
>
apparently already has reverted "...2250b". I will definitely repeat my
> tests
>
with -rc8. and report.
>
> Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
> Maybe we can push it to 2.6.24 after your testing.
>

Will do tomorrow or friday. Actually a patch against -rc8 would be nicer for me, as I have not looked at -rc7 due to holidays and some of the reported problems with it.

Cheers
Martin

> Fengguang
> ---
> fs/fs-writeback.c | 17 +++++++++++++++--
> include/linux/writeback.h | 1 +
> mm/page-writeback.c | 9 ++++++---
> 3 files changed, 22 insertions(+), 5 deletions(-)
>
> --- linux.orig/fs/fs-writeback.c
> +++ linux/fs/fs-writeback.c
> @@ -284,7 +284,16 @@ __sync_single_inode(struct inode *inode,
> * soon as the queue becomes uncongested.
> */
> inode->i_state |= I_DIRTY_PAGES;
> - requeue_io(inode);
> + if (wbc->nr_to_write <= 0)
> + /*
> + * slice used up: queue for next turn
> + */
> + requeue_io(inode);
> + else
> + /*
> + * somehow blocked: retry later
> + */
> + redirty_tail(inode);
> } else {
> /*
> * Otherwise fully redirty the inode so that
> @@ -479,8 +488,12 @@ sync_sb_inodes(struct super_block *sb, s
> iput(inode);
> cond_resched();
> spin_lock(&inode_lock);
> - if (wbc->nr_to_write <= 0)
> + if (wbc->nr_to_write <= 0) {
> + wbc->more_io = 1;
> break;
> + }
> + if (!list_empty(&sb->s_more_io))
> + wbc->more_io = 1;
> }
> return; /* Leave any unwritten inodes on s_io */
> }
> --- linux.orig/include/linux/writeback.h
> +++ linux/include/linux/writeback.h
> @@ -62,6 +62,7 @@ struct writeback_control {
> unsigned for_reclaim:1; /* Invoked from the page
> allocator
>
*/
> unsigned for_writepages:1; /* This is a writepages() call */
> unsigned range_cyclic:1; /* range_start is cyclic */
> + unsigned more_io:1; /* more io to be dispatched */
> };
>
> /*
> --- linux.orig/mm/page-writeback.c
> +++ linux/mm/page-writeback.c
> @@ -558,6 +558,7 @@ static void background_writeout(unsigned
> global_page_state(NR_UNSTABLE_NFS) < background_thresh
> && min_pages <= 0)
> break;
> + wbc.more_io = 0;
> wbc.encountered_congestion = 0;
> wbc.nr_to_write = MAX_WRITEBACK_PAGES;
> wbc.pages_skipped = 0;
> @@ -565,8 +566,9 @@ static void background_writeout(unsigned
> min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
> if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
> /* Wrote less than expected */
> - congestion_wait(WRITE, HZ/10);
> - if (!wbc.encountered_congestion)
> + if (wbc.encountered_congestion || wbc.more_io)
> + congestion_wait(WRITE, HZ/10);
> + else
> break;
> }
> }
> @@ -631,11 +633,12 @@ static void wb_kupdate(unsigned long arg
> global_page_state(NR_UNSTABLE_NFS) +
> (inodes_stat.nr_inodes - inodes_stat.nr_unused);
> while (nr_to_write > 0) {
> + wbc.more_io = 0;
> wbc.encountered_congestion = 0;
> wbc.nr_to_write = MAX_WRITEBACK_PAGES;
> writeback_inodes(&wbc);
> if (wbc.nr_to_write > 0) {
> - if (wbc.encountered_congestion)
> + if (wbc.encountered_congestion || wbc.more_io)
> congestion_wait(WRITE, HZ/10);
> else
> break; /* All the old data is written */
>
>
>

2008-01-16 16:28:25

by Mike Snitzer

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

On Jan 16, 2008 9:15 AM, Martin Knoblauch <[email protected]> wrote:
> ----- Original Message ----
> > From: Fengguang Wu <[email protected]>
> > To: Martin Knoblauch <[email protected]>
> > Cc: Mike Snitzer <[email protected]>; Peter Zijlstra <[email protected]>; [email protected]; Ingo Molnar <[email protected]>; [email protected]; "[email protected]" <[email protected]>; Linus Torvalds <[email protected]>
> > Sent: Wednesday, January 16, 2008 1:00:04 PM
> > Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> >
>
> > On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote:
> > > > For those interested in using your writeback improvements in
> > > > production sooner rather than later (primarily with ext3); what
> > > > recommendations do you have? Just heavily test our own 2.6.24
> > +
> >
> your
> > > > evolving "close, but not ready for merge" -mm writeback patchset?
> > > >
> > > Hi Fengguang, Mike,
> > >
> > > I can add myself to Mikes question. It would be good to know
> > a
> >
> "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has
> > been
> >
> showing quite nice improvement of the overall writeback situation and
> > it
> >
> would be sad to see this [partially] gone in 2.6.24-final.
> > Linus
> >
> apparently already has reverted "...2250b". I will definitely repeat my
> > tests
> >
> with -rc8. and report.
> >
> > Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
> > Maybe we can push it to 2.6.24 after your testing.
> >
>
> Will do tomorrow or friday. Actually a patch against -rc8 would be nicer for me, as I have not looked at -rc7 due to holidays and some of the reported problems with it.

Fengguang's latest writeback patch applies cleanly, builds, boots on 2.6.24-rc8.

I'll be able to share ext3 performance results (relative to 2.6.24-rc7) shortly.

Mike
>
>
> > Fengguang
> > ---
> > fs/fs-writeback.c | 17 +++++++++++++++--
> > include/linux/writeback.h | 1 +
> > mm/page-writeback.c | 9 ++++++---
> > 3 files changed, 22 insertions(+), 5 deletions(-)
> >
> > --- linux.orig/fs/fs-writeback.c
> > +++ linux/fs/fs-writeback.c
> > @@ -284,7 +284,16 @@ __sync_single_inode(struct inode *inode,
> > * soon as the queue becomes uncongested.
> > */
> > inode->i_state |= I_DIRTY_PAGES;
> > - requeue_io(inode);
> > + if (wbc->nr_to_write <= 0)
> > + /*
> > + * slice used up: queue for next turn
> > + */
> > + requeue_io(inode);
> > + else
> > + /*
> > + * somehow blocked: retry later
> > + */
> > + redirty_tail(inode);
> > } else {
> > /*
> > * Otherwise fully redirty the inode so that
> > @@ -479,8 +488,12 @@ sync_sb_inodes(struct super_block *sb, s
> > iput(inode);
> > cond_resched();
> > spin_lock(&inode_lock);
> > - if (wbc->nr_to_write <= 0)
> > + if (wbc->nr_to_write <= 0) {
> > + wbc->more_io = 1;
> > break;
> > + }
> > + if (!list_empty(&sb->s_more_io))
> > + wbc->more_io = 1;
> > }
> > return; /* Leave any unwritten inodes on s_io */
> > }
> > --- linux.orig/include/linux/writeback.h
> > +++ linux/include/linux/writeback.h
> > @@ -62,6 +62,7 @@ struct writeback_control {
> > unsigned for_reclaim:1; /* Invoked from the page
> > allocator
> >
> */
> > unsigned for_writepages:1; /* This is a writepages() call */
> > unsigned range_cyclic:1; /* range_start is cyclic */
> > + unsigned more_io:1; /* more io to be dispatched */
> > };
> >
> > /*
> > --- linux.orig/mm/page-writeback.c
> > +++ linux/mm/page-writeback.c
> > @@ -558,6 +558,7 @@ static void background_writeout(unsigned
> > global_page_state(NR_UNSTABLE_NFS) < background_thresh
> > && min_pages <= 0)
> > break;
> > + wbc.more_io = 0;
> > wbc.encountered_congestion = 0;
> > wbc.nr_to_write = MAX_WRITEBACK_PAGES;
> > wbc.pages_skipped = 0;
> > @@ -565,8 +566,9 @@ static void background_writeout(unsigned
> > min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
> > if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
> > /* Wrote less than expected */
> > - congestion_wait(WRITE, HZ/10);
> > - if (!wbc.encountered_congestion)
> > + if (wbc.encountered_congestion || wbc.more_io)
> > + congestion_wait(WRITE, HZ/10);
> > + else
> > break;
> > }
> > }
> > @@ -631,11 +633,12 @@ static void wb_kupdate(unsigned long arg
> > global_page_state(NR_UNSTABLE_NFS) +
> > (inodes_stat.nr_inodes - inodes_stat.nr_unused);
> > while (nr_to_write > 0) {
> > + wbc.more_io = 0;
> > wbc.encountered_congestion = 0;
> > wbc.nr_to_write = MAX_WRITEBACK_PAGES;
> > writeback_inodes(&wbc);
> > if (wbc.nr_to_write > 0) {
> > - if (wbc.encountered_congestion)
> > + if (wbc.encountered_congestion || wbc.more_io)
> > congestion_wait(WRITE, HZ/10);
> > else
> > break; /* All the old data is written */
> >
> >
> >
>
>
>

2008-01-17 08:43:23

by Wu Fengguang

[permalink] [raw]
Subject: [PATCH] writeback: speed up writeback of big dirty files

> On Jan 16, 2008 9:15 AM, Martin Knoblauch <[email protected]> wrote:
> Fengguang's latest writeback patch applies cleanly, builds, boots on 2.6.24-rc8.

Linus, if possible, I'd suggest this patch be merged for 2.6.24.

It's a safer version of the reverted patch. It was tested on
ext2/ext3/jfs/xfs/reiserfs and won't 100% iowait even without the
other bug fixing patches.

Fengguang
---

writeback: speed up writeback of big dirty files

After making dirty a 100M file, the normal behavior is to
start the writeback for all data after 30s delays. But
sometimes the following happens instead:

- after 30s: ~4M
- after 5s: ~4M
- after 5s: all remaining 92M

Some analyze shows that the internal io dispatch queues goes like this:

s_io s_more_io
-------------------------
1) 100M,1K 0
2) 1K 96M
3) 0 96M
1) initial state with a 100M file and a 1K file
2) 4M written, nr_to_write <= 0, so write more
3) 1K written, nr_to_write > 0, no more writes(BUG)
nr_to_write > 0 in (3) fools the upper layer to think that data have all been
written out. The big dirty file is actually still sitting in s_more_io. We
cannot simply splice s_more_io back to s_io as soon as s_io becomes empty, and
let the loop in generic_sync_sb_inodes() continue: this may starve newly
expired inodes in s_dirty. It is also not an option to draw inodes from both
s_more_io and s_dirty, an let the loop go on: this might lead to live locks,
and might also starve other superblocks in sync time(well kupdate may still
starve some superblocks, that's another bug).
We have to return when a full scan of s_io completes. So nr_to_write > 0 does
not necessarily mean that "all data are written". This patch introduces a flag
writeback_control.more_io to indicate that more io should be done. With it the
big dirty file no longer has to wait for the next kupdate invocation 5s later.

In sync_sb_inodes() we only set more_io on super_blocks we actually visited.
This aviods the interaction between two pdflush deamons.

Also in __sync_single_inode() we don't blindly keep requeuing the io if the
filesystem cannot progress. Failing to do so may lead to 100% iowait.

Tested-by: Mike Snitzer <[email protected]>
Signed-off-by: Fengguang Wu <[email protected]>
---
fs/fs-writeback.c | 18 ++++++++++++++++--
include/linux/writeback.h | 1 +
mm/page-writeback.c | 9 ++++++---
3 files changed, 23 insertions(+), 5 deletions(-)

--- linux.orig/fs/fs-writeback.c
+++ linux/fs/fs-writeback.c
@@ -284,7 +284,17 @@ __sync_single_inode(struct inode *inode,
* soon as the queue becomes uncongested.
*/
inode->i_state |= I_DIRTY_PAGES;
- requeue_io(inode);
+ if (wbc->nr_to_write <= 0) {
+ /*
+ * slice used up: queue for next turn
+ */
+ requeue_io(inode);
+ } else {
+ /*
+ * somehow blocked: retry later
+ */
+ redirty_tail(inode);
+ }
} else {
/*
* Otherwise fully redirty the inode so that
@@ -479,8 +489,12 @@ sync_sb_inodes(struct super_block *sb, s
iput(inode);
cond_resched();
spin_lock(&inode_lock);
- if (wbc->nr_to_write <= 0)
+ if (wbc->nr_to_write <= 0) {
+ wbc->more_io = 1;
break;
+ }
+ if (!list_empty(&sb->s_more_io))
+ wbc->more_io = 1;
}
return; /* Leave any unwritten inodes on s_io */
}
--- linux.orig/include/linux/writeback.h
+++ linux/include/linux/writeback.h
@@ -62,6 +62,7 @@ struct writeback_control {
unsigned for_reclaim:1; /* Invoked from the page allocator */
unsigned for_writepages:1; /* This is a writepages() call */
unsigned range_cyclic:1; /* range_start is cyclic */
+ unsigned more_io:1; /* more io to be dispatched */
};

/*
--- linux.orig/mm/page-writeback.c
+++ linux/mm/page-writeback.c
@@ -558,6 +558,7 @@ static void background_writeout(unsigned
global_page_state(NR_UNSTABLE_NFS) < background_thresh
&& min_pages <= 0)
break;
+ wbc.more_io = 0;
wbc.encountered_congestion = 0;
wbc.nr_to_write = MAX_WRITEBACK_PAGES;
wbc.pages_skipped = 0;
@@ -565,8 +566,9 @@ static void background_writeout(unsigned
min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
/* Wrote less than expected */
- congestion_wait(WRITE, HZ/10);
- if (!wbc.encountered_congestion)
+ if (wbc.encountered_congestion || wbc.more_io)
+ congestion_wait(WRITE, HZ/10);
+ else
break;
}
}
@@ -631,11 +633,12 @@ static void wb_kupdate(unsigned long arg
global_page_state(NR_UNSTABLE_NFS) +
(inodes_stat.nr_inodes - inodes_stat.nr_unused);
while (nr_to_write > 0) {
+ wbc.more_io = 0;
wbc.encountered_congestion = 0;
wbc.nr_to_write = MAX_WRITEBACK_PAGES;
writeback_inodes(&wbc);
if (wbc.nr_to_write > 0) {
- if (wbc.encountered_congestion)
+ if (wbc.encountered_congestion || wbc.more_io)
congestion_wait(WRITE, HZ/10);
else
break; /* All the old data is written */

2008-01-17 13:53:18

by Martin Knoblauch

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

----- Original Message ----
> From: Fengguang Wu <[email protected]>
> To: Martin Knoblauch <[email protected]>
> Cc: Mike Snitzer <[email protected]>; Peter Zijlstra <[email protected]>; [email protected]; Ingo Molnar <[email protected]>; [email protected]; "[email protected]" <[email protected]>; Linus Torvalds <[email protected]>
> Sent: Wednesday, January 16, 2008 1:00:04 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
>
> On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote:
> > > For those interested in using your writeback improvements in
> > > production sooner rather than later (primarily with ext3); what
> > > recommendations do you have? Just heavily test our own 2.6.24
> +
>
your
> > > evolving "close, but not ready for merge" -mm writeback patchset?
> > >
> > Hi Fengguang, Mike,
> >
> > I can add myself to Mikes question. It would be good to know
> a
>
"roadmap" for the writeback changes. Testing 2.6.24-rcX so far has
> been
>
showing quite nice improvement of the overall writeback situation and
> it
>
would be sad to see this [partially] gone in 2.6.24-final.
> Linus
>
apparently already has reverted "...2250b". I will definitely repeat my
> tests
>
with -rc8. and report.
>
> Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
> Maybe we can push it to 2.6.24 after your testing.
>
Hi Fengguang,

something really bad has happened between -rc3 and -rc6. Embarrassingly I did not catch that earlier :-(

Compared to the numbers I posted in http://lkml.org/lkml/2007/10/26/208 , dd1 is now at 60 MB/sec (slight plus), while dd2/dd3 suck the same way as in pre 2.6.24. The only test that is still good is mix3, which I attribute to the per-BDI stuff.

At the moment I am frantically trying to find when things went down. I did run -rc8 and rc8+yourpatch. No difference to what I see with -rc6. Sorry that I cannot provide any input to your patch.

Depressed
Martin

2008-01-17 16:12:12

by Mike Snitzer

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

On Jan 17, 2008 8:52 AM, Martin Knoblauch <[email protected]> wrote:
>
> ----- Original Message ----
> > From: Fengguang Wu <[email protected]>
> > To: Martin Knoblauch <[email protected]>
> > Cc: Mike Snitzer <[email protected]>; Peter Zijlstra <[email protected]>; [email protected]; Ingo Molnar <[email protected]>; [email protected]; "[email protected]" <[email protected]>; Linus Torvalds <[email protected]>
> > Sent: Wednesday, January 16, 2008 1:00:04 PM
> > Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> >
> > On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote:
> > > > For those interested in using your writeback improvements in
> > > > production sooner rather than later (primarily with ext3); what
> > > > recommendations do you have? Just heavily test our own 2.6.24
> > +
> >
> your
> > > > evolving "close, but not ready for merge" -mm writeback patchset?
> > > >
> > > Hi Fengguang, Mike,
> > >
> > > I can add myself to Mikes question. It would be good to know
> > a
> >
> "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has
> > been
> >
> showing quite nice improvement of the overall writeback situation and
> > it
> >
> would be sad to see this [partially] gone in 2.6.24-final.
> > Linus
> >
> apparently already has reverted "...2250b". I will definitely repeat my
> > tests
> >
> with -rc8. and report.
> >
> > Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
> > Maybe we can push it to 2.6.24 after your testing.
> >
> Hi Fengguang,
>
> something really bad has happened between -rc3 and -rc6. Embarrassingly I did not catch that earlier :-(
>
> Compared to the numbers I posted in http://lkml.org/lkml/2007/10/26/208 , dd1 is now at 60 MB/sec (slight plus), while dd2/dd3 suck the same way as in pre 2.6.24. The only test that is still good is mix3, which I attribute to the per-BDI stuff.
>
> At the moment I am frantically trying to find when things went down. I did run -rc8 and rc8+yourpatch. No difference to what I see with -rc6. Sorry that I cannot provide any input to your patch.
>
> Depressed
> Martin

Martin,

I've backported Peter's perbdi patchset to 2.6.22.x. I can share it
with anyone who might be interested.

As expected, it has yielded 2.6.24-rcX level scaling. Given the test
result matrix you previously posted, 2.6.22.x+perbdi might give you
what you're looking for (sans improved writeback that 2.6.24 was
thought to be providing). That is, much improved scaling with better
O_DIRECT and network throughput. Just a thought...

Unfortunately, my priorities (and computing resources) have shifted
and I won't be able to thoroughly test Fengguang's new writeback patch
on 2.6.24-rc8... whereby missing out on providing
justification/testing to others on _some_ improved writeback being
included in 2.6.24 final.

Not to mention the window for writeback improvement is all but closed
considering the 2.6.24-rc8 announcement's 2.6.24 final release
timetable.

regards,
Mike

2008-01-17 17:44:50

by Martin Knoblauch

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

----- Original Message ----
> From: Martin Knoblauch <[email protected]>
> To: Fengguang Wu <[email protected]>
> Cc: Mike Snitzer <[email protected]>; Peter Zijlstra <[email protected]>; [email protected]; Ingo Molnar <[email protected]>; [email protected]; "[email protected]" <[email protected]>; Linus Torvalds <[email protected]>
> Sent: Thursday, January 17, 2008 2:52:58 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
>
> ----- Original Message ----
> > From: Fengguang Wu
> > To: Martin Knoblauch
> > Cc: Mike Snitzer ; Peter
> Zijlstra
>
; [email protected]; Ingo Molnar
> ;
>
[email protected];
> "[email protected]"
>
; Linus
> Torvalds
>

> > Sent: Wednesday, January 16, 2008 1:00:04 PM
> > Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> >
> > On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote:
> > > > For those interested in using your writeback improvements in
> > > > production sooner rather than later (primarily with ext3); what
> > > > recommendations do you have? Just heavily test our own 2.6.24
> > +
> >
> your
> > > > evolving "close, but not ready for merge" -mm writeback patchset?
> > > >
> > > Hi Fengguang, Mike,
> > >
> > > I can add myself to Mikes question. It would be good to know
> > a
> >
> "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has
> > been
> >
> showing quite nice improvement of the overall writeback situation and
> > it
> >
> would be sad to see this [partially] gone in 2.6.24-final.
> > Linus
> >
> apparently already has reverted "...2250b". I will definitely
> repeat
>
my
> > tests
> >
> with -rc8. and report.
> >
> > Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
> > Maybe we can push it to 2.6.24 after your testing.
> >
> Hi Fengguang,
>
> something really bad has happened between -rc3 and
> -rc6.
>
Embarrassingly I did not catch that earlier :-(
>
> Compared to the numbers I posted
> in
>
http://lkml.org/lkml/2007/10/26/208 , dd1 is now at 60 MB/sec
> (slight
>
plus), while dd2/dd3 suck the same way as in pre 2.6.24. The only
> test
>
that is still good is mix3, which I attribute to the per-BDI stuff.
>
> At the moment I am frantically trying to find when things went down.
> I
>
did run -rc8 and rc8+yourpatch. No difference to what I see with
> -rc6.
>
Sorry that I cannot provide any input to your patch.
>

OK, the change happened between rc5 and rc6. Just following a gut feeling, I reverted

#commit 81eabcbe0b991ddef5216f30ae91c4b226d54b6d
#Author: Mel Gorman <[email protected]>
#Date: Mon Dec 17 16:20:05 2007 -0800
#
# mm: fix page allocation for larger I/O segments
#
# In some cases the IO subsystem is able to merge requests if the pages are
# adjacent in physical memory. This was achieved in the allocator by having
# expand() return pages in physically contiguous order in situations were a
# large buddy was split. However, list-based anti-fragmentation changed the
# order pages were returned in to avoid searching in buffered_rmqueue() for a
# page of the appropriate migrate type.
#
# This patch restores behaviour of rmqueue_bulk() preserving the physical
# order of pages returned by the allocator without incurring increased search
# costs for anti-fragmentation.
#
# Signed-off-by: Mel Gorman <[email protected]>
# Cc: James Bottomley <[email protected]>
# Cc: Jens Axboe <[email protected]>
# Cc: Mark Lord <[email protected]
# Signed-off-by: Andrew Morton <[email protected]>
# Signed-off-by: Linus Torvalds <[email protected]>
diff -urN linux-2.6.24-rc5/mm/page_alloc.c linux-2.6.24-rc6/mm/page_alloc.c
--- linux-2.6.24-rc5/mm/page_alloc.c 2007-12-21 04:14:11.305633890 +0000
+++ linux-2.6.24-rc6/mm/page_alloc.c 2007-12-21 04:14:17.746985697 +0000
@@ -847,8 +847,19 @@
struct page *page = __rmqueue(zone, order, migratetype);
if (unlikely(page == NULL))
break;
+
+ /*
+ * Split buddy pages returned by expand() are received here
+ * in physical page order. The page is added to the callers and
+ * list and the list head then moves forward. From the callers
+ * perspective, the linked list is ordered by page number in
+ * some conditions. This is useful for IO devices that can
+ * merge IO requests if the physical pages are ordered
+ * properly.
+ */
list_add(&page->lru, list);
set_page_private(page, migratetype);
+ list = &page->lru;
}
spin_unlock(&zone->lock);
return i;



This has brought back the good results I observed and reported.
I do not know what to make out of this. At least on the systems I care
about (HP/DL380g4, dual CPUs, HT-enabled, 8 GB Memory, SmartaArray6i
controller with 4x72GB SCSI disks as RAID5 (battery protected writeback
cache enabled) and gigabit networking (tg3)) this optimisation is a dissaster.

On the other hand, it is not a regression against 2.6.22/23. Those had
bad IO scaling to. It would just be a shame to loose an apparently great
performance win.
is

Cheers
Martin

2008-01-17 17:51:33

by Martin Knoblauch

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

----- Original Message ----
> From: Mike Snitzer <[email protected]>
> To: Martin Knoblauch <[email protected]>
> Cc: Fengguang Wu <[email protected]>; Peter Zijlstra <[email protected]>; [email protected]; Ingo Molnar <[email protected]>; [email protected]; "[email protected]" <[email protected]>; Linus Torvalds <[email protected]>
> Sent: Thursday, January 17, 2008 5:11:50 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
>
>
> I've backported Peter's perbdi patchset to 2.6.22.x. I can share it
> with anyone who might be interested.
>
> As expected, it has yielded 2.6.24-rcX level scaling. Given the test
> result matrix you previously posted, 2.6.22.x+perbdi might give you
> what you're looking for (sans improved writeback that 2.6.24 was
> thought to be providing). That is, much improved scaling with better
> O_DIRECT and network throughput. Just a thought...
>
> Unfortunately, my priorities (and computing resources) have shifted
> and I won't be able to thoroughly test Fengguang's new writeback patch
> on 2.6.24-rc8... whereby missing out on providing
> justification/testing to others on _some_ improved writeback being
> included in 2.6.24 final.
>
> Not to mention the window for writeback improvement is all but closed
> considering the 2.6.24-rc8 announcement's 2.6.24 final release
> timetable.
>
Mike,

thanks for the offer, but the improved throughput is my #1 priority nowadays.
And while the better scaling for different targets is nothing to frown upon, the
much better scaling when writing to the same target would have been the big
winner for me.

Anyway, I located the "offending" commit. Lets see what the experts say.


Cheers
Martin


2008-01-17 20:24:18

by Mel Gorman

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

On (17/01/08 09:44), Martin Knoblauch didst pronounce:
> > > > > > > On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote:
> > > > > > For those interested in using your writeback improvements in
> > > > > > production sooner rather than later (primarily with ext3); what
> > > > > > recommendations do you have? Just heavily test our own 2.6.24
> > > > > > evolving "close, but not ready for merge" -mm writeback patchset?
> > > > > >
> > > > >
> > > > > I can add myself to Mikes question. It would be good to know a
> > > >
> > > > "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has
> > > > been showing quite nice improvement of the overall writeback situation and
> > > > it would be sad to see this [partially] gone in 2.6.24-final.
> > > > Linus apparently already has reverted "...2250b". I will definitely
> > > > repeat my tests with -rc8. and report.
> > > >
> > > Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
> > > Maybe we can push it to 2.6.24 after your testing.
> > >
> > Hi Fengguang,
> >
> > something really bad has happened between -rc3 and -rc6.
> > Embarrassingly I did not catch that earlier :-(
> > Compared to the numbers I posted in
> > http://lkml.org/lkml/2007/10/26/208 , dd1 is now at 60 MB/sec
> > (slight plus), while dd2/dd3 suck the same way as in pre 2.6.24.
> > The only test that is still good is mix3, which I attribute to
> > the per-BDI stuff.

I suspect that the IO hardware you have is very sensitive to the color of the
physical page. I wonder, do you boot the system cleanly and then run these
tests? If so, it would be interesting to know what happens if you stress
the system first (many kernel compiles for example, basically anything that
would use a lot of memory in different ways for some time) to randomise the
free lists a bit and then run your test. You'd need to run the test
three times for 2.6.23, 2.6.24-rc8 and 2.6.24-rc8 with the patch you
identified reverted.

> > At the moment I am frantically trying to find when things went down. I
> > did run -rc8 and rc8+yourpatch. No difference to what I see with -rc6.
> >
> > Sorry that I cannot provide any input to your patch.
>
> OK, the change happened between rc5 and rc6. Just following a gut feeling, I reverted
>
> #commit 81eabcbe0b991ddef5216f30ae91c4b226d54b6d
> #Author: Mel Gorman <[email protected]>
> #Date: Mon Dec 17 16:20:05 2007 -0800
> #
> # mm: fix page allocation for larger I/O segments
> #
> # In some cases the IO subsystem is able to merge requests if the pages are
> # adjacent in physical memory. This was achieved in the allocator by having
> # expand() return pages in physically contiguous order in situations were a
> # large buddy was split. However, list-based anti-fragmentation changed the
> # order pages were returned in to avoid searching in buffered_rmqueue() for a
> # page of the appropriate migrate type.
> #
> # This patch restores behaviour of rmqueue_bulk() preserving the physical
> # order of pages returned by the allocator without incurring increased search
> # costs for anti-fragmentation.
> #
> # Signed-off-by: Mel Gorman <[email protected]>
> # Cc: James Bottomley <[email protected]>
> # Cc: Jens Axboe <[email protected]>
> # Cc: Mark Lord <[email protected]
> # Signed-off-by: Andrew Morton <[email protected]>
> # Signed-off-by: Linus Torvalds <[email protected]>
> diff -urN linux-2.6.24-rc5/mm/page_alloc.c linux-2.6.24-rc6/mm/page_alloc.c
> --- linux-2.6.24-rc5/mm/page_alloc.c 2007-12-21 04:14:11.305633890 +0000
> +++ linux-2.6.24-rc6/mm/page_alloc.c 2007-12-21 04:14:17.746985697 +0000
> @@ -847,8 +847,19 @@
> struct page *page = __rmqueue(zone, order, migratetype);
> if (unlikely(page == NULL))
> break;
> +
> + /*
> + * Split buddy pages returned by expand() are received here
> + * in physical page order. The page is added to the callers and
> + * list and the list head then moves forward. From the callers
> + * perspective, the linked list is ordered by page number in
> + * some conditions. This is useful for IO devices that can
> + * merge IO requests if the physical pages are ordered
> + * properly.
> + */
> list_add(&page->lru, list);
> set_page_private(page, migratetype);
> + list = &page->lru;
> }
> spin_unlock(&zone->lock);
> return i;
>
> This has brought back the good results I observed and reported.
> I do not know what to make out of this. At least on the systems I care
> about (HP/DL380g4, dual CPUs, HT-enabled, 8 GB Memory, SmartaArray6i
> controller with 4x72GB SCSI disks as RAID5 (battery protected writeback
> cache enabled) and gigabit networking (tg3)) this optimisation is a dissaster.
>

That patch was not an optimisation, it was a regression fix against 2.6.23
and I don't believe reverting it is an option. Other IO hardware benefits
from having the allocator supply pages in PFN order. Your controller would
seem to suffer when presented with the same situation but I don't know why
that is. I've added James to the cc in case he has seen this sort of
situation before.

> On the other hand, it is not a regression against 2.6.22/23. Those had
> bad IO scaling to. It would just be a shame to loose an apparently great
> performance win.
> is

Could you try running your tests again when the system has been stressed
with some other workload first?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-01-17 21:51:55

by Martin Knoblauch

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

----- Original Message ----
> From: Mel Gorman <[email protected]>
> To: Martin Knoblauch <[email protected]>
> Cc: Fengguang Wu <[email protected]>; Mike Snitzer <[email protected]>; Peter Zijlstra <[email protected]>; [email protected]; Ingo Molnar <[email protected]>; [email protected]; "[email protected]" <[email protected]>; Linus Torvalds <[email protected]>; [email protected]
> Sent: Thursday, January 17, 2008 9:23:57 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
>
> On (17/01/08 09:44), Martin Knoblauch didst pronounce:
> > > > > > > > On Wed, Jan 16, 2008 at 01:26:41AM -0800,
> Martin
>
Knoblauch wrote:
> > > > > > > For those interested in using your writeback
> improvements
>
in
> > > > > > > production sooner rather than later (primarily with
> ext3);
>
what
> > > > > > > recommendations do you have? Just heavily test our
> own
>
2.6.24
> > > > > > > evolving "close, but not ready for merge" -mm
> writeback
>
patchset?
> > > > > > >
> > > > > >
> > > > > > I can add myself to Mikes question. It would be good to
> know
>
a
> > > > >
> > > > > "roadmap" for the writeback changes. Testing 2.6.24-rcX so
> far
>
has
> > > > > been showing quite nice improvement of the overall
> writeback
>
situation and
> > > > > it would be sad to see this [partially] gone in 2.6.24-final.
> > > > > Linus apparently already has reverted "...2250b". I
> will
>
definitely
> > > > > repeat my tests with -rc8. and report.
> > > > >
> > > > Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
> > > > Maybe we can push it to 2.6.24 after your testing.
> > > >
> > > Hi Fengguang,
> > >
> > > something really bad has happened between -rc3 and -rc6.
> > > Embarrassingly I did not catch that earlier :-(
> > > Compared to the numbers I posted in
> > > http://lkml.org/lkml/2007/10/26/208 , dd1 is now at 60 MB/sec
> > > (slight plus), while dd2/dd3 suck the same way as in pre 2.6.24.
> > > The only test that is still good is mix3, which I attribute to
> > > the per-BDI stuff.
>
> I suspect that the IO hardware you have is very sensitive to the
> color of the physical page. I wonder, do you boot the system cleanly
> and then run these tests? If so, it would be interesting to know what
> happens if you stress the system first (many kernel compiles for example,
> basically anything that would use a lot of memory in different ways for some
> time) to randomise the free lists a bit and then run your test. You'd need to run
> the test three times for 2.6.23, 2.6.24-rc8 and 2.6.24-rc8 with the patch you
> identified reverted.
>

The effect is defintely depending on the IO hardware. I performed the same tests
on a different box with an AACRAID controller and there things look different. Basically
the "offending" commit helps seingle stream performance on that box, while dual/triple
stream are not affected. So I suspect that the CCISS is just not behaving well.

And yes, the tests are usually done on a freshly booted box. Of course, I repeat them
a few times. On the CCISS box the numbers are very constant. On the AACRAID box
they vary quite a bit.

I can certainly stress the box before doing the tests. Please define "many" for the kernel
compiles :-)

> >
> > OK, the change happened between rc5 and rc6. Just following a
> > gut feeling, I reverted
> >
> > #commit 81eabcbe0b991ddef5216f30ae91c4b226d54b6d
> > #Author: Mel Gorman
> > #Date: Mon Dec 17 16:20:05 2007 -0800
> > #

> >
> > This has brought back the good results I observed and reported.
> > I do not know what to make out of this. At least on the systems
> > I care about (HP/DL380g4, dual CPUs, HT-enabled, 8 GB Memory,
> > SmartaArray6i controller with 4x72GB SCSI disks as RAID5 (battery
> > protected writeback cache enabled) and gigabit networking (tg3)) this
> > optimisation is a dissaster.
> >
>
> That patch was not an optimisation, it was a regression fix
> against 2.6.23 and I don't believe reverting it is an option. Other IO
> hardware benefits from having the allocator supply pages in PFN order.

I think this late in the 2.6.24 game we just should leave things as they are. But
we should try to find a way to make CCISS faster, as it apparently can be faster.

> Your controller would seem to suffer when presented with the same situation
> but I don't know why that is. I've added James to the cc in case he has seen this
> sort of situation before.
>
> > On the other hand, it is not a regression against 2.6.22/23. Those
>
> had bad IO scaling to. It would just be a shame to loose an apparently
>
> great performance win.
>
> Could you try running your tests again when the system has been
> stressed with some other workload first?
>

Will do.

Cheers
Martin


2008-01-17 22:12:35

by Mel Gorman

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

On (17/01/08 13:50), Martin Knoblauch didst pronounce:
> > <mail manglement snipped>
>
> The effect is defintely depending on the IO hardware. I performed the same tests
> on a different box with an AACRAID controller and there things look different.

I take it different also means it does not show this odd performance
behaviour and is similar whether the patch is applied or not?

> Basically
> the "offending" commit helps seingle stream performance on that box, while dual/triple
> stream are not affected. So I suspect that the CCISS is just not behaving well.
>
> And yes, the tests are usually done on a freshly booted box. Of course, I repeat them
> a few times. On the CCISS box the numbers are very constant. On the AACRAID box
> they vary quite a bit.
>
> I can certainly stress the box before doing the tests. Please define "many" for the kernel
> compiles :-)
>

With 8GiB of RAM, try making 24 copies of the kernel and compiling them
all simultaneously. Running that for for 20-30 minutes should be enough to
randomise the freelists affecting what color of page is used for the dd test.

> > > OK, the change happened between rc5 and rc6. Just following a
> > > gut feeling, I reverted
> > >
> > > #commit 81eabcbe0b991ddef5216f30ae91c4b226d54b6d
> > > #Author: Mel Gorman
> > > #Date: Mon Dec 17 16:20:05 2007 -0800
> > > #
>
> > >
> > > This has brought back the good results I observed and reported.
> > > I do not know what to make out of this. At least on the systems
> > > I care about (HP/DL380g4, dual CPUs, HT-enabled, 8 GB Memory,
> > > SmartaArray6i controller with 4x72GB SCSI disks as RAID5 (battery
> > > protected writeback cache enabled) and gigabit networking (tg3)) this
> > > optimisation is a dissaster.
> > >
> >
> > That patch was not an optimisation, it was a regression fix
> > against 2.6.23 and I don't believe reverting it is an option. Other IO
> > hardware benefits from having the allocator supply pages in PFN order.
>
> I think this late in the 2.6.24 game we just should leave things as they are. But
> we should try to find a way to make CCISS faster, as it apparently can be faster.
>
> > Your controller would seem to suffer when presented with the same situation
> > but I don't know why that is. I've added James to the cc in case he has seen this
> > sort of situation before.
> >
> > > On the other hand, it is not a regression against 2.6.22/23. Those
> >
> > had bad IO scaling to. It would just be a shame to loose an apparently
> >
> > great performance win.
> >
> > Could you try running your tests again when the system has been
> > stressed with some other workload first?
> >
>
> Will do.
>

Thanks

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-01-18 08:19:20

by Martin Knoblauch

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

----- Original Message ----
> From: Mel Gorman <[email protected]>
> To: Martin Knoblauch <[email protected]>
> Cc: Fengguang Wu <[email protected]>; Mike Snitzer <[email protected]>; Peter Zijlstra <[email protected]>; [email protected]; Ingo Molnar <[email protected]>; [email protected]; "[email protected]" <[email protected]>; Linus Torvalds <[email protected]>; [email protected]
> Sent: Thursday, January 17, 2008 11:12:21 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
>
> On (17/01/08 13:50), Martin Knoblauch didst pronounce:
> > >
> >
> > The effect is defintely depending on the IO hardware.
> >
performed the same tests
> > on a different box with an AACRAID controller and there things
> > look different.
>
> I take it different also means it does not show this odd performance
> behaviour and is similar whether the patch is applied or not?
>

Here are the numbers (MB/s) from the AACRAID box, after a fresh boot:

Test 2.6.19.2 2.6.24-rc6 2.6.24-rc6-81eabcbe0b991ddef5216f30ae91c4b226d54b6d
dd1 325 350 290
dd1-dir 180 160 160
dd2 2x90 2x113 2x110
dd2-dir 2x120 2x92 2x93
dd3 3x54 3x70 3x70
dd3-dir 3x83 3x64 3x64
mix3 55,2x30 400,2x25 310,2x25

What we are seing here is that:

a) DIRECT IO takes a much bigger hit (2.6.19 vs. 2.6.24) on this IO system compared to the CCISS box
b) Reverting your patch hurts single stream
c) dual/triple stream are not affected by your patch and are improved over 2.6.19
d) the mix3 performance is improved compared to 2.6.19.
d1) reverting your patch hurts the local-disk part of mix3
e) the AACRAID setup is definitely faster than the CCISS.

So, on this box your patch is definitely needed to get the pre-2.6.24 performance
when writing a single big file.

Actually things on the CCISS box might be even more complicated. I forgot the fact
that on that box we have ext2/LVM/DM/Hardware, while on the AACRAID box we have
ext2/Hardware. Do you think that the LVM/MD are sensitive to the page order/coloring?

Anyway: does your patch only address this performance issue, or are there also
data integrity concerns without it? I may consider reverting the patch for my
production environment. It really helps two thirds of my boxes big time, while it does
not hurt the other third that much :-)

> >
> > I can certainly stress the box before doing the tests. Please
> > define "many" for the kernel compiles :-)
> >
>
> With 8GiB of RAM, try making 24 copies of the kernel and compiling them
> all simultaneously. Running that for for 20-30 minutes should be enough
>
to randomise the freelists affecting what color of page is used for the
> dd test.
>

ouch :-) OK, I will try that.

Martin


2008-01-18 16:01:37

by Mel Gorman

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

On (18/01/08 00:19), Martin Knoblauch didst pronounce:
> > > The effect is defintely depending on the IO hardware.
> > > performed the same tests
> > > on a different box with an AACRAID controller and there things
> > > look different.
> >
> > I take it different also means it does not show this odd performance
> > behaviour and is similar whether the patch is applied or not?
> >
>
> Here are the numbers (MB/s) from the AACRAID box, after a fresh boot:
>
> Test 2.6.19.2 2.6.24-rc6 2.6.24-rc6-81eabcbe0b991ddef5216f30ae91c4b226d54b6d
> dd1 325 350 290
> dd1-dir 180 160 160
> dd2 2x 90 2x113 2x110
> dd2-dir 2x120 2x 92 2x 93
> dd3 3x 54 3x 70 3x 70
> dd3-dir 3x 83 3x 64 3x 64
> mix3 55,2x 30 400,2x 25 310,2x 25
>
> What we are seing here is that:
>
> a) DIRECT IO takes a much bigger hit (2.6.19 vs. 2.6.24) on this IO system compared to the CCISS box
> b) Reverting your patch hurts single stream

Right, and this is consistent with other complaints about the PFN of the
page mattering to some hardware.

> c) dual/triple stream are not affected by your patch and are improved over 2.6.19

I am not very surprised. The callers to the page allocator are probably
making no special effort to get a batch of pages in PFN-order. They are just
assuming that subsequent calls give contiguous pages. With two or more
threads involved, there will not be a correlation between physical pages
and what is on disk any more.

> d) the mix3 performance is improved compared to 2.6.19.
> d1) reverting your patch hurts the local-disk part of mix3
> e) the AACRAID setup is definitely faster than the CCISS.
>
> So, on this box your patch is definitely needed to get the pre-2.6.24 performance
> when writing a single big file.
>
> Actually things on the CCISS box might be even more complicated. I forgot the fact
> that on that box we have ext2/LVM/DM/Hardware, while on the AACRAID box we have
> ext2/Hardware. Do you think that the LVM/MD are sensitive to the page order/coloring?
>

I don't have enough experience with LVM setups to make an intelligent
guess.

> Anyway: does your patch only address this performance issue, or are there also
> data integrity concerns without it?

Performance issue only. There are no data integrity concerns with that
patch.

> I may consider reverting the patch for my
> production environment. It really helps two thirds of my boxes big time, while it does
> not hurt the other third that much :-)
>

That is certainly an option.

> > >
> > > I can certainly stress the box before doing the tests. Please
> > > define "many" for the kernel compiles :-)
> > >
> >
> > With 8GiB of RAM, try making 24 copies of the kernel and compiling them
> > all simultaneously. Running that for for 20-30 minutes should be enough
> >
> to randomise the freelists affecting what color of page is used for the
> > dd test.
> >
>
> ouch :-) OK, I will try that.
>

Thanks.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-01-18 17:48:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX



On Fri, 18 Jan 2008, Mel Gorman wrote:
>
> Right, and this is consistent with other complaints about the PFN of the
> page mattering to some hardware.

I don't think it's actually the PFN per se.

I think it's simply that some controllers (quite probably affected by both
driver and hardware limits) have some subtle interactions with the size of
the IO commands.

For example, let's say that you have a controller that has some limit X on
the size of IO in flight (whether due to hardware or driver issues doesn't
really matter) in addition to a limit on the size of the scatter-gather
size. They all tend to have limits, and they differ.

Now, the PFN doesn't matter per se, but the allocation pattern definitely
matters for whether the IO's are physically contiguous, and thus matters
for the size of the scatter-gather thing.

Now, generally the rule-of-thumb is that you want big commands, so
physical merging is good for you, but I could well imagine that the IO
limits interact, and end up hurting each other. Let's say that a better
allocation order allows for bigger contiguous physical areas, and thus
fewer scatter-gather entries.

What does that result in? The obvious answer is

"Better performance obviously, because the controller needs to do fewer
scatter-gather lookups, and the requests are bigger, because there are
fewer IO's that hit scatter-gather limits!"

Agreed?

Except maybe the *real* answer for some controllers end up being

"Worse performance, because individual commands grow because they don't
hit the per-command limits, but now we hit the global size-in-flight
limits and have many fewer of these good commands in flight. And while
the commands are larger, it means that there are fewer outstanding
commands, which can mean that the disk cannot scheduling things
as well, or makes high latency of command generation by the controller
much more visible because there aren't enough concurrent requests
queued up to hide it"

Is this the reason? I have no idea. But somebody who knows the AACRAID
hardware and driver limits might think about interactions like that.
Sometimes you actually might want to have smaller individual commands if
there is some other limit that means that it can be more advantageous to
have many small requests over a few big onees.

RAID might well make it worse. Maybe small requests work better because
they are simpler to schedule because they only hit one disk (eg if you
have simple striping)! So that's another reason why one *large* request
may actually be slower than two requests half the size, even if it's
against the "normal rule".

And it may be that that AACRAID box takes a big hit on DIO exactly because
DIO has been optimized almost purely for making one command as big as
possible.

Just a theory.

Linus

2008-01-18 19:01:24

by Martin Knoblauch

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX


--- Linus Torvalds <[email protected]> wrote:

>
>
> On Fri, 18 Jan 2008, Mel Gorman wrote:
> >
> > Right, and this is consistent with other complaints about the PFN
> > of the page mattering to some hardware.
>
> I don't think it's actually the PFN per se.
>
> I think it's simply that some controllers (quite probably affected by
> both driver and hardware limits) have some subtle interactions with
> the size of the IO commands.
>
> For example, let's say that you have a controller that has some limit
> X on the size of IO in flight (whether due to hardware or driver
> issues doesn't really matter) in addition to a limit on the size
> of the scatter-gather size. They all tend to have limits, and
> they differ.
>
> Now, the PFN doesn't matter per se, but the allocation pattern
> definitely matters for whether the IO's are physically
> contiguous, and thus matters for the size of the scatter-gather
> thing.
>
> Now, generally the rule-of-thumb is that you want big commands, so
> physical merging is good for you, but I could well imagine that the
> IO limits interact, and end up hurting each other. Let's say that a
> better allocation order allows for bigger contiguous physical areas,
> and thus fewer scatter-gather entries.
>
> What does that result in? The obvious answer is
>
> "Better performance obviously, because the controller needs to do
> fewer scatter-gather lookups, and the requests are bigger, because
> there are fewer IO's that hit scatter-gather limits!"
>
> Agreed?
>
> Except maybe the *real* answer for some controllers end up being
>
> "Worse performance, because individual commands grow because they
> don't hit the per-command limits, but now we hit the global
> size-in-flight limits and have many fewer of these good commands in
> flight. And while the commands are larger, it means that there
> are fewer outstanding commands, which can mean that the disk
> cannot scheduling things as well, or makes high latency of command
> generation by the controller much more visible because there aren't
> enough concurrent requests queued up to hide it"
>
> Is this the reason? I have no idea. But somebody who knows the
> AACRAID hardware and driver limits might think about interactions
> like that. Sometimes you actually might want to have smaller
> individual commands if there is some other limit that means that
> it can be more advantageous to have many small requests over a
> few big onees.
>
> RAID might well make it worse. Maybe small requests work better
> because they are simpler to schedule because they only hit one
> disk (eg if you have simple striping)! So that's another reason
> why one *large* request may actually be slower than two requests
> half the size, even if it's against the "normal rule".
>
> And it may be that that AACRAID box takes a big hit on DIO
> exactly because DIO has been optimized almost purely for making
> one command as big as possible.
>
> Just a theory.
>
> Linus

just to make one thing clear - I am not so much concerned about the
performance of AACRAID. It is OK with or without Mel's patch. It is
better with Mel's patch. The regression in DIO compared to 2.6.19.2 is
completely independent of Mel's stuff.

What interests me much more is the behaviour of the CCISS+LVM based
system. Here I see a huge benefit of reverting Mel's patch.

I dirtied the system after reboot as Mel suggested (24 parallel kernel
build) and repeated the tests. The dirtying did not make any
difference. Here are the results:

Test -rc8 -rc8-without-Mels-Patch
dd1 57 94
dd1-dir 87 86
dd2 2x8.5 2x45
dd2-dir 2x43 2x43
dd3 3x7 3x30
dd3-dir 3x28.5 3x28.5
mix3 59,2x25 98,2x24

The big IO size with Mel's patch really has a devastating effect on
the parallel write. Nowhere near the value one would expect, while the
numbers are perfect without Mel's patch as in rc1-rc5. To bad I did not
see this earlier. Maybe we could have found a solution for .24.

At least, rc1-rc5 have shown that the CCISS system can do well. Now
the question is which part of the system does not cope well with the
larger IO sizes? Is it the CCISS controller, LVM or both. I am open to
suggestions on how to debug that.

Cheers
Martin

------------------------------------------------------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www: http://www.knobisoft.de

2008-01-18 19:24:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX



On Fri, 18 Jan 2008, Martin Knoblauch wrote:
>
> just to make one thing clear - I am not so much concerned about the
> performance of AACRAID. It is OK with or without Mel's patch. It is
> better with Mel's patch. The regression in DIO compared to 2.6.19.2 is
> completely independent of Mel's stuff.
>
> What interests me much more is the behaviour of the CCISS+LVM based
> system. Here I see a huge benefit of reverting Mel's patch.

Ok, I just got your usage cases confused.

The argument stays the same: some controllers/drivers may have subtle
behavioural differences that come from the IO limits themselves.

So it wasn't AACRAID, it was CCISS+LVM. The argument is the same: it may
well be that the *bigger* IO sizes are actually what hurts, even if the
conventional wisdom is traditionally that bigger submissions are better.

> At least, rc1-rc5 have shown that the CCISS system can do well. Now
> the question is which part of the system does not cope well with the
> larger IO sizes? Is it the CCISS controller, LVM or both. I am open to
> suggestions on how to debug that.

I think you need to ask the MD/DM people for suggestions.. Who aren't cc'd
here.

Linus

2008-01-18 20:00:42

by Mike Snitzer

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

On Jan 18, 2008 12:46 PM, Linus Torvalds <[email protected]> wrote:
>
>
> On Fri, 18 Jan 2008, Mel Gorman wrote:
> >
> > Right, and this is consistent with other complaints about the PFN of the
> > page mattering to some hardware.
>
> I don't think it's actually the PFN per se.
>
> I think it's simply that some controllers (quite probably affected by both
> driver and hardware limits) have some subtle interactions with the size of
> the IO commands.
>
> For example, let's say that you have a controller that has some limit X on
> the size of IO in flight (whether due to hardware or driver issues doesn't
> really matter) in addition to a limit on the size of the scatter-gather
> size. They all tend to have limits, and they differ.
>
> Now, the PFN doesn't matter per se, but the allocation pattern definitely
> matters for whether the IO's are physically contiguous, and thus matters
> for the size of the scatter-gather thing.
>
> Now, generally the rule-of-thumb is that you want big commands, so
> physical merging is good for you, but I could well imagine that the IO
> limits interact, and end up hurting each other. Let's say that a better
> allocation order allows for bigger contiguous physical areas, and thus
> fewer scatter-gather entries.
>
> What does that result in? The obvious answer is
>
> "Better performance obviously, because the controller needs to do fewer
> scatter-gather lookups, and the requests are bigger, because there are
> fewer IO's that hit scatter-gather limits!"
>
> Agreed?
>
> Except maybe the *real* answer for some controllers end up being
>
> "Worse performance, because individual commands grow because they don't
> hit the per-command limits, but now we hit the global size-in-flight
> limits and have many fewer of these good commands in flight. And while
> the commands are larger, it means that there are fewer outstanding
> commands, which can mean that the disk cannot scheduling things
> as well, or makes high latency of command generation by the controller
> much more visible because there aren't enough concurrent requests
> queued up to hide it"
>
> Is this the reason? I have no idea. But somebody who knows the AACRAID
> hardware and driver limits might think about interactions like that.
> Sometimes you actually might want to have smaller individual commands if
> there is some other limit that means that it can be more advantageous to
> have many small requests over a few big onees.
>
> RAID might well make it worse. Maybe small requests work better because
> they are simpler to schedule because they only hit one disk (eg if you
> have simple striping)! So that's another reason why one *large* request
> may actually be slower than two requests half the size, even if it's
> against the "normal rule".
>
> And it may be that that AACRAID box takes a big hit on DIO exactly because
> DIO has been optimized almost purely for making one command as big as
> possible.
>
> Just a theory.

Oddly enough, I'm seeing the opposite here with 2.6.22.16 w/ AACRAID
configured with 5 LUNS (each 2disk HW RAID0, 1024k stripesz). That
is, with dd the avgrqsiz (from iostat) shows DIO to be ~130k whereas
non-DIO is a mere ~13k! (NOTE: with aacraid, max_hw_sectors_kb=192)

DIO cmdline: dd if=/dev/zero of=/dev/sdX bs=8192k count=1k
non-DIO cmdline: dd if=/dev/zero of=/dev/sdX bs=8192k count=1k

DIO is ~80MB/s on all 5 LUNs for a total of ~400MB/s
non-DIO is only ~12MB on all 5 LUNs for a mere ~70MB/s aggregate
(deadline w/ nr_requests=32)

Calls into question the theory of small requests being beneficial for
AACRAID. Martin, what are you seeing for the avg request size when
you're conducting your AACRAID tests?

I can fire up 2.6.24-rc8 in short order to see if things are vastly
improved (as Martin seems to indicate that he is happy with AACRAID on
2.6.24-rc8). Although even Martin's AACRAID numbers from 2.6.19.2 are
still quite good (relative to mine). Martin can you share any tuning
you may have done to get AACRAID to where it is for you right now?

regards,
Mike

2008-01-18 22:47:18

by Mike Snitzer

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

On Jan 18, 2008 3:00 PM, Mike Snitzer <[email protected]> wrote:
>
> On Jan 18, 2008 12:46 PM, Linus Torvalds <[email protected]> wrote:
> >
> >
> > On Fri, 18 Jan 2008, Mel Gorman wrote:
> > >
> > > Right, and this is consistent with other complaints about the PFN of the
> > > page mattering to some hardware.
> >
> > I don't think it's actually the PFN per se.
> >
> > I think it's simply that some controllers (quite probably affected by both
> > driver and hardware limits) have some subtle interactions with the size of
> > the IO commands.
> >
> > For example, let's say that you have a controller that has some limit X on
> > the size of IO in flight (whether due to hardware or driver issues doesn't
> > really matter) in addition to a limit on the size of the scatter-gather
> > size. They all tend to have limits, and they differ.
> >
> > Now, the PFN doesn't matter per se, but the allocation pattern definitely
> > matters for whether the IO's are physically contiguous, and thus matters
> > for the size of the scatter-gather thing.
> >
> > Now, generally the rule-of-thumb is that you want big commands, so
> > physical merging is good for you, but I could well imagine that the IO
> > limits interact, and end up hurting each other. Let's say that a better
> > allocation order allows for bigger contiguous physical areas, and thus
> > fewer scatter-gather entries.
> >
> > What does that result in? The obvious answer is
> >
> > "Better performance obviously, because the controller needs to do fewer
> > scatter-gather lookups, and the requests are bigger, because there are
> > fewer IO's that hit scatter-gather limits!"
> >
> > Agreed?
> >
> > Except maybe the *real* answer for some controllers end up being
> >
> > "Worse performance, because individual commands grow because they don't
> > hit the per-command limits, but now we hit the global size-in-flight
> > limits and have many fewer of these good commands in flight. And while
> > the commands are larger, it means that there are fewer outstanding
> > commands, which can mean that the disk cannot scheduling things
> > as well, or makes high latency of command generation by the controller
> > much more visible because there aren't enough concurrent requests
> > queued up to hide it"
> >
> > Is this the reason? I have no idea. But somebody who knows the AACRAID
> > hardware and driver limits might think about interactions like that.
> > Sometimes you actually might want to have smaller individual commands if
> > there is some other limit that means that it can be more advantageous to
> > have many small requests over a few big onees.
> >
> > RAID might well make it worse. Maybe small requests work better because
> > they are simpler to schedule because they only hit one disk (eg if you
> > have simple striping)! So that's another reason why one *large* request
> > may actually be slower than two requests half the size, even if it's
> > against the "normal rule".
> >
> > And it may be that that AACRAID box takes a big hit on DIO exactly because
> > DIO has been optimized almost purely for making one command as big as
> > possible.
> >
> > Just a theory.
>
> Oddly enough, I'm seeing the opposite here with 2.6.22.16 w/ AACRAID
> configured with 5 LUNS (each 2disk HW RAID0, 1024k stripesz). That
> is, with dd the avgrqsiz (from iostat) shows DIO to be ~130k whereas
> non-DIO is a mere ~13k! (NOTE: with aacraid, max_hw_sectors_kb=192)
...
> I can fire up 2.6.24-rc8 in short order to see if things are vastly
> improved (as Martin seems to indicate that he is happy with AACRAID on
> 2.6.24-rc8). Although even Martin's AACRAID numbers from 2.6.19.2 are
> still quite good (relative to mine). Martin can you share any tuning
> you may have done to get AACRAID to where it is for you right now?

I can confirm 2.6.24-rc8 behaves like Martin has posted for the
AACRAID. Slower DIO with smaller avgreqsiz. Much faster buffered IO
(for my config anyway) with a much larger avgreqsiz (180K).

I have no idea why 2.6.22.16's request size on non-DIO is _so_ small...

Mike

2008-01-19 10:06:16

by Martin Knoblauch

[permalink] [raw]
Subject: Re: [PATCH] writeback: speed up writeback of big dirty files

---- Original Message ----
> From: Fengguang Wu <[email protected]>
> To: Linus Torvalds <[email protected]>
> Cc: Mike Snitzer <[email protected]>; Martin Knoblauch <[email protected]>; Peter Zijlstra <[email protected]>; [email protected]; Ingo Molnar <[email protected]>; [email protected]; "[email protected]" <[email protected]>
> Sent: Thursday, January 17, 2008 6:28:18 AM
> Subject: [PATCH] writeback: speed up writeback of big dirty files
>
> On Jan 16, 2008 9:15 AM, Martin Knoblauch
>
>
wrote:
> > Fengguang's latest writeback patch applies cleanly, builds, boots
> on
>
2.6.24-rc8.
>
> Linus, if possible, I'd suggest this patch be merged for 2.6.24.
>
> It's a safer version of the reverted patch. It was tested on
> ext2/ext3/jfs/xfs/reiserfs and won't 100% iowait even without the
> other bug fixing patches.
>
> Fengguang
> ---
>
> writeback: speed up writeback of big dirty files
>
> After making dirty a 100M file, the normal behavior is to
> start the writeback for all data after 30s delays. But
> sometimes the following happens instead:
>
> - after 30s: ~4M
> - after 5s: ~4M
> - after 5s: all remaining 92M
>
> Some analyze shows that the internal io dispatch queues goes like this:
>
> s_io s_more_io
> -------------------------
> 1) 100M,1K 0
> 2) 1K 96M
> 3) 0 96M
> 1) initial state with a 100M file and a 1K file
> 2) 4M written, nr_to_write <= 0, so write more
> 3) 1K written, nr_to_write > 0, no more writes(BUG)
> nr_to_write > 0 in (3) fools the upper layer to think that data
> have
>
all been
> written out. The big dirty file is actually still sitting in
> s_more_io.
>
We
> cannot simply splice s_more_io back to s_io as soon as s_io
> becomes
>
empty, and
> let the loop in generic_sync_sb_inodes() continue: this may
> starve
>
newly
> expired inodes in s_dirty. It is also not an option to draw
> inodes
>
from both
> s_more_io and s_dirty, an let the loop go on: this might lead to
> live
>
locks,
> and might also starve other superblocks in sync time(well kupdate
> may
>
still
> starve some superblocks, that's another bug).
> We have to return when a full scan of s_io completes. So nr_to_write
> >
>
0 does
> not necessarily mean that "all data are written". This patch
> introduces
>
a flag
> writeback_control.more_io to indicate that more io should be done.
> With
>
it the
> big dirty file no longer has to wait for the next kupdate invocation
> 5s
>
later.
>
> In sync_sb_inodes() we only set more_io on super_blocks we
> actually
>
visited.
> This aviods the interaction between two pdflush deamons.
>
> Also in __sync_single_inode() we don't blindly keep requeuing the io
> if
>
the
> filesystem cannot progress. Failing to do so may lead to 100% iowait.
>
> Tested-by: Mike Snitzer
> Signed-off-by: Fengguang Wu
> ---
> fs/fs-writeback.c | 18 ++++++++++++++++--
> include/linux/writeback.h | 1 +
> mm/page-writeback.c | 9 ++++++---
> 3 files changed, 23 insertions(+), 5 deletions(-)
>
> --- linux.orig/fs/fs-writeback.c
> +++ linux/fs/fs-writeback.c
> @@ -284,7 +284,17 @@ __sync_single_inode(struct inode *inode,
> * soon as the queue becomes uncongested.
> */
> inode->i_state |= I_DIRTY_PAGES;
> - requeue_io(inode);
> + if (wbc->nr_to_write <= 0) {
> + /*
> + * slice used up: queue for next turn
> + */
> + requeue_io(inode);
> + } else {
> + /*
> + * somehow blocked: retry later
> + */
> + redirty_tail(inode);
> + }
> } else {
> /*
> * Otherwise fully redirty the inode so that
> @@ -479,8 +489,12 @@ sync_sb_inodes(struct super_block *sb, s
> iput(inode);
> cond_resched();
> spin_lock(&inode_lock);
> - if (wbc->nr_to_write <= 0)
> + if (wbc->nr_to_write <= 0) {
> + wbc->more_io = 1;
> break;
> + }
> + if (!list_empty(&sb->s_more_io))
> + wbc->more_io = 1;
> }
> return; /* Leave any unwritten inodes on s_io */
> }
> --- linux.orig/include/linux/writeback.h
> +++ linux/include/linux/writeback.h
> @@ -62,6 +62,7 @@ struct writeback_control {
> unsigned for_reclaim:1; /* Invoked from the page
> allocator
>
*/
> unsigned for_writepages:1; /* This is a writepages() call */
> unsigned range_cyclic:1; /* range_start is cyclic */
> + unsigned more_io:1; /* more io to be dispatched */
> };
>
> /*
> --- linux.orig/mm/page-writeback.c
> +++ linux/mm/page-writeback.c
> @@ -558,6 +558,7 @@ static void background_writeout(unsigned
> global_page_state(NR_UNSTABLE_NFS) < background_thresh
> && min_pages <= 0)
> break;
> + wbc.more_io = 0;
> wbc.encountered_congestion = 0;
> wbc.nr_to_write = MAX_WRITEBACK_PAGES;
> wbc.pages_skipped = 0;
> @@ -565,8 +566,9 @@ static void background_writeout(unsigned
> min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
> if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
> /* Wrote less than expected */
> - congestion_wait(WRITE, HZ/10);
> - if (!wbc.encountered_congestion)
> + if (wbc.encountered_congestion || wbc.more_io)
> + congestion_wait(WRITE, HZ/10);
> + else
> break;
> }
> }
> @@ -631,11 +633,12 @@ static void wb_kupdate(unsigned long arg
> global_page_state(NR_UNSTABLE_NFS) +
> (inodes_stat.nr_inodes - inodes_stat.nr_unused);
> while (nr_to_write > 0) {
> + wbc.more_io = 0;
> wbc.encountered_congestion = 0;
> wbc.nr_to_write = MAX_WRITEBACK_PAGES;
> writeback_inodes(&wbc);
> if (wbc.nr_to_write > 0) {
> - if (wbc.encountered_congestion)
> + if (wbc.encountered_congestion || wbc.more_io)
> congestion_wait(WRITE, HZ/10);
> else
> break; /* All the old data is written */
>
>
>
Hi Fenguang,

sorry for not coming back earlier. I compiled -rc8 with your patch. It boots and works with my test cases. More I cannot say. The performance decrease I see compared to -rc5 has been discussed elsewhere in this thread and is not related to your work.

Cheers
Martin

2008-01-19 10:24:25

by Martin Knoblauch

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

----- Original Message ----
> From: Mike Snitzer <[email protected]>
> To: Linus Torvalds <[email protected]>
> Cc: Mel Gorman <[email protected]>; Martin Knoblauch <[email protected]>; Fengguang Wu <[email protected]>; Peter Zijlstra <[email protected]>; [email protected]; Ingo Molnar <[email protected]>; [email protected]; "[email protected]" <[email protected]>; [email protected]
> Sent: Friday, January 18, 2008 11:47:02 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
>
> > I can fire up 2.6.24-rc8 in short order to see if things are vastly
> > improved (as Martin seems to indicate that he is happy with
> > AACRAID on 2.6.24-rc8). Although even Martin's AACRAID
> > numbers from 2.6.19.2
>
> are still quite good (relative to mine). Martin can you share any tuning
> > you may have done to get AACRAID to where it is for you right now?
Mike,

I have always been happy with the AACRAID box compared to the CCISS system. Even with the "regression" in 2.6.24-rc1..rc5 it was more than acceptable to me. For me the differences between 2.6.19 and 2.6.24-rc8 on the AACRAID setup are:

- 11% (single stream) to 25% (dual/triple stream) regression in DIO. Something I do not care much about. I just measure it for reference.
+ the very nice behaviour when writing to different targets (mix3), which I attribute to Peter's per-dbi stuff.

And until -rc6 I was extremely pleased with the cool speedup I saw on my CCISS boxes. This would have been the next "production" kernel for me. But lets discuss this under a seperate topic. It has nothing to do with the original wait-io issue.

Oh, before I forget. There has been no tuning for the AACRAID. The system is an IBM x3650 with built in AACRAID and battery backed write cache. The disks are 6x142GB/15krpm in a RAID5 setup. I see one big difference between your an my tests. I do 1MB writes to simulate the behaviour of the real applications, while yours seem to be much smaller.

Cheers
Martin


2008-01-22 14:41:34

by Alasdair G Kergon

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

On Fri, Jan 18, 2008 at 11:01:11AM -0800, Martin Knoblauch wrote:
> At least, rc1-rc5 have shown that the CCISS system can do well. Now
> the question is which part of the system does not cope well with the
> larger IO sizes? Is it the CCISS controller, LVM or both. I am open to
> suggestions on how to debug that.

What is your LVM device configuration?
E.g. 'dmsetup table' and 'dmsetup info -c' output.
Some configurations lead to large IOs getting split up on the way through
device-mapper.

See if these patches make any difference:
http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/

dm-md-merge_bvec_fn-with-separate-bdev-and-sector.patch
dm-introduce-merge_bvec_fn.patch
dm-linear-add-merge.patch
dm-table-remove-merge_bvec-sector-restriction.patch

Alasdair
--
[email protected]

2008-01-22 15:25:29

by Martin Knoblauch

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX


------------------------------------------------------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www: http://www.knobisoft.de

----- Original Message ----
> From: Alasdair G Kergon <[email protected]>
> To: Martin Knoblauch <[email protected]>
> Cc: Linus Torvalds <[email protected]>; Mel Gorman <[email protected]>; Fengguang Wu <[email protected]>; Mike Snitzer <[email protected]>; Peter Zijlstra <[email protected]>; [email protected]; Ingo Molnar <[email protected]>; [email protected]; "[email protected]" <[email protected]>; [email protected]; Jens Axboe <[email protected]>; Milan Broz <[email protected]>; Neil Brown <[email protected]>
> Sent: Tuesday, January 22, 2008 3:39:33 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
>
> On Fri, Jan 18, 2008 at 11:01:11AM -0800, Martin Knoblauch wrote:
> > At least, rc1-rc5 have shown that the CCISS system can do well. Now
> > the question is which part of the system does not cope well with the
> > larger IO sizes? Is it the CCISS controller, LVM or both. I am
> open
>
to
> > suggestions on how to debug that.
>
> What is your LVM device configuration?
> E.g. 'dmsetup table' and 'dmsetup info -c' output.
> Some configurations lead to large IOs getting split up on the
> way
>
through
> device-mapper.
>
Hi Alastair,

here is the output, the filesystem in question is on LogVol02:

[root@lpsdm52 ~]# dmsetup table
VolGroup00-LogVol02: 0 350945280 linear 104:2 67109248
VolGroup00-LogVol01: 0 8388608 linear 104:2 418054528
VolGroup00-LogVol00: 0 67108864 linear 104:2 384
[root@lpsdm52 ~]# dmsetup info -c
Name Maj Min Stat Open Targ Event UUID
VolGroup00-LogVol02 253 1 L--w 1 1 0 LVM-IV4PeE8cdxA3piC1qk79GY9PE9OC4OgmOZ4OzOgGQIdF3qDx6fJmlZukXXLIy39R
VolGroup00-LogVol01 253 2 L--w 1 1 0 LVM-IV4PeE8cdxA3piC1qk79GY9PE9OC4Ogmfn2CcAd2Fh7i48twe8PZc2XK5bSOe1Fq
VolGroup00-LogVol00 253 0 L--w 1 1 0 LVM-IV4PeE8cdxA3piC1qk79GY9PE9OC4OgmfYjxQKFP3zw2fGsezJN7ypSrfmP7oSvE

> See if these patches make any difference:
>
http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/
>
> dm-md-merge_bvec_fn-with-separate-bdev-and-sector.patch
> dm-introduce-merge_bvec_fn.patch
> dm-linear-add-merge.patch
> dm-table-remove-merge_bvec-sector-restriction.patch
>

thanks for the suggestion. Are they supposed to apply to mainline?

Cheers
Martin


2008-01-22 18:51:43

by Martin Knoblauch

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

----- Original Message ----
> From: Alasdair G Kergon <[email protected]>
> To: Martin Knoblauch <[email protected]>
> Cc: Linus Torvalds <[email protected]>; Mel Gorman <[email protected]>; Fengguang Wu <[email protected]>; Mike Snitzer <[email protected]>; Peter Zijlstra <[email protected]>; [email protected]; Ingo Molnar <[email protected]>; [email protected]; "[email protected]" <[email protected]>; [email protected]; Jens Axboe <[email protected]>; Milan Broz <[email protected]>; Neil Brown <[email protected]>
> Sent: Tuesday, January 22, 2008 3:39:33 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
>
>
> See if these patches make any difference:
>
http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/
>
> dm-md-merge_bvec_fn-with-separate-bdev-and-sector.patch
> dm-introduce-merge_bvec_fn.patch
> dm-linear-add-merge.patch
> dm-table-remove-merge_bvec-sector-restriction.patch
>


nope. Exactely the same poor results. To rule out LVM/DM I really have to see what happens if I setup a system with filesystems directly on partitions. Might take some time though.

Cheers
Martin


2008-01-22 23:42:21

by Alasdair G Kergon

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

On Tue, Jan 22, 2008 at 07:25:15AM -0800, Martin Knoblauch wrote:
> [root@lpsdm52 ~]# dmsetup table
> VolGroup00-LogVol02: 0 350945280 linear 104:2 67109248
> VolGroup00-LogVol01: 0 8388608 linear 104:2 418054528
> VolGroup00-LogVol00: 0 67108864 linear 104:2 384

The IO should pass straight through simple linear targets like that without
needing to get broken up, so I wouldn't expect those patches to make any
difference in this particular case.

Alasdair
--
[email protected]

2008-01-23 11:12:53

by Martin Knoblauch

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

----- Original Message ----
> From: Alasdair G Kergon <[email protected]>
> To: Martin Knoblauch <[email protected]>
> Cc: Linus Torvalds <[email protected]>; Mel Gorman <[email protected]>; Fengguang Wu <[email protected]>; Mike Snitzer <[email protected]>; Peter Zijlstra <[email protected]>; [email protected]; Ingo Molnar <[email protected]>; [email protected]; "[email protected]" <[email protected]>; [email protected]; Jens Axboe <[email protected]>; Milan Broz <[email protected]>; Neil Brown <[email protected]>
> Sent: Wednesday, January 23, 2008 12:40:52 AM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
>
> On Tue, Jan 22, 2008 at 07:25:15AM -0800, Martin Knoblauch wrote:
> > [root@lpsdm52 ~]# dmsetup table
> > VolGroup00-LogVol02: 0 350945280 linear 104:2 67109248
> > VolGroup00-LogVol01: 0 8388608 linear 104:2 418054528
> > VolGroup00-LogVol00: 0 67108864 linear 104:2 384
>
> The IO should pass straight through simple linear targets like
> that without needing to get broken up, so I wouldn't expect those patches to
> make any difference in this particular case.
>

Alasdair,

LVM/DM are off the hook :-) I converted one box to direct using partitions and the performance is the same disappointment as with LVM/DM. Thanks anyway for looking at my problem.

I will move the discussion now to a new thread, targetting CCISS directly.

Cheers
Martin