LinuxLists.cc - 100% iowait on one of cpus in current -git

2007-10-22 06:23:44

Subject: 100% iowait on one of cpus in current -git

Hi,

I found a bug in current -git:

On my system on of cpus stays 100% in iowait mode (I have core 2 duo)
Otherwise the system works OK, no disk activity and/or slowdown.
Suspecting that this is a swap-related problem I tried to turn swap of, but it doesn't affect anything.
It is probably some accounting bug.

If I start with init=/bin/bash, then this disappears.
I tried then to start usual /etc/init.d scripts then, and first one to show this bug was gpm.
but then I rebooted the system to X without gpm, and I still see 100% iowait.

No additional messages in dmesg.

I tried to bisect this, but eventually I run into other bugs that cause system to oops early.

This is very rough estimate of the bug location:

HEAD
......
c8f30ae54714abf494d79826d90b5e4844fbf355 - has the above bug, but otherwise works properly
.....
5c8e191e8437616a498a8e1cc0af3dd0d32bbff2 - fails early
.....
f4a1c2bce002f683801bcdbbc9fd89804614fb6b - last known working revision

Best regards,
Maxim Levitsky

PS: .config attached.

Attachments:

(No filename) (0.99 kB)
.config (47.39 kB)
Download all attachments

2007-10-22 09:12:11

by Paolo Ornati

[permalink] [raw]

Subject: Re: 100% iowait on one of cpus in current -git

On Mon, 22 Oct 2007 08:22:52 +0200
Maxim Levitsky <[email protected]> wrote:

> I tried to bisect this, but eventually I run into other bugs that cause system to oops early.

You can pick a different revision to test with:
git-reset --hard "SHA1"

Choose one with "git-bisect visualize".

--
Paolo Ornati
Linux 2.6.23-ge8b8c977 on x86_64

2007-10-22 09:42:16

by Peter Zijlstra

[permalink] [raw]

Subject: Re: 100% iowait on one of cpus in current -git

On Mon, 2007-10-22 at 08:22 +0200, Maxim Levitsky wrote:
> Hi,
>
> I found a bug in current -git:
>
> On my system on of cpus stays 100% in iowait mode (I have core 2 duo)
> Otherwise the system works OK, no disk activity and/or slowdown.
> Suspecting that this is a swap-related problem I tried to turn swap of, but it doesn't affect anything.
> It is probably some accounting bug.
>
> If I start with init=/bin/bash, then this disappears.
> I tried then to start usual /etc/init.d scripts then, and first one to show this bug was gpm.
> but then I rebooted the system to X without gpm, and I still see 100% iowait.
>
> No additional messages in dmesg.

does sysrq-t show any D state tasks?

2007-10-22 09:44:06

by Maxim Levitsky

[permalink] [raw]

Subject: Re: 100% iowait on one of cpus in current -git

On Monday 22 October 2007 11:11:52 Paolo Ornati wrote:
> On Mon, 22 Oct 2007 08:22:52 +0200
> Maxim Levitsky <[email protected]> wrote:
>
> > I tried to bisect this, but eventually I run into other bugs that cause system to oops early.
>
> You can pick a different revision to test with:
> git-reset --hard "SHA1"
>
> Choose one with "git-bisect visualize".
>

Well, I know that, and I did try this a lot.

The problem is that between good and bad revisions there are few bugs that cause the system to oops early,
thus I can't tell whenever the 100% iowait bug is present or not.

Best regards,
Maxim Levitsky

2007-10-22 10:00:17

by Maxim Levitsky

[permalink] [raw]

Subject: Re: 100% iowait on one of cpus in current -git

On Monday 22 October 2007 11:41:57 Peter Zijlstra wrote:
> On Mon, 2007-10-22 at 08:22 +0200, Maxim Levitsky wrote:
> > Hi,
> >
> > I found a bug in current -git:
> >
> > On my system on of cpus stays 100% in iowait mode (I have core 2 duo)
> > Otherwise the system works OK, no disk activity and/or slowdown.
> > Suspecting that this is a swap-related problem I tried to turn swap of, but it doesn't affect anything.
> > It is probably some accounting bug.
> >
> > If I start with init=/bin/bash, then this disappears.
> > I tried then to start usual /etc/init.d scripts then, and first one to show this bug was gpm.
> > but then I rebooted the system to X without gpm, and I still see 100% iowait.
> >
> > No additional messages in dmesg.
>
> does sysrq-t show any D state tasks?
>
>
This one:
Probably per-block device dirty writeback?
I am compiling now revision 1f7d6668c29b1dfa307a44844f9bb38356fc989b
Thanks for the pointer.

[ 673.365631] pdflush D c21bdecc 0 221 2
[ 673.365635] c21bdee0 00000046 00000002 c21bdecc c21bdec4 00000000 c21b3000 00000002
[ 673.365643] c0134892 c21b3164 c1e00200 00000001 c7109280 c21bdec0 c03ff849 c21bdef0
[ 673.365650] 00052974 00000000 000000ff 00000000 00000000 00000000 c21bdef0 000529dc
[ 673.365657] Call Trace:
[ 673.365659] [<c03fd728>] schedule_timeout+0x48/0xc0
[ 673.365663] [<c03fd50e>] io_schedule_timeout+0x5e/0xb0
[ 673.365667] [<c0170d11>] congestion_wait+0x71/0x90
[ 673.365671] [<c016b92e>] wb_kupdate+0x9e/0xf0
[ 673.365675] [<c016beb2>] pdflush+0x102/0x1d0
[ 673.365679] [<c013fa82>] kthread+0x42/0x70
[ 673.365683] [<c01050df>] kernel_thread_helper+0x7/0x18

Best regards,
Maxim Levitsky

2007-10-22 10:22:23

by Peter Zijlstra

[permalink] [raw]

Subject: Re: 100% iowait on one of cpus in current -git

On Mon, 2007-10-22 at 11:59 +0200, Maxim Levitsky wrote:
> On Monday 22 October 2007 11:41:57 Peter Zijlstra wrote:
> > On Mon, 2007-10-22 at 08:22 +0200, Maxim Levitsky wrote:
> > > Hi,
> > >
> > > I found a bug in current -git:
> > >
> > > On my system on of cpus stays 100% in iowait mode (I have core 2 duo)
> > > Otherwise the system works OK, no disk activity and/or slowdown.
> > > Suspecting that this is a swap-related problem I tried to turn swap of, but it doesn't affect anything.
> > > It is probably some accounting bug.
> > >
> > > If I start with init=/bin/bash, then this disappears.
> > > I tried then to start usual /etc/init.d scripts then, and first one to show this bug was gpm.
> > > but then I rebooted the system to X without gpm, and I still see 100% iowait.
> > >
> > > No additional messages in dmesg.
> >
> > does sysrq-t show any D state tasks?
> >
> >
> This one:
> Probably per-block device dirty writeback?
> I am compiling now revision 1f7d6668c29b1dfa307a44844f9bb38356fc989b
> Thanks for the pointer.
>
>
>
> [ 673.365631] pdflush D c21bdecc 0 221 2
> [ 673.365635] c21bdee0 00000046 00000002 c21bdecc c21bdec4 00000000 c21b3000 00000002
> [ 673.365643] c0134892 c21b3164 c1e00200 00000001 c7109280 c21bdec0 c03ff849 c21bdef0
> [ 673.365650] 00052974 00000000 000000ff 00000000 00000000 00000000 c21bdef0 000529dc
> [ 673.365657] Call Trace:
> [ 673.365659] [<c03fd728>] schedule_timeout+0x48/0xc0
> [ 673.365663] [<c03fd50e>] io_schedule_timeout+0x5e/0xb0
> [ 673.365667] [<c0170d11>] congestion_wait+0x71/0x90
> [ 673.365671] [<c016b92e>] wb_kupdate+0x9e/0xf0
> [ 673.365675] [<c016beb2>] pdflush+0x102/0x1d0
> [ 673.365679] [<c013fa82>] kthread+0x42/0x70
> [ 673.365683] [<c01050df>] kernel_thread_helper+0x7/0x18
>

That looks more like the inode writeback patches from Wu than the per
bdi dirty stuff. The later typically hangs in balance_dirty_pages().

2007-10-22 10:41:16

by Maxim Levitsky

[permalink] [raw]

Subject: Re: 100% iowait on one of cpus in current -git

On Monday 22 October 2007 12:22:10 Peter Zijlstra wrote:
> On Mon, 2007-10-22 at 11:59 +0200, Maxim Levitsky wrote:
> > On Monday 22 October 2007 11:41:57 Peter Zijlstra wrote:
> > > On Mon, 2007-10-22 at 08:22 +0200, Maxim Levitsky wrote:
> > > > Hi,
> > > >
> > > > I found a bug in current -git:
> > > >
> > > > On my system on of cpus stays 100% in iowait mode (I have core 2 duo)
> > > > Otherwise the system works OK, no disk activity and/or slowdown.
> > > > Suspecting that this is a swap-related problem I tried to turn swap of, but it doesn't affect anything.
> > > > It is probably some accounting bug.
> > > >
> > > > If I start with init=/bin/bash, then this disappears.
> > > > I tried then to start usual /etc/init.d scripts then, and first one to show this bug was gpm.
> > > > but then I rebooted the system to X without gpm, and I still see 100% iowait.
> > > >
> > > > No additional messages in dmesg.
> > >
> > > does sysrq-t show any D state tasks?
> > >
> > >
> > This one:
> > Probably per-block device dirty writeback?
> > I am compiling now revision 1f7d6668c29b1dfa307a44844f9bb38356fc989b
> > Thanks for the pointer.
> >
> >
> >
> > [ 673.365631] pdflush D c21bdecc 0 221 2
> > [ 673.365635] c21bdee0 00000046 00000002 c21bdecc c21bdec4 00000000 c21b3000 00000002
> > [ 673.365643] c0134892 c21b3164 c1e00200 00000001 c7109280 c21bdec0 c03ff849 c21bdef0
> > [ 673.365650] 00052974 00000000 000000ff 00000000 00000000 00000000 c21bdef0 000529dc
> > [ 673.365657] Call Trace:
> > [ 673.365659] [<c03fd728>] schedule_timeout+0x48/0xc0
> > [ 673.365663] [<c03fd50e>] io_schedule_timeout+0x5e/0xb0
> > [ 673.365667] [<c0170d11>] congestion_wait+0x71/0x90
> > [ 673.365671] [<c016b92e>] wb_kupdate+0x9e/0xf0
> > [ 673.365675] [<c016beb2>] pdflush+0x102/0x1d0
> > [ 673.365679] [<c013fa82>] kthread+0x42/0x70
> > [ 673.365683] [<c01050df>] kernel_thread_helper+0x7/0x18
> >
>
> That looks more like the inode writeback patches from Wu than the per
> bdi dirty stuff. The later typically hangs in balance_dirty_pages().
>
>
>

Yes, you are right,

both revisions 1f7d6668c29b1dfa307a44844f9bb38356fc989b and 3e26c149c358529b1605f8959341d34bc4b880a3 work fine
But I didn't pay attention that those are before f4a1c2bce002f683801bcdbbc9fd89804614fb6b.
So, back to the drawing board.... :-)

Will test revision 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b, just after writeback patches.
Thanks,
Best regards,
Maxim Levitsky

2007-10-22 10:56:09

by Wu Fengguang

[permalink] [raw]

Subject: Re: 100% iowait on one of cpus in current -git

On Mon, Oct 22, 2007 at 12:40:24PM +0200, Maxim Levitsky wrote:
> On Monday 22 October 2007 12:22:10 Peter Zijlstra wrote:
> > > [ 673.365631] pdflush D c21bdecc 0 221 2
> > > [ 673.365635] c21bdee0 00000046 00000002 c21bdecc c21bdec4 00000000 c21b3000 00000002
> > > [ 673.365643] c0134892 c21b3164 c1e00200 00000001 c7109280 c21bdec0 c03ff849 c21bdef0
> > > [ 673.365650] 00052974 00000000 000000ff 00000000 00000000 00000000 c21bdef0 000529dc
> > > [ 673.365657] Call Trace:
> > > [ 673.365659] [<c03fd728>] schedule_timeout+0x48/0xc0
> > > [ 673.365663] [<c03fd50e>] io_schedule_timeout+0x5e/0xb0
> > > [ 673.365667] [<c0170d11>] congestion_wait+0x71/0x90
> > > [ 673.365671] [<c016b92e>] wb_kupdate+0x9e/0xf0
> > > [ 673.365675] [<c016beb2>] pdflush+0x102/0x1d0
> > > [ 673.365679] [<c013fa82>] kthread+0x42/0x70
> > > [ 673.365683] [<c01050df>] kernel_thread_helper+0x7/0x18
> > >
> >
> > That looks more like the inode writeback patches from Wu than the per
> > bdi dirty stuff. The later typically hangs in balance_dirty_pages().
> >
> >
> >
>
> Yes, you are right,
>
> both revisions 1f7d6668c29b1dfa307a44844f9bb38356fc989b and 3e26c149c358529b1605f8959341d34bc4b880a3 work fine
> But I didn't pay attention that those are before f4a1c2bce002f683801bcdbbc9fd89804614fb6b.
> So, back to the drawing board.... :-)
>
> Will test revision 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b, just after writeback patches.

Thank you. I'll try if I can reproduce it locally...

Fengguang

2007-10-22 10:58:59

by Maxim Levitsky

[permalink] [raw]

Subject: Re: 100% iowait on one of cpus in current -git

On Monday 22 October 2007 12:55:25 Fengguang Wu wrote:
> On Mon, Oct 22, 2007 at 12:40:24PM +0200, Maxim Levitsky wrote:
> > On Monday 22 October 2007 12:22:10 Peter Zijlstra wrote:
> > > > [ 673.365631] pdflush D c21bdecc 0 221 2
> > > > [ 673.365635] c21bdee0 00000046 00000002 c21bdecc c21bdec4 00000000 c21b3000 00000002
> > > > [ 673.365643] c0134892 c21b3164 c1e00200 00000001 c7109280 c21bdec0 c03ff849 c21bdef0
> > > > [ 673.365650] 00052974 00000000 000000ff 00000000 00000000 00000000 c21bdef0 000529dc
> > > > [ 673.365657] Call Trace:
> > > > [ 673.365659] [<c03fd728>] schedule_timeout+0x48/0xc0
> > > > [ 673.365663] [<c03fd50e>] io_schedule_timeout+0x5e/0xb0
> > > > [ 673.365667] [<c0170d11>] congestion_wait+0x71/0x90
> > > > [ 673.365671] [<c016b92e>] wb_kupdate+0x9e/0xf0
> > > > [ 673.365675] [<c016beb2>] pdflush+0x102/0x1d0
> > > > [ 673.365679] [<c013fa82>] kthread+0x42/0x70
> > > > [ 673.365683] [<c01050df>] kernel_thread_helper+0x7/0x18
> > > >
> > >
> > > That looks more like the inode writeback patches from Wu than the per
> > > bdi dirty stuff. The later typically hangs in balance_dirty_pages().
> > >
> > >
> > >
> >
> > Yes, you are right,
> >
> > both revisions 1f7d6668c29b1dfa307a44844f9bb38356fc989b and 3e26c149c358529b1605f8959341d34bc4b880a3 work fine
> > But I didn't pay attention that those are before f4a1c2bce002f683801bcdbbc9fd89804614fb6b.
> > So, back to the drawing board.... :-)
> >
> > Will test revision 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b, just after writeback patches.
>
> Thank you. I'll try if I can reproduce it locally...
>
> Fengguang
>
>

Bingo,

Revision 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b shows this bug.

I will now bisect to find exact patch that caused this bug,
Thanks,
Maxim Levitsky

2007-10-22 11:21:55

by Wu Fengguang

[permalink] [raw]

Subject: Re: 100% iowait on one of cpus in current -git

On Mon, Oct 22, 2007 at 12:58:11PM +0200, Maxim Levitsky wrote:
> On Monday 22 October 2007 12:55:25 Fengguang Wu wrote:
> > On Mon, Oct 22, 2007 at 12:40:24PM +0200, Maxim Levitsky wrote:
> > > On Monday 22 October 2007 12:22:10 Peter Zijlstra wrote:
> > > > > [ 673.365631] pdflush D c21bdecc 0 221 2
> > > > > [ 673.365635] c21bdee0 00000046 00000002 c21bdecc c21bdec4 00000000 c21b3000 00000002
> > > > > [ 673.365643] c0134892 c21b3164 c1e00200 00000001 c7109280 c21bdec0 c03ff849 c21bdef0
> > > > > [ 673.365650] 00052974 00000000 000000ff 00000000 00000000 00000000 c21bdef0 000529dc
> > > > > [ 673.365657] Call Trace:
> > > > > [ 673.365659] [<c03fd728>] schedule_timeout+0x48/0xc0
> > > > > [ 673.365663] [<c03fd50e>] io_schedule_timeout+0x5e/0xb0
> > > > > [ 673.365667] [<c0170d11>] congestion_wait+0x71/0x90
> > > > > [ 673.365671] [<c016b92e>] wb_kupdate+0x9e/0xf0
> > > > > [ 673.365675] [<c016beb2>] pdflush+0x102/0x1d0
> > > > > [ 673.365679] [<c013fa82>] kthread+0x42/0x70
> > > > > [ 673.365683] [<c01050df>] kernel_thread_helper+0x7/0x18
> > > > >
> > > >
> > > > That looks more like the inode writeback patches from Wu than the per
> > > > bdi dirty stuff. The later typically hangs in balance_dirty_pages().
> > > >
> > > >
> > > >
> > >
> > > Yes, you are right,
> > >
> > > both revisions 1f7d6668c29b1dfa307a44844f9bb38356fc989b and 3e26c149c358529b1605f8959341d34bc4b880a3 work fine
> > > But I didn't pay attention that those are before f4a1c2bce002f683801bcdbbc9fd89804614fb6b.
> > > So, back to the drawing board.... :-)
> > >
> > > Will test revision 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b, just after writeback patches.
> >
> > Thank you. I'll try if I can reproduce it locally...
> >
> > Fengguang
> >
> >
>
> Bingo,
>
> Revision 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b shows this bug.
>
> I will now bisect to find exact patch that caused this bug,

This one is most relevant:

1f7decf6d9f06dac008b8d66935c0c3b18e564f9
writeback: introduce writeback_control.more_io to indicate more io

Still compiling the kernel...

Thank you,
Fengguang

2007-10-22 12:22:18

by Maxim Levitsky

[permalink] [raw]

Subject: Re: 100% iowait on one of cpus in current -git

On Monday 22 October 2007 13:19:08 Fengguang Wu wrote:
> On Mon, Oct 22, 2007 at 12:58:11PM +0200, Maxim Levitsky wrote:
> > On Monday 22 October 2007 12:55:25 Fengguang Wu wrote:
> > > On Mon, Oct 22, 2007 at 12:40:24PM +0200, Maxim Levitsky wrote:
> > > > On Monday 22 October 2007 12:22:10 Peter Zijlstra wrote:
> > > > > > [ 673.365631] pdflush D c21bdecc 0 221 2
> > > > > > [ 673.365635] c21bdee0 00000046 00000002 c21bdecc c21bdec4 00000000 c21b3000 00000002
> > > > > > [ 673.365643] c0134892 c21b3164 c1e00200 00000001 c7109280 c21bdec0 c03ff849 c21bdef0
> > > > > > [ 673.365650] 00052974 00000000 000000ff 00000000 00000000 00000000 c21bdef0 000529dc
> > > > > > [ 673.365657] Call Trace:
> > > > > > [ 673.365659] [<c03fd728>] schedule_timeout+0x48/0xc0
> > > > > > [ 673.365663] [<c03fd50e>] io_schedule_timeout+0x5e/0xb0
> > > > > > [ 673.365667] [<c0170d11>] congestion_wait+0x71/0x90
> > > > > > [ 673.365671] [<c016b92e>] wb_kupdate+0x9e/0xf0
> > > > > > [ 673.365675] [<c016beb2>] pdflush+0x102/0x1d0
> > > > > > [ 673.365679] [<c013fa82>] kthread+0x42/0x70
> > > > > > [ 673.365683] [<c01050df>] kernel_thread_helper+0x7/0x18
> > > > > >
> > > > >
> > > > > That looks more like the inode writeback patches from Wu than the per
> > > > > bdi dirty stuff. The later typically hangs in balance_dirty_pages().
> > > > >
> > > > >
> > > > >
> > > >
> > > > Yes, you are right,
> > > >
> > > > both revisions 1f7d6668c29b1dfa307a44844f9bb38356fc989b and 3e26c149c358529b1605f8959341d34bc4b880a3 work fine
> > > > But I didn't pay attention that those are before f4a1c2bce002f683801bcdbbc9fd89804614fb6b.
> > > > So, back to the drawing board.... :-)
> > > >
> > > > Will test revision 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b, just after writeback patches.
> > >
> > > Thank you. I'll try if I can reproduce it locally...
> > >
> > > Fengguang
> > >
> > >
> >
> > Bingo,
> >
> > Revision 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b shows this bug.
> >
> > I will now bisect to find exact patch that caused this bug,
>
> This one is most relevant:
>
> 1f7decf6d9f06dac008b8d66935c0c3b18e564f9
> writeback: introduce writeback_control.more_io to indicate more io
Exactly.
>
> Still compiling the kernel...
>
> Thank you,
> Fengguang
>
>
Hi,

I Bisected this bug to exactly this commit:

2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b
writeback: introduce writeback_control.more_io to indicate more io

Reverting it and compiling latest git shows no more signs of that bug.
Thanks,
Best regards,
Maxim Levitsky

2007-10-22 12:37:28

by Wu Fengguang

[permalink] [raw]

Subject: Re: 100% iowait on one of cpus in current -git

On Mon, Oct 22, 2007 at 02:21:21PM +0200, Maxim Levitsky wrote:
> I Bisected this bug to exactly this commit:
>
> 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b
> writeback: introduce writeback_control.more_io to indicate more io
>
> Reverting it and compiling latest git shows no more signs of that bug.

Thank you very much.

I guess your system has some difficulty in writing back some inodes.
(i.e. a bug disclosed by this patch, the 100% iowait only makes it
more obvious)

I cannot reproduce it with your .config, so would you recompile and
run the kernel with the above commit _and_ the below debugging patch?

Thank you,
Fengguang
---

fs/fs-writeback.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)

--- linux-2.6.23-git17.orig/fs/fs-writeback.c
+++ linux-2.6.23-git17/fs/fs-writeback.c
@@ -164,12 +164,25 @@ static void redirty_tail(struct inode *i
list_move(&inode->i_list, &sb->s_dirty);
}

+#define requeue_io(inode) \
+ do { \
+ __requeue_io(inode, __LINE__); \
+ } while (0)
+
/*
* requeue inode for re-scanning after sb->s_io list is exhausted.
*/
-static void requeue_io(struct inode *inode)
+static void __requeue_io(struct inode *inode, int line)
{
list_move(&inode->i_list, &inode->i_sb->s_more_io);
+
+ printk(KERN_DEBUG "redirtied inode %lu size %llu at %02x:%02x(%s), line %d.\n",
+ inode->i_ino,
+ i_size_read(inode),
+ MAJOR(inode->i_sb->s_dev),
+ MINOR(inode->i_sb->s_dev),
+ inode->i_sb->s_id,
+ line);
}

static void inode_sync_complete(struct inode *inode)

2007-10-22 13:06:44

by Maxim Levitsky

[permalink] [raw]

Subject: Re: 100% iowait on one of cpus in current -git

On Monday 22 October 2007 14:37:07 Fengguang Wu wrote:
> On Mon, Oct 22, 2007 at 02:21:21PM +0200, Maxim Levitsky wrote:
> > I Bisected this bug to exactly this commit:
> >
> > 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b
> > writeback: introduce writeback_control.more_io to indicate more io
> >
> > Reverting it and compiling latest git shows no more signs of that bug.
>
> Thank you very much.
>
> I guess your system has some difficulty in writing back some inodes.
> (i.e. a bug disclosed by this patch, the 100% iowait only makes it
> more obvious)
>
> I cannot reproduce it with your .config, so would you recompile and
> run the kernel with the above commit _and_ the below debugging patch?
>
> Thank you,
> Fengguang
> ---
>
> fs/fs-writeback.c | 15 ++++++++++++++-
> 1 file changed, 14 insertions(+), 1 deletion(-)
>
> --- linux-2.6.23-git17.orig/fs/fs-writeback.c
> +++ linux-2.6.23-git17/fs/fs-writeback.c
> @@ -164,12 +164,25 @@ static void redirty_tail(struct inode *i
> list_move(&inode->i_list, &sb->s_dirty);
> }
>
> +#define requeue_io(inode) \
> + do { \
> + __requeue_io(inode, __LINE__); \
> + } while (0)
> +
> /*
> * requeue inode for re-scanning after sb->s_io list is exhausted.
> */
> -static void requeue_io(struct inode *inode)
> +static void __requeue_io(struct inode *inode, int line)
> {
> list_move(&inode->i_list, &inode->i_sb->s_more_io);
> +
> + printk(KERN_DEBUG "redirtied inode %lu size %llu at %02x:%02x(%s), line %d.\n",
> + inode->i_ino,
> + i_size_read(inode),
> + MAJOR(inode->i_sb->s_dev),
> + MINOR(inode->i_sb->s_dev),
> + inode->i_sb->s_id,
> + line);
> }
>
> static void inode_sync_complete(struct inode *inode)
>
>

Hi,
Thank you very much too, for helping me.

Applied.
Had to kill klogd, since kernel generates tons of redirtied inode messages.
Size of the kern.log is 863 KB, thus I I don't think it is polite to attachit.
Don't know whenever it is ok to put it on pastebin too.

Anyway, it shows lots of redirtied inode... messages,
and while most of them are "at 08:02(sda2)" , my reiserfs root partition, some are

"Oct 22 14:50:27 MAIN kernel: [ 73.643794] redirtied inode 2582 size 0 at 00:0f(tmpfs), line 300."

" line 300" is shown always

(I have /var/run, /var/lock, /dev mounted as tmpfs, default kubuntu setup)

Best regards,
Maxim Levitsky

2007-10-22 13:10:57

by Wu Fengguang

[permalink] [raw]

Subject: Re: 100% iowait on one of cpus in current -git

On Mon, Oct 22, 2007 at 03:05:35PM +0200, Maxim Levitsky wrote:
> Hi,
> Thank you very much too, for helping me.

You are welcome :-)

> Applied.
> Had to kill klogd, since kernel generates tons of redirtied inode messages.
> Size of the kern.log is 863 KB, thus I I don't think it is polite to attachit.
> Don't know whenever it is ok to put it on pastebin too.
>
> Anyway, it shows lots of redirtied inode... messages,
> and while most of them are "at 08:02(sda2)" , my reiserfs root partition, some are
>
> "Oct 22 14:50:27 MAIN kernel: [ 73.643794] redirtied inode 2582 size 0 at 00:0f(tmpfs), line 300."
>
> " line 300" is shown always
>
> (I have /var/run, /var/lock, /dev mounted as tmpfs, default kubuntu setup)

Thank you for the testing out.

Hmm, Maybe it's an reiserfs related issue. Do you have the full log file?

Thank you,
Fengguang

2007-10-22 13:29:16

by Maxim Levitsky

[permalink] [raw]

Subject: Re: 100% iowait on one of cpus in current -git

On Monday 22 October 2007 15:10:45 Fengguang Wu wrote:
> On Mon, Oct 22, 2007 at 03:05:35PM +0200, Maxim Levitsky wrote:
> > Hi,
> > Thank you very much too, for helping me.
>
> You are welcome :-)
>
> > Applied.
> > Had to kill klogd, since kernel generates tons of redirtied inode messages.
> > Size of the kern.log is 863 KB, thus I I don't think it is polite to attachit.
> > Don't know whenever it is ok to put it on pastebin too.
> >
> > Anyway, it shows lots of redirtied inode... messages,
> > and while most of them are "at 08:02(sda2)" , my reiserfs root partition, some are
> >
> > "Oct 22 14:50:27 MAIN kernel: [ 73.643794] redirtied inode 2582 size 0 at 00:0f(tmpfs), line 300."
> >
> > " line 300" is shown always
> >
> > (I have /var/run, /var/lock, /dev mounted as tmpfs, default kubuntu setup)
>
> Thank you for the testing out.
>
> Hmm, Maybe it's an reiserfs related issue. Do you have the full log file?
No, I don't think so, like I said it sometimes shows the same message on tmpfs
>
> Thank you,
> Fengguang
>
>
Best Regards,
Maxim Levitsky

2007-10-22 13:42:37

by Wu Fengguang

[permalink] [raw]

Subject: Re: 100% iowait on one of cpus in current -git

On Mon, Oct 22, 2007 at 09:10:45PM +0800, Fengguang Wu wrote:
> Hmm, Maybe it's an reiserfs related issue. Do you have the full log file?

Bingo! It can be reproduced in -mm on reiserfs:

# mkfs.reiserfs /dev/sdb1
# mount /dev/sdb1 /test
# cp bin /test
<wait for a while>
# dmesg
[...]
[ 418.346113] requeue_io 308: inode 6 size 302 at 08:11(sdb1)
[ 418.346119] requeue_io 308: inode 7 size 196 at 08:11(sdb1)
[ 418.346125] requeue_io 308: inode 8 size 85 at 08:11(sdb1)
[ 418.346131] requeue_io 308: inode 9 size 180 at 08:11(sdb1)
[ 418.346136] requeue_io 308: inode 10 size 1488 at 08:11(sdb1)
[ 418.346142] requeue_io 308: inode 12 size 1358 at 08:11(sdb1)
[ 418.346148] requeue_io 308: inode 13 size 482 at 08:11(sdb1)
[ 418.346153] requeue_io 308: inode 14 size 171 at 08:11(sdb1)
[ 418.346159] requeue_io 308: inode 15 size 93 at 08:11(sdb1)
[ 418.346164] requeue_io 308: inode 16 size 81 at 08:11(sdb1)
[ 418.346170] requeue_io 308: inode 17 size 212 at 08:11(sdb1)
[ 418.346176] requeue_io 308: inode 18 size 431 at 08:11(sdb1)
[ 418.346181] requeue_io 308: inode 19 size 231 at 08:11(sdb1)
[ 418.346187] requeue_io 308: inode 20 size 1756 at 08:11(sdb1)
[ 418.346193] requeue_io 308: inode 21 size 1229 at 08:11(sdb1)
[ 418.346198] requeue_io 308: inode 22 size 157 at 08:11(sdb1)
[ 418.346204] requeue_io 308: inode 23 size 3430 at 08:11(sdb1)
[ 418.346210] requeue_io 308: inode 24 size 200 at 08:11(sdb1)
[ 418.346215] requeue_io 308: inode 25 size 202 at 08:11(sdb1)
[ 418.346221] requeue_io 308: inode 26 size 386 at 08:11(sdb1)
[ 418.346226] requeue_io 308: inode 27 size 264 at 08:11(sdb1)
[ 418.346232] requeue_io 308: inode 28 size 268 at 08:11(sdb1)
[ 418.346238] requeue_io 308: inode 29 size 1228 at 08:11(sdb1)
[ 418.346243] requeue_io 308: inode 30 size 404 at 08:11(sdb1)
[ 418.346249] requeue_io 308: inode 31 size 2452 at 08:11(sdb1)
[ 418.346255] requeue_io 308: inode 32 size 1236 at 08:11(sdb1)
[ 418.346260] requeue_io 308: inode 33 size 655 at 08:11(sdb1)
[ 418.346266] requeue_io 308: inode 35 size 330 at 08:11(sdb1)
[ 418.346272] requeue_io 308: inode 36 size 248 at 08:11(sdb1)
[ 418.346277] requeue_io 308: inode 37 size 683 at 08:11(sdb1)
[ 418.346283] requeue_io 308: inode 38 size 1451 at 08:11(sdb1)
[ 418.346288] requeue_io 308: inode 39 size 894 at 08:11(sdb1)
[ 418.346294] requeue_io 308: inode 40 size 879 at 08:11(sdb1)
[ 418.346300] requeue_io 308: inode 42 size 797 at 08:11(sdb1)
[ 418.346305] requeue_io 308: inode 43 size 1314 at 08:11(sdb1)
[ 418.346311] requeue_io 308: inode 44 size 1463 at 08:11(sdb1)
[ 418.346317] requeue_io 308: inode 45 size 3032 at 08:11(sdb1)
[ 418.346322] requeue_io 308: inode 46 size 325 at 08:11(sdb1)
[ 418.346328] requeue_io 308: inode 47 size 583 at 08:11(sdb1)
[ 418.346334] requeue_io 308: inode 48 size 1660 at 08:11(sdb1)
[ 418.346339] requeue_io 308: inode 49 size 3159 at 08:11(sdb1)
[ 418.346345] requeue_io 308: inode 50 size 510 at 08:11(sdb1)
[ 418.346350] requeue_io 308: inode 51 size 100 at 08:11(sdb1)
[ 418.346356] requeue_io 308: inode 52 size 143 at 08:11(sdb1)
[ 418.346370] requeue_io 308: inode 53 size 954 at 08:11(sdb1)
[ 418.346373] requeue_io 308: inode 54 size 322 at 08:11(sdb1)
[ 418.346376] requeue_io 308: inode 55 size 970 at 08:11(sdb1)
[ 418.346379] requeue_io 308: inode 57 size 483 at 08:11(sdb1)
[ 418.346382] requeue_io 308: inode 58 size 1125 at 08:11(sdb1)
[ 418.346385] requeue_io 308: inode 59 size 2196 at 08:11(sdb1)
[ 418.346388] requeue_io 308: inode 60 size 104 at 08:11(sdb1)
[ 418.346391] requeue_io 308: inode 61 size 488 at 08:11(sdb1)
[ 418.346394] requeue_io 308: inode 62 size 116 at 08:11(sdb1)
[ 418.346397] requeue_io 308: inode 63 size 907 at 08:11(sdb1)
[ 418.346400] requeue_io 308: inode 64 size 1076 at 08:11(sdb1)
[ 418.346403] requeue_io 308: inode 65 size 460 at 08:11(sdb1)
[ 418.346406] requeue_io 308: inode 66 size 1092 at 08:11(sdb1)
[ 418.346409] requeue_io 308: inode 67 size 424 at 08:11(sdb1)
[ 418.346412] requeue_io 308: inode 68 size 696 at 08:11(sdb1)
[ 418.346415] requeue_io 308: inode 70 size 137 at 08:11(sdb1)
[ 418.346418] requeue_io 308: inode 71 size 201 at 08:11(sdb1)
[ 418.346421] requeue_io 308: inode 72 size 150 at 08:11(sdb1)
[ 418.346424] requeue_io 308: inode 73 size 188 at 08:11(sdb1)
[ 418.346427] requeue_io 308: inode 75 size 1208 at 08:11(sdb1)
[ 418.346431] requeue_io 308: inode 76 size 493 at 08:11(sdb1)
[ 418.346434] requeue_io 308: inode 77 size 484 at 08:11(sdb1)
[ 418.346437] requeue_io 308: inode 78 size 356 at 08:11(sdb1)
[ 418.346440] requeue_io 308: inode 79 size 895 at 08:11(sdb1)
[ 418.346443] requeue_io 308: inode 80 size 847 at 08:11(sdb1)
[ 418.346446] requeue_io 308: inode 81 size 3281 at 08:11(sdb1)
[ 418.346449] requeue_io 308: inode 82 size 3329 at 08:11(sdb1)
[ 418.346452] requeue_io 308: inode 83 size 115 at 08:11(sdb1)
[ 418.346455] requeue_io 308: inode 84 size 644 at 08:11(sdb1)
[ 418.346458] requeue_io 308: inode 85 size 125 at 08:11(sdb1)
[ 418.346461] requeue_io 308: inode 86 size 199 at 08:11(sdb1)
[ 418.346464] requeue_io 308: inode 87 size 204 at 08:11(sdb1)
[ 418.346467] requeue_io 308: inode 88 size 72 at 08:11(sdb1)
[ 418.346476] mm/page-writeback.c 658 wb_kupdate: pdflush(209) 17174 global 2012 0 0 wc _M tw 1024 sk 0
[ 418.366318] requeue_io 308: inode 6 size 302 at 08:11(sdb1)
[ 418.366325] requeue_io 308: inode 7 size 196 at 08:11(sdb1)
[ 418.366330] requeue_io 308: inode 8 size 85 at 08:11(sdb1)
[ 418.366334] requeue_io 308: inode 9 size 180 at 08:11(sdb1)
[ 418.366338] requeue_io 308: inode 10 size 1488 at 08:11(sdb1)
[ 418.366342] requeue_io 308: inode 12 size 1358 at 08:11(sdb1)
[ 418.366346] requeue_io 308: inode 13 size 482 at 08:11(sdb1)
[ 418.366350] requeue_io 308: inode 14 size 171 at 08:11(sdb1)
[ 418.366354] requeue_io 308: inode 15 size 93 at 08:11(sdb1)
[ 418.366358] requeue_io 308: inode 16 size 81 at 08:11(sdb1)
[ 418.366361] requeue_io 308: inode 17 size 212 at 08:11(sdb1)
[ 418.366365] requeue_io 308: inode 18 size 431 at 08:11(sdb1)
[ 418.366369] requeue_io 308: inode 19 size 231 at 08:11(sdb1)
[ 418.366373] requeue_io 308: inode 20 size 1756 at 08:11(sdb1)
[ 418.366378] requeue_io 308: inode 21 size 1229 at 08:11(sdb1)
[ 418.366382] requeue_io 308: inode 22 size 157 at 08:11(sdb1)
[ 418.366386] requeue_io 308: inode 23 size 3430 at 08:11(sdb1)
[ 418.366390] requeue_io 308: inode 24 size 200 at 08:11(sdb1)
[ 418.366394] requeue_io 308: inode 25 size 202 at 08:11(sdb1)
[ 418.366398] requeue_io 308: inode 26 size 386 at 08:11(sdb1)
[ 418.366402] requeue_io 308: inode 27 size 264 at 08:11(sdb1)
[ 418.366407] requeue_io 308: inode 28 size 268 at 08:11(sdb1)
[ 418.366411] requeue_io 308: inode 29 size 1228 at 08:11(sdb1)
[ 418.366415] requeue_io 308: inode 30 size 404 at 08:11(sdb1)
[ 418.366419] requeue_io 308: inode 31 size 2452 at 08:11(sdb1)
[ 418.366423] requeue_io 308: inode 32 size 1236 at 08:11(sdb1)
[ 418.366427] requeue_io 308: inode 33 size 655 at 08:11(sdb1)
[ 418.366431] requeue_io 308: inode 35 size 330 at 08:11(sdb1)
[ 418.366435] requeue_io 308: inode 36 size 248 at 08:11(sdb1)
[ 418.366439] requeue_io 308: inode 37 size 683 at 08:11(sdb1)
[ 418.366443] requeue_io 308: inode 38 size 1451 at 08:11(sdb1)
[ 418.366446] requeue_io 308: inode 39 size 894 at 08:11(sdb1)
[ 418.366450] requeue_io 308: inode 40 size 879 at 08:11(sdb1)
[ 418.366453] requeue_io 308: inode 42 size 797 at 08:11(sdb1)
[ 418.366457] requeue_io 308: inode 43 size 1314 at 08:11(sdb1)
[ 418.366460] requeue_io 308: inode 44 size 1463 at 08:11(sdb1)
[ 418.366464] requeue_io 308: inode 45 size 3032 at 08:11(sdb1)
[ 418.366468] requeue_io 308: inode 46 size 325 at 08:11(sdb1)
[ 418.366471] requeue_io 308: inode 47 size 583 at 08:11(sdb1)
[ 418.366475] requeue_io 308: inode 48 size 1660 at 08:11(sdb1)
[ 418.366478] requeue_io 308: inode 49 size 3159 at 08:11(sdb1)
[ 418.366482] requeue_io 308: inode 50 size 510 at 08:11(sdb1)
[ 418.366485] requeue_io 308: inode 51 size 100 at 08:11(sdb1)
[ 418.366489] requeue_io 308: inode 52 size 143 at 08:11(sdb1)
[ 418.366492] requeue_io 308: inode 53 size 954 at 08:11(sdb1)
[ 418.366496] requeue_io 308: inode 54 size 322 at 08:11(sdb1)
[ 418.366500] requeue_io 308: inode 55 size 970 at 08:11(sdb1)
[ 418.366503] requeue_io 308: inode 57 size 483 at 08:11(sdb1)
[ 418.366507] requeue_io 308: inode 58 size 1125 at 08:11(sdb1)
[ 418.366511] requeue_io 308: inode 59 size 2196 at 08:11(sdb1)
[ 418.366514] requeue_io 308: inode 60 size 104 at 08:11(sdb1)
[ 418.366518] requeue_io 308: inode 61 size 488 at 08:11(sdb1)
[ 418.366522] requeue_io 308: inode 62 size 116 at 08:11(sdb1)
[ 418.366525] requeue_io 308: inode 63 size 907 at 08:11(sdb1)
[ 418.366529] requeue_io 308: inode 64 size 1076 at 08:11(sdb1)
[ 418.366532] requeue_io 308: inode 65 size 460 at 08:11(sdb1)
[ 418.366536] requeue_io 308: inode 66 size 1092 at 08:11(sdb1)
[ 418.366539] requeue_io 308: inode 67 size 424 at 08:11(sdb1)
[ 418.366543] requeue_io 308: inode 68 size 696 at 08:11(sdb1)
[ 418.366546] requeue_io 308: inode 70 size 137 at 08:11(sdb1)
[ 418.366550] requeue_io 308: inode 71 size 201 at 08:11(sdb1)
[ 418.366553] requeue_io 308: inode 72 size 150 at 08:11(sdb1)
[ 418.366557] requeue_io 308: inode 73 size 188 at 08:11(sdb1)
[ 418.366561] requeue_io 308: inode 75 size 1208 at 08:11(sdb1)
[ 418.366564] requeue_io 308: inode 76 size 493 at 08:11(sdb1)
[ 418.366567] requeue_io 308: inode 77 size 484 at 08:11(sdb1)
[ 418.366571] requeue_io 308: inode 78 size 356 at 08:11(sdb1)
[ 418.366575] requeue_io 308: inode 79 size 895 at 08:11(sdb1)
[ 418.366578] requeue_io 308: inode 80 size 847 at 08:11(sdb1)
[ 418.366582] requeue_io 308: inode 81 size 3281 at 08:11(sdb1)
[ 418.366586] requeue_io 308: inode 82 size 3329 at 08:11(sdb1)
[ 418.366590] requeue_io 308: inode 83 size 115 at 08:11(sdb1)
[ 418.366593] requeue_io 308: inode 84 size 644 at 08:11(sdb1)
[ 418.366597] requeue_io 308: inode 85 size 125 at 08:11(sdb1)
[ 418.366600] requeue_io 308: inode 86 size 199 at 08:11(sdb1)
[ 418.366604] requeue_io 308: inode 87 size 204 at 08:11(sdb1)
[ 418.366607] requeue_io 308: inode 88 size 72 at 08:11(sdb1)
[ 418.366622] mm/page-writeback.c 658 wb_kupdate: pdflush(209) 17174 global 2012 0 0 wc _M tw 1024 sk 0

2007-10-23 07:55:27

by Wu Fengguang

[permalink] [raw]

Subject: [PATCH] reiserfs: don't drop PG_dirty when releasing sub-page-sized dirty file

This is not a new problem in 2.6.23-git17.
2.6.22/2.6.23 is buggy in the same way.

Reiserfs could leave newly created sub-page-size files in dirty state
for ever. They cannot be synced to disk by pdflush routines or
explicit `sync' commands. Only `umount' can do the trick.

The direct cause is: the dirty page's PG_dirty is wrongly _cleared_.
Call trace:
[<ffffffff8027e920>] cancel_dirty_page+0xd0/0xf0
[<ffffffff8816d470>] :reiserfs:reiserfs_cut_from_item+0x660/0x710
[<ffffffff8816d791>] :reiserfs:reiserfs_do_truncate+0x271/0x530
[<ffffffff8815872d>] :reiserfs:reiserfs_truncate_file+0xfd/0x3b0
[<ffffffff8815d3d0>] :reiserfs:reiserfs_file_release+0x1e0/0x340
[<ffffffff802a187c>] __fput+0xcc/0x1b0
[<ffffffff802a1ba6>] fput+0x16/0x20
[<ffffffff8029e676>] filp_close+0x56/0x90
[<ffffffff8029fe0d>] sys_close+0xad/0x110
[<ffffffff8020c41e>] system_call+0x7e/0x83

Fix the bug by removing the cancel_dirty_page() call. Tests show that
it causes no bad behaviors on various write sizes.

=== for the patient ===
Here are more detailed demonstrations of the problem.

1) the page has both PG_dirty(D)/PAGECACHE_TAG_DIRTY(d) after being written to;
and then only PAGECACHE_TAG_DIRTY(d) remains after the file is closed.

------------------------------ screen 0 ------------------------------
[T0] root /home/wfg# cat > /test/tiny
[T1] hi
[T2] root /home/wfg#

------------------------------ screen 1 ------------------------------
[T1] root /home/wfg# echo /test/tiny > /proc/filecache
[T1] root /home/wfg# cat /proc/filecache
# file /test/tiny
# flags R:referenced A:active M:mmap U:uptodate D:dirty W:writeback O:owner B:buffer d:dirty w:writeback
# idx len state refcnt
0 1 ___UD__Bd_ 2
[T2] root /home/wfg# cat /proc/filecache
# file /test/tiny
# flags R:referenced A:active M:mmap U:uptodate D:dirty W:writeback O:owner B:buffer d:dirty w:writeback
# idx len state refcnt
0 1 ___U___Bd_ 2

2) note the non-zero 'cancelled_write_bytes' after /tmp/hi is copied.

------------------------------ screen 0 ------------------------------
[T0] root /home/wfg# echo hi > /tmp/hi
[T1] root /home/wfg# cp /tmp/hi /dev/stdin /test
[T2] hi
[T3] root /home/wfg#

------------------------------ screen 1 ------------------------------
[T1] root /proc/4397# cd /proc/`pidof cp`
[T1] root /proc/4713# cat io
rchar: 8396
wchar: 3
syscr: 20
syscw: 1
read_bytes: 0
write_bytes: 20480
cancelled_write_bytes: 4096
[T2] root /proc/4713# cat io
rchar: 8399
wchar: 6
syscr: 21
syscw: 2
read_bytes: 0
write_bytes: 24576
cancelled_write_bytes: 4096

//Question: the 'write_bytes' is a bit more than expected ;-)

Cc: Maxim Levitsky <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Fengguang Wu <[email protected]>
---
fs/reiserfs/stree.c | 3 ---
1 file changed, 3 deletions(-)

--- linux-2.6.24-git17.orig/fs/reiserfs/stree.c
+++ linux-2.6.24-git17/fs/reiserfs/stree.c
@@ -1458,9 +1458,6 @@ static void unmap_buffers(struct page *p
}
bh = next;
} while (bh != head);
- if (PAGE_SIZE == bh->b_size) {
- cancel_dirty_page(page, PAGE_CACHE_SIZE);
- }
}
}
}

2007-10-23 10:07:54

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] reiserfs: don't drop PG_dirty when releasing sub-page-sized dirty file

[ adding reiserfs devs to the CC ]

On Tue, 2007-10-23 at 15:55 +0800, Fengguang Wu wrote:
> This is not a new problem in 2.6.23-git17.
> 2.6.22/2.6.23 is buggy in the same way.
>
> Reiserfs could leave newly created sub-page-size files in dirty state
> for ever. They cannot be synced to disk by pdflush routines or
> explicit `sync' commands. Only `umount' can do the trick.
>
> The direct cause is: the dirty page's PG_dirty is wrongly _cleared_.
> Call trace:
> [<ffffffff8027e920>] cancel_dirty_page+0xd0/0xf0
> [<ffffffff8816d470>] :reiserfs:reiserfs_cut_from_item+0x660/0x710
> [<ffffffff8816d791>] :reiserfs:reiserfs_do_truncate+0x271/0x530
> [<ffffffff8815872d>] :reiserfs:reiserfs_truncate_file+0xfd/0x3b0
> [<ffffffff8815d3d0>] :reiserfs:reiserfs_file_release+0x1e0/0x340
> [<ffffffff802a187c>] __fput+0xcc/0x1b0
> [<ffffffff802a1ba6>] fput+0x16/0x20
> [<ffffffff8029e676>] filp_close+0x56/0x90
> [<ffffffff8029fe0d>] sys_close+0xad/0x110
> [<ffffffff8020c41e>] system_call+0x7e/0x83
>
> Fix the bug by removing the cancel_dirty_page() call. Tests show that
> it causes no bad behaviors on various write sizes.
>
>
> === for the patient ===
> Here are more detailed demonstrations of the problem.
>
> 1) the page has both PG_dirty(D)/PAGECACHE_TAG_DIRTY(d) after being written to;
> and then only PAGECACHE_TAG_DIRTY(d) remains after the file is closed.
>
> ------------------------------ screen 0 ------------------------------
> [T0] root /home/wfg# cat > /test/tiny
> [T1] hi
> [T2] root /home/wfg#
>
> ------------------------------ screen 1 ------------------------------
> [T1] root /home/wfg# echo /test/tiny > /proc/filecache
> [T1] root /home/wfg# cat /proc/filecache
> # file /test/tiny
> # flags R:referenced A:active M:mmap U:uptodate D:dirty W:writeback O:owner B:buffer d:dirty w:writeback
> # idx len state refcnt
> 0 1 ___UD__Bd_ 2
> [T2] root /home/wfg# cat /proc/filecache
> # file /test/tiny
> # flags R:referenced A:active M:mmap U:uptodate D:dirty W:writeback O:owner B:buffer d:dirty w:writeback
> # idx len state refcnt
> 0 1 ___U___Bd_ 2
>
> 2) note the non-zero 'cancelled_write_bytes' after /tmp/hi is copied.
>
> ------------------------------ screen 0 ------------------------------
> [T0] root /home/wfg# echo hi > /tmp/hi
> [T1] root /home/wfg# cp /tmp/hi /dev/stdin /test
> [T2] hi
> [T3] root /home/wfg#
>
> ------------------------------ screen 1 ------------------------------
> [T1] root /proc/4397# cd /proc/`pidof cp`
> [T1] root /proc/4713# cat io
> rchar: 8396
> wchar: 3
> syscr: 20
> syscw: 1
> read_bytes: 0
> write_bytes: 20480
> cancelled_write_bytes: 4096
> [T2] root /proc/4713# cat io
> rchar: 8399
> wchar: 6
> syscr: 21
> syscw: 2
> read_bytes: 0
> write_bytes: 24576
> cancelled_write_bytes: 4096
>
> //Question: the 'write_bytes' is a bit more than expected ;-)
>
> Cc: Maxim Levitsky <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Signed-off-by: Fengguang Wu <[email protected]>
> ---
> fs/reiserfs/stree.c | 3 ---
> 1 file changed, 3 deletions(-)
>
> --- linux-2.6.24-git17.orig/fs/reiserfs/stree.c
> +++ linux-2.6.24-git17/fs/reiserfs/stree.c
> @@ -1458,9 +1458,6 @@ static void unmap_buffers(struct page *p
> }
> bh = next;
> } while (bh != head);
> - if (PAGE_SIZE == bh->b_size) {
> - cancel_dirty_page(page, PAGE_CACHE_SIZE);
> - }
> }
> }
> }
>

2007-10-23 10:18:44

by Maxim Levitsky

[permalink] [raw]

Subject: Re: [PATCH] reiserfs: don't drop PG_dirty when releasing sub-page-sized dirty file

On Tuesday 23 October 2007 09:55:14 Fengguang Wu wrote:
> This is not a new problem in 2.6.23-git17.
> 2.6.22/2.6.23 is buggy in the same way.
>
> Reiserfs could leave newly created sub-page-size files in dirty state
> for ever. They cannot be synced to disk by pdflush routines or
> explicit `sync' commands. Only `umount' can do the trick.
>
> The direct cause is: the dirty page's PG_dirty is wrongly _cleared_.
> Call trace:
> [<ffffffff8027e920>] cancel_dirty_page+0xd0/0xf0
> [<ffffffff8816d470>] :reiserfs:reiserfs_cut_from_item+0x660/0x710
> [<ffffffff8816d791>] :reiserfs:reiserfs_do_truncate+0x271/0x530
> [<ffffffff8815872d>] :reiserfs:reiserfs_truncate_file+0xfd/0x3b0
> [<ffffffff8815d3d0>] :reiserfs:reiserfs_file_release+0x1e0/0x340
> [<ffffffff802a187c>] __fput+0xcc/0x1b0
> [<ffffffff802a1ba6>] fput+0x16/0x20
> [<ffffffff8029e676>] filp_close+0x56/0x90
> [<ffffffff8029fe0d>] sys_close+0xad/0x110
> [<ffffffff8020c41e>] system_call+0x7e/0x83
>
> Fix the bug by removing the cancel_dirty_page() call. Tests show that
> it causes no bad behaviors on various write sizes.
>
>
> === for the patient ===
> Here are more detailed demonstrations of the problem.
>
> 1) the page has both PG_dirty(D)/PAGECACHE_TAG_DIRTY(d) after being written to;
> and then only PAGECACHE_TAG_DIRTY(d) remains after the file is closed.
>
> ------------------------------ screen 0 ------------------------------
> [T0] root /home/wfg# cat > /test/tiny
> [T1] hi
> [T2] root /home/wfg#
>
> ------------------------------ screen 1 ------------------------------
> [T1] root /home/wfg# echo /test/tiny > /proc/filecache
> [T1] root /home/wfg# cat /proc/filecache
> # file /test/tiny
> # flags R:referenced A:active M:mmap U:uptodate D:dirty W:writeback O:owner B:buffer d:dirty w:writeback
> # idx len state refcnt
> 0 1 ___UD__Bd_ 2
> [T2] root /home/wfg# cat /proc/filecache
> # file /test/tiny
> # flags R:referenced A:active M:mmap U:uptodate D:dirty W:writeback O:owner B:buffer d:dirty w:writeback
> # idx len state refcnt
> 0 1 ___U___Bd_ 2
>
> 2) note the non-zero 'cancelled_write_bytes' after /tmp/hi is copied.
>
> ------------------------------ screen 0 ------------------------------
> [T0] root /home/wfg# echo hi > /tmp/hi
> [T1] root /home/wfg# cp /tmp/hi /dev/stdin /test
> [T2] hi
> [T3] root /home/wfg#
>
> ------------------------------ screen 1 ------------------------------
> [T1] root /proc/4397# cd /proc/`pidof cp`
> [T1] root /proc/4713# cat io
> rchar: 8396
> wchar: 3
> syscr: 20
> syscw: 1
> read_bytes: 0
> write_bytes: 20480
> cancelled_write_bytes: 4096
> [T2] root /proc/4713# cat io
> rchar: 8399
> wchar: 6
> syscr: 21
> syscw: 2
> read_bytes: 0
> write_bytes: 24576
> cancelled_write_bytes: 4096
>
> //Question: the 'write_bytes' is a bit more than expected ;-)
>
> Cc: Maxim Levitsky <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Signed-off-by: Fengguang Wu <[email protected]>
> ---
> fs/reiserfs/stree.c | 3 ---
> 1 file changed, 3 deletions(-)
>
> --- linux-2.6.24-git17.orig/fs/reiserfs/stree.c
> +++ linux-2.6.24-git17/fs/reiserfs/stree.c
> @@ -1458,9 +1458,6 @@ static void unmap_buffers(struct page *p
> }
> bh = next;
> } while (bh != head);
> - if (PAGE_SIZE == bh->b_size) {
> - cancel_dirty_page(page, PAGE_CACHE_SIZE);
> - }
> }
> }
> }
>
>

One thing to say... Works perfectly!
Big thanks for fixing that bug.

Best regards,
Maxim Levitsky

2007-10-23 11:56:34

by Wu Fengguang

[permalink] [raw]

Subject: Re: [PATCH] reiserfs: don't drop PG_dirty when releasing sub-page-sized dirty file

On Tue, Oct 23, 2007 at 12:07:07PM +0200, Peter Zijlstra wrote:
> [ adding reiserfs devs to the CC ]

Thank you.

This fix is kind of crude - even when it fixed Maxim's problem, and
survived my stress testing of a lot of patching and kernel compiling.
I'd be glad to see better solutions.

Fengguang
---

reiserfs: don't drop PG_dirty when releasing sub-page-sized dirty file

This is not a new problem in 2.6.23-git17.
2.6.22/2.6.23 is buggy in the same way.

Reiserfs could accumulate dirty sub-page-size files until umount time.
They cannot be synced to disk by pdflush routines or explicit `sync'
commands. Only `umount' can do the trick.

The direct cause is: the dirty page's PG_dirty is wrongly _cleared_.
Call trace:
[<ffffffff8027e920>] cancel_dirty_page+0xd0/0xf0
[<ffffffff8816d470>] :reiserfs:reiserfs_cut_from_item+0x660/0x710
[<ffffffff8816d791>] :reiserfs:reiserfs_do_truncate+0x271/0x530
[<ffffffff8815872d>] :reiserfs:reiserfs_truncate_file+0xfd/0x3b0
[<ffffffff8815d3d0>] :reiserfs:reiserfs_file_release+0x1e0/0x340
[<ffffffff802a187c>] __fput+0xcc/0x1b0
[<ffffffff802a1ba6>] fput+0x16/0x20
[<ffffffff8029e676>] filp_close+0x56/0x90
[<ffffffff8029fe0d>] sys_close+0xad/0x110
[<ffffffff8020c41e>] system_call+0x7e/0x83

Fix the bug by removing the cancel_dirty_page() call. Tests show that
it causes no bad behaviors on various write sizes.

=== for the patient ===
Here are more detailed demonstrations of the problem.

1) the page has both PG_dirty(D)/PAGECACHE_TAG_DIRTY(d) after being written to;
and then only PAGECACHE_TAG_DIRTY(d) remains after the file is closed.

------------------------------ screen 0 ------------------------------
[T0] root /home/wfg# cat > /test/tiny
[T1] hi
[T2] root /home/wfg#

------------------------------ screen 1 ------------------------------
[T1] root /home/wfg# echo /test/tiny > /proc/filecache
[T1] root /home/wfg# cat /proc/filecache
# file /test/tiny
# flags R:referenced A:active M:mmap U:uptodate D:dirty W:writeback O:owner B:buffer d:dirty w:writeback
# idx len state refcnt
0 1 ___UD__Bd_ 2
[T2] root /home/wfg# cat /proc/filecache
# file /test/tiny
# flags R:referenced A:active M:mmap U:uptodate D:dirty W:writeback O:owner B:buffer d:dirty w:writeback
# idx len state refcnt
0 1 ___U___Bd_ 2

2) note the non-zero 'cancelled_write_bytes' after /tmp/hi is copied.

------------------------------ screen 0 ------------------------------
[T0] root /home/wfg# echo hi > /tmp/hi
[T1] root /home/wfg# cp /tmp/hi /dev/stdin /test
[T2] hi
[T3] root /home/wfg#

------------------------------ screen 1 ------------------------------
[T1] root /proc/4397# cd /proc/`pidof cp`
[T1] root /proc/4713# cat io
rchar: 8396
wchar: 3
syscr: 20
syscw: 1
read_bytes: 0
write_bytes: 20480
cancelled_write_bytes: 4096
[T2] root /proc/4713# cat io
rchar: 8399
wchar: 6
syscr: 21
syscw: 2
read_bytes: 0
write_bytes: 24576
cancelled_write_bytes: 4096

//Question: the 'write_bytes' is a bit more than expected ;-)

Cc: Maxim Levitsky <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Jeff Mahoney <[email protected]>
Signed-off-by: Fengguang Wu <[email protected]>
---
fs/reiserfs/stree.c | 3 ---
1 file changed, 3 deletions(-)

--- linux-2.6.24-git17.orig/fs/reiserfs/stree.c
+++ linux-2.6.24-git17/fs/reiserfs/stree.c
@@ -1458,9 +1458,6 @@ static void unmap_buffers(struct page *p
}
bh = next;
} while (bh != head);
- if (PAGE_SIZE == bh->b_size) {
- cancel_dirty_page(page, PAGE_CACHE_SIZE);
- }
}
}
}

2007-10-23 14:13:40

by Chris Mason

[permalink] [raw]

Subject: Re: [PATCH] reiserfs: don't drop PG_dirty when releasing sub-page-sized dirty file

On Tue, 23 Oct 2007 19:56:20 +0800
Fengguang Wu <[email protected]> wrote:

> On Tue, Oct 23, 2007 at 12:07:07PM +0200, Peter Zijlstra wrote:
> > [ adding reiserfs devs to the CC ]
>
> Thank you.
>
> This fix is kind of crude - even when it fixed Maxim's problem, and
> survived my stress testing of a lot of patching and kernel compiling.
> I'd be glad to see better solutions.

This should be safe, reiserfs has the buffer heads themselves clean and
the page should get cleaned eventually. The cancel_dirty_page call was
just an optimization to be VM friendly.

-chris

2007-10-23 14:40:29

by Wu Fengguang

[permalink] [raw]

Subject: Re: [PATCH] reiserfs: don't drop PG_dirty when releasing sub-page-sized dirty file

On Tue, Oct 23, 2007 at 10:10:53AM -0400, Chris Mason wrote:
> On Tue, 23 Oct 2007 19:56:20 +0800
> Fengguang Wu <[email protected]> wrote:
>
> > On Tue, Oct 23, 2007 at 12:07:07PM +0200, Peter Zijlstra wrote:
> > > [ adding reiserfs devs to the CC ]
> >
> > Thank you.
> >
> > This fix is kind of crude - even when it fixed Maxim's problem, and
> > survived my stress testing of a lot of patching and kernel compiling.
> > I'd be glad to see better solutions.
>
> This should be safe, reiserfs has the buffer heads themselves clean and
> the page should get cleaned eventually. The cancel_dirty_page call was
> just an optimization to be VM friendly.

> -chris

'chris' as in fs/reiserfs/{inode.c,namei.c}, and now in btrfs/*?

Nice to meet you ;-)

Fengguang

2007-10-23 14:41:52

by Wu Fengguang

[permalink] [raw]

Subject: Re: [PATCH] reiserfs: don't drop PG_dirty when releasing sub-page-sized dirty file

On Tue, Oct 23, 2007 at 12:17:51PM +0200, Maxim Levitsky wrote:
> > ---
> > fs/reiserfs/stree.c | 3 ---
> > 1 file changed, 3 deletions(-)
> >
> > --- linux-2.6.24-git17.orig/fs/reiserfs/stree.c
> > +++ linux-2.6.24-git17/fs/reiserfs/stree.c
> > @@ -1458,9 +1458,6 @@ static void unmap_buffers(struct page *p
> > }
> > bh = next;
> > } while (bh != head);
> > - if (PAGE_SIZE == bh->b_size) {
> > - cancel_dirty_page(page, PAGE_CACHE_SIZE);
> > - }
> > }
> > }
> > }
> >
> >
>
> One thing to say... Works perfectly!
> Big thanks for fixing that bug.

And many thanks for your testing~

Fengguang

2007-10-31 15:22:22

Subject: 100% iowait on one of cpus in current -git

Attachments:

Subject: Re: 100% iowait on one of cpus in current -git

Subject: Re: 100% iowait on one of cpus in current -git

Subject: Re: 100% iowait on one of cpus in current -git

Subject: Re: 100% iowait on one of cpus in current -git

Subject: Re: 100% iowait on one of cpus in current -git

Subject: Re: 100% iowait on one of cpus in current -git

Subject: Re: 100% iowait on one of cpus in current -git

Subject: Re: 100% iowait on one of cpus in current -git

Subject: Re: 100% iowait on one of cpus in current -git

Subject: Re: 100% iowait on one of cpus in current -git

Subject: Re: 100% iowait on one of cpus in current -git

Subject: Re: 100% iowait on one of cpus in current -git

Subject: Re: 100% iowait on one of cpus in current -git

Subject: Re: 100% iowait on one of cpus in current -git

Subject: Re: 100% iowait on one of cpus in current -git

Subject: [PATCH] reiserfs: don't drop PG_dirty when releasing sub-page-sized dirty file

Subject: Re: [PATCH] reiserfs: don't drop PG_dirty when releasing sub-page-sized dirty file

Subject: Re: [PATCH] reiserfs: don't drop PG_dirty when releasing sub-page-sized dirty file

Subject: Re: [PATCH] reiserfs: don't drop PG_dirty when releasing sub-page-sized dirty file

Subject: Re: [PATCH] reiserfs: don't drop PG_dirty when releasing sub-page-sized dirty file

Subject: Re: [PATCH] reiserfs: don't drop PG_dirty when releasing sub-page-sized dirty file

Subject: Re: [PATCH] reiserfs: don't drop PG_dirty when releasing sub-page-sized dirty file

Subject: Re: 100% iowait on one of cpus in current -git

Subject: Re: 100% iowait on one of cpus in current -git

Attachments:

Subject: Re: 100% iowait on one of cpus in current -git

Subject: Re: 100% iowait on one of cpus in current -git

Subject: writeout stalls in current -git

Subject: Re: writeout stalls in current -git

Attachments:

Subject: Re: writeout stalls in current -git

Subject: Re: writeout stalls in current -git

Subject: Re: writeout stalls in current -git

Subject: Re: writeout stalls in current -git

Subject: Re: writeout stalls in current -git

Subject: Re: writeout stalls in current -git

Subject: Re: writeout stalls in current -git

Subject: Re: writeout stalls in current -git

Subject: Re: writeout stalls in current -git

Subject: Re: writeout stalls in current -git

Attachments:

Subject: Re: writeout stalls in current -git

Attachments:

Subject: Re: writeout stalls in current -git

Subject: Re: writeout stalls in current -git

Subject: Re: writeout stalls in current -git

Subject: Re: writeout stalls in current -git

Subject: Re: writeout stalls in current -git

Subject: Re: writeout stalls in current -git

Subject: Re: writeout stalls in current -git

Subject: Patch tags [was writeout stalls in current -git]

Subject: Re: Patch tags [was writeout stalls in current -git]

Subject: Re: writeout stalls in current -git

Subject: Re: writeout stalls in current -git

Subject: Re: writeout stalls in current -git

Subject: Re: Patch tags [was writeout stalls in current -git]

Subject: Re: writeout stalls in current -git

Subject: Re: writeout stalls in current -git

Subject: Re: writeout stalls in current -git

Subject: Re: writeout stalls in current -git

Subject: Re: Patch tags [was writeout stalls in current -git]

Subject: Re: Patch tags [was writeout stalls in current -git]

Subject: Re: writeout stalls in current -git

Subject: Re: writeout stalls in current -git