2007-12-11 21:31:32

by Andrew Morton

[permalink] [raw]
Subject: Re: [Bug 9546] New: Huge latency in concurrent I/O when using data=ordered


(switching to email - please respond via emailed reply-to-all, not via the
bugzilla web interface)

On Tue, 11 Dec 2007 11:36:39 -0800 (PST)
[email protected] wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=9546
>
> Summary: Huge latency in concurrent I/O when using data=ordered
> Product: File System
> Version: 2.5
> KernelVersion: 2.6.24-rc4
> Platform: All
> OS/Version: Linux
> Tree: Mainline
> Status: NEW
> Severity: normal
> Priority: P1
> Component: ext3
> AssignedTo: [email protected]
> ReportedBy: [email protected]
>
>
> Most recent kernel where this bug did not occur:
> Unknown, certainly not a regression, but something specific to ext3 algorithm
>
> Distribution:
> Bluewhite64 12 (Slackware 12 64 bits port) and Slackware 12
>
> Hardware Environment:
> Athlon 64 3000+laptop IDE 5400 80GB+1.2GB RAM
> Athlon 64X2 4200+SATA 7200 200GB drive+1GB
> Athlon 2800+IDE 7200 40GB drive+512MB
>
> Software Environment:
> dd, cp, konqueror/KDE, mount/tune2fs
>
> Problem Description:
> When the system does heavy input/output operations on big files, small files
> access from other applications are always not served for very long time. This
> can cause huge latencies. The system is really not usable at all, even with all
> the recent improvements done to increase interactivity on desktop.
>
> This behaviour is very visible with the simple following test case:
> 1. Build a DVD structure from big MPEG+PS files with dvdauthor (it copies the
> files in the DVD stucture, then pass on them to fix VOBUs, but this part is not
> very long so this is not the main problem).
> 2. While the computer is doing this, try to open a web browser such as
> konqueror. Then open a page from bookmark. Then open a new tab, then open
> another page from bookmark. Switch bak to first page.
>
> What I get is:
> 35 seconds to open Konqueror.
> 8 seconds to open the "bookmark menu". Incredible.
> 30 seconds to open the web page (DSL/10GBits).
> 5 seconds to open the second tab.
> 6 seconds to reopen the menu.
> 36 seconds to open the second page.
> 14 seconds to come back to first tab.
> This is unbelievable! The system is completely trashed, with more than 1GB RAM,
> whatever the hardware configuration is used.
>
> Of course, I investigated the problem... First, DMA is OK. Second, I thought
> cache would make memory swapped. So I used echo 0 > swapiness. Then (of course,
> the system was not swapping at all), I thought TEXT sections from software
> discarded (that would be simply stupid, but who knows?). I then tried to make
> the writing process throttled with dirty_background_ratio (say 10%) while
> reserving a greater RAM portion for the rest of the system with dirty_ratio
> (say 70%). No way. Then I launched top, and looked at the WCHAN to see what was
> the problem for the frozen process (ie: konqueror). The I saw the faulty guy:
> log_wait_commit!
>
> So I concluded there is unfair access to the filesystem journal. So I tried
> other journaling options than the default "ordered" data mode. The results were
> really different: "5s, 2s, 4s, etc.", both with journal and write back mode!
>
> I therefore think there is a great lock and even maybe a priority inversion in
> log_wait_commit of the ext3 filesystem. I think that, even if it is throttled,
> the writing process always get access to the journal in ordered mode, simply
> because it writes many pages at a time and because the ordered mode indeed
> implies... ordering of requests (as I understand it).
>
> It's sad this is the default option that gives the worst interactivity
> problems. Indeed, this messes all previous work done to enhance desktop
> experience I think, too bad!
>
> Btw, I've also seen on Internet that some people reported that journal data
> mode gives "better" performance. I think the problem was indeed related to
> latency rather than performance (timing the writing process effectively shows a
> output rate halved with journal data mode, and twice the time to process).
>
> Steps to reproduce:
> I did a simple script:
> #!/bin/bash
>
> SRC1=src1.bin
> SRC2=src2.bin
> DEST_DIR=tmpdir
> DST1=dst.bin
>
> # First, create the source files:
> if [ ! -e $SRC1 ] ; then
> dd if=/dev/zero of=$SRC1 bs=10k count=150000
> fi
> if [ ! -e $SRC2 ] ; then
> dd if=/dev/zero of=$SRC2 bs=10k count=150000
> fi
> mkdir $DEST_DIR > /dev/null 2>&1
> sync
>
> # Do the test:
> echo "Trashing the system..."
> rm $DEST_DIR/$DST1 > /dev/null 2>&1
> cp $SRC1 $DEST_DIR/$DST1
> cat $SRC2 >> $DEST_DIR/$DST1
> echo "Done!"
>
> #rm -rf $DEST_DIR $SRC1 $SRC2
>
> While running it, try to use "normally" the interactive programs, such as
> konqueror (the program should have to access files, such as cookies, cache and
> so for konqueror). Then remount/tune the filesystem to use another data mode
> for ext3.
>
> I didn't try with other journaling filesystems. I guess ext2 also doesn't
> exhibit the problem.
>

Interesting that data=writeback helped this. You don't give a lot of
details, but I assume that data=writeback made a large difference here?


2007-12-11 22:43:26

by Christian Casteyde

[permalink] [raw]
Subject: Re: [Bug 9546] New: Huge latency in concurrent I/O when using data=ordered

OK for reply to all. I'm using the first mail received.

> Interesting that data=writeback helped this. You don't give a lot of
> details, but I assume that data=writeback made a large difference here?

Well, my measures were taken by counting seconds from the desktop clock,
so the precision is roughly 1s. Here are the effective results from the 3
journaling modes I have done yesterday :

Times in seconds:
ordered mode : 17 5 38 19 33
journal mode : 6 1 5 6 5
writeback mode : 4 1 7 6 8

I do not remember exactly the operation sequence, but the beginning was
opening Konqueror, opening the bookmarks menu, loading a page, and opening
another tab.

As you can see anyway, ordered is completely broken, opening the menu is near
1s in journal and writeback mode, which is perfectly acceptable for a user
(after all, the computer is loaded, and interactivity is not simply running
interactive tasks at real time priority, so the the user can understand it
reacts slower). Launching Konqueror in 4 to 6 seconds is a good time, given
that the cache is poisoned by the writing process. Opening another tab or a
web page is tolerable within 5-6s, even if this is still a little too long,
specially for opening another tab of an already running process (2-3s would
be a perfect value I think, but this is another story to get this). Certainly
5-6s of latency is due to the block size used by the writing process, and
tuning the kernel to get lower results would be dangerous for general
performances. I could do tests by asking dd to write more but smaller blocks
to check this idea.

As far as writeback is concerned, the latency reduction was not clear as you
can see, at least with my mesurement procedure. For performance of the
**writing process**, it was nearly the same as ordered. I havn't written the
timings results for the writing process, since I was still investigating at
that time. But I used :

time dowrite.sh

to time it, and I got real time, as well as dd output (records written, MB/s,
and journal was half as ordered). So writeback and ordered seemed the same at
this level, and journal mode was effectively twice slower. It was clear there
was twice as most write operations, that was the expected behaviour.

I could redo the tests, with a more formal approach (that is: fixing the
operations order done in Konqueror, and noting the times for the writing
process), it's easy. In fact, I'm considering doing another script simulating
Konqueror / user activities and timing them automatically, just to get a
reproduceable benchmark suite.

Please note that not all programs are affected (for instance kmail was quite
quick), so the kernel does a good job to run programs that do not many I/Os
when the fs is stressed by another process. The difference with konqueror is
it is opening several files (block dump says), so my guess is that processes
doing small I/O operations are not given "journal access" fairly enough in
ordered mode.

Indeed, I imagine (without knowing any internals of course) that the data
writing is indeed scheduled correctly by the disk scheduler, and that the
writing process is throttled, but the priority informations are lost when
metadata are written in ordered mode, because all I/Os are mixed for
metadata. Therefore, the writing process manages to fill the journal quite
easily. With journal and writeback mode, all data may be written
sequentially, in the context of the issuing process, so priorities may be
tracked on the whole writing process. Then the writing process is prempted
correctly to the benefits of Konqueror. This is of course just "black box"
hypothesis...

CC

2007-12-12 09:15:28

by Jan Kara

[permalink] [raw]
Subject: Re: [Bug 9546] New: Huge latency in concurrent I/O when using data=ordered

>
> (switching to email - please respond via emailed reply-to-all, not via the
> bugzilla web interface)
>
> On Tue, 11 Dec 2007 11:36:39 -0800 (PST)
> [email protected] wrote:
>
> > http://bugzilla.kernel.org/show_bug.cgi?id=9546
> >
> > Summary: Huge latency in concurrent I/O when using data=ordered
> > Product: File System
> > Version: 2.5
> > KernelVersion: 2.6.24-rc4
> > Platform: All
> > OS/Version: Linux
> > Tree: Mainline
> > Status: NEW
> > Severity: normal
> > Priority: P1
> > Component: ext3
> > AssignedTo: [email protected]
> > ReportedBy: [email protected]
> >
> >
> > Most recent kernel where this bug did not occur:
> > Unknown, certainly not a regression, but something specific to ext3 algorithm
> >
> > Distribution:
> > Bluewhite64 12 (Slackware 12 64 bits port) and Slackware 12
> >
> > Hardware Environment:
> > Athlon 64 3000+laptop IDE 5400 80GB+1.2GB RAM
> > Athlon 64X2 4200+SATA 7200 200GB drive+1GB
> > Athlon 2800+IDE 7200 40GB drive+512MB
> >
> > Software Environment:
> > dd, cp, konqueror/KDE, mount/tune2fs
> >
> > Problem Description:
> > When the system does heavy input/output operations on big files, small files
> > access from other applications are always not served for very long time. This
> > can cause huge latencies. The system is really not usable at all, even with all
> > the recent improvements done to increase interactivity on desktop.
> >
> > This behaviour is very visible with the simple following test case:
> > 1. Build a DVD structure from big MPEG+PS files with dvdauthor (it copies the
> > files in the DVD stucture, then pass on them to fix VOBUs, but this part is not
> > very long so this is not the main problem).
> > 2. While the computer is doing this, try to open a web browser such as
> > konqueror. Then open a page from bookmark. Then open a new tab, then open
> > another page from bookmark. Switch bak to first page.
> >
> > What I get is:
> > 35 seconds to open Konqueror.
> > 8 seconds to open the "bookmark menu". Incredible.
> > 30 seconds to open the web page (DSL/10GBits).
> > 5 seconds to open the second tab.
> > 6 seconds to reopen the menu.
> > 36 seconds to open the second page.
> > 14 seconds to come back to first tab.
> > This is unbelievable! The system is completely trashed, with more than 1GB RAM,
> > whatever the hardware configuration is used.
> >
> > Of course, I investigated the problem... First, DMA is OK. Second, I thought
> > cache would make memory swapped. So I used echo 0 > swapiness. Then (of course,
> > the system was not swapping at all), I thought TEXT sections from software
> > discarded (that would be simply stupid, but who knows?). I then tried to make
> > the writing process throttled with dirty_background_ratio (say 10%) while
> > reserving a greater RAM portion for the rest of the system with dirty_ratio
> > (say 70%). No way. Then I launched top, and looked at the WCHAN to see what was
> > the problem for the frozen process (ie: konqueror). The I saw the faulty guy:
> > log_wait_commit!
> >
> > So I concluded there is unfair access to the filesystem journal. So I tried
> > other journaling options than the default "ordered" data mode. The results were
> > really different: "5s, 2s, 4s, etc.", both with journal and write back mode!
> >
> > I therefore think there is a great lock and even maybe a priority inversion in
> > log_wait_commit of the ext3 filesystem. I think that, even if it is throttled,
> > the writing process always get access to the journal in ordered mode, simply
> > because it writes many pages at a time and because the ordered mode indeed
> > implies... ordering of requests (as I understand it).
> >
> > It's sad this is the default option that gives the worst interactivity
> > problems. Indeed, this messes all previous work done to enhance desktop
> > experience I think, too bad!
> >
> > Btw, I've also seen on Internet that some people reported that journal data
> > mode gives "better" performance. I think the problem was indeed related to
> > latency rather than performance (timing the writing process effectively shows a
> > output rate halved with journal data mode, and twice the time to process).
> >
> > Steps to reproduce:
> > I did a simple script:
> > #!/bin/bash
> >
> > SRC1=src1.bin
> > SRC2=src2.bin
> > DEST_DIR=tmpdir
> > DST1=dst.bin
> >
> > # First, create the source files:
> > if [ ! -e $SRC1 ] ; then
> > dd if=/dev/zero of=$SRC1 bs=10k count=150000
> > fi
> > if [ ! -e $SRC2 ] ; then
> > dd if=/dev/zero of=$SRC2 bs=10k count=150000
> > fi
> > mkdir $DEST_DIR > /dev/null 2>&1
> > sync
> >
> > # Do the test:
> > echo "Trashing the system..."
> > rm $DEST_DIR/$DST1 > /dev/null 2>&1
> > cp $SRC1 $DEST_DIR/$DST1
> > cat $SRC2 >> $DEST_DIR/$DST1
> > echo "Done!"
> >
> > #rm -rf $DEST_DIR $SRC1 $SRC2
> >
> > While running it, try to use "normally" the interactive programs, such as
> > konqueror (the program should have to access files, such as cookies, cache and
> > so for konqueror). Then remount/tune the filesystem to use another data mode
> > for ext3.
> >
> > I didn't try with other journaling filesystems. I guess ext2 also doesn't
> > exhibit the problem.
> >
>
> Interesting that data=writeback helped this. You don't give a lot of
> details, but I assume that data=writeback made a large difference here?
What I think could be happening:
One process does data-intensive load. Thus in the ordered mode the
transaction is tiny but has tons of data buffers attached. If commit
happens, it takes a long time to sync all the data before the commit
can proceed... In the writeback mode, we don't wait for data buffers, in
the journal mode amount of data to be written is really limited by the
maximum size of a transaction and so we write by much smaller chunks
and better latency is thus ensured.
I'll try some tests to verify whether my theory is correct :).

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2007-12-12 14:43:45

by Jan Kara

[permalink] [raw]
Subject: Re: [Bug 9546] New: Huge latency in concurrent I/O when using data=ordered

> >
> > (switching to email - please respond via emailed reply-to-all, not via the
> > bugzilla web interface)
> >
> > On Tue, 11 Dec 2007 11:36:39 -0800 (PST)
> > [email protected] wrote:
> >
> > > http://bugzilla.kernel.org/show_bug.cgi?id=9546
> > >
> > > Summary: Huge latency in concurrent I/O when using data=ordered
> > > Product: File System
> > > Version: 2.5
> > > KernelVersion: 2.6.24-rc4
> > > Platform: All
> > > OS/Version: Linux
> > > Tree: Mainline
> > > Status: NEW
> > > Severity: normal
> > > Priority: P1
> > > Component: ext3
> > > AssignedTo: [email protected]
> > > ReportedBy: [email protected]
> > >
> > >
> > > Most recent kernel where this bug did not occur:
> > > Unknown, certainly not a regression, but something specific to ext3 algorithm
> > >
> > > Distribution:
> > > Bluewhite64 12 (Slackware 12 64 bits port) and Slackware 12
> > >
> > > Hardware Environment:
> > > Athlon 64 3000+laptop IDE 5400 80GB+1.2GB RAM
> > > Athlon 64X2 4200+SATA 7200 200GB drive+1GB
> > > Athlon 2800+IDE 7200 40GB drive+512MB
> > >
> > > Software Environment:
> > > dd, cp, konqueror/KDE, mount/tune2fs
> > >
> > > Problem Description:
> > > When the system does heavy input/output operations on big files, small files
> > > access from other applications are always not served for very long time. This
> > > can cause huge latencies. The system is really not usable at all, even with all
> > > the recent improvements done to increase interactivity on desktop.
> > >
> > > This behaviour is very visible with the simple following test case:
> > > 1. Build a DVD structure from big MPEG+PS files with dvdauthor (it copies the
> > > files in the DVD stucture, then pass on them to fix VOBUs, but this part is not
> > > very long so this is not the main problem).
> > > 2. While the computer is doing this, try to open a web browser such as
> > > konqueror. Then open a page from bookmark. Then open a new tab, then open
> > > another page from bookmark. Switch bak to first page.
> > >
> > > What I get is:
> > > 35 seconds to open Konqueror.
> > > 8 seconds to open the "bookmark menu". Incredible.
> > > 30 seconds to open the web page (DSL/10GBits).
> > > 5 seconds to open the second tab.
> > > 6 seconds to reopen the menu.
> > > 36 seconds to open the second page.
> > > 14 seconds to come back to first tab.
> > > This is unbelievable! The system is completely trashed, with more than 1GB RAM,
> > > whatever the hardware configuration is used.
> > >
> > > Of course, I investigated the problem... First, DMA is OK. Second, I thought
> > > cache would make memory swapped. So I used echo 0 > swapiness. Then (of course,
> > > the system was not swapping at all), I thought TEXT sections from software
> > > discarded (that would be simply stupid, but who knows?). I then tried to make
> > > the writing process throttled with dirty_background_ratio (say 10%) while
> > > reserving a greater RAM portion for the rest of the system with dirty_ratio
> > > (say 70%). No way. Then I launched top, and looked at the WCHAN to see what was
> > > the problem for the frozen process (ie: konqueror). The I saw the faulty guy:
> > > log_wait_commit!
> > >
> > > So I concluded there is unfair access to the filesystem journal. So I tried
> > > other journaling options than the default "ordered" data mode. The results were
> > > really different: "5s, 2s, 4s, etc.", both with journal and write back mode!
> > >
> > > I therefore think there is a great lock and even maybe a priority inversion in
> > > log_wait_commit of the ext3 filesystem. I think that, even if it is throttled,
> > > the writing process always get access to the journal in ordered mode, simply
> > > because it writes many pages at a time and because the ordered mode indeed
> > > implies... ordering of requests (as I understand it).
> > >
> > > It's sad this is the default option that gives the worst interactivity
> > > problems. Indeed, this messes all previous work done to enhance desktop
> > > experience I think, too bad!
> > >
> > > Btw, I've also seen on Internet that some people reported that journal data
> > > mode gives "better" performance. I think the problem was indeed related to
> > > latency rather than performance (timing the writing process effectively shows a
> > > output rate halved with journal data mode, and twice the time to process).
> > >
> > > Steps to reproduce:
> > > I did a simple script:
> > > #!/bin/bash
> > >
> > > SRC1=src1.bin
> > > SRC2=src2.bin
> > > DEST_DIR=tmpdir
> > > DST1=dst.bin
> > >
> > > # First, create the source files:
> > > if [ ! -e $SRC1 ] ; then
> > > dd if=/dev/zero of=$SRC1 bs=10k count=150000
> > > fi
> > > if [ ! -e $SRC2 ] ; then
> > > dd if=/dev/zero of=$SRC2 bs=10k count=150000
> > > fi
> > > mkdir $DEST_DIR > /dev/null 2>&1
> > > sync
> > >
> > > # Do the test:
> > > echo "Trashing the system..."
> > > rm $DEST_DIR/$DST1 > /dev/null 2>&1
> > > cp $SRC1 $DEST_DIR/$DST1
> > > cat $SRC2 >> $DEST_DIR/$DST1
> > > echo "Done!"
> > >
> > > #rm -rf $DEST_DIR $SRC1 $SRC2
> > >
> > > While running it, try to use "normally" the interactive programs, such as
> > > konqueror (the program should have to access files, such as cookies, cache and
> > > so for konqueror). Then remount/tune the filesystem to use another data mode
> > > for ext3.
> > >
> > > I didn't try with other journaling filesystems. I guess ext2 also doesn't
> > > exhibit the problem.
> > >
> >
> > Interesting that data=writeback helped this. You don't give a lot of
> > details, but I assume that data=writeback made a large difference here?
> What I think could be happening:
> One process does data-intensive load. Thus in the ordered mode the
> transaction is tiny but has tons of data buffers attached. If commit
> happens, it takes a long time to sync all the data before the commit
> can proceed... In the writeback mode, we don't wait for data buffers, in
> the journal mode amount of data to be written is really limited by the
> maximum size of a transaction and so we write by much smaller chunks
> and better latency is thus ensured.
> I'll try some tests to verify whether my theory is correct :).
Hmm, I couldn't reproduce the bad behavior... But anyway, if my theory
is correct, attached patch should help - can you try it please? It's a
debugging quality only but shouldn't do anything bad to your data :)

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs


Attachments:
(No filename) (6.45 kB)
jbd-2.6.24-rc5-jbd_latency_fix.diff (1.58 kB)
Download all attachments