LinuxLists.cc - Re: 2.6.36 io bring the system to its knees

2010-11-04 23:55:35

Subject: Re: 2.6.36 io bring the system to its knees

On Thu, 28 Oct 2010, Chris Mason wrote:

> On Thu, Oct 28, 2010 at 03:30:36PM +0200, Ingo Molnar wrote:
> >
> > "Many seconds freezes" and slowdowns wont be fixed via the VFS scalability patches
> > i'm afraid.
> >
> > This has the appearance of some really bad IO or VM latency problem. Unfixed and
> > present in stable kernel versions going from years ago all the way to v2.6.36.
>
> Hmmm, the workload you're describing here has two special parts. First
> it dramatically overloads the disk, and then it has guis doing things
> waiting for the disk.
>

Just want to chime in with a 'me too'.

I see something similar on Arch Linux when doing 'pacman -Syyuv' and there
are many (as in more than 5-10) updates to apply. While the update is
running (even if that's all the system is doing) system responsiveness is
terrible - just starting 'chromium' which is usually instant (at least
less than 2 sec at worst) can take upwards of 10 seconds and the mouse
cursor in X starts to jump a bit as well and switching virtual desktops
noticably lags when redrawing the new desktop if there's a full screen app
like gimp or OpenOffice open there. This is on a Lenovo Thinkpad R61i
which has a 'Intel(R) Core(TM)2 Duo CPU T7250 @ 2.00GHz' CPU, 2GB of
memory and 499996 kilobytes of swap.

--
Jesper Juhl <[email protected]> http://www.chaosbits.net/
Plain text mails only, please http://www.expita.com/nomime.html
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html

2010-11-04 23:59:17

by Jesper Juhl

[permalink] [raw]

Subject: Re: 2.6.36 io bring the system to its knees

On Fri, 5 Nov 2010, Jesper Juhl wrote:

> On Thu, 28 Oct 2010, Chris Mason wrote:
>
> > On Thu, Oct 28, 2010 at 03:30:36PM +0200, Ingo Molnar wrote:
> > >
> > > "Many seconds freezes" and slowdowns wont be fixed via the VFS scalability patches
> > > i'm afraid.
> > >
> > > This has the appearance of some really bad IO or VM latency problem. Unfixed and
> > > present in stable kernel versions going from years ago all the way to v2.6.36.
> >
> > Hmmm, the workload you're describing here has two special parts. First
> > it dramatically overloads the disk, and then it has guis doing things
> > waiting for the disk.
> >
>
> Just want to chime in with a 'me too'.
>
> I see something similar on Arch Linux when doing 'pacman -Syyuv' and there
> are many (as in more than 5-10) updates to apply. While the update is
> running (even if that's all the system is doing) system responsiveness is
> terrible - just starting 'chromium' which is usually instant (at least
> less than 2 sec at worst) can take upwards of 10 seconds and the mouse
> cursor in X starts to jump a bit as well and switching virtual desktops
> noticably lags when redrawing the new desktop if there's a full screen app
> like gimp or OpenOffice open there. This is on a Lenovo Thinkpad R61i
> which has a 'Intel(R) Core(TM)2 Duo CPU T7250 @ 2.00GHz' CPU, 2GB of
> memory and 499996 kilobytes of swap.
>
Forgot to mention the kernel I currently experience this with :

[jj@dragon ~]$ uname -a
Linux dragon 2.6.35-ARCH #1 SMP PREEMPT Sat Oct 30 21:22:26 CEST 2010 x86_64 Intel(R) Core(TM)2 Duo CPU T7250 @ 2.00GHz GenuineIntel GNU/Linux

--
Jesper Juhl <[email protected]> http://www.chaosbits.net/
Plain text mails only, please http://www.expita.com/nomime.html
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html

2010-11-05 01:44:58

by Dave Chinner

[permalink] [raw]

Subject: Re: 2.6.36 io bring the system to its knees

On Fri, Nov 05, 2010 at 12:48:17AM +0100, Jesper Juhl wrote:
> On Fri, 5 Nov 2010, Jesper Juhl wrote:
>
> > On Thu, 28 Oct 2010, Chris Mason wrote:
> >
> > > On Thu, Oct 28, 2010 at 03:30:36PM +0200, Ingo Molnar wrote:
> > > >
> > > > "Many seconds freezes" and slowdowns wont be fixed via the VFS scalability patches
> > > > i'm afraid.
> > > >
> > > > This has the appearance of some really bad IO or VM latency problem. Unfixed and
> > > > present in stable kernel versions going from years ago all the way to v2.6.36.
> > >
> > > Hmmm, the workload you're describing here has two special parts. First
> > > it dramatically overloads the disk, and then it has guis doing things
> > > waiting for the disk.
> > >
> >
> > Just want to chime in with a 'me too'.
> >
> > I see something similar on Arch Linux when doing 'pacman -Syyuv' and there
> > are many (as in more than 5-10) updates to apply. While the update is
> > running (even if that's all the system is doing) system responsiveness is
> > terrible - just starting 'chromium' which is usually instant (at least
> > less than 2 sec at worst) can take upwards of 10 seconds and the mouse
> > cursor in X starts to jump a bit as well and switching virtual desktops
> > noticably lags when redrawing the new desktop if there's a full screen app
> > like gimp or OpenOffice open there. This is on a Lenovo Thinkpad R61i
> > which has a 'Intel(R) Core(TM)2 Duo CPU T7250 @ 2.00GHz' CPU, 2GB of
> > memory and 499996 kilobytes of swap.
> >
> Forgot to mention the kernel I currently experience this with :
>
> [jj@dragon ~]$ uname -a
> Linux dragon 2.6.35-ARCH #1 SMP PREEMPT Sat Oct 30 21:22:26 CEST 2010 x86_64 Intel(R) Core(TM)2 Duo CPU T7250 @ 2.00GHz GenuineIntel GNU/Linux

I think anyone reporting a interactivity problem also needs to
indicate what their filesystem is, what mount paramters they are
using, what their storage config is, whether barriers are active or
not, what elevator they are using, whether one or more of the
applications are issuing fsync() or sync() calls, and so on.

Basically, what we need to know is whether these problems are
isolated to a particular filesystem or storage type because
they may simply be known problems (e.g. the ext3 fsync-the-world
problem).

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-11-05 12:48:25

by Sanjoy Mahajan

[permalink] [raw]

Subject: Re: 2.6.36 io bring the system to its knees

Dave Chinner <[email protected]> wrote:

> I think anyone reporting a interactivity problem also needs to
> indicate what their filesystem is, what mount paramters they are
> using, what their storage config is, whether barriers are active or
> not, what elevator they are using, whether one or more of the
> applications are issuing fsync() or sync() calls, and so on.

Good idea.

The filesystems are all ext3 with default mount parameters. The dmesgs
say that the filesystems are mounted in ordered data mode and that
barriers are not enabled.

mount says:

/dev/sda2 on / type ext3 (rw,errors=remount-ro,commit=0)
/dev/sda1 on /boot type ext3 (rw,commit=0)
/dev/sda3 on /home type ext3 (rw,commit=0)

> storage config

Do you mean the partition sizes? Here's that:

$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda2 72G 52G 17G 77% /
tmpfs 755M 4.0K 755M 1% /lib/init/rw
udev 750M 212K 750M 1% /dev
tmpfs 755M 0 755M 0% /dev/shm
/dev/sda1 274M 117M 143M 45% /boot
/dev/sda3 74G 37G 33G 53% /home

> elevator

CFQ

> sync-related calls

I don't have a test from the time I ran rsync (but I'll check that
tonight), but I traced the currently running emacs and iceweasel
(a.k.a. firefox) with "strace -p PID 2>&1 | grep sync". That didn't
turn up any sync-related calls.

(I checked the firefox because I seem to remember that it used to do
fsync absurdly often, but I also seem to remember that the outcry made
them stop.)

-Sanjoy

`Until lions have their historians, tales of the hunt shall always
glorify the hunters.' --African Proverb

2010-11-06 14:10:48

by db

[permalink] [raw]

Subject: Re: 2.6.36 io bring the system to its knees

I now personally have thought that this problem is the kernel not
keeping track of reads vs writers properly or not providing enough
time to reading processes as writing ones which look like they are
blocking the system....

If you want to do a simple test do an unlimited dd (or two dd's of a
limited size, say 10gb) and a find /
Tell me how it goes :) ( the system will stall)
(obviously stop the dd after some time :) ).

http://article.gmane.org/gmane.linux.kernel.device-mapper.dm-crypt/4561
iirc can reproduce this on plain ext3.

2010-11-06 15:14:29

by Dave Chinner

[permalink] [raw]

Subject: Re: 2.6.36 io bring the system to its knees

On Sun, Nov 07, 2010 at 01:10:24AM +1100, dave b wrote:
> I now personally have thought that this problem is the kernel not
> keeping track of reads vs writers properly or not providing enough
> time to reading processes as writing ones which look like they are
> blocking the system....

Could be anything from that description....

> If you want to do a simple test do an unlimited dd (or two dd's of a
> limited size, say 10gb) and a find /
> Tell me how it goes :)

The find runs at IO latency speed while the dd processes run at disk
bandwidth:

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
vda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
vdb 0.00 0.00 58.00 1251.00 0.45 556.54 871.45 26.69 20.39 0.72 94.32
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

That looks pretty normal to me for XFS and the noop IO scheduler,
and there are no signs of latency or interactive problems in
the system at all. Kill the dd's and:

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
vda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
vdb 0.00 0.00 214.80 0.40 1.68 0.00 15.99 0.33 1.54 1.54 33.12
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

And the find runs 3-4x faster, but ~200 iops is about the limit
I'd expect from 7200rpm SATA drives given a single thread issuing IO
(i.e. 5ms average seek time).

> ( the system will stall)

No, the system doesn't stall at all. It runs just fine. Sure,
anything that requires IO on the loaded filesystem is _slower_, but
if you're writing huge files to it that's pretty much expected. The
root drive (on a different spindle) is still perfectly responsive on
a cold cache:

$ sudo time find / -xdev > /dev/null
0.10user 1.87system 0:03.39elapsed 58%CPU (0avgtext+0avgdata 7008maxresident)k
0inputs+0outputs (1major+844minor)pagefaults 0swap

So what you describe is not a systemic problem, but a problem that
your system configuration triggers. That's why we need to know
_exactly_ how your storage subsystem is configured....

> http://article.gmane.org/gmane.linux.kernel.device-mapper.dm-crypt/4561
> iirc can reproduce this on plain ext3.

You're pointing to a "fsync-tester" program that exercises a
well-known problem with ext3 (sync-the-world-on-fsync). Other
filesystems do not have that design flaw so don't suffer from
interactivity problems uner these workloads. As it is, your above
dd workload example is not related to this fsync problem, either.

This is what I'm trying to point out - you need to describe in
significant detail your setup and what your applications are doing
so we can identify if you are seeing a known problem or not. If you
are seeing problems as a result of the above ext3 fsync problem,
then the simple answer is "don't use ext3".

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-11-06 19:11:15

by Arjan van de Ven

[permalink] [raw]

Subject: Re: 2.6.36 io bring the system to its knees

On Fri, 5 Nov 2010 08:48:13 -0400
Sanjoy Mahajan <[email protected]> wrote:

> Dave Chinner <[email protected]> wrote:
>
> > I think anyone reporting a interactivity problem also needs to
> > indicate what their filesystem is, what mount paramters they are
> > using, what their storage config is, whether barriers are active or
> > not, what elevator they are using, whether one or more of the
> > applications are issuing fsync() or sync() calls, and so on.
>
> Good idea.
>
> The filesystems are all ext3 with default mount parameters. The
> dmesgs say that the filesystems are mounted in ordered data mode and
> that barriers are not enabled.

btw few more things to try (from my standard rc.local script):

echo 4096 > /sys/block/sda/queue/nr_requests

for i in `pidof kjournald` ; do ionice -c1 -p $i ; done

echo 75 > /proc/sys/vm/dirty_ratio

(replace sda with whatever your disk is of course)

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2010-11-07 06:06:45

by db

[permalink] [raw]

Subject: Re: 2.6.36 io bring the system to its knees

On 7 November 2010 02:12, Dave Chinner <[email protected]> wrote:
> On Sun, Nov 07, 2010 at 01:10:24AM +1100, dave b wrote:
>> I now personally have thought that this problem is the kernel not
>> keeping track of reads vs writers properly or not providing enough
>> time to reading processes as writing ones which look like they are
>> blocking the system....
>
> Could be anything from that description....
>
>> If you want to do a simple test do an unlimited dd (or two dd's of a
>> limited size, say 10gb) and a find /
>> Tell me how it goes :)
>
> The find runs at IO latency speed while the dd processes run at disk
> bandwidth:
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
> vda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> vdb 0.00 0.00 58.00 1251.00 0.45 556.54 871.45 26.69 20.39 0.72 94.32
> sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>
> That looks pretty normal to me for XFS and the noop IO scheduler,
> and there are no signs of latency or interactive problems in
> the system at all. Kill the dd's and:
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
> vda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> vdb 0.00 0.00 214.80 0.40 1.68 0.00 15.99 0.33 1.54 1.54 33.12
> sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>
> And the find runs 3-4x faster, but ~200 iops is about the limit
> I'd expect from 7200rpm SATA drives given a single thread issuing IO
> (i.e. 5ms average seek time).
>
>> ( the system will stall)
>
> No, the system doesn't stall at all. It runs just fine. Sure,
> anything that requires IO on the loaded filesystem is _slower_, but
> if you're writing huge files to it that's pretty much expected. The
> root drive (on a different spindle) is still perfectly responsive on
> a cold cache:
>
> $ sudo time find / -xdev > /dev/null
> 0.10user 1.87system 0:03.39elapsed 58%CPU (0avgtext+0avgdata 7008maxresident)k
> 0inputs+0outputs (1major+844minor)pagefaults 0swap
>
> So what you describe is not a systemic problem, but a problem that
> your system configuration triggers. That's why we need to know
> _exactly_ how your storage subsystem is configured....
>
>> http://article.gmane.org/gmane.linux.kernel.device-mapper.dm-crypt/4561
>> iirc can reproduce this on plain ext3.
>
> You're pointing to a "fsync-tester" program that exercises a
> well-known problem with ext3 (sync-the-world-on-fsync). Other
> filesystems do not have that design flaw so don't suffer from
> interactivity problems uner these workloads. As it is, your above
> dd workload example is not related to this fsync problem, either.
>
> This is what I'm trying to point out - you need to describe in
> significant detail your setup and what your applications are doing
> so we can identify if you are seeing a known problem or not. If you
> are seeing problems as a result of the above ext3 fsync problem,
> then the simple answer is "don't use ext3".

Thank you for your reply.
Well I am not sure :)
Is the answer "don't use ext3" ?
If it is what should I really be using instead?

2010-11-07 12:08:34

by Jens Axboe

[permalink] [raw]

Subject: Re: 2.6.36 io bring the system to its knees

On 2010-11-06 15:10, dave b wrote:
> I now personally have thought that this problem is the kernel not
> keeping track of reads vs writers properly or not providing enough
> time to reading processes as writing ones which look like they are
> blocking the system....
>
> If you want to do a simple test do an unlimited dd (or two dd's of a
> limited size, say 10gb) and a find /
> Tell me how it goes :) ( the system will stall)
> (obviously stop the dd after some time :) ).
>
> http://article.gmane.org/gmane.linux.kernel.device-mapper.dm-crypt/4561
> iirc can reproduce this on plain ext3.

As already mentioned, ext3 is just not a good choice for this sort of
thing. Did you have atimes enabled?

--
Jens Axboe

2010-11-07 15:51:18

by Linus Torvalds

[permalink] [raw]

Subject: Re: 2.6.36 io bring the system to its knees

On Sun, Nov 7, 2010 at 4:08 AM, Jens Axboe <[email protected]> wrote:
>
> As already mentioned, ext3 is just not a good choice for this sort of
> thing. Did you have atimes enabled?

At least for ext3, more important than atimes is the "data=writeback"
setting. Especially since our atime default is sane these days (ie if
you don't specify anything, we end up using 'relatime').

If you compile your own kernel, answer "N" to the question

Default to 'data=ordered' in ext3?

at config time (CONFIG_EXT3_DEFAULTS_TO_ORDERED), or you can make sure
"data=writeback" is in the fstab (but I don't think everything honors
it for the root filesystem).

Linus

2010-11-07 17:27:46

by Jesper Juhl

[permalink] [raw]

Subject: Re: 2.6.36 io bring the system to its knees

On Fri, 5 Nov 2010, Dave Chinner wrote:

> On Fri, Nov 05, 2010 at 12:48:17AM +0100, Jesper Juhl wrote:
> > On Fri, 5 Nov 2010, Jesper Juhl wrote:
> >
> > > On Thu, 28 Oct 2010, Chris Mason wrote:
> > >
> > > > On Thu, Oct 28, 2010 at 03:30:36PM +0200, Ingo Molnar wrote:
> > > > >
> > > > > "Many seconds freezes" and slowdowns wont be fixed via the VFS scalability patches
> > > > > i'm afraid.
> > > > >
> > > > > This has the appearance of some really bad IO or VM latency problem. Unfixed and
> > > > > present in stable kernel versions going from years ago all the way to v2.6.36.
> > > >
> > > > Hmmm, the workload you're describing here has two special parts. First
> > > > it dramatically overloads the disk, and then it has guis doing things
> > > > waiting for the disk.
> > > >
> > >
> > > Just want to chime in with a 'me too'.
> > >
> > > I see something similar on Arch Linux when doing 'pacman -Syyuv' and there
> > > are many (as in more than 5-10) updates to apply. While the update is
> > > running (even if that's all the system is doing) system responsiveness is
> > > terrible - just starting 'chromium' which is usually instant (at least
> > > less than 2 sec at worst) can take upwards of 10 seconds and the mouse
> > > cursor in X starts to jump a bit as well and switching virtual desktops
> > > noticably lags when redrawing the new desktop if there's a full screen app
> > > like gimp or OpenOffice open there. This is on a Lenovo Thinkpad R61i
> > > which has a 'Intel(R) Core(TM)2 Duo CPU T7250 @ 2.00GHz' CPU, 2GB of
> > > memory and 499996 kilobytes of swap.
> > >
> > Forgot to mention the kernel I currently experience this with :
> >
> > [jj@dragon ~]$ uname -a
> > Linux dragon 2.6.35-ARCH #1 SMP PREEMPT Sat Oct 30 21:22:26 CEST 2010 x86_64 Intel(R) Core(TM)2 Duo CPU T7250 @ 2.00GHz GenuineIntel GNU/Linux
>
> I think anyone reporting a interactivity problem also needs to
> indicate what their filesystem is, what mount paramters they are
> using, what their storage config is, whether barriers are active or
> not, what elevator they are using, whether one or more of the
> applications are issuing fsync() or sync() calls, and so on.
>
Some details below.

[jj@dragon ~]$ mount
proc on /proc type proc (rw,relatime)
sys on /sys type sysfs (rw,relatime)
udev on /dev type devtmpfs
(rw,nosuid,relatime,size=10240k,nr_inodes=255749,mode=755)
/dev/disk/by-uuid/61d104a5-4f7b-40ef-a9c8-44ad2765513e on / type ext4 (rw,commit=0)
devpts on /dev/pts type devpts (rw)
shm on /dev/shm type tmpfs (rw,nosuid,nodev)

[root@dragon ~]# hdparm -v /dev/disk/by-uuid/61d104a5-4f7b-40ef-a9c8-44ad2765513e

/dev/disk/by-uuid/61d104a5-4f7b-40ef-a9c8-44ad2765513e:
multcount = 16 (on)
IO_support = 1 (32-bit)
readonly = 0 (off)
readahead = 256 (on)
geometry = 9729/255/63, sectors = 25220160, start = 119644560

[root@dragon ~]# dmesg | grep -i ext4
EXT4-fs (sda4): mounted filesystem with ordered data mode. Opts: (null)
EXT4-fs (sda4): re-mounted. Opts: (null)
EXT4-fs (sda4): re-mounted. Opts: (null)
EXT4-fs (sda4): re-mounted. Opts: commit=0

The elevator in use is CFQ.

The app that's causing the system to behave this way (the 'pacman' package
manager in Arch Linux) makes a few calls (2-4) to fsync() during its run,
but that's all.

--
Jesper Juhl <[email protected]> http://www.chaosbits.net/
Plain text mails only, please http://www.expita.com/nomime.html
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html

2010-11-09 19:53:45

by Evgeniy Ivanov

[permalink] [raw]

Subject: Re: 2.6.36 io bring the system to its knees

I have almost same problem (system is less interactive, but no freeze happens).
Here are tests I use (written by Alexander Nekrasov):
logrotate.sh (hard writer): http://pastebin.com/PPnSvP2f
writetest (small writer): http://pastebin.com/616JvWEK

If you run "writetest 15 realtime" timings will be OK, but if you also
run "logrotate.sh 300 3" you will see that RT processes start trashing
(timings periodically increase from 50ms to 2000-4000ms).
I do tests on 2.6.31, but same happens on 2.6.36. CFQ with default
settings is used. I've played with page-background.c and noticed, that
writeback still works for RT processes (no write through/disk wait). I
even tried to increase dirty_ratio for RT processes. Also I've limited
memory consumed by dd (logrotate.sh), since I had situation when it
consumed too much and kernel started to reclaim pages.

It doesn't want to work on ext3 (compiled and mounted like Linus
suggested in this thread), but works fine on ext4 with
"data=writeback" and on XFS. I'm not sure if it means that problem in
ext3 and in journaling (in case of ext4 without data=writeback).
I'm not sure if "data=writeback" (makes ext4 journaling similar to
XFS) really fixes the problem, probably it increases FS bandwidth, so
we just don't see the problem, but it can still present.

On Sun, Nov 7, 2010 at 8:16 PM, Jesper Juhl <[email protected]> wrote:
> On Fri, 5 Nov 2010, Dave Chinner wrote:
>
>> On Fri, Nov 05, 2010 at 12:48:17AM +0100, Jesper Juhl wrote:
>> > On Fri, 5 Nov 2010, Jesper Juhl wrote:
>> >
>> > > On Thu, 28 Oct 2010, Chris Mason wrote:
>> > >
>> > > > On Thu, Oct 28, 2010 at 03:30:36PM +0200, Ingo Molnar wrote:
>> > > > >
>> > > > > "Many seconds freezes" and slowdowns wont be fixed via the VFS scalability patches
>> > > > > i'm afraid.
>> > > > >
>> > > > > This has the appearance of some really bad IO or VM latency problem. Unfixed and
>> > > > > present in stable kernel versions going from years ago all the way to v2.6.36.
>> > > >
>> > > > Hmmm, the workload you're describing here has two special parts. ?First
>> > > > it dramatically overloads the disk, and then it has guis doing things
>> > > > waiting for the disk.
>> > > >
>> > >
>> > > Just want to chime in with a 'me too'.
>> > >
>> > > I see something similar on Arch Linux when doing 'pacman -Syyuv' and there
>> > > are many (as in more than 5-10) updates to apply. While the update is
>> > > running (even if that's all the system is doing) system responsiveness is
>> > > terrible - just starting 'chromium' which is usually instant (at least
>> > > less than 2 sec at worst) can take upwards of 10 seconds and the mouse
>> > > cursor in X starts to jump a bit as well and switching virtual desktops
>> > > noticably lags when redrawing the new desktop if there's a full screen app
>> > > like gimp or OpenOffice open there. This is on a Lenovo Thinkpad R61i
>> > > which has a 'Intel(R) Core(TM)2 Duo CPU T7250 @ 2.00GHz' CPU, 2GB of
>> > > memory and 499996 kilobytes of swap.
>> > >
>> > Forgot to mention the kernel I currently experience this with :
>> >
>> > [jj@dragon ~]$ uname -a
>> > Linux dragon 2.6.35-ARCH #1 SMP PREEMPT Sat Oct 30 21:22:26 CEST 2010 x86_64 Intel(R) Core(TM)2 Duo CPU T7250 @ 2.00GHz GenuineIntel GNU/Linux
>>
>> I think anyone reporting a interactivity problem also needs to
>> indicate what their filesystem is, what mount paramters they are
>> using, what their storage config is, whether barriers are active or
>> not, what elevator they are using, whether one or more of the
>> applications are issuing fsync() or sync() calls, and so on.
>>
> Some details below.
>
> [jj@dragon ~]$ mount
> proc on /proc type proc (rw,relatime)
> sys on /sys type sysfs (rw,relatime)
> udev on /dev type devtmpfs
> (rw,nosuid,relatime,size=10240k,nr_inodes=255749,mode=755)
> /dev/disk/by-uuid/61d104a5-4f7b-40ef-a9c8-44ad2765513e on / type ext4 (rw,commit=0)
> devpts on /dev/pts type devpts (rw)
> shm on /dev/shm type tmpfs (rw,nosuid,nodev)
>
> [root@dragon ~]# hdparm -v /dev/disk/by-uuid/61d104a5-4f7b-40ef-a9c8-44ad2765513e
>
> /dev/disk/by-uuid/61d104a5-4f7b-40ef-a9c8-44ad2765513e:
> ?multcount ? ? = 16 (on)
> ?IO_support ? ?= ?1 (32-bit)
> ?readonly ? ? ?= ?0 (off)
> ?readahead ? ? = 256 (on)
> ?geometry ? ? ?= 9729/255/63, sectors = 25220160, start = 119644560
>
> [root@dragon ~]# dmesg | grep -i ext4
> EXT4-fs (sda4): mounted filesystem with ordered data mode. Opts: (null)
> EXT4-fs (sda4): re-mounted. Opts: (null)
> EXT4-fs (sda4): re-mounted. Opts: (null)
> EXT4-fs (sda4): re-mounted. Opts: commit=0
>
> The elevator in use is CFQ.
>
> The app that's causing the system to behave this way (the 'pacman' package
> manager in Arch Linux) makes a few calls (2-4) ?to fsync() during its run,
> but that's all.
>
>
> --
> Jesper Juhl <[email protected]> ? ? ? ? ? ? http://www.chaosbits.net/
> Plain text mails only, please ? ? ?http://www.expita.com/nomime.html
> Don't top-post ?http://www.catb.org/~esr/jargon/html/T/top-post.html
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. ?For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

--
Evgeniy Ivanov

2010-11-09 20:20:54

by Christoph Hellwig

[permalink] [raw]

Subject: Re: 2.6.36 io bring the system to its knees

> I'm not sure if "data=writeback" (makes ext4 journaling similar to
> XFS) really fixes the problem

It doesn't. XFS does not expose stale data after a crash, while ext3/4
data=writeback does.

2010-11-09 21:23:12

by Chris Mason

[permalink] [raw]

Subject: Re: 2.6.36 io bring the system to its knees

Excerpts from Dave Chinner's message of 2010-11-04 21:43:34 -0400:
> On Fri, Nov 05, 2010 at 12:48:17AM +0100, Jesper Juhl wrote:
>
> [ the disks are slow for me too!!!!!!!!!!!!!! ]
>
> > Forgot to mention the kernel I currently experience this with :
> >
> > [jj@dragon ~]$ uname -a
> > Linux dragon 2.6.35-ARCH #1 SMP PREEMPT Sat Oct 30 21:22:26 CEST 2010 x86_64 Intel(R) Core(TM)2 Duo CPU T7250 @ 2.00GHz GenuineIntel GNU/Linux
>
> I think anyone reporting a interactivity problem also needs to
> indicate what their filesystem is, what mount paramters they are
> using, what their storage config is, whether barriers are active or
> not, what elevator they are using, whether one or more of the
> applications are issuing fsync() or sync() calls, and so on.
>
> Basically, what we need to know is whether these problems are
> isolated to a particular filesystem or storage type because
> they may simply be known problems (e.g. the ext3 fsync-the-world
> problem).

latencytop does help quite a lot in nailing down why we're waiting on
the disk, but the interface doesn't lend itself very well to remote
debugging. We end up asking for screen shots that may or may not really
nail down what is going on.

I've got a patch that adds latencytop -c, which you use like this:

latencytop -c >& out

It spits out latency info for all the procs every 10 seconds or so,
along with a short stack trace that often helps figure things out.

The patch is below and works properly with the current latencytop
git. If some of the people hitting bad latencies could try it, it might
help narrow things down.

From: Chris Mason <[email protected]>
Subject: [PATCH] Add latencytop -c to dump process information to the console

This adds something similar to vmstat 1 to latencytop, where
it simply does a text dump of all the process latency information
to the console every 10 seconds. Back traces are included in the
dump.

Signed-off-by: Chris Mason <[email protected]>
---
src/Makefile | 2 +-
src/latencytop.c | 38 +++++++---
src/latencytop.h | 1 +
src/text_dump.c | 199 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 227 insertions(+), 13 deletions(-)
create mode 100644 src/text_dump.c

diff --git a/src/Makefile b/src/Makefile
index de24551..1ff9740 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -6,7 +6,7 @@ SBINDIR = /usr/sbin
XCFLAGS = -W -g `pkg-config --cflags glib-2.0` -D_FORTIFY_SOURCE=2 -Wno-sign-compare
LDF = -Wl,--as-needed `pkg-config --libs glib-2.0` -lncursesw

-OBJS= latencytop.o text_display.o translate.o fsync.o
+OBJS= latencytop.o text_display.o text_dump.o translate.o fsync.o

ifdef HAS_GTK_GUI
XCFLAGS += `pkg-config --cflags gtk+-2.0` -DHAS_GTK_GUI
diff --git a/src/latencytop.c b/src/latencytop.c
index f516f53..fe252d0 100644
--- a/src/latencytop.c
+++ b/src/latencytop.c
@@ -111,6 +111,10 @@ static void fixup_reason(struct latency_line *line, char *c)
*(c2++) = 0;
} else
strncpy(line->reason, c2, 1024);
+
+ c2 = strchr(line->reason, '\n');
+ if (c2)
+ *c2=0;
}

void parse_global_list(void)
@@ -538,19 +542,13 @@ static void cleanup_sysctl(void)
int main(int argc, char **argv)
{
int i, use_gtk = 0;
+ int console_dump = 0;

enable_sysctl();
enable_fsync_tracer();
atexit(cleanup_sysctl);

-#ifdef HAS_GTK_GUI
- if (preinitialize_gtk_ui(&argc, &argv))
- use_gtk = 1;
-#endif
- if (!use_gtk)
- preinitialize_text_ui(&argc, &argv);
-
- for (i = 1; i < argc; i++)
+ for (i = 1; i < argc; i++) {
if (strcmp(argv[i],"-d") == 0) {
init_translations("latencytop.trans");
parse_global_list();
@@ -558,6 +556,17 @@ int main(int argc, char **argv)
dump_global_to_console();
return EXIT_SUCCESS;
}
+ if (strcmp(argv[i],"-c") == 0)
+ console_dump = 1;
+ }
+
+#ifdef HAS_GTK_GUI
+ if (!console_dump && preinitialize_gtk_ui(&argc, &argv))
+ use_gtk = 1;
+#endif
+ if (!console_dump && !use_gtk)
+ preinitialize_text_ui(&argc, &argv);
+
for (i = 1; i < argc; i++)
if (strcmp(argv[i], "--unknown") == 0) {
noui = 1;
@@ -579,12 +588,17 @@ int main(int argc, char **argv)
sleep(5);
fprintf(stderr, ".");
}
+
+ if (console_dump) {
+ start_text_dump();
+ } else {
#ifdef HAS_GTK_GUI
- if (use_gtk)
- start_gtk_ui();
- else
+ if (use_gtk)
+ start_gtk_ui();
+ else
#endif
- start_text_ui();
+ start_text_ui();
+ }

prune_unused_procs();
delete_list();
diff --git a/src/latencytop.h b/src/latencytop.h
index 79775ac..f3e0934 100644
--- a/src/latencytop.h
+++ b/src/latencytop.h
@@ -50,6 +50,7 @@ extern void start_gtk_ui(void);

extern void preinitialize_text_ui(int *argc, char ***argv);
extern void start_text_ui(void);
+extern void start_text_dump(void);

extern char *translate(char *line);
extern void init_translations(char *filename);
diff --git a/src/text_dump.c b/src/text_dump.c
new file mode 100644
index 0000000..76fc7b1
--- /dev/null
+++ b/src/text_dump.c
@@ -0,0 +1,199 @@
+/*
+ * Copyright 2008, Intel Corporation
+ *
+ * This file is part of LatencyTOP
+ *
+ * This program file is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the
+ * Free Software Foundation; version 2 of the License.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
+ * for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program in a file named COPYING; if not, write to the
+ * Free Software Foundation, Inc.,
+ * 51 Franklin Street, Fifth Floor,
+ * Boston, MA 02110-1301 USA
+ *
+ * Authors:
+ * Arjan van de Ven <[email protected]>
+ * Chris Mason <[email protected]>
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/time.h>
+#include <dirent.h>
+#include <time.h>
+#include <wchar.h>
+#include <ctype.h>
+
+#include <glib.h>
+
+#include "latencytop.h"
+
+static GList *cursor_e = NULL;
+static int done = 0;
+
+static void print_global_list(void)
+{
+ GList *item;
+ struct latency_line *line;
+ int i = 1;
+
+ printf("Globals: Cause Maximum Percentage\n");
+ item = g_list_first(lines);
+ while (item && i < 10) {
+ line = item->data;
+ item = g_list_next(item);
+
+ if (line->max*0.001 < 0.1)
+ continue;
+ printf("%s", line->reason);
+ printf("\t%5.1f msec %5.1f %%\n",
+ line->max * 0.001,
+ (line->time * 100 +0.0001) / total_time);
+ i++;
+ }
+}
+
+static void print_one_backtrace(char *trace)
+{
+ char *p;
+ int pos;
+ int after;
+ int tabs = 0;
+
+ if (!trace || !trace[0])
+ return;
+ pos = 16;
+ while(*trace && *trace == ' ')
+ trace++;
+
+ if (!trace[0])
+ return;
+
+ while(*trace) {
+ p = strchr(trace, ' ');
+ if (p) {
+ pos += p - trace + 1;
+ *p = '\0';
+ }
+ if (!tabs) {
+ /* we haven't printed anything yet */
+ printf("\t\t");
+ tabs = 1;
+ } else if (pos > 79) {
+ /*
+ * we have printed something our line is going to be
+ * long
+ */
+ printf("\n\t\t");
+ pos = 16 + p - trace + 1;
+ }
+ printf("%s ", trace);
+ if (!p)
+ break;
+
+ trace = p + 1;
+ if (trace && pos > 70) {
+ printf("\n");
+ tabs = 0;
+ pos = 16;
+ }
+ }
+ printf("\n");
+}
+
+static void print_procs()
+{
+ struct process *proc;
+ GList *item;
+ double total;
+
+ printf("Process details:\n");
+ item = g_list_first(procs);
+ while (item) {
+ int printit = 0;
+ GList *item2;
+ struct latency_line *line;
+ proc = item->data;
+ item = g_list_next(item);
+
+ total = 0.0;
+
+ item2 = g_list_first(proc->latencies);
+ while (item2) {
+ line = item2->data;
+ item2 = g_list_next(item2);
+ total = total + line->time;
+ }
+ item2 = g_list_first(proc->latencies);
+ while (item2) {
+ char *p;
+ char *backtrace;
+ line = item2->data;
+ item2 = g_list_next(item2);
+ if (line->max*0.001 < 0.1)
+ continue;
+ if (!printit) {
+ printf("Process %s (%i) ", proc->name, proc->pid);
+ printf("Total: %5.1f msec\n", total*0.001);
+ printit = 1;
+ }
+ printf("\t%s", line->reason);
+ printf("\t%5.1f msec %5.1f %%\n",
+ line->max * 0.001,
+ (line->time * 100 +0.0001) / total
+ );
+ print_one_backtrace(line->backtrace);
+ }
+
+ }
+}
+
+static int done_yet(int time, struct timeval *p1)
+{
+ int seconds;
+ int usecs;
+ struct timeval p2;
+ gettimeofday(&p2, NULL);
+ seconds = p2.tv_sec - p1->tv_sec;
+ usecs = p2.tv_usec - p1->tv_usec;
+
+ usecs += seconds * 1000000;
+ if (usecs > time * 1000000)
+ return 1;
+ return 0;
+}
+
+void signal_func(int foobie)
+{
+ done = 1;
+}
+
+void start_text_dump(void)
+{
+ struct timeval now;
+ struct tm *tm;
+ signal(SIGINT, signal_func);
+ signal(SIGTERM, signal_func);
+
+ while (!done) {
+ gettimeofday(&now, NULL);
+ printf("=============== %s", asctime(localtime(&now.tv_sec)));
+ update_list();
+ print_global_list();
+ print_procs();
+ if (done)
+ break;
+ sleep(10);
+ }
+}
+
--
1.6.5.2

2010-11-10 01:34:57

by Dave Chinner

[permalink] [raw]

Subject: Re: 2.6.36 io bring the system to its knees

On Sun, Nov 07, 2010 at 07:50:13AM -0800, Linus Torvalds wrote:
> On Sun, Nov 7, 2010 at 4:08 AM, Jens Axboe <[email protected]> wrote:
> >
> > As already mentioned, ext3 is just not a good choice for this sort of
> > thing. Did you have atimes enabled?
>
> At least for ext3, more important than atimes is the "data=writeback"
> setting. Especially since our atime default is sane these days (ie if
> you don't specify anything, we end up using 'relatime').
>
> If you compile your own kernel, answer "N" to the question
>
> Default to 'data=ordered' in ext3?
>
> at config time (CONFIG_EXT3_DEFAULTS_TO_ORDERED), or you can make sure
> "data=writeback" is in the fstab (but I don't think everything honors
> it for the root filesystem).

Don't forget to mention data=writeback is not the default because if
your system crashes or you lose power running in this mode it will
*CORRUPT YOUR FILESYSTEM* and you *WILL LOSE DATA*. Not to mention
the significant security issues (e.g stale data exposure) that also
occur even if the filesystem is not corrupted by the crash. IOWs,
data=writeback is the "fast but I'll eat your data" option for ext3.

So I recommend that nobody follows this path because it only leads
to worse trouble down the road. Your best bet it to migrate away
from ext3 to a filesystem that doesn't have such inherent ordering
problems like ext4 or XFS....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-11-10 02:02:17

by db

[permalink] [raw]

Subject: Re: 2.6.36 io bring the system to its knees

Ok so all of us on ext3 should just up and move to ext4 ^ ^ ? (who
want to avoid these problems)

2010-11-10 08:08:21

by Evgeniy Ivanov

[permalink] [raw]

Subject: Re: 2.6.36 io bring the system to its knees

On Wed, Nov 10, 2010 at 4:32 AM, Dave Chinner <[email protected]> wrote:
> Don't forget to mention data=writeback is not the default because if
> your system crashes or you lose power running in this mode it will
> *CORRUPT YOUR FILESYSTEM* and you *WILL LOSE DATA*. Not to mention
> the significant security issues (e.g stale data exposure) that also
> occur even if the filesystem is not corrupted by the crash. IOWs,
> data=writeback is the "fast but I'll eat your data" option for ext3.
>
> So I recommend that nobody follows this path because it only leads
> to worse trouble down the road. ?Your best bet it to migrate away
> from ext3 to a filesystem that doesn't have such inherent ordering
> problems like ext4 or XFS....

Is it save to use "data=writeback" with ext4? At least are there
security issues?
Why do you say, that fs can be corrupted? Metadata is still
journalled, so only data might be corrupted, but FS should still be
consistent.

--
Evgeniy Ivanov

2010-11-10 08:26:14

by Dave Chinner

[permalink] [raw]

Subject: Re: 2.6.36 io bring the system to its knees

On Wed, Nov 10, 2010 at 11:08:17AM +0300, Evgeniy Ivanov wrote:
> On Wed, Nov 10, 2010 at 4:32 AM, Dave Chinner <[email protected]> wrote:
> > Don't forget to mention data=writeback is not the default because if
> > your system crashes or you lose power running in this mode it will
> > *CORRUPT YOUR FILESYSTEM* and you *WILL LOSE DATA*. Not to mention
> > the significant security issues (e.g stale data exposure) that also
> > occur even if the filesystem is not corrupted by the crash. IOWs,
> > data=writeback is the "fast but I'll eat your data" option for ext3.
> >
> > So I recommend that nobody follows this path because it only leads
> > to worse trouble down the road. ?Your best bet it to migrate away
> > from ext3 to a filesystem that doesn't have such inherent ordering
> > problems like ext4 or XFS....
>
> Is it save to use "data=writeback" with ext4?

I believe the same issues exist with data=writeback in ext4, but you
probably should have an ext4 developer answer that question for
certain.

> At least are there security issues?
> Why do you say, that fs can be corrupted? Metadata is still
> journalled, so only data might be corrupted, but FS should still be
> consistent.

Data corruption is still a filesystem corruption.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-11-10 14:20:47

by Pavel Machek

[permalink] [raw]

Subject: Re: 2.6.36 io bring the system to its knees

Hi!

> > > As already mentioned, ext3 is just not a good choice for this sort of
> > > thing. Did you have atimes enabled?
> >
> > At least for ext3, more important than atimes is the "data=writeback"
> > setting. Especially since our atime default is sane these days (ie if
> > you don't specify anything, we end up using 'relatime').
> >
> > If you compile your own kernel, answer "N" to the question
> >
> > Default to 'data=ordered' in ext3?
> >
> > at config time (CONFIG_EXT3_DEFAULTS_TO_ORDERED), or you can make sure
> > "data=writeback" is in the fstab (but I don't think everything honors
> > it for the root filesystem).
>
> Don't forget to mention data=writeback is not the default because if
> your system crashes or you lose power running in this mode it will
> *CORRUPT YOUR FILESYSTEM* and you *WILL LOSE DATA*. Not to mention

You will lose your data, but the filesystem should still be
consistent, right? Metadata are still journaled.

> the significant security issues (e.g stale data exposure) that also
> occur even if the filesystem is not corrupted by the crash. IOWs,

I agree on security issues.
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2010-11-10 14:22:28

by Pavel Machek

[permalink] [raw]

Subject: Re: 2.6.36 io bring the system to its knees

Hi!

> > At least are there security issues?
> > Why do you say, that fs can be corrupted? Metadata is still
> > journalled, so only data might be corrupted, but FS should still be
> > consistent.
>
> Data corruption is still a filesystem corruption.

As far as I understand, apps should not expect anything unless they
use fsync(). And fsync() still works in ext3...

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2010-11-10 14:27:57

by Ingo Molnar

[permalink] [raw]

Subject: Re: 2.6.36 io bring the system to its knees

* Pavel Machek <[email protected]> wrote:

> Hi!
>
> > > > As already mentioned, ext3 is just not a good choice for this sort of
> > > > thing. Did you have atimes enabled?
> > >
> > > At least for ext3, more important than atimes is the "data=writeback"
> > > setting. Especially since our atime default is sane these days (ie if
> > > you don't specify anything, we end up using 'relatime').
> > >
> > > If you compile your own kernel, answer "N" to the question
> > >
> > > Default to 'data=ordered' in ext3?
> > >
> > > at config time (CONFIG_EXT3_DEFAULTS_TO_ORDERED), or you can make sure
> > > "data=writeback" is in the fstab (but I don't think everything honors
> > > it for the root filesystem).
> >
> > Don't forget to mention data=writeback is not the default because if your system
> > crashes or you lose power running in this mode it will *CORRUPT YOUR FILESYSTEM*
> > and you *WILL LOSE DATA*. Not to mention
>
> You will lose your data, but the filesystem should still be consistent, right?
> Metadata are still journaled.

That is data that was freshly touched around the time the system went down, right?

I.e. data that was probably half-modified by user-space to begin with.

Ingo

2010-11-10 14:34:10

by Theodore Ts'o

[permalink] [raw]

Subject: Re: 2.6.36 io bring the system to its knees

On Nov 9, 2010, at 8:32 PM, Dave Chinner wrote:

> Don't forget to mention data=writeback is not the default because if
> your system crashes or you lose power running in this mode it will
> *CORRUPT YOUR FILESYSTEM* and you *WILL LOSE DATA*. Not to mention
> the significant security issues (e.g stale data exposure) that also
> occur even if the filesystem is not corrupted by the crash. IOWs,
> data=writeback is the "fast but I'll eat your data" option for ext3.

This is strictly speaking not true. Using data=writeback will not cause you to lose any data --- at least, not any more than you would without the feature. If you have applications that write files in an unsafe way, that data is going to be lost, one way or another. (i.e., with XFS in a similar situation you'll get a zero-length file) The difference is that in the case of a system crash, there may be unwritten data revealed if you use data=writeback. This could be a security exposure, especially if you are using your system in as time-sharing system, and where you see the contents of deleted files belonging to another user.

So it is not an "eat your data" situation, but rather, a "possibly expose old data". Whether or not you care on a single-user workstation situation, is an individual judgement call. There's been a lot of controversy about this.

The chance that this occurs using data=writeback in ext4 is much less, BTW, because with delayed allocation we delay updating the inode until right before we write the block. I have a plan for changing things so that we write the data blocks *first* and then update the metadata blocks second, which will mean that ext4 data=ordered will go away entirely, and we'll get both the safety and as well as avoiding the forced data page writeouts during journal commits.

-- Ted

2010-11-10 14:55:48

by Christoph Hellwig

[permalink] [raw]

Subject: Re: 2.6.36 io bring the system to its knees

On Wed, Nov 10, 2010 at 03:27:21PM +0100, Ingo Molnar wrote:
> That is data that was freshly touched around the time the system went down, right?
>
> I.e. data that was probably half-modified by user-space to begin with.

It's data that wasn't synced out yet, yes. Which isn't the problem per
se. With ext3/4 in ordered mode, or xfs, or btrfs the file size won't
be incremented until the data is written. in ext3/4 in writeback mode
(or various non-journaling filesystems) however the inode size is
updated, and metadagta changes are logged. Besides exposing stale
data which is a security risk in multi-user systems it also means the
inode looks modified (by size and timestamps), but contains other data
than actually written.

2010-11-10 14:57:21

by Christoph Hellwig

[permalink] [raw]

Subject: Re: 2.6.36 io bring the system to its knees

On Wed, Nov 10, 2010 at 09:33:29AM -0500, Theodore Tso wrote:
> The chance that this occurs using data=writeback in ext4 is much less, BTW, because with delayed allocation we delay updating the inode until right before we write the block. I have a plan for changing things so that we write the data blocks *first* and then update the metadata blocks second, which will mean that ext4 data=ordered will go away entirely, and we'll get both the safety and as well as avoiding the forced data page writeouts during journal commits.

That's the scheme used by XFS and btrfs in one form or another. Chris
also had a patch to implement it for ext3, which unfortunately fell
under the floor.

2010-11-10 15:03:37

by Chris Mason

[permalink] [raw]

Subject: Re: 2.6.36 io bring the system to its knees

Excerpts from Christoph Hellwig's message of 2010-11-10 09:57:12 -0500:
> On Wed, Nov 10, 2010 at 09:33:29AM -0500, Theodore Tso wrote:
> > The chance that this occurs using data=writeback in ext4 is much less, BTW, because with delayed allocation we delay updating the inode until right before we write the block. I have a plan for changing things so that we write the data blocks *first* and then update the metadata blocks second, which will mean that ext4 data=ordered will go away entirely, and we'll get both the safety and as well as avoiding the forced data page writeouts during journal commits.
>
> That's the scheme used by XFS and btrfs in one form or another. Chris
> also had a patch to implement it for ext3, which unfortunately fell
> under the floor.

It probably still applies, but by the time I had it stable I realized
that ext4 was really a better place to fix this stuff. ext3 is what it
is (good and bad), and a big change like my data=guarded code probably
isn't the best way to help.

-chris

2010-11-10 16:05:11

by Linus Torvalds

[permalink] [raw]

Subject: Re: 2.6.36 io bring the system to its knees

On Tue, Nov 9, 2010 at 5:32 PM, Dave Chinner <[email protected]> wrote:
>
> Don't forget to mention data=writeback is not the default because if
> your system crashes or you lose power running in this mode it will
> *CORRUPT YOUR FILESYSTEM* and you *WILL LOSE DATA*.

You will lose data even with data=ordered. All the data that didn't
get logged before the crash is lost anyway.

So your argument is kind of dishonest. The thing is, if you have a
crash or power outage or whatever, the only data you can really rely
on is always going to be the data that you fsync'ed before the crash.
Everything else is just gravy.

Are there downsides to "data=writeback"? Absolutely. But anybody who
tries to push those downsides without taking the performance and
latency issues into account is just not thinking straight.

Too many people think that "correct" is somehow black-and-white. It's
not. "The correct answer too late" is not worth anything. Sane people
understand that "good enough" is important.

And quite frankly, "data=writeback" is not wonderful, but it's "good
enough". And it helps enormously with at least one class of serious
performance problems. Dismissing it because it doesn't have quite the
guarantees of "data=ordered" is like saying that you should never use
"pi=3.14" for any calculations because it's not as exact as
"pi=3.14159265". The thing is, for many things, three significant
digits (or even _one_ significant digit) is plenty.

ext3 [f]sync sucks. We know. All filesystems suck. They just tend to
do it in different dimensions.

Linus

2010-11-10 16:46:25

by Alexey Dobriyan

[permalink] [raw]

Subject: Re: 2.6.36 io bring the system to its knees

On Wed, Nov 10, 2010 at 5:59 PM, Linus Torvalds
<[email protected]> wrote:
> On Tue, Nov 9, 2010 at 5:32 PM, Dave Chinner <[email protected]> wrote:
>>
>> Don't forget to mention data=writeback is not the default because if
>> your system crashes or you lose power running in this mode it will
>> *CORRUPT YOUR FILESYSTEM* and you *WILL LOSE DATA*.
>
> You will lose data even with data=ordered. All the data that didn't
> get logged before the crash is lost anyway.

Linus, are you using with data=writeback?

Those of us, who did (without UPS), will never do it again.

Propability of non-trivial FS corruption becomes so much bigger.
I believe from my experience, average number of crashes before
one loses FS becomes single digit number.

With data=ordered, it's quite hard.

2010-11-10 17:01:40

by Linus Torvalds

[permalink] [raw]

Subject: Re: 2.6.36 io bring the system to its knees

On Wed, Nov 10, 2010 at 8:46 AM, Alexey Dobriyan <[email protected]> wrote:
>>
>> You will lose data even with data=ordered. All the data that didn't
>> get logged before the crash is lost anyway.
>
> Linus, are you using with data=writeback?

I used to, indeed. But since I upgrade computers fairly regularly, and
all the distros have moved towards ext4, I'm no longer using ext3 at
all.

But yes, to me ext3 was totally unusable with rotational media and
"data=ordered". Not just bad. Total crap. Whenever the mail client
wanted to write something out, the whole machine basically stopped.

Of course, part of that was that long ago I used reiserfs back when
SuSE had it as the default. So I didn't think that the hickups were
"normal" like a lot of people probably do. I knew better. So it was
"bad latency, and I know it's the filesystem that is total crap".

> Those of us, who did (without UPS), will never do it again.

Before or after the change to make renaming on top of old files do the
IO flushing?

That made a big difference for some rather common cases.

Linus

2010-11-10 17:10:20

by Alexey Dobriyan

[permalink] [raw]

Subject: Re: 2.6.36 io bring the system to its knees

On Wed, Nov 10, 2010 at 6:55 PM, Linus Torvalds
<[email protected]> wrote:
> On Wed, Nov 10, 2010 at 8:46 AM, Alexey Dobriyan <[email protected]> wrote:
>> Those of us, who did (without UPS), will never do it again.
>
> Before or after the change to make renaming on top of old files do the
> IO flushing?

It was long ago, so before patch.

> That made a big difference for some rather common cases.

That's good.
Maybe, it's only an order of magnitude likely to lose FS now instead of several.
:-)

2010-11-10 18:27:52

by Mike Galbraith

[permalink] [raw]

Subject: Re: 2.6.36 io bring the system to its knees

On Wed, 2010-11-10 at 18:46 +0200, Alexey Dobriyan wrote:
> On Wed, Nov 10, 2010 at 5:59 PM, Linus Torvalds
> <[email protected]> wrote:
> > On Tue, Nov 9, 2010 at 5:32 PM, Dave Chinner <[email protected]> wrote:
> >>
> >> Don't forget to mention data=writeback is not the default because if
> >> your system crashes or you lose power running in this mode it will
> >> *CORRUPT YOUR FILESYSTEM* and you *WILL LOSE DATA*.
> >
> > You will lose data even with data=ordered. All the data that didn't
> > get logged before the crash is lost anyway.
>
> Linus, are you using with data=writeback?
>
> Those of us, who did (without UPS), will never do it again.

I've been using it for a looong time on my desktop box. Yeah, you can
be bitten easier than ordered, and I have been, but it's never been
anything major. The risk for me is worth it, as data=ordered sucked
really bad.

If I didn't need to maintain compatibility with 30+ old kernels for
regression testing, I'd upgrade desktop to ext4, and likely be happy.

> Propability of non-trivial FS corruption becomes so much bigger.
> I believe from my experience, average number of crashes before
> one loses FS becomes single digit number.

That's not my experience. I've yet to have to rebuild my ext3 fs since
upgrading box to shiny new opensuse 11.1 (however long ago and how many
many explosions ago that was;)

> With data=ordered, it's quite hard.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2010-11-10 18:55:13

by Mark Lord

[permalink] [raw]

Subject: Re: 2.6.36 io bring the system to its knees

On 10-11-10 12:10 PM, Alexey Dobriyan wrote:
> On Wed, Nov 10, 2010 at 6:55 PM, Linus Torvalds
> <[email protected]> wrote:
>> On Wed, Nov 10, 2010 at 8:46 AM, Alexey Dobriyan<[email protected]> wrote:
>>> Those of us, who did (without UPS), will never do it again.

I've used ext2 and ext3 extensively on all of the boxes here,
every since each first became available. I developed Linux IDE,
the first IDE DMA, lots of custom storage drivers, and more recently
worked on libata drivers. This meant a LOT of sudden and catastrophic
system failures, as the bugs and other kinks were worked on.

Never lost a nibble. Totally, utterly reliable stuff for everyday use.
*WITH* the write-caches all enabled on all of the drives, too.

Sure, sudden power-failures could have a better chance of corrupting data,
but those are really rare, and the few that have happened were again non-events
here.

That's the difference between theory and practice.

Cheers
-ml

2010-11-10 19:09:42

by Pavel Machek

[permalink] [raw]

Subject: Re: 2.6.36 io bring the system to its knees

Hi!

> > That is data that was freshly touched around the time the system went down, right?
> >
> > I.e. data that was probably half-modified by user-space to begin with.
>
> It's data that wasn't synced out yet, yes. Which isn't the problem per
> se. With ext3/4 in ordered mode, or xfs, or btrfs the file size won't
> be incremented until the data is written. in ext3/4 in writeback mode
> (or various non-journaling filesystems) however the inode size is
> updated, and metadagta changes are logged. Besides exposing stale
> data which is a security risk in multi-user systems it also means the
> inode looks modified (by size and timestamps), but contains other data
> than actually written.

Well, afaict thats traditional unix behaviour... while it is not user
friendly, I'd not call it 'corrupted filesytem'.
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2010-11-10 23:38:23

by Dave Chinner

[permalink] [raw]

Subject: Re: 2.6.36 io bring the system to its knees

On Wed, Nov 10, 2010 at 09:33:29AM -0500, Theodore Tso wrote:
>
> On Nov 9, 2010, at 8:32 PM, Dave Chinner wrote:
>
> > Don't forget to mention data=writeback is not the default because if
> > your system crashes or you lose power running in this mode it will
> > *CORRUPT YOUR FILESYSTEM* and you *WILL LOSE DATA*. Not to mention
> > the significant security issues (e.g stale data exposure) that also
> > occur even if the filesystem is not corrupted by the crash. IOWs,
> > data=writeback is the "fast but I'll eat your data" option for ext3.
>
> This is strictly speaking not true. Using data=writeback will not
> cause you to lose any data --- at least, not any more than you
> would without the feature. If you have applications that write
> files in an unsafe way, that data is going to be lost, one way or
> another. (i.e., with XFS in a similar situation you'll get a
> zero-length file) The difference is that in the case of a system
> crash, there may be unwritten data revealed if you use
> data=writeback. This could be a security exposure, especially if
> you are using your system in as time-sharing system, and where you
> see the contents of deleted files belonging to another user.

In theory, that's all that is _supposed_ to happen. However, my
recent experience is that massive ext3 filesystem corruption occurs
in data=writeback mode when the system crashes and that does not
happen in ordered mode.

Why do you think i posted the patches to change the default back to
ordered mode a few months back? I basically trashed the root ext3
partitions on three test machines (to the point where >5000 files
across /sbin, /bin, /lib and /usr were corrupted or missing and I
had to reinstall from scratch) when I'd forgotten to set the
ordered-is-defult config option in the kernel i was testing. And
that is when the only thing being written to the root filesystems
was log files...

The worst part about this was that I also had ext3 filesystems
corrupted by crashes in such a way that e2fsck didn't detect it but
they would repeatedly trigger kernel crashes at runtime....

> So it is not an "eat your data" situation,

My experience says otherwise....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-11-10 23:44:46

by Dave Chinner

[permalink] [raw]

Subject: Re: 2.6.36 io bring the system to its knees

On Wed, Nov 10, 2010 at 07:59:10AM -0800, Linus Torvalds wrote:
> On Tue, Nov 9, 2010 at 5:32 PM, Dave Chinner <[email protected]> wrote:
> >
> > Don't forget to mention data=writeback is not the default because if
> > your system crashes or you lose power running in this mode it will
> > *CORRUPT YOUR FILESYSTEM* and you *WILL LOSE DATA*.
>
> You will lose data even with data=ordered. All the data that didn't
> get logged before the crash is lost anyway.
>
> So your argument is kind of dishonest. The thing is, if you have a
> crash or power outage or whatever, the only data you can really rely
> on is always going to be the data that you fsync'ed before the crash.
> Everything else is just gravy.

I crash kernels tens of times every day doing filesystem testing.
With data=ordered I have not seen a corrupted root filesystem as a
result of normal testing and crashing as long as I can remember.
With data=writeback, I'll have corrupted root ext3 partitions in
under a day. Hardly what I'd call stable or something you'd want
to deploy in production.

Cheers,

Dave.
--
Dave Chinner
[email protected]