Hi,
I've encountered a strange problem - if an application is sequentially
writing a large file on a busy machine, a single write() of 64KB may
take even 30 seconds. But if I do fsync() after each write() the maximum
time of write()+fsync() is about 0.5 second (the overall performance is,
of course, degraded).
The point is, that some applications (samba+smbclient) time out after
20s waiting for write().
Does anybody have an idea how to tune the kernel to avoid this strange
delay in write()?
I've tried to experiment with cfq and deadline IO scheduler - without
success. Decreasing /proc/vm/dirty_ratio to 5% helps a little.
If somebody want to test it, the tool I've written for measuring maximum
write() time is here: http://galaxy.agh.edu.pl/~szymans/writetimer
1. Compile writetimer.c
2. Put a large background read from the disk
3. Simultaneously write 10 files 200MB each (write() without fsync())
for i in 1 2 3 4 5 6 7 8 9 0 ; do ./writetimer 200 > testfile$i & done
4. and with fsync() after each write()
for i in 1 2 3 4 5 6 7 8 9 0 ; do ./writetimer -200 > testfile$i & done
(negative file size turns on fsync())
Tested on
- 2.6.15-23 (512MB RAM, Pentium-M 1.7, Ubuntu 6.06, ATA disk)
- 2.6.17-1.2145_FC5 (512MB RAM, Pentium-M 1.7, Fedora Core 5, ATA disk)
- 2.6.12-2.3.legacy_FC3smp (2GB RAM, Fedora Core 3, software RAID 5 on 4
ATA disks)
Thanks,
Andrzej
Andrzej Szymanski wrote:
> Hi,
>
> I've encountered a strange problem - if an application is sequentially
> writing a large file on a busy machine, a single write() of 64KB may
> take even 30 seconds. But if I do fsync() after each write() the
> maximum time of write()+fsync() is about 0.5 second (the overall
> performance is, of course, degraded).
>
> The point is, that some applications (samba+smbclient) time out after
> 20s waiting for write().
>
> Does anybody have an idea how to tune the kernel to avoid this strange
> delay in write()?
>
> I've tried to experiment with cfq and deadline IO scheduler - without
> success. Decreasing /proc/vm/dirty_ratio to 5% helps a little.
>
> If somebody want to test it, the tool I've written for measuring
> maximum write() time is here:
> http://galaxy.agh.edu.pl/~szymans/writetimer
>
> 1. Compile writetimer.c
> 2. Put a large background read from the disk
> 3. Simultaneously write 10 files 200MB each (write() without fsync())
> for i in 1 2 3 4 5 6 7 8 9 0 ; do ./writetimer 200 > testfile$i & done
> 4. and with fsync() after each write()
> for i in 1 2 3 4 5 6 7 8 9 0 ; do ./writetimer -200 > testfile$i & done
> (negative file size turns on fsync())
>
> Tested on
> - 2.6.15-23 (512MB RAM, Pentium-M 1.7, Ubuntu 6.06, ATA disk)
> - 2.6.17-1.2145_FC5 (512MB RAM, Pentium-M 1.7, Fedora Core 5, ATA disk)
> - 2.6.12-2.3.legacy_FC3smp (2GB RAM, Fedora Core 3, software RAID 5 on
> 4 ATA disks)
>
> Thanks,
> Andrzej
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
Sounds like a problem with the elevator code. As often as the memory
management gets rewritten and busted, no telling where the condition is
coming from. I have not seen this problem on released kernels with the
distros, but I have seen on on post 2.6.14 kernels, which is why I am
not using them. Seems related to user space memory usage in some way,
and not the actual elevator code. Try an older kernel and see if it
goes away.
Jeff
On Mon, 14 Aug 2006 18:36:44 +0200, Andrzej Szymanski <[email protected]> wrote:
>Hi,
>
>I've encountered a strange problem - if an application is sequentially
>writing a large file on a busy machine, a single write() of 64KB may
>take even 30 seconds. But if I do fsync() after each write() the maximum
>time of write()+fsync() is about 0.5 second (the overall performance is,
>of course, degraded).
>
>The point is, that some applications (samba+smbclient) time out after
>20s waiting for write().
>
>Does anybody have an idea how to tune the kernel to avoid this strange
>delay in write()?
>
>I've tried to experiment with cfq and deadline IO scheduler - without
>success. Decreasing /proc/vm/dirty_ratio to 5% helps a little.
>
>If somebody want to test it, the tool I've written for measuring maximum
>write() time is here: http://galaxy.agh.edu.pl/~szymans/writetimer
>
>1. Compile writetimer.c
>2. Put a large background read from the disk
>3. Simultaneously write 10 files 200MB each (write() without fsync())
>for i in 1 2 3 4 5 6 7 8 9 0 ; do ./writetimer 200 > testfile$i & done
>4. and with fsync() after each write()
>for i in 1 2 3 4 5 6 7 8 9 0 ; do ./writetimer -200 > testfile$i & done
>(negative file size turns on fsync())
>
>Tested on
>- 2.6.15-23 (512MB RAM, Pentium-M 1.7, Ubuntu 6.06, ATA disk)
>- 2.6.17-1.2145_FC5 (512MB RAM, Pentium-M 1.7, Fedora Core 5, ATA disk)
>- 2.6.12-2.3.legacy_FC3smp (2GB RAM, Fedora Core 3, software RAID 5 on 4
>ATA disks)
Does top show what's holding up the file I/O? Rings a vague bell but
I'm not able to run tests here for a little while until some other stuff
completed.
Grant.
On Mon, 14 Aug 2006 18:36:44 +0200
Andrzej Szymanski <[email protected]> wrote:
> I've encountered a strange problem - if an application is sequentially
> writing a large file on a busy machine, a single write() of 64KB may
> take even 30 seconds. But if I do fsync() after each write() the maximum
> time of write()+fsync() is about 0.5 second (the overall performance is,
> of course, degraded).
>
> The point is, that some applications (samba+smbclient) time out after
> 20s waiting for write().
>
> Does anybody have an idea how to tune the kernel to avoid this strange
> delay in write()?
>
> I've tried to experiment with cfq and deadline IO scheduler - without
> success. Decreasing /proc/vm/dirty_ratio to 5% helps a little.
>
> If somebody want to test it, the tool I've written for measuring maximum
> write() time is here: http://galaxy.agh.edu.pl/~szymans/writetimer
>
> 1. Compile writetimer.c
> 2. Put a large background read from the disk
> 3. Simultaneously write 10 files 200MB each (write() without fsync())
> for i in 1 2 3 4 5 6 7 8 9 0 ; do ./writetimer 200 > testfile$i & done
> 4. and with fsync() after each write()
> for i in 1 2 3 4 5 6 7 8 9 0 ; do ./writetimer -200 > testfile$i & done
> (negative file size turns on fsync())
>
> Tested on
> - 2.6.15-23 (512MB RAM, Pentium-M 1.7, Ubuntu 6.06, ATA disk)
> - 2.6.17-1.2145_FC5 (512MB RAM, Pentium-M 1.7, Fedora Core 5, ATA disk)
> - 2.6.12-2.3.legacy_FC3smp (2GB RAM, Fedora Core 3, software RAID 5 on 4
> ATA disks)
Which filesystem?
If ext3 in ordered-data mode: any fsync() will sync the whole filesystem
(it has to). Mounting with `-o writeback' should help.
On Tue, 15 Aug 2006 00:50:25 -0700
Andrew Morton <[email protected]> wrote:
> Mounting with `-o writeback' should help.
err, `-o data=writeback'
On Tue, 15 Aug 2006, Andrew Morton wrote:
> Which filesystem?
>
> If ext3 in ordered-data mode: any fsync() will sync the whole filesystem
> (it has to).
Could you explain some more why it has to?... Is this caused by design of
ext3 or any filesystem in ordered-data mode has to do it?
Thanks,
Grzegorz Kulewski
On Tue, 15 Aug 2006 14:40:48 +0200 (CEST)
Grzegorz Kulewski <[email protected]> wrote:
> On Tue, 15 Aug 2006, Andrew Morton wrote:
> > Which filesystem?
> >
> > If ext3 in ordered-data mode: any fsync() will sync the whole filesystem
> > (it has to).
>
> Could you explain some more why it has to?... Is this caused by design of
> ext3 or any filesystem in ordered-data mode has to do it?
>
The journal is a shared resource - a single swipe of disk which is written
(logically) atomically and which contains the metadata modifications for
many files. As filesystem activity proceeds we attach more and more
metadata blocks to the journal and when a commit happens we write them all
out in one hit.
Hence an fsync() of a single file ends up journalling the metadata for all
files which have pending metadata writes.
And in data=ordered mode the filesystem must write all the user-data for a
file before it writes the metadata which refers to that data.
IOW, a single fsync() triggers the journalling of all file metadata which
requires the writing of all file data.
In data=writeback mode the fs doesn't have the requirement that file
user-data be written prior to the journalling of the metadata which refers
to that data, so we can leave the pagecache for all the non-fsynced files
floating about in memory, still dirty.
[email protected] said:
> I've encountered a strange problem - if an application is sequentially
> writing a large file on a busy machine, a single write() of 64KB may
> take even 30 seconds. But if I do fsync() after each write() the maximum
If the sleeps are that long, and reproducible, then maybe you can find
the offending wait by using sysrq-t while the writer is blocked.
Jason
Based on your comments I've made some more tests with different kernels,
schedulers and ext3 options (yes, I'm using ext3).
Using ext3 with data=writeback against data=ordered doesn't help,
sometimes makes things even slightly worse
Changing the kernel version from 2.6.9 to 2.6.17 has minor effect on
write delay.
Different schedulers (CFQ and deadline) on different disks and
especially with different kinds of software RAID give huge differences:
Machine 1: PIV 3GHz 2GB RAM (kernel 2.6.11)
- (4 x ATA 7200 RPM 120GB):
- software RAID5: deadline 10-30s, CFQ 10-36s (I guess the killer are
the reads required to compute XOR)
- software RAID10: deadline 3-6s, CFQ 4-14s
- software RAID0 (2xSCSI 15000 RPM 36GB): deadline 1-3s, CFQ 2-10s
Machine 2: PIV 3.2GHz 512MB RAM (kernels 2.6.9, 2.6.11, 2.6.12, 2.6.17)
- single disk ATA 7200RPM 60GB: deadline: 3-8s, CFQ: 10-20s
Laptop: Pentium-M 1.7, 512MB RAM (kernel 2.6.17)
- single 2,5" 5400RPM 80GB ATA disk: deadline 6-23s, CFQ 6-17s
The time is the max write() time (64KB) I have observed for different
simultaneously writing threads (the numbers are just from one pass with
a specific load, so they are not statistically valid).
Just for comparison on Machine 1, RAID5, deadline scheduler writes with
fsync() after each write last max. 1.1 - 1.7s.
I've also made a single test with xfs filesystem - the problem remains
the same, so it doesn't seem to be connected with ext3.
For me it seems that linux has somehow hardcoded minimum expectations
about disk performance (maybe the number of pages flushed at once),
since with SCSI everything is OK, with 7200 RPM ATA more or less OK,
with 5400 ATA much worse and with ATA RAID5 terrible.
Based on that I have moved the data disks on my server from RAID5 to
RAID10, and set scheduler to deadline, so my main problem with samba
timeouts is (hopefully) resolved right now.
Thanks for help!
Andrzej.
In article <[email protected]>,
Andrzej Szymanski <[email protected]> wrote:
>I've encountered a strange problem - if an application is sequentially
>writing a large file on a busy machine, a single write() of 64KB may
>take even 30 seconds. But if I do fsync() after each write() the maximum
>time of write()+fsync() is about 0.5 second (the overall performance is,
>of course, degraded).
I'm seeing something similar.
I upgraded one of our newsrouters from 2.6.14.2 to 2.6.17.8 because
I needed ethernet bonding with vlan support.
It performs quite a bit worse now that before. The nightly report
that the INN software writes, tells me that the average write()
time (of the single threaded innd process to news storage) went
up from min/avg/max 0.942/1.716/2.337 ms to min/avg/max
2.952/4.553/5.658 ms.
The innd process writes to a simple filesystem I wrote myself that
shows a blockdevice as a single large file. It's a bit more efficient
than using the blockdevice directly.
So I don't see large write delays, but the average write() time
has gone up significantly (bad in my case, since it starves the
innd process).
Since I've also had several unexplained hangs I'm going back
to 2.6.14.x for now, since this machine is too important .. as
soon as I've got some more redundancy I'll experiment some more.
Mike.
On Thursday August 17, [email protected] wrote:
> In article <[email protected]>,
> Andrzej Szymanski <[email protected]> wrote:
> >I've encountered a strange problem - if an application is sequentially
> >writing a large file on a busy machine, a single write() of 64KB may
> >take even 30 seconds. But if I do fsync() after each write() the maximum
> >time of write()+fsync() is about 0.5 second (the overall performance is,
> >of course, degraded).
>
> I'm seeing something similar.
Can you report the contents of /proc/meminfo before, during, and after
the long pause?
I'm particularly interested in MemTotal, Dirty, and Writeback, but the
others are of interest too.
Thanks,
NeilBrown
Neil Brown wrote:
> On Thursday August 17, [email protected] wrote:
>
> Can you report the contents of /proc/meminfo before, during, and after
> the long pause?
> I'm particularly interested in MemTotal, Dirty, and Writeback, but the
> others are of interest too.
>
> Thanks,
> NeilBrown
I've prepared two logs, from different machines, the first one is on
software RAID5 (4 ATA disks) with deadline scheduler, the second on a
single ATA disk with CFQ scheduler. In the first case 10 writer threads
are sufficient to give large delays, in the second case I've run an
additional tar thread reading from the disk.
The logs are here:
http://galaxy.agh.edu.pl/~szymans/logs/
Each writer starts with:
Writing 200 MB to stdout without fsync
than reports each write() that lasts > 3s along with pid:
6582 - Delayed 4806 ms.
And finishes with:
Max write delay: 14968 ms.
Andrzej.
On Monday August 21, [email protected] wrote:
> Neil Brown wrote:
> > On Thursday August 17, [email protected] wrote:
> >
> > Can you report the contents of /proc/meminfo before, during, and after
> > the long pause?
> > I'm particularly interested in MemTotal, Dirty, and Writeback, but the
> > others are of interest too.
> >
> > Thanks,
> > NeilBrown
>
> I've prepared two logs, from different machines, the first one is on
> software RAID5 (4 ATA disks) with deadline scheduler, the second on a
> single ATA disk with CFQ scheduler. In the first case 10 writer threads
> are sufficient to give large delays, in the second case I've run an
> additional tar thread reading from the disk.
>
> The logs are here:
> http://galaxy.agh.edu.pl/~szymans/logs/
>
> Each writer starts with:
> Writing 200 MB to stdout without fsync
>
> than reports each write() that lasts > 3s along with pid:
> 6582 - Delayed 4806 ms.
>
> And finishes with:
> Max write delay: 14968 ms.
>
> Andrzej.
Hmm...
I spent altogether too long exploring this :-)
I think all that is happening here is that the delay that has to be
imposed on the different write calls is being imposed somewhat
unfairly.
Suppose your filesystem/device can sustain 15MB/sec (like my
notebook).
Then with 10 concurrent threads, each should expect to write at
1.5MB/sec, or 1 Megabyte every 667 milliseconds
So you might expect each 1Meg write to take 667msec.
However Linux only imposes write throttling every 4Meg (If you have
128Meg or RAM of more). So you would expect 3 out of 4 requests to be
instantaneous, and 1 out of 4 to take 2.667 seconds.
However it very hard to impose rate limiting completely uniformly.
For example the block device driver imposes a queue length limit when
space becomes available on the queue, it is fairly random which
process gets to use it.
Further the block layer actually imposes some unfairness: when a
process gets an entry on the queue, it is (almost) guaranteed another
31 so it will make lots of progress and the expense of anyone else.
It does this to improve throughput (I believe).
So getting individual 1Meg writes taking a few seconds is quite likely.
Having a few take several seconds should be expected.
You report numbers up to 26 seconds. That is clearly quite a lot, but
I'm not sure there is much to be done about it - imposing strict
fairness will always have a cost. That 26 seconds was in a run where
there were 2000 writes. You would expect 500 to hit delays. Only 63
hit delays > 3seconds.
In my various experimenting the one thing that was effective in
improving the fairness was to make Linux impose write throttling more
often.
In mm/page-writeback.c, in set_ratelimit() where ratelimit_pages is
calculated, cap it at much lower. e.g. set it unconditionally to 16.
This might increase CPU load on multi-processor machines, but it does
seem to increase the fairness of write throttling.
NeilBrown
On Tuesday August 22, [email protected] wrote:
>
> In my various experimenting the one thing that was effective in
> improving the fairness was to make Linux impose write throttling more
> often.
I might have found something else too....
Were you using ext3?
If you, can you try mounting with data=writeback
and see if that makes any difference to the fairness?
Thanks,
NeilBrown
Neil Brown wrote:
> On Tuesday August 22, [email protected] wrote:
>> In my various experimenting the one thing that was effective in
>> improving the fairness was to make Linux impose write throttling more
>> often.
>
> I might have found something else too....
>
> Were you using ext3?
>
> If you, can you try mounting with data=writeback
> and see if that makes any difference to the fairness?
>
> Thanks,
> NeilBrown
I've already tried data=writeback - almost no difference or it makes
things even worse. I've briefly tested XFS filesystem, and I've seen the
same behavior as in ext3 so it does not seem to be ext3 related.
Andrzej.