2003-03-26 20:20:11

by Erik Hensema

[permalink] [raw]
Subject: Delaying writes to disk when there's no need

In all kernels I've tested writes to disk are delayed a long time even when
there's no need to do so.

A very simple test shows this: on an otherwise idle system, create a tar of
a NFS-mounted filesystem to a local disk. The kernel starts writing out the
data after 30 seconds, while a slow and steady stream would be much nicer
to the system, I think.

On 2.4.x this can block the system for several seconds. 2.5.6x and
2.5.6x-mm (with AS) also show this behaviour, but the system doesn't block
anymore. I'm using a preemtable kernel.

I only started to notice this behaviour when I upgraded from 256 MB ram to
512 MB. In other words: Linux behaves more nicely with 256 MB.

Attached is a vmstat trace, it starts at the moment the tar process starts
too. It's a simple tar cf /local/test.tar /home (/home being mounted over
NFS).

Further data:
SIS5513: IDE controller at PCI slot 00:02.5
SIS5513: chipset revision 208
SIS5513: not 100% native mode: will probe irqs later
SiS745 ATA 100 controller
ide0: BM-DMA at 0xff00-0xff07, BIOS settings: hda:DMA, hdb:DMA
ide1: BM-DMA at 0xff08-0xff0f, BIOS settings: hdc:DMA, hdd:DMA
hda: MAXTOR 6L080J4, ATA DISK drive

It's an 80 GB ATA 133 drive.

AMD Athlon XP 1800+, 512 MB ram.

Attached vmstat was created on:
Linux bender 2.5.66-mm1 #10 Wed Mar 26 15:16:17 CET 2003 i686 unknown

tar output was going to a logical volume formatted with reiserfs.

procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
5 0 0 0 123720 45068 200044 0 0 9 10 1038 766 73 1 26
0 1 0 0 144640 45488 178568 0 0 420 0 1587 2139 9 15 76
0 1 0 0 137976 45496 184624 0 0 4 0 4543 3415 17 28 54
0 1 0 0 134848 45496 188112 0 0 0 0 2877 2311 21 29 50
0 1 0 0 124992 45500 197848 0 0 0 0 5313 2795 32 59 9
4 1 0 0 110528 45604 212088 0 0 0 496 7460 3510 24 72 4
3 1 0 0 105408 45608 217068 0 0 0 0 3594 2436 20 27 53
6 0 0 0 104192 45608 218160 0 0 0 0 2037 2068 15 18 68
3 0 0 0 99328 45608 222736 0 0 0 0 4064 3484 16 26 57
4 1 0 0 95040 45612 226888 0 0 0 0 3379 2500 24 30 45
5 1 0 0 90752 45656 230936 0 0 0 116 3497 2821 19 21 60
1 1 0 0 85056 45660 236568 0 0 0 0 3744 2369 25 45 30
1 1 0 0 83200 45660 238308 0 0 0 0 2786 3216 40 18 42
3 1 0 0 81400 45660 239876 0 0 0 0 3020 3587 22 21 57
1 1 0 0 71608 45664 249512 0 0 0 0 5742 3378 23 48 29
0 1 1 0 63872 45716 257040 0 0 0 124 5063 3405 25 36 39
2 1 1 0 56064 45720 264752 0 0 0 0 4652 2694 26 52 22
2 1 0 0 47936 45724 272660 0 0 0 0 5443 3850 27 41 32
1 1 0 0 46912 45724 273664 0 0 0 0 2300 4372 66 21 13
3 1 0 0 41728 45728 278732 0 0 0 0 3755 3370 50 37 13
2 1 0 0 36928 45780 283368 0 0 0 124 4549 4094 33 33 35
1 1 1 0 31680 45780 288508 0 0 0 0 4116 2931 23 31 46
0 1 0 0 28672 45780 291436 0 0 0 0 2694 2183 33 33 33
3 1 0 0 24448 45784 295556 0 0 0 0 3152 2322 19 30 51
0 1 0 0 24256 45784 295724 0 0 0 0 1354 1414 6 4 90
4 0 1 0 20800 45824 298184 0 0 0 21120 3267 3374 22 20 57
4 0 1 0 18304 45828 300440 0 0 0 36328 3583 4013 47 53 0
3 0 0 0 16640 45828 302468 0 0 0 0 3594 4084 16 15 69
4 1 1 0 14456 45828 304332 0 0 0 0 3531 3946 23 20 57


--
Erik Hensema <[email protected]>


2003-03-27 08:53:18

by Helge Hafting

[permalink] [raw]
Subject: Re: Delaying writes to disk when there's no need

Erik Hensema wrote:
> In all kernels I've tested writes to disk are delayed a long time even when
> there's no need to do so.
>
Short answer - it is supposed to do that!

> A very simple test shows this: on an otherwise idle system, create a tar of
> a NFS-mounted filesystem to a local disk. The kernel starts writing out the
> data after 30 seconds, while a slow and steady stream would be much nicer
> to the system, I think.
>
You're wrong then. There's no need for a slow steady stream, why do
you want that. Of course you can set up cron to run sync at
regular (short) intervals to achieve this.

> On 2.4.x this can block the system for several seconds. 2.5.6x and
> 2.5.6x-mm (with AS) also show this behaviour, but the system doesn't block
> anymore. I'm using a preemtable kernel.
>
Writing out stuff is not supposed to block the machine, and as you say,
it is fixed in 2.5. No need for the steady writing.

> I only started to notice this behaviour when I upgraded from 256 MB ram to
> 512 MB. In other words: Linux behaves more nicely with 256 MB.
>
Why do you think that is more nice?

Writing is delayed because that accumulate bigger writes and
fewer seeks. This helps performance a lot. Delaying writes
has another advantage - somw writes won't be done at all,
saving 100% writing time. This is the case for temporary
files that gets written to, read, and deleted before they
get written to disk. It all happens in cache, improving
performance tremendously. To see the alternative,
try booting with mem=4M or 16M or some such, with _no_ swapping.

Another case is a file that gets overwritten several times.
This all happens in memory because of the delay, only the final
version gets written to disk. This is ver common, even for files
that are written only once. (slowly extending a file
and writing it to disk every time _will_ write the same stuff
over and over because it is impossible to add a few bytes only.
You always write a whole block, adding anything less than that
involves reading the half-full block, updating it, and writing it
back. Keeping such operations in memory saves quite a few
disk writes.)

For more detailed information, read a book about how filesystems and
disk caching works.

Helge Hafting

2003-03-27 11:11:22

by Erik Hensema

[permalink] [raw]
Subject: Re: Delaying writes to disk when there's no need

Helge Hafting ([email protected]) wrote:
> Erik Hensema wrote:
>> In all kernels I've tested writes to disk are delayed a long time even when
>> there's no need to do so.
>>
> Short answer - it is supposed to do that!
>
>> A very simple test shows this: on an otherwise idle system, create a tar of
>> a NFS-mounted filesystem to a local disk. The kernel starts writing out the
>> data after 30 seconds, while a slow and steady stream would be much nicer
>> to the system, I think.
>>
> You're wrong then. There's no need for a slow steady stream, why do
> you want that. Of course you can set up cron to run sync at
> regular (short) intervals to achieve this.
>
>> On 2.4.x this can block the system for several seconds. 2.5.6x and
>> 2.5.6x-mm (with AS) also show this behaviour, but the system doesn't block
>> anymore. I'm using a preemtable kernel.
>>
> Writing out stuff is not supposed to block the machine, and as you say,
> it is fixed in 2.5. No need for the steady writing.
>
>> I only started to notice this behaviour when I upgraded from 256 MB ram to
>> 512 MB. In other words: Linux behaves more nicely with 256 MB.
>>
> Why do you think that is more nice?

Because the interactivity of the system is better with less memory.

> Writing is delayed because that accumulate bigger writes and
> fewer seeks. This helps performance a lot. Delaying writes
> has another advantage - somw writes won't be done at all,
> saving 100% writing time. This is the case for temporary
> files that gets written to, read, and deleted before they
> get written to disk. It all happens in cache, improving
> performance tremendously. To see the alternative,
> try booting with mem=4M or 16M or some such, with _no_ swapping.

I see that. However, I don't see why the kernel is writing out data
as agressively as it does now. Delaying a write for 30 seconds isn't the
problem: the aggressive writes are. Since the disks are otherwise idle, the
kernel can gently start writing out the dirty cache. No need to try and
write 40 MB in 1 sec when you can write 10 MB/sec in 4 seconds.

[...]

> For more detailed information, read a book about how filesystems and
> disk caching works.

I'm just reporting what's happening to me in practice, I don't really care
about what should happen in theory.

--
Erik Hensema <[email protected]>

2003-03-28 10:07:48

by Tim Connors

[permalink] [raw]
Subject: Re: Delaying writes to disk when there's no need

In linux.kernel, you wrote:
> Helge Hafting ([email protected]) wrote:
>> Erik Hensema wrote:
>>> In all kernels I've tested writes to disk are delayed a long time even when
>>> there's no need to do so.
>>>
>> Short answer - it is supposed to do that!
>>
>>> A very simple test shows this: on an otherwise idle system, create a tar of
>>> a NFS-mounted filesystem to a local disk. The kernel starts writing out the
>>> data after 30 seconds, while a slow and steady stream would be much nicer
>>> to the system, I think.

Agreed. We have a cluster which is writing on average something like
20 Megs/sec/node. We had to lower the write threshold from 30% to 0%,
because with the constant writing, linux will buffer it for 30 secs,
fill up RAM, try to empty the write-cache, stall, wash, rinse,
repeat. Because it was being filled up at roughly the rate it was
being emptied, once it got 30% behind, there was no catching up, so
the realtime system would lose data. Ouch.

>> You're wrong then. There's no need for a slow steady stream, why do
>> you want that. Of course you can set up cron to run sync at
>> regular (short) intervals to achieve this.

Last time I checked, cron had 1 minute resolution.

> I see that. However, I don't see why the kernel is writing out data
> as agressively as it does now. Delaying a write for 30 seconds isn't the
> problem: the aggressive writes are. Since the disks are otherwise idle, the
> kernel can gently start writing out the dirty cache. No need to try and
> write 40 MB in 1 sec when you can write 10 MB/sec in 4 seconds.
>
> [...]
>
>> For more detailed information, read a book about how filesystems and
>> disk caching works.
>
> I'm just reporting what's happening to me in practice, I don't really care
> about what should happen in theory.

Exactly.

Helge's comment about /tmp files and rewriting files multiple times:
in real life, how often does this happen? How often do you overwrite
one file many times in 30 seconds? The occasional 20 kilobyte /tmp
file perhaps, but I doubt it matters in real life. In real life, when
writing to disk constantly (not just scientific applications - I
believe this happens in the real world too!), waiting for 30 seconds
is a liability!


--
TimC -- http://astronomy.swin.edu.au/staff/tconnors/

White dwarf seeks red giant star

2003-03-30 17:24:12

by Helge Hafting

[permalink] [raw]
Subject: Re: Delaying writes to disk when there's no need

On Fri, Mar 28, 2003 at 09:18:59PM +1100, Tim Connors wrote:
> In linux.kernel, you wrote:
> > Helge Hafting ([email protected]) wrote:
> >> Erik Hensema wrote:
> >>> In all kernels I've tested writes to disk are delayed a long time even when
> >>> there's no need to do so.
> >>>
> >> Short answer - it is supposed to do that!
> >>
> >>> A very simple test shows this: on an otherwise idle system, create a tar of
> >>> a NFS-mounted filesystem to a local disk. The kernel starts writing out the
> >>> data after 30 seconds, while a slow and steady stream would be much nicer
> >>> to the system, I think.
>
> Agreed. We have a cluster which is writing on average something like
> 20 Megs/sec/node. We had to lower the write threshold from 30% to 0%,
> because with the constant writing, linux will buffer it for 30 secs,
> fill up RAM, try to empty the write-cache, stall, wash, rinse,
> repeat. Because it was being filled up at roughly the rate it was
> being emptied, once it got 30% behind, there was no catching up, so
> the realtime system would lose data. Ouch.
>
Nothing can help you if you're getting input at the rate you
can write it. You need ability to write at least somewhat faster
no matter how buffering is done.

> >> You're wrong then. There's no need for a slow steady stream, why do
> >> you want that. Of course you can set up cron to run sync at
> >> regular (short) intervals to achieve this.
>
> Last time I checked, cron had 1 minute resolution.

If you need to sync more often than once per minute, consider
this shellscript:
for ((;;)) ; do sleep 1 ; sync ; done

[...]
> Helge's comment about /tmp files and rewriting files multiple times:
> in real life, how often does this happen? How often do you overwrite
> one file many times in 30 seconds?

_Every_ time you write a file in chunks not perfectly aligned
on block boundaries.

> The occasional 20 kilobyte /tmp
> file perhaps, but I doubt it matters in real life. In real life, when
> writing to disk constantly (not just scientific applications - I
> believe this happens in the real world too!), waiting for 30 seconds
> is a liability!

Only if that 30-second wait makes linux buffer more than it can handle.
That may indeed be a problem, but buffering up _some_ data
before writing is still a good idea.

Helge Hafting

2003-03-30 19:31:43

by Pavel Machek

[permalink] [raw]
Subject: Re: Delaying writes to disk when there's no need

Hi!

> In all kernels I've tested writes to disk are delayed a long time even when
> there's no need to do so.
>
> A very simple test shows this: on an otherwise idle system, create a tar of
> a NFS-mounted filesystem to a local disk. The kernel starts writing out the
> data after 30 seconds, while a slow and steady stream would be much nicer
> to the system, I think.
>

Well, doing writeback sooner when disks
are idle might be good idea; detecting
if disk is idle might not be too easy, through.

OTOH, raid resync already has some
such detection?
Pavel
--
Pavel
Written on sharp zaurus, because my Velo1 broke. If you have Velo you don't need...

2003-03-31 11:49:38

by Erik Hensema

[permalink] [raw]
Subject: Re: Delaying writes to disk when there's no need

Pavel Machek ([email protected]) wrote:
> Hi!
>
>> In all kernels I've tested writes to disk are delayed a long time even when
>> there's no need to do so.
>>
>> A very simple test shows this: on an otherwise idle system, create a tar of
>> a NFS-mounted filesystem to a local disk. The kernel starts writing out the
>> data after 30 seconds, while a slow and steady stream would be much nicer
>> to the system, I think.
>>
>
> Well, doing writeback sooner when disks
> are idle might be good idea; detecting
> if disk is idle might not be too easy, through.

Helge Hafting already pointed out that writing out the data earlier isn't
desirable. The problem isn't in the waiting: the problem is in the writing.
I think the current kernel tries to write too much data too fast when
there's absolutely no reason to do so. It should probably gently write out
small amounts of data until there is a more pressing need for memory.

--
Erik Hensema <[email protected]>

2003-03-31 13:28:09

by Helge Hafting

[permalink] [raw]
Subject: Re: Delaying writes to disk when there's no need

Erik Hensema wrote:
[...]
> Helge Hafting already pointed out that writing out the data earlier isn't
> desirable. The problem isn't in the waiting: the problem is in the writing.
> I think the current kernel tries to write too much data too fast when
> there's absolutely no reason to do so. It should probably gently write out
> small amounts of data until there is a more pressing need for memory.
>
I don't think the problem is "writing a large chunk", rather that this
chunk is scheduled for writing a bit too late. Memory is filling up
and the process producing data us throttled while waiting for
the write to free up pages. Then the "huge chunk" of pages is released,
and memory is allowed to fill up for too long again.

Seem to me the correct solution is to start writing out
things long before memory gets so full that we need to
throttle the producer.

This will result in somewhat smaller chunks and a somewhat
steadier stream of data. It will work better, but not because
the chunks are smaller. (Block devices is supposed to handle
enormous chunks with no problems, and the bandwith utilization
is generally better the bigger chunks you can get. 50M to
disk in one go isn't "pushing" anything - it is "nice".)
The reason an earlier start works better is that memory never fills up
to the point where the producer is throttled, assuming
the io system can keep up with the producer forever.
Throttling will _always_ happen when that isn't the case.

The tricky part here is knowing the bandwith of the output
device, and start writing at such a time that memory
won't have time to fill up in the case where a
producer is almost as fast as the output device.

The problem is that this depends on several things:
1. How much more memory is there (varies a lot, but
the kernel knows this one.)
2. How fast is the output device (varies a lot, different
areas on a disk have different speed. Different
disks have different speed. The speed of nfs depends
on network speed, network congestion,
roundtrip time, server load, and server disk speed.
You probably cannot get good estimates for all cases,
particularly not nfs in a shared net.
To get this right we need both bandwith and latency.
3. How fast is data produced? A global estimate may
be possible, looking at how fast memory is dirtied.
I have no idea if such an estimate is possible per
block device.
4. The big problem is that there may be several unrelated
processes dirtying memory to be written to several
very different block devices.
For this to work automatically we need a low estimate
for the bandwith for each block device/filesystem,
and memory dirying rate for each.


This seems hard to solve automatically. A specific
case of a realtime program writing near disk speed is solvable
by having an extra thread that issue a fsync whenever the
amount of written but unsynced data gets near the point
where the time necessary to write it is long enough
to fill memory with the same rate of producing data.
Of course one wants a substantial safety margin here,
perhaps an assumption that only one third or so of memory
actually will be available for caching the important stuff.

A manual solution is possible if we can have two "knobs"
for this:
1. Treshold for when to start writing out stuff
2. Treshold for when to throttle processes.

The latter may or may not be necessary, the point is that the former
should kick in long before throttling is necessary.

This is usually expressed as how many % of memory that is dirty, but
I'm not sure that is the right thing. It assumes that 100% will be
available after cleaning, which may be way off.

Something like % of memory that is still available (free,
or instantly freeable by reclaiming clean unpinned cache)

Helge Hafting




2003-03-31 14:33:46

by Oliver Neukum

[permalink] [raw]
Subject: Re: Delaying writes to disk when there's no need


> A manual solution is possible if we can have two "knobs"
> for this:
> 1. Treshold for when to start writing out stuff
> 2. Treshold for when to throttle processes.
>
> The latter may or may not be necessary, the point is that the former
> should kick in long before throttling is necessary.
>
> This is usually expressed as how many % of memory that is dirty, but
> I'm not sure that is the right thing. It assumes that 100% will be
> available after cleaning, which may be way off.
>
> Something like % of memory that is still available (free,
> or instantly freeable by reclaiming clean unpinned cache)

Is there any sense in allowing a task to keep dirty a certain percentage
of free memory? If you have a task that has to be throttled amyway,
is any memory that this task keeps dirty wasted anyway, if it's more
than needed to send efficient io requests to the device? Somebody
else might have better uses for that memory.

Regards
Oliver

2003-03-31 21:54:51

by Nick Piggin

[permalink] [raw]
Subject: Re: Delaying writes to disk when there's no need

Helge Hafting wrote:

> Erik Hensema wrote:
> [...]
>
>> Helge Hafting already pointed out that writing out the data earlier
>> isn't
>> desirable. The problem isn't in the waiting: the problem is in the
>> writing.
>> I think the current kernel tries to write too much data too fast when
>> there's absolutely no reason to do so. It should probably gently
>> write out
>> small amounts of data until there is a more pressing need for memory.
>>
> I don't think the problem is "writing a large chunk", rather that this
> chunk is scheduled for writing a bit too late. Memory is filling up
> and the process producing data us throttled while waiting for
> the write to free up pages. Then the "huge chunk" of pages is released,
> and memory is allowed to fill up for too long again.
>
> Seem to me the correct solution is to start writing out
> things long before memory gets so full that we need to
> throttle the producer.

I haven't thought about this much, but it seems to me that
doing writeout whenever the disk would otherwise be idle
(and we have dirty memory to write out) would be a good
solution.

2003-03-31 22:19:16

by Chris Friesen

[permalink] [raw]
Subject: Re: Delaying writes to disk when there's no need

Nick Piggin wrote:

> I haven't thought about this much, but it seems to me that
> doing writeout whenever the disk would otherwise be idle
> (and we have dirty memory to write out) would be a good
> solution.

The whole argument about waiting though is that there may be another write
coming to the same place, in which case you could save the cost of the first
write because it didn't have to be written.

Writing to disk isn't free, even if the disk would otherwise be idle. You have
the cost of the setup as well as the memory and pci bus traffic. You may have
disk bandwidth available but be already maxing out the PCI bus, in which case
your "free" disk write takes I/O away from other things.

Ultimately its all a tradeoff. Do you write now, or do you hold off and hope
that you can throw away some of the writes because new stuff will home in to
overwrite them?

Chris

--
Chris Friesen | MailStop: 043/33/F10
Nortel Networks | work: (613) 765-0557
3500 Carling Avenue | fax: (613) 765-2986
Nepean, ON K2H 8E9 Canada | email: [email protected]

2003-03-31 22:24:19

by Nick Piggin

[permalink] [raw]
Subject: Re: Delaying writes to disk when there's no need



Chris Friesen wrote:

> Nick Piggin wrote:
>
>> I haven't thought about this much, but it seems to me that
>> doing writeout whenever the disk would otherwise be idle
>> (and we have dirty memory to write out) would be a good
>> solution.
>
>
> The whole argument about waiting though is that there may be another
> write coming to the same place, in which case you could save the cost
> of the first write because it didn't have to be written.
>
> Writing to disk isn't free, even if the disk would otherwise be idle.
> You have the cost of the setup as well as the memory and pci bus
> traffic. You may have disk bandwidth available but be already maxing
> out the PCI bus, in which case your "free" disk write takes I/O away
> from other things.

Only if the memory gets dirtied again, otherwise the earlier the better.
If the
memory does get written to again before the writeout timeout then yeah
its used
some cpu, memory, pci, etc that it didn't have to.

>
>
> Ultimately its all a tradeoff. Do you write now, or do you hold off
> and hope that you can throw away some of the writes because new stuff
> will home in to overwrite them?

Yes it is a tradeoff. Having an idle disk gives more weight to "write now".

2003-03-31 22:34:03

by Andrew Morton

[permalink] [raw]
Subject: Re: Delaying writes to disk when there's no need

Nick Piggin <[email protected]> wrote:
>
> it seems to me that
> doing writeout whenever the disk would otherwise be idle
> (and we have dirty memory to write out) would be a good
> solution.

This is what the recently-removed BDI_read_active flag in backing_dev_info
was supposed to be for. I let it go because I don't think it's terribly
important and it's time to stop fiddling with the vfs writeout code and it
wasn't right anyway.

Note that 2.5 starts pdflush writeout at 10% of memory dirty. Or even lower
if there is a lot of mapped memory around. Whereas 2.4 will start background
writeout at 30% or 40% dirty. That's a fairly significant tuning change.

The algorithm for utilisation of an idle disk should be, in
balance_dirty_pages():

if (ps.nr_dirty + ps.nr_writeback < background_thresh) {
if (time_after(jiffies, bdi->last_read + HZ/100)) {
if (bdi->write_requests_in_flight < 2) {
struct writeback_control wbc = {
.bdi = bdi,
.sync_mode = WB_SYNC_NONE,
.nr_to_write = write_chunk,
};

writeback_inodes(&wbc);
}
}
return;
}


Or something like that. It's pretty close.

It could have pretty bad failure modes. Short-lived files in /tmp now
perform writeout, which needs to be waited on when those files are removed.

2003-03-31 22:40:50

by John Bradford

[permalink] [raw]
Subject: Re: Delaying writes to disk when there's no need

> If the memory does get written to again before the writeout timeout
> then yeah its used some cpu, memory, pci, etc that it didn't have
> to.

It will presumably also have filled the cache with the writeout data.

> > Ultimately its all a tradeoff. Do you write now, or do you hold off
> > and hope that you can throw away some of the writes because new stuff
> > will home in to overwrite them?
>
> Yes it is a tradeoff. Having an idle disk gives more weight to "write now".

Not necessarily. What if you are using a solid state disk which only
allows a relatively low number of re-write cycles? What if the disk
is spun down, and spinning it up uses a lot of power? On a laptop,
you don't necessarily want the disk spinning up just to write one
sector.

John.

2003-03-31 22:47:49

by Nick Piggin

[permalink] [raw]
Subject: Re: Delaying writes to disk when there's no need

John Bradford wrote:

>>If the memory does get written to again before the writeout timeout
>>then yeah its used some cpu, memory, pci, etc that it didn't have
>>to.
>>
>
>It will presumably also have filled the cache with the writeout data.
>
What cache?

>
>
>>>Ultimately its all a tradeoff. Do you write now, or do you hold off
>>>and hope that you can throw away some of the writes because new stuff
>>>will home in to overwrite them?
>>>
>>Yes it is a tradeoff. Having an idle disk gives more weight to "write now".
>>
>
>Not necessarily. What if you are using a solid state disk which only
>allows a relatively low number of re-write cycles? What if the disk
>is spun down, and spinning it up uses a lot of power? On a laptop,
>you don't necessarily want the disk spinning up just to write one
>sector.
>
Yes it does. The factors you mention just add (a lot) more
weight to "hold off".

2003-03-31 22:52:28

by Nick Piggin

[permalink] [raw]
Subject: Re: Delaying writes to disk when there's no need



Andrew Morton wrote:

>Nick Piggin <[email protected]> wrote:
>
>
>>it seems to me that
>>doing writeout whenever the disk would otherwise be idle
>>(and we have dirty memory to write out) would be a good
>>solution.
>>
>>
>
>This is what the recently-removed BDI_read_active flag in backing_dev_info
>was supposed to be for. I let it go because I don't think it's terribly
>important and it's time to stop fiddling with the vfs writeout code and it
>wasn't right anyway.
>
>Note that 2.5 starts pdflush writeout at 10% of memory dirty. Or even lower
>if there is a lot of mapped memory around. Whereas 2.4 will start background
>writeout at 30% or 40% dirty. That's a fairly significant tuning change.
>
>The algorithm for utilisation of an idle disk should be, in
>balance_dirty_pages():
>
> if (ps.nr_dirty + ps.nr_writeback < background_thresh) {
> if (time_after(jiffies, bdi->last_read + HZ/100)) {
> if (bdi->write_requests_in_flight < 2) {
> struct writeback_control wbc = {
> .bdi = bdi,
> .sync_mode = WB_SYNC_NONE,
> .nr_to_write = write_chunk,
> };
>
> writeback_inodes(&wbc);
> }
> }
> return;
> }
>
>
>Or something like that. It's pretty close.
>
Yeah something like that looks alright.

>
>It could have pretty bad failure modes. Short-lived files in /tmp now
>perform writeout, which needs to be waited on when those files are removed.
>
>
I didn't think of that.

2003-03-31 23:29:50

by Ingo Oeser

[permalink] [raw]
Subject: Re: Delaying writes to disk when there's no need

On Mon, Mar 31, 2003 at 02:45:00PM -0800, Andrew Morton wrote:
> It could have pretty bad failure modes. Short-lived files in /tmp now
> perform writeout, which needs to be waited on when those files are removed.

/tmp is not a problem, because this can be fixed by using tmpfs
(I use 2GB of it with 1GB of RAM).

Bad are the small writes generated by the proposed behavior.

The disk is idle, so this is not about performance, but power
consumption. Spinning up a disk costs around 1-2 seconds, so you
should come in with at least the amount of data you write in 1-2
seconds for a spun down disk.

Regards

Ingo Oeser
--
Marketing ist die Kunst, Leuten Sachen zu verkaufen, die sie
nicht brauchen, mit Geld, was sie nicht haben, um Leute zu
beeindrucken, die sie nicht moegen.

2003-03-31 23:51:14

by Andrew Morton

[permalink] [raw]
Subject: Re: Delaying writes to disk when there's no need

Ingo Oeser <[email protected]> wrote:
>
> On Mon, Mar 31, 2003 at 02:45:00PM -0800, Andrew Morton wrote:
> > It could have pretty bad failure modes. Short-lived files in /tmp now
> > perform writeout, which needs to be waited on when those files are removed.
>
> /tmp is not a problem, because this can be fixed by using tmpfs
> (I use 2GB of it with 1GB of RAM).

I don't. These files get unlinked before they hit disk.

> The disk is idle, so this is not about performance, but power
> consumption. Spinning up a disk costs around 1-2 seconds, so you
> should come in with at least the amount of data you write in 1-2
> seconds for a spun down disk.

The requirements for portable computers are totally different. You'd turn
the whole thing off for them.

2003-04-01 00:32:01

by Daniel Pittman

[permalink] [raw]
Subject: Re: Delaying writes to disk when there's no need

On Mon, 31 Mar 2003, Andrew Morton wrote:
> Nick Piggin <[email protected]> wrote:
>>
>> it seems to me that
>> doing writeout whenever the disk would otherwise be idle
>> (and we have dirty memory to write out) would be a good
>> solution.
>
> This is what the recently-removed BDI_read_active flag in
> backing_dev_info was supposed to be for. I let it go because I don't
> think it's terribly important and it's time to stop fiddling with the
> vfs writeout code and it wasn't right anyway.
>
> Note that 2.5 starts pdflush writeout at 10% of memory dirty. Or even
> lower if there is a lot of mapped memory around. Whereas 2.4 will
> start background writeout at 30% or 40% dirty. That's a fairly
> significant tuning change.

I don't figure it's a very important thing, but even this change doesn't
resolve one of the issues I have with the default writeout scheduler.

Capturing a real-time video stream from an IEEE1394 DV stream means
writing a stead 3.5MB per second for two on two and a half hours.

Linux isn't great at this, using the default writeout policy, even as
recent as 2.5.64. The writer goes OK for a while but, eventually, blocks
on writeout for long enough to drop a frame -- more than 8/25ths of a
second.


This can be resolved by tuning the default delay before write-out start
to 5 seconds, down from 30, or by running sync every second, or by doing
fsync tricks.


I think it's a good thing that you can delay writes for a long time, in
general, but there are cases where blocking *really* sucks and on a
system that does nothing else but produce 3.5MB per second of dirty
memory and write that to disk...

Well, something that allowed only that data stream to be preemptively
written out would be good without the need for the thread-and-fsync
trick.

Daniel

--
Anyone who stops learning is old, whether at twenty or eighty. Anyone who keeps
learning stays young. The greatest thing in life is to keep your mind young.
-- Henry Ford

2003-04-01 00:58:31

by Andrew Morton

[permalink] [raw]
Subject: Re: Delaying writes to disk when there's no need

Daniel Pittman <[email protected]> wrote:
>
> Capturing a real-time video stream from an IEEE1394 DV stream means
> writing a stead 3.5MB per second for two on two and a half hours.
>
> Linux isn't great at this, using the default writeout policy, even as
> recent as 2.5.64. The writer goes OK for a while but, eventually, blocks
> on writeout for long enough to drop a frame -- more than 8/25ths of a
> second.
>
>
> This can be resolved by tuning the default delay before write-out start
> to 5 seconds, down from 30, or by running sync every second, or by doing
> fsync tricks.

Interesting.

Yes, I expect that you could fix that up by altering dirty_background_ratio
and dirty_expire_centisecs.

The problem with fsync() is that it waits on the writeout. You don't want
that to happen - you just want to tell the kernel "I won't be overwriting or
deleting this data". Make the kernel queue up and start the IO but not wait
on its completion.

It is quite appropriate to do this in fadvise(FADV_DONTNEED) - as a
lower-latency fsync(). The app would need to call it once per second or so.

It would also throw away any written-back pagecache inside your (start, len)
which is exactly what your applications wants to happen, so the app should be
calling fadvise _anyway_.

What do you think?


25-akpm/include/linux/fs.h | 1 +
25-akpm/mm/fadvise.c | 1 +
25-akpm/mm/filemap.c | 18 ++++++++++++++++--
3 files changed, 18 insertions(+), 2 deletions(-)

diff -puN include/linux/fs.h~fadvise-flush-data include/linux/fs.h
--- 25/include/linux/fs.h~fadvise-flush-data Mon Mar 31 17:03:39 2003
+++ 25-akpm/include/linux/fs.h Mon Mar 31 17:03:39 2003
@@ -1112,6 +1112,7 @@ unsigned long invalidate_inode_pages(str
extern void invalidate_inode_pages2(struct address_space *mapping);
extern void write_inode_now(struct inode *, int);
extern int filemap_fdatawrite(struct address_space *);
+extern int filemap_flush(struct address_space *);
extern int filemap_fdatawait(struct address_space *);
extern void sync_supers(void);
extern void sync_filesystems(int wait);
diff -puN mm/fadvise.c~fadvise-flush-data mm/fadvise.c
--- 25/mm/fadvise.c~fadvise-flush-data Mon Mar 31 17:03:39 2003
+++ 25-akpm/mm/fadvise.c Mon Mar 31 17:03:39 2003
@@ -61,6 +61,7 @@ long sys_fadvise64(int fd, loff_t offset
ret = 0;
break;
case POSIX_FADV_DONTNEED:
+ filemap_flush(mapping);
invalidate_mapping_pages(mapping, offset >> PAGE_CACHE_SHIFT,
(len >> PAGE_CACHE_SHIFT) + 1);
break;
diff -puN mm/filemap.c~fadvise-flush-data mm/filemap.c
--- 25/mm/filemap.c~fadvise-flush-data Mon Mar 31 17:03:39 2003
+++ 25-akpm/mm/filemap.c Mon Mar 31 17:03:39 2003
@@ -122,11 +122,11 @@ static inline int sync_page(struct page
* if a dirty page/buffer is encountered, it must be waited upon, and not just
* skipped over.
*/
-int filemap_fdatawrite(struct address_space *mapping)
+static int __filemap_fdatawrite(struct address_space *mapping, int sync_mode)
{
int ret;
struct writeback_control wbc = {
- .sync_mode = WB_SYNC_ALL,
+ .sync_mode = sync_mode,
.nr_to_write = mapping->nrpages * 2,
};

@@ -140,6 +140,20 @@ int filemap_fdatawrite(struct address_sp
return ret;
}

+int filemap_fdatawrite(struct address_space *mapping)
+{
+ return __filemap_fdatawrite(mapping, WB_SYNC_ALL);
+}
+
+/*
+ * This is a mostly non-blocking flush. Not suitable for data-integrity
+ * purposes.
+ */
+int filemap_flush(struct address_space *mapping)
+{
+ return __filemap_fdatawrite(mapping, WB_SYNC_NONE);
+}
+
/**
* filemap_fdatawait - walk the list of locked pages of the given address
* space and wait for all of them.

_

2003-04-01 01:23:28

by Daniel Pittman

[permalink] [raw]
Subject: Re: Delaying writes to disk when there's no need

On Mon, 31 Mar 2003, Andrew Morton wrote:
> Daniel Pittman <[email protected]> wrote:
>>
>> Capturing a real-time video stream from an IEEE1394 DV stream means
>> writing a stead 3.5MB per second for two on two and a half hours.
>>
>> Linux isn't great at this, using the default writeout policy, even as
>> recent as 2.5.64. The writer goes OK for a while but, eventually,
>> blocks on writeout for long enough to drop a frame -- more than
>> 8/25ths of a second.
>>
>>
>> This can be resolved by tuning the default delay before write-out
>> start to 5 seconds, down from 30, or by running sync every second, or
>> by doing fsync tricks.
>
> Interesting.
>
> Yes, I expect that you could fix that up by altering
> dirty_background_ratio and dirty_expire_centisecs.

Those are, in fact, the precise knobs I turned. Well, those and the XFS
pagebuf layer equivalents.

> The problem with fsync() is that it waits on the writeout. You don't
> want that to happen - you just want to tell the kernel "I won't be
> overwriting or deleting this data". Make the kernel queue up and start
> the IO but not wait on its completion.

Yes, that would be good, because then I wouldn't need to write an IPC
thing and fork or thread, so that the second thread can be busy blocking
on the writeout for me.

> It is quite appropriate to do this in fadvise(FADV_DONTNEED) - as a
> lower-latency fsync(). The app would need to call it once per second
> or so.
>
> It would also throw away any written-back pagecache inside your
> (start, len) which is exactly what your applications wants to happen,
> so the app should be calling fadvise _anyway_.
>
> What do you think?

I will apply the patch and test later today. This, however, looks like
a *really* good thing to me.

Daniel

--
there's a party going on
we'll all be here dancing underground
there's a riot going on
we'll all be here dancing underground
-- Covenant, _Riot_

2003-04-01 01:28:22

by Andrew Morton

[permalink] [raw]
Subject: Re: Delaying writes to disk when there's no need

Nick Piggin <[email protected]> wrote:
>
> Would the writeout on disk idle solve this without using the fadvise?

Yes it would schedule the I/O in the desired manner. But it would do that
for _all_ files, not just the desired one.

And that app needs to be changed to use fadvise anyway, to take down the
useless pagecache.

> How often is balance_dirty_pages called? Enough to keep an otherwise
> idle disk busy?

Approximately once per 1000 dirtied pages per cpu. Say 4 megs. A nice
chunk.

> Would it be possible to fix the /tmp files case? Could you cancel IO
> to a file that gets deleted?

Well... One could place a hint in the inode somewhere.
balance_dirty_pages() is given the inode, so it could notice that this is an
"early sync" inode and flush it every 4 megabytes.

That would be appropriate for O_STREAMING, which is a better and more
efficient interface than fadvise.

But really, the right thing to do here is to modify the app to use fadvise.

> On a similar note, would it be useful and not difficult to do
> speculative LIFO swapin in case of lots of free memory and an idle
> disk? Probably too hard. I guess lowering swapiness would help.

Might do. I'm not sure how though.

Another possibility would be to perform speculative writes. Write some
random data to a disk block and later see if that was the data which the
application actually wanted to write. If so, we can optimise away the later
I/O.

2003-04-01 01:34:45

by Andrew Morton

[permalink] [raw]
Subject: Re: Delaying writes to disk when there's no need

Daniel Pittman <[email protected]> wrote:
>
> > What do you think?
>
> I will apply the patch and test later today. This, however, looks like
> a *really* good thing to me.

I think so too.

A possible enhancement is to only do the flush if the backing queue is not
write-congested. So the syscall will very probably be async. If the queue
is write congested then it's a sure bet that someone else is saturating the
disk ayway. We don't want to block in that case.


25-akpm/include/linux/fs.h | 1 +
25-akpm/mm/fadvise.c | 2 ++
25-akpm/mm/filemap.c | 18 ++++++++++++++++--
3 files changed, 19 insertions(+), 2 deletions(-)

diff -puN include/linux/fs.h~fadvise-flush-data include/linux/fs.h
--- 25/include/linux/fs.h~fadvise-flush-data Mon Mar 31 17:03:39 2003
+++ 25-akpm/include/linux/fs.h Mon Mar 31 17:43:45 2003
@@ -1112,6 +1112,7 @@ unsigned long invalidate_inode_pages(str
extern void invalidate_inode_pages2(struct address_space *mapping);
extern void write_inode_now(struct inode *, int);
extern int filemap_fdatawrite(struct address_space *);
+extern int filemap_flush(struct address_space *);
extern int filemap_fdatawait(struct address_space *);
extern void sync_supers(void);
extern void sync_filesystems(int wait);
diff -puN mm/fadvise.c~fadvise-flush-data mm/fadvise.c
--- 25/mm/fadvise.c~fadvise-flush-data Mon Mar 31 17:03:39 2003
+++ 25-akpm/mm/fadvise.c Mon Mar 31 17:44:49 2003
@@ -61,6 +61,8 @@ long sys_fadvise64(int fd, loff_t offset
ret = 0;
break;
case POSIX_FADV_DONTNEED:
+ if (!bdi_write_congested(mapping->backing_dev_info))
+ filemap_flush(mapping);
invalidate_mapping_pages(mapping, offset >> PAGE_CACHE_SHIFT,
(len >> PAGE_CACHE_SHIFT) + 1);
break;
diff -puN mm/filemap.c~fadvise-flush-data mm/filemap.c
--- 25/mm/filemap.c~fadvise-flush-data Mon Mar 31 17:03:39 2003
+++ 25-akpm/mm/filemap.c Mon Mar 31 17:03:39 2003
@@ -122,11 +122,11 @@ static inline int sync_page(struct page
* if a dirty page/buffer is encountered, it must be waited upon, and not just
* skipped over.
*/
-int filemap_fdatawrite(struct address_space *mapping)
+static int __filemap_fdatawrite(struct address_space *mapping, int sync_mode)
{
int ret;
struct writeback_control wbc = {
- .sync_mode = WB_SYNC_ALL,
+ .sync_mode = sync_mode,
.nr_to_write = mapping->nrpages * 2,
};

@@ -140,6 +140,20 @@ int filemap_fdatawrite(struct address_sp
return ret;
}

+int filemap_fdatawrite(struct address_space *mapping)
+{
+ return __filemap_fdatawrite(mapping, WB_SYNC_ALL);
+}
+
+/*
+ * This is a mostly non-blocking flush. Not suitable for data-integrity
+ * purposes.
+ */
+int filemap_flush(struct address_space *mapping)
+{
+ return __filemap_fdatawrite(mapping, WB_SYNC_NONE);
+}
+
/**
* filemap_fdatawait - walk the list of locked pages of the given address
* space and wait for all of them.

_