LinuxLists.cc - Change in default vm_dirty

2007-06-18 23:08:25

Subject: Change in default vm_dirty_ratio

Andrew,

The default vm_dirty_ratio changed from 40 to 10
for the 2.6.22-rc kernels in this patch:

http://git.kernel.org/?
p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=07db59bd6b0f279c31044cba6787344f63be87ea;hp=de46c33745f5e2ad594c72f2cf5f490861b16ce1

IOZone write drops by about 60% when test file size is 50 percent of
memory. Rand-write drops by 90%.

Is there a good reason for turning down the default dirty ratio?
How will it help for most cases? Intuitively, it seems like
a less aggressive writeback will have better performance.

Thanks.

Tim

2007-06-18 23:47:28

by Andrew Morton

[permalink] [raw]

Subject: Re: Change in default vm_dirty_ratio

On Mon, 18 Jun 2007 14:14:30 -0700
Tim Chen <[email protected]> wrote:

> Andrew,
>
> The default vm_dirty_ratio changed from 40 to 10
> for the 2.6.22-rc kernels in this patch:
>
> http://git.kernel.org/?
> p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=07db59bd6b0f279c31044cba6787344f63be87ea;hp=de46c33745f5e2ad594c72f2cf5f490861b16ce1
>
> IOZone write drops by about 60% when test file size is 50 percent of
> memory. Rand-write drops by 90%.

heh.

(Or is that an inappropriate reaction?)

> Is there a good reason for turning down the default dirty ratio?

It seems too large. Memory sizes are going up faster than disk throughput
and it seems wrong to keep vast amounts of dirty data floating about in
memory like this. It can cause long stalls while the system writes back
huge amounts of data and is generally ill-behaved.

> How will it help for most cases? Intuitively, it seems like
> a less aggressive writeback will have better performance.

I assume that iozone is either doing a lot of file overwrites or is
unlinking/truncating files shortly after having written them.

And some benchmarks are silly. You have just demonstrated that IOZone
should have been called RAMZone....

Some workloads will work more nicely with this change and others will be
hurt. Where does the optimum lie? Don't know. Nowhere, really.

Frankly, I find it very depressing that the kernel defaults matter. These
things are trivially tunable and you'd think that after all these years,
distro initscripts would be establishing the settings, based upon expected
workload, amount of memory, number and bandwidth of attached devices, etc.

Heck, there should even be userspace daemons which observe ongoing system
behaviour and which adaptively tune these things to the most appropriate
level.

But nope, nothing.

2007-06-19 00:06:48

by Linus Torvalds

[permalink] [raw]

Subject: Re: Change in default vm_dirty_ratio

On Mon, 18 Jun 2007, Andrew Morton wrote:

> On Mon, 18 Jun 2007 14:14:30 -0700
> Tim Chen <[email protected]> wrote:
>
> > Andrew,
> >
> > The default vm_dirty_ratio changed from 40 to 10
> > for the 2.6.22-rc kernels in this patch:

Yup.

> > IOZone write drops by about 60% when test file size is 50 percent of
> > memory. Rand-write drops by 90%.
>
> heh.
>
> (Or is that an inappropriate reaction?)

I think it's probably appropriate.

I don't know what else to say.

For pure write testing, where writeback caching is good, you should
probably run all benchmarks with vm_dirty_ratio set as high as possible.
That's fairly obvious.

What's equally obvious is that for actual real-life use, such tuning is
not a good idea, and setting the vm_dirty_ratio down causes a more
pleasant user experience, thanks to smoother IO load behavoiur.

Is it good to keep tons of dirty stuff around? Sure. It allows overwriting
(and thus avoiding doing the write in the first place), but it also allows
for a more aggressive IO scheduling, in that you have more writes that you
can schedule.

It does sound like IOZone just isn't a good benchmark. It doesn't actually
measure disk throughput, it really measures how good the OS is at *not*
doing the IO. And yes, in that case, set vm_dirty_ratio high to get better
numbers.

I'd rather have the defaults at something that is "pleasant", and then
make it easy for benchmarkers to put it at something "unpleasant, but
gives better numbers". And it's not like it's all that hard to just do

echo 50 > /proc/sys/vm/dirty_ratio

in your /etc/rc.local or something, if you know you want this.

Maybe somebody can make a small graphical config app, and the distros
could even skip it? Dunno. I *suspect* very few people actually end up
caring.

Linus

2007-06-19 00:12:59

by Arjan van de Ven

[permalink] [raw]

Subject: Re: Change in default vm_dirty_ratio

> Is it good to keep tons of dirty stuff around? Sure. It allows overwriting
> (and thus avoiding doing the write in the first place), but it also allows
> for a more aggressive IO scheduling, in that you have more writes that you
> can schedule.

it also allows for an elevator that can merge more so that there are
less seeks...

so it's not all pure artificial ;(

I really don't like doing just-for-benchmark tuning ... but I wonder how
much real workloads this will get too (like installing or upgrading a
bunch of rpms)

As for the smoother IO thing.. there's already a kernel process that
writes this lot out after 5 seconds... so that ought to smooth some of
this out already.... I would hope.

(I'm not arguing this change is wrong, I'm just grinding my teeth on how
long updating rpms already takes... for no apparent reason)

2007-06-19 18:41:58

by John Stoffel

[permalink] [raw]

Subject: Re: Change in default vm_dirty_ratio

>>>>> "Andrew" == Andrew Morton <[email protected]> writes:

Andrew> On Mon, 18 Jun 2007 14:14:30 -0700
Andrew> Tim Chen <[email protected]> wrote:

>> IOZone write drops by about 60% when test file size is 50 percent of
>> memory. Rand-write drops by 90%.

Andrew> heh.

Andrew> (Or is that an inappropriate reaction?)

>> Is there a good reason for turning down the default dirty ratio?

Andrew> It seems too large. Memory sizes are going up faster than
Andrew> disk throughput and it seems wrong to keep vast amounts of
Andrew> dirty data floating about in memory like this. It can cause
Andrew> long stalls while the system writes back huge amounts of data
Andrew> and is generally ill-behaved.

Shouldn't the vm_dirty_ratio be based on the speed of the device, and
not the size of memory? So slower devices can't keep as much in
memory as fast devices? That would seem to be a better metric. And
of course those with hundreds of disks will then complain we're taking
too much memory as well, even though they can handle it.

So per-device vm_dirty_ratio, capped with a vm_dirty_total_ratio seems
to be what we want, right?

John

2007-06-19 19:01:54

by Andi Kleen

[permalink] [raw]

Subject: Re: Change in default vm_dirty_ratio

Andrew Morton <[email protected]> writes:
>
> It seems too large. Memory sizes are going up faster than disk throughput
> and it seems wrong to keep vast amounts of dirty data floating about in
> memory like this. It can cause long stalls while the system writes back
> huge amounts of data and is generally ill-behaved.

A more continuous write out would be better I think. Perhaps the dirty ratio
needs to be per address space?

> things are trivially tunable and you'd think that after all these years,
> distro initscripts would be establishing the settings, based upon expected
> workload, amount of memory, number and bandwidth of attached devices, etc.

Distro initscripts normally don't have any better clue about any of this
than the kernel.

-Andi

2007-06-19 19:06:17

by Linus Torvalds

[permalink] [raw]

Subject: Re: Change in default vm_dirty_ratio

On Tue, 19 Jun 2007, John Stoffel wrote:
>
> Shouldn't the vm_dirty_ratio be based on the speed of the device, and
> not the size of memory?

Yes. It should depend on:
- speed of the device(s) in question
- seekiness of the workload
- wishes of the user as per the latency of other operations.

However, nobody has ever found the required algorithm.

So "at most 10% of memory dirty" is a simple (and fairly _good_)
heuristic. Nobody has actually ever ended up complaining about the change
from 40% -> 10%, and as far as I know this was the first report (and it's
not so much because the change was bad, but because it showed up on a
benchmark - and I don't think that actually says anythign about anything
else then the behaviour of the benchmark itself)

So are there better algorithms in theory? Probably lots of them.

Linus

2007-06-19 19:07:39

by Linus Torvalds

[permalink] [raw]

Subject: Re: Change in default vm_dirty_ratio

On Tue, 19 Jun 2007, Linus Torvalds wrote:
>
> Yes. It should depend on:
> - speed of the device(s) in question

Btw, this one can be quite a big deal. Try connecting an iPod and syncing
8GB of data to it. Oops.

So yes, it would be nice to have some per-device logic too. Tested patches
would be very welcome ;)

Linus

2007-06-19 22:33:11

by David Miller

[permalink] [raw]

Subject: Re: Change in default vm_dirty_ratio

From: Linus Torvalds <[email protected]>
Date: Tue, 19 Jun 2007 12:04:33 -0700 (PDT)

>
>
> On Tue, 19 Jun 2007, John Stoffel wrote:
> >
> > Shouldn't the vm_dirty_ratio be based on the speed of the device, and
> > not the size of memory?
>
> Yes. It should depend on:
> - speed of the device(s) in question
> - seekiness of the workload
> - wishes of the user as per the latency of other operations.
>
> However, nobody has ever found the required algorithm.
>
> So "at most 10% of memory dirty" is a simple (and fairly _good_)
> heuristic. Nobody has actually ever ended up complaining about the change
> from 40% -> 10%, and as far as I know this was the first report (and it's
> not so much because the change was bad, but because it showed up on a
> benchmark - and I don't think that actually says anythign about anything
> else then the behaviour of the benchmark itself)

I complained very early on that it makes my workstation basically just
hang doing disk writes when I do big git tree operations on my sparc64
workstation.

Restoring the old values makes that go away.

It's an arbitrary number because the correct setting is dependant
upon what you're doing and how fast your disks are.

So I really think the "only benchmarks like the old settings"
argument doesn't really apply.

2007-06-20 04:24:49

by Dave Jones

[permalink] [raw]

Subject: Re: Change in default vm_dirty_ratio

On Mon, Jun 18, 2007 at 04:47:11PM -0700, Andrew Morton wrote:

> Frankly, I find it very depressing that the kernel defaults matter. These
> things are trivially tunable and you'd think that after all these years,
> distro initscripts would be establishing the settings, based upon expected
> workload, amount of memory, number and bandwidth of attached devices, etc.

"This is hard, lets make it someone else's problem" shouldn't ever be the
answer, especially if the end result is that we become even more
dependant on bits of userspace running before the system becomes useful.

> Heck, there should even be userspace daemons which observe ongoing system
> behaviour and which adaptively tune these things to the most appropriate
> level.
>
> But nope, nothing.

See the 'libtune' crack that people have been trying to get distros to
adopt for a long time.
If we need some form of adaptive behaviour, the kernel needs to be
doing this monitoring/adapting, not some userspace daemon that may
not get scheduled before its too late.

If the kernel can't get the defaults right, what makes you think
userspace can do better ? Just as the kernel can't get
"one size fits all" right, there's no silver bullet just by clicking
"this is a database server" button to have it configure random
sysctls etc. These things require thought and planning that
daemons will never get right in every case. And when they get
it wrong, the results can be worse than the stock defaults.

libtune is the latest in a series of attempts to do this dynamic
runtime adjustment (hell, I even started such a project myself
back circa 2000 which thankfully never really took off).
It's a bad idea that just won't die.

Dave

--
http://www.codemonkey.org.uk

2007-06-20 04:45:05

by Andrew Morton

[permalink] [raw]

Subject: Re: Change in default vm_dirty_ratio

On Wed, 20 Jun 2007 00:24:34 -0400 Dave Jones <[email protected]> wrote:

> On Mon, Jun 18, 2007 at 04:47:11PM -0700, Andrew Morton wrote:
>
> > Frankly, I find it very depressing that the kernel defaults matter. These
> > things are trivially tunable and you'd think that after all these years,
> > distro initscripts would be establishing the settings, based upon expected
> > workload, amount of memory, number and bandwidth of attached devices, etc.
>
> "This is hard, lets make it someone else's problem" shouldn't ever be the
> answer,

Bovine droppings. Nobody has even tried.

> especially if the end result is that we become even more
> dependant on bits of userspace running before the system becomes useful.

Cattle excreta. The kernel remains as it presently is. No less useful that it is
now.

> > Heck, there should even be userspace daemons which observe ongoing system
> > behaviour and which adaptively tune these things to the most appropriate
> > level.
> >
> > But nope, nothing.
>
> See the 'libtune' crack that people have been trying to get distros to
> adopt for a long time.
> If we need some form of adaptive behaviour, the kernel needs to be
> doing this monitoring/adapting, not some userspace daemon that may
> not get scheduled before its too late.

Userspace has just as much info as the kernel has and there is no latency
concern here.

> If the kernel can't get the defaults right, what makes you think
> userspace can do better ?

Because userspace can implement more sophisticated algorithms and is more
easily configured.

For example, userspace can take a hotplug event for the just-added
usb-storage device then go look up its IO characteristics in a database
and then apply that to the confgured policy. If the device was not found,
userspace can perform a test run to empirically measure that device's IO
characteristics and then record them in the database. I don't think we'll
be doing this in-kernel any time soon.

(And to preempt lkml-games: this is just an _example_. There are
others)

> Just as the kernel can't get
> "one size fits all" right, there's no silver bullet just by clicking
> "this is a database server" button to have it configure random
> sysctls etc. These things require thought and planning that
> daemons will never get right in every case. And when they get
> it wrong, the results can be worse than the stock defaults.
>
> libtune is the latest in a series of attempts to do this dynamic
> runtime adjustment (hell, I even started such a project myself
> back circa 2000 which thankfully never really took off).
> It's a bad idea that just won't die.
>

So libtune is the only possible way of implementing any of this?

If choosing the optimum settings cannot be done in userspace then it sure
as heck cannot be done in-kernel.

Anyway, this is all arse-about. What is the design? What algorithms
do we need to implement to do this successfully? Answer me that, then
we can decide upon these implementation details.

2007-06-20 08:35:51

by Peter Zijlstra

[permalink] [raw]

Subject: Re: Change in default vm_dirty_ratio

On Tue, 2007-06-19 at 21:44 -0700, Andrew Morton wrote:

> Anyway, this is all arse-about. What is the design? What algorithms
> do we need to implement to do this successfully? Answer me that, then
> we can decide upon these implementation details.

Building on the per BDI patches, how about integrating feedback from the
full-ness of device queues. That is, when we are happily doing IO and we
cannot possibly saturate the active devices (as measured by their queue
never reaching 75%?) then we can safely increase the total dirty limit.

OTOH, when even with the per BDI dirty limit the device queue is
constantly saturated (contended) we ought to lower the total dirty
limit.

Lots of detail here to work out, but does this sound workable?

2007-06-20 08:59:17

by Andrew Morton

[permalink] [raw]

Subject: Re: Change in default vm_dirty_ratio

> On Wed, 20 Jun 2007 10:35:36 +0200 Peter Zijlstra <[email protected]> wrote:
> On Tue, 2007-06-19 at 21:44 -0700, Andrew Morton wrote:
>
> > Anyway, this is all arse-about. What is the design? What algorithms
> > do we need to implement to do this successfully? Answer me that, then
> > we can decide upon these implementation details.
>
> Building on the per BDI patches, how about integrating feedback from the
> full-ness of device queues. That is, when we are happily doing IO and we
> cannot possibly saturate the active devices (as measured by their queue
> never reaching 75%?) then we can safely increase the total dirty limit.
>
> OTOH, when even with the per BDI dirty limit the device queue is
> constantly saturated (contended) we ought to lower the total dirty
> limit.
>
> Lots of detail here to work out, but does this sound workable?

It's pretty easy to fill the queues - I'd expect that there are a lot of
not-very-heavy workloads which cause the kernel to shove a lot of little
writes into the queue when it visits the blockdev mapping: a shower of
inodes, directory entries, indirect blocks, etc. With very little dirty
memory associated with it.

But back away further.

What do we actually want the kernel to *do*? Stated in terms of "when the
dirty memory state is A, do B" and "when userspace does C, the kernel should
do D".

Top-level statement: "when userspace does anything, the kernel should not
suck" ;) Some refinement is needed there.

I _think_ the problem is basically one of latency: a) writes starving reads
and b) dirty memory causing page reclaim to stall and c) inter-device
contention on the global memory limits.

Hard. If the device isn't doing anything else then we can shove data at it
freely. If reads (or synchronous writes) come in then perhaps the VM
should back off and permit dirty memory to go higher.

The anticipatory scheduler(s) are supposed to fix this.

Perhaps our queues are too long - if the VFS _does_ back off, it'll take
some time for that to have an effect.

Perhaps the fact that the queue size knows nothing about the _size_ of the
requests in the queue is a problem.

Back away even further here.

What user-visible problem(s) are we attemping to fix?

2007-06-20 09:14:51

by Jens Axboe

[permalink] [raw]

Subject: Re: Change in default vm_dirty_ratio

On Wed, Jun 20 2007, Andrew Morton wrote:
> Perhaps our queues are too long - if the VFS _does_ back off, it'll take
> some time for that to have an effect.
>
> Perhaps the fact that the queue size knows nothing about the _size_ of the
> requests in the queue is a problem.

It's complicated, the size may not matter a lot. 128 sequential 512kb IO
may complete faster than 128 random 4kb IO's.

> Back away even further here.
>
> What user-visible problem(s) are we attemping to fix?

I'd like innocent-app-doing-little-write-or-fsync not being stalled by
big-bad-app-doing-lots-of-dirtying.

--
Jens Axboe

2007-06-20 09:19:40

by Peter Zijlstra

[permalink] [raw]

Subject: Re: Change in default vm_dirty_ratio

On Wed, 2007-06-20 at 01:58 -0700, Andrew Morton wrote:
> > On Wed, 20 Jun 2007 10:35:36 +0200 Peter Zijlstra <[email protected]> wrote:
> > On Tue, 2007-06-19 at 21:44 -0700, Andrew Morton wrote:
> >
> > > Anyway, this is all arse-about. What is the design? What algorithms
> > > do we need to implement to do this successfully? Answer me that, then
> > > we can decide upon these implementation details.
> >
> > Building on the per BDI patches, how about integrating feedback from the
> > full-ness of device queues. That is, when we are happily doing IO and we
> > cannot possibly saturate the active devices (as measured by their queue
> > never reaching 75%?) then we can safely increase the total dirty limit.
> >
> > OTOH, when even with the per BDI dirty limit the device queue is
> > constantly saturated (contended) we ought to lower the total dirty
> > limit.
> >
> > Lots of detail here to work out, but does this sound workable?
>
> It's pretty easy to fill the queues - I'd expect that there are a lot of
> not-very-heavy workloads which cause the kernel to shove a lot of little
> writes into the queue when it visits the blockdev mapping: a shower of
> inodes, directory entries, indirect blocks, etc. With very little dirty
> memory associated with it.
>
> But back away further.
>
> What do we actually want the kernel to *do*? Stated in terms of "when the
> dirty memory state is A, do B" and "when userspace does C, the kernel should
> do D".
>
> Top-level statement: "when userspace does anything, the kernel should not
> suck" ;) Some refinement is needed there.
>
> I _think_ the problem is basically one of latency: a) writes starving reads
> and b) dirty memory causing page reclaim to stall and c) inter-device
> contention on the global memory limits.

well, I hope to have solved c)... :-)

> Hard. If the device isn't doing anything else then we can shove data at it
> freely. If reads (or synchronous writes) come in then perhaps the VM
> should back off and permit dirty memory to go higher.
>
> The anticipatory scheduler(s) are supposed to fix this.

I must plead ignorance here, I'll try to fill this hole in my
knowledge :-/

> Perhaps our queues are too long - if the VFS _does_ back off, it'll take
> some time for that to have an effect.
>
> Perhaps the fact that the queue size knows nothing about the _size_ of the
> requests in the queue is a problem.

Yes this would be an issue. By not knowing that, the queue limit is
basically useless. It doesn't limit very much.

> Back away even further here.
>
> What user-visible problem(s) are we attemping to fix?

Good point, the only report so far is DaveM saying git sucked on his
sparc64 box...

Dave, do you have any idea what caused that? could it be this global
fsync ext3 suffers from?

But the basic problem is balancing under-utilisation of disks vs.
keeping too much in memory.

2007-06-20 09:20:00

by Peter Zijlstra

[permalink] [raw]

Subject: Re: Change in default vm_dirty_ratio

On Wed, 2007-06-20 at 11:14 +0200, Jens Axboe wrote:
> On Wed, Jun 20 2007, Andrew Morton wrote:
> > Perhaps our queues are too long - if the VFS _does_ back off, it'll take
> > some time for that to have an effect.
> >
> > Perhaps the fact that the queue size knows nothing about the _size_ of the
> > requests in the queue is a problem.
>
> It's complicated, the size may not matter a lot. 128 sequential 512kb IO
> may complete faster than 128 random 4kb IO's.

Yes, is there any way a queue could be limited to a certain amount of
'completion time' ?

> > Back away even further here.
> >
> > What user-visible problem(s) are we attemping to fix?
>
> I'd like innocent-app-doing-little-write-or-fsync not being stalled by
> big-bad-app-doing-lots-of-dirtying.

Could you please try this per BDI dirty limit -v7 patch series, the very
last patch tries to address this by taking the per task dirty rate into
account.

Although, on the fsync, ext3 seems to want to do a global fsync, which
will still make the experience suck. :-(

2007-06-20 09:21:47

by Jens Axboe

[permalink] [raw]

Subject: Re: Change in default vm_dirty_ratio

On Wed, Jun 20 2007, Peter Zijlstra wrote:
> On Wed, 2007-06-20 at 11:14 +0200, Jens Axboe wrote:
> > On Wed, Jun 20 2007, Andrew Morton wrote:
> > > Perhaps our queues are too long - if the VFS _does_ back off, it'll take
> > > some time for that to have an effect.
> > >
> > > Perhaps the fact that the queue size knows nothing about the _size_ of the
> > > requests in the queue is a problem.
> >
> > It's complicated, the size may not matter a lot. 128 sequential 512kb IO
> > may complete faster than 128 random 4kb IO's.
>
> Yes, is there any way a queue could be limited to a certain amount of
> 'completion time' ?

Not easily, we'd need some sort of disk profile for that to be remotely
reliable.

> > > Back away even further here.
> > >
> > > What user-visible problem(s) are we attemping to fix?
> >
> > I'd like innocent-app-doing-little-write-or-fsync not being stalled by
> > big-bad-app-doing-lots-of-dirtying.
>
> Could you please try this per BDI dirty limit -v7 patch series, the very
> last patch tries to address this by taking the per task dirty rate into
> account.

Yeah, I've been watching your patchset with interesting. Hope it'll get
merged some time soon, I think it's a real problem.

> Although, on the fsync, ext3 seems to want to do a global fsync, which
> will still make the experience suck. :-(

Yeah well, extX sucks on many levels :-)

--
Jens Axboe

2007-06-20 09:44:00

by Peter Zijlstra

[permalink] [raw]

Subject: Re: Change in default vm_dirty_ratio

On Wed, 2007-06-20 at 11:20 +0200, Jens Axboe wrote:
> On Wed, Jun 20 2007, Peter Zijlstra wrote:
> > On Wed, 2007-06-20 at 11:14 +0200, Jens Axboe wrote:
> > > On Wed, Jun 20 2007, Andrew Morton wrote:
> > > > Perhaps our queues are too long - if the VFS _does_ back off, it'll take
> > > > some time for that to have an effect.
> > > >
> > > > Perhaps the fact that the queue size knows nothing about the _size_ of the
> > > > requests in the queue is a problem.
> > >
> > > It's complicated, the size may not matter a lot. 128 sequential 512kb IO
> > > may complete faster than 128 random 4kb IO's.
> >
> > Yes, is there any way a queue could be limited to a certain amount of
> > 'completion time' ?
>
> Not easily, we'd need some sort of disk profile for that to be remotely
> reliable.

Yes, I see the problem, benching the device is hard; you don't want it
to do it each time, nor for it to take too long. Also, write performance
might be destructive, also not quite wanted :-/

/me sees this libtune doom on the horizon again.

Something adaptive would be best, something that inserts barriers and
measures the time to complete and then solves the read speed, write
speed and seek latency. All during normal operation.

That would entail storing a bunch of these sample points, solve the
equation for sets of 3, and (time-) average the results.. ugh

2007-06-20 17:18:36

by Linus Torvalds

[permalink] [raw]

Subject: Re: Change in default vm_dirty_ratio

On Wed, 20 Jun 2007, Peter Zijlstra wrote:
>
> Building on the per BDI patches, how about integrating feedback from the
> full-ness of device queues. That is, when we are happily doing IO and we
> cannot possibly saturate the active devices (as measured by their queue
> never reaching 75%?) then we can safely increase the total dirty limit.

The really annoying things are the one-off things. You've been happily
working for a while (never even being _close_ to saturatign any IO
queues), and then you untar a large tree.

If the kernel now let's you dirty lots of memory, you'll have a very
unpleasant experience.

And with hot-pluggable devices (which is where most of the throughput
problems tend to be!), the "one-off" thing is not a "just after reboot"
kind of situation.

So you'd have to be pretty smart about it.

Linus

2007-06-20 18:16:37

by Arjan van de Ven

[permalink] [raw]

Subject: Re: Change in default vm_dirty_ratio

On Wed, 2007-06-20 at 10:17 -0700, Linus Torvalds wrote:
>
> On Wed, 20 Jun 2007, Peter Zijlstra wrote:
> >
> > Building on the per BDI patches, how about integrating feedback from the
> > full-ness of device queues. That is, when we are happily doing IO and we
> > cannot possibly saturate the active devices (as measured by their queue
> > never reaching 75%?) then we can safely increase the total dirty limit.
>
> The really annoying things are the one-off things. You've been happily
> working for a while (never even being _close_ to saturatign any IO
> queues), and then you untar a large tree.
>
> If the kernel now let's you dirty lots of memory, you'll have a very
> unpleasant experience.

maybe that needs to be fixed? If you stopped dirtying after the initial
bump.. is there a reason for the kernel to dump all that data to the
disk in such a way that it disturbs interactive users?

so the question maybe is.. is the vm tunable the cause or the symptom of
the bad experience?

2007-06-20 18:28:31

by Linus Torvalds

[permalink] [raw]

Subject: Re: Change in default vm_dirty_ratio

On Wed, 20 Jun 2007, Arjan van de Ven wrote:
>
> maybe that needs to be fixed? If you stopped dirtying after the initial
> bump.. is there a reason for the kernel to dump all that data to the
> disk in such a way that it disturbs interactive users?

No. I would argue that the kernel should try to trickle things out, so
that it doesn't disturb anything, and a "big dump" becomes a "steady
trickle".

And that's what "vm_dirty_ratio" is all about.

> so the question maybe is.. is the vm tunable the cause or the symptom of
> the bad experience?

No, the vm tunable is exactly what it's all about.

Do a big "untar", and what you *want* to see is not "instant dump,
followed by long pause".

A much *smoother* behaviour is generally preferable, and most of the time
that's true even if it may be lower throughput in the end!

Of course, "synchronous writes" are *really* smooth (you never allow any
dumps at *all* to build up), so this is about a balance - not about
"perfect smoothness" vs "best throughput", but about a heuristic that
finds a reasonable middle ground.

There is no "perfect". There is only "stupid heuristics". Maybe the
"vm_dirty_ratio" is a bit *too* stupid, but it definitely is needed in
some form.

It can actually be more than just a "performance" vs "smoothness" issue:
the 40% thing was actually a *correctness* issue too, back when we coutned
it as a percentage of total memory. A highmem machine would allow 40% of
all memory free and it was all in low memory, and that literally caused
lockups.

So the dirty_ratio is not *only* about smoothness, it's also simply about
the fact that the kernel must not allow too much memory to be dirtied,
because that leads to out-of-memory deadlocks and other nasty issues. So
it's not *purely* a tunable.

Linus

2007-06-21 12:32:59

by Nadia Derbey

[permalink] [raw]

Subject: Re: Change in default vm_dirty_ratio

Dave Jones wrote:
> On Mon, Jun 18, 2007 at 04:47:11PM -0700, Andrew Morton wrote:
>
> > Frankly, I find it very depressing that the kernel defaults matter. These
> > things are trivially tunable and you'd think that after all these years,
> > distro initscripts would be establishing the settings, based upon expected
> > workload, amount of memory, number and bandwidth of attached devices, etc.
>
> "This is hard, lets make it someone else's problem" shouldn't ever be the
> answer, especially if the end result is that we become even more
> dependant on bits of userspace running before the system becomes useful.
>
> > Heck, there should even be userspace daemons which observe ongoing system
> > behaviour and which adaptively tune these things to the most appropriate
> > level.
> >
> > But nope, nothing.
>
> See the 'libtune' crack that people have been trying to get distros to
> adopt for a long time.
> If we need some form of adaptive behaviour, the kernel needs to be
> doing this monitoring/adapting, not some userspace daemon that may
> not get scheduled before its too late.
>

I'm wondering whether AKT I proposed a couple of months ago wouldn't be
more appropriate (provided that we find the perfect heuristics to tune
the dirty_ratio ;-) )
see thread http://lkml.org/lkml/2007/1/16/16

Regards,
Nadia

2007-06-21 16:54:30

by Mark Lord

[permalink] [raw]

Subject: Re: Change in default vm_dirty_ratio

Andrew Morton wrote:
>
> What do we actually want the kernel to *do*? Stated in terms of "when the
> dirty memory state is A, do B" and "when userspace does C, the kernel should
> do D".

When we have dirty pages awaiting write-out,
and the write-out device is completely idle,
then we should be writing them out.

That's the easy bit taken care of. ;)

2007-06-21 16:56:20

by Peter Zijlstra

[permalink] [raw]

Subject: Re: Change in default vm_dirty_ratio

On Thu, 2007-06-21 at 12:54 -0400, Mark Lord wrote:
> Andrew Morton wrote:
> >
> > What do we actually want the kernel to *do*? Stated in terms of "when the
> > dirty memory state is A, do B" and "when userspace does C, the kernel should
> > do D".
>
> When we have dirty pages awaiting write-out,
> and the write-out device is completely idle,
> then we should be writing them out.
>
> That's the easy bit taken care of. ;)

Unless we're in laptop mode, in that case we want to procrastinate.. :-)

2007-06-21 22:54:11

by Matt Mackall

[permalink] [raw]

Subject: Re: Change in default vm_dirty_ratio

On Wed, Jun 20, 2007 at 11:20:59AM +0200, Jens Axboe wrote:
> On Wed, Jun 20 2007, Peter Zijlstra wrote:
> > On Wed, 2007-06-20 at 11:14 +0200, Jens Axboe wrote:
> > > On Wed, Jun 20 2007, Andrew Morton wrote:
> > > > Perhaps our queues are too long - if the VFS _does_ back off, it'll take
> > > > some time for that to have an effect.
> > > >
> > > > Perhaps the fact that the queue size knows nothing about the _size_ of the
> > > > requests in the queue is a problem.
> > >
> > > It's complicated, the size may not matter a lot. 128 sequential 512kb IO
> > > may complete faster than 128 random 4kb IO's.
> >
> > Yes, is there any way a queue could be limited to a certain amount of
> > 'completion time' ?
>
> Not easily, we'd need some sort of disk profile for that to be remotely
> reliable.

Perhaps we want to throw some sliding window algorithms at it. We can
bound requests and total I/O and if requests get retired too slowly we
can shrink the windows. Alternately, we can grow the window if we're
retiring things within our desired timeframe.

--
Mathematics is the supreme nostalgia of our time.

2007-06-21 23:08:53

by Linus Torvalds

[permalink] [raw]

Subject: Re: Change in default vm_dirty_ratio

On Thu, 21 Jun 2007, Matt Mackall wrote:
>
> Perhaps we want to throw some sliding window algorithms at it. We can
> bound requests and total I/O and if requests get retired too slowly we
> can shrink the windows. Alternately, we can grow the window if we're
> retiring things within our desired timeframe.

I suspect that would tend to be a good way to go. But it almost certainly
has to be per-device, which implies that somebody would have to do some
major coding/testing on this..

The vm_dirty_ratio thing is a global value, and I think we need that
regardless (for the independent issue of memory deadlocks etc), but if we
*additionally* had a per-device throttle that was based on some kind of
adaptive thing, we could probably raise the global (hard) vm_dirty_ratio a
lot.

Linus

2007-06-24 13:17:19

by Peter Zijlstra

[permalink] [raw]

Subject: Re: Change in default vm_dirty_ratio

On Thu, 2007-06-21 at 16:08 -0700, Linus Torvalds wrote:
>
> On Thu, 21 Jun 2007, Matt Mackall wrote:
> >
> > Perhaps we want to throw some sliding window algorithms at it. We can
> > bound requests and total I/O and if requests get retired too slowly we
> > can shrink the windows. Alternately, we can grow the window if we're
> > retiring things within our desired timeframe.
>
> I suspect that would tend to be a good way to go. But it almost certainly
> has to be per-device, which implies that somebody would have to do some
> major coding/testing on this..
>
> The vm_dirty_ratio thing is a global value, and I think we need that
> regardless (for the independent issue of memory deadlocks etc), but if we
> *additionally* had a per-device throttle that was based on some kind of
> adaptive thing, we could probably raise the global (hard) vm_dirty_ratio a
> lot.

I just did quite a bit of that:

http://lkml.org/lkml/2007/6/14/437

2007-06-24 16:42:18

by Linus Torvalds

[permalink] [raw]

Subject: Re: Change in default vm_dirty_ratio

On Sat, 23 Jun 2007, Peter Zijlstra wrote:

> On Thu, 2007-06-21 at 16:08 -0700, Linus Torvalds wrote:
> >
> > The vm_dirty_ratio thing is a global value, and I think we need that
> > regardless (for the independent issue of memory deadlocks etc), but if we
> > *additionally* had a per-device throttle that was based on some kind of
> > adaptive thing, we could probably raise the global (hard) vm_dirty_ratio a
> > lot.
>
> I just did quite a bit of that:
>
> http://lkml.org/lkml/2007/6/14/437

Ok, that does look interesting.

A few comments:

- Cosmetic: please please *please* don't do this:

- if (atomic_long_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH) {
+ if (atomic_long_dec_return(&nfss->writeback) <
+ NFS_CONGESTION_OFF_THRESH)

we had a discussion about this not that long ago, and it drives me wild
to see people split lines like that. It used to be readable. Now it's
not.

If it's the checkpatch.pl thing that caused you to split it like that,
I think we should just change the value where we start complaining.
Maybe set it at 95 characters per line or something.

- I appreciate the extensive comments on floating proportions, and the
code looks really quite clean (apart from the cosmetic thing about)
from a quick look-through.

HOWEVER. It does seem to be a bit of an overkill. Do we really need to
be quite that clever, and do we really need to do 64-bit calculations
for this? The 64-bit ops in particular seem quite iffy: if we ever
actually pass in an amount that doesn't fit in 32 bits, we'll turn
those per-cpu counters into totally synchronous global counters, which
seems to defeat the whole point of them. So I'm a bit taken aback by
that whole "mod64" thing

(I also hate the naming. I don't think "..._mod()" was a good name to
begin with: "mod" means "modulus" to me, not "modify". Making it be
called "mod64" just makes it even *worse*, since it's now _obviously_
about modulus - but isn't)

So I'd appreciate some more explanations, but I'd also appreciate some
renaming of those functions. What used to be pretty bad naming just turned
*really* bad, imnsho.

Linus

2007-06-25 00:16:40

by Peter Zijlstra

[permalink] [raw]

Subject: Re: Change in default vm_dirty_ratio

On Sun, 2007-06-24 at 09:40 -0700, Linus Torvalds wrote:
>
> On Sat, 23 Jun 2007, Peter Zijlstra wrote:
>
> > On Thu, 2007-06-21 at 16:08 -0700, Linus Torvalds wrote:
> > >
> > > The vm_dirty_ratio thing is a global value, and I think we need that
> > > regardless (for the independent issue of memory deadlocks etc), but if we
> > > *additionally* had a per-device throttle that was based on some kind of
> > > adaptive thing, we could probably raise the global (hard) vm_dirty_ratio a
> > > lot.
> >
> > I just did quite a bit of that:
> >
> > http://lkml.org/lkml/2007/6/14/437
>
> Ok, that does look interesting.
>
> A few comments:
>
> - Cosmetic: please please *please* don't do this:
>
> - if (atomic_long_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH) {
> + if (atomic_long_dec_return(&nfss->writeback) <
> + NFS_CONGESTION_OFF_THRESH)
>
> we had a discussion about this not that long ago, and it drives me wild
> to see people split lines like that. It used to be readable. Now it's
> not.
>
> If it's the checkpatch.pl thing that caused you to split it like that,
> I think we should just change the value where we start complaining.
> Maybe set it at 95 characters per line or something.

It was.

> - I appreciate the extensive comments on floating proportions, and the
> code looks really quite clean (apart from the cosmetic thing about)
> from a quick look-through.
>
> HOWEVER. It does seem to be a bit of an overkill. Do we really need to
> be quite that clever,

Hehe, it is the simplest thing I could come up with. It is
deterministic, fast and has a nice physical model :-)

> and do we really need to do 64-bit calculations
> for this?

No we don't (well, on 64bit arches we do). I actually only use unsigned
long, and even cast whatever comes out of the percpu_counter thing to
unsigned long.

> The 64-bit ops in particular seem quite iffy: if we ever
> actually pass in an amount that doesn't fit in 32 bits, we'll turn
> those per-cpu counters into totally synchronous global counters, which
> seems to defeat the whole point of them. So I'm a bit taken aback by
> that whole "mod64" thing

That is only needed on 64bit arches, and even there, actually
encountering such large values will be rare at best.

Also, this re-normalisation event that uses the call is low frequency.
That is, that part will be used once every ~ total_dirty_limit/nr_bdis
written out.

> (I also hate the naming. I don't think "..._mod()" was a good name to
> begin with: "mod" means "modulus" to me, not "modify". Making it be
> called "mod64" just makes it even *worse*, since it's now _obviously_
> about modulus - but isn't)

Agreed.

> So I'd appreciate some more explanations, but I'd also appreciate some
> renaming of those functions. What used to be pretty bad naming just turned
> *really* bad, imnsho.

It all just stems from Andrew asking if I could please re-use something
instread of duplication a lot of things. I picked percpu_counter because
that was the closest to what was needed. An unsigned long based per-cpu
counter would suit better.

There is another problem I have with this percpu_counter, it is rather
space hungry. It does a node affine sizeof(s32) kalloc on each cpu.
Which will end up using the smallest slab, and that is quite a bit
bigger than needed. But should be about the size of a cacheline
(otherwise we might still end up with false sharing).

I've been thinking of extending this per cpu allocator thing a bit to be
a little smarter about these things. What would be needed is a strict
per-cpu slab allocator. The current ones are node affine, which can
still cause false sharing (unless - as should be the case - these
objects are both cacheline aligned and of cacheline size). When we have
that, we can start using smaller objects.