Andrew,
We have some fairly large installations that are running into long
pauses while running fsync(). One of the issues that was noted is the
vm_dirty_ratio, while probably adequate for a desktop type installation,
seems excessively large for a larger configuration. For your reference,
the machine that first reported this is running with 384GB of memory.
Others that reported the problem range from 256GB to 4TB. At those sizes,
we are talking dirty buffers in the range of 100GB to 1TB. That seems
a bit excessive.
Is there any chance of limiting vm_dirty_ratio to something other than
a hard-coded 40%? Maybe add something like the following two lines to
the beginning of page_writeback_init(). This would limit us to roughly
2GB of dirty buffers. I picked that number assuming that nobody would
want to affect machines in the 4GB and below range.
vm_dirty_ratio = min(40, TWO_GB_IN_PAGES / total_pages * 100);
dirty_background_ratio = vm_dirty_ratio / 4;
One other issue we have is the vm_dirty_ratio and background_ratio
adjustments are a little coarse with these memory sizes. Since our
minimum adjustment is 1%, we are adjusting by 40GB on the largest
configuration from above. The hardware we are shipping today is capable
of going to far greater amounts of memory, but we don't have customers
demanding that yet. I would like to plan ahead for that and change
vm_dirty_ratio from a straight percent into a millipercent (thousandth
of a percent). Would that type of change be acceptable?
Thanks,
Robin Holt
Robin Holt <[email protected]> wrote:
>
> Andrew,
>
> We have some fairly large installations that are running into long
> pauses while running fsync(). One of the issues that was noted is the
> vm_dirty_ratio, while probably adequate for a desktop type installation,
> seems excessively large for a larger configuration. For your reference,
> the machine that first reported this is running with 384GB of memory.
> Others that reported the problem range from 256GB to 4TB. At those sizes,
> we are talking dirty buffers in the range of 100GB to 1TB. That seems
> a bit excessive.
I'd have thought that dirty_background_ratio is the problem here: you want
pdflush to kick in earlier to start the I/O while permitting the write()ing
application to keep running.
> Is there any chance of limiting vm_dirty_ratio to something other than
> a hard-coded 40%? Maybe add something like the following two lines to
> the beginning of page_writeback_init(). This would limit us to roughly
> 2GB of dirty buffers. I picked that number assuming that nobody would
> want to affect machines in the 4GB and below range.
>
>
> vm_dirty_ratio = min(40, TWO_GB_IN_PAGES / total_pages * 100);
> dirty_background_ratio = vm_dirty_ratio / 4;
All that dirty pagecache allows us to completely elide I/O when overwrites
are happening, to get better request queue merging, to get better file
layout if the fs does allocate-on-flush and, probably most importantly, to
avoid I/O completely for short-lived files.
So I'm sure there's someone out there who will say "hey, how come by
seeky-writing application just got 75% slower?".
That being said, perhaps reducing the default will help more people than it
hurts - I simply do not know. That's why it's tuneable ;)
Would it be correct to assume that these applications are simply doing
large, linear writes? If so, do they write quickly or at a relatively slow
rate? The latter, I assume.
Which fs are you using?
Other things we can think about are
- Setting the dirty limit on a per-inode basis (non-trivial)
- Adding a new fadvise command to start async writeback of a section of
the file (easy).
>
> One other issue we have is the vm_dirty_ratio and background_ratio
> adjustments are a little coarse with these memory sizes. Since our
> minimum adjustment is 1%, we are adjusting by 40GB on the largest
> configuration from above. The hardware we are shipping today is capable
> of going to far greater amounts of memory, but we don't have customers
> demanding that yet. I would like to plan ahead for that and change
> vm_dirty_ratio from a straight percent into a millipercent (thousandth
> of a percent). Would that type of change be acceptable?
Oh drat. I think such a change would require a new set of /proc entries.
>>>>> "Andrew" == Andrew Morton <[email protected]> writes:
Andrew> Robin Holt <[email protected]> wrote:
>> One other issue we have is the vm_dirty_ratio and background_ratio
>> adjustments are a little coarse with these memory sizes. Since our
>> minimum adjustment is 1%, we are adjusting by 40GB on the largest
>> configuration from above. The hardware we are shipping today is
>> capable of going to far greater amounts of memory, but we don't
>> have customers demanding that yet. I would like to plan ahead for
>> that and change vm_dirty_ratio from a straight percent into a
>> millipercent (thousandth of a percent). Would that type of change
>> be acceptable?
Andrew> Oh drat. I think such a change would require a new set of
Andrew> /proc entries.
No, you could just extend them to understand fixed point. Keep
printing integers as integers, print non-integers with one (or two:
will we ever need 0.01% increments?) decimal places.
--
Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au
The technical we do immediately, the political takes *forever*
On Fri, Mar 18, 2005 at 09:27:31AM +1100, Peter Chubb wrote:
> >>>>> "Andrew" == Andrew Morton <[email protected]> writes:
>
> Andrew> Robin Holt <[email protected]> wrote:
>
> >> One other issue we have is the vm_dirty_ratio and background_ratio
> >> adjustments are a little coarse with these memory sizes. Since our
> >> minimum adjustment is 1%, we are adjusting by 40GB on the largest
> >> configuration from above. The hardware we are shipping today is
> >> capable of going to far greater amounts of memory, but we don't
> >> have customers demanding that yet. I would like to plan ahead for
> >> that and change vm_dirty_ratio from a straight percent into a
> >> millipercent (thousandth of a percent). Would that type of change
> >> be acceptable?
>
> Andrew> Oh drat. I think such a change would require a new set of
> Andrew> /proc entries.
>
> No, you could just extend them to understand fixed point. Keep
> printing integers as integers, print non-integers with one (or two:
> will we ever need 0.01% increments?) decimal places.
Right now, it is possible to build our largest Altix configuration with
64TB of memory (unfortunatetly, we can't get any customers to pay that
large of bill ;). We are currently shipping a few 4TB systems and hope
to be selling 20TB systems by the end of the year (at least engineering
hopes to).
Given that, two decimal places are really not enough. We probably need
at least 3.
Is there any reason to not do 3 places? Is this the right direction to
head or does anybody know of problems this would cause?
Thanks,
Robin Holt
Robin Holt <[email protected]> wrote:
>
> > No, you could just extend them to understand fixed point. Keep
> > printing integers as integers, print non-integers with one (or two:
> > will we ever need 0.01% increments?) decimal places.
>
> Right now, it is possible to build our largest Altix configuration with
> 64TB of memory (unfortunatetly, we can't get any customers to pay that
> large of bill ;). We are currently shipping a few 4TB systems and hope
> to be selling 20TB systems by the end of the year (at least engineering
> hopes to).
>
> Given that, two decimal places are really not enough. We probably need
> at least 3.
>
> Is there any reason to not do 3 places? Is this the right direction to
> head or does anybody know of problems this would cause?
It's a rather unorthodox fix, but not illogical. I guess it depends upon
how much sysctl infrastructure it adds. Probably quite a lot.
Another approach would be to just say the ratio now has a range 0 ..
999,999 and then, if it happens to be less than 100, treat that as a
percentage for back-compatibility reasons. Although that's a bit kludgy
and perhaps a completely new /proc entry would be better.