Hi,
I have observed a problem that write(2) can be blocked for a long time
if a system has several disks and is under heavy I/O pressure. This
patchset is to avoid the problem.
Example of the probrem:
There are two processes on a system which has two disks. Process-A
writes heavily to disk-a, and process-B writes small data (e.g. log
files) to disk-b occasionally. A portion of system memory, which is
depends on vm.dirty_ratio (typically 40%), is filled up with Dirty
and Writeback pages of disk-a.
In this situation, write(2) of process-B could be blocked for a very
long time (more then 60 seconds), although the load of disk-b is quite
low. In particular, the system would become quite slow, if disk-a is
slow (e.g. backup to an USB disk).
This seems to be the same problem as discussed in LKML:
http://marc.theaimsgroup.com/?t=115559902900003
and
http://marc.theaimsgroup.com/?t=117182340400003
Root cause:
I found this problem is caused by the balance_dirty_pages().
While Dirty+Writeback pages get more than 40% of memory, process-B is
blocked in balance_dirty_pages() until writeback of some (`write_chunk',
typically = 1536) dirty pages on disk-b is started.
However, because disk-b has only a few dirty pages, the process-B will
be blocked until writeback to disk-a is completed and Dirty+Writeback
goes below 40%.
Solution:
I consider that all of the dirty pages for the disk have been written
back and that the disk is clean if a process cannot write 'write_chunk'
pages in balance_dirty_pages().
To avoid using up the free memory with dirty pages by passing blocking,
this patchset adds a new threshold named vm.dirty_limit_ratio to sysctl.
It modifies balance_dirty_pages() not to block when the amount of
Dirty+Writeback is less than vm.dirty_limit_ratio percent of the memory.
In the other cases, writers are throttled as current Linux does.
In this patchset, vm.dirty_limit_ratio, instead of vm.dirty_ratio, is
used as the clamping level of Dirty+Writeback. And, vm.dirty_ratio is
used as the level at which a writers will itself start writeback of the
dirty pages.
Testing Results:
In the situation explained in "Example of the problem" section, I
measured time of write(2)ing to disk-b.
The write was completed by 30ms or less under the kernel with this
patchset.
When nr_requests is set too high (e.g. 8192), Dirty+Writeback grows near
vm.dirty_limit_ratio(45% of system memory by defaults). In that case,
write(2) sometimes took about 1 second.
This patchset can be applied to 2.6.20-mm2.
It consists of 3 pieces:
1/3 - add a sysctl variable `vm.dirty_limit_ratio'
2/3 - modify get_dirty_limits() to return the limit of dirty pages.
3/3 - break out of balance_dirty_pages() loop if the disk doesn't have
remaining dirty pages, if Dirty+Writeback < vm.dirty_limit_ratio.
--
Tomoki Sekiyama
Linux Technology Center
Hitachi, Ltd., Systems Development Laboratory
E-mail: [email protected]
On Fri, 23 Feb 2007 21:03:37 +0900
Tomoki Sekiyama <[email protected]> wrote:
> Hi,
>
> I have observed a problem that write(2) can be blocked for a long time
> if a system has several disks and is under heavy I/O pressure. This
> patchset is to avoid the problem.
>
> Example of the probrem:
>
> There are two processes on a system which has two disks. Process-A
> writes heavily to disk-a, and process-B writes small data (e.g. log
> files) to disk-b occasionally. A portion of system memory, which is
> depends on vm.dirty_ratio (typically 40%), is filled up with Dirty
> and Writeback pages of disk-a.
>
> In this situation, write(2) of process-B could be blocked for a very
> long time (more then 60 seconds), although the load of disk-b is quite
> low. In particular, the system would become quite slow, if disk-a is
> slow (e.g. backup to an USB disk).
>
> This seems to be the same problem as discussed in LKML:
> http://marc.theaimsgroup.com/?t=115559902900003
> and
> http://marc.theaimsgroup.com/?t=117182340400003
>
Interesting, but how about adjust this parameter like below instead of
adding new control knob ?(this kind of knob is not easy to use.)
==
struct writeback_control wbc = {
.bdi = bdi,
.sync_mode = WB_SYNC_NONE,
.older_than_this = NULL,
.nr_to_write = 0,
.range_cyclic = 1,
};
<snip>
if (nr_reclaimable) {
/* Just do what I can do */
dirty_pages_on_device = count_dirty_pages_on_device_limited(bdi, writechunk);
wbc.nr_to_write = dirty_pages_on_device.
writeback_inodes(&wbc);
==
count_dirty_pages_on_device_limited(bdi, writechunk) above returns
dirty pages on bdi. if # of dirty_pages on bdi is larger than writechunk,
just returns writechunk.
-Kame
Tomoki Sekiyama writes:
> Hi,
Hello,
>
[...]
>
> While Dirty+Writeback pages get more than 40% of memory, process-B is
> blocked in balance_dirty_pages() until writeback of some (`write_chunk',
> typically = 1536) dirty pages on disk-b is started.
May be the simpler solution is to use separate variables to control
ratelimit and write chunk?
writeback_set_ratelimit() adjusts ratelimit_pages to avoid too frequent
calls to balance_dirty_pages(), but once we are inside of
writeback_inodes(), there is no need to write especially many pages in
one go: overhead of any additional looping is negligible, when compared
with the cost of writing.
Speaking of which, now that expensive get_writeback_state() is gone from
page-writeback.c why do we need adjustable ratelimiting at all? It looks
like writeback_set_ratelimit() can be dropped, and fixed ratelimit used
instead.
Nikita.
Hi Kamezawa-san,
thanks for your reply.
KAMEZAWA Hiroyuki wrote:
> Interesting, but how about adjust this parameter like below instead of
> adding new control knob ?(this kind of knob is not easy to use.)
>
> ==
> struct writeback_control wbc = {
> .bdi = bdi,
> .sync_mode = WB_SYNC_NONE,
> .older_than_this = NULL,
> .nr_to_write = 0,
> .range_cyclic = 1,
> };
> <snip>
> if (nr_reclaimable) {
> /* Just do what I can do */
> dirty_pages_on_device = count_dirty_pages_on_device_limited(bdi, writechunk);
> wbc.nr_to_write = dirty_pages_on_device.
> writeback_inodes(&wbc);
>
> ==
>
> count_dirty_pages_on_device_limited(bdi, writechunk) above returns
> dirty pages on bdi. if # of dirty_pages on bdi is larger than writechunk,
> just returns writechunk.
I think that way is not enough to control the total amount of
Dirty+Writeback.
In that way, while writeback_inodes() scans for dirty pages and writes
them back, the caller will be blocked only if the length of the write-
requests queue is longer than nr_requests. If so, Writeback may consume
tens MB memory for each queue, because nr_requests is 128 and the
maximum size of a request is 512KB. If you have several devices, it can
consume more than hundred MB memory.
I concerned about that, so I introduced dirty_limit_ratio to limit the
total amount of Dirty+Writeback pages.
Regards
--
Tomoki Sekiyama
Hitachi, Ltd., Systems Development Laboratory
Hi Nikita,
thanks for your comments.
Nikita Danilov wrote:
>> While Dirty+Writeback pages get more than 40% of memory, process-B is
>> blocked in balance_dirty_pages() until writeback of some (`write_chunk',
>> typically = 1536) dirty pages on disk-b is started.
>
> May be the simpler solution is to use separate variables to control
> ratelimit and write chunk?
No, I think it's difficult to throttle total Dirty+Writeback only with
write_chunk, because write_chunk just affects Dirty and Writeback of
each device (in this case, throttling is done in write-requests queue of
the each backing device, as I said in another mail).
Throttling of the total Dirty+Writeback should be also done in VM itself,
and to control that, I added `dirty_limit_ratio.'
> writeback_set_ratelimit() adjusts ratelimit_pages to avoid too frequent
> calls to balance_dirty_pages(), but once we are inside of
> writeback_inodes(), there is no need to write especially many pages in
> one go: overhead of any additional looping is negligible, when compared
> with the cost of writing.
>
> Speaking of which, now that expensive get_writeback_state() is gone from
> page-writeback.c why do we need adjustable ratelimiting at all? It looks
> like writeback_set_ratelimit() can be dropped, and fixed ratelimit used
> instead.
As far as I can see, adjustable ratelimiting is the actual cause of the
long wait on writing to disk with light load.
I think removing adjustable ratelimiting should be done in another patch...
Regards
--
Tomoki Sekiyama
Hitachi, Ltd., Systems Development Laboratory
On Tue, 27 Feb 2007 09:50:16 +0900
Tomoki Sekiyama <[email protected]> wrote:
> Hi Kamezawa-san,
>
> thanks for your reply.
>
> KAMEZAWA Hiroyuki wrote:
> > Interesting, but how about adjust this parameter like below instead of
> > adding new control knob ?(this kind of knob is not easy to use.)
> >
> > ==
> > struct writeback_control wbc = {
> > .bdi = bdi,
> > .sync_mode = WB_SYNC_NONE,
> > .older_than_this = NULL,
> > .nr_to_write = 0,
> > .range_cyclic = 1,
> > };
> > <snip>
> > if (nr_reclaimable) {
> > /* Just do what I can do */
> > dirty_pages_on_device = count_dirty_pages_on_device_limited(bdi, writechunk);
> > wbc.nr_to_write = dirty_pages_on_device.
> > writeback_inodes(&wbc);
> >
> > ==
> >
> > count_dirty_pages_on_device_limited(bdi, writechunk) above returns
> > dirty pages on bdi. if # of dirty_pages on bdi is larger than writechunk,
> > just returns writechunk.
>
>
> I think that way is not enough to control the total amount of
> Dirty+Writeback.
>
> In that way, while writeback_inodes() scans for dirty pages and writes
> them back, the caller will be blocked only if the length of the write-
> requests queue is longer than nr_requests.
What nr_request means ?
But Ok, maybe I'm not understanding. What I want to ask you is do
per-device-write-throttling rather than adding a new parameter.
Bye.
-Kame