2007-01-09 08:58:00

by NeilBrown

[permalink] [raw]
Subject: [PATCH - RFC] allow setting vm_dirty below 1% for large memory machines


Imagine a machine with lots of memory - say 100Gig.

Suppose there is one (largish) filesystem that is ext3 (or maybe
reiser) with the default data=ordered.

Suppose this filesystem is being written to steadily so that the
maximum amount of memory is always dirty. With the default
vm.dirty_ratio of 40%, this could be 40Gig.

When the journal triggers a commit, all the dirty data needs to be
flushed out in order to adhere to the "data=ordered" semantics.
This can take a while.

While this is happening, some small updates such as 'atime' update can
block waiting for the journal to be unlocked again after the flush.

Waiting for 40gig to flush for an atime update to complete is clearly
unsatisfactory.

We can reduce the amount of dirty memory by setting vm.dirty_ratio
down to 1 still allows 1Gig of dirty data which can cause unpleasant
pauses (and this was on a kernel where '1' still meant something. In
current kernels, '5' is the effective minimum).

So this patch removes the minimum of '5' and introduces a new tunable
'vm.dirty_kb' which sets an upper limit in Kibibytes.

This allows the amount of dirty memory to be limited to - say - 50M
which should flush fast enough.

So: is this patch acceptable? And should a lower default value for
vm_dirty_kb be used?


Some of the details in the above description might not be 100%
accurate (I'm not sure of the exact connection between atime updates
and journal commits). The symptoms are:
While generating constant write traffic on a machine with > 20Gig
of RAM, performing assorted read-only operations can sometimes
produces a pause of 10s of seconds.
The pause can be removed by:
- mounting noatime
- mounting data=writeback
- setting vm.dirty_kb to 1000 with this patch.

Maybe the problem is really just in atime updates, but I feel that it
is broader than that.

Thanks for any comments.

NeilBrown

-----------------
Allow fixed limit on amount of dirty memory.


On large memory machines, a interger percentage (dirty_ratio) does not
allow sufficiently fine control on the limit to the amount of dirty memory,
especially when that percentage is forced to be >=5.

So remove the >=5 restriction and introduce 'vm_dirty_kb' which sets
an upper limit in kibibytes to the amount of dirty memory.

Signed-off-by: Neil Brown <[email protected]>

### Diffstat output
./include/linux/writeback.h | 1 +
./kernel/sysctl.c | 11 +++++++++++
./mm/page-writeback.c | 19 +++++++++++++++----
3 files changed, 27 insertions(+), 4 deletions(-)

diff .prev/include/linux/writeback.h ./include/linux/writeback.h
--- .prev/include/linux/writeback.h 2007-01-09 17:16:00.000000000 +1100
+++ ./include/linux/writeback.h 2007-01-09 17:16:31.000000000 +1100
@@ -95,6 +95,7 @@ static inline int laptop_spinned_down(vo
/* These are exported to sysctl. */
extern int dirty_background_ratio;
extern int vm_dirty_ratio;
+extern int vm_dirty_kb;
extern int dirty_writeback_interval;
extern int dirty_expire_interval;
extern int block_dump;

diff .prev/kernel/sysctl.c ./kernel/sysctl.c
--- .prev/kernel/sysctl.c 2007-01-09 17:16:00.000000000 +1100
+++ ./kernel/sysctl.c 2007-01-09 17:17:57.000000000 +1100
@@ -860,6 +860,17 @@ static ctl_table vm_table[] = {
.extra2 = &one_hundred,
},
{
+ .ctl_name = -2,
+ .procname = "dirty_kb",
+ .data = &vm_dirty_kb,
+ .maxlen = sizeof(vm_dirty_kb),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_minmax,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ .extra2 = NULL,
+ },
+ {
.ctl_name = VM_DIRTY_WB_CS,
.procname = "dirty_writeback_centisecs",
.data = &dirty_writeback_interval,

diff .prev/mm/page-writeback.c ./mm/page-writeback.c
--- .prev/mm/page-writeback.c 2007-01-09 17:16:00.000000000 +1100
+++ ./mm/page-writeback.c 2007-01-09 17:52:55.000000000 +1100
@@ -75,6 +75,11 @@ int dirty_background_ratio = 10;
int vm_dirty_ratio = 40;

/*
+ * If that percentage exceeds this limit, use this instead
+ */
+int vm_dirty_kb = 10000000; /* 10 gigabytes, way too much really */
+
+/*
* The interval between `kupdate'-style writebacks, in jiffies
*/
int dirty_writeback_interval = 5 * HZ;
@@ -149,15 +154,21 @@ get_dirty_limits(long *pbackground, long
if (dirty_ratio > unmapped_ratio / 2)
dirty_ratio = unmapped_ratio / 2;

- if (dirty_ratio < 5)
- dirty_ratio = 5;
-
background_ratio = dirty_background_ratio;
if (background_ratio >= dirty_ratio)
background_ratio = dirty_ratio / 2;
+ if (dirty_background_ratio && !background_ratio)
+ background_ratio = 1;

- background = (background_ratio * available_memory) / 100;
dirty = (dirty_ratio * available_memory) / 100;
+ if (dirty > vm_dirty_kb / (PAGE_SIZE/1024))
+ dirty = vm_dirty_kb / (PAGE_SIZE/1024);
+ if (dirty_ratio == 0)
+ background = 0;
+ else if (background_ratio >= dirty_ratio)
+ background = dirty / 2;
+ else
+ background = dirty * background_ratio / dirty_ratio;
tsk = current;
if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
background += background / 4;


2007-01-09 10:10:26

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH - RFC] allow setting vm_dirty below 1% for large memory machines

On Tue, 9 Jan 2007 19:57:50 +1100
Neil Brown <[email protected]> wrote:

>
> Imagine a machine with lots of memory - say 100Gig.
>
> Suppose there is one (largish) filesystem that is ext3 (or maybe
> reiser) with the default data=ordered.
>
> Suppose this filesystem is being written to steadily so that the
> maximum amount of memory is always dirty. With the default
> vm.dirty_ratio of 40%, this could be 40Gig.
>
> When the journal triggers a commit, all the dirty data needs to be
> flushed out in order to adhere to the "data=ordered" semantics.
> This can take a while.
>
> While this is happening, some small updates such as 'atime' update can
> block waiting for the journal to be unlocked again after the flush.

Actually, ext3 doesn't work that way. The atime update will go into the
"running transaction", which is an instance of journal_t which is separate
from the committing transaction.

But there are situations (ie; journal free-space exhaustion) where things
can go synchronous. They're more likely to occur during metadata storms
though, and perhaps indicate an undersized journal.

But yeah, overall point agreed with.

> Waiting for 40gig to flush for an atime update to complete is clearly
> unsatisfactory.
>
> We can reduce the amount of dirty memory by setting vm.dirty_ratio
> down to 1 still allows 1Gig of dirty data which can cause unpleasant
> pauses (and this was on a kernel where '1' still meant something. In
> current kernels, '5' is the effective minimum).
>
> So this patch removes the minimum of '5' and introduces a new tunable
> 'vm.dirty_kb' which sets an upper limit in Kibibytes.

kibibytes? We're feeding the kernel catfood now?

> This allows the amount of dirty memory to be limited to - say - 50M
> which should flush fast enough.
>
> So: is this patch acceptable? And should a lower default value for
> vm_dirty_kb be used?
>
>
> Some of the details in the above description might not be 100%
> accurate (I'm not sure of the exact connection between atime updates
> and journal commits). The symptoms are:
> While generating constant write traffic on a machine with > 20Gig
> of RAM, performing assorted read-only operations can sometimes
> produces a pause of 10s of seconds.
> The pause can be removed by:
> - mounting noatime
> - mounting data=writeback
> - setting vm.dirty_kb to 1000 with this patch.

Could be IO scheduler borkage, could be ext3 borkage. A well-timed sysrq-T
will tell us, and is worth doing (please).

Does increasing the journal size help?

> @@ -149,15 +154,21 @@ get_dirty_limits(long *pbackground, long
> if (dirty_ratio > unmapped_ratio / 2)
> dirty_ratio = unmapped_ratio / 2;
>
> - if (dirty_ratio < 5)
> - dirty_ratio = 5;
> -
> background_ratio = dirty_background_ratio;
> if (background_ratio >= dirty_ratio)
> background_ratio = dirty_ratio / 2;
> + if (dirty_background_ratio && !background_ratio)
> + background_ratio = 1;
>
> - background = (background_ratio * available_memory) / 100;
> dirty = (dirty_ratio * available_memory) / 100;
> + if (dirty > vm_dirty_kb / (PAGE_SIZE/1024))
> + dirty = vm_dirty_kb / (PAGE_SIZE/1024);
> + if (dirty_ratio == 0)
> + background = 0;
> + else if (background_ratio >= dirty_ratio)
> + background = dirty / 2;
> + else
> + background = dirty * background_ratio / dirty_ratio;
> tsk = current;
> if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
> background += background / 4;

It would be better if we can avoid creating the second global variable. Is
it not possible to remove dirty_ratio? Make everything work off
vm_dirty_kb and do arithmetricks at the /proc/sys/vm/dirty_ratio interface?

We should perform the same conversion to dirty_background_ratio, I suspect.

And these guys should be `long', not `int'. Otherwise things will go
pearshaped at 2 tabbybytes.

2007-01-10 03:29:57

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH - RFC] allow setting vm_dirty below 1% for large memory machines

On Tuesday January 9, [email protected] wrote:
>
> Actually, ext3 doesn't work that way. The atime update will go into the
> "running transaction", which is an instance of journal_t which is separate
> from the committing transaction.

Hmm... fair enough. start_this_handle (which is called eventually
from ext3_dirty_inode) seems to wait for a few different things, and I
jumped to some conclusions.

>
> But there are situations (ie; journal free-space exhaustion) where things
> can go synchronous. They're more likely to occur during metadata storms
> though, and perhaps indicate an undersized journal.
>
> But yeah, overall point agreed with.

Thanks.

> > So this patch removes the minimum of '5' and introduces a new tunable
> > 'vm.dirty_kb' which sets an upper limit in Kibibytes.
>
> kibibytes? We're feeding the kernel catfood now?

:-)

> > and journal commits). The symptoms are:
> > While generating constant write traffic on a machine with > 20Gig
> > of RAM, performing assorted read-only operations can sometimes
> > produces a pause of 10s of seconds.
> > The pause can be removed by:
> > - mounting noatime
> > - mounting data=writeback
> > - setting vm.dirty_kb to 1000 with this patch.
>
> Could be IO scheduler borkage, could be ext3 borkage. A well-timed sysrq-T
> will tell us, and is worth doing (please).
>
> Does increasing the journal size help?

No, that was tried.

>
> It would be better if we can avoid creating the second global variable. Is
> it not possible to remove dirty_ratio? Make everything work off
> vm_dirty_kb and do arithmetricks at the /proc/sys/vm/dirty_ratio interface?

Uhmmm... not sure what you are thinking.
I guess we could teach vm.dirty_ratio to take a floating point number
(but does sysctl understand that?) so we could set it to 0.01 or
similar, but that is missing the point in a way. We don't really want
to set a small ratio. We want to set a small maximum number.

It could make lots of sense to have two numbers. A ratio that wins on
a small memory machine and a fixed number that wins on a large memory
machine. Different trade offs are more significant in the different
cases.

>
> We should perform the same conversion to dirty_background_ratio, I suspect.
>

I didn't add a fixed limit for dirty_background_ratio as it seemed
reasonable to assume that (dirty_background_ratio / dirty_ratio) was a
meaningful value, and just multiplied the final 'dirty' figure by this
ration to get the 'background' figure.

> And these guys should be `long', not `int'. Otherwise things will go
> pearshaped at 2 tabbybytes.

I don't think so. You would need to have blindingly fast storage
before there would be any interest in vm_dirty_kb getting anything
close to t*bytes. But I guess we can make it 'unsigned long' if it
helps.

Thanks,
NeilBrown

2007-01-10 03:04:57

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH - RFC] allow setting vm_dirty below 1% for large memory machines

On Tuesday January 9, [email protected] wrote:
>
> Could be IO scheduler borkage, could be ext3 borkage. A well-timed sysrq-T
> will tell us, and is worth doing (please).

The problem has been reported against reiserfs and ext3, and against
SLES9 and SLES10. The big machine I can test with is currently
running SLES9 (2.6.5 (plus lots of stuff)) and has a reiserfs
filesystem. In that config, the blocked process seems to be:

Jan 10 02:19:18 macallan kernel: sh D a000000100037b60 0 9852 9815 (NOTLB)
Jan 10 02:19:18 macallan kernel:
Jan 10 02:19:18 macallan kernel: Call Trace:
Jan 10 02:19:18 macallan kernel: [<a000000100098970>] schedule+0xf50/0x2b60
Jan 10 02:19:18 macallan kernel: sp=e000001fe9da79c0 bsp=e000001fe9da1550
Jan 10 02:19:18 macallan kernel: [<a000000100037b60>] __down+0x200/0x320
Jan 10 02:19:18 macallan kernel: sp=e000001fe9da79f0 bsp=e000001fe9da14e8
Jan 10 02:19:18 macallan kernel: [<a00000021c2e0f30>] do_journal_begin_r+0x3b0/0x880 [reiserfs]
Jan 10 02:19:18 macallan kernel: sp=e000001fe9da7a20 bsp=e000001fe9da1468
Jan 10 02:19:18 macallan kernel: [<a00000021c2e1920>] journal_begin+0x260/0x360 [reiserfs]
Jan 10 02:19:18 macallan kernel: sp=e000001fe9da7a60 bsp=e000001fe9da1430
Jan 10 02:19:18 macallan kernel: [<a00000021c2b0ac0>] reiserfs_dirty_inode+0xe0/0x240 [reiserfs]
Jan 10 02:19:18 macallan kernel: sp=e000001fe9da7a60 bsp=e000001fe9da1408
Jan 10 02:19:18 macallan kernel: [<a0000001001c76f0>] __mark_inode_dirty+0x330/0x340
Jan 10 02:19:18 macallan kernel: sp=e000001fe9da7aa0 bsp=e000001fe9da13c0
Jan 10 02:19:18 macallan kernel: [<a0000001001b4400>] __update_atime+0x180/0x200
Jan 10 02:19:18 macallan kernel: sp=e000001fe9da7aa0 bsp=e000001fe9da1380
Jan 10 02:19:18 macallan kernel: [<a000000100111e80>] do_generic_mapping_read+0x8a0/0x1300
Jan 10 02:19:18 macallan kernel: sp=e000001fe9da7ac0 bsp=e000001fe9da12b0
Jan 10 02:19:18 macallan kernel: [<a0000001001141c0>] __generic_file_aio_read+0x2c0/0x400
Jan 10 02:19:18 macallan kernel: sp=e000001fe9da7b20 bsp=e000001fe9da1240
Jan 10 02:19:18 macallan kernel: [<a000000100114550>] generic_file_read+0x110/0x160
Jan 10 02:19:18 macallan kernel: sp=e000001fe9da7b40 bsp=e000001fe9da1200
Jan 10 02:19:18 macallan kernel: [<a00000010016ebb0>] vfs_read+0x250/0x3a0
Jan 10 02:19:18 macallan kernel: sp=e000001fe9da7c30 bsp=e000001fe9da11a8
Jan 10 02:19:18 macallan kernel: [<a00000010018ddb0>] kernel_read+0x50/0x80
Jan 10 02:19:18 macallan kernel: sp=e000001fe9da7c30 bsp=e000001fe9da1168
Jan 10 02:19:18 macallan kernel: [<a00000010018f490>] do_execve+0x210/0x760
Jan 10 02:19:18 macallan kernel: sp=e000001fe9da7c40 bsp=e000001fe9da10e8
Jan 10 02:19:18 macallan kernel: [<a0000001000184a0>] sys_execve+0x60/0xc0
Jan 10 02:19:18 macallan kernel: sp=e000001fe9da7e30 bsp=e000001fe9da10b0
Jan 10 02:19:18 macallan kernel: [<a00000010000f770>] ia64_execve+0x30/0x160
Jan 10 02:19:18 macallan kernel: sp=e000001fe9da7e30 bsp=e000001fe9da1060
Jan 10 02:19:18 macallan kernel: [<a000000100010060>] ia64_ret_from_syscall+0x0/0x20
Jan 10 02:19:18 macallan kernel: sp=e000001fe9da7e30 bsp=e000001fe9da1060
Jan 10 02:19:18 macallan kernel: [<a000000000010640>] 0xa000000000010640
Jan 10 02:19:18 macallan kernel: sp=e000001fe9da8000 bsp=e000001fe9da1060

while the background generate-dirty-data process is:

Jan 10 02:19:18 macallan kernel: cp D a000000100037b60 0 9814 9800 (NOTLB)
Jan 10 02:19:18 macallan kernel:
Jan 10 02:19:18 macallan kernel: Call Trace:
Jan 10 02:19:18 macallan kernel: [<a000000100098970>] schedule+0xf50/0x2b60
Jan 10 02:19:18 macallan kernel: sp=e000001fe2f97b50 bsp=e000001fe2f91788
Jan 10 02:19:18 macallan kernel: [<a000000100037b60>] __down+0x200/0x320
Jan 10 02:19:18 macallan kernel: sp=e000001fe2f97b80 bsp=e000001fe2f91720
Jan 10 02:19:18 macallan kernel: [<a00000021c2d6340>] flush_commit_list+0x9e0/0xec0 [reiserfs]
Jan 10 02:19:18 macallan kernel: sp=e000001fe2f97bb0 bsp=e000001fe2f91678
Jan 10 02:19:18 macallan kernel: [<a00000021c2d69c0>] flush_older_commits+0x1a0/0x220 [reiserfs]
Jan 10 02:19:18 macallan kernel: sp=e000001fe2f97bc0 bsp=e000001fe2f91628
Jan 10 02:19:18 macallan kernel: [<a00000021c2d5f40>] flush_commit_list+0x5e0/0xec0 [reiserfs]
Jan 10 02:19:18 macallan kernel: sp=e000001fe2f97bc0 bsp=e000001fe2f91578
Jan 10 02:19:18 macallan kernel: [<a00000021c2e08d0>] do_journal_end+0x18f0/0x1ba0 [reiserfs]
Jan 10 02:19:18 macallan kernel: sp=e000001fe2f97bd0 bsp=e000001fe2f91428
Jan 10 02:19:18 macallan kernel: [<a00000021c2980c0>] restart_transaction+0x100/0x1e0 [reiserfs]
Jan 10 02:19:18 macallan kernel: sp=e000001fe2f97be0 bsp=e000001fe2f913e8
Jan 10 02:19:18 macallan kernel: [<a00000021c2a1130>] reiserfs_allocate_blocks_for_region+0x410/0x2d40 [reiserfs]
Jan 10 02:19:18 macallan kernel: sp=e000001fe2f97be0 bsp=e000001fe2f91300
Jan 10 02:19:18 macallan kernel: [<a00000021c2a45f0>] reiserfs_file_write+0xb90/0x1020 [reiserfs]
Jan 10 02:19:18 macallan kernel: sp=e000001fe2f97d10 bsp=e000001fe2f91218
Jan 10 02:19:18 macallan kernel: [<a00000010016e190>] vfs_write+0x250/0x3a0
Jan 10 02:19:18 macallan kernel: sp=e000001fe2f97de0 bsp=e000001fe2f911c0
Jan 10 02:19:18 macallan kernel: [<a00000010016e700>] sys_write+0x140/0x220


and pdflush is doing:

Jan 10 02:19:17 macallan kernel: pdflush D a00000010009b410 0 81 10 83 80 (L-TLB)
Jan 10 02:19:17 macallan kernel:
Jan 10 02:19:17 macallan kernel: Call Trace:
Jan 10 02:19:17 macallan kernel: [<a000000100098970>] schedule+0xf50/0x2b60
Jan 10 02:19:17 macallan kernel: sp=e000000039e179f0 bsp=e000000039e11710
Jan 10 02:19:17 macallan kernel: [<a00000010009b410>] io_schedule+0xb0/0x1a0
Jan 10 02:19:17 macallan kernel: sp=e000000039e17a20 bsp=e000000039e116e8
Jan 10 02:19:17 macallan kernel: [<a0000001003e9aa0>] get_request_wait+0x200/0x240
Jan 10 02:19:17 macallan kernel: sp=e000000039e17a20 bsp=e000000039e116a8
Jan 10 02:19:17 macallan kernel: [<a0000001003ea5c0>] __make_request+0x340/0xf40
Jan 10 02:19:17 macallan kernel: sp=e000000039e17a80 bsp=e000000039e11640
Jan 10 02:19:17 macallan kernel: [<a0000001003e6f40>] generic_make_request+0x2c0/0x440
Jan 10 02:19:17 macallan kernel: sp=e000000039e17a90 bsp=e000000039e115f0
Jan 10 02:19:17 macallan kernel: [<a0000001003e7220>] submit_bio+0x160/0x300
Jan 10 02:19:17 macallan kernel: sp=e000000039e17ab0 bsp=e000000039e115b8
Jan 10 02:19:17 macallan kernel: [<a000000100172d40>] submit_bh+0x360/0x420
Jan 10 02:19:17 macallan kernel: sp=e000000039e17ad0 bsp=e000000039e11578
Jan 10 02:19:17 macallan kernel: [<a00000021c2d4330>] write_ordered_chunk+0x110/0x180 [reiserfs]
Jan 10 02:19:17 macallan kernel: sp=e000000039e17ad0 bsp=e000000039e11530
Jan 10 02:19:17 macallan kernel: [<a00000021c2d30b0>] add_to_chunk+0xb0/0x140 [reiserfs]
Jan 10 02:19:17 macallan kernel: sp=e000000039e17ad0 bsp=e000000039e114e8
Jan 10 02:19:17 macallan kernel: [<a00000021c2d5610>] write_ordered_buffers+0x6d0/0x9c0 [reiserfs]
Jan 10 02:19:17 macallan kernel: sp=e000000039e17ad0 bsp=e000000039e11470
Jan 10 02:19:17 macallan kernel: [<a00000021c2d5c10>] flush_commit_list+0x2b0/0xec0 [reiserfs]
Jan 10 02:19:17 macallan kernel: sp=e000000039e17c00 bsp=e000000039e113c0
Jan 10 02:19:17 macallan kernel: [<a00000021c2d6ec0>] check_journal_end+0x480/0x740 [reiserfs]
Jan 10 02:19:17 macallan kernel: sp=e000000039e17c10 bsp=e000000039e11358
Jan 10 02:19:17 macallan kernel: [<a00000021c2df1c0>] do_journal_end+0x1e0/0x1ba0 [reiserfs]
Jan 10 02:19:17 macallan kernel: sp=e000000039e17c10 bsp=e000000039e11208
Jan 10 02:19:17 macallan kernel: [<a00000021c2b0480>] reiserfs_write_super+0x1a0/0x200 [reiserfs]
Jan 10 02:19:17 macallan kernel: sp=e000000039e17c20 bsp=e000000039e111e0
Jan 10 02:19:17 macallan kernel: [<a000000100181950>] sync_supers+0x290/0x2a0
Jan 10 02:19:17 macallan kernel: sp=e000000039e17c60 bsp=e000000039e111a0
Jan 10 02:19:17 macallan kernel: [<a000000100121c80>] wb_kupdate+0xa0/0x360
Jan 10 02:19:17 macallan kernel: sp=e000000039e17c60 bsp=e000000039e11148
Jan 10 02:19:17 macallan kernel: [<a000000100122dc0>] pdflush+0x320/0x580
Jan 10 02:19:17 macallan kernel: sp=e000000039e17de0 bsp=e000000039e110b0
Jan 10 02:19:17 macallan kernel: [<a0000001000ddc40>] kthread+0x220/0x280
Jan 10 02:19:17 macallan kernel: sp=e000000039e17e10 bsp=e000000039e11068
Jan 10 02:19:17 macallan kernel: [<a000000100018200>] kernel_thread_helper+0xe0/0x100
Jan 10 02:19:17 macallan kernel: sp=e000000039e17e30 bsp=e000000039e11040
Jan 10 02:19:17 macallan kernel: [<a000000100009060>] start_kernel_thread+0x20/0x40
Jan 10 02:19:17 macallan kernel: sp=e000000039e17e30 bsp=e000000039e11040


I'll see about getting an ext3 trace on a more recent kernel.


NeilBrown

2007-01-10 03:41:10

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH - RFC] allow setting vm_dirty below 1% for large memory machines

On Wed, 10 Jan 2007 14:29:35 +1100
Neil Brown <[email protected]> wrote:

> >
> > It would be better if we can avoid creating the second global variable. Is
> > it not possible to remove dirty_ratio? Make everything work off
> > vm_dirty_kb and do arithmetricks at the /proc/sys/vm/dirty_ratio interface?
>
> Uhmmm... not sure what you are thinking.
> I guess we could teach vm.dirty_ratio to take a floating point number
> (but does sysctl understand that?) so we could set it to 0.01 or
> similar, but that is missing the point in a way. We don't really want
> to set a small ratio. We want to set a small maximum number.

I mean remove the kernel-internal dirty_ratio variable and use
/proc/sys/vm/dirty_ratio as an accessor to `long vm_dirty_kb', with
appropriate conversions when /proc/sys/vm/dirty_ratio is written to and
read from.

> It could make lots of sense to have two numbers. A ratio that wins on
> a small memory machine and a fixed number that wins on a large memory
> machine. Different trade offs are more significant in the different
> cases.

hm.

> >
> > We should perform the same conversion to dirty_background_ratio, I suspect.
> >
>
> I didn't add a fixed limit for dirty_background_ratio as it seemed
> reasonable to assume that (dirty_background_ratio / dirty_ratio) was a
> meaningful value, and just multiplied the final 'dirty' figure by this
> ration to get the 'background' figure.

Sounds complex. Better, I think, to create (and recommend) vm_dirty_kb and
vm_dirty_background_kb and deprecate the old knobs.

> > And these guys should be `long', not `int'. Otherwise things will go
> > pearshaped at 2 tabbybytes.
>
> I don't think so. You would need to have blindingly fast storage
> before there would be any interest in vm_dirty_kb getting anything
> close to t*bytes. But I guess we can make it 'unsigned long' if it
> helps.
>

A 16TB machine would overflow that int by default.

2007-01-11 11:04:03

by dean gaudet

[permalink] [raw]
Subject: Re: [PATCH - RFC] allow setting vm_dirty below 1% for large memory machines

On Tue, 9 Jan 2007, Neil Brown wrote:

> Imagine a machine with lots of memory - say 100Gig.

i've had these problems on machines as "small" as 8GiB. the real problem
is that the kernel will let millions of potential (write) IO ops stack up
for a device which can handle only mere 100s of IOs per second. (and i'm
not convinced it does the IOs in a sane order when it has millions to
choose from)

replacing the percentage based dirty_ratio / dirty_background_ratio with
sane kibibyte units is a good fix... but i'm not sure it's sufficient.

it seems like the "flow control" mechanism (i.e. dirty_ratio) should be on
a device basis...

try running doug ledford'd memtest.sh on an 8GiB box with a single disk,
let it go a few minutes then ^C and type "sync". i've had to wait 10
minutes (2.6.18 with default vm settings).

it makes it hard to guarantee a box can shutdown quickly -- nasty for
setting up UPS on-battery timeouts for example.

-dean

2007-01-11 20:21:52

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH - RFC] allow setting vm_dirty below 1% for large memory machines

On Thu, 11 Jan 2007 03:04:00 -0800 (PST)
dean gaudet <[email protected]> wrote:

> On Tue, 9 Jan 2007, Neil Brown wrote:
>
> > Imagine a machine with lots of memory - say 100Gig.
>
> i've had these problems on machines as "small" as 8GiB. the real problem
> is that the kernel will let millions of potential (write) IO ops stack up
> for a device which can handle only mere 100s of IOs per second. (and i'm
> not convinced it does the IOs in a sane order when it has millions to
> choose from)
>
> replacing the percentage based dirty_ratio / dirty_background_ratio with
> sane kibibyte units is a good fix... but i'm not sure it's sufficient.
>
> it seems like the "flow control" mechanism (i.e. dirty_ratio) should be on
> a device basis...
>
> try running doug ledford'd memtest.sh on an 8GiB box with a single disk,
> let it go a few minutes then ^C and type "sync". i've had to wait 10
> minutes (2.6.18 with default vm settings).
>
> it makes it hard to guarantee a box can shutdown quickly -- nasty for
> setting up UPS on-battery timeouts for example.
>

Increasing the request queue size should help there
(/sys/block/sda/queue/nr_requests). Maybe 25% or more benefit with that
test, at a guess.

Probably initscripts should do that rather than leaving the kernel defaults
in place. It's a bit tricky for the kernel to do because the decision
depends upon the number of disks in the system, as well as the amount of
memory.

Or perhaps the kernel should implement a system-wide limit on the number of
requests in flight. While avoiding per-device starvation. Tricky.

2007-01-11 22:35:09

by dean gaudet

[permalink] [raw]
Subject: Re: [PATCH - RFC] allow setting vm_dirty below 1% for large memory machines

On Thu, 11 Jan 2007, Andrew Morton wrote:

> On Thu, 11 Jan 2007 03:04:00 -0800 (PST)
> dean gaudet <[email protected]> wrote:
>
> > On Tue, 9 Jan 2007, Neil Brown wrote:
> >
> > > Imagine a machine with lots of memory - say 100Gig.
> >
> > i've had these problems on machines as "small" as 8GiB. the real problem
> > is that the kernel will let millions of potential (write) IO ops stack up
> > for a device which can handle only mere 100s of IOs per second. (and i'm
> > not convinced it does the IOs in a sane order when it has millions to
> > choose from)
> >
> > replacing the percentage based dirty_ratio / dirty_background_ratio with
> > sane kibibyte units is a good fix... but i'm not sure it's sufficient.
> >
> > it seems like the "flow control" mechanism (i.e. dirty_ratio) should be on
> > a device basis...
> >
> > try running doug ledford'd memtest.sh on an 8GiB box with a single disk,
> > let it go a few minutes then ^C and type "sync". i've had to wait 10
> > minutes (2.6.18 with default vm settings).
> >
> > it makes it hard to guarantee a box can shutdown quickly -- nasty for
> > setting up UPS on-battery timeouts for example.
> >
>
> Increasing the request queue size should help there
> (/sys/block/sda/queue/nr_requests). Maybe 25% or more benefit with that
> test, at a guess.

hmm i've never had much luck with increasing nr_requests... if i get a
chance i'll reproduce the problem and try that.


> Probably initscripts should do that rather than leaving the kernel defaults
> in place. It's a bit tricky for the kernel to do because the decision
> depends upon the number of disks in the system, as well as the amount of
> memory.
>
> Or perhaps the kernel should implement a system-wide limit on the number of
> requests in flight. While avoiding per-device starvation. Tricky.

actually a global dirty_ratio causes interference between devices which
should otherwise not block each other...

if you set up a "dd if=/dev/zero of=/dev/sdb bs=1M" it shouldn't affect
write performance on sda -- but it does... because the dd basically
dirties all of the "dirty_background_ratio" pages and then any task
writing to sda has to block in the foreground... (i've had this happen in
practice -- my hack fix is oflag=direct on the dd... but the problem still
exists.)

i'm not saying fixing any of this is easy, i'm just being a user griping
about it :)

-dean

2007-01-11 22:48:52

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH - RFC] allow setting vm_dirty below 1% for large memory machines

On Thu, 11 Jan 2007 14:35:06 -0800 (PST)
dean gaudet <[email protected]> wrote:

> actually a global dirty_ratio causes interference between devices which
> should otherwise not block each other...
>
> if you set up a "dd if=/dev/zero of=/dev/sdb bs=1M" it shouldn't affect
> write performance on sda -- but it does... because the dd basically
> dirties all of the "dirty_background_ratio" pages and then any task
> writing to sda has to block in the foreground... (i've had this happen in
> practice -- my hack fix is oflag=direct on the dd... but the problem still
> exists.)

yeah. Plus your heavy-dd-to-/dev/sda tends to block light-writers to
/dev/sda in perhaps disproportionate ways.

This is on my list of things to look at. Hah.

> i'm not saying fixing any of this is easy, i'm just being a user griping
> about it :)

It's rather complex, I believe. Needs per-backing-dev dirty counts (already
in -mm) plus, I suspect, per-process dirty counts (possibly derivable from
per-task-io-accounting) plus some tricky logic to make all that work along
with global dirtiness (and later per-node dirtiness!) while meeting all the
constraints which that logic must satisfy.