2010-08-16 04:17:23

by Nikanth Karthikesan

[permalink] [raw]
Subject: [RFC][PATCH] Per file dirty limit throttling

When the total dirty pages exceed vm_dirty_ratio, the dirtier is made to do
the writeback. But this dirtier may not be the one who took the system to this
state. Instead, if we can track the dirty count per-file, we could throttle
the dirtier of a file, when the file's dirty pages exceed a certain limit.
Even though this dirtier may not be the one who dirtied the other pages of
this file, it is fair to throttle this process, as it uses that file.

This patch
1. Adds dirty page accounting per-file.
2. Exports the number of pages of this file in cache and no of pages dirty via
proc-fdinfo.
3. Adds a new tunable, /proc/sys/vm/file_dirty_bytes. When a files dirty data
exceeds this limit, the writeback of that inode is done by the current
dirtier.

This certainly will affect the throughput of certain heavy-dirtying workloads,
but should help for interactive systems.

Signed-off-by: Nikanth Karthikesan <[email protected]>

---

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index b606c2c..4f8bc06 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -133,6 +133,15 @@ Setting this to zero disables periodic writeback altogether.

==============================================================

+file_dirty_bytes
+
+When a files total dirty data exceeds file_dirty_bytes, the current generator
+of dirty data would be made to do the writeback of dirty pages of that file.
+
+0 disables this behaviour.
+
+==============================================================
+
drop_caches

Writing to this will cause the kernel to drop clean caches, dentries and
diff --git a/fs/proc/base.c b/fs/proc/base.c
index a1c43e7..23becb2 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -83,6 +83,8 @@
#include <linux/pid_namespace.h>
#include <linux/fs_struct.h>
#include <linux/slab.h>
+#include <linux/pagemap.h>
+#include <linux/writeback.h>
#include "internal.h"

/* NOTE:
@@ -1791,7 +1793,7 @@ out:
return ~0U;
}

-#define PROC_FDINFO_MAX 64
+#define PROC_FDINFO_MAX 128

static int proc_fd_info(struct inode *inode, struct path *path, char *info)
{
@@ -1805,6 +1807,7 @@ static int proc_fd_info(struct inode *inode, struct path *path, char *info)
put_task_struct(task);
}
if (files) {
+ int infolen = 0;
/*
* We are not taking a ref to the file structure, so we must
* hold ->file_lock.
@@ -1816,12 +1819,24 @@ static int proc_fd_info(struct inode *inode, struct path *path, char *info)
*path = file->f_path;
path_get(&file->f_path);
}
- if (info)
- snprintf(info, PROC_FDINFO_MAX,
+ if (info) {
+ struct address_space *as = file->f_mapping;
+
+ infolen = snprintf(info, PROC_FDINFO_MAX,
"pos:\t%lli\n"
"flags:\t0%o\n",
(long long) file->f_pos,
file->f_flags);
+ if (as) {
+ snprintf(info + infolen, PROC_FDINFO_MAX - infolen,
+ "cache_kb:\t%lu\n"
+ "dirty_kb:\t%lu\n",
+ (as->nrpages * PAGE_SIZE) / 1024,
+ (as->nrdirty * PAGE_SIZE) / 1024);
+ }
+
+ }
+
spin_unlock(&files->file_lock);
put_files_struct(files);
return 0;
diff --git a/fs/read_write.c b/fs/read_write.c
index 74e3658..8881b7d 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -16,6 +16,8 @@
#include <linux/syscalls.h>
#include <linux/pagemap.h>
#include <linux/splice.h>
+#include <linux/buffer_head.h>
+#include <linux/writeback.h>
#include "read_write.h"

#include <asm/uaccess.h>
@@ -414,9 +416,19 @@ SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,

file = fget_light(fd, &fput_needed);
if (file) {
+ struct address_space *as = file->f_mapping;
+ unsigned long file_dirty_pages;
loff_t pos = file_pos_read(file);
+
ret = vfs_write(file, buf, count, &pos);
file_pos_write(file, pos);
+ /* Start write-out ? */
+ if (file_dirty_bytes) {
+ file_dirty_pages = file_dirty_bytes / PAGE_SIZE;
+ if (as->nrdirty > file_dirty_pages)
+ write_inode_now(as->host, 0);
+ }
+
fput_light(file, fput_needed);
}

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7a0625e..5de7f32 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -638,6 +638,7 @@ struct address_space {
spinlock_t i_mmap_lock; /* protect tree, count, list */
unsigned int truncate_count; /* Cover race condition with truncate */
unsigned long nrpages; /* number of total pages */
+ unsigned long nrdirty;
pgoff_t writeback_index;/* writeback starts here */
const struct address_space_operations *a_ops; /* methods */
unsigned long flags; /* error bits/gfp mask */
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 72a5d64..c31bf16 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -99,6 +99,7 @@ extern int dirty_background_ratio;
extern unsigned long dirty_background_bytes;
extern int vm_dirty_ratio;
extern unsigned long vm_dirty_bytes;
+extern unsigned long file_dirty_bytes;
extern unsigned int dirty_writeback_interval;
extern unsigned int dirty_expire_interval;
extern int vm_highmem_is_dirtyable;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index ca38e8e..8b76b9f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1041,6 +1041,13 @@ static struct ctl_table vm_table[] = {
.extra1 = &dirty_bytes_min,
},
{
+ .procname = "file_dirty_bytes",
+ .data = &file_dirty_bytes,
+ .maxlen = sizeof(file_dirty_bytes),
+ .mode = 0644,
+ .proc_handler = proc_doulongvec_minmax,
+ },
+ {
.procname = "dirty_writeback_centisecs",
.data = &dirty_writeback_interval,
.maxlen = sizeof(dirty_writeback_interval),
diff --git a/mm/memory.c b/mm/memory.c
index 9606ceb..0961f70 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2873,6 +2873,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
struct vm_fault vmf;
int ret;
int page_mkwrite = 0;
+ unsigned long file_dirty_pages;

vmf.virtual_address = (void __user *)(address & PAGE_MASK);
vmf.pgoff = pgoff;
@@ -3024,6 +3025,13 @@ out:
/* file_update_time outside page_lock */
if (vma->vm_file)
file_update_time(vma->vm_file);
+
+ /* Start write-back ? */
+ if (mapping && file_dirty_bytes) {
+ file_dirty_pages = file_dirty_bytes / PAGE_SIZE;
+ if (mapping->nrdirty > file_dirty_pages)
+ write_inode_now(mapping->host, 0);
+ }
} else {
unlock_page(vmf.page);
if (anon)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 20890d8..1cabd7f 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -87,6 +87,13 @@ int vm_dirty_ratio = 20;
unsigned long vm_dirty_bytes;

/*
+ * When a files total dirty data exceeds file_dirty_bytes, the current generator
+ * of dirty data would be made to do the writeback of dirty pages of that file.
+ * 0 disables this behaviour.
+ */
+unsigned long file_dirty_bytes = 0;
+
+/*
* The interval between `kupdate'-style writebacks
*/
unsigned int dirty_writeback_interval = 5 * 100; /* centiseconds */
@@ -796,6 +803,8 @@ static struct notifier_block __cpuinitdata ratelimit_nb = {
void __init page_writeback_init(void)
{
int shift;
+ unsigned long background;
+ unsigned long dirty;

writeback_set_ratelimit();
register_cpu_notifier(&ratelimit_nb);
@@ -803,6 +812,8 @@ void __init page_writeback_init(void)
shift = calc_period_shift();
prop_descriptor_init(&vm_completions, shift);
prop_descriptor_init(&vm_dirties, shift);
+ global_dirty_limits(&background, &dirty);
+ file_dirty_bytes = dirty;
}

/**
@@ -1126,6 +1137,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
{
if (mapping_cap_account_dirty(mapping)) {
__inc_zone_page_state(page, NR_FILE_DIRTY);
+ mapping->nrdirty++;
__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
task_dirty_inc(current);
task_io_account_write(PAGE_CACHE_SIZE);
@@ -1301,6 +1313,7 @@ int clear_page_dirty_for_io(struct page *page)
*/
if (TestClearPageDirty(page)) {
dec_zone_page_state(page, NR_FILE_DIRTY);
+ mapping->nrdirty--;
dec_bdi_stat(mapping->backing_dev_info,
BDI_RECLAIMABLE);
return 1;
diff --git a/mm/truncate.c b/mm/truncate.c
index ba887bf..5846d6a 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -75,6 +75,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
struct address_space *mapping = page->mapping;
if (mapping && mapping_cap_account_dirty(mapping)) {
dec_zone_page_state(page, NR_FILE_DIRTY);
+ mapping->nrdirty--;
dec_bdi_stat(mapping->backing_dev_info,
BDI_RECLAIMABLE);
if (account_size)


2010-08-16 11:05:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH] Per file dirty limit throttling

On Mon, 2010-08-16 at 09:49 +0530, Nikanth Karthikesan wrote:
> When the total dirty pages exceed vm_dirty_ratio, the dirtier is made to do
> the writeback. But this dirtier may not be the one who took the system to this
> state. Instead, if we can track the dirty count per-file, we could throttle
> the dirtier of a file, when the file's dirty pages exceed a certain limit.
> Even though this dirtier may not be the one who dirtied the other pages of
> this file, it is fair to throttle this process, as it uses that file.
>
> This patch
> 1. Adds dirty page accounting per-file.
> 2. Exports the number of pages of this file in cache and no of pages dirty via
> proc-fdinfo.
> 3. Adds a new tunable, /proc/sys/vm/file_dirty_bytes. When a files dirty data
> exceeds this limit, the writeback of that inode is done by the current
> dirtier.
>
> This certainly will affect the throughput of certain heavy-dirtying workloads,
> but should help for interactive systems.

I'm not really charmed by this.. it adds another random variable to prod
at. Nor does it really tell me why you're wanting to do this. We already
have per-task invluence on the dirty limits, a task that sporadically
dirties pages (your vi) will already end up with a higher dirty limit
than a task that only dirties pages (your dd).

> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index b606c2c..4f8bc06 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -133,6 +133,15 @@ Setting this to zero disables periodic writeback altogether.
>
> ==============================================================
>
> +file_dirty_bytes
> +
> +When a files total dirty data exceeds file_dirty_bytes, the current generator
> +of dirty data would be made to do the writeback of dirty pages of that file.
> +
> +0 disables this behaviour.
> +
> +==============================================================
> +
> drop_caches
>
> Writing to this will cause the kernel to drop clean caches, dentries and

> diff --git a/fs/read_write.c b/fs/read_write.c
> index 74e3658..8881b7d 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -16,6 +16,8 @@
> #include <linux/syscalls.h>
> #include <linux/pagemap.h>
> #include <linux/splice.h>
> +#include <linux/buffer_head.h>
> +#include <linux/writeback.h>
> #include "read_write.h"
>
> #include <asm/uaccess.h>
> @@ -414,9 +416,19 @@ SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
>
> file = fget_light(fd, &fput_needed);
> if (file) {
> + struct address_space *as = file->f_mapping;
> + unsigned long file_dirty_pages;
> loff_t pos = file_pos_read(file);
> +
> ret = vfs_write(file, buf, count, &pos);
> file_pos_write(file, pos);
> + /* Start write-out ? */
> + if (file_dirty_bytes) {
> + file_dirty_pages = file_dirty_bytes / PAGE_SIZE;
> + if (as->nrdirty > file_dirty_pages)
> + write_inode_now(as->host, 0);
> + }
> +
> fput_light(file, fput_needed);
> }


This seems wrong, wth are you doing it here and not in the generic
balance_dirty_pages thing called by set_page_dirty()?


> diff --git a/mm/memory.c b/mm/memory.c
> index 9606ceb..0961f70 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2873,6 +2873,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> struct vm_fault vmf;
> int ret;
> int page_mkwrite = 0;
> + unsigned long file_dirty_pages;
>
> vmf.virtual_address = (void __user *)(address & PAGE_MASK);
> vmf.pgoff = pgoff;
> @@ -3024,6 +3025,13 @@ out:
> /* file_update_time outside page_lock */
> if (vma->vm_file)
> file_update_time(vma->vm_file);
> +
> + /* Start write-back ? */
> + if (mapping && file_dirty_bytes) {
> + file_dirty_pages = file_dirty_bytes / PAGE_SIZE;
> + if (mapping->nrdirty > file_dirty_pages)
> + write_inode_now(mapping->host, 0);
> + }
> } else {
> unlock_page(vmf.page);
> if (anon)

Idem, replicating that code at every site that can dirty a page is just
wrong, hook into the regular set_page_dirty()->balance_dirty_pages()
code.

> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 20890d8..1cabd7f 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -87,6 +87,13 @@ int vm_dirty_ratio = 20;
> unsigned long vm_dirty_bytes;
>
> /*
> + * When a files total dirty data exceeds file_dirty_bytes, the current generator
> + * of dirty data would be made to do the writeback of dirty pages of that file.
> + * 0 disables this behaviour.
> + */
> +unsigned long file_dirty_bytes = 0;

So you're adding a extra cacheline to dirty even though its not used by
default, that seems like suckage..


> @@ -1126,6 +1137,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
> {
> if (mapping_cap_account_dirty(mapping)) {
> __inc_zone_page_state(page, NR_FILE_DIRTY);
> + mapping->nrdirty++;
> __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> task_dirty_inc(current);
> task_io_account_write(PAGE_CACHE_SIZE);
> @@ -1301,6 +1313,7 @@ int clear_page_dirty_for_io(struct page *page)
> */
> if (TestClearPageDirty(page)) {
> dec_zone_page_state(page, NR_FILE_DIRTY);
> + mapping->nrdirty--;
> dec_bdi_stat(mapping->backing_dev_info,
> BDI_RECLAIMABLE);
> return 1;
> diff --git a/mm/truncate.c b/mm/truncate.c
> index ba887bf..5846d6a 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -75,6 +75,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
> struct address_space *mapping = page->mapping;
> if (mapping && mapping_cap_account_dirty(mapping)) {
> dec_zone_page_state(page, NR_FILE_DIRTY);
> + mapping->nrdirty--;
> dec_bdi_stat(mapping->backing_dev_info,
> BDI_RECLAIMABLE);
> if (account_size)


Preferably we don't add any extra fields under tree_lock so that we can
easily split it up if/when we decide to use a fine-grain locked radix
tree.

Also, like mentioned, you just added a whole new cacheline to dirty.

2010-08-16 16:53:45

by Bill Davidsen

[permalink] [raw]
Subject: Re: [RFC][PATCH] Per file dirty limit throttling

Nikanth Karthikesan wrote:
> When the total dirty pages exceed vm_dirty_ratio, the dirtier is made to do
> the writeback. But this dirtier may not be the one who took the system to this
> state. Instead, if we can track the dirty count per-file, we could throttle
> the dirtier of a file, when the file's dirty pages exceed a certain limit.
> Even though this dirtier may not be the one who dirtied the other pages of
> this file, it is fair to throttle this process, as it uses that file.
>
I agree with your problem description, a single program which writes a single
large file can make an interactive system suck. Creating a 25+GB Blu-Ray image
will often saturate the buffer space. I played with per-fd limiting during
2.5.xx development and I had an app writing 5-10GB files. While I wanted to get
something to submit while the kernel was changing, I kept hitting cornet cases.

> This patch
> 1. Adds dirty page accounting per-file.
> 2. Exports the number of pages of this file in cache and no of pages dirty via
> proc-fdinfo.
> 3. Adds a new tunable, /proc/sys/vm/file_dirty_bytes. When a files dirty data
> exceeds this limit, the writeback of that inode is done by the current
> dirtier.
>
I think you have this in the wrong place, can't it go in balance_dirty_pages?

> This certainly will affect the throughput of certain heavy-dirtying workloads,
> but should help for interactive systems.
>
I found that the effect was about the same as forcing the application to use
O_DIRECT, and since it was our application I could do that. Not all
badly-behaved programs are open source, so that addressed my issue but not the
general case.

I think you really need to track by process, not file, as you said "Even though
this dirtier may not be the one who dirtied the other pages of this file..."
that doesn't work, you block a process which is contributing minimally to the
problem while letting the real problem process continue. Ex: a log file, with
one process spewing error messages while others write a few lines/min. You have
to get it right, I think.

--
Bill Davidsen <[email protected]>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot

2010-08-17 02:41:47

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC][PATCH] Per file dirty limit throttling

On Mon, Aug 16, 2010 at 12:19:50PM +0800, Nikanth Karthikesan wrote:
> When the total dirty pages exceed vm_dirty_ratio, the dirtier is made to do
> the writeback. But this dirtier may not be the one who took the system to this
> state. Instead, if we can track the dirty count per-file, we could throttle
> the dirtier of a file, when the file's dirty pages exceed a certain limit.
> Even though this dirtier may not be the one who dirtied the other pages of
> this file, it is fair to throttle this process, as it uses that file.

Nikanth, there's a more elegant solution in upstream kernel.
See the comment for task_dirty_limit() in commit 1babe1838.

NFS may want to limit per-file dirty pages, to prevent long stall time
inside the nfs_getattr()->filemap_write_and_wait() calls (and problems
like that). Peter Staubach has similar ideas on it.

Thanks,
Fengguang

2010-08-17 02:53:36

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC][PATCH] Per file dirty limit throttling

Bill,

On Tue, Aug 17, 2010 at 12:53:17AM +0800, Bill Davidsen wrote:
> Nikanth Karthikesan wrote:
> > When the total dirty pages exceed vm_dirty_ratio, the dirtier is made to do
> > the writeback. But this dirtier may not be the one who took the system to this
> > state. Instead, if we can track the dirty count per-file, we could throttle
> > the dirtier of a file, when the file's dirty pages exceed a certain limit.
> > Even though this dirtier may not be the one who dirtied the other pages of
> > this file, it is fair to throttle this process, as it uses that file.
> >
> I agree with your problem description, a single program which writes a single
> large file can make an interactive system suck. Creating a 25+GB Blu-Ray image
> will often saturate the buffer space. I played with per-fd limiting during
> 2.5.xx development and I had an app writing 5-10GB files. While I wanted to get
> something to submit while the kernel was changing, I kept hitting cornet cases.

The block layer in recent kernels are much better at preventing SYNC
read/write from being delayed by lots of ASYNC writeback requests.
And we are attacking the other responsiveness problems under light
memory pressure.

> > This patch
> > 1. Adds dirty page accounting per-file.
> > 2. Exports the number of pages of this file in cache and no of pages dirty via
> > proc-fdinfo.
> > 3. Adds a new tunable, /proc/sys/vm/file_dirty_bytes. When a files dirty data
> > exceeds this limit, the writeback of that inode is done by the current
> > dirtier.
> >
> I think you have this in the wrong place, can't it go in balance_dirty_pages?
>
> > This certainly will affect the throughput of certain heavy-dirtying workloads,
> > but should help for interactive systems.
> >
> I found that the effect was about the same as forcing the application to use
> O_DIRECT, and since it was our application I could do that. Not all
> badly-behaved programs are open source, so that addressed my issue but not the
> general case.
>
> I think you really need to track by process, not file, as you said "Even though
> this dirtier may not be the one who dirtied the other pages of this file..."
> that doesn't work, you block a process which is contributing minimally to the
> problem while letting the real problem process continue. Ex: a log file, with
> one process spewing error messages while others write a few lines/min. You have
> to get it right, I think.

Good point. Peter implemented that idea long ago in upstream kernel,
see the comment for task_dirty_limit() in commit 1babe1838.

Thanks,
Fengguang

2010-08-17 05:06:53

by Nikanth Karthikesan

[permalink] [raw]
Subject: Re: [RFC][PATCH] Per file dirty limit throttling

On Monday 16 August 2010 16:35:42 Peter Zijlstra wrote:
> On Mon, 2010-08-16 at 09:49 +0530, Nikanth Karthikesan wrote:
> > When the total dirty pages exceed vm_dirty_ratio, the dirtier is made to
> > do the writeback. But this dirtier may not be the one who took the system
> > to this state. Instead, if we can track the dirty count per-file, we
> > could throttle the dirtier of a file, when the file's dirty pages exceed
> > a certain limit. Even though this dirtier may not be the one who dirtied
> > the other pages of this file, it is fair to throttle this process, as it
> > uses that file.
> >
> > This patch
> > 1. Adds dirty page accounting per-file.
> > 2. Exports the number of pages of this file in cache and no of pages
> > dirty via proc-fdinfo.
> > 3. Adds a new tunable, /proc/sys/vm/file_dirty_bytes. When a files dirty
> > data exceeds this limit, the writeback of that inode is done by the
> > current dirtier.
> >
> > This certainly will affect the throughput of certain heavy-dirtying
> > workloads, but should help for interactive systems.
>
> I'm not really charmed by this.. it adds another random variable to prod
> at. Nor does it really tell me why you're wanting to do this. We already
> have per-task invluence on the dirty limits, a task that sporadically
> dirties pages (your vi) will already end up with a higher dirty limit
> than a task that only dirties pages (your dd).
>

Oh, nice. Per-task limit is an elegant solution, which should help during
most of the common cases.

But I just wonder what happens, when
1. The dirtier is multiple co-operating processes
2. Some app like a shell script, that repeatedly calls dd with seek and skip?
People do this for data deduplication, sparse skipping etc..
3. The app dies and comes back again. Like a VM that is rebooted, and
continues writing to a disk backed by a file on the host.

Do you think, in those cases this might still be useful?

> > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> > index b606c2c..4f8bc06 100644
> > --- a/Documentation/sysctl/vm.txt
> > +++ b/Documentation/sysctl/vm.txt
> > @@ -133,6 +133,15 @@ Setting this to zero disables periodic writeback
> > altogether.
> >
> > ==============================================================
> >
> > +file_dirty_bytes
> > +
> > +When a files total dirty data exceeds file_dirty_bytes, the current
> > generator +of dirty data would be made to do the writeback of dirty pages
> > of that file. +
> > +0 disables this behaviour.
> > +
> > +==============================================================
> > +
> > drop_caches
> >
> > Writing to this will cause the kernel to drop clean caches, dentries and
> >
> > diff --git a/fs/read_write.c b/fs/read_write.c
> > index 74e3658..8881b7d 100644
> > --- a/fs/read_write.c
> > +++ b/fs/read_write.c
> > @@ -16,6 +16,8 @@
> > #include <linux/syscalls.h>
> > #include <linux/pagemap.h>
> > #include <linux/splice.h>
> > +#include <linux/buffer_head.h>
> > +#include <linux/writeback.h>
> > #include "read_write.h"
> >
> > #include <asm/uaccess.h>
> > @@ -414,9 +416,19 @@ SYSCALL_DEFINE3(write, unsigned int, fd, const char
> > __user *, buf,
> >
> > file = fget_light(fd, &fput_needed);
> > if (file) {
> > + struct address_space *as = file->f_mapping;
> > + unsigned long file_dirty_pages;
> > loff_t pos = file_pos_read(file);
> > +
> > ret = vfs_write(file, buf, count, &pos);
> > file_pos_write(file, pos);
> > + /* Start write-out ? */
> > + if (file_dirty_bytes) {
> > + file_dirty_pages = file_dirty_bytes / PAGE_SIZE;
> > + if (as->nrdirty > file_dirty_pages)
> > + write_inode_now(as->host, 0);
> > + }
> > +
> > fput_light(file, fput_needed);
> > }
>
> This seems wrong, wth are you doing it here and not in the generic
> balance_dirty_pages thing called by set_page_dirty()?
>

Yes, this should be moved to the single site, inside generic
balance_dirty_pages().

> > diff --git a/mm/memory.c b/mm/memory.c
> > index 9606ceb..0961f70 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2873,6 +2873,7 @@ static int __do_fault(struct mm_struct *mm, struct
> > vm_area_struct *vma, struct vm_fault vmf;
> > int ret;
> > int page_mkwrite = 0;
> > + unsigned long file_dirty_pages;
> >
> > vmf.virtual_address = (void __user *)(address & PAGE_MASK);
> > vmf.pgoff = pgoff;
> > @@ -3024,6 +3025,13 @@ out:
> > /* file_update_time outside page_lock */
> > if (vma->vm_file)
> > file_update_time(vma->vm_file);
> > +
> > + /* Start write-back ? */
> > + if (mapping && file_dirty_bytes) {
> > + file_dirty_pages = file_dirty_bytes / PAGE_SIZE;
> > + if (mapping->nrdirty > file_dirty_pages)
> > + write_inode_now(mapping->host, 0);
> > + }
> > } else {
> > unlock_page(vmf.page);
> > if (anon)
>
> Idem, replicating that code at every site that can dirty a page is just
> wrong, hook into the regular set_page_dirty()->balance_dirty_pages()
> code.
>
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 20890d8..1cabd7f 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -87,6 +87,13 @@ int vm_dirty_ratio = 20;
> > unsigned long vm_dirty_bytes;
> >
> > /*
> > + * When a files total dirty data exceeds file_dirty_bytes, the current
> > generator + * of dirty data would be made to do the writeback of dirty
> > pages of that file. + * 0 disables this behaviour.
> > + */
> > +unsigned long file_dirty_bytes = 0;
>
> So you're adding a extra cacheline to dirty even though its not used by
> default, that seems like suckage..
>

Right, this could be avoided. Some f(vm_dirty_ratio) should be used. I just
wanted to provide a way to disable this behaviour at run-time.

Thanks
Nikanth

> > @@ -1126,6 +1137,7 @@ void account_page_dirtied(struct page *page, struct
> > address_space *mapping) {
> > if (mapping_cap_account_dirty(mapping)) {
> > __inc_zone_page_state(page, NR_FILE_DIRTY);
> > + mapping->nrdirty++;
> > __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> > task_dirty_inc(current);
> > task_io_account_write(PAGE_CACHE_SIZE);
> > @@ -1301,6 +1313,7 @@ int clear_page_dirty_for_io(struct page *page)
> > */
> > if (TestClearPageDirty(page)) {
> > dec_zone_page_state(page, NR_FILE_DIRTY);
> > + mapping->nrdirty--;
> > dec_bdi_stat(mapping->backing_dev_info,
> > BDI_RECLAIMABLE);
> > return 1;
> > diff --git a/mm/truncate.c b/mm/truncate.c
> > index ba887bf..5846d6a 100644
> > --- a/mm/truncate.c
> > +++ b/mm/truncate.c
> > @@ -75,6 +75,7 @@ void cancel_dirty_page(struct page *page, unsigned int
> > account_size) struct address_space *mapping = page->mapping;
> > if (mapping && mapping_cap_account_dirty(mapping)) {
> > dec_zone_page_state(page, NR_FILE_DIRTY);
> > + mapping->nrdirty--;
> > dec_bdi_stat(mapping->backing_dev_info,
> > BDI_RECLAIMABLE);
> > if (account_size)
>
> Preferably we don't add any extra fields under tree_lock so that we can
> easily split it up if/when we decide to use a fine-grain locked radix
> tree.
>
> Also, like mentioned, you just added a whole new cacheline to dirty.
>

2010-08-17 08:25:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH] Per file dirty limit throttling

On Tue, 2010-08-17 at 10:39 +0530, Nikanth Karthikesan wrote:
> Oh, nice. Per-task limit is an elegant solution, which should help during
> most of the common cases.
>
> But I just wonder what happens, when
> 1. The dirtier is multiple co-operating processes
> 2. Some app like a shell script, that repeatedly calls dd with seek and skip?
> People do this for data deduplication, sparse skipping etc..
> 3. The app dies and comes back again. Like a VM that is rebooted, and
> continues writing to a disk backed by a file on the host.
>
> Do you think, in those cases this might still be useful?

Those cases do indeed defeat the current per-task-limit, however I think
the solution to that is to limit the amount of writeback done by each
blocked process.

Jan Kara had some good ideas in that department.

2010-08-18 09:19:32

by Nikanth Karthikesan

[permalink] [raw]
Subject: Re: [RFC][PATCH] Per file dirty limit throttling

On Tuesday 17 August 2010 13:54:35 Peter Zijlstra wrote:
> On Tue, 2010-08-17 at 10:39 +0530, Nikanth Karthikesan wrote:
> > Oh, nice. Per-task limit is an elegant solution, which should help
> > during most of the common cases.
> >
> > But I just wonder what happens, when
> > 1. The dirtier is multiple co-operating processes
> > 2. Some app like a shell script, that repeatedly calls dd with seek and
> > skip? People do this for data deduplication, sparse skipping etc..
> > 3. The app dies and comes back again. Like a VM that is rebooted, and
> > continues writing to a disk backed by a file on the host.
> >
> > Do you think, in those cases this might still be useful?
>
> Those cases do indeed defeat the current per-task-limit, however I think
> the solution to that is to limit the amount of writeback done by each
> blocked process.
>

Blocked on what? Sorry, I do not understand.

Thanks
Nikanth

> Jan Kara had some good ideas in that department.
>

2010-08-18 09:59:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH] Per file dirty limit throttling

On Wed, 2010-08-18 at 14:52 +0530, Nikanth Karthikesan wrote:
> On Tuesday 17 August 2010 13:54:35 Peter Zijlstra wrote:
> > On Tue, 2010-08-17 at 10:39 +0530, Nikanth Karthikesan wrote:
> > > Oh, nice. Per-task limit is an elegant solution, which should help
> > > during most of the common cases.
> > >
> > > But I just wonder what happens, when
> > > 1. The dirtier is multiple co-operating processes
> > > 2. Some app like a shell script, that repeatedly calls dd with seek and
> > > skip? People do this for data deduplication, sparse skipping etc..
> > > 3. The app dies and comes back again. Like a VM that is rebooted, and
> > > continues writing to a disk backed by a file on the host.
> > >
> > > Do you think, in those cases this might still be useful?
> >
> > Those cases do indeed defeat the current per-task-limit, however I think
> > the solution to that is to limit the amount of writeback done by each
> > blocked process.
> >
>
> Blocked on what? Sorry, I do not understand.

balance_dirty_pages(), by limiting the work done there (or actually, the
amount of page writeback completions you wait for -- starting IO isn't
that expensive), you can also affect the time it takes, and therefore
influence the impact.

2010-08-18 14:09:13

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH] Per file dirty limit throttling

* Peter Zijlstra <[email protected]> [2010-08-18 11:58:56]:

> On Wed, 2010-08-18 at 14:52 +0530, Nikanth Karthikesan wrote:
> > On Tuesday 17 August 2010 13:54:35 Peter Zijlstra wrote:
> > > On Tue, 2010-08-17 at 10:39 +0530, Nikanth Karthikesan wrote:
> > > > Oh, nice. Per-task limit is an elegant solution, which should help
> > > > during most of the common cases.
> > > >
> > > > But I just wonder what happens, when
> > > > 1. The dirtier is multiple co-operating processes
> > > > 2. Some app like a shell script, that repeatedly calls dd with seek and
> > > > skip? People do this for data deduplication, sparse skipping etc..
> > > > 3. The app dies and comes back again. Like a VM that is rebooted, and
> > > > continues writing to a disk backed by a file on the host.
> > > >
> > > > Do you think, in those cases this might still be useful?
> > >
> > > Those cases do indeed defeat the current per-task-limit, however I think
> > > the solution to that is to limit the amount of writeback done by each
> > > blocked process.
> > >
> >
> > Blocked on what? Sorry, I do not understand.
>
> balance_dirty_pages(), by limiting the work done there (or actually, the
> amount of page writeback completions you wait for -- starting IO isn't
> that expensive), you can also affect the time it takes, and therefore
> influence the impact.
>

There is an ongoing effort to look at per-cgroup dirty limits and I
honestly think it would be nice to do it at that level first. We need
it there as a part of the overall I/O controller. As a specialized
need it could handle your case as well.

--
Three Cheers,
Balbir

2010-08-18 14:25:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH] Per file dirty limit throttling

On Wed, 2010-08-18 at 19:38 +0530, Balbir Singh wrote:

> There is an ongoing effort to look at per-cgroup dirty limits and I
> honestly think it would be nice to do it at that level first. We need
> it there as a part of the overall I/O controller. As a specialized
> need it could handle your case as well.

Well, it would be good to isolate that to the cgroup code. Also from
what I understood, the plan was to simply mark dirty inodes with a
cgroup and use that from writeout_inodes() to write out inodes
specifically used by that cgroup.

That is, on top of what Andrea Righi already proposed, which would
provide the actual per cgroup dirty limit (although the per-bdi
proportions applied to a cgroup limit aren't strictly correct, but that
seems to be something you'll have to live with, a per-bdi-per-cgroup
proportion would simply be accounting insanity).

That is a totally different thing than what was proposed.

2010-08-18 14:48:18

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH] Per file dirty limit throttling

* Peter Zijlstra <[email protected]> [2010-08-18 16:25:18]:

> On Wed, 2010-08-18 at 19:38 +0530, Balbir Singh wrote:
>
> > There is an ongoing effort to look at per-cgroup dirty limits and I
> > honestly think it would be nice to do it at that level first. We need
> > it there as a part of the overall I/O controller. As a specialized
> > need it could handle your case as well.
>
> Well, it would be good to isolate that to the cgroup code. Also from
> what I understood, the plan was to simply mark dirty inodes with a
> cgroup and use that from writeout_inodes() to write out inodes
> specifically used by that cgroup.
>
> That is, on top of what Andrea Righi already proposed, which would
> provide the actual per cgroup dirty limit (although the per-bdi
> proportions applied to a cgroup limit aren't strictly correct, but that
> seems to be something you'll have to live with, a per-bdi-per-cgroup
> proportion would simply be accounting insanity).
>
> That is a totally different thing than what was proposed.

Understood, I was indirectly trying to get Nikanth to look at cgroups
since he was interested in the dirtier (as in task).


--
Three Cheers,
Balbir

2010-08-23 12:16:46

by Nikanth Karthikesan

[permalink] [raw]
Subject: Re: [RFC][PATCH] Per file dirty limit throttling

On Wednesday 18 August 2010 15:28:56 Peter Zijlstra wrote:
> On Wed, 2010-08-18 at 14:52 +0530, Nikanth Karthikesan wrote:
> > On Tuesday 17 August 2010 13:54:35 Peter Zijlstra wrote:
> > > On Tue, 2010-08-17 at 10:39 +0530, Nikanth Karthikesan wrote:
> > > > Oh, nice. Per-task limit is an elegant solution, which should help
> > > > during most of the common cases.
> > > >
> > > > But I just wonder what happens, when
> > > > 1. The dirtier is multiple co-operating processes
> > > > 2. Some app like a shell script, that repeatedly calls dd with seek
> > > > and skip? People do this for data deduplication, sparse skipping
> > > > etc.. 3. The app dies and comes back again. Like a VM that is
> > > > rebooted, and continues writing to a disk backed by a file on the
> > > > host.
> > > >
> > > > Do you think, in those cases this might still be useful?
> > >
> > > Those cases do indeed defeat the current per-task-limit, however I
> > > think the solution to that is to limit the amount of writeback done by
> > > each blocked process.
> >
> > Blocked on what? Sorry, I do not understand.
>
> balance_dirty_pages(), by limiting the work done there (or actually, the
> amount of page writeback completions you wait for -- starting IO isn't
> that expensive), you can also affect the time it takes, and therefore
> influence the impact.
>

But this has nothing special to do with the cases like multi-threaded dirtier,
which is why I was confused. :)

Thanks
Nikanth