LinuxLists.cc - [patch 03/13] io-accounting: write accounting

2006-12-08 11:53:13

Subject: [patch 03/13] io-accounting: write accounting

From: Andrew Morton <[email protected]>

Accounting writes is fairly simple: whenever a process flips a page from clean
to dirty, we accuse it of having caused a write to underlying storage of
PAGE_CACHE_SIZE bytes.

This may overestimate the amount of writing: the page-dirtying may cause only
one buffer_head's worth of writeout. Fixing that is possible, but probably a
bit messy and isn't obviously important.

Cc: Jay Lan <[email protected]>
Cc: Shailabh Nagar <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Chris Sturtivant <[email protected]>
Cc: Tony Ernst <[email protected]>
Cc: Guillaume Thouvenin <[email protected]>
Cc: David Wright <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

fs/buffer.c | 5 ++++-
mm/page-writeback.c | 5 ++++-
2 files changed, 8 insertions(+), 2 deletions(-)

diff -puN fs/buffer.c~io-accounting-write-accounting fs/buffer.c
--- a/fs/buffer.c~io-accounting-write-accounting
+++ a/fs/buffer.c
@@ -35,6 +35,7 @@
#include <linux/hash.h>
#include <linux/suspend.h>
#include <linux/buffer_head.h>
+#include <linux/task_io_accounting_ops.h>
#include <linux/bio.h>
#include <linux/notifier.h>
#include <linux/cpu.h>
@@ -729,8 +730,10 @@ int __set_page_dirty_buffers(struct page

write_lock_irq(&mapping->tree_lock);
if (page->mapping) { /* Race with truncate? */
- if (mapping_cap_account_dirty(mapping))
+ if (mapping_cap_account_dirty(mapping)) {
__inc_zone_page_state(page, NR_FILE_DIRTY);
+ task_io_account_write(PAGE_CACHE_SIZE);
+ }
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
diff -puN mm/page-writeback.c~io-accounting-write-accounting mm/page-writeback.c
--- a/mm/page-writeback.c~io-accounting-write-accounting
+++ a/mm/page-writeback.c
@@ -21,6 +21,7 @@
#include <linux/writeback.h>
#include <linux/init.h>
#include <linux/backing-dev.h>
+#include <linux/task_io_accounting_ops.h>
#include <linux/blkdev.h>
#include <linux/mpage.h>
#include <linux/rmap.h>
@@ -768,8 +769,10 @@ int __set_page_dirty_nobuffers(struct pa
mapping2 = page_mapping(page);
if (mapping2) { /* Race with truncate? */
BUG_ON(mapping2 != mapping);
- if (mapping_cap_account_dirty(mapping))
+ if (mapping_cap_account_dirty(mapping)) {
__inc_zone_page_state(page, NR_FILE_DIRTY);
+ task_io_account_write(PAGE_CACHE_SIZE);
+ }
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
_

2006-12-13 09:01:10

by Andrew Morton

[permalink] [raw]

Subject: Re: [patch 03/13] io-accounting: write accounting

On Wed, 13 Dec 2006 00:45:50 -0800
Suleiman Souhlal <[email protected]> wrote:

> [email protected] wrote:
> > From: Andrew Morton <[email protected]>
> >
> > Accounting writes is fairly simple: whenever a process flips a page from clean
> > to dirty, we accuse it of having caused a write to underlying storage of
> > PAGE_CACHE_SIZE bytes.
>
> On architectures where dirtying a page doesn't cause a page fault (like i386), couldn't you end up billing the wrong process (in fact, I think that even on other archituctures set_page_dirty() doesn't get called immediately in the page fault handler)?

Yes, that would be a problem in 2.6.18 and earlier.

In 2.6.19 and later, we do take a fault when transitioning a page from
pte-clean to pte-dirty. That was done to get the dirty-page accounting
right - to avoid the all-of-memory-is-dirty-but-the-kernel-doesn't-know-it
problem.

> AFAICS, set_page_dirty() is mostly called when trying to unmap a page when trying to shrink LRU lists, and there is no guarantee that this happens under the process that dirtied it (in fact, the set_page_dirty() is often done by kswapd).

hm, that code is still there in zap_pte_range(). If all is well, that
set_page_dirty() call should never return true. Peter did, you ever test
for that?

(Well, it might return true in rare races, because zap_pte_range() doesn't
lock the pages)

2006-12-13 09:17:47

by Suleiman Souhlal

[permalink] [raw]

Subject: Re: [patch 03/13] io-accounting: write accounting

[email protected] wrote:
> From: Andrew Morton <[email protected]>
>
> Accounting writes is fairly simple: whenever a process flips a page from clean
> to dirty, we accuse it of having caused a write to underlying storage of
> PAGE_CACHE_SIZE bytes.

On architectures where dirtying a page doesn't cause a page fault (like i386), couldn't you end up billing the wrong process (in fact, I think that even on other archituctures set_page_dirty() doesn't get called immediately in the page fault handler)?

AFAICS, set_page_dirty() is mostly called when trying to unmap a page when trying to shrink LRU lists, and there is no guarantee that this happens under the process that dirtied it (in fact, the set_page_dirty() is often done by kswapd).

-- Suleiman

2006-12-13 10:22:14

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [patch 03/13] io-accounting: write accounting

On Wed, 2006-12-13 at 00:59 -0800, Andrew Morton wrote:
> On Wed, 13 Dec 2006 00:45:50 -0800
> Suleiman Souhlal <[email protected]> wrote:
>
> > [email protected] wrote:
> > > From: Andrew Morton <[email protected]>
> > >
> > > Accounting writes is fairly simple: whenever a process flips a
> page from clean
> > > to dirty, we accuse it of having caused a write to underlying
> storage of
> > > PAGE_CACHE_SIZE bytes.
> >
> > On architectures where dirtying a page doesn't cause a page fault
> (like i386), couldn't you end up billing the wrong process (in fact, I
> think that even on other archituctures set_page_dirty() doesn't get
> called immediately in the page fault handler)?
>
> Yes, that would be a problem in 2.6.18 and earlier.
>
> In 2.6.19 and later, we do take a fault when transitioning a page from
> pte-clean to pte-dirty. That was done to get the dirty-page
> accounting
> right - to avoid the
> all-of-memory-is-dirty-but-the-kernel-doesn't-know-it
> problem.
>
>
> > AFAICS, set_page_dirty() is mostly called when trying to unmap a
> page when trying to shrink LRU lists, and there is no guarantee that
> this happens under the process that dirtied it (in fact, the
> set_page_dirty() is often done by kswapd).

we're talking:

shrink_page_list()
TestSetPageLock()
try_to_unmap()
try_to_unmap_file()
try_to_unmap_one()
set_page_dirty()

Which, because its run under the page lock, should always be false,
right? perhaps a WARN_ON might do.

> hm, that code is still there in zap_pte_range(). If all is well, that
> set_page_dirty() call should never return true.

exit_mmap(), do_unmap()
unmap_vmas()
unmap_page_range()
zap_{pud,pmd,pte}_range()

vmtruncate()
unmap_mapping_range()
unmap_mapping_range_*()
unmap_mapping_range_vma(), madvise_dontneed()
zap_page_range()
unmap_vmas()
unmap_page_range()
zap_{pud,pmd,pte}_range()

Which is all ran without the page lock, so might be open for a race;
however, I don't see that hurting because the whole page is un-accounted
pretty soon afterwards.

> Peter did, you ever test for that?

Nope

> (Well, it might return true in rare races, because zap_pte_range()
> doesn't lock the pages)

2006-12-13 10:36:14

by Suleiman Souhlal

[permalink] [raw]

Subject: Re: [patch 03/13] io-accounting: write accounting

Andrew Morton wrote:
> On Wed, 13 Dec 2006 00:45:50 -0800
> Suleiman Souhlal <[email protected]> wrote:
>
>
>>[email protected] wrote:
>>
>>>From: Andrew Morton <[email protected]>
>>>
>>>Accounting writes is fairly simple: whenever a process flips a page from clean
>>>to dirty, we accuse it of having caused a write to underlying storage of
>>>PAGE_CACHE_SIZE bytes.
>>
>>On architectures where dirtying a page doesn't cause a page fault (like i386), couldn't you end up billing the wrong process (in fact, I think that even on other archituctures set_page_dirty() doesn't get called immediately in the page fault handler)?
>
>
> Yes, that would be a problem in 2.6.18 and earlier.
>
> In 2.6.19 and later, we do take a fault when transitioning a page from
> pte-clean to pte-dirty. That was done to get the dirty-page accounting
> right - to avoid the all-of-memory-is-dirty-but-the-kernel-doesn't-know-it
> problem.

Ah yes indeed. I'm unable to keep up with all the new developments. :-(

However, if my understanding of this code is correct, it seems that the
page fault is only done for shared writable VMAs, so you still can't
rely on set_page_dirty() always being called by the process that
dirtied the page in the first place.

Am I wrong?

-- Suleiman

2006-12-13 11:03:17

by Suleiman Souhlal

[permalink] [raw]

Subject: Re: [patch 03/13] io-accounting: write accounting

Suleiman Souhlal wrote:
> Andrew Morton wrote:
>
>> On Wed, 13 Dec 2006 00:45:50 -0800
>> Suleiman Souhlal <[email protected]> wrote:
>>
>>
>>> [email protected] wrote:
>>>
>>>> From: Andrew Morton <[email protected]>
>>>>
>>>> Accounting writes is fairly simple: whenever a process flips a page
>>>> from clean
>>>> to dirty, we accuse it of having caused a write to underlying
>>>> storage of
>>>> PAGE_CACHE_SIZE bytes.
>>>
>>>
>>> On architectures where dirtying a page doesn't cause a page fault
>>> (like i386), couldn't you end up billing the wrong process (in fact,
>>> I think that even on other archituctures set_page_dirty() doesn't get
>>> called immediately in the page fault handler)?
>>
>>
>>
>> Yes, that would be a problem in 2.6.18 and earlier.
>>
>> In 2.6.19 and later, we do take a fault when transitioning a page from
>> pte-clean to pte-dirty. That was done to get the dirty-page accounting
>> right - to avoid the
>> all-of-memory-is-dirty-but-the-kernel-doesn't-know-it
>> problem.
>
>
> Ah yes indeed. I'm unable to keep up with all the new developments. :-(
>
> However, if my understanding of this code is correct, it seems that the
> page fault is only done for shared writable VMAs, so you still can't
> rely on set_page_dirty() always being called by the process that
> dirtied the page in the first place.
>
> Am I wrong?

Yes I am.
The only I/O non-shared VMAs might cause is from swapping, and I'm not
sure if the io accounting patches actually care about that.
My confusion came from the term "shared": A VMA is considered shared
whenever MAP_SHARED is specified, even if it only has only one single
"user".

-- Suleiman

2006-12-13 19:08:01

by Andrew Morton

[permalink] [raw]

Subject: Re: [patch 03/13] io-accounting: write accounting

On Wed, 13 Dec 2006 03:02:38 -0800
Suleiman Souhlal <[email protected]> wrote:

> The only I/O non-shared VMAs might cause is from swapping, and I'm not
> sure if the io accounting patches actually care about that.

Yes, the patches do attempt to correctly account for swap IO. swapin is
accounted in submit_bio() and swapout is, err, not accounted at all. Drat,
I forgot to retest that.

2006-12-13 22:01:15

by Andrew Morton

[permalink] [raw]

Subject: Re: [patch 03/13] io-accounting: write accounting

On Wed, 13 Dec 2006 11:07:01 -0800
Andrew Morton <[email protected]> wrote:

> On Wed, 13 Dec 2006 03:02:38 -0800
> Suleiman Souhlal <[email protected]> wrote:
>
> > The only I/O non-shared VMAs might cause is from swapping, and I'm not
> > sure if the io accounting patches actually care about that.
>
> Yes, the patches do attempt to correctly account for swap IO. swapin is
> accounted in submit_bio() and swapout is, err, not accounted at all. Drat,
> I forgot to retest that.
>

hey, this is hard.

The obvious "fix" is to do:

--- a/mm/page-writeback.c~a
+++ a/mm/page-writeback.c
@@ -816,8 +816,10 @@ int fastcall set_page_dirty(struct page
return (*spd)(page);
}
if (!PageDirty(page)) {
- if (!TestSetPageDirty(page))
+ if (!TestSetPageDirty(page)) {
+ task_io_account_write(PAGE_SIZE);
return 1;
+ }
}
return 0;
}
_

but that means that memset(malloc(1000000)) will accuse the task of having
done 1MB of writing, which is daft.

What would be appropriate here is to account the task with the write when
someone moves a dirty anon page into swapcache. That means that some
random task needs to locate the task which "owns" this anon page. So in
shrink_page_list()->add_to_swap() we need to hunt down the appropriate
locks, do the rmap walk, find the vma, find the mm, then wonder how the
heck we find the right task_struct based on the mm_struct.

I think I'll back slowly away from this problem and mark it as "known
shortcoming".

Which perhaps means that we should not account for swapin either.