2012-06-28 10:55:17

by Sha Zhengju

[permalink] [raw]
Subject: [PATCH 0/7] Per-cgroup page stat accounting

This patch series provide the ability for each memory cgroup to have independent
dirty/writeback page stats. This can provide some information for per-cgroup direct
reclaim. Meanwhile, we add more detailed dump messages for memcg OOMs.

Three features are included in this patch series:
(0).prepare patches for page accounting
1. memcg dirty page accounting
2. memcg writeback page accounting
3. memcg OOMs dump info

In (0) prepare patches, we have reworked vfs set page dirty routines to make "modify
page info" and "dirty page accouting" stay in one function as much as possible for
the sake of memcg bigger lock.

These patches are cooked based on Andrew's akpm tree.

Sha Zhengju (7):
memcg-update-cgroup-memory-document.patch
memcg-remove-MEMCG_NR_FILE_MAPPED.patch
Make-TestSetPageDirty-and-dirty-page-accounting-in-o.patch
Use-vfs-__set_page_dirty-interface-instead-of-doing-.patch
memcg-add-per-cgroup-dirty-pages-accounting.patch
memcg-add-per-cgroup-writeback-pages-accounting.patch
memcg-print-more-detailed-info-while-memcg-oom-happe.patch

Documentation/cgroups/memory.txt | 2 +
fs/buffer.c | 36 +++++++++-----
fs/ceph/addr.c | 20 +-------
include/linux/buffer_head.h | 2 +
include/linux/memcontrol.h | 27 +++++++---
mm/filemap.c | 5 ++
mm/memcontrol.c | 99 +++++++++++++++++++++++--------------
mm/page-writeback.c | 42 ++++++++++++++--
mm/rmap.c | 4 +-
mm/truncate.c | 6 ++
10 files changed, 159 insertions(+), 84 deletions(-)


2012-06-28 10:57:44

by Sha Zhengju

[permalink] [raw]
Subject: [PATCH 1/7] memcg: update cgroup memory document

From: Sha Zhengju <[email protected]>

Document cgroup dirty/writeback memory statistics.

The implementation for these new interface routines come in a series
of following patches.

Signed-off-by: Sha Zhengju <[email protected]>
---
Documentation/cgroups/memory.txt | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index dd88540..24d7e3c 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -420,6 +420,8 @@ pgpgin - # of charging events to the memory cgroup. The charging
pgpgout - # of uncharging events to the memory cgroup. The uncharging
event happens each time a page is unaccounted from the cgroup.
swap - # of bytes of swap usage
+dirty - # of bytes that are waiting to get written back to the disk.
+writeback - # of bytes that are actively being written back to the disk.
inactive_anon - # of bytes of anonymous memory and swap cache memory on
LRU list.
active_anon - # of bytes of anonymous and swap cache memory on active
--
1.7.1

2012-06-28 10:58:40

by Sha Zhengju

[permalink] [raw]
Subject: [PATCH 2/7] memcg: remove MEMCG_NR_FILE_MAPPED

From: Sha Zhengju <[email protected]>

While accounting memcg page stat, it's not worth to use MEMCG_NR_FILE_MAPPED
as an extra layer of indirection because of the complexity and presumed
performance overhead. We can use MEM_CGROUP_STAT_FILE_MAPPED directly.

Signed-off-by: Sha Zhengju <[email protected]>
---
include/linux/memcontrol.h | 25 +++++++++++++++++--------
mm/memcontrol.c | 24 +-----------------------
mm/rmap.c | 4 ++--
3 files changed, 20 insertions(+), 33 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 83e7ba9..20b0f2d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -27,9 +27,18 @@ struct page_cgroup;
struct page;
struct mm_struct;

-/* Stats that can be updated by kernel. */
-enum mem_cgroup_page_stat_item {
- MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
+/*
+ * Statistics for memory cgroup.
+ */
+enum mem_cgroup_stat_index {
+ /*
+ * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
+ */
+ MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
+ MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
+ MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
+ MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
+ MEM_CGROUP_STAT_NSTATS,
};

struct mem_cgroup_reclaim_cookie {
@@ -164,17 +173,17 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
}

void mem_cgroup_update_page_stat(struct page *page,
- enum mem_cgroup_page_stat_item idx,
+ enum mem_cgroup_stat_index idx,
int val);

static inline void mem_cgroup_inc_page_stat(struct page *page,
- enum mem_cgroup_page_stat_item idx)
+ enum mem_cgroup_stat_index idx)
{
mem_cgroup_update_page_stat(page, idx, 1);
}

static inline void mem_cgroup_dec_page_stat(struct page *page,
- enum mem_cgroup_page_stat_item idx)
+ enum mem_cgroup_stat_index idx)
{
mem_cgroup_update_page_stat(page, idx, -1);
}
@@ -349,12 +358,12 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
}

static inline void mem_cgroup_inc_page_stat(struct page *page,
- enum mem_cgroup_page_stat_item idx)
+ enum mem_cgroup_stat_index idx)
{
}

static inline void mem_cgroup_dec_page_stat(struct page *page,
- enum mem_cgroup_page_stat_item idx)
+ enum mem_cgroup_stat_index idx)
{
}

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a2677e0..ebed1ca 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -77,20 +77,6 @@ static int really_do_swap_account __initdata = 0;
#endif


-/*
- * Statistics for memory cgroup.
- */
-enum mem_cgroup_stat_index {
- /*
- * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
- */
- MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
- MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
- MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
- MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
- MEM_CGROUP_STAT_NSTATS,
-};
-
static const char * const mem_cgroup_stat_names[] = {
"cache",
"rss",
@@ -1926,7 +1912,7 @@ void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
}

void mem_cgroup_update_page_stat(struct page *page,
- enum mem_cgroup_page_stat_item idx, int val)
+ enum mem_cgroup_stat_index idx, int val)
{
struct mem_cgroup *memcg;
struct page_cgroup *pc = lookup_page_cgroup(page);
@@ -1939,14 +1925,6 @@ void mem_cgroup_update_page_stat(struct page *page,
if (unlikely(!memcg || !PageCgroupUsed(pc)))
return;

- switch (idx) {
- case MEMCG_NR_FILE_MAPPED:
- idx = MEM_CGROUP_STAT_FILE_MAPPED;
- break;
- default:
- BUG();
- }
-
this_cpu_add(memcg->stat->count[idx], val);
}

diff --git a/mm/rmap.c b/mm/rmap.c
index 2144160..d6b93df 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1148,7 +1148,7 @@ void page_add_file_rmap(struct page *page)
mem_cgroup_begin_update_page_stat(page, &locked, &flags);
if (atomic_inc_and_test(&page->_mapcount)) {
__inc_zone_page_state(page, NR_FILE_MAPPED);
- mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
+ mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
}
mem_cgroup_end_update_page_stat(page, &locked, &flags);
}
@@ -1202,7 +1202,7 @@ void page_remove_rmap(struct page *page)
NR_ANON_TRANSPARENT_HUGEPAGES);
} else {
__dec_zone_page_state(page, NR_FILE_MAPPED);
- mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
+ mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
}
/*
* It would be tidy to reset the PageAnon mapping here,
--
1.7.1

2012-06-28 11:01:33

by Sha Zhengju

[permalink] [raw]
Subject: [PATCH 3/7] Make TestSetPageDirty and dirty page accounting in one func

From: Sha Zhengju <[email protected]>

Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers())
extracts TestSetPageDirty from __set_page_dirty and is far away from
account_page_dirtied.But it's better to make the two operations in one single
function to keep modular.So in order to avoid the potential race mentioned in
commit a8e7d49a, we can hold private_lock until __set_page_dirty completes.
I guess there's no deadlock between ->private_lock and ->tree_lock by quick look.

It's a prepare patch for following memcg dirty page accounting patches.

Signed-off-by: Sha Zhengju <[email protected]>
---
fs/buffer.c | 25 +++++++++++++------------
1 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 838a9cf..e8d96b8 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -610,9 +610,15 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
* If warn is true, then emit a warning if the page is not uptodate and has
* not been truncated.
*/
-static void __set_page_dirty(struct page *page,
+static int __set_page_dirty(struct page *page,
struct address_space *mapping, int warn)
{
+ if (unlikely(!mapping))
+ return !TestSetPageDirty(page);
+
+ if (TestSetPageDirty(page))
+ return 0;
+
spin_lock_irq(&mapping->tree_lock);
if (page->mapping) { /* Race with truncate? */
WARN_ON_ONCE(warn && !PageUptodate(page));
@@ -622,6 +628,8 @@ static void __set_page_dirty(struct page *page,
}
spin_unlock_irq(&mapping->tree_lock);
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+
+ return 1;
}

/*
@@ -667,11 +675,9 @@ int __set_page_dirty_buffers(struct page *page)
bh = bh->b_this_page;
} while (bh != head);
}
- newly_dirty = !TestSetPageDirty(page);
+ newly_dirty = __set_page_dirty(page, mapping, 1);
spin_unlock(&mapping->private_lock);

- if (newly_dirty)
- __set_page_dirty(page, mapping, 1);
return newly_dirty;
}
EXPORT_SYMBOL(__set_page_dirty_buffers);
@@ -1115,14 +1121,9 @@ void mark_buffer_dirty(struct buffer_head *bh)
return;
}

- if (!test_set_buffer_dirty(bh)) {
- struct page *page = bh->b_page;
- if (!TestSetPageDirty(page)) {
- struct address_space *mapping = page_mapping(page);
- if (mapping)
- __set_page_dirty(page, mapping, 0);
- }
- }
+ if (!test_set_buffer_dirty(bh))
+ __set_page_dirty(bh->b_page, page_mapping(bh->b_page), 0);
+
}
EXPORT_SYMBOL(mark_buffer_dirty);

--
1.7.1

2012-06-28 11:04:00

by Sha Zhengju

[permalink] [raw]
Subject: [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem

From: Sha Zhengju <[email protected]>

Following we will treat SetPageDirty and dirty page accounting as an integrated
operation. Filesystems had better use vfs interface directly to avoid those details.

Signed-off-by: Sha Zhengju <[email protected]>
---
fs/buffer.c | 2 +-
fs/ceph/addr.c | 20 ++------------------
include/linux/buffer_head.h | 2 ++
3 files changed, 5 insertions(+), 19 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index e8d96b8..55522dd 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
* If warn is true, then emit a warning if the page is not uptodate and has
* not been truncated.
*/
-static int __set_page_dirty(struct page *page,
+int __set_page_dirty(struct page *page,
struct address_space *mapping, int warn)
{
if (unlikely(!mapping))
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 8b67304..d028fbe 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -5,6 +5,7 @@
#include <linux/mm.h>
#include <linux/pagemap.h>
#include <linux/writeback.h> /* generic_writepages */
+#include <linux/buffer_head.h>
#include <linux/slab.h>
#include <linux/pagevec.h>
#include <linux/task_io_accounting_ops.h>
@@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
int undo = 0;
struct ceph_snap_context *snapc;

- if (unlikely(!mapping))
- return !TestSetPageDirty(page);
-
- if (TestSetPageDirty(page)) {
- dout("%p set_page_dirty %p idx %lu -- already dirty\n",
- mapping->host, page, page->index);
+ if (!__set_page_dirty(page, mapping, 1))
return 0;
- }

inode = mapping->host;
ci = ceph_inode(inode);
@@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page)
snapc, snapc->seq, snapc->num_snaps);
spin_unlock(&ci->i_ceph_lock);

- /* now adjust page */
- spin_lock_irq(&mapping->tree_lock);
if (page->mapping) { /* Race with truncate? */
- WARN_ON_ONCE(!PageUptodate(page));
- account_page_dirtied(page, page->mapping);
- radix_tree_tag_set(&mapping->page_tree,
- page_index(page), PAGECACHE_TAG_DIRTY);
-
/*
* Reference snap context in page->private. Also set
* PagePrivate so that we get invalidatepage callback.
@@ -126,14 +114,10 @@ static int ceph_set_page_dirty(struct page *page)
undo = 1;
}

- spin_unlock_irq(&mapping->tree_lock);
-
if (undo)
/* whoops, we failed to dirty the page */
ceph_put_wrbuffer_cap_refs(ci, 1, snapc);

- __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
-
BUG_ON(!PageDirty(page));
return 1;
}
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 458f497..0a331a8 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -336,6 +336,8 @@ static inline void lock_buffer(struct buffer_head *bh)
}

extern int __set_page_dirty_buffers(struct page *page);
+extern int __set_page_dirty(struct page *page,
+ struct address_space *mapping, int warn);

#else /* CONFIG_BLOCK */

--
1.7.1

2012-06-28 11:04:56

by Sha Zhengju

[permalink] [raw]
Subject: [PATCH 5/7] memcg: add per cgroup dirty pages accounting

From: Sha Zhengju <[email protected]>

This patch adds memcg routines to count dirty pages, which allows memory controller
to maintain an accurate view of the amount of its dirty memory and can provide some
info for users while group's direct reclaim is working.

After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
has a feature to move a page from a cgroup to another one and may have race between
"move" and "page stat accounting". So in order to avoid the race we have designed a
bigger lock:

mem_cgroup_begin_update_page_stat()
modify page information -->(a)
mem_cgroup_update_page_stat() -->(b)
mem_cgroup_end_update_page_stat()

It requires (a) and (b)(dirty pages accounting) can stay close enough.

In the previous two prepare patches, we have reworked the vfs set page dirty routines
and now the interfaces are more explicit:
incrementing (2):
__set_page_dirty
__set_page_dirty_nobuffers
decrementing (2):
clear_page_dirty_for_io
cancel_dirty_page


Signed-off-by: Sha Zhengju <[email protected]>
---
fs/buffer.c | 17 ++++++++++++++---
include/linux/memcontrol.h | 1 +
mm/filemap.c | 5 +++++
mm/memcontrol.c | 28 +++++++++++++++++++++-------
mm/page-writeback.c | 30 ++++++++++++++++++++++++------
mm/truncate.c | 6 ++++++
6 files changed, 71 insertions(+), 16 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 55522dd..d3714cc 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -613,11 +613,19 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
int __set_page_dirty(struct page *page,
struct address_space *mapping, int warn)
{
+ bool locked;
+ unsigned long flags;
+ int ret = 0;
+
if (unlikely(!mapping))
return !TestSetPageDirty(page);

- if (TestSetPageDirty(page))
- return 0;
+ mem_cgroup_begin_update_page_stat(page, &locked, &flags);
+
+ if (TestSetPageDirty(page)) {
+ ret = 0;
+ goto out;
+ }

spin_lock_irq(&mapping->tree_lock);
if (page->mapping) { /* Race with truncate? */
@@ -629,7 +637,10 @@ int __set_page_dirty(struct page *page,
spin_unlock_irq(&mapping->tree_lock);
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);

- return 1;
+ ret = 1;
+out:
+ mem_cgroup_end_update_page_stat(page, &locked, &flags);
+ return ret;
}

/*
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 20b0f2d..ad37b59 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -38,6 +38,7 @@ enum mem_cgroup_stat_index {
MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
+ MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */
MEM_CGROUP_STAT_NSTATS,
};

diff --git a/mm/filemap.c b/mm/filemap.c
index 1f19ec3..5159a49 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -140,6 +140,11 @@ void __delete_from_page_cache(struct page *page)
* having removed the page entirely.
*/
if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+ /*
+ * Do not change page state, so no need to use mem_cgroup_
+ * {begin, end}_update_page_stat to get lock.
+ */
+ mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
dec_zone_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ebed1ca..90e2946 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -82,6 +82,7 @@ static const char * const mem_cgroup_stat_names[] = {
"rss",
"mapped_file",
"swap",
+ "dirty",
};

enum mem_cgroup_events_index {
@@ -2538,6 +2539,18 @@ void mem_cgroup_split_huge_fixup(struct page *head)
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

+static inline
+void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
+ struct mem_cgroup *to,
+ enum mem_cgroup_stat_index idx)
+{
+ /* Update stat data for mem_cgroup */
+ preempt_disable();
+ __this_cpu_dec(from->stat->count[idx]);
+ __this_cpu_inc(to->stat->count[idx]);
+ preempt_enable();
+}
+
/**
* mem_cgroup_move_account - move account of the page
* @page: the page
@@ -2583,13 +2596,14 @@ static int mem_cgroup_move_account(struct page *page,

move_lock_mem_cgroup(from, &flags);

- if (!anon && page_mapped(page)) {
- /* Update mapped_file data for mem_cgroup */
- preempt_disable();
- __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
- __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
- preempt_enable();
- }
+ if (!anon && page_mapped(page))
+ mem_cgroup_move_account_page_stat(from, to,
+ MEM_CGROUP_STAT_FILE_MAPPED);
+
+ if (PageDirty(page))
+ mem_cgroup_move_account_page_stat(from, to,
+ MEM_CGROUP_STAT_FILE_DIRTY);
+
mem_cgroup_charge_statistics(from, anon, -nr_pages);

/* caller should have done css_get */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index e5363f3..e79a2f7 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1962,6 +1962,7 @@ int __set_page_dirty_no_writeback(struct page *page)
void account_page_dirtied(struct page *page, struct address_space *mapping)
{
if (mapping_cap_account_dirty(mapping)) {
+ mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
__inc_zone_page_state(page, NR_FILE_DIRTY);
__inc_zone_page_state(page, NR_DIRTIED);
__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
@@ -2001,12 +2002,20 @@ EXPORT_SYMBOL(account_page_writeback);
*/
int __set_page_dirty_nobuffers(struct page *page)
{
+ bool locked;
+ unsigned long flags;
+ int ret = 0;
+
+ mem_cgroup_begin_update_page_stat(page, &locked, &flags);
+
if (!TestSetPageDirty(page)) {
struct address_space *mapping = page_mapping(page);
struct address_space *mapping2;

- if (!mapping)
- return 1;
+ if (!mapping) {
+ ret = 1;
+ goto out;
+ }

spin_lock_irq(&mapping->tree_lock);
mapping2 = page_mapping(page);
@@ -2022,9 +2031,12 @@ int __set_page_dirty_nobuffers(struct page *page)
/* !PageAnon && !swapper_space */
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
}
- return 1;
+ ret = 1;
}
- return 0;
+
+out:
+ mem_cgroup_end_update_page_stat(page, &locked, &flags);
+ return ret;
}
EXPORT_SYMBOL(__set_page_dirty_nobuffers);

@@ -2139,6 +2151,9 @@ EXPORT_SYMBOL(set_page_dirty_lock);
int clear_page_dirty_for_io(struct page *page)
{
struct address_space *mapping = page_mapping(page);
+ bool locked;
+ unsigned long flags;
+ int ret = 0;

BUG_ON(!PageLocked(page));

@@ -2180,13 +2195,16 @@ int clear_page_dirty_for_io(struct page *page)
* the desired exclusion. See mm/memory.c:do_wp_page()
* for more comments.
*/
+ mem_cgroup_begin_update_page_stat(page, &locked, &flags);
if (TestClearPageDirty(page)) {
+ mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
dec_zone_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(mapping->backing_dev_info,
BDI_RECLAIMABLE);
- return 1;
+ ret = 1;
}
- return 0;
+ mem_cgroup_end_update_page_stat(page, &locked, &flags);
+ return ret;
}
return TestClearPageDirty(page);
}
diff --git a/mm/truncate.c b/mm/truncate.c
index 75801ac..052016a 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
*/
void cancel_dirty_page(struct page *page, unsigned int account_size)
{
+ bool locked;
+ unsigned long flags;
+
+ mem_cgroup_begin_update_page_stat(page, &locked, &flags);
if (TestClearPageDirty(page)) {
struct address_space *mapping = page->mapping;
if (mapping && mapping_cap_account_dirty(mapping)) {
+ mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
dec_zone_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(mapping->backing_dev_info,
BDI_RECLAIMABLE);
@@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
task_io_account_cancelled_write(account_size);
}
}
+ mem_cgroup_end_update_page_stat(page, &locked, &flags);
}
EXPORT_SYMBOL(cancel_dirty_page);

--
1.7.1

2012-06-28 11:05:38

by Sha Zhengju

[permalink] [raw]
Subject: [PATCH 6/7] memcg: add per cgroup writeback pages accounting

From: Sha Zhengju <[email protected]>

Similar to dirty page, we add per cgroup writeback pages accounting. The lock
rule still is:
mem_cgroup_begin_update_page_stat()
modify page WRITEBACK stat
mem_cgroup_update_page_stat()
mem_cgroup_end_update_page_stat()

There're two writeback interface to modify: test_clear/set_page_writeback.

Signed-off-by: Sha Zhengju <[email protected]>
---
include/linux/memcontrol.h | 1 +
mm/memcontrol.c | 5 +++++
mm/page-writeback.c | 12 ++++++++++++
3 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ad37b59..9193d93 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -39,6 +39,7 @@ enum mem_cgroup_stat_index {
MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */
+ MEM_CGROUP_STAT_FILE_WRITEBACK, /* # of pages under writeback */
MEM_CGROUP_STAT_NSTATS,
};

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 90e2946..8493119 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -83,6 +83,7 @@ static const char * const mem_cgroup_stat_names[] = {
"mapped_file",
"swap",
"dirty",
+ "writeback",
};

enum mem_cgroup_events_index {
@@ -2604,6 +2605,10 @@ static int mem_cgroup_move_account(struct page *page,
mem_cgroup_move_account_page_stat(from, to,
MEM_CGROUP_STAT_FILE_DIRTY);

+ if (PageWriteback(page))
+ mem_cgroup_move_account_page_stat(from, to,
+ MEM_CGROUP_STAT_FILE_WRITEBACK);
+
mem_cgroup_charge_statistics(from, anon, -nr_pages);

/* caller should have done css_get */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index e79a2f7..7398836 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1981,6 +1981,7 @@ EXPORT_SYMBOL(account_page_dirtied);
*/
void account_page_writeback(struct page *page)
{
+ mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
inc_zone_page_state(page, NR_WRITEBACK);
}
EXPORT_SYMBOL(account_page_writeback);
@@ -2214,7 +2215,10 @@ int test_clear_page_writeback(struct page *page)
{
struct address_space *mapping = page_mapping(page);
int ret;
+ bool locked;
+ unsigned long flags;

+ mem_cgroup_begin_update_page_stat(page, &locked, &flags);
if (mapping) {
struct backing_dev_info *bdi = mapping->backing_dev_info;
unsigned long flags;
@@ -2235,9 +2239,12 @@ int test_clear_page_writeback(struct page *page)
ret = TestClearPageWriteback(page);
}
if (ret) {
+ mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
dec_zone_page_state(page, NR_WRITEBACK);
inc_zone_page_state(page, NR_WRITTEN);
}
+
+ mem_cgroup_end_update_page_stat(page, &locked, &flags);
return ret;
}

@@ -2245,7 +2252,10 @@ int test_set_page_writeback(struct page *page)
{
struct address_space *mapping = page_mapping(page);
int ret;
+ bool locked;
+ unsigned long flags;

+ mem_cgroup_begin_update_page_stat(page, &locked, &flags);
if (mapping) {
struct backing_dev_info *bdi = mapping->backing_dev_info;
unsigned long flags;
@@ -2272,6 +2282,8 @@ int test_set_page_writeback(struct page *page)
}
if (!ret)
account_page_writeback(page);
+
+ mem_cgroup_end_update_page_stat(page, &locked, &flags);
return ret;

}
--
1.7.1

2012-06-28 11:06:12

by Sha Zhengju

[permalink] [raw]
Subject: [PATCH 6/7] memcg: add per cgroup writeback pages accounting

From: Sha Zhengju <[email protected]>

Similar to dirty page, we add per cgroup writeback pages accounting. The lock
rule still is:
mem_cgroup_begin_update_page_stat()
modify page WRITEBACK stat
mem_cgroup_update_page_stat()
mem_cgroup_end_update_page_stat()

There're two writeback interface to modify: test_clear/set_page_writeback.

Signed-off-by: Sha Zhengju <[email protected]>
---
include/linux/memcontrol.h | 1 +
mm/memcontrol.c | 5 +++++
mm/page-writeback.c | 12 ++++++++++++
3 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ad37b59..9193d93 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -39,6 +39,7 @@ enum mem_cgroup_stat_index {
MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */
+ MEM_CGROUP_STAT_FILE_WRITEBACK, /* # of pages under writeback */
MEM_CGROUP_STAT_NSTATS,
};

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 90e2946..8493119 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -83,6 +83,7 @@ static const char * const mem_cgroup_stat_names[] = {
"mapped_file",
"swap",
"dirty",
+ "writeback",
};

enum mem_cgroup_events_index {
@@ -2604,6 +2605,10 @@ static int mem_cgroup_move_account(struct page *page,
mem_cgroup_move_account_page_stat(from, to,
MEM_CGROUP_STAT_FILE_DIRTY);

+ if (PageWriteback(page))
+ mem_cgroup_move_account_page_stat(from, to,
+ MEM_CGROUP_STAT_FILE_WRITEBACK);
+
mem_cgroup_charge_statistics(from, anon, -nr_pages);

/* caller should have done css_get */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index e79a2f7..7398836 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1981,6 +1981,7 @@ EXPORT_SYMBOL(account_page_dirtied);
*/
void account_page_writeback(struct page *page)
{
+ mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
inc_zone_page_state(page, NR_WRITEBACK);
}
EXPORT_SYMBOL(account_page_writeback);
@@ -2214,7 +2215,10 @@ int test_clear_page_writeback(struct page *page)
{
struct address_space *mapping = page_mapping(page);
int ret;
+ bool locked;
+ unsigned long flags;

+ mem_cgroup_begin_update_page_stat(page, &locked, &flags);
if (mapping) {
struct backing_dev_info *bdi = mapping->backing_dev_info;
unsigned long flags;
@@ -2235,9 +2239,12 @@ int test_clear_page_writeback(struct page *page)
ret = TestClearPageWriteback(page);
}
if (ret) {
+ mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
dec_zone_page_state(page, NR_WRITEBACK);
inc_zone_page_state(page, NR_WRITTEN);
}
+
+ mem_cgroup_end_update_page_stat(page, &locked, &flags);
return ret;
}

@@ -2245,7 +2252,10 @@ int test_set_page_writeback(struct page *page)
{
struct address_space *mapping = page_mapping(page);
int ret;
+ bool locked;
+ unsigned long flags;

+ mem_cgroup_begin_update_page_stat(page, &locked, &flags);
if (mapping) {
struct backing_dev_info *bdi = mapping->backing_dev_info;
unsigned long flags;
@@ -2272,6 +2282,8 @@ int test_set_page_writeback(struct page *page)
}
if (!ret)
account_page_writeback(page);
+
+ mem_cgroup_end_update_page_stat(page, &locked, &flags);
return ret;

}
--
1.7.1

2012-06-28 11:07:03

by Sha Zhengju

[permalink] [raw]
Subject: [PATCH 7/7] memcg: print more detailed info while memcg oom happening

From: Sha Zhengju <[email protected]>

While memcg oom happening, the dump info is limited, so add this
to provide memcg page stat.

Signed-off-by: Sha Zhengju <[email protected]>
---
mm/memcontrol.c | 42 ++++++++++++++++++++++++++++++++++--------
1 files changed, 34 insertions(+), 8 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8493119..3ed41e9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -101,6 +101,14 @@ static const char * const mem_cgroup_events_names[] = {
"pgmajfault",
};

+static const char * const mem_cgroup_lru_names[] = {
+ "inactive_anon",
+ "active_anon",
+ "inactive_file",
+ "active_file",
+ "unevictable",
+};
+
/*
* Per memcg event counter is incremented at every pagein/pageout. With THP,
* it will be incremated by the number of pages. This counter is used for
@@ -1358,6 +1366,30 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg,
spin_unlock_irqrestore(&memcg->move_lock, *flags);
}

+#define K(x) ((x) << (PAGE_SHIFT-10))
+static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg)
+{
+ int i;
+
+ printk(KERN_INFO "Memory cgroup stat:\n");
+ for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
+ if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
+ continue;
+ printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i],
+ K(mem_cgroup_read_stat(memcg, i)));
+ }
+
+ for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++)
+ printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i],
+ mem_cgroup_read_events(memcg, i));
+
+ for (i = 0; i < NR_LRU_LISTS; i++)
+ printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i],
+ K(mem_cgroup_nr_lru_pages(memcg, BIT(i))));
+ printk(KERN_CONT "\n");
+
+}
+
/**
* mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode.
* @memcg: The memory cgroup that went over limit
@@ -1422,6 +1454,8 @@ done:
res_counter_read_u64(&memcg->memsw, RES_USAGE) >> 10,
res_counter_read_u64(&memcg->memsw, RES_LIMIT) >> 10,
res_counter_read_u64(&memcg->memsw, RES_FAILCNT));
+
+ mem_cgroup_print_oom_stat(memcg);
}

/*
@@ -4043,14 +4077,6 @@ static int mem_control_numa_stat_show(struct cgroup *cont, struct cftype *cft,
}
#endif /* CONFIG_NUMA */

-static const char * const mem_cgroup_lru_names[] = {
- "inactive_anon",
- "active_anon",
- "inactive_file",
- "active_file",
- "unevictable",
-};
-
static inline void mem_cgroup_lru_names_not_uptodate(void)
{
BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS);
--
1.7.1

2012-06-29 05:22:03

by Sage Weil

[permalink] [raw]
Subject: Re: [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem

On Thu, 28 Jun 2012, Sha Zhengju wrote:

> From: Sha Zhengju <[email protected]>
>
> Following we will treat SetPageDirty and dirty page accounting as an integrated
> operation. Filesystems had better use vfs interface directly to avoid those details.
>
> Signed-off-by: Sha Zhengju <[email protected]>
> ---
> fs/buffer.c | 2 +-
> fs/ceph/addr.c | 20 ++------------------
> include/linux/buffer_head.h | 2 ++
> 3 files changed, 5 insertions(+), 19 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index e8d96b8..55522dd 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
> * If warn is true, then emit a warning if the page is not uptodate and has
> * not been truncated.
> */
> -static int __set_page_dirty(struct page *page,
> +int __set_page_dirty(struct page *page,
> struct address_space *mapping, int warn)
> {
> if (unlikely(!mapping))

This also needs an EXPORT_SYMBOL(__set_page_dirty) to allow ceph to
continue to build as a module.

With that fixed, the ceph bits are a welcome cleanup!

Acked-by: Sage Weil <[email protected]>

> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 8b67304..d028fbe 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -5,6 +5,7 @@
> #include <linux/mm.h>
> #include <linux/pagemap.h>
> #include <linux/writeback.h> /* generic_writepages */
> +#include <linux/buffer_head.h>
> #include <linux/slab.h>
> #include <linux/pagevec.h>
> #include <linux/task_io_accounting_ops.h>
> @@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
> int undo = 0;
> struct ceph_snap_context *snapc;
>
> - if (unlikely(!mapping))
> - return !TestSetPageDirty(page);
> -
> - if (TestSetPageDirty(page)) {
> - dout("%p set_page_dirty %p idx %lu -- already dirty\n",
> - mapping->host, page, page->index);
> + if (!__set_page_dirty(page, mapping, 1))
> return 0;
> - }
>
> inode = mapping->host;
> ci = ceph_inode(inode);
> @@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page)
> snapc, snapc->seq, snapc->num_snaps);
> spin_unlock(&ci->i_ceph_lock);
>
> - /* now adjust page */
> - spin_lock_irq(&mapping->tree_lock);
> if (page->mapping) { /* Race with truncate? */
> - WARN_ON_ONCE(!PageUptodate(page));
> - account_page_dirtied(page, page->mapping);
> - radix_tree_tag_set(&mapping->page_tree,
> - page_index(page), PAGECACHE_TAG_DIRTY);
> -
> /*
> * Reference snap context in page->private. Also set
> * PagePrivate so that we get invalidatepage callback.
> @@ -126,14 +114,10 @@ static int ceph_set_page_dirty(struct page *page)
> undo = 1;
> }
>
> - spin_unlock_irq(&mapping->tree_lock);
> -
> if (undo)
> /* whoops, we failed to dirty the page */
> ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
>
> - __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> -
> BUG_ON(!PageDirty(page));
> return 1;
> }
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index 458f497..0a331a8 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -336,6 +336,8 @@ static inline void lock_buffer(struct buffer_head *bh)
> }
>
> extern int __set_page_dirty_buffers(struct page *page);
> +extern int __set_page_dirty(struct page *page,
> + struct address_space *mapping, int warn);
>
> #else /* CONFIG_BLOCK */
>
> --
> 1.7.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>

2012-06-29 08:26:08

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 0/7] Per-cgroup page stat accounting

(2012/06/28 19:54), Sha Zhengju wrote:
> This patch series provide the ability for each memory cgroup to have independent
> dirty/writeback page stats. This can provide some information for per-cgroup direct
> reclaim. Meanwhile, we add more detailed dump messages for memcg OOMs.
>
> Three features are included in this patch series:
> (0).prepare patches for page accounting
> 1. memcg dirty page accounting
> 2. memcg writeback page accounting
> 3. memcg OOMs dump info
>
> In (0) prepare patches, we have reworked vfs set page dirty routines to make "modify
> page info" and "dirty page accouting" stay in one function as much as possible for
> the sake of memcg bigger lock.
>
> These patches are cooked based on Andrew's akpm tree.
>

Thank you !, it seems good in general. I'll review in detail, later.

Do you have any performance comparison between before/after the series ?
I mean, set_page_dirty() is the hot-path and we should be careful to add a new accounting.

Thanks,
-Kame



> Sha Zhengju (7):
> memcg-update-cgroup-memory-document.patch
> memcg-remove-MEMCG_NR_FILE_MAPPED.patch
> Make-TestSetPageDirty-and-dirty-page-accounting-in-o.patch
> Use-vfs-__set_page_dirty-interface-instead-of-doing-.patch
> memcg-add-per-cgroup-dirty-pages-accounting.patch
> memcg-add-per-cgroup-writeback-pages-accounting.patch
> memcg-print-more-detailed-info-while-memcg-oom-happe.patch
>
> Documentation/cgroups/memory.txt | 2 +
> fs/buffer.c | 36 +++++++++-----
> fs/ceph/addr.c | 20 +-------
> include/linux/buffer_head.h | 2 +
> include/linux/memcontrol.h | 27 +++++++---
> mm/filemap.c | 5 ++
> mm/memcontrol.c | 99 +++++++++++++++++++++++--------------
> mm/page-writeback.c | 42 ++++++++++++++--
> mm/rmap.c | 4 +-
> mm/truncate.c | 6 ++
> 10 files changed, 159 insertions(+), 84 deletions(-)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>