2009-04-15 12:06:07

by Balbir Singh

[permalink] [raw]
Subject: [PATCH] Add file based RSS accounting for memory resource controller (v2)


Feature: Add file RSS tracking per memory cgroup

From: Balbir Singh <[email protected]>

Changelog v2 -> v1

1. Rename file_rss to mapped_file
2. Add hooks into mem_cgroup_move_account for updating MAPPED_FILE statistics
3. Use a better name for the statistics routine.


We currently don't track file RSS, the RSS we report is actually anon RSS.
All the file mapped pages, come in through the page cache and get accounted
there. This patch adds support for accounting file RSS pages. It should

1. Help improve the metrics reported by the memory resource controller
2. Will form the basis for a future shared memory accounting heuristic
that has been proposed by Kamezawa.

Unfortunately, we cannot rename the existing "rss" keyword used in memory.stat
to "anon_rss". We however, add "mapped_file" data and hope to educate the end
user through documentation.

Testing

1. Feature enabled, built and booted the kernel. Mounted the memory
resource controller and verified the statistics
2. Compiled with CGROUP_MEM_RES_CTLR disabled


Signed-off-by: Balbir Singh <[email protected]>
---

include/linux/memcontrol.h | 9 +++++++-
include/linux/rmap.h | 4 ++--
mm/filemap_xip.c | 2 +-
mm/fremap.c | 2 +-
mm/memcontrol.c | 49 +++++++++++++++++++++++++++++++++++++++++++-
mm/memory.c | 8 ++++---
mm/migrate.c | 2 +-
mm/rmap.c | 13 +++++++-----
8 files changed, 73 insertions(+), 16 deletions(-)


diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 18146c9..6864b88 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -116,7 +116,8 @@ static inline bool mem_cgroup_disabled(void)
}

extern bool mem_cgroup_oom_called(struct task_struct *task);
-
+void mem_cgroup_update_mapped_file_stat(struct page *page, struct mm_struct *mm,
+ int val);
#else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct mem_cgroup;

@@ -264,6 +265,12 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
{
}

+static inline void mem_cgroup_update_mapped_file_stat(struct page *page,
+ struct mm_struct *mm,
+ int val)
+{
+}
+
#endif /* CONFIG_CGROUP_MEM_CONT */

#endif /* _LINUX_MEMCONTROL_H */
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index b35bc0e..01b4af1 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -68,8 +68,8 @@ void __anon_vma_link(struct vm_area_struct *);
*/
void page_add_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
void page_add_new_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
-void page_add_file_rmap(struct page *);
-void page_remove_rmap(struct page *);
+void page_add_file_rmap(struct page *, struct vm_area_struct *);
+void page_remove_rmap(struct page *, struct vm_area_struct *);

#ifdef CONFIG_DEBUG_VM
void page_dup_rmap(struct page *page, struct vm_area_struct *vma, unsigned long address);
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index 427dfe3..e8b2b18 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -193,7 +193,7 @@ retry:
/* Nuke the page table entry. */
flush_cache_page(vma, address, pte_pfn(*pte));
pteval = ptep_clear_flush_notify(vma, address, pte);
- page_remove_rmap(page);
+ page_remove_rmap(page, vma);
dec_mm_counter(mm, file_rss);
BUG_ON(pte_dirty(pteval));
pte_unmap_unlock(pte, ptl);
diff --git a/mm/fremap.c b/mm/fremap.c
index b6ec85a..01ea2da 100644
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -37,7 +37,7 @@ static void zap_pte(struct mm_struct *mm, struct vm_area_struct *vma,
if (page) {
if (pte_dirty(pte))
set_page_dirty(page);
- page_remove_rmap(page);
+ page_remove_rmap(page, vma);
page_cache_release(page);
update_hiwater_rss(mm);
dec_mm_counter(mm, file_rss);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e44fb0f..c1d5c37 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -62,7 +62,8 @@ enum mem_cgroup_stat_index {
* For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
*/
MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
- MEM_CGROUP_STAT_RSS, /* # of pages charged as rss */
+ MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
+ MEM_CGROUP_STAT_MAPPED_FILE, /* # of pages charged as file rss */
MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */
MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */

@@ -321,6 +322,33 @@ static bool mem_cgroup_is_obsolete(struct mem_cgroup *mem)
return css_is_removed(&mem->css);
}

+/*
+ * Currently used to update mapped file statistics, but the routine can be
+ * generalized to update other statistics as well.
+ */
+void mem_cgroup_update_mapped_file_stat(struct page *page, struct mm_struct *mm,
+ int val)
+{
+ struct mem_cgroup *mem;
+ struct mem_cgroup_stat *stat;
+ struct mem_cgroup_stat_cpu *cpustat;
+ int cpu = get_cpu();
+
+ if (!page_is_file_cache(page))
+ return;
+
+ if (unlikely(!mm))
+ mm = &init_mm;
+
+ mem = try_get_mem_cgroup_from_mm(mm);
+ if (!mem)
+ return;
+
+ stat = &mem->stat;
+ cpustat = &stat->cpustat[cpu];
+
+ __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, val);
+}

/*
* Call callback function against all cgroup under hierarchy tree.
@@ -1096,6 +1124,9 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
struct mem_cgroup_per_zone *from_mz, *to_mz;
int nid, zid;
int ret = -EBUSY;
+ struct mem_cgroup_stat *stat;
+ struct mem_cgroup_stat_cpu *cpustat;
+ int cpu;

VM_BUG_ON(from == to);
VM_BUG_ON(PageLRU(pc->page));
@@ -1116,6 +1147,18 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,

res_counter_uncharge(&from->res, PAGE_SIZE);
mem_cgroup_charge_statistics(from, pc, false);
+
+ cpu = get_cpu();
+ /* Update mapped_file data for mem_cgroup "from" */
+ stat = &from->stat;
+ cpustat = &stat->cpustat[cpu];
+ __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, -1);
+
+ /* Update mapped_file data for mem_cgroup "to" */
+ stat = &to->stat;
+ cpustat = &stat->cpustat[cpu];
+ __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, 1);
+
if (do_swap_account)
res_counter_uncharge(&from->memsw, PAGE_SIZE);
css_put(&from->css);
@@ -2051,6 +2094,7 @@ static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
enum {
MCS_CACHE,
MCS_RSS,
+ MCS_MAPPED_FILE,
MCS_PGPGIN,
MCS_PGPGOUT,
MCS_INACTIVE_ANON,
@@ -2071,6 +2115,7 @@ struct {
} memcg_stat_strings[NR_MCS_STAT] = {
{"cache", "total_cache"},
{"rss", "total_rss"},
+ {"mapped_file", "total_mapped_file"},
{"pgpgin", "total_pgpgin"},
{"pgpgout", "total_pgpgout"},
{"inactive_anon", "total_inactive_anon"},
@@ -2091,6 +2136,8 @@ static int mem_cgroup_get_local_stat(struct mem_cgroup *mem, void *data)
s->stat[MCS_CACHE] += val * PAGE_SIZE;
val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_RSS);
s->stat[MCS_RSS] += val * PAGE_SIZE;
+ val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_MAPPED_FILE);
+ s->stat[MCS_MAPPED_FILE] += val * PAGE_SIZE;
val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_PGPGIN_COUNT);
s->stat[MCS_PGPGIN] += val;
val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_PGPGOUT_COUNT);
diff --git a/mm/memory.c b/mm/memory.c
index a715b19..95a9ded 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -822,7 +822,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
mark_page_accessed(page);
file_rss--;
}
- page_remove_rmap(page);
+ page_remove_rmap(page, vma);
if (unlikely(page_mapcount(page) < 0))
print_bad_pte(vma, addr, ptent, page);
tlb_remove_page(tlb, page);
@@ -1421,7 +1421,7 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
/* Ok, finally just insert the thing.. */
get_page(page);
inc_mm_counter(mm, file_rss);
- page_add_file_rmap(page);
+ page_add_file_rmap(page, vma);
set_pte_at(mm, addr, pte, mk_pte(page, prot));

retval = 0;
@@ -2080,7 +2080,7 @@ gotten:
* mapcount is visible. So transitively, TLBs to
* old page will be flushed before it can be reused.
*/
- page_remove_rmap(old_page);
+ page_remove_rmap(old_page, vma);
}

/* Free the old page.. */
@@ -2718,7 +2718,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
page_add_new_anon_rmap(page, vma, address);
} else {
inc_mm_counter(mm, file_rss);
- page_add_file_rmap(page);
+ page_add_file_rmap(page, vma);
if (flags & FAULT_FLAG_WRITE) {
dirty_page = page;
get_page(dirty_page);
diff --git a/mm/migrate.c b/mm/migrate.c
index 068655d..098d365 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -131,7 +131,7 @@ static void remove_migration_pte(struct vm_area_struct *vma,
if (PageAnon(new))
page_add_anon_rmap(new, vma, addr);
else
- page_add_file_rmap(new);
+ page_add_file_rmap(new, vma);

/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, addr, pte);
diff --git a/mm/rmap.c b/mm/rmap.c
index 1652166..3e29864 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -686,10 +686,12 @@ void page_add_new_anon_rmap(struct page *page,
*
* The caller needs to hold the pte lock.
*/
-void page_add_file_rmap(struct page *page)
+void page_add_file_rmap(struct page *page, struct vm_area_struct *vma)
{
- if (atomic_inc_and_test(&page->_mapcount))
+ if (atomic_inc_and_test(&page->_mapcount)) {
__inc_zone_page_state(page, NR_FILE_MAPPED);
+ mem_cgroup_update_mapped_file_stat(page, vma->vm_mm, 1);
+ }
}

#ifdef CONFIG_DEBUG_VM
@@ -719,7 +721,7 @@ void page_dup_rmap(struct page *page, struct vm_area_struct *vma, unsigned long
*
* The caller needs to hold the pte lock.
*/
-void page_remove_rmap(struct page *page)
+void page_remove_rmap(struct page *page, struct vm_area_struct *vma)
{
if (atomic_add_negative(-1, &page->_mapcount)) {
/*
@@ -738,6 +740,7 @@ void page_remove_rmap(struct page *page)
mem_cgroup_uncharge_page(page);
__dec_zone_page_state(page,
PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
+ mem_cgroup_update_mapped_file_stat(page, vma->vm_mm, -1);
/*
* It would be tidy to reset the PageAnon mapping here,
* but that might overwrite a racing page_add_anon_rmap
@@ -835,7 +838,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
dec_mm_counter(mm, file_rss);


- page_remove_rmap(page);
+ page_remove_rmap(page, vma);
page_cache_release(page);

out_unmap:
@@ -950,7 +953,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
if (pte_dirty(pteval))
set_page_dirty(page);

- page_remove_rmap(page);
+ page_remove_rmap(page, vma);
page_cache_release(page);
dec_mm_counter(mm, file_rss);
(*mapcount)--;

--
Balbir


2009-04-16 00:54:45

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] Add file based RSS accounting for memory resource controller (v2)

On Wed, 15 Apr 2009 17:35:10 +0530
Balbir Singh <[email protected]> wrote:

> +void mem_cgroup_update_mapped_file_stat(struct page *page, struct mm_struct *mm,
> + int val)
> +{
> + struct mem_cgroup *mem;
> + struct mem_cgroup_stat *stat;
> + struct mem_cgroup_stat_cpu *cpustat;
> + int cpu = get_cpu();
> +
> + if (!page_is_file_cache(page))
> + return;
> +
> + if (unlikely(!mm))
> + mm = &init_mm;
> +
> + mem = try_get_mem_cgroup_from_mm(mm);
> + if (!mem)
> + return;
> +
> + stat = &mem->stat;
> + cpustat = &stat->cpustat[cpu];
> +
> + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, val);
> +}
>
put_cpu() is necessary.



> /*
> * Call callback function against all cgroup under hierarchy tree.
> @@ -1096,6 +1124,9 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
> struct mem_cgroup_per_zone *from_mz, *to_mz;
> int nid, zid;
> int ret = -EBUSY;
> + struct mem_cgroup_stat *stat;
> + struct mem_cgroup_stat_cpu *cpustat;
> + int cpu;
>
> VM_BUG_ON(from == to);
> VM_BUG_ON(PageLRU(pc->page));
> @@ -1116,6 +1147,18 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
>
> res_counter_uncharge(&from->res, PAGE_SIZE);
> mem_cgroup_charge_statistics(from, pc, false);
> +
> + cpu = get_cpu();
> + /* Update mapped_file data for mem_cgroup "from" */
> + stat = &from->stat;
> + cpustat = &stat->cpustat[cpu];
> + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, -1);
> +
> + /* Update mapped_file data for mem_cgroup "to" */
> + stat = &to->stat;
> + cpustat = &stat->cpustat[cpu];
> + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, 1);
> +
here, too.

Seems no troubles in other parts.

Regards,
-Kame

> if (do_swap_account)
> res_counter_uncharge(&from->memsw, PAGE_SIZE);
> css_put(&from->css);
> @@ -2051,6 +2094,7 @@ static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
> enum {
> MCS_CACHE,
> MCS_RSS,
> + MCS_MAPPED_FILE,
> MCS_PGPGIN,
> MCS_PGPGOUT,
> MCS_INACTIVE_ANON,
> @@ -2071,6 +2115,7 @@ struct {
> } memcg_stat_strings[NR_MCS_STAT] = {
> {"cache", "total_cache"},
> {"rss", "total_rss"},
> + {"mapped_file", "total_mapped_file"},
> {"pgpgin", "total_pgpgin"},
> {"pgpgout", "total_pgpgout"},
> {"inactive_anon", "total_inactive_anon"},
> @@ -2091,6 +2136,8 @@ static int mem_cgroup_get_local_stat(struct mem_cgroup *mem, void *data)
> s->stat[MCS_CACHE] += val * PAGE_SIZE;
> val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_RSS);
> s->stat[MCS_RSS] += val * PAGE_SIZE;
> + val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_MAPPED_FILE);
> + s->stat[MCS_MAPPED_FILE] += val * PAGE_SIZE;
> val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_PGPGIN_COUNT);
> s->stat[MCS_PGPGIN] += val;
> val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_PGPGOUT_COUNT);
> diff --git a/mm/memory.c b/mm/memory.c
> index a715b19..95a9ded 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -822,7 +822,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> mark_page_accessed(page);
> file_rss--;
> }
> - page_remove_rmap(page);
> + page_remove_rmap(page, vma);
> if (unlikely(page_mapcount(page) < 0))
> print_bad_pte(vma, addr, ptent, page);
> tlb_remove_page(tlb, page);
> @@ -1421,7 +1421,7 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
> /* Ok, finally just insert the thing.. */
> get_page(page);
> inc_mm_counter(mm, file_rss);
> - page_add_file_rmap(page);
> + page_add_file_rmap(page, vma);
> set_pte_at(mm, addr, pte, mk_pte(page, prot));
>
> retval = 0;
> @@ -2080,7 +2080,7 @@ gotten:
> * mapcount is visible. So transitively, TLBs to
> * old page will be flushed before it can be reused.
> */
> - page_remove_rmap(old_page);
> + page_remove_rmap(old_page, vma);
> }
>
> /* Free the old page.. */
> @@ -2718,7 +2718,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> page_add_new_anon_rmap(page, vma, address);
> } else {
> inc_mm_counter(mm, file_rss);
> - page_add_file_rmap(page);
> + page_add_file_rmap(page, vma);
> if (flags & FAULT_FLAG_WRITE) {
> dirty_page = page;
> get_page(dirty_page);
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 068655d..098d365 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -131,7 +131,7 @@ static void remove_migration_pte(struct vm_area_struct *vma,
> if (PageAnon(new))
> page_add_anon_rmap(new, vma, addr);
> else
> - page_add_file_rmap(new);
> + page_add_file_rmap(new, vma);
>
> /* No need to invalidate - it was non-present before */
> update_mmu_cache(vma, addr, pte);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 1652166..3e29864 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -686,10 +686,12 @@ void page_add_new_anon_rmap(struct page *page,
> *
> * The caller needs to hold the pte lock.
> */
> -void page_add_file_rmap(struct page *page)
> +void page_add_file_rmap(struct page *page, struct vm_area_struct *vma)
> {
> - if (atomic_inc_and_test(&page->_mapcount))
> + if (atomic_inc_and_test(&page->_mapcount)) {
> __inc_zone_page_state(page, NR_FILE_MAPPED);
> + mem_cgroup_update_mapped_file_stat(page, vma->vm_mm, 1);
> + }
> }
>
> #ifdef CONFIG_DEBUG_VM
> @@ -719,7 +721,7 @@ void page_dup_rmap(struct page *page, struct vm_area_struct *vma, unsigned long
> *
> * The caller needs to hold the pte lock.
> */
> -void page_remove_rmap(struct page *page)
> +void page_remove_rmap(struct page *page, struct vm_area_struct *vma)
> {
> if (atomic_add_negative(-1, &page->_mapcount)) {
> /*
> @@ -738,6 +740,7 @@ void page_remove_rmap(struct page *page)
> mem_cgroup_uncharge_page(page);
> __dec_zone_page_state(page,
> PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
> + mem_cgroup_update_mapped_file_stat(page, vma->vm_mm, -1);
> /*
> * It would be tidy to reset the PageAnon mapping here,
> * but that might overwrite a racing page_add_anon_rmap
> @@ -835,7 +838,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> dec_mm_counter(mm, file_rss);
>
>
> - page_remove_rmap(page);
> + page_remove_rmap(page, vma);
> page_cache_release(page);
>
> out_unmap:
> @@ -950,7 +953,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
> if (pte_dirty(pteval))
> set_page_dirty(page);
>
> - page_remove_rmap(page);
> + page_remove_rmap(page, vma);
> page_cache_release(page);
> dec_mm_counter(mm, file_rss);
> (*mapcount)--;
>
> --
> Balbir
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2009-04-16 02:00:48

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] Add file based RSS accounting for memory resource controller (v2)

* KAMEZAWA Hiroyuki <[email protected]> [2009-04-16 09:53:03]:

> On Wed, 15 Apr 2009 17:35:10 +0530
> Balbir Singh <[email protected]> wrote:
>
> > +void mem_cgroup_update_mapped_file_stat(struct page *page, struct mm_struct *mm,
> > + int val)
> > +{
> > + struct mem_cgroup *mem;
> > + struct mem_cgroup_stat *stat;
> > + struct mem_cgroup_stat_cpu *cpustat;
> > + int cpu = get_cpu();
> > +
> > + if (!page_is_file_cache(page))
> > + return;
> > +
> > + if (unlikely(!mm))
> > + mm = &init_mm;
> > +
> > + mem = try_get_mem_cgroup_from_mm(mm);
> > + if (!mem)
> > + return;
> > +
> > + stat = &mem->stat;
> > + cpustat = &stat->cpustat[cpu];
> > +
> > + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, val);
> > +}
> >
> put_cpu() is necessary.

Thanks, I could have almost sworn I had it.. but I clearly don't

Here is the fixed version

Feature: Add file RSS tracking per memory cgroup

From: Balbir Singh <[email protected]>

Changelog v3 -> v2
1. Add corresponding put_cpu() for every get_cpu()

Changelog v2 -> v1

1. Rename file_rss to mapped_file
2. Add hooks into mem_cgroup_move_account for updating MAPPED_FILE statistics
3. Use a better name for the statistics routine.


We currently don't track file RSS, the RSS we report is actually anon RSS.
All the file mapped pages, come in through the page cache and get accounted
there. This patch adds support for accounting file RSS pages. It should

1. Help improve the metrics reported by the memory resource controller
2. Will form the basis for a future shared memory accounting heuristic
that has been proposed by Kamezawa.

Unfortunately, we cannot rename the existing "rss" keyword used in memory.stat
to "anon_rss". We however, add "mapped_file" data and hope to educate the end
user through documentation.

Signed-off-by: Balbir Singh <[email protected]>
---

include/linux/memcontrol.h | 9 +++++++-
include/linux/rmap.h | 4 ++-
mm/filemap_xip.c | 2 +-
mm/fremap.c | 2 +-
mm/memcontrol.c | 51 +++++++++++++++++++++++++++++++++++++++++++-
mm/memory.c | 8 +++----
mm/migrate.c | 2 +-
mm/rmap.c | 13 +++++++----
8 files changed, 75 insertions(+), 16 deletions(-)


diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 18146c9..6864b88 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -116,7 +116,8 @@ static inline bool mem_cgroup_disabled(void)
}

extern bool mem_cgroup_oom_called(struct task_struct *task);
-
+void mem_cgroup_update_mapped_file_stat(struct page *page, struct mm_struct *mm,
+ int val);
#else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct mem_cgroup;

@@ -264,6 +265,12 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
{
}

+static inline void mem_cgroup_update_mapped_file_stat(struct page *page,
+ struct mm_struct *mm,
+ int val)
+{
+}
+
#endif /* CONFIG_CGROUP_MEM_CONT */

#endif /* _LINUX_MEMCONTROL_H */
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index b35bc0e..01b4af1 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -68,8 +68,8 @@ void __anon_vma_link(struct vm_area_struct *);
*/
void page_add_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
void page_add_new_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
-void page_add_file_rmap(struct page *);
-void page_remove_rmap(struct page *);
+void page_add_file_rmap(struct page *, struct vm_area_struct *);
+void page_remove_rmap(struct page *, struct vm_area_struct *);

#ifdef CONFIG_DEBUG_VM
void page_dup_rmap(struct page *page, struct vm_area_struct *vma, unsigned long address);
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index 427dfe3..e8b2b18 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -193,7 +193,7 @@ retry:
/* Nuke the page table entry. */
flush_cache_page(vma, address, pte_pfn(*pte));
pteval = ptep_clear_flush_notify(vma, address, pte);
- page_remove_rmap(page);
+ page_remove_rmap(page, vma);
dec_mm_counter(mm, file_rss);
BUG_ON(pte_dirty(pteval));
pte_unmap_unlock(pte, ptl);
diff --git a/mm/fremap.c b/mm/fremap.c
index b6ec85a..01ea2da 100644
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -37,7 +37,7 @@ static void zap_pte(struct mm_struct *mm, struct vm_area_struct *vma,
if (page) {
if (pte_dirty(pte))
set_page_dirty(page);
- page_remove_rmap(page);
+ page_remove_rmap(page, vma);
page_cache_release(page);
update_hiwater_rss(mm);
dec_mm_counter(mm, file_rss);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e44fb0f..797892c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -62,7 +62,8 @@ enum mem_cgroup_stat_index {
* For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
*/
MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
- MEM_CGROUP_STAT_RSS, /* # of pages charged as rss */
+ MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
+ MEM_CGROUP_STAT_MAPPED_FILE, /* # of pages charged as file rss */
MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */
MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */

@@ -321,6 +322,34 @@ static bool mem_cgroup_is_obsolete(struct mem_cgroup *mem)
return css_is_removed(&mem->css);
}

+/*
+ * Currently used to update mapped file statistics, but the routine can be
+ * generalized to update other statistics as well.
+ */
+void mem_cgroup_update_mapped_file_stat(struct page *page, struct mm_struct *mm,
+ int val)
+{
+ struct mem_cgroup *mem;
+ struct mem_cgroup_stat *stat;
+ struct mem_cgroup_stat_cpu *cpustat;
+ int cpu = get_cpu();
+
+ if (!page_is_file_cache(page))
+ return;
+
+ if (unlikely(!mm))
+ mm = &init_mm;
+
+ mem = try_get_mem_cgroup_from_mm(mm);
+ if (!mem)
+ return;
+
+ stat = &mem->stat;
+ cpustat = &stat->cpustat[cpu];
+
+ __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, val);
+ put_cpu();
+}

/*
* Call callback function against all cgroup under hierarchy tree.
@@ -1096,6 +1125,9 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
struct mem_cgroup_per_zone *from_mz, *to_mz;
int nid, zid;
int ret = -EBUSY;
+ struct mem_cgroup_stat *stat;
+ struct mem_cgroup_stat_cpu *cpustat;
+ int cpu;

VM_BUG_ON(from == to);
VM_BUG_ON(PageLRU(pc->page));
@@ -1116,6 +1148,19 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,

res_counter_uncharge(&from->res, PAGE_SIZE);
mem_cgroup_charge_statistics(from, pc, false);
+
+ cpu = get_cpu();
+ /* Update mapped_file data for mem_cgroup "from" */
+ stat = &from->stat;
+ cpustat = &stat->cpustat[cpu];
+ __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, -1);
+
+ /* Update mapped_file data for mem_cgroup "to" */
+ stat = &to->stat;
+ cpustat = &stat->cpustat[cpu];
+ __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, 1);
+ put_cpu();
+
if (do_swap_account)
res_counter_uncharge(&from->memsw, PAGE_SIZE);
css_put(&from->css);
@@ -2051,6 +2096,7 @@ static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
enum {
MCS_CACHE,
MCS_RSS,
+ MCS_MAPPED_FILE,
MCS_PGPGIN,
MCS_PGPGOUT,
MCS_INACTIVE_ANON,
@@ -2071,6 +2117,7 @@ struct {
} memcg_stat_strings[NR_MCS_STAT] = {
{"cache", "total_cache"},
{"rss", "total_rss"},
+ {"mapped_file", "total_mapped_file"},
{"pgpgin", "total_pgpgin"},
{"pgpgout", "total_pgpgout"},
{"inactive_anon", "total_inactive_anon"},
@@ -2091,6 +2138,8 @@ static int mem_cgroup_get_local_stat(struct mem_cgroup *mem, void *data)
s->stat[MCS_CACHE] += val * PAGE_SIZE;
val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_RSS);
s->stat[MCS_RSS] += val * PAGE_SIZE;
+ val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_MAPPED_FILE);
+ s->stat[MCS_MAPPED_FILE] += val * PAGE_SIZE;
val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_PGPGIN_COUNT);
s->stat[MCS_PGPGIN] += val;
val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_PGPGOUT_COUNT);
diff --git a/mm/memory.c b/mm/memory.c
index a715b19..95a9ded 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -822,7 +822,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
mark_page_accessed(page);
file_rss--;
}
- page_remove_rmap(page);
+ page_remove_rmap(page, vma);
if (unlikely(page_mapcount(page) < 0))
print_bad_pte(vma, addr, ptent, page);
tlb_remove_page(tlb, page);
@@ -1421,7 +1421,7 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
/* Ok, finally just insert the thing.. */
get_page(page);
inc_mm_counter(mm, file_rss);
- page_add_file_rmap(page);
+ page_add_file_rmap(page, vma);
set_pte_at(mm, addr, pte, mk_pte(page, prot));

retval = 0;
@@ -2080,7 +2080,7 @@ gotten:
* mapcount is visible. So transitively, TLBs to
* old page will be flushed before it can be reused.
*/
- page_remove_rmap(old_page);
+ page_remove_rmap(old_page, vma);
}

/* Free the old page.. */
@@ -2718,7 +2718,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
page_add_new_anon_rmap(page, vma, address);
} else {
inc_mm_counter(mm, file_rss);
- page_add_file_rmap(page);
+ page_add_file_rmap(page, vma);
if (flags & FAULT_FLAG_WRITE) {
dirty_page = page;
get_page(dirty_page);
diff --git a/mm/migrate.c b/mm/migrate.c
index 068655d..098d365 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -131,7 +131,7 @@ static void remove_migration_pte(struct vm_area_struct *vma,
if (PageAnon(new))
page_add_anon_rmap(new, vma, addr);
else
- page_add_file_rmap(new);
+ page_add_file_rmap(new, vma);

/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, addr, pte);
diff --git a/mm/rmap.c b/mm/rmap.c
index 1652166..3e29864 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -686,10 +686,12 @@ void page_add_new_anon_rmap(struct page *page,
*
* The caller needs to hold the pte lock.
*/
-void page_add_file_rmap(struct page *page)
+void page_add_file_rmap(struct page *page, struct vm_area_struct *vma)
{
- if (atomic_inc_and_test(&page->_mapcount))
+ if (atomic_inc_and_test(&page->_mapcount)) {
__inc_zone_page_state(page, NR_FILE_MAPPED);
+ mem_cgroup_update_mapped_file_stat(page, vma->vm_mm, 1);
+ }
}

#ifdef CONFIG_DEBUG_VM
@@ -719,7 +721,7 @@ void page_dup_rmap(struct page *page, struct vm_area_struct *vma, unsigned long
*
* The caller needs to hold the pte lock.
*/
-void page_remove_rmap(struct page *page)
+void page_remove_rmap(struct page *page, struct vm_area_struct *vma)
{
if (atomic_add_negative(-1, &page->_mapcount)) {
/*
@@ -738,6 +740,7 @@ void page_remove_rmap(struct page *page)
mem_cgroup_uncharge_page(page);
__dec_zone_page_state(page,
PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
+ mem_cgroup_update_mapped_file_stat(page, vma->vm_mm, -1);
/*
* It would be tidy to reset the PageAnon mapping here,
* but that might overwrite a racing page_add_anon_rmap
@@ -835,7 +838,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
dec_mm_counter(mm, file_rss);


- page_remove_rmap(page);
+ page_remove_rmap(page, vma);
page_cache_release(page);

out_unmap:
@@ -950,7 +953,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
if (pte_dirty(pteval))
set_page_dirty(page);

- page_remove_rmap(page);
+ page_remove_rmap(page, vma);
page_cache_release(page);
dec_mm_counter(mm, file_rss);
(*mapcount)--;



--
Balbir

2009-04-16 02:04:30

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] Add file based RSS accounting for memory resource controller (v2)

On Thu, 16 Apr 2009 07:29:55 +0530
Balbir Singh <[email protected]> wrote:

> Thanks, I could have almost sworn I had it.. but I clearly don't
>
> Here is the fixed version
>
> Feature: Add file RSS tracking per memory cgroup
>
> From: Balbir Singh <[email protected]>
>
> Changelog v3 -> v2
> 1. Add corresponding put_cpu() for every get_cpu()
>
> Changelog v2 -> v1
>
> 1. Rename file_rss to mapped_file
> 2. Add hooks into mem_cgroup_move_account for updating MAPPED_FILE statistics
> 3. Use a better name for the statistics routine.
>
>
> We currently don't track file RSS, the RSS we report is actually anon RSS.
> All the file mapped pages, come in through the page cache and get accounted
> there. This patch adds support for accounting file RSS pages. It should
>
> 1. Help improve the metrics reported by the memory resource controller
> 2. Will form the basis for a future shared memory accounting heuristic
> that has been proposed by Kamezawa.
>
> Unfortunately, we cannot rename the existing "rss" keyword used in memory.stat
> to "anon_rss". We however, add "mapped_file" data and hope to educate the end
> user through documentation.
>
> Signed-off-by: Balbir Singh <[email protected]>

Nice feature :) Thanks.

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

I'll test this today.

> ---
>
> include/linux/memcontrol.h | 9 +++++++-
> include/linux/rmap.h | 4 ++-
> mm/filemap_xip.c | 2 +-
> mm/fremap.c | 2 +-
> mm/memcontrol.c | 51 +++++++++++++++++++++++++++++++++++++++++++-
> mm/memory.c | 8 +++----
> mm/migrate.c | 2 +-
> mm/rmap.c | 13 +++++++----
> 8 files changed, 75 insertions(+), 16 deletions(-)
>
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 18146c9..6864b88 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -116,7 +116,8 @@ static inline bool mem_cgroup_disabled(void)
> }
>
> extern bool mem_cgroup_oom_called(struct task_struct *task);
> -
> +void mem_cgroup_update_mapped_file_stat(struct page *page, struct mm_struct *mm,
> + int val);
> #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> struct mem_cgroup;
>
> @@ -264,6 +265,12 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
> {
> }
>
> +static inline void mem_cgroup_update_mapped_file_stat(struct page *page,
> + struct mm_struct *mm,
> + int val)
> +{
> +}
> +
> #endif /* CONFIG_CGROUP_MEM_CONT */
>
> #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index b35bc0e..01b4af1 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -68,8 +68,8 @@ void __anon_vma_link(struct vm_area_struct *);
> */
> void page_add_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
> void page_add_new_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
> -void page_add_file_rmap(struct page *);
> -void page_remove_rmap(struct page *);
> +void page_add_file_rmap(struct page *, struct vm_area_struct *);
> +void page_remove_rmap(struct page *, struct vm_area_struct *);
>
> #ifdef CONFIG_DEBUG_VM
> void page_dup_rmap(struct page *page, struct vm_area_struct *vma, unsigned long address);
> diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
> index 427dfe3..e8b2b18 100644
> --- a/mm/filemap_xip.c
> +++ b/mm/filemap_xip.c
> @@ -193,7 +193,7 @@ retry:
> /* Nuke the page table entry. */
> flush_cache_page(vma, address, pte_pfn(*pte));
> pteval = ptep_clear_flush_notify(vma, address, pte);
> - page_remove_rmap(page);
> + page_remove_rmap(page, vma);
> dec_mm_counter(mm, file_rss);
> BUG_ON(pte_dirty(pteval));
> pte_unmap_unlock(pte, ptl);
> diff --git a/mm/fremap.c b/mm/fremap.c
> index b6ec85a..01ea2da 100644
> --- a/mm/fremap.c
> +++ b/mm/fremap.c
> @@ -37,7 +37,7 @@ static void zap_pte(struct mm_struct *mm, struct vm_area_struct *vma,
> if (page) {
> if (pte_dirty(pte))
> set_page_dirty(page);
> - page_remove_rmap(page);
> + page_remove_rmap(page, vma);
> page_cache_release(page);
> update_hiwater_rss(mm);
> dec_mm_counter(mm, file_rss);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e44fb0f..797892c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -62,7 +62,8 @@ enum mem_cgroup_stat_index {
> * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> */
> MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
> - MEM_CGROUP_STAT_RSS, /* # of pages charged as rss */
> + MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
> + MEM_CGROUP_STAT_MAPPED_FILE, /* # of pages charged as file rss */
> MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */
> MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
>
> @@ -321,6 +322,34 @@ static bool mem_cgroup_is_obsolete(struct mem_cgroup *mem)
> return css_is_removed(&mem->css);
> }
>
> +/*
> + * Currently used to update mapped file statistics, but the routine can be
> + * generalized to update other statistics as well.
> + */
> +void mem_cgroup_update_mapped_file_stat(struct page *page, struct mm_struct *mm,
> + int val)
> +{
> + struct mem_cgroup *mem;
> + struct mem_cgroup_stat *stat;
> + struct mem_cgroup_stat_cpu *cpustat;
> + int cpu = get_cpu();
> +
> + if (!page_is_file_cache(page))
> + return;
> +
> + if (unlikely(!mm))
> + mm = &init_mm;
> +
> + mem = try_get_mem_cgroup_from_mm(mm);
> + if (!mem)
> + return;
> +
> + stat = &mem->stat;
> + cpustat = &stat->cpustat[cpu];
> +
> + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, val);
> + put_cpu();
> +}
>
> /*
> * Call callback function against all cgroup under hierarchy tree.
> @@ -1096,6 +1125,9 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
> struct mem_cgroup_per_zone *from_mz, *to_mz;
> int nid, zid;
> int ret = -EBUSY;
> + struct mem_cgroup_stat *stat;
> + struct mem_cgroup_stat_cpu *cpustat;
> + int cpu;
>
> VM_BUG_ON(from == to);
> VM_BUG_ON(PageLRU(pc->page));
> @@ -1116,6 +1148,19 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
>
> res_counter_uncharge(&from->res, PAGE_SIZE);
> mem_cgroup_charge_statistics(from, pc, false);
> +
> + cpu = get_cpu();
> + /* Update mapped_file data for mem_cgroup "from" */
> + stat = &from->stat;
> + cpustat = &stat->cpustat[cpu];
> + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, -1);
> +
> + /* Update mapped_file data for mem_cgroup "to" */
> + stat = &to->stat;
> + cpustat = &stat->cpustat[cpu];
> + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, 1);
> + put_cpu();
> +
> if (do_swap_account)
> res_counter_uncharge(&from->memsw, PAGE_SIZE);
> css_put(&from->css);
> @@ -2051,6 +2096,7 @@ static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
> enum {
> MCS_CACHE,
> MCS_RSS,
> + MCS_MAPPED_FILE,
> MCS_PGPGIN,
> MCS_PGPGOUT,
> MCS_INACTIVE_ANON,
> @@ -2071,6 +2117,7 @@ struct {
> } memcg_stat_strings[NR_MCS_STAT] = {
> {"cache", "total_cache"},
> {"rss", "total_rss"},
> + {"mapped_file", "total_mapped_file"},
> {"pgpgin", "total_pgpgin"},
> {"pgpgout", "total_pgpgout"},
> {"inactive_anon", "total_inactive_anon"},
> @@ -2091,6 +2138,8 @@ static int mem_cgroup_get_local_stat(struct mem_cgroup *mem, void *data)
> s->stat[MCS_CACHE] += val * PAGE_SIZE;
> val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_RSS);
> s->stat[MCS_RSS] += val * PAGE_SIZE;
> + val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_MAPPED_FILE);
> + s->stat[MCS_MAPPED_FILE] += val * PAGE_SIZE;
> val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_PGPGIN_COUNT);
> s->stat[MCS_PGPGIN] += val;
> val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_PGPGOUT_COUNT);
> diff --git a/mm/memory.c b/mm/memory.c
> index a715b19..95a9ded 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -822,7 +822,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> mark_page_accessed(page);
> file_rss--;
> }
> - page_remove_rmap(page);
> + page_remove_rmap(page, vma);
> if (unlikely(page_mapcount(page) < 0))
> print_bad_pte(vma, addr, ptent, page);
> tlb_remove_page(tlb, page);
> @@ -1421,7 +1421,7 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
> /* Ok, finally just insert the thing.. */
> get_page(page);
> inc_mm_counter(mm, file_rss);
> - page_add_file_rmap(page);
> + page_add_file_rmap(page, vma);
> set_pte_at(mm, addr, pte, mk_pte(page, prot));
>
> retval = 0;
> @@ -2080,7 +2080,7 @@ gotten:
> * mapcount is visible. So transitively, TLBs to
> * old page will be flushed before it can be reused.
> */
> - page_remove_rmap(old_page);
> + page_remove_rmap(old_page, vma);
> }
>
> /* Free the old page.. */
> @@ -2718,7 +2718,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> page_add_new_anon_rmap(page, vma, address);
> } else {
> inc_mm_counter(mm, file_rss);
> - page_add_file_rmap(page);
> + page_add_file_rmap(page, vma);
> if (flags & FAULT_FLAG_WRITE) {
> dirty_page = page;
> get_page(dirty_page);
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 068655d..098d365 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -131,7 +131,7 @@ static void remove_migration_pte(struct vm_area_struct *vma,
> if (PageAnon(new))
> page_add_anon_rmap(new, vma, addr);
> else
> - page_add_file_rmap(new);
> + page_add_file_rmap(new, vma);
>
> /* No need to invalidate - it was non-present before */
> update_mmu_cache(vma, addr, pte);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 1652166..3e29864 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -686,10 +686,12 @@ void page_add_new_anon_rmap(struct page *page,
> *
> * The caller needs to hold the pte lock.
> */
> -void page_add_file_rmap(struct page *page)
> +void page_add_file_rmap(struct page *page, struct vm_area_struct *vma)
> {
> - if (atomic_inc_and_test(&page->_mapcount))
> + if (atomic_inc_and_test(&page->_mapcount)) {
> __inc_zone_page_state(page, NR_FILE_MAPPED);
> + mem_cgroup_update_mapped_file_stat(page, vma->vm_mm, 1);
> + }
> }
>
> #ifdef CONFIG_DEBUG_VM
> @@ -719,7 +721,7 @@ void page_dup_rmap(struct page *page, struct vm_area_struct *vma, unsigned long
> *
> * The caller needs to hold the pte lock.
> */
> -void page_remove_rmap(struct page *page)
> +void page_remove_rmap(struct page *page, struct vm_area_struct *vma)
> {
> if (atomic_add_negative(-1, &page->_mapcount)) {
> /*
> @@ -738,6 +740,7 @@ void page_remove_rmap(struct page *page)
> mem_cgroup_uncharge_page(page);
> __dec_zone_page_state(page,
> PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
> + mem_cgroup_update_mapped_file_stat(page, vma->vm_mm, -1);
> /*
> * It would be tidy to reset the PageAnon mapping here,
> * but that might overwrite a racing page_add_anon_rmap
> @@ -835,7 +838,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> dec_mm_counter(mm, file_rss);
>
>
> - page_remove_rmap(page);
> + page_remove_rmap(page, vma);
> page_cache_release(page);
>
> out_unmap:
> @@ -950,7 +953,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
> if (pte_dirty(pteval))
> set_page_dirty(page);
>
> - page_remove_rmap(page);
> + page_remove_rmap(page, vma);
> page_cache_release(page);
> dec_mm_counter(mm, file_rss);
> (*mapcount)--;
>
>
>
> --
> Balbir
>

2009-04-16 04:06:44

by Bharata B Rao

[permalink] [raw]
Subject: Re: [PATCH] Add file based RSS accounting for memory resource controller (v2)

On Thu, Apr 16, 2009 at 7:29 AM, Balbir Singh <[email protected]> wrote:
>
> Feature: Add file RSS tracking per memory cgroup
>
> From: Balbir Singh <[email protected]>
>
> Changelog v3 -> v2
> 1. Add corresponding put_cpu() for every get_cpu()
>
> Changelog v2 -> v1
>
> 1. Rename file_rss to mapped_file
> 2. Add hooks into mem_cgroup_move_account for updating MAPPED_FILE statistics
> 3. Use a better name for the statistics routine.
>
>
> We currently don't track file RSS, the RSS we report is actually anon RSS.
> All the file mapped pages, come in through the page cache and get accounted
> there. This patch adds support for accounting file RSS pages. It should
>
> 1. Help improve the metrics reported by the memory resource controller
> 2. Will form the basis for a future shared memory accounting heuristic
> ? that has been proposed by Kamezawa.
>
> Unfortunately, we cannot rename the existing "rss" keyword used in memory.stat
> to "anon_rss". We however, add "mapped_file" data and hope to educate the end
> user through documentation.
>
> Signed-off-by: Balbir Singh <[email protected]>

Balbir, could you please also update the documentation with the
description about this new metric ?

Regards,
Bharata.
--
http://bharata.sulekha.com/blog/posts.htm

2009-04-16 04:35:26

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] Add file based RSS accounting for memory resource controller (v2)

* Bharata B Rao <[email protected]> [2009-04-16 09:29:53]:

> On Thu, Apr 16, 2009 at 7:29 AM, Balbir Singh <[email protected]> wrote:
> >
> > Feature: Add file RSS tracking per memory cgroup
> >
> > From: Balbir Singh <[email protected]>
> >
> > Changelog v3 -> v2
> > 1. Add corresponding put_cpu() for every get_cpu()
> >
> > Changelog v2 -> v1
> >
> > 1. Rename file_rss to mapped_file
> > 2. Add hooks into mem_cgroup_move_account for updating MAPPED_FILE statistics
> > 3. Use a better name for the statistics routine.
> >
> >
> > We currently don't track file RSS, the RSS we report is actually anon RSS.
> > All the file mapped pages, come in through the page cache and get accounted
> > there. This patch adds support for accounting file RSS pages. It should
> >
> > 1. Help improve the metrics reported by the memory resource controller
> > 2. Will form the basis for a future shared memory accounting heuristic
> > ? that has been proposed by Kamezawa.
> >
> > Unfortunately, we cannot rename the existing "rss" keyword used in memory.stat
> > to "anon_rss". We however, add "mapped_file" data and hope to educate the end
> > user through documentation.
> >
> > Signed-off-by: Balbir Singh <[email protected]>
>
> Balbir, could you please also update the documentation with the
> description about this new metric ?
>

Yes, I am going to in a follow up patch. I'll send that out after I do
some experimentation with shared memory size heuristics.

--
Balbir

2009-04-16 07:42:20

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] Add file based RSS accounting for memory resource controller (v2)

On Thu, 16 Apr 2009 11:02:46 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Thu, 16 Apr 2009 07:29:55 +0530
> Balbir Singh <[email protected]> wrote:
>
> > Thanks, I could have almost sworn I had it.. but I clearly don't
> >
> > Here is the fixed version
> >
> > Feature: Add file RSS tracking per memory cgroup
> >
> > From: Balbir Singh <[email protected]>
> >
> > Changelog v3 -> v2
> > 1. Add corresponding put_cpu() for every get_cpu()
> >
> > Changelog v2 -> v1
> >
> > 1. Rename file_rss to mapped_file
> > 2. Add hooks into mem_cgroup_move_account for updating MAPPED_FILE statistics
> > 3. Use a better name for the statistics routine.
> >
> >
> > We currently don't track file RSS, the RSS we report is actually anon RSS.
> > All the file mapped pages, come in through the page cache and get accounted
> > there. This patch adds support for accounting file RSS pages. It should
> >
> > 1. Help improve the metrics reported by the memory resource controller
> > 2. Will form the basis for a future shared memory accounting heuristic
> > that has been proposed by Kamezawa.
> >
> > Unfortunately, we cannot rename the existing "rss" keyword used in memory.stat
> > to "anon_rss". We however, add "mapped_file" data and hope to educate the end
> > user through documentation.
> >
> > Signed-off-by: Balbir Singh <[email protected]>
>
> Nice feature :) Thanks.
>
> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
>
> I'll test this today.
>
Sorry, some troubles found. Ignore above Ack. 3points now.

1. get_cpu should be after (*)
==mem_cgroup_update_mapped_file_stat()
+ int cpu = get_cpu();
+
+ if (!page_is_file_cache(page))
+ return;
+
+ if (unlikely(!mm))
+ mm = &init_mm;
+
+ mem = try_get_mem_cgroup_from_mm(mm);
+ if (!mem)
+ return;
+ ----------------------------------------(*)
+ stat = &mem->stat;
+ cpustat = &stat->cpustat[cpu];
+
+ __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, val);
+ put_cpu();
+}
==

2. In above, "mem" shouldn't be got from "mm"....please get "mem" from page_cgroup.
(Because it's file cache, pc->mem_cgroup is not NULL always.)

I saw this very easily.
==
Cache: 4096
mapped_file: 20480
==

3. at force_empty().
==
+
+ cpu = get_cpu();
+ /* Update mapped_file data for mem_cgroup "from" */
+ stat = &from->stat;
+ cpustat = &stat->cpustat[cpu];
+ __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, -1);
+
+ /* Update mapped_file data for mem_cgroup "to" */
+ stat = &to->stat;
+ cpustat = &stat->cpustat[cpu];
+ __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, 1);
+ put_cpu();

This just breaks counter when page is not mapped. please check page_mapped().

like this:
==
if (page_is_file_cache(page) && page_mapped(page)) {
modify counter.
}
==

and call lock_page_cgroup() in mem_cgroup_update_mapped_file_stat().

This will be slow, but optimization will be very tricky and need some amount of time.


-Kame


2009-04-16 08:17:24

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] Add file based RSS accounting for memory resource controller (v2)


> Sorry, some troubles found. Ignore above Ack. 3points now.
>
> 1. get_cpu should be after (*)
> ==mem_cgroup_update_mapped_file_stat()
> + int cpu = get_cpu();
> +
> + if (!page_is_file_cache(page))
> + return;
> +
> + if (unlikely(!mm))
> + mm = &init_mm;
> +
> + mem = try_get_mem_cgroup_from_mm(mm);
> + if (!mem)
> + return;
> + ----------------------------------------(*)
> + stat = &mem->stat;
> + cpustat = &stat->cpustat[cpu];
> +
> + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, val);
> + put_cpu();
> +}
> ==
>
> 2. In above, "mem" shouldn't be got from "mm"....please get "mem" from page_cgroup.
> (Because it's file cache, pc->mem_cgroup is not NULL always.)
>
> I saw this very easily.
> ==
> Cache: 4096
> mapped_file: 20480
> ==
>
> 3. at force_empty().
> ==
> +
> + cpu = get_cpu();
> + /* Update mapped_file data for mem_cgroup "from" */
> + stat = &from->stat;
> + cpustat = &stat->cpustat[cpu];
> + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, -1);
> +
> + /* Update mapped_file data for mem_cgroup "to" */
> + stat = &to->stat;
> + cpustat = &stat->cpustat[cpu];
> + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, 1);
> + put_cpu();
>
> This just breaks counter when page is not mapped. please check page_mapped().
>
> like this:
> ==
> if (page_is_file_cache(page) && page_mapped(page)) {
> modify counter.
> }
> ==
>
> and call lock_page_cgroup() in mem_cgroup_update_mapped_file_stat().
>
> This will be slow, but optimization will be very tricky and need some amount of time.
>

This is my fix for above 3 problems. but plz do as you like.
I'm not very intersted in details.
==

---
Index: mmotm-2.6.30-Apr14/mm/memcontrol.c
===================================================================
--- mmotm-2.6.30-Apr14.orig/mm/memcontrol.c
+++ mmotm-2.6.30-Apr14/mm/memcontrol.c
@@ -322,33 +322,39 @@ static bool mem_cgroup_is_obsolete(struc
return css_is_removed(&mem->css);
}

-/*
- * Currently used to update mapped file statistics, but the routine can be
- * generalized to update other statistics as well.
- */
-void mem_cgroup_update_mapped_file_stat(struct page *page, struct mm_struct *mm,
- int val)
+void mem_cgroup_update_mapped_file_stat(struct page *page, bool map)
{
struct mem_cgroup *mem;
struct mem_cgroup_stat *stat;
struct mem_cgroup_stat_cpu *cpustat;
- int cpu = get_cpu();
+ struct page_cgroup *pc;
+ int cpu;

if (!page_is_file_cache(page))
return;

- if (unlikely(!mm))
- mm = &init_mm;
-
- mem = try_get_mem_cgroup_from_mm(mm);
- if (!mem)
+ pc = lookup_page_cgroup(page);
+ if (unlikely(!pc))
return;
-
- stat = &mem->stat;
- cpustat = &stat->cpustat[cpu];
-
- __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, val);
- put_cpu();
+ lock_page_cgroup(pc);
+ mem = pc->mem_cgroup;
+ if (mem) {
+ cpu = get_cpu();
+ stat = &mem->stat;
+ cpustat = &stat->cpustat[cpu];
+ if (map)
+ __mem_cgroup_stat_add_safe(cpustat,
+ MEM_CGROUP_STAT_MAPPED_FILE, 1);
+ else
+ __mem_cgroup_stat_add_safe(cpustat,
+ MEM_CGROUP_STAT_MAPPED_FILE, -1);
+ put_cpu();
+ }
+ if (map)
+ SetPageCgroupMapped(pc);
+ else
+ ClearPageCgroupMapped(pc);
+ unlock_page_cgroup(pc);
}

/*
@@ -1149,17 +1155,19 @@ static int mem_cgroup_move_account(struc
res_counter_uncharge(&from->res, PAGE_SIZE);
mem_cgroup_charge_statistics(from, pc, false);

- cpu = get_cpu();
- /* Update mapped_file data for mem_cgroup "from" */
- stat = &from->stat;
- cpustat = &stat->cpustat[cpu];
- __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, -1);
-
- /* Update mapped_file data for mem_cgroup "to" */
- stat = &to->stat;
- cpustat = &stat->cpustat[cpu];
- __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, 1);
- put_cpu();
+ if (PageCgroupMapped(pc)) {
+ cpu = get_cpu();
+ /* Update mapped_file data for mem_cgroup "from" and "to" */
+ stat = &from->stat;
+ cpustat = &stat->cpustat[cpu];
+ __mem_cgroup_stat_add_safe(cpustat,
+ MEM_CGROUP_STAT_MAPPED_FILE, -1);
+ stat = &to->stat;
+ cpustat = &stat->cpustat[cpu];
+ __mem_cgroup_stat_add_safe(cpustat,
+ MEM_CGROUP_STAT_MAPPED_FILE, 1);
+ put_cpu();
+ }

if (do_swap_account)
res_counter_uncharge(&from->memsw, PAGE_SIZE);
Index: mmotm-2.6.30-Apr14/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.30-Apr14.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.30-Apr14/include/linux/page_cgroup.h
@@ -26,6 +26,7 @@ enum {
PCG_LOCK, /* page cgroup is locked */
PCG_CACHE, /* charged as cache */
PCG_USED, /* this object is in use. */
+ PCG_MAPPED, /* mapped file cache */
};

#define TESTPCGFLAG(uname, lname) \
@@ -46,6 +47,10 @@ TESTPCGFLAG(Cache, CACHE)
TESTPCGFLAG(Used, USED)
CLEARPCGFLAG(Used, USED)

+TESTPCGFLAG(Mapped, USED)
+CLEARPCGFLAG(Mapped, USED)
+SETPCGFLAG(Mapped, USED)
+
static inline int page_cgroup_nid(struct page_cgroup *pc)
{
return page_to_nid(pc->page);
Index: mmotm-2.6.30-Apr14/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.30-Apr14.orig/include/linux/memcontrol.h
+++ mmotm-2.6.30-Apr14/include/linux/memcontrol.h
@@ -116,8 +116,7 @@ static inline bool mem_cgroup_disabled(v
}

extern bool mem_cgroup_oom_called(struct task_struct *task);
-void mem_cgroup_update_mapped_file_stat(struct page *page, struct mm_struct *mm,
- int val);
+void mem_cgroup_update_mapped_file_stat(struct page *page, bool map);
#else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct mem_cgroup;

@@ -266,8 +265,7 @@ mem_cgroup_print_oom_info(struct mem_cgr
}

static inline void mem_cgroup_update_mapped_file_stat(struct page *page,
- struct mm_struct *mm,
- int val)
+ bool map)
{
}

Index: mmotm-2.6.30-Apr14/mm/rmap.c
===================================================================
--- mmotm-2.6.30-Apr14.orig/mm/rmap.c
+++ mmotm-2.6.30-Apr14/mm/rmap.c
@@ -690,7 +690,7 @@ void page_add_file_rmap(struct page *pag
{
if (atomic_inc_and_test(&page->_mapcount)) {
__inc_zone_page_state(page, NR_FILE_MAPPED);
- mem_cgroup_update_mapped_file_stat(page, vma->vm_mm, 1);
+ mem_cgroup_update_mapped_file_stat(page, true);
}
}

@@ -740,7 +740,7 @@ void page_remove_rmap(struct page *page,
mem_cgroup_uncharge_page(page);
__dec_zone_page_state(page,
PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
- mem_cgroup_update_mapped_file_stat(page, vma->vm_mm, -1);
+ mem_cgroup_update_mapped_file_stat(page, false);
/*
* It would be tidy to reset the PageAnon mapping here,
* but that might overwrite a racing page_add_anon_rmap




2009-04-16 12:04:20

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] Add file based RSS accounting for memory resource controller (v2)

* KAMEZAWA Hiroyuki <[email protected]> [2009-04-16 17:15:35]:

>
> > Sorry, some troubles found. Ignore above Ack. 3points now.
> >
> > 1. get_cpu should be after (*)
> > ==mem_cgroup_update_mapped_file_stat()
> > + int cpu = get_cpu();
> > +
> > + if (!page_is_file_cache(page))
> > + return;
> > +
> > + if (unlikely(!mm))
> > + mm = &init_mm;
> > +
> > + mem = try_get_mem_cgroup_from_mm(mm);
> > + if (!mem)
> > + return;
> > + ----------------------------------------(*)
> > + stat = &mem->stat;
> > + cpustat = &stat->cpustat[cpu];
> > +
> > + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, val);
> > + put_cpu();
> > +}
> > ==

Yes or I should have a goto

> >
> > 2. In above, "mem" shouldn't be got from "mm"....please get "mem" from page_cgroup.
> > (Because it's file cache, pc->mem_cgroup is not NULL always.)

Hmmm.. not sure I understand this part. Are you suggesting that mm can
be NULL? I added the check for !mm as a safety check. Since this
routine is only called from rmap context, mm is not NULL, hence mem
should not be NULL. Did you find a race between mm->owner assignment
and lookup via mm->owner?

> >
> > I saw this very easily.
> > ==
> > Cache: 4096
> > mapped_file: 20480
> > ==
> >
> > 3. at force_empty().
> > ==
> > +
> > + cpu = get_cpu();
> > + /* Update mapped_file data for mem_cgroup "from" */
> > + stat = &from->stat;
> > + cpustat = &stat->cpustat[cpu];
> > + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, -1);
> > +
> > + /* Update mapped_file data for mem_cgroup "to" */
> > + stat = &to->stat;
> > + cpustat = &stat->cpustat[cpu];
> > + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, 1);
> > + put_cpu();
> >
> > This just breaks counter when page is not mapped. please check page_mapped().
> >
> > like this:
> > ==
> > if (page_is_file_cache(page) && page_mapped(page)) {
> > modify counter.
> > }
>

Oh! yeah..

> ==
> >
> > and call lock_page_cgroup() in mem_cgroup_update_mapped_file_stat().
> >
> > This will be slow, but optimization will be very tricky and need some amount of time.
> >
>
> This is my fix for above 3 problems. but plz do as you like.
> I'm not very intersted in details.
> ==
>
> ---
> Index: mmotm-2.6.30-Apr14/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.30-Apr14.orig/mm/memcontrol.c
> +++ mmotm-2.6.30-Apr14/mm/memcontrol.c
> @@ -322,33 +322,39 @@ static bool mem_cgroup_is_obsolete(struc
> return css_is_removed(&mem->css);
> }
>
> -/*
> - * Currently used to update mapped file statistics, but the routine can be
> - * generalized to update other statistics as well.
> - */
> -void mem_cgroup_update_mapped_file_stat(struct page *page, struct mm_struct *mm,
> - int val)
> +void mem_cgroup_update_mapped_file_stat(struct page *page, bool map)
> {
> struct mem_cgroup *mem;
> struct mem_cgroup_stat *stat;
> struct mem_cgroup_stat_cpu *cpustat;
> - int cpu = get_cpu();
> + struct page_cgroup *pc;
> + int cpu;
>
> if (!page_is_file_cache(page))
> return;
>
> - if (unlikely(!mm))
> - mm = &init_mm;
> -
> - mem = try_get_mem_cgroup_from_mm(mm);
> - if (!mem)
> + pc = lookup_page_cgroup(page);
> + if (unlikely(!pc))
> return;
> -
> - stat = &mem->stat;
> - cpustat = &stat->cpustat[cpu];
> -
> - __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, val);
> - put_cpu();
> + lock_page_cgroup(pc);
> + mem = pc->mem_cgroup;
> + if (mem) {
> + cpu = get_cpu();
> + stat = &mem->stat;
> + cpustat = &stat->cpustat[cpu];
> + if (map)
> + __mem_cgroup_stat_add_safe(cpustat,
> + MEM_CGROUP_STAT_MAPPED_FILE, 1);
> + else
> + __mem_cgroup_stat_add_safe(cpustat,
> + MEM_CGROUP_STAT_MAPPED_FILE, -1);
> + put_cpu();
> + }
> + if (map)
> + SetPageCgroupMapped(pc);
> + else
> + ClearPageCgroupMapped(pc);
> + unlock_page_cgroup(pc);
> }
>
> /*
> @@ -1149,17 +1155,19 @@ static int mem_cgroup_move_account(struc
> res_counter_uncharge(&from->res, PAGE_SIZE);
> mem_cgroup_charge_statistics(from, pc, false);
>
> - cpu = get_cpu();
> - /* Update mapped_file data for mem_cgroup "from" */
> - stat = &from->stat;
> - cpustat = &stat->cpustat[cpu];
> - __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, -1);
> -
> - /* Update mapped_file data for mem_cgroup "to" */
> - stat = &to->stat;
> - cpustat = &stat->cpustat[cpu];
> - __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, 1);
> - put_cpu();
> + if (PageCgroupMapped(pc)) {
> + cpu = get_cpu();
> + /* Update mapped_file data for mem_cgroup "from" and "to" */
> + stat = &from->stat;
> + cpustat = &stat->cpustat[cpu];
> + __mem_cgroup_stat_add_safe(cpustat,
> + MEM_CGROUP_STAT_MAPPED_FILE, -1);
> + stat = &to->stat;
> + cpustat = &stat->cpustat[cpu];
> + __mem_cgroup_stat_add_safe(cpustat,
> + MEM_CGROUP_STAT_MAPPED_FILE, 1);
> + put_cpu();
> + }
>
> if (do_swap_account)
> res_counter_uncharge(&from->memsw, PAGE_SIZE);
> Index: mmotm-2.6.30-Apr14/include/linux/page_cgroup.h
> ===================================================================
> --- mmotm-2.6.30-Apr14.orig/include/linux/page_cgroup.h
> +++ mmotm-2.6.30-Apr14/include/linux/page_cgroup.h
> @@ -26,6 +26,7 @@ enum {
> PCG_LOCK, /* page cgroup is locked */
> PCG_CACHE, /* charged as cache */
> PCG_USED, /* this object is in use. */
> + PCG_MAPPED, /* mapped file cache */
> };
>
> #define TESTPCGFLAG(uname, lname) \
> @@ -46,6 +47,10 @@ TESTPCGFLAG(Cache, CACHE)
> TESTPCGFLAG(Used, USED)
> CLEARPCGFLAG(Used, USED)
>
> +TESTPCGFLAG(Mapped, USED)
> +CLEARPCGFLAG(Mapped, USED)
> +SETPCGFLAG(Mapped, USED)
> +
> static inline int page_cgroup_nid(struct page_cgroup *pc)
> {
> return page_to_nid(pc->page);
> Index: mmotm-2.6.30-Apr14/include/linux/memcontrol.h
> ===================================================================
> --- mmotm-2.6.30-Apr14.orig/include/linux/memcontrol.h
> +++ mmotm-2.6.30-Apr14/include/linux/memcontrol.h
> @@ -116,8 +116,7 @@ static inline bool mem_cgroup_disabled(v
> }
>
> extern bool mem_cgroup_oom_called(struct task_struct *task);
> -void mem_cgroup_update_mapped_file_stat(struct page *page, struct mm_struct *mm,
> - int val);
> +void mem_cgroup_update_mapped_file_stat(struct page *page, bool map);
> #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> struct mem_cgroup;
>
> @@ -266,8 +265,7 @@ mem_cgroup_print_oom_info(struct mem_cgr
> }
>
> static inline void mem_cgroup_update_mapped_file_stat(struct page *page,
> - struct mm_struct *mm,
> - int val)
> + bool map)
> {
> }
>
> Index: mmotm-2.6.30-Apr14/mm/rmap.c
> ===================================================================
> --- mmotm-2.6.30-Apr14.orig/mm/rmap.c
> +++ mmotm-2.6.30-Apr14/mm/rmap.c
> @@ -690,7 +690,7 @@ void page_add_file_rmap(struct page *pag
> {
> if (atomic_inc_and_test(&page->_mapcount)) {
> __inc_zone_page_state(page, NR_FILE_MAPPED);
> - mem_cgroup_update_mapped_file_stat(page, vma->vm_mm, 1);
> + mem_cgroup_update_mapped_file_stat(page, true);
> }
> }
>
> @@ -740,7 +740,7 @@ void page_remove_rmap(struct page *page,
> mem_cgroup_uncharge_page(page);
> __dec_zone_page_state(page,
> PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
> - mem_cgroup_update_mapped_file_stat(page, vma->vm_mm, -1);
> + mem_cgroup_update_mapped_file_stat(page, false);
> /*
> * It would be tidy to reset the PageAnon mapping here,
> * but that might overwrite a racing page_add_anon_rmap
>
>
>
>
>
>

--
Balbir

2009-04-16 12:15:03

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] Add file based RSS accounting for memory resource controller (v2)

* KAMEZAWA Hiroyuki <[email protected]> [2009-04-16 16:40:36]:

> 2. In above, "mem" shouldn't be got from "mm"....please get "mem" from page_cgroup.
> (Because it's file cache, pc->mem_cgroup is not NULL always.)
>
> I saw this very easily.
> ==
> Cache: 4096
> mapped_file: 20480
> ==
>

May I ask how and what was expected?

--
Balbir

2009-04-16 23:59:10

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] Add file based RSS accounting for memory resource controller (v2)

On Thu, 16 Apr 2009 17:44:07 +0530
Balbir Singh <[email protected]> wrote:

> * KAMEZAWA Hiroyuki <[email protected]> [2009-04-16 16:40:36]:
>
> > 2. In above, "mem" shouldn't be got from "mm"....please get "mem" from page_cgroup.
> > (Because it's file cache, pc->mem_cgroup is not NULL always.)
> >
> > I saw this very easily.
> > ==
> > Cache: 4096
> > mapped_file: 20480
> > ==
> >
>
> May I ask how and what was expected?
>

Mapped_file <= Cache,

Thanks,
-Kame


> --
> Balbir
>

2009-04-17 00:16:40

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] Add file based RSS accounting for memory resource controller (v2)

On Thu, 16 Apr 2009 17:33:16 +0530
Balbir Singh <[email protected]> wrote:

> * KAMEZAWA Hiroyuki <[email protected]> [2009-04-16 17:15:35]:
>
> >
> > > Sorry, some troubles found. Ignore above Ack. 3points now.
> > >
> > > 1. get_cpu should be after (*)
> > > ==mem_cgroup_update_mapped_file_stat()
> > > + int cpu = get_cpu();
> > > +
> > > + if (!page_is_file_cache(page))
> > > + return;
> > > +
> > > + if (unlikely(!mm))
> > > + mm = &init_mm;
> > > +
> > > + mem = try_get_mem_cgroup_from_mm(mm);
> > > + if (!mem)
> > > + return;
> > > + ----------------------------------------(*)
> > > + stat = &mem->stat;
> > > + cpustat = &stat->cpustat[cpu];
> > > +
> > > + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, val);
> > > + put_cpu();
> > > +}
> > > ==
>
> Yes or I should have a goto
>
> > >
> > > 2. In above, "mem" shouldn't be got from "mm"....please get "mem" from page_cgroup.
> > > (Because it's file cache, pc->mem_cgroup is not NULL always.)
>
> Hmmm.. not sure I understand this part. Are you suggesting that mm can
> be NULL?
No.

> I added the check for !mm as a safety check. Since this
> routine is only called from rmap context, mm is not NULL, hence mem
> should not be NULL. Did you find a race between mm->owner assignment
> and lookup via mm->owner?
>
No.

page_cgroup->mem_cgroup != try_get_mem_cgroup_from_mm(mm); in many many cases.

For example, libc and /bin/*** is tend to be loaded into default cgroup at boot but
used by many cgroups. But mapcount of page caches for /bin/*** is 0 if not running.

Then, File_Mapped can be greater than Cached easily if you use mm->owner.

I can't estimate RSS in *my* cgroup if File_Mapped includes pages which is under
other cgroups. It's meaningless.
Especially, when Cached==0 but File_Mapped > 0, I think "oh, the kernel leaks somehing..hmm..."

By useing page_cgroup->mem_cgroup, we can avoid above mess.

Thanks,
-Kame








2009-04-17 00:19:20

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] Add file based RSS accounting for memory resource controller (v2)

On Fri, 17 Apr 2009 09:14:59 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> page_cgroup->mem_cgroup != try_get_mem_cgroup_from_mm(mm); in many many cases.
>
> For example, libc and /bin/*** is tend to be loaded into default cgroup at boot but
> used by many cgroups. But mapcount of page caches for /bin/*** is 0 if not running.
>
> Then, File_Mapped can be greater than Cached easily if you use mm->owner.
>
> I can't estimate RSS in *my* cgroup if File_Mapped includes pages which is under
> other cgroups. It's meaningless.
> Especially, when Cached==0 but File_Mapped > 0, I think "oh, the kernel leaks somehing..hmm..."
>
> By useing page_cgroup->mem_cgroup, we can avoid above mess.
>
And, if pc->mem_cgroup is used, we can ignore "task move".

Thanks,
-Kame

2009-04-17 01:41:40

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] Add file based RSS accounting for memory resource controller (v2)

* KAMEZAWA Hiroyuki <[email protected]> [2009-04-17 09:14:59]:

> On Thu, 16 Apr 2009 17:33:16 +0530
> Balbir Singh <[email protected]> wrote:
>
> > * KAMEZAWA Hiroyuki <[email protected]> [2009-04-16 17:15:35]:
> >
> > >
> > > > Sorry, some troubles found. Ignore above Ack. 3points now.
> > > >
> > > > 1. get_cpu should be after (*)
> > > > ==mem_cgroup_update_mapped_file_stat()
> > > > + int cpu = get_cpu();
> > > > +
> > > > + if (!page_is_file_cache(page))
> > > > + return;
> > > > +
> > > > + if (unlikely(!mm))
> > > > + mm = &init_mm;
> > > > +
> > > > + mem = try_get_mem_cgroup_from_mm(mm);
> > > > + if (!mem)
> > > > + return;
> > > > + ----------------------------------------(*)
> > > > + stat = &mem->stat;
> > > > + cpustat = &stat->cpustat[cpu];
> > > > +
> > > > + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, val);
> > > > + put_cpu();
> > > > +}
> > > > ==
> >
> > Yes or I should have a goto
> >
> > > >
> > > > 2. In above, "mem" shouldn't be got from "mm"....please get "mem" from page_cgroup.
> > > > (Because it's file cache, pc->mem_cgroup is not NULL always.)
> >
> > Hmmm.. not sure I understand this part. Are you suggesting that mm can
> > be NULL?
> No.
>
> > I added the check for !mm as a safety check. Since this
> > routine is only called from rmap context, mm is not NULL, hence mem
> > should not be NULL. Did you find a race between mm->owner assignment
> > and lookup via mm->owner?
> >
> No.
>
> page_cgroup->mem_cgroup != try_get_mem_cgroup_from_mm(mm); in many many cases.
>
> For example, libc and /bin/*** is tend to be loaded into default cgroup at boot but
> used by many cgroups. But mapcount of page caches for /bin/*** is 0 if not running.
>
> Then, File_Mapped can be greater than Cached easily if you use mm->owner.
>
> I can't estimate RSS in *my* cgroup if File_Mapped includes pages which is under
> other cgroups. It's meaningless.
> Especially, when Cached==0 but File_Mapped > 0, I think "oh, the kernel leaks somehing..hmm..."
>
> By useing page_cgroup->mem_cgroup, we can avoid above mess.

Yes, I see your point. I wanted mapped_file to show up in the cgroup
that mapped the page. But this works for me as well, but that means
we'll nest the page cgroup lock under the PTE lock.

--
Balbir

2009-04-17 02:05:37

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] Add file based RSS accounting for memory resource controller (v2)

On Fri, 17 Apr 2009 07:10:42 +0530
Balbir Singh <[email protected]> wrote:

> * KAMEZAWA Hiroyuki <[email protected]> [2009-04-17 09:14:59]:
>
> > On Thu, 16 Apr 2009 17:33:16 +0530
> > Balbir Singh <[email protected]> wrote:
> >
> > > * KAMEZAWA Hiroyuki <[email protected]> [2009-04-16 17:15:35]:
> > >
> > > >
> > > > > Sorry, some troubles found. Ignore above Ack. 3points now.
> > > > >
> > > > > 1. get_cpu should be after (*)
> > > > > ==mem_cgroup_update_mapped_file_stat()
> > > > > + int cpu = get_cpu();
> > > > > +
> > > > > + if (!page_is_file_cache(page))
> > > > > + return;
> > > > > +
> > > > > + if (unlikely(!mm))
> > > > > + mm = &init_mm;
> > > > > +
> > > > > + mem = try_get_mem_cgroup_from_mm(mm);
> > > > > + if (!mem)
> > > > > + return;
> > > > > + ----------------------------------------(*)
> > > > > + stat = &mem->stat;
> > > > > + cpustat = &stat->cpustat[cpu];
> > > > > +
> > > > > + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, val);
> > > > > + put_cpu();
> > > > > +}
> > > > > ==
> > >
> > > Yes or I should have a goto
> > >
> > > > >
> > > > > 2. In above, "mem" shouldn't be got from "mm"....please get "mem" from page_cgroup.
> > > > > (Because it's file cache, pc->mem_cgroup is not NULL always.)
> > >
> > > Hmmm.. not sure I understand this part. Are you suggesting that mm can
> > > be NULL?
> > No.
> >
> > > I added the check for !mm as a safety check. Since this
> > > routine is only called from rmap context, mm is not NULL, hence mem
> > > should not be NULL. Did you find a race between mm->owner assignment
> > > and lookup via mm->owner?
> > >
> > No.
> >
> > page_cgroup->mem_cgroup != try_get_mem_cgroup_from_mm(mm); in many many cases.
> >
> > For example, libc and /bin/*** is tend to be loaded into default cgroup at boot but
> > used by many cgroups. But mapcount of page caches for /bin/*** is 0 if not running.
> >
> > Then, File_Mapped can be greater than Cached easily if you use mm->owner.
> >
> > I can't estimate RSS in *my* cgroup if File_Mapped includes pages which is under
> > other cgroups. It's meaningless.
> > Especially, when Cached==0 but File_Mapped > 0, I think "oh, the kernel leaks somehing..hmm..."
> >
> > By useing page_cgroup->mem_cgroup, we can avoid above mess.
>
> Yes, I see your point. I wanted mapped_file to show up in the cgroup
> that mapped the page. But this works for me as well, but that means
> we'll nest the page cgroup lock under the PTE lock.

Don't worry. we do that nest at ANON's uncharge(), already.

About cost:

IIUC, the number of "mapcount 0->1/1->0" of file caches are much smaller than
that of o Anon. And there will be not very much cache pingpong.

If you use PCG_MAPPED flag in page_cgroup (as my patch), you can use
not-atomic version of set/clear when update is only under lock_page_cgroup().
If you find better way, plz use it. But we can't avoid some kind of atomic ops
for correct accounting, I think.

Thanks,
-Kame

2009-04-17 03:46:55

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] Add file based RSS accounting for memory resource controller (v2)

* KAMEZAWA Hiroyuki <[email protected]> [2009-04-17 11:03:50]:

> On Fri, 17 Apr 2009 07:10:42 +0530
> Balbir Singh <[email protected]> wrote:
>
> > * KAMEZAWA Hiroyuki <[email protected]> [2009-04-17 09:14:59]:
> >
> > > On Thu, 16 Apr 2009 17:33:16 +0530
> > > Balbir Singh <[email protected]> wrote:
> > >
> > > > * KAMEZAWA Hiroyuki <[email protected]> [2009-04-16 17:15:35]:
> > > >
> > > > >
> > > > > > Sorry, some troubles found. Ignore above Ack. 3points now.
> > > > > >
> > > > > > 1. get_cpu should be after (*)
> > > > > > ==mem_cgroup_update_mapped_file_stat()
> > > > > > + int cpu = get_cpu();
> > > > > > +
> > > > > > + if (!page_is_file_cache(page))
> > > > > > + return;
> > > > > > +
> > > > > > + if (unlikely(!mm))
> > > > > > + mm = &init_mm;
> > > > > > +
> > > > > > + mem = try_get_mem_cgroup_from_mm(mm);
> > > > > > + if (!mem)
> > > > > > + return;
> > > > > > + ----------------------------------------(*)
> > > > > > + stat = &mem->stat;
> > > > > > + cpustat = &stat->cpustat[cpu];
> > > > > > +
> > > > > > + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, val);
> > > > > > + put_cpu();
> > > > > > +}
> > > > > > ==
> > > >
> > > > Yes or I should have a goto
> > > >
> > > > > >
> > > > > > 2. In above, "mem" shouldn't be got from "mm"....please get "mem" from page_cgroup.
> > > > > > (Because it's file cache, pc->mem_cgroup is not NULL always.)
> > > >
> > > > Hmmm.. not sure I understand this part. Are you suggesting that mm can
> > > > be NULL?
> > > No.
> > >
> > > > I added the check for !mm as a safety check. Since this
> > > > routine is only called from rmap context, mm is not NULL, hence mem
> > > > should not be NULL. Did you find a race between mm->owner assignment
> > > > and lookup via mm->owner?
> > > >
> > > No.
> > >
> > > page_cgroup->mem_cgroup != try_get_mem_cgroup_from_mm(mm); in many many cases.
> > >
> > > For example, libc and /bin/*** is tend to be loaded into default cgroup at boot but
> > > used by many cgroups. But mapcount of page caches for /bin/*** is 0 if not running.
> > >
> > > Then, File_Mapped can be greater than Cached easily if you use mm->owner.
> > >
> > > I can't estimate RSS in *my* cgroup if File_Mapped includes pages which is under
> > > other cgroups. It's meaningless.
> > > Especially, when Cached==0 but File_Mapped > 0, I think "oh, the kernel leaks somehing..hmm..."
> > >
> > > By useing page_cgroup->mem_cgroup, we can avoid above mess.
> >
> > Yes, I see your point. I wanted mapped_file to show up in the cgroup
> > that mapped the page. But this works for me as well, but that means
> > we'll nest the page cgroup lock under the PTE lock.
>
> Don't worry. we do that nest at ANON's uncharge(), already.
>
> About cost:
>
> IIUC, the number of "mapcount 0->1/1->0" of file caches are much smaller than
> that of o Anon. And there will be not very much cache pingpong.
>
> If you use PCG_MAPPED flag in page_cgroup (as my patch), you can use
> not-atomic version of set/clear when update is only under lock_page_cgroup().
> If you find better way, plz use it. But we can't avoid some kind of atomic ops
> for correct accounting, I think.
>

Can you sign off on your patch, so that I can take it with your
signed-off-by. I will also make some minor changes, get_cpu() is not
needed, since we are in preempt disable context.

--
Balbir

2009-04-17 03:51:39

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] Add file based RSS accounting for memory resource controller (v2)

On Fri, 17 Apr 2009 09:15:39 +0530
Balbir Singh <[email protected]> wrote:

> * KAMEZAWA Hiroyuki <[email protected]> [2009-04-17 11:03:50]:
>
> > On Fri, 17 Apr 2009 07:10:42 +0530
> > Balbir Singh <[email protected]> wrote:
> >
> > > * KAMEZAWA Hiroyuki <[email protected]> [2009-04-17 09:14:59]:
> > >
> > > > On Thu, 16 Apr 2009 17:33:16 +0530
> > > > Balbir Singh <[email protected]> wrote:
> > > >
> > > > > * KAMEZAWA Hiroyuki <[email protected]> [2009-04-16 17:15:35]:
> > > > >
> > > > > >
> > > > > > > Sorry, some troubles found. Ignore above Ack. 3points now.
> > > > > > >
> > > > > > > 1. get_cpu should be after (*)
> > > > > > > ==mem_cgroup_update_mapped_file_stat()
> > > > > > > + int cpu = get_cpu();
> > > > > > > +
> > > > > > > + if (!page_is_file_cache(page))
> > > > > > > + return;
> > > > > > > +
> > > > > > > + if (unlikely(!mm))
> > > > > > > + mm = &init_mm;
> > > > > > > +
> > > > > > > + mem = try_get_mem_cgroup_from_mm(mm);
> > > > > > > + if (!mem)
> > > > > > > + return;
> > > > > > > + ----------------------------------------(*)
> > > > > > > + stat = &mem->stat;
> > > > > > > + cpustat = &stat->cpustat[cpu];
> > > > > > > +
> > > > > > > + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, val);
> > > > > > > + put_cpu();
> > > > > > > +}
> > > > > > > ==
> > > > >
> > > > > Yes or I should have a goto
> > > > >
> > > > > > >
> > > > > > > 2. In above, "mem" shouldn't be got from "mm"....please get "mem" from page_cgroup.
> > > > > > > (Because it's file cache, pc->mem_cgroup is not NULL always.)
> > > > >
> > > > > Hmmm.. not sure I understand this part. Are you suggesting that mm can
> > > > > be NULL?
> > > > No.
> > > >
> > > > > I added the check for !mm as a safety check. Since this
> > > > > routine is only called from rmap context, mm is not NULL, hence mem
> > > > > should not be NULL. Did you find a race between mm->owner assignment
> > > > > and lookup via mm->owner?
> > > > >
> > > > No.
> > > >
> > > > page_cgroup->mem_cgroup != try_get_mem_cgroup_from_mm(mm); in many many cases.
> > > >
> > > > For example, libc and /bin/*** is tend to be loaded into default cgroup at boot but
> > > > used by many cgroups. But mapcount of page caches for /bin/*** is 0 if not running.
> > > >
> > > > Then, File_Mapped can be greater than Cached easily if you use mm->owner.
> > > >
> > > > I can't estimate RSS in *my* cgroup if File_Mapped includes pages which is under
> > > > other cgroups. It's meaningless.
> > > > Especially, when Cached==0 but File_Mapped > 0, I think "oh, the kernel leaks somehing..hmm..."
> > > >
> > > > By useing page_cgroup->mem_cgroup, we can avoid above mess.
> > >
> > > Yes, I see your point. I wanted mapped_file to show up in the cgroup
> > > that mapped the page. But this works for me as well, but that means
> > > we'll nest the page cgroup lock under the PTE lock.
> >
> > Don't worry. we do that nest at ANON's uncharge(), already.
> >
> > About cost:
> >
> > IIUC, the number of "mapcount 0->1/1->0" of file caches are much smaller than
> > that of o Anon. And there will be not very much cache pingpong.
> >
> > If you use PCG_MAPPED flag in page_cgroup (as my patch), you can use
> > not-atomic version of set/clear when update is only under lock_page_cgroup().
> > If you find better way, plz use it. But we can't avoid some kind of atomic ops
> > for correct accounting, I think.
> >
>
> Can you sign off on your patch, so that I can take it with your
> signed-off-by. I will also make some minor changes, get_cpu() is not
> needed, since we are in preempt disable context.
>
Hmm,
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

But some more clean up is necesarry.

=== This part ==
+ lock_page_cgroup(pc);
+ mem = pc->mem_cgroup;
+ if (mem) {
+ cpu = get_cpu();
+ stat = &mem->stat;
+ cpustat = &stat->cpustat[cpu];
+ if (map)

=== Should be ==
+ lock_page_cgroup(pc);
if (!PageCgroupUsed(pc)) {
unlock_page_cgroup(pc);
return;
}
mem = pc->mem_cgroup
VM_BUG_ON(!mem);
cpu = get_cpu();
stat = &mem->stat;
cpustat = &stat->cpustat[cpu];
if (map)
.....

Maybe much cleaner.

Thanks,
-Kame


2009-04-17 04:57:24

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] Add file based RSS accounting for memory resource controller (v2)

* KAMEZAWA Hiroyuki <[email protected]> [2009-04-17 12:49:51]:

> On Fri, 17 Apr 2009 09:15:39 +0530
> Balbir Singh <[email protected]> wrote:
>
> > * KAMEZAWA Hiroyuki <[email protected]> [2009-04-17 11:03:50]:
> >
> > > On Fri, 17 Apr 2009 07:10:42 +0530
> > > Balbir Singh <[email protected]> wrote:
> > >
> > > > * KAMEZAWA Hiroyuki <[email protected]> [2009-04-17 09:14:59]:
> > > >
> > > > > On Thu, 16 Apr 2009 17:33:16 +0530
> > > > > Balbir Singh <[email protected]> wrote:
> > > > >
> > > > > > * KAMEZAWA Hiroyuki <[email protected]> [2009-04-16 17:15:35]:
> > > > > >
> > > > > > >
> > > > > > > > Sorry, some troubles found. Ignore above Ack. 3points now.
> > > > > > > >
> > > > > > > > 1. get_cpu should be after (*)
> > > > > > > > ==mem_cgroup_update_mapped_file_stat()
> > > > > > > > + int cpu = get_cpu();
> > > > > > > > +
> > > > > > > > + if (!page_is_file_cache(page))
> > > > > > > > + return;
> > > > > > > > +
> > > > > > > > + if (unlikely(!mm))
> > > > > > > > + mm = &init_mm;
> > > > > > > > +
> > > > > > > > + mem = try_get_mem_cgroup_from_mm(mm);
> > > > > > > > + if (!mem)
> > > > > > > > + return;
> > > > > > > > + ----------------------------------------(*)
> > > > > > > > + stat = &mem->stat;
> > > > > > > > + cpustat = &stat->cpustat[cpu];
> > > > > > > > +
> > > > > > > > + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, val);
> > > > > > > > + put_cpu();
> > > > > > > > +}
> > > > > > > > ==
> > > > > >
> > > > > > Yes or I should have a goto
> > > > > >
> > > > > > > >
> > > > > > > > 2. In above, "mem" shouldn't be got from "mm"....please get "mem" from page_cgroup.
> > > > > > > > (Because it's file cache, pc->mem_cgroup is not NULL always.)
> > > > > >
> > > > > > Hmmm.. not sure I understand this part. Are you suggesting that mm can
> > > > > > be NULL?
> > > > > No.
> > > > >
> > > > > > I added the check for !mm as a safety check. Since this
> > > > > > routine is only called from rmap context, mm is not NULL, hence mem
> > > > > > should not be NULL. Did you find a race between mm->owner assignment
> > > > > > and lookup via mm->owner?
> > > > > >
> > > > > No.
> > > > >
> > > > > page_cgroup->mem_cgroup != try_get_mem_cgroup_from_mm(mm); in many many cases.
> > > > >
> > > > > For example, libc and /bin/*** is tend to be loaded into default cgroup at boot but
> > > > > used by many cgroups. But mapcount of page caches for /bin/*** is 0 if not running.
> > > > >
> > > > > Then, File_Mapped can be greater than Cached easily if you use mm->owner.
> > > > >
> > > > > I can't estimate RSS in *my* cgroup if File_Mapped includes pages which is under
> > > > > other cgroups. It's meaningless.
> > > > > Especially, when Cached==0 but File_Mapped > 0, I think "oh, the kernel leaks somehing..hmm..."
> > > > >
> > > > > By useing page_cgroup->mem_cgroup, we can avoid above mess.
> > > >
> > > > Yes, I see your point. I wanted mapped_file to show up in the cgroup
> > > > that mapped the page. But this works for me as well, but that means
> > > > we'll nest the page cgroup lock under the PTE lock.
> > >
> > > Don't worry. we do that nest at ANON's uncharge(), already.
> > >
> > > About cost:
> > >
> > > IIUC, the number of "mapcount 0->1/1->0" of file caches are much smaller than
> > > that of o Anon. And there will be not very much cache pingpong.
> > >
> > > If you use PCG_MAPPED flag in page_cgroup (as my patch), you can use
> > > not-atomic version of set/clear when update is only under lock_page_cgroup().
> > > If you find better way, plz use it. But we can't avoid some kind of atomic ops
> > > for correct accounting, I think.
> > >
> >
> > Can you sign off on your patch, so that I can take it with your
> > signed-off-by. I will also make some minor changes, get_cpu() is not
> > needed, since we are in preempt disable context.
> >
> Hmm,
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
> But some more clean up is necesarry.
>
> === This part ==
> + lock_page_cgroup(pc);
> + mem = pc->mem_cgroup;
> + if (mem) {
> + cpu = get_cpu();
> + stat = &mem->stat;
> + cpustat = &stat->cpustat[cpu];
> + if (map)
>
> === Should be ==
> + lock_page_cgroup(pc);
> if (!PageCgroupUsed(pc)) {
> unlock_page_cgroup(pc);
> return;
> }

Do we need this? If the page is mapped, pc should be used right?

--
Balbir

2009-04-17 05:19:18

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] Add file based RSS accounting for memory resource controller (v2)

On Fri, 17 Apr 2009 10:26:23 +0530
Balbir Singh <[email protected]> wrote:

> * KAMEZAWA Hiroyuki <[email protected]> [2009-04-17 12:49:51]:
>
> > On Fri, 17 Apr 2009 09:15:39 +0530
> > Balbir Singh <[email protected]> wrote:
> >
> > > * KAMEZAWA Hiroyuki <[email protected]> [2009-04-17 11:03:50]:
> > >
> > > > On Fri, 17 Apr 2009 07:10:42 +0530
> > > > Balbir Singh <[email protected]> wrote:
> > > >
> > > > > * KAMEZAWA Hiroyuki <[email protected]> [2009-04-17 09:14:59]:
> > > > >
> > > > > > On Thu, 16 Apr 2009 17:33:16 +0530
> > > > > > Balbir Singh <[email protected]> wrote:
> > > > > >
> > > > > > > * KAMEZAWA Hiroyuki <[email protected]> [2009-04-16 17:15:35]:
> > > > > > >
> > > > > > > >
> > > > > > > > > Sorry, some troubles found. Ignore above Ack. 3points now.
> > > > > > > > >
> > > > > > > > > 1. get_cpu should be after (*)
> > > > > > > > > ==mem_cgroup_update_mapped_file_stat()
> > > > > > > > > + int cpu = get_cpu();
> > > > > > > > > +
> > > > > > > > > + if (!page_is_file_cache(page))
> > > > > > > > > + return;
> > > > > > > > > +
> > > > > > > > > + if (unlikely(!mm))
> > > > > > > > > + mm = &init_mm;
> > > > > > > > > +
> > > > > > > > > + mem = try_get_mem_cgroup_from_mm(mm);
> > > > > > > > > + if (!mem)
> > > > > > > > > + return;
> > > > > > > > > + ----------------------------------------(*)
> > > > > > > > > + stat = &mem->stat;
> > > > > > > > > + cpustat = &stat->cpustat[cpu];
> > > > > > > > > +
> > > > > > > > > + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, val);
> > > > > > > > > + put_cpu();
> > > > > > > > > +}
> > > > > > > > > ==
> > > > > > >
> > > > > > > Yes or I should have a goto
> > > > > > >
> > > > > > > > >
> > > > > > > > > 2. In above, "mem" shouldn't be got from "mm"....please get "mem" from page_cgroup.
> > > > > > > > > (Because it's file cache, pc->mem_cgroup is not NULL always.)
> > > > > > >
> > > > > > > Hmmm.. not sure I understand this part. Are you suggesting that mm can
> > > > > > > be NULL?
> > > > > > No.
> > > > > >
> > > > > > > I added the check for !mm as a safety check. Since this
> > > > > > > routine is only called from rmap context, mm is not NULL, hence mem
> > > > > > > should not be NULL. Did you find a race between mm->owner assignment
> > > > > > > and lookup via mm->owner?
> > > > > > >
> > > > > > No.
> > > > > >
> > > > > > page_cgroup->mem_cgroup != try_get_mem_cgroup_from_mm(mm); in many many cases.
> > > > > >
> > > > > > For example, libc and /bin/*** is tend to be loaded into default cgroup at boot but
> > > > > > used by many cgroups. But mapcount of page caches for /bin/*** is 0 if not running.
> > > > > >
> > > > > > Then, File_Mapped can be greater than Cached easily if you use mm->owner.
> > > > > >
> > > > > > I can't estimate RSS in *my* cgroup if File_Mapped includes pages which is under
> > > > > > other cgroups. It's meaningless.
> > > > > > Especially, when Cached==0 but File_Mapped > 0, I think "oh, the kernel leaks somehing..hmm..."
> > > > > >
> > > > > > By useing page_cgroup->mem_cgroup, we can avoid above mess.
> > > > >
> > > > > Yes, I see your point. I wanted mapped_file to show up in the cgroup
> > > > > that mapped the page. But this works for me as well, but that means
> > > > > we'll nest the page cgroup lock under the PTE lock.
> > > >
> > > > Don't worry. we do that nest at ANON's uncharge(), already.
> > > >
> > > > About cost:
> > > >
> > > > IIUC, the number of "mapcount 0->1/1->0" of file caches are much smaller than
> > > > that of o Anon. And there will be not very much cache pingpong.
> > > >
> > > > If you use PCG_MAPPED flag in page_cgroup (as my patch), you can use
> > > > not-atomic version of set/clear when update is only under lock_page_cgroup().
> > > > If you find better way, plz use it. But we can't avoid some kind of atomic ops
> > > > for correct accounting, I think.
> > > >
> > >
> > > Can you sign off on your patch, so that I can take it with your
> > > signed-off-by. I will also make some minor changes, get_cpu() is not
> > > needed, since we are in preempt disable context.
> > >
> > Hmm,
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> >
> > But some more clean up is necesarry.
> >
> > === This part ==
> > + lock_page_cgroup(pc);
> > + mem = pc->mem_cgroup;
> > + if (mem) {
> > + cpu = get_cpu();
> > + stat = &mem->stat;
> > + cpustat = &stat->cpustat[cpu];
> > + if (map)
> >
> > === Should be ==
> > + lock_page_cgroup(pc);
> > if (!PageCgroupUsed(pc)) {
> > unlock_page_cgroup(pc);
> > return;
> > }
>
> Do we need this? If the page is mapped, pc should be used right?
>

About file cache, it'd definitely charged at add-to-radix-tree
regardless of being mapped or not.

*But* we still have following code.
==
820 static int __mem_cgroup_try_charge(struct mm_struct *mm,
821 gfp_t gfp_mask, struct mem_cgroup **memcg,
822
834 /*
835 * We always charge the cgroup the mm_struct belongs to.
836 * The mm_struct's mem_cgroup changes on task migration if the
837 * thread group leader migrates. It's possible that mm is not
838 * set, if so charge the init_mm (happens for pagecache usage).
839 */
840 mem = *memcg;
841 if (likely(!mem)) {
842 mem = try_get_mem_cgroup_from_mm(mm);
843 *memcg = mem;
844 } else {
845 css_get(&mem->css);
846 }
847 if (unlikely(!mem))
848 return 0;
==

So, for _now_, we should use this style of checking page_cgroup is used or not.
Until we fix/confirm try_charge() does.

Thanks,
-Kame

2009-04-17 06:48:28

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] Add file based RSS accounting for memory resource controller (v2)

* KAMEZAWA Hiroyuki <[email protected]> [2009-04-17 14:17:26]:

> On Fri, 17 Apr 2009 10:26:23 +0530
> Balbir Singh <[email protected]> wrote:
>
> > * KAMEZAWA Hiroyuki <[email protected]> [2009-04-17 12:49:51]:
> >
> > > On Fri, 17 Apr 2009 09:15:39 +0530
> > > Balbir Singh <[email protected]> wrote:
> > >
> > > > * KAMEZAWA Hiroyuki <[email protected]> [2009-04-17 11:03:50]:
> > > >
> > > > > On Fri, 17 Apr 2009 07:10:42 +0530
> > > > > Balbir Singh <[email protected]> wrote:
> > > > >
> > > > > > * KAMEZAWA Hiroyuki <[email protected]> [2009-04-17 09:14:59]:
> > > > > >
> > > > > > > On Thu, 16 Apr 2009 17:33:16 +0530
> > > > > > > Balbir Singh <[email protected]> wrote:
> > > > > > >
> > > > > > > > * KAMEZAWA Hiroyuki <[email protected]> [2009-04-16 17:15:35]:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > Sorry, some troubles found. Ignore above Ack. 3points now.
> > > > > > > > > >
> > > > > > > > > > 1. get_cpu should be after (*)
> > > > > > > > > > ==mem_cgroup_update_mapped_file_stat()
> > > > > > > > > > + int cpu = get_cpu();
> > > > > > > > > > +
> > > > > > > > > > + if (!page_is_file_cache(page))
> > > > > > > > > > + return;
> > > > > > > > > > +
> > > > > > > > > > + if (unlikely(!mm))
> > > > > > > > > > + mm = &init_mm;
> > > > > > > > > > +
> > > > > > > > > > + mem = try_get_mem_cgroup_from_mm(mm);
> > > > > > > > > > + if (!mem)
> > > > > > > > > > + return;
> > > > > > > > > > + ----------------------------------------(*)
> > > > > > > > > > + stat = &mem->stat;
> > > > > > > > > > + cpustat = &stat->cpustat[cpu];
> > > > > > > > > > +
> > > > > > > > > > + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, val);
> > > > > > > > > > + put_cpu();
> > > > > > > > > > +}
> > > > > > > > > > ==
> > > > > > > >
> > > > > > > > Yes or I should have a goto
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > 2. In above, "mem" shouldn't be got from "mm"....please get "mem" from page_cgroup.
> > > > > > > > > > (Because it's file cache, pc->mem_cgroup is not NULL always.)
> > > > > > > >
> > > > > > > > Hmmm.. not sure I understand this part. Are you suggesting that mm can
> > > > > > > > be NULL?
> > > > > > > No.
> > > > > > >
> > > > > > > > I added the check for !mm as a safety check. Since this
> > > > > > > > routine is only called from rmap context, mm is not NULL, hence mem
> > > > > > > > should not be NULL. Did you find a race between mm->owner assignment
> > > > > > > > and lookup via mm->owner?
> > > > > > > >
> > > > > > > No.
> > > > > > >
> > > > > > > page_cgroup->mem_cgroup != try_get_mem_cgroup_from_mm(mm); in many many cases.
> > > > > > >
> > > > > > > For example, libc and /bin/*** is tend to be loaded into default cgroup at boot but
> > > > > > > used by many cgroups. But mapcount of page caches for /bin/*** is 0 if not running.
> > > > > > >
> > > > > > > Then, File_Mapped can be greater than Cached easily if you use mm->owner.
> > > > > > >
> > > > > > > I can't estimate RSS in *my* cgroup if File_Mapped includes pages which is under
> > > > > > > other cgroups. It's meaningless.
> > > > > > > Especially, when Cached==0 but File_Mapped > 0, I think "oh, the kernel leaks somehing..hmm..."
> > > > > > >
> > > > > > > By useing page_cgroup->mem_cgroup, we can avoid above mess.
> > > > > >
> > > > > > Yes, I see your point. I wanted mapped_file to show up in the cgroup
> > > > > > that mapped the page. But this works for me as well, but that means
> > > > > > we'll nest the page cgroup lock under the PTE lock.
> > > > >
> > > > > Don't worry. we do that nest at ANON's uncharge(), already.
> > > > >
> > > > > About cost:
> > > > >
> > > > > IIUC, the number of "mapcount 0->1/1->0" of file caches are much smaller than
> > > > > that of o Anon. And there will be not very much cache pingpong.
> > > > >
> > > > > If you use PCG_MAPPED flag in page_cgroup (as my patch), you can use
> > > > > not-atomic version of set/clear when update is only under lock_page_cgroup().
> > > > > If you find better way, plz use it. But we can't avoid some kind of atomic ops
> > > > > for correct accounting, I think.
> > > > >
> > > >
> > > > Can you sign off on your patch, so that I can take it with your
> > > > signed-off-by. I will also make some minor changes, get_cpu() is not
> > > > needed, since we are in preempt disable context.
> > > >
> > > Hmm,
> > > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > >
> > > But some more clean up is necesarry.
> > >
> > > === This part ==
> > > + lock_page_cgroup(pc);
> > > + mem = pc->mem_cgroup;
> > > + if (mem) {
> > > + cpu = get_cpu();
> > > + stat = &mem->stat;
> > > + cpustat = &stat->cpustat[cpu];
> > > + if (map)
> > >
> > > === Should be ==
> > > + lock_page_cgroup(pc);
> > > if (!PageCgroupUsed(pc)) {
> > > unlock_page_cgroup(pc);
> > > return;
> > > }
> >
> > Do we need this? If the page is mapped, pc should be used right?
> >
>
> About file cache, it'd definitely charged at add-to-radix-tree
> regardless of being mapped or not.
>

Yes, what I meant was that before being mapped, the page should be
charged by the memory controller.

> *But* we still have following code.
> ==
> 820 static int __mem_cgroup_try_charge(struct mm_struct *mm,
> 821 gfp_t gfp_mask, struct mem_cgroup **memcg,
> 822
> 834 /*
> 835 * We always charge the cgroup the mm_struct belongs to.
> 836 * The mm_struct's mem_cgroup changes on task migration if the
> 837 * thread group leader migrates. It's possible that mm is not
> 838 * set, if so charge the init_mm (happens for pagecache usage).
> 839 */
> 840 mem = *memcg;
> 841 if (likely(!mem)) {
> 842 mem = try_get_mem_cgroup_from_mm(mm);
> 843 *memcg = mem;
> 844 } else {
> 845 css_get(&mem->css);
> 846 }
> 847 if (unlikely(!mem))
> 848 return 0;
> ==
>
> So, for _now_, we should use this style of checking page_cgroup is used or not.
> Until we fix/confirm try_charge() does.
>

Hmm... I think we need to fix this loop hole, if not mem, we should
look at charging the root cgroup. I suspect !mem cases should be 0,
I'll keep that as a TODO.

--
Balbir

2009-04-17 06:57:52

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] Add file based RSS accounting for memory resource controller (v2)

On Fri, 17 Apr 2009 12:17:26 +0530
Balbir Singh <[email protected]> wrote:

> > *But* we still have following code.
> > ==
> > 820 static int __mem_cgroup_try_charge(struct mm_struct *mm,
> > 821 gfp_t gfp_mask, struct mem_cgroup **memcg,
> > 822
> > 834 /*
> > 835 * We always charge the cgroup the mm_struct belongs to.
> > 836 * The mm_struct's mem_cgroup changes on task migration if the
> > 837 * thread group leader migrates. It's possible that mm is not
> > 838 * set, if so charge the init_mm (happens for pagecache usage).
> > 839 */
> > 840 mem = *memcg;
> > 841 if (likely(!mem)) {
> > 842 mem = try_get_mem_cgroup_from_mm(mm);
> > 843 *memcg = mem;
> > 844 } else {
> > 845 css_get(&mem->css);
> > 846 }
> > 847 if (unlikely(!mem))
> > 848 return 0;
> > ==
> >
> > So, for _now_, we should use this style of checking page_cgroup is used or not.
> > Until we fix/confirm try_charge() does.
> >
>
> Hmm... I think we need to fix this loop hole, if not mem, we should
> look at charging the root cgroup. I suspect !mem cases should be 0,
> I'll keep that as a TODO.
>
yes, I'd like to keep this in my mind, too.


Thanks,
-Kame

2009-04-17 14:19:33

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] Add file based RSS accounting for memory resource controller (v3)

Hi, Kame,

How does this look? I did not use the mapped flag in page_cgroup
flags. page_is_file_cache and page_mapped worked as well.

Feature: Add file RSS tracking per memory cgroup

From: Balbir Singh <[email protected]>

Changelog v3 -> v2
1. Fix get_cpu(), put_cpu() matching. Moved away from get_cpu() and use
smp_processor_id(), since we are in preempt disable context
2. Use pc->mem_cgroup to identify the mem_cgroup instead of mm
3. page_add_file_rmap() and page_remove_rmap() argument changes are undone.

Changelog v2 -> v1

1. Rename file_rss to mapped_file
2. Add hooks into mem_cgroup_move_account for updating MAPPED_FILE statistics
3. Use a better name for the statistics routine.


We currently don't track file RSS, the RSS we report is actually anon RSS.
All the file mapped pages, come in through the page cache and get accounted
there. This patch adds support for accounting file RSS pages. It should

1. Help improve the metrics reported by the memory resource controller
2. Will form the basis for a future shared memory accounting heuristic
that has been proposed by Kamezawa.

Unfortunately, we cannot rename the existing "rss" keyword used in memory.stat
to "anon_rss". We however, add "mapped_file" data and hope to educate the end
user through documentation.

Signed-off-by: Balbir Singh <[email protected]>
---

include/linux/memcontrol.h | 7 ++++-
mm/memcontrol.c | 66 +++++++++++++++++++++++++++++++++++++++++++-
mm/rmap.c | 5 +++
3 files changed, 75 insertions(+), 3 deletions(-)


diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 18146c9..05a5c11 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -116,7 +116,7 @@ static inline bool mem_cgroup_disabled(void)
}

extern bool mem_cgroup_oom_called(struct task_struct *task);
-
+void mem_cgroup_update_mapped_file_stat(struct page *page, int val);
#else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct mem_cgroup;

@@ -264,6 +264,11 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
{
}

+static inline void mem_cgroup_update_mapped_file_stat(struct page *page,
+ int val)
+{
+}
+
#endif /* CONFIG_CGROUP_MEM_CONT */

#endif /* _LINUX_MEMCONTROL_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e44fb0f..562bd76 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -62,7 +62,8 @@ enum mem_cgroup_stat_index {
* For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
*/
MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
- MEM_CGROUP_STAT_RSS, /* # of pages charged as rss */
+ MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
+ MEM_CGROUP_STAT_MAPPED_FILE, /* # of pages charged as file rss */
MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */
MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */

@@ -321,6 +322,44 @@ static bool mem_cgroup_is_obsolete(struct mem_cgroup *mem)
return css_is_removed(&mem->css);
}

+/*
+ * Currently used to update mapped file statistics, but the routine can be
+ * generalized to update other statistics as well.
+ */
+void mem_cgroup_update_mapped_file_stat(struct page *page, int val)
+{
+ struct mem_cgroup *mem;
+ struct mem_cgroup_stat *stat;
+ struct mem_cgroup_stat_cpu *cpustat;
+ int cpu;
+ struct page_cgroup *pc;
+
+ if (!page_is_file_cache(page))
+ return;
+
+ pc = lookup_page_cgroup(page);
+ if (unlikely(!pc))
+ return;
+
+ lock_page_cgroup(pc);
+ mem = pc->mem_cgroup;
+ if (!mem)
+ goto done;
+
+ if (!PageCgroupUsed(pc))
+ goto done;
+
+ /*
+ * Preemption is already disabled, we don't need get_cpu()
+ */
+ cpu = smp_processor_id();
+ stat = &mem->stat;
+ cpustat = &stat->cpustat[cpu];
+
+ __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, val);
+done:
+ unlock_page_cgroup(pc);
+}

/*
* Call callback function against all cgroup under hierarchy tree.
@@ -1096,6 +1135,10 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
struct mem_cgroup_per_zone *from_mz, *to_mz;
int nid, zid;
int ret = -EBUSY;
+ struct page *page;
+ int cpu;
+ struct mem_cgroup_stat *stat;
+ struct mem_cgroup_stat_cpu *cpustat;

VM_BUG_ON(from == to);
VM_BUG_ON(PageLRU(pc->page));
@@ -1116,6 +1159,23 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,

res_counter_uncharge(&from->res, PAGE_SIZE);
mem_cgroup_charge_statistics(from, pc, false);
+
+ page = pc->page;
+ if (page_is_file_cache(page) && page_mapped(page)) {
+ cpu = smp_processor_id();
+ /* Update mapped_file data for mem_cgroup "from" */
+ stat = &from->stat;
+ cpustat = &stat->cpustat[cpu];
+ __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE,
+ -1);
+
+ /* Update mapped_file data for mem_cgroup "to" */
+ stat = &to->stat;
+ cpustat = &stat->cpustat[cpu];
+ __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE,
+ 1);
+ }
+
if (do_swap_account)
res_counter_uncharge(&from->memsw, PAGE_SIZE);
css_put(&from->css);
@@ -2051,6 +2111,7 @@ static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
enum {
MCS_CACHE,
MCS_RSS,
+ MCS_MAPPED_FILE,
MCS_PGPGIN,
MCS_PGPGOUT,
MCS_INACTIVE_ANON,
@@ -2071,6 +2132,7 @@ struct {
} memcg_stat_strings[NR_MCS_STAT] = {
{"cache", "total_cache"},
{"rss", "total_rss"},
+ {"mapped_file", "total_mapped_file"},
{"pgpgin", "total_pgpgin"},
{"pgpgout", "total_pgpgout"},
{"inactive_anon", "total_inactive_anon"},
@@ -2091,6 +2153,8 @@ static int mem_cgroup_get_local_stat(struct mem_cgroup *mem, void *data)
s->stat[MCS_CACHE] += val * PAGE_SIZE;
val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_RSS);
s->stat[MCS_RSS] += val * PAGE_SIZE;
+ val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_MAPPED_FILE);
+ s->stat[MCS_MAPPED_FILE] += val * PAGE_SIZE;
val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_PGPGIN_COUNT);
s->stat[MCS_PGPGIN] += val;
val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_PGPGOUT_COUNT);
diff --git a/mm/rmap.c b/mm/rmap.c
index 1652166..c3ba0b9 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -688,8 +688,10 @@ void page_add_new_anon_rmap(struct page *page,
*/
void page_add_file_rmap(struct page *page)
{
- if (atomic_inc_and_test(&page->_mapcount))
+ if (atomic_inc_and_test(&page->_mapcount)) {
__inc_zone_page_state(page, NR_FILE_MAPPED);
+ mem_cgroup_update_mapped_file_stat(page, 1);
+ }
}

#ifdef CONFIG_DEBUG_VM
@@ -738,6 +740,7 @@ void page_remove_rmap(struct page *page)
mem_cgroup_uncharge_page(page);
__dec_zone_page_state(page,
PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
+ mem_cgroup_update_mapped_file_stat(page, -1);
/*
* It would be tidy to reset the PageAnon mapping here,
* but that might overwrite a racing page_add_anon_rmap

--
Balbir

2009-04-17 16:31:12

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] Add file based RSS accounting for memory resource controller (v3)

Balbir Singh wrote:
> Hi, Kame,
>
> How does this look? I did not use the mapped flag in page_cgroup
> flags. page_is_file_cache and page_mapped worked as well.
>
> Feature: Add file RSS tracking per memory cgroup
>
> From: Balbir Singh <[email protected]>
>
> Changelog v3 -> v2
> 1. Fix get_cpu(), put_cpu() matching. Moved away from get_cpu() and use
> smp_processor_id(), since we are in preempt disable context
> 2. Use pc->mem_cgroup to identify the mem_cgroup instead of mm
> 3. page_add_file_rmap() and page_remove_rmap() argument changes are
> undone.
>
> Changelog v2 -> v1
>
> 1. Rename file_rss to mapped_file
> 2. Add hooks into mem_cgroup_move_account for updating MAPPED_FILE
> statistics
> 3. Use a better name for the statistics routine.
>
>
> We currently don't track file RSS, the RSS we report is actually anon RSS.
> All the file mapped pages, come in through the page cache and get
> accounted
> there. This patch adds support for accounting file RSS pages. It should
>
> 1. Help improve the metrics reported by the memory resource controller
> 2. Will form the basis for a future shared memory accounting heuristic
> that has been proposed by Kamezawa.
>
> Unfortunately, we cannot rename the existing "rss" keyword used in
> memory.stat
> to "anon_rss". We however, add "mapped_file" data and hope to educate the
> end
> user through documentation.
>
> Signed-off-by: Balbir Singh <[email protected]>
> ---
>
> include/linux/memcontrol.h | 7 ++++-
> mm/memcontrol.c | 66
> +++++++++++++++++++++++++++++++++++++++++++-
> mm/rmap.c | 5 +++
> 3 files changed, 75 insertions(+), 3 deletions(-)
>
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 18146c9..05a5c11 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -116,7 +116,7 @@ static inline bool mem_cgroup_disabled(void)
> }
>
> extern bool mem_cgroup_oom_called(struct task_struct *task);
> -
> +void mem_cgroup_update_mapped_file_stat(struct page *page, int val);
> #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> struct mem_cgroup;
>
> @@ -264,6 +264,11 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
> struct task_struct *p)
> {
> }
>
> +static inline void mem_cgroup_update_mapped_file_stat(struct page *page,
> + int val)
> +{
> +}
> +
> #endif /* CONFIG_CGROUP_MEM_CONT */
>
> #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e44fb0f..562bd76 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -62,7 +62,8 @@ enum mem_cgroup_stat_index {
> * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> */
> MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
> - MEM_CGROUP_STAT_RSS, /* # of pages charged as rss */
> + MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
> + MEM_CGROUP_STAT_MAPPED_FILE, /* # of pages charged as file rss */
> MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */
> MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
>
> @@ -321,6 +322,44 @@ static bool mem_cgroup_is_obsolete(struct mem_cgroup
> *mem)
> return css_is_removed(&mem->css);
> }
>
> +/*
> + * Currently used to update mapped file statistics, but the routine can
> be
> + * generalized to update other statistics as well.
> + */
> +void mem_cgroup_update_mapped_file_stat(struct page *page, int val)
> +{
> + struct mem_cgroup *mem;
> + struct mem_cgroup_stat *stat;
> + struct mem_cgroup_stat_cpu *cpustat;
> + int cpu;
> + struct page_cgroup *pc;
> +
> + if (!page_is_file_cache(page))
> + return;
> +
> + pc = lookup_page_cgroup(page);
> + if (unlikely(!pc))
> + return;
> +
> + lock_page_cgroup(pc);
> + mem = pc->mem_cgroup;
> + if (!mem)
> + goto done;
> +
> + if (!PageCgroupUsed(pc))
> + goto done;
> +
> + /*
> + * Preemption is already disabled, we don't need get_cpu()
> + */
> + cpu = smp_processor_id();
> + stat = &mem->stat;
> + cpustat = &stat->cpustat[cpu];
> +
> + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, val);
> +done:
> + unlock_page_cgroup(pc);
> +}
>
> /*
> * Call callback function against all cgroup under hierarchy tree.
> @@ -1096,6 +1135,10 @@ static int mem_cgroup_move_account(struct
> page_cgroup *pc,
> struct mem_cgroup_per_zone *from_mz, *to_mz;
> int nid, zid;
> int ret = -EBUSY;
> + struct page *page;
> + int cpu;
> + struct mem_cgroup_stat *stat;
> + struct mem_cgroup_stat_cpu *cpustat;
>
> VM_BUG_ON(from == to);
> VM_BUG_ON(PageLRU(pc->page));
> @@ -1116,6 +1159,23 @@ static int mem_cgroup_move_account(struct
> page_cgroup *pc,
>
> res_counter_uncharge(&from->res, PAGE_SIZE);
> mem_cgroup_charge_statistics(from, pc, false);
> +
> + page = pc->page;
> + if (page_is_file_cache(page) && page_mapped(page)) {
*Maybe* safe, but could you add comments ?

ok, let's start testing under -mm.
Acked-by: KAMEZAWA Hiroyuki <[email protected]>

Nice feature :)

Thanks,
-Kame

2009-04-21 03:00:28

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] Add file based RSS accounting for memory resource controller (v3)

2009/4/17 KAMEZAWA Hiroyuki <[email protected]>:
[snip]
> ok, let's start testing under -mm.
> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
>
> Nice feature :)
>

Hi, Andrew,

Could you please pick this up for testing under -mm.

Thanks,
Balbir

2009-04-21 20:28:24

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Add file based RSS accounting for memory resource controller (v3)

On Fri, 17 Apr 2009 19:48:38 +0530
Balbir Singh <[email protected]> wrote:

>
> ...
>
> We currently don't track file RSS, the RSS we report is actually anon RSS.
> All the file mapped pages, come in through the page cache and get accounted
> there. This patch adds support for accounting file RSS pages. It should
>
> 1. Help improve the metrics reported by the memory resource controller
> 2. Will form the basis for a future shared memory accounting heuristic
> that has been proposed by Kamezawa.
>
> Unfortunately, we cannot rename the existing "rss" keyword used in memory.stat
> to "anon_rss". We however, add "mapped_file" data and hope to educate the end
> user through documentation.
>
> Signed-off-by: Balbir Singh <[email protected]>
>
> ...
>
> @@ -1096,6 +1135,10 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
> struct mem_cgroup_per_zone *from_mz, *to_mz;
> int nid, zid;
> int ret = -EBUSY;
> + struct page *page;
> + int cpu;
> + struct mem_cgroup_stat *stat;
> + struct mem_cgroup_stat_cpu *cpustat;
>
> VM_BUG_ON(from == to);
> VM_BUG_ON(PageLRU(pc->page));
> @@ -1116,6 +1159,23 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
>
> res_counter_uncharge(&from->res, PAGE_SIZE);
> mem_cgroup_charge_statistics(from, pc, false);
> +
> + page = pc->page;
> + if (page_is_file_cache(page) && page_mapped(page)) {
> + cpu = smp_processor_id();
> + /* Update mapped_file data for mem_cgroup "from" */
> + stat = &from->stat;
> + cpustat = &stat->cpustat[cpu];
> + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE,
> + -1);
> +
> + /* Update mapped_file data for mem_cgroup "to" */
> + stat = &to->stat;
> + cpustat = &stat->cpustat[cpu];
> + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE,
> + 1);
> + }

This function (mem_cgroup_move_account()) does a trylock_page_cgroup()
and if that fails it will bale out, and the newly-added code will not
be executed.

What are the implications of this? Does the missed accounting later get
performed somewhere, or does the error remain in place?

That trylock_page_cgroup() really sucks - trylocks usually do. Could
someone please raise a patch which completely documents the reasons for
its presence, and for any other uncommented/unobvious trylocks?

Where appropriate, the comment should explain why the trylock isn't
simply a bug - why it is safe and correct to omit the operations which
we wished to perform.

Thanks.

2009-04-22 00:04:04

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] Add file based RSS accounting for memory resource controller (v3)

On Tue, 21 Apr 2009 13:25:51 -0700
Andrew Morton <[email protected]> wrote:

> On Fri, 17 Apr 2009 19:48:38 +0530
> Balbir Singh <[email protected]> wrote:
>
> >
> > ...
> >
> > We currently don't track file RSS, the RSS we report is actually anon RSS.
> > All the file mapped pages, come in through the page cache and get accounted
> > there. This patch adds support for accounting file RSS pages. It should
> >
> > 1. Help improve the metrics reported by the memory resource controller
> > 2. Will form the basis for a future shared memory accounting heuristic
> > that has been proposed by Kamezawa.
> >
> > Unfortunately, we cannot rename the existing "rss" keyword used in memory.stat
> > to "anon_rss". We however, add "mapped_file" data and hope to educate the end
> > user through documentation.
> >
> > Signed-off-by: Balbir Singh <[email protected]>
> >
> > ...
> >
> > @@ -1096,6 +1135,10 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
> > struct mem_cgroup_per_zone *from_mz, *to_mz;
> > int nid, zid;
> > int ret = -EBUSY;
> > + struct page *page;
> > + int cpu;
> > + struct mem_cgroup_stat *stat;
> > + struct mem_cgroup_stat_cpu *cpustat;
> >
> > VM_BUG_ON(from == to);
> > VM_BUG_ON(PageLRU(pc->page));
> > @@ -1116,6 +1159,23 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
> >
> > res_counter_uncharge(&from->res, PAGE_SIZE);
> > mem_cgroup_charge_statistics(from, pc, false);
> > +
> > + page = pc->page;
> > + if (page_is_file_cache(page) && page_mapped(page)) {
> > + cpu = smp_processor_id();
> > + /* Update mapped_file data for mem_cgroup "from" */
> > + stat = &from->stat;
> > + cpustat = &stat->cpustat[cpu];
> > + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE,
> > + -1);
> > +
> > + /* Update mapped_file data for mem_cgroup "to" */
> > + stat = &to->stat;
> > + cpustat = &stat->cpustat[cpu];
> > + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE,
> > + 1);
> > + }
>
> This function (mem_cgroup_move_account()) does a trylock_page_cgroup()
> and if that fails it will bale out, and the newly-added code will not
> be executed.
yes. and returns -EBUSY.

>
> What are the implications of this? Does the missed accounting later get
> performed somewhere, or does the error remain in place?
>
no error just -BUSY. the caller (now, only force_empty is the caller) will do retry.

> That trylock_page_cgroup() really sucks - trylocks usually do. Could
> someone please raise a patch which completely documents the reasons for
> its presence, and for any other uncommented/unobvious trylocks?
>
> Where appropriate, the comment should explain why the trylock isn't
> simply a bug - why it is safe and correct to omit the operations which
> we wished to perform.
>
> Thanks.
>
Hmm...maybe we can replace trylock with lock, here.

IIRC, this has been trylock because the old routine uses other locks
(mem_cgroup' zone mz->lru_lock) before calling this.
mz->lru_lock
lock_page_cgroup()
And there was other routine which calls lock_page_cgroup()->mz->lru_lock.
lock_page_cgroup()
-> mz->lru_lock.

So, I used trylock here. But now, the lock(mz->lru_lock) is removed.
I should check this.

Thank you for pointing out.

Regards,
-Kame

2009-04-22 03:18:25

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH] memcg: remove trylock_page_cgroup

How about this ? worth to be tested, I think.
-Kame
==
From: KAMEZAWA Hiroyuki <[email protected]>

Before synchronized-LRU patch, mem cgroup had its own LRU lock.
And there was a code which does
# assume mz as per zone struct of memcg.

spin_lock mz->lru_lock
lock_page_cgroup(pc).
and
lock_page_cgroup(pc)
spin_lock mz->lru_lock

because we cannot locate "mz" until we see pc->page_cgroup, we used
trylock(). But now, we don't have mz->lru_lock. All cgroup
uses zone->lru_lock for handling list. Moreover, manipulation of
LRU depends on global LRU now and we can isolate page from LRU by
very generic way.(isolate_lru_page()).
So, this kind of trylock is not necessary now.

I thought I removed all trylock in synchronized-LRU patch but there
is still one. This patch removes trylock used in memcontrol.c and
its definition. If someone needs, he should add this again with enough
reason.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/page_cgroup.h | 5 -----
mm/memcontrol.c | 3 +--
2 files changed, 1 insertion(+), 7 deletions(-)

Index: mmotm-2.6.30-Apr21/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.30-Apr21.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.30-Apr21/include/linux/page_cgroup.h
@@ -61,11 +61,6 @@ static inline void lock_page_cgroup(stru
bit_spin_lock(PCG_LOCK, &pc->flags);
}

-static inline int trylock_page_cgroup(struct page_cgroup *pc)
-{
- return bit_spin_trylock(PCG_LOCK, &pc->flags);
-}
-
static inline void unlock_page_cgroup(struct page_cgroup *pc)
{
bit_spin_unlock(PCG_LOCK, &pc->flags);
Index: mmotm-2.6.30-Apr21/mm/memcontrol.c
===================================================================
--- mmotm-2.6.30-Apr21.orig/mm/memcontrol.c
+++ mmotm-2.6.30-Apr21/mm/memcontrol.c
@@ -1148,8 +1148,7 @@ static int mem_cgroup_move_account(struc
from_mz = mem_cgroup_zoneinfo(from, nid, zid);
to_mz = mem_cgroup_zoneinfo(to, nid, zid);

- if (!trylock_page_cgroup(pc))
- return ret;
+ lock_page_cgroup(pc);

if (!PageCgroupUsed(pc))
goto out;

2009-04-22 03:20:42

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] Add file based RSS accounting for memory resource controller (v3)

* KAMEZAWA Hiroyuki <[email protected]> [2009-04-22 09:02:18]:

> On Tue, 21 Apr 2009 13:25:51 -0700
> Andrew Morton <[email protected]> wrote:
>
> > On Fri, 17 Apr 2009 19:48:38 +0530
> > Balbir Singh <[email protected]> wrote:
> >
> > >
> > > ...
> > >
> > > We currently don't track file RSS, the RSS we report is actually anon RSS.
> > > All the file mapped pages, come in through the page cache and get accounted
> > > there. This patch adds support for accounting file RSS pages. It should
> > >
> > > 1. Help improve the metrics reported by the memory resource controller
> > > 2. Will form the basis for a future shared memory accounting heuristic
> > > that has been proposed by Kamezawa.
> > >
> > > Unfortunately, we cannot rename the existing "rss" keyword used in memory.stat
> > > to "anon_rss". We however, add "mapped_file" data and hope to educate the end
> > > user through documentation.
> > >
> > > Signed-off-by: Balbir Singh <[email protected]>
> > >
> > > ...
> > >
> > > @@ -1096,6 +1135,10 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
> > > struct mem_cgroup_per_zone *from_mz, *to_mz;
> > > int nid, zid;
> > > int ret = -EBUSY;
> > > + struct page *page;
> > > + int cpu;
> > > + struct mem_cgroup_stat *stat;
> > > + struct mem_cgroup_stat_cpu *cpustat;
> > >
> > > VM_BUG_ON(from == to);
> > > VM_BUG_ON(PageLRU(pc->page));
> > > @@ -1116,6 +1159,23 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
> > >
> > > res_counter_uncharge(&from->res, PAGE_SIZE);
> > > mem_cgroup_charge_statistics(from, pc, false);
> > > +
> > > + page = pc->page;
> > > + if (page_is_file_cache(page) && page_mapped(page)) {
> > > + cpu = smp_processor_id();
> > > + /* Update mapped_file data for mem_cgroup "from" */
> > > + stat = &from->stat;
> > > + cpustat = &stat->cpustat[cpu];
> > > + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE,
> > > + -1);
> > > +
> > > + /* Update mapped_file data for mem_cgroup "to" */
> > > + stat = &to->stat;
> > > + cpustat = &stat->cpustat[cpu];
> > > + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE,
> > > + 1);
> > > + }
> >
> > This function (mem_cgroup_move_account()) does a trylock_page_cgroup()
> > and if that fails it will bale out, and the newly-added code will not
> > be executed.
> yes. and returns -EBUSY.
>
> >
> > What are the implications of this? Does the missed accounting later get
> > performed somewhere, or does the error remain in place?
> >
> no error just -BUSY. the caller (now, only force_empty is the caller) will do retry.
>
> > That trylock_page_cgroup() really sucks - trylocks usually do. Could
> > someone please raise a patch which completely documents the reasons for
> > its presence, and for any other uncommented/unobvious trylocks?
> >
> > Where appropriate, the comment should explain why the trylock isn't
> > simply a bug - why it is safe and correct to omit the operations which
> > we wished to perform.
> >
> > Thanks.
> >
> Hmm...maybe we can replace trylock with lock, here.
>
> IIRC, this has been trylock because the old routine uses other locks
> (mem_cgroup' zone mz->lru_lock) before calling this.
> mz->lru_lock
> lock_page_cgroup()
> And there was other routine which calls lock_page_cgroup()->mz->lru_lock.
> lock_page_cgroup()
> -> mz->lru_lock.
>
> So, I used trylock here. But now, the lock(mz->lru_lock) is removed.
> I should check this.
>
> Thank you for pointing out.
>

This is definitely worth looking into. Since we run force_empty() in a
while loop with some margin, we've probably avoided the problem. I
think this code needs a second look and refactoring.



--
Balbir

2009-04-22 03:44:47

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] memcg: remove trylock_page_cgroup

On Wed, 22 Apr 2009 12:16:41 +0900 KAMEZAWA Hiroyuki <[email protected]> wrote:

> How about this ? worth to be tested, I think.
> -Kame
> ==
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Before synchronized-LRU patch, mem cgroup had its own LRU lock.
> And there was a code which does
> # assume mz as per zone struct of memcg.
>
> spin_lock mz->lru_lock
> lock_page_cgroup(pc).
> and
> lock_page_cgroup(pc)
> spin_lock mz->lru_lock
>
> because we cannot locate "mz" until we see pc->page_cgroup, we used
> trylock(). But now, we don't have mz->lru_lock. All cgroup
> uses zone->lru_lock for handling list. Moreover, manipulation of
> LRU depends on global LRU now and we can isolate page from LRU by
> very generic way.(isolate_lru_page()).
> So, this kind of trylock is not necessary now.
>
> I thought I removed all trylock in synchronized-LRU patch but there
> is still one. This patch removes trylock used in memcontrol.c and
> its definition. If someone needs, he should add this again with enough
> reason.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> include/linux/page_cgroup.h | 5 -----
> mm/memcontrol.c | 3 +--
> 2 files changed, 1 insertion(+), 7 deletions(-)
>
> Index: mmotm-2.6.30-Apr21/include/linux/page_cgroup.h
> ===================================================================
> --- mmotm-2.6.30-Apr21.orig/include/linux/page_cgroup.h
> +++ mmotm-2.6.30-Apr21/include/linux/page_cgroup.h
> @@ -61,11 +61,6 @@ static inline void lock_page_cgroup(stru
> bit_spin_lock(PCG_LOCK, &pc->flags);
> }
>
> -static inline int trylock_page_cgroup(struct page_cgroup *pc)
> -{
> - return bit_spin_trylock(PCG_LOCK, &pc->flags);
> -}
> -
> static inline void unlock_page_cgroup(struct page_cgroup *pc)
> {
> bit_spin_unlock(PCG_LOCK, &pc->flags);
> Index: mmotm-2.6.30-Apr21/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.30-Apr21.orig/mm/memcontrol.c
> +++ mmotm-2.6.30-Apr21/mm/memcontrol.c
> @@ -1148,8 +1148,7 @@ static int mem_cgroup_move_account(struc
> from_mz = mem_cgroup_zoneinfo(from, nid, zid);
> to_mz = mem_cgroup_zoneinfo(to, nid, zid);
>
> - if (!trylock_page_cgroup(pc))
> - return ret;
> + lock_page_cgroup(pc);
>
> if (!PageCgroupUsed(pc))
> goto out;

But we can't remove that nasty `while (loop--)' thing?

I expect that it will reliably fail if the caller is running as
SCHED_FIFO and the machine is single-CPU, or if we're trying to yield
to a SCHED_OTHER task which is pinned to this CPU, etc. The cond_resched()
won't work.

2009-04-22 04:42:53

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] memcg: remove trylock_page_cgroup

On Tue, 21 Apr 2009 20:41:04 -0700
Andrew Morton <[email protected]> wrote:

> On Wed, 22 Apr 2009 12:16:41 +0900 KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > How about this ? worth to be tested, I think.
> > -Kame
> > ==
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > Before synchronized-LRU patch, mem cgroup had its own LRU lock.
> > And there was a code which does
> > # assume mz as per zone struct of memcg.
> >
> > spin_lock mz->lru_lock
> > lock_page_cgroup(pc).
> > and
> > lock_page_cgroup(pc)
> > spin_lock mz->lru_lock
> >
> > because we cannot locate "mz" until we see pc->page_cgroup, we used
> > trylock(). But now, we don't have mz->lru_lock. All cgroup
> > uses zone->lru_lock for handling list. Moreover, manipulation of
> > LRU depends on global LRU now and we can isolate page from LRU by
> > very generic way.(isolate_lru_page()).
> > So, this kind of trylock is not necessary now.
> >
> > I thought I removed all trylock in synchronized-LRU patch but there
> > is still one. This patch removes trylock used in memcontrol.c and
> > its definition. If someone needs, he should add this again with enough
> > reason.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > ---
> > include/linux/page_cgroup.h | 5 -----
> > mm/memcontrol.c | 3 +--
> > 2 files changed, 1 insertion(+), 7 deletions(-)
> >
> > Index: mmotm-2.6.30-Apr21/include/linux/page_cgroup.h
> > ===================================================================
> > --- mmotm-2.6.30-Apr21.orig/include/linux/page_cgroup.h
> > +++ mmotm-2.6.30-Apr21/include/linux/page_cgroup.h
> > @@ -61,11 +61,6 @@ static inline void lock_page_cgroup(stru
> > bit_spin_lock(PCG_LOCK, &pc->flags);
> > }
> >
> > -static inline int trylock_page_cgroup(struct page_cgroup *pc)
> > -{
> > - return bit_spin_trylock(PCG_LOCK, &pc->flags);
> > -}
> > -
> > static inline void unlock_page_cgroup(struct page_cgroup *pc)
> > {
> > bit_spin_unlock(PCG_LOCK, &pc->flags);
> > Index: mmotm-2.6.30-Apr21/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-2.6.30-Apr21.orig/mm/memcontrol.c
> > +++ mmotm-2.6.30-Apr21/mm/memcontrol.c
> > @@ -1148,8 +1148,7 @@ static int mem_cgroup_move_account(struc
> > from_mz = mem_cgroup_zoneinfo(from, nid, zid);
> > to_mz = mem_cgroup_zoneinfo(to, nid, zid);
> >
> > - if (!trylock_page_cgroup(pc))
> > - return ret;
> > + lock_page_cgroup(pc);
> >
> > if (!PageCgroupUsed(pc))
> > goto out;
>
> But we can't remove that nasty `while (loop--)' thing?
>
every call which use isolate_lru_page() should handle isolatation failure.
But its ok to remove force_empty_list()'s loop-- becasue we do retry
in force_empty()
force_empty() # does retry.
-> force_empty_list() # does retry.

> I expect that it will reliably fail if the caller is running as
> SCHED_FIFO and the machine is single-CPU, or if we're trying to yield
> to a SCHED_OTHER task which is pinned to this CPU, etc. The cond_resched()
> won't work.
>
Hm, signal_pending() is supported now (so special user scan use alaram())
I used yield() before cond_resched() but I was told don't use it.
Should I replace cond_resched() with congestion_wait(HZ/10) or some ?

But I'd like to do that in other patch than this patch bacause it
chages force_empty()'s logic.

Thanks,
-Kame

2009-04-22 06:05:04

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] memcg: remove trylock_page_cgroup

On Wed, 22 Apr 2009 13:41:08 +0900 KAMEZAWA Hiroyuki <[email protected]> wrote:

> > I expect that it will reliably fail if the caller is running as
> > SCHED_FIFO and the machine is single-CPU, or if we're trying to yield
> > to a SCHED_OTHER task which is pinned to this CPU, etc. The cond_resched()
> > won't work.
> >
> Hm, signal_pending() is supported now (so special user scan use alaram())
> I used yield() before cond_resched() but I was told don't use it.
> Should I replace cond_resched() with congestion_wait(HZ/10) or some ?

msleep(1) would be typical. That can also be used to give a
predictable number of seconds for the timeout.

If 1 millisecond is too coarse then it's possible to sleep for much
shorter intervals if the platform implements hi-res timers. We don't
appear to have a handy interface to that (usleep, microsleep,
nanosleep, etc?).

And an attempt to sleep for 1us will fall back to 1/HZ if the platform
doesn't implement hi-res timers, so that loop will need to be turned
into a do {} while(!timer_after(jiffies, start))) thing. Probably it
should be converted to that anyway, to be better behaved/predictable,
etc.

2009-04-22 06:15:27

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] memcg: remove trylock_page_cgroup

On Tue, 21 Apr 2009 23:01:47 -0700
Andrew Morton <[email protected]> wrote:

> On Wed, 22 Apr 2009 13:41:08 +0900 KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > > I expect that it will reliably fail if the caller is running as
> > > SCHED_FIFO and the machine is single-CPU, or if we're trying to yield
> > > to a SCHED_OTHER task which is pinned to this CPU, etc. The cond_resched()
> > > won't work.
> > >
> > Hm, signal_pending() is supported now (so special user scan use alaram())
> > I used yield() before cond_resched() but I was told don't use it.
> > Should I replace cond_resched() with congestion_wait(HZ/10) or some ?
>
> msleep(1) would be typical. That can also be used to give a
> predictable number of seconds for the timeout.
>
> If 1 millisecond is too coarse then it's possible to sleep for much
> shorter intervals if the platform implements hi-res timers. We don't
> appear to have a handy interface to that (usleep, microsleep,
> nanosleep, etc?).
>
> And an attempt to sleep for 1us will fall back to 1/HZ if the platform
> doesn't implement hi-res timers, so that loop will need to be turned
> into a do {} while(!timer_after(jiffies, start))) thing. Probably it
> should be converted to that anyway, to be better behaved/predictable,
> etc.
>
I'm now consdiering to make this loop _short_ and call lru_add_drain_all()
when the loop finds a page is busy. (lru_add_drain_all() will sleep until
the end of keventd.) And yes, checking jiffies and do sleep if it seems
necessary.


I'll write some RFC today.

Thanks,
-Kame