2009-04-07 06:39:32

by Balbir Singh

[permalink] [raw]
Subject: [RFI] Shared accounting for memory resource controller

Hi, All,

This is a request for input for the design of shared page accounting for
the memory resource controller, here is what I have so far

Motivation for shared page accounting
-------------------------------------
1. Memory cgroup administrators will benefit from the knowledge of how
much of the data is shared, it helps size the groups correctly.
2. We currently report only the pages brought in by the cgroup, knowledge
of shared data will give a complete picture of the actual usage.

Use rmap to account sharing/unsharing through mapcount
-------------------------------------------------------

The current page has links to

+-------+
| |
| +--->pc->mem_cgroup (first mem_cgroup to touch the page)
| |
| page |
| +--->mapping (used for rmap)
| |
+-------+

While accounting shared pages works well, as pages get unshared, I've hit a
problem. Here is the current flow for shared accounting

Flow for sharing
----------------
1. Page not yet mapped anywhere (_mapcount is -1 and mem_cgroup,mapping is NULL)
2. Page gets mapped for the first time (_mapcount is 0, mem_cgroup points
to the memory resource group that brought in the page, mapping is set)
3. Page gets shared (_mapcount is 1, mem_cgroup points to the cgroup that
brought in the page, mapping is set and now has rmap information)

When a page is being shared at step 3, we detect we are sharing the page and

1. For page->pc->mem_cgroup, we note that the page is being shared
2. For any vma that maps this page, we get to vma->vm_mm and then to the
other mem_cgroup that is sharing this page and note this page is being
shared.

So far so good

When a page is being uncharged

1. We note that we need to deduct the shared accounting from the mem_cgroup
2. When the _mapcount reaches 0, we have no way of knowing which of the
mm's or mem_cgroup's is left behind. The original page->pc->mem_cgroup
could have unmapped this page long time back. At this point we want
to note the only mm that has this page mapped and the mem_cgroup is not
sharing the page, but that the page is private to it.

Figuring out the mem_cgroup/mm for the last uncharge, requires a rmap
lookup, which we cannot do with PTE lock held (I have all my hooks in
page_add.*rmap() and page_remove_rmap()).

Questions, suggestions

1. Does it make sense to use the rmap routines for shared accounting?
2. How do we solve the problem of the last unshare causing the pages
becoming private
a. Can we use rmap?
b. Can we live with leaving the page being marked as shared, even
though it is no longer shared?


--
Balbir


2009-04-07 07:02:21

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFI] Shared accounting for memory resource controller

On Tue, 7 Apr 2009 12:07:22 +0530
Balbir Singh <[email protected]> wrote:

> Hi, All,
>
> This is a request for input for the design of shared page accounting for
> the memory resource controller, here is what I have so far
>

In my first impression, I think simple counting is impossible.
IOW, "usage count" and "shared or not" is very different problem.

Assume a page and its page_cgroup.

Case 1)
1. a page is mapped by process-X under group-A
2. its mapped by process-Y in group-B (now, shared and charged under group-A)
3. move process-X to group-B
4. now the page is not shared.

Case 2)
swap is an object which can be shared.

Case 3)
1. a page known as "A" is mapped by process-X under group-A.
2. its mapped by process-Y under group-B(now, shared and charged under group-A)
3. Do copy-on-write by process-X.
Now, "A" is mapped only by B but accoutned under group-A.
This case is ignored intentionally, now.
Do you want to call try_charge() both against group-A and group-B
under process-X's page fault ?

There will be many many corner case.


> Motivation for shared page accounting
> -------------------------------------
> 1. Memory cgroup administrators will benefit from the knowledge of how
> much of the data is shared, it helps size the groups correctly.
> 2. We currently report only the pages brought in by the cgroup, knowledge
> of shared data will give a complete picture of the actual usage.
>

Motivation sounds good. But counting this in generic rmap will have tons of
troubles and slow-down.

I bet we should prepare a file as
/proc/<pid>/cgroup_maps

And show RSS/RSS-owned-by-us per process. Maybe this feature will be able to be
implemented in 3 days.

Thanks,
-Kame

2009-04-07 07:19:20

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFI] Shared accounting for memory resource controller

* KAMEZAWA Hiroyuki <[email protected]> [2009-04-07 16:00:14]:

> On Tue, 7 Apr 2009 12:07:22 +0530
> Balbir Singh <[email protected]> wrote:
>
> > Hi, All,
> >
> > This is a request for input for the design of shared page accounting for
> > the memory resource controller, here is what I have so far
> >
>
> In my first impression, I think simple counting is impossible.
> IOW, "usage count" and "shared or not" is very different problem.
>
> Assume a page and its page_cgroup.
>
> Case 1)
> 1. a page is mapped by process-X under group-A
> 2. its mapped by process-Y in group-B (now, shared and charged under group-A)
> 3. move process-X to group-B
> 4. now the page is not shared.
>

By shared I don't mean only between cgroups, it could be a page shared
in the same cgroup

> Case 2)
> swap is an object which can be shared.
>

Good point, I expect the user to account all cached pages as shared -
no?

> Case 3)
> 1. a page known as "A" is mapped by process-X under group-A.
> 2. its mapped by process-Y under group-B(now, shared and charged under group-A)
> 3. Do copy-on-write by process-X.
> Now, "A" is mapped only by B but accoutned under group-A.
> This case is ignored intentionally, now.

Yes, that is the original design

> Do you want to call try_charge() both against group-A and group-B
> under process-X's page fault ?
>

No we don't, but copy-on-write is caught at page_rmap_dup() - no?

> There will be many many corner case.
>
>
> > Motivation for shared page accounting
> > -------------------------------------
> > 1. Memory cgroup administrators will benefit from the knowledge of how
> > much of the data is shared, it helps size the groups correctly.
> > 2. We currently report only the pages brought in by the cgroup, knowledge
> > of shared data will give a complete picture of the actual usage.
> >
>
> Motivation sounds good. But counting this in generic rmap will have tons of
> troubles and slow-down.
>
> I bet we should prepare a file as
> /proc/<pid>/cgroup_maps
>
> And show RSS/RSS-owned-by-us per process. Maybe this feature will be able to be
> implemented in 3 days.

Yes, we can probably do that, but if we have too many processes in one
cgroup, we'll need to walk across all of them in user space. One other
alternative I did not mention is to walk the LRU like we walk page
tables and look at page_mapcount of every page, but that will be
very slow.

>
> Thanks,
> -Kame
>
>

--
Balbir

2009-04-07 07:35:22

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFI] Shared accounting for memory resource controller

On Tue, 7 Apr 2009 12:48:25 +0530
Balbir Singh <[email protected]> wrote:

> * KAMEZAWA Hiroyuki <[email protected]> [2009-04-07 16:00:14]:
>
> > On Tue, 7 Apr 2009 12:07:22 +0530
> > Balbir Singh <[email protected]> wrote:
> >
> > > Hi, All,
> > >
> > > This is a request for input for the design of shared page accounting for
> > > the memory resource controller, here is what I have so far
> > >
> >
> > In my first impression, I think simple counting is impossible.
> > IOW, "usage count" and "shared or not" is very different problem.
> >
> > Assume a page and its page_cgroup.
> >
> > Case 1)
> > 1. a page is mapped by process-X under group-A
> > 2. its mapped by process-Y in group-B (now, shared and charged under group-A)
> > 3. move process-X to group-B
> > 4. now the page is not shared.
> >
>
> By shared I don't mean only between cgroups, it could be a page shared
> in the same cgroup
>
Hmm, is it good information ?

Such kind of information can be calucated by
==
rss = 0;
for_each_process_under_cgroup() {
mm = tsk->mm
rss += mm->anon_rss;
}
some_of_all_rss = rss;

shared_ratio = mem_cgrou->rss *100 / some_of_all_rss.
==
if 100%, all anon memory are not shared.


> > Case 2)
> > swap is an object which can be shared.
> >
>
> Good point, I expect the user to account all cached pages as shared -
> no
Maybe yes if we explain it's so ;)

?
>
> > Case 3)
> > 1. a page known as "A" is mapped by process-X under group-A.
> > 2. its mapped by process-Y under group-B(now, shared and charged under group-A)
> > 3. Do copy-on-write by process-X.
> > Now, "A" is mapped only by B but accoutned under group-A.
> > This case is ignored intentionally, now.
>
> Yes, that is the original design
>
> > Do you want to call try_charge() both against group-A and group-B
> > under process-X's page fault ?
> >
>
> No we don't, but copy-on-write is caught at page_rmap_dup() - no?
>
Hmm, if we don't consider group-B, maybe we can.
But I wonder counting is overkill..


> > There will be many many corner case.
> >
> >
> > > Motivation for shared page accounting
> > > -------------------------------------
> > > 1. Memory cgroup administrators will benefit from the knowledge of how
> > > much of the data is shared, it helps size the groups correctly.
> > > 2. We currently report only the pages brought in by the cgroup, knowledge
> > > of shared data will give a complete picture of the actual usage.
> > >
> >
> > Motivation sounds good. But counting this in generic rmap will have tons of
> > troubles and slow-down.
> >
> > I bet we should prepare a file as
> > /proc/<pid>/cgroup_maps
> >
> > And show RSS/RSS-owned-by-us per process. Maybe this feature will be able to be
> > implemented in 3 days.
>
> Yes, we can probably do that, but if we have too many processes in one
> cgroup, we'll need to walk across all of them in user space. One other
> alternative I did not mention is to walk the LRU like we walk page
> tables and look at page_mapcount of every page, but that will be
> very slow.

Can't we make use of information in mm_counters ? (As I shown in above)
(see set/get/add/inc/dec_mm_counters())

Thanks,
-Kame



2009-04-07 08:05:09

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFI] Shared accounting for memory resource controller

* KAMEZAWA Hiroyuki <[email protected]> [2009-04-07 16:33:31]:

> On Tue, 7 Apr 2009 12:48:25 +0530
> Balbir Singh <[email protected]> wrote:
>
> > * KAMEZAWA Hiroyuki <[email protected]> [2009-04-07 16:00:14]:
> >
> > > On Tue, 7 Apr 2009 12:07:22 +0530
> > > Balbir Singh <[email protected]> wrote:
> > >
> > > > Hi, All,
> > > >
> > > > This is a request for input for the design of shared page accounting for
> > > > the memory resource controller, here is what I have so far
> > > >
> > >
> > > In my first impression, I think simple counting is impossible.
> > > IOW, "usage count" and "shared or not" is very different problem.
> > >
> > > Assume a page and its page_cgroup.
> > >
> > > Case 1)
> > > 1. a page is mapped by process-X under group-A
> > > 2. its mapped by process-Y in group-B (now, shared and charged under group-A)
> > > 3. move process-X to group-B
> > > 4. now the page is not shared.
> > >
> >
> > By shared I don't mean only between cgroups, it could be a page shared
> > in the same cgroup
> >
> Hmm, is it good information ?
>
> Such kind of information can be calucated by
> ==
> rss = 0;
> for_each_process_under_cgroup() {
> mm = tsk->mm
> rss += mm->anon_rss;
> }
> some_of_all_rss = rss;
>
> shared_ratio = mem_cgrou->rss *100 / some_of_all_rss.
> ==
> if 100%, all anon memory are not shared.
>

Why only anon? This seems like a good idea, except when we have a page
charged to a cgroup and the task that charged it has migrated, in that
case sum_of_all_rss will be 0.

>
> > > Case 2)
> > > swap is an object which can be shared.
> > >
> >
> > Good point, I expect the user to account all cached pages as shared -
> > no
> Maybe yes if we explain it's so ;)
>
> ?
> >
> > > Case 3)
> > > 1. a page known as "A" is mapped by process-X under group-A.
> > > 2. its mapped by process-Y under group-B(now, shared and charged under group-A)
> > > 3. Do copy-on-write by process-X.
> > > Now, "A" is mapped only by B but accoutned under group-A.
> > > This case is ignored intentionally, now.
> >
> > Yes, that is the original design
> >
> > > Do you want to call try_charge() both against group-A and group-B
> > > under process-X's page fault ?
> > >
> >
> > No we don't, but copy-on-write is caught at page_rmap_dup() - no?
> >
> Hmm, if we don't consider group-B, maybe we can.
> But I wonder counting is overkill..
>
>
> > > There will be many many corner case.
> > >
> > >
> > > > Motivation for shared page accounting
> > > > -------------------------------------
> > > > 1. Memory cgroup administrators will benefit from the knowledge of how
> > > > much of the data is shared, it helps size the groups correctly.
> > > > 2. We currently report only the pages brought in by the cgroup, knowledge
> > > > of shared data will give a complete picture of the actual usage.
> > > >
> > >
> > > Motivation sounds good. But counting this in generic rmap will have tons of
> > > troubles and slow-down.
> > >
> > > I bet we should prepare a file as
> > > /proc/<pid>/cgroup_maps
> > >
> > > And show RSS/RSS-owned-by-us per process. Maybe this feature will be able to be
> > > implemented in 3 days.
> >
> > Yes, we can probably do that, but if we have too many processes in one
> > cgroup, we'll need to walk across all of them in user space. One other
> > alternative I did not mention is to walk the LRU like we walk page
> > tables and look at page_mapcount of every page, but that will be
> > very slow.
>
> Can't we make use of information in mm_counters ? (As I shown in above)
> (see set/get/add/inc/dec_mm_counters())
>

I've seen them, might be a good way to get started, except some corner
cases mentioned above.

--
Balbir

2009-04-07 08:26:09

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFI] Shared accounting for memory resource controller

On Tue, 7 Apr 2009 13:33:55 +0530
Balbir Singh <[email protected]> wrote:

> * KAMEZAWA Hiroyuki <[email protected]> [2009-04-07 16:33:31]:
>
> > On Tue, 7 Apr 2009 12:48:25 +0530
> > Balbir Singh <[email protected]> wrote:
> >
> > > * KAMEZAWA Hiroyuki <[email protected]> [2009-04-07 16:00:14]:
> > >
> > > > On Tue, 7 Apr 2009 12:07:22 +0530
> > > > Balbir Singh <[email protected]> wrote:
> > > >
> > > > > Hi, All,
> > > > >
> > > > > This is a request for input for the design of shared page accounting for
> > > > > the memory resource controller, here is what I have so far
> > > > >
> > > >
> > > > In my first impression, I think simple counting is impossible.
> > > > IOW, "usage count" and "shared or not" is very different problem.
> > > >
> > > > Assume a page and its page_cgroup.
> > > >
> > > > Case 1)
> > > > 1. a page is mapped by process-X under group-A
> > > > 2. its mapped by process-Y in group-B (now, shared and charged under group-A)
> > > > 3. move process-X to group-B
> > > > 4. now the page is not shared.
> > > >
> > >
> > > By shared I don't mean only between cgroups, it could be a page shared
> > > in the same cgroup
> > >
> > Hmm, is it good information ?
> >
> > Such kind of information can be calucated by
> > ==
> > rss = 0;
> > for_each_process_under_cgroup() {
> > mm = tsk->mm
> > rss += mm->anon_rss;
> > }
> > some_of_all_rss = rss;
> >
> > shared_ratio = mem_cgrou->rss *100 / some_of_all_rss.
> > ==
> > if 100%, all anon memory are not shared.
> >
>
> Why only anon?

no serious intention.
Just because you wrote "expect the user to account all cached pages as shared" ;)

> This seems like a good idea, except when we have a page
> charged to a cgroup and the task that charged it has migrated, in that
> case sum_of_all_rss will be 0.
>
Yes. But we don't move pages at task-move under expectation that moved
process will call fork() soon.
"task move" has its own problem, so ignoring it for now is a choice.
That kind of troubls can be treated when we fixes "task move".
(or fix "task move" first.)

Thanks,
-Kame

2009-04-07 10:11:34

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFI] Shared accounting for memory resource controller

* KAMEZAWA Hiroyuki <[email protected]> [2009-04-07 17:24:19]:

> On Tue, 7 Apr 2009 13:33:55 +0530
> Balbir Singh <[email protected]> wrote:
>
> > * KAMEZAWA Hiroyuki <[email protected]> [2009-04-07 16:33:31]:
> >
> > > On Tue, 7 Apr 2009 12:48:25 +0530
> > > Balbir Singh <[email protected]> wrote:
> > >
> > > > * KAMEZAWA Hiroyuki <[email protected]> [2009-04-07 16:00:14]:
> > > >
> > > > > On Tue, 7 Apr 2009 12:07:22 +0530
> > > > > Balbir Singh <[email protected]> wrote:
> > > > >
> > > > > > Hi, All,
> > > > > >
> > > > > > This is a request for input for the design of shared page accounting for
> > > > > > the memory resource controller, here is what I have so far
> > > > > >
> > > > >
> > > > > In my first impression, I think simple counting is impossible.
> > > > > IOW, "usage count" and "shared or not" is very different problem.
> > > > >
> > > > > Assume a page and its page_cgroup.
> > > > >
> > > > > Case 1)
> > > > > 1. a page is mapped by process-X under group-A
> > > > > 2. its mapped by process-Y in group-B (now, shared and charged under group-A)
> > > > > 3. move process-X to group-B
> > > > > 4. now the page is not shared.
> > > > >
> > > >
> > > > By shared I don't mean only between cgroups, it could be a page shared
> > > > in the same cgroup
> > > >
> > > Hmm, is it good information ?
> > >
> > > Such kind of information can be calucated by
> > > ==
> > > rss = 0;
> > > for_each_process_under_cgroup() {
> > > mm = tsk->mm
> > > rss += mm->anon_rss;
> > > }
> > > some_of_all_rss = rss;
> > >
> > > shared_ratio = mem_cgrou->rss *100 / some_of_all_rss.
> > > ==
> > > if 100%, all anon memory are not shared.
> > >
> >
> > Why only anon?
>
> no serious intention.
> Just because you wrote "expect the user to account all cached pages as shared" ;)

OK, I think we should mention that we can treat unmapped cache as
shared :)

>
> > This seems like a good idea, except when we have a page
> > charged to a cgroup and the task that charged it has migrated, in that
> > case sum_of_all_rss will be 0.
> >
> Yes. But we don't move pages at task-move under expectation that moved
> process will call fork() soon.
> "task move" has its own problem, so ignoring it for now is a choice.
> That kind of troubls can be treated when we fixes "task move".
> (or fix "task move" first.)
>

Yes, but the point I was making was that we could have pages left over
without tasks remaining, in the case of shared pages. I think we can
handle them suitably, probably an implementation issue.

--
Balbir

2009-04-08 05:29:51

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFI] Shared accounting for memory resource controller

* KAMEZAWA Hiroyuki <[email protected]> [2009-04-07 17:24:19]:

> On Tue, 7 Apr 2009 13:33:55 +0530
> Balbir Singh <[email protected]> wrote:
>
> > * KAMEZAWA Hiroyuki <[email protected]> [2009-04-07 16:33:31]:
> >
> > > On Tue, 7 Apr 2009 12:48:25 +0530
> > > Balbir Singh <[email protected]> wrote:
> > >
> > > > * KAMEZAWA Hiroyuki <[email protected]> [2009-04-07 16:00:14]:
> > > >
> > > > > On Tue, 7 Apr 2009 12:07:22 +0530
> > > > > Balbir Singh <[email protected]> wrote:
> > > > >
> > > > > > Hi, All,
> > > > > >
> > > > > > This is a request for input for the design of shared page accounting for
> > > > > > the memory resource controller, here is what I have so far
> > > > > >
> > > > >
> > > > > In my first impression, I think simple counting is impossible.
> > > > > IOW, "usage count" and "shared or not" is very different problem.
> > > > >
> > > > > Assume a page and its page_cgroup.
> > > > >
> > > > > Case 1)
> > > > > 1. a page is mapped by process-X under group-A
> > > > > 2. its mapped by process-Y in group-B (now, shared and charged under group-A)
> > > > > 3. move process-X to group-B
> > > > > 4. now the page is not shared.
> > > > >
> > > >
> > > > By shared I don't mean only between cgroups, it could be a page shared
> > > > in the same cgroup
> > > >
> > > Hmm, is it good information ?
> > >
> > > Such kind of information can be calucated by
> > > ==
> > > rss = 0;
> > > for_each_process_under_cgroup() {
> > > mm = tsk->mm
> > > rss += mm->anon_rss;
> > > }
> > > some_of_all_rss = rss;
> > >
> > > shared_ratio = mem_cgrou->rss *100 / some_of_all_rss.
> > > ==
> > > if 100%, all anon memory are not shared.
> > >
> >
> > Why only anon?
>
> no serious intention.
> Just because you wrote "expect the user to account all cached pages as shared" ;)
>

OK, I noticed another thing, our RSS accounting is not RSS per-se, it
includes only anon RSS, file backed pages are accounted as cached.
I'll send out a patch to see if we can include anon RSS as well.

> > This seems like a good idea, except when we have a page
> > charged to a cgroup and the task that charged it has migrated, in that
> > case sum_of_all_rss will be 0.
> >
> Yes. But we don't move pages at task-move under expectation that moved
> process will call fork() soon.
> "task move" has its own problem, so ignoring it for now is a choice.
> That kind of troubls can be treated when we fixes "task move".
> (or fix "task move" first.)
>
> Thanks,
> -Kame
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
>

--
Balbir

2009-04-08 06:17:17

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFI] Shared accounting for memory resource controller

On Wed, 8 Apr 2009 10:59:04 +0530
Balbir Singh <[email protected]> wrote:


> > no serious intention.
> > Just because you wrote "expect the user to account all cached pages as shared" ;)
> >
>
> OK, I noticed another thing, our RSS accounting is not RSS per-se, it
> includes only anon RSS, file backed pages are accounted as cached.
> I'll send out a patch to see if we can include anon RSS as well.
>

I think we can't do it in memcg layer without new-hook because file caches
are added to radix-tree before mapped.

mm struct has anon_rss and file_rss coutners. Then, you can show
sum of total maps of file pages. maybe.

Thanks,
-Kame

2009-04-08 07:05:03

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFI] Shared accounting for memory resource controller

* KAMEZAWA Hiroyuki <[email protected]> [2009-04-08 15:15:29]:

> On Wed, 8 Apr 2009 10:59:04 +0530
> Balbir Singh <[email protected]> wrote:
>
>
> > > no serious intention.
> > > Just because you wrote "expect the user to account all cached pages as shared" ;)
> > >
> >
> > OK, I noticed another thing, our RSS accounting is not RSS per-se, it
> > includes only anon RSS, file backed pages are accounted as cached.
> > I'll send out a patch to see if we can include anon RSS as well.
> >
>
> I think we can't do it in memcg layer without new-hook because file caches
> are added to radix-tree before mapped.
>
> mm struct has anon_rss and file_rss coutners. Then, you can show
> sum of total maps of file pages. maybe.
>

Yes, correct and that is a hook worth adding, IMHO. Better statistics
are critical and it will also help us with the shared memory
accounting. Without that we can't account for file rss in the memory
cgroup.

--
Balbir

2009-04-08 07:09:25

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFI] Shared accounting for memory resource controller

On Wed, 8 Apr 2009 12:34:01 +0530
Balbir Singh <[email protected]> wrote:

> * KAMEZAWA Hiroyuki <[email protected]> [2009-04-08 15:15:29]:
>
> > On Wed, 8 Apr 2009 10:59:04 +0530
> > Balbir Singh <[email protected]> wrote:
> >
> >
> > > > no serious intention.
> > > > Just because you wrote "expect the user to account all cached pages as shared" ;)
> > > >
> > >
> > > OK, I noticed another thing, our RSS accounting is not RSS per-se, it
> > > includes only anon RSS, file backed pages are accounted as cached.
> > > I'll send out a patch to see if we can include anon RSS as well.
> > >
> >
> > I think we can't do it in memcg layer without new-hook because file caches
> > are added to radix-tree before mapped.
> >
> > mm struct has anon_rss and file_rss coutners. Then, you can show
> > sum of total maps of file pages. maybe.
> >
>
> Yes, correct and that is a hook worth adding, IMHO. Better statistics
> are critical and it will also help us with the shared memory
> accounting. Without that we can't account for file rss in the memory
> cgroup.
>
Finally, you'll be asked "is it necessary ?", if the cost is big.
>From my point of view, I can't see what new information it will give us.
But maybe useful because the user can avoid some calculation.

Cheers,
-Kame

2009-04-08 07:12:09

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFI] Shared accounting for memory resource controller

* KAMEZAWA Hiroyuki <[email protected]> [2009-04-08 16:07:33]:

> On Wed, 8 Apr 2009 12:34:01 +0530
> Balbir Singh <[email protected]> wrote:
>
> > * KAMEZAWA Hiroyuki <[email protected]> [2009-04-08 15:15:29]:
> >
> > > On Wed, 8 Apr 2009 10:59:04 +0530
> > > Balbir Singh <[email protected]> wrote:
> > >
> > >
> > > > > no serious intention.
> > > > > Just because you wrote "expect the user to account all cached pages as shared" ;)
> > > > >
> > > >
> > > > OK, I noticed another thing, our RSS accounting is not RSS per-se, it
> > > > includes only anon RSS, file backed pages are accounted as cached.
> > > > I'll send out a patch to see if we can include anon RSS as well.
> > > >
> > >
> > > I think we can't do it in memcg layer without new-hook because file caches
> > > are added to radix-tree before mapped.
> > >
> > > mm struct has anon_rss and file_rss coutners. Then, you can show
> > > sum of total maps of file pages. maybe.
> > >
> >
> > Yes, correct and that is a hook worth adding, IMHO. Better statistics
> > are critical and it will also help us with the shared memory
> > accounting. Without that we can't account for file rss in the memory
> > cgroup.
> >
> Finally, you'll be asked "is it necessary ?", if the cost is big.
> >From my point of view, I can't see what new information it will give us.
> But maybe useful because the user can avoid some calculation.

OK, here is what I see

1. First our rss in memory.stat is confusing, we should call it anon
RSS
2. We need to add file rss, this is sort of inline with the
information we export per process file_rss and anon_rss
3. Using the above, we can then try to (using an algorithm you
proposed), try to do some work for figuring out the shared percentage.

--
Balbir

2009-04-08 07:20:11

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFI] Shared accounting for memory resource controller

On Wed, 8 Apr 2009 12:41:15 +0530
Balbir Singh <[email protected]> wrote:

> * KAMEZAWA Hiroyuki <[email protected]> [2009-04-08 16:07:33]:
> 1. First our rss in memory.stat is confusing, we should call it anon
> RSS
ok. but ....changing current interface ?

> 2. We need to add file rss, this is sort of inline with the
> information we export per process file_rss and anon_rss

maybe good. *but* active/incative ratio in lru file cache is good estimation for this.

> 3. Using the above, we can then try to (using an algorithm you
> proposed), try to do some work for figuring out the shared percentage.
>
This is the point. At last. Why "# of shared pages" is important ?

I wonder it's better to add new stat file as memory.cacheinfo which helps
following kind of commands.

#cacheinfo /cgroups/memory/group01/
/usr/lib/libc.so.1 30pages
/var/log/messages 1 pages
/tmp/xxxxxx 20 pages
.....
.....

Thanks,
-Kame

2009-04-08 07:31:35

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFI] Shared accounting for memory resource controller

On Wed, Apr 8, 2009 at 12:48 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
>
> On Wed, 8 Apr 2009 12:41:15 +0530
> Balbir Singh <[email protected]> wrote:
> > 3. Using the above, we can then try to (using an algorithm you
> > proposed), try to do some work for figuring out the shared percentage.
> >
> This is the point. At last. Why "# of shared pages" is important ?
>
> I wonder it's better to add new stat file as memory.cacheinfo which helps
> following kind of commands.
>
> ?#cacheinfo /cgroups/memory/group01/
> ? ? ? /usr/lib/libc.so.1 ? ? 30pages
> ? ? ? /var/log/messages ? ? ?1 pages
> ? ? ? /tmp/xxxxxx ? ? ? ? ? ?20 pages

Can I suggest that we don't add new files for additional stats and try
as far as possible to include them in <controller>.stat file. Please
note that we have APIs in libcgroup library which can return
statistics from controllers associated with a cgroup and these APIs
assume that stats are part of <controller>.stat file.

Regards,
Bharata.
--
http://bharata.sulekha.com/blog/posts.htm

2009-04-08 07:36:59

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFI] Shared accounting for memory resource controller

On Wed, 8 Apr 2009 13:01:15 +0530
Bharata B Rao <[email protected]> wrote:

> On Wed, Apr 8, 2009 at 12:48 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> >
> > On Wed, 8 Apr 2009 12:41:15 +0530
> > Balbir Singh <[email protected]> wrote:
> > > 3. Using the above, we can then try to (using an algorithm you
> > > proposed), try to do some work for figuring out the shared percentage.
> > >
> > This is the point. At last. Why "# of shared pages" is important ?
> >
> > I wonder it's better to add new stat file as memory.cacheinfo which helps
> > following kind of commands.
> >
> >  #cacheinfo /cgroups/memory/group01/
> >       /usr/lib/libc.so.1     30pages
> >       /var/log/messages      1 pages
> >       /tmp/xxxxxx            20 pages
>
> Can I suggest that we don't add new files for additional stats and try
> as far as possible to include them in <controller>.stat file. Please
> note that we have APIs in libcgroup library which can return
> statistics from controllers associated with a cgroup and these APIs
> assume that stats are part of <controller>.stat file.
>
Hmm ? Is there generic assumption as all cgroup has "stat" file ?
And libcgroup cause bug if the new entry is added to stat file ?
(IOW, libcgroup can't ignore new entry added ?)

Thanks,
-Kame

2009-04-08 07:41:16

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFI] Shared accounting for memory resource controller

On Wed, 8 Apr 2009 16:18:24 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Wed, 8 Apr 2009 12:41:15 +0530
> Balbir Singh <[email protected]> wrote:
>
> > * KAMEZAWA Hiroyuki <[email protected]> [2009-04-08 16:07:33]:
> > 1. First our rss in memory.stat is confusing, we should call it anon
> > RSS
> ok. but ....changing current interface ?
>
> > 2. We need to add file rss, this is sort of inline with the
> > information we export per process file_rss and anon_rss
>
> maybe good. *but* active/incative ratio in lru file cache is good estimation for this.
>
> > 3. Using the above, we can then try to (using an algorithm you
> > proposed), try to do some work for figuring out the shared percentage.
> >
> This is the point. At last. Why "# of shared pages" is important ?
>
> I wonder it's better to add new stat file as memory.cacheinfo which helps
> following kind of commands.
>
> #cacheinfo /cgroups/memory/group01/
> /usr/lib/libc.so.1 30pages
> /var/log/messages 1 pages
> /tmp/xxxxxx 20 pages
> .....
> .....
To do above, I wonder it's better to add "cache count cgroup" rather than modify memcg.
plz ignore.

Thanks,
-Kame

2009-04-08 07:45:21

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFI] Shared accounting for memory resource controller

On Wed, Apr 8, 2009 at 1:04 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Wed, 8 Apr 2009 13:01:15 +0530
> Bharata B Rao <[email protected]> wrote:
>
>> On Wed, Apr 8, 2009 at 12:48 PM, KAMEZAWA Hiroyuki
>> <[email protected]> wrote:
>> >
>> > On Wed, 8 Apr 2009 12:41:15 +0530
>> > Balbir Singh <[email protected]> wrote:
>> > > 3. Using the above, we can then try to (using an algorithm you
>> > > proposed), try to do some work for figuring out the shared percentage.
>> > >
>> > This is the point. At last. Why "# of shared pages" is important ?
>> >
>> > I wonder it's better to add new stat file as memory.cacheinfo which helps
>> > following kind of commands.
>> >
>> > ?#cacheinfo /cgroups/memory/group01/
>> > ? ? ? /usr/lib/libc.so.1 ? ? 30pages
>> > ? ? ? /var/log/messages ? ? ?1 pages
>> > ? ? ? /tmp/xxxxxx ? ? ? ? ? ?20 pages
>>
>> Can I suggest that we don't add new files for additional stats and try
>> as far as possible to include them in <controller>.stat file. Please
>> note that we have APIs in libcgroup library which can return
>> statistics from controllers associated with a cgroup and these APIs
>> assume that stats are part of <controller>.stat file.
>>
> Hmm ? Is there generic assumption as all cgroup has "stat" file ?

No. But I would think if any controller has any stats to export, it
would do so via <controller>.stat file.

> And libcgroup cause bug if the new entry is added to stat file ?

No. It can cope with new entries being added to stat file as long as
they appear as (name, value) pairs.

> (IOW, libcgroup can't ignore new entry added ?)

It won't ignore, but can read the new stat.

Regards,
Bharata.

2009-04-08 07:48:55

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFI] Shared accounting for memory resource controller

* KAMEZAWA Hiroyuki <[email protected]> [2009-04-08 16:18:24]:

> On Wed, 8 Apr 2009 12:41:15 +0530
> Balbir Singh <[email protected]> wrote:
>
> > * KAMEZAWA Hiroyuki <[email protected]> [2009-04-08 16:07:33]:
> > 1. First our rss in memory.stat is confusing, we should call it anon
> > RSS
> ok. but ....changing current interface ?
>

No, lets just a new field called file_rss and make sure the
documentation reflects the correct information.

> > 2. We need to add file rss, this is sort of inline with the
> > information we export per process file_rss and anon_rss
>
> maybe good. *but* active/incative ratio in lru file cache is good estimation for this.

Active/Inactive tell us about frequently a page is referenced rather
than what is mapped and what is not. We could get very bad results if
we make an assumption.

>
> > 3. Using the above, we can then try to (using an algorithm you
> > proposed), try to do some work for figuring out the shared percentage.
> >
> This is the point. At last. Why "# of shared pages" is important ?
>

I posted this in my motivation yesterday. # of shared pages can help
plan the system better and the size of the cgroup. A cgroup might have
small usage_in_bytes but large number of shared pages. We need a
metric that can help figure out the fair usage of the cgroup.

> I wonder it's better to add new stat file as memory.cacheinfo which helps
> following kind of commands.
>
> #cacheinfo /cgroups/memory/group01/
> /usr/lib/libc.so.1 30pages
> /var/log/messages 1 pages
> /tmp/xxxxxx 20 pages
> .....
> .....
>

But, what I need at the moment is shared usage information.

--
Balbir

2009-04-08 07:53:52

by Dhaval Giani

[permalink] [raw]
Subject: Re: [RFI] Shared accounting for memory resource controller

On Wed, Apr 08, 2009 at 01:15:01PM +0530, Bharata B Rao wrote:
> On Wed, Apr 8, 2009 at 1:04 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > On Wed, 8 Apr 2009 13:01:15 +0530
> > Bharata B Rao <[email protected]> wrote:
> >
> >> On Wed, Apr 8, 2009 at 12:48 PM, KAMEZAWA Hiroyuki
> >> <[email protected]> wrote:
> >> >
> >> > On Wed, 8 Apr 2009 12:41:15 +0530
> >> > Balbir Singh <[email protected]> wrote:
> >> > > 3. Using the above, we can then try to (using an algorithm you
> >> > > proposed), try to do some work for figuring out the shared percentage.
> >> > >
> >> > This is the point. At last. Why "# of shared pages" is important ?
> >> >
> >> > I wonder it's better to add new stat file as memory.cacheinfo which helps
> >> > following kind of commands.
> >> >
> >> > ?#cacheinfo /cgroups/memory/group01/
> >> > ? ? ? /usr/lib/libc.so.1 ? ? 30pages
> >> > ? ? ? /var/log/messages ? ? ?1 pages
> >> > ? ? ? /tmp/xxxxxx ? ? ? ? ? ?20 pages
> >>
> >> Can I suggest that we don't add new files for additional stats and try
> >> as far as possible to include them in <controller>.stat file. Please
> >> note that we have APIs in libcgroup library which can return
> >> statistics from controllers associated with a cgroup and these APIs
> >> assume that stats are part of <controller>.stat file.
> >>
> > Hmm ? Is there generic assumption as all cgroup has "stat" file ?
>
> No. But I would think if any controller has any stats to export, it
> would do so via <controller>.stat file.
>
> > And libcgroup cause bug if the new entry is added to stat file ?
>
> No. It can cope with new entries being added to stat file as long as
> they appear as (name, value) pairs.
>

And if it does not, we should fix it to cope up with it. But I agree
with bharata, we should avoid adding new files, and try to use the stat
file.

thanks,
--
regards,
Dhaval

2009-04-08 08:06:14

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFI] Shared accounting for memory resource controller

On Wed, 8 Apr 2009 13:18:09 +0530
Balbir Singh <[email protected]> wrote:

> > > 3. Using the above, we can then try to (using an algorithm you
> > > proposed), try to do some work for figuring out the shared percentage.
> > >
> > This is the point. At last. Why "# of shared pages" is important ?
> >
>
> I posted this in my motivation yesterday. # of shared pages can help
> plan the system better and the size of the cgroup. A cgroup might have
> small usage_in_bytes but large number of shared pages. We need a
> metric that can help figure out the fair usage of the cgroup.
>
I don't fully understand but NR_FILE_MAPPED is an information in /proc/meminfo.
I personally think I want to support information in /proc/meminfo per memcg.

Hmm ? then, if you add a hook, it seems
== mm/rmap.c
689 void page_add_file_rmap(struct page *page)
690 {
691 if (atomic_inc_and_test(&page->_mapcount))
692 __inc_zone_page_state(page, NR_FILE_MAPPED);
693 }
== page_remove_rmap(struct page *page)
739 __dec_zone_page_state(page,
740 PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
==

Is good place to go, maybe.

page->page_cgroup->mem_cgroup-> inc/dec counter ?

Maybe the patch itself will be simple, overhead is unknown..

Thanks,
-Kame

2009-04-08 08:50:35

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFI] Shared accounting for memory resource controller

* KAMEZAWA Hiroyuki <[email protected]> [2009-04-08 17:03:41]:

> On Wed, 8 Apr 2009 13:18:09 +0530
> Balbir Singh <[email protected]> wrote:
>
> > > > 3. Using the above, we can then try to (using an algorithm you
> > > > proposed), try to do some work for figuring out the shared percentage.
> > > >
> > > This is the point. At last. Why "# of shared pages" is important ?
> > >
> >
> > I posted this in my motivation yesterday. # of shared pages can help
> > plan the system better and the size of the cgroup. A cgroup might have
> > small usage_in_bytes but large number of shared pages. We need a
> > metric that can help figure out the fair usage of the cgroup.
> >
> I don't fully understand but NR_FILE_MAPPED is an information in /proc/meminfo.
> I personally think I want to support information in /proc/meminfo per memcg.
>
> Hmm ? then, if you add a hook, it seems
> == mm/rmap.c
> 689 void page_add_file_rmap(struct page *page)
> 690 {
> 691 if (atomic_inc_and_test(&page->_mapcount))
> 692 __inc_zone_page_state(page, NR_FILE_MAPPED);
> 693 }
> == page_remove_rmap(struct page *page)
> 739 __dec_zone_page_state(page,
> 740 PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
> ==
>
> Is good place to go, maybe.
>
> page->page_cgroup->mem_cgroup-> inc/dec counter ?
>
> Maybe the patch itself will be simple, overhead is unknown..

I thought of the same thing, but then moved to the following

... mem_cgroup_charge_statistics(..) {
if (page_mapcount(page) == 0 && page_is_file_cache(page))
__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_FILE_RSS, val);

But I've not yet tested the end result


--
Balbir

2009-04-08 08:55:52

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFI] Shared accounting for memory resource controller

On Wed, 8 Apr 2009 14:19:52 +0530
Balbir Singh <[email protected]> wrote:

> * KAMEZAWA Hiroyuki <[email protected]> [2009-04-08 17:03:41]:
>
> > On Wed, 8 Apr 2009 13:18:09 +0530
> > Balbir Singh <[email protected]> wrote:
> >
> > > > > 3. Using the above, we can then try to (using an algorithm you
> > > > > proposed), try to do some work for figuring out the shared percentage.
> > > > >
> > > > This is the point. At last. Why "# of shared pages" is important ?
> > > >
> > >
> > > I posted this in my motivation yesterday. # of shared pages can help
> > > plan the system better and the size of the cgroup. A cgroup might have
> > > small usage_in_bytes but large number of shared pages. We need a
> > > metric that can help figure out the fair usage of the cgroup.
> > >
> > I don't fully understand but NR_FILE_MAPPED is an information in /proc/meminfo.
> > I personally think I want to support information in /proc/meminfo per memcg.
> >
> > Hmm ? then, if you add a hook, it seems
> > == mm/rmap.c
> > 689 void page_add_file_rmap(struct page *page)
> > 690 {
> > 691 if (atomic_inc_and_test(&page->_mapcount))
> > 692 __inc_zone_page_state(page, NR_FILE_MAPPED);
> > 693 }
> > == page_remove_rmap(struct page *page)
> > 739 __dec_zone_page_state(page,
> > 740 PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
> > ==
> >
> > Is good place to go, maybe.
> >
> > page->page_cgroup->mem_cgroup-> inc/dec counter ?
> >
> > Maybe the patch itself will be simple, overhead is unknown..
>
> I thought of the same thing, but then moved to the following
>
> ... mem_cgroup_charge_statistics(..) {
> if (page_mapcount(page) == 0 && page_is_file_cache(page))
> __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_FILE_RSS, val);
>
> But I've not yet tested the end result
>
I think
- at uncharge:
charge_statistics is only called when FILE CACHE is removed from radix-tree.
mem_cgroup_uncharge() is called only when PageAnon(page).
- at charge:
charge_statistics is only called when FILE CACHE is added to radix-tree.

This "checking only radix-tree insert/delete" help us to remove most of overheads
on FILE CACHE.

So, adding new hooks to page_add_file_rmap() and page_remove_rmap()
is a way to go. (and easy to understand because we account it at the same time
NR_FILE_MAPPED is modified.)


Thanks,
-Kame





2009-04-08 09:03:22

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFI] Shared accounting for memory resource controller

* KAMEZAWA Hiroyuki <[email protected]> [2009-04-08 17:54:09]:

> On Wed, 8 Apr 2009 14:19:52 +0530
> Balbir Singh <[email protected]> wrote:
>
> > * KAMEZAWA Hiroyuki <[email protected]> [2009-04-08 17:03:41]:
> >
> > > On Wed, 8 Apr 2009 13:18:09 +0530
> > > Balbir Singh <[email protected]> wrote:
> > >
> > > > > > 3. Using the above, we can then try to (using an algorithm you
> > > > > > proposed), try to do some work for figuring out the shared percentage.
> > > > > >
> > > > > This is the point. At last. Why "# of shared pages" is important ?
> > > > >
> > > >
> > > > I posted this in my motivation yesterday. # of shared pages can help
> > > > plan the system better and the size of the cgroup. A cgroup might have
> > > > small usage_in_bytes but large number of shared pages. We need a
> > > > metric that can help figure out the fair usage of the cgroup.
> > > >
> > > I don't fully understand but NR_FILE_MAPPED is an information in /proc/meminfo.
> > > I personally think I want to support information in /proc/meminfo per memcg.
> > >
> > > Hmm ? then, if you add a hook, it seems
> > > == mm/rmap.c
> > > 689 void page_add_file_rmap(struct page *page)
> > > 690 {
> > > 691 if (atomic_inc_and_test(&page->_mapcount))
> > > 692 __inc_zone_page_state(page, NR_FILE_MAPPED);
> > > 693 }
> > > == page_remove_rmap(struct page *page)
> > > 739 __dec_zone_page_state(page,
> > > 740 PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
> > > ==
> > >
> > > Is good place to go, maybe.
> > >
> > > page->page_cgroup->mem_cgroup-> inc/dec counter ?
> > >
> > > Maybe the patch itself will be simple, overhead is unknown..
> >
> > I thought of the same thing, but then moved to the following
> >
> > ... mem_cgroup_charge_statistics(..) {
> > if (page_mapcount(page) == 0 && page_is_file_cache(page))
> > __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_FILE_RSS, val);
> >
> > But I've not yet tested the end result
> >
> I think
> - at uncharge:
> charge_statistics is only called when FILE CACHE is removed from radix-tree.
> mem_cgroup_uncharge() is called only when PageAnon(page).

Good point, I missed it, testing would have caught it.

> - at charge:
> charge_statistics is only called when FILE CACHE is added to radix-tree.
>
> This "checking only radix-tree insert/delete" help us to remove most of overheads
> on FILE CACHE.
>
> So, adding new hooks to page_add_file_rmap() and page_remove_rmap()
> is a way to go. (and easy to understand because we account it at the same time
> NR_FILE_MAPPED is modified.)

Agreed.

--
Balbir