2010-01-05 18:52:34

by Balbir Singh

[permalink] [raw]
Subject: [PATCH -mm] Shared Page accounting for memory cgroup (v2)

Hi, All,

No major changes from v1, except for the use of get_mm_rss().
Kamezawa-San felt that this can be done in user space and I responded
to him with my concerns of doing it in user space. The thread
can be found at http://thread.gmane.org/gmane.linux.kernel.mm/42367.

If there are no major objections, can I ask for a merge into -mm.
Andrew, the patches are against mmotm 10 December 2009, if there
are some merge conflicts, please let me know, I can rebase after
you release the next mmotm.


Add shared accounting to memcg

From: Balbir Singh <[email protected]>

Currently there is no accurate way of estimating how many pages are
shared in a memory cgroup. The accurate way of accounting shared memory
is to

1. Either follow every page rmap and track number of users
2. Iterate through the pages and use _mapcount

We take an intermediate approach (suggested by Kamezawa), we sum up
the file and anon rss of the mm's belonging to the cgroup and then
subtract the values of anon rss and file mapped. This should give
us a good estimate of the pages being shared.

The shared statistic is called memory.shared_usage_in_bytes and
does not support hierarchical information, just the information
for the current cgroup.

Signed-off-by: Balbir Singh <[email protected]>
---

Documentation/cgroups/memory.txt | 6 +++++
mm/memcontrol.c | 42 ++++++++++++++++++++++++++++++++++++++
2 files changed, 48 insertions(+), 0 deletions(-)


diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index b871f25..c2c70c9 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -341,6 +341,12 @@ Note:
- a cgroup which uses hierarchy and it has child cgroup.
- a cgroup which uses hierarchy and not the root of hierarchy.

+5.4 shared_usage_in_bytes
+ This data lists the number of shared bytes. The data provided
+ provides an approximation based on the anon and file rss counts
+ of all the mm's belonging to the cgroup. The sum above is subtracted
+ from the count of rss and file mapped count maintained within the
+ memory cgroup statistics (see section 5.2).

6. Hierarchy support

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 488b644..e49b47a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3052,6 +3052,44 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
return 0;
}

+static u64 mem_cgroup_shared_read(struct cgroup *cgrp, struct cftype *cft)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+ struct cgroup_iter it;
+ struct task_struct *tsk;
+ u64 total_rss = 0, shared;
+ struct mm_struct *mm;
+ s64 val;
+
+ cgroup_iter_start(cgrp, &it);
+ val = mem_cgroup_read_stat(&memcg->stat, MEM_CGROUP_STAT_RSS);
+ val += mem_cgroup_read_stat(&memcg->stat, MEM_CGROUP_STAT_FILE_MAPPED);
+ while ((tsk = cgroup_iter_next(cgrp, &it))) {
+ if (!thread_group_leader(tsk))
+ continue;
+ mm = tsk->mm;
+ /*
+ * We can't use get_task_mm(), since mmput() its counterpart
+ * can sleep. We know that mm can't become invalid since
+ * we hold the css_set_lock (see cgroup_iter_start()).
+ */
+ if (tsk->flags & PF_KTHREAD || !mm)
+ continue;
+ total_rss += get_mm_rss(mm);
+ }
+ cgroup_iter_end(cgrp, &it);
+
+ /*
+ * We need to tolerate negative values due to the difference in
+ * time of calculating total_rss and val, but the shared value
+ * converges to the correct value quite soon depending on the changing
+ * memory usage of the workload running in the memory cgroup.
+ */
+ shared = total_rss - val;
+ shared = max_t(s64, 0, shared);
+ shared <<= PAGE_SHIFT;
+ return shared;
+}

static struct cftype mem_cgroup_files[] = {
{
@@ -3101,6 +3139,10 @@ static struct cftype mem_cgroup_files[] = {
.read_u64 = mem_cgroup_swappiness_read,
.write_u64 = mem_cgroup_swappiness_write,
},
+ {
+ .name = "shared_usage_in_bytes",
+ .read_u64 = mem_cgroup_shared_read,
+ },
};

#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP

--
Balbir


2010-01-06 00:10:23

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH -mm] Shared Page accounting for memory cgroup (v2)

On Wed, 6 Jan 2010 00:22:26 +0530
Balbir Singh <[email protected]> wrote:

> Hi, All,
>
> No major changes from v1, except for the use of get_mm_rss().
> Kamezawa-San felt that this can be done in user space and I responded
> to him with my concerns of doing it in user space. The thread
> can be found at http://thread.gmane.org/gmane.linux.kernel.mm/42367.
>
> If there are no major objections, can I ask for a merge into -mm.
> Andrew, the patches are against mmotm 10 December 2009, if there
> are some merge conflicts, please let me know, I can rebase after
> you release the next mmotm.
>

The problem is that this isn't "shared" uasge but "considered to be shared"
usage. Okay ?

Then I don't want to provide this misleading value as "official report" from
the kernel. And this can be done in userland.

Then, NACK.

Thanks,
-Kame

>
> Add shared accounting to memcg
>
> From: Balbir Singh <[email protected]>
>
> Currently there is no accurate way of estimating how many pages are
> shared in a memory cgroup. The accurate way of accounting shared memory
> is to
>
> 1. Either follow every page rmap and track number of users
> 2. Iterate through the pages and use _mapcount
>
> We take an intermediate approach (suggested by Kamezawa), we sum up
> the file and anon rss of the mm's belonging to the cgroup and then
> subtract the values of anon rss and file mapped. This should give
> us a good estimate of the pages being shared.
>
> The shared statistic is called memory.shared_usage_in_bytes and
> does not support hierarchical information, just the information
> for the current cgroup.
>
> Signed-off-by: Balbir Singh <[email protected]>
> ---
>
> Documentation/cgroups/memory.txt | 6 +++++
> mm/memcontrol.c | 42 ++++++++++++++++++++++++++++++++++++++
> 2 files changed, 48 insertions(+), 0 deletions(-)
>
>
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index b871f25..c2c70c9 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -341,6 +341,12 @@ Note:
> - a cgroup which uses hierarchy and it has child cgroup.
> - a cgroup which uses hierarchy and not the root of hierarchy.
>
> +5.4 shared_usage_in_bytes
> + This data lists the number of shared bytes. The data provided
> + provides an approximation based on the anon and file rss counts
> + of all the mm's belonging to the cgroup. The sum above is subtracted
> + from the count of rss and file mapped count maintained within the
> + memory cgroup statistics (see section 5.2).
>
> 6. Hierarchy support
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 488b644..e49b47a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3052,6 +3052,44 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
> return 0;
> }
>
> +static u64 mem_cgroup_shared_read(struct cgroup *cgrp, struct cftype *cft)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> + struct cgroup_iter it;
> + struct task_struct *tsk;
> + u64 total_rss = 0, shared;
> + struct mm_struct *mm;
> + s64 val;
> +
> + cgroup_iter_start(cgrp, &it);
> + val = mem_cgroup_read_stat(&memcg->stat, MEM_CGROUP_STAT_RSS);
> + val += mem_cgroup_read_stat(&memcg->stat, MEM_CGROUP_STAT_FILE_MAPPED);
> + while ((tsk = cgroup_iter_next(cgrp, &it))) {
> + if (!thread_group_leader(tsk))
> + continue;
> + mm = tsk->mm;
> + /*
> + * We can't use get_task_mm(), since mmput() its counterpart
> + * can sleep. We know that mm can't become invalid since
> + * we hold the css_set_lock (see cgroup_iter_start()).
> + */
> + if (tsk->flags & PF_KTHREAD || !mm)
> + continue;
> + total_rss += get_mm_rss(mm);
> + }
> + cgroup_iter_end(cgrp, &it);
> +
> + /*
> + * We need to tolerate negative values due to the difference in
> + * time of calculating total_rss and val, but the shared value
> + * converges to the correct value quite soon depending on the changing
> + * memory usage of the workload running in the memory cgroup.
> + */
> + shared = total_rss - val;
> + shared = max_t(s64, 0, shared);
> + shared <<= PAGE_SHIFT;
> + return shared;
> +}
>
> static struct cftype mem_cgroup_files[] = {
> {
> @@ -3101,6 +3139,10 @@ static struct cftype mem_cgroup_files[] = {
> .read_u64 = mem_cgroup_swappiness_read,
> .write_u64 = mem_cgroup_swappiness_write,
> },
> + {
> + .name = "shared_usage_in_bytes",
> + .read_u64 = mem_cgroup_shared_read,
> + },
> };
>
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
>
> --
> Balbir
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2010-01-06 03:07:59

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH -mm] Shared Page accounting for memory cgroup (v2)

* KAMEZAWA Hiroyuki <[email protected]> [2010-01-06 09:07:08]:

> On Wed, 6 Jan 2010 00:22:26 +0530
> Balbir Singh <[email protected]> wrote:
>
> > Hi, All,
> >
> > No major changes from v1, except for the use of get_mm_rss().
> > Kamezawa-San felt that this can be done in user space and I responded
> > to him with my concerns of doing it in user space. The thread
> > can be found at http://thread.gmane.org/gmane.linux.kernel.mm/42367.
> >
> > If there are no major objections, can I ask for a merge into -mm.
> > Andrew, the patches are against mmotm 10 December 2009, if there
> > are some merge conflicts, please let me know, I can rebase after
> > you release the next mmotm.
> >
>
> The problem is that this isn't "shared" uasge but "considered to be shared"
> usage. Okay ?
>

Could you give me your definition of "shared". From the mem cgroup
perspective, total_rss (which is accumulated) subtracted from the
count of pages in the LRU which are RSS and FILE_MAPPED is shared, no?
I understand that some of the pages that might be shared, show up
in our LRU and accounting. These are not treated as shared by
our cgroup, but by other cgroups.

> Then I don't want to provide this misleading value as "official report" from
> the kernel. And this can be done in userland.
>

I explained some of the issues of doing this from user space, would
you be OK if I called them "non-private" pages?

--
Balbir

2010-01-06 03:21:51

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH -mm] Shared Page accounting for memory cgroup (v2)

On Wed, 6 Jan 2010 08:37:52 +0530
Balbir Singh <[email protected]> wrote:

> * KAMEZAWA Hiroyuki <[email protected]> [2010-01-06 09:07:08]:
>
> > On Wed, 6 Jan 2010 00:22:26 +0530
> > Balbir Singh <[email protected]> wrote:
> >
> > > Hi, All,
> > >
> > > No major changes from v1, except for the use of get_mm_rss().
> > > Kamezawa-San felt that this can be done in user space and I responded
> > > to him with my concerns of doing it in user space. The thread
> > > can be found at http://thread.gmane.org/gmane.linux.kernel.mm/42367.
> > >
> > > If there are no major objections, can I ask for a merge into -mm.
> > > Andrew, the patches are against mmotm 10 December 2009, if there
> > > are some merge conflicts, please let me know, I can rebase after
> > > you release the next mmotm.
> > >
> >
> > The problem is that this isn't "shared" uasge but "considered to be shared"
> > usage. Okay ?
> >
>
> Could you give me your definition of "shared". From the mem cgroup
> perspective, total_rss (which is accumulated) subtracted from the
> count of pages in the LRU which are RSS and FILE_MAPPED is shared, no?

You consider only "mapped" pages are shared page. That's wrong.
And let's think about your "total_rss - RSS+MAPPED"

In this typical case,
fork() ---- process(A)
-> fork() --- process(B)
-> process(C)

total_rss = rss(A) + rss(B) + rss(C) = 3 * rss(A)
Then,

total_rss - RSS_MAPPED = 2 * rss(A).

How we call this number ? Is this "shared usage" ? I think no.
If you want to do this, scan LRU and count the number of really shared pages.
It's much better than detecting "shared pages" via process and will have no
big issue if implemented in proper way.

> I understand that some of the pages that might be shared, show up
> in our LRU and accounting. These are not treated as shared by
> our cgroup, but by other cgroups.
>
> > Then I don't want to provide this misleading value as "official report" from
> > the kernel. And this can be done in userland.
> >
>
> I explained some of the issues of doing this from user space, would
> you be OK if I called them "non-private" pages?
>

I think I explained there is no issue to do this in user-land.

Thanks,
-Kame

2010-01-06 03:49:42

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH -mm] Shared Page accounting for memory cgroup (v2)

* KAMEZAWA Hiroyuki <[email protected]> [2010-01-06 12:18:36]:

> On Wed, 6 Jan 2010 08:37:52 +0530
> Balbir Singh <[email protected]> wrote:
>
> > * KAMEZAWA Hiroyuki <[email protected]> [2010-01-06 09:07:08]:
> >
> > > On Wed, 6 Jan 2010 00:22:26 +0530
> > > Balbir Singh <[email protected]> wrote:
> > >
> > > > Hi, All,
> > > >
> > > > No major changes from v1, except for the use of get_mm_rss().
> > > > Kamezawa-San felt that this can be done in user space and I responded
> > > > to him with my concerns of doing it in user space. The thread
> > > > can be found at http://thread.gmane.org/gmane.linux.kernel.mm/42367.
> > > >
> > > > If there are no major objections, can I ask for a merge into -mm.
> > > > Andrew, the patches are against mmotm 10 December 2009, if there
> > > > are some merge conflicts, please let me know, I can rebase after
> > > > you release the next mmotm.
> > > >
> > >
> > > The problem is that this isn't "shared" uasge but "considered to be shared"
> > > usage. Okay ?
> > >
> >
> > Could you give me your definition of "shared". From the mem cgroup
> > perspective, total_rss (which is accumulated) subtracted from the
> > count of pages in the LRU which are RSS and FILE_MAPPED is shared, no?
>
> You consider only "mapped" pages are shared page. That's wrong.
> And let's think about your "total_rss - RSS+MAPPED"
>
> In this typical case,
> fork() ---- process(A)
> -> fork() --- process(B)
> -> process(C)
>
> total_rss = rss(A) + rss(B) + rss(C) = 3 * rss(A)
> Then,
>
> total_rss - RSS_MAPPED = 2 * rss(A).
>
> How we call this number ? Is this "shared usage" ? I think no.

Why not? The pages in LRU is rss(A) and the total usage is 3*rss(A),
shared does not imply shared outside the cgroup. Why do you say it is
not shared?

> If you want to do this, scan LRU and count the number of really shared pages.

A page walk for large number of cases is expensive for a large memory
cgroup.

> It's much better than detecting "shared pages" via process and will have no
> big issue if implemented in proper way.
>

A walk is not cheap, specifically since the list is protected by zone
lru lock and there are now 5 lists.

> > I understand that some of the pages that might be shared, show up
> > in our LRU and accounting. These are not treated as shared by
> > our cgroup, but by other cgroups.
> >
> > > Then I don't want to provide this misleading value as "official report" from
> > > the kernel. And this can be done in userland.
> > >
> >
> > I explained some of the issues of doing this from user space, would
> > you be OK if I called them "non-private" pages?
> >
>
> I think I explained there is no issue to do this in user-land.
>

You did not respond back to the last message of (I thought I convinced
you) http://thread.gmane.org/gmane.linux.kernel.mm/42367

--
Balbir

2010-01-06 03:59:15

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH -mm] Shared Page accounting for memory cgroup (v2)

On Wed, 6 Jan 2010 09:19:34 +0530
Balbir Singh <[email protected]> wrote:

> * KAMEZAWA Hiroyuki <[email protected]> [2010-01-06 12:18:36]:
>
> > On Wed, 6 Jan 2010 08:37:52 +0530
> > Balbir Singh <[email protected]> wrote:
> >
> > > * KAMEZAWA Hiroyuki <[email protected]> [2010-01-06 09:07:08]:
> > >
> > > > On Wed, 6 Jan 2010 00:22:26 +0530
> > > > Balbir Singh <[email protected]> wrote:
> > > >
> > > > > Hi, All,
> > > > >
> > > > > No major changes from v1, except for the use of get_mm_rss().
> > > > > Kamezawa-San felt that this can be done in user space and I responded
> > > > > to him with my concerns of doing it in user space. The thread
> > > > > can be found at http://thread.gmane.org/gmane.linux.kernel.mm/42367.
> > > > >
> > > > > If there are no major objections, can I ask for a merge into -mm.
> > > > > Andrew, the patches are against mmotm 10 December 2009, if there
> > > > > are some merge conflicts, please let me know, I can rebase after
> > > > > you release the next mmotm.
> > > > >
> > > >
> > > > The problem is that this isn't "shared" uasge but "considered to be shared"
> > > > usage. Okay ?
> > > >
> > >
> > > Could you give me your definition of "shared". From the mem cgroup
> > > perspective, total_rss (which is accumulated) subtracted from the
> > > count of pages in the LRU which are RSS and FILE_MAPPED is shared, no?
> >
> > You consider only "mapped" pages are shared page. That's wrong.
> > And let's think about your "total_rss - RSS+MAPPED"
> >
> > In this typical case,
> > fork() ---- process(A)
> > -> fork() --- process(B)
> > -> process(C)
> >
> > total_rss = rss(A) + rss(B) + rss(C) = 3 * rss(A)
> > Then,
> >
> > total_rss - RSS_MAPPED = 2 * rss(A).
> >
> > How we call this number ? Is this "shared usage" ? I think no.
>
> Why not? The pages in LRU is rss(A) and the total usage is 3*rss(A),
> shared does not imply shared outside the cgroup. Why do you say it is
> not shared?
>
If "shared" means "shared with other cgroup", I'll say "Oh, this is really
helpful". Usual novice user will think "shared" means "shared with other
cgroup".

Again, "shared of cgroup" should be "shared between cgroup" not
"shared between process".

Thanks,
-Kame