2011-03-18 15:25:42

by Michal Hocko

[permalink] [raw]
Subject: cgroup: real meaning of memory.usage_in_bytes

Hi Kame,

I have received a report that our SLE11-SP1 (based on 2.6.32) kernel
doesn't pass LTP cgroup test case[*]. The test case basically creates a
cgroup (with 100M), runs a simple allocator which dirties a certain
amount of anonymous memory (under the limit) and finally checks whether
memory.usage_in_bytes == memory.stat (rss value).

This is obviously not 100% correct as the test should consider also
cache size but this test case doesn't end up using any cache pages so it
used worked when it was developed.

According to our documention this is a reasonable test case:
Documentation/cgroups/memory.txt:
memory.usage_in_bytes # show current memory(RSS+Cache) usage.

This however doesn't work after your commit:
cdec2e4265d (memcg: coalesce charging via percpu storage)

because since then we are charging in bulks so we can end up with
rss+cache <= usage_in_bytes. Simple (attached) program will
show this as well:
# mkdir /dev/memctl; mount -t cgroup -omemory cgroup /dev/memctl; cd /dev/memctl
# mkdir group_1; cd group_1; echo 100M > memory.limit_in_bytes
# cat memory.{usage_in_bytes,stat}
0
cache 0
rss 0
[...]

[run the program - it will print its pid and wait for enter]
echo pid > tasks

[hit enter to make the program mmap and dirty pages]
# cat memory.{usage_in_bytes,stat}
131072
cache 0
rss 4096
[...]

[hit enter again to let it finish]
# cat memory.{usage_in_bytes,stat}
126976
cache 0
rss 0
[...]

I think we have several options here
1) document that the value is actually >= rss+cache and it shows
the guaranteed charges for the group
2) use rss+cache rather then res->count
3) remove the file
4) call drain_all_stock_sync before asking for the value in
mem_cgroup_read
5) collect the current amount of stock charges and subtract it
from the current res->count value

1) and 2) would suggest that the file is actually not very much useful.
3) is basically the interface change as well
4) sounds little bit invasive as we basically lose the advantage of the
pool whenever somebody reads the file. Btw. for who is this file
intended?
5) sounds like a compromise

As I do not see a point of the file I would like to get rid of it
completely rather than play games around it but I am not sure why we
have it in the first place.

What do you (and others) think? I have a patch for 4 ready here but I
would like to understand the purpose of the file more before I post it.

Thanks
---
[*] You can get source at http://sourceforge.net/projects/ltp/
./testcases/kernel/controllers/memctl/memctl_test01.c and
./testcases/kernel/controllers/memctl/run_memctl_test.sh

The test should be executed with 4 as the parameter
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic


2011-03-18 15:29:44

by Michal Hocko

[permalink] [raw]
Subject: Re: cgroup: real meaning of memory.usage_in_bytes

On Fri 18-03-11 16:25:32, Michal Hocko wrote:
> because since then we are charging in bulks so we can end up with
> rss+cache <= usage_in_bytes. Simple (attached) program will

And I forgot to attach the test case

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic


Attachments:
(No filename) (313.00 B)
charge_test.c (607.00 B)
Download all attachments

2011-03-21 09:34:27

by Michal Hocko

[permalink] [raw]
Subject: Re: cgroup: real meaning of memory.usage_in_bytes

On Fri 18-03-11 16:25:32, Michal Hocko wrote:
[...]
> According to our documention this is a reasonable test case:
> Documentation/cgroups/memory.txt:
> memory.usage_in_bytes # show current memory(RSS+Cache) usage.
>
> This however doesn't work after your commit:
> cdec2e4265d (memcg: coalesce charging via percpu storage)
>
> because since then we are charging in bulks so we can end up with
> rss+cache <= usage_in_bytes.
[...]
> I think we have several options here
> 1) document that the value is actually >= rss+cache and it shows
> the guaranteed charges for the group
> 2) use rss+cache rather then res->count
> 3) remove the file
> 4) call drain_all_stock_sync before asking for the value in
> mem_cgroup_read
> 5) collect the current amount of stock charges and subtract it
> from the current res->count value
>
> 1) and 2) would suggest that the file is actually not very much useful.
> 3) is basically the interface change as well
> 4) sounds little bit invasive as we basically lose the advantage of the
> pool whenever somebody reads the file. Btw. for who is this file
> intended?
> 5) sounds like a compromise

I guess that 4) is really too invasive - for no good reason so here we
go with the 5) solution.
---
From: Michal Hocko <[email protected]>
Subject: Drain memcg_stock before returning res->count value

Since cdec2e4265d (memcg: coalesce charging via percpu storage) commit we
are charging resource counter in batches. This means that the current
res->count value doesn't show the real consumed value (rss+cache as we
describe in the documentation) but rather a promissed charges for future.
We are pre-charging CHARGE_SIZE bulk at once and subsequent charges are
satisfied from the per-cpu cgroup_stock pool.

We have seen a report that one of the LTP testcases checks exactly this
condition so the test fails.

As this exported value is a part of kernel->userspace interface we should
try to preserve the original (and documented) semantic.

This patch fixes the issue by collecting the current usage of each per-cpu
stock and subtracting it from the current res counter value.

Signed-off-by: Michal Hocko <[email protected]>
Index: linus_tree/mm/memcontrol.c
===================================================================
--- linus_tree.orig/mm/memcontrol.c 2011-03-18 16:09:11.000000000 +0100
+++ linus_tree/mm/memcontrol.c 2011-03-21 10:21:55.000000000 +0100
@@ -3579,13 +3579,30 @@ static unsigned long mem_cgroup_recursiv
return val;
}

+static u64 mem_cgroup_current_usage(struct mem_cgroup *mem)
+{
+ u64 val = res_counter_read_u64(&mem->res, RES_USAGE);
+ u64 per_cpu_val = 0;
+ int cpu;
+
+ get_online_cpus();
+ for_each_online_cpu(cpu) {
+ struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
+
+ per_cpu_val += stock->nr_pages * PAGE_SIZE;
+ }
+ put_online_cpus();
+
+ return (val > per_cpu_val)? val - per_cpu_val: 0;
+}
+
static inline u64 mem_cgroup_usage(struct mem_cgroup *mem, bool swap)
{
u64 val;

if (!mem_cgroup_is_root(mem)) {
if (!swap)
- return res_counter_read_u64(&mem->res, RES_USAGE);
+ return mem_cgroup_current_usage(mem);
else
return res_counter_read_u64(&mem->memsw, RES_USAGE);
}
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-03-21 10:24:26

by Michal Hocko

[permalink] [raw]
Subject: [PATCH] memcg: consider per-cpu stock reserves when returning RES_USAGE for _MEM

[Sorry for reposting but I forgot to fully refresh the patch before
posting...]

On Mon 21-03-11 10:34:19, Michal Hocko wrote:
> On Fri 18-03-11 16:25:32, Michal Hocko wrote:
> [...]
> > According to our documention this is a reasonable test case:
> > Documentation/cgroups/memory.txt:
> > memory.usage_in_bytes # show current memory(RSS+Cache) usage.
> >
> > This however doesn't work after your commit:
> > cdec2e4265d (memcg: coalesce charging via percpu storage)
> >
> > because since then we are charging in bulks so we can end up with
> > rss+cache <= usage_in_bytes.
> [...]
> > I think we have several options here
> > 1) document that the value is actually >= rss+cache and it shows
> > the guaranteed charges for the group
> > 2) use rss+cache rather then res->count
> > 3) remove the file
> > 4) call drain_all_stock_sync before asking for the value in
> > mem_cgroup_read
> > 5) collect the current amount of stock charges and subtract it
> > from the current res->count value
> >
> > 1) and 2) would suggest that the file is actually not very much useful.
> > 3) is basically the interface change as well
> > 4) sounds little bit invasive as we basically lose the advantage of the
> > pool whenever somebody reads the file. Btw. for who is this file
> > intended?
> > 5) sounds like a compromise
>
> I guess that 4) is really too invasive - for no good reason so here we
> go with the 5) solution.
---
From: Michal Hocko <[email protected]>
Subject: memcg: consider per-cpu stock reserves when returning RES_USAGE for _MEM

Since cdec2e4265d (memcg: coalesce charging via percpu storage) commit we
are charging resource counter in batches. This means that the current
res->count value doesn't show the real consumed value (rss+cache as we
describe in the documentation) but rather a promissed charges for future.
We are pre-charging CHARGE_SIZE bulk at once and subsequent charges are
satisfied from the per-cpu cgroup_stock pool.

We have seen a report that one of the LTP testcases checks exactly this
condition so the test fails.

As this exported value is a part of kernel->userspace interface we should
try to preserve the original (and documented) semantic.

This patch fixes the issue by collecting the current usage of each per-cpu
stock and subtracting it from the current res counter value.

Signed-off-by: Michal Hocko <[email protected]>
Index: linus_tree/mm/memcontrol.c
===================================================================
--- linus_tree.orig/mm/memcontrol.c 2011-03-18 16:09:11.000000000 +0100
+++ linus_tree/mm/memcontrol.c 2011-03-21 10:21:55.000000000 +0100
@@ -3579,13 +3579,30 @@ static unsigned long mem_cgroup_recursiv
return val;
}

+static u64 mem_cgroup_current_usage(struct mem_cgroup *mem)
+{
+ u64 val = res_counter_read_u64(&mem->res, RES_USAGE);
+ u64 per_cpu_val = 0;
+ int cpu;
+
+ get_online_cpus();
+ for_each_online_cpu(cpu) {
+ struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
+
+ per_cpu_val += stock->nr_pages * PAGE_SIZE;
+ }
+ put_online_cpus();
+
+ return (val > per_cpu_val)? val - per_cpu_val: 0;
+}
+
static inline u64 mem_cgroup_usage(struct mem_cgroup *mem, bool swap)
{
u64 val;

if (!mem_cgroup_is_root(mem)) {
if (!swap)
- return res_counter_read_u64(&mem->res, RES_USAGE);
+ return mem_cgroup_current_usage(mem);
else
return res_counter_read_u64(&mem->memsw, RES_USAGE);
}
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-03-21 17:22:52

by Ying Han

[permalink] [raw]
Subject: Re: cgroup: real meaning of memory.usage_in_bytes

On Mon, Mar 21, 2011 at 2:34 AM, Michal Hocko <[email protected]> wrote:
> On Fri 18-03-11 16:25:32, Michal Hocko wrote:
> [...]
>> According to our documention this is a reasonable test case:
>> Documentation/cgroups/memory.txt:
>> memory.usage_in_bytes ? ? ? ? ? # show current memory(RSS+Cache) usage.
>>
>> This however doesn't work after your commit:
>> cdec2e4265d (memcg: coalesce charging via percpu storage)
>>
>> because since then we are charging in bulks so we can end up with
>> rss+cache <= usage_in_bytes.
> [...]
>> I think we have several options here
>> ? ? ? 1) document that the value is actually >= rss+cache and it shows
>> ? ? ? ? ?the guaranteed charges for the group
>> ? ? ? 2) use rss+cache rather then res->count
>> ? ? ? 3) remove the file
>> ? ? ? 4) call drain_all_stock_sync before asking for the value in
>> ? ? ? ? ?mem_cgroup_read
>> ? ? ? 5) collect the current amount of stock charges and subtract it
>> ? ? ? ? ?from the current res->count value
>>
>> 1) and 2) would suggest that the file is actually not very much useful.
>> 3) is basically the interface change as well
>> 4) sounds little bit invasive as we basically lose the advantage of the
>> pool whenever somebody reads the file. Btw. for who is this file
>> intended?
>> 5) sounds like a compromise
>
> I guess that 4) is really too invasive - for no good reason so here we
> go with the 5) solution.
> ---
> From: Michal Hocko <[email protected]>
> Subject: Drain memcg_stock before returning res->count value
>
> Since cdec2e4265d (memcg: coalesce charging via percpu storage) commit we
> are charging resource counter in batches. This means that the current
> res->count value doesn't show the real consumed value (rss+cache as we
> describe in the documentation) but rather a promissed charges for future.
> We are pre-charging CHARGE_SIZE bulk at once and subsequent charges are
> satisfied from the per-cpu cgroup_stock pool.
>
> We have seen a report that one of the LTP testcases checks exactly this
> condition so the test fails.
>
> As this exported value is a part of kernel->userspace interface we should
> try to preserve the original (and documented) semantic.
>
> This patch fixes the issue by collecting the current usage of each per-cpu
> stock and subtracting it from the current res counter value.
>
> Signed-off-by: Michal Hocko <[email protected]>
> Index: linus_tree/mm/memcontrol.c
> ===================================================================
> --- linus_tree.orig/mm/memcontrol.c ? ? 2011-03-18 16:09:11.000000000 +0100
> +++ linus_tree/mm/memcontrol.c ?2011-03-21 10:21:55.000000000 +0100
> @@ -3579,13 +3579,30 @@ static unsigned long mem_cgroup_recursiv
> ? ? ? ?return val;
> ?}
>
> +static u64 mem_cgroup_current_usage(struct mem_cgroup *mem)
> +{
> + ? ? ? u64 val = res_counter_read_u64(&mem->res, RES_USAGE);
> + ? ? ? u64 per_cpu_val = 0;
> + ? ? ? int cpu;
> +
> + ? ? ? get_online_cpus();
> + ? ? ? for_each_online_cpu(cpu) {
> + ? ? ? ? ? ? ? struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> +
> + ? ? ? ? ? ? ? per_cpu_val += stock->nr_pages * PAGE_SIZE;
> + ? ? ? }
> + ? ? ? put_online_cpus();
> +
> + ? ? ? return (val > per_cpu_val)? val - per_cpu_val: 0;
> +}
> +
> ?static inline u64 mem_cgroup_usage(struct mem_cgroup *mem, bool swap)
> ?{
> ? ? ? ?u64 val;
>
> ? ? ? ?if (!mem_cgroup_is_root(mem)) {
> ? ? ? ? ? ? ? ?if (!swap)
> - ? ? ? ? ? ? ? ? ? ? ? return res_counter_read_u64(&mem->res, RES_USAGE);
> + ? ? ? ? ? ? ? ? ? ? ? return mem_cgroup_current_usage(mem);
> ? ? ? ? ? ? ? ?else
> ? ? ? ? ? ? ? ? ? ? ? ?return res_counter_read_u64(&mem->memsw, RES_USAGE);
> ? ? ? ?}

Michal,

Can you help to post the test result after applying the patch?

--Ying

> --
> Michal Hocko
> SUSE Labs
> SUSE LINUX s.r.o.
> Lihovarska 1060/12
> 190 00 Praha 9
> Czech Republic
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. ?For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2011-03-22 00:16:47

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] memcg: consider per-cpu stock reserves when returning RES_USAGE for _MEM

On Mon, 21 Mar 2011 11:24:20 +0100
Michal Hocko <[email protected]> wrote:

> [Sorry for reposting but I forgot to fully refresh the patch before
> posting...]
>
> On Mon 21-03-11 10:34:19, Michal Hocko wrote:
> > On Fri 18-03-11 16:25:32, Michal Hocko wrote:
> > [...]
> > > According to our documention this is a reasonable test case:
> > > Documentation/cgroups/memory.txt:
> > > memory.usage_in_bytes # show current memory(RSS+Cache) usage.
> > >
> > > This however doesn't work after your commit:
> > > cdec2e4265d (memcg: coalesce charging via percpu storage)
> > >
> > > because since then we are charging in bulks so we can end up with
> > > rss+cache <= usage_in_bytes.
> > [...]
> > > I think we have several options here
> > > 1) document that the value is actually >= rss+cache and it shows
> > > the guaranteed charges for the group
> > > 2) use rss+cache rather then res->count
> > > 3) remove the file
> > > 4) call drain_all_stock_sync before asking for the value in
> > > mem_cgroup_read
> > > 5) collect the current amount of stock charges and subtract it
> > > from the current res->count value
> > >
> > > 1) and 2) would suggest that the file is actually not very much useful.
> > > 3) is basically the interface change as well
> > > 4) sounds little bit invasive as we basically lose the advantage of the
> > > pool whenever somebody reads the file. Btw. for who is this file
> > > intended?
> > > 5) sounds like a compromise
> >
> > I guess that 4) is really too invasive - for no good reason so here we
> > go with the 5) solution.

I think the test in LTP is bad...(it should be fuzzy.) because we cannot
avoid races...
But ok, this itself will be a problem with a large machine with many cpus.


> ---
> From: Michal Hocko <[email protected]>
> Subject: memcg: consider per-cpu stock reserves when returning RES_USAGE for _MEM
>
> Since cdec2e4265d (memcg: coalesce charging via percpu storage) commit we
> are charging resource counter in batches. This means that the current
> res->count value doesn't show the real consumed value (rss+cache as we
> describe in the documentation) but rather a promissed charges for future.
> We are pre-charging CHARGE_SIZE bulk at once and subsequent charges are
> satisfied from the per-cpu cgroup_stock pool.
>
> We have seen a report that one of the LTP testcases checks exactly this
> condition so the test fails.
>
> As this exported value is a part of kernel->userspace interface we should
> try to preserve the original (and documented) semantic.
>
> This patch fixes the issue by collecting the current usage of each per-cpu
> stock and subtracting it from the current res counter value.
>
> Signed-off-by: Michal Hocko <[email protected]>

This doesn't seems correct.

> Index: linus_tree/mm/memcontrol.c
> ===================================================================
> --- linus_tree.orig/mm/memcontrol.c 2011-03-18 16:09:11.000000000 +0100
> +++ linus_tree/mm/memcontrol.c 2011-03-21 10:21:55.000000000 +0100
> @@ -3579,13 +3579,30 @@ static unsigned long mem_cgroup_recursiv
> return val;
> }
>
> +static u64 mem_cgroup_current_usage(struct mem_cgroup *mem)
> +{
> + u64 val = res_counter_read_u64(&mem->res, RES_USAGE);
> + u64 per_cpu_val = 0;
> + int cpu;
> +
> + get_online_cpus();
> + for_each_online_cpu(cpu) {
> + struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> +
> + per_cpu_val += stock->nr_pages * PAGE_SIZE;

if (memcg_stock->cached == mem)
per_cpu_val += stock->nr_pages * PAGE_SIZE;

AND I think you doesn't handle batched uncharge.
Do you have any idea ? (Peter Zilstra's patch will make error size of
bached uncharge bigger.)

So....rather than this, just always using root memcg's code is
a good way. Could you try ?
==
usage = mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_CACHE);
usage += mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_RSS);

if (swap)
val += mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_SWAPOUT);

return val << PAGE_SHIFT;
==

Thanks,
-Kame


> + }
> + put_online_cpus();
> +
> + return (val > per_cpu_val)? val - per_cpu_val: 0;
> +}
> +
> static inline u64 mem_cgroup_usage(struct mem_cgroup *mem, bool swap)
> {
> u64 val;
>
> if (!mem_cgroup_is_root(mem)) {
> if (!swap)
> - return res_counter_read_u64(&mem->res, RES_USAGE);
> + return mem_cgroup_current_usage(mem);
> else
> return res_counter_read_u64(&mem->memsw, RES_USAGE);
> }
> --
> Michal Hocko
> SUSE Labs
> SUSE LINUX s.r.o.
> Lihovarska 1060/12
> 190 00 Praha 9
> Czech Republic
>

2011-03-22 01:54:34

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [PATCH] memcg: consider per-cpu stock reserves when returning RES_USAGE for _MEM

On Tue, 22 Mar 2011 09:10:14 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Mon, 21 Mar 2011 11:24:20 +0100
> Michal Hocko <[email protected]> wrote:
>
> > [Sorry for reposting but I forgot to fully refresh the patch before
> > posting...]
> >
> > On Mon 21-03-11 10:34:19, Michal Hocko wrote:
> > > On Fri 18-03-11 16:25:32, Michal Hocko wrote:
> > > [...]
> > > > According to our documention this is a reasonable test case:
> > > > Documentation/cgroups/memory.txt:
> > > > memory.usage_in_bytes # show current memory(RSS+Cache) usage.
> > > >
> > > > This however doesn't work after your commit:
> > > > cdec2e4265d (memcg: coalesce charging via percpu storage)
> > > >
> > > > because since then we are charging in bulks so we can end up with
> > > > rss+cache <= usage_in_bytes.
> > > [...]
> > > > I think we have several options here
> > > > 1) document that the value is actually >= rss+cache and it shows
> > > > the guaranteed charges for the group
> > > > 2) use rss+cache rather then res->count
> > > > 3) remove the file
> > > > 4) call drain_all_stock_sync before asking for the value in
> > > > mem_cgroup_read
> > > > 5) collect the current amount of stock charges and subtract it
> > > > from the current res->count value
> > > >
> > > > 1) and 2) would suggest that the file is actually not very much useful.
> > > > 3) is basically the interface change as well
> > > > 4) sounds little bit invasive as we basically lose the advantage of the
> > > > pool whenever somebody reads the file. Btw. for who is this file
> > > > intended?
> > > > 5) sounds like a compromise
> > >
> > > I guess that 4) is really too invasive - for no good reason so here we
> > > go with the 5) solution.
>
> I think the test in LTP is bad...(it should be fuzzy.) because we cannot
> avoid races...
I agree.

> But ok, this itself will be a problem with a large machine with many cpus.
>
>
> > ---
> > From: Michal Hocko <[email protected]>
> > Subject: memcg: consider per-cpu stock reserves when returning RES_USAGE for _MEM
> >
> > Since cdec2e4265d (memcg: coalesce charging via percpu storage) commit we
> > are charging resource counter in batches. This means that the current
> > res->count value doesn't show the real consumed value (rss+cache as we
> > describe in the documentation) but rather a promissed charges for future.
> > We are pre-charging CHARGE_SIZE bulk at once and subsequent charges are
> > satisfied from the per-cpu cgroup_stock pool.
> >
> > We have seen a report that one of the LTP testcases checks exactly this
> > condition so the test fails.
> >
> > As this exported value is a part of kernel->userspace interface we should
> > try to preserve the original (and documented) semantic.
> >
> > This patch fixes the issue by collecting the current usage of each per-cpu
> > stock and subtracting it from the current res counter value.
> >
> > Signed-off-by: Michal Hocko <[email protected]>
>
> This doesn't seems correct.
>
> > Index: linus_tree/mm/memcontrol.c
> > ===================================================================
> > --- linus_tree.orig/mm/memcontrol.c 2011-03-18 16:09:11.000000000 +0100
> > +++ linus_tree/mm/memcontrol.c 2011-03-21 10:21:55.000000000 +0100
> > @@ -3579,13 +3579,30 @@ static unsigned long mem_cgroup_recursiv
> > return val;
> > }
> >
> > +static u64 mem_cgroup_current_usage(struct mem_cgroup *mem)
> > +{
> > + u64 val = res_counter_read_u64(&mem->res, RES_USAGE);
> > + u64 per_cpu_val = 0;
> > + int cpu;
> > +
> > + get_online_cpus();
> > + for_each_online_cpu(cpu) {
> > + struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> > +
> > + per_cpu_val += stock->nr_pages * PAGE_SIZE;
>
> if (memcg_stock->cached == mem)
> per_cpu_val += stock->nr_pages * PAGE_SIZE;
>
> AND I think you doesn't handle batched uncharge.
> Do you have any idea ? (Peter Zilstra's patch will make error size of
> bached uncharge bigger.)
>
> So....rather than this, just always using root memcg's code is
> a good way. Could you try ?
> ==
> usage = mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_CACHE);
> usage += mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_RSS);
>
> if (swap)
> val += mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
>
> return val << PAGE_SHIFT;
> ==
>
So, option 2) above.

As Michal already said, this change will make *.usage_in_bytes not so useful,
i.e. we can use memory.stat instead.

I don't have any good idea, but I tend to agree to 1) or 3)(or rename the file names) now.
Considering batched uncharge, I think 4) and 5) is difficult.

Thanks,
Daisuke Nishimura.

2011-03-22 07:31:59

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] memcg: consider per-cpu stock reserves when returning RES_USAGE for _MEM

On Tue 22-03-11 10:47:23, Daisuke Nishimura wrote:
> On Tue, 22 Mar 2011 09:10:14 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > On Mon, 21 Mar 2011 11:24:20 +0100
> > Michal Hocko <[email protected]> wrote:
> >
> > > [Sorry for reposting but I forgot to fully refresh the patch before
> > > posting...]
> > >
> > > On Mon 21-03-11 10:34:19, Michal Hocko wrote:
> > > > On Fri 18-03-11 16:25:32, Michal Hocko wrote:
> > > > [...]
> > > > > According to our documention this is a reasonable test case:
> > > > > Documentation/cgroups/memory.txt:
> > > > > memory.usage_in_bytes # show current memory(RSS+Cache) usage.
> > > > >
> > > > > This however doesn't work after your commit:
> > > > > cdec2e4265d (memcg: coalesce charging via percpu storage)
> > > > >
> > > > > because since then we are charging in bulks so we can end up with
> > > > > rss+cache <= usage_in_bytes.
> > > > [...]
> > > > > I think we have several options here
> > > > > 1) document that the value is actually >= rss+cache and it shows
> > > > > the guaranteed charges for the group
> > > > > 2) use rss+cache rather then res->count
> > > > > 3) remove the file
> > > > > 4) call drain_all_stock_sync before asking for the value in
> > > > > mem_cgroup_read
> > > > > 5) collect the current amount of stock charges and subtract it
> > > > > from the current res->count value
> > > > >
> > > > > 1) and 2) would suggest that the file is actually not very much useful.
> > > > > 3) is basically the interface change as well
> > > > > 4) sounds little bit invasive as we basically lose the advantage of the
> > > > > pool whenever somebody reads the file. Btw. for who is this file
> > > > > intended?
> > > > > 5) sounds like a compromise
> > > >
> > > > I guess that 4) is really too invasive - for no good reason so here we
> > > > go with the 5) solution.
> >
> > I think the test in LTP is bad...(it should be fuzzy.) because we cannot
> > avoid races...
> I agree.

I think that as well. In fact, I quite do not understand what it is
testing actually (that we account charges correctly? If yes then what if
both values are wrong?). The other point is, though, we have exported this
interface and documented its semantic. This is the reason I have asked
for the initial purpose of the file in the first place. Is this
something for debugging only? Can we make use of the value somehow
(other than a shortcut for rss+cache)?

If there is realy no strong reason for the file existence I would rather
vote for its removing than having a unusable semantic.

[...]
> > > Index: linus_tree/mm/memcontrol.c
> > > ===================================================================
> > > --- linus_tree.orig/mm/memcontrol.c 2011-03-18 16:09:11.000000000 +0100
> > > +++ linus_tree/mm/memcontrol.c 2011-03-21 10:21:55.000000000 +0100
> > > @@ -3579,13 +3579,30 @@ static unsigned long mem_cgroup_recursiv
> > > return val;
> > > }
> > >
> > > +static u64 mem_cgroup_current_usage(struct mem_cgroup *mem)
> > > +{
> > > + u64 val = res_counter_read_u64(&mem->res, RES_USAGE);
> > > + u64 per_cpu_val = 0;
> > > + int cpu;
> > > +
> > > + get_online_cpus();
> > > + for_each_online_cpu(cpu) {
> > > + struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> > > +
> > > + per_cpu_val += stock->nr_pages * PAGE_SIZE;
> >
> > if (memcg_stock->cached == mem)
> > per_cpu_val += stock->nr_pages * PAGE_SIZE;

OK, thanks for pointing that out.

> > AND I think you doesn't handle batched uncharge.
> > Do you have any idea ?

Hmm, missed that... And this will be really hard as the batch structure
is embeded in the task struct (strange why we simply cannot uncharge
same way as we do charges?).

> > (Peter Zilstra's patch will make error size of
> > bached uncharge bigger.)
> >
> > So....rather than this, just always using root memcg's code is
> > a good way. Could you try ?
> > ==
> > usage = mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_CACHE);
> > usage += mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_RSS);
> >
> > if (swap)
> > val += mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
> >
> > return val << PAGE_SHIFT;
> > ==
> >
> So, option 2) above.
>

I have tried that and it really works but what is the point of the file
then? Or should we do this as a workaround for some time frame until we
remove the file completely?

> As Michal already said, this change will make *.usage_in_bytes not so useful,
> i.e. we can use memory.stat instead.

It would be useful in that regard that we do not have to grep out two
values from .stat file and sum them up. But I am missing the point. If
we want to have a sum we should add it to the .stat file, right? (like
we do in other exported files - e.g. meminfo etc...)

>
> I don't have any good idea, but I tend to agree to 1) or 3)(or rename the file names) now.

The more I think about it the more I think the same.

> Considering batched uncharge, I think 4) and 5) is difficult.

Thanks for comments.
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-03-22 07:35:20

by Michal Hocko

[permalink] [raw]
Subject: Re: cgroup: real meaning of memory.usage_in_bytes

On Mon 21-03-11 10:22:41, Ying Han wrote:
[...]
>
> Michal,
>
> Can you help to post the test result after applying the patch?

The result of the LTP test is:
TEST 4: MEMORY CONTROLLER TESTING
RUNNING SETUP.....
WARN:/dev/memctl already exist..overwriting
Cleanup called
TEST STARTED: Please avoid using system while this test executes
memory usage from memory.usage_in_bytes= 62955520
memory usage from memory.stat= 62955520
TINFO Memory Resource Controller: stat check test passes first run
Test continues to run the second step.
memory usage from memory.usage_in_bytes= 78643200
memory usage from memory.stat=78643200
TPASS Memory Resource Controller: stat check test PASSED
Memory Resource Controller test executed successfully.
Cleanup called

The attached simple test case result is:
# mkdir /dev/memctl; mount -t cgroup -omemory cgroup /dev/memctl; cd /dev/memctl
# mkdir group_1; cd group_1; echo 100M > memory.limit_in_bytes
# cat memory.{usage_in_bytes,stat}
0
cache 0
rss 0
[start the test case, add its pid to the group and let it fault in]

# cat memory.{usage_in_bytes,stat}
4096
cache 0
rss 4096

[let it finish]
# cat memory.{usage_in_bytes,stat}
0
cache 0
rss 0

Thanks
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-03-22 17:06:34

by Ying Han

[permalink] [raw]
Subject: Re: cgroup: real meaning of memory.usage_in_bytes

On Tue, Mar 22, 2011 at 12:35 AM, Michal Hocko <[email protected]> wrote:
> On Mon 21-03-11 10:22:41, Ying Han wrote:
> [...]
>>
>> Michal,
>>
>> Can you help to post the test result after applying the patch?
>
> The result of the LTP test is:
> TEST 4: MEMORY CONTROLLER TESTING
> RUNNING SETUP.....
> WARN:/dev/memctl already exist..overwriting
> Cleanup called
> TEST STARTED: Please avoid using system while this test executes
> memory usage from memory.usage_in_bytes= 62955520
> memory usage from memory.stat= 62955520
> TINFO ? Memory Resource Controller: stat check test passes first run
> Test continues to run the second step.
> memory usage from memory.usage_in_bytes= 78643200
> memory usage from memory.stat=78643200
> TPASS ? Memory Resource Controller: stat check test PASSED
> Memory Resource Controller test executed successfully.
> Cleanup called
>
> The attached simple test case result is:
> # mkdir /dev/memctl; mount -t cgroup -omemory cgroup /dev/memctl; cd /dev/memctl
> # mkdir group_1; cd group_1; echo 100M > memory.limit_in_bytes
> # cat memory.{usage_in_bytes,stat}
> 0
> cache 0
> rss 0
> [start the test case, add its pid to the group and let it fault in]
>
> # cat memory.{usage_in_bytes,stat}
> 4096
> cache 0
> rss 4096
>
> [let it finish]
> # cat memory.{usage_in_bytes,stat}
> 0
> cache 0
> rss 0
>
> Thanks

Thanks Michal for fixing it up. Regardless of the performance
overhead, the change make sense to me.

--Ying

> --
> Michal Hocko
> SUSE Labs
> SUSE LINUX s.r.o.
> Lihovarska 1060/12
> 190 00 Praha 9
> Czech Republic
>

2011-03-23 00:32:54

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [PATCH] memcg: consider per-cpu stock reserves when returning RES_USAGE for _MEM

On Tue, 22 Mar 2011 08:31:50 +0100
Michal Hocko <[email protected]> wrote:

> On Tue 22-03-11 10:47:23, Daisuke Nishimura wrote:
> > On Tue, 22 Mar 2011 09:10:14 +0900
> > KAMEZAWA Hiroyuki <[email protected]> wrote:
> >
> > > On Mon, 21 Mar 2011 11:24:20 +0100
> > > Michal Hocko <[email protected]> wrote:
> > >
> > > > [Sorry for reposting but I forgot to fully refresh the patch before
> > > > posting...]
> > > >
> > > > On Mon 21-03-11 10:34:19, Michal Hocko wrote:
> > > > > On Fri 18-03-11 16:25:32, Michal Hocko wrote:
> > > > > [...]
> > > > > > According to our documention this is a reasonable test case:
> > > > > > Documentation/cgroups/memory.txt:
> > > > > > memory.usage_in_bytes # show current memory(RSS+Cache) usage.
> > > > > >
> > > > > > This however doesn't work after your commit:
> > > > > > cdec2e4265d (memcg: coalesce charging via percpu storage)
> > > > > >
> > > > > > because since then we are charging in bulks so we can end up with
> > > > > > rss+cache <= usage_in_bytes.
> > > > > [...]
> > > > > > I think we have several options here
> > > > > > 1) document that the value is actually >= rss+cache and it shows
> > > > > > the guaranteed charges for the group
> > > > > > 2) use rss+cache rather then res->count
> > > > > > 3) remove the file
> > > > > > 4) call drain_all_stock_sync before asking for the value in
> > > > > > mem_cgroup_read
> > > > > > 5) collect the current amount of stock charges and subtract it
> > > > > > from the current res->count value
> > > > > >
> > > > > > 1) and 2) would suggest that the file is actually not very much useful.
> > > > > > 3) is basically the interface change as well
> > > > > > 4) sounds little bit invasive as we basically lose the advantage of the
> > > > > > pool whenever somebody reads the file. Btw. for who is this file
> > > > > > intended?
> > > > > > 5) sounds like a compromise
> > > > >
> > > > > I guess that 4) is really too invasive - for no good reason so here we
> > > > > go with the 5) solution.
> > >
> > > I think the test in LTP is bad...(it should be fuzzy.) because we cannot
> > > avoid races...
> > I agree.
>
> I think that as well. In fact, I quite do not understand what it is
> testing actually (that we account charges correctly? If yes then what if
> both values are wrong?). The other point is, though, we have exported this
> interface and documented its semantic. This is the reason I have asked
> for the initial purpose of the file in the first place. Is this
> something for debugging only? Can we make use of the value somehow
> (other than a shortcut for rss+cache)?
>
> If there is realy no strong reason for the file existence I would rather
> vote for its removing than having a unusable semantic.
>
Considering more, without these files, we cannot know the actual usage of
a res_counter, although we set a limit to a res_counter. So, I want to keep
these files.

If no-one have any objections, I'll prepare a patch to update the documentation.

Thanks,
Daisuke Nishimura.

2011-03-23 04:41:47

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] memcg: consider per-cpu stock reserves when returning RES_USAGE for _MEM

On Wed, 23 Mar 2011 09:27:08 +0900
Daisuke Nishimura <[email protected]> wrote:

> On Tue, 22 Mar 2011 08:31:50 +0100
> Michal Hocko <[email protected]> wrote:
>
> > On Tue 22-03-11 10:47:23, Daisuke Nishimura wrote:
> > > On Tue, 22 Mar 2011 09:10:14 +0900
> > > KAMEZAWA Hiroyuki <[email protected]> wrote:
> > >
> > > > On Mon, 21 Mar 2011 11:24:20 +0100
> > > > Michal Hocko <[email protected]> wrote:
> > > >
> > > > > [Sorry for reposting but I forgot to fully refresh the patch before
> > > > > posting...]
> > > > >
> > > > > On Mon 21-03-11 10:34:19, Michal Hocko wrote:
> > > > > > On Fri 18-03-11 16:25:32, Michal Hocko wrote:
> > > > > > [...]
> > > > > > > According to our documention this is a reasonable test case:
> > > > > > > Documentation/cgroups/memory.txt:
> > > > > > > memory.usage_in_bytes # show current memory(RSS+Cache) usage.
> > > > > > >
> > > > > > > This however doesn't work after your commit:
> > > > > > > cdec2e4265d (memcg: coalesce charging via percpu storage)
> > > > > > >
> > > > > > > because since then we are charging in bulks so we can end up with
> > > > > > > rss+cache <= usage_in_bytes.
> > > > > > [...]
> > > > > > > I think we have several options here
> > > > > > > 1) document that the value is actually >= rss+cache and it shows
> > > > > > > the guaranteed charges for the group
> > > > > > > 2) use rss+cache rather then res->count
> > > > > > > 3) remove the file
> > > > > > > 4) call drain_all_stock_sync before asking for the value in
> > > > > > > mem_cgroup_read
> > > > > > > 5) collect the current amount of stock charges and subtract it
> > > > > > > from the current res->count value
> > > > > > >
> > > > > > > 1) and 2) would suggest that the file is actually not very much useful.
> > > > > > > 3) is basically the interface change as well
> > > > > > > 4) sounds little bit invasive as we basically lose the advantage of the
> > > > > > > pool whenever somebody reads the file. Btw. for who is this file
> > > > > > > intended?
> > > > > > > 5) sounds like a compromise
> > > > > >
> > > > > > I guess that 4) is really too invasive - for no good reason so here we
> > > > > > go with the 5) solution.
> > > >
> > > > I think the test in LTP is bad...(it should be fuzzy.) because we cannot
> > > > avoid races...
> > > I agree.
> >
> > I think that as well. In fact, I quite do not understand what it is
> > testing actually (that we account charges correctly? If yes then what if
> > both values are wrong?). The other point is, though, we have exported this
> > interface and documented its semantic. This is the reason I have asked
> > for the initial purpose of the file in the first place. Is this
> > something for debugging only? Can we make use of the value somehow
> > (other than a shortcut for rss+cache)?
> >
> > If there is realy no strong reason for the file existence I would rather
> > vote for its removing than having a unusable semantic.
> >
> Considering more, without these files, we cannot know the actual usage of
> a res_counter, although we set a limit to a res_counter. So, I want to keep
> these files.
>
> If no-one have any objections, I'll prepare a patch to update the documentation.
>

please.

Thanks,
-Kame

2011-03-23 14:26:49

by Michal Hocko

[permalink] [raw]
Subject: Re: cgroup: real meaning of memory.usage_in_bytes

On Tue 22-03-11 10:06:27, Ying Han wrote:
> On Tue, Mar 22, 2011 at 12:35 AM, Michal Hocko <[email protected]> wrote:
> > On Mon 21-03-11 10:22:41, Ying Han wrote:
> > [...]
> >>
> >> Michal,
> >>
> >> Can you help to post the test result after applying the patch?
> >
> > The result of the LTP test is:
> > TEST 4: MEMORY CONTROLLER TESTING
> > RUNNING SETUP.....
> > WARN:/dev/memctl already exist..overwriting
> > Cleanup called
> > TEST STARTED: Please avoid using system while this test executes
> > memory usage from memory.usage_in_bytes= 62955520
> > memory usage from memory.stat= 62955520
> > TINFO ? Memory Resource Controller: stat check test passes first run
> > Test continues to run the second step.
> > memory usage from memory.usage_in_bytes= 78643200
> > memory usage from memory.stat=78643200
> > TPASS ? Memory Resource Controller: stat check test PASSED
> > Memory Resource Controller test executed successfully.
> > Cleanup called
[...]
> Thanks Michal for fixing it up. Regardless of the performance
> overhead, the change make sense to me.

As you can see in the other email in this thread the patch is not 100%
correct because it doesn't consider batched uncharges which are stored
in the task_struct. Make it 100% correct would be harder and probably
not worth the overhead. Daisuke Nishimura is working on the
documentation update patch which will most likely describe that
usage_in_bytes is not exactly rss+cache and that nobody should rely on
it.

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-03-27 23:59:08

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [PATCH] memcg: consider per-cpu stock reserves when returning RES_USAGE for _MEM

> > Considering more, without these files, we cannot know the actual usage of
> > a res_counter, although we set a limit to a res_counter. So, I want to keep
> > these files.
> >
> > If no-one have any objections, I'll prepare a patch to update the documentation.
> >
>
> please.
>
I'm sorry for my late response. I've been out of office because of a cold.
I'll prepare the patch later.

Thanks,
Daisuke Nishimura.

2011-03-28 04:29:12

by Daisuke Nishimura

[permalink] [raw]
Subject: [PATCH] memcg: update documentation to describe usage_in_bytes

On Mon, 28 Mar 2011 08:55:08 +0900
Daisuke Nishimura <[email protected]> wrote:

> > > Considering more, without these files, we cannot know the actual usage of
> > > a res_counter, although we set a limit to a res_counter. So, I want to keep
> > > these files.
> > >
> > > If no-one have any objections, I'll prepare a patch to update the documentation.
> > >
> >
> > please.
> >
> I'm sorry for my late response. I've been out of office because of a cold.
> I'll prepare the patch later.
>
How about this ?

===
From: Daisuke Nishimura <[email protected]>

Update the meaning of *.usage_in_bytes. They doesn't show the actual usage of
RSS+Cache(+Swap). They show the res_counter->usage for memory and memory+swap.

Signed-off-by: Daisuke Nishimura <[email protected]>
---
Documentation/cgroups/memory.txt | 16 ++++++++++++++--
1 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 7781857..ab7d4c1 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -52,8 +52,10 @@ Brief summary of control files.
tasks # attach a task(thread) and show list of threads
cgroup.procs # show list of processes
cgroup.event_control # an interface for event_fd()
- memory.usage_in_bytes # show current memory(RSS+Cache) usage.
- memory.memsw.usage_in_bytes # show current memory+Swap usage
+ memory.usage_in_bytes # show current res_counter usage for memory
+ (See 5.5 for details)
+ memory.memsw.usage_in_bytes # show current res_counter usage for memory+Swap
+ (See 5.5 for details)
memory.limit_in_bytes # set/show limit of memory usage
memory.memsw.limit_in_bytes # set/show limit of memory+Swap usage
memory.failcnt # show the number of memory usage hits limits
@@ -453,6 +455,16 @@ memory under it will be reclaimed.
You can reset failcnt by writing 0 to failcnt file.
# echo 0 > .../memory.failcnt

+5.5 usage_in_bytes
+
+As described in 2.1, memory cgroup uses res_counter for tracking and limiting
+the memory usage. memory.usage_in_bytes shows the current res_counter usage for
+memory, and DOESN'T show a actual usage of RSS and Cache. It is usually bigger
+than the actual usage for a performance improvement reason. If you want to know
+the actual usage, you can use memory.stat(see 5.2).
+It's the same for memory.memsw.usage_in_bytes, which shows the current
+res_counter usage for memory+swap.
+
6. Hierarchy support

The memory controller supports a deep hierarchy and hierarchical accounting.
--
1.7.1

2011-03-28 07:43:46

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] memcg: update documentation to describe usage_in_bytes

On Mon 28-03-11 13:25:50, Daisuke Nishimura wrote:
> From: Daisuke Nishimura <[email protected]>
>
> Update the meaning of *.usage_in_bytes. They doesn't show the actual usage of
> RSS+Cache(+Swap). They show the res_counter->usage for memory and memory+swap.

Don't we want to add why this is not rss+cache? The reason is really non
trivial for somebody who is not familiar with the code and with the fact
that we are heavily caching charges.

>
> Signed-off-by: Daisuke Nishimura <[email protected]>
> ---
> Documentation/cgroups/memory.txt | 16 ++++++++++++++--
> 1 files changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index 7781857..ab7d4c1 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -52,8 +52,10 @@ Brief summary of control files.
> tasks # attach a task(thread) and show list of threads
> cgroup.procs # show list of processes
> cgroup.event_control # an interface for event_fd()
> - memory.usage_in_bytes # show current memory(RSS+Cache) usage.
> - memory.memsw.usage_in_bytes # show current memory+Swap usage
> + memory.usage_in_bytes # show current res_counter usage for memory
> + (See 5.5 for details)
> + memory.memsw.usage_in_bytes # show current res_counter usage for memory+Swap
> + (See 5.5 for details)
> memory.limit_in_bytes # set/show limit of memory usage
> memory.memsw.limit_in_bytes # set/show limit of memory+Swap usage
> memory.failcnt # show the number of memory usage hits limits
> @@ -453,6 +455,16 @@ memory under it will be reclaimed.
> You can reset failcnt by writing 0 to failcnt file.
> # echo 0 > .../memory.failcnt
>
> +5.5 usage_in_bytes
> +
> +As described in 2.1, memory cgroup uses res_counter for tracking and limiting
> +the memory usage. memory.usage_in_bytes shows the current res_counter usage for
> +memory, and DOESN'T show a actual usage of RSS and Cache. It is usually bigger
> +than the actual usage for a performance improvement reason.

Isn't an explicit mention about caching charges better?

> If you want to know
> +the actual usage, you can use memory.stat(see 5.2).
> +It's the same for memory.memsw.usage_in_bytes, which shows the current
> +res_counter usage for memory+swap.

Should we clarify for who is this file intended?

Thanks
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-03-28 09:17:59

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] memcg: update documentation to describe usage_in_bytes

On Mon, 28 Mar 2011 09:43:42 +0200
Michal Hocko <[email protected]> wrote:

> On Mon 28-03-11 13:25:50, Daisuke Nishimura wrote:
> > From: Daisuke Nishimura <[email protected]>
> >
> > Update the meaning of *.usage_in_bytes. They doesn't show the actual usage of
> > RSS+Cache(+Swap). They show the res_counter->usage for memory and memory+swap.
>
> Don't we want to add why this is not rss+cache? The reason is really non
> trivial for somebody who is not familiar with the code and with the fact
> that we are heavily caching charges.
>
> >
> > Signed-off-by: Daisuke Nishimura <[email protected]>
> > ---
> > Documentation/cgroups/memory.txt | 16 ++++++++++++++--
> > 1 files changed, 14 insertions(+), 2 deletions(-)
> >
> > diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> > index 7781857..ab7d4c1 100644
> > --- a/Documentation/cgroups/memory.txt
> > +++ b/Documentation/cgroups/memory.txt
> > @@ -52,8 +52,10 @@ Brief summary of control files.
> > tasks # attach a task(thread) and show list of threads
> > cgroup.procs # show list of processes
> > cgroup.event_control # an interface for event_fd()
> > - memory.usage_in_bytes # show current memory(RSS+Cache) usage.
> > - memory.memsw.usage_in_bytes # show current memory+Swap usage
> > + memory.usage_in_bytes # show current res_counter usage for memory
> > + (See 5.5 for details)
> > + memory.memsw.usage_in_bytes # show current res_counter usage for memory+Swap
> > + (See 5.5 for details)
> > memory.limit_in_bytes # set/show limit of memory usage
> > memory.memsw.limit_in_bytes # set/show limit of memory+Swap usage
> > memory.failcnt # show the number of memory usage hits limits
> > @@ -453,6 +455,16 @@ memory under it will be reclaimed.
> > You can reset failcnt by writing 0 to failcnt file.
> > # echo 0 > .../memory.failcnt
> >
> > +5.5 usage_in_bytes
> > +
> > +As described in 2.1, memory cgroup uses res_counter for tracking and limiting
> > +the memory usage. memory.usage_in_bytes shows the current res_counter usage for
> > +memory, and DOESN'T show a actual usage of RSS and Cache. It is usually bigger
> > +than the actual usage for a performance improvement reason.
>
> Isn't an explicit mention about caching charges better?
>

It's difficult to distinguish which is spec. and which is implemnation details...

My one here ;)
==
5.5 usage_in_bytes

For efficiency, as other kernel components, memory cgroup uses some optimization to
avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the
method and doesn't show 'exact' value of usage, it's an fuzz value for efficient
access. (Of course, when necessary, it's synchronized.)
In usual, the value (RSS+CACHE) in memory.stat shows more exact value. IOW,
usage_in_bytes is less exact than memory.stat. The error will be larger on the larger
hardwares which have many cpus and tasks.
==

Hmm ?

Thanks,
-Kame

2011-03-28 09:48:25

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] memcg: update documentation to describe usage_in_bytes

On Mon 28-03-11 18:11:27, KAMEZAWA Hiroyuki wrote:
> On Mon, 28 Mar 2011 09:43:42 +0200
> Michal Hocko <[email protected]> wrote:
>
> > On Mon 28-03-11 13:25:50, Daisuke Nishimura wrote:
> > > From: Daisuke Nishimura <[email protected]>
[...]
> > > +5.5 usage_in_bytes
> > > +
> > > +As described in 2.1, memory cgroup uses res_counter for tracking and limiting
> > > +the memory usage. memory.usage_in_bytes shows the current res_counter usage for
> > > +memory, and DOESN'T show a actual usage of RSS and Cache. It is usually bigger
> > > +than the actual usage for a performance improvement reason.
> >
> > Isn't an explicit mention about caching charges better?
> >
>
> It's difficult to distinguish which is spec. and which is implemnation details...

Sure. At least commit log should contain the implementation details IMO,
though.

>
> My one here ;)
> ==
> 5.5 usage_in_bytes
>
> For efficiency, as other kernel components, memory cgroup uses some optimization to
> avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the
> method and doesn't show 'exact' value of usage, it's an fuzz value for efficient
> access. (Of course, when necessary, it's synchronized.)
> In usual, the value (RSS+CACHE) in memory.stat shows more exact value. IOW,

- In usual, the value (RSS+CACHE) in memory.stat shows more exact value. IOW,
+ (RSS+CACHE) value from memory.stat shows more exact value and should be used
+ by userspace. IOW,

?

> usage_in_bytes is less exact than memory.stat. The error will be larger on the larger
> hardwares which have many cpus and tasks.
> ==
>
> Hmm ?

Looks much better.

Thanks
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-03-28 10:37:39

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] memcg: update documentation to describe usage_in_bytes

On Mon, 28 Mar 2011 11:48:20 +0200
Michal Hocko <[email protected]> wrote:

> On Mon 28-03-11 18:11:27, KAMEZAWA Hiroyuki wrote:
> > On Mon, 28 Mar 2011 09:43:42 +0200
> > Michal Hocko <[email protected]> wrote:
> >
> > > On Mon 28-03-11 13:25:50, Daisuke Nishimura wrote:
> > > > From: Daisuke Nishimura <[email protected]>
> [...]
> > > > +5.5 usage_in_bytes
> > > > +
> > > > +As described in 2.1, memory cgroup uses res_counter for tracking and limiting
> > > > +the memory usage. memory.usage_in_bytes shows the current res_counter usage for
> > > > +memory, and DOESN'T show a actual usage of RSS and Cache. It is usually bigger
> > > > +than the actual usage for a performance improvement reason.
> > >
> > > Isn't an explicit mention about caching charges better?
> > >
> >
> > It's difficult to distinguish which is spec. and which is implemnation details...
>
> Sure. At least commit log should contain the implementation details IMO,
> though.
>
> >
> > My one here ;)
> > ==
> > 5.5 usage_in_bytes
> >
> > For efficiency, as other kernel components, memory cgroup uses some optimization to
> > avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the
> > method and doesn't show 'exact' value of usage, it's an fuzz value for efficient
> > access. (Of course, when necessary, it's synchronized.)
> > In usual, the value (RSS+CACHE) in memory.stat shows more exact value. IOW,
>
> - In usual, the value (RSS+CACHE) in memory.stat shows more exact value. IOW,
> + (RSS+CACHE) value from memory.stat shows more exact value and should be used
> + by userspace. IOW,
>
> ?
>

seems good. Nishimura-san, could you update ?

Thanks,
-Kame

2011-03-29 01:18:06

by Daisuke Nishimura

[permalink] [raw]
Subject: [PATCH v2] memcg: update documentation to describe usage_in_bytes

On Mon, 28 Mar 2011 19:31:08 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Mon, 28 Mar 2011 11:48:20 +0200
> Michal Hocko <[email protected]> wrote:
>
> > On Mon 28-03-11 18:11:27, KAMEZAWA Hiroyuki wrote:
> > > On Mon, 28 Mar 2011 09:43:42 +0200
> > > Michal Hocko <[email protected]> wrote:
> > >
> > > > On Mon 28-03-11 13:25:50, Daisuke Nishimura wrote:
> > > > > From: Daisuke Nishimura <[email protected]>
> > [...]
> > > > > +5.5 usage_in_bytes
> > > > > +
> > > > > +As described in 2.1, memory cgroup uses res_counter for tracking and limiting
> > > > > +the memory usage. memory.usage_in_bytes shows the current res_counter usage for
> > > > > +memory, and DOESN'T show a actual usage of RSS and Cache. It is usually bigger
> > > > > +than the actual usage for a performance improvement reason.
> > > >
> > > > Isn't an explicit mention about caching charges better?
> > > >
> > >
> > > It's difficult to distinguish which is spec. and which is implemnation details...
> >
> > Sure. At least commit log should contain the implementation details IMO,
> > though.
> >
> > >
> > > My one here ;)
> > > ==
> > > 5.5 usage_in_bytes
> > >
> > > For efficiency, as other kernel components, memory cgroup uses some optimization to
> > > avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the
> > > method and doesn't show 'exact' value of usage, it's an fuzz value for efficient
> > > access. (Of course, when necessary, it's synchronized.)
> > > In usual, the value (RSS+CACHE) in memory.stat shows more exact value. IOW,
> >
> > - In usual, the value (RSS+CACHE) in memory.stat shows more exact value. IOW,
> > + (RSS+CACHE) value from memory.stat shows more exact value and should be used
> > + by userspace. IOW,
> >
> > ?
> >
>
> seems good. Nishimura-san, could you update ?
>
> Thanks,
> -Kame
>
Thank you very much for your comments. This is the updated one.

===
From: Daisuke Nishimura <[email protected]>

Since 569b846d(memcg: coalesce uncharge during unmap/truncate), we do batched
(delayed) uncharge at truncation/unmap. And since cdec2e42(memcg: coalesce
charging via percpu storage), we have percpu cache for res_counter.

These changes improved performance of memory cgroup very much, but made
res_counter->usage usually have a bigger value than the actual value of memory usage.
So, *.usage_in_bytes, which show res_counter->usage, are not desirable for precise
values of memory(and swap) usage anymore.

Instead of removing these files completely(because we cannot know res_counter->usage
without them), this patch updates the meaning of those files.

Signed-off-by: Daisuke Nishimura <[email protected]>
---
Documentation/cgroups/memory.txt | 15 +++++++++++++--
1 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 7781857..4f49d91 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -52,8 +52,10 @@ Brief summary of control files.
tasks # attach a task(thread) and show list of threads
cgroup.procs # show list of processes
cgroup.event_control # an interface for event_fd()
- memory.usage_in_bytes # show current memory(RSS+Cache) usage.
- memory.memsw.usage_in_bytes # show current memory+Swap usage
+ memory.usage_in_bytes # show current res_counter usage for memory
+ (See 5.5 for details)
+ memory.memsw.usage_in_bytes # show current res_counter usage for memory+Swap
+ (See 5.5 for details)
memory.limit_in_bytes # set/show limit of memory usage
memory.memsw.limit_in_bytes # set/show limit of memory+Swap usage
memory.failcnt # show the number of memory usage hits limits
@@ -453,6 +455,15 @@ memory under it will be reclaimed.
You can reset failcnt by writing 0 to failcnt file.
# echo 0 > .../memory.failcnt

+5.5 usage_in_bytes
+
+For efficiency, as other kernel components, memory cgroup uses some optimization
+to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the
+method and doesn't show 'exact' value of memory(and swap) usage, it's an fuzz
+value for efficient access. (Of course, when necessary, it's synchronized.)
+If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP)
+value in memory.stat(see 5.2).
+
6. Hierarchy support

The memory controller supports a deep hierarchy and hierarchical accounting.
--
1.7.1

2011-03-29 01:31:14

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH v2] memcg: update documentation to describe usage_in_bytes

On Tue, 29 Mar 2011 10:15:11 +0900
Daisuke Nishimura <[email protected]> wrote:

> ===
> From: Daisuke Nishimura <[email protected]>
>
> Since 569b846d(memcg: coalesce uncharge during unmap/truncate), we do batched
> (delayed) uncharge at truncation/unmap. And since cdec2e42(memcg: coalesce
> charging via percpu storage), we have percpu cache for res_counter.
>
> These changes improved performance of memory cgroup very much, but made
> res_counter->usage usually have a bigger value than the actual value of memory usage.
> So, *.usage_in_bytes, which show res_counter->usage, are not desirable for precise
> values of memory(and swap) usage anymore.
>
> Instead of removing these files completely(because we cannot know res_counter->usage
> without them), this patch updates the meaning of those files.
>
> Signed-off-by: Daisuke Nishimura <[email protected]>

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2011-03-29 07:21:27

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2] memcg: update documentation to describe usage_in_bytes

On Tue 29-03-11 10:15:11, Daisuke Nishimura wrote:
[...]
> From: Daisuke Nishimura <[email protected]>
>
> Since 569b846d(memcg: coalesce uncharge during unmap/truncate), we do batched
> (delayed) uncharge at truncation/unmap. And since cdec2e42(memcg: coalesce
> charging via percpu storage), we have percpu cache for res_counter.
>
> These changes improved performance of memory cgroup very much, but made
> res_counter->usage usually have a bigger value than the actual value of memory usage.
> So, *.usage_in_bytes, which show res_counter->usage, are not desirable for precise
> values of memory(and swap) usage anymore.
>
> Instead of removing these files completely(because we cannot know res_counter->usage
> without them), this patch updates the meaning of those files.
>
> Signed-off-by: Daisuke Nishimura <[email protected]>
> ---
> Documentation/cgroups/memory.txt | 15 +++++++++++++--
> 1 files changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index 7781857..4f49d91 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -52,8 +52,10 @@ Brief summary of control files.
> tasks # attach a task(thread) and show list of threads
> cgroup.procs # show list of processes
> cgroup.event_control # an interface for event_fd()
> - memory.usage_in_bytes # show current memory(RSS+Cache) usage.
> - memory.memsw.usage_in_bytes # show current memory+Swap usage
> + memory.usage_in_bytes # show current res_counter usage for memory
> + (See 5.5 for details)
> + memory.memsw.usage_in_bytes # show current res_counter usage for memory+Swap
> + (See 5.5 for details)
> memory.limit_in_bytes # set/show limit of memory usage
> memory.memsw.limit_in_bytes # set/show limit of memory+Swap usage
> memory.failcnt # show the number of memory usage hits limits
> @@ -453,6 +455,15 @@ memory under it will be reclaimed.
> You can reset failcnt by writing 0 to failcnt file.
> # echo 0 > .../memory.failcnt
>
> +5.5 usage_in_bytes
> +
> +For efficiency, as other kernel components, memory cgroup uses some optimization
> +to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the
> +method and doesn't show 'exact' value of memory(and swap) usage, it's an fuzz
> +value for efficient access. (Of course, when necessary, it's synchronized.)
> +If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP)
> +value in memory.stat(see 5.2).
> +
> 6. Hierarchy support
>
> The memory controller supports a deep hierarchy and hierarchical accounting.

Acked-by: Michal Hocko <[email protected]>

Although I would like to see a mention about what is the reason for
keeping that file(s) if their usage is very limited. Something like.
"We are keeping the file because we want to be consistent with other
cgroups implementations and all of them export usage counter in some
way. Make sure you exactly know the meaning before you use the value
in userspace."

If nobody else feels that this is that important then please forget
about this comment.
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-04-19 12:14:19

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2] memcg: update documentation to describe usage_in_bytes

Hi,

On Tue 29-03-11 09:21:22, Michal Hocko wrote:
> On Tue 29-03-11 10:15:11, Daisuke Nishimura wrote:
> [...]
> > From: Daisuke Nishimura <[email protected]>
> >
> > Since 569b846d(memcg: coalesce uncharge during unmap/truncate), we do batched
> > (delayed) uncharge at truncation/unmap. And since cdec2e42(memcg: coalesce
> > charging via percpu storage), we have percpu cache for res_counter.
> >
> > These changes improved performance of memory cgroup very much, but made
> > res_counter->usage usually have a bigger value than the actual value of memory usage.
> > So, *.usage_in_bytes, which show res_counter->usage, are not desirable for precise
> > values of memory(and swap) usage anymore.
> >
> > Instead of removing these files completely(because we cannot know res_counter->usage
> > without them), this patch updates the meaning of those files.
> >
> > Signed-off-by: Daisuke Nishimura <[email protected]>
> > ---
> > Documentation/cgroups/memory.txt | 15 +++++++++++++--
> > 1 files changed, 13 insertions(+), 2 deletions(-)
> >
> > diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> > index 7781857..4f49d91 100644
> > --- a/Documentation/cgroups/memory.txt
> > +++ b/Documentation/cgroups/memory.txt
> > @@ -52,8 +52,10 @@ Brief summary of control files.
> > tasks # attach a task(thread) and show list of threads
> > cgroup.procs # show list of processes
> > cgroup.event_control # an interface for event_fd()
> > - memory.usage_in_bytes # show current memory(RSS+Cache) usage.
> > - memory.memsw.usage_in_bytes # show current memory+Swap usage
> > + memory.usage_in_bytes # show current res_counter usage for memory
> > + (See 5.5 for details)
> > + memory.memsw.usage_in_bytes # show current res_counter usage for memory+Swap
> > + (See 5.5 for details)
> > memory.limit_in_bytes # set/show limit of memory usage
> > memory.memsw.limit_in_bytes # set/show limit of memory+Swap usage
> > memory.failcnt # show the number of memory usage hits limits
> > @@ -453,6 +455,15 @@ memory under it will be reclaimed.
> > You can reset failcnt by writing 0 to failcnt file.
> > # echo 0 > .../memory.failcnt
> >
> > +5.5 usage_in_bytes
> > +
> > +For efficiency, as other kernel components, memory cgroup uses some optimization
> > +to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the
> > +method and doesn't show 'exact' value of memory(and swap) usage, it's an fuzz
> > +value for efficient access. (Of course, when necessary, it's synchronized.)
> > +If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP)
> > +value in memory.stat(see 5.2).
> > +
> > 6. Hierarchy support
> >
> > The memory controller supports a deep hierarchy and hierarchical accounting.
>
> Acked-by: Michal Hocko <[email protected]>
>
> Although I would like to see a mention about what is the reason for
> keeping that file(s) if their usage is very limited. Something like.
> "We are keeping the file because we want to be consistent with other
> cgroups implementations and all of them export usage counter in some
> way. Make sure you exactly know the meaning before you use the value
> in userspace."
>
> If nobody else feels that this is that important then please forget
> about this comment.

I am wondering what happened to the patch. Do you have any plans to
update/push it?

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-04-19 23:37:51

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [PATCH v2] memcg: update documentation to describe usage_in_bytes

Hi,

On Tue, 19 Apr 2011 14:14:14 +0200
Michal Hocko <[email protected]> wrote:

> Hi,
>
> On Tue 29-03-11 09:21:22, Michal Hocko wrote:
> > On Tue 29-03-11 10:15:11, Daisuke Nishimura wrote:
> > [...]
> > > From: Daisuke Nishimura <[email protected]>
> > >
> > > Since 569b846d(memcg: coalesce uncharge during unmap/truncate), we do batched
> > > (delayed) uncharge at truncation/unmap. And since cdec2e42(memcg: coalesce
> > > charging via percpu storage), we have percpu cache for res_counter.
> > >
> > > These changes improved performance of memory cgroup very much, but made
> > > res_counter->usage usually have a bigger value than the actual value of memory usage.
> > > So, *.usage_in_bytes, which show res_counter->usage, are not desirable for precise
> > > values of memory(and swap) usage anymore.
> > >
> > > Instead of removing these files completely(because we cannot know res_counter->usage
> > > without them), this patch updates the meaning of those files.
> > >
> > > Signed-off-by: Daisuke Nishimura <[email protected]>
> > > ---
> > > Documentation/cgroups/memory.txt | 15 +++++++++++++--
> > > 1 files changed, 13 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> > > index 7781857..4f49d91 100644
> > > --- a/Documentation/cgroups/memory.txt
> > > +++ b/Documentation/cgroups/memory.txt
> > > @@ -52,8 +52,10 @@ Brief summary of control files.
> > > tasks # attach a task(thread) and show list of threads
> > > cgroup.procs # show list of processes
> > > cgroup.event_control # an interface for event_fd()
> > > - memory.usage_in_bytes # show current memory(RSS+Cache) usage.
> > > - memory.memsw.usage_in_bytes # show current memory+Swap usage
> > > + memory.usage_in_bytes # show current res_counter usage for memory
> > > + (See 5.5 for details)
> > > + memory.memsw.usage_in_bytes # show current res_counter usage for memory+Swap
> > > + (See 5.5 for details)
> > > memory.limit_in_bytes # set/show limit of memory usage
> > > memory.memsw.limit_in_bytes # set/show limit of memory+Swap usage
> > > memory.failcnt # show the number of memory usage hits limits
> > > @@ -453,6 +455,15 @@ memory under it will be reclaimed.
> > > You can reset failcnt by writing 0 to failcnt file.
> > > # echo 0 > .../memory.failcnt
> > >
> > > +5.5 usage_in_bytes
> > > +
> > > +For efficiency, as other kernel components, memory cgroup uses some optimization
> > > +to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the
> > > +method and doesn't show 'exact' value of memory(and swap) usage, it's an fuzz
> > > +value for efficient access. (Of course, when necessary, it's synchronized.)
> > > +If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP)
> > > +value in memory.stat(see 5.2).
> > > +
> > > 6. Hierarchy support
> > >
> > > The memory controller supports a deep hierarchy and hierarchical accounting.
> >
> > Acked-by: Michal Hocko <[email protected]>
> >
> > Although I would like to see a mention about what is the reason for
> > keeping that file(s) if their usage is very limited. Something like.
> > "We are keeping the file because we want to be consistent with other
> > cgroups implementations and all of them export usage counter in some
> > way. Make sure you exactly know the meaning before you use the value
> > in userspace."
> >
> > If nobody else feels that this is that important then please forget
> > about this comment.
>
> I am wondering what happened to the patch. Do you have any plans to
> update/push it?
>
Ouch! Thank you for your reminding me..

I'll resend this to Andrew soon.

Thanks,
Daisuke Nishimura.