2010-11-05 16:09:13

by Minchan Kim

[permalink] [raw]
Subject: [PATCH] memcg: use do_div to divide s64 in 32 bit machine.

Use do_div to divide s64 value. Otherwise, build would be failed
like Dave Young reported.

mm/built-in.o: In function `mem_cgroup_dirty_info':
/home/dave/vdb/build/mm/linux-2.6.36/mm/memcontrol.c:1251: undefined
reference to `__divdi3'
/home/dave/vdb/build/mm/linux-2.6.36/mm/memcontrol.c:1259: undefined
reference to `__divdi3'
make: *** [.tmp_vmlinux1] Error 1

Tested-by: Dave Young <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
mm/memcontrol.c | 16 +++++++++-------
1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 76386f4..a15d95e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1247,18 +1247,20 @@ bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
if (dirty_param.dirty_bytes)
info->dirty_thresh =
DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);
- else
- info->dirty_thresh =
- (dirty_param.dirty_ratio * available_mem) / 100;
+ else {
+ info->dirty_thresh = dirty_param.dirty_ratio * available_mem;
+ do_div(info->dirty_thresh, 100);
+ }

if (dirty_param.dirty_background_bytes)
info->background_thresh =
DIV_ROUND_UP(dirty_param.dirty_background_bytes,
PAGE_SIZE);
- else
- info->background_thresh =
- (dirty_param.dirty_background_ratio *
- available_mem) / 100;
+ else {
+ info->background_thresh = dirty_param.dirty_background_ratio *
+ available_mem;
+ do_div(info->background_thresh, 100);
+ }

info->nr_reclaimable =
mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
--
1.7.0.5


2010-11-05 16:35:35

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH] memcg: use do_div to divide s64 in 32 bit machine.

Minchan Kim <[email protected]> writes:

> Use do_div to divide s64 value. Otherwise, build would be failed
> like Dave Young reported.
>
> mm/built-in.o: In function `mem_cgroup_dirty_info':
> /home/dave/vdb/build/mm/linux-2.6.36/mm/memcontrol.c:1251: undefined
> reference to `__divdi3'
> /home/dave/vdb/build/mm/linux-2.6.36/mm/memcontrol.c:1259: undefined
> reference to `__divdi3'
> make: *** [.tmp_vmlinux1] Error 1
>
> Tested-by: Dave Young <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>
Tested-by: Greg Thelen <[email protected]>

Thanks for report and the patch.

> ---
> mm/memcontrol.c | 16 +++++++++-------
> 1 files changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 76386f4..a15d95e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1247,18 +1247,20 @@ bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
> if (dirty_param.dirty_bytes)
> info->dirty_thresh =
> DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);
> - else
> - info->dirty_thresh =
> - (dirty_param.dirty_ratio * available_mem) / 100;
> + else {
> + info->dirty_thresh = dirty_param.dirty_ratio * available_mem;
> + do_div(info->dirty_thresh, 100);
> + }
>
> if (dirty_param.dirty_background_bytes)
> info->background_thresh =
> DIV_ROUND_UP(dirty_param.dirty_background_bytes,
> PAGE_SIZE);
> - else
> - info->background_thresh =
> - (dirty_param.dirty_background_ratio *
> - available_mem) / 100;
> + else {
> + info->background_thresh = dirty_param.dirty_background_ratio *
> + available_mem;
> + do_div(info->background_thresh, 100);
> + }
>
> info->nr_reclaimable =
> mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);

2010-11-06 01:04:21

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH] memcg: use do_div to divide s64 in 32 bit machine.

On Sat, Nov 06, 2010 at 01:08:53AM +0900, Minchan Kim wrote:
> Use do_div to divide s64 value. Otherwise, build would be failed
> like Dave Young reported.

I thought about that too, but then I asked myself why you would want
to represent a number of pages as signed 64bit type, even on 32 bit?

Isn't the much better fix to get the types right instead?

2010-11-06 17:20:01

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH] memcg: use do_div to divide s64 in 32 bit machine.

On Fri, Nov 5, 2010 at 6:03 PM, <[email protected]> wrote:
> On Sat, Nov 06, 2010 at 01:08:53AM +0900, Minchan Kim wrote:
>> Use do_div to divide s64 value. Otherwise, build would be failed
>> like Dave Young reported.
>
> I thought about that too, but then I asked myself why you would want
> to represent a number of pages as signed 64bit type, even on 32 bit?

I think the reason that 64 byte type is used for page count in
memcontrol.c is because the low level res_counter primitives operate
on 64 bit counters, even on 32 bit machines.

> Isn't the much better fix to get the types right instead?
>

I agree that consistent types between mem_cgroup_dirty_info() and
global_dirty_info() is important. There seems to be a lot of usage of
s64 for page counts in memcontrol.c, which I think is due to the
res_counter types. I think these s64 be switched to unsigned long
rather to be consistent with the rest of mm code. It looks like this
will be a clean patch, except for the lowest level where
res_counter_read_u64() is used, where some casting may be needed.

I'll post a patch for that change.

2010-11-06 17:31:15

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] memcg: use do_div to divide s64 in 32 bit machine.

On Sun, Nov 7, 2010 at 2:19 AM, Greg Thelen <[email protected]> wrote:
> On Fri, Nov 5, 2010 at 6:03 PM, ?<[email protected]> wrote:
>> On Sat, Nov 06, 2010 at 01:08:53AM +0900, Minchan Kim wrote:
>>> Use do_div to divide s64 value. Otherwise, build would be failed
>>> like Dave Young reported.
>>
>> I thought about that too, but then I asked myself why you would want
>> to represent a number of pages as signed 64bit type, even on 32 bit?
>
> I think the reason that 64 byte type is used for page count in
> memcontrol.c is because the low level res_counter primitives operate
> on 64 bit counters, even on 32 bit machines.
>
>> Isn't the much better fix to get the types right instead?
>>
>
> I agree that consistent types between mem_cgroup_dirty_info() and
> global_dirty_info() is important. ?There seems to be a lot of usage of
> s64 for page counts in memcontrol.c, which I think is due to the
> res_counter types. ?I think these s64 be switched to unsigned long
> rather to be consistent with the rest of mm code. ?It looks like this
> will be a clean patch, except for the lowest level where
> res_counter_read_u64() is used, where some casting may be needed.
>
> I'll post a patch for that change.
>

Agree. I don't mind it.
Thanks, Hannes and Greg.



--
Kind regards,
Minchan Kim

2010-11-07 22:15:17

by Johannes Weiner

[permalink] [raw]
Subject: [patch 4/4] memcg: use native word page statistics counters

The statistic counters are in units of pages, there is no reason to
make them 64-bit wide on 32-bit machines.

Make them native words. Since they are signed, this leaves 31 bit on
32-bit machines, which can represent roughly 8TB assuming a page size
of 4k.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 2 +-
mm/memcontrol.c | 43 +++++++++++++++++++++----------------------
mm/page-writeback.c | 4 ++--
3 files changed, 24 insertions(+), 25 deletions(-)

--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -110,7 +110,7 @@ enum {
};

struct mem_cgroup_stat_cpu {
- s64 count[MEM_CGROUP_STAT_NSTATS];
+ long count[MEM_CGROUP_STAT_NSTATS];
unsigned long events[MEM_CGROUP_EVENTS_NSTATS];
};

@@ -583,11 +583,11 @@ mem_cgroup_largest_soft_limit_node(struc
* common workload, threashold and synchonization as vmstat[] should be
* implemented.
*/
-static s64 mem_cgroup_read_stat(struct mem_cgroup *mem,
- enum mem_cgroup_stat_index idx)
+static long mem_cgroup_read_stat(struct mem_cgroup *mem,
+ enum mem_cgroup_stat_index idx)
{
+ long val = 0;
int cpu;
- s64 val = 0;

for_each_online_cpu(cpu)
val += per_cpu(mem->stat->count[idx], cpu);
@@ -599,9 +599,9 @@ static s64 mem_cgroup_read_stat(struct m
return val;
}

-static s64 mem_cgroup_local_usage(struct mem_cgroup *mem)
+static long mem_cgroup_local_usage(struct mem_cgroup *mem)
{
- s64 ret;
+ long ret;

ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_RSS);
ret += mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_CACHE);
@@ -1244,7 +1244,7 @@ bool mem_cgroup_dirty_info(unsigned long
struct vm_dirty_param dirty_param;
unsigned long available_mem;
struct mem_cgroup *memcg;
- s64 value;
+ long value;

if (mem_cgroup_disabled())
return false;
@@ -1301,10 +1301,10 @@ static inline bool mem_cgroup_can_swap(s
(res_counter_read_u64(&memcg->memsw, RES_LIMIT) > 0);
}

-static s64 mem_cgroup_local_page_stat(struct mem_cgroup *mem,
- enum mem_cgroup_nr_pages_item item)
+static long mem_cgroup_local_page_stat(struct mem_cgroup *mem,
+ enum mem_cgroup_nr_pages_item item)
{
- s64 ret;
+ long ret;

switch (item) {
case MEMCG_NR_DIRTYABLE_PAGES:
@@ -1365,11 +1365,11 @@ memcg_hierarchical_free_pages(struct mem
* Return the accounted statistic value or negative value if current task is
* root cgroup.
*/
-s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
+long mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
{
- struct mem_cgroup *mem;
struct mem_cgroup *iter;
- s64 value;
+ struct mem_cgroup *mem;
+ long value;

get_online_cpus();
rcu_read_lock();
@@ -2069,7 +2069,7 @@ static void mem_cgroup_drain_pcp_counter

spin_lock(&mem->pcp_counter_lock);
for (i = 0; i < MEM_CGROUP_STAT_DATA; i++) {
- s64 x = per_cpu(mem->stat->count[i], cpu);
+ long x = per_cpu(mem->stat->count[i], cpu);

per_cpu(mem->stat->count[i], cpu) = 0;
mem->nocpu_base.count[i] += x;
@@ -3660,13 +3660,13 @@ static int mem_cgroup_hierarchy_write(st
}


-static u64 mem_cgroup_get_recursive_idx_stat(struct mem_cgroup *mem,
- enum mem_cgroup_stat_index idx)
+static unsigned long mem_cgroup_recursive_stat(struct mem_cgroup *mem,
+ enum mem_cgroup_stat_index idx)
{
struct mem_cgroup *iter;
- s64 val = 0;
+ long val = 0;

- /* each per cpu's value can be minus.Then, use s64 */
+ /* Per-cpu values can be negative, use a signed accumulator */
for_each_mem_cgroup_tree(iter, mem)
val += mem_cgroup_read_stat(iter, idx);

@@ -3686,12 +3686,11 @@ static inline u64 mem_cgroup_usage(struc
return res_counter_read_u64(&mem->memsw, RES_USAGE);
}

- val = mem_cgroup_get_recursive_idx_stat(mem, MEM_CGROUP_STAT_CACHE);
- val += mem_cgroup_get_recursive_idx_stat(mem, MEM_CGROUP_STAT_RSS);
+ val = mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_CACHE);
+ val += mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_RSS);

if (swap)
- val += mem_cgroup_get_recursive_idx_stat(mem,
- MEM_CGROUP_STAT_SWAPOUT);
+ val += mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_SWAPOUT);

return val << PAGE_SHIFT;
}
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -157,7 +157,7 @@ static inline void mem_cgroup_dec_page_s
bool mem_cgroup_has_dirty_limit(void);
bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
struct dirty_info *info);
-s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item);
+long mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item);

unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
gfp_t gfp_mask);
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -133,10 +133,10 @@ static struct prop_descriptor vm_dirties

static unsigned long dirty_writeback_pages(void)
{
- s64 ret;
+ unsigned long ret;

ret = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
- if (ret < 0)
+ if ((long)ret < 0)
ret = global_page_state(NR_UNSTABLE_NFS) +
global_page_state(NR_WRITEBACK);


2010-11-07 22:15:10

by Johannes Weiner

[permalink] [raw]
Subject: [patch 0/4] memcg: variable type fixes

Hi Greg,

it is not the res counter primitives, these are our own counters. We
have to keep signed types for most counters, as the per-cpu counter
folding can race and we end up with negative values.

The fix for the original issue is in patch 1. There are no casts
needed, the range is checked to be sane and then converted to the
unsigned type through assignment.

Patch 2, also a type fix, ensures we catch accounting races properly.
It is unrelated, but also important.

Patch 3 implements the idea that we only have to used signed types for
_some_ of the counters, but not for constant event counters where the
sign-bit would be a waste.

Patch 4 converts our fundamental page statistics counters to native
words as these should be wide enough for the expected values.

Hannes

2010-11-07 22:15:09

by Johannes Weiner

[permalink] [raw]
Subject: [patch 1/4] memcg: use native word to represent dirtyable pages

The memory cgroup dirty info calculation currently uses a signed
64-bit type to represent the amount of dirtyable memory in pages.

This can instead be changed to an unsigned word, which will allow the
formula to function correctly with up to 160G of LRU pages on a 32-bit
system, assuming 4k pages. That should be plenty even when taking
racy folding of the per-cpu counters into account.

This fixes a compilation error on 32-bit systems as this code tries to
do 64-bit division.

Signed-off-by: Johannes Weiner <[email protected]>
Reported-by: Dave Young <[email protected]>
---
mm/memcontrol.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1222,9 +1222,10 @@ static void __mem_cgroup_dirty_param(str
bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
struct dirty_info *info)
{
- s64 available_mem;
struct vm_dirty_param dirty_param;
+ unsigned long available_mem;
struct mem_cgroup *memcg;
+ s64 value;

if (mem_cgroup_disabled())
return false;
@@ -1238,11 +1239,11 @@ bool mem_cgroup_dirty_info(unsigned long
__mem_cgroup_dirty_param(&dirty_param, memcg);
rcu_read_unlock();

- available_mem = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
- if (available_mem < 0)
+ value = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
+ if (value < 0)
return false;

- available_mem = min((unsigned long)available_mem, sys_available_mem);
+ available_mem = min((unsigned long)value, sys_available_mem);

if (dirty_param.dirty_bytes)
info->dirty_thresh =

2010-11-07 22:15:41

by Johannes Weiner

[permalink] [raw]
Subject: [patch 2/4] memcg: catch negative per-cpu sums in dirty info

Folding the per-cpu counters can yield a negative value in case of
accounting races between CPUs.

When collecting the dirty info, the code would read those sums into an
unsigned variable and then check for it being negative, which can not
work.

Instead, fold the counters into a signed local variable, make the
check, and only then assign it.

This way, the function signals correctly when there are insane values
instead of leaking them out to the caller.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/memcontrol.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)

--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1261,14 +1261,15 @@ bool mem_cgroup_dirty_info(unsigned long
(dirty_param.dirty_background_ratio *
available_mem) / 100;

- info->nr_reclaimable =
- mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
- if (info->nr_reclaimable < 0)
+ value = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
+ if (value < 0)
return false;
+ info->nr_reclaimable = value;

- info->nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
- if (info->nr_writeback < 0)
+ value = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
+ if (value < 0)
return false;
+ info->nr_writeback = value;

return true;
}

2010-11-07 22:15:56

by Johannes Weiner

[permalink] [raw]
Subject: [patch 3/4] memcg: break out event counters from other stats

For increasing and decreasing per-cpu cgroup usage counters it makes
sense to use signed types, as single per-cpu values might go negative
during updates. But this is not the case for only-ever-increasing
event counters.

All the counters have been signed 64-bit so far, which was enough to
count events even with the sign bit wasted.

The next patch narrows the usage counters type (on 32-bit CPUs, that
is), though, so break out the event counters and make them unsigned
words as they should have been from the start.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/memcontrol.c | 49 +++++++++++++++++++++++++++++++++++++------------
1 file changed, 37 insertions(+), 12 deletions(-)

--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -85,21 +85,23 @@ enum mem_cgroup_stat_index {
*/
MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
- MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */
- MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */
MEM_CGROUP_STAT_FILE_WRITEBACK, /* # of pages under writeback */
MEM_CGROUP_STAT_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
MEM_CGROUP_STAT_DATA, /* end of data requires synchronization */
- /* incremented at every pagein/pageout */
- MEM_CGROUP_EVENTS = MEM_CGROUP_STAT_DATA,
MEM_CGROUP_ON_MOVE, /* someone is moving account between groups */
-
MEM_CGROUP_STAT_NSTATS,
};

+enum mem_cgroup_events_index {
+ MEM_CGROUP_EVENTS_PGPGIN, /* # of pages paged in */
+ MEM_CGROUP_EVENTS_PGPGOUT, /* # of pages paged out */
+ MEM_CGROUP_EVENTS_COUNT, /* # of pages paged in/out */
+ MEM_CGROUP_EVENTS_NSTATS,
+};
+
enum {
MEM_CGROUP_DIRTY_RATIO,
MEM_CGROUP_DIRTY_LIMIT_IN_BYTES,
@@ -109,6 +111,7 @@ enum {

struct mem_cgroup_stat_cpu {
s64 count[MEM_CGROUP_STAT_NSTATS];
+ unsigned long events[MEM_CGROUP_EVENTS_NSTATS];
};

/*
@@ -612,6 +615,22 @@ static void mem_cgroup_swap_statistics(s
this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_SWAPOUT], val);
}

+static unsigned long mem_cgroup_read_events(struct mem_cgroup *mem,
+ enum mem_cgroup_events_index idx)
+{
+ unsigned long val = 0;
+ int cpu;
+
+ for_each_online_cpu(cpu)
+ val += per_cpu(mem->stat->events[idx], cpu);
+#ifdef CONFIG_HOTPLUG_CPU
+ spin_lock(&mem->pcp_counter_lock);
+ val += mem->nocpu_base.events[idx];
+ spin_unlock(&mem->pcp_counter_lock);
+#endif
+ return val;
+}
+
static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
struct page_cgroup *pc,
bool charge)
@@ -626,10 +645,10 @@ static void mem_cgroup_charge_statistics
__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_RSS], val);

if (charge)
- __this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_PGPGIN_COUNT]);
+ __this_cpu_inc(mem->stat->events[MEM_CGROUP_EVENTS_PGPGIN]);
else
- __this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_PGPGOUT_COUNT]);
- __this_cpu_inc(mem->stat->count[MEM_CGROUP_EVENTS]);
+ __this_cpu_inc(mem->stat->events[MEM_CGROUP_EVENTS_PGPGOUT]);
+ __this_cpu_inc(mem->stat->events[MEM_CGROUP_EVENTS_COUNT]);

preempt_enable();
}
@@ -651,9 +670,9 @@ static unsigned long mem_cgroup_get_loca

static bool __memcg_event_check(struct mem_cgroup *mem, int event_mask_shift)
{
- s64 val;
+ unsigned long val;

- val = this_cpu_read(mem->stat->count[MEM_CGROUP_EVENTS]);
+ val = this_cpu_read(mem->stat->events[MEM_CGROUP_EVENTS_COUNT]);

return !(val & ((1 << event_mask_shift) - 1));
}
@@ -2055,6 +2074,12 @@ static void mem_cgroup_drain_pcp_counter
per_cpu(mem->stat->count[i], cpu) = 0;
mem->nocpu_base.count[i] += x;
}
+ for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) {
+ unsigned long x = per_cpu(mem->stat->events[i], cpu);
+
+ per_cpu(mem->stat->events[i], cpu) = 0;
+ mem->nocpu_base.events[i] += x;
+ }
/* need to clear ON_MOVE value, works as a kind of lock. */
per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) = 0;
spin_unlock(&mem->pcp_counter_lock);
@@ -3892,9 +3917,9 @@ mem_cgroup_get_local_stat(struct mem_cgr
s->stat[MCS_RSS] += val * PAGE_SIZE;
val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_MAPPED);
s->stat[MCS_FILE_MAPPED] += val * PAGE_SIZE;
- val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_PGPGIN_COUNT);
+ val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PGPGIN);
s->stat[MCS_PGPGIN] += val;
- val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_PGPGOUT_COUNT);
+ val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PGPGOUT);
s->stat[MCS_PGPGOUT] += val;
if (do_swap_account) {
val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);

2010-11-07 22:56:46

by Minchan Kim

[permalink] [raw]
Subject: Re: [patch 1/4] memcg: use native word to represent dirtyable pages

On Mon, Nov 8, 2010 at 7:14 AM, Johannes Weiner <[email protected]> wrote:
> The memory cgroup dirty info calculation currently uses a signed
> 64-bit type to represent the amount of dirtyable memory in pages.
>
> This can instead be changed to an unsigned word, which will allow the
> formula to function correctly with up to 160G of LRU pages on a 32-bit
> system, assuming 4k pages. ?That should be plenty even when taking
> racy folding of the per-cpu counters into account.
>
> This fixes a compilation error on 32-bit systems as this code tries to
> do 64-bit division.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> Reported-by: Dave Young <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>


--
Kind regards,
Minchan Kim

2010-11-07 23:26:31

by Minchan Kim

[permalink] [raw]
Subject: Re: [patch 2/4] memcg: catch negative per-cpu sums in dirty info

On Mon, Nov 8, 2010 at 7:14 AM, Johannes Weiner <[email protected]> wrote:
> Folding the per-cpu counters can yield a negative value in case of
> accounting races between CPUs.
>
> When collecting the dirty info, the code would read those sums into an
> unsigned variable and then check for it being negative, which can not
> work.
>
> Instead, fold the counters into a signed local variable, make the
> check, and only then assign it.
>
> This way, the function signals correctly when there are insane values
> instead of leaking them out to the caller.
>
> Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>


--
Kind regards,
Minchan Kim

2010-11-07 23:52:43

by Minchan Kim

[permalink] [raw]
Subject: Re: [patch 3/4] memcg: break out event counters from other stats

On Mon, Nov 8, 2010 at 7:14 AM, Johannes Weiner <[email protected]> wrote:
> For increasing and decreasing per-cpu cgroup usage counters it makes
> sense to use signed types, as single per-cpu values might go negative
> during updates. ?But this is not the case for only-ever-increasing
> event counters.
>
> All the counters have been signed 64-bit so far, which was enough to
> count events even with the sign bit wasted.
>
> The next patch narrows the usage counters type (on 32-bit CPUs, that
> is), though, so break out the event counters and make them unsigned
> words as they should have been from the start.
>
> Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

Fair enough.
We already have used unsigned long in vmstat.

--
Kind regards,
Minchan Kim

2010-11-08 00:01:56

by Minchan Kim

[permalink] [raw]
Subject: Re: [patch 4/4] memcg: use native word page statistics counters

On Mon, Nov 8, 2010 at 7:14 AM, Johannes Weiner <[email protected]> wrote:
> The statistic counters are in units of pages, there is no reason to
> make them 64-bit wide on 32-bit machines.
>
> Make them native words. ?Since they are signed, this leaves 31 bit on
> 32-bit machines, which can represent roughly 8TB assuming a page size
> of 4k.
>
> Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

This patch changes mem_cgroup_recursive_idx_stat with
mem_cgroup_recursive_stat as well.
As you know, It would be better to be another patch although it's
trivial. But I don't mind it.
I like the name. :)

Thanks, Hannes.





--
Kind regards,
Minchan Kim

2010-11-08 00:07:37

by Minchan Kim

[permalink] [raw]
Subject: Re: [patch 4/4] memcg: use native word page statistics counters

On Mon, Nov 8, 2010 at 7:14 AM, Johannes Weiner <[email protected]> wrote:
> The statistic counters are in units of pages, there is no reason to
> make them 64-bit wide on 32-bit machines.
>
> Make them native words. ?Since they are signed, this leaves 31 bit on
> 32-bit machines, which can represent roughly 8TB assuming a page size
> of 4k.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> ?include/linux/memcontrol.h | ? ?2 +-
> ?mm/memcontrol.c ? ? ? ? ? ?| ? 43 +++++++++++++++++++++----------------------
> ?mm/page-writeback.c ? ? ? ?| ? ?4 ++--
> ?3 files changed, 24 insertions(+), 25 deletions(-)
>

<snip>

>
> ?static unsigned long dirty_writeback_pages(void)
> ?{
> - ? ? ? s64 ret;
> + ? ? ? unsigned long ret;
>
> ? ? ? ?ret = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> - ? ? ? if (ret < 0)
> + ? ? ? if ((long)ret < 0)
> ? ? ? ? ? ? ? ?ret = global_page_state(NR_UNSTABLE_NFS) +
> ? ? ? ? ? ? ? ? ? ? ? ?global_page_state(NR_WRITEBACK);

BTW, let me ask a question.
dirty_writeback_pages seems to be depends on mem_cgroup_page_stat's
result(ie, negative) for separate global and memcg.
But mem_cgroup_page_stat could return negative value by per-cpu as
well as root cgroup.
If I understand right, Isn't it a problem?

--
Kind regards,
Minchan Kim

2010-11-08 09:08:57

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 4/4] memcg: use native word page statistics counters

On Mon, Nov 08, 2010 at 09:01:54AM +0900, Minchan Kim wrote:
> On Mon, Nov 8, 2010 at 7:14 AM, Johannes Weiner <[email protected]> wrote:
> > The statistic counters are in units of pages, there is no reason to
> > make them 64-bit wide on 32-bit machines.
> >
> > Make them native words. ?Since they are signed, this leaves 31 bit on
> > 32-bit machines, which can represent roughly 8TB assuming a page size
> > of 4k.
> >
> > Signed-off-by: Johannes Weiner <[email protected]>
> Reviewed-by: Minchan Kim <[email protected]>
>
> This patch changes mem_cgroup_recursive_idx_stat with
> mem_cgroup_recursive_stat as well.
> As you know, It would be better to be another patch although it's
> trivial. But I don't mind it.
> I like the name. :)

I also feel that it's not too nice to mix such cleanups with
functionality changes. But I found the name too appalling to leave it
alone, and a separate patch not worth the trouble.

If somebody has strong feelings, I will happily split it up.

Thanks for your review, Minchan.

Hannes

2010-11-08 09:37:54

by Johannes Weiner

[permalink] [raw]
Subject: Re: memcg writeout throttling, was: [patch 4/4] memcg: use native word page statistics counters

On Mon, Nov 08, 2010 at 09:07:35AM +0900, Minchan Kim wrote:
> BTW, let me ask a question.
> dirty_writeback_pages seems to be depends on mem_cgroup_page_stat's
> result(ie, negative) for separate global and memcg.
> But mem_cgroup_page_stat could return negative value by per-cpu as
> well as root cgroup.
> If I understand right, Isn't it a problem?

Yes, the numbers are not reliable and may be off by some. It appears
to me that the only sensible interpretation of a negative sum is to
assume zero, though. So to be honest, I don't understand the fallback
to global state when the local state fluctuates around low values.

This function is also only used in throttle_vm_writeout(), where the
outcome is compared to the global dirty threshold. So using the
number of writeback pages _from the current cgroup_ and falling back
to global writeback pages when this number is low makes no sense to me
at all.

I looks like it should rather compare the cgroup state with the cgroup
limit, and the global state with the global limit.

Can somebody explain the reasoning behind this? And in case it makes
sense after all, put a comment into this function?

Thanks!

2010-11-08 15:45:29

by Fengguang Wu

[permalink] [raw]
Subject: Re: memcg writeout throttling, was: [patch 4/4] memcg: use native word page statistics counters

On Mon, Nov 08, 2010 at 05:37:16PM +0800, Johannes Weiner wrote:
> On Mon, Nov 08, 2010 at 09:07:35AM +0900, Minchan Kim wrote:
> > BTW, let me ask a question.
> > dirty_writeback_pages seems to be depends on mem_cgroup_page_stat's
> > result(ie, negative) for separate global and memcg.
> > But mem_cgroup_page_stat could return negative value by per-cpu as
> > well as root cgroup.
> > If I understand right, Isn't it a problem?
>
> Yes, the numbers are not reliable and may be off by some. It appears
> to me that the only sensible interpretation of a negative sum is to
> assume zero, though. So to be honest, I don't understand the fallback
> to global state when the local state fluctuates around low values.

Agreed. It does not make sense to compare values from different domains.

The bdi stats use percpu_counter_sum_positive() which never return
negative values. It may be suitable for memcg page counts, too.

> This function is also only used in throttle_vm_writeout(), where the
> outcome is compared to the global dirty threshold. So using the
> number of writeback pages _from the current cgroup_ and falling back
> to global writeback pages when this number is low makes no sense to me
> at all.
>
> I looks like it should rather compare the cgroup state with the cgroup
> limit, and the global state with the global limit.

Right.

> Can somebody explain the reasoning behind this? And in case it makes
> sense after all, put a comment into this function?

It seems a better match to test sc->mem_cgroup rather than
mem_cgroup_from_task(current). The latter could make mismatches. When
someone is changing the memcg limits and hence triggers memcg
reclaims, the current task is actually the (unrelated) shell. It's
also possible for the memcg task to trigger _global_ direct reclaim.

Thanks,
Fengguang

2010-11-08 19:01:22

by Greg Thelen

[permalink] [raw]
Subject: Re: memcg writeout throttling, was: [patch 4/4] memcg: use native word page statistics counters

On Mon, Nov 8, 2010 at 7:45 AM, Wu Fengguang <[email protected]> wrote:
> On Mon, Nov 08, 2010 at 05:37:16PM +0800, Johannes Weiner wrote:
>> On Mon, Nov 08, 2010 at 09:07:35AM +0900, Minchan Kim wrote:
>> > BTW, let me ask a question.
>> > dirty_writeback_pages seems to be depends on mem_cgroup_page_stat's
>> > result(ie, negative) for separate global and memcg.
>> > But mem_cgroup_page_stat could return negative value by per-cpu as
>> > well as root cgroup.
>> > If I understand right, Isn't it a problem?
>>
>> Yes, the numbers are not reliable and may be off by some. ?It appears
>> to me that the only sensible interpretation of a negative sum is to
>> assume zero, though. ?So to be honest, I don't understand the fallback
>> to global state when the local state fluctuates around low values.
>
> Agreed. It does not make sense to compare values from different domains.
>
> The bdi stats use percpu_counter_sum_positive() which never return
> negative values. It may be suitable for memcg page counts, too.
>
>> This function is also only used in throttle_vm_writeout(), where the
>> outcome is compared to the global dirty threshold. ?So using the
>> number of writeback pages _from the current cgroup_ and falling back
>> to global writeback pages when this number is low makes no sense to me
>> at all.
>>
>> I looks like it should rather compare the cgroup state with the cgroup
>> limit, and the global state with the global limit.
>
> Right.
>
>> Can somebody explain the reasoning behind this? ?And in case it makes
>> sense after all, put a comment into this function?
>
> It seems a better match to test sc->mem_cgroup rather than
> mem_cgroup_from_task(current). The latter could make mismatches. When
> someone is changing the memcg limits and hence triggers memcg
> reclaims, the current task is actually the (unrelated) shell. It's
> also possible for the memcg task to trigger _global_ direct reclaim.

Good point. I am writing a patch that will pass mem_cgroup from
sc->mem_cgroup into mem_cgroup_page_stat() rather than using
mem_cgroup_from_task(current). I will post this patch in a few hours.

I will also fix the negative value issue in mem_cgroup_page_stat().

2010-11-08 22:25:47

by Greg Thelen

[permalink] [raw]
Subject: Re: [patch 1/4] memcg: use native word to represent dirtyable pages

Minchan Kim <[email protected]> writes:

> On Mon, Nov 8, 2010 at 7:14 AM, Johannes Weiner <[email protected]> wrote:
>> The memory cgroup dirty info calculation currently uses a signed
>> 64-bit type to represent the amount of dirtyable memory in pages.
>>
>> This can instead be changed to an unsigned word, which will allow the
>> formula to function correctly with up to 160G of LRU pages on a 32-bit
Is is really 160G of LRU pages? On 32-bit machine we use a 32 bit
unsigned page number. With a 4KiB page size, I think that maps 16TiB
(1<<(32+12)) bytes. Or is there some other limit?
>> system, assuming 4k pages.  That should be plenty even when taking
>> racy folding of the per-cpu counters into account.
>>
>> This fixes a compilation error on 32-bit systems as this code tries to
>> do 64-bit division.
>>
>> Signed-off-by: Johannes Weiner <[email protected]>
>> Reported-by: Dave Young <[email protected]>
> Reviewed-by: Minchan Kim <[email protected]>
Reviewed-by: Greg Thelen <[email protected]>

2010-11-08 22:29:15

by Greg Thelen

[permalink] [raw]
Subject: Re: [patch 2/4] memcg: catch negative per-cpu sums in dirty info

Minchan Kim <[email protected]> writes:

> On Mon, Nov 8, 2010 at 7:14 AM, Johannes Weiner <[email protected]> wrote:
>> Folding the per-cpu counters can yield a negative value in case of
>> accounting races between CPUs.
>>
>> When collecting the dirty info, the code would read those sums into an
>> unsigned variable and then check for it being negative, which can not
>> work.
>>
>> Instead, fold the counters into a signed local variable, make the
>> check, and only then assign it.
>>
>> This way, the function signals correctly when there are insane values
>> instead of leaking them out to the caller.
>>
>> Signed-off-by: Johannes Weiner <[email protected]>
> Reviewed-by: Minchan Kim <[email protected]>
Reviewed-by: Greg Thelen <[email protected]>

2010-11-08 22:39:04

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 1/4] memcg: use native word to represent dirtyable pages

On Mon, Nov 08, 2010 at 02:25:15PM -0800, Greg Thelen wrote:
> Minchan Kim <[email protected]> writes:
>
> > On Mon, Nov 8, 2010 at 7:14 AM, Johannes Weiner <[email protected]> wrote:
> >> The memory cgroup dirty info calculation currently uses a signed
> >> 64-bit type to represent the amount of dirtyable memory in pages.
> >>
> >> This can instead be changed to an unsigned word, which will allow the
> >> formula to function correctly with up to 160G of LRU pages on a 32-bit
> Is is really 160G of LRU pages? On 32-bit machine we use a 32 bit
> unsigned page number. With a 4KiB page size, I think that maps 16TiB
> (1<<(32+12)) bytes. Or is there some other limit?

Yes, the dirty limit we calculate from it :)

We have to be able to multiply this number by up to 100 (maximum dirty
ratio value) without overflowing.

2010-11-08 22:43:34

by Greg Thelen

[permalink] [raw]
Subject: Re: [patch 1/4] memcg: use native word to represent dirtyable pages

On Mon, Nov 8, 2010 at 2:38 PM, Johannes Weiner <[email protected]> wrote:
> On Mon, Nov 08, 2010 at 02:25:15PM -0800, Greg Thelen wrote:
>> Minchan Kim <[email protected]> writes:
>>
>> > On Mon, Nov 8, 2010 at 7:14 AM, Johannes Weiner <[email protected]> wrote:
>> >> The memory cgroup dirty info calculation currently uses a signed
>> >> 64-bit type to represent the amount of dirtyable memory in pages.
>> >>
>> >> This can instead be changed to an unsigned word, which will allow the
>> >> formula to function correctly with up to 160G of LRU pages on a 32-bit
>> Is is really 160G of LRU pages? ?On 32-bit machine we use a 32 bit
>> unsigned page number. ?With a 4KiB page size, I think that maps 16TiB
>> (1<<(32+12)) bytes. ?Or is there some other limit?
>
> Yes, the dirty limit we calculate from it :)
>
> We have to be able to multiply this number by up to 100 (maximum dirty
> ratio value) without overflowing.

Duh :) thanks.

2010-11-08 22:51:22

by Greg Thelen

[permalink] [raw]
Subject: Re: [patch 4/4] memcg: use native word page statistics counters

Minchan Kim <[email protected]> writes:

> On Mon, Nov 8, 2010 at 7:14 AM, Johannes Weiner <[email protected]> wrote:
>> The statistic counters are in units of pages, there is no reason to
>> make them 64-bit wide on 32-bit machines.
>>
>> Make them native words.  Since they are signed, this leaves 31 bit on
>> 32-bit machines, which can represent roughly 8TB assuming a page size
>> of 4k.
>>
>> Signed-off-by: Johannes Weiner <[email protected]>
> Reviewed-by: Minchan Kim <[email protected]>
Reviewed-by: Greg Thelen <[email protected]>

2010-11-08 23:21:14

by Greg Thelen

[permalink] [raw]
Subject: Re: [patch 3/4] memcg: break out event counters from other stats

Minchan Kim <[email protected]> writes:

> On Mon, Nov 8, 2010 at 7:14 AM, Johannes Weiner <[email protected]> wrote:
>> For increasing and decreasing per-cpu cgroup usage counters it makes
>> sense to use signed types, as single per-cpu values might go negative
>> during updates.  But this is not the case for only-ever-increasing
>> event counters.
>>
>> All the counters have been signed 64-bit so far, which was enough to
>> count events even with the sign bit wasted.
>>
>> The next patch narrows the usage counters type (on 32-bit CPUs, that
>> is), though, so break out the event counters and make them unsigned
>> words as they should have been from the start.
>>
>> Signed-off-by: Johannes Weiner <[email protected]>
> Reviewed-by: Minchan Kim <[email protected]>
Reviewed-by: Greg Thelen <[email protected]>

2010-11-08 23:28:36

by Greg Thelen

[permalink] [raw]
Subject: Re: [patch 4/4] memcg: use native word page statistics counters

Johannes Weiner <[email protected]> writes:

> The statistic counters are in units of pages, there is no reason to
> make them 64-bit wide on 32-bit machines.
>
> Make them native words. Since they are signed, this leaves 31 bit on
> 32-bit machines, which can represent roughly 8TB assuming a page size
> of 4k.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> include/linux/memcontrol.h | 2 +-
> mm/memcontrol.c | 43 +++++++++++++++++++++----------------------
> mm/page-writeback.c | 4 ++--
> 3 files changed, 24 insertions(+), 25 deletions(-)
>
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -110,7 +110,7 @@ enum {
> };
>
> struct mem_cgroup_stat_cpu {
> - s64 count[MEM_CGROUP_STAT_NSTATS];
> + long count[MEM_CGROUP_STAT_NSTATS];
> unsigned long events[MEM_CGROUP_EVENTS_NSTATS];
> };
>
> @@ -583,11 +583,11 @@ mem_cgroup_largest_soft_limit_node(struc
> * common workload, threashold and synchonization as vmstat[] should be
> * implemented.
> */
> -static s64 mem_cgroup_read_stat(struct mem_cgroup *mem,
> - enum mem_cgroup_stat_index idx)
> +static long mem_cgroup_read_stat(struct mem_cgroup *mem,
> + enum mem_cgroup_stat_index idx)
> {
> + long val = 0;
> int cpu;
> - s64 val = 0;
>
> for_each_online_cpu(cpu)
> val += per_cpu(mem->stat->count[idx], cpu);
> @@ -599,9 +599,9 @@ static s64 mem_cgroup_read_stat(struct m
> return val;
> }
>
> -static s64 mem_cgroup_local_usage(struct mem_cgroup *mem)
> +static long mem_cgroup_local_usage(struct mem_cgroup *mem)
> {
> - s64 ret;
> + long ret;
>
> ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_RSS);
> ret += mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_CACHE);
> @@ -1244,7 +1244,7 @@ bool mem_cgroup_dirty_info(unsigned long
> struct vm_dirty_param dirty_param;
> unsigned long available_mem;
> struct mem_cgroup *memcg;
> - s64 value;
> + long value;
>
> if (mem_cgroup_disabled())
> return false;
> @@ -1301,10 +1301,10 @@ static inline bool mem_cgroup_can_swap(s
> (res_counter_read_u64(&memcg->memsw, RES_LIMIT) > 0);
> }
>
> -static s64 mem_cgroup_local_page_stat(struct mem_cgroup *mem,
> - enum mem_cgroup_nr_pages_item item)
> +static long mem_cgroup_local_page_stat(struct mem_cgroup *mem,
> + enum mem_cgroup_nr_pages_item item)
> {
> - s64 ret;
> + long ret;
>
> switch (item) {
> case MEMCG_NR_DIRTYABLE_PAGES:
> @@ -1365,11 +1365,11 @@ memcg_hierarchical_free_pages(struct mem
> * Return the accounted statistic value or negative value if current task is
> * root cgroup.
> */
> -s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
> +long mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
> {
> - struct mem_cgroup *mem;
> struct mem_cgroup *iter;
> - s64 value;
> + struct mem_cgroup *mem;
> + long value;
>
> get_online_cpus();
> rcu_read_lock();
> @@ -2069,7 +2069,7 @@ static void mem_cgroup_drain_pcp_counter
>
> spin_lock(&mem->pcp_counter_lock);
> for (i = 0; i < MEM_CGROUP_STAT_DATA; i++) {
> - s64 x = per_cpu(mem->stat->count[i], cpu);
> + long x = per_cpu(mem->stat->count[i], cpu);
>
> per_cpu(mem->stat->count[i], cpu) = 0;
> mem->nocpu_base.count[i] += x;
> @@ -3660,13 +3660,13 @@ static int mem_cgroup_hierarchy_write(st
> }
>
>
> -static u64 mem_cgroup_get_recursive_idx_stat(struct mem_cgroup *mem,
> - enum mem_cgroup_stat_index idx)
> +static unsigned long mem_cgroup_recursive_stat(struct mem_cgroup *mem,
> + enum mem_cgroup_stat_index idx)
> {
> struct mem_cgroup *iter;
> - s64 val = 0;
> + long val = 0;
>
> - /* each per cpu's value can be minus.Then, use s64 */
> + /* Per-cpu values can be negative, use a signed accumulator */
> for_each_mem_cgroup_tree(iter, mem)
> val += mem_cgroup_read_stat(iter, idx);
>
> @@ -3686,12 +3686,11 @@ static inline u64 mem_cgroup_usage(struc
> return res_counter_read_u64(&mem->memsw, RES_USAGE);
> }
>
> - val = mem_cgroup_get_recursive_idx_stat(mem, MEM_CGROUP_STAT_CACHE);
> - val += mem_cgroup_get_recursive_idx_stat(mem, MEM_CGROUP_STAT_RSS);
> + val = mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_CACHE);
> + val += mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_RSS);
>
> if (swap)
> - val += mem_cgroup_get_recursive_idx_stat(mem,
> - MEM_CGROUP_STAT_SWAPOUT);
> + val += mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
>
> return val << PAGE_SHIFT;
> }
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -157,7 +157,7 @@ static inline void mem_cgroup_dec_page_s
> bool mem_cgroup_has_dirty_limit(void);
> bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
> struct dirty_info *info);
> -s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item);
> +long mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item);

Ooops. I missed something in my review.

mem_cgroup_page_stat() appears twice in memcontrol.h The return value
should match regardless of if CONFIG_CGROUP_MEM_RES_CTLR is set.

I suggest integrating the following into you patch ([patch 4/4] memcg:
use native word page statistics counters):

---
include/linux/memcontrol.h | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4e046d6..7a3d915 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -351,7 +351,7 @@ static inline bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
return false;
}

-static inline s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
+static inline long mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
{
return -ENOSYS;
}

> unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> gfp_t gfp_mask);
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -133,10 +133,10 @@ static struct prop_descriptor vm_dirties
>
> static unsigned long dirty_writeback_pages(void)
> {
> - s64 ret;
> + unsigned long ret;
>
> ret = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> - if (ret < 0)
> + if ((long)ret < 0)
> ret = global_page_state(NR_UNSTABLE_NFS) +
> global_page_state(NR_WRITEBACK);
>

2010-11-08 23:46:02

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 4/4] memcg: use native word page statistics counters

On Mon, Nov 08, 2010 at 03:27:26PM -0800, Greg Thelen wrote:
> Johannes Weiner <[email protected]> writes:
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -157,7 +157,7 @@ static inline void mem_cgroup_dec_page_s
> > bool mem_cgroup_has_dirty_limit(void);
> > bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
> > struct dirty_info *info);
> > -s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item);
> > +long mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item);
>
> Ooops. I missed something in my review.
>
> mem_cgroup_page_stat() appears twice in memcontrol.h The return value
> should match regardless of if CONFIG_CGROUP_MEM_RES_CTLR is set.
>
> I suggest integrating the following into you patch ([patch 4/4] memcg:
> use native word page statistics counters):

Right you are! Thanks.

Hannes

> ---
> include/linux/memcontrol.h | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 4e046d6..7a3d915 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -351,7 +351,7 @@ static inline bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
> return false;
> }
>
> -static inline s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
> +static inline long mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
> {
> return -ENOSYS;
> }

2010-11-16 03:43:21

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [patch 1/4] memcg: use native word to represent dirtyable pages

On Sun, 7 Nov 2010 23:14:36 +0100
Johannes Weiner <[email protected]> wrote:

> The memory cgroup dirty info calculation currently uses a signed
> 64-bit type to represent the amount of dirtyable memory in pages.
>
> This can instead be changed to an unsigned word, which will allow the
> formula to function correctly with up to 160G of LRU pages on a 32-bit
> system, assuming 4k pages. That should be plenty even when taking
> racy folding of the per-cpu counters into account.
>
> This fixes a compilation error on 32-bit systems as this code tries to
> do 64-bit division.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> Reported-by: Dave Young <[email protected]>


Thank you.

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

I couldn't read email because of vacation.

2010-11-16 03:44:50

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [patch 2/4] memcg: catch negative per-cpu sums in dirty info

On Sun, 7 Nov 2010 23:14:37 +0100
Johannes Weiner <[email protected]> wrote:

> Folding the per-cpu counters can yield a negative value in case of
> accounting races between CPUs.
>
> When collecting the dirty info, the code would read those sums into an
> unsigned variable and then check for it being negative, which can not
> work.
>
> Instead, fold the counters into a signed local variable, make the
> check, and only then assign it.
>
> This way, the function signals correctly when there are insane values
> instead of leaking them out to the caller.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2010-11-16 03:47:28

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [patch 3/4] memcg: break out event counters from other stats

On Sun, 7 Nov 2010 23:14:38 +0100
Johannes Weiner <[email protected]> wrote:

> For increasing and decreasing per-cpu cgroup usage counters it makes
> sense to use signed types, as single per-cpu values might go negative
> during updates. But this is not the case for only-ever-increasing
> event counters.
>
> All the counters have been signed 64-bit so far, which was enough to
> count events even with the sign bit wasted.
>
> The next patch narrows the usage counters type (on 32-bit CPUs, that
> is), though, so break out the event counters and make them unsigned
> words as they should have been from the start.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2010-11-16 03:49:49

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [patch 4/4] memcg: use native word page statistics counters

On Sun, 7 Nov 2010 23:14:39 +0100
Johannes Weiner <[email protected]> wrote:

> The statistic counters are in units of pages, there is no reason to
> make them 64-bit wide on 32-bit machines.
>
> Make them native words. Since they are signed, this leaves 31 bit on
> 32-bit machines, which can represent roughly 8TB assuming a page size
> of 4k.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Acked-by: KAMEZAWA Hiroyuki <[email protected]>