2010-11-09 09:25:20

by Greg Thelen

[permalink] [raw]
Subject: [PATCH 0/6] *** memcg: make throttle_vm_writeout() cgroup aware ***

This series aims to:
- Make throttle_vm_writeout() cgroup aware. Prior to this patch, cgroup reclaim
would consider global dirty limits when deciding to throttle. Now cgroup
limits are used if the cgroup being reclaimed has dirty limits.

- Part of making throttle_vm_writeout() cgroup aware involves fixing a negative
value signaling error in mem_cgroup_page_stat(). Previously,
mem_cgroup_page_stat() could falsely return a negative value if a per-cpu
counter sum was negative. Calling routines considered negative a special
"cgroup does not have limits" value.

Greg Thelen (6):
memcg: add mem_cgroup parameter to mem_cgroup_page_stat()
memcg: pass mem_cgroup to mem_cgroup_dirty_info()
memcg: make throttle_vm_writeout() memcg aware
memcg: simplify mem_cgroup_page_stat()
memcg: simplify mem_cgroup_dirty_info()
memcg: make mem_cgroup_page_stat() return value unsigned

include/linux/memcontrol.h | 8 +++-
include/linux/writeback.h | 2 +-
mm/memcontrol.c | 92 ++++++++++++++++++++++---------------------
mm/page-writeback.c | 40 ++++++++++---------
mm/vmscan.c | 2 +-
5 files changed, 77 insertions(+), 67 deletions(-)

--
1.7.3.1


2010-11-09 09:25:07

by Greg Thelen

[permalink] [raw]
Subject: [PATCH 1/6] memcg: add mem_cgroup parameter to mem_cgroup_page_stat()

This new parameter can be used to query dirty memory usage
from a given memcg rather than the current task's memcg.

Signed-off-by: Greg Thelen <[email protected]>
---
include/linux/memcontrol.h | 6 ++++--
mm/memcontrol.c | 37 +++++++++++++++++++++----------------
mm/page-writeback.c | 2 +-
3 files changed, 26 insertions(+), 19 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7a3d915..89a9278 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -157,7 +157,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
bool mem_cgroup_has_dirty_limit(void);
bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
struct dirty_info *info);
-long mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item);
+long mem_cgroup_page_stat(struct mem_cgroup *mem,
+ enum mem_cgroup_nr_pages_item item);

unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
gfp_t gfp_mask);
@@ -351,7 +352,8 @@ static inline bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
return false;
}

-static inline long mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
+static inline long mem_cgroup_page_stat(struct mem_cgroup *mem,
+ enum mem_cgroup_nr_pages_item item)
{
return -ENOSYS;
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d8a06d6..1bff7cf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1245,22 +1245,20 @@ bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
unsigned long available_mem;
struct mem_cgroup *memcg;
long value;
+ bool valid = false;

if (mem_cgroup_disabled())
return false;

rcu_read_lock();
memcg = mem_cgroup_from_task(current);
- if (!__mem_cgroup_has_dirty_limit(memcg)) {
- rcu_read_unlock();
- return false;
- }
+ if (!__mem_cgroup_has_dirty_limit(memcg))
+ goto done;
__mem_cgroup_dirty_param(&dirty_param, memcg);
- rcu_read_unlock();

- value = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
+ value = mem_cgroup_page_stat(memcg, MEMCG_NR_DIRTYABLE_PAGES);
if (value < 0)
- return false;
+ goto done;

available_mem = min((unsigned long)value, sys_available_mem);

@@ -1280,17 +1278,21 @@ bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
(dirty_param.dirty_background_ratio *
available_mem) / 100;

- value = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
+ value = mem_cgroup_page_stat(memcg, MEMCG_NR_RECLAIM_PAGES);
if (value < 0)
- return false;
+ goto done;
info->nr_reclaimable = value;

- value = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
+ value = mem_cgroup_page_stat(memcg, MEMCG_NR_WRITEBACK);
if (value < 0)
- return false;
+ goto done;
info->nr_writeback = value;

- return true;
+ valid = true;
+
+done:
+ rcu_read_unlock();
+ return valid;
}

static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
@@ -1361,20 +1363,23 @@ memcg_hierarchical_free_pages(struct mem_cgroup *mem)

/*
* mem_cgroup_page_stat() - get memory cgroup file cache statistics
- * @item: memory statistic item exported to the kernel
+ * @mem: optional memory cgroup to query. If NULL, use current task's
+ * cgroup.
+ * @item: memory statistic item exported to the kernel
*
* Return the accounted statistic value or negative value if current task is
* root cgroup.
*/
-long mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
+long mem_cgroup_page_stat(struct mem_cgroup *mem,
+ enum mem_cgroup_nr_pages_item item)
{
struct mem_cgroup *iter;
- struct mem_cgroup *mem;
long value;

get_online_cpus();
rcu_read_lock();
- mem = mem_cgroup_from_task(current);
+ if (!mem)
+ mem = mem_cgroup_from_task(current);
if (__mem_cgroup_has_dirty_limit(mem)) {
/*
* If we're looking for dirtyable pages we need to evaluate
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index a477f59..dc3dbe3 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -135,7 +135,7 @@ static unsigned long dirty_writeback_pages(void)
{
unsigned long ret;

- ret = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
+ ret = mem_cgroup_page_stat(NULL, MEMCG_NR_DIRTY_WRITEBACK_PAGES);
if ((long)ret < 0)
ret = global_page_state(NR_UNSTABLE_NFS) +
global_page_state(NR_WRITEBACK);
--
1.7.3.1

2010-11-09 09:25:30

by Greg Thelen

[permalink] [raw]
Subject: [PATCH 2/6] memcg: pass mem_cgroup to mem_cgroup_dirty_info()

Pass mem_cgroup parameter through memcg_dirty_info() into
mem_cgroup_dirty_info(). This allows for querying dirty memory
information from a particular cgroup, rather than just the
current task's cgroup.

Signed-off-by: Greg Thelen <[email protected]>
---
include/linux/memcontrol.h | 2 ++
mm/memcontrol.c | 5 +++--
mm/page-writeback.c | 9 ++++++---
3 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 89a9278..a81dfda 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -156,6 +156,7 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,

bool mem_cgroup_has_dirty_limit(void);
bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
+ struct mem_cgroup *memcg,
struct dirty_info *info);
long mem_cgroup_page_stat(struct mem_cgroup *mem,
enum mem_cgroup_nr_pages_item item);
@@ -347,6 +348,7 @@ static inline bool mem_cgroup_has_dirty_limit(void)
}

static inline bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
+ struct mem_cgroup *memcg,
struct dirty_info *info)
{
return false;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1bff7cf..eb621ee 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1239,11 +1239,11 @@ static void __mem_cgroup_dirty_param(struct vm_dirty_param *param,
* "memcg" will not be freed while holding rcu_read_lock().
*/
bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
+ struct mem_cgroup *memcg,
struct dirty_info *info)
{
struct vm_dirty_param dirty_param;
unsigned long available_mem;
- struct mem_cgroup *memcg;
long value;
bool valid = false;

@@ -1251,7 +1251,8 @@ bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
return false;

rcu_read_lock();
- memcg = mem_cgroup_from_task(current);
+ if (!memcg)
+ memcg = mem_cgroup_from_task(current);
if (!__mem_cgroup_has_dirty_limit(memcg))
goto done;
__mem_cgroup_dirty_param(&dirty_param, memcg);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index dc3dbe3..d717fa9 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -461,12 +461,15 @@ void global_dirty_info(struct dirty_info *info)
* Calculate the background-writeback and dirty-throttling thresholds and dirty
* usage metrics from the current task's memcg dirty limit parameters. Returns
* false if no memcg limits exist.
+ *
+ * @memcg may be NULL if the current task's memcg should be used.
+ * @info is the location where the dirty information is written.
*/
-static bool memcg_dirty_info(struct dirty_info *info)
+static bool memcg_dirty_info(struct mem_cgroup *memcg, struct dirty_info *info)
{
unsigned long available_memory = determine_dirtyable_memory();

- if (!mem_cgroup_dirty_info(available_memory, info))
+ if (!mem_cgroup_dirty_info(available_memory, memcg, info))
return false;

adjust_dirty_info(info);
@@ -534,7 +537,7 @@ static void balance_dirty_pages(struct address_space *mapping,

global_dirty_info(&sys_info);

- if (!memcg_dirty_info(&memcg_info))
+ if (!memcg_dirty_info(NULL, &memcg_info))
memcg_info = sys_info;

/*
--
1.7.3.1

2010-11-09 09:25:40

by Greg Thelen

[permalink] [raw]
Subject: [PATCH 4/6] memcg: simplify mem_cgroup_page_stat()

The cgroup given to mem_cgroup_page_stat() is no allowed to be
NULL or the root cgroup. So there is no need to complicate the code
handling those cases.

Signed-off-by: Greg Thelen <[email protected]>
---
mm/memcontrol.c | 48 ++++++++++++++++++++++--------------------------
1 files changed, 22 insertions(+), 26 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index eb621ee..f8df350 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1364,12 +1364,10 @@ memcg_hierarchical_free_pages(struct mem_cgroup *mem)

/*
* mem_cgroup_page_stat() - get memory cgroup file cache statistics
- * @mem: optional memory cgroup to query. If NULL, use current task's
- * cgroup.
+ * @mem: memory cgroup to query
* @item: memory statistic item exported to the kernel
*
- * Return the accounted statistic value or negative value if current task is
- * root cgroup.
+ * Return the accounted statistic value.
*/
long mem_cgroup_page_stat(struct mem_cgroup *mem,
enum mem_cgroup_nr_pages_item item)
@@ -1377,29 +1375,27 @@ long mem_cgroup_page_stat(struct mem_cgroup *mem,
struct mem_cgroup *iter;
long value;

+ VM_BUG_ON(!mem);
+ VM_BUG_ON(mem_cgroup_is_root(mem));
+
get_online_cpus();
- rcu_read_lock();
- if (!mem)
- mem = mem_cgroup_from_task(current);
- if (__mem_cgroup_has_dirty_limit(mem)) {
- /*
- * If we're looking for dirtyable pages we need to evaluate
- * free pages depending on the limit and usage of the parents
- * first of all.
- */
- if (item == MEMCG_NR_DIRTYABLE_PAGES)
- value = memcg_hierarchical_free_pages(mem);
- else
- value = 0;
- /*
- * Recursively evaluate page statistics against all cgroup
- * under hierarchy tree
- */
- for_each_mem_cgroup_tree(iter, mem)
- value += mem_cgroup_local_page_stat(iter, item);
- } else
- value = -EINVAL;
- rcu_read_unlock();
+
+ /*
+ * If we're looking for dirtyable pages we need to evaluate
+ * free pages depending on the limit and usage of the parents
+ * first of all.
+ */
+ if (item == MEMCG_NR_DIRTYABLE_PAGES)
+ value = memcg_hierarchical_free_pages(mem);
+ else
+ value = 0;
+ /*
+ * Recursively evaluate page statistics against all cgroup
+ * under hierarchy tree
+ */
+ for_each_mem_cgroup_tree(iter, mem)
+ value += mem_cgroup_local_page_stat(iter, item);
+
put_online_cpus();

return value;
--
1.7.3.1

2010-11-09 09:25:47

by Greg Thelen

[permalink] [raw]
Subject: [PATCH 6/6] memcg: make mem_cgroup_page_stat() return value unsigned

mem_cgroup_page_stat() used to return a negative page count
value to indicate value.

mem_cgroup_page_stat() has changed so it never returns
error so convert the return value to the traditional page
count type (unsigned long).

Signed-off-by: Greg Thelen <[email protected]>
---
include/linux/memcontrol.h | 6 +++---
mm/memcontrol.c | 12 ++++++++++--
2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a81dfda..3433784 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -158,8 +158,8 @@ bool mem_cgroup_has_dirty_limit(void);
bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
struct mem_cgroup *memcg,
struct dirty_info *info);
-long mem_cgroup_page_stat(struct mem_cgroup *mem,
- enum mem_cgroup_nr_pages_item item);
+unsigned long mem_cgroup_page_stat(struct mem_cgroup *mem,
+ enum mem_cgroup_nr_pages_item item);

unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
gfp_t gfp_mask);
@@ -354,7 +354,7 @@ static inline bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
return false;
}

-static inline long mem_cgroup_page_stat(struct mem_cgroup *mem,
+static inline unsigned long mem_cgroup_page_stat(struct mem_cgroup *mem,
enum mem_cgroup_nr_pages_item item)
{
return -ENOSYS;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ccdbb7e..ed070d0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1361,8 +1361,8 @@ memcg_hierarchical_free_pages(struct mem_cgroup *mem)
*
* Return the accounted statistic value.
*/
-long mem_cgroup_page_stat(struct mem_cgroup *mem,
- enum mem_cgroup_nr_pages_item item)
+unsigned long mem_cgroup_page_stat(struct mem_cgroup *mem,
+ enum mem_cgroup_nr_pages_item item)
{
struct mem_cgroup *iter;
long value;
@@ -1388,6 +1388,14 @@ long mem_cgroup_page_stat(struct mem_cgroup *mem,
for_each_mem_cgroup_tree(iter, mem)
value += mem_cgroup_local_page_stat(iter, item);

+ /*
+ * The sum of unlocked per-cpu counters may yield a slightly negative
+ * value. This function returns an unsigned value, so round it up to
+ * zero to avoid returning a very large value.
+ */
+ if (value < 0)
+ value = 0;
+
put_online_cpus();

return value;
--
1.7.3.1

2010-11-09 09:25:54

by Greg Thelen

[permalink] [raw]
Subject: [PATCH 5/6] memcg: simplify mem_cgroup_dirty_info()

Because mem_cgroup_page_stat() no longer returns negative numbers
to indicate failure, mem_cgroup_dirty_info() does not need to check
for such failures.

Signed-off-by: Greg Thelen <[email protected]>
---
mm/memcontrol.c | 14 +++-----------
1 files changed, 3 insertions(+), 11 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f8df350..ccdbb7e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1258,8 +1258,6 @@ bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
__mem_cgroup_dirty_param(&dirty_param, memcg);

value = mem_cgroup_page_stat(memcg, MEMCG_NR_DIRTYABLE_PAGES);
- if (value < 0)
- goto done;

available_mem = min((unsigned long)value, sys_available_mem);

@@ -1279,15 +1277,9 @@ bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
(dirty_param.dirty_background_ratio *
available_mem) / 100;

- value = mem_cgroup_page_stat(memcg, MEMCG_NR_RECLAIM_PAGES);
- if (value < 0)
- goto done;
- info->nr_reclaimable = value;
-
- value = mem_cgroup_page_stat(memcg, MEMCG_NR_WRITEBACK);
- if (value < 0)
- goto done;
- info->nr_writeback = value;
+ info->nr_reclaimable =
+ mem_cgroup_page_stat(memcg, MEMCG_NR_RECLAIM_PAGES);
+ info->nr_writeback = mem_cgroup_page_stat(memcg, MEMCG_NR_WRITEBACK);

valid = true;

--
1.7.3.1

2010-11-09 09:26:27

by Greg Thelen

[permalink] [raw]
Subject: [PATCH 3/6] memcg: make throttle_vm_writeout() memcg aware

If called with a mem_cgroup, then throttle_vm_writeout() should
query the given cgroup for its dirty memory usage limits.

dirty_writeback_pages() is no longer used, so delete it.

Signed-off-by: Greg Thelen <[email protected]>
---
include/linux/writeback.h | 2 +-
mm/page-writeback.c | 31 ++++++++++++++++---------------
mm/vmscan.c | 2 +-
3 files changed, 18 insertions(+), 17 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 335dba1..1bacdda 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -97,7 +97,7 @@ void laptop_mode_timer_fn(unsigned long data);
#else
static inline void laptop_sync_completion(void) { }
#endif
-void throttle_vm_writeout(gfp_t gfp_mask);
+void throttle_vm_writeout(gfp_t gfp_mask, struct mem_cgroup *mem_cgroup);

/* These are exported to sysctl. */
extern int dirty_background_ratio;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index d717fa9..bf85062 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -131,18 +131,6 @@ EXPORT_SYMBOL(laptop_mode);
static struct prop_descriptor vm_completions;
static struct prop_descriptor vm_dirties;

-static unsigned long dirty_writeback_pages(void)
-{
- unsigned long ret;
-
- ret = mem_cgroup_page_stat(NULL, MEMCG_NR_DIRTY_WRITEBACK_PAGES);
- if ((long)ret < 0)
- ret = global_page_state(NR_UNSTABLE_NFS) +
- global_page_state(NR_WRITEBACK);
-
- return ret;
-}
-
/*
* couple the period to the dirty_ratio:
*
@@ -703,12 +691,25 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
}
EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);

-void throttle_vm_writeout(gfp_t gfp_mask)
+/*
+ * Throttle the current task if it is near dirty memory usage limits.
+ * If @mem_cgroup is NULL or the root_cgroup, then use global dirty memory
+ * information; otherwise use the per-memcg dirty limits.
+ */
+void throttle_vm_writeout(gfp_t gfp_mask, struct mem_cgroup *mem_cgroup)
{
struct dirty_info dirty_info;
+ unsigned long nr_writeback;

for ( ; ; ) {
- global_dirty_info(&dirty_info);
+ if (!mem_cgroup || !memcg_dirty_info(mem_cgroup, &dirty_info)) {
+ global_dirty_info(&dirty_info);
+ nr_writeback = global_page_state(NR_UNSTABLE_NFS) +
+ global_page_state(NR_WRITEBACK);
+ } else {
+ nr_writeback = mem_cgroup_page_stat(
+ mem_cgroup, MEMCG_NR_DIRTY_WRITEBACK_PAGES);
+ }

/*
* Boost the allowable dirty threshold a bit for page
@@ -717,7 +718,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
dirty_info.dirty_thresh +=
dirty_info.dirty_thresh / 10; /* wheeee... */

- if (dirty_writeback_pages() <= dirty_info.dirty_thresh)
+ if (nr_writeback <= dirty_info.dirty_thresh)
break;
congestion_wait(BLK_RW_ASYNC, HZ/10);

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6d84858..8cc90d5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1844,7 +1844,7 @@ static void shrink_zone(int priority, struct zone *zone,
if (inactive_anon_is_low(zone, sc))
shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);

- throttle_vm_writeout(sc->gfp_mask);
+ throttle_vm_writeout(sc->gfp_mask, sc->mem_cgroup);
}

/*
--
1.7.3.1

2010-11-09 12:15:15

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH 6/6] memcg: make mem_cgroup_page_stat() return value unsigned

On Tue, Nov 09, 2010 at 05:24:31PM +0800, Greg Thelen wrote:
> mem_cgroup_page_stat() used to return a negative page count
> value to indicate value.
>
> mem_cgroup_page_stat() has changed so it never returns
> error so convert the return value to the traditional page
> count type (unsigned long).
>
> Signed-off-by: Greg Thelen <[email protected]>
> ---

> + /*
> + * The sum of unlocked per-cpu counters may yield a slightly negative
> + * value. This function returns an unsigned value, so round it up to
> + * zero to avoid returning a very large value.
> + */
> + if (value < 0)
> + value = 0;

nitpick: it's good candidate for unlikely().

Reviewed-by: Wu Fengguang <[email protected]>

Sorry, I lose track to the source code after so many patches.
It would help if you can put the patches to a git tree.

Thanks,
Fengguang

2010-11-09 22:53:48

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 1/6] memcg: add mem_cgroup parameter to mem_cgroup_page_stat()

On Tue, Nov 9, 2010 at 6:24 PM, Greg Thelen <[email protected]> wrote:
> This new parameter can be used to query dirty memory usage
> from a given memcg rather than the current task's memcg.
>
> Signed-off-by: Greg Thelen <[email protected]>
> ---
> ?include/linux/memcontrol.h | ? ?6 ++++--
> ?mm/memcontrol.c ? ? ? ? ? ?| ? 37 +++++++++++++++++++++----------------
> ?mm/page-writeback.c ? ? ? ?| ? ?2 +-
> ?3 files changed, 26 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 7a3d915..89a9278 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -157,7 +157,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
> ?bool mem_cgroup_has_dirty_limit(void);
> ?bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
> ? ? ? ? ? ? ? ? ? ? ? ? ? struct dirty_info *info);
> -long mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item);
> +long mem_cgroup_page_stat(struct mem_cgroup *mem,
> + ? ? ? ? ? ? ? ? ? ? ? ? enum mem_cgroup_nr_pages_item item);
>
> ?unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?gfp_t gfp_mask);
> @@ -351,7 +352,8 @@ static inline bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
> ? ? ? ?return false;
> ?}
>
> -static inline long mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
> +static inline long mem_cgroup_page_stat(struct mem_cgroup *mem,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? enum mem_cgroup_nr_pages_item item)
> ?{
> ? ? ? ?return -ENOSYS;
> ?}
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d8a06d6..1bff7cf 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1245,22 +1245,20 @@ bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
> ? ? ? ?unsigned long available_mem;
> ? ? ? ?struct mem_cgroup *memcg;
> ? ? ? ?long value;
> + ? ? ? bool valid = false;
>
> ? ? ? ?if (mem_cgroup_disabled())
> ? ? ? ? ? ? ? ?return false;
>
> ? ? ? ?rcu_read_lock();
> ? ? ? ?memcg = mem_cgroup_from_task(current);
> - ? ? ? if (!__mem_cgroup_has_dirty_limit(memcg)) {
> - ? ? ? ? ? ? ? rcu_read_unlock();
> - ? ? ? ? ? ? ? return false;
> - ? ? ? }
> + ? ? ? if (!__mem_cgroup_has_dirty_limit(memcg))
> + ? ? ? ? ? ? ? goto done;
> ? ? ? ?__mem_cgroup_dirty_param(&dirty_param, memcg);
> - ? ? ? rcu_read_unlock();
>
> - ? ? ? value = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
> + ? ? ? value = mem_cgroup_page_stat(memcg, MEMCG_NR_DIRTYABLE_PAGES);
> ? ? ? ?if (value < 0)
> - ? ? ? ? ? ? ? return false;
> + ? ? ? ? ? ? ? goto done;
>
> ? ? ? ?available_mem = min((unsigned long)value, sys_available_mem);
>
> @@ -1280,17 +1278,21 @@ bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
> ? ? ? ? ? ? ? ? ? ? ? ?(dirty_param.dirty_background_ratio *
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? available_mem) / 100;
>
> - ? ? ? value = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> + ? ? ? value = mem_cgroup_page_stat(memcg, MEMCG_NR_RECLAIM_PAGES);
> ? ? ? ?if (value < 0)
> - ? ? ? ? ? ? ? return false;
> + ? ? ? ? ? ? ? goto done;
> ? ? ? ?info->nr_reclaimable = value;
>
> - ? ? ? value = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
> + ? ? ? value = mem_cgroup_page_stat(memcg, MEMCG_NR_WRITEBACK);
> ? ? ? ?if (value < 0)
> - ? ? ? ? ? ? ? return false;
> + ? ? ? ? ? ? ? goto done;
> ? ? ? ?info->nr_writeback = value;
>
> - ? ? ? return true;
> + ? ? ? valid = true;
> +
> +done:
> + ? ? ? rcu_read_unlock();
> + ? ? ? return valid;
> ?}
>
> ?static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
> @@ -1361,20 +1363,23 @@ memcg_hierarchical_free_pages(struct mem_cgroup *mem)
>
> ?/*
> ?* mem_cgroup_page_stat() - get memory cgroup file cache statistics
> - * @item: ? ? ?memory statistic item exported to the kernel
> + * @mem: ? ? ? optional memory cgroup to query. ?If NULL, use current task's
> + * ? ? ? ? ? ? cgroup.
> + * @item: ? ? ?memory statistic item exported to the kernel
> ?*
> ?* Return the accounted statistic value or negative value if current task is
> ?* root cgroup.
> ?*/
> -long mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
> +long mem_cgroup_page_stat(struct mem_cgroup *mem,
> + ? ? ? ? ? ? ? ? ? ? ? ? enum mem_cgroup_nr_pages_item item)
> ?{
> ? ? ? ?struct mem_cgroup *iter;
> - ? ? ? struct mem_cgroup *mem;
> ? ? ? ?long value;
>
> ? ? ? ?get_online_cpus();
> ? ? ? ?rcu_read_lock();
> - ? ? ? mem = mem_cgroup_from_task(current);
> + ? ? ? if (!mem)
> + ? ? ? ? ? ? ? mem = mem_cgroup_from_task(current);
> ? ? ? ?if (__mem_cgroup_has_dirty_limit(mem)) {
> ? ? ? ? ? ? ? ?/*
> ? ? ? ? ? ? ? ? * If we're looking for dirtyable pages we need to evaluate
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index a477f59..dc3dbe3 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -135,7 +135,7 @@ static unsigned long dirty_writeback_pages(void)
> ?{
> ? ? ? ?unsigned long ret;
>
> - ? ? ? ret = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> + ? ? ? ret = mem_cgroup_page_stat(NULL, MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> ? ? ? ?if ((long)ret < 0)
> ? ? ? ? ? ? ? ?ret = global_page_state(NR_UNSTABLE_NFS) +
> ? ? ? ? ? ? ? ? ? ? ? ?global_page_state(NR_WRITEBACK);
> --
> 1.7.3.1
>

I didn't look at further patches so It might be changed.

Now all of caller of mem_cgroup_page_stat except only
dirty_writeback_pages hold rcu_read_lock.
And mem_cgroup_page_stat itself hold rcu_read_lock again.
Couldn't we remove duplicated rcu lock by adding rcu_read_lock in
dirty_writeback_pages for the consistency?


--
Kind regards,
Minchan Kim

2010-11-09 23:09:13

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 2/6] memcg: pass mem_cgroup to mem_cgroup_dirty_info()

On Tue, Nov 9, 2010 at 6:24 PM, Greg Thelen <[email protected]> wrote:
> Pass mem_cgroup parameter through memcg_dirty_info() into
> mem_cgroup_dirty_info(). ?This allows for querying dirty memory
> information from a particular cgroup, rather than just the
> current task's cgroup.
>
> Signed-off-by: Greg Thelen <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

--
Kind regards,
Minchan Kim

2010-11-09 23:22:47

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 3/6] memcg: make throttle_vm_writeout() memcg aware

On Tue, Nov 9, 2010 at 6:24 PM, Greg Thelen <[email protected]> wrote:
> If called with a mem_cgroup, then throttle_vm_writeout() should
> query the given cgroup for its dirty memory usage limits.
>
> dirty_writeback_pages() is no longer used, so delete it.
>
> Signed-off-by: Greg Thelen <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

> ---
> ?include/linux/writeback.h | ? ?2 +-
> ?mm/page-writeback.c ? ? ? | ? 31 ++++++++++++++++---------------
> ?mm/vmscan.c ? ? ? ? ? ? ? | ? ?2 +-
> ?3 files changed, 18 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 335dba1..1bacdda 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -97,7 +97,7 @@ void laptop_mode_timer_fn(unsigned long data);
> ?#else
> ?static inline void laptop_sync_completion(void) { }
> ?#endif
> -void throttle_vm_writeout(gfp_t gfp_mask);
> +void throttle_vm_writeout(gfp_t gfp_mask, struct mem_cgroup *mem_cgroup);
>
> ?/* These are exported to sysctl. */
> ?extern int dirty_background_ratio;
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index d717fa9..bf85062 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -131,18 +131,6 @@ EXPORT_SYMBOL(laptop_mode);
> ?static struct prop_descriptor vm_completions;
> ?static struct prop_descriptor vm_dirties;
>
> -static unsigned long dirty_writeback_pages(void)
> -{
> - ? ? ? unsigned long ret;
> -
> - ? ? ? ret = mem_cgroup_page_stat(NULL, MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> - ? ? ? if ((long)ret < 0)
> - ? ? ? ? ? ? ? ret = global_page_state(NR_UNSTABLE_NFS) +
> - ? ? ? ? ? ? ? ? ? ? ? global_page_state(NR_WRITEBACK);
> -
> - ? ? ? return ret;
> -}
> -

Nice cleanup.

> ?/*
> ?* couple the period to the dirty_ratio:
> ?*
> @@ -703,12 +691,25 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
> ?}
> ?EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
>
> -void throttle_vm_writeout(gfp_t gfp_mask)
> +/*
> + * Throttle the current task if it is near dirty memory usage limits.
> + * If @mem_cgroup is NULL or the root_cgroup, then use global dirty memory
> + * information; otherwise use the per-memcg dirty limits.
> + */
> +void throttle_vm_writeout(gfp_t gfp_mask, struct mem_cgroup *mem_cgroup)
> ?{
> ? ? ? ?struct dirty_info dirty_info;
> + ? ? ? unsigned long nr_writeback;
>
> ? ? ? ? for ( ; ; ) {
> - ? ? ? ? ? ? ? global_dirty_info(&dirty_info);
> + ? ? ? ? ? ? ? if (!mem_cgroup || !memcg_dirty_info(mem_cgroup, &dirty_info)) {
> + ? ? ? ? ? ? ? ? ? ? ? global_dirty_info(&dirty_info);
> + ? ? ? ? ? ? ? ? ? ? ? nr_writeback = global_page_state(NR_UNSTABLE_NFS) +
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? global_page_state(NR_WRITEBACK);
> + ? ? ? ? ? ? ? } else {
> + ? ? ? ? ? ? ? ? ? ? ? nr_writeback = mem_cgroup_page_stat(
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? mem_cgroup, MEMCG_NR_DIRTY_WRITEBACK_PAGES);

In point of view rcu_read_lock removal, memcg can't destroy due to
mem_cgroup_select_victim's css_tryget?
Then, we can remove unnecessary rcu_read_lock in mem_cgroup_page_stat.



--
Kind regards,
Minchan Kim

2010-11-09 23:36:33

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 4/6] memcg: simplify mem_cgroup_page_stat()

On Tue, Nov 9, 2010 at 6:24 PM, Greg Thelen <[email protected]> wrote:
> The cgroup given to mem_cgroup_page_stat() is no allowed to be
> NULL or the root cgroup. ?So there is no need to complicate the code
> handling those cases.
>
> Signed-off-by: Greg Thelen <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

You already did what i want. :)
I should have commented after seeing all patches.

Thanks.
--
Kind regards,
Minchan Kim

2010-11-10 00:51:27

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 1/6] memcg: add mem_cgroup parameter to mem_cgroup_page_stat()

On Tue, Nov 9, 2010 at 6:24 PM, Greg Thelen <[email protected]> wrote:
> This new parameter can be used to query dirty memory usage
> from a given memcg rather than the current task's memcg.
>
> Signed-off-by: Greg Thelen <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>


--
Kind regards,
Minchan Kim

2010-11-10 01:01:19

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 5/6] memcg: simplify mem_cgroup_dirty_info()

On Tue, Nov 9, 2010 at 6:24 PM, Greg Thelen <[email protected]> wrote:
> Because mem_cgroup_page_stat() no longer returns negative numbers

I tried to understand why mem_cgroup_page_stat doesn't return negative
number any more for a while.
I couldn't find answer by current patches 5/6.
The answer is where 6/6.
It would be better to change 6/6 with 5/6.


> to indicate failure, mem_cgroup_dirty_info() does not need to check
> for such failures.
>
> Signed-off-by: Greg Thelen <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

--
Kind regards,
Minchan Kim

2010-11-10 01:04:31

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 6/6] memcg: make mem_cgroup_page_stat() return value unsigned

On Tue, Nov 9, 2010 at 6:24 PM, Greg Thelen <[email protected]> wrote:
> mem_cgroup_page_stat() used to return a negative page count
> value to indicate value.
>
> mem_cgroup_page_stat() has changed so it never returns
> error so convert the return value to the traditional page
> count type (unsigned long).
>
> Signed-off-by: Greg Thelen <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

It seems to be late review. Just 1 day.
I don't know why Andrew merge it in a hurry without any review.
Anyway, I add my Reviewed-by in this series since my eyes can't find
any big bug.

Thanks.
--
Kind regards,
Minchan Kim

2010-11-12 08:18:29

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 3/6] memcg: make throttle_vm_writeout() memcg aware

On Tue, Nov 09, 2010 at 01:24:28AM -0800, Greg Thelen wrote:
> If called with a mem_cgroup, then throttle_vm_writeout() should
> query the given cgroup for its dirty memory usage limits.
>
> dirty_writeback_pages() is no longer used, so delete it.
>
> Signed-off-by: Greg Thelen <[email protected]>
> ---
> include/linux/writeback.h | 2 +-
> mm/page-writeback.c | 31 ++++++++++++++++---------------
> mm/vmscan.c | 2 +-
> 3 files changed, 18 insertions(+), 17 deletions(-)

> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index d717fa9..bf85062 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -131,18 +131,6 @@ EXPORT_SYMBOL(laptop_mode);
> static struct prop_descriptor vm_completions;
> static struct prop_descriptor vm_dirties;
>
> -static unsigned long dirty_writeback_pages(void)
> -{
> - unsigned long ret;
> -
> - ret = mem_cgroup_page_stat(NULL, MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> - if ((long)ret < 0)
> - ret = global_page_state(NR_UNSTABLE_NFS) +
> - global_page_state(NR_WRITEBACK);

There are two bugfixes in this patch. One is getting rid of this
fallback to global numbers that are compared to memcg limits. The
other one is that reclaim will now throttle writeout based on the
cgroup it runs on behalf of, instead of that of the current task.

Both are undocumented and should arguably not even be in the same
patch...?

> @@ -703,12 +691,25 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
> }
> EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
>
> -void throttle_vm_writeout(gfp_t gfp_mask)
> +/*
> + * Throttle the current task if it is near dirty memory usage limits.
> + * If @mem_cgroup is NULL or the root_cgroup, then use global dirty memory
> + * information; otherwise use the per-memcg dirty limits.
> + */
> +void throttle_vm_writeout(gfp_t gfp_mask, struct mem_cgroup *mem_cgroup)
> {
> struct dirty_info dirty_info;
> + unsigned long nr_writeback;
>
> for ( ; ; ) {
> - global_dirty_info(&dirty_info);
> + if (!mem_cgroup || !memcg_dirty_info(mem_cgroup, &dirty_info)) {
> + global_dirty_info(&dirty_info);
> + nr_writeback = global_page_state(NR_UNSTABLE_NFS) +
> + global_page_state(NR_WRITEBACK);
> + } else {
> + nr_writeback = mem_cgroup_page_stat(
> + mem_cgroup, MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> + }

Odd branch ordering, but I may be OCDing again.

if (mem_cgroup && memcg_dirty_info())
do_mem_cgroup_stuff()
else
do_global_stuff()

would be more natural, IMO.

2010-11-12 08:20:09

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 4/6] memcg: simplify mem_cgroup_page_stat()

On Tue, Nov 09, 2010 at 01:24:29AM -0800, Greg Thelen wrote:
> The cgroup given to mem_cgroup_page_stat() is no allowed to be
> NULL or the root cgroup. So there is no need to complicate the code
> handling those cases.
>
> Signed-off-by: Greg Thelen <[email protected]>
> ---
> mm/memcontrol.c | 48 ++++++++++++++++++++++--------------------------
> 1 files changed, 22 insertions(+), 26 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index eb621ee..f8df350 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1364,12 +1364,10 @@ memcg_hierarchical_free_pages(struct mem_cgroup *mem)
>
> /*
> * mem_cgroup_page_stat() - get memory cgroup file cache statistics
> - * @mem: optional memory cgroup to query. If NULL, use current task's
> - * cgroup.
> + * @mem: memory cgroup to query
> * @item: memory statistic item exported to the kernel
> *
> - * Return the accounted statistic value or negative value if current task is
> - * root cgroup.
> + * Return the accounted statistic value.
> */
> long mem_cgroup_page_stat(struct mem_cgroup *mem,
> enum mem_cgroup_nr_pages_item item)
> @@ -1377,29 +1375,27 @@ long mem_cgroup_page_stat(struct mem_cgroup *mem,
> struct mem_cgroup *iter;
> long value;
>
> + VM_BUG_ON(!mem);
> + VM_BUG_ON(mem_cgroup_is_root(mem));
> +
> get_online_cpus();
> - rcu_read_lock();
> - if (!mem)
> - mem = mem_cgroup_from_task(current);
> - if (__mem_cgroup_has_dirty_limit(mem)) {

What about mem->use_hierarchy that is checked in
__mem_cgroup_has_dirty_limit()? Is it no longer needed?

2010-11-12 08:22:09

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 5/6] memcg: simplify mem_cgroup_dirty_info()

On Tue, Nov 09, 2010 at 01:24:30AM -0800, Greg Thelen wrote:
> Because mem_cgroup_page_stat() no longer returns negative numbers
> to indicate failure, mem_cgroup_dirty_info() does not need to check
> for such failures.

This is simply not true at this point in time. Patch ordering is not
optional.

2010-11-12 08:29:41

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 6/6] memcg: make mem_cgroup_page_stat() return value unsigned

On Tue, Nov 09, 2010 at 01:24:31AM -0800, Greg Thelen wrote:
> mem_cgroup_page_stat() used to return a negative page count
> value to indicate value.

Whoops :)

> mem_cgroup_page_stat() has changed so it never returns
> error so convert the return value to the traditional page
> count type (unsigned long).

This changelog feels a bit beside the point.

What's really interesting is that we now don't consider negative sums
to be invalid anymore, but just assume zero! There is a real
semantical change here.

That the return type can then be changed to unsigned long is a nice
follow-up cleanup that happens to be folded into this patch.

2010-11-12 20:40:08

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH 3/6] memcg: make throttle_vm_writeout() memcg aware

Johannes Weiner <[email protected]> writes:

> On Tue, Nov 09, 2010 at 01:24:28AM -0800, Greg Thelen wrote:
>> If called with a mem_cgroup, then throttle_vm_writeout() should
>> query the given cgroup for its dirty memory usage limits.
>>
>> dirty_writeback_pages() is no longer used, so delete it.
>>
>> Signed-off-by: Greg Thelen <[email protected]>
>> ---
>> include/linux/writeback.h | 2 +-
>> mm/page-writeback.c | 31 ++++++++++++++++---------------
>> mm/vmscan.c | 2 +-
>> 3 files changed, 18 insertions(+), 17 deletions(-)
>
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index d717fa9..bf85062 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -131,18 +131,6 @@ EXPORT_SYMBOL(laptop_mode);
>> static struct prop_descriptor vm_completions;
>> static struct prop_descriptor vm_dirties;
>>
>> -static unsigned long dirty_writeback_pages(void)
>> -{
>> - unsigned long ret;
>> -
>> - ret = mem_cgroup_page_stat(NULL, MEMCG_NR_DIRTY_WRITEBACK_PAGES);
>> - if ((long)ret < 0)
>> - ret = global_page_state(NR_UNSTABLE_NFS) +
>> - global_page_state(NR_WRITEBACK);
>
> There are two bugfixes in this patch. One is getting rid of this
> fallback to global numbers that are compared to memcg limits. The
> other one is that reclaim will now throttle writeout based on the
> cgroup it runs on behalf of, instead of that of the current task.
>
> Both are undocumented and should arguably not even be in the same
> patch...?

I will better document these changes in the commit message and I will
split the change into two patches for clarity.

- sub-patch 1 will change throttle_vm_writeout() to only consider global
usage and limits. This would remove memcg consideration from
throttle_vm_writeout() and thus ensure that only global limits are
compared to global usage.

- sub-patch 2 will introduce memcg consideration consistently into
throttle_vm_writeout(). This will allow throttle_vm_writeout() to
consider memcg usage and limits, but they will uniformly applied.
memcg usage will not be compared to global limits.

>> @@ -703,12 +691,25 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
>> }
>> EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
>>
>> -void throttle_vm_writeout(gfp_t gfp_mask)
>> +/*
>> + * Throttle the current task if it is near dirty memory usage limits.
>> + * If @mem_cgroup is NULL or the root_cgroup, then use global dirty memory
>> + * information; otherwise use the per-memcg dirty limits.
>> + */
>> +void throttle_vm_writeout(gfp_t gfp_mask, struct mem_cgroup *mem_cgroup)
>> {
>> struct dirty_info dirty_info;
>> + unsigned long nr_writeback;
>>
>> for ( ; ; ) {
>> - global_dirty_info(&dirty_info);
>> + if (!mem_cgroup || !memcg_dirty_info(mem_cgroup, &dirty_info)) {
>> + global_dirty_info(&dirty_info);
>> + nr_writeback = global_page_state(NR_UNSTABLE_NFS) +
>> + global_page_state(NR_WRITEBACK);
>> + } else {
>> + nr_writeback = mem_cgroup_page_stat(
>> + mem_cgroup, MEMCG_NR_DIRTY_WRITEBACK_PAGES);
>> + }
>
> Odd branch ordering, but I may be OCDing again.
>
> if (mem_cgroup && memcg_dirty_info())
> do_mem_cgroup_stuff()
> else
> do_global_stuff()
>
> would be more natural, IMO.

I agree. I will resubmit this series with your improved branch ordering.

2010-11-12 20:41:15

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH 5/6] memcg: simplify mem_cgroup_dirty_info()

Johannes Weiner <[email protected]> writes:

> On Tue, Nov 09, 2010 at 01:24:30AM -0800, Greg Thelen wrote:
>> Because mem_cgroup_page_stat() no longer returns negative numbers
>> to indicate failure, mem_cgroup_dirty_info() does not need to check
>> for such failures.
>
> This is simply not true at this point in time. Patch ordering is not
> optional.

Thanks. Patch order will be corrected.

2010-11-12 20:41:14

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH 4/6] memcg: simplify mem_cgroup_page_stat()

Johannes Weiner <[email protected]> writes:

> On Tue, Nov 09, 2010 at 01:24:29AM -0800, Greg Thelen wrote:
>> The cgroup given to mem_cgroup_page_stat() is no allowed to be
>> NULL or the root cgroup. So there is no need to complicate the code
>> handling those cases.
>>
>> Signed-off-by: Greg Thelen <[email protected]>
>> ---
>> mm/memcontrol.c | 48 ++++++++++++++++++++++--------------------------
>> 1 files changed, 22 insertions(+), 26 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index eb621ee..f8df350 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -1364,12 +1364,10 @@ memcg_hierarchical_free_pages(struct mem_cgroup *mem)
>>
>> /*
>> * mem_cgroup_page_stat() - get memory cgroup file cache statistics
>> - * @mem: optional memory cgroup to query. If NULL, use current task's
>> - * cgroup.
>> + * @mem: memory cgroup to query
>> * @item: memory statistic item exported to the kernel
>> *
>> - * Return the accounted statistic value or negative value if current task is
>> - * root cgroup.
>> + * Return the accounted statistic value.
>> */
>> long mem_cgroup_page_stat(struct mem_cgroup *mem,
>> enum mem_cgroup_nr_pages_item item)
>> @@ -1377,29 +1375,27 @@ long mem_cgroup_page_stat(struct mem_cgroup *mem,
>> struct mem_cgroup *iter;
>> long value;
>>
>> + VM_BUG_ON(!mem);
>> + VM_BUG_ON(mem_cgroup_is_root(mem));
>> +
>> get_online_cpus();
>> - rcu_read_lock();
>> - if (!mem)
>> - mem = mem_cgroup_from_task(current);
>> - if (__mem_cgroup_has_dirty_limit(mem)) {
>
> What about mem->use_hierarchy that is checked in
> __mem_cgroup_has_dirty_limit()? Is it no longer needed?

It is no longer needed because the callers of mem_cgroup_page_stat()
call __mem_cgroup_has_dirty_limit(). In the current implementation, if
use_hierarchy=1 then the cgroup does not have dirty limits, so calls
into mem_cgroup_page_stat() are avoided. Specifically the callers of
mem_cgroup_page_stat() are:

1. mem_cgroup_dirty_info() which calls __mem_cgroup_has_dirty_limit()
and returns false if use_hierarchy=1.

2. throttle_vm_writeout() which calls mem_dirty_info() ->
mem_cgroup_dirty_info() -> __mem_cgroup_has_dirty_limit() will fall
back to global limits if use_hierarchy=1.

2010-11-12 20:41:40

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH 6/6] memcg: make mem_cgroup_page_stat() return value unsigned

DeJohannes Weiner <[email protected]> writes:

> On Tue, Nov 09, 2010 at 01:24:31AM -0800, Greg Thelen wrote:
>> mem_cgroup_page_stat() used to return a negative page count
>> value to indicate value.
>
> Whoops :)
>
>> mem_cgroup_page_stat() has changed so it never returns
>> error so convert the return value to the traditional page
>> count type (unsigned long).
>
> This changelog feels a bit beside the point.
>
> What's really interesting is that we now don't consider negative sums
> to be invalid anymore, but just assume zero! There is a real
> semantical change here.

Prior to this patch series mem_cgroup_page_stat() returned a negative
value (specifically -EINVAL) to indicate that the current task was in
the root_cgroup and thus the per-cgroup usage and limit counter were
invalid. Callers treated all negative values as an indication of
root-cgroup message.

Unfortunately there was another way that mem_cgroup_page_stat() could
return a negative value even when current was not in the root cgroup.
Negative sums were a possibility due to summing of unsynchronized
per-cpu counters. These occasional negative sums would fool callers
into thinking that the current task was in the root cgroup.

Would adding this description to the commit message address your
concerns?

> That the return type can then be changed to unsigned long is a nice
> follow-up cleanup that happens to be folded into this patch.

Good point. I can separate the change into two sub-patches:
1. use zero for a min-value (as described above)
2. change return value to unsigned

2010-11-16 03:55:36

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 1/6] memcg: add mem_cgroup parameter to mem_cgroup_page_stat()

On Tue, 9 Nov 2010 01:24:26 -0800
Greg Thelen <[email protected]> wrote:

> This new parameter can be used to query dirty memory usage
> from a given memcg rather than the current task's memcg.
>
> Signed-off-by: Greg Thelen <[email protected]>

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2010-11-16 03:56:48

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 2/6] memcg: pass mem_cgroup to mem_cgroup_dirty_info()

On Tue, 9 Nov 2010 01:24:27 -0800
Greg Thelen <[email protected]> wrote:

> Pass mem_cgroup parameter through memcg_dirty_info() into
> mem_cgroup_dirty_info(). This allows for querying dirty memory
> information from a particular cgroup, rather than just the
> current task's cgroup.
>
> Signed-off-by: Greg Thelen <[email protected]>

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2010-11-16 04:03:03

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 3/6] memcg: make throttle_vm_writeout() memcg aware

On Fri, 12 Nov 2010 12:39:35 -0800
Greg Thelen <[email protected]> wrote:

> > Odd branch ordering, but I may be OCDing again.
> >
> > if (mem_cgroup && memcg_dirty_info())
> > do_mem_cgroup_stuff()
> > else
> > do_global_stuff()
> >
> > would be more natural, IMO.
>
> I agree. I will resubmit this series with your improved branch ordering.
>

Hmm. I think this patch is troublesome.

This patch will make memcg's pageout routine _not_ throttoled even when the whole
system vmscan's pageout is throttoled.

So, one idea is....

Make this change
==
+++ b/mm/vmscan.c
@@ -1844,7 +1844,7 @@ static void shrink_zone(int priority, struct zone *zone,
if (inactive_anon_is_low(zone, sc))
shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);

- throttle_vm_writeout(sc->gfp_mask);
+ throttle_vm_writeout(sc->gfp_mask, sc->mem_cgroup);
}
==
as

==

if (!sc->mem_cgroup || throttle_vm_writeout(sc->gfp_mask, sc->mem_cgroup) == not throttled)
throttole_vm_writeout(sc->gfp_mask, NULL);

Then, both of memcg and global dirty thresh will be checked.


-Kame

2010-11-16 04:04:24

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 4/6] memcg: simplify mem_cgroup_page_stat()

On Tue, 9 Nov 2010 01:24:29 -0800
Greg Thelen <[email protected]> wrote:

> The cgroup given to mem_cgroup_page_stat() is no allowed to be
> NULL or the root cgroup. So there is no need to complicate the code
> handling those cases.
>
> Signed-off-by: Greg Thelen <[email protected]>

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2010-11-16 04:06:03

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 6/6] memcg: make mem_cgroup_page_stat() return value unsigned

On Tue, 9 Nov 2010 01:24:31 -0800
Greg Thelen <[email protected]> wrote:

> mem_cgroup_page_stat() used to return a negative page count
> value to indicate value.
>
> mem_cgroup_page_stat() has changed so it never returns
> error so convert the return value to the traditional page
> count type (unsigned long).
>
> Signed-off-by: Greg Thelen <[email protected]>

Okay.

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2010-11-19 11:17:14

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 3/6] memcg: make throttle_vm_writeout() memcg aware

Hello,

On Tue, Nov 16, 2010 at 12:57:26PM +0900, KAMEZAWA Hiroyuki wrote:
> Hmm. I think this patch is troublesome.
>
> This patch will make memcg's pageout routine _not_ throttoled even when the whole
> system vmscan's pageout is throttoled.
>
> So, one idea is....
>
> Make this change
> ==
> +++ b/mm/vmscan.c
> @@ -1844,7 +1844,7 @@ static void shrink_zone(int priority, struct zone *zone,
> if (inactive_anon_is_low(zone, sc))
> shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
>
> - throttle_vm_writeout(sc->gfp_mask);
> + throttle_vm_writeout(sc->gfp_mask, sc->mem_cgroup);
> }
> ==
> as
>
> ==
>
> if (!sc->mem_cgroup || throttle_vm_writeout(sc->gfp_mask, sc->mem_cgroup) == not throttled)
> throttole_vm_writeout(sc->gfp_mask, NULL);
>
> Then, both of memcg and global dirty thresh will be checked.

Good point, both limits should apply.

I'd prefer to stuff it all into throttle_vm_writeout() and not encode
memcg-specific behaviour into the caller, though.

2010-11-19 11:22:31

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 4/6] memcg: simplify mem_cgroup_page_stat()

On Fri, Nov 12, 2010 at 12:40:22PM -0800, Greg Thelen wrote:
> Johannes Weiner <[email protected]> writes:
>
> > On Tue, Nov 09, 2010 at 01:24:29AM -0800, Greg Thelen wrote:
> >> The cgroup given to mem_cgroup_page_stat() is no allowed to be
> >> NULL or the root cgroup. So there is no need to complicate the code
> >> handling those cases.
> >>
> >> Signed-off-by: Greg Thelen <[email protected]>
> >> ---
> >> mm/memcontrol.c | 48 ++++++++++++++++++++++--------------------------
> >> 1 files changed, 22 insertions(+), 26 deletions(-)
> >>
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> index eb621ee..f8df350 100644
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> >> @@ -1364,12 +1364,10 @@ memcg_hierarchical_free_pages(struct mem_cgroup *mem)
> >>
> >> /*
> >> * mem_cgroup_page_stat() - get memory cgroup file cache statistics
> >> - * @mem: optional memory cgroup to query. If NULL, use current task's
> >> - * cgroup.
> >> + * @mem: memory cgroup to query
> >> * @item: memory statistic item exported to the kernel
> >> *
> >> - * Return the accounted statistic value or negative value if current task is
> >> - * root cgroup.
> >> + * Return the accounted statistic value.
> >> */
> >> long mem_cgroup_page_stat(struct mem_cgroup *mem,
> >> enum mem_cgroup_nr_pages_item item)
> >> @@ -1377,29 +1375,27 @@ long mem_cgroup_page_stat(struct mem_cgroup *mem,
> >> struct mem_cgroup *iter;
> >> long value;
> >>
> >> + VM_BUG_ON(!mem);
> >> + VM_BUG_ON(mem_cgroup_is_root(mem));
> >> +
> >> get_online_cpus();
> >> - rcu_read_lock();
> >> - if (!mem)
> >> - mem = mem_cgroup_from_task(current);
> >> - if (__mem_cgroup_has_dirty_limit(mem)) {
> >
> > What about mem->use_hierarchy that is checked in
> > __mem_cgroup_has_dirty_limit()? Is it no longer needed?
>
> It is no longer needed because the callers of mem_cgroup_page_stat()
> call __mem_cgroup_has_dirty_limit(). In the current implementation, if
> use_hierarchy=1 then the cgroup does not have dirty limits, so calls
> into mem_cgroup_page_stat() are avoided. Specifically the callers of
> mem_cgroup_page_stat() are:
>
> 1. mem_cgroup_dirty_info() which calls __mem_cgroup_has_dirty_limit()
> and returns false if use_hierarchy=1.
>
> 2. throttle_vm_writeout() which calls mem_dirty_info() ->
> mem_cgroup_dirty_info() -> __mem_cgroup_has_dirty_limit() will fall
> back to global limits if use_hierarchy=1.

Thanks for the clarification.

Acked-by: Johannes Weiner <[email protected]>

2010-11-19 11:39:29

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 6/6] memcg: make mem_cgroup_page_stat() return value unsigned

On Fri, Nov 12, 2010 at 12:41:15PM -0800, Greg Thelen wrote:
> >> mem_cgroup_page_stat() has changed so it never returns
> >> error so convert the return value to the traditional page
> >> count type (unsigned long).
> >
> > This changelog feels a bit beside the point.
> >
> > What's really interesting is that we now don't consider negative sums
> > to be invalid anymore, but just assume zero! There is a real
> > semantical change here.
>
> Prior to this patch series mem_cgroup_page_stat() returned a negative
> value (specifically -EINVAL) to indicate that the current task was in
> the root_cgroup and thus the per-cgroup usage and limit counter were
> invalid. Callers treated all negative values as an indication of
> root-cgroup message.
>
> Unfortunately there was another way that mem_cgroup_page_stat() could
> return a negative value even when current was not in the root cgroup.
> Negative sums were a possibility due to summing of unsynchronized
> per-cpu counters. These occasional negative sums would fool callers
> into thinking that the current task was in the root cgroup.
>
> Would adding this description to the commit message address your
> concerns?

I'd just describe that summing per-cpu counters is racy, that we can
end up with negative results, and the only sensible handling of that
is to assume zero.

> > That the return type can then be changed to unsigned long is a nice
> > follow-up cleanup that happens to be folded into this patch.
>
> Good point. I can separate the change into two sub-patches:
> 1. use zero for a min-value (as described above)
> 2. change return value to unsigned

Sounds good. You can just fold the previous patch (adjusting the
callsites) into 2, which should take care of the ordering problem.

2010-11-22 06:40:53

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH 1/6] memcg: add mem_cgroup parameter to mem_cgroup_page_stat()

* Greg Thelen <[email protected]> [2010-11-09 01:24:26]:

> This new parameter can be used to query dirty memory usage
> from a given memcg rather than the current task's memcg.
>
> Signed-off-by: Greg Thelen <[email protected]>
>

How is this useful, documenting that in the changelog would be nice.

--
Three Cheers,
Balbir

2010-11-22 06:41:53

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH 2/6] memcg: pass mem_cgroup to mem_cgroup_dirty_info()

* Greg Thelen <[email protected]> [2010-11-09 01:24:27]:

> Pass mem_cgroup parameter through memcg_dirty_info() into
> mem_cgroup_dirty_info(). This allows for querying dirty memory
> information from a particular cgroup, rather than just the
> current task's cgroup.
>
> Signed-off-by: Greg Thelen <[email protected]>
> ---

Acked-by: Balbir Singh <[email protected]>

--
Three Cheers,
Balbir