2012-11-07 08:40:23

by Sha Zhengju

[permalink] [raw]
Subject: [PATCH V2 0/2] Provide more precise dump info for memcg-oom

From: Sha Zhengju <[email protected]>


When memcg oom is happening the current memcg related dump information
is limited for debugging. The patches provide more detailed memcg page statistics
and also take hierarchy into consideration.
The previous primitive version can be reached here: https://lkml.org/lkml/2012/7/30/179.

Change log:
1. some modification towards hierarchy
2. rework dump_tasks
3. rebased on Michal's mm tree since-3.6

Any comments are welcomed. : )


Sha Zhengju (2):
memcg-oom-provide-more-precise-dump-info-while-memcg.patch
oom-rework-dump_tasks-to-optimize-memcg-oom-situatio.patch

include/linux/memcontrol.h | 7 ++++
include/linux/oom.h | 2 +
mm/memcontrol.c | 85 +++++++++++++++++++++++++++++++++++++++-----
mm/oom_kill.c | 61 +++++++++++++++++++------------
4 files changed, 122 insertions(+), 33 deletions(-)


2012-11-07 08:41:43

by Sha Zhengju

[permalink] [raw]
Subject: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening

From: Sha Zhengju <[email protected]>

Current, when a memcg oom is happening the oom dump messages is still global
state and provides few useful info for users. This patch prints more pointed
memcg page statistics for memcg-oom.

Signed-off-by: Sha Zhengju <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Andrew Morton <[email protected]>
---
mm/memcontrol.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++-------
mm/oom_kill.c | 6 +++-
2 files changed, 66 insertions(+), 11 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0eab7d5..2df5e72 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -118,6 +118,14 @@ static const char * const mem_cgroup_events_names[] = {
"pgmajfault",
};

+static const char * const mem_cgroup_lru_names[] = {
+ "inactive_anon",
+ "active_anon",
+ "inactive_file",
+ "active_file",
+ "unevictable",
+};
+
/*
* Per memcg event counter is incremented at every pagein/pageout. With THP,
* it will be incremated by the number of pages. This counter is used for
@@ -1501,8 +1509,59 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg,
spin_unlock_irqrestore(&memcg->move_lock, *flags);
}

+#define K(x) ((x) << (PAGE_SHIFT-10))
+static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg)
+{
+ struct mem_cgroup *mi;
+ unsigned int i;
+
+ if (!memcg->use_hierarchy && memcg != root_mem_cgroup) {
+ for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
+ if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
+ continue;
+ printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i],
+ K(mem_cgroup_read_stat(memcg, i)));
+ }
+
+ for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++)
+ printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i],
+ mem_cgroup_read_events(memcg, i));
+
+ for (i = 0; i < NR_LRU_LISTS; i++)
+ printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i],
+ K(mem_cgroup_nr_lru_pages(memcg, BIT(i))));
+ } else {
+
+ for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
+ long long val = 0;
+
+ if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
+ continue;
+ for_each_mem_cgroup_tree(mi, memcg)
+ val += mem_cgroup_read_stat(mi, i);
+ printk(KERN_CONT "%s:%lldKB ", mem_cgroup_stat_names[i], K(val));
+ }
+
+ for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) {
+ unsigned long long val = 0;
+
+ for_each_mem_cgroup_tree(mi, memcg)
+ val += mem_cgroup_read_events(mi, i);
+ printk(KERN_CONT "%s:%llu ",
+ mem_cgroup_events_names[i], val);
+ }
+
+ for (i = 0; i < NR_LRU_LISTS; i++) {
+ unsigned long long val = 0;
+
+ for_each_mem_cgroup_tree(mi, memcg)
+ val += mem_cgroup_nr_lru_pages(mi, BIT(i));
+ printk(KERN_CONT "%s:%lluKB ", mem_cgroup_lru_names[i], K(val));
+ }
+ }
+ printk(KERN_CONT "\n");
+}
/**
- * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode.
* @memcg: The memory cgroup that went over limit
* @p: Task that is going to be killed
*
@@ -1569,6 +1628,8 @@ done:
res_counter_read_u64(&memcg->kmem, RES_USAGE) >> 10,
res_counter_read_u64(&memcg->kmem, RES_LIMIT) >> 10,
res_counter_read_u64(&memcg->kmem, RES_FAILCNT));
+
+ mem_cgroup_print_oom_stat(memcg);
}

/*
@@ -5195,14 +5256,6 @@ static int memcg_numa_stat_show(struct cgroup *cont, struct cftype *cft,
}
#endif /* CONFIG_NUMA */

-static const char * const mem_cgroup_lru_names[] = {
- "inactive_anon",
- "active_anon",
- "inactive_file",
- "active_file",
- "unevictable",
-};
-
static inline void mem_cgroup_lru_names_not_uptodate(void)
{
BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7e9e911..4b8a6dd 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -421,8 +421,10 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
cpuset_print_task_mems_allowed(current);
task_unlock(current);
dump_stack();
- mem_cgroup_print_oom_info(memcg, p);
- show_mem(SHOW_MEM_FILTER_NODES);
+ if (memcg)
+ mem_cgroup_print_oom_info(memcg, p);
+ else
+ show_mem(SHOW_MEM_FILTER_NODES);
if (sysctl_oom_dump_tasks)
dump_tasks(memcg, nodemask);
}
--
1.7.6.1

2012-11-07 08:42:03

by Sha Zhengju

[permalink] [raw]
Subject: [PATCH 2/2] oom: rework dump_tasks to optimize memcg-oom situation

From: Sha Zhengju <[email protected]>

If memcg oom happening, don't scan all system tasks to dump memory state of
eligible tasks, instead we iterates only over the process attached to the oom
memcg and avoid the rcu lock.


Signed-off-by: Sha Zhengju <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Andrew Morton <[email protected]>
---
include/linux/memcontrol.h | 7 +++++
include/linux/oom.h | 2 +
mm/memcontrol.c | 14 +++++++++++
mm/oom_kill.c | 55 ++++++++++++++++++++++++++-----------------
4 files changed, 56 insertions(+), 22 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c91e3c1..4322ca8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -122,6 +122,8 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list);
void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
struct task_struct *p);
+extern void dump_tasks_memcg(const struct mem_cgroup *memcg,
+ const nodemask_t *nodemask);
extern void mem_cgroup_replace_page_cache(struct page *oldpage,
struct page *newpage);

@@ -337,6 +339,11 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
{
}

+static inline void
+dump_tasks_memcg(const struct mem_cgroup *memcg, const nodemask_t *nodemask)
+{
+}
+
static inline void mem_cgroup_begin_update_page_stat(struct page *page,
bool *locked, unsigned long *flags)
{
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 20b5c46..9ba3344 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -57,6 +57,8 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
unsigned long totalpages, const nodemask_t *nodemask,
bool force_kill);

+extern inline void dump_per_task(struct task_struct *p,
+ const nodemask_t *nodemask);
extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
int order, nodemask_t *mask, bool force_kill);
extern int register_oom_notifier(struct notifier_block *nb);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2df5e72..fe648f8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1665,6 +1665,20 @@ static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
return min(limit, memsw);
}

+void dump_tasks_memcg(const struct mem_cgroup *memcg, const nodemask_t *nodemask)
+{
+ struct cgroup_iter it;
+ struct task_struct *task;
+ struct cgroup *cgroup = memcg->css.cgroup;
+
+ cgroup_iter_start(cgroup, &it);
+ while ((task = cgroup_iter_next(cgroup, &it))) {
+ dump_per_task(task, nodemask);
+ }
+
+ cgroup_iter_end(cgroup, &it);
+}
+
static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
int order)
{
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4b8a6dd..aaf6237 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -367,6 +367,32 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
return chosen;
}

+inline void dump_per_task(struct task_struct *p, const nodemask_t *nodemask)
+{
+ struct task_struct *task;
+
+ if (oom_unkillable_task(p, NULL, nodemask))
+ return;
+
+ task = find_lock_task_mm(p);
+ if (!task) {
+ /*
+ * This is a kthread or all of p's threads have already
+ * detached their mm's. There's no need to report
+ * them; they can't be oom killed anyway.
+ */
+ return;
+ }
+
+ pr_info("[%5d] %5d %5d %8lu %8lu %7lu %8lu %5d %s\n",
+ task->pid, from_kuid(&init_user_ns, task_uid(task)),
+ task->tgid, task->mm->total_vm, get_mm_rss(task->mm),
+ task->mm->nr_ptes,
+ get_mm_counter(task->mm, MM_SWAPENTS),
+ task->signal->oom_score_adj, task->comm);
+ task_unlock(task);
+}
+
/**
* dump_tasks - dump current memory state of all system tasks
* @memcg: current's memory controller, if constrained
@@ -381,32 +407,17 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
static void dump_tasks(const struct mem_cgroup *memcg, const nodemask_t *nodemask)
{
struct task_struct *p;
- struct task_struct *task;

pr_info("[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name\n");
- rcu_read_lock();
- for_each_process(p) {
- if (oom_unkillable_task(p, memcg, nodemask))
- continue;
-
- task = find_lock_task_mm(p);
- if (!task) {
- /*
- * This is a kthread or all of p's threads have already
- * detached their mm's. There's no need to report
- * them; they can't be oom killed anyway.
- */
- continue;
- }

- pr_info("[%5d] %5d %5d %8lu %8lu %7lu %8lu %5d %s\n",
- task->pid, from_kuid(&init_user_ns, task_uid(task)),
- task->tgid, task->mm->total_vm, get_mm_rss(task->mm),
- task->mm->nr_ptes,
- get_mm_counter(task->mm, MM_SWAPENTS),
- task->signal->oom_score_adj, task->comm);
- task_unlock(task);
+ if (memcg) {
+ dump_tasks_memcg(memcg, nodemask);
+ return;
}
+
+ rcu_read_lock();
+ for_each_process(p)
+ dump_per_task(p, nodemask);
rcu_read_unlock();
}

--
1.7.6.1

2012-11-07 18:02:14

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening

On Wed, 7 Nov 2012, Sha Zhengju wrote:

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 0eab7d5..2df5e72 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -118,6 +118,14 @@ static const char * const mem_cgroup_events_names[] = {
> "pgmajfault",
> };
>
> +static const char * const mem_cgroup_lru_names[] = {
> + "inactive_anon",
> + "active_anon",
> + "inactive_file",
> + "active_file",
> + "unevictable",
> +};
> +
> /*
> * Per memcg event counter is incremented at every pagein/pageout. With THP,
> * it will be incremated by the number of pages. This counter is used for
> @@ -1501,8 +1509,59 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg,
> spin_unlock_irqrestore(&memcg->move_lock, *flags);
> }
>
> +#define K(x) ((x) << (PAGE_SHIFT-10))
> +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg)
> +{
> + struct mem_cgroup *mi;
> + unsigned int i;
> +
> + if (!memcg->use_hierarchy && memcg != root_mem_cgroup) {
> + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
> + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
> + continue;
> + printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i],

This printk isn't continuing any previous printk, so using KERN_CONT here
will require a short header to be printed first ("Memcg: "?) with
KERN_INFO before the iterations.

> + K(mem_cgroup_read_stat(memcg, i)));
> + }
> +
> + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++)
> + printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i],
> + mem_cgroup_read_events(memcg, i));
> +
> + for (i = 0; i < NR_LRU_LISTS; i++)
> + printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i],
> + K(mem_cgroup_nr_lru_pages(memcg, BIT(i))));
> + } else {
> +

Spurious newline.

Eek, is there really no way to avoid this if-conditional and just use
for_each_mem_cgroup_tree() for everything and use

mem_cgroup_iter_break(memcg, iter);
break;

for !memcg->use_hierarchy?

> + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
> + long long val = 0;
> +
> + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
> + continue;
> + for_each_mem_cgroup_tree(mi, memcg)
> + val += mem_cgroup_read_stat(mi, i);
> + printk(KERN_CONT "%s:%lldKB ", mem_cgroup_stat_names[i], K(val));
> + }
> +
> + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) {
> + unsigned long long val = 0;
> +
> + for_each_mem_cgroup_tree(mi, memcg)
> + val += mem_cgroup_read_events(mi, i);
> + printk(KERN_CONT "%s:%llu ",
> + mem_cgroup_events_names[i], val);
> + }
> +
> + for (i = 0; i < NR_LRU_LISTS; i++) {
> + unsigned long long val = 0;
> +
> + for_each_mem_cgroup_tree(mi, memcg)
> + val += mem_cgroup_nr_lru_pages(mi, BIT(i));
> + printk(KERN_CONT "%s:%lluKB ", mem_cgroup_lru_names[i], K(val));
> + }
> + }
> + printk(KERN_CONT "\n");
> +}
> /**
> - * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode.
> * @memcg: The memory cgroup that went over limit
> * @p: Task that is going to be killed
> *
> @@ -1569,6 +1628,8 @@ done:
> res_counter_read_u64(&memcg->kmem, RES_USAGE) >> 10,
> res_counter_read_u64(&memcg->kmem, RES_LIMIT) >> 10,
> res_counter_read_u64(&memcg->kmem, RES_FAILCNT));
> +
> + mem_cgroup_print_oom_stat(memcg);

I think this should be folded into mem_cgroup_print_oom_info(), I don't
see a need for a new function.

> }
>
> /*
> @@ -5195,14 +5256,6 @@ static int memcg_numa_stat_show(struct cgroup *cont, struct cftype *cft,
> }
> #endif /* CONFIG_NUMA */
>
> -static const char * const mem_cgroup_lru_names[] = {
> - "inactive_anon",
> - "active_anon",
> - "inactive_file",
> - "active_file",
> - "unevictable",
> -};
> -
> static inline void mem_cgroup_lru_names_not_uptodate(void)
> {
> BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS);
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 7e9e911..4b8a6dd 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -421,8 +421,10 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
> cpuset_print_task_mems_allowed(current);
> task_unlock(current);
> dump_stack();
> - mem_cgroup_print_oom_info(memcg, p);
> - show_mem(SHOW_MEM_FILTER_NODES);
> + if (memcg)
> + mem_cgroup_print_oom_info(memcg, p);

mem_cgroup_print_oom_info() already returns immediately for !memcg, so I'm
not sure why this change is made.

> + else
> + show_mem(SHOW_MEM_FILTER_NODES);

Well that's disappointing if memcg == root_mem_cgroup, we'd probably like
to know the global memory state to determine what the problem is.

> if (sysctl_oom_dump_tasks)
> dump_tasks(memcg, nodemask);
> }

2012-11-07 18:07:19

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 2/2] oom: rework dump_tasks to optimize memcg-oom situation

On Wed, 7 Nov 2012, Sha Zhengju wrote:

> From: Sha Zhengju <[email protected]>
>
> If memcg oom happening, don't scan all system tasks to dump memory state of
> eligible tasks, instead we iterates only over the process attached to the oom
> memcg and avoid the rcu lock.
>

Avoiding the rcu lock isn't actually that impressive here, the cgroup
iterator will use it's own lock for that memcg.

> Signed-off-by: Sha Zhengju <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>
> Cc: David Rientjes <[email protected]>
> Cc: Andrew Morton <[email protected]>
> ---
> include/linux/memcontrol.h | 7 +++++
> include/linux/oom.h | 2 +
> mm/memcontrol.c | 14 +++++++++++
> mm/oom_kill.c | 55 ++++++++++++++++++++++++++-----------------
> 4 files changed, 56 insertions(+), 22 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index c91e3c1..4322ca8 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -122,6 +122,8 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list);
> void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
> extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
> struct task_struct *p);
> +extern void dump_tasks_memcg(const struct mem_cgroup *memcg,
> + const nodemask_t *nodemask);

Shouldn't need the nodemask parameter, just have dump_tasks_memcg() pass
NULL to dump_per_task(), we won't be isolating to tasks with mempolicies
restricted to a particular set of nodes since we're in the memcg oom path
here, not the global page allocator oom path.

> extern void mem_cgroup_replace_page_cache(struct page *oldpage,
> struct page *newpage);
>
> @@ -337,6 +339,11 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
> {
> }
>
> +static inline void
> +dump_tasks_memcg(const struct mem_cgroup *memcg, const nodemask_t *nodemask)
> +{
> +}
> +
> static inline void mem_cgroup_begin_update_page_stat(struct page *page,
> bool *locked, unsigned long *flags)
> {
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 20b5c46..9ba3344 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -57,6 +57,8 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
> unsigned long totalpages, const nodemask_t *nodemask,
> bool force_kill);
>
> +extern inline void dump_per_task(struct task_struct *p,
> + const nodemask_t *nodemask);

This is a global symbol, so dump_per_task() doesn't make a lot of sense:
it would need to be prefixed with "oom_" so perhaps oom_dump_task() is
better?

> extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> int order, nodemask_t *mask, bool force_kill);
> extern int register_oom_notifier(struct notifier_block *nb);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2df5e72..fe648f8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1665,6 +1665,20 @@ static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
> return min(limit, memsw);
> }
>
> +void dump_tasks_memcg(const struct mem_cgroup *memcg, const nodemask_t *nodemask)
> +{
> + struct cgroup_iter it;
> + struct task_struct *task;
> + struct cgroup *cgroup = memcg->css.cgroup;
> +
> + cgroup_iter_start(cgroup, &it);
> + while ((task = cgroup_iter_next(cgroup, &it))) {
> + dump_per_task(task, nodemask);
> + }
> +
> + cgroup_iter_end(cgroup, &it);
> +}
> +
> static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> int order)
> {
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 4b8a6dd..aaf6237 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -367,6 +367,32 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
> return chosen;
> }
>
> +inline void dump_per_task(struct task_struct *p, const nodemask_t *nodemask)

No inline.

> +{
> + struct task_struct *task;
> +
> + if (oom_unkillable_task(p, NULL, nodemask))
> + return;
> +
> + task = find_lock_task_mm(p);
> + if (!task) {
> + /*
> + * This is a kthread or all of p's threads have already
> + * detached their mm's. There's no need to report
> + * them; they can't be oom killed anyway.
> + */
> + return;
> + }
> +
> + pr_info("[%5d] %5d %5d %8lu %8lu %7lu %8lu %5d %s\n",
> + task->pid, from_kuid(&init_user_ns, task_uid(task)),
> + task->tgid, task->mm->total_vm, get_mm_rss(task->mm),
> + task->mm->nr_ptes,
> + get_mm_counter(task->mm, MM_SWAPENTS),
> + task->signal->oom_score_adj, task->comm);
> + task_unlock(task);
> +}
> +
> /**
> * dump_tasks - dump current memory state of all system tasks
> * @memcg: current's memory controller, if constrained
> @@ -381,32 +407,17 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
> static void dump_tasks(const struct mem_cgroup *memcg, const nodemask_t *nodemask)
> {
> struct task_struct *p;
> - struct task_struct *task;
>
> pr_info("[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name\n");
> - rcu_read_lock();
> - for_each_process(p) {
> - if (oom_unkillable_task(p, memcg, nodemask))
> - continue;
> -
> - task = find_lock_task_mm(p);
> - if (!task) {
> - /*
> - * This is a kthread or all of p's threads have already
> - * detached their mm's. There's no need to report
> - * them; they can't be oom killed anyway.
> - */
> - continue;
> - }
>
> - pr_info("[%5d] %5d %5d %8lu %8lu %7lu %8lu %5d %s\n",
> - task->pid, from_kuid(&init_user_ns, task_uid(task)),
> - task->tgid, task->mm->total_vm, get_mm_rss(task->mm),
> - task->mm->nr_ptes,
> - get_mm_counter(task->mm, MM_SWAPENTS),
> - task->signal->oom_score_adj, task->comm);
> - task_unlock(task);
> + if (memcg) {
> + dump_tasks_memcg(memcg, nodemask);
> + return;
> }
> +
> + rcu_read_lock();
> + for_each_process(p)
> + dump_per_task(p, nodemask);
> rcu_read_unlock();
> }
>

2012-11-07 19:31:10

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH V2 0/2] Provide more precise dump info for memcg-oom

On Wed, 7 Nov 2012 16:40:02 +0800
Sha Zhengju <[email protected]> wrote:

> When memcg oom is happening the current memcg related dump information
> is limited for debugging. The patches provide more detailed memcg page statistics
> and also take hierarchy into consideration.

Within the changelogs, please include a sample of the proposed output
so we can properly review the proposal.

Also it would be useful to provide some justification for the decisions
in this patch: which data is displayed and, particularly, which is not?
Why is the displayed information useful to developers, etc.

2012-11-07 22:17:15

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening

On Wed 07-11-12 16:41:36, Sha Zhengju wrote:
> From: Sha Zhengju <[email protected]>
>
> Current, when a memcg oom is happening the oom dump messages is still global
> state and provides few useful info for users. This patch prints more pointed
> memcg page statistics for memcg-oom.
>
> Signed-off-by: Sha Zhengju <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>
> Cc: David Rientjes <[email protected]>
> Cc: Andrew Morton <[email protected]>
> ---
> mm/memcontrol.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++-------
> mm/oom_kill.c | 6 +++-
> 2 files changed, 66 insertions(+), 11 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 0eab7d5..2df5e72 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
[...]
> @@ -1501,8 +1509,59 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg,
> spin_unlock_irqrestore(&memcg->move_lock, *flags);
> }
>
> +#define K(x) ((x) << (PAGE_SHIFT-10))
> +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg)
> +{
> + struct mem_cgroup *mi;
> + unsigned int i;
> +
> + if (!memcg->use_hierarchy && memcg != root_mem_cgroup) {
> + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
> + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
> + continue;
> + printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i],
> + K(mem_cgroup_read_stat(memcg, i)));
> + }
> +
> + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++)
> + printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i],
> + mem_cgroup_read_events(memcg, i));
> +
> + for (i = 0; i < NR_LRU_LISTS; i++)
> + printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i],
> + K(mem_cgroup_nr_lru_pages(memcg, BIT(i))));
> + } else {
> +
> + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
> + long long val = 0;
> +
> + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
> + continue;
> + for_each_mem_cgroup_tree(mi, memcg)
> + val += mem_cgroup_read_stat(mi, i);
> + printk(KERN_CONT "%s:%lldKB ", mem_cgroup_stat_names[i], K(val));
> + }
> +
> + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) {
> + unsigned long long val = 0;
> +
> + for_each_mem_cgroup_tree(mi, memcg)
> + val += mem_cgroup_read_events(mi, i);
> + printk(KERN_CONT "%s:%llu ",
> + mem_cgroup_events_names[i], val);
> + }
> +
> + for (i = 0; i < NR_LRU_LISTS; i++) {
> + unsigned long long val = 0;
> +
> + for_each_mem_cgroup_tree(mi, memcg)
> + val += mem_cgroup_nr_lru_pages(mi, BIT(i));
> + printk(KERN_CONT "%s:%lluKB ", mem_cgroup_lru_names[i], K(val));
> + }
> + }

This is just plain ugly. for_each_mem_cgroup_tree is use_hierarchy aware
and there is no need for if (use_hierarchy) part.
memcg != root_mem_cgroup test doesn't make much sense as well because we
call that a global oom killer ;)
--
Michal Hocko
SUSE Labs

2012-11-07 22:34:43

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/2] oom: rework dump_tasks to optimize memcg-oom situation

On Wed 07-11-12 16:41:59, Sha Zhengju wrote:
> From: Sha Zhengju <[email protected]>
>
> If memcg oom happening, don't scan all system tasks to dump memory state of
> eligible tasks, instead we iterates only over the process attached to the oom
> memcg and avoid the rcu lock.

you have replaced rcu lock by css_set_lock which is, well, heavier than
rcu. Besides that the patch is not correct because you have excluded
all tasks that are from subgroups because you iterate only through the
top level one.
I am not sure the whole optimization would be a win even if implemented
correctly. Well, we scan through more tasks currently and most of them
are not relevant but then you would need to exclude task_in_mem_cgroup
from oom_unkillable_task and that would be more code churn than the
win.

> Signed-off-by: Sha Zhengju <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>
> Cc: David Rientjes <[email protected]>
> Cc: Andrew Morton <[email protected]>
> ---
> include/linux/memcontrol.h | 7 +++++
> include/linux/oom.h | 2 +
> mm/memcontrol.c | 14 +++++++++++
> mm/oom_kill.c | 55 ++++++++++++++++++++++++++-----------------
> 4 files changed, 56 insertions(+), 22 deletions(-)
>
[...]
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 20b5c46..9ba3344 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -57,6 +57,8 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
> unsigned long totalpages, const nodemask_t *nodemask,
> bool force_kill);
>
> +extern inline void dump_per_task(struct task_struct *p,
> + const nodemask_t *nodemask);
> extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> int order, nodemask_t *mask, bool force_kill);
> extern int register_oom_notifier(struct notifier_block *nb);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2df5e72..fe648f8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1665,6 +1665,20 @@ static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
> return min(limit, memsw);
> }
>
> +void dump_tasks_memcg(const struct mem_cgroup *memcg, const nodemask_t *nodemask)
> +{
> + struct cgroup_iter it;
> + struct task_struct *task;
> + struct cgroup *cgroup = memcg->css.cgroup;
> +
> + cgroup_iter_start(cgroup, &it);
> + while ((task = cgroup_iter_next(cgroup, &it))) {
> + dump_per_task(task, nodemask);
> + }
> +
> + cgroup_iter_end(cgroup, &it);
> +}
> +
> static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> int order)
> {
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 4b8a6dd..aaf6237 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -367,6 +367,32 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
> return chosen;
> }
>
> +inline void dump_per_task(struct task_struct *p, const nodemask_t *nodemask)
> +{
> + struct task_struct *task;
> +
> + if (oom_unkillable_task(p, NULL, nodemask))
> + return;
> +
> + task = find_lock_task_mm(p);
> + if (!task) {
> + /*
> + * This is a kthread or all of p's threads have already
> + * detached their mm's. There's no need to report
> + * them; they can't be oom killed anyway.
> + */
> + return;
> + }
> +
> + pr_info("[%5d] %5d %5d %8lu %8lu %7lu %8lu %5d %s\n",
> + task->pid, from_kuid(&init_user_ns, task_uid(task)),
> + task->tgid, task->mm->total_vm, get_mm_rss(task->mm),
> + task->mm->nr_ptes,
> + get_mm_counter(task->mm, MM_SWAPENTS),
> + task->signal->oom_score_adj, task->comm);
> + task_unlock(task);
> +}
> +
> /**
> * dump_tasks - dump current memory state of all system tasks
> * @memcg: current's memory controller, if constrained
> @@ -381,32 +407,17 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
> static void dump_tasks(const struct mem_cgroup *memcg, const nodemask_t *nodemask)
> {
> struct task_struct *p;
> - struct task_struct *task;
>
> pr_info("[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name\n");
> - rcu_read_lock();
> - for_each_process(p) {
> - if (oom_unkillable_task(p, memcg, nodemask))
> - continue;
> -
> - task = find_lock_task_mm(p);
> - if (!task) {
> - /*
> - * This is a kthread or all of p's threads have already
> - * detached their mm's. There's no need to report
> - * them; they can't be oom killed anyway.
> - */
> - continue;
> - }
>
> - pr_info("[%5d] %5d %5d %8lu %8lu %7lu %8lu %5d %s\n",
> - task->pid, from_kuid(&init_user_ns, task_uid(task)),
> - task->tgid, task->mm->total_vm, get_mm_rss(task->mm),
> - task->mm->nr_ptes,
> - get_mm_counter(task->mm, MM_SWAPENTS),
> - task->signal->oom_score_adj, task->comm);
> - task_unlock(task);
> + if (memcg) {
> + dump_tasks_memcg(memcg, nodemask);
> + return;
> }
> +
> + rcu_read_lock();
> + for_each_process(p)
> + dump_per_task(p, nodemask);
> rcu_read_unlock();
> }
>
> --
> 1.7.6.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Michal Hocko
SUSE Labs

2012-11-08 09:08:04

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening

(2012/11/07 17:41), Sha Zhengju wrote:
> From: Sha Zhengju <[email protected]>
>
> Current, when a memcg oom is happening the oom dump messages is still global
> state and provides few useful info for users. This patch prints more pointed
> memcg page statistics for memcg-oom.
>
> Signed-off-by: Sha Zhengju <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>
> Cc: David Rientjes <[email protected]>
> Cc: Andrew Morton <[email protected]>
> ---
> mm/memcontrol.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++-------
> mm/oom_kill.c | 6 +++-
> 2 files changed, 66 insertions(+), 11 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 0eab7d5..2df5e72 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -118,6 +118,14 @@ static const char * const mem_cgroup_events_names[] = {
> "pgmajfault",
> };
>
> +static const char * const mem_cgroup_lru_names[] = {
> + "inactive_anon",
> + "active_anon",
> + "inactive_file",
> + "active_file",
> + "unevictable",
> +};
> +

Is this for the same strings with show_free_areas() ?


> /*
> * Per memcg event counter is incremented at every pagein/pageout. With THP,
> * it will be incremated by the number of pages. This counter is used for
> @@ -1501,8 +1509,59 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg,
> spin_unlock_irqrestore(&memcg->move_lock, *flags);
> }
>
> +#define K(x) ((x) << (PAGE_SHIFT-10))
> +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg)
> +{
> + struct mem_cgroup *mi;
> + unsigned int i;
> +
> + if (!memcg->use_hierarchy && memcg != root_mem_cgroup) {

Why do you need to have this condition check ?

> + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
> + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
> + continue;
> + printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i],
> + K(mem_cgroup_read_stat(memcg, i)));

Hm, how about using the same style with show_free_areas() ?

> + }
> +
> + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++)
> + printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i],
> + mem_cgroup_read_events(memcg, i));
> +

I don't think EVENTS info is useful for oom.

> + for (i = 0; i < NR_LRU_LISTS; i++)
> + printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i],
> + K(mem_cgroup_nr_lru_pages(memcg, BIT(i))));

How far does your new information has different format than usual oom ?
Could you show a sample and difference in changelog ?

Of course, I prefer both of them has similar format.





> + } else {
> +
> + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
> + long long val = 0;
> +
> + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
> + continue;
> + for_each_mem_cgroup_tree(mi, memcg)
> + val += mem_cgroup_read_stat(mi, i);
> + printk(KERN_CONT "%s:%lldKB ", mem_cgroup_stat_names[i], K(val));
> + }
> +
> + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) {
> + unsigned long long val = 0;
> +
> + for_each_mem_cgroup_tree(mi, memcg)
> + val += mem_cgroup_read_events(mi, i);
> + printk(KERN_CONT "%s:%llu ",
> + mem_cgroup_events_names[i], val);
> + }
> +
> + for (i = 0; i < NR_LRU_LISTS; i++) {
> + unsigned long long val = 0;
> +
> + for_each_mem_cgroup_tree(mi, memcg)
> + val += mem_cgroup_nr_lru_pages(mi, BIT(i));
> + printk(KERN_CONT "%s:%lluKB ", mem_cgroup_lru_names[i], K(val));
> + }
> + }
> + printk(KERN_CONT "\n");
> +}




> /**
> - * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode.
> * @memcg: The memory cgroup that went over limit
> * @p: Task that is going to be killed
> *
> @@ -1569,6 +1628,8 @@ done:
> res_counter_read_u64(&memcg->kmem, RES_USAGE) >> 10,
> res_counter_read_u64(&memcg->kmem, RES_LIMIT) >> 10,
> res_counter_read_u64(&memcg->kmem, RES_FAILCNT));
> +
> + mem_cgroup_print_oom_stat(memcg);
> }

please put directly in print_oom_info()



Thanks,
-Kame

2012-11-08 12:37:45

by Sha Zhengju

[permalink] [raw]
Subject: Re: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening

On 11/08/2012 02:02 AM, David Rientjes wrote:
> On Wed, 7 Nov 2012, Sha Zhengju wrote:
>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 0eab7d5..2df5e72 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -118,6 +118,14 @@ static const char * const mem_cgroup_events_names[] = {
>> "pgmajfault",
>> };
>>
>> +static const char * const mem_cgroup_lru_names[] = {
>> + "inactive_anon",
>> + "active_anon",
>> + "inactive_file",
>> + "active_file",
>> + "unevictable",
>> +};
>> +
>> /*
>> * Per memcg event counter is incremented at every pagein/pageout. With THP,
>> * it will be incremated by the number of pages. This counter is used for
>> @@ -1501,8 +1509,59 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg,
>> spin_unlock_irqrestore(&memcg->move_lock, *flags);
>> }
>>
>> +#define K(x) ((x)<< (PAGE_SHIFT-10))
>> +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg)
>> +{
>> + struct mem_cgroup *mi;
>> + unsigned int i;
>> +
>> + if (!memcg->use_hierarchy&& memcg != root_mem_cgroup) {
>> + for (i = 0; i< MEM_CGROUP_STAT_NSTATS; i++) {
>> + if (i == MEM_CGROUP_STAT_SWAP&& !do_swap_account)
>> + continue;
>> + printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i],
> This printk isn't continuing any previous printk, so using KERN_CONT here
> will require a short header to be printed first ("Memcg: "?) with
> KERN_INFO before the iterations.
>

Yep...I think I lost it while rebasing... sorry for the stupid mistake.

>> + K(mem_cgroup_read_stat(memcg, i)));
>> + }
>> +
>> + for (i = 0; i< MEM_CGROUP_EVENTS_NSTATS; i++)
>> + printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i],
>> + mem_cgroup_read_events(memcg, i));
>> +
>> + for (i = 0; i< NR_LRU_LISTS; i++)
>> + printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i],
>> + K(mem_cgroup_nr_lru_pages(memcg, BIT(i))));
>> + } else {
>> +
> Spurious newline.
>
> Eek, is there really no way to avoid this if-conditional and just use
> for_each_mem_cgroup_tree() for everything and use
>
> mem_cgroup_iter_break(memcg, iter);
> break;
>
> for !memcg->use_hierarchy?
>

Now I'm shamed at my bad brain of yesterday by sending this chunk out...
Yes, the if-part code above is obviously unwanted, and the
for_each_mem_cgroup_tree
can handle hierarchy already.


>> + for (i = 0; i< MEM_CGROUP_STAT_NSTATS; i++) {
>> + long long val = 0;
>> +
>> + if (i == MEM_CGROUP_STAT_SWAP&& !do_swap_account)
>> + continue;
>> + for_each_mem_cgroup_tree(mi, memcg)
>> + val += mem_cgroup_read_stat(mi, i);
>> + printk(KERN_CONT "%s:%lldKB ", mem_cgroup_stat_names[i], K(val));
>> + }
>> +
>> + for (i = 0; i< MEM_CGROUP_EVENTS_NSTATS; i++) {
>> + unsigned long long val = 0;
>> +
>> + for_each_mem_cgroup_tree(mi, memcg)
>> + val += mem_cgroup_read_events(mi, i);
>> + printk(KERN_CONT "%s:%llu ",
>> + mem_cgroup_events_names[i], val);
>> + }
>> +
>> + for (i = 0; i< NR_LRU_LISTS; i++) {
>> + unsigned long long val = 0;
>> +
>> + for_each_mem_cgroup_tree(mi, memcg)
>> + val += mem_cgroup_nr_lru_pages(mi, BIT(i));
>> + printk(KERN_CONT "%s:%lluKB ", mem_cgroup_lru_names[i], K(val));
>> + }
>> + }
>> + printk(KERN_CONT "\n");
>> +}
>> /**
>> - * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode.
>> * @memcg: The memory cgroup that went over limit
>> * @p: Task that is going to be killed
>> *
>> @@ -1569,6 +1628,8 @@ done:
>> res_counter_read_u64(&memcg->kmem, RES_USAGE)>> 10,
>> res_counter_read_u64(&memcg->kmem, RES_LIMIT)>> 10,
>> res_counter_read_u64(&memcg->kmem, RES_FAILCNT));
>> +
>> + mem_cgroup_print_oom_stat(memcg);
> I think this should be folded into mem_cgroup_print_oom_info(), I don't
> see a need for a new function.
>
>> }
>>
>> /*
>> @@ -5195,14 +5256,6 @@ static int memcg_numa_stat_show(struct cgroup *cont, struct cftype *cft,
>> }
>> #endif /* CONFIG_NUMA */
>>
>> -static const char * const mem_cgroup_lru_names[] = {
>> - "inactive_anon",
>> - "active_anon",
>> - "inactive_file",
>> - "active_file",
>> - "unevictable",
>> -};
>> -
>> static inline void mem_cgroup_lru_names_not_uptodate(void)
>> {
>> BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS);
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 7e9e911..4b8a6dd 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -421,8 +421,10 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
>> cpuset_print_task_mems_allowed(current);
>> task_unlock(current);
>> dump_stack();
>> - mem_cgroup_print_oom_info(memcg, p);
>> - show_mem(SHOW_MEM_FILTER_NODES);
>> + if (memcg)
>> + mem_cgroup_print_oom_info(memcg, p);
> mem_cgroup_print_oom_info() already returns immediately for !memcg, so I'm
> not sure why this change is made.
>

Here the if-else checking is aiming at printing distinct messages for
memcg & non-memcg.
IMHO, global state has little actual use for memcg-oom and why not we
wipe off it?
Though mem_cgroup_print_oom_info already checking for !memcg, the
if-statement
can avoid one function call and save the deep-enough oom call stack a
little.

>> + else
>> + show_mem(SHOW_MEM_FILTER_NODES);
> Well that's disappointing if memcg == root_mem_cgroup, we'd probably like
> to know the global memory state to determine what the problem is.
>

I really wondering if there is any case that can pass root_mem_cgroup
down here.
It's called by global or memcg oom killer and the global oom will set
memcg=NULL
directly instead of root_mem_cgroup. Besides, root memcg will not go
through charging
and there is no chance to call mem_cgroup_out_of_memory for root cgroup
tasks.



Thanks,
Sha


>> if (sysctl_oom_dump_tasks)
>> dump_tasks(memcg, nodemask);
>> }
> .
>

2012-11-08 12:44:30

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening

On Thu 08-11-12 20:37:45, Sha Zhengju wrote:
> On 11/08/2012 02:02 AM, David Rientjes wrote:
> >On Wed, 7 Nov 2012, Sha Zhengju wrote:
[..]
> >>+ else
> >>+ show_mem(SHOW_MEM_FILTER_NODES);
> >Well that's disappointing if memcg == root_mem_cgroup, we'd probably like
> >to know the global memory state to determine what the problem is.
> >
>
> I really wondering if there is any case that can pass
> root_mem_cgroup down here.

No it cannot because the root cgroup doesn't have any limit so we cannot
trigger memcg oom killer.

--
Michal Hocko
SUSE Labs

2012-11-08 12:45:29

by Sha Zhengju

[permalink] [raw]
Subject: Re: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening

On 11/08/2012 06:17 AM, Michal Hocko wrote:
> On Wed 07-11-12 16:41:36, Sha Zhengju wrote:
>> From: Sha Zhengju<[email protected]>
>>
>> Current, when a memcg oom is happening the oom dump messages is still global
>> state and provides few useful info for users. This patch prints more pointed
>> memcg page statistics for memcg-oom.
>>
>> Signed-off-by: Sha Zhengju<[email protected]>
>> Cc: Michal Hocko<[email protected]>
>> Cc: KAMEZAWA Hiroyuki<[email protected]>
>> Cc: David Rientjes<[email protected]>
>> Cc: Andrew Morton<[email protected]>
>> ---
>> mm/memcontrol.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++-------
>> mm/oom_kill.c | 6 +++-
>> 2 files changed, 66 insertions(+), 11 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 0eab7d5..2df5e72 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
> [...]
>> @@ -1501,8 +1509,59 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg,
>> spin_unlock_irqrestore(&memcg->move_lock, *flags);
>> }
>>
>> +#define K(x) ((x)<< (PAGE_SHIFT-10))
>> +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg)
>> +{
>> + struct mem_cgroup *mi;
>> + unsigned int i;
>> +
>> + if (!memcg->use_hierarchy&& memcg != root_mem_cgroup) {
>> + for (i = 0; i< MEM_CGROUP_STAT_NSTATS; i++) {
>> + if (i == MEM_CGROUP_STAT_SWAP&& !do_swap_account)
>> + continue;
>> + printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i],
>> + K(mem_cgroup_read_stat(memcg, i)));
>> + }
>> +
>> + for (i = 0; i< MEM_CGROUP_EVENTS_NSTATS; i++)
>> + printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i],
>> + mem_cgroup_read_events(memcg, i));
>> +
>> + for (i = 0; i< NR_LRU_LISTS; i++)
>> + printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i],
>> + K(mem_cgroup_nr_lru_pages(memcg, BIT(i))));
>> + } else {
>> +
>> + for (i = 0; i< MEM_CGROUP_STAT_NSTATS; i++) {
>> + long long val = 0;
>> +
>> + if (i == MEM_CGROUP_STAT_SWAP&& !do_swap_account)
>> + continue;
>> + for_each_mem_cgroup_tree(mi, memcg)
>> + val += mem_cgroup_read_stat(mi, i);
>> + printk(KERN_CONT "%s:%lldKB ", mem_cgroup_stat_names[i], K(val));
>> + }
>> +
>> + for (i = 0; i< MEM_CGROUP_EVENTS_NSTATS; i++) {
>> + unsigned long long val = 0;
>> +
>> + for_each_mem_cgroup_tree(mi, memcg)
>> + val += mem_cgroup_read_events(mi, i);
>> + printk(KERN_CONT "%s:%llu ",
>> + mem_cgroup_events_names[i], val);
>> + }
>> +
>> + for (i = 0; i< NR_LRU_LISTS; i++) {
>> + unsigned long long val = 0;
>> +
>> + for_each_mem_cgroup_tree(mi, memcg)
>> + val += mem_cgroup_nr_lru_pages(mi, BIT(i));
>> + printk(KERN_CONT "%s:%lluKB ", mem_cgroup_lru_names[i], K(val));
>> + }
>> + }
> This is just plain ugly. for_each_mem_cgroup_tree is use_hierarchy aware
> and there is no need for if (use_hierarchy) part.
> memcg != root_mem_cgroup test doesn't make much sense as well because we
> call that a global oom killer ;)

Yes... bitterly did I repent the patch... The else-part of
for_each_mem_cgroup_tree
is enough for hierarchy. I'll send a update one later.
Sorry for the noise. : (


Thanks,
Sha

2012-11-08 13:12:00

by Sha Zhengju

[permalink] [raw]
Subject: Re: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening

On 11/08/2012 05:07 PM, Kamezawa Hiroyuki wrote:
> (2012/11/07 17:41), Sha Zhengju wrote:
>> From: Sha Zhengju <[email protected]>
>>
>> Current, when a memcg oom is happening the oom dump messages is still global
>> state and provides few useful info for users. This patch prints more pointed
>> memcg page statistics for memcg-oom.
>>
>> Signed-off-by: Sha Zhengju <[email protected]>
>> Cc: Michal Hocko <[email protected]>
>> Cc: KAMEZAWA Hiroyuki <[email protected]>
>> Cc: David Rientjes <[email protected]>
>> Cc: Andrew Morton <[email protected]>
>> ---
>> mm/memcontrol.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++-------
>> mm/oom_kill.c | 6 +++-
>> 2 files changed, 66 insertions(+), 11 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 0eab7d5..2df5e72 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -118,6 +118,14 @@ static const char * const mem_cgroup_events_names[] = {
>> "pgmajfault",
>> };
>>
>> +static const char * const mem_cgroup_lru_names[] = {
>> + "inactive_anon",
>> + "active_anon",
>> + "inactive_file",
>> + "active_file",
>> + "unevictable",
>> +};
>> +
> Is this for the same strings with show_free_areas() ?
>

I just move the declaration here from the bottom of source file to make
the following use error-free.

>> /*
>> * Per memcg event counter is incremented at every pagein/pageout. With THP,
>> * it will be incremated by the number of pages. This counter is used for
>> @@ -1501,8 +1509,59 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg,
>> spin_unlock_irqrestore(&memcg->move_lock, *flags);
>> }
>>
>> +#define K(x) ((x) << (PAGE_SHIFT-10))
>> +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg)
>> +{
>> + struct mem_cgroup *mi;
>> + unsigned int i;
>> +
>> + if (!memcg->use_hierarchy && memcg != root_mem_cgroup) {
> Why do you need to have this condition check ?
>

Yes, the check is unnecessary... I'll remove it next version.

>> + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
>> + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
>> + continue;
>> + printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i],
>> + K(mem_cgroup_read_stat(memcg, i)));
> Hm, how about using the same style with show_free_areas() ?
>

I'm also trying do so. show_free_areas() prints the memory related info
in two style:
one is in page unit and the oher is in KB (I've no idea why we distinct
them), but
I think the KB format is more readable.


>> + }
>> +
>> + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++)
>> + printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i],
>> + mem_cgroup_read_events(memcg, i));
>> +
> I don't think EVENTS info is useful for oom.
>

It seems you're right. : )

>> + for (i = 0; i < NR_LRU_LISTS; i++)
>> + printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i],
>> + K(mem_cgroup_nr_lru_pages(memcg, BIT(i))));
> How far does your new information has different format than usual oom ?
> Could you show a sample and difference in changelog ?
>
> Of course, I prefer both of them has similar format.
>
>
>
The new memcg-oom info excludes global state out and prints the memcg
statistics instead
which seems more brevity. I'll add a sample next time. Thanks for
reminding me!


Thanks,
Sha

2012-11-08 13:58:50

by Sha Zhengju

[permalink] [raw]
Subject: Re: [PATCH 2/2] oom: rework dump_tasks to optimize memcg-oom situation

On 11/08/2012 06:34 AM, Michal Hocko wrote:
> On Wed 07-11-12 16:41:59, Sha Zhengju wrote:
>> From: Sha Zhengju<[email protected]>
>>
>> If memcg oom happening, don't scan all system tasks to dump memory state of
>> eligible tasks, instead we iterates only over the process attached to the oom
>> memcg and avoid the rcu lock.
> you have replaced rcu lock by css_set_lock which is, well, heavier than
> rcu. Besides that the patch is not correct because you have excluded
> all tasks that are from subgroups because you iterate only through the
> top level one.
> I am not sure the whole optimization would be a win even if implemented
> correctly. Well, we scan through more tasks currently and most of them
> are not relevant but then you would need to exclude task_in_mem_cgroup
> from oom_unkillable_task and that would be more code churn than the
> win.

Thanks for your and David's advice.
This piece is trying to save some expense while dumping memcg tasks, but
failed to
scanning subgroups by iterating the cgroup. I'm agreed with your cost&win
opinion, so I decide to give up this one. : )


Thanks,
Sha