2011-06-13 03:08:19

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [BUGFIX][PATCH 0/5] memcg bugfixes in the last week.


In the last week, I(and memcg people) posted 5 bug fixes.
I was slightly confued.

For making things clearer, I post all 5 patches again,
which are I'm now testing.

If I miss some patches/fixes/bugs, please notify me.

[1/5] - fix memory.numa_stat permission (this is in mmotm)
[2/5] - fix init_page_cgroup() nid with sparsemem
[3/5] - fix mm->owner update
[4/5] - fix wrong check of noswap with softlimit
[5/5] - fix percpu cached charge draining.


Thank you for all your helps.

Thanks,
-Kame


2011-06-13 03:10:03

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [BUGFIX][PATCH 1/5] memcg fix numa_stat permission

This is already queued in mmotm.
==
>From 3b1bec1d07ba21339697f069acab47dae6b81097 Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <[email protected]>
Date: Mon, 13 Jun 2011 09:32:28 +0900
Subject: [PATCH 1/5] memcg fix numa_stat permission

commit 406eb0c9b ("memcg: add memory.numastat api for numa statistics")
adds memory.numa_stat file for memory cgroup. But the file permissions
are wrong.

[kamezawa@bluextal linux-2.6]$ ls -l /cgroup/memory/A/memory.numa_stat
---------- 1 root root 0 Jun 9 18:36 /cgroup/memory/A/memory.numa_stat

This patch fixes the permission as

[root@bluextal kamezawa]# ls -l /cgroup/memory/A/memory.numa_stat
-r--r--r-- 1 root root 0 Jun 10 16:49 /cgroup/memory/A/memory.numa_stat

Acked-by: Ying Han <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bd9052a..ce05835 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4640,6 +4640,7 @@ static struct cftype mem_cgroup_files[] = {
{
.name = "numa_stat",
.open = mem_control_numa_stat_open,
+ .mode = S_IRUGO,
},
#endif
};
--
1.7.4.1

2011-06-13 03:13:08

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [BUGFIX][PATCH 2/5] memcg: fix init_page_cgroup nid with sparsemem

added some clean ups.
==
>From 45b8881201f73015c4e0352eb7e434f6e9c53fdd Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <[email protected]>
Date: Mon, 13 Jun 2011 10:09:17 +0900
Subject: [PATCH 2/5] [BUGFIX] memcg: fix init_page_cgroup nid with sparsemem


commit 21a3c96 makes page_cgroup allocation as NUMA aware. But that caused
a problem https://bugzilla.kernel.org/show_bug.cgi?id=36192.

The problem was getting a NID from invalid struct pages, which was not
initialized because it was out-of-node, out of [node_start_pfn, node_end_pfn)

Now, with sparsemem, page_cgroup_init scans pfn from 0 to max_pfn.
But this may scan a pfn which is not on any node and can access
memmap which is not initialized.

This makes page_cgroup_init() for SPARSEMEM node aware and remove
a code to get nid from page->flags. (Then, we'll use valid NID
always.)

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

Changelog :
- use better macros as node_start/end_pfn(), PAGE_SECTION_MASK..etc
- pull out error handling code, which is rarely called.
- fixed MEM_GOING_ONLINE case.
- Add more comments and updated patch descriptions.
---
mm/page_cgroup.c | 67 +++++++++++++++++++++++++++++++++++++++++-------------
1 files changed, 51 insertions(+), 16 deletions(-)

diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 74ccff6..0742633 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -162,13 +162,13 @@ static void free_page_cgroup(void *addr)
}
#endif

-static int __meminit init_section_page_cgroup(unsigned long pfn)
+static int __meminit init_section_page_cgroup(unsigned long pfn, int nid)
{
struct page_cgroup *base, *pc;
struct mem_section *section;
unsigned long table_size;
unsigned long nr;
- int nid, index;
+ int index;

nr = pfn_to_section_nr(pfn);
section = __nr_to_section(nr);
@@ -176,7 +176,6 @@ static int __meminit init_section_page_cgroup(unsigned long pfn)
if (section->page_cgroup)
return 0;

- nid = page_to_nid(pfn_to_page(pfn));
table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
base = alloc_page_cgroup(table_size, nid);

@@ -196,7 +195,11 @@ static int __meminit init_section_page_cgroup(unsigned long pfn)
pc = base + index;
init_page_cgroup(pc, nr);
}
-
+ /*
+ * Passed "pfn" may not be aligned to SECTION. For calculation,
+ * nedd to apply a mask.
+ */
+ pfn &= PAGE_SECTION_MASK;
section->page_cgroup = base - pfn;
total_usage += table_size;
return 0;
@@ -225,10 +228,20 @@ int __meminit online_page_cgroup(unsigned long start_pfn,
start = start_pfn & ~(PAGES_PER_SECTION - 1);
end = ALIGN(start_pfn + nr_pages, PAGES_PER_SECTION);

+ if (nid == -1) {
+ /*
+ * In this case, "nid" already exists and contains valid memory.
+ * "start_pfn" passed for us is a pfn which is args for online_
+ * _pages(), and start_pfn should exists.
+ */
+ nid = pfn_to_nid(start_pfn);
+ VM_BUG_ON(!node_state(nid, N_ONLINE));
+ }
+
for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION) {
if (!pfn_present(pfn))
continue;
- fail = init_section_page_cgroup(pfn);
+ fail = init_section_page_cgroup(pfn, nid);
}
if (!fail)
return 0;
@@ -284,25 +297,47 @@ static int __meminit page_cgroup_callback(struct notifier_block *self,
void __init page_cgroup_init(void)
{
unsigned long pfn;
- int fail = 0;
+ int nid;

if (mem_cgroup_disabled())
return;

- for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
- if (!pfn_present(pfn))
- continue;
- fail = init_section_page_cgroup(pfn);
- }
- if (fail) {
- printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
- panic("Out of memory");
- } else {
- hotplug_memory_notifier(page_cgroup_callback, 0);
+ for_each_node_state(nid, N_HIGH_MEMORY) {
+ unsigned long start_pfn, end_pfn;
+
+ start_pfn = node_start_pfn(nid);
+ end_pfn = node_end_pfn(nid);
+ /*
+ * start_pfn and end_pfn may not be aligned to SECTION and
+ * page->flags of out of node pages are not initialized. So,
+ * we scan [start_pfn, the biggest section's pfn < end_pfn), here.
+ */
+ for (pfn = start_pfn;
+ pfn < end_pfn;
+ pfn = ALIGN(pfn + 1, PAGES_PER_SECTION)) {
+
+ if (!pfn_valid(pfn))
+ continue;
+ /*
+ * Nodes's pfn can be overlapped.
+ * We know some arch can have nodes layout as
+ * -------------pfn-------------->
+ * N0 | N1 | N2 | N0 | N1 | N2|....
+ */
+ if (pfn_to_nid(pfn) != nid)
+ continue;
+ if (init_section_page_cgroup(pfn, nid))
+ goto oom;
+ }
}
+ hotplug_memory_notifier(page_cgroup_callback, 0);
printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
printk(KERN_INFO "please try 'cgroup_disable=memory' option if you don't"
" want memory cgroups\n");
+ return;
+oom:
+ printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
+ panic("Out of memory");
}

void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
--
1.7.4.1

2011-06-13 03:16:57

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [BUGFIX][PATCH 3/5] memcg: clear mm->owner when last possible owner leaves

This is Hugh's version.
==
>From c3d1fb5637dd01ae034cdf2106aaf3249856738e Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <[email protected]>
Date: Mon, 13 Jun 2011 10:26:29 +0900
Subject: [PATCH 3/5] [BUGFIX] memcg: clear mm->owner when last possible owner leaves

The following crash was reported:

> Call Trace:
> [<ffffffff81139792>] mem_cgroup_from_task+0x15/0x17
> [<ffffffff8113a75a>] __mem_cgroup_try_charge+0x148/0x4b4
> [<ffffffff810493f3>] ? need_resched+0x23/0x2d
> [<ffffffff814cbf43>] ? preempt_schedule+0x46/0x4f
> [<ffffffff8113afe8>] mem_cgroup_charge_common+0x9a/0xce
> [<ffffffff8113b6d1>] mem_cgroup_newpage_charge+0x5d/0x5f
> [<ffffffff81134024>] khugepaged+0x5da/0xfaf
> [<ffffffff81078ea0>] ? __init_waitqueue_head+0x4b/0x4b
> [<ffffffff81133a4a>] ? add_mm_counter.constprop.5+0x13/0x13
> [<ffffffff81078625>] kthread+0xa8/0xb0
> [<ffffffff814d13e8>] ? sub_preempt_count+0xa1/0xb4
> [<ffffffff814d5664>] kernel_thread_helper+0x4/0x10
> [<ffffffff814ce858>] ? retint_restore_args+0x13/0x13
> [<ffffffff8107857d>] ? __init_kthread_worker+0x5a/0x5a

What happens is that khugepaged tries to charge a huge page against an
mm whose last possible owner has already exited, and the memory
controller crashes when the stale mm->owner is used to look up the
cgroup to charge.

mm->owner has never been set to NULL with the last owner going away,
but nobody cared until khugepaged came along.

Even then it wasn't a problem because the final mmput() on an mm was
forced to acquire and release mmap_sem in write-mode, preventing an
exiting owner to go away while the mmap_sem was held, and until
"692e0b3 mm: thp: optimize memcg charge in khugepaged", the memory
cgroup charge was protected by mmap_sem in read-mode.

Instead of going back to relying on the mmap_sem to enforce lifetime
of a task, this patch ensures that mm->owner is properly set to NULL
when the last possible owner is exiting, which the memory controller
can handle just fine.

Reported-by: Hugh Dickins <[email protected]>
Reported-by: Dave Jones <[email protected]>
Reviewed-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
kernel/exit.c | 31 +++++++++++++++----------------
1 files changed, 15 insertions(+), 16 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 20a4064..26c5feb 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -561,29 +561,28 @@ void exit_files(struct task_struct *tsk)

#ifdef CONFIG_MM_OWNER
/*
- * Task p is exiting and it owned mm, lets find a new owner for it
+ * A task is exiting and if it owned mm, lets find a new owner for it
*/
-static inline int
-mm_need_new_owner(struct mm_struct *mm, struct task_struct *p)
-{
- /*
- * If there are other users of the mm and the owner (us) is exiting
- * we need to find a new owner to take on the responsibility.
- */
- if (atomic_read(&mm->mm_users) <= 1)
- return 0;
- if (mm->owner != p)
- return 0;
- return 1;
-}
-
void mm_update_next_owner(struct mm_struct *mm)
{
struct task_struct *c, *g, *p = current;

retry:
- if (!mm_need_new_owner(mm, p))
+ /*
+ * If the exiting or execing task is not the owner, it's
+ * someone else's problem.
+ */
+ if (mm->owner != p)
return;
+ /*
+ * The current owner is exiting/execing and there are no other
+ * candidates. Do not leave the mm pointing to a possibly
+ * freed task structure.
+ */
+ if (atomic_read(&mm->mm_users) <= 1) {
+ mm->owner = NULL;
+ return;
+ }

read_lock(&tasklist_lock);
/*
--
1.7.4.1

2011-06-13 03:18:06

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [BUGFIX][PATCH 4/5] memcg: fix wrong check of noswap with softlimit

>From 0a0358d300330a4ba86e39ea56ed63f1e4519dfd Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <[email protected]>
Date: Mon, 13 Jun 2011 10:31:16 +0900
Subject: [PATCH 4/5] fix wrong check of noswap with softlimit

Hierarchical reclaim doesn't swap out if memsw and resource limits are
same (memsw_is_minimum == true) because we would hit mem+swap limit
anyway (during hard limit reclaim).
If it comes to the solft limit we shouldn't consider memsw_is_minimum at
all because it doesn't make much sense. Either the soft limit is bellow
the hard limit and then we cannot hit mem+swap limit or the direct
reclaim takes a precedence.

Reviewed-by: Michal Hocko <[email protected]>
Acked-by: Daisuke Nishimura <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ce05835..915c3f3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1663,7 +1663,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;

/* If memsw_is_minimum==1, swap-out is of-no-use. */
- if (root_mem->memsw_is_minimum)
+ if (!check_soft && root_mem->memsw_is_minimum)
noswap = true;

while (1) {
--
1.7.4.1

2011-06-13 03:23:52

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [BUGFIX][PATCH 5/5] memcg: fix percpu cached charge draining frequency

>From 18b12e53f1cdf6d7feed1f9226c189c34866338c Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <[email protected]>
Date: Mon, 13 Jun 2011 11:25:43 +0900
Subject: [PATCH 5/5] memcg: fix percpu cached charge draining frequency

For performance, memory cgroup caches some "charge" from res_counter
into per cpu cache. This works well but because it's cache,
it needs to be flushed in some cases. Typical cases are
1. when someone hit limit.
2. when rmdir() is called and need to charges to be 0.

But "1" has problem.

Recently, with large SMP machines, we see many kworker runs because
of flushing memcg's cache. Bad things in implementation are
that even if a cpu contains a cache for memcg not related to
a memcg which hits limit, drain code is called.

This patch does
A) check percpu cache contains a useful data or not.
B) check other asynchronous percpu draining doesn't run.
C) don't call local cpu callback.
D) don't call at softlimit reclaim.

(*)This patch avoid changing the calling condition with hard-limit.

When I run "cat 1Gfile > /dev/null" under 300M limit memcg,

[Before]
13767 kamezawa 20 0 98.6m 424 416 D 10.0 0.0 0:00.61 cat
58 root 20 0 0 0 0 S 0.6 0.0 0:00.09 kworker/2:1
60 root 20 0 0 0 0 S 0.6 0.0 0:00.08 kworker/4:1
4 root 20 0 0 0 0 S 0.3 0.0 0:00.02 kworker/0:0
57 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/1:1
61 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/5:1
62 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/6:1
63 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/7:1

[After]
2676 root 20 0 98.6m 416 416 D 9.3 0.0 0:00.87 cat
2626 kamezawa 20 0 15192 1312 920 R 0.3 0.0 0:00.28 top
1 root 20 0 19384 1496 1204 S 0.0 0.0 0:00.66 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0
4 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0

Acked-by: Daisuke Nishimura <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

Changelog:
- cleaned up and add more comments.
- don't call at softlimit reclaim.
---
mm/memcontrol.c | 61 +++++++++++++++++++++++++++++++++++++++---------------
1 files changed, 44 insertions(+), 17 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 915c3f3..79f68ab 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -359,7 +359,7 @@ enum charge_type {
static void mem_cgroup_get(struct mem_cgroup *mem);
static void mem_cgroup_put(struct mem_cgroup *mem);
static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
-static void drain_all_stock_async(void);
+static void drain_all_stock_async(struct mem_cgroup *mem);

static struct mem_cgroup_per_zone *
mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
@@ -1670,8 +1670,8 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
victim = mem_cgroup_select_victim(root_mem);
if (victim == root_mem) {
loop++;
- if (loop >= 1)
- drain_all_stock_async();
+ if (!check_soft && loop >= 1)
+ drain_all_stock_async(root_mem);
if (loop >= 2) {
/*
* If we have not been able to reclaim
@@ -1934,9 +1934,11 @@ struct memcg_stock_pcp {
struct mem_cgroup *cached; /* this never be root cgroup */
unsigned int nr_pages;
struct work_struct work;
+ unsigned long flags;
+#define FLUSHING_CACHED_CHARGE (0)
};
static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
-static atomic_t memcg_drain_count;
+DEFINE_MUTEX(percpu_charge_mutex);

/*
* Try to consume stocked charge on this cpu. If success, one page is consumed
@@ -1984,6 +1986,7 @@ static void drain_local_stock(struct work_struct *dummy)
{
struct memcg_stock_pcp *stock = &__get_cpu_var(memcg_stock);
drain_stock(stock);
+ clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
}

/*
@@ -2008,26 +2011,50 @@ static void refill_stock(struct mem_cgroup *mem, unsigned int nr_pages)
* expects some charges will be back to res_counter later but cannot wait for
* it.
*/
-static void drain_all_stock_async(void)
+static void drain_all_stock_async(struct mem_cgroup *root_mem)
{
- int cpu;
- /* This function is for scheduling "drain" in asynchronous way.
- * The result of "drain" is not directly handled by callers. Then,
- * if someone is calling drain, we don't have to call drain more.
- * Anyway, WORK_STRUCT_PENDING check in queue_work_on() will catch if
- * there is a race. We just do loose check here.
+ int cpu, curcpu;
+ /*
+ * If someone calls draining, avoid adding more kworker runs.
*/
- if (atomic_read(&memcg_drain_count))
+ if (!mutex_trylock(&percpu_charge_mutex))
return;
/* Notify other cpus that system-wide "drain" is running */
- atomic_inc(&memcg_drain_count);
get_online_cpus();
+
+ /*
+ * get a hint for avoiding draining charges on the current cpu,
+ * which must be exhausted by our charging. But this is not
+ * required to be a precise check, We use raw_smp_processor_id()
+ * instead of getcpu()/putcpu().
+ */
+ curcpu = raw_smp_processor_id();
for_each_online_cpu(cpu) {
struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
- schedule_work_on(cpu, &stock->work);
+ struct mem_cgroup *mem;
+
+ if (cpu == curcpu)
+ continue;
+
+ mem = stock->cached;
+ if (!mem)
+ continue;
+ if (mem != root_mem) {
+ if (!root_mem->use_hierarchy)
+ continue;
+ /* check whether "mem" is under tree of "root_mem" */
+ rcu_read_lock();
+ if (!css_is_ancestor(&mem->css, &root_mem->css)) {
+ rcu_read_unlock();
+ continue;
+ }
+ rcu_read_unlock();
+ }
+ if (!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
+ schedule_work_on(cpu, &stock->work);
}
put_online_cpus();
- atomic_dec(&memcg_drain_count);
+ mutex_unlock(&percpu_charge_mutex);
/* We don't wait for flush_work */
}

@@ -2035,9 +2062,9 @@ static void drain_all_stock_async(void)
static void drain_all_stock_sync(void)
{
/* called when force_empty is called */
- atomic_inc(&memcg_drain_count);
+ mutex_lock(&percpu_charge_mutex);
schedule_on_each_cpu(drain_local_stock);
- atomic_dec(&memcg_drain_count);
+ mutex_unlock(&percpu_charge_mutex);
}

/*
--
1.7.4.1

2011-06-13 21:25:46

by Andrew Morton

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH 5/5] memcg: fix percpu cached charge draining frequency

On Mon, 13 Jun 2011 12:16:48 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> >From 18b12e53f1cdf6d7feed1f9226c189c34866338c Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <[email protected]>
> Date: Mon, 13 Jun 2011 11:25:43 +0900
> Subject: [PATCH 5/5] memcg: fix percpu cached charge draining frequency
>
> For performance, memory cgroup caches some "charge" from res_counter
> into per cpu cache. This works well but because it's cache,
> it needs to be flushed in some cases. Typical cases are
> 1. when someone hit limit.
> 2. when rmdir() is called and need to charges to be 0.
>
> But "1" has problem.
>
> Recently, with large SMP machines, we see many kworker runs because
> of flushing memcg's cache. Bad things in implementation are
> that even if a cpu contains a cache for memcg not related to
> a memcg which hits limit, drain code is called.
>
> This patch does
> A) check percpu cache contains a useful data or not.
> B) check other asynchronous percpu draining doesn't run.
> C) don't call local cpu callback.
> D) don't call at softlimit reclaim.
>
>
> ...
>
> +DEFINE_MUTEX(percpu_charge_mutex);

I made this static. If we later wish to give it kernel-wide scope then
"percpu_charge_mutex" will not be a good choice of name.

2011-06-14 07:36:56

by Michal Hocko

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH 5/5] memcg: fix percpu cached charge draining frequency

On Mon 13-06-11 12:16:48, KAMEZAWA Hiroyuki wrote:
> From 18b12e53f1cdf6d7feed1f9226c189c34866338c Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <[email protected]>
> Date: Mon, 13 Jun 2011 11:25:43 +0900
> Subject: [PATCH 5/5] memcg: fix percpu cached charge draining frequency
>
> For performance, memory cgroup caches some "charge" from res_counter
> into per cpu cache. This works well but because it's cache,
> it needs to be flushed in some cases. Typical cases are
> 1. when someone hit limit.
> 2. when rmdir() is called and need to charges to be 0.
>
> But "1" has problem.
>
> Recently, with large SMP machines, we see many kworker runs because
> of flushing memcg's cache. Bad things in implementation are
> that even if a cpu contains a cache for memcg not related to
> a memcg which hits limit, drain code is called.
>
> This patch does
> D) don't call at softlimit reclaim.

I think this needs some justification. The decision is not that
obvious IMO. I would say that this is a good decision because cached
charges will not help to free any memory (at least not directly) during
background reclaim. What about something like:
"
We are not draining per cpu cached charges during soft limit reclaim
because background reclaim doesn't care about charges. It tries to free
some memory and charges will not give any.
Cached charges might influence only selection of the biggest soft limit
offender but as the call is done only after the selection has been
already done it makes no change.
"

Anyway, wouldn't it be better to have this change separate from the
async draining logic change?
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-06-14 09:42:55

by Johannes Weiner

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH 1/5] memcg fix numa_stat permission

On Mon, Jun 13, 2011 at 12:03:01PM +0900, KAMEZAWA Hiroyuki wrote:
> This is already queued in mmotm.
> ==
> >From 3b1bec1d07ba21339697f069acab47dae6b81097 Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <[email protected]>
> Date: Mon, 13 Jun 2011 09:32:28 +0900
> Subject: [PATCH 1/5] memcg fix numa_stat permission
>
> commit 406eb0c9b ("memcg: add memory.numastat api for numa statistics")
> adds memory.numa_stat file for memory cgroup. But the file permissions
> are wrong.
>
> [kamezawa@bluextal linux-2.6]$ ls -l /cgroup/memory/A/memory.numa_stat
> ---------- 1 root root 0 Jun 9 18:36 /cgroup/memory/A/memory.numa_stat
>
> This patch fixes the permission as
>
> [root@bluextal kamezawa]# ls -l /cgroup/memory/A/memory.numa_stat
> -r--r--r-- 1 root root 0 Jun 10 16:49 /cgroup/memory/A/memory.numa_stat
>
> Acked-by: Ying Han <[email protected]>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

2011-06-14 09:45:31

by Johannes Weiner

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH 2/5] memcg: fix init_page_cgroup nid with sparsemem

On Mon, Jun 13, 2011 at 12:06:08PM +0900, KAMEZAWA Hiroyuki wrote:
> added some clean ups.
> ==
> >From 45b8881201f73015c4e0352eb7e434f6e9c53fdd Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <[email protected]>
> Date: Mon, 13 Jun 2011 10:09:17 +0900
> Subject: [PATCH 2/5] [BUGFIX] memcg: fix init_page_cgroup nid with sparsemem
>
>
> commit 21a3c96 makes page_cgroup allocation as NUMA aware. But that caused
> a problem https://bugzilla.kernel.org/show_bug.cgi?id=36192.
>
> The problem was getting a NID from invalid struct pages, which was not
> initialized because it was out-of-node, out of [node_start_pfn, node_end_pfn)
>
> Now, with sparsemem, page_cgroup_init scans pfn from 0 to max_pfn.
> But this may scan a pfn which is not on any node and can access
> memmap which is not initialized.
>
> This makes page_cgroup_init() for SPARSEMEM node aware and remove
> a code to get nid from page->flags. (Then, we'll use valid NID
> always.)
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

2011-06-14 09:46:21

by Johannes Weiner

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH 3/5] memcg: clear mm->owner when last possible owner leaves

On Mon, Jun 13, 2011 at 12:09:51PM +0900, KAMEZAWA Hiroyuki wrote:
> This is Hugh's version.

I approve of Hugh's fixes :-) Thanks

2011-06-14 09:49:04

by Johannes Weiner

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH 4/5] memcg: fix wrong check of noswap with softlimit

On Mon, Jun 13, 2011 at 12:11:05PM +0900, KAMEZAWA Hiroyuki wrote:
> >From 0a0358d300330a4ba86e39ea56ed63f1e4519dfd Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <[email protected]>
> Date: Mon, 13 Jun 2011 10:31:16 +0900
> Subject: [PATCH 4/5] fix wrong check of noswap with softlimit
>
> Hierarchical reclaim doesn't swap out if memsw and resource limits are
> same (memsw_is_minimum == true) because we would hit mem+swap limit
> anyway (during hard limit reclaim).
> If it comes to the solft limit we shouldn't consider memsw_is_minimum at
> all because it doesn't make much sense. Either the soft limit is bellow
> the hard limit and then we cannot hit mem+swap limit or the direct
> reclaim takes a precedence.
>
> Reviewed-by: Michal Hocko <[email protected]>
> Acked-by: Daisuke Nishimura <[email protected]>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

2011-06-14 10:05:05

by Johannes Weiner

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH 5/5] memcg: fix percpu cached charge draining frequency

On Mon, Jun 13, 2011 at 12:16:48PM +0900, KAMEZAWA Hiroyuki wrote:
> @@ -1670,8 +1670,8 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> victim = mem_cgroup_select_victim(root_mem);
> if (victim == root_mem) {
> loop++;
> - if (loop >= 1)
> - drain_all_stock_async();
> + if (!check_soft && loop >= 1)
> + drain_all_stock_async(root_mem);

I agree with Michal, this should be a separate change.

> @@ -2008,26 +2011,50 @@ static void refill_stock(struct mem_cgroup *mem, unsigned int nr_pages)
> * expects some charges will be back to res_counter later but cannot wait for
> * it.
> */
> -static void drain_all_stock_async(void)
> +static void drain_all_stock_async(struct mem_cgroup *root_mem)
> {
> - int cpu;
> - /* This function is for scheduling "drain" in asynchronous way.
> - * The result of "drain" is not directly handled by callers. Then,
> - * if someone is calling drain, we don't have to call drain more.
> - * Anyway, WORK_STRUCT_PENDING check in queue_work_on() will catch if
> - * there is a race. We just do loose check here.
> + int cpu, curcpu;
> + /*
> + * If someone calls draining, avoid adding more kworker runs.
> */
> - if (atomic_read(&memcg_drain_count))
> + if (!mutex_trylock(&percpu_charge_mutex))
> return;
> /* Notify other cpus that system-wide "drain" is running */
> - atomic_inc(&memcg_drain_count);
> get_online_cpus();
> +
> + /*
> + * get a hint for avoiding draining charges on the current cpu,
> + * which must be exhausted by our charging. But this is not
> + * required to be a precise check, We use raw_smp_processor_id()
> + * instead of getcpu()/putcpu().
> + */
> + curcpu = raw_smp_processor_id();
> for_each_online_cpu(cpu) {
> struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> - schedule_work_on(cpu, &stock->work);
> + struct mem_cgroup *mem;
> +
> + if (cpu == curcpu)
> + continue;
> +
> + mem = stock->cached;
> + if (!mem)
> + continue;
> + if (mem != root_mem) {
> + if (!root_mem->use_hierarchy)
> + continue;
> + /* check whether "mem" is under tree of "root_mem" */
> + rcu_read_lock();
> + if (!css_is_ancestor(&mem->css, &root_mem->css)) {
> + rcu_read_unlock();
> + continue;
> + }
> + rcu_read_unlock();

css_is_ancestor() takes the rcu read lock itself already.

2011-06-15 00:17:18

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH 5/5] memcg: fix percpu cached charge draining frequency

On Mon, 13 Jun 2011 14:25:01 -0700
Andrew Morton <[email protected]> wrote:

> On Mon, 13 Jun 2011 12:16:48 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > >From 18b12e53f1cdf6d7feed1f9226c189c34866338c Mon Sep 17 00:00:00 2001
> > From: KAMEZAWA Hiroyuki <[email protected]>
> > Date: Mon, 13 Jun 2011 11:25:43 +0900
> > Subject: [PATCH 5/5] memcg: fix percpu cached charge draining frequency
> >
> > For performance, memory cgroup caches some "charge" from res_counter
> > into per cpu cache. This works well but because it's cache,
> > it needs to be flushed in some cases. Typical cases are
> > 1. when someone hit limit.
> > 2. when rmdir() is called and need to charges to be 0.
> >
> > But "1" has problem.
> >
> > Recently, with large SMP machines, we see many kworker runs because
> > of flushing memcg's cache. Bad things in implementation are
> > that even if a cpu contains a cache for memcg not related to
> > a memcg which hits limit, drain code is called.
> >
> > This patch does
> > A) check percpu cache contains a useful data or not.
> > B) check other asynchronous percpu draining doesn't run.
> > C) don't call local cpu callback.
> > D) don't call at softlimit reclaim.
> >
> >
> > ...
> >
> > +DEFINE_MUTEX(percpu_charge_mutex);
>
> I made this static. If we later wish to give it kernel-wide scope then
> "percpu_charge_mutex" will not be a good choice of name.

Thank you.
And, yes..... memcg_cached_charge_mutex ?

Thanks,
-Kame

2011-06-15 00:19:48

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH 5/5] memcg: fix percpu cached charge draining frequency

On Tue, 14 Jun 2011 09:36:51 +0200
Michal Hocko <[email protected]> wrote:

> On Mon 13-06-11 12:16:48, KAMEZAWA Hiroyuki wrote:
> > From 18b12e53f1cdf6d7feed1f9226c189c34866338c Mon Sep 17 00:00:00 2001
> > From: KAMEZAWA Hiroyuki <[email protected]>
> > Date: Mon, 13 Jun 2011 11:25:43 +0900
> > Subject: [PATCH 5/5] memcg: fix percpu cached charge draining frequency
> >
> > For performance, memory cgroup caches some "charge" from res_counter
> > into per cpu cache. This works well but because it's cache,
> > it needs to be flushed in some cases. Typical cases are
> > 1. when someone hit limit.
> > 2. when rmdir() is called and need to charges to be 0.
> >
> > But "1" has problem.
> >
> > Recently, with large SMP machines, we see many kworker runs because
> > of flushing memcg's cache. Bad things in implementation are
> > that even if a cpu contains a cache for memcg not related to
> > a memcg which hits limit, drain code is called.
> >
> > This patch does
> > D) don't call at softlimit reclaim.
>
> I think this needs some justification. The decision is not that
> obvious IMO. I would say that this is a good decision because cached
> charges will not help to free any memory (at least not directly) during
> background reclaim. What about something like:
> "
> We are not draining per cpu cached charges during soft limit reclaim
> because background reclaim doesn't care about charges. It tries to free
> some memory and charges will not give any.
> Cached charges might influence only selection of the biggest soft limit
> offender but as the call is done only after the selection has been
> already done it makes no change.
> "
>
> Anyway, wouldn't it be better to have this change separate from the
> async draining logic change?

Hmm. I think calling "draining" at softlimit is just a bug.

Thanks,
-Kame

2011-06-15 00:23:54

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH 5/5] memcg: fix percpu cached charge draining frequency

On Tue, 14 Jun 2011 12:04:12 +0200
Johannes Weiner <[email protected]> wrote:

> On Mon, Jun 13, 2011 at 12:16:48PM +0900, KAMEZAWA Hiroyuki wrote:
> > @@ -1670,8 +1670,8 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> > victim = mem_cgroup_select_victim(root_mem);
> > if (victim == root_mem) {
> > loop++;
> > - if (loop >= 1)
> > - drain_all_stock_async();
> > + if (!check_soft && loop >= 1)
> > + drain_all_stock_async(root_mem);
>
> I agree with Michal, this should be a separate change.
>

Hm, ok, I'll do.

> > @@ -2008,26 +2011,50 @@ static void refill_stock(struct mem_cgroup *mem, unsigned int nr_pages)
> > * expects some charges will be back to res_counter later but cannot wait for
> > * it.
> > */
> > -static void drain_all_stock_async(void)
> > +static void drain_all_stock_async(struct mem_cgroup *root_mem)
> > {
> > - int cpu;
> > - /* This function is for scheduling "drain" in asynchronous way.
> > - * The result of "drain" is not directly handled by callers. Then,
> > - * if someone is calling drain, we don't have to call drain more.
> > - * Anyway, WORK_STRUCT_PENDING check in queue_work_on() will catch if
> > - * there is a race. We just do loose check here.
> > + int cpu, curcpu;
> > + /*
> > + * If someone calls draining, avoid adding more kworker runs.
> > */
> > - if (atomic_read(&memcg_drain_count))
> > + if (!mutex_trylock(&percpu_charge_mutex))
> > return;
> > /* Notify other cpus that system-wide "drain" is running */
> > - atomic_inc(&memcg_drain_count);
> > get_online_cpus();
> > +
> > + /*
> > + * get a hint for avoiding draining charges on the current cpu,
> > + * which must be exhausted by our charging. But this is not
> > + * required to be a precise check, We use raw_smp_processor_id()
> > + * instead of getcpu()/putcpu().
> > + */
> > + curcpu = raw_smp_processor_id();
> > for_each_online_cpu(cpu) {
> > struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> > - schedule_work_on(cpu, &stock->work);
> > + struct mem_cgroup *mem;
> > +
> > + if (cpu == curcpu)
> > + continue;
> > +
> > + mem = stock->cached;
> > + if (!mem)
> > + continue;
> > + if (mem != root_mem) {
> > + if (!root_mem->use_hierarchy)
> > + continue;
> > + /* check whether "mem" is under tree of "root_mem" */
> > + rcu_read_lock();
> > + if (!css_is_ancestor(&mem->css, &root_mem->css)) {
> > + rcu_read_unlock();
> > + continue;
> > + }
> > + rcu_read_unlock();
>
> css_is_ancestor() takes the rcu read lock itself already.
>

you're right.

I'll post an update.

Thanks,
-Kame

2011-06-15 01:19:02

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH 5/5] memcg: fix percpu cached charge draining frequency

On Wed, 15 Jun 2011 09:12:45 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Tue, 14 Jun 2011 09:36:51 +0200
> Michal Hocko <[email protected]> wrote:
>
> > On Mon 13-06-11 12:16:48, KAMEZAWA Hiroyuki wrote:
> > > From 18b12e53f1cdf6d7feed1f9226c189c34866338c Mon Sep 17 00:00:00 2001
> > > From: KAMEZAWA Hiroyuki <[email protected]>
> > > Date: Mon, 13 Jun 2011 11:25:43 +0900
> > > Subject: [PATCH 5/5] memcg: fix percpu cached charge draining frequency
> > >
> > > For performance, memory cgroup caches some "charge" from res_counter
> > > into per cpu cache. This works well but because it's cache,
> > > it needs to be flushed in some cases. Typical cases are
> > > 1. when someone hit limit.
> > > 2. when rmdir() is called and need to charges to be 0.
> > >
> > > But "1" has problem.
> > >
> > > Recently, with large SMP machines, we see many kworker runs because
> > > of flushing memcg's cache. Bad things in implementation are
> > > that even if a cpu contains a cache for memcg not related to
> > > a memcg which hits limit, drain code is called.
> > >
> > > This patch does
> > > D) don't call at softlimit reclaim.
> >
> > I think this needs some justification. The decision is not that
> > obvious IMO. I would say that this is a good decision because cached
> > charges will not help to free any memory (at least not directly) during
> > background reclaim. What about something like:
> > "
> > We are not draining per cpu cached charges during soft limit reclaim
> > because background reclaim doesn't care about charges. It tries to free
> > some memory and charges will not give any.
> > Cached charges might influence only selection of the biggest soft limit
> > offender but as the call is done only after the selection has been
> > already done it makes no change.
> > "
> >
> > Anyway, wouldn't it be better to have this change separate from the
> > async draining logic change?
>
> Hmm. I think calling "draining" at softlimit is just a bug.
>
I'll divide patches.

Thanks,
-kame

2011-06-15 01:20:21

by Andrew Morton

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH 5/5] memcg: fix percpu cached charge draining frequency

On Wed, 15 Jun 2011 09:09:35 +0900 KAMEZAWA Hiroyuki <[email protected]> wrote:

> > I made this static. If we later wish to give it kernel-wide scope then
> > "percpu_charge_mutex" will not be a good choice of name.
>
> Thank you.
> And, yes..... memcg_cached_charge_mutex ?

I don't see a need to rename the mutex unless we change it to have
kernel-wide scope. Given that it's presently static to memcontrol.c,
it's obviously part of memcg ;)

2011-06-15 01:56:45

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [BUGFIX][PATCH v6] memcg: fix percpu cached charge draining frequency

This is a repleacement for
memcg-fix-percpu-cached-charge-draining-frequency.patch
+
memcg-fix-percpu-cached-charge-draining-frequency-fix.patch


Changelog:
- removed unnecessary rcu_read_lock()
- removed a fix for softlimit case (move to another independent patch)
- make mutex static.
- applied comment updates from Andrew Morton.

A patch for softlimit will follow this.

==
>From f3f41b827d70142858ba8b370510a82d608870d0 Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <[email protected]>
Date: Wed, 15 Jun 2011 10:39:57 +0900
Subject: [PATCH 5/6] memcg: fix behavior of per cpu charge cache draining.

For performance, memory cgroup caches some "charge" from res_counter
into per cpu cache. This works well but because it's cache,
it needs to be flushed in some cases. Typical cases are
1. when someone hit limit.
2. when rmdir() is called and need to charges to be 0.

But "1" has problem.

Recently, with large SMP machines, we see many kworker runs because
of flushing memcg's cache. Bad things in implementation are
that even if a cpu contains a cache for memcg not related to
a memcg which hits limit, drain code is called.

This patch does
A) check percpu cache contains a useful data or not.
B) check other asynchronous percpu draining doesn't run.
C) don't call local cpu callback.

(*)This patch avoid changing the calling condition with hard-limit.

When I run "cat 1Gfile > /dev/null" under 300M limit memcg,

[Before]
13767 kamezawa 20 0 98.6m 424 416 D 10.0 0.0 0:00.61 cat
58 root 20 0 0 0 0 S 0.6 0.0 0:00.09 kworker/2:1
60 root 20 0 0 0 0 S 0.6 0.0 0:00.08 kworker/4:1
4 root 20 0 0 0 0 S 0.3 0.0 0:00.02 kworker/0:0
57 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/1:1
61 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/5:1
62 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/6:1
63 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/7:1

[After]
2676 root 20 0 98.6m 416 416 D 9.3 0.0 0:00.87 cat
2626 kamezawa 20 0 15192 1312 920 R 0.3 0.0 0:00.28 top
1 root 20 0 19384 1496 1204 S 0.0 0.0 0:00.66 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0
4 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0

Acked-by: Daisuke Nishimura <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

Changelog:
- removed unnecessary rcu_read_lock()
- removed a fix for softlimit case (move to another independent patch)
- make mutex static.
- applied comment updates from Andrew Morton.
---
mm/memcontrol.c | 54 ++++++++++++++++++++++++++++++++++++++----------------
1 files changed, 38 insertions(+), 16 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 915c3f3..8fb29de 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -359,7 +359,7 @@ enum charge_type {
static void mem_cgroup_get(struct mem_cgroup *mem);
static void mem_cgroup_put(struct mem_cgroup *mem);
static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
-static void drain_all_stock_async(void);
+static void drain_all_stock_async(struct mem_cgroup *mem);

static struct mem_cgroup_per_zone *
mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
@@ -1671,7 +1671,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
if (victim == root_mem) {
loop++;
if (loop >= 1)
- drain_all_stock_async();
+ drain_all_stock_async(root_mem);
if (loop >= 2) {
/*
* If we have not been able to reclaim
@@ -1934,9 +1934,11 @@ struct memcg_stock_pcp {
struct mem_cgroup *cached; /* this never be root cgroup */
unsigned int nr_pages;
struct work_struct work;
+ unsigned long flags;
+#define FLUSHING_CACHED_CHARGE (0)
};
static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
-static atomic_t memcg_drain_count;
+static DEFINE_MUTEX(percpu_charge_mutex);

/*
* Try to consume stocked charge on this cpu. If success, one page is consumed
@@ -1984,6 +1986,7 @@ static void drain_local_stock(struct work_struct *dummy)
{
struct memcg_stock_pcp *stock = &__get_cpu_var(memcg_stock);
drain_stock(stock);
+ clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
}

/*
@@ -2008,26 +2011,45 @@ static void refill_stock(struct mem_cgroup *mem, unsigned int nr_pages)
* expects some charges will be back to res_counter later but cannot wait for
* it.
*/
-static void drain_all_stock_async(void)
+static void drain_all_stock_async(struct mem_cgroup *root_mem)
{
- int cpu;
- /* This function is for scheduling "drain" in asynchronous way.
- * The result of "drain" is not directly handled by callers. Then,
- * if someone is calling drain, we don't have to call drain more.
- * Anyway, WORK_STRUCT_PENDING check in queue_work_on() will catch if
- * there is a race. We just do loose check here.
+ int cpu, curcpu;
+ /*
+ * If someone calls draining, avoid adding more kworker runs.
*/
- if (atomic_read(&memcg_drain_count))
+ if (!mutex_trylock(&percpu_charge_mutex))
return;
/* Notify other cpus that system-wide "drain" is running */
- atomic_inc(&memcg_drain_count);
get_online_cpus();
+ /*
+ * Get a hint for avoiding draining charges on the current cpu,
+ * which must be exhausted by our charging. It is not required that
+ * this be a precise check, so we use raw_smp_processor_id() instead of
+ * getcpu()/putcpu().
+ */
+ curcpu = raw_smp_processor_id();
for_each_online_cpu(cpu) {
struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
- schedule_work_on(cpu, &stock->work);
+ struct mem_cgroup *mem;
+
+ if (cpu == curcpu)
+ continue;
+
+ mem = stock->cached;
+ if (!mem)
+ continue;
+ if (mem != root_mem) {
+ if (!root_mem->use_hierarchy)
+ continue;
+ /* check whether "mem" is under tree of "root_mem" */
+ if (!css_is_ancestor(&mem->css, &root_mem->css))
+ continue;
+ }
+ if (!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
+ schedule_work_on(cpu, &stock->work);
}
put_online_cpus();
- atomic_dec(&memcg_drain_count);
+ mutex_unlock(&percpu_charge_mutex);
/* We don't wait for flush_work */
}

@@ -2035,9 +2057,9 @@ static void drain_all_stock_async(void)
static void drain_all_stock_sync(void)
{
/* called when force_empty is called */
- atomic_inc(&memcg_drain_count);
+ mutex_lock(&percpu_charge_mutex);
schedule_on_each_cpu(drain_local_stock);
- atomic_dec(&memcg_drain_count);
+ mutex_unlock(&percpu_charge_mutex);
}

/*
--
1.7.4.1

2011-06-15 01:59:19

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [BUGFIX][PATCH v2] memcg: avoid percpu cached charge draining at softlimit

Based on Michal Hokko's comment.

=
>From 5d1121e4e5425187a58025cb63be8e297a97f624 Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <[email protected]>
Date: Wed, 15 Jun 2011 10:44:29 +0900
Subject: [PATCH 6/6] memcg: avoid percpu cached charge draining at softlimit.

We are not draining per cpu cached charges during soft limit reclaim
because background reclaim doesn't care about charges. It tries to free
some memory and charges will not give any.
Cached charges might influence only selection of the biggest soft limit
offender but as the call is done only after the selection has been
already done it makes no change.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 8 +++++++-
1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8fb29de..fa5c918 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1670,7 +1670,13 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
victim = mem_cgroup_select_victim(root_mem);
if (victim == root_mem) {
loop++;
- if (loop >= 1)
+ /*
+ * We are not draining per cpu cached charges during
+ * soft limit reclaim because global reclaim doesn't
+ * care about charges. It tries to free some memory and
+ * charges will not give any.
+ */
+ if (!check_soft && loop >= 1)
drain_all_stock_async(root_mem);
if (loop >= 2) {
/*
--
1.7.4.1

2011-06-15 05:30:20

by Ying Han

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH v6] memcg: fix percpu cached charge draining frequency

On Tue, Jun 14, 2011 at 6:49 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> This is a repleacement for
> memcg-fix-percpu-cached-charge-draining-frequency.patch
> +
> memcg-fix-percpu-cached-charge-draining-frequency-fix.patch
>
>
> Changelog:
> ?- removed unnecessary rcu_read_lock()
> ?- removed a fix for softlimit case (move to another independent patch)
> ?- make mutex static.
> ?- applied comment updates from Andrew Morton.
>
> A patch for softlimit will follow this.
>
> ==
> From f3f41b827d70142858ba8b370510a82d608870d0 Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <[email protected]>
> Date: Wed, 15 Jun 2011 10:39:57 +0900
> Subject: [PATCH 5/6] memcg: fix behavior of per cpu charge cache draining.
>
> ?For performance, memory cgroup caches some "charge" from res_counter
> ?into per cpu cache. This works well but because it's cache,
> ?it needs to be flushed in some cases. Typical cases are
> ? ? ? ? 1. when someone hit limit.
> ? ? ? ? 2. when rmdir() is called and need to charges to be 0.
>
> But "1" has problem.
>
> Recently, with large SMP machines, we see many kworker runs because
> of flushing memcg's cache. Bad things in implementation are
> that even if a cpu contains a cache for memcg not related to
> a memcg which hits limit, drain code is called.
>
> This patch does
> ? ? ? ?A) check percpu cache contains a useful data or not.
> ? ? ? ?B) check other asynchronous percpu draining doesn't run.
> ? ? ? ?C) don't call local cpu callback.
>
> (*)This patch avoid changing the calling condition with hard-limit.
>
> When I run "cat 1Gfile > /dev/null" under 300M limit memcg,
>
> [Before]
> 13767 kamezawa ?20 ? 0 98.6m ?424 ?416 D 10.0 ?0.0 ? 0:00.61 cat
> ? 58 root ? ? ?20 ? 0 ? ? 0 ? ?0 ? ?0 S ?0.6 ?0.0 ? 0:00.09 kworker/2:1
> ? 60 root ? ? ?20 ? 0 ? ? 0 ? ?0 ? ?0 S ?0.6 ?0.0 ? 0:00.08 kworker/4:1
> ? ?4 root ? ? ?20 ? 0 ? ? 0 ? ?0 ? ?0 S ?0.3 ?0.0 ? 0:00.02 kworker/0:0
> ? 57 root ? ? ?20 ? 0 ? ? 0 ? ?0 ? ?0 S ?0.3 ?0.0 ? 0:00.05 kworker/1:1
> ? 61 root ? ? ?20 ? 0 ? ? 0 ? ?0 ? ?0 S ?0.3 ?0.0 ? 0:00.05 kworker/5:1
> ? 62 root ? ? ?20 ? 0 ? ? 0 ? ?0 ? ?0 S ?0.3 ?0.0 ? 0:00.05 kworker/6:1
> ? 63 root ? ? ?20 ? 0 ? ? 0 ? ?0 ? ?0 S ?0.3 ?0.0 ? 0:00.05 kworker/7:1
>
> [After]
> ?2676 root ? ? ?20 ? 0 98.6m ?416 ?416 D ?9.3 ?0.0 ? 0:00.87 cat
> ?2626 kamezawa ?20 ? 0 15192 1312 ?920 R ?0.3 ?0.0 ? 0:00.28 top
> ? ?1 root ? ? ?20 ? 0 19384 1496 1204 S ?0.0 ?0.0 ? 0:00.66 init
> ? ?2 root ? ? ?20 ? 0 ? ? 0 ? ?0 ? ?0 S ?0.0 ?0.0 ? 0:00.00 kthreadd
> ? ?3 root ? ? ?20 ? 0 ? ? 0 ? ?0 ? ?0 S ?0.0 ?0.0 ? 0:00.00 ksoftirqd/0
> ? ?4 root ? ? ?20 ? 0 ? ? 0 ? ?0 ? ?0 S ?0.0 ?0.0 ? 0:00.00 kworker/0:0
>
> Acked-by: Daisuke Nishimura <[email protected]>
> Reviewed-by: Michal Hocko <[email protected]>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
> Changelog:
> ?- removed unnecessary rcu_read_lock()
> ?- removed a fix for softlimit case (move to another independent patch)
> ?- make mutex static.
> ?- applied comment updates from Andrew Morton.
> ---
> ?mm/memcontrol.c | ? 54 ++++++++++++++++++++++++++++++++++++++----------------
> ?1 files changed, 38 insertions(+), 16 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 915c3f3..8fb29de 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -359,7 +359,7 @@ enum charge_type {
> ?static void mem_cgroup_get(struct mem_cgroup *mem);
> ?static void mem_cgroup_put(struct mem_cgroup *mem);
> ?static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
> -static void drain_all_stock_async(void);
> +static void drain_all_stock_async(struct mem_cgroup *mem);
>
> ?static struct mem_cgroup_per_zone *
> ?mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
> @@ -1671,7 +1671,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> ? ? ? ? ? ? ? ?if (victim == root_mem) {
> ? ? ? ? ? ? ? ? ? ? ? ?loop++;
> ? ? ? ? ? ? ? ? ? ? ? ?if (loop >= 1)
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? drain_all_stock_async();
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? drain_all_stock_async(root_mem);
> ? ? ? ? ? ? ? ? ? ? ? ?if (loop >= 2) {
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?/*
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? * If we have not been able to reclaim
> @@ -1934,9 +1934,11 @@ struct memcg_stock_pcp {
> ? ? ? ?struct mem_cgroup *cached; /* this never be root cgroup */
> ? ? ? ?unsigned int nr_pages;
> ? ? ? ?struct work_struct work;
> + ? ? ? unsigned long flags;
> +#define FLUSHING_CACHED_CHARGE (0)
> ?};
> ?static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> -static atomic_t memcg_drain_count;
> +static DEFINE_MUTEX(percpu_charge_mutex);
>
> ?/*
> ?* Try to consume stocked charge on this cpu. If success, one page is consumed
> @@ -1984,6 +1986,7 @@ static void drain_local_stock(struct work_struct *dummy)
> ?{
> ? ? ? ?struct memcg_stock_pcp *stock = &__get_cpu_var(memcg_stock);
> ? ? ? ?drain_stock(stock);
> + ? ? ? clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
> ?}
>
> ?/*
> @@ -2008,26 +2011,45 @@ static void refill_stock(struct mem_cgroup *mem, unsigned int nr_pages)
> ?* expects some charges will be back to res_counter later but cannot wait for
> ?* it.
> ?*/
> -static void drain_all_stock_async(void)
> +static void drain_all_stock_async(struct mem_cgroup *root_mem)
> ?{
> - ? ? ? int cpu;
> - ? ? ? /* This function is for scheduling "drain" in asynchronous way.
> - ? ? ? ?* The result of "drain" is not directly handled by callers. Then,
> - ? ? ? ?* if someone is calling drain, we don't have to call drain more.
> - ? ? ? ?* Anyway, WORK_STRUCT_PENDING check in queue_work_on() will catch if
> - ? ? ? ?* there is a race. We just do loose check here.
> + ? ? ? int cpu, curcpu;
> + ? ? ? /*
> + ? ? ? ?* If someone calls draining, avoid adding more kworker runs.
> ? ? ? ? */
> - ? ? ? if (atomic_read(&memcg_drain_count))
> + ? ? ? if (!mutex_trylock(&percpu_charge_mutex))
> ? ? ? ? ? ? ? ?return;
> ? ? ? ?/* Notify other cpus that system-wide "drain" is running */
> - ? ? ? atomic_inc(&memcg_drain_count);
> ? ? ? ?get_online_cpus();
> + ? ? ? /*
> + ? ? ? ?* Get a hint for avoiding draining charges on the current cpu,
> + ? ? ? ?* which must be exhausted by our charging. ?It is not required that
> + ? ? ? ?* this be a precise check, so we use raw_smp_processor_id() instead of
> + ? ? ? ?* getcpu()/putcpu().
> + ? ? ? ?*/
> + ? ? ? curcpu = raw_smp_processor_id();
> ? ? ? ?for_each_online_cpu(cpu) {
> ? ? ? ? ? ? ? ?struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> - ? ? ? ? ? ? ? schedule_work_on(cpu, &stock->work);
> + ? ? ? ? ? ? ? struct mem_cgroup *mem;
> +
> + ? ? ? ? ? ? ? if (cpu == curcpu)
> + ? ? ? ? ? ? ? ? ? ? ? continue;
> +
> + ? ? ? ? ? ? ? mem = stock->cached;
> + ? ? ? ? ? ? ? if (!mem)
> + ? ? ? ? ? ? ? ? ? ? ? continue;
> + ? ? ? ? ? ? ? if (mem != root_mem) {
> + ? ? ? ? ? ? ? ? ? ? ? if (!root_mem->use_hierarchy)
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? continue;
> + ? ? ? ? ? ? ? ? ? ? ? /* check whether "mem" is under tree of "root_mem" */
> + ? ? ? ? ? ? ? ? ? ? ? if (!css_is_ancestor(&mem->css, &root_mem->css))
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? continue;
> + ? ? ? ? ? ? ? }
> + ? ? ? ? ? ? ? if (!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> + ? ? ? ? ? ? ? ? ? ? ? schedule_work_on(cpu, &stock->work);
> ? ? ? ?}
> ? ? ? ?put_online_cpus();
> - ? ? ? atomic_dec(&memcg_drain_count);
> + ? ? ? mutex_unlock(&percpu_charge_mutex);
> ? ? ? ?/* We don't wait for flush_work */
> ?}
>
> @@ -2035,9 +2057,9 @@ static void drain_all_stock_async(void)
> ?static void drain_all_stock_sync(void)
> ?{
> ? ? ? ?/* called when force_empty is called */
> - ? ? ? atomic_inc(&memcg_drain_count);
> + ? ? ? mutex_lock(&percpu_charge_mutex);
> ? ? ? ?schedule_on_each_cpu(drain_local_stock);
> - ? ? ? atomic_dec(&memcg_drain_count);
> + ? ? ? mutex_unlock(&percpu_charge_mutex);
> ?}
>
> ?/*
> --
> 1.7.4.1
>
>
>

Tested-by: Ying Han <[email protected]>


--Ying

2011-06-15 07:15:06

by Michal Hocko

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH 5/5] memcg: fix percpu cached charge draining frequency

On Wed 15-06-11 10:12:02, KAMEZAWA Hiroyuki wrote:
> On Wed, 15 Jun 2011 09:12:45 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > On Tue, 14 Jun 2011 09:36:51 +0200
> > Michal Hocko <[email protected]> wrote:
> >
> > > On Mon 13-06-11 12:16:48, KAMEZAWA Hiroyuki wrote:
> > > > From 18b12e53f1cdf6d7feed1f9226c189c34866338c Mon Sep 17 00:00:00 2001
> > > > From: KAMEZAWA Hiroyuki <[email protected]>
> > > > Date: Mon, 13 Jun 2011 11:25:43 +0900
> > > > Subject: [PATCH 5/5] memcg: fix percpu cached charge draining frequency
> > > >
> > > > For performance, memory cgroup caches some "charge" from res_counter
> > > > into per cpu cache. This works well but because it's cache,
> > > > it needs to be flushed in some cases. Typical cases are
> > > > 1. when someone hit limit.
> > > > 2. when rmdir() is called and need to charges to be 0.
> > > >
> > > > But "1" has problem.
> > > >
> > > > Recently, with large SMP machines, we see many kworker runs because
> > > > of flushing memcg's cache. Bad things in implementation are
> > > > that even if a cpu contains a cache for memcg not related to
> > > > a memcg which hits limit, drain code is called.
> > > >
> > > > This patch does
> > > > D) don't call at softlimit reclaim.
> > >
> > > I think this needs some justification. The decision is not that
> > > obvious IMO. I would say that this is a good decision because cached
> > > charges will not help to free any memory (at least not directly) during
> > > background reclaim. What about something like:
> > > "
> > > We are not draining per cpu cached charges during soft limit reclaim
> > > because background reclaim doesn't care about charges. It tries to free
> > > some memory and charges will not give any.
> > > Cached charges might influence only selection of the biggest soft limit
> > > offender but as the call is done only after the selection has been
> > > already done it makes no change.
> > > "
> > >
> > > Anyway, wouldn't it be better to have this change separate from the
> > > async draining logic change?
> >
> > Hmm. I think calling "draining" at softlimit is just a bug.
> >
> I'll divide patches.

Thanks.

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-06-16 10:10:06

by Ingo Molnar

[permalink] [raw]
Subject: [-git build bug, PATCH] Re: [BUGFIX][PATCH 2/5] memcg: fix init_page_cgroup nid with sparsemem


* KAMEZAWA Hiroyuki <[email protected]> wrote:

> Date: Mon, 13 Jun 2011 10:09:17 +0900
> Subject: [PATCH 2/5] [BUGFIX] memcg: fix init_page_cgroup nid with sparsemem

This fresh upstream commit commit:

37573e8c7182: memcg: fix init_page_cgroup nid with sparsemem

is causing widespread build failures on latest -git, on x86:

mm/page_cgroup.c:308:3: error: implicit declaration of function ‘node_start_pfn’ [-Werror=implicit-function-declaration]
mm/page_cgroup.c:309:3: error: implicit declaration of function ‘node_end_pfn’ [-Werror=implicit-function-declaration]

On any config that has CONFIG_CGROUP_MEM_RES_CTLR=y enabled but
CONFIG_NUMA disabled.

For now i've worked it around with the patch below, but the real
solution would be to make the page_cgroup.c code not depend on NUMA.

Thanks,

Ingo

diff --git a/init/Kconfig b/init/Kconfig
index 412c21b..1593be9 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -639,6 +639,7 @@ config RESOURCE_COUNTERS
config CGROUP_MEM_RES_CTLR
bool "Memory Resource Controller for Control Groups"
depends on RESOURCE_COUNTERS
+ depends on NUMA
select MM_OWNER
help
Provides a memory resource controller that manages both anonymous

2011-06-16 13:01:28

by Hiroyuki Kamezawa

[permalink] [raw]
Subject: Re: [-git build bug, PATCH] Re: [BUGFIX][PATCH 2/5] memcg: fix init_page_cgroup nid with sparsemem

2011/6/16 Ingo Molnar <[email protected]>:
>
> * KAMEZAWA Hiroyuki <[email protected]> wrote:
>
>> Date: Mon, 13 Jun 2011 10:09:17 +0900
>> Subject: [PATCH 2/5] [BUGFIX] memcg: fix init_page_cgroup nid with sparsemem
>
> This fresh upstream commit commit:
>
> ?37573e8c7182: memcg: fix init_page_cgroup nid with sparsemem
>
> is causing widespread build failures on latest -git, on x86:
>
> ?mm/page_cgroup.c:308:3: error: implicit declaration of function ?node_start_pfn? [-Werror=implicit-function-declaration]
> ?mm/page_cgroup.c:309:3: error: implicit declaration of function ?node_end_pfn? [-Werror=implicit-function-declaration]
>
> On any config that has CONFIG_CGROUP_MEM_RES_CTLR=y enabled but
> CONFIG_NUMA disabled.
>
> For now i've worked it around with the patch below, but the real
> solution would be to make the page_cgroup.c code not depend on NUMA.
>
> Thanks,
>
> ? ? ? ?Ingo
>
yes, very sorry. I'm now preparing a fix in this thread.

http://marc.info/?t=130819986800002&r=1&w=2

I think I'll be able to post a final fix, tomorrow. I'll cc you when I'll post.
Thanks,
-Kame