2009-01-08 10:12:22

by Daisuke Nishimura

[permalink] [raw]
Subject: [RFC][PATCH 0/4] some memcg fixes

Hi.

These are patches that I've been testing.

They survived my test(rmdir aftre task move under memory pressure
and page migration) w/o big problem(except oom) for hours
in both use_hierarchy==0/1 case.

I want them go in 2.6.29.

They are based on mmotm-2009-01-05-12-50.

[1/4] fix for mem_cgroup_get_reclaim_stat_from_page

[2/4] fix error path of mem_cgroup_move_parent

[3/4] fix for mem_cgroup_hierarchical_reclaim

[4/4] make oom less frequently


I think 1 and 2 are ok.
I'll update them based on comments.


Thanks,
Daisuke Nishimura.


2009-01-08 10:17:30

by Daisuke Nishimura

[permalink] [raw]
Subject: [RFC][PATCH 1/4] memcg: fix for mem_cgroup_get_reclaim_stat_from_page

In case of swapin, a new page is added to lru before it is charged,
so page->pc->mem_cgroup points to NULL or last mem_cgroup the page
was charged before.

In the latter case, if the mem_cgroup has already freed by rmdir,
the area pointed to by page->pc->mem_cgroup may have invalid data.

Actually, I saw general protection fault.

general protection fault: 0000 [#1] SMP
last sysfs file: /sys/devices/system/cpu/cpu15/cache/index1/shared_cpu_map
CPU 4
Modules linked in: ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp ipv6 autofs4 hidp rfcomm l2cap bluetooth sunrpc dm_mirror dm_region_hash dm_log dm_multipath dm_mod rfkill input_polldev sbs sbshc battery ac lp sg ide_cd_mod cdrom button serio_raw acpi_memhotplug parport_pc e1000 rtc_cmos parport rtc_core rtc_lib i2c_i801 i2c_core shpchp pcspkr ata_piix libata megaraid_mbox megaraid_mm sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd [last unloaded: microcode]
Pid: 26038, comm: page01 Tainted: G W 2.6.28-rc9-mm1-mmotm-2008-12-22-16-14-f2ab3dea #1
RIP: 0010:[<ffffffff8028e710>] [<ffffffff8028e710>] update_page_reclaim_stat+0x2f/0x42
RSP: 0000:ffff8801ee457da8 EFLAGS: 00010002
RAX: 32353438312021c8 RBX: 0000000000000000 RCX: 32353438312021c8
RDX: 0000000000000000 RSI: ffff8800cb0b1000 RDI: ffff8801164d1d28
RBP: ffff880110002cb8 R08: ffff88010f2eae23 R09: 0000000000000001
R10: ffff8800bc514b00 R11: ffff880110002c00 R12: 0000000000000000
R13: ffff88000f484100 R14: 0000000000000003 R15: 00000000001200d2
FS: 00007f8a261726f0(0000) GS:ffff88010f2eaa80(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f8a25d22000 CR3: 00000001ef18c000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process page01 (pid: 26038, threadinfo ffff8801ee456000, task ffff8800b585b960)
Stack:
ffffe200071ee568 ffff880110001f00 0000000000000000 ffffffff8028ea17
ffff88000f484100 0000000000000000 0000000000000020 00007f8a25d22000
ffff8800bc514b00 ffffffff8028ec34 0000000000000000 0000000000016fd8
Call Trace:
[<ffffffff8028ea17>] ? ____pagevec_lru_add+0xc1/0x13c
[<ffffffff8028ec34>] ? drain_cpu_pagevecs+0x36/0x89
[<ffffffff802a4f8c>] ? swapin_readahead+0x78/0x98
[<ffffffff8029a37a>] ? handle_mm_fault+0x3d9/0x741
[<ffffffff804da654>] ? do_page_fault+0x3ce/0x78c
[<ffffffff804d7a42>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[<ffffffff804d860f>] ? page_fault+0x1f/0x30
Code: cc 55 48 8d af b8 0d 00 00 48 89 f7 53 89 d3 e8 39 85 02 00 48 63 d3 48 ff 44 d5 10 45 85 e4 74 05 48 ff 44 d5 00 48 85 c0 74 0e <48> ff 44 d0 10 45 85 e4 74 04 48 ff 04 d0 5b 5d 41 5c c3 41 54
RIP [<ffffffff8028e710>] update_page_reclaim_stat+0x2f/0x42
RSP <ffff8801ee457da8>


Signed-off-by: Daisuke Nishimura <[email protected]>
---
mm/memcontrol.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e2996b8..62e69d8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -559,6 +559,10 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
return NULL;

pc = lookup_page_cgroup(page);
+ smp_rmb();
+ if (!PageCgroupUsed(pc))
+ return NULL;
+
mz = page_cgroup_zoneinfo(pc);
if (!mz)
return NULL;

2009-01-08 10:18:03

by Daisuke Nishimura

[permalink] [raw]
Subject: [RFC][PATCH 2/4] memcg: fix error path of mem_cgroup_move_parent

There is a bug in error path of mem_cgroup_move_parent.

Extra refcnt got from try_charge should be dropped, and usages incremented
by try_charge should be decremented in both error paths:

A: failure at get_page_unless_zero
B: failure at isolate_lru_page

This bug makes this parent directory unremovable.

In case of A, rmdir doesn't return, because res.usage doesn't go
down to 0 at mem_cgroup_force_empty even after all the pc in
lru are removed.
In case of B, rmdir fails and returns -EBUSY, because it has
extra ref counts even after res.usage goes down to 0.


Signed-off-by: Daisuke Nishimura <[email protected]>
---
mm/memcontrol.c | 23 +++++++++++++++--------
1 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 62e69d8..288e22c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -983,14 +983,15 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
if (pc->mem_cgroup != from)
goto out;

- css_put(&from->css);
res_counter_uncharge(&from->res, PAGE_SIZE);
mem_cgroup_charge_statistics(from, pc, false);
if (do_swap_account)
res_counter_uncharge(&from->memsw, PAGE_SIZE);
+ css_put(&from->css);
+
+ css_get(&to->css);
pc->mem_cgroup = to;
mem_cgroup_charge_statistics(to, pc, true);
- css_get(&to->css);
ret = 0;
out:
unlock_page_cgroup(pc);
@@ -1023,8 +1024,10 @@ static int mem_cgroup_move_parent(struct page_cgroup *pc,
if (ret || !parent)
return ret;

- if (!get_page_unless_zero(page))
- return -EBUSY;
+ if (!get_page_unless_zero(page)) {
+ ret = -EBUSY;
+ goto uncharge;
+ }

ret = isolate_lru_page(page);

@@ -1033,19 +1036,23 @@ static int mem_cgroup_move_parent(struct page_cgroup *pc,

ret = mem_cgroup_move_account(pc, child, parent);

- /* drop extra refcnt by try_charge() (move_account increment one) */
- css_put(&parent->css);
putback_lru_page(page);
if (!ret) {
put_page(page);
+ /* drop extra refcnt by try_charge() */
+ css_put(&parent->css);
return 0;
}
- /* uncharge if move fails */
+
cancel:
+ put_page(page);
+uncharge:
+ /* drop extra refcnt by try_charge() */
+ css_put(&parent->css);
+ /* uncharge if move fails */
res_counter_uncharge(&parent->res, PAGE_SIZE);
if (do_swap_account)
res_counter_uncharge(&parent->memsw, PAGE_SIZE);
- put_page(page);
return ret;
}

2009-01-08 10:18:37

by Daisuke Nishimura

[permalink] [raw]
Subject: [RFC][PATCH 3/4] memcg: fix for mem_cgroup_hierarchical_reclaim

If root_mem has no children, last_scaned_child is set to root_mem itself.
But after some children added to root_mem, mem_cgroup_get_next_node can
mem_cgroup_put the root_mem although root_mem has not been mem_cgroup_get.

This patch fixes this behavior by:
- Set last_scanned_child to NULL if root_mem has no children or DFS search
has returned to root_mem itself(root_mem is not a "child" of root_mem).
Make mem_cgroup_get_first_node return root_mem in this case.
There are no mem_cgroup_get/put for root_mem.
- Rename mem_cgroup_get_next_node to __mem_cgroup_get_next_node, and
mem_cgroup_get_first_node to mem_cgroup_get_next_node.
Make mem_cgroup_hierarchical_reclaim call only new mem_cgroup_get_next_node.


Signed-off-by: Daisuke Nishimura <[email protected]>
---
mm/memcontrol.c | 37 +++++++++++++++++++++----------------
1 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 288e22c..dc38a0e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -622,7 +622,7 @@ unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
* called with hierarchy_mutex held
*/
static struct mem_cgroup *
-mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup *root_mem)
+__mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup *root_mem)
{
struct cgroup *cgroup, *curr_cgroup, *root_cgroup;

@@ -644,8 +644,8 @@ mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup *root_mem)
visit_parent:
if (curr_cgroup == root_cgroup) {
mem_cgroup_put(curr);
- curr = root_mem;
- mem_cgroup_get(curr);
+ /* caller handles NULL case */
+ curr = NULL;
goto done;
}

@@ -668,7 +668,6 @@ visit_parent:
goto visit_parent;

done:
- root_mem->last_scanned_child = curr;
return curr;
}

@@ -678,20 +677,29 @@ done:
* that to reclaim free pages from.
*/
static struct mem_cgroup *
-mem_cgroup_get_first_node(struct mem_cgroup *root_mem)
+mem_cgroup_get_next_node(struct mem_cgroup *root_mem)
{
struct cgroup *cgroup;
struct mem_cgroup *ret;
bool obsolete;

- obsolete = mem_cgroup_is_obsolete(root_mem->last_scanned_child);
-
/*
* Scan all children under the mem_cgroup mem
*/
mutex_lock(&mem_cgroup_subsys.hierarchy_mutex);
+
+ obsolete = mem_cgroup_is_obsolete(root_mem->last_scanned_child);
+
if (list_empty(&root_mem->css.cgroup->children)) {
- ret = root_mem;
+ /*
+ * root_mem might have children before and last_scanned_child
+ * may point to one of them.
+ */
+ if (root_mem->last_scanned_child) {
+ VM_BUG_ON(!obsolete);
+ mem_cgroup_put(root_mem->last_scanned_child);
+ }
+ ret = NULL;
goto done;
}

@@ -705,13 +713,13 @@ mem_cgroup_get_first_node(struct mem_cgroup *root_mem)
ret = mem_cgroup_from_cont(cgroup);
mem_cgroup_get(ret);
} else
- ret = mem_cgroup_get_next_node(root_mem->last_scanned_child,
+ ret = __mem_cgroup_get_next_node(root_mem->last_scanned_child,
root_mem);

done:
root_mem->last_scanned_child = ret;
mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex);
- return ret;
+ return (ret) ? ret : root_mem;
}

static bool mem_cgroup_check_under_limit(struct mem_cgroup *mem)
@@ -769,21 +777,18 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
if (!root_mem->use_hierarchy)
return ret;

- next_mem = mem_cgroup_get_first_node(root_mem);
+ next_mem = mem_cgroup_get_next_node(root_mem);

while (next_mem != root_mem) {
if (mem_cgroup_is_obsolete(next_mem)) {
- mem_cgroup_put(next_mem);
- next_mem = mem_cgroup_get_first_node(root_mem);
+ next_mem = mem_cgroup_get_next_node(root_mem);
continue;
}
ret = try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
get_swappiness(next_mem));
if (mem_cgroup_check_under_limit(root_mem))
return 0;
- mutex_lock(&mem_cgroup_subsys.hierarchy_mutex);
- next_mem = mem_cgroup_get_next_node(next_mem, root_mem);
- mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex);
+ next_mem = mem_cgroup_get_next_node(root_mem);
}
return ret;
}

2009-01-08 10:19:13

by Daisuke Nishimura

[permalink] [raw]
Subject: [RFC][PATCH 4/4] memcg: make oom less frequently

In previous implementation, mem_cgroup_try_charge checked the return
value of mem_cgroup_try_to_free_pages, and just retried if some pages
had been reclaimed.
But now, try_charge(and mem_cgroup_hierarchical_reclaim called from it)
only checks whether the usage is less than the limit.

This patch tries to change the behavior as before to cause oom less frequently.

To prevent try_charge from getting stuck in infinite loop,
MEM_CGROUP_RECLAIM_RETRIES_MAX is defined.


Signed-off-by: Daisuke Nishimura <[email protected]>
---
mm/memcontrol.c | 16 ++++++++++++----
1 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 804c054..fedd76b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -42,6 +42,7 @@

struct cgroup_subsys mem_cgroup_subsys __read_mostly;
#define MEM_CGROUP_RECLAIM_RETRIES 5
+#define MEM_CGROUP_RECLAIM_RETRIES_MAX 32

#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
/* Turned on only when memory cgroup is enabled && really_do_swap_account = 0 */
@@ -770,10 +771,10 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
* but there might be left over accounting, even after children
* have left.
*/
- ret = try_to_free_mem_cgroup_pages(root_mem, gfp_mask, noswap,
+ ret += try_to_free_mem_cgroup_pages(root_mem, gfp_mask, noswap,
get_swappiness(root_mem));
if (mem_cgroup_check_under_limit(root_mem))
- return 0;
+ return 1; /* indicate reclaim has succeeded */
if (!root_mem->use_hierarchy)
return ret;

@@ -785,10 +786,10 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
next_mem = mem_cgroup_get_next_node(root_mem);
continue;
}
- ret = try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
+ ret += try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
get_swappiness(next_mem));
if (mem_cgroup_check_under_limit(root_mem))
- return 0;
+ return 1; /* indicate reclaim has succeeded */
next_mem = mem_cgroup_get_next_node(root_mem);
}
return ret;
@@ -820,6 +821,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
{
struct mem_cgroup *mem, *mem_over_limit;
int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
+ int nr_retries_max = MEM_CGROUP_RECLAIM_RETRIES_MAX;
struct res_counter *fail_res;

if (unlikely(test_thread_flag(TIF_MEMDIE))) {
@@ -871,8 +873,13 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
if (!(gfp_mask & __GFP_WAIT))
goto nomem;

+ if (!nr_retries_max--)
+ goto oom;
+
ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
noswap);
+ if (ret)
+ continue;

/*
* try_to_free_mem_cgroup_pages() might not give us a full
@@ -886,6 +893,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
continue;

if (!nr_retries--) {
+oom:
if (oom) {
mutex_lock(&memcg_tasklist);
mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);

2009-01-08 10:59:35

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/4] memcg: fix formem_cgroup_get_reclaim_stat_from_page

Daisuke Nishimura said:

> Call Trace:
> [<ffffffff8028ea17>] ? ____pagevec_lru_add+0xc1/0x13c
> [<ffffffff8028ec34>] ? drain_cpu_pagevecs+0x36/0x89
> [<ffffffff802a4f8c>] ? swapin_readahead+0x78/0x98
> [<ffffffff8029a37a>] ? handle_mm_fault+0x3d9/0x741
> [<ffffffff804da654>] ? do_page_fault+0x3ce/0x78c
> [<ffffffff804d7a42>] ? trace_hardirqs_off_thunk+0x3a/0x3c
> [<ffffffff804d860f>] ? page_fault+0x1f/0x30
> Code: cc 55 48 8d af b8 0d 00 00 48 89 f7 53 89 d3 e8 39 85 02 00 48
> 63 d3 48 ff 44 d5 10 45 85 e4 74 05 48 ff 44 d5 00 48 85 c0 74 0e <48>
> ff 44 d0 10 45 85 e4 74 04 48 ff 04 d0 5b 5d 41 5c c3 41 54
> RIP [<ffffffff8028e710>] update_page_reclaim_stat+0x2f/0x42
> RSP <ffff8801ee457da8>
>
>
> Signed-off-by: Daisuke Nishimura <[email protected]>
yes. PageCgroupUsed() should be cheked.
or
list_empty(&pc->lru) should be checked under zone->lock.
Your fix seems reasonable.

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

> ---
> mm/memcontrol.c | 4 ++++
> 1 files changed, 4 insertions(+), 0 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e2996b8..62e69d8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -559,6 +559,10 @@ mem_cgroup_get_reclaim_stat_from_page(struct page
> *page)
> return NULL;
>
> pc = lookup_page_cgroup(page);
> + smp_rmb();
> + if (!PageCgroupUsed(pc))
> + return NULL;
> +
> mz = page_cgroup_zoneinfo(pc);
> if (!mz)
> return NULL;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2009-01-08 11:00:59

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/4] memcg: fix error path of mem_cgroup_move_parent

Daisuke Nishimura said:
> There is a bug in error path of mem_cgroup_move_parent.
>
> Extra refcnt got from try_charge should be dropped, and usages incremented
> by try_charge should be decremented in both error paths:
>
> A: failure at get_page_unless_zero
> B: failure at isolate_lru_page
>
> This bug makes this parent directory unremovable.
>
> In case of A, rmdir doesn't return, because res.usage doesn't go
> down to 0 at mem_cgroup_force_empty even after all the pc in
> lru are removed.
> In case of B, rmdir fails and returns -EBUSY, because it has
> extra ref counts even after res.usage goes down to 0.
>
>
> Signed-off-by: Daisuke Nishimura <[email protected]>
Thank you for catching.

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2009-01-08 11:08:18

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/4] memcg: fix for mem_cgroup_hierarchical_reclaim

Daisuke Nishimura said:
> If root_mem has no children, last_scaned_child is set to root_mem itself.
> But after some children added to root_mem, mem_cgroup_get_next_node can
> mem_cgroup_put the root_mem although root_mem has not been mem_cgroup_get.
>
> This patch fixes this behavior by:
> - Set last_scanned_child to NULL if root_mem has no children or DFS search
> has returned to root_mem itself(root_mem is not a "child" of root_mem).
> Make mem_cgroup_get_first_node return root_mem in this case.
> There are no mem_cgroup_get/put for root_mem.
> - Rename mem_cgroup_get_next_node to __mem_cgroup_get_next_node, and
> mem_cgroup_get_first_node to mem_cgroup_get_next_node.
> Make mem_cgroup_hierarchical_reclaim call only new
> mem_cgroup_get_next_node.
>
>
> Signed-off-by: Daisuke Nishimura <[email protected]>

Hmm, seems necessary fix. Then, it's better to rebase my patch on to this.

Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>

Maybe simpler one can be written but my patch remove all this out later....

Thanks,
-Kame

> ---
> mm/memcontrol.c | 37 +++++++++++++++++++++----------------
> 1 files changed, 21 insertions(+), 16 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 288e22c..dc38a0e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -622,7 +622,7 @@ unsigned long mem_cgroup_isolate_pages(unsigned long
> nr_to_scan,
> * called with hierarchy_mutex held
> */
> static struct mem_cgroup *
> -mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup
> *root_mem)
> +__mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup
> *root_mem)
> {
> struct cgroup *cgroup, *curr_cgroup, *root_cgroup;
>
> @@ -644,8 +644,8 @@ mem_cgroup_get_next_node(struct mem_cgroup *curr,
> struct mem_cgroup *root_mem)
> visit_parent:
> if (curr_cgroup == root_cgroup) {
> mem_cgroup_put(curr);
> - curr = root_mem;
> - mem_cgroup_get(curr);
> + /* caller handles NULL case */
> + curr = NULL;
> goto done;
> }
>
> @@ -668,7 +668,6 @@ visit_parent:
> goto visit_parent;
>
> done:
> - root_mem->last_scanned_child = curr;
> return curr;
> }
>
> @@ -678,20 +677,29 @@ done:
> * that to reclaim free pages from.
> */
> static struct mem_cgroup *
> -mem_cgroup_get_first_node(struct mem_cgroup *root_mem)
> +mem_cgroup_get_next_node(struct mem_cgroup *root_mem)
> {
> struct cgroup *cgroup;
> struct mem_cgroup *ret;
> bool obsolete;
>
> - obsolete = mem_cgroup_is_obsolete(root_mem->last_scanned_child);
> -
> /*
> * Scan all children under the mem_cgroup mem
> */
> mutex_lock(&mem_cgroup_subsys.hierarchy_mutex);
> +
> + obsolete = mem_cgroup_is_obsolete(root_mem->last_scanned_child);
> +
> if (list_empty(&root_mem->css.cgroup->children)) {
> - ret = root_mem;
> + /*
> + * root_mem might have children before and last_scanned_child
> + * may point to one of them.
> + */
> + if (root_mem->last_scanned_child) {
> + VM_BUG_ON(!obsolete);
> + mem_cgroup_put(root_mem->last_scanned_child);
> + }
> + ret = NULL;
> goto done;
> }
>
> @@ -705,13 +713,13 @@ mem_cgroup_get_first_node(struct mem_cgroup
> *root_mem)
> ret = mem_cgroup_from_cont(cgroup);
> mem_cgroup_get(ret);
> } else
> - ret = mem_cgroup_get_next_node(root_mem->last_scanned_child,
> + ret = __mem_cgroup_get_next_node(root_mem->last_scanned_child,
> root_mem);
>
> done:
> root_mem->last_scanned_child = ret;
> mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex);
> - return ret;
> + return (ret) ? ret : root_mem;
> }
>
> static bool mem_cgroup_check_under_limit(struct mem_cgroup *mem)
> @@ -769,21 +777,18 @@ static int mem_cgroup_hierarchical_reclaim(struct
> mem_cgroup *root_mem,
> if (!root_mem->use_hierarchy)
> return ret;
>
> - next_mem = mem_cgroup_get_first_node(root_mem);
> + next_mem = mem_cgroup_get_next_node(root_mem);
>
> while (next_mem != root_mem) {
> if (mem_cgroup_is_obsolete(next_mem)) {
> - mem_cgroup_put(next_mem);
> - next_mem = mem_cgroup_get_first_node(root_mem);
> + next_mem = mem_cgroup_get_next_node(root_mem);
> continue;
> }
> ret = try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
> get_swappiness(next_mem));
> if (mem_cgroup_check_under_limit(root_mem))
> return 0;
> - mutex_lock(&mem_cgroup_subsys.hierarchy_mutex);
> - next_mem = mem_cgroup_get_next_node(next_mem, root_mem);
> - mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex);
> + next_mem = mem_cgroup_get_next_node(root_mem);
> }
> return ret;
> }
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2009-01-08 11:20:01

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/4] memcg: make oom less frequently

Daisuke Nishimura said:
> In previous implementation, mem_cgroup_try_charge checked the return
> value of mem_cgroup_try_to_free_pages, and just retried if some pages
> had been reclaimed.
> But now, try_charge(and mem_cgroup_hierarchical_reclaim called from it)
> only checks whether the usage is less than the limit.
>
> This patch tries to change the behavior as before to cause oom less
> frequently.
>
> To prevent try_charge from getting stuck in infinite loop,
> MEM_CGROUP_RECLAIM_RETRIES_MAX is defined.
>
>
> Signed-off-by: Daisuke Nishimura <[email protected]>

I think this is necessary change.
My version of hierarchy reclaim will do this.

But RETRIES_MAX is not clear ;) please use one counter.

And why MAX=32 ?
> + if (ret)
> + continue;
seems to do enough work.

While memory can be reclaimed, it's not dead lock.
To handle live-lock situation as "reclaimed memory is stolen very soon",
should we check signal_pending(current) or some flags ?

IMHO, using jiffies to detect how long we should retry is easy to understand
....like
"if memory charging cannot make progress for XXXX minutes,
trigger some notifier or show some flag to user via cgroupfs interface.
to show we're tooooooo busy."

-Kame


> ---
> mm/memcontrol.c | 16 ++++++++++++----
> 1 files changed, 12 insertions(+), 4 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 804c054..fedd76b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -42,6 +42,7 @@
>
> struct cgroup_subsys mem_cgroup_subsys __read_mostly;
> #define MEM_CGROUP_RECLAIM_RETRIES 5
> +#define MEM_CGROUP_RECLAIM_RETRIES_MAX 32
>
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> /* Turned on only when memory cgroup is enabled && really_do_swap_account
> = 0 */
> @@ -770,10 +771,10 @@ static int mem_cgroup_hierarchical_reclaim(struct
> mem_cgroup *root_mem,
> * but there might be left over accounting, even after children
> * have left.
> */
> - ret = try_to_free_mem_cgroup_pages(root_mem, gfp_mask, noswap,
> + ret += try_to_free_mem_cgroup_pages(root_mem, gfp_mask, noswap,
> get_swappiness(root_mem));
> if (mem_cgroup_check_under_limit(root_mem))
> - return 0;
> + return 1; /* indicate reclaim has succeeded */
> if (!root_mem->use_hierarchy)
> return ret;
>
> @@ -785,10 +786,10 @@ static int mem_cgroup_hierarchical_reclaim(struct
> mem_cgroup *root_mem,
> next_mem = mem_cgroup_get_next_node(root_mem);
> continue;
> }
> - ret = try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
> + ret += try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
> get_swappiness(next_mem));
> if (mem_cgroup_check_under_limit(root_mem))
> - return 0;
> + return 1; /* indicate reclaim has succeeded */
> next_mem = mem_cgroup_get_next_node(root_mem);
> }
> return ret;
> @@ -820,6 +821,7 @@ static int __mem_cgroup_try_charge(struct mm_struct
> *mm,
> {
> struct mem_cgroup *mem, *mem_over_limit;
> int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> + int nr_retries_max = MEM_CGROUP_RECLAIM_RETRIES_MAX;
> struct res_counter *fail_res;
>
> if (unlikely(test_thread_flag(TIF_MEMDIE))) {
> @@ -871,8 +873,13 @@ static int __mem_cgroup_try_charge(struct mm_struct
> *mm,
> if (!(gfp_mask & __GFP_WAIT))
> goto nomem;
>
> + if (!nr_retries_max--)
> + goto oom;
> +
> ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
> noswap);
> + if (ret)
> + continue;
>
> /*
> * try_to_free_mem_cgroup_pages() might not give us a full
> @@ -886,6 +893,7 @@ static int __mem_cgroup_try_charge(struct mm_struct
> *mm,
> continue;
>
> if (!nr_retries--) {
> +oom:
> if (oom) {
> mutex_lock(&memcg_tasklist);
> mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2009-01-09 00:59:25

by Li Zefan

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/4] memcg: fix for mem_cgroup_get_reclaim_stat_from_page

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e2996b8..62e69d8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -559,6 +559,10 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
> return NULL;
>
> pc = lookup_page_cgroup(page);
> + smp_rmb();

It is better to add a comment to explain this smp_rmb. I think it's recommended
that every memory barrier has a comment.

> + if (!PageCgroupUsed(pc))
> + return NULL;
> +
> mz = page_cgroup_zoneinfo(pc);
> if (!mz)
> return NULL;
>
>

2009-01-09 01:06:45

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/4] memcg: fix for mem_cgroup_get_reclaim_stat_from_page

On Fri, 09 Jan 2009 08:57:59 +0800
Li Zefan <[email protected]> wrote:

> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index e2996b8..62e69d8 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -559,6 +559,10 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
> > return NULL;
> >
> > pc = lookup_page_cgroup(page);
> > + smp_rmb();
>
> It is better to add a comment to explain this smp_rmb. I think it's recommended
> that every memory barrier has a comment.
>
Ah, yes. good point.

Maybe text like this
/*
* Used bit is set without atomic ops but after smp_wmb().
* For making pc->mem_cgroup visible, insert smp_rmb() here.
*/

?
-Kame

2009-01-09 01:09:45

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/4] memcg: fix for mem_cgroup_hierarchical_reclaim

On Thu, 8 Jan 2009 20:08:01 +0900 (JST)
"KAMEZAWA Hiroyuki" <[email protected]> wrote:

> Daisuke Nishimura said:
> > If root_mem has no children, last_scaned_child is set to root_mem itself.
> > But after some children added to root_mem, mem_cgroup_get_next_node can
> > mem_cgroup_put the root_mem although root_mem has not been mem_cgroup_get.
> >
> > This patch fixes this behavior by:
> > - Set last_scanned_child to NULL if root_mem has no children or DFS search
> > has returned to root_mem itself(root_mem is not a "child" of root_mem).
> > Make mem_cgroup_get_first_node return root_mem in this case.
> > There are no mem_cgroup_get/put for root_mem.
> > - Rename mem_cgroup_get_next_node to __mem_cgroup_get_next_node, and
> > mem_cgroup_get_first_node to mem_cgroup_get_next_node.
> > Make mem_cgroup_hierarchical_reclaim call only new
> > mem_cgroup_get_next_node.
> >
> >
> > Signed-off-by: Daisuke Nishimura <[email protected]>
>
> Hmm, seems necessary fix. Then, it's better to rebase my patch on to this.
>
> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
>
> Maybe simpler one can be written but my patch remove all this out later....
>
How about this ? (just an exmaple and not tested, sorry)



---
mm/memcontrol.c | 52 ++++++++++++++++++++++------------------------------
1 file changed, 22 insertions(+), 30 deletions(-)

Index: mmotm-2.6.28-Jan7/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Jan7.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Jan7/mm/memcontrol.c
@@ -621,6 +621,7 @@ static struct mem_cgroup *
mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup *root_mem)
{
struct cgroup *cgroup, *curr_cgroup, *root_cgroup;
+ struct mem_cgroup *orig = root_mem->last_scanned_child;

curr_cgroup = curr->css.cgroup;
root_cgroup = root_mem->css.cgroup;
@@ -629,19 +630,15 @@ mem_cgroup_get_next_node(struct mem_cgro
/*
* Walk down to children
*/
- mem_cgroup_put(curr);
cgroup = list_entry(curr_cgroup->children.next,
struct cgroup, sibling);
curr = mem_cgroup_from_cont(cgroup);
- mem_cgroup_get(curr);
goto done;
}

visit_parent:
if (curr_cgroup == root_cgroup) {
- mem_cgroup_put(curr);
curr = root_mem;
- mem_cgroup_get(curr);
goto done;
}

@@ -649,11 +646,9 @@ visit_parent:
* Goto next sibling
*/
if (curr_cgroup->sibling.next != &curr_cgroup->parent->children) {
- mem_cgroup_put(curr);
cgroup = list_entry(curr_cgroup->sibling.next, struct cgroup,
sibling);
curr = mem_cgroup_from_cont(cgroup);
- mem_cgroup_get(curr);
goto done;
}

@@ -664,7 +659,10 @@ visit_parent:
goto visit_parent;

done:
+ if (orig)
+ mem_cgroup_put(orig);
root_mem->last_scanned_child = curr;
+ mem_cgroup_get(curr);
return curr;
}

@@ -677,35 +675,25 @@ static struct mem_cgroup *
mem_cgroup_get_first_node(struct mem_cgroup *root_mem)
{
struct cgroup *cgroup;
- struct mem_cgroup *ret;
- bool obsolete;
+ struct mem_cgroup *ret, *orig;

- obsolete = mem_cgroup_is_obsolete(root_mem->last_scanned_child);
-
- /*
- * Scan all children under the mem_cgroup mem
- */
mutex_lock(&mem_cgroup_subsys.hierarchy_mutex);
- if (list_empty(&root_mem->css.cgroup->children)) {
- ret = root_mem;
- goto done;
- }
-
- if (!root_mem->last_scanned_child || obsolete) {
-
- if (obsolete && root_mem->last_scanned_child)
- mem_cgroup_put(root_mem->last_scanned_child);
+ orig = root_mem->last_scanned_child;

- cgroup = list_first_entry(&root_mem->css.cgroup->children,
- struct cgroup, sibling);
- ret = mem_cgroup_from_cont(cgroup);
+ if (!orig) {
+ if (list_empty(&root_mem->css.cgroup->children)) {
+ ret = root_mem;
+ } else {
+ cgroup =
+ list_first_entry(&root_mem->css.cgroup->children,
+ struct cgroup, sibling);
+ ret = mem_cgroup_from_cont(cgroup);
+ }
+ root_mem->last_scanned_child = ret;
mem_cgroup_get(ret);
- } else
+ } else /* get_next_node will manage refcnt */
ret = mem_cgroup_get_next_node(root_mem->last_scanned_child,
root_mem);
-
-done:
- root_mem->last_scanned_child = ret;
mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex);
return ret;
}
@@ -2232,7 +2220,11 @@ static void mem_cgroup_pre_destroy(struc
static void mem_cgroup_destroy(struct cgroup_subsys *ss,
struct cgroup *cont)
{
- mem_cgroup_put(mem_cgroup_from_cont(cont));
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+
+ if (mem->last_scanned_child == mem)
+ mem_cgroup_put(mem);
+ mem_cgroup_put(mem);
}

static int mem_cgroup_populate(struct cgroup_subsys *ss,

2009-01-09 01:47:21

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/4] memcg: make oom less frequently

On Thu, 8 Jan 2009 20:19:48 +0900 (JST), "KAMEZAWA Hiroyuki" <[email protected]> wrote:
> Daisuke Nishimura said:
> > In previous implementation, mem_cgroup_try_charge checked the return
> > value of mem_cgroup_try_to_free_pages, and just retried if some pages
> > had been reclaimed.
> > But now, try_charge(and mem_cgroup_hierarchical_reclaim called from it)
> > only checks whether the usage is less than the limit.
> >
> > This patch tries to change the behavior as before to cause oom less
> > frequently.
> >
> > To prevent try_charge from getting stuck in infinite loop,
> > MEM_CGROUP_RECLAIM_RETRIES_MAX is defined.
> >
> >
> > Signed-off-by: Daisuke Nishimura <[email protected]>
>
> I think this is necessary change.
> My version of hierarchy reclaim will do this.
>
> But RETRIES_MAX is not clear ;) please use one counter.
>
> And why MAX=32 ?
I inserted printk and counted the loop count on oom(tested with 4 children).
It seemed 32 would be enough.

> > + if (ret)
> > + continue;
> seems to do enough work.
>
> While memory can be reclaimed, it's not dead lock.
I see.
I introduced this max count because mmap_sem might be hold for a long time
at page fault, but this is not "dead" lock as you say.

> To handle live-lock situation as "reclaimed memory is stolen very soon",
> should we check signal_pending(current) or some flags ?
>
> IMHO, using jiffies to detect how long we should retry is easy to understand
> ....like
> "if memory charging cannot make progress for XXXX minutes,
> trigger some notifier or show some flag to user via cgroupfs interface.
> to show we're tooooooo busy."
>
Good Idea.

But I think it would be enough for now to check signal_pending(curren) and
return -ENOMEM.

How about this one?
===
From: Daisuke Nishimura <[email protected]>

In previous implementation, mem_cgroup_try_charge checked the return
value of mem_cgroup_try_to_free_pages, and just retried if some pages
had been reclaimed.
But now, try_charge(and mem_cgroup_hierarchical_reclaim called from it)
only checks whether the usage is less than the limit.

This patch tries to change the behavior as before to cause oom less frequently.


Signed-off-by: Daisuke Nishimura <[email protected]>
---
mm/memcontrol.c | 14 ++++++++++----
1 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index dc38a0e..2ab0a5c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -770,10 +770,10 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
* but there might be left over accounting, even after children
* have left.
*/
- ret = try_to_free_mem_cgroup_pages(root_mem, gfp_mask, noswap,
+ ret += try_to_free_mem_cgroup_pages(root_mem, gfp_mask, noswap,
get_swappiness(root_mem));
if (mem_cgroup_check_under_limit(root_mem))
- return 0;
+ return 1; /* indicate reclaim has succeeded */
if (!root_mem->use_hierarchy)
return ret;

@@ -784,10 +784,10 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
next_mem = mem_cgroup_get_next_node(root_mem);
continue;
}
- ret = try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
+ ret += try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
get_swappiness(next_mem));
if (mem_cgroup_check_under_limit(root_mem))
- return 0;
+ return 1; /* indicate reclaim has succeeded */
next_mem = mem_cgroup_get_next_node(root_mem);
}
return ret;
@@ -870,8 +870,13 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
if (!(gfp_mask & __GFP_WAIT))
goto nomem;

+ if (signal_pending(current))
+ goto oom;
+
ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
noswap);
+ if (ret)
+ continue;

/*
* try_to_free_mem_cgroup_pages() might not give us a full
@@ -885,6 +890,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
continue;

if (!nr_retries--) {
+oom:
if (oom) {
mutex_lock(&memcg_tasklist);
mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);

2009-01-09 02:05:19

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/4] memcg: make oom less frequently

On Fri, 9 Jan 2009 10:44:16 +0900
Daisuke Nishimura <[email protected]> wrote:

> > To handle live-lock situation as "reclaimed memory is stolen very soon",
> > should we check signal_pending(current) or some flags ?
> >
> > IMHO, using jiffies to detect how long we should retry is easy to understand
> > ....like
> > "if memory charging cannot make progress for XXXX minutes,
> > trigger some notifier or show some flag to user via cgroupfs interface.
> > to show we're tooooooo busy."
> >
> Good Idea.
>
> But I think it would be enough for now to check signal_pending(curren) and
> return -ENOMEM.
>
> How about this one?

Hmm, looks much simpler.

> ===
> From: Daisuke Nishimura <[email protected]>
>
> In previous implementation, mem_cgroup_try_charge checked the return
> value of mem_cgroup_try_to_free_pages, and just retried if some pages
> had been reclaimed.
> But now, try_charge(and mem_cgroup_hierarchical_reclaim called from it)
> only checks whether the usage is less than the limit.
>
> This patch tries to change the behavior as before to cause oom less frequently.
>
>
> Signed-off-by: Daisuke Nishimura <[email protected]>
> ---
> mm/memcontrol.c | 14 ++++++++++----
> 1 files changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index dc38a0e..2ab0a5c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -770,10 +770,10 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> * but there might be left over accounting, even after children
> * have left.
> */
> - ret = try_to_free_mem_cgroup_pages(root_mem, gfp_mask, noswap,
> + ret += try_to_free_mem_cgroup_pages(root_mem, gfp_mask, noswap,
> get_swappiness(root_mem));
> if (mem_cgroup_check_under_limit(root_mem))
> - return 0;
> + return 1; /* indicate reclaim has succeeded */
> if (!root_mem->use_hierarchy)
> return ret;
>
> @@ -784,10 +784,10 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> next_mem = mem_cgroup_get_next_node(root_mem);
> continue;
> }
> - ret = try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
> + ret += try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
> get_swappiness(next_mem));
> if (mem_cgroup_check_under_limit(root_mem))
> - return 0;
> + return 1; /* indicate reclaim has succeeded */
> next_mem = mem_cgroup_get_next_node(root_mem);
> }
> return ret;
> @@ -870,8 +870,13 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> if (!(gfp_mask & __GFP_WAIT))
> goto nomem;
>
> + if (signal_pending(current))
> + goto oom;
> +

I think it's better to avoid to add this check *now*. and "signal is pending"
doesn't mean oom situation.

Hmm..Maybe we can tell "please retry page fault again, it's too long latency in
memory reclaim and you received signal." in future.

IMHO, only quick path which we can add here now is
==
if (test_thread_flag(TIG_MEMDIE)) { /* This thread is killed by OOM */
*memcg = NULL;
return 0;
}
==
like this.

Anyway, please discuss this "quick exit path" in other patch and just remove
siginal check.

Other part looks ok to me.

Thanks,
-Kame




> ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
> noswap);
> + if (ret)
> + continue;
>
> /*
> * try_to_free_mem_cgroup_pages() might not give us a full
> @@ -885,6 +890,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> continue;
>
> if (!nr_retries--) {
> +oom:
> if (oom) {
> mutex_lock(&memcg_tasklist);
> mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2009-01-09 02:32:22

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/4] memcg: make oom less frequently

On Fri, 9 Jan 2009 11:03:58 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> On Fri, 9 Jan 2009 10:44:16 +0900
> Daisuke Nishimura <[email protected]> wrote:
>
> > > To handle live-lock situation as "reclaimed memory is stolen very soon",
> > > should we check signal_pending(current) or some flags ?
> > >
> > > IMHO, using jiffies to detect how long we should retry is easy to understand
> > > ....like
> > > "if memory charging cannot make progress for XXXX minutes,
> > > trigger some notifier or show some flag to user via cgroupfs interface.
> > > to show we're tooooooo busy."
> > >
> > Good Idea.
> >
> > But I think it would be enough for now to check signal_pending(curren) and
> > return -ENOMEM.
> >
> > How about this one?
>
> Hmm, looks much simpler.
>
> > ===
> > From: Daisuke Nishimura <[email protected]>
> >
> > In previous implementation, mem_cgroup_try_charge checked the return
> > value of mem_cgroup_try_to_free_pages, and just retried if some pages
> > had been reclaimed.
> > But now, try_charge(and mem_cgroup_hierarchical_reclaim called from it)
> > only checks whether the usage is less than the limit.
> >
> > This patch tries to change the behavior as before to cause oom less frequently.
> >
> >
> > Signed-off-by: Daisuke Nishimura <[email protected]>
> > ---
> > mm/memcontrol.c | 14 ++++++++++----
> > 1 files changed, 10 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index dc38a0e..2ab0a5c 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -770,10 +770,10 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> > * but there might be left over accounting, even after children
> > * have left.
> > */
> > - ret = try_to_free_mem_cgroup_pages(root_mem, gfp_mask, noswap,
> > + ret += try_to_free_mem_cgroup_pages(root_mem, gfp_mask, noswap,
> > get_swappiness(root_mem));
> > if (mem_cgroup_check_under_limit(root_mem))
> > - return 0;
> > + return 1; /* indicate reclaim has succeeded */
> > if (!root_mem->use_hierarchy)
> > return ret;
> >
> > @@ -784,10 +784,10 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> > next_mem = mem_cgroup_get_next_node(root_mem);
> > continue;
> > }
> > - ret = try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
> > + ret += try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
> > get_swappiness(next_mem));
> > if (mem_cgroup_check_under_limit(root_mem))
> > - return 0;
> > + return 1; /* indicate reclaim has succeeded */
> > next_mem = mem_cgroup_get_next_node(root_mem);
> > }
> > return ret;
> > @@ -870,8 +870,13 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> > if (!(gfp_mask & __GFP_WAIT))
> > goto nomem;
> >
> > + if (signal_pending(current))
> > + goto oom;
> > +
>
> I think it's better to avoid to add this check *now*. and "signal is pending"
> doesn't mean oom situation.
>
hmm.. charge is assumed to return 0 or -ENOMEM, what should we return on
signal_pending case ?

In case of shmem for example, if charge at shmem_getpage fails by -ENOMEM,
shmem_fault returns VM_FAULT_OOM, so pagefault_out_of_memory would be called.
If memcg had not invoked oom-killer, system wide oom would be invoked.

> Hmm..Maybe we can tell "please retry page fault again, it's too long latency in
> memory reclaim and you received signal." in future.
>
OK.

> IMHO, only quick path which we can add here now is
> ==
> if (test_thread_flag(TIG_MEMDIE)) { /* This thread is killed by OOM */
> *memcg = NULL;
> return 0;
> }
> ==
> like this.
>
> Anyway, please discuss this "quick exit path" in other patch and just remove
> siginal check.
>
> Other part looks ok to me.
>
Thanks :)

I'll update this one by removing the signal_pendign check.


Thanks,
Daisuke Nishimura.

2009-01-09 02:38:38

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/4] memcg: fix for mem_cgroup_get_reclaim_stat_from_page

On Fri, 9 Jan 2009 10:05:31 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> On Fri, 09 Jan 2009 08:57:59 +0800
> Li Zefan <[email protected]> wrote:
>
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index e2996b8..62e69d8 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -559,6 +559,10 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
> > > return NULL;
> > >
> > > pc = lookup_page_cgroup(page);
> > > + smp_rmb();
> >
> > It is better to add a comment to explain this smp_rmb. I think it's recommended
> > that every memory barrier has a comment.
> >
> Ah, yes. good point.
>
> Maybe text like this
> /*
> * Used bit is set without atomic ops but after smp_wmb().
> * For making pc->mem_cgroup visible, insert smp_rmb() here.
> */
>
OK. I'll add this comment.

BTW, mem_cgroup_rotate_lru_list and mem_cgroup_add_lru_list have similar code.
(mem_cgroup_add_lru_list has some comment already.)
Should I update them too ?


Thanks,
Daisuke Nishimura.

2009-01-09 02:40:49

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/4] memcg: make oom less frequently

On Fri, 9 Jan 2009 11:29:22 +0900
Daisuke Nishimura <[email protected]> wrote:

> > > @@ -870,8 +870,13 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> > > if (!(gfp_mask & __GFP_WAIT))
> > > goto nomem;
> > >
> > > + if (signal_pending(current))
> > > + goto oom;
> > > +
> >
> > I think it's better to avoid to add this check *now*. and "signal is pending"
> > doesn't mean oom situation.
> >
> hmm.. charge is assumed to return 0 or -ENOMEM, what should we return on
> signal_pending case ?
>
> In case of shmem for example, if charge at shmem_getpage fails by -ENOMEM,
> shmem_fault returns VM_FAULT_OOM, so pagefault_out_of_memory would be called.
> If memcg had not invoked oom-killer, system wide oom would be invoked.
>
yes, that's problem.

I think generic -EAGAIN support is appreciated. But it will not be for -rc ;)
(... that will make codes other than memcontrol.c more complicated.)

Thanks,
-Kame

> > Hmm..Maybe we can tell "please retry page fault again, it's too long latency in
> > memory reclaim and you received signal." in future.
> >
> OK.
>
> > IMHO, only quick path which we can add here now is
> > ==
> > if (test_thread_flag(TIG_MEMDIE)) { /* This thread is killed by OOM */
> > *memcg = NULL;
> > return 0;
> > }
> > ==
> > like this.
> >
> > Anyway, please discuss this "quick exit path" in other patch and just remove
> > siginal check.
> >
> > Other part looks ok to me.
> >
> Thanks :)
>
> I'll update this one by removing the signal_pendign check.
>
>
> Thanks,
> Daisuke Nishimura.
>

2009-01-09 02:43:18

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/4] memcg: fix for mem_cgroup_get_reclaim_stat_from_page

On Fri, 9 Jan 2009 11:34:58 +0900
Daisuke Nishimura <[email protected]> wrote:

> On Fri, 9 Jan 2009 10:05:31 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> > On Fri, 09 Jan 2009 08:57:59 +0800
> > Li Zefan <[email protected]> wrote:
> >
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index e2996b8..62e69d8 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -559,6 +559,10 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
> > > > return NULL;
> > > >
> > > > pc = lookup_page_cgroup(page);
> > > > + smp_rmb();
> > >
> > > It is better to add a comment to explain this smp_rmb. I think it's recommended
> > > that every memory barrier has a comment.
> > >
> > Ah, yes. good point.
> >
> > Maybe text like this
> > /*
> > * Used bit is set without atomic ops but after smp_wmb().
> > * For making pc->mem_cgroup visible, insert smp_rmb() here.
> > */
> >
> OK. I'll add this comment.
>
> BTW, mem_cgroup_rotate_lru_list and mem_cgroup_add_lru_list have similar code.
> (mem_cgroup_add_lru_list has some comment already.)
> Should I update them too ?
>
please :) it's helpful. Sorry for my too short comment on orignal patch.

-Kame

2009-01-09 03:05:26

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/4] memcg: fix for mem_cgroup_hierarchical_reclaim

On Fri, 9 Jan 2009 10:08:30 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> On Thu, 8 Jan 2009 20:08:01 +0900 (JST)
> "KAMEZAWA Hiroyuki" <[email protected]> wrote:
>
> > Daisuke Nishimura said:
> > > If root_mem has no children, last_scaned_child is set to root_mem itself.
> > > But after some children added to root_mem, mem_cgroup_get_next_node can
> > > mem_cgroup_put the root_mem although root_mem has not been mem_cgroup_get.
> > >
> > > This patch fixes this behavior by:
> > > - Set last_scanned_child to NULL if root_mem has no children or DFS search
> > > has returned to root_mem itself(root_mem is not a "child" of root_mem).
> > > Make mem_cgroup_get_first_node return root_mem in this case.
> > > There are no mem_cgroup_get/put for root_mem.
> > > - Rename mem_cgroup_get_next_node to __mem_cgroup_get_next_node, and
> > > mem_cgroup_get_first_node to mem_cgroup_get_next_node.
> > > Make mem_cgroup_hierarchical_reclaim call only new
> > > mem_cgroup_get_next_node.
> > >
> > >
> > > Signed-off-by: Daisuke Nishimura <[email protected]>
> >
> > Hmm, seems necessary fix. Then, it's better to rebase my patch on to this.
> >
> > Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
> >
> > Maybe simpler one can be written but my patch remove all this out later....
> >
> How about this ? (just an exmaple and not tested, sorry)
>
>
hmm, I don't think it's much simpler than this one(I don't want last_scanned_child
to point to the mem itself, because it's not a "child").

This part will be re-written by your patch, but this patch is needed
to fix a bug(I saw general protection fault), so let's fix one by one.
I'll post my original version. It's well tested :)


Thanks,
Daisuke Nishimura.

>
> ---
> mm/memcontrol.c | 52 ++++++++++++++++++++++------------------------------
> 1 file changed, 22 insertions(+), 30 deletions(-)
>
> Index: mmotm-2.6.28-Jan7/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.28-Jan7.orig/mm/memcontrol.c
> +++ mmotm-2.6.28-Jan7/mm/memcontrol.c
> @@ -621,6 +621,7 @@ static struct mem_cgroup *
> mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup *root_mem)
> {
> struct cgroup *cgroup, *curr_cgroup, *root_cgroup;
> + struct mem_cgroup *orig = root_mem->last_scanned_child;
>
> curr_cgroup = curr->css.cgroup;
> root_cgroup = root_mem->css.cgroup;
> @@ -629,19 +630,15 @@ mem_cgroup_get_next_node(struct mem_cgro
> /*
> * Walk down to children
> */
> - mem_cgroup_put(curr);
> cgroup = list_entry(curr_cgroup->children.next,
> struct cgroup, sibling);
> curr = mem_cgroup_from_cont(cgroup);
> - mem_cgroup_get(curr);
> goto done;
> }
>
> visit_parent:
> if (curr_cgroup == root_cgroup) {
> - mem_cgroup_put(curr);
> curr = root_mem;
> - mem_cgroup_get(curr);
> goto done;
> }
>
> @@ -649,11 +646,9 @@ visit_parent:
> * Goto next sibling
> */
> if (curr_cgroup->sibling.next != &curr_cgroup->parent->children) {
> - mem_cgroup_put(curr);
> cgroup = list_entry(curr_cgroup->sibling.next, struct cgroup,
> sibling);
> curr = mem_cgroup_from_cont(cgroup);
> - mem_cgroup_get(curr);
> goto done;
> }
>
> @@ -664,7 +659,10 @@ visit_parent:
> goto visit_parent;
>
> done:
> + if (orig)
> + mem_cgroup_put(orig);
> root_mem->last_scanned_child = curr;
> + mem_cgroup_get(curr);
> return curr;
> }
>
> @@ -677,35 +675,25 @@ static struct mem_cgroup *
> mem_cgroup_get_first_node(struct mem_cgroup *root_mem)
> {
> struct cgroup *cgroup;
> - struct mem_cgroup *ret;
> - bool obsolete;
> + struct mem_cgroup *ret, *orig;
>
> - obsolete = mem_cgroup_is_obsolete(root_mem->last_scanned_child);
> -
> - /*
> - * Scan all children under the mem_cgroup mem
> - */
> mutex_lock(&mem_cgroup_subsys.hierarchy_mutex);
> - if (list_empty(&root_mem->css.cgroup->children)) {
> - ret = root_mem;
> - goto done;
> - }
> -
> - if (!root_mem->last_scanned_child || obsolete) {
> -
> - if (obsolete && root_mem->last_scanned_child)
> - mem_cgroup_put(root_mem->last_scanned_child);
> + orig = root_mem->last_scanned_child;
>
> - cgroup = list_first_entry(&root_mem->css.cgroup->children,
> - struct cgroup, sibling);
> - ret = mem_cgroup_from_cont(cgroup);
> + if (!orig) {
> + if (list_empty(&root_mem->css.cgroup->children)) {
> + ret = root_mem;
> + } else {
> + cgroup =
> + list_first_entry(&root_mem->css.cgroup->children,
> + struct cgroup, sibling);
> + ret = mem_cgroup_from_cont(cgroup);
> + }
> + root_mem->last_scanned_child = ret;
> mem_cgroup_get(ret);
> - } else
> + } else /* get_next_node will manage refcnt */
> ret = mem_cgroup_get_next_node(root_mem->last_scanned_child,
> root_mem);
> -
> -done:
> - root_mem->last_scanned_child = ret;
> mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex);
> return ret;
> }
> @@ -2232,7 +2220,11 @@ static void mem_cgroup_pre_destroy(struc
> static void mem_cgroup_destroy(struct cgroup_subsys *ss,
> struct cgroup *cont)
> {
> - mem_cgroup_put(mem_cgroup_from_cont(cont));
> + struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
> +
> + if (mem->last_scanned_child == mem)
> + mem_cgroup_put(mem);
> + mem_cgroup_put(mem);
> }
>
> static int mem_cgroup_populate(struct cgroup_subsys *ss,
>

2009-01-09 03:11:08

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/4] memcg: fix for mem_cgroup_hierarchical_reclaim

On Fri, 9 Jan 2009 11:51:03 +0900
Daisuke Nishimura <[email protected]> wrote:

> On Fri, 9 Jan 2009 10:08:30 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> > On Thu, 8 Jan 2009 20:08:01 +0900 (JST)
> > "KAMEZAWA Hiroyuki" <[email protected]> wrote:
> >
> > > Daisuke Nishimura said:
> > > > If root_mem has no children, last_scaned_child is set to root_mem itself.
> > > > But after some children added to root_mem, mem_cgroup_get_next_node can
> > > > mem_cgroup_put the root_mem although root_mem has not been mem_cgroup_get.
> > > >
> > > > This patch fixes this behavior by:
> > > > - Set last_scanned_child to NULL if root_mem has no children or DFS search
> > > > has returned to root_mem itself(root_mem is not a "child" of root_mem).
> > > > Make mem_cgroup_get_first_node return root_mem in this case.
> > > > There are no mem_cgroup_get/put for root_mem.
> > > > - Rename mem_cgroup_get_next_node to __mem_cgroup_get_next_node, and
> > > > mem_cgroup_get_first_node to mem_cgroup_get_next_node.
> > > > Make mem_cgroup_hierarchical_reclaim call only new
> > > > mem_cgroup_get_next_node.
> > > >
> > > >
> > > > Signed-off-by: Daisuke Nishimura <[email protected]>
> > >
> > > Hmm, seems necessary fix. Then, it's better to rebase my patch on to this.
> > >
> > > Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
> > >
> > > Maybe simpler one can be written but my patch remove all this out later....
> > >
> > How about this ? (just an exmaple and not tested, sorry)
> >
> >
> hmm, I don't think it's much simpler than this one(I don't want last_scanned_child
> to point to the mem itself, because it's not a "child").
>
> This part will be re-written by your patch, but this patch is needed
> to fix a bug(I saw general protection fault), so let's fix one by one.
of course.

> I'll post my original version. It's well tested :)
>
Ok, please go ahead.

Maybe create a patch agains "rc1" is better for all your fixes.
And please ask Andrew to "This is a bugfix and please put into fast-path"

-Kame




> Thanks,
> Daisuke Nishimura.
>
> >
> > ---
> > mm/memcontrol.c | 52 ++++++++++++++++++++++------------------------------
> > 1 file changed, 22 insertions(+), 30 deletions(-)
> >
> > Index: mmotm-2.6.28-Jan7/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-2.6.28-Jan7.orig/mm/memcontrol.c
> > +++ mmotm-2.6.28-Jan7/mm/memcontrol.c
> > @@ -621,6 +621,7 @@ static struct mem_cgroup *
> > mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup *root_mem)
> > {
> > struct cgroup *cgroup, *curr_cgroup, *root_cgroup;
> > + struct mem_cgroup *orig = root_mem->last_scanned_child;
> >
> > curr_cgroup = curr->css.cgroup;
> > root_cgroup = root_mem->css.cgroup;
> > @@ -629,19 +630,15 @@ mem_cgroup_get_next_node(struct mem_cgro
> > /*
> > * Walk down to children
> > */
> > - mem_cgroup_put(curr);
> > cgroup = list_entry(curr_cgroup->children.next,
> > struct cgroup, sibling);
> > curr = mem_cgroup_from_cont(cgroup);
> > - mem_cgroup_get(curr);
> > goto done;
> > }
> >
> > visit_parent:
> > if (curr_cgroup == root_cgroup) {
> > - mem_cgroup_put(curr);
> > curr = root_mem;
> > - mem_cgroup_get(curr);
> > goto done;
> > }
> >
> > @@ -649,11 +646,9 @@ visit_parent:
> > * Goto next sibling
> > */
> > if (curr_cgroup->sibling.next != &curr_cgroup->parent->children) {
> > - mem_cgroup_put(curr);
> > cgroup = list_entry(curr_cgroup->sibling.next, struct cgroup,
> > sibling);
> > curr = mem_cgroup_from_cont(cgroup);
> > - mem_cgroup_get(curr);
> > goto done;
> > }
> >
> > @@ -664,7 +659,10 @@ visit_parent:
> > goto visit_parent;
> >
> > done:
> > + if (orig)
> > + mem_cgroup_put(orig);
> > root_mem->last_scanned_child = curr;
> > + mem_cgroup_get(curr);
> > return curr;
> > }
> >
> > @@ -677,35 +675,25 @@ static struct mem_cgroup *
> > mem_cgroup_get_first_node(struct mem_cgroup *root_mem)
> > {
> > struct cgroup *cgroup;
> > - struct mem_cgroup *ret;
> > - bool obsolete;
> > + struct mem_cgroup *ret, *orig;
> >
> > - obsolete = mem_cgroup_is_obsolete(root_mem->last_scanned_child);
> > -
> > - /*
> > - * Scan all children under the mem_cgroup mem
> > - */
> > mutex_lock(&mem_cgroup_subsys.hierarchy_mutex);
> > - if (list_empty(&root_mem->css.cgroup->children)) {
> > - ret = root_mem;
> > - goto done;
> > - }
> > -
> > - if (!root_mem->last_scanned_child || obsolete) {
> > -
> > - if (obsolete && root_mem->last_scanned_child)
> > - mem_cgroup_put(root_mem->last_scanned_child);
> > + orig = root_mem->last_scanned_child;
> >
> > - cgroup = list_first_entry(&root_mem->css.cgroup->children,
> > - struct cgroup, sibling);
> > - ret = mem_cgroup_from_cont(cgroup);
> > + if (!orig) {
> > + if (list_empty(&root_mem->css.cgroup->children)) {
> > + ret = root_mem;
> > + } else {
> > + cgroup =
> > + list_first_entry(&root_mem->css.cgroup->children,
> > + struct cgroup, sibling);
> > + ret = mem_cgroup_from_cont(cgroup);
> > + }
> > + root_mem->last_scanned_child = ret;
> > mem_cgroup_get(ret);
> > - } else
> > + } else /* get_next_node will manage refcnt */
> > ret = mem_cgroup_get_next_node(root_mem->last_scanned_child,
> > root_mem);
> > -
> > -done:
> > - root_mem->last_scanned_child = ret;
> > mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex);
> > return ret;
> > }
> > @@ -2232,7 +2220,11 @@ static void mem_cgroup_pre_destroy(struc
> > static void mem_cgroup_destroy(struct cgroup_subsys *ss,
> > struct cgroup *cont)
> > {
> > - mem_cgroup_put(mem_cgroup_from_cont(cont));
> > + struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
> > +
> > + if (mem->last_scanned_child == mem)
> > + mem_cgroup_put(mem);
> > + mem_cgroup_put(mem);
> > }
> >
> > static int mem_cgroup_populate(struct cgroup_subsys *ss,
> >
>

2009-01-09 04:33:18

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/4] memcg: fix for mem_cgroup_get_reclaim_stat_from_page

* Daisuke Nishimura <[email protected]> [2009-01-08 19:14:30]:

> In case of swapin, a new page is added to lru before it is charged,
> so page->pc->mem_cgroup points to NULL or last mem_cgroup the page
> was charged before.
>
> In the latter case, if the mem_cgroup has already freed by rmdir,
> the area pointed to by page->pc->mem_cgroup may have invalid data.
>
> Actually, I saw general protection fault.
>
> general protection fault: 0000 [#1] SMP
> last sysfs file: /sys/devices/system/cpu/cpu15/cache/index1/shared_cpu_map
> CPU 4
> Modules linked in: ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp ipv6 autofs4 hidp rfcomm l2cap bluetooth sunrpc dm_mirror dm_region_hash dm_log dm_multipath dm_mod rfkill input_polldev sbs sbshc battery ac lp sg ide_cd_mod cdrom button serio_raw acpi_memhotplug parport_pc e1000 rtc_cmos parport rtc_core rtc_lib i2c_i801 i2c_core shpchp pcspkr ata_piix libata megaraid_mbox megaraid_mm sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd [last unloaded: microcode]
> Pid: 26038, comm: page01 Tainted: G W 2.6.28-rc9-mm1-mmotm-2008-12-22-16-14-f2ab3dea #1
> RIP: 0010:[<ffffffff8028e710>] [<ffffffff8028e710>] update_page_reclaim_stat+0x2f/0x42
> RSP: 0000:ffff8801ee457da8 EFLAGS: 00010002
> RAX: 32353438312021c8 RBX: 0000000000000000 RCX: 32353438312021c8
> RDX: 0000000000000000 RSI: ffff8800cb0b1000 RDI: ffff8801164d1d28
> RBP: ffff880110002cb8 R08: ffff88010f2eae23 R09: 0000000000000001
> R10: ffff8800bc514b00 R11: ffff880110002c00 R12: 0000000000000000
> R13: ffff88000f484100 R14: 0000000000000003 R15: 00000000001200d2
> FS: 00007f8a261726f0(0000) GS:ffff88010f2eaa80(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00007f8a25d22000 CR3: 00000001ef18c000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process page01 (pid: 26038, threadinfo ffff8801ee456000, task ffff8800b585b960)
> Stack:
> ffffe200071ee568 ffff880110001f00 0000000000000000 ffffffff8028ea17
> ffff88000f484100 0000000000000000 0000000000000020 00007f8a25d22000
> ffff8800bc514b00 ffffffff8028ec34 0000000000000000 0000000000016fd8
> Call Trace:
> [<ffffffff8028ea17>] ? ____pagevec_lru_add+0xc1/0x13c
> [<ffffffff8028ec34>] ? drain_cpu_pagevecs+0x36/0x89
> [<ffffffff802a4f8c>] ? swapin_readahead+0x78/0x98
> [<ffffffff8029a37a>] ? handle_mm_fault+0x3d9/0x741
> [<ffffffff804da654>] ? do_page_fault+0x3ce/0x78c
> [<ffffffff804d7a42>] ? trace_hardirqs_off_thunk+0x3a/0x3c
> [<ffffffff804d860f>] ? page_fault+0x1f/0x30
> Code: cc 55 48 8d af b8 0d 00 00 48 89 f7 53 89 d3 e8 39 85 02 00 48 63 d3 48 ff 44 d5 10 45 85 e4 74 05 48 ff 44 d5 00 48 85 c0 74 0e <48> ff 44 d0 10 45 85 e4 74 04 48 ff 04 d0 5b 5d 41 5c c3 41 54
> RIP [<ffffffff8028e710>] update_page_reclaim_stat+0x2f/0x42
> RSP <ffff8801ee457da8>
>
>
> Signed-off-by: Daisuke Nishimura <[email protected]>
> ---
> mm/memcontrol.c | 4 ++++
> 1 files changed, 4 insertions(+), 0 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e2996b8..62e69d8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -559,6 +559,10 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
> return NULL;
>
> pc = lookup_page_cgroup(page);
> + smp_rmb();

Do you really need the read memory barrier?

> + if (!PageCgroupUsed(pc))
> + return NULL;
> +

In this case we've hit a case where the page is valid and the pc is
not. This does fix the problem, but won't this impact us getting
correct reclaim stats and thus indirectly impact the working of
pressure?

> mz = page_cgroup_zoneinfo(pc);
> if (!mz)
> return NULL;
>
--
Balbir

2009-01-09 04:48:52

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/4] memcg: fix for mem_cgroup_get_reclaim_stat_from_page

On Fri, 9 Jan 2009 10:02:57 +0530
Balbir Singh <[email protected]> wrote:

> > pc = lookup_page_cgroup(page);
> > + smp_rmb();
>
> Do you really need the read memory barrier?
>
Necessary.

> > + if (!PageCgroupUsed(pc))
> > + return NULL;
> > +
>
> In this case we've hit a case where the page is valid and the pc is
> not. This does fix the problem, but won't this impact us getting
> correct reclaim stats and thus indirectly impact the working of
> pressure?
>
- If retruns NULL, only global LRU's status is updated.

Because this page is not belongs to any memcg, we cannot update
any counters. But yes, your point is a concern.

Maybe moving acitvate_page() to
==
do_swap_page()
{

- activate_page()
mem_cgroup_try_charge()..
....
mem_cgroup_commit_charge()....
....
+ activate_page()
}
==
is necessary. How do you think, kosaki ?

Thanks,
-Kame

2009-01-09 05:15:36

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/4] memcg: fix error path of mem_cgroup_move_parent

* Daisuke Nishimura <[email protected]> [2009-01-08 19:14:45]:

> There is a bug in error path of mem_cgroup_move_parent.
>
> Extra refcnt got from try_charge should be dropped, and usages incremented
> by try_charge should be decremented in both error paths:
>
> A: failure at get_page_unless_zero
> B: failure at isolate_lru_page
>
> This bug makes this parent directory unremovable.
>
> In case of A, rmdir doesn't return, because res.usage doesn't go
> down to 0 at mem_cgroup_force_empty even after all the pc in
> lru are removed.
> In case of B, rmdir fails and returns -EBUSY, because it has
> extra ref counts even after res.usage goes down to 0.
>
>
> Signed-off-by: Daisuke Nishimura <[email protected]>
> ---
> mm/memcontrol.c | 23 +++++++++++++++--------
> 1 files changed, 15 insertions(+), 8 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 62e69d8..288e22c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -983,14 +983,15 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
> if (pc->mem_cgroup != from)
> goto out;
>
> - css_put(&from->css);
> res_counter_uncharge(&from->res, PAGE_SIZE);
> mem_cgroup_charge_statistics(from, pc, false);
> if (do_swap_account)
> res_counter_uncharge(&from->memsw, PAGE_SIZE);
> + css_put(&from->css);
> +
> + css_get(&to->css);
> pc->mem_cgroup = to;
> mem_cgroup_charge_statistics(to, pc, true);
> - css_get(&to->css);
> ret = 0;
> out:
> unlock_page_cgroup(pc);
> @@ -1023,8 +1024,10 @@ static int mem_cgroup_move_parent(struct page_cgroup *pc,
> if (ret || !parent)
> return ret;
>
> - if (!get_page_unless_zero(page))
> - return -EBUSY;
> + if (!get_page_unless_zero(page)) {
> + ret = -EBUSY;
> + goto uncharge;
> + }
>
> ret = isolate_lru_page(page);
>
> @@ -1033,19 +1036,23 @@ static int mem_cgroup_move_parent(struct page_cgroup *pc,
>
> ret = mem_cgroup_move_account(pc, child, parent);
>
> - /* drop extra refcnt by try_charge() (move_account increment one) */
> - css_put(&parent->css);
> putback_lru_page(page);
> if (!ret) {
> put_page(page);
> + /* drop extra refcnt by try_charge() */
> + css_put(&parent->css);
> return 0;
> }
> - /* uncharge if move fails */
> +
> cancel:
> + put_page(page);
> +uncharge:
> + /* drop extra refcnt by try_charge() */
> + css_put(&parent->css);
> + /* uncharge if move fails */
> res_counter_uncharge(&parent->res, PAGE_SIZE);
> if (do_swap_account)
> res_counter_uncharge(&parent->memsw, PAGE_SIZE);
> - put_page(page);
> return ret;
> }
>
>

Looks good to me, just out of curiousity how did you catch this error?
Through review or testing?

--
Balbir

2009-01-09 05:33:36

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/4] memcg: fix for mem_cgroup_hierarchical_reclaim

* Daisuke Nishimura <[email protected]> [2009-01-08 19:15:01]:

> If root_mem has no children, last_scaned_child is set to root_mem itself.
> But after some children added to root_mem, mem_cgroup_get_next_node can
> mem_cgroup_put the root_mem although root_mem has not been mem_cgroup_get.
>

Good catch!

> This patch fixes this behavior by:
> - Set last_scanned_child to NULL if root_mem has no children or DFS search
> has returned to root_mem itself(root_mem is not a "child" of root_mem).
> Make mem_cgroup_get_first_node return root_mem in this case.
> There are no mem_cgroup_get/put for root_mem.
> - Rename mem_cgroup_get_next_node to __mem_cgroup_get_next_node, and
> mem_cgroup_get_first_node to mem_cgroup_get_next_node.
> Make mem_cgroup_hierarchical_reclaim call only new mem_cgroup_get_next_node.
>

How have you tested these changes? When I wrote up the patches, I did
several tests to make sure that all nodes in the hierarchy are covered
while reclaiming in order.

>
> Signed-off-by: Daisuke Nishimura <[email protected]>
> ---
> mm/memcontrol.c | 37 +++++++++++++++++++++----------------
> 1 files changed, 21 insertions(+), 16 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 288e22c..dc38a0e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -622,7 +622,7 @@ unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
> * called with hierarchy_mutex held
> */
> static struct mem_cgroup *
> -mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup *root_mem)
> +__mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup *root_mem)
> {
> struct cgroup *cgroup, *curr_cgroup, *root_cgroup;
>
> @@ -644,8 +644,8 @@ mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup *root_mem)
> visit_parent:
> if (curr_cgroup == root_cgroup) {
> mem_cgroup_put(curr);
> - curr = root_mem;
> - mem_cgroup_get(curr);
> + /* caller handles NULL case */
> + curr = NULL;
> goto done;
> }
>
> @@ -668,7 +668,6 @@ visit_parent:
> goto visit_parent;
>
> done:
> - root_mem->last_scanned_child = curr;
> return curr;
> }
>
> @@ -678,20 +677,29 @@ done:
> * that to reclaim free pages from.
> */
> static struct mem_cgroup *
> -mem_cgroup_get_first_node(struct mem_cgroup *root_mem)
> +mem_cgroup_get_next_node(struct mem_cgroup *root_mem)
> {
> struct cgroup *cgroup;
> struct mem_cgroup *ret;
> bool obsolete;
>
> - obsolete = mem_cgroup_is_obsolete(root_mem->last_scanned_child);
> -
> /*
> * Scan all children under the mem_cgroup mem
> */
> mutex_lock(&mem_cgroup_subsys.hierarchy_mutex);
> +
> + obsolete = mem_cgroup_is_obsolete(root_mem->last_scanned_child);
> +
> if (list_empty(&root_mem->css.cgroup->children)) {
> - ret = root_mem;
> + /*
> + * root_mem might have children before and last_scanned_child
> + * may point to one of them.
> + */
> + if (root_mem->last_scanned_child) {
> + VM_BUG_ON(!obsolete);
> + mem_cgroup_put(root_mem->last_scanned_child);
> + }
> + ret = NULL;
> goto done;
> }
>
> @@ -705,13 +713,13 @@ mem_cgroup_get_first_node(struct mem_cgroup *root_mem)
> ret = mem_cgroup_from_cont(cgroup);
> mem_cgroup_get(ret);
> } else
> - ret = mem_cgroup_get_next_node(root_mem->last_scanned_child,
> + ret = __mem_cgroup_get_next_node(root_mem->last_scanned_child,
> root_mem);
>
> done:
> root_mem->last_scanned_child = ret;
> mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex);
> - return ret;
> + return (ret) ? ret : root_mem;
> }
>
> static bool mem_cgroup_check_under_limit(struct mem_cgroup *mem)
> @@ -769,21 +777,18 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> if (!root_mem->use_hierarchy)
> return ret;
>
> - next_mem = mem_cgroup_get_first_node(root_mem);
> + next_mem = mem_cgroup_get_next_node(root_mem);
>
> while (next_mem != root_mem) {
> if (mem_cgroup_is_obsolete(next_mem)) {
> - mem_cgroup_put(next_mem);
> - next_mem = mem_cgroup_get_first_node(root_mem);
> + next_mem = mem_cgroup_get_next_node(root_mem);
> continue;
> }
> ret = try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
> get_swappiness(next_mem));
> if (mem_cgroup_check_under_limit(root_mem))
> return 0;
> - mutex_lock(&mem_cgroup_subsys.hierarchy_mutex);
> - next_mem = mem_cgroup_get_next_node(next_mem, root_mem);
> - mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex);
> + next_mem = mem_cgroup_get_next_node(root_mem);
> }
> return ret;
> }
>


Looks good to me, I need to test it though

Acked-by: Balbir Singh <[email protected]>

--
Balbir

2009-01-09 05:35:17

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/4] memcg: fix for mem_cgroup_hierarchical_reclaim

* KAMEZAWA Hiroyuki <[email protected]> [2009-01-09 12:09:50]:

> Ok, please go ahead.
>
> Maybe create a patch agains "rc1" is better for all your fixes.
> And please ask Andrew to "This is a bugfix and please put into fast-path"
>
> -Kame
>

Agreed and we'll test it thoroughly meanwhile!

--
Balbir

2009-01-09 05:40:26

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/4] memcg: fix error path of mem_cgroup_move_parent

On Fri, 9 Jan 2009 10:45:22 +0530, Balbir Singh <[email protected]> wrote:
> * Daisuke Nishimura <[email protected]> [2009-01-08 19:14:45]:
>
> > There is a bug in error path of mem_cgroup_move_parent.
> >
> > Extra refcnt got from try_charge should be dropped, and usages incremented
> > by try_charge should be decremented in both error paths:
> >
> > A: failure at get_page_unless_zero
> > B: failure at isolate_lru_page
> >
> > This bug makes this parent directory unremovable.
> >
> > In case of A, rmdir doesn't return, because res.usage doesn't go
> > down to 0 at mem_cgroup_force_empty even after all the pc in
> > lru are removed.
> > In case of B, rmdir fails and returns -EBUSY, because it has
> > extra ref counts even after res.usage goes down to 0.
> >
> >
> > Signed-off-by: Daisuke Nishimura <[email protected]>
> > ---
> > mm/memcontrol.c | 23 +++++++++++++++--------
> > 1 files changed, 15 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 62e69d8..288e22c 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -983,14 +983,15 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
> > if (pc->mem_cgroup != from)
> > goto out;
> >
> > - css_put(&from->css);
> > res_counter_uncharge(&from->res, PAGE_SIZE);
> > mem_cgroup_charge_statistics(from, pc, false);
> > if (do_swap_account)
> > res_counter_uncharge(&from->memsw, PAGE_SIZE);
> > + css_put(&from->css);
> > +
> > + css_get(&to->css);
> > pc->mem_cgroup = to;
> > mem_cgroup_charge_statistics(to, pc, true);
> > - css_get(&to->css);
> > ret = 0;
> > out:
> > unlock_page_cgroup(pc);
> > @@ -1023,8 +1024,10 @@ static int mem_cgroup_move_parent(struct page_cgroup *pc,
> > if (ret || !parent)
> > return ret;
> >
> > - if (!get_page_unless_zero(page))
> > - return -EBUSY;
> > + if (!get_page_unless_zero(page)) {
> > + ret = -EBUSY;
> > + goto uncharge;
> > + }
> >
> > ret = isolate_lru_page(page);
> >
> > @@ -1033,19 +1036,23 @@ static int mem_cgroup_move_parent(struct page_cgroup *pc,
> >
> > ret = mem_cgroup_move_account(pc, child, parent);
> >
> > - /* drop extra refcnt by try_charge() (move_account increment one) */
> > - css_put(&parent->css);
> > putback_lru_page(page);
> > if (!ret) {
> > put_page(page);
> > + /* drop extra refcnt by try_charge() */
> > + css_put(&parent->css);
> > return 0;
> > }
> > - /* uncharge if move fails */
> > +
> > cancel:
> > + put_page(page);
> > +uncharge:
> > + /* drop extra refcnt by try_charge() */
> > + css_put(&parent->css);
> > + /* uncharge if move fails */
> > res_counter_uncharge(&parent->res, PAGE_SIZE);
> > if (do_swap_account)
> > res_counter_uncharge(&parent->memsw, PAGE_SIZE);
> > - put_page(page);
> > return ret;
> > }
> >
> >
>
> Looks good to me, just out of curiousity how did you catch this error?
> Through review or testing?
>
Through testing.

I got "an unremovable directory" sometimes, which had res.usage remained
even after all lru lists had become empty, or which had ref counts remained
even after res.usage had become 0.
And tracked down the cause of this problem .


Thanks,
Daisuke Nishimura.

2009-01-09 05:58:22

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/4] memcg: make oom less frequently

* Daisuke Nishimura <[email protected]> [2009-01-08 19:15:20]:

> In previous implementation, mem_cgroup_try_charge checked the return
> value of mem_cgroup_try_to_free_pages, and just retried if some pages
> had been reclaimed.
> But now, try_charge(and mem_cgroup_hierarchical_reclaim called from it)
> only checks whether the usage is less than the limit.
>
> This patch tries to change the behavior as before to cause oom less frequently.
>
> To prevent try_charge from getting stuck in infinite loop,
> MEM_CGROUP_RECLAIM_RETRIES_MAX is defined.
>
>
> Signed-off-by: Daisuke Nishimura <[email protected]>
> ---
> mm/memcontrol.c | 16 ++++++++++++----
> 1 files changed, 12 insertions(+), 4 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 804c054..fedd76b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -42,6 +42,7 @@
>
> struct cgroup_subsys mem_cgroup_subsys __read_mostly;
> #define MEM_CGROUP_RECLAIM_RETRIES 5
> +#define MEM_CGROUP_RECLAIM_RETRIES_MAX 32

Why 32 are you seeing frequent OOMs? I had 5 iterations to allow

1. pages to move to swap cache, which added back pressure to memcg in
the original implementation, since the pages came back
2. It look longer to move, recalim those pages.

Ideally 3 would suffice, but I added an additional 2 retries for
safety.

>
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> /* Turned on only when memory cgroup is enabled && really_do_swap_account = 0 */
> @@ -770,10 +771,10 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> * but there might be left over accounting, even after children
> * have left.
> */
> - ret = try_to_free_mem_cgroup_pages(root_mem, gfp_mask, noswap,
> + ret += try_to_free_mem_cgroup_pages(root_mem, gfp_mask, noswap,
> get_swappiness(root_mem));
> if (mem_cgroup_check_under_limit(root_mem))
> - return 0;
> + return 1; /* indicate reclaim has succeeded */
> if (!root_mem->use_hierarchy)
> return ret;
>
> @@ -785,10 +786,10 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> next_mem = mem_cgroup_get_next_node(root_mem);
> continue;
> }
> - ret = try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
> + ret += try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
> get_swappiness(next_mem));
> if (mem_cgroup_check_under_limit(root_mem))
> - return 0;
> + return 1; /* indicate reclaim has succeeded */
> next_mem = mem_cgroup_get_next_node(root_mem);
> }
> return ret;
> @@ -820,6 +821,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> {
> struct mem_cgroup *mem, *mem_over_limit;
> int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> + int nr_retries_max = MEM_CGROUP_RECLAIM_RETRIES_MAX;
> struct res_counter *fail_res;
>
> if (unlikely(test_thread_flag(TIF_MEMDIE))) {
> @@ -871,8 +873,13 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> if (!(gfp_mask & __GFP_WAIT))
> goto nomem;
>
> + if (!nr_retries_max--)
> + goto oom;
> +
> ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
> noswap);
> + if (ret)
> + continue;
>
> /*
> * try_to_free_mem_cgroup_pages() might not give us a full
> @@ -886,6 +893,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> continue;
>
> if (!nr_retries--) {
> +oom:
> if (oom) {
> mutex_lock(&memcg_tasklist);
> mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);

--
Balbir

2009-01-09 06:01:28

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/4] memcg: fix error path of mem_cgroup_move_parent

* Daisuke Nishimura <[email protected]> [2009-01-09 14:33:46]:

> > Looks good to me, just out of curiousity how did you catch this error?
> > Through review or testing?
> >
> Through testing.
>
> I got "an unremovable directory" sometimes, which had res.usage remained
> even after all lru lists had become empty, or which had ref counts remained
> even after res.usage had become 0.
> And tracked down the cause of this problem .
>

Thanks,

Acked-by: Balbir Singh <[email protected]>

--
Balbir

2009-01-09 06:05:40

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/4] memcg: fix for mem_cgroup_hierarchical_reclaim

On Fri, 9 Jan 2009 11:03:23 +0530, Balbir Singh <[email protected]> wrote:
> * Daisuke Nishimura <[email protected]> [2009-01-08 19:15:01]:
>
> > If root_mem has no children, last_scaned_child is set to root_mem itself.
> > But after some children added to root_mem, mem_cgroup_get_next_node can
> > mem_cgroup_put the root_mem although root_mem has not been mem_cgroup_get.
> >
>
> Good catch!
>
Thanks :)

> > This patch fixes this behavior by:
> > - Set last_scanned_child to NULL if root_mem has no children or DFS search
> > has returned to root_mem itself(root_mem is not a "child" of root_mem).
> > Make mem_cgroup_get_first_node return root_mem in this case.
> > There are no mem_cgroup_get/put for root_mem.
> > - Rename mem_cgroup_get_next_node to __mem_cgroup_get_next_node, and
> > mem_cgroup_get_first_node to mem_cgroup_get_next_node.
> > Make mem_cgroup_hierarchical_reclaim call only new mem_cgroup_get_next_node.
> >
>
> How have you tested these changes? When I wrote up the patches, I did
> several tests to make sure that all nodes in the hierarchy are covered
> while reclaiming in order.
>
I do something like:

1. mount memcg (at/cgroup/memory)
2. enable hierarchy (if testing use_hierarchy==1 case)
3. mkdir /cgroup/memory/01
4. run some programs in /cgroup/memory/01
5. select the next directory to move to at random from 01/, 01/aa, 01/bb,
02/, 02/aa and 02/bb.
6. move all processes to next directory.
7. remove the old directory if possible.
8. wait for an random period.
9. goto 5.

Before this patch, I got sometimes general protection fault, which seemed
to be caused by unexpected free of mem_cgroup (and reuse the area for
another purpose).


BTW, I think "mem_cgroup_put(mem->last_scanned_child)" is also needed
at mem_cgroup_destroy to prevent memory leak.

I'll update my patch later.


Thanks,
Daisuke Nishimura.

> >
> > Signed-off-by: Daisuke Nishimura <[email protected]>
> > ---
> > mm/memcontrol.c | 37 +++++++++++++++++++++----------------
> > 1 files changed, 21 insertions(+), 16 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 288e22c..dc38a0e 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -622,7 +622,7 @@ unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
> > * called with hierarchy_mutex held
> > */
> > static struct mem_cgroup *
> > -mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup *root_mem)
> > +__mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup *root_mem)
> > {
> > struct cgroup *cgroup, *curr_cgroup, *root_cgroup;
> >
> > @@ -644,8 +644,8 @@ mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup *root_mem)
> > visit_parent:
> > if (curr_cgroup == root_cgroup) {
> > mem_cgroup_put(curr);
> > - curr = root_mem;
> > - mem_cgroup_get(curr);
> > + /* caller handles NULL case */
> > + curr = NULL;
> > goto done;
> > }
> >
> > @@ -668,7 +668,6 @@ visit_parent:
> > goto visit_parent;
> >
> > done:
> > - root_mem->last_scanned_child = curr;
> > return curr;
> > }
> >
> > @@ -678,20 +677,29 @@ done:
> > * that to reclaim free pages from.
> > */
> > static struct mem_cgroup *
> > -mem_cgroup_get_first_node(struct mem_cgroup *root_mem)
> > +mem_cgroup_get_next_node(struct mem_cgroup *root_mem)
> > {
> > struct cgroup *cgroup;
> > struct mem_cgroup *ret;
> > bool obsolete;
> >
> > - obsolete = mem_cgroup_is_obsolete(root_mem->last_scanned_child);
> > -
> > /*
> > * Scan all children under the mem_cgroup mem
> > */
> > mutex_lock(&mem_cgroup_subsys.hierarchy_mutex);
> > +
> > + obsolete = mem_cgroup_is_obsolete(root_mem->last_scanned_child);
> > +
> > if (list_empty(&root_mem->css.cgroup->children)) {
> > - ret = root_mem;
> > + /*
> > + * root_mem might have children before and last_scanned_child
> > + * may point to one of them.
> > + */
> > + if (root_mem->last_scanned_child) {
> > + VM_BUG_ON(!obsolete);
> > + mem_cgroup_put(root_mem->last_scanned_child);
> > + }
> > + ret = NULL;
> > goto done;
> > }
> >
> > @@ -705,13 +713,13 @@ mem_cgroup_get_first_node(struct mem_cgroup *root_mem)
> > ret = mem_cgroup_from_cont(cgroup);
> > mem_cgroup_get(ret);
> > } else
> > - ret = mem_cgroup_get_next_node(root_mem->last_scanned_child,
> > + ret = __mem_cgroup_get_next_node(root_mem->last_scanned_child,
> > root_mem);
> >
> > done:
> > root_mem->last_scanned_child = ret;
> > mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex);
> > - return ret;
> > + return (ret) ? ret : root_mem;
> > }
> >
> > static bool mem_cgroup_check_under_limit(struct mem_cgroup *mem)
> > @@ -769,21 +777,18 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> > if (!root_mem->use_hierarchy)
> > return ret;
> >
> > - next_mem = mem_cgroup_get_first_node(root_mem);
> > + next_mem = mem_cgroup_get_next_node(root_mem);
> >
> > while (next_mem != root_mem) {
> > if (mem_cgroup_is_obsolete(next_mem)) {
> > - mem_cgroup_put(next_mem);
> > - next_mem = mem_cgroup_get_first_node(root_mem);
> > + next_mem = mem_cgroup_get_next_node(root_mem);
> > continue;
> > }
> > ret = try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
> > get_swappiness(next_mem));
> > if (mem_cgroup_check_under_limit(root_mem))
> > return 0;
> > - mutex_lock(&mem_cgroup_subsys.hierarchy_mutex);
> > - next_mem = mem_cgroup_get_next_node(next_mem, root_mem);
> > - mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex);
> > + next_mem = mem_cgroup_get_next_node(root_mem);
> > }
> > return ret;
> > }
> >
>
>
> Looks good to me, I need to test it though
>
> Acked-by: Balbir Singh <[email protected]>
>
> --
> Balbir

2009-01-09 08:53:56

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/4] memcg: make oom less frequently

On Fri, 9 Jan 2009 11:28:04 +0530, Balbir Singh <[email protected]> wrote:
> * Daisuke Nishimura <[email protected]> [2009-01-08 19:15:20]:
>
> > In previous implementation, mem_cgroup_try_charge checked the return
> > value of mem_cgroup_try_to_free_pages, and just retried if some pages
> > had been reclaimed.
> > But now, try_charge(and mem_cgroup_hierarchical_reclaim called from it)
> > only checks whether the usage is less than the limit.
> >
> > This patch tries to change the behavior as before to cause oom less frequently.
> >
> > To prevent try_charge from getting stuck in infinite loop,
> > MEM_CGROUP_RECLAIM_RETRIES_MAX is defined.
> >
> >
> > Signed-off-by: Daisuke Nishimura <[email protected]>
> > ---
> > mm/memcontrol.c | 16 ++++++++++++----
> > 1 files changed, 12 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 804c054..fedd76b 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -42,6 +42,7 @@
> >
> > struct cgroup_subsys mem_cgroup_subsys __read_mostly;
> > #define MEM_CGROUP_RECLAIM_RETRIES 5
> > +#define MEM_CGROUP_RECLAIM_RETRIES_MAX 32
>
> Why 32 are you seeing frequent OOMs? I had 5 iterations to allow
>
> 1. pages to move to swap cache, which added back pressure to memcg in
> the original implementation, since the pages came back
> 2. It look longer to move, recalim those pages.
>
> Ideally 3 would suffice, but I added an additional 2 retries for
> safety.
>
Before this patch, try_charge doesn't check the return value of
try_to_free_page, i.e. how many pages has been reclaimed, and
only checks whether the usage has become less than the limit.
So, oom can be caused if the group is too busy.

IIRC memory-cgroup-hierarchical-reclaim patch introduced this behavior,
and, I don't remember in detail, some tests which had not caused oom
started to cause oom after it.
That was the motivation of my first version of this patch(*1).

*1 http://lkml.org/lkml/2008/11/28/35

Anyway, this is the updated version.
I removed RETRIES_MAX.


Thanks,
Daisuke Nishimura.
===
From: Daisuke Nishimura <[email protected]>

In previous implementation, mem_cgroup_try_charge checked the return
value of mem_cgroup_try_to_free_pages, and just retried if some pages
had been reclaimed.
But now, try_charge(and mem_cgroup_hierarchical_reclaim called from it)
only checks whether the usage is less than the limit.

This patch tries to change the behavior as before to cause oom less frequently.

ChangeLog: RFC->v1
- removed RETRIES_MAX.


Signed-off-by: Daisuke Nishimura <[email protected]>
---
mm/memcontrol.c | 10 ++++++----
1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7ba5c61..fb0e9eb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -781,10 +781,10 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
* but there might be left over accounting, even after children
* have left.
*/
- ret = try_to_free_mem_cgroup_pages(root_mem, gfp_mask, noswap,
+ ret += try_to_free_mem_cgroup_pages(root_mem, gfp_mask, noswap,
get_swappiness(root_mem));
if (mem_cgroup_check_under_limit(root_mem))
- return 0;
+ return 1; /* indicate reclaim has succeeded */
if (!root_mem->use_hierarchy)
return ret;

@@ -795,10 +795,10 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
next_mem = mem_cgroup_get_next_node(root_mem);
continue;
}
- ret = try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
+ ret += try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
get_swappiness(next_mem));
if (mem_cgroup_check_under_limit(root_mem))
- return 0;
+ return 1; /* indicate reclaim has succeeded */
next_mem = mem_cgroup_get_next_node(root_mem);
}
return ret;
@@ -883,6 +883,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,

ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
noswap);
+ if (ret)
+ continue;

/*
* try_to_free_mem_cgroup_pages() might not give us a full

2009-01-09 09:07:53

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/4] memcg: make oom less frequently

* Daisuke Nishimura <[email protected]> [2009-01-09 17:52:15]:

> On Fri, 9 Jan 2009 11:28:04 +0530, Balbir Singh <[email protected]> wrote:
> > * Daisuke Nishimura <[email protected]> [2009-01-08 19:15:20]:
> >
> > > In previous implementation, mem_cgroup_try_charge checked the return
> > > value of mem_cgroup_try_to_free_pages, and just retried if some pages
> > > had been reclaimed.
> > > But now, try_charge(and mem_cgroup_hierarchical_reclaim called from it)
> > > only checks whether the usage is less than the limit.
> > >
> > > This patch tries to change the behavior as before to cause oom less frequently.
> > >
> > > To prevent try_charge from getting stuck in infinite loop,
> > > MEM_CGROUP_RECLAIM_RETRIES_MAX is defined.
> > >
> > >
> > > Signed-off-by: Daisuke Nishimura <[email protected]>
> > > ---
> > > mm/memcontrol.c | 16 ++++++++++++----
> > > 1 files changed, 12 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 804c054..fedd76b 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -42,6 +42,7 @@
> > >
> > > struct cgroup_subsys mem_cgroup_subsys __read_mostly;
> > > #define MEM_CGROUP_RECLAIM_RETRIES 5
> > > +#define MEM_CGROUP_RECLAIM_RETRIES_MAX 32
> >
> > Why 32 are you seeing frequent OOMs? I had 5 iterations to allow
> >
> > 1. pages to move to swap cache, which added back pressure to memcg in
> > the original implementation, since the pages came back
> > 2. It look longer to move, recalim those pages.
> >
> > Ideally 3 would suffice, but I added an additional 2 retries for
> > safety.
> >
> Before this patch, try_charge doesn't check the return value of
> try_to_free_page, i.e. how many pages has been reclaimed, and
> only checks whether the usage has become less than the limit.
> So, oom can be caused if the group is too busy.
>
> IIRC memory-cgroup-hierarchical-reclaim patch introduced this behavior,
> and, I don't remember in detail, some tests which had not caused oom
> started to cause oom after it.
> That was the motivation of my first version of this patch(*1).
>
> *1 http://lkml.org/lkml/2008/11/28/35
>
> Anyway, this is the updated version.
> I removed RETRIES_MAX.
>
>
> Thanks,
> Daisuke Nishimura.
> ===
> From: Daisuke Nishimura <[email protected]>
>
> In previous implementation, mem_cgroup_try_charge checked the return
> value of mem_cgroup_try_to_free_pages, and just retried if some pages
> had been reclaimed.
> But now, try_charge(and mem_cgroup_hierarchical_reclaim called from it)
> only checks whether the usage is less than the limit.
>
> This patch tries to change the behavior as before to cause oom less frequently.
>
> ChangeLog: RFC->v1
> - removed RETRIES_MAX.
>
>
> Signed-off-by: Daisuke Nishimura <[email protected]>
> ---
> mm/memcontrol.c | 10 ++++++----
> 1 files changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7ba5c61..fb0e9eb 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -781,10 +781,10 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> * but there might be left over accounting, even after children
> * have left.
> */
> - ret = try_to_free_mem_cgroup_pages(root_mem, gfp_mask, noswap,
> + ret += try_to_free_mem_cgroup_pages(root_mem, gfp_mask, noswap,
> get_swappiness(root_mem));
> if (mem_cgroup_check_under_limit(root_mem))
> - return 0;
> + return 1; /* indicate reclaim has succeeded */
> if (!root_mem->use_hierarchy)
> return ret;
>
> @@ -795,10 +795,10 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> next_mem = mem_cgroup_get_next_node(root_mem);
> continue;
> }
> - ret = try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
> + ret += try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
> get_swappiness(next_mem));
> if (mem_cgroup_check_under_limit(root_mem))
> - return 0;
> + return 1; /* indicate reclaim has succeeded */
> next_mem = mem_cgroup_get_next_node(root_mem);
> }
> return ret;
> @@ -883,6 +883,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>
> ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
> noswap);
> + if (ret)
> + continue;
>
> /*
> * try_to_free_mem_cgroup_pages() might not give us a full
>

This makes sense

Acked-by: Balbir Singh <[email protected]>


--
Balbir

2009-01-09 09:08:53

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/4] memcg: fix for mem_cgroup_hierarchical_reclaim

On Fri, 9 Jan 2009 15:01:03 +0900, Daisuke Nishimura <[email protected]> wrote:
> On Fri, 9 Jan 2009 11:03:23 +0530, Balbir Singh <[email protected]> wrote:
> > * Daisuke Nishimura <[email protected]> [2009-01-08 19:15:01]:
> >
> > > If root_mem has no children, last_scaned_child is set to root_mem itself.
> > > But after some children added to root_mem, mem_cgroup_get_next_node can
> > > mem_cgroup_put the root_mem although root_mem has not been mem_cgroup_get.
> > >
> >
> > Good catch!
> >
> Thanks :)
>
> > > This patch fixes this behavior by:
> > > - Set last_scanned_child to NULL if root_mem has no children or DFS search
> > > has returned to root_mem itself(root_mem is not a "child" of root_mem).
> > > Make mem_cgroup_get_first_node return root_mem in this case.
> > > There are no mem_cgroup_get/put for root_mem.
> > > - Rename mem_cgroup_get_next_node to __mem_cgroup_get_next_node, and
> > > mem_cgroup_get_first_node to mem_cgroup_get_next_node.
> > > Make mem_cgroup_hierarchical_reclaim call only new mem_cgroup_get_next_node.
> > >
> >
> > How have you tested these changes? When I wrote up the patches, I did
> > several tests to make sure that all nodes in the hierarchy are covered
> > while reclaiming in order.
> >
> I do something like:
>
> 1. mount memcg (at/cgroup/memory)
> 2. enable hierarchy (if testing use_hierarchy==1 case)
> 3. mkdir /cgroup/memory/01
> 4. run some programs in /cgroup/memory/01
> 5. select the next directory to move to at random from 01/, 01/aa, 01/bb,
> 02/, 02/aa and 02/bb.
> 6. move all processes to next directory.
> 7. remove the old directory if possible.
> 8. wait for an random period.
> 9. goto 5.
>
> Before this patch, I got sometimes general protection fault, which seemed
> to be caused by unexpected free of mem_cgroup (and reuse the area for
> another purpose).
>
>
> BTW, I think "mem_cgroup_put(mem->last_scanned_child)" is also needed
> at mem_cgroup_destroy to prevent memory leak.
>
> I'll update my patch later.
>
This is the updated one.

I'm now testing these patches on -git12.
It has been working fine so far.


Thanks,
Daisuke Nishimura.
===
From: Daisuke Nishimura <[email protected]>

If root_mem has no children, last_scaned_child is set to root_mem itself.
But after some children added to root_mem, mem_cgroup_get_next_node can
mem_cgroup_put the root_mem although root_mem has not been mem_cgroup_get.

This patch fixes this behavior by:
- Set last_scanned_child to NULL if root_mem has no children or DFS search
has returned to root_mem itself(root_mem is not a "child" of root_mem).
Make mem_cgroup_get_first_node return root_mem in this case.
There are no mem_cgroup_get/put for root_mem.
- Rename mem_cgroup_get_next_node to __mem_cgroup_get_next_node, and
mem_cgroup_get_first_node to mem_cgroup_get_next_node.
Make mem_cgroup_hierarchical_reclaim call only new mem_cgroup_get_next_node.

ChangeLog: RFC->v1
- add mem_cgroup_put of last_scanned_child at mem_cgroup_destroy.

Signed-off-by: Daisuke Nishimura <[email protected]>
---
mm/memcontrol.c | 46 +++++++++++++++++++++++++++++-----------------
1 files changed, 29 insertions(+), 17 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7be9b35..7ba5c61 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -633,7 +633,7 @@ unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
* called with hierarchy_mutex held
*/
static struct mem_cgroup *
-mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup *root_mem)
+__mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup *root_mem)
{
struct cgroup *cgroup, *curr_cgroup, *root_cgroup;

@@ -655,8 +655,8 @@ mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup *root_mem)
visit_parent:
if (curr_cgroup == root_cgroup) {
mem_cgroup_put(curr);
- curr = root_mem;
- mem_cgroup_get(curr);
+ /* caller handles NULL case */
+ curr = NULL;
goto done;
}

@@ -679,7 +679,6 @@ visit_parent:
goto visit_parent;

done:
- root_mem->last_scanned_child = curr;
return curr;
}

@@ -689,20 +688,29 @@ done:
* that to reclaim free pages from.
*/
static struct mem_cgroup *
-mem_cgroup_get_first_node(struct mem_cgroup *root_mem)
+mem_cgroup_get_next_node(struct mem_cgroup *root_mem)
{
struct cgroup *cgroup;
struct mem_cgroup *ret;
bool obsolete;

- obsolete = mem_cgroup_is_obsolete(root_mem->last_scanned_child);
-
/*
* Scan all children under the mem_cgroup mem
*/
mutex_lock(&mem_cgroup_subsys.hierarchy_mutex);
+
+ obsolete = mem_cgroup_is_obsolete(root_mem->last_scanned_child);
+
if (list_empty(&root_mem->css.cgroup->children)) {
- ret = root_mem;
+ /*
+ * root_mem might have children before and last_scanned_child
+ * may point to one of them.
+ */
+ if (root_mem->last_scanned_child) {
+ VM_BUG_ON(!obsolete);
+ mem_cgroup_put(root_mem->last_scanned_child);
+ }
+ ret = NULL;
goto done;
}

@@ -716,13 +724,13 @@ mem_cgroup_get_first_node(struct mem_cgroup *root_mem)
ret = mem_cgroup_from_cont(cgroup);
mem_cgroup_get(ret);
} else
- ret = mem_cgroup_get_next_node(root_mem->last_scanned_child,
+ ret = __mem_cgroup_get_next_node(root_mem->last_scanned_child,
root_mem);

done:
root_mem->last_scanned_child = ret;
mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex);
- return ret;
+ return (ret) ? ret : root_mem;
}

static bool mem_cgroup_check_under_limit(struct mem_cgroup *mem)
@@ -780,21 +788,18 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
if (!root_mem->use_hierarchy)
return ret;

- next_mem = mem_cgroup_get_first_node(root_mem);
+ next_mem = mem_cgroup_get_next_node(root_mem);

while (next_mem != root_mem) {
if (mem_cgroup_is_obsolete(next_mem)) {
- mem_cgroup_put(next_mem);
- next_mem = mem_cgroup_get_first_node(root_mem);
+ next_mem = mem_cgroup_get_next_node(root_mem);
continue;
}
ret = try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
get_swappiness(next_mem));
if (mem_cgroup_check_under_limit(root_mem))
return 0;
- mutex_lock(&mem_cgroup_subsys.hierarchy_mutex);
- next_mem = mem_cgroup_get_next_node(next_mem, root_mem);
- mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex);
+ next_mem = mem_cgroup_get_next_node(root_mem);
}
return ret;
}
@@ -2254,7 +2259,14 @@ static void mem_cgroup_pre_destroy(struct cgroup_subsys *ss,
static void mem_cgroup_destroy(struct cgroup_subsys *ss,
struct cgroup *cont)
{
- mem_cgroup_put(mem_cgroup_from_cont(cont));
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+ struct mem_cgroup *last_scanned_child = mem->last_scanned_child;
+
+ if (last_scanned_child) {
+ VM_BUG_ON(!mem_cgroup_is_obsolete(last_scanned_child));
+ mem_cgroup_put(last_scanned_child);
+ }
+ mem_cgroup_put(mem);
}

static int mem_cgroup_populate(struct cgroup_subsys *ss,

2009-01-09 09:38:21

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/4] memcg: make oom less frequently

On Fri, 9 Jan 2009 14:33:43 +0530
Balbir Singh <[email protected]> wrote:

> * Daisuke Nishimura <[email protected]> [2009-01-09 17:52:15]:
>
> > On Fri, 9 Jan 2009 11:28:04 +0530, Balbir Singh <[email protected]> wrote:
> > > * Daisuke Nishimura <[email protected]> [2009-01-08 19:15:20]:
> > >
> > > > In previous implementation, mem_cgroup_try_charge checked the return
> > > > value of mem_cgroup_try_to_free_pages, and just retried if some pages
> > > > had been reclaimed.
> > > > But now, try_charge(and mem_cgroup_hierarchical_reclaim called from it)
> > > > only checks whether the usage is less than the limit.
> > > >
> > > > This patch tries to change the behavior as before to cause oom less frequently.
> > > >
> > > > To prevent try_charge from getting stuck in infinite loop,
> > > > MEM_CGROUP_RECLAIM_RETRIES_MAX is defined.
> > > >
> > > >
> > > > Signed-off-by: Daisuke Nishimura <[email protected]>
> > > > ---
> > > > mm/memcontrol.c | 16 ++++++++++++----
> > > > 1 files changed, 12 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index 804c054..fedd76b 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -42,6 +42,7 @@
> > > >
> > > > struct cgroup_subsys mem_cgroup_subsys __read_mostly;
> > > > #define MEM_CGROUP_RECLAIM_RETRIES 5
> > > > +#define MEM_CGROUP_RECLAIM_RETRIES_MAX 32
> > >
> > > Why 32 are you seeing frequent OOMs? I had 5 iterations to allow
> > >
> > > 1. pages to move to swap cache, which added back pressure to memcg in
> > > the original implementation, since the pages came back
> > > 2. It look longer to move, recalim those pages.
> > >
> > > Ideally 3 would suffice, but I added an additional 2 retries for
> > > safety.
> > >
> > Before this patch, try_charge doesn't check the return value of
> > try_to_free_page, i.e. how many pages has been reclaimed, and
> > only checks whether the usage has become less than the limit.
> > So, oom can be caused if the group is too busy.
> >
> > IIRC memory-cgroup-hierarchical-reclaim patch introduced this behavior,
> > and, I don't remember in detail, some tests which had not caused oom
> > started to cause oom after it.
> > That was the motivation of my first version of this patch(*1).
> >
> > *1 http://lkml.org/lkml/2008/11/28/35
> >
> > Anyway, this is the updated version.
> > I removed RETRIES_MAX.
> >
> >
> > Thanks,
> > Daisuke Nishimura.
> > ===
> > From: Daisuke Nishimura <[email protected]>
> >
> > In previous implementation, mem_cgroup_try_charge checked the return
> > value of mem_cgroup_try_to_free_pages, and just retried if some pages
> > had been reclaimed.
> > But now, try_charge(and mem_cgroup_hierarchical_reclaim called from it)
> > only checks whether the usage is less than the limit.
> >
> > This patch tries to change the behavior as before to cause oom less frequently.
> >
> > ChangeLog: RFC->v1
> > - removed RETRIES_MAX.
> >
> >
> > Signed-off-by: Daisuke Nishimura <[email protected]>
> > ---
> > mm/memcontrol.c | 10 ++++++----
> > 1 files changed, 6 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 7ba5c61..fb0e9eb 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -781,10 +781,10 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> > * but there might be left over accounting, even after children
> > * have left.
> > */
> > - ret = try_to_free_mem_cgroup_pages(root_mem, gfp_mask, noswap,
> > + ret += try_to_free_mem_cgroup_pages(root_mem, gfp_mask, noswap,
> > get_swappiness(root_mem));
> > if (mem_cgroup_check_under_limit(root_mem))
> > - return 0;
> > + return 1; /* indicate reclaim has succeeded */
> > if (!root_mem->use_hierarchy)
> > return ret;
> >
> > @@ -795,10 +795,10 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> > next_mem = mem_cgroup_get_next_node(root_mem);
> > continue;
> > }
> > - ret = try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
> > + ret += try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
> > get_swappiness(next_mem));
> > if (mem_cgroup_check_under_limit(root_mem))
> > - return 0;
> > + return 1; /* indicate reclaim has succeeded */
> > next_mem = mem_cgroup_get_next_node(root_mem);
> > }
> > return ret;
> > @@ -883,6 +883,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> >
> > ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
> > noswap);
> > + if (ret)
> > + continue;
> >
> > /*
> > * try_to_free_mem_cgroup_pages() might not give us a full
> >
>
> This makes sense
>
> Acked-by: Balbir Singh <[email protected]>
>
me, too

Acked-by: KAMEZAWA Hiroyuki <[email protected]>



>
> --
> Balbir
>

2009-01-15 11:08:50

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH] mark_page_accessed() in do_swap_page() move latter than memcg charge


sorry for late responce.

> > In this case we've hit a case where the page is valid and the pc is
> > not. This does fix the problem, but won't this impact us getting
> > correct reclaim stats and thus indirectly impact the working of
> > pressure?
> >
> - If retruns NULL, only global LRU's status is updated.
>
> Because this page is not belongs to any memcg, we cannot update
> any counters. But yes, your point is a concern.
>
> Maybe moving acitvate_page() to
> ==
> do_swap_page()
> {
>
> - activate_page()
> mem_cgroup_try_charge()..
> ....
> mem_cgroup_commit_charge()....
> ....
> + activate_page()
> }
> ==
> is necessary. How do you think, kosaki ?


OK. it makes sense. and my test found no bug.

==

mark_page_accessed() update reclaim_stat statics.
but currently, memcg charge is called after mark_page_accessed().

then, mark_page_accessed() don't update memcg statics correctly.

fixing here.

Signed-off-by: KOSAKI Motohiro <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Cc: Balbir Singh <[email protected]>

---
mm/memory.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

Index: b/mm/memory.c
===================================================================
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2426,8 +2426,6 @@ static int do_swap_page(struct mm_struct
count_vm_event(PGMAJFAULT);
}

- mark_page_accessed(page);
-
lock_page(page);
delayacct_clear_flag(DELAYACCT_PF_SWAPIN);

@@ -2480,6 +2478,8 @@ static int do_swap_page(struct mm_struct
try_to_free_swap(page);
unlock_page(page);

+ mark_page_accessed(page);
+
if (write_access) {
ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte);
if (ret & VM_FAULT_ERROR)


2009-01-15 11:13:46

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] mark_page_accessed() in do_swap_page() move latter than memcg charge

On Thu, 15 Jan 2009 20:08:36 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:

>
> sorry for late responce.
>
> > > In this case we've hit a case where the page is valid and the pc is
> > > not. This does fix the problem, but won't this impact us getting
> > > correct reclaim stats and thus indirectly impact the working of
> > > pressure?
> > >
> > - If retruns NULL, only global LRU's status is updated.
> >
> > Because this page is not belongs to any memcg, we cannot update
> > any counters. But yes, your point is a concern.
> >
> > Maybe moving acitvate_page() to
> > ==
> > do_swap_page()
> > {
> >
> > - activate_page()
> > mem_cgroup_try_charge()..
> > ....
> > mem_cgroup_commit_charge()....
> > ....
> > + activate_page()
> > }
> > ==
> > is necessary. How do you think, kosaki ?
>
>
> OK. it makes sense. and my test found no bug.
>
Thank you very much! KOSAKI.

Reviewed-by; KAMEZAWA Hiroyuki <[email protected]>

-Kame

> ==
>
> mark_page_accessed() update reclaim_stat statics.
> but currently, memcg charge is called after mark_page_accessed().
>
> then, mark_page_accessed() don't update memcg statics correctly.
>
> fixing here.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>
> Cc: Daisuke Nishimura <[email protected]>
> Cc: Balbir Singh <[email protected]>
>
> ---
> mm/memory.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> Index: b/mm/memory.c
> ===================================================================
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2426,8 +2426,6 @@ static int do_swap_page(struct mm_struct
> count_vm_event(PGMAJFAULT);
> }
>
> - mark_page_accessed(page);
> -
> lock_page(page);
> delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
>
> @@ -2480,6 +2478,8 @@ static int do_swap_page(struct mm_struct
> try_to_free_swap(page);
> unlock_page(page);
>
> + mark_page_accessed(page);
> +
> if (write_access) {
> ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte);
> if (ret & VM_FAULT_ERROR)
>
>
>
>

2009-01-15 11:30:56

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] mark_page_accessed() in do_swap_page() move latter than memcg charge

* KOSAKI Motohiro <[email protected]> [2009-01-15 20:08:36]:

>
> sorry for late responce.
>
> > > In this case we've hit a case where the page is valid and the pc is
> > > not. This does fix the problem, but won't this impact us getting
> > > correct reclaim stats and thus indirectly impact the working of
> > > pressure?
> > >
> > - If retruns NULL, only global LRU's status is updated.
> >
> > Because this page is not belongs to any memcg, we cannot update
> > any counters. But yes, your point is a concern.
> >
> > Maybe moving acitvate_page() to
> > ==
> > do_swap_page()
> > {
> >
> > - activate_page()
> > mem_cgroup_try_charge()..
> > ....
> > mem_cgroup_commit_charge()....
> > ....
> > + activate_page()
> > }
> > ==
> > is necessary. How do you think, kosaki ?
>
>
> OK. it makes sense. and my test found no bug.
>
> ==
>
> mark_page_accessed() update reclaim_stat statics.
> but currently, memcg charge is called after mark_page_accessed().
>
> then, mark_page_accessed() don't update memcg statics correctly.
>
> fixing here.
>

Changelog needs to be a bit more elaborate, may be talk about invalid
pointer we hit if mark_page_accessed is not moved, etc.

> Signed-off-by: KOSAKI Motohiro <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>
> Cc: Daisuke Nishimura <[email protected]>
> Cc: Balbir Singh <[email protected]>
>
> ---
> mm/memory.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> Index: b/mm/memory.c
> ===================================================================
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2426,8 +2426,6 @@ static int do_swap_page(struct mm_struct
> count_vm_event(PGMAJFAULT);
> }
>
> - mark_page_accessed(page);
> -
> lock_page(page);
> delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
>
> @@ -2480,6 +2478,8 @@ static int do_swap_page(struct mm_struct
> try_to_free_swap(page);
> unlock_page(page);
>
> + mark_page_accessed(page);
> +
> if (write_access) {
> ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte);
> if (ret & VM_FAULT_ERROR)
>
>
>

Looks good to me otherwise

Reviewed-by: Balbir Singh <[email protected]>

--
Balbir

2009-01-15 12:08:43

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] mark_page_accessed() in do_swap_page() move latter than memcg charge

On Thu, 15 Jan 2009, KOSAKI Motohiro wrote:
>
> sorry for late responce.
>
> > > In this case we've hit a case where the page is valid and the pc is
> > > not. This does fix the problem, but won't this impact us getting
> > > correct reclaim stats and thus indirectly impact the working of
> > > pressure?
> > >
> > - If retruns NULL, only global LRU's status is updated.
> >
> > Because this page is not belongs to any memcg, we cannot update
> > any counters. But yes, your point is a concern.
> >
> > Maybe moving acitvate_page() to
> > ==
> > do_swap_page()
> > {
> >
> > - activate_page()
> > mem_cgroup_try_charge()..
> > ....
> > mem_cgroup_commit_charge()....
> > ....
> > + activate_page()
> > }
> > ==
> > is necessary. How do you think, kosaki ?
>
>
> OK. it makes sense. and my test found no bug.
>
> ==
>
> mark_page_accessed() update reclaim_stat statics.
> but currently, memcg charge is called after mark_page_accessed().
>
> then, mark_page_accessed() don't update memcg statics correctly.

Statics? "Stats" is a good abbreviation for statistics,
but statics are something else.

>
> fixing here.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>
> Cc: Daisuke Nishimura <[email protected]>
> Cc: Balbir Singh <[email protected]>
>
> ---
> mm/memory.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> Index: b/mm/memory.c
> ===================================================================
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2426,8 +2426,6 @@ static int do_swap_page(struct mm_struct
> count_vm_event(PGMAJFAULT);
> }
>
> - mark_page_accessed(page);
> -
> lock_page(page);
> delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
>
> @@ -2480,6 +2478,8 @@ static int do_swap_page(struct mm_struct
> try_to_free_swap(page);
> unlock_page(page);
>
> + mark_page_accessed(page);
> +
> if (write_access) {
> ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte);
> if (ret & VM_FAULT_ERROR)

This catches my eye, because I'd discussed with Nick and was going to
send in a patch which entirely _removes_ this mark_page_accessed call
from do_swap_page (and replaces follow_page's mark_page_accessed call
by a pte_mkyoung): they seem inconsistent to me, in the light of
bf3f3bc5e734706730c12a323f9b2068052aa1f0 mm: don't mark_page_accessed
in fault path.

Though I need to give it another think through first: the situation
is muddied by the way we (rightly) don't bother to do the mark_page_
accessed on Anon in zap_pte_range anyway; and anon/swap has an
independent lifecycle now with the separate swapbacked LRUs.

What do you think? I didn't look further into your memcg situation,
and what this patch is about: I'm unclear whether my patch to remove
that mark_page_accessed would solve your problem, or mess you up.

Hugh

2009-01-15 12:28:59

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] mark_page_accessed() in do_swap_page() move latter than memcg charge

Hugh Dickins wrote:
> On Thu, 15 Jan 2009, KOSAKI Motohiro wrote:
>>
>> sorry for late responce.
>>
>> > > In this case we've hit a case where the page is valid and the pc is
>> > > not. This does fix the problem, but won't this impact us getting
>> > > correct reclaim stats and thus indirectly impact the working of
>> > > pressure?
>> > >
>> > - If retruns NULL, only global LRU's status is updated.
>> >
>> > Because this page is not belongs to any memcg, we cannot update
>> > any counters. But yes, your point is a concern.
>> >
>> > Maybe moving acitvate_page() to
>> > ==
>> > do_swap_page()
>> > {
>> >
>> > - activate_page()
>> > mem_cgroup_try_charge()..
>> > ....
>> > mem_cgroup_commit_charge()....
>> > ....
>> > + activate_page()
>> > }
>> > ==
>> > is necessary. How do you think, kosaki ?
>>
>>
>> OK. it makes sense. and my test found no bug.
>>
>> ==
>>
>> mark_page_accessed() update reclaim_stat statics.
>> but currently, memcg charge is called after mark_page_accessed().
>>
>> then, mark_page_accessed() don't update memcg statics correctly.
>
> Statics? "Stats" is a good abbreviation for statistics,
> but statics are something else.
>
>>
>> fixing here.
>>
>> Signed-off-by: KOSAKI Motohiro <[email protected]>
>> Cc: KAMEZAWA Hiroyuki <[email protected]>
>> Cc: Daisuke Nishimura <[email protected]>
>> Cc: Balbir Singh <[email protected]>
>>
>> ---
>> mm/memory.c | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> Index: b/mm/memory.c
>> ===================================================================
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -2426,8 +2426,6 @@ static int do_swap_page(struct mm_struct
>> count_vm_event(PGMAJFAULT);
>> }
>>
>> - mark_page_accessed(page);
>> -
>> lock_page(page);
>> delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
>>
>> @@ -2480,6 +2478,8 @@ static int do_swap_page(struct mm_struct
>> try_to_free_swap(page);
>> unlock_page(page);
>>
>> + mark_page_accessed(page);
>> +
>> if (write_access) {
>> ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte);
>> if (ret & VM_FAULT_ERROR)
>
> This catches my eye, because I'd discussed with Nick and was going to
> send in a patch which entirely _removes_ this mark_page_accessed call
> from do_swap_page (and replaces follow_page's mark_page_accessed call
> by a pte_mkyoung): they seem inconsistent to me, in the light of
> bf3f3bc5e734706730c12a323f9b2068052aa1f0 mm: don't mark_page_accessed
> in fault path.
>
Hmm

> Though I need to give it another think through first: the situation
> is muddied by the way we (rightly) don't bother to do the mark_page_
> accessed on Anon in zap_pte_range anyway; and anon/swap has an
> independent lifecycle now with the separate swapbacked LRUs.
>
> What do you think? I didn't look further into your memcg situation,
> and what this patch is about: I'm unclear whether my patch to remove
> that mark_page_accessed would solve your problem, or mess you up.
>
For memcg situation, there was 2 problems.

1. page_cgroup->mem_cgroup was accessed before it's updated.
(in mmotm, fixed by Nishimura's patch, avoiding panic.)
2. mem_cgroup's reclaim_stat is not updated correctly.

1. is fixed. Kosaki's patch is for "2".

If mark_page_accessed() is entirely removed, I have no concerns to
memcg but just test memcg's reclaim logic.

Anyway, at the end of the last year, it has some unfair situation..
I'll restart digging if the problem still exists.;(

Thanks,
-Kame

2009-01-15 13:35:50

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] mark_page_accessed() in do_swap_page() move latter than memcg charge

(CC to Rik and Nick)

Hi

Thank you reviewing.

> > mark_page_accessed() update reclaim_stat statics.
> > but currently, memcg charge is called after mark_page_accessed().
> >
> > then, mark_page_accessed() don't update memcg statics correctly.
>
> Statics? "Stats" is a good abbreviation for statistics,
> but statics are something else.

Doh! your are definitly right. thanks.

> > ---
> > mm/memory.c | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > Index: b/mm/memory.c
> > ===================================================================
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2426,8 +2426,6 @@ static int do_swap_page(struct mm_struct
> > count_vm_event(PGMAJFAULT);
> > }
> >
> > - mark_page_accessed(page);
> > -
> > lock_page(page);
> > delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
> >
> > @@ -2480,6 +2478,8 @@ static int do_swap_page(struct mm_struct
> > try_to_free_swap(page);
> > unlock_page(page);
> >
> > + mark_page_accessed(page);
> > +
> > if (write_access) {
> > ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte);
> > if (ret & VM_FAULT_ERROR)
>
> This catches my eye, because I'd discussed with Nick and was going to
> send in a patch which entirely _removes_ this mark_page_accessed call
> from do_swap_page (and replaces follow_page's mark_page_accessed call
> by a pte_mkyoung): they seem inconsistent to me, in the light of
> bf3f3bc5e734706730c12a323f9b2068052aa1f0 mm: don't mark_page_accessed
> in fault path.

Actually, bf3f3bc5e734706730c12a323f9b2068052aa1f0 only remove
the mark_page_accessed() in filemap_fault().
current mmotm's do_swap_page() still have mark_page_accessed().

but your suggestion is very worth.
ok, I'm thinking and sorting out again.

Rik's commit 9ff473b9a72942c5ac0ad35607cae28d8d59ed7a vmscan: evict streaming IO first
does "reclaim stastics don't only update at reclaim, but also at fault and read/write.
it makes proper anon/file reclaim balancing stastics value before starting actual reclaim".
and it depend on fault path calling mark_page_accessed().

Then, we need following change. I think.

- Remove calling mark_page_accessed() in do_swap_page().
it makes consistency against filemap_fault().
- Add calling update_page_reclaim_stat() into do_swap_page() and
filemap_fault().

Am I overlooking something?


> Though I need to give it another think through first: the situation
> is muddied by the way we (rightly) don't bother to do the mark_page_
> accessed on Anon in zap_pte_range anyway; and anon/swap has an
> independent lifecycle now with the separate swapbacked LRUs.
>
> What do you think? I didn't look further into your memcg situation,
> and what this patch is about: I'm unclear whether my patch to remove
> that mark_page_accessed would solve your problem, or mess you up.

===========================
Subject: [PATCH] mm: solve reclaim statistics confliction

commit 9ff473b9a72942c5ac0ad35607cae28d8d59ed7a "vmscan: evict streaming IO first"
require to update reclaim statistics in page fault and depend on that fault path call
mark_page_accessed().

but commit bf3f3bc5e734706730c12a323f9b2068052aa1f0 "mm: don't mark_page_accessed
in fault path." require to remove mark_page_accessed() from fault path.

They have conflict requirement.


the solusion is,
- Remove calling mark_page_accessed() in do_swap_page().
it makes consistency against filemap_fault().
- Add calling update_page_reclaim_stat() into do_swap_page() and
filemap_fault().


Signed-off-by: KOSAKI Motohiro <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Rik van Riel <[email protected]>
---
include/linux/swap.h | 2 ++
mm/filemap.c | 1 +
mm/memory.c | 4 ++--
mm/swap.c | 4 ++--
4 files changed, 7 insertions(+), 4 deletions(-)

Index: b/include/linux/swap.h
===================================================================
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -185,6 +185,8 @@ extern void rotate_reclaimable_page(stru
extern void swap_setup(void);

extern void add_page_to_unevictable_list(struct page *page);
+extern void update_page_reclaim_stat(struct zone *zone, struct page *page,
+ int file, int rotated);

/**
* lru_cache_add: add a page to the page lists
Index: b/mm/filemap.c
===================================================================
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1637,6 +1637,7 @@ retry_page_update:
/*
* Found the page and have a reference on it.
*/
+ update_page_reclaim_stat(page_zone(page), page, 1, 1);
ra->prev_pos = (loff_t)page->index << PAGE_CACHE_SHIFT;
vmf->page = page;
return ret | VM_FAULT_LOCKED;
Index: b/mm/memory.c
===================================================================
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2426,8 +2426,6 @@ static int do_swap_page(struct mm_struct
count_vm_event(PGMAJFAULT);
}

- mark_page_accessed(page);
-
lock_page(page);
delayacct_clear_flag(DELAYACCT_PF_SWAPIN);

@@ -2480,6 +2478,8 @@ static int do_swap_page(struct mm_struct
try_to_free_swap(page);
unlock_page(page);

+ update_page_reclaim_stat(page_zone(page), page, 0, 1);
+
if (write_access) {
ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte);
if (ret & VM_FAULT_ERROR)
Index: b/mm/swap.c
===================================================================
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -151,8 +151,8 @@ void rotate_reclaimable_page(struct pag
}
}

-static void update_page_reclaim_stat(struct zone *zone, struct page *page,
- int file, int rotated)
+void update_page_reclaim_stat(struct zone *zone, struct page *page,
+ int file, int rotated)
{
struct zone_reclaim_stat *reclaim_stat = &zone->reclaim_stat;
struct zone_reclaim_stat *memcg_reclaim_stat;

2009-01-15 13:43:30

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] mark_page_accessed() in do_swap_page() move latter than memcg charge

> (CC to Rik and Nick)
>
> Hi
>
> Thank you reviewing.
>
> > > mark_page_accessed() update reclaim_stat statics.
> > > but currently, memcg charge is called after mark_page_accessed().
> > >
> > > then, mark_page_accessed() don't update memcg statics correctly.
> >
> > Statics? "Stats" is a good abbreviation for statistics,
> > but statics are something else.
>
> Doh! your are definitly right. thanks.
>
> > > ---
> > > mm/memory.c | 4 ++--
> > > 1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > Index: b/mm/memory.c
> > > ===================================================================
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -2426,8 +2426,6 @@ static int do_swap_page(struct mm_struct
> > > count_vm_event(PGMAJFAULT);
> > > }
> > >
> > > - mark_page_accessed(page);
> > > -
> > > lock_page(page);
> > > delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
> > >
> > > @@ -2480,6 +2478,8 @@ static int do_swap_page(struct mm_struct
> > > try_to_free_swap(page);
> > > unlock_page(page);
> > >
> > > + mark_page_accessed(page);
> > > +
> > > if (write_access) {
> > > ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte);
> > > if (ret & VM_FAULT_ERROR)
> >
> > This catches my eye, because I'd discussed with Nick and was going to
> > send in a patch which entirely _removes_ this mark_page_accessed call
> > from do_swap_page (and replaces follow_page's mark_page_accessed call
> > by a pte_mkyoung): they seem inconsistent to me, in the light of
> > bf3f3bc5e734706730c12a323f9b2068052aa1f0 mm: don't mark_page_accessed
> > in fault path.
>
> Actually, bf3f3bc5e734706730c12a323f9b2068052aa1f0 only remove
> the mark_page_accessed() in filemap_fault().
> current mmotm's do_swap_page() still have mark_page_accessed().
>
> but your suggestion is very worth.
> ok, I'm thinking and sorting out again.
>
> Rik's commit 9ff473b9a72942c5ac0ad35607cae28d8d59ed7a vmscan: evict streaming IO first
> does "reclaim stastics don't only update at reclaim, but also at fault and read/write.
> it makes proper anon/file reclaim balancing stastics value before starting actual reclaim".
> and it depend on fault path calling mark_page_accessed().
>
> Then, we need following change. I think.
>
> - Remove calling mark_page_accessed() in do_swap_page().
> it makes consistency against filemap_fault().
> - Add calling update_page_reclaim_stat() into do_swap_page() and
> filemap_fault().
>
> Am I overlooking something?

Doh! please ignore last mail's patch. I forgot grab zone->lru_lock.
it's perfectly buggy.

I'll make it again tommorow.