Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753606Ab0AMFlT (ORCPT ); Wed, 13 Jan 2010 00:41:19 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753189Ab0AMFlS (ORCPT ); Wed, 13 Jan 2010 00:41:18 -0500 Received: from TYO201.gate.nec.co.jp ([202.32.8.193]:49188 "EHLO tyo201.gate.nec.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750976Ab0AMFlS (ORCPT ); Wed, 13 Jan 2010 00:41:18 -0500 Date: Wed, 13 Jan 2010 14:27:07 +0900 From: Daisuke Nishimura To: KAMEZAWA Hiroyuki Cc: Andrew Morton , LKML , linux-mm , stable , Balbir Singh , Daisuke Nishimura Subject: Re: [BUGFIX][PATCH] memcg: ensure list is empty at rmdir Message-Id: <20100113142707.1c857d1d.nishimura@mxp.nes.nec.co.jp> In-Reply-To: <20100113122754.d390d0a2.kamezawa.hiroyu@jp.fujitsu.com> References: <20100112140836.45e7fabb.nishimura@mxp.nes.nec.co.jp> <20100113103006.8cf3b23c.nishimura@mxp.nes.nec.co.jp> <20100113122754.d390d0a2.kamezawa.hiroyu@jp.fujitsu.com> Organization: NEC Soft, Ltd. X-Mailer: Sylpheed 2.6.0 (GTK+ 2.10.14; i686-pc-mingw32) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9531 Lines: 176 On Wed, 13 Jan 2010 12:27:54 +0900, KAMEZAWA Hiroyuki wrote: > On Wed, 13 Jan 2010 10:30:06 +0900 > Daisuke Nishimura wrote: > > This patch tries to fix this bug by ensuring not only the usage is zero but also > > all of the LRUs are empty. mem_cgroup_force_empty_list() checks the list is empty > > or not, so we can make use of it. > > > > Hmm, too short ? ;) fix me if following is wrong. > > Logical Background. > > The problem here is pages on LRU may contain pointer to stale memcg. To make > res->usage to be 0, all pages on memcg must be uncharged. Uncharge page_cgroup or must be moved to another(parent) memcg. > contains pointer to memcg withou PCG_USED bit. (This asynchronous LRU work is > for improving performance.) If PCG_USED bit is not set, page_cgroup will never > be added to memcg's LRU. So, about pages not on LRU, they never access stale > pointer. Then, what we have to take care of is page_cgroup _on_ LRU list. > > Before this patch, mem->res.usage is checked after lru_add_drain(). But this ^^^^^ before > doesn't guarantee memcg's LRU is really empty (considering races with other cpus.) > In usual workload, in most case, current logic works without bug. (Considering > how rmdir->force_empty() works..). But in some heavy workload case, pages remain > on LRU can cause invalid access to freed memcg. This patch fixes rmdir->force_empty > to visit all all LRUs before exiting this force_empty loop and guarantee there > are no pages on memcg's LRU. > I don't think the problem lies in the place where lru_add_drain() is called. There are some cases in which !PageCgrupUsed pages exist on memcg's LRU. For example: - Pages can be uncharged by its owner process while they are on LRU. - race between mem_cgroup_add_lru_list() and __mem_cgroup_uncharge_common(). So, I think the problem in current code is mem_cgroup_force_empty_list() isn't called when the mem->res.usage is(or has become) zero. We must check both mem->res.usage and the LRU. That's why I changed "while() {}" to "do {} while()". Anyway, thank you for your greate explanation. It makes the problem more clear. This is the updated version(no change in the body of patch). Thanks, Daisuke Nishimura. === From: Daisuke Nishimura Current mem_cgroup_force_empty() only ensures mem->res.usage == 0 on success. But this doesn't guarantee memcg's LRU is really empty, because there are some cases in which !PageCgrupUsed pages exist on memcg's LRU. For example: - Pages can be uncharged by its owner process while they are on LRU. - race between mem_cgroup_add_lru_list() and __mem_cgroup_uncharge_common(). So there can be a case in which the usage is zero but some of the LRUs are not empty. OTOH, mem_cgroup_del_lru_list(), which can be called asynchronously with rmdir, accesses the mem_cgroup, so this access can cause a problem if it races with rmdir because the mem_cgroup might have been freed by rmdir. Actually, I saw a bug which seems to be caused by this race. [1530745.949906] BUG: unable to handle kernel NULL pointer dereference at 0000000000000230 [1530745.950651] IP: [] mem_cgroup_del_lru_list+0x30/0x80 [1530745.950651] PGD 3863de067 PUD 3862c7067 PMD 0 [1530745.950651] Oops: 0002 [#1] SMP [1530745.950651] last sysfs file: /sys/devices/system/cpu/cpu7/cache/index1/shared_cpu_map [1530745.950651] CPU 3 [1530745.950651] Modules linked in: configs ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp nfsd nfs_acl auth_rpcgss exportfs autofs4 hidp rfcomm l2cap crc16 bluetooth lockd sunrpc ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i cxgb3 mdio libiscsi_tcp libiscsi scsi_transport_iscsi dm_mirror dm_multipath scsi_dh video output sbs sbshc battery ac lp kvm_intel kvm sg ide_cd_mod cdrom serio_raw tpm_tis tpm tpm_bios acpi_memhotplug button parport_pc parport rtc_cmos rtc_core rtc_lib e1000 i2c_i801 i2c_core pcspkr dm_region_hash dm_log dm_mod ata_piix libata shpchp megaraid_mbox sd_mod scsi_mod megaraid_mm ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: freq_table] [1530745.950651] Pid: 19653, comm: shmem_test_02 Tainted: G M 2.6.32-mm1-00701-g2b04386 #3 Express5800/140Rd-4 [N8100-1065] [1530745.950651] RIP: 0010:[] [] mem_cgroup_del_lru_list+0x30/0x80 [1530745.950651] RSP: 0018:ffff8803863ddcb8 EFLAGS: 00010002 [1530745.950651] RAX: 00000000000001e0 RBX: ffff8803abc02238 RCX: 00000000000001e0 [1530745.950651] RDX: 0000000000000000 RSI: ffff88038611a000 RDI: ffff8803abc02238 [1530745.950651] RBP: ffff8803863ddcc8 R08: 0000000000000002 R09: ffff8803a04c8643 [1530745.950651] R10: 0000000000000000 R11: ffffffff810c7333 R12: 0000000000000000 [1530745.950651] R13: ffff880000017f00 R14: 0000000000000092 R15: ffff8800179d0310 [1530745.950651] FS: 0000000000000000(0000) GS:ffff880017800000(0000) knlGS:0000000000000000 [1530745.950651] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [1530745.950651] CR2: 0000000000000230 CR3: 0000000379d87000 CR4: 00000000000006e0 [1530745.950651] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [1530745.950651] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [1530745.950651] Process shmem_test_02 (pid: 19653, threadinfo ffff8803863dc000, task ffff88038612a8a0) [1530745.950651] Stack: [1530745.950651] ffffea00040c2fe8 0000000000000000 ffff8803863ddd98 ffffffff810c739a [1530745.950651] <0> 00000000863ddd18 000000000000000c 0000000000000000 0000000000000000 [1530745.950651] <0> 0000000000000002 0000000000000000 ffff8803863ddd68 0000000000000046 [1530745.950651] Call Trace: [1530745.950651] [] release_pages+0x142/0x1e7 [1530745.950651] [] ? pagevec_move_tail+0x6e/0x112 [1530745.950651] [] pagevec_move_tail+0xfd/0x112 [1530745.950651] [] lru_add_drain+0x76/0x94 [1530745.950651] [] exit_mmap+0x6e/0x145 [1530745.950651] [] mmput+0x5e/0xcf [1530745.950651] [] exit_mm+0x11c/0x129 [1530745.950651] [] ? audit_free+0x196/0x1c9 [1530745.950651] [] do_exit+0x1f5/0x6b7 [1530745.950651] [] ? up_read+0x2b/0x2f [1530745.950651] [] ? lockdep_sys_exit_thunk+0x35/0x67 [1530745.950651] [] do_group_exit+0x83/0xb0 [1530745.950651] [] sys_exit_group+0x17/0x1b [1530745.950651] [] system_call_fastpath+0x16/0x1b [1530745.950651] Code: 54 53 0f 1f 44 00 00 83 3d cc 29 7c 00 00 41 89 f4 75 63 eb 4e 48 83 7b 08 00 75 04 0f 0b eb fe 48 89 df e8 18 f3 ff ff 44 89 e2 <48> ff 4c d0 50 48 8b 05 2b 2d 7c 00 48 39 43 08 74 39 48 8b 4b [1530745.950651] RIP [] mem_cgroup_del_lru_list+0x30/0x80 [1530745.950651] RSP [1530745.950651] CR2: 0000000000000230 [1530745.950651] ---[ end trace c3419c1bb8acc34f ]--- [1530745.950651] Fixing recursive fault but reboot is needed! The problem here is pages on LRU may contain pointer to stale memcg. To make res->usage to be 0, all pages on memcg must be uncharged or moved to another(parent) memcg. Moved page_cgroup have already removed from original LRU, but uncharged page_cgroup contains pointer to memcg withou PCG_USED bit. (This asynchronous LRU work is for improving performance.) If PCG_USED bit is not set, page_cgroup will never be added to memcg's LRU. So, about pages not on LRU, they never access stale pointer. Then, what we have to take care of is page_cgroup _on_ LRU list. This patch fixes this problem by making mem_cgroup_force_empty() visit all LRUs before exiting its loop and guarantee there are no pages on its LRU. Signed-off-by: Daisuke Nishimura Acked-by: KAMEZAWA Hiroyuki Cc: Balbir Singh Cc: stable@kernel.org --- This patch is based on 2.6.33-rc3, and can be applied to older versions too. mm/memcontrol.c | 11 ++++------- 1 files changed, 4 insertions(+), 7 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 488b644..954032b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2586,7 +2586,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *mem, bool free_all) if (free_all) goto try_to_free; move_account: - while (mem->res.usage > 0) { + do { ret = -EBUSY; if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children)) goto out; @@ -2614,8 +2614,8 @@ move_account: if (ret == -ENOMEM) goto try_to_free; cond_resched(); - } - ret = 0; + /* "ret" should also be checked to ensure all lists are empty. */ + } while (mem->res.usage > 0 || ret); out: css_put(&mem->css); return ret; @@ -2648,10 +2648,7 @@ try_to_free: } lru_add_drain(); /* try move_account...there may be some *locked* pages. */ - if (mem->res.usage) - goto move_account; - ret = 0; - goto out; + goto move_account; } int mem_cgroup_force_empty_write(struct cgroup *cont, unsigned int event) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/