Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751330Ab1E0ElX (ORCPT ); Fri, 27 May 2011 00:41:23 -0400 Received: from fgwmail5.fujitsu.co.jp ([192.51.44.35]:45575 "EHLO fgwmail5.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750777Ab1E0ElW (ORCPT ); Fri, 27 May 2011 00:41:22 -0400 X-SecurityPolicyCheck-FJ: OK by FujitsuOutboundMailChecker v1.3.1 Date: Fri, 27 May 2011 13:34:31 +0900 From: KAMEZAWA Hiroyuki To: Ying Han Cc: "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , "akpm@linux-foundation.org" , "nishimura@mxp.nes.nec.co.jp" , "balbir@linux.vnet.ibm.com" Subject: Re: [RFC][PATCH v3 0/10] memcg async reclaim Message-Id: <20110527133431.471eefc2.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: References: <20110526141047.dc828124.kamezawa.hiroyu@jp.fujitsu.com> <20110527111639.22e3e257.kamezawa.hiroyu@jp.fujitsu.com> Organization: FUJITSU Co. LTD. X-Mailer: Sylpheed 3.1.0 (GTK+ 2.10.14; i686-pc-mingw32) Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 10155 Lines: 268 On Thu, 26 May 2011 21:33:32 -0700 Ying Han wrote: > On Thu, May 26, 2011 at 7:16 PM, KAMEZAWA Hiroyuki > wrote: > > On Thu, 26 May 2011 18:49:26 -0700 > > Ying Han wrote: > > > >> On Wed, May 25, 2011 at 10:10 PM, KAMEZAWA Hiroyuki > >> wrote: > >> > > >> > It's now merge window...I just dump my patch queue to hear other's idea. > >> > I wonder I should wait until dirty_ratio for memcg is queued to mmotm... > >> > I'll be busy with LinuxCon Japan etc...in the next week. > >> > > >> > This patch is onto mmotm-May-11 + some patches queued in mmotm, as numa_stat. > >> > > >> > This is a patch for memcg to keep margin to the limit in background. > >> > By keeping some margin to the limit in background, application can > >> > avoid foreground memory reclaim at charge() and this will help latency. > >> > > >> > Main changes from v2 is. > >> >  - use SCHED_IDLE. > >> >  - removed most of heuristic codes. Now, code is very simple. > >> > > >> > By using SCHED_IDLE, async memory reclaim can only consume 0.3%? of cpu > >> > if the system is truely busy but can use much CPU if the cpu is idle. > >> > Because my purpose is for reducing latency without affecting other running > >> > applications, SCHED_IDLE fits this work. > >> > > >> > If application need to stop by some I/O or event, background memory reclaim > >> > will cull memory while the system is idle. > >> > > >> > Perforemce: > >> >  Running an httpd (apache) under 300M limit. And access 600MB working set > >> >  with normalized distribution access by apatch-bench. > >> >  apatch bench's concurrency was 4 and did 40960 accesses. > >> > > >> > Without async reclaim: > >> > Connection Times (ms) > >> >              min  mean[+/-sd] median   max > >> > Connect:        0    0   0.0      0       2 > >> > Processing:    30   37  28.3     32    1793 > >> > Waiting:       28   35  25.5     31    1792 > >> > Total:         30   37  28.4     32    1793 > >> > > >> > Percentage of the requests served within a certain time (ms) > >> >  50%     32 > >> >  66%     32 > >> >  75%     33 > >> >  80%     34 > >> >  90%     39 > >> >  95%     60 > >> >  98%    100 > >> >  99%    133 > >> >  100%   1793 (longest request) > >> > > >> > With async reclaim: > >> > Connection Times (ms) > >> >              min  mean[+/-sd] median   max > >> > Connect:        0    0   0.0      0       2 > >> > Processing:    30   35  12.3     32     678 > >> > Waiting:       28   34  12.0     31     658 > >> > Total:         30   35  12.3     32     678 > >> > > >> > Percentage of the requests served within a certain time (ms) > >> >  50%     32 > >> >  66%     32 > >> >  75%     33 > >> >  80%     34 > >> >  90%     39 > >> >  95%     49 > >> >  98%     71 > >> >  99%     86 > >> >  100%    678 (longest request) > >> > > >> > > >> > It seems latency is stabilized by hiding memory reclaim. > >> > > >> > The score for memory reclaim was following. > >> > See patch 10 for meaning of each member. > >> > > >> > == without async reclaim == > >> > recent_scan_success_ratio 44 > >> > limit_scan_pages 388463 > >> > limit_freed_pages 162238 > >> > limit_elapsed_ns 13852159231 > >> > soft_scan_pages 0 > >> > soft_freed_pages 0 > >> > soft_elapsed_ns 0 > >> > margin_scan_pages 0 > >> > margin_freed_pages 0 > >> > margin_elapsed_ns 0 > >> > > >> > == with async reclaim == > >> > recent_scan_success_ratio 6 > >> > limit_scan_pages 0 > >> > limit_freed_pages 0 > >> > limit_elapsed_ns 0 > >> > soft_scan_pages 0 > >> > soft_freed_pages 0 > >> > soft_elapsed_ns 0 > >> > margin_scan_pages 1295556 > >> > margin_freed_pages 122450 > >> > margin_elapsed_ns 644881521 > >> > > >> > > >> > For this case, SCHED_IDLE workqueue can reclaim enough memory to the httpd. > >> > > >> > I may need to dig why scan_success_ratio is far different in the both case. > >> > I guess the difference of epalsed_ns is because several threads enter > >> > memory reclaim when async reclaim doesn't run. But may not... > >> > > >> > >> > >> Hmm.. I noticed a very strange behavior on a simple test w/ the patch set. > >> > >> Test: > >> I created a 4g memcg and start doing cat. Then the memcg being OOM > >> killed as soon as it reaches its hard_limit. We shouldn't hit OOM even > >> w/o async-reclaim. > >> > >> Again, I will read through the patch. But like to post the test result first. > >> > >> $ echo $$ >/dev/cgroup/memory/A/tasks > >> $ cat /dev/cgroup/memory/A/memory.limit_in_bytes > >> 4294967296 > >> > >> $ time cat /export/hdc3/dd_A/tf0 > /dev/zero > >> Killed > >> > > > > I did the same kind of test without any problem...but ok, I'll do more test > > later. > > > > > > > >> real  0m53.565s > >> user  0m0.061s > >> sys   0m4.814s > >> > >> Here is the OOM log: > >> > >> May 26 18:43:00  kernel: [  963.489112] cat invoked oom-killer: > >> gfp_mask=0xd0, order=0, oom_adj=0, oom_score_adj=0 > >> May 26 18:43:00  kernel: [  963.489121] Pid: 9425, comm: cat Tainted: > >> G        W   2.6.39-mcg-DEV #131 > >> May 26 18:43:00  kernel: [  963.489123] Call Trace: > >> May 26 18:43:00  kernel: [  963.489134]  [] > >> dump_header+0x82/0x1af > >> May 26 18:43:00  kernel: [  963.489137]  [] ? > >> spin_lock+0xe/0x10 > >> May 26 18:43:00  kernel: [  963.489140]  [] ? > >> find_lock_task_mm+0x2d/0x67 > >> May 26 18:43:00  kernel: [  963.489143]  [] > >> oom_kill_process+0x50/0x27b > >> May 26 18:43:00  kernel: [  963.489155]  [] > >> mem_cgroup_out_of_memory+0x9a/0xe4 > >> May 26 18:43:00  kernel: [  963.489160]  [] > >> mem_cgroup_handle_oom+0x134/0x1fe > >> May 26 18:43:00  kernel: [  963.489163]  [] ? > >> __mem_cgroup_insert_exceeded+0x83/0x83 > >> May 26 18:43:00  kernel: [  963.489176]  [] > >> __mem_cgroup_try_charge.clone.3+0x368/0x43a > >> May 26 18:43:00  kernel: [  963.489179]  [] > >> mem_cgroup_cache_charge+0x95/0x123 > >> May 26 18:43:00  kernel: [  963.489183]  [] > >> add_to_page_cache_locked+0x42/0x114 > >> May 26 18:43:00  kernel: [  963.489185]  [] > >> add_to_page_cache_lru+0x31/0x5f > >> May 26 18:43:00  kernel: [  963.489189]  [] > >> mpage_readpages+0xb6/0x132 > >> May 26 18:43:00  kernel: [  963.489194]  [] ? > >> noalloc_get_block_write+0x24/0x24 > >> May 26 18:43:00  kernel: [  963.489197]  [] ? > >> noalloc_get_block_write+0x24/0x24 > >> May 26 18:43:00  kernel: [  963.489201]  [] ? > >> __switch_to+0x160/0x212 > >> May 26 18:43:00  kernel: [  963.489205]  [] > >> ext4_readpages+0x1d/0x1f > >> May 26 18:43:00  kernel: [  963.489209]  [] > >> __do_page_cache_readahead+0x144/0x1e3 > >> May 26 18:43:00  kernel: [  963.489212]  [] > >> ra_submit+0x21/0x25 > >> May 26 18:43:00  kernel: [  963.489215]  [] > >> ondemand_readahead+0x18c/0x19f > >> May 26 18:43:00  kernel: [  963.489218]  [] > >> page_cache_async_readahead+0x7d/0x86 > >> May 26 18:43:00  kernel: [  963.489221]  [] > >> generic_file_aio_read+0x2d8/0x5fe > >> May 26 18:43:00  kernel: [  963.489225]  [] > >> do_sync_read+0xcb/0x108 > >> May 26 18:43:00  kernel: [  963.489230]  [] ? > >> fsnotify_perm+0x66/0x72 > >> May 26 18:43:00  kernel: [  963.489233]  [] ? > >> security_file_permission+0x2e/0x33 > >> May 26 18:43:00  kernel: [  963.489236]  [] > >> vfs_read+0xab/0x107 > >> May 26 18:43:00  kernel: [  963.489239]  [] sys_read+0x4a/0x6e > >> May 26 18:43:00  kernel: [  963.489244]  [] > >> sysenter_dispatch+0x7/0x27 > >> May 26 18:43:00  kernel: [  963.489248] Task in /A killed as a result > >> of limit of /A > >> May 26 18:43:00  kernel: [  963.489251] memory: usage 4194304kB, limit > >> 4194304kB, failcnt 26 > >> May 26 18:43:00  kernel: [  963.489253] memory+swap: usage 0kB, limit > >> 9007199254740991kB, failcnt 0 > >> > > > > Hmm, why memory+swap usage 0kb here... > > > > In this set, I used mem_cgroup_margin() rather than res_counter_margin(). > > Hmm, do you disable swap accounting ? If so, I may miss some. > > Yes, I disabled the swap accounting in .config: > # CONFIG_CGROUP_MEM_RES_CTLR_SWAP is not set > > > Here is how i reproduce it: > > $ mkdir /dev/cgroup/memory/D > $ echo 4g >/dev/cgroup/memory/D/memory.limit_in_bytes > > $ cat /dev/cgroup/memory/D/memory.limit_in_bytes > 4294967296 > > $ cat /dev/cgroup/memory/D/memory. > memory.async_control memory.max_usage_in_bytes > memory.soft_limit_in_bytes memory.use_hierarchy > memory.failcnt memory.move_charge_at_immigrate > memory.stat > memory.force_empty memory.oom_control > memory.swappiness > memory.limit_in_bytes memory.reclaim_stat > memory.usage_in_bytes > > $ cat /dev/cgroup/memory/D/memory.async_control > 0 > $ echo 1 >/dev/cgroup/memory/D/memory.async_control > $ cat /dev/cgroup/memory/D/memory.async_control > 1 > > $ echo $$ >/dev/cgroup/memory/D/tasks > $ cat /proc/4358/cgroup > 3:memory:/D > > $ time cat /export/hdc3/dd_A/tf0 > /dev/zero > Killed > If you applied my patches collectly, async_control can be seen if swap controller is configured because of BUG in patch. I could cat 20G file under 4G limit without any problem with boot option swapaccount=0. no problem if async_control == 0 ? Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/