Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751664AbZJIEU4 (ORCPT ); Fri, 9 Oct 2009 00:20:56 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751586AbZJIEUy (ORCPT ); Fri, 9 Oct 2009 00:20:54 -0400 Received: from fgwmail5.fujitsu.co.jp ([192.51.44.35]:48535 "EHLO fgwmail5.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751521AbZJIEUx (ORCPT ); Fri, 9 Oct 2009 00:20:53 -0400 X-SecurityPolicyCheck-FJ: OK by FujitsuOutboundMailChecker v1.3.1 Date: Fri, 9 Oct 2009 13:17:51 +0900 From: KAMEZAWA Hiroyuki To: balbir@linux.vnet.ibm.com Cc: "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , "akpm@linux-foundation.org" , "nishimura@mxp.nes.nec.co.jp" Subject: Re: [PATCH 1/2] memcg: coalescing uncharge at unmap and truncation Message-Id: <20091009131751.03eea135.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20091009040142.GN6818@balbir.in.ibm.com> References: <20091002135531.3b5abf5c.kamezawa.hiroyu@jp.fujitsu.com> <20091002140126.61d15e5e.kamezawa.hiroyu@jp.fujitsu.com> <20091009040142.GN6818@balbir.in.ibm.com> Organization: FUJITSU Co. LTD. X-Mailer: Sylpheed 2.5.0 (GTK+ 2.10.14; i686-pc-mingw32) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 11829 Lines: 338 On Fri, 9 Oct 2009 09:31:42 +0530 Balbir Singh wrote: > * KAMEZAWA Hiroyuki [2009-10-02 14:01:26]: > > > > > In massive parallel enviroment, res_counter can be a performance bottleneck. > > One strong techinque to reduce lock contention is reducing calls by > > coalescing some amount of calls into one. > > > > Considering charge/uncharge chatacteristic, > > - charge is done one by one via demand-paging. > > - uncharge is done by > > - in chunk at munmap, truncate, exit, execve... > > - one by one via vmscan/paging. > > > > It seems we have a chance in uncharge at unmap/truncation. > > A chance to improve scalability? yes. > > > > This patch is a for coalescing uncharge. For avoiding scattering memcg's > > structure to functions under /mm, this patch adds memcg batch uncharge > > information to the task. > > > > Is there a reason for associating batch with the task rather than > per-cpu or per-memcg? per-memcg, Per-cpu is bad because we can't be robust against task migration. (I mean per-cpu cannot be clean/simple as this patch.) Per-memcg doesn't work well. If we desgin per-memcg, we'll use a some logic like percpu counter. batch-legnth of it cannot be enough big. And we need fine grain synchronization more than this patch. > I suspect would add some locking > overhead, per-cpu would require synchronization across cpu's while > uncharging, is that where per-task helps? Yes, it's one of reason for this design. > I suspect per-mm, per-signal will have the issues above. Per-mm, per-signal can't be help. mulit-thread can do file truncations (and unmap) in parallel. The biggest reason for per-task is that I'd like to change this behavior by calling _context_. Per-task batch means that we can make use of caller's context. When uncharge is called by try_to_unmap(), __remove_mapping() in vmscan.c, we do direct uncharge, for example. > > > The degree of coalescing depends on callers > > - at invalidate/trucate... pagevec size > > - at unmap ....ZAP_BLOCK_SIZE > > (memory itself will be freed in this degree.) > > Then, we'll not coalescing too much. > > > > Changelog(now): > > - rebased onto the latest mmotm + softlimit fix patches. > > > > Changelog(old): > > - unified patch for callers > > - added commetns. > > - make ->do_batch as bool. > > - removed css_get() at el. We don't need it. > > > > > > Signed-off-by: KAMEZAWA Hiroyuki > > --- > > include/linux/memcontrol.h | 13 ++++++ > > include/linux/sched.h | 7 +++ > > mm/memcontrol.c | 91 ++++++++++++++++++++++++++++++++++++++++++--- > > mm/memory.c | 2 > > mm/truncate.c | 6 ++ > > 5 files changed, 113 insertions(+), 6 deletions(-) > > > > Index: mmotm-2.6.31-Sep28/include/linux/memcontrol.h > > =================================================================== > > --- mmotm-2.6.31-Sep28.orig/include/linux/memcontrol.h > > +++ mmotm-2.6.31-Sep28/include/linux/memcontrol.h > > @@ -54,6 +54,11 @@ extern void mem_cgroup_rotate_lru_list(s > > extern void mem_cgroup_del_lru(struct page *page); > > extern void mem_cgroup_move_lists(struct page *page, > > enum lru_list from, enum lru_list to); > > + > > +/* For coalescing uncharge for reducing memcg' overhead*/ > > +extern void mem_cgroup_uncharge_start(void); > > +extern void mem_cgroup_uncharge_end(void); > > + > > extern void mem_cgroup_uncharge_page(struct page *page); > > extern void mem_cgroup_uncharge_cache_page(struct page *page); > > extern int mem_cgroup_shmem_charge_fallback(struct page *page, > > @@ -151,6 +156,14 @@ static inline void mem_cgroup_cancel_cha > > { > > } > > > > +static inline void mem_cgroup_uncharge_batch_start(void) > > +{ > > +} > > + > > +static inline void mem_cgroup_uncharge_batch_start(void) > > +{ > > +} > > + > > static inline void mem_cgroup_uncharge_page(struct page *page) > > { > > } > > Index: mmotm-2.6.31-Sep28/mm/memcontrol.c > > =================================================================== > > --- mmotm-2.6.31-Sep28.orig/mm/memcontrol.c > > +++ mmotm-2.6.31-Sep28/mm/memcontrol.c > > @@ -1826,6 +1826,49 @@ void mem_cgroup_cancel_charge_swapin(str > > css_put(&mem->css); > > } > > > > +static void > > +__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype) > > +{ > > + struct memcg_batch_info *batch = NULL; > > + bool uncharge_memsw = true; > > + /* If swapout, usage of swap doesn't decrease */ > > + if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) > > + uncharge_memsw = false; > > + /* > > + * do_batch > 0 when unmapping pages or inode invalidate/truncate. > > + * In those cases, all pages freed continously can be expected to be in > > + * the same cgroup and we have chance to coalesce uncharges. > > + * And, we do uncharge one by one if this is killed by OOM. > > + */ > > + if (!current->memcg_batch.do_batch || test_thread_flag(TIF_MEMDIE)) > > + goto direct_uncharge; > > Should we also not uncharge the current batch when the task is dying? > TIF_MEMDIE means OOM happens. I take "reduce usage as fast as possible" scheme here. > > + > > + batch = ¤t->memcg_batch; > > + /* > > + * In usual, we do css_get() when we remember memcg pointer. > > + * But in this case, we keep res->usage until end of a series of > > + * uncharges. Then, it's ok to ignore memcg's refcnt. > > + */ > > + if (!batch->memcg) > > + batch->memcg = mem; > > + /* > > + * In typical case, batch->memcg == mem. This means we can > > + * merge a series of uncharges to an uncharge of res_counter. > > + * If not, we uncharge res_counter ony by one. > > + */ > > + if (batch->memcg != mem) > > + goto direct_uncharge; > > + /* remember freed charge and uncharge it later */ > > + batch->pages += PAGE_SIZE; > > + if (uncharge_memsw) > > + batch->memsw += PAGE_SIZE; > > + return; > > +direct_uncharge: > > + res_counter_uncharge(&mem->res, PAGE_SIZE); > > + if (uncharge_memsw) > > + res_counter_uncharge(&mem->memsw, PAGE_SIZE); > > + return; > > +} > > > > /* > > * uncharge if !page_mapped(page) > > @@ -1874,12 +1917,8 @@ __mem_cgroup_uncharge_common(struct page > > break; > > } > > > > - if (!mem_cgroup_is_root(mem)) { > > - res_counter_uncharge(&mem->res, PAGE_SIZE); > > - if (do_swap_account && > > - (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)) > > - res_counter_uncharge(&mem->memsw, PAGE_SIZE); > > - } > > + if (!mem_cgroup_is_root(mem)) > > + __do_uncharge(mem, ctype); > > if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) > > mem_cgroup_swap_statistics(mem, true); > > mem_cgroup_charge_statistics(mem, pc, false); > > @@ -1925,6 +1964,46 @@ void mem_cgroup_uncharge_cache_page(stru > > __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE); > > } > > > > +/* > > + * batch_start/batch_end is called in unmap_page_range/invlidate/trucate. > > + * In that cases, pages are freed continuously and we can expect pages > > + * are in the same memcg. All these calls itself limits the number of > > + * pages freed at once, then uncharge_start/end() is called properly. > > + */ > > + > > +void mem_cgroup_uncharge_start(void) > > +{ > > + if (!current->memcg_batch.do_batch) { > > + current->memcg_batch.memcg = NULL; > > + current->memcg_batch.pages = 0; > > + current->memcg_batch.memsw = 0; > > + } > > + current->memcg_batch.do_batch++; > > +} > > + > > +void mem_cgroup_uncharge_end(void) > > +{ > > + struct mem_cgroup *mem; > > + > > + if (!current->memcg_batch.do_batch) > > + return; > > + > > + current->memcg_batch.do_batch--; > > + if (current->memcg_batch.do_batch) /* Nested ? */ > > + return; > > + > > + mem = current->memcg_batch.memcg; > > + if (!mem) > > + return; > > + /* This "mem" is valid bacause we hide charges behind us. */ > > + if (current->memcg_batch.pages) > > + res_counter_uncharge(&mem->res, current->memcg_batch.pages); > > + if (current->memcg_batch.memsw) > > + res_counter_uncharge(&mem->memsw, current->memcg_batch.memsw); > > + /* Not necessary. but forget this pointer */ > > + current->memcg_batch.memcg = NULL; > > +} > > + > > #ifdef CONFIG_SWAP > > /* > > * called after __delete_from_swap_cache() and drop "page" account. > > Index: mmotm-2.6.31-Sep28/include/linux/sched.h > > =================================================================== > > --- mmotm-2.6.31-Sep28.orig/include/linux/sched.h > > +++ mmotm-2.6.31-Sep28/include/linux/sched.h > > @@ -1549,6 +1549,13 @@ struct task_struct { > > unsigned long trace_recursion; > > #endif /* CONFIG_TRACING */ > > unsigned long stack_start; > > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR /* memcg uses this to do batch job */ > > + struct memcg_batch_info { > > + int do_batch; > > + struct mem_cgroup *memcg; > > + long pages, memsw; > > + } memcg_batch; > > +#endif > > }; > > > > /* Future-safe accessor for struct task_struct's cpus_allowed. */ > > Index: mmotm-2.6.31-Sep28/mm/memory.c > > =================================================================== > > --- mmotm-2.6.31-Sep28.orig/mm/memory.c > > +++ mmotm-2.6.31-Sep28/mm/memory.c > > @@ -940,6 +940,7 @@ static unsigned long unmap_page_range(st > > details = NULL; > > > > BUG_ON(addr >= end); > > + mem_cgroup_uncharge_start(); > > tlb_start_vma(tlb, vma); > > pgd = pgd_offset(vma->vm_mm, addr); > > do { > > @@ -952,6 +953,7 @@ static unsigned long unmap_page_range(st > > zap_work, details); > > } while (pgd++, addr = next, (addr != end && *zap_work > 0)); > > tlb_end_vma(tlb, vma); > > + mem_cgroup_uncharge_end(); > > > > return addr; > > } > > Index: mmotm-2.6.31-Sep28/mm/truncate.c > > =================================================================== > > --- mmotm-2.6.31-Sep28.orig/mm/truncate.c > > +++ mmotm-2.6.31-Sep28/mm/truncate.c > > @@ -272,6 +272,7 @@ void truncate_inode_pages_range(struct a > > pagevec_release(&pvec); > > break; > > } > > + mem_cgroup_uncharge_start(); > > for (i = 0; i < pagevec_count(&pvec); i++) { > > struct page *page = pvec.pages[i]; > > > > @@ -286,6 +287,7 @@ void truncate_inode_pages_range(struct a > > unlock_page(page); > > } > > pagevec_release(&pvec); > > + mem_cgroup_uncharge_end(); > > } > > } > > EXPORT_SYMBOL(truncate_inode_pages_range); > > @@ -327,6 +329,7 @@ unsigned long invalidate_mapping_pages(s > > pagevec_init(&pvec, 0); > > while (next <= end && > > pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { > > + mem_cgroup_uncharge_start(); > > for (i = 0; i < pagevec_count(&pvec); i++) { > > struct page *page = pvec.pages[i]; > > pgoff_t index; > > @@ -354,6 +357,7 @@ unsigned long invalidate_mapping_pages(s > > break; > > } > > pagevec_release(&pvec); > > + mem_cgroup_uncharge_end(); > > cond_resched(); > > } > > return ret; > > @@ -428,6 +432,7 @@ int invalidate_inode_pages2_range(struct > > while (next <= end && !wrapped && > > pagevec_lookup(&pvec, mapping, next, > > min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) { > > + mem_cgroup_uncharge_start(); > > for (i = 0; i < pagevec_count(&pvec); i++) { > > struct page *page = pvec.pages[i]; > > pgoff_t page_index; > > @@ -477,6 +482,7 @@ int invalidate_inode_pages2_range(struct > > unlock_page(page); > > } > > pagevec_release(&pvec); > > + mem_cgroup_uncharge_end(); > > cond_resched(); > > } > > return ret; > > > > The patch overall looks good, just some questions about it. > Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/