Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751060AbZJIEC0 (ORCPT ); Fri, 9 Oct 2009 00:02:26 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1750898AbZJIECY (ORCPT ); Fri, 9 Oct 2009 00:02:24 -0400 Received: from e28smtp03.in.ibm.com ([59.145.155.3]:55054 "EHLO e28smtp03.in.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750889AbZJIECX (ORCPT ); Fri, 9 Oct 2009 00:02:23 -0400 Date: Fri, 9 Oct 2009 09:31:42 +0530 From: Balbir Singh To: KAMEZAWA Hiroyuki Cc: "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , "akpm@linux-foundation.org" , "nishimura@mxp.nes.nec.co.jp" Subject: Re: [PATCH 1/2] memcg: coalescing uncharge at unmap and truncation Message-ID: <20091009040142.GN6818@balbir.in.ibm.com> Reply-To: balbir@linux.vnet.ibm.com References: <20091002135531.3b5abf5c.kamezawa.hiroyu@jp.fujitsu.com> <20091002140126.61d15e5e.kamezawa.hiroyu@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <20091002140126.61d15e5e.kamezawa.hiroyu@jp.fujitsu.com> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 10318 Lines: 305 * KAMEZAWA Hiroyuki [2009-10-02 14:01:26]: > > In massive parallel enviroment, res_counter can be a performance bottleneck. > One strong techinque to reduce lock contention is reducing calls by > coalescing some amount of calls into one. > > Considering charge/uncharge chatacteristic, > - charge is done one by one via demand-paging. > - uncharge is done by > - in chunk at munmap, truncate, exit, execve... > - one by one via vmscan/paging. > > It seems we have a chance in uncharge at unmap/truncation. A chance to improve scalability? > > This patch is a for coalescing uncharge. For avoiding scattering memcg's > structure to functions under /mm, this patch adds memcg batch uncharge > information to the task. > Is there a reason for associating batch with the task rather than per-cpu or per-memcg? per-memcg, I suspect would add some locking overhead, per-cpu would require synchronization across cpu's while uncharging, is that where per-task helps? I suspect per-mm, per-signal will have the issues above. > The degree of coalescing depends on callers > - at invalidate/trucate... pagevec size > - at unmap ....ZAP_BLOCK_SIZE > (memory itself will be freed in this degree.) > Then, we'll not coalescing too much. > > Changelog(now): > - rebased onto the latest mmotm + softlimit fix patches. > > Changelog(old): > - unified patch for callers > - added commetns. > - make ->do_batch as bool. > - removed css_get() at el. We don't need it. > > > Signed-off-by: KAMEZAWA Hiroyuki > --- > include/linux/memcontrol.h | 13 ++++++ > include/linux/sched.h | 7 +++ > mm/memcontrol.c | 91 ++++++++++++++++++++++++++++++++++++++++++--- > mm/memory.c | 2 > mm/truncate.c | 6 ++ > 5 files changed, 113 insertions(+), 6 deletions(-) > > Index: mmotm-2.6.31-Sep28/include/linux/memcontrol.h > =================================================================== > --- mmotm-2.6.31-Sep28.orig/include/linux/memcontrol.h > +++ mmotm-2.6.31-Sep28/include/linux/memcontrol.h > @@ -54,6 +54,11 @@ extern void mem_cgroup_rotate_lru_list(s > extern void mem_cgroup_del_lru(struct page *page); > extern void mem_cgroup_move_lists(struct page *page, > enum lru_list from, enum lru_list to); > + > +/* For coalescing uncharge for reducing memcg' overhead*/ > +extern void mem_cgroup_uncharge_start(void); > +extern void mem_cgroup_uncharge_end(void); > + > extern void mem_cgroup_uncharge_page(struct page *page); > extern void mem_cgroup_uncharge_cache_page(struct page *page); > extern int mem_cgroup_shmem_charge_fallback(struct page *page, > @@ -151,6 +156,14 @@ static inline void mem_cgroup_cancel_cha > { > } > > +static inline void mem_cgroup_uncharge_batch_start(void) > +{ > +} > + > +static inline void mem_cgroup_uncharge_batch_start(void) > +{ > +} > + > static inline void mem_cgroup_uncharge_page(struct page *page) > { > } > Index: mmotm-2.6.31-Sep28/mm/memcontrol.c > =================================================================== > --- mmotm-2.6.31-Sep28.orig/mm/memcontrol.c > +++ mmotm-2.6.31-Sep28/mm/memcontrol.c > @@ -1826,6 +1826,49 @@ void mem_cgroup_cancel_charge_swapin(str > css_put(&mem->css); > } > > +static void > +__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype) > +{ > + struct memcg_batch_info *batch = NULL; > + bool uncharge_memsw = true; > + /* If swapout, usage of swap doesn't decrease */ > + if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) > + uncharge_memsw = false; > + /* > + * do_batch > 0 when unmapping pages or inode invalidate/truncate. > + * In those cases, all pages freed continously can be expected to be in > + * the same cgroup and we have chance to coalesce uncharges. > + * And, we do uncharge one by one if this is killed by OOM. > + */ > + if (!current->memcg_batch.do_batch || test_thread_flag(TIF_MEMDIE)) > + goto direct_uncharge; Should we also not uncharge the current batch when the task is dying? > + > + batch = ¤t->memcg_batch; > + /* > + * In usual, we do css_get() when we remember memcg pointer. > + * But in this case, we keep res->usage until end of a series of > + * uncharges. Then, it's ok to ignore memcg's refcnt. > + */ > + if (!batch->memcg) > + batch->memcg = mem; > + /* > + * In typical case, batch->memcg == mem. This means we can > + * merge a series of uncharges to an uncharge of res_counter. > + * If not, we uncharge res_counter ony by one. > + */ > + if (batch->memcg != mem) > + goto direct_uncharge; > + /* remember freed charge and uncharge it later */ > + batch->pages += PAGE_SIZE; > + if (uncharge_memsw) > + batch->memsw += PAGE_SIZE; > + return; > +direct_uncharge: > + res_counter_uncharge(&mem->res, PAGE_SIZE); > + if (uncharge_memsw) > + res_counter_uncharge(&mem->memsw, PAGE_SIZE); > + return; > +} > > /* > * uncharge if !page_mapped(page) > @@ -1874,12 +1917,8 @@ __mem_cgroup_uncharge_common(struct page > break; > } > > - if (!mem_cgroup_is_root(mem)) { > - res_counter_uncharge(&mem->res, PAGE_SIZE); > - if (do_swap_account && > - (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)) > - res_counter_uncharge(&mem->memsw, PAGE_SIZE); > - } > + if (!mem_cgroup_is_root(mem)) > + __do_uncharge(mem, ctype); > if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) > mem_cgroup_swap_statistics(mem, true); > mem_cgroup_charge_statistics(mem, pc, false); > @@ -1925,6 +1964,46 @@ void mem_cgroup_uncharge_cache_page(stru > __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE); > } > > +/* > + * batch_start/batch_end is called in unmap_page_range/invlidate/trucate. > + * In that cases, pages are freed continuously and we can expect pages > + * are in the same memcg. All these calls itself limits the number of > + * pages freed at once, then uncharge_start/end() is called properly. > + */ > + > +void mem_cgroup_uncharge_start(void) > +{ > + if (!current->memcg_batch.do_batch) { > + current->memcg_batch.memcg = NULL; > + current->memcg_batch.pages = 0; > + current->memcg_batch.memsw = 0; > + } > + current->memcg_batch.do_batch++; > +} > + > +void mem_cgroup_uncharge_end(void) > +{ > + struct mem_cgroup *mem; > + > + if (!current->memcg_batch.do_batch) > + return; > + > + current->memcg_batch.do_batch--; > + if (current->memcg_batch.do_batch) /* Nested ? */ > + return; > + > + mem = current->memcg_batch.memcg; > + if (!mem) > + return; > + /* This "mem" is valid bacause we hide charges behind us. */ > + if (current->memcg_batch.pages) > + res_counter_uncharge(&mem->res, current->memcg_batch.pages); > + if (current->memcg_batch.memsw) > + res_counter_uncharge(&mem->memsw, current->memcg_batch.memsw); > + /* Not necessary. but forget this pointer */ > + current->memcg_batch.memcg = NULL; > +} > + > #ifdef CONFIG_SWAP > /* > * called after __delete_from_swap_cache() and drop "page" account. > Index: mmotm-2.6.31-Sep28/include/linux/sched.h > =================================================================== > --- mmotm-2.6.31-Sep28.orig/include/linux/sched.h > +++ mmotm-2.6.31-Sep28/include/linux/sched.h > @@ -1549,6 +1549,13 @@ struct task_struct { > unsigned long trace_recursion; > #endif /* CONFIG_TRACING */ > unsigned long stack_start; > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR /* memcg uses this to do batch job */ > + struct memcg_batch_info { > + int do_batch; > + struct mem_cgroup *memcg; > + long pages, memsw; > + } memcg_batch; > +#endif > }; > > /* Future-safe accessor for struct task_struct's cpus_allowed. */ > Index: mmotm-2.6.31-Sep28/mm/memory.c > =================================================================== > --- mmotm-2.6.31-Sep28.orig/mm/memory.c > +++ mmotm-2.6.31-Sep28/mm/memory.c > @@ -940,6 +940,7 @@ static unsigned long unmap_page_range(st > details = NULL; > > BUG_ON(addr >= end); > + mem_cgroup_uncharge_start(); > tlb_start_vma(tlb, vma); > pgd = pgd_offset(vma->vm_mm, addr); > do { > @@ -952,6 +953,7 @@ static unsigned long unmap_page_range(st > zap_work, details); > } while (pgd++, addr = next, (addr != end && *zap_work > 0)); > tlb_end_vma(tlb, vma); > + mem_cgroup_uncharge_end(); > > return addr; > } > Index: mmotm-2.6.31-Sep28/mm/truncate.c > =================================================================== > --- mmotm-2.6.31-Sep28.orig/mm/truncate.c > +++ mmotm-2.6.31-Sep28/mm/truncate.c > @@ -272,6 +272,7 @@ void truncate_inode_pages_range(struct a > pagevec_release(&pvec); > break; > } > + mem_cgroup_uncharge_start(); > for (i = 0; i < pagevec_count(&pvec); i++) { > struct page *page = pvec.pages[i]; > > @@ -286,6 +287,7 @@ void truncate_inode_pages_range(struct a > unlock_page(page); > } > pagevec_release(&pvec); > + mem_cgroup_uncharge_end(); > } > } > EXPORT_SYMBOL(truncate_inode_pages_range); > @@ -327,6 +329,7 @@ unsigned long invalidate_mapping_pages(s > pagevec_init(&pvec, 0); > while (next <= end && > pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { > + mem_cgroup_uncharge_start(); > for (i = 0; i < pagevec_count(&pvec); i++) { > struct page *page = pvec.pages[i]; > pgoff_t index; > @@ -354,6 +357,7 @@ unsigned long invalidate_mapping_pages(s > break; > } > pagevec_release(&pvec); > + mem_cgroup_uncharge_end(); > cond_resched(); > } > return ret; > @@ -428,6 +432,7 @@ int invalidate_inode_pages2_range(struct > while (next <= end && !wrapped && > pagevec_lookup(&pvec, mapping, next, > min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) { > + mem_cgroup_uncharge_start(); > for (i = 0; i < pagevec_count(&pvec); i++) { > struct page *page = pvec.pages[i]; > pgoff_t page_index; > @@ -477,6 +482,7 @@ int invalidate_inode_pages2_range(struct > unlock_page(page); > } > pagevec_release(&pvec); > + mem_cgroup_uncharge_end(); > cond_resched(); > } > return ret; > The patch overall looks good, just some questions about it. -- Balbir -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/