Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754348Ab1CAEpN (ORCPT ); Mon, 28 Feb 2011 23:45:13 -0500 Received: from mail-pv0-f174.google.com ([74.125.83.174]:40038 "EHLO mail-pv0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753673Ab1CAEpL (ORCPT ); Mon, 28 Feb 2011 23:45:11 -0500 DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; b=daTKlNYdWhV/2YdD/JUcPp16U/dXYEBW8H7eEvSjjFDpFgJ4kngPmuVfuk25Ma0Bgc NqwzAzDBQIMyzQPHPp41Eb/6fKkHZvOr5+xLeHtvOS5DeYdV3FgBryeUvQBZDiTF+gHU Om8RlZFm3CmA82X+IKjOV9K7/8bk+WGdb59hc= Date: Tue, 1 Mar 2011 13:44:55 +0900 From: Minchan Kim To: Greg Thelen Cc: Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org, containers@lists.osdl.org, Andrea Righi , Balbir Singh , KAMEZAWA Hiroyuki , Daisuke Nishimura , Ciju Rajan K , David Rientjes , Wu Fengguang , Chad Talbott , Justin TerAvest , Vivek Goyal Subject: Re: [PATCH v5 9/9] memcg: check memcg dirty limits in page writeback Message-ID: <20110301044455.GB2107@barrios-desktop> References: <1298669760-26344-1-git-send-email-gthelen@google.com> <1298669760-26344-10-git-send-email-gthelen@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1298669760-26344-10-git-send-email-gthelen@google.com> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6859 Lines: 187 On Fri, Feb 25, 2011 at 01:36:00PM -0800, Greg Thelen wrote: > If the current process is in a non-root memcg, then > balance_dirty_pages() will consider the memcg dirty limits as well as > the system-wide limits. This allows different cgroups to have distinct > dirty limits which trigger direct and background writeback at different > levels. > > If called with a mem_cgroup, then throttle_vm_writeout() should query > the given cgroup for its dirty memory usage limits. > > Signed-off-by: Andrea Righi > Signed-off-by: Greg Thelen > Acked-by: KAMEZAWA Hiroyuki > Acked-by: Wu Fengguang > --- > /* > @@ -477,12 +502,14 @@ unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty) > * data. It looks at the number of dirty pages in the machine and will force > * the caller to perform writeback if the system is over `vm_dirty_ratio'. > * If we're over `background_thresh' then the writeback threads are woken to > - * perform some writeout. > + * perform some writeout. The current task may have per-memcg dirty > + * limits, which are also checked. > */ > static void balance_dirty_pages(struct address_space *mapping, > unsigned long write_chunk) > { > struct dirty_info sys_info; > + struct dirty_info memcg_info; > long bdi_nr_reclaimable; > long bdi_nr_writeback; > unsigned long bdi_thresh; > @@ -500,18 +527,27 @@ static void balance_dirty_pages(struct address_space *mapping, > }; > > global_dirty_info(&sys_info); > + if (!memcg_dirty_info(NULL, &memcg_info)) > + memcg_info = sys_info; Sigh. I don't like dobule check in case of no-memcg configuration or root memcgroup. "Dobule check" means 1) sys_info check and 2) memcg_info check but it's same so second check is redundant. In addition, we always need same logic between global and memcg. It adds binary bloating. > > /* > * Throttle it only when the background writeback cannot > * catch-up. This avoids (excessively) small writeouts > * when the bdi limits are ramping up. > */ > - if (dirty_info_reclaimable(&sys_info) + sys_info.nr_writeback <= > + if ((dirty_info_reclaimable(&sys_info) + > + sys_info.nr_writeback <= > (sys_info.background_thresh + > - sys_info.dirty_thresh) / 2) > + sys_info.dirty_thresh) / 2) && > + (dirty_info_reclaimable(&memcg_info) + > + memcg_info.nr_writeback <= > + (memcg_info.background_thresh + > + memcg_info.dirty_thresh) / 2)) > break; > > - bdi_thresh = bdi_dirty_limit(bdi, sys_info.dirty_thresh); > + bdi_thresh = bdi_dirty_limit(bdi, > + min(sys_info.dirty_thresh, > + memcg_info.dirty_thresh)); > bdi_thresh = task_dirty_limit(current, bdi_thresh); > > /* > @@ -541,7 +577,9 @@ static void balance_dirty_pages(struct address_space *mapping, > dirty_exceeded = > (bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh) > || (dirty_info_reclaimable(&sys_info) + > - sys_info.nr_writeback > sys_info.dirty_thresh); > + sys_info.nr_writeback > sys_info.dirty_thresh) > + || (dirty_info_reclaimable(&memcg_info) + > + memcg_info.nr_writeback > memcg_info.dirty_thresh); > > if (!dirty_exceeded) > break; > @@ -559,7 +597,9 @@ static void balance_dirty_pages(struct address_space *mapping, > * up. > */ > trace_wbc_balance_dirty_start(&wbc, bdi); > - if (bdi_nr_reclaimable > bdi_thresh) { > + if ((bdi_nr_reclaimable > bdi_thresh) || > + (dirty_info_reclaimable(&memcg_info) > > + memcg_info.dirty_thresh)) { Why does memcg need this check? I guess bdi_thresh is the f(x), x = dirty_thresh and dirty_thresh is already memcg's one. So isn't it enough? I don't know the logic well but as I look at the comment, at least you change the behavior. Please write down why you need it and if you need it, please change comment, too. " * Only move pages to writeback if this bdi is over its * threshold otherwise wait until the disk writes catch * up. " > writeback_inodes_wb(&bdi->wb, &wbc); > pages_written += write_chunk - wbc.nr_to_write; > trace_wbc_balance_dirty_written(&wbc, bdi); > @@ -594,8 +634,10 @@ static void balance_dirty_pages(struct address_space *mapping, > * background_thresh, to keep the amount of dirty memory low. > */ > if ((laptop_mode && pages_written) || > - (!laptop_mode && (dirty_info_reclaimable(&sys_info) > > - sys_info.background_thresh))) > + (!laptop_mode && ((dirty_info_reclaimable(&sys_info) > > + sys_info.background_thresh) || > + (dirty_info_reclaimable(&memcg_info) > > + memcg_info.background_thresh)))) > bdi_start_background_writeback(bdi); > } > > @@ -653,12 +695,20 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping, > } > EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr); > > -void throttle_vm_writeout(gfp_t gfp_mask) > +/* > + * Throttle the current task if it is near dirty memory usage limits. > + * If @mem_cgroup is NULL or the root_cgroup, then use global dirty memory > + * information; otherwise use the per-memcg dirty limits. > + */ > +void throttle_vm_writeout(gfp_t gfp_mask, struct mem_cgroup *mem_cgroup) > { > struct dirty_info sys_info; > + struct dirty_info memcg_info; > > for ( ; ; ) { > global_dirty_info(&sys_info); > + if (!memcg_dirty_info(mem_cgroup, &memcg_info)) > + memcg_info = sys_info; > > /* > * Boost the allowable dirty threshold a bit for page > @@ -666,9 +716,13 @@ void throttle_vm_writeout(gfp_t gfp_mask) > */ > sys_info.dirty_thresh += > sys_info.dirty_thresh / 10; /* wheeee... */ > + memcg_info.dirty_thresh += > + memcg_info.dirty_thresh / 10; /* wheeee... */ > > - if (sys_info.nr_unstable_nfs + > - sys_info.nr_writeback <= sys_info.dirty_thresh) > + if ((sys_info.nr_unstable_nfs + > + sys_info.nr_writeback <= sys_info.dirty_thresh) && > + (memcg_info.nr_unstable_nfs + > + memcg_info.nr_writeback <= memcg_info.dirty_thresh)) > break; > congestion_wait(BLK_RW_ASYNC, HZ/10); > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index ba11e28..f723242 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1927,7 +1927,7 @@ restart: > sc->nr_scanned - nr_scanned, sc)) > goto restart; > > - throttle_vm_writeout(sc->gfp_mask); > + throttle_vm_writeout(sc->gfp_mask, sc->mem_cgroup); > } > > /* > -- > 1.7.3.1 > -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/