Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755020AbbGCNC3 (ORCPT ); Fri, 3 Jul 2015 09:02:29 -0400 Received: from cantor2.suse.de ([195.135.220.15]:35629 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754824AbbGCNCU (ORCPT ); Fri, 3 Jul 2015 09:02:20 -0400 Date: Fri, 3 Jul 2015 15:02:13 +0200 From: Jan Kara To: Tejun Heo Cc: Jan Kara , axboe@kernel.dk, linux-kernel@vger.kernel.org, hch@infradead.org, hannes@cmpxchg.org, linux-fsdevel@vger.kernel.org, vgoyal@redhat.com, lizefan@huawei.com, cgroups@vger.kernel.org, linux-mm@kvack.org, mhocko@suse.cz, clm@fb.com, fengguang.wu@intel.com, david@fromorbit.com, gthelen@google.com, khlebnikov@yandex-team.ru Subject: Re: [PATCH 41/51] writeback: make wakeup_flusher_threads() handle multiple bdi_writeback's Message-ID: <20150703130213.GM23329@quack.suse.cz> References: <1432329245-5844-1-git-send-email-tj@kernel.org> <1432329245-5844-42-git-send-email-tj@kernel.org> <20150701081528.GB7252@quack.suse.cz> <20150702023706.GK26440@mtj.duckdns.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150702023706.GK26440@mtj.duckdns.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2543 Lines: 55 On Wed 01-07-15 22:37:06, Tejun Heo wrote: > Hello, > > On Wed, Jul 01, 2015 at 10:15:28AM +0200, Jan Kara wrote: > > I was looking at who uses wakeup_flusher_threads(). There are two usecases: > > > > 1) sync() - we want to writeback everything > > 2) We want to relieve memory pressure by cleaning and subsequently > > reclaiming pages. > > > > Neither of these cares about number of pages too much if you write enough. > > What's enough tho? Saying "yeah let's try about 1000 pages" is one > thing and "let's try about 1000 pages on each of 100 cgroups" is a > quite different operation. Given the nature of "let's try to write > some", I'd venture to say that writing somewhat less is an a lot > better behavior than possibly trying to write out possibly huge amount > given that the amount of fluctuation such behaviors may cause > system-wide and how non-obvious the reasons for such fluctuations > would be. > > > So similarly as we don't split the passed nr_pages argument among bdis, I > > bdi's are bound by actual hardware. wb's aren't. This is a purely > logical construct and there can be a lot of them. Again, trying to > write 1024 pages on each of 100 devices and trying to write 1024 * 100 > pages to single device are quite different. OK, I agree with your device vs logical construct argument. However when splitting pages based on avg throughput each cgroup generates, we know nothing about actual amount of dirty pages in each cgroup so we may end up writing much fewer pages than we originally wanted since a cgroup which was assigned a big chunk needn't have many pages available. So your algorithm is basically bound to undershoot the requested number of pages in some cases... Another concern is that if we have two cgroups with same amount of dirty pages but cgroup A has them randomly scattered (and thus have much lower bandwidth) and cgroup B has them in a sequential fashion (thus with higher bandwidth), you end up cleaning (and subsequently reclaiming) more from cgroup B. That may be good for immediate memory pressure but could be considered unfair by the cgroup owner. Maybe it would be better to split number of pages to write based on fraction of dirty pages each cgroup has in the bdi? Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/